HPC Cluster Setup¶
This guide covers installing and running Underworld3 on HPC clusters. Install scripts are maintained in the uw3-hpc-baremetal-install-run repository.
Architecture¶
All supported clusters use the same architecture:
pixi hpc env → Python 3.12, sympy, scipy, pint, pydantic, ... (conda-forge, no MPI)
cluster MPI → OpenMPI (spack or module) (cluster MPI)
source build → mpi4py, PETSc+AMR+petsc4py, h5py (linked to cluster MPI)
Why source builds? Anything linked against MPI must use the same MPI as the cluster scheduler. conda-forge bundles its own MPI (MPICH), which is incompatible with Slurm/PBS. Building from source ensures the correct linkage.
Why pixi? Pixi manages the Python environment consistently with local development — same pixi.toml, same package versions. The hpc environment is pure Python (no MPI packages from conda-forge).
PETSc build: petsc-custom/build-petsc.sh auto-detects the cluster from hostname, or can be overridden with UW_CLUSTER=kaiju|gadi. Cluster-specific differences (HDF5 source, BLAS, cmake, compiler flags) are handled internally.
Kaiju¶
Hardware¶
Resource |
Specification |
|---|---|
Head node |
1× Intel Xeon Silver 4210R, 40 CPUs @ 2.4 GHz |
Compute nodes |
8× Intel Xeon Gold 6230R, 104 CPUs @ 2.1 GHz each |
Shared storage |
|
Scheduler |
Slurm with Munge authentication |
MPI |
Spack |
Prerequisites¶
Spack must have OpenMPI available:
spack find openmpi
# openmpi@4.1.6
Pixi must be installed in your user space:
pixi --version # check
curl -fsSL https://pixi.sh/install.sh | bash # install if missing
Installation¶
Copy kaiju_install_user.sh (per-user) or kaiju_install_shared.sh (admin) from uw3-hpc-baremetal-install-run to a convenient location, edit the variables at the top, then:
source kaiju_install_user.sh install
Step |
Function |
Time |
|---|---|---|
Install pixi |
|
~1 min |
Clone Underworld3 |
|
~1 min |
Install pixi hpc env |
|
~3 min |
Build mpi4py |
|
~2 min |
Build PETSc + AMR tools |
|
~1 hour |
Build h5py |
|
~2 min |
Install Underworld3 |
|
~2 min |
Verify |
|
~1 min |
Individual steps can be run after sourcing:
source kaiju_install_user.sh
install_petsc # run just one step
What PETSc builds on Kaiju¶
AMR tools: mmg, parmmg, pragmatic, eigen, bison
Solvers: mumps, scalapack, slepc
Partitioners: metis, parmetis, ptscotch
MPI: Spack’s OpenMPI (
--with-mpi-dir)HDF5: downloaded (not in Spack)
BLAS/LAPACK: fblaslapack (no guaranteed system BLAS on Rocky Linux 8)
cmake: downloaded (not in Spack)
petsc4py: built during configure (
--with-petsc4py=1)
Activating the Environment¶
Source the install script at the start of every session or job:
source kaiju_install_user.sh
This loads spack openmpi@4.1.6, activates the pixi hpc environment via pixi shell-hook, and sets PETSC_DIR, PETSC_ARCH, and PYTHONPATH.
pixi shell-hookis used instead ofpixi shellbecause it activates the environment in the current shell without spawning a new one — required for Slurm batch jobs.
Running with Slurm¶
Use kaiju_slurm_job.sh from uw3-hpc-baremetal-install-run. Edit the variables at the top, then:
sbatch kaiju_slurm_job.sh
--mpi=pmix is required on Kaiju (Spack has pmix@5.0.3):
srun --mpi=pmix python3 my_model.py
Monitor progress:
squeue -u $USER
tail -f uw3_<jobid>.out
Troubleshooting (Kaiju)¶
import underworld3 fails on compute nodes¶
Source the install script inside the job script (not the login shell) so all paths propagate to compute nodes. The kaiju_slurm_job.sh template does this correctly.
PETSc needs rebuilding after Spack module update¶
PETSc links against Spack’s OpenMPI at build time. If openmpi@4.1.6 is reinstalled:
source kaiju_install_user.sh
rm -rf ~/uw3-installation/underworld3/petsc-custom/petsc
install_petsc
install_h5py
h5py replaces source-built mpi4py¶
pip install h5py without --no-deps silently replaces the source-built mpi4py with a wheel linked to a different MPI. The install script uses --no-deps to prevent this. If mpi4py was accidentally replaced:
pip install --no-binary :all: --no-cache-dir --force-reinstall "mpi4py>=4,<5"
PARMMG configure failure¶
pixi’s conda linker requires transitive shared library dependencies to be explicitly linked. libmmg.so built with SCOTCH support causes PARMMG’s link test to fail. This is fixed in build-petsc.sh by building MMG without SCOTCH (-DUSE_SCOTCH=OFF).
Gadi¶
Hardware¶
Resource |
Specification |
|---|---|
System |
NCI Gadi (CentOS, Lustre filesystem) |
Compute |
Multiple node types (normal, hugemem, gpuvolta) |
Shared storage |
|
Scheduler |
PBS Pro |
MPI |
Module |
Prerequisites¶
The following Gadi modules must be available:
module load openmpi/4.1.7 hdf5/1.12.2p gmsh/4.13.1 cmake/3.31.6
Pixi must be installed:
pixi --version # check
curl -fsSL https://pixi.sh/install.sh | bash # install if missing
Inode quota: Gadi’s
/g/datahas strict inode limits. PETSc (which creates many files during build) may need to be built on/scratchand symlinked from/g/data. The install script handles this if you setPETSC_DIRto a/scratchpath.
Installation¶
Copy gadi_install_user.sh (per-user) or gadi_install_shared.sh (admin) from uw3-hpc-baremetal-install-run to a convenient location, edit the variables at the top, then:
source gadi_install_shared.sh install
Step |
Function |
Time |
|---|---|---|
Install pixi |
|
~1 min |
Clone Underworld3 |
|
~1 min |
Install pixi hpc env |
|
~3 min |
Build mpi4py |
|
~2 min |
Build PETSc + AMR tools |
|
~1 hour |
Build h5py |
|
~2 min |
Install Underworld3 |
|
~2 min |
Verify |
|
~1 min |
What PETSc builds on Gadi¶
AMR tools: mmg, parmmg, pragmatic, eigen
Solvers: mumps, scalapack, slepc, superlu, superlu_dist, hypre
Partitioners: metis, parmetis, ptscotch
MPI: Gadi’s OpenMPI module (
--with-cc/cxx/fc)HDF5: Gadi’s
hdf5/1.12.2pmodule (--with-hdf5-dir)BLAS/LAPACK: fblaslapack (auto-detection fails due to compiler env manipulation)
petsc4py: built during configure (
--with-petsc4py=1)
Activating the Environment¶
Source the install script at the start of every session or job:
source gadi_install_shared.sh
This loads Gadi modules, activates the pixi hpc environment via pixi shell-hook, and sets PETSC_DIR, PETSC_ARCH, and PYTHONPATH. Gadi’s HDF5 lib dir is prepended to LD_LIBRARY_PATH to ensure the parallel HDF5 1.12.2p is loaded at runtime (not conda’s serial HDF5 1.14).
Running with PBS¶
Use gadi_pbs_job.sh from uw3-hpc-baremetal-install-run. Edit the variables at the top, then:
qsub gadi_pbs_job.sh
Monitor progress:
qstat -u $USER
tail -f <jobid>.o*
Shared Installation (Admin)¶
Deploys to /g/data/m18/software/uw3-pixi/ so all m18 project members can use it:
source gadi_install_shared.sh install
The install script is then copied to the install directory so users can source it directly:
source /g/data/m18/software/uw3-pixi/gadi_install_shared.sh
Troubleshooting (Gadi)¶
h5py undefined symbol: H5E_BADATOM_g¶
The pixi hpc env ships a serial HDF5 1.14 (transitive conda-forge dependency). If h5py links against it instead of Gadi’s parallel HDF5 1.12.2p, this symbol (removed in 1.14) is missing at runtime. The install script fixes this by temporarily hiding conda’s HDF5 during the h5py build so meson can only find Gadi’s. If you see this error, re-run:
source gadi_install_shared.sh
install_h5py
Compiler interference during PETSc build¶
The pixi hpc env ships a full conda toolchain (x86_64-conda-linux-gnu-*) that interferes with Gadi’s OpenMPI wrappers. build-petsc.sh handles this via setup_gadi_build_env(), which unsets conda compiler variables and forces the MPI wrappers to use system compilers (/usr/bin/gcc).
Fortran MPI library not found¶
Gadi ships compiler-tagged Fortran MPI libraries (libmpi_usempif08_GNU.so) rather than the standard untagged names. build-petsc.sh creates symlinks in petsc-custom/mpi-gadi-gnu-libs/ to bridge this.
import underworld3 fails in PBS job¶
Ensure the install script is sourced inside the job script (not just in the login shell). The gadi_pbs_job.sh template does this correctly.
Rebuilding Underworld3 after source changes¶
source kaiju_install_user.sh # or gadi_install_shared.sh
cd <UW3_PATH>
git pull
pip install -e .