Parallel-Safe Scripting in Underworld3¶
Overview¶
Underworld3 uses PETSc for parallel operations, which means you rarely need to use MPI directly. PETSc handles domain decomposition, ghost cells, and synchronization automatically. However, understanding parallel safety is crucial for writing correct scripts that work on multiple processors.
Key Principle: PETSc Handles Parallelism¶
Critical Understanding: You don’t write MPI code - you write parallel-safe UW3 code.
All mesh operations are inherently parallel via PETSc
Solvers automatically distribute work across processors
Use UW3 API functions, not direct MPI calls
The main use of uw.mpi.rank is for conditional output/visualization.
MPI + Thread Pools (Oversubscription)¶
When running with MPI, each rank can also spawn BLAS/OpenMP worker threads. If this is not controlled, total runnable threads can explode and performance can degrade severely.
Example: mpirun -np 8 with OpenBLAS default 10 threads can create up to
80 compute threads, often slower than expected.
Default Underworld3 Policy¶
Underworld3 now applies MPI-safe defaults (thread pool size 1) unless users
explicitly set their own values:
OMP_NUM_THREADSOPENBLAS_NUM_THREADSMKL_NUM_THREADSVECLIB_MAXIMUM_THREADSNUMEXPR_NUM_THREADS
This happens in two places:
./uwlauncher: sets defaults before Python starts.underworld3import path: applies the same defaults for MPI runs if unset.
Runtime Warning¶
If running with MPI and any of the thread variables above are explicitly set
to values greater than 1, Underworld3 prints a rank-0 warning about possible
oversubscription.
User Controls¶
Disable automatic thread caps:
export UW_DISABLE_THREAD_CAPS=1
Suppress warning (keep your explicit thread settings):
export UW_SUPPRESS_THREAD_WARNING=1
Recommended Practice¶
For most MPI benchmark and production jobs, keep 1 thread per rank unless
you are intentionally tuning hybrid MPI+threads.
Parallel-Safe Output¶
The Problem with Rank Conditionals¶
This common pattern is dangerous:
# DANGEROUS - can hang in parallel!
if uw.mpi.rank == 0:
stats = var.stats() # Collective operation - other ranks wait forever!
print(f"Stats: {stats}")
Why it hangs: var.stats() is a collective operation - ALL ranks must call it. If only rank 0 calls it, other ranks wait forever for synchronization that never comes.
The Solution: Parallel Print¶
Use uw.pprint() for rank-specific output:
# Safe - all ranks execute stats(), only rank 0 prints
uw.pprint(f"Stats: {var.stats()}")
# Debug - print from multiple ranks
uw.pprint(slice(0, 4), f"Local max: {var.data.max()}")
Rank Selection Syntax¶
uw.pprint() supports flexible rank selection:
Basic Selection¶
# Single rank
uw.pprint("Only rank 0")
# All ranks
uw.pprint(None, "Everyone prints this")
# Range of ranks (Python slice)
uw.pprint(slice(0, 4), "Ranks 0-3")
uw.pprint(slice(2, 8, 2), "Ranks 2, 4, 6")
# Specific ranks (list/tuple)
uw.pprint([0, 3, 7], "Ranks 0, 3, and 7")
Named Patterns¶
uw.pprint('all', "All ranks")
uw.pprint('first', "Rank 0 only")
uw.pprint('last', "Highest rank only")
uw.pprint('even', "Even-numbered ranks")
uw.pprint('odd', "Odd-numbered ranks")
Advanced Selection¶
# Percentage of ranks
uw.pprint('10%', "First 10% of ranks")
# Function-based
uw.pprint(lambda r: r % 3 == 0, "Every third rank")
# NumPy arrays
import numpy as np
mask = np.array([True, False, True, False])
uw.pprint(mask, "Using boolean mask")
Selective Execution Context¶
For code that should only execute on certain ranks (not just print), use selective_ranks():
# Visualization - only rank 0 executes
with uw.selective_ranks(0) as should_execute:
if should_execute:
import matplotlib.pyplot as plt
plt.figure()
plt.plot(x, temperature.data[:, 0])
plt.savefig("temp_profile.png")
# Multiple ranks
with uw.selective_ranks(slice(0, 4)) as should_execute:
if should_execute:
# Only ranks 0-3 execute this block
process_local_partition()
Tip
Using the Context Manager Return Value
The selective_ranks() context manager yields True if the current rank should execute, False otherwise. Always check this value to ensure code only runs on selected ranks:
with uw.selective_ranks(0) as should_execute:
if should_execute:
# Your rank-specific code here
pass
### Collective Operation Safety
```{warning} Avoid Collective Operations in Selective Blocks
**Never call collective operations inside `selective_ranks()` blocks** - they will cause deadlocks since not all ranks participate:
```python
# WRONG - Will hang!
with uw.selective_ranks(0):
stats = var.stats() # Collective operation - deadlock!
# RIGHT - Use pprint instead
uw.pprint(f"Stats: {var.stats()}") # All ranks execute stats(), only rank 0 prints
Future versions will include automatic detection of collective operations within selective execution blocks.
## Understanding Collective Operations
**Collective operations** require all MPI ranks to participate:
### Common Collective Operations
```python
# Solver operations
stokes.solve() # All ranks must call
var.stats() # All ranks must call
mesh.save("file.h5") # Collective I/O
# Data operations
var.rbf_interpolate() # All ranks participate
swarm.migrate() # Redistributes particles
Local Operations (Safe)¶
# These only access local data
var.data[...] # Local partition only
var.data.max() # Local maximum
print(var.data.shape) # Local shape
Practical Patterns¶
Pattern 1: Time-Stepping Output¶
# Time-stepping loop with progress output
for step in range(nsteps):
dt = stokes.estimate_dt()
# All ranks execute solver, only rank 0 prints progress
stokes.solve()
uw.pprint(f"Step {step}: dt={dt:.3e}, time={time:.3e}")
# Update fields...
time += dt
Pattern 2: Convergence Monitoring¶
# Monitor solver convergence
solver.solve()
# Get global statistics (collective), print on rank 0
velocity_stats = velocity.stats()
pressure_stats = pressure.stats()
uw.pprint(f"Velocity max: {velocity_stats['max']:.6e}")
uw.pprint(f"Pressure range: [{pressure_stats['min']:.6e}, {pressure_stats['max']:.6e}]")
Pattern 3: Debugging Parallel Decomposition¶
# Check local data on multiple ranks
uw.pprint(slice(0, 4), f"Rank {uw.mpi.rank}: Local elements = {mesh.dm.getLocalSize()}")
uw.pprint('all', f"Rank {uw.mpi.rank}: Partition shape = {var.data.shape}")
# Check first and last rank only
uw.pprint([0, uw.mpi.size-1], f"Rank {uw.mpi.rank}: Boundary points = {boundary_count}")
# Custom selection - every 4th rank
uw.pprint(lambda r: r % 4 == 0, f"Rank {uw.mpi.rank}: Memory usage = {psutil.Process().memory_info().rss / 1e9:.2f} GB")
Pattern 4: Visualization and I/O¶
# Only rank 0 does visualization
with uw.selective_ranks(0) as should_execute:
if should_execute:
import pyvista as pv
plotter = pv.Plotter(off_screen=True)
plotter.add_mesh(mesh.pyvista_mesh, scalars=temperature.data)
plotter.camera_position = 'xy'
plotter.show(screenshot=f"temp_{step:04d}.png")
Pattern 5: Mesh Information Display¶
# Display mesh statistics (from mesh.view())
uw.pprint(f"\nMesh # {mesh.instance}: {mesh.name}\n")
uw.pprint(f"Number of cells: {num_cells}\n")
if len(mesh.vars) > 0:
uw.pprint(f"| Variable Name | component | degree | type |")
uw.pprint(f"| ---------------------------------------------------------- |")
for vname in mesh.vars.keys():
v = mesh.vars[vname]
uw.pprint(f"| {v.clean_name:<20}|{v.num_components:^10} |{v.degree:^7} | {v.vtype.name:^15} |")
uw.pprint(f"| ---------------------------------------------------------- |")
else:
uw.pprint(f"No variables are defined on the mesh\n")
Pattern 6: Conditional Execution by Rank Group¶
# Define rank groups for different tasks
io_ranks = 0
compute_ranks = slice(1, None)
# Rank 0: Handle I/O
with uw.selective_ranks(io_ranks) as should_execute:
if should_execute:
mesh.save(f"checkpoint_{step}.h5")
swarm.save(f"particles_{step}.h5")
# Ranks 1+: Continue computing
with uw.selective_ranks(compute_ranks) as should_execute:
if should_execute:
# Prepare next timestep while rank 0 saves
update_advection_timestep()
Testing Parallel Safety¶
Always test scripts with multiple processors:
# Test with 2 processors
mpirun -np 2 python my_script.py
# Test with 4 processors
mpirun -np 4 python my_script.py
If your script hangs, check for:
Collective operations inside rank conditionals
Rank-specific execution of collective operations
Mismatched barriers or synchronization
Common Pitfalls¶
Pitfall 1: Collective in Conditional¶
# WRONG - hangs
if uw.mpi.rank == 0:
result = var.stats() # Other ranks wait forever
# RIGHT - all execute, selective output
uw.pprint(f"Result: {var.stats()}")
Pitfall 2: Assuming Global Data¶
# WRONG - data is local to each rank
total_points = var.data.shape[0] # Only local points!
# RIGHT - use collective operation
total_points = var.stats()['count'] # Global count
Pitfall 3: Serial Libraries in Parallel¶
# WRONG - matplotlib on all ranks (crashes or conflicts)
import matplotlib.pyplot as plt
plt.plot(x, y)
# RIGHT - only rank 0
with uw.selective_ranks(0):
import matplotlib.pyplot as plt
plt.plot(x, y)
Migration from Old Patterns¶
If you have existing code using rank conditionals, here’s how to migrate:
Old Pattern → New Pattern¶
# OLD: Rank conditional with print
if uw.mpi.rank == 0:
print(f"Iteration {step}, Time: {time}")
# NEW: Use pprint
uw.pprint(f"Iteration {step}, Time: {time}")
# OLD: Collective operation in conditional (DANGEROUS!)
if uw.mpi.rank == 0:
max_value = var.stats()['max']
print(f"Max: {max_value}")
# NEW: All ranks execute, selected ranks print
uw.pprint(f"Max: {var.stats()['max']}")
# OLD: Rank conditional for visualization
if uw.mpi.rank == 0:
import pyvista as pv
plotter = pv.Plotter()
plotter.add_mesh(mesh.pyvista_mesh)
plotter.show()
# NEW: Use selective_ranks
with uw.selective_ranks(0) as should_execute:
if should_execute:
import pyvista as pv
plotter = pv.Plotter()
plotter.add_mesh(mesh.pyvista_mesh)
plotter.show()
Note
Why the Change? The new patterns are safer because:
uw.pprint()ensures all ranks evaluate arguments (preventing collective operation deadlocks)selective_ranks()makes it explicit which code is rank-specificCode is more readable and intention is clear
Future automatic deadlock detection becomes possible
When to Use What¶
Task |
Use |
Example |
|---|---|---|
Print from specific ranks |
|
|
Serial code (viz, I/O) |
|
Matplotlib, file writes |
Collective with output |
|
|
Debug multiple ranks |
|
Check local data |
Best Practices¶
Never use direct MPI unless absolutely necessary
Use
uw.pprint()instead ofif uw.mpi.rank == 0: print()Wrap serial libraries in
selective_ranks(0)Test with multiple processors early and often
Check error messages - they tell you exactly what’s wrong
Advanced: Rank Groups¶
For complex patterns, define rank groups:
class RankGroups:
def __init__(self):
size = uw.mpi.size
self.io_rank = 0
self.compute_ranks = list(range(1, size))
self.sample_ranks = list(range(0, size, size // 4))
groups = RankGroups()
# Use throughout script
uw.pprint(groups.io_rank, "Saving data...")
with uw.selective_ranks(groups.compute_ranks):
heavy_computation()
Quick Reference¶
API Summary¶
uw.pprint(ranks, *args, prefix=True, **kwargs)
Print from selected ranks
All ranks evaluate arguments (safe for collective operations)
Optional rank prefix (default:
[0])
uw.selective_ranks(ranks)
Context manager for rank-specific execution
Yields
True/Falsefor current rankUse with
if should_execute:check
Rank Selection Quick Reference¶
Syntax |
Selects |
Example |
|---|---|---|
|
Single rank |
|
|
Range of ranks |
|
|
Specific ranks |
|
|
All ranks |
|
|
Rank 0 |
|
|
Highest rank |
|
|
Even ranks |
|
|
Odd ranks |
|
|
First 10% of ranks |
|
|
Custom function |
|
Common Collective Operations¶
These operations require ALL ranks to participate:
Operation |
Type |
Example |
|---|---|---|
|
Statistics |
Global min/max/mean |
|
Solver |
All PETSc solvers |
|
I/O |
Parallel HDF5 write |
|
Interpolation |
Radial basis functions |
|
Redistribution |
Particle migration |
Migration Checklist¶
Replace
if uw.mpi.rank == 0: print(...)withuw.pprint(...)Replace
if uw.mpi.rank == 0:for visualization withwith uw.selective_ranks(0):Ensure collective operations are called on ALL ranks
Test with
mpirun -np 2andmpirun -np 4Check for deadlocks (script hangs = collective operation issue)
Timing Output at Extreme Scale¶
uw.timing.print_table() ultimately calls PETSc’s PetscLogView. At very
high CPU counts (≳1000 ranks), the ASCII output path can hang —
typically appearing as a job that completes its computation cleanly but
never exits. The CSV write path uses a different, less collective-heavy
strategy and avoids the issue:
# Default — fine at small scale, can hang at ≳1000 ranks
uw.timing.print_table()
uw.timing.print_table("results.txt")
# Safe at any scale — recommended for HPC runs
uw.timing.print_table("results.csv")
The behaviour is in PETSc, not Underworld; choosing CSV at scale is the recommended workaround. (Issue #134.)
Summary¶
Key Takeaways:
PETSc handles parallelism - you write parallel-safe UW3 code
Use
uw.pprint(ranks, ...)for output on specific ranksUse
with uw.selective_ranks(ranks):for serial operationsCollective operations must run on ALL ranks - never inside rank conditionals
Test with
mpirun -np Nto catch issues earlyAt ≳1000 ranks, write timing output as
.csvto avoidPetscLogViewhangs
The parallel safety system makes parallel programming in Underworld3 safer and more intuitive - collective operations are evaluated on all ranks automatically, preventing common deadlock scenarios!