In-memory checkpoint as a general UW3 capability¶
Design note. Spun off from the deformable-surface architectural
discussion (2026-05-11) as a self-contained capability that is bounded,
useful in its own right, and not specifically tied to free-surface code.
The companion note that motivated this is
docs/developer/design/deformable_surface_metronome_design_note.md (in
the feature/exp-integrator-freesurface worktree).
Motivation¶
Primary use case: backtrack past unstable timestepping. A timestepper hits an instability or a sudden regime change; the cleanest recovery is to restore the last known-good state and continue with smaller Δt (or a different scheme). Today this is done ad-hoc by users where it happens at all.
Secondary uses, all sharing the same primitive:
Multi-stage time integration (RK4 between stages — restore to start-of-step, deform to next stage configuration, sample rate). The free-surface integrator-zoo work is the immediate case; the same primitive applies to any multi-stage scheme.
Adaptive Δt with error estimate (take a step, estimate per-step error, restore + retry if too large).
Predictor-corrector probing (try a predictor, check the corrector residual, fall back if not converging — relevant to the VEP work).
Regime-change feeling-out (e.g., elastic predictor → check yield surface → restore + plastic split if violated).
All five want one thing: at one moment, capture enough state that the system can be put back at any point afterwards as if nothing had happened.
Reframe: same operation as on-disk checkpoint, different backend¶
Rather than build a parallel snapshot mechanism, treat in-memory snapshot as a backend variant of the existing checkpoint code. Two benefits:
One source of truth for “what is system state.” Whatever set of (mesh coords, MV DOFs, swarm positions, swarm-var values, algorithm history, …) the checkpoint serialiser captures becomes the contract for in-memory snapshots too. Adding new state-bearers means adding them once; both backends pick them up.
The existing checkpoint code gets exercised harder and improved as a side effect. In-memory roundtrip is a cheap unit test — every snapshot/restore cycle exercises the serialise/deserialise paths, which surfaces gaps and bugs that disk-only checkpointing exposes only at quarterly-test cadence.
Two checkpoint paths today (audit-confirmed)¶
The audit (read-only investigation, current main) confirmed two
distinct on-disk paths in src/underworld3/:
Path A — PETScSection-based (native checkpoint).
Mesh.write_checkpoint(...)atdiscretisation/discretisation_mesh.py:1892–1953.Uses
dm.sectionView(viewer, subdm)anddm.globalVectorView(viewer, subdm, var._gvec)through aPETSc.ViewerHDF5().Captures mesh topology, deformed coordinates, MV DOF values, and swarm-variable values via the
_meshVarproxy.Lower-level; bound to the producing DM.
Path B — write_timestep (visualization-oriented).
Mesh.write_timestep(...)atdiscretisation/discretisation_mesh.py:1750–1830andSwarm.write_timestep(...)atswarm.py:3726–3810.Per-variable HDF5 + XDMF for ParaView; mixes
PETSc.ViewerHDF5()with directh5pywrites (swarm.py:1772–1850).More flexible — can be re-loaded at different resolution / decomposition; bulkier; the user-facing visualisation pipeline.
For in-memory snapshot, Path A is the conceptual model. Restore goes back to the same DM, so the resolution/decomposition flexibility of Path B is unneeded. But the in-memory backend will not actually use the HDF5 Viewer — it will copy section structure + global-vector data directly into numpy arrays. Same conceptual capture, different mechanism.
What state must be captured¶
Audit-informed inventory:
Captured today (in Path A):
DM topology and section
Deformed mesh coordinates
Mesh-variable global vectors (DOF values) — including
DDt.psi_star, which is itself a mesh variable; its DOF data goes through the section path automaticallySwarm-variable values (via mesh-DM proxy fields)
NOT captured today, required for full-state in-memory restore:
state |
location |
priority |
|---|---|---|
|
|
high |
|
|
high |
|
|
high |
Binding |
implicit in |
high |
Simulation time, step counter |
not in any object today |
high |
Parameter mutation history |
|
medium |
|
|
medium |
Solver iteration counts / convergence history |
scattered |
low |
The DDt example is representative, not exceptional. Algorithm-internal scalar/list state on Python objects is the architecturally significant gap — it lives in solver-adjacent Python objects, not in any PETScSection. The in-memory snapshot must compose PETScSection state + Python-side mutable state into a single token. The next section addresses how this composition should be designed in general — not as a per-class bolt-on, but as a contract that new algorithm-helper classes follow from day one.
General serialisation contract for solver-internal state¶
The opportunity. Bolting per-class snapshot hooks onto each algorithm helper as it appears would produce a half-baked system that silently misses state whenever a new helper is added without a corresponding hook. DDt today, the next adaptive-Δt controller tomorrow, the gamma estimator the day after — the pattern recurs.
This is the right moment to design a general serialisation contract for solver-internal state, before it leaks into the user API. New algorithm-helper classes should declare their state slots from day one; existing classes (DDt, parameter mutation history, solver convergence bookkeeping) get retrofitted as they’re touched.
Three options for the contract.
(A) Declarative slots. Class declares a class-level list of attribute names that constitute state.
class DDt:
_state_attrs = ('_dt_history', '_history_initialised',
'_n_solves_completed')
Snapshot copies named attrs; restore writes them back. Stringly-typed; silent breakage when attribute names drift. Least invasive — works as a mixin on any existing class.
(B) Explicit save/restore methods. Class implements paired methods.
class DDt:
def _save_state(self) -> dict: ...
def _restore_state(self, d: dict) -> None: ...
Most flexible (transform on save, validate on restore). Most boilerplate;
easy to forget to update when adding a new state attribute. Standard
Python pickling pattern (__getstate__ / __setstate__) is essentially
this option.
(C) State as a first-class object. Computation and state are separated; the State object is a dataclass / pydantic model that trivially serialises.
@dataclass
class DDtState:
dt_history: list[float]
history_initialised: bool
n_solves_completed: int
psi_star_var_names: list[str] # binding to MVs lives here too
class DDt:
def __init__(self, ...):
self.state = DDtState(...)
# All mutations go via self.state.<attr>
Most self-documenting; enables side benefits beyond serialisation (deep-copy, equality testing, repr, schema-versioned migrations). Most invasive — changes how new solver-internal classes are written.
Recommendation. (C) for new code, (B)-style adapters for the small number of existing classes that need retrofitting (primarily DDt and parameter-mutation-history). (A) is rejected because the silent-drift failure mode is exactly the kind of half-baked outcome we’re trying to avoid by designing this now.
The decision matters because it sets the pattern for every algorithm-internal class added over the next few years. The cost of choosing wrong is that snapshot tokens silently miss state for any class added without proper instrumentation; the cost of choosing right is that new solver-internal classes get serialisation, deep-copy, and equality testing automatically.
Side benefits of (C) beyond checkpoint.
solver_a.state == solver_b.statebecomes a meaningful comparison — useful for regression testing of solver-internal behaviourrepr(obj.state)is automatic and useful for debuggingsnapshot tokens compose trivially:
{id(obj): obj.state.copy()}schema versioning is tractable: a
_schema_versionfield on each State object plus a migration registry handles cross-version compatibility (relevant for on-disk checkpoints that survive UW3 upgrades)the
Snapshottableinterface becomes a one-liner: “has a.stateattribute that is a dataclass.”
Bindings to PETSc-owned objects. State objects must NOT hold direct PETSc Vec / DM / MV handles — only stable identifiers (variable names, mesh names) that can resolve back to the live object on restore. This keeps tokens plain-Python and avoids the DM-lifecycle hazards identified in earlier work.
A third backend: on-disk full-state snapshot¶
The in-memory backend is one storage option. The same capture/restore machinery supports a slower-but-persistent on-disk full-state backend with no architectural changes — only the serialisation layer differs.
This is distinct from the existing write_timestep on-disk path,
which is selective (per-variable), designed for visualisation, emits
XDMF for ParaView, and does not restore solver-internal state. Three
backends serving three different needs:
dimension |
existing |
new on-disk full-state |
new in-memory |
|---|---|---|---|
storage |
HDF5 + XDMF, per-variable |
HDF5, monolithic |
dict of numpy arrays |
selectivity |
user picks variables |
always full state |
always full state |
Python-side state |
not captured |
captured |
captured |
restorability |
partial (re-init solvers, lose history) |
bit-equivalent |
bit-equivalent |
persistence |
survives process exit |
survives process exit |
intra-run only |
speed |
medium |
slow (HDF5 I/O) |
fast (RAM copy) |
typical use |
visualisation, restart-with-changes |
crash recovery, bisection, debugging |
RK staging, backtrack, adaptive Δt |
The existing write_timestep continues to serve its role unchanged.
The new mechanism (in-memory + on-disk-full-state) addresses a
different need: faithful state restore for algorithmic uses where
“approximate restart” would silently corrupt the run.
Use cases the on-disk full-state backend opens up:
Crash recovery for long simulations. Periodic snapshots written to disk; on restart, restore from the most recent. No lost work beyond the snapshot interval.
Cross-run resumption. A simulation that ran to \(t = T_1\) can be picked up at \(t = T_1\) in a later session, bit-equivalent — including DDt history and any other algorithm-internal state.
Bisection / branching exploration. Snapshot at a decision point; try one parameter path; if unsatisfactory, restore and try another. Powerful for sensitivity studies on long runs.
Debugging captures. Snapshot at a problem point; examine offline; iterate without re-running the costly setup.
What this adds to the design. Three implications, all bounded:
Backend abstraction must serialise to bytes from day one. The in-memory backend stores numpy arrays directly; the on-disk backend serialises them to HDF5. Both go through the same
save_*/load_*protocol on the backend interface — the abstraction is shaped by needing both.Schema versioning becomes critical. In-memory tokens are short-lived; on-disk tokens may be loaded by a future UW3 version. Each
Statedataclass carries a_schema_version; a migration registry handles cross-version restore. (The on-diskwrite_timesteppath has no equivalent need because it doesn’t restore solver state.)Generation-counter semantics need a cross-process branch. Within-process: counter check ensures you don’t restore an invalidated snapshot. Across-process (restoring from disk on a fresh run): the model is being initialised from the snapshot, not restored to a previous state — counters are set to the snapshot’s values, no invalidation check needed. Restore code distinguishes the two paths.
API shape¶
Full-state always; backend chosen at save time by passing (or
omitting) a file=. (Final names landed after a phase-5 rename
pass that replaced the draft snapshot/restore verbs — see
the user guide at docs/advanced/snapshot-restore.md.)
# Same call, different storage — dispatch on whether a file is given.
token = model.save_state() # in-memory (default)
model.save_state(file='step42.snap.h5') # on-disk full-state
model.load_state(token) # in-memory restore
model.load_state('step42.snap.h5') # on-disk restore
# Existing per-variable selective on-disk path is unchanged:
mesh.write_timestep('step42.h5', ...) # visualisation; not full-state
Backends share a single Snapshot structure — only the serialisation
layer differs:
class Snapshot:
section_state: dict[DM, SectionAndVecData] # PETSc state
python_state: dict[ObjectId, ObjectState] # State dataclasses
generations: dict[ObjectId, int] # within-process invalidation
metadata: dict # sim time, step #, schema version
The in-memory backend stores section_state arrays as numpy buffers
directly; the on-disk backend serialises them to HDF5 datasets. The
Python-side State dataclasses serialise to HDF5 attributes (small
scalars / short lists) and groups (arrays).
Tokens are plain Python / numpy — never PETSc Vec or DM handles. On-disk “tokens” are paths to single HDF5 files. Either way, the DM-lifecycle hazards identified in earlier work do not apply.
Restore semantics for swarms — rebuild, do not refuse¶
Correction (2026-05-11, post-review). An earlier draft of this
section proposed using a per-swarm _population_generation counter as
an invalidation gate: if the counter at restore differs from the
counter at capture, raise rather than restore. That design is wrong.
The whole point of the toolkit is to undo state changes — including
particle motion / migration / repopulation between capture and
restore. Refusing on counter mismatch breaks the central use cases (RK
staging, backtrack-on-instability, adaptive Δt retry — all of which
will migrate particles between capture and restore).
The correct semantics: restore rebuilds the swarm’s particle population from the snapshot. Specifically:
Clear the swarm’s current local particles.
Re-add the captured per-rank coordinates via
add_particles_with_coordinates(saved_local_coords, migrate=False). The mesh partition is deterministic and unchanged within v1 scope (mesh-version check still applies), so particles that were local at capture are local at restore — no migration step needed.Write the captured per-particle variable data back into the newly-added particles, in their captured order.
The Swarm._population_generation counter is still useful as
informational metadata — it can flag in logs / metadata what
happened between capture and restore, it can feed cache invalidation
in other consumers, it can power future optimisations (e.g., a fast
in-place restore when the counter happens to match). But it is not
a restore gate.
Mutation sites where the counter is incremented (current line numbers re-derived per audit; the design’s correctness does not depend on the exact set as long as we over-bump rather than under-bump):
file |
site |
call |
|---|---|---|
|
end of |
covers internal |
|
|
bumps unconditionally; conservative no-op safe |
|
after |
direct PETSc DM call, not via |
|
after |
catches |
|
|
recycle-mode reinitialisation |
Mesh._mesh_version is a separate counter on the mesh side. In v1
the mesh-version mismatch does refuse restore — see the next
section.
Architectural work required¶
In rough dependency order:
Backend abstraction layer. Extract the “capture” logic from the “store” logic in
Mesh.write_checkpoint(). Introduce aCheckpointBackendprotocol (e.g.,save_section,save_vector,save_metadata,load_section,load_vector,load_metadata). Shaped from day one to support both backends — the in-memory case is the cheapest test of the abstraction’s correctness; the on-disk case is what locks in the byte-level serialisation contract. The existing HDF5 path is refactored to fit the protocol; the wedge is clean because section/vector ops are PETSc abstractions and the HDF5 coupling lives in the Viewer construction.Backend implementations. Two concrete backends:
InMemoryBackend— dict of numpy arrays. Trivial once the abstraction exists.OnDiskFullStateBackend— single monolithic HDF5 file. Shares PETSc-state serialisation with the existingwrite_checkpointpath (already HDF5); adds Python-state serialisation as HDF5 attributes/groups.
Adopt the serialisation contract for new solver-internal code. Decision: option (C) — state as first-class dataclass (see “General serialisation contract” section). Any new algorithm-helper class added from this point on declares its state as a separate dataclass; the checkpoint mechanism reads
.stateautomatically. Document the pattern in the developer guide; add a check to CI that new classes insystems/andsolvers/declare.state.Retrofit existing solver-internal classes to the contract. In priority order:
DDt(systems/ddt.py), parameter mutation history (parameters.py:145), any solver convergence-tracking state (audit pending). Each retrofit is small; total bounded by the number of classes (probably under ten).Swarm rebuild on restore + informational
_population_generationcounter. Snapshot captures per-rank particle coordinates and per-variable arrays. Restore clears the current local population and re-adds the captured particles at their captured coords (see “Restore semantics for swarms” section above for the corrected design). The counter is bumped at every identified mutation site for informational use; it is not an invalidation gate.Schema versioning + migration registry. Each
Statedataclass carries a_schema_versioninteger. A central registry maps(class, version)to migration functions that lift older State data to the current schema. In-memory restore checks version equality (any mismatch is a programming error since both sides are the same process); on-disk restore consults the migration registry and applies migrations in sequence. Without this infrastructure, on-disk snapshots break the moment any retrofitted class evolves.Model.snapshot()/Model.restore()orchestration. Walks registered meshes, swarms, MVs, and any object exposing a.stateattribute; composes the token; routes to backend (in-memory or on-disk) based on whetherpath=is given; restores in safe order; distinguishes within-process restore (counter validation enforced) from cross-process restore (counters initialised from snapshot).Tests. In-memory roundtrip on standard setups (Stokes benchmark, swarm-bearing setup, DDt-using setup). On-disk roundtrip on the same setups. Cross-process restore: write to disk in process A, load in fresh process B, verify the restored model continues bit-equivalently from the snapshot point. Verify generation invalidation raises cleanly; verify
.state == .stateafter roundtrip for retrofitted classes.
Scope boundaries (NOT in v1)¶
Mesh adaptation roundtrip — scheduled for v1.2, not a permanent limitation. A snapshot taken before a
mesh.adapt()event in v1 refuses restore via the_mesh_versioncheck, because the captured DOF arrays are sized for the pre-adapt section and writing them in-place into the post-adapt DM would corrupt the run. v1.2 will replace the refusal with a mesh-rebuild path: capture enough topology / section info to destroy the post-adapt DM and rebuild the pre-adapt one, then write DOFs into the rebuilt DM. The principle is the same as the swarm rebuild (capture-the-state, rebuild-on-restore); the implementation is more invasive because every MeshVariable / Swarm / solver holds references into the DM and those wrappers need to re-bind. v1’s snapshot captures the topology / section info even though v1 restore ignores it, so the payload is forward-compatible with v1.2 without a schema bump.Replacing the existing
write_timesteppath. The selective per-variable on-disk path continues unchanged. The new full-state on-disk backend is additive and serves a different need (faithful restore vs visualisation/restart-with-changes).Cross-rank-count restore. On-disk full-state snapshots written on N ranks restore on N ranks. Restoring on a different rank count requires the existing
write_timesteppath’s interpolation machinery and is out of scope for the full-state backend.Lazy / copy-on-write in-memory tokens. Always-eager copy for v1. The use cases that drove this (RK staging, single-step backtrack) all happen on a per-step cadence where eager copy is affordable. Lazy/COW is an optimisation for later if someone proves it matters.
Open implementation questions¶
Does PETSc expose a clean way to copy section + vector state directly into numpy without going through a Viewer? The PETSc
Vec.getArray()/Section.getDof()low-level APIs should suffice, but it needs a quick prototype to confirm. Alternative:PETSc.Viewer.Type.MEMORYmay exist but support for DM operations is uncertain.How does
Model.snapshot()discover state-bearing objects? With the contract decision made (option C —.statedataclass), discovery options narrow to: (a) walk Model’s registered solvers and recurse into anything with a.stateattr; (b) explicit registration at construction time. (a) is more ergonomic but relies on solvers maintaining proper registration with the Model. (b) is more explicit but requires every state-bearing object to know about Model. (a) is probably right; verify the existing solver-Model relationship can support recursive.statediscovery.What does
Model.restore(token)order look like? Mesh coords first (since MV DOFs are tied to mesh layout), then MV DOFs, then swarm positions + migrate, then swarm-var values, then DDt and Python-side state, then generation counters validated last. Confirm this order is correct; particularly whether MV restoration needs the mesh to be at restored coords first.Memory cost on realistic setups. For a typical coupled-physics run (mesh + swarm + several MVs + DDt history), what’s the per-snapshot byte cost? Drives the answer to “can users keep N snapshots in memory without thinking about it.” Probably bounded by one-Stokes-solve worth of memory, but worth measuring early.
Status¶
Audit complete. Design pending review. Implementation has not started. The work is bounded (no open-ended research items in any of the seven architectural-work items above) and decomposes into commits that can land sequentially: backend abstraction → InMemoryBackend → Python-side registration → swarm generation counter → Model orchestration → tests.
Expected size: around four weeks of careful work, dominated by:
the backend-abstraction refactor (item 1), which touches existing checkpoint code other things depend on, and must be shaped to support both backends from day one;
the contract retrofit (item 4), mechanically small per class but needing careful audit so no state is silently missed;
the on-disk backend (item 2b) and schema versioning (item 6), together adding around a week beyond the in-memory-only scope but unlocking the cross-run / crash-recovery / bisection use cases.
The contract decision (item 3) and the schema-versioning decision (item 6) are the highest-leverage pieces — they set the pattern for every algorithm-internal class added over the next few years and the durability guarantee for every on-disk full-state snapshot. Worth getting right even if they slow the immediate work, because the alternative is years of half-baked snapshots that silently miss state or break across UW3 versions.
Cross-references¶
docs/developer/design/deformable_surface_metronome_design_note.md(infeature/exp-integrator-freesurfaceworktree) — the parent design discussion that motivated spinning this off as a separate capability.publications/free-surface-paper/integrator_zoo_supplementary.md— the empirical work that exposed the need for snapshot/restore as part of multi-stage time integration.