Lagged reference-mesh SL history (and mixed-mesh evaluation)¶
Status: design / plan (2026-06). Supersedes the ALE v_mesh-pulse + CARRY
approximation for the semi-Lagrangian history on a moving mesh. Two-phase plan:
Phase 1 — mixed-mesh expression evaluation (the enabling primitive,
implement and prove first); Phase 2 — lagged rolling-clone SL history
(the architecture built on Phase 1).
Related: REMESH_FIELD_TRANSFER_DESIGN, in_memory_checkpoint_design, submesh-solver-architecture, STRESS_EQUILIBRIUM_FREESURFACE.
Problem¶
On a moving mesh (free surface, or adaptation) the semi-Lagrangian update needs the old field at the departure point. At a receding free surface (downwellings) that point lies in the layer the surface just vacated — which the new mesh does not cover. Sampling the new-mesh FE field there extrapolates the P3 polynomial → ±300 node-level speckles → blow-up (Ra=1e5 annulus, ~step 20).
The current code approximates “old field on old geometry” without keeping a
copy: history vars are CARRY’d onto the new mesh and the trace-back subtracts
v_mesh = Δx/dt (ddt.py SemiLagrangian.update_pre_solve /
_activate_ale_for_traceback). This is the approximation that fails:
The order-1 advected field has its
v_meshcorrection gated off ati=0(the band-aid re-record path,ddt.py:2447), so its foot lands at the old surface — in the vacated layer.The annulus
return_coords_to_bounds(meshing/annulus.py:507) clamps to the construction-time circular radius, not the deformed surface — so receded feet aren’t recognised as outside and aren’t clamped.
Confirmed empirically (~/+Simulations/fs_convection_goal4/smoke_geomclamp.py):
a geometry-aware clamp delays the blow-up ~2 steps and halves the early
overshoot but does not cure it — because the departure value is wrong, not
just the geometry. Clamp/monotone band-aids are dead ends; monotone does run on
the FE path (functions_unit_system.py:854) but fails via neighbour-corruption
feedback once a speckle seeds.
Target architecture: lagged rolling-clone¶
Keep a frozen clone of the previous-step mesh (geometry + the primitive
fields needed for history). The SL trace-back samples that clone:
global_evaluate(history_expr_on_clone, departure_coords). The departure point
sits in the receded layer, which the old mesh covers — so the sample is a
real interpolated value: no clamp, no extrapolation, no speckle. This is the
exact computation the v_mesh/CARRY apparatus only approximates.
Invariants¶
One clone, any scheme order. Every trace-back is a single-Δt reach from the current mesh to the previous-step mesh. BDF2 reaches 2Δt as two composed single-Δt traces across successive steps (
psi_star[1]is last step’s already-tracedpsi_star[0]). History depth lives in the stack of fields, not in a stack of meshes. The clone is a rolling one-step-behind copy, refreshed at the start of each step.Shared partition. The clone must share topology and the exact parallel partition / local numbering (PETSc
DMClone-style: shared DMPlex +PetscSF, independent coordinate vector). Then refreshing the clone’s geometry each step is a local memcpy of coordinates, no global scatter. (Sampling at launch points isglobal_evaluate— parallel-routed and partition-agnostic; shared partition lets it reuse routing.)The clone stores primitives, not derived tensors. Store
T(andV, and stress for VE) at the lagged time; recompute gradients / fluxes / stress at full FE accuracy on the clone. This retires the lossy projected histories (DFDtflux history, thepsi_star_flattensor view, the VE stress projection) and their projection diffusion.Mesh-homogeneous history; the mixing guard defines the clone set. Each history term is an expression evaluated entirely on the clone; the only thing crossing the mesh boundary is a value array at coordinates, never a symbolic expression spanning two DMs. The evaluator’s existing refuse-mixed-mesh guard is the enforcement: if a history build trips it, that names a current-mesh field that must be cloned to the lagged mesh. Time- splitting (BDF/AM) already separates terms by time level, so “no mixed mesh” and “no mixed time level in one expression” become the same discipline.
Split the implicit
self.mesh. Todayupdate_pre_solveassumes one mesh everywhere (get_closest_cells,_centroids,return_coords_to_bounds, theglobal_evaluatecalls). The refactor makes two roles explicit: foot computation (where did the material come from) on the current mesh; history sampling (value/flux there) on the lagged clone. Every evaluation takes its mesh from the variable/expression, not a default.Unifies free-surface and adaptation. The lagged clone is the SL reference regardless of how nodes moved (free surface, mmpde, OT). Same code path — the original goal of this work.
Phase 1 — mixed-mesh expression evaluation (implement first)¶
The enabling primitive, and the efficiency-critical piece. Two implementation strategies; Phase 1 likely needs both, chosen per term:
(a) Clone-the-primitives → expression is homogeneous on the clone. Simple per-eval; costs memory + per-step field copy. Good default for history terms.
(b) True mixed-mesh evaluate → evaluate each meshVariable on its own DM at the shared physical coordinates, then combine the per-mesh arrays pointwise. No field cloning; the harder/efficiency-critical path.
Semantics of (b)¶
For an expression E whose meshVariable atoms belong to meshes {M_a, M_b, …},
at physical points X: value at x_k = E(var_a(x_k|M_a), var_b(x_k|M_b), …).
Pointwise; derivatives are local to each variable’s own mesh, so they’re fine.
Mechanism¶
Partition the (unwrapped) expression’s atoms by owning mesh.
For each mesh
M_i, point-locateXand interpolate the needed variable/derivative components → arrayA_i.Substitute the
A_iinto the symbolic combination and evaluate pointwise (existing numpy/lambdify path).
Efficiency (the crux)¶
Per-mesh location cache. Point location dominates. Cache per mesh, keyed on coord-hash (cf.
_dminterpolation_cache, see data-access); reuse across all history levels evaluated at the sameXwithin one step.Shared-partition fast path. When
M_bis a partition-sharing clone ofM_a, a point owned by rank r onM_ais owned by rank r onM_b— one routing, two local interpolations; no second scatter. This is why the clone shares the partition.Reuse the locator across the BDF stack (all
psi_star[i]shareX).
API¶
Keep refuse-mixed-mesh as the default (accidental mixing still caught).
Add an explicit opt-in (e.g.
evaluate(..., allow_mixed_mesh=True)or a dedicatedmulti_mesh_evaluate) for intentional history evaluation.Caveat to verify up front:
global_evaluateof derivative / vector expressions on a non-current mesh has a known wrinkle on some branches (scalar_componentmismatch). The gradient-recompute path depends entirely on this — confirm/fix it as the first Phase-1 task.
Tests¶
Correctness vs per-mesh manual eval; the shared-partition fast path (bit-identical, reduced routing); parallel (np>1); derivative/vector exprs across meshes.
Phase 2 — lagged rolling-clone SL history¶
Clone primitive:
DMClone-with-shared-partition + independent coords; rolling refresh (local coord + field copy at step start). Reuse the in-memory snapshot toolkit if it already provides same-partition state hold.DDt refactor: split foot-mesh (current) vs history-mesh (clone); sample history via Phase-1 evaluation; recompute fluxes/gradients homogeneously on the clone.
Retire the
CARRY+v_mesh-pulse +i=0-gate machinery for the trace- back (scope what else still depends onCARRYbefore deleting).Validate: the free-surface smoke test reaches approximate equilibrium (T∈[0,1], no speckle); adaptation regression unchanged; bit-identical on a non-moving mesh.
Open questions / risks¶
Derivative/vector
global_evaluateon a non-current mesh (Phase-1 task 1).What else depends on
CARRY/ the ALE pulse (swarm advection? VE?).Memory/perf of the rolling clone at scale; locator rebuild vs reuse.
VE: the stress history needs lagged
V(and∇V) on the clone — confirm the clone primitive set covers the constitutive expression’s full free-variable set (the mixing guard enumerates it).
Relationship to landed work¶
PR #246 (capability gate +
Mesh.deform+ephemeral_coords+ SL-fieldCARRYauto-stamp) made coordinate mutation foolproof and proved the failure is not transfer-incoherence — it set up this design. TheCARRYauto-stamp is part of the apparatus this design will eventually retire.The big question this unblocks: SL height field vs full node movement + mmpde. The lagged-clone approach makes SL correct at a moving boundary; the comparison is then about cost/robustness, not correctness.
Stage 0 — landed (2026-06-18): old-frame reach-back, cost/benefit¶
Stage 0 (a) — ephemeral old-geometry restore on the working mesh — is implemented and validated as the production cure for the high-Ra free-surface SL blow-up. It is the minimal subset of this design: no clone, no mixed-mesh evaluator. This section records what it costs, what it leaves on the table, and the recommendation on whether to proceed to the rolling clone (Stage 1 (c)) now.
What landed¶
SemiLagrangian.old_frame_traceback (opt-in, default False; forwarded through
AdvDiffusionSLCN(..., old_frame_traceback=True) to the advective DuDt
only — the diffusive DFDt keeps stock ALE, validated sufficient at Ra=1e5).
Three hooks in systems/ddt.py:
on_remeshstashes the pre-move geometry (_oldframe_X = ctx.old_X, earliest since the last solve) instead of av_meshdisplacement, and leavespsi_starCARRY’d.update_pre_solvecomputes the departure foot from the physical velocity (nov_mesh;_ale_activeisFalse), recordspsi_star[0]by direct nodal carry (not a re-evaluate on the deformed mesh — see below), skips the new-mesh bounds clamp, and samplespsi_starinsidewith mesh.ephemeral_coords(): mesh._deform_mesh(_oldframe_X); global_evaluate(...).The one-step stash is consumed at the end of
update_pre_solve.
De-risked: a variable’s .data is bit-identical through a
_deform_mesh(old) → _deform_mesh(new) round-trip (nuke_coords_and_rebuild
rebuilds the DS / DM / DOF-coordinate caches but never touches the solution
Vec), and _deform_mesh invalidates _dminterpolation_cache + the evaluation
cache — so the ephemeral sample is correct and not stale. The clone (Stage 1) is
therefore not required for correctness; it is an efficiency / capability
play.
Load-bearing detail: the first real-API run blew up like the baseline
despite old-frame being active, because the standard store_result path
re-records psi_star[0] by evaluating psi_fn on the deformed mesh at
centroid-shifted nodes — injecting boundary-layer interpolation error that grows
with h_max and then rides the old-geometry sample. Recording the history by a
direct nodal carry (reusing the parallel _record_psi_star_from_field_data
path) restores the prototype’s exact behaviour. This is the “store primitives,
not re-derived values” principle of invariant 3, in miniature.
Validation (Ra=1e5 annulus, res 20, free surface, CN)¶
run |
worst T over 30 steps |
|---|---|
baseline (stock ALE) |
|
old-frame, SLCN (BDF1 + CN flux, θ=0.5) |
|
old-frame, SL-BDF2 (BDF2 + flux at n+1, θ=1) |
|
h_max reaches ~0.067 (surface deformed ~2.3 cells), Nu climbs 12→55 — vigorous,
not over-diffused. Old-frame is order-agnostic — it changes only where the
history is sampled, so both the canonical SLCN and SL-BDF2 schemes hold T∈[0,1].
(The two scheme knobs must be paired: SLCN = BDF1 time-difference + Crank-Nicolson
flux; SL-BDF2 = BDF2 time-difference + flux implicit at n+1 — a BDF2 stencil with a
CN flux is not a consistent 2nd-order scheme. See
docs/advanced/semi-lagrangian-time-integration.md
and Bonaventura et al., 2021.) Unit regression: tests/test_0855_oldframe_sl_traceback.py
(no-op on static mesh bit-identical; geometry restored + stash consumed; samples
on old geometry not new; bounded on a moving mesh; BDF2) — passes serial and
np=2; the existing ddt suite is unchanged (opt-in, default off).
Cost of Stage 0 (a)¶
Measured on the Ra=1e5 res-20 free-surface loop (oldframe_overhead_bench.py):
adv_diff.solve: +166 ms/step (+8 %) over baseline (2033 → 2199 ms/step), from +2nuke_coords_and_rebuildper solve (deform-to-old + restore) for an order-1DuDt. Order N wraps the sample per history level →2Nrebuilds / step (a hoist to one ephemeral block per step is an obvious Stage-0 optimisation, deferred).Hidden churn not in that 8 %:
_deform_meshunconditionally setsis_setup = Falseon every registered solver, so each ephemeral round-trip also forces the Stokes DM/assembly to rebuild on its next solve — even though the working mesh returns to bit-identical coordinates. This is the real Stage-0 tax at scale (Stokes assembly ≫ anuke), and it is entirely avoidable: a guard that skips theis_setupinvalidation whenephemeral_coordsrestores identical coordinates would remove it without the clone. Recommended as the first Stage-0 follow-up if the tax bites.
What Stage 0 leaves on the table (the Stage-1 (c) clone)¶
The rolling clone (DMClone, shared partition, independent coords, refreshed by local memcpy; sampled via the mixed-mesh evaluator, Phase 1) would:
Eliminate both costs above — the working mesh never moves, so no per-step
nukeand no solver-rebuild churn; the clone refresh is a local coord memcpy.Enable old
Von the clone → second-order temporal reach-back (the current foot uses only the current velocity).Recompute fluxes / gradients at full FE accuracy on the clone, retiring the lossy projected histories (
DFDt,psi_star_flat, VE stress projection).
Its build cost is the large piece deliberately excluded from Stage 0: the
mixed-mesh global_evaluate (per-mesh point location + pointwise combine,
shared-partition fast path) and verifying derivative/vector global_evaluate
on a non-current DM (the scalar_component wrinkle, invariant 5 / Phase-1 task
1).
Recommendation¶
Ship Stage 0 (a) now; defer the clone (Stage 1 (c)). Stage 0 delivers the correctness cure with ~8 % on the advection solve and zero new infrastructure, and (Task 2 below) is mesh-agnostic — it covers interior-node adaptation on the same code path. The clone’s payoff is efficiency and second-order/flux capability, not correctness, and it costs the whole mixed-mesh evaluator. Build it when one of these triggers fires, not before:
The Stokes-rebuild churn is shown to dominate a production run and the cheap
is_setupguard above does not remove it; ora feature needs old
V(second-order temporal reach-back) or full-accuracy flux/stress recompute on a moving mesh (VE on a free surface / adapting mesh); orthe per-step
nukecost becomes material at higher order or resolution.
Until then the v_mesh/CARRY/i=0-gate apparatus stays as the default path; old-
frame is the opt-in cure. Retiring that apparatus is gated on unifying the
adaptation path onto old-frame (Task 2) and is tracked separately.
Task 2 — old-frame through interior-node adaptation¶
(see ~/+Simulations/fs_convection_goal4/oldframe_adapt_comparison.py;
no-slip Ra=1e5 res-16 annulus convection, periodic interior adaptation, old-frame
vs the current v_mesh path; gentle refinement=1.3 — aggressive refinement
destabilises the base convection regardless of transfer mode, so it is not a
fair test).
Old-frame is mesh-agnostic: on_remesh stashes the old geometry for any node
move (free surface or mmpde/OT interior adaptation), so the same code path
applies. The comparison (30 steps, 7 adaptation events each):
mover |
v_mesh worst T |
old-frame worst T |
final Nu (vm / of) |
worst q (vm / of) |
|---|---|---|---|---|
|
[-6.97, +4.27] |
[-0.51, +1.07] |
16.4 / 16.0 |
0.778 / 0.783 |
|
[-0.014, 1.000] |
[0.000, 1.000] |
21.8 / 21.7 |
0.762 / 0.762 |
|
[0, 1] (n_adapts=0) |
[0, 1] (n_adapts=0) |
32.3 / 32.1 |
0.826 / 0.826 |
Conclusion: old-frame MATCHES or BEATS the current v_mesh path with no
regression. For follow_metric’s anisotropic interior motion it is
substantially better bounded — v_mesh overshoots to [-7, +4] (the same
new-mesh-frame fold that fails at the free surface), while old-frame holds
≈[0, 1] — at equal Nu and equal-or-better mesh quality. For OT_adapt (whose
internal reset + FE-remap already keeps the history clean) both are bounded, old-
frame marginally better. smooth_mesh_interior on a uniform mesh is a no-op
(equant cells → no move); the near-identical Nu confirms old-frame does not
perturb a no-adaptation run.
So old-frame is the single SL reference for both free surface and adaptation
(invariant 6), and is the path to eventually retiring the v_mesh/CARRY/i=0-
gate apparatus. That retirement is held back only by sequencing/benchmarking
discipline (it is the default path today and old-frame is opt-in), not by any
correctness gap — and is tracked as the unification follow-up.