Test Reliability Classification System¶
Last Updated: 2025-11-15 Status: Active - Use for all new tests and test reviews
Overview¶
Underworld3 uses a three-tier reliability classification system (A/B/C) to ensure tests are trustworthy and appropriate for their intended use. This system prevents test-driven development from being derailed by unreliable tests and provides clear guidelines for test maturation.
Reliability Tiers¶
Tier A: Production-Ready (Trusted)¶
Use for: Test-Driven Development (TDD), Continuous Integration (CI), Release Validation
Characteristics:
✅ Long-lived tests with proven track record (>3 months in codebase)
✅ Consistently passing across multiple environments
✅ Clear, well-documented test intent
✅ Tests stable, well-understood functionality
✅ Failure indicates DEFINITE regression in production code
✅ No known flakiness or environmental sensitivity
✅ Reviewed and approved by core maintainers
Examples:
Core Stokes solver tests (test_101*_Stokes*.py)
Basic mesh creation and data access (test_0100-0199_*.py)
Fundamental units system tests (test_0700_units_system.py)
Pytest Marker: @pytest.mark.tier_a
When to Use:
Running full CI pipeline before merging
Test-driven development sprints
Release validation
Bisecting regressions (these tests can be trusted to find the problem)
Tier B: Validated (Use with Caution)¶
Use for: Feature Validation, Exploratory Testing, Manual Review
Characteristics:
⚠️ Successfully run at least once, but not yet battle-tested
⚠️ Test appears correct but functionality may still be evolving
⚠️ Limited production usage or edge case coverage
⚠️ May have environmental dependencies not fully documented
⚠️ Failure could indicate test OR code issue - requires investigation
⚠️ Not yet reviewed for promotion to Tier A
Examples:
Recently added units integration tests (test_08*_*.py - many currently failing)
New reduction operation tests (test_0850_*.py)
Feature tests for newly implemented capabilities
Pytest Marker: @pytest.mark.tier_b
When to Use:
Manual feature validation after implementation
Exploratory testing of new capabilities
Code review process (validate test works as intended)
NOT for automated TDD sprints (unless explicitly monitoring)
Promotion Path: B → A
Test passes consistently for 3+ months
Functionality confirmed stable in production
Core maintainer review confirms test quality
Add to Tier A suite via PR review
Tier C: Experimental (Development)¶
Use for: Feature Development, Debugging, Test Development
Characteristics:
🚧 Test OR code (or both!) may be incorrect
🚧 Actively under development
🚧 Used to explore expected behavior
🚧 May test unimplemented or partially implemented features
🚧 Failures are EXPECTED and informative
🚧 Not suitable for any automated testing
Examples:
Tests written for not-yet-implemented features
Exploratory tests to understand API design
Tests for actively debugged features
Tests with known issues (mark with
@pytest.mark.xfail+ reason)
Pytest Markers:
@pytest.mark.tier_c@pytest.mark.xfail(reason="Feature not yet implemented")@pytest.mark.skip(reason="Waiting for X to be fixed")
When to Use:
Feature development (write test first, then implement)
Debugging complex issues (write test to reproduce bug)
API design exploration (what SHOULD the behavior be?)
NEVER for automated CI/TDD
Promotion Path: C → B
Feature fully implemented
Test passes consistently
Developer confirms test is correct
Remove xfail/skip markers
Promote to Tier B for further validation
Implementation in Pytest¶
pytest.ini Configuration¶
[pytest]
markers =
# Reliability tiers (how much to trust the test)
tier_a: Production-ready tests (trusted, use for TDD and CI)
tier_b: Validated tests (use with caution, manual review recommended)
tier_c: Experimental tests (development only, not for automation)
# Complexity levels (what kind of test, independent of number prefix)
level_1: Quick core tests - imports, basic setup, no solving (~seconds)
level_2: Intermediate tests - integration, units, regression (~minutes)
level_3: Physics tests - solvers, time-stepping, coupled systems (~minutes to hours)
# Other markers
mpi: marks tests as requiring MPI
slow: marks tests as slow (>10s)
Level vs Number Prefix¶
IMPORTANT: The number prefix (0000-9999) is for organization/ordering only. The actual complexity level is marked explicitly with @pytest.mark.level_N.
Why this matters:
A file
test_1010_stokes_basic.pycan contain both Level 1 (setup) and Level 3 (benchmark) testsAllows thematic organization: All Stokes tests in 1010-1099 regardless of complexity
Can run “all quick tests” across all topics:
pytest -m level_1
Example:
# File: test_1010_stokes_basic.py (number 1010 = Stokes topic)
@pytest.mark.level_1 # Quick - just setup
@pytest.mark.tier_a # Production-ready
def test_stokes_create_mesh_and_variables():
"""Test creating Stokes mesh and variables (no solving)."""
mesh = uw.meshing.StructuredQuadBox(elementRes=(4, 4))
v = uw.discretisation.MeshVariable("v", mesh, 2, degree=2)
p = uw.discretisation.MeshVariable("p", mesh, 1, degree=1)
# Just creation - very fast!
@pytest.mark.level_3 # Physics - full solve + benchmark
@pytest.mark.tier_a # Production-ready
def test_stokes_sinking_block_benchmark():
"""Test Stokes solver against analytical solution."""
# Complex benchmark with large mesh, comparison to theory
# Could take minutes!
Both tests live in the same file (organized by topic), but have different levels (organized by complexity).
Marking Tests¶
import pytest
# Level 1 + Tier A: Quick, production-ready
@pytest.mark.level_1
@pytest.mark.tier_a
def test_basic_mesh_creation():
\"\"\"Test mesh creation with default parameters.\"\"\"
mesh = uw.meshing.StructuredQuadBox(elementRes=(8, 8))
assert mesh.dim == 2
assert mesh.elementCount > 0
# Level 2 + Tier B: Intermediate, validated but new
@pytest.mark.level_2
@pytest.mark.tier_b
def test_units_integration_with_stokes():
\"\"\"Test Stokes solver with unit-aware variables.
Status: Passing locally, needs more production validation.
\"\"\"
# Test implementation...
# Level 3 + Tier A: Complex physics, production-ready
@pytest.mark.level_3
@pytest.mark.tier_a
def test_stokes_benchmark_sinking_block():
\"\"\"Test Stokes solver against analytical solution.
Validated against Gerya (2019) textbook solution.
\"\"\"
# Benchmark implementation...
# Level 2 + Tier C: Intermediate complexity, experimental feature
@pytest.mark.level_2
@pytest.mark.tier_c
@pytest.mark.xfail(reason="Advanced units propagation not yet implemented")
def test_symbolic_units_propagation():
\"\"\"Test automatic unit propagation through symbolic operations.
This test documents expected behavior for future implementation.
\"\"\"
# Test for future feature...
Running Tests by Tier¶
# Run only Tier A tests (safe for TDD)
pytest -m tier_a
# Run Tier A and B tests (full validation)
pytest -m "tier_a or tier_b"
# Run all tests including experimental (for development)
pytest
# Exclude experimental tests
pytest -m "not tier_c"
Current Test Classification Status¶
2025-11-15 Audit Results¶
Units Test Suite (test_07_units.py, test_08*_*.py)**:
Total: 259 tests
Passing: 180 (69.5%)
Failing: 79 (30.5%)
Immediate Actions Required:
✅ DONE: Fixed Stokes JIT unwrapping bug (test_0818_stokes_nd.py now passing)
🔄 IN PROGRESS: Classify remaining 79 failures as B or C
📋 TODO: Eliminate tests for unimplemented features (move to C or remove)
📋 TODO: Fix legitimate test failures or mark as xfail with clear reasons
Test Review Process¶
For New Tests (PR Review)¶
Checklist:
Test has clear docstring explaining intent
Test has appropriate tier marker (start at C, promote through review)
If Tier C/xfail: Reason clearly documented
Test follows project conventions (naming, structure)
Test is not redundant with existing tests
If testing edge case: Edge case clearly documented
For Promoting Tests (C → B → A)¶
C → B Promotion:
Feature fully implemented
Test passes consistently (developer verified)
Test correctly validates intended behavior
Remove xfail/skip markers
Update tier marker to
@pytest.mark.tier_b
B → A Promotion (Requires Core Maintainer Review):
Test has passed for 3+ months without modification
Functionality confirmed stable in production use
Test quality reviewed (clear, maintainable, appropriate assertions)
No known environmental flakiness
Update tier marker to
@pytest.mark.tier_a
Guidelines for Test Development¶
When Writing a New Test¶
Start at Tier C: All new tests begin as experimental
Document Intent: Clear docstring explaining what behavior is tested
Use xfail Appropriately: If testing unimplemented feature, mark with xfail
Don’t Break CI: Tier C tests with xfail won’t break automated testing
Promote Deliberately: Don’t rush promotion - let tests prove reliability
When a Test Fails¶
If Tier A Test Fails:
🚨 HIGH PRIORITY: Definite regression in production code
Investigate immediately
Bisect to find breaking commit
Fix production code or demote test if it was incorrectly promoted
If Tier B Test Fails:
⚠️ MEDIUM PRIORITY: Could be test OR code issue
Investigate to determine root cause
If code issue: Fix code
If test issue: Fix test or demote to Tier C
Document findings in test or issue tracker
If Tier C Test Fails:
ℹ️ EXPECTED: Tier C tests may fail
No immediate action required
Useful for tracking development progress
Update xfail reason if expectations change
Integration with Slash Commands¶
The following slash commands help manage test reliability:
/test-solvers: Run Tier A solver tests (trusted for validation)/test-units: Run Tier A+B units tests (quick validation)/test-regression: Run full Tier A suite (check for regressions)/validate-docs: Check documentation test coverage
Migration Plan for Existing Tests¶
Phase 1: Initial Classification (2025-11-15 to 2025-12-01)¶
Classify all existing tests:
Simple tests (0000-0199): Review for Tier A
Intermediate tests (0500-0699): Review for Tier A/B
Regression tests (0600-0699): Review for Tier B (→A after validation)
Units tests (0700-0799): Classify failures as B or C
Complex tests (1000+): Review for Tier A/B
Add markers to all test files
Update pytest.ini with tier markers
Document known issues with xfail
Phase 2: Stabilization (2025-12-01 to 2026-01-01)¶
Fix or document all Tier B test failures
Remove or mark xfail for Tier C tests
Promote stable Tier B tests to Tier A
Establish CI pipeline using Tier A tests
Phase 3: Maintenance (Ongoing)¶
Regular review of Tier B tests for promotion
Continuous monitoring of Tier A test stability
Documentation updates for test coverage
Rationale¶
Why This System?
Prevents TDD Confusion: Developers know which tests to trust
Documents Test Maturity: Clear progression from experimental to production
Reduces False Alarms: Tier C failures expected, Tier A failures are urgent
Guides Review Process: Clear criteria for test promotion
Supports Development: Can write tests for future features without breaking CI
Comparison to Other Approaches:
Better than xfail alone: Three tiers provide more nuanced classification
Better than skip: Tests still run, failures are informative
Better than no classification: Prevents “all tests are equal” assumption
References¶
Pytest Markers: https://docs.pytest.org/en/stable/how-to/mark.html
Test Coverage Analysis:
docs/reviews/2025-10/TEST-COVERAGE-ANALYSIS.mdProject Test Organization:
CLAUDE.md(Test Suite Organization section)