Test Reliability Classification System¶

Last Updated: 2025-11-15 Status: Active - Use for all new tests and test reviews

Overview¶

Underworld3 uses a three-tier reliability classification system (A/B/C) to ensure tests are trustworthy and appropriate for their intended use. This system prevents test-driven development from being derailed by unreliable tests and provides clear guidelines for test maturation.

Reliability Tiers¶

Tier A: Production-Ready (Trusted)¶

Use for: Test-Driven Development (TDD), Continuous Integration (CI), Release Validation

Characteristics:

✅ Long-lived tests with proven track record (>3 months in codebase)
✅ Consistently passing across multiple environments
✅ Clear, well-documented test intent
✅ Tests stable, well-understood functionality
✅ Failure indicates DEFINITE regression in production code
✅ No known flakiness or environmental sensitivity
✅ Reviewed and approved by core maintainers

Examples:

Core Stokes solver tests (test_101*_Stokes*.py)
Basic mesh creation and data access (test_0100-0199_*.py)
Fundamental units system tests (test_0700_units_system.py)

Pytest Marker: @pytest.mark.tier_a

When to Use:

Running full CI pipeline before merging
Test-driven development sprints
Release validation
Bisecting regressions (these tests can be trusted to find the problem)

Tier B: Validated (Use with Caution)¶

Use for: Feature Validation, Exploratory Testing, Manual Review

Characteristics:

⚠️ Successfully run at least once, but not yet battle-tested
⚠️ Test appears correct but functionality may still be evolving
⚠️ Limited production usage or edge case coverage
⚠️ May have environmental dependencies not fully documented
⚠️ Failure could indicate test OR code issue - requires investigation
⚠️ Not yet reviewed for promotion to Tier A

Examples:

Recently added units integration tests (test_08*_*.py - many currently failing)
New reduction operation tests (test_0850_*.py)
Feature tests for newly implemented capabilities

Pytest Marker: @pytest.mark.tier_b

When to Use:

Manual feature validation after implementation
Exploratory testing of new capabilities
Code review process (validate test works as intended)
NOT for automated TDD sprints (unless explicitly monitoring)

Promotion Path: B → A

Test passes consistently for 3+ months
Functionality confirmed stable in production
Core maintainer review confirms test quality
Add to Tier A suite via PR review

Tier C: Experimental (Development)¶

Use for: Feature Development, Debugging, Test Development

Characteristics:

🚧 Test OR code (or both!) may be incorrect
🚧 Actively under development
🚧 Used to explore expected behavior
🚧 May test unimplemented or partially implemented features
🚧 Failures are EXPECTED and informative
🚧 Not suitable for any automated testing

Examples:

Tests written for not-yet-implemented features
Exploratory tests to understand API design
Tests for actively debugged features
Tests with known issues (mark with @pytest.mark.xfail + reason)

Pytest Markers:

@pytest.mark.tier_c
@pytest.mark.xfail(reason="Feature not yet implemented")
@pytest.mark.skip(reason="Waiting for X to be fixed")

When to Use:

Feature development (write test first, then implement)
Debugging complex issues (write test to reproduce bug)
API design exploration (what SHOULD the behavior be?)
NEVER for automated CI/TDD

Promotion Path: C → B

Feature fully implemented
Test passes consistently
Developer confirms test is correct
Remove xfail/skip markers
Promote to Tier B for further validation

Implementation in Pytest¶

pytest.ini Configuration¶

[pytest]
markers =
    # Reliability tiers (how much to trust the test)
    tier_a: Production-ready tests (trusted, use for TDD and CI)
    tier_b: Validated tests (use with caution, manual review recommended)
    tier_c: Experimental tests (development only, not for automation)

    # Complexity levels (what kind of test, independent of number prefix)
    level_1: Quick core tests - imports, basic setup, no solving (~seconds)
    level_2: Intermediate tests - integration, units, regression (~minutes)
    level_3: Physics tests - solvers, time-stepping, coupled systems (~minutes to hours)

    # Other markers
    mpi: marks tests as requiring MPI
    slow: marks tests as slow (>10s)

Level vs Number Prefix¶

IMPORTANT: The number prefix (0000-9999) is for organization/ordering only. The actual complexity level is marked explicitly with @pytest.mark.level_N.

Why this matters:

A file test_1010_stokes_basic.py can contain both Level 1 (setup) and Level 3 (benchmark) tests
Allows thematic organization: All Stokes tests in 1010-1099 regardless of complexity
Can run “all quick tests” across all topics: pytest -m level_1

Example:

# File: test_1010_stokes_basic.py (number 1010 = Stokes topic)

@pytest.mark.level_1  # Quick - just setup
@pytest.mark.tier_a   # Production-ready
def test_stokes_create_mesh_and_variables():
    """Test creating Stokes mesh and variables (no solving)."""
    mesh = uw.meshing.StructuredQuadBox(elementRes=(4, 4))
    v = uw.discretisation.MeshVariable("v", mesh, 2, degree=2)
    p = uw.discretisation.MeshVariable("p", mesh, 1, degree=1)
    # Just creation - very fast!

@pytest.mark.level_3  # Physics - full solve + benchmark
@pytest.mark.tier_a   # Production-ready
def test_stokes_sinking_block_benchmark():
    """Test Stokes solver against analytical solution."""
    # Complex benchmark with large mesh, comparison to theory
    # Could take minutes!

Both tests live in the same file (organized by topic), but have different levels (organized by complexity).

Marking Tests¶

import pytest

# Level 1 + Tier A: Quick, production-ready
@pytest.mark.level_1
@pytest.mark.tier_a
def test_basic_mesh_creation():
    \"\"\"Test mesh creation with default parameters.\"\"\"
    mesh = uw.meshing.StructuredQuadBox(elementRes=(8, 8))
    assert mesh.dim == 2
    assert mesh.elementCount > 0

# Level 2 + Tier B: Intermediate, validated but new
@pytest.mark.level_2
@pytest.mark.tier_b
def test_units_integration_with_stokes():
    \"\"\"Test Stokes solver with unit-aware variables.

    Status: Passing locally, needs more production validation.
    \"\"\"
    # Test implementation...

# Level 3 + Tier A: Complex physics, production-ready
@pytest.mark.level_3
@pytest.mark.tier_a
def test_stokes_benchmark_sinking_block():
    \"\"\"Test Stokes solver against analytical solution.

    Validated against Gerya (2019) textbook solution.
    \"\"\"
    # Benchmark implementation...

# Level 2 + Tier C: Intermediate complexity, experimental feature
@pytest.mark.level_2
@pytest.mark.tier_c
@pytest.mark.xfail(reason="Advanced units propagation not yet implemented")
def test_symbolic_units_propagation():
    \"\"\"Test automatic unit propagation through symbolic operations.

    This test documents expected behavior for future implementation.
    \"\"\"
    # Test for future feature...

Running Tests by Tier¶

# Run only Tier A tests (safe for TDD)
pytest -m tier_a

# Run Tier A and B tests (full validation)
pytest -m "tier_a or tier_b"

# Run all tests including experimental (for development)
pytest

# Exclude experimental tests
pytest -m "not tier_c"

Current Test Classification Status¶

2025-11-15 Audit Results¶

Units Test Suite (test_07_units.py, test_08*_*.py)**:

Total: 259 tests
Passing: 180 (69.5%)
Failing: 79 (30.5%)

Immediate Actions Required:

✅ DONE: Fixed Stokes JIT unwrapping bug (test_0818_stokes_nd.py now passing)
🔄 IN PROGRESS: Classify remaining 79 failures as B or C
📋 TODO: Eliminate tests for unimplemented features (move to C or remove)
📋 TODO: Fix legitimate test failures or mark as xfail with clear reasons

Guidelines for Test Development¶

When Writing a New Test¶

Start at Tier C: All new tests begin as experimental
Document Intent: Clear docstring explaining what behavior is tested
Use xfail Appropriately: If testing unimplemented feature, mark with xfail
Don’t Break CI: Tier C tests with xfail won’t break automated testing
Promote Deliberately: Don’t rush promotion - let tests prove reliability

When a Test Fails¶

If Tier A Test Fails:

🚨 HIGH PRIORITY: Definite regression in production code
Investigate immediately
Bisect to find breaking commit
Fix production code or demote test if it was incorrectly promoted

If Tier B Test Fails:

⚠️ MEDIUM PRIORITY: Could be test OR code issue
Investigate to determine root cause
If code issue: Fix code
If test issue: Fix test or demote to Tier C
Document findings in test or issue tracker

If Tier C Test Fails:

ℹ️ EXPECTED: Tier C tests may fail
No immediate action required
Useful for tracking development progress
Update xfail reason if expectations change

Integration with Slash Commands¶

The following slash commands help manage test reliability:

/test-solvers: Run Tier A solver tests (trusted for validation)
/test-units: Run Tier A+B units tests (quick validation)
/test-regression: Run full Tier A suite (check for regressions)
/validate-docs: Check documentation test coverage

Migration Plan for Existing Tests¶

Phase 1: Initial Classification (2025-11-15 to 2025-12-01)¶

Classify all existing tests:
- Simple tests (0000-0199): Review for Tier A
- Intermediate tests (0500-0699): Review for Tier A/B
- Regression tests (0600-0699): Review for Tier B (→A after validation)
- Units tests (0700-0799): Classify failures as B or C
- Complex tests (1000+): Review for Tier A/B
Add markers to all test files
Update pytest.ini with tier markers
Document known issues with xfail

Phase 2: Stabilization (2025-12-01 to 2026-01-01)¶

Fix or document all Tier B test failures
Remove or mark xfail for Tier C tests
Promote stable Tier B tests to Tier A
Establish CI pipeline using Tier A tests

Phase 3: Maintenance (Ongoing)¶

Regular review of Tier B tests for promotion
Continuous monitoring of Tier A test stability
Documentation updates for test coverage

Rationale¶

Why This System?

Prevents TDD Confusion: Developers know which tests to trust
Documents Test Maturity: Clear progression from experimental to production
Reduces False Alarms: Tier C failures expected, Tier A failures are urgent
Guides Review Process: Clear criteria for test promotion
Supports Development: Can write tests for future features without breaking CI

Comparison to Other Approaches:

Better than xfail alone: Three tiers provide more nuanced classification
Better than skip: Tests still run, failures are informative
Better than no classification: Prevents “all tests are equal” assumption

References¶

Pytest Markers: https://docs.pytest.org/en/stable/how-to/mark.html
Test Coverage Analysis: docs/reviews/2025-10/TEST-COVERAGE-ANALYSIS.md
Project Test Organization: CLAUDE.md (Test Suite Organization section)

Test Reliability Classification System¶

Overview¶

Reliability Tiers¶

Tier A: Production-Ready (Trusted)¶

Tier B: Validated (Use with Caution)¶

Tier C: Experimental (Development)¶

Implementation in Pytest¶

pytest.ini Configuration¶

Level vs Number Prefix¶

Marking Tests¶

Running Tests by Tier¶

Current Test Classification Status¶

2025-11-15 Audit Results¶

Test Review Process¶

For New Tests (PR Review)¶

For Promoting Tests (C → B → A)¶

Guidelines for Test Development¶

When Writing a New Test¶

When a Test Fails¶

Integration with Slash Commands¶

Migration Plan for Existing Tests¶

Phase 1: Initial Classification (2025-11-15 to 2025-12-01)¶

Phase 2: Stabilization (2025-12-01 to 2026-01-01)¶

Phase 3: Maintenance (Ongoing)¶

Rationale¶

References¶