Cross-Form Data Validation Rules: Deterministic Multi-Form Edit Checks for Clinical EDC Pipelines

Cross-form validation is the engineering problem of reconciling clinical relationships that no single form can prove on its own: a randomization date that must follow a consent date captured on a different visit, a dose recorded in the exposure form that must agree with the hepatic safety markers in the lab extract, a serious adverse event whose onset must fall inside a treatment window assembled from three separate eCRF modules. Intra-form checks cannot express these constraints; they reconcile temporal, clinical, and operational relationships across disparate eCRF modules, laboratory feeds, and ePRO streams. This page is a sub-discipline of the broader Clinical Query Generation & Discrepancy Management section, and the deterministic rule outputs defined here become the structured discrepancies consumed downstream by Automated Clinical Query Generation. The regulatory stakes are concrete: every rule firing must be reproducible from a frozen data state and a versioned rule catalog, because an inspector will ask you to reconstruct exactly why a value was queried and exactly which logic version produced that query.

Validation Sequence at a Glance

Cross-form rules run only after intra-form checks, against a materialized subject-visit graph; failures become structured discrepancies while passes are written to an immutable ledger.

Concept and Prerequisites

A cross-form rule is a relational predicate evaluated against a normalized, subject-level data graph rather than a single record. The three operational patterns that cover the majority of production rules are temporal sequencing (e.g., consent_date <= randomization_date), conditional clinical reconciliation (e.g., investigational-product dosing linked to ALT/AST thresholds drawn from the lab domain), and visit-window alignment across screening, treatment, and follow-up modules. Each rule must be version-controlled, parameterized, and explicitly mapped to a protocol amendment or statistical analysis plan, so that a change in clinical tolerance is a tracked change to a manifest rather than an edit buried in code.

The required standards knowledge overlaps with the rest of the synchronization path. Engineers should understand the CDISC Operational Data Model (ODM) hierarchy — StudyEventData → FormData → ItemGroupData → ItemData — because cross-form joins are expressed in those keys; the field-level semantics are covered in CDISC ODM vs CDASH schema mapping. The endpoint contract that delivers the deltas being validated is documented in EDC API Architecture for Clinical Trials, and the immutable-record expectations every rule firing must satisfy are defined in audit trail boundaries in EDC systems. Cross-form validation also consumes the cleaned, typed frames produced upstream — the normalization patterns in Pandas DataFrames for Clinical Data Cleaning are a hard prerequisite, because predicate evaluation against untyped or unaligned columns produces silent false negatives.

The reference implementation assumes a pinned, version-controlled dependency set so that validated behavior is reproducible across IQ/OQ/PQ environments:

Dependency	Pinned version	Role in cross-form validation
`python`	3.11.x	Runtime; structured `tomllib`/`dataclasses` rule loading
`polars`	0.20.x	Vectorized join + predicate evaluation on large subject graphs
`pandas`	2.2.x	Interop with legacy SDTM tooling and lab-extract readers
`pydantic`	2.7.x	Schema validation of the YAML rule manifest at load time
`pyyaml`	6.0.x	Parses the declarative rule catalog
`pytest`	8.2.x	GxP regression artifacts and shadow-mode fixtures

Implementation: The Rule Registry and Predicate Engine

Code auditability begins with a strict separation of rule definitions from execution logic. Rules live in a declarative manifest (see the configuration section below); the engine compiles each manifest entry into a callable predicate at load time and never embeds clinical thresholds in source. The registry validates every rule against a pydantic model so that a malformed predicate fails loudly at startup rather than silently skipping during a nightly run.

# ALCOA+ requirement: every rule carries an immutable id + version so each
# discrepancy is attributable to an exact logic state (Attributable, Original).
from dataclasses import dataclass
from typing import Callable
import polars as pl


@dataclass(frozen=True)
class CrossFormRule:
    rule_id: str            # e.g. "XF-TEMP-001"
    version: str            # bumped on any predicate change; maps to amendment
    severity: str           # "critical" | "major" | "minor"
    forms: tuple[str, ...]  # forms this rule joins, for lineage
    description: str
    # predicate returns a boolean column: True = PASS, False = discrepancy
    predicate: Callable[[pl.DataFrame], pl.Expr]


def evaluate_rule(graph: pl.DataFrame, rule: CrossFormRule) -> pl.DataFrame:
    """Evaluate one rule across the whole subject-visit graph in a single pass.
    Vectorized: no row-wise Python loop, so a 200k-row refresh stays sub-second."""
    passed = rule.predicate(graph)
    return (
        graph.with_columns(passed.alias("_passed"))
        .filter(~pl.col("_passed"))            # keep only failures -> discrepancies
        .select(["subject_id", "visit_id", "event_sequence", *rule.forms])
        .with_columns(
            pl.lit(rule.rule_id).alias("rule_id"),
            pl.lit(rule.version).alias("rule_version"),
            pl.lit(rule.severity).alias("severity"),
        )
    )

The predicate itself is pure and declarative. A temporal-sequencing rule and a dose-lab reconciliation rule look like this when registered:

# ALCOA+ requirement: thresholds (ALT > 3x ULN) are parameters, not magic
# numbers, so a protocol amendment is a tracked manifest change (Contemporaneous).
TEMPORAL_CONSENT = CrossFormRule(
    rule_id="XF-TEMP-001", version="1.2.0", severity="critical",
    forms=("DM", "DS"),
    description="Randomization date must not precede informed consent date.",
    predicate=lambda df: pl.col("randomization_date") >= pl.col("consent_date"),
)

DOSE_HEPATIC = CrossFormRule(
    rule_id="XF-SAFE-014", version="2.0.1", severity="major",
    forms=("EX", "LB"),
    description="IP dosing continued despite ALT > 3x ULN without override flag.",
    predicate=lambda df: ~(
        (pl.col("ip_administered") == True)
        & (pl.col("alt_value") > 3 * pl.col("alt_uln"))
        & (pl.col("hepatic_override").is_null())
    ),
)

Because each predicate returns a polars expression rather than a scalar, the engine evaluates the entire materialized graph in one columnar pass. This is what makes the pipeline deterministic and fast: the same frozen snapshot and the same rule version always yield byte-identical discrepancy rows, and re-running the engine is idempotent.

Implementation: Materialization, Missing Data, and Compliance Constraints

Predicate evaluation is only as trustworthy as the join that feeds it. Cross-form materialization constructs the subject-visit graph by joining each form on subject_id, visit_id, and event_sequence — the same keys the ODM exposes. The compliance-critical edge cases live here, not in the predicates: a late-arriving lab record, a form-version drift that renamed ALAT to ALT, or a left-join that quietly drops a subject whose treatment form has not yet synced. A dropped subject is the most dangerous failure mode in clinical validation because it produces a false negative — a missing query rather than a noisy one — and it is invisible without an explicit anti-join check.

# ALCOA+ requirement: Complete + Accurate. Materialization must surface subjects
# present in one form but absent from the join, never silently drop them.
def materialize_graph(forms: dict[str, pl.DataFrame]) -> tuple[pl.DataFrame, pl.DataFrame]:
    keys = ["subject_id", "visit_id", "event_sequence"]
    graph = forms["DM"]
    for name, frame in forms.items():
        if name == "DM":
            continue
        graph = graph.join(frame, on=keys, how="outer_coalesce", suffix=f"_{name}")

    # Quarantine rows where a join key is null on either side: these are
    # entity-resolution failures (form drift / late arrival), not clean passes.
    unjoined = graph.filter(
        pl.any_horizontal(pl.col(k).is_null() for k in keys)
    )
    clean = graph.filter(
        pl.all_horizontal(pl.col(k).is_not_null() for k in keys)
    )
    return clean, unjoined

Missing data handling must be explicit per ICH E6(R3): a predicate evaluated against a null operand must not evaluate to a silent False. The dose-hepatic rule above deliberately uses is_null() on the override flag rather than == None, and any rule comparing two values where one is missing should route the row to a cannot-evaluate bucket — a third outcome alongside pass and fail — so that an inspector can distinguish “rule did not fire because data is clean” from “rule could not fire because data is absent.” Conflating the two is a data-integrity finding waiting to happen. Conditional rules that depend on protocol-specific branching (for example, a hepatic threshold that only applies to subjects on the active arm) must read the branch condition from the same frozen graph, never from a live lookup, to preserve reproducibility.

Configuration and Parameterization

The rule catalog is a version-controlled YAML manifest, parsed and validated against a pydantic model at load time. Storing thresholds, severities, and applicable visit windows in configuration — rather than code — is what lets a clinical data manager review a protocol-amendment diff without reading Python, and what gives the audit ledger a stable reference for “which version of which rule fired.”

# rules/cross_form.yaml — committed; every change is a reviewed, tagged PR.
# Maps directly to protocol amendment IDs for regulatory traceability.
catalog_version: "2026.06.0"
rules:
  - rule_id: XF-TEMP-001
    version: "1.2.0"
    severity: critical
    forms: [DM, DS]
    amendment_ref: "PROTO-AMD-03"
    visit_windows: [SCREENING, RANDOMIZATION]
    description: "Randomization date must not precede informed consent date."
    expression: "randomization_date >= consent_date"

  - rule_id: XF-SAFE-014
    version: "2.0.1"
    severity: major
    forms: [EX, LB]
    amendment_ref: "PROTO-AMD-05"
    params:
      alt_multiplier: 3          # x ULN; tuned per safety review board
    description: "IP dosing continued despite ALT > 3x ULN without override."

# ALCOA+ requirement: config is validated at load so a malformed rule fails
# the pipeline (Legible, Accurate) rather than skipping silently in production.
import pydantic, yaml


class RuleSpec(pydantic.BaseModel):
    rule_id: str
    version: str
    severity: pydantic.constr(pattern="^(critical|major|minor)$")
    forms: list[str]
    amendment_ref: str
    description: str
    expression: str | None = None
    params: dict[str, float] = {}
    visit_windows: list[str] = []


def load_catalog(path: str) -> list[RuleSpec]:
    raw = yaml.safe_load(open(path, encoding="utf-8"))
    return [RuleSpec(**r) for r in raw["rules"]]   # raises on any bad rule

Environment-specific values — database DSNs, the ledger bucket, the active catalog version — belong in environment variables, never in the manifest, so the same validated catalog promotes unchanged from DEV through PROD. The catalog file’s git SHA is recorded in every ledger entry, which is what binds a discrepancy to an exact, reviewable rule definition.

Testing and Validation

GxP test artifacts require that every production rule has a unit test asserting both a passing and a failing case against a deterministic fixture. Mock the materialized graph rather than the API, because the unit under test is the predicate, not the transport. The same fixtures double as IQ/OQ evidence: a reviewer can read the test and confirm the rule encodes the protocol.

# OQ requirement: documented evidence that each rule behaves to specification.
import polars as pl
from rules.engine import evaluate_rule, DOSE_HEPATIC


def test_dose_hepatic_flags_elevated_alt_without_override():
    graph = pl.DataFrame({
        "subject_id": ["S001", "S002"],
        "visit_id": ["V3", "V3"], "event_sequence": [1, 1],
        "EX": ["x", "x"], "LB": ["y", "y"],
        "ip_administered": [True, True],
        "alt_value": [180.0, 20.0], "alt_uln": [40.0, 40.0],  # S001: 4.5x ULN
        "hepatic_override": [None, None],
    })
    out = evaluate_rule(graph, DOSE_HEPATIC)
    assert out["subject_id"].to_list() == ["S001"]      # only the violation
    assert out["rule_version"][0] == "2.0.1"            # version pinned in ledger


def test_dose_hepatic_respects_documented_override():
    graph = pl.DataFrame({
        "subject_id": ["S001"], "visit_id": ["V3"], "event_sequence": [1],
        "EX": ["x"], "LB": ["y"], "ip_administered": [True],
        "alt_value": [180.0], "alt_uln": [40.0],
        "hepatic_override": ["MD-signed-2026-05-12"],   # clinician override present
    })
    assert evaluate_rule(graph, DOSE_HEPATIC).height == 0

Before any new rule generates live queries, run it in shadow mode: execute the rule against a historical, locked dataset and report its hit count, precision, and recall without raising a single query. This lets data managers assess a rule’s behavior on real data and is the safest path to production for rules whose precision is uncertain — the same discipline detailed in Reducing False Positives in Clinical Query Engines.

Production Gotchas and Failure Modes

Silent subject drop on inner join. A rule that joins EX and LB with how="inner" will skip every subject whose lab has not yet synced, producing missing queries. Remediation: use the outer_coalesce + anti-join pattern above and route unjoined rows to a quarantine queue, never to the clean-pass ledger.
Form-version drift renames a column. A protocol amendment renames ALAT → ALT; the predicate reads alt_value, finds nulls, and every hepatic rule silently passes. Remediation: pin field names in the entity-resolution layer and fail the run if an expected column is absent after mapping, rather than coalescing to null.
Null operand evaluates to a passing row. consent_date >= NULL is not False, it is unknown — but a naive filter treats unknown as pass. Remediation: route any predicate with a null operand to a cannot-evaluate bucket and surface it as a data-completeness discrepancy, not a clean pass.
Over-tight thresholds flood sites with queries. A rule firing on a one-day visit-window edge generates query fatigue and erodes site engagement. Remediation: tune tolerances with Discrepancy Threshold Tuning and add documented grace periods rather than loosening the predicate in code.
Non-idempotent reruns mutate the ledger. Re-running the engine after a partial failure appends duplicate discrepancies. Remediation: key each ledger row on (rule_id, rule_version, subject_id, visit_id, row_hash) so a rerun against the same snapshot is a no-op upsert.

Compliance Checklist

Every rule carries an immutable rule_id and a version bumped on any predicate change.
The rule manifest is committed; the catalog git SHA is recorded in each ledger entry.
Each rule maps to a protocol amendment or SAP reference (amendment_ref).
Materialization surfaces unjoined subjects to a quarantine queue, never to the clean-pass ledger.
Null-operand predicates route to a cannot-evaluate bucket, not a silent pass.
Every production rule has passing and failing unit tests retained as OQ evidence.
New or changed rules run in shadow mode against a locked dataset before generating live queries.
Ledger rows are idempotent on (rule_id, rule_version, subject_id, visit_id, row_hash).
Discrepancy payloads carry the failing field values, expected range, and a source-CRF reference for downstream routing.

Automated Clinical Query Generation — consumes the structured discrepancies this engine emits.
Reducing False Positives in Clinical Query Engines — shadow-mode testing and precision tuning for these rules.
Discrepancy Threshold Tuning in Clinical Trial Data Monitoring Pipelines — calibrating clinical tolerances to balance signal and site burden.
Deterministic Query Routing Workflows for CRAs — severity-based routing of the discrepancies these rules raise.
Clinical Query Generation & Discrepancy Management — the parent section this page belongs to.

Cross-Form Data Validation Rules: Deterministic Multi-Form Edit Checks for Clinical EDC Pipelines

Validation Sequence at a Glance #

Concept and Prerequisites #

Implementation: The Rule Registry and Predicate Engine #

Implementation: Materialization, Missing Data, and Compliance Constraints #

Configuration and Parameterization #

Testing and Validation #

Production Gotchas and Failure Modes #

Compliance Checklist #

Related #