Mapping EDC Forms to CDASH Standards: A Step-by-Step Pipeline Guide

The symptom is familiar to anyone who has locked a multi-site database: an EDC form that looked clean in the vendor UI fails conformance the moment it is projected into CDASH, throwing orphaned columns, codelist misses, and VSTESTCD values that no SDTM tabulation will accept. This page is the tactical, form-by-form companion to the broader CDISC ODM vs CDASH Schema Mapping design, and it sits inside the wider Clinical Data Architecture & EDC Standards reference where the EDC is treated as an immutable, read-only source of truth. It is written for clinical data managers, Python ETL engineers, and regulatory teams who need a deterministic, defensible procedure for turning a Medidata Rave, Oracle InForm, or Veeva Vault form into a CDASH-aligned dataset without watering down the audit trail. Each step below isolates a single responsibility and carries the regulatory rationale alongside the code.

The Six-Step Mapping Flow

The pipeline resolves hierarchical ODM metadata into flat, controlled-terminology-aligned CDASH variables, embedding provenance and a validation gate before anything reaches a monitored environment.

Why EDC Forms Resist a Clean CDASH Projection

The mapping is hard because the two models disagree on shape and on meaning. ODM is hierarchical and vendor-flavored: Medidata Rave, Oracle InForm, and Veeva Vault EDC each serialize FormDef, ItemGroupDef, and ItemDef differently, frequently embedding custom attributes and dynamic OID suffixes that no CDASH variable expects. CDASH is flat and tabular, and it demands explicit nullability, controlled terminology, and a --TESTCD/--ORRES grain that EDC capture forms rarely express directly. A single EDC text box can carry both a measurement and its unit; conditional branching can produce mutually exclusive fields that must collapse into one CDASH column. Every one of these structural mismatches is a place where a silent data-integrity defect is introduced or prevented, which is why the procedure is deterministic rather than best-effort.

Step 1: Deconstruct EDC Metadata and Resolve Vendor-Specific ODM Artifacts

Start from the CDISC Operational Data Model (ODM) export, not the human-readable form. Parse it with a hardened XML reader and isolate the ItemDef, ItemGroupDef, and CodeList nodes, preserving the DataType, Length, and Mandatory flags that CDASH needs for explicit nullability declarations. Repeat groups whose ItemGroupOID carries dynamic suffixes (AE_01, AE_02) must be collapsed to a canonical domain identifier before any mapping runs.

# 21 CFR Part 11 (Accurate): parse ODM with entity resolution disabled
# and preserve source attributes verbatim — never coerce metadata silently.
import re
from defusedxml.lxml import parse

ODM_NS = {"odm": "http://www.cdisc.org/ns/odm/v1.3"}

def deconstruct_items(odm_path: str) -> list[dict]:
    tree = parse(odm_path)
    items = []
    for item in tree.iterfind(".//odm:ItemDef", ODM_NS):
        group_oid = item.getparent().get("OID", "")
        items.append({
            "item_oid": item.get("OID"),
            # Collapse vendor repeat suffixes (AE_01 -> AE) to one domain key
            "domain": re.sub(r"_\d+$", "", group_oid).split(".")[-1],
            "data_type": item.get("DataType"),
            "length": item.get("Length"),
            "mandatory": item.get("Mandatory", "No"),
            "codelist_oid": (item.find("odm:CodeListRef", ODM_NS) or {}).get("CodeListOID")
                if item.find("odm:CodeListRef", ODM_NS) is not None else None,
        })
    return items

The same read-only extraction discipline that governs upstream sync — documented in Python ETL for EDC Data Extraction — applies here: the ODM file is evidence, so it is parsed, never edited in place.

Step 2: Align Variables to CDASH Domains via a Versioned Manifest

Mapping is a contract, not a guess. Bind each item_oid to its CDASH variable through a version-controlled manifest, never by column position, and validate the result with a typed model so unmapped fields are flagged rather than silently dropped.

# ALCOA+ (Attributable, Consistent): map by OID through a versioned manifest;
# an unmapped OID is surfaced for review, never coerced into a column.
from pydantic import BaseModel

class CdashTarget(BaseModel):
    domain: str          # e.g. "VS"
    variable: str        # e.g. "VSORRES"
    testcd: str | None = None   # e.g. "SYSBP"

MANIFEST_VERSION = "2026.06"
MANIFEST: dict[str, CdashTarget] = {
    "IT.VS.BP_SYSTOLIC": CdashTarget(domain="VS", variable="VSORRES", testcd="SYSBP"),
    "IT.VS.BP_SYS_UNIT": CdashTarget(domain="VS", variable="VSORRESU", testcd="SYSBP"),
    "IT.AE.AETERM":      CdashTarget(domain="AE", variable="AETERM"),
}

def align_to_cdash(items: list[dict]) -> tuple[list[dict], list[str]]:
    mapped, unmapped = [], []
    for it in items:
        target = MANIFEST.get(it["item_oid"])
        if target is None:
            unmapped.append(it["item_oid"])      # route to manual review, do not drop
            continue
        mapped.append({**it, **target.model_dump(), "manifest_version": MANIFEST_VERSION})
    return mapped, unmapped

Treat the manifest as a living artifact under pull-request review; consult the CDASH Implementation Guide for required-versus-expected variables before adding any new row. A split EDC field — one box capturing a value and its unit — resolves here into distinct VSORRES/VSORRESU targets sharing a VSTESTCD.

Step 3: Normalize Controlled Terminology and Resolve Semantic Drift

Controlled-terminology drift is a primary driver of submission rejections. EDC picklists and free text diverge from NCI-EVS and MedDRA, so harmonize local codes against a versioned dictionary with a strict left join, and route any miss into the discrepancy workflow instead of imputing a value.

# ALCOA+ (Accurate): codelist misses are tagged and queried, never imputed —
# fabricating a coded value is a data-integrity finding.
import pandas as pd

CT_VERSION = "NCI-EVS 2026-03-28"

def harmonize_terminology(df: pd.DataFrame, ct_map: pd.DataFrame) -> pd.DataFrame:
    out = df.merge(ct_map, on=["variable", "verbatim_value"], how="left")
    miss = out["submission_value"].isna()
    # Preserve the original vendor value for auditability in a SUPP-- qualifier
    out.loc[miss, "supp_qual"] = out.loc[miss, "verbatim_value"]
    out.loc[miss, "submission_value"] = "__UNMAPPED__:" + out.loc[miss, "verbatim_value"]
    out["ct_version"] = CT_VERSION
    return out

A custom severity scale (Mild/Moderate/Severe) maps to standardized AESEV values (MILD/MODERATE/SEVERE); the verbatim term is retained in a SUPP-- record. The same harmonization tables feed Automated Clinical Query Generation, so an unmapped term becomes a query rather than a corrupted safety record.

Step 4: Embed Provenance and Respect Audit-Trail Boundaries

Every transformed value must trace back to its source EDC record. Embed a content-addressed lineage hash per row and keep ETL-derived lineage strictly separate from the source audit trail — a distinction covered in depth under Audit Trail Boundaries in EDC Systems.

# 21 CFR Part 11 (Original, Enduring): deterministic SHA-256 over the source
# tuple gives a reconstructable lineage anchor for incremental change detection.
import hashlib

def lineage_hash(row: dict) -> str:
    payload = "|".join(str(row[k]) for k in (
        "item_oid", "subject_id", "visit", "repeat_index", "verbatim_value"))
    return hashlib.sha256(payload.encode("utf-8")).hexdigest()

Isolate the final adjudicated value and archive historical deltas in a SUPP-- domain or a separate audit table; never overwrite the data-entry, modification, and query-resolution timestamps into a single column. Because a flat CDASH frame can re-collapse fields the nested ODM kept compartmentalized, apply Role-Based Access Control for Clinical Data and field minimization before the frame leaves the trusted environment.

Step 5: Validate with a Conformance Gate and Deterministic Recovery

A production mapping cannot rely on best-effort transforms. Run a schema contract that asserts the grain, types, and controlled-terminology membership before any commit, and enter a deterministic recovery state on failure.

# GxP test artifact: pandera contract is the executable CDASH conformance spec;
# its pass/fail report is archived as OQ evidence.
import pandera.pandas as pa

vs_schema = pa.DataFrameSchema({
    "VSTESTCD": pa.Column(str, pa.Check.isin(["SYSBP", "DIABP", "PULSE"])),
    "VSORRES":  pa.Column(str, nullable=True),
    "VSDTC":    pa.Column(str, pa.Check.str_matches(r"^\d{4}-\d{2}-\d{2}$")),  # ISO 8601
    "lineage_hash": pa.Column(str, pa.Check.str_length(64, 64)),
})

def validate_or_quarantine(df, schema=vs_schema):
    try:
        return schema.validate(df, lazy=True), None
    except pa.errors.SchemaErrors as exc:
        # Isolate offenders; preserve clean rows for resumable processing
        bad = exc.failure_cases["index"].dropna().astype(int).unique()
        df.loc[df.index.isin(bad)].to_parquet("quarantine.parquet")
        return df.loc[~df.index.isin(bad)], exc.failure_cases

The same pandas-based cleaning and type-coercion patterns described in Pandas DataFrames for Clinical Data Cleaning feed this gate; great_expectations is an equivalent choice where a shared expectation suite is preferred. Cross-domain referential checks — for example confirming every VS record resolves to a DM subject — belong here too and are detailed in Cross-Form Data Validation Rules.

Step 6: Deploy and Maintain Continuous Conformance

Promote the mapping logic into a monitored environment behind a CI/CD gate. A pinned container guarantees dependency resolution; the pipeline re-runs CDASH conformance against the latest FDA Study Data Technical Conformance Guide before any manifest change ships. Establish change control that tracks EDC form versioning, terminology dictionary updates, and CDASH specification revisions, and monitor validation-failure trends as an early-warning signal for architectural drift. The ODM payloads themselves arrive over the interfaces defined in EDC API Architecture for Clinical Trials, so version both ends of the contract together.

Verification and Audit Trail

Confirming the mapping is correct is itself a regulated activity. Capture the following evidence on every run so the transformation can be reconstructed and defended at inspection:

Determinism proof: run the mapping twice on identical ODM input and assert byte-identical output (pandas.testing.assert_frame_equal). Archive the equal result as OQ evidence.
Conformance report: persist the pandera validation report (pass and failure cases) alongside the MANIFEST_VERSION and CT_VERSION stamped in each row.
Lineage reconciliation: spot-check that each output lineage_hash recomputes from its source ODM node, proving no value was orphaned or fabricated.
Unmapped register: confirm the count of __UNMAPPED__ tags matches the queries raised, so no codelist miss was silently absorbed.

Edge Cases and Vendor-Specific Gotchas

Medidata Rave repeat groups. Rave emits dynamic ItemGroupOID suffixes for log lines (AE_01, AE_02). Carry an explicit repeat_index into the flattened grain and sort with a stable mergesort; without it, three adverse-event rows can reorder between runs and break determinism.

Veeva Vault CDMS stringified numerics. Vault REST payloads frequently return numeric limits and values as JSON strings. Coerce types at the boundary with a typed model before any threshold or VSORRESN numeric mapping, or every comparison silently falls back to string semantics.

Oracle InForm omitted nullability. InForm exports often drop the Mandatory flag entirely. CDASH requires explicit nullability, so default a missing flag to a conservative "No" and flag the field for manifest review rather than assuming the value is optional.

Frequently Asked Questions

Should I map an EDC form straight to SDTM and skip CDASH?

No. CDASH is the acquisition-standard intermediate that keeps collection consistent and traceable. Mapping the form into CDASH first gives a vendor-neutral tabular contract that SDTM tabulation builds on, so an EDC form-build change does not ripple uncontrolled into submission datasets.

How do I keep a form-level mapping deterministic across repeating groups?

Carry an explicit repeat index from the ODM ItemGroupData into the flattened grain and sort with a stable mergesort before output. The same form then always produces the same rows in the same order, so identical ODM input yields byte-identical CDASH every run.

What happens when an EDC picklist value is not in the codelist?

Tag it explicitly (for example __UNMAPPED__:<verbatim>), preserve the original value in a SUPP-- qualifier, and route it into the discrepancy workflow. Imputing a coded value fabricates clinical meaning and is a data-integrity finding; raising a query preserves ALCOA+ Accurate and Consistent.

How does this mapping satisfy 21 CFR Part 11?

Acceptability comes from controls, not the tool: the mapping manifest is version-controlled, behavior is proven deterministic by regression tests, every output row carries a lineage hash back to its source ODM node, and the validation report is archived as IQ/OQ evidence.

Where do I split a single EDC field that holds a value and its unit?

Split it in Step 2, in the manifest. Bind the value box to --ORRES and the unit to --ORRESU, both sharing the same --TESTCD. Doing it in the mapping contract — rather than ad hoc in transform code — keeps the split reviewable and reproducible.

CDISC ODM vs CDASH Schema Mapping — the parent design and the versioned manifest this walkthrough applies form by form.
Clinical Data Architecture & EDC Standards — the reference architecture this mapping layer sits within.
Audit Trail Boundaries in EDC Systems — keeping ETL-derived lineage separate from source audit records.
Role-Based Access Control for Clinical Data — gating the PHI a flattened CDASH frame can re-expose.
Automated Clinical Query Generation — where codelist misses and unmapped fields become discrepancies.

Mapping EDC Forms to CDASH Standards: A Step-by-Step Pipeline Guide

The Six-Step Mapping Flow #

Why EDC Forms Resist a Clean CDASH Projection #

Step 1: Deconstruct EDC Metadata and Resolve Vendor-Specific ODM Artifacts #

Step 2: Align Variables to CDASH Domains via a Versioned Manifest #

Step 3: Normalize Controlled Terminology and Resolve Semantic Drift #

Step 4: Embed Provenance and Respect Audit-Trail Boundaries #

Step 5: Validate with a Conformance Gate and Deterministic Recovery #

Step 6: Deploy and Maintain Continuous Conformance #

Verification and Audit Trail #

Edge Cases and Vendor-Specific Gotchas #

Frequently Asked Questions #

Related #