# Design Document: Compliance Schema Drift Check ## Overview This feature adds schema drift detection to the compliance xlsx upload flow. When a user uploads a weekly NTS_AEO report, the backend extracts the xlsx structural schema (sheet names, column headers, metric values) and compares it against a shared parser configuration file. The comparison produces a categorised drift report with three severity levels: breaking (blocks upload), silent-miss (warns but allows proceeding), and cosmetic (informational). The frontend displays these findings in a new drift review phase inside the upload modal, inserted between the upload spinner and the existing diff preview. The parser configuration dicts (`METRIC_CATEGORIES`, `CORE_COLS`, `SKIP_SHEETS`) currently defined inline in `parse_compliance_xlsx.py` are extracted into a shared JSON file (`backend/scripts/compliance_config.json`) that both the Python parser and the Node.js drift checker read. This establishes a single source of truth for parser configuration. ### Design Decisions 1. **Shared JSON config over database storage**: The parser config is a developer-maintained mapping, not user data. A JSON file is version-controllable, diffable, and readable by both Python and Node.js without additional dependencies. 2. **Python subprocess for schema extraction**: The existing `dump_xlsx_schema.py` already uses openpyxl to extract xlsx structure. We adapt this into a new `extract_xlsx_schema.py` script that the Node.js backend invokes as a subprocess, consistent with how `parse_compliance_xlsx.py` is already called. 3. **Node.js drift comparison logic**: The drift comparison is pure object comparison (sets of strings) with no xlsx parsing. Implementing it in Node.js avoids a second Python subprocess call and keeps the logic co-located with the route handler. 4. **Graceful degradation**: If the drift check fails, the upload flow proceeds normally with `drift: null` and a `drift_error` message. The drift check is additive and must never block the existing workflow. ## Architecture ```mermaid sequenceDiagram participant User participant Modal as ComplianceUploadModal participant API as POST /api/compliance/preview participant Schema as extract_xlsx_schema.py participant Drift as driftChecker (Node.js) participant Config as compliance_config.json participant Parser as parse_compliance_xlsx.py User->>Modal: Drops xlsx file Modal->>API: POST /preview (multipart) API->>Schema: spawn python3 extract_xlsx_schema.py Schema-->>API: JSON { sheets: [...] } API->>Config: fs.readFileSync(compliance_config.json) API->>Drift: compareSchemaToDrift(schema, config) Drift-->>API: { breaking: [...], silent_miss: [...], cosmetic: [...] } API->>Parser: spawn python3 parse_compliance_xlsx.py Parser->>Config: reads compliance_config.json Parser-->>API: JSON { items, summary, ... } API->>API: computeDiff(db, items) API-->>Modal: { drift, diff, tempFile, ... } alt drift has findings Modal->>User: Show drift review phase alt breaking findings exist Modal->>User: Block "Continue to Preview" else no breaking findings User->>Modal: Click "Continue to Preview" Modal->>User: Show diff preview end else no drift findings Modal->>User: Show diff preview directly end ``` ### File Layout ``` backend/ scripts/ compliance_config.json # NEW — shared parser config (single source of truth) extract_xlsx_schema.py # NEW — extracts xlsx structure as JSON parse_compliance_xlsx.py # MODIFIED — reads config from JSON file dump_xlsx_schema.py # UNCHANGED — standalone diagnostic tool routes/ compliance.js # MODIFIED — drift check in /preview, new driftChecker module helpers/ driftChecker.js # NEW — compareSchemaToDrift() function frontend/ src/components/pages/ ComplianceUploadModal.js # MODIFIED — new drift-review phase ``` ## Components and Interfaces ### 1. Shared Parser Configuration (`compliance_config.json`) ```json { "metric_categories": { "2.3.4i": "Vulnerability Management", "2.3.6i": "Vulnerability Management", "5.2.4": "Access & MFA" }, "core_cols": [ "Preferred - Hostname", "GRANITE - IPv4_Address", "GRANITE - Type", "Team", "Compliant", "Source_Network", "Vertical", "GRANITE - Equip_Inst_ID", "GRANITE - RESPONSIBLE_TEAM" ], "skip_sheets": ["Summary", "CMDB_9box", "Vulns", "Aging Dashboard"] } ``` ### 2. Schema Extractor (`extract_xlsx_schema.py`) **Input**: File path as CLI argument. **Output** (stdout JSON): ```json { "sheets": [ { "name": "Summary", "columns": ["Metric", "Non-Compliant", "..."], "metric_values": ["2.3.4i", "5.2.4", "..."] }, { "name": "2.3.4i", "columns": ["Preferred - Hostname", "GRANITE - IPv4_Address", "..."] } ] } ``` - Uses openpyxl in read-only mode. - Extracts sheet names, first-row column headers per sheet, and unique metric values from the Summary sheet (header at row 4, data from row 5 onward). - On error, returns `{ "error": "..." }` on stdout and exits with non-zero code. ### 3. Drift Checker (`backend/helpers/driftChecker.js`) **Function**: `compareSchemaToDrift(schema, config) => DriftReport` **Parameters**: - `schema` — object returned by `extract_xlsx_schema.py` - `config` — object parsed from `compliance_config.json` **Returns** (`DriftReport`): ```javascript { breaking: [ { severity: 'breaking', message: 'Detail sheet "2.3.4i" is missing core column "Team"', value: 'Team', sheet: '2.3.4i' } ], silent_miss: [ { severity: 'silent_miss', message: 'Unknown metric "9.1.2" in Summary — not in metric_categories', value: '9.1.2' } ], cosmetic: [ { severity: 'cosmetic', message: 'New column "Extra_Field" in sheet "2.3.4i" — will be captured in extra_json', value: 'Extra_Field', sheet: '2.3.4i' } ] } ``` **Drift rules**: | Rule | Severity | Condition | |---|---|---| | Missing core column | `breaking` | A detail sheet (not in `skip_sheets`, present in xlsx) is missing a column from `core_cols` | | Missing detail sheet | `breaking` | A sheet name in `metric_categories` (and not in `skip_sheets`) is absent from the xlsx | | Unknown metric value | `silent_miss` | A metric value in the Summary sheet is not a key in `metric_categories` | | Unknown sheet | `silent_miss` | An xlsx sheet is not in `skip_sheets` and not in `metric_categories` | | New column in detail sheet | `cosmetic` | A detail sheet has columns not in `core_cols` | | Stale metric category | `cosmetic` | A key in `metric_categories` does not appear in the Summary sheet's metric values | ### 4. Preview Endpoint Changes (`POST /api/compliance/preview`) The existing `/preview` handler is modified to: 1. After receiving the uploaded file, spawn `extract_xlsx_schema.py` to get the xlsx schema. 2. Read `compliance_config.json` from disk. 3. Call `compareSchemaToDrift(schema, config)` to produce the drift report. 4. Proceed with the existing `parseXlsx()` call and `computeDiff()`. 5. Include `drift` (the DriftReport object) and optionally `drift_error` (string) in the response. If the schema extraction or drift check throws, set `drift: null` and `drift_error: `, then continue with the normal flow. **Updated response shape**: ```json { "drift": { "breaking": [], "silent_miss": [], "cosmetic": [] }, "drift_error": null, "diff": { "new_count": 5, "recurring_count": 120, "resolved_count": 3 }, "tempFile": "/path/to/temp.json", "filename": "NTS_AEO_2026_03_25.xlsx", "report_date": "2026-03-25", "total_items": 125 } ``` ### 5. Upload Modal Changes (`ComplianceUploadModal.js`) **New phase**: `drift-review` inserted between `uploading` and `preview`. **Phase flow**: ``` idle → uploading → drift-review (if findings) → preview → committing → done → preview (if no findings) ``` **Drift review UI**: - Findings grouped by severity: breaking first, then silent-miss, then cosmetic. - Each group has a header with severity label and count badge. - Groups with more than 5 findings collapse with a "Show N more" toggle. - Each finding shows the message text and the triggering value. - Breaking findings: red text (`#EF4444`), red left-border accent. - Silent-miss findings: amber text (`#F59E0B`), amber left-border accent. - Cosmetic findings: muted text (`#94A3B8`), subtle left-border accent. - "Cancel" button returns to idle. "Continue to Preview" button advances to diff preview. - "Continue to Preview" is disabled when breaking findings exist, with a message explaining the block. - When `drift` is `null` (drift check failed), skip drift-review and go straight to preview. ## Data Models ### DriftFinding ```javascript { severity: 'breaking' | 'silent_miss' | 'cosmetic', message: string, // Human-readable description value: string, // The specific column/sheet/metric that triggered the finding sheet: string|null // Sheet name context (when applicable) } ``` ### DriftReport ```javascript { breaking: DriftFinding[], silent_miss: DriftFinding[], cosmetic: DriftFinding[] } ``` ### ParserConfig ```javascript { metric_categories: { [metricId: string]: string }, // metric ID → category name core_cols: string[], // column names for main item fields skip_sheets: string[] // sheet names excluded from parsing } ``` ### XlsxSchema (output of extract_xlsx_schema.py) ```javascript { sheets: [ { name: string, columns: string[], metric_values?: string[] // only present on Summary sheet } ] } ``` ## Correctness Properties *A property is a characteristic or behavior that should hold true across all valid executions of a system — essentially, a formal statement about what the system should do. Properties serve as the bridge between human-readable specifications and machine-verifiable correctness guarantees.* ### Property 1: Breaking drift completeness *For any* xlsx schema and parser config, the drift checker SHALL produce a breaking finding for every core column missing from every detail sheet, and for every detail sheet (present in `metric_categories` but not in `skip_sheets`) absent from the xlsx — and no other breaking findings. The set of breaking findings is exactly the union of missing-core-column findings and missing-detail-sheet findings. **Validates: Requirements 3.1, 3.2, 3.3** ### Property 2: Silent-miss drift completeness *For any* xlsx schema and parser config, the drift checker SHALL produce a silent-miss finding for every metric value in the Summary sheet not present in `metric_categories`, and for every xlsx sheet not in `skip_sheets` and not in `metric_categories` — and no other silent-miss findings. The set of silent-miss findings is exactly the union of unknown-metric findings and unknown-sheet findings. **Validates: Requirements 4.1, 4.2, 4.3** ### Property 3: Cosmetic drift completeness *For any* xlsx schema and parser config, the drift checker SHALL produce a cosmetic finding for every column in a detail sheet not present in `core_cols`, and for every key in `metric_categories` not present in the Summary sheet's metric values — and no other cosmetic findings. The set of cosmetic findings is exactly the union of new-column findings and stale-metric findings. **Validates: Requirements 5.1, 5.2, 5.3** ### Property 4: Drift severity ordering *For any* drift report containing a mix of breaking, silent-miss, and cosmetic findings, the grouping function SHALL always return findings ordered by severity: all breaking findings first, then all silent-miss findings, then all cosmetic findings. **Validates: Requirements 8.1** ## Error Handling ### Python Script Failures | Failure | Handling | |---|---| | `extract_xlsx_schema.py` exits non-zero | Preview endpoint sets `drift: null`, `drift_error: `, continues with normal parse flow | | `extract_xlsx_schema.py` returns invalid JSON | Same as above — caught in JSON.parse, treated as drift check failure | | `compliance_config.json` missing or invalid (Node.js read) | Preview endpoint returns 500 with message "Configuration file could not be loaded" | | `compliance_config.json` missing or invalid (Python parser read) | Parser exits non-zero, stderr describes the error, preview endpoint returns 500 with parse error | | xlsx file cannot be opened by schema extractor | Schema extractor returns `{ "error": "..." }` on stdout, exits non-zero; drift check skipped gracefully | ### Frontend Error States | Condition | Behavior | |---|---| | `drift` is `null` in preview response | Skip drift-review phase, proceed directly to diff preview | | `drift_error` is present | Optionally display a subtle warning in the diff preview that drift check was skipped | | Network error during upload | Existing error phase handling (unchanged) | ### Config File Validation The Node.js config loader validates that: - The file exists and is readable. - The content parses as valid JSON. - The parsed object contains `metric_categories` (object), `core_cols` (array), and `skip_sheets` (array). If any check fails, the loader throws with a descriptive message. The preview handler catches this and returns a 500 response. ## Testing Strategy ### Unit Tests **Drift checker (`driftChecker.js`)**: - Breaking: missing core column produces finding with correct severity, message, value, and sheet. - Breaking: missing detail sheet produces finding. - Silent-miss: unknown metric value produces finding. - Silent-miss: unknown sheet produces finding. - Cosmetic: new column in detail sheet produces finding. - Cosmetic: stale metric category produces finding. - Empty schema (no sheets) produces appropriate findings. - Config with empty metric_categories, core_cols, or skip_sheets. - Schema and config that are perfectly aligned produce zero findings. **Config loader**: - Valid config file loads correctly. - Missing file throws descriptive error. - Invalid JSON throws descriptive error. - Config missing required keys throws descriptive error. **Frontend drift review component**: - Drift review phase renders when findings exist. - "Continue to Preview" button disabled when breaking findings present. - "Continue to Preview" button enabled when no breaking findings. - Groups collapse at 5+ findings with correct "Show N more" count. - Cancel returns to idle phase. - Skips drift review when drift is null or has no findings. ### Property-Based Tests Property-based tests use `fast-check` (JavaScript) to verify the four correctness properties defined above. Each test generates random schema and config objects and verifies the drift checker output against the expected set-theoretic result. **Configuration**: - Minimum 100 iterations per property test. - Each test tagged with: **Feature: compliance-schema-drift-check, Property {N}: {title}** **Generators**: - `arbitraryParserConfig`: generates random `metric_categories` (object with 0–20 string keys mapped to category strings), `core_cols` (array of 0–15 unique column name strings), `skip_sheets` (array of 0–5 unique sheet name strings). - `arbitraryXlsxSchema`: generates random sheets array, each with a name, columns array, and optionally metric_values (for the Summary sheet). Sheet names, column names, and metric values drawn from a shared pool to ensure meaningful overlap with the config. ### Integration Tests - Preview endpoint returns drift report alongside existing diff data. - Preview endpoint returns 200 with breaking drift (does not error). - Preview endpoint gracefully degrades when drift check fails (`drift: null`, `drift_error` present). - Preview endpoint returns 500 when config file is missing. - Python parser reads from `compliance_config.json` and produces same output as before. - Commit endpoint is unchanged and does not reference drift.