Files
cve-dashboard/.kiro/specs/compliance-schema-drift-check/design.md

16 KiB
Raw Blame History

Design Document: Compliance Schema Drift Check

Overview

This feature adds schema drift detection to the compliance xlsx upload flow. When a user uploads a weekly NTS_AEO report, the backend extracts the xlsx structural schema (sheet names, column headers, metric values) and compares it against a shared parser configuration file. The comparison produces a categorised drift report with three severity levels: breaking (blocks upload), silent-miss (warns but allows proceeding), and cosmetic (informational). The frontend displays these findings in a new drift review phase inside the upload modal, inserted between the upload spinner and the existing diff preview.

The parser configuration dicts (METRIC_CATEGORIES, CORE_COLS, SKIP_SHEETS) currently defined inline in parse_compliance_xlsx.py are extracted into a shared JSON file (backend/scripts/compliance_config.json) that both the Python parser and the Node.js drift checker read. This establishes a single source of truth for parser configuration.

Design Decisions

  1. Shared JSON config over database storage: The parser config is a developer-maintained mapping, not user data. A JSON file is version-controllable, diffable, and readable by both Python and Node.js without additional dependencies.

  2. Python subprocess for schema extraction: The existing dump_xlsx_schema.py already uses openpyxl to extract xlsx structure. We adapt this into a new extract_xlsx_schema.py script that the Node.js backend invokes as a subprocess, consistent with how parse_compliance_xlsx.py is already called.

  3. Node.js drift comparison logic: The drift comparison is pure object comparison (sets of strings) with no xlsx parsing. Implementing it in Node.js avoids a second Python subprocess call and keeps the logic co-located with the route handler.

  4. Graceful degradation: If the drift check fails, the upload flow proceeds normally with drift: null and a drift_error message. The drift check is additive and must never block the existing workflow.

Architecture

sequenceDiagram
    participant User
    participant Modal as ComplianceUploadModal
    participant API as POST /api/compliance/preview
    participant Schema as extract_xlsx_schema.py
    participant Drift as driftChecker (Node.js)
    participant Config as compliance_config.json
    participant Parser as parse_compliance_xlsx.py

    User->>Modal: Drops xlsx file
    Modal->>API: POST /preview (multipart)
    API->>Schema: spawn python3 extract_xlsx_schema.py <file>
    Schema-->>API: JSON { sheets: [...] }
    API->>Config: fs.readFileSync(compliance_config.json)
    API->>Drift: compareSchemaToDrift(schema, config)
    Drift-->>API: { breaking: [...], silent_miss: [...], cosmetic: [...] }
    API->>Parser: spawn python3 parse_compliance_xlsx.py <file>
    Parser->>Config: reads compliance_config.json
    Parser-->>API: JSON { items, summary, ... }
    API->>API: computeDiff(db, items)
    API-->>Modal: { drift, diff, tempFile, ... }
    alt drift has findings
        Modal->>User: Show drift review phase
        alt breaking findings exist
            Modal->>User: Block "Continue to Preview"
        else no breaking findings
            User->>Modal: Click "Continue to Preview"
            Modal->>User: Show diff preview
        end
    else no drift findings
        Modal->>User: Show diff preview directly
    end

File Layout

backend/
  scripts/
    compliance_config.json          # NEW — shared parser config (single source of truth)
    extract_xlsx_schema.py          # NEW — extracts xlsx structure as JSON
    parse_compliance_xlsx.py        # MODIFIED — reads config from JSON file
    dump_xlsx_schema.py             # UNCHANGED — standalone diagnostic tool
  routes/
    compliance.js                   # MODIFIED — drift check in /preview, new driftChecker module
  helpers/
    driftChecker.js                 # NEW — compareSchemaToDrift() function

frontend/
  src/components/pages/
    ComplianceUploadModal.js        # MODIFIED — new drift-review phase

Components and Interfaces

1. Shared Parser Configuration (compliance_config.json)

{
  "metric_categories": {
    "2.3.4i": "Vulnerability Management",
    "2.3.6i": "Vulnerability Management",
    "5.2.4": "Access & MFA"
  },
  "core_cols": [
    "Preferred - Hostname",
    "GRANITE - IPv4_Address",
    "GRANITE - Type",
    "Team",
    "Compliant",
    "Source_Network",
    "Vertical",
    "GRANITE - Equip_Inst_ID",
    "GRANITE - RESPONSIBLE_TEAM"
  ],
  "skip_sheets": ["Summary", "CMDB_9box", "Vulns", "Aging Dashboard"]
}

2. Schema Extractor (extract_xlsx_schema.py)

Input: File path as CLI argument.

Output (stdout JSON):

{
  "sheets": [
    {
      "name": "Summary",
      "columns": ["Metric", "Non-Compliant", "..."],
      "metric_values": ["2.3.4i", "5.2.4", "..."]
    },
    {
      "name": "2.3.4i",
      "columns": ["Preferred - Hostname", "GRANITE - IPv4_Address", "..."]
    }
  ]
}
  • Uses openpyxl in read-only mode.
  • Extracts sheet names, first-row column headers per sheet, and unique metric values from the Summary sheet (header at row 4, data from row 5 onward).
  • On error, returns { "error": "..." } on stdout and exits with non-zero code.

3. Drift Checker (backend/helpers/driftChecker.js)

Function: compareSchemaToDrift(schema, config) => DriftReport

Parameters:

  • schema — object returned by extract_xlsx_schema.py
  • config — object parsed from compliance_config.json

Returns (DriftReport):

{
  breaking: [
    { severity: 'breaking', message: 'Detail sheet "2.3.4i" is missing core column "Team"', value: 'Team', sheet: '2.3.4i' }
  ],
  silent_miss: [
    { severity: 'silent_miss', message: 'Unknown metric "9.1.2" in Summary — not in metric_categories', value: '9.1.2' }
  ],
  cosmetic: [
    { severity: 'cosmetic', message: 'New column "Extra_Field" in sheet "2.3.4i" — will be captured in extra_json', value: 'Extra_Field', sheet: '2.3.4i' }
  ]
}

Drift rules:

Rule Severity Condition
Missing core column breaking A detail sheet (not in skip_sheets, present in xlsx) is missing a column from core_cols
Missing detail sheet breaking A sheet name in metric_categories (and not in skip_sheets) is absent from the xlsx
Unknown metric value silent_miss A metric value in the Summary sheet is not a key in metric_categories
Unknown sheet silent_miss An xlsx sheet is not in skip_sheets and not in metric_categories
New column in detail sheet cosmetic A detail sheet has columns not in core_cols
Stale metric category cosmetic A key in metric_categories does not appear in the Summary sheet's metric values

4. Preview Endpoint Changes (POST /api/compliance/preview)

The existing /preview handler is modified to:

  1. After receiving the uploaded file, spawn extract_xlsx_schema.py to get the xlsx schema.
  2. Read compliance_config.json from disk.
  3. Call compareSchemaToDrift(schema, config) to produce the drift report.
  4. Proceed with the existing parseXlsx() call and computeDiff().
  5. Include drift (the DriftReport object) and optionally drift_error (string) in the response.

If the schema extraction or drift check throws, set drift: null and drift_error: <message>, then continue with the normal flow.

Updated response shape:

{
  "drift": {
    "breaking": [],
    "silent_miss": [],
    "cosmetic": []
  },
  "drift_error": null,
  "diff": { "new_count": 5, "recurring_count": 120, "resolved_count": 3 },
  "tempFile": "/path/to/temp.json",
  "filename": "NTS_AEO_2026_03_25.xlsx",
  "report_date": "2026-03-25",
  "total_items": 125
}

5. Upload Modal Changes (ComplianceUploadModal.js)

New phase: drift-review inserted between uploading and preview.

Phase flow:

idle → uploading → drift-review (if findings) → preview → committing → done
                 → preview (if no findings)

Drift review UI:

  • Findings grouped by severity: breaking first, then silent-miss, then cosmetic.
  • Each group has a header with severity label and count badge.
  • Groups with more than 5 findings collapse with a "Show N more" toggle.
  • Each finding shows the message text and the triggering value.
  • Breaking findings: red text (#EF4444), red left-border accent.
  • Silent-miss findings: amber text (#F59E0B), amber left-border accent.
  • Cosmetic findings: muted text (#94A3B8), subtle left-border accent.
  • "Cancel" button returns to idle. "Continue to Preview" button advances to diff preview.
  • "Continue to Preview" is disabled when breaking findings exist, with a message explaining the block.
  • When drift is null (drift check failed), skip drift-review and go straight to preview.

Data Models

DriftFinding

{
  severity: 'breaking' | 'silent_miss' | 'cosmetic',
  message: string,    // Human-readable description
  value: string,      // The specific column/sheet/metric that triggered the finding
  sheet: string|null   // Sheet name context (when applicable)
}

DriftReport

{
  breaking: DriftFinding[],
  silent_miss: DriftFinding[],
  cosmetic: DriftFinding[]
}

ParserConfig

{
  metric_categories: { [metricId: string]: string },  // metric ID → category name
  core_cols: string[],                                  // column names for main item fields
  skip_sheets: string[]                                 // sheet names excluded from parsing
}

XlsxSchema (output of extract_xlsx_schema.py)

{
  sheets: [
    {
      name: string,
      columns: string[],
      metric_values?: string[]  // only present on Summary sheet
    }
  ]
}

Correctness Properties

A property is a characteristic or behavior that should hold true across all valid executions of a system — essentially, a formal statement about what the system should do. Properties serve as the bridge between human-readable specifications and machine-verifiable correctness guarantees.

Property 1: Breaking drift completeness

For any xlsx schema and parser config, the drift checker SHALL produce a breaking finding for every core column missing from every detail sheet, and for every detail sheet (present in metric_categories but not in skip_sheets) absent from the xlsx — and no other breaking findings. The set of breaking findings is exactly the union of missing-core-column findings and missing-detail-sheet findings.

Validates: Requirements 3.1, 3.2, 3.3

Property 2: Silent-miss drift completeness

For any xlsx schema and parser config, the drift checker SHALL produce a silent-miss finding for every metric value in the Summary sheet not present in metric_categories, and for every xlsx sheet not in skip_sheets and not in metric_categories — and no other silent-miss findings. The set of silent-miss findings is exactly the union of unknown-metric findings and unknown-sheet findings.

Validates: Requirements 4.1, 4.2, 4.3

Property 3: Cosmetic drift completeness

For any xlsx schema and parser config, the drift checker SHALL produce a cosmetic finding for every column in a detail sheet not present in core_cols, and for every key in metric_categories not present in the Summary sheet's metric values — and no other cosmetic findings. The set of cosmetic findings is exactly the union of new-column findings and stale-metric findings.

Validates: Requirements 5.1, 5.2, 5.3

Property 4: Drift severity ordering

For any drift report containing a mix of breaking, silent-miss, and cosmetic findings, the grouping function SHALL always return findings ordered by severity: all breaking findings first, then all silent-miss findings, then all cosmetic findings.

Validates: Requirements 8.1

Error Handling

Python Script Failures

Failure Handling
extract_xlsx_schema.py exits non-zero Preview endpoint sets drift: null, drift_error: <stderr message>, continues with normal parse flow
extract_xlsx_schema.py returns invalid JSON Same as above — caught in JSON.parse, treated as drift check failure
compliance_config.json missing or invalid (Node.js read) Preview endpoint returns 500 with message "Configuration file could not be loaded"
compliance_config.json missing or invalid (Python parser read) Parser exits non-zero, stderr describes the error, preview endpoint returns 500 with parse error
xlsx file cannot be opened by schema extractor Schema extractor returns { "error": "..." } on stdout, exits non-zero; drift check skipped gracefully

Frontend Error States

Condition Behavior
drift is null in preview response Skip drift-review phase, proceed directly to diff preview
drift_error is present Optionally display a subtle warning in the diff preview that drift check was skipped
Network error during upload Existing error phase handling (unchanged)

Config File Validation

The Node.js config loader validates that:

  • The file exists and is readable.
  • The content parses as valid JSON.
  • The parsed object contains metric_categories (object), core_cols (array), and skip_sheets (array).

If any check fails, the loader throws with a descriptive message. The preview handler catches this and returns a 500 response.

Testing Strategy

Unit Tests

Drift checker (driftChecker.js):

  • Breaking: missing core column produces finding with correct severity, message, value, and sheet.
  • Breaking: missing detail sheet produces finding.
  • Silent-miss: unknown metric value produces finding.
  • Silent-miss: unknown sheet produces finding.
  • Cosmetic: new column in detail sheet produces finding.
  • Cosmetic: stale metric category produces finding.
  • Empty schema (no sheets) produces appropriate findings.
  • Config with empty metric_categories, core_cols, or skip_sheets.
  • Schema and config that are perfectly aligned produce zero findings.

Config loader:

  • Valid config file loads correctly.
  • Missing file throws descriptive error.
  • Invalid JSON throws descriptive error.
  • Config missing required keys throws descriptive error.

Frontend drift review component:

  • Drift review phase renders when findings exist.
  • "Continue to Preview" button disabled when breaking findings present.
  • "Continue to Preview" button enabled when no breaking findings.
  • Groups collapse at 5+ findings with correct "Show N more" count.
  • Cancel returns to idle phase.
  • Skips drift review when drift is null or has no findings.

Property-Based Tests

Property-based tests use fast-check (JavaScript) to verify the four correctness properties defined above. Each test generates random schema and config objects and verifies the drift checker output against the expected set-theoretic result.

Configuration:

  • Minimum 100 iterations per property test.
  • Each test tagged with: Feature: compliance-schema-drift-check, Property {N}: {title}

Generators:

  • arbitraryParserConfig: generates random metric_categories (object with 020 string keys mapped to category strings), core_cols (array of 015 unique column name strings), skip_sheets (array of 05 unique sheet name strings).
  • arbitraryXlsxSchema: generates random sheets array, each with a name, columns array, and optionally metric_values (for the Summary sheet). Sheet names, column names, and metric values drawn from a shared pool to ensure meaningful overlap with the config.

Integration Tests

  • Preview endpoint returns drift report alongside existing diff data.
  • Preview endpoint returns 200 with breaking drift (does not error).
  • Preview endpoint gracefully degrades when drift check fails (drift: null, drift_error present).
  • Preview endpoint returns 500 when config file is missing.
  • Python parser reads from compliance_config.json and produces same output as before.
  • Commit endpoint is unchanged and does not reference drift.