cve-dashboard/.kiro/specs/compliance-schema-drift-check/design.md

# Design Document: Compliance Schema Drift Check

## Overview

This feature adds schema drift detection to the compliance xlsx upload flow. When a user uploads a weekly NTS_AEO report, the backend extracts the xlsx structural schema (sheet names, column headers, metric values) and compares it against a shared parser configuration file. The comparison produces a categorised drift report with three severity levels: breaking (blocks upload), silent-miss (warns but allows proceeding), and cosmetic (informational). The frontend displays these findings in a new drift review phase inside the upload modal, inserted between the upload spinner and the existing diff preview.

The parser configuration dicts (`METRIC_CATEGORIES`, `CORE_COLS`, `SKIP_SHEETS`) currently defined inline in `parse_compliance_xlsx.py` are extracted into a shared JSON file (`backend/scripts/compliance_config.json`) that both the Python parser and the Node.js drift checker read. This establishes a single source of truth for parser configuration.

### Design Decisions

1. **Shared JSON config over database storage**: The parser config is a developer-maintained mapping, not user data. A JSON file is version-controllable, diffable, and readable by both Python and Node.js without additional dependencies.

2. **Python subprocess for schema extraction**: The existing `dump_xlsx_schema.py` already uses openpyxl to extract xlsx structure. We adapt this into a new `extract_xlsx_schema.py` script that the Node.js backend invokes as a subprocess, consistent with how `parse_compliance_xlsx.py` is already called.

3. **Node.js drift comparison logic**: The drift comparison is pure object comparison (sets of strings) with no xlsx parsing. Implementing it in Node.js avoids a second Python subprocess call and keeps the logic co-located with the route handler.

4. **Graceful degradation**: If the drift check fails, the upload flow proceeds normally with `drift: null` and a `drift_error` message. The drift check is additive and must never block the existing workflow.

## Architecture

```mermaid
sequenceDiagram
    participant User
    participant Modal as ComplianceUploadModal
    participant API as POST /api/compliance/preview
    participant Schema as extract_xlsx_schema.py
    participant Drift as driftChecker (Node.js)
    participant Config as compliance_config.json
    participant Parser as parse_compliance_xlsx.py

    User->>Modal: Drops xlsx file
    Modal->>API: POST /preview (multipart)
    API->>Schema: spawn python3 extract_xlsx_schema.py <file>
    Schema-->>API: JSON { sheets: [...] }
    API->>Config: fs.readFileSync(compliance_config.json)
    API->>Drift: compareSchemaToDrift(schema, config)
    Drift-->>API: { breaking: [...], silent_miss: [...], cosmetic: [...] }
    API->>Parser: spawn python3 parse_compliance_xlsx.py <file>
    Parser->>Config: reads compliance_config.json
    Parser-->>API: JSON { items, summary, ... }
    API->>API: computeDiff(db, items)
    API-->>Modal: { drift, diff, tempFile, ... }
    alt drift has findings
        Modal->>User: Show drift review phase
        alt breaking findings exist
            Modal->>User: Block "Continue to Preview"
        else no breaking findings
            User->>Modal: Click "Continue to Preview"
            Modal->>User: Show diff preview
        end
    else no drift findings
        Modal->>User: Show diff preview directly
    end
```

### File Layout

```
backend/
  scripts/
    compliance_config.json          # NEW — shared parser config (single source of truth)
    extract_xlsx_schema.py          # NEW — extracts xlsx structure as JSON
    parse_compliance_xlsx.py        # MODIFIED — reads config from JSON file
    dump_xlsx_schema.py             # UNCHANGED — standalone diagnostic tool
  routes/
    compliance.js                   # MODIFIED — drift check in /preview, new driftChecker module
  helpers/
    driftChecker.js                 # NEW — compareSchemaToDrift() function

frontend/
  src/components/pages/
    ComplianceUploadModal.js        # MODIFIED — new drift-review phase
```

## Components and Interfaces

### 1. Shared Parser Configuration (`compliance_config.json`)

```json
{
  "metric_categories": {
    "2.3.4i": "Vulnerability Management",
    "2.3.6i": "Vulnerability Management",
    "5.2.4": "Access & MFA"
  },
  "core_cols": [
    "Preferred - Hostname",
    "GRANITE - IPv4_Address",
    "GRANITE - Type",
    "Team",
    "Compliant",
    "Source_Network",
    "Vertical",
    "GRANITE - Equip_Inst_ID",
    "GRANITE - RESPONSIBLE_TEAM"
  ],
  "skip_sheets": ["Summary", "CMDB_9box", "Vulns", "Aging Dashboard"]
}
```

### 2. Schema Extractor (`extract_xlsx_schema.py`)

**Input**: File path as CLI argument.

**Output** (stdout JSON):
```json
{
  "sheets": [
    {
      "name": "Summary",
      "columns": ["Metric", "Non-Compliant", "..."],
      "metric_values": ["2.3.4i", "5.2.4", "..."]
    },
    {
      "name": "2.3.4i",
      "columns": ["Preferred - Hostname", "GRANITE - IPv4_Address", "..."]
    }
  ]
}
```

- Uses openpyxl in read-only mode.
- Extracts sheet names, first-row column headers per sheet, and unique metric values from the Summary sheet (header at row 4, data from row 5 onward).
- On error, returns `{ "error": "..." }` on stdout and exits with non-zero code.

### 3. Drift Checker (`backend/helpers/driftChecker.js`)

**Function**: `compareSchemaToDrift(schema, config) => DriftReport`

**Parameters**:
- `schema` — object returned by `extract_xlsx_schema.py`
- `config` — object parsed from `compliance_config.json`

**Returns** (`DriftReport`):
```javascript
{
  breaking: [
    { severity: 'breaking', message: 'Detail sheet "2.3.4i" is missing core column "Team"', value: 'Team', sheet: '2.3.4i' }
  ],
  silent_miss: [
    { severity: 'silent_miss', message: 'Unknown metric "9.1.2" in Summary — not in metric_categories', value: '9.1.2' }
  ],
  cosmetic: [
    { severity: 'cosmetic', message: 'New column "Extra_Field" in sheet "2.3.4i" — will be captured in extra_json', value: 'Extra_Field', sheet: '2.3.4i' }
  ]
}
```

**Drift rules**:

| Rule | Severity | Condition |
|---|---|---|
| Missing core column | `breaking` | A detail sheet (not in `skip_sheets`, present in xlsx) is missing a column from `core_cols` |
| Missing detail sheet | `breaking` | A sheet name in `metric_categories` (and not in `skip_sheets`) is absent from the xlsx |
| Unknown metric value | `silent_miss` | A metric value in the Summary sheet is not a key in `metric_categories` |
| Unknown sheet | `silent_miss` | An xlsx sheet is not in `skip_sheets` and not in `metric_categories` |
| New column in detail sheet | `cosmetic` | A detail sheet has columns not in `core_cols` |
| Stale metric category | `cosmetic` | A key in `metric_categories` does not appear in the Summary sheet's metric values |

### 4. Preview Endpoint Changes (`POST /api/compliance/preview`)

The existing `/preview` handler is modified to:

1. After receiving the uploaded file, spawn `extract_xlsx_schema.py` to get the xlsx schema.
2. Read `compliance_config.json` from disk.
3. Call `compareSchemaToDrift(schema, config)` to produce the drift report.
4. Proceed with the existing `parseXlsx()` call and `computeDiff()`.
5. Include `drift` (the DriftReport object) and optionally `drift_error` (string) in the response.

If the schema extraction or drift check throws, set `drift: null` and `drift_error: <message>`, then continue with the normal flow.

**Updated response shape**:
```json
{
  "drift": {
    "breaking": [],
    "silent_miss": [],
    "cosmetic": []
  },
  "drift_error": null,
  "diff": { "new_count": 5, "recurring_count": 120, "resolved_count": 3 },
  "tempFile": "/path/to/temp.json",
  "filename": "NTS_AEO_2026_03_25.xlsx",
  "report_date": "2026-03-25",
  "total_items": 125
}
```

### 5. Upload Modal Changes (`ComplianceUploadModal.js`)

**New phase**: `drift-review` inserted between `uploading` and `preview`.

**Phase flow**:
```
idle → uploading → drift-review (if findings) → preview → committing → done
                 → preview (if no findings)
```

**Drift review UI**:
- Findings grouped by severity: breaking first, then silent-miss, then cosmetic.
- Each group has a header with severity label and count badge.
- Groups with more than 5 findings collapse with a "Show N more" toggle.
- Each finding shows the message text and the triggering value.
- Breaking findings: red text (`#EF4444`), red left-border accent.
- Silent-miss findings: amber text (`#F59E0B`), amber left-border accent.
- Cosmetic findings: muted text (`#94A3B8`), subtle left-border accent.
- "Cancel" button returns to idle. "Continue to Preview" button advances to diff preview.
- "Continue to Preview" is disabled when breaking findings exist, with a message explaining the block.
- When `drift` is `null` (drift check failed), skip drift-review and go straight to preview.

## Data Models

### DriftFinding

```javascript
{
  severity: 'breaking' | 'silent_miss' | 'cosmetic',
  message: string,    // Human-readable description
  value: string,      // The specific column/sheet/metric that triggered the finding
  sheet: string|null   // Sheet name context (when applicable)
}
```

### DriftReport

```javascript
{
  breaking: DriftFinding[],
  silent_miss: DriftFinding[],
  cosmetic: DriftFinding[]
}
```

### ParserConfig

```javascript
{
  metric_categories: { [metricId: string]: string },  // metric ID → category name
  core_cols: string[],                                  // column names for main item fields
  skip_sheets: string[]                                 // sheet names excluded from parsing
}
```

### XlsxSchema (output of extract_xlsx_schema.py)

```javascript
{
  sheets: [
    {
      name: string,
      columns: string[],
      metric_values?: string[]  // only present on Summary sheet
    }
  ]
}
```


## Correctness Properties

*A property is a characteristic or behavior that should hold true across all valid executions of a system — essentially, a formal statement about what the system should do. Properties serve as the bridge between human-readable specifications and machine-verifiable correctness guarantees.*

### Property 1: Breaking drift completeness

*For any* xlsx schema and parser config, the drift checker SHALL produce a breaking finding for every core column missing from every detail sheet, and for every detail sheet (present in `metric_categories` but not in `skip_sheets`) absent from the xlsx — and no other breaking findings. The set of breaking findings is exactly the union of missing-core-column findings and missing-detail-sheet findings.

**Validates: Requirements 3.1, 3.2, 3.3**

### Property 2: Silent-miss drift completeness

*For any* xlsx schema and parser config, the drift checker SHALL produce a silent-miss finding for every metric value in the Summary sheet not present in `metric_categories`, and for every xlsx sheet not in `skip_sheets` and not in `metric_categories` — and no other silent-miss findings. The set of silent-miss findings is exactly the union of unknown-metric findings and unknown-sheet findings.

**Validates: Requirements 4.1, 4.2, 4.3**

### Property 3: Cosmetic drift completeness

*For any* xlsx schema and parser config, the drift checker SHALL produce a cosmetic finding for every column in a detail sheet not present in `core_cols`, and for every key in `metric_categories` not present in the Summary sheet's metric values — and no other cosmetic findings. The set of cosmetic findings is exactly the union of new-column findings and stale-metric findings.

**Validates: Requirements 5.1, 5.2, 5.3**

### Property 4: Drift severity ordering

*For any* drift report containing a mix of breaking, silent-miss, and cosmetic findings, the grouping function SHALL always return findings ordered by severity: all breaking findings first, then all silent-miss findings, then all cosmetic findings.

**Validates: Requirements 8.1**

## Error Handling

### Python Script Failures

| Failure | Handling |
|---|---|
| `extract_xlsx_schema.py` exits non-zero | Preview endpoint sets `drift: null`, `drift_error: <stderr message>`, continues with normal parse flow |
| `extract_xlsx_schema.py` returns invalid JSON | Same as above — caught in JSON.parse, treated as drift check failure |
| `compliance_config.json` missing or invalid (Node.js read) | Preview endpoint returns 500 with message "Configuration file could not be loaded" |
| `compliance_config.json` missing or invalid (Python parser read) | Parser exits non-zero, stderr describes the error, preview endpoint returns 500 with parse error |
| xlsx file cannot be opened by schema extractor | Schema extractor returns `{ "error": "..." }` on stdout, exits non-zero; drift check skipped gracefully |

### Frontend Error States

| Condition | Behavior |
|---|---|
| `drift` is `null` in preview response | Skip drift-review phase, proceed directly to diff preview |
| `drift_error` is present | Optionally display a subtle warning in the diff preview that drift check was skipped |
| Network error during upload | Existing error phase handling (unchanged) |

### Config File Validation

The Node.js config loader validates that:
- The file exists and is readable.
- The content parses as valid JSON.
- The parsed object contains `metric_categories` (object), `core_cols` (array), and `skip_sheets` (array).

If any check fails, the loader throws with a descriptive message. The preview handler catches this and returns a 500 response.

## Testing Strategy

### Unit Tests

**Drift checker (`driftChecker.js`)**:
- Breaking: missing core column produces finding with correct severity, message, value, and sheet.
- Breaking: missing detail sheet produces finding.
- Silent-miss: unknown metric value produces finding.
- Silent-miss: unknown sheet produces finding.
- Cosmetic: new column in detail sheet produces finding.
- Cosmetic: stale metric category produces finding.
- Empty schema (no sheets) produces appropriate findings.
- Config with empty metric_categories, core_cols, or skip_sheets.
- Schema and config that are perfectly aligned produce zero findings.

**Config loader**:
- Valid config file loads correctly.
- Missing file throws descriptive error.
- Invalid JSON throws descriptive error.
- Config missing required keys throws descriptive error.

**Frontend drift review component**:
- Drift review phase renders when findings exist.
- "Continue to Preview" button disabled when breaking findings present.
- "Continue to Preview" button enabled when no breaking findings.
- Groups collapse at 5+ findings with correct "Show N more" count.
- Cancel returns to idle phase.
- Skips drift review when drift is null or has no findings.

### Property-Based Tests

Property-based tests use `fast-check` (JavaScript) to verify the four correctness properties defined above. Each test generates random schema and config objects and verifies the drift checker output against the expected set-theoretic result.

**Configuration**:
- Minimum 100 iterations per property test.
- Each test tagged with: **Feature: compliance-schema-drift-check, Property {N}: {title}**

**Generators**:
- `arbitraryParserConfig`: generates random `metric_categories` (object with 0–20 string keys mapped to category strings), `core_cols` (array of 0–15 unique column name strings), `skip_sheets` (array of 0–5 unique sheet name strings).
- `arbitraryXlsxSchema`: generates random sheets array, each with a name, columns array, and optionally metric_values (for the Summary sheet). Sheet names, column names, and metric values drawn from a shared pool to ensure meaningful overlap with the config.

### Integration Tests

- Preview endpoint returns drift report alongside existing diff data.
- Preview endpoint returns 200 with breaking drift (does not error).
- Preview endpoint gracefully degrades when drift check fails (`drift: null`, `drift_error` present).
- Preview endpoint returns 500 when config file is missing.
- Python parser reads from `compliance_config.json` and produces same output as before.
- Commit endpoint is unchanged and does not reference drift.