Files
cve-dashboard/.kiro/specs/compliance-schema-drift-check/design.md

365 lines
16 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Design Document: Compliance Schema Drift Check
## Overview
This feature adds schema drift detection to the compliance xlsx upload flow. When a user uploads a weekly NTS_AEO report, the backend extracts the xlsx structural schema (sheet names, column headers, metric values) and compares it against a shared parser configuration file. The comparison produces a categorised drift report with three severity levels: breaking (blocks upload), silent-miss (warns but allows proceeding), and cosmetic (informational). The frontend displays these findings in a new drift review phase inside the upload modal, inserted between the upload spinner and the existing diff preview.
The parser configuration dicts (`METRIC_CATEGORIES`, `CORE_COLS`, `SKIP_SHEETS`) currently defined inline in `parse_compliance_xlsx.py` are extracted into a shared JSON file (`backend/scripts/compliance_config.json`) that both the Python parser and the Node.js drift checker read. This establishes a single source of truth for parser configuration.
### Design Decisions
1. **Shared JSON config over database storage**: The parser config is a developer-maintained mapping, not user data. A JSON file is version-controllable, diffable, and readable by both Python and Node.js without additional dependencies.
2. **Python subprocess for schema extraction**: The existing `dump_xlsx_schema.py` already uses openpyxl to extract xlsx structure. We adapt this into a new `extract_xlsx_schema.py` script that the Node.js backend invokes as a subprocess, consistent with how `parse_compliance_xlsx.py` is already called.
3. **Node.js drift comparison logic**: The drift comparison is pure object comparison (sets of strings) with no xlsx parsing. Implementing it in Node.js avoids a second Python subprocess call and keeps the logic co-located with the route handler.
4. **Graceful degradation**: If the drift check fails, the upload flow proceeds normally with `drift: null` and a `drift_error` message. The drift check is additive and must never block the existing workflow.
## Architecture
```mermaid
sequenceDiagram
participant User
participant Modal as ComplianceUploadModal
participant API as POST /api/compliance/preview
participant Schema as extract_xlsx_schema.py
participant Drift as driftChecker (Node.js)
participant Config as compliance_config.json
participant Parser as parse_compliance_xlsx.py
User->>Modal: Drops xlsx file
Modal->>API: POST /preview (multipart)
API->>Schema: spawn python3 extract_xlsx_schema.py <file>
Schema-->>API: JSON { sheets: [...] }
API->>Config: fs.readFileSync(compliance_config.json)
API->>Drift: compareSchemaToDrift(schema, config)
Drift-->>API: { breaking: [...], silent_miss: [...], cosmetic: [...] }
API->>Parser: spawn python3 parse_compliance_xlsx.py <file>
Parser->>Config: reads compliance_config.json
Parser-->>API: JSON { items, summary, ... }
API->>API: computeDiff(db, items)
API-->>Modal: { drift, diff, tempFile, ... }
alt drift has findings
Modal->>User: Show drift review phase
alt breaking findings exist
Modal->>User: Block "Continue to Preview"
else no breaking findings
User->>Modal: Click "Continue to Preview"
Modal->>User: Show diff preview
end
else no drift findings
Modal->>User: Show diff preview directly
end
```
### File Layout
```
backend/
scripts/
compliance_config.json # NEW — shared parser config (single source of truth)
extract_xlsx_schema.py # NEW — extracts xlsx structure as JSON
parse_compliance_xlsx.py # MODIFIED — reads config from JSON file
dump_xlsx_schema.py # UNCHANGED — standalone diagnostic tool
routes/
compliance.js # MODIFIED — drift check in /preview, new driftChecker module
helpers/
driftChecker.js # NEW — compareSchemaToDrift() function
frontend/
src/components/pages/
ComplianceUploadModal.js # MODIFIED — new drift-review phase
```
## Components and Interfaces
### 1. Shared Parser Configuration (`compliance_config.json`)
```json
{
"metric_categories": {
"2.3.4i": "Vulnerability Management",
"2.3.6i": "Vulnerability Management",
"5.2.4": "Access & MFA"
},
"core_cols": [
"Preferred - Hostname",
"GRANITE - IPv4_Address",
"GRANITE - Type",
"Team",
"Compliant",
"Source_Network",
"Vertical",
"GRANITE - Equip_Inst_ID",
"GRANITE - RESPONSIBLE_TEAM"
],
"skip_sheets": ["Summary", "CMDB_9box", "Vulns", "Aging Dashboard"]
}
```
### 2. Schema Extractor (`extract_xlsx_schema.py`)
**Input**: File path as CLI argument.
**Output** (stdout JSON):
```json
{
"sheets": [
{
"name": "Summary",
"columns": ["Metric", "Non-Compliant", "..."],
"metric_values": ["2.3.4i", "5.2.4", "..."]
},
{
"name": "2.3.4i",
"columns": ["Preferred - Hostname", "GRANITE - IPv4_Address", "..."]
}
]
}
```
- Uses openpyxl in read-only mode.
- Extracts sheet names, first-row column headers per sheet, and unique metric values from the Summary sheet (header at row 4, data from row 5 onward).
- On error, returns `{ "error": "..." }` on stdout and exits with non-zero code.
### 3. Drift Checker (`backend/helpers/driftChecker.js`)
**Function**: `compareSchemaToDrift(schema, config) => DriftReport`
**Parameters**:
- `schema` — object returned by `extract_xlsx_schema.py`
- `config` — object parsed from `compliance_config.json`
**Returns** (`DriftReport`):
```javascript
{
breaking: [
{ severity: 'breaking', message: 'Detail sheet "2.3.4i" is missing core column "Team"', value: 'Team', sheet: '2.3.4i' }
],
silent_miss: [
{ severity: 'silent_miss', message: 'Unknown metric "9.1.2" in Summary — not in metric_categories', value: '9.1.2' }
],
cosmetic: [
{ severity: 'cosmetic', message: 'New column "Extra_Field" in sheet "2.3.4i" — will be captured in extra_json', value: 'Extra_Field', sheet: '2.3.4i' }
]
}
```
**Drift rules**:
| Rule | Severity | Condition |
|---|---|---|
| Missing core column | `breaking` | A detail sheet (not in `skip_sheets`, present in xlsx) is missing a column from `core_cols` |
| Missing detail sheet | `breaking` | A sheet name in `metric_categories` (and not in `skip_sheets`) is absent from the xlsx |
| Unknown metric value | `silent_miss` | A metric value in the Summary sheet is not a key in `metric_categories` |
| Unknown sheet | `silent_miss` | An xlsx sheet is not in `skip_sheets` and not in `metric_categories` |
| New column in detail sheet | `cosmetic` | A detail sheet has columns not in `core_cols` |
| Stale metric category | `cosmetic` | A key in `metric_categories` does not appear in the Summary sheet's metric values |
### 4. Preview Endpoint Changes (`POST /api/compliance/preview`)
The existing `/preview` handler is modified to:
1. After receiving the uploaded file, spawn `extract_xlsx_schema.py` to get the xlsx schema.
2. Read `compliance_config.json` from disk.
3. Call `compareSchemaToDrift(schema, config)` to produce the drift report.
4. Proceed with the existing `parseXlsx()` call and `computeDiff()`.
5. Include `drift` (the DriftReport object) and optionally `drift_error` (string) in the response.
If the schema extraction or drift check throws, set `drift: null` and `drift_error: <message>`, then continue with the normal flow.
**Updated response shape**:
```json
{
"drift": {
"breaking": [],
"silent_miss": [],
"cosmetic": []
},
"drift_error": null,
"diff": { "new_count": 5, "recurring_count": 120, "resolved_count": 3 },
"tempFile": "/path/to/temp.json",
"filename": "NTS_AEO_2026_03_25.xlsx",
"report_date": "2026-03-25",
"total_items": 125
}
```
### 5. Upload Modal Changes (`ComplianceUploadModal.js`)
**New phase**: `drift-review` inserted between `uploading` and `preview`.
**Phase flow**:
```
idle → uploading → drift-review (if findings) → preview → committing → done
→ preview (if no findings)
```
**Drift review UI**:
- Findings grouped by severity: breaking first, then silent-miss, then cosmetic.
- Each group has a header with severity label and count badge.
- Groups with more than 5 findings collapse with a "Show N more" toggle.
- Each finding shows the message text and the triggering value.
- Breaking findings: red text (`#EF4444`), red left-border accent.
- Silent-miss findings: amber text (`#F59E0B`), amber left-border accent.
- Cosmetic findings: muted text (`#94A3B8`), subtle left-border accent.
- "Cancel" button returns to idle. "Continue to Preview" button advances to diff preview.
- "Continue to Preview" is disabled when breaking findings exist, with a message explaining the block.
- When `drift` is `null` (drift check failed), skip drift-review and go straight to preview.
## Data Models
### DriftFinding
```javascript
{
severity: 'breaking' | 'silent_miss' | 'cosmetic',
message: string, // Human-readable description
value: string, // The specific column/sheet/metric that triggered the finding
sheet: string|null // Sheet name context (when applicable)
}
```
### DriftReport
```javascript
{
breaking: DriftFinding[],
silent_miss: DriftFinding[],
cosmetic: DriftFinding[]
}
```
### ParserConfig
```javascript
{
metric_categories: { [metricId: string]: string }, // metric ID → category name
core_cols: string[], // column names for main item fields
skip_sheets: string[] // sheet names excluded from parsing
}
```
### XlsxSchema (output of extract_xlsx_schema.py)
```javascript
{
sheets: [
{
name: string,
columns: string[],
metric_values?: string[] // only present on Summary sheet
}
]
}
```
## Correctness Properties
*A property is a characteristic or behavior that should hold true across all valid executions of a system — essentially, a formal statement about what the system should do. Properties serve as the bridge between human-readable specifications and machine-verifiable correctness guarantees.*
### Property 1: Breaking drift completeness
*For any* xlsx schema and parser config, the drift checker SHALL produce a breaking finding for every core column missing from every detail sheet, and for every detail sheet (present in `metric_categories` but not in `skip_sheets`) absent from the xlsx — and no other breaking findings. The set of breaking findings is exactly the union of missing-core-column findings and missing-detail-sheet findings.
**Validates: Requirements 3.1, 3.2, 3.3**
### Property 2: Silent-miss drift completeness
*For any* xlsx schema and parser config, the drift checker SHALL produce a silent-miss finding for every metric value in the Summary sheet not present in `metric_categories`, and for every xlsx sheet not in `skip_sheets` and not in `metric_categories` — and no other silent-miss findings. The set of silent-miss findings is exactly the union of unknown-metric findings and unknown-sheet findings.
**Validates: Requirements 4.1, 4.2, 4.3**
### Property 3: Cosmetic drift completeness
*For any* xlsx schema and parser config, the drift checker SHALL produce a cosmetic finding for every column in a detail sheet not present in `core_cols`, and for every key in `metric_categories` not present in the Summary sheet's metric values — and no other cosmetic findings. The set of cosmetic findings is exactly the union of new-column findings and stale-metric findings.
**Validates: Requirements 5.1, 5.2, 5.3**
### Property 4: Drift severity ordering
*For any* drift report containing a mix of breaking, silent-miss, and cosmetic findings, the grouping function SHALL always return findings ordered by severity: all breaking findings first, then all silent-miss findings, then all cosmetic findings.
**Validates: Requirements 8.1**
## Error Handling
### Python Script Failures
| Failure | Handling |
|---|---|
| `extract_xlsx_schema.py` exits non-zero | Preview endpoint sets `drift: null`, `drift_error: <stderr message>`, continues with normal parse flow |
| `extract_xlsx_schema.py` returns invalid JSON | Same as above — caught in JSON.parse, treated as drift check failure |
| `compliance_config.json` missing or invalid (Node.js read) | Preview endpoint returns 500 with message "Configuration file could not be loaded" |
| `compliance_config.json` missing or invalid (Python parser read) | Parser exits non-zero, stderr describes the error, preview endpoint returns 500 with parse error |
| xlsx file cannot be opened by schema extractor | Schema extractor returns `{ "error": "..." }` on stdout, exits non-zero; drift check skipped gracefully |
### Frontend Error States
| Condition | Behavior |
|---|---|
| `drift` is `null` in preview response | Skip drift-review phase, proceed directly to diff preview |
| `drift_error` is present | Optionally display a subtle warning in the diff preview that drift check was skipped |
| Network error during upload | Existing error phase handling (unchanged) |
### Config File Validation
The Node.js config loader validates that:
- The file exists and is readable.
- The content parses as valid JSON.
- The parsed object contains `metric_categories` (object), `core_cols` (array), and `skip_sheets` (array).
If any check fails, the loader throws with a descriptive message. The preview handler catches this and returns a 500 response.
## Testing Strategy
### Unit Tests
**Drift checker (`driftChecker.js`)**:
- Breaking: missing core column produces finding with correct severity, message, value, and sheet.
- Breaking: missing detail sheet produces finding.
- Silent-miss: unknown metric value produces finding.
- Silent-miss: unknown sheet produces finding.
- Cosmetic: new column in detail sheet produces finding.
- Cosmetic: stale metric category produces finding.
- Empty schema (no sheets) produces appropriate findings.
- Config with empty metric_categories, core_cols, or skip_sheets.
- Schema and config that are perfectly aligned produce zero findings.
**Config loader**:
- Valid config file loads correctly.
- Missing file throws descriptive error.
- Invalid JSON throws descriptive error.
- Config missing required keys throws descriptive error.
**Frontend drift review component**:
- Drift review phase renders when findings exist.
- "Continue to Preview" button disabled when breaking findings present.
- "Continue to Preview" button enabled when no breaking findings.
- Groups collapse at 5+ findings with correct "Show N more" count.
- Cancel returns to idle phase.
- Skips drift review when drift is null or has no findings.
### Property-Based Tests
Property-based tests use `fast-check` (JavaScript) to verify the four correctness properties defined above. Each test generates random schema and config objects and verifies the drift checker output against the expected set-theoretic result.
**Configuration**:
- Minimum 100 iterations per property test.
- Each test tagged with: **Feature: compliance-schema-drift-check, Property {N}: {title}**
**Generators**:
- `arbitraryParserConfig`: generates random `metric_categories` (object with 020 string keys mapped to category strings), `core_cols` (array of 015 unique column name strings), `skip_sheets` (array of 05 unique sheet name strings).
- `arbitraryXlsxSchema`: generates random sheets array, each with a name, columns array, and optionally metric_values (for the Summary sheet). Sheet names, column names, and metric values drawn from a shared pool to ensure meaningful overlap with the config.
### Integration Tests
- Preview endpoint returns drift report alongside existing diff data.
- Preview endpoint returns 200 with breaking drift (does not error).
- Preview endpoint gracefully degrades when drift check fails (`drift: null`, `drift_error` present).
- Preview endpoint returns 500 when config file is missing.
- Python parser reads from `compliance_config.json` and produces same output as before.
- Commit endpoint is unchanged and does not reference drift.