Add admin page overhaul and compliance schema drift check specs, compliance upload improvements, drift checker helper

This commit is contained in:
root
2026-04-20 20:12:12 +00:00
parent 6082721452
commit 043c85cc69
20 changed files with 56814 additions and 59 deletions

View File

@@ -0,0 +1,364 @@
# Design Document: Compliance Schema Drift Check
## Overview
This feature adds schema drift detection to the compliance xlsx upload flow. When a user uploads a weekly NTS_AEO report, the backend extracts the xlsx structural schema (sheet names, column headers, metric values) and compares it against a shared parser configuration file. The comparison produces a categorised drift report with three severity levels: breaking (blocks upload), silent-miss (warns but allows proceeding), and cosmetic (informational). The frontend displays these findings in a new drift review phase inside the upload modal, inserted between the upload spinner and the existing diff preview.
The parser configuration dicts (`METRIC_CATEGORIES`, `CORE_COLS`, `SKIP_SHEETS`) currently defined inline in `parse_compliance_xlsx.py` are extracted into a shared JSON file (`backend/scripts/compliance_config.json`) that both the Python parser and the Node.js drift checker read. This establishes a single source of truth for parser configuration.
### Design Decisions
1. **Shared JSON config over database storage**: The parser config is a developer-maintained mapping, not user data. A JSON file is version-controllable, diffable, and readable by both Python and Node.js without additional dependencies.
2. **Python subprocess for schema extraction**: The existing `dump_xlsx_schema.py` already uses openpyxl to extract xlsx structure. We adapt this into a new `extract_xlsx_schema.py` script that the Node.js backend invokes as a subprocess, consistent with how `parse_compliance_xlsx.py` is already called.
3. **Node.js drift comparison logic**: The drift comparison is pure object comparison (sets of strings) with no xlsx parsing. Implementing it in Node.js avoids a second Python subprocess call and keeps the logic co-located with the route handler.
4. **Graceful degradation**: If the drift check fails, the upload flow proceeds normally with `drift: null` and a `drift_error` message. The drift check is additive and must never block the existing workflow.
## Architecture
```mermaid
sequenceDiagram
participant User
participant Modal as ComplianceUploadModal
participant API as POST /api/compliance/preview
participant Schema as extract_xlsx_schema.py
participant Drift as driftChecker (Node.js)
participant Config as compliance_config.json
participant Parser as parse_compliance_xlsx.py
User->>Modal: Drops xlsx file
Modal->>API: POST /preview (multipart)
API->>Schema: spawn python3 extract_xlsx_schema.py <file>
Schema-->>API: JSON { sheets: [...] }
API->>Config: fs.readFileSync(compliance_config.json)
API->>Drift: compareSchemaToDrift(schema, config)
Drift-->>API: { breaking: [...], silent_miss: [...], cosmetic: [...] }
API->>Parser: spawn python3 parse_compliance_xlsx.py <file>
Parser->>Config: reads compliance_config.json
Parser-->>API: JSON { items, summary, ... }
API->>API: computeDiff(db, items)
API-->>Modal: { drift, diff, tempFile, ... }
alt drift has findings
Modal->>User: Show drift review phase
alt breaking findings exist
Modal->>User: Block "Continue to Preview"
else no breaking findings
User->>Modal: Click "Continue to Preview"
Modal->>User: Show diff preview
end
else no drift findings
Modal->>User: Show diff preview directly
end
```
### File Layout
```
backend/
scripts/
compliance_config.json # NEW — shared parser config (single source of truth)
extract_xlsx_schema.py # NEW — extracts xlsx structure as JSON
parse_compliance_xlsx.py # MODIFIED — reads config from JSON file
dump_xlsx_schema.py # UNCHANGED — standalone diagnostic tool
routes/
compliance.js # MODIFIED — drift check in /preview, new driftChecker module
helpers/
driftChecker.js # NEW — compareSchemaToDrift() function
frontend/
src/components/pages/
ComplianceUploadModal.js # MODIFIED — new drift-review phase
```
## Components and Interfaces
### 1. Shared Parser Configuration (`compliance_config.json`)
```json
{
"metric_categories": {
"2.3.4i": "Vulnerability Management",
"2.3.6i": "Vulnerability Management",
"5.2.4": "Access & MFA"
},
"core_cols": [
"Preferred - Hostname",
"GRANITE - IPv4_Address",
"GRANITE - Type",
"Team",
"Compliant",
"Source_Network",
"Vertical",
"GRANITE - Equip_Inst_ID",
"GRANITE - RESPONSIBLE_TEAM"
],
"skip_sheets": ["Summary", "CMDB_9box", "Vulns", "Aging Dashboard"]
}
```
### 2. Schema Extractor (`extract_xlsx_schema.py`)
**Input**: File path as CLI argument.
**Output** (stdout JSON):
```json
{
"sheets": [
{
"name": "Summary",
"columns": ["Metric", "Non-Compliant", "..."],
"metric_values": ["2.3.4i", "5.2.4", "..."]
},
{
"name": "2.3.4i",
"columns": ["Preferred - Hostname", "GRANITE - IPv4_Address", "..."]
}
]
}
```
- Uses openpyxl in read-only mode.
- Extracts sheet names, first-row column headers per sheet, and unique metric values from the Summary sheet (header at row 4, data from row 5 onward).
- On error, returns `{ "error": "..." }` on stdout and exits with non-zero code.
### 3. Drift Checker (`backend/helpers/driftChecker.js`)
**Function**: `compareSchemaToDrift(schema, config) => DriftReport`
**Parameters**:
- `schema` — object returned by `extract_xlsx_schema.py`
- `config` — object parsed from `compliance_config.json`
**Returns** (`DriftReport`):
```javascript
{
breaking: [
{ severity: 'breaking', message: 'Detail sheet "2.3.4i" is missing core column "Team"', value: 'Team', sheet: '2.3.4i' }
],
silent_miss: [
{ severity: 'silent_miss', message: 'Unknown metric "9.1.2" in Summary — not in metric_categories', value: '9.1.2' }
],
cosmetic: [
{ severity: 'cosmetic', message: 'New column "Extra_Field" in sheet "2.3.4i" — will be captured in extra_json', value: 'Extra_Field', sheet: '2.3.4i' }
]
}
```
**Drift rules**:
| Rule | Severity | Condition |
|---|---|---|
| Missing core column | `breaking` | A detail sheet (not in `skip_sheets`, present in xlsx) is missing a column from `core_cols` |
| Missing detail sheet | `breaking` | A sheet name in `metric_categories` (and not in `skip_sheets`) is absent from the xlsx |
| Unknown metric value | `silent_miss` | A metric value in the Summary sheet is not a key in `metric_categories` |
| Unknown sheet | `silent_miss` | An xlsx sheet is not in `skip_sheets` and not in `metric_categories` |
| New column in detail sheet | `cosmetic` | A detail sheet has columns not in `core_cols` |
| Stale metric category | `cosmetic` | A key in `metric_categories` does not appear in the Summary sheet's metric values |
### 4. Preview Endpoint Changes (`POST /api/compliance/preview`)
The existing `/preview` handler is modified to:
1. After receiving the uploaded file, spawn `extract_xlsx_schema.py` to get the xlsx schema.
2. Read `compliance_config.json` from disk.
3. Call `compareSchemaToDrift(schema, config)` to produce the drift report.
4. Proceed with the existing `parseXlsx()` call and `computeDiff()`.
5. Include `drift` (the DriftReport object) and optionally `drift_error` (string) in the response.
If the schema extraction or drift check throws, set `drift: null` and `drift_error: <message>`, then continue with the normal flow.
**Updated response shape**:
```json
{
"drift": {
"breaking": [],
"silent_miss": [],
"cosmetic": []
},
"drift_error": null,
"diff": { "new_count": 5, "recurring_count": 120, "resolved_count": 3 },
"tempFile": "/path/to/temp.json",
"filename": "NTS_AEO_2026_03_25.xlsx",
"report_date": "2026-03-25",
"total_items": 125
}
```
### 5. Upload Modal Changes (`ComplianceUploadModal.js`)
**New phase**: `drift-review` inserted between `uploading` and `preview`.
**Phase flow**:
```
idle → uploading → drift-review (if findings) → preview → committing → done
→ preview (if no findings)
```
**Drift review UI**:
- Findings grouped by severity: breaking first, then silent-miss, then cosmetic.
- Each group has a header with severity label and count badge.
- Groups with more than 5 findings collapse with a "Show N more" toggle.
- Each finding shows the message text and the triggering value.
- Breaking findings: red text (`#EF4444`), red left-border accent.
- Silent-miss findings: amber text (`#F59E0B`), amber left-border accent.
- Cosmetic findings: muted text (`#94A3B8`), subtle left-border accent.
- "Cancel" button returns to idle. "Continue to Preview" button advances to diff preview.
- "Continue to Preview" is disabled when breaking findings exist, with a message explaining the block.
- When `drift` is `null` (drift check failed), skip drift-review and go straight to preview.
## Data Models
### DriftFinding
```javascript
{
severity: 'breaking' | 'silent_miss' | 'cosmetic',
message: string, // Human-readable description
value: string, // The specific column/sheet/metric that triggered the finding
sheet: string|null // Sheet name context (when applicable)
}
```
### DriftReport
```javascript
{
breaking: DriftFinding[],
silent_miss: DriftFinding[],
cosmetic: DriftFinding[]
}
```
### ParserConfig
```javascript
{
metric_categories: { [metricId: string]: string }, // metric ID → category name
core_cols: string[], // column names for main item fields
skip_sheets: string[] // sheet names excluded from parsing
}
```
### XlsxSchema (output of extract_xlsx_schema.py)
```javascript
{
sheets: [
{
name: string,
columns: string[],
metric_values?: string[] // only present on Summary sheet
}
]
}
```
## Correctness Properties
*A property is a characteristic or behavior that should hold true across all valid executions of a system — essentially, a formal statement about what the system should do. Properties serve as the bridge between human-readable specifications and machine-verifiable correctness guarantees.*
### Property 1: Breaking drift completeness
*For any* xlsx schema and parser config, the drift checker SHALL produce a breaking finding for every core column missing from every detail sheet, and for every detail sheet (present in `metric_categories` but not in `skip_sheets`) absent from the xlsx — and no other breaking findings. The set of breaking findings is exactly the union of missing-core-column findings and missing-detail-sheet findings.
**Validates: Requirements 3.1, 3.2, 3.3**
### Property 2: Silent-miss drift completeness
*For any* xlsx schema and parser config, the drift checker SHALL produce a silent-miss finding for every metric value in the Summary sheet not present in `metric_categories`, and for every xlsx sheet not in `skip_sheets` and not in `metric_categories` — and no other silent-miss findings. The set of silent-miss findings is exactly the union of unknown-metric findings and unknown-sheet findings.
**Validates: Requirements 4.1, 4.2, 4.3**
### Property 3: Cosmetic drift completeness
*For any* xlsx schema and parser config, the drift checker SHALL produce a cosmetic finding for every column in a detail sheet not present in `core_cols`, and for every key in `metric_categories` not present in the Summary sheet's metric values — and no other cosmetic findings. The set of cosmetic findings is exactly the union of new-column findings and stale-metric findings.
**Validates: Requirements 5.1, 5.2, 5.3**
### Property 4: Drift severity ordering
*For any* drift report containing a mix of breaking, silent-miss, and cosmetic findings, the grouping function SHALL always return findings ordered by severity: all breaking findings first, then all silent-miss findings, then all cosmetic findings.
**Validates: Requirements 8.1**
## Error Handling
### Python Script Failures
| Failure | Handling |
|---|---|
| `extract_xlsx_schema.py` exits non-zero | Preview endpoint sets `drift: null`, `drift_error: <stderr message>`, continues with normal parse flow |
| `extract_xlsx_schema.py` returns invalid JSON | Same as above — caught in JSON.parse, treated as drift check failure |
| `compliance_config.json` missing or invalid (Node.js read) | Preview endpoint returns 500 with message "Configuration file could not be loaded" |
| `compliance_config.json` missing or invalid (Python parser read) | Parser exits non-zero, stderr describes the error, preview endpoint returns 500 with parse error |
| xlsx file cannot be opened by schema extractor | Schema extractor returns `{ "error": "..." }` on stdout, exits non-zero; drift check skipped gracefully |
### Frontend Error States
| Condition | Behavior |
|---|---|
| `drift` is `null` in preview response | Skip drift-review phase, proceed directly to diff preview |
| `drift_error` is present | Optionally display a subtle warning in the diff preview that drift check was skipped |
| Network error during upload | Existing error phase handling (unchanged) |
### Config File Validation
The Node.js config loader validates that:
- The file exists and is readable.
- The content parses as valid JSON.
- The parsed object contains `metric_categories` (object), `core_cols` (array), and `skip_sheets` (array).
If any check fails, the loader throws with a descriptive message. The preview handler catches this and returns a 500 response.
## Testing Strategy
### Unit Tests
**Drift checker (`driftChecker.js`)**:
- Breaking: missing core column produces finding with correct severity, message, value, and sheet.
- Breaking: missing detail sheet produces finding.
- Silent-miss: unknown metric value produces finding.
- Silent-miss: unknown sheet produces finding.
- Cosmetic: new column in detail sheet produces finding.
- Cosmetic: stale metric category produces finding.
- Empty schema (no sheets) produces appropriate findings.
- Config with empty metric_categories, core_cols, or skip_sheets.
- Schema and config that are perfectly aligned produce zero findings.
**Config loader**:
- Valid config file loads correctly.
- Missing file throws descriptive error.
- Invalid JSON throws descriptive error.
- Config missing required keys throws descriptive error.
**Frontend drift review component**:
- Drift review phase renders when findings exist.
- "Continue to Preview" button disabled when breaking findings present.
- "Continue to Preview" button enabled when no breaking findings.
- Groups collapse at 5+ findings with correct "Show N more" count.
- Cancel returns to idle phase.
- Skips drift review when drift is null or has no findings.
### Property-Based Tests
Property-based tests use `fast-check` (JavaScript) to verify the four correctness properties defined above. Each test generates random schema and config objects and verifies the drift checker output against the expected set-theoretic result.
**Configuration**:
- Minimum 100 iterations per property test.
- Each test tagged with: **Feature: compliance-schema-drift-check, Property {N}: {title}**
**Generators**:
- `arbitraryParserConfig`: generates random `metric_categories` (object with 020 string keys mapped to category strings), `core_cols` (array of 015 unique column name strings), `skip_sheets` (array of 05 unique sheet name strings).
- `arbitraryXlsxSchema`: generates random sheets array, each with a name, columns array, and optionally metric_values (for the Summary sheet). Sheet names, column names, and metric values drawn from a shared pool to ensure meaningful overlap with the config.
### Integration Tests
- Preview endpoint returns drift report alongside existing diff data.
- Preview endpoint returns 200 with breaking drift (does not error).
- Preview endpoint gracefully degrades when drift check fails (`drift: null`, `drift_error` present).
- Preview endpoint returns 500 when config file is missing.
- Python parser reads from `compliance_config.json` and produces same output as before.
- Commit endpoint is unchanged and does not reference drift.