Files
cve-dashboard/docs/vcl-multi-vertical-design-brief.md
Jordan Ramos a2bc1ff564 Add metric sub-team intermediate drill-down view
Clicking a metric now shows a sub-team breakdown page with totals per team
(compliant, non-compliant, total, %) instead of jumping directly to a flat
device list. Clicking a sub-team then shows the device list filtered to
that team only.

Navigation flow: Overview → Vertical → Metric (sub-team totals) → Team (devices)

Backend: added optional ?team= query param to the device list endpoint for
filtered queries.

Frontend: added MetricSubTeamView component with metric-level stats bar and
clickable sub-team table. Updated navigation state to include selectedTeam.

Also updated design brief to reflect the new drill-down hierarchy.
2026-05-14 14:53:41 -06:00

16 KiB
Raw Blame History

VCL Multi-Vertical Upload — Design Brief

Purpose

This document summarizes the design decisions and architectural choices for the VCL Multi-Vertical Upload feature. It is intended as a reference for presenting the approach to stakeholders and the compliance team.


What We Are Building

A new upload flow on the STEAM Security Dashboard that accepts multiple per-vertical compliance xlsx files (one per organizational vertical), ingests them with vertical-scoped resolution logic, and generates an executive-level VCL compliance report across all organizations — with drill-down by vertical and by metric.

This is a POC. The compliance team currently exports data from CyberMetrics as xlsx files on a 24-hour cycle. This feature lets them upload those files and generate the same reports they currently build manually in PowerPoint/Excel for senior leadership.


The Problem It Solves

Today the compliance team has 14 separate xlsx files — one per vertical (NTS_AEO, SDIT_CISO, TSI, etc.). The existing dashboard upload flow accepts a single consolidated file and treats it as the complete compliance state. If you upload just one vertical's file, the system incorrectly marks every device from the other 13 verticals as "resolved."

There is no automated way to:

  • Ingest all 14 files and produce a unified report
  • Drill down from the organizational view into specific metrics and devices
  • Generate burndown forecasts across verticals

Key Architectural Decisions

1. Vertical-Scoped Resolution

Decision: When a file for vertical X is committed, only items belonging to vertical X are evaluated for resolution. All other verticals are untouched.

Why: This is the fundamental change that makes per-vertical uploads safe. Without it, uploading one file would destroy data from the other 13 verticals.

Implication: Verticals are independent. You can upload NTS_AEO on Monday and SDIT_CISO on Wednesday without interference. This also supports the daily upload cadence the compliance team wants.

2. Vertical Identity Comes From the Filename

Decision: The vertical code is extracted from the filename pattern <VERTICAL>_YYYY_MM_DD.xlsx, not from data inside the xlsx.

Why: The internal xlsx structure is identical across verticals — same Summary sheet, same metric detail sheets, same columns. The only differentiator is the filename. This also means the Python parser requires zero changes.

Implication: Filenames must follow the convention. If they don't, the system flags them as "unrecognized" and the user can manually assign a vertical. This is a reasonable tradeoff for a POC.

3. Separate From Existing AEO Upload

Decision: This is a new flow with its own endpoints (/api/compliance/vcl-multi/...), its own UI page, and its own nav entry. The existing AEO compliance upload is unchanged.

Why:

  • The existing flow works for the STEAM/ACCESS-ENG team's day-to-day operations
  • The compliance team may deploy this on a separate instance to experiment without affecting production
  • Different user groups with different needs — engineers vs. compliance analysts vs. senior leadership

Implication: There are now two ways to upload compliance data. They coexist via the vertical column — existing AEO data has vertical = NULL, multi-vertical data has a vertical code. The VCL report page can aggregate either or both.

4. Two-Dimensional Grouping (Vertical + Team)

Decision: vertical and team are separate fields. Vertical is the organizational unit (NTS_AEO, SDIT_CISO). Team is the sub-team within a vertical (STEAM, ACCESS-ENG, ACCESS-OPS).

Why: NTS_AEO contains multiple sub-teams. Senior leadership wants to see the vertical-level view. The STEAM team wants to see their team-level view. Both are valid groupings on the same data.

Implication: The cross-organizational report groups by vertical. Drilling into NTS_AEO still shows the STEAM/ACCESS-ENG/ACCESS-OPS breakdown because that data exists in the "Team" column inside the xlsx.

5. Summary Sheet Data Stored Separately

Decision: The parsed Summary sheet (metric-level health data) is stored in a dedicated vcl_multi_vertical_summary table, not just as JSON on the upload record.

Why: The metric drill-down view needs to query per-metric compliance percentages and targets efficiently. Storing structured rows enables filtering, sorting, and aggregation at the database level rather than parsing JSON blobs in application code.

Implication: Slightly more storage, but enables fast queries like "show me all metrics below target across all verticals" without full-table scans.

6. Batch Upload With Atomic Commit

Decision: All files in a batch are committed in a single database transaction. If any file fails, the entire batch rolls back.

Why: Partial commits would leave the report in an inconsistent state — some verticals updated, others stale. The compliance team uploads all 14 files together as a reporting cycle. It should either all succeed or all fail.

Implication: If one file has a parsing error, the user is shown the error in the preview phase (before commit). They can remove that file from the batch and commit the rest. Once they hit "Commit," it's all-or-nothing.

7. Daily Upload Support (Idempotent)

Decision: Re-uploading the same vertical on the same day produces the same final state as uploading it once. The system doesn't create duplicate records.

Why: CyberMetrics refreshes on a 24-hour cycle. The compliance team may want to upload daily to track movement. They shouldn't have to worry about "did I already upload today?"

Implication: The resolution logic uses vertical + hostname + metric_id as the identity key. Recurring items get their seen_count incremented and metadata updated. New items are inserted. Missing items are resolved. Same logic as today, just scoped to the vertical.


Drill-Down Hierarchy

Executive Overview (all verticals aggregated)
  │
  ├── Stats: 2.1M devices, 97% compliant, target 95%
  ├── Trend: monthly compliance % with forecast
  ├── Donut: blocked vs in-progress (non-compliant devices)
  │
  └── Vertical Breakdown Table
        │
        ├── NTS_AEO — 99% — 2,163 non-compliant — click to drill down
        │     │
        │     ├── Team Filter: [All (Rollup)] [ACCESS-ENG] [ACCESS-OPS] [INTELDEV] [STEAM]
        │     │
        │     ├── Metric Breakdown (expandable rows)
        │     │     ├── ▸ 5.5.4i (Vulnerability Mgmt) — 97.0% — 1,762 NC — target 80%
        │     │     │     ├── └ ACCESS-ENG:  7 compliant, 1 NC, 8 total — 88.0%
        │     │     │     ├── └ ACCESS-OPS:  64,051 compliant, 1,746 NC, 65,797 total — 97.0%
        │     │     │     ├── └ INTELDEV:    233 compliant, 11 NC, 244 total — 95.0%
        │     │     │     └── └ STEAM:       123 compliant, 4 NC, 127 total — 97.0%
        │     │     │
        │     │     ├── Click metric ID → Metric Sub-Team View
        │     │     │     ├── Stats: total 66,176 | compliant 64,414 | NC 1,762 | 97% | target 80%
        │     │     │     └── Sub-Team Table:
        │     │     │           ├── ACCESS-ENG — 8 total — 88.0% → click
        │     │     │           │     └── Device list (filtered to ACCESS-ENG)
        │     │     │           ├── ACCESS-OPS — 65,797 total — 97.0% → click
        │     │     │           │     └── Device list (filtered to ACCESS-OPS)
        │     │     │           ├── INTELDEV — 244 total — 95.0% → click
        │     │     │           └── STEAM — 127 total — 97.0% → click
        │     │     └── ...
        │     │
        │     └── Burndown: blockers, with dates, projected clear date
        │
        ├── SDIT_CISO — 72% — 68 non-compliant
        └── ...

How Metrics Are Calculated

Data Sources

Each vertical's xlsx file contains two types of data:

  1. Summary sheet — one row per metric per sub-team, with pre-calculated totals (compliant, non-compliant, total, compliance %, target). This is the source of truth for aggregate numbers.

  2. Detail sheets — one sheet per metric, listing individual non-compliant devices (hostname, IP, device type, team). These feed the device-level drill-down.

The Double-Counting Problem (and How We Solve It)

The Summary sheet contains two levels of rows for each metric:

Row Type Example Purpose
Sub-team rows ACCESS-OPS, STEAM, INTELDEV Individual team breakdown
Rollup row ALL: NTS-AEO Sum of all sub-teams for that metric

The rollup row already includes all sub-team totals. If you sum all rows naively, you count every device twice.

Solution: All aggregate calculations (stats bar, vertical breakdown, category totals, snapshots) use only the ALL: rollup rows. Sub-team rows are stored for drill-down display but never included in totals.

What Each Number Means

Metric Source Meaning
Total Devices Sum of total from ALL: rows across all metrics for a vertical Total device-metric pairs evaluated (a device appears once per metric it's measured against)
Compliant Sum of compliant from ALL: rows Device-metric pairs that pass the compliance check
Non-Compliant Sum of non_compliant from ALL: rows Device-metric pairs that fail
Compliance % compliant / total * 100 Percentage of device-metric pairs passing
Target % Per-metric value from the spreadsheet (e.g., 95%, 80%, 75%) The threshold set by the compliance program
Blockers Non-compliant devices in compliance_items with no resolution_date Devices with no committed remediation timeline
In-Progress Non-compliant devices with a resolution_date set Devices with a planned fix date

Important: "Total Devices" Is Not Unique Devices

A single physical device (hostname) can appear in multiple metrics. For example, one router might be measured against metric 5.5.4i (vulnerability scanning), 7.1.1 (logging), and 2.3.6i (patching). The "Total Devices" count is the sum of all device-metric evaluations, not unique hostnames.

This matches how CyberMetrics reports — each metric has its own scope of applicable devices, and the overall compliance percentage reflects performance across all metrics.

Per-Metric Compliance Percentage

Each metric row shows its own compliance percentage, which comes directly from the Summary sheet's "Current Compliance" column. This is a decimal between 0 and 1 (displayed as 0100% in the UI). The target is also per-metric — some metrics have a 95% target, others 80% or 75%, depending on the compliance program's priorities.

Category Aggregation

Metrics are grouped into categories (Logging & Monitoring, Vulnerability Management, Access & MFA, Endpoint Protection, etc.) based on a static mapping in compliance_config.json. The category cards in the drill-down view show the aggregate compliance % across all metrics in that category, using only rollup rows.


Sub-Team Drill-Down

How It Works

When you click into a vertical (e.g., NTS_AEO), the metrics table shows the rollup totals by default — one row per metric with the ALL: numbers. Two mechanisms expose sub-team data:

1. Expand/Collapse (▸ arrow)

Click the arrow on any metric row to reveal sub-team rows inline beneath it. Each sub-team row shows that team's compliant/non-compliant/total/% for that specific metric. The sub-team rows are visually indented and teal-highlighted.

This is useful for: "Which team is dragging down metric 5.5.4i?"

2. Team Filter Buttons

A row of filter buttons appears above the metrics table showing all teams in that vertical (e.g., ACCESS-ENG, ACCESS-OPS, INTELDEV, STEAM). Click one to filter the entire table to show only that team's numbers per metric. The "All (Rollup)" button returns to the aggregated view.

This is useful for: "Show me STEAM's compliance across all metrics."

What "(Other)" Means

Some metrics have a team value of (Other) in the Summary sheet. This represents devices that don't map to a named sub-team. These are included in the ALL: rollup total but are not shown as a separate sub-team in the UI — they're noise for the compliance team's purposes.

Device-Level Drill-Down

Clicking a sub-team row in the metric sub-team view navigates to the device list — individual non-compliant hostnames for that vertical + metric + team combination. The device list is filtered to only show devices belonging to the selected team. This data comes from the detail sheets (not the Summary sheet) and shows:

  • Hostname, IP address, device type, team
  • Seen count (how many consecutive uploads this device has been non-compliant)
  • First seen / last seen dates
  • Resolution date (if set)
  • Remediation plan (if documented)

If a metric has no sub-team breakdown (e.g., only an "(Other)" team), a "View All Devices" button is shown instead, which loads the full unfiltered device list for that metric.

The full navigation path is:

Overview → Vertical → Metric (sub-team totals) → Team (device list)

Burndown Forecast

The burndown forecast answers: "When will this vertical reach compliance?"

How it works:

  1. Each non-compliant device can have a resolution_date set (target remediation date)
  2. Devices with dates are bucketed by month → "20 devices expected remediated in June, 35 in July"
  3. Devices without dates are counted as "blockers" — no committed timeline
  4. The trend chart uses linear regression on 3+ months of actual data to project a forecast line

What feeds it:

  • Resolution dates can be set manually (click device → set date) or via bulk upload (xlsx with Hostname + Resolution Date columns)
  • The existing bulk upload flow on the VCL page already supports this

What the compliance team sees:

  • Per-vertical: "NTS_AEO has 80 non-compliant, 25 are blockers, 55 have dates, projected clear by August 2026"
  • Aggregated: trend line showing whether the organization is on track to hit 95% target

What Does NOT Change

  • Existing AEO compliance upload (single file) — unchanged
  • Existing VCL report page (STEAM/ACCESS-ENG view) — unchanged
  • Existing compliance_items table structure — only adds a nullable vertical column
  • Python parser — reused as-is, no modifications
  • Auth model — same groups (Admin, Standard_User) required for upload

Deployment Options

Option Description
Same instance Add the feature to the existing dashboard. Multi-vertical data coexists with AEO data via the vertical column.
Separate instance Deploy a fresh instance with its own database. Compliance team experiments freely. No risk to dev/production data.
Later: API integration Replace xlsx upload with direct CyberMetrics API calls. Backend endpoints stay the same — just a different client pushing data.

The architecture supports all three without code changes. The vertical column and scoped resolution logic work regardless of deployment topology.


Open Questions for the Meeting

  1. Vertical list — Are the 14 verticals in the screenshot the complete set, or do new verticals get added periodically? (Affects whether we hardcode a list or keep it dynamic.)

  2. Target % per vertical — Is the 95% target uniform across all verticals, or do different verticals have different targets?

  3. Access control — Should the compliance team have their own user accounts with a specific role, or do they use existing Admin/Standard_User groups?

  4. Naming — What should this page be called in the nav? "CCP Metrics", "VCL Multi-Vertical", "Compliance Reporting", something else?

  5. Retention — How long should historical upload data be kept? (Affects trend chart depth and storage.)


Timeline Estimate

Phase Scope Effort
1. Migration + backend endpoints Schema changes, upload flow, scoped resolution, stats/trend/drill-down APIs 23 days
2. Frontend — upload modal Multi-file drop, filename parsing, batch preview, commit 12 days
3. Frontend — report page Stats bar, vertical table, trend chart, donut, drill-down views 23 days
4. Frontend — burndown Per-vertical burndown chart, blocker counts, forecast 1 day
5. Testing + polish Property tests, edge cases, error handling, loading states 1 day

Total: roughly 710 working days for the full POC.