Files

Jordan Ramos 8ebd7e4d5e Auto-sync .kiro/ from master (post-checkout hook)

2026-05-19 15:01:25 -06:00

47 KiB

Raw Blame History

Compliance Duplicate Failing Metrics Bugfix Design

Overview

Five compliance endpoints in backend/routes/compliance.js (GET /items, GET /items/:hostname, GET /vcl/stats, GET /mttr) and the compliance_snapshots block inside persistUpload() all share the same root cause: each one reads compliance_items with either no vertical filter or a filter that admits both legacy vertical IS NULL rows and multi-vertical vertical = 'NTS_AEO' rows, but does not deduplicate on (hostname, metric_id). When the same (hostname, metric_id) pair exists in two verticals (the documented multi-vertical workflow), the duplicate row distorts the response — it duplicates a metric chip in the UI, double-counts a device across teams, inflates aging buckets, inflates forecast row counts, and (in the snapshot block) lets a single hostname appear in both compliant and non_compliant columns of the same (snapshot_month, vertical) row.

The fix is uniform across endpoints: dedupe at the SQL layer using DISTINCT ON (hostname, metric_id) with a deterministic ORDER BY (highest seen_count, then most recent upload_id) so each unique violation contributes exactly one row to the aggregation, and rewrite the snapshot query so each hostname is classified by its MIN(status) (active wins over resolved) inside a CTE before the count. The groupByHostname helper and the /items/:hostname response builder also gain a defensive in-memory dedupe keyed by metric_id (or (metric_id, status) for the detail endpoint), so a duplicate row that slips through any unforeseen code path still cannot duplicate a chip in the UI.

The implementation is intentionally minimal: each fix changes a single SQL statement and (for two endpoints) a small JavaScript loop. No schema migration, no new column, no frontend change. The frontend ComplianceDetailPanel already keys metrics on metric_id for chip rendering and will render correctly once the API stops returning duplicate rows.

Glossary

Bug_Condition (C): The condition that triggers the bug — two or more rows in compliance_items share the same (hostname, metric_id) pair across verticals (i.e., one row with vertical IS NULL and another with vertical = 'NTS_AEO', or two non-null verticals that both pass an endpoint's WHERE clause).
Property (P): The desired behavior when C holds — each affected endpoint returns or counts each (hostname, metric_id) exactly once, using the row with the highest seen_count (with most recent upload_id as tiebreak) as the representative.
Preservation: Behavior on rows where (hostname, metric_id) is unique across verticals, on the empty-data response shape, and on unrelated query parameters (e.g., team and status on /items) — all must be byte-for-byte unchanged.
vertical: TEXT column on compliance_items and compliance_uploads identifying which xlsx (NTS_AEO, SDIT_CISO, TSI) the row originated from. NULL indicates a legacy AEO-only upload. The bug class is specifically about rows that share (hostname, metric_id) but differ in vertical.
groupByHostname(rows, noteHostnames): Helper in backend/routes/compliance.js (lines ~213–230) that flattens a list of joined compliance_items rows into one device object per hostname, pushing each row's metric_id onto failing_metrics. It performs no deduplication.
bucketAgingItems(items): Helper in backend/routes/compliance.js (lines ~234–254) that places each item into one of four seen_count buckets per team. It iterates rows directly, so duplicate rows produce double-counted buckets.
persistUpload(): Function in backend/routes/compliance.js (lines ~81–192) that writes a parsed upload to the DB and then writes per-vertical rows into compliance_snapshots. The snapshot query at the end of this function counts hostnames per team using COUNT(DISTINCT CASE WHEN status = 'X' THEN hostname END), which double-counts a hostname into both the compliant and non_compliant columns when the hostname has both active and resolved rows across verticals.
representative row: For a duplicated (hostname, metric_id), the row chosen by DISTINCT ON (hostname, metric_id) ... ORDER BY hostname, metric_id, seen_count DESC, upload_id DESC. This is deterministic and aligns with the "use the row with the highest seen_count or most recent upload_id" rule from the requirements.

Bug Details

Bug Condition

The bug manifests when two or more rows in compliance_items share the same (hostname, metric_id) pair across different vertical values. This happens whenever the same device fails the same metric in more than one vertical's xlsx — including the originally reported case where a legacy vertical IS NULL AEO upload and a newer vertical = 'NTS_AEO' multi-vertical upload both contain STEAM-INTERSIGHT failing 7.1.1.

Formal Specification:

FUNCTION isBugCondition(items)
  INPUT: items — list of compliance_items rows
  OUTPUT: boolean

  // The bug condition is triggered for any (hostname, metric_id) pair with more than one row
  GROUP items BY (hostname, metric_id) INTO groups
  RETURN EXISTS group IN groups WHERE COUNT(group) > 1
END FUNCTION

For a single endpoint response to be considered buggy, the API output must additionally fail one of the following invariants (the per-endpoint manifestation of the same root cause):

FUNCTION isBuggyResponse(endpoint, response)
  CASE endpoint OF
    '/items':              RETURN EXISTS device IN response.devices WHERE
                                  COUNT(device.failing_metrics) != COUNT(DISTINCT metric_id IN device.failing_metrics)
    '/items/:hostname':    RETURN EXISTS (metric_id, status) WITH COUNT(*) > 1 IN response.metrics
    '/vcl/stats heavy_hitters': RETURN SUM(hh.non_compliant FOR hh IN response.heavy_hitters) >
                                       COUNT(DISTINCT hostname WHERE has_active_violation)
    '/vcl/stats forecast': RETURN response.vertical_breakdown[i].blockers < 0
                                  FOR SOME i WITH duplicated (hostname, metric_id) AND non-null resolution_date
    '/mttr':               RETURN SUM(b.total FOR b IN response.aging) >
                                  COUNT(DISTINCT (hostname, metric_id) WHERE status = 'active')
    'persistUpload snapshot': RETURN EXISTS row IN compliance_snapshots WHERE
                                     compliant + non_compliant > total_devices
  END CASE
END FUNCTION

Examples

The originally reported case (GitLab issue #13, STEAM-INTERSIGHT / 172.16.30.40 / metric 7.1.1) and the four sibling manifestations:

/items — STEAM-INTERSIGHT has two active rows for metric_id = '7.1.1' (one with vertical IS NULL, one with vertical = 'NTS_AEO'). The handler's WHERE clause admits both via (ci.vertical IS NULL OR ci.vertical = 'NTS_AEO'), and groupByHostname() does an unconditional dev.failing_metrics.push(...) per row. The device's failing_metrics array contains two entries with metric_id = '7.1.1'. Expected: one entry per unique metric_id.
/items/:hostname — Same hostname, but the detail query has no vertical filter at all (just WHERE ci.hostname = $1). Both rows come back, the response builder maps metricRows.map(...) over both, and the frontend's MetricChip renders 7.1.1 twice in the "Failing Metrics" section. Expected: one entry per (metric_id, status) pair.
/vcl/stats heavy-hitters and per-team totals — A device whose team differs between verticals (e.g., STEAM in the legacy row and ACCESS-ENG in the NTS_AEO row) is counted under both teams by COUNT(DISTINCT hostname) ... GROUP BY team. The sum across heavy_hitters[*].non_compliant exceeds the dashboard's stats.non_compliant. Expected: the hostname is assigned to exactly one team — the team from its representative row — and the per-team sums reconcile with the global non-compliant count.
/vcl/stats forecast-burndown — For a team with one duplicated (hostname, metric_id) whose resolution_date is non-null in both rows, the forecast query SELECT resolution_date FROM compliance_items WHERE status = 'active' AND team = $1 AND resolution_date IS NOT NULL returns two rows. forecastItems.length is 2, but teamNonCompliant (from the de-team-counted DISTINCT-hostname query) is 1, so blockers = 1 - 2 = -1. The route then clamps to 0, hiding the underlying inconsistency. Expected: forecast is deduped by (hostname, metric_id) so the count matches teamNonCompliant's scoping and blockers is non-negative.
/mttr — Same duplicated (hostname, metric_id). bucketAgingItems() receives both rows, increments the bucket twice. The team total for STEAM (or whichever team appears in the duplicate row) is inflated. Expected: each unique (hostname, metric_id) contributes to exactly one bucket using a single representative seen_count.
Edge case — persistUpload() snapshot — A hostname has two rows: one with status = 'resolved' (legacy vertical) and one with status = 'active' (NTS_AEO vertical). The snapshot query
```
COUNT(DISTINCT CASE WHEN status = 'resolved' THEN hostname END) AS compliant,
COUNT(DISTINCT CASE WHEN status = 'active'   THEN hostname END) AS non_compliant
```
counts the hostname once in compliant and once in non_compliant, so compliant + non_compliant > total_devices. Expected: the hostname is classified by its worst-case status (active wins over resolved) inside a CTE so it appears in exactly one column.

Expected Behavior

Preservation Requirements

Unchanged Behaviors:

Rows where (hostname, metric_id) is unique across verticals: every endpoint returns the same numbers, in the same shape, in the same order as before the fix.
Empty-data responses: /items returns { devices: [], team, status }, /items/:hostname returns 404 for unknown hostnames, /vcl/stats returns its zero-state shape, /mttr returns { aging: [] }.
/items team and status query parameters: still validated against ALLOWED_TEAMS and ['active', 'resolved'], still reject invalid values with HTTP 400.
/items (ci.vertical IS NULL OR ci.vertical = 'NTS_AEO') predicate: kept as-is so AEO-only legacy data continues to surface alongside NTS_AEO multi-vertical data. The fix only adds dedup on top, it does not narrow the set of admitted verticals.
/items/:hostname ordering by status DESC, metric_id: unchanged so active metrics remain listed before resolved ones.
/vcl/stats donut categorization (blocked / in_progress by MAX(resolution_date) per hostname): already deduped by GROUP BY hostname and stays unchanged.
/vcl/stats global stats.compliant / stats.non_compliant: already use COUNT(DISTINCT hostname) and stay unchanged.
Frontend components (ComplianceDetailPanel, ComplianceCharts, CompliancePage): no changes required. They already render one chip per metric_id; the bug was purely an upstream data issue.
persistUpload() error handling: snapshot creation remains wrapped in try/catch that logs but does not fail the upload commit.

Scope: All endpoint inputs that do not involve cross-vertical duplicate (hostname, metric_id) rows must be byte-for-byte identical to the pre-fix output. The fix only changes what happens when two or more compliance_items rows share (hostname, metric_id) across verticals.

Hypothesized Root Cause

All five sites have the same shape of bug — missing dedup on (hostname, metric_id) after the multi-vertical migration admitted two-row scenarios — but with slightly different mechanics. Listing them explicitly so the test plan can confirm or refute each one:

/items — groupByHostname() pushes every row. The handler's WHERE clause (ci.vertical IS NULL OR ci.vertical = 'NTS_AEO') admits both verticals' rows. groupByHostname() then iterates each row and unconditionally calls dev.failing_metrics.push({ metric_id, ... }). There is no Set/Map keyed on metric_id, so a duplicate row produces a duplicate metric in the array.
/items/:hostname — no vertical filter, no dedup. The detail query is WHERE ci.hostname = $1 with no vertical predicate at all, so every vertical's row for that hostname is returned. The response builder does metricRows.map(r => ({ ...r })) — there is no dedup step. Every duplicate row becomes a duplicate entry in metrics.
/vcl/stats heavy-hitters and per-team totals — team chosen per row, not per hostname. The GROUP BY team ... COUNT(DISTINCT hostname) query is correct for "how many distinct hostnames does each team see," but a hostname with rows under two different teams (because team differs across verticals) is counted in both groups. The dashboard's global stats.non_compliant (a single COUNT(DISTINCT hostname) with no team scoping) does not match SUM(heavy_hitters[*].non_compliant).
/vcl/stats forecast-burndown — duplicate rows inflate forecast count. The query SELECT resolution_date FROM compliance_items WHERE status = 'active' AND team = $1 AND resolution_date IS NOT NULL returns one row per compliance_items row, not one row per (hostname, metric_id). The downstream blockers = teamNonCompliant - forecastItems.length calculation can go negative because forecastItems.length is inflated relative to the deduped teamNonCompliant.
/mttr — bucketAgingItems() iterates rows directly. The query SELECT seen_count, team FROM compliance_items WHERE status = 'active' returns every active row, and bucketAgingItems() does for (const item of items) { buckets[label].total += 1; ... }. There is no Set keyed on (hostname, metric_id), so each duplicate row increments its bucket twice.
persistUpload() snapshot — CASE WHEN status = double-counts. The snapshot query
```
COUNT(DISTINCT CASE WHEN status = 'resolved' THEN hostname END) AS compliant,
COUNT(DISTINCT CASE WHEN status = 'active'   THEN hostname END) AS non_compliant
```
counts a hostname in compliant if any of its rows is resolved AND in non_compliant if any of its rows is active. With duplicate rows that disagree on status across verticals, the same hostname lands in both columns.

The common structural cause is that the multi-vertical migration (add_vcl_multi_vertical.js) added a vertical column to compliance_items but did not retrofit existing read queries — which assumed (hostname, metric_id) was effectively a unique key — to either dedupe explicitly or scope to a single vertical.

Correctness Properties

Property 1: Bug Condition — /items failing_metrics array contains at most one entry per metric_id

For any set of compliance_items rows including cross-vertical duplicates of (hostname, metric_id), the response from GET /items SHALL contain, for every device in response.devices, exactly one entry per unique metric_id in device.failing_metrics. Formally: device.failing_metrics.length == new Set(device.failing_metrics.map(m => m.metric_id)).size.

Validates: Requirements 2.1

Property 2: Bug Condition — /items/:hostname returns one metric per (metric_id, status)

For any device with cross-vertical duplicate (hostname, metric_id) rows, the response from GET /items/:hostname SHALL contain exactly one entry per unique (metric_id, status) pair in response.metrics. Each entry's seen_count SHALL equal the maximum seen_count across the duplicate rows for that pair.

Validates: Requirements 2.2, 2.3

Property 3: Bug Condition — /vcl/stats per-team device counts equal COUNT(DISTINCT hostname)

For any set of compliance_items rows including cross-vertical duplicates where a hostname's team differs across verticals, the response from GET /vcl/stats SHALL satisfy SUM(heavy_hitters[*].non_compliant) == stats.non_compliant. Each hostname SHALL be assigned to exactly one team — the team from its representative row (highest seen_count, then most recent upload_id) — regardless of how many verticals contain rows for that hostname.

Validates: Requirements 2.5

Property 4: Bug Condition — /vcl/stats forecast-burndown is deduped by (hostname, metric_id) and blockers is non-negative

For any team with cross-vertical duplicate (hostname, metric_id) rows where both rows have a non-null resolution_date, the deduped forecast row count for that team SHALL contribute at most one entry per unique (hostname, metric_id), and blockers = teamNonCompliant - dedupedForecastCount SHALL be >= 0 (no clamp required to satisfy the invariant).

Validates: Requirements 2.6

Property 5: Bug Condition — /mttr aging buckets count each unique (hostname, metric_id) exactly once

For any set of active compliance_items rows including cross-vertical duplicates, the response from GET /mttr SHALL satisfy SUM(aging[*].total) == COUNT(DISTINCT (hostname, metric_id) WHERE status = 'active'). Each unique (hostname, metric_id) SHALL be bucketed exactly once using a single representative seen_count.

Validates: Requirements 2.7

Property 6: Bug Condition — persistUpload() snapshot rows satisfy compliant + non_compliant <= total_devices

For any persistUpload() invocation, every row written into compliance_snapshots for the current (snapshot_month, vertical) pair SHALL satisfy compliant + non_compliant <= total_devices. A hostname with both active and resolved rows for the same team SHALL be classified as non_compliant (active wins over resolved) and SHALL appear in exactly one of the two columns.

Validates: Requirements 2.4

Property 7: Preservation — non-duplicated rows are unchanged

For any set of compliance_items rows where every (hostname, metric_id) is unique across verticals, the responses from /items, /items/:hostname, /vcl/stats, /mttr, and the compliance_snapshots rows written by persistUpload() SHALL be identical to the pre-fix output for the same input. The fix SHALL NOT change behavior on the unique-(hostname, metric_id) case.

Validates: Requirements 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7

Property 8: Preservation — seen_count, first_seen, and last_seen are correctly aggregated for the representative row

For any duplicated (hostname, metric_id), the surviving entry SHALL carry seen_count = MAX(seen_count) across the duplicates, first_seen = MIN(first_seen) across the duplicates, and last_seen = MAX(last_seen) across the duplicates. Active/resolved status separation SHALL be preserved (a metric still appears in the active section if any of its duplicate rows is active).

Validates: Requirements 3.2, 3.3

Fix Implementation

Changes Required

All changes are in backend/routes/compliance.js. No schema migration, no new column, no frontend change. SQL-level dedup is preferred wherever possible because it avoids materialising duplicates into Node memory and centralises the representative-row policy at the data layer.

Fix 1: `GET /items` — `DISTINCT ON (hostname, metric_id)` in SQL, defensive dedupe in `groupByHostname`

File: backend/routes/compliance.js

Function: router.get('/items', ...) (around line 535) and helper groupByHostname (around line 213)

Specific Changes:

Rewrite the items query to use DISTINCT ON (hostname, metric_id) so each unique violation contributes exactly one row, choosing the representative row by highest seen_count and most recent upload_id:

SELECT DISTINCT ON (ci.hostname, ci.metric_id)
       ci.hostname, ci.ip_address, ci.device_type, ci.team,
       ci.metric_id, ci.metric_desc, ci.category, ci.status, ci.seen_count,
       fu.report_date AS first_seen,
       lu.report_date AS last_seen,
       ru.report_date AS resolved_on
FROM compliance_items ci
LEFT JOIN compliance_uploads fu ON ci.first_seen_upload_id = fu.id
LEFT JOIN compliance_uploads lu ON ci.upload_id = lu.id
LEFT JOIN compliance_uploads ru ON ci.resolved_upload_id = ru.id
WHERE ci.team = $1
  AND ci.status = $2
  AND (ci.vertical IS NULL OR ci.vertical = 'NTS_AEO')
ORDER BY ci.hostname, ci.metric_id, ci.seen_count DESC, ci.upload_id DESC

The ORDER BY lead expressions match the DISTINCT ON columns; the trailing expressions select the representative.

Add a defensive in-memory dedupe in groupByHostname() keyed on metric_id per device, so any future code path that bypasses the SQL dedupe still cannot duplicate a chip:

function groupByHostname(rows, noteHostnames) {
    const deviceMap = {};
    for (const row of rows) {
        if (!deviceMap[row.hostname]) {
            deviceMap[row.hostname] = {
                hostname: row.hostname, ip_address: row.ip_address || '', device_type: row.device_type || '',
                team: row.team || '', status: row.status, failing_metrics: [],
                _seenMetricIds: new Set(),
                seen_count: row.seen_count || 1, first_seen: row.first_seen || null,
                last_seen: row.last_seen || null, resolved_on: row.resolved_on || null,
                has_notes: noteHostnames.has(row.hostname),
            };
        }
        const dev = deviceMap[row.hostname];
        if (!dev._seenMetricIds.has(row.metric_id)) {
            dev._seenMetricIds.add(row.metric_id);
            dev.failing_metrics.push({ metric_id: row.metric_id, metric_desc: row.metric_desc || '', category: row.category || '' });
        }
        if ((row.seen_count || 1) > dev.seen_count) dev.seen_count = row.seen_count;
        if (row.first_seen && (!dev.first_seen || row.first_seen < dev.first_seen)) dev.first_seen = row.first_seen;
        if (row.last_seen && (!dev.last_seen || row.last_seen > dev.last_seen)) dev.last_seen = row.last_seen;
    }
    return Object.values(deviceMap).map(({ _seenMetricIds, ...dev }) => dev);
}

The _seenMetricIds set is stripped before the device is returned so the response shape is unchanged.

Fix 2: `GET /items/:hostname` — `DISTINCT ON (metric_id, status)` in SQL

File: backend/routes/compliance.js

Function: router.get('/items/:hostname', ...) (around line 575)

Specific Changes:

Rewrite the metrics query to dedupe on (metric_id, status) so a metric that appears as both active and resolved is preserved (one row each), but cross-vertical duplicates of the same (metric_id, status) collapse to one:

SELECT DISTINCT ON (ci.metric_id, ci.status)
       ci.metric_id, ci.metric_desc, ci.category, ci.status,
       ci.ip_address, ci.device_type, ci.team, ci.seen_count, ci.extra_json,
       ci.resolution_date, ci.remediation_plan,
       fu.report_date AS first_seen, fu.uploaded_at AS first_seen_at,
       lu.report_date AS last_seen,  lu.uploaded_at AS last_seen_at,
       ru.report_date AS resolved_on
FROM compliance_items ci
LEFT JOIN compliance_uploads fu ON ci.first_seen_upload_id = fu.id
LEFT JOIN compliance_uploads lu ON ci.upload_id = lu.id
LEFT JOIN compliance_uploads ru ON ci.resolved_upload_id = ru.id
WHERE ci.hostname = $1
ORDER BY ci.metric_id, ci.status, ci.seen_count DESC, ci.upload_id DESC

The existing post-query sort (status DESC for grouping, then metric_id) is rebuilt in JavaScript on the deduped result to preserve the response ordering:
```
metricRows.sort((a, b) => {
    if (a.status !== b.status) return b.status.localeCompare(a.status);  // 'resolved' < 'active' alphabetically; we want 'active' first
    return a.metric_id.localeCompare(b.metric_id);
});
```
(Note: the original SQL used ORDER BY ci.status DESC, ci.metric_id, which placed resolved before active because of DESC on the alphabetic order. The post-fix sort reproduces that exact ordering on the deduped rows.)
The identity lookup metricRows.find(r => r.status === 'active') || metricRows[0] is unchanged — it still picks the active representative for ip_address, device_type, team, resolution_date, and remediation_plan.

Fix 3: `GET /vcl/stats` heavy-hitters and per-team totals — dedupe to one team per hostname via CTE

File: backend/routes/compliance.js

Function: router.get('/vcl/stats', ...) (around line 990, the teamRows query)

Specific Changes:

Replace the heavy-hitters query with a CTE that first picks one representative row per hostname (using the same DISTINCT ON (hostname) policy: highest seen_count, then most recent upload_id), then groups by the representative's team:

WITH device_team AS (
    SELECT DISTINCT ON (hostname)
           hostname,
           COALESCE(team, 'Unknown') AS team,
           resolution_date
    FROM compliance_items
    WHERE status = 'active'
    ORDER BY hostname, seen_count DESC, upload_id DESC
)
SELECT team,
       COUNT(DISTINCT hostname)::int AS non_compliant,
       MAX(resolution_date)          AS compliance_date
FROM device_team
GROUP BY team
ORDER BY COUNT(DISTINCT hostname) DESC

Because the CTE already collapses each hostname to one row, COUNT(DISTINCT hostname) per team is equivalent to COUNT(*) per team, but DISTINCT is kept for defensive symmetry.

The dashboard's existing stats.non_compliant query (also COUNT(DISTINCT hostname) over all active rows, no team scoping) is unchanged — but now the property SUM(heavy_hitters[*].non_compliant) == stats.non_compliant holds because both numerators and denominators agree on "one hostname, one team."
The per-team-total query inside the for (const teamRow of teamRows) loop (SELECT COUNT(DISTINCT hostname) AS total FROM compliance_items WHERE COALESCE(team, 'Unknown') = $1) is similarly rewritten to use the same device_team-style CTE so the team's total_devices matches the team's non_compliant. The simplest form:
```
WITH device_team AS (
    SELECT DISTINCT ON (hostname)
           hostname,
           COALESCE(team, 'Unknown') AS team
    FROM compliance_items
    ORDER BY hostname, seen_count DESC, upload_id DESC
)
SELECT COUNT(*)::int AS total FROM device_team WHERE team = $1
```
Note: the loop runs once per team, so this CTE is computed N times — acceptable given the small team count (4 teams). If profiling shows this is a hot path, hoist the device_team CTE out of the loop and compute totals in a single grouped query.

Fix 4: `GET /vcl/stats` forecast-burndown — dedupe by `(hostname, metric_id)` in SQL

File: backend/routes/compliance.js

Function: router.get('/vcl/stats', ...) (around line 1015, the forecastItems query)

Specific Changes:

Rewrite the forecast query to DISTINCT ON (hostname, metric_id) so each unique violation contributes at most one resolution_date to the forecast bucketing:

SELECT DISTINCT ON (hostname, metric_id) resolution_date
FROM compliance_items
WHERE status = 'active'
  AND COALESCE(team, 'Unknown') = $1
  AND resolution_date IS NOT NULL
ORDER BY hostname, metric_id, seen_count DESC, upload_id DESC

The downstream blockers = teamNonCompliant - forecastItems.length calculation is unchanged. With both sides now deduped consistently — teamNonCompliant from the device_team CTE in Fix 3, forecastItems.length from the deduped query above — the difference is non-negative without any clamp.
The Math.max(blockers, 0) clamp at the existing call site is left in place as a belt-and-braces safeguard. It SHOULD be a no-op after the fix; if a regression introduces inconsistency again, the property test for Property 4 will catch it.

Fix 5: `GET /mttr` — dedupe by `(hostname, metric_id)` in SQL

File: backend/routes/compliance.js

Function: router.get('/mttr', ...) (around line 824)

Specific Changes:

Rewrite the query to DISTINCT ON (hostname, metric_id) so each unique active violation contributes exactly one (seen_count, team) row to bucketAgingItems():

SELECT DISTINCT ON (hostname, metric_id)
       COALESCE(seen_count, 1) AS seen_count,
       team
FROM compliance_items
WHERE status = 'active'
ORDER BY hostname, metric_id, seen_count DESC, upload_id DESC

bucketAgingItems() itself does not change. It already does the right thing when fed one row per unique violation.
SQL-level dedup is preferred over a JavaScript-side Set here because the helper is also called from /vcl/stats and changing its contract risks regressions in the other caller. Pushing the dedup into SQL keeps bucketAgingItems() pure and reusable.

Fix 6: `persistUpload()` snapshot block — classify hostnames via `MIN(status)` in a CTE

File: backend/routes/compliance.js

Function: persistUpload() (lines 81–192), specifically the verticalStats query at line 157

Specific Changes:

Rewrite the snapshot query so each hostname is classified by its worst-case status (active wins over resolved) inside a CTE before counting. Because 'active' < 'resolved' lexicographically, MIN(status) returns 'active' for a hostname that has any active row and 'resolved' only if all rows are 'resolved':

WITH hostname_status AS (
    SELECT team,
           hostname,
           MIN(status) AS status  -- 'active' beats 'resolved' alphabetically
    FROM compliance_items
    WHERE team IS NOT NULL
    GROUP BY team, hostname
)
SELECT team AS vertical,
       COUNT(*)::int                                                 AS total_devices,
       COUNT(*) FILTER (WHERE status = 'resolved')::int              AS compliant,
       COUNT(*) FILTER (WHERE status = 'active')::int                AS non_compliant
FROM hostname_status
GROUP BY team

Each hostname appears exactly once in hostname_status, so compliant + non_compliant == total_devices is structurally guaranteed.

The downstream INSERT ... ON CONFLICT (snapshot_month, vertical) DO UPDATE block is unchanged. It already keys snapshots on vertical, and the per-team counts now satisfy compliant + non_compliant <= total_devices.
The compliance_pct calculation Math.round((vs.compliant / total) * 100 * 100) / 100 is unchanged. It is now defended against the previous off-by-one: with the old query a hostname could be in both compliant and non_compliant, double-counting itself in compliant's numerator while only contributing once to total_devices's denominator. The fix removes that.

Note: Fix 6 is conceptually adjacent to but mechanically distinct from the persistUpload fix in compliance-duplicate-chart-entries. That spec adds a WHERE vertical = $1 filter to scope the snapshot to one vertical at a time. This spec adds the hostname_status CTE so that within whichever vertical is snapshotted, each hostname is classified once. Both fixes are independently necessary; landing them together is preferred but each is correct on its own.

Testing Strategy

Validation Approach

The bug condition is straightforward to construct: insert two compliance_items rows for the same (hostname, metric_id) with different vertical values, then call each affected endpoint. The two-phase approach is to first run the tests against the unfixed code to confirm the duplication counterexamples, then run the same tests against the fixed code and add property-based tests that explore the input space more broadly.

Exploratory Bug Condition Checking

Goal: Surface counterexamples that demonstrate each of the six manifestations BEFORE implementing the fix. Confirm or refute the root cause analysis for each endpoint independently — they share a structural cause but the SQL details differ.

Test Plan: Seed a clean test database with a fixture representing the original GitLab #13 scenario (STEAM-INTERSIGHT failing 7.1.1 in both vertical IS NULL and vertical = 'NTS_AEO'), plus targeted variants for each sibling endpoint. Call each affected endpoint and assert the buggy invariants. Run on UNFIXED code first.

Test Cases:

/items Duplicate Failing Metric Test — Insert two rows for (STEAM-INTERSIGHT, 7.1.1): one with vertical IS NULL, status = 'active', team = 'STEAM', another with vertical = 'NTS_AEO', status = 'active', team = 'STEAM'. Call GET /items?team=STEAM&status=active. Assert that response.devices[0].failing_metrics.filter(m => m.metric_id === '7.1.1').length === 1. (will fail on unfixed code — returns 2)
/items/:hostname Duplicate Metric Entry Test — Same fixture. Call GET /items/STEAM-INTERSIGHT. Assert response.metrics.filter(m => m.metric_id === '7.1.1' && m.status === 'active').length === 1. Assert response.metrics[0].seen_count === MAX(seen_count across the duplicate rows). (will fail on unfixed code — returns 2)
/vcl/stats Cross-Team Hostname Test — Insert two rows for (SOME-DEVICE, 7.1.1): one with team = 'STEAM', vertical IS NULL, another with team = 'ACCESS-ENG', vertical = 'NTS_AEO'. Call GET /vcl/stats. Assert SUM(heavy_hitters[*].non_compliant) === stats.non_compliant. (will fail on unfixed code — sum exceeds the global count by 1)
/vcl/stats Forecast Negative Blockers Test — Insert two rows for (SOME-DEVICE, 7.1.1) both with team = 'STEAM', status = 'active', both with resolution_date = '2025-09-30', but different verticals. Call GET /vcl/stats. Assert vertical_breakdown.find(v => v.team === 'STEAM').blockers >= 0 AND that the unclamped teamNonCompliant - dedupedForecastCount equals the reported blockers. (will fail on unfixed code — clamped from -1 to 0, hiding the inconsistency; the unclamped check makes the failure visible)
/mttr Inflated Bucket Test — Insert two rows for (SOME-DEVICE, 7.1.1) with seen_count = 5, team = 'STEAM', status = 'active', different verticals. Compare SUM(aging[*].total) to COUNT(DISTINCT (hostname, metric_id) WHERE status = 'active'). Assert equality. (will fail on unfixed code — bucket total exceeds distinct count by 1)
persistUpload() Snapshot Inconsistency Test — Pre-populate compliance_items with (SOME-DEVICE, 7.1.1) having one status = 'active' row and one status = 'resolved' row, both team = 'STEAM', different verticals. Call persistUpload() with a no-op upload. Read back the compliance_snapshots row for the current month and STEAM. Assert compliant + non_compliant <= total_devices. (will fail on unfixed code — the hostname is counted in both columns so the sum exceeds total_devices by 1)
Edge Case — Single-Vertical Regression Test — Insert a fixture where every (hostname, metric_id) is unique (no cross-vertical duplicates). Call all five read endpoints and capture responses. Apply the fix, re-run, and assert response equality (byte-for-byte). Run persistUpload() on a single-vertical fixture and assert the resulting compliance_snapshots rows are identical pre-fix and post-fix. (should pass on unfixed code; will pass on fixed code; protects the preservation property)

Expected Counterexamples:

/items returns failing_metrics arrays where length > Set(metric_ids).size. Cause: groupByHostname() pushes per row instead of per unique metric_id.
/items/:hostname returns metrics arrays where (metric_id, status) collisions exist. Cause: no vertical filter and no dedup in the detail query or response builder.
/vcl/stats returns heavy_hitters whose non_compliant values sum to more than the global stats.non_compliant. Cause: a hostname's team differs across verticals and the per-team COUNT(DISTINCT hostname) counts it under both teams.
/vcl/stats reports blockers clamped to 0 when the unclamped expression is negative. Cause: the forecast query is not deduped on (hostname, metric_id) so its row count exceeds teamNonCompliant.
/mttr returns aging whose total exceeds the distinct-violation count. Cause: bucketAgingItems() iterates rows directly with no dedup.
persistUpload() writes compliance_snapshots rows where compliant + non_compliant > total_devices. Cause: the snapshot query's CASE WHEN status = 'X' pattern lets a hostname appear in both columns.

Fix Checking

Goal: Verify that for all inputs where the bug condition holds (any (hostname, metric_id) shared by two or more rows across verticals), each fixed endpoint produces the expected deduped result.

Pseudocode:

FOR ALL items WHERE EXISTS (h, m) WITH COUNT(items WHERE hostname = h AND metric_id = m) > 1 DO
  items_response   := GET_items_fixed(items)
  detail_response  := GET_items_hostname_fixed(items, some_hostname_with_dups)
  stats_response   := GET_vcl_stats_fixed(items)
  mttr_response    := GET_mttr_fixed(items)
  snapshot_rows    := persistUpload_fixed(no_op_upload, items)

  ASSERT no_duplicate_metrics_per_device(items_response.devices)
  ASSERT no_duplicate_metric_status_pairs(detail_response.metrics)
  ASSERT sum_of_team_counts_equals_global(stats_response)
  ASSERT forecast_blockers_non_negative(stats_response.vertical_breakdown)
  ASSERT mttr_total_equals_distinct_violations(mttr_response.aging, items)
  ASSERT snapshot_invariant_holds(snapshot_rows)
END FOR

Preservation Checking

Goal: Verify that for all inputs where the bug condition does NOT hold (every (hostname, metric_id) is unique across verticals), the fixed endpoints produce results identical to the original endpoints.

Pseudocode:

FOR ALL items WHERE FORALL (h, m), COUNT(items WHERE hostname = h AND metric_id = m) <= 1 DO
  ASSERT GET_items_original(items)          = GET_items_fixed(items)
  ASSERT GET_items_hostname_original(items) = GET_items_hostname_fixed(items)
  ASSERT GET_vcl_stats_original(items)      = GET_vcl_stats_fixed(items)
  ASSERT GET_mttr_original(items)           = GET_mttr_fixed(items)
  ASSERT persistUpload_original(items).snapshots = persistUpload_fixed(items).snapshots
END FOR

Testing Approach: Property-based testing is the right fit for preservation checking here:

The unique-(hostname, metric_id) input space is large (any number of hostnames, any combination of metrics, any team and vertical mix) and exhaustive enumeration is impractical.
The preservation property is strict equality, which is well-suited to PBT shrinking — any counterexample is a small fixture demonstrating a behavior change.
The legacy AEO-only data shape (vertical IS NULL) must be exercised, which falls naturally out of generators that include null verticals.

Test Plan: Capture responses from the unfixed code on unique-(hostname, metric_id) fixtures (snapshot tests). After applying the fix, re-run the same fixtures and assert equality. Then run a property-based generator that produces random unique-key scenarios and asserts the same equality.

Test Cases:

Snapshot Equality — Empty State — Empty compliance_items. /items?team=STEAM returns { devices: [], team: 'STEAM', status: 'active' }. /items/:hostname returns 404. /vcl/stats returns its zero-state shape. /mttr returns { aging: [] }. persistUpload() writes no snapshot rows. Snapshot-test before and after the fix.
Snapshot Equality — Single AEO-Only Items — Items with vertical IS NULL only. Capture pre-fix responses, apply fix, assert equality across all five endpoints.
Snapshot Equality — Multiple Unique Verticals — A mix of items with vertical IS NULL and vertical = 'NTS_AEO', but no (hostname, metric_id) collision across verticals. Capture pre-fix responses, apply fix, assert equality.
/items Team and Status Filter Preservation — Active and resolved items mixed across teams. Assert ?team=STEAM&status=active, ?team=STEAM&status=resolved, and ?team=ACCESS-ENG&status=active each return the same devices pre-fix and post-fix on a unique-key fixture. Assert non-ALLOWED_TEAMS value (e.g., ?team=OTHER) returns HTTP 400.
/items/:hostname Active-Then-Resolved Ordering Preservation — A device with both active and resolved metrics. Assert the response's metrics array has all active entries before all resolved entries, sorted by metric_id within each group, identical to pre-fix.
/vcl/stats Donut Categorization Preservation — Active items with mixed null/non-null resolution_date. Assert donut.blocked and donut.in_progress counts match pre-fix on a unique-key fixture (this is already deduped via GROUP BY hostname in the existing query).
persistUpload() Snapshot Equality — Single-Status-Per-Hostname Fixture — Pre-populate compliance_items so every hostname has rows of only one status (all active or all resolved within a team). Run persistUpload(). Assert compliance_snapshots rows are identical pre-fix and post-fix.

Test Fixtures Needed

The following fixtures must be reusable across unit, integration, and property-based tests. Place them under backend/__tests__/fixtures/compliance-duplicate-failing-metrics/ (or inline as factory functions in the test file, matching the convention used in vcl-compliance-reporting.test.js).

fixtureCrossVerticalDuplicateActive — A single hostname with two rows for the same metric_id, both status = 'active', both team = 'STEAM', one with vertical IS NULL and one with vertical = 'NTS_AEO'. Different seen_count (e.g., 3 and 5) so the representative-row policy is exercised. Used by /items, /items/:hostname, /mttr, and the /vcl/stats forecast tests.
fixtureCrossVerticalTeamMismatch — A single hostname with two active rows for the same metric_id but different team across verticals (STEAM in legacy, ACCESS-ENG in NTS_AEO). Used by the /vcl/stats heavy-hitters and per-team-totals test (Property 3).
fixtureCrossVerticalStatusMismatch — A single hostname with two rows for the same metric_id and same team, but different status across verticals (active in legacy, resolved in NTS_AEO). Used by the persistUpload() snapshot test (Property 6).
fixtureMultiHostnameMixed — A combination of unique-key hostnames and one duplicated (hostname, metric_id). Used to verify that the dedup applies only to the duplicates and leaves unique entries untouched.
fixtureLegacyOnly — All rows with vertical IS NULL. Used for preservation checks that confirm legacy data flows are unchanged.
fixtureNTSAEOOnly — All rows with vertical = 'NTS_AEO'. Used for preservation checks that confirm multi-vertical-only data flows are unchanged.
fixtureForecastDuplicateResolutionDate — A duplicated (hostname, metric_id) where both rows have a non-null resolution_date (same value). Used to verify Property 4 (blockers >= 0).
fixtureSeenCountTiebreak — A duplicated (hostname, metric_id) where both rows have the same seen_count but different upload_id. Used to verify the upload_id DESC tiebreaker in the DISTINCT ON ... ORDER BY policy.

Unit Tests

/items: insert fixtureCrossVerticalDuplicateActive, assert failing_metrics.length === 1 and seen_count === 5.
/items/:hostname: insert fixtureCrossVerticalDuplicateActive, assert metrics.length === 1 for (7.1.1, active).
/items/:hostname ordering: insert a fixture with one active and one resolved metric, assert metrics[0].status === 'active' and metrics[1].status === 'resolved'.
/vcl/stats heavy-hitters: insert fixtureCrossVerticalTeamMismatch, assert hostname counted exactly once in the team derived from the representative row.
/vcl/stats forecast: insert fixtureForecastDuplicateResolutionDate, assert blockers >= 0 and forecastItems.length === teamNonCompliant.
/mttr: insert fixtureCrossVerticalDuplicateActive with seen_count = 5, assert exactly one increment in the 4–6 cycles bucket for STEAM.
persistUpload() snapshot: insert fixtureCrossVerticalStatusMismatch, run persistUpload(), assert the resulting snapshot row has compliant === 0, non_compliant === 1, total_devices === 1.
groupByHostname() direct: pass a list with two rows for the same (hostname, metric_id), assert one entry in failing_metrics. This protects the helper's contract independently of the SQL dedup.

Property-Based Tests

Use fast-check (already in use in backend/__tests__/*.property.test.js). Generators should mix vertical IS NULL and vertical = 'NTS_AEO' rows, with controllable rates of (hostname, metric_id) collision.

Property 1 — /items: For any list of compliance_items rows including cross-vertical duplicates, assert that for every device in the response, device.failing_metrics.length === new Set(device.failing_metrics.map(m => m.metric_id)).size.
Property 2 — /items/:hostname: For any duplicated (hostname, metric_id, status), assert response.metrics.filter(m => m.metric_id === target_metric && m.status === target_status).length === 1. Assert the surviving entry's seen_count is the maximum across duplicates.
Property 3 — /vcl/stats: For any input including team-mismatched cross-vertical duplicates, assert SUM(heavy_hitters[*].non_compliant) === stats.non_compliant. Generate inputs that vary the rate of team mismatch from 0% to 100%.
Property 4 — /vcl/stats forecast: For any input including duplicated (hostname, metric_id) with non-null resolution dates, assert vertical_breakdown[*].blockers >= 0 and assert the unclamped expression equals teamNonCompliant - dedupedForecastCount.
Property 5 — /mttr: For any input including cross-vertical duplicate active rows, assert SUM(aging[*].total) === COUNT(DISTINCT (hostname, metric_id) WHERE status = 'active') over the input. Per-team totals: assert SUM(aging[*][team]) over each team equals the distinct count for that team.
Property 6 — persistUpload() snapshot invariant: For any combination of cross-vertical status mismatches, assert every row written into compliance_snapshots satisfies compliant + non_compliant <= total_devices and compliant + non_compliant === total_devices (equality, since every hostname is classified into exactly one column).
Property 7 — Preservation: For any input where every (hostname, metric_id) is unique across verticals, the responses from all five endpoints SHALL equal the responses on the same input from the original (unfixed) implementations. Use fast-check's shrinking to surface the smallest counterexample if the property fails.
Property 8 — Representative-row policy: For any duplicated (hostname, metric_id), the surviving entry's seen_count SHALL equal max(duplicates.seen_count), and ties SHALL be broken by max(duplicates.upload_id).

Integration Tests

Full flow: upload a legacy AEO xlsx, then upload an NTS_AEO multi-vertical xlsx that overlaps on at least one (hostname, metric_id). Call /items, /items/:hostname, /vcl/stats, /mttr, and read compliance_snapshots for the current month. Assert all six post-fix invariants hold.
Cross-page consistency: load /vcl/stats and /items?team=STEAM for the same backing data. Assert that the device count from /vcl/stats heavy-hitters for STEAM equals the number of devices in /items?team=STEAM&status=active.
ComplianceDetailPanel render: load a device with cross-vertical duplicates via /items/:hostname, render ComplianceDetailPanel, assert exactly one MetricChip with the duplicated metric_id is present in the "Failing Metrics" section.
persistUpload() end-to-end: with cross-vertical status mismatch present, run a fresh upload and assert the new compliance_snapshots row's invariant holds.

47 KiB Raw Blame History Unescape Escape

Compliance Duplicate Failing Metrics Bugfix Design

Overview

Glossary

Bug Details

Bug Condition

Examples

Expected Behavior

Preservation Requirements

Hypothesized Root Cause

Correctness Properties

Fix Implementation

Changes Required

Fix 1: GET /items — DISTINCT ON (hostname, metric_id) in SQL, defensive dedupe in groupByHostname

Fix 2: GET /items/:hostname — DISTINCT ON (metric_id, status) in SQL

Fix 3: GET /vcl/stats heavy-hitters and per-team totals — dedupe to one team per hostname via CTE

Fix 4: GET /vcl/stats forecast-burndown — dedupe by (hostname, metric_id) in SQL

Fix 5: GET /mttr — dedupe by (hostname, metric_id) in SQL

Fix 6: persistUpload() snapshot block — classify hostnames via MIN(status) in a CTE

Testing Strategy

Validation Approach

Exploratory Bug Condition Checking

Fix Checking

Preservation Checking

Test Fixtures Needed

Unit Tests

Property-Based Tests

Integration Tests

47 KiB

Raw Blame History

Fix 1: `GET /items` — `DISTINCT ON (hostname, metric_id)` in SQL, defensive dedupe in `groupByHostname`

Fix 2: `GET /items/:hostname` — `DISTINCT ON (metric_id, status)` in SQL

Fix 3: `GET /vcl/stats` heavy-hitters and per-team totals — dedupe to one team per hostname via CTE

Fix 4: `GET /vcl/stats` forecast-burndown — dedupe by `(hostname, metric_id)` in SQL

Fix 5: `GET /mttr` — dedupe by `(hostname, metric_id)` in SQL

Fix 6: `persistUpload()` snapshot block — classify hostnames via `MIN(status)` in a CTE