Files
cve-dashboard/docs/architecture/split-architecture-proposal.md
Jordan Ramos a95fd03f5e Rebrand STEAM → AEGIS, fix BU drift checker previous_bu bug
- Replace all STEAM branding with AEGIS (Advanced Engineering Group
  Intelligence System) across login, header, nav drawer, manifest, and
  browser title
- Add shield logo to login page, main header, and nav drawer
- Fix BU drift checker recording incorrect previous_bu values by
  building a previousBuMap snapshot BEFORE the upsert/delete cycle
  instead of querying the DB after rows are already gone
- Clean 526 bogus BU history entries generated by the broken logic
- Add docs and scripts from prior session
2026-06-17 14:40:38 -06:00

32 KiB
Raw Blame History

Split Architecture Proposal: Collector + Indexer

Author: Infrastructure Team
Date: 2026-06-08
Status: Draft — Pending Review
Scope: Scale CVE Dashboard from 2 teams / ~15 users to company-wide deployment (100+ users, 15+ teams)


Executive Summary

The STEAM Security Dashboard currently runs as a monolithic single-process Express application on CT107 (dashboard-dev, 71.85.90.9). This single process simultaneously serves the frontend, handles all API requests, and performs background data collection from Ivanti, Jira, CARD, Atlas, and NVD APIs.

At current scale (2 teams, <15 users, daily sync), this architecture works. At company-wide scale (15+ teams, hundreds of users, sub-hourly sync), it will not. This document proposes a phased transition to a Collector + API Server architecture that separates data ingestion from request serving.

Critical constraint: CT107 (71.85.90.9) has the firewall rules granting access to the production Ivanti, Jira, and CARD APIs. The collector component must remain on this machine or firewall rules must be extended.


Table of Contents


Current Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    CT107 (dashboard-dev)                         │
│                    71.85.90.9 — 48 GB RAM, 250 GB Disk          │
│                                                                 │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │              Express Process (port 3001/3100)             │  │
│  │                                                           │  │
│  │  ┌─────────────┐  ┌──────────────┐  ┌────────────────┐  │  │
│  │  │  React SPA  │  │   API Routes │  │  Sync Workers  │  │  │
│  │  │  (static)   │  │  (50+ endpts)│  │  (setInterval) │  │  │
│  │  └─────────────┘  └──────────────┘  └────────────────┘  │  │
│  │                          │                    │           │  │
│  │                          │    Shared PG Pool (10 conn)    │  │
│  │                          │          │                     │  │
│  └──────────────────────────┼──────────┼─────────────────────┘  │
│                             │          │                         │
│  ┌──────────────────────────▼──────────▼─────────────────────┐  │
│  │         PostgreSQL 16 (Docker, port 5433)                 │  │
│  └───────────────────────────────────────────────────────────┘  │
│                                                                 │
│  Firewall Access: Ivanti API, Jira DC, CARD API, Atlas API      │
└─────────────────────────────────────────────────────────────────┘

Key Metrics (Current)

Metric Current Value Company-Wide Projection
Concurrent users 515 100300
Teams tracked 2 15+
Ivanti findings (open) ~200500 2,00010,000+
Ivanti sync frequency 24h 14h desired
PG connection pool 10 Insufficient
Jira API rate limit 1,440/day Shared across all users
Data sources 5 (Ivanti, NVD, Jira, Atlas, CARD) 8+ (add CrowdStrike, Qualys, Tanium)

Problem Statement

1. Sync Blocks the API Server

syncFindings() runs sequentially through:

  1. Fetch all open findings pages (100/page)
  2. Upsert findings batch into PostgreSQL
  3. Detect archive changes (compare all previous vs current)
  4. Fetch all closed findings pages
  5. Upsert closed findings
  6. Run BU drift checker (makes additional API calls per disappeared finding)
  7. Sync FP workflow counts (sweeps all closed pages again)
  8. Compute and store anomaly summary
  9. Record counts history

At 500 findings, this takes 25 minutes. At 10,000 findings across 15 teams, this could take 1530 minutes. During sync, the Express process is saturated — API responses slow, the connection pool contends.

2. Single Point of Failure

One process handles everything. A memory leak during sync, an unhandled promise rejection in the BU drift checker, or a runaway loop in archive detection crashes the entire dashboard for all users.

3. Connection Pool Exhaustion

10 connections shared between:

  • User-facing read queries (findings list, compliance items, charts)
  • Sync bulk upserts (batches of 100 rows × 18 columns)
  • User writes (notes, overrides, queue operations)

The pool already logs warnings at 8/10 active. At 100+ concurrent users issuing reads while a sync writes thousands of rows, this will deadlock or time out.

4. Rate Limits Shared Across Functions

Jira's 1,440/day limit is consumed by both background sync and user-initiated operations (lookups, ticket creation). A bulk sync could exhaust the daily budget, blocking users from creating tickets the rest of the day.

5. No Horizontal Scaling Path

Cannot add a second API server without also duplicating the sync scheduler, which would cause duplicate syncs, double-writes, and race conditions.

6. Firewall Constraint

CT107 has the only firewall access to production Ivanti, Jira, and CARD APIs. The collector (data fetcher) must run on this machine. The API server could potentially move elsewhere, but the collector cannot without firewall changes.


Proposed Architecture

Target State

┌─────────────────────────────────────────────────────────────────┐
│                    CT107 (dashboard-dev)                         │
│                    71.85.90.9 — 48 GB RAM, 250 GB Disk          │
│                    ★ Firewall access to prod APIs ★             │
│                                                                 │
│  ┌───────────────────────────────────┐  ┌─────────────────────┐│
│  │   API Server (Express, port 3001) │  │  Collector Service  ││
│  │                                   │  │  (Node.js worker)   ││
│  │  • React SPA serving             │  │                     ││
│  │  • All /api/* read endpoints     │  │  • Ivanti sync      ││
│  │  • User writes (notes, queue)    │  │  • Jira bulk sync   ││
│  │  • On-demand lookups (proxied)   │  │  • CARD cache sync  ││
│  │  • Triggers collector via        │  │  • Atlas cache sync ││
│  │    pg NOTIFY                     │  │  • NVD bulk sync    ││
│  │                                   │  │  • Archive detect   ││
│  │  Pool: 15 conn (reads + writes)  │  │  • BU drift checker ││
│  │                                   │  │  • Anomaly compute  ││
│  └───────────────┬───────────────────┘  │  • Compliance parse ││
│                  │                       │                     ││
│                  │                       │  Pool: 10 conn      ││
│                  │                       │  (bulk upserts)     ││
│                  │                       │                     ││
│                  │                       │  Listens:           ││
│                  │                       │    pg LISTEN         ││
│                  │                       │    'sync_trigger'    ││
│                  │                       └──────────┬──────────┘│
│                  │                                  │           │
│  ┌───────────────▼──────────────────────────────────▼─────────┐│
│  │              PostgreSQL 16 (Docker, port 5433)              ││
│  │              Pool: 25 total connections allocated           ││
│  └────────────────────────────────────────────────────────────┘│
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Component Responsibilities

API Server (cve-api.service)

Responsibility Details
Frontend serving Static React build via express.static
Read endpoints All GET routes — findings, compliance, charts, exports
User writes Notes, overrides, queue items, ticket CRUD, KB uploads
On-demand lookups Single NVD lookup, single Jira issue lookup, CARD real-time queries
Sync trigger SELECT pg_notify('sync_trigger', '{"type":"findings","user":"admin"}')
Health/status Expose collector status via sync_state table reads

Collector (cve-collector.service)

Responsibility Details
Scheduled syncs Ivanti findings (configurable interval), workflows (24h)
Bulk API operations Jira JQL sync-all, Atlas cache refresh, NVD bulk sync
Post-sync processing Archive detection, BU drift classification, closed-gone detection
Anomaly computation Open/closed deltas, classification breakdown, significance flagging
Compliance parsing Spawns Python subprocess for xlsx parsing on upload commit
Event-driven triggers Listens on pg LISTEN sync_trigger for on-demand requests
Rate budget management Owns the Jira daily/burst counters; API server gets a reserved allocation

Communication Pattern

User clicks "Sync" in UI
         │
         ▼
API Server receives POST /api/ivanti/findings/sync
         │
         ▼
API Server: SELECT pg_notify('sync_trigger', '{"type":"findings"}')
         │
         ▼
API Server responds: { status: 'sync_started', message: 'Check /sync-status' }
         │
         ▼
Collector receives NOTIFY, starts syncFindings()
         │
         ▼
Collector updates ivanti_sync_state (status='syncing')
         │
         ▼
Collector completes, updates ivanti_sync_state (status='success')
         │
         ▼
Frontend polls GET /api/ivanti/findings/sync-status → sees 'success' → refreshes

No Redis. No message broker. Just PostgreSQL LISTEN/NOTIFY — zero new infrastructure.


Phase Plan

Phase 0: Immediate Improvements (Week 12)

Goal: Reduce risk within the current monolith. No architectural changes.

Task Effort Impact
Make POST /sync non-blocking — return immediately, let sync run in background 2h Unblocks users during sync
Add GET /api/ivanti/findings/sync-status endpoint 1h Frontend can poll for completion
Increase PG pool from 10 → 20 connections 10min Headroom for concurrent operations
Add pg_stat_activity monitoring query to health endpoint 30min Visibility into pool pressure
Update frontend to poll sync-status instead of waiting 2h UX improvement

Deliverables:

  • Updated ivantiFindings.js with async sync dispatch
  • New sync-status polling endpoint
  • Frontend ReportingPage sync UX updated
  • Pool configuration change in db.js

Phase 1: Extract Collector (Weeks 34)

Goal: Separate data collection into its own process on CT107.

Task Effort Impact
Create backend/collector.js — standalone Node process 4h Fault isolation
Move sync functions from route files into shared lib/sync/ modules 3h Code reuse between collector and API
Implement pg LISTEN/NOTIFY trigger mechanism 2h API → Collector communication
Create cve-collector.service systemd unit 30min Process management
Add collector health check and status reporting 1h Observability
Update POST /sync routes to use pg_notify instead of inline sync 1h Complete decoupling
Add sync_jobs table for job tracking (queued, running, complete, failed) 1h Multi-user sync coordination
Update CI/CD pipeline to deploy collector service 2h Automated deployment

Deliverables:

  • backend/collector.js — entry point for collector process
  • backend/lib/sync/ — shared sync logic (extracted from routes)
  • systemd/cve-collector.service — systemd unit
  • Updated .gitlab-ci.yml with collector deploy stage
  • sync_jobs table for job state tracking

File structure after Phase 1:

backend/
├── server.js                # API server (unchanged entry point)
├── collector.js             # NEW — collector entry point
├── db.js                    # Shared pool config
├── lib/
│   └── sync/
│       ├── ivantiFindings.js    # Extracted from routes/ivantiFindings.js
│       ├── ivantiWorkflows.js   # Extracted from routes/ivantiWorkflows.js
│       ├── jiraBulkSync.js      # Extracted from routes/jiraTickets.js
│       ├── atlasCache.js        # Extracted from routes/atlas.js
│       ├── nvdBulkSync.js       # New — bulk NVD operations
│       ├── archiveDetection.js  # Extracted from routes/ivantiFindings.js
│       └── anomalyCompute.js    # Extracted from routes/ivantiFindings.js
├── routes/                  # API routes — now thin, read-heavy
└── helpers/                 # Shared API client helpers (unchanged)

Phase 2: Multi-Tenancy & Scale Hardening (Weeks 58)

Goal: Prepare for 15 teams and hundreds of users.

Task Effort Impact
Per-team sync scheduling — stagger syncs to avoid API burst 3h Spreads load
Jira rate budget partitioning (collector gets 80%, API gets 20%) 2h Prevents sync from starving users
Per-BU finding isolation — team users only see their findings 4h Data scoping
Add connection pooling metrics endpoint (/api/admin/pool-stats) 1h Operational visibility
Implement sync queue with priority (user-triggered > scheduled) 3h Better UX
Add retry logic with exponential backoff to collector 2h Resilience
Partial-progress persistence — don't lose work on mid-sync failure 4h Data integrity
PG connection pool separation — API pool (15) + Collector pool (10) 1h Isolation
Add pg_bouncer or similar for connection multiplexing (optional) 4h Scale past 50 concurrent

Deliverables:

  • Team-scoped sync scheduler in collector
  • Rate budget allocation system
  • Retry/backoff logic
  • Partial progress tracking
  • Pool separation

Phase 3: Additional Data Sources (Weeks 912)

Goal: Integrate CrowdStrike, Qualys, and Tanium feeds.

Task Effort Impact
CrowdStrike Falcon API integration in collector 8h New vulnerability source
Qualys VMDR API integration in collector 8h New vulnerability source
Tanium asset inventory sync 6h Asset correlation
Cross-source finding deduplication logic 6h Data quality
Unified findings view (merged from all sources) 4h Single pane of glass
Source-specific sync schedules (configurable per source) 2h Flexibility

Note: All new API integrations go into the collector. The API server never makes outbound calls to external vulnerability platforms except for single-item on-demand lookups.

Firewall implications: CrowdStrike, Qualys, and Tanium API access will need firewall rules added to CT107 (71.85.90.9). Submit firewall requests in advance.


Phase 4: Horizontal Scaling (Weeks 13+)

Goal: Support 300+ concurrent users if company-wide adoption materializes.

Task Effort Impact
Move API server to a separate LXC container (with more resources) 4h Dedicated API resources
Run multiple API server instances behind a load balancer 8h Horizontal scale
Keep collector on CT107 (firewall access) 0h No change needed
Add Redis for session store (replace PG sessions) 4h Multi-instance sessions
Add read replicas if PG becomes the bottleneck 8h Read scale
Evaluate moving PG to CT109 (zbl-indexer, 32GB/500GB) 2h Larger DB host

Architecture at Phase 4:

                    ┌─────────────────┐
                    │  Load Balancer  │
                    │  (nginx/HAProxy)│
                    └────┬───────┬────┘
                         │       │
           ┌─────────────▼─┐   ┌─▼─────────────┐
           │  API Server 1 │   │  API Server 2 │   (New LXC or CT103)
           │  (Express)    │   │  (Express)    │
           └───────┬───────┘   └───────┬───────┘
                   │                   │
                   └─────────┬─────────┘
                             │
┌────────────────────────────▼──────────────────────────────────────┐
│                    CT107 (71.85.90.9)                              │
│                                                                   │
│  ┌─────────────────────────┐    ┌──────────────────────────────┐ │
│  │  Collector Service      │    │  PostgreSQL 16               │ │
│  │  (sole process with     │    │  (or moved to CT109)         │ │
│  │   firewall API access)  │    │                              │ │
│  └─────────────────────────┘    └──────────────────────────────┘ │
│                                                                   │
│  ★ Firewall: Ivanti, Jira, CARD, Atlas, CrowdStrike, Qualys ★   │
└───────────────────────────────────────────────────────────────────┘

Infrastructure Requirements

CT107 Resource Allocation (Current → Phase 2)

Resource Current Phase 2 Target Notes
RAM 48 GB 48 GB (sufficient) Node processes use <2GB each
CPU Shared May need 4+ dedicated cores Sync is CPU-intensive during transform
Disk 250 GB 250 GB (sufficient) PG data + uploads + logs
PG connections 10 25 (15 API + 10 collector) Configure in postgresql.conf
Systemd services 2 (backend + frontend) 3 (api + collector + postgres) Frontend served by API

PostgreSQL Tuning (for 15 teams / hundreds of users)

# postgresql.conf changes
max_connections = 50            # Up from default 100 is fine, need headroom
shared_buffers = 4GB            # 25% of available RAM for PG
effective_cache_size = 12GB     # 75% of RAM PG can expect from OS
work_mem = 64MB                 # Per-sort/hash operation
maintenance_work_mem = 512MB    # For VACUUM, CREATE INDEX
wal_level = replica             # If read replicas needed later

Firewall Dependencies

Service Endpoint Required By Current Access
Ivanti/RiskSense platform4.risksense.com:443 Collector CT107 only
Jira Data Center jira.charter.com:443 Collector + API (lookups) CT107 only
CARD API card.charter.com:443 API (real-time) CT107 only
Atlas InfoSec (internal) Collector CT107 only
NVD API services.nvd.nist.gov:443 Collector + API Public
CrowdStrike api.crowdstrike.com:443 Collector Firewall request needed
Qualys qualysapi.qualys.com:443 Collector Firewall request needed
Tanium (internal) Collector Firewall request needed

Key constraint: If the API server moves off CT107 in Phase 4, you'll need firewall rules for the new host to reach Jira (for user lookups) and CARD (for real-time queries). Alternatively, the collector could proxy those on-demand requests — adds latency but avoids firewall changes.


Risk Assessment

Risk Likelihood Impact Mitigation
Collector crash doesn't affect API users This is the primary benefit of splitting
Collector and API race on DB writes Medium Low Collector does bulk upserts; API does single-row writes. Different tables mostly. Use advisory locks for sync_state.
Sync trigger lost (pg NOTIFY missed) Low Medium Collector also runs on a schedule. Missed trigger just delays to next interval.
Phase 1 introduces bugs in extraction Medium Medium Comprehensive test suite exists. Run parallel (old monolith + new split) in staging for 1 week.
Firewall change delays block Phase 4 High Medium Start firewall requests early. Phase 4 is optional — single-machine split (Phases 13) works fine at 15 teams.
PG becomes bottleneck at 300+ users Low High Phase 4 addresses with read replicas. CT109 (500GB, 32GB) available as larger DB host.

Decision Points

These require team/leadership input before proceeding:

  1. Sync frequency target: Is 1-hour sync acceptable, or do teams need near-real-time (15 min)? This affects collector design complexity and API rate budget math.

  2. API server location: Keep everything on CT107, or move the API server to a separate container? Keeping it on CT107 is simpler (no firewall changes for CARD/Jira lookups) but limits scaling options.

  3. Database location: Keep PG on CT107, or move to CT109 (zbl-indexer, 500GB disk, 32GB RAM)? Moving adds network latency but gives more room for growth.

  4. CrowdStrike/Qualys/Tanium priority: Which new data sources are most urgent? This affects Phase 3 ordering and firewall request timing.

  5. Session management: At 300+ users, PG-backed sessions will be high-churn. Acceptable, or invest in Redis? Redis adds infrastructure but is the industry standard for session stores at scale.

  6. Multi-instance API: Is the goal to survive a single API server restart without downtime? If yes, Phase 4 (load balancer + multiple instances) is needed. If brief restarts during deploys are acceptable, single-instance on CT107 works through Phase 3.


Appendix: Current Data Flow Analysis

Data Collection Patterns

Source Trigger Frequency Data Volume Processing
Ivanti Findings Schedule + manual 24h 100500 findings (all pages) Extract, upsert, archive detect, BU drift, anomaly
Ivanti Workflows Schedule + manual 24h 50 workflow batches Store as JSON blob
Ivanti Closed Findings During findings sync 24h All closed pages Upsert + closed archive detection
Jira Bulk Sync Manual (admin) On-demand All tracked tickets via JQL Status/summary update per ticket
Jira Single Lookup User action Real-time 1 issue Proxy + display
NVD Lookup User action Real-time 1 CVE Proxy + optional save
NVD Bulk Sync Manual On-demand All CVEs in DB Batch update metadata
Atlas Action Plans Cache refresh Background Per-host plan data Cache in atlas_action_plans_cache
CARD Operations User action Real-time 1 asset at a time Proxy (confirm/decline/redirect)
Compliance xlsx Manual upload Weekly 1 file → hundreds of rows Python parse → PG upsert (transactional)

What Moves to Collector vs Stays in API

Operation Collector API Server Rationale
Ivanti findings sync (all pages) Heavy, multi-page, post-processing
Ivanti workflows sync Scheduled background task
Ivanti closed sweep Part of findings sync pipeline
Archive detection CPU-intensive comparison
BU drift checker Makes additional API calls
Anomaly computation Depends on sync completion
Jira bulk sync-all Consumes rate budget, multi-issue
NVD bulk sync Multi-CVE, rate-limited
Atlas cache refresh Background, per-host API calls
Compliance xlsx parse Spawns Python, heavy DB writes
Single Jira lookup User-initiated, real-time, 1 call
Single NVD lookup User-initiated, real-time, 1 call
CARD operations User-initiated, real-time
All GET /api/* reads Pure DB queries, user-facing
Notes/overrides/queue Small writes, user-facing
File uploads User-initiated, disk I/O

Sync Pipeline Detail (becomes collector's core loop)

┌──────────────────────────────────────────────────────────────────┐
│                    Collector Sync Pipeline                        │
│                                                                  │
│  ┌────────────────┐                                              │
│  │ 1. Fetch Open  │ ← Ivanti API (paginated, 100/page)          │
│  │    Findings    │                                              │
│  └───────┬────────┘                                              │
│          │                                                       │
│  ┌───────▼────────┐                                              │
│  │ 2. Extract &   │ ← Transform raw API → normalized rows       │
│  │    Transform   │                                              │
│  └───────┬────────┘                                              │
│          │                                                       │
│  ┌───────▼────────┐                                              │
│  │ 3. Upsert to   │ ← Batch INSERT ON CONFLICT (100/batch)      │
│  │    PG          │   Preserves notes + overrides                │
│  └───────┬────────┘                                              │
│          │                                                       │
│  ┌───────▼────────┐                                              │
│  │ 4. Archive     │ ← Compare previous IDs vs current IDs       │
│  │    Detection   │   Detect disappeared + returned findings     │
│  └───────┬────────┘                                              │
│          │                                                       │
│  ┌───────▼────────┐                                              │
│  │ 5. Fetch Closed│ ← Ivanti API (all closed pages)             │
│  │    Findings    │   Upsert as state='closed'                   │
│  └───────┬────────┘                                              │
│          │                                                       │
│  ┌───────▼────────┐                                              │
│  │ 6. BU Drift    │ ← Re-query Ivanti for disappeared IDs       │
│  │    Checker     │   Classify: BU reassign / severity / decom   │
│  └───────┬────────┘                                              │
│          │                                                       │
│  ┌───────▼────────┐                                              │
│  │ 7. FP Workflow │ ← Sweep closed findings for FP# tickets     │
│  │    Counts      │   Aggregate by state                         │
│  └───────┬────────┘                                              │
│          │                                                       │
│  ┌───────▼────────┐                                              │
│  │ 8. Anomaly     │ ← Compute deltas, write to anomaly_log      │
│  │    Summary     │                                              │
│  └───────┬────────┘                                              │
│          │                                                       │
│  ┌───────▼────────┐                                              │
│  │ 9. Update      │ ← sync_state status='success'               │
│  │    Sync State  │   Notify API server: pg_notify('sync_done')  │
│  └────────────────┘                                              │
│                                                                  │
└──────────────────────────────────────────────────────────────────┘

Timeline Summary

Phase Timeframe Key Outcome Required For
0 Weeks 12 Non-blocking sync, pool increase Immediate UX fix
1 Weeks 34 Collector extracted, fault isolation Multi-team onboarding
2 Weeks 58 Multi-tenancy, rate budgeting, retries 15 teams / 100+ users
3 Weeks 912 New data sources (CS/Qualys/Tanium) Full vuln coverage
4 Weeks 13+ Horizontal scaling, load balancing 300+ users (if needed)

Phases 02 are recommended regardless of company-wide rollout. Phase 3 depends on data source priority decisions. Phase 4 is contingent on actual adoption numbers.


Next Steps

  1. Review this document and provide input on Decision Points
  2. Approve Phase 0 for immediate implementation
  3. Schedule Phase 1 kickoff once Phase 0 is validated in staging
  4. Submit firewall requests for CrowdStrike/Qualys/Tanium access to CT107 (long lead time)