- Replace all STEAM branding with AEGIS (Advanced Engineering Group Intelligence System) across login, header, nav drawer, manifest, and browser title - Add shield logo to login page, main header, and nav drawer - Fix BU drift checker recording incorrect previous_bu values by building a previousBuMap snapshot BEFORE the upsert/delete cycle instead of querying the DB after rows are already gone - Clean 526 bogus BU history entries generated by the broken logic - Add docs and scripts from prior session
32 KiB
Split Architecture Proposal: Collector + Indexer
Author: Infrastructure Team
Date: 2026-06-08
Status: Draft — Pending Review
Scope: Scale CVE Dashboard from 2 teams / ~15 users to company-wide deployment (100+ users, 15+ teams)
Executive Summary
The STEAM Security Dashboard currently runs as a monolithic single-process Express application on CT107 (dashboard-dev, 71.85.90.9). This single process simultaneously serves the frontend, handles all API requests, and performs background data collection from Ivanti, Jira, CARD, Atlas, and NVD APIs.
At current scale (2 teams, <15 users, daily sync), this architecture works. At company-wide scale (15+ teams, hundreds of users, sub-hourly sync), it will not. This document proposes a phased transition to a Collector + API Server architecture that separates data ingestion from request serving.
Critical constraint: CT107 (71.85.90.9) has the firewall rules granting access to the production Ivanti, Jira, and CARD APIs. The collector component must remain on this machine or firewall rules must be extended.
Table of Contents
- Current Architecture
- Problem Statement
- Proposed Architecture
- Phase Plan
- Infrastructure Requirements
- Risk Assessment
- Decision Points
- Appendix: Current Data Flow Analysis
Current Architecture
┌─────────────────────────────────────────────────────────────────┐
│ CT107 (dashboard-dev) │
│ 71.85.90.9 — 48 GB RAM, 250 GB Disk │
│ │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ Express Process (port 3001/3100) │ │
│ │ │ │
│ │ ┌─────────────┐ ┌──────────────┐ ┌────────────────┐ │ │
│ │ │ React SPA │ │ API Routes │ │ Sync Workers │ │ │
│ │ │ (static) │ │ (50+ endpts)│ │ (setInterval) │ │ │
│ │ └─────────────┘ └──────────────┘ └────────────────┘ │ │
│ │ │ │ │ │
│ │ │ Shared PG Pool (10 conn) │ │
│ │ │ │ │ │
│ └──────────────────────────┼──────────┼─────────────────────┘ │
│ │ │ │
│ ┌──────────────────────────▼──────────▼─────────────────────┐ │
│ │ PostgreSQL 16 (Docker, port 5433) │ │
│ └───────────────────────────────────────────────────────────┘ │
│ │
│ Firewall Access: Ivanti API, Jira DC, CARD API, Atlas API │
└─────────────────────────────────────────────────────────────────┘
Key Metrics (Current)
| Metric | Current Value | Company-Wide Projection |
|---|---|---|
| Concurrent users | 5–15 | 100–300 |
| Teams tracked | 2 | 15+ |
| Ivanti findings (open) | ~200–500 | 2,000–10,000+ |
| Ivanti sync frequency | 24h | 1–4h desired |
| PG connection pool | 10 | Insufficient |
| Jira API rate limit | 1,440/day | Shared across all users |
| Data sources | 5 (Ivanti, NVD, Jira, Atlas, CARD) | 8+ (add CrowdStrike, Qualys, Tanium) |
Problem Statement
1. Sync Blocks the API Server
syncFindings() runs sequentially through:
- Fetch all open findings pages (100/page)
- Upsert findings batch into PostgreSQL
- Detect archive changes (compare all previous vs current)
- Fetch all closed findings pages
- Upsert closed findings
- Run BU drift checker (makes additional API calls per disappeared finding)
- Sync FP workflow counts (sweeps all closed pages again)
- Compute and store anomaly summary
- Record counts history
At 500 findings, this takes 2–5 minutes. At 10,000 findings across 15 teams, this could take 15–30 minutes. During sync, the Express process is saturated — API responses slow, the connection pool contends.
2. Single Point of Failure
One process handles everything. A memory leak during sync, an unhandled promise rejection in the BU drift checker, or a runaway loop in archive detection crashes the entire dashboard for all users.
3. Connection Pool Exhaustion
10 connections shared between:
- User-facing read queries (findings list, compliance items, charts)
- Sync bulk upserts (batches of 100 rows × 18 columns)
- User writes (notes, overrides, queue operations)
The pool already logs warnings at 8/10 active. At 100+ concurrent users issuing reads while a sync writes thousands of rows, this will deadlock or time out.
4. Rate Limits Shared Across Functions
Jira's 1,440/day limit is consumed by both background sync and user-initiated operations (lookups, ticket creation). A bulk sync could exhaust the daily budget, blocking users from creating tickets the rest of the day.
5. No Horizontal Scaling Path
Cannot add a second API server without also duplicating the sync scheduler, which would cause duplicate syncs, double-writes, and race conditions.
6. Firewall Constraint
CT107 has the only firewall access to production Ivanti, Jira, and CARD APIs. The collector (data fetcher) must run on this machine. The API server could potentially move elsewhere, but the collector cannot without firewall changes.
Proposed Architecture
Target State
┌─────────────────────────────────────────────────────────────────┐
│ CT107 (dashboard-dev) │
│ 71.85.90.9 — 48 GB RAM, 250 GB Disk │
│ ★ Firewall access to prod APIs ★ │
│ │
│ ┌───────────────────────────────────┐ ┌─────────────────────┐│
│ │ API Server (Express, port 3001) │ │ Collector Service ││
│ │ │ │ (Node.js worker) ││
│ │ • React SPA serving │ │ ││
│ │ • All /api/* read endpoints │ │ • Ivanti sync ││
│ │ • User writes (notes, queue) │ │ • Jira bulk sync ││
│ │ • On-demand lookups (proxied) │ │ • CARD cache sync ││
│ │ • Triggers collector via │ │ • Atlas cache sync ││
│ │ pg NOTIFY │ │ • NVD bulk sync ││
│ │ │ │ • Archive detect ││
│ │ Pool: 15 conn (reads + writes) │ │ • BU drift checker ││
│ │ │ │ • Anomaly compute ││
│ └───────────────┬───────────────────┘ │ • Compliance parse ││
│ │ │ ││
│ │ │ Pool: 10 conn ││
│ │ │ (bulk upserts) ││
│ │ │ ││
│ │ │ Listens: ││
│ │ │ pg LISTEN ││
│ │ │ 'sync_trigger' ││
│ │ └──────────┬──────────┘│
│ │ │ │
│ ┌───────────────▼──────────────────────────────────▼─────────┐│
│ │ PostgreSQL 16 (Docker, port 5433) ││
│ │ Pool: 25 total connections allocated ││
│ └────────────────────────────────────────────────────────────┘│
│ │
└─────────────────────────────────────────────────────────────────┘
Component Responsibilities
API Server (cve-api.service)
| Responsibility | Details |
|---|---|
| Frontend serving | Static React build via express.static |
| Read endpoints | All GET routes — findings, compliance, charts, exports |
| User writes | Notes, overrides, queue items, ticket CRUD, KB uploads |
| On-demand lookups | Single NVD lookup, single Jira issue lookup, CARD real-time queries |
| Sync trigger | SELECT pg_notify('sync_trigger', '{"type":"findings","user":"admin"}') |
| Health/status | Expose collector status via sync_state table reads |
Collector (cve-collector.service)
| Responsibility | Details |
|---|---|
| Scheduled syncs | Ivanti findings (configurable interval), workflows (24h) |
| Bulk API operations | Jira JQL sync-all, Atlas cache refresh, NVD bulk sync |
| Post-sync processing | Archive detection, BU drift classification, closed-gone detection |
| Anomaly computation | Open/closed deltas, classification breakdown, significance flagging |
| Compliance parsing | Spawns Python subprocess for xlsx parsing on upload commit |
| Event-driven triggers | Listens on pg LISTEN sync_trigger for on-demand requests |
| Rate budget management | Owns the Jira daily/burst counters; API server gets a reserved allocation |
Communication Pattern
User clicks "Sync" in UI
│
▼
API Server receives POST /api/ivanti/findings/sync
│
▼
API Server: SELECT pg_notify('sync_trigger', '{"type":"findings"}')
│
▼
API Server responds: { status: 'sync_started', message: 'Check /sync-status' }
│
▼
Collector receives NOTIFY, starts syncFindings()
│
▼
Collector updates ivanti_sync_state (status='syncing')
│
▼
Collector completes, updates ivanti_sync_state (status='success')
│
▼
Frontend polls GET /api/ivanti/findings/sync-status → sees 'success' → refreshes
No Redis. No message broker. Just PostgreSQL LISTEN/NOTIFY — zero new infrastructure.
Phase Plan
Phase 0: Immediate Improvements (Week 1–2)
Goal: Reduce risk within the current monolith. No architectural changes.
| Task | Effort | Impact |
|---|---|---|
Make POST /sync non-blocking — return immediately, let sync run in background |
2h | Unblocks users during sync |
Add GET /api/ivanti/findings/sync-status endpoint |
1h | Frontend can poll for completion |
| Increase PG pool from 10 → 20 connections | 10min | Headroom for concurrent operations |
Add pg_stat_activity monitoring query to health endpoint |
30min | Visibility into pool pressure |
| Update frontend to poll sync-status instead of waiting | 2h | UX improvement |
Deliverables:
- Updated
ivantiFindings.jswith async sync dispatch - New sync-status polling endpoint
- Frontend ReportingPage sync UX updated
- Pool configuration change in
db.js
Phase 1: Extract Collector (Weeks 3–4)
Goal: Separate data collection into its own process on CT107.
| Task | Effort | Impact |
|---|---|---|
Create backend/collector.js — standalone Node process |
4h | Fault isolation |
Move sync functions from route files into shared lib/sync/ modules |
3h | Code reuse between collector and API |
| Implement pg LISTEN/NOTIFY trigger mechanism | 2h | API → Collector communication |
Create cve-collector.service systemd unit |
30min | Process management |
| Add collector health check and status reporting | 1h | Observability |
Update POST /sync routes to use pg_notify instead of inline sync |
1h | Complete decoupling |
Add sync_jobs table for job tracking (queued, running, complete, failed) |
1h | Multi-user sync coordination |
| Update CI/CD pipeline to deploy collector service | 2h | Automated deployment |
Deliverables:
backend/collector.js— entry point for collector processbackend/lib/sync/— shared sync logic (extracted from routes)systemd/cve-collector.service— systemd unit- Updated
.gitlab-ci.ymlwith collector deploy stage sync_jobstable for job state tracking
File structure after Phase 1:
backend/
├── server.js # API server (unchanged entry point)
├── collector.js # NEW — collector entry point
├── db.js # Shared pool config
├── lib/
│ └── sync/
│ ├── ivantiFindings.js # Extracted from routes/ivantiFindings.js
│ ├── ivantiWorkflows.js # Extracted from routes/ivantiWorkflows.js
│ ├── jiraBulkSync.js # Extracted from routes/jiraTickets.js
│ ├── atlasCache.js # Extracted from routes/atlas.js
│ ├── nvdBulkSync.js # New — bulk NVD operations
│ ├── archiveDetection.js # Extracted from routes/ivantiFindings.js
│ └── anomalyCompute.js # Extracted from routes/ivantiFindings.js
├── routes/ # API routes — now thin, read-heavy
└── helpers/ # Shared API client helpers (unchanged)
Phase 2: Multi-Tenancy & Scale Hardening (Weeks 5–8)
Goal: Prepare for 15 teams and hundreds of users.
| Task | Effort | Impact |
|---|---|---|
| Per-team sync scheduling — stagger syncs to avoid API burst | 3h | Spreads load |
| Jira rate budget partitioning (collector gets 80%, API gets 20%) | 2h | Prevents sync from starving users |
| Per-BU finding isolation — team users only see their findings | 4h | Data scoping |
Add connection pooling metrics endpoint (/api/admin/pool-stats) |
1h | Operational visibility |
| Implement sync queue with priority (user-triggered > scheduled) | 3h | Better UX |
| Add retry logic with exponential backoff to collector | 2h | Resilience |
| Partial-progress persistence — don't lose work on mid-sync failure | 4h | Data integrity |
| PG connection pool separation — API pool (15) + Collector pool (10) | 1h | Isolation |
Add pg_bouncer or similar for connection multiplexing (optional) |
4h | Scale past 50 concurrent |
Deliverables:
- Team-scoped sync scheduler in collector
- Rate budget allocation system
- Retry/backoff logic
- Partial progress tracking
- Pool separation
Phase 3: Additional Data Sources (Weeks 9–12)
Goal: Integrate CrowdStrike, Qualys, and Tanium feeds.
| Task | Effort | Impact |
|---|---|---|
| CrowdStrike Falcon API integration in collector | 8h | New vulnerability source |
| Qualys VMDR API integration in collector | 8h | New vulnerability source |
| Tanium asset inventory sync | 6h | Asset correlation |
| Cross-source finding deduplication logic | 6h | Data quality |
| Unified findings view (merged from all sources) | 4h | Single pane of glass |
| Source-specific sync schedules (configurable per source) | 2h | Flexibility |
Note: All new API integrations go into the collector. The API server never makes outbound calls to external vulnerability platforms except for single-item on-demand lookups.
Firewall implications: CrowdStrike, Qualys, and Tanium API access will need firewall rules added to CT107 (71.85.90.9). Submit firewall requests in advance.
Phase 4: Horizontal Scaling (Weeks 13+)
Goal: Support 300+ concurrent users if company-wide adoption materializes.
| Task | Effort | Impact |
|---|---|---|
| Move API server to a separate LXC container (with more resources) | 4h | Dedicated API resources |
| Run multiple API server instances behind a load balancer | 8h | Horizontal scale |
| Keep collector on CT107 (firewall access) | 0h | No change needed |
| Add Redis for session store (replace PG sessions) | 4h | Multi-instance sessions |
| Add read replicas if PG becomes the bottleneck | 8h | Read scale |
| Evaluate moving PG to CT109 (zbl-indexer, 32GB/500GB) | 2h | Larger DB host |
Architecture at Phase 4:
┌─────────────────┐
│ Load Balancer │
│ (nginx/HAProxy)│
└────┬───────┬────┘
│ │
┌─────────────▼─┐ ┌─▼─────────────┐
│ API Server 1 │ │ API Server 2 │ (New LXC or CT103)
│ (Express) │ │ (Express) │
└───────┬───────┘ └───────┬───────┘
│ │
└─────────┬─────────┘
│
┌────────────────────────────▼──────────────────────────────────────┐
│ CT107 (71.85.90.9) │
│ │
│ ┌─────────────────────────┐ ┌──────────────────────────────┐ │
│ │ Collector Service │ │ PostgreSQL 16 │ │
│ │ (sole process with │ │ (or moved to CT109) │ │
│ │ firewall API access) │ │ │ │
│ └─────────────────────────┘ └──────────────────────────────┘ │
│ │
│ ★ Firewall: Ivanti, Jira, CARD, Atlas, CrowdStrike, Qualys ★ │
└───────────────────────────────────────────────────────────────────┘
Infrastructure Requirements
CT107 Resource Allocation (Current → Phase 2)
| Resource | Current | Phase 2 Target | Notes |
|---|---|---|---|
| RAM | 48 GB | 48 GB (sufficient) | Node processes use <2GB each |
| CPU | Shared | May need 4+ dedicated cores | Sync is CPU-intensive during transform |
| Disk | 250 GB | 250 GB (sufficient) | PG data + uploads + logs |
| PG connections | 10 | 25 (15 API + 10 collector) | Configure in postgresql.conf |
| Systemd services | 2 (backend + frontend) | 3 (api + collector + postgres) | Frontend served by API |
PostgreSQL Tuning (for 15 teams / hundreds of users)
# postgresql.conf changes
max_connections = 50 # Up from default 100 is fine, need headroom
shared_buffers = 4GB # 25% of available RAM for PG
effective_cache_size = 12GB # 75% of RAM PG can expect from OS
work_mem = 64MB # Per-sort/hash operation
maintenance_work_mem = 512MB # For VACUUM, CREATE INDEX
wal_level = replica # If read replicas needed later
Firewall Dependencies
| Service | Endpoint | Required By | Current Access |
|---|---|---|---|
| Ivanti/RiskSense | platform4.risksense.com:443 | Collector | ✅ CT107 only |
| Jira Data Center | jira.charter.com:443 | Collector + API (lookups) | ✅ CT107 only |
| CARD API | card.charter.com:443 | API (real-time) | ✅ CT107 only |
| Atlas InfoSec | (internal) | Collector | ✅ CT107 only |
| NVD API | services.nvd.nist.gov:443 | Collector + API | ✅ Public |
| CrowdStrike | api.crowdstrike.com:443 | Collector | ❌ Firewall request needed |
| Qualys | qualysapi.qualys.com:443 | Collector | ❌ Firewall request needed |
| Tanium | (internal) | Collector | ❌ Firewall request needed |
Key constraint: If the API server moves off CT107 in Phase 4, you'll need firewall rules for the new host to reach Jira (for user lookups) and CARD (for real-time queries). Alternatively, the collector could proxy those on-demand requests — adds latency but avoids firewall changes.
Risk Assessment
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Collector crash doesn't affect API users | — | — | This is the primary benefit of splitting |
| Collector and API race on DB writes | Medium | Low | Collector does bulk upserts; API does single-row writes. Different tables mostly. Use advisory locks for sync_state. |
| Sync trigger lost (pg NOTIFY missed) | Low | Medium | Collector also runs on a schedule. Missed trigger just delays to next interval. |
| Phase 1 introduces bugs in extraction | Medium | Medium | Comprehensive test suite exists. Run parallel (old monolith + new split) in staging for 1 week. |
| Firewall change delays block Phase 4 | High | Medium | Start firewall requests early. Phase 4 is optional — single-machine split (Phases 1–3) works fine at 15 teams. |
| PG becomes bottleneck at 300+ users | Low | High | Phase 4 addresses with read replicas. CT109 (500GB, 32GB) available as larger DB host. |
Decision Points
These require team/leadership input before proceeding:
-
Sync frequency target: Is 1-hour sync acceptable, or do teams need near-real-time (15 min)? This affects collector design complexity and API rate budget math.
-
API server location: Keep everything on CT107, or move the API server to a separate container? Keeping it on CT107 is simpler (no firewall changes for CARD/Jira lookups) but limits scaling options.
-
Database location: Keep PG on CT107, or move to CT109 (zbl-indexer, 500GB disk, 32GB RAM)? Moving adds network latency but gives more room for growth.
-
CrowdStrike/Qualys/Tanium priority: Which new data sources are most urgent? This affects Phase 3 ordering and firewall request timing.
-
Session management: At 300+ users, PG-backed sessions will be high-churn. Acceptable, or invest in Redis? Redis adds infrastructure but is the industry standard for session stores at scale.
-
Multi-instance API: Is the goal to survive a single API server restart without downtime? If yes, Phase 4 (load balancer + multiple instances) is needed. If brief restarts during deploys are acceptable, single-instance on CT107 works through Phase 3.
Appendix: Current Data Flow Analysis
Data Collection Patterns
| Source | Trigger | Frequency | Data Volume | Processing |
|---|---|---|---|---|
| Ivanti Findings | Schedule + manual | 24h | 100–500 findings (all pages) | Extract, upsert, archive detect, BU drift, anomaly |
| Ivanti Workflows | Schedule + manual | 24h | 50 workflow batches | Store as JSON blob |
| Ivanti Closed Findings | During findings sync | 24h | All closed pages | Upsert + closed archive detection |
| Jira Bulk Sync | Manual (admin) | On-demand | All tracked tickets via JQL | Status/summary update per ticket |
| Jira Single Lookup | User action | Real-time | 1 issue | Proxy + display |
| NVD Lookup | User action | Real-time | 1 CVE | Proxy + optional save |
| NVD Bulk Sync | Manual | On-demand | All CVEs in DB | Batch update metadata |
| Atlas Action Plans | Cache refresh | Background | Per-host plan data | Cache in atlas_action_plans_cache |
| CARD Operations | User action | Real-time | 1 asset at a time | Proxy (confirm/decline/redirect) |
| Compliance xlsx | Manual upload | Weekly | 1 file → hundreds of rows | Python parse → PG upsert (transactional) |
What Moves to Collector vs Stays in API
| Operation | Collector | API Server | Rationale |
|---|---|---|---|
| Ivanti findings sync (all pages) | ✅ | Heavy, multi-page, post-processing | |
| Ivanti workflows sync | ✅ | Scheduled background task | |
| Ivanti closed sweep | ✅ | Part of findings sync pipeline | |
| Archive detection | ✅ | CPU-intensive comparison | |
| BU drift checker | ✅ | Makes additional API calls | |
| Anomaly computation | ✅ | Depends on sync completion | |
| Jira bulk sync-all | ✅ | Consumes rate budget, multi-issue | |
| NVD bulk sync | ✅ | Multi-CVE, rate-limited | |
| Atlas cache refresh | ✅ | Background, per-host API calls | |
| Compliance xlsx parse | ✅ | Spawns Python, heavy DB writes | |
| Single Jira lookup | ✅ | User-initiated, real-time, 1 call | |
| Single NVD lookup | ✅ | User-initiated, real-time, 1 call | |
| CARD operations | ✅ | User-initiated, real-time | |
| All GET /api/* reads | ✅ | Pure DB queries, user-facing | |
| Notes/overrides/queue | ✅ | Small writes, user-facing | |
| File uploads | ✅ | User-initiated, disk I/O |
Sync Pipeline Detail (becomes collector's core loop)
┌──────────────────────────────────────────────────────────────────┐
│ Collector Sync Pipeline │
│ │
│ ┌────────────────┐ │
│ │ 1. Fetch Open │ ← Ivanti API (paginated, 100/page) │
│ │ Findings │ │
│ └───────┬────────┘ │
│ │ │
│ ┌───────▼────────┐ │
│ │ 2. Extract & │ ← Transform raw API → normalized rows │
│ │ Transform │ │
│ └───────┬────────┘ │
│ │ │
│ ┌───────▼────────┐ │
│ │ 3. Upsert to │ ← Batch INSERT ON CONFLICT (100/batch) │
│ │ PG │ Preserves notes + overrides │
│ └───────┬────────┘ │
│ │ │
│ ┌───────▼────────┐ │
│ │ 4. Archive │ ← Compare previous IDs vs current IDs │
│ │ Detection │ Detect disappeared + returned findings │
│ └───────┬────────┘ │
│ │ │
│ ┌───────▼────────┐ │
│ │ 5. Fetch Closed│ ← Ivanti API (all closed pages) │
│ │ Findings │ Upsert as state='closed' │
│ └───────┬────────┘ │
│ │ │
│ ┌───────▼────────┐ │
│ │ 6. BU Drift │ ← Re-query Ivanti for disappeared IDs │
│ │ Checker │ Classify: BU reassign / severity / decom │
│ └───────┬────────┘ │
│ │ │
│ ┌───────▼────────┐ │
│ │ 7. FP Workflow │ ← Sweep closed findings for FP# tickets │
│ │ Counts │ Aggregate by state │
│ └───────┬────────┘ │
│ │ │
│ ┌───────▼────────┐ │
│ │ 8. Anomaly │ ← Compute deltas, write to anomaly_log │
│ │ Summary │ │
│ └───────┬────────┘ │
│ │ │
│ ┌───────▼────────┐ │
│ │ 9. Update │ ← sync_state status='success' │
│ │ Sync State │ Notify API server: pg_notify('sync_done') │
│ └────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────┘
Timeline Summary
| Phase | Timeframe | Key Outcome | Required For |
|---|---|---|---|
| 0 | Weeks 1–2 | Non-blocking sync, pool increase | Immediate UX fix |
| 1 | Weeks 3–4 | Collector extracted, fault isolation | Multi-team onboarding |
| 2 | Weeks 5–8 | Multi-tenancy, rate budgeting, retries | 15 teams / 100+ users |
| 3 | Weeks 9–12 | New data sources (CS/Qualys/Tanium) | Full vuln coverage |
| 4 | Weeks 13+ | Horizontal scaling, load balancing | 300+ users (if needed) |
Phases 0–2 are recommended regardless of company-wide rollout. Phase 3 depends on data source priority decisions. Phase 4 is contingent on actual adoption numbers.
Next Steps
- Review this document and provide input on Decision Points
- Approve Phase 0 for immediate implementation
- Schedule Phase 1 kickoff once Phase 0 is validated in staging
- Submit firewall requests for CrowdStrike/Qualys/Tanium access to CT107 (long lead time)