Files
ajarbot/memory_workspace/UCS_C240_MIGRATION_PLAN.md

449 lines
18 KiB
Markdown
Raw Permalink Normal View History

feat: RSO observation system, child safety, Discord adapter, Telegram watchdog, email attachments Core agent improvements: - RSO (Relevance Scoring & Observation) system: interaction_logger, memory_scorer, signal_detector - Memory access logging (memory_access_log table) for relevance scoring; high-signal turn detection - Rich conversation storage for notable turns; compact_conversation truncates long user messages - Task-type classifier (query/action/analysis/creative) for observation tagging - Nested sub-agent visibility: deep delegations now register against the main agent's manager Child safety (Gabriel profile): - child_safety.py: filtering, audit logging, prompt constants for restricted sessions - .kiro/specs/child-safety-profile: requirements, design, tasks specs - GABRIEL_BOT_PROPOSAL.md: initial proposal doc - Reduced context window (10 msgs) and tutor-mode identity for restricted users Telegram adapter: - Polling watchdog: auto-restarts updater if polling drops unexpectedly - get_me() with exponential-backoff retry on NetworkError at startup - Correct stop() ordering: signal watchdog before cancelling tasks Email / Gmail: - send_email: supports file attachments (attachments list param) - get_email: surfaces attachment metadata in response Scheduled tasks / weather: - Remove OpenWeatherMap API calls from morning-weather task; use wttr.in exclusively - New scheduled tasks and scheduler state persistence Discord: - adapters/discord/__init__.py scaffold - discord-plugin: MCP plugin for Claude Code Discord integration (server.ts, skills, config) Infrastructure: - n8n workflow exports (garvis_webhook, content_pipeline variants) - memory_workspace: context, homelab-repo-updates, weekly observation summaries, error logs - UCS C240 migration plan doc - requirements.txt: new deps - .claude/settings.json, fix_hooks.py: hook/permission tuning
2026-04-23 07:54:01 -06:00
# Proxmox Migration Plan: Dell R620 → Cisco UCS C240 M5
**Created:** 2026-03-14
**Updated:** 2026-03-14
**Status:** Pre-Migration — Backups Running, Awaiting C240 M5 Power-On
**Strategy:** Option C — Wipe R620 Drives → Install in C240 → Restore from PBS
---
## 1. Current Environment Summary
### Source Server: Dell PowerEdge R620
| Component | Details |
|-----------|---------|
| **Proxmox VE** | Latest (verify version on next SSH) |
| **RAID Controller** | LSI SAS1068E (Fusion MPT SAS) — **NOT a Dell PERC** |
| **Boot Drive** | `/dev/sda` — 146 GB SAS (Seagate ST914603SSUN146G) — Proxmox OS on LVM |
| **Data Pool** | ZFS "Vault" — 4.36 TB on `/dev/sdb` (RAID 0 virtual disk — 4x 1.2TB NETAPP drives) |
| **Pool Usage** | 108 GB used / 4.25 TB free — HEALTHY, 0 errors |
| **Last Scrub** | Mar 8, 2026 — clean |
### ⚠️ RAID 0 Warning
The "Vault" ZFS pool sits on a **RAID 0 stripe** (4 drives, no redundancy). If any single drive fails, all data is lost. This is another strong reason to get fresh backups before touching anything.
### Physical Drive Inventory — R620 (6 Drives)
| Slot | Vendor | Model | Capacity | RPM | Interface | Serial | Current Use |
|------|--------|-------|----------|-----|-----------|--------|-------------|
| 0 | SEAGATE | ST914602SSUN146G | 146 GB | 10,025 | 2.5" SAS | 2896MNAS | **Unused** (no block device assigned) |
| 1 | SEAGATE | ST914603SSUN146G | 146 GB | 10,000 | 2.5" SAS | 00110282EXXH | **sda** — Proxmox boot (LVM) |
| 2 | NETAPP | X425_SIRMN1T2A10 | 1.20 TB | 10,500 | 2.5" SAS | S3L1GAHC | **sdb** — RAID 0 member → ZFS "Vault" |
| 3 | NETAPP | X425_SIRMN1T2A10 | 1.20 TB | 10,500 | 2.5" SAS | S3L1TPXN | **sdb** — RAID 0 member → ZFS "Vault" |
| 4 | NETAPP | X425_SIRMN1T2A10 | 1.20 TB | 10,500 | 2.5" SAS | S3L1YV7T | **sdb** — RAID 0 member → ZFS "Vault" |
| 5 | NETAPP | X425_SIRMN1T2A10 | 1.20 TB | 10,500 | 2.5" SAS | S3L1TTA2 | **sdb** — RAID 0 member → ZFS "Vault" |
**Note:** NETAPP X425 drives are Seagate-manufactured 1.2TB 10K SAS drives (rebranded for NetApp storage shelves).
### Workloads (12 total — 6 running, 6 stopped)
| VMID | Name | Type | Status | RAM | Disk | Priority |
|------|------|------|--------|-----|------|----------|
| 100 | docker-hub | VM | 🟢 Running | 8.2 GB | 100 GB | HIGH |
| 101 | monitoring-docker | VM | 🟢 Running | 8 GB | 50 GB | HIGH |
| 102 | CML | VM | 🟢 Running | 32 GB | 200 GB | HIGH |
| 105 | pfSense-Firewall | VM | 🟢 Running | 2 GB | 16 GB | CRITICAL |
| 114 | haos | VM | 🟢 Running | 4 GB | 50 GB | HIGH |
| 109 | caddy | LXC | 🟢 Running | — | — | HIGH |
| 112 | twingate-connector | LXC | 🟢 Running | — | — | HIGH |
| 104 | ubuntu-dev | VM | ⚫ Stopped | 5 GB | 32 GB | LOW |
| 106 | Ansible-Control | VM | ⚫ Stopped | 4 GB | 32 GB | LOW |
| 107 | ubuntu-docker | VM | ⚫ Stopped | 4 GB | 50 GB | LOW |
| 113 | n8n | LXC | ⚫ Stopped | — | — | LOW |
| 117 | test-cve-database | LXC | ⚫ Stopped | — | — | LOW |
### Backup Server
| Component | Details |
|-----------|---------|
| **PBS Host** | 192.168.2.151 (container on TrueNAS 192.168.2.150) |
| **Storage** | `PBS-Backups` — 292 GB used / 962 GB total |
| **Status** | ✅ Online (restored 2026-03-14 — fixed macvtap collision) |
| **Fresh Backups** | 🔄 Running as of 2026-03-14 |
---
## 2. Target Server: Cisco UCS C240 M5
### Known Specs
| Component | Details |
|-----------|---------|
| **Chassis** | Cisco UCS C240 M5 (2U rack) |
| **New Drives** | 2x 960 GB (SSD — likely SATA or SAS, verify on power-on) |
| **Reused Drives** | 6x drives from R620 (2x 146GB SAS + 4x 1.2TB SAS) |
| **Total Drive Count** | **8 drives** (2 new + 6 from R620) |
| **CPUs** | TBD — power on to check (C240 M5 supports 2x Xeon Scalable) |
| **RAM** | TBD — power on to check (C240 M5 supports up to 3 TB) |
| **Drive Bays** | C240 M5 has 24x 2.5" SFF or 12x 3.5" LFF depending on config |
| **CIMC** | Cisco Integrated Management Controller (equivalent to iDRAC/iLO) |
### ⚠️ Items to Verify on Power-On
1. **CPU model & count** — Need to confirm sufficient cores/threads
2. **Total RAM installed** — Current R620 workloads need ~62 GB minimum (CML alone uses 32 GB)
3. **Drive bay form factor** — Should be 2.5" SFF to accept the R620 SAS drives
4. **RAID controller or HBA** — Need HBA/IT mode for ZFS (NOT hardware RAID)
5. **NIC configuration** — How many ports, speed, VLAN capability
6. **CIMC IP/access** — For remote management
7. **Firmware version** — May need BIOS/CIMC update
---
## 3. Migration Strategy — Option C: Wipe & Restore
### Why This Approach
The R620's "Vault" pool sits on a RAID 0 virtual disk behind an LSI SAS1068E controller. The RAID metadata is tied to that controller — the drives aren't directly portable as a ZFS pool. Rather than fighting controller compatibility, we'll:
1. **Back everything up to PBS** (running now)
2. **Wipe the R620 drives** (RAID metadata gets destroyed when removed anyway)
3. **Install drives in C240** with a proper HBA/IT mode controller
4. **Create a fresh ZFS pool** on the clean drives
5. **Restore all VMs/CTs from PBS**
### Benefits
| Benefit | Details |
|---------|---------|
| **More storage** | 2x 960GB SSDs (boot mirror) + 4x 1.2TB drives = separate OS and data pools |
| **Clean ZFS** | No RAID controller metadata — native ZFS from the start |
| **Better redundancy** | Can use RAIDZ1 instead of RAID 0 (lose 1 drive worth of capacity, gain fault tolerance) |
| **Full rollback** | R620 untouched until drives are pulled; PBS has all backups |
| **No wasted drives** | Reusing all existing hardware |
### Target Drive Layout
```
┌───────────────────────────────────────────────────────────────┐
│ UCS C240 M5 │
├─────────────────────┬─────────────────────────────────────────┤
│ Boot Pool │ Data Pool ("Vault") │
│ 2x 960GB SSD │ 4x 1.2TB NETAPP SAS (from R620) │
│ ZFS Mirror (RAID1) │ ZFS RAIDZ1 = ~3.6TB usable │
│ Proxmox OS + │ OR ZFS Stripe = ~4.8TB (no redundancy) │
│ local templates │ VM/CT storage │
├─────────────────────┴─────────────────────────────────────────┤
│ Spare: 2x 146GB Seagate SAS (from R620) │
│ Options: ZIL/SLOG, L2ARC, small utility pool, or don't use │
└───────────────────────────────────────────────────────────────┘
```
### ZFS Pool Decision
| Option | Usable Space | Fault Tolerance | Recommendation |
|--------|-------------|-----------------|----------------|
| **4x RAIDZ1** | ~3.6 TB | Survives 1 drive failure | ✅ **RECOMMENDED** |
| **2x Mirror pairs** | ~2.4 TB | Survives 1 per pair, better IOPS | Good if space isn't tight |
| **4x Stripe (RAID0)** | ~4.8 TB | NO redundancy (current R620 setup) | ❌ Don't repeat this mistake |
**RAIDZ1 is the way to go.** You only have ~108 GB of data currently, so 3.6 TB is more than enough. And you gain drive failure protection you don't have today.
### What About the 2x 146GB Seagate Drives?
These are small and old but still functional. Options:
- **ZFS SLOG (write log)** — marginal benefit for home lab, skip unless doing sync writes
- **L2ARC (read cache)** — 146GB of SAS cache, minor benefit with only 108GB of data
- **Leave them out** — simplest option, fewer failure points
- **Small utility pool** — ISOs, templates, scratch space
**Recommendation:** Leave them out for now. Keep them as spares. You can always add them later.
---
## 4. Detailed Phase Breakdown
### Phase 1: Prepare (Before Migration Day)
#### 1.1 — Power On C240 M5 & Inventory
```
Action: Power on, access CIMC (default IP via console or DHCP)
Check: CPUs, RAM, drive bays, RAID controller model, NIC ports
Goal: Confirm hardware meets requirements (64+ GB RAM, 2.5" SFF bays, HBA capable)
```
#### 1.2 — RAID Controller Configuration
```
CRITICAL: ZFS needs raw disk access — NOT behind a hardware RAID controller
If C240 M5 has Cisco 12G SAS Modular RAID Controller:
→ Flash to IT mode (HBA passthrough) OR
→ Configure JBOD mode in BIOS/CIMC
→ Create individual RAID-0 per disk (JBOD workaround if needed)
If C240 M5 has a simple HBA:
→ No action needed, ZFS will see raw disks
```
#### 1.3 — Firmware Updates
```
Action: Check CIMC firmware version, update if below 4.x
Tool: Cisco Host Upgrade Utility (HUU) — bootable ISO
Note: Do this BEFORE installing Proxmox
```
#### 1.4 — Verify Backups
```
Action: Confirm all 7 running workloads backed up successfully
Check: tail -f /tmp/backup_all.log (running now)
Verify: pvesm list PBS-Backups (from Proxmox shell)
```
---
### Phase 2: Install Proxmox on C240 M5
#### 2.1 — Proxmox Boot Drive Setup
```
Config: ZFS Mirror (RAID-1) on the 2x 960GB SSDs
Why: Boot drive redundancy — if one SSD dies, system keeps running
Installer: Select "zfs (RAID1)" during Proxmox install
Bonus: ~900GB usable for OS + local storage (ISOs, templates, etc.)
```
#### 2.2 — Network Configuration During Install
```
Management IP: Pick a new IP (e.g., 192.168.2.141) — keep R620 at .140 as fallback
Gateway: 192.168.2.1 (or whatever pfSense assigns)
DNS: Match current R620 config
Hostname: pve-c240 (or whatever you prefer)
Bridge: vmbr0 on primary NIC
```
#### 2.3 — Post-Install Configuration
```bash
# Add PBS storage
pvesm add pbs PBS-Backups \
--server 192.168.2.151 \
--datastore <datastore-name> \
--username <pbs-user> \
--fingerprint <pbs-fingerprint> \
--content backup
# Verify connectivity
pvesm status
# Add any needed repos (no-subscription, etc.)
# Match /etc/apt/sources.list from R620
```
---
### Phase 3: Migrate Data (The Big Move)
#### 3.1 — Pre-Migration Checklist
```
□ All backups verified on PBS (all 7 running workloads)
□ pfSense config exported as XML (Diagnostics → Backup & Restore)
□ Proxmox configs backed up (tar czf /tmp/pve-configs.tar.gz /etc/pve/)
□ C240 M5 Proxmox installed and accessible
□ PBS storage connected on C240
□ RAID controller in HBA/IT mode on C240
□ Drive bays confirmed compatible (2.5" SFF SAS)
□ Maintenance window planned (Home Assistant, pfSense will be down)
```
#### 3.2 — Shutdown Sequence (R620)
```bash
# Stop VMs/CTs in reverse dependency order
# pfSense LAST (everything depends on it for networking)
qm shutdown 102 # CML (resource heavy, shut down first)
qm shutdown 114 # haos
qm shutdown 100 # docker-hub
qm shutdown 101 # monitoring-docker
pct shutdown 109 # caddy
pct shutdown 112 # twingate-connector
qm shutdown 105 # pfSense — LAST
# Wait for all to stop
qm list && pct list
# Power off R620
shutdown -h now
```
#### 3.3 — Physical Drive Migration
```
1. Power off R620 completely (already done in 3.2)
2. Pull the 4x NETAPP 1.2TB SAS drives (slots 2-5)
3. Optionally pull 2x Seagate 146GB SAS drives (slots 0-1)
4. Insert drives into C240 M5 drive bays
5. Power on C240 M5
6. Verify drives visible in CIMC/Proxmox: lsblk -d -o NAME,SIZE,MODEL,SERIAL
```
#### 3.4 — Create Fresh ZFS Pool on C240
```bash
# Identify the 4x 1.2TB NETAPP drives (will have new device names)
lsblk -d -o NAME,SIZE,MODEL,SERIAL
# Wipe any leftover RAID metadata
wipefs -a /dev/sdX /dev/sdY /dev/sdZ /dev/sdW # replace with actual device names
# Create RAIDZ1 pool (RECOMMENDED — 1 drive fault tolerance)
zpool create -f \
-o ashift=12 \
-O atime=off \
-O compression=lz4 \
-O recordsize=64k \
Vault raidz1 /dev/disk/by-id/<drive1> /dev/disk/by-id/<drive2> /dev/disk/by-id/<drive3> /dev/disk/by-id/<drive4>
# Always use /dev/disk/by-id/ paths — they're stable across reboots
# Verify pool
zpool status Vault
zpool list Vault
# Add to Proxmox as storage
pvesm add zfspool Vault-data -pool Vault -content images,rootdir
```
---
### Phase 4: Restore & Verify
#### 4.1 — Restore from PBS
```bash
# Restore each VM/CT from PBS backup
# Easiest via Proxmox Web UI: Storage → PBS-Backups → Select backup → Restore
# CLI examples if preferred:
# VM 105 (pfSense) — RESTORE FIRST
qmrestore PBS-Backups:backup/vzdump-qemu-105-<timestamp>.vma.zst 105 \
--storage Vault-data
# LXC 109 (caddy)
pct restore 109 PBS-Backups:backup/vzdump-lxc-109-<timestamp>.tar.zst \
--storage Vault-data
# Repeat for: 100, 101, 102, 112, 114
# Also restore stopped VMs if needed: 104, 106, 107, 113, 117
```
#### 4.2 — Startup Sequence (CRITICAL ORDER)
```
1. pfSense (105) — FIRST — everything needs networking
2. caddy (109) — reverse proxy for services
3. twingate-connector (112) — remote access
4. docker-hub (100) — core services
5. monitoring-docker (101) — observability
6. haos (114) — Home Assistant
7. CML (102) — Cisco Modeling Labs (resource heavy, LAST)
```
#### 4.3 — Post-Migration Verification Checklist
```
□ All VMs/CTs start successfully
□ pfSense routing/firewall rules intact
□ pfSense WAN/LAN interfaces mapped correctly to new NIC names
□ Home Assistant devices reconnected
□ Docker containers running (check docker-hub VM)
□ Monitoring/Grafana dashboards loading
□ Caddy reverse proxy serving sites
□ Twingate remote access working
□ PBS backup jobs reconfigured on new Proxmox host
□ ZFS pool healthy (zpool status Vault)
□ No disk errors in dmesg
□ SMART health on all drives (smartctl -a /dev/sdX)
```
---
## 5. Rollback Plan
```
UNTIL you pull drives from R620, rollback is trivial:
1. Power off C240 M5
2. Power on R620
3. Everything is exactly as it was
AFTER drives are pulled and wiped:
1. You cannot restore the R620 to original state
2. BUT: PBS has full backups of everything
3. If C240 fails: re-insert drives in R620, install fresh Proxmox, restore from PBS
4. OR: put drives back in C240 and troubleshoot
KEY SAFETY NET: PBS on TrueNAS (192.168.2.150/151) is independent of both servers.
As long as TrueNAS stays up, your backups are safe regardless of what happens.
```
---
## 6. Estimated Timeline
| Phase | Duration | Notes |
|-------|----------|-------|
| Phase 1: Prepare | 1-2 hours | CIMC setup, firmware, verify hardware, HBA config |
| Phase 2: Install Proxmox | 30-45 min | Proxmox install on SSD mirror + basic config |
| Phase 3: Migrate drives + ZFS pool | 30-60 min | Physical drive swap + create RAIDZ1 pool |
| Phase 4: Restore from PBS | 1-3 hours | Depends on data size (~108 GB across all VMs) |
| Phase 4: Verify | 1-2 hours | Start everything, test services |
| **Total** | **~4-7 hours** | Plan for a half-day window |
---
## 7. Risk Matrix
| Risk | Impact | Likelihood | Mitigation |
|------|--------|------------|------------|
| C240 RAM insufficient (<64 GB) | HIGH | MEDIUM | Check CIMC before starting — need 62+ GB |
| RAID controller doesn't support HBA/IT mode | HIGH | LOW | Most C240 M5 configs have this; JBOD workaround available |
| Drive bay incompatible (3.5" LFF chassis) | HIGH | LOW | C240 M5 SFF variant uses 2.5" — verify on power-on |
| PBS goes down during migration | HIGH | LOW | Fixed macvtap issue today; verify before starting |
| pfSense NIC mapping changes | MEDIUM | MEDIUM | NICs will have different names on C240; remap in pfSense console |
| Drive failure during migration | HIGH | LOW | RAID 0 has zero redundancy today — fresh backups are the safety net |
| Firmware incompatibility | LOW | LOW | Update CIMC/BIOS first via HUU |
---
## 8. Pre-Migration Bonus Tasks (Do Before Migration Day)
```bash
# 1. Export pfSense config (CRITICAL — do from pfSense Web UI)
# Diagnostics → Backup & Restore → Download configuration as XML
# Save to local machine AND to TrueNAS
# 2. Document current network config (run on R620)
ip addr show
cat /etc/network/interfaces
cat /etc/hosts
cat /etc/resolv.conf
# 3. Save Proxmox configs
tar czf /tmp/proxmox-configs-backup.tar.gz /etc/pve/
# 4. Copy to TrueNAS for safekeeping
scp /tmp/proxmox-configs-backup.tar.gz truenas_admin@192.168.2.150:/mnt/data/backups/
# 5. Note down PBS connection details for re-adding on new Proxmox
cat /etc/pve/storage.cfg | grep -A 10 PBS
# 6. Record current VM disk locations
for vmid in 100 101 102 104 105 106 107 114; do
echo "=== VM $vmid ==="; qm config $vmid | grep -E "scsi|virtio|ide|efidisk"
done
for ctid in 109 112 113 117; do
echo "=== CT $ctid ==="; pct config $ctid | grep rootfs
done
```
---
## 9. Open Questions (Resolve on Power-On)
1. **C240 M5 drive bay form factor?** — Need 2.5" SFF for the R620 SAS drives
2. **RAID controller model?** — Determines HBA/IT mode procedure
3. **Total RAM?** — Minimum 64 GB needed (CML = 32 GB alone)
4. **CPU specs?** — Should be fine, but confirm core count
5. **Individual R620 drive sizes?** — Jordan to double-check (currently showing 2x 146GB + 4x 1.2TB)
6. **ZFS pool layout preference?** — RAIDZ1 recommended (~3.6TB), stripe (~4.8TB) if you need space
7. **Keep the 2x 146GB Seagates?** — Recommend leaving out; they're small and old
8. **Same IP (.140) or new IP for C240?**
9. **Hostname preference?**`pve`, `pve-c240`, something else?
---
*Plan authored by Garvis — 2026-03-14*
*Updated: Option C strategy (wipe drives, restore from PBS), added full drive inventory.*
*Will be updated once C240 M5 hardware inventory is complete.*