Production Deployment Guide

โš ๏ธ FUTURE WORK REFERENCE - This document is for planning and reference only

Status: ๐Ÿ“‹ Analysis complete, implementation NOT started
Purpose: Reference material for future production deployment implementation
Last Updated: 2025-01-17

๐Ÿšง IMPORTANT NOTES:

  1. DNS Architecture Changing: See GitHub Issue #30 - PowerDNS will be replaced with CoreDNS file plugin
  2. Not Production Ready: Scripts and procedures need to be created and tested
  3. Architecture May Evolve: Recommendations here based on current dev setup, may change
  4. Use for Planning: Use this as a reference when implementing production deployment

๐Ÿ“‹ Table of Contents

Quick Reference

  1. Executive Summary
  2. Key Decisions
  3. Development vs Production Comparison
  4. Implementation Roadmap

Detailed Analysis

  1. Architecture Analysis
  2. Component-by-Component Breakdown
  3. Security Considerations
  4. Deployment Procedures (Future)
  5. Operations & Maintenance
  6. Cost & Resource Planning

Executive Summary

This document provides comprehensive analysis and planning for Holistix Forge production deployment on Ubuntu VPS. The analysis shows that 85% of development scripts and 90% of architecture can be reused for production with minimal adaptation.

Key Findings

Good News:

  • โœ… Development environment designed with production parity in mind
  • โœ… Most scripts work with minor modifications
  • โœ… Architecture is production-ready
  • โœ… Main differences are simplifications (fewer components in production)

Important Caveats:

  • โš ๏ธ DNS architecture is changing (see Issue #30)
  • โš ๏ธ Production scripts need to be created
  • โš ๏ธ Full deployment testing required
  • โš ๏ธ Security hardening needs implementation

Deployment Strategy

Approach: Maximize reuse from development setup, implement only necessary differences.

Main Differences:

  1. โŒ No dev container - Install directly on Ubuntu VPS
  2. โš ๏ธ DNS: PowerDNS on port 53 (but changing to CoreDNS file plugin per #30)
  3. โœ๏ธ SSL: Let's Encrypt with DNS-01 challenge (instead of mkcert)
  4. โž• systemd services for process management
  5. โž• Security hardening (firewall, SSH, passwords)
  6. โž• Monitoring alerts and automated backups

Key Decisions

Decision 1: Direct Install (No Dev Container) โœ…

Recommendation: Install services directly on Ubuntu VPS

Reasoning:

  • Dev container provides no functional benefit in production
  • Simplifies operations and debugging
  • Better performance (no container overhead)
  • Standard Linux administration tools

Impact: Need to adapt scripts that assume container environment


Decision 2: DNS Architecture โš ๏ธ CHANGING

โš ๏ธ CRITICAL NOTE: This decision will change when Issue #30 is implemented.

Current Plan (in this doc): PowerDNS on port 53
Future Plan (Issue #30): CoreDNS file plugin with wildcard DNS
Result: Production will be even simpler (no database, no dynamic DNS operations)

Current Recommendation: PowerDNS on port 53 directly

Current Reasoning:

  • Production has domain delegation (simpler than dev)
  • CoreDNS only needed for local dev forwarding
  • Single DNS server instead of two

Future Recommendation (after #30):

  • Use CoreDNS with file plugin
  • Static zone files with wildcard DNS (*.domain)
  • No PowerDNS, no database, no dynamic operations
  • Even simpler architecture

Impact: Wait for Issue #30 before implementing DNS setup


Decision 3: Let's Encrypt SSL โœ…

Recommendation: Let's Encrypt with DNS-01 challenge for wildcard certificates

Reasoning:

  • Free and fully automated
  • Wildcard support (critical for dynamic gateways/containers)
  • Industry standard
  • Automatic renewal

Requirements:

  • DNS provider API access (Cloudflare, Route53, etc.)
  • Certbot with DNS plugin

Impact: Need DNS provider API credentials


Decision 4: Pre-Built Artifacts โœ…

Recommendation: Build locally or in CI/CD, deploy artifacts only

Reasoning:

  • No source code on production server
  • Faster deployments
  • No build tools needed on production
  • Better security

Impact: Need deployment pipeline or local build process


Decision 5: systemd Services โœ…

Recommendation: Use systemd for all service management

Reasoning:

  • Auto-restart on crash
  • Start on boot
  • Resource limits
  • Standard logging (journalctl)
  • Standard operations (systemctl)

Impact: Need to create systemd service files


Development vs Production Comparison

Quick Reference Table

Component Development Production Changes Required
Host Environment Dev Container (Ubuntu) Ubuntu VPS โŒ Remove container layer
PostgreSQL apt install apt install โœ… Same install
โœ๏ธ Harden config
Nginx apt install apt install โœ… Same install
โž• Security headers
DNS CoreDNS + PowerDNS โš ๏ธ TBD (see #30) โš ๏ธ Wait for Issue #30
SSL mkcert Let's Encrypt โœ๏ธ Change SSL automation
Services Manual start systemd โž• Create service files
Node.js NodeSource 24.x NodeSource 24.x โœ… Same
Docker Docker Desktop Docker Engine โœ… Same (for gateways)
Monitoring Optional Required โœ… Same stack + alerts

DNS Architecture Comparison

โš ๏ธ NOTE: This comparison assumes current architecture. See Issue #30 for planned changes.

Aspect Development Production (Current) Production (Future #30)
Tiers Two-tier Single-tier Single-tier
DNS Servers CoreDNS + PowerDNS PowerDNS only CoreDNS only
Port 53 (CoreDNS), 5300 (PowerDNS) 53 (PowerDNS) 53 (CoreDNS)
Database PostgreSQL for PowerDNS PostgreSQL for PowerDNS None!
Dynamic DNS Yes (via API) Yes (via API) No (wildcard)
Complexity Medium Low Very Low

Why Different?

  • Dev: Need both local (*.domain.local) and external DNS forwarding
  • Prod (current): Domain delegation handles routing, no forwarding needed
  • Prod (future): Wildcard DNS eliminates need for dynamic records!

SSL/TLS Comparison

Aspect Development Production
Tool mkcert Let's Encrypt (certbot)
Certificate Type Self-signed Trusted CA
Wildcard โœ… *.domain.local โœ… *.your-domain.com
Challenge N/A DNS-01 (for wildcard)
Renewal Never expires Auto-renew every 90 days
Client Trust Manual CA install Automatic (browser trusted)
Cost Free Free

Service Management Comparison

Aspect Development Production
Ganymede Manual node main.js & systemd service
DNS Manual start systemd service
Nginx System service systemd service
Auto-start โŒ Manual โœ… On boot
Restart on Crash โŒ No โœ… Yes
Resource Limits โŒ None โœ… systemd limits
Logging Files journalctl + files

Security Comparison

Aspect Development Production
Firewall โŒ Not configured โœ… ufw with strict rules
SSH Default โœ… Hardened (no root, key-only)
DB Password devpassword Strong random (32 chars)
DB User postgres superuser โœ… Limited app user
SSL/TLS Self-signed Trusted CA
Rate Limiting โŒ None โœ… Nginx limits
Security Headers โŒ None โœ… X-Frame, CSP, etc.
Auto Updates Manual โœ… unattended-upgrades

Implementation Roadmap

Phase 1: Core Infrastructure (Week 1)

Goal: Get VPS ready with basic services

Tasks:

  • [ ] Provision Ubuntu 24.04 VPS
  • [ ] Configure SSH hardening
  • [ ] Setup firewall (ufw)
  • [ ] Configure DNS at domain registrar
  • [ ] Install Node.js, PostgreSQL, Nginx, Docker
  • [ ] Setup Let's Encrypt SSL

Deliverables:

  • Accessible VPS with hardened SSH
  • Domain pointing to VPS
  • SSL certificate working
  • Core dependencies installed

Estimated Time: 8 hours (+ 24-48h DNS propagation wait)


Phase 2: Script Adaptation (Week 2)

Goal: Create production-specific scripts

Tasks:

  • [ ] Wait for Issue #30 DNS architecture decision
  • [ ] Create scripts/production/setup-production.sh
  • [ ] Create systemd service files
  • [ ] Adapt create-env.sh for production
  • [ ] Create scripts/production/deploy.sh
  • [ ] Create backup scripts
  • [ ] Document all procedures

Deliverables:

  • Production setup script
  • Production environment creation script
  • Deployment automation
  • Backup automation
  • systemd service templates

Estimated Time: 16 hours


Phase 3: Deployment & Testing (Week 3)

Goal: Deploy and verify full stack

Tasks:

  • [ ] Build application artifacts
  • [ ] Run production setup
  • [ ] Create production environment
  • [ ] Deploy artifacts
  • [ ] Start services
  • [ ] Test all functionality
  • [ ] Fix issues
  • [ ] Security audit

Deliverables:

  • Working production deployment
  • Test results documentation
  • Issue tracking for bugs

Estimated Time: 24 hours


Phase 4: Operations Setup (Week 4)

Goal: Production-ready operations

Tasks:

  • [ ] Configure Grafana alerts
  • [ ] Setup external uptime monitoring
  • [ ] Test backup/restore procedures
  • [ ] Create runbooks
  • [ ] Load testing
  • [ ] Disaster recovery plan
  • [ ] CI/CD integration

Deliverables:

  • Monitoring and alerting configured
  • Tested backup/restore procedures
  • Operational runbooks
  • CI/CD pipeline

Estimated Time: 24 hours

Total Timeline: ~4 weeks


Architecture Analysis

What We Have (Development)

Components:

  • Main dev container (Ubuntu 24.04)
  • PostgreSQL database
  • PowerDNS (port 5300) + CoreDNS (port 53)
  • Nginx for SSL and routing
  • Ganymede API (Express.js)
  • Gateway pool (Docker containers)
  • User containers (Docker)
  • Monitoring stack (Grafana, Loki, Tempo)

Strengths:

  • โœ… Complete local development environment
  • โœ… Production parity in architecture
  • โœ… Comprehensive automation scripts
  • โœ… Well-documented setup

Production Gaps:

  • โš ๏ธ mkcert SSL (need Let's Encrypt)
  • โš ๏ธ Manual process management (need systemd)
  • โš ๏ธ Weak security defaults
  • โš ๏ธ No monitoring alerts
  • โš ๏ธ No automated backups

Component Reusability Matrix

Component Reusability Notes
PostgreSQL setup 85% Add hardening steps
DNS (PowerDNS) โš ๏ธ TBD Wait for Issue #30
DNS (CoreDNS) โš ๏ธ TBD May keep with file plugin
Nginx config 85% Change SSL paths, add headers
Ganymede app 95% No code changes
Gateway pool 100% Works as-is
Frontend build 100% No changes
Monitoring 100% Add alerts

Overall Reusability: 85%


Component-by-Component Breakdown

PostgreSQL

Development:

  • Installed via apt install postgresql
  • Default configuration
  • Weak password (devpassword)
  • Superuser used directly

Production Adaptations:

  1. โœ… Keep apt install postgresql (same)
  2. โœ๏ธ Generate strong random password
  3. โœ๏ธ Create limited application user (already in create-env.sh!)
  4. โž• Configure connection limits
  5. โž• Enable SSL/TLS for connections
  6. โž• Setup automated backups
  7. โž• Add monitoring

Script Impact:

  • setup-postgres.sh - Add hardening (85% reusable)
  • create-env.sh - Already creates app user โœ…

DNS (โš ๏ธ Architecture Changing)

See Issue #30 for planned architecture changes.

Current Plan (may be obsolete):

  • PowerDNS on port 53
  • Remove CoreDNS

Future Plan (Issue #30):

  • CoreDNS with file plugin on port 53
  • Static zone files with wildcard DNS
  • Remove PowerDNS entirely

Recommendation: Wait for Issue #30 before implementing DNS in production


Nginx

Development:

  • SSL termination with mkcert
  • Proxy to Ganymede and gateways
  • Basic configuration

Production Adaptations:

  1. โœ๏ธ Use Let's Encrypt SSL certificates
  2. โž• Add security headers (X-Frame-Options, CSP, etc.)
  3. โž• Add rate limiting
  4. โž• Add gzip compression
  5. โœ… Keep proxy configuration (same)
  6. โœ… Keep dynamic gateway configs (same)

Security Headers:

add_header X-Frame-Options "SAMEORIGIN" always;
add_header X-Content-Type-Options "nosniff" always;
add_header X-XSS-Protection "1; mode=block" always;
add_header Referrer-Policy "strict-origin-when-cross-origin" always;
add_header Content-Security-Policy "default-src 'self'; script-src 'self' 'unsafe-inline';" always;

Reusability: 85%


Ganymede (API Server)

Development:

  • Runs manually via node main.js &
  • Logs to file
  • No restart on crash

Production Adaptations:

  1. โœ๏ธ Run via systemd service
  2. โœ๏ธ Use production environment variables
  3. โž• Add resource limits (systemd MemoryMax, CPUQuota)
  4. โž• Add security sandboxing (systemd directives)
  5. โœ… Keep application code (no changes needed)
  6. โœ… Keep database schema (no changes)

systemd Service Example:

[Service]
Type=simple
User=holistix
WorkingDirectory=/opt/holistix/prod
EnvironmentFile=/opt/holistix/prod/.env.ganymede

# Security
NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=strict
ProtectHome=true

# Resources
MemoryMax=2G
CPUQuota=200%

# Restart
Restart=on-failure
RestartSec=5s

ExecStart=/usr/bin/node dist/packages/app-ganymede/main.js

Reusability: 95%


Gateway Pool

Development:

  • Docker containers
  • HTTP build distribution
  • Dynamic allocation

Production Adaptations:

  1. โœ… Keep Docker containers (same)
  2. โœ… Keep allocation logic (same)
  3. โœ๏ธ Use pre-built artifacts instead of HTTP distribution
  4. โž• Add container health checks
  5. โž• Add resource limits (Docker --memory, --cpus)
  6. โœ… Keep lifecycle management (same)

Reusability: 95%


Frontend

Development:

  • Built with Vite
  • Served by Nginx
  • Hot reload in dev mode

Production Adaptations:

  1. โœ… Keep Vite build process (same)
  2. โœ๏ธ Build with --configuration=production
  3. โž• Add cache headers in Nginx
  4. โž• Add CDN integration (optional)
  5. โœ… Keep Nginx serving (same)

Reusability: 95%


Security Considerations

Firewall Configuration

Required Ports:

# SSH
ufw allow 22/tcp

# HTTP/HTTPS
ufw allow 80/tcp
ufw allow 443/tcp

# DNS
ufw allow 53/tcp
ufw allow 53/udp

# Block everything else
ufw default deny incoming
ufw default allow outgoing

ufw enable

SSH Hardening

# /etc/ssh/sshd_config
PermitRootLogin no
PasswordAuthentication no
PubkeyAuthentication yes
MaxAuthTries 3
LoginGraceTime 20

Database Security

Strong Passwords:

# Generate 32-character random password
DB_PASSWORD=$(openssl rand -base64 32)

Limited Privileges:

-- App user has only necessary permissions
GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES TO app_user;
-- No CREATE, DROP, or user management

SSL Enforcement:

# PostgreSQL: require SSL for all connections
ssl = on

Secrets Management

DO NOT:

  • โŒ Commit secrets to git
  • โŒ Store in plain text

DO:

  • โœ… Use environment files with 0600 permissions
  • โœ… Store in /etc/holistix/secrets/
  • โœ… Consider secret management tools (Vault, AWS Secrets Manager)

Rate Limiting

# Nginx rate limiting
limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s;

location / {
    limit_req zone=api burst=20;
}

Deployment Procedures (Future)

โš ๏ธ NOTE: These procedures are for reference only. They need to be tested and refined before use.

Prerequisites

VPS Requirements:

  • Ubuntu 24.04 LTS
  • 4 vCPU, 8GB RAM, 100GB SSD (minimum)
  • Static public IP
  • Cost: ~$40-50/month

Domain Requirements:

  • Owned domain name
  • DNS registrar access
  • Ability to configure NS records

DNS Provider API:

  • Cloudflare (recommended)
  • Route 53 (AWS)
  • Or other certbot-supported provider

Deployment Steps (High-Level)

  1. VPS Provisioning
  • Create VPS instance
  • Configure SSH hardening
  • Setup firewall
  1. DNS Configuration
  • Configure domain delegation
  • Wait for propagation
  1. Application Setup
  • Install dependencies
  • Setup PostgreSQL
  • Setup DNS (wait for Issue #30)
  • Setup Let's Encrypt
  1. Environment Creation
  • Build applications
  • Create production environment
  • Configure systemd services
  1. Verification
  • Test DNS resolution
  • Test HTTPS
  • Test API
  • Test frontend
  1. Operations Setup
    - Configure monitoring
    - Setup backups
    - Configure alerts

Detailed procedures: Create after testing deployment.


Operations & Maintenance

Monitoring

Components to Monitor:

  • System metrics (CPU, RAM, disk, network)
  • Application metrics (API requests, response times)
  • Database metrics (connections, queries)
  • Gateway pool status
  • Container metrics

Tools:

  • Grafana (dashboards)
  • Loki (logs)
  • Tempo (traces)
  • OTLP Collector
  • UptimeRobot (external uptime)

Alert Rules:

  • Gateway pool exhausted
  • Disk usage > 80%
  • High memory usage
  • SSL certificate expiring (< 30 days)
  • API errors > threshold

Backups

What to Backup:

  • PostgreSQL databases (all ganymede_*)
  • Organization data (org-data/ directory)
  • Nginx configurations
  • Environment files (.env.*)
  • SSL certificates (auto-renewed, but backup for safety)

Backup Schedule:

  • Daily backups at 2 AM
  • Keep last 7 days
  • Weekly backups kept for 4 weeks
  • Monthly backups kept for 12 months

Backup Script (Example):

#!/bin/bash
BACKUP_DIR="/opt/holistix/backups"
DATE=$(date +%Y%m%d_%H%M%S)

# Backup PostgreSQL
pg_dump ganymede_prod | gzip > "$BACKUP_DIR/postgres/ganymede_prod_${DATE}.sql.gz"

# Backup org-data
tar -czf "$BACKUP_DIR/org-data/org-data_${DATE}.tar.gz" /opt/holistix/prod/org-data/

# Cleanup old backups
find "$BACKUP_DIR" -mtime +7 -delete

Common Operations

Deploy Updates:

# Build locally
npx nx run-many --target=build --all --configuration=production
tar -czf holistix-$(git rev-parse --short HEAD).tar.gz dist/

# Deploy to VPS
scp holistix-*.tar.gz holistix@VPS:/tmp/
ssh holistix@VPS "cd /opt/holistix/prod && tar -xzf /tmp/holistix-*.tar.gz"
ssh holistix@VPS "systemctl restart ganymede@prod"

Scale Gateway Pool:

# Add 5 more gateways
ENV_NAME=prod DOMAIN=your-domain.com \
  ./scripts/local-dev/gateway-pool.sh create 5 /opt/holistix/monorepo

View Logs:

# System logs
journalctl -u ganymede@prod -f

# Application logs
tail -f /opt/holistix/prod/logs/ganymede.log

# Gateway logs
docker logs -f gw-pool-0

Cost & Resource Planning

VPS Cost Estimates

Minimum (testing):

  • 2 vCPU, 4GB RAM, 50GB SSD
  • $15-25/month
  • Suitable for: Testing, small deployment

Recommended (production):

  • 4 vCPU, 8GB RAM, 100GB SSD
  • $35-50/month
  • Suitable for: Production, 10-50 users

High Performance:

  • 8 vCPU, 16GB RAM, 200GB SSD
  • $80-120/month
  • Suitable for: Large deployment, 100+ users

Resource Distribution (8GB VPS)

PostgreSQL:     2GB
Ganymede:       2GB
Gateway Pool:   3GB (10 gateways @ 300MB each)
User Containers: 1GB (2-4 containers)
System:         1GB (OS overhead)

Storage Planning

Application:     500MB (dist + node_modules)
PostgreSQL:     1-5GB (depends on usage)
Logs:           1-2GB (with rotation)
Backups:        5-10GB (7 days of DB backups)
User Data:      Variable (org-data files)
Total:          ~10-20GB typical

Bandwidth Estimation

Per User Per Day:

  • Initial load: ~2MB (frontend bundle)
  • WebSocket: ~1MB (collaboration)
  • API requests: ~1MB

Example: 100 active users:

  • Daily: ~400MB
  • Monthly: ~12GB
  • Well within typical 4TB bandwidth limits

Script Reusability Summary

Scripts That Work As-Is (100%)

  • install-node.sh - Node.js installation
  • build-images.sh - Gateway Docker image
  • gateway-pool.sh - Gateway pool management
  • envctl-monitor.sh - Environment monitoring
  • build-frontend.sh - Frontend build

Scripts Needing Minor Changes (85-95%)

  • setup-postgres.sh - Add hardening steps
  • create-env.sh - Replace mkcert with Let's Encrypt, adapt paths
  • envctl.sh - Add systemd support
  • install-system-deps.sh - Minor tweaks

Scripts Not Needed in Production

  • setup-coredns.sh - DNS architecture changing (Issue #30)
  • update-coredns.sh - DNS architecture changing
  • install-mkcert.sh - Using Let's Encrypt instead

New Scripts Needed

  • scripts/production/setup-production.sh - Main production setup
  • scripts/production/setup-letsencrypt.sh - SSL automation
  • scripts/production/create-systemd-services.sh - Service files
  • scripts/production/harden-system.sh - Security hardening
  • scripts/production/deploy.sh - Deployment automation
  • scripts/production/backup-all.sh - Backup automation
  • scripts/production/restore.sh - Disaster recovery
  • scripts/production/health-check.sh - Deep health check

Timeline Estimates

Development Setup (First Time)

  • Create dev container: 10 min
  • Run setup-all.sh: 15-20 min
  • Create environment: 5 min
  • Build frontend: 5 min
  • Configure host DNS: 10 min
  • Total: 45-50 minutes

Production Setup (Estimated)

  • Provision VPS: 10 min
  • DNS delegation: 5 min (+ 24-48h wait)
  • SSH & security: 30 min
  • Run production setup: 20-30 min
  • SSL certificate: 5 min
  • Create environment: 10 min
  • Deploy artifacts: 10 min
  • Testing: 30 min
  • Monitoring setup: 30 min
  • Total: 2.5-3 hours (+ DNS propagation wait)

Risk Assessment

Development Risks (Low)

  • Dev container crash โ†’ Restart
  • Data loss โ†’ Not production data
  • Security breach โ†’ Local network only

Production Risks (High)

Risk Impact Mitigation
VPS crash HIGH Monitoring + alerts + backups
Database corruption HIGH Daily backups + replication
Security breach HIGH Hardening + updates + monitoring
SSL expiry MEDIUM Auto-renewal + alerts
DNS failure MEDIUM Health checks
Disk full MEDIUM Monitoring + log rotation
Gateway exhaustion MEDIUM Pool size alerts

Next Steps

Before Starting Implementation

  1. โœ… Read this document - Understand architecture and decisions
  2. โš ๏ธ Wait for Issue #30 - DNS architecture decision
  3. ๐Ÿ“‹ Create GitHub issue - Track production deployment work
  4. ๐Ÿงช Plan testing strategy - How to verify deployment

Implementation Order

  1. Phase 1: Core infrastructure (VPS, security, dependencies)
  2. Phase 2: Script adaptation (after Issue #30)
  3. Phase 3: Deployment testing (staging environment)
  4. Phase 4: Operations setup (monitoring, backups)

Success Criteria

  • [ ] Production deployment works end-to-end
  • [ ] All services auto-start on boot
  • [ ] Monitoring and alerts configured
  • [ ] Backups tested and working
  • [ ] Security audit passed
  • [ ] Load testing passed
  • [ ] Documentation complete

Conclusion

The Holistix Forge local development environment is remarkably production-ready. The main work required is:

  1. Wait for DNS simplification (Issue #30)
  2. Remove dev container layer (install directly)
  3. Add Let's Encrypt SSL (instead of mkcert)
  4. Create systemd services (proper management)
  5. Implement security hardening (firewall, SSH, etc.)
  6. Setup operations (monitoring, backups, alerts)

Key Insight: 85% of development work transfers to production. The architecture is solid, the foundation is there. The main task is creating and testing the production-specific scripts and procedures.



Document Status: ๐Ÿ“‹ Planning/Reference Only
Implementation Status: Not started - waiting for Issue #30
Next Action: Create GitHub issue to track implementation work
Maintainer: Core team