Logging & Observability Architecture

This document describes the current, implemented logging and observability design (backend + frontend).


Goals

  • Centralized telemetry: all important logs and traces from app-ganymede, app-gateway, app-frontend go to the same stack.
  • Flow tracking: follow a request end‑to‑end across HTTP, WebSocket, and SQL using traces.
  • Error classification: distinguish app bugs vs. user errors in logs.
  • AI-/API-friendly: logs and traces are structured and consistently tagged (future: AI/agent access API).

Stack Overview

  • OpenTelemetry SDK (Node + Browser)
  • Distributed tracing, context propagation, log correlation.
  • OTLP Collector
  • Receives OTLP traces/logs from apps and forwards them.
  • Tempo
  • Trace storage and query (traceQL, Grafana Tempo datasource).
  • Loki
  • Log storage and query (LogQL, Grafana Loki datasource).
  • Grafana
  • Unified UI over Loki (logs) and Tempo (traces).

High level:

Apps (Node + Browser)
  ↓ OTLP (HTTP/GRPC)
Collector
  ↓
Tempo (traces) + Loki (logs)
  ↓
Grafana dashboards / APIs

Core Principles

1. Set Context Once, Correlate Everywhere

  • Context lives on spans, not in every log record:
  • user.id
  • organization.id
  • project.id
  • gateway IDs, etc.
  • Logs only need:
  • trace_id
  • span_id
  • plus local log metadata (category, message, error info).
  • Correlation is done in Grafana/Loki/Tempo via trace_id.

2. Backend Logs are First‑Class; Frontend Logs are Secondary

  • Backend (app-ganymede, app-gateway, shared libs):
  • Logs go to Loki via OTLP, with trace context and error classification.
  • Frontend (app-frontend + frontend-data):
  • Primary telemetry = traces (browser OTel SDK).
  • Frontend “logs” are currently console + optional span events, not shipped to Loki yet.

3. One Shared Initialization per Platform

  • Node apps call a shared initializer from @holistix/observability.
  • Browser app calls a shared initializer from @holistix/observability.
  • All apps share:
  • OTLP endpoint config
  • resource attributes (service.name, deployment.environment, version)
  • auto‑instrumentation setup.

Backend: Observability & Logging

1. Node Observability (@holistix/observability)

Usage in apps (simplified):

import { initializeNodeObservability } from '@holistix/observability';

initializeNodeObservability({
  serviceName: 'ganymede' | 'gateway' | '...',
  environment: process.env.OTEL_DEPLOYMENT_ENVIRONMENT,
  version: process.env.SERVICE_VERSION,
});

What it does (see packages/observability/src/node/init.ts):

  • Reads OTLP endpoint config (HTTP) from environment.
  • Creates a Resource with:
  • service.name
  • deployment environment
  • service.version.
  • Configures:
  • trace exporter → OTLP /v1/traces.
  • log exporter → OTLP /v1/logs.
  • Enables Node auto‑instrumentations:
  • Express, HTTP, PostgreSQL, etc.
  • Starts NodeSDK once per process.

2. Logger (@holistix/log)

Key pieces (packages/log/src/lib/log.ts):

  • Initialization (Node‑only OTLP export):
import { Logger } from '@holistix/log';

Logger.initialize({
  // optional; usually read from env
  otlpEndpointHttp: 'http://localhost:4318',
  serviceName: 'ganymede',
});
  • Severity / priority:
export enum EPriority {
  Emergency = 'emergency',
  Alert = 'alert',
  Critical = 'critical',
  Error = 'error',
  Warning = 'warning',
  Notice = 'notice',
  Info = 'info',
  Debug = 'debug',
}
  • Core API:
import { EPriority, log, error } from '@holistix/log';

log(EPriority.Info, 'CATEGORY', 'message', { structured: 'data' });
error('CATEGORY', 'message', { ... }); // convenience wrapper
  • What log() does in Node:

  • Uses the OpenTelemetry logs SDK with an OTLP exporter.

  • Extracts the active span (trace.getActiveSpan()):
    • Attaches trace_id and span_id to log attributes (if a span exists).
  • Flattens primitive data fields into log.data.* attributes.
  • Emits a log record to Loki via the collector.

  • What log() does in the browser (today):

  • Logger.initialize() bails out in browser (to avoid bundling Node‑only log exporter).
  • Logger._otlpLogger therefore stays null.
  • The OTLP export branch is skipped → no log record is sent.
  • We rely on traces and the separate browserLog helper (see below) for frontend diagnostics.

3. Error Classification (Exception classes)

Location: packages/log/src/lib/exceptions.ts

  • All application errors extend a root Exception type.
  • Exception includes:
  • _uuid (error instance ID)
  • httpStatus
  • _errors (public/private details)
  • errorCategory: APP_ERROR | USER_ERROR | SYSTEM_ERROR.

The category is derived mostly from the concrete subclass:

  • UserException, ForbiddenException, NotFoundExceptionUSER_ERROR.
  • SystemExceptionSYSTEM_ERROR.
  • Everything else defaults to APP_ERROR.

Express error handler (backend-engine/app-setup.ts) logs:

  • error_uuid
  • error_category
  • http_status
  • optional raw error string (for non‑Exception errors)
  • plus any span context (trace/span, user/org/project attributes).

This gives Grafana/Loki clear dimensions to filter and aggregate by category.

4. Span Enrichment & Middleware

Key idea: enrich spans progressively as middleware learns more context (sequential enrichment).

  • In setupBasicExpressApp (shared backend engine):

  • We set generic HTTP attributes early:

    • http.method
    • http.route
    • http.url.
  • Optionally, we can infer project.id from URL/body when obvious.

  • In authentication middleware:

  • app-gateway (jwt-auth.ts):

    • Sets span attributes:
    • user.id
    • organization.id
    • gateway.id.
  • app-ganymede (auth.ts):

    • For user tokens:
    • user.id
    • user.username.
    • For org/gateway tokens:
    • organization.id
    • gateway.id.
  • This matches the “set context once, correlate everywhere” principle from the original audit:

  • Logs don’t redundantly store user_id / organization_id / project_id.
  • Instead, they carry trace_id / span_id, and Grafana can look up the span attributes.

5. WebSocket Logging (Gateway)

Location: packages/app-gateway/src/websocket.ts

We implemented:

  • A websocket.upgrade span for the HTTP upgrade handshake.
  • A websocket.connection span for each YJS WebSocket connection:
  • Attributes:
    • websocket.room_id
    • project.id (via ProjectRoomsManager)
    • user.id (from JWT)
    • websocket.connected flag
    • Close code + reason on shutdown.
  • Structured logs for:
  • Connection success, token expiry, auth failures, and upgrade errors.
  • Classification:
    • Auth failures → effectively USER_ERROR in terms of semantics.
    • Upgrade failures → APP_ERROR/system side.

This matches the WebSocket cases in LOGGING_AUDIT.md without repeating all details here.


Frontend: Traces, browserLog, and Limits

1. Browser Observability (@holistix/observability)

Usage in app-frontend/src/main.tsx:

import { initializeBrowserObservability } from '@holistix/observability';

initializeBrowserObservability({
  serviceName: 'frontend',
  environment:
    (window as any).OTEL_DEPLOYMENT_ENVIRONMENT ??
    import.meta.env.VITE_ENVIRONMENT ??
    'development',
});

Implementation (packages/observability/src/browser/init.ts):

  • Uses WebTracerProvider (OTel browser tracing).
  • Exports traces via OTLPTraceExporter → OTLP /v1/traces.
  • Enables web auto‑instrumentations:
  • fetch
  • XMLHttpRequest
  • user interaction instrumentation.
  • Registers the tracer provider globally.

Result: frontend traces exist and correlate with backend traces via trace_id.

2. browserLog Helper (@holistix/frontend-data)

We introduced a small, browser‑only logging shim that we can later hook into OTLP if needed.

Location: packages/frontend-data/src/lib/browser-log.ts
Exports: browserLog via @holistix/frontend-data.

API:

browserLog(
  level: 'debug' | 'info' | 'warn' | 'error',
  category: string,
  message: string,
  options?: {
    data?: unknown;
    asSpanEvent?: boolean;
  }
);

Behavior:

  • Logs a single structured object to the browser console:
  • { level, category, message, data? }.
  • Optionally (asSpanEvent: true), also attaches a span event to the active span:
  • span.addEvent('browser.log', { 'log.level', 'log.category', 'log.message', 'log.data'? }).
  • Today, we do not send browser logs to Loki:
  • This avoids bundler issues with the logs SDK and keeps volume manageable.
  • We rely on traces as the primary frontend telemetry.

Current usages:

  • frontend-data:
  • LOCAL_STORAGE_STORE debug (local storage state machine).
  • API_CALL debug around token logic in GanymedeApi.
  • STORY_API_CONTEXT mock API fetch logs.
  • app-frontend:
  • PROJECT_CONTEXT debug updates (project state + user snapshot).

All of these are low‑value debug signals, so asSpanEvent is left at its default (false).

If we later decide certain frontend failures are critical (e.g. global error boundary, “project load failed” page), we can:

  • Switch those specific calls to asSpanEvent: true.
  • Or implement a small OTLP logs exporter on top of browserLog without touching call sites.