Logging & Observability Architecture
This document describes the current, implemented logging and observability design (backend + frontend).
Goals
- Centralized telemetry: all important logs and traces from
app-ganymede,app-gateway,app-frontendgo to the same stack. - Flow tracking: follow a request end‑to‑end across HTTP, WebSocket, and SQL using traces.
- Error classification: distinguish app bugs vs. user errors in logs.
- AI-/API-friendly: logs and traces are structured and consistently tagged (future: AI/agent access API).
Stack Overview
- OpenTelemetry SDK (Node + Browser)
- Distributed tracing, context propagation, log correlation.
- OTLP Collector
- Receives OTLP traces/logs from apps and forwards them.
- Tempo
- Trace storage and query (traceQL, Grafana Tempo datasource).
- Loki
- Log storage and query (LogQL, Grafana Loki datasource).
- Grafana
- Unified UI over Loki (logs) and Tempo (traces).
High level:
Apps (Node + Browser)
↓ OTLP (HTTP/GRPC)
Collector
↓
Tempo (traces) + Loki (logs)
↓
Grafana dashboards / APIs
Core Principles
1. Set Context Once, Correlate Everywhere
- Context lives on spans, not in every log record:
user.idorganization.idproject.id- gateway IDs, etc.
- Logs only need:
trace_idspan_id- plus local log metadata (category, message, error info).
- Correlation is done in Grafana/Loki/Tempo via
trace_id.
2. Backend Logs are First‑Class; Frontend Logs are Secondary
- Backend (
app-ganymede,app-gateway, shared libs): - Logs go to Loki via OTLP, with trace context and error classification.
- Frontend (
app-frontend+frontend-data): - Primary telemetry = traces (browser OTel SDK).
- Frontend “logs” are currently console + optional span events, not shipped to Loki yet.
3. One Shared Initialization per Platform
- Node apps call a shared initializer from
@holistix/observability. - Browser app calls a shared initializer from
@holistix/observability. - All apps share:
- OTLP endpoint config
- resource attributes (
service.name,deployment.environment, version) - auto‑instrumentation setup.
Backend: Observability & Logging
1. Node Observability (@holistix/observability)
Usage in apps (simplified):
import { initializeNodeObservability } from '@holistix/observability';
initializeNodeObservability({
serviceName: 'ganymede' | 'gateway' | '...',
environment: process.env.OTEL_DEPLOYMENT_ENVIRONMENT,
version: process.env.SERVICE_VERSION,
});
What it does (see packages/observability/src/node/init.ts):
- Reads OTLP endpoint config (HTTP) from environment.
- Creates a
Resourcewith: service.name- deployment environment
service.version.- Configures:
- trace exporter → OTLP
/v1/traces. - log exporter → OTLP
/v1/logs. - Enables Node auto‑instrumentations:
- Express, HTTP, PostgreSQL, etc.
- Starts
NodeSDKonce per process.
2. Logger (@holistix/log)
Key pieces (packages/log/src/lib/log.ts):
- Initialization (Node‑only OTLP export):
import { Logger } from '@holistix/log';
Logger.initialize({
// optional; usually read from env
otlpEndpointHttp: 'http://localhost:4318',
serviceName: 'ganymede',
});
- Severity / priority:
export enum EPriority {
Emergency = 'emergency',
Alert = 'alert',
Critical = 'critical',
Error = 'error',
Warning = 'warning',
Notice = 'notice',
Info = 'info',
Debug = 'debug',
}
- Core API:
import { EPriority, log, error } from '@holistix/log';
log(EPriority.Info, 'CATEGORY', 'message', { structured: 'data' });
error('CATEGORY', 'message', { ... }); // convenience wrapper
-
What
log()does in Node: -
Uses the OpenTelemetry logs SDK with an OTLP exporter.
- Extracts the active span (
trace.getActiveSpan()):- Attaches
trace_idandspan_idto log attributes (if a span exists).
- Attaches
- Flattens primitive
datafields intolog.data.*attributes. -
Emits a log record to Loki via the collector.
-
What
log()does in the browser (today): Logger.initialize()bails out in browser (to avoid bundling Node‑only log exporter).Logger._otlpLoggertherefore staysnull.- The OTLP export branch is skipped → no log record is sent.
- We rely on traces and the separate
browserLoghelper (see below) for frontend diagnostics.
3. Error Classification (Exception classes)
Location: packages/log/src/lib/exceptions.ts
- All application errors extend a root
Exceptiontype. Exceptionincludes:_uuid(error instance ID)httpStatus_errors(public/private details)errorCategory: APP_ERROR | USER_ERROR | SYSTEM_ERROR.
The category is derived mostly from the concrete subclass:
UserException,ForbiddenException,NotFoundException→USER_ERROR.SystemException→SYSTEM_ERROR.- Everything else defaults to
APP_ERROR.
Express error handler (backend-engine/app-setup.ts) logs:
error_uuiderror_categoryhttp_status- optional raw error string (for non‑
Exceptionerrors) - plus any span context (trace/span, user/org/project attributes).
This gives Grafana/Loki clear dimensions to filter and aggregate by category.
4. Span Enrichment & Middleware
Key idea: enrich spans progressively as middleware learns more context (sequential enrichment).
-
In
setupBasicExpressApp(shared backend engine): -
We set generic HTTP attributes early:
http.methodhttp.routehttp.url.
-
Optionally, we can infer
project.idfrom URL/body when obvious. -
In authentication middleware:
-
app-gateway(jwt-auth.ts):- Sets span attributes:
user.idorganization.idgateway.id.
-
app-ganymede(auth.ts):- For user tokens:
user.iduser.username.- For org/gateway tokens:
organization.idgateway.id.
-
This matches the “set context once, correlate everywhere” principle from the original audit:
- Logs don’t redundantly store
user_id/organization_id/project_id. - Instead, they carry
trace_id/span_id, and Grafana can look up the span attributes.
5. WebSocket Logging (Gateway)
Location: packages/app-gateway/src/websocket.ts
We implemented:
- A
websocket.upgradespan for the HTTP upgrade handshake. - A
websocket.connectionspan for each YJS WebSocket connection: - Attributes:
websocket.room_idproject.id(viaProjectRoomsManager)user.id(from JWT)websocket.connectedflag- Close code + reason on shutdown.
- Structured logs for:
- Connection success, token expiry, auth failures, and upgrade errors.
- Classification:
- Auth failures → effectively
USER_ERRORin terms of semantics. - Upgrade failures →
APP_ERROR/system side.
- Auth failures → effectively
This matches the WebSocket cases in LOGGING_AUDIT.md without repeating all details here.
Frontend: Traces, browserLog, and Limits
1. Browser Observability (@holistix/observability)
Usage in app-frontend/src/main.tsx:
import { initializeBrowserObservability } from '@holistix/observability';
initializeBrowserObservability({
serviceName: 'frontend',
environment:
(window as any).OTEL_DEPLOYMENT_ENVIRONMENT ??
import.meta.env.VITE_ENVIRONMENT ??
'development',
});
Implementation (packages/observability/src/browser/init.ts):
- Uses
WebTracerProvider(OTel browser tracing). - Exports traces via
OTLPTraceExporter→ OTLP/v1/traces. - Enables web auto‑instrumentations:
fetchXMLHttpRequest- user interaction instrumentation.
- Registers the tracer provider globally.
Result: frontend traces exist and correlate with backend traces via trace_id.
2. browserLog Helper (@holistix/frontend-data)
We introduced a small, browser‑only logging shim that we can later hook into OTLP if needed.
Location: packages/frontend-data/src/lib/browser-log.ts
Exports: browserLog via @holistix/frontend-data.
API:
browserLog(
level: 'debug' | 'info' | 'warn' | 'error',
category: string,
message: string,
options?: {
data?: unknown;
asSpanEvent?: boolean;
}
);
Behavior:
- Logs a single structured object to the browser console:
{ level, category, message, data? }.- Optionally (
asSpanEvent: true), also attaches a span event to the active span: span.addEvent('browser.log', { 'log.level', 'log.category', 'log.message', 'log.data'? }).- Today, we do not send browser logs to Loki:
- This avoids bundler issues with the logs SDK and keeps volume manageable.
- We rely on traces as the primary frontend telemetry.
Current usages:
frontend-data:LOCAL_STORAGE_STOREdebug (local storage state machine).API_CALLdebug around token logic inGanymedeApi.STORY_API_CONTEXTmock API fetch logs.app-frontend:PROJECT_CONTEXTdebug updates (project state + user snapshot).
All of these are low‑value debug signals, so asSpanEvent is left at its default (false).
If we later decide certain frontend failures are critical (e.g. global error boundary, “project load failed” page), we can:
- Switch those specific calls to
asSpanEvent: true. - Or implement a small OTLP logs exporter on top of
browserLogwithout touching call sites.