Velum Docs
Everything you need to integrate, configure, and understand Velum, the Product Healing Agent.
Quick Start
Prerequisites
- Go 1.23+
- PostgreSQL
- LLM API key (optional — enables AI features) — Groq (default), OpenAI, or any OpenAI-compatible provider
Install & Run
$ git clone https://github.com/gokulnair2001/Velum.git
$ cd velum
$ go mod tidy
$ cp example.config.yaml config.yaml # edit with your Postgres credentials
$ go run cmd/velum/main.go
✓ velum listening on :8080
Docker
$ docker network create velum-network
$ docker compose up --build
Velum creates all required database tables automatically on first connection. No migrations needed.
Try It
Send a few test events to see Velum detect patterns in real time:
$ curl -X POST http://localhost:8080/api/v1/analyze \
-H "Content-Type: application/json" \
-H "X-Project-ID: my-app" \
-d '{
"events": [
{
"event": "checkout_page_view",
"ts": 1707500000000,
"user_id": "usr-101",
"session_id": "sess-abc",
"device": "mobile"
},
{
"event": "checkout_payment_click",
"ts": 1707500015000,
"user_id": "usr-101",
"session_id": "sess-abc",
"error_code": "card_declined"
},
{
"event": "checkout_payment_click",
"ts": 1707500045000,
"user_id": "usr-101",
"session_id": "sess-abc",
"error_code": "card_declined"
}
]
}'
Velum auto-detects which field is the event name, user ID, timestamp, etc. You don't need to configure your schema — the Context Enricher (Layer 0) handles it via AI.
Optional: Build a Baseline
Want trend comparisons ("retry storms increased 21% vs. last 28 days")? Feed historical events to /api/v1/baseline first. Without this step, /analyze still detects all patterns — you just won't get trend data.
$ curl -X POST http://localhost:8080/api/v1/baseline \
-H "Content-Type: application/json" \
-H "X-Project-ID: my-app" \
-d @historical_events.json
Processing Pipeline
Events flow through an 8-layer sequential pipeline. Each layer implements a Layer interface and processes the output of the previous layer. One API call triggers the full pipeline.
Context Enricher
Classifies event properties as dimension, target, condition, or measure via LLM (Groq default)
Vocab Enricher
Tokenizes event names, classifies unknown words as surface / status / flow / noise
Event Adapter
Normalizes raw events into canonical form using vocab + property lookups
Session Flow Reconstructor
Groups events by user + session. Splits at 30-min gaps. Builds flow instances with retry-cycle merging.
Behavior Analyzer
Tags flows with behavioral signals: explore, attempt, succeed, retry, abandon, hesitate
Pattern Detector
Aggregates behaviors across all users into named anti-patterns with impact ratios
Baseline Comparator
Compares current patterns against historical snapshots. Flags new, increasing, or anomalous trends.
AI Analyzer
Generates natural-language summaries with actionable hypotheses grounded in data
Layer Deep Dives
Purpose: Identifies the role of each property/field in your event JSON — which field is the user ID, timestamp, event name, etc.
Different products send events in completely different schemas. The Property Agent eliminates the need for per-customer configuration by using LLM inference (Groq by default, or any configured provider) to classify fields automatically.
How it works
- Samples events from the batch
- Sends them to Groq LLM with a classification prompt
- LLM returns a field → role mapping (event_name, user_id, timestamp, etc.)
- Mapping is cached and used by Layer 2 for field extraction
// Product A
{"event_name": "checkout", "uid": "u1", "time": 170850000}
// Product B
{"action": "purchase", "user": "u1", "timestamp": "2024-02-21T10:00:00Z"}
// Velum detects both automatically — zero config needed
Without the Property Agent, Velum would need a config file per customer. This layer makes Velum truly zero-config.
Purpose: Learns the meaning of unknown words in your event names by classifying them into semantic roles.
| Category | Examples | Meaning |
|---|---|---|
Surface | checkout, payment, ride, playback | The "what" / "where" |
Status | failed, success, click, initiated | The "state" |
Flow | cart, auth, booking, registration | Higher-level grouping |
Noise | the, a, total, count | Irrelevant |
How it works
- Tokenizes every event name (
payment_failed→["payment", "failed"]) - Checks each token against PostgreSQL vocab storage
- Unknown tokens are batched and sent to Groq LLM for classification
- Classifications are stored back to PostgreSQL for future requests
Event names in snake_case, camelCase, kebab-case, and dot.notation are all tokenized and classified automatically.
Purpose: Transforms raw JSON events into canonical events — Velum's internal standardized format.
Dual-Lookup Guarantee
Tokens are resolved using a two-tier system for reliability:
- PostgreSQL first — AI-learned vocab from Layers 0/1
- Static vocabulary fallback — hardcoded common words
- If both miss →
uncategorized(learned on next request)
CanonicalEvent{
Flow: "payment", // from Surface token
Action: "payment", // from Surface token
Status: "failed", // from Status token
UserID: "u1",
SessionID: "s1",
Timestamp: 1708500015000,
RawProperties: { ... } // original JSON preserved
}
Purpose: Groups canonical events into user sessions and flow instances.
Processing Steps
- Group by user — collect all events per user_id
- Split into sessions — 30-minute inactivity gap threshold
- Build flow instances — group contiguous events by flow name
Key Rules
| Rule | Example |
|---|---|
| Same flow contiguous → same instance | playback_started, playback_error → 1 flow |
| Entry status → new instance | booking_requested starts new booking flow |
| Lifecycle events filtered | session_end, app_closed → no flow created |
| Surface-fallback folding | driver_assigned folds into active booking flow |
| Retry cycle merging | fail → re-attempt collapses into 1 flow instance |
When a flow fails and the user re-attempts (e.g., booking → driver_cancelled → booking), Velum merges these into a single flow instance with retry evidence. This prevents inflated flow counts.
Purpose: Tags each flow instance with behavioral signals — what the user was doing.
| Behavior | Meaning | Detection |
|---|---|---|
explore | User looked around | Entry/view events |
attempt | User tried to do something | Action events (submit, pay) |
succeed | User completed the goal | Success/complete status |
retry | User tried again after failure | Error → same action repeated |
abandon | User left without completing | No success + session ends |
hesitate | User paused before acting | Long delay between events |
progress | User moved to next step | Flow transition to deeper step |
Each flow also receives an intent classification: transact (intended to complete a transaction), browse (just looking), or unknown.
Purpose: Detects anti-patterns across all users — systemic problems, not individual quirks.
How it works
- Groups all flow instances by
(flowName, contextKey) - Runs 5 detection algorithms across each group
- Each detector calculates an impact ratio against configurable thresholds
Patterns are detected across multiple flow instances per user, not just within a single flow. For example: user fails booking, creates new booking → counts as retry across instances. This is critical for accuracy.
{
"pattern": "retry_storm",
"flow": "booking",
"affected_users": 3,
"total_flows": 4,
"impact_ratio": 0.75,
"description": "High frequency of retry attempts in booking flow"
}
Purpose: Compares current patterns against historical baselines stored in PostgreSQL to detect trends.
Trend Classifications
- New — first time this pattern was observed
- Significant Increase — impact ratio increased beyond
std_dev_multiplier × σ - Significant Decrease — impact ratio decreased significantly
- Stable — within normal historical range
// Current observation
retry_storm on booking: impact_ratio = 0.75
// Historical baseline (last 7 observations)
average = 0.35, σ = 0.08
// Deviation calculation
(0.75 - 0.35) / 0.08 = 5.0 standard deviations
threshold = 2.0
→ SIGNIFICANT INCREASE
Purpose: Generates natural language analysis of detected patterns using your configured LLM provider (Groq by default).
Enriched Prompt Includes
- Pattern metadata (name, flow, affected users, impact ratios)
- Error code distributions from event evidence
- Sample user journeys (actual event sequences)
- Baseline comparison results
Strict Quality Rules
The system prompt enforces: numbers and percentages in every detail, error codes must be cited, hypotheses must reference specific data points, no generic filler.
{
"summary": "Retry storm affecting 42% of playback users...",
"details": ["retry_storm in playback: 2/12 users. Errors: buffer_timeout (2), drm_license_failed (3)"],
"hypotheses": ["DRM licensing may be broken for IN region — 3 of 5 failures are drm_license_failed on mobile"],
"confidence_note": "These are hypotheses based on observed behavioral changes."
}
Detected Patterns
Velum detects seven behavioral anti-patterns across users. Detection is cross-instance — patterns are found across multiple flow instances per user for maximum accuracy.
Users retrying the same action repeatedly after failures. Indicates broken UX, missing feedback, or backend errors.
Users failing but eventually succeeding — hiding real friction. The success masks the underlying problem.
Users explore but never attempt any action. They arrive, see, and leave without interacting.
Users bounce immediately after starting a flow. No attempt, no interaction — something repelled them.
Same event repeated 3+ times — users going in circles, unable to find what they need.
Users skip expected steps in a flow. Indicates confusing UX or users finding shortcuts around intended journeys.
Significant user loss between defined funnel steps. Indicates revenue-impacting friction in conversion flows.
Severity & Significance
Pattern severity is weighted by pattern type and flow intent:
| Pattern | Weight | Notes |
|---|---|---|
| Retry Storm | 1.0 | Most impactful |
| Masked Failure | 0.9 | Hidden friction |
| Funnel Dropoff | 0.8 | Revenue impact |
| Silent Abandonment | 0.7 | Lost engagement |
| Early Dropoff | 0.6 | May be expected |
| Confusion Loop | 0.5 | UX friction |
| Bypass Behavior | 0.4 | Least impactful |
Transactional flows get a 1.5× multiplier; browse flows get 0.7×.
Baseline significance is capped at low when affected users are below min_affected_users (default: 5). The pattern is still reported with low_volume: true so dashboards can filter or display it, but it won't trigger high-priority alerts on statistically thin data.
API Reference
Endpoints
| Method | Path | Auth | Description |
|---|---|---|---|
GET | /health | None | Health check (DB connectivity) |
POST | /api/v1/analyze | X-Infra-Key | Analyze events against stored baselines (read-only, no baseline writes) |
POST | /api/v1/baseline | X-Infra-Key | Ingest events and store baseline snapshots (optional — enables trend comparison) |
/api/v1/analyze is the core endpoint — it detects all patterns in your event batch and works standalone. /api/v1/baseline is optional — feed it historical data (on a schedule or as a one-time backfill) so that /analyze can compare current patterns against past trends and report whether things are getting better or worse.
Headers
| Header | Required | Description |
|---|---|---|
X-Project-ID | Always | Project identifier (1–64 chars, alphanumeric/hyphens/underscores). Scopes storage. |
X-Infra-Key | When security.enabled: true | Raw API key (server compares SHA-256 hash) |
Request Body
{
"events": [
{
"event": "checkout_payment_click", // required — event name
"ts": 1707500000000, // required — epoch milliseconds
"user_id": "usr-123", // required — user identifier
"session_id": "sess-abc", // optional — session identifier
"device": "mobile", // auto-classified as dimension
"error_code": "card_declined" // auto-classified as condition
}
]
}
Any additional properties beyond event, ts, user_id, and session_id are automatically classified into roles: dimension (device, country), target (product_id), condition (error_code), or measure (cart_value).
Response Shape
{
"success": true,
"message": "Behavioral analysis complete",
"request_id": "...",
"data": {
"ai_analysis": {
"summary": "Booking flow shows 75% retry storm rate...",
"details": ["..."],
"hypotheses": ["Driver supply insufficient for long-distance rides..."],
"confidence_note": "Hypotheses based on observed behavioral changes."
}
}
}
When AI is disabled, data.patterns is returned instead of data.ai_analysis, containing the raw detected patterns array.
Configuration
Config is loaded from config.yaml → config.yml → /etc/velum/config.yaml (first found wins), then overridden by VELUM_* environment variables.
Minimal Config
server:
port: "8080"
environment: "development"
storage:
type: "postgres"
postgres:
host: "localhost"
port: 5432
database: "velum"
user: "velum_user"
password: "your_password"
security:
enabled: false
Config Sections
| Section | Purpose |
|---|---|
server | Port, host, environment, timeouts |
storage | PostgreSQL connection details, retention days |
security | API key auth toggle + SHA-256 hash of key |
cors | Allowed origins, methods, headers |
resiliency | Rate limit (req/s), circuit breaker settings |
baseline | Window days, min days, computation mode, trend thresholds |
ai_analyzer | Enable + provider + API key + model for Layer 7 |
vocab_agent | Enable + provider + API key + model for Layer 1 |
context_agent | Enable + provider + API key + model for Layer 0 |
data_mapping | Declarative field mapping for custom schemas |
Security
security:
enabled: true
api_key_hash: "<sha256-hash-of-your-key>"
Generate a hash:
$ printf "my-secret-key" | shasum -a 256
Then pass X-Infra-Key: my-secret-key on every request. Velum compares using constant-time SHA-256 comparison.
AI Features
All three AI layers support any OpenAI-compatible API — Groq (default), OpenAI, Together, Mistral, Fireworks, and more. The API URL is auto-resolved from the provider name.
Baseline Detection
baseline:
window_days: 28 # Days of history for baseline computation
min_days: 7 # Minimum days before baseline is valid
min_affected_users: 5 # Below this, significance is capped at "low"
computation_mode: "daily" # "daily" (cached) or "always" (per-request)
trend_threshold: 0.10 # 10% delta to flag increasing/decreasing
high_significance_threshold: 0.15 # 15% delta for high significance
std_deviation_multiplier: 2.0 # Multiplier for std-based significance
min_affected_users prevents low-volume patterns (e.g., 1 user with 100% impact ratio) from being flagged as high significance. Set to 1 to disable the guard.
How Baseline Works
Every analysis request:
- Detects patterns in the current batch (stateless, works for any time window)
- Compares each pattern's impact ratio against the stored historical average (last
window_days) - Stores the current snapshot via upsert — keyed on
(date, pattern_type, flow, context_key)
| Baseline Status | Condition | Behavior |
|---|---|---|
first_observation | 0 historical snapshots | Stores snapshot, returns unknown trend |
insufficient_data | 1–6 days of history | Stores snapshot, returns unknown trend |
sufficient | ≥7 days of history | Computes avg + stddev, returns trend + significance |
out_of_window | Data older than 28 days | Skips storage and comparison entirely |
Trend is classified by delta percentage: ≥10% increase → increasing, ≥10% decrease → decreasing, otherwise stable.
Significance uses standard deviation when available (delta ≥ 2×stddev → high), falls back to absolute threshold (delta ≥ 0.15 → high). Capped at low when affected users < min_affected_users.
Data Ingestion Guidelines
- Consistent windows: For meaningful baseline comparisons, send the same time window each ingestion (e.g., always a full day). Inconsistent window sizes produce different denominators, making ratio comparisons noisy.
- No overlap: Avoid sending overlapping event batches for the same day. The last batch overwrites the snapshot (upsert), so overlapping batches cause the stored ratio to reflect only the last batch.
- Re-processing: Sending the same complete batch again is safe — the upsert overwrites with identical values.
- Ad-hoc analysis: The
/api/v1/analyzeendpoint never writes to baseline history, so investigative queries with non-standard windows are always safe.
Retention & Cleanup
Snapshots are auto-deleted after retention_days (default: 90 days). A background goroutine runs cleanup on startup and every 24 hours.
storage:
retention_days: 90 # Snapshots older than this are deleted
| Time Boundary | Default | Purpose |
|---|---|---|
baseline.window_days | 28 days | How far back to look for comparison |
baseline.min_days | 7 days | Minimum history before comparison is valid |
storage.retention_days | 90 days | When data is permanently deleted |
The 62-day gap between window_days and retention_days means historical snapshots are preserved in case you widen the baseline window later.
ai_analyzer:
enabled: true
provider: "groq" # URL auto-resolved (default if omitted)
api_key: "gsk_..." # or env: VELUM_AI_API_KEY
model: "llama-3.1-8b-instant"
ai_analyzer:
enabled: true
provider: "openai" # URL auto-resolved
api_key: "sk-..." # or env: VELUM_AI_API_KEY
model: "gpt-4o-mini"
ai_analyzer:
enabled: true
provider: "custom"
base_url: "http://localhost:11434/v1/chat/completions"
model: "llama3"
The same provider, api_key, model, and base_url fields are available on all three agents (ai_analyzer, vocab_agent, context_agent). The base_url field is only needed for custom endpoints. If provider is omitted, it defaults to "groq".
Data Mapping
If your events use a non-standard schema, map fields declaratively:
data_mapping:
enabled: true
mapping:
event:
paths: ["payload.event.action", "event_name"]
required: true
ts:
paths: ["meta.time", "timestamp"]
format: "epoch_ms"
required: true
user_id:
paths: ["context.user.id", "user_id"]
required: true
session_id:
paths: ["context.session.id", "session_id"]
Supports dot-notation paths with fallback order. Extra properties pass through automatically.
Environment Variables
| Variable | Overrides |
|---|---|
VELUM_PORT | server.port |
VELUM_ENV | server.environment |
VELUM_DB_HOST | storage.postgres.host |
VELUM_DB_USER | storage.postgres.user |
VELUM_DB_PASSWORD | storage.postgres.password |
VELUM_DB_NAME | storage.postgres.database |
VELUM_DB_PORT | storage.postgres.port |
VELUM_DB_SSL_MODE | storage.postgres.ssl_mode |
VELUM_AI_API_KEY | ai_analyzer.api_key |
VELUM_VOCAB_AGENT_API_KEY | vocab_agent.api_key |
VELUM_CONTEXT_AGENT_API_KEY | context_agent.api_key |
VELUM_API_KEY_HASH | security.api_key_hash |
Full Example: Ride-Hailing App
Here's a complete walkthrough showing what each pipeline layer does with a real ride-hailing scenario.
Input Events
{
"events": [
{"event":"app_opened","ts":1708700000000,"user_id":"u1","session_id":"s1","device":"mobile","country":"IN"},
{"event":"booking_requested","ts":1708700045000,"user_id":"u1","session_id":"s1","fare":790},
{"event":"driver_assigned","ts":1708700060000,"user_id":"u1","session_id":"s1","driver_id":"d01"},
{"event":"driver_cancelled","ts":1708700090000,"user_id":"u1","session_id":"s1","cancel_reason":"too_far"},
{"event":"booking_requested","ts":1708700095000,"user_id":"u1","session_id":"s1","fare":790},
{"event":"driver_assigned","ts":1708700110000,"user_id":"u1","session_id":"s1","driver_id":"d02"},
{"event":"ride_started","ts":1708700180000,"user_id":"u1","session_id":"s1"},
{"event":"ride_completed","ts":1708701000000,"user_id":"u1","session_id":"s1"}
]
}
Layer-by-Layer Trace
Layer 0 (Context Enricher):
Schema detected: event="event", timestamp="ts", user_id="user_id"
Layer 2 (Event Adapter):
"booking_requested" → {Flow:"booking", Status:"request"}
"driver_assigned" → {Flow:"driver", Status:"assigned"}
"driver_cancelled" → {Flow:"driver", Status:"cancelled"}
"ride_completed" → {Flow:"ride", Status:"completed"}
Layer 3 (Session Flow):
u1 → Flow 1: booking [booking_requested, driver_*, booking_requested, driver_*]
↑ driver events folded in, retry cycle merged
Flow 2: ride [ride_started, ride_completed]
Layer 4 (Behavior):
booking: [attempt, retry, progress]
ride: [attempt, succeed]
Layer 5 (Patterns — across all users):
retry_storm on booking: 3/4 users (75%) → DETECTED
masked_failure on booking: 2/4 users → DETECTED
Layer 6 (Baseline):
retry_storm: FIRST OBSERVATION → establishing baseline
masked_failure: FIRST OBSERVATION → establishing baseline
Layer 7 (AI):
"75% retry storm in booking — driver_cancelled with cancel_reason=too_far
in 2 of 3 cases. Driver supply may be insufficient for long-distance rides."
Multi-Tenancy
Each X-Project-ID gets isolated storage:
- Pattern baselines are stored in per-project tables (
pattern_snapshots_{project_id}) - Vocabulary and property registry are shared across projects
- A background goroutine runs daily cleanup of snapshots older than
retention_days
Database Tables
| Table | Scope | Purpose |
|---|---|---|
vocabulary | Shared | AI-learned word classifications |
property_registry | Shared | AI-learned property classifications |
pattern_snapshots_{id} | Per-project | Historical pattern observations for baselines |
Resiliency
Velum is built for production reliability with multiple layers of protection:
| Feature | Details |
|---|---|
| Circuit Breaker | Wraps all LLM calls. After N failures, circuit opens — AI layers degrade gracefully until reset timeout. |
| Rate Limiting | Configurable req/s via go-chi/httprate. Returns HTTP 429 when exceeded. |
| Body Limit | 10 MB max request body to prevent OOM attacks. |
| Graceful Shutdown | Catches SIGTERM/SIGINT, waits for active requests to finish (configurable timeout). |
| Background Cleanup | Daily goroutine deletes old snapshots per retention policy. Cancellable on shutdown. |
| Health Check | /health endpoint checks DB connectivity. Returns "degraded" if unreachable. |
Tech Stack
External Dependencies
| Dependency | Purpose |
|---|---|
github.com/go-chi/chi/v5 | HTTP router |
github.com/go-chi/cors | CORS middleware |
github.com/go-chi/httprate | Rate limiting middleware |
Build & Test
$ go test ./... # All tests
$ go test ./... -v # Verbose
$ go test ./... -cover # With coverage
$ go build -o velum cmd/velum/main.go # Build binary