Last updated · April 2026
- Production-ready APIs
- SOC 2 Type I - In Progress
SLIs, SLOs & Review Workflows
Definitions, targets, and operational procedures for the DataVibe AI interception platform. Operators and reviewers should read this before handling the approval queue in a production environment.
Service Level Indicators (SLIs)
SLIs are the raw measurements that power the SLO calculations. DataVibe emits each SLI continuously from the slo-watchdog cron (runs every 15 minutes) and surfaces them on the system health page.
| SLI | What we measure | Unit | Source |
|---|---|---|---|
| Review latency p50 | Median time from submission to reviewer decision | minutes | gate_submissions.reviewed_at − created_at |
| Review latency p95 | 95th-percentile review time — worst-of-normal latency | minutes | gate_submissions (percentile) |
| SLA compliance rate | % of reviewed submissions resolved before breach threshold | % | gate_submissions.sla_breached_at IS NULL |
| SLA breach count | Submissions that exceeded the breach threshold | count | gate_submissions.sla_breached_at IS NOT NULL |
| Escalation rate | % of reviews escalated (approaching SLA + autoEscalate=true) | % | gate_submissions.escalated_at IS NOT NULL |
| Block rate | % of submissions blocked by policy (hard BLOCK rule) | % | gate_submissions.status = BLOCKED |
| Queue depth | Total QUEUED submissions at time of measurement | count | gate_submissions.status = QUEUED |
| Oldest queue age | Age of the oldest QUEUED submission | seconds | NOW() − MIN(created_at) WHERE status = QUEUED |
| False-positive proxy | % of WARN-only submissions that were ultimately approved | % | approved / queued WHERE no BLOCK violations |
| Audit chain integrity | Whether the tamper-evident hash chain is unbroken | boolean | GET /api/audit/verify-chain |
Service Level Objectives (SLOs)
Default SLO targets are listed below. Workspaces can override the warn and breach thresholds via Settings → SLA Configuration or via PUT /api/workspaces/:slug/sla.
| SLO | Default target | Configurable? | Alert when |
|---|---|---|---|
| Review latency — warn | ≤ 10 min p50 | Yes (warnMinutes) | Any submission exceeds this in the QUEUED state |
| Review latency — breach | ≤ 30 min p95 | Yes (breachMinutes) | Any submission exceeds this → sla_breached_at set |
| SLA compliance rate (7d) | ≥ 90% | No (improvement target) | Drops below 90% on the /sla dashboard |
| Queue depth | < 100 QUEUED | Via QUEUE_BACKUP_THRESHOLD env | Threshold crossed → banner in queue UI |
| Audit chain integrity | 100% (no breaks) | No | GET /api/audit/verify-chain returns ok: false |
| DLQ depth | 0 queued items | No | > 10 items → slo.dlq.depth_high system event |
| Core API availability | 99.5% / 30d | No | Non-200 health response → slo.core_api.unhealthy event |
Review workflow
Every AI-generated action that triggers a WARN or BLOCK policy rule is routed to the human approval queue before dispatch. The workflow is:
- Submission queued. The edge gateway (or Core API) writes a
gate_submissionsrow withstatus = QUEUEDand notifies reviewers via SSE pubsub + optional Slack/Teams channel. - SLA clock starts. The
slo-watchdogcron tracks age against the workspace SLA config. At warn threshold: yellow badge. At breach threshold:sla_breached_atis set, a system event is emitted, and an optional escalation notification is sent. - Reviewer action. A REVIEWER, ADMIN, or OWNER opens the queue item, reads the policy violations, and approves or rejects with an optional note.
- Dispatch or closure. Approval fires the downstream dispatch (email provider, webhook, etc.) via the reliability layer. Rejection closes the item with no outbound action.
- Audit record. Every decision is appended to the tamper-evident audit chain with
reviewed_by,reviewed_at, and the decision. Verifiable atGET /api/audit/verify-chain.
Assigning SLAs and escalation targets
For high-stakes submissions, reviewers can set a hard deadline and an escalation contact:
# Set a 2-hour SLA on a specific submission with a named escalation contact
PATCH /api/gate/{submissionId}/sla
{
"dueAt": "2026-05-14T18:00:00.000Z",
"escalateTo": "user_compliance_lead",
"note": "Finance team deal — must clear before 6pm."
}
# Configure workspace-wide default SLA
PUT /api/workspaces/{slug}/sla
{
"warnMinutes": 10,
"breachMinutes": 30,
"autoEscalate": true,
"escalateTo": null // null = broadcast to all ADMINs
}On-call runbook
When the slo-watchdog emits a slo.gate_queue.sla_breach system event:
- Go to /queue and sort by oldest first. Identify submissions with a red "SLA BREACH" badge.
- Triage: does the violation require legal review, or is it a clear pass? Reject with a note or approve.
- If queue depth > backup threshold: page the designated on-call reviewer (configured in workspace SLA → escalateTo). Do not batch-approve to clear the queue — each submission needs individual review.
- If the Core API is down (
slo.core_api.unreachable): new submissions are blocked at the edge gateway. Existing QUEUED items can still be reviewed and approved from the dashboard. Check Render dashboard for core-api status. - After resolution: verify audit chain integrity at /audit → "Verify chain". Ensure
ok: true.
Audit chain break runbook
- Call
GET /api/audit/verify-chain?workspace=<slug>. The response includesfirstBreakAt(the row ID and timestamp where the chain diverges). - Identify the break: was it a manual DB edit, a failed write, or a race condition? Check
systemEventrows forkind = audit_log_write_failednear the break timestamp. - Do NOT delete or modify the broken rows — this is an immutable log. Open an incident ticket and escalate to engineering.
- Communicate to affected workspace admins that the integrity break has been detected and is under investigation.
Reviewer efficiency signals
The SLA performance dashboard shows per-reviewer metrics. Two signals warrant follow-up:
- Rejection rate = 0% with > 10 decisions (rubber-stamping). A reviewer approving everything without rejections may not be reading submissions carefully. Review their queue history and consider spot-checking approved items.
- Rejection rate > 80% (over-blocking). The reviewer may have a misaligned understanding of policy intent. Schedule a calibration session with the policy owner.
- p95 decision time > 2× breach threshold. The reviewer is a bottleneck. Consider adding a second reviewer to the workspace or adjusting the escalation path.