DataSnipe API

What you can build

Three building blocks, one base URL.

Everything lives under https://datasnipe.app/api/v1. A workflow isn't a prompt you resend — it's a reusable extraction asset you define, then execute against documents as often as you like. Authenticate with a Bearer API key and the workflow and artifact endpoints are yours to automate.

API keys

Scoped, revocable keys you create in the dashboard. Grant read, write, or run access, attach a key to yourself or your organization, and rotate it whenever you need.

Workflow API

Create, read, update, and delete reusable extraction workflows — a field schema, a model, and an optional context prompt — then run a workflow against one or many uploaded files.

Artifacts API

Read the complete result of a run — every value with its citation — then download it collated as CSV or TSV, with control over how rows are grouped and a confidence cutoff for low-quality cells.

A workflow, not a prompt.

What that shape buys you over sending a prompt per document:

	Sending a prompt	Running a workflow
Unit of work	A prompt you rewrite and resend per document	A typed workflow defined once, then reused
Output	Free text or JSON you parse and validate yourself	A declared schema rendered as typed columns — same shape every run
Provenance	Take the answer on faith	Every value carries its `confidence`, `sourceSnippet`, and `pageNumber`
Operation	Fire and hope	Observable jobs — chunk progress, token usage, and cost
Reliability	Re-run means re-charge	Idempotent runs — a duplicate submission replays, never double-charges

Quickstart

Zero to a CSV in four calls.

Create a key in the dashboard, then run the full loop — in your shell or in your language of choice. No SDK, just the HTTP client you already have.

End-to-end

# 0. Create a key at https://datasnipe.app/api-keys with the
#    workflows:read, workflows:write, and workflows:run scopes.
export DATASNIPE_API_KEY="dsk_…"

# 1. Create a workflow.
WORKFLOW_ID=$(curl -s https://datasnipe.app/api/v1/workflows \
  -H "Authorization: Bearer $DATASNIPE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "name": "Quickstart", "extractionSchema": [
        { "name": "title", "type": "string" } ] }' \
  | jq -r '.workflow.id')

# 2. Run it against your documents.
GROUP_ID=$(curl -s https://datasnipe.app/api/v1/workflows/$WORKFLOW_ID/runs \
  -H "Authorization: Bearer $DATASNIPE_API_KEY" \
  -F "files=@paper-1.pdf" -F "files=@paper-2.pdf" \
  | jq -r '.groupId')

# 3. Poll every 3s until every job is done or failed.
until curl -s https://datasnipe.app/api/v1/job-groups/$GROUP_ID \
  -H "Authorization: Bearer $DATASNIPE_API_KEY" \
  | jq -e '[.jobs[].status] | all(. == "done" or . == "failed")' >/dev/null; do
  sleep 3
done

# 4. Download the collated table.
curl -s "https://datasnipe.app/api/v1/job-groups/$GROUP_ID/artifact.csv" \
  -H "Authorization: Bearer $DATASNIPE_API_KEY" -o results.csv

import { openAsBlob } from "node:fs";

const BASE = "https://datasnipe.app/api/v1";
const headers = { Authorization: `Bearer ${process.env.DATASNIPE_API_KEY}` };

// 1. Create a workflow.
const { workflow } = await fetch(`${BASE}/workflows`, {
  method: "POST",
  headers: { ...headers, "Content-Type": "application/json" },
  body: JSON.stringify({
    name: "Quickstart",
    extractionSchema: [{ name: "title", type: "string" }],
  }),
}).then((r) => r.json());

// 2. Run it against a file.
const form = new FormData();
form.append("files", await openAsBlob("paper-1.pdf"), "paper-1.pdf");
const { groupId } = await fetch(`${BASE}/workflows/${workflow.id}/runs`, {
  method: "POST",
  headers,
  body: form,
}).then((r) => r.json());

// 3. Poll every 3s until every job is terminal.
let group;
do {
  await new Promise((r) => setTimeout(r, 3000));
  group = await fetch(`${BASE}/job-groups/${groupId}`, { headers }).then((r) => r.json());
} while (group.jobs.some((j) => j.status !== "done" && j.status !== "failed"));

// 4. Download the collated CSV.
const csv = await fetch(`${BASE}/job-groups/${groupId}/artifact.csv`, { headers })
  .then((r) => r.text());

import os, time, requests

BASE = "https://datasnipe.app/api/v1"
headers = {"Authorization": f"Bearer {os.environ['DATASNIPE_API_KEY']}"}

# 1. Create a workflow.
workflow = requests.post(
    f"{BASE}/workflows",
    headers=headers,
    json={"name": "Quickstart",
          "extractionSchema": [{"name": "title", "type": "string"}]},
).json()["workflow"]

# 2. Run it against a file.
with open("paper-1.pdf", "rb") as f:
    group_id = requests.post(
        f"{BASE}/workflows/{workflow['id']}/runs",
        headers=headers,
        files=[("files", ("paper-1.pdf", f, "application/pdf"))],
    ).json()["groupId"]

# 3. Poll every 3s until every job is terminal.
while True:
    time.sleep(3)
    group = requests.get(f"{BASE}/job-groups/{group_id}", headers=headers).json()
    if all(j["status"] in ("done", "failed") for j in group["jobs"]):
        break

# 4. Download the collated CSV.
csv = requests.get(f"{BASE}/job-groups/{group_id}/artifact.csv", headers=headers).text

Create your first API key →

Authentication

Bearer keys, scoped on purpose.

Create and manage keys from the protected API keys page in your dashboard. This public reference only needs the wire contract: send the key as a Bearer token, and give it the scopes your integration uses.

Send the key on every request

API keys are prefixed with dsk_. Store the secret outside your codebase and pass it in the Authorization header.

HTTP header

Authorization: Bearer dsk_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Workflow API

Define once, run anywhere.

A workflow bundles an extraction schema, a model, and an optional context prompt under a name. Manage workflows with these endpoints.

GET /api/v1/workflows List your workflows · workflows:read

POST /api/v1/workflows Create a workflow · workflows:write

GET /api/v1/workflows/:id Fetch one workflow · workflows:read

DELETE /api/v1/workflows/:id Delete a workflow · workflows:write

POST /api/v1/workflows/:id/runs Run a saved workflow against uploaded files · workflows:run

POST /api/v1/runs Run a one-off config — no saved workflow · workflows:run

Create a workflow

Send a JSON body with a name and an inline extraction schema. Alternatively, pass a sourceGroupId to clone the schema, model, and prompt from a previous run instead of an inline schema.

Body

Field	Type		Description
name	string	required	1–120 characters. Must be unique within the owner.
extractionSchema	Field[]	required*	At least one field. *Required unless `sourceGroupId` is given.
model	string	optional	Defaults to `claude-sonnet-5`. See models below.
contextPrompt	string	optional	Extra context passed to the extraction phase.
fewShotExamples	Example[]	optional	Each `{ input, output }` pair guides the model.
ownerType	string	optional	`user` or `organization`. Must match the key's owner.
sourceGroupId	string	optional	Clone metadata from a prior job group instead of an inline schema.

Extraction field

Fields are explicit: a name, a type of string, number, boolean, date, or list, and an optional description. A list field also needs itemFields — the scalar columns of each item.

POST /api/v1/workflows

curl https://datasnipe.app/api/v1/workflows \
  -H "Authorization: Bearer $DATASNIPE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Clinical trial extractor",
    "model": "claude-sonnet-5",
    "extractionSchema": [
      { "name": "title", "type": "string", "description": "Paper title" },
      { "name": "sample_size", "type": "number", "description": "Participants enrolled" },
      { "name": "double_blind", "type": "boolean" },
      {
        "name": "arms",
        "type": "list",
        "description": "Each treatment arm",
        "itemFields": [
          { "name": "label", "type": "string" },
          { "name": "dose_mg", "type": "number" }
        ]
      }
    ]
  }'

Responds 201 with { "workflow": { … } }. The workflow object includes id, name, ownerType, ownerId, extractionSchema, model, contextPrompt, fewShotExamples, createdAt, and updatedAt.

Run a workflow

Runs are multipart/form-data: attach one or more files parts. DataSnipe processes the bytes in memory, persists them for later re-processing, and returns a group id plus a job per file.

POST /api/v1/workflows/:id/runs

curl https://datasnipe.app/api/v1/workflows/$WORKFLOW_ID/runs \
  -H "Authorization: Bearer $DATASNIPE_API_KEY" \
  -F "files=@paper-1.pdf" \
  -F "files=@paper-2.pdf"

202 Accepted

{
  "workflowId": "0b1c…",
  "groupId": "9f2a…",
  "jobs": [
    { "jobId": "a1…", "fileName": "paper-1.pdf" },
    { "jobId": "b2…", "fileName": "paper-2.pdf" }
  ]
}

Hold on to groupId — it's how you poll the run and download artifacts.

One-off runs

To extract without saving a workflow, post to /api/v1/runs with a JSON config part alongside the files parts. The config takes the same extractionSchema, model, contextPrompt, and fewShotExamples fields as a workflow. The response is a groupId and jobs — but no workflowId, since nothing is persisted.

POST /api/v1/runs

curl https://datasnipe.app/api/v1/runs \
  -H "Authorization: Bearer $DATASNIPE_API_KEY" \
  -F 'config={
        "extractionSchema": [ { "name": "title", "type": "string" } ],
        "model": "claude-sonnet-5"
      };type=application/json' \
  -F "files=@paper-1.pdf"

Poll the returned groupId and download artifacts with the same job-group endpoints used for saved runs.

Duplicate run protection

Both run endpoints are billable. DataSnipe automatically protects against accidental duplicate submissions by fingerprinting the billing account, run configuration, and file metadata.

POST /api/v1/runs

curl https://datasnipe.app/api/v1/runs \
  -H "Authorization: Bearer $DATASNIPE_API_KEY" \
  -F 'config={ ... };type=application/json' \
  -F "files=@paper-1.pdf"

Same request body → replays the original 202 (same groupId), no second submission.
Different body → creates a new run.
Retry while the first request is still running → 409 idempotency_conflict; wait and retry.
If the first attempt failed before anything was billed (400/402/429), the duplicate-protection claim is released so a corrected retry can proceed.

Duplicate-protection claims are scoped to the billing account and expire after about 10 minutes.

Models

Pass a model id as model. The default is claude-sonnet-5. The set of ids your account can use is listed in the workflow builder in your dashboard, so it always reflects what's currently available to you — pick one from there rather than against a hardcoded list that drifts as models are added or retired.

Artifacts API

Read the result, download the table.

One endpoint returns the complete output of a run — the group's metadata and, per file, every extracted value with its citation. Because a run is asynchronous, you call it until every job is done or failed (that's all polling is), then optionally download the collated CSV/TSV.

GET /api/v1/job-groups/:groupId The complete group result — status & every extraction · workflows:read

GET /api/v1/job-groups/:groupId/artifact.csv Download collated CSV · workflows:read

GET /api/v1/job-groups/:groupId/artifact.tsv Download collated TSV · workflows:read

The group result

This is the canonical output of a run. It returns the group's metadata and, for each file, the full set of extractions — every value with its confidence and citation — alongside token usage and cost. While a run is still working the same endpoint reports progress: each job moves through queued → ready → summarizing → extracting → done, or ends as failed with an errorReason. Call it until every job is done or failed.

GET /api/v1/job-groups/:groupId

curl https://datasnipe.app/api/v1/job-groups/$GROUP_ID \
  -H "Authorization: Bearer $DATASNIPE_API_KEY"

200 OK

{
  "groupId": "9f2a…",
  "model": "claude-sonnet-5",
  "contextPrompt": null,
  "createdAt": "2026-01-04 10:32:00",
  "extractionSchema": [ /* the fields you ran */ ],
  "jobs": [
    {
      "jobId": "a1…",
      "fileName": "paper-1.pdf",
      "status": "done",
      "errorReason": null,
      "createdAt": "2026-01-04 10:32:00",
      "chunks": { "total": 12, "done": 12, "failed": 0 },
      "extractionMode": "text",
      "usage": { "inputTokens": 18452, "outputTokens": 1203, "costUsd": 0.0712 },
      "extractions": [
        {
          "id": "ext_3f…",
          "fieldName": "title",
          "value": "A randomized trial of …",
          "confidence": 0.98,
          "sourceSnippet": "A randomized, double-blind trial of …",
          "pageNumber": 1
        },
        {
          "id": "ext_9d…",
          "fieldName": "arms",
          "value": "[{\"label\":\"Treatment\",\"dose_mg\":50},{\"label\":\"Placebo\",\"dose_mg\":0}]",
          "confidence": 0.86,
          "sourceSnippet": "assigned to a 50 mg treatment arm or placebo",
          "pageNumber": 4
        }
      ],
      "summarySelections": [
        { "jobId": "a1…", "fieldName": "title", "pageNumber": 1, "extractionId": "ext_3f…" }
      ]
    }
  ]
}

A job is finished when its status is done or failed. usage reports per-job token counts and costUsd (rounded to four decimals); chunks tracks page-chunk progress while a job runs; and extractionMode reports how the document was read (the backend decides this, not you) — text when its text layer is parsed directly, ocr when scanned text is recognized, or vision when pages are processed as images. A job that spans more than one mode reports mixed, and it's null until the first page finishes.

Every extraction carries its citation

Each entry in a job's extractions array is a single extracted value with the provenance behind it. This is where confidence and source location live — the CSV export flattens it away.

Field	Type	Description
id	string	Stable identifier for this extraction. `summarySelections[].extractionId` points back at it — it's the key that joins the two arrays.
fieldName	string	The schema field this value answers.
value	string	Always a string. Scalars are rendered as text (`"412"`, `"true"`); a `list` field's value is a JSON-encoded array of its items.
confidence	0–1	The model's confidence in this value.
sourceSnippet	string	The passage the value was read from — the citation text.
pageNumber	number	The 1-based page the snippet sits on.

The same field can appear more than once — different pages or chunks may each yield a candidate. Each entry in summarySelections records a pinned choice for a cell: the specific candidate chosen as the best occurrence for a given field and page (for example, after review in the dashboard). Its extractionId references the id of one extraction above, and its fieldName / pageNumber identify the cell. The array is empty until a choice is pinned, so a run you never review comes back with none. When the CSV export builds a cell it uses the pinned choice if there is one, and otherwise falls back to the highest-confidence candidate.

Download artifacts

Both artifact.csv and artifact.tsv accept the same query parameters.

Param	Values	Default	Description
collateBy	document · page · none	document	One row per file, per page, or per individual extraction occurrence.
confidenceCutoff	0–1	0	Drops collated cells below the cutoff. Defaults to `0` — every value is kept. Ignored when `collateBy=none`.
format	standard · normalized	standard	`normalized` explodes a single list field into one row per item. Requires the schema to have exactly one list field, otherwise `422`.

GET /api/v1/job-groups/:groupId/artifact.csv

curl "https://datasnipe.app/api/v1/job-groups/$GROUP_ID/artifact.csv?collateBy=document&confidenceCutoff=0.5" \
  -H "Authorization: Bearer $DATASNIPE_API_KEY" \
  -o results.csv

What the CSV looks like

One column per schema field, plus a leading File column — and a Page or Extraction column when collateBy is page or none. Each cell holds the single best value for that field — the pinned summarySelections choice when one exists, otherwise the highest-confidence candidate (subject to confidenceCutoff). A list field's items are flattened into the one cell, one item per line; within a line, the item's own fields are comma-separated in itemFields order. Because such a cell contains commas and newlines it's wrapped in quotes, per RFC 4180.

artifact.csv (collateBy=document)

File,title,sample_size,double_blind,arms
paper-1.pdf,A randomized trial of …,412,true,"Treatment,50
Placebo,0"
paper-2.pdf,Effects of …,318,false,"Drug A,100
Drug B,200"

The CSV is values only — citations live in the group result. Confidence, source snippet, and page number are not columns in the current export. To keep the provenance, read each value from the job-group endpoint (GET /api/v1/job-groups/:groupId). Citation columns in the CSV are on the roadmap.

confidenceCutoff is an opt-in filter. It defaults to 0, so the export keeps every value unless you ask otherwise. Raise it — e.g. confidenceCutoff=0.5 — to drop cells the model was less sure about. Dropped cells come back blank, so filter deliberately.

Limits & conventions

The numbers to design around.

The constraints worth knowing before you write the upload-and-poll loop — what we accept, how much you can have in flight at once, and how to poll without guessing.

File constraints

Constraint	Value	Notes
Accepted formats	PDF · JPEG · PNG · WebP	Sniffed by content, not filename or declared type. Each image counts as a single-page document.
Pages per file	100	PDFs over the limit are rejected (see the note below).
File size	50 MB	Soft per-file ceiling in memory — processing is in-memory so oversized files are rejected before any work is done (see the note below).
Files per run	≥ 1	At least one `files` part. Each file becomes a job, and a run is admitted only if it fits within the in-flight cap below — so a single run can carry at most as many files as that cap. Larger batches must be split across runs.

A bad file fails its own job, not the whole run. When a run uploads a file that's the wrong format, over the page limit, or over the size cap, the run still returns 202 — that file becomes a job whose status is failed, with the reason in errorReason, while the rest of the batch proceeds. So an unsupported or unsniffable file doesn't surface as a request-level 4xx; check each job's status in the group result. (The 402/429 rejections below are request-level — they're decided before any job is created.)

Throughput

An account may have a fixed number of jobs in flight at once — counting every job that hasn't reached a terminal state across all runs: queued, ready, summarizing, and extracting all count; done and failed don't. The check runs on arrival against the whole batch: if the jobs you're submitting would push the account's in-flight count past the cap, the run is rejected whole — nothing is queued — with 429, a Retry-After: 30 header, and a body of { error, limit, outstanding, retryAfterSeconds }. The cap comes back as limit in that body (and is shown in your dashboard), so read it from the response rather than hardcoding a number. Wait for jobs to finish — or honour Retry-After — then resend. The limit is per owner, so a personal key and an organization key draw from separate budgets.

Polling & webhooks

Read endpoints — the job-group result and the artifact downloads — are not rate-limited. Poll GET /api/v1/job-groups/:groupId on a fixed interval until every job is done or failed; every 2–3 seconds is a good default. A run's wall-clock grows with page count, so for large batches back off to a longer interval rather than hammering — there's nothing to gain from a tight loop.

Polling is the only completion signal. There are no webhooks today. If you need a callback-style integration, poll from a worker and fan out yourself.

Pagination

GET /api/v1/workflows returns your full list in a single response — there are no limit, offset, or cursor parameters today. Don't build paging logic against it; expect the complete set.

Dates & value formats

Two date-shaped things are easy to conflate — they have different formats:

Where	Format	Example
Resource timestamps (`createdAt`, `updatedAt`)	YYYY-MM-DD HH:MM:SS, UTC	2026-01-04 10:32:00
Extracted `date` field values	As written in the document — not normalized	"January 4, 2026"

Resource timestamps are plain SQLite-style UTC strings, not ISO 8601 with a T/Z. An extracted date value comes back as the string the model read off the page, so normalize it downstream if you need a canonical format.

Errors

Predictable status codes.

Errors return a JSON body with an error field and, where useful, extra context.

Status	Meaning	Body
400	Invalid request body or query parameters.	{ error: [ … ] }
401	Missing or invalid API key.	{ error: "Unauthorized" }
402	Not enough credits to start the run.	{ error: "insufficient_credits", available, required }
403	The key lacks the required scope.	{ error: "insufficient_scope", required }
404	Workflow or job group not found (or not visible to the key).	{ error: "…" }
409	A workflow with that name already exists for the owner.	{ error: "…" }
422	Normalized export needs exactly one list field in the schema.	{ error: "…" }
429	Too many in-flight jobs for the account. Respect `Retry-After`.	{ error, limit, outstanding, retryAfterSeconds }

Extraction workflows, not extraction prompts.

Three building blocks, one base URL.

API keys

Workflow API

Artifacts API

A workflow, not a prompt.

Zero to a CSV in four calls.

Bearer keys, scoped on purpose.

Send the key on every request

Define once, run anywhere.

Create a workflow

Body

Extraction field

Run a workflow

One-off runs

Duplicate run protection

Models

Read the result, download the table.

The group result

Every extraction carries its citation

Download artifacts

What the CSV looks like

The numbers to design around.

File constraints

Throughput

Polling & webhooks

Pagination

Dates & value formats

Predictable status codes.