Skip to main content
Multi-Modal AI features are Coming Soon and not yet available.
Bota’s AI pipeline processes recordings through three stages: transcription (audio → text), summarization (text → structured output), and multi-modal analysis (text + visual media → context-aware output).

Transcription

Bota transcribes audio recordings using Automatic Speech Recognition (ASR). Transcription is asynchronous — you submit a job and receive results via webhook or polling.
  1. Upload a recording — see Quickstart
  2. Create a transcription — specify the recording and optional language hint
  3. Wait for completion — poll or listen for the transcription.completed webhook
  4. Retrieve results — structured output with timestamps, speaker labels, and confidence scores

Output Format

A completed transcription includes a full text string and time-stamped segments with speaker diarization:
{
  "id": "txn_abc123",
  "status": "completed",
  "full_text": "Good morning. I'd like to discuss the project timeline...",
  "segments": [
    {
      "start": 0.0,
      "end": 1.2,
      "text": "Good morning.",
      "speaker": "Speaker 1",
      "confidence": 0.95
    },
    {
      "start": 1.5,
      "end": 4.8,
      "text": "I'd like to discuss the project timeline.",
      "speaker": "Speaker 1",
      "confidence": 0.92
    }
  ],
  "word_count": 42,
  "confidence": 0.93,
  "language": "en"
}
Each segment includes:
FieldDescription
start / endTimestamps in seconds
textTranscribed text for this segment
speakerSpeaker label (e.g., Speaker 1, Speaker 2)
confidencePer-segment confidence score (0–1)

ASR Providers

ProviderBest For
whisperGeneral purpose, multilingual support
deepgramLow latency, real-time processing
assemblyaiSpeaker diarization, content analysis
elevenlabsHigh accuracy transcription
You can specify a provider when creating a transcription, or let Bota use the default configured for your project.

Language Support

Transcription supports 50+ languages. Provide a language hint (e.g., en, es, zh) to improve accuracy, or omit it for automatic detection.

Transcription API Reference

Summarization

Bota generates structured summaries from transcriptions using LLM providers. Use built-in templates for common formats (SOAP notes, sales calls, legal memos) or provide custom prompts.

Templates vs Custom Prompts

  • Template — Use a built-in template for standardized, structured output. Best for repeatable workflows.
  • Custom Prompt — Provide your own instructions for flexible, ad-hoc summarization.
Provide either a template or a custom prompt, not both.

Built-in Templates

General Notes

Ideal for meetings, discussions, and team syncs. Extracts key points, action items, decisions, and participants.
{
  "overview": "Team discussed Q2 roadmap priorities...",
  "key_points": ["Launch new API version by March", "Hire 2 engineers"],
  "action_items": [
    { "task": "Draft API migration guide", "owner": "Sarah", "deadline": "2025-02-01" }
  ],
  "decisions": ["Postpone mobile app to Q3"],
  "participants": ["Sarah", "Mike", "Lisa"]
}

Sales Call

Captures pain points, budget, next steps, and deal sentiment from sales conversations.
{
  "pain_points": ["Current solution too slow", "No API access"],
  "budget": { "range": "$50k-75k", "timeline": "Q2 2025" },
  "next_steps": ["Send proposal by Friday", "Schedule demo with CTO"],
  "key_quotes": [
    { "quote": "We need this integrated by April", "speaker": "Prospect", "context": "Timeline discussion" }
  ],
  "sentiment": "positive",
  "deal_probability": 0.7
}

Clinical SOAP

Generates structured SOAP notes from healthcare encounters.
{
  "chief_complaint": "Patient reports persistent lower back pain for 2 weeks",
  "subjective": "Pain rated 6/10, worse with sitting...",
  "objective": "BP 120/80, ROM limited in lumbar flexion...",
  "assessment": "Lumbar strain, likely mechanical origin",
  "plan": "Physical therapy 2x/week, NSAIDs as needed, follow up in 2 weeks"
}
Summarizes legal proceedings, depositions, and client meetings into structured memos with facts, issues, and analysis.

Template Reference

TemplateIDUse Case
General Notestmpl_general_notesMeetings, discussions
Sales Calltmpl_sales_callSales conversations
Clinical SOAPtmpl_clinical_soapHealthcare encounters
Legal Memotmpl_legal_memoLegal proceedings

LLM Providers

ProviderBest For
geminiFast processing, good general quality
openaiHigh accuracy, structured output
claudeNuanced analysis, long transcripts

Summarization API Reference

Multi-Modal Analysis

Multi-Modal extends the pipeline with visual context from the Bota Pin Pro. The Pin Pro captures images and video alongside audio, enabling AI that understands both what was said and what was seen.

Media Types

TypeFormatBest For
ImagesJPEG, PNGPeriodic snapshots, whiteboard captures, document scans, equipment photos
Video clipsMP4 (H.264)Short scene captures, demonstrations, walkthroughs
Media is captured based on configurable triggers:
TriggerDescription
PeriodicCapture at fixed intervals (e.g., every 30 seconds, every 5 minutes)
MotionCapture when significant scene change is detected
ManualCapture on button press

Video Summary

Generates a visual summary from video clips by identifying key frames, generating captions, and producing a timeline of visual highlights. Useful for quickly reviewing long recordings without watching the entire video. See Create Video Summary for the API reference.

Use Cases

Field Inspection

Inspector narrates findings while the camera captures equipment and damage. Video summary highlights key visual moments alongside the transcript.

Clinical Encounter

Doctor-patient conversation captured alongside video of the examination. Transcript + video summary provide a complete record.

Meeting + Whiteboard

Discussion transcript combined with video of whiteboard diagrams. Video summary extracts key frames for quick review.

Training Session

Trainer’s spoken instructions paired with video of demonstrations. Video summary creates a visual timeline of the session.

End-to-End Flow

A typical multi-modal workflow:
  1. Record — End user wears Pin Pro, presses button to start. Audio records continuously; camera captures video.
  2. Upload — Device uploads audio and video via the Upload URL endpoint (repeated per file), then calls Complete Upload.
  3. Transcribe — Create a transcription from the audio.
  4. Summarize — Create a summary from the transcript.
  5. Video Summary — Create a video summary for visual highlights.
  6. Deliver — Results delivered via webhook or polling.

BYO API Keys

All AI processing supports bringing your own provider API keys. This gives you control over costs, rate limits, and model selection.
  1. Register your provider API key through the Integrations API
  2. Test the key to verify it works
  3. Bota automatically uses your key when you select that provider
Keys are encrypted at rest (AES-256-GCM) and never exposed in API responses. You can rotate or delete keys at any time.

Webhooks

EventDescription
transcription.completedTranscription finished successfully
transcription.failedTranscription encountered an error
summary.completedSummary generated successfully
summary.failedSummary encountered an error
See Webhook Events for payload details.