Multi-Modal AI features are Coming Soon and not yet available.
Bota’s AI pipeline processes recordings through three stages: transcription (audio → text), summarization (text → structured output), and multi-modal analysis (text + visual media → context-aware output).
Transcription
Bota transcribes audio recordings using Automatic Speech Recognition (ASR). Transcription is asynchronous — you submit a job and receive results via webhook or polling.
Upload a recording — see Quickstart
Create a transcription — specify the recording and optional language hint
Wait for completion — poll or listen for the transcription.completed webhook
Retrieve results — structured output with timestamps, speaker labels, and confidence scores
A completed transcription includes a full text string and time-stamped segments with speaker diarization:
{
"id" : "txn_abc123" ,
"status" : "completed" ,
"full_text" : "Good morning. I'd like to discuss the project timeline..." ,
"segments" : [
{
"start" : 0.0 ,
"end" : 1.2 ,
"text" : "Good morning." ,
"speaker" : "Speaker 1" ,
"confidence" : 0.95
},
{
"start" : 1.5 ,
"end" : 4.8 ,
"text" : "I'd like to discuss the project timeline." ,
"speaker" : "Speaker 1" ,
"confidence" : 0.92
}
],
"word_count" : 42 ,
"confidence" : 0.93 ,
"language" : "en"
}
Each segment includes:
Field Description start / endTimestamps in seconds textTranscribed text for this segment speakerSpeaker label (e.g., Speaker 1, Speaker 2) confidencePer-segment confidence score (0–1)
ASR Providers
Provider Best For whisperGeneral purpose, multilingual support deepgramLow latency, real-time processing assemblyaiSpeaker diarization, content analysis elevenlabsHigh accuracy transcription
You can specify a provider when creating a transcription, or let Bota use the default configured for your project.
Language Support
Transcription supports 50+ languages. Provide a language hint (e.g., en, es, zh) to improve accuracy, or omit it for automatic detection.
Transcription API Reference
Summarization
Bota generates structured summaries from transcriptions using LLM providers. Use built-in templates for common formats (SOAP notes, sales calls, legal memos) or provide custom prompts.
Templates vs Custom Prompts
Template — Use a built-in template for standardized, structured output. Best for repeatable workflows.
Custom Prompt — Provide your own instructions for flexible, ad-hoc summarization.
Provide either a template or a custom prompt, not both.
Built-in Templates
General Notes
Ideal for meetings, discussions, and team syncs. Extracts key points, action items, decisions, and participants.
{
"overview" : "Team discussed Q2 roadmap priorities..." ,
"key_points" : [ "Launch new API version by March" , "Hire 2 engineers" ],
"action_items" : [
{ "task" : "Draft API migration guide" , "owner" : "Sarah" , "deadline" : "2025-02-01" }
],
"decisions" : [ "Postpone mobile app to Q3" ],
"participants" : [ "Sarah" , "Mike" , "Lisa" ]
}
Sales Call
Captures pain points, budget, next steps, and deal sentiment from sales conversations.
{
"pain_points" : [ "Current solution too slow" , "No API access" ],
"budget" : { "range" : "$50k-75k" , "timeline" : "Q2 2025" },
"next_steps" : [ "Send proposal by Friday" , "Schedule demo with CTO" ],
"key_quotes" : [
{ "quote" : "We need this integrated by April" , "speaker" : "Prospect" , "context" : "Timeline discussion" }
],
"sentiment" : "positive" ,
"deal_probability" : 0.7
}
Clinical SOAP
Generates structured SOAP notes from healthcare encounters.
{
"chief_complaint" : "Patient reports persistent lower back pain for 2 weeks" ,
"subjective" : "Pain rated 6/10, worse with sitting..." ,
"objective" : "BP 120/80, ROM limited in lumbar flexion..." ,
"assessment" : "Lumbar strain, likely mechanical origin" ,
"plan" : "Physical therapy 2x/week, NSAIDs as needed, follow up in 2 weeks"
}
Legal Memo
Summarizes legal proceedings, depositions, and client meetings into structured memos with facts, issues, and analysis.
Template Reference
Template ID Use Case General Notes tmpl_general_notesMeetings, discussions Sales Call tmpl_sales_callSales conversations Clinical SOAP tmpl_clinical_soapHealthcare encounters Legal Memo tmpl_legal_memoLegal proceedings
LLM Providers
Provider Best For geminiFast processing, good general quality openaiHigh accuracy, structured output claudeNuanced analysis, long transcripts
Summarization API Reference
Multi-Modal Analysis
Multi-Modal extends the pipeline with visual context from the Bota Pin Pro . The Pin Pro captures images and video alongside audio, enabling AI that understands both what was said and what was seen.
Type Format Best For Images JPEG, PNG Periodic snapshots, whiteboard captures, document scans, equipment photos Video clips MP4 (H.264) Short scene captures, demonstrations, walkthroughs
Media is captured based on configurable triggers:
Trigger Description Periodic Capture at fixed intervals (e.g., every 30 seconds, every 5 minutes) Motion Capture when significant scene change is detected Manual Capture on button press
Video Summary
Generates a visual summary from video clips by identifying key frames, generating captions, and producing a timeline of visual highlights. Useful for quickly reviewing long recordings without watching the entire video.
See Create Video Summary for the API reference.
Use Cases
Field Inspection Inspector narrates findings while the camera captures equipment and damage. Video summary highlights key visual moments alongside the transcript.
Clinical Encounter Doctor-patient conversation captured alongside video of the examination. Transcript + video summary provide a complete record.
Meeting + Whiteboard Discussion transcript combined with video of whiteboard diagrams. Video summary extracts key frames for quick review.
Training Session Trainer’s spoken instructions paired with video of demonstrations. Video summary creates a visual timeline of the session.
End-to-End Flow
A typical multi-modal workflow:
Record — End user wears Pin Pro, presses button to start. Audio records continuously; camera captures video.
Upload — Device uploads audio and video via the Upload URL endpoint (repeated per file), then calls Complete Upload .
Transcribe — Create a transcription from the audio.
Summarize — Create a summary from the transcript.
Video Summary — Create a video summary for visual highlights.
Deliver — Results delivered via webhook or polling.
BYO API Keys
All AI processing supports bringing your own provider API keys. This gives you control over costs, rate limits, and model selection.
Register your provider API key through the Integrations API
Test the key to verify it works
Bota automatically uses your key when you select that provider
Keys are encrypted at rest (AES-256-GCM) and never exposed in API responses. You can rotate or delete keys at any time.
Webhooks
Event Description transcription.completedTranscription finished successfully transcription.failedTranscription encountered an error summary.completedSummary generated successfully summary.failedSummary encountered an error
See Webhook Events for payload details.