Multi-Modal AI (Coming Soon)

Multi-Modal AI features are Coming Soon and not yet available.

Bota’s AI pipeline processes recordings through three stages: transcription (audio → text), summarization (text → structured output), and multi-modal analysis (text + visual media → context-aware output).

Transcription

Bota transcribes audio recordings using Automatic Speech Recognition (ASR). Transcription is asynchronous — you submit a job and receive results via webhook or polling.

Upload a recording — see Quickstart
Create a transcription — specify the recording and optional language hint
Wait for completion — poll or listen for the transcription.completed webhook
Retrieve results — structured output with timestamps, speaker labels, and confidence scores

Output Format

A completed transcription includes a full text string and time-stamped segments with speaker diarization:

{
  "id": "txn_abc123",
  "status": "completed",
  "full_text": "Good morning. I'd like to discuss the project timeline...",
  "segments": [
    {
      "start": 0.0,
      "end": 1.2,
      "text": "Good morning.",
      "speaker": "Speaker 1",
      "confidence": 0.95
    },
    {
      "start": 1.5,
      "end": 4.8,
      "text": "I'd like to discuss the project timeline.",
      "speaker": "Speaker 1",
      "confidence": 0.92
    }
  ],
  "word_count": 42,
  "confidence": 0.93,
  "language": "en"
}

Each segment includes:

Field	Description
`start` / `end`	Timestamps in seconds
`text`	Transcribed text for this segment
`speaker`	Speaker label (e.g., `Speaker 1`, `Speaker 2`)
`confidence`	Per-segment confidence score (0–1)

ASR Providers

Provider	Best For
`whisper`	General purpose, multilingual support
`deepgram`	Low latency, real-time processing
`assemblyai`	Speaker diarization, content analysis
`elevenlabs`	High accuracy transcription

You can specify a provider when creating a transcription, or let Bota use the default configured for your project.

Language Support

Transcription supports 50+ languages. Provide a language hint (e.g., en, es, zh) to improve accuracy, or omit it for automatic detection.

Transcription API Reference

Summarization

Bota generates structured summaries from transcriptions using LLM providers. Use built-in templates for common formats (SOAP notes, sales calls, legal memos) or provide custom prompts.

Templates vs Custom Prompts

Template — Use a built-in template for standardized, structured output. Best for repeatable workflows.
Custom Prompt — Provide your own instructions for flexible, ad-hoc summarization.

Provide either a template or a custom prompt, not both.

Built-in Templates

General Notes

Ideal for meetings, discussions, and team syncs. Extracts key points, action items, decisions, and participants.

{
  "overview": "Team discussed Q2 roadmap priorities...",
  "key_points": ["Launch new API version by March", "Hire 2 engineers"],
  "action_items": [
    { "task": "Draft API migration guide", "owner": "Sarah", "deadline": "2025-02-01" }
  ],
  "decisions": ["Postpone mobile app to Q3"],
  "participants": ["Sarah", "Mike", "Lisa"]
}

Sales Call

Captures pain points, budget, next steps, and deal sentiment from sales conversations.

{
  "pain_points": ["Current solution too slow", "No API access"],
  "budget": { "range": "$50k-75k", "timeline": "Q2 2025" },
  "next_steps": ["Send proposal by Friday", "Schedule demo with CTO"],
  "key_quotes": [
    { "quote": "We need this integrated by April", "speaker": "Prospect", "context": "Timeline discussion" }
  ],
  "sentiment": "positive",
  "deal_probability": 0.7
}

Clinical SOAP

Generates structured SOAP notes from healthcare encounters.

{
  "chief_complaint": "Patient reports persistent lower back pain for 2 weeks",
  "subjective": "Pain rated 6/10, worse with sitting...",
  "objective": "BP 120/80, ROM limited in lumbar flexion...",
  "assessment": "Lumbar strain, likely mechanical origin",
  "plan": "Physical therapy 2x/week, NSAIDs as needed, follow up in 2 weeks"
}

Legal Memo

Summarizes legal proceedings, depositions, and client meetings into structured memos with facts, issues, and analysis.

Template Reference

Template	ID	Use Case
General Notes	`tmpl_general_notes`	Meetings, discussions
Sales Call	`tmpl_sales_call`	Sales conversations
Clinical SOAP	`tmpl_clinical_soap`	Healthcare encounters
Legal Memo	`tmpl_legal_memo`	Legal proceedings

LLM Providers

Provider	Best For
`gemini`	Fast processing, good general quality
`openai`	High accuracy, structured output
`claude`	Nuanced analysis, long transcripts

Summarization API Reference

Multi-Modal extends the pipeline with visual context from the Bota Pin Pro. The Pin Pro captures images and video alongside audio, enabling AI that understands both what was said and what was seen.

Media Types

Type	Format	Best For
Images	JPEG, PNG	Periodic snapshots, whiteboard captures, document scans, equipment photos
Video clips	MP4 (H.264)	Short scene captures, demonstrations, walkthroughs

Media is captured based on configurable triggers:

Trigger	Description
Periodic	Capture at fixed intervals (e.g., every 30 seconds, every 5 minutes)
Motion	Capture when significant scene change is detected
Manual	Capture on button press

Video Summary

Generates a visual summary from video clips by identifying key frames, generating captions, and producing a timeline of visual highlights. Useful for quickly reviewing long recordings without watching the entire video. See Create Video Summary for the API reference.

Use Cases

Field Inspection

Inspector narrates findings while the camera captures equipment and damage. Video summary highlights key visual moments alongside the transcript.

Clinical Encounter

Doctor-patient conversation captured alongside video of the examination. Transcript + video summary provide a complete record.

Meeting + Whiteboard

Discussion transcript combined with video of whiteboard diagrams. Video summary extracts key frames for quick review.

Training Session

Trainer’s spoken instructions paired with video of demonstrations. Video summary creates a visual timeline of the session.

End-to-End Flow

A typical multi-modal workflow:

Record — End user wears Pin Pro, presses button to start. Audio records continuously; camera captures video.
Upload — Device uploads audio and video via the Upload URL endpoint (repeated per file), then calls Complete Upload.
Transcribe — Create a transcription from the audio.
Summarize — Create a summary from the transcript.
Video Summary — Create a video summary for visual highlights.
Deliver — Results delivered via webhook or polling.

BYO API Keys

All AI processing supports bringing your own provider API keys. This gives you control over costs, rate limits, and model selection.

Register your provider API key through the Integrations API
Test the key to verify it works
Bota automatically uses your key when you select that provider

Keys are encrypted at rest (AES-256-GCM) and never exposed in API responses. You can rotate or delete keys at any time.

Webhooks

Event	Description
`transcription.completed`	Transcription finished successfully
`transcription.failed`	Transcription encountered an error
`summary.completed`	Summary generated successfully
`summary.failed`	Summary encountered an error

See Webhook Events for payload details.

Bota Pin Pro — Multi-modal hardware
Upload API Reference — File upload documentation
Streaming Upload — Upload while recording for faster processing

​Transcription

​Output Format

​ASR Providers

​Language Support

​Transcription API Reference

​Summarization

​Templates vs Custom Prompts

​Built-in Templates

​General Notes

​Sales Call

​Clinical SOAP

​Legal Memo

​Template Reference

​LLM Providers

​Summarization API Reference

​Multi-Modal Analysis

​Media Types

​Video Summary

​Use Cases

Field Inspection

Clinical Encounter

Meeting + Whiteboard

Training Session

​End-to-End Flow

​BYO API Keys

​Webhooks

​Related

Transcription

Output Format

ASR Providers

Language Support

Transcription API Reference

Summarization

Templates vs Custom Prompts

Built-in Templates

General Notes

Sales Call

Clinical SOAP

Legal Memo

Template Reference

LLM Providers

Summarization API Reference

Multi-Modal Analysis

Media Types

Video Summary

Use Cases

End-to-End Flow

BYO API Keys

Webhooks

Related