Media Generation
Generate images and videos with AI using multiple providers including BytePlus, OpenAI, and Google Gemini.
The desktop app includes a provider-agnostic media generation service that lets agents create images and videos using multiple AI providers.
Image Generation
Supported Providers
| Provider | Model | Key Features |
|---|---|---|
| BytePlus | Seedream | High-quality, fast generation |
| OpenAI | DALL-E | Versatile image creation |
| Imagen / Gemini | Multimodal understanding |
How It Works
Image generation is synchronous -- the agent requests an image and receives it when ready. The agent can:
- Describe the desired image in natural language
- Specify aspect ratio, style, and mood
- Receive the generated image directly
- Save it to the workspace or Library
Using Image Generation
Images can be generated in two ways:
- Through the agent: The agent uses media generation MCP tools during task execution
- Directly: Use the image generation feature in the web platform
Video Generation
Supported Providers
| Provider | Model | Key Features |
|---|---|---|
| BytePlus | Seedance | Motion and animation |
| OpenAI | Sora | High-quality video generation |
| Veo | Multimodal video creation |
How It Works
Video generation is asynchronous because it takes longer:
- The agent submits a video generation request
- A task ID is returned immediately
- The system polls for completion (up to 1-hour TTL)
- When ready, the video is saved and available in the Library
Provider Configuration
Configure media providers in Settings > General:
- Enter the API key for your preferred provider
- The system automatically detects which providers are available
- Tools are registered based on configured providers
Pattern-Based Adapter
The system uses a pattern-based adapter factory that routes requests to the correct provider based on URL patterns. This means:
- You can use multiple providers simultaneously
- Switching providers doesn't require code changes
- New providers can be added without disrupting existing ones
Graceful Degradation
If no media provider is configured:
- Media generation tools are not exposed to the agent
- The agent can still perform all other tasks
- No errors are thrown -- the feature is simply unavailable
Voice I/O
The desktop app also supports voice interaction:
Text-to-Speech (TTS)
Hear agent responses spoken aloud:
- Sentence-by-sentence streaming synthesis
- Multiple providers: ElevenLabs, Azure, Google Cloud, OpenAI, local models
- Cached embeddings for offline support
Speech-to-Text (STT)
Dictate your prompts:
- Real-time streaming transcription via WebSocket
- Multiple providers: Azure, Google Cloud, OpenAI Whisper, local models
- Automatic language detection
Local Speech Models
For offline use, the app bundles Sherpa ONNX:
- No API key required
- Works without internet
- Bundled as native addon in the Tauri app
Learn More
- Agent System -- How agents use media tools
- Desktop Application -- Overview and setup