Media Generation
Generate images and videos with AI using multiple providers including BytePlus, OpenAI, and Google Gemini.
The desktop app includes a media generation service that lets agents create images and videos using multiple AI providers.
This page covers media the agent creates for you. To browse existing self-hosted libraries or stock catalogs, see Cloud Storage.

Image Generation
Supported Providers
| Provider | Model | Key Features |
|---|---|---|
| BytePlus | Seedream | High-quality, fast generation |
| OpenAI | DALL-E / GPT-Image | Versatile generation and edits |
| Imagen / Gemini | Multimodal understanding; Nano Banana when enabled |
How It Works
Image generation is instant -- the agent requests an image and receives it when ready. The agent can:
- Describe the desired image in natural language
- Specify aspect ratio, style, and mood
- Receive the generated image directly
- Save it to the workspace or Library
Using Image Generation
Images can be generated in two ways:
- Through the agent: The agent uses media generation tools during task execution
- Directly: Use the image generation feature in the web platform
Video Generation
Supported Providers
| Provider | Model | Key Features |
|---|---|---|
| BytePlus | Seedance 2.0 (default) | Fast motion generation with first/last-frame anchoring |
| OpenAI | Sora | High-quality video generation |
| Veo | Multimodal video creation |
New BytePlus setups default to Seedance 2.0 Fast for quicker turnaround at 720p. Existing Seedance 1.5 and 1.0 models remain available in Settings > Providers > BytePlus if you prefer them.
How It Works
Video generation runs in the background because it takes longer:
- The agent submits a video generation request
- You're notified immediately that it's being processed
- The system checks for completion automatically (up to 1 hour)
- When ready, the video is saved and available in the Library
Reference Images
Supply a single reference image to preserve identity and style, or supply both a first-frame and a last-frame image to have Seedance 2.0 animate the transition between them. Providing only a last-frame image is not supported and will be ignored with a warning.
Provider Configuration
Configure media providers in Settings > General:
- Enter the API key for your preferred provider
- The app automatically detects which providers are available
- Tools are registered based on configured providers
You can use multiple providers simultaneously, and switching providers requires no other changes.
Image edits that use a reference image can take longer than simple generation. The app now waits longer for OpenAI image requests and keeps reference-image fetching bounded so failed sources do not block indefinitely.
Graceful Degradation
If no media provider is configured:
- Media generation tools are not shown to the agent
- The agent can still perform all other tasks
- No errors are thrown -- the feature is simply unavailable
Voice I/O
The desktop app also supports voice interaction:
Text-to-Speech (TTS)
Hear agent responses spoken aloud:
- Sentence-by-sentence streaming for natural pacing
- Multiple providers: ElevenLabs, Azure, Google Cloud, OpenAI, MiniMax, and local models
- Works offline with bundled local models
- MiniMax supports a language boost setting for clearer multilingual pronunciation
Speech-to-Text (STT)
Dictate your prompts:
- Real-time streaming transcription
- Multiple providers: Azure, Google Cloud, OpenAI Whisper, and local models
- Automatic language detection
Local Speech Models
For offline use, the app includes built-in speech models:
- No API key required
- Works without internet
- Bundled with the desktop app
Learn More
- Agent System -- How agents use media tools
- Cloud Storage -- Browse and attach existing media from connected libraries
- Publish Pipeline -- Publish generated media to configured destinations
- Voice and Speech -- Transcription and text-to-speech workflows
- Desktop Application -- Overview and setup