Media Generation

Generate images and videos with AI using multiple providers including BytePlus, OpenAI, and Google Gemini.

The desktop app includes a media generation service that lets agents create images and videos using multiple AI providers.

This page covers media the agent creates for you. To browse existing self-hosted libraries or stock catalogs, see Cloud Storage.

DesignMode image project setup with model and aspect ratio controls. — Media generation projects expose image and video controls directly in DesignMode before the agent starts work.

Media generation starts from the same structured project panel, with image-specific model and aspect controls.Transcript

Image Generation

Supported Providers

Provider	Model	Key Features
BytePlus	Seedream	High-quality, fast generation
OpenAI	DALL-E / GPT-Image	Versatile generation and edits
Google	Imagen / Gemini	Multimodal understanding; Nano Banana when enabled

How It Works

Image generation is instant -- the agent requests an image and receives it when ready. The agent can:

Describe the desired image in natural language
Specify aspect ratio, style, and mood
Receive the generated image directly
Save it to the workspace or Library

Using Image Generation

Images can be generated in two ways:

Through the agent: The agent uses media generation tools during task execution
Directly: Use the image generation feature in the web platform

Video Generation

Supported Providers

Provider	Model	Key Features
BytePlus	Seedance 2.0 (default)	Fast motion generation with first/last-frame anchoring
OpenAI	Sora	High-quality video generation
Google	Veo	Multimodal video creation

New BytePlus setups default to Seedance 2.0 Fast for quicker turnaround at 720p. Existing Seedance 1.5 and 1.0 models remain available in Settings > Providers > BytePlus if you prefer them.

How It Works

Video generation runs in the background because it takes longer:

The agent submits a video generation request
You're notified immediately that it's being processed
The system checks for completion automatically (up to 1 hour)
When ready, the video is saved and available in the Library

Reference Images

Supply a single reference image to preserve identity and style, or supply both a first-frame and a last-frame image to have Seedance 2.0 animate the transition between them. Providing only a last-frame image is not supported and will be ignored with a warning.

Provider Configuration

Configure media providers in Settings > General:

Enter the API key for your preferred provider
The app automatically detects which providers are available
Tools are registered based on configured providers

You can use multiple providers simultaneously, and switching providers requires no other changes.

Image edits that use a reference image can take longer than simple generation. The app now waits longer for OpenAI image requests and keeps reference-image fetching bounded so failed sources do not block indefinitely.

Graceful Degradation

If no media provider is configured:

Media generation tools are not shown to the agent
The agent can still perform all other tasks
No errors are thrown -- the feature is simply unavailable

Voice I/O

The desktop app also supports voice interaction:

Text-to-Speech (TTS)

Hear agent responses spoken aloud:

Sentence-by-sentence streaming for natural pacing
Multiple providers: ElevenLabs, Azure, Google Cloud, OpenAI, MiniMax, and local models
Works offline with bundled local models
MiniMax supports a language boost setting for clearer multilingual pronunciation

Speech-to-Text (STT)

Dictate your prompts:

Real-time streaming transcription
Multiple providers: Azure, Google Cloud, OpenAI Whisper, and local models
Automatic language detection

Local Speech Models

For offline use, the app includes built-in speech models:

No API key required
Works without internet
Bundled with the desktop app

Learn More

Agent System -- How agents use media tools
Cloud Storage -- Browse and attach existing media from connected libraries
Publish Pipeline -- Publish generated media to configured destinations
Voice and Speech -- Transcription and text-to-speech workflows
Desktop Application -- Overview and setup

Media Generation

Generate images and videos with AI using multiple providers including BytePlus, OpenAI, and Google Gemini.

The desktop app includes a media generation service that lets agents create images and videos using multiple AI providers.

This page covers media the agent creates for you. To browse existing self-hosted libraries or stock catalogs, see Cloud Storage.

Media generation starts from the same structured project panel, with image-specific model and aspect controls.Transcript

Image Generation

Supported Providers

Provider	Model	Key Features
BytePlus	Seedream	High-quality, fast generation
OpenAI	DALL-E / GPT-Image	Versatile generation and edits
Google	Imagen / Gemini	Multimodal understanding; Nano Banana when enabled

How It Works

Image generation is instant -- the agent requests an image and receives it when ready. The agent can:

Describe the desired image in natural language
Specify aspect ratio, style, and mood
Receive the generated image directly
Save it to the workspace or Library

Using Image Generation

Images can be generated in two ways:

Through the agent: The agent uses media generation tools during task execution
Directly: Use the image generation feature in the web platform

Video Generation

Supported Providers

Provider	Model	Key Features
BytePlus	Seedance 2.0 (default)	Fast motion generation with first/last-frame anchoring
OpenAI	Sora	High-quality video generation
Google	Veo	Multimodal video creation

New BytePlus setups default to Seedance 2.0 Fast for quicker turnaround at 720p. Existing Seedance 1.5 and 1.0 models remain available in Settings > Providers > BytePlus if you prefer them.

How It Works

Video generation runs in the background because it takes longer:

The agent submits a video generation request
You're notified immediately that it's being processed
The system checks for completion automatically (up to 1 hour)
When ready, the video is saved and available in the Library

Reference Images

Provider Configuration

Configure media providers in Settings > General:

Enter the API key for your preferred provider
The app automatically detects which providers are available
Tools are registered based on configured providers

You can use multiple providers simultaneously, and switching providers requires no other changes.

Graceful Degradation

If no media provider is configured:

Media generation tools are not shown to the agent
The agent can still perform all other tasks
No errors are thrown -- the feature is simply unavailable

Voice I/O

The desktop app also supports voice interaction:

Text-to-Speech (TTS)

Hear agent responses spoken aloud:

Sentence-by-sentence streaming for natural pacing
Multiple providers: ElevenLabs, Azure, Google Cloud, OpenAI, MiniMax, and local models
Works offline with bundled local models
MiniMax supports a language boost setting for clearer multilingual pronunciation

Speech-to-Text (STT)

Dictate your prompts:

Real-time streaming transcription
Multiple providers: Azure, Google Cloud, OpenAI Whisper, and local models
Automatic language detection

Local Speech Models

For offline use, the app includes built-in speech models:

No API key required
Works without internet
Bundled with the desktop app

Learn More

Agent System -- How agents use media tools
Cloud Storage -- Browse and attach existing media from connected libraries
Publish Pipeline -- Publish generated media to configured destinations
Voice and Speech -- Transcription and text-to-speech workflows
Desktop Application -- Overview and setup