Skip to content
AI Article

Agentic Video Editing Without the Multimodal Token Tax

How browser-use's video-use compiles raw footage using structured transcripts, targeted visual checks, and self-evaluating code agents.

Priya Nair
Priya Nair
AI & Developer Experience Writer · Jul 4, 2026 · 5 min read
Agentic Video Editing Without the Multimodal Token Tax

Feeding raw video directly into a large multimodal model is a recipe for high latency and massive API bills. A single minute of high-definition video at 30 frames per second easily translates to tens of thousands of frames. If an agent tries to process every single frame at roughly 1,500 tokens apiece, a simple editing task quickly balloons into a 45-million-token nightmare. Worse, the model still lacks the frame-accurate precision required to make clean cuts.

The open-source project video-use, developed by the team behind Browser Use, bypasses this token trap entirely. Instead of forcing an LLM to watch raw footage, it treats video editing as a compilation problem. By decoupling the audio transcript from the visual timeline, the system allows lightweight coding agents, such as Claude Code, to edit complex video files using structured text, targeted visual queries, and deterministic rendering engines.

The Two-Layer Architecture: Text First, Visuals on Demand

To keep token counts low and execution speeds high, the system splits video representation into two distinct layers. The LLM primarily interacts with a highly compressed text format, pulling in visual data only when it needs to resolve specific ambiguities.

flowchart TD
    A[Raw Footage] --> B[ElevenLabs Scribe]
    B --> C[takes_packed.md ~12KB]
    C --> D[LLM Reasoning]
    D -->|On-Demand Visuals| E[timeline_view PNG]
    E --> D
    D --> F[Edit Decision List]
    F --> G[FFmpeg Render]
    G --> H[Self-Eval Loop]
    H -->|Issue Found| D
    H -->|Pass| I[final.mp4]

Layer 1: The Audio Transcript

Audio is treated as the primary editing surface. When raw footage is dropped into the project directory, the system runs a single transcription pass using ElevenLabs Scribe. This pass generates word-level timestamps, speaker diarization, and markers for non-verbal audio events like laughter, applause, or sighs.

These details are packed into a single markdown file, typically around 12KB, called takes_packed.md. This file serves as the agent's primary view of the media:

## C0103 (duration: 43.0s, 8 phrases)
[002.52-005.36] S0 Ninety percent of what a web agent does is completely wasted.
[006.08-006.74] S0 We fixed this.

Because the agent has precise word-level timestamps, it can plan cuts directly on the text. It can identify and strip out filler words, silence gaps, and false starts without ever looking at a single pixel.

Layer 2: On-Demand Visual Composites

An agent cannot edit on audio alone. It needs to check for visual jumps, verify framing, and ensure that cuts do not happen mid-blink. To solve this without dumping raw frames, the system introduces a tool called timeline_view.

When the agent reaches a decision point, such as verifying a cut boundary, it requests a targeted visual composite for a specific time range. The system generates a single PNG containing a filmstrip, an audio waveform, and word labels. The agent inspects this composite to make its final decision. This approach mirrors how browser-use provides an LLM with a structured DOM instead of a continuous stream of raw screenshots.

The Compilation Pipeline and Self-Evaluation

Once the agent decides on an editing strategy, it does not write video bytes directly. Instead, it generates an Edit Decision List (EDL) and compiles it using ffmpeg.

To ensure the output meets professional standards, the pipeline enforces several strict engineering rules:

  • Audio Smoothing: The system automatically applies 30ms audio fades at every cut boundary to eliminate digital pops and clicks.
  • Automated Color Grading: Every segment is processed through customizable ffmpeg chains to normalize color profiles across different takes.
  • Parallel Sub-Agents: For complex additions like animation overlays, the system spawns parallel sub-agents using tools like Manim, Remotion, or PIL. Each sub-agent renders its asset independently before the main agent composites them onto the timeline.

Crucially, the pipeline includes a self-evaluation loop. Before presenting the final video to the user, the agent runs timeline_view on the rendered output at every cut boundary. It inspects the transitions for visual jumps, audio pops, or misaligned subtitles. If it detects an issue, it adjusts the EDL, re-renders, and evaluates again, repeating this loop up to three times if necessary.

The Developer Angle: Integrating Agentic Media Workflows

For developers building internal media tools or automated content pipelines, this project offers a highly modular blueprint. It runs locally and integrates directly into existing terminal-based agents.

Installation and Setup

To set up the environment, you need Python, ffmpeg, and an API key for ElevenLabs to handle the transcription pass. You can clone the repository and symlink it directly into your agent's skills directory:

git clone https://github.com/browser-use/video-use ~/Developer/video-use
ln -sfn ~/Developer/video-use ~/.claude/skills/video-use

cd ~/Developer/video-use
uv sync
brew install ffmpeg

Next, copy the environment template and add your API credentials:

cp .env.example .env

Inside the .env file, configure your transcription provider:

ELEVENLABS_API_KEY=your_api_key_here

Running an Editing Session

Once the skill is registered, you can start an editing session by pointing your agent at a directory of raw footage. The agent will read the files, inventory the assets, and propose an editing strategy:

cd /path/to/your/raw/footage
claude

During the session, you can issue high-level instructions in natural language:

"Edit these takes into a 60-second launch video. Cut out the long pauses, add warm cinematic color grading, and burn in uppercase subtitles."

The agent writes its progress and state to a local project.md file. This file persists session memory, allowing you to stop the agent and resume the editing session later without losing context or re-transcribing the source files.

A Pragmatic Shift in Media Automation

This architecture represents a highly practical shift in how developers can approach AI-driven media creation. Rather than waiting for massive, expensive multimodal models to natively process hours of video, this framework demonstrates that combining lightweight text models with targeted visual feedback and deterministic command-line tools is a far more efficient path forward.

While it will not replace human editors for high-end creative storytelling, it is highly effective for structured, repetitive video tasks. For developers building automated product tutorials, internal training videos, or programmatic social media clips, this text-first compilation pipeline provides a fast, cost-effective, and highly predictable alternative to traditional video editing suites.

Sources & further reading

  1. browser-use/video-use — github.com
  2. Browser Use - The way AI uses the internet — browser-use.com
  3. video-use download | SourceForge.net — sourceforge.net
  4. HTML Video — w3schools.com
Priya Nair
Written by
Priya Nair · AI & Developer Experience Writer

Priya covers AI frameworks, developer productivity tooling, and the startup ecosystem across South and Southeast Asia, bringing a researcher's rigour and a practitioner's empathy to every story. She is deeply sceptical of benchmarks and asks hard questions so her readers don't have to.

Discussion 0

Join the discussion

Sign in or create an account to comment and vote.

No comments yet

Be the first to weigh in.

Related Reading