If you are building an AI agent that produces video ads, you have two real problems the API itself does not solve for you: how to expose endpoints to the model as tools, and how to handle the fact that video renders are asynchronous and take 30–180 seconds. This piece walks through both, with concrete schemas for Claude and OpenAI, a webhook-based async pattern, and the engine-routing logic that determines whether your agent looks like a pro or like a toy.
What kind of agent we are building
Concretely: a Slack or chat-based agent that takes a brief like "Generate me 10 ads for this Shopify product, mix of testimonial and demo, TikTok-ready" and returns finished MP4 URLs in 5–10 minutes. The agent calls the UGC Copilot API for everything — persona, script, image (via Nano Banana 2 (Gemini)), render across Sora 2, Veo 3.1, Kling 3.0, or Seedance 2.0, and overlays — and waits for completion via webhooks.
Tool schema design
Each public endpoint becomes a tool the model can call. You do not need to expose all 26 — for an ad-producing agent, six or seven cover the surface. Here are the tools that matter, with their action types and credit cost notes:
generate_persona→proxyGenerateIdealInfluencer(free, text-only)generate_script→proxyGenerateViralScript(1 credit)regenerate_script_tone→proxyRegenerateScriptWithTone(1 credit)generate_scene_image→proxyGenerateSceneImagevia Nano Banana 2 (1 std / 2 hq)start_video_render→proxyStartVideoGeneration(engine-dependent, 18–130 credits)check_video_status→proxyCheckVideoStatus(free, polling)apply_text_overlay→proxyApplyTextOverlay(1 credit)
Claude tool definition example
{
"name": "start_video_render",
"description": "Start an asynchronous video render. Returns an operationName immediately. Use webhooks or check_video_status to retrieve the finished MP4. Engine selection rules: 'seedance' for cheap volume tests, 'sora' for dialogue/lip-sync, 'kling' for image-to-video from a fixed reference, 'veo' for cinematic hero spots.",
"input_schema": {
"type": "object",
"required": ["engine", "quality", "duration", "scriptText", "sceneImageUrl"],
"properties": {
"engine": { "type": "string", "enum": ["sora", "veo", "kling", "seedance"] },
"quality": { "type": "string", "enum": ["std", "hq"] },
"duration": { "type": "integer", "minimum": 4, "maximum": 30 },
"scriptText": { "type": "string" },
"sceneImageUrl": { "type": "string", "format": "uri" },
"personaId": { "type": "string" },
"idempotencyKey": { "type": "string" }
}
}
}
OpenAI function tool definition
{
"type": "function",
"function": {
"name": "start_video_render",
"description": "...same as above...",
"parameters": { ...same schema... }
}
}
The schema is the same for both providers; only the wrapper shape differs.
The engine-routing system prompt
This is the prompt fragment that turns an agent from "uses the cheapest engine for everything" into "actually picks the right tool." Drop it into your agent's system message:
When choosing a video engine, follow these rules:
- Default to Seedance 2.0 std for the test phase of any campaign,
drop-shipper variant tests, and any batch > 20 videos.
- Use Sora 2 std for ads where dialogue and lip-sync matter
(testimonials, POV creator content, monologue hooks).
- Use Kling 3.0 std when the brief includes a specific reference
image that must be matched closely (image-to-video).
- Use Veo 3.1 std for hero spots, longer cuts (15s+), and any
cinematic/branded content where fixed-cost-regardless-of-duration
works in our favor.
- Reserve HQ tiers for the 1–2 winning variants from a test batch.
This kind of explicit engine logic in the system prompt is the difference between an agent that produces output and one that produces good output. Without it, every model defaults to the engine name it has seen most often in training data, which is almost never the right pick for the brief.
The async problem
Video generation is the only step that does not return synchronously. Typical render times by engine in practice:
- Seedance 2.0 std — 20–45 seconds (fastest)
- Sora 2 std — 35–90 seconds
- Kling 3.0 std — 45–120 seconds
- Veo 3.1 std — 60–180 seconds (slowest)
A naive agent calls start_video_render, then loops on check_video_status in the same conversation turn. This works for 1–3 videos. It breaks for 10+ because the agent thread is now blocked, the user has no feedback, and one stuck render stalls the entire batch.
The right pattern is webhook-driven. The agent submits all renders, persists the operationNames to its memory store, and ends the turn. When webhooks arrive, your backend updates the agent's memory and sends a follow-up message ("3 of 10 ads are ready"). The user gets progress; the agent keeps moving.
Webhook setup
Register webhooks under Profile → Webhooks. You can register up to 5 active endpoints per user. The signing scheme mirrors Stripe: each delivery includes an X-Webhook-Signature header in the form t=<unix_seconds>,v1=<hex_hmac_sha256>. The HMAC input is ${t}.${rawBody}, signed with the whsec_-prefixed secret you got at endpoint creation. Replay window: ±5 minutes. Retries: 1m, 5m, 30m, 2h, 6h (5 attempts after the initial — 6 total, ~9h ceiling). The events:
video.completed— event envelope includesid,type,apiVersion; data fields includeoperationName,engine,modelName,creditCost, andvideoUrlvideo.failed— same envelope; data fields includeoperationName,engine,modelName,creditCost, anderror
// Express webhook handler — Stripe-style signature with timestamp
import crypto from 'crypto';
app.post('/api/ugc-callbacks', express.raw({ type: '*/*' }), (req, res) => {
const header = req.header('X-Webhook-Signature') || '';
const parts = Object.fromEntries(header.split(',').map(p => p.split('=')));
const t = parts.t;
const v1 = parts.v1;
if (!t || !v1) return res.status(401).end();
const rawBody = req.body.toString('utf8');
const expected = crypto.createHmac('sha256', WEBHOOK_SECRET)
.update(`${t}.${rawBody}`)
.digest('hex');
const ok = crypto.timingSafeEqual(Buffer.from(expected), Buffer.from(v1));
if (!ok) return res.status(401).end();
// Reject anything older than 5 minutes (replay protection)
if (Math.abs(Math.floor(Date.now() / 1000) - Number(t)) > 300) {
return res.status(401).end();
}
const event = JSON.parse(rawBody);
if (event.type === 'video.completed') {
await agentMemory.markVideoReady(event.operationName, event.videoUrl);
await maybeSendProgressUpdate(event.operationName);
}
res.status(200).end();
});
Idempotency for retried tool calls
LLM agents retry tool calls. If a network blip makes the agent retry start_video_render, you do not want a double charge. The API supports an Idempotency-Key header — same key, same response, no second deduction. TTL is 24 hours.
The right pattern: derive the key deterministically from the agent turn and the variant index. Something like ${agentSessionId}_${turnId}_v${variantIndex}. The agent retries; the API returns the cached response; you do not get billed twice.
Concurrency etiquette
Tier concurrency caps (3 / 8 / 25 in-flight slots for Creator / Pro / Business) apply per API key, not per agent. If your agent submits 20 renders simultaneously on a Pro key (8 slots), the 9th onward will fail with a concurrency-exhausted 429.
Two patterns:
- Submit-and-wait queue. The agent's tool implementation maintains a small queue (set queue depth = your tier's slot count). Tool calls return immediately, but new submissions wait until a slot frees.
- Backpressure on 429. Catch concurrency 429s, respect
Retry-After, and resubmit. Simpler, slightly slower than #1, fine for most agents.
Read-only endpoints (check_video_status, list_twins, etc.) do not acquire concurrency slots — your status polls cannot starve out renders.
Memory pattern: persisting state across turns
Agents lose context between conversations. To handle a long-running batch (kicked off Monday, completed Tuesday), persist three things in your memory store:
operationNameper submitted variant, plus the engine and quality used.idempotencyKeyper submission, in case the agent retries the same variant.statusupdated by the webhook handler. Eitherpending,completed, orfailed.
When the user comes back and asks "how are my ads doing?", the agent reads from memory rather than re-calling status endpoints for every video. Cheaper, faster, and survives agent-process restarts.
End-to-end agent loop
Putting it together. User sends "Generate 10 ads for this Shopify product, TikTok, mix of testimonial and demo." The agent:
- Calls
generate_personawith the product brief. - Calls
generate_scriptfor two base scripts (one testimonial, one demo). - Calls
regenerate_script_tonefour times each, producing 10 total scripts. - Calls
generate_scene_image10 times in parallel via Nano Banana 2 (Pro tier: 8 in-flight cap, so at least one batch waits). - Picks the engine for each render based on the system-prompt rules — testimonials go to Sora 2, demo shots with a fixed reference go to Kling 3.0, edge cases default to Seedance 2.0 std for cost.
- Submits all 10 renders with deterministic
Idempotency-Keyvalues. - Persists
operationNames to memory; ends the turn with "Submitted 10 renders, will message back in 5–10 minutes." - Webhook handler receives
video.completedevents, updates memory, applies overlays, posts the finished MP4 URLs to the original chat thread.
Total agent thinking time: under 60 seconds. Total wall-clock time: 5–10 minutes for the renders to complete. Total credit cost: roughly 250–350 credits depending on engine mix ≈ $17–$28 at the volume pack rate.
Where this gets hard
Some honest counterweights to keep in mind before you ship this to paying customers:
- Tool-calling drift. Models occasionally pass invalid
enginevalues or skip required fields. Validate every tool input on your side before forwarding to the API; do not trust the model to honor the schema 100% of the time. - Persona consistency. If the agent re-generates a persona on every brief, your customer's brand recognition resets every conversation. Persist the AI Twin once per customer and reuse the
twinIdacross all renders. - Quality review still matters. Even with engine-selection rules, expect 10–15% of generated ads to need a re-render. Build a "regenerate this one" tool the agent can call when the user flags an output.
- Refunds happen. Failed renders auto-refund credits. If you display credit balance to the user, refresh after webhook events; do not deduct optimistically.
Where to take this next
The agent pattern above is the technical baseline. The business question — what to charge for an agent like this and which workflows actually convert — is in How to Make Money with the UGC Copilot API. The lower-level pipeline that the agent ultimately calls is documented step-by-step in From Product URL to 50 Ad Variants.
Frequently Asked Questions
Should the agent decide engine selection, or should I hardcode it?
Both. Give the agent rules in the system prompt and let it pick — but expose an engine override on the start_video_render tool so an experienced operator can force a specific model. The hybrid works better than either extreme.
How many tools should I expose?
Six to eight is the sweet spot. More tools mean more model latency on each turn (longer schema = bigger system prompt). The seven listed in this article cover an ad-producing agent end-to-end.
Can I run the agent on Claude Haiku or GPT-4o-mini for cost?
Yes for the routing/orchestration layer. Engine selection and tool calling work fine on smaller models. Use a larger model (Claude Opus 4.7, GPT-4 Turbo) only for the script-quality review step if you want a model-as-judge gate before kicking off renders.
What happens if my webhook endpoint goes down?
UGC Copilot retries failed deliveries up to 6 times with exponential backoff (~9-hour ceiling). If you miss all retries, the videos still complete and are still retrievable via proxyCheckVideoStatus. Your agent should have a "reconcile pending videos" job that runs hourly as a backstop.
Can the agent do video analysis (e.g., "make me an ad similar to this competitor's")?
Yes — expose proxyAnalyzeReferenceVideo as a tool. It costs 3 credits standard or 4 credits in clone-deep mode. The output structures the reference into an internal brief the agent can pass straight into the script step.