It sees the video.
Every 2.5 seconds a frame is sent to the vision model. Face bounding box, shot type, gesture, emotion — all fed back into the crop so the speaker never drifts out of frame.
Not another OpusClip. This one sees the video — gemma4-31b reads every frame, whisper-large-v3 catches every word, the crop tracks the speaker's face across every shot. Six-axis score, per-platform virality read, refuses to ship a mediocre clip.
Three ideas set this apart from every OpusClip-shaped tool.
Every 2.5 seconds a frame is sent to the vision model. Face bounding box, shot type, gesture, emotion — all fed back into the crop so the speaker never drifts out of frame.
Six intrinsic axes plus a platform read — TikTok, Shorts, Reels get different scores because they behave differently. The pipeline tells you where the clip actually belongs.
Reject-first prompt. Rather ship one strong clip than five mid ones. Intros, filler, sponsor reads, throat-clearing — all trimmed automatically before you see the result.