Written by
Asrah Mohammed
Published on
Videos are ultimately about storytelling; and they rely on strong imagery and footage to do it well. While a script provides narrative framework and direction, these visuals execute on engaging an audience and resonating with them emotionally.
Of course, their compilation and assembly call for resources. They may involve sourcing images and footage from stock photography and video libraries, hiring a professional photographer or videographer to capture custom content, or a combination of both. Even after this, most creative production calls for another person to select the right assets and work them into the narrative.
We wanted to simplify all that with AI. More specifically, with a model called BLIP (Bootstrapping Language-Image Pre-training).
We incorporated a computer vision model to condense what could be an expensive, cumbersome effort into a step handled in seconds.
BLIP uses a process known as image-text retrieval to save both time and expense. Based on the information generated by our LLM model, it sifts through images pulled in the web scraping process and adds them in at the right points in the final commercial. This saves hours of manual work organizing images, and manages the storyboarding process in full.
Even with BLIP involved in assembling imagery, we wanted to ensure the widest possible variety of choices possible. That’s why we supplemented the tailored assets that BLIP assembles with fully licensed footage and stock image libraries. Users can choose to add these in when refining their video in the final stages. This combines the best of both worlds, offering immediate access to both tailored and complementary creative collections that users can combine at will.