This post is by Iain Anderson from ProVideo Coalition

Over the last year or so, the world has gone AI-mad, and many people across the video industry are worried that their jobs will be replaced by a computer. Here, I’ll tell you why that — with a few exceptions — probably won’t happen. Along the way, we’ll find out what AI can do today, what its strengths are, and what it’s really not good at.

Defining several kinds of AI

Rather than listing a ton of acronyms, here I’ll focus on the uses you’re likely to find for AI today and in the near future. They are:

  • Generative AI, which creates new content based on existing content or prompts. This encompasses text output from ChatGPT, videos created from prompts by Sora, scripts written without human input, Content-Aware Fill in stills and video in Adobe apps, and artificial voiceovers that imitate humans, to name just a few examples.
  • Recognition AI, which recognizes data or patterns and then presents it somehow. This is one of the main tasks chatbots and AI-enhanced search engines perform today.
  • Automation AI, which performs actions on your behalf. Telsa’s (not really) Full Self Driving cars fall into this category, as do the promised future services that may eventually be provided by the Rabbit R1 or Humane AI pin.

There is of course some crossover between these categories, such as when you ask ChatGPT to make an image for you, or the generative component involved in preparing an English-language summary of a recognition task. Still, the categories are useful for understanding the broad capabilities and limitations of AI.

You may have also heard the term Machine Learning, a more specific term referring to how computers are trained to perform tasks. When a computer is trained to recognize what a person looks like, or how a good paragraph of text should be structured, that process uses machine learning. Since most of the new AI-based tools use machine learning, the terms are used somewhat interchangeably, but I won’t dwell on the distinction here.

Generative AI attracts the most attention, both positive and negative, so let’s start there.

Generative AI

Adobe implemented Generative AI with images a long time ago, and the most obvious related breakthrough was the Content-Aware Fill feature (also found in the Spot Healing brush) back in Photoshop CS5. The ability to simply paint over part of an image and have it filled in automatically with “something that looks right” most of the time was a big deal back then, and has changed the way images are processed. With video-friendly Content-aware fill in After Effects, and equivalent features in other plug-ins and apps, we can quickly accomplish what would have seemed like dark sorcery a decade ago. 

Fast forward to today’s Photoshop, embracing GenAI more fully, and just about anyone can create novel content simply by telling it what you want. Content-aware fill was no less magical on its debut, but it wasn’t nearly so accessible — generative AI is widespread and available from many providers. The next Premiere will include not just Adobe’s own GenAI tech, but links to third-party offerings.

The main reason GenAI has so captivated the public is that it’s approachable, it’s novel, and it enables people who aren’t skilled in creating images to create something — even if it’s not perfect. GenAI is still much better at stills than it is at video, but its weaknesses are exposed more easily when used to create a moving image. Here’s an assessment of the state of the art, and it’s not always pretty.

One massive problem is due to the way these models work: they are predictive. If you train a model by showing it thousands of pictures of dogs, and then ask for a picture of a dog, it’s going to be able to do it. But when you ask for the same dog moving at 24 frames per second, issues around persistence crop right up. The models don’t really understand how the world works, they’re just trying to imitate what they’ve seen before. The more you ask of them, the more they fail. The more you know what an image should look like, the more you see wrong with it.

What is AI good for? 1
This is not a photo of a friendly dog

GenAI outputs today are usually superficial, looking OK to the untrained eye but not to anyone paying attention. They feel undirected and empty. People in these models don’t have emotions, or a reason they’re doing what they’re doing. There’s no structure beneath the fakery, and that’s going to be a difficult or impossible problem to fix. The Sora-made diving example that plays first in this is proudly shown off, but the diver isn’t moving remotely correctly, and there are no bubbles being released. It’s a poor imitation made by an AI that doesn’t understand reality, and it doesn’t stand up to deeper scrutiny.

@openai

Saturdays are for the new Sora drops. These videos were generated by our text-to-video model, Sora, without modification. Prompt 1: a scuba diver discovers a hidden futuristic shipwreck, with cybernetic marine life and advanced alien technology Prompt 2: a man BASE jumping over tropical hawaii waters. His pet macaw flies alongside him Prompt 3: Close-up of a majestic white dragon with pearlescent, silver-edged scales, icy blue eyes, elegant ivory horns, and misty breath. Focus on detailed facial features and textured scales, set against a softly blurred background Prompt 4: a red panda and a toucan are best friends taking a stroll through santorini during the blue hour Prompt 5: a dark neon rainforest aglow with fantastical fauna and animals What would you like to see us make with Sora next? *Sora is not yet available to the public. We’re sharing our research progress early to learn from feedback and give the public a sense of what AI capabilities are on the horizon. #madewithSora #Sora #openai

♬ Hip Hop Background(814204) – Pavel

But it’s not all bad. If you focus a GenAI model on a specific task, clearly defining what to change and how to change it, a model can do really well. On the other hand, creating an entire video from a text prompt is a party trick that doesn’t scale well, because an AI doesn’t even really understand what’s going on in a single frame. We can certainly expect a host of AI-generated short film and low-quality ads, but it’s not going to replace actual long-form video.

Leaving aside the capabilities, ethics is another issue which most GenAI creators have put to one side. If you’re training your models on copyright material, without the express permission of those creators, many people already don’t want to watch that content. And while it’s certainly possible to train a model exclusively on copyright-cleared material (as Adobe have claimed) if you’re putting people out of work, you’ll still put some people offside. Unions have power. Hollywood went on strike for good reason, and there will be a line drawn in the sand somewhere.

Recognition AI

Placed firmly in the middle of what Michael Cioni’s Strada calls “Utility AI” is AI-based transcription, and tasks like this are where AI excels. I’ve used AI-based transcription to re-do captions created by live human captioners, and there’s simply no contest — the AI did a better job in far less time than a human could. A human is definitely still needed, because you can’t tell how someone spells their name just by hearing it, but I can use another AI-based recognition engine (built in on the Mac) to copy text that’s written in a video or a photo.

I would love to see more automatic AI-based classification of video content, very much like the “Tanalyze” that Strada is showing off, but most NLEs simply aren’t ready for the time-based metadata that these systems can generate. Final Cut Pro’s keyword system has been ready since 10.0, but the other major players don’t have a good UI to show this. And fair play to Strada — their system does a better job of showing exactly where all the detected keywords are than any NLE.

Color correction and grading with help from AI is possible today using Colourlab.ai, making a complex job easier. Similarly, AI-based people recognition algorithms in plug-ins like Keyper or Resolve’s Magic Mask are making new keying workflows possible. Similar tech is even in modern cameras, helping them to recognize and then auto-focus on people or animals.

Another boring job that few people are paid to do is summarizing, and AIs are great at it. Again, they don’t need to do a perfect job, because no human is going to do that job better, and certainly not in the few seconds that an AI takes. The speed at which an AI can process information and connect dots that humans simply cannot means that this isn’t going away any time soon.

Accessibility is one place in which nobody’s going to question the utility of AI. If a blind person can now hold up a camera to a scene, and ask an AI model describes it to them, that’s wonderful. If an AI model provides automatic, accurate, live captioning for a Deaf person, that’s similarly life-changing.

There are many tasks in the post-production space which Recognition AI can help with: color correction, audio cleanup, classification, clip syncing, and more. Many of these are tasks an Assistant Editor might perform, but an AI will make their job easier, allowing to do more in less time, and not replace them.

Automation AI

Siri, Alexa and Google Assistant are three examples of some form of AI acting as your assistant. They’re often useful for specific tasks, but because they were built on an older model, their capabilities are limited and they often fail to complete tasks. As the newer chat-based AIs appear to be far more capable, there’s an expectation that we’ll soon have much smarter assistants.

So far, that hasn’t happened. ChatGPT and models like it don’t connect to the internet, they don’t have access to your personal data, and they won’t do things for you. Partly, this is because Apple and Google hold the keys to your data, and they’re going to be the ones to ask your permission to connect an AI to it. But partly, it’s because this is a hard problem that hasn’t yet been solved.

What is AI good for? 2
Lots of promises, not all kept — a common thread in AI

The Rabbit R1 is a device which promised to be able to do many kinds of automation. Without a direct link to your computer, the advertised plan at launch was to use the device’s camera to take a picture of your screen or even a hand-drawn table on a piece of paper, then do something smart with that data, and output a new spreadsheet. None of that works yet. Of course, the process would work so much more smoothly if it was all performed on-device, with local access to your data, but that’s not possible.

In such a system, the potential for faster workflows is huge, but so too is the potential for an AI to make a mistake with your data, or worse, with your money. Apple and others have plans to teach an AI to navigate a UI on your behalf, so we could yet connect these dots, but this is a more difficult problem that many think it is.

Why? Because AIs make mistakes that humans generally don’t.

Imperfection is the common thread

If an AI is imperfect, and they’re all imperfect, you can only trust them on limited tasks. Trusting an AI to plan your holiday would be like trusting a junior travel agent without supervision, and yet “booking your holiday for you” is precisely the kind of example being used to sell AI assistants today. Worse, because AIs don’t learn like humans do, it’s often impossible to find out why a mistake was made, or to stop that mistake being made again. Sometimes these mistakes are trivial, but often they’re more serious.

For example, from time to time I pop into a chatbot to ask it the answer to a question I know the answer to, such as (please excuse the plug) “Who wrote Final Cut Pro Efficient Editing?”. This is a book I wrote in 2020, well before the 2022 knowledge cutoff of many of these models, and this is information easily found by Google. But ChatGPT fails almost every time, confidently presenting nonsense as truth, returning a random collection of other authors, colleagues and strangers as the author of my book. So far, so weird, and though Google’s latest Gemini 1.5 Pro does actually get this right, if you can’t be sure you’re being told the truth, there’s no point asking the question.

What is AI good for? 3
This is the right answer, but chatbots have told me that Diana Weynand, Michael Wohl, Jeff Greenberg and  others wrote my book

Generative AI for images and videos often creates nonsense, too. The examples posted online are the best of the best, picked by humans from a huge collection of mostly-bad outputs. If you’re a consumer wanting to make a cartoon image of your kid, great — you’ll be able to create a “good enough” option pretty quickly. For concept art and previz, GenAI will probably save you time. But if you need something perfect, photo real, and quite specific, you may never get there.

Worse, while AI has improved a whole lot recently, the pace of change has slowed markedly. It’s become very expensive to train AI models, and as a model grows, you need more and more examples to improve that model. Eventually, this becomes impossible, due to the expense, or simply because it’s too hard to keep teaching new concepts to the same machine learning model. We are reaching hard limits. That’s bad news for a self-driving car, bad news for a client who wants to create a TV series by asking for it, but good news for video professionals. AI can still be useful.

What’s an imperfect AI good for?

Automation AIs that handle boring, repetitive tasks are great, because not everyone can write a macro or a batch script. Generative AIs that can replace unwanted objects in shots are great; this is something we can do already, and AI would just make it faster and better. AIs that can translate text or speech into other languages improve accessibility for most humans. AIs that improve image quality can restore films that most people don’t want to watch any more, or remove glitches from damaged recordings. AIs that can categorise and classify huge data sets can make them far more accessible — and there’s way too much data out there for humans to do the job.

Happily, these things are happening already. I’m using Peakto’s latest AI update to classify hundreds of thousands of images based on their contents, making my family’s life in photos far easier to explore. A host of AI features are turning up in NLEs, including audio transcription, classification and extension, speech enhancement and soon content-aware fill in Premiere Pro. DaVinci Resolve has transcription, magic mask, noise reduction, voice isolation and audio remixing, which can isolate stems in music beds. Final Cut Pro has voice isolation, automatic color correction, noise reduction, object tracking and (soon!) super slow-mo — and of course, all these apps will evolve as they compete with one another over time. Third-party apps like TopazAI and plug-ins from many companies will fill in the gaps, using AI a little or a lot; whatever gets the job done.

Do we have to be worried?

AI will definitely be used to make low-end content, but it can’t do a good job on its own. Flashy effects don’t write stories, and the AIs that do write stories don’t write good ones. In every creative field, we will still need experts to make good content. Consistency over time is important. A clear direction is important. If you need to tell a story about a real person or a real thing for a client, that’s going to be best done by a human, because there’s a huge gulf between creating a 10-second concept and anything longer. Consumers and scummy advertisers will continue to make junk content (AKA slop) with AI, but these are not jobs we were going to get anyway. It’s a whole new low end for the market.

Disruption from progress isn’t new, and while it absolutely happens, it rarely happens in the way people fear it will. Canva templates have made it easier for anyone to create better-looking birthday party invitations, but they haven’t made it any easier to make annual reports, and designers are still employed to create them. 

However, if your job is boring, or repetitive, or you’re making content that people don’t value, then yes, AI could disrupt it. If your job is to create concept art that people won’t see in a finished product, then yes, I would expect the industry to need fewer people to do that job. Pre-viz is a perfect example of where “near enough” probably is “good enough”, but it’s also something more people might do if it was more affordable. More people will be making pre-viz, but they won’t be drawing each frame by hand.

Voice generation sits in a similar place. Real voices are better, because even good AI voices make occasional weird mistakes, but if you use them to fill in a few missing words, nobody will notice, not even Mr Beast. Again, the low end of the market might use artificial voices for entire videos, but they were never going to pay you to help them.

One area that’s at least a brief shining light for generative AI is for abstract video art like music videos. This music video is pretty interesting, but it doesn’t show techniques that would be useful in a less abstract context. Music videos (Read more…)