Upload to Multi-Actor Projects
Watch to learn how to use Multi-Actor projects
NOTE: If you are dubbing 1 actor but have additional training footage, you'll need to click on “Multi-Actor projects”. There is no separate section to upload training footage for “Single-Actor projects"
Step 1: Upload to Multi-Actor Projects
Hi, I'm Jordan. I'm the Associate Product Manager for onboarding at ub, and I'm here to help you get onboarded to the platform. Thank you so much for choosing us. Now let's get started, and I will say if you have any questions after watching this video, I will link our help center in the description below. Check that out—it goes into much more detail. This will just be a brief introduction. Also, don't hesitate to reach out to support@lipdub.ai. We're happy to help.
Okay, so when you get started with Lipdub, your projects will probably be pretty empty. So let's create a new one here. You can name your project. We'll call this Project 1.1. What is the source language of the video you'd like to lip dub? So this is for my content—I believe it is Russian. And what language do you want to lip dub into? For me, it's English.
Now you'll be asked to select this checkbox, which just says you have the legal rights to lip dub and lip sync the actors found in this content. So now we can select “Create Project.” When you first open your project, you'll see this UI, and if you look on the left-hand screen, every project is broken up into different scenes. Within every scene, there are four main steps you need to go through before you can actually generate a lip dub video result. This is what I'm going to walk you through.
Before I walk you through these four steps, let me just briefly explain what a scene is. Lipdub studies every frame where a person's face appears on screen. A person's face on screen can vary widely. For example, if I upload a full feature film like Mission: Impossible, the person's face may be on a mountain, in a deep dark cave, or in a nightclub with flashing lights. These are all very different lighting environments and backgrounds, and visually the actor's face will look very different even though it's the same actor. That’s why content is broken up into different scenes.
For films and complex content, this is by no means a limiting factor. You can freely upload a video that you want to lip dub in this section with multiple scene changes and lighting changes and complex content all in one scene. It just increases the chances for artifacts that may appear in your result. In my situation, though, the content I'm working on is just one scene, so I'm going to keep it that way. Even though it changes camera angles and the camera moves a bit, it's still all within the same scene, same location, and same lighting. So I will just keep it all under one scene.
Let me upload my footage that I want to lip dub. Great, so now I’ve uploaded the one video that I want to lip sync. If I want to add more videos, I could. I do want to mention that there are two sections within the upload video step. There's "Lipdub Footage," where you can upload videos you want to lip sync, and then there's "Additional Footage for Training." This is a really important distinction because we assume you do not want those videos to be lip synced. They're purely additional data to help the program study the frames of the actor's faces in this footage—just for training purposes—when we create a specific model for every actor’s face on screen that you want to lip sync. It’s by no means necessary if you have enough data in lip dub footage, but it’s encouraged if you don’t.
Let’s say the lip dub footage is only 15 seconds long—that is not our recommended amount. The recommended amount of data for every person’s face at every camera angle is one to two minutes for each angle and for each actor. The purpose of providing extra data on top of the clip you want to lip dub is because Lipdub studies every frame of when an actor appears on screen and how their face looks. So if I only give a 15-second commercial for Lipdub to sync, that’s not a lot of data—especially if the actor only says one line. That might be only two seconds of actual talking. Therefore, if I upload one minute of footage, it adds more context of what that actor’s mouth should look like and makes it easier to recreate it when we go to generate a result.
Before uploading your footage, it’s really important to understand the limitations and where Lipdub struggles to lip dub the face. Sometimes side profiles can be really difficult to lip dub, or sometimes big beards and high texture can be difficult to replicate to 100% accuracy—especially in closeups. These limitations and preferred video metadata for the Lipdub platform are all detailed in our help center, which I will link below.
So that’s the first step: upload footage. Now we wait for this to finish processing. Okay, I’m back. It’s been about five to ten minutes. My one-minute video has now finished uploading to the platform. I don’t see any errors. I can play the video back—this is just the same source language as I uploaded it. I can now move on to the second step.
Step 2: Label Actors
So once you've uploaded your footage, this is the next step: label actors. As soon as you upload your videos to Lip Dub, the platform will automatically detect the faces of any human that appears on screen.
You can imagine if there is a crowd scene and there are a thousand faces in the background and you just have the main speaker, it can take quite a long time for the platform to identify each and every person. That’s why uploading videos with lots of faces in the background could take longer than a video with only a few clearly identifiable faces and no background clutter.
It’s important to know that facial tracking and automatic facial detection sometimes get it wrong. The platform might even miss some facial trackings. That’s why it’s crucial for you to quickly double-check and QC the footage—confirming, for example, that all the facial trackings belong to the correct person. If everything looks right for one actor, you can move on and do the same for the others.
Now, I want to make something clear: the ML (machine learning) technology doesn’t always get it 100% accurate. Sometimes a facial tracking is missed, and it’s up to you as the end user to assign it to the correct person. That’s why we’ve given you the functionality to manually assign them.
For example, you can go to the "unassigned tracks" section from the dropdown menu. You’ll be able to select the tracks of a particular person’s face and assign them. If the dropdown is empty, it’s likely because you haven’t labeled any actors yet. So go ahead and label one—say, call him Charlie. Once you label the identity as Charlie, it won’t automatically merge all of his other face tracks, but you can select them from the dropdown and assign them to Charlie. That will group all of Charlie’s faces together.
If you label someone else—say, Fred—you’ll then be able to assign Fred’s face tracks in the same way. Select the face in "unassigned tracks," assign it, and Fred’s name will appear in the dropdown list. Confirm the assignment, and the facial track will move from the unassigned section to Fred’s grouping in the UI.
Now, let’s say you accidentally assign Charlie’s face to Fred’s group. You’ll see Charlie’s face under Fred’s group, which is incorrect. To fix it, simply remove the selected face from Fred’s group, and it will return to the unassigned section. Then, reassign it to Charlie to correct the grouping.
It’s also important to know that when you move on to the third step—"Train AI"—you will only see the actors you have labeled. If someone isn’t labeled, you won’t be able to train the model on them. That’s why it’s essential to name and label all your actors before continuing.
To do that, label the actor and click "Add." Now all the face tracks under that grouping will be recognized as one identity—say, Mabel—and you’ll be able to train the AI to recognize her.
So that’s the "label actors" step. Now you understand that as soon as you upload your video, just do a quick QC of all the tracked faces. If there are more than three, quickly go through the ones that matter. You don’t need to review all of them—just the ones you plan to lip sync.
If you only want to lip sync three characters and there are a hundred face detections, you only need to label and name the three faces you’re using.
Perfect. That is the second step: label actors
Step 3: Train AI
Great. So now you have all the actors that you want to lip sync, you've labeled them correctly, and you've double checked that all of their facial tracks are exactly them, with no other actors cluttering their data.
You can then go to Train AI. This is the third step and really an important one because this is where we actually kick off the models to train every frame of when these actors appear on screen, such that they can recreate the face textures. When we go to generate the result in this Train AI section, please select all the actors you want to lip sync.
Generally, you only want to label the actors you want to lip sync and change their mouths. So normally, you would just select all and train all. But there are some cases where you only want to train models for a couple of them and lip sync just those few actors. In that case, you didn't actually need to label the rest. But if you do want to train a couple, you can select them individually or select them all and click Train.
Training the AI models for every actor is the longest step in the Lipdub process. It takes about four to six hours on average to train an AI model for each actor. Right now, all of these runs and training jobs can be run in parallel. So even though there are three actors, it won't take six hours each sequentially. Instead, it will be done in parallel.
Once I click Train, it will give me a popup of how many credits each of these actors is going to cost me. When I go to hit Train, I will simply acknowledge and accept that I will train these models anyway.
Now, keep in mind that these models are specific to this actor—Mabel, Fred, and Charlie—and they fall under the umbrella of Scene One. If I create another scene called Scene Two, I will need to upload more footage and more training data, or new footage and new training data. And I won’t have any actors already there. So this model and these videos are specific to this scene. I cannot use Mabel's and Fred's models that were trained in Scene One in Scene Two.
Step 4: Upload Audio & Generate
Okay, I'm back. It's been about six and a half hours and I got an email notification saying that all of my actors have completed training. You can see by this green check mark that this is correct. All three of my actors—Mabel, Fred, and Charlie—have been trained, and I can move on to the next step.
Just keep in mind, if you see any of your actors with a red X next to their name and a message saying training cannot be completed, feel free to reach out to our support team. We're happy to help.
In this section, we want to select the video that we want to lip up. For my use case, it's pretty straightforward—I just have one video, so I can select this one automatically and play it to see the source video. If you upload multiple videos at once, you'll see multiple options here in this section of lip up footage that you want to have lip synced. You will not see the training footage you uploaded in this section—only the lip up video.
Once you've selected the video you want to lip sync, scroll down and choose the language you want to lip sync into. This was built for localization, so you can add languages after your project has been created. This project was originally in Russian, so I can add English, Arabic, Korean, etc. If I simply select “Add,” I’ll add them all as options.
Now, let’s say I have my French audio. I can select each actor, click “Upload Audio,” and upload the audio track for that actor. I can listen back to make sure the French audio is what I've uploaded, and then double-check that the timing actually syncs with the video I want to lip sync—this is really important.
There are three things to keep in mind when uploading audio to lip up. First, it should be an audio dialogue stem—no background noise or sound effects. Lip Dub is highly sensitive, and we will lip up any and all sounds present in the audio file. That’s why it’s important to have only the person’s voice with no background sounds or music.
Second, the audio should be individualized for each actor. That means separate audio stems for Mabel, Fred, and Charlie. For example, Mabel’s track should contain only her speaking throughout the clip, even if Fred and Charlie also speak in the video. Same for Fred and Charlie—each track should contain just that actor’s lines.
Third, the length of the audio should match the length of the video you want to lip up. Lip Dub doesn’t automatically align an actor’s audio with when they begin speaking in the video. If Mabel starts speaking ten seconds into the video, but the audio file starts at zero, she’ll start speaking at zero in the dubbed video—which is incorrect. That’s why it’s important to upload an audio file that matches the length of the video and has the proper timing already built in.
Once you’ve uploaded the audio files for every actor, you can click the “Generate Result” button. After about 15 to 30 minutes, the video will be finished generating. You’ll then be able to view the video you’ve generated, give it a rating, and download it off the platform.
That’s how you use Lip Dub. If you have any questions, don’t hesitate to reach out. We’re happy to help. I’ll link our Help Center down below.