Video Editing via Factorized Diffusion Distillation

Uriel Singer*, Amit Zohar*, Yuval Kirstain, Shelly Sheynin, Adam Polyak, Devi Parikh, Yaniv Taigman
Meta AI
ECCV 2024 (Oral)
*Equal Contribution
In an aquarium
Sitting on a red bench
Set it in winter wonderland
Replace with a panda
In Minecraft style
Paint it pink and blue

Abstract

We introduce Emu Video Edit (EVE), a model that establishes a new state-of-the art in video editing without relying on any supervised video editing data. To develop EVE we separately train an image editing adapter and a video generation adapter, and attach both to the same text-to-image model. Then, to align the adapters towards video editing we introduce a new unsupervised distillation procedure, Factorized Diffusion Distillation. This procedure distills knowledge from one or more teachers simultaneously, without any supervised data. We utilize this procedure to teach EVE to edit videos by jointly distilling knowledge to (i) precisely edit each individual frame from the image editing adapter, and (ii) ensure temporal consistency among the edited frames using the video generation adapter. Finally, to demonstrate the potential of our approach in unlocking other capabilities, we align additional combinations of adapters.

Precise Video Editing

Input
With pyramids in the back
Add a balloon
In cartoon style
Input
With a santa hat
Add a beard
At Tokyo
Input
Cover it with flowers
Transform into a penguin
Turn the floor into glass
Input
At a show stage
Cartoon style
Depth map
Input
Add a red bench
Convert to a sketch
In pink colors
Input
Dressed as a princess
With a fireman uniform
Replace with a penguin
Input
Remove the plant
Make the mushroom pink
Paint the sofa gold
Input
Make it a doctor
With a policeman uniform
At New York
Input
Convert to a glass bowl
Cartoon style
Generate a sketch
Input
In the snow
Cover with mud
As a pop art painting
Input
Remove glasses
Extract a pose map
Replace with a penguin
Input
Make the water green
Paint the tail with rainbow colors
Futuristic style
Input
Add metal gloves
At an amusement park
Make it snow
Input
Add fireworks in the back
Make the suit green
Make it autumn

TGVE+ Comparison

A comparison of our model with the previous state-of-the-art, InstructVid2Vid, on TGVE+

Add a group of dolphins swimming in the water
Ours
InsVid2Vid
Change the background to a beach
Ours
InsVid2Vid
Make the sun red
Ours
InsVid2Vid
Change horse to turtle
Ours
InsVid2Vid
Remove the plane and its contrail
Ours
InsVid2Vid
Change green pants to blue
Ours
InsVid2Vid
Make the style expressionist
Ours
InsVid2Vid

Ablations

Add flowers around the bird
Ours
Fixed bins
No SDS
No adversarial
No alignment
Random adapters
Make it misty morning
Ours
Fixed bins
No SDS
No adversarial
No alignment
Random adapters
Change cat to dog
Ours
Fixed bins
No SDS
No adversarial
No alignment
Random adapters
Make the style Minecraft
Ours
Fixed bins
No SDS
No adversarial
No alignment
Random adapters

Acknowledgements

We extend our gratitude to the following people for their contributions (alphabetical order):
Andrew Brown, Bichen Wu, Ishan Misra, Saketh Rambhatla, Xiaoliang Dai, Zijian He.



Bibtex


@article{singer2024video,
  title={Video Editing via Factorized Diffusion Distillation},
  author={Singer, Uriel and Zohar, Amit and Kirstain, Yuval and Sheynin, Shelly and Polyak, Adam and Parikh, Devi and Taigman, Yaniv},
  journal={arXiv preprint arXiv:2403.09334},
  year={2024}
}