Video Editing via Factorized Diffusion Distillation

Uriel Singer^*, Amit Zohar^*, Yuval Kirstain, Shelly Sheynin, Adam Polyak, Devi Parikh, Yaniv Taigman

Meta AI

ECCV 2024 (Oral)

Paper arXiv Benchmark

^*Equal Contribution

In an aquarium

Sitting on a red bench

Set it in winter wonderland

Replace with a panda

In Minecraft style

Paint it pink and blue

Abstract

We introduce Emu Video Edit (EVE), a model that establishes a new state-of-the art in video editing without relying on any supervised video editing data. To develop EVE we separately train an image editing adapter and a video generation adapter, and attach both to the same text-to-image model. Then, to align the adapters towards video editing we introduce a new unsupervised distillation procedure, Factorized Diffusion Distillation. This procedure distills knowledge from one or more teachers simultaneously, without any supervised data. We utilize this procedure to teach EVE to edit videos by jointly distilling knowledge to (i) precisely edit each individual frame from the image editing adapter, and (ii) ensure temporal consistency among the edited frames using the video generation adapter. Finally, to demonstrate the potential of our approach in unlocking other capabilities, we align additional combinations of adapters.

Precise Video Editing

Input

With pyramids in the back

Add a balloon

In cartoon style

Input

With a santa hat

Add a beard

At Tokyo

Input

Cover it with flowers

Transform into a penguin

Turn the floor into glass

Input

At a show stage

Cartoon style

Depth map

Input

Add a red bench

Convert to a sketch

In pink colors

Input

Dressed as a princess

With a fireman uniform

Replace with a penguin

Input

Remove the plant

Make the mushroom pink

Paint the sofa gold

Input

Make it a doctor

With a policeman uniform

At New York

Input

Convert to a glass bowl

Cartoon style

Generate a sketch

Input

In the snow

Cover with mud

As a pop art painting

Input

Remove glasses

Extract a pose map

Replace with a penguin

Input

Make the water green

Paint the tail with rainbow colors

Futuristic style

Input

Add metal gloves

At an amusement park

Make it snow

Input

Add fireworks in the back

Make the suit green

Make it autumn

TGVE+ Comparison

A comparison of our model with the previous state-of-the-art, InstructVid2Vid, on TGVE+

Add a group of dolphins swimming in the water

Ours

InsVid2Vid

Change the background to a beach

Ours

InsVid2Vid

Make the sun red

Ours

InsVid2Vid

Change horse to turtle

Ours

InsVid2Vid

Remove the plane and its contrail

Ours

InsVid2Vid

Change green pants to blue

Ours

InsVid2Vid

Make the style expressionist

Ours

InsVid2Vid

Ablations

Add flowers around the bird

Ours

Fixed bins

No SDS

No adversarial

No alignment

Random adapters

Make it misty morning

Ours

Fixed bins

No SDS

No adversarial

No alignment

Random adapters

Change cat to dog

Ours

Fixed bins

No SDS

No adversarial

No alignment

Random adapters

Make the style Minecraft

Ours

Fixed bins

No SDS

No adversarial

No alignment

Random adapters

Acknowledgements

We extend our gratitude to the following people for their contributions (alphabetical order):
Andrew Brown, Bichen Wu, Ishan Misra, Saketh Rambhatla, Xiaoliang Dai, Zijian He.

Bibtex


@article{singer2024video,
  title={Video Editing via Factorized Diffusion Distillation},
  author={Singer, Uriel and Zohar, Amit and Kirstain, Yuval and Sheynin, Shelly and Polyak, Adam and Parikh, Devi and Taigman, Yaniv},
  journal={arXiv preprint arXiv:2403.09334},
  year={2024}
}