Live Action Animating – Proof of Concept. (Experiment #1)
I set out to explore how I might create a live action animation using machine learning – fake news video that would be indiscernible from real.
To achieve this there are numerous aspects to unpick. The first proof of concept was to be able to create a video sequence that I could control.
It seemed obvious to use a subject that is at the centre of the fake news storm – plus there is the added advantage of copious amounts of data available for training. Hence Donald.
The Pix2Pix library (Isola,Zhu, Zhou and Effros – https://phillipi.github.io/pix2pix/) had some example great results from small datasets and as this was only proof of concept it seemed like the right tech. The Pix2Pix library that enables image to image translation using conditional adversarial nets. These networks map from input image to output image and learn a loss function that trains the mapping.
I used a Python port of the Pix2Pix library (AffineLayer – https://github.com/affinelayer/pix2pix-tensorflow) which worked very well for my purposes.
I started off by trying to train face to face translations. This is a good direction of travel – but not for the time and processor resources that I had available (using my laptop to train models overnight). I’d need much more data and therefore much more time.
Inspired by the work of Mario Klingemann (@quasimondo), I recognised that simpler forms of images would get to the results I wanted much more quickly. To this end, I manually traced the subject and create 101 image pairs at 128x128px.
I trained the model by undergoing 50 epochs – this took a few hours overnight on my set up.
To create a test set I captured video of myself saying some words and then traced my face into a template.
Running the test set gave me an image sequence that I then composed back against the video and audio of my recording to produce an output video (see bottom for video).
The results were very good (for such a small data set – this was, after all, just a proof of concept).
Artefacts can be seen in the movement of the face and the body with the background – these can be traced to inaccuracies in the test data composition (tracing blunders).
The output loses fidelity from the original images. I’d look to improve the quality by using bigger training sets, bigger images, tweaks to the learning rate, number of steps and epochs.
I want to re-look at image to image – it may be possible to produce real time processing. There are many ways that a full fake news production system might work. Casting the net wider from image-to-image to incorporate Seq2Seq models mean the production could be driven by audio, performance or even text-to-video.
Lots of possibilities to explore.