Crash Course On GANs

header-img

Image credit: Dev Nag

This post is not necessarily a crash course on GANs. It is at least a record of me giving myself a crash course on GANs. Adding to this as I go.

Intro/Motivation

I’ve been wanting to grasp the seeming-magic of Generative Adversarial Networks (GANs) since I started seeing handbags turned into shoes and brunettes turned to blondeshbg-shoe …and seeing Magic Pony’s image super-resolution results and hearing that Yann Lecun had called GANs the most important innovation in machine learning in recent years.

Finally, seeing Google’s Cat-Pig Sketch-Drawing Mathcatpig

…broke me, and so…I need to ‘get’ this.

I’ve noticed that, although people use GANs with great success for images, not many have tried using them for audio yet (Note: see SEGAN paper, below). Maybe with already-successful generative audio systems like WaveNet, SampleRNN (listen to those piano sounds!!) and TacoTron there’s less of a push for trying GANs. Or maybe GANs just suck for audio. Guess I’ll find out…

Steps I Took

Day 1:

  1. Gathered list of some prominent papers (below).
  2. Watched video of Ian Goodfellow’s Berkeley lecture (notes below).
  3. Started reading the EBGAN paper (notes below)…
  4. …but soon switched to BEGAN paper – because wow! Look at these generated images: Sample Images
  5. Googled for Keras-based BEGAN implementations and other code repositories (below)…Noticed SEGAN
  6. …Kept reading BEGAN, making notes as I went (below).
  7. Finished paper, started looking through BEGAN codes from GitHub (below) & began trying them out…
    a. Cloned @mokemokechicken’s Keras repo, grabbed suggested LFW database, converted images via script, ran training… Takes 140 seconds per epoch on Titan X Pascal.

    b. Cloned @carpedm’s Tensorflow repo, looked through it, got CelebA dataset, started running code.

  8. Leaving codes to train overnight. Next time, I’d like to try to better understand the use of an autoencoder as the discriminator.

Day 2:

  1. My office is hot. Two Titan X GPUs pulling ~230W for 10 hours straight has put the cards up towards annoyingly high temperatures, as in ~ 85 Celsius! My previous nightly runs wouldn’t even go above 60 C. But the results – espically from the straight-Tensorflow code trained on the CelebA dataset – are as incredible as advertised! (Not that I understand them yet. LOL.) The Keras version, despite claiming to be a BEGAN implementation, seems to suffer from “mode collapse,” i.e. that too many very similar images get generated.
  2. Fished around a little more on the web for audio GAN applications. Found an RNN-GAN application to MIDI, and found actual audio examples of what not to do: don’t try to just produce spectrograms with DCGAN and convert them to audio. The latter authors seem to have decided to switch to a SampleRNN approach. Perhaps it would be wise to heed their example? ;-)
  3. Since EBGAN implemented autoencoders as discriminators before BEGAN did, I went back to read that part of the EBGAN paper. Indeed, section “2.3 - Using AutoEncoders” (page 4). (see notes below)
  4. Ok, I basically get the autoencoder-discriminator thing now. :-)

Day 3:

“Life” intervened. :-/ Hope to pick this up later.

Papers

Haven’t read hardly any of these yet, just gathering them here for reference:

Videos

  • Ian Goodfellow (original GAN author), Guest lecture on GANs for UC Berkeley CS295 (Oct 2016). 1 hour 27 minutes. NOTE: actually starts at 4:33. Watch at 1.25 speed.
    • Remarks/Notes:
    • This is on a fairly “high” level, which may be too much for some viewers; if hearing the words “probability distribution” over & over again makes you tune out, and e.g. if you don’t know what a Jacobian is, then you may not want to watch this.
    • His “Taxonomy of Generative Models” is GREAT!
    • The discriminator is just an ordinary classifier.
    • So, the generator’s cost function can be just the negative of the discriminator’s cost function, (i.e. it tries to “mess up” the discriminator), however that can saturate (i.e. produce small gradients) so instead they try to “maximize the probability that the discriminator will make a mistake” (44:12).
    • “KL divergence” is a measure of the ‘difference’ between two PD’s.
    • “Logit” is the inverse of the sigmoid/logistical function. (logit<–>sigmoid :: tan<–>arctan)
    • Jensen-Shannon divergence is a measure of the ‘similarity’ between two PD’s. Jensen-Shannon produces better results for GANs than KL/maximum likelihood.

Web Posts/Tutorials

Code

Keras:

Tensorflow:

PyTorch:

Datasets

More References (Lists)