Generative AI

An Intro to the Deep Generative Models Behind GANs

Caitlyn Coloma
12 min readNov 27, 2019

A lot of AI as we know it today has to do with deep learning classifiers or predictive algorithms. When it comes to manipulating large amounts of information, it is clear how AI helps humans do the otherwise tedious or difficult computing tasks of sorting large amounts of data. AI can even make predictions or insights about data that are not immediately apparent to humans.

But when AI goes beyond mere classification—when the degree to which AI is “artificial” increases—machines begin to blur the lines between what’s real and what’s fake, and that’s where it starts to get interesting.

Now that’s not to say deep learning classifiers aren’t interesting. The artificial neural networks behind these classifiers are extremely good at identifying patterns among hundreds and even thousands of data inputs and then extracting those patterns in an output layer in a way that is useful to humans.

But AI can be used to do so much more than just extract patterns from data. In contrast to deep learning classifiers, other subset methods of machine learning called deep generative models allow us to use those patterns to learn an underlying distribution and generate new data that fits the pattern. That means computers aren’t just looking at our data anymore; they’re making their own artificial data.

Okay, good for them [computers]. So what?

Have you watched any action movies lately? Disney movies? Have you switched between satellite view and default view on your Maps app before? Have you ever used filters on a photo? Like that one filter that makes you age fifty years? Or maybe one that automatically retouched your selfie?

If you answered yes to any of those questions, you were using computer-generated data: Many action movies and animated films, like Disney’s, use some sort of CGI (computer-generated images); switching between map views involves image-to-image or pixel-to-pixel translation; automatic photo editors, like aging filters and retouching, rely on different types of image generation too.

And the common denominator between all this image generation? Deep generative models.

These deep generative models have enjoyed some of the largest and most recent advances in AI. Generative adversarial networks (GANs) are the leading deep generative model at just five years old, having been introduced in this paper by Ian Goodfellow and other machine learning researchers at the University of Montreal in 2014.

Computer-generated data opens up a realm of possibilities with AI at the same time it narrows the gap between reality and illusion. Understanding deep generative models, especially GANs, helps us see where the future of AI is headed and is crucial to making sure our human intelligence does not get outpaced by AI.

Deep Generative Models

Deep generative models are a type of latent variable model. A latent variable is a feature of the data that we can’t directly observe, but is used to inform and generate the data that we can observe. For this reason, latent variables are also sometimes called hidden variables.

The Shadow Analogy

The generated, observable data is like a shadow that we see, and the latent variable is like the object that produced the shadow, assuming the object is behind us. We don’t know what the object itself looks like because we can’t see it, but we can see some of its features through its shadow.

Let’s say the object is a ball, mounted behind an observer and in front of a light source such that the observer can see the ball’s shadow but not the ball itself.

If asked about the ball’s features, a human observer could only say that the object has a circular outline because all this person can see is the ball’s shadow. A human would have no clue what color the ball is, to what sport it belongs, or any other relevant detail.

The observer can only view the ball’s shadow, but cannot see the basketball itself because it is hidden, just like latent variables are.

A machine, on the other hand, would be able to look at the ball’s shadow and identify that not only does it have a circular outline, but that it’s spherical, is orange with black lines, and is a basketball. A machine knows enough about the ball to even reconstruct the ball entirely—all from only viewing its shadow!

Conceptually, this is what deep generative models are doing: they identify hidden latent variables based only on observable data, and then use those latent variables to generate new data!

But how?

Generative Models Learn Unsupervised

Generative models are a subset of unsupervised machine learning. To understand why this is so, we have to look at the constraints of a supervised learning model and why this is ineffective if we are to generate new data.

Supervised Learning: Learning with Labels

Supervised learning takes place through classification (sorting data into labeled categories) or regression (fitting target data to a relationship between the data’s features).

Classification = sorting into categories. Regression = fitting data to a relationship (like a line of best fit).

The supervision taking place with classification or regression has to do with labels. With supervised learning, we attach labels to predetermined categories such as color or shape that the function will then sort into. Supervised learning is essentially the functional mapping from input data to output labels.

The problem is that supervised learning is confined to preset labels. If we’re classifying by color, for example, and the neural net only knows red and blue but encounters new data that is green, it would (inaccurately) sort this green data point into either the red or blue category.

Or we would have to manually add in a third category for green data. For something as seemingly simple as color, accounting for the vast amounts of colors and shades for categorizing requires several iterations to build an artificial model that maps inputs to correct outputs.

Where unsupervised learning really separates itself from supervised learning is the way it encounters new data.

Outlier Detection with Unsupervised Learning

Unsupervised learning, the method used by deep generative models, happens through clustering or association, with no labels for the clusters or associations.

Going back to our shadow analogy, the deep generative model wouldn’t know that what it’s generating is called a “basketball” specifically, but instead only knows that it’s generating an orange sphere with black stripes. In this way, deep generative models are able to learn the underlying patterns or features (latent variables) without needing labels for those patterns or features.

An unsupervised model only clusters, or groups, similar pieces of data. A supervised model would label those groups.

The goal for an unsupervised learning model is to mimic the natural distribution of the input dataset, as opposed to creating an artificial one using supervision.

When an unsupervised learning model encounters new or rare data (outliers), this disrupts the distribution of the dataset. Generative models use unsupervised learning for outlier detection, which leads the model to change the data distribution to accommodate for outliers and improve the accuracy of the model.

Outlier detection is an important process for de-biasing data. Sometimes, neural nets are trained solely on a dataset that is not representative of the data you wish to pull from. This creates a bias: since the neural net is familiar with only part of the data, it doesn’t know that certain features are misrepresented in the distribution.

Generative models remove bias from biased classification models.

When the data is taken in context of more sensitive features like race or sex, the importance of de-biasing becomes clear.

Let’s go back to the idea of observable data vs. hidden data (latent variables), now in the context of a deep learning classifier that predicts income. If the training set itself is biased, like if the data shows women with lower incomes than their male counterparts, the model would inadvertently begin to link being female with having a low income when the two are independent.

This model’s predictions would thus become biased against females, which makes moving the prediction’s dependency on observable features to latent ones an important step in de-biasing deep learning models. This would be like ignoring the shadow’s more superficial features and instead focusing on the basketball’s relevant features.

In much the same way, a generative model would minimize sex as a feature that influences income and would move instead to find the latent variables more relevant to predicting income, like level of education or zip code. Not only does generative modeling de-bias the model, but it makes the model’s predictions more accurate.

Types of Deep Generative Models

The unsupervised learning that deep generative models employ make them highly effective in reproducing data that accurately represents a training set. Now we’ll focus on two types of deep generative models: autoencoders and GANs, and we’ll see why GANs ultimately comes out on top.

Autoencoders

Encoding means to map input data to a lower dimension latent space. What this means is autoencoders take input data and uses convolutional layers to narrow down the data to only the most meaningful features (the most latent latent variables). Meaningful features would be the ones that most contribute to the underlying, or latent, structure of the entire dataset.

By discarding the details of individual images, autoencoders focus in on the rich details of the underlying structure of the dataset taken as a whole. This means that the dimensions (size) of each convolutional layer gets smaller and smaller until only the relevant latent features remain.

From there, the latent features that have been identified are then used to reconstruct the original data. The dimensions of each convolutional layer gets larger and larger until the image is able to be reconstructed.

The shape of the overall neural network (convolutional layers decreasing in dimension before the latent space and increasing after it) makes it so that autoencoders are often depicted as a pair of trapezoids that bottleneck at the latent space.

The first set of convolutional layers is known collectively as the encoder because it encodes the input data into a low dimension latent space. The second set of convolutional layers is known collectively as the decoder because it decodes the latent space into a higher dimension reconstruction.

The convolutional layers in autoencoders are represented by trapezoids because of the way they get narrow toward the lower dimension latent space and get larger toward the higher dimension reconstruction.

The autoencoder learns by comparing the input data to the reconstructed output, and any difference thereof forces the decoder to learn the most accurate version and most relevant features of the input to best reconstruct the original input.

The bottlenecking shape of an autoencoder depicts the process of compression. The nature of data compression makes it so that the reconstructed input is usually not identical to the original input. The quality of this reconstruction is determined by the dimensions of the latent space.

A smaller latent space means lower quality of reconstruction because when an autoencoder bottlenecks at such a small latent space, some fine details, such as hard edges, are compromised in preceding convolutional layers, while the overall structure is somewhat exclusively preserved by the latent space.

As a result, autoencoders produce high-quality structures but often low-quality, often blurry reproductions of inputs.

Variational Autoencoders (VAEs)

In generative modeling, it’s usually not very useful to reproduce the same exact image over and over again—especially if the reproduction is just a low-quality version of the original, as is the case for traditional autoencoders. To produce data that exhibits some degree of variation from original data, we look to variational autoencoders.

The difference between traditional autoencoders and VAEs has to do with the mathematical functions that determine latent spaces. In traditional autoencoders, the latent space is a deterministic function, meaning that feeding the network with the same input will always generate the same output.

With VAEs, the latent space is replaced with a probabilistic function. Now, instead of one input mapping to exactly one output, one input has a probability of mapping to several different outputs. Feeding the same image through the network will produce different reproductions, which is marginally more useful to generative modeling.

Usually, the probabilistic function of VAEs computes the mean or standard deviation of data from which the latent space could be found in order to incorporate a degree of randomness to the autoencoder. This also means the latent space will not always be the same for the same input, as was this case with traditional autoencoders.

VAEs use probability to incorporate a degree of variation between the input and the reconstructed input.

One cool thing we can do with VAEs is disentanglement. We can disentangle, or isolate, latent variables from each other so that we can manipulate them how we want. We can get narrow down the features of an image so that each one has semantic meaning, or affects something visually that us humans care about.

Want a face that smiles? Doesn’t smile? Want the male version of the female face generated? The face looks left when you want it to look right? We can change all these semantic features by simply manipulating the latent variables. This manipulation, called perturbation, allows us to change faces feature by feature! That’s pretty sick!

The VAE-generated faces above make small changes to underlying latent features to vary the output according to things like color or sex.

Notice that while VAEs are pretty good at feature perturbation and producing a variety of outputs, these computer-generated faces don’t yet appear to be real. They’re blurry, and the backgrounds aren’t realistic.

Is this what I meant when I said machines blur the lines between what’s real and what’s fake?

No way! I haven’t gotten to GANs yet!

Generative Adversarial Networks (GANs)

Where VAEs were concerned with estimating mathematical distributions of data in order to generate new data fitting those distributions, GANs circumvent the estimation step entirely to go straight from data sampling to data generation.

The generator takes a sample m from the real data and adds some random noise z to generate fake data. The discriminator then compares the generated data to the real data to decide if it’s real or fake. GANs are trained until the discriminator thinks the generated data is real.

Adversarial: What Doesn’t Kill You Makes You Stronger

A GAN actually consists of a pair of competing neural networks.

  • The generator, which takes in some data sample and imitates it just well enough to trick the discriminator
  • The discriminator, which compares the generator’s imitations to the real data

This relationship makes it so that as the discriminator gets better at distinguishing between real and fake data, the generator has to also get better at producing realistic fake data to pass off as real. The competitive, or adversarial, interaction between neural networks in GANs can be described by a minimax game.

The discriminator tries to maximize the likelihood that the data it says is real is actually real while the generator tries to minimize this. The GAN becomes successful when the discriminator cannot tell the difference between what’s real and fake.

…And neither can we. GANs produce such highly realistic, vividly-detailed computer-generated data that I wouldn’t know that all the people below aren’t real.

You’re telling me none of these people are real?!

Advantages of GANs

  • Where VAEs try to find the best fit, GANs find the exact fit. This is because VAEs use a maximum likelihood approach to summarize latent features and GANs use a minimax formulation to capture details as they are.
  • GANs increase the spatial resolution of generated images, preserving the hard edges that VAEs often compromise. Yay for no blurry pictures!
  • GANs generate highly varied outputs. The generator operates off of random noise samples of data, so each output will be different without compromising vivid detail.
VAEs turn out to produce less accurate, overly-smoothened, and noisier representations of the data than GANs.

Deep Generative Models: TL;DR

  • DGMs don’t just extract patterns; they use patterns to generate new data.
  • DGMs use hidden variables called latent variables to identify only the most meaningful features of the data.
  • Supervised learning = mapping inputs to labeled outputs. Unsupervised learning = finding a pattern from the inputs. DGMs use unsupervised learning.
  • Three types of DGMs are: autoencoders, variational autoencoders (VAEs), and generative adversarial networks (GANs).
  • Autoencoders narrow down the details of a dataset to just the latent variables to reconstruct the input (almost identically). VAEs use probability to add variation to the reconstructed data.
  • Autoencoders and VAEs produce often blurry or unrealistic images.
  • GANs are made up of a competing generator that makes data from sample inputs, and a discriminator that compares the generated (fake) data to the sample (real) data. When the discriminator can no longer tell the difference, the GANs has been properly trained.
  • GANs are the leading DGM because they produce highly realistic images—so realistic even the trained eye would have trouble distinguishing between real and fake.

Deep generative models are a super interesting and new field of AI. DGMs like autoencoders and GANs can be used creatively for visual art or even music generation.

On the flip side, this power can be abused, and we can use DGMs to produce highly realistic artificial images or videos, often termed deep fakes, to misinform the masses.

My hope is that with this article, we can strengthen our understanding of the machine learning that takes place behind these models, even when our eyes begin to deceive us.

Thanks for reading! Applause and feedback are welcome and much appreciated. Follow me on Medium for more articles like this, and join me on my journey as I use emerging tech to change the world. Let’s connect on LinkedIn or email me (caitayc@gmail.com), and subscribe to my monthly newsletter!

--

--

Caitlyn Coloma

20 y/o futurist eager to change the world with science and tech. Researching space + climate. Tweeting sometimes @caitlyn_coloma.