Deep Convolutional Generative Adversarial Network (DCGAN)

Note

This short tutorial explores GANs, and how they work. This are my working notes as I explore the generative models landscape. Code is in PyTorch and adapted from the PyTorch DCGAN tutorial and annotated_deep_learning_paper_implementations

1 Simply, what are GANs?

GANs are a class of generative models that aim to learn and mimic the underlying probability distribution of the real data so that they can generate samples that closely resemble real data samples. This is achieved by simultaneously training two neural networks, a generator and a discriminator, in an adversarial manner. These two networks engage in a game, where the generator tries to fool the discriminator by mapping gaussian noise z to fake samples that are so realistic, such that the discriminator cannot differentiate them from real data. The ideal outcome is to reach a point of equilibrium where the generator has successfully approximated the true data distribution and the discriminator can no longer distinguish between real and fake samples.

MIT 6.S191: Deep Generative Modeling

Let x be a real data sample, D represent the discriminator function, G represent the generator function and z be a latent space vector from a standard normal distribution.

The discriminator evaluates whether the input data is real or fake.

The probability that x is real is D(x) and for the generated data, the probability that G(z) is real is D(G(z)).

Therefore intuitively, the discriminator is encouraged to output high values (close to 1) when presented with real data i.e D(x) \rightarrow 1 and low values (close to 0) when presented with generated data i.e D(G(z)) \rightarrow 0 or [1 - D(G(z))] \rightarrow 1.

  • For real data x, the discriminator is penalized if it predicts x as fake, so the discriminator tries to maximize the probability log(D(x)).

  • For generated data G(z), the discriminator is penalized if it predicts G(z) as real, so the discriminator tries to maximize the probability log(1 - D(G(z))).

  • The generator in turn tries to minimize the probability that the discriminator will predict G(z) as fake, so the generator tries to maximize log(D(G(z))) or equivalently minimize log(1 - D(G(z))).

Therefore the loss of GAN is given by this combined loss function:

$$ \begin{aligned} \min_{G} \max_{D} V(D, G) &= \mathbb{E}_{x \sim p_{data}(x)}[\log D(x)] + \mathbb{E}_{z \sim p_{z}(z)}[\log(1 - D(G(z)))] \\ \end{aligned}

$$

Note

\mathbb{E}_{x \sim p_{data}(x)} is read as the expected value of x sampled from the real data distribution p_{data}(x).

\mathbb{E}_{z \sim p_{z}(z)} is read as the expected value of z sampled from the latent space distribution p_{z}(z).

2 What is a DCGAN?

DCGANs are a class of GANs that use convolutional neural networks (CNNs) for both the generator and discriminator. They were introduced to address limitations in earlier GAN architectures:

DCGANs paper

3 Implementation

Now, let’s implement a DCGAN that produces faces of shape 3x64x64.

import torch
import torch.nn as nn
import torch.nn.functional as F
import random


# Set seed
manualSeed = 999
random.seed(manualSeed)
torch.manual_seed
torch.use_deterministic_algorithms(True)

# Root directory for dataset
dataroot = "data/celeba"

# Number of workers for dataloader
workers = 0

# Batch size during training
batch_size = 128

# Image size
image_size = 64

# Number of channels in the training images
nc = 3

# Size of latent vector z
nz = 100

# Size of feature maps in generator
ngf = 64

# Size of feature maps in discriminator
ndf = 64

# Number of training epochs
num_epochs = 5

# Learning rate for optimizers
lr = 0.0002

# Beta1 hyperparam for Adam optimizers
beta1 = 0.5

# Number of GPUs available. Use 0 for CPU mode.
ngpu = 1

3.1 Data

We’ll be using the Celeb-A faces dataset.

import torch
import torchvision
from torchvision import datasets, transforms
from torch.utils.data import Dataset, DataLoader
from torchvision.utils import make_grid



# Define transformations that will be applied to the images
transform = transforms.Compose([
    transforms.CenterCrop(178),
    transforms.Resize(image_size),
    transforms.ToTensor(), # In transforms, ToTensor() already normalizes the images to [0,1]
    transforms.Normalize((0.5, 0.5, 0.5),
                         (0.5, 0.5, 0.5))
])

# Create celeba dataset
# Doesn't work so downloaded from:  https://drive.google.com/file/d/0B7EVK8r0v71pZjFTYXZWM3FlRnM/view?usp=drive_link&resourcekey=0-dYn9z10tMJOBAkviAcfdyQ
#celeba_dataset = datasets.CelebA(root = dataroot, download = True, split = 'train', transform = transform)
celeba_dataset = datasets.ImageFolder(root = dataroot, transform = transform)

# Create dataloader to load images in batches
dataloader = DataLoader(dataset = celeba_dataset, batch_size = batch_size, shuffle = True, num_workers=workers)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Let's plot some training images
import matplotlib.pyplot as plt
import numpy as np
batch = next(iter(dataloader))

plt.figure(figsize = (8, 8))
plt.axis("off")
plt.title("Training Images")
plt.imshow(torch.permute(torchvision.utils.make_grid(batch[0].to(device)[:64], padding = 2, normalize = True), (1, 2, 0)).cpu())
<matplotlib.image.AxesImage at 0x217ff2cfc10>

3.2 Weight initialization

From the DCGAN paper, the authors initialized weights from a Normal distribution with mean = 0 and stddev = 0.02.

# Initialize convolution, convolution transpose and batch normalization layers
def init_weights(module):
    classname = module.__class__.__name__

    # Convolutional layers
    if classname.find('Conv') != -1:
        nn.init.normal_(module.weight.data, 0.0, 0.02)

    # BatchNorm layers
    elif classname.find('BatchNorm') != -1:
        nn.init.normal_(module.weight.data, 1.0, 0.02)
        nn.init.constant_(module.bias.data, 0)

3.3 Generator

Great! Now, let’s implement the generator. Remember, the goal of the generator is to learn the underlying disribution from the training data and generate samples that closely resemble real data samples which can fool the discriminator. It does this by transforming latent vector z (random noise) sampled from a Gaussian distribution into synthetic data G(z) through a series of transposed convolution layers.

DCGAN paper
Note

Transposed convolutions are used to upsample the input by inserting zeros between input elements and then convolving with a filter. The resulting output is given by the formula:

$$

\begin{aligned} \text{Output size} &= [(\text{Input size} - 1) \times \text{stride}] - [2 \times \text{padding}] + [\text{kernel size - 1}] + [\text{output padding}] + 1\\ \end{aligned}

$$

# Implementing the generator

class Generator(nn.Module):
    def __init__(self, nz = 100, ngf = 64, nc = 3, ngpu = 1):
        super().__init__()
        self.ngpu = ngpu
        self.nz = nz
        self.ngf = ngf
        self.nc = nc
        self.gen = nn.Sequential(
            # Z going through a transposed conv
            # 100 x 1 x 1 -> 512 x 4 x 4
            nn.ConvTranspose2d(in_channels = self.nz, out_channels = self.ngf * 8, kernel_size = 4, stride = 1, padding = 0, bias = False),
            nn.BatchNorm2d(self.ngf * 8),
            nn.ReLU(True),

            # 512 x 4 x 4 -> 256 x 8 x 8
            nn.ConvTranspose2d(in_channels = self.ngf * 8, out_channels = self.ngf * 4, kernel_size = 4, stride = 2, padding = 1, bias = False),
            nn.BatchNorm2d(self.ngf * 4),
            nn.ReLU(True),

            # 256 x 8 x 8 -> 128 x 16 x 16
            nn.ConvTranspose2d(in_channels = self.ngf * 4, out_channels = self.ngf * 2, kernel_size = 4),
            nn.BatchNorm2d(self.ngf * 2),
            nn.ReLU(True),

            # 128 x 16 x 16 -> 64 x 32 x 32
            nn.ConvTranspose2d(in_channels = self.ngf * 2, out_channels = self.ngf, kernel_size = 4, stride = 2, padding = 1, bias = False),
            nn.BatchNorm2d(self.ngf),
            nn.ReLU(True),

            # 64 x 32 x 32 -> 3 x 64 x 64
            nn.ConvTranspose2d(in_channels = self.ngf, out_channels = self.nc, kernel_size = 4, stride = 2, padding = 1, bias = False),
            nn.Tanh()

        )

    def forward(self, input):
        return self.gen(input)
        
# Create generator
netG = Generator().to(device)

# Initialize weights
# .apply applies the function to all the modules in the network
netG = netG.apply(init_weights)

3.4 Discriminator

The discriminator is a CNN-based image classifier that takes a 3 x 64 x 64 image as input, processes it through a series of Conv2d, batch norm and leakyrelu layers, and outputs a scalar probability that the input image is real (as opposed to fake). The discriminator is trained to maximize the probability of assigning the correct label to both training and generated data.

# Define the discriminator
class Discriminator(nn.Module):
    def __init__(self, ndf = 64, nc = 3, ngpu = 1):
        super().__init__()
        self.ngpu = ngpu
        self.ndf = ndf
        self.nc = nc
        self.disc = nn.Sequential(
            # 3 x 64 x 64 -> 64 x 32 x 32
            nn.Conv2d(in_channels = self.nc, out_channels = self.ndf, kernel_size = 4, stride = 2, padding = 1, bias = False),
            nn.LeakyReLU(0.2, inplace = True),

            # 64 x 32 x 32 -> 128 x 16 x 16
            nn.Conv2d(in_channels = self.ndf, out_channels = self.ndf * 2, kernel_size = 4, stride = 2, padding = 1, bias = False),
            nn.BatchNorm2d(self.ndf * 2),
            nn.LeakyReLU(0.2, inplace = True),

            # 128 x 16 x 16 -> 256 x 8 x 8
            nn.Conv2d(in_channels = self.ndf * 2, out_channels = self.ndf * 4, kernel_size = 4, stride = 2, padding = 1, bias = False),
            nn.BatchNorm2d(self.ndf * 4),
            nn.LeakyReLU(0.2, inplace = True),

            # 256 x 8 x 8 -> 512 x 4 x 4
            nn.Conv2d(in_channels = self.ndf * 4, out_channels = self.ndf * 8, kernel_size = 4, stride = 2, padding = 1, bias = False),
            nn.BatchNorm2d(self.ndf * 8),
            nn.LeakyReLU(0.2, inplace = True),

            # 512 x 4 x 4 -> 1 x 1 x 1
            nn.Conv2d(in_channels = self.ndf * 8, out_channels = 1, kernel_size = 4, stride = 1, padding = 0, bias = False),
            nn.Sigmoid()
        )

    def forward(self, input):
        return self.disc(input)

# Create discriminator
netD = Discriminator().to(device)

# Initialize weights
netD = netD.apply(init_weights)

3.5 Loss functions and optimizers

The loss function used for both networks is the Binary Cross Entropy loss. The optimizers used are Adam optimizers with a learning rate of 0.0002 and a momentum of 0.5.

# initialize BCELoss function
criterion = nn.BCELoss()

# Create gaussian noise to visualize generator's progress
fixed_noise = torch.randn(64, nz, 1, 1, device = device)

# Real/fake labels
real_label = 1
fake_label = 0

# Setup Adam optimizers for both G and D
optimizerD = torch.optim.Adam(netD.parameters(), lr = lr, betas = (beta1, 0.999))
optimizerG = torch.optim.Adam(netG.parameters(), lr = lr, betas = (beta1, 0.999))

3.6 Training

  • To train discriminator, we first pass a batch of real data, calculate the loss log(D(x)) and then calculate the gradients. Secondly, we pass a batch of generated data, calculate the loss log(1 - D(G(z))) and then calculate the gradients. Now, with the gradients accumulated from both the all-real and all-fake batches, we call a step of the Discriminator’s optimizer.

  • To train the generator, we set the labels to real, pass the generated data through the discriminator, compute the loss log(D(G(z))) and then compute the gradients. We then use the gradients to update the generator’s parameters. This way, the generator learns/incentivised to generate data that the discriminator classifies as real.

import torchvision
# Training loop

# Lists to keep track of progress
img_list = []
G_losses = []
D_losees = []
iters = 0

for epoch in range(num_epochs):
    for idx, batch in enumerate(dataloader):
        ## Train discriminator

        ### Train with all real batch
        real_cpu = batch[0].to(device)
        b_size = real_cpu.shape[0]
        # Label all real images as real
        label = torch.full((b_size, ), real_label, device = device, dtype = torch.float)
        # Forward pass real batch through D
        output = netD(real_cpu).view(-1) # .view(-1) flattens the output
        # Calculate loss on all real batch
        loss_dreal = criterion(output, label)
        # Zero out previously accumulated gradients
        optimizerD.zero_grad()
        # Calculate gradients of real_loss wrt D's parameters
        loss_dreal.backward()
        D_x = output.mean().item() # Average output of D for real images

        ### Train with all fake batch
        # Generate batch of latent vectors: b_size x 100 x 1 x 1
        noise = torch.randn((b_size, nz, 1, 1), device = device)
        # Generate fake image batch with G: b_size x 3 x 64 x 64
        fake = netG(noise)
        # Label all fake images as fake
        label.fill_(fake_label)
        # Classify all fake batch with D
        output = netD(fake.detach()).view(-1) # .detach() detaches the fake tensor from the computational graph
        # Calculate D's loss on all fake batch
        loss_dfake = criterion(output, label)
        # Calculate gradients of fake_loss wrt D params in backward pass,
        # These will be accumulated with previous real gradients
        loss_dfake.backward()
        D_G_z1 = output.mean().item() # Average output of D for fake images
        # Compute total error for this batch
        loss_d = loss_dreal + loss_dfake

        #### Update D's weights with gradients accumulated from real and fake batches
        optimizerD.step()



        ## Train generator
        # Set generators labels to real to force it generate real looking images i.e maximize log(D(G(z)))
        label.fill_(real_label)
        # Forward pass the previously generated fake batch through D
        output = netD(fake).view(-1)
        D_G_z2 = output.mean().item() # Average output of D for fake images
        # Calculate G's loss based on this output
        loss_g = criterion(output, label)
        # Zero out previously accumulated gradients
        optimizerG.zero_grad()
        # Calculate gradients of G's loss wrt G's parameters
        loss_g.backward()
        # Update G's weights to minimize loss_g
        optimizerG.step()

        # Output training stats
        if idx % 50 == 0:
            print(f"[{epoch}/{num_epochs}] [{idx}/{len(dataloader)}] Loss_D: {loss_d:.4f} Loss_G: {loss_g:.4f} D(x): {D_x:.4f} D(G(z)): {D_G_z1:.4f}/{D_G_z2:.4f}")

        # Save losses for plotting later
        G_losses.append(loss_g.item())
        D_losees.append(loss_d.item())

        # Check how the generator is doing by saving G's output on fixed_noise
        if (iters % 500 == 0) or ((epoch == num_epochs - 1) and (idx == len(dataloader) - 1)):
            with torch.no_grad():
                fake = netG(fixed_noise).detach().cpu()
            img_list.append(torchvision.utils.make_grid(fake, padding = 2, normalize = True))