Optimization

Slides Key Points Function Collections Set up Set up the model:Loss Gradient Descent VS Random Search Gradient update random update Comparison Poor conditioning What is poor conditioning An intuitive illustration Solution: Momentum Non-convexity Mini-batches Adaptive Methods / Autotune RMSprop A Testing Framework Set up and tune the model Train Visualization

Slides

Key Points

💡

Optimization is necessary to create Deep Learning models that are guaranteed to converge

Stochastic Gradient Descent and Momentum are two commonly used optimization techniques

RMSProp is a way of adaptive hyperparameter tuning which utilises a per-dimension learning rate

Poor choice of optimization objectives can lead to unforeseen, undesirable consequences

Function Collections

zero grad

clear the gradient after one epoch

momentum update

RMSPROP

Autotune

mini batches

This note answers the following questions:

What is optimization about?

What we optimize

How we optimize it

Set up

Here we will use an MLP to recognise handwritten digits, and use different methods to optimize the model.


# Imports
import copy

import ipywidgets as widgets
import matplotlib.pyplot as plt
import numpy as np

import time
import torch
import torchvision
import torchvision.datasets as datasets
import torch.nn.functional as F
import torch.nn as nn
import torch.optim as optim
from tqdm.auto import tqdm

Load the data:

Set up the model:


class MLP(nn.Module):
  """
  This class implements MLPs in Pytorch of an arbitrary number of hidden
  layers of potentially different sizes. Since we concentrate on classification
  tasks in this tutorial, we have a log_softmax layer at prediction time.
  """

  def __init__(self, in_dim=784, out_dim=10, hidden_dims=[], use_bias=True):
    """
    Constructs a MultiLayerPerceptron

    Args:
      in_dim: Integer
        dimensionality of input data (784)
      out_dim: Integer
        number of classes (10)
      hidden_dims: List
        containing the dimensions of the hidden layers,
        empty list corresponds to a linear model (in_dim, out_dim)

    Returns:
      Nothing
    """

    super(MLP, self).__init__()

    self.in_dim = in_dim
    self.out_dim = out_dim

    # If we have no hidden layer, just initialize a linear model (e.g. in logistic regression)
    if len(hidden_dims) == 0:
      layers = [nn.Linear(in_dim, out_dim, bias=use_bias)]
    else:
      # 'Actual' MLP with dimensions in_dim - num_hidden_layers*[hidden_dim] - out_dim
      layers = [nn.Linear(in_dim, hidden_dims[0], bias=use_bias), nn.ReLU()]

      # Loop until before the last layer
      for i, hidden_dim in enumerate(hidden_dims[:-1]):
        layers += [nn.Linear(hidden_dim, hidden_dims[i + 1], bias=use_bias),
                   nn.ReLU()]

      # Add final layer to the number of classes
      layers += [nn.Linear(hidden_dims[-1], out_dim, bias=use_bias)]

    self.main = nn.Sequential(*layers)

  def forward(self, x):
    """
    Defines the network structure and flow from input to output

    Args:
      x: Tensor
        Image to be processed by the network

    Returns:
      output: Tensor
        same dimension and shape as the input with probabilistic values in the range [0, 1]

    """
    # Flatten each images into a 'vector'
    transformed_x = x.view(-1, self.in_dim)
    hidden_output = self.main(transformed_x)
    output = F.log_softmax(hidden_output, dim=1)
    return output

Linear models constitute a very special kind of MLPs: they are equivalent to an MLP with zero hidden layers. This is simply an affine transformation, in other words a 'linear' map $W x$ with an 'offset' $b$; followed by a softmax function.

Here , and . Notice that the dimensions of the weight matrix are as the input tensors are flattened images, i.e., -dimensional tensors and the output layer consists of nodes. Also, note that the implementation of softmax encapsulates b in W i.e., It maps the rows of the input instead of the columns. That is, the i’th row of the output is the mapping of the i’th row of the input under W, plus the bias term. Refer Affine maps here: https://pytorch.org/tutorials/beginner/nlp/deep_learning_tutorial.html#affine-maps


# Empty hidden_dims means we take a model with zero hidden layers.
model = MLP(in_dim=784, out_dim=10, hidden_dims=[])

# We print the model structure with 784 inputs and 10 outputs
print(model)

Loss

While we care about the accuracy of the model, the 'discrete' nature of the 0-1 loss makes it challenging to optimize. In order to learn good parameters for this model, we will use the cross-entropy loss (negative log-likelihood), which you saw in the last lecture, as a surrogate objective to be minimized.


loss_fn = F.nll_loss

train the model


partial_trained_model = MLP(in_dim=784, out_dim=10, hidden_dims=[])

if cell_verbose:
  print('Init loss', loss_fn(partial_trained_model(X), y).item()) # This matches around np.log(10 = # of classes)

# Invoke an optimizer using Adaptive gradient and Momentum (more about this in Section 7)
optimizer = optim.Adam(partial_trained_model.parameters(), lr=7e-4)
for _ in range(200):
  loss = loss_fn(partial_trained_model(X), y)
  optimizer.zero_grad()
  loss.backward()
  optimizer.step()

Gradient Descent VS Random Search

Gradient update


def zero_grad(params):
  """
  Clear gradients as they accumulate on successive backward calls

  Args:
    params: an iterator over tensors
      i.e., updating the Weights and biases

  Returns:
    Nothing
  """
  for par in params:
    if not(par.grad is None):
      par.grad.data.zero_()


def gradient_update(loss, params, lr=1e-3):
  """
  Perform a gradient descent update on a given loss over a collection of parameters
  Args:
    loss: Tensor
      A scalar tensor containing the loss through which the gradient will be computed
    params: List of iterables
      Collection of parameters with respect to which we compute gradients
    lr: Float
      Scalar specifying the learning rate or step-size for the update
  Returns:
    Nothing
  """
  # Clear up gradients as Pytorch automatically accumulates gradients from
  # successive backward calls
  zero_grad(params)

  # Compute gradients on given objective
  loss.backward()

  with torch.no_grad():
    for par in params:
      #################################################
      ## TODO for students: update the value of the parameter ##
      # raise NotImplementedError("Student exercise: implement gradient update")
      #################################################
      # Here we work with the 'data' attribute of the parameter rather than the
      # parameter itself.
      # Hence - use the learning rate and the parameter's .grad.data attribute to perform an update
      par.data -= lr * par.grad.data


# add event to airtable
atform.add_event('Coding Exercise 3: Implement gradient descent')


set_seed(seed=SEED)
model1 = MLP(in_dim=784, out_dim=10, hidden_dims=[])
print('\n The model1 parameters before the update are: \n')
print_params(model1)
loss = loss_fn(model1(X), y)

## Uncomment below to test your function
gradient_update(loss, list(model1.parameters()), lr=1e-1)
print('\n The model1 parameters after the update are: \n')
print_params(model1)

random update


def random_update(model, noise_scale=0.1, normalized=False):
  """
  Performs a random update on the parameters of the model to help
  understand the effectiveness of updating random directions
  for the problem of optimizing the parameters of a high-dimensional linear model.

  Args:
    model: nn.Module derived class
      The model whose parameters are to be updated

    noise_scale: float
      Specifies the magnitude of random weight

    normalized: Bool
      Indicates if the parameter has been normalised or not

  Returns:
    Nothing
  """
  for par in model.parameters():
    noise = torch.randn_like(par)
    if normalized:
      noise /= torch.norm(noise)
    par.data +=  noise_scale * noise

Comparison

Poor conditioning

What is poor conditioning

ref: https://jhui.github.io/2017/01/05/Deep-learning-computation-and-optimization/

💡

Conditioning measures how rapidly the output changed with tiny changes in input.

For example, in a linear equation, we can use the inverse matrix to solve .

Nevertheless it is not commonly done in machine learning because is slow to compute, and worse may amplify input errors rapidly.

For the function:

The condition number is defined as:

Poorly conditioned matrix AA is a matrix with a high condition number. A−1A−1 amplifies input errors. Small errors in xx can change the output of A−1xA−1x rapidly.

Other methods including matrix factorization can replace the matrix inversion method to avoid poor conditioning to improve the numerical stability.

An intuitive illustration

Solution: Momentum

implement the momentum update given by:

It is convenient to re-express this update rule in terms of a recursion. For that, we define 'velocity' as the quantity:

which leads to the two-step update rule:

Pay attention to the positive sign of the update in the last equation, given the definition of , above.


def momentum_update(loss, params, grad_vel, lr=1e-3, beta=0.8):
  """
  Perform a momentum update over a collection of parameters given a loss and velocities
  Args:
    loss: Tensor
      A scalar tensor containing the loss through which gradient will be computed
    params: Iterable
      Collection of parameters with respect to which we compute gradients
    grad_vel: Iterable
      Collection containing the 'velocity' v_t for each parameter
    lr: Float
      Scalar specifying the learning rate or step-size for the update
    beta: Float
      Scalar 'momentum' parameter
  Returns:
    Nothing
  """
  # Clear up gradients as Pytorch automatically accumulates gradients from
  # successive backward calls
  zero_grad(params)
  # Compute gradients on given objective
  loss.backward()

  with torch.no_grad():
    for (par, vel) in zip(params, grad_vel):
      # Update 'velocity'
      vel.data = -lr * par.grad.data + beta * vel.data
      # Update parameters
      par.data += vel.data

Non-convexity

Mini-batches


def sample_minibatch(input_data, target_data, num_points=100):
  """
  Sample a minibatch of size num_point from the provided input-target data
  Args:
    input_data: Tensor
      Multi-dimensional tensor containing the input data
    target_data: Tensor
      1D tensor containing the class labels
    num_points: Integer
      Number of elements to be included in minibatch with default=100
  Returns:
    batch_inputs: Tensor
      Minibatch inputs
    batch_targets: Tensor
      Minibatch targets
  """
  # Sample a collection of IID indices from the existing data
  batch_indices = np.random.choice(len(input_data), num_points)
  # Use batch_indices to extract entries from the input and target data tensors
  batch_inputs = input_data[batch_indices, :]
  batch_targets = target_data[batch_indices]

  return batch_inputs, batch_targets

Adaptive Methods / Autotune

RMSprop


def rmsprop_update(loss, params, grad_sq, lr=1e-3, alpha=0.8, epsilon=1e-8):
  """
  Perform an RMSprop update on a collection of parameters
  Args:
    loss: Tensor
      A scalar tensor containing the loss whose gradient will be computed
    params: Iterable
      Collection of parameters with respect to which we compute gradients
    grad_sq: Iterable
      Moving average of squared gradients
    lr: Float
      Scalar specifying the learning rate or step-size for the update
    alpha: Float
      Moving average parameter
    epsilon: Float
      quotient for numerical stability
  Returns:
    Nothing
  """
  # Clear up gradients as Pytorch automatically accumulates gradients from
  # successive backward calls
  zero_grad(params)
  # Compute gradients on given objective
  loss.backward()

  with torch.no_grad():
    for (par, gsq) in zip(params, grad_sq):
      # Update estimate of gradient variance
      gsq.data = alpha * gsq.data + (1 - alpha) * par.grad**2
      # Update parameters
      par.data -=  lr * (par.grad / (epsilon + gsq.data)**0.5)

A Testing Framework


# Define helper function to evaluate models
def eval_model(model, data_loader, num_batches=np.inf, device='cpu'):
  """
  To evaluate a given model

  Args:
    model: nn.Module derived class
      The model which is to be evaluated
    data_loader: Iterable
      A configured dataloading utility
    num_batches: Integer
      Size of minibatches
    device: String
      Sets the device. CUDA if available, CPU otherwise

  Returns:
    mean of log loss and mean of log accuracy
  """

  loss_log, acc_log = [], []
  model.to(device=device)

  # We are just evaluating the model, no need to compute gradients
  with torch.no_grad():
    for batch_id, batch in enumerate(data_loader):
      # If we only evaluate a number of batches, stop after we reach that number
      if batch_id > num_batches:
        break
      # Extract minibatch data
      data, labels = batch[0].to(device), batch[1].to(device)
      # Evaluate model and loss on minibatch
      preds = model(data)
      loss_log.append(loss_fn(preds, labels).item())
      acc_log.append(torch.mean(1. * (preds.argmax(dim=1) == labels)).item())

  return np.mean(loss_log), np.mean(acc_log)

Set up and tune the model


# Create MLP object and update weights with those of saved model
benchmark_model = MLP(in_dim=784, out_dim=10,
                      hidden_dims=[200, 100, 50]).to(DEVICE)
benchmark_model.load_state_dict(benchmark_state_dict)

#################################################
## adjust training settings ##

# The three parameters below are in your full control
MAX_EPOCHS = 2  # select number of epochs to train
LR = 1e-5  # choose the step size
BATCH_SIZE = 64  # number of examples per minibatch

# Define the model and associated optimizer -- you may change its architecture!
my_model = MLP(in_dim=784, out_dim=10, hidden_dims=[200, 100, 50]).to(DEVICE)

# You can take your pick from many different optimizers
# Check the optimizer documentation and hyperparameter meaning before using!
# More details on Pytorch optimizers: https://pytorch.org/docs/stable/optim.html
# optimizer = torch.optim.SGD(my_model.parameters(), lr=LR, momentum=0.9)
# optimizer = torch.optim.RMSprop(my_model.parameters(), lr=LR, alpha=0.99)
# optimizer = torch.optim.Adagrad(my_model.parameters(), lr=LR)
optimizer = torch.optim.Adam(my_model.parameters(), lr=LR)
#################################################

Train


set_seed(seed=SEED)
# Print training stats every LOG_FREQ minibatches
LOG_FREQ = 200
# Frequency for evaluating the validation metrics
VAL_FREQ = 200
# Load data using a Pytorch Dataset
train_set_orig, test_set_orig = load_mnist_data(change_tensors=False)

# We separate 10,000 training samples to create a validation set
train_set_orig, val_set_orig = torch.utils.data.random_split(train_set_orig, [50000, 10000])

# Create the corresponding DataLoaders for training and test
g_seed = torch.Generator()
g_seed.manual_seed(SEED)

train_loader = torch.utils.data.DataLoader(train_set_orig,
                                           shuffle=True,
                                           batch_size=BATCH_SIZE,
                                           num_workers=2,
                                           worker_init_fn=seed_worker,
                                           generator=g_seed)
val_loader = torch.utils.data.DataLoader(val_set_orig,
                                         shuffle=True,
                                         batch_size=256,
                                         num_workers=2,
                                         worker_init_fn=seed_worker,
                                         generator=g_seed)
test_loader = torch.utils.data.DataLoader(test_set_orig,
                                          batch_size=256,
                                          num_workers=2,
                                          worker_init_fn=seed_worker,
                                          generator=g_seed)

# Run training
metrics = {'train_loss':[],
           'train_acc':[],
           'val_loss':[],
           'val_acc':[],
           'val_idx':[]}

step_idx = 0
for epoch in tqdm(range(MAX_EPOCHS)):

  running_loss, running_acc = 0., 0.

  for batch_id, batch in enumerate(train_loader):
    step_idx += 1
    # Extract minibatch data and labels
    data, labels = batch[0].to(DEVICE), batch[1].to(DEVICE)
    # Just like before, refresh gradient accumulators.
    # Note that this is now a method of the optimizer.
    optimizer.zero_grad()
    # Evaluate model and loss on minibatch
    preds = my_model(data)
    loss = loss_fn(preds, labels)
    acc = torch.mean(1.0 * (preds.argmax(dim=1) == labels))
    # Compute gradients
    loss.backward()
    # Update parameters
    # Note how all the magic in the update of the parameters is encapsulated by
    # the optimizer class.
    optimizer.step()
    # Log metrics for plotting
    metrics['train_loss'].append(loss.cpu().item())
    metrics['train_acc'].append(acc.cpu().item())

    if batch_id % VAL_FREQ == (VAL_FREQ - 1):
      # Get an estimate of the validation accuracy with 100 batches
      val_loss, val_acc = eval_model(my_model, val_loader,
                                     num_batches=100,
                                     device=DEVICE)
      metrics['val_idx'].append(step_idx)
      metrics['val_loss'].append(val_loss)
      metrics['val_acc'].append(val_acc)

      print(f"[VALID] Epoch {epoch + 1} - Batch {batch_id + 1} - "
            f"Loss: {val_loss:.3f} - Acc: {100*val_acc:.3f}%")

    # print statistics
    running_loss += loss.cpu().item()
    running_acc += acc.cpu().item()
    # Print every LOG_FREQ minibatches
    if batch_id % LOG_FREQ == (LOG_FREQ-1):
      print(f"[TRAIN] Epoch {epoch + 1} - Batch {batch_id + 1} - "
            f"Loss: {running_loss / LOG_FREQ:.3f} - "
            f"Acc: {100 * running_acc / LOG_FREQ:.3f}%")

      running_loss, running_acc = 0., 0.

Visualization


fig, ax = plt.subplots(1, 2, figsize=(10, 4))

ax[0].plot(range(len(metrics['train_loss'])), metrics['train_loss'],
           alpha=0.8, label='Train')
ax[0].plot(metrics['val_idx'], metrics['val_loss'], label='Valid')
ax[0].set_xlabel('Iteration')
ax[0].set_ylabel('Loss')
ax[0].legend()

ax[1].plot(range(len(metrics['train_acc'])), metrics['train_acc'],
           alpha=0.8, label='Train')
ax[1].plot(metrics['val_idx'], metrics['val_acc'], label='Valid')
ax[1].set_xlabel('Iteration')
ax[1].set_ylabel('Accuracy')
ax[1].legend()
plt.tight_layout()
plt.show()