Optimization

Authors
Date
Jun 20, 2023 05:25 AM
Field
Machine Learning
Main Tags
Tags
Additional Tags

Slides


Key Points

💡
Optimization is necessary to create Deep Learning models that are guaranteed to converge
  • Stochastic Gradient Descent and Momentum are two commonly used optimization techniques
  • RMSProp is a way of adaptive hyperparameter tuning which utilises a per-dimension learning rate
  • Poor choice of optimization objectives can lead to unforeseen, undesirable consequences

Function Collections

Function Collections
clear the gradient after one epoch
Autotune

This note answers the following questions:
What is optimization about?
What we optimize
How we optimize it

Set up

Here we will use an MLP to recognise handwritten digits, and use different methods to optimize the model.
# Imports import copy import ipywidgets as widgets import matplotlib.pyplot as plt import numpy as np import time import torch import torchvision import torchvision.datasets as datasets import torch.nn.functional as F import torch.nn as nn import torch.optim as optim from tqdm.auto import tqdm
Load the data:

Set up the model:

class MLP(nn.Module): """ This class implements MLPs in Pytorch of an arbitrary number of hidden layers of potentially different sizes. Since we concentrate on classification tasks in this tutorial, we have a log_softmax layer at prediction time. """ def __init__(self, in_dim=784, out_dim=10, hidden_dims=[], use_bias=True): """ Constructs a MultiLayerPerceptron Args: in_dim: Integer dimensionality of input data (784) out_dim: Integer number of classes (10) hidden_dims: List containing the dimensions of the hidden layers, empty list corresponds to a linear model (in_dim, out_dim) Returns: Nothing """ super(MLP, self).__init__() self.in_dim = in_dim self.out_dim = out_dim # If we have no hidden layer, just initialize a linear model (e.g. in logistic regression) if len(hidden_dims) == 0: layers = [nn.Linear(in_dim, out_dim, bias=use_bias)] else: # 'Actual' MLP with dimensions in_dim - num_hidden_layers*[hidden_dim] - out_dim layers = [nn.Linear(in_dim, hidden_dims[0], bias=use_bias), nn.ReLU()] # Loop until before the last layer for i, hidden_dim in enumerate(hidden_dims[:-1]): layers += [nn.Linear(hidden_dim, hidden_dims[i + 1], bias=use_bias), nn.ReLU()] # Add final layer to the number of classes layers += [nn.Linear(hidden_dims[-1], out_dim, bias=use_bias)] self.main = nn.Sequential(*layers) def forward(self, x): """ Defines the network structure and flow from input to output Args: x: Tensor Image to be processed by the network Returns: output: Tensor same dimension and shape as the input with probabilistic values in the range [0, 1] """ # Flatten each images into a 'vector' transformed_x = x.view(-1, self.in_dim) hidden_output = self.main(transformed_x) output = F.log_softmax(hidden_output, dim=1) return output
Linear models constitute a very special kind of MLPs: they are equivalent to an MLP with zero hidden layers. This is simply an affine transformation, in other words a 'linear' map $W x$ with an 'offset' $b$; followed by a softmax function.
Here , and . Notice that the dimensions of the weight matrix are as the input tensors are flattened images, i.e., -dimensional tensors and the output layer consists of nodes. Also, note that the implementation of softmax encapsulates b in W i.e., It maps the rows of the input instead of the columns. That is, the i’th row of the output is the mapping of the i’th row of the input under W, plus the bias term. Refer Affine maps here: https://pytorch.org/tutorials/beginner/nlp/deep_learning_tutorial.html#affine-maps
# Empty hidden_dims means we take a model with zero hidden layers. model = MLP(in_dim=784, out_dim=10, hidden_dims=[]) # We print the model structure with 784 inputs and 10 outputs print(model)

Loss

While we care about the accuracy of the model, the 'discrete' nature of the 0-1 loss makes it challenging to optimize. In order to learn good parameters for this model, we will use the cross-entropy loss (negative log-likelihood), which you saw in the last lecture, as a surrogate objective to be minimized.
loss_fn = F.nll_loss
train the model
partial_trained_model = MLP(in_dim=784, out_dim=10, hidden_dims=[]) if cell_verbose: print('Init loss', loss_fn(partial_trained_model(X), y).item()) # This matches around np.log(10 = # of classes) # Invoke an optimizer using Adaptive gradient and Momentum (more about this in Section 7) optimizer = optim.Adam(partial_trained_model.parameters(), lr=7e-4) for _ in range(200): loss = loss_fn(partial_trained_model(X), y) optimizer.zero_grad() loss.backward() optimizer.step()

Gradient Descent VS Random Search

Gradient update

def zero_grad(params): """ Clear gradients as they accumulate on successive backward calls Args: params: an iterator over tensors i.e., updating the Weights and biases Returns: Nothing """ for par in params: if not(par.grad is None): par.grad.data.zero_() def gradient_update(loss, params, lr=1e-3): """ Perform a gradient descent update on a given loss over a collection of parameters Args: loss: Tensor A scalar tensor containing the loss through which the gradient will be computed params: List of iterables Collection of parameters with respect to which we compute gradients lr: Float Scalar specifying the learning rate or step-size for the update Returns: Nothing """ # Clear up gradients as Pytorch automatically accumulates gradients from # successive backward calls zero_grad(params) # Compute gradients on given objective loss.backward() with torch.no_grad(): for par in params: ################################################# ## TODO for students: update the value of the parameter ## # raise NotImplementedError("Student exercise: implement gradient update") ################################################# # Here we work with the 'data' attribute of the parameter rather than the # parameter itself. # Hence - use the learning rate and the parameter's .grad.data attribute to perform an update par.data -= lr * par.grad.data # add event to airtable atform.add_event('Coding Exercise 3: Implement gradient descent') set_seed(seed=SEED) model1 = MLP(in_dim=784, out_dim=10, hidden_dims=[]) print('\n The model1 parameters before the update are: \n') print_params(model1) loss = loss_fn(model1(X), y) ## Uncomment below to test your function gradient_update(loss, list(model1.parameters()), lr=1e-1) print('\n The model1 parameters after the update are: \n') print_params(model1)

random update

def random_update(model, noise_scale=0.1, normalized=False): """ Performs a random update on the parameters of the model to help understand the effectiveness of updating random directions for the problem of optimizing the parameters of a high-dimensional linear model. Args: model: nn.Module derived class The model whose parameters are to be updated noise_scale: float Specifies the magnitude of random weight normalized: Bool Indicates if the parameter has been normalised or not Returns: Nothing """ for par in model.parameters(): noise = torch.randn_like(par) if normalized: noise /= torch.norm(noise) par.data += noise_scale * noise

Comparison

 
notion image
 

Poor conditioning

What is poor conditioning

💡
Conditioning measures how rapidly the output changed with tiny changes in input.
For example, in a linear equation, we can use the inverse matrix   to solve .
Nevertheless it is not commonly done in machine learning because is slow to compute, and worse may amplify input errors rapidly.
For the function:
The condition number is defined as:
notion image
Poorly conditioned matrix AA is a matrix with a high condition number. A−1A−1 amplifies input errors. Small errors in xx can change the output of A−1xA−1x rapidly.
Other methods including matrix factorization can replace the matrix inversion method to avoid poor conditioning to improve the numerical stability.

An intuitive illustration

 
 

Solution: Momentum

implement the momentum update given by:
It is convenient to re-express this update rule in terms of a recursion. For that, we define 'velocity' as the quantity:
which leads to the two-step update rule:
Pay attention to the positive sign of the update in the last equation, given the definition of , above.
 
def momentum_update(loss, params, grad_vel, lr=1e-3, beta=0.8): """ Perform a momentum update over a collection of parameters given a loss and velocities Args: loss: Tensor A scalar tensor containing the loss through which gradient will be computed params: Iterable Collection of parameters with respect to which we compute gradients grad_vel: Iterable Collection containing the 'velocity' v_t for each parameter lr: Float Scalar specifying the learning rate or step-size for the update beta: Float Scalar 'momentum' parameter Returns: Nothing """ # Clear up gradients as Pytorch automatically accumulates gradients from # successive backward calls zero_grad(params) # Compute gradients on given objective loss.backward() with torch.no_grad(): for (par, vel) in zip(params, grad_vel): # Update 'velocity' vel.data = -lr * par.grad.data + beta * vel.data # Update parameters par.data += vel.data
 

Non-convexity

 
 

Mini-batches

def sample_minibatch(input_data, target_data, num_points=100): """ Sample a minibatch of size num_point from the provided input-target data Args: input_data: Tensor Multi-dimensional tensor containing the input data target_data: Tensor 1D tensor containing the class labels num_points: Integer Number of elements to be included in minibatch with default=100 Returns: batch_inputs: Tensor Minibatch inputs batch_targets: Tensor Minibatch targets """ # Sample a collection of IID indices from the existing data batch_indices = np.random.choice(len(input_data), num_points) # Use batch_indices to extract entries from the input and target data tensors batch_inputs = input_data[batch_indices, :] batch_targets = target_data[batch_indices] return batch_inputs, batch_targets
 

Adaptive Methods / Autotune

 
 

RMSprop

def rmsprop_update(loss, params, grad_sq, lr=1e-3, alpha=0.8, epsilon=1e-8): """ Perform an RMSprop update on a collection of parameters Args: loss: Tensor A scalar tensor containing the loss whose gradient will be computed params: Iterable Collection of parameters with respect to which we compute gradients grad_sq: Iterable Moving average of squared gradients lr: Float Scalar specifying the learning rate or step-size for the update alpha: Float Moving average parameter epsilon: Float quotient for numerical stability Returns: Nothing """ # Clear up gradients as Pytorch automatically accumulates gradients from # successive backward calls zero_grad(params) # Compute gradients on given objective loss.backward() with torch.no_grad(): for (par, gsq) in zip(params, grad_sq): # Update estimate of gradient variance gsq.data = alpha * gsq.data + (1 - alpha) * par.grad**2 # Update parameters par.data -= lr * (par.grad / (epsilon + gsq.data)**0.5)
 

A Testing Framework

# Define helper function to evaluate models def eval_model(model, data_loader, num_batches=np.inf, device='cpu'): """ To evaluate a given model Args: model: nn.Module derived class The model which is to be evaluated data_loader: Iterable A configured dataloading utility num_batches: Integer Size of minibatches device: String Sets the device. CUDA if available, CPU otherwise Returns: mean of log loss and mean of log accuracy """ loss_log, acc_log = [], [] model.to(device=device) # We are just evaluating the model, no need to compute gradients with torch.no_grad(): for batch_id, batch in enumerate(data_loader): # If we only evaluate a number of batches, stop after we reach that number if batch_id > num_batches: break # Extract minibatch data data, labels = batch[0].to(device), batch[1].to(device) # Evaluate model and loss on minibatch preds = model(data) loss_log.append(loss_fn(preds, labels).item()) acc_log.append(torch.mean(1. * (preds.argmax(dim=1) == labels)).item()) return np.mean(loss_log), np.mean(acc_log)

Set up and tune the model

# Create MLP object and update weights with those of saved model benchmark_model = MLP(in_dim=784, out_dim=10, hidden_dims=[200, 100, 50]).to(DEVICE) benchmark_model.load_state_dict(benchmark_state_dict) ################################################# ## adjust training settings ## # The three parameters below are in your full control MAX_EPOCHS = 2 # select number of epochs to train LR = 1e-5 # choose the step size BATCH_SIZE = 64 # number of examples per minibatch # Define the model and associated optimizer -- you may change its architecture! my_model = MLP(in_dim=784, out_dim=10, hidden_dims=[200, 100, 50]).to(DEVICE) # You can take your pick from many different optimizers # Check the optimizer documentation and hyperparameter meaning before using! # More details on Pytorch optimizers: https://pytorch.org/docs/stable/optim.html # optimizer = torch.optim.SGD(my_model.parameters(), lr=LR, momentum=0.9) # optimizer = torch.optim.RMSprop(my_model.parameters(), lr=LR, alpha=0.99) # optimizer = torch.optim.Adagrad(my_model.parameters(), lr=LR) optimizer = torch.optim.Adam(my_model.parameters(), lr=LR) #################################################

Train

set_seed(seed=SEED) # Print training stats every LOG_FREQ minibatches LOG_FREQ = 200 # Frequency for evaluating the validation metrics VAL_FREQ = 200 # Load data using a Pytorch Dataset train_set_orig, test_set_orig = load_mnist_data(change_tensors=False) # We separate 10,000 training samples to create a validation set train_set_orig, val_set_orig = torch.utils.data.random_split(train_set_orig, [50000, 10000]) # Create the corresponding DataLoaders for training and test g_seed = torch.Generator() g_seed.manual_seed(SEED) train_loader = torch.utils.data.DataLoader(train_set_orig, shuffle=True, batch_size=BATCH_SIZE, num_workers=2, worker_init_fn=seed_worker, generator=g_seed) val_loader = torch.utils.data.DataLoader(val_set_orig, shuffle=True, batch_size=256, num_workers=2, worker_init_fn=seed_worker, generator=g_seed) test_loader = torch.utils.data.DataLoader(test_set_orig, batch_size=256, num_workers=2, worker_init_fn=seed_worker, generator=g_seed) # Run training metrics = {'train_loss':[], 'train_acc':[], 'val_loss':[], 'val_acc':[], 'val_idx':[]} step_idx = 0 for epoch in tqdm(range(MAX_EPOCHS)): running_loss, running_acc = 0., 0. for batch_id, batch in enumerate(train_loader): step_idx += 1 # Extract minibatch data and labels data, labels = batch[0].to(DEVICE), batch[1].to(DEVICE) # Just like before, refresh gradient accumulators. # Note that this is now a method of the optimizer. optimizer.zero_grad() # Evaluate model and loss on minibatch preds = my_model(data) loss = loss_fn(preds, labels) acc = torch.mean(1.0 * (preds.argmax(dim=1) == labels)) # Compute gradients loss.backward() # Update parameters # Note how all the magic in the update of the parameters is encapsulated by # the optimizer class. optimizer.step() # Log metrics for plotting metrics['train_loss'].append(loss.cpu().item()) metrics['train_acc'].append(acc.cpu().item()) if batch_id % VAL_FREQ == (VAL_FREQ - 1): # Get an estimate of the validation accuracy with 100 batches val_loss, val_acc = eval_model(my_model, val_loader, num_batches=100, device=DEVICE) metrics['val_idx'].append(step_idx) metrics['val_loss'].append(val_loss) metrics['val_acc'].append(val_acc) print(f"[VALID] Epoch {epoch + 1} - Batch {batch_id + 1} - " f"Loss: {val_loss:.3f} - Acc: {100*val_acc:.3f}%") # print statistics running_loss += loss.cpu().item() running_acc += acc.cpu().item() # Print every LOG_FREQ minibatches if batch_id % LOG_FREQ == (LOG_FREQ-1): print(f"[TRAIN] Epoch {epoch + 1} - Batch {batch_id + 1} - " f"Loss: {running_loss / LOG_FREQ:.3f} - " f"Acc: {100 * running_acc / LOG_FREQ:.3f}%") running_loss, running_acc = 0., 0.

Visualization

fig, ax = plt.subplots(1, 2, figsize=(10, 4)) ax[0].plot(range(len(metrics['train_loss'])), metrics['train_loss'], alpha=0.8, label='Train') ax[0].plot(metrics['val_idx'], metrics['val_loss'], label='Valid') ax[0].set_xlabel('Iteration') ax[0].set_ylabel('Loss') ax[0].legend() ax[1].plot(range(len(metrics['train_acc'])), metrics['train_acc'], alpha=0.8, label='Train') ax[1].plot(metrics['val_idx'], metrics['val_acc'], label='Valid') ax[1].set_xlabel('Iteration') ax[1].set_ylabel('Accuracy') ax[1].legend() plt.tight_layout() plt.show()
notion image