Regularization

Slides Key Points Function Collections Intro Regularization as Shrinkage Frobenius Norm Overfit Overfit train set Overfitting the test dataset Memorization Regularization Techniques Early Stopping L1 and L2 Regularization L1 (LASSO)L2 (Ridge)Implementation Dropout Implementation Data Augmentation an example Stochastic Gradient Descent batch size vs learning rate Hyper Parameter Tuning Further Readings

Slides

Key Points

💡

Shrinkage is Regularization

The goal of supervised learning is generalization

Regularization controls overfitting in overparameterized models

We split the data into train, validation, test to prevent test set overfitting

Regularization methods Shrinking, L1 L2 early stopping, data augmentation, SGD, dropout

Hyperparameter tuning is critical and expensive

L1 and L2 regularization, Dropout, and Data Augmentation.

the learning rate of SGD can act as a regularizer.

Function Collections

frobenius_norm

early stop main

Intro

Too simple a model “underfits”, It fails to capture the signal in the data

Too complex a model “overfits” It fits the noise in the data, and so generalizes poorly

Why?

To get smaller weights and thus get smoother models

prevent overfit

How?

L2 penalties

set some of them to zero

Regularization as Shrinkage

A key idea of neural nets is that they use models that are "too complex" - complex enough to fit all the noise in the data. One then needs to "regularize" them to make the models fit complex enough, but not too complex. The more complex the model, the better it fits the training data, but if it is too complex, it generalizes less well; it memorizes the training data but is less accurate on future test data.

One way to think about Regularization is to think in terms of the magnitude of the overall weights of the model. A model with big weights can fit more data perfectly, whereas a model with smaller weights tends to underperform on the train set but can surprisingly do very well on the test set. Having the weights too small can also be an issue as it can then underfit the model.

In these tutorials, we use the sum of the Frobenius norm of all the tensors in the model as a measure of the "size of the model".

Frobenius Norm

💡

Measures how big a matrix is.

Before we start, let's define the Frobenius norm, sometimes also called the Euclidean norm of an $m×n$ matrix $A$ as the square root of the sum of the absolute squares of its elements.

💡

Sum of the square of every elements in the matrix

This is just a measure of how big the matrix is, analogous to how big a vector is.


def calculate_frobenius_norm(model):
  """
  Calculate Frobenius Norm per Layer

  Args:
    model: nn.module
      Neural network instance

  Returns:
    norm: float
      Norm value
    labels: list
      Targets
    ws: list
      Weights for each layers
  """

  # Initialization of variables
  norm, ws, labels = 0.0, [], []

  # Sum all the parameters
  for name, parameters in model.named_parameters():
    p = torch.sum(parameters**2)
    norm += p

    ws.append((p**0.5).cpu().detach().numpy())
    labels.append(name)

  # Take a square root of the sum of squares of all the parameters
  norm = (norm**0.5).cpu().detach().numpy()

  return norm, ws, labels


set_seed(SEED)
net = nn.Linear(10,1)
norm, ws, labels = calculate_frobenius_norm(net)
print(f'Frobenius norm of Single Linear Layer: {norm:.4f}')
# Plots the weights
plot_weights(norm, labels, ws)

Overfit

Overfit train set

💡

Perform perfect on train set but poor on test set, because the model is learning noise.

Overfitting the test dataset

In principle, we should not touch our test set until choosing all our hyperparameters. Were we to use the test data in the model selection process, there is a risk that we might overfit the test data, and then we will be in serious trouble. If we overfit our training data, there is always an evaluation using the test data to keep us honest. But if we overfit the test data, how would we ever know?

Note that there is another kind of overfitting: you do "honest" fitting on one set of images or posts or medical records, but it may not generalize to other images, posts, or medical records.

Validation Dataset

A common practice to address this problem is to split our data in three ways, using a validation dataset (or validation set) to tune the hyperparameters. Ideally, we would only touch the test data once, to assess the very best model or to compare a small number of models to each other, real-world test data is seldom discarded after just one use.

Memorization

Given sufficiently large networks and enough training, Neural Networks can achieve almost 100% train accuracy by remembering each training example. However, this is bad because it will mean that the model will fail when presented with new data.

In this section, we train three MLPs; one each on:

1. Animal Faces Dataset

2. A Completely Noisy Dataset (Random shuffling of all labels)

3. A partially Noisy Dataset (Random shuffling of 15% labels)

Isn't it surprising to see that the ANN was able to achieve 100% training accuracy on randomly shuffled labels? This is one of the reasons why training accuracy is not a good indicator of model performance.

Regularization Techniques

This regularization term makes the parameters smaller, giving simpler models that will overfit less.

Early Stopping

Now that we have established that the validation accuracy reaches the peak well before the model overfits, we want to stop the training somehow early. You should have also observed from the above plots that the train/test loss on real data is not very smooth, and hence you might guess that the choice of the epoch can play a crucial role in the validation/test accuracy.

💡

Early stopping stops training when the validation accuracies stop increasing.

Define a main function with early stop


def early_stopping_main(args, model, train_loader, val_loader):
  """
  Function to simulate early stopping
  Args:
    args: dictionary
      Dictionary with epochs: 200, lr: 5e-3, momentum: 0.9, device: DEVICE
    model: nn.module
      Neural network instance
    train_loader: torch.loader
      Train dataset
    val_loader: torch.loader
      Validation set
  Returns:
    val_acc_list: list
      Val accuracy log until early stop point
    train_acc_list: list
      Training accuracy log until early stop point
    best_model: nn.module
      Model performing best with early stopping
    best_epoch: int
      Epoch at which early stopping occurs
  """

  device = args['device']
  model = model.to(device)
  optimizer = optim.SGD(model.parameters(),
                        lr=args['lr'],
                        momentum=args['momentum'])

  best_acc = 0.0
  best_epoch = 0

  # Number of successive epochs that you want to wait before stopping training process
  patience = 20

  # Keeps track of number of epochs during which the val_acc was less than best_acc
  wait = 0

  val_acc_list, train_acc_list = [], []
  for epoch in tqdm(range(args['epochs'])):

    # Train the model
    trained_model = train(args, model, train_loader, optimizer)

    # Calculate training accuracy
    train_acc = test(trained_model, train_loader, device=device)

    # Calculate validation accuracy
    val_acc = test(trained_model, val_loader, device=device)

    if (val_acc > best_acc):
      best_acc = val_acc
      best_epoch = epoch
      best_model = copy.deepcopy(trained_model)
      wait = 0
    else:
      wait += 1

    if (wait > patience):
      print(f'Early stopped on epoch: {epoch}')
      break

    train_acc_list.append(train_acc)
    val_acc_list.append(val_acc)

  return val_acc_list, train_acc_list, best_model, best_epoch


# Add event to airtable
atform.add_event('Coding Exercise 4: Early Stopping')

# Set the arguments
args = {
    'epochs': 200,
    'lr': 5e-4,
    'momentum': 0.99,
    'device': DEVICE
}

# Initialize the model
set_seed(seed=SEED)
model = AnimalNet()

## Uncomment to test
val_acc_earlystop, train_acc_earlystop, best_model, best_epoch = early_stopping_main(args, model, train_loader, val_loader)
print(f'Maximum Validation Accuracy is reached at epoch: {best_epoch:2d}')
with plt.xkcd():
  early_stop_plot(train_acc_earlystop, val_acc_earlystop, best_epoch)

L1 and L2 Regularization

💡

L0 is the least square error (LSE)

L1 (LASSO)

💡

Sum of the absolute value of each parameter.

💡

LASSO means Least Absolute Shrinkage and Selection Operator, the same theory behind the LASSO regression

L1 Regularization (or LASSO) uses a penalty which is the sum of the absolute value of all the weights in the Deep Learning architecture, resulting in the following loss function ( is the usual Cross-Entropy loss):

where $r$ denotes the layer, and $ij$ the specific weight in that layer.

At a high level, L1 Regularization is similar to L2 Regularization since it leads to smaller weights . It results in the following weight update equation when using Stochastic Gradient Descent:

where is the sign function, such that

L2 (Ridge)

💡

Sum of the square of each parameter.

L2 Regularization (or Ridge), also referred to as “Weight Decay”, is widely used. It works by adding a quadratic penalty term to the Cross-Entropy Loss Function $L$, which results in a new Loss Function $L_R$ given by:

where, again, $r$ superscript denotes the layer, and $ij$ the specific weight in that layer.

To get further insight into L2 Regularization, we investigate its effect on the Gradient Descent based update equations for the weight and bias parameters. Taking the derivative on both sides of the above equation, we obtain

Thus the weight update rule becomes:

where is the learning rate.

Implementation

Implemente all regularization function together


args3 = {
    'test_batch_size': 1000,
    'epochs': 150,
    'lr': 5e-3,
    'momentum': 0.99,
    'device': DEVICE,
    'lambda1': 0.001,
    'lambda2': 0.001
}

# Initialize the model
set_seed(seed=SEED)
model = AnimalNet()
val_acc_l1l2reg, train_acc_l1l2reg, param_norm_l1l2reg, _ = main(args3,
                                                                 model,
                                                                 train_loader,
                                                                 val_loader,
                                                                 img_test_dataset,
                                                                 reg_function1=l1_reg,
                                                                 reg_function2=l2_reg)

Dropout

💡

Typically used in CNNs

With Dropout, we literally drop out (zero out) some neurons during training. Throughout the training, the standard dropout zeros out some fraction (usually 50%) of the nodes in each layer, and on each iteration, before calculating the subsequent layer. Randomly selecting different subsets to drop out introduces noise into the process and reduces overfitting.

Now let's revisit the toy dataset we generated above to visualize how the Dropout stabilizes training on a noisy dataset. We will slightly modify the architecture we used above to add dropout layers.

Implementation


class AnimalNetDropout(nn.Module):
  """
  Network Class - Animal Faces with following structure
  nn.Linear(3*32*32, 248) + leaky_relu(self.dropout1(self.fc1(x))) # First fully connected layer with 0.5 dropout
  nn.Linear(248, 210) + leaky_relu(self.dropout2(self.fc2(x))) # Second fully connected layer with 0.3 dropout
  nn.Linear(210, 3) # Final fully connected layer
  """

  def __init__(self):
    """
    Initialize parameters of AnimalNetDropout

    Args:
      None

    Returns:
      Nothing
    """
    super(AnimalNetDropout, self).__init__()
    self.fc1 = nn.Linear(3*32*32, 248)
    self.fc2 = nn.Linear(248, 210)
    self.fc3 = nn.Linear(210, 3)
    self.dropout1 = nn.Dropout(p=0.5) # add dropout layer
    self.dropout2 = nn.Dropout(p=0.3)

Data Augmentation

💡

Typically used in image classification problems

Data augmentation is often used to increase the number of training samples. Now we will explore the effects of data augmentation on regularization. Here regularization is achieved by adding noise into training data after every epoch.

PyTorch's torchvision module provides a few built-in data augmentation techniques, which we can use on image datasets. Some of the techniques we most frequently use are:

Random Crop

Random Rotate

Vertical Flip

Horizontal Flip

Define a DataLoader using torchvision.transforms, which randomly augments the data for us. For more info, see here.

an example


# Data Augmentation using transforms
new_transforms = transforms.Compose([
                                     transforms.RandomHorizontalFlip(p=0.1),
                                     transforms.RandomVerticalFlip(p=0.1),
                                     transforms.ToTensor(),
                                     transforms.Normalize((0.5, 0.5, 0.5),
                                                          (0.5, 0.5, 0.5))
                                     ])

data_path = pathlib.Path('.')/'afhq'  # Using pathlib to be compatible with all OS's
img_dataset = ImageFolder(data_path/'train', transform=new_transforms)
# Splitting dataset
new_train_data, _,_ = torch.utils.data.random_split(img_dataset,
                                                    [250, 100, 14280])

# For reproducibility
g_seed = torch.Generator()
g_seed.manual_seed(SEED)

# Creating train_loader and Val_loader
new_train_loader = torch.utils.data.DataLoader(new_train_data,
                                               batch_size=batch_size,
                                               worker_init_fn=seed_worker,
                                               generator=g_seed)

Stochastic Gradient Descent

Initialize with small random weights

Weights get bigger as one iterates

Use early stopping to avoid overfitting

SGD finds ‘good’ minima ● Deep learning often uses more parameters than observations ○ ‘should’ massively overfit ● Deep learning on CIFAR10 and IMAGENET ○ gives 0 training error with small test error ○ gives 0 training error with randomized labels ● SGD converges to flat minima, which generalize better Gradient descent is magic: it tends to find ‘good’ (smooth/regularized) solutions Zhang, Bengio, Hardt, Recht, and Vinyals 2017

Smaller learning rates regularize less and slowly converge to deep minima.

Larger learning rates regularize more by missing local minima and converging to broader, flatter minima, which often generalize better.

💡

Again, learning rate is very important

batch size vs learning rate

Hyper Parameter Tuning

Hyperparameter tuning is often tricky and time-consuming, and it is a vital part of training any Deep Learning model to give good generalization. There are a few techniques that we can use to guide us during the search.

*Grid Search**: Try all possible combinations of hyperparameters

This method is typically too expensive in practice

*Random Search**: Randomly try different combinations of hyperparameters

does not sound good 🤣

*Coordinate-wise Gradient Descent**: Start at one set of hyperparameters and try changing one at a time, accept any changes that reduce your validation error

This one is widely used

*Bayesian Optimization / Auto ML**: Start from a set of hyperparameters that have worked well on a similar problem, and then do some sort of local exploration (e.g., gradient descent) from there.

Auto machine learning according to previous deep nets

There are many choices, like what range to explore over, which parameter to optimize first, etc. Some hyperparameters don’t matter much (people use a dropout of either 0.5 or 0.2, but not much else). Others can matter a lot more (e.g., size and depth of the neural net). The key is to see what worked on similar problems.

One can automate the process of tuning the network architecture using the so called *Neural Architecture Search (NAS)*. NAS designs new architectures using a few building blocks (Linear, Convolutional, Convolution Layers, etc.) and optimizes the design based on performance using a wide range of techniques such as Grid Search, Reinforcement Learning, Gradient Descent, Evolutionary Algorithms, etc. This obviously requires very high computing power. Read this [article](https://lilianweng.github.io/lil-log/2020/08/06/neural-architecture-search.html) to learn more about NAS.