Regularization

Authors
Date
Jun 20, 2023 05:25 AM
Field
Machine Learning
Main Tags
Tags
Additional Tags

Slides


Key Points

💡
Shrinkage is Regularization
  • The goal of supervised learning is generalization
  • Regularization controls overfitting in overparameterized models
  • We split the data into train, validation, test to prevent test set overfitting
  • Regularization methods Shrinking, L1 L2 early stopping, data augmentation, SGD, dropout
  • Hyperparameter tuning is critical and expensive
  • L1 and L2 regularization, Dropout, and Data Augmentation.
  • the learning rate of SGD can act as a regularizer.

Function Collections

Function Collections

 

Intro

notion image
Too simple a model “underfits”, It fails to capture the signal in the data
Too complex a model “overfits” It fits the noise in the data, and so generalizes poorly
Why?
To get smaller weights and thus get smoother models
prevent overfit
How?
L2 penalties
set some of them to zero

Regularization as Shrinkage

A key idea of neural nets is that they use models that are "too complex" - complex enough to fit all the noise in the data. One then needs to "regularize" them to make the models fit complex enough, but not too complex. The more complex the model, the better it fits the training data, but if it is too complex, it generalizes less well; it memorizes the training data but is less accurate on future test data.
One way to think about Regularization is to think in terms of the magnitude of the overall weights of the model. A model with big weights can fit more data perfectly, whereas a model with smaller weights tends to underperform on the train set but can surprisingly do very well on the test set. Having the weights too small can also be an issue as it can then underfit the model.
In these tutorials, we use the sum of the Frobenius norm of all the tensors in the model as a measure of the "size of the model".

Frobenius Norm

💡
Measures how big a matrix is.
Before we start, let's define the Frobenius norm, sometimes also called the Euclidean norm of an $m×n$ matrix $A$ as the square root of the sum of the absolute squares of its elements.
 
💡
Sum of the square of every elements in the matrix
This is just a measure of how big the matrix is, analogous to how big a vector is.
def calculate_frobenius_norm(model): """ Calculate Frobenius Norm per Layer Args: model: nn.module Neural network instance Returns: norm: float Norm value labels: list Targets ws: list Weights for each layers """ # Initialization of variables norm, ws, labels = 0.0, [], [] # Sum all the parameters for name, parameters in model.named_parameters(): p = torch.sum(parameters**2) norm += p ws.append((p**0.5).cpu().detach().numpy()) labels.append(name) # Take a square root of the sum of squares of all the parameters norm = (norm**0.5).cpu().detach().numpy() return norm, ws, labels set_seed(SEED) net = nn.Linear(10,1) norm, ws, labels = calculate_frobenius_norm(net) print(f'Frobenius norm of Single Linear Layer: {norm:.4f}') # Plots the weights plot_weights(norm, labels, ws)

Overfit

Overfit train set

notion image
💡
Perform perfect on train set but poor on test set, because the model is learning noise.
notion image

Overfitting the test dataset

In principle, we should not touch our test set until choosing all our hyperparameters. Were we to use the test data in the model selection process, there is a risk that we might overfit the test data, and then we will be in serious trouble. If we overfit our training data, there is always an evaluation using the test data to keep us honest. But if we overfit the test data, how would we ever know?
Note that there is another kind of overfitting: you do "honest" fitting on one set of images or posts or medical records, but it may not generalize to other images, posts, or medical records.
Validation Dataset
A common practice to address this problem is to split our data in three ways, using a validation dataset (or validation set) to tune the hyperparameters. Ideally, we would only touch the test data once, to assess the very best model or to compare a small number of models to each other, real-world test data is seldom discarded after just one use.

Memorization

Given sufficiently large networks and enough training, Neural Networks can achieve almost 100% train accuracy by remembering each training example. However, this is bad because it will mean that the model will fail when presented with new data.
In this section, we train three MLPs; one each on:
1. Animal Faces Dataset
2. A Completely Noisy Dataset (Random shuffling of all labels)
3. A partially Noisy Dataset (Random shuffling of 15% labels)
notion image
Isn't it surprising to see that the ANN was able to achieve 100% training accuracy on randomly shuffled labels? This is one of the reasons why training accuracy is not a good indicator of model performance.
 

Regularization Techniques

This regularization term makes the parameters smaller, giving simpler models that will overfit less.
 

Early Stopping

Now that we have established that the validation accuracy reaches the peak well before the model overfits, we want to stop the training somehow early. You should have also observed from the above plots that the train/test loss on real data is not very smooth, and hence you might guess that the choice of the epoch can play a crucial role in the validation/test accuracy.
💡
Early stopping stops training when the validation accuracies stop increasing.
notion image
Define a main function with early stop
def early_stopping_main(args, model, train_loader, val_loader): """ Function to simulate early stopping Args: args: dictionary Dictionary with epochs: 200, lr: 5e-3, momentum: 0.9, device: DEVICE model: nn.module Neural network instance train_loader: torch.loader Train dataset val_loader: torch.loader Validation set Returns: val_acc_list: list Val accuracy log until early stop point train_acc_list: list Training accuracy log until early stop point best_model: nn.module Model performing best with early stopping best_epoch: int Epoch at which early stopping occurs """ device = args['device'] model = model.to(device) optimizer = optim.SGD(model.parameters(), lr=args['lr'], momentum=args['momentum']) best_acc = 0.0 best_epoch = 0 # Number of successive epochs that you want to wait before stopping training process patience = 20 # Keeps track of number of epochs during which the val_acc was less than best_acc wait = 0 val_acc_list, train_acc_list = [], [] for epoch in tqdm(range(args['epochs'])): # Train the model trained_model = train(args, model, train_loader, optimizer) # Calculate training accuracy train_acc = test(trained_model, train_loader, device=device) # Calculate validation accuracy val_acc = test(trained_model, val_loader, device=device) if (val_acc > best_acc): best_acc = val_acc best_epoch = epoch best_model = copy.deepcopy(trained_model) wait = 0 else: wait += 1 if (wait > patience): print(f'Early stopped on epoch: {epoch}') break train_acc_list.append(train_acc) val_acc_list.append(val_acc) return val_acc_list, train_acc_list, best_model, best_epoch # Add event to airtable atform.add_event('Coding Exercise 4: Early Stopping') # Set the arguments args = { 'epochs': 200, 'lr': 5e-4, 'momentum': 0.99, 'device': DEVICE } # Initialize the model set_seed(seed=SEED) model = AnimalNet() ## Uncomment to test val_acc_earlystop, train_acc_earlystop, best_model, best_epoch = early_stopping_main(args, model, train_loader, val_loader) print(f'Maximum Validation Accuracy is reached at epoch: {best_epoch:2d}') with plt.xkcd(): early_stop_plot(train_acc_earlystop, val_acc_earlystop, best_epoch)

L1 and L2 Regularization

💡
L0 is the least square error (LSE)

L1 (LASSO)

💡
Sum of the absolute value of each parameter.
💡
LASSO means Least Absolute Shrinkage and Selection Operator, the same theory behind the LASSO regression
L1 Regularization (or LASSO) uses a penalty which is the sum of the absolute value of all the weights in the Deep Learning architecture, resulting in the following loss function ( is the usual Cross-Entropy loss):
 
where $r$ denotes the layer, and $ij$ the specific weight in that layer.
At a high level, L1 Regularization is similar to L2 Regularization since it leads to smaller weights . It results in the following weight update equation when using Stochastic Gradient Descent:
 
 
where is the sign function, such that
notion image

L2 (Ridge)

💡
Sum of the square of each parameter.
L2 Regularization (or Ridge), also referred to as “Weight Decay”, is widely used. It works by adding a quadratic penalty term to the Cross-Entropy Loss Function $L$, which results in a new Loss Function $L_R$ given by:
where, again, $r$ superscript denotes the layer, and $ij$ the specific weight in that layer.
To get further insight into L2 Regularization, we investigate its effect on the Gradient Descent based update equations for the weight and bias parameters. Taking the derivative on both sides of the above equation, we obtain
 
 
Thus the weight update rule becomes:
 
 
where is the learning rate.

Implementation

Implemente all regularization function together
args3 = { 'test_batch_size': 1000, 'epochs': 150, 'lr': 5e-3, 'momentum': 0.99, 'device': DEVICE, 'lambda1': 0.001, 'lambda2': 0.001 } # Initialize the model set_seed(seed=SEED) model = AnimalNet() val_acc_l1l2reg, train_acc_l1l2reg, param_norm_l1l2reg, _ = main(args3, model, train_loader, val_loader, img_test_dataset, reg_function1=l1_reg, reg_function2=l2_reg)
notion image
notion image

Dropout

💡
Typically used in CNNs
With Dropout, we literally drop out (zero out) some neurons during training. Throughout the training, the standard dropout zeros out some fraction (usually 50%) of the nodes in each layer, and on each iteration, before calculating the subsequent layer. Randomly selecting different subsets to drop out introduces noise into the process and reduces overfitting.
notion image
Now let's revisit the toy dataset we generated above to visualize how the Dropout stabilizes training on a noisy dataset. We will slightly modify the architecture we used above to add dropout layers.

Implementation

class AnimalNetDropout(nn.Module): """ Network Class - Animal Faces with following structure nn.Linear(3*32*32, 248) + leaky_relu(self.dropout1(self.fc1(x))) # First fully connected layer with 0.5 dropout nn.Linear(248, 210) + leaky_relu(self.dropout2(self.fc2(x))) # Second fully connected layer with 0.3 dropout nn.Linear(210, 3) # Final fully connected layer """ def __init__(self): """ Initialize parameters of AnimalNetDropout Args: None Returns: Nothing """ super(AnimalNetDropout, self).__init__() self.fc1 = nn.Linear(3*32*32, 248) self.fc2 = nn.Linear(248, 210) self.fc3 = nn.Linear(210, 3) self.dropout1 = nn.Dropout(p=0.5) # add dropout layer self.dropout2 = nn.Dropout(p=0.3)

Data Augmentation

💡
Typically used in image classification problems
Data augmentation is often used to increase the number of training samples. Now we will explore the effects of data augmentation on regularization. Here regularization is achieved by adding noise into training data after every epoch.
PyTorch's torchvision module provides a few built-in data augmentation techniques, which we can use on image datasets. Some of the techniques we most frequently use are:
  • Random Crop
  • Random Rotate
  • Vertical Flip
  • Horizontal Flip
 
Define a DataLoader using torchvision.transforms, which randomly augments the data for us. For more info, see here.

an example

# Data Augmentation using transforms new_transforms = transforms.Compose([ transforms.RandomHorizontalFlip(p=0.1), transforms.RandomVerticalFlip(p=0.1), transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)) ]) data_path = pathlib.Path('.')/'afhq' # Using pathlib to be compatible with all OS's img_dataset = ImageFolder(data_path/'train', transform=new_transforms) # Splitting dataset new_train_data, _,_ = torch.utils.data.random_split(img_dataset, [250, 100, 14280]) # For reproducibility g_seed = torch.Generator() g_seed.manual_seed(SEED) # Creating train_loader and Val_loader new_train_loader = torch.utils.data.DataLoader(new_train_data, batch_size=batch_size, worker_init_fn=seed_worker, generator=g_seed)

Stochastic Gradient Descent

  • Initialize with small random weights
  • Weights get bigger as one iterates
  • Use early stopping to avoid overfitting
SGD finds ‘good’ minima ● Deep learning often uses more parameters than observations ○ ‘should’ massively overfit ● Deep learning on CIFAR10 and IMAGENET ○ gives 0 training error with small test error ○ gives 0 training error with randomized labels ● SGD converges to flat minima, which generalize better Gradient descent is magic: it tends to find ‘good’ (smooth/regularized) solutions Zhang, Bengio, Hardt, Recht, and Vinyals 2017
SGD is one of the best in this scenerio
SGD is one of the best in this scenerio
 
  • Smaller learning rates regularize less and slowly converge to deep minima.
  • Larger learning rates regularize more by missing local minima and converging to broader, flatter minima, which often generalize better.
💡
Again, learning rate is very important

batch size vs learning rate

notion image

Hyper Parameter Tuning

Hyperparameter tuning is often tricky and time-consuming, and it is a vital part of training any Deep Learning model to give good generalization. There are a few techniques that we can use to guide us during the search.
  • *Grid Search**: Try all possible combinations of hyperparameters
    • This method is typically too expensive in practice
  • *Random Search**: Randomly try different combinations of hyperparameters
    • does not sound good 🤣
  • *Coordinate-wise Gradient Descent**: Start at one set of hyperparameters and try changing one at a time, accept any changes that reduce your validation error
    • This one is widely used
  • *Bayesian Optimization / Auto ML**: Start from a set of hyperparameters that have worked well on a similar problem, and then do some sort of local exploration (e.g., gradient descent) from there.
    • Auto machine learning according to previous deep nets
There are many choices, like what range to explore over, which parameter to optimize first, etc. Some hyperparameters don’t matter much (people use a dropout of either 0.5 or 0.2, but not much else). Others can matter a lot more (e.g., size and depth of the neural net). The key is to see what worked on similar problems.
One can automate the process of tuning the network architecture using the so called *Neural Architecture Search (NAS)*. NAS designs new architectures using a few building blocks (Linear, Convolutional, Convolution Layers, etc.) and optimizes the design based on performance using a wide range of techniques such as Grid Search, Reinforcement Learning, Gradient Descent, Evolutionary Algorithms, etc. This obviously requires very high computing power. Read this [article](https://lilianweng.github.io/lil-log/2020/08/06/neural-architecture-search.html) to learn more about NAS.

Further Readings