Gradient Descent and AutoGrad

Key Points

💡

The learning process is an optimization problem that optimizes the weights to get the best output.

Optimize Target function, minimize loss function

Learning is an optimization problem

GD is the one that most rapidly reduces loss (for infinitesimal steps)

Concept	Mathmatical Expression	Pytorch Function
ANN (Target Function)		nn.Module, nn.Sequential nn.Linear, nn.ReLU
Loss Function		nn.MSELoss, nn.CrossEntropyLoss
Optimization problem		torch.optim.SGD, torch.optim.ADAM

Function Collections

WideNet

A simple WideNet

Gradient Descent Algorithm

💡

It’s like taking derivatives. “Make a small change in weights that most rapidly improves task performance”

One thing that need to mention is that we don’t actually make a small change, the actual change depends on the learning rate

Let be a differentiable function. Gradient Descent is an iterative algorithm for minimizing the function , starting with an initial value for variables , taking steps of size (learning rate) in the direction of the negative gradient at the current point to update the variables .

where and . Since negative gradients always point locally in the direction of steepest descent, the algorithm makes small steps at each point towards the minimum.

Vanilla Algorithm

Inputs: initial guess $\mathbf{w}^{(0)}$, step size $\eta > 0$, number of steps $T$.

For do end

Return:

Hence, all we need is to calculate the gradient of the loss function with respect to the learnable parameters (i.e., weights):

💡

GD is the one that most rapidly reduces loss (for infinitesimal steps)

Computation Graph

💡

The Computation Graph is a way to calculate the derivatives of a complex function, and this is also how NN does it.

For Example: for a function like below:

Forward

Explicitly represent and store intermediate variables a,b,c,d,e

Nodes in the graph correspond to intermediate variables.

Backward

Starting from the top, pass backward. Each edge stores partial derivative of the head of the edge with respect to the tail.

Conveniently, the partial derivatives can often be expressed using the intermediate variables calculated in the forward pass (a,b,c,d,e).

For more info:

Calculus on Computational Graphs: Backpropagation

Compute Gradients

It can easily computed using chain rule.

Auto Differentiation

A data structure for storing intermediate values and partial derivatives needed to compute gradients. ● Node v represents variable

○ Stores value

○ Gradient

○ The function that created the node

● Directed edge from v to u represents the partial derivative of u w.r.t. v

● To compute the gradient , find the unique path from L to v and multiply the edge weights, where L is the overall loss.

AutoGrad in Pytorch

When we perform operations on PyTorch Tensors, PyTorch does not simply calculate the output Instead, each operation is added to the computational graph PyTorch can then do a forward and backward pass through the graph, storing necessary intermediate variables, and yield any gradients we need

Often only some parameters are trainable and require gradients. We indicate tensors that require gradients by setting requires_grad=True

PyTorch can keep adding to the graph as your code winds through functions and classes

Example

In PyTorch, Tensor and Function are interconnected and build up an acyclic graph, that encodes a complete history of computation. Each variable has a grad_fn attribute that references a function that has created the Tensor (except for Tensors created by the user - these have None as grad_fn). The example below shows that the tensor c = a + b is created by the Add operation and the gradient function is the object <AddBackward...>. Replace + with other single operations (e.g., c = a * b or c = torch.sin(a)) and examine the results.

Now let's kick off the backward pass to calculate the gradients by calling .backward() on the tensor we wish to initiate the backpropagation from. Often, .backward() is called on the loss, which is the last node on the graph. Before doing that, let's calculate the loss gradients by hand:

Where is the target (true label), and is the prediction (model output). We can then compare it to PyTorch gradients, which can be obtained by calling .grad on the relevant tensors.

Important Notes

Learnable parameters (i.e. requires_grad tensors) are "contagious". Let's look at a simple example: Y = W @ X, where X is the feature tensors and W is the weight tensor (learnable parameters, requires_grad), the newly generated output tensor Y will be also requires_grad. So any operation that is applied to Y will be part of the computational graph. Therefore, if we need to plot or store a tensor that is requires_grad, we must first .detach() it from the graph by calling the .detach() method on that tensor.

.backward() accumulates gradients in the leaf nodes (i.e., the input nodes to the node of interest). We can call .zero_grad() on the loss or optimizer to zero out all .grad attributes (see autograd.backward for more information).

Recall that in python we can access variables and associated methods with .method_name. You can use the command dir(my_object) to observe all variables and associated methods to your object, e.g., dir(simple_graph.w).

References and more:

A gentle introduction to torch.autograd

Automatic Differentiation package - torch.autograd

Autograd mechanics

Automatic Differentiation with torch.autograd