Key Points
The learning process is an optimization problem that optimizes the weights to get the best output.
- Optimize Target function, minimize loss function
- Learning is an optimization problem
- GD is the one that most rapidly reduces loss (for infinitesimal steps)
Concept | Mathmatical Expression | Pytorch Function |
ANN (Target Function) | nn.Module, nn.Sequential nn.Linear, nn.ReLU | |
Loss Function | nn.MSELoss, nn.CrossEntropyLoss | |
Optimization problem | torch.optim.SGD, torch.optim.ADAM |
Function Collections
Function Collections
Gradient Descent Algorithm
It’s like taking derivatives. “Make a small change in weights that most rapidly improves task performance”
One thing that need to mention is that we don’t actually make a small change, the actual change depends on the learning rate
Let be a differentiable function. Gradient Descent is an iterative algorithm for minimizing the function , starting with an initial value for variables , taking steps of size (learning rate) in the direction of the negative gradient at the current point to update the variables .
where and . Since negative gradients always point locally in the direction of steepest descent, the algorithm makes small steps at each point towards the minimum.
Vanilla Algorithm
Inputs: initial guess $\mathbf{w}^{(0)}$, step size $\eta > 0$, number of steps $T$.
For do end
Return:
Hence, all we need is to calculate the gradient of the loss function with respect to the learnable parameters (i.e., weights):
GD is the one that most rapidly reduces loss (for infinitesimal steps)
Computation Graph
The Computation Graph is a way to calculate the derivatives of a complex function, and this is also how NN does it.
For Example: for a function like below:
Forward
Explicitly represent and store intermediate variables a,b,c,d,e
Nodes in the graph correspond to intermediate variables.
Backward
Starting from the top, pass backward. Each edge stores partial derivative of the head of the edge with respect to the tail.
Conveniently, the partial derivatives can often be expressed using the intermediate variables calculated in the forward pass (a,b,c,d,e).
For more info:
Compute Gradients
It can easily computed using chain rule.
Auto Differentiation
A data structure for storing intermediate values and partial derivatives needed to compute gradients.
● Node v represents variable
○ Stores value
○ Gradient
○ The function that created the node
● Directed edge from v to u represents the partial derivative of u w.r.t. v
● To compute the gradient , find the unique path from L to v and multiply the edge weights, where L is the overall loss.
AutoGrad in Pytorch
When we perform operations on PyTorch Tensors, PyTorch does not simply calculate the output Instead, each operation is added to the computational graph
PyTorch can then do a forward and backward pass through the graph, storing necessary intermediate variables, and yield any gradients we need
Often only some parameters are trainable and require gradients. We indicate tensors that require gradients by setting requires_grad=True
PyTorch can keep adding to the graph as your code winds through functions and classes
Example
In PyTorch,
Tensor
and Function
are interconnected and build up an acyclic graph, that encodes a complete history of computation. Each variable has a grad_fn
attribute that references a function that has created the Tensor (except for Tensors created by the user - these have None
as grad_fn
). The example below shows that the tensor c = a + b
is created by the Add
operation and the gradient function is the object <AddBackward...>
. Replace +
with other single operations (e.g., c = a * b
or c = torch.sin(a)
) and examine the results.Now let's kick off the backward pass to calculate the gradients by calling
.backward()
on the tensor we wish to initiate the backpropagation from. Often, .backward()
is called on the loss, which is the last node on the graph. Before doing that, let's calculate the loss gradients by hand:Where is the target (true label), and is the prediction (model output). We can then compare it to PyTorch gradients, which can be obtained by calling
.grad
on the relevant tensors.Important Notes
- Learnable parameters (i.e.
requires_grad
tensors) are "contagious". Let's look at a simple example:Y = W @ X
, whereX
is the feature tensors andW
is the weight tensor (learnable parameters,requires_grad
), the newly generated output tensorY
will be alsorequires_grad
. So any operation that is applied toY
will be part of the computational graph. Therefore, if we need to plot or store a tensor that isrequires_grad
, we must first.detach()
it from the graph by calling the.detach()
method on that tensor.
.backward()
accumulates gradients in the leaf nodes (i.e., the input nodes to the node of interest). We can call.zero_grad()
on the loss or optimizer to zero out all.grad
attributes (see autograd.backward for more information).
- Recall that in python we can access variables and associated methods with
.method_name
. You can use the commanddir(my_object)
to observe all variables and associated methods to your object, e.g.,dir(simple_graph.w)
.