Text Generation With LSTM Recurrent Neural Networks in Python with Keras want for PyTorch
Open-source Deep Learning framework with dynamic computational graphs, emphasizing flexibility and research. Similar to Tensorflow.
Framework pieces:
- torch: a general purpose array library similar to Numpy that can do computations on GPU when the tensor type is cast to (torch.cuda.TensorFloat)
- torch.autograd: a package for building a computational graph and automatically obtaining gradients
- torch.nn: a Neural Network library with common layers and Loss function
- torch.optim: an optimization package with common optimization algorithms like Stochastic Gradient Descent
Basics of PyTorch
Tensors arrays
PyTorch uses tensors, which are similar to NumPy arrays, but with GPU acceleration.
Automatic Differentiation
PyTorch can compute gradients automatically with autograd
.
The gradient is calculated by the chain rule:
- For , the derivative of with respect to is .
- ==Since
z
is the sum of the elements ofy
, the gradient ofz
with respect to each element inx
is for each element inx
.==
z is a formula but calculates the derivative wrt x, so the derivative of y with x.
So, the gradients for each element in x
are:
The gradients are stored in x.grad
. After calling z.backward()
, x.grad
contains the derivative of z
with respect to each element of x
, which is [2.0, 4.0, 6.0]
Gradient Computation (Summary):
- The gradient represents how much the output
z
changes for a small change inx
. In our case, if we slightly increasex_1
,x_2
, orx_3
, the change inz
can be predicted using these gradients. - This gradient information is used to update weights in neural networks during training. For example, in optimization algorithms like (Stochastic Gradient Descent), these gradients are used to adjust model parameters in the direction that minimizes the loss function.
Confusion: total derivative is not the partial derivates
The confusion here comes from the distinction between partial derivatives (for each element in a tensor) and total derivatives. Let me clarify this.
In the context of:
x
is a tensor:y = x^2
computes element-wise squares:z = y.sum()
adds the elements iny
:
Now, you’re asking about the derivative of z
with respect to x
, specifically whether dz/dx
should be the sum of the derivatives and equal to 4.
Derivative with Respect to Each Element (Partial Derivative)
When you calculate the gradient of z
with respect to x
, you’re computing the partial derivatives of z
with respect to each element in x
. The function z
depends on each element of x
individually.
For each element in x
, we are calculating:
This gives:
These are the gradients stored in x.grad
after calling z.backward()
.
Total Derivative vs Partial Derivatives
If we are talking about partial derivatives, we get a gradient for each individual component of x
:
These partial derivatives form the gradient vector: [2.0, 4.0, 6.0]
.
Why Is It Not Just 4?
If you’re thinking of the total derivative, that would be different from what we are calculating here. The sum of the derivatives of z
with respect to all components of x
is:
However, this total derivative is not what we are computing here. We are computing the partial derivatives for each element of x
separately, which results in the gradient vector [2.0, 4.0, 6.0]
.
In Summary:
- Gradient (
x.grad
): A vector of partial derivatives ofz
with respect to each element inx
, giving us[2.0, 4.0, 6.0]
. - Sum of Gradients: The sum of the elements of the gradient vector is
12
, but that’s not the gradient ofz
with respect tox
as a whole—it’s just a summation of the partial derivatives.
Basic Neural Network Implementation
A simple feedforward network using PyTorch.
Input Tensor and Forward Pass
torch.tensor([[1.0, 2.0]])
: This defines a 2D input tensor with two features, which corresponds to the two input nodes in the network.net(x)
: This performs a forward pass, feeding the input tensorx
into the network. The linear layer applies the learned weights and bias to compute the output.
Output The output of the network is a tensor from the linear layer, which corresponds to the result of the operation , where:
- and are the learned weights for each input feature.
- is the learned bias.
Use Cases for a Simple Neural Network Like
Training a Simple Model
An example of training a linear model with
Training of Linear Regression model. This model find best w,b in .
Moving Tensors to GPU
Summary:
- speed up training e.g. of Neural Network
- use Parallelism for simultaneous calculations
- GPU can do larger batches of computations, better on the memory, better for Gradient Descent estimations
PyTorch makes it easy to move computations to a GPU.
Both the model and the data are moved to the GPU (device='cuda'
). All computations, including the forward pass, loss calculation, backward pass, and optimizer step, happen on the GPU.