Pytorch-2-Autogradient

Posted on 2021-08-04 Edited on 2021-08-05 In Notes , Python module , Pytorch Views:
Symbols count in article: 15k Reading time ≈ 13 mins.

1. 什么是PyTorch?

PyTorch是一个基于Python的科学计算库，它有以下特点:

类似于 NumPy，但是它可以使用 GPU
可以用它定义深度学习模型，可以灵活地进行深度学习模型的训练和使用

1.1 Tensors

Tensor 类似与 NumPy 的 ndarray，唯一的区别是 Tensor 可以在 GPU 上加速运算。

1 2	import torch import numpy as np

构造一个未初始化的 5x3 矩阵:

1 2	x = torch.empty(5,3) x

Results:

tensor([[1.0194e-38, 9.1837e-39, 8.4490e-39],
        [9.6429e-39, 8.4490e-39, 9.6429e-39],
        [9.2755e-39, 1.0286e-38, 9.0919e-39],
        [8.9082e-39, 9.2755e-39, 8.4490e-39],
        [1.0194e-38, 9.0919e-39, 8.4490e-39]])

构建一个随机初始化的矩阵:

1 2	x = torch.rand(5,3) x

Results:

tensor([[0.1782, 0.5218, 0.1660],
        [0.4009, 0.0275, 0.8139],
        [0.6904, 0.4813, 0.7811],
        [0.4067, 0.6289, 0.5081],
        [0.0305, 0.9687, 0.0834]])

构建一个全部为 0，类型为 long 的矩阵:

x = torch.zeros(5,3,dtype=torch.long)
x1 = torch.zeros(5,3).long() # Equals
x
x2.dtype

Results:

tensor([[0, 0, 0],
        [0, 0, 0],
        [0, 0, 0],
        [0, 0, 0],
        [0, 0, 0]])
torch.int64

从数据直接直接构建 tensor:

1 2	x = torch.tensor([5.5,3]) x

Results:

tensor([5.5000, 3.0000])

1 矩阵

1 2	x = x.new_ones(5,3, dtype=torch.double) x

Results:

tensor([[1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.]], dtype=torch.float64)

randn_like

也可以从一个已有的tensor构建一个tensor。这些方法会重用原来tensor的特征，例如，数据类型，除非提供新的数据。

1 2	x = torch.randn_like(x, dtype=torch.float) x

Results:

tensor([[-0.3863, -1.4149, -0.1054],
        [ 2.3531,  0.2044, -1.5104],
        [ 0.2127, -0.5231, -0.7806],
        [-0.9173, -0.0242,  0.1667],
        [-1.3101,  1.2451, -0.3665]])

得到 tensor 的形状:

1 2	x.shape x.size()

Results:

torch.Size([5, 3])
torch.Size([5, 3])

Notes: torch.Size 返回的是一个tuple

1.2 Operations

有很多种tensor运算。我们先介绍加法运算。

1	y = torch.rand(5,3)

加法

x + y # Way 1
torch.add(x, y) # Way 2

# Store the Results into a new variable
result = torch.empty(5,3)
torch.add(x, y, out=result)

Results:

tensor([[-0.0267, -1.1529,  0.7544],
        [ 2.8445,  1.0922, -1.3458],
        [ 0.6769, -0.2879, -0.7209],
        [-0.1200,  0.8094,  0.3603],
        [-0.3267,  2.1897, -0.3474]])

in-place加法

1 2	y.add_(x) y

Results:

tensor([[-0.0267, -1.1529,  0.7544],                [ 2.8445,  1.0922, -1.3458],                [ 0.6769, -0.2879, -0.7209],                [-0.1200,  0.8094,  0.3603],                [-0.3267,  2.1897, -0.3474]])

Note: 任何 in-place 的运算都会以 _ 结尾。举例来说：x.copy_(y), x.t_(), 这些 in-place 方法会改变变量 x。

Index

各种类似 NumPy 的 indexing 都可以在 PyTorch tensor 上面使用。

x[1:, 1:]

Results:

tensor([[ 0.2044, -1.5104],        [-0.5231, -0.7806],        [-0.0242,  0.1667],        [ 1.2451, -0.3665]])

Resizing

如果希望 resize/reshape 一个 tensor，可以使用 torch.view：

x = torch.randn(4,4)
y = x.view(16)
z = x.view(-1,8)
z

Results:

tensor([[ 1.1355, -1.1149, -0.1322, -0.8217,  0.7920,  0.6061,  0.7453,  1.1177],
        [ 0.7566,  1.3975, -0.8014,  0.5999, -0.1476, -0.5695, -1.3861, -0.4741]])

取得数值

如果你有一个只有一个元素的 tensor，使用 .item() 方法可以把里面的 value 变成 Python 数值。
1
2
3
x = torch.randn(1)
x
x.item()
Results:
```
tensor([-1.4227])
-1.422684907913208
```

转置

1	z.transpose(1, 0)z.t()

Results:

tensor([[ 1.1355,  0.7566],        [-1.1149,  1.3975],        [-0.1322, -0.8014],        [-0.8217,  0.5999],        [ 0.7920, -0.1476],        [ 0.6061, -0.5695],        [ 0.7453, -1.3861],        [ 1.1177, -0.4741]])

更多阅读

各种 Tensor operations, 包括 transposing, indexing, slicing,
mathematical operations, linear algebra, random numbers 在
PyTorch Documentation.

1.3 Numpy 和 Tensor 之间的转化

在 Torch Tensor 和 NumPy array 之间相互转化非常容易。

Note: Torch Tensor和 NumPy array 会共享内存，所以改变其中一项也会改变另一项。

Tensor to ndarray

1	a = torch.ones(5) # a = tensor([1., 1., 1., 1., 1.])b = a.numpy() # b = array([1., 1., 1., 1., 1.], dtype=float32)

改变numpy array里面的值。

1	b[1] = 2 # b = array([1., 2., 1., 1., 1.], dtype=float32) # a = tensor([1., 2., 1., 1., 1.])

ndarray to Tensor

a = np.ones(5)
b = torch.from_numpy(a)
np.add(a, 1, out=a)
print(a)
pirnt(a)

[2. 2. 2. 2. 2.]
[2. 2. 2. 2. 2.]

Note: 所有 CPU 上的 Tensor 都支持转成 numpy 或者从 numpy 转成 Tensor。

1.4 CUDA Tensors

使用.to方法，Tensor可以被移动到别的device上。

if torch.cuda.is_available():
    device = torch.device("cuda")    
    y = torch.ones_like(x, device=device)    
    x = x.to(device)    
    z = x + y    
    print(z)    
    print(z.to("cpu", torch.double))

Results:

tensor([-0.4227], device='cuda:0')
tensor([-0.4227], dtype=torch.float64)

1	y.to("cpu").data.numpy()y.cpu().data.numpy()

Results:

array([1.], dtype=float32)

2. Bi-Linear NN with numpy

一个全连接ReLU神经网络，一个隐藏层，没有 bias。用来从 x 预测 y，使用 L2 Loss。

$h = W_1X$
$a = max(0, h)$
$y_{hat} = W_2a$

这一实现完全使用numpy来计算前向神经网络，loss，和反向传播。

forward pass
loss
backward pass

numpy ndarray 是一个普通的 n 维 array。它不知道任何关于深度学习或者梯度 ( gradient) 的知识，也不知道计算图 (computation graph)，只是一种用来计算数学运算的数据结构。

N, D_in, H, D_out = 64, 1000, 100, 10# 随机创建一些训练数据x = np.random.randn(N, D_in)y = np.random.randn(N, D_out)w1 = np.random.randn(D_in, H)w2 = np.random.randn(H, D_out)learning_rate = 1e-6for it in range(500):    # Forward pass    h = x.dot(w1) # N * H    h_relu = np.maximum(h, 0) # N * H    y_pred = h_relu.dot(w2) # N * D_out    # compute loss    loss = np.square(y_pred - y).sum()    if it%50 == 0:        print(it, loss)    # Backward pass    # compute the gradient    grad_y_pred = 2.0 * (y_pred - y)    grad_w2 = h_relu.T.dot(grad_y_pred)    grad_h_relu = grad_y_pred.dot(w2.T)    grad_h = grad_h_relu.copy()    grad_h[h<0] = 0    grad_w1 = x.T.dot(grad_h)    # update weights of w1 and w2    w1 -= learning_rate * grad_w1    w2 -= learning_rate * grad_w2

Results:

0 30767183.85959570550 19309.70472025325100 927.7120364511059150 65.96723021925985200 5.451127574769099250 0.4979445419484826300 0.04972286885320139350 0.0053918752712101775400 0.0006287349147432031450 7.778529921424317e-05

1	h = x.dot(w1)h_relu = np.maximum(h, 0) # N * Hy_pred = h_relu.dot(w2) # N * D_outab_loss = y_pred - yab_loss[:1]

Results:

array([[-2.31700647e-05, -6.28731777e-05,  4.77222071e-05,         1.18624504e-05, -4.80998287e-05,  1.52820208e-06,         6.16901569e-06, -8.37944967e-05, -2.06058305e-06,         1.23783963e-06]])

可以发现，两者相差非常小，模型训练较为成功。

3. Bi-Linear NN with PyTorch

3.1 Implementation

这次我们使用 PyTorch tensors 来创建前向神经网络，计算损失，以及反向传播。

一个 PyTorch Tensor 很像一个 numpy 的 ndarray。但是它和 numpy ndarray 最大的区别是，PyTorch Tensor 可以在 CPU 或者 GPU 上运算。如果想要在 GPU 上运算，就需要把 Tensor 换成 cuda 类型。

与 numpy 不同的方法：

mm: 矩阵相乘，对应于 dot
clamp: 夹子，将值夹在两者之间，对应于 maximum
pow: 平方，对应于 np.square
t: 转置，对应于 .t
clone: 复制，对应于 copy

N, D_in, H, D_out = 64, 1000, 100, 10# 随机创建一些训练数据x = torch.randn(N, D_in)y = torch.randn(N, D_out)w1 = torch.randn(D_in, H)w2 = torch.randn(H, D_out)learning_rate = 1e-6for it in range(500):    # Forward pass    h = x.mm(w1) # N * H    h_relu = h.clamp(min=0) # N * H    y_pred = h_relu.mm(w2) # N * D_out    # compute loss    loss = (y_pred - y).pow(2).sum().item()    if it%50 == 0:        print(it, loss)    # Backward pass    # compute the gradient    grad_y_pred = 2.0 * (y_pred - y)    grad_w2 = h_relu.t().mm(grad_y_pred)    grad_h_relu = grad_y_pred.mm(w2.t())    grad_h = grad_h_relu.clone()    grad_h[h<0] = 0    grad_w1 = x.t().mm(grad_h)    # update weights of w1 and w2    w1 -= learning_rate * grad_w1    w2 -= learning_rate * grad_w2

0 31964224.050 8884.4111328125100 245.90003967285156150 12.271601676940918200 0.7670953273773193250 0.05322442948818207300 0.0041369786486029625350 0.0005290982662700117400 0.00013705584569834173450 5.566925392486155e-05

简单的autograd

x = torch.tensor(1., requires_grad=True)w = torch.tensor(2., requires_grad=True)b = torch.tensor(3., requires_grad=True)y = w*x + b# y = 2*1+3y.backward()# dy / dw = xprint(w.grad)print(x.grad)print(b.grad)

Results:

tensor(1.)tensor(2.)tensor(1.)

3.2 PyTorch: Tensor and autograd

PyTorch 的一个重要功能就是 autograd，也就是说只要定义了 forward pass (前向神经网络)，计算了 loss 之后，PyTorch 可以自动求导计算模型所有参数的梯度。

一个 PyTorch 的 Tensor 表示计算图中的一个节点。如果 x 是一个 Tensor 并且 x.requires_grad=True 那么 x.grad 是另一个储存着 x 当前梯度(相对于一个 scalar，常常是 loss)的向量。

N, D_in, H, D_out = 64, 1000, 100, 10# 随机创建一些训练数据x = torch.randn(N, D_in)y = torch.randn(N, D_out)w1 = torch.randn(D_in, H, requires_grad=True)w2 = torch.randn(H, D_out, requires_grad=True)learning_rate = 1e-6for it in range(500):    # Forward pass    y_pred = x.mm(w1).clamp(min=0).mm(w2)    # compute loss    loss = (y_pred - y).pow(2).sum() # computation graph    if it%50 == 0:        print(it, loss.item())    # Backward pass    loss.backward()    # update weights of w1 and w2    with torch.no_grad():        w1 -= learning_rate * w1.grad        w2 -= learning_rate * w2.grad        w1.grad.zero_() # gradient 清零        w2.grad.zero_()

Results:

0 40281260.050 16265.544921875100 770.28857421875150 60.70436477661133200 5.996867656707764250 0.6809620261192322300 0.08519172668457031350 0.011621471494436264400 0.0019496456952765584450 0.0004842539201490581

3.3 PyTorch: nn

这次我们使用 PyTorch 中 nn 这个库来构建网络。用 PyTorch autograd 来构建计算图和计算 gradients，然后 PyTorch 会帮我们自动计算 gradient。

import torch.nn as nnN, D_in, H, D_out = 64, 1000, 100, 10# 随机创建一些训练数据x = torch.randn(N, D_in)y = torch.randn(N, D_out)model = torch.nn.Sequential(    torch.nn.Linear(D_in, H, bias=False), # w_1 * x + b_1    torch.nn.ReLU(),    torch.nn.Linear(H, D_out, bias=False), # default bias is True)# 初始化权值torch.nn.init.normal_(model[0].weight)torch.nn.init.normal_(model[2].weight)# model = model.cuda()loss_fn = nn.MSELoss(reduction='sum')learning_rate = 1e-6for it in range(500):    # Forward pass    y_pred = model(x) # model.forward()    # compute loss    loss = loss_fn(y_pred, y) # computation graph    if it% 50 == 0:        print(it, loss.item())    # Backward pass    loss.backward()    # update weights of w1 and w2    with torch.no_grad():        for param in model.parameters(): # param (tensor, grad)            param -= learning_rate * param.grad    model.zero_grad()

Results：

0 37942064.050 13557.75390625100 482.86322021484375150 27.34326934814453200 1.863478183746338250 0.1420218050479889300 0.01187755074352026350 0.0013162594987079501400 0.00026860935031436384450 9.016783587867394e-05

model[0].weight 可得到模型中第一层的权重，bia 可得到偏置项。

1	model[0].weight

Results：

Parameter containing:tensor([[-0.0218,  0.0212,  0.0243,  ...,  0.0230,  0.0247,  0.0168],    [-0.0144,  0.0177, -0.0221,  ...,  0.0161,  0.0098, -0.0172],    [ 0.0086, -0.0122, -0.0298,  ..., -0.0236, -0.0187,  0.0295],    ...,    [ 0.0266, -0.0008, -0.0141,  ...,  0.0018,  0.0319, -0.0129],    [ 0.0296, -0.0005,  0.0115,  ...,  0.0141, -0.0088, -0.0106],    [ 0.0289, -0.0077,  0.0239,  ..., -0.0166, -0.0156, -0.0235]],   requires_grad=True)

3.4 PyTorch: optim

这一次我们不再手动更新模型的 weights,而是使用 optim 这个包来帮助我们更新参数。
optim 这个 package 提供了各种不同的模型优化方法，包括 SGD+momentum, RMSProp, Adam 等等。

import torch.nn as nnN, D_in, H, D_out = 64, 1000, 100, 10# 随机创建一些训练数据x = torch.randn(N, D_in)y = torch.randn(N, D_out)model = torch.nn.Sequential(    torch.nn.Linear(D_in, H, bias=False), # w_1 * x + b_1    torch.nn.ReLU(),    torch.nn.Linear(H, D_out, bias=False),)torch.nn.init.normal_(model[0].weight)torch.nn.init.normal_(model[2].weight)# model = model.cuda()loss_fn = nn.MSELoss(reduction='sum')# Adam 则可以不用参数初始化# learning_rate = 1e-4# optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)learning_rate = 1e-6optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)for it in range(500):    # Forward pass    y_pred = model(x) # model.forward()    # compute loss    loss = loss_fn(y_pred, y) # computation graph    if it%50 == 0:        print(it, loss.item())    optimizer.zero_grad()    # Backward pass    loss.backward()    # update model parameters    optimizer.step()

Results:

0 40010880.050 8527.947265625100 276.97821044921875150 18.40827751159668200 1.5830364227294922250 0.15364140272140503300 0.016014395281672478350 0.0019916763994842768400 0.0004044498491566628450 0.0001368314551655203

3.5 PyTorch: 自定义 nn Modules

我们可以定义一个模型，这个模型继承自 nn.Module 类。如果需要定义一个比 Sequential 模型更加复杂的模型，就需要定义 nn.Module 模型。

import torch.nn as nnN, D_in, H, D_out = 64, 1000, 100, 10# 随机创建一些训练数据x = torch.randn(N, D_in)y = torch.randn(N, D_out)class TwoLayerNet(torch.nn.Module):    def __init__(self, D_in, H, D_out):        super(TwoLayerNet, self).__init__()        # define the model architecture        self.linear1 = torch.nn.Linear(D_in, H, bias=False)        self.linear2 = torch.nn.Linear(H, D_out, bias=False)    def forward(self, x):        y_pred = self.linear2(self.linear1(x).clamp(min=0))        return y_predmodel = TwoLayerNet(D_in, H, D_out)loss_fn = nn.MSELoss(reduction='sum')learning_rate = 1e-4optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)for it in range(500):    # Forward pass    y_pred = model(x) # model.forward()    # compute loss    loss = loss_fn(y_pred, y) # computation graph    if it%100 == 0:        print(it, loss.item())    optimizer.zero_grad()    # Backward pass    loss.backward()    # update model parameters    optimizer.step()

Results：

  0 692.8556518554688  100 50.59251403808594  200 0.6749072074890137  300 0.01346816960722208  400 0.0009340611286461353

4. Fuzzbuzz

FizzBuzz 是一个简单的小游戏。游戏规则如下：从 1 开始往上数数，当遇到 3 的倍数的时候，说 fizz，当遇到 5 的倍数，说 buzz，当遇到 15 的倍数，就说 fizzbuzz，其他情况下则正常数数。

我们可以写一个简单的小程序来决定要返回正常数值还是 fizz, buzz 或者 fizzbuzz。

# One-hot encode the desired outputs: [number, "fizz", "buzz", "fizzbuzz"]def fizz_buzz_encode(i):    if   i % 15 == 0: return 3    elif i % 5  == 0: return 2    elif i % 3  == 0: return 1    else:             return 0def fizz_buzz_decode(i, prediction):    return [str(i), "fizz", "buzz", "fizzbuzz"][prediction]print(fizz_buzz_decode(1, fizz_buzz_encode(1)))print(fizz_buzz_decode(2, fizz_buzz_encode(2)))print(fizz_buzz_decode(5, fizz_buzz_encode(5)))print(fizz_buzz_decode(12, fizz_buzz_encode(12)))print(fizz_buzz_decode(15, fizz_buzz_encode(15)))

Resutls:

12buzzfizzfizzbuzz

我们首先定义模型的输入与输出(训练数据)

import numpy as npimport torchNUM_DIGITS = 10# Represent each input by an array of its binary digits.def binary_encode(i, num_digits):    return np.array([i >> d & 1 for d in range(num_digits)])trX = torch.Tensor([binary_encode(i, NUM_DIGITS) for i in range(101, 2 ** NUM_DIGITS)])trY = torch.LongTensor([fizz_buzz_encode(i) for i in range(101, 2 ** NUM_DIGITS)])

然后我们用PyTorch定义模型

# Define the model
NUM_HIDDEN = 100
model = torch.nn.Sequential(
    torch.nn.Linear(NUM_DIGITS, NUM_HIDDEN),
    torch.nn.ReLU(),
    torch.nn.Linear(NUM_HIDDEN, 4)
)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = model.to(device)

为了让我们的模型学会 FizzBuzz 这个游戏，我们需要定义一个损失函数，和一个优化算法。
这个优化算法会不断优化（降低）损失函数，使得模型的在该任务上取得尽可能低的损失值。
损失值低往往表示我们的模型表现好，损失值高表示我们的模型表现差。
由于 FizzBuzz 游戏本质上是一个分类问题，我们选用 Cross Entropyy Loss 函数。
优化函数我们选用 Stochastic Gradient Descent。

1 2	loss_fn = torch.nn.CrossEntropyLoss() optimizer = torch.optim.SGD(model.parameters(), lr = 0.05)

以下是模型的训练代码

# Start training it
BATCH_SIZE = 128
for epoch in range(10000):
    for start in range(0, len(trX), BATCH_SIZE):
        end = start + BATCH_SIZE
        batchX = trX[start:end].to(device)
        batchY = trY[start:end].to(device)

        y_pred = model(batchX)
        loss = loss_fn(y_pred, batchY)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    # Find loss on training data
    loss = loss_fn(model(trX.to(device)), trY.to(device)).cpu().item()
    if epoch%1000 == 0:
        print('Epoch:', epoch, 'Loss:', loss)

Results:

Epoch: 0 Loss: 1.1463667154312134
Epoch: 1000 Loss: 0.4962996244430542
Epoch: 2000 Loss: 0.15307655930519104
Epoch: 3000 Loss: 0.07736020535230637
Epoch: 4000 Loss: 0.044787853956222534
Epoch: 5000 Loss: 0.029076775535941124
Epoch: 6000 Loss: 0.020609091967344284
Epoch: 7000 Loss: 0.01560244057327509
Epoch: 8000 Loss: 0.012386537156999111
Epoch: 9000 Loss: 0.010154682211577892

最后我们用训练好的模型尝试在 1 到 100 这些数字上玩 FizzBuzz 游戏

# Output now
testX = torch.Tensor([binary_encode(i, NUM_DIGITS) for i in range(1, 101)])
with torch.no_grad():
    testY = model(testX.to(device))
predictions = zip(range(1, 101), list(testY.max(1)[1].data.tolist()))

print([fizz_buzz_decode(i, x) for (i, x) in predictions])

Results:

['1', '2', 'fizz', '4', 'buzz', 'fizz', '7', '8', 'fizz', '10', '11',
  'fizz', '13', '14', 'fizzbuzz', '16', '17', 'fizz', '19', 'buzz',
  'fizz', '22', '23', 'fizz', 'buzz', '26', 'fizz', '28', '29',
  'fizzbuzz', '31', '32', 'fizz', '34', 'buzz', 'fizz', '37', '38',
  'fizz', 'buzz', '41', '42', '43', '44', 'fizzbuzz', '46', '47',
  'fizz', '49', 'buzz', 'fizz', '52', '53', 'fizz', 'buzz', '56',
  'fizz', '58', '59', 'fizzbuzz', '61', '62', 'fizz', '64', 'buzz',
  'fizz', '67', '68', '69', 'buzz', '71', 'fizz', '73', '74', 'fizzbuzz',
  '76', '77', 'fizz', '79', 'buzz', 'fizz', '82', '83', '84', 'buzz',
  '86', 'fizz', '88', '89', 'fizzbuzz', '91', '92', '93', '94',
  'buzz', 'fizz', '97', '98', 'fizz', 'buzz']

1	print(np.sum(testY.cpu().max(1)[1].numpy() == np.array([fizz_buzz_encode(i) for i in range(1,101)])))testY.cpu().max(1)[1].numpy() == np.array([fizz_buzz_encode(i) for i in range(1,101)])

Results:

95
array([ True,  True,  True,  True,  True,  True,  True,  True,  True,       
        False,  True,  True,  True,  True,  True,  True,  True,  True,        
        True,  True,  True,  True,  True,  True,  True,  True,  True,        
        True,  True,  True,  True,  True,  True,  True,  True,  True,        
        True,  True,  True,  True,  True, False,  True,  True,  True,        
        True,  True,  True,  True,  True,  True,  True,  True,  True,        
        True,  True,  True,  True,  True,  True,  True,  True,  True,        
        True,  True,  True,  True,  True, False,  True,  True,  True,        
        True,  True,  True,  True,  True,  True,  True,  True,  True,        
        True,  True, False,  True,  True,  True,  True,  True,  True,        
        True,  True, False,  True,  True,  True,  True,  True,  True,        
        True])