Code snippets for pytorch

January 4, 2019 - 12 minute read -

I am switching to pytorch and here are some piece of code that I think will be useful to copy-paste later.


Data

Pytorch handles data quite cleanly. Data is loaded as tensors and then iterated using an iterator.

Example 1: CIFAR-10 dataset

from torchvision import datasets
import torchvision.transforms as transforms
from torch.utils.data.sampler import SubsetRandomSampler

# number of subprocesses to use for data loading
num_workers = 0
# how many samples per batch to load
batch_size = 20
# percentage of training set to use as validation
valid_size = 0.2

# convert data to a normalized torch.FloatTensor
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
    ])

# choose the training and test datasets
train_data = datasets.CIFAR10('data', train=True,
                              download=True, transform=transform)
test_data = datasets.CIFAR10('data', train=False,
                             download=True, transform=transform)

# obtain training indices that will be used for validation
num_train = len(train_data)
indices = list(range(num_train))
np.random.shuffle(indices)
split = int(np.floor(valid_size * num_train))
train_idx, valid_idx = indices[split:], indices[:split]
 
# define samplers for obtaining training and validation batches
train_sampler = SubsetRandomSampler(train_idx)
valid_sampler = SubsetRandomSampler(valid_idx)

# prepare data loaders (combine dataset and sampler)
# when sampler is specified, shuffle parameter must be False 
# because sampler already specifies how data is drawn from the dataset.
train_loader = torch.utils.data.DataLoader(train_data, batch_size=batch_size,
    sampler=train_sampler, num_workers=num_workers)
valid_loader = torch.utils.data.DataLoader(train_data, batch_size=batch_size, 
    sampler=valid_sampler, num_workers=num_workers)
test_loader = torch.utils.data.DataLoader(test_data, batch_size=batch_size, 
    num_workers=num_workers)

# specify the image classes
classes = ['airplane', 'automobile', 'bird', 'cat', 'deer',
           'dog', 'frog', 'horse', 'ship', 'truck']

In the example above, the three iterators train_loader, valid_loader, and test_loader iterate through all pairs of (data, target)s. The training process for some given model, optimizer, and criterion:

# number of epochs to train the model
n_epochs = 8 
# track change in validaiton loss
valid_loss_min = np.Inf 

# start training
for epoch in range(1, n_epochs+1):
    # keep track of training and validation loss
    train_loss = 0.0
    valid_loss = 0.0
    ###################
    # train the model #
    ###################
    model.train()
    for data, target in train_loader:
        # move tensors to GPU if CUDA is available
        if train_on_gpu:
            data, target = data.cuda(), target.cuda()
        # clear the gradients of all optimized variables because by default
        # optimizer accumulates gradients after every batch. This design is
        # convenient for recurrent neural networks
        optimizer.zero_grad()
        # forward pass: compute predicted outputs by passing inputs to the model
        output = model(data)
        # calculate the batch loss
        loss = criterion(output, target)
        # backward pass: compute gradient of the loss with respect to model parameters
        loss.backward()
        # perform a single optimization step (parameter update)
        optimizer.step()
        # update training loss
        train_loss += loss.item()*data.size(0)
    ######################    
    # validate the model #
    ######################
    model.eval()
    for data, target in valid_loader:
        # move tensors to GPU if CUDA is available
        if train_on_gpu:
            data, target = data.cuda(), target.cuda()
        # forward pass: compute predicted outputs by passing inputs to the model
        output = model(data)
        # calculate the batch loss
        loss = criterion(output, target)
        # update average validation loss 
        valid_loss += loss.item()*data.size(0)
    # calculate average losses
    train_loss = train_loss/len(train_loader.dataset)
    valid_loss = valid_loss/len(valid_loader.dataset)
    # print training/validation statistics 
    print('Epoch: {} \tTraining Loss: {:.6f} \tValidation Loss: {:.6f}'.format(
        epoch, train_loss, valid_loss))
    # save model if validation loss has decreased
    if valid_loss <= valid_loss_min:
        print('Validation loss decreased ({:.6f} --> {:.6f}).  Saving model ...'.format(
        valid_loss_min,
        valid_loss))
        torch.save(model.state_dict(), 'model_cifar.pt') # state_dict() is a function call 
        valid_loss_min = valid_loss

In this paper, the author views Szemeredi Regularity Lemma from the information theoric perspective. Such point of view not only clarifies the Regularity Lemma, but also allows applications of other techniques (information theory, ergodic theory) to proving and improving the regularity lemma.

Regularity Lemma is a building stone to the Szemeredi Theorem. The theorem states that in a positive density subset of natural numbers, there exists an arbitarily large arithmetic progression. This results relates strongly to the ergodic theory.

Instead of studying dense graph, in this paper, the author uses graphs and bi-paritite graphs as an analogue to random variables. Suppose there are two random variables $x_1$ and $x_2$ mapping from the vertices space ($V_1$ and $V_2$) to some probability measurement.

There is a recent paper from Cambridge (probably sumitted to ICML 2019) that connect ergodic theory with machine learning. It was published on arXiv with the title “The Theory and Algorithm of Ergodic Inference”.

(Dec 26 – paused)

1. Introduction

Main ideas and contributions

  1. Starting from a skewed stacked RNN architecture, the authors proposed a novel RNN where each hidden unit is parameterized by a high rank tensor (≥2).

    sRNN and tRNN

    The image above compares a three hidden layers (depth L = 3) skewed recurrent neural network with the proposed model. The tRNN model here has only one hidden layer, but its hidden units are parameterized by a P-by-M matrix rather than a single vector. Furthermore, the interaction between hidden units are captured by the convolution operation (defined later) to take advantages of the tensor representation. According to the authors, the first dimension of the hidden units is “locally connected” in order to share parameters, while the second dimension is “fully connected” for global interaction. This idea will be clearer when we discuss the convolution. In general, the construction of a hidden unit output is computed as following ( is the convolution operator):

    Note that the actual output at is computed using the last tensor (in this case it is a vector) of the hidden tensor . As in the figure above, is computed from - the last layer of the hidden unit . Also, the implicit depth P = L = 3 leads us to the same computation.

  2. The paper introduces cross-layer convolution and memory cell convolution (for the LSTM extension). The explanation to the convolution is presented in the second section. Note that in this post, I moved the time notation to the top of each symbol when it’s convenient so that the subscript contains only indexing variables (e.g. becomes ).

  3. The 2D-tRNN model is further extended to LSTM and higher order tensors (3D).

    This extension to LSTM is pretty straight-forward as each gate is computed using the convolution operator in lieu of the standard matrix multiplication. For the higher order tensors extension, the concatenated tensor is constructed by appending the projection (multiplied with weights, added with bias) of to one corner of the hidden unit tensor. This can be understood as going down the tensor dimension-wise, when you reach a 2D matrix with row size of M, append to it. In the same way, is generated with the opposite corner.

Dissecting convolution

It took me sometime to fully grasp the author’s idea :dizzy_face:. Despite that “sometime”, my understanding can be captured in a single image (and now I feel silly). In the case of 2D hidden units, the operation and the dimensionality of each tensor is:

In here, is the activation tensor. represents the concatenation of a hidden unit’s rank-2 tensor output (matrix) and the input vector from the layer above it (in this case, it is the input vector ). From the skewed sRNN figure above, the concatenation is pretty clear: is pointed to by and , hence the tensor used in computing and will be the concatenation between and . The same principle applies for the tRNN. The convolution kernel consists of weight and bias . is a rank-3 tensor of K filters, each filters has input channels and output channels. In this paper, the authors noted that they let for simplicity. The detail of the convolution is given in the supplementary document:

It might sound obvious, but it is easier for me to think of the indices as “selector”. For example, is the element of tensor that is selected by . Usually, the “selector” is lowercase and the total number of element in a dimension is uppercase. In the figure below, I circled in red the “selectors”.

sRNN and tRNN

It might helps with a kind of story. The matrix is what we need to compute for the next time step. We set out to compute each element of this matrix. is given by the convolution operator defined in the formula above. In the figure, the gray small box represents . The two selector and control the center of the convolution in and which output channel to pick in respectively. For example, if and where want to computer at , then for each slice of , the 5 columns associated with will be picked out. Next, on , there are also 5 rows selected. Since , the index of these row vectors are (centered at ). These 5 pairs of vectors are indexed by . Sum of the dot products of the pairs is . Note that the authors use zero-padding to keep the shape of output same as the input. In the case of memory cell convolution, the values used in padding are the border values to avoid interference with the memory values.

Evaluations

The authors uses three main tasks to demonstrate the effectiveness of their proposed model:

  1. Algorithmic tasks consist of learning to sum two 15-digit numbers and learning to repeat the input sequence.

      Sum two 15-digit integers    ||   Repeat the input
      Input : --2545-9542-------   ||   Input : -hiworld----------
      Output: -----------12087--   ||   Output: --------hiworld--- 
    

    The advantage of the proposed model (3D-tLSTM with Channel Normalization) is that it requires much less training samples to reach more than 99 percent accuracy compared with Stacked LSTM and Grid LSTM. However, there are no mention to the training time or other modification of tLSTM. Furthermore, the best performing models have 7 and 10 layers depth. There are not much explanation for these hyper-parameters. The 7-layer requires less training samples than the 10-layer in addition task, but the 10-layer in turn requires much less samples in memorization. I expected the 7-layer model requires less samples in both tasks.

  2. MNIST classification tasks consist of normal MNIST and randomized pixel sequence called pMNIST. In these tasks, the pixels of an hand-written digit image are fed into the neural nets as a sequence. In this task, I do not see any advantage of using tLSMT compared with state-of-the-art methods such as DilatedGRU.

  3. Wikipedia language modeling task deal with next-character prediction on the Hutter Prize Wikipedia dataset. Similar to the aforementioned MNIST tasks, there are no clear win for tLSTM compared with Large FS-LSTM-4 method (in term of both BPC and number of parameters).

Why tensors?

It is true that we can design a “vector” recurrent neural net that works in the same way as the proposed tRNN here. After all, the variables is stored in arrays on our computer. However, tensors represent a concrete way to think about a “block of numbers”, and more importantly, we can define and argue about convolution easily with it. Let’s take image data as an example. The image below shows the popular 2D convolution operation. Tensor representation gives us a concrete way to think about color channels and its corresponding channels in each convolution kernel. Furthermore, notice that we “scan” each convolution channel in the intensity dimension (on the matrices) but not the color dimension. As such, the weights in the intensity dimension are shared. Similarly, in recurrent neural networks, the tensor hidden unit in some sense enables a more abstract representation and allow weight sharing.

sRNN and tRNN

Conclusion

This paper proposes a nice way to increase the structural depth of a recurrent neural network model. Instead of parameterizing hidden units with vectors, tRNN modeled hidden units as tensors for more efficient weight sharing and implicit depth. Such approach greatly reduces the number of parameters in a recurrent neural network while maintaining the depth. tRNN model is also extended to use LSTM module and higher order (3D) tensor for better performance and data abstraction. The effectiveness of tLSTM is demonstrated on three different tasks in which tLSTM achieve state-of-the-art performance with some improvement on number of parameters (minor) and required number of training samples. On the other hand, while the proposed model might improve running time and number of parameters, there was no discussion on the training time and training complexity of tLSTM. It would be interesting to implement the 2D and 3D models to understand the benefit of tLSTM better.

In this post, I wrote about my understanding and commented on the paper “Wider and Deeper, Cheaper and Faster: Tensorized LSTMs for Sequence Learning” by Zhen He, Shaobing Gao, Liang Xiao, Daxue Liu, Hangen He, and David Barber. I left out some details such as Channel Normalization or the constraints on . These minor optimization tricks can be found on the paper. After this post, I plan to implement this technique to see if it is suitable for my current work. In particular, I would like to see if the training cost can potentially be reduced for a stacked LSTM approach to model product names.