Deep compression case study - AlexNet

In this post, we study the result of Song Hans’ work on AlexNet. Since the encrypting code is not provided, we analyze the decompression code provided on the author’s github repository to have a clear understanding of the compression scheme. There are two main techniques contribute to the small size of the compressed AlexNet:

Values clustering - Instead of having different values for each weight matrix, each layer is limited to have only 256 distinct values (convolutional) or distinct 16 values (fully connected). These values is encoded using 8-bit integer and 4-bit integer respectively.
Running sum encoding for the indexing array - Each weight is stored as a sparse array (only non-zero elements are stored). To avoid large indexing values (up to several millions), the index array is stored by the difference of the non-zero index and the array index only. (This scheme enables Huffman Coding to be effective later).

Binary file layout

The provided binary file AlexNet_compressed.net is organized into a header containing the number of non-zero elements in each layers nz_num and a body containing data for each layer. Each layer has 4 main parts: codebook contains the distinct float values of the layer, bias contains the bias values (no compressing for bias), spm_stream contains the integer encoding for each non-zero elements in the weight matrix, and ind_stream contains the index for each non-zero elements.

Binary File Format

In the figure above, each part name is given corresponding to the naming in the provided decode.py file. Below the name is the size of the array (we will provide details in the following sections). Yellow stands for unsigned integer data type, blue stands for float data type.

The file header contains 8 32-bit unsigned integers representing the number of non-zero elements in each layer. There are 8 layers in AlexNet. In the decompression code, this array is named nz_num.

layers = ['conv1', 'conv2', 'conv3', 'conv4', 'conv5', 'fc6', 'fc7', 'fc8']
# Get 8 uint32 values from fin (AlexNet_compressed)
nz_num = np.fromfile(fin, dtype = np.uint32, count = len(layers))
nz_num = array([29388, 118492, 309138, 247913, 163904, 4665474, 1959380, 1061645],
                dtype=uint32)

Compressed data for each layer

There are two types of layers in AlexNet namely convolutional and fully connected. As we mentioned above, each convolutional layer has 256 distinct values, while each fully connected has 16 distinct values. These distinct values are stored in codebook section as 32-bit floats. The encoding is stored as 8-bit or 4-bit unsigned integers. We take the first convolutional layer conv1 as an example.

Per layer storage layout
+----------+----------+---------------+---------------+
| codebook |  biases  |   ind_stream  |   spm_stream  |
+----------+----------+---------------+---------------+

Since conv1 is a convolutional layer, each elements in the weight matrix is encoded using 8-bit. The codebook has size 256 (of 32-bit floats):

# Read codebook from file, codebook_size=256 in this case
codebook = np.fromfile(fin, dtype = np.float32, count = codebook_size)

Bias for each convolution is read and copy to the network. In this case, there are 96 32-bit floats biases corresponding to 96 convolutional kernels. No compression was done for biases.

bias = np.fromfile(fin, dtype = np.float32, count = net.params[layer][1].data.size)
np.copyto(net.params[layer][1].data, bias)

The majority of information is stored in ind_stream and spm_stream. In the decoding process, ind_stream is read as an array of unsigned 8-bit integers. This array stores the indices of non-zero elements in the flattened weight matrix. To save storage space, ind_stream is actually stored as 4-bit integers. Thefore, each 8-bit is read as two 4-bit indices. Furthermore, the indexing is stored as the difference between a running sum of indices. The following example will make it clear:

data = [1, 3, 1, 0, 0, 0, 2, 0, 1]
ind = [0, 1, 2, 6, 8] # Only store indices of non-zero elements
ind_stream = [0, 0, 0, 3, 2] # Store the difference of a running sum
# We can get back the original index as follow:
ind = ind_stream + 1 # [1, 1, 1, 4, 2]
ind = np.cumsum(ind) # [1, 2, 3, 7, 9] - Accumulating sum
ind = ind - 1 # [0, 1, 2, 6, 8]
ind == [0, 1, 2, 6, 8] # The original indexing

The recovering process for 4-bit indexing from 8-bit indexing is simply using bit shift:

# In here, num_nz is the number of non zero elements
ind[np.arange(0, num_nz, 2)] = ind_stream % (2**4)
ind[np.arange(1, num_nz, 2)] = ind_stream / (2**4)

The second large chunk of memory is stored at spm_stream. spm_stream stores the indexing to the codebook, which stores the real value of weights. conv1 has 256 distinct values, hence spm_stream is an array of 8-bit unsigned integers.

# Create the numpy array of size num_nz fill with zeros
spm = np.zeros(num_nz, np.uint8)
spm = spm_stream

In the case of the fully connected layer, only 16 distinct values are used. Therefore, only 4-bit is needed per element:

spm = np.zeros(num_nz, np.uint8)
spm[np.arange(0, num_nz, 2)] = spm_stream % (2**4) # last 4-bit
spm[np.arange(1, num_nz, 2)] = spm_stream / (2**4) # first 4-bit

Each element of spm_stream points to the real values in codebook. For example, if we need data[i], where i is inferred from ind_stream, we look it up at spm[i], then data[i] = codebook[spm[i]]. In here, we can clearly see that the values of fully connected layers are divided into 16 bins, and that of convolutional layers are divided into 256 bins. This is where the main compression at.

Huffman coding

Although the code provided by the author does not implement Huffman coding, we can see that Huffman coding can help to reduce the coding length for spm_stream. According to Hans, the compressed model’s size is further reduced by 1MB with the use of Huffman coding. Furthermore, it is also possible to compress the size of ind_stream due to the fact that it has multiple running streams of zeros.

Discussion

Taking a look at the weight value distribution of each layers gives some insight about the design decision for the compressed file. (For more layer encoding and clustering: weight-clustering-notebook)

The first thing we can observe here is the difference between the standard deviation of convolutional layers (conv) versus fully connected layers (fc). Moreover, the value ranges of convolutional layers are also much larger than that of fully connected layers. This observation suggests that we need to have a “finer” quantization for convolutional layers. As it turned out, to preserve the accuracy, we need to quantize the weights of a convolutional layer by 256 values; but we only need 16 discrete values for a fully connected to preserve the accuracy. This design decision is good for two reasons:

Larger coverage for convolutional layers. These layers are small in size (less than a million parameter each) so we can afford to encode them using 8-bits (256 values).
Save storage space for fully connected layers. Since a fully connected layer has up to 25 millions parameter, storing each value as a 4-bits value greatly helped to compress the size.

In a note on this compression scheme, there are several interesting points to discuss:

Pruning parameter. We do not know the pruning threshold and the number of pruning-retraining processes. We are implementing our own pruning-clustering and codebook-training on Caffe.
The sensitivity of quantization to the convolutional kernels. It seems that fully connected layers are more robust to quantization (limited number of distinct values than the convolutional kernels). We need to study the robustness of the pruning and quantization process with respect to the number of discrete values and the sensitivity of the values in codebook to the extracted model’s accuracy.
The main storage lies at ind_stream (non-zero index) and spm_stream (look-up keys for real values in codebook). Can we further compress these data? One idea is to short the codebook such that the key values preserve the locality of stored values (similar keys store similar values). From here, we can store spm_stream in a lossy integer format (inspired by Fourier transformation). However, the effect of Huffman coding might be broken in this scheme.

Deep compression case study - AlexNet

Binary file layout

Header

Compressed data for each layer

Huffman coding

Discussion