Video Frame Prediction using ConvLSTM Network in PyTorch

Rohit Panda

6 min readJun 14, 2021

Courtesy : Unsupervised Learning of Video Representations using LSTMs, Srivastava et al., 2015

Why predict video frames

Courtesy : A Review on Deep Learning Techniques for Video Prediction, Oprea et al., 2020

Imagine an autonomous vehicle driving on the road, and its input video feed consists of a sequence of frames, predicting a future frame may help decide when to apply brakes, go slower or maybe increase speed in real time!

Video frame prediction

Given a sequence of frames, we wish to build a model which can learn some hidden representation of this sequence, and then use this representation to predict a future frame.

A video is a spatiotemporal sequence, which means it has both spatial and temporal correlations which need to be captured in order to predict a frame.

State of video prediction

First approaches to learning unsupervised video representations by van Hateren & Ruderman, 1998, were based on Independent Component Analysis.

In Unsupervised Learning of Video Representations using LSTMs by Srivastava et al., 2015, they use multilayer LSTM Encoder-Decoder networks to learn video representations.

In this post we will be looking at the Convolutional LSTM unit, a novel architecture proposed by Shi et al., 2015 in Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting.

The authors have shown that this model performs better on spatiotemportal data as compared to earlier purely LSTM based approaches, due to its additional capability of being able to capture spatial correlations as well.

Also the no. of parameters hugely reduces, compared to a fully LSTM based network, completely analogous to how the no. of parameters in a Convolutional network differs from a fully connected network.

There has been extensive development since with the advent of newer models such as PredRNN, PredRNN++, Eidetic 3D LSTM and many more.

Our approach

As mentioned, we look at the Convolutional LSTM unit.

This post is inspired by this excellent tutorial Next-Frame Video Prediction with Convolutional LSTMs by Amogh Joshi, which uses the out-of-the-box ConvLSTM2d layer available in Keras layers API.

However, ConvLSTM is unavailable in PyTorch as of now, so we’ll build one.

Some more background

Let us start by looking at an LSTM unit. Originally introduced by Hochreiter et al., 1997 in their paper Long Short-Term Memory, it has undergone some modifications, but we will look at the implementation by Alex Graves, 2013.

Courtesy : Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting

Here, i, f, o, c, h are the input gate, forget gate, output gate, cell activations and hidden states respectively, all are vectors of the same size. W are weight matrices and x are the inputs. ‘o’ denotes the Hadamard Product.

A video is a spatiotemporal sequence, although the LSTM architecture above handles temporal correlations very effectively, it cannot capture spacial information well, as the input, hidden states, cells, gates are all vectors.

Meet the ConvLSTM unit, proposed by Shi et al., 2015

Only things that are different from the LSTM unit:

Matrix multiplication between weight matrices W and inputs X replaced by convolution operation, i.e., W is now a learnable set of filters/kernels.
Similarly, matrix multiplication between weight matrices W and hidden state at previous time step.
The input X, hidden states H, cell activations C, gates i, f and o are now all 3D tensors, whose first dimension is no. of channels/filters, and last two dimensions are spatial (Height and Width), able to capture spatial information.

Designing the model

Let’s get coding! We’ll begin by implementing a single time step in a ConvLSTM layer, i.e., code representation of the equations above. I have made provisions for both ‘tanh’ and ‘relu’ activations in the below code.

Idea of using single convolutional layer for all convolutions adapted from Andrea Palazzi. Computationally efficient compared to 4 convolutions.

Now we’ll extend this by unrolling over time steps, the below code constitutes of a single ConvLSTM layer in our network.

We finish off the network by stacking up a few ConvLSTM layers, followed by BatchNorm3d layers, and the finally followed by a Conv2d layer, which can take the hidden state at the last time step of the last layer and predict a frame. Finally, we pass this through a Sigmoid activation to get pixel values between 0 and 1.

Dataset we’ll use

We’ll train our network on the Moving MNIST Dataset, introduced by Srivastava et al., 2015 in Unsupervised Learning of Video Representations using LSTMs. It contains 10,000 samples, each consisting of 20 grayscale frames of size 64 x 64 each.

Training our model

For training, we randomly choose 10 frames from among 20 frames, and ask our network to predict the 11th frame. We will use Adam optimizer and Binary Cross Entropy Loss since in original dataset, each pixel value is either 0 or 255, so its a 2 class classification problem for each pixel value.

Qualitative Analysis

Finally, to visualize what our model has learned so far, we will use the following approach on some test samples

Give the first 10 frames as input and ask the model to predict the 11th
Give frames 2 to 11 as input and ask the model to predict the12th, & so on.
We will append the model’s output frames 11 to 20 into a gif, and compare this with frames 11 to 20 in the original video.

Let us look at some of the real gifs.

And the corresponding generated outputs by our model.

Some Concluding Remarks

Ideally we would like our model to predict not just a frame, but a sequence of frames at a time.

However, this network can predict 2/3 frames at max after being fed a sequence of frames, i.e., the loss for a predicted frame is so high, if we try to build upon it and predict more, the quality drastically reduces until it predicts just black frames.

Solution is to decrease the loss for a single predicted frame to such an extent that we can consider it input and continue predicting more, incrementally. And this is the future direction that I would like to work in, by making this baseline model more powerful.

Thank you for your patient read if you have reached till here!

View the complete code at sladewinter/ConvLSTM (github.com)

If you find the code useful and decide to use it publicly and/or academically, kindly do give credits.

Since this is my first blog post, a lot of errors may have inadvertently crept in, my apologies for the same in advance.

If you wish to make any suggestion for improvements, corrections, or have any doubts and require clarifications. drop me a mail at sladewinter@gmail.com.

References

Shi, X., Chen, Z., Wang, H., Yeung, D., Wong, W., and Woo, W. Convolutional LSTM network: A machinelearning approach for precipitation nowcasting, 2015.
PyTorch documentation — which I have referred to extensively.
Next-Frame Video Prediction with Convolutional LSTMs by Amogh Joshi.
Unsupervised Learning of Video Representations using LSTMs, Srivastava et al., 2015 — Moving MNIST Dataset.
ndrplz/ConvLSTM_pytorch: Implementation of Convolutional LSTM in PyTorch. (github.com) — Idea of single convolutional layer in ConvLSTM.
Alex Graves. Generating Sequences With Recurrent Neural Networks, 2013.