This is a short tutorial on the following topics in Deep Learning: Neural Networks, Recurrent Neural Networks, Long Short Term Memory Networks, Variational Auto-encoders, and Conditional Variational Auto-encoders. The full code for this tutorial can be found here.


Neural Networks

Consider the following deep neural network with two hidden layers.


Here, \(\mathbf{x}_p \in \mathbb{R}^{N \times 1}\) denotes dimension \(p=1,\ldots,P\) of the input data \(\mathbf{X} = [\mathbf{x}_1, \ldots, \mathbf{x}_P] \in \mathbb{R}^{N \times P}\). The first hidden layer is given by

\[\mathbf{h}_q^{(1)} = h(\sum_{p=1}^P a_{p q}^{(1)} \mathbf{x}_p + b_q^{(1)}),\ \ \ q = 1,\ldots,Q^{(1)},\]

where \(\mathbf{h}_q^{(1)} \in \mathbb{R}^{N\times 1}\) denotes dimension \(q=1,\ldots,Q^{(1)}\) of the matrix

\[\mathbf{H}^{(1)} = [\mathbf{h}_1^{(1)}, \ldots, \mathbf{h}_{Q^{(1)}}^{(1)}] \in \mathbb{R}^{N \times Q^{(1)}}.\]

In matrix-vector notations we obtain \(\mathbf{H}^{(1)} = h(\mathbf{X} A^{(1)} + b^{(1)})\) with \(A^{(1)} = [a_{pq}^{(1)}]_{p=1,\ldots,P,q=1,\ldots,Q^{(1)}}\) being a \(P \times Q^{(1)}\) matrix of multipliers and \(b^{(1)} = [b_1^{(1)},\ldots,b^{(1)}_{Q^{(1)}}] \in \mathbb{R}^{1 \times Q^{(1)}}\) denoting the bias vector. Here, \(h(x)\) is the activation function given explicitly by \(h(x) = \tanh(x)\). Similarly, the second hidden layer \(\mathbf{H}^{(2)} \in \mathbb{R}^{N \times Q^{(2)}}\) is given by \(\mathbf{H}^{(2)} = h(\mathbf{H}^{(1)} A^{(2)} + b^{(2)})\) where \(A^{(2)}\) is a \(Q^{(1)} \times Q^{(2)}\) matrix of multipliers and \(b^{(2)} \in \mathbb{R}^{1 \times Q^{(2)}}\) denotes the bias vector. The output of the neural network \(\mathbf{F} \in \mathbb{R}^{N \times R}\) is given by \(\mathbf{F} = \mathbf{H}^{(2)} A + b\) with \(A \in \mathbb{R}^{Q^{(2)} \times R}\) and \(b \in \mathbb{R}^{1 \times R}\). Moreover, we assume a Gaussian noise model

\[\mathbf{y}_r = \mathbf{f}_r + \mathbf{\epsilon}_r, \ \ \ r = 1, \ldots, R,\]

with mutually independent \(\mathbf{\epsilon}_r \sim \mathcal{N}(\mathbf{0}, \sigma_r^2 \mathbf{I})\). Letting \(\mathbf{y}:= [\mathbf{y}_1;\ldots;\mathbf{y}_R] \in \mathbb{R}^{N R \times 1}\), we obtain the following likelihood

\[\mathbf{y} \sim \mathcal{N}(\mathbf{y} | \mathbf{f}, \Sigma \otimes \mathbf{I}),\]

where \(\Sigma =\) diag\((\sigma_1^2,\ldots,\sigma_S^2)\) and \(\mathbf{f}:= [\mathbf{f}_1;\ldots;\mathbf{f}_R] \in \mathbb{R}^{N R \times 1}\). One can train the parameters of the neural network by minimizing the resulting negative log likelihood.

Illustrative Example

The following figure depicts a neural network fit to a synthetic dataset generated by random perturbations of a simple one dimensional function.

Neural network fitting a simple one dimensional function.



Recurrent Neural Networks

Let us consider a time series dataset of the form \(\{\mathbf{y}_t: t=1,\ldots,T\}\). We can employ the following recurrent neural network


to model the next value \(\hat{\mathbf{y}}_t\) of the variable of interest as a function of its own lagged values \(\mathbf{y}_{t-1}\) and \(\mathbf{y}_{t-2}\); i.e., \(\hat{\mathbf{y}}_t = f(\mathbf{y}_{t-1}, \mathbf{y}_{t-2})\). Here, \(\hat{\mathbf{y}}_t = \mathbf{h}_t V + c\), \(\mathbf{h}_t = \tanh\left(\mathbf{h}_{t-1} W + \mathbf{y}_{t-1} U + b\right)\), \(\mathbf{h}_{t-1} = \tanh\left(\mathbf{h}_{t-2} W + \mathbf{y}_{t-2} U + b\right)\), and \(\mathbf{h}_{t-2} = \mathbf{0}\). The parameters \(U,V,W,b,\) and \(c\) of the recurrent neural network can be trained my minimizing the mean squared error

\[\mathcal{MSE} := \frac{1}{T-2}\sum_{t=3}^T |\mathbf{y}_t - \hat{\mathbf{y}}_t|^2.\]

Illustrative Example

The following figure depicts a recurrent neural network (with $5$ lags) learning and predicting the dynamics of a simple sine wave.

Recurrent neural network predicting the dynamics of a simple sine wave.



Long Short Term Memory (LSTM) Networks

A long short term memory (LSTM) network replaces the units \(\mathbf{h}_t = \tanh\left(\mathbf{h}_{t-1} W + \mathbf{y}_{t-1} U + b\right)\) of a recurrent neural network with

\[\mathbf{h}_{t} = \mathbf{o}_{t} \odot \tanh(\mathbf{s}_{t}),\]

where

\[\mathbf{o}_{t} = \sigma\left(\mathbf{h}_{t-1} W_o + \mathbf{y}_{t-1} U_o + b_o\right),\]

is the output gate and

\[\mathbf{s}_{t} = \mathbf{f}_{t} \odot \mathbf{s}_{t-1} + \mathbf{i}_t \odot \widetilde{\mathbf{s}}_t,\]

is the cell state. Here,

\[\widetilde{\mathbf{s}}_t = \tanh\left(\mathbf{h}_{t-1} W_s + \mathbf{y}_{t-1} U_s + b_s\right).\]

Moreover,

\[\mathbf{i}_t = \sigma\left(\mathbf{h}_{t-1} W_i + \mathbf{y}_{t-1} U_i + b_i\right)\]

is the external input gate while

\[\mathbf{f}_t = \sigma\left(\mathbf{h}_{t-1} W_f + \mathbf{y}_{t-1} U_f + b_f\right),\]

is the forget gate.

Illustrative Example

The following figure depicts a long short term memory network (with $10$ lags) learning and predicting the dynamics of a simple sine wave.

Long short term memory network predicting the dynamics of a simple sine wave.



Variational Auto-encoders

Let us start by the prior assumption that

\[p(\mathbf{z}) = \mathcal{N}\left(\mathbf{z}|\mathbf{0}, I\right),\]

where \(\mathbf{z}\) is a latent variable. Moreover, let us assume

\[p(\mathbf{y}|\mathbf{z}) = \mathcal{N}\left(\mathbf{y} | \mu_2(\mathbf{z}), \Sigma_2(\mathbf{z})\right),\]

where \(\mu_2(\mathbf{z})\) and \(\Sigma_2(\mathbf{z})\) are modeled as deep neural networks. Here, \(\Sigma_2(\mathbf{z})\) is constrained to be a diagonal matrix. We are interested in minimizing the negative log likelihood \(-\log p(\mathbf{y})\), where

\[p(\mathbf{y}) = \int p(\mathbf{y}|\mathbf{z})p(\mathbf{z})d\mathbf{z}.\]

However, \(-\log p(\mathbf{y})\) is not analytically tractable. To deal with this issue, one could employ a variational distribution

\[q(\mathbf{z}|\mathbf{y})\]

and compute the following Kullback-Leibler divergence; i.e.,

\[\mathbb{KL}\left[q(\mathbf{z}|\mathbf{y})\ ||\ p(\mathbf{z}|\mathbf{y})\right] = \int \log \frac{q(\mathbf{z}|\mathbf{y})}{p(\mathbf{z}|\mathbf{y})}q(\mathbf{z}|\mathbf{y}) d\mathbf{z} = \int \left[\log q(\mathbf{z}|\mathbf{y}) - \log p(\mathbf{z}|\mathbf{y}) \right] q(\mathbf{z}|\mathbf{y}) d\mathbf{z}.\]

Using the Bayes rule

\[p(\mathbf{z}|\mathbf{y}) = \frac{p(\mathbf{y}|\mathbf{z})p(\mathbf{z})}{p(\mathbf{y})},\]

one obtains

\[\mathbb{KL}\left[q(\mathbf{z}|\mathbf{y})\ ||\ p(\mathbf{z}|\mathbf{y})\right] = \int \left[\log q(\mathbf{z}|\mathbf{y}) - \log p(\mathbf{y}|\mathbf{z}) - \log p(\mathbf{z}) + \log p(\mathbf{y}) \right] q(\mathbf{z}|\mathbf{y}) d\mathbf{z}.\]

Therefore,

\[\mathbb{KL}\left[q(\mathbf{z}|\mathbf{y})\ ||\ p(\mathbf{z}|\mathbf{y})\right] = \log p(\mathbf{y}) + \mathbb{KL}\left[q(\mathbf{z}|\mathbf{y})\ ||\ p(\mathbf{z})\right] - \int \log p(\mathbf{y}|\mathbf{z}) q(\mathbf{z}|\mathbf{y}) d\mathbf{z}.\]

Rearranging the terms yields

\[-\log p(\mathbf{y}) + \mathbb{KL}\left[q(\mathbf{z}|\mathbf{y})\ ||\ p(\mathbf{z}|\mathbf{y})\right] = -\int \log p(\mathbf{y}|\mathbf{z}) q(\mathbf{z}|\mathbf{y}) d\mathbf{z} + \mathbb{KL}\left[q(\mathbf{z}|\mathbf{y})\ ||\ p(\mathbf{z})\right].\]

A variational auto-encoder proceeds by minimizing the terms on the right hand side of the above equation. Moreover, let us assume that

\[q(\mathbf{z}|\mathbf{y}) = \mathcal{N}\left(\mathbf{z} | \mu_1(\mathbf{y}), \Sigma_1(\mathbf{y})\right),\]

where \(\mu_1(\mathbf{y})\) and \(\Sigma_1(\mathbf{y})\) are modeled as deep neural networks. Here, \(\Sigma_1(\mathbf{y})\) is constrained to be a diagonal matrix. One can use

\[\mu_1(\mathbf{y}) + \mathbf{\epsilon}\Sigma_1(\mathbf{y})^{1/2}, \ \ \ \mathbf{\epsilon} \sim \mathcal{N}(\mathbf{0}, I).\]

to generate samples from

\[q(\mathbf{z}|\mathbf{y}).\]

Illustrative Example

The following figure depicts the training data and the samples generated by a variational auto-encoder.

Training data and samples generated by a variational auto-encoder.

Conditional Variational Auto-encoders

Conditional variational auto-encoders, rather that making the assumption that

\[p(\mathbf{z}) = \mathcal{N}\left(\mathbf{z}|\mathbf{0}, I\right),\]

start by assuming that

\[p(\mathbf{z}|\mathbf{x}) = \mathcal{N}\left(\mathbf{z}|\mu_0(\mathbf{x}), \Sigma_0(\mathbf{x})\right),\]

where \(\mu_0(\mathbf{x})\) and \(\Sigma_0(\mathbf{x})\) are modeled as deep neural networks. Here, \(\Sigma_0(\mathbf{x})\) is constrained to be a diagonal matrix.

Illustrative Example

The following figure depicts the training data and the samples generated by a conditional variational auto-encoder.

Training data and samples generated by a conditional variational auto-encoder.

All data and codes are publicly available on GitHub.