This is a short tutorial on the following topics in Deep Learning: Neural Networks, Recurrent Neural Networks, Long Short Term Memory Networks, Variational Auto-encoders, and Conditional Variational Auto-encoders. The full code for this tutorial can be found here.

Neural Networks

Consider the following deep neural network with two hidden layers. Here, $$\mathbf{x}_p \in \mathbb{R}^{N \times 1}$$ denotes dimension $$p=1,\ldots,P$$ of the input data $$\mathbf{X} = [\mathbf{x}_1, \ldots, \mathbf{x}_P] \in \mathbb{R}^{N \times P}$$. The first hidden layer is given by

$\mathbf{h}_q^{(1)} = h(\sum_{p=1}^P a_{p q}^{(1)} \mathbf{x}_p + b_q^{(1)}),\ \ \ q = 1,\ldots,Q^{(1)},$

where $$\mathbf{h}_q^{(1)} \in \mathbb{R}^{N\times 1}$$ denotes dimension $$q=1,\ldots,Q^{(1)}$$ of the matrix

$\mathbf{H}^{(1)} = [\mathbf{h}_1^{(1)}, \ldots, \mathbf{h}_{Q^{(1)}}^{(1)}] \in \mathbb{R}^{N \times Q^{(1)}}.$

In matrix-vector notations we obtain $$\mathbf{H}^{(1)} = h(\mathbf{X} A^{(1)} + b^{(1)})$$ with $$A^{(1)} = [a_{pq}^{(1)}]_{p=1,\ldots,P,q=1,\ldots,Q^{(1)}}$$ being a $$P \times Q^{(1)}$$ matrix of multipliers and $$b^{(1)} = [b_1^{(1)},\ldots,b^{(1)}_{Q^{(1)}}] \in \mathbb{R}^{1 \times Q^{(1)}}$$ denoting the bias vector. Here, $$h(x)$$ is the activation function given explicitly by $$h(x) = \tanh(x)$$. Similarly, the second hidden layer $$\mathbf{H}^{(2)} \in \mathbb{R}^{N \times Q^{(2)}}$$ is given by $$\mathbf{H}^{(2)} = h(\mathbf{H}^{(1)} A^{(2)} + b^{(2)})$$ where $$A^{(2)}$$ is a $$Q^{(1)} \times Q^{(2)}$$ matrix of multipliers and $$b^{(2)} \in \mathbb{R}^{1 \times Q^{(2)}}$$ denotes the bias vector. The output of the neural network $$\mathbf{F} \in \mathbb{R}^{N \times R}$$ is given by $$\mathbf{F} = \mathbf{H}^{(2)} A + b$$ with $$A \in \mathbb{R}^{Q^{(2)} \times R}$$ and $$b \in \mathbb{R}^{1 \times R}$$. Moreover, we assume a Gaussian noise model

$\mathbf{y}_r = \mathbf{f}_r + \mathbf{\epsilon}_r, \ \ \ r = 1, \ldots, R,$

with mutually independent $$\mathbf{\epsilon}_r \sim \mathcal{N}(\mathbf{0}, \sigma_r^2 \mathbf{I})$$. Letting $$\mathbf{y}:= [\mathbf{y}_1;\ldots;\mathbf{y}_R] \in \mathbb{R}^{N R \times 1}$$, we obtain the following likelihood

$\mathbf{y} \sim \mathcal{N}(\mathbf{y} | \mathbf{f}, \Sigma \otimes \mathbf{I}),$

where $$\Sigma =$$ diag$$(\sigma_1^2,\ldots,\sigma_S^2)$$ and $$\mathbf{f}:= [\mathbf{f}_1;\ldots;\mathbf{f}_R] \in \mathbb{R}^{N R \times 1}$$. One can train the parameters of the neural network by minimizing the resulting negative log likelihood.

Illustrative Example

The following figure depicts a neural network fit to a synthetic dataset generated by random perturbations of a simple one dimensional function. Neural network fitting a simple one dimensional function.

Recurrent Neural Networks

Let us consider a time series dataset of the form $$\{\mathbf{y}_t: t=1,\ldots,T\}$$. We can employ the following recurrent neural network to model the next value $$\hat{\mathbf{y}}_t$$ of the variable of interest as a function of its own lagged values $$\mathbf{y}_{t-1}$$ and $$\mathbf{y}_{t-2}$$; i.e., $$\hat{\mathbf{y}}_t = f(\mathbf{y}_{t-1}, \mathbf{y}_{t-2})$$. Here, $$\hat{\mathbf{y}}_t = \mathbf{h}_t V + c$$, $$\mathbf{h}_t = \tanh\left(\mathbf{h}_{t-1} W + \mathbf{y}_{t-1} U + b\right)$$, $$\mathbf{h}_{t-1} = \tanh\left(\mathbf{h}_{t-2} W + \mathbf{y}_{t-2} U + b\right)$$, and $$\mathbf{h}_{t-2} = \mathbf{0}$$. The parameters $$U,V,W,b,$$ and $$c$$ of the recurrent neural network can be trained my minimizing the mean squared error

$\mathcal{MSE} := \frac{1}{T-2}\sum_{t=3}^T |\mathbf{y}_t - \hat{\mathbf{y}}_t|^2.$

Illustrative Example

The following figure depicts a recurrent neural network (with $5$ lags) learning and predicting the dynamics of a simple sine wave. Recurrent neural network predicting the dynamics of a simple sine wave.

Long Short Term Memory (LSTM) Networks

A long short term memory (LSTM) network replaces the units $$\mathbf{h}_t = \tanh\left(\mathbf{h}_{t-1} W + \mathbf{y}_{t-1} U + b\right)$$ of a recurrent neural network with

$\mathbf{h}_{t} = \mathbf{o}_{t} \odot \tanh(\mathbf{s}_{t}),$

where

$\mathbf{o}_{t} = \sigma\left(\mathbf{h}_{t-1} W_o + \mathbf{y}_{t-1} U_o + b_o\right),$

is the output gate and

$\mathbf{s}_{t} = \mathbf{f}_{t} \odot \mathbf{s}_{t-1} + \mathbf{i}_t \odot \widetilde{\mathbf{s}}_t,$

is the cell state. Here,

$\widetilde{\mathbf{s}}_t = \tanh\left(\mathbf{h}_{t-1} W_s + \mathbf{y}_{t-1} U_s + b_s\right).$

Moreover,

$\mathbf{i}_t = \sigma\left(\mathbf{h}_{t-1} W_i + \mathbf{y}_{t-1} U_i + b_i\right)$

is the external input gate while

$\mathbf{f}_t = \sigma\left(\mathbf{h}_{t-1} W_f + \mathbf{y}_{t-1} U_f + b_f\right),$

is the forget gate.

Illustrative Example

The following figure depicts a long short term memory network (with $10$ lags) learning and predicting the dynamics of a simple sine wave. Long short term memory network predicting the dynamics of a simple sine wave.

Variational Auto-encoders

Let us start by the prior assumption that

$p(\mathbf{z}) = \mathcal{N}\left(\mathbf{z}|\mathbf{0}, I\right),$

where $$\mathbf{z}$$ is a latent variable. Moreover, let us assume

$p(\mathbf{y}|\mathbf{z}) = \mathcal{N}\left(\mathbf{y} | \mu_2(\mathbf{z}), \Sigma_2(\mathbf{z})\right),$

where $$\mu_2(\mathbf{z})$$ and $$\Sigma_2(\mathbf{z})$$ are modeled as deep neural networks. Here, $$\Sigma_2(\mathbf{z})$$ is constrained to be a diagonal matrix. We are interested in minimizing the negative log likelihood $$-\log p(\mathbf{y})$$, where

$p(\mathbf{y}) = \int p(\mathbf{y}|\mathbf{z})p(\mathbf{z})d\mathbf{z}.$

However, $$-\log p(\mathbf{y})$$ is not analytically tractable. To deal with this issue, one could employ a variational distribution

$q(\mathbf{z}|\mathbf{y})$

and compute the following Kullback-Leibler divergence; i.e.,

$\mathbb{KL}\left[q(\mathbf{z}|\mathbf{y})\ ||\ p(\mathbf{z}|\mathbf{y})\right] = \int \log \frac{q(\mathbf{z}|\mathbf{y})}{p(\mathbf{z}|\mathbf{y})}q(\mathbf{z}|\mathbf{y}) d\mathbf{z} = \int \left[\log q(\mathbf{z}|\mathbf{y}) - \log p(\mathbf{z}|\mathbf{y}) \right] q(\mathbf{z}|\mathbf{y}) d\mathbf{z}.$

Using the Bayes rule

$p(\mathbf{z}|\mathbf{y}) = \frac{p(\mathbf{y}|\mathbf{z})p(\mathbf{z})}{p(\mathbf{y})},$

one obtains

$\mathbb{KL}\left[q(\mathbf{z}|\mathbf{y})\ ||\ p(\mathbf{z}|\mathbf{y})\right] = \int \left[\log q(\mathbf{z}|\mathbf{y}) - \log p(\mathbf{y}|\mathbf{z}) - \log p(\mathbf{z}) + \log p(\mathbf{y}) \right] q(\mathbf{z}|\mathbf{y}) d\mathbf{z}.$

Therefore,

$\mathbb{KL}\left[q(\mathbf{z}|\mathbf{y})\ ||\ p(\mathbf{z}|\mathbf{y})\right] = \log p(\mathbf{y}) + \mathbb{KL}\left[q(\mathbf{z}|\mathbf{y})\ ||\ p(\mathbf{z})\right] - \int \log p(\mathbf{y}|\mathbf{z}) q(\mathbf{z}|\mathbf{y}) d\mathbf{z}.$

Rearranging the terms yields

$-\log p(\mathbf{y}) + \mathbb{KL}\left[q(\mathbf{z}|\mathbf{y})\ ||\ p(\mathbf{z}|\mathbf{y})\right] = -\int \log p(\mathbf{y}|\mathbf{z}) q(\mathbf{z}|\mathbf{y}) d\mathbf{z} + \mathbb{KL}\left[q(\mathbf{z}|\mathbf{y})\ ||\ p(\mathbf{z})\right].$

A variational auto-encoder proceeds by minimizing the terms on the right hand side of the above equation. Moreover, let us assume that

$q(\mathbf{z}|\mathbf{y}) = \mathcal{N}\left(\mathbf{z} | \mu_1(\mathbf{y}), \Sigma_1(\mathbf{y})\right),$

where $$\mu_1(\mathbf{y})$$ and $$\Sigma_1(\mathbf{y})$$ are modeled as deep neural networks. Here, $$\Sigma_1(\mathbf{y})$$ is constrained to be a diagonal matrix. One can use

$\mu_1(\mathbf{y}) + \mathbf{\epsilon}\Sigma_1(\mathbf{y})^{1/2}, \ \ \ \mathbf{\epsilon} \sim \mathcal{N}(\mathbf{0}, I).$

to generate samples from

$q(\mathbf{z}|\mathbf{y}).$

Illustrative Example

The following figure depicts the training data and the samples generated by a variational auto-encoder. Training data and samples generated by a variational auto-encoder.

Conditional Variational Auto-encoders

Conditional variational auto-encoders, rather that making the assumption that

$p(\mathbf{z}) = \mathcal{N}\left(\mathbf{z}|\mathbf{0}, I\right),$

start by assuming that

$p(\mathbf{z}|\mathbf{x}) = \mathcal{N}\left(\mathbf{z}|\mu_0(\mathbf{x}), \Sigma_0(\mathbf{x})\right),$

where $$\mu_0(\mathbf{x})$$ and $$\Sigma_0(\mathbf{x})$$ are modeled as deep neural networks. Here, $$\Sigma_0(\mathbf{x})$$ is constrained to be a diagonal matrix.

Illustrative Example

The following figure depicts the training data and the samples generated by a conditional variational auto-encoder. Training data and samples generated by a conditional variational auto-encoder.

All data and codes are publicly available on GitHub.