Author

Abstract

This is a short tutorial on the following topics in Deep Learning: Neural Networks, Recurrent Neural Networks, Long Short Term Memory Networks, Variational Auto-encoders, and Conditional Variational Auto-encoders. The full code for this tutorial can be found here.

Neural Networks

Consider the following deep neural network with two hidden layers.

Here, $\mathbf{x}_p \in \mathbb{R}^{N \times 1}$ denotes dimension $p=1,\ldots,P$ of the input data $\mathbf{X} = [\mathbf{x}_1, \ldots, \mathbf{x}_P] \in \mathbb{R}^{N \times P}$ . The first hidden layer is given by

$\mathbf{h}_q^{(1)} = h(\sum_{p=1}^P a_{p q}^{(1)} \mathbf{x}_p + b_q^{(1)}),\ \ \ q = 1,\ldots,Q^{(1)},$

where $\mathbf{h}_q^{(1)} \in \mathbb{R}^{N\times 1}$ denotes dimension $q=1,\ldots,Q^{(1)}$ of the matrix

$\mathbf{H}^{(1)} = [\mathbf{h}_1^{(1)}, \ldots, \mathbf{h}_{Q^{(1)}}^{(1)}] \in \mathbb{R}^{N \times Q^{(1)}}.$

In matrix-vector notations we obtain $\mathbf{H}^{(1)} = h(\mathbf{X} A^{(1)} + b^{(1)})$ with $A^{(1)} = [a_{pq}^{(1)}]_{p=1,\ldots,P,q=1,\ldots,Q^{(1)}}$ being a $P \times Q^{(1)}$ matrix of multipliers and $b^{(1)} = [b_1^{(1)},\ldots,b^{(1)}_{Q^{(1)}}] \in \mathbb{R}^{1 \times Q^{(1)}}$ denoting the bias vector. Here, $h(x)$ is the activation function given explicitly by $h(x) = \tanh(x)$ . Similarly, the second hidden layer $\mathbf{H}^{(2)} \in \mathbb{R}^{N \times Q^{(2)}}$ is given by $\mathbf{H}^{(2)} = h(\mathbf{H}^{(1)} A^{(2)} + b^{(2)})$ where $A^{(2)}$ is a $Q^{(1)} \times Q^{(2)}$ matrix of multipliers and $b^{(2)} \in \mathbb{R}^{1 \times Q^{(2)}}$ denotes the bias vector. The output of the neural network $\mathbf{F} \in \mathbb{R}^{N \times R}$ is given by $\mathbf{F} = \mathbf{H}^{(2)} A + b$ with $A \in \mathbb{R}^{Q^{(2)} \times R}$ and $b \in \mathbb{R}^{1 \times R}$ . Moreover, we assume a Gaussian noise model

$\mathbf{y}_r = \mathbf{f}_r + \mathbf{\epsilon}_r, \ \ \ r = 1, \ldots, R,$

with mutually independent $\mathbf{\epsilon}_r \sim \mathcal{N}(\mathbf{0}, \sigma_r^2 \mathbf{I})$ . Letting $\mathbf{y}:= [\mathbf{y}_1;\ldots;\mathbf{y}_R] \in \mathbb{R}^{N R \times 1}$ , we obtain the following likelihood

$\mathbf{y} \sim \mathcal{N}(\mathbf{y} | \mathbf{f}, \Sigma \otimes \mathbf{I}),$

where $\Sigma =$ diag $(\sigma_1^2,\ldots,\sigma_S^2)$ and $\mathbf{f}:= [\mathbf{f}_1;\ldots;\mathbf{f}_R] \in \mathbb{R}^{N R \times 1}$ . One can train the parameters of the neural network by minimizing the resulting negative log likelihood.

Illustrative Example

The following figure depicts a neural network fit to a synthetic dataset generated by random perturbations of a simple one dimensional function.

Neural network fitting a simple one dimensional function.

Recurrent Neural Networks

Let us consider a time series dataset of the form $\{\mathbf{y}_t: t=1,\ldots,T\}$ . We can employ the following recurrent neural network

to model the next value $\hat{\mathbf{y}}_t$ of the variable of interest as a function of its own lagged values $\mathbf{y}_{t-1}$ and $\mathbf{y}_{t-2}$ ; i.e., $\hat{\mathbf{y}}_t = f(\mathbf{y}_{t-1}, \mathbf{y}_{t-2})$ . Here, $\hat{\mathbf{y}}_t = \mathbf{h}_t V + c$ , $\mathbf{h}_t = \tanh\left(\mathbf{h}_{t-1} W + \mathbf{y}_{t-1} U + b\right)$ , $\mathbf{h}_{t-1} = \tanh\left(\mathbf{h}_{t-2} W + \mathbf{y}_{t-2} U + b\right)$ , and $\mathbf{h}_{t-2} = \mathbf{0}$ . The parameters $U,V,W,b,$ and $c$ of the recurrent neural network can be trained my minimizing the mean squared error

$\mathcal{MSE} := \frac{1}{T-2}\sum_{t=3}^T |\mathbf{y}_t - \hat{\mathbf{y}}_t|^2.$

Illustrative Example

The following figure depicts a recurrent neural network (with $5$ lags) learning and predicting the dynamics of a simple sine wave.

Recurrent neural network predicting the dynamics of a simple sine wave.

Long Short Term Memory (LSTM) Networks

A long short term memory (LSTM) network replaces the units $\mathbf{h}_t = \tanh\left(\mathbf{h}_{t-1} W + \mathbf{y}_{t-1} U + b\right)$ of a recurrent neural network with

$\mathbf{h}_{t} = \mathbf{o}_{t} \odot \tanh(\mathbf{s}_{t}),$

where

$\mathbf{o}_{t} = \sigma\left(\mathbf{h}_{t-1} W_o + \mathbf{y}_{t-1} U_o + b_o\right),$

is the output gate and

$\mathbf{s}_{t} = \mathbf{f}_{t} \odot \mathbf{s}_{t-1} + \mathbf{i}_t \odot \widetilde{\mathbf{s}}_t,$

is the cell state. Here,

$\widetilde{\mathbf{s}}_t = \tanh\left(\mathbf{h}_{t-1} W_s + \mathbf{y}_{t-1} U_s + b_s\right).$

Moreover,

$\mathbf{i}_t = \sigma\left(\mathbf{h}_{t-1} W_i + \mathbf{y}_{t-1} U_i + b_i\right)$

is the external input gate while

$\mathbf{f}_t = \sigma\left(\mathbf{h}_{t-1} W_f + \mathbf{y}_{t-1} U_f + b_f\right),$

is the forget gate.

Illustrative Example

The following figure depicts a long short term memory network (with $10$ lags) learning and predicting the dynamics of a simple sine wave.

Long short term memory network predicting the dynamics of a simple sine wave.

Variational Auto-encoders

Let us start by the prior assumption that

$p(\mathbf{z}) = \mathcal{N}\left(\mathbf{z}|\mathbf{0}, I\right),$

where $\mathbf{z}$ is a latent variable. Moreover, let us assume

$p(\mathbf{y}|\mathbf{z}) = \mathcal{N}\left(\mathbf{y} | \mu_2(\mathbf{z}), \Sigma_2(\mathbf{z})\right),$

where $\mu_2(\mathbf{z})$ and $\Sigma_2(\mathbf{z})$ are modeled as deep neural networks. Here, $\Sigma_2(\mathbf{z})$ is constrained to be a diagonal matrix. We are interested in minimizing the negative log likelihood $-\log p(\mathbf{y})$ , where

$p(\mathbf{y}) = \int p(\mathbf{y}|\mathbf{z})p(\mathbf{z})d\mathbf{z}.$

However, $-\log p(\mathbf{y})$ is not analytically tractable. To deal with this issue, one could employ a variational distribution

$q(\mathbf{z}|\mathbf{y})$

and compute the following Kullback-Leibler divergence; i.e.,

$\mathbb{KL}\left[q(\mathbf{z}|\mathbf{y})\ ||\ p(\mathbf{z}|\mathbf{y})\right] = \int \log \frac{q(\mathbf{z}|\mathbf{y})}{p(\mathbf{z}|\mathbf{y})}q(\mathbf{z}|\mathbf{y}) d\mathbf{z} = \int \left[\log q(\mathbf{z}|\mathbf{y}) - \log p(\mathbf{z}|\mathbf{y}) \right] q(\mathbf{z}|\mathbf{y}) d\mathbf{z}.$

Using the Bayes rule

$p(\mathbf{z}|\mathbf{y}) = \frac{p(\mathbf{y}|\mathbf{z})p(\mathbf{z})}{p(\mathbf{y})},$

one obtains

$\mathbb{KL}\left[q(\mathbf{z}|\mathbf{y})\ ||\ p(\mathbf{z}|\mathbf{y})\right] = \int \left[\log q(\mathbf{z}|\mathbf{y}) - \log p(\mathbf{y}|\mathbf{z}) - \log p(\mathbf{z}) + \log p(\mathbf{y}) \right] q(\mathbf{z}|\mathbf{y}) d\mathbf{z}.$

Therefore,

$\mathbb{KL}\left[q(\mathbf{z}|\mathbf{y})\ ||\ p(\mathbf{z}|\mathbf{y})\right] = \log p(\mathbf{y}) + \mathbb{KL}\left[q(\mathbf{z}|\mathbf{y})\ ||\ p(\mathbf{z})\right] - \int \log p(\mathbf{y}|\mathbf{z}) q(\mathbf{z}|\mathbf{y}) d\mathbf{z}.$

Rearranging the terms yields

$-\log p(\mathbf{y}) + \mathbb{KL}\left[q(\mathbf{z}|\mathbf{y})\ ||\ p(\mathbf{z}|\mathbf{y})\right] = -\int \log p(\mathbf{y}|\mathbf{z}) q(\mathbf{z}|\mathbf{y}) d\mathbf{z} + \mathbb{KL}\left[q(\mathbf{z}|\mathbf{y})\ ||\ p(\mathbf{z})\right].$

A variational auto-encoder proceeds by minimizing the terms on the right hand side of the above equation. Moreover, let us assume that

$q(\mathbf{z}|\mathbf{y}) = \mathcal{N}\left(\mathbf{z} | \mu_1(\mathbf{y}), \Sigma_1(\mathbf{y})\right),$

where $\mu_1(\mathbf{y})$ and $\Sigma_1(\mathbf{y})$ are modeled as deep neural networks. Here, $\Sigma_1(\mathbf{y})$ is constrained to be a diagonal matrix. One can use

$\mu_1(\mathbf{y}) + \mathbf{\epsilon}\Sigma_1(\mathbf{y})^{1/2}, \ \ \ \mathbf{\epsilon} \sim \mathcal{N}(\mathbf{0}, I).$

to generate samples from

$q(\mathbf{z}|\mathbf{y}).$

Illustrative Example

The following figure depicts the training data and the samples generated by a variational auto-encoder.

Training data and samples generated by a variational auto-encoder.

Conditional Variational Auto-encoders

Conditional variational auto-encoders, rather that making the assumption that

$p(\mathbf{z}) = \mathcal{N}\left(\mathbf{z}|\mathbf{0}, I\right),$

start by assuming that

$p(\mathbf{z}|\mathbf{x}) = \mathcal{N}\left(\mathbf{z}|\mu_0(\mathbf{x}), \Sigma_0(\mathbf{x})\right),$

where $\mu_0(\mathbf{x})$ and $\Sigma_0(\mathbf{x})$ are modeled as deep neural networks. Here, $\Sigma_0(\mathbf{x})$ is constrained to be a diagonal matrix.

Illustrative Example

The following figure depicts the training data and the samples generated by a conditional variational auto-encoder.

Training data and samples generated by a conditional variational auto-encoder.

All data and codes are publicly available on GitHub.

Deep Learning Tutorial

Tutorial on a number of topics in Deep Learning

Author

Abstract