Derivatives of ESN reservoirs with respect to input weights, reservoir weights, and leakage

Overview

I recently worked on a project that aimed at minimizing redundancies and multicollinearity in echo state neural networks (ESNs) by linearly decorrelating features in their reservoir. In the process, I solved for the derivatives of reservoir states with respect to input weights, reservoir weights, and the leakage rate. These derivatives are useful as they can be used to impose any desired dynamics onto the reservoir. My results add to previous work by Jaeger et al. (2007) and Palangi et al. (2013).

Click here to skip ahead to the derivatives.

A quick intro to echo state neural networks

ESNs are recurrent neural networks that consist of an input layer, reservoir, and output layer. Quite unconventionally, ESNs are trained using ridge regression rather than backpropagation which leaves all parameters but those in the output layer randomly initialized. Nevertheless, ESNs tend to outperform LSTMs and GRUs in terms of training time and prediction accuracy when applied to chaotic time series (Shahi et al., 2022).

The reservoir states encode information from past states and are updated iteratively as follows

\[ \mathbf{r_t} = (1-\alpha) \mathbf{r_{t-1}} + \alpha \mathrm{tanh}(W_{\mathrm{res}} \mathbf{r_{t-1}} + W_{\mathrm{in}} \mathbf{x_t}) \]

where \(\mathbf{x_t}\), \(W_\mathrm{in}\), \(W_\mathrm{res}\), and \( \alpha \) denote the input vector, input weights, reservoir weights and leakage, respectively. During training, reservoir state vectors are collected, concatenated and used in ridge regression to determine the optimal output weights and biases that map the reservoir to the desired output

\[ y_t = W_{\mathrm{out}} r_t + \mathbf{b_{\mathrm{out}}} \]

Lukoševicius et al. (2006) provide a more detailed description of the training procedure.

Echo state network derivatives

The derivatives are based on ESNs that use the update rules summarized above. I have verified the following results numerically. The derivatives with respect to the reservoir and input weights are vectorized (denoted \(\mathrm{vec}\)) so to avoid complicated derivatives with tensors of 3rd order. Let \(\otimes\) denote the Kronecker product

The derivative of the hyperbolic tangent function with respect to its inputs is

\[ H_{t} = \mathrm{diag}\left(\mathbf{1} - \mathrm{tanh}(W_{\mathrm{in}} \mathbf{x_t} + W_{\mathrm{res}} \mathbf{r_{t-1}})^2\right) \]

The derivative of the reservoir states with respect to the reservoir weights is

\[ \footnotesize \frac{\partial \mathbf{r_0}}{\partial \mathrm{vec} W_{\mathrm{res}}} = 0 \]

\[ \footnotesize \frac{\partial \mathbf{r_t}}{\partial \mathrm{vec} W_{\mathrm{res}}} = (1-\alpha) \frac{\partial \mathbf{r_{t-1}}}{\partial \mathrm{vec} W_{\mathrm{res}}} + \alpha \left(\mathbf{r_{t-1}} \otimes H_{t} + H_{t} W_{\mathrm{res}} \frac{\partial \mathbf{r_{t-1}}}{\partial \mathrm{vec} W_{\mathrm{res}} }\right) \]

The derivative of the reservoir states with respect to the input weights is

\[ \footnotesize \frac{\partial \mathbf{r_0}}{\partial \mathrm{vec} W_{\mathrm{in}}} = 0 \]

\[ \footnotesize \frac{\partial \mathbf{r_t}}{\partial \mathrm{vec} W_{\mathrm{in}}} = (1-\alpha) \frac{\partial \mathbf{r_{t-1}}}{\partial \mathrm{vec} W_{\mathrm{in}}} + \alpha \left( \mathbf{x_t} \otimes H_t + H_t W_{\mathrm{res}} \frac{\partial \mathbf{r_{t-1}}}{\partial \mathrm{vec} W_{\mathrm{in}}} \right) \]

And finally, the derivative of the reservoir states with respect to the leakage rate is

\[ \footnotesize \frac{\partial \mathbf{r_0}}{\partial \alpha} = 0 \]

\[ \footnotesize \frac{\partial \mathbf{r_t}}{\partial \alpha} = \frac{\partial \mathbf{r_{t-1}}}{\partial \alpha} - \left( \mathbf{r_{t-1}} + \alpha \frac{\partial \mathbf{r_{t-1}}}{\partial \alpha} \right) + \left( H_t W_{\mathrm{res}} \frac{\partial \mathbf{r_{t-1}}}{\partial \alpha} \right) + \mathrm{tanh}(W_{\mathrm{res}} \mathbf{r_{t-1}} + W_{\mathrm{in}} \mathbf{x}_t) \]