Derivatives of ESN reservoirs with respect to input weights, reservoir weights, and leakage
Overview
I recently worked on a project that aimed at minimizing redundancies and multicollinearity in echo state neural networks (ESNs) by linearly decorrelating features in their reservoir. In the process,
Click here to skip ahead to the derivatives.
A quick intro to echo state neural networks
ESNs are recurrent neural networks that consist of an input layer, reservoir, and output layer. Quite unconventionally, ESNs are trained using ridge regression rather than backpropagation which leaves all parameters but those in the output layer randomly initialized. Nevertheless, ESNs tend to outperform LSTMs and GRUs in terms of training time and prediction accuracy when applied to chaotic time series (Shahi et al., 2022).
The reservoir states encode information from past states and are updated iteratively as follows
\[ \mathbf{r_t} = (1-\alpha) \mathbf{r_{t-1}} + \alpha \mathrm{tanh}(W_{\mathrm{res}} \mathbf{r_{t-1}} + W_{\mathrm{in}} \mathbf{x_t}) \]
where \(\mathbf{x_t}\), \(W_\mathrm{in}\), \(W_\mathrm{res}\), and \( \alpha \) denote the input vector, input weights, reservoir weights and leakage, respectively. During training, reservoir state vectors are collected, concatenated and used in ridge regression to determine the optimal output weights and biases that map the reservoir to the desired output
\[ y_t = W_{\mathrm{out}} r_t + \mathbf{b_{\mathrm{out}}} \]
Lukoševicius et al. (2006) provide a more detailed description of the training procedure.
Echo state network derivatives
The derivatives are based on ESNs that use the update rules summarized above. I have verified the following results numerically. The derivatives with respect to the reservoir and input weights are vectorized (denoted \(\mathrm{vec}\)) so to avoid complicated derivatives with tensors of 3rd order. Let \(\otimes\) denote the Kronecker product
\[ H_{t} = \mathrm{diag}\left(\mathbf{1} - \mathrm{tanh}(W_{\mathrm{in}} \mathbf{x_t} + W_{\mathrm{res}} \mathbf{r_{t-1}})^2\right) \]
\[ \footnotesize \frac{\partial \mathbf{r_0}}{\partial \mathrm{vec} W_{\mathrm{res}}} = 0 \]
\[ \footnotesize \frac{\partial \mathbf{r_t}}{\partial \mathrm{vec} W_{\mathrm{res}}} = (1-\alpha) \frac{\partial \mathbf{r_{t-1}}}{\partial \mathrm{vec} W_{\mathrm{res}}} + \alpha \left(\mathbf{r_{t-1}} \otimes H_{t} + H_{t} W_{\mathrm{res}} \frac{\partial \mathbf{r_{t-1}}}{\partial \mathrm{vec} W_{\mathrm{res}} }\right) \]
\[ \footnotesize \frac{\partial \mathbf{r_0}}{\partial \mathrm{vec} W_{\mathrm{in}}} = 0 \]
\[ \footnotesize \frac{\partial \mathbf{r_t}}{\partial \mathrm{vec} W_{\mathrm{in}}} = (1-\alpha) \frac{\partial \mathbf{r_{t-1}}}{\partial \mathrm{vec} W_{\mathrm{in}}} + \alpha \left( \mathbf{x_t} \otimes H_t + H_t W_{\mathrm{res}} \frac{\partial \mathbf{r_{t-1}}}{\partial \mathrm{vec} W_{\mathrm{in}}} \right) \]
\[ \footnotesize \frac{\partial \mathbf{r_0}}{\partial \alpha} = 0 \]
\[ \footnotesize \frac{\partial \mathbf{r_t}}{\partial \alpha} = \frac{\partial \mathbf{r_{t-1}}}{\partial \alpha} - \left( \mathbf{r_{t-1}} + \alpha \frac{\partial \mathbf{r_{t-1}}}{\partial \alpha} \right) + \left( H_t W_{\mathrm{res}} \frac{\partial \mathbf{r_{t-1}}}{\partial \alpha} \right) + \mathrm{tanh}(W_{\mathrm{res}} \mathbf{r_{t-1}} + W_{\mathrm{in}} \mathbf{x}_t) \]