Redes LSTM (Long Short Term Memory Networks)#

Introducción#

Los humanos no comienzan a pensar desde cero cada segundo. A medida que usted lee este documento, comprende cada palabra en función de su comprensión de las palabras anteriores. No lee todo y empieza a pensar desde cero de nuevo. Sus pensamientos tienen persistencia.

Las redes neuronales tradicionales no pueden hacer esto, lo parece una gran deficiencia. Por ejemplo, imagine que desea clasificar qué tipo de evento está ocurriendo en cada punto de una película. No está claro cómo una red neuronal tradicional podría usar su razonamiento sobre eventos anteriores en la película para informar a los posteriores.

Las redes neuronales recurrentes (RNR) abordan este problema. Estas son redes con bucles que permiten que la información sea persistente.

Diagrama enrrollado de un trozo de una red neuronal recurrente

Imagen tomada de Understanding LSTM Networks

El diagrama anterior presenta un diagrama enrrollado de un trozo de una red neuronal. La neurona \(A\), observa alguna entrada \(x_t\) y genera un valor \(h_t\). Un bucle permite que la información trasite de un paso de la red al siguiente.

Estos bucles hacen que las redes neuronales recurrentes parezcan misteriosas.

Sin embargo, si piensa un poco más, resulta que no son tan diferentes de una red neuronal normal.

Una red neuronal recurrente puede considerarse como copias múltiples de la misma red, cada una de las cuales pasa un mensaje a un sucesor. Considere lo que sucede si desenrollamos el bucle:

Diagrama desenrrollado de un trozo de una red neuronal recurrente

Imagen tomada de Understanding LSTM Networks

Esta naturaleza en cadena revela que las redes neuronales recurrentes están íntimamente relacionadas con secuencias.

Son la arquitectura natural de red neuronal para usar con datos de naturaleza secuencial.

Desde el punto de vista estadístico son procesos estocásticos. Más precisamente, cadenas de Markov (ocultas).

En los últimos años, ha habido un éxito increíble aplicando RNR a una variedad de problemas:

  1. reconocimiento de voz,

  2. modelamiento de idiomas,

  3. traducción,

  4. subtitulado de imágenes

  5. … La lista continúa.

Dejaré la discusión de las increíbles hazañas que uno puede lograr con los RNN en la excelente publicación de blog de Andrej Karpathy, La irrazonable efectividad de las redes neuronales recurrentes http://karpathy.github.io/2015/05/21/rnn-effectiveness/ . Pero realmente son bastante asombrosos.

Esencial para estos éxitos es el uso de redes de tipo LSTM, un tipo muy especial de red neuronal recurrente que funciona, para muchas tareas, mucho mejor que la versión estándar. Casi todos los resultados emocionantes basados en redes neuronales recurrentes se logran con ellas.

El problema de las dependencias a largo plazo#

Dependencias a corto plazo#

Uno de los atractivos de las RNR es la idea de que podrían ser capaces de conectar información previa a la tarea actual, como el uso de fotogramas de video anteriores que podrían informar la comprensión del presente cuadro.

Si las RNR pudieran hacer esto, serían extremadamente útiles. ¿Pero pueden hacerlo?

Depende.

A veces, solo necesitamos mirar información reciente para realizar la tarea actual. Por ejemplo, considere un modelo de lenguaje que intenta predecir la siguiente palabra en función de las anteriores.

Si estamos tratando de predecir la última palabra en

las nubes están en el…,

no necesitamos más contexto; es bastante obvio que la siguiente palabra será

cielo.

En tales casos, en donde la brecha entre la información relevante y el lugar que se necesita es pequeña, las RNR pueden aprender a usar la información pasada.

RNR: dependencias de corto plazo. \(h_3\) depende de \(x_0, x_1,x_3,x_4\)

Imagen tomada de Understanding LSTM Networks

Pero también hay casos en los que necesitamos más contexto.

Contexto a más largo plazo#

Consideremos el problema de tratar de predecir la última palabra en el texto

«Crecí en Francia … hablo francés con fluidez».

La información reciente sugiere que la siguiente palabra es probablemente el nombre de un idioma, pero si queremos limitar qué idioma, necesitamos el contexto de Francia, desde más atrás.

Es completamente posible que la brecha entre la información relevante y el punto donde se necesita sea muy grande.

Desafortunadamente, a medida que crece esa brecha, los RNN no pueden aprender a conectar la información.

RNR: dependencias de largo plazo. \(h_{t+1}\) depende de \(x_0, x_1\)

Imagen tomada de Understanding LSTM Networks

Las redes neuronales luchan con dependencias a largo plazo. En teoría, los RNR son absolutamente capaces de manejar tales «dependencias a largo plazo».

Un humano podría elegir cuidadosamente los parámetros para resolver problemas de juguetes de esta forma.

Lamentablemente, en la práctica, los RNR no parecen poder aprenderlos.

El problema del gradiente que se desvanece o explota#

El problema fue explorado en profundidad por Hochreiter (1991) y Bengio, et al. (1994), quienes encontraron algunas razones fundamentales por las cuales podría ser difícil.

El problema fundamental mostrado por Hochreiter es que el gradiente tiende a desaparecer o explotar cuando en la RNR se calcula las transformaciones afines y se aplica la función de activación (el sigmoide) sobre las mismas matrices y vectores. Ver referencia 2 para más detalles.

¡Afortunadamente, los LSTM no tienen este problema!

Redes LSTM #

Las redes de memoria a corto y largo plazo, generalmente llamadas «LSTM», son un tipo especial de RNR, capaces de aprender dependencias a largo plazo.

Fueron introducidos por Hochreiter y Schmidhuber (1997), y fueron refinadas y popularizadas por muchas personas en trabajos posteriores.

Funcionan tremendamente bien en una gran variedad de problemas, y ahora son ampliamente utilizadas.

Las redes LSTM están diseñadas explícitamente para resolver el problema de dependencia a largo plazo.

  • Recordar información durante largos períodos de tiempo es prácticamente su comportamiento predeterminado, ¡no es algo que les cuesta aprender!

Todas las redes neuronales recurrentes tienen la forma de una cadena de módulos repetitivos de red neuronal.

En los RNR estándar, este módulo repetitivo tendrá una estructura muy simple, como una sola capa de \(\tanh\).

El módulo de repetición en una RNR estándar con una sola capa.

Imagen tomada de Understanding LSTM Networks

Las redes LSTM también tienen esta estructura tipo cadena, pero el módulo de repetición tiene una estructura diferente. En lugar de tener una sola capa de red neuronal, hay cuatro que interactúan de una manera muy especial.

El módulo de repetición en una red LSTM con cuatro capas interactuando.

Imagen tomada de Understanding LSTM Networks

No se preocupe por los detalles de lo que está sucediendo. Recorreremos el diagrama LSTM paso a paso más adelante. Por ahora, intentemos sentirnos cómodos con la notación que usaremos.

Nomenclatura de los objetos en las redes LSTM

Imagen tomada de Understanding LSTM Networks

En el diagrama anterior, cada línea transporta un vector completo, desde la salida de un nodo hasta las entradas de otros. Los círculos de color rosa representan operaciones puntuales, como la suma de vectores, mientras que los cuadros amarillos son capas de redes neuronales aprendidas. La fusión de líneas denota concatenación, mientras que una bifurcación de línea denota que su contenido se copia y las copias van a diferentes ubicaciones.

La idea central detras de las redes LSTM#

La clave para las redes LSTM es el estado de la celda, la línea horizontal que pasa por la parte superior del diagrama.

El estado de la celda es como una cinta transportadora. Corre directamente por toda la cadena, con solo algunas interacciones lineales menores. Es muy fácil que la información fluya sin cambios.

Banda transportadora en las redes LSTM

Imagen tomada de Understanding LSTM Networks

Una red LSTM tiene la capacidad de eliminar o agregar información al estado de la célula, cuidadosamente regulada por estructuras llamadas puertas.

Las puertas son una forma opcional de dejar pasar la información. Se componen de una capa de red neuronal con activación sigmoide y una operación de multiplicación puntual.

Puerta en una red LSTM

Imagen tomada de Understanding LSTM Networks

La capa sigmoide genera números entre cero y uno, que describe la cantidad de cada componente que debe dejarse pasar.

  • Un valor de cero significa no dejar pasar nada,

  • mientras que un valor de uno significa dejar pasar todo.

Un red LSTM tiene tres de estas puertas, para proteger y controlar el estado de la neurona.

Caminando a lo largo de una red LSTM. Paso a paso#

Puerta de olvido#

El primer paso en nuestra red LSTM es decidir qué información vamos a tirar del estado de la celda. Esta decisión la toma una capa sigmoidea llamada capa de puerta de olvido. Se recibe \( h_{t − 1} \) y \( x_t \), y se genera un número entre 0 y 1 para cada número en el estado de celda \( C_{t − 1} \). Un 1 representa mantener completamente esto mientras que un 0 representa deshacerse completamente de esto.

Volvamos a nuestro ejemplo de un modelo de lenguaje que intenta predecir la siguiente palabra en función de todas las anteriores.

En tal problema, el estado de la celda puede incluir el género del sujeto presente, de modo que se puedan usar los pronombres correctos. Cuando vemos un tema nuevo, queremos olvidar el género del tema anterior.

Puerta de olvido

Imagen tomada de Understanding LSTM Networks

Puerta de entrada#

El siguiente paso es decidir qué nueva información vamos a almacenar en el estado de la celda. Esto tiene dos partes.

  • Primero, una capa sigmoidea llamada puerta de entrada decide qué valores actualizaremos.

  • Luego, una capa de \(\tanh\) crea un vector de nuevos valores candidatos, \(\tilde{C}_t\), que podrían agregarse al estado. En el siguiente paso, combinaremos estos dos para crear una actualización del estado.

En el ejemplo de nuestro modelo de lenguaje, queremos agregar el género del nuevo sujeto al estado de la celda, para reemplazar el antiguo que estamos olvidando.

Puerta de entrada

Imagen tomada de Understanding LSTM Networks

Actualización del estado de la celda#

Ahora es el momento de actualizar el estado de la celda anterior, \( C_{t − 1} \), en el nuevo estado de la celda \( C_t \). Los pasos anteriores ya decidieron qué hacer, solo tenemos que hacerlo.

Multiplicamos el estado anterior por \( f_t \), olvidando las cosas que decidimos olvidar antes. Luego agregamos \( i_t \times \tilde{C}_t \). Estos son los nuevos valores candidatos, escalados según cuánto decidimos actualizar cada valor de estado.

En el caso del modelo de lenguaje, aquí es donde realmente soltaríamos la información sobre el género del sujeto anterior y agreguemos la nueva información, como decidimos en los pasos anteriores.

Actualización del estado de la celda

Imagen tomada de Understanding LSTM Networks

Salida#

Finalmente, tenemos que decidir qué vamos a generar. Esta salida se basará en nuestro estado de celda, pero será una versión filtrada. Primero, ejecutamos una capa sigmoidea que decide qué partes del estado de la celda vamos a generar. Luego, ponemos el estado de la celda a través de tanh (para empujar los valores a estar entre -1 y 1) y lo multiplicamos por la salida de la puerta sigmoidea, de modo que solo produzcamos las partes que decidimos.

Para el ejemplo del modelo de lenguaje, dado que acaba de ver un tema, es posible que desee generar información relevante para un verbo, en caso de que eso sea lo que viene a continuación. Por ejemplo, podría generar si el sujeto es singular o plural, de modo que sepamos en qué forma se debe conjugar un verbo si eso es lo que sigue a continuación.

Salida

Imagen tomada de Understanding LSTM Networks

>Estructura Matemática de una red LSTM </span#

En una red neuronal LSTM, cada puerta es una RNA. Una capa LSTM tiene varios bloques y cada bloque tiene dos salidas: \( C_t \), el estado actualizado de la celda y \( h_t \), el estado oculto actualizado.

\( C_t \) se actualiza de la siguiente manera

\[\begin{split} \begin{align*} f_t &= \sigma(W_f\cdot [h_{t-1}, x_t] + b_f)\\ i_t &= \sigma(W_i\cdot [h_{t-1}, x_t] + b_i)\\ \tilde{C}_t &= \tanh(W_c\cdot [h_{t-1}, x_t] + b_c)\\ C_t &= f_t \odot C_{t-1} + i_t\odot \tilde{C}_t,\\ \end{align*} \end{split}\]

donde \( \odot \) es el producto de Hadamard o producto puntual de vectores.

Por otro lado, \( h_t \) se actualiza después de que \( C_t \) se haya actualizado de la siguiente manera

\[\begin{split} \begin{align*} o_t &= \sigma(W_o\cdot [h_{t-1}, x_t] + b_o)\\ h_t &= o_t \odot \tanh(C_t).\\ \end{align*} \end{split}\]

Estas son las ecuaciones requeridas para calcular el gradiente de los parámetros de una capa LSTM.

Cálculo del número de parámetros de una capa LSTM#

Supongamos que el tamaño de entrada es \(n\) y que el tamaño de salida de la capa LSTM es p.

Entonces \(h\) que es la salida tiene tamaño p y \(x\) tiene tamaño n. Por lo tanto para que los operaciones indicadas puedan hacerse se requiere que

  1. \(W_f\), \(W_c\), \(W_i\) y \(W_o\) tienen tamaño \(p\times (p+n)\). Osea en total \(4(p*p + p*n)\).

  2. Si se incluyen bias (que es lo común), son \(4p\) parámetros adicionales, y en total se tiene que

\[ \text{Número de parámetros capa LSTM } = 4(p^2+ pn +p) \]

La siguiente imagen muestra la estructura típica de una compuerta en el modelo LSTM. Se ilustra la activación sigmoide y tanh, porque téngase en cuenta que en la compuerta de actualización la activación es tanh.

Imagen: Alvaro Montenegro

Finalmente, la siguiente imagen revela el plano de una capa neuronal de tipo LSTM

Imagen: Alvaro Montenegro

Computación en una capa LSTM de Keras#

Cualquier capa de Keras siempre espera un batch de datos. En el caso de una capa LSTM Keras espera tensores 3D de la siguiente forma

  • [batch_size, time_step, feature]

Por ejemplo un tensor de entrada de tamaño [32, 10, 8], la capa Keras lo interpreta como

  • batch_size = 32, es decir 32 ejemplos.

  • time_step = 10, es decir secuencias de entrada de tamaño 10. Por ejemplo en una serie de tiempo este es el tamaño de la ventana de entrada.

  • feature = 8, es decir la variable de entrada es de tamaño 8. Por ejemplo, en series de timepo univariadas, feature = 1. En una serie multivariada con 8 variables, features = 8. En modelos de lenguaje natural feature = tamaño de representación de acada token. Usualmente correspondería al tamaño del embedding.

La salida de la capa corresponde al tamaño del estado oculto. Por ejemplo, si el estado oculto tiene tamaño, la salida de la capa es de tamaño [batch_size, 4].

Ejemplo#

import tensorflow as tf
inputs = tf.random.normal([32,10,8])
lstm = tf.keras.layers.LSTM(4) # lstm es una capa de tamaño de salidad 4.
output = lstm(inputs)
print (output.shape)
(32, 4)

Recibiendo toda la secuencia del valor del estado oculto#

En algunos casos es necesario disponer del valor del estado oculto para cada valor en la secuenci de entrada. Esta secuencia tiene tamaño [batch_size, time_step, output_size] En el sugiente ejemplo se tiene que

  • return_sequences son todos lo estados del estado oculto

  • return_state es el último valor del estado oculto

  • final_carry_state es el último valor de la banda transportadora

lstm = tf.keras.layers.LSTM(4, return_sequences = True, return_state=True)
whole_seq_output, final_memory_state, final_carry_state = lstm(inputs)
print(whole_seq_output.shape)
print(final_memory_state.shape)
print(final_carry_state.shape)
(32, 10, 4)
(32, 4)
(32, 4)
whole_seq_output[:,-1,:] == final_memory_state
<tf.Tensor: shape=(32, 4), dtype=bool, numpy=
array([[ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True]])>

Extracción de los pesos de la capa#

help(lstm)
Help on LSTM in module keras.layers.rnn.lstm object:

class LSTM(keras.layers.rnn.dropout_rnn_cell_mixin.DropoutRNNCellMixin, keras.layers.rnn.base_rnn.RNN, keras.engine.base_layer.BaseRandomLayer)
 |  LSTM(units, activation='tanh', recurrent_activation='sigmoid', use_bias=True, kernel_initializer='glorot_uniform', recurrent_initializer='orthogonal', bias_initializer='zeros', unit_forget_bias=True, kernel_regularizer=None, recurrent_regularizer=None, bias_regularizer=None, activity_regularizer=None, kernel_constraint=None, recurrent_constraint=None, bias_constraint=None, dropout=0.0, recurrent_dropout=0.0, return_sequences=False, return_state=False, go_backwards=False, stateful=False, time_major=False, unroll=False, **kwargs)
 |  
 |  Long Short-Term Memory layer - Hochreiter 1997.
 |  
 |  See [the Keras RNN API guide](https://www.tensorflow.org/guide/keras/rnn)
 |  for details about the usage of RNN API.
 |  
 |  Based on available runtime hardware and constraints, this layer
 |  will choose different implementations (cuDNN-based or pure-TensorFlow)
 |  to maximize the performance. If a GPU is available and all
 |  the arguments to the layer meet the requirement of the cuDNN kernel
 |  (see below for details), the layer will use a fast cuDNN implementation.
 |  
 |  The requirements to use the cuDNN implementation are:
 |  
 |  1. `activation` == `tanh`
 |  2. `recurrent_activation` == `sigmoid`
 |  3. `recurrent_dropout` == 0
 |  4. `unroll` is `False`
 |  5. `use_bias` is `True`
 |  6. Inputs, if use masking, are strictly right-padded.
 |  7. Eager execution is enabled in the outermost context.
 |  
 |  For example:
 |  
 |  >>> inputs = tf.random.normal([32, 10, 8])
 |  >>> lstm = tf.keras.layers.LSTM(4)
 |  >>> output = lstm(inputs)
 |  >>> print(output.shape)
 |  (32, 4)
 |  >>> lstm = tf.keras.layers.LSTM(4, return_sequences=True, return_state=True)
 |  >>> whole_seq_output, final_memory_state, final_carry_state = lstm(inputs)
 |  >>> print(whole_seq_output.shape)
 |  (32, 10, 4)
 |  >>> print(final_memory_state.shape)
 |  (32, 4)
 |  >>> print(final_carry_state.shape)
 |  (32, 4)
 |  
 |  Args:
 |    units: Positive integer, dimensionality of the output space.
 |    activation: Activation function to use.
 |      Default: hyperbolic tangent (`tanh`). If you pass `None`, no activation
 |      is applied (ie. "linear" activation: `a(x) = x`).
 |    recurrent_activation: Activation function to use for the recurrent step.
 |      Default: sigmoid (`sigmoid`). If you pass `None`, no activation is
 |      applied (ie. "linear" activation: `a(x) = x`).
 |    use_bias: Boolean (default `True`), whether the layer uses a bias vector.
 |    kernel_initializer: Initializer for the `kernel` weights matrix, used for
 |      the linear transformation of the inputs. Default: `glorot_uniform`.
 |    recurrent_initializer: Initializer for the `recurrent_kernel` weights
 |      matrix, used for the linear transformation of the recurrent state.
 |      Default: `orthogonal`.
 |    bias_initializer: Initializer for the bias vector. Default: `zeros`.
 |    unit_forget_bias: Boolean (default `True`). If True, add 1 to the bias of
 |      the forget gate at initialization. Setting it to true will also force
 |      `bias_initializer="zeros"`. This is recommended in [Jozefowicz et
 |          al.](http://www.jmlr.org/proceedings/papers/v37/jozefowicz15.pdf).
 |    kernel_regularizer: Regularizer function applied to the `kernel` weights
 |      matrix. Default: `None`.
 |    recurrent_regularizer: Regularizer function applied to the
 |      `recurrent_kernel` weights matrix. Default: `None`.
 |    bias_regularizer: Regularizer function applied to the bias vector. Default:
 |      `None`.
 |    activity_regularizer: Regularizer function applied to the output of the
 |      layer (its "activation"). Default: `None`.
 |    kernel_constraint: Constraint function applied to the `kernel` weights
 |      matrix. Default: `None`.
 |    recurrent_constraint: Constraint function applied to the `recurrent_kernel`
 |      weights matrix. Default: `None`.
 |    bias_constraint: Constraint function applied to the bias vector. Default:
 |      `None`.
 |    dropout: Float between 0 and 1. Fraction of the units to drop for the linear
 |      transformation of the inputs. Default: 0.
 |    recurrent_dropout: Float between 0 and 1. Fraction of the units to drop for
 |      the linear transformation of the recurrent state. Default: 0.
 |    return_sequences: Boolean. Whether to return the last output. in the output
 |      sequence, or the full sequence. Default: `False`.
 |    return_state: Boolean. Whether to return the last state in addition to the
 |      output. Default: `False`.
 |    go_backwards: Boolean (default `False`). If True, process the input sequence
 |      backwards and return the reversed sequence.
 |    stateful: Boolean (default `False`). If True, the last state for each sample
 |      at index i in a batch will be used as initial state for the sample of
 |      index i in the following batch.
 |    time_major: The shape format of the `inputs` and `outputs` tensors.
 |      If True, the inputs and outputs will be in shape
 |      `[timesteps, batch, feature]`, whereas in the False case, it will be
 |      `[batch, timesteps, feature]`. Using `time_major = True` is a bit more
 |      efficient because it avoids transposes at the beginning and end of the
 |      RNN calculation. However, most TensorFlow data is batch-major, so by
 |      default this function accepts input and emits output in batch-major
 |      form.
 |    unroll: Boolean (default `False`). If True, the network will be unrolled,
 |      else a symbolic loop will be used. Unrolling can speed-up a RNN, although
 |      it tends to be more memory-intensive. Unrolling is only suitable for short
 |      sequences.
 |  
 |  Call arguments:
 |    inputs: A 3D tensor with shape `[batch, timesteps, feature]`.
 |    mask: Binary tensor of shape `[batch, timesteps]` indicating whether
 |      a given timestep should be masked (optional, defaults to `None`).
 |      An individual `True` entry indicates that the corresponding timestep
 |      should be utilized, while a `False` entry indicates that the corresponding
 |      timestep should be ignored.
 |    training: Python boolean indicating whether the layer should behave in
 |      training mode or in inference mode. This argument is passed to the cell
 |      when calling it. This is only relevant if `dropout` or
 |      `recurrent_dropout` is used (optional, defaults to `None`).
 |    initial_state: List of initial state tensors to be passed to the first
 |      call of the cell (optional, defaults to `None` which causes creation
 |      of zero-filled initial state tensors).
 |  
 |  Method resolution order:
 |      LSTM
 |      keras.layers.rnn.dropout_rnn_cell_mixin.DropoutRNNCellMixin
 |      keras.layers.rnn.base_rnn.RNN
 |      keras.engine.base_layer.BaseRandomLayer
 |      keras.engine.base_layer.Layer
 |      tensorflow.python.module.module.Module
 |      tensorflow.python.training.tracking.autotrackable.AutoTrackable
 |      tensorflow.python.training.tracking.base.Trackable
 |      keras.utils.version_utils.LayerVersionSelector
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, units, activation='tanh', recurrent_activation='sigmoid', use_bias=True, kernel_initializer='glorot_uniform', recurrent_initializer='orthogonal', bias_initializer='zeros', unit_forget_bias=True, kernel_regularizer=None, recurrent_regularizer=None, bias_regularizer=None, activity_regularizer=None, kernel_constraint=None, recurrent_constraint=None, bias_constraint=None, dropout=0.0, recurrent_dropout=0.0, return_sequences=False, return_state=False, go_backwards=False, stateful=False, time_major=False, unroll=False, **kwargs)
 |      Initialize the BaseRandomLayer.
 |      
 |      Note that the constructor is annotated with
 |      @no_automatic_dependency_tracking. This is to skip the auto
 |      tracking of self._random_generator instance, which is an AutoTrackable.
 |      The backend.RandomGenerator could contain a tf.random.Generator instance
 |      which will have tf.Variable as the internal state. We want to avoid saving
 |      that state into model.weights and checkpoints for backward compatibility
 |      reason. In the meantime, we still need to make them visible to SavedModel
 |      when it is tracing the tf.function for the `call()`.
 |      See _list_extra_dependencies_for_serialization below for more details.
 |      
 |      Args:
 |        seed: optional integer, used to create RandomGenerator.
 |        force_generator: boolean, default to False, whether to force the
 |          RandomGenerator to use the code branch of tf.random.Generator.
 |        **kwargs: other keyword arguments that will be passed to the parent class
 |  
 |  call(self, inputs, mask=None, training=None, initial_state=None)
 |      This is where the layer's logic lives.
 |      
 |      The `call()` method may not create state (except in its first invocation,
 |      wrapping the creation of variables or other resources in `tf.init_scope()`).
 |      It is recommended to create state in `__init__()`, or the `build()` method
 |      that is called automatically before `call()` executes the first time.
 |      
 |      Args:
 |        inputs: Input tensor, or dict/list/tuple of input tensors.
 |          The first positional `inputs` argument is subject to special rules:
 |          - `inputs` must be explicitly passed. A layer cannot have zero
 |            arguments, and `inputs` cannot be provided via the default value
 |            of a keyword argument.
 |          - NumPy array or Python scalar values in `inputs` get cast as tensors.
 |          - Keras mask metadata is only collected from `inputs`.
 |          - Layers are built (`build(input_shape)` method)
 |            using shape info from `inputs` only.
 |          - `input_spec` compatibility is only checked against `inputs`.
 |          - Mixed precision input casting is only applied to `inputs`.
 |            If a layer has tensor arguments in `*args` or `**kwargs`, their
 |            casting behavior in mixed precision should be handled manually.
 |          - The SavedModel input specification is generated using `inputs` only.
 |          - Integration with various ecosystem packages like TFMOT, TFLite,
 |            TF.js, etc is only supported for `inputs` and not for tensors in
 |            positional and keyword arguments.
 |        *args: Additional positional arguments. May contain tensors, although
 |          this is not recommended, for the reasons above.
 |        **kwargs: Additional keyword arguments. May contain tensors, although
 |          this is not recommended, for the reasons above.
 |          The following optional keyword arguments are reserved:
 |          - `training`: Boolean scalar tensor of Python boolean indicating
 |            whether the `call` is meant for training or inference.
 |          - `mask`: Boolean input mask. If the layer's `call()` method takes a
 |            `mask` argument, its default value will be set to the mask generated
 |            for `inputs` by the previous layer (if `input` did come from a layer
 |            that generated a corresponding mask, i.e. if it came from a Keras
 |            layer with masking support).
 |      
 |      Returns:
 |        A tensor or list/tuple of tensors.
 |  
 |  get_config(self)
 |      Returns the config of the layer.
 |      
 |      A layer config is a Python dictionary (serializable)
 |      containing the configuration of a layer.
 |      The same layer can be reinstantiated later
 |      (without its trained weights) from this configuration.
 |      
 |      The config of a layer does not include connectivity
 |      information, nor the layer class name. These are handled
 |      by `Network` (one layer of abstraction above).
 |      
 |      Note that `get_config()` does not guarantee to return a fresh copy of dict
 |      every time it is called. The callers should make a copy of the returned dict
 |      if they want to modify it.
 |      
 |      Returns:
 |          Python dictionary.
 |  
 |  ----------------------------------------------------------------------
 |  Class methods defined here:
 |  
 |  from_config(config) from builtins.type
 |      Creates a layer from its config.
 |      
 |      This method is the reverse of `get_config`,
 |      capable of instantiating the same layer from the config
 |      dictionary. It does not handle layer connectivity
 |      (handled by Network), nor weights (handled by `set_weights`).
 |      
 |      Args:
 |          config: A Python dictionary, typically the
 |              output of get_config.
 |      
 |      Returns:
 |          A layer instance.
 |  
 |  ----------------------------------------------------------------------
 |  Readonly properties defined here:
 |  
 |  activation
 |  
 |  bias_constraint
 |  
 |  bias_initializer
 |  
 |  bias_regularizer
 |  
 |  dropout
 |  
 |  implementation
 |  
 |  kernel_constraint
 |  
 |  kernel_initializer
 |  
 |  kernel_regularizer
 |  
 |  recurrent_activation
 |  
 |  recurrent_constraint
 |  
 |  recurrent_dropout
 |  
 |  recurrent_initializer
 |  
 |  recurrent_regularizer
 |  
 |  unit_forget_bias
 |  
 |  units
 |  
 |  use_bias
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from keras.layers.rnn.dropout_rnn_cell_mixin.DropoutRNNCellMixin:
 |  
 |  __getstate__(self)
 |  
 |  __setstate__(self, state)
 |  
 |  get_dropout_mask_for_cell(self, inputs, training, count=1)
 |      Get the dropout mask for RNN cell's input.
 |      
 |      It will create mask based on context if there isn't any existing cached
 |      mask. If a new mask is generated, it will update the cache in the cell.
 |      
 |      Args:
 |        inputs: The input tensor whose shape will be used to generate dropout
 |          mask.
 |        training: Boolean tensor, whether its in training mode, dropout will be
 |          ignored in non-training mode.
 |        count: Int, how many dropout mask will be generated. It is useful for cell
 |          that has internal weights fused together.
 |      Returns:
 |        List of mask tensor, generated or cached mask based on context.
 |  
 |  get_recurrent_dropout_mask_for_cell(self, inputs, training, count=1)
 |      Get the recurrent dropout mask for RNN cell.
 |      
 |      It will create mask based on context if there isn't any existing cached
 |      mask. If a new mask is generated, it will update the cache in the cell.
 |      
 |      Args:
 |        inputs: The input tensor whose shape will be used to generate dropout
 |          mask.
 |        training: Boolean tensor, whether its in training mode, dropout will be
 |          ignored in non-training mode.
 |        count: Int, how many dropout mask will be generated. It is useful for cell
 |          that has internal weights fused together.
 |      Returns:
 |        List of mask tensor, generated or cached mask based on context.
 |  
 |  reset_dropout_mask(self)
 |      Reset the cached dropout masks if any.
 |      
 |      This is important for the RNN layer to invoke this in it `call()` method so
 |      that the cached mask is cleared before calling the `cell.call()`. The mask
 |      should be cached across the timestep within the same batch, but shouldn't
 |      be cached between batches. Otherwise it will introduce unreasonable bias
 |      against certain index of data within the batch.
 |  
 |  reset_recurrent_dropout_mask(self)
 |      Reset the cached recurrent dropout masks if any.
 |      
 |      This is important for the RNN layer to invoke this in it call() method so
 |      that the cached mask is cleared before calling the cell.call(). The mask
 |      should be cached across the timestep within the same batch, but shouldn't
 |      be cached between batches. Otherwise it will introduce unreasonable bias
 |      against certain index of data within the batch.
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from keras.layers.rnn.dropout_rnn_cell_mixin.DropoutRNNCellMixin:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from keras.layers.rnn.base_rnn.RNN:
 |  
 |  __call__(self, inputs, initial_state=None, constants=None, **kwargs)
 |      Wraps `call`, applying pre- and post-processing steps.
 |      
 |      Args:
 |        *args: Positional arguments to be passed to `self.call`.
 |        **kwargs: Keyword arguments to be passed to `self.call`.
 |      
 |      Returns:
 |        Output tensor(s).
 |      
 |      Note:
 |        - The following optional keyword arguments are reserved for specific uses:
 |          * `training`: Boolean scalar tensor of Python boolean indicating
 |            whether the `call` is meant for training or inference.
 |          * `mask`: Boolean input mask.
 |        - If the layer's `call` method takes a `mask` argument (as some Keras
 |          layers do), its default value will be set to the mask generated
 |          for `inputs` by the previous layer (if `input` did come from
 |          a layer that generated a corresponding mask, i.e. if it came from
 |          a Keras layer with masking support.
 |        - If the layer is not built, the method will call `build`.
 |      
 |      Raises:
 |        ValueError: if the layer's `call` method returns None (an invalid value).
 |        RuntimeError: if `super().__init__()` was not called in the constructor.
 |  
 |  build(self, input_shape)
 |      Creates the variables of the layer (optional, for subclass implementers).
 |      
 |      This is a method that implementers of subclasses of `Layer` or `Model`
 |      can override if they need a state-creation step in-between
 |      layer instantiation and layer call. It is invoked automatically before
 |      the first execution of `call()`.
 |      
 |      This is typically used to create the weights of `Layer` subclasses
 |      (at the discretion of the subclass implementer).
 |      
 |      Args:
 |        input_shape: Instance of `TensorShape`, or list of instances of
 |          `TensorShape` if the layer expects a list of inputs
 |          (one instance per input).
 |  
 |  compute_mask(self, inputs, mask)
 |      Computes an output mask tensor.
 |      
 |      Args:
 |          inputs: Tensor or list of tensors.
 |          mask: Tensor or list of tensors.
 |      
 |      Returns:
 |          None or a tensor (or list of tensors,
 |              one per output tensor of the layer).
 |  
 |  compute_output_shape(self, input_shape)
 |      Computes the output shape of the layer.
 |      
 |      This method will cause the layer's state to be built, if that has not
 |      happened before. This requires that the layer will later be used with
 |      inputs that match the input shape provided here.
 |      
 |      Args:
 |          input_shape: Shape tuple (tuple of integers)
 |              or list of shape tuples (one per output tensor of the layer).
 |              Shape tuples can include None for free dimensions,
 |              instead of an integer.
 |      
 |      Returns:
 |          An input shape tuple.
 |  
 |  get_initial_state(self, inputs)
 |  
 |  reset_states(self, states=None)
 |      Reset the recorded states for the stateful RNN layer.
 |      
 |      Can only be used when RNN layer is constructed with `stateful` = `True`.
 |      Args:
 |        states: Numpy arrays that contains the value for the initial state, which
 |          will be feed to cell at the first time step. When the value is None,
 |          zero filled numpy array will be created based on the cell state size.
 |      
 |      Raises:
 |        AttributeError: When the RNN layer is not stateful.
 |        ValueError: When the batch size of the RNN layer is unknown.
 |        ValueError: When the input numpy array is not compatible with the RNN
 |          layer state, either size wise or dtype wise.
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from keras.layers.rnn.base_rnn.RNN:
 |  
 |  states
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from keras.engine.base_layer.Layer:
 |  
 |  __delattr__(self, name)
 |      Implement delattr(self, name).
 |  
 |  __setattr__(self, name, value)
 |      Support self.foo = trackable syntax.
 |  
 |  add_loss(self, losses, **kwargs)
 |      Add loss tensor(s), potentially dependent on layer inputs.
 |      
 |      Some losses (for instance, activity regularization losses) may be dependent
 |      on the inputs passed when calling a layer. Hence, when reusing the same
 |      layer on different inputs `a` and `b`, some entries in `layer.losses` may
 |      be dependent on `a` and some on `b`. This method automatically keeps track
 |      of dependencies.
 |      
 |      This method can be used inside a subclassed layer or model's `call`
 |      function, in which case `losses` should be a Tensor or list of Tensors.
 |      
 |      Example:
 |      
 |      ```python
 |      class MyLayer(tf.keras.layers.Layer):
 |        def call(self, inputs):
 |          self.add_loss(tf.abs(tf.reduce_mean(inputs)))
 |          return inputs
 |      ```
 |      
 |      This method can also be called directly on a Functional Model during
 |      construction. In this case, any loss Tensors passed to this Model must
 |      be symbolic and be able to be traced back to the model's `Input`s. These
 |      losses become part of the model's topology and are tracked in `get_config`.
 |      
 |      Example:
 |      
 |      ```python
 |      inputs = tf.keras.Input(shape=(10,))
 |      x = tf.keras.layers.Dense(10)(inputs)
 |      outputs = tf.keras.layers.Dense(1)(x)
 |      model = tf.keras.Model(inputs, outputs)
 |      # Activity regularization.
 |      model.add_loss(tf.abs(tf.reduce_mean(x)))
 |      ```
 |      
 |      If this is not the case for your loss (if, for example, your loss references
 |      a `Variable` of one of the model's layers), you can wrap your loss in a
 |      zero-argument lambda. These losses are not tracked as part of the model's
 |      topology since they can't be serialized.
 |      
 |      Example:
 |      
 |      ```python
 |      inputs = tf.keras.Input(shape=(10,))
 |      d = tf.keras.layers.Dense(10)
 |      x = d(inputs)
 |      outputs = tf.keras.layers.Dense(1)(x)
 |      model = tf.keras.Model(inputs, outputs)
 |      # Weight regularization.
 |      model.add_loss(lambda: tf.reduce_mean(d.kernel))
 |      ```
 |      
 |      Args:
 |        losses: Loss tensor, or list/tuple of tensors. Rather than tensors, losses
 |          may also be zero-argument callables which create a loss tensor.
 |        **kwargs: Used for backwards compatibility only.
 |  
 |  add_metric(self, value, name=None, **kwargs)
 |      Adds metric tensor to the layer.
 |      
 |      This method can be used inside the `call()` method of a subclassed layer
 |      or model.
 |      
 |      ```python
 |      class MyMetricLayer(tf.keras.layers.Layer):
 |        def __init__(self):
 |          super(MyMetricLayer, self).__init__(name='my_metric_layer')
 |          self.mean = tf.keras.metrics.Mean(name='metric_1')
 |      
 |        def call(self, inputs):
 |          self.add_metric(self.mean(inputs))
 |          self.add_metric(tf.reduce_sum(inputs), name='metric_2')
 |          return inputs
 |      ```
 |      
 |      This method can also be called directly on a Functional Model during
 |      construction. In this case, any tensor passed to this Model must
 |      be symbolic and be able to be traced back to the model's `Input`s. These
 |      metrics become part of the model's topology and are tracked when you
 |      save the model via `save()`.
 |      
 |      ```python
 |      inputs = tf.keras.Input(shape=(10,))
 |      x = tf.keras.layers.Dense(10)(inputs)
 |      outputs = tf.keras.layers.Dense(1)(x)
 |      model = tf.keras.Model(inputs, outputs)
 |      model.add_metric(math_ops.reduce_sum(x), name='metric_1')
 |      ```
 |      
 |      Note: Calling `add_metric()` with the result of a metric object on a
 |      Functional Model, as shown in the example below, is not supported. This is
 |      because we cannot trace the metric result tensor back to the model's inputs.
 |      
 |      ```python
 |      inputs = tf.keras.Input(shape=(10,))
 |      x = tf.keras.layers.Dense(10)(inputs)
 |      outputs = tf.keras.layers.Dense(1)(x)
 |      model = tf.keras.Model(inputs, outputs)
 |      model.add_metric(tf.keras.metrics.Mean()(x), name='metric_1')
 |      ```
 |      
 |      Args:
 |        value: Metric tensor.
 |        name: String metric name.
 |        **kwargs: Additional keyword arguments for backward compatibility.
 |          Accepted values:
 |          `aggregation` - When the `value` tensor provided is not the result of
 |          calling a `keras.Metric` instance, it will be aggregated by default
 |          using a `keras.Metric.Mean`.
 |  
 |  add_update(self, updates)
 |      Add update op(s), potentially dependent on layer inputs.
 |      
 |      Weight updates (for instance, the updates of the moving mean and variance
 |      in a BatchNormalization layer) may be dependent on the inputs passed
 |      when calling a layer. Hence, when reusing the same layer on
 |      different inputs `a` and `b`, some entries in `layer.updates` may be
 |      dependent on `a` and some on `b`. This method automatically keeps track
 |      of dependencies.
 |      
 |      This call is ignored when eager execution is enabled (in that case, variable
 |      updates are run on the fly and thus do not need to be tracked for later
 |      execution).
 |      
 |      Args:
 |        updates: Update op, or list/tuple of update ops, or zero-arg callable
 |          that returns an update op. A zero-arg callable should be passed in
 |          order to disable running the updates by setting `trainable=False`
 |          on this Layer, when executing in Eager mode.
 |  
 |  add_variable(self, *args, **kwargs)
 |      Deprecated, do NOT use! Alias for `add_weight`.
 |  
 |  add_weight(self, name=None, shape=None, dtype=None, initializer=None, regularizer=None, trainable=None, constraint=None, use_resource=None, synchronization=<VariableSynchronization.AUTO: 0>, aggregation=<VariableAggregationV2.NONE: 0>, **kwargs)
 |      Adds a new variable to the layer.
 |      
 |      Args:
 |        name: Variable name.
 |        shape: Variable shape. Defaults to scalar if unspecified.
 |        dtype: The type of the variable. Defaults to `self.dtype`.
 |        initializer: Initializer instance (callable).
 |        regularizer: Regularizer instance (callable).
 |        trainable: Boolean, whether the variable should be part of the layer's
 |          "trainable_variables" (e.g. variables, biases)
 |          or "non_trainable_variables" (e.g. BatchNorm mean and variance).
 |          Note that `trainable` cannot be `True` if `synchronization`
 |          is set to `ON_READ`.
 |        constraint: Constraint instance (callable).
 |        use_resource: Whether to use `ResourceVariable`.
 |        synchronization: Indicates when a distributed a variable will be
 |          aggregated. Accepted values are constants defined in the class
 |          `tf.VariableSynchronization`. By default the synchronization is set to
 |          `AUTO` and the current `DistributionStrategy` chooses
 |          when to synchronize. If `synchronization` is set to `ON_READ`,
 |          `trainable` must not be set to `True`.
 |        aggregation: Indicates how a distributed variable will be aggregated.
 |          Accepted values are constants defined in the class
 |          `tf.VariableAggregation`.
 |        **kwargs: Additional keyword arguments. Accepted values are `getter`,
 |          `collections`, `experimental_autocast` and `caching_device`.
 |      
 |      Returns:
 |        The variable created.
 |      
 |      Raises:
 |        ValueError: When giving unsupported dtype and no initializer or when
 |          trainable has been set to True with synchronization set as `ON_READ`.
 |  
 |  compute_output_signature(self, input_signature)
 |      Compute the output tensor signature of the layer based on the inputs.
 |      
 |      Unlike a TensorShape object, a TensorSpec object contains both shape
 |      and dtype information for a tensor. This method allows layers to provide
 |      output dtype information if it is different from the input dtype.
 |      For any layer that doesn't implement this function,
 |      the framework will fall back to use `compute_output_shape`, and will
 |      assume that the output dtype matches the input dtype.
 |      
 |      Args:
 |        input_signature: Single TensorSpec or nested structure of TensorSpec
 |          objects, describing a candidate input for the layer.
 |      
 |      Returns:
 |        Single TensorSpec or nested structure of TensorSpec objects, describing
 |          how the layer would transform the provided input.
 |      
 |      Raises:
 |        TypeError: If input_signature contains a non-TensorSpec object.
 |  
 |  count_params(self)
 |      Count the total number of scalars composing the weights.
 |      
 |      Returns:
 |          An integer count.
 |      
 |      Raises:
 |          ValueError: if the layer isn't yet built
 |            (in which case its weights aren't yet defined).
 |  
 |  finalize_state(self)
 |      Finalizes the layers state after updating layer weights.
 |      
 |      This function can be subclassed in a layer and will be called after updating
 |      a layer weights. It can be overridden to finalize any additional layer state
 |      after a weight update.
 |      
 |      This function will be called after weights of a layer have been restored
 |      from a loaded model.
 |  
 |  get_input_at(self, node_index)
 |      Retrieves the input tensor(s) of a layer at a given node.
 |      
 |      Args:
 |          node_index: Integer, index of the node
 |              from which to retrieve the attribute.
 |              E.g. `node_index=0` will correspond to the
 |              first input node of the layer.
 |      
 |      Returns:
 |          A tensor (or list of tensors if the layer has multiple inputs).
 |      
 |      Raises:
 |        RuntimeError: If called in Eager mode.
 |  
 |  get_input_mask_at(self, node_index)
 |      Retrieves the input mask tensor(s) of a layer at a given node.
 |      
 |      Args:
 |          node_index: Integer, index of the node
 |              from which to retrieve the attribute.
 |              E.g. `node_index=0` will correspond to the
 |              first time the layer was called.
 |      
 |      Returns:
 |          A mask tensor
 |          (or list of tensors if the layer has multiple inputs).
 |  
 |  get_input_shape_at(self, node_index)
 |      Retrieves the input shape(s) of a layer at a given node.
 |      
 |      Args:
 |          node_index: Integer, index of the node
 |              from which to retrieve the attribute.
 |              E.g. `node_index=0` will correspond to the
 |              first time the layer was called.
 |      
 |      Returns:
 |          A shape tuple
 |          (or list of shape tuples if the layer has multiple inputs).
 |      
 |      Raises:
 |        RuntimeError: If called in Eager mode.
 |  
 |  get_output_at(self, node_index)
 |      Retrieves the output tensor(s) of a layer at a given node.
 |      
 |      Args:
 |          node_index: Integer, index of the node
 |              from which to retrieve the attribute.
 |              E.g. `node_index=0` will correspond to the
 |              first output node of the layer.
 |      
 |      Returns:
 |          A tensor (or list of tensors if the layer has multiple outputs).
 |      
 |      Raises:
 |        RuntimeError: If called in Eager mode.
 |  
 |  get_output_mask_at(self, node_index)
 |      Retrieves the output mask tensor(s) of a layer at a given node.
 |      
 |      Args:
 |          node_index: Integer, index of the node
 |              from which to retrieve the attribute.
 |              E.g. `node_index=0` will correspond to the
 |              first time the layer was called.
 |      
 |      Returns:
 |          A mask tensor
 |          (or list of tensors if the layer has multiple outputs).
 |  
 |  get_output_shape_at(self, node_index)
 |      Retrieves the output shape(s) of a layer at a given node.
 |      
 |      Args:
 |          node_index: Integer, index of the node
 |              from which to retrieve the attribute.
 |              E.g. `node_index=0` will correspond to the
 |              first time the layer was called.
 |      
 |      Returns:
 |          A shape tuple
 |          (or list of shape tuples if the layer has multiple outputs).
 |      
 |      Raises:
 |        RuntimeError: If called in Eager mode.
 |  
 |  get_weights(self)
 |      Returns the current weights of the layer, as NumPy arrays.
 |      
 |      The weights of a layer represent the state of the layer. This function
 |      returns both trainable and non-trainable weight values associated with this
 |      layer as a list of NumPy arrays, which can in turn be used to load state
 |      into similarly parameterized layers.
 |      
 |      For example, a `Dense` layer returns a list of two values: the kernel matrix
 |      and the bias vector. These can be used to set the weights of another
 |      `Dense` layer:
 |      
 |      >>> layer_a = tf.keras.layers.Dense(1,
 |      ...   kernel_initializer=tf.constant_initializer(1.))
 |      >>> a_out = layer_a(tf.convert_to_tensor([[1., 2., 3.]]))
 |      >>> layer_a.get_weights()
 |      [array([[1.],
 |             [1.],
 |             [1.]], dtype=float32), array([0.], dtype=float32)]
 |      >>> layer_b = tf.keras.layers.Dense(1,
 |      ...   kernel_initializer=tf.constant_initializer(2.))
 |      >>> b_out = layer_b(tf.convert_to_tensor([[10., 20., 30.]]))
 |      >>> layer_b.get_weights()
 |      [array([[2.],
 |             [2.],
 |             [2.]], dtype=float32), array([0.], dtype=float32)]
 |      >>> layer_b.set_weights(layer_a.get_weights())
 |      >>> layer_b.get_weights()
 |      [array([[1.],
 |             [1.],
 |             [1.]], dtype=float32), array([0.], dtype=float32)]
 |      
 |      Returns:
 |          Weights values as a list of NumPy arrays.
 |  
 |  set_weights(self, weights)
 |      Sets the weights of the layer, from NumPy arrays.
 |      
 |      The weights of a layer represent the state of the layer. This function
 |      sets the weight values from numpy arrays. The weight values should be
 |      passed in the order they are created by the layer. Note that the layer's
 |      weights must be instantiated before calling this function, by calling
 |      the layer.
 |      
 |      For example, a `Dense` layer returns a list of two values: the kernel matrix
 |      and the bias vector. These can be used to set the weights of another
 |      `Dense` layer:
 |      
 |      >>> layer_a = tf.keras.layers.Dense(1,
 |      ...   kernel_initializer=tf.constant_initializer(1.))
 |      >>> a_out = layer_a(tf.convert_to_tensor([[1., 2., 3.]]))
 |      >>> layer_a.get_weights()
 |      [array([[1.],
 |             [1.],
 |             [1.]], dtype=float32), array([0.], dtype=float32)]
 |      >>> layer_b = tf.keras.layers.Dense(1,
 |      ...   kernel_initializer=tf.constant_initializer(2.))
 |      >>> b_out = layer_b(tf.convert_to_tensor([[10., 20., 30.]]))
 |      >>> layer_b.get_weights()
 |      [array([[2.],
 |             [2.],
 |             [2.]], dtype=float32), array([0.], dtype=float32)]
 |      >>> layer_b.set_weights(layer_a.get_weights())
 |      >>> layer_b.get_weights()
 |      [array([[1.],
 |             [1.],
 |             [1.]], dtype=float32), array([0.], dtype=float32)]
 |      
 |      Args:
 |        weights: a list of NumPy arrays. The number
 |          of arrays and their shape must match
 |          number of the dimensions of the weights
 |          of the layer (i.e. it should match the
 |          output of `get_weights`).
 |      
 |      Raises:
 |        ValueError: If the provided weights list does not match the
 |          layer's specifications.
 |  
 |  ----------------------------------------------------------------------
 |  Readonly properties inherited from keras.engine.base_layer.Layer:
 |  
 |  compute_dtype
 |      The dtype of the layer's computations.
 |      
 |      This is equivalent to `Layer.dtype_policy.compute_dtype`. Unless
 |      mixed precision is used, this is the same as `Layer.dtype`, the dtype of
 |      the weights.
 |      
 |      Layers automatically cast their inputs to the compute dtype, which causes
 |      computations and the output to be in the compute dtype as well. This is done
 |      by the base Layer class in `Layer.__call__`, so you do not have to insert
 |      these casts if implementing your own layer.
 |      
 |      Layers often perform certain internal computations in higher precision when
 |      `compute_dtype` is float16 or bfloat16 for numeric stability. The output
 |      will still typically be float16 or bfloat16 in such cases.
 |      
 |      Returns:
 |        The layer's compute dtype.
 |  
 |  dtype
 |      The dtype of the layer weights.
 |      
 |      This is equivalent to `Layer.dtype_policy.variable_dtype`. Unless
 |      mixed precision is used, this is the same as `Layer.compute_dtype`, the
 |      dtype of the layer's computations.
 |  
 |  dtype_policy
 |      The dtype policy associated with this layer.
 |      
 |      This is an instance of a `tf.keras.mixed_precision.Policy`.
 |  
 |  dynamic
 |      Whether the layer is dynamic (eager-only); set in the constructor.
 |  
 |  inbound_nodes
 |      Return Functional API nodes upstream of this layer.
 |  
 |  input
 |      Retrieves the input tensor(s) of a layer.
 |      
 |      Only applicable if the layer has exactly one input,
 |      i.e. if it is connected to one incoming layer.
 |      
 |      Returns:
 |          Input tensor or list of input tensors.
 |      
 |      Raises:
 |        RuntimeError: If called in Eager mode.
 |        AttributeError: If no inbound nodes are found.
 |  
 |  input_mask
 |      Retrieves the input mask tensor(s) of a layer.
 |      
 |      Only applicable if the layer has exactly one inbound node,
 |      i.e. if it is connected to one incoming layer.
 |      
 |      Returns:
 |          Input mask tensor (potentially None) or list of input
 |          mask tensors.
 |      
 |      Raises:
 |          AttributeError: if the layer is connected to
 |          more than one incoming layers.
 |  
 |  input_shape
 |      Retrieves the input shape(s) of a layer.
 |      
 |      Only applicable if the layer has exactly one input,
 |      i.e. if it is connected to one incoming layer, or if all inputs
 |      have the same shape.
 |      
 |      Returns:
 |          Input shape, as an integer shape tuple
 |          (or list of shape tuples, one tuple per input tensor).
 |      
 |      Raises:
 |          AttributeError: if the layer has no defined input_shape.
 |          RuntimeError: if called in Eager mode.
 |  
 |  losses
 |      List of losses added using the `add_loss()` API.
 |      
 |      Variable regularization tensors are created when this property is accessed,
 |      so it is eager safe: accessing `losses` under a `tf.GradientTape` will
 |      propagate gradients back to the corresponding variables.
 |      
 |      Examples:
 |      
 |      >>> class MyLayer(tf.keras.layers.Layer):
 |      ...   def call(self, inputs):
 |      ...     self.add_loss(tf.abs(tf.reduce_mean(inputs)))
 |      ...     return inputs
 |      >>> l = MyLayer()
 |      >>> l(np.ones((10, 1)))
 |      >>> l.losses
 |      [1.0]
 |      
 |      >>> inputs = tf.keras.Input(shape=(10,))
 |      >>> x = tf.keras.layers.Dense(10)(inputs)
 |      >>> outputs = tf.keras.layers.Dense(1)(x)
 |      >>> model = tf.keras.Model(inputs, outputs)
 |      >>> # Activity regularization.
 |      >>> len(model.losses)
 |      0
 |      >>> model.add_loss(tf.abs(tf.reduce_mean(x)))
 |      >>> len(model.losses)
 |      1
 |      
 |      >>> inputs = tf.keras.Input(shape=(10,))
 |      >>> d = tf.keras.layers.Dense(10, kernel_initializer='ones')
 |      >>> x = d(inputs)
 |      >>> outputs = tf.keras.layers.Dense(1)(x)
 |      >>> model = tf.keras.Model(inputs, outputs)
 |      >>> # Weight regularization.
 |      >>> model.add_loss(lambda: tf.reduce_mean(d.kernel))
 |      >>> model.losses
 |      [<tf.Tensor: shape=(), dtype=float32, numpy=1.0>]
 |      
 |      Returns:
 |        A list of tensors.
 |  
 |  metrics
 |      List of metrics added using the `add_metric()` API.
 |      
 |      Example:
 |      
 |      >>> input = tf.keras.layers.Input(shape=(3,))
 |      >>> d = tf.keras.layers.Dense(2)
 |      >>> output = d(input)
 |      >>> d.add_metric(tf.reduce_max(output), name='max')
 |      >>> d.add_metric(tf.reduce_min(output), name='min')
 |      >>> [m.name for m in d.metrics]
 |      ['max', 'min']
 |      
 |      Returns:
 |        A list of `Metric` objects.
 |  
 |  name
 |      Name of the layer (string), set in the constructor.
 |  
 |  non_trainable_variables
 |      Sequence of non-trainable variables owned by this module and its submodules.
 |      
 |      Note: this method uses reflection to find variables on the current instance
 |      and submodules. For performance reasons you may wish to cache the result
 |      of calling this method if you don't expect the return value to change.
 |      
 |      Returns:
 |        A sequence of variables for the current module (sorted by attribute
 |        name) followed by variables from all submodules recursively (breadth
 |        first).
 |  
 |  non_trainable_weights
 |      List of all non-trainable weights tracked by this layer.
 |      
 |      Non-trainable weights are *not* updated during training. They are expected
 |      to be updated manually in `call()`.
 |      
 |      Returns:
 |        A list of non-trainable variables.
 |  
 |  outbound_nodes
 |      Return Functional API nodes downstream of this layer.
 |  
 |  output
 |      Retrieves the output tensor(s) of a layer.
 |      
 |      Only applicable if the layer has exactly one output,
 |      i.e. if it is connected to one incoming layer.
 |      
 |      Returns:
 |        Output tensor or list of output tensors.
 |      
 |      Raises:
 |        AttributeError: if the layer is connected to more than one incoming
 |          layers.
 |        RuntimeError: if called in Eager mode.
 |  
 |  output_mask
 |      Retrieves the output mask tensor(s) of a layer.
 |      
 |      Only applicable if the layer has exactly one inbound node,
 |      i.e. if it is connected to one incoming layer.
 |      
 |      Returns:
 |          Output mask tensor (potentially None) or list of output
 |          mask tensors.
 |      
 |      Raises:
 |          AttributeError: if the layer is connected to
 |          more than one incoming layers.
 |  
 |  output_shape
 |      Retrieves the output shape(s) of a layer.
 |      
 |      Only applicable if the layer has one output,
 |      or if all outputs have the same shape.
 |      
 |      Returns:
 |          Output shape, as an integer shape tuple
 |          (or list of shape tuples, one tuple per output tensor).
 |      
 |      Raises:
 |          AttributeError: if the layer has no defined output shape.
 |          RuntimeError: if called in Eager mode.
 |  
 |  trainable_variables
 |      Sequence of trainable variables owned by this module and its submodules.
 |      
 |      Note: this method uses reflection to find variables on the current instance
 |      and submodules. For performance reasons you may wish to cache the result
 |      of calling this method if you don't expect the return value to change.
 |      
 |      Returns:
 |        A sequence of variables for the current module (sorted by attribute
 |        name) followed by variables from all submodules recursively (breadth
 |        first).
 |  
 |  trainable_weights
 |      List of all trainable weights tracked by this layer.
 |      
 |      Trainable weights are updated via gradient descent during training.
 |      
 |      Returns:
 |        A list of trainable variables.
 |  
 |  updates
 |  
 |  variable_dtype
 |      Alias of `Layer.dtype`, the dtype of the weights.
 |  
 |  variables
 |      Returns the list of all layer variables/weights.
 |      
 |      Alias of `self.weights`.
 |      
 |      Note: This will not track the weights of nested `tf.Modules` that are not
 |      themselves Keras layers.
 |      
 |      Returns:
 |        A list of variables.
 |  
 |  weights
 |      Returns the list of all layer variables/weights.
 |      
 |      Returns:
 |        A list of variables.
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from keras.engine.base_layer.Layer:
 |  
 |  activity_regularizer
 |      Optional regularizer function for the output of this layer.
 |  
 |  input_spec
 |      `InputSpec` instance(s) describing the input format for this layer.
 |      
 |      When you create a layer subclass, you can set `self.input_spec` to enable
 |      the layer to run input compatibility checks when it is called.
 |      Consider a `Conv2D` layer: it can only be called on a single input tensor
 |      of rank 4. As such, you can set, in `__init__()`:
 |      
 |      ```python
 |      self.input_spec = tf.keras.layers.InputSpec(ndim=4)
 |      ```
 |      
 |      Now, if you try to call the layer on an input that isn't rank 4
 |      (for instance, an input of shape `(2,)`, it will raise a nicely-formatted
 |      error:
 |      
 |      ```
 |      ValueError: Input 0 of layer conv2d is incompatible with the layer:
 |      expected ndim=4, found ndim=1. Full shape received: [2]
 |      ```
 |      
 |      Input checks that can be specified via `input_spec` include:
 |      - Structure (e.g. a single input, a list of 2 inputs, etc)
 |      - Shape
 |      - Rank (ndim)
 |      - Dtype
 |      
 |      For more information, see `tf.keras.layers.InputSpec`.
 |      
 |      Returns:
 |        A `tf.keras.layers.InputSpec` instance, or nested structure thereof.
 |  
 |  stateful
 |  
 |  supports_masking
 |      Whether this layer supports computing a mask using `compute_mask`.
 |  
 |  trainable
 |  
 |  ----------------------------------------------------------------------
 |  Class methods inherited from tensorflow.python.module.module.Module:
 |  
 |  with_name_scope(method) from builtins.type
 |      Decorator to automatically enter the module name scope.
 |      
 |      >>> class MyModule(tf.Module):
 |      ...   @tf.Module.with_name_scope
 |      ...   def __call__(self, x):
 |      ...     if not hasattr(self, 'w'):
 |      ...       self.w = tf.Variable(tf.random.normal([x.shape[1], 3]))
 |      ...     return tf.matmul(x, self.w)
 |      
 |      Using the above module would produce `tf.Variable`s and `tf.Tensor`s whose
 |      names included the module name:
 |      
 |      >>> mod = MyModule()
 |      >>> mod(tf.ones([1, 2]))
 |      <tf.Tensor: shape=(1, 3), dtype=float32, numpy=..., dtype=float32)>
 |      >>> mod.w
 |      <tf.Variable 'my_module/Variable:0' shape=(2, 3) dtype=float32,
 |      numpy=..., dtype=float32)>
 |      
 |      Args:
 |        method: The method to wrap.
 |      
 |      Returns:
 |        The original method wrapped such that it enters the module's name scope.
 |  
 |  ----------------------------------------------------------------------
 |  Readonly properties inherited from tensorflow.python.module.module.Module:
 |  
 |  name_scope
 |      Returns a `tf.name_scope` instance for this class.
 |  
 |  submodules
 |      Sequence of all sub-modules.
 |      
 |      Submodules are modules which are properties of this module, or found as
 |      properties of modules which are properties of this module (and so on).
 |      
 |      >>> a = tf.Module()
 |      >>> b = tf.Module()
 |      >>> c = tf.Module()
 |      >>> a.b = b
 |      >>> b.c = c
 |      >>> list(a.submodules) == [b, c]
 |      True
 |      >>> list(b.submodules) == [c]
 |      True
 |      >>> list(c.submodules) == []
 |      True
 |      
 |      Returns:
 |        A sequence of all submodules.
 |  
 |  ----------------------------------------------------------------------
 |  Static methods inherited from keras.utils.version_utils.LayerVersionSelector:
 |  
 |  __new__(cls, *args, **kwargs)
 |      Create and return a new object.  See help(type) for accurate signature.
weights = lstm.get_weights()
len(weights)
3
weights[0].shape
(8, 16)
weights[1].shape
(4, 16)
weights[2].shape
(16,)

Note que los pesos de la capa LSTM Están organizados de la siguiente forma

units = 4 # tamaño del estado oculto
W = lstm.get_weights()[0]
W_x_f = W[:,:units]
W_x_i = W[:, units:units*2]
W_x_c = W[:, units*2:units*3]
W_x_o = W[:,units*3:]

U = lstm.get_weights()[1]
U_x_f = W[:,:units]
U_x_i = W[:, units:units*2]
U_x_c = W[:, units*2:units*3]
U_x_o = W[:,units*3:]

b = lstm.get_weights()[2] 
b_x_f = b[:units]
b_x_i = b[ units:units*2]
b_x_c = b[units*2:units*3]
b_x_o = b[units*3:]