Optimizer
The optimizer node is used to define your loss function. When training, the optimizer node will attempt to minimize the value which is connected to it by reaching out to the nodes and modifying their parameters (weights and biases). When evaluating the model, optimizers are not calculated and do nothing.
At least one of the values connected to an optimizer node must record gradient info. You can tell if a value is recording gradient info by seeing if the line (the connection from one node to another) is blue instead of grey. For example, the output from a constant does not have contain any gradient information, and can therefore not be the only value provided to an optimizer, since it is not possible to modify the value of a constant. On the other hand, the output from a Conv1D node is optimizable, since a convolutional node has weights and biases that can be optimized. See the page training for more information.
Multiple optimizer nodes
You can use multiple optimizer nodes to define multiple loss functions and optimize these separately. This is for example used in generative adversarial networks (GAN), where the generator and discriminator are trained in two separate steps. For each batch, each optimizer will be stepped in order. The order that the optimizers are processed in can be changed by clicking an optimizer and configuring on the panel on the right.
The steps when training are:
- Start epoch 1
- Load batch 1
- Calculate the loss function for the first optimizer by using batch 1 as the input, and step it. That is, perform the optimization.
- Calculate the loss function for the second optimizer by using batch 1 as the input, and step it.
- <Repeat steps 3-4 for all optimizers>
- If there is more batches to load for this epoch, go to step 2 with the next batch. If there is more epochs, go to step 1 with the next epoch. Otherwise, the training is completed.
Select nodes to optimize
Each optimizer contains an internal list of all the nodes it is optimizing. All nodes that are not included will not be modified by the optimizer when training.
You can select which nodes to optimize by clicking the optimizer node, and on the properties panel clicking "Select nodes".
Configuration
By clicking the node the following parameters can be configured on the right panel:
- Optimizer: One of SGD (stochastic gradient descent) or Adam. The algorithm for performing the optimization. For SGD, the following additional parameters can be configured:
- Momentum: Momentum that acts as a moving average for the gradients, which in effect dampens oscillations.
- Use nesterov momentum: Whether to use nesterov momentum from On the importance of initialization and momentum in deep learning.
- beta1: Exponential decay rates for the first moment estimates.
- beta2: Exponential decay rates for the second moment estimates.
- eps: A small constant to improve numerical stability.
- Order: In case there are multiple optimizers in your network, this field configures the order of processing for the optimizers. See the training steps example above.
- Learning rate: The learning rate configures how aggressive the optimizer should be. A larger learning rate may increase training speed and increase final model accuracy, but it can also have the exact opposite effect. You should experiment with different values to see what works best for your particular model.
- Weight decay: Also known as L2 regularization. This will modify the loss function to become \(\text{Loss} = \text{PreviousLoss} + \text{WD} * \text{sum}(\text{weights}^2)\), where \(\text{WD}\) is the weight decay factor and \(\text{weights}\) is all the models parameters. In effect, this will cause our parameters to decrease in magnitude. This can help with overfitting, but may cause underfitting if the weight decay is too large.
- Loss: Configures how to optimize the provided value. There a few modes:
- Mean: In this mode, the optimizer takes one input and minimizes the mean of the value provided.
- Mean absolute error: In this mode, the optimizer takes two inputs A and B, and minimizes the mean of the absolute value of the difference between input B and input A: \(\text{mean}(\text{abs}(B - A))\)
- Mean squared error: In this mode, the optimizer also takes two inputs, but minimizes the mean of the square of the difference between input B and input A: \(\text{mean}((B - A)^2)\)
- Finally, there is a button "Select nodes" which as described above allows you to configure which nodes to optimize.