Advanced Deep Learning

Non-Linear Activation Functions

Activation in Output Layer

  • Sigmoid — This will be used if the elements of the prediction tensor will be independently mapped between 0.0 and 1.0.

  • Softmax — The summation of all the elements of the predicted tensor is constrained to 1.0

  • Tanh — Maps its input in the range -1.0 to 1.0. This is important if the output can swing in both positive and negative values. The tanh function is more popularly used in the internal layer of recurrent neural networks but has also been used as an output layer activation.

scaled=(value127.5)/127.5scaled = (value - 127.5) / 127.5

Regularization

A neural network has the tendency to memorize its training data, especially if it contains more than enough capacity. In such cases, the network fails catastrophically when subjected to the test data. This is the classic case of the network failing to generalize. To avoid this tendency, the model uses a regularizing layer or function. It discourages learning a more complex or flexible model, so as to avoid the risk of overfitting.

This is a classic case of Shortcut Learning.

Dropout

The idea of dropout is simple. Given a dropout rate, the Dropout layer randomly removes that fraction of units from participating in the next layer. For example, if the first layer has 256 units, after dropout = 0.45 is applied, only (1 - 0.45) * 256 units = 140 units from layer 1 participate in layer 2.

The Dropout layer makes neural networks robust to unforeseen input data because the network is trained to predict correctly, even if some units are missing. It's worth noting that dropout is not used in the output layer and it is only active during training. Moreover, dropout is not present during predictions.

L1 and L2 Regularization

There are regularizers that can be used other than dropouts such as l1 or l2. In Keras, the bias, weight, and activation outputs can be regularized per layer. l1 and l2 favor smaller parameter values by adding a penalty function. Both l1 and l2 enforce the penalty using a fraction of the sum of the absolute (l1) or square (l2) of parameter values. In other words, the penalty function forces the optimizer to find parameter values that are small. Neural networks with small parameter values are more insensitive to the presence of noise from within the input data.

from tensorflow.keras.regularizers import l2

model.add(Dense(hidden_units,
                kernel_regularizer=l2(0.001),
                input_dim=input_size))

Last updated

Was this helpful?