Shrink to Win: Regularization’s Role in Model Success

Overfitting? Shrink your model! Regularization prevents memorization, enabling true learning & robust success. Discover how.

A central problem in machine learning is to develop algorithms that work well both on training data and on new inputs (test data).

Most machine learning tasks can be generalized as the estimation of a function \(\hat{f}(X)\) that maps the input variables to an output variable. This involves learning a target function \(\hat{f}\) that best maps input variables \(X\) to an output variable \(Y\), expressed as \(Y = \hat{f}(X)\). The goal is to predict \(Y\) for new values of \(X\), without knowing the exact form of the true underlying function \(f(X)\). We would like \(\hat{f}(X)\) to be as close of possible to the true underlying function \(f(X)\).

For a random variable \(f\) and its estimate \(\hat{f}\), the relationship between MSE, bias and variance is as follows:

Bias refers to the error introduced by approximating a real-world problem, which may be complex, with a simplified model. High bias can cause an algorithm to miss relevant relations between features and target outputs (underfitting). The bias is calculated as

$$Bias(\hat{f}) = E[\hat{f}] – f \quad \quad (1)$$

A model with high bias pays little attention to the training data and oversimplifies the model.

Variance measures how much the predictions for a given point vary between different realizations of the model. High variance can cause an algorithm to model the random noise in the training data rather than the intended outputs (overfitting). Variance is calculated as

$$Var(\hat{f}) = E \left[ \hat{f} – E[\hat{f}] \right] \quad \quad (2)$$

A model with high variance pays too much attention to the training data and captures noise along with the underlying data distribution.

The relationship between bias, variance, and mean squared error (MSE) can be visually and conceptually understood as a triangular relationship, often referred to as the bias-variance tradeoff 1. This relationship highlights how these three components interact in the context of model performance in machine learning.

Relationship between variance, bias and MSE
Figure 1: Relationship between variance, bias and MSE

We can always think of the triangular relationship between the functions \(\hat{f}\), \(f\) and the expectation \(E(\hat{f})\). Between \(f\) and \(E(\hat{f})\), there is a distance of bias \(Bias(\hat{f})\). The distance between \(\hat{f}\) and \(E(\hat{f})\) is given by \(\sqrt(var(\hat(f)))\). The distance between \(\hat{f}\) and \(f\) is the \(\sqrt{MSE}\). We always want \(\hat{f}\) and \(f\) to be as close as possible, that is the MSE should be as low as possible.

$$ MSE(\hat{X}) = Var(\hat{X}) + Bias^2(\hat{X}) \quad \quad (3) $$

Tradeoffs

Low Bias, High Variance

A complex machine learning model (like deep neural networks) may fit training data very well but may not generalize to unseen data due to high variance.

High Bias, Low variance

A simple model (like linear regression) may not capture all patterns in the training data, leading to underfitting.

Optimal Model: Training Error and True Error

The goal is to find a balance where both bias and variance are minimized, leading to lower MSE. This often involves tuning hyperparameters or selecting appropriate models.

The definitions of bias, variance and MSE in equations (1), (2) and (3), are also applicable to the estimation of true underlying function/model \(f(X)\) and the estimated model \(\hat{f(X)}\).

The performance of the estimated model is measured by true error and test error. The true error, also known as the generalization error, is the expected error of the model on all possible unseen data. It represents the model’s performance on the entire population of data, not just the specific test set used for evaluation. Since it’s impossible to evaluate the model on all possible data, the true error is never truly known. So the true error is a theoretical concept representing the model’s performance on all possible unseen data.

Test error is an estimate of how well a model generalizes to unseen data. Test error is estimated by evaluating the trained model on a test dataset, which is a set of data that the model has not seen during training. In essence, test error is an estimate of the true error that is used as a proxy to evaluate the model’s ability to generalize.

The goal of machine learning is to minimize the true error, but since it cannot be directly measured, the test error is used as a practical approximation.

Figure 2 illustrates how the training and test errors relate to overfitting and underfitting.

Figure 2: True error and test error, underfitting and overfitting

At the first stages of training, the test error and the true error both decrease; however, after
some training, the model becomes more complex and goes toward overfitting.

When the model is too simple, it is unable to properly capture the patterns and relationships in the data. The underfit model performs poorly on both the training and the test data. An underfit model has high bias and low variance.

When the model is too complex, it fits the training data too closely, capturing both relevant patterns and noise. The model performs very well on the training data but has poor performance on the test data

The aim is to achieve good training and test accuracy. The goal is to find the “sweet spot” between underfitting and overfitting so that the model can establish a dominant trend and apply it broadly to new datasets

Mathematics behind the curve in Figure 2

From equation (33) in this paper, the relationship between true error \(Err_{true}\) and the test error \(err_{test}\) is 1

$$Err_{true} = err_{test} – n \sigma^2 + 2 \sigma^2 \sum_{i=1}^{n} \frac{\partial \hat{f}_i}{\partial y_i} \quad \quad (4)$$

where the last term in equation (4) is a measure of complexity of the model. This equation illustrates that the empirical test error is not a good representation of the true error as it is always biased by the complexity term. The true error depends on the complexity of the model. As the complexity of the model increases, the true error increases after some point (see Figure 2) as the complexity term makes a huge contribution. This is the reason for overfitting.

Using optimization, the true error can be minimized 1

$$minimize (Err_{true}) = minimize \left\{ err_{test} – n \sigma^2 + 2 \sigma^2 \sum_{i=1}^{n} \frac{\partial \hat{f}_i}{\partial y_i} \right\}\quad \quad (5) $$

One way to avoid overfitting is to add a penalty function which is an approximation of the complexity term such that the penalty increases as the complexity of the model increases. This forms the basis for regularization.

General form of Regularized Optimized problem

A regularized optimization problem is a mathematical formulation where a penalty term (regularization term) is added to the objective function to control the complexity of the solution2.

The general form of a regularized optimization problem can be expressed as:

$$\min \; \mathcal{L}(\theta) + \lambda R(\theta) \quad \quad (6)$$

Where:

\(\mathcal{L(\theta)}\) is the loss function or data term, which measures the error between the model’s predictions and the observed data. \(\theta\) is the parameter of the model that we wish to optimize and \((x,y)\) are the input to the model and the observed data respectively. Examples for the loss function include

  • Mean Squared Error (MSE) : \( \displaystyle{\mathcal{L}(x) = \frac{1}{n} \sum_{i=1}^n (y_i – f(x_i))^2}\)
  • Cross-Entropy Loss for classification problems.

\(R(\theta)\) is the regularization term, encourages simpler models by penalizing certain properties of the model parameter \(\theta), such as large coefficients or non-sparsity. The regularization term comes in different forms such as

  • L1 Regularization
  • L2 Regularization
  • Elastic Net

\(\lambda\) is the regularization parameter, that balances the importance of minimizing the loss function \(\mathcal{L}(\theta)\) versus enforcing regularization \(R(\theta)\). A higher \(\lambda\) emphasizes simplicity (stronger regularization), while a lower \(\lambda\) focuses on fitting the data.

Types of Regularization

Following examples illustrate the different types of regularization applied to the concept of Regression.

No Regularization (no penalty)

A simple linear regression model that uses Mean Squared Error (MSE) to update the model weights is considered. The MSE is a loss function without penalty

$$L(w) = \sum_{i=1}^n \left( y_i – wx_i\right)^2 = \left\Vert y – Xw \right\Vert_2^2 \quad \quad (7)$$
$$$$
Figure 3: A linear regression model using MSE loss function to update model weights

The search surface illustrating the optimization of a two weight system based on MSE is given in Figure 4.

Figure 4: MSE surface plot for a regression model with two weights

L2 Regularization

The most popular technique for optimization in neural networks is L2 regularization (also known as weight decay). In L2 regularization we are trying to find the weights of the network using back propagation, but instead of optimizing the cost function (example: squared error between predicted values and target values) alone, we add a penalty function and optimize the whole term together.

$$\tilde{\mathcal{L}}(\theta) = \mathcal{L}(\theta) + \lambda \left\Vert \theta \right\Vert_{2}^2 \quad \quad (8) $$

This optimization problem when applied to regression is called Ridge regression3. Ridge regression minimizes the squared error with an L2 penalty.

L2 penalty

L2 penalty is based on L2 norm. L2 norm calculates the Euclidean distance of a vector from the origin of the vector space. The L2 norm of a vector \(\mathbf{x}=[x_1,x_2, \cdots, x_n]\) is defined mathematically as \(\displaystyle{\left\Vert \mathbf{x} \right\Vert_2 = \sqrt{\sum_{i=1}^{n} x_i^2}} \)

Applying L2 penalty to regression problem, changes the loss function as follows

$$\tilde{\mathcal{L}}(w) = \sum_{i=1}^n \left( y_i – wx_i\right)^2 + \lambda \sum_{j=0}^d w_j^2= \left\Vert y – Xw \right\Vert_2^2 + \lambda \left\Vert w \right\Vert_2^2 \quad \quad (9)$$

For a system with two weights, the loss function is

$$ \lambda \left\Vert w \right\Vert_2^2 = \lambda \left( w_1^2 + w_2^2\right) \quad \quad (10)$$

The L2 penalty function for various values, for a regression system with two weights, are plotted in Figure 5 for better visualization. The width of the bowl changes with \(\lambda\).

Figure 5: L2 regularization penalty for various \(\lambda\)

L1 Regularization

L1 regularization minimizes the squared error with an L1 penalty which is based on L1 norm.

$$\tilde{\mathcal{L}}(\theta) = \mathcal{L}(\theta) + \lambda \left\Vert \theta \right\Vert_{2}^2 \quad \quad (11) $$
L1 norm

The L1 norm is a measure of the magnitude of a vector. It is defined as the sum of the absolute values of the vector’s components. The L1 norm of a vector \(\mathbf{x}=[x_1,x_2, \cdots, x_n]\) is defined mathematically as \(\displaystyle{\left\Vert \mathbf{x} \right\Vert_1 =\sum_{i=1}^{n} |x_i|} \)

L1 regularization when applied to regression problems is called Lasso regression4. The loss function for L1 regularization is

$$\tilde{\mathcal{L}}(w) = \sum_{i=1}^n \left( y_i – wx_i\right)^2 + \lambda \sum_{j=0}^d |w_j|= \left\Vert y – Xw \right\Vert_2^2 + \lambda \left\Vert w \right\Vert_1 \quad \quad (12)$$

The surface plot of L1 penalty term for a regression system with two weights for \(\lambda=0.5\) is shown in Figure 6.

Figure 6: L1 Regularization penalty for a regression system with two weights \(\lambda = 0.5\)

Elastic Net

Elastic Net Regression is a regularized regression technique that combines the strengths of both Lasso Regression (L1 penalty) and Ridge Regression (L2 penalty)5. The complete Elastic Net loss function is

$$\tilde{\mathcal{L}}(w) = \left\Vert y – Xw \right\Vert_2^2 + \alpha \cdot \lambda \cdot \left\Vert w \right\Vert_1 + (1- \alpha) \cdot \frac{\lambda}{2} \cdot \left\Vert w \right\Vert_2^2 \quad \quad (13)$$

where,

  • \(\lambda\) is the regularization strength. It controls the overall amount of regularization.
    • A larger \lambda increases the regularization strength, leading to smaller weights and potentially more feature selection.
    • A smaller \lambda reduces regularization, making the model closer to standard linear regression.
  • \(\alpha\) is the mixing parameter. It determines the balance between L1 and L2 regularization (\(0 \leq \alpha \leq 1\)).
    • \(\alpha =0\): Elastic Net becomes Ridge regression (L2 regularization).
    • \(\alpha = 1\): Elastic Net becomes Lasso regression (L1 regularization).
    • 0 < \(\alpha\) < 1: Elastic Net combines the benefits of both L1 and L2 regularization.

It is particularly useful when dealing with datasets where features are highly correlated or when there are more predictors than observations. The L1 norm \(\left\Vert w \right\Vert_1\) encourages sparsity by shrinking some coefficients to exactly zero. L2 norm \(\left\Vert w \right\Vert_2^2\) shrinks all coefficients but does not eliminate them.

By combining Lasso (L1 regularization) and Ridge (L2 regularization) regressions, Elastic Net regression gains some of the stability that Ridge regression offers when the data is rotated or transformed.

The surface plot of the penalty term for a Elastic Net regression system with two weights for various values of \(\lambda\) and \(\alpha\) is shown in Figure 7. Observe how the shape of the Elastic Net penalty surface changes as the \(\lambda\) and \(\alpha\) varies. When \(\alpha\) is close to 0, the surface will resemble the L2 penalty (a bowl shape). When \(\alpha\) is close to 1, the surface will resemble the L1 penalty (a diamond shape). Varying lambda controls the overall steepness of the penalty.

Figure 7: Regularization penalty for a Elastic Net regression system with two weights for various values of \(\lambda\) and \(\alpha\).

Explicit and Implicit Regularization

The regularization methods seen so far involves adding a specific term to the optimization problem. This is explicit regularization. This term imposes a cost on the optimization function to make the optimal solution unique.

Implicit regularization includes techniques like early stopping, using a robust loss function, and discarding outliers. These methods indirectly control model complexity without adding explicit penalty terms to the loss function6.

Summary

In summary, regularization optimization involves modifying the loss function to include a penalty term that discourages complex models. The choice of regularization technique and the value of the regularization parameter (\(\lambda\)) are crucial for balancing model fit and generalization performance. Cross-validation is often used to select the optimal \(\lambda\).

    Citations:

    1. Ghojogh, Benyamin, and Mark Crowley. “The theory behind overfitting, cross validation, regularization, bagging, and boosting: tutorial.”, 2019, [arXiv:1905.12787]
    2. Burnham, K.P. and Anderson, D.R. (2002) Model Selection and Inference: A Practical Information-Theoretic Approach. 2nd Edition, Springer-Verlag, New York. https://doi.org/10.1007/b97636
    3. Hoerl, Arthur E., and Robert W. Kennard. “Ridge Regression: Biased Estimation for Nonorthogonal Problems.” Technometrics, vol. 12, no. 1, 1970, pp. 55–67. JSTOR, https://doi.org/10.2307/1267351. Accessed 1 Mar. 2025.
    4. Tibshirani, Robert. “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society. Series B (Methodological), vol. 58, no. 1, 1996, pp. 267–88. JSTOR, http://www.jstor.org/stable/2346178. Accessed 1 Mar. 2025.
    5. Zou, Hui, and Trevor Hastie. “Regularization and Variable Selection via the Elastic Net.” Journal of the Royal Statistical Society. Series B (Statistical Methodology), vol. 67, no. 2, 2005, pp. 301–20. JSTOR, http://www.jstor.org/stable/3647580. Accessed 1 Mar. 2025.
    6. Behnam Neyshabur, “Implicit Regularization in Deep Learning”, PhD Thesis, 2017, [arXiv:1709.01953]

    Published by

    Mathuranathan

    Mathuranathan Viswanathan, is an author @ gaussianwaves.com that has garnered worldwide readership. He is a masters in communication engineering and has 12 years of technical expertise in channel modeling and has worked in various technologies ranging from read channel, OFDM, MIMO, 3GPP PHY layer, Data Science & Machine learning.

    Post your valuable comments !!!Cancel reply