Why regularization
machine learning is all about fitting a model to data. To generalize well to new, unseen data (outside of the training set), the model should not be too complex. If the model is too complex, it will start to fit the noise in the data rather than the signal. This is where regularization comes in.
What is regularization
To understand regularization, we first need to understand overfitting. Overfitting occurs when our model is too complex and/or too specific to the training data, so it doesn’t generalize well to new data. This usually happens when we have too many features or when our features are too highly correlated with each other.
Regularization is a technique that helps us avoid overfitting by penalizing certain types of complexity in our model. For example, L1 regularization penalizes models based on the sum of the absolute values of their coefficients, while L2 regularization penalizes models based on the sum of the squares of their coefficients.
The regularization parameter controls the trade off between our two goals: fitting the training data well (i.e., minimizing the training error) and generalizing well to new data (i.e., minimizing the test error). A smaller regularization parameter means that we are more concerned with fitting the training data well, while a larger regularization parameter means that we are more concerned with generalizing well to new data.
The trade off
A trade off between simplicity and performance.
We want our model to be both accurate and simple. Unfortunately, these two goals are often at odds with each other. More accuracy usually requires a more complex model, while a simpler model is often less accurate.
This trade off is known as the bias-variance trade off, and is one of the fundamental challenges in machine learning. In a nutshell, the goal of regularization is to find the right balance between these two competing needs.
There are two main types of regularization: L1 and L2. Each has its own advantages and disadvantages, which we will explore in more detail below.
The two goals
In machine learning, we typically have two conflicting goals:
Goal 1: Fit the training data well
We want our hypothesis function hθ(x) to accurately predict the values of y for the training examples x. Ideally, we would like our hθ(x) to predict the value of y exactly for every x in our training set. However, it is often impossible to find a hypothesis function hθ(x) that fits the training data perfectly. The reason for this is that most models are too simple to accurately describe the relationship between x and y. A model with more parameters can potentially describe this relationship much better but at the cost of increased complexity.
Goal 2: Generalize well to new examples
We don’t only care about how well our hypothesis function hθ(x) does on the training data. If we could find some hθ(x) that predicted whether a microchip was defective without looking at any training data (just common sense), then we would say that it generalized quite well! Unfortunately, this is often not the case and we usually have to settle for a “slightly” inaccurate hθ(x). Ideally, our objective is to find the best hypothesis function hθ(x)-the one that makes the smallest number of mistakes on new data (outside of our training set).
Goal 2: Generalize to unseen data
In statistics and machine learning, the goal is often to learn a model from data that can be used to make predictions about unseen data. This is known as generalization. For example, if we are building a machine learning model to predict the price of a house based on square footage, we want our model to generalize from the training data to unseen data (e.g., future houses). We don’t want our model to just memorize the training data.
In order to measure how well our model is generalizing, we split our data into two parts: a training set and a test set. The training set is used to fit the model (i.e., find the best parameters), and the test set is used to evaluate how well the model performs on unseen data.
The goal of regularization is to find a balance between two competing objectives:
- Fit the training data well (i.e., minimize the training error)
- Generalize well to unseen data (i.e., minimize the test error)
The regularization parameter
The regularization parameter is a hyperparameter that controls the trade off between our two goals: minimizing the training error and minimizing the generalization error. The trade off between these two goals is called the bias-variance trade off.
How the regularization parameter controls the trade off
A regularization parameter is a control variable that we can use to tune the performance of our machine learning models. It allows us to trade off between two competing goals:
- Goal 1: We want our model to have low bias.
- Goal 2: We want our model to have low variance.
If we increase the value of the regularization parameter, we are effectively telling our algorithm to give more weight to goal 1 and less weight to goal 2. Conversely, if we decrease the value of the regularization parameter, we are telling our algorithm to give more weight to goal 2 and less weight to goal 1.
How to choose the regularization parameter
The regularization parameter is a hyperparameter of the model that controls the trade off between our two goals: minimizing the training error and minimizing the generalization error. The bigger the regularization parameter is, the more we care about minimizing the training error, and vice versa.
There are a few different ways to choose the regularization parameter:
- Trial and error: We can train the model with different values of the regularization parameter and see which one gives us the best results on the validation set.
- Cross-validation: We can split the training set into a smaller training set and a validation set, and then train the model with different values of the regularization parameter on the smaller training set and choose the one that gives us the best results on the validation set.
- Analytical approach: We can use mathematical analysis to derive an expression for the generalization error as a function of the regularization parameter, and then choose the value of the regularization parameter that minimizes this function.