Machine Learning - Cost Function

Table of Contents

This article explains what Cost Function is for Machine Learning.

Cost Function #

In machine learning, the cost function plays a pivotal role. It is a mathematical formula that measures the performance of a machine learning model for given data. The cost function quantifies the error between predicted values and expected values and presents it in the form of a single real number.

Depending on the problem, this could be the difference between the predicted class and the actual class in classification problems, or the difference between estimated values and actual values in regression problems.

When training a model, our goal is to find the best parameters (coefficients or weights) that minimize the cost function.

Let’s take a look at the following example:

size in feet$^2$ (x)	price $1000’s (y)
2104	460
1416	232
1534	315
852	178
…	…

The chosen model for our training set is a simple linear regression model defined as $f_{w,b}(x) = wx+b$ .

Here, $w,b$ are the parameters of the model, which we will adjust to fit our model to the data best. These parameters are crucial as they will determine the slope of the line (weight $w$) and the point where the line crosses the y-axis (bias $b$).

The objective during training is to find the optimal values for these parameters so that our model’s predictions, $ \hat{y} $, are as close as possible to the actual outcomes, $y$, for all provided training examples $ (x^{(i)}, y^{(i)}) $.

To achieve this, we introduce a cost function, more specifically, the Squared Error Cost Function, which is represented as:

$J(w,b) = \frac{1}{2m} \displaystyle\sum_{i=1}^m(f_{w,b}(x^{(i)})-y^{(i)})^2 $

Where:

$J(w,b)$ is the cost function.
$m$ is the number of training examples.
$ \hat{y}^{(i)} $ is the prediction of the model for the $i^{th}$ example.
$ y^{(i)}$ is the actual value for the $i^{th}$ example.

The factor of $\frac{1}{2}$ is used to simplify the derivative calculation of the cost function. Our goal in training is to minimize $J(w,b)$, which will result in the best possible values for $w$ and $b$.

The charts below show the possible linear models given different values of $w$ and $b$:

By adjusting $w$ and $b$, we can observe how the line fits the data points. The better the fit, the lower our cost function value, and hence the more accurate predictions our model will make.