Linear Regression and Gradient Descent

Linear Regression Model

In a simple linear regression problem, our goal is to fit a straight line to data points $(x^{(i)}, y^{(i)})$ that best predicts the output $y$ from an input $x$ .
The model is expressed as:

$f_{w,b}(x) = wx + b$

$w$ — the weight (slope of the line)
$b$ — the bias (intercept)
$f_{w,b}(x)$ — the predicted value of $y$ for a given $x$

The objective of linear regression is to find parameters $w$ and $b$ that make $f_{w,b}(x)$ as close as possible to the actual $y$ values in the training data.

Squared Error Cost Function

To measure how well our model fits the data, we use the Mean Squared Error (MSE) cost function:

$J(w,b) = \frac{1}{2m} \sum_{i=1}^{m} ( f_{w,b}(x^{(i)}) - y^{(i)} )^2$

Substituting $f_{w,b}(x) = wx + b$ gives:

$J(w,b) = \frac{1}{2m} \sum_{i=1}^{m} ( wx^{(i)} + b - y^{(i)} )^2$

Here, $m$ is the number of training examples. The goal is to find the values of $w$ and $b$ that minimize this cost.

Gradient Descent Algorithm

Gradient Descent is an iterative optimization algorithm that finds the parameters $(w,b)$ minimizing $J(w,b)$ by repeatedly updating them in the direction opposite to the gradient:

$latex
\begin{aligned}
w &:= w – \alpha \frac{\partial J(w,b)}{\partial w} \\
b &:= b – \alpha \frac{\partial J(w,b)}{\partial b}
\end{aligned}
$

where $\alpha$ is the learning rate controlling how large a step is taken on each iteration.

Gradient Derivations

To implement gradient descent, we need the partial derivatives of $J(w,b)$ with respect to both $w$ and $b$ .

Partial Derivative with Respect to $w$

Start with:

$J(w,b) = \frac{1}{2m} \sum_{i=1}^{m} (wx^{(i)} + b - y^{(i)})^2$

Differentiate with respect to $w$ :

$\frac{\partial J(w,b)}{\partial w} = \frac{1}{m} \sum_{i=1}^{m} ( wx^{(i)} + b - y^{(i)} ) x^{(i)}$

Partial Derivative with Respect to $b$

Similarly, differentiating with respect to $b$ :

$\frac{\partial J(w,b)}{\partial b} = \frac{1}{m} \sum_{i=1}^{m} ( wx^{(i)} + b - y^{(i)} )$

Gradient Descent Update Rules

Substituting the gradients into the update equations gives:

$latex
\begin{aligned}
w &:= w – \alpha \frac{1}{m} \sum_{i=1}^{m} ( wx^{(i)} + b – y^{(i)} ) x^{(i)} \\
b &:= b – \alpha \frac{1}{m} \sum_{i=1}^{m} ( wx^{(i)} + b – y^{(i)} )
\end{aligned}
$

These equations iteratively adjust $w$ and $b$ to minimize the cost function.

Understanding Gradient Descent Behavior

To build intuition for gradient descent, consider the cost surface $J(w,b)$ .
For linear regression, $J(w,b)$ is a smooth, bowl-shaped surface in 3D space.
Each point on this surface corresponds to a different pair of parameters $(w,b)$ .

Gradient descent starts at some initial guess $(w_0,b_0)$ and moves iteratively down the slope of the cost surface until it reaches the bottom — the global minimum.

In general machine learning problems, gradient descent can sometimes converge to a local minimum instead of a global minimum.
However, for linear regression, the cost function:

$J(w,b) = \frac{1}{2m} \sum_{i=1}^{m} ( wx^{(i)} + b - y^{(i)} )^2$

is a convex function.
This means it has a single global minimum and no local minima — gradient descent will always converge to that global minimum if the learning rate $\alpha$ is chosen appropriately.

Convexity and Convergence

Definition of Convexity:

$J(\theta_1 x_1 + \theta_2 x_2) \leq \theta_1 J(x_1) + \theta_2 J(x_2), \quad \forall \theta_1, \theta_2 \ge 0, \ \theta_1 + \theta_2 = 1$

Intuitively, this means the line segment connecting any two points on the curve of $J$ lies above the function itself.
For convex functions, gradient descent is guaranteed to reach the global minimum if $\alpha$ is properly chosen.

The cost function in linear regression is convex.
Gradient descent always converges to the global minimum (no local minima).
The choice of $\alpha$ affects convergence speed and stability.

Visualization (Conceptual)

Imagine the cost function surface as a smooth 3D bowl:

The horizontal axes represent parameters $w$ and $b$ .
The vertical axis represents the cost $J(w,b)$ .
The global minimum is at the bottom of the bowl.

Gradient descent starts at some initial point and takes small steps downhill until it reaches the minimum.

Figure: Visualization of the convex cost function surface $J(w,b)$ — a bowl-shaped curve with a single global minimum.

Make a one-time donation

Make a monthly donation

Make a yearly donation

Choose an amount

£5.00

£15.00

£100.00

£5.00

£15.00

£100.00

£5.00

£15.00

£100.00

Or enter a custom amount

Your contribution is appreciated.

Donate Donate monthly Donate yearly

Cenk Yildiran