Cenk Yildiran

Writing about lots of unrelated topics…


Linear Regression and Gradient Descent

Linear Regression Model

In a simple linear regression problem, our goal is to fit a straight line to data points (x^{(i)}, y^{(i)}) that best predicts the output y from an input x.
The model is expressed as:

f_{w,b}(x) = wx + b

  • w — the weight (slope of the line)
  • b — the bias (intercept)
  • f_{w,b}(x) — the predicted value of y for a given x

The objective of linear regression is to find parameters w and b that make f_{w,b}(x) as close as possible to the actual y values in the training data.


Squared Error Cost Function

To measure how well our model fits the data, we use the Mean Squared Error (MSE) cost function:

J(w,b) = \frac{1}{2m} \sum_{i=1}^{m} ( f_{w,b}(x^{(i)}) - y^{(i)} )^2

Substituting f_{w,b}(x) = wx + b gives:

J(w,b) = \frac{1}{2m} \sum_{i=1}^{m} ( wx^{(i)} + b - y^{(i)} )^2

Here, m is the number of training examples. The goal is to find the values of w and b that minimize this cost.


Gradient Descent Algorithm

Gradient Descent is an iterative optimization algorithm that finds the parameters $(w,b)$ minimizing J(w,b) by repeatedly updating them in the direction opposite to the gradient:

$latex
\begin{aligned}
w &:= w – \alpha \frac{\partial J(w,b)}{\partial w} \\
b &:= b – \alpha \frac{\partial J(w,b)}{\partial b}
\end{aligned}
$

where \alpha is the learning rate controlling how large a step is taken on each iteration.


Gradient Derivations

To implement gradient descent, we need the partial derivatives of J(w,b) with respect to both w and b.

Partial Derivative with Respect to w

Start with:

J(w,b) = \frac{1}{2m} \sum_{i=1}^{m} (wx^{(i)} + b - y^{(i)})^2

Differentiate with respect to w:

\frac{\partial J(w,b)}{\partial w} = \frac{1}{m} \sum_{i=1}^{m} ( wx^{(i)} + b - y^{(i)} ) x^{(i)}

Partial Derivative with Respect to b

Similarly, differentiating with respect to b:

\frac{\partial J(w,b)}{\partial b} = \frac{1}{m} \sum_{i=1}^{m} ( wx^{(i)} + b - y^{(i)} )


Gradient Descent Update Rules

Substituting the gradients into the update equations gives:

$latex
\begin{aligned}
w &:= w – \alpha \frac{1}{m} \sum_{i=1}^{m} ( wx^{(i)} + b – y^{(i)} ) x^{(i)} \\
b &:= b – \alpha \frac{1}{m} \sum_{i=1}^{m} ( wx^{(i)} + b – y^{(i)} )
\end{aligned}
$

These equations iteratively adjust w and b to minimize the cost function.


Advertisements

Understanding Gradient Descent Behavior

To build intuition for gradient descent, consider the cost surface J(w,b).
For linear regression, J(w,b) is a smooth, bowl-shaped surface in 3D space.
Each point on this surface corresponds to a different pair of parameters (w,b).

Gradient descent starts at some initial guess (w_0,b_0) and moves iteratively down the slope of the cost surface until it reaches the bottom — the global minimum.

In general machine learning problems, gradient descent can sometimes converge to a local minimum instead of a global minimum.
However, for linear regression, the cost function:

J(w,b) = \frac{1}{2m} \sum_{i=1}^{m} ( wx^{(i)} + b - y^{(i)} )^2

is a convex function.
This means it has a single global minimum and no local minima — gradient descent will always converge to that global minimum if the learning rate \alpha is chosen appropriately.


Convexity and Convergence

Definition of Convexity:

J(\theta_1 x_1 + \theta_2 x_2) \leq \theta_1 J(x_1) + \theta_2 J(x_2), \quad \forall \theta_1, \theta_2 \ge 0, \ \theta_1 + \theta_2 = 1

Intuitively, this means the line segment connecting any two points on the curve of J lies above the function itself.
For convex functions, gradient descent is guaranteed to reach the global minimum if \alpha is properly chosen.

  • The cost function in linear regression is convex.
  • Gradient descent always converges to the global minimum (no local minima).
  • The choice of \alpha affects convergence speed and stability.

Visualization (Conceptual)

Imagine the cost function surface as a smooth 3D bowl:

  • The horizontal axes represent parameters w and b.
  • The vertical axis represents the cost J(w,b).
  • The global minimum is at the bottom of the bowl.

Gradient descent starts at some initial point and takes small steps downhill until it reaches the minimum.

Figure: Visualization of the convex cost function surface J(w,b) — a bowl-shaped curve with a single global minimum.

One-Time
Monthly
Yearly

Make a one-time donation

Make a monthly donation

Make a yearly donation

Choose an amount

£5.00
£15.00
£100.00
£5.00
£15.00
£100.00
£5.00
£15.00
£100.00

Or enter a custom amount

£

Your contribution is appreciated.

Your contribution is appreciated.

Your contribution is appreciated.

DonateDonate monthlyDonate yearly


Leave a Reply

About Me

My name is Cenk, and I am an economist. I write on this internet site on economics, econometrics, finance, value-investing, programming, calculus, basketball, history, foods, books, self-improvement, well-being and productivity. This internet site is a personal blog, and the posts reflect my personal views and do not represent where I have been working.
For my academic works, please visit this site: https://cenkufukyildiran.academia.edu/
Posts related to financial markets, trading, investing and similar posts are not for financial advice purposes.

Newsletter

Discover more from Cenk Yildiran

Subscribe now to keep reading and get access to the full archive.

Continue reading