5.2 Least Squares Linear Regression (2024)

In this Section we formally describe the problem of linear regression, or the fitting of a representative line (or hyperplane in higher dimensions) to a set of input/output data points. Regression in general may be performed for a variety of reasons: to produce a so-called trend line (or - more generally - a curve) that can be used to help visually summarize, drive home a particular point about the data under study, or to learn a model so that precise predictions can be made regarding output values in the future.

Data for regression problems comes in the form of a set of$P$ input/output observation pairs

\begin{equation} \left(\mathbf{x}_{1},y_{1}\right),\,\left(\mathbf{x}_{2},y_{2}\right),\,...,\,\left(\mathbf{x}_{P},y_{P}\right)\end{equation}

or $\left\{ \left(\mathbf{x}_{p},y_{p}\right)\right\} _{p=1}^{P}$ for short, where $\mathbf{x}_{p}$ and $y_{p}$ denote the $p^{\textrm{th}}$ input and output respectively. In simple instances the input is scalar-valued (the output will always be considered scalar-valued here), and hence the linear regression problem is geometrically speaking one of fitting a line to the associated scatter of data points in 2-dimensional space. In general howevereach input $\mathbf{x}_{p}$ may be a column vector of length $N$

\begin{equation}\mathbf{x}_{p}=\begin{bmatrix}x_{1,p}\\x_{2,p}\\\vdots\\x_{N,p}\end{bmatrix} \end{equation}

in which case the linear regression problem is analogously one of fitting a hyperplane to a scatter of points in $N+1$ dimensional space.

In the case of scalar input the fitting of a line to the data requires we determine a vertical intercept $w_0$ and slope $w_1$ so that the following approximate linear relationshipholds between the input/output data

\begin{equation}w_{0}+x_{p}w_{1}\approx y_{p},\quad p=1,...,P.\end{equation}

Notice that we have used the approximately equal sign because we cannot be surethat all data lies completely on a single line. More generally, when dealing with $N$ dimensional input we have a bias and $N$ associated slope weights to tune properly in order to fit a hyperplane, with the analogous linear relationship written as

\begin{equation}w_{0}+ x_{1,p}w_{1} + x_{2,p}w_{2} + \cdots + x_{N,p}w_{N} \approx y_{p} ,\quad p=1,...,P.\end{equation}

Because each dimension of the input is referred to as a feature or input feature in machine learning parlance, we will often refer to $w_{1,p},\,w_{2,p},\,...,w_{N,p}$ as the feature-touching weights (the only weight not touching a feature is the bias $w_0$).

For any $N$ we can write the above more compactly - in particular using the notation $\mathring{\mathbf{x}}_{\,}$ to denote an input $\mathbf{x}_{\,}$ with a $1$ placed on top of it as

\begin{equation}\mathbf{w}=\begin{bmatrix}w_{0}\\w_{1}\\w_{2}\\\vdots\\w_{N}\end{bmatrix}\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\mathring{\mathbf{x}}_{\,}=\begin{bmatrix}1 \\x_{1}\\x_{2}\\\vdots\\x_{N}\end{bmatrix}.\end{equation}

In particular, this means that we stack a $1$ on top of each of our input points $\mathbf{x}_p$ as

\begin{equation}\mathring{\mathbf{x}}_p=\begin{bmatrix}1 \\x_{1,p}\\x_{2,p}\\\vdots\\x_{N,p}\end{bmatrix},\,\,\,\, p = 1,...,P\end{equation}

Example 1: Predicting Gross Domestic Product (GDP) growth rates¶

As an example of a regression problem with vector-valued input consider the problem of predicting the growth rate of a country's Gross Domestic Product (GDP), which is the value of all goods and services produced within a country during a single year. Economists are often interested in understanding factors (e.g., unemployment rate, education level, population count, land area, income level, investment rate, life expectancy, etc.,) which determine a country's GDP growth rate in order to inform better financial policy making. To understand how these various features of a country relate to its GDP growth rate economists often perform linear regression.

In the Figure below we show a heat map of the world where countries are color-coded based on their GDP growth rate in 2013, as reported by the International Monetary Fund (IMF).

The Least Squares cost function¶

To find the parameters of the hyperplane which best fits a regression dataset, it is common practice to first form the Least Squares cost function. For a given set of parameters $\mathbf{w}$ this cost function computes the total squared error between the associated hyperplane and the data (as illustrated pictorially in the Figure below), giving a good measure of how well the particular linear model fitsthe dataset. Naturally then the best fitting hyperplane is the one whose parameters minimize this error.

We want to find a weight vector $\mathbf{w}$ so that each of $P$ approximate equalities

\begin{equation}\mathring{\mathbf{x}}_{p}^T\mathbf{w}^{\,} \approx y_{p}\end{equation}

holds as well as possible. Another way of stating the above is to say that the error between $\mathring{\mathbf{x}}_{p}^T\mathbf{w}^{\,} $ and $y_{p}$ is small. One natural way to measure error between two quantities like this measure its square (so that both negative and positive errors are treating equally) as

\begin{equation}g_p\left(\mathbf{w}\right) = \left(\mathring{\mathbf{x}}_{p}^{T}\mathbf{w} - \overset{\,}{y}_p^{\,}\right)^2.\end{equation}

This squared error $g_p\left(\cdot\right)$ is one example of a point-wise cost that measures the error of a model (here a linear one) on the point $\left\{\mathbf{x}_p,y_p\right\}$.

Since we want all $P$ such values to be small we can take their average - forming a Least Squares cost function

Implementing the Least Squares cost in `Python`¶

When implementing a cost function like Least squares it is helpful to think modularly, with the aim lightening the amount of mental 'bookkeeping' required, breaking down the cost into a few distinct pieces. Here we can really break things down into two chunks: we have our model - a linear combination of input - and the cost (squared error) itself.

We can express our linear model - a function of our input and weights - is a function worthy enough of its own notation. We will write it as

\begin{equation}\text{model}\left(\mathbf{x}_{p},\mathbf{w}\right) = \mathring{\mathbf{x}}_{p}^T \mathbf{w}.\end{equation}

If we were to go back then and use this modeling notation we could equally well e.g., our ideal settings of the weights in equation (7) as

\begin{equation}\text{model}\left(\mathbf{x}_{p},\mathbf{w}\right) \approx y_p\end{equation}

and likewise our Least Squares cost function itself as

\begin{equation}\,g\left(\mathbf{w}\right)=\frac{1}{P}\sum_{p=1}^{P}\left(\text{model}\left(\mathbf{x}_{p},\mathbf{w}\right) -y_{p}^{\,}\right)^{2}.\end{equation}

This kind of simple deconstruction of the Least Squares cost lends itself to an organized and modular implementation. First we can implement the linear model as shown below. Note here that while it is more convenient mathematically to write the linear combination $\mathring{\mathbf{x}}_{p}^T \mathbf{w}$, in implementing this we need not form the adjusted input $\mathring{\mathbf{x}}_{p}$ (by tacking a $1$ on top of the raw input $\mathbf{x}_p$) and can more easily compute the linear combination by exposing the bias and feature-touching weights seperately as

\begin{equation}\mathring{\mathbf{x}}_{p}^T \mathbf{w} = b + \mathbf{x}_p^T\boldsymbol{\omega}.\end{equation}

Here we use the following bias / feature weight notation

\begin{equation}\text{(bias):}\,\, b = w_0 \,\,\,\,\,\,\,\, \text{(feature-touching weights):} \,\,\,\,\,\, \boldsymbol{\omega} = \begin{bmatrix}w_1 \\w_2 \\ \vdots \\w_N\end{bmatrix}.\end{equation}

Remember that $w_0$ is called the bias since it controls where our linear model cross the $y$ axis, and $w_1,\,w_2,...,w_N$ are called feature-touching weights because they touch each individual dimension of the input (which - in the jargon of machine learning - are called features).

This notation is used to match the Pythonic slicing operation (as shown in the implementation given below), which we implement in Python analagously as

 a = w[0] + np.dot(x_p.T,w[1:])

That is $b = w[0]$ denotes the bias and $\boldsymbol{\omega} = w[1:]$ denotes the remaining $N$ feature-touching weights $w_1,\,w_2,\,...,w_N$. Another reason to implement in this way is that the particular linear combination $\mathbf{x}_p^T \mathbf{w}_{[1:]}^{\,}$ - implemented using np.dot as np.dot(x_p.T,w[1:]) below - is an especially effecient since numpy's np.dot operation is far more effecient than constructing a linear combination in Python via an explicit for loop.

Then, with our linear model implemented we can easily use it to form the associated Least Squares cost function like below. Notice here we explicitly show the all of the inputs to the cost function here, not just the $\left(N+1\right) \times 1$ weights $\mathbf{w}$ - whose Python variable is denoted w. The Least Squares cost also takes in all inputs (with ones stacked on top of each point) $\mathring{\mathbf{x}}_{p}$ - which together we denote by the $\left(N+1\right) \times P$ Python variable x as well as the entire set of corresponding outputs which we denote as the $1 \times P$ variable y.

Notice that this really is a direct implementation of the algebraic form of the cost in equation (13), where we think of the cost modularly the sum of squared errors of a linear model of input against its corresponding output. However explicit for loops (including list comprehensions) written in Python are rather slow due to the very nature of the language (e.g., it being a dynamically typed interpreted language).

It is easy to get around most of this inefficiency by replacing explicit for loops with numerically equivalent operations performed using operations from the numpy library. numpy is an API for some very efficient vector/matrix manipulation libraries written in C. In fact Python code, employing heavy use of numpy functions, can often execute almost as fast a raw C implementation itself.

Broadly speaking, when scribing a Pythonic function like this one with heavy use of numpy functionality one tries to package each step of computation - which previously was being formed sequentially at each data point - together for the entire dataset simultaneously. This means we do away with the explicit for loop over each of our $P$ points and make the same computations (numerically speaking) for every point simultaneously. Below we provide one such numpy heavy version of the Least Squares implementation shown previously which is far more efficient.

Note that in using these functions the input variable x (containing the entire set of $P$ inputs) is size $N \times P$, and its corresponding output y is size $1\times P$. Here we we have written this code - and in particular the model function - to mirror its respective formula notationally as close as possible.

Notice too that for simplicity we write the the Pythonic Least Squares cost function least_squares(w) instead of least_squares(w,x,y), where in the latter case we explicitly list its other two arguments: the input x and output y data. This is done for notational simplicity - we do this with our math notation as well denoting our Least Squares cost $g\left(\mathbf{w}\right)$ instead of $g\left(\mathbf{w},\mathbf{x},\mathbf{y}\right)$ - and either format is perfectly fine practically speaking as autograd will correctly differentiate both forms (since by default it computes the gradient of a Python function with respect to its first input only). We will use this kind of simplified Pythonic notation when introducing future machine learning cost functions as well.

Minimization of the Least Squares cost function¶

Now, determining the overall shape of a function - i.e., whether or not a function is convex - helps determine the appropriate optimization method(s) we can apply to efficiently determine the ideal parameters. In the case of the Least Squares cost function for linear regression it is easy to check that the cost function is always convex regardless of the dataset.

For small input dimensions (e.g., $N=1$) we can empirically verify this claim for any given dataset by simply plotting the function $g$ - as a surface and/or contour plot - as we do in the example below.

Example 2: Visually verifying the convexity of the cost function for a toy dataset¶

In this example we plot the contour and surface plot for the Least Squares cost function for linear regression for a toy dataset. This toy dataset consists of 50 points randomly selected off of the line $y = x$, with a small amount of Gaussian noise added to each. Notice that the data is packaged in a $\left(N+1\right)\times P$ array, with the input being in the top $N$ rows and the corresponding output is the last row.