The Least Squares Regression Method – How to Find the Line of Best Fit (2024)

/ #Math
The Least Squares Regression Method – How to Find the Line of Best Fit (1)
Diogo Spínola
The Least Squares Regression Method – How to Find the Line of Best Fit (2)

Would you like to know how to predict the future with a simple formula and some data?

There are multiple ways to tackle the problem of attempting to predict the future. But we're going to look into the theory of how we could do it with the formula Y = a + b * X.

After we cover the theory we're going to be creating a JavaScript project. This will help us more easily visualize the formula in action using Chart.js to represent the data.

What is the Least Squares Regression method and why use it?

Least squares is a method to apply linear regression. It helps us predict results based on an existing set of data as well as clear anomalies in our data. Anomalies are values that are too good, or bad, to be true or that represent rare cases.

For example, say we have a list of how many topics future engineers here at freeCodeCamp can solve if they invest 1, 2, or 3 hours continuously. Then we can predict how many topics will be covered after 4 hours of continuous study even without that data being available to us.

This method is used by a multitude of professionals, for example statisticians, accountants, managers, and engineers (like in machine learning problems).

Setting up an example

Before we jump into the formula and code, let's define the data we're going to use.

To do that let's expand on the example mentioned earlier.

Let's assume that our objective is to figure out how many topics are covered by a student per hour of learning.

Each pair (X, Y) will represent a student. Since we all have different rates of learning, the number of topics solved can be higher or lower for the same time invested.

Hours (X)Topics Solved (Y)
11.5
1.22
1.53
21.8
2.32.7
2.54.7
2.77.1
310
3.16
3.25
3.68.9

You can read it like this: "Someone spent 1 hour and solved 2 topics" or "One student after 3 hours solved 10 topics".

In a graph these points look like this:

The Least Squares Regression Method – How to Find the Line of Best Fit (3)

Disclaimer: This data is fictional and was made by hitting random keys. I have no idea of the actual values.

The formula

Y = a + bX

The formula, for those unfamiliar with it, probably looks underwhelming – even more so given the fact that we already have the values for Y and X in our example.

Having said that, and now that we're not scared by the formula, we just need to figure out the a and b values.

To give some context as to what they mean:

  • a is the intercept, in other words the value that we expect, on average, from a student that practices for one hour. One hour is the least amount of time we're going to accept into our example data set.
  • b is the slope or coefficient, in other words the number of topics solved in a specific hour (X). As we increase in hours (X) spent studying, b increases more and more.

Calculating "b"

The Least Squares Regression Method – How to Find the Line of Best Fit (4)

X and Y are our positions from our earlier table. When they have a - (macron) above them, it means we should use the average which we obtain by summing them all up and dividing by the total amount:

͞x -> 1+1.2+1.5+2+2.3+2.5+2.7+3+3.1+3.2+3.6 = 2.37

͞y -> 1,5+2+3+1,8+2,7+4,7+7,1+10+6+5+8,9 / 11 = 4.79

Now that we have the average we can expand our table to include the new results:

Hours (X)Topics Solved (Y)(X - ͞x)(y - ͞y)(X - ͞x)*(y - ͞y)(x - ͞x)²
11.5-1.37-3.294.511.88
1.22-1.17-2.793.261.37
1.53-0.87-1.791.560.76
21.8-0.37-2.991.110.14
2.32.7-0.07-2.090.150.00
2.54.70.13-0.09-0.010.02
2.77.10.332.310.760.11
3100.635.213.280.40
3.160.731.210.880.53
3.250.830.210.170.69
3.68.91.234.115.061.51

The weird symbol sigma () tells us to sum everything up:

∑(x - ͞x)*(y - ͞y) -> 4.51+3.26+1.56+1.11+0.15+-0.01+0.76+3.28+0.88+0.17+5.06 = 20.73

∑(x - ͞x)² -> 1.88+1.37+0.76+0.14+0.00+0.02+0.11+0.40+0.53+0.69+1.51 = 7.41

And finally we do 20.73 / 7.41 and we get b = 2.8

Note: When using an expression input calculator, like the one that's available in Ubuntu, -2² returns -4 instead of 4. To avoid that input (-2)².

Calculating "a"

All that is left is a, for which the formula is ͞͞͞y = a + b ͞x. We've already obtained all those other values, so we can substitute them and we get:

  • 4.79 = a + 2.8*2.37
  • 4.79 = a + 6.64
  • a = -6.64+4.79
  • a = -1.85

The result

Our final formula becomes:

Y = -1.85 + 2.8*X

Now we replace the X in our formula with each value that we have:

Hours (X)-1.85 + 2.8 * X
10.95
1.21.51
1.52.35
23.75
2.34.59
2.55.15
2.75.71
36.55
3.16.83
3.27.11
3.68.23

Which is a graph that looks something like this:

The Least Squares Regression Method – How to Find the Line of Best Fit (5)

If we want to predict how many topics we expect a student to solve with 8 hours of study, we replace it in our formula:

  • Y = -1.85 + 2.8*8
  • Y = 20.55

An in a graph we can see:

The Least Squares Regression Method – How to Find the Line of Best Fit (6)

Limitations

Always bear in mind the limitations of a method. This will hopefully help you avoid incorrect results.

And this method, like any other, has its limitations. Here are a couple:

  • It doesn't take into account the complexity of the topics solved. A topic covered at the start of the "Responsive Web Design Certification" will most likely take less time to learn and solve than doing one of the final projects. So if the data we have is from different starting points of a course, the predictions won't be accurate
  • It's impossible for someone to study 240 hours continuously or to solve more topics than those available. Regardless, the method allows us to predict those values. At that point the method is no longer accurately giving results since it's an impossibility.

Example JavaScript Project

Doing this by hand is not necessary. We can create our project where we input the X and Y values, it draws a graph with those points, and applies the linear regression formula.

The project folder will have the following contents:

src/ |-public // folder with the content that we will feed to the browser |-index.html |-style.css |-least-squares.js package.json server.js // our Node.js server

And package.json:

{ "name": "least-squares-regression", "version": "1.0.0", "description": "Visualize linear least squares", "main": "server.js", "scripts": { "start": "node server.js", "server-debug": "nodemon --inspect server.js" }, "author": "daspinola", "license": "MIT", "devDependencies": { "nodemon": "2.0.4" }, "dependencies": { "express": "4.17.1" }}

Once we have the package.json and we run npm install we will have Express and nodemon available. You can switch them out for others as you prefer, but I use these out of convenience.

In server.js:

const express = require('express')const path = require('path')const app = express()app.use(express.static(path.join(__dirname, 'public')))app.get('/', function(req, res) { res.sendFile(path.join(__dirname, 'public/index.html'))})app.listen(5000, function () { console.log(`Listening on port ${5000}!`)})

This tiny server is made so we can access our page when we write in the browser localhost:5000. Before we run it let's create the remaining files:

public/index.html

<html> <head> <title>Least Squares Regression</title> <script src="https://cdn.jsdelivr.net/npm/chart.js@2.9.3/dist/Chart.min.js"></script> <link rel="stylesheet" href="style.css"> </head> <body> <div class="container"> <div class="left-half"> <div> <input type="number" class="input-x" placeholder="X"> <input type="number" class="input-y" placeholder="Y"> <button class="btn-update-graph">Add</button> </div> <div> <span class="span-formula"></span> </div> <div> <table class="table-pairs"> <thead> <th> X </th> <th> Y </th> </thead> <tbody></tbody> </table> </div> </div> <div class="right-half"> <canvas id="myChart"></canvas> </div> </div> <script src="/js/least-squares.js"></script> </body></html>

We create our elements:

  • Two inputs for our pairs, one for X and one for Y
  • A button to add those values to a table
  • A span to show the current formula as values are added
  • A table to show the pairs we've been adding
  • And a canvas for our chart

We also import the Chart.js library with a CDN and add our CSS and JavaScript files.

public/style.css

.container { display: grid; }.left-half { grid-column: 1;}.right-half { grid-column: 2;}

We add some rules so we have our inputs and table to the left and our graph to the right. This takes advantage of CSS grid.

public/least-squares.js

document.addEventListener('DOMContentLoaded', init, false);function init() { const currentData = { pairs: [], slope: 0, coeficient: 0, line: [], }; const chart = initChart();} function initChart() { const ctx = document.getElementById('myChart').getContext('2d'); return new Chart(ctx, { type: 'scatter', data: { datasets: [{ label: 'Scatter Dataset', backgroundColor: 'rgb(125,67,120)', data: [], }, { label: 'Line Dataset', fill: false, data: [], type: 'line', }], }, options: { scales: { xAxes: [{ type: 'linear', position: 'bottom', display: true, scaleLabel: { display: true, labelString: '(X)', }, }], yAxes: [{ type: 'linear', position: 'bottom', display: true, scaleLabel: { display: true, labelString: '(Y)', }, }], }, }, });}

And finally, we initialize our graph. At the start, it should be empty since we haven't added any data to it just yet.

Now if we run npm run server-debug and open our browser on localhost:5000 we should see something like this:

The Least Squares Regression Method – How to Find the Line of Best Fit (7)

Adding functionality

The next step is to make the "Add" button do something. In our case we want to achieve:

  • Add the X and Y values to the table
  • Update the formula when we add more than one pair (we need at least 2 pairs to create a line)
  • Update the graph with the points and the line
  • Clean the inputs, just so it's easier to keep introducing data

Add the values to the table

public/least-squares.js

document.addEventListener('DOMContentLoaded', init, false);function init() { const currentData = { pairs: [], slope: 0, coeficient: 0, line: [], }; const btnUpdateGraph = document.querySelector('.btn-update-graph'); const tablePairs = document.querySelector('.table-pairs'); const spanFormula = document.querySelector('.span-formula'); const inputX = document.querySelector('.input-x'); const inputY = document.querySelector('.input-y'); const chart = initChart(); btnUpdateGraph.addEventListener('click', () => { const x = parseFloat(inputX.value); const y = parseFloat(inputY.value); updateTable(x, y); }); function updateTable(x, y) { const tr = document.createElement('tr'); const tdX = document.createElement('td'); const tdY = document.createElement('td'); tdX.innerHTML = x; tdY.innerHTML = y; tr.appendChild(tdX); tr.appendChild(tdY); tablePairs.querySelector('tbody').appendChild(tr); }}// ... rest of the code as it was

We get all of the elements we will use shortly and add an event on the "Add" button. That event will grab the current values and update our table visually.

We need to parse the amount since we get a string. It will be important for the next step when we have to apply the formula.

The Least Squares Regression Method – How to Find the Line of Best Fit (8)

Make the calculations

All the math we were talking about earlier (getting the average of X and Y, calculating b, and calculating a) should now be turned into code. We will also display the a and b values so we see them changing as we add values.

public/least-squares.js

// ... rest of the code as it wasbtnUpdateGraph.addEventListener('click', () => { const x = parseFloat(inputX.value); const y = parseFloat(inputY.value); updateTable(x, y); updateFormula(x, y);});function updateFormula(x, y) { currentData.pairs.push({ x, y }); const pairsAmount = currentData.pairs.length; const sum = currentData.pairs.reduce((acc, pair) => ({ x: acc.x + pair.x, y: acc.y + pair.y, }), { x: 0, y: 0 }); const average = { x: sum.x / pairsAmount, y: sum.y / pairsAmount, }; const slopeDividend = currentData.pairs .reduce((acc, pair) => parseFloat(acc + ((pair.x - average.x) * (pair.y - average.y))), 0); const slopeDivisor = currentData.pairs .reduce((acc, pair) => parseFloat(acc + (pair.x - average.x) ** 2), 0); const slope = slopeDivisor !== 0 ? parseFloat((slopeDividend / slopeDivisor).toFixed(2)) : 0; const coeficient = parseFloat( (-(slope * average.x) + average.y).toFixed(2), ); currentData.line = currentData.pairs .map((pair) => ({ x: pair.x, y: parseFloat((coeficient + (slope * pair.x)).toFixed(2)), })); spanFormula.innerHTML = `Formula: Y = ${coeficient} + ${slope} * X`;}// ... rest of the code as it was

There isn't much to be said about the code here since it's all the theory that we've been through earlier. We loop through the values to get sums, averages, and all the other values we need to obtain the coefficient (a) and the slope (b).

The Least Squares Regression Method – How to Find the Line of Best Fit (9)

We have the pairs and line in the current variable so we use them in the next step to update our chart.

Update the graph and clean inputs

public/least-squares.js

// ... rest of the code as it wasbtnUpdateGraph.addEventListener('click', () => { const x = parseFloat(inputX.value); const y = parseFloat(inputY.value); updateTable(x, y); updateFormula(x, y); updateChart(); clearInputs();});function updateChart() { chart.data.datasets[0].data = currentData.pairs; chart.data.datasets[1].data = currentData.line; chart.update();} function clearInputs() { inputX.value = ''; inputY.value = '';}// ... rest of the code as it was

Updating the chart and cleaning the inputs of X and Y is very straightforward. We have two datasets, the first one (position zero) is for our pairs, so we show the dot on the graph. The second one (position one) is for our regression line.

We have to grab our instance of the chart and call update so we see the new values being taken into account.

The Least Squares Regression Method – How to Find the Line of Best Fit (10)

Adding some style

We can change our layout a bit so it's more manageable. Nothing major, it just serves as a reminder that we can update the UI at any point

public/style.css

.container { display: grid; }.left-half { grid-column: 1;}.right-half { grid-column: 2;}.pairs-style input[type="number"],.pairs-style button { margin: 5px 0px;}.table-pairs { border-collapse: collapse; width: 100%;}.table-pairs td { text-align: center;}.table-pairs,.table-pairs th,.table-pairs td { margin: 10px 0px; border: 1px solid black;}

public/index.html

<html> <head> <title>Least Squares Regression</title> <script src="https://cdn.jsdelivr.net/npm/chart.js@2.9.3/dist/Chart.min.js"></script> <link rel="stylesheet" href="style.css"> </head> <body> <div class="container"> <div class="left-half"> <div class="pairs-style"> <div> <input type="number" class="input-x" placeholder="X"> </div> <div> <input type="number" class="input-y" placeholder="Y"> </div> <button class="btn-update-graph">Add</button> </div> <div> <span class="span-formula">Formula: Y = a + b * X</span> </div> <div> <table class="table-pairs"> <thead> <th> X </th> <th> Y </th> </thead> <tbody></tbody> </table> </div> </div> <div class="right-half"> <canvas id="myChart"></canvas> </div> </div> <script src="/js/least-squares.js"></script> </body></html>
The Least Squares Regression Method – How to Find the Line of Best Fit (11)

Proof of Concept

The Least Squares Regression Method – How to Find the Line of Best Fit (12)

Final remarks

For brevity's sake, I cut out a lot that can be taken as an exercise to vastly improve the project. For example:

  • Add checks for empty values and the like
  • Make it so we can remove data that we wrongly inserted
  • Add an input for X or Y and apply the current data formula to "predict the future", similar to the last example of the theory

Regardless, predicting the future is a fun concept even if, in reality, the most we can hope to predict is an approximation based on past data points.

It's a powerful formula and if you build any project using it I would love to see it.

I hope this article was helpful to serve as an introduction to this concept. The code used in the article can be found in my GitHub here.

See you in the next one, in the meantime, go code something!

ADVERTIsem*nT

ADVERTIsem*nT

ADVERTIsem*nT

ADVERTIsem*nT

ADVERTIsem*nT

ADVERTIsem*nT

ADVERTIsem*nT

ADVERTIsem*nT

ADVERTIsem*nT

ADVERTIsem*nT

ADVERTIsem*nT

ADVERTIsem*nT

ADVERTIsem*nT

ADVERTIsem*nT

ADVERTIsem*nT

The Least Squares Regression Method – How to Find the Line of Best Fit (13)
Diogo Spínola

Learning enthusiast, web engineer, and writer of programming stuff that calls to my attention

If you read this far, thank the author to show them you care.

Learn to code for free. freeCodeCamp's open source curriculum has helped more than 40,000 people get jobs as developers. Get started

ADVERTIsem*nT

The Least Squares Regression Method – How to Find the Line of Best Fit (2024)

FAQs

How do you find the regression equation for the line of best fit? ›

The line of best fit formula is y = mx + b. Finding the line of best fit formula can be done using the point slope method. Take two points, usually the beginning point and the last point given, and find the slope and y intercept.

How to calculate the least squares regression line? ›

The least-squares regression line equation is y = mx + b, where m is the slope, which is equal to (Nsum(xy) - sum(x)sum(y))/(Nsum(x^2) - (sum x)^2), and b is the y-intercept, which is equals to (sum(y) - msum(x))/N. N is the number of data points, and x and y are the coordinates of the data points.

How to find a best regression line for a regression problem? ›

The least Sum of Squares of Errors is used as the cost function for Linear Regression. For all possible lines, calculate the sum of squares of errors. The line which has the least sum of squares of errors is the best fit line.

How to find the regression line? ›

To work out the regression line the following values need to be calculated: a=¯y−b¯x a = y ¯ − b x ¯ and b=SxySxx b = S x y S x x . The easiest way of calculating them is by using a table. Start off by working out the mean of the independent and dependent variables.

How to calculate the least squares regression line in R? ›

In order to fit a multiple linear regression model using least squares, we again use the lm() function. The syntax lm(y∼x1+x2+x3) is used to fit a model with three predictors, x1, x2, and x3. The summary() function now outputs the regression coefficients for all the predictors.

What is the formula for the least squares estimator? ›

Mathematically, the least (sum of) squares criterion that is minimized to obtain the parameter estimates is Q = ∑ i = 1 n [ y i − f ( x → i ; β → ^ ) ] 2 As previously noted, β 0 , β 1 , … are treated as the variables in the optimization and the predictor variable values, x 1 , x 2 , … are treated as coefficients.

How to tell if a regression line is a good fit? ›

Assessing Goodness-of-Fit in a Regression Model

To be precise, linear regression finds the smallest sum of squared residuals that is possible for the dataset. Statisticians say that a regression model fits the data well if the differences between the observations and the predicted values are small and unbiased.

What is the most common method for finding the best fitting regression line? ›

The least-squares method is a crucial statistical method that is practised to find a regression line or a best-fit line for the given pattern. This method is described by an equation with specific parameters.

What is the least squares method of fit? ›

The least squares method is a statistical procedure to find the best fit for a set of data points. The method works by minimizing the sum of the offsets or residuals of points from the plotted curve. Least squares regression is used to predict the behavior of dependent variables.

Is a line of best fit calculated with a linear regression? ›

Linear regression is used to model the relationship between two variables and estimate the value of a response by using a line-of-best-fit.

What is the method of least square for fitting a regression line? ›

The method of least squares is a parameter estimation method in regression analysis based on minimizing the sum of the squares of the residuals (a residual being the difference between an observed value and the fitted value provided by a model) made in the results of each individual equation.

What is the line of best fit in multiple regression? ›

In multiple linear regression, the model calculates the line of best fit that minimizes the variances of each of the variables included as it relates to the dependent variable. Because it fits a line, it is a linear model.

Can the best line of fit in linear regression be found by ordinary least square? ›

The OLS method identifies that line which fits best for the given data. This is called the 'line of best fit' and is determined by identifying the line out of all of the probable lines which results in the least difference between the observed data points and the line.

Top Articles
Latest Posts
Article information

Author: Nicola Considine CPA

Last Updated:

Views: 5828

Rating: 4.9 / 5 (69 voted)

Reviews: 84% of readers found this page helpful

Author information

Name: Nicola Considine CPA

Birthday: 1993-02-26

Address: 3809 Clinton Inlet, East Aleisha, UT 46318-2392

Phone: +2681424145499

Job: Government Technician

Hobby: Calligraphy, Lego building, Worldbuilding, Shooting, Bird watching, Shopping, Cooking

Introduction: My name is Nicola Considine CPA, I am a determined, witty, powerful, brainy, open, smiling, proud person who loves writing and wants to share my knowledge and understanding with you.