Least Squares Regression

April 16, 2023

Basics of Least Squares Regression

Least Squares Regression is a statistical method for estimating and analyzing relationships between variables.
“Least Squares” refers to the method of seeking to minimize the sum of squared distances between predicted and observed values.
Simple Linear Regression models the relationship between an independent (predictor) variable and a dependent (response) variable using the Least Squares method.
The equation for a Simple Linear Regression line takes the form \(\text{y} = \beta_0 + \beta_1\text{x} + \varepsilon\) where \(\beta_0\) is the intercept, \(\beta_1\) is the slope, and \(\varepsilon\) is some amount of error.

Conditions

There are some general conditions that must be met for Least Squares Regression to perform as intended:

1. Linearity - The data must be at least somewhat linear. A regression line can be fit to almost any set of data, however, it wouldn’t make much sense to fit a straight line to data that have an obvious curve, a circular shape, or a number of other non-linear scenarios.

2. Nearly Normal Residuals - If the residuals from the fitted line do not tend to fall within a reasonably similar range for all points, the presence of outliers could be more heavily impacting the slope of the line causing predictions to be less accurate than desired. Another form of prediction may be a wiser choice.

3. Constant Variability - If residuals tend to increase or decrease as a whole along the fitted line, this would be another indication that a different method of prediction could better suit the data.

4. Independent Observations - Data that may have another underlying structure are not good candidates for fitting a straight line. Time Series data would be an example.

Meeting these conditions should ensure a regression line that is accurately representative of the data.

Dataset `mtcars`

We will use the built-in mtcars dataset for our Regression analysis.

data(mtcars)
head(mtcars)

                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Visualizing Data

First, we start by making a scatterplot of the data. Here we will plot mpg against weight from the mtcars dataset. The data appear to have general linearity and meet all input conditions.

Determining the Least Squares Line

The Least Squares line can be found using summary statistics of the data.

Mean of wt: 3.21725 
Mean of mpg: 20.09062 
Standard deviation of wt: 0.9784574 
Standard deviation of mpg: 6.026948 
Correlation between wt and mpg: -0.8676594

The slope of the line is given by \(b_1 = \frac{s_y}{s_x}R\) where \(b_1\) is the slope, \(s_y\) and \(s_x\) are the sample standard deviations of the response and predictor variables (respectively), and R is the correlation between the two variables.

Given that the means create a point \((\bar{x}, \bar{y})\) on the least squares line, these values can be used in the point-slope formula \(y - \bar{y} = b_1(x - \bar{x})\) to determine the y-intercept and, ultimately, the regression line for the data. Our data yield the equation:

\[\hat{mpg} = 37.29 - 5.34(wt)\]

Fitting a Least Squares Line

Now, we add the Least Squares line fitted to the data. This line minimizes the sum of squared differences between the observed values and predicted values which is represented by: \[\sum_{i=1}^{n}(y_i - \hat{y}_i)^2\]

Evaluating Strength of Fit Using \(R^2\)

Strength of fit for linear models is frequently evaluated using \(R^2\). That is, the correlation coefficient used to determine the linear model, squared. For our model that becomes:

\[R^2 = (-0.8676594)^2 \approx 0.7528\]
This value explains how closely the observed values seem to fall near the fitted regression line. Or, in other words, the amount of variation in the response variable that is explained by the linear model.

For our model, it can be said that approximately 75.28% of variation in the mpg variable is explained by the fitted Least Squares Regression line.

Discrete and Categorical Variables

Least Squares can also be useful for analyzing discrete or one-hot encoded categorical data. Here it can be seen that as the number of cylinders in a car increases, so too does the horsepower. It would not be good practice to use this as a predictor, and typically only two variables would be compared (our data works fine with its three points), but it can help to visualize trends in the data. Miles per gallon is added to this chart as well, for further data exploration.

Code used for Plots

Scatter Plot:

LSR <- ggplot(mtcars, aes(x = wt, y = mpg)) + geom_point() +
  labs(x = "Weight (1000 lbs)", y = "Miles per Gallon")

Adding Least Squares Line to Scatter Plot:

LSR <- LSR + geom_smooth(method = "lm", se=F)

Interactive Plot of Discrete Variables:

P_LSR <- ggplot(mtcars, aes(x = cyl, y = hp, color = mpg)) +
  geom_point() +
  geom_smooth(method = "lm", se=F) +
  labs(x = "Cylinders", y = "Horsepower")
ggplotly(P_LSR, height = 380, width = 780)

Thank You