2023-04-15

Linear Regression

Linear Regression is a statistical tool that is used to predict a value for \(y\) given a predictor variable \(x\)

We will review the most basic version of Linear Regression: Simple Linear Regression

Overview

Linear Regression assumes an approximately linear relationship between two variables \(x\) and \(y\)

Once the model is fitted, it provides a y-intercept (\(\beta_{0}\)) and a value for the coefficient of \(x\) (\(\beta_{1}\)) that can be used to form the linear formula \[y \approx \beta_{0} + \beta_{1}x\]

Least Squares

The best fit for Regression is found by minimizing the residual \(y_{i} - \hat{y_{i}}\) between each point and the linear model

(\(y_{i}\) represents the actual y-value for \(x_{i}\), while \(\hat{y_{i}}\) represents the predicted y-value for \(x_{i}\))

Mathematically, this can be found by calculating the Least Squares Solution for the dataset

Simple Linear Regression

Simple Linear Regression (SLR) refers to a Linear Regression model involving only one \(x\) value

The SLR model calculates a formula that can be used to predict other \(y\) values given an \(x\) value

swiss dataset

We will use the swiss dataset from the R Datasets Package to observe SLR in action.

data(swiss)
str(swiss)
'data.frame':   47 obs. of  6 variables:
 $ Fertility       : num  80.2 83.1 92.5 85.8 76.9 76.1 83.8 92.4 82.4 82.9 ...
 $ Agriculture     : num  17 45.1 39.7 36.5 43.5 35.3 70.2 67.8 53.3 45.2 ...
 $ Examination     : int  15 6 5 12 17 9 16 14 12 16 ...
 $ Education       : int  12 9 5 7 15 7 7 8 7 13 ...
 $ Catholic        : num  9.96 84.84 93.4 33.77 5.16 ...
 $ Infant.Mortality: num  22.2 22.2 20.2 20.3 20.6 26.6 23.6 24.9 21 24.4 ...

swiss plot

Below is the scatter plot of Education (\(x\)) and Examination (\(y\)) from this dataset (each point represents a different county):

SLR Example

We will try to predict the percentage of Swiss army draftees from a province who received the highest mark on the army exam in 1888 based on the percentage of education beyond primary school

After processing our Least Squares calculation, we receive our coefficients:

(Intercept)   Education 
 10.1274801   0.5794737 

Using the coefficients provided by our model, we can use this formula to predict \(y\) for any given \(x\): \[y \approx 10.1274801 + 0.5794737x \]

swiss SLR Model Graph

Below is an interactive graph showing the original data along with the fitted model:

Residuals

This graph shows the distance between each residual and the fitted model (\(y_{i} - \hat{y_{i}}\))

Summary Statistics

Finally, we will use the summary() function to review some summary statistics from our model

              Estimate Std. Error  t value     Pr(>|t|)
(Intercept) 10.1274801 1.28588895 7.875859 5.228601e-10
Education    0.5794737 0.08851978 6.546262 4.811397e-08

We can see from the p-value that the Education variable does play a significant role in determining the percent of draftees who will receive the highest mark on their exam

References: