Linear Regression is a statistical tool that is used to predict a value for \(y\) given a predictor variable \(x\)
We will review the most basic version of Linear Regression: Simple Linear Regression
2023-04-15
Linear Regression is a statistical tool that is used to predict a value for \(y\) given a predictor variable \(x\)
We will review the most basic version of Linear Regression: Simple Linear Regression
Linear Regression assumes an approximately linear relationship between two variables \(x\) and \(y\)
Once the model is fitted, it provides a y-intercept (\(\beta_{0}\)) and a value for the coefficient of \(x\) (\(\beta_{1}\)) that can be used to form the linear formula \[y \approx \beta_{0} + \beta_{1}x\]
The best fit for Regression is found by minimizing the residual \(y_{i} - \hat{y_{i}}\) between each point and the linear model
(\(y_{i}\) represents the actual y-value for \(x_{i}\), while \(\hat{y_{i}}\) represents the predicted y-value for \(x_{i}\))
Mathematically, this can be found by calculating the Least Squares Solution for the dataset
Simple Linear Regression (SLR) refers to a Linear Regression model involving only one \(x\) value
The SLR model calculates a formula that can be used to predict other \(y\) values given an \(x\) value
swiss datasetWe will use the swiss dataset from the R Datasets Package to observe SLR in action.
data(swiss) str(swiss)
'data.frame': 47 obs. of 6 variables: $ Fertility : num 80.2 83.1 92.5 85.8 76.9 76.1 83.8 92.4 82.4 82.9 ... $ Agriculture : num 17 45.1 39.7 36.5 43.5 35.3 70.2 67.8 53.3 45.2 ... $ Examination : int 15 6 5 12 17 9 16 14 12 16 ... $ Education : int 12 9 5 7 15 7 7 8 7 13 ... $ Catholic : num 9.96 84.84 93.4 33.77 5.16 ... $ Infant.Mortality: num 22.2 22.2 20.2 20.3 20.6 26.6 23.6 24.9 21 24.4 ...
swiss plotBelow is the scatter plot of Education (\(x\)) and Examination (\(y\)) from this dataset (each point represents a different county):
We will try to predict the percentage of Swiss army draftees from a province who received the highest mark on the army exam in 1888 based on the percentage of education beyond primary school
After processing our Least Squares calculation, we receive our coefficients:
(Intercept) Education 10.1274801 0.5794737
Using the coefficients provided by our model, we can use this formula to predict \(y\) for any given \(x\): \[y \approx 10.1274801 + 0.5794737x \]
swiss SLR Model GraphBelow is an interactive graph showing the original data along with the fitted model:
This graph shows the distance between each residual and the fitted model (\(y_{i} - \hat{y_{i}}\))
Finally, we will use the summary() function to review some summary statistics from our model
Estimate Std. Error t value Pr(>|t|) (Intercept) 10.1274801 1.28588895 7.875859 5.228601e-10 Education 0.5794737 0.08851978 6.546262 4.811397e-08
We can see from the p-value that the Education variable does play a significant role in determining the percent of draftees who will receive the highest mark on their exam
Introduction to Statistical Learning: With Applications in R (2nd ed., pp.60-61)