Linear regression was developed in the field of statistics and is studied as a model for understanding the relationship between input and output numeric variables. More specifically, it is a linear model that assumes a linear relationship between the input variables (x) and the single output variable (y). It assumes that y can be calcuated from a linear combination of the input variables (x). As such, both the input values (x) and the output value (y) are numeric.
In addition to being a statisical algorithm, linear regression is also used in machine learning for making predictions. There are a number of techniques that could be used to prepare (or train) the linear regression equation from the data, four of which I demonstrate in this analysis:
I’ll be applying these techniques to the longley
dataset, which describes 7 economic variables observed from 1947 to 1962 used to predict the number of people employed yearly.
# Load statistical packages
library(pls)
# Load data
data(longley)
# Examine structure and variable types
str(longley)
'data.frame': 16 obs. of 7 variables:
$ GNP.deflator: num 83 88.5 88.2 89.5 96.2 ...
$ GNP : num 234 259 258 285 329 ...
$ Unemployed : num 236 232 368 335 210 ...
$ Armed.Forces: num 159 146 162 165 310 ...
$ Population : num 108 109 110 111 112 ...
$ Year : int 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 ...
$ Employed : num 60.3 61.1 60.2 61.2 63.2 ...
# View first few lines
head(longley)
GNP.deflator GNP Unemployed Armed.Forces Population Year Employed
1947 83.0 234.289 235.6 159.0 107.608 1947 60.323
1948 88.5 259.426 232.5 145.6 108.632 1948 61.122
1949 88.2 258.054 368.2 161.6 109.773 1949 60.171
1950 89.5 284.599 335.1 165.0 110.929 1950 61.187
1951 96.2 328.975 209.9 309.9 112.075 1951 63.221
1952 98.1 346.999 193.2 359.4 113.270 1952 63.639
# Summarize the data set
summary(longley)
GNP.deflator GNP Unemployed Armed.Forces
Min. : 83.00 Min. :234.3 Min. :187.0 Min. :145.6
1st Qu.: 94.53 1st Qu.:317.9 1st Qu.:234.8 1st Qu.:229.8
Median :100.60 Median :381.4 Median :314.4 Median :271.8
Mean :101.68 Mean :387.7 Mean :319.3 Mean :260.7
3rd Qu.:111.25 3rd Qu.:454.1 3rd Qu.:384.2 3rd Qu.:306.1
Max. :116.90 Max. :554.9 Max. :480.6 Max. :359.4
Population Year Employed
Min. :107.6 Min. :1947 Min. :60.17
1st Qu.:111.8 1st Qu.:1951 1st Qu.:62.71
Median :116.8 Median :1954 Median :65.50
Mean :117.4 Mean :1954 Mean :65.32
3rd Qu.:122.3 3rd Qu.:1958 3rd Qu.:68.29
Max. :130.1 Max. :1962 Max. :70.55
# Visualize the data set
x<-longley[ ,1:7]
par(mfrow=c(1,7))
for(i in 1:7) {
boxplot(x[,i], main=names(longley)[i])
}
When there is more than one input variable, Ordinary Least Squares regression (OLS) can be used to estimate the values of the coefficients. This regression is a linear model that seeks to find a set of coefficients for a line/hyper-plane that minimizes the sum of the squared errors. In other words, given a regression line through the data, it calculates the distance from each data point to the regression line, squares it, and sums all of the squared errors together. This is the quantity that ordinary least squares seeks to minimize.
# Fit model
olsFit<-lm(Employed ~.,longley)
# Summarize the fit
summary(olsFit)
Call:
lm(formula = Employed ~ ., data = longley)
Residuals:
Min 1Q Median 3Q Max
-0.41011 -0.15767 -0.02816 0.10155 0.45539
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.482e+03 8.904e+02 -3.911 0.003560 **
GNP.deflator 1.506e-02 8.492e-02 0.177 0.863141
GNP -3.582e-02 3.349e-02 -1.070 0.312681
Unemployed -2.020e-02 4.884e-03 -4.136 0.002535 **
Armed.Forces -1.033e-02 2.143e-03 -4.822 0.000944 ***
Population -5.110e-02 2.261e-01 -0.226 0.826212
Year 1.829e+00 4.555e-01 4.016 0.003037 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3049 on 9 degrees of freedom
Multiple R-squared: 0.9955, Adjusted R-squared: 0.9925
F-statistic: 330.3 on 6 and 9 DF, p-value: 4.984e-10
# Make predictions
olsPredictions<-predict(olsFit, longley)
# Summarize accuracy
olsMSE<-mean((longley$Employed - olsPredictions)^2)
print(olsMSE)
[1] 0.0522765
Stepwize Linear Regression is a method that makes use of linear regression to discover which subset of attributes in the dataset result in the best performing model. It is step-wise because each iteration of the method makes a change to the set of attributes and creates a model to evaluate the performance of the set.
# Fit model
base<-lm(Employed ~., longley)
# Summarize the fit
summary(base)
Call:
lm(formula = Employed ~ ., data = longley)
Residuals:
Min 1Q Median 3Q Max
-0.41011 -0.15767 -0.02816 0.10155 0.45539
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.482e+03 8.904e+02 -3.911 0.003560 **
GNP.deflator 1.506e-02 8.492e-02 0.177 0.863141
GNP -3.582e-02 3.349e-02 -1.070 0.312681
Unemployed -2.020e-02 4.884e-03 -4.136 0.002535 **
Armed.Forces -1.033e-02 2.143e-03 -4.822 0.000944 ***
Population -5.110e-02 2.261e-01 -0.226 0.826212
Year 1.829e+00 4.555e-01 4.016 0.003037 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3049 on 9 degrees of freedom
Multiple R-squared: 0.9955, Adjusted R-squared: 0.9925
F-statistic: 330.3 on 6 and 9 DF, p-value: 4.984e-10
# Perform step-wise feature selection
slrFit<-step(base)
Start: AIC=-33.22
Employed ~ GNP.deflator + GNP + Unemployed + Armed.Forces + Population +
Year
Df Sum of Sq RSS AIC
- GNP.deflator 1 0.00292 0.83935 -35.163
- Population 1 0.00475 0.84117 -35.129
- GNP 1 0.10631 0.94273 -33.305
<none> 0.83642 -33.219
- Year 1 1.49881 2.33524 -18.792
- Unemployed 1 1.59014 2.42656 -18.178
- Armed.Forces 1 2.16091 2.99733 -14.798
Step: AIC=-35.16
Employed ~ GNP + Unemployed + Armed.Forces + Population + Year
Df Sum of Sq RSS AIC
- Population 1 0.01933 0.8587 -36.799
<none> 0.8393 -35.163
- GNP 1 0.14637 0.9857 -34.592
- Year 1 1.52725 2.3666 -20.578
- Unemployed 1 2.18989 3.0292 -16.628
- Armed.Forces 1 2.39752 3.2369 -15.568
Step: AIC=-36.8
Employed ~ GNP + Unemployed + Armed.Forces + Year
Df Sum of Sq RSS AIC
<none> 0.8587 -36.799
- GNP 1 0.4647 1.3234 -31.879
- Year 1 1.8980 2.7567 -20.137
- Armed.Forces 1 2.3806 3.2393 -17.556
- Unemployed 1 4.0491 4.9077 -10.908
# Summarize selected model
summary(slrFit)
Call:
lm(formula = Employed ~ GNP + Unemployed + Armed.Forces + Year,
data = longley)
Residuals:
Min 1Q Median 3Q Max
-0.42165 -0.12457 -0.02416 0.08369 0.45268
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.599e+03 7.406e+02 -4.859 0.000503 ***
GNP -4.019e-02 1.647e-02 -2.440 0.032833 *
Unemployed -2.088e-02 2.900e-03 -7.202 1.75e-05 ***
Armed.Forces -1.015e-02 1.837e-03 -5.522 0.000180 ***
Year 1.887e+00 3.828e-01 4.931 0.000449 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2794 on 11 degrees of freedom
Multiple R-squared: 0.9954, Adjusted R-squared: 0.9937
F-statistic: 589.8 on 4 and 11 DF, p-value: 9.5e-13
# Make predictions
slrPredictions<-predict(slrFit, longley)
# Summarize accuracy
slrMSE<-mean((longley$Employed - slrPredictions)^2)
print(slrMSE)
[1] 0.05366753
Principal Component Regression (PCR) creates a linear regression model using the outputs of a Principal Component Analysis (PCA) to estimate the coefficients of the model. PCR is useful when the data has highly correlated predictors.
# Fit model
pcrFit<-pcr(Employed ~ ., data = longley, valdiation = "cv")
# Summarize the fit
summary(pcrFit)
Data: X dimension: 16 6
Y dimension: 16 1
Fit method: svdpc
Number of components considered: 6
TRAINING: % variance explained
1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
X 64.96 94.90 99.99 100.00 100.00 100.00
Employed 78.42 89.73 98.51 98.56 98.83 99.55
# Make predictions
pcrPredictions<-predict(pcrFit, longley, ncomp = 6)
# Summarize accuracy
pcrMSE<-mean((longley$Employed - pcrPredictions)^2)
print(pcrMSE)
[1] 0.0522765
Partial Least Squares (PLS) Regression creates a linear model of the data in a transformed projection of problem space. Like PCR, PLS is appropriate for data with highly-correlated predictors.
# Fit model
plsFit<-plsr(Employed ~., data = longley, validation = "CV")
# Summarize the fit
summary(plsFit)
Data: X dimension: 16 6
Y dimension: 16 1
Fit method: kernelpls
Number of components considered: 6
VALIDATION: RMSEP
Cross-validated using 10 random segments.
(Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
CV 3.627 1.461 1.007 0.5194 0.5844 0.4937 0.4459
adjCV 3.627 1.434 1.014 0.5128 0.5726 0.4865 0.4341
TRAINING: % variance explained
1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
X 63.88 93.35 99.99 100.00 100.00 100.00
Employed 87.91 93.70 98.51 98.65 99.16 99.55
# Make predictions
plsPredictions<-predict(plsFit, longley, ncomp = 6)
#Summarize acuracy
plsMSE<-mean((longley$Employed - plsPredictions)^2)
print(plsMSE)
[1] 0.0522765