Chapter 2 of “Mastering Machine Learning with R” focuses on linear regression, which is considered the foundation of machine learning. The chapter explores the key concepts and techniques involved in linear regression, emphasizing its importance in understanding and implementing machine learning models.
The chapter covers the following topics:
The text went ahead to analyse the Anscombe data. For detail review of the Anscombe data and paper, kindly check here.
Linear regression is defined as a statistical approach to modeling the relationship between a dependent variable and one or more independent variables. It is a supervised learning algorithm used for both prediction and inference.
Simple linear regression involves predicting a dependent variable using a single independent variable. The chapter explains how to fit a simple linear regression model and interpret the results, including assessing the model’s goodness of fit by minimizing the sum of squared errors.
Let \(Y\) be the dependent variable, \(\beta_0\) be the intercept, \(\beta_1\) be the coefficient of the independent variable \(X\), and \(\varepsilon\) be the error term.
The equation can be expressed as:
\[Y = \beta_0 + \beta_1X + \varepsilon\]
This is a simple linear regression equation relating the dependent variable \(Y\) to the independent variable \(X\) with an error term \(\varepsilon\).
\[Y = \beta_0 + \beta_1X + \varepsilon\]
For the model evaluation we will Interpret the coefficients and their significance. Model fit is assessed through measures like R-squared and adjusted R-squared. And finally Understanding the assumptions of simple linear regression.
To demonstrate the analytics of the Univariate or simple linear
regression, the Snake River Watershed in Wyoming data set as contained
in the alr3
package will be used. Our hypothesis is to
model water yield as a function of the water content from the snowfall
of the year.
We start our analysis by importing the necessary library. If you are
unable to directly install the alr3
package from R you can
download the zip
file and install.
#install.packages("alr3")
data(snake)
An initial view of the data structure can be achieved using the following code
attach(snake)
dim(snake)
## [1] 17 2
head(snake)
names(snake) = c("content", "yield")
attach(snake) #reattach data with new names
head(snake)
the data contains six rows and two columns labeled as content and yield. To have a better glimpse of the data, we will visualize using a scatter plot
#produce a scatterplot
plot(content, yield, xlab="water content of snow", ylab="water yield",
main="Scatterplot of Snow vs. Yield")
From the scatter plot, we can see a linear pattern in the distribution
of the data point. Hence we will build a linear model over the data.
#build a linear model
yield.fit = lm(yield~content)
summary(yield.fit)
##
## Call:
## lm(formula = yield ~ content)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.1793 -1.5149 -0.3624 1.6276 3.1973
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.72538 1.54882 0.468 0.646
## content 0.49808 0.04952 10.058 4.63e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.743 on 15 degrees of freedom
## Multiple R-squared: 0.8709, Adjusted R-squared: 0.8623
## F-statistic: 101.2 on 1 and 15 DF, p-value: 4.632e-08
plot(content, yield)
abline(yield.fit, lwd=3, col="black")
par(oma = c(0, 0, 0, 0))
plot(yield.fit)
qqPlot(yield.fit)
## [1] 7 10
For better understanding here is an explanation of what the code below does:
This code performs a linear regression analysis and presents some graphical and statistical outputs related to the Snake river watershed data.
yield.fit = lm(yield ~ content)
: This line fits a
linear regression model to the data, where yield
is the
dependent variable, and content
is the independent
variable. The model aims to find the best-fitting line that describes
the relationship between yield
and
content
.
summary(yield.fit)
: This line displays the summary
of the linear regression model fitted in the previous step. The summary
includes statistical information such as the coefficients, standard
errors, t-values, and p-values for the model.
plot(content, yield)
: This line creates a scatter
plot of the data, where the x-axis represents the values of the
content
variable, and the y-axis represents the
yield
variable. This plot allows us to visually inspect the
relationship between the two variables.
abline(yield.fit, lwd=3, col="black")
: This line
adds the best-fitting regression line to the scatter plot created in the
previous step. The line is drawn based on the coefficients obtained from
the linear regression model (yield.fit
). The
lwd
and col
parameters specify the line width
and color, respectively.
par(mfrow=c(2,2))
: This line sets the plotting
layout to a 2x2 grid, which means that the following plots will be
displayed in a 2x2 arrangement.
plot(yield.fit)
: This line creates a series of
diagnostic plots for the linear regression model
(yield.fit
). These diagnostic plots help assess the
assumptions of linear regression and the goodness of fit.
qqPlot(yield.fit)
: This line creates a
quantile-quantile (QQ) plot, which is one of the diagnostic plots used
to check if the residuals of the linear regression model follow a normal
distribution. Deviations from the straight line in the QQ plot may
indicate departures from normality.
The following interpretation can be drawn from the analysis:
content
(β₁) is
estimated to be 0.49808 with a standard error of 0.04952. The p-value
for content
is very low (4.63e-08), indicated by ’***,’
which means it is highly statistically significant.yield
) can be explained by the independent variable
(content
).plot(content, yield)
shows the
scatter plot of the data points with the regression line added using
abline(yield.fit, lwd=3, col="black")
.par(mfrow=c(2,2))
sets the plotting layout
to a 2x2 grid for the diagnostic plots.plot(yield.fit)
and qqPlot(yield.fit)
generate diagnostic plots to check for homoscedasticity and normality of
residuals, respectively.Overall, the analysis suggests that the variable content
has a significant effect on the yield
, and the model
explains a substantial amount of the variability in the response
variable.
Multiple linear regression extends the concept of simple linear regression by incorporating multiple independent variables. We will discuss how to build and interpret multiple linear regression models, including assessing multicollinearity and selecting significant variables.
The regression equation is given by:
\[Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \ldots + \beta_nX_n + \varepsilon\]
In this case, assessing multicollinearity among independent variables using techniques like variance inflation factor (VIF), selecting significant variables through methods like stepwise regression or domain knowledge is done.
To demonstrate this concept, the data set water
in the
alr3
package will be used. The data contains 43 years of
snow precipitation, measured at six different sites in the Owens Valley.
The response variable water availability
stream runoff
volume near Bishop, California, feeds into the Owens Valley aqueduct,
and eventually, the Los Angeles Aqueduct. To begin we will give an
explanation on the code we will be using for the analysis and the tasks
it performs:
data(water)
: Loads the dataset named “water” into the R
environment.dim(water)
: Returns the dimensions (number of rows and
columns) of the “water” dataset.str(water)
: Provides the structure of the “water”
dataset, showing the data types and a summary of each variable.head(water)
: Displays the first few rows of the “water”
dataset.socal.water = water[, -1]
: Creates a new dataset
“socal.water” by removing the first column (possibly an identifier) from
the “water” dataset.library(corrplot)
: Loads the “corrplot” package for
correlation analysis.water.cor = cor(socal.water)
: Computes the correlation
matrix for the variables in the “socal.water” dataset.corrplot(water.cor, method="ellipse")
: Plots the
correlation matrix using ellipses to represent the correlation strength
between variables.library(leaps)
: Loads the “leaps” package for model
selection.fit=lm(BSAAM~., data=socal.water)
: Fits a linear
regression model with the dependent variable “BSAAM” and all other
variables as predictors from the “socal.water” dataset.sub.fit = regsubsets(BSAAM~., data=socal.water)
:
Performs a best subset selection using the “regsubsets” function, where
“BSAAM” is the response variable, and “.” indicates that all other
variables are potential predictors.best.summary = summary(sub.fit)
: Extracts the summary
information for the best subset selection.names(best.summary)
: Displays the names of the summary
information available.which.min(best.summary$rss)
: Identifies the subset with
the minimum residual sum of squares (RSS) from the best subset
selection.par(mfrow=c(1,2))
: Sets the plotting layout to a 1x2
grid for the following two plots.plot(best.summary$cp, xlab="number of features", ylab="cp")
:
Plots the Cp statistic against the number of features in the model,
aiming to find the model with the lowest Cp value.plot(sub.fit, scale="Cp")
: Plots the best subset
selection results based on the Cp statistic.which.min(best.summary$bic)
: Identifies the subset with
the minimum Bayesian Information Criterion (BIC) from the best subset
selection.which.max(best.summary$adjr2)
: Identifies the subset
with the highest adjusted R-squared value from the best subset
selection.data(water)
dim(water)
## [1] 43 8
str(water)
## 'data.frame': 43 obs. of 8 variables:
## $ Year : int 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 ...
## $ APMAM : num 9.13 5.28 4.2 4.6 7.15 9.7 5.02 6.7 10.5 9.1 ...
## $ APSAB : num 3.58 4.82 3.77 4.46 4.99 5.65 1.45 7.44 5.85 6.13 ...
## $ APSLAKE: num 3.91 5.2 3.67 3.93 4.88 4.91 1.77 6.51 3.38 4.08 ...
## $ OPBPC : num 4.1 7.55 9.52 11.14 16.34 ...
## $ OPRC : num 7.43 11.11 12.2 15.15 20.05 ...
## $ OPSLAKE: num 6.47 10.26 11.35 11.13 22.81 ...
## $ BSAAM : int 54235 67567 66161 68094 107080 67594 65356 67909 92715 70024 ...
head(water)
socal.water = water[ ,-1]
head(socal.water)
##Load the corrplot package if you haven't done so previously
water.cor = cor(socal.water)
water.cor
## APMAM APSAB APSLAKE OPBPC OPRC OPSLAKE
## APMAM 1.0000000 0.82768637 0.81607595 0.12238567 0.1544155 0.10754212
## APSAB 0.8276864 1.00000000 0.90030474 0.03954211 0.1056396 0.02961175
## APSLAKE 0.8160760 0.90030474 1.00000000 0.09344773 0.1063836 0.10058669
## OPBPC 0.1223857 0.03954211 0.09344773 1.00000000 0.8647073 0.94334741
## OPRC 0.1544155 0.10563959 0.10638359 0.86470733 1.0000000 0.91914467
## OPSLAKE 0.1075421 0.02961175 0.10058669 0.94334741 0.9191447 1.00000000
## BSAAM 0.2385695 0.18329499 0.24934094 0.88574778 0.9196270 0.93843604
## BSAAM
## APMAM 0.2385695
## APSAB 0.1832950
## APSLAKE 0.2493409
## OPBPC 0.8857478
## OPRC 0.9196270
## OPSLAKE 0.9384360
## BSAAM 1.0000000
corrplot(water.cor, method="ellipse")
##load the leaps library
fit=lm(BSAAM~., data=socal.water)
sub.fit = regsubsets(BSAAM~., data=socal.water)
best.summary = summary(sub.fit)
names(best.summary)
## [1] "which" "rsq" "rss" "adjr2" "cp" "bic" "outmat" "obj"
which.min(best.summary$rss)
## [1] 6
par(mfrow=c(1,2))
plot(best.summary$cp, xlab="number of features", ylab="cp")
plot(sub.fit, scale="Cp")
which.min(best.summary$bic)
## [1] 3
which.max(best.summary$adjr2)
## [1] 3
The given R code performs the following tasks:
corrplot
function is used to visualize the
correlation matrix using ellipses. The following are detailed
interpretation of the correlation result:Interpreting the correlation matrix
APMAM and APSAB: These variables have a strong positive correlation of approximately 0.828. This indicates that as the value of one variable (APMAM) increases, the value of the other variable (APSAB) tends to increase as well.
APMAM and APSLAKE: APMAM and APSLAKE also have a strong positive correlation of approximately 0.816. As the value of APMAM increases, the value of APSLAKE tends to increase too.
APSAB and APSLAKE: APSAB and APSLAKE have a strong positive correlation of approximately 0.900. This indicates that these two variables tend to increase or decrease together.
APMAM and OPBPC: APMAM and OPBPC have a weak positive correlation of approximately 0.122. The correlation is closer to zero, suggesting a weaker relationship between these two variables.
APMAM and OPRC: APMAM and OPRC have a weak positive correlation of approximately 0.154. Similar to the previous correlation, it indicates a weak relationship between the two variables.
APMAM and OPSLAKE: APMAM and OPSLAKE have a weak positive correlation of approximately 0.108, indicating a weak relationship.
APMAM and BSAAM: APMAM and BSAAM have a positive correlation of approximately 0.239, indicating a weak to moderate relationship between these two variables.
APSAB and OPBPC: APSAB and OPBPC have a weak positive correlation of approximately 0.040, suggesting a weak relationship between them.
APSAB and OPRC: APSAB and OPRC have a weak positive correlation of approximately 0.106.
APSAB and OPSLAKE: APSAB and OPSLAKE have a weak positive correlation of approximately 0.030.
APSAB and BSAAM: APSAB and BSAAM have a positive correlation of approximately 0.183, indicating a weak to moderate relationship.
APSLAKE and OPBPC: APSLAKE and OPBPC have a weak positive correlation of approximately 0.093.
APSLAKE and OPRC: APSLAKE and OPRC have a weak positive correlation of approximately 0.106.
APSLAKE and OPSLAKE: APSLAKE and OPSLAKE have a weak positive correlation of approximately 0.101.
APSLAKE and BSAAM: APSLAKE and BSAAM have a positive correlation of approximately 0.249, indicating a weak to moderate relationship.
OPBPC and OPRC: OPBPC and OPRC have a strong positive correlation of approximately 0.865.
OPBPC and OPSLAKE: OPBPC and OPSLAKE have a strong positive correlation of approximately 0.943.
OPBPC and BSAAM: OPBPC and BSAAM have a positive correlation of approximately 0.886, indicating a moderate relationship.
OPRC and OPSLAKE: OPRC and OPSLAKE have a strong positive correlation of approximately 0.919.
OPRC and BSAAM: OPRC and BSAAM have a positive correlation of approximately 0.920, indicating a moderate relationship.
OPSLAKE and BSAAM: OPSLAKE and BSAAM have a positive correlation of approximately 0.938, indicating a moderate to strong relationship.
The correlation matrix provides insights into the relationships between the variables in the “water” dataset. Varaibles with strong positive correlations suggest that the variables tend to move together, while weak correlations indicate a weaker relationship.
regsubsets
function from the “leaps” package is
used to perform best subset selection. It evaluates various subsets of
predictors to find the best combination of features that contribute most
to the model’s performance.which.min(best.summary$rss)
. The
selected subset includes 6 predictors.par(mfrow=c(1,2))
.which.min(best.summary$bic)
identifies the subset with
the minimum Bayesian Information Criterion (BIC), which includes 3
predictors.which.max(best.summary$adjr2)
identifies the subset
with the highest adjusted R-squared value, which also includes 3
predictors.With the above result, we need to examine the model and test the assumptions.
best.fit = lm(BSAAM~APSLAKE+OPRC+OPSLAKE, data=socal.water)
summary(best.fit)
##
## Call:
## lm(formula = BSAAM ~ APSLAKE + OPRC + OPSLAKE, data = socal.water)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12964 -5140 -1252 4446 18649
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15424.6 3638.4 4.239 0.000133 ***
## APSLAKE 1712.5 500.5 3.421 0.001475 **
## OPRC 1797.5 567.8 3.166 0.002998 **
## OPSLAKE 2389.8 447.1 5.346 4.19e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7284 on 39 degrees of freedom
## Multiple R-squared: 0.9244, Adjusted R-squared: 0.9185
## F-statistic: 158.9 on 3 and 39 DF, p-value: < 2.2e-16
par(oma = c(0, 0, 0, 0))
plot(best.fit)
vif(best.fit)
## APSLAKE OPRC OPSLAKE
## 1.011499 6.452569 6.444748
plot(socal.water$OPRC, socal.water$OPSLAKE, xlab="OPRC", ylab="OPSLAKE")
best.summary$adjr2
## [1] 0.8777515 0.9001619 0.9185369 0.9168706 0.9146772 0.9123079
fit.2 = lm(BSAAM~APSLAKE+OPSLAKE, data=socal.water)
summary(fit.2)
##
## Call:
## lm(formula = BSAAM ~ APSLAKE + OPSLAKE, data = socal.water)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13335.8 -5893.2 -171.8 4219.5 19500.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19144.9 3812.0 5.022 1.1e-05 ***
## APSLAKE 1768.8 553.7 3.194 0.00273 **
## OPSLAKE 3689.5 196.0 18.829 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8063 on 40 degrees of freedom
## Multiple R-squared: 0.9049, Adjusted R-squared: 0.9002
## F-statistic: 190.3 on 2 and 40 DF, p-value: < 2.2e-16
## Produce diagnostic plots
par(oma = c(0, 0, 0, 0))
plot(fit.2)
### Perform variance inflation factor analysis
vif(fit.2)
## APSLAKE OPSLAKE
## 1.010221 1.010221
bptest(fit.2)
##
## studentized Breusch-Pagan test
##
## data: fit.2
## BP = 0.0046205, df = 2, p-value = 0.9977
plot(fit.2$fitted.values, socal.water$BSAAM, xlab="predicted", ylab="actual", main="Predicted vs. Actual")
Analysis Interpretation The given output shows the results of two linear regression models applied to the “socal.water” dataset, focusing on predicting the variable “BSAAM” using different combinations of predictor variables. We will interpret each model’s results and the diagnostic plots.
Model 1 (best.fit):
The first model includes three predictor variables, “APSLAKE,” “OPRC,” and “OPSLAKE,” to predict “BSAAM.”
Coefficients: The coefficients represent the estimated relationship between each predictor and the response variable.
Significance: The coefficients of all predictor
variables are statistically significant, as indicated by the stars
(***
, **
) in the “Pr(>|t|)”
column.
Model Performance: The model’s adjusted R-squared value is 0.9185, which suggests that approximately 91.85% of the variability in “BSAAM” is explained by the predictors in the model.
Residuals: The residual standard error is 7284, which measures the average distance between the observed values and the predicted values from the model.
Model 2 (fit.2):
The second model includes two predictor variables, “APSLAKE” and “OPSLAKE,” to predict “BSAAM.”
Coefficients: The coefficients represent the estimated relationship between each predictor and the response variable.
Significance: Both coefficients are
statistically significant, as indicated by the stars (***
,
**
) in the “Pr(>|t|)” column.
Model Performance: The model’s adjusted R-squared value is 0.9002, which suggests that approximately 90.02% of the variability in “BSAAM” is explained by the predictors in the model.
Residuals: The residual standard error is 8063, which measures the average distance between the observed values and the predicted values from the model.
Diagnostic Plots:
The code produces diagnostic plots for each model. These plots help assess the model’s assumptions and check the validity of the linear regression analysis. The diagnostic plots include:
Residuals vs. Fitted: This plot checks for the presence of patterns in the residuals (vertical spread) and identifies potential outliers.
Normal Q-Q Plot: This plot assesses whether the residuals follow a normal distribution. A straight line indicates normality.
Scale-Location (Sqrt(|Residuals|) vs. Fitted): This plot helps check the assumption of constant variance (homoscedasticity). A horizontal line indicates homoscedasticity.
Cook’s Distance: This plot identifies influential observations that might significantly impact the model’s fit.
Variance Inflation Factor (VIF):
The VIF analysis checks for multicollinearity between predictor variables. In both models, the VIF values for “APSLAKE” and “OPSLAKE” are close to 1, suggesting no significant multicollinearity issues.
Breusch-Pagan Test:
The Breusch-Pagan test is performed to check for heteroscedasticity. In both models, the test does not reject the null hypothesis of homoscedasticity, indicating no significant evidence of heteroscedasticity.
Predicted vs. Actual Plot:
The code creates a plot showing the predicted values vs. the actual values of “BSAAM.” This plot helps visually assess how well the model predictions align with the observed values.
Both models show promising results in predicting “BSAAM,” with statistically significant predictor variables and satisfactory model performance. The diagnostic plots suggest that the models meet the assumptions of linear regression.
Before we proceed, we will put the predicted values into our data
frame socal.water
and also rename BSAAM
as
Actual
, and put in a new vector within the data frame. The
following code will do just that.
socal.water["Actual"] = water$BSAAM
socal.water["Forecast"] = NA
socal.water$Forecast = predict(fit.2)
head(socal.water)
Using ggplot2
we will create a more beautiful plot.
ggplot(socal.water, aes(x=Forecast, y=Actual))+geom_point() + geom_smooth(method=lm) + labs(title = "Forecast versus Actuals")
### Leave-one-out (LOO) model selection
The concept of LOO model selection technique is totally captured by this quote from Professor Tarpey of Wright State University
Often we use regression models to predict future observations. We can use our data to fit the model. However, it is cheating to then access how well the model predicts responses using the same data that was used to estimate the model – this will tend to give overly optimistic results in terms of how well a model is able to predict future observations. If we leave out an observation, fit the model and then predict the left out response, then this will give a less biased idea of how well the model predicts.
In linear models, LOOCV is easily carried out by examining the Prediction Error Sum of Squares (PRESS) statistic, and selecting the model that has the lowest value. The following R code chunk performs same.
PRESS(best.fit)
## [1] 2426757258
PRESS(fit.2)
## [1] 2992801411
PRESS.best = sum((resid(best.fit)/(1-hatvalues(best.fit)))^2)
PRESS.fit.2 = sum((resid(fit.2)/(1-hatvalues(fit.2)))^2)
PRESS.best
## [1] 2426757258
PRESS.fit.2
## [1] 2992801411
The given R code calculates the PRESS (Predicted Residual Sum of Squares) values for two linear regression models, “best.fit” and “fit.2,” using the MPV package. PRESS is a measure of the predictive power of a regression model, and it assesses how well the model predicts new observations.
Here’s the interpretation of the results:
PRESS(best.fit): The PRESS value for the “best.fit” model is approximately 2,426,757,258. This value represents the sum of the squared differences between the observed response variable (“BSAAM”) and the predicted values from the model. A lower PRESS value indicates better predictive performance, meaning that the “best.fit” model predicts the response variable more accurately.
PRESS(fit.2): The PRESS value for the “fit.2” model is approximately 2,992,801,411. This value represents the sum of the squared differences between the observed response variable (“BSAAM”) and the predicted values from the model. Comparing this value to the PRESS for “best.fit,” we can see that the “fit.2” model has a higher PRESS, indicating less accurate predictions compared to the “best.fit” model.
PRESS.best: The variable “PRESS.best” is calculated separately by summing the squared residuals (residuals divided by (1-hat values)) for the “best.fit” model. The resulting value is approximately 2,426,757,258, which matches the previous result obtained using the PRESS function. This serves as a confirmation that the PRESS calculation is consistent.
PRESS.fit.2: The variable “PRESS.fit.2” is calculated separately by summing the squared residuals (residuals divided by (1-hat values)) for the “fit.2” model. The resulting value is approximately 2,992,801,411, which also matches the previous result obtained using the PRESS function. This confirms the consistency of the PRESS calculation for “fit.2.”
From the above result, the PRESS values give an indication of the predictive performance of the regression models. A lower PRESS value indicates better predictive accuracy, suggesting that the “best.fit” model performs better in predicting the response variable “BSAAM” compared to the “fit.2” model. It’s important to consider the PRESS values in the context of the data and the modeling goals to make informed decisions about model selection and evaluation.
A qualitative feature(i.e factor), can take on two or more levels. Example ON/OFF or Bad/Neutral/Good. If we have a feature with levels, then we create what is known as an indicator or dummy feature arbitrarily assigning value labels as the case may be.
The linear model with the indicator would still follow the same formulation as before, that is, \[Y = \beta_0 + \beta_1X + \varepsilon\]
If we code the feature as OFF is equal to zero and ON is equal to
one, then the expectation for OFF would just be the intercept, \(beta_0\), while for ON it would be _0 + _1
. More than two levels will call for n-1 indicators. We shall be using
the Carseats dataset from the ISLR
package
to demonstrate this concept.
data(Carseats)
str(Carseats)
## 'data.frame': 400 obs. of 11 variables:
## $ Sales : num 9.5 11.22 10.06 7.4 4.15 ...
## $ CompPrice : num 138 111 113 117 141 124 115 136 132 132 ...
## $ Income : num 73 48 35 100 64 113 105 81 110 113 ...
## $ Advertising: num 11 16 10 4 3 13 0 15 0 0 ...
## $ Population : num 276 260 269 466 340 501 45 425 108 131 ...
## $ Price : num 120 83 80 97 128 72 108 120 124 124 ...
## $ ShelveLoc : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
## $ Age : num 42 65 59 55 38 78 71 67 76 76 ...
## $ Education : num 17 10 12 14 13 16 15 10 10 17 ...
## $ Urban : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
## $ US : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...
sales.fit = lm(Sales~Advertising+ShelveLoc, data=Carseats)
summary(sales.fit)
##
## Call:
## lm(formula = Sales ~ Advertising + ShelveLoc, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.6480 -1.6198 -0.0476 1.5308 6.4098
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.89662 0.25207 19.426 < 2e-16 ***
## Advertising 0.10071 0.01692 5.951 5.88e-09 ***
## ShelveLocGood 4.57686 0.33479 13.671 < 2e-16 ***
## ShelveLocMedium 1.75142 0.27475 6.375 5.11e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.244 on 396 degrees of freedom
## Multiple R-squared: 0.3733, Adjusted R-squared: 0.3685
## F-statistic: 78.62 on 3 and 396 DF, p-value: < 2.2e-16
contrasts(Carseats$ShelveLoc)
## Good Medium
## Bad 0 0
## Good 1 0
## Medium 0 1
The given R code performs a linear regression analysis on the “Carseats” dataset, which contains information about the sales of car seats in different stores. The model predicts “Sales” based on two predictor variables: “Advertising” and “ShelveLoc” (Shelf Location). Here’s the interpretation of the results:
Coefficients: The coefficients represent the estimated relationship between each predictor and the response variable (“Sales”).
Significance: All coefficients have very low p-values (< 0.001), denoted by the ’***’ in the “Signif. codes” column of the summary output. This indicates strong evidence of a significant relationship between the predictor variables and sales.
Residuals: The residuals represent the differences between the observed sales and the predicted sales from the model. The minimum and maximum residuals are approximately -6.6480 and 6.4098, respectively. The residuals have a median close to zero, suggesting that the model has captured the overall trend in sales well.
Model Performance: The multiple R-squared value is 0.3733, indicating that about 37.33% of the variability in sales is explained by the model. The Adjusted R-squared value is 0.3685, which takes into account the number of predictor variables in the model.
F-statistic: The F-statistic tests whether the model’s overall regression fit is significant. In this case, the F-statistic is 78.62 with an extremely low p-value (< 0.001), indicating that the model as a whole is statistically significant.
The regression model indicates that both “Advertising” and “ShelveLoc” are important predictors of “Sales.” The “Good” and “Medium” shelf locations are associated with higher sales compared to the “Bad” shelf location. Additionally, increasing the advertising expenditure is expected to boost sales. However, it’s important to note that the model explains only a moderate amount of the variability in sales, suggesting that there might be other factors influencing sales that are not included in the model.
If the effect on the prediction of one feature depends on the value of the other feature, this is termed as Interaction. The linear regression model can be written as:
\[Y = B_0 + B_1x + B_2x + B_1B_2x + e \]
Where: - \(Y\) is the response variable (dependent variable). - \(B_0\) is the intercept term. - \(B_1\) and \(B_2\) are the coefficients of the predictor variables \(x\). - \(e\) represents the error term (residuals).
The example we will use here is the Boston
dataset from
the MASS
library. The response is median home value, which
is medv
in the output; we will use two features, the
percentage of homes with a low socioeconomic status, which is termed as
lstat
, and the age of the home in years, which is termed as
age
in the following output:
data(Boston)
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
value.fit = lm(medv~lstat*age, data=Boston)
summary(value.fit)
##
## Call:
## lm(formula = medv ~ lstat * age, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.806 -4.045 -1.333 2.085 27.552
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.0885359 1.4698355 24.553 < 2e-16 ***
## lstat -1.3921168 0.1674555 -8.313 8.78e-16 ***
## age -0.0007209 0.0198792 -0.036 0.9711
## lstat:age 0.0041560 0.0018518 2.244 0.0252 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.149 on 502 degrees of freedom
## Multiple R-squared: 0.5557, Adjusted R-squared: 0.5531
## F-statistic: 209.3 on 3 and 502 DF, p-value: < 2.2e-16
The above result performs a multiple linear regression analysis on the “Boston” dataset, which contains various attributes related to housing in Boston. The model predicts the “medv” variable (median value of owner-occupied homes in $1000s) based on two predictor variables: “lstat” (percentage of lower status of the population) and “age” (proportion of owner-occupied units built prior to 1940), as well as their interaction term “lstat:age”. Further interpretation of the results are given below:
Coefficients: The coefficients represent the estimated relationships between each predictor variable and the response variable (“medv”), as well as their interaction effect.
Significance: The significance of each coefficient is indicated by the asterisks in the “Signif. codes” column of the summary output. Three asterisks (**), which appear next to “Intercept,” “lstat,” and “lstat:age,” indicate that these coefficients are highly significant (p-value < 0.001). One asterisk () next to “lstat:age” indicates that this coefficient is statistically significant (p-value = 0.0252).
Residuals: The residuals represent the differences between the observed median home values and the predicted values from the model. The residuals have a median close to zero, suggesting that the model has captured the overall trend in median home values relatively well.
Model Performance: The multiple R-squared value is 0.5557, indicating that approximately 55.57% of the variability in median home values is explained by the model. The Adjusted R-squared value is 0.5531, which takes into account the number of predictor variables in the model.
F-statistic: The F-statistic tests whether the model’s overall regression fit is significant. In this case, the F-statistic is 209.3 with an extremely low p-value (< 0.001), indicating that the model as a whole is statistically significant.
Overall, the regression model suggests that “lstat” has a strong negative effect on median home values, while “age” has little to no impact after considering the interaction effect of “lstat” and “age.” The model provides a reasonable fit to the data, explaining approximately 55.57% of the variability in median home values. However, as with any statistical model, it’s essential to consider other factors and perform further analysis to understand the complete picture.
In machine learning, we train and test a model to predict or forecast an outcome. In this chapter, we took a close look at the simple but highly effective method of linear regression for predicting a quantitative response. Many of the advanced techniques covered in later chapters are simply extensions of what we learned in this chapter. We’ve talked about the issue of not visually inspecting the dataset and instead relying on statistics to guide model selection.
Evaluating the performance of linear regression models is crucial. Various evaluation metrics such as mean squared error (MSE), coefficient of determination (R-squared), adjusted R-squared and hypothesis testing for individual coefficients using t-tests is encouraged. The following points are worth taking into consideration:
Mean squared error (MSE) measures the average squared difference between predicted and actual values.
R-squared represents the proportion of variance explained by the model.
Adjusted R-squared adjusts for the number of predictors in the model.
Linear regression relies on several assumptions about the data. The chapter covers assumptions such as linearity, independence, homoscedasticity, and normality, along with diagnostic techniques to check if these assumptions are met.
Major key note-taking points are:
Feature engineering involves transforming and selecting relevant features to improve model performance. The chapter explains techniques like polynomial regression, interaction terms, and handling categorical variables in the context of linear regression.
The following key notes should be taken into consideration: