Introduction

Chapter 2 of “Mastering Machine Learning with R” focuses on linear regression, which is considered the foundation of machine learning. The chapter explores the key concepts and techniques involved in linear regression, emphasizing its importance in understanding and implementing machine learning models.

The chapter covers the following topics:

Introduction to Linear Regression
Simple Linear Regression
Multiple Linear Regression
Model Evaluation
Assumptions and Diagnostics
Feature Engineering

The text went ahead to analyse the Anscombe data. For detail review of the Anscombe data and paper, kindly check here.

1. Introduction to Linear Regression

Linear regression is defined as a statistical approach to modeling the relationship between a dependent variable and one or more independent variables. It is a supervised learning algorithm used for both prediction and inference.

2. Simple Linear Regression

Simple linear regression involves predicting a dependent variable using a single independent variable. The chapter explains how to fit a simple linear regression model and interpret the results, including assessing the model’s goodness of fit by minimizing the sum of squared errors.

Let $Y$ be the dependent variable, $\beta_0$ be the intercept, $\beta_1$ be the coefficient of the independent variable $X$, and $\varepsilon$ be the error term.

The equation can be expressed as:

\[Y = \beta_0 + \beta_1X + \varepsilon\]

This is a simple linear regression equation relating the dependent variable $Y$ to the independent variable $X$ with an error term $\varepsilon$.

\[Y = \beta_0 + \beta_1X + \varepsilon\]

For the model evaluation we will Interpret the coefficients and their significance. Model fit is assessed through measures like R-squared and adjusted R-squared. And finally Understanding the assumptions of simple linear regression.

To demonstrate the analytics of the Univariate or simple linear regression, the Snake River Watershed in Wyoming data set as contained in the alr3 package will be used. Our hypothesis is to model water yield as a function of the water content from the snowfall of the year.

We start our analysis by importing the necessary library. If you are unable to directly install the alr3 package from R you can download the zip file and install.

#install.packages("alr3")
data(snake)

An initial view of the data structure can be achieved using the following code

attach(snake)
dim(snake)

## [1] 17  2

head(snake)

names(snake) = c("content", "yield")
attach(snake) #reattach data with new names
head(snake)

the data contains six rows and two columns labeled as content and yield. To have a better glimpse of the data, we will visualize using a scatter plot

#produce a scatterplot
plot(content, yield, xlab="water content of snow", ylab="water yield",
     main="Scatterplot of Snow vs. Yield")

From the scatter plot, we can see a linear pattern in the distribution of the data point. Hence we will build a linear model over the data.

#build a linear model
yield.fit = lm(yield~content)
summary(yield.fit)

## 
## Call:
## lm(formula = yield ~ content)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.1793 -1.5149 -0.3624  1.6276  3.1973 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.72538    1.54882   0.468    0.646    
## content      0.49808    0.04952  10.058 4.63e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.743 on 15 degrees of freedom
## Multiple R-squared:  0.8709, Adjusted R-squared:  0.8623 
## F-statistic: 101.2 on 1 and 15 DF,  p-value: 4.632e-08

plot(content, yield)
abline(yield.fit, lwd=3, col="black")

par(oma = c(0, 0, 0, 0))
plot(yield.fit)

qqPlot(yield.fit)

## [1]  7 10

For better understanding here is an explanation of what the code below does:

This code performs a linear regression analysis and presents some graphical and statistical outputs related to the Snake river watershed data.

yield.fit = lm(yield ~ content): This line fits a linear regression model to the data, where yield is the dependent variable, and content is the independent variable. The model aims to find the best-fitting line that describes the relationship between yield and content.
summary(yield.fit): This line displays the summary of the linear regression model fitted in the previous step. The summary includes statistical information such as the coefficients, standard errors, t-values, and p-values for the model.
plot(content, yield): This line creates a scatter plot of the data, where the x-axis represents the values of the content variable, and the y-axis represents the yield variable. This plot allows us to visually inspect the relationship between the two variables.
abline(yield.fit, lwd=3, col="black"): This line adds the best-fitting regression line to the scatter plot created in the previous step. The line is drawn based on the coefficients obtained from the linear regression model (yield.fit). The lwd and col parameters specify the line width and color, respectively.
par(mfrow=c(2,2)): This line sets the plotting layout to a 2x2 grid, which means that the following plots will be displayed in a 2x2 arrangement.
plot(yield.fit): This line creates a series of diagnostic plots for the linear regression model (yield.fit). These diagnostic plots help assess the assumptions of linear regression and the goodness of fit.
qqPlot(yield.fit): This line creates a quantile-quantile (QQ) plot, which is one of the diagnostic plots used to check if the residuals of the linear regression model follow a normal distribution. Deviations from the straight line in the QQ plot may indicate departures from normality.

The following interpretation can be drawn from the analysis:

Model Coefficients:
- The intercept (β₀) is estimated to be 0.72538 with a standard error of 1.54882. The p-value for the intercept is 0.646, which indicates that it is not statistically significant at common significance levels (α = 0.05).
- The coefficient for the variable content (β₁) is estimated to be 0.49808 with a standard error of 0.04952. The p-value for content is very low (4.63e-08), indicated by ’***,’ which means it is highly statistically significant.
Residuals:

Residuals represent the difference between the observed values of the dependent variable (yield) and the predicted values from the regression model. The residuals have a minimum value of -2.1793 and a maximum value of 3.1973.

Goodness of Fit:

The multiple R-squared value is 0.8709, indicating that approximately 87.09% of the variability in the dependent variable (yield) can be explained by the independent variable (content).
The adjusted R-squared value is 0.8623, which takes into account the number of predictors in the model and provides a more accurate measure of the model’s goodness of fit.

F-Statistic and p-value:

The F-statistic is 101.2 with degrees of freedom (DF) of 1 and 15. This test assesses whether the overall model is statistically significant. The low p-value (4.632e-08) indicates that the model is highly significant, suggesting that at least one of the predictors in the model has a significant effect on the response variable.

Diagnostic Plots:
- The code provides diagnostic plots to assess the model’s assumptions and check the validity of regression analysis.
- The plot created using plot(content, yield) shows the scatter plot of the data points with the regression line added using abline(yield.fit, lwd=3, col="black").
- The function par(mfrow=c(2,2)) sets the plotting layout to a 2x2 grid for the diagnostic plots.
- plot(yield.fit) and qqPlot(yield.fit) generate diagnostic plots to check for homoscedasticity and normality of residuals, respectively.

Overall, the analysis suggests that the variable content has a significant effect on the yield, and the model explains a substantial amount of the variability in the response variable.

3. Multiple Linear Regression

Multiple linear regression extends the concept of simple linear regression by incorporating multiple independent variables. We will discuss how to build and interpret multiple linear regression models, including assessing multicollinearity and selecting significant variables.

The regression equation is given by:

\[Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \ldots + \beta_nX_n + \varepsilon\]

In this case, assessing multicollinearity among independent variables using techniques like variance inflation factor (VIF), selecting significant variables through methods like stepwise regression or domain knowledge is done.

To demonstrate this concept, the data set water in the alr3 package will be used. The data contains 43 years of snow precipitation, measured at six different sites in the Owens Valley. The response variable water availabilitystream runoff volume near Bishop, California, feeds into the Owens Valley aqueduct, and eventually, the Los Angeles Aqueduct. To begin we will give an explanation on the code we will be using for the analysis and the tasks it performs:

Load and inspect the dataset:
- data(water): Loads the dataset named “water” into the R environment.
- dim(water): Returns the dimensions (number of rows and columns) of the “water” dataset.
- str(water): Provides the structure of the “water” dataset, showing the data types and a summary of each variable.
- head(water): Displays the first few rows of the “water” dataset.
Data preprocessing:
- socal.water = water[, -1]: Creates a new dataset “socal.water” by removing the first column (possibly an identifier) from the “water” dataset.
Correlation analysis:
- library(corrplot): Loads the “corrplot” package for correlation analysis.
- water.cor = cor(socal.water): Computes the correlation matrix for the variables in the “socal.water” dataset.
- corrplot(water.cor, method="ellipse"): Plots the correlation matrix using ellipses to represent the correlation strength between variables.
Multiple linear regression modeling:
- library(leaps): Loads the “leaps” package for model selection.
- fit=lm(BSAAM~., data=socal.water): Fits a linear regression model with the dependent variable “BSAAM” and all other variables as predictors from the “socal.water” dataset.
- sub.fit = regsubsets(BSAAM~., data=socal.water): Performs a best subset selection using the “regsubsets” function, where “BSAAM” is the response variable, and “.” indicates that all other variables are potential predictors.
Model selection and evaluation:
- best.summary = summary(sub.fit): Extracts the summary information for the best subset selection.
- names(best.summary): Displays the names of the summary information available.
- which.min(best.summary$rss): Identifies the subset with the minimum residual sum of squares (RSS) from the best subset selection.
- par(mfrow=c(1,2)): Sets the plotting layout to a 1x2 grid for the following two plots.
- plot(best.summary$cp, xlab="number of features", ylab="cp"): Plots the Cp statistic against the number of features in the model, aiming to find the model with the lowest Cp value.
- plot(sub.fit, scale="Cp"): Plots the best subset selection results based on the Cp statistic.
- which.min(best.summary$bic): Identifies the subset with the minimum Bayesian Information Criterion (BIC) from the best subset selection.
- which.max(best.summary$adjr2): Identifies the subset with the highest adjusted R-squared value from the best subset selection.

data(water)
dim(water)

## [1] 43  8

str(water)

## 'data.frame':    43 obs. of  8 variables:
##  $ Year   : int  1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 ...
##  $ APMAM  : num  9.13 5.28 4.2 4.6 7.15 9.7 5.02 6.7 10.5 9.1 ...
##  $ APSAB  : num  3.58 4.82 3.77 4.46 4.99 5.65 1.45 7.44 5.85 6.13 ...
##  $ APSLAKE: num  3.91 5.2 3.67 3.93 4.88 4.91 1.77 6.51 3.38 4.08 ...
##  $ OPBPC  : num  4.1 7.55 9.52 11.14 16.34 ...
##  $ OPRC   : num  7.43 11.11 12.2 15.15 20.05 ...
##  $ OPSLAKE: num  6.47 10.26 11.35 11.13 22.81 ...
##  $ BSAAM  : int  54235 67567 66161 68094 107080 67594 65356 67909 92715 70024 ...

head(water)

socal.water = water[ ,-1]
head(socal.water)

##Load the corrplot package if you haven't done so previously
water.cor = cor(socal.water)
water.cor

##             APMAM      APSAB    APSLAKE      OPBPC      OPRC    OPSLAKE
## APMAM   1.0000000 0.82768637 0.81607595 0.12238567 0.1544155 0.10754212
## APSAB   0.8276864 1.00000000 0.90030474 0.03954211 0.1056396 0.02961175
## APSLAKE 0.8160760 0.90030474 1.00000000 0.09344773 0.1063836 0.10058669
## OPBPC   0.1223857 0.03954211 0.09344773 1.00000000 0.8647073 0.94334741
## OPRC    0.1544155 0.10563959 0.10638359 0.86470733 1.0000000 0.91914467
## OPSLAKE 0.1075421 0.02961175 0.10058669 0.94334741 0.9191447 1.00000000
## BSAAM   0.2385695 0.18329499 0.24934094 0.88574778 0.9196270 0.93843604
##             BSAAM
## APMAM   0.2385695
## APSAB   0.1832950
## APSLAKE 0.2493409
## OPBPC   0.8857478
## OPRC    0.9196270
## OPSLAKE 0.9384360
## BSAAM   1.0000000

corrplot(water.cor, method="ellipse")

##load the leaps library
fit=lm(BSAAM~., data=socal.water)
sub.fit = regsubsets(BSAAM~., data=socal.water)
best.summary = summary(sub.fit)
names(best.summary)

## [1] "which"  "rsq"    "rss"    "adjr2"  "cp"     "bic"    "outmat" "obj"

which.min(best.summary$rss)

## [1] 6

par(mfrow=c(1,2))
plot(best.summary$cp, xlab="number of features", ylab="cp")
plot(sub.fit, scale="Cp")

which.min(best.summary$bic)

## [1] 3

which.max(best.summary$adjr2)

## [1] 3

Model Evaluation and Interpretation

The given R code performs the following tasks:

Data Preparation and Exploration:
- The dataset named “water” has 43 observations and 8 variables.
- The variables in the dataset include “Year,” “APMAM,” “APSAB,” “APSLAKE,” “OPBPC,” “OPRC,” “OPSLAKE,” and “BSAAM.”
- Displays the first few rows of the “water” dataset and creates a new dataset named “socal.water” by excluding the “Year” variable.

Correlation Analysis:
- The code calculates the correlation matrix (pairwise correlations) between the variables in the “socal.water” dataset.
- The correlation coefficients range from -1 to 1, with values closer to 1 indicating stronger positive correlations, and values closer to -1 indicating stronger negative correlations. Correlation values close to 0 indicate weak or no correlation between variables.
- The corrplot function is used to visualize the correlation matrix using ellipses. The following are detailed interpretation of the correlation result:
The given output shows the correlation matrix between the variables in the “water” dataset. The correlation matrix is a square matrix where each cell represents the correlation coefficient between two variables. The diagonal elements of the matrix show the correlation of each variable with itself, which is always 1 since a variable is perfectly correlated with itself.

Interpreting the correlation matrix

APMAM and APSAB: These variables have a strong positive correlation of approximately 0.828. This indicates that as the value of one variable (APMAM) increases, the value of the other variable (APSAB) tends to increase as well.
APMAM and APSLAKE: APMAM and APSLAKE also have a strong positive correlation of approximately 0.816. As the value of APMAM increases, the value of APSLAKE tends to increase too.
APSAB and APSLAKE: APSAB and APSLAKE have a strong positive correlation of approximately 0.900. This indicates that these two variables tend to increase or decrease together.
APMAM and OPBPC: APMAM and OPBPC have a weak positive correlation of approximately 0.122. The correlation is closer to zero, suggesting a weaker relationship between these two variables.
APMAM and OPRC: APMAM and OPRC have a weak positive correlation of approximately 0.154. Similar to the previous correlation, it indicates a weak relationship between the two variables.
APMAM and OPSLAKE: APMAM and OPSLAKE have a weak positive correlation of approximately 0.108, indicating a weak relationship.
APMAM and BSAAM: APMAM and BSAAM have a positive correlation of approximately 0.239, indicating a weak to moderate relationship between these two variables.
APSAB and OPBPC: APSAB and OPBPC have a weak positive correlation of approximately 0.040, suggesting a weak relationship between them.
APSAB and OPRC: APSAB and OPRC have a weak positive correlation of approximately 0.106.
APSAB and OPSLAKE: APSAB and OPSLAKE have a weak positive correlation of approximately 0.030.
APSAB and BSAAM: APSAB and BSAAM have a positive correlation of approximately 0.183, indicating a weak to moderate relationship.
APSLAKE and OPBPC: APSLAKE and OPBPC have a weak positive correlation of approximately 0.093.
APSLAKE and OPRC: APSLAKE and OPRC have a weak positive correlation of approximately 0.106.
APSLAKE and OPSLAKE: APSLAKE and OPSLAKE have a weak positive correlation of approximately 0.101.
APSLAKE and BSAAM: APSLAKE and BSAAM have a positive correlation of approximately 0.249, indicating a weak to moderate relationship.
OPBPC and OPRC: OPBPC and OPRC have a strong positive correlation of approximately 0.865.
OPBPC and OPSLAKE: OPBPC and OPSLAKE have a strong positive correlation of approximately 0.943.
OPBPC and BSAAM: OPBPC and BSAAM have a positive correlation of approximately 0.886, indicating a moderate relationship.
OPRC and OPSLAKE: OPRC and OPSLAKE have a strong positive correlation of approximately 0.919.
OPRC and BSAAM: OPRC and BSAAM have a positive correlation of approximately 0.920, indicating a moderate relationship.
OPSLAKE and BSAAM: OPSLAKE and BSAAM have a positive correlation of approximately 0.938, indicating a moderate to strong relationship.

The correlation matrix provides insights into the relationships between the variables in the “water” dataset. Varaibles with strong positive correlations suggest that the variables tend to move together, while weak correlations indicate a weaker relationship.

Multiple Linear Regression Modeling:
- The code fits a linear regression model with the response variable “BSAAM” and all other variables in the “socal.water” dataset as predictors.
- The regsubsets function from the “leaps” package is used to perform best subset selection. It evaluates various subsets of predictors to find the best combination of features that contribute most to the model’s performance.
- The results of the subset selection are stored in the “best.summary” object, which includes information about the selected predictors based on different model evaluation criteria. The algorithm used for the selection is exhaustive.
Model Evaluation:
- The names of the elements in the “best.summary” object are displayed.
- The code identifies the subset with the minimum residual sum of squares (RSS) using which.min(best.summary$rss). The selected subset includes 6 predictors.
Plots:
- The plotting layout is set to a 1x2 grid using par(mfrow=c(1,2)).
- The code creates two plots to visualize the Cp statistic for different numbers of features in the model. See figue plots
- which.min(best.summary$bic) identifies the subset with the minimum Bayesian Information Criterion (BIC), which includes 3 predictors.
- which.max(best.summary$adjr2) identifies the subset with the highest adjusted R-squared value, which also includes 3 predictors.

With the above result, we need to examine the model and test the assumptions.

best.fit = lm(BSAAM~APSLAKE+OPRC+OPSLAKE, data=socal.water)
summary(best.fit)

## 
## Call:
## lm(formula = BSAAM ~ APSLAKE + OPRC + OPSLAKE, data = socal.water)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12964  -5140  -1252   4446  18649 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  15424.6     3638.4   4.239 0.000133 ***
## APSLAKE       1712.5      500.5   3.421 0.001475 ** 
## OPRC          1797.5      567.8   3.166 0.002998 ** 
## OPSLAKE       2389.8      447.1   5.346 4.19e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7284 on 39 degrees of freedom
## Multiple R-squared:  0.9244, Adjusted R-squared:  0.9185 
## F-statistic: 158.9 on 3 and 39 DF,  p-value: < 2.2e-16

par(oma = c(0, 0, 0, 0))
plot(best.fit)

vif(best.fit)

##  APSLAKE     OPRC  OPSLAKE 
## 1.011499 6.452569 6.444748

plot(socal.water$OPRC, socal.water$OPSLAKE, xlab="OPRC", ylab="OPSLAKE")

best.summary$adjr2

## [1] 0.8777515 0.9001619 0.9185369 0.9168706 0.9146772 0.9123079

fit.2 = lm(BSAAM~APSLAKE+OPSLAKE, data=socal.water)
summary(fit.2)

## 
## Call:
## lm(formula = BSAAM ~ APSLAKE + OPSLAKE, data = socal.water)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13335.8  -5893.2   -171.8   4219.5  19500.2 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  19144.9     3812.0   5.022  1.1e-05 ***
## APSLAKE       1768.8      553.7   3.194  0.00273 ** 
## OPSLAKE       3689.5      196.0  18.829  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8063 on 40 degrees of freedom
## Multiple R-squared:  0.9049, Adjusted R-squared:  0.9002 
## F-statistic: 190.3 on 2 and 40 DF,  p-value: < 2.2e-16

## Produce diagnostic plots
par(oma = c(0, 0, 0, 0))
plot(fit.2)

### Perform variance inflation factor analysis
vif(fit.2)

##  APSLAKE  OPSLAKE 
## 1.010221 1.010221

bptest(fit.2)

## 
##  studentized Breusch-Pagan test
## 
## data:  fit.2
## BP = 0.0046205, df = 2, p-value = 0.9977

plot(fit.2$fitted.values, socal.water$BSAAM, xlab="predicted", ylab="actual", main="Predicted vs. Actual")

Analysis Interpretation The given output shows the results of two linear regression models applied to the “socal.water” dataset, focusing on predicting the variable “BSAAM” using different combinations of predictor variables. We will interpret each model’s results and the diagnostic plots.

Model 1 (best.fit):

The first model includes three predictor variables, “APSLAKE,” “OPRC,” and “OPSLAKE,” to predict “BSAAM.”

Coefficients: The coefficients represent the estimated relationship between each predictor and the response variable.
- “APSLAKE”: For a one-unit increase in “APSLAKE,” “BSAAM” is estimated to increase by approximately 1712.5 units. This coefficient is statistically significant at the 0.01 level.
- “OPRC”: For a one-unit increase in “OPRC,” “BSAAM” is estimated to increase by approximately 1797.5 units. This coefficient is also statistically significant at the 0.01 level.
- “OPSLAKE”: For a one-unit increase in “OPSLAKE,” “BSAAM” is estimated to increase by approximately 2389.8 units. This coefficient is highly significant at the 0.001 level.
Significance: The coefficients of all predictor variables are statistically significant, as indicated by the stars (***, **) in the “Pr(>|t|)” column.
Model Performance: The model’s adjusted R-squared value is 0.9185, which suggests that approximately 91.85% of the variability in “BSAAM” is explained by the predictors in the model.
Residuals: The residual standard error is 7284, which measures the average distance between the observed values and the predicted values from the model.

Model 2 (fit.2):

The second model includes two predictor variables, “APSLAKE” and “OPSLAKE,” to predict “BSAAM.”

Coefficients: The coefficients represent the estimated relationship between each predictor and the response variable.
- “APSLAKE”: For a one-unit increase in “APSLAKE,” “BSAAM” is estimated to increase by approximately 1768.8 units. This coefficient is statistically significant at the 0.01 level.
- “OPSLAKE”: For a one-unit increase in “OPSLAKE,” “BSAAM” is estimated to increase by approximately 3689.5 units. This coefficient is highly significant at the 0.001 level.
Significance: Both coefficients are statistically significant, as indicated by the stars (***, **) in the “Pr(>|t|)” column.
Model Performance: The model’s adjusted R-squared value is 0.9002, which suggests that approximately 90.02% of the variability in “BSAAM” is explained by the predictors in the model.
Residuals: The residual standard error is 8063, which measures the average distance between the observed values and the predicted values from the model.

Diagnostic Plots:

The code produces diagnostic plots for each model. These plots help assess the model’s assumptions and check the validity of the linear regression analysis. The diagnostic plots include:

Residuals vs. Fitted: This plot checks for the presence of patterns in the residuals (vertical spread) and identifies potential outliers.
Normal Q-Q Plot: This plot assesses whether the residuals follow a normal distribution. A straight line indicates normality.
Scale-Location (Sqrt(|Residuals|) vs. Fitted): This plot helps check the assumption of constant variance (homoscedasticity). A horizontal line indicates homoscedasticity.
Cook’s Distance: This plot identifies influential observations that might significantly impact the model’s fit.

In both models, the diagnostic plots show no major violations of assumptions, indicating that the models are generally well-behaved.

Variance Inflation Factor (VIF):

The VIF analysis checks for multicollinearity between predictor variables. In both models, the VIF values for “APSLAKE” and “OPSLAKE” are close to 1, suggesting no significant multicollinearity issues.

Breusch-Pagan Test:

The Breusch-Pagan test is performed to check for heteroscedasticity. In both models, the test does not reject the null hypothesis of homoscedasticity, indicating no significant evidence of heteroscedasticity.

Predicted vs. Actual Plot:

The code creates a plot showing the predicted values vs. the actual values of “BSAAM.” This plot helps visually assess how well the model predictions align with the observed values.

Both models show promising results in predicting “BSAAM,” with statistically significant predictor variables and satisfactory model performance. The diagnostic plots suggest that the models meet the assumptions of linear regression.

Before we proceed, we will put the predicted values into our data frame socal.water and also rename BSAAM as Actual, and put in a new vector within the data frame. The following code will do just that.

socal.water["Actual"] = water$BSAAM
socal.water["Forecast"] = NA
socal.water$Forecast = predict(fit.2)
head(socal.water)

Using ggplot2 we will create a more beautiful plot.

ggplot(socal.water, aes(x=Forecast, y=Actual))+geom_point() + geom_smooth(method=lm) + labs(title = "Forecast versus Actuals")

### Leave-one-out (LOO) model selection

The concept of LOO model selection technique is totally captured by this quote from Professor Tarpey of Wright State University

Often we use regression models to predict future observations. We can use our data to fit the model. However, it is cheating to then access how well the model predicts responses using the same data that was used to estimate the model – this will tend to give overly optimistic results in terms of how well a model is able to predict future observations. If we leave out an observation, fit the model and then predict the left out response, then this will give a less biased idea of how well the model predicts.

In linear models, LOOCV is easily carried out by examining the Prediction Error Sum of Squares (PRESS) statistic, and selecting the model that has the lowest value. The following R code chunk performs same.

PRESS(best.fit)

## [1] 2426757258

PRESS(fit.2)

## [1] 2992801411

PRESS.best = sum((resid(best.fit)/(1-hatvalues(best.fit)))^2)
PRESS.fit.2 = sum((resid(fit.2)/(1-hatvalues(fit.2)))^2)
PRESS.best

## [1] 2426757258

PRESS.fit.2

## [1] 2992801411

The given R code calculates the PRESS (Predicted Residual Sum of Squares) values for two linear regression models, “best.fit” and “fit.2,” using the MPV package. PRESS is a measure of the predictive power of a regression model, and it assesses how well the model predicts new observations.

Here’s the interpretation of the results:

PRESS(best.fit): The PRESS value for the “best.fit” model is approximately 2,426,757,258. This value represents the sum of the squared differences between the observed response variable (“BSAAM”) and the predicted values from the model. A lower PRESS value indicates better predictive performance, meaning that the “best.fit” model predicts the response variable more accurately.
PRESS(fit.2): The PRESS value for the “fit.2” model is approximately 2,992,801,411. This value represents the sum of the squared differences between the observed response variable (“BSAAM”) and the predicted values from the model. Comparing this value to the PRESS for “best.fit,” we can see that the “fit.2” model has a higher PRESS, indicating less accurate predictions compared to the “best.fit” model.
PRESS.best: The variable “PRESS.best” is calculated separately by summing the squared residuals (residuals divided by (1-hat values)) for the “best.fit” model. The resulting value is approximately 2,426,757,258, which matches the previous result obtained using the PRESS function. This serves as a confirmation that the PRESS calculation is consistent.
PRESS.fit.2: The variable “PRESS.fit.2” is calculated separately by summing the squared residuals (residuals divided by (1-hat values)) for the “fit.2” model. The resulting value is approximately 2,992,801,411, which also matches the previous result obtained using the PRESS function. This confirms the consistency of the PRESS calculation for “fit.2.”

From the above result, the PRESS values give an indication of the predictive performance of the regression models. A lower PRESS value indicates better predictive accuracy, suggesting that the “best.fit” model performs better in predicting the response variable “BSAAM” compared to the “fit.2” model. It’s important to consider the PRESS values in the context of the data and the modeling goals to make informed decisions about model selection and evaluation.

4. Other Models: Qualitative Feature and Interactive Term

A qualitative feature(i.e factor), can take on two or more levels. Example ON/OFF or Bad/Neutral/Good. If we have a feature with levels, then we create what is known as an indicator or dummy feature arbitrarily assigning value labels as the case may be.

The linear model with the indicator would still follow the same formulation as before, that is, \[Y = \beta_0 + \beta_1X + \varepsilon\]

If we code the feature as OFF is equal to zero and ON is equal to one, then the expectation for OFF would just be the intercept, $beta_0$, while for ON it would be _0 + _1 . More than two levels will call for n-1 indicators. We shall be using the Carseats dataset from the ISLR package to demonstrate this concept.

data(Carseats)
str(Carseats)

## 'data.frame':    400 obs. of  11 variables:
##  $ Sales      : num  9.5 11.22 10.06 7.4 4.15 ...
##  $ CompPrice  : num  138 111 113 117 141 124 115 136 132 132 ...
##  $ Income     : num  73 48 35 100 64 113 105 81 110 113 ...
##  $ Advertising: num  11 16 10 4 3 13 0 15 0 0 ...
##  $ Population : num  276 260 269 466 340 501 45 425 108 131 ...
##  $ Price      : num  120 83 80 97 128 72 108 120 124 124 ...
##  $ ShelveLoc  : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
##  $ Age        : num  42 65 59 55 38 78 71 67 76 76 ...
##  $ Education  : num  17 10 12 14 13 16 15 10 10 17 ...
##  $ Urban      : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
##  $ US         : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...

sales.fit = lm(Sales~Advertising+ShelveLoc, data=Carseats)
summary(sales.fit)

## 
## Call:
## lm(formula = Sales ~ Advertising + ShelveLoc, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.6480 -1.6198 -0.0476  1.5308  6.4098 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      4.89662    0.25207  19.426  < 2e-16 ***
## Advertising      0.10071    0.01692   5.951 5.88e-09 ***
## ShelveLocGood    4.57686    0.33479  13.671  < 2e-16 ***
## ShelveLocMedium  1.75142    0.27475   6.375 5.11e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.244 on 396 degrees of freedom
## Multiple R-squared:  0.3733, Adjusted R-squared:  0.3685 
## F-statistic: 78.62 on 3 and 396 DF,  p-value: < 2.2e-16

contrasts(Carseats$ShelveLoc)

##        Good Medium
## Bad       0      0
## Good      1      0
## Medium    0      1

The given R code performs a linear regression analysis on the “Carseats” dataset, which contains information about the sales of car seats in different stores. The model predicts “Sales” based on two predictor variables: “Advertising” and “ShelveLoc” (Shelf Location). Here’s the interpretation of the results:

Coefficients: The coefficients represent the estimated relationship between each predictor and the response variable (“Sales”).
- “Advertising”: For each additional unit increase in “Advertising” expenditure, the sales are estimated to increase by approximately 0.10071 units. This coefficient is highly significant (p-value < 0.001), indicating that advertising has a significant positive effect on sales.
- “ShelveLocGood”: The “Good” shelf location category is associated with an increase in sales. Compared to the “Bad” shelf location (baseline), the “Good” shelf location is estimated to increase sales by approximately 4.57686 units. This coefficient is highly significant (p-value < 0.001).
- “ShelveLocMedium”: The “Medium” shelf location category is also associated with an increase in sales. Compared to the “Bad” shelf location (baseline), the “Medium” shelf location is estimated to increase sales by approximately 1.75142 units. This coefficient is highly significant (p-value < 0.001).
Significance: All coefficients have very low p-values (< 0.001), denoted by the ’***’ in the “Signif. codes” column of the summary output. This indicates strong evidence of a significant relationship between the predictor variables and sales.
Residuals: The residuals represent the differences between the observed sales and the predicted sales from the model. The minimum and maximum residuals are approximately -6.6480 and 6.4098, respectively. The residuals have a median close to zero, suggesting that the model has captured the overall trend in sales well.
Model Performance: The multiple R-squared value is 0.3733, indicating that about 37.33% of the variability in sales is explained by the model. The Adjusted R-squared value is 0.3685, which takes into account the number of predictor variables in the model.
F-statistic: The F-statistic tests whether the model’s overall regression fit is significant. In this case, the F-statistic is 78.62 with an extremely low p-value (< 0.001), indicating that the model as a whole is statistically significant.

The regression model indicates that both “Advertising” and “ShelveLoc” are important predictors of “Sales.” The “Good” and “Medium” shelf locations are associated with higher sales compared to the “Bad” shelf location. Additionally, increasing the advertising expenditure is expected to boost sales. However, it’s important to note that the model explains only a moderate amount of the variability in sales, suggesting that there might be other factors influencing sales that are not included in the model.

If the effect on the prediction of one feature depends on the value of the other feature, this is termed as Interaction. The linear regression model can be written as:

\[Y = B_0 + B_1x + B_2x + B_1B_2x + e \]

Where: - $Y$ is the response variable (dependent variable). - $B_0$ is the intercept term. - $B_1$ and $B_2$ are the coefficients of the predictor variables $x$. - $e$ represents the error term (residuals).

The example we will use here is the Boston dataset from the MASS library. The response is median home value, which is medv in the output; we will use two features, the percentage of homes with a low socioeconomic status, which is termed as lstat, and the age of the home in years, which is termed as age in the following output:

data(Boston)
str(Boston)

## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

value.fit = lm(medv~lstat*age, data=Boston)
summary(value.fit)

## 
## Call:
## lm(formula = medv ~ lstat * age, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.806  -4.045  -1.333   2.085  27.552 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 36.0885359  1.4698355  24.553  < 2e-16 ***
## lstat       -1.3921168  0.1674555  -8.313 8.78e-16 ***
## age         -0.0007209  0.0198792  -0.036   0.9711    
## lstat:age    0.0041560  0.0018518   2.244   0.0252 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.149 on 502 degrees of freedom
## Multiple R-squared:  0.5557, Adjusted R-squared:  0.5531 
## F-statistic: 209.3 on 3 and 502 DF,  p-value: < 2.2e-16

The above result performs a multiple linear regression analysis on the “Boston” dataset, which contains various attributes related to housing in Boston. The model predicts the “medv” variable (median value of owner-occupied homes in $1000s) based on two predictor variables: “lstat” (percentage of lower status of the population) and “age” (proportion of owner-occupied units built prior to 1940), as well as their interaction term “lstat:age”. Further interpretation of the results are given below:

Coefficients: The coefficients represent the estimated relationships between each predictor variable and the response variable (“medv”), as well as their interaction effect.
- “Intercept”: The estimated intercept is approximately 36.09. This represents the predicted median value of owner-occupied homes (in $1000s) when both “lstat” and “age” are zero. However, since both “lstat” and “age” cannot be exactly zero in this dataset, the intercept is mainly used for reference.
- “lstat”: For each one-unit increase in the percentage of lower status population (“lstat”), the median value of owner-occupied homes (“medv”) is estimated to decrease by approximately $1.39 (in $1000s). This coefficient is highly significant (p-value < 0.001), indicating that “lstat” has a significant negative effect on the median home value.
- “age”: The coefficient for “age” is very close to zero (-0.0007209) and not statistically significant (p-value = 0.9711). This suggests that the proportion of older owner-occupied units has little to no impact on the median home value after considering other variables in the model.
- “lstat:age”: The interaction term “lstat:age” represents the combined effect of both “lstat” and “age.” For each one-unit increase in the product of “lstat” and “age,” the median home value (“medv”) is estimated to increase by approximately $0.0042 (in $1000s). This interaction term is significant (p-value = 0.0252), indicating that the combined effect of “lstat” and “age” has a significant impact on the median home value.
Significance: The significance of each coefficient is indicated by the asterisks in the “Signif. codes” column of the summary output. Three asterisks (**), which appear next to “Intercept,” “lstat,” and “lstat:age,” indicate that these coefficients are highly significant (p-value < 0.001). One asterisk () next to “lstat:age” indicates that this coefficient is statistically significant (p-value = 0.0252).
Residuals: The residuals represent the differences between the observed median home values and the predicted values from the model. The residuals have a median close to zero, suggesting that the model has captured the overall trend in median home values relatively well.
Model Performance: The multiple R-squared value is 0.5557, indicating that approximately 55.57% of the variability in median home values is explained by the model. The Adjusted R-squared value is 0.5531, which takes into account the number of predictor variables in the model.
F-statistic: The F-statistic tests whether the model’s overall regression fit is significant. In this case, the F-statistic is 209.3 with an extremely low p-value (< 0.001), indicating that the model as a whole is statistically significant.

Overall, the regression model suggests that “lstat” has a strong negative effect on median home values, while “age” has little to no impact after considering the interaction effect of “lstat” and “age.” The model provides a reasonable fit to the data, explaining approximately 55.57% of the variability in median home values. However, as with any statistical model, it’s essential to consider other factors and perform further analysis to understand the complete picture.

Conclusion

In machine learning, we train and test a model to predict or forecast an outcome. In this chapter, we took a close look at the simple but highly effective method of linear regression for predicting a quantitative response. Many of the advanced techniques covered in later chapters are simply extensions of what we learned in this chapter. We’ve talked about the issue of not visually inspecting the dataset and instead relying on statistics to guide model selection.

Evaluating the performance of linear regression models is crucial. Various evaluation metrics such as mean squared error (MSE), coefficient of determination (R-squared), adjusted R-squared and hypothesis testing for individual coefficients using t-tests is encouraged. The following points are worth taking into consideration:

Mean squared error (MSE) measures the average squared difference between predicted and actual values.
R-squared represents the proportion of variance explained by the model.
Adjusted R-squared adjusts for the number of predictors in the model.

Linear regression relies on several assumptions about the data. The chapter covers assumptions such as linearity, independence, homoscedasticity, and normality, along with diagnostic techniques to check if these assumptions are met.

Major key note-taking points are:

Always check for linearity of features through scatter plots and residual plots.
Assess independence through residual autocorrelation plots.
Evaluating homoscedasticity using scatter plots of residuals against predicted values.
Normality can be checked through normal probability plots or Shapiro-Wilk tests.

Feature engineering involves transforming and selecting relevant features to improve model performance. The chapter explains techniques like polynomial regression, interaction terms, and handling categorical variables in the context of linear regression.

The following key notes should be taken into consideration:

Polynomial regression can capture nonlinear relationships by including polynomial terms.
Interaction terms allow modeling the combined effect of two or more variables.
Handling categorical variables through techniques like one-hot encoding or dummy coding.

MASTERING MACHINE LEARNING WITH R

Linear Regression – The Blocking and Tackling of Machine Learning

Margaret Oluwadare (211064)