Linear regression is a very useful tool for predicting a quantitative response. It has been around for a long time and is the topic of innumerable textbooks. Though it may seem somewhat dull compared to some of the modern statistical learning approaches, linear regression is still a useful and widely used statistical learning method. Moreover, it serves as a good jumping-off point for new approaches that serve as extension of regression. Some of the R libraries that we will need for this lesson is provided below.
library(MASS)
library(ISLR2)
library(car)
# Load Advertising Data from local computer
setwd("C:\\Users\\Asus\\Documents\\UP Files\\UPV Subjects\\Stat 197 (Intro to BI)")
advertising <- read.csv(".\\Advertising.csv")
# Check variable names
names(advertising)
## [1] "X" "TV" "Radio" "Newspaper" "Sales"
# Check first 6 rows of data
head(advertising)
## X TV Radio Newspaper Sales
## 1 1 230.1 37.8 69.2 22.1
## 2 2 44.5 39.3 45.1 10.4
## 3 3 17.2 45.9 69.3 9.3
## 4 4 151.5 41.3 58.5 18.5
## 5 5 180.8 10.8 58.4 12.9
## 6 6 8.7 48.9 75.0 7.2
# Generate scatter plot matrix
plot(advertising[,c(2:5)], col="#69b3a2")
# Generate the linear regression results using the lm() function
# Store the results inside "lm.fit"
lm.fit <- lm(Sales ~ TV + Radio + Newspaper, data=advertising)
summary(lm.fit)
##
## Call:
## lm(formula = Sales ~ TV + Radio + Newspaper, data = advertising)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.8277 -0.8908 0.2418 1.1893 2.8292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.938889 0.311908 9.422 <2e-16 ***
## TV 0.045765 0.001395 32.809 <2e-16 ***
## Radio 0.188530 0.008611 21.893 <2e-16 ***
## Newspaper -0.001037 0.005871 -0.177 0.86
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.686 on 196 degrees of freedom
## Multiple R-squared: 0.8972, Adjusted R-squared: 0.8956
## F-statistic: 570.3 on 3 and 196 DF, p-value: < 2.2e-16
# Save the fitted values and compare with observed sales
predict.sales <- lm.fit$fitted.values
plot(predict.sales, advertising$Sales)
par(mfrow = c(1,2))
plot(lm.fit)
residuals <- lm.fit$residuals
# Testing for Multicollinearity
vif(lm.fit)
## TV Radio Newspaper
## 1.004611 1.144952 1.145187
# Testing for Autocorrelation
durbinWatsonTest(lm.fit)
## lag Autocorrelation D-W Statistic p-value
## 1 -0.04687792 2.083648 0.532
## Alternative hypothesis: rho != 0
# Testing for Normality
shapiro.test(residuals)
##
## Shapiro-Wilk normality test
##
## data: residuals
## W = 0.91767, p-value = 3.939e-09
# Testing for Equality of Variance
# Use leveneTest() or other tests
There are times when the model validation results suggest that the data need to be transformed. This is often useful in addressing issues with nonconstant variance and non-normality of the data. As an example, let’s consider the log(Sales) as the new response variable.
# Transform sales using the log() function
advertising$log.Sales <- log(advertising$Sales)
lm.fit2 <- lm(log.Sales ~ TV + Radio + Newspaper, data=advertising)
summary(lm.fit2)
##
## Call:
## lm(formula = log.Sales ~ TV + Radio + Newspaper, data = advertising)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.74201 -0.05892 0.04715 0.10553 0.20666
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.7390020 0.0345727 50.300 <2e-16 ***
## TV 0.0036697 0.0001546 23.735 <2e-16 ***
## Radio 0.0118019 0.0009545 12.365 <2e-16 ***
## Newspaper 0.0003544 0.0006508 0.545 0.587
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1868 on 196 degrees of freedom
## Multiple R-squared: 0.7998, Adjusted R-squared: 0.7967
## F-statistic: 260.9 on 3 and 196 DF, p-value: < 2.2e-16
lm.fit3 <- lm(Sales ~ TV + I(TV^2) + I(TV^3), data=advertising)
summary(lm.fit3)
##
## Call:
## lm(formula = Sales ~ TV + I(TV^2) + I(TV^3), data = advertising)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.9734 -1.8900 -0.0897 2.0189 7.3765
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.420e+00 8.641e-01 6.272 2.23e-09 ***
## TV 9.643e-02 2.580e-02 3.738 0.000243 ***
## I(TV^2) -3.152e-04 2.022e-04 -1.559 0.120559
## I(TV^3) 5.572e-07 4.494e-07 1.240 0.216519
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.232 on 196 degrees of freedom
## Multiple R-squared: 0.622, Adjusted R-squared: 0.6162
## F-statistic: 107.5 on 3 and 196 DF, p-value: < 2.2e-16
lm.fit4 <- lm(Sales ~ TV + Radio + TV*Radio, data=advertising)
summary(lm.fit4)
##
## Call:
## lm(formula = Sales ~ TV + Radio + TV * Radio, data = advertising)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.3366 -0.4028 0.1831 0.5948 1.5246
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.750e+00 2.479e-01 27.233 <2e-16 ***
## TV 1.910e-02 1.504e-03 12.699 <2e-16 ***
## Radio 2.886e-02 8.905e-03 3.241 0.0014 **
## TV:Radio 1.086e-03 5.242e-05 20.727 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9435 on 196 degrees of freedom
## Multiple R-squared: 0.9678, Adjusted R-squared: 0.9673
## F-statistic: 1963 on 3 and 196 DF, p-value: < 2.2e-16
predict.w.inter <- lm.fit4$fitted.values
par(mfrow=c(1,2))
plot(predict.sales, advertising$Sales)
plot(predict.w.inter, advertising$Sales)
The linear regression model discussed in Chapter 3 assumes that the response variable \(Y\) is quantitative. But in many situations, the response variable is instead qualitative/categorical like eye color or type of disease. In this section we study classification models in R. There many possible classification techniques. Some of the widely-used classifiers are: logistic regression, linear discriminant analysis, naive Bayes, and K-nearest neighbors. This discussion will largely focus on logistic regression as an modification of the linear regression.
To illustrate the concept of classification we will use the
Default data set. We are interested in predicting whether
an individual will default on his or her credit card payment, on the
basis of annual income and monthly credit card
balance for a subset of 10000 individuals.
library(ISLR2)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.1
# Load the data set "Default" from ISLR2 package
data(Default) # Load the data into the R Session
head(Default) # See the first six rows of the data
## default student balance income
## 1 No No 729.5265 44361.625
## 2 No Yes 817.1804 12106.135
## 3 No No 1073.5492 31767.139
## 4 No No 529.2506 35704.494
## 5 No No 785.6559 38463.496
## 6 No Yes 919.5885 7491.559
names(Default) # Check for variable names in the data
## [1] "default" "student" "balance" "income"
# Generate Grouped Scatter Plot
ggplot(Default, aes(x=balance, y=income, color=default)) +
geom_point() + scale_color_manual(values=c("cyan4", "coral2")) +
theme_minimal()
# Generate Simple Boxplot
par(mfrow=c(1,2)) # This creates a 1x2 grid in the plots window
colors = c("cyan4", "coral2")
attach(Default)
boxplot(balance ~ default,
data=Credit,
col=colors, ylab="Balance", xlab="Default Status")
boxplot(income ~ default,
data=Credit,
col=colors, ylab="Income", xlab="Default Status")
par(mfrow = c(1,1))
detach(Default)
fit <- glm(default ~ balance + income + student,
data = Default, family = binomial)
summary(fit)
##
## Call:
## glm(formula = default ~ balance + income + student, family = binomial,
## data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4691 -0.1418 -0.0557 -0.0203 3.7383
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.087e+01 4.923e-01 -22.080 < 2e-16 ***
## balance 5.737e-03 2.319e-04 24.738 < 2e-16 ***
## income 3.033e-06 8.203e-06 0.370 0.71152
## studentYes -6.468e-01 2.363e-01 -2.738 0.00619 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1571.5 on 9996 degrees of freedom
## AIC: 1579.5
##
## Number of Fisher Scoring iterations: 8
# Predict Customer Default using pred() function
class.prob <- predict(fit, type="response")
# Find the optimal cutoff using Recall, Precision, and F1 (the higher the better)
# Create a loop that calculates all metrics for each cut off value
library(Metrics)
cut.off <- 0.05
Recall <- vector()
Precision <- vector()
F1 <- vector()
cut.off.d <- vector()
while(cut.off < .99) {
predicted <- ifelse(class.prob > cut.off, 1, 0)
actual <- ifelse(Default$default == "Yes", 1, 0)
Recall.new <- Metrics::recall(actual,predicted)
Recall <- c(Recall, Recall.new)
Precision.new <- Metrics::precision(actual,predicted)
Precision <- c(Precision, Precision.new)
F1.new <- 2*Precision.new*Recall.new/(Precision.new + Recall.new)
F1 <- c(F1, F1.new)
if(cut.off == 0.95) {
break
}
cut.off.d <- c(cut.off.d, cut.off)
cut.off <- cut.off + 0.05
}
# Find the optimal cutoff by printing this table
df <- data.frame(cut.off.d, Recall, Precision, F1)
print(df)
## cut.off.d Recall Precision F1
## 1 0.05 0.84384384 0.2255217 0.35592147
## 2 0.10 0.74474474 0.3069307 0.43470640
## 3 0.15 0.67267267 0.3684211 0.47608927
## 4 0.20 0.60960961 0.4229167 0.49938499
## 5 0.25 0.55255255 0.4919786 0.52050919
## 6 0.30 0.50750751 0.5522876 0.52895149
## 7 0.35 0.45945946 0.6144578 0.52577320
## 8 0.40 0.40240240 0.6291080 0.49084249
## 9 0.45 0.34834835 0.7116564 0.46774194
## 10 0.50 0.31531532 0.7241379 0.43933054
## 11 0.55 0.28828829 0.7741935 0.42013129
## 12 0.60 0.24324324 0.7788462 0.37070938
## 13 0.65 0.21321321 0.7977528 0.33649289
## 14 0.70 0.17117117 0.8260870 0.28358209
## 15 0.75 0.12012012 0.8163265 0.20942408
## 16 0.80 0.09009009 0.8823529 0.16348774
## 17 0.85 0.05705706 0.8260870 0.10674157
## 18 0.90 0.03003003 0.8333333 0.05797101
## 19 0.95 0.01201201 0.8000000 0.02366864
# It seems that a cut.off of 0.5 gives the optimal values of Recall, Precision, and F1
# Note that it's not always 0.5 that is optimal.
# Use 0.5 cutoff to create your final Confusion Matrix
predicted <- as.factor(ifelse(class.prob > 0.5, "Yes", "No"))
actual <- as.factor(Default$default)
table(predicted, actual)
## actual
## predicted No Yes
## No 9627 228
## Yes 40 105
# Consider the data from 5 new customers
# Which of these customers will most likely default
student <- c("No", "No", "Yes", "Yes", "No")
balance <- c(555.6, 1363.52, 1180.4, 584.9, 2189.4)
income <- c(40837.7, 55713.1, 24063.3, 22429.9, 20655.2)
new.client.data <- data.frame(student,balance,income)
class.prob.new <- predict(fit, newdata=new.client.data, type="response")
class.pred.new <- ifelse(class.prob.new > 0.5, "Default", "No Default")
class.pred.new
## 1 2 3 4 5
## "No Default" "No Default" "No Default" "No Default" "Default"
library(caret)
# Try different values of K
fit.k1 <- knnreg(Sales ~ TV + Radio + Newspaper, data=advertising, k=1)
fit.k4 <- knnreg(Sales ~ TV + Radio + Newspaper, data=advertising, k=4)
fit.k8 <- knnreg(Sales ~ TV + Radio + Newspaper, data=advertising, k=8)
fit.k12 <- knnreg(Sales ~ TV + Radio + Newspaper, data=advertising, k=12)
# Plot the Sales and Predicted Sales Using KNN with K=12
Predicted <- predict(fit.k12, newdata = advertising)
plot(advertising$Sales,Predicted)
The linear discriminant analysis (LDA) classifier results
from assuming that the observations within each class come from a normal
distribution with a specific mean and a common variance \(\sigma^2\). The word linear stems
from the fact that the discriminant functions are linear
functions of \(x\) as opposed to more
complex functions. We again consider the Default credit
dataset and apply LDA in classifying customers based on predictors
balance and income.
# Load Default data again in R session
library(ISLR2)
library(MASS)
data(Default)
# Using the function lda() to generate the linear discriminant analysis
# lda() uses the same syntax as lm()
lda.fit <- lda(default ~ balance + income + student, data=Default)
lda.fit
## Call:
## lda(default ~ balance + income + student, data = Default)
##
## Prior probabilities of groups:
## No Yes
## 0.9667 0.0333
##
## Group means:
## balance income studentYes
## No 803.9438 33566.17 0.2914037
## Yes 1747.8217 32089.15 0.3813814
##
## Coefficients of linear discriminants:
## LD1
## balance 2.243541e-03
## income 3.367310e-06
## studentYes -1.746631e-01
# The plot() function produces plots of the linear discriminant
plot(lda.fit)
# The predict() function returns a list that contains the predicted classes
lda.pred <- predict(lda.fit, Default)
lda.prob <- lda.pred$posterior[,2]
# Suppose we want a 0.3 threshold in classifying default
# The predictions can then be assessed using a confusion matrix
predicted <- ifelse(lda.prob > 0.3, "Yes", "No")
observed <- Default$default
table(predicted, observed)
## observed
## predicted No Yes
## No 9571 181
## Yes 96 152
# Prediction assessment
library(Metrics)
predicted <- ifelse(lda.prob > 0.3, 1, 0)
actual <- ifelse(Default$default == "Yes", 1, 0)
Recall <- Metrics::recall(actual,predicted); Recall
## [1] 0.4564565
Precision <- Metrics::precision(actual,predicted); Precision
## [1] 0.6129032
F1 <- 2*Precision*Recall/(Precision + Recall); F1
## [1] 0.5232358
Quadratic Discriminant Analysis (QDA) provides an
alternative approach in classifying the outcome variable. Like LDA, the
QDA classifier results from assuming that the observations from each
class are drawn from a Gaussian distribution. However, unlike LDA, QDA
assumes that each class has its own covariance matrix, i.e., it assumes
that an observation from the kth class is of the form \(X\) that follows \(N(\mu_k, \Sigma_k)\), where \(\Sigma_k\) is a covariance matrix for the
kth class. Using QDA, we can also model the probability of
default from the Default data.
# Using the function lda() to generate the linear discriminant analysis
# lda() uses the same syntax as lm()
qda.fit <- qda(default ~ balance + income + student, data=Default)
qda.fit
## Call:
## qda(default ~ balance + income + student, data = Default)
##
## Prior probabilities of groups:
## No Yes
## 0.9667 0.0333
##
## Group means:
## balance income studentYes
## No 803.9438 33566.17 0.2914037
## Yes 1747.8217 32089.15 0.3813814
qda.pred <- predict(qda.fit, Default)
qda.prob <- qda.pred$posterior[,2]
# Suppose we want a 0.3 threshold in classifying default
# The predictions can then be assessed using a confusion matrix
predicted <- ifelse(qda.prob > 0.3, "Yes", "No")
observed <- Default$default
table(predicted, observed)
## observed
## predicted No Yes
## No 9507 163
## Yes 160 170
The naive Bayes classifier takes a different track for
estimating \(f_1(x),...,f_K(x)\).
Instead of assuming that these functions belong to a particular family
of distributions, we make a single assumption: within the kth class,
the p predictors are independent. The naive Bayes function is found
in the e1071 library.
library(e1071)
nbc.fit <- naiveBayes(default ~ balance + income + student, data=Default)
nbc.fit
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## No Yes
## 0.9667 0.0333
##
## Conditional probabilities:
## balance
## Y [,1] [,2]
## No 803.9438 456.4762
## Yes 1747.8217 341.2668
##
## income
## Y [,1] [,2]
## No 33566.17 13318.25
## Yes 32089.15 13804.22
##
## student
## Y No Yes
## No 0.7085963 0.2914037
## Yes 0.6186186 0.3813814
# the predict() function needs the type=raw argument to generate the probabilities
nbc.pred <- predict(nbc.fit, Default, type="raw")
nbc.prob <- nbc.pred[,2]
predicted <- ifelse(nbc.prob > 0.3, "Yes", "No")
observed <- Default$default
table(predicted, observed)
## observed
## predicted No Yes
## No 9473 163
## Yes 194 170