Linear regression is a very useful tool for predicting a quantitative response. It has been around for a long time and is the topic of innumerable textbooks. Though it may seem somewhat dull compared to some of the modern statistical learning approaches, linear regression is still a useful and widely used statistical learning method. Moreover, it serves as a good jumping-off point for new approaches that serve as extension of regression. Some of the R libraries that we will need for this lesson is provided below.
library(MASS)
library(ISLR2)
library(car)
# Load Advertising Data from local computer
setwd("C:\\Users\\Asus\\Documents\\UP Files\\UPV Subjects\\Stat 197 (Intro to BI)")
advertising <- read.csv(".\\Advertising.csv")
# Check variable names
names(advertising)
## [1] "X" "TV" "Radio" "Newspaper" "Sales"
# Check first 6 rows of data
head(advertising)
## X TV Radio Newspaper Sales
## 1 1 230.1 37.8 69.2 22.1
## 2 2 44.5 39.3 45.1 10.4
## 3 3 17.2 45.9 69.3 9.3
## 4 4 151.5 41.3 58.5 18.5
## 5 5 180.8 10.8 58.4 12.9
## 6 6 8.7 48.9 75.0 7.2
# Generate scatter plot matrix
plot(advertising[,c(2:5)], col="#69b3a2")
# Specify the formula of the regression equation
formula <- Sales ~ TV + Radio + Newspaper
# Generate the linear regression results using the lm() function
# Store the results inside "lm.fit"
lm.fit <- lm(formula, data=advertising)
summary(lm.fit)
##
## Call:
## lm(formula = formula, data = advertising)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.8277 -0.8908 0.2418 1.1893 2.8292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.938889 0.311908 9.422 <2e-16 ***
## TV 0.045765 0.001395 32.809 <2e-16 ***
## Radio 0.188530 0.008611 21.893 <2e-16 ***
## Newspaper -0.001037 0.005871 -0.177 0.86
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.686 on 196 degrees of freedom
## Multiple R-squared: 0.8972, Adjusted R-squared: 0.8956
## F-statistic: 570.3 on 3 and 196 DF, p-value: < 2.2e-16
# Plot predicted vs actual values
predicted <- lm.fit$fitted.values
actual <- advertising$Sales # Change this line depending on the data
plot(actual, predicted)
# Plot residuals and validate the model
par(mfrow=c(1,2))
plot(lm.fit); par(mfrow=c(1,1))
# Generate the mean squared error (MSE) of the linear model
error <- lm.fit$residuals
mse <- mean(error^2); mse
## [1] 2.784126
There are times when the model validation results suggest that the data need to be transformed. This is often useful in addressing issues with non-constant variance and non-normality of the data. As an example, let’s consider the log(Sales) as the new response variable.
# Transform sales using the log() function
advertising$log.Sales <- log(advertising$Sales)
formula <- log.Sales ~ TV + Radio + Newspaper
lm.fit <- lm(formula, data=advertising)
summary(lm.fit)
##
## Call:
## lm(formula = formula, data = advertising)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.74201 -0.05892 0.04715 0.10553 0.20666
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.7390020 0.0345727 50.300 <2e-16 ***
## TV 0.0036697 0.0001546 23.735 <2e-16 ***
## Radio 0.0118019 0.0009545 12.365 <2e-16 ***
## Newspaper 0.0003544 0.0006508 0.545 0.587
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1868 on 196 degrees of freedom
## Multiple R-squared: 0.7998, Adjusted R-squared: 0.7967
## F-statistic: 260.9 on 3 and 196 DF, p-value: < 2.2e-16
formula <- Sales ~ TV + Radio + TV*Radio
lm.fit <- lm(formula, data=advertising)
summary(lm.fit)
##
## Call:
## lm(formula = formula, data = advertising)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.3366 -0.4028 0.1831 0.5948 1.5246
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.750e+00 2.479e-01 27.233 <2e-16 ***
## TV 1.910e-02 1.504e-03 12.699 <2e-16 ***
## Radio 2.886e-02 8.905e-03 3.241 0.0014 **
## TV:Radio 1.086e-03 5.242e-05 20.727 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9435 on 196 degrees of freedom
## Multiple R-squared: 0.9678, Adjusted R-squared: 0.9673
## F-statistic: 1963 on 3 and 196 DF, p-value: < 2.2e-16
predicted <- lm.fit$fitted.values
plot(actual, predicted)
error <- lm.fit$residuals
mse <- mean(error^2); mse
## [1] 0.8724169
The linear regression model discussed in Chapter 3 assumes that the response variable \(Y\) is quantitative. But in many situations, the response variable is instead qualitative/categorical like eye color or type of disease. In this section we study classification models in R. There many possible classification techniques. Some of the widely-used classifiers are: logistic regression, linear discriminant analysis, naive Bayes, and K-nearest neighbors. This discussion will largely focus on logistic regression as an modification of the linear regression.
To illustrate the concept of classification we will use the
Default data set. We are interested in predicting whether
an individual will default on his or her credit card payment, on the
basis of annual income, monthly credit card
balance, and whether the client is a student for a subset
of 10000 individuals.
library(ISLR2)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.1
# Load the data set "Default" from ISLR2 package
data(Default) # Load the data into the R Session
head(Default) # See the first six rows of the data
## default student balance income
## 1 No No 729.5265 44361.625
## 2 No Yes 817.1804 12106.135
## 3 No No 1073.5492 31767.139
## 4 No No 529.2506 35704.494
## 5 No No 785.6559 38463.496
## 6 No Yes 919.5885 7491.559
names(Default) # Check for variable names in the data
## [1] "default" "student" "balance" "income"
This part may be skipped. Some codes are difficult for non-programmers.
# Generate Grouped Scatter Plot
ggplot(Default, aes(x=balance, y=income, color=default)) +
geom_point() + scale_color_manual(values=c("cyan4", "coral2")) +
theme_minimal()
# Generate Simple Boxplot
par(mfrow=c(1,2)) # This creates a 1x2 grid in the plots window
colors = c("cyan4", "coral2")
attach(Default)
boxplot(balance ~ default,
data=Credit,
col=colors, ylab="Balance", xlab="Default Status")
boxplot(income ~ default,
data=Credit,
col=colors, ylab="Income", xlab="Default Status")
par(mfrow = c(1,1))
detach(Default)
formula <- default ~ balance + income + student
logit.fit <- glm(formula, data = Default, family = binomial)
summary(logit.fit)
##
## Call:
## glm(formula = formula, family = binomial, data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4691 -0.1418 -0.0557 -0.0203 3.7383
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.087e+01 4.923e-01 -22.080 < 2e-16 ***
## balance 5.737e-03 2.319e-04 24.738 < 2e-16 ***
## income 3.033e-06 8.203e-06 0.370 0.71152
## studentYes -6.468e-01 2.363e-01 -2.738 0.00619 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1571.5 on 9996 degrees of freedom
## AIC: 1579.5
##
## Number of Fisher Scoring iterations: 8
We can now generate the probabilities of default, but the problem of “classification” still remains. Here, we need to decide on the optimal cut-off point between 0-1 that serves as threshold value between the “default” and “no default” clients.
# Save the predicted probabilities of default
probability <- predict(logit.fit, type="response")
library(Metrics)
# Try 0.2 as cut off
cut.off <- 0.2
predicted <- ifelse(probability > cut.off, 1, 0)
actual <- ifelse(Default$default == "Yes", 1, 0)
Recall <- recall(actual, predicted)
Precision <- precision(actual, predicted)
F1 <- 2*Precision*Recall/(Precision + Recall)
cbind(cut.off, Recall, Precision, F1)
## cut.off Recall Precision F1
## [1,] 0.2 0.6096096 0.4229167 0.499385
# Try 0.3 as cut off
cut.off <- 0.3
predicted <- ifelse(probability > cut.off, 1, 0)
actual <- ifelse(Default$default == "Yes", 1, 0)
Recall <- recall(actual, predicted)
Precision <- precision(actual, predicted)
F1 <- 2*Precision*Recall/(Precision + Recall)
cbind(cut.off, Recall, Precision, F1)
## cut.off Recall Precision F1
## [1,] 0.3 0.5075075 0.5522876 0.5289515
# Try 0.4 as cut off
cut.off <- 0.4
predicted <- ifelse(probability > cut.off, 1, 0)
actual <- ifelse(Default$default == "Yes", 1, 0)
Recall <- recall(actual, predicted)
Precision <- precision(actual, predicted)
F1 <- 2*Precision*Recall/(Precision + Recall)
cbind(cut.off, Recall, Precision, F1)
## cut.off Recall Precision F1
## [1,] 0.4 0.4024024 0.629108 0.4908425
# Try 0.5 as cut off
cut.off <- 0.5
predicted <- ifelse(probability > cut.off, 1, 0)
actual <- ifelse(Default$default == "Yes", 1, 0)
Recall <- recall(actual, predicted)
Precision <- precision(actual, predicted)
F1 <- 2*Precision*Recall/(Precision + Recall)
cbind(cut.off, Recall, Precision, F1)
## cut.off Recall Precision F1
## [1,] 0.5 0.3153153 0.7241379 0.4393305
The above code can be done multiple times for different cut-off values. In this example, it seems that a cut-off of 0.3 gives the optimal Recall, Precision, and F1 values.
The next set of codes in this section might be too complex for non-programmers. Alternatively, you may still use the codes shown in section (2.3) to find the optimal cut-off.
# Predict Customer Default using pred() function
probability <- predict(logit.fit, type="response")
# Find the optimal cutoff using Recall, Precision, and F1 (the higher the better)
# Create a loop that calculates all metrics for each cut off value
library(Metrics)
cut.off <- 0.05
Recall <- vector()
Precision <- vector()
F1 <- vector()
cut.off.d <- vector()
while(cut.off < .99) {
predicted <- ifelse(probability > cut.off, 1, 0)
actual <- ifelse(Default$default == "Yes", 1, 0)
Recall.new <- Metrics::recall(actual,predicted)
Recall <- c(Recall, Recall.new)
Precision.new <- Metrics::precision(actual,predicted)
Precision <- c(Precision, Precision.new)
F1.new <- 2*Precision.new*Recall.new/(Precision.new + Recall.new)
F1 <- c(F1, F1.new)
if(cut.off == 0.95) {
break
}
cut.off.d <- c(cut.off.d, cut.off)
cut.off <- cut.off + 0.05
}
# Find the optimal cutoff by printing this table
df <- data.frame(cut.off.d, Recall, Precision, F1)
print(df)
## cut.off.d Recall Precision F1
## 1 0.05 0.84384384 0.2255217 0.35592147
## 2 0.10 0.74474474 0.3069307 0.43470640
## 3 0.15 0.67267267 0.3684211 0.47608927
## 4 0.20 0.60960961 0.4229167 0.49938499
## 5 0.25 0.55255255 0.4919786 0.52050919
## 6 0.30 0.50750751 0.5522876 0.52895149
## 7 0.35 0.45945946 0.6144578 0.52577320
## 8 0.40 0.40240240 0.6291080 0.49084249
## 9 0.45 0.34834835 0.7116564 0.46774194
## 10 0.50 0.31531532 0.7241379 0.43933054
## 11 0.55 0.28828829 0.7741935 0.42013129
## 12 0.60 0.24324324 0.7788462 0.37070938
## 13 0.65 0.21321321 0.7977528 0.33649289
## 14 0.70 0.17117117 0.8260870 0.28358209
## 15 0.75 0.12012012 0.8163265 0.20942408
## 16 0.80 0.09009009 0.8823529 0.16348774
## 17 0.85 0.05705706 0.8260870 0.10674157
## 18 0.90 0.03003003 0.8333333 0.05797101
## 19 0.95 0.01201201 0.8000000 0.02366864
# It seems that a cut.off of 0.35 gives the optimal values of Recall, Precision, and F1
# Note that it's not always 0.35 that is optimal.
# Use 0.35 cutoff to create your final Confusion Matrix
predicted <- as.factor(ifelse(probability > 0.35, "Yes", "No"))
actual <- as.factor(Default$default)
table(predicted, actual)
## actual
## predicted No Yes
## No 9571 180
## Yes 96 153
# Consider the data from 5 new customers in the Default-new.csv file
new.client <- read.csv(".\\Default-new.csv")
new.client
## student balance income
## 1 No 555.6 40837.7
## 2 No 1363.5 55713.1
## 3 Yes 1180.4 24063.3
## 4 Yes 584.9 22429.9
## 5 No 2189.4 20655.2
# Apply the classification model stored in "fit"
prob.new <- predict(logit.fit, newdata=new.client, type="response")
# Predict default using 0.35 cut off
class.pred.new <- ifelse(prob.new > 0.35, "Default", "No Default")
# Which of these customers will most likely default?
class.pred.new
## 1 2 3 4 5
## "No Default" "No Default" "No Default" "No Default" "Default"
library(caret)
# Try different values of K
formula <- Sales ~ TV + Radio + Newspaper
fit.k1 <- knnreg(formula, data=advertising, k=1)
fit.k4 <- knnreg(formula, data=advertising, k=4)
fit.k8 <- knnreg(formula, data=advertising, k=8)
fit.k12 <- knnreg(formula, data=advertising, k=12)
# Plot the Sales and Predicted Sales Using KNN with K=12
predicted <- predict(fit.k12, newdata = advertising)
actual <- advertising$Sales
plot(actual,predicted)
The linear discriminant analysis (LDA) classifier results
from assuming that the observations within each class come from a normal
distribution with a specific mean and a common variance \(\sigma^2\). The word linear stems
from the fact that the discriminant functions are linear
functions of \(x\) as opposed to more
complex functions. We again consider the Default credit
dataset and apply LDA in classifying customers based on predictors
balance and income.
# Load Default data again in R session
library(ISLR2)
library(MASS)
data(Default)
# Using the function lda() to generate the linear discriminant analysis
# lda() uses the same syntax as lm()
formula <- default ~ balance + income + student
lda.fit <- lda(formula, data=Default)
lda.fit
## Call:
## lda(formula, data = Default)
##
## Prior probabilities of groups:
## No Yes
## 0.9667 0.0333
##
## Group means:
## balance income studentYes
## No 803.9438 33566.17 0.2914037
## Yes 1747.8217 32089.15 0.3813814
##
## Coefficients of linear discriminants:
## LD1
## balance 2.243541e-03
## income 3.367310e-06
## studentYes -1.746631e-01
# The plot() function produces plots of the linear discriminant
plot(lda.fit)
# The predict() function returns the predicted probabilities of default
lda.prob <- predict(lda.fit, Default)$posterior[,2]
# Suppose we want a 0.3 threshold in classifying default
# The predictions can then be assessed using a confusion matrix
predicted <- ifelse(lda.prob > 0.35, "Yes", "No")
observed <- Default$default
table(predicted, observed)
## observed
## predicted No Yes
## No 9607 207
## Yes 60 126
# Prediction assessment
library(Metrics)
predicted <- ifelse(lda.prob > 0.35, 1, 0)
actual <- ifelse(Default$default == "Yes", 1, 0)
Recall <- Metrics::recall(actual,predicted); Recall
## [1] 0.3783784
Precision <- Metrics::precision(actual,predicted); Precision
## [1] 0.6774194
F1 <- 2*Precision*Recall/(Precision + Recall); F1
## [1] 0.4855491
Quadratic Discriminant Analysis (QDA) provides an
alternative approach in classifying the outcome variable. Like LDA, the
QDA classifier results from assuming that the observations from each
class are drawn from a Gaussian distribution. However, unlike LDA, QDA
assumes that each class has its own covariance matrix, i.e., it assumes
that an observation from the kth class is of the form \(X\) that follows \(N(\mu_k, \Sigma_k)\), where \(\Sigma_k\) is a covariance matrix for the
kth class. Using QDA, we can also model the probability of
default from the Default data.
# qda has the same syntax as lda()
formula <- default ~ balance + income + student
qda.fit <- qda(default ~ balance + income + student, data=Default)
qda.fit
## Call:
## qda(default ~ balance + income + student, data = Default)
##
## Prior probabilities of groups:
## No Yes
## 0.9667 0.0333
##
## Group means:
## balance income studentYes
## No 803.9438 33566.17 0.2914037
## Yes 1747.8217 32089.15 0.3813814
qda.prob <- predict(qda.fit, Default)$posterior[,2]
# Suppose we want a 0.3 threshold in classifying default
# The predictions can then be assessed using a confusion matrix
predicted <- ifelse(qda.prob > 0.35, "Yes", "No")
observed <- Default$default
table(predicted, observed)
## observed
## predicted No Yes
## No 9569 175
## Yes 98 158
The naive Bayes classifier takes a different track for
estimating \(f_1(x),...,f_K(x)\).
Instead of assuming that these functions belong to a particular family
of distributions, we make a single assumption: within the kth class,
the p predictors are independent. The naive Bayes function is found
in the e1071 library.
library(e1071)
formula <- default ~ balance + income + student
nbc.fit <- naiveBayes(formula, data=Default)
nbc.fit
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## No Yes
## 0.9667 0.0333
##
## Conditional probabilities:
## balance
## Y [,1] [,2]
## No 803.9438 456.4762
## Yes 1747.8217 341.2668
##
## income
## Y [,1] [,2]
## No 33566.17 13318.25
## Yes 32089.15 13804.22
##
## student
## Y No Yes
## No 0.7085963 0.2914037
## Yes 0.6186186 0.3813814
# the predict() function needs the type=raw argument to generate the probabilities
nbc.prob <- predict(nbc.fit, Default, type="raw")[,2]
predicted <- ifelse(nbc.prob > 0.35, "Yes", "No")
observed <- Default$default
table(predicted, observed)
## observed
## predicted No Yes
## No 9520 185
## Yes 147 148