Regression and Classification in R

1. Linear Regression Analysis

Linear regression is a very useful tool for predicting a quantitative response. It has been around for a long time and is the topic of innumerable textbooks. Though it may seem somewhat dull compared to some of the modern statistical learning approaches, linear regression is still a useful and widely used statistical learning method. Moreover, it serves as a good jumping-off point for new approaches that serve as extension of regression. Some of the R libraries that we will need for this lesson is provided below.

library(MASS)
library(ISLR2)
library(car)

1.1 Advertisement Data

# Load Advertising Data from local computer
setwd("C:\\Users\\Asus\\Documents\\UP Files\\UPV Subjects\\Stat 197 (Intro to BI)")
advertising <- read.csv(".\\Advertising.csv")
# Check variable names
names(advertising)

## [1] "X"         "TV"        "Radio"     "Newspaper" "Sales"

# Check first 6 rows of data
head(advertising)

##   X    TV Radio Newspaper Sales
## 1 1 230.1  37.8      69.2  22.1
## 2 2  44.5  39.3      45.1  10.4
## 3 3  17.2  45.9      69.3   9.3
## 4 4 151.5  41.3      58.5  18.5
## 5 5 180.8  10.8      58.4  12.9
## 6 6   8.7  48.9      75.0   7.2

# Generate scatter plot matrix
plot(advertising[,c(2:5)], col="#69b3a2")

1.2 Model Building & Validation

# Specify the formula of the regression equation
formula <- Sales ~ TV + Radio + Newspaper
# Generate the linear regression results using the lm() function
# Store the results inside "lm.fit" 
lm.fit <- lm(formula, data=advertising)
summary(lm.fit)

## 
## Call:
## lm(formula = formula, data = advertising)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.8277 -0.8908  0.2418  1.1893  2.8292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.938889   0.311908   9.422   <2e-16 ***
## TV           0.045765   0.001395  32.809   <2e-16 ***
## Radio        0.188530   0.008611  21.893   <2e-16 ***
## Newspaper   -0.001037   0.005871  -0.177     0.86    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.686 on 196 degrees of freedom
## Multiple R-squared:  0.8972, Adjusted R-squared:  0.8956 
## F-statistic: 570.3 on 3 and 196 DF,  p-value: < 2.2e-16

# Plot predicted vs actual values
predicted <- lm.fit$fitted.values
actual <- advertising$Sales    # Change this line depending on the data
plot(actual, predicted)

# Plot residuals and validate the model
par(mfrow=c(1,2))
plot(lm.fit); par(mfrow=c(1,1))

# Generate the mean squared error (MSE) of the linear model
error <- lm.fit$residuals
mse <- mean(error^2); mse

## [1] 2.784126

1.2 Transformation of Response Variable

There are times when the model validation results suggest that the data need to be transformed. This is often useful in addressing issues with non-constant variance and non-normality of the data. As an example, let’s consider the log(Sales) as the new response variable.

# Transform sales using the log() function
advertising$log.Sales <- log(advertising$Sales)
formula <- log.Sales ~ TV + Radio + Newspaper
lm.fit <- lm(formula, data=advertising)
summary(lm.fit)

## 
## Call:
## lm(formula = formula, data = advertising)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.74201 -0.05892  0.04715  0.10553  0.20666 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.7390020  0.0345727  50.300   <2e-16 ***
## TV          0.0036697  0.0001546  23.735   <2e-16 ***
## Radio       0.0118019  0.0009545  12.365   <2e-16 ***
## Newspaper   0.0003544  0.0006508   0.545    0.587    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1868 on 196 degrees of freedom
## Multiple R-squared:  0.7998, Adjusted R-squared:  0.7967 
## F-statistic: 260.9 on 3 and 196 DF,  p-value: < 2.2e-16

1.4 Interaction Effects

formula <- Sales ~ TV + Radio + TV*Radio
lm.fit <- lm(formula, data=advertising)
summary(lm.fit)

## 
## Call:
## lm(formula = formula, data = advertising)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.3366 -0.4028  0.1831  0.5948  1.5246 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.750e+00  2.479e-01  27.233   <2e-16 ***
## TV          1.910e-02  1.504e-03  12.699   <2e-16 ***
## Radio       2.886e-02  8.905e-03   3.241   0.0014 ** 
## TV:Radio    1.086e-03  5.242e-05  20.727   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9435 on 196 degrees of freedom
## Multiple R-squared:  0.9678, Adjusted R-squared:  0.9673 
## F-statistic:  1963 on 3 and 196 DF,  p-value: < 2.2e-16

predicted <- lm.fit$fitted.values
plot(actual, predicted)

error <- lm.fit$residuals
mse <- mean(error^2); mse

## [1] 0.8724169

2. Logistic Regression

The linear regression model discussed in Chapter 3 assumes that the response variable \(Y\) is quantitative. But in many situations, the response variable is instead qualitative/categorical like eye color or type of disease. In this section we study classification models in R. There many possible classification techniques. Some of the widely-used classifiers are: logistic regression, linear discriminant analysis, naive Bayes, and K-nearest neighbors. This discussion will largely focus on logistic regression as an modification of the linear regression.

2.1 “Default” Dataset

To illustrate the concept of classification we will use the Default data set. We are interested in predicting whether an individual will default on his or her credit card payment, on the basis of annual income, monthly credit card balance, and whether the client is a student for a subset of 10000 individuals.

library(ISLR2)
library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.3.1

# Load the data set "Default" from ISLR2 package
data(Default)    # Load the data into the R Session
head(Default)    # See the first six rows of the data

##   default student   balance    income
## 1      No      No  729.5265 44361.625
## 2      No     Yes  817.1804 12106.135
## 3      No      No 1073.5492 31767.139
## 4      No      No  529.2506 35704.494
## 5      No      No  785.6559 38463.496
## 6      No     Yes  919.5885  7491.559

names(Default)   # Check for variable names in the data

## [1] "default" "student" "balance" "income"

2.2 (Optional) Exploratory Analysis

This part may be skipped. Some codes are difficult for non-programmers.

# Generate Grouped Scatter Plot
ggplot(Default, aes(x=balance, y=income, color=default)) +
  geom_point() + scale_color_manual(values=c("cyan4", "coral2")) +
  theme_minimal()

# Generate Simple Boxplot
par(mfrow=c(1,2))              # This creates a 1x2 grid in the plots window
colors = c("cyan4", "coral2")
attach(Default)
boxplot(balance ~ default,
        data=Credit,
        col=colors, ylab="Balance", xlab="Default Status")
boxplot(income ~ default,
        data=Credit,
        col=colors, ylab="Income", xlab="Default Status")

par(mfrow = c(1,1))
detach(Default)

2.3 Fitting a Logistic Regression Model

formula <- default ~ balance + income + student
logit.fit <- glm(formula, data = Default, family = binomial)
summary(logit.fit)

## 
## Call:
## glm(formula = formula, family = binomial, data = Default)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4691  -0.1418  -0.0557  -0.0203   3.7383  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.087e+01  4.923e-01 -22.080  < 2e-16 ***
## balance      5.737e-03  2.319e-04  24.738  < 2e-16 ***
## income       3.033e-06  8.203e-06   0.370  0.71152    
## studentYes  -6.468e-01  2.363e-01  -2.738  0.00619 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1571.5  on 9996  degrees of freedom
## AIC: 1579.5
## 
## Number of Fisher Scoring iterations: 8

2.3 Optimal Cut-off - Manual Approach

We can now generate the probabilities of default, but the problem of “classification” still remains. Here, we need to decide on the optimal cut-off point between 0-1 that serves as threshold value between the “default” and “no default” clients.

# Save the predicted probabilities of default
probability <- predict(logit.fit, type="response")
library(Metrics)
# Try 0.2 as cut off
cut.off <- 0.2
predicted <- ifelse(probability > cut.off, 1, 0)
actual <- ifelse(Default$default == "Yes", 1, 0)
Recall <- recall(actual, predicted)
Precision <- precision(actual, predicted)
F1 <- 2*Precision*Recall/(Precision + Recall)
cbind(cut.off, Recall, Precision, F1)

##      cut.off    Recall Precision       F1
## [1,]     0.2 0.6096096 0.4229167 0.499385

# Try 0.3 as cut off
cut.off <- 0.3
predicted <- ifelse(probability > cut.off, 1, 0)
actual <- ifelse(Default$default == "Yes", 1, 0)
Recall <- recall(actual, predicted)
Precision <- precision(actual, predicted)
F1 <- 2*Precision*Recall/(Precision + Recall)
cbind(cut.off, Recall, Precision, F1)

##      cut.off    Recall Precision        F1
## [1,]     0.3 0.5075075 0.5522876 0.5289515

# Try 0.4 as cut off
cut.off <- 0.4
predicted <- ifelse(probability > cut.off, 1, 0)
actual <- ifelse(Default$default == "Yes", 1, 0)
Recall <- recall(actual, predicted)
Precision <- precision(actual, predicted)
F1 <- 2*Precision*Recall/(Precision + Recall)
cbind(cut.off, Recall, Precision, F1)

##      cut.off    Recall Precision        F1
## [1,]     0.4 0.4024024  0.629108 0.4908425

# Try 0.5 as cut off
cut.off <- 0.5
predicted <- ifelse(probability > cut.off, 1, 0)
actual <- ifelse(Default$default == "Yes", 1, 0)
Recall <- recall(actual, predicted)
Precision <- precision(actual, predicted)
F1 <- 2*Precision*Recall/(Precision + Recall)
cbind(cut.off, Recall, Precision, F1)

##      cut.off    Recall Precision        F1
## [1,]     0.5 0.3153153 0.7241379 0.4393305

The above code can be done multiple times for different cut-off values. In this example, it seems that a cut-off of 0.3 gives the optimal Recall, Precision, and F1 values.

2.4 (Optional) Automated Idenfification of Optimal Cut-off

The next set of codes in this section might be too complex for non-programmers. Alternatively, you may still use the codes shown in section (2.3) to find the optimal cut-off.

# Predict Customer Default using pred() function
probability <- predict(logit.fit, type="response")

# Find the optimal cutoff using Recall, Precision, and F1 (the higher the better)
# Create a loop that calculates all metrics for each cut off value
library(Metrics)
cut.off <- 0.05
Recall <- vector()
Precision <- vector()
F1 <- vector()
cut.off.d <- vector()

while(cut.off < .99) {
  predicted <- ifelse(probability > cut.off, 1, 0)
  actual <- ifelse(Default$default == "Yes", 1, 0)
  Recall.new <- Metrics::recall(actual,predicted)   
  Recall <- c(Recall, Recall.new)
  Precision.new <- Metrics::precision(actual,predicted)
  Precision <- c(Precision, Precision.new)
  F1.new <- 2*Precision.new*Recall.new/(Precision.new + Recall.new)
  F1 <- c(F1, F1.new) 
  if(cut.off == 0.95) {
    break
  }
  cut.off.d <- c(cut.off.d, cut.off)
  cut.off <- cut.off + 0.05
}

# Find the optimal cutoff by printing this table
df <- data.frame(cut.off.d, Recall, Precision, F1)
print(df)

##    cut.off.d     Recall Precision         F1
## 1       0.05 0.84384384 0.2255217 0.35592147
## 2       0.10 0.74474474 0.3069307 0.43470640
## 3       0.15 0.67267267 0.3684211 0.47608927
## 4       0.20 0.60960961 0.4229167 0.49938499
## 5       0.25 0.55255255 0.4919786 0.52050919
## 6       0.30 0.50750751 0.5522876 0.52895149
## 7       0.35 0.45945946 0.6144578 0.52577320
## 8       0.40 0.40240240 0.6291080 0.49084249
## 9       0.45 0.34834835 0.7116564 0.46774194
## 10      0.50 0.31531532 0.7241379 0.43933054
## 11      0.55 0.28828829 0.7741935 0.42013129
## 12      0.60 0.24324324 0.7788462 0.37070938
## 13      0.65 0.21321321 0.7977528 0.33649289
## 14      0.70 0.17117117 0.8260870 0.28358209
## 15      0.75 0.12012012 0.8163265 0.20942408
## 16      0.80 0.09009009 0.8823529 0.16348774
## 17      0.85 0.05705706 0.8260870 0.10674157
## 18      0.90 0.03003003 0.8333333 0.05797101
## 19      0.95 0.01201201 0.8000000 0.02366864

# It seems that a cut.off of 0.35 gives the optimal values of Recall, Precision, and F1
# Note that it's not always 0.35 that is optimal.

# Use 0.35 cutoff to create your final Confusion Matrix
predicted <- as.factor(ifelse(probability > 0.35, "Yes", "No"))
actual <- as.factor(Default$default)
table(predicted, actual)

##          actual
## predicted   No  Yes
##       No  9571  180
##       Yes   96  153

2.5 Predicting Default of New Customers

# Consider the data from 5 new customers in the Default-new.csv file
new.client <- read.csv(".\\Default-new.csv")
new.client

##   student balance  income
## 1      No   555.6 40837.7
## 2      No  1363.5 55713.1
## 3     Yes  1180.4 24063.3
## 4     Yes   584.9 22429.9
## 5      No  2189.4 20655.2

# Apply the classification model stored in "fit"
prob.new <- predict(logit.fit, newdata=new.client, type="response")
# Predict default using 0.35 cut off
class.pred.new <- ifelse(prob.new > 0.35, "Default", "No Default")
# Which of these customers will most likely default?
class.pred.new

##            1            2            3            4            5 
## "No Default" "No Default" "No Default" "No Default"    "Default"

3. Other Regression and Classification Methods

3.1 KNN Regression

library(caret)
# Try different values of K
formula <- Sales ~ TV + Radio + Newspaper

fit.k1 <- knnreg(formula, data=advertising, k=1)
fit.k4 <- knnreg(formula, data=advertising, k=4)
fit.k8 <- knnreg(formula, data=advertising, k=8) 
fit.k12 <- knnreg(formula, data=advertising, k=12)

# Plot the Sales and Predicted Sales Using KNN with K=12
predicted <- predict(fit.k12, newdata = advertising)
actual <- advertising$Sales
plot(actual,predicted)

3.2 Linear Discriminant Analysis (LDA)

The linear discriminant analysis (LDA) classifier results from assuming that the observations within each class come from a normal distribution with a specific mean and a common variance \(\sigma^2\). The word linear stems from the fact that the discriminant functions are linear functions of \(x\) as opposed to more complex functions. We again consider the Default credit dataset and apply LDA in classifying customers based on predictors balance and income.

# Load Default data again in R session
library(ISLR2)
library(MASS)
data(Default)
# Using the function lda() to generate the linear discriminant analysis
# lda() uses the same syntax as lm()
formula <- default ~ balance + income + student
lda.fit <- lda(formula, data=Default)
lda.fit

## Call:
## lda(formula, data = Default)
## 
## Prior probabilities of groups:
##     No    Yes 
## 0.9667 0.0333 
## 
## Group means:
##       balance   income studentYes
## No   803.9438 33566.17  0.2914037
## Yes 1747.8217 32089.15  0.3813814
## 
## Coefficients of linear discriminants:
##                      LD1
## balance     2.243541e-03
## income      3.367310e-06
## studentYes -1.746631e-01

# The plot() function produces plots of the linear discriminant
plot(lda.fit)

# The predict() function returns the predicted probabilities of default
lda.prob <- predict(lda.fit, Default)$posterior[,2]

# Suppose we want a 0.3 threshold in classifying default
# The predictions can then be assessed using a confusion matrix
predicted <- ifelse(lda.prob > 0.35, "Yes", "No")
observed <- Default$default
table(predicted, observed)

##          observed
## predicted   No  Yes
##       No  9607  207
##       Yes   60  126

# Prediction assessment
library(Metrics)
predicted <- ifelse(lda.prob > 0.35, 1, 0)
actual <- ifelse(Default$default == "Yes", 1, 0)
Recall <- Metrics::recall(actual,predicted); Recall

## [1] 0.3783784

Precision <- Metrics::precision(actual,predicted); Precision

## [1] 0.6774194

F1 <- 2*Precision*Recall/(Precision + Recall); F1

## [1] 0.4855491

3.3 Quadratic Discriminant Analysis (QDA)

Quadratic Discriminant Analysis (QDA) provides an alternative approach in classifying the outcome variable. Like LDA, the QDA classifier results from assuming that the observations from each class are drawn from a Gaussian distribution. However, unlike LDA, QDA assumes that each class has its own covariance matrix, i.e., it assumes that an observation from the kth class is of the form \(X\) that follows \(N(\mu_k, \Sigma_k)\), where \(\Sigma_k\) is a covariance matrix for the kth class. Using QDA, we can also model the probability of default from the Default data.

# qda has the same syntax as lda()
formula <- default ~ balance + income + student
qda.fit <- qda(default ~ balance + income + student, data=Default)
qda.fit

## Call:
## qda(default ~ balance + income + student, data = Default)
## 
## Prior probabilities of groups:
##     No    Yes 
## 0.9667 0.0333 
## 
## Group means:
##       balance   income studentYes
## No   803.9438 33566.17  0.2914037
## Yes 1747.8217 32089.15  0.3813814

qda.prob <- predict(qda.fit, Default)$posterior[,2]

# Suppose we want a 0.3 threshold in classifying default
# The predictions can then be assessed using a confusion matrix
predicted <- ifelse(qda.prob > 0.35, "Yes", "No")
observed <- Default$default
table(predicted, observed)

##          observed
## predicted   No  Yes
##       No  9569  175
##       Yes   98  158

3.4 Naive Bayes Classifier (NBC)

The naive Bayes classifier takes a different track for estimating \(f_1(x),...,f_K(x)\). Instead of assuming that these functions belong to a particular family of distributions, we make a single assumption: within the kth class, the p predictors are independent. The naive Bayes function is found in the e1071 library.

library(e1071)
formula <- default ~ balance + income + student
nbc.fit <- naiveBayes(formula, data=Default)
nbc.fit

## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##     No    Yes 
## 0.9667 0.0333 
## 
## Conditional probabilities:
##      balance
## Y          [,1]     [,2]
##   No   803.9438 456.4762
##   Yes 1747.8217 341.2668
## 
##      income
## Y         [,1]     [,2]
##   No  33566.17 13318.25
##   Yes 32089.15 13804.22
## 
##      student
## Y            No       Yes
##   No  0.7085963 0.2914037
##   Yes 0.6186186 0.3813814

# the predict() function needs the type=raw argument to generate the probabilities
nbc.prob <- predict(nbc.fit, Default, type="raw")[,2]
predicted <- ifelse(nbc.prob > 0.35, "Yes", "No")
observed <- Default$default
table(predicted, observed)

##          observed
## predicted   No  Yes
##       No  9520  185
##       Yes  147  148