Regression and Classification in R

1. Linear Regression Analysis

Linear regression is a very useful tool for predicting a quantitative response. It has been around for a long time and is the topic of innumerable textbooks. Though it may seem somewhat dull compared to some of the modern statistical learning approaches, linear regression is still a useful and widely used statistical learning method. Moreover, it serves as a good jumping-off point for new approaches that serve as extension of regression. Some of the R libraries that we will need for this lesson is provided below.

library(MASS)
library(ISLR2)
library(car)

1.1 Advertisement Data

# Load Advertising Data from local computer
setwd("C:\\Users\\Asus\\Documents\\UP Files\\UPV Subjects\\Stat 197 (Intro to BI)")
advertising <- read.csv(".\\Advertising.csv")
# Check variable names
names(advertising)

## [1] "X"         "TV"        "Radio"     "Newspaper" "Sales"

# Check first 6 rows of data
head(advertising)

##   X    TV Radio Newspaper Sales
## 1 1 230.1  37.8      69.2  22.1
## 2 2  44.5  39.3      45.1  10.4
## 3 3  17.2  45.9      69.3   9.3
## 4 4 151.5  41.3      58.5  18.5
## 5 5 180.8  10.8      58.4  12.9
## 6 6   8.7  48.9      75.0   7.2

# Generate scatter plot matrix
plot(advertising[,c(2:5)], col="#69b3a2")

1.2 Model Building

# Generate the linear regression results using the lm() function
# Store the results inside "lm.fit" 
lm.fit <- lm(Sales ~ TV + Radio + Newspaper, data=advertising)
summary(lm.fit)

## 
## Call:
## lm(formula = Sales ~ TV + Radio + Newspaper, data = advertising)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.8277 -0.8908  0.2418  1.1893  2.8292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.938889   0.311908   9.422   <2e-16 ***
## TV           0.045765   0.001395  32.809   <2e-16 ***
## Radio        0.188530   0.008611  21.893   <2e-16 ***
## Newspaper   -0.001037   0.005871  -0.177     0.86    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.686 on 196 degrees of freedom
## Multiple R-squared:  0.8972, Adjusted R-squared:  0.8956 
## F-statistic: 570.3 on 3 and 196 DF,  p-value: < 2.2e-16

1.2 Model Validation

# Save the fitted values and compare with observed sales
predict.sales <- lm.fit$fitted.values 
plot(predict.sales, advertising$Sales)

par(mfrow = c(1,2))
plot(lm.fit)

residuals <- lm.fit$residuals
# Testing for Multicollinearity
vif(lm.fit)

##        TV     Radio Newspaper 
##  1.004611  1.144952  1.145187

# Testing for Autocorrelation
durbinWatsonTest(lm.fit)

##  lag Autocorrelation D-W Statistic p-value
##    1     -0.04687792      2.083648   0.532
##  Alternative hypothesis: rho != 0

# Testing for Normality
shapiro.test(residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals
## W = 0.91767, p-value = 3.939e-09

# Testing for Equality of Variance
# Use leveneTest() or other tests

1.3 Transformation of Response Variable

There are times when the model validation results suggest that the data need to be transformed. This is often useful in addressing issues with nonconstant variance and non-normality of the data. As an example, let’s consider the log(Sales) as the new response variable.

# Transform sales using the log() function
advertising$log.Sales <- log(advertising$Sales)
lm.fit2 <- lm(log.Sales ~ TV + Radio + Newspaper, data=advertising)
summary(lm.fit2)

## 
## Call:
## lm(formula = log.Sales ~ TV + Radio + Newspaper, data = advertising)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.74201 -0.05892  0.04715  0.10553  0.20666 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.7390020  0.0345727  50.300   <2e-16 ***
## TV          0.0036697  0.0001546  23.735   <2e-16 ***
## Radio       0.0118019  0.0009545  12.365   <2e-16 ***
## Newspaper   0.0003544  0.0006508   0.545    0.587    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1868 on 196 degrees of freedom
## Multiple R-squared:  0.7998, Adjusted R-squared:  0.7967 
## F-statistic: 260.9 on 3 and 196 DF,  p-value: < 2.2e-16

1.4 Nonlinear Regression

lm.fit3 <- lm(Sales ~ TV + I(TV^2) + I(TV^3), data=advertising)
summary(lm.fit3)

## 
## Call:
## lm(formula = Sales ~ TV + I(TV^2) + I(TV^3), data = advertising)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.9734 -1.8900 -0.0897  2.0189  7.3765 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.420e+00  8.641e-01   6.272 2.23e-09 ***
## TV           9.643e-02  2.580e-02   3.738 0.000243 ***
## I(TV^2)     -3.152e-04  2.022e-04  -1.559 0.120559    
## I(TV^3)      5.572e-07  4.494e-07   1.240 0.216519    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.232 on 196 degrees of freedom
## Multiple R-squared:  0.622,  Adjusted R-squared:  0.6162 
## F-statistic: 107.5 on 3 and 196 DF,  p-value: < 2.2e-16

1.5 Interaction Effects

lm.fit4 <- lm(Sales ~ TV + Radio + TV*Radio, data=advertising)
summary(lm.fit4)

## 
## Call:
## lm(formula = Sales ~ TV + Radio + TV * Radio, data = advertising)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.3366 -0.4028  0.1831  0.5948  1.5246 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.750e+00  2.479e-01  27.233   <2e-16 ***
## TV          1.910e-02  1.504e-03  12.699   <2e-16 ***
## Radio       2.886e-02  8.905e-03   3.241   0.0014 ** 
## TV:Radio    1.086e-03  5.242e-05  20.727   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9435 on 196 degrees of freedom
## Multiple R-squared:  0.9678, Adjusted R-squared:  0.9673 
## F-statistic:  1963 on 3 and 196 DF,  p-value: < 2.2e-16

predict.w.inter <- lm.fit4$fitted.values
par(mfrow=c(1,2))
plot(predict.sales, advertising$Sales)
plot(predict.w.inter, advertising$Sales)

2. Logistic Regression

The linear regression model discussed in Chapter 3 assumes that the response variable \(Y\) is quantitative. But in many situations, the response variable is instead qualitative/categorical like eye color or type of disease. In this section we study classification models in R. There many possible classification techniques. Some of the widely-used classifiers are: logistic regression, linear discriminant analysis, naive Bayes, and K-nearest neighbors. This discussion will largely focus on logistic regression as an modification of the linear regression.

2.1 “Default” Dataset

To illustrate the concept of classification we will use the Default data set. We are interested in predicting whether an individual will default on his or her credit card payment, on the basis of annual income and monthly credit card balance for a subset of 10000 individuals.

library(ISLR2)
library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.3.1

# Load the data set "Default" from ISLR2 package
data(Default)    # Load the data into the R Session
head(Default)    # See the first six rows of the data

##   default student   balance    income
## 1      No      No  729.5265 44361.625
## 2      No     Yes  817.1804 12106.135
## 3      No      No 1073.5492 31767.139
## 4      No      No  529.2506 35704.494
## 5      No      No  785.6559 38463.496
## 6      No     Yes  919.5885  7491.559

names(Default)   # Check for variable names in the data

## [1] "default" "student" "balance" "income"

# Generate Grouped Scatter Plot
ggplot(Default, aes(x=balance, y=income, color=default)) +
  geom_point() + scale_color_manual(values=c("cyan4", "coral2")) +
  theme_minimal()

# Generate Simple Boxplot
par(mfrow=c(1,2))              # This creates a 1x2 grid in the plots window
colors = c("cyan4", "coral2")
attach(Default)
boxplot(balance ~ default,
        data=Credit,
        col=colors, ylab="Balance", xlab="Default Status")
boxplot(income ~ default,
        data=Credit,
        col=colors, ylab="Income", xlab="Default Status")

par(mfrow = c(1,1))
detach(Default)

2.2 Model Fitting

fit <- glm(default ~ balance + income + student, 
            data = Default, family = binomial)
summary(fit)

## 
## Call:
## glm(formula = default ~ balance + income + student, family = binomial, 
##     data = Default)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4691  -0.1418  -0.0557  -0.0203   3.7383  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.087e+01  4.923e-01 -22.080  < 2e-16 ***
## balance      5.737e-03  2.319e-04  24.738  < 2e-16 ***
## income       3.033e-06  8.203e-06   0.370  0.71152    
## studentYes  -6.468e-01  2.363e-01  -2.738  0.00619 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1571.5  on 9996  degrees of freedom
## AIC: 1579.5
## 
## Number of Fisher Scoring iterations: 8

2.3 Finding the Optimal Cut-off

# Predict Customer Default using pred() function
class.prob <- predict(fit, type="response")

# Find the optimal cutoff using Recall, Precision, and F1 (the higher the better)
# Create a loop that calculates all metrics for each cut off value
library(Metrics)
cut.off <- 0.05
Recall <- vector()
Precision <- vector()
F1 <- vector()
cut.off.d <- vector()

while(cut.off < .99) {
  predicted <- ifelse(class.prob > cut.off, 1, 0)
  actual <- ifelse(Default$default == "Yes", 1, 0)
  Recall.new <- Metrics::recall(actual,predicted)   
  Recall <- c(Recall, Recall.new)
  Precision.new <- Metrics::precision(actual,predicted)
  Precision <- c(Precision, Precision.new)
  F1.new <- 2*Precision.new*Recall.new/(Precision.new + Recall.new)
  F1 <- c(F1, F1.new) 
  if(cut.off == 0.95) {
    break
  }
  cut.off.d <- c(cut.off.d, cut.off)
  cut.off <- cut.off + 0.05
}

# Find the optimal cutoff by printing this table
df <- data.frame(cut.off.d, Recall, Precision, F1)
print(df)

##    cut.off.d     Recall Precision         F1
## 1       0.05 0.84384384 0.2255217 0.35592147
## 2       0.10 0.74474474 0.3069307 0.43470640
## 3       0.15 0.67267267 0.3684211 0.47608927
## 4       0.20 0.60960961 0.4229167 0.49938499
## 5       0.25 0.55255255 0.4919786 0.52050919
## 6       0.30 0.50750751 0.5522876 0.52895149
## 7       0.35 0.45945946 0.6144578 0.52577320
## 8       0.40 0.40240240 0.6291080 0.49084249
## 9       0.45 0.34834835 0.7116564 0.46774194
## 10      0.50 0.31531532 0.7241379 0.43933054
## 11      0.55 0.28828829 0.7741935 0.42013129
## 12      0.60 0.24324324 0.7788462 0.37070938
## 13      0.65 0.21321321 0.7977528 0.33649289
## 14      0.70 0.17117117 0.8260870 0.28358209
## 15      0.75 0.12012012 0.8163265 0.20942408
## 16      0.80 0.09009009 0.8823529 0.16348774
## 17      0.85 0.05705706 0.8260870 0.10674157
## 18      0.90 0.03003003 0.8333333 0.05797101
## 19      0.95 0.01201201 0.8000000 0.02366864

# It seems that a cut.off of 0.5 gives the optimal values of Recall, Precision, and F1
# Note that it's not always 0.5 that is optimal.

# Use 0.5 cutoff to create your final Confusion Matrix
predicted <- as.factor(ifelse(class.prob > 0.5, "Yes", "No"))
actual <- as.factor(Default$default)
table(predicted, actual)

##          actual
## predicted   No  Yes
##       No  9627  228
##       Yes   40  105

2.4 Predicting Default of New Customers

# Consider the data from 5 new customers
# Which of these customers will most likely default
student <- c("No", "No", "Yes", "Yes", "No")
balance <- c(555.6, 1363.52, 1180.4, 584.9, 2189.4)
income <- c(40837.7, 55713.1, 24063.3, 22429.9, 20655.2)
new.client.data <- data.frame(student,balance,income)

class.prob.new <- predict(fit, newdata=new.client.data, type="response")
class.pred.new <- ifelse(class.prob.new > 0.5, "Default", "No Default")
class.pred.new

##            1            2            3            4            5 
## "No Default" "No Default" "No Default" "No Default"    "Default"

3. Other Regression and Classification Methods

3.1 KNN Regression

library(caret)
# Try different values of K
fit.k1 <- knnreg(Sales ~ TV + Radio + Newspaper, data=advertising, k=1)
fit.k4 <- knnreg(Sales ~ TV + Radio + Newspaper, data=advertising, k=4)
fit.k8 <- knnreg(Sales ~ TV + Radio + Newspaper, data=advertising, k=8) 
fit.k12 <- knnreg(Sales ~ TV + Radio + Newspaper, data=advertising, k=12)

# Plot the Sales and Predicted Sales Using KNN with K=12
Predicted <- predict(fit.k12, newdata = advertising)
plot(advertising$Sales,Predicted)

3.2 Linear Discriminant Analysis (LDA)

The linear discriminant analysis (LDA) classifier results from assuming that the observations within each class come from a normal distribution with a specific mean and a common variance \(\sigma^2\). The word linear stems from the fact that the discriminant functions are linear functions of \(x\) as opposed to more complex functions. We again consider the Default credit dataset and apply LDA in classifying customers based on predictors balance and income.

# Load Default data again in R session
library(ISLR2)
library(MASS)
data(Default)
# Using the function lda() to generate the linear discriminant analysis
# lda() uses the same syntax as lm()
lda.fit <- lda(default ~ balance + income + student, data=Default)
lda.fit

## Call:
## lda(default ~ balance + income + student, data = Default)
## 
## Prior probabilities of groups:
##     No    Yes 
## 0.9667 0.0333 
## 
## Group means:
##       balance   income studentYes
## No   803.9438 33566.17  0.2914037
## Yes 1747.8217 32089.15  0.3813814
## 
## Coefficients of linear discriminants:
##                      LD1
## balance     2.243541e-03
## income      3.367310e-06
## studentYes -1.746631e-01

# The plot() function produces plots of the linear discriminant
plot(lda.fit)

# The predict() function returns a list that contains the predicted classes
lda.pred <- predict(lda.fit, Default)
lda.prob <- lda.pred$posterior[,2]

# Suppose we want a 0.3 threshold in classifying default
# The predictions can then be assessed using a confusion matrix
predicted <- ifelse(lda.prob > 0.3, "Yes", "No")
observed <- Default$default
table(predicted, observed)

##          observed
## predicted   No  Yes
##       No  9571  181
##       Yes   96  152

# Prediction assessment
library(Metrics)
predicted <- ifelse(lda.prob > 0.3, 1, 0)
actual <- ifelse(Default$default == "Yes", 1, 0)
Recall <- Metrics::recall(actual,predicted); Recall

## [1] 0.4564565

Precision <- Metrics::precision(actual,predicted); Precision

## [1] 0.6129032

F1 <- 2*Precision*Recall/(Precision + Recall); F1

## [1] 0.5232358

3.3 Quadratic Discriminant Analysis (QDA)

Quadratic Discriminant Analysis (QDA) provides an alternative approach in classifying the outcome variable. Like LDA, the QDA classifier results from assuming that the observations from each class are drawn from a Gaussian distribution. However, unlike LDA, QDA assumes that each class has its own covariance matrix, i.e., it assumes that an observation from the kth class is of the form \(X\) that follows \(N(\mu_k, \Sigma_k)\), where \(\Sigma_k\) is a covariance matrix for the kth class. Using QDA, we can also model the probability of default from the Default data.

# Using the function lda() to generate the linear discriminant analysis
# lda() uses the same syntax as lm()
qda.fit <- qda(default ~ balance + income + student, data=Default)
qda.fit

## Call:
## qda(default ~ balance + income + student, data = Default)
## 
## Prior probabilities of groups:
##     No    Yes 
## 0.9667 0.0333 
## 
## Group means:
##       balance   income studentYes
## No   803.9438 33566.17  0.2914037
## Yes 1747.8217 32089.15  0.3813814

qda.pred <- predict(qda.fit, Default)
qda.prob <- qda.pred$posterior[,2]

# Suppose we want a 0.3 threshold in classifying default
# The predictions can then be assessed using a confusion matrix
predicted <- ifelse(qda.prob > 0.3, "Yes", "No")
observed <- Default$default
table(predicted, observed)

##          observed
## predicted   No  Yes
##       No  9507  163
##       Yes  160  170

3.4 Naive Bayes Classifier (NBC)

The naive Bayes classifier takes a different track for estimating \(f_1(x),...,f_K(x)\). Instead of assuming that these functions belong to a particular family of distributions, we make a single assumption: within the kth class, the p predictors are independent. The naive Bayes function is found in the e1071 library.

library(e1071)
nbc.fit <- naiveBayes(default ~ balance + income + student, data=Default)
nbc.fit

## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##     No    Yes 
## 0.9667 0.0333 
## 
## Conditional probabilities:
##      balance
## Y          [,1]     [,2]
##   No   803.9438 456.4762
##   Yes 1747.8217 341.2668
## 
##      income
## Y         [,1]     [,2]
##   No  33566.17 13318.25
##   Yes 32089.15 13804.22
## 
##      student
## Y            No       Yes
##   No  0.7085963 0.2914037
##   Yes 0.6186186 0.3813814

# the predict() function needs the type=raw argument to generate the probabilities
nbc.pred <- predict(nbc.fit, Default, type="raw")
nbc.prob <- nbc.pred[,2]
predicted <- ifelse(nbc.prob > 0.3, "Yes", "No")
observed <- Default$default
table(predicted, observed)

##          observed
## predicted   No  Yes
##       No  9473  163
##       Yes  194  170