R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

Load the necessary packages.

library(ISLR) 
library(rpart) 
library(tidyverse) 
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2) 
library(dplyr) 
library(rpart.plot) 
library(caret)
## 載入需要的套件:lattice
## 
## 載入套件:'caret'
## 
## 下列物件被遮斷自 'package:purrr':
## 
##     lift

Question 1: Poly and Tree Model

Loading the data.

data(Credit)

Setting seed and performing a 75-25 train-test split on the data. Then check the data splitting using dim().

set.seed(123) 
sample <- sample(
  c(T, F), 
  length(Credit$Rating), 
  replace = T, 
  prob = c(0.75, 0.25)
  )

training <- Credit[sample, ] 
testing <- Credit[!sample, ]

dim(training) 
## [1] 303  12
dim(testing)
## [1] 97 12

Check for NA first.

sum(is.na(Credit$Rating)) 
## [1] 0

No NA values.

Now fit a polynomial regression model, using Rating as Y, an orthogonal transformed Income with up to 5 degrees as X, as well as Age, Education, Balance as control variables.

poly.fit1 <- lm(
  Rating ~ poly(Income, 5, raw = F) + Age + Education + Balance, 
  data = training
  )

As the result shows, Income only exhibits significance at the 1st order,while also showing some significance at the 5th degree. With the 2nd, 3rd and 4th degree of Income all show no significance, this indicate that the relationship between Rating and Income could largely be explained by the 1st order of Income, meaning that the relationship between Rating and Income is very much linear.

summary(poly.fit1)
## 
## Call:
## lm(formula = Rating ~ poly(Income, 5, raw = F) + Age + Education + 
##     Balance, data = training)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -104.96  -19.14   14.58   25.74   57.35 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                2.414e+02  1.260e+01  19.161   <2e-16 ***
## poly(Income, 5, raw = F)1  1.291e+03  4.369e+01  29.555   <2e-16 ***
## poly(Income, 5, raw = F)2  3.024e+00  3.877e+01   0.078    0.938    
## poly(Income, 5, raw = F)3  4.650e+00  3.875e+01   0.120    0.905    
## poly(Income, 5, raw = F)4  5.343e+01  3.888e+01   1.374    0.170    
## poly(Income, 5, raw = F)5 -5.087e+01  3.867e+01  -1.316    0.189    
## Age                        1.594e-01  1.312e-01   1.215    0.225    
## Education                 -6.455e-01  7.171e-01  -0.900    0.369    
## Balance                    2.102e-01  5.401e-03  38.911   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 38.65 on 294 degrees of freedom
## Multiple R-squared:  0.9355, Adjusted R-squared:  0.9338 
## F-statistic: 533.1 on 8 and 294 DF,  p-value: < 2.2e-16

Plotting the result using ggplot().

ggplot(training, aes(Income, Rating)) + 
  geom_point(alpha = 0.2) + 
  labs(title = "Relationship between Rating and Income") + 
  stat_smooth(method = lm, formula = y ~ poly(x, 5, raw = F), col = "blue")

Fit a tree model using the same features.

tree.fit1 <- rpart(
  Rating ~ Income + Age + Education + Balance, 
  data = training
  ) 

set.seed(123)

Take a look at the model output. The threshold value for Income that splits the 6th and 7th nodes is 57.6815, where samples that are smaller than 57.6815 is split to the 6th node, equal to or larger than 57.6815, node 7th.

tree.fit1
## n= 303 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 303 6809949.00 348.8845  
##    2) Balance< 452 150  970858.50 241.4200  
##      4) Balance< 98.5 84  260761.00 192.1905  
##        8) Income< 42.47 70  106021.80 174.7857 *
##        9) Income>=42.47 14   27510.36 279.2143 *
##      5) Balance>=98.5 66  247420.60 304.0758  
##       10) Income< 60.1 57   92384.67 286.6667 *
##       11) Income>=60.1 9   28350.00 414.3333 *
##    3) Balance>=452 153 2408472.00 454.2418  
##      6) Income< 57.6815 102  304479.00 387.5686  
##       12) Balance< 879 64  109973.80 360.9375 *
##       13) Balance>=879 38   72669.26 432.4211 *
##      7) Income>=57.6815 51  743726.40 587.5882  
##       14) Balance< 1418.5 42  199956.10 546.7381  
##         28) Income< 115.3215 33   80714.06 522.2424 *
##         29) Income>=115.3215 9   26836.22 636.5556 *
##       15) Balance>=1418.5 9  146611.60 778.2222 *

Take a look at the CP table of the model. The minimum number of splits to obtain an error rate smaller than the cross-validation error rate is 2, where it has the lowest CV error rate.

tree.fit1$cptable
##           CP nsplit rel error    xerror       xstd
## 1 0.50376565      0 1.0000000 1.0064719 0.10772248
## 2 0.19974697      1 0.4962343 0.5180209 0.07002870
## 3 0.06794133      2 0.2964874 0.3479229 0.04179905
## 4 0.05832036      3 0.2285461 0.2824120 0.03803238
## 5 0.01868278      4 0.1702257 0.2158172 0.02487199
## 6 0.01860307      5 0.1515429 0.2048530 0.02190454
## 7 0.01789088      6 0.1329398 0.2032444 0.02188778
## 8 0.01356924      7 0.1150490 0.1855566 0.02082012
## 9 0.01000000      8 0.1014797 0.1691961 0.02021567
CPtable <- as.data.frame(tree.fit1$cptable)
CPtable$nsplit[CPtable$"rel error" + CPtable$"xstd" < CPtable$"xerror"][1]
## [1] 2

Plot the tree model CP and mark the number of trees with the lowest error rate.

plotcp(tree.fit1)
abline(v = 2, lty = "dashed", col = "red")

Make predictions using both models and compare the RMSE and R2.

poly.pred <- poly.fit1 %>% predict(testing) 
tree.pred <- tree.fit1 %>% predict(testing)

comparison <- data.frame(
  model = c("Tree Regression", "Polynomial Regression"), 
  RMSE = c(RMSE(tree.pred, testing$Rating), RMSE(poly.pred, testing$Rating)), 
  R2 = c(R2(tree.pred, testing$Rating), R2(poly.pred, testing$Rating)) 
  )

By this table, I prefer the Poly model, for it has a smaller RMSE and a better model performance based on its R squared value.

library(gridExtra) 
## 
## 載入套件:'gridExtra'
## 下列物件被遮斷自 'package:dplyr':
## 
##     combine
grid.table(comparison)

Question 2: SVM

Load Social_Network_Ads data using the code provided in the assignment.

Social_Network_Ads <- read.csv( "https://www.dropbox.com/s/11rmse4ay9uu8vu/Social_Network_Ads.csv?dl=1")

Subsetting the data by Age, EstimatedSalary, and Purchased.

subset_SNA <- Social_Network_Ads[, c("Age", "EstimatedSalary", "Purchased")]

Convert Purchased as factor variable using as.factor(), then check its class using class().

subset_SNA$Purchased <- as.factor(subset_SNA$Purchased) 
class(subset_SNA$Purchased)
## [1] "factor"

Setting seed to 123 and perform a 75-25 train-test split using createDataPartition() on Purchased. Then check the data splitting using dim().

set.seed(123) 
sample <- createDataPartition(subset_SNA$Purchased, p = 0.75, list = F) 
training <- subset_SNA[sample, ] 
testing <- subset_SNA[-sample, ]

dim(training) 
## [1] 301   3
dim(testing)
## [1] 99  3

Check for NAs.

sum(is.na(training))
## [1] 0

Now fit an SVM on training data, while Purchased as outcome and Age, EstimatedSalary as explanatory variables.

library(e1071) ## For us to use svm()


svm.fit1 <- svm(
  Purchased ~ Age + EstimatedSalary, 
  scale = T, 
  type = "C-classification", 
  kernel = "linear", 
  data = training
  )

There are 116 support vectors in the model.

svm.fit1
## 
## Call:
## svm(formula = Purchased ~ Age + EstimatedSalary, data = training, 
##     type = "C-classification", kernel = "linear", scale = T)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
## 
## Number of Support Vectors:  116

Now we predict on the testing data using the SVM model.

svm.pred <- predict(svm.fit1, testing)

Generating confusion matrix, show the prediction accuracy. The prediction accuracy is 83.26263%.

confmat.svm1 <- table(
  Predicted = svm.pred, Actual = testing$Purchased
  ) 
confmat.svm1 ## confusion matrix
##          Actual
## Predicted  0  1
##         0 57  9
##         1  7 26
(confmat.svm1[1, 1] + confmat.svm1[2, 2] / sum(confmat.svm1) * 100) ## 83%
## [1] 83.26263

Plotting the SVM model.

plot(svm.fit1, subset_SNA, col = c("slategray1", "paleturquoise4"), )

Hyperparameter tuning using the above SVM.

svm.fit.tuned <- tune.svm(
  Purchased ~ Age + EstimatedSalary, 
  type = "C-classification", 
  kernel = "polynomial", 
  gamma = seq(1, 2, 0.1), ## 3 different gamma values 
  cost = seq(1, 4, 0.2), ## 3 different cost(C) values 
  tunecontrol = tune.control(cross = 5), data = training
  )

There are 176 unique combinations of hyperparameters (1, 1), (1.1, 1)……(2, 4) that I need to tune over.

length(seq(1, 2, 0.1)) * length(seq(1, 4, 0.2)) ## = 176.
## [1] 176

The best model is the 3rd degree SVM with Cost = 1, gamma = 1.2. It has 116 support vectors.

svm.fit.tuned$best.model ## Optimal degree is 3.
## 
## Call:
## best.svm(x = Purchased ~ Age + EstimatedSalary, data = training, 
##     gamma = seq(1, 2, 0.1), cost = seq(1, 4, 0.2), type = "C-classification", 
##     kernel = "polynomial", tunecontrol = tune.control(cross = 5))
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  polynomial 
##        cost:  1 
##      degree:  3 
##      coef.0:  0 
## 
## Number of Support Vectors:  91
min(svm.fit.tuned$performances$error) ## Lowest CV error is 0.1296175.
## [1] 0.1264481
svm.fit.tuned$best.parameters$cost ## Optimal C is 1.
## [1] 1
svm.fit.tuned$best.parameters$gamma ## Optimal gamma is 1.2.
## [1] 1

Showing all possible C and gamma values under the same CV error.

svm.fit.tuned$performances[
  (svm.fit.tuned$performances$error == min(svm.fit.tuned$performances$error)), ]
##    gamma cost     error dispersion
## 1    1.0  1.0 0.1264481 0.03503527
## 2    1.1  1.0 0.1264481 0.03503527
## 3    1.2  1.0 0.1264481 0.03503527
## 12   1.0  1.2 0.1264481 0.03503527
## 13   1.1  1.2 0.1264481 0.03503527
## 23   1.0  1.4 0.1264481 0.03503527
## 24   1.1  1.4 0.1264481 0.03503527
## 34   1.0  1.6 0.1264481 0.03503527
## 45   1.0  1.8 0.1264481 0.03503527
plot(
  svm.fit.tuned$best.model, 
  subset_SNA, 
  col = c("slategray1", "paleturquoise4") 
  ) 

I get different results every time I refit the tuned SVM, why?

Question 3: LASSO Regression

Load the necessary packages and “Hitters” data.

library(ISLR)
library(glmnet)
## 載入需要的套件:Matrix
## 
## 載入套件:'Matrix'
## 下列物件被遮斷自 'package:tidyr':
## 
##     expand, pack, unpack
## Loaded glmnet 4.1-8
library(plotmo)
## 載入需要的套件:Formula
## 載入需要的套件:plotrix
data(Hitters)

Removing NAs using na.omit().

dim(Hitters) ## Check how many rows and columns Hitters has. 
## [1] 322  20
sum(is.na(Hitters)) ## How many rows in Hitters that contain NAs? 59. 
## [1] 59
Hitters <- na.omit(Hitters)

Generate a matrix, in which Salary is Y, the rest of the variables are X.

X = model.matrix(Salary ~ .-1, data = Hitters) 
Y = Hitters$Salary

Fit a LASSO model using glmnet().

set.seed(100) 
lasso.fit <- glmnet(X, Y, alpha = 1) 
print(lasso.fit)
## 
## Call:  glmnet(x = X, y = Y, alpha = 1) 
## 
##    Df  %Dev  Lambda
## 1   0  0.00 255.300
## 2   1  5.46 232.600
## 3   2 10.05 211.900
## 4   2 13.91 193.100
## 5   3 17.61 176.000
## 6   4 21.91 160.300
## 7   4 25.68 146.100
## 8   4 28.82 133.100
## 9   4 31.42 121.300
## 10  4 33.58 110.500
## 11  5 35.41 100.700
## 12  5 37.30  91.740
## 13  5 38.87  83.590
## 14  5 40.17  76.170
## 15  6 41.51  69.400
## 16  6 42.73  63.240
## 17  6 43.75  57.620
## 18  6 44.60  52.500
## 19  6 45.30  47.840
## 20  6 45.88  43.590
## 21  6 46.36  39.710
## 22  6 46.76  36.190
## 23  6 47.10  32.970
## 24  6 47.37  30.040
## 25  6 47.60  27.370
## 26  6 47.79  24.940
## 27  6 47.95  22.730
## 28  6 48.08  20.710
## 29  6 48.19  18.870
## 30  7 48.29  17.190
## 31  8 48.40  15.660
## 32  8 48.48  14.270
## 33 10 48.57  13.000
## 34 10 48.66  11.850
## 35 10 48.73  10.800
## 36 10 48.79   9.837
## 37 10 48.84   8.963
## 38 12 49.12   8.167
## 39 12 49.60   7.442
## 40 13 50.27   6.781
## 41 13 50.82   6.178
## 42 14 51.32   5.629
## 43 14 51.77   5.129
## 44 14 52.14   4.674
## 45 14 52.45   4.258
## 46 14 52.70   3.880
## 47 14 52.91   3.535
## 48 14 53.09   3.221
## 49 14 53.24   2.935
## 50 14 53.36   2.674
## 51 14 53.46   2.437
## 52 15 53.55   2.220
## 53 16 53.62   2.023
## 54 16 53.72   1.843
## 55 18 53.84   1.680
## 56 18 53.96   1.530
## 57 18 54.06   1.394
## 58 18 54.15   1.271
## 59 18 54.22   1.158
## 60 18 54.28   1.055
## 61 18 54.32   0.961
## 62 18 54.36   0.876
## 63 18 54.40   0.798
## 64 18 54.43   0.727
## 65 19 54.45   0.662
## 66 19 54.48   0.604
## 67 19 54.49   0.550
## 68 18 54.51   0.501
## 69 19 54.52   0.457
## 70 19 54.53   0.416
## 71 19 54.54   0.379
## 72 19 54.55   0.345
## 73 19 54.56   0.315
## 74 19 54.57   0.287
## 75 19 54.57   0.261
## 76 19 54.58   0.238
## 77 19 54.58   0.217
## 78 19 54.59   0.198
## 79 19 54.59   0.180
## 80 20 54.59   0.164

Getting the best lambda value lambda.1se, which is not the lambda that produces minimum mean CV error (lambda.min), but the largest lambda value that is the error is within 1 s.e. of the CV error of lambda.min.

cv.glmnet(X, Y)$lambda.1se ## 63.23532.
## [1] 91.74363

At this optimized lambda, we see that we have only 5 variables left, ranging them from their beta magnitude is:

Walks: at 1.39817511;

Hits: at 1.29269756;

CRBI: at 0.32192558

CRuns: at 0.14167760;

PutOuts: at 0.04675463.

coef(lasso.fit, s = cv.glmnet(X, Y)$lambda.1se)
## 21 x 1 sparse Matrix of class "dgCMatrix"
##                       s1
## (Intercept) 2.220974e+02
## AtBat       .           
## Hits        1.129009e+00
## HmRun       .           
## Runs        .           
## RBI         .           
## Walks       1.172062e+00
## Years       .           
## CAtBat      .           
## CHits       .           
## CHmRun      .           
## CRuns       1.147170e-01
## CRBI        3.085475e-01
## CWalks      .           
## LeagueA     .           
## LeagueN     .           
## DivisionW   .           
## PutOuts     1.763115e-03
## Assists     .           
## Errors      .           
## NewLeagueN  .

Plot using plot_glmnet(). In the charts, we see that as penalty (lambda) increases, the magnitude of most estimated coefficients (beta) decreases, so that the number of X variables that still have non-zero beta also drops. Meaning that as lambda increases, DF drops. As LASSO suggests, the variable whose beta is the last one to become 0 contributes to the outcome the most. In this case, it is “Hits” who is the most important of all.

par(mfrow = c(1, 2))

plot_glmnet(
  lasso.fit, 
  xvar = "lambda", 
  label = 5, 
  xlab = expression(paste("log(", lambda, ")")), 
  ylab = expression(beta) 
  ) 

plot_glmnet(
  lasso.fit, 
  label = 5, 
  xlab = expression(paste("log(", lambda, ")")), 
  ylab = expression(beta) 
  ) 

Now we calculate the difference between lambda.min and lambda.1se. According to the authors of the package, using lambda.1se as the optimized lambda value produces the the most parsimonious list of features, which is the simplest model, while its accuracy is also comparable with choosing lambda.min as optimized lambda.

cv.glmnet(X, Y)$lambda.min - cv.glmnet(X, Y)$lambda.1se ## -67 at first run.
## [1] -80.919

When log(lambda) = 0……? log(lambda) = 0, means that lambda = 1. We can set the lambda value to 1 in our model. After setting lambda to 1, the first 5 variables becomes:

Hits: at 6.771856e+00;

Years: at -7.706032e+00;

NewLeagN: at -1.015361e+01;

LeagueA: at -4.653101e+01;

DivisinW: at -1.167282e+02.

lasso.fit1 <- glmnet(X, Y, lambda = 1)

dev.off()
## null device 
##           1
plot_glmnet( 
  lasso.fit1, 
  xvar = "lambda",
  label = 5, 
  xlab = expression(paste("log(", lambda, ")")), 
  ylab = expression(beta) 
  )

End of Homework 3.