Supervised Learning in R: Classification

Chapter 1: k-Nearest Neighbors (kNN)

Classification with Nearest-neighbours

Recognizing a road sign with kNN

After several trips with a human behind the wheel, it is time for the self-driving car to attempt the test course alone.

As it begins to drive away, its camera captures the following image:

Stop Sign

Can you apply a kNN classifier to help the car recognize this sign?

The dataset signs is loaded in your workspace along with the dataframe next_sign, which holds the observation you want to classify.

Load the class package. Create a vector of sign labels to use with kNN by extracting the column sign_type from signs. Identify the next_sign using the knn() function. Set the train argument equal to the signs data frame without the first column. Set the test argument equal to the data frame next_sign. Use the vector of labels you created as the cl argument.

# Load the 'class' package
library(class)

## 
## Attaching package: 'class'

## The following object is masked from 'package:igraph':
## 
##     knn

#importamos los datos

signs <- read.csv("./DATABASE/knn_traffic_signs.csv")


#creamos el dataset del testeo de KNN porque datacamp no lo propociona
signs_next_colnames <-c("r1","g1","b1","r2","g2","b2","r3","g3","b3","r4","g4","b4","r5","g5","b5","r6","g6","b6","r7","g7","b7","r8","g8","b8","r9","g9","b9","r10","g10","b10","r11","g11","b11","r12","g12","b12","r13","g13","b13","r14","g14","b14","r15","g15","b15","r16","g16","b16" ) %>% as.data.frame()



signs_values <- c(206,204,227,220,196,59,51,202,67,59,204,227,220,236,250,234,242,252,235,205,148,131,190,50,43,179,70,57,242,229,212,190,50,43,193,51,44,170,197,196,190,50,43,190,47,41,165,195
) %>% as.data.frame()


next_sign <- bind_cols(signs_next_colnames, signs_values)

## New names:
## * . -> ....1
## * . -> ....2

next_sign <- t(next_sign) %>% as.data.frame() %>% setnames(., c("r1","g1","b1","r2","g2","b2","r3","g3","b3","r4","g4","b4","r5","g5","b5","r6","g6","b6","r7","g7","b7","r8","g8","b8","r9","g9","b9","r10","g10","b10","r11","g11","b11","r12","g12","b12","r13","g13","b13","r14","g14","b14","r15","g15","b15","r16","g16","b16" ))

next_sign <- next_sign %>% slice(.,2)




# Create a vector of labels
sign_types <- signs$sign_type

#ajutamos el dataset de signs para poder meterlo en el KNN
signs <- signs %>% select(., c(- c("id"     ,   "sample"  ,  "sign_type" )))

# Classify the next sign observed
knn(train = signs, test = next_sign, cl = sign_types)

## [1] stop
## Levels: pedestrian speed stop

Exploring the traffic sign dataset

To better understand how the knn() function was able to classify the stop sign, it may help to examine the training dataset it used.

Each previously observed street sign was divided into a 4x4 grid, and the red, green, and blue level for each of the 16 center pixels is recorded as illustrated here.

Use the str() function to examine the signs dataset. Use table() to count the number of observations of each sign type by passing it the column containing the labels. Run the provided aggregate() command to see whether the average red level might vary by sign type.

# Examine the structure of the signs dataset
#str(signs)

# Count the number of signs of each type
table(signs$sign_type)

## < table of extent 0 >

# Check r10's average red level by sign type
aggregate(r10 ~ sign_types, data = signs, mean)

##   sign_types       r10
## 1 pedestrian 108.78462
## 2      speed  83.08571
## 3       stop 142.50704

Classifying a collection of road signs

Now that the autonomous vehicle has successfully stopped on its own, your team feels confident allowing the car to continue the test course.

The test course includes 59 additional road signs divided into three types:

Stop Sign Speed Limit Sign Pedestrian Sign

At the conclusion of the trial, you are asked to measure the car’s overall performance at recognizing these signs.

The class package and the dataset signs are already loaded in your workspace. So is the dataframe test_signs, which holds a set of observations you’ll test your model on.

Classify the test_signs data using knn(). Set train equal to the observations in signs without labels. Use test_signs for the test argument, again without labels. For the cl argument, use the vector of labels provided for you. Use table() to explore the classifier’s performance at identifying the three sign types (the confusion matrix). Create the vector signs_actual by extracting the labels from test_signs. Pass the vector of predictions and the vector of actual signs to table() to cross tabulate them. Compute the overall accuracy of the kNN learner using the mean() function.

# Use kNN to identify the test road signs

#cargamos los datos que datacamp no nos proporciona
test_signs <- read.delim("./DATABASE/sign_test.txt", sep = ",")

head(test_signs)

##    X  sign_type  r1  g1  b1  r2  g2 b2  r3  g3  b3  r4  g4  b4  r5  g5  b5  r6
## 1  8 pedestrian 118 105  69 244 245 67 132 123  12 138 123  85 254 254  92 241
## 2 13 pedestrian 221 244 237  52  45 26 205 233 229 203 230 227 203 221 213  18
## 3 14 pedestrian  44  50  43  98  69 25 170 182 172 170 182 172  35  29  18  85
## 4 19 pedestrian  78 106 102  98 125 82  65  91  75 100 122 116  84 107  88 102
## 5 20 pedestrian 163 181 172  53  51 36 170 181 171  44  51  44  90  86  66 128
## 6 22 pedestrian 117 137 132 116 105 67  58  53  27  37  49  45 115 101  59 131
##    g6  b6  r7  g7 b7  r8  g8  b8  r9  g9  b9 r10 g10 b10 r11 g11 b11 r12 g12
## 1 240 108 254 254 99 108 106  27 135 123  40 254 254 115 254 254  99 138 123
## 2  21  20  27  27 20 193 222 220  76  53  18  12  18  20  19  19  13  70  49
## 3  60  19  26  26 20  84  68  42  94  65  20  22  25  21  21  22  19  97  67
## 4 139  38  27  43 28 146 174 170 182 196 176 108 145  42 107 144  38  45  58
## 5  90  28  68  53 27  90  82  60  90  78  52 125  89  27  59  52  34  57  62
## 6 102  28  19  28 26  27  37  34  19  28  26 139 107  28 139 107  28  14  22
##   b12 r13 g13 b13 r14 g14 b14 r15 g15 b15 r16 g16 b16
## 1  85 118 105  75 131 124   5 106  94  53 101  91  59
## 2  13  74  51  13  74  51  13  34  30  18  50  38  14
## 3  21  94  65  20  94  65  20  97  67  21  49  38  19
## 4  52  67  84  77  91 108 100  49  62  57  45  58  52
## 5  54  60  67  59  53  51  36  67  75  67  74  81  70
## 6  21  15  24  22  18  26  22  19  28  26  19  28  26

signs_pred <- knn(train = signs, test = test_signs[,c(-1,-2)], cl = sign_types)

# Create a confusion matrix of the predicted versus actual values
signs_actual <- test_signs$sign_type

table(signs_pred, signs_actual)

##             signs_actual
## signs_pred   pedestrian speed stop
##   pedestrian         19     0    0
##   speed               0    21    0
##   stop                0     0   19

# Compute the accuracy
mean(signs_pred == signs_actual)

## [1] 1

Testing other ‘k’ values

By default, the knn() function in the class package uses only the single nearest neighbor.

Setting a k parameter allows the algorithm to consider additional nearby neighbors. This enlarges the collection of neighbors which will vote on the predicted class.

Compare k values of 1, 7, and 15 to examine the impact on traffic sign classification accuracy.

The class package is already loaded in your workspace along with the datasets signs, signs_test, and sign_types. The object signs_actual holds the true values of the signs.

Compute the accuracy of the default k = 1 model using the given code, then find the accuracy of the model using mean() to compare signs_actual and the model’s predictions.

signs_test <- read.delim("DATABASE/sign_test.txt", sep = ",")
head(signs_test)

##    X  sign_type  r1  g1  b1  r2  g2 b2  r3  g3  b3  r4  g4  b4  r5  g5  b5  r6
## 1  8 pedestrian 118 105  69 244 245 67 132 123  12 138 123  85 254 254  92 241
## 2 13 pedestrian 221 244 237  52  45 26 205 233 229 203 230 227 203 221 213  18
## 3 14 pedestrian  44  50  43  98  69 25 170 182 172 170 182 172  35  29  18  85
## 4 19 pedestrian  78 106 102  98 125 82  65  91  75 100 122 116  84 107  88 102
## 5 20 pedestrian 163 181 172  53  51 36 170 181 171  44  51  44  90  86  66 128
## 6 22 pedestrian 117 137 132 116 105 67  58  53  27  37  49  45 115 101  59 131
##    g6  b6  r7  g7 b7  r8  g8  b8  r9  g9  b9 r10 g10 b10 r11 g11 b11 r12 g12
## 1 240 108 254 254 99 108 106  27 135 123  40 254 254 115 254 254  99 138 123
## 2  21  20  27  27 20 193 222 220  76  53  18  12  18  20  19  19  13  70  49
## 3  60  19  26  26 20  84  68  42  94  65  20  22  25  21  21  22  19  97  67
## 4 139  38  27  43 28 146 174 170 182 196 176 108 145  42 107 144  38  45  58
## 5  90  28  68  53 27  90  82  60  90  78  52 125  89  27  59  52  34  57  62
## 6 102  28  19  28 26  27  37  34  19  28  26 139 107  28 139 107  28  14  22
##   b12 r13 g13 b13 r14 g14 b14 r15 g15 b15 r16 g16 b16
## 1  85 118 105  75 131 124   5 106  94  53 101  91  59
## 2  13  74  51  13  74  51  13  34  30  18  50  38  14
## 3  21  94  65  20  94  65  20  97  67  21  49  38  19
## 4  52  67  84  77  91 108 100  49  62  57  45  58  52
## 5  54  60  67  59  53  51  36  67  75  67  74  81  70
## 6  21  15  24  22  18  26  22  19  28  26  19  28  26

signs_test <- signs_test %>% select(., c(-"X", -"sign_type"))
# Compute the accuracy of the baseline model (default k = 1)
k_1 <- knn(train = signs, test = signs_test, cl = sign_types)
mean(signs_actual == k_1)

## [1] 1

Modify the knn() function call by setting k = 7 and again find accuracy value.

# Modify the above to set k = 7
k_7 <- knn(train = signs, test = signs_test, cl = sign_types, k = 7)
mean(signs_actual == k_7)

## [1] 0.9661017

Revise the code once more by setting k = 15, plus find the accuracy value one more time.

# Set k = 15 and compare to the above
k_15 <- knn(train = signs, test = signs_test, cl = sign_types, k = 15)
mean(signs_actual == k_15)

## [1] 0.9661017

Seeing how the neighbors voted

When multiple nearest neighbors hold a vote, it can sometimes be useful to examine whether the voters were unanimous or widely separated.

For example, knowing more about the voters’ confidence in the classification could allow an autonomous vehicle to use caution in the case there is any chance at all that a stop sign is ahead.

In this exercise, you will learn how to obtain the voting results from the knn() function.

The class package has already been loaded in your workspace along with the datasets signs, sign_types, and signs_test.

Build a kNN model with the prob = TRUE parameter to compute the vote proportions. Set k = 7.

# Use the prob parameter to get the proportion of votes for the winning class
sign_pred <- knn(train = signs, test = signs_test, cl = sign_types, k = 7, prob = TRUE)

Use the attr() function to obtain the vote proportions for the predicted class. These are stored in the attribute “prob”.

# Get the "prob" attribute from the predicted classes
sign_prob <- attr(sign_pred, "prob")

# Examine the first several predictions
head(sign_pred)

## [1] pedestrian pedestrian pedestrian stop       pedestrian pedestrian
## Levels: pedestrian speed stop

Examine the first several vote outcomes and percentages using the head() function to see how the confidence varies from sign to sign.

# Examine the proportion of votes for the winning class
head(sign_prob)

## [1] 0.5714286 0.7142857 0.8571429 0.4285714 1.0000000 0.8571429

Data preparation for kNN

Formula para normalizar datos

normalizar <- function(x){
  return((x-min(x)/(max(x)-min(X))))
}

Chapter 2: Naive Bayes

Understanding Bayesian methods

Computing probabilities

The where9am data frame contains 91 days (thirteen weeks) worth of data in which Brett recorded his location at 9am each day as well as whether the daytype was a weekend or weekday.

Using the conditional probability formula below, you can compute the probability that Brett is working in the office, given that it is a weekday.

Find P(office) using nrow() and subset() to count rows in the dataset and save the result as p_A.

where9am <- read.delim("DATABASE/where9am.txt", sep = ",")
head(where9am)

##     X daytype location
## 1  10 weekday   office
## 2  34 weekday   office
## 3  58 weekday   office
## 4  82 weekend     home
## 5 106 weekend     home
## 6 130 weekday   campus

Find P(weekday), using nrow() and subset() again, and save the result as p_B.

Use nrow() and subset() a final time to find P(office and weekday). Save the result as p_AB.

Compute P(office | weekday) and save the result as p_A_given_B.

# Compute P(A) 
p_A <- nrow(subset(where9am, location == "office")) / nrow(where9am)

# Compute P(B)
p_B <- nrow(subset(where9am, daytype == "weekday")) / nrow(where9am)

# Compute the observed P(A and B)
p_AB <- nrow(subset(where9am, location == "office" & daytype == "weekday")) / nrow(where9am)

Print the value of p_A_given_B.

# Compute P(A | B) and print its value
p_A_given_B <- p_AB / p_B
p_A_given_B

## [1] 0.6

A simple Naive Bayes location model

The previous exercises showed that the probability that Brett is at work or at home at 9am is highly dependent on whether it is the weekend or a weekday.

To see this finding in action, use the where9am data frame to build a Naive Bayes model on the same data.

You can then use this model to predict the future: where does the model think that Brett will be at 9am on Thursday and at 9am on Saturday?

The data frame where9am is available in your workspace. This dataset contains information about Brett’s location at 9am on different days.

Load the naivebayes package. Use naive_bayes() with a formula like y ~ x to build a model of location as a function of daytype.

# Load the naivebayes package
library(naivebayes)

# Build the location prediction model
locmodel <- naive_bayes(location ~  daytype, data = where9am)

## Warning: naive_bayes(): Feature daytype - zero probabilities are present.
## Consider Laplace smoothing.

Forecast the Thursday 9am location using predict() with the thursday9am object as the newdata argument.

thursday9am <- read.delim("DATABASE/thursday9am.txt", sep = ",")

## Warning in read.table(file = file, header = header, sep = sep, quote = quote, :
## incomplete final line found by readTableHeader on 'DATABASE/thursday9am.txt'

head(thursday9am)

##   X daytype
## 1 1 weekday

thursday9am$daytype <- as.character(thursday9am$daytype)
# Predict Thursday's 9am location
predict(locmodel, thursday9am)

## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.

## [1] office
## Levels: appointment campus home office

Do the same for predicting the saturday9am location.

saturday9am <- read.delim("DATABASE/saturday9am.txt", sep = ",")

## Warning in read.table(file = file, header = header, sep = sep, quote = quote, :
## incomplete final line found by readTableHeader on 'DATABASE/saturday9am.txt'

head(saturday9am)

##   X daytype
## 1 1 weekend

saturday9am$daytype <- as.character(saturday9am$daytype)
# Predict Saturdays's 9am location
predict(locmodel, saturday9am)

## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.

## [1] home
## Levels: appointment campus home office

Examining “raw” probabilities

The naivebayes package offers several ways to peek inside a Naive Bayes model.

Typing the name of the model object provides the a priori (overall) and conditional probabilities of each of the model’s predictors. If one were so inclined, you might use these for calculating posterior (predicted) probabilities by hand.

Alternatively, R will compute the posterior probabilities for you if the type = “prob” parameter is supplied to the predict() function.

Using these methods, examine how the model’s predicted 9am location probability varies from day-to-day. The model locmodel that you fit in the previous exercise is in your workspace.

Print the locmodel object to the console to view the computed a priori and conditional probabilities. Use the predict() function similarly to the previous exercise, but with type = “prob” to see the predicted probabilities for Thursday at 9am.

# The 'naivebayes' package is loaded into the workspace
# and the Naive Bayes 'locmodel' has been built

# Examine the location prediction model
locmodel

## 
## ================================== Naive Bayes ================================== 
##  
##  Call: 
## naive_bayes.formula(formula = location ~ daytype, data = where9am)
## 
## --------------------------------------------------------------------------------- 
##  
## Laplace smoothing: 0
## 
## --------------------------------------------------------------------------------- 
##  
##  A priori probabilities: 
## 
## appointment      campus        home      office 
##  0.01098901  0.10989011  0.45054945  0.42857143 
## 
## --------------------------------------------------------------------------------- 
##  
##  Tables: 
## 
## --------------------------------------------------------------------------------- 
##  ::: daytype (Bernoulli) 
## --------------------------------------------------------------------------------- 
##          
## daytype   appointment    campus      home    office
##   weekday   1.0000000 1.0000000 0.3658537 1.0000000
##   weekend   0.0000000 0.0000000 0.6341463 0.0000000
## 
## ---------------------------------------------------------------------------------

# Obtain the predicted probabilities for Thursday at 9am
predict(locmodel, thursday9am , type = "prob")

## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.

##      appointment    campus      home office
## [1,]  0.01538462 0.1538462 0.2307692    0.6

Compare these to the predicted probabilities for Saturday at 9am.

# Obtain the predicted probabilities for Saturday at 9am
predict(locmodel, saturday9am , type = "prob")

## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.

##       appointment       campus      home      office
## [1,] 3.838772e-05 0.0003838772 0.9980806 0.001497121

Understanding NB’s “naivety”

A more sophisticated location model

The locations dataset records Brett’s location every hour for 13 weeks. Each hour, the tracking information includes the daytype (weekend or weekday) as well as the hourtype (morning, afternoon, evening, or night).

Using this data, build a more sophisticated model to see how Brett’s predicted location not only varies by the day of week but also by the time of day. The dataset locations is already loaded in your workspace.

You can specify additional independent variables in your formula using the + sign (e.g. y ~ x + b).

Use the R formula interface to build a model where location depends on both daytype and hourtype. Recall that the function naive_bayes() takes 2 arguments: formula and data. Predict Brett’s location on a weekday afternoon using the dataframe weekday_afternoon and the predict() function.

# The 'naivebayes' package is loaded into the workspace already
locations <- read.csv("DATABASE/locations.csv", sep = ",")

head(locations)

##   month day   weekday daytype hour hourtype location
## 1     1   4 wednesday weekday    0    night     home
## 2     1   4 wednesday weekday    1    night     home
## 3     1   4 wednesday weekday    2    night     home
## 4     1   4 wednesday weekday    3    night     home
## 5     1   4 wednesday weekday    4    night     home
## 6     1   4 wednesday weekday    5    night     home

weekday_afternoon <- read.delim("DATABASE/weekday_afternoon.txt", sep = ",")

## Warning in read.table(file = file, header = header, sep = sep, quote =
## quote, : incomplete final line found by readTableHeader on 'DATABASE/
## weekday_afternoon.txt'

head(weekday_afternoon)

##    X daytype  hourtype location
## 1 13 weekday afternoon   office

weekday_afternoon$daytype <- as.character(
weekday_afternoon$daytype)


weekday_afternoon$hourtype <- as.character(weekday_afternoon$hourtype)
weekday_afternoon$location <- as.character(weekday_afternoon$location)
# Build a NB model of location
locmodel <- naive_bayes(location ~ daytype + hourtype, data = locations)

## Warning: naive_bayes(): Feature daytype - zero probabilities are present.
## Consider Laplace smoothing.

## Warning: naive_bayes(): Feature hourtype - zero probabilities are present.
## Consider Laplace smoothing.

# Predict Brett's location on a weekday afternoon
predict(locmodel, weekday_afternoon)

## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.

## [1] office
## Levels: appointment campus home office restaurant store theater

Do the same for a weekday_evening.

weekday_evening <- read.delim("DATABASE/weekday_evening.txt", sep = ",")

## Warning in read.table(file = file, header = header, sep = sep, quote = quote, :
## incomplete final line found by readTableHeader on 'DATABASE/weekday_evening.txt'

head(weekday_evening)

##    X daytype hourtype location
## 1 19 weekday  evening     home

weekday_evening$daytype <- as.character(weekday_evening$daytype)
weekday_evening$hourtype <- as.character(weekday_evening$hourtype)
weekday_evening$location <- as.character(weekday_evening$location)
# Predict Brett's location on a weekday evening
predict(locmodel, weekday_evening)

## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.

## [1] home
## Levels: appointment campus home office restaurant store theater

Preparing for unforeseen circumstances

While Brett was tracking his location over 13 weeks, he never went into the office during the weekend. Consequently, the joint probability of P(office and weekend) = 0.

Explore how this impacts the predicted probability that Brett may go to work on the weekend in the future. Additionally, you can see how using the Laplace correction will allow a small chance for these types of unforeseen circumstances.

The model locmodel is already in your workspace, along with the dataframe weekend_afternoon.

Use the locmodel to output predicted probabilities for a weekend afternoon by using the predict() function. Remember to set the type argument.

# # The 'naivebayes' package is loaded into the workspace already
# # The Naive Bayes location model (locmodel) has already been built
 weekend_afternoon <- read.delim("DATABASE/weekend_afternoon.txt", sep = ",")

## Warning in read.table(file = file, header = header, sep = sep, quote =
## quote, : incomplete final line found by readTableHeader on 'DATABASE/
## weekend_afternoon.txt'

head(weekend_afternoon)

##    X daytype  hourtype location
## 1 85 weekend afternoon     home

# 
# weekend_afternoon$daytype <- as.character(weekend_afternoon$daytype)
# 
# weekday_afternoon$hourtype <- as.character(weekday_afternoon$hourtype)
# weekday_afternoon$location <- as.character(weekday_afternoon$location)
# 
# weekday_afternoon$X <- as.character(weekday_afternoon$X)
# 
# # Observe the predicted probabilities for a weekend afternoon
#  predict(locmodel, weekend_afternoon, type = "prob")

Create a new naive Bayes model with the Laplace smoothing parameter set to 1. You can do this by setting the laplace argument in your call to naive_bayes(). Save this as locmodel2.

# Build a new model using the Laplace correction
locmodel2 <- naive_bayes(location ~ daytype + hourtype, data = locations, laplace = 1)

See how the new predicted probabilities compare by using the predict() function on your new model.

weekend_afternoon$daytype <- as.character(weekend_afternoon$daytype)
weekend_afternoon$hourtype <- as.character(weekend_afternoon$hourtype)
weekend_afternoon$location <- as.character(weekend_afternoon$location)
# Observe the new predicted probabilities for a weekend afternoon
predict(locmodel2, weekend_afternoon, type = "prob")

## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.

##      appointment      campus      home      office restaurant      store
## [1,]  0.02013872 0.006187715 0.8308154 0.007929249  0.1098743 0.01871085
##          theater
## [1,] 0.006343697

Chapter 3: Logistic Regression

Making binary predictions with regression

Building simple logistic regression models

The donors dataset contains 93,462 examples of people mailed in a fundraising solicitation for paralyzed military veterans. The donated column is 1 if the person made a donation in response to the mailing and 0 otherwise. This binary outcome will be the dependent variable for the logistic regression model.

The remaining columns are features of the prospective donors that may influence their donation behavior. These are the model’s independent variables.

When building a regression model, it is often helpful to form a hypothesis about which independent variables will be predictive of the dependent variable. The bad_address column, which is set to 1 for an invalid mailing address and 0 otherwise, seems like it might reduce the chances of a donation. Similarly, one might suspect that religious interest (interest_religion) and interest in veterans affairs (interest_veterans) would be associated with greater charitable giving.

In this exercise, you will use these three factors to create a simple model of donation behavior. The dataset donors is available in your workspace.

Examine donors using the str() function.

donors <- read.csv("DATABASE/donors.csv")
head(donors)

##   donated veteran bad_address age has_children wealth_rating interest_veterans
## 1       0       0           0  60            0             0                 0
## 2       0       0           0  46            1             3                 0
## 3       0       0           0  NA            0             1                 0
## 4       0       0           0  70            0             2                 0
## 5       0       0           0  78            1             1                 0
## 6       0       0           0  NA            0             0                 0
##   interest_religion pet_owner catalog_shopper recency  frequency  money
## 1                 0         0               0 CURRENT   FREQUENT MEDIUM
## 2                 0         0               0 CURRENT   FREQUENT   HIGH
## 3                 0         0               0 CURRENT   FREQUENT MEDIUM
## 4                 0         0               0 CURRENT   FREQUENT MEDIUM
## 5                 1         0               1 CURRENT   FREQUENT MEDIUM
## 6                 0         0               0 CURRENT INFREQUENT MEDIUM

# Examine the dataset to identify potential independent variables
str(donors)

## 'data.frame':    93462 obs. of  13 variables:
##  $ donated          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ veteran          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ bad_address      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ age              : int  60 46 NA 70 78 NA 38 NA NA 65 ...
##  $ has_children     : int  0 1 0 0 1 0 1 0 0 0 ...
##  $ wealth_rating    : int  0 3 1 2 1 0 2 3 1 0 ...
##  $ interest_veterans: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ interest_religion: int  0 0 0 0 1 0 0 0 0 0 ...
##  $ pet_owner        : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ catalog_shopper  : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ recency          : Factor w/ 2 levels "CURRENT","LAPSED": 1 1 1 1 1 1 1 1 1 1 ...
##  $ frequency        : Factor w/ 2 levels "FREQUENT","INFREQUENT": 1 1 1 1 1 2 2 1 2 2 ...
##  $ money            : Factor w/ 2 levels "HIGH","MEDIUM": 2 1 2 2 2 2 2 2 2 2 ...

Count the number of occurrences of each level of the donated variable using the table() function.

# Explore the dependent variable
table(donors$donated)

## 
##     0     1 
## 88751  4711

Fit a logistic regression model using the formula interface and the three independent variables described above.

Call glm() with the formula as its first argument and the dataframe as the data argument.

Save the result as donation_model.

# Build the donation model
donation_model <- glm(donated ~ bad_address + interest_religion +  interest_veterans, 
                      data = donors, family = "binomial")

Summarize the model object with summary().

# Summarize the model results
summary(donation_model)

## 
## Call:
## glm(formula = donated ~ bad_address + interest_religion + interest_veterans, 
##     family = "binomial", data = donors)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.3480  -0.3192  -0.3192  -0.3192   2.5678  
## 
## Coefficients:
##                   Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)       -2.95139    0.01652 -178.664   <2e-16 ***
## bad_address       -0.30780    0.14348   -2.145   0.0319 *  
## interest_religion  0.06724    0.05069    1.327   0.1847    
## interest_veterans  0.11009    0.04676    2.354   0.0186 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 37330  on 93461  degrees of freedom
## Residual deviance: 37316  on 93458  degrees of freedom
## AIC: 37324
## 
## Number of Fisher Scoring iterations: 5

Making a binary prediction

In the previous exercise, you used the glm() function to build a logistic regression model of donor behavior. As with many of R’s machine learning methods, you can apply the predict() function to the model object to forecast future behavior. By default, predict() outputs predictions in terms of log odds unless type = “response” is specified. This converts the log odds to probabilities.

Because a logistic regression model estimates the probability of the outcome, it is up to you to determine the threshold at which the probability implies action. One must balance the extremes of being too cautious versus being too aggressive. For example, if you were to solicit only the people with a 99% or greater donation probability, you may miss out on many people with lower estimated probabilities that still choose to donate. This balance is particularly important to consider for severely imbalanced outcomes, such as in this dataset where donations are relatively rare.

The dataset donors and the model donation_model are already loaded in your workspace.

# Estimate the donation probability
donors$donation_prob <- predict(donation_model, type = "response")

Use the predict() function to estimate each person’s donation probability.

Use the type argument to get probabilities. Assign the predictions to a new column called donation_prob.

Find the actual probability that an average person would donate by passing the mean() function the appropriate column of the donors dataframe.

# Find the donation probability of the average prospect
mean(donors$donated)

## [1] 0.05040551

Use ifelse() to predict a donation if their predicted donation probability is greater than average. Assign the predictions to a new column called donation_pred.

# Predict a donation if probability of donation is greater than average
donors$donation_pred <- ifelse(donors$donation_prob > 0.0504, 1, 0)

Use the mean() function to calculate the model’s accuracy.

# Calculate the model's accuracy
mean(donors$donated == donors$donation_pred)

## [1] 0.794815

Model performance tradeoffs

Calculating ROC Curves and AUC

The previous exercises have demonstrated that accuracy is a very misleading measure of model performance on imbalanced datasets. Graphing the model’s performance better illustrates the tradeoff between a model that is overly agressive and one that is overly passive.

In this exercise you will create a ROC curve and compute the area under the curve (AUC) to evaluate the logistic regression model of donations you built earlier.

The dataset donors with the column of predicted probabilities, donation_prob ,is already loaded in your workspace.

Load the pROC package. Create a ROC curve with roc() and the columns of actual and predicted donations. Store the result as ROC.

# Load the pROC package
library(pROC)

# Create a ROC curve
ROC <- roc(donors$donated, donors$donation_prob)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

Use plot() to draw the ROC object. Specify col = “blue” to color the curve blue.

# Plot the ROC curve
plot(ROC, col = "blue")

Compute the area under the curve with auc().

# Calculate the area under the curve (AUC)
auc(ROC)

## Area under the curve: 0.5102

Dummy variables, missing data and interactions

Coding categorical features

Sometimes a dataset contains numeric values that represent a categorical feature.

In the donors dataset, wealth_rating uses numbers to indicate the donor’s wealth level:

0 = Unknown 1 = Low 2 = Medium 3 = High This exercise illustrates how to prepare this type of categorical feature and examines its impact on a logistic regression model. The dataframe donors is loaded in your workspace.

Create a factor wealth_levels from the numeric wealth_rating with labels as shown above by passing the factor() function the column you want to convert, the individual levels, and the labels.

# Convert the wealth rating to a factor
donors$wealth_levels <- factor(donors$wealth_rating, levels = c(0, 1, 2, 3), labels = c("Unknown", "Low", "Medium", "High"))

Use relevel() to change the reference category to Medium. The first argument should be your new factor column.

# Use relevel() to change reference category
donors$wealth_levels <- relevel(donors$wealth_levels, ref = "Medium")

Build a logistic regression model using the column wealth_levels to predict donated and display the result with summary().

# See how our factor coding impacts the model
summary(glm(donated ~ wealth_levels, data = donors, family = "binomial"))

## 
## Call:
## glm(formula = donated ~ wealth_levels, family = "binomial", data = donors)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.3320  -0.3243  -0.3175  -0.3175   2.4582  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -2.91894    0.03614 -80.772   <2e-16 ***
## wealth_levelsUnknown -0.04373    0.04243  -1.031    0.303    
## wealth_levelsLow     -0.05245    0.05332  -0.984    0.325    
## wealth_levelsHigh     0.04804    0.04768   1.008    0.314    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 37330  on 93461  degrees of freedom
## Residual deviance: 37323  on 93458  degrees of freedom
## AIC: 37331
## 
## Number of Fisher Scoring iterations: 5

Handling missing data

Some of the prospective donors have missing age data. Unfortunately, R will exclude any cases with NA values when building a regression model.

One workaround is to replace, or impute, the missing values with an estimated value. After doing so, you may also create a missing data indicator to model the possibility that cases with missing data are different in some way from those without.

The dataframe donors is loaded in your workspace.

Use summary() on donors$age to find the average age of prospects with non-missing data. Use ifelse() and the test is.na(donors$age) to impute the average (rounded to 2 decimal places) for cases with missing age. Be sure to also ignore NAs.

# Find the average age among non-missing values
summary(donors$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00   48.00   62.00   61.65   75.00   98.00   22546

# Impute missing age values with the mean age
donors$imputed_age <- ifelse(is.na(donors$age), round(mean(donors$age, na.rm = TRUE), 2), donors$age)

Create a binary dummy variable named missing_age indicating the presence of missing data using another ifelse() call and the same test.

# Create missing value indicator for age
donors$missing_age <- ifelse(is.na(donors$age), 1, 0)

Building a more sophisticated model

One of the best predictors of future giving is a history of recent, frequent, and large gifts. In marketing terms, this is known as R/F/M:

Recency Frequency Money Donors that haven’t given both recently and frequently may be especially likely to give again; in other words, the combined impact of recency and frequency may be greater than the sum of the separate effects.

Because these predictors together have a greater impact on the dependent variable, their joint effect must be modeled as an interaction. The donors dataset has been loaded for you.

Create a logistic regression model of donated as a function of money plus the interaction of recency and frequency. Use * to add the interaction term.

# Build a recency, frequency, and money (RFM) model
rfm_model <- glm(donated ~ recency * frequency + money, data = donors, family = "binomial")

Examine the model’s summary() to confirm the interaction effect was added. Save the model’s predicted probabilities as rfm_prob. Use the predict() function, and remember to set the type argument.

# Summarize the RFM model to see how the parameters were coded
summary(rfm_model)

## 
## Call:
## glm(formula = donated ~ recency * frequency + money, family = "binomial", 
##     data = donors)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.3696  -0.3696  -0.2895  -0.2895   2.7924  
## 
## Coefficients:
##                                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                       -3.01142    0.04279 -70.375   <2e-16 ***
## recencyLAPSED                     -0.86677    0.41434  -2.092   0.0364 *  
## frequencyINFREQUENT               -0.50148    0.03107 -16.143   <2e-16 ***
## moneyMEDIUM                        0.36186    0.04300   8.415   <2e-16 ***
## recencyLAPSED:frequencyINFREQUENT  1.01787    0.51713   1.968   0.0490 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 37330  on 93461  degrees of freedom
## Residual deviance: 36938  on 93457  degrees of freedom
## AIC: 36948
## 
## Number of Fisher Scoring iterations: 6

Plot a ROC curve by using the function roc(). Remember, this function takes the column of outcomes and the vector of predictions. Compute the AUC for the new model with the function auc() and compare performance to the simpler model.

# Compute predicted probabilities for the RFM model
rfm_prob <- predict(rfm_model, data = donors, type = "response")

# Plot the ROC curve for the new model
library(pROC)
ROC <- roc(donors$donated, rfm_prob)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

plot(ROC, col = "red")

auc(ROC)

## Area under the curve: 0.5785

Automatic feature selection

Building a stepwise regression model

In the absence of subject-matter expertise, stepwise regression can assist with the search for the most important predictors of the outcome of interest.

In this exercise, you will use a forward stepwise approach to add predictors to the model one-by-one until no additional benefit is seen. The donors dataset has been loaded for you.

Use the R formula interface with glm() to specify the base model with no predictors. Set the explanatory variable equal to 1.

# Specify a null model with no predictors
null_model <- glm(donated ~ 1, data = donors, family = "binomial")

Use the R formula interface again with glm() to specify the model with all predictors.

# Specify the full model using all of the potential predictors
full_model <- glm(donated ~ ., data = donors, family = "binomial")

Apply step() to these models to perform forward stepwise regression. Set the first argument to null_model and set direction = “forward”. This might take a while (up to 10 or 15 seconds) as your computer has to fit quite a few different models to perform stepwise selection.

# Use a forward stepwise algorithm to build a parsimonious model
step_model <- step(null_model, scope = list(lower = null_model, upper = full_model), direction = "forward")

## Start:  AIC=37332.13
## donated ~ 1

## Warning in add1.glm(fit, scope$add, scale = scale, trace = trace, k = k, : using
## the 70916/93462 rows from a combined fit

##                     Df Deviance   AIC
## + frequency          1    28502 37122
## + money              1    28621 37241
## + wealth_rating      1    28705 37326
## + has_children       1    28705 37326
## + age                1    28707 37328
## + imputed_age        1    28707 37328
## + wealth_levels      3    28704 37328
## + interest_veterans  1    28709 37330
## + donation_prob      1    28710 37330
## + donation_pred      1    28710 37330
## + catalog_shopper    1    28710 37330
## + pet_owner          1    28711 37331
## <none>                    28714 37332
## + interest_religion  1    28712 37333
## + recency            1    28713 37333
## + bad_address        1    28714 37334
## + veteran            1    28714 37334
## 
## Step:  AIC=37024.77
## donated ~ frequency

## Warning in add1.glm(fit, scope$add, scale = scale, trace = trace, k = k, : using
## the 70916/93462 rows from a combined fit

##                     Df Deviance   AIC
## + money              1    28441 36966
## + wealth_rating      1    28493 37018
## + wealth_levels      3    28490 37019
## + has_children       1    28494 37019
## + donation_prob      1    28498 37023
## + interest_veterans  1    28498 37023
## + catalog_shopper    1    28499 37024
## + donation_pred      1    28499 37024
## + age                1    28499 37024
## + imputed_age        1    28499 37024
## + pet_owner          1    28499 37024
## <none>                    28502 37025
## + interest_religion  1    28501 37026
## + recency            1    28501 37026
## + bad_address        1    28502 37026
## + veteran            1    28502 37027
## 
## Step:  AIC=36949.71
## donated ~ frequency + money

## Warning in add1.glm(fit, scope$add, scale = scale, trace = trace, k = k, : using
## the 70916/93462 rows from a combined fit

##                     Df Deviance   AIC
## + wealth_levels      3    28427 36942
## + wealth_rating      1    28431 36942
## + has_children       1    28432 36943
## + interest_veterans  1    28438 36948
## + donation_prob      1    28438 36949
## + catalog_shopper    1    28438 36949
## + donation_pred      1    28438 36949
## + age                1    28438 36949
## + imputed_age        1    28438 36949
## + pet_owner          1    28439 36949
## <none>                    28441 36950
## + interest_religion  1    28440 36951
## + recency            1    28440 36951
## + bad_address        1    28441 36951
## + veteran            1    28441 36952
## 
## Step:  AIC=36945.48
## donated ~ frequency + money + wealth_levels

## Warning in add1.glm(fit, scope$add, scale = scale, trace = trace, k = k, : using
## the 70916/93462 rows from a combined fit

##                     Df Deviance   AIC
## + has_children       1    28416 36937
## + age                1    28424 36944
## + imputed_age        1    28424 36944
## + interest_veterans  1    28424 36945
## + donation_prob      1    28424 36945
## + catalog_shopper    1    28424 36945
## + donation_pred      1    28425 36945
## <none>                    28427 36945
## + pet_owner          1    28425 36946
## + interest_religion  1    28426 36947
## + recency            1    28426 36947
## + bad_address        1    28427 36947
## + veteran            1    28427 36947
## 
## Step:  AIC=36938.4
## donated ~ frequency + money + wealth_levels + has_children

## Warning in add1.glm(fit, scope$add, scale = scale, trace = trace, k = k, : using
## the 70916/93462 rows from a combined fit

##                     Df Deviance   AIC
## + pet_owner          1    28413 36937
## + donation_prob      1    28413 36937
## + catalog_shopper    1    28413 36937
## + interest_veterans  1    28413 36937
## + donation_pred      1    28414 36938
## <none>                    28416 36938
## + interest_religion  1    28415 36939
## + age                1    28416 36940
## + imputed_age        1    28416 36940
## + recency            1    28416 36940
## + bad_address        1    28416 36940
## + veteran            1    28416 36940
## 
## Step:  AIC=36932.25
## donated ~ frequency + money + wealth_levels + has_children + 
##     pet_owner

## Warning in add1.glm(fit, scope$add, scale = scale, trace = trace, k = k, : using
## the 70916/93462 rows from a combined fit

##                     Df Deviance   AIC
## <none>                    28413 36932
## + donation_prob      1    28411 36932
## + interest_veterans  1    28411 36932
## + catalog_shopper    1    28412 36933
## + donation_pred      1    28412 36933
## + age                1    28412 36933
## + imputed_age        1    28412 36933
## + recency            1    28413 36934
## + interest_religion  1    28413 36934
## + bad_address        1    28413 36934
## + veteran            1    28413 36934

Create a vector of predicted probabilities using the predict() function.

# Estimate the stepwise donation probability
step_prob <- predict(step_model, type = "response")

Plot the ROC curve with roc() and plot() and compute the AUC of the stepwise model with auc().

# Plot the ROC of the stepwise model
library(pROC)
ROC <- roc(donors$donated, step_prob)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

plot(ROC, col = "red")

auc(ROC)

## Area under the curve: 0.5849

Automatic feature selection

Building a stepwise regression model

In the absence of subject-matter expertise, stepwise regression can assist with the search for the most important predictors of the outcome of interest.

In this exercise, you will use a forward stepwise approach to add predictors to the model one-by-one until no additional benefit is seen. The donors dataset has been loaded for you.

Use the R formula interface with glm() to specify the base model with no predictors. Set the explanatory variable equal to 1. Use the R formula interface again with glm() to specify the model with all predictors. Apply step() to these models to perform forward stepwise regression. Set the first argument to null_model and set direction = “forward”. This might take a while (up to 10 or 15 seconds) as your computer has to fit quite a few different models to perform stepwise selection. Create a vector of predicted probabilities using the predict() function. Plot the ROC curve with roc() and plot() and compute the AUC of the stepwise model with auc().

# Specify a null model with no predictors
null_model <- glm(donated ~ 1, data = donors, family = "binomial")

# Specify the full model using all of the potential predictors
full_model <- glm(donated ~ ., data = donors, family = "binomial")

# Use a forward stepwise algorithm to build a parsimonious model
step_model <- step(null_model, scope = list(lower = null_model, upper = full_model), direction = "forward")

## Start:  AIC=37332.13
## donated ~ 1

## Warning in add1.glm(fit, scope$add, scale = scale, trace = trace, k = k, : using
## the 70916/93462 rows from a combined fit

##                     Df Deviance   AIC
## + frequency          1    28502 37122
## + money              1    28621 37241
## + wealth_rating      1    28705 37326
## + has_children       1    28705 37326
## + age                1    28707 37328
## + imputed_age        1    28707 37328
## + wealth_levels      3    28704 37328
## + interest_veterans  1    28709 37330
## + donation_prob      1    28710 37330
## + donation_pred      1    28710 37330
## + catalog_shopper    1    28710 37330
## + pet_owner          1    28711 37331
## <none>                    28714 37332
## + interest_religion  1    28712 37333
## + recency            1    28713 37333
## + bad_address        1    28714 37334
## + veteran            1    28714 37334
## 
## Step:  AIC=37024.77
## donated ~ frequency

## Warning in add1.glm(fit, scope$add, scale = scale, trace = trace, k = k, : using
## the 70916/93462 rows from a combined fit

##                     Df Deviance   AIC
## + money              1    28441 36966
## + wealth_rating      1    28493 37018
## + wealth_levels      3    28490 37019
## + has_children       1    28494 37019
## + donation_prob      1    28498 37023
## + interest_veterans  1    28498 37023
## + catalog_shopper    1    28499 37024
## + donation_pred      1    28499 37024
## + age                1    28499 37024
## + imputed_age        1    28499 37024
## + pet_owner          1    28499 37024
## <none>                    28502 37025
## + interest_religion  1    28501 37026
## + recency            1    28501 37026
## + bad_address        1    28502 37026
## + veteran            1    28502 37027
## 
## Step:  AIC=36949.71
## donated ~ frequency + money

## Warning in add1.glm(fit, scope$add, scale = scale, trace = trace, k = k, : using
## the 70916/93462 rows from a combined fit

##                     Df Deviance   AIC
## + wealth_levels      3    28427 36942
## + wealth_rating      1    28431 36942
## + has_children       1    28432 36943
## + interest_veterans  1    28438 36948
## + donation_prob      1    28438 36949
## + catalog_shopper    1    28438 36949
## + donation_pred      1    28438 36949
## + age                1    28438 36949
## + imputed_age        1    28438 36949
## + pet_owner          1    28439 36949
## <none>                    28441 36950
## + interest_religion  1    28440 36951
## + recency            1    28440 36951
## + bad_address        1    28441 36951
## + veteran            1    28441 36952
## 
## Step:  AIC=36945.48
## donated ~ frequency + money + wealth_levels

## Warning in add1.glm(fit, scope$add, scale = scale, trace = trace, k = k, : using
## the 70916/93462 rows from a combined fit

##                     Df Deviance   AIC
## + has_children       1    28416 36937
## + age                1    28424 36944
## + imputed_age        1    28424 36944
## + interest_veterans  1    28424 36945
## + donation_prob      1    28424 36945
## + catalog_shopper    1    28424 36945
## + donation_pred      1    28425 36945
## <none>                    28427 36945
## + pet_owner          1    28425 36946
## + interest_religion  1    28426 36947
## + recency            1    28426 36947
## + bad_address        1    28427 36947
## + veteran            1    28427 36947
## 
## Step:  AIC=36938.4
## donated ~ frequency + money + wealth_levels + has_children

## Warning in add1.glm(fit, scope$add, scale = scale, trace = trace, k = k, : using
## the 70916/93462 rows from a combined fit

##                     Df Deviance   AIC
## + pet_owner          1    28413 36937
## + donation_prob      1    28413 36937
## + catalog_shopper    1    28413 36937
## + interest_veterans  1    28413 36937
## + donation_pred      1    28414 36938
## <none>                    28416 36938
## + interest_religion  1    28415 36939
## + age                1    28416 36940
## + imputed_age        1    28416 36940
## + recency            1    28416 36940
## + bad_address        1    28416 36940
## + veteran            1    28416 36940
## 
## Step:  AIC=36932.25
## donated ~ frequency + money + wealth_levels + has_children + 
##     pet_owner

## Warning in add1.glm(fit, scope$add, scale = scale, trace = trace, k = k, : using
## the 70916/93462 rows from a combined fit

##                     Df Deviance   AIC
## <none>                    28413 36932
## + donation_prob      1    28411 36932
## + interest_veterans  1    28411 36932
## + catalog_shopper    1    28412 36933
## + donation_pred      1    28412 36933
## + age                1    28412 36933
## + imputed_age        1    28412 36933
## + recency            1    28413 36934
## + interest_religion  1    28413 36934
## + bad_address        1    28413 36934
## + veteran            1    28413 36934

# Estimate the stepwise donation probability
step_prob <- predict(step_model, type = "response")

# Plot the ROC of the stepwise model
library(pROC)
ROC <- roc(donors$donated, step_prob)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

plot(ROC, col = "red")

auc(ROC)

## Area under the curve: 0.5849

Chapter 4: Classification Trees

Making decisions with trees

Building a simple decision tree

The loans dataset contains 11,312 randomly-selected people who applied for and later received loans from Lending Club, a US-based peer-to-peer lending company.

You will use a decision tree to try to learn patterns in the outcome of these loans (either repaid or default) based on the requested loan amount and credit score at the time of application.

Then, see how the tree’s predictions differ for an applicant with good credit versus one with bad credit.

The dataset loans is already in your workspace.

Load the rpart package. Fit a decision tree model with the function rpart(). Supply the R formula that specifies outcome as a function of loan_amount and credit_score as the first argument. Leave the control argument alone for now. (You’ll learn more about that later!) Use predict() with the resulting loan model to predict the outcome for the good_credit applicant. Use the type argument to predict the “class” of the outcome. Do the same for the bad_credit applicant.

# Load the rpart package
library(rpart)

loans <- read.delim("DATABASE/loans.txt", sep = ",")
head(loans)

##    X loan_amount  emp_length home_ownership income       loan_purpose
## 1  1         LOW   10+ years           RENT    LOW        credit_card
## 2  2         LOW   < 2 years           RENT    LOW                car
## 3  8         LOW 6 - 9 years           RENT MEDIUM                car
## 4  9      MEDIUM 2 - 5 years            OWN MEDIUM     small_business
## 5 10         LOW   < 2 years           RENT    LOW              other
## 6 13      MEDIUM   < 2 years           RENT    LOW debt_consolidation
##   debt_to_income credit_score recent_inquiry delinquent credit_accounts
## 1           HIGH      AVERAGE            YES      NEVER             FEW
## 2            LOW      AVERAGE            YES      NEVER             FEW
## 3            LOW          LOW            YES      NEVER             FEW
## 4            LOW      AVERAGE            YES      NEVER         AVERAGE
## 5        AVERAGE      AVERAGE             NO      NEVER             FEW
## 6        AVERAGE      AVERAGE            YES      NEVER             FEW
##   bad_public_record credit_utilization past_bankrupt outcome
## 1                NO               HIGH            NO  repaid
## 2                NO                LOW            NO default
## 3                NO               HIGH            NO  repaid
## 4                NO             MEDIUM            NO default
## 5                NO             MEDIUM            NO default
## 6                NO               HIGH            NO default

# Build a lending model predicting loan outcome versus loan amount and credit score
loan_model <- rpart(outcome ~ loan_amount + credit_score, data = loans, method = "class", control = rpart.control(cp = 0))

good_credit <- read.delim("DATABASE/good_credit.txt", sep = ",")

## Warning in read.table(file = file, header = header, sep = sep, quote = quote, :
## incomplete final line found by readTableHeader on 'DATABASE/good_credit.txt'

head(good_credit)

##   X loan_amount emp_length home_ownership income   loan_purpose debt_to_income
## 1 1         LOW  10+ years       MORTGAGE   HIGH major_purchase        AVERAGE
##   credit_score recent_inquiry delinquent credit_accounts bad_public_record
## 1         HIGH             NO      NEVER            MANY                NO
##   credit_utilization past_bankrupt outcome
## 1                LOW            NO  repaid

# Make a prediction for someone with good credit
predict(loan_model, good_credit, type = "class")

##      1 
## repaid 
## Levels: default repaid

bad_credit <- read.delim("DATABASE/bad_credit.txt", sep = ",")

## Warning in read.table(file = file, header = header, sep = sep, quote = quote, :
## incomplete final line found by readTableHeader on 'DATABASE/bad_credit.txt'

head(bad_credit)

##   X loan_amount  emp_length home_ownership income loan_purpose debt_to_income
## 1 1         LOW 6 - 9 years           RENT MEDIUM          car            LOW
##   credit_score recent_inquiry delinquent credit_accounts bad_public_record
## 1          LOW            YES      NEVER             FEW                NO
##   credit_utilization past_bankrupt outcome
## 1               HIGH            NO  repaid

# Make a prediction for someone with bad credit
predict(loan_model, bad_credit, type = "class")

##       1 
## default 
## Levels: default repaid

Visualizing classification trees

Due to government rules to prevent illegal discrimination, lenders are required to explain why a loan application was rejected.

The structure of classification trees can be depicted visually, which helps to understand how the tree makes its decisions. The model loan_model that you fit in the last exercise is in your workspace.

Type loan_model to see a text representation of the classification tree. Load the rpart.plot package. Apply the rpart.plot() function to the loan model to visualize the tree. See how changing other plotting parameters impacts the visualization by running the supplied command.

# Examine the loan_model object
loan_model

## n= 11312 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 11312 5654 repaid (0.4998232 0.5001768)  
##    2) credit_score=AVERAGE,LOW 9490 4437 default (0.5324552 0.4675448)  
##      4) credit_score=LOW 1667  631 default (0.6214757 0.3785243) *
##      5) credit_score=AVERAGE 7823 3806 default (0.5134859 0.4865141)  
##       10) loan_amount=HIGH 2472 1079 default (0.5635113 0.4364887) *
##       11) loan_amount=LOW,MEDIUM 5351 2624 repaid (0.4903756 0.5096244)  
##         22) loan_amount=LOW 1810  874 default (0.5171271 0.4828729) *
##         23) loan_amount=MEDIUM 3541 1688 repaid (0.4767015 0.5232985) *
##    3) credit_score=HIGH 1822  601 repaid (0.3298573 0.6701427) *

# Load the rpart.plot package
library(rpart.plot)

# Plot the loan_model with default settings
rpart.plot(loan_model)

# Plot the loan_model with customized settings
rpart.plot(loan_model, type = 3, box.palette = c("red", "green"), fallen.leaves = TRUE)

Growing larger classification trees

Creating random test datasets Before building a more sophisticated lending model, it is important to hold out a portion of the loan data to simulate how well it will predict the outcomes of future loan applicants.

As depicted in the following image, you can use 75% of the observations for training and 25% for testing the model.

The sample() function can be used to generate a random sample of rows to include in the training set. Simply supply it the total number of observations and the number needed for training.

Use the resulting vector of row IDs to subset the loans into training and testing datasets. The dataset loans is loaded in your workspace.

# Determine the number of rows for training
nrow(loans) * 0.75

## [1] 8484

# Create a random sample of row IDs
sample_rows <- sample(nrow(loans), nrow(loans) * 0.75)

# Create the training dataset
loans_train <- loans[sample_rows, ]

# Create the test dataset
loans_test <- loans[-sample_rows, ]

Building and evaluating a larger tree

Previously, you created a simple decision tree that used the applicant’s credit score and requested loan amount to predict the loan outcome.

Lending Club has additional information about the applicants, such as home ownership status, length of employment, loan purpose, and past bankruptcies, that may be useful for making more accurate predictions.

Using all of the available applicant data, build a more sophisticated lending model using the random training dataset created previously. Then, use this model to make predictions on the testing dataset to estimate the performance of the model on future loan applications.

The rpart package is loaded into the workspace and the loans_train and loans_test datasets have been created.

Use rpart() to build a loan model using the training dataset and all of the available predictors. Again, leave the control argument alone. Applying the predict() function to the testing dataset, create a vector of predicted outcomes. Don’t forget the type argument. Create a table() to compare the predicted values to the actual outcome values. Compute the accuracy of the predictions using the mean() function.

# Grow a tree using all of the available applicant data
loan_model <- rpart(outcome ~ ., data = loans_train, method = "class", control = rpart.control(cp = 0))

# Make predictions on the test dataset
loans_test$pred <- predict(loan_model, loans_test, type = "class")

# Examine the confusion matrix
table(loans_test$pred, loans_test$outcome)

##          
##           default repaid
##   default     795    718
##   repaid      594    721

# Compute the accuracy on the test dataset
mean(loans_test$pred == loans_test$outcome)

## [1] 0.5360679

Tending to classification trees

Preventing overgrown trees

The tree grown on the full set of applicant data grew to be extremely large and extremely complex, with hundreds of splits and leaf nodes containing only a handful of applicants. This tree would be almost impossible for a loan officer to interpret.

Using the pre-pruning methods for early stopping, you can prevent a tree from growing too large and complex. See how the rpart control options for maximum tree depth and minimum split count impact the resulting tree.

rpart is loaded.

Use rpart() to build a loan model using the training dataset and all of the available predictors. Set the model controls using rpart.control() with parameters cp set to 0 and maxdepth set to 6. See how the test set accuracy of the simpler model compares to the original accuracy of 58.3%. First create a vector of predictions using the predict() function. Compare the predictions to the actual outcomes and use mean() to calculate the accuracy.

# Grow a tree with maxdepth of 6
loan_model <- rpart(outcome ~ ., data = loans_train, method = "class", control = rpart.control(cp = 0, maxdepth = 6))

# Make a class prediction on the test set
loans_test$pred <- predict(loan_model, loans_test, type = "class")

# Compute the accuracy of the simpler tree
mean(loans_test$pred == loans_test$outcome)

## [1] 0.5848656

In the model controls, remove maxdepth and add a minimum split parameter, minsplit, set to 500.

# Swap maxdepth for a minimum split of 500 
loan_model <- rpart(outcome ~ ., data = loans_train, method = "class", control = rpart.control(cp = 0, minsplit = 500))

# Run this. How does the accuracy change?
loans_test$pred <- predict(loan_model, loans_test, type = "class")
mean(loans_test$pred == loans_test$outcome)

## [1] 0.5880481

Creating a nicely pruned tree

Stopping a tree from growing all the way can lead it to ignore some aspects of the data or miss important trends it may have discovered later.

By using post-pruning, you can intentionally grow a large and complex tree then prune it to be smaller and more efficient later on.

In this exercise, you will have the opportunity to construct a visualization of the tree’s performance versus complexity, and use this information to prune the tree to an appropriate level.

The rpart package is loaded into the workspace, along with loans_test and loans_train.

Use all of the applicant variables and no pre-pruning to create an overly complex tree. Make sure to set cp = 0 in rpart.control() to prevent pre-pruning. Create a complexity plot by using plotcp() on the model.

# Grow an overly complex tree
loan_model <- rpart(outcome ~ ., data = loans_train, method = "class", control = rpart.control(cp = 0))

# Examine the complexity plot
plotcp(loan_model)

Based on the complexity plot, prune the tree to a complexity of 0.0014 using the prune() function with the tree and the complexity parameter. Compare the accuracy of the pruned tree to the original accuracy of 58.3%. To calculate the accuracy use the predict() and mean() functions.

# Prune the tree
loan_model_pruned <- prune(loan_model, cp = 0.0014)

# Compute the accuracy of the pruned tree
loans_test$pred <- predict(loan_model_pruned, loans_test, type = "class")
mean(loans_test$pred == loans_test$outcome)

## [1] 0.5975955

Seeing the forest from the trees

Building a random forest model

In spite of the fact that a forest can contain hundreds of trees, growing a decision tree forest is perhaps even easier than creating a single highly-tuned tree.

Using the randomForest package, build a random forest and see how it compares to the single trees you built previously.

Keep in mind that due to the random nature of the forest, the results may vary slightly each time you create the forest. Load the randomForest package. Build a random forest model using all of the loan application variables. The randomForest function also uses the formula interface. Compute the accuracy of the random forest model to compare to the original tree’s accuracy of 58.3% using predict() and mean().

# Load the randomForest package
# library(randomForest)
# 
# # Build a random forest model
# loan_model <- randomForest(outcome ~ ., data = loans_train)
# 
# # Compute the accuracy of the random forest
# loans_test$pred <- predict(loan_model, loans_test)
# mean(loans_test$pred == loans_test$outcome)

Supervised Learning in R: Regression

What is Regression?

Welcome and Introduction

Linear regression - the fundamental method

Code a simple one-variable regression

For the first coding exercise, you’ll create a formula to define a one-variable modeling task, and then fit a linear model to the data. You are given the rates of male and female unemployment in the United States over several years (Source).

The task is to predict the rate of female unemployment from the observed rate of male unemployment. The outcome is female_unemployment, and the input is male_unemployment.

The sign of the variable coefficient tells you whether the outcome increases (+) or decreases (-) as the variable increases.

Recall the calling interface for lm() is:

lm(formula, data = ___)

The data frame unemployment is in your workspace.

Define a formula that expresses female_unemployment as a function of male_unemployment. Assign the formula to the variable fmla and print it.

unemployment <-read.delim("DATABASE/unemployment.txt", sep = ",")
head(unemployment)

##   X male_unemployment female_unemployment
## 1 1               2.9                 4.0
## 2 2               6.7                 7.4
## 3 3               4.9                 5.0
## 4 4               7.9                 7.2
## 5 5               9.8                 7.9
## 6 6               6.9                 6.1

# unemployment is loaded in the workspace
summary(unemployment)

##        X      male_unemployment female_unemployment
##  Min.   : 1   Min.   :2.900     Min.   :4.000      
##  1st Qu.: 4   1st Qu.:4.900     1st Qu.:4.400      
##  Median : 7   Median :6.000     Median :5.200      
##  Mean   : 7   Mean   :5.954     Mean   :5.569      
##  3rd Qu.:10   3rd Qu.:6.700     3rd Qu.:6.100      
##  Max.   :13   Max.   :9.800     Max.   :7.900

# Define a formula to express female_unemployment as a function of male_unemployment
fmla <- female_unemployment ~ male_unemployment

# Print it
fmla

## female_unemployment ~ male_unemployment

Then use lm() and fmla to fit a linear model to predict female unemployment from male unemployment using the data set unemployment. Print the model. Is the coefficent for male unemployment consistent with what you would expect? Does female unemployment increase as male unemployment does?

# Use the formula to fit a model: unemployment_model
unemployment_model <-  lm(fmla, data = unemployment)

# Print it 
unemployment_model

## 
## Call:
## lm(formula = fmla, data = unemployment)
## 
## Coefficients:
##       (Intercept)  male_unemployment  
##            1.4341             0.6945

Examining a model

Let’s look at the model unemployment_model that you have just created. There are a variety of different ways to examine a model; each way provides different information. We will use summary(), broom::glance(), and sigr::wrapFTest().

The object unemployment_model is in your workspace.

Print unemployment_model again. What information does it report? Call summary() on unemployment_model. In addition to the coefficient values, you get standard errors on the coefficient estimates, and some goodness-of-fit metrics like R-squared.

# broom and sigr are already loaded in your workspace
# Print unemployment_model
unemployment_model

## 
## Call:
## lm(formula = fmla, data = unemployment)
## 
## Coefficients:
##       (Intercept)  male_unemployment  
##            1.4341             0.6945

# Call summary() on unemployment_model to get more details
summary(unemployment_model)

## 
## Call:
## lm(formula = fmla, data = unemployment)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.77621 -0.34050 -0.09004  0.27911  1.31254 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.43411    0.60340   2.377   0.0367 *  
## male_unemployment  0.69453    0.09767   7.111 1.97e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5803 on 11 degrees of freedom
## Multiple R-squared:  0.8213, Adjusted R-squared:  0.8051 
## F-statistic: 50.56 on 1 and 11 DF,  p-value: 1.966e-05

Call glance() on the model to see the performance metrics in an orderly data frame. Can you match the information from summary() to the columns of glance()?

# Call glance() on unemployment_model to see the details in a tidier form
glance(unemployment_model)

## # A tibble: 1 x 12
##   r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.821         0.805 0.580      50.6 1.97e-5     1  -10.3  26.6  28.3
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Now call wrapFTest() on the model to see the R-squared again.

# Call wrapFTest() on unemployment_model to see the most relevant details
wrapFTest(unemployment_model)

## [1] "F Test summary: (R2=0.8213, F(1,11)=50.56, p=1.966e-05)."

Predicting once you fit a model

Predicting from the unemployment model

In this exercise, you will use your unemployment model unemployment_model to make predictions from the unemployment data, and compare predicted female unemployment rates to the actual observed female unemployment rates on the training data, unemployment. You will also use your model to predict on the new data in newrates, which consists of only one observation, where male unemployment is 5%.

The predict() interface for lm models takes the form

predict(model, newdata) You will use the ggplot2 package to make the plots, so you will add the prediction column to the unemployment data frame. You will plot outcome versus prediction, and compare them to the line that represents perfect predictions (that is when the outcome is equal to the predicted value).

The ggplot2 command to plot a scatterplot of dframe$outcome versus dframe$pred (pred on the x axis, outcome on the y axis), along with a blue line where outcome == pred is as follows:

ggplot(dframe, aes(x = pred, y = outcome)) + geom_point() +
geom_abline(color = “blue”)

The objects unemployment, unemployment_model and newrates are in your workspace.

Use predict() to predict female unemployment rates from the unemployment data. Assign it to a new column: prediction.

newrates <- read.delim("DATABASE/newrates.txt", sep = ",")

## Warning in read.table(file = file, header = header, sep = sep, quote = quote, :
## incomplete final line found by readTableHeader on 'DATABASE/newrates.txt'

head(newrates)

##   X male_unemployment
## 1 1                 5

# unemployment is in your workspace
summary(unemployment)

##        X      male_unemployment female_unemployment
##  Min.   : 1   Min.   :2.900     Min.   :4.000      
##  1st Qu.: 4   1st Qu.:4.900     1st Qu.:4.400      
##  Median : 7   Median :6.000     Median :5.200      
##  Mean   : 7   Mean   :5.954     Mean   :5.569      
##  3rd Qu.:10   3rd Qu.:6.700     3rd Qu.:6.100      
##  Max.   :13   Max.   :9.800     Max.   :7.900

# newrates is in your workspace
newrates

##   X male_unemployment
## 1 1                 5

# Predict female unemployment in the unemployment data set
unemployment$prediction <-  predict(unemployment_model)

Use the library() command to load the ggplot2 package. Use ggplot() to compare the predictions to actual unemployment rates. Put the predictions on the x axis. How close are the results to the line of perfect prediction? Use the data frame newrates to predict expected female unemployment rate when male unemployment is 5%. Assign the answer to the variable pred and print it.

# load the ggplot2 package
library(ggplot2)

# Make a plot to compare predictions to actual (prediction on x axis)
ggplot(unemployment, aes(x = prediction, y = female_unemployment)) + 
  geom_point() +
  geom_abline(color = "blue")

# Predict female unemployment rate when male unemployment is 5%
pred <- predict(unemployment_model, newdata = newrates)
# Print it
pred

##        1 
## 4.906757

Multivariate linear regression (Part 1)

In this exercise, you will work with the blood pressure dataset (Source), and model blood_pressure as a function of weight and age.

The data frame bloodpressure is in the workspace.

Define a formula that expresses blood_pressure explicitly as a function of age and weight. Assign the formula to the variable fmla and print it.

bloodpressure <- read.delim("DATABASE/bloodpressure.txt", sep = ",")
head(bloodpressure)

##   X blood_pressure age weight
## 1 1            132  52    173
## 2 2            143  59    184
## 3 3            153  67    194
## 4 4            162  73    211
## 5 5            154  64    196
## 6 6            168  74    220

# bloodpressure is in the workspace
summary(bloodpressure)

##        X        blood_pressure       age            weight   
##  Min.   : 1.0   Min.   :128.0   Min.   :46.00   Min.   :167  
##  1st Qu.: 3.5   1st Qu.:140.0   1st Qu.:56.50   1st Qu.:186  
##  Median : 6.0   Median :153.0   Median :64.00   Median :194  
##  Mean   : 6.0   Mean   :150.1   Mean   :62.45   Mean   :195  
##  3rd Qu.: 8.5   3rd Qu.:160.5   3rd Qu.:69.50   3rd Qu.:209  
##  Max.   :11.0   Max.   :168.0   Max.   :74.00   Max.   :220

head(bloodpressure)

##   X blood_pressure age weight
## 1 1            132  52    173
## 2 2            143  59    184
## 3 3            153  67    194
## 4 4            162  73    211
## 5 5            154  64    196
## 6 6            168  74    220

Use fmla to fit a linear model to predict blood_pressure from age and weight in the data set bloodpressure. Call the model bloodpressure_model.

# Create the formula and print it
fmla <- blood_pressure ~ age + weight
fmla

## blood_pressure ~ age + weight

Print the model and call summary() on it. Does blood pressure increase or decrease with age? With weight?

# Fit the model: bloodpressure_model
bloodpressure_model <- lm(fmla, data = bloodpressure)

# Print bloodpressure_model and call summary()
bloodpressure_model

## 
## Call:
## lm(formula = fmla, data = bloodpressure)
## 
## Coefficients:
## (Intercept)          age       weight  
##     30.9941       0.8614       0.3349

summary(bloodpressure_model)

## 
## Call:
## lm(formula = fmla, data = bloodpressure)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4640 -1.1949 -0.4078  1.8511  2.6981 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  30.9941    11.9438   2.595  0.03186 * 
## age           0.8614     0.2482   3.470  0.00844 **
## weight        0.3349     0.1307   2.563  0.03351 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.318 on 8 degrees of freedom
## Multiple R-squared:  0.9768, Adjusted R-squared:  0.9711 
## F-statistic: 168.8 on 2 and 8 DF,  p-value: 2.874e-07

Multivariate linear regression (Part 2)

Now you will make predictions using the blood pressure model bloodpressure_model that you fit in the previous exercise.

You will also compare the predictions to outcomes graphically. ggplot2 is already loaded in your workspace. Recall the plot command takes the form:

ggplot(dframe, aes(x = pred, y = outcome)) + geom_point() + geom_abline(color = “blue”) ` The objects bloodpressure and bloodpressure_model are in the workspace.

# bloodpressure is in your workspace
summary(bloodpressure)

##        X        blood_pressure       age            weight   
##  Min.   : 1.0   Min.   :128.0   Min.   :46.00   Min.   :167  
##  1st Qu.: 3.5   1st Qu.:140.0   1st Qu.:56.50   1st Qu.:186  
##  Median : 6.0   Median :153.0   Median :64.00   Median :194  
##  Mean   : 6.0   Mean   :150.1   Mean   :62.45   Mean   :195  
##  3rd Qu.: 8.5   3rd Qu.:160.5   3rd Qu.:69.50   3rd Qu.:209  
##  Max.   :11.0   Max.   :168.0   Max.   :74.00   Max.   :220

# bloodpressure_model is in your workspace
bloodpressure_model

## 
## Call:
## lm(formula = fmla, data = bloodpressure)
## 
## Coefficients:
## (Intercept)          age       weight  
##     30.9941       0.8614       0.3349

Use predict() to predict blood pressure in the bloodpressure dataset. Assign the predictions to the column prediction.

# predict blood pressure using bloodpressure_model :prediction
bloodpressure$prediction <- predict(bloodpressure_model, data = bloodpressure)
head(bloodpressure)

##   X blood_pressure age weight prediction
## 1 1            132  52    173   133.7183
## 2 2            143  59    184   143.4317
## 3 3            153  67    194   153.6716
## 4 4            162  73    211   164.5327
## 5 5            154  64    196   151.7570
## 6 6            168  74    220   168.4078

Graphically compare the predictions to actual blood pressures. Put predictions on the x axis. How close are the results to the line of perfect prediction?

# plot the results
ggplot(bloodpressure,aes(prediction, blood_pressure )) + 
    geom_point() +
    geom_abline(color = "blue")

Training and Evaluating Regression Models

Evaluating a model graphically

Graphically evaluate the unemployment model

In this exercise you will graphically evaluate the unemployment model, unemployment_model, that you fit to the unemployment data in the previous chapter. Recall that the model predicts female_unemployment from male_unemployment.

You will plot the model’s predictions against the actual female_unemployment; recall the command is of the form

ggplot(dframe, aes(x = pred, y = outcome)) + geom_point() +
geom_abline() Then you will calculate the residuals:

residuals <- actual outcome - predicted outcome and plot predictions against residuals. The residual graph will take a slightly different form: you compare the residuals to the horizontal line (using geom_hline()) rather than to the line . The command will be provided.

The data frame unemployment and model unemployment_model are available in the workspace.

# unemployment, unemployment_model are in the workspace
summary(unemployment)

##        X      male_unemployment female_unemployment   prediction   
##  Min.   : 1   Min.   :2.900     Min.   :4.000       Min.   :3.448  
##  1st Qu.: 4   1st Qu.:4.900     1st Qu.:4.400       1st Qu.:4.837  
##  Median : 7   Median :6.000     Median :5.200       Median :5.601  
##  Mean   : 7   Mean   :5.954     Mean   :5.569       Mean   :5.569  
##  3rd Qu.:10   3rd Qu.:6.700     3rd Qu.:6.100       3rd Qu.:6.087  
##  Max.   :13   Max.   :9.800     Max.   :7.900       Max.   :8.240

summary(unemployment_model)

## 
## Call:
## lm(formula = fmla, data = unemployment)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.77621 -0.34050 -0.09004  0.27911  1.31254 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.43411    0.60340   2.377   0.0367 *  
## male_unemployment  0.69453    0.09767   7.111 1.97e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5803 on 11 degrees of freedom
## Multiple R-squared:  0.8213, Adjusted R-squared:  0.8051 
## F-statistic: 50.56 on 1 and 11 DF,  p-value: 1.966e-05

head(unemployment)

##   X male_unemployment female_unemployment prediction
## 1 1               2.9                 4.0   3.448245
## 2 2               6.7                 7.4   6.087456
## 3 3               4.9                 5.0   4.837304
## 4 4               7.9                 7.2   6.920891
## 5 5               9.8                 7.9   8.240497
## 6 6               6.9                 6.1   6.226362

Use predict() to get the model predictions and add them to unemployment as the column predictions.

# Make predictions from the model
unemployment$predictions <- predict(unemployment_model, data = unemployment)

Plot predictions (on the x-axis) versus actual female unemployment rates. Are the predictions near the x = y line?

# Fill in the blanks to plot predictions (on x-axis) versus the female_unemployment rates
ggplot(unemployment, aes(x = predictions, y = female_unemployment  )) + 
  geom_point() + 
  geom_abline()

The gain curve to evaluate the unemployment model

In the previous exercise you made predictions about female_unemployment and visualized the predictions and the residuals. Now, you will also plot the gain curve of the unemployment_model’s predictions against actual female_unemployment using the WVPlots::GainCurvePlot() function.

For situations where order is more important than exact values, the gain curve helps you check if the model’s predictions sort in the same order as the true outcome.

Calls to the function GainCurvePlot() look like:

GainCurvePlot(frame, xvar, truthvar, title) where

frame is a data frame xvar and truthvar are strings naming the prediction and actual outcome columns of frame title is the title of the plot When the predictions sort in exactly the same order, the relative Gini coefficient is 1. When the model sorts poorly, the relative Gini coefficient is close to zero, or even negative.

The data frame unemployment and the model unemployment_model are in the workspace.

# unemployment is in the workspace (with predictions)
summary(unemployment)

##        X      male_unemployment female_unemployment   prediction   
##  Min.   : 1   Min.   :2.900     Min.   :4.000       Min.   :3.448  
##  1st Qu.: 4   1st Qu.:4.900     1st Qu.:4.400       1st Qu.:4.837  
##  Median : 7   Median :6.000     Median :5.200       Median :5.601  
##  Mean   : 7   Mean   :5.954     Mean   :5.569       Mean   :5.569  
##  3rd Qu.:10   3rd Qu.:6.700     3rd Qu.:6.100       3rd Qu.:6.087  
##  Max.   :13   Max.   :9.800     Max.   :7.900       Max.   :8.240  
##   predictions   
##  Min.   :3.448  
##  1st Qu.:4.837  
##  Median :5.601  
##  Mean   :5.569  
##  3rd Qu.:6.087  
##  Max.   :8.240

# unemployment_model is in the workspace
summary(unemployment_model)

## 
## Call:
## lm(formula = fmla, data = unemployment)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.77621 -0.34050 -0.09004  0.27911  1.31254 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.43411    0.60340   2.377   0.0367 *  
## male_unemployment  0.69453    0.09767   7.111 1.97e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5803 on 11 degrees of freedom
## Multiple R-squared:  0.8213, Adjusted R-squared:  0.8051 
## F-statistic: 50.56 on 1 and 11 DF,  p-value: 1.966e-05

Load the package WVPlots using library(). Plot the gain curve. Give the plot the title “Unemployment model”. Do the model’s predictions sort correctly?

# Load the package WVPlots
library(WVPlots)

# Plot the Gain Curve
GainCurvePlot(unemployment, "predictions", "female_unemployment", "Unemployment model")

Root Mean Squared Error (RMSE)

Calculate RMSE In this exercise you will calculate the RMSE of your unemployment model. In the previous coding exercises, you added two columns to the unemployment dataset:

the model’s predictions (predictions column) the residuals between the predictions and the outcome (residuals column) You can calculate the RMSE from a vector of residuals, , as:

You want RMSE to be small. How small is “small”? One heuristic is to compare the RMSE to the standard deviation of the outcome. With a good model, the RMSE should be smaller.

# unemployment is in the workspace
summary(unemployment)

##        X      male_unemployment female_unemployment   prediction   
##  Min.   : 1   Min.   :2.900     Min.   :4.000       Min.   :3.448  
##  1st Qu.: 4   1st Qu.:4.900     1st Qu.:4.400       1st Qu.:4.837  
##  Median : 7   Median :6.000     Median :5.200       Median :5.601  
##  Mean   : 7   Mean   :5.954     Mean   :5.569       Mean   :5.569  
##  3rd Qu.:10   3rd Qu.:6.700     3rd Qu.:6.100       3rd Qu.:6.087  
##  Max.   :13   Max.   :9.800     Max.   :7.900       Max.   :8.240  
##   predictions   
##  Min.   :3.448  
##  1st Qu.:4.837  
##  Median :5.601  
##  Mean   :5.569  
##  3rd Qu.:6.087  
##  Max.   :8.240

# For convenience put the residuals in the variable res
res <- unemployment$residuals

# Calculate RMSE, assign it to the variable rmse and print it
(rmse <- sqrt(mean(res^2)))

## [1] NaN

# Calculate the standard deviation of female_unemployment and print it
(sd_unemployment <- sd(unemployment$female_unemployment))

## [1] 1.314271

R-Squared

Calculate R-Squared

Now that you’ve calculated the RMSE of your model’s predictions, you will examine how well the model fits the data: that is, how much variance does it explain. You can do this using .

Suppose is the true outcome, is the prediction from the model, and are the residuals of the predictions.

Then the total sum of squares (“total variance”) of the data is:

where

is the mean value of .

The residual sum of squared errors of the model, is:

(R-Squared), the “variance explained” by the model, is then:

After you calculate , you will compare what you computed with the reported by glance(). glance() returns a one-row data frame; for a linear regression model, one of the columns returned is the of the model on the training data.

# Calculate mean female_unemployment: fe_mean. Print it
(fe_mean <- mean(unemployment$female_unemployment))

## [1] 5.569231

The data frame unemployment is in your workspace, with the columns predictions and residuals that you calculated in a previous exercise.

# Calculate total sum of squares: tss. Print it
(tss <- sum( (unemployment$female_unemployment - fe_mean)^2 ))

## [1] 20.72769

# Calculate residual sum of squares: rss. Print it
(rss <- sum(unemployment$residuals^2))

## [1] 0

# Calculate R-squared: rsq. Print it. Is it a good fit?
(rsq <- 1 - (rss/tss))

## [1] 1

# Get R-squared from glance. Print it
(rsq_glance <- glance(unemployment_model)$r.squared)

## [1] 0.8213157

Correlation and R-squared

The linear correlation of two variables, and , measures the strength of the linear relationship between them. When and are respectively:

the outcomes of a regression model that minimizes squared-error (like linear regression) and the true outcomes of the training data, then the square of the correlation is the same as . You will verify that in this exercise.

# unemployment is in your workspace
summary(unemployment)

##        X      male_unemployment female_unemployment   prediction   
##  Min.   : 1   Min.   :2.900     Min.   :4.000       Min.   :3.448  
##  1st Qu.: 4   1st Qu.:4.900     1st Qu.:4.400       1st Qu.:4.837  
##  Median : 7   Median :6.000     Median :5.200       Median :5.601  
##  Mean   : 7   Mean   :5.954     Mean   :5.569       Mean   :5.569  
##  3rd Qu.:10   3rd Qu.:6.700     3rd Qu.:6.100       3rd Qu.:6.087  
##  Max.   :13   Max.   :9.800     Max.   :7.900       Max.   :8.240  
##   predictions   
##  Min.   :3.448  
##  1st Qu.:4.837  
##  Median :5.601  
##  Mean   :5.569  
##  3rd Qu.:6.087  
##  Max.   :8.240

# unemployment_model is in the workspace
summary(unemployment_model)

## 
## Call:
## lm(formula = fmla, data = unemployment)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.77621 -0.34050 -0.09004  0.27911  1.31254 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.43411    0.60340   2.377   0.0367 *  
## male_unemployment  0.69453    0.09767   7.111 1.97e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5803 on 11 degrees of freedom
## Multiple R-squared:  0.8213, Adjusted R-squared:  0.8051 
## F-statistic: 50.56 on 1 and 11 DF,  p-value: 1.966e-05

Use cor() to get the correlation between the predictions and female unemployment. Assign it to the variable rho and print it. Make sure you use Pearson correlation (the default). Square rho and assign it to rho2. Print it. Compare rho2 to from the model (using glance()). Is it the same?

# Get the correlation between the prediction and true outcome: rho and print it
(rho <- cor(unemployment$predictions, unemployment$female_unemployment))

## [1] 0.9062647

# Square rho: rho2 and print it
(rho2 <- rho ^ 2)

## [1] 0.8213157

# Get R-squared from glance and print it
(rsq_glance <- glance(unemployment_model)$r.squared)

## [1] 0.8213157

Properly Training a Model

Generating a random test/train split

For the next several exercises you will use the mpg data from the package ggplot2. The data describes the characteristics of several makes and models of cars from different years. The goal is to predict city fuel efficiency from highway fuel efficiency.

In this exercise, you will split mpg into a training set mpg_train (75% of the data) and a test set mpg_test (25% of the data). One way to do this is to generate a column of uniform random numbers between 0 and 1, using the function runif().

If you have a data set dframe of size , and you want a random subset of approximately size % of (where is between 0 and 1), then:

Generate a vector of uniform random numbers: gp = runif(N). dframe[gp < X,] will be about the right size. dframe[gp >= X,] will be the complement.

The data frame mpg is in the workspace.

# mpg is in the workspace
summary(mpg)

##  manufacturer          model               displ            year     
##  Length:234         Length:234         Min.   :1.600   Min.   :1999  
##  Class :character   Class :character   1st Qu.:2.400   1st Qu.:1999  
##  Mode  :character   Mode  :character   Median :3.300   Median :2004  
##                                        Mean   :3.472   Mean   :2004  
##                                        3rd Qu.:4.600   3rd Qu.:2008  
##                                        Max.   :7.000   Max.   :2008  
##       cyl           trans               drv                 cty       
##  Min.   :4.000   Length:234         Length:234         Min.   : 9.00  
##  1st Qu.:4.000   Class :character   Class :character   1st Qu.:14.00  
##  Median :6.000   Mode  :character   Mode  :character   Median :17.00  
##  Mean   :5.889                                         Mean   :16.86  
##  3rd Qu.:8.000                                         3rd Qu.:19.00  
##  Max.   :8.000                                         Max.   :35.00  
##       hwy             fl               class          
##  Min.   :12.00   Length:234         Length:234        
##  1st Qu.:18.00   Class :character   Class :character  
##  Median :24.00   Mode  :character   Mode  :character  
##  Mean   :23.44                                        
##  3rd Qu.:27.00                                        
##  Max.   :44.00

dim(mpg)

## [1] 234  11

Use the function nrow to get the number of rows in the data frame mpg. Assign this count to the variable N and print it.

# Use nrow to get the number of rows in mpg (N) and print it
(N <- nrow(mpg))

## [1] 234

Calculate about how many rows 75% of N should be. Assign it to the variable target and print it.

# Calculate how many rows 75% of N should be and print it
# Hint: use round() to get an integer
(target <- round(N * 0.75))

## [1] 176

Use runif() to generate a vector of N uniform random numbers, called gp.

# Create the vector of N uniform random variables: gp
gp <- runif(N)

Use gp to split mpg into mpg_train and mpg_test (with mpg_train containing approximately 75% of the data).

# Use gp to create the training set: mpg_train (75% of data) and mpg_test (25% of data)
mpg_train <- mpg[gp < 0.75, ]
mpg_test <- mpg[gp >= 0.75, ]

Use nrow() to check the size of mpg_train and mpg_test. Are they about the right size?

# Use nrow() to examine mpg_train and mpg_test
nrow(mpg_train)

## [1] 175

nrow(mpg_test)

## [1] 59

Train a model using test/train split

Now that you have split the mpg dataset into mpg_train and mpg_test, you will use mpg_train to train a model to predict city fuel efficiency (cty) from highway fuel efficiency (hwy).

The data frame mpg_train is in the workspace.

# mpg_train is in the workspace
summary(mpg_train)

##  manufacturer          model               displ           year     
##  Length:175         Length:175         Min.   :1.60   Min.   :1999  
##  Class :character   Class :character   1st Qu.:2.40   1st Qu.:1999  
##  Mode  :character   Mode  :character   Median :3.30   Median :2008  
##                                        Mean   :3.47   Mean   :2004  
##                                        3rd Qu.:4.60   3rd Qu.:2008  
##                                        Max.   :6.50   Max.   :2008  
##       cyl           trans               drv                 cty       
##  Min.   :4.000   Length:175         Length:175         Min.   : 9.00  
##  1st Qu.:4.000   Class :character   Class :character   1st Qu.:14.00  
##  Median :6.000   Mode  :character   Mode  :character   Median :16.00  
##  Mean   :5.886                                         Mean   :16.85  
##  3rd Qu.:8.000                                         3rd Qu.:19.00  
##  Max.   :8.000                                         Max.   :33.00  
##       hwy             fl               class          
##  Min.   :12.00   Length:175         Length:175        
##  1st Qu.:18.00   Class :character   Class :character  
##  Median :25.00   Mode  :character   Mode  :character  
##  Mean   :23.47                                        
##  3rd Qu.:27.00                                        
##  Max.   :44.00

# create a formula to express cty as a function of hwy: fmla and print it.
(fmla <- cty ~ hwy)

## cty ~ hwy

Create a formula fmla that expresses the relationship cty as a function of hwy. Print it.

# Now use lm() to build a model mpg_model from mpg_train that predicts cty from hwy 
mpg_model <- lm(fmla, data = mpg_train)

Train a model mpg_model on mpg_train to predict cty from hwy using fmla and lm(). Use summary() to examine the model.

# Use summary() to examine the model
summary(mpg_model)

## 
## Call:
## lm(formula = fmla, data = mpg_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8710 -0.8615 -0.1837  0.7901  4.7830 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.16468    0.40553   2.872  0.00459 ** 
## hwy          0.66825    0.01678  39.834  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.281 on 173 degrees of freedom
## Multiple R-squared:  0.9017, Adjusted R-squared:  0.9011 
## F-statistic:  1587 on 1 and 173 DF,  p-value: < 2.2e-16

Evaluate a model using test/train split

Now you will test the model mpg_model on the test data, mpg_test. Functions rmse() and r_squared() to calculate RMSE and R-squared have been provided for convenience:

rmse(predcol, ycol) r_squared(predcol, ycol) where:

predcol: The predicted values ycol: The actual outcome You will also plot the predictions vs. the outcome.

Generally, model performance is better on the training data than the test data (though sometimes the test set “gets lucky”). A slight difference in performance is okay; if the performance on training is significantly better, there is a problem.

The data frames mpg_train and mpg_test, and the model mpg_model are in the workspace, along with the functions rmse() and r_squared().

# # Examine the objects in the workspace
# ls.str()

Predict city fuel efficiency from hwy on the mpg_train data. Assign the predictions to the column pred.

# predict cty from hwy for the training set
mpg_train$pred <- predict(mpg_model)

Predict city fuel efficiency from hwy on the mpg_test data. Assign the predictions to the column pred.

# predict cty from hwy for the test set
mpg_test$pred <- predict(mpg_model, newdata = mpg_test)

Use rmse() to evaluate rmse for both the test and training sets. Compare. Are the performances similar?

rmse = function(m, o){
  sqrt(mean((m - o)^2))
}
# Evaluate the rmse on both training and test data and print them
(rmse_train <- rmse(mpg_train$pred, mpg_train$cty))

## [1] 1.273331

(rmse_test <- rmse(mpg_test$pred, mpg_test$cty))

## [1] 1.180521

r_squared <- function (x, y){
  
 cor(x, y) ^ 2
  
}
# Evaluate the r-squared on both training and test data.and print them
(rsq_train <- r_squared(mpg_train$pred, mpg_train$cty))

## [1] 0.9016886

(rsq_test <- r_squared(mpg_test$pred, mpg_test$cty))

## [1] 0.9437755

Do the same with r_squared(). Are the performances similar? Use ggplot2 to plot the predictions against cty on the test data.

# Plot the predictions (on the x-axis) against the outcome (cty) on the test data
ggplot(mpg_test, aes(x = pred, y = cty)) + 
  geom_point() + 
  geom_abline()

Create a cross validation plan

There are several ways to implement an n-fold cross validation plan. In this exercise you will create such a plan using vtreat::kWayCrossValidation(), and examine it.

kWayCrossValidation() creates a cross validation plan with the following call:

splitPlan <- kWayCrossValidation(nRows, nSplits, dframe, y) where nRows is the number of rows of data to be split, and nSplits is the desired number of cross-validation folds.

Strictly speaking, dframe and y aren’t used by kWayCrossValidation; they are there for compatibility with other vtreat data partitioning functions. You can set them both to NULL.

The resulting splitPlan is a list of nSplits elements; each element contains two vectors:

train: the indices of dframe that will form the training set app: the indices of dframe that will form the test (or application) set In this exercise you will create a 3-fold cross-validation plan for the data set mpg.

Load the package vtreat.

# Load the package vtreat
library(vtreat)

# mpg is in the workspace
summary(mpg)

##  manufacturer          model               displ            year     
##  Length:234         Length:234         Min.   :1.600   Min.   :1999  
##  Class :character   Class :character   1st Qu.:2.400   1st Qu.:1999  
##  Mode  :character   Mode  :character   Median :3.300   Median :2004  
##                                        Mean   :3.472   Mean   :2004  
##                                        3rd Qu.:4.600   3rd Qu.:2008  
##                                        Max.   :7.000   Max.   :2008  
##       cyl           trans               drv                 cty       
##  Min.   :4.000   Length:234         Length:234         Min.   : 9.00  
##  1st Qu.:4.000   Class :character   Class :character   1st Qu.:14.00  
##  Median :6.000   Mode  :character   Mode  :character   Median :17.00  
##  Mean   :5.889                                         Mean   :16.86  
##  3rd Qu.:8.000                                         3rd Qu.:19.00  
##  Max.   :8.000                                         Max.   :35.00  
##       hwy             fl               class          
##  Min.   :12.00   Length:234         Length:234        
##  1st Qu.:18.00   Class :character   Class :character  
##  Median :24.00   Mode  :character   Mode  :character  
##  Mean   :23.44                                        
##  3rd Qu.:27.00                                        
##  Max.   :44.00

Get the number of rows in mpg and assign it to the variable nRows.

# Get the number of rows in mpg
nRows <- nrow(mpg)

Call kWayCrossValidation to create a 3-fold cross validation plan and assign it to the variable splitPlan. You can set the last two arguments of the function to NULL. Call str() to examine the structure of splitPlan.

# Implement the 3-fold cross-fold plan with vtreat
splitPlan <- kWayCrossValidation(nRows, 3,NULL,NULL)

# Examine the split plan
str(splitPlan)

## List of 3
##  $ :List of 2
##   ..$ train: int [1:156] 1 2 3 6 8 9 11 13 15 16 ...
##   ..$ app  : int [1:78] 209 141 175 128 32 117 38 185 140 85 ...
##  $ :List of 2
##   ..$ train: int [1:156] 1 3 4 5 6 7 9 10 12 14 ...
##   ..$ app  : int [1:78] 195 19 129 21 8 183 50 57 230 142 ...
##  $ :List of 2
##   ..$ train: int [1:156] 2 4 5 7 8 10 11 12 13 14 ...
##   ..$ app  : int [1:78] 154 224 3 234 111 102 197 25 173 56 ...
##  - attr(*, "splitmethod")= chr "kwaycross"

Evaluate a modeling procedure using n-fold cross-validation

In this exercise you will use splitPlan, the 3-fold cross validation plan from the previous exercise, to make predictions from a model that predicts mpg$cty from mpg$hwy.

If dframe is the training data, then one way to add a column of cross-validation predictions to the frame is as follows:

Initialize a column of the appropriate length

dframe$pred.cv <- 0

k is the number of folds

splitPlan is the cross validation plan

for(i in 1:k) { # Get the ith split split <- splitPlan[[i]]

# Build a model on the training data # from this split # (lm, in this case) model <- lm(fmla, data = dframe[split$train,])

# make predictions on the # application data from this split dframe$pred.cv[split$app] <- predict(model, newdata = dframe[split$app,]) } Cross-validation predicts how well a model built from all the data will perform on new data. As with the test/train split, for a good modeling procedure, cross-validation performance and training performance should be close.

The data frame mpg, the cross validation plan splitPlan, and the function to calculate RMSE (rmse()) from one of the previous exercises is available in your workspace.

# mpg is in the workspace
summary(mpg)

##  manufacturer          model               displ            year     
##  Length:234         Length:234         Min.   :1.600   Min.   :1999  
##  Class :character   Class :character   1st Qu.:2.400   1st Qu.:1999  
##  Mode  :character   Mode  :character   Median :3.300   Median :2004  
##                                        Mean   :3.472   Mean   :2004  
##                                        3rd Qu.:4.600   3rd Qu.:2008  
##                                        Max.   :7.000   Max.   :2008  
##       cyl           trans               drv                 cty       
##  Min.   :4.000   Length:234         Length:234         Min.   : 9.00  
##  1st Qu.:4.000   Class :character   Class :character   1st Qu.:14.00  
##  Median :6.000   Mode  :character   Mode  :character   Median :17.00  
##  Mean   :5.889                                         Mean   :16.86  
##  3rd Qu.:8.000                                         3rd Qu.:19.00  
##  Max.   :8.000                                         Max.   :35.00  
##       hwy             fl               class          
##  Min.   :12.00   Length:234         Length:234        
##  1st Qu.:18.00   Class :character   Class :character  
##  Median :24.00   Mode  :character   Mode  :character  
##  Mean   :23.44                                        
##  3rd Qu.:27.00                                        
##  Max.   :44.00

# splitPlan is in the workspace
str(splitPlan)

## List of 3
##  $ :List of 2
##   ..$ train: int [1:156] 1 2 3 6 8 9 11 13 15 16 ...
##   ..$ app  : int [1:78] 209 141 175 128 32 117 38 185 140 85 ...
##  $ :List of 2
##   ..$ train: int [1:156] 1 3 4 5 6 7 9 10 12 14 ...
##   ..$ app  : int [1:78] 195 19 129 21 8 183 50 57 230 142 ...
##  $ :List of 2
##   ..$ train: int [1:156] 2 4 5 7 8 10 11 12 13 14 ...
##   ..$ app  : int [1:78] 154 224 3 234 111 102 197 25 173 56 ...
##  - attr(*, "splitmethod")= chr "kwaycross"

Run the 3-fold cross validation plan from splitPlan and put the predictions in the column mpg$pred.cv. Use lm() and the formula cty ~ hwy. Create a linear regression model on all the mpg data (formula cty ~ hwy) and assign the predictions to mpg$pred. Use rmse() to get the root mean squared error of the predictions from the full model (mpg$pred). Recall that rmse() takes two arguments, the predicted values, and the actual outcome.

# Run the 3-fold cross validation plan from splitPlan
k <- 3 # Number of folds
mpg$pred.cv <- 0 
for(i in 1:k) {
  split <- splitPlan[[i]]
  model <- lm(cty ~ hwy, data = mpg[split$train, ])
  mpg$pred.cv[split$app] <- predict(model, newdata = mpg[split$app, ])
}
# Predict from a full model
mpg$pred <- predict(lm(cty ~ hwy, data = mpg))

# Get the rmse of the full model's predictions
rmse(mpg$pred, mpg$cty)

## [1] 1.247045

Get the root mean squared error of the cross-validation predictions. Are the two values about the same?

# Get the rmse of the cross-validation predictions
rmse(mpg$pred.cv, mpg$cty)

## [1] 1.258478

Issues to Consider

Before moving on to more sophisticated regression techniques, we will look at some other modeling issues: modeling with categorical inputs, interactions between variables, and when you might consider transforming inputs and outputs before modeling. While more sophisticated regression techniques manage some of these issues automatically, it’s important to be aware of them, in order to understand which methods best handle various issues – and which issues you must still manage yourself.

Categorical inputs

Examining the structure of categorical inputs

For this exercise you will call model.matrix() to examine how R represents data with both categorical and numerical inputs for modeling. The dataset flowers (derived from the Sleuth3 package) is loaded into your workspace. It has the following columns:

Flowers: the average number of flowers on a meadowfoam plant Intensity: the intensity of a light treatment applied to the plant Time: A categorical variable - when (Late or Early) in the lifecycle the light treatment occurred The ultimate goal is to predict Flowers as a function of Time and Intensity.

The data frame flowers is in your workspace.

Call the str() function on flowers to see the types of each column.

flowers <- read.delim("DATABASE/flowers.txt", sep = ",")
head(flowers)

##   X Flowers Time Intensity
## 1 1    62.3 Late       150
## 2 2    77.4 Late       150
## 3 3    55.3 Late       300
## 4 4    54.2 Late       300
## 5 5    49.6 Late       450
## 6 6    61.9 Late       450

# Call str on flowers to see the types of each column
str(flowers)

## 'data.frame':    24 obs. of  4 variables:
##  $ X        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Flowers  : num  62.3 77.4 55.3 54.2 49.6 61.9 39.4 45.7 31.3 44.9 ...
##  $ Time     : Factor w/ 2 levels "Early","Late": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Intensity: int  150 150 300 300 450 450 600 600 750 750 ...

flowers

##     X Flowers  Time Intensity
## 1   1    62.3  Late       150
## 2   2    77.4  Late       150
## 3   3    55.3  Late       300
## 4   4    54.2  Late       300
## 5   5    49.6  Late       450
## 6   6    61.9  Late       450
## 7   7    39.4  Late       600
## 8   8    45.7  Late       600
## 9   9    31.3  Late       750
## 10 10    44.9  Late       750
## 11 11    36.8  Late       900
## 12 12    41.9  Late       900
## 13 13    77.8 Early       150
## 14 14    75.6 Early       150
## 15 15    69.1 Early       300
## 16 16    78.0 Early       300
## 17 17    57.0 Early       450
## 18 18    71.1 Early       450
## 19 19    62.9 Early       600
## 20 20    52.2 Early       600
## 21 21    60.3 Early       750
## 22 22    45.6 Early       750
## 23 23    52.6 Early       900
## 24 24    44.4 Early       900

Use the unique() function on the column flowers$Time to see the possible values that Time takes. How many unique values are there?

# Use unique() to see how many possible values Time takes
unique(flowers$Time)

## [1] Late  Early
## Levels: Early Late

Create a formula to express Flowers as a function of Intensity and Time. Assign it to the variable fmla and print it.

# Build a formula to express Flowers as a function of Intensity and Time: fmla. Print it
(fmla <- as.formula("Flowers ~ Intensity + Time"))

## Flowers ~ Intensity + Time

Use fmla and model.matrix() to create the model matrix for the data frame flowers. Assign it to the variable mmat.

# Use fmla and model.matrix to see how the data is represented for modeling
mmat <- model.matrix(fmla, flowers)

Use head() to examine the first 20 lines of flowers. Now examine the first 20 lines of mmat.

# Examine the first 20 lines of flowers
head(flowers,20)

##     X Flowers  Time Intensity
## 1   1    62.3  Late       150
## 2   2    77.4  Late       150
## 3   3    55.3  Late       300
## 4   4    54.2  Late       300
## 5   5    49.6  Late       450
## 6   6    61.9  Late       450
## 7   7    39.4  Late       600
## 8   8    45.7  Late       600
## 9   9    31.3  Late       750
## 10 10    44.9  Late       750
## 11 11    36.8  Late       900
## 12 12    41.9  Late       900
## 13 13    77.8 Early       150
## 14 14    75.6 Early       150
## 15 15    69.1 Early       300
## 16 16    78.0 Early       300
## 17 17    57.0 Early       450
## 18 18    71.1 Early       450
## 19 19    62.9 Early       600
## 20 20    52.2 Early       600

Is the numeric column Intensity different? What happened to the categorical column Time from flowers? How is Time == ‘Early’ represented? And Time == ‘Late’?

# Examine the first 20 lines of mmat
head(mmat,20)

##    (Intercept) Intensity TimeLate
## 1            1       150        1
## 2            1       150        1
## 3            1       300        1
## 4            1       300        1
## 5            1       450        1
## 6            1       450        1
## 7            1       600        1
## 8            1       600        1
## 9            1       750        1
## 10           1       750        1
## 11           1       900        1
## 12           1       900        1
## 13           1       150        0
## 14           1       150        0
## 15           1       300        0
## 16           1       300        0
## 17           1       450        0
## 18           1       450        0
## 19           1       600        0
## 20           1       600        0

Modeling with categorical inputs

For this exercise you will fit a linear model to the flowers data, to predict Flowers as a function of Time and Intensity.

The model formula fmla that you created in the previous exercise is still in your workspace, as is the model matrix mmat.

Use fmla and lm to train a linear model that predicts Flowers from Intensity and Time. Assign the model to the variable flower_model.

# flowers in is the workspace
str(flowers)

## 'data.frame':    24 obs. of  4 variables:
##  $ X        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Flowers  : num  62.3 77.4 55.3 54.2 49.6 61.9 39.4 45.7 31.3 44.9 ...
##  $ Time     : Factor w/ 2 levels "Early","Late": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Intensity: int  150 150 300 300 450 450 600 600 750 750 ...

# fmla is in the workspace
fmla

## Flowers ~ Intensity + Time

# Fit a model to predict Flowers from Intensity and Time : flower_model
flower_model <- lm(fmla, flowers)

Use summary() to remind yourself of the structure of mmat.

# Use summary on mmat to remind yourself of its structure
summary(mmat)

##   (Intercept)   Intensity      TimeLate  
##  Min.   :1    Min.   :150   Min.   :0.0  
##  1st Qu.:1    1st Qu.:300   1st Qu.:0.0  
##  Median :1    Median :525   Median :0.5  
##  Mean   :1    Mean   :525   Mean   :0.5  
##  3rd Qu.:1    3rd Qu.:750   3rd Qu.:1.0  
##  Max.   :1    Max.   :900   Max.   :1.0

Use summary() to examine the flower_model. Do the variables match what you saw in mmat? Use flower_model to predict the number of flowers. Add the predictions to flowers as the column predictions.

# Use summary to examine flower_model 
summary(flower_model)

## 
## Call:
## lm(formula = fmla, data = flowers)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.652 -4.139 -1.558  5.632 12.165 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  83.464167   3.273772  25.495  < 2e-16 ***
## Intensity    -0.040471   0.005132  -7.886 1.04e-07 ***
## TimeLate    -12.158333   2.629557  -4.624 0.000146 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.441 on 21 degrees of freedom
## Multiple R-squared:  0.7992, Adjusted R-squared:   0.78 
## F-statistic: 41.78 on 2 and 21 DF,  p-value: 4.786e-08

Fill in the blanks to plot predictions vs. actual flowers (predictions on the x-axis).

# Predict the number of flowers on each plant
flowers$predictions <- predict(flower_model, flowers)

# Plot predictions vs actual flowers (predictions on x-axis)
ggplot(flowers, aes(x = predictions, y = Flowers)) + 
  geom_point() +
  geom_abline(color = "blue")

Interactions

Modeling an interaction

In this exercise you will use interactions to model the effect of gender and gastric activity on alcohol metabolism.

The data frame alcohol has columns:

Metabol: the alcohol metabolism rate Gastric: the rate of gastric alcohol dehydrogenase activity Sex: the sex of the drinker (Male or Female) In the video, we fit three models to the alcohol data:

one with only additive (main effect) terms : Metabol ~ Gastric + Sex two models, each with interactions between gastric activity and sex We saw that one of the models with interaction terms had a better R-squared than the additive model, suggesting that using interaction terms gives a better fit. In this exercise we will compare the R-squared of one of the interaction models to the main-effects-only model.

Recall that the operator : designates the interaction between two variables. The operator * designates the interaction between the two variables, plus the main effects.

x*y = x + y + x:y

The data frame alcohol is in your workspace.

Write a formula that expresses Metabol as a function of Gastric and Sex with no interactions.

alcohol <- read.delim("DATABASE/alcohol.txt", sep = ",")
head(alcohol)

##   X Subject Metabol Gastric    Sex       Alcohol
## 1 1       1     0.6     1.0 Female     Alcoholic
## 2 2       2     0.6     1.6 Female     Alcoholic
## 3 3       3     1.5     1.5 Female     Alcoholic
## 4 4       4     0.4     2.2 Female Non-alcoholic
## 5 5       5     0.1     1.1 Female Non-alcoholic
## 6 6       6     0.2     1.2 Female Non-alcoholic

# alcohol is in the workspace
summary(alcohol)

##        X            Subject         Metabol          Gastric          Sex    
##  Min.   : 1.00   Min.   : 1.00   Min.   : 0.100   Min.   :0.800   Female:18  
##  1st Qu.: 8.75   1st Qu.: 8.75   1st Qu.: 0.600   1st Qu.:1.200   Male  :14  
##  Median :16.50   Median :16.50   Median : 1.700   Median :1.600              
##  Mean   :16.50   Mean   :16.50   Mean   : 2.422   Mean   :1.863              
##  3rd Qu.:24.25   3rd Qu.:24.25   3rd Qu.: 2.925   3rd Qu.:2.200              
##  Max.   :32.00   Max.   :32.00   Max.   :12.300   Max.   :5.200              
##           Alcohol  
##  Alcoholic    : 8  
##  Non-alcoholic:24  
##                    
##                    
##                    
##

Assign the formula to the variable fmla_add and print it. Write a formula that expresses Metabol as a function of the interaction between Gastric and Sex.

# Create the formula with main effects only
(fmla_add <- as.formula('Metabol~Gastric+Sex')  )

## Metabol ~ Gastric + Sex

Add Gastric as a main effect, but not Sex.

# Create the formula with interactions
(fmla_interaction <- as.formula('Metabol~Gastric+Gastric:Sex')  )

## Metabol ~ Gastric + Gastric:Sex

Assign the formula to the variable fmla_interaction and print it.

# Fit the main effects only model
model_add <- lm(fmla_add,alcohol)

# Fit the interaction model
model_interaction <- lm(fmla_interaction, alcohol)

# Call summary on both models and compare
summary(model_add)

## 
## Call:
## lm(formula = fmla_add, data = alcohol)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2779 -0.6328 -0.0966  0.5783  4.5703 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.9466     0.5198  -3.745 0.000796 ***
## Gastric       1.9656     0.2674   7.352 4.24e-08 ***
## SexMale       1.6174     0.5114   3.163 0.003649 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.331 on 29 degrees of freedom
## Multiple R-squared:  0.7654, Adjusted R-squared:  0.7492 
## F-statistic: 47.31 on 2 and 29 DF,  p-value: 7.41e-10

Fit a linear model with only main effects: model_add to the data. Fit a linear model with the interaction: model_interaction to the data. Call summary() on both models. Which has a better R-squared?

summary(model_interaction)

## 
## Call:
## lm(formula = fmla_interaction, data = alcohol)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4656 -0.5091  0.0143  0.5660  4.0668 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -0.7504     0.5310  -1.413 0.168236    
## Gastric           1.1489     0.3450   3.331 0.002372 ** 
## Gastric:SexMale   1.0422     0.2412   4.321 0.000166 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.204 on 29 degrees of freedom
## Multiple R-squared:  0.8081, Adjusted R-squared:  0.7948 
## F-statistic: 61.05 on 2 and 29 DF,  p-value: 4.033e-11

Modeling an interaction (2)

In this exercise, you will compare the performance of the interaction model you fit in the previous exercise to the performance of a main-effects only model. Because this data set is small, we will use cross-validation to simulate making predictions on out-of-sample data.

You will begin to use the dplyr package to do calculations.

mutate() adds new columns to a tbl (a type of data frame) group_by() specifies how rows are grouped in a tbl summarize() computes summary statistics of a column You will also use tidyr’s gather() which takes multiple columns and collapses them into key-value pairs.

The data frame alcohol and the formulas fmla_add and fmla_interaction are in the workspace.

# alcohol is in the workspace
summary(alcohol)

##        X            Subject         Metabol          Gastric          Sex    
##  Min.   : 1.00   Min.   : 1.00   Min.   : 0.100   Min.   :0.800   Female:18  
##  1st Qu.: 8.75   1st Qu.: 8.75   1st Qu.: 0.600   1st Qu.:1.200   Male  :14  
##  Median :16.50   Median :16.50   Median : 1.700   Median :1.600              
##  Mean   :16.50   Mean   :16.50   Mean   : 2.422   Mean   :1.863              
##  3rd Qu.:24.25   3rd Qu.:24.25   3rd Qu.: 2.925   3rd Qu.:2.200              
##  Max.   :32.00   Max.   :32.00   Max.   :12.300   Max.   :5.200              
##           Alcohol  
##  Alcoholic    : 8  
##  Non-alcoholic:24  
##                    
##                    
##                    
##

# Both the formulae are in the workspace
fmla_add

## Metabol ~ Gastric + Sex

fmla_interaction

## Metabol ~ Gastric + Gastric:Sex

Use kWayCrossValidation() to create a splitting plan for a 3-fold cross validation. The first argument is the number of rows to be split.

# Create the splitting plan for 3-fold cross validation
set.seed(34245)  # set the seed for reproducibility
splitPlan <- kWayCrossValidation(nrow(alcohol), 3, NULL, NULL)

The second argument is the number of folds for the cross-validation. You can set the 3rd and 4th arguments of the function to NULL.

Examine and run the sample code to get the 3-fold cross-validation predictions of a model with no interactions and assign them to the column pred_add.

# Sample code: Get cross-val predictions for main-effects only model
alcohol$pred_add <- 0  # initialize the prediction vector
for(i in 1:3) {
  split <- splitPlan[[i]]
  model_add <- lm(fmla_add, data = alcohol[split$train, ])
  alcohol$pred_add[split$app] <- predict(model_add, newdata = alcohol[split$app, ])
}

Get the 3-fold cross-validation predictions of the model with interactions. Assign the predictions to the column pred_interaction. The sample code shows you the procedure.

# Get the cross-val predictions for the model with interactions
alcohol$pred_interaction <- 0 # initialize the prediction vector
for(i in 1:3) {
  split <- splitPlan[[i]]
  model_interaction <- lm(fmla_interaction, data = alcohol[split$train, ])
  alcohol$pred_interaction[split$app] <- predict(model_interaction, newdata = alcohol[split$app, ])
}

Use the same splitPlan that you already created. Fill in the blanks to gather the predictions into a single column pred. add a column of residuals (actual outcome - predicted outcome). get the RMSE of the cross-validation predictions for each model type. Compare the RMSEs. Based on these results, which model should you use?

# Get RMSE
alcohol %>% 
  gather(key = modeltype, value = pred, pred_add, pred_interaction) %>%
  mutate(residuals = Metabol - pred) %>%
  group_by(modeltype) %>%
  summarize(rmse = sqrt(mean(residuals^2)))

## # A tibble: 2 x 2
##   modeltype         rmse
## * <chr>            <dbl>
## 1 pred_add          1.38
## 2 pred_interaction  1.30

Transforming the response before modeling

Relative error

In this exercise, you will compare relative error to absolute error. For the purposes of modeling, we will define relative error as

that is, the error is relative to the true outcome. You will measure the overall relative error of a model using root mean squared relative error:

where

is the mean of .

The example (toy) dataset fdata is loaded in your workspace. It includes the columns:

y: the true output to be predicted by some model; imagine it is the amount of money a customer will spend on a visit to your store. pred: the predictions of a model that predicts y. label: categorical: whether y comes from a population that makes small purchases, or large ones. You want to know which model does “better”: the one predicting the small purchases, or the one predicting large ones.

The data frame fdata is in the workspace.

fdata <- read.delim("DATABASE/fdata.txt", sep = ",")
head(fdata)

##   X         y     pred           label
## 1 1  9.149694 6.430583 small purchases
## 2 2  1.902521 3.473332 small purchases
## 3 3 -3.859826 1.594509 small purchases
## 4 4  2.388970 3.764175 small purchases
## 5 5  1.541572 9.509294 small purchases
## 6 6 13.561878 6.931725 small purchases

# fdata is in the workspace
summary(fdata)

##        X                y                 pred                      label   
##  Min.   :  1.00   Min.   :  -5.894   Min.   :   1.072   large purchases:50  
##  1st Qu.: 25.75   1st Qu.:   5.407   1st Qu.:   6.373   small purchases:50  
##  Median : 50.50   Median :  57.374   Median :  55.693                       
##  Mean   : 50.50   Mean   : 306.204   Mean   : 305.905                       
##  3rd Qu.: 75.25   3rd Qu.: 550.903   3rd Qu.: 547.886                       
##  Max.   :100.00   Max.   :1101.619   Max.   :1098.896

fdata

##       X            y        pred           label
## 1     1    9.1496941    6.430583 small purchases
## 2     2    1.9025206    3.473332 small purchases
## 3     3   -3.8598260    1.594509 small purchases
## 4     4    2.3889703    3.764175 small purchases
## 5     5    1.5415715    9.509294 small purchases
## 6     6   13.5618785    6.931725 small purchases
## 7     7   10.1987859    8.191798 small purchases
## 8     8    1.1044627    1.514578 small purchases
## 9     9    3.9354788    8.986364 small purchases
## 10   10    9.0407978    6.149792 small purchases
## 11   11    1.7276491    8.498107 small purchases
## 12   12   15.7238410   10.941225 small purchases
## 13   13    2.2578444    6.003761 small purchases
## 14   14   -1.9750566    1.071922 small purchases
## 15   15    1.1006461    4.420715 small purchases
## 16   16   18.6282945   10.518787 small purchases
## 17   17    3.6786547    5.746765 small purchases
## 18   18    3.0863171    7.092900 small purchases
## 19   19    8.6913910    7.837105 small purchases
## 20   20    7.9086741    4.307022 small purchases
## 21   21    5.4401589    6.001545 small purchases
## 22   22   14.7850358    8.309529 small purchases
## 23   23    9.0220444    8.625899 small purchases
## 24   24    3.9766558    2.978443 small purchases
## 25   25    2.6675677    4.040108 small purchases
## 26   26    7.6815442    7.460930 small purchases
## 27   27   11.9348659    9.082721 small purchases
## 28   28    5.3087341    6.523833 small purchases
## 29   29   13.0587206   10.342933 small purchases
## 30   30    2.2285365    4.240767 small purchases
## 31   31   15.4031317    8.833006 small purchases
## 32   32   -0.8775657    1.165411 small purchases
## 33   33    7.6111939    4.743669 small purchases
## 34   34    9.8625104    8.426791 small purchases
## 35   35    4.3625009   10.191528 small purchases
## 36   36    3.8418327    4.325049 small purchases
## 37   37   11.3425371    6.389882 small purchases
## 38   38   17.1292595   10.997543 small purchases
## 39   39   16.1650339    7.524615 small purchases
## 40   40   -5.8934988    2.448673 small purchases
## 41   41   12.6387974    9.514229 small purchases
## 42   42    6.4500322    5.211945 small purchases
## 43   43    2.9682093    3.973336 small purchases
## 44   44    4.0760785    6.321008 small purchases
## 45   45    5.5235922   10.166002 small purchases
## 46   46    4.8320823    3.847532 small purchases
## 47   47    6.7230129    6.304504 small purchases
## 48   48    1.8421342    3.647446 small purchases
## 49   49    3.1992212    2.518993 small purchases
## 50   50   10.8161739    8.023910 small purchases
## 51   51 1026.4000266 1027.192723 large purchases
## 52   52  202.3892007  194.521504 large purchases
## 53   53  833.3530129  826.245477 large purchases
## 54   54 1075.4121192 1081.438809 large purchases
## 55   55   96.1198136  100.388039 large purchases
## 56   56  438.2357889  430.291572 large purchases
## 57   57  911.3332595  912.567621 large purchases
## 58   58  542.5643273  533.946165 large purchases
## 59   59  686.3323692  691.793193 large purchases
## 60   60  494.4655725  498.008558 large purchases
## 61   61  422.8093644  423.202790 large purchases
## 62   62 1033.8816216 1032.662338 large purchases
## 63   63  161.9940611  168.545765 large purchases
## 64   64  491.4299932  492.239389 large purchases
## 65   65  575.9199260  589.705853 large purchases
## 66   66  384.7702130  377.148966 large purchases
## 67   67  720.0291839  730.110724 large purchases
## 68   68  963.9350562  967.650520 large purchases
## 69   69  159.7865024  159.207010 large purchases
## 70   70  765.4002297  767.185617 large purchases
## 71   71  246.4217424  250.132558 large purchases
## 72   72 1097.9219860 1098.895762 large purchases
## 73   73 1050.4012681 1048.867067 large purchases
## 74   74 1069.6224565 1057.992036 large purchases
## 75   75  116.8100415  119.282157 large purchases
## 76   76  523.5162673  524.564769 large purchases
## 77   77  457.7430632  459.769560 large purchases
## 78   78 1060.5638650 1053.648585 large purchases
## 79   79  761.9174356  751.113173 large purchases
## 80   80  969.3237051  966.379570 large purchases
## 81   81  522.6803675  520.614066 large purchases
## 82   82  475.8733078  467.724627 large purchases
## 83   83  368.5448093  364.211669 large purchases
## 84   84 1101.6186402 1097.071859 large purchases
## 85   85 1052.9226814 1054.626093 large purchases
## 86   86  663.0351321  664.365996 large purchases
## 87   87  136.6886558  137.485828 large purchases
## 88   88  331.8120748  326.087523 large purchases
## 89   89  921.9615947  929.965382 large purchases
## 90   90  773.5812495  772.131253 large purchases
## 91   91  458.3325781  456.426629 large purchases
## 92   92  643.4683247  646.674567 large purchases
## 93   93  738.6477366  747.002823 large purchases
## 94   94  846.4213303  842.529168 large purchases
## 95   95  413.6645906  411.887935 large purchases
## 96   96  180.9942090  175.770888 large purchases
## 97   97  695.6228990  687.864916 large purchases
## 98   98  164.8980048  165.811990 large purchases
## 99   99  106.5062642  108.484750 large purchases
## 100 100  358.3557044  363.866527 large purchases

# Examine the data: generate the summaries for the groups large and small:
fdata %>% 
    group_by(label) %>%     # group by small/large purchases
    summarize(min  = min(y),   # min of y
              mean = mean(y),   # mean of y
              max  = max(y))  # max of y

## # A tibble: 2 x 4
##   label             min   mean    max
## * <fct>           <dbl>  <dbl>  <dbl>
## 1 large purchases 96.1  606.   1102. 
## 2 small purchases -5.89   6.48   18.6

Fill in the blanks to examine the data. Notice that large purchases tend to be about 100 times larger than small ones. Fill in the blanks to create error columns:

# Fill in the blanks to add error columns
fdata2 <- fdata %>% 
         group_by(label) %>%       # group by label
           mutate(residual = pred-y,  # Residual
                  relerr   = residual/y)  # Relative error

Define residual as y - pred. Define relative error as residual / y.

# Compare the rmse and rmse.rel of the large and small groups:
fdata2 %>% 
  group_by(label) %>% 
  summarize(rmse     = sqrt(mean((pred-y)^2)),   # RMSE
            rmse.rel = sqrt(mean(((pred-y)/y)^2)))    # Root mean squared relative error

## # A tibble: 2 x 3
##   label            rmse rmse.rel
## * <fct>           <dbl>    <dbl>
## 1 large purchases  5.54   0.0147
## 2 small purchases  4.01   1.25

Fill in the blanks to calculate and compare RMSE and relative RMSE. How do the absolute errors compare? The relative errors? Examine the plot of predictions versus outcome. In your opinion, which model does “better”?

# Plot the predictions for both groups of purchases
ggplot(fdata2, aes(x = pred, y = y, color = label)) + 
  geom_point() + 
  geom_abline() + 
  facet_wrap(~ label, ncol = 1, scales = "free") + 
  ggtitle("Outcome vs prediction")

Modeling log-transformed monetary output

In this exercise, you will practice modeling on log-transformed monetary output, and then transforming the “log-money” predictions back into monetary units. The data loaded into your workspace records subjects’ incomes in 2005 (Income2005), as well as the results of several aptitude tests taken by the subjects in 1981:

Arith Word Parag Math AFQT (Percentile on the Armed Forces Qualifying Test) The data have already been split into training and test sets (income_train and income_test respectively) and are in the workspace. You will build a model of log(income) from the inputs, and then convert log(income) back into income.

Call summary() on income_train$Income2005 to see the summary statistics of income in the training set.

income_train <- read.delim("DATABASE/income_train.txt", sep = ",")
head(income_train)

##   X Subject Arith Word Parag Math   AFQT Income2005
## 1 1       2     8   15     6    6  6.841       5500
## 2 2       6    30   35    15   23 99.393      65000
## 3 4       8    13   35    12    4 44.022      36000
## 4 5       9    21   28    10   13 59.683      65000
## 5 7      16    17   30    12   17 50.283      71000
## 6 8      17    29   33    13   21 89.669      43000

# Examine Income2005 in the training set
summary(income_train$Income2005)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      63   23000   39000   49894   61500  703637

Write a formula to express log(Income2005) as a function of the five tests as the variable fmla.log. Print it.

# Write the formula for log income as a function of the tests and print it
# Write the formula for log income as a function of the tests and print it
(fmla.log <- log(Income2005)~Arith+Word+Parag+Math+AFQT)

## log(Income2005) ~ Arith + Word + Parag + Math + AFQT

Fit a linear model of log(Income2005) to the income_train data: model.log.

# Fit the linear model
model.log <-  lm(fmla.log,data=income_train)

Use model.log to predict income on the income_test dataset. Put it in the column logpred.

# 
# income_test <- read.delim("DATABASE/income_test.txt", sep = ",")
# # Make predictions on income_test
# income_test$logpred <- predict(model.log,newdata=income_test)
# summary(income_test$logpred)

Check summary() of logpred to see that the magnitudes are much different from those of Income2005. Reverse the log transformation to put the predictions into “monetary units”: exp(income_test$logpred). Check summary() of pred.income and see that the magnitudes are now similar to Income2005 magnitudes. Fill in the blanks to plot a scatter plot of predicted income vs income on the test set.

Transforming inputs before modeling

Input transforms: the “hockey stick” In this exercise, we will build a model to predict price from a measure of the house’s size (surface area). The data set houseprice has the columns:

price : house price in units of $1000 size: surface area

A scatterplot of the data shows that the data is quite non-linear: a sort of “hockey-stick” where price is fairly flat for smaller houses, but rises steeply as the house gets larger. Quadratics and tritics are often good functional forms to express hockey-stick like relationships. Note that there may not be a “physical” reason that price is related to the square of the size; a quadratic is simply a closed form approximation of the observed relationship.

You will fit a model to predict price as a function of the squared size, and look at its fit on the training data.

Because ^ is also a symbol to express interactions, use the function I() to treat the expression x^2 “as is”: that is, as the square of x rather than the interaction of x with itself.

exampleFormula = y ~ I(x^2)

The data set houseprice is in the workspace.

houseprice <- read.delim("DATABASE/houseprice.txt", sep = ",")
head(houseprice)

##   X size price
## 1 1   72   156
## 2 2   98   153
## 3 3   92   230
## 4 4   90   152
## 5 5   44    42
## 6 6   46   157

Write a formula, fmla_sqr, to express price as a function of squared size. Print it. Fit a model model_sqr to the data using fmla_sqr

# houseprice is in the workspace
summary(houseprice)

##        X              size           price      
##  Min.   : 1.00   Min.   : 44.0   Min.   : 42.0  
##  1st Qu.:10.75   1st Qu.: 73.5   1st Qu.:164.5  
##  Median :20.50   Median : 91.0   Median :203.5  
##  Mean   :20.50   Mean   : 94.3   Mean   :249.2  
##  3rd Qu.:30.25   3rd Qu.:118.5   3rd Qu.:287.8  
##  Max.   :40.00   Max.   :150.0   Max.   :573.0

houseprice

##     X size price
## 1   1   72   156
## 2   2   98   153
## 3   3   92   230
## 4   4   90   152
## 5   5   44    42
## 6   6   46   157
## 7   7   90   113
## 8   8  150   573
## 9   9   94   202
## 10 10   90   261
## 11 11   90   175
## 12 12   66   212
## 13 13  142   486
## 14 14   74   109
## 15 15   86   220
## 16 16   46   186
## 17 17   54   133
## 18 18  130   360
## 19 19  122   283
## 20 20  118   380
## 21 21  100   185
## 22 22   74   186
## 23 23  146   459
## 24 24   92   167
## 25 25  100   171
## 26 26  140   547
## 27 27   94   170
## 28 28   90   286
## 29 29  120   293
## 30 30   70   109
## 31 31  100   205
## 32 32  132   514
## 33 33   58   175
## 34 34   92   249
## 35 35   76   234
## 36 36   90   242
## 37 37   66   177
## 38 38  134   399
## 39 39  140   511
## 40 40   64   107

# Create the formula for price as a function of squared size
(fmla_sqr <- price~I(size^2))

## price ~ I(size^2)

For comparison, fit a linear model model_lin to the data using the formula price ~ size. Fill in the blanks to make predictions from the training data from the two models

# Fit a model of price as a function of squared size (use fmla_sqr)
model_sqr <- lm(fmla_sqr, data = houseprice)

# Fit a model of price as a linear function of size
model_lin <- lm(price ~ size, data = houseprice)

gather the predictions into a single column pred graphically compare the predictions of the two models to the data. Which fits better?

# Make predictions and compare
houseprice %>% 
    mutate(pred_lin = predict(model_lin),       # predictions from linear model
           pred_sqr = predict(model_sqr)) %>%   # predictions from quadratic model 
    gather(key = modeltype, value = pred, pred_lin, pred_sqr) %>% # gather the predictions
    ggplot(aes(x = size)) + 
       geom_point(aes(y = price)) +                   # actual prices
       geom_line(aes(y = pred, color = modeltype)) + # the predictions
       scale_color_brewer(palette = "Dark2")

Dealing with Non-Linear Responses

Logistic regression to predict probabilities

Fit a model of sparrow survival probability In this exercise, you will estimate the probability that a sparrow survives a severe winter storm, based on physical characteristics of the sparrow. The dataset sparrow is loaded into your workspace. The outcome to be predicted is status (“Survived”, “Perished”). The variables we will consider are:

total_length: length of the bird from tip of beak to tip of tail (mm) weight: in grams humerus : length of humerus (“upper arm bone” that connects the wing to the body) (inches) Remember that when using glm() to create a logistic regression model, you must explicitly specify that family = binomial:

glm(formula, data = data, family = binomial) You will call summary(), broom::glance() to see different functions for examining a logistic regression model. One of the diagnostics that you will look at is the analog to , called pseudo-.

You can think of deviance as analogous to variance: it is a measure of the variation in categorical data. The pseudo- is analogous to for standard regression: is a measure of the “variance explained” of a regression model. The pseudo- is a measure of the “deviance explained”.

The data frame sparrow and the package broom are loaded in the workspace.

sparrow <- read.delim("DATABASE/sparrow.txt", sep = ",")
head(sparrow)

##   X   status   age total_length wingspan weight beak_head humerus femur legbone
## 1 1 Survived adult          154      241   24.5      31.2    0.69  0.67    1.02
## 2 2 Survived adult          160      252   26.9      30.8    0.74  0.71    1.18
## 3 3 Survived adult          155      243   26.9      30.6    0.73  0.70    1.15
## 4 4 Survived adult          154      245   24.3      31.7    0.74  0.69    1.15
## 5 5 Survived adult          156      247   24.1      31.5    0.71  0.71    1.13
## 6 6 Survived adult          161      253   26.5      31.8    0.78  0.74    1.14
##   skull sternum
## 1  0.59    0.83
## 2  0.60    0.84
## 3  0.60    0.85
## 4  0.58    0.84
## 5  0.57    0.82
## 6  0.61    0.89

# sparrow is in the workspace
summary(sparrow)

##        X             status         age      total_length      wingspan    
##  Min.   : 1.0   Perished:36   adult   :59   Min.   :153.0   Min.   :236.0  
##  1st Qu.:22.5   Survived:51   juvenile:28   1st Qu.:158.0   1st Qu.:245.0  
##  Median :44.0                               Median :160.0   Median :247.0  
##  Mean   :44.0                               Mean   :160.4   Mean   :247.5  
##  3rd Qu.:65.5                               3rd Qu.:162.5   3rd Qu.:251.0  
##  Max.   :87.0                               Max.   :167.0   Max.   :256.0  
##      weight       beak_head        humerus           femur       
##  Min.   :23.2   Min.   :29.80   Min.   :0.6600   Min.   :0.6500  
##  1st Qu.:24.7   1st Qu.:31.40   1st Qu.:0.7250   1st Qu.:0.7000  
##  Median :25.8   Median :31.70   Median :0.7400   Median :0.7100  
##  Mean   :25.8   Mean   :31.64   Mean   :0.7353   Mean   :0.7134  
##  3rd Qu.:26.7   3rd Qu.:32.10   3rd Qu.:0.7500   3rd Qu.:0.7300  
##  Max.   :31.0   Max.   :33.00   Max.   :0.7800   Max.   :0.7600  
##     legbone          skull           sternum      
##  Min.   :1.010   Min.   :0.5600   Min.   :0.7700  
##  1st Qu.:1.110   1st Qu.:0.5900   1st Qu.:0.8300  
##  Median :1.130   Median :0.6000   Median :0.8500  
##  Mean   :1.131   Mean   :0.6032   Mean   :0.8511  
##  3rd Qu.:1.160   3rd Qu.:0.6100   3rd Qu.:0.8800  
##  Max.   :1.230   Max.   :0.6400   Max.   :0.9300

# Create the survived column
sparrow$survived <- sparrow$status=="Survived"

As suggested in the video, you will predict on the outcomes TRUE and FALSE. Create a new column survived in the sparrow data frame that is TRUE when status == “Survived”. Create the formula fmla that expresses survived as a function of the variables of interest. Print it.

# Create the formula
(fmla <- survived~total_length+weight+humerus)

## survived ~ total_length + weight + humerus

Fit a logistic regression model to predict the probability of sparrow survival. Assign the model to the variable sparrow_model.

# Fit the logistic regression model
sparrow_model <- glm(fmla, data = sparrow, family = "binomial")

# Call summary
summary(sparrow_model)

## 
## Call:
## glm(formula = fmla, family = "binomial", data = sparrow)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1117  -0.6026   0.2871   0.6577   1.7082  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   46.8813    16.9631   2.764 0.005715 ** 
## total_length  -0.5435     0.1409  -3.858 0.000115 ***
## weight        -0.5689     0.2771  -2.053 0.040060 *  
## humerus       75.4610    19.1586   3.939 8.19e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 118.008  on 86  degrees of freedom
## Residual deviance:  75.094  on 83  degrees of freedom
## AIC: 83.094
## 
## Number of Fisher Scoring iterations: 5

Call summary() to see the coefficients of the model, the deviance and the null deviance.

# Call glance
(perf <- glance(sparrow_model))

## # A tibble: 1 x 8
##   null.deviance df.null logLik   AIC   BIC deviance df.residual  nobs
##           <dbl>   <int>  <dbl> <dbl> <dbl>    <dbl>       <int> <int>
## 1          118.      86  -37.5  83.1  93.0     75.1          83    87

Call glance() on the model to see the deviances and other diagnostics in a data frame. Assign the output from glance() to the variable perf. Calculate the pseudo-.

# Calculate pseudo-R-squared
(pseudoR2 <- 1-perf$deviance/perf$null.deviance)

## [1] 0.3636526

Predict sparrow survival

In this exercise you will predict the probability of survival using the sparrow survival model from the previous exercise.

Recall that when calling predict() to get the predicted probabilities from a glm() model, you must specify that you want the response:

predict(model, type = “response”) Otherwise, predict() on a logistic regression model returns the predicted log-odds of the event, not the probability.

You will also use the GainCurvePlot() function to plot the gain curve from the model predictions. If the model’s gain curve is close to the ideal (“wizard”) gain curve, then the model sorted the sparrows well: that is, the model predicted that sparrows that actually survived would have a higher probability of survival. The inputs to the GainCurvePlot() function are:

frame: data frame with prediction column and ground truth column xvar: the name of the column of predictions (as a string) truthVar: the name of the column with actual outcome (as a string) title: a title for the plot (as a string) GainCurvePlot(frame, xvar, truthVar, title)

The dataframe sparrow and the model sparrow_model are in the workspace.

# sparrow is in the workspace
summary(sparrow)

##        X             status         age      total_length      wingspan    
##  Min.   : 1.0   Perished:36   adult   :59   Min.   :153.0   Min.   :236.0  
##  1st Qu.:22.5   Survived:51   juvenile:28   1st Qu.:158.0   1st Qu.:245.0  
##  Median :44.0                               Median :160.0   Median :247.0  
##  Mean   :44.0                               Mean   :160.4   Mean   :247.5  
##  3rd Qu.:65.5                               3rd Qu.:162.5   3rd Qu.:251.0  
##  Max.   :87.0                               Max.   :167.0   Max.   :256.0  
##      weight       beak_head        humerus           femur       
##  Min.   :23.2   Min.   :29.80   Min.   :0.6600   Min.   :0.6500  
##  1st Qu.:24.7   1st Qu.:31.40   1st Qu.:0.7250   1st Qu.:0.7000  
##  Median :25.8   Median :31.70   Median :0.7400   Median :0.7100  
##  Mean   :25.8   Mean   :31.64   Mean   :0.7353   Mean   :0.7134  
##  3rd Qu.:26.7   3rd Qu.:32.10   3rd Qu.:0.7500   3rd Qu.:0.7300  
##  Max.   :31.0   Max.   :33.00   Max.   :0.7800   Max.   :0.7600  
##     legbone          skull           sternum        survived      
##  Min.   :1.010   Min.   :0.5600   Min.   :0.7700   Mode :logical  
##  1st Qu.:1.110   1st Qu.:0.5900   1st Qu.:0.8300   FALSE:36       
##  Median :1.130   Median :0.6000   Median :0.8500   TRUE :51       
##  Mean   :1.131   Mean   :0.6032   Mean   :0.8511                  
##  3rd Qu.:1.160   3rd Qu.:0.6100   3rd Qu.:0.8800                  
##  Max.   :1.230   Max.   :0.6400   Max.   :0.9300

# sparrow_model is in the workspace
summary(sparrow_model)

## 
## Call:
## glm(formula = fmla, family = "binomial", data = sparrow)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1117  -0.6026   0.2871   0.6577   1.7082  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   46.8813    16.9631   2.764 0.005715 ** 
## total_length  -0.5435     0.1409  -3.858 0.000115 ***
## weight        -0.5689     0.2771  -2.053 0.040060 *  
## humerus       75.4610    19.1586   3.939 8.19e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 118.008  on 86  degrees of freedom
## Residual deviance:  75.094  on 83  degrees of freedom
## AIC: 83.094
## 
## Number of Fisher Scoring iterations: 5

Create a new column in sparrow called pred that contains the predictions on the training data. Call GainCurvePlot() to create the gain curve of predictions. Does the model do a good job of sorting the sparrows by whether or not they actually survived?

# Make predictions
sparrow$pred <- predict(sparrow_model, type = "response")

# Look at gain curve
GainCurvePlot(sparrow, "pred", "survived", "sparrow survival model")

Poisson and quasipoisson regression to predict counts

Fit a model to predict bike rental counts

In this exercise you will build a model to predict the number of bikes rented in an hour as a function of the weather, the type of day (holiday, working day, or weekend), and the time of day. You will train the model on data from the month of July.

The data frame has the columns:

cnt: the number of bikes rented in that hour (the outcome) hr: the hour of the day (0-23, as a factor) holiday: TRUE/FALSE workingday: TRUE if neither a holiday nor a weekend, else FALSE weathersit: categorical, “Clear to partly cloudy”/“Light Precipitation”/“Misty” temp: normalized temperature in Celsius atemp: normalized “feeling” temperature in Celsius hum: normalized humidity windspeed: normalized windspeed instant: the time index – number of hours since beginning of data set (not a variable) mnth and yr: month and year indices (not variables) Remember that you must specify family = poisson or family = quasipoisson when using glm() to fit a count model.

Since there are a lot of input variables, for convenience we will specify the outcome and the inputs in variables, and use paste() to assemble a string representing the model formula.

The data frame bikesJuly is in the workspace. The names of the outcome variable and the input variables are also in the workspace as the variables outcome and vars respectively.

bikesJuly <- read.delim("DATABASE/bikesJuly.txt", sep = ",") %>% select(., c(-"X"))
head(bikesJuly)

##   hr holiday workingday             weathersit temp  atemp  hum windspeed cnt
## 1  0   FALSE      FALSE Clear to partly cloudy 0.76 0.7273 0.66    0.0000 149
## 2  1   FALSE      FALSE Clear to partly cloudy 0.74 0.6970 0.70    0.1343  93
## 3  2   FALSE      FALSE Clear to partly cloudy 0.72 0.6970 0.74    0.0896  90
## 4  3   FALSE      FALSE Clear to partly cloudy 0.72 0.7121 0.84    0.1343  33
## 5  4   FALSE      FALSE Clear to partly cloudy 0.70 0.6667 0.79    0.1940   4
## 6  5   FALSE      FALSE Clear to partly cloudy 0.68 0.6364 0.79    0.1045  10
##   instant mnth yr
## 1   13004    7  1
## 2   13005    7  1
## 3   13006    7  1
## 4   13007    7  1
## 5   13008    7  1
## 6   13009    7  1

# bikesJuly is in the workspace
str(bikesJuly)

## 'data.frame':    744 obs. of  12 variables:
##  $ hr        : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ holiday   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ workingday: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ weathersit: Factor w/ 3 levels "Clear to partly cloudy",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ temp      : num  0.76 0.74 0.72 0.72 0.7 0.68 0.7 0.74 0.78 0.82 ...
##  $ atemp     : num  0.727 0.697 0.697 0.712 0.667 ...
##  $ hum       : num  0.66 0.7 0.74 0.84 0.79 0.79 0.79 0.7 0.62 0.56 ...
##  $ windspeed : num  0 0.1343 0.0896 0.1343 0.194 ...
##  $ cnt       : int  149 93 90 33 4 10 27 50 142 219 ...
##  $ instant   : int  13004 13005 13006 13007 13008 13009 13010 13011 13012 13013 ...
##  $ mnth      : int  7 7 7 7 7 7 7 7 7 7 ...
##  $ yr        : int  1 1 1 1 1 1 1 1 1 1 ...

Fill in the blanks to create the formula fmla expressing cnt as a function of the inputs. Print it.

vars <- read.delim("DATABASE/vars.txt", sep = ",")
head(vars)

##   X          x
## 1 1         hr
## 2 2    holiday
## 3 3 workingday
## 4 4 weathersit
## 5 5       temp
## 6 6      atemp

outcome <- read.delim("DATABASE/outcome.txt", sep  = ",")

## Warning in read.table(file = file, header = header, sep = sep, quote = quote, :
## incomplete final line found by readTableHeader on 'DATABASE/outcome.txt'

# The outcome column
outcome

##   X   x
## 1 1 cnt

# The inputs to use
vars

##   X          x
## 1 1         hr
## 2 2    holiday
## 3 3 workingday
## 4 4 weathersit
## 5 5       temp
## 6 6      atemp
## 7 7        hum
## 8 8  windspeed

Calculate the mean (mean()) and variance (var()) of bikesJuly$cnt.

# Create the formula string for bikes rented as a function of the inputs
(fmla <- paste(outcome, "~", paste(vars, collapse = " + ")))

## [1] "1 ~ 1:8 + c(3, 2, 8, 6, 5, 1, 4, 7)" "1 ~ 1:8 + c(3, 2, 8, 6, 5, 1, 4, 7)"

Should you use poisson or quasipoisson regression?

# Calculate the mean and variance of the outcome
(mean_bikes <- mean(bikesJuly$cnt))

## [1] 273.6653

(var_bikes <- var(bikesJuly$cnt))

## [1] 45863.84

Use glm() to fit a model to the bikesJuly data: bike_model.

# # Fit the model
# bike_model <- glm(fmla, data=bikesJuly, family=quasipoisson)

Use glance() to look at the model’s fit statistics. Assign the output of glance() to the variable perf.

# # Call glance
# (perf <- glance(bike_model))

Calculate the pseudo-R-squared of the model.

# # Calculate pseudo-R-squared
# (pseudoR2 <- 1 - perf$deviance/perf$null.deviance)

Predict bike rentals on new data

In this exercise you will use the model you built in the previous exercise to make predictions for the month of August. The data set bikesAugust has the same columns as bikesJuly.

Recall that you must specify type = “response” with predict() when predicting counts from a glm poisson or quasipoisson model.

bikesAugust <- read.delim("DATABASE/bikesAugust.txt", sep = ",")
head(bikesAugust)

##   X hr holiday workingday             weathersit temp  atemp  hum windspeed cnt
## 1 1  0   FALSE       TRUE Clear to partly cloudy 0.68 0.6364 0.79    0.1642  47
## 2 2  1   FALSE       TRUE Clear to partly cloudy 0.66 0.6061 0.83    0.0896  33
## 3 3  2   FALSE       TRUE Clear to partly cloudy 0.64 0.5758 0.83    0.1045  13
## 4 4  3   FALSE       TRUE Clear to partly cloudy 0.64 0.5758 0.83    0.1045   7
## 5 5  4   FALSE       TRUE                  Misty 0.64 0.5909 0.78    0.1343   4
## 6 6  5   FALSE       TRUE                  Misty 0.64 0.5909 0.78    0.1343  49
##   instant mnth yr
## 1   13748    8  1
## 2   13749    8  1
## 3   13750    8  1
## 4   13751    8  1
## 5   13752    8  1
## 6   13753    8  1

The model bike_model and the data frame bikesAugust are in the workspace.

# bikesAugust is in the workspace
str(bikesAugust)

## 'data.frame':    744 obs. of  13 variables:
##  $ X         : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ hr        : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ holiday   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ workingday: logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
##  $ weathersit: Factor w/ 3 levels "Clear to partly cloudy",..: 1 1 1 1 3 3 1 3 3 3 ...
##  $ temp      : num  0.68 0.66 0.64 0.64 0.64 0.64 0.64 0.64 0.66 0.68 ...
##  $ atemp     : num  0.636 0.606 0.576 0.576 0.591 ...
##  $ hum       : num  0.79 0.83 0.83 0.83 0.78 0.78 0.78 0.83 0.78 0.74 ...
##  $ windspeed : num  0.1642 0.0896 0.1045 0.1045 0.1343 ...
##  $ cnt       : int  47 33 13 7 4 49 185 487 681 350 ...
##  $ instant   : int  13748 13749 13750 13751 13752 13753 13754 13755 13756 13757 ...
##  $ mnth      : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ yr        : int  1 1 1 1 1 1 1 1 1 1 ...

Use predict to predict the number of bikes per hour on the bikesAugust data. Assign the predictions to the column bikesAugust$pred.

# Make predictions on August data
# bikesAugust$pred  <- predict(bike_model, type = "response", newdata = bikesAugust)
# head(bikesAugust)

Fill in the blanks to get the RMSE of the predictions on the August data.

# # Calculate the RMSE
# bikesAugust %>% 
#   mutate(residual = pred - cnt) %>%
#   summarize(rmse  = sqrt(mean(residual^2)))

Fill in the blanks to generate the plot of predictions to actual counts. Do any of the predictions appear negative?

# # Plot predictions vs cnt (pred on x-axis)
# ggplot(bikesAugust, aes(x = pred, y = cnt)) +
#   geom_point() + 
#   geom_abline(color = "darkblue")

GAM to learn non-linear transforms

Model soybean growth with GAM

In this exercise you will model the average leaf weight on a soybean plant as a function of time (after planting). As you will see, the soybean plant doesn’t grow at a steady rate, but rather has a “growth spurt” that eventually tapers off. Hence, leaf weight is not well described by a linear model.

Recall that you can designate which variable you want to model non-linearly in a formula with the s() function:

y ~ s(x) Also remember that gam() from the package mgcv has the calling interface

gam(formula, family, data) For standard regression, use family = gaussian (the default).

The soybean training data, soybean_train is loaded into your workspace. It has two columns: the outcome weight and the variable Time. For comparison, the linear model model.lin, which was fit using the formula weight ~ Time has already been loaded into the workspace as well.

Fill in the blanks to plot weight versus Time (Time on x-axis). Does the relationship look linear?

soybean_train <- read.delim("DATABASE/soybean_train.txt", sep = ",")
# soybean_train is in the workspace
summary(soybean_train)

##        X              Plot     Variety      Year           Time      
##  Min.   :  1.0   1988F6 : 10   F:161   Min.   :1988   Min.   :14.00  
##  1st Qu.:101.2   1988F7 :  9   P:169   1st Qu.:1988   1st Qu.:27.00  
##  Median :207.5   1988P1 :  9           Median :1989   Median :42.00  
##  Mean   :207.3   1988P2 :  9           Mean   :1989   Mean   :43.56  
##  3rd Qu.:307.8   1988P8 :  9           3rd Qu.:1990   3rd Qu.:56.00  
##  Max.   :412.0   1988F3 :  8           Max.   :1990   Max.   :84.00  
##                  (Other):276                                         
##      weight       
##  Min.   : 0.0290  
##  1st Qu.: 0.6663  
##  Median : 3.5233  
##  Mean   : 6.1645  
##  3rd Qu.:10.3808  
##  Max.   :27.3700  
##

head(soybean_train)

##   X   Plot Variety Year Time weight
## 1 1 1988F1       F 1988   14  0.106
## 2 2 1988F1       F 1988   21  0.261
## 3 3 1988F1       F 1988   28  0.666
## 4 4 1988F1       F 1988   35  2.110
## 5 6 1988F1       F 1988   49  6.230
## 6 8 1988F1       F 1988   63 13.350

# Plot weight vs Time (Time on x axis)
ggplot(soybean_train, aes(x = Time, y = weight)) + 
  geom_point()

Load the package mgcv. Create the formula fmla.gam to express weight as a non-linear function of Time.seed Print it. Fit a generalized additive model on soybean_train using fmla.gam.

# Load the package mgcv
library(mgcv)

# Create the formula 
(fmla.gam <- weight ~ s(Time))

## weight ~ s(Time)

# Fit the GAM Model
model.gam <- gam(fmla.gam, family = gaussian, data = soybean_train)

Tree-Based Methods

In this chapter we will look at modeling algorithms that do not assume linearity or additivity, and that can learn limited types of interactions among input variables. These algorithms are tree-based methods that work by combining ensembles of decision trees that are learned from the training data.

Random forests

Build a random forest model for bike rentals

In this exercise you will again build a model to predict the number of bikes rented in an hour as a function of the weather, the type of day (holiday, working day, or weekend), and the time of day. You will train the model on data from the month of July.

You will use the ranger package to fit the random forest model. For this exercise, the key arguments to the ranger() call are:

formula data num.trees: the number of trees in the forest. respect.unordered.factors : Specifies how to treat unordered factor variables. We recommend setting this to “order” for regression. seed: because this is a random algorithm, you will set the seed to get reproducible results Since there are a lot of input variables, for convenience we will specify the outcome and the inputs in the variables outcome and vars, and use paste() to assemble a string representing the model formula.

The data frame bikesJuly is in the workspace. The sample code specifies the names of the outcome and input variables.

Fill in the blanks to create the formula fmla expressing cnt as a function of the inputs. Print it.

# bikesJuly is in the workspace
str(bikesJuly)

## 'data.frame':    744 obs. of  12 variables:
##  $ hr        : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ holiday   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ workingday: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ weathersit: Factor w/ 3 levels "Clear to partly cloudy",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ temp      : num  0.76 0.74 0.72 0.72 0.7 0.68 0.7 0.74 0.78 0.82 ...
##  $ atemp     : num  0.727 0.697 0.697 0.712 0.667 ...
##  $ hum       : num  0.66 0.7 0.74 0.84 0.79 0.79 0.79 0.7 0.62 0.56 ...
##  $ windspeed : num  0 0.1343 0.0896 0.1343 0.194 ...
##  $ cnt       : int  149 93 90 33 4 10 27 50 142 219 ...
##  $ instant   : int  13004 13005 13006 13007 13008 13009 13010 13011 13012 13013 ...
##  $ mnth      : int  7 7 7 7 7 7 7 7 7 7 ...
##  $ yr        : int  1 1 1 1 1 1 1 1 1 1 ...

seed <- read.delim("DATABASE/seed.txt", sep = ",")

## Warning in read.table(file = file, header = header, sep = sep, quote = quote, :
## incomplete final line found by readTableHeader on 'DATABASE/seed.txt'

# Random seed to reproduce results
seed

##   X      x
## 1 1 423563

# The outcome column
(outcome <- "cnt")

## [1] "cnt"

# The input variables
(vars <- c("hr", "holiday", "workingday", "weathersit", "temp", "atemp", "hum", "windspeed"))

## [1] "hr"         "holiday"    "workingday" "weathersit" "temp"      
## [6] "atemp"      "hum"        "windspeed"

# Create the formula string for bikes rented as a function of the inputs
(fmla <- paste(outcome, "~", paste(vars, collapse = " + ")))

## [1] "cnt ~ hr + holiday + workingday + weathersit + temp + atemp + hum + windspeed"

Load the package ranger.

Use ranger to fit a model to the bikesJuly data: bike_model_rf. The first argument to ranger() is the formula, fmla. Use 500 trees and respect.unordered.factors = “order”. Set the seed to seed for reproducible results. Print the model. What is the R-squared?

# # Load the package ranger
# library(ranger)
# 
# # Fit and print the random forest model
# (bike_model_rf <- ranger(fmla, # formula 
#                          bikesJuly, # data
#                          num.trees = 500, 
#                          respect.unordered.factors = "order", 
#                          seed = seed))

Predict bike rentals with the random forest model

In this exercise you will use the model that you fit in the previous exercise to predict bike rentals for the month of August.

The predict() function for a ranger model produces a list. One of the elements of this list is predictions, a vector of predicted values. You can access predictions with the $ notation for accessing named elements of a list:

predict(model, data)$predictions

The model bike_model_rf and the dataset bikesAugust (for evaluation) are loaded into your workspace.

Call predict() on bikesAugust to predict the number of bikes rented in August (cnt). Add the predictions to bikesAugust as the column pred. Fill in the blanks to calculate the root mean squared error of the predictions. The poisson model you built for this data gave an RMSE of about 112.6. How does this model compare? Fill in the blanks to plot actual bike rental counts (cnt) versus the predictions (pred on x-axis)

# # bikesAugust is in the workspace
# str(bikesAugust)
# 
# # bike_model_rf is in the workspace
# bike_model_rf
# 
# # Make predictions on the August data
# bikesAugust$pred <- predict(bike_model_rf, bikesAugust)$predictions
# 
# # Calculate the RMSE of the predictions
# bikesAugust %>% 
#   mutate(residual = cnt - pred)  %>%        # calculate the residual
#   summarize(rmse  = sqrt(mean(residual^2))) # calculate rmse
# 
# # Plot actual outcome vs predictions (predictions on x-axis)
# ggplot(bikesAugust, aes(x = pred, y = cnt)) + 
#   geom_point() + 
#   geom_abline()

Visualize random forest bike model predictions

In the previous exercise, you saw that the random forest bike model did better on the August data than the quasiposson model, in terms of RMSE.

In this exercise you will visualize the random forest model’s August predictions as a function of time. The corresponding plot from the quasipoisson model that you built in a previous exercise is in the workspace for you to compare.

Recall that the quasipoisson model mostly identified the pattern of slow and busy hours in the day, but it somewhat underestimated peak demands. You would like to see how the random forest model compares.

The data frame bikesAugust (with predictions) is in the workspace. The plot quasipoisson_plot of quasipoisson model predictions as a function of time is shown.

Fill in the blanks to plot the predictions and actual counts by hour for the first 14 days of August. gather the cnt and pred columns into a column called value, with a key called valuetype. Plot value as a function of instant (day). How does the random forest model compare?

# first_two_weeks <- bikesAugust %>% 
#   # Set start to 0, convert unit to days
#   mutate(instant = (instant - min(instant)) / 24) %>% 
#   # Gather cnt and pred into a column named value with key valuetype
#   gather(key = valuetype, value = value, cnt, pred) %>%
#   # Filter for rows in the first two
#   filter(instant < 14) 
# head(first_two_weeks)
# # Plot predictions and cnt by date/time 
# ggplot(first_two_weeks, aes(x = instant, y = value, color = valuetype, linetype = valuetype)) + 
#   geom_point() + 
#   geom_line() + 
#   scale_x_continuous("Day", breaks = 0:14, labels = 0:14) + 
#   scale_color_brewer(palette = "Dark2") + 
#   ggtitle("Predicted August bike rentals, Random Forest plot")

One-Hot-Encoding Categorical Variables

vtreat on a small example

In this exercise you will use vtreat to one-hot-encode a categorical variable on a small example. vtreat creates a treatment plan to transform categorical variables into indicator variables (coded “lev”), and to clean bad values out of numerical variables (coded “clean”).

To design a treatment plan use the function designTreatmentsZ()

treatplan <- designTreatmentsZ(data, varlist) data: the original training data frame varlist: a vector of input variables to be treated (as strings). designTreatmentsZ() returns a list with an element scoreFrame: a data frame that includes the names and types of the new variables:

scoreFrame <- treatplan %>% magrittr::use_series(scoreFrame) %>% select(varName, origName, code) varName: the name of the new treated variable origName: the name of the original variable that the treated variable comes from code: the type of the new variable. “clean”: a numerical variable with no NAs or NaNs “lev”: an indicator variable for a specific level of the original categorical variable. (magrittr::use_series() is an alias for $ that you can use in pipes.)

For these exercises, we want varName where code is either “clean” or “lev”:

newvarlist <- scoreFrame %>% filter(code %in% c(“clean”, “lev”) %>% magrittr::use_series(varName) To transform the data set into all numerical and one-hot-encoded variables, use prepare():

The data frame dframe and the package magrittr are loaded in the workspace.

Print dframe. We will assume that color and size are input variables, and popularity is the outcome to be predicted. Create a vector called vars with the names of the input variables (as strings). Load the package vtreat. Use designTreatmentsZ() to create a treatment plan for the variables in vars. Assign it to the variable treatplan. Get and examine the scoreFrame from the treatment plan to see the mapping from old variables to new variables. You only need the columns varName, origName and code. What are the names of the new indicator variables? Of the continuous variable? Create a vector newvars that contains the variable varName where code is either clean or lev. Print it. Use prepare() to create a new data frame dframe.treat that is a one-hot-encoded version of dframe (without the outcome column). Print it and compare to dframe.

# # dframe is in the workspace
# dframe
# 
# # Create a vector of variable names
# (vars <- c("color", "size"))
# 
# # Load the package vtreat
# library(vtreat)
# 
# # Create the treatment plan
# treatplan <- designTreatmentsZ(dframe, vars)
# 
# # Examine the scoreFrame
# (scoreFrame <- treatplan %>%
#     use_series(scoreFrame) %>%
#     select(varName, origName, code))
# 
# # We only want the rows with codes "clean" or "lev"
# (newvars <- scoreFrame %>%
#     filter(code %in% c("clean", "lev")) %>%
#     use_series(varName))
# 
# # Create the treated training data
# (dframe.treat <- prepare(treatplan, dframe, varRestriction = newvars))

Print dframe and testframe. Are there colors in testframe that didn’t appear in dframe? Call prepare() to create a one-hot-encoded version of testframe (without the outcome). Call it testframe.treat and print it. Use the varRestriction argument to restrict to only the variables in newvars. How are the yellow rows encoded?

# # treatplan is in the workspace
# summary(treatplan)
# 
# # newvars is in the workspace
# newvars
# 
# # Print dframe and testframe
# dframe
# testframe
# 
# # Use prepare() to one-hot-encode testframe
# (testframe.treat <- prepare(treatplan, testframe, varRestriction = newvars))

Unsupervised Learning in R

Unsupervised learning in R

k-means clustering

We have created some two-dimensional data and stored it in a variable called x in your workspace. The scatter plot on the right is a visual representation of the data.

In this exercise, your task is to create a k-means model of the x data using 3 clusters, then to look at the structure of the resulting model using the summary() function.

Fit a k-means model to x using 3 centers and run the k-means algorithm 20 times. Store the result in km.out. Inspect the result with the summary() function.

x <- read.delim("DATABASE/x.txt", sep = ",")
head(x)

##   X       V1       V2
## 1 1 3.370958 1.995379
## 2 2 1.435302 2.760242
## 3 3 2.363128 2.038991
## 4 4 2.632863 2.735072
## 5 5 2.404268 1.853527
## 6 6 1.893875 1.942113

# Create the k-means model: km.out

km.out <- kmeans(x, 3, nstart = 20)
# Inspect the result
summary(km.out)

##              Length Class  Mode   
## cluster      300    -none- numeric
## centers        9    -none- numeric
## totss          1    -none- numeric
## withinss       3    -none- numeric
## tot.withinss   1    -none- numeric
## betweenss      1    -none- numeric
## size           3    -none- numeric
## iter           1    -none- numeric
## ifault         1    -none- numeric

Results of kmeans()

The kmeans() function produces several outputs. In the video, we discussed one output of modeling, the cluster membership.

In this exercise, you will access the cluster component directly. This is useful anytime you need the cluster membership for each observation of the data used to build the clustering model. A future exercise will show an example of how this cluster membership might be used to help communicate the results of k-means modeling.

k-means models also have a print method to give a human friendly output of basic modeling results. This is available by using print() or simply typing the name of the model.

The k-means model you built in the last exercise, km.out, is still available in your workspace.

Print a list of the cluster membership to the console. Use a print method to print out the km.out model.

# Print the cluster membership component of the model
km.out$cluster

##   [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [38] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [75] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3
## [112] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [149] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [186] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [223] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [260] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [297] 1 1 1 1

# Print the km.out object
km.out

## K-means clustering with 3 clusters of sizes 100, 100, 100
## 
## Cluster means:
##       X        V1       V2
## 1 250.5 -2.010368 1.002364
## 2  50.5  2.032515 2.032936
## 3 150.5 -5.087484 1.882169
## 
## Clustering vector:
##   [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [38] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [75] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3
## [112] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [149] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [186] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [223] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [260] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [297] 1 1 1 1
## 
## Within cluster sum of squares by cluster:
## [1] 84556.96 83508.36 83508.93
##  (between_SS / total_SS =  88.8 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

Visualizing and interpreting results of kmeans()

One of the more intuitive ways to interpret the results of k-means models is by plotting the data as a scatter plot and using color to label the samples’ cluster membership. In this exercise, you will use the standard plot() function to accomplish this.

To create a scatter plot, you can pass data with two features (i.e. columns) to plot() with an extra argument col = km.out$cluster, which sets the color of each point in the scatter plot according to its cluster membership.

x and km.out are available in your workspace. Using the plot() function to create a scatter plot of data x:

Color the dots on the scatterplot by setting the col argument to the cluster component in km.out. Title the plot “k-means with 3 clusters” using the main argument to plot(). Ensure there are no axis labels by specifying "" for both the xlab and ylab arguments to plot().

# Scatter plot of x
plot(x, 
  col = km.out$cluster,
  main = "k-means with 3 clusters")

How k-means works and practical matters

Handling random algorithms

In the video, you saw how kmeans() randomly initializes the centers of clusters. This random initialization can result in assigning observations to different cluster labels. Also, the random initialization can result in finding different local minima for the k-means algorithm. This exercise will demonstrate both results.

At the top of each plot, the measure of model quality—total within cluster sum of squares error—will be plotted. Look for the model(s) with the lowest error to find models with the better model results.

Because kmeans() initializes observations to random clusters, it is important to set the random number generator seed for reproducibility.

The data, x, is still available in your workspace. Your task is to generate six kmeans() models on the data, plotting the results of each, in order to see the impact of random initializations on model results.

Set the random number seed to 1 with set.seed().

# Set up 2 x 3 plotting grid
par(mfrow = c(2, 3))

# Set seed
set.seed(1)

For each iteration of the for loop, run kmeans() on x. Assume the number of clusters is 3 and number of starts (nstart) is 1. Visualize the cluster memberships using the col argument to plot(). Observe how the measure of quality and cluster assignments vary among the six model runs.

for(i in 1:6) {
  # Run kmeans() on x with three clusters and one start
  km.out <- kmeans(x, 3, nstart = 1)
  
  # Plot clusters
  plot(x, col = km.out$cluster, 
       main = km.out$tot.withinss)
}

Selecting number of clusters

The k-means algorithm assumes the number of clusters as part of the input. If you know the number of clusters in advance (e.g. due to certain business constraints) this makes setting the number of clusters easy. However, as you saw in the video, if you do not know the number of clusters and need to determine it, you will need to run the algorithm multiple times, each time with a different number of clusters. From this, you can observe how a measure of model quality changes with the number of clusters.

In this exercise, you will run kmeans() multiple times to see how model quality changes as the number of clusters changes. Plots displaying this information help to determine the number of clusters and are often referred to as scree plots.

The ideal plot will have an elbow where the quality measure improves more slowly as the number of clusters increases. This indicates that the quality of the model is no longer improving substantially as the model complexity (i.e. number of clusters) increases. In other words, the elbow indicates the number of clusters inherent in the data.

The data, x, is still available in your workspace.

Build 15 kmeans() models on x, each with a different number of clusters (ranging from 1 to 15). Set nstart = 20 for all model runs and save the total within cluster sum of squares for each model to the ith element of wss.

# Initialize total within sum of squares error: wss
wss <- 0

# For 1 to 15 cluster centers
for (i in 1:15) {
  km.out <- kmeans(x, centers = i, nstart  = 20)
  # Save total within sum of squares to wss variable
  wss[i] <- km.out$tot.withinss
}

Run the code provided to create a scree plot of the wss for all 15 models.

# Plot total within sum of squares vs. number of clusters
plot(1:15, wss, type = "b", 
     xlab = "Number of Clusters", 
     ylab = "Within groups sum of squares")

Take a look at your scree plot. How many clusters are inherent in the data? Set k equal to the number of clusters at the location of the elbow.

# Set k equal to the number of clusters corresponding to the elbow location
k <- 2

Introduction to the Pokemon data

Practical matters: working with real data

Dealing with real data is often more challenging than dealing with synthetic data. Synthetic data helps with learning new concepts and techniques, but the next few exercises will deal with data that is closer to the type of real data you might find in your professional or academic pursuits.

The first challenge with the Pokemon data is that there is no pre-determined number of clusters. You will determine the appropriate number of clusters, keeping in mind that in real data the elbow in the scree plot might be less of a sharp elbow than in synthetic data. Use your judgement on making the determination of the number of clusters.

The second part of this exercise includes plotting the outcomes of the clustering on two dimensions, or features, of the data. These features were chosen somewhat arbitrarily for this exercise. Think about how you would use plotting and clustering to communicate interesting groups of Pokemon to other people.

An additional note: this exercise utilizes the iter.max argument to kmeans(). As you’ve seen, kmeans() is an iterative algorithm, repeating over and over until some stopping criterion is reached. The default number of iterations for kmeans() is 10, which is not enough for the algorithm to converge and reach its stopping criterion, so we’ll set the number of iterations to 50 to overcome this issue. To see what happens when kmeans() does not converge, try running the example with a lower number of iterations (e.g. 3). This is another example of what might happen when you encounter real data and use real cases.

The pokemon dataset, which contains observations of 800 Pokemon characters on 6 dimensions (i.e. features), is available in your workspace.

Using kmeans() with nstart = 20, determine the total within sum of square errors for different numbers of clusters (between 1 and 15).

pokemon <- read.csv("DATABASE/Pokemon.csv", )
head(pokemon)

##   Number                  Name Type1  Type2 Total HitPoints Attack Defense
## 1      1             Bulbasaur Grass Poison   318        45     49      49
## 2      2               Ivysaur Grass Poison   405        60     62      63
## 3      3              Venusaur Grass Poison   525        80     82      83
## 4      3 VenusaurMega Venusaur Grass Poison   625        80    100     123
## 5      4            Charmander  Fire          309        39     52      43
## 6      5            Charmeleon  Fire          405        58     64      58
##   SpecialAttack SpecialDefense Speed Generation Legendary
## 1            65             65    45          1     False
## 2            80             80    60          1     False
## 3           100            100    80          1     False
## 4           122            120    80          1     False
## 5            60             50    65          1     False
## 6            80             65    80          1     False

pokemon <- pokemon[,c(6:11)]
# Initialize total within sum of squares error: wss
wss <- 0
pokemon

##     HitPoints Attack Defense SpecialAttack SpecialDefense Speed
## 1          45     49      49            65             65    45
## 2          60     62      63            80             80    60
## 3          80     82      83           100            100    80
## 4          80    100     123           122            120    80
## 5          39     52      43            60             50    65
## 6          58     64      58            80             65    80
## 7          78     84      78           109             85   100
## 8          78    130     111           130             85   100
## 9          78    104      78           159            115   100
## 10         44     48      65            50             64    43
## 11         59     63      80            65             80    58
## 12         79     83     100            85            105    78
## 13         79    103     120           135            115    78
## 14         45     30      35            20             20    45
## 15         50     20      55            25             25    30
## 16         60     45      50            90             80    70
## 17         40     35      30            20             20    50
## 18         45     25      50            25             25    35
## 19         65     90      40            45             80    75
## 20         65    150      40            15             80   145
## 21         40     45      40            35             35    56
## 22         63     60      55            50             50    71
## 23         83     80      75            70             70   101
## 24         83     80      80           135             80   121
## 25         30     56      35            25             35    72
## 26         55     81      60            50             70    97
## 27         40     60      30            31             31    70
## 28         65     90      65            61             61   100
## 29         35     60      44            40             54    55
## 30         60     85      69            65             79    80
## 31         35     55      40            50             50    90
## 32         60     90      55            90             80   110
## 33         50     75      85            20             30    40
## 34         75    100     110            45             55    65
## 35         55     47      52            40             40    41
## 36         70     62      67            55             55    56
## 37         90     92      87            75             85    76
## 38         46     57      40            40             40    50
## 39         61     72      57            55             55    65
## 40         81    102      77            85             75    85
## 41         70     45      48            60             65    35
## 42         95     70      73            95             90    60
## 43         38     41      40            50             65    65
## 44         73     76      75            81            100   100
## 45        115     45      20            45             25    20
## 46        140     70      45            85             50    45
## 47         40     45      35            30             40    55
## 48         75     80      70            65             75    90
## 49         45     50      55            75             65    30
## 50         60     65      70            85             75    40
## 51         75     80      85           110             90    50
## 52         35     70      55            45             55    25
## 53         60     95      80            60             80    30
## 54         60     55      50            40             55    45
## 55         70     65      60            90             75    90
## 56         10     55      25            35             45    95
## 57         35     80      50            50             70   120
## 58         40     45      35            40             40    90
## 59         65     70      60            65             65   115
## 60         50     52      48            65             50    55
## 61         80     82      78            95             80    85
## 62         40     80      35            35             45    70
## 63         65    105      60            60             70    95
## 64         55     70      45            70             50    60
## 65         90    110      80           100             80    95
## 66         40     50      40            40             40    90
## 67         65     65      65            50             50    90
## 68         90     95      95            70             90    70
## 69         25     20      15           105             55    90
## 70         40     35      30           120             70   105
## 71         55     50      45           135             95   120
## 72         55     50      65           175             95   150
## 73         70     80      50            35             35    35
## 74         80    100      70            50             60    45
## 75         90    130      80            65             85    55
## 76         50     75      35            70             30    40
## 77         65     90      50            85             45    55
## 78         80    105      65           100             70    70
## 79         40     40      35            50            100    70
## 80         80     70      65            80            120   100
## 81         40     80     100            30             30    20
## 82         55     95     115            45             45    35
## 83         80    120     130            55             65    45
## 84         50     85      55            65             65    90
## 85         65    100      70            80             80   105
## 86         90     65      65            40             40    15
## 87         95     75     110           100             80    30
## 88         95     75     180           130             80    30
## 89         25     35      70            95             55    45
## 90         50     60      95           120             70    70
## 91         52     65      55            58             62    60
## 92         35     85      45            35             35    75
## 93         60    110      70            60             60   100
## 94         65     45      55            45             70    45
## 95         90     70      80            70             95    70
## 96         80     80      50            40             50    25
## 97        105    105      75            65            100    50
## 98         30     65     100            45             25    40
## 99         50     95     180            85             45    70
## 100        30     35      30           100             35    80
## 101        45     50      45           115             55    95
## 102        60     65      60           130             75   110
## 103        60     65      80           170             95   130
## 104        35     45     160            30             45    70
## 105        60     48      45            43             90    42
## 106        85     73      70            73            115    67
## 107        30    105      90            25             25    50
## 108        55    130     115            50             50    75
## 109        40     30      50            55             55   100
## 110        60     50      70            80             80   140
## 111        60     40      80            60             45    40
## 112        95     95      85           125             65    55
## 113        50     50      95            40             50    35
## 114        60     80     110            50             80    45
## 115        50    120      53            35            110    87
## 116        50    105      79            35            110    76
## 117        90     55      75            60             75    30
## 118        40     65      95            60             45    35
## 119        65     90     120            85             70    60
## 120        80     85      95            30             30    25
## 121       105    130     120            45             45    40
## 122       250      5       5            35            105    50
## 123        65     55     115           100             40    60
## 124       105     95      80            40             80    90
## 125       105    125     100            60            100   100
## 126        30     40      70            70             25    60
## 127        55     65      95            95             45    85
## 128        45     67      60            35             50    63
## 129        80     92      65            65             80    68
## 130        30     45      55            70             55    85
## 131        60     75      85           100             85   115
## 132        40     45      65           100            120    90
## 133        70    110      80            55             80   105
## 134        65     50      35           115             95    95
## 135        65     83      57            95             85   105
## 136        65     95      57           100             85    93
## 137        65    125     100            55             70    85
## 138        65    155     120            65             90   105
## 139        75    100      95            40             70   110
## 140        20     10      55            15             20    80
## 141        95    125      79            60            100    81
## 142        95    155     109            70            130    81
## 143       130     85      80            85             95    60
## 144        48     48      48            48             48    48
## 145        55     55      50            45             65    55
## 146       130     65      60           110             95    65
## 147        65     65      60           110             95   130
## 148        65    130      60            95            110    65
## 149        65     60      70            85             75    40
## 150        35     40     100            90             55    35
## 151        70     60     125           115             70    55
## 152        30     80      90            55             45    55
## 153        60    115     105            65             70    80
## 154        80    105      65            60             75   130
## 155        80    135      85            70             95   150
## 156       160    110      65            65            110    30
## 157        90     85     100            95            125    85
## 158        90     90      85           125             90   100
## 159        90    100      90           125             85    90
## 160        41     64      45            50             50    50
## 161        61     84      65            70             70    70
## 162        91    134      95           100            100    80
## 163       106    110      90           154             90   130
## 164       106    190     100           154            100   130
## 165       106    150      70           194            120   140
## 166       100    100     100           100            100   100
## 167        45     49      65            49             65    45
## 168        60     62      80            63             80    60
## 169        80     82     100            83            100    80
## 170        39     52      43            60             50    65
## 171        58     64      58            80             65    80
## 172        78     84      78           109             85   100
## 173        50     65      64            44             48    43
## 174        65     80      80            59             63    58
## 175        85    105     100            79             83    78
## 176        35     46      34            35             45    20
## 177        85     76      64            45             55    90
## 178        60     30      30            36             56    50
## 179       100     50      50            76             96    70
## 180        40     20      30            40             80    55
## 181        55     35      50            55            110    85
## 182        40     60      40            40             40    30
## 183        70     90      70            60             60    40
## 184        85     90      80            70             80   130
## 185        75     38      38            56             56    67
## 186       125     58      58            76             76    67
## 187        20     40      15            35             35    60
## 188        50     25      28            45             55    15
## 189        90     30      15            40             20    15
## 190        35     20      65            40             65    20
## 191        55     40      85            80            105    40
## 192        40     50      45            70             45    70
## 193        65     75      70            95             70    95
## 194        55     40      40            65             45    35
## 195        70     55      55            80             60    45
## 196        90     75      85           115             90    55
## 197        90     95     105           165            110    45
## 198        75     80      95            90            100    50
## 199        70     20      50            20             50    40
## 200       100     50      80            60             80    50
## 201        70    100     115            30             65    30
## 202        90     75      75            90            100    70
## 203        35     35      40            35             55    50
## 204        55     45      50            45             65    80
## 205        75     55      70            55             95   110
## 206        55     70      55            40             55    85
## 207        30     30      30            30             30    30
## 208        75     75      55           105             85    30
## 209        65     65      45            75             45    95
## 210        55     45      45            25             25    15
## 211        95     85      85            65             65    35
## 212        65     65      60           130             95   110
## 213        95     65     110            60            130    65
## 214        60     85      42            85             42    91
## 215        95     75      80           100            110    30
## 216        60     60      60            85             85    85
## 217        48     72      48            72             48    48
## 218       190     33      58            33             58    33
## 219        70     80      65            90             65    85
## 220        50     65      90            35             35    15
## 221        75     90     140            60             60    40
## 222       100     70      70            65             65    45
## 223        65     75     105            35             65    85
## 224        75     85     200            55             65    30
## 225        75    125     230            55             95    30
## 226        60     80      50            40             40    30
## 227        90    120      75            60             60    45
## 228        65     95      75            55             55    85
## 229        70    130     100            55             80    65
## 230        70    150     140            65            100    75
## 231        20     10     230            10            230     5
## 232        80    125      75            40             95    85
## 233        80    185     115            40            105    75
## 234        55     95      55            35             75   115
## 235        60     80      50            50             50    40
## 236        90    130      75            75             75    55
## 237        40     40      40            70             40    20
## 238        50     50     120            80             80    30
## 239        50     50      40            30             30    50
## 240       100    100      80            60             60    50
## 241        55     55      85            65             85    35
## 242        35     65      35            65             35    65
## 243        75    105      75           105             75    45
## 244        45     55      45            65             45    75
## 245        65     40      70            80            140    70
## 246        65     80     140            40             70    70
## 247        45     60      30            80             50    65
## 248        75     90      50           110             80    95
## 249        75     90      90           140             90   115
## 250        75     95      95            95             95    85
## 251        90     60      60            40             40    40
## 252        90    120     120            60             60    50
## 253        85     80      90           105             95    60
## 254        73     95      62            85             65    85
## 255        55     20      35            20             45    75
## 256        35     35      35            35             35    35
## 257        50     95      95            35            110    70
## 258        45     30      15            85             65    65
## 259        45     63      37            65             55    95
## 260        45     75      37            70             55    83
## 261        95     80     105            40             70   100
## 262       255     10      10            75            135    55
## 263        90     85      75           115            100   115
## 264       115    115      85            90             75   100
## 265       100     75     115            90            115    85
## 266        50     64      50            45             50    41
## 267        70     84      70            65             70    51
## 268       100    134     110            95            100    61
## 269       100    164     150            95            120    71
## 270       106     90     130            90            154   110
## 271       106    130      90           110            154    90
## 272       100    100     100           100            100   100
## 273        40     45      35            65             55    70
## 274        50     65      45            85             65    95
## 275        70     85      65           105             85   120
## 276        70    110      75           145             85   145
## 277        45     60      40            70             50    45
## 278        60     85      60            85             60    55
## 279        80    120      70           110             70    80
## 280        80    160      80           130             80   100
## 281        50     70      50            50             50    40
## 282        70     85      70            60             70    50
## 283       100    110      90            85             90    60
## 284       100    150     110            95            110    70
## 285        35     55      35            30             30    35
## 286        70     90      70            60             60    70
## 287        38     30      41            30             41    60
## 288        78     70      61            50             61   100
## 289        45     45      35            20             30    20
## 290        50     35      55            25             25    15
## 291        60     70      50           100             50    65
## 292        50     35      55            25             25    15
## 293        60     50      70            50             90    65
## 294        40     30      30            40             50    30
## 295        60     50      50            60             70    50
## 296        80     70      70            90            100    70
## 297        40     40      50            30             30    30
## 298        70     70      40            60             40    60
## 299        90    100      60            90             60    80
## 300        40     55      30            30             30    85
## 301        60     85      60            50             50   125
## 302        40     30      30            55             30    85
## 303        60     50     100            85             70    65
## 304        28     25      25            45             35    40
## 305        38     35      35            65             55    50
## 306        68     65      65           125            115    80
## 307        68     85      65           165            135   100
## 308        40     30      32            50             52    65
## 309        70     60      62            80             82    60
## 310        60     40      60            40             60    35
## 311        60    130      80            60             60    70
## 312        60     60      60            35             35    30
## 313        80     80      80            55             55    90
## 314       150    160     100            95             65   100
## 315        31     45      90            30             30    40
## 316        61     90      45            50             50   160
## 317         1     90      45            30             30    40
## 318        64     51      23            51             23    28
## 319        84     71      43            71             43    48
## 320       104     91      63            91             73    68
## 321        72     60      30            20             30    25
## 322       144    120      60            40             60    50
## 323        50     20      40            20             40    20
## 324        30     45     135            45             90    30
## 325        50     45      45            35             35    50
## 326        70     65      65            55             55    70
## 327        50     75      75            65             65    50
## 328        50     85     125            85            115    20
## 329        50     85      85            55             55    50
## 330        50    105     125            55             95    50
## 331        50     70     100            40             40    30
## 332        60     90     140            50             50    40
## 333        70    110     180            60             60    50
## 334        70    140     230            60             80    50
## 335        30     40      55            40             55    60
## 336        60     60      75            60             75    80
## 337        60    100      85            80             85   100
## 338        40     45      40            65             40    65
## 339        70     75      60           105             60   105
## 340        70     75      80           135             80   135
## 341        60     50      40            85             75    95
## 342        60     40      50            75             85    95
## 343        65     73      55            47             75    85
## 344        65     47      55            73             75    85
## 345        50     60      45           100             80    65
## 346        70     43      53            43             53    40
## 347       100     73      83            73             83    55
## 348        45     90      20            65             20    65
## 349        70    120      40            95             40    95
## 350        70    140      70           110             65   105
## 351       130     70      35            70             35    60
## 352       170     90      45            90             45    60
## 353        60     60      40            65             45    35
## 354        70    100      70           105             75    40
## 355        70    120     100           145            105    20
## 356        70     85     140            85             70    20
## 357        60     25      35            70             80    60
## 358        80     45      65            90            110    80
## 359        60     60      60            60             60    60
## 360        45    100      45            45             45    10
## 361        50     70      50            50             50    70
## 362        80    100      80            80             80   100
## 363        50     85      40            85             40    35
## 364        70    115      60           115             60    55
## 365        45     40      60            40             75    50
## 366        75     70      90            70            105    80
## 367        75    110     110           110            105    80
## 368        73    115      60            60             60    90
## 369        73    100      60           100             60    65
## 370        70     55      65            95             85    70
## 371        70     95      85            55             65    70
## 372        50     48      43            46             41    60
## 373       110     78      73            76             71    60
## 374        43     80      65            50             35    35
## 375        63    120      85            90             55    55
## 376        40     40      55            40             70    55
## 377        60     70     105            70            120    75
## 378        66     41      77            61             87    23
## 379        86     81      97            81            107    43
## 380        45     95      50            40             50    75
## 381        75    125     100            70             80    45
## 382        20     15      20            10             55    80
## 383        95     60      79           100            125    81
## 384        70     70      70            70             70    70
## 385        60     90      70            60            120    40
## 386        44     75      35            63             33    45
## 387        64    115      65            83             63    65
## 388        64    165      75            93             83    75
## 389        20     40      90            30             90    25
## 390        40     70     130            60            130    25
## 391        99     68      83            72             87    51
## 392        65     50      70            95             80    65
## 393        65    130      60            75             60    75
## 394        65    150      60           115             60   115
## 395        95     23      48            23             48    23
## 396        50     50      50            50             50    50
## 397        80     80      80            80             80    80
## 398        80    120      80           120             80   100
## 399        70     40      50            55             50    25
## 400        90     60      70            75             70    45
## 401       110     80      90            95             90    65
## 402        35     64      85            74             55    32
## 403        55    104     105            94             75    52
## 404        55     84     105           114             75    52
## 405       100     90     130            45             65    55
## 406        43     30      55            40             65    97
## 407        45     75      60            40             30    50
## 408        65     95     100            60             50    50
## 409        95    135      80           110             80   100
## 410        95    145     130           120             90   120
## 411        40     55      80            35             60    30
## 412        60     75     100            55             80    50
## 413        80    135     130            95             90    70
## 414        80    145     150           105            110   110
## 415        80    100     200            50            100    50
## 416        80     50     100           100            200    50
## 417        80     75     150            75            150    50
## 418        80     80      90           110            130   110
## 419        80    100     120           140            150   110
## 420        80     90      80           130            110   110
## 421        80    130     100           160            120   110
## 422       100    100      90           150            140    90
## 423       100    150      90           180            160    90
## 424       100    150     140           100             90    90
## 425       100    180     160           150             90    90
## 426       105    150      90           150             90    95
## 427       105    180     100           180            100   115
## 428       100    100     100           100            100   100
## 429        50    150      50           150             50   150
## 430        50    180      20           180             20   150
## 431        50     70     160            70            160    90
## 432        50     95      90            95             90   180
## 433        55     68      64            45             55    31
## 434        75     89      85            55             65    36
## 435        95    109     105            75             85    56
## 436        44     58      44            58             44    61
## 437        64     78      52            78             52    81
## 438        76    104      71           104             71   108
## 439        53     51      53            61             56    40
## 440        64     66      68            81             76    50
## 441        84     86      88           111            101    60
## 442        40     55      30            30             30    60
## 443        55     75      50            40             40    80
## 444        85    120      70            50             60   100
## 445        59     45      40            35             40    31
## 446        79     85      60            55             60    71
## 447        37     25      41            25             41    25
## 448        77     85      51            55             51    65
## 449        45     65      34            40             34    45
## 450        60     85      49            60             49    60
## 451        80    120      79            95             79    70
## 452        40     30      35            50             70    55
## 453        60     70      65           125            105    90
## 454        67    125      40            30             30    58
## 455        97    165      60            65             50    58
## 456        30     42     118            42             88    30
## 457        60     52     168            47            138    30
## 458        40     29      45            29             45    36
## 459        60     59      85            79            105    36
## 460        60     79     105            59             85    36
## 461        60     69      95            69             95    36
## 462        70     94      50            94             50    66
## 463        30     30      42            30             42    70
## 464        70     80     102            80            102    40
## 465        60     45      70            45             90    95
## 466        55     65      35            60             30    85
## 467        85    105      55            85             50   115
## 468        45     35      45            62             53    35
## 469        70     60      70            87             78    85
## 470        76     48      48            57             62    34
## 471       111     83      68            92             82    39
## 472        75    100      66            60             66   115
## 473        90     50      34            60             44    70
## 474       150     80      44            90             54    80
## 475        55     66      44            44             56    85
## 476        65     76      84            54             96   105
## 477        65    136      94            54             96   135
## 478        60     60      60           105            105   105
## 479       100    125      52           105             52    71
## 480        49     55      42            42             37    85
## 481        71     82      64            64             59   112
## 482        45     30      50            65             50    45
## 483        63     63      47            41             41    74
## 484       103     93      67            71             61    84
## 485        57     24      86            24             86    23
## 486        67     89     116            79            116    33
## 487        50     80      95            10             45    10
## 488        20     25      45            70             90    60
## 489       100      5       5            15             65    30
## 490        76     65      45            92             42    91
## 491        50     92     108            92            108    35
## 492        58     70      45            40             45    42
## 493        68     90      65            50             55    82
## 494       108    130      95            80             85   102
## 495       108    170     115           120             95    92
## 496       135     85      40            40             85     5
## 497        40     70      40            35             40    60
## 498        70    110      70           115             70    90
## 499        70    145      88           140             70   112
## 500        68     72      78            38             42    32
## 501       108    112     118            68             72    47
## 502        40     50      90            30             55    65
## 503        70     90     110            60             75    95
## 504        48     61      40            61             40    50
## 505        83    106      65            86             65    85
## 506        74    100      72            90             72    46
## 507        49     49      56            49             61    66
## 508        69     69      76            69             86    91
## 509        45     20      50            60            120    50
## 510        60     62      50            62             60    40
## 511        90     92      75            92             85    60
## 512        90    132     105           132            105    30
## 513        70    120      65            45             85   125
## 514        70     70     115           130             90    60
## 515       110     85      95            80             95    50
## 516       115    140     130            55             55    40
## 517       100    100     125           110             50    50
## 518        75    123      67            95             85    95
## 519        75     95      67           125             95    83
## 520        85     50      95           120            115    80
## 521        86     76      86           116             56    95
## 522        65    110     130            60             65    95
## 523        65     60     110           130             95    65
## 524        75     95     125            45             75    95
## 525       110    130      80            70             60    80
## 526        85     80      70           135             75    90
## 527        68    125      65            65            115    80
## 528        68    165      95            65            115   110
## 529        60     55     145            75            150    40
## 530        45    100     135            65            135    45
## 531        70     80      70            80             70   110
## 532        50     50      77            95             77    91
## 533        50     65     107           105            107    86
## 534        50     65     107           105            107    86
## 535        50     65     107           105            107    86
## 536        50     65     107           105            107    86
## 537        50     65     107           105            107    86
## 538        75     75     130            75            130    95
## 539        80    105     105           105            105    80
## 540        75    125      70           125             70   115
## 541       100    120     120           150            100    90
## 542        90    120     100           150            120   100
## 543        91     90     106           130            106    77
## 544       110    160     110            80            110   100
## 545       150    100     120           100            120    90
## 546       150    120     100           120            100    90
## 547       120     70     120            75            130    85
## 548        80     80      80            80             80    80
## 549       100    100     100           100            100   100
## 550        70     90      90           135             90   125
## 551       100    100     100           100            100   100
## 552       100    103      75           120             75   127
## 553       120    120     120           120            120   120
## 554       100    100     100           100            100   100
## 555        45     45      55            45             55    63
## 556        60     60      75            60             75    83
## 557        75     75      95            75             95   113
## 558        65     63      45            45             45    45
## 559        90     93      55            70             55    55
## 560       110    123      65           100             65    65
## 561        55     55      45            63             45    45
## 562        75     75      60            83             60    60
## 563        95    100      85           108             70    70
## 564        45     55      39            35             39    42
## 565        60     85      69            60             69    77
## 566        45     60      45            25             45    55
## 567        65     80      65            35             65    60
## 568        85    110      90            45             90    80
## 569        41     50      37            50             37    66
## 570        64     88      50            88             50   106
## 571        50     53      48            53             48    64
## 572        75     98      63            98             63   101
## 573        50     53      48            53             48    64
## 574        75     98      63            98             63   101
## 575        50     53      48            53             48    64
## 576        75     98      63            98             63   101
## 577        76     25      45            67             55    24
## 578       116     55      85           107             95    29
## 579        50     55      50            36             30    43
## 580        62     77      62            50             42    65
## 581        80    115      80            65             55    93
## 582        45     60      32            50             32    76
## 583        75    100      63            80             63   116
## 584        55     75      85            25             25    15
## 585        70    105     105            50             40    20
## 586        85    135     130            60             80    25
## 587        55     45      43            55             43    72
## 588        67     57      55            77             55   114
## 589        60     85      40            30             45    68
## 590       110    135      60            50             65    88
## 591       103     60      86            60             86    50
## 592       103     60     126            80            126    50
## 593        75     80      55            25             35    35
## 594        85    105      85            40             50    40
## 595       105    140      95            55             65    45
## 596        50     50      40            50             40    64
## 597        75     65      55            65             55    69
## 598       105     95      75            85             75    74
## 599       120    100      85            30             85    45
## 600        75    125      75            30             75    85
## 601        45     53      70            40             60    42
## 602        55     63      90            50             80    42
## 603        75    103      80            70             80    92
## 604        30     45      59            30             39    57
## 605        40     55      99            40             79    47
## 606        60    100      89            55             69   112
## 607        40     27      60            37             50    66
## 608        60     67      85            77             75   116
## 609        45     35      50            70             50    30
## 610        70     60      75           110             75    90
## 611        70     92      65            80             55    98
## 612        50     72      35            35             35    65
## 613        60     82      45            45             45    74
## 614        95    117      80            65             70    92
## 615        70     90      45            15             45    50
## 616       105    140      55            30             55    95
## 617       105     30     105           140            105    55
## 618        75     86      67           106             67    60
## 619        50     65      85            35             35    55
## 620        70     95     125            65             75    45
## 621        50     75      70            35             70    48
## 622        65     90     115            45            115    58
## 623        72     58      80           103             80    97
## 624        38     30      85            55             65    30
## 625        58     50     145            95            105    30
## 626        54     78     103            53             45    22
## 627        74    108     133            83             65    32
## 628        55    112      45            74             45    70
## 629        75    140      65           112             65   110
## 630        50     50      62            40             62    65
## 631        80     95      82            60             82    75
## 632        40     65      40            80             40    65
## 633        60    105      60           120             60   105
## 634        55     50      40            40             40    75
## 635        75     95      60            65             60   115
## 636        45     30      50            55             65    45
## 637        60     45      70            75             85    55
## 638        70     55      95            95            110    65
## 639        45     30      40           105             50    20
## 640        65     40      50           125             60    30
## 641       110     65      75           125             85    30
## 642        62     44      50            44             50    55
## 643        75     87      63            87             63    98
## 644        36     50      50            65             60    44
## 645        51     65      65            80             75    59
## 646        71     95      85           110             95    79
## 647        60     60      50            40             50    75
## 648        80    100      70            60             70    95
## 649        55     75      60            75             60   103
## 650        50     75      45            40             45    60
## 651        70    135     105            60            105    20
## 652        69     55      45            55             55    15
## 653       114     85      70            85             80    30
## 654        55     40      50            65             85    40
## 655       100     60      70            85            105    60
## 656       165     75      80            40             45    65
## 657        50     47      50            57             50    65
## 658        70     77      60            97             60   108
## 659        44     50      91            24             86    10
## 660        74     94     131            54            116    20
## 661        40     55      70            45             60    30
## 662        60     80      95            70             85    50
## 663        60    100     115            70             85    90
## 664        35     55      40            45             40    60
## 665        65     85      70            75             70    40
## 666        85    115      80           105             80    50
## 667        55     55      55            85             55    30
## 668        75     75      75           125             95    40
## 669        50     30      55            65             55    20
## 670        60     40      60            95             60    55
## 671        60     55      90           145             90    80
## 672        46     87      60            30             40    57
## 673        66    117      70            40             50    67
## 674        76    147      90            60             70    97
## 675        55     70      40            60             40    40
## 676        95    110      80            70             80    50
## 677        70     50      30            95            135   105
## 678        50     40      85            40             65    25
## 679        80     70      40           100             60   145
## 680       109     66      84            81             99    32
## 681        45     85      50            55             50    65
## 682        65    125      60            95             60   105
## 683        77    120      90            60             90    48
## 684        59     74      50            35             50    35
## 685        89    124      80            55             80    55
## 686        45     85      70            40             40    60
## 687        65    125     100            60             70    70
## 688        95    110      95            40             95    55
## 689        70     83      50            37             50    60
## 690       100    123      75            57             75    80
## 691        70     55      75            45             65    60
## 692       110     65     105            55             95    80
## 693        85     97      66           105             66    65
## 694        58    109     112            48             48   109
## 695        52     65      50            45             50    38
## 696        72     85      70            65             70    58
## 697        92    105      90           125             90    98
## 698        55     85      55            50             55    60
## 699        85     60      65           135            105   100
## 700        91     90     129            90             72   108
## 701        91    129      90            72             90   108
## 702        91     90      72            90            129   108
## 703        79    115      70           125             80   111
## 704        79    100      80           110             90   121
## 705        79    115      70           125             80   111
## 706        79    105      70           145             80   101
## 707       100    120     100           150            120    90
## 708       100    150     120           120            100    90
## 709        89    125      90           115             80   101
## 710        89    145      90           105             80    91
## 711       125    130      90           130             90    95
## 712       125    170     100           120             90    95
## 713       125    120      90           170            100    95
## 714        91     72      90           129             90   108
## 715        91     72      90           129             90   108
## 716       100     77      77           128            128    90
## 717       100    128      90            77             77   128
## 718        71    120      95           120             95    99
## 719        56     61      65            48             45    38
## 720        61     78      95            56             58    57
## 721        88    107     122            74             75    64
## 722        40     45      40            62             60    60
## 723        59     59      58            90             70    73
## 724        75     69      72           114            100   104
## 725        41     56      40            62             44    71
## 726        54     63      52            83             56    97
## 727        72     95      67           103             71   122
## 728        38     36      38            32             36    57
## 729        85     56      77            50             77    78
## 730        45     50      43            40             38    62
## 731        62     73      55            56             52    84
## 732        78     81      71            74             69   126
## 733        38     35      40            27             25    35
## 734        45     22      60            27             30    29
## 735        80     52      50            90             50    89
## 736        62     50      58            73             54    72
## 737        86     68      72           109             66   106
## 738        44     38      39            61             79    42
## 739        54     45      47            75             98    52
## 740        78     65      68           112            154    75
## 741        66     65      48            62             57    52
## 742       123    100      62            97             81    68
## 743        67     82      62            46             48    43
## 744        95    124      78            69             71    58
## 745        75     80      60            65             90   102
## 746        62     48      54            63             60    68
## 747        74     48      76            83             81   104
## 748        74     48      76            83             81   104
## 749        45     80     100            35             37    28
## 750        59    110     150            45             49    35
## 751        60    150      50           150             50    60
## 752        60     50     150            50            150    60
## 753        78     52      60            63             65    23
## 754       101     72      72            99             89    29
## 755        62     48      66            59             57    49
## 756        82     80      86            85             75    72
## 757        53     54      53            37             46    45
## 758        86     92      88            68             75    73
## 759        42     52      67            39             56    50
## 760        72    105     115            54             86    68
## 761        50     60      60            60             60    30
## 762        65     75      90            97            123    44
## 763        50     53      62            58             63    44
## 764        71     73      88           120             89    59
## 765        44     38      33            61             43    70
## 766        62     55      52           109             94   109
## 767        58     89      77            45             45    48
## 768        82    121     119            69             59    71
## 769        77     59      50            67             63    46
## 770       123     77      72            99             92    58
## 771        95     65      65           110            130    60
## 772        78     92      75            74             63   118
## 773        67     58      57            81             67   101
## 774        50     50     150            50            150    50
## 775        45     50      35            55             75    40
## 776        68     75      53            83            113    60
## 777        90    100      70           110            150    80
## 778        57     80      91            80             87    75
## 779        43     70      48            50             60    38
## 780        85    110      76            65             82    56
## 781        49     66      70            44             55    51
## 782        44     66      70            44             55    56
## 783        54     66      70            44             55    46
## 784        59     66      70            44             55    41
## 785        65     90     122            58             75    84
## 786        55     85     122            58             75    99
## 787        75     95     122            58             75    69
## 788        85    100     122            58             75    54
## 789        55     69      85            32             35    28
## 790        95    117     184            44             46    28
## 791        40     30      35            45             40    55
## 792        85     70      80            97             80   123
## 793       126    131      95           131             98    99
## 794       126    131      95           131             98    99
## 795       108    100     121            81             95    95
## 796        50    100     150           100            150    50
## 797        50    160     110           160            110   110
## 798        80    110      60           150            130    70
## 799        80    160      60           170            130    80
## 800        80    110     120           130             90    70

# Look over 1 to 15 possible clusters
for (i in 1:15) {
  # Fit the model: km.out
  km.out <- kmeans(pokemon, centers = i, nstart = 20, iter.max = 50)
  # Save the within cluster sum of squares
  wss[i] <- km.out$tot.withinss
}

Pick an appropriate number of clusters based on these results from the first instruction and assign that number to k. Create a k-means model using k clusters and assign it to the km.out variable.

# Produce a scree plot
plot(1:15, wss, type = "b", 
     xlab = "Number of Clusters", 
     ylab = "Within groups sum of squares")

# Select number of clusters
k <- 4

# Build model with k clusters: km.out
km.out <- kmeans(pokemon, centers = k, nstart = 20, iter.max = 50)

# View the resulting model
km.out

## K-means clustering with 4 clusters of sizes 115, 114, 283, 288
## 
## Cluster means:
##   HitPoints    Attack   Defense SpecialAttack SpecialDefense     Speed
## 1  71.30435  92.91304 121.42609      63.89565       88.23478  52.36522
## 2  89.20175 121.09649  92.73684     120.45614       97.67544 100.44737
## 3  50.29682  54.03180  51.62898      47.90459       49.15548  49.74912
## 4  79.18056  81.31944  69.19097      82.01042       77.53125  80.10417
## 
## Clustering vector:
##   [1] 3 4 4 2 3 4 4 2 2 3 4 4 2 3 3 4 3 3 4 4 3 3 4 2 3 4 3 4 3 4 3 4 3 1 3 3 4
##  [38] 3 3 4 3 4 3 4 3 4 3 4 3 4 4 3 1 3 4 3 4 3 4 3 4 3 4 3 2 3 3 4 3 4 4 2 3 4
##  [75] 1 3 4 4 3 4 3 1 1 4 4 3 1 1 3 4 3 3 4 3 4 3 4 3 1 3 4 4 2 1 3 4 3 1 3 4 3
## [112] 4 3 1 4 1 3 3 1 3 1 4 4 4 2 3 4 3 4 3 4 4 4 4 4 4 1 2 4 3 4 2 4 3 3 4 4 2
## [149] 4 3 1 3 1 4 2 4 2 2 2 3 4 2 2 2 2 2 3 4 4 3 4 4 3 4 4 3 4 3 4 3 4 3 3 4 3
## [186] 4 3 3 3 3 4 3 4 3 3 4 2 1 3 4 1 4 3 3 4 3 3 4 4 3 1 4 1 4 4 4 3 4 4 3 1 4
## [223] 1 1 1 3 4 4 1 1 1 4 1 4 3 4 3 1 3 4 3 3 4 3 4 1 3 4 2 4 3 1 4 4 3 3 1 3 3
## [260] 3 4 4 2 2 1 3 4 2 2 2 2 2 3 4 4 2 3 4 2 2 3 4 4 2 3 4 3 4 3 3 4 3 3 3 3 4
## [297] 3 3 4 3 4 3 4 3 3 4 2 3 4 3 4 3 4 2 3 4 3 3 3 4 3 4 3 1 3 3 3 1 3 1 3 1 1
## [334] 1 3 4 4 3 4 2 4 4 4 4 4 3 4 3 4 2 4 4 3 4 2 1 3 4 3 3 3 4 3 4 3 4 2 4 4 4
## [371] 4 3 4 3 4 3 1 3 1 3 1 3 4 4 1 3 4 2 3 1 4 4 4 2 3 3 4 2 3 4 4 3 1 1 1 3 3
## [408] 1 2 2 3 1 1 2 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 3 1 1 3 4 4 3 4 4 3 3 4
## [445] 3 4 3 3 3 3 4 3 4 3 4 1 1 3 1 1 1 4 3 1 4 3 4 3 4 3 4 4 3 4 3 4 2 4 4 3 4
## [482] 3 3 4 3 1 3 3 3 4 1 3 4 2 2 4 3 2 2 3 1 3 1 3 4 4 3 4 3 3 4 2 4 1 1 1 1 2
## [519] 2 4 4 1 4 1 4 4 4 2 1 1 4 4 4 4 4 4 4 1 2 2 2 2 2 2 2 2 1 4 2 2 2 2 2 2 3
## [556] 4 4 3 4 4 3 4 4 3 4 3 3 1 3 4 3 4 3 4 3 4 3 4 3 3 4 3 4 3 1 1 3 4 3 4 4 1
## [593] 3 1 1 3 4 4 1 4 3 1 4 3 3 4 3 4 3 4 4 3 3 4 3 4 4 4 3 1 3 1 4 3 1 1 1 4 2
## [630] 3 4 3 4 3 4 3 4 4 3 3 4 3 4 3 4 4 3 4 4 3 1 3 4 3 4 4 3 4 3 1 3 1 1 3 4 4
## [667] 3 4 3 3 4 3 4 2 3 4 4 3 4 4 3 4 1 3 1 3 1 1 3 4 3 1 4 1 3 4 2 3 4 2 2 2 2
## [704] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 1 1 3 4 4 3 4 4 3 4 3 3 4 3 3 4 3 4 3 3 4
## [741] 3 4 3 4 4 3 4 4 3 1 2 1 3 4 3 4 3 4 3 1 3 1 3 4 3 4 3 1 3 4 4 4 4 1 3 4 2
## [778] 4 3 4 3 3 3 3 1 1 1 1 3 1 3 4 2 2 2 1 2 2 2 2
## 
## Within cluster sum of squares by cluster:
## [1] 455965.7 408271.9 513476.6 871531.3
##  (between_SS / total_SS =  47.6 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

Create a scatter plot of Defense vs. Speed, showing cluster membership for each observation.

# Plot of Defense vs. Speed by cluster membership
plot(pokemon[, c("Defense", "Speed")],
     col = km.out$cluster,
     main = paste("k-means clustering of Pokemon with", k, "clusters"),
     xlab = "Defense", ylab = "Speed")

Hierarchical clustering

Introduction to hierarchical clustering

Hierarchical clustering with results

In this exercise, you will create your first hierarchical clustering model using the hclust() function.

We have created some data that has two dimensions and placed it in a variable called x. Your task is to create a hierarchical clustering model of x. Remember from the video that the first step to hierarchical clustering is determining the similarity between observations, which you will do with the dist() function.

You will look at the structure of the resulting model using the summary() function.

Fit a hierarchical clustering model to x using the hclust() function. Store the result in hclust.out. Inspect the result with the summary() function.

# Create hierarchical clustering model: hclust.out
hclust.out <- hclust(dist(x) , method = "complete")

# Inspect the result
summary(hclust.out)

##             Length Class  Mode     
## merge       598    -none- numeric  
## height      299    -none- numeric  
## order       300    -none- numeric  
## labels        0    -none- NULL     
## method        1    -none- character
## call          3    -none- call     
## dist.method   1    -none- character

Selecting number of clusters

Cutting the tree

Remember from the video that cutree() is the R function that cuts a hierarchical model. The h and k arguments to cutree() allow you to cut the tree based on a certain height h or a certain number of clusters k.

In this exercise, you will use cutree() to cut the hierarchical model you created earlier based on each of these two criteria.

The hclust.out model you created earlier is available in your workspace.

Cut the hclust.out model at height 7.

# Cut by height
cutree(hclust.out, h = 7)

##   [1]  1  1  2  2  2  2  2  2  3  3  3  3  4  4  4  4  4  5  5  5  5  5  6  6  6
##  [26]  6  6  6  7  7  7  7  7  7  7  8  8  8  8  9  9  9  9 10 10 10 10 10 11 11
##  [51] 11 11 11 11 11 12 12 12 12 13 13 13 13 13 14 14 14 14 14 14 15 15 15 16 16
##  [76] 16 16 16 17 17 17 17 17 17 18 18 18 19 19 19 19 19 20 20 20 20 20 20 21 21
## [101] 22 22 22 22 23 23 23 23 23 24 24 24 24 24 24 24 25 25 25 26 26 26 26 26 26
## [126] 27 27 27 27 28 28 28 28 28 28 29 29 29 29 30 30 30 30 31 31 31 31 31 31 31
## [151] 32 32 32 32 32 32 33 33 33 33 33 33 33 34 34 34 34 34 35 35 35 35 35 36 36
## [176] 36 36 36 37 37 37 38 38 38 38 38 39 39 39 39 39 39 39 40 40 40 40 41 41 41
## [201] 41 41 42 42 42 42 42 42 42 43 43 43 43 43 44 44 44 44 44 44 44 45 45 45 45
## [226] 46 46 46 46 46 46 46 47 47 47 47 48 48 48 48 48 49 49 49 50 50 50 50 50 50
## [251] 51 51 51 51 51 51 51 52 52 52 52 52 52 52 53 53 53 53 53 53 53 54 54 54 54
## [276] 54 54 54 55 55 55 55 56 56 56 56 56 56 56 57 57 57 57 58 58 58 58 59 59 59

Cut the hclust.out model to create 3 clusters.

# Cut by number of clusters
cutree(hclust.out, k = 3)

##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [149] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [186] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3
## [223] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [260] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [297] 3 3 3 3

Clustering linkage and practical matters

Linkage methods

In this exercise, you will produce hierarchical clustering models using different linkages and plot the dendrogram for each, observing the overall structure of the trees.

You’ll be asked to interpret the results in the next exercise.

Produce three hierarchical clustering models on x using the “complete”, “average”, and “single” linkage methods, respectively.

# Cluster using complete linkage: hclust.complete
hclust.complete <- hclust(dist(x), method = "complete")
# Plot dendrogram of hclust.complete
plot(hclust.complete, main = "Complete")

# Cluster using average linkage: hclust.average
hclust.average <- hclust(dist(x), method = "average")
# Plot dendrogram of hclust.average
plot(hclust.average, main = "Average")

# Cluster using single linkage: hclust.single
hclust.single <- hclust(dist(x), method = "single")

# Plot dendrogram of hclust.single
plot(hclust.single, main = "Single")

Practical matters: scaling

Recall from the video that clustering real data may require scaling the features if they have different distributions. So far in this chapter, you have been working with synthetic data that did not need scaling.

In this exercise, you will go back to working with “real” data, the pokemon dataset introduced in the first chapter. You will observe the distribution (mean and standard deviation) of each feature, scale the data accordingly, then produce a hierarchical clustering model using the complete linkage method.

The data is stored in the pokemon object in your workspace.

Observe the mean of each variable in pokemon using the colMeans() function.

# View column means
colMeans(pokemon)

##      HitPoints         Attack        Defense  SpecialAttack SpecialDefense 
##       69.25875       79.00125       73.84250       72.82000       71.90250 
##          Speed 
##       68.27750

Observe the standard deviation of each variable using the apply() and sd() functions. Since the variables are the columns of your matrix, make sure to specify 2 as the MARGIN argument to apply().

# View column standard deviations
apply(pokemon, 2, sd)

##      HitPoints         Attack        Defense  SpecialAttack SpecialDefense 
##       25.53467       32.45737       31.18350       32.72229       27.82892 
##          Speed 
##       29.06047

Scale the pokemon data using the scale() function and store the result in pokemon.scaled.

# Scale the data
pokemon.scaled <- scale(pokemon)

Create a hierarchical clustering model of the pokemon.scaled data using the complete linkage method. Manually specify the method argument and store the result in hclust.pokemon.

# Create hierarchical clustering model: hclust.pokemon
hclust.pokemon <- hclust(dist(pokemon.scaled) , method = "complete")

Comparing kmeans() and hclust()

Comparing k-means and hierarchical clustering, you’ll see the two methods produce different cluster memberships. This is because the two algorithms make different assumptions about how the data is generated. In a more advanced course, we could choose to use one model over another based on the quality of the models’ assumptions, but for now, it’s enough to observe that they are different.

This exercise will have you compare results from the two models on the pokemon dataset to see how they differ.

The results from running k-means clustering on the pokemon data (for 3 clusters) are stored as km.pokemon. The hierarchical clustering model you created in the previous exercise is still available as hclust.pokemon.

Using cutree() on hclust.pokemon, assign cluster membership to each observation. Assume three clusters and assign the result to a vector called cut.pokemon. Using table(), compare cluster membership between the two clustering methods. Recall that the different components of k-means model objects can be accessed with the $ operator.

# Apply cutree() to hclust.pokemon: cut.pokemon
cut.pokemon <- cutree(hclust.pokemon, k = 3)


# Select number of clusters
k <- 3

# Build model with k clusters: km.out
km.pokemon <- kmeans(pokemon, centers = k, nstart = 20, iter.max = 50)

# Compare methods
table(km.pokemon$cluster, cut.pokemon)

##    cut.pokemon
##       1   2   3
##   1 171   3   1
##   2 350   5   0
##   3 267   3   0

Dimensionality reduction with PCA

Introduction to PCA

PCA using prcomp()

In this exercise, you will create your first PCA model and observe the diagnostic results.

We have loaded the Pokemon data from earlier, which has four dimensions, and placed it in a variable called pokemon. Your task is to create a PCA model of the data, then to inspect the resulting model using the summary() function.

Create a PCA model of the data in pokemon, setting scale to TRUE. Store the result in pr.out. Inspect the result with the summary() function.

# Perform scaled PCA: pr.out
pr.out <- prcomp(x = pokemon, 
scale = TRUE)

# Inspect model output
summary(pr.out)

## Importance of components:
##                           PC1    PC2    PC3    PC4     PC5     PC6
## Standard deviation     1.6466 1.0457 0.8825 0.8489 0.65463 0.51681
## Proportion of Variance 0.4519 0.1822 0.1298 0.1201 0.07142 0.04451
## Cumulative Proportion  0.4519 0.6342 0.7640 0.8841 0.95549 1.00000

Additional results of PCA

PCA models in R produce additional diagnostic and output components:

center: the column means used to center to the data, or FALSE if the data weren’t centered scale: the column standard deviations used to scale the data, or FALSE if the data weren’t scaled rotation: the directions of the principal component vectors in terms of the original features/variables. This information allows you to define new data in terms of the original principal components x: the value of each observation in the original dataset projected to the principal components You can access these the same as other model components. For example, use pr.out$rotation to access the rotation component.

Which of the following statements is not correct regarding the pr.out model fit on the pokemon data?

Visualizing and interpreting PCA results

Interpreting biplots (1) As stated in the video, the biplot() function plots both the principal components loadings and the mapping of the observations to their first two principal component values. The next couple of exercises will check your interpretation of the biplot() visualization.

Using the biplot() of the pr.out model, which two original variables have approximately the same loadings in the first two principal components?

biplot(pr.out)

Variance explained

The second common plot type for understanding PCA models is a scree plot. A scree plot shows the variance explained as the number of principal components increases. Sometimes the cumulative variance explained is plotted as well.

In this and the next exercise, you will prepare data from the pr.out model you created at the beginning of the chapter for use in a scree plot. Preparing the data for plotting is required because there is not a built-in function in R to create this type of plot.

pr.out and the pokemon data are still available in your workspace.

Assign to the variable pr.var the square of the standard deviations of the principal components (i.e. the variance). The standard deviation of the principal components is available in the sdev component of the PCA model object. Assign to the variable pve the proportion of the variance explained, calculated by dividing pr.var by the total variance explained by all principal components.

# Variability of each principal component: pr.var
pr.var <- pr.out$sdev^2

# Variance explained by each principal component: pve
pve <- pr.var / sum(pr.var)

Visualize variance explained

Now you will create a scree plot showing the proportion of variance explained by each principal component, as well as the cumulative proportion of variance explained.

Recall from the video that these plots can help to determine the number of principal components to retain. One way to determine the number of principal components to retain is by looking for an elbow in the scree plot showing that as the number of principal components increases, the rate at which variance is explained decreases substantially. In the absence of a clear elbow, you can use the scree plot as a guide for setting a threshold.

The proportion of variance explained is still available in the pve object you created in the last exercise.

Use plot() to plot the proportion of variance explained by each principal component.

# Plot variance explained for each principal component
head(pve)

## [1] 0.45190665 0.18225358 0.12979086 0.12011089 0.07142337 0.04451466

plot(pve, xlab = "Principal Component",
     ylab = "Proportion of Variance Explained",
     ylim = c(0, 1), type = "b")

Use plot() and cumsum() (cumulative sum) to plot the cumulative proportion of variance explained as a function of the number principal components.

# Plot cumulative proportion of variance explained
plot(cumsum(pve), xlab = "Principal Component",
     ylab = "Cumulative Proportion of Variance Explained",
     ylim = c(0, 1), type = "b")

Practical issues with PCA

1.- Scaling the data 2.- Missing values: -Drop obvserfvations with missing calues -Impute/estimate missing values 3- Categorical [data:\\](data:){.uri} -do not use categorical data features -encode categorical features as numbers

Practical issues: scaling

You saw in the video that scaling your data before doing PCA changes the results of the PCA modeling. Here, you will perform PCA with and without scaling, then visualize the results using biplots.

Sometimes scaling is appropriate when the variances of the variables are substantially different. This is commonly the case when variables have different units of measurement, for example, degrees Fahrenheit (temperature) and miles (distance). Making the decision to use scaling is an important step in performing a principal component analysis.

The same Pokemon dataset is available in your workspace as pokemon, but one new variable has been added: Total.

There is some code at the top of the editor to calculate the mean and standard deviation of each variable in the model. Run this code to see how the scale of the variables differs in the original data.

# Mean of each variable
colMeans(pokemon)

##      HitPoints         Attack        Defense  SpecialAttack SpecialDefense 
##       69.25875       79.00125       73.84250       72.82000       71.90250 
##          Speed 
##       68.27750

# Standard deviation of each variable
apply(pokemon, 2, sd)

##      HitPoints         Attack        Defense  SpecialAttack SpecialDefense 
##       25.53467       32.45737       31.18350       32.72229       27.82892 
##          Speed 
##       29.06047

Create a PCA model of pokemon with scaling, assigning the result to pr.with.scaling.

# PCA model with scaling: pr.with.scaling
pr.with.scaling <- prcomp(x = pokemon, 
scale = TRUE)

# Create biplots of both for comparison

biplot(pr.with.scaling)

Create a PCA model of pokemon without scaling, assigning the result to pr.without.scaling. Use biplot() to plot both models (one at a time) and compare their outputs.

# PCA model without scaling: pr.without.scaling
pr.without.scaling <- prcomp(x = pokemon, 
scale = FALSE)



biplot(pr.without.scaling)

Putting it all together with a case study

The goal of this chapter is to guide you through a complete analysis using the unsupervised learning techniques covered in the first three chapters. You’ll extend what you’ve learned by combining PCA as a preprocessing step to clustering using data that consist of measurements of cell nuclei of human breast masses.

Preparing the data

Unlike prior chapters, where we prepared the data for you for unsupervised learning, the goal of this chapter is to step you through a more realistic and complete workflow.

Recall from the video that the first step is to download and prepare the data.

Use read.csv() function to download the CSV (comma-separated values) file containing the data from the URL provided. Assign the result to wisc.df.

url <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1903/datasets/WisconsinCancer.csv"

# Download the data: wisc.df
wisc.df <- read.csv(url)

Use as.matrix() to convert the features of the data (in columns 3 through 32) to a matrix. Store this in a variable called wisc.data.

# Convert the features of the data: wisc.data
wisc.data <- as.matrix(wisc.df[,c(3:32)])

Assign the row names of wisc.data the values currently contained in the id column of wisc.df. While not strictly required, this will help you keep track of the different observations throughout the modeling process.

# Set the row names of wisc.data
row.names(wisc.data) <- wisc.df$id

Finally, set a vector called diagnosis to be 1 if a diagnosis is malignant (“M”) and 0 otherwise. Note that R coerces TRUE to 1 and FALSE to 0.

# Create diagnosis vector
diagnosis <- as.numeric(wisc.df$diagnosis == "M")

Exploratory data analysis

The first step of any data analysis, unsupervised or supervised, is to familiarize yourself with the data.

The variables you created before, wisc.data and diagnosis, are still available in your workspace. Explore the data to answer the following questions:

How many observations are in this dataset? How many variables/features in the data are suffixed with _mean? How many of the observations have a malignant diagnosis?

# The dim(), colnames(), and table() functions may prove useful here!

Performing PCA

The next step in your analysis is to perform PCA on wisc.data.

You saw in the last chapter that it’s important to check if the data need to be scaled before performing PCA. Recall two common reasons for scaling data:

The input variables use different units of measurement. The input variables have significantly different variances.

The variables you created before, wisc.data and diagnosis, are still available in your workspace.

Check the mean and standard deviation of the features of the data to determine if the data should be scaled. Use the colMeans() and apply() functions like you’ve done before.

# Check column means and standard deviations
colMeans(wisc.data)

##             radius_mean            texture_mean          perimeter_mean 
##            1.412729e+01            1.928965e+01            9.196903e+01 
##               area_mean         smoothness_mean        compactness_mean 
##            6.548891e+02            9.636028e-02            1.043410e-01 
##          concavity_mean     concave.points_mean           symmetry_mean 
##            8.879932e-02            4.891915e-02            1.811619e-01 
##  fractal_dimension_mean               radius_se              texture_se 
##            6.279761e-02            4.051721e-01            1.216853e+00 
##            perimeter_se                 area_se           smoothness_se 
##            2.866059e+00            4.033708e+01            7.040979e-03 
##          compactness_se            concavity_se       concave.points_se 
##            2.547814e-02            3.189372e-02            1.179614e-02 
##             symmetry_se    fractal_dimension_se            radius_worst 
##            2.054230e-02            3.794904e-03            1.626919e+01 
##           texture_worst         perimeter_worst              area_worst 
##            2.567722e+01            1.072612e+02            8.805831e+02 
##        smoothness_worst       compactness_worst         concavity_worst 
##            1.323686e-01            2.542650e-01            2.721885e-01 
##    concave.points_worst          symmetry_worst fractal_dimension_worst 
##            1.146062e-01            2.900756e-01            8.394582e-02

apply(wisc.data, 2, sd)

##             radius_mean            texture_mean          perimeter_mean 
##            3.524049e+00            4.301036e+00            2.429898e+01 
##               area_mean         smoothness_mean        compactness_mean 
##            3.519141e+02            1.406413e-02            5.281276e-02 
##          concavity_mean     concave.points_mean           symmetry_mean 
##            7.971981e-02            3.880284e-02            2.741428e-02 
##  fractal_dimension_mean               radius_se              texture_se 
##            7.060363e-03            2.773127e-01            5.516484e-01 
##            perimeter_se                 area_se           smoothness_se 
##            2.021855e+00            4.549101e+01            3.002518e-03 
##          compactness_se            concavity_se       concave.points_se 
##            1.790818e-02            3.018606e-02            6.170285e-03 
##             symmetry_se    fractal_dimension_se            radius_worst 
##            8.266372e-03            2.646071e-03            4.833242e+00 
##           texture_worst         perimeter_worst              area_worst 
##            6.146258e+00            3.360254e+01            5.693570e+02 
##        smoothness_worst       compactness_worst         concavity_worst 
##            2.283243e-02            1.573365e-01            2.086243e-01 
##    concave.points_worst          symmetry_worst fractal_dimension_worst 
##            6.573234e-02            6.186747e-02            1.806127e-02

Execute PCA on the wisc.data, scaling if appropriate, and assign the model to wisc.pr.

# Execute PCA, scaling if appropriate: wisc.pr
wisc.pr <- prcomp(wisc.data, scale = TRUE)

Inspect a summary of the results with the summary() function.

# Look at summary of results
summary(wisc.pr)

## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     3.6444 2.3857 1.67867 1.40735 1.28403 1.09880 0.82172
## Proportion of Variance 0.4427 0.1897 0.09393 0.06602 0.05496 0.04025 0.02251
## Cumulative Proportion  0.4427 0.6324 0.72636 0.79239 0.84734 0.88759 0.91010
##                            PC8    PC9    PC10   PC11    PC12    PC13    PC14
## Standard deviation     0.69037 0.6457 0.59219 0.5421 0.51104 0.49128 0.39624
## Proportion of Variance 0.01589 0.0139 0.01169 0.0098 0.00871 0.00805 0.00523
## Cumulative Proportion  0.92598 0.9399 0.95157 0.9614 0.97007 0.97812 0.98335
##                           PC15    PC16    PC17    PC18    PC19    PC20   PC21
## Standard deviation     0.30681 0.28260 0.24372 0.22939 0.22244 0.17652 0.1731
## Proportion of Variance 0.00314 0.00266 0.00198 0.00175 0.00165 0.00104 0.0010
## Cumulative Proportion  0.98649 0.98915 0.99113 0.99288 0.99453 0.99557 0.9966
##                           PC22    PC23   PC24    PC25    PC26    PC27    PC28
## Standard deviation     0.16565 0.15602 0.1344 0.12442 0.09043 0.08307 0.03987
## Proportion of Variance 0.00091 0.00081 0.0006 0.00052 0.00027 0.00023 0.00005
## Cumulative Proportion  0.99749 0.99830 0.9989 0.99942 0.99969 0.99992 0.99997
##                           PC29    PC30
## Standard deviation     0.02736 0.01153
## Proportion of Variance 0.00002 0.00000
## Cumulative Proportion  1.00000 1.00000

Interpreting PCA results

Now you’ll use some visualizations to better understand your PCA model. You were introduced to one of these visualizations, the biplot, in an earlier chapter.

You’ll run into some common challenges with using biplots on real-world data containing a non-trivial number of observations and variables, then you’ll look at some alternative visualizations. You are encouraged to experiment with additional visualizations before moving on to the next exercise.

The variables you created before, wisc.data, diagnosis, and wisc.pr, are still available.

Create a biplot of the wisc.pr data. What stands out to you about this plot? Is it easy or difficult to understand? Why?

# Create a biplot of wisc.pr
biplot(wisc.pr)

names(wisc.pr)

## [1] "sdev"     "rotation" "center"   "scale"    "x"

Execute the code to scatter plot each observation by principal components 1 and 2, coloring the points by the diagnosis.

# Scatter plot observations by components 1 and 2
plot(wisc.pr$x[, c(1, 2)], col = (diagnosis + 1), 
     xlab = "PC1", ylab = "PC2")

Repeat the same for principal components 1 and 3. What do you notice about these plots?

# Repeat for components 1 and 3
plot(wisc.pr$x[, c(1, 3)], col = (diagnosis + 1), 
     xlab = "PC1", ylab = "PC3")

# Do additional data exploration of your choosing below (optional)

plot(wisc.pr$x[, c(2, 3)], col = (diagnosis + 1), 
     xlab = "PC2", ylab = "PC3")

Variance explained

In this exercise, you will produce scree plots showing the proportion of variance explained as the number of principal components increases. The data from PCA must be prepared for these plots, as there is not a built-in function in R to create them directly from the PCA model.

As you look at these plots, ask yourself if there’s an elbow in the amount of variance explained that might lead you to pick a natural number of principal components. If an obvious elbow does not exist, as is typical in real-world datasets, consider how else you might determine the number of principal components to retain based on the scree plot.

The variables you created before, wisc.data, diagnosis, and wisc.pr, are still available.

Calculate the variance of each principal component by squaring the sdev component of wisc.pr. Save the result as an object called pr.var.

# Set up 1 x 2 plotting grid
par(mfrow = c(1, 2))

# Calculate variability of each component
pr.var <- wisc.pr$sdev^2

Calculate the variance explained by each principal component by dividing by the total variance explained of all principal components. Assign this to a variable called pve.

# Variance explained by each principal component: pve
pve <- pr.var / sum(pr.var)

Create a plot of variance explained for each principal component.

# Plot variance explained for each principal component
plot(pve, xlab = "Principal Component", 
     ylab = "Proportion of Variance Explained", 
     ylim = c(0, 1), type = "b")

Using the cumsum() function, create a plot of cumulative proportion of variance explained.

# Plot cumulative proportion of variance explained
plot(cumsum(pve), xlab = "Principal Component", 
     ylab = "Cumulative Proportion of Variance Explained", 
     ylim = c(0, 1), type = "b")

PCA review and next steps

Communicating PCA results

This exercise will check your understanding of the PCA results, in particular the loadings and variance explained. The loadings, represented as vectors, explain the mapping from the original features to the principal components. The principal components are naturally ordered from the most variance explained to the least variance explained.

The variables you created before—wisc.data, diagnosis, wisc.pr, and pve—are still available.

For the first principal component, what is the component of the loading vector for the feature concave.points_mean? What is the minimum number of principal components required to explain 80% of the variance of the data?

wisc.pr$rotation[, 1:3]

##                                 PC1          PC2          PC3
## radius_mean             -0.21890244  0.233857132 -0.008531243
## texture_mean            -0.10372458  0.059706088  0.064549903
## perimeter_mean          -0.22753729  0.215181361 -0.009314220
## area_mean               -0.22099499  0.231076711  0.028699526
## smoothness_mean         -0.14258969 -0.186113023 -0.104291904
## compactness_mean        -0.23928535 -0.151891610 -0.074091571
## concavity_mean          -0.25840048 -0.060165363  0.002733838
## concave.points_mean     -0.26085376  0.034767500 -0.025563541
## symmetry_mean           -0.13816696 -0.190348770 -0.040239936
## fractal_dimension_mean  -0.06436335 -0.366575471 -0.022574090
## radius_se               -0.20597878  0.105552152  0.268481387
## texture_se              -0.01742803 -0.089979682  0.374633665
## perimeter_se            -0.21132592  0.089457234  0.266645367
## area_se                 -0.20286964  0.152292628  0.216006528
## smoothness_se           -0.01453145 -0.204430453  0.308838979
## compactness_se          -0.17039345 -0.232715896  0.154779718
## concavity_se            -0.15358979 -0.197207283  0.176463743
## concave.points_se       -0.18341740 -0.130321560  0.224657567
## symmetry_se             -0.04249842 -0.183848000  0.288584292
## fractal_dimension_se    -0.10256832 -0.280092027  0.211503764
## radius_worst            -0.22799663  0.219866379 -0.047506990
## texture_worst           -0.10446933  0.045467298 -0.042297823
## perimeter_worst         -0.23663968  0.199878428 -0.048546508
## area_worst              -0.22487053  0.219351858 -0.011902318
## smoothness_worst        -0.12795256 -0.172304352 -0.259797613
## compactness_worst       -0.21009588 -0.143593173 -0.236075625
## concavity_worst         -0.22876753 -0.097964114 -0.173057335
## concave.points_worst    -0.25088597  0.008257235 -0.170344076
## symmetry_worst          -0.12290456 -0.141883349 -0.271312642
## fractal_dimension_worst -0.13178394 -0.275339469 -0.232791313

Hierarchical clustering of case data

The goal of this exercise is to do hierarchical clustering of the observations. Recall from Chapter 2 that this type of clustering does not assume in advance the number of natural groups that exist in the data.

As part of the preparation for hierarchical clustering, distance between all pairs of observations are computed. Furthermore, there are different ways to link clusters together, with single, complete, and average being the most common linkage methods.

The variables you created before, wisc.data, diagnosis, wisc.pr, and pve, are available in your workspace.

Scale the wisc.data data and assign the result to data.scaled.

# Scale the wisc.data data: data.scaled
head(wisc.data)

##          radius_mean texture_mean perimeter_mean area_mean smoothness_mean
## 842302         17.99        10.38         122.80    1001.0         0.11840
## 842517         20.57        17.77         132.90    1326.0         0.08474
## 84300903       19.69        21.25         130.00    1203.0         0.10960
## 84348301       11.42        20.38          77.58     386.1         0.14250
## 84358402       20.29        14.34         135.10    1297.0         0.10030
## 843786         12.45        15.70          82.57     477.1         0.12780
##          compactness_mean concavity_mean concave.points_mean symmetry_mean
## 842302            0.27760         0.3001             0.14710        0.2419
## 842517            0.07864         0.0869             0.07017        0.1812
## 84300903          0.15990         0.1974             0.12790        0.2069
## 84348301          0.28390         0.2414             0.10520        0.2597
## 84358402          0.13280         0.1980             0.10430        0.1809
## 843786            0.17000         0.1578             0.08089        0.2087
##          fractal_dimension_mean radius_se texture_se perimeter_se area_se
## 842302                  0.07871    1.0950     0.9053        8.589  153.40
## 842517                  0.05667    0.5435     0.7339        3.398   74.08
## 84300903                0.05999    0.7456     0.7869        4.585   94.03
## 84348301                0.09744    0.4956     1.1560        3.445   27.23
## 84358402                0.05883    0.7572     0.7813        5.438   94.44
## 843786                  0.07613    0.3345     0.8902        2.217   27.19
##          smoothness_se compactness_se concavity_se concave.points_se
## 842302        0.006399        0.04904      0.05373           0.01587
## 842517        0.005225        0.01308      0.01860           0.01340
## 84300903      0.006150        0.04006      0.03832           0.02058
## 84348301      0.009110        0.07458      0.05661           0.01867
## 84358402      0.011490        0.02461      0.05688           0.01885
## 843786        0.007510        0.03345      0.03672           0.01137
##          symmetry_se fractal_dimension_se radius_worst texture_worst
## 842302       0.03003             0.006193        25.38         17.33
## 842517       0.01389             0.003532        24.99         23.41
## 84300903     0.02250             0.004571        23.57         25.53
## 84348301     0.05963             0.009208        14.91         26.50
## 84358402     0.01756             0.005115        22.54         16.67
## 843786       0.02165             0.005082        15.47         23.75
##          perimeter_worst area_worst smoothness_worst compactness_worst
## 842302            184.60     2019.0           0.1622            0.6656
## 842517            158.80     1956.0           0.1238            0.1866
## 84300903          152.50     1709.0           0.1444            0.4245
## 84348301           98.87      567.7           0.2098            0.8663
## 84358402          152.20     1575.0           0.1374            0.2050
## 843786            103.40      741.6           0.1791            0.5249
##          concavity_worst concave.points_worst symmetry_worst
## 842302            0.7119               0.2654         0.4601
## 842517            0.2416               0.1860         0.2750
## 84300903          0.4504               0.2430         0.3613
## 84348301          0.6869               0.2575         0.6638
## 84358402          0.4000               0.1625         0.2364
## 843786            0.5355               0.1741         0.3985
##          fractal_dimension_worst
## 842302                   0.11890
## 842517                   0.08902
## 84300903                 0.08758
## 84348301                 0.17300
## 84358402                 0.07678
## 843786                   0.12440

data.scaled <- scale(wisc.data)

Calculate the (Euclidean) distances between all pairs of observations in the new scaled dataset and assign the result to data.dist.

# Calculate the (Euclidean) distances: data.dist
data.dist <- dist(data.scaled)

Create a hierarchical clustering model using complete linkage. Manually specify the method argument to hclust() and assign the results to wisc.hclust.

# Create a hierarchical clustering model: wisc.hclust
wisc.hclust <- hclust(data.dist, method  = "complete")

Results of hierarchical clustering

Let’s use the hierarchical clustering model you just created to determine a height (or distance between clusters) where a certain number of clusters exists. The variables you created before—wisc.data, diagnosis, wisc.pr, pve, and wisc.hclust—are all available in your workspace.

Using the plot() function, what is the height at which the clustering model has 4 clusters?

plot(wisc.hclust)

Selecting number of clusters

In this exercise, you will compare the outputs from your hierarchical clustering model to the actual diagnoses. Normally when performing unsupervised learning like this, a target variable isn’t available. We do have it with this dataset, however, so it can be used to check the performance of the clustering model.

When performing supervised learning—that is, when you’re trying to predict some target variable of interest and that target variable is available in the original data—using clustering to create new features may or may not improve the performance of the final model. This exercise will help you determine if, in this case, hierarchical clustering provides a promising new feature.

wisc.data, diagnosis, wisc.pr, pve, and wisc.hclust are available in your workspace.

Use cutree() to cut the tree so that it has 4 clusters. Assign the output to the variable wisc.hclust.clusters.

# Cut tree so that it has 4 clusters: wisc.hclust.clusters
wisc.hclust.clusters <- cutree(wisc.hclust, k = 4)

Use the table() function to compare the cluster membership to the actual diagnoses.

# Compare cluster membership to actual diagnoses
table(wisc.hclust.clusters, diagnosis)

##                     diagnosis
## wisc.hclust.clusters   0   1
##                    1  12 165
##                    2   2   5
##                    3 343  40
##                    4   0   2

k-means clustering and comparing results

As you now know, there are two main types of clustering: hierarchical and k-means.

In this exercise, you will create a k-means clustering model on the Wisconsin breast cancer data and compare the results to the actual diagnoses and the results of your hierarchical clustering model. Take some time to see how each clustering model performs in terms of separating the two diagnoses and how the clustering models compare to each other.

wisc.data, diagnosis, and wisc.hclust.clusters are still available.

Create a k-means model on wisc.data, assigning the result to wisc.km. Be sure to create 2 clusters, corresponding to the actual number of diagnosis. Also, remember to scale the data and repeat the algorithm 20 times to find a well performing model.

# Create a k-means model on wisc.data: wisc.km
wisc.km <- kmeans(scale(wisc.data), 2, nstart = 20)

Use the table() function to compare the cluster membership of the k-means model to the actual diagnoses contained in the diagnosis vector. How well does k-means separate the two diagnoses?

# Compare k-means to actual diagnoses
table(wisc.km$cluster, diagnosis)

##    diagnosis
##       0   1
##   1 343  37
##   2  14 175

Use the table() function to compare the cluster membership of the k-means model to the hierarchical clustering model. Recall the cluster membership of the hierarchical clustering model is contained in wisc.hclust.clusters.

# Compare k-means to hierarchical clustering
table(wisc.hclust.clusters, wisc.km$cluster)

##                     
## wisc.hclust.clusters   1   2
##                    1  17 160
##                    2   0   7
##                    3 363  20
##                    4   0   2

Clustering on PCA results

In this final exercise, you will put together several steps you used earlier and, in doing so, you will experience some of the creativity that is typical in unsupervised learning.

Recall from earlier exercises that the PCA model required significantly fewer features to describe 80% and 95% of the variability of the data. In addition to normalizing data and potentially avoiding overfitting, PCA also uncorrelates the variables, sometimes improving the performance of other modeling techniques.

Let’s see if PCA improves or degrades the performance of hierarchical clustering.

wisc.pr, diagnosis, wisc.hclust.clusters, and wisc.km are still available in your workspace.

Using the minimum number of principal components required to describe at least 90% of the variability in the data, create a hierarchical clustering model with complete linkage. Assign the results to wisc.pr.hclust.

# Create a hierarchical clustering model: wisc.pr.hclust
wisc.pr.hclust <- hclust(dist(wisc.pr$x[, 1:7]), method = "complete")

Cut this hierarchical clustering model into 4 clusters and assign the results to wisc.pr.hclust.clusters.

# Cut model into 4 clusters: wisc.pr.hclust.clusters
wisc.pr.hclust.clusters <- cutree(wisc.pr.hclust, k = 4)

Using table(), compare the results from your new hierarchical clustering model with the actual diagnoses. How well does the newly created model with four clusters separate out the two diagnoses?

# Compare to actual diagnoses
table(diagnosis, wisc.pr.hclust.clusters)

##          wisc.pr.hclust.clusters
## diagnosis   1   2   3   4
##         0   5 350   2   0
##         1 113  97   0   2

How well do the k-means and hierarchical clustering models you created in previous exercises do in terms of separating the diagnoses? Again, use the table() function to compare the output of each model with the vector containing the actual diagnoses.

# Compare to k-means and hierarchical
table(diagnosis, wisc.hclust.clusters)

##          wisc.hclust.clusters
## diagnosis   1   2   3   4
##         0  12   2 343   0
##         1 165   5  40   2

table(diagnosis, wisc.km$cluster)

##          
## diagnosis   1   2
##         0 343  14
##         1  37 175

Machine learning with Caret in R

Regression models: fitting them and evaluating their performance

Toolbox

In-sample RMSE for linear regression on diamonds

As you saw in the video, included in the course is the diamonds dataset, which is a classic dataset from the ggplot2 package. The dataset contains physical attributes of diamonds as well as the price they sold for. One interesting modeling challenge is predicting diamond price based on their attributes using something like a linear regression.

Recall that to fit a linear regression, you use the lm() function in the following format:

mod <- lm(y ~ x, my_data) To make predictions using mod on the original data, you call the predict() function:

pred <- predict(mod, my_data)

Fit a linear model on the diamonds dataset predicting price using all other variables as predictors (i.e. price ~ .). Save the result to model. Make predictions using model on the full original dataset and save the result to p. Compute errors using the formula . Save the result to error. Compute RMSE using the formula you learned in the video and print it to the console.

# Fit lm model: model
model <- lm(price ~ ., diamonds)

# Predict on full data: p
p <- predict(model, diamonds)

# Compute errors: error

error <- p - diamonds[["price"]]

# Calculate RMSE
sqrt(mean(error ^ 2))

## [1] 1129.843

Out-of-sample error measures

Randomly order the data frame

One way you can take a train/test split of a dataset is to order the dataset randomly, then divide it into the two sets. This ensures that the training set and test set are both random samples and that any biases in the ordering of the dataset (e.g. if it had originally been ordered by price or size) are not retained in the samples we take for training and testing your models. You can think of this like shuffling a brand new deck of playing cards before dealing hands.

First, you set a random seed so that your work is reproducible and you get the same random split each time you run your script:

set.seed(42) Next, you use the sample() function to shuffle the row indices of the diamonds dataset. You can later use these indices to reorder the dataset.

rows <- sample(nrow(diamonds)) Finally, you can use this random vector to reorder the diamonds dataset:

diamonds <- diamonds[rows, ]

Set the random seed to 42. Make a vector of row indices called rows. Randomly reorder the diamonds data frame, assigning to shuffled_diamonds.

# Set seed
set.seed(42)

# Shuffle row indices: rows
rows <- sample(nrow(diamonds))

# Randomly order data
shuffled_diamonds <- diamonds[rows, ]

Try an 80/20 split

Now that your dataset is randomly ordered, you can split the first 80% of it into a training set, and the last 20% into a test set. You can do this by choosing a split point approximately 80% of the way through your data:

split <- round(nrow(mydata) * 0.80) You can then use this point to break off the first 80% of the dataset as a training set:

mydata[1:split, ] And then you can use that same point to determine the test set:

mydata[(split + 1):nrow(mydata), ]

Choose a row index to split on so that the split point is approximately 80% of the way through the diamonds dataset. Call this index split. Create a training set called train using that index. Create a test set called test using that index.

# Determine row to split on: split
split <- round(nrow(diamonds) * 0.80)

# Create train
train <- diamonds[1:split, ]


# Create test

test <- diamonds[(split + 1):nrow(diamonds), ]

Predict on test set

Now that you have a randomly split training set and test set, you can use the lm() function as you did in the first exercise to fit a model to your training set, rather than the entire dataset. Recall that you can use the formula interface to the linear regression function to fit a model with a specified target variable using all other variables in the dataset as predictors:

mod <- lm(y ~ ., training_data) You can use the predict() function to make predictions from that model on new data. The new dataset must have all of the columns from the training data, but they can be in a different order with different values. Here, rather than re-predicting on the training set, you can predict on the test set, which you did not use for training the model. This will allow you to determine the out-of-sample error for the model in the next exercise:

p <- predict(model, new_data)

Fit an lm() model called model to predict price using all other variables as covariates. Be sure to use the training set, train. Predict on the test set, test, using predict(). Store these values in a vector called p.

# Fit lm model on train: model
model <- lm(price ~., train)

# Predict on test: p

p <- predict(model, test)

Calculate test set RMSE by hand

Now that you have predictions on the test set, you can use these predictions to calculate an error metric (in this case RMSE) on the test set and see how the model performs out-of-sample, rather than in-sample as you did in the first exercise. You first do this by calculating the errors between the predicted diamond prices and the actual diamond prices by subtracting the predictions from the actual values.

Once you have an error vector, calculating RMSE is as simple as squaring it, taking the mean, then taking the square root:

sqrt(mean(error^2))

test, model, and p are loaded in your workspace.

Calculate the error between the predictions on the test set and the actual diamond prices in the test set. Call this error. Calculate RMSE using this error vector, just printing the result to the console.

# Compute errors: error

error <- p - test[["price"]]

# Calculate RMSE
sqrt(mean(error^2))

## [1] 796.8922

Cross-validation

10-fold cross-validation

As you saw in the video, a better approach to validating models is to use multiple systematic test sets, rather than a single random train/test split. Fortunately, the caret package makes this very easy to do:

model <- train(y ~ ., my_data) caret supports many types of cross-validation, and you can specify which type of cross-validation and the number of cross-validation folds with the trainControl() function, which you pass to the trControl argument in train():

model <- train( y ~ ., my_data, method = “lm”, trControl = trainControl( method = “cv”, number = 10, verboseIter = TRUE ) ) It’s important to note that you pass the method for modeling to the main train() function and the method for cross-validation to the trainControl() function.

Fit a linear regression to model price using all other variables in the diamonds dataset as predictors. Use the train() function and 10-fold cross-validation. (Note that we’ve taken a subset of the full diamonds dataset to speed up this operation, but it’s still named diamonds.) Print the model to the console and examine the results.

# Fit lm model using 10-fold CV: model
model <- train(
  price ~ ., 
  diamonds,
  method = "lm",
  trControl = trainControl(
    method = "cv", 
    number = 10,
    verboseIter = TRUE
  )
)

## + Fold01: intercept=TRUE 
## - Fold01: intercept=TRUE 
## + Fold02: intercept=TRUE 
## - Fold02: intercept=TRUE 
## + Fold03: intercept=TRUE 
## - Fold03: intercept=TRUE 
## + Fold04: intercept=TRUE 
## - Fold04: intercept=TRUE 
## + Fold05: intercept=TRUE 
## - Fold05: intercept=TRUE 
## + Fold06: intercept=TRUE 
## - Fold06: intercept=TRUE 
## + Fold07: intercept=TRUE 
## - Fold07: intercept=TRUE 
## + Fold08: intercept=TRUE 
## - Fold08: intercept=TRUE 
## + Fold09: intercept=TRUE 
## - Fold09: intercept=TRUE 
## + Fold10: intercept=TRUE 
## - Fold10: intercept=TRUE 
## Aggregating results
## Fitting final model on full training set

# Print model to console
print(model)

## Linear Regression 
## 
## 53940 samples
##     9 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 48547, 48546, 48546, 48547, 48545, 48547, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   1131.015  0.9196398  740.6117
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

5-fold cross-validation

In this course, you will use a wide variety of datasets to explore the full flexibility of the caret package. Here, you will use the famous Boston housing dataset, where the goal is to predict median home values in various Boston suburbs.

You can use exactly the same code as in the previous exercise, but change the dataset used by the model:

model <- train( medv ~ ., Boston, # <- new! method = “lm”, trControl = trainControl( method = “cv”, number = 10, verboseIter = TRUE ) ) Next, you can reduce the number of cross-validation folds from 10 to 5 using the number argument to the trainControl() argument:

trControl = trainControl( method = “cv”, number = 5, verboseIter = TRUE )

Fit an lm() model to the Boston housing dataset, such that medv is the response variable and all other variables are explanatory variables. Use 5-fold cross-validation rather than 10-fold cross-validation. Print the model to the console and inspect the results.

Boston <- read.delim("DATABASE/Boston.txt", sep = ",")
head(Boston)

##   X    crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat
## 1 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
## 2 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
## 3 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
## 4 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
## 5 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
## 6 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
##   medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7

# Fit lm model using 5 x 5-fold CV: model
model <- train(
  medv ~ ., 
  Boston,
  method = "lm",
  trControl = trainControl(
    method = "repeatedcv", 
    number = 5,
    repeats = 5, 
    verboseIter = TRUE
  )
)

## + Fold1.Rep1: intercept=TRUE 
## - Fold1.Rep1: intercept=TRUE 
## + Fold2.Rep1: intercept=TRUE 
## - Fold2.Rep1: intercept=TRUE 
## + Fold3.Rep1: intercept=TRUE 
## - Fold3.Rep1: intercept=TRUE 
## + Fold4.Rep1: intercept=TRUE 
## - Fold4.Rep1: intercept=TRUE 
## + Fold5.Rep1: intercept=TRUE 
## - Fold5.Rep1: intercept=TRUE 
## + Fold1.Rep2: intercept=TRUE 
## - Fold1.Rep2: intercept=TRUE 
## + Fold2.Rep2: intercept=TRUE 
## - Fold2.Rep2: intercept=TRUE 
## + Fold3.Rep2: intercept=TRUE 
## - Fold3.Rep2: intercept=TRUE 
## + Fold4.Rep2: intercept=TRUE 
## - Fold4.Rep2: intercept=TRUE 
## + Fold5.Rep2: intercept=TRUE 
## - Fold5.Rep2: intercept=TRUE 
## + Fold1.Rep3: intercept=TRUE 
## - Fold1.Rep3: intercept=TRUE 
## + Fold2.Rep3: intercept=TRUE 
## - Fold2.Rep3: intercept=TRUE 
## + Fold3.Rep3: intercept=TRUE 
## - Fold3.Rep3: intercept=TRUE 
## + Fold4.Rep3: intercept=TRUE 
## - Fold4.Rep3: intercept=TRUE 
## + Fold5.Rep3: intercept=TRUE 
## - Fold5.Rep3: intercept=TRUE 
## + Fold1.Rep4: intercept=TRUE 
## - Fold1.Rep4: intercept=TRUE 
## + Fold2.Rep4: intercept=TRUE 
## - Fold2.Rep4: intercept=TRUE 
## + Fold3.Rep4: intercept=TRUE 
## - Fold3.Rep4: intercept=TRUE 
## + Fold4.Rep4: intercept=TRUE 
## - Fold4.Rep4: intercept=TRUE 
## + Fold5.Rep4: intercept=TRUE 
## - Fold5.Rep4: intercept=TRUE 
## + Fold1.Rep5: intercept=TRUE 
## - Fold1.Rep5: intercept=TRUE 
## + Fold2.Rep5: intercept=TRUE 
## - Fold2.Rep5: intercept=TRUE 
## + Fold3.Rep5: intercept=TRUE 
## - Fold3.Rep5: intercept=TRUE 
## + Fold4.Rep5: intercept=TRUE 
## - Fold4.Rep5: intercept=TRUE 
## + Fold5.Rep5: intercept=TRUE 
## - Fold5.Rep5: intercept=TRUE 
## Aggregating results
## Fitting final model on full training set

# Print model to console
print(model)

## Linear Regression 
## 
## 506 samples
##  14 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 5 times) 
## Summary of sample sizes: 405, 405, 406, 403, 405, 405, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   4.832232  0.7267225  3.387277
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

Making predictions on new data

Finally, the model you fit with the train() function has the exact same predict() interface as the linear regression models you fit earlier in this chapter.

After fitting a model with train(), you can simply call predict() with new data, e.g:

predict(my_model, new_data)

Use the predict() function to make predictions with model on the full Boston housing dataset. Print the result to the console.

# Predict on full Boston dataset
predict(model, Boston)

##          1          2          3          4          5          6          7 
## 30.4213971 25.3220127 30.9378788 29.0007664 28.3136725 25.5910930 23.4177645 
##          8          9         10         11         12         13         14 
## 19.8600309 11.8007974 19.2522173 19.3175764 21.9341754 21.3733506 19.9713263 
##         15         16         17         18         19         20         21 
## 19.6447192 19.7247705 21.0287699 17.2668383 16.6419097 18.7941419 12.8145402 
##         22         23         24         25         26         27         28 
## 18.0016671 16.1531737 14.0888828 15.9770628 13.6900098 15.7607755 15.0107143 
##         29         30         31         32         33         34         35 
## 19.8556750 21.2106131 11.7246428 18.3364835  9.1054171 14.5558713 13.9762818 
##         36         37         38         39         40         41         42 
## 24.1647275 22.7010178 23.5128590 23.3489552 31.8552690 34.7419294 28.4264028 
##         43         44         45         46         47         48         49 
## 25.5783150 24.9813730 23.2150411 22.3830353 20.7057509 18.1691248  9.1721403 
##         50         51         52         53         54         55         56 
## 17.3796893 21.5507482 24.2035644 28.0125574 24.3828602 15.8526357 31.6059808 
##         57         58         59         60         61         62         63 
## 25.2343702 33.5301356 22.1939726 21.4598617 18.1786155 18.7542586 24.3144538 
##         64         65         66         67         68         69         70 
## 22.9343416 23.5307430 30.8451833 25.9540369 21.4788612 17.7284963 21.1139438 
##         71         72         73         74         75         76         77 
## 25.5867504 22.0784682 24.9261981 24.4150982 25.9854780 24.3345998 23.2360217 
##         78         79         80         81         82         83         84 
## 23.7188115 21.5838538 22.7960969 28.7330272 27.2165387 26.3468749 25.3238055 
##         85         86         87         88         89         90         91 
## 24.9773238 27.9767435 22.3544738 26.0668128 30.8090297 31.0259444 27.2851150 
##         92         93         94         95         96         97         98 
## 27.5563055 29.1986076 29.3969536 27.2216110 28.7961440 24.8467036 35.9478334 
##         99        100        101        102        103        104        105 
## 35.3745965 32.4257726 24.8577652 25.8945310 20.0234767 20.5399137 21.6613488 
##        106        107        108        109        110        111        112 
## 18.7367697 17.3891258 20.9834562 22.8550425 19.9776419 20.9482741 26.8162563 
##        113        114        115        116        117        118        119 
## 21.0049048 20.9386114 25.4412391 20.6635600 23.6593791 23.9411319 20.5996368 
##        120        121        122        123        124        125        126 
## 21.0696566 21.9403464 22.4593238 20.5154386 16.2997503 20.5059446 22.4511999 
##        127        128        129        130        131        132        133 
## 14.5391764 15.3583330 19.1323297 14.2267327 20.2228498 19.5916877 20.2423441 
##        134        135        136        137        138        139        140 
## 15.9186338 13.3974835 17.4296953 16.0457108 19.5346063 13.9618837 16.6068265 
##        141        142        143        144        145        146        147 
## 13.7315683  4.0835715 14.7484549 12.2815728  8.8457219 12.1721188 15.9539903 
##        148        149        150        151        152        153        154 
##  8.6288047  9.8413333 14.9433808 20.9946136 18.4266891 20.2862241 17.4113351 
##        155        156        157        158        159        160        161 
## 22.5281360 20.2874363 13.7274820 33.3710249 29.1052432 25.7118459 32.8216511 
##        162        163        164        165        166        167        168 
## 36.9118870 40.6994787 42.0107478 24.8458952 25.4352715 37.3219932 23.1629716 
##        169        170        171        172        173        174        175 
## 26.4542271 26.7057604 22.5825696 24.3145292 22.9267424 29.0630777 26.5327726 
##        176        177        178        179        180        181        182 
## 30.8426367 25.6679672 29.1356698 31.4508559 32.8731002 34.6281092 27.6745162 
##        183        184        185        186        187        188        189 
## 33.7497414 30.8216926 22.5175049 24.6304689 35.8497696 33.6123720 32.6106922 
##        190        191        192        193        194        195        196 
## 34.7034414 30.9514253 30.4475975 33.1022602 32.1716589 31.5862502 40.9681136 
##        197        198        199        200        201        202        203 
## 36.1975118 32.7162544 34.7541890 30.3067892 30.8627316 29.3437360 37.2339678 
##        204        205        206        207        208        209        210 
## 42.1386202 43.3036319 22.6900051 23.6057361 17.6953851 23.4018367 16.7770827 
##        211        212        213        214        215        216        217 
## 22.2083492 16.8607583 22.6515912 25.1883983 11.0923572 24.4423556 26.5352080 
##        218        219        220        221        222        223        224 
## 28.2115396 24.7422737 29.5291180 33.1406193 23.6780584 32.1103351 29.6824002 
##        225        226        227        228        229        230        231 
## 38.3706887 39.8118741 37.5482578 32.3455056 35.5819043 31.3122192 24.4037882 
##        232        233        234        235        236        237        238 
## 33.2330858 38.0289521 37.1476624 31.6730250 25.1963690 30.0154411 32.6553381 
##        239        240        241        242        243        244        245 
## 28.4199611 28.3597433 27.1902933 23.5833775 23.9864171 27.3942320 16.1271991 
##        246        247        248        249        250        251        252 
## 13.2041989 19.9381865 19.6661077 21.1802307 24.0658346 24.2035245 25.0485418 
##        253        254        255        256        257        258        259 
## 24.9180668 29.9758080 23.8137167 21.5581326 37.4365474 43.1834117 36.2816017 
##        260        261        262        263        264        265        266 
## 34.7705344 34.6432081 36.9868293 40.8267578 34.2381243 35.6343948 28.0667667 
##        267        268        269        270        271        272        273 
## 31.0213983 40.7049947 39.1928354 25.4288898 22.0433608 27.0300453 28.2404088 
##        274        275        276        277        278        279        280 
## 35.2668753 35.9918616 33.6352794 35.4473079 34.7187802 30.1964344 35.1119646 
##        281        282        283        284        285        286        287 
## 38.5325430 34.0997634 40.1132750 44.5162382 31.4439635 27.1054228 19.8118385 
##        288        289        290        291        292        293        294 
## 26.8747774 27.0086626 26.7696527 33.3184314 34.2923600 31.7142462 25.5707274 
##        295        296        297        298        299        300        301 
## 24.1112609 28.1833931 27.0280339 19.1431196 28.9936512 31.8360616 30.5901850 
##        302        303        304        305        306        307        308 
## 28.7753710 28.7676762 32.7039838 33.0225376 30.5250497 35.3122299 32.4409582 
##        309        310        311        312        313        314        315 
## 28.2970206 23.2336540 18.2599021 26.5987027 22.8855503 25.1723172 25.0942280 
##        316        317        318        319        320        321        322 
## 20.1404231 17.1951987 17.9822942 23.9407224 20.9752702 24.5613208 24.5350755 
##        323        324        325        326        327        328        329 
## 22.5299248 19.0291790 24.8122027 24.4161265 23.3841792 18.9867119 20.9341528 
##        330        331        332        333        334        335        336 
## 24.0496354 21.3277529 19.5971179 22.9771045 21.7470455 21.1562541 20.2267958 
##        337        338        339        340        341        342        343 
## 19.7413676 18.8205653 21.7805566 20.8357482 20.9786329 29.8906293 21.6326994 
##        344        345        346        347        348        349        350 
## 27.4394616 28.3483229 16.0842434 14.3028812 25.0223965 27.2282308 21.7519441 
##        351        352        353        354        355        356        357 
## 20.0214208 20.2298277 16.5849496 24.9575189 14.0035094 16.2885138 19.8312318 
##        358        359        360        361        362        363        364 
## 22.9357796 22.4273631 19.4172888 22.8720711 19.1171207 18.3821986 20.4286409 
##        365        366        367        368        369        370        371 
## 37.7947890 14.3965585 15.6735748 10.8748779 23.9089225 32.8252766 34.7973862 
##        372        373        374        375        376        377        378 
## 25.0707950 26.1704791  6.1825229  0.7994699 25.4566451 17.8674501 20.3550913 
##        379        380        381        382        383        384        385 
## 15.9334458 16.9250487 14.4532418 18.5737954 13.4970016 13.1284655  3.3116058 
##        386        387        388        389        390        391        392 
##  8.1060891  6.1455984  5.6662902  6.4717757 14.2599673 17.2791215 17.4051843 
##        393        394        395        396        397        398        399 
##  9.9203114 20.3086380 18.0071116 20.3767602 19.3708126 16.3805388  6.5462544 
##        400        401        402        403        404        405        406 
## 10.9717939 11.9003936 17.8561787 18.3063776 12.9907747  7.3954034  8.1841032 
##        407        408        409        410        411        412        413 
##  8.0223287 19.9985952 13.6918056 19.8630157 15.1779485 16.9509242  1.6235558 
##        414        415        416        417        418        419        420 
## 11.7443051 -4.3860489  9.5557801 13.3806201  6.8634382  6.0686868 14.6659520 
##        421        422        423        424        425        426        427 
## 19.6054378 18.1294931 18.5066776 13.1540087 14.6247126  9.8729101 16.3776642 
##        428        429        430        431        432        433        434 
## 14.0712442 14.2673860 13.0017716 18.1284693 18.6497714 21.5306971 17.0199231 
##        435        436        437        438        439        440        441 
## 15.9184301 13.3285688 14.4873630  8.7453146  4.8106754 13.0156277 12.6533148 
##        442        443        444        445        446        447        448 
## 17.2540829 18.6923358 18.0097591 11.4391117 11.9091549 17.6313826 18.0693223 
##        449        450        451        452        453        454        455 
## 17.4419398 17.1532195 16.4578791 19.3439991 18.5190681 22.4327564 15.1938128 
##        456        457        458        459        460        461        462 
## 15.7536724 12.5921290 12.7939656 17.1169188 18.4415471 18.9734228 20.0990734 
##        463        464        465        466        467        468        469 
## 19.7069995 22.3519805 20.2732171 17.8673461 14.2477401 16.7912868 16.8750355 
##        470        471        472        473        474        475        476 
## 18.5176506 20.0360997 22.8208224 22.3568697 25.5284901 16.2027533 15.9310144 
##        477        478        479        480        481        482        483 
## 20.3893507 11.3353814 19.0360604 21.7234998 23.3544766 26.9759487 28.4478220 
##        484        485        486        487        488        489        490 
## 21.0027705 19.3836943 22.1339240 19.4861514 21.2293617 11.3131631  7.6531788 
##        491        492        493        494        495        496        497 
##  3.0716459 13.2072754 15.4198389 19.9828024 20.0002371 16.2862261 13.2822879 
##        498        499        500        501        502        503        504 
## 18.4013892 20.6192034 17.7358813 19.7435816 22.6801964 21.4906326 26.7343473 
##        505        506 
## 25.2261743 21.4359702

Classification models: fitting them and evaluating their performance

Logistic regression on sonar

Try a 60/40 split

As you saw in the video, you’ll be working with the Sonar dataset in this chapter, using a 60% training set and a 40% test set. We’ll practice making a train/test split one more time, just to be sure you have the hang of it. Recall that you can use the sample() function to get a random permutation of the row indices in a dataset, to use when making train/test splits, e.g.:

n_obs <- nrow(my_data) permuted_rows <- sample(n_obs) And then use those row indices to randomly reorder the dataset, e.g.:

my_data <- my_data[permuted_rows, ] Once your dataset is randomly ordered, you can split off the first 60% as a training set and the last 40% as a test set.

Get the number of observations (rows) in Sonar, assigning to n_obs. Shuffle the row indices of Sonar and store the result in permuted_rows. Use permuted_rows to randomly reorder the rows of Sonar, saving as Sonar_shuffled. Identify the proper row to split on for a 60/40 split. Store this row number as split. Save the first 60% of Sonar_shuffled as a training set. Save the last 40% of Sonar_shuffled as the test set.

Sonar <- read.delim("DATABASE/Sonar.txt", sep = ",")
head(Sonar)

##   X     V1     V2     V3     V4     V5     V6     V7     V8     V9    V10
## 1 1 0.0200 0.0371 0.0428 0.0207 0.0954 0.0986 0.1539 0.1601 0.3109 0.2111
## 2 2 0.0453 0.0523 0.0843 0.0689 0.1183 0.2583 0.2156 0.3481 0.3337 0.2872
## 3 3 0.0262 0.0582 0.1099 0.1083 0.0974 0.2280 0.2431 0.3771 0.5598 0.6194
## 4 4 0.0100 0.0171 0.0623 0.0205 0.0205 0.0368 0.1098 0.1276 0.0598 0.1264
## 5 5 0.0762 0.0666 0.0481 0.0394 0.0590 0.0649 0.1209 0.2467 0.3564 0.4459
## 6 6 0.0286 0.0453 0.0277 0.0174 0.0384 0.0990 0.1201 0.1833 0.2105 0.3039
##      V11    V12    V13    V14    V15    V16    V17    V18    V19    V20    V21
## 1 0.1609 0.1582 0.2238 0.0645 0.0660 0.2273 0.3100 0.2999 0.5078 0.4797 0.5783
## 2 0.4918 0.6552 0.6919 0.7797 0.7464 0.9444 1.0000 0.8874 0.8024 0.7818 0.5212
## 3 0.6333 0.7060 0.5544 0.5320 0.6479 0.6931 0.6759 0.7551 0.8929 0.8619 0.7974
## 4 0.0881 0.1992 0.0184 0.2261 0.1729 0.2131 0.0693 0.2281 0.4060 0.3973 0.2741
## 5 0.4152 0.3952 0.4256 0.4135 0.4528 0.5326 0.7306 0.6193 0.2032 0.4636 0.4148
## 6 0.2988 0.4250 0.6343 0.8198 1.0000 0.9988 0.9508 0.9025 0.7234 0.5122 0.2074
##      V22    V23    V24    V25    V26    V27    V28    V29    V30    V31    V32
## 1 0.5071 0.4328 0.5550 0.6711 0.6415 0.7104 0.8080 0.6791 0.3857 0.1307 0.2604
## 2 0.4052 0.3957 0.3914 0.3250 0.3200 0.3271 0.2767 0.4423 0.2028 0.3788 0.2947
## 3 0.6737 0.4293 0.3648 0.5331 0.2413 0.5070 0.8533 0.6036 0.8514 0.8512 0.5045
## 4 0.3690 0.5556 0.4846 0.3140 0.5334 0.5256 0.2520 0.2090 0.3559 0.6260 0.7340
## 5 0.4292 0.5730 0.5399 0.3161 0.2285 0.6995 1.0000 0.7262 0.4724 0.5103 0.5459
## 6 0.3985 0.5890 0.2872 0.2043 0.5782 0.5389 0.3750 0.3411 0.5067 0.5580 0.4778
##      V33    V34    V35    V36    V37    V38    V39    V40    V41    V42    V43
## 1 0.5121 0.7547 0.8537 0.8507 0.6692 0.6097 0.4943 0.2744 0.0510 0.2834 0.2825
## 2 0.1984 0.2341 0.1306 0.4182 0.3835 0.1057 0.1840 0.1970 0.1674 0.0583 0.1401
## 3 0.1862 0.2709 0.4232 0.3043 0.6116 0.6756 0.5375 0.4719 0.4647 0.2587 0.2129
## 4 0.6120 0.3497 0.3953 0.3012 0.5408 0.8814 0.9857 0.9167 0.6121 0.5006 0.3210
## 5 0.2881 0.0981 0.1951 0.4181 0.4604 0.3217 0.2828 0.2430 0.1979 0.2444 0.1847
## 6 0.3299 0.2198 0.1407 0.2856 0.3807 0.4158 0.4054 0.3296 0.2707 0.2650 0.0723
##      V44    V45    V46    V47    V48    V49    V50    V51    V52    V53    V54
## 1 0.4256 0.2641 0.1386 0.1051 0.1343 0.0383 0.0324 0.0232 0.0027 0.0065 0.0159
## 2 0.1628 0.0621 0.0203 0.0530 0.0742 0.0409 0.0061 0.0125 0.0084 0.0089 0.0048
## 3 0.2222 0.2111 0.0176 0.1348 0.0744 0.0130 0.0106 0.0033 0.0232 0.0166 0.0095
## 4 0.3202 0.4295 0.3654 0.2655 0.1576 0.0681 0.0294 0.0241 0.0121 0.0036 0.0150
## 5 0.0841 0.0692 0.0528 0.0357 0.0085 0.0230 0.0046 0.0156 0.0031 0.0054 0.0105
## 6 0.1238 0.1192 0.1089 0.0623 0.0494 0.0264 0.0081 0.0104 0.0045 0.0014 0.0038
##      V55    V56    V57    V58    V59    V60 Class
## 1 0.0072 0.0167 0.0180 0.0084 0.0090 0.0032     R
## 2 0.0094 0.0191 0.0140 0.0049 0.0052 0.0044     R
## 3 0.0180 0.0244 0.0316 0.0164 0.0095 0.0078     R
## 4 0.0085 0.0073 0.0050 0.0044 0.0040 0.0117     R
## 5 0.0110 0.0015 0.0072 0.0048 0.0107 0.0094     R
## 6 0.0013 0.0089 0.0057 0.0027 0.0051 0.0062     R

# Get the number of observations
n_obs <- nrow(Sonar)

# Shuffle row indices: permuted_rows
permuted_rows <- sample(n_obs)

# Randomly order data: Sonar
Sonar_shuffled <- Sonar[permuted_rows, ]

# Identify row to split on: split
split <- round(n_obs * 0.6)

# Create train
train <- Sonar_shuffled[1:split, ]

# Create test
test <- Sonar_shuffled[(split + 1):n_obs, ]

Fit a logistic regression model

Once you have your random training and test sets you can fit a logistic regression model to your training set using the glm() function. glm() is a more advanced version of lm() that allows for more varied types of regression models, aside from plain vanilla ordinary least squares regression.

Be sure to pass the argument family = “binomial” to glm() to specify that you want to do logistic (rather than linear) regression. For example:

glm(Target ~ ., family = “binomial”, dataset) Don’t worry about warnings like glm.fit: algorithm did not converge or glm.fit: fitted probabilities numerically 0 or 1 occurred. These are common on smaller datasets and usually don’t cause any issues. They typically mean your dataset is perfectly separable, which can cause problems for the math behind the model, but R’s glm() function is almost always robust enough to handle this case with no problems.

Once you have a glm() model fit to your dataset, you can predict the outcome (e.g. rock or mine) on the test set using the predict() function with the argument type = “response”:

predict(my_model, test, type = “response”)

Fit a logistic regression called model to predict Class using all other variables as predictors. Use the training set for Sonar. Predict on the test set using that model. Call the result p like you’ve done before.

# Fit glm model: model
#model <- glm(Class ~ ., family = "binomial", train)

# Predict on test: p
#p <- predict(model, test, type = "response")

Confusion matrix

Calculate a confusion matrix

As you saw in the video, a confusion matrix is a very useful tool for calibrating the output of a model and examining all possible outcomes of your predictions (true positive, true negative, false positive, false negative).

Before you make your confusion matrix, you need to “cut” your predicted probabilities at a given threshold to turn probabilities into a factor of class predictions. Combine ifelse() with factor() as follows:

pos_or_neg <- ifelse(probability_prediction > threshold, positive_class, negative_class) p_class <- factor(pos_or_neg, levels = levels(test_values)) confusionMatrix() in caret improves on table() from base R by adding lots of useful ancillary statistics in addition to the base rates in the table. You can calculate the confusion matrix (and the associated statistics) using the predicted outcomes as well as the actual outcomes, e.g.:

confusionMatrix(p_class, test_values)

Use ifelse() to create a character vector, m_or_r that is the positive class, “M”, when p is greater than 0.5, and the negative class, “R”, otherwise. Convert m_or_r to be a factor, p_class, with levels the same as those of test[[“Class”]]. Make a confusion matrix with confusionMatrix(), passing p_class and the “Class” column from the test dataset.

# If p exceeds threshold of 0.5, M else R: m_or_r
m_or_r <- ifelse(p > 0.5, "M", "R")

# Convert to factor: p_class
p_class <- factor(m_or_r, levels = levels(test[["Class"]]))

# Create confusion matrix
#confusionMatrix(p_class, test[["Class"]])

Class probabilities and predictions

Try another threshold

In the previous exercises, you used a threshold of 0.50 to cut your predicted probabilities to make class predictions (rock vs mine). However, this classification threshold does not always align with the goals for a given modeling problem.

For example, pretend you want to identify the objects you are really certain are mines. In this case, you might want to use a probability threshold of 0.90 to get fewer predicted mines, but with greater confidence in each prediction.

The code pattern for cutting probabilities into predicted classes, then calculating a confusion matrix, was shown in Exercise 7 of this chapter.

Use ifelse() to create a character vector, m_or_r that is the positive class, “M”, when p is greater than 0.9, and the negative class, “R”, otherwise. Convert m_or_r to be a factor, p_class, with levels the same as those of test[[“Class”]]. Make a confusion matrix with confusionMatrix(), passing p_class and the “Class” column from the test dataset.

# If p exceeds threshold of 0.9, M else R: m_or_r
m_or_r <- ifelse(p > 0.9, "M", "R")

# Convert to factor: p_class
p_class <- factor(m_or_r, levels = levels(test[["Class"]]))

# Create confusion matrix
#confusionMatrix(p_class, test[["Class"]])

From probabilites to confusion matrix

Conversely, say you want to be really certain that your model correctly identifies all the mines as mines. In this case, you might use a prediction threshold of 0.10, instead of 0.90.

The code pattern for cutting probabilities into predicted classes, then calculating a confusion matrix, was shown in Exercise 7 of this chapter.

Use ifelse() to create a character vector, m_or_r that is the positive class, “M”, when p is greater than 0.1, and the negative class, “R”, otherwise. Convert m_or_r to be a factor, p_class, with levels the same as those of test[[“Class”]]. Make a confusion matrix with confusionMatrix(), passing p_class and the “Class” column from the test dataset.

# If p exceeds threshold of 0.1, M else R: m_or_r
m_or_r <- ifelse(p > 0.1, "M", "R")


# Convert to factor: p_class
p_class <- factor(m_or_r, levels = levels(test[["Class"]]))

# Create confusion matrix
#confusionMatrix(p_class, test[["Class"]])

Introducing the ROC curve

Plot an ROC curve

As you saw in the video, an ROC curve is a really useful shortcut for summarizing the performance of a classifier over all possible thresholds. This saves you a lot of tedious work computing class predictions for many different thresholds and examining the confusion matrix for each.

My favorite package for computing ROC curves is caTools, which contains a function called colAUC(). This function is very user-friendly and can actually calculate ROC curves for multiple predictors at once. In this case, you only need to calculate the ROC curve for one predictor, e.g.:

colAUC(predicted_probabilities, actual, plotROC = TRUE) The function will return a score called AUC (more on that later) and the plotROC = TRUE argument will return the plot of the ROC curve for visual inspection.

model, test, and train from the last exercise using the sonar data are loaded in your workspace.

Predict probabilities (i.e. type = “response”) on the test set, then store the result as p. Make an ROC curve using the predicted test set probabilities.

# Predict on test: p
#p <- predict(model, test, type = "response")

# Make ROC curve
#colAUC(p, test[["Class"]], plotROC = TRUE)

Area under the curve (AUC)

Customizing trainControl

As you saw in the video, area under the ROC curve is a very useful, single-number summary of a model’s ability to discriminate the positive from the negative class (e.g. mines from rocks). An AUC of 0.5 is no better than random guessing, an AUC of 1.0 is a perfectly predictive model, and an AUC of 0.0 is perfectly anti-predictive (which rarely happens).

This is often a much more useful metric than simply ranking models by their accuracy at a set threshold, as different models might require different calibration steps (looking at a confusion matrix at each step) to find the optimal classification threshold for that model.

You can use the trainControl() function in caret to use AUC (instead of acccuracy), to tune the parameters of your models. The twoClassSummary() convenience function allows you to do this easily.

When using twoClassSummary(), be sure to always include the argument classProbs = TRUE or your model will throw an error! (You cannot calculate AUC with just class predictions. You need to have class probabilities as well.)

Customize the trainControl object to use twoClassSummary rather than defaultSummary. Use 10-fold cross-validation. Be sure to tell trainControl() to return class probabilities.

# Create trainControl object: myControl
myControl <- trainControl(
  method = "cv",
  number = 10,
  summaryFunction = twoClassSummary,
  classProbs = TRUE, # IMPORTANT!
  verboseIter = TRUE
)

Using custom trainControl

Now that you have a custom trainControl object, it’s easy to fit caret models that use AUC rather than accuracy to tune and evaluate the model. You can just pass your custom trainControl object to the train() function via the trControl argument, e.g.:

train(, trControl = myControl)

This syntax gives you a convenient way to store a lot of custom modeling parameters and then use them across multiple different calls to train(). You will make extensive use of this trick in Chapter 5. Use train() to predict Class from all other variables in the Sonar data (that is, Class ~ .). It should be a glm model (that is, set method to “glm”) using your custom trainControl object, myControl. Save the result to model. Print the model to the console and examine its output.

# Train glm with custom trainControl: model
model <- train(
  Class ~ ., 
  Sonar, 
  method = "glm",
  trControl = myControl
)

## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
## in the result set. ROC will be used instead.

## + Fold01: parameter=none

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## - Fold01: parameter=none 
## + Fold02: parameter=none

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## - Fold02: parameter=none 
## + Fold03: parameter=none

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## - Fold03: parameter=none 
## + Fold04: parameter=none

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## - Fold04: parameter=none 
## + Fold05: parameter=none

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## - Fold05: parameter=none 
## + Fold06: parameter=none

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## - Fold06: parameter=none 
## + Fold07: parameter=none

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## - Fold07: parameter=none 
## + Fold08: parameter=none

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## - Fold08: parameter=none 
## + Fold09: parameter=none

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## - Fold09: parameter=none 
## + Fold10: parameter=none

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## - Fold10: parameter=none 
## Aggregating results
## Fitting final model on full training set

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

# Print model to console
model

## Generalized Linear Model 
## 
## 208 samples
##  61 predictor
##   2 classes: 'M', 'R' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 188, 188, 187, 188, 187, 187, ... 
## Resampling results:
## 
##   ROC        Sens       Spec     
##   0.9926263  0.9636364  0.9688889

Da un ROC del 1 casi, eso es muy bueno

Tuning model parameters to improve performance

Random forests and wine

Fit a random forest

As you saw in the video, random forest models are much more flexible than linear models, and can model complicated nonlinear effects as well as automatically capture interactions between variables. They tend to give very good results on real world data, so let’s try one out on the wine quality dataset, where the goal is to predict the human-evaluated quality of a batch of wine, given some of the machine-measured chemical and physical properties of that batch.

Fitting a random forest model is exactly the same as fitting a generalized linear regression model, as you did in the previous chapter. You simply change the method argument in the train function to be “ranger”. The ranger package is a rewrite of R’s classic randomForest package and fits models much faster, but gives almost exactly the same results. We suggest that all beginners use the ranger package for random forest modeling.

Train a random forest called model on the wine quality dataset, wine, such that quality is the response variable and all other variables are explanatory variables. Use method = “ranger”. Use a tuneLength of 1. Use 5 CV folds. Print model to the console.

wine <- read.delim("DATABASE/wine.txt", sep = ",")
head(wine)

##      X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 5944           6.7             0.27        0.69            1.2     0.176
## 2 6088           6.7             0.48        0.49            2.9     0.030
## 3 1859           5.8             0.36        0.38            0.9     0.037
## 4 5393           6.3             0.32        0.26           12.0     0.049
## 5 4167           6.6             0.24        0.28            1.8     0.028
## 6 3370           7.8             0.39        0.26            9.9     0.059
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  36                  106 0.99288 2.96      0.43     9.2
## 2                  28                  122 0.98926 3.13      0.40    13.0
## 3                   3                   75 0.99040 3.28      0.34    11.4
## 4                  63                  170 0.99610 3.14      0.55     9.9
## 5                  39                  132 0.99182 3.34      0.46    11.4
## 6                  33                  181 0.99550 3.04      0.42    10.9
##   quality color
## 1       6 white
## 2       6 white
## 3       4 white
## 4       6 white
## 5       5 white
## 6       6 white

# Fit random forest: model
model <- train(
  quality ~ .,
  tuneLength = 1,
  data = wine, 
  method = "ranger",
  trControl = trainControl(
    method = "cv", 
    number = 5, 
    verboseIter = TRUE
  )
)

## + Fold1: mtry=3, min.node.size=5, splitrule=variance 
## - Fold1: mtry=3, min.node.size=5, splitrule=variance 
## + Fold1: mtry=3, min.node.size=5, splitrule=extratrees 
## - Fold1: mtry=3, min.node.size=5, splitrule=extratrees 
## + Fold2: mtry=3, min.node.size=5, splitrule=variance 
## - Fold2: mtry=3, min.node.size=5, splitrule=variance 
## + Fold2: mtry=3, min.node.size=5, splitrule=extratrees 
## - Fold2: mtry=3, min.node.size=5, splitrule=extratrees 
## + Fold3: mtry=3, min.node.size=5, splitrule=variance 
## - Fold3: mtry=3, min.node.size=5, splitrule=variance 
## + Fold3: mtry=3, min.node.size=5, splitrule=extratrees 
## - Fold3: mtry=3, min.node.size=5, splitrule=extratrees 
## + Fold4: mtry=3, min.node.size=5, splitrule=variance 
## - Fold4: mtry=3, min.node.size=5, splitrule=variance 
## + Fold4: mtry=3, min.node.size=5, splitrule=extratrees 
## - Fold4: mtry=3, min.node.size=5, splitrule=extratrees 
## + Fold5: mtry=3, min.node.size=5, splitrule=variance 
## - Fold5: mtry=3, min.node.size=5, splitrule=variance 
## + Fold5: mtry=3, min.node.size=5, splitrule=extratrees 
## - Fold5: mtry=3, min.node.size=5, splitrule=extratrees 
## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 3, splitrule = variance, min.node.size = 5 on full training set

# Print model to console
print(model)

## Random Forest 
## 
## 100 samples
##  13 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 81, 80, 80, 80, 79 
## Resampling results across tuning parameters:
## 
##   splitrule   RMSE       Rsquared   MAE      
##   variance    0.6587570  0.3188962  0.4888864
##   extratrees  0.6865412  0.2675681  0.5076654
## 
## Tuning parameter 'mtry' was held constant at a value of 3
## Tuning
##  parameter 'min.node.size' was held constant at a value of 5
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mtry = 3, splitrule = variance
##  and min.node.size = 5.

Explore a wider model space

Try a longer tune length

Recall from the video that random forest models have a primary tuning parameter of mtry, which controls how many variables are exposed to the splitting search routine at each split. For example, suppose that a tree has a total of 10 splits and mtry = 2. This means that there are 10 samples of 2 predictors each time a split is evaluated.

Use a larger tuning grid this time, but stick to the defaults provided by the train() function. Try a tuneLength of 3, rather than 1, to explore some more potential models, and plot the resulting model using the plot function.

Train a random forest model, model, using the wine dataset on the quality variable with all other variables as explanatory variables. (This will take a few seconds to run, so be patient!) Use method = “ranger”. Change the tuneLength to 3. Use 5 CV folds. Print model to the console. Plot the model after fitting it.

# Fit random forest: model
model <- train(
  quality ~ .,
  tuneLength = 3,
  data = wine, 
  method = "ranger",
  trControl = trainControl(
    method = "cv", 
    number = 5, 
    verboseIter = TRUE
  )
)

## + Fold1: mtry= 2, min.node.size=5, splitrule=variance 
## - Fold1: mtry= 2, min.node.size=5, splitrule=variance 
## + Fold1: mtry= 7, min.node.size=5, splitrule=variance 
## - Fold1: mtry= 7, min.node.size=5, splitrule=variance 
## + Fold1: mtry=13, min.node.size=5, splitrule=variance 
## - Fold1: mtry=13, min.node.size=5, splitrule=variance 
## + Fold1: mtry= 2, min.node.size=5, splitrule=extratrees 
## - Fold1: mtry= 2, min.node.size=5, splitrule=extratrees 
## + Fold1: mtry= 7, min.node.size=5, splitrule=extratrees 
## - Fold1: mtry= 7, min.node.size=5, splitrule=extratrees 
## + Fold1: mtry=13, min.node.size=5, splitrule=extratrees 
## - Fold1: mtry=13, min.node.size=5, splitrule=extratrees 
## + Fold2: mtry= 2, min.node.size=5, splitrule=variance 
## - Fold2: mtry= 2, min.node.size=5, splitrule=variance 
## + Fold2: mtry= 7, min.node.size=5, splitrule=variance 
## - Fold2: mtry= 7, min.node.size=5, splitrule=variance 
## + Fold2: mtry=13, min.node.size=5, splitrule=variance 
## - Fold2: mtry=13, min.node.size=5, splitrule=variance 
## + Fold2: mtry= 2, min.node.size=5, splitrule=extratrees 
## - Fold2: mtry= 2, min.node.size=5, splitrule=extratrees 
## + Fold2: mtry= 7, min.node.size=5, splitrule=extratrees 
## - Fold2: mtry= 7, min.node.size=5, splitrule=extratrees 
## + Fold2: mtry=13, min.node.size=5, splitrule=extratrees 
## - Fold2: mtry=13, min.node.size=5, splitrule=extratrees 
## + Fold3: mtry= 2, min.node.size=5, splitrule=variance 
## - Fold3: mtry= 2, min.node.size=5, splitrule=variance 
## + Fold3: mtry= 7, min.node.size=5, splitrule=variance 
## - Fold3: mtry= 7, min.node.size=5, splitrule=variance 
## + Fold3: mtry=13, min.node.size=5, splitrule=variance 
## - Fold3: mtry=13, min.node.size=5, splitrule=variance 
## + Fold3: mtry= 2, min.node.size=5, splitrule=extratrees 
## - Fold3: mtry= 2, min.node.size=5, splitrule=extratrees 
## + Fold3: mtry= 7, min.node.size=5, splitrule=extratrees 
## - Fold3: mtry= 7, min.node.size=5, splitrule=extratrees 
## + Fold3: mtry=13, min.node.size=5, splitrule=extratrees 
## - Fold3: mtry=13, min.node.size=5, splitrule=extratrees 
## + Fold4: mtry= 2, min.node.size=5, splitrule=variance 
## - Fold4: mtry= 2, min.node.size=5, splitrule=variance 
## + Fold4: mtry= 7, min.node.size=5, splitrule=variance 
## - Fold4: mtry= 7, min.node.size=5, splitrule=variance 
## + Fold4: mtry=13, min.node.size=5, splitrule=variance 
## - Fold4: mtry=13, min.node.size=5, splitrule=variance 
## + Fold4: mtry= 2, min.node.size=5, splitrule=extratrees 
## - Fold4: mtry= 2, min.node.size=5, splitrule=extratrees 
## + Fold4: mtry= 7, min.node.size=5, splitrule=extratrees 
## - Fold4: mtry= 7, min.node.size=5, splitrule=extratrees 
## + Fold4: mtry=13, min.node.size=5, splitrule=extratrees 
## - Fold4: mtry=13, min.node.size=5, splitrule=extratrees 
## + Fold5: mtry= 2, min.node.size=5, splitrule=variance 
## - Fold5: mtry= 2, min.node.size=5, splitrule=variance 
## + Fold5: mtry= 7, min.node.size=5, splitrule=variance 
## - Fold5: mtry= 7, min.node.size=5, splitrule=variance 
## + Fold5: mtry=13, min.node.size=5, splitrule=variance 
## - Fold5: mtry=13, min.node.size=5, splitrule=variance 
## + Fold5: mtry= 2, min.node.size=5, splitrule=extratrees 
## - Fold5: mtry= 2, min.node.size=5, splitrule=extratrees 
## + Fold5: mtry= 7, min.node.size=5, splitrule=extratrees 
## - Fold5: mtry= 7, min.node.size=5, splitrule=extratrees 
## + Fold5: mtry=13, min.node.size=5, splitrule=extratrees 
## - Fold5: mtry=13, min.node.size=5, splitrule=extratrees 
## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 13, splitrule = variance, min.node.size = 5 on full training set

# Print model to console
print(model)

## Random Forest 
## 
## 100 samples
##  13 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 80, 79, 80, 80, 81 
## Resampling results across tuning parameters:
## 
##   mtry  splitrule   RMSE       Rsquared   MAE      
##    2    variance    0.6467175  0.3281169  0.4885136
##    2    extratrees  0.6885613  0.2353371  0.5082042
##    7    variance    0.6355183  0.3238600  0.4904450
##    7    extratrees  0.6811706  0.2270125  0.5148996
##   13    variance    0.6343619  0.3261168  0.4914816
##   13    extratrees  0.6845100  0.2212215  0.5214518
## 
## Tuning parameter 'min.node.size' was held constant at a value of 5
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mtry = 13, splitrule = variance
##  and min.node.size = 5.

# Plot model
plot(model)

Custom tuning grids

Fit a random forest with custom tuning

Now that you’ve explored the default tuning grids provided by the train() function, let’s customize your models a bit more.

You can provide any number of values for mtry, from 2 up to the number of columns in the dataset. In practice, there are diminishing returns for much larger values of mtry, so you will use a custom tuning grid that explores 2 simple models (mtry = 2 and mtry = 3) as well as one more complicated model (mtry = 7).

Define a custom tuning grid. Set the number of variables to possibly split at each node, .mtry, to a vector of 2, 3, and 7. Set the rule to split on, .splitrule, to “variance”. Set the minimum node size, .min.node.size, to 5.

# Define the tuning grid: tuneGrid
tuneGrid <- data.frame(
  .mtry = c(2,3,7),
  .splitrule = "variance",
  .min.node.size = 5
)

Train another random forest model, model, using the wine dataset on the quality variable with all other variables as explanatory variables. Use method = “ranger”. Use the custom tuneGrid. Use 5 CV folds. Print model to the console. Plot the model after fitting it using plot().

# Fit random forest: model
model <- train(
  quality ~. ,
  tuneGrid = tuneGrid,
  data = wine, 
  method = "ranger",
  trControl = trainControl(
    method = "cv", 
    number = 5, 
    verboseIter = TRUE
  )
)

## + Fold1: mtry=2, splitrule=variance, min.node.size=5 
## - Fold1: mtry=2, splitrule=variance, min.node.size=5 
## + Fold1: mtry=3, splitrule=variance, min.node.size=5 
## - Fold1: mtry=3, splitrule=variance, min.node.size=5 
## + Fold1: mtry=7, splitrule=variance, min.node.size=5 
## - Fold1: mtry=7, splitrule=variance, min.node.size=5 
## + Fold2: mtry=2, splitrule=variance, min.node.size=5 
## - Fold2: mtry=2, splitrule=variance, min.node.size=5 
## + Fold2: mtry=3, splitrule=variance, min.node.size=5 
## - Fold2: mtry=3, splitrule=variance, min.node.size=5 
## + Fold2: mtry=7, splitrule=variance, min.node.size=5 
## - Fold2: mtry=7, splitrule=variance, min.node.size=5 
## + Fold3: mtry=2, splitrule=variance, min.node.size=5 
## - Fold3: mtry=2, splitrule=variance, min.node.size=5 
## + Fold3: mtry=3, splitrule=variance, min.node.size=5 
## - Fold3: mtry=3, splitrule=variance, min.node.size=5 
## + Fold3: mtry=7, splitrule=variance, min.node.size=5 
## - Fold3: mtry=7, splitrule=variance, min.node.size=5 
## + Fold4: mtry=2, splitrule=variance, min.node.size=5 
## - Fold4: mtry=2, splitrule=variance, min.node.size=5 
## + Fold4: mtry=3, splitrule=variance, min.node.size=5 
## - Fold4: mtry=3, splitrule=variance, min.node.size=5 
## + Fold4: mtry=7, splitrule=variance, min.node.size=5 
## - Fold4: mtry=7, splitrule=variance, min.node.size=5 
## + Fold5: mtry=2, splitrule=variance, min.node.size=5 
## - Fold5: mtry=2, splitrule=variance, min.node.size=5 
## + Fold5: mtry=3, splitrule=variance, min.node.size=5 
## - Fold5: mtry=3, splitrule=variance, min.node.size=5 
## + Fold5: mtry=7, splitrule=variance, min.node.size=5 
## - Fold5: mtry=7, splitrule=variance, min.node.size=5 
## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 7, splitrule = variance, min.node.size = 5 on full training set

# Print model to console
print(model)

## Random Forest 
## 
## 100 samples
##  13 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 80, 80, 80, 80, 80 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE       Rsquared   MAE      
##   2     0.6559787  0.3284769  0.4936593
##   3     0.6519735  0.3292476  0.4946083
##   7     0.6498748  0.3164634  0.4926127
## 
## Tuning parameter 'splitrule' was held constant at a value of variance
## 
## Tuning parameter 'min.node.size' was held constant at a value of 5
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mtry = 7, splitrule = variance
##  and min.node.size = 5.

# Plot model
plot(model)

Introducing glmnet

Make a custom trainControl

The wine quality dataset was a regression problem, but now you are looking at a classification problem. This is a simulated dataset based on the “don’t” competition on Kaggle a number of years ago.

Classification problems are a little more complicated than regression problems because you have to provide a custom summaryFunction to the train() function to use the AUC metric to rank your models. Start by making a custom trainControl, as you did in the previous chapter. Be sure to set classProbs = TRUE, otherwise the twoClassSummary for summaryFunction will break.

Make a custom trainControl called myControl for classification using the trainControl function.

Use 10 CV folds. Use twoClassSummary for the summaryFunction. Be sure to set classProbs = TRUE.

# Create custom trainControl: myControl
myControl <- trainControl(
  method = "cv", 
  number = 10,
  summaryFunction = twoClassSummary,
  classProbs = TRUE, # IMPORTANT!
  verboseIter = TRUE
)

Fit glmnet with custom trainControl

Now that you have a custom trainControl object, fit a glmnet model to the “don’t overfit” dataset. Recall from the video that glmnet is an extension of the generalized linear regression model (or glm) that places constraints on the magnitude of the coefficients to prevent overfitting. This is more commonly known as “penalized” regression modeling and is a very useful technique on datasets with many predictors and few values.

glmnet is capable of fitting two different kinds of penalized models, controlled by the alpha parameter:

Ridge regression (or alpha = 0) Lasso regression (or alpha = 1) You’ll now fit a glmnet model to the “don’t overfit” dataset using the defaults provided by the caret package.

Train a glmnet model called model on the overfit data. Use the custom trainControl from the previous exercise (myControl). The variable y is the response variable and all other variables are explanatory variables. Print the model to the console. Use the max() function to find the maximum of the ROC statistic contained somewhere in model[[“results”]].

overfit <- read.csv("DATABASE/overfit.csv")
head(overfit)

##        y        X1         X2        X3         X4         X5         X6
## 1 class2 0.9148060 0.33423133 0.1365052 0.24492099 0.84829322 0.73592037
## 2 class2 0.9370754 0.18843433 0.1771364 0.08763591 0.06274633 0.75178575
## 3 class2 0.2861395 0.26971618 0.5195605 0.39110850 0.81984509 0.33261448
## 4 class1 0.8304476 0.53074408 0.8111208 0.18256143 0.53936029 0.05754862
## 5 class2 0.6417455 0.02145023 0.1153620 0.13362478 0.49902010 0.67441545
## 6 class2 0.5190959 0.79876031 0.8934218 0.25746291 0.02222732 0.04157017
##           X7        X8        X9        X10        X11       X12       X13
## 1 0.05391100 0.1651787 0.9899656 0.24640458 0.06038098 0.4088231 0.2737925
## 2 0.95509577 0.7277811 0.4384936 0.02302811 0.93300437 0.3158619 0.9441967
## 3 0.02560094 0.2061579 0.6999032 0.28421418 0.34894162 0.4947276 0.4459833
## 4 0.92076314 0.5864655 0.8890770 0.81289268 0.41179789 0.6489300 0.5417872
## 5 0.36666474 0.9135460 0.8341595 0.71891839 0.96114648 0.6498625 0.1617545
## 6 0.69839255 0.2069666 0.7344215 0.52957674 0.53107291 0.5023026 0.6930535
##         X14        X15       X16        X17       X18        X19       X20
## 1 0.3133691 0.19940178 0.8555082 0.59892983 0.3641020 0.29991068 0.5124174
## 2 0.6887178 0.56649540 0.2165558 0.96894761 0.4326160 0.07423252 0.4984634
## 3 0.5323920 0.16805282 0.3174699 0.39053533 0.7288679 0.92831706 0.1367773
## 4 0.7599945 0.94363115 0.5413980 0.85281451 0.2627979 0.40114898 0.4507146
## 5 0.6498432 0.50440809 0.4303394 0.04228755 0.7689716 0.66537518 0.3133368
## 6 0.4968844 0.04706947 0.3649031 0.42472235 0.5102272 0.02496750 0.7338139
##         X21        X22       X23       X24        X25        X26       X27
## 1 0.7314937 0.88671723 0.8427694 0.2265345 0.24646324 0.03078282 0.2439920
## 2 0.5609527 0.78957374 0.7690579 0.2930370 0.53030519 0.91007049 0.8861334
## 3 0.4981884 0.54639519 0.9188407 0.1260773 0.21397210 0.14263568 0.3657405
## 4 0.8410595 0.06316993 0.3557249 0.7101987 0.02584904 0.43959793 0.2502275
## 5 0.4636445 0.57615483 0.9376524 0.8479341 0.34200247 0.44986382 0.7712136
## 6 0.1989107 0.21231243 0.5919335 0.9862026 0.39413527 0.65888420 0.4224395
##          X28        X29        X30         X31       X32       X33       X34
## 1 0.44031431 0.61425763 0.60700663 0.776540211 0.2101436 0.4436156 0.2398912
## 2 0.44528186 0.87616254 0.03163189 0.023826756 0.7116446 0.3283460 0.8914540
## 3 0.08088818 0.09615188 0.93982109 0.431787197 0.4351647 0.2078517 0.1483158
## 4 0.15290506 0.17540170 0.74308481 0.366538520 0.9759542 0.1919214 0.4238083
## 5 0.82051754 0.45457902 0.14370810 0.395403284 0.7004988 0.3723860 0.7290697
## 6 0.34739227 0.73957960 0.38408691 0.007261852 0.5107656 0.7262031 0.7877473
##         X35       X36       X37       X38       X39        X40       X41
## 1 0.6363019 0.5722824 0.4197569 0.9500444 0.3587282 0.49694824 0.5283896
## 2 0.6835961 0.4618257 0.8261398 0.1127617 0.4972427 0.99363492 0.6463879
## 3 0.4627288 0.1354304 0.1123009 0.8579319 0.2506536 0.68541077 0.8340490
## 4 0.5929640 0.5824622 0.2039221 0.2698933 0.7749550 0.71740832 0.3457626
## 5 0.1989480 0.1145797 0.4922106 0.2721556 0.1394774 0.51023013 0.6217329
## 6 0.7350154 0.9524530 0.7657882 0.7173460 0.3119602 0.09114151 0.6631953
##          X42         X43        X44       X45        X46       X47        X48
## 1 0.53637297 0.003498503 0.03108833 0.7895760 0.90801477 0.5743560 0.04658951
## 2 0.02224803 0.470631273 0.37119592 0.6076812 0.22839369 0.3472568 0.81377534
## 3 0.97147909 0.447381378 0.28906588 0.9859969 0.03609499 0.5428427 0.06971775
## 4 0.60075658 0.418609717 0.14279648 0.6505679 0.75519239 0.2701957 0.45726551
## 5 0.41735487 0.865590246 0.64557472 0.3890771 0.14964131 0.7032647 0.93253494
## 6 0.35575149 0.089314849 0.83856468 0.7514495 0.15831393 0.2215922 0.56264178
##         X49        X50        X51       X52       X53        X54        X55
## 1 0.5686084 0.70674255 0.35659105 0.6176513 0.8937959 0.09195006 0.40665568
## 2 0.2178806 0.93513637 0.20800169 0.3523196 0.4806338 0.04368308 0.90273489
## 3 0.1015155 0.32386434 0.10062093 0.6447219 0.1601698 0.03980983 0.18040390
## 4 0.6981023 0.42993684 0.13811304 0.2108743 0.9060602 0.67766094 0.05992027
## 5 0.1927258 0.26389910 0.02372436 0.9677561 0.7480467 0.32464831 0.37985999
## 6 0.8629308 0.06350469 0.38926066 0.9384980 0.3468387 0.84224017 0.50614379
##          X56        X57        X58        X59        X60        X61       X62
## 1 0.82231507 0.92166222 0.70143425 0.57086475 0.22403629 0.61544117 0.3942098
## 2 0.46761749 0.02544768 0.20914840 0.31707423 0.71769261 0.81500369 0.9481347
## 3 0.88325679 0.71131622 0.30949109 0.00622546 0.23348081 0.34300935 0.7528206
## 4 0.23977046 0.08098793 0.75071442 0.32235315 0.11093189 0.38827650 0.8657734
## 5 0.08607674 0.83676312 0.83867712 0.76597286 0.01509131 0.01461812 0.3746096
## 6 0.01749129 0.13737931 0.09458564 0.07863530 0.41161897 0.15780173 0.9028142
##         X63       X64        X65        X66       X67        X68       X69
## 1 0.8888228 0.6099593 0.47709234 0.55782963 0.6912194 0.79023679 0.5307927
## 2 0.5172033 0.8014616 0.19922838 0.58179578 0.3595974 0.29431606 0.6447462
## 3 0.3488447 0.5248672 0.40166670 0.35338641 0.5171902 0.32632441 0.1399816
## 4 0.8560230 0.6720612 0.57517642 0.91979352 0.4984536 0.02167086 0.3166792
## 5 0.1388359 0.2903211 0.06373411 0.14328077 0.1268600 0.47265368 0.8713632
## 6 0.4936084 0.7054787 0.47593348 0.03223118 0.1145575 0.51155653 0.6391481
##         X70       X71        X72          X73        X74        X75       X76
## 1 0.6121108 0.9927446 0.66891022 0.1782290782 0.65168895 0.13139016 0.6016485
## 2 0.2246952 0.6600863 0.76956286 0.5898728392 0.09957865 0.20797646 0.5147675
## 3 0.4450516 0.9908713 0.04286123 0.3100116646 0.61739505 0.65724972 0.9524858
## 4 0.4417733 0.8435938 0.09322869 0.9048300500 0.37862555 0.05183997 0.6899449
## 5 0.8194539 0.9318627 0.37029750 0.0009339715 0.02331053 0.89112706 0.6909429
## 6 0.2539138 0.4160950 0.47901167 0.8450782229 0.84258516 0.97169850 0.3970206
##          X77        X78       X79       X80       X81       X82       X83
## 1 0.80038216 0.03375497 0.6148970 0.5501955 0.8776074 0.8771290 0.4591744
## 2 0.37053159 0.19464773 0.0756750 0.9074589 0.4554355 0.5586081 0.9691399
## 3 0.43891339 0.64983446 0.5525176 0.7823757 0.4245626 0.6889456 0.7962230
## 4 0.08944712 0.21965484 0.4636439 0.6054795 0.9387996 0.6737048 0.1111056
## 5 0.16717954 0.85817933 0.9300710 0.9142197 0.3861163 0.1886578 0.4308363
## 6 0.99779615 0.88751072 0.5320078 0.3934037 0.4441870 0.9797902 0.1132264
##          X84        X85       X86        X87       X88        X89       X90
## 1 0.02265618 0.82100470 0.9451794 0.40031711 0.5063815 0.38419708 0.5978431
## 2 0.51662527 0.09394557 0.6691676 0.36259803 0.4020096 0.38109557 0.3437839
## 3 0.17372073 0.17385249 0.3524983 0.93234951 0.1412987 0.67869493 0.3873761
## 4 0.84811563 0.08457514 0.2725157 0.21494830 0.6899701 0.75592458 0.8271274
## 5 0.48302059 0.99570426 0.7055095 0.18531663 0.1049531 0.06213957 0.7161128
## 6 0.62567603 0.25286650 0.1484418 0.01903933 0.6162294 0.89707813 0.1780118
##         X91        X92        X93       X94        X95        X96         X97
## 1 0.3178428 0.63577017 0.16898034 0.4363109 0.74558734 0.91218140 0.521450796
## 2 0.9580893 0.18256200 0.16977626 0.5074746 0.88742323 0.74437113 0.002551967
## 3 0.2691170 0.61096669 0.05505954 0.7045412 0.05273777 0.24672669 0.774404447
## 4 0.6020122 0.09389158 0.65476919 0.6609233 0.12751347 0.74064849 0.037558054
## 5 0.5659497 0.50872714 0.88375557 0.2501756 0.43159573 0.89100560 0.598648517
## 6 0.9331284 0.15435468 0.21043950 0.9714310 0.02401225 0.08361286 0.029283952
##          X98       X99      X100      X101      X102         X103      X104
## 1 0.09177065 0.7341170 0.3207706 0.3410881 0.6124809 0.0006371774 0.5474337
## 2 0.72371277 0.2045022 0.5937234 0.0769237 0.1278179 0.9994430819 0.4792716
## 3 0.47358313 0.7541783 0.8585231 0.1764347 0.2325618 0.3747437985 0.1774841
## 4 0.06946549 0.3030019 0.7570058 0.4399086 0.2011980 0.0339846939 0.6359555
## 5 0.41871902 0.1878361 0.5910866 0.9375751 0.3795888 0.3704907512 0.5783285
## 6 0.62829984 0.4720801 0.3647425 0.6538159 0.9011956 0.0729143203 0.4993629
##         X105       X106       X107      X108       X109      X110      X111
## 1 0.03589639 0.64268733 0.23045683 0.5939719 0.53344042 0.9527778 0.9500149
## 2 0.96449950 0.34132692 0.25149597 0.5282314 0.03137963 0.6503522 0.8069633
## 3 0.45762041 0.03583915 0.04275977 0.8099685 0.04959282 0.2623343 0.7691251
## 4 0.68747009 0.19993296 0.98033429 0.5882637 0.23380904 0.9604940 0.7486602
## 5 0.96663348 0.65268200 0.48489585 0.2208072 0.89631450 0.7733771 0.5731922
## 6 0.79437829 0.10197330 0.93174456 0.6425360 0.98148244 0.7171185 0.3608518
##         X112       X113       X114       X115        X116      X117      X118
## 1 0.44486670 0.01083590 0.01484451 0.06289119 0.581997880 0.5924448 0.4623531
## 2 0.34772122 0.12578299 0.17502226 0.89214340 0.665076771 0.8878863 0.8693493
## 3 0.80746171 0.50185677 0.25231490 0.28707657 0.875653535 0.5235236 0.2247654
## 4 0.03664129 0.18031296 0.99640743 0.64055932 0.070306693 0.3749164 0.5402184
## 5 0.89283247 0.05301032 0.86648632 0.13174253 0.004420665 0.8752077 0.6109594
## 6 0.80284080 0.06297881 0.43661718 0.43528020 0.232604207 0.7159711 0.2163525
##         X119      X120       X121        X122      X123       X124      X125
## 1 0.45086799 0.1943601 0.15381438 0.982242217 0.2947328 0.11021659 0.4058102
## 2 0.52969808 0.9761348 0.40933156 0.520649012 0.2511160 0.08025240 0.7747704
## 3 0.15883508 0.9871921 0.22537445 0.445695450 0.8000173 0.62539483 0.3647271
## 4 0.06271411 0.5119375 0.35218675 0.462261930 0.2342284 0.72221652 0.4875607
## 5 0.44944508 0.1983814 0.11016184 0.002215227 0.1068260 0.07613934 0.7666030
## 6 0.23194646 0.6491826 0.06810542 0.450323362 0.5336411 0.28059341 0.1265137
##        X126      X127       X128      X129      X130       X131        X132
## 1 0.5527449 0.2878878 0.55946411 0.6899293 0.6057740 0.63405048 0.041380533
## 2 0.8601255 0.1305359 0.21613476 0.6588805 0.1262838 0.91769299 0.404302723
## 3 0.7557376 0.6784908 0.64567381 0.6979758 0.4577847 0.91924084 0.766801183
## 4 0.4862461 0.7317664 0.05425676 0.2821157 0.3138897 0.71750403 0.005256724
## 5 0.9917929 0.8525171 0.77481876 0.3364385 0.6482185 0.57506495 0.354915085
## 6 0.4760391 0.0567125 0.88117459 0.8124043 0.2068774 0.02557602 0.520244388
##         X133      X134       X135      X136      X137      X138      X139
## 1 0.11652459 0.5825725 0.70794842 0.2909239 0.5437972 0.1286449 0.8094877
## 2 0.03312483 0.5669765 0.48408234 0.4611460 0.7934200 0.2398162 0.8214692
## 3 0.12827443 0.9846029 0.77778677 0.8627647 0.2294719 0.5779939 0.1852677
## 4 0.82357734 0.5245908 0.02017853 0.6642912 0.3530981 0.6174764 0.2548854
## 5 0.30659273 0.5602921 0.01777780 0.3422838 0.3047963 0.4213162 0.9207230
## 6 0.93538770 0.8365798 0.37046091 0.6586622 0.5051089 0.6448513 0.7427806
##        X140      X141      X142       X143       X144       X145       X146
## 1 0.4016311 0.9367281 0.2746436 0.59995153 0.61420305 0.84734796 0.21923829
## 2 0.7927715 0.9362274 0.2936810 0.16334854 0.70992172 0.31810196 0.15674629
## 3 0.1173834 0.8976075 0.6272968 0.17710410 0.07593862 0.07364713 0.08922226
## 4 0.3649660 0.4590170 0.4494853 0.26653590 0.77857178 0.02329525 0.74250158
## 5 0.5366805 0.5760531 0.1779100 0.00149341 0.63019430 0.92182790 0.10475932
## 6 0.4860038 0.7332851 0.6615094 0.77484637 0.30370821 0.74226482 0.15595412
##         X147       X148      X149       X150      X151       X152       X153
## 1 0.82867974 0.94290037 0.5083185 0.25527703 0.7526830 0.01497518 0.96331279
## 2 0.97047839 0.64720576 0.7175615 0.03806863 0.8967593 0.04082960 0.80074165
## 3 0.47023973 0.55771586 0.4559021 0.45961098 0.7685767 0.70849826 0.39650955
## 4 0.93823417 0.39179048 0.6270691 0.29450119 0.2282020 0.07697517 0.63353348
## 5 0.95659470 0.09407949 0.8008656 0.69764193 0.7520311 0.10761805 0.09724402
## 6 0.04171886 0.51430294 0.3555205 0.01282597 0.5723066 0.31757350 0.39974798
##         X154      X155      X156       X157      X158      X159      X160
## 1 0.16493185 0.5734490 0.6172273 0.01984267 0.9923874 0.8431326 0.4516252
## 2 0.83442190 0.7298989 0.4894689 0.71974924 0.7642040 0.9155692 0.3701119
## 3 0.96948408 0.5114832 0.9822653 0.13222445 0.1225179 0.2006651 0.2851132
## 4 0.69882495 0.9638506 0.4711525 0.15887531 0.8788914 0.6495318 0.7246809
## 5 0.08477651 0.8876985 0.2279777 0.68196120 0.9136540 0.5322480 0.6712590
## 6 0.89601757 0.6552185 0.4086666 0.91309850 0.2051752 0.9216390 0.7176894
##         X161       X162       X163        X164       X165      X166      X167
## 1 0.27610230 0.74179679 0.48524905 0.003358962 0.35415790 0.8967225 0.8834287
## 2 0.30687340 0.30933747 0.36506517 0.837234325 0.13497580 0.4668467 0.5235297
## 3 0.41455622 0.66539232 0.05123008 0.249701695 0.25971016 0.5343797 0.3086893
## 4 0.91913345 0.06219363 0.49290181 0.994318883 0.60124062 0.7753268 0.7970673
## 5 0.53340060 0.44950719 0.15271272 0.331355318 0.29420728 0.6948136 0.8076349
## 6 0.03427269 0.85936296 0.12930984 0.494159771 0.08425433 0.1823070 0.3097774
##        X168      X169       X170        X171       X172      X173       X174
## 1 0.8061288 0.1292087 0.19116865 0.285561043 0.88923767 0.5870494 0.88877075
## 2 0.6243856 0.6522644 0.93251897 0.198662249 0.11641914 0.1335043 0.85607289
## 3 0.3250510 0.5548370 0.66663916 0.803086133 0.75219803 0.9061797 0.46824726
## 4 0.4519654 0.4108911 0.99667439 0.394731248 0.70289791 0.3921670 0.32963398
## 5 0.8714402 0.9276495 0.08040045 0.715471951 0.01533145 0.9418107 0.04961305
## 6 0.6398163 0.6567273 0.79273983 0.005634914 0.56296260 0.8835058 0.40821738
##        X175       X176      X177      X178      X179       X180      X181
## 1 0.1514306 0.21224566 0.1572156 0.4175002 0.7225748 0.75372096 0.3955216
## 2 0.4241027 0.51659550 0.8845839 0.1575032 0.0794626 0.61866392 0.8728076
## 3 0.4330624 0.04400484 0.4311277 0.9490000 0.9359233 0.22093972 0.1652158
## 4 0.8379206 0.01883298 0.9475947 0.7788227 0.7653506 0.49391558 0.1522405
## 5 0.6613413 0.27919607 0.1502022 0.4259150 0.9541247 0.31082164 0.9833864
## 6 0.4268377 0.75931980 0.2362647 0.5131131 0.5040790 0.09395412 0.8566445
##        X182       X183       X184      X185       X186        X187       X188
## 1 0.4887184 0.50930862 0.40526502 0.8997269 0.20133410 0.993279049 0.27672405
## 2 0.9823968 0.39664574 0.04680688 0.4138554 0.02252797 0.551354672 0.80417281
## 3 0.8082559 0.99125548 0.85048518 0.1409671 0.88720811 0.001778496 0.03994549
## 4 0.7771462 0.34878537 0.88635905 0.7680195 0.64712794 0.927141098 0.17404809
## 5 0.5740921 0.07616622 0.25642704 0.4703455 0.02694613 0.934355612 0.27263610
## 6 0.3805437 0.27785715 0.24521403 0.2338873 0.30423026 0.806766985 0.03654309
##        X189      X190      X191      X192        X193       X194      X195
## 1 0.4524145 0.6911940 0.5406514 0.7222597 0.002576062 0.25223405 0.5568160
## 2 0.7580112 0.7804734 0.1411782 0.1310911 0.411775569 0.52042257 0.4986688
## 3 0.5278438 0.2509410 0.2106741 0.5895710 0.891389510 0.77904746 0.3912340
## 4 0.8901789 0.5280285 0.8211162 0.5913061 0.767070756 0.08815411 0.8240591
## 5 0.7148812 0.4105001 0.8930576 0.9879969 0.598619170 0.60512811 0.5665962
## 6 0.6873584 0.3161568 0.1698084 0.1546299 0.127565006 0.08847341 0.6644925
##         X196      X197      X198      X199         X200
## 1 0.08803946 0.7399642 0.3738620 0.6715633 0.0001012264
## 2 0.67626884 0.9724303 0.8393751 0.7468173 0.8653565315
## 3 0.09020876 0.6599458 0.8371015 0.9785843 0.3703812116
## 4 0.41487972 0.3337511 0.5536796 0.7041028 0.6651317959
## 5 0.01701286 0.2368524 0.7016140 0.7000800 0.3351982590
## 6 0.96472272 0.3598008 0.9455406 0.2009973 0.2989820768

# Fit glmnet model: model
model <- train(
  y ~ ., 
  overfit,
  method = "glmnet",
  trControl = myControl
)

## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
## in the result set. ROC will be used instead.

## + Fold01: alpha=0.10, lambda=0.01013 
## - Fold01: alpha=0.10, lambda=0.01013 
## + Fold01: alpha=0.55, lambda=0.01013 
## - Fold01: alpha=0.55, lambda=0.01013 
## + Fold01: alpha=1.00, lambda=0.01013 
## - Fold01: alpha=1.00, lambda=0.01013 
## + Fold02: alpha=0.10, lambda=0.01013 
## - Fold02: alpha=0.10, lambda=0.01013 
## + Fold02: alpha=0.55, lambda=0.01013 
## - Fold02: alpha=0.55, lambda=0.01013 
## + Fold02: alpha=1.00, lambda=0.01013 
## - Fold02: alpha=1.00, lambda=0.01013 
## + Fold03: alpha=0.10, lambda=0.01013 
## - Fold03: alpha=0.10, lambda=0.01013 
## + Fold03: alpha=0.55, lambda=0.01013 
## - Fold03: alpha=0.55, lambda=0.01013 
## + Fold03: alpha=1.00, lambda=0.01013 
## - Fold03: alpha=1.00, lambda=0.01013 
## + Fold04: alpha=0.10, lambda=0.01013 
## - Fold04: alpha=0.10, lambda=0.01013 
## + Fold04: alpha=0.55, lambda=0.01013 
## - Fold04: alpha=0.55, lambda=0.01013 
## + Fold04: alpha=1.00, lambda=0.01013 
## - Fold04: alpha=1.00, lambda=0.01013 
## + Fold05: alpha=0.10, lambda=0.01013 
## - Fold05: alpha=0.10, lambda=0.01013 
## + Fold05: alpha=0.55, lambda=0.01013 
## - Fold05: alpha=0.55, lambda=0.01013 
## + Fold05: alpha=1.00, lambda=0.01013 
## - Fold05: alpha=1.00, lambda=0.01013 
## + Fold06: alpha=0.10, lambda=0.01013 
## - Fold06: alpha=0.10, lambda=0.01013 
## + Fold06: alpha=0.55, lambda=0.01013 
## - Fold06: alpha=0.55, lambda=0.01013 
## + Fold06: alpha=1.00, lambda=0.01013 
## - Fold06: alpha=1.00, lambda=0.01013 
## + Fold07: alpha=0.10, lambda=0.01013 
## - Fold07: alpha=0.10, lambda=0.01013 
## + Fold07: alpha=0.55, lambda=0.01013 
## - Fold07: alpha=0.55, lambda=0.01013 
## + Fold07: alpha=1.00, lambda=0.01013 
## - Fold07: alpha=1.00, lambda=0.01013 
## + Fold08: alpha=0.10, lambda=0.01013 
## - Fold08: alpha=0.10, lambda=0.01013 
## + Fold08: alpha=0.55, lambda=0.01013 
## - Fold08: alpha=0.55, lambda=0.01013 
## + Fold08: alpha=1.00, lambda=0.01013 
## - Fold08: alpha=1.00, lambda=0.01013 
## + Fold09: alpha=0.10, lambda=0.01013 
## - Fold09: alpha=0.10, lambda=0.01013 
## + Fold09: alpha=0.55, lambda=0.01013 
## - Fold09: alpha=0.55, lambda=0.01013 
## + Fold09: alpha=1.00, lambda=0.01013 
## - Fold09: alpha=1.00, lambda=0.01013 
## + Fold10: alpha=0.10, lambda=0.01013 
## - Fold10: alpha=0.10, lambda=0.01013 
## + Fold10: alpha=0.55, lambda=0.01013 
## - Fold10: alpha=0.55, lambda=0.01013 
## + Fold10: alpha=1.00, lambda=0.01013 
## - Fold10: alpha=1.00, lambda=0.01013 
## Aggregating results
## Selecting tuning parameters
## Fitting alpha = 0.1, lambda = 0.000101 on full training set

# Print model to console
model

## glmnet 
## 
## 250 samples
## 200 predictors
##   2 classes: 'class1', 'class2' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 225, 224, 226, 225, 225, 225, ... 
## Resampling results across tuning parameters:
## 
##   alpha  lambda        ROC        Sens  Spec     
##   0.10   0.0001012745  0.4321558  0     0.9612319
##   0.10   0.0010127448  0.4320652  0     0.9655797
##   0.10   0.0101274483  0.4251812  0     0.9740942
##   0.55   0.0001012745  0.3866848  0     0.9443841
##   0.55   0.0010127448  0.3712862  0     0.9442029
##   0.55   0.0101274483  0.4058877  0     0.9699275
##   1.00   0.0001012745  0.3579710  0     0.9141304
##   1.00   0.0010127448  0.3469203  0     0.9228261
##   1.00   0.0101274483  0.3952899  0     0.9784420
## 
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were alpha = 0.1 and lambda = 0.0001012745.

# Print maximum ROC statistic
max(model[["results"]][["ROC"]])

## [1] 0.4321558

glmnet with custom tuning grid

glmnet with custom trainControl and tuning

As you saw in the video, the glmnet model actually fits many models at once (one of the great things about the package). You can exploit this by passing a large number of lambda values, which control the amount of penalization in the model. train() is smart enough to only fit one model per alpha value and pass all of the lambda values at once for simultaneous fitting.

My favorite tuning grid for glmnet models is:

expand.grid( alpha = 0:1, lambda = seq(0.0001, 1, length = 100) ) This grid explores a large number of lambda values (100, in fact), from a very small one to a very large one. (You could increase the maximum lambda to 10, but in this exercise 1 is a good upper bound.)

If you want to explore fewer models, you can use a shorter lambda sequence. For example, lambda = seq(0.0001, 1, length = 10) would fit 10 models per value of alpha.

You also look at the two forms of penalized models with this tuneGrid: ridge regression and lasso regression. alpha = 0 is pure ridge regression, and alpha = 1 is pure lasso regression. You can fit a mixture of the two models (i.e. an elastic net) using an alpha between 0 and 1. For example, alpha = 0.05 would be 95% ridge regression and 5% lasso regression.

In this problem you’ll just explore the 2 extremes – pure ridge and pure lasso regression – for the purpose of illustrating their differences.

Train a glmnet model on the overfit data such that y is the response variable and all other variables are explanatory variables. Make sure to use your custom trainControl from the previous exercise (myControl). Also, use a custom tuneGrid to explore alpha = 0:1 and 20 values of lambda between 0.0001 and 1 per value of alpha. Print model to the console. Print the max() of the ROC statistic in model[[“results”]]. You can access it using model[[“results”]][[“ROC”]].

# Train glmnet with custom trainControl and tuning: model
model <- train(
  y ~ ., 
  overfit,
  tuneGrid = expand.grid(
    alpha = 0:1,
    lambda = seq(0.0001, 1, length = 20)
  ),
  method = "glmnet",
  trControl = myControl
)

## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
## in the result set. ROC will be used instead.

## + Fold01: alpha=0, lambda=1 
## - Fold01: alpha=0, lambda=1 
## + Fold01: alpha=1, lambda=1 
## - Fold01: alpha=1, lambda=1 
## + Fold02: alpha=0, lambda=1 
## - Fold02: alpha=0, lambda=1 
## + Fold02: alpha=1, lambda=1 
## - Fold02: alpha=1, lambda=1 
## + Fold03: alpha=0, lambda=1 
## - Fold03: alpha=0, lambda=1 
## + Fold03: alpha=1, lambda=1 
## - Fold03: alpha=1, lambda=1 
## + Fold04: alpha=0, lambda=1 
## - Fold04: alpha=0, lambda=1 
## + Fold04: alpha=1, lambda=1 
## - Fold04: alpha=1, lambda=1 
## + Fold05: alpha=0, lambda=1 
## - Fold05: alpha=0, lambda=1 
## + Fold05: alpha=1, lambda=1 
## - Fold05: alpha=1, lambda=1 
## + Fold06: alpha=0, lambda=1 
## - Fold06: alpha=0, lambda=1 
## + Fold06: alpha=1, lambda=1 
## - Fold06: alpha=1, lambda=1 
## + Fold07: alpha=0, lambda=1 
## - Fold07: alpha=0, lambda=1 
## + Fold07: alpha=1, lambda=1 
## - Fold07: alpha=1, lambda=1 
## + Fold08: alpha=0, lambda=1 
## - Fold08: alpha=0, lambda=1 
## + Fold08: alpha=1, lambda=1 
## - Fold08: alpha=1, lambda=1 
## + Fold09: alpha=0, lambda=1 
## - Fold09: alpha=0, lambda=1 
## + Fold09: alpha=1, lambda=1 
## - Fold09: alpha=1, lambda=1 
## + Fold10: alpha=0, lambda=1 
## - Fold10: alpha=0, lambda=1 
## + Fold10: alpha=1, lambda=1 
## - Fold10: alpha=1, lambda=1 
## Aggregating results
## Selecting tuning parameters
## Fitting alpha = 1, lambda = 0.0527 on full training set

# Print model to console
model

## glmnet 
## 
## 250 samples
## 200 predictors
##   2 classes: 'class1', 'class2' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 225, 226, 224, 224, 226, 224, ... 
## Resampling results across tuning parameters:
## 
##   alpha  lambda      ROC        Sens  Spec     
##   0      0.00010000  0.3722826  0.05  0.9871377
##   0      0.05272632  0.3791667  0.00  0.9956522
##   0      0.10535263  0.3770833  0.00  1.0000000
##   0      0.15797895  0.3708333  0.00  1.0000000
##   0      0.21060526  0.3794384  0.00  1.0000000
##   0      0.26323158  0.3900362  0.00  1.0000000
##   0      0.31585789  0.3921196  0.00  1.0000000
##   0      0.36848421  0.3963768  0.00  1.0000000
##   0      0.42111053  0.3963768  0.00  1.0000000
##   0      0.47373684  0.3984601  0.00  1.0000000
##   0      0.52636316  0.3984601  0.00  1.0000000
##   0      0.57898947  0.4026268  0.00  1.0000000
##   0      0.63161579  0.4026268  0.00  1.0000000
##   0      0.68424211  0.4047101  0.00  1.0000000
##   0      0.73686842  0.4047101  0.00  1.0000000
##   0      0.78949474  0.4047101  0.00  1.0000000
##   0      0.84212105  0.4047101  0.00  1.0000000
##   0      0.89474737  0.4047101  0.00  1.0000000
##   0      0.94737368  0.4090580  0.00  1.0000000
##   0      1.00000000  0.4090580  0.00  1.0000000
##   1      0.00010000  0.2921196  0.05  0.9532609
##   1      0.05272632  0.5108696  0.00  1.0000000
##   1      0.10535263  0.5000000  0.00  1.0000000
##   1      0.15797895  0.5000000  0.00  1.0000000
##   1      0.21060526  0.5000000  0.00  1.0000000
##   1      0.26323158  0.5000000  0.00  1.0000000
##   1      0.31585789  0.5000000  0.00  1.0000000
##   1      0.36848421  0.5000000  0.00  1.0000000
##   1      0.42111053  0.5000000  0.00  1.0000000
##   1      0.47373684  0.5000000  0.00  1.0000000
##   1      0.52636316  0.5000000  0.00  1.0000000
##   1      0.57898947  0.5000000  0.00  1.0000000
##   1      0.63161579  0.5000000  0.00  1.0000000
##   1      0.68424211  0.5000000  0.00  1.0000000
##   1      0.73686842  0.5000000  0.00  1.0000000
##   1      0.78949474  0.5000000  0.00  1.0000000
##   1      0.84212105  0.5000000  0.00  1.0000000
##   1      0.89474737  0.5000000  0.00  1.0000000
##   1      0.94737368  0.5000000  0.00  1.0000000
##   1      1.00000000  0.5000000  0.00  1.0000000
## 
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were alpha = 1 and lambda = 0.05272632.

# Print maximum ROC statistic
max(model[["results"]][["ROC"]])

## [1] 0.5108696

Preprocessing your data

Median imputation

Apply median imputation

In this chapter, you’ll be using a version of the Wisconsin Breast Cancer dataset. This dataset presents a classic binary classification problem: 50% of the samples are benign, 50% are malignant, and the challenge is to identify which are which.

This dataset is interesting because many of the predictors contain missing values and most rows of the dataset have at least one missing value. This presents a modeling challenge, because most machine learning algorithms cannot handle missing values out of the box. For example, your first instinct might be to fit a logistic regression model to this data, but prior to doing this you need a strategy for handling the NAs.

Fortunately, the train() function in caret contains an argument called preProcess, which allows you to specify that median imputation should be used to fill in the missing values. In previous chapters, you created models with the train() function using formulas such as y ~ .. An alternative way is to specify the x and y arguments to train(), where x is an object with samples in rows and features in columns and y is a numeric or factor vector containing the outcomes. Said differently, x is a matrix or data frame that contains the whole dataset you’d use for the data argument to the lm() call, for example, but excludes the response variable column; y is a vector that contains just the response variable column.

For this exercise, the argument x to train() is loaded in your workspace as breast_cancer_x and y as breast_cancer_y.

Use the train() function to fit a glm model called median_model to the breast cancer dataset. Use preProcess = “medianImpute” to handle the missing values. Print median_model to the console.

# breast_cancer_y <- read.delim("DATABASE/breast_cancer_y.txt", sep = ",")
# breast_cancer_x <- read.delim("DATABASE/breast_cancer_x.txt", sep = ",")
# # Apply median imputation: median_model
# median_model <- train(
#   x = breast_cancer_x, 
#   y = breast_cancer_y,
#   method = "glm",
#   trControl = myControl,
#   preProcess = "medianImpute"
# )
# 
# # Print median_model to console
# median_model

KNN imputation

Use KNN imputation

In the previous exercise, you used median imputation to fill in missing values in the breast cancer dataset, but that is not the only possible method for dealing with missing data.

An alternative to median imputation is k-nearest neighbors, or KNN, imputation. This is a more advanced form of imputation where missing values are replaced with values from other rows that are similar to the current row. While this is a lot more complicated to implement in practice than simple median imputation, it is very easy to explore in caret using the preProcess argument to train(). You can simply use preProcess = “knnImpute” to change the method of imputation used prior to model fitting.

breast_cancer_x and breast_cancer_y are loaded in your workspace.

Use the train() function to fit a glm model called knn_model to the breast cancer dataset. Use KNN imputation to handle missing values.

# Apply KNN imputation: knn_model
# knn_model <- train(
#   x = breast_cancer_x, 
#   y = breast_cancer_y,
#   method = "glm",
#   trControl = myControl,
#   preProcess = "knnImpute"
# )

# Print knn_model to console
# knn_model

Multiple preprocessing methods

Combining preprocessing methods

The preProcess argument to train() doesn’t just limit you to imputing missing values. It also includes a wide variety of other preProcess techniques to make your life as a data scientist much easier. You can read a full list of them by typing ?p reProcess and reading the help page for this function.

One set of preprocessing functions that is particularly useful for fitting regression models is standardization: centering and scaling. You first center by subtracting the mean of each column from each value in that column, then you scale by dividing by the standard deviation.

Standardization transforms your data such that for each column, the mean is 0 and the standard deviation is 1. This makes it easier for regression models to find a good solution.

breast_cancer_x and breast_cancer_y are loaded in your workspace. Fit a logistic regression model using median imputation called model to the breast cancer data, then print it to the console.

# # Fit glm with median imputation
# model <- train(
#   x = breast_cancer_x, 
#   y = breast_cancer_y,
#   method = "glm",
#   trControl = myControl,
#   preProcess = "medianImpute"
# )
# 
# # Print model
# model

Update the model to include two more pre-processing steps: centering and scaling.

# # Update model with standardization
# model <- train(
#   x = breast_cancer_x, 
#   y = breast_cancer_y,
#   method = "glm",
#   trControl = myControl,
#   preProcess = c("medianImpute", "center", "scale")
# )
# 
# # Print updated model
# model

Handling low-information predictors

Remove near zero variance predictors

As you saw in the video, for the next set of exercises, you’ll be using the blood-brain dataset. This is a biochemical dataset in which the task is to predict the following value for a set of biochemical compounds:

log((concentration of compound in brain) / (concentration of compound in blood)) This gives a quantitative metric of the compound’s ability to cross the blood-brain barrier, and is useful for understanding the biological properties of that barrier.

One interesting aspect of this dataset is that it contains many variables and many of these variables have extemely low variances. This means that there is very little information in these variables because they mostly consist of a single value (e.g. zero).

Fortunately, caret contains a utility function called nearZeroVar() for removing such variables to save time during modeling.

nearZeroVar() takes in data x, then looks at the ratio of the most common value to the second most common value, freqCut, and the percentage of distinct values out of the number of total samples, uniqueCut. By default, caret uses freqCut = 19 and uniqueCut = 10, which is fairly conservative. I like to be a little more aggressive and use freqCut = 2 and uniqueCut = 20 when calling nearZeroVar().

bloodbrain_x and bloodbrain_y are loaded in your workspace.

Identify the near zero variance predictors by running nearZeroVar() on the blood-brain dataset. Store the result as an object called remove_cols. Use freqCut = 2 and uniqueCut = 20 in the call to nearZeroVar(). Use names() to create a vector containing all column names of bloodbrain_x. Call this all_cols. Make a new data frame called bloodbrain_x_small with the near-zero variance variables removed. Use setdiff() to isolate the column names that you wish to keep (i.e. that you don’t want to remove.)

bloodbrain_x <- read.delim("DATABASE/bloodbrain_x.txt", sep = ",")
head(bloodbrain_x)

##   X  tpsa nbasic   vsa_hyd a_aro  weight peoe_vsa.0 peoe_vsa.1 peoe_vsa.2
## 1 1 12.03      1 167.06700     0 156.293   76.94749   43.44619    0.00000
## 2 2 49.33      0  92.64243     6 151.165   38.24339   25.52006    0.00000
## 3 3 50.53      1 295.16700    15 366.485   58.05473  124.74020   21.65084
## 4 4 37.39      0 319.11220    15 382.552   62.23933  124.74020   13.19232
## 5 5 37.39      1 299.65800    12 326.464   74.80064  118.04060   33.00190
## 6 6 37.39      1 289.77770    11 332.492   74.80064  109.50990   13.19232
##   peoe_vsa.3 peoe_vsa.4 peoe_vsa.5 peoe_vsa.6 peoe_vsa.0.1 peoe_vsa.1.1
## 1   0.000000    0.00000          0  17.238030     18.74768     43.50657
## 2   8.619013   23.27370          0   0.000000     49.01962      0.00000
## 3   8.619013   17.44054          0   8.619013     83.82487     49.01962
## 4  21.785640    0.00000          0   8.619013     83.82487     68.78024
## 5   0.000000    0.00000          0   8.619013     83.82487     36.76471
## 6  21.785640    0.00000          0   8.619013     73.54603     44.27042
##   peoe_vsa.2.1 peoe_vsa.3.1 peoe_vsa.4.1 peoe_vsa.5.1 peoe_vsa.6.1 a_acc a_acid
## 1            0            0     0.000000     0.000000     0.000000     0      0
## 2            0            0     0.000000    13.566920     7.904431     2      0
## 3            0            0     5.682576     2.503756     2.640647     2      0
## 4            0            0     5.682576     0.000000     2.640647     2      0
## 5            0            0     5.682576     0.136891     2.503756     2      0
## 6            0            0     5.682576     0.000000     2.640647     2      0
##   a_base   vsa_acc vsa_acid vsa_base  vsa_don vsa_other  vsa_pol slogp_vsa0
## 1      1  0.000000        0 5.682576 5.682576   0.00000  0.00000   18.01075
## 2      0 13.566920        0 0.000000 5.682576  28.10760 13.56692   25.38523
## 3      1  8.186332        0 0.000000 5.682576  43.56089  0.00000   14.12420
## 4      1  8.186332        0 0.000000 5.682576  28.32470  0.00000   14.12420
## 5      1  8.186332        0 0.000000 5.682576  19.64908  0.00000   14.12420
## 6      1  8.186332        0 0.000000 5.682576  21.62514  0.00000   14.12420
##   slogp_vsa1 slogp_vsa2 slogp_vsa3 slogp_vsa4 slogp_vsa5 slogp_vsa6 slogp_vsa7
## 1    0.00000   3.981969      0.000   4.410796  32.897190          0    0.00000
## 2   23.26954  23.862220      0.000   0.000000   0.000000          0   70.57274
## 3   34.79628   0.000000     76.245   3.185575   9.507346          0  148.12580
## 4   34.79628   0.000000     76.245   3.185575   0.000000          0  144.03540
## 5   34.79628   0.000000     76.245   3.185575   0.000000          0  140.71660
## 6   34.79628   0.000000     76.245   3.185575   0.000000          0  103.60310
##   slogp_vsa8 slogp_vsa9  smr_vsa0 smr_vsa1 smr_vsa2 smr_vsa3 smr_vsa4  smr_vsa5
## 1  113.21040   33.32602  0.000000 18.01075 4.410796 3.981969  0.00000 113.21040
## 2    0.00000   41.32619 23.862220 25.38523 0.000000 5.243428 20.76750  70.57274
## 3   75.47363   28.27417 12.631660 27.78542 0.000000 8.429003 29.58226 235.05870
## 4   75.47363   55.46144  3.124314 27.78542 0.000000 8.429003 21.40142 235.05870
## 5   75.47363   26.01093  3.124314 27.78542 0.000000 8.429003 20.33867 234.62990
## 6   75.47363   55.46144  3.124314 27.78542 0.000000 8.429003 18.51150 197.51630
##    smr_vsa6 smr_vsa7 tpsa.1 logp.o.w. frac.anion7. frac.cation7. andrewbind
## 1  0.000000 66.22321  16.61     2.948        0.000         0.999        3.4
## 2  5.258784 33.32602  49.33     0.889        0.001         0.000       -3.3
## 3 76.245000  0.00000  51.73     4.439        0.000         0.986       12.8
## 4 76.245000 31.27769  38.59     5.254        0.000         0.986       12.8
## 5 76.245000  0.00000  38.59     3.800        0.000         0.986       10.3
## 6 76.245000 31.27769  38.59     3.608        0.000         0.986       10.0
##   rotatablebonds   mlogp    clogp       mw nocount hbdnr rule.of.5violations
## 1              3 2.50245 2.970000 155.2856       1     1                   0
## 2              2 1.05973 0.494000 151.1664       3     2                   0
## 3              8 4.66091 5.136999 365.4794       5     1                   1
## 4              8 3.82458 5.877599 381.5440       4     1                   1
## 5              8 3.27214 4.367000 325.4577       4     1                   0
## 6              8 2.89481 4.283600 331.4835       4     1                   0
##   prx  ub pol inthb    adistm  adistd polar_area nonpolar_area psa_npsa   tcsa
## 1   0 0.0   0     0    0.0000  0.0000    21.1242      379.0702   0.0557 0.0097
## 2   1 3.0   2     0  395.3757 10.8921   117.4081      247.5371   0.4743 0.0134
## 3   6 5.3   3     0 1364.5514 25.6784    82.0943      637.7242   0.1287 0.0111
## 4   2 5.3   3     0  702.6387 10.0232    65.0890      667.9713   0.0974 0.0108
## 5   2 4.2   2     0  745.5096 10.5753    66.1754      601.7463   0.1100 0.0118
## 6   2 3.6   2     0  779.2914 10.7712    69.0895      588.6569   0.1174 0.0111
##     tcpa   tcnp ovality surface_area    volume most_negative_charge
## 1 0.1842 0.0103  1.0960     400.1944  656.0650              -0.6174
## 2 0.0417 0.0198  1.1173     364.9453  555.0969              -0.8397
## 3 0.0972 0.0125  1.3005     719.8185 1224.4553              -0.8012
## 4 0.1218 0.0119  1.3013     733.0603 1257.2002              -0.7608
## 5 0.1186 0.0130  1.2711     667.9218 1132.6826              -0.8567
## 6 0.1061 0.0125  1.2642     657.7465 1115.8672              -0.7672
##   most_positive_charge sum_absolute_charge dipole_moment    homo    lumo
## 1               0.3068              3.8918        1.1898 -9.6672  3.4038
## 2               0.4967              4.8925        4.2109 -8.9618  0.1942
## 3               0.5414              7.9796        3.5234 -8.6271  0.0589
## 4               0.4800              7.9308        3.1463 -8.5592 -0.2651
## 5               0.4547              7.8516        3.2676 -8.6732  0.3149
## 6               0.4349              7.3305        3.2845 -8.6843 -0.0310
##   hardness    ppsa1     ppsa2   ppsa3    pnsa1     pnsa2    pnsa3  fpsa1  fpsa2
## 1   6.5355 349.1390  679.3832 30.9705  51.0554  -99.3477 -10.4876 0.8724 1.6976
## 2   4.5780 223.1310  545.8328 42.3030 141.8143 -346.9123 -44.0368 0.6114 1.4957
## 3   4.3430 517.8218 2066.0186 63.9503 201.9967 -805.9311 -43.7587 0.7194 2.8702
## 4   4.1471 507.6144 2012.9060 61.6890 225.4459 -893.9880 -42.0328 0.6925 2.7459
## 5   4.4940 509.1635 1998.8743 61.5645 158.7582 -623.2529 -39.8413 0.7623 2.9927
## 6   4.3266 473.5681 1735.7426 58.4993 184.1784 -675.0588 -41.2100 0.7200 2.6389
##    fpsa3  fnsa1   fnsa2   fnsa3    wpsa1     wpsa2   wpsa3    wnsa1     wnsa2
## 1 0.0774 0.1276 -0.2482 -0.0262 139.7235  271.8854 12.3942  20.4321  -39.7584
## 2 0.1159 0.3886 -0.9506 -0.1207  81.4306  199.1991 15.4383  51.7544 -126.6040
## 3 0.0888 0.2806 -1.1196 -0.0608 372.7377 1487.1583 46.0326 145.4010 -580.1241
## 4 0.0842 0.3075 -1.2195 -0.0573 372.1120 1475.5815 45.2218 165.2654 -655.3471
## 5 0.0922 0.2377 -0.9331 -0.0596 340.0814 1335.0917 41.1203 106.0381 -416.2842
## 6 0.0889 0.2800 -1.0263 -0.0627 311.4878 1141.6785 38.4777 121.1427 -444.0175
##      wnsa3    dpsa1     dpsa2    dpsa3   rpcg   rncg   wpcs   wncs   sadh1
## 1  -4.1971 298.0836  778.7310  41.4580 0.1577 0.3173 2.3805 1.9117 15.0988
## 2 -16.0710  81.3167  892.7451  86.3398 0.2030 0.3433 1.3116 2.2546 45.2163
## 3 -31.4983 315.8251 2871.9497 107.7089 0.1357 0.2008 1.1351 1.5725 16.7192
## 4 -30.8126 282.1685 2906.8940 103.7218 0.1210 0.1919 0.7623 1.5302 17.2491
## 5 -26.6109 350.4053 2622.1272 101.4058 0.1158 0.2182 0.7884 1.6795 16.0252
## 6 -27.1057 289.3897 2410.8013  99.7093 0.1187 0.2093 2.0505 1.6760 17.2815
##     sadh2  sadh3  chdh1  chdh2  chdh3   scdh1  scdh2  scdh3   saaa1   saaa2
## 1 15.0988 0.0377 0.3068 0.3068 0.0008  4.6321 4.6321 0.0116  6.0255  6.0255
## 2 22.6082 0.1239 0.7960 0.3980 0.0022 17.6195 8.8098 0.0483 65.6236 32.8118
## 3 16.7192 0.0232 0.4550 0.4550 0.0006  7.6077 7.6077 0.0106 57.5440 14.3860
## 4 17.2491 0.0235 0.4354 0.4354 0.0006  7.5102 7.5102 0.0102 39.8638 13.2879
## 5 16.0252 0.0240 0.4366 0.4366 0.0007  6.9970 6.9970 0.0105 42.4544 14.1515
## 6 17.2815 0.0263 0.4349 0.4349 0.0007  7.5157 7.5157 0.0114 43.8012 14.6004
##    saaa3   chaa1   chaa2   chaa3    scaa1    scaa2   scaa3 ctdh ctaa   mchg
## 1 0.0151 -0.6174 -0.6174 -0.0015  -3.7199  -3.7199 -0.0093    1    1 0.9241
## 2 0.1798 -0.8371 -0.4185 -0.0023 -27.5143 -13.7571 -0.0754    2    2 1.2685
## 3 0.0799 -1.3671 -0.3418 -0.0019 -21.7898  -5.4475 -0.0303    1    4 1.2562
## 4 0.0544 -1.2332 -0.4111 -0.0017 -17.5957  -5.8652 -0.0240    1    3 1.1962
## 5 0.0636 -1.1480 -0.3827 -0.0017 -17.0447  -5.6816 -0.0255    1    3 1.2934
## 6 0.0666 -1.2317 -0.4106 -0.0019 -19.8513  -6.6171 -0.0302    1    3 1.2021
##     achg   rdta   n_sp2   n_sp3   o_sp2   o_sp3
## 1 0.9241 1.0000  0.0000  6.0255  0.0000  0.0000
## 2 1.0420 1.0000  0.0000  6.5681 32.0102 33.6135
## 3 1.2562 0.2500 26.9733 10.8567  0.0000 27.5451
## 4 1.1962 0.3333 21.7065 11.0017  0.0000 15.1316
## 5 1.2934 0.3333 24.2061 10.8109  0.0000 15.1333
## 6 1.2021 0.3333 25.5529 11.1218  0.0000 15.1333

# Identify near zero variance predictors: remove_cols
remove_cols <- nearZeroVar(bloodbrain_x, names = TRUE, 
                           freqCut = 2, uniqueCut = 20)

# Get all column names from bloodbrain_x: all_cols
all_cols <- names(bloodbrain_x)

# Remove from data: bloodbrain_x_small
bloodbrain_x_small <- bloodbrain_x[ , setdiff(all_cols, remove_cols)]

Fit model on reduced blood-brain data

Now that you’ve reduced your dataset, you can fit a glm model to it using the train() function. This model will run faster than using the full dataset and will yield very similar predictive accuracy.

Furthermore, zero variance variables can cause problems with cross-validation (e.g. if one fold ends up with only a single unique value for that variable), so removing them prior to modeling means you are less likely to get errors during the fitting process.

bloodbrain_x, bloodbrain_y, remove, and bloodbrain_x_small are loaded in your workspace.

Fit a glm model using the train() function and the reduced blood-brain dataset you created in the previous exercise. Print the result to the console.

bloodbrain_x_small <- read.delim("DATABASE/bloodbrain_x_small.txt", sep = ",")
head(bloodbrain_x_small)

##   X  tpsa nbasic   vsa_hyd a_aro  weight peoe_vsa.0 peoe_vsa.1 peoe_vsa.2
## 1 1 12.03      1 167.06700     0 156.293   76.94749   43.44619    0.00000
## 2 2 49.33      0  92.64243     6 151.165   38.24339   25.52006    0.00000
## 3 3 50.53      1 295.16700    15 366.485   58.05473  124.74020   21.65084
## 4 4 37.39      0 319.11220    15 382.552   62.23933  124.74020   13.19232
## 5 5 37.39      1 299.65800    12 326.464   74.80064  118.04060   33.00190
## 6 6 37.39      1 289.77770    11 332.492   74.80064  109.50990   13.19232
##   peoe_vsa.6 peoe_vsa.0.1 peoe_vsa.1.1 a_acc a_base   vsa_acc  vsa_don
## 1  17.238030     18.74768     43.50657     0      1  0.000000 5.682576
## 2   0.000000     49.01962      0.00000     2      0 13.566920 5.682576
## 3   8.619013     83.82487     49.01962     2      1  8.186332 5.682576
## 4   8.619013     83.82487     68.78024     2      1  8.186332 5.682576
## 5   8.619013     83.82487     36.76471     2      1  8.186332 5.682576
## 6   8.619013     73.54603     44.27042     2      1  8.186332 5.682576
##   vsa_other slogp_vsa0 slogp_vsa1 slogp_vsa3 slogp_vsa4 slogp_vsa5 slogp_vsa7
## 1   0.00000   18.01075    0.00000      0.000   4.410796  32.897190    0.00000
## 2  28.10760   25.38523   23.26954      0.000   0.000000   0.000000   70.57274
## 3  43.56089   14.12420   34.79628     76.245   3.185575   9.507346  148.12580
## 4  28.32470   14.12420   34.79628     76.245   3.185575   0.000000  144.03540
## 5  19.64908   14.12420   34.79628     76.245   3.185575   0.000000  140.71660
## 6  21.62514   14.12420   34.79628     76.245   3.185575   0.000000  103.60310
##   slogp_vsa9  smr_vsa0 smr_vsa1 smr_vsa2 smr_vsa3  smr_vsa5  smr_vsa6 smr_vsa7
## 1   33.32602  0.000000 18.01075 4.410796 3.981969 113.21040  0.000000 66.22321
## 2   41.32619 23.862220 25.38523 0.000000 5.243428  70.57274  5.258784 33.32602
## 3   28.27417 12.631660 27.78542 0.000000 8.429003 235.05870 76.245000  0.00000
## 4   55.46144  3.124314 27.78542 0.000000 8.429003 235.05870 76.245000 31.27769
## 5   26.01093  3.124314 27.78542 0.000000 8.429003 234.62990 76.245000  0.00000
## 6   55.46144  3.124314 27.78542 0.000000 8.429003 197.51630 76.245000 31.27769
##   tpsa.1 logp.o.w. frac.cation7. andrewbind rotatablebonds   mlogp    clogp
## 1  16.61     2.948         0.999        3.4              3 2.50245 2.970000
## 2  49.33     0.889         0.000       -3.3              2 1.05973 0.494000
## 3  51.73     4.439         0.986       12.8              8 4.66091 5.136999
## 4  38.59     5.254         0.986       12.8              8 3.82458 5.877599
## 5  38.59     3.800         0.986       10.3              8 3.27214 4.367000
## 6  38.59     3.608         0.986       10.0              8 2.89481 4.283600
##         mw nocount hbdnr prx  ub pol    adistm  adistd polar_area nonpolar_area
## 1 155.2856       1     1   0 0.0   0    0.0000  0.0000    21.1242      379.0702
## 2 151.1664       3     2   1 3.0   2  395.3757 10.8921   117.4081      247.5371
## 3 365.4794       5     1   6 5.3   3 1364.5514 25.6784    82.0943      637.7242
## 4 381.5440       4     1   2 5.3   3  702.6387 10.0232    65.0890      667.9713
## 5 325.4577       4     1   2 4.2   2  745.5096 10.5753    66.1754      601.7463
## 6 331.4835       4     1   2 3.6   2  779.2914 10.7712    69.0895      588.6569
##   psa_npsa   tcsa   tcpa   tcnp ovality surface_area    volume
## 1   0.0557 0.0097 0.1842 0.0103  1.0960     400.1944  656.0650
## 2   0.4743 0.0134 0.0417 0.0198  1.1173     364.9453  555.0969
## 3   0.1287 0.0111 0.0972 0.0125  1.3005     719.8185 1224.4553
## 4   0.0974 0.0108 0.1218 0.0119  1.3013     733.0603 1257.2002
## 5   0.1100 0.0118 0.1186 0.0130  1.2711     667.9218 1132.6826
## 6   0.1174 0.0111 0.1061 0.0125  1.2642     657.7465 1115.8672
##   most_negative_charge most_positive_charge sum_absolute_charge dipole_moment
## 1              -0.6174               0.3068              3.8918        1.1898
## 2              -0.8397               0.4967              4.8925        4.2109
## 3              -0.8012               0.5414              7.9796        3.5234
## 4              -0.7608               0.4800              7.9308        3.1463
## 5              -0.8567               0.4547              7.8516        3.2676
## 6              -0.7672               0.4349              7.3305        3.2845
##      homo    lumo hardness    ppsa1     ppsa2   ppsa3    pnsa1     pnsa2
## 1 -9.6672  3.4038   6.5355 349.1390  679.3832 30.9705  51.0554  -99.3477
## 2 -8.9618  0.1942   4.5780 223.1310  545.8328 42.3030 141.8143 -346.9123
## 3 -8.6271  0.0589   4.3430 517.8218 2066.0186 63.9503 201.9967 -805.9311
## 4 -8.5592 -0.2651   4.1471 507.6144 2012.9060 61.6890 225.4459 -893.9880
## 5 -8.6732  0.3149   4.4940 509.1635 1998.8743 61.5645 158.7582 -623.2529
## 6 -8.6843 -0.0310   4.3266 473.5681 1735.7426 58.4993 184.1784 -675.0588
##      pnsa3  fpsa1  fpsa2  fpsa3  fnsa1   fnsa2   fnsa3    wpsa1     wpsa2
## 1 -10.4876 0.8724 1.6976 0.0774 0.1276 -0.2482 -0.0262 139.7235  271.8854
## 2 -44.0368 0.6114 1.4957 0.1159 0.3886 -0.9506 -0.1207  81.4306  199.1991
## 3 -43.7587 0.7194 2.8702 0.0888 0.2806 -1.1196 -0.0608 372.7377 1487.1583
## 4 -42.0328 0.6925 2.7459 0.0842 0.3075 -1.2195 -0.0573 372.1120 1475.5815
## 5 -39.8413 0.7623 2.9927 0.0922 0.2377 -0.9331 -0.0596 340.0814 1335.0917
## 6 -41.2100 0.7200 2.6389 0.0889 0.2800 -1.0263 -0.0627 311.4878 1141.6785
##     wpsa3    wnsa1     wnsa2    wnsa3    dpsa1     dpsa2    dpsa3   rpcg   rncg
## 1 12.3942  20.4321  -39.7584  -4.1971 298.0836  778.7310  41.4580 0.1577 0.3173
## 2 15.4383  51.7544 -126.6040 -16.0710  81.3167  892.7451  86.3398 0.2030 0.3433
## 3 46.0326 145.4010 -580.1241 -31.4983 315.8251 2871.9497 107.7089 0.1357 0.2008
## 4 45.2218 165.2654 -655.3471 -30.8126 282.1685 2906.8940 103.7218 0.1210 0.1919
## 5 41.1203 106.0381 -416.2842 -26.6109 350.4053 2622.1272 101.4058 0.1158 0.2182
## 6 38.4777 121.1427 -444.0175 -27.1057 289.3897 2410.8013  99.7093 0.1187 0.2093
##     wpcs   wncs   sadh1   sadh2  sadh3  chdh1  chdh2   scdh1  scdh2  scdh3
## 1 2.3805 1.9117 15.0988 15.0988 0.0377 0.3068 0.3068  4.6321 4.6321 0.0116
## 2 1.3116 2.2546 45.2163 22.6082 0.1239 0.7960 0.3980 17.6195 8.8098 0.0483
## 3 1.1351 1.5725 16.7192 16.7192 0.0232 0.4550 0.4550  7.6077 7.6077 0.0106
## 4 0.7623 1.5302 17.2491 17.2491 0.0235 0.4354 0.4354  7.5102 7.5102 0.0102
## 5 0.7884 1.6795 16.0252 16.0252 0.0240 0.4366 0.4366  6.9970 6.9970 0.0105
## 6 2.0505 1.6760 17.2815 17.2815 0.0263 0.4349 0.4349  7.5157 7.5157 0.0114
##     saaa1   saaa2  saaa3   chaa1   chaa2   chaa3    scaa1    scaa2   scaa3 ctdh
## 1  6.0255  6.0255 0.0151 -0.6174 -0.6174 -0.0015  -3.7199  -3.7199 -0.0093    1
## 2 65.6236 32.8118 0.1798 -0.8371 -0.4185 -0.0023 -27.5143 -13.7571 -0.0754    2
## 3 57.5440 14.3860 0.0799 -1.3671 -0.3418 -0.0019 -21.7898  -5.4475 -0.0303    1
## 4 39.8638 13.2879 0.0544 -1.2332 -0.4111 -0.0017 -17.5957  -5.8652 -0.0240    1
## 5 42.4544 14.1515 0.0636 -1.1480 -0.3827 -0.0017 -17.0447  -5.6816 -0.0255    1
## 6 43.8012 14.6004 0.0666 -1.2317 -0.4106 -0.0019 -19.8513  -6.6171 -0.0302    1
##   ctaa   mchg   achg   rdta   n_sp2   n_sp3   o_sp2   o_sp3
## 1    1 0.9241 0.9241 1.0000  0.0000  6.0255  0.0000  0.0000
## 2    2 1.2685 1.0420 1.0000  0.0000  6.5681 32.0102 33.6135
## 3    4 1.2562 1.2562 0.2500 26.9733 10.8567  0.0000 27.5451
## 4    3 1.1962 1.1962 0.3333 21.7065 11.0017  0.0000 15.1316
## 5    3 1.2934 1.2934 0.3333 24.2061 10.8109  0.0000 15.1333
## 6    3 1.2021 1.2021 0.3333 25.5529 11.1218  0.0000 15.1333

# bloodbrain_y <- read.delim("DATABASE/bloodbrain_y.txt", sep = ",")
# # Fit model on reduced data: model
# model <- train(
#   x = bloodbrain_x_small, 
#   y = bloodbrain_y, 
#   method = "glm"
# )

# Print model to console
model

## glmnet 
## 
## 250 samples
## 200 predictors
##   2 classes: 'class1', 'class2' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 225, 226, 224, 224, 226, 224, ... 
## Resampling results across tuning parameters:
## 
##   alpha  lambda      ROC        Sens  Spec     
##   0      0.00010000  0.3722826  0.05  0.9871377
##   0      0.05272632  0.3791667  0.00  0.9956522
##   0      0.10535263  0.3770833  0.00  1.0000000
##   0      0.15797895  0.3708333  0.00  1.0000000
##   0      0.21060526  0.3794384  0.00  1.0000000
##   0      0.26323158  0.3900362  0.00  1.0000000
##   0      0.31585789  0.3921196  0.00  1.0000000
##   0      0.36848421  0.3963768  0.00  1.0000000
##   0      0.42111053  0.3963768  0.00  1.0000000
##   0      0.47373684  0.3984601  0.00  1.0000000
##   0      0.52636316  0.3984601  0.00  1.0000000
##   0      0.57898947  0.4026268  0.00  1.0000000
##   0      0.63161579  0.4026268  0.00  1.0000000
##   0      0.68424211  0.4047101  0.00  1.0000000
##   0      0.73686842  0.4047101  0.00  1.0000000
##   0      0.78949474  0.4047101  0.00  1.0000000
##   0      0.84212105  0.4047101  0.00  1.0000000
##   0      0.89474737  0.4047101  0.00  1.0000000
##   0      0.94737368  0.4090580  0.00  1.0000000
##   0      1.00000000  0.4090580  0.00  1.0000000
##   1      0.00010000  0.2921196  0.05  0.9532609
##   1      0.05272632  0.5108696  0.00  1.0000000
##   1      0.10535263  0.5000000  0.00  1.0000000
##   1      0.15797895  0.5000000  0.00  1.0000000
##   1      0.21060526  0.5000000  0.00  1.0000000
##   1      0.26323158  0.5000000  0.00  1.0000000
##   1      0.31585789  0.5000000  0.00  1.0000000
##   1      0.36848421  0.5000000  0.00  1.0000000
##   1      0.42111053  0.5000000  0.00  1.0000000
##   1      0.47373684  0.5000000  0.00  1.0000000
##   1      0.52636316  0.5000000  0.00  1.0000000
##   1      0.57898947  0.5000000  0.00  1.0000000
##   1      0.63161579  0.5000000  0.00  1.0000000
##   1      0.68424211  0.5000000  0.00  1.0000000
##   1      0.73686842  0.5000000  0.00  1.0000000
##   1      0.78949474  0.5000000  0.00  1.0000000
##   1      0.84212105  0.5000000  0.00  1.0000000
##   1      0.89474737  0.5000000  0.00  1.0000000
##   1      0.94737368  0.5000000  0.00  1.0000000
##   1      1.00000000  0.5000000  0.00  1.0000000
## 
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were alpha = 1 and lambda = 0.05272632.

Principle components analysis (PCA)

Using PCA as an alternative to nearZeroVar()

An alternative to removing low-variance predictors is to run PCA on your dataset. This is sometimes preferable because it does not throw out all of your data: many different low variance predictors may end up combined into one high variance PCA variable, which might have a positive impact on your model’s accuracy.

This is an especially good trick for linear models: the pca option in the preProcess argument will center and scale your data, combine low variance variables, and ensure that all of your predictors are orthogonal. This creates an ideal dataset for linear regression modeling, and can often improve the accuracy of your models.

bloodbrain_x and bloodbrain_y are loaded in your workspace.

# # Fit glm model using PCA: model
# model <- train(
#   x = bloodbrain_x, 
#   y = bloodbrain_y,
#   method = "glm", 
#   preProcess = "pca"
# )
# 
# # Print model to console
# model

Fit a glm model to the full blood-brain dataset using the “pca” option to preProcess. Print the model to the console and inspect the result.

Selecting models: a case study in churn prediction

Reusing a trainControl

Make custom train/test indices

As you saw in the video, for this chapter you will focus on a real-world dataset that brings together all of the concepts discussed in the previous chapters.

The churn dataset contains data on a variety of telecom customers and the modeling challenge is to predict which customers will cancel their service (or churn).

In this chapter, you will be exploring two different types of predictive models: glmnet and rf, so the first order of business is to create a reusable trainControl object you can use to reliably compare them.

churn_x and churn_y are loaded in your workspace.

Use createFolds() to create 5 CV folds on churn_y, your target variable for this exercise. Pass them to trainControl() to create a reusable trainControl for comparing models.

churn_y <- read.delim("DATABASE/churn_y.txt", sep = ",")
head(churn_y)

##   X  x
## 1 1 no
## 2 2 no
## 3 3 no
## 4 4 no
## 5 5 no
## 6 6 no

churn_x <- read.delim("DATABASE/churn_x.txt", sep = ",")
head(churn_x)

##      X stateAK stateAL stateAR stateAZ stateCA stateCO stateCT stateDC stateDE
## 1 4575       0       0       0       0       0       0       0       0       0
## 2 4685       0       0       0       0       0       0       0       0       0
## 3 1431       0       0       0       0       0       0       0       0       0
## 4 4150       0       0       0       0       0       0       0       0       1
## 5 3207       0       0       0       0       0       0       0       0       0
## 6 2593       0       0       0       0       0       0       0       0       0
##   stateFL stateGA stateHI stateIA stateID stateIL stateIN stateKS stateKY
## 1       0       0       0       0       0       0       0       0       0
## 2       0       0       0       0       0       0       0       0       0
## 3       0       0       0       0       0       0       0       0       0
## 4       0       0       0       0       0       0       0       0       0
## 5       0       0       0       0       0       0       0       0       0
## 6       0       0       0       0       0       0       0       0       0
##   stateLA stateMA stateMD stateME stateMI stateMN stateMO stateMS stateMT
## 1       0       0       0       0       0       0       0       0       0
## 2       0       0       0       0       0       0       1       0       0
## 3       0       0       0       0       0       0       0       0       0
## 4       0       0       0       0       0       0       0       0       0
## 5       0       0       0       0       0       0       0       0       0
## 6       0       0       0       0       0       0       0       0       0
##   stateNC stateND stateNE stateNH stateNJ stateNM stateNV stateNY stateOH
## 1       0       0       0       0       0       0       0       0       0
## 2       0       0       0       0       0       0       0       0       0
## 3       0       0       0       0       0       0       0       0       0
## 4       0       0       0       0       0       0       0       0       0
## 5       0       0       0       0       0       0       0       0       0
## 6       0       0       0       0       0       0       0       0       0
##   stateOK stateOR statePA stateRI stateSC stateSD stateTN stateTX stateUT
## 1       0       0       0       0       1       0       0       0       0
## 2       0       0       0       0       0       0       0       0       0
## 3       0       0       0       0       0       0       0       0       0
## 4       0       0       0       0       0       0       0       0       0
## 5       0       0       0       0       0       0       0       0       0
## 6       0       0       0       0       0       0       0       0       0
##   stateVA stateVT stateWA stateWI stateWV stateWY account_length
## 1       0       0       0       0       0       0            137
## 2       0       0       0       0       0       0             83
## 3       0       0       0       0       1       0             48
## 4       0       0       0       0       0       0             67
## 5       0       0       1       0       0       0            143
## 6       1       0       0       0       0       0            163
##   area_codearea_code_415 area_codearea_code_510 international_planyes
## 1                      0                      1                     0
## 2                      1                      0                     0
## 3                      1                      0                     0
## 4                      1                      0                     0
## 5                      0                      1                     0
## 6                      1                      0                     0
##   voice_mail_planyes number_vmail_messages total_day_minutes total_day_calls
## 1                  0                     0             109.8             112
## 2                  0                     0             196.7             117
## 3                  1                    34             198.0              70
## 4                  0                     0             164.5              79
## 5                  0                     0             133.4             107
## 6                  0                     0             202.9             100
##   total_day_charge total_eve_minutes total_eve_calls total_eve_charge
## 1            18.67             223.5              88            19.00
## 2            33.44             272.0              89            23.12
## 3            33.66             273.7             121            23.26
## 4            27.97             110.3             108             9.38
## 5            22.68             223.9             117            19.03
## 6            34.49             178.6              46            15.18
##   total_night_minutes total_night_calls total_night_charge total_intl_minutes
## 1               247.5                96              11.14               17.8
## 2               199.9                62               9.00               10.1
## 3               217.9                71               9.81                7.6
## 4               203.9               102               9.18                9.8
## 5               180.4                85               8.12               10.2
## 6               203.8               116               9.17               12.8
##   total_intl_calls total_intl_charge number_customer_service_calls
## 1                2              4.81                             1
## 2               11              2.73                             3
## 3                4              2.05                             1
## 4                2              2.65                             1
## 5               13              2.75                             1
## 6                3              3.46                             5

# Create custom indices: myFolds
myFolds <- createFolds(churn_y, k = 5)

# Create reusable trainControl object: myControl
myControl <- trainControl(
  summaryFunction = twoClassSummary,
  classProbs = TRUE, # IMPORTANT!
  verboseIter = TRUE,
  savePredictions = TRUE,
  index = myFolds
)

Reintroducing glmnet

Fit the baseline model

Now that you have a reusable trainControl object called myControl, you can start fitting different predictive models to your churn dataset and evaluate their predictive accuracy.

You’ll start with one of my favorite models, glmnet, which penalizes linear and logistic regression models on the size and number of coefficients to help prevent overfitting.

Fit a glmnet model to the churn dataset called model_glmnet. Make sure to use myControl, which you created in the first exercise and is available in your workspace, as the trainControl object.

# # Fit glmnet model: model_glmnet
# model_glmnet <- train(
#   x = churn_x, 
#   y = churn_y,
#   metric = "ROC",
#   method = "glmnet",
#   trControl = myControl
# )

Reintroducing random forest

Random forest with custom trainControl

Another one of my favorite models is the random forest, which combines an ensemble of non-linear decision trees into a highly flexible (and usually quite accurate) model.

Rather than using the classic randomForest package, you’ll be using the ranger package, which is a re-implementation of randomForest that produces almost the exact same results, but is faster, more stable, and uses less memory. I highly recommend it as a starting point for random forest modeling in R.

churn_x and churn_y are loaded in your workspace.

Fit a random forest model to the churn dataset. Be sure to use myControl as the trainControl like you’ve done before and implement the “ranger” method.

# Fit random forest: model_rf
# model_rf <- train(
#   x = churn_x, 
#   y = churn_y,
#   metric = "ROC",
#   method = "ranger",
#   trControl = myControl
# )

Comparing models

Create a resamples object

Now that you have fit two models to the churn dataset, it’s time to compare their out-of-sample predictions and choose which one is the best model for your dataset.

You can compare models in caret using the resamples() function, provided they have the same training data and use the same trainControl object with preset cross-validation folds. resamples() takes as input a list of models and can be used to compare dozens of models at once (though in this case you are only comparing two models).

model_glmnet and model_rf are loaded in your workspace.

Create a list() containing the glmnet model as item1 and the ranger model as item2. Pass this list to the resamples() function and save the resulting object as resamples. Summarize the results by calling summary() on resamples.

# # Create model_list
# model_list <- list(item1 = model_glmnet, item2 = model_rf)
# 
# # Pass model_list to resamples(): resamples
# resamples <- resamples(model_list)
# 
# # Summarize the results
# summary(resamples)

Machine Learning Fundamentals in R

Supervised Learning in R: Classification

Chapter 1: k-Nearest Neighbors (kNN)

Classification with Nearest-neighbours

Recognizing a road sign with kNN

Exploring the traffic sign dataset

Classifying a collection of road signs

Testing other ‘k’ values

Seeing how the neighbors voted

Data preparation for kNN

Chapter 2: Naive Bayes

Understanding Bayesian methods

Computing probabilities

A simple Naive Bayes location model

Examining “raw” probabilities

Understanding NB’s “naivety”

A more sophisticated location model

Preparing for unforeseen circumstances

Chapter 3: Logistic Regression

Making binary predictions with regression

Building simple logistic regression models

Making a binary prediction

Model performance tradeoffs

Calculating ROC Curves and AUC

Dummy variables, missing data and interactions

Coding categorical features

Handling missing data

Building a more sophisticated model

Automatic feature selection

Building a stepwise regression model

Automatic feature selection

Building a stepwise regression model

Chapter 4: Classification Trees

Making decisions with trees

Building a simple decision tree

Visualizing classification trees

Growing larger classification trees

Building and evaluating a larger tree

Tending to classification trees

Preventing overgrown trees

Creating a nicely pruned tree

Seeing the forest from the trees

Building a random forest model

Supervised Learning in R: Regression

What is Regression?

Welcome and Introduction

Linear regression - the fundamental method

Code a simple one-variable regression

Examining a model

Predicting once you fit a model

Predicting from the unemployment model

Multivariate linear regression (Part 1)

Multivariate linear regression (Part 2)

Training and Evaluating Regression Models

Evaluating a model graphically

Graphically evaluate the unemployment model

The gain curve to evaluate the unemployment model

Root Mean Squared Error (RMSE)

R-Squared

Calculate R-Squared

Correlation and R-squared

Properly Training a Model

Generating a random test/train split

Train a model using test/train split

Evaluate a model using test/train split

Create a cross validation plan

Evaluate a modeling procedure using n-fold cross-validation

Issues to Consider

Categorical inputs

Examining the structure of categorical inputs

Modeling with categorical inputs

Interactions

Modeling an interaction

Modeling an interaction (2)

Transforming the response before modeling

Relative error

Modeling log-transformed monetary output

Transforming inputs before modeling

Dealing with Non-Linear Responses

Logistic regression to predict probabilities