Libraries

library(tidyverse)
library(class)
library(openintro)
library(assertive)
library(fst)
library(broom)
library(naivebayes)
library(pROC)
library(rpart)
library(rpart.plot)
library(sigr)
library(WVPlots)
library(vtreat)

Data

signs <- read.csv("signs.csv")
# write.csv(signs, "/Users/arifmathiq/R Data/signs.csv", row.names= FALSE)
locations <- read.csv("locations.csv")
donors <- read_csv("donors.csv")
## Rows: 93462 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (3): recency, frequency, money
## dbl (10): donated, veteran, bad_address, age, has_children, wealth_rating, i...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
loans <- read_csv("loans.csv")
## Rows: 39732 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (13): loan_amount, emp_length, home_ownership, income, loan_purpose, deb...
## dbl  (3): keep, rand, default
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Course Description

This beginner-level introduction to machine learning covers four of the most common classification algorithms. You will come away with a basic understanding of how each algorithm approaches a learning task, as well as learn the R functions needed to apply these tools to your own work.

1. k-Nearest Neighbors (kNN)

As the kNN algorithm literally “learns by example” it is a case in point for starting to understand supervised machine learning. This chapter will introduce classification while working through the application of kNN to self-driving vehicle road sign recognition.

1.1 Recognizing a road sign with kNN

After several trips with a human behind the wheel, it is time for the self-driving car to attempt the test course alone.

next_sign <- signs %>% filter(r1 == 204 & b1 == 220) %>% select(-id, -sample, -sign_type)
signs <- signs %>% filter(sample == "train") %>% select(-id, -sample)
# Create a vector of labels
sign_types <- signs$sign_type

# Classify the next sign observed
knn(train = signs[-1], test = next_sign, cl = sign_types)
## [1] stop
## Levels: pedestrian speed stop
# Count the number of signs of each type
table(signs$sign_type)
## 
## pedestrian      speed       stop 
##         46         49         51
# Check r10's average red level by sign type
aggregate(r10 ~ sign_type, data = signs, mean)
##    sign_type       r10
## 1 pedestrian 113.71739
## 2      speed  80.63265
## 3       stop 132.39216

1.2 Classifying a collection of road signs

Now that the autonomous vehicle has successfully stopped on its own, your team feels confident allowing the car to continue the test course.

test_signs <- read.csv("signs.csv")
test_signs <- test_signs %>% filter(sample == "test") %>% select(-id, -sample)

# Use kNN to identify the test road signs
sign_types <- signs$sign_type
signs_pred <- knn(train = signs[-1], test = test_signs[-1], cl = sign_types)

# Create a confusion matrix of the predicted versus actual values
signs_actual <- test_signs$sign_type
table(signs_pred, signs_actual)
##             signs_actual
## signs_pred   pedestrian speed stop
##   pedestrian         19     2    0
##   speed               0    17    0
##   stop                0     2   19

1.3 Testing other ‘k’ values

By default, the knn() function in the class package uses only the single nearest neighbor. Setting a k parameter allows the algorithm to consider additional nearby neighbors. This enlarges the collection of neighbors which will vote on the predicted class. Compare k values of 1, 7, and 15 to examine the impact on traffic sign classification accuracy.

signs_test <- test_signs
# Compute the accuracy of the baseline model (default k = 1)
k_1 <- knn(train = signs[-1], test = signs_test[-1], cl = sign_types)
mean(signs_actual == k_1)
## [1] 0.9322034
# Modify the above to set k = 7
k_7 <- knn(train = signs[-1], test = signs_test[-1], cl = sign_types, k = 7)
mean(signs_actual == k_7)
## [1] 0.9661017
# Set k = 15 and compare to the above
k_15 <- knn(train = signs[-1], test = signs_test[-1], cl = sign_types, k = 15)
mean(signs_actual == k_15)
## [1] 0.8983051

1.4 Seeing how the neighbors voted

When multiple nearest neighbors hold a vote, it can sometimes be useful to examine whether the voters were unanimous or widely separated.

For example, knowing more about the voters’ confidence in the classification could allow an autonomous vehicle to use caution in the case there is any chance at all that a stop sign is ahead.

# Use the prob parameter to get the proportion of votes for the winning class
sign_pred <- knn(train = signs[-1], test = signs_test[-1], cl = sign_types, k = 7, prob = TRUE)

# Get the "prob" attribute from the predicted classes
sign_prob <- attr(sign_pred, "prob")

# Examine the first several predictions
head(sign_pred)
## [1] pedestrian pedestrian pedestrian stop       pedestrian pedestrian
## Levels: pedestrian speed stop
# Examine the proportion of votes for the winning class
head(sign_prob)
## [1] 0.5714286 0.5714286 0.8571429 0.5714286 0.8571429 0.5714286

2. Naive Bayes

Naive Bayes uses principles from the field of statistics to make predictions. This chapter will introduce the basics of Bayesian methods while exploring how to apply these techniques to iPhone-like destination suggestions.

2.1 Computing probabilities

The where9am data frame contains 91 days (thirteen weeks) worth of data in which Brett recorded his location at 9am each day as well as whether the daytype was a weekend or weekday.

where9am <- locations %>% 
              filter(hour == 9) %>% 
              select(daytype, location) %>% 
              mutate(across(everything(), as.factor))
head(where9am)
##   daytype location
## 1 weekday   office
## 2 weekday   office
## 3 weekday   office
## 4 weekend     home
## 5 weekend     home
## 6 weekday   campus
# Compute P(A) 
p_A <- nrow(subset(where9am, location == "office")) / nrow(where9am)

# Compute P(B)
p_B <- nrow(subset(where9am, daytype == "weekday")) / nrow(where9am)

# Compute the observed P(A and B)
p_AB <- nrow(subset(where9am, location == "office" & daytype == "weekday")) / nrow(where9am)

# Compute P(A | B) and print its value
p_A_given_B <- p_AB / p_B
p_A_given_B
## [1] 0.6

2.2 A simple Naive Bayes location model

The previous exercises showed that the probability that Brett is at work or at home at 9am is highly dependent on whether it is the weekend or a weekday. To see this finding in action, use the where9am data frame to build a Naive Bayes model on the same data.

thursday9am <- where9am[3:4,] %>% filter(daytype == "weekday") %>% select(-location)
saturday9am <- where9am[3:4,] %>% filter(daytype == "weekend") %>% select(-location)

# Build the location prediction model
locmodel <- naive_bayes(location ~ daytype, data = where9am)
## Warning: naive_bayes(): Feature daytype - zero probabilities are present.
## Consider Laplace smoothing.
# Predict Thursday's 9am location
predict(locmodel, thursday9am)
## [1] office
## Levels: appointment campus home office restaurant store theater
# Predict Saturdays's 9am location
predict(locmodel, saturday9am)
## [1] home
## Levels: appointment campus home office restaurant store theater
# Examine the location prediction model
locmodel
## 
## ================================== Naive Bayes ================================== 
##  
##  Call: 
## naive_bayes.formula(formula = location ~ daytype, data = where9am)
## 
## --------------------------------------------------------------------------------- 
##  
## Laplace smoothing: 0
## 
## --------------------------------------------------------------------------------- 
##  
##  A priori probabilities: 
## 
## appointment      campus        home      office  restaurant       store 
##  0.01098901  0.10989011  0.45054945  0.42857143  0.00000000  0.00000000 
##     theater 
##  0.00000000 
## 
## --------------------------------------------------------------------------------- 
##  
##  Tables: 
## 
## --------------------------------------------------------------------------------- 
##  ::: daytype (Bernoulli) 
## --------------------------------------------------------------------------------- 
##          
## daytype   appointment    campus      home    office restaurant store theater
##   weekday   1.0000000 1.0000000 0.3658537 1.0000000                         
##   weekend   0.0000000 0.0000000 0.6341463 0.0000000                         
## 
## ---------------------------------------------------------------------------------
# Obtain the predicted probabilities for Thursday at 9am
predict(locmodel, thursday9am , type = "prob")
##      appointment campus home office restaurant store theater
## [1,]         NaN    NaN  NaN    NaN        NaN   NaN     NaN
# Obtain the predicted probabilities for Saturday at 9am
predict(locmodel, saturday9am , type = "prob")
##      appointment campus home office restaurant store theater
## [1,]         NaN    NaN  NaN    NaN        NaN   NaN     NaN

** Example** The Naive Bayes algorithm got its name because it makes a “naive” assumption about event independence.

What is the purpose of making this assumption?

* Independent events can never have a joint probability of zero.
* [*] The joint probability calculation is simpler for independent events.
* Conditional probability is undefined for dependent events.
* Dependent events cannot be used to make predictions.
weekday_afternoon <- locations[13,]
weekday_evening <- locations[19,]

# Build a NB model of location
locmodel <- naive_bayes(location ~ daytype + hourtype, data = locations)
## Warning: naive_bayes(): Feature daytype - zero probabilities are present.
## Consider Laplace smoothing.
## Warning: naive_bayes(): Feature hourtype - zero probabilities are present.
## Consider Laplace smoothing.
# Predict Brett's location on a weekday afternoon
predict(locmodel, weekday_afternoon)
## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.
## [1] office
## Levels: appointment campus home office restaurant store theater
# Predict Brett's location on a weekday evening
predict(locmodel, weekday_evening)
## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.
## [1] home
## Levels: appointment campus home office restaurant store theater

2.3 Preparing for unforeseen circumstances

While Brett was tracking his location over 13 weeks, he never went into the office during the weekend. Consequently, the joint probability of P(office and weekend) = 0.

Explore how this impacts the predicted probability that Brett may go to work on the weekend in the future. Additionally, you can see how using the Laplace correction will allow a small chance for these types of unforeseen circumstances.

weekend_afternoon <- locations[85,] %>% select(daytype, hourtype, location)
# Observe the predicted probabilities for a weekend afternoon
predict(locmodel, weekend_afternoon, type = "prob")
## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.
##      appointment       campus      home      office restaurant      store
## [1,]  0.02462883 0.0004802622 0.8439145 0.003349521  0.1111338 0.01641922
##          theater
## [1,] 7.38865e-05
# Build a new model using the Laplace correction
locmodel2 <- naive_bayes(location ~ daytype + hourtype, data = locations, laplace = 1)

# Observe the new predicted probabilities for a weekend afternoon
predict(locmodel2, weekend_afternoon, type = "prob")
## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.
##      appointment      campus      home      office restaurant      store
## [1,]  0.02013872 0.006187715 0.8308154 0.007929249  0.1098743 0.01871085
##          theater
## [1,] 0.006343697

** Example** By default, the naive_bayes() function in the naivebayes package does not use the Laplace correction. What is the risk of leaving this parameter unset?

* [*] Some potential outcomes may be predicted to be impossible.
* The algorithm may have a divide by zero error.
* Naive Bayes will ignore features with zero values.
* The model may not estimate probabilities for some cases.

Building simple logistic regression models

When building a regression model, it is often helpful to form a hypothesis about which independent variables will be predictive of the dependent variable. The bad_address column, which is set to 1 for an invalid mailing address and 0 otherwise, seems like it might reduce the chances of a donation.

# Examine the dataset to identify potential independent variables
str(donors)
## spec_tbl_df [93,462 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ donated          : num [1:93462] 0 0 0 0 0 0 0 0 0 0 ...
##  $ veteran          : num [1:93462] 0 0 0 0 0 0 0 0 0 0 ...
##  $ bad_address      : num [1:93462] 0 0 0 0 0 0 0 0 0 0 ...
##  $ age              : num [1:93462] 60 46 NA 70 78 NA 38 NA NA 65 ...
##  $ has_children     : num [1:93462] 0 1 0 0 1 0 1 0 0 0 ...
##  $ wealth_rating    : num [1:93462] 0 3 1 2 1 0 2 3 1 0 ...
##  $ interest_veterans: num [1:93462] 0 0 0 0 0 0 0 0 0 0 ...
##  $ interest_religion: num [1:93462] 0 0 0 0 1 0 0 0 0 0 ...
##  $ pet_owner        : num [1:93462] 0 0 0 0 0 0 1 0 0 0 ...
##  $ catalog_shopper  : num [1:93462] 0 0 0 0 1 0 0 0 0 0 ...
##  $ recency          : chr [1:93462] "CURRENT" "CURRENT" "CURRENT" "CURRENT" ...
##  $ frequency        : chr [1:93462] "FREQUENT" "FREQUENT" "FREQUENT" "FREQUENT" ...
##  $ money            : chr [1:93462] "MEDIUM" "HIGH" "MEDIUM" "MEDIUM" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   donated = col_double(),
##   ..   veteran = col_double(),
##   ..   bad_address = col_double(),
##   ..   age = col_double(),
##   ..   has_children = col_double(),
##   ..   wealth_rating = col_double(),
##   ..   interest_veterans = col_double(),
##   ..   interest_religion = col_double(),
##   ..   pet_owner = col_double(),
##   ..   catalog_shopper = col_double(),
##   ..   recency = col_character(),
##   ..   frequency = col_character(),
##   ..   money = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
# Explore the dependent variable
table(donors$donated)
## 
##     0     1 
## 88751  4711
# Build the donation model
donation_model <- glm(donated ~ bad_address + interest_religion + interest_veterans, 
                      data = donors, family = "binomial")

# Summarize the model results
summary(donation_model)
## 
## Call:
## glm(formula = donated ~ bad_address + interest_religion + interest_veterans, 
##     family = "binomial", data = donors)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.3480  -0.3192  -0.3192  -0.3192   2.5678  
## 
## Coefficients:
##                   Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)       -2.95139    0.01652 -178.664   <2e-16 ***
## bad_address       -0.30780    0.14348   -2.145   0.0319 *  
## interest_religion  0.06724    0.05069    1.327   0.1847    
## interest_veterans  0.11009    0.04676    2.354   0.0186 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 37330  on 93461  degrees of freedom
## Residual deviance: 37316  on 93458  degrees of freedom
## AIC: 37324
## 
## Number of Fisher Scoring iterations: 5

3. Logistic Regression

3.1 Building simple logistic regression models

In the previous exercise, you used the glm() function to build a logistic regression model of donor behavior. As with many of R’s machine learning methods, you can apply the predict() function to the model object to forecast future behavior. By default, predict() outputs predictions in terms of log odds unless type = “response” is specified. This converts the log odds to probabilities.

Because a logistic regression model estimates the probability of the outcome, it is up to you to determine the threshold at which the probability implies action. One must balance the extremes of being too cautious versus being too aggressive. For example, if you were to solicit only the people with a 99% or greater donation probability, you may miss out on many people with lower estimated probabilities that still choose to donate. This balance is particularly important to consider for severely imbalanced outcomes, such as in this dataset where donations are relatively rare.

# Explore the dependent variable
table(donors$donated)
## 
##     0     1 
## 88751  4711
# Build the donation model
donation_model <- glm(donated ~ bad_address + interest_religion + interest_veterans, 
                      data = donors, family = "binomial")

# Summarize the model results
summary(donation_model)
## 
## Call:
## glm(formula = donated ~ bad_address + interest_religion + interest_veterans, 
##     family = "binomial", data = donors)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.3480  -0.3192  -0.3192  -0.3192   2.5678  
## 
## Coefficients:
##                   Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)       -2.95139    0.01652 -178.664   <2e-16 ***
## bad_address       -0.30780    0.14348   -2.145   0.0319 *  
## interest_religion  0.06724    0.05069    1.327   0.1847    
## interest_veterans  0.11009    0.04676    2.354   0.0186 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 37330  on 93461  degrees of freedom
## Residual deviance: 37316  on 93458  degrees of freedom
## AIC: 37324
## 
## Number of Fisher Scoring iterations: 5

3.2 Making a binary prediction

# Estimate the donation probability
donors$donation_prob <- predict(donation_model, type = "response")

# Find the donation probability of the average prospect
mean(donors$donated)
## [1] 0.05040551
# Predict a donation if probability of donation is greater than average (0.0504)
donors$donation_pred <- ifelse(donors$donation_prob > 0.0504, 1, 0)

# Calculate the model's accuracy
mean(donors$donated == donors$donation_pred)
## [1] 0.794815

3.3 Calculating ROC Curves and AUC

The previous exercises have demonstrated that accuracy is a very misleading measure of model performance on imbalanced datasets. Graphing the model’s performance better illustrates the trade off between a model that is overly aggressive and one that is overly passive.

In this exercise you will create a ROC curve and compute the area under the curve (AUC) to evaluate the logistic regression model of donations you built earlier.

# Create a ROC curve
ROC <- roc(donors$donated, donors$donation_prob)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
# Plot the ROC curve
plot(ROC, col = "blue")

# Calculate the area under the curve (AUC)
auc(ROC)
## Area under the curve: 0.5102

3.4 Coding Categorical Features

Sometimes a dataset contains numeric values that represent a categorical feature. In the donors dataset, wealth_rating uses numbers to indicate the donor’s wealth level: * 0 = Unknown * 1 = Low * 2 = Medium * 3 = High

# Convert the wealth rating to a factor
donors$wealth_levels <- factor(donors$wealth_rating, levels = c(0, 1, 2, 3), labels = c("Unknown", "Low", "Medium", "High"))

# Use relevel() to change reference category
donors$wealth_levels <- relevel(donors$wealth_levels, ref = "Medium")

# See how our factor coding impacts the model
summary(glm(donated ~ wealth_levels, data = donors, family = "binomial"))
## 
## Call:
## glm(formula = donated ~ wealth_levels, family = "binomial", data = donors)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.3320  -0.3243  -0.3175  -0.3175   2.4582  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -2.91894    0.03614 -80.772   <2e-16 ***
## wealth_levelsUnknown -0.04373    0.04243  -1.031    0.303    
## wealth_levelsLow     -0.05245    0.05332  -0.984    0.325    
## wealth_levelsHigh     0.04804    0.04768   1.008    0.314    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 37330  on 93461  degrees of freedom
## Residual deviance: 37323  on 93458  degrees of freedom
## AIC: 37331
## 
## Number of Fisher Scoring iterations: 5

3.5 Handling Missing Data

Some of the prospective donors have missing age data. Unfortunately, R will exclude any cases with NA values when building a regression model.

One workaround is to replace, or impute, the missing values with an estimated value. After doing so, you may also create a missing data indicator to model the possibility that cases with missing data are different in some way from those without.

# Find the average age among non-missing values
summary(donors$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00   48.00   62.00   61.65   75.00   98.00   22546
# Impute missing age values with the mean age
donors$imputed_age <- ifelse(is.na(donors$age), round(mean(donors$age, na.rm = TRUE), 2), donors$age)

# Create missing value indicator for age
donors$missing_age <- ifelse(is.na(donors$age), 1, 0)

3.6 Building a more Sophisticated Model

One of the best predictors of future giving is a history of recent, frequent, and large gifts. In marketing terms, this is known as R/F/M: - Recency - Frequency - Money Donors that haven’t given both recently and frequently may be especially likely to give again; in other words, the combined impact of recency and frequency may be greater than the sum of the separate effects.

# Build a recency, frequency, and money (RFM) model
rfm_model <- glm(donated ~ recency * frequency + money, data = donors, family = "binomial")

# Summarize the RFM model to see how the parameters were coded
summary(rfm_model)
## 
## Call:
## glm(formula = donated ~ recency * frequency + money, family = "binomial", 
##     data = donors)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.3696  -0.3696  -0.2895  -0.2895   2.7924  
## 
## Coefficients:
##                                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                       -3.01142    0.04279 -70.375   <2e-16 ***
## recencyLAPSED                     -0.86677    0.41434  -2.092   0.0364 *  
## frequencyINFREQUENT               -0.50148    0.03107 -16.143   <2e-16 ***
## moneyMEDIUM                        0.36186    0.04300   8.415   <2e-16 ***
## recencyLAPSED:frequencyINFREQUENT  1.01787    0.51713   1.968   0.0490 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 37330  on 93461  degrees of freedom
## Residual deviance: 36938  on 93457  degrees of freedom
## AIC: 36948
## 
## Number of Fisher Scoring iterations: 6
# Compute predicted probabilities for the RFM model
rfm_prob <- predict(rfm_model, data = donors, type = "response")

# Plot the ROC curve and find AUC for the new model
library(pROC)
ROC <- roc(donors$donated, rfm_prob)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(ROC, col = "red")

auc(ROC)
## Area under the curve: 0.5785

3.7 Building a stepwise regression model

In the absence of subject-matter expertise, stepwise regression can assist with the search for the most important predictors of the outcome of interest.

donors <- donors %>% select(-age, -wealth_levels, -donation_prob, -donation_pred) %>% drop_na()

# Specify a null model with no predictors
null_model <- glm(donated ~ 1, data = donors, family = "binomial")

# Specify the full model using all of the potential predictors
full_model <- glm(donated ~ ., data = donors, family = "binomial")

# Use a forward stepwise algorithm to build a parsimonious model
step_model <- step(null_model, scope = list(lower = null_model, upper = full_model), direction = "forward")
## Start:  AIC=37332.13
## donated ~ 1
## 
##                     Df Deviance   AIC
## + frequency          1    37021 37025
## + money              1    37209 37213
## + pet_owner          1    37322 37326
## + interest_veterans  1    37323 37327
## + has_children       1    37323 37327
## + imputed_age        1    37323 37327
## + wealth_rating      1    37324 37328
## + missing_age        1    37325 37329
## + bad_address        1    37325 37329
## + catalog_shopper    1    37327 37331
## + interest_religion  1    37327 37331
## + recency            1    37328 37332
## <none>                    37330 37332
## + veteran            1    37330 37334
## 
## Step:  AIC=37024.77
## donated ~ frequency
## 
##                     Df Deviance   AIC
## + money              1    36944 36950
## + pet_owner          1    37014 37020
## + has_children       1    37014 37020
## + bad_address        1    37014 37020
## + wealth_rating      1    37015 37021
## + interest_veterans  1    37015 37021
## + missing_age        1    37015 37021
## + catalog_shopper    1    37018 37024
## + interest_religion  1    37018 37024
## + imputed_age        1    37018 37024
## + recency            1    37018 37024
## <none>                    37021 37025
## + veteran            1    37021 37027
## 
## Step:  AIC=36949.71
## donated ~ frequency + money
## 
##                     Df Deviance   AIC
## + has_children       1    36936 36944
## + pet_owner          1    36937 36945
## + wealth_rating      1    36937 36945
## + missing_age        1    36938 36946
## + bad_address        1    36938 36946
## + interest_veterans  1    36938 36946
## + catalog_shopper    1    36941 36949
## + imputed_age        1    36941 36949
## + interest_religion  1    36942 36950
## <none>                    36944 36950
## + recency            1    36942 36950
## + veteran            1    36944 36952
## 
## Step:  AIC=36944.32
## donated ~ frequency + money + has_children
## 
##                     Df Deviance   AIC
## + pet_owner          1    36927 36937
## + wealth_rating      1    36928 36938
## + missing_age        1    36928 36938
## + interest_veterans  1    36930 36940
## + bad_address        1    36930 36940
## + catalog_shopper    1    36933 36943
## + interest_religion  1    36933 36943
## <none>                    36936 36944
## + recency            1    36934 36944
## + imputed_age        1    36936 36946
## + veteran            1    36936 36946
## 
## Step:  AIC=36937.12
## donated ~ frequency + money + has_children + pet_owner
## 
##                     Df Deviance   AIC
## + wealth_rating      1    36920 36932
## + missing_age        1    36921 36933
## + bad_address        1    36921 36933
## + interest_veterans  1    36924 36936
## <none>                    36927 36937
## + recency            1    36925 36937
## + imputed_age        1    36926 36938
## + interest_religion  1    36926 36938
## + catalog_shopper    1    36926 36938
## + veteran            1    36927 36939
## 
## Step:  AIC=36932.08
## donated ~ frequency + money + has_children + pet_owner + wealth_rating
## 
##                     Df Deviance   AIC
## + bad_address        1    36915 36929
## + missing_age        1    36916 36930
## + interest_veterans  1    36918 36932
## <none>                    36920 36932
## + recency            1    36918 36932
## + interest_religion  1    36919 36933
## + imputed_age        1    36919 36933
## + catalog_shopper    1    36920 36934
## + veteran            1    36920 36934
## 
## Step:  AIC=36928.65
## donated ~ frequency + money + has_children + pet_owner + wealth_rating + 
##     bad_address
## 
##                     Df Deviance   AIC
## + missing_age        1    36911 36927
## + interest_veterans  1    36912 36928
## <none>                    36915 36929
## + recency            1    36913 36929
## + imputed_age        1    36914 36930
## + interest_religion  1    36914 36930
## + catalog_shopper    1    36914 36930
## + veteran            1    36915 36931
## 
## Step:  AIC=36926.94
## donated ~ frequency + money + has_children + pet_owner + wealth_rating + 
##     bad_address + missing_age
## 
##                     Df Deviance   AIC
## + interest_veterans  1    36909 36927
## <none>                    36911 36927
## + recency            1    36909 36927
## + imputed_age        1    36910 36928
## + interest_religion  1    36910 36928
## + catalog_shopper    1    36911 36929
## + veteran            1    36911 36929
## 
## Step:  AIC=36926.91
## donated ~ frequency + money + has_children + pet_owner + wealth_rating + 
##     bad_address + missing_age + interest_veterans
## 
##                     Df Deviance   AIC
## <none>                    36909 36927
## + recency            1    36907 36927
## + imputed_age        1    36908 36928
## + interest_religion  1    36908 36928
## + catalog_shopper    1    36909 36929
## + veteran            1    36909 36929
# Estimate the stepwise donation probability
step_prob <- predict(step_model, type = "response")

# Plot the ROC of the stepwise model
library(pROC)
ROC <- roc(donors$donated, step_prob)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(ROC, col = "red")

auc(ROC)
## Area under the curve: 0.5871

4. Classification Trees

Classification trees use flowchart-like structures to make decisions. Because humans can readily understand these tree structures, classification trees are useful when transparency is needed, such as in loan approval. We’ll use the Lending Club dataset to simulate this scenario.

4.1 Building a simple decision tree

good_credit <- loans[18,]
bad_credit <- loans[8,]

loans <- loans %>% mutate(outcome = factor(!default, labels = c("default", "repaid"))) %>% 
                   filter(keep == 1) %>% 
                   select(-keep, -rand, -default)

# Build a lending model predicting loan outcome versus loan amount and credit score
loan_model <- rpart(outcome ~ loan_amount + credit_score, data = loans, method = "class", control = rpart.control(cp = 0))

# Make a prediction for someone with good credit
predict(loan_model, good_credit, type = "class")
##      1 
## repaid 
## Levels: default repaid
# Make a prediction for someone with bad credit
predict(loan_model, bad_credit, type = "class")
##       1 
## default 
## Levels: default repaid

4.2 Visualizing classification trees

Due to government rules to prevent illegal discrimination, lenders are required to explain why a loan application was rejected.

The structure of classification trees can be depicted visually, which helps to understand how the tree makes its decisions.

# Examine the loan_model object
loan_model
## n= 11312 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 11312 5654 repaid (0.4998232 0.5001768)  
##    2) credit_score=AVERAGE,LOW 9490 4437 default (0.5324552 0.4675448)  
##      4) credit_score=LOW 1667  631 default (0.6214757 0.3785243) *
##      5) credit_score=AVERAGE 7823 3806 default (0.5134859 0.4865141)  
##       10) loan_amount=HIGH 2472 1079 default (0.5635113 0.4364887) *
##       11) loan_amount=LOW,MEDIUM 5351 2624 repaid (0.4903756 0.5096244)  
##         22) loan_amount=LOW 1810  874 default (0.5171271 0.4828729) *
##         23) loan_amount=MEDIUM 3541 1688 repaid (0.4767015 0.5232985) *
##    3) credit_score=HIGH 1822  601 repaid (0.3298573 0.6701427) *
# Load the rpart.plot package
library(rpart.plot)

# Plot the loan_model with default settings
rpart.plot(loan_model)

# Plot the loan_model with customized settings
rpart.plot(loan_model, type = 3, box.palette = c("red", "green"), fallen.leaves = TRUE)

4.3 Creating Random Test Datasets

Before building a more sophisticated lending model, it is important to hold out a portion of the loan data to simulate how well it will predict the outcomes of future loan applicants.

As depicted in the following image, you can use 75% of the observations for training and 25% for testing the model. The sample() function can be used to generate a random sample of rows to include in the training set. Simply supply it the total number of observations and the number needed for training.

# Determine the number of rows for training
nrow(loans) * 0.75
## [1] 8484
# Create a random sample of row IDs
sample_rows <- sample(nrow(loans), nrow(loans) * 0.75)

# Create the training dataset
loans_train <- loans[sample_rows, ]

# Create the test dataset
loans_test <- loans[-sample_rows, ]

4.4 Building and Evaluating a Larger Tree

Previously, you created a simple decision tree that used the applicant’s credit score and requested loan amount to predict the loan outcome.

Lending Club has additional information about the applicants, such as home ownership status, length of employment, loan purpose, and past bankruptcies, that may be useful for making more accurate predictions. Using all of the available applicant data, build a more sophisticated lending model using the random training dataset created previously. Then, use this model to make predictions on the testing dataset to estimate the performance of the model on future loan applications.

The rpart package is loaded into the workspace and the loans_train and loans_test datasets have been created.

# Grow a tree using all of the available applicant data
loan_model <- rpart(outcome ~ ., data = loans_train, method = "class", control = rpart.control(cp = 0))

# Make predictions on the test dataset
loans_test$pred <- predict(loan_model, loans_test, type = "class")

# Examine the confusion matrix
table(loans_test$pred, loans_test$outcome)
##          
##           default repaid
##   default     801    608
##   repaid      603    816
# Compute the accuracy on the test dataset
mean(loans_test$pred == loans_test$outcome)
## [1] 0.5717822

4.5 Preventing Overgrown Trees

The tree grown on the full set of applicant data grew to be extremely large and extremely complex, with hundreds of splits and leaf nodes containing only a handful of applicants. This tree would be almost impossible for a loan officer to interpret. Using the pre-pruning methods for early stopping, you can prevent a tree from growing too large and complex. See how the rpart control options for maximum tree depth and minimum split count impact the resulting tree.

# Grow a tree with maxdepth of 6
loan_model <- rpart(outcome ~ ., data = loans_train, method = "class", control = rpart.control(cp = 0, maxdepth = 6))

# Make a class prediction on the test set
loans_test$pred <- predict(loan_model, loans_test, type = "class")

# Compute the accuracy of the simpler tree
mean(loans_test$pred == loans_test$outcome)
## [1] 0.5862801
# Swap maxdepth for a minimum split of 500 
loan_model <- rpart(outcome ~ ., data = loans_train, method = "class", control = rpart.control(cp = 0, minsplit = 500))

# Run this. How does the accuracy change?
loans_test$pred <- predict(loan_model, loans_test, type = "class")
mean(loans_test$pred == loans_test$outcome)
## [1] 0.594413

4.6 Creating a Nicely Pruned Tree

Stopping a tree from growing all the way can lead it to ignore some aspects of the data or miss important trends it may have discovered later. By using post-pruning, you can intentionally grow a large and complex tree then prune it to be smaller and more efficient later on.

In this exercise, you will have the opportunity to construct a visualization of the tree’s performance versus complexity, and use this information to prune the tree to an appropriate level.

# Grow an overly complex tree
loan_model <- rpart(outcome ~ ., data = loans_train, method = "class", control = rpart.control(cp = 0))

# Examine the complexity plot
plotcp(loan_model)

# Prune the tree
loan_model_pruned <- prune(loan_model, cp = 0.0014)

# Compute the accuracy of the pruned tree
loans_test$pred <- predict(loan_model_pruned, loans_test, type = "class")
mean(loans_test$pred == loans_test$outcome)
## [1] 0.5834512

4.7 Building a Random Forest Model

In spite of the fact that a forest can contain hundreds of trees, growing a decision tree forest is perhaps even easier than creating a single highly-tuned tree. Using the randomForest package, build a random forest and see how it compares to the single trees you built previously.

Keep in mind that due to the random nature of the forest, the results may vary slightly each time you create the forest.

# Build a random forest model
## loan_model <- randomForest(outcome ~ ., data = loans_train)

# Compute the accuracy of the random forest
loans_test$pred <- predict(loan_model, loans_test)
mean(loans_test$pred == loans_test$outcome)
## [1] 0

5. Supervised Learning in R: Regression

From a machine learning perspective, regression is the task of predicting numerical outcomes from various inputs. In this course, you’ll learn about different regression models, how to train these models in R, how to evaluate the models you train and use them to make predictions.

setwd("/Users/arifmathiq/R Data")
unemployment <- readRDS("unemployment.rds")
bloodpressure <- readRDS("bloodpressure.rds")

summary(unemployment)
##  male_unemployment female_unemployment
##  Min.   :2.900     Min.   :4.000      
##  1st Qu.:4.900     1st Qu.:4.400      
##  Median :6.000     Median :5.200      
##  Mean   :5.954     Mean   :5.569      
##  3rd Qu.:6.700     3rd Qu.:6.100      
##  Max.   :9.800     Max.   :7.900
# Define a formula to express female_unemployment as a function of male_unemployment
fmla <- female_unemployment ~ male_unemployment
# Use the formula to fit a model: unemployment_model
unemployment_model <- lm(fmla, data = unemployment)
# Print it
unemployment_model
## 
## Call:
## lm(formula = fmla, data = unemployment)
## 
## Coefficients:
##       (Intercept)  male_unemployment  
##            1.4341             0.6945

5.1 Examining a Model

Let’s look at the model unemployment_model that you have just created. There are a variety of different ways to examine a model; each way provides different information. We will use summary(), broom::glance(), and sigr::wrapFTest().

# broom and sigr are already loaded in your workspace
# Print unemployment_model
unemployment_model
## 
## Call:
## lm(formula = fmla, data = unemployment)
## 
## Coefficients:
##       (Intercept)  male_unemployment  
##            1.4341             0.6945
# Call summary() on unemployment_model to get more details
summary(unemployment_model)
## 
## Call:
## lm(formula = fmla, data = unemployment)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.77621 -0.34050 -0.09004  0.27911  1.31254 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.43411    0.60340   2.377   0.0367 *  
## male_unemployment  0.69453    0.09767   7.111 1.97e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5803 on 11 degrees of freedom
## Multiple R-squared:  0.8213, Adjusted R-squared:  0.8051 
## F-statistic: 50.56 on 1 and 11 DF,  p-value: 1.966e-05
# Call glance() on unemployment_model to see the details in a tidier form
glance(unemployment_model)
## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic   p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>     <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.821         0.805 0.580      50.6 0.0000197     1  -10.3  26.6  28.3
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
# Call wrapFTest() on unemployment_model to see the most relevant details
wrapFTest(unemployment_model)
## [1] "F Test summary: (R2=0.8213, F(1,11)=50.56, p=1.966e-05)."

5.2 Predicting from the unemployment model

newrates<-data.frame("male_unemployment"=5)

# Predict female unemployment in the unemployment data set
unemployment$prediction <-  predict(unemployment_model, unemployment)

# Make a plot to compare predictions to actual (prediction on x axis). 
ggplot(unemployment, aes(x = prediction, y = female_unemployment)) + 
   geom_point() +
  geom_abline(color = "blue")

# Predict female unemployment rate when male unemployment is 5%
pred <- predict(unemployment_model,newrates)
# Print it
pred
##        1 
## 4.906757

5.3 Multivariate Linear Regression

# bloodpressure is in the workspace
summary(bloodpressure)
##  blood_pressure       age            weight   
##  Min.   :128.0   Min.   :46.00   Min.   :167  
##  1st Qu.:140.0   1st Qu.:56.50   1st Qu.:186  
##  Median :153.0   Median :64.00   Median :194  
##  Mean   :150.1   Mean   :62.45   Mean   :195  
##  3rd Qu.:160.5   3rd Qu.:69.50   3rd Qu.:209  
##  Max.   :168.0   Max.   :74.00   Max.   :220
# Create the formula and print it
fmla <- blood_pressure~age+weight
fmla
## blood_pressure ~ age + weight
# Fit the model: bloodpressure_model
bloodpressure_model <- lm(fmla,data=bloodpressure)

# Print bloodpressure_model and call summary() 
bloodpressure_model
## 
## Call:
## lm(formula = fmla, data = bloodpressure)
## 
## Coefficients:
## (Intercept)          age       weight  
##     30.9941       0.8614       0.3349
summary(bloodpressure_model)
## 
## Call:
## lm(formula = fmla, data = bloodpressure)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4640 -1.1949 -0.4078  1.8511  2.6981 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  30.9941    11.9438   2.595  0.03186 * 
## age           0.8614     0.2482   3.470  0.00844 **
## weight        0.3349     0.1307   2.563  0.03351 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.318 on 8 degrees of freedom
## Multiple R-squared:  0.9768, Adjusted R-squared:  0.9711 
## F-statistic: 168.8 on 2 and 8 DF,  p-value: 2.874e-07
# predict blood pressure using bloodpressure_model :prediction
bloodpressure$prediction <- predict(bloodpressure_model, bloodpressure)

# plot the results
ggplot(bloodpressure, aes(x=prediction, y=blood_pressure)) + 
    geom_point() +
    geom_abline(color = "blue")

6. Training and Evaluating Regression Models

Now that we have learned how to fit basic linear regression models, we will learn how to evaluate how well our models perform. We will review evaluating a model graphically, and look at two basic metrics for regression models. We will also learn how to train a model that will perform well in the wild, not just on training data. Although we will demonstrate these techniques using linear regression, all these concepts apply to models fit with any regression algorithm.

# Make predictions from the model
unemployment$predictions <- predict(unemployment_model, unemployment)

# Fill in the blanks to plot predictions (on x-axis) versus the female_unemployment rates
ggplot(unemployment, aes(x = predictions, y = female_unemployment)) + 
  geom_point() + 
  geom_abline()

# Calculate residuals
unemployment$residuals <- unemployment$female_unemployment-unemployment$predictions

# Fill in the blanks to plot predictions (on x-axis) versus the residuals
ggplot(unemployment, aes(x = predictions, y = residuals)) + 
  geom_pointrange(aes(ymin = 0, ymax = residuals)) + 
  geom_hline(yintercept = 0, linetype = 3) + 
  ggtitle("residuals vs. linear model prediction")

6.1 The Gain Curve to Evaluate the Unemployment Model

In the previous exercise you made predictions about female_unemployment and visualized the predictions and the residuals. Now, you will also plot the gain curve of the unemployment_model’s predictions against actual female_unemployment using the WVPlots::GainCurvePlot() function.

# Plot the Gain Curve
GainCurvePlot(unemployment, "predictions", "female_unemployment", "Unemployment model")

6.2 Root Mean Squared Error (RMSE)

We want RMSE to be small, One heuristic is to compare the RMSE to the standard deviation of the outcome. With a good model, the RMSE should be smaller.

# For convenience put the residuals in the variable res
res <- unemployment$residuals

# Calculate RMSE, assign it to the variable rmse and print it
(rmse <- sqrt(mean(res^2)))
## [1] 0.5337612
# Calculate the standard deviation of female_unemployment and print it
(sd_unemployment <- sd(unemployment$female_unemployment))
## [1] 1.314271

6.3 Calculate R-squared

Now that you’ve calculated the RMSE of your model’s predictions, you will examine how well the model fits the data: that is, how much variance does it explain.

# Calculate mean female_unemployment: fe_mean. Print it
(fe_mean <- mean(unemployment$female_unemployment))
## [1] 5.569231
# Calculate total sum of squares: tss. Print it
(tss <- sum( (unemployment$female_unemployment - fe_mean)^2 ))
## [1] 20.72769
# Calculate residual sum of squares: rss. Print it
(rss <- sum(unemployment$residuals^2))
## [1] 3.703714
# Calculate R-squared: rsq. Print it. Is it a good fit?
(rsq <- 1 - (rss/tss))
## [1] 0.8213157
# Get R-squared from glance. Print it
(rsq_glance <- glance(unemployment_model)$r.squared)
## [1] 0.8213157

6.4 Correlation and R-squared

The linear correlation of two variables, x and y, measures the strength of the linear relationship between them. When x and y are respectively: * the outcomes of a regression model that minimizes squared-error (like linear regression) and * the true outcomes of the training data, * then the square of the correlation is the same as R^2

# Get the correlation between the prediction and true outcome: rho and print it
(rho <- cor(unemployment$predictions, unemployment$female_unemployment))
## [1] 0.9062647
# Square rho: rho2 and print it
(rho2 <- rho ^ 2)
## [1] 0.8213157
# Get R-squared from glance and print it
(rsq_glance <- glance(unemployment_model)$r.squared)
## [1] 0.8213157

6.5 Generating a random test/train split

For the next several exercises you will use the mpg data from the package ggplot2. The data describes the characteristics of several makes and models of cars from different years. The goal is to predict city fuel efficiency from highway fuel efficiency.

dim(mpg)
## [1] 234  11
# Use nrow to get the number of rows in mpg (N) and print it
(N <- nrow(mpg))
## [1] 234
# Calculate how many rows 75% of N should be and print it
# Hint: use round() to get an integer
(target <- round(0.75 * N))
## [1] 176
# Create the vector of N uniform random variables: gp
gp <- runif(N)

# Use gp to create the training set: mpg_train (75% of data) and mpg_test (25% of data)
mpg_train <- mpg[gp<0.75,]
mpg_test <-  mpg[gp>=0.75,]

# Use nrow() to examine mpg_train and mpg_test
nrow(mpg_train)
## [1] 180
nrow(mpg_test)
## [1] 54
# mpg_train is in the workspace
summary(mpg_train)
##  manufacturer          model               displ            year     
##  Length:180         Length:180         Min.   :1.600   Min.   :1999  
##  Class :character   Class :character   1st Qu.:2.400   1st Qu.:1999  
##  Mode  :character   Mode  :character   Median :3.100   Median :2004  
##                                        Mean   :3.392   Mean   :2004  
##                                        3rd Qu.:4.450   3rd Qu.:2008  
##                                        Max.   :6.200   Max.   :2008  
##       cyl           trans               drv                 cty       
##  Min.   :4.000   Length:180         Length:180         Min.   : 9.00  
##  1st Qu.:4.000   Class :character   Class :character   1st Qu.:14.00  
##  Median :6.000   Mode  :character   Mode  :character   Median :17.00  
##  Mean   :5.778                                         Mean   :17.02  
##  3rd Qu.:8.000                                         3rd Qu.:20.00  
##  Max.   :8.000                                         Max.   :35.00  
##       hwy             fl               class          
##  Min.   :12.00   Length:180         Length:180        
##  1st Qu.:18.00   Class :character   Class :character  
##  Median :25.00   Mode  :character   Mode  :character  
##  Mean   :23.68                                        
##  3rd Qu.:27.00                                        
##  Max.   :44.00
# Create a formula to express cty as a function of hwy: fmla and print it.
(fmla <- cty ~ hwy)
## cty ~ hwy
# Now use lm() to build a model mpg_model from mpg_train that predicts cty from hwy 
mpg_model <- lm(fmla, mpg_train)

# Use summary() to examine the model
summary(mpg_model)
## 
## Call:
## lm(formula = fmla, data = mpg_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9479 -0.7726  0.0527  0.6526  4.4517 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.44642    0.39228   1.138    0.257    
## hwy          0.70006    0.01606  43.585   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.29 on 178 degrees of freedom
## Multiple R-squared:  0.9143, Adjusted R-squared:  0.9138 
## F-statistic:  1900 on 1 and 178 DF,  p-value: < 2.2e-16

6.6 Evaluate a model using test/train split

Functions rmse() and r_squared() to calculate RMSE and R-squared have been provided for convenience: predcol: The predicted values, ycol: The actual outcome. Generally, model performance is better on the training data than the test data (though sometimes the test set “gets lucky”).

rmse<-function(predcol, ycol) {
  res <- predcol - ycol
  sqrt(mean(res^2))
}

r_squared<-function(predcol, ycol) {
  tss = sum( (ycol - mean(ycol))^2 )
  rss = sum( (predcol - ycol)^2 )
  1 - rss/tss
}

# predict cty from hwy for the training set
mpg_train$pred <- predict(mpg_model, mpg_train)

# predict cty from hwy for the test set
mpg_test$pred <- predict(mpg_model, mpg_test)

# Evaluate the rmse on both training and test data and print them
(rmse_train <- rmse(mpg_train$pred,mpg_train$cty))
## [1] 1.282898
(rmse_test <- rmse(mpg_test$pred, mpg_test$cty))
## [1] 1.138501
# Evaluate the r-squared on both training and test data.and print them
(rsq_train <- r_squared(mpg_train$pred,mpg_train$cty))
## [1] 0.9143272
(rsq_test <- r_squared(mpg_test$pred, mpg_test$cty))
## [1] 0.9056238
ggplot(mpg_test, aes(x = pred, y = cty)) +
  geom_point() +
  geom_abline()

6.7 Create a cross validation plan

There are several ways to implement an n-fold cross validation plan. In this exercise you will create such a plan using vtreat::kWayCrossValidation(), and examine it.

kWayCrossValidation() creates a cross validation plan with the following call: splitPlan <- kWayCrossValidation(nRows, nSplits, dframe, y) where nRows is the number of rows of data to be split, and nSplits is the desired number of cross-validation folds.

Strictly speaking, dframe and y aren’t used by kWayCrossValidation; they are there for compatibility with other vtreat data partitioning functions. You can set them both to NULL.

The resulting splitPlan is a list of nSplits elements; each element contains two vectors:

train: the indices of dframe that will form the training set app: the indices of dframe that will form the test (or application) set.

# Get the number of rows in mpg
nRows <- nrow(mpg)

# Implement the 3-fold cross-fold plan with vtreat
splitPlan <- kWayCrossValidation(nRows, 3,NULL,NULL)

# Examine the split plan
str(splitPlan)
## List of 3
##  $ :List of 2
##   ..$ train: int [1:156] 1 2 3 4 5 7 8 9 10 12 ...
##   ..$ app  : int [1:78] 172 80 52 217 96 130 196 157 30 75 ...
##  $ :List of 2
##   ..$ train: int [1:156] 2 6 7 8 11 14 18 19 20 21 ...
##   ..$ app  : int [1:78] 12 9 84 1 104 3 115 38 188 93 ...
##  $ :List of 2
##   ..$ train: int [1:156] 1 3 4 5 6 9 10 11 12 13 ...
##   ..$ app  : int [1:78] 162 98 163 47 92 81 64 161 44 139 ...
##  - attr(*, "splitmethod")= chr "kwaycross"

6.8 Evaluate a modeling procedure using n-fold cross-validation

the 3-fold cross validation to make predictions from a model that predicts mpgctyfrommpghwy.

If dframe is the training data, then one way to add a column of cross-validation predictions to the frame is as follows:

Initialize a column of the appropriate length dframe$pred.cv <- 0 k is the number of folds splitPlan is the cross validation plan

# Run the 3-fold cross validation plan from splitPlan
k <- 3 # Number of folds
mpg$pred.cv <- 0 
for(i in 1:k) {
  split <- splitPlan[[i]]
  model <- lm(cty ~ hwy, data = mpg[split$train, ])
  mpg$pred.cv[split$app] <- predict(model, newdata = mpg[split$app, ])
}

# Predict from a full model
mpg$pred <- predict(lm(cty ~ hwy, data = mpg))

# Get the rmse of the full model's predictions
rmse(mpg$pred, mpg$cty)
## [1] 1.247045
# Get the rmse of the cross-validation predictions
rmse(mpg$pred.cv, mpg$cty)
## [1] 1.288596

7. Issues to Consider

Before moving on to more sophisticated regression techniques, we will look at some other modeling issues: modeling with categorical inputs, interactions between variables, and when you might consider transforming inputs and outputs before modeling. While more sophisticated regression techniques manage some of these issues automatically, it’s important to be aware of them, in order to understand which methods best handle various issues – and which issues you must still manage yourself.

7.1 Examining the sStructure of Categorical Inputs

For this exercise you will call model.matrix() to examine how R represents data with both categorical and numerical inputs for modeling. The dataset flowers (derived from the Sleuth3 package) is loaded into your workspace. It has the following columns:

library(Sleuth3)
library(dplyr)
flowers <- case0901
# Call str on flowers to see the types of each column
str(flowers)
## 'data.frame':    24 obs. of  3 variables:
##  $ Flowers  : num  62.3 77.4 55.3 54.2 49.6 61.9 39.4 45.7 31.3 44.9 ...
##  $ Time     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Intensity: int  150 150 300 300 450 450 600 600 750 750 ...
# Use unique() to see how many possible values Time takes
unique(flowers$Time)
## [1] 1 2
(fmla <- as.formula("Flowers ~ Intensity + Time"))
## Flowers ~ Intensity + Time
mmat <- model.matrix(fmla, flowers)

# Examine the first 20 lines of flowers
head(flowers, n = 20)
##    Flowers Time Intensity
## 1     62.3    1       150
## 2     77.4    1       150
## 3     55.3    1       300
## 4     54.2    1       300
## 5     49.6    1       450
## 6     61.9    1       450
## 7     39.4    1       600
## 8     45.7    1       600
## 9     31.3    1       750
## 10    44.9    1       750
## 11    36.8    1       900
## 12    41.9    1       900
## 13    77.8    2       150
## 14    75.6    2       150
## 15    69.1    2       300
## 16    78.0    2       300
## 17    57.0    2       450
## 18    71.1    2       450
## 19    62.9    2       600
## 20    52.2    2       600
# Examine the first 20 lines of mmat
head(mmat, n = 20)
##    (Intercept) Intensity Time
## 1            1       150    1
## 2            1       150    1
## 3            1       300    1
## 4            1       300    1
## 5            1       450    1
## 6            1       450    1
## 7            1       600    1
## 8            1       600    1
## 9            1       750    1
## 10           1       750    1
## 11           1       900    1
## 12           1       900    1
## 13           1       150    2
## 14           1       150    2
## 15           1       300    2
## 16           1       300    2
## 17           1       450    2
## 18           1       450    2
## 19           1       600    2
## 20           1       600    2
# Fit a model to predict Flowers from Intensity and Time : flower_model
flower_model <-  lm(fmla, data = flowers)

# Use summary on mmat to remind yourself of its structure
summary(mmat)
##   (Intercept)   Intensity        Time    
##  Min.   :1    Min.   :150   Min.   :1.0  
##  1st Qu.:1    1st Qu.:300   1st Qu.:1.0  
##  Median :1    Median :525   Median :1.5  
##  Mean   :1    Mean   :525   Mean   :1.5  
##  3rd Qu.:1    3rd Qu.:750   3rd Qu.:2.0  
##  Max.   :1    Max.   :900   Max.   :2.0
# Use summary to examine the flower_model
summary(flower_model)
## 
## Call:
## lm(formula = fmla, data = flowers)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.652 -4.139 -1.558  5.632 12.165 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 59.147500   4.954465  11.938 8.01e-11 ***
## Intensity   -0.040471   0.005132  -7.886 1.04e-07 ***
## Time        12.158333   2.629557   4.624 0.000146 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.441 on 21 degrees of freedom
## Multiple R-squared:  0.7992, Adjusted R-squared:   0.78 
## F-statistic: 41.78 on 2 and 21 DF,  p-value: 4.786e-08
# predict the number of flowers on each plant
flowers$predictions <- predict(flower_model)

# Plot predictions vs actual flowers (predictions on x-axis)
ggplot(flowers, aes(x = predictions, y = Flowers)) + 
  geom_point() +
  geom_abline(color = "blue")

7.2 Modeling an Interaction

In this exercise you will use interactions to model the effect of gender and gastric activity on alcohol metabolism.We saw that one of the models with interaction terms had a better R-squared than the additive model, suggesting that using interaction terms gives a better fit. In this exercise we will compare the R-squared of one of the interaction models to the main-effects-only model. Recall that the operator : designates the interaction between two variables. The operator * designates the interaction between the two variables, plus the main effects.

alcohol <- case1101
# alcohol is in the workspace
summary(alcohol)
##     Subject         Metabol          Gastric          Sex    
##  Min.   : 1.00   Min.   : 0.100   Min.   :0.800   Female:18  
##  1st Qu.: 8.75   1st Qu.: 0.600   1st Qu.:1.200   Male  :14  
##  Median :16.50   Median : 1.700   Median :1.600              
##  Mean   :16.50   Mean   : 2.422   Mean   :1.863              
##  3rd Qu.:24.25   3rd Qu.: 2.925   3rd Qu.:2.200              
##  Max.   :32.00   Max.   :12.300   Max.   :5.200              
##           Alcohol  
##  Alcoholic    : 8  
##  Non-alcoholic:24  
##                    
##                    
##                    
## 
# Create the formula with main effects only
(fmla_add <- Metabol ~ Gastric + Sex)
## Metabol ~ Gastric + Sex
# Create the formula with interactions
(fmla_interaction <- Metabol ~  Gastric + Gastric:Sex)
## Metabol ~ Gastric + Gastric:Sex
# Fit the main effects only model
model_add <- lm(fmla_add, data = alcohol)

# Fit the interaction model
model_interaction <- lm(fmla_interaction, data = alcohol)

# Call summary on both models and compare
summary(model_add)
## 
## Call:
## lm(formula = fmla_add, data = alcohol)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2779 -0.6328 -0.0966  0.5783  4.5703 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.9466     0.5198  -3.745 0.000796 ***
## Gastric       1.9656     0.2674   7.352 4.24e-08 ***
## SexMale       1.6174     0.5114   3.163 0.003649 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.331 on 29 degrees of freedom
## Multiple R-squared:  0.7654, Adjusted R-squared:  0.7492 
## F-statistic: 47.31 on 2 and 29 DF,  p-value: 7.41e-10
summary(model_interaction)
## 
## Call:
## lm(formula = fmla_interaction, data = alcohol)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4656 -0.5091  0.0143  0.5660  4.0668 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -0.7504     0.5310  -1.413 0.168236    
## Gastric           1.1489     0.3450   3.331 0.002372 ** 
## Gastric:SexMale   1.0422     0.2412   4.321 0.000166 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.204 on 29 degrees of freedom
## Multiple R-squared:  0.8081, Adjusted R-squared:  0.7948 
## F-statistic: 61.05 on 2 and 29 DF,  p-value: 4.033e-11

7.3 Modeling an Interaction (2)

In this exercise, you will compare the performance of the interaction model you fit in the previous exercise to the performance of a main-effects only model. Because this data set is small, we will use cross-validation to simulate making predictions on out-of-sample data.

# Create the splitting plan for 3-fold cross validation
set.seed(34245)  # set the seed for reproducibility
splitPlan <- kWayCrossValidation(nrow(alcohol), 3, NULL, NULL)

# Sample code: Get cross-val predictions for main-effects only model
alcohol$pred_add <- 0  # initialize the prediction vector
for(i in 1:3) {
  split <- splitPlan[[i]]
  model_add <- lm(fmla_add, data = alcohol[split$train, ])
  alcohol$pred_add[split$app] <- predict(model_add, newdata = alcohol[split$app, ])
}

# Get the cross-val predictions for the model with interactions
alcohol$pred_interaction <- 0 # initialize the prediction vector
for(i in 1:3) {
  split <- splitPlan[[i]]
  model_interaction <- lm(fmla_interaction, data = alcohol[split$train, ])
  alcohol$pred_interaction[split$app] <- predict(model_interaction, newdata = alcohol[split$app, ])
}

# Get RMSE
alcohol %>% 
  gather(key = modeltype, value = pred, pred_add, pred_interaction) %>%
  mutate(residuals = Metabol - pred) %>%
  group_by(modeltype) %>%
  summarize(rmse = sqrt(mean(residuals^2)))
## # A tibble: 2 × 2
##   modeltype         rmse
##   <chr>            <dbl>
## 1 pred_add          1.38
## 2 pred_interaction  1.30

7.4 Transforming the Response Before Modeling

load("~/R Data/Income.RData")
income_train <- incometrain
income_test <- incometest

# Write the formula for log income as a function of the tests and print it
(fmla.log <- log(Income2005) ~ Arith + Word + Parag + Math + AFQT)
## log(Income2005) ~ Arith + Word + Parag + Math + AFQT
# Fit the linear model
model.log <- lm(fmla.log, data = income_train)

# Make predictions on income_test
income_test$logpred <- predict(model.log, newdata = income_test)
summary(income_test$logpred)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   9.766  10.133  10.423  10.419  10.705  11.006
# Convert the predictions to monetary units
income_test$pred.income <- exp(income_test$logpred)
summary(income_test$pred.income)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   17432   25167   33615   35363   44566   60217
#  Plot predicted income (x axis) vs income
ggplot(income_test, aes(x = pred.income, y = Income2005)) + 
  geom_point() + 
  geom_abline(color = "blue")

7.5 Comparing RMSE and root-mean-squared Relative Error

In this exercise, you will show that log-transforming a monetary output before modeling improves mean relative error (but increases RMSE) compared to modeling the monetary output directly. You will compare the results of model.log from the previous exercise to a model (model.abs) that directly fits income.

# Write the formula for income as a function of the tests and print it
fmla.abs <- Income2005 ~ Arith + Word + Parag + Math + AFQT

# Fit the linear model
model.abs <- lm(fmla.abs, data = income_train)

# fmla.abs is in the workspace
fmla.abs
## Income2005 ~ Arith + Word + Parag + Math + AFQT
# model.abs is in the workspace
summary(model.abs)
## 
## Call:
## lm(formula = fmla.abs, data = income_train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -78728 -24137  -6979  11964 648573 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  17516.7     6420.1   2.728  0.00642 ** 
## Arith         1552.3      303.4   5.116 3.41e-07 ***
## Word          -132.3      265.0  -0.499  0.61754    
## Parag        -1155.1      618.3  -1.868  0.06189 .  
## Math           725.5      372.0   1.950  0.05127 .  
## AFQT           177.8      144.1   1.234  0.21734    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 45500 on 2063 degrees of freedom
## Multiple R-squared:  0.1165, Adjusted R-squared:  0.1144 
## F-statistic:  54.4 on 5 and 2063 DF,  p-value: < 2.2e-16
# Add predictions to the test set
income_test <- income_test %>%
  mutate(pred.absmodel = predict(model.abs, income_test),      # predictions from model.abs
         pred.logmodel = exp(predict(model.log, income_test))) # predictions from model.log

# Gather the predictions and calculate residuals and relative error
income_long <- income_test %>% 
  gather(key = modeltype, value = pred, pred.absmodel, pred.logmodel) %>%
  mutate(residual = pred - Income2005,     # residuals
         relerr   = residual / Income2005) # relative error

# Calculate RMSE and relative RMSE and compare
income_long %>% 
  group_by(modeltype) %>%                       # group by modeltype
  summarize(rmse     = sqrt(mean(residual^2)),  # RMSE
            rmse.rel = sqrt(mean(relerr^2)))    # Root mean squared relative error
## # A tibble: 2 × 3
##   modeltype       rmse rmse.rel
##   <chr>          <dbl>    <dbl>
## 1 pred.absmodel 37448.     3.18
## 2 pred.logmodel 39235.     2.22

7.6 Input Transforms: the “hockey stick”

In this exercise, we will build a model to predict price from a measure of the house’s size (surface area).

houseprice <- readRDS("~/R Data/houseprice.rds")

# houseprice is in the workspace
summary(houseprice)
##       size           price      
##  Min.   : 44.0   Min.   : 42.0  
##  1st Qu.: 73.5   1st Qu.:164.5  
##  Median : 91.0   Median :203.5  
##  Mean   : 94.3   Mean   :249.2  
##  3rd Qu.:118.5   3rd Qu.:287.8  
##  Max.   :150.0   Max.   :573.0
# Create the formula for price as a function of squared size
(fmla_sqr <- price ~ I(size^2))
## price ~ I(size^2)
# Fit a model of price as a function of squared size (use fmla_sqr)
model_sqr <- lm(fmla_sqr, houseprice)

# Fit a model of price as a linear function of size
model_lin <- lm(price ~ size, houseprice)

# Make predictions and compare
houseprice %>% 
    mutate(pred_lin = predict(model_lin),       # predictions from linear model
           pred_sqr = predict(model_sqr)) %>%   # predictions from quadratic model 
    gather(key = modeltype, value = pred, pred_lin, pred_sqr) %>% # gather the predictions
    ggplot(aes(x = size)) + 
       geom_point(aes(y = price)) +                  # actual prices
       geom_line(aes(y = pred, color = modeltype)) + # the predictions
       scale_color_brewer(palette = "Dark2")

7.7 Input Transforms: the “hockey stick” (2)

In the last exercise you saw that a quadratic model seems to fit the houseprice data better than a linear model. In this exercise you will confirm whether the quadratic model would perform better on out-of-sample data. Since this data set is small, you will use cross-validation.

# Create a splitting plan for 3-fold cross validation
set.seed(34245)  # set the seed for reproducibility
splitPlan <- kWayCrossValidation(nrow(houseprice), 3, NULL, NULL)

# Sample code: get cross-val predictions for price ~ size
houseprice$pred_lin <- 0  # initialize the prediction vector
for(i in 1:3) {
  split <- splitPlan[[i]]
  model_lin <- lm(price ~ size, data = houseprice[split$train, ])
  houseprice$pred_lin[split$app] <- predict(model_lin, newdata = houseprice[split$app, ])
}

# Get cross-val predictions for price as a function of size^2 (use fmla_sqr)
houseprice$pred_sqr <- 0 # initialize the prediction vector
for(i in 1:3) {
  split <- splitPlan[[i]]
  model_sqr <- lm(fmla_sqr, data = houseprice[split$train, ])
  houseprice$pred_sqr[split$app] <- predict(model_sqr, newdata = houseprice[split$app, ])
}

# Gather the predictions and calculate the residuals
houseprice_long <- houseprice %>%
  gather(key = modeltype, value = pred, pred_lin, pred_sqr) %>%
  mutate(residuals = pred - price)

# Compare the cross-validated RMSE for the two models
houseprice_long %>% 
  group_by(modeltype) %>%
  summarize(rmse = sqrt(mean(residuals^2)))
## # A tibble: 2 × 2
##   modeltype  rmse
##   <chr>     <dbl>
## 1 pred_lin   73.1
## 2 pred_sqr   60.3

8. Dealing with Non-Linear Responses

Now that we have mastered linear models, we will begin to look at techniques for modeling situations that don’t meet the assumptions of linearity. This includes predicting probabilities and frequencies (values bounded between 0 and 1); predicting counts (nonnegative integer values, and associated rates); and responses that have a non-linear but additive relationship to the inputs. These algorithms are variations on the standard linear model.

8.1 Logistic regression to predict probabilities

Now that we have mastered linear models, we will begin to look at techniques for modeling situations that don’t meet the assumptions of linearity. This includes predicting probabilities and frequencies (values bounded between 0 and 1); predicting counts (nonnegative integer values, and associated rates); and responses that have a non-linear but additive relationship to the inputs. These algorithms are variations on the standard linear model.

# sparrow is in the workspace
sparrow <- readRDS("~/R Data/sparrow.rds")
summary(sparrow)
##       status       age             total_length      wingspan    
##  Perished:36   Length:87          Min.   :153.0   Min.   :236.0  
##  Survived:51   Class :character   1st Qu.:158.0   1st Qu.:245.0  
##                Mode  :character   Median :160.0   Median :247.0  
##                                   Mean   :160.4   Mean   :247.5  
##                                   3rd Qu.:162.5   3rd Qu.:251.0  
##                                   Max.   :167.0   Max.   :256.0  
##      weight       beak_head        humerus           femur       
##  Min.   :23.2   Min.   :29.80   Min.   :0.6600   Min.   :0.6500  
##  1st Qu.:24.7   1st Qu.:31.40   1st Qu.:0.7250   1st Qu.:0.7000  
##  Median :25.8   Median :31.70   Median :0.7400   Median :0.7100  
##  Mean   :25.8   Mean   :31.64   Mean   :0.7353   Mean   :0.7134  
##  3rd Qu.:26.7   3rd Qu.:32.10   3rd Qu.:0.7500   3rd Qu.:0.7300  
##  Max.   :31.0   Max.   :33.00   Max.   :0.7800   Max.   :0.7600  
##     legbone          skull           sternum      
##  Min.   :1.010   Min.   :0.5600   Min.   :0.7700  
##  1st Qu.:1.110   1st Qu.:0.5900   1st Qu.:0.8300  
##  Median :1.130   Median :0.6000   Median :0.8500  
##  Mean   :1.131   Mean   :0.6032   Mean   :0.8511  
##  3rd Qu.:1.160   3rd Qu.:0.6100   3rd Qu.:0.8800  
##  Max.   :1.230   Max.   :0.6400   Max.   :0.9300
# Create the survived column
sparrow$survived <- sparrow$status == "Survived"

# Create the formula
(fmla <- survived ~ total_length + weight + humerus)
## survived ~ total_length + weight + humerus
# Fit the logistic regression model
sparrow_model <- glm(fmla, data = sparrow, family = binomial)

# Call summary
summary(sparrow_model)
## 
## Call:
## glm(formula = fmla, family = binomial, data = sparrow)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1117  -0.6026   0.2871   0.6577   1.7082  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   46.8813    16.9631   2.764 0.005715 ** 
## total_length  -0.5435     0.1409  -3.858 0.000115 ***
## weight        -0.5689     0.2771  -2.053 0.040060 *  
## humerus       75.4610    19.1586   3.939 8.19e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 118.008  on 86  degrees of freedom
## Residual deviance:  75.094  on 83  degrees of freedom
## AIC: 83.094
## 
## Number of Fisher Scoring iterations: 5
# Call glance
(perf <- glance(sparrow_model))
## # A tibble: 1 × 8
##   null.deviance df.null logLik   AIC   BIC deviance df.residual  nobs
##           <dbl>   <int>  <dbl> <dbl> <dbl>    <dbl>       <int> <int>
## 1          118.      86  -37.5  83.1  93.0     75.1          83    87
# Calculate pseudo-R-squared
(pseudoR2 <- 1 - perf$deviance/perf$null.deviance)
## [1] 0.3636526
predict(model, type = "response")
##         1         2         3         4         5         6         7         8 
## 20.397635 21.705665 21.051650 18.435590 18.435590 17.781575 19.743620 19.089605 
##         9        10        11        12        13        14        15        16 
## 17.781575 17.781575 17.781575 17.781575 17.127560 17.781575 14.511499 11.241424 
##        17        18        19        20        21        22        23        24 
## 14.511499 12.549454 18.435590 16.473545 18.435590 17.781575 17.127560 13.857484 
##        25        26        27        28        29        30        31        32 
## 10.587409 11.241424 12.549454 21.051650 18.435590 18.435590 17.127560 15.819530 
##        33        34        35        36        37        38        39        40 
## 17.127560 15.819530 16.473545 12.549454 13.857484  9.279379 11.241424 12.549454 
##        41        42        43        44        45        46        47        48 
## 12.549454  9.279379 12.549454 11.895439 13.203469 11.895439  9.279379  9.279379 
##        49        50        51        52        53        54        55        56 
## 12.549454 12.549454 12.549454 12.549454 12.549454 12.549454 12.549454 12.549454 
##        57        58        59        60        61        62        63        64 
## 11.895439 11.241424 18.435590 18.435590 17.127560 15.165514 15.819530 16.473545 
##        65        66        67        68        69        70        71        72 
## 14.511499 23.013695 22.359680 20.397635 22.359680 24.975741 24.975741 19.089605 
##        73        74        75        76        77        78        79        80 
## 21.051650 21.705665 18.435590 18.435590 19.743620 18.435590 20.397635 19.743620 
##        81        82        83        84        85        86        87        88 
## 17.127560 13.857484 14.511499 12.549454  9.279379 13.857484 13.203469 10.587409 
##        89        90        91        92        93        94        95        96 
## 11.241424 13.203469 13.203469 11.241424 12.549454 12.549454 12.549454 22.359680 
##        97        98        99       100       101       102       103       104 
## 19.089605 18.435590 12.549454 12.549454 14.511499 13.203469 18.435590 18.435590 
##       105       106       107       108       109       110       111       112 
## 19.089605 19.743620 17.781575 17.781575 18.435590 16.473545 18.435590 18.435590 
##       113       114       115       116       117       118       119       120 
## 18.435590 17.781575 19.089605 17.781575 19.089605 14.511499 13.857484 12.549454 
##       121       122       123       124       125       126       127       128 
## 12.549454 19.089605 21.705665 21.705665 18.435590 18.435590 19.089605 20.397635 
##       129       130       131       132       133       134       135       136 
## 21.705665 21.705665 19.089605 23.013695 24.321726 24.321726 11.241424 14.511499 
##       137       138       139       140       141       142       143       144 
## 15.819530 12.549454 13.857484 14.511499 20.397635 18.435590 20.397635 30.207861 
##       145       146       147       148       149       150       151       152 
## 20.397635 18.435590 20.397635 20.397635 17.127560 28.245816 18.435590 20.397635 
##       153       154       155       156 
## 19.743620 20.397635 18.435590 18.435590
# GainCurvePlot(frame, xvar, truthVar, title)

8.2 Poisson and quasipoisson regression to predict counts

Linear regression values in [−∞,∞], Counts: integers in range [0,∞] * Poisson/Quasipoisson Regression * glm(formula, data, family) * family: either poisson or quasipoisson * inputs additive and linear in log(count

load("~/R Data/Bikes.RData")
# bikesJuly is in the workspace
str(bikesJuly)
## 'data.frame':    744 obs. of  12 variables:
##  $ hr        : Factor w/ 24 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ holiday   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ workingday: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ weathersit: chr  "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" ...
##  $ temp      : num  0.76 0.74 0.72 0.72 0.7 0.68 0.7 0.74 0.78 0.82 ...
##  $ atemp     : num  0.727 0.697 0.697 0.712 0.667 ...
##  $ hum       : num  0.66 0.7 0.74 0.84 0.79 0.79 0.79 0.7 0.62 0.56 ...
##  $ windspeed : num  0 0.1343 0.0896 0.1343 0.194 ...
##  $ cnt       : int  149 93 90 33 4 10 27 50 142 219 ...
##  $ instant   : int  13004 13005 13006 13007 13008 13009 13010 13011 13012 13013 ...
##  $ mnth      : int  7 7 7 7 7 7 7 7 7 7 ...
##  $ yr        : int  1 1 1 1 1 1 1 1 1 1 ...
# The outcome column
(outcome <- "cnt")
## [1] "cnt"
# The inputs to use
(vars <- c("hr", "holiday", "workingday", "weathersit", "temp","atemp", "hum", "windspeed"))
## [1] "hr"         "holiday"    "workingday" "weathersit" "temp"      
## [6] "atemp"      "hum"        "windspeed"
# Create the formula string for bikes rented as a function of the inputs
(fmla <- paste(outcome, "~", paste(vars, collapse = " + ")))
## [1] "cnt ~ hr + holiday + workingday + weathersit + temp + atemp + hum + windspeed"
# Calculate the mean and variance of the outcome
(mean_bikes <- mean(bikesJuly$cnt))
## [1] 273.6653
(var_bikes <- var(bikesJuly$cnt))
## [1] 45863.84
# Fit the model
bike_model <- glm(fmla, data = bikesJuly, family = quasipoisson)

# Call glance
(perf <- glance(bike_model))
## # A tibble: 1 × 8
##   null.deviance df.null logLik   AIC   BIC deviance df.residual  nobs
##           <dbl>   <int>  <dbl> <dbl> <dbl>    <dbl>       <int> <int>
## 1       133365.     743     NA    NA    NA   28775.         712   744
# Calculate pseudo-R-squared
(pseudoR2 <- 1 - perf$deviance/perf$null.deviance)
## [1] 0.7842393

8.3 Predict bike rentals on new data

In this exercise you will use the model you built in the previous exercise to make predictions for the month of August. The data set bikesAugust has the same columns as bikesJuly.

Recall that you must specify type = “response” with predict() when predicting counts from a glm poisson or quasipoisson model.

# bikesAugust is in the workspace
str(bikesAugust)
## 'data.frame':    744 obs. of  12 variables:
##  $ hr        : Factor w/ 24 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ holiday   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ workingday: logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
##  $ weathersit: chr  "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" ...
##  $ temp      : num  0.68 0.66 0.64 0.64 0.64 0.64 0.64 0.64 0.66 0.68 ...
##  $ atemp     : num  0.636 0.606 0.576 0.576 0.591 ...
##  $ hum       : num  0.79 0.83 0.83 0.83 0.78 0.78 0.78 0.83 0.78 0.74 ...
##  $ windspeed : num  0.1642 0.0896 0.1045 0.1045 0.1343 ...
##  $ cnt       : int  47 33 13 7 4 49 185 487 681 350 ...
##  $ instant   : int  13748 13749 13750 13751 13752 13753 13754 13755 13756 13757 ...
##  $ mnth      : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ yr        : int  1 1 1 1 1 1 1 1 1 1 ...
# bike_model is in the workspace
summary(bike_model)
## 
## Call:
## glm(formula = fmla, family = quasipoisson, data = bikesJuly)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -21.6117   -4.3121   -0.7223    3.5507   16.5079  
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    5.934986   0.439027  13.519  < 2e-16 ***
## hr1                           -0.580055   0.193354  -3.000 0.002794 ** 
## hr2                           -0.892314   0.215452  -4.142 3.86e-05 ***
## hr3                           -1.662342   0.290658  -5.719 1.58e-08 ***
## hr4                           -2.350204   0.393560  -5.972 3.71e-09 ***
## hr5                           -1.084289   0.230130  -4.712 2.96e-06 ***
## hr6                            0.211945   0.156476   1.354 0.176012    
## hr7                            1.211135   0.132332   9.152  < 2e-16 ***
## hr8                            1.648361   0.127177  12.961  < 2e-16 ***
## hr9                            1.155669   0.133927   8.629  < 2e-16 ***
## hr10                           0.993913   0.137096   7.250 1.09e-12 ***
## hr11                           1.116547   0.136300   8.192 1.19e-15 ***
## hr12                           1.282685   0.134769   9.518  < 2e-16 ***
## hr13                           1.273010   0.135872   9.369  < 2e-16 ***
## hr14                           1.237721   0.136386   9.075  < 2e-16 ***
## hr15                           1.260647   0.136144   9.260  < 2e-16 ***
## hr16                           1.515893   0.132727  11.421  < 2e-16 ***
## hr17                           1.948404   0.128080  15.212  < 2e-16 ***
## hr18                           1.893915   0.127812  14.818  < 2e-16 ***
## hr19                           1.669277   0.128471  12.993  < 2e-16 ***
## hr20                           1.420732   0.131004  10.845  < 2e-16 ***
## hr21                           1.146763   0.134042   8.555  < 2e-16 ***
## hr22                           0.856182   0.138982   6.160 1.21e-09 ***
## hr23                           0.479197   0.148051   3.237 0.001265 ** 
## holidayTRUE                    0.201598   0.079039   2.551 0.010961 *  
## workingdayTRUE                 0.116798   0.033510   3.485 0.000521 ***
## weathersitLight Precipitation -0.214801   0.072699  -2.955 0.003233 ** 
## weathersitMisty               -0.010757   0.038600  -0.279 0.780572    
## temp                          -3.246001   1.148270  -2.827 0.004833 ** 
## atemp                          2.042314   0.953772   2.141 0.032589 *  
## hum                           -0.748557   0.236015  -3.172 0.001581 ** 
## windspeed                      0.003277   0.148814   0.022 0.982439    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 38.98949)
## 
##     Null deviance: 133365  on 743  degrees of freedom
## Residual deviance:  28775  on 712  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 5
# Make predictions on August data
bikesAugust$pred <- predict(bike_model, newdata = bikesAugust, type = "response")

# Calculate the RMSE
bikesAugust %>% 
  mutate(residual = pred - cnt) %>%
  summarize(rmse  = sqrt(mean(residual^2)))
##       rmse
## 1 112.5815
# Plot predictions vs cnt (pred on x-axis)
ggplot(bikesAugust, aes(x = pred, y = cnt)) +
  geom_point() + 
  geom_abline(color = "darkblue")

8.4 Visualize the bike rental predictions

In the previous exercise, you visualized the bike model’s predictions using the standard “outcome vs. prediction” scatter plot. Since the bike rental data is time series data, you might be interested in how the model performs as a function of time. In this exercise, you will compare the predictions and actual rentals on an hourly basis, for the first 14 days of August.

# Plot predictions and cnt by date/time
(quasipoisson_plot <- bikesAugust %>% 
  # set start to 0, convert unit to days
  mutate(instant = (instant - min(instant))/24) %>%  
  # gather cnt and pred into a value column
  gather(key = valuetype, value = value, cnt, pred) %>%
  filter(instant < 14) %>% # restrict to first 14 days
  # plot value by instant
  ggplot(aes(x = instant, y = value, color = valuetype, linetype = valuetype)) + 
  geom_point() + 
  geom_line() + 
  scale_x_continuous("Day", breaks = 0:14, labels = 0:14) + 
  scale_color_brewer(palette = "Dark2") + 
  ggtitle("Predicted August bike rentals, Quasipoisson model")
)

8.5 GAM to learn non-linear transforms

gam() in the mgcv package - gam(formula, family, data) - family: - gaussian (default): “regular” regression - binomial: probabilities - - poisson/quasipoisson: counts best for larger data sets

In this exercise you will model the average leaf weight on a soybean plant as a function of time (after planting). As you will see, the soybean plant doesn’t grow at a steady rate, but rather has a “growth spurt” that eventually tapers off. Hence, leaf weight is not well described by a linear model.

Recall that you can designate which variable you want to model non-linearly in a formula with the s() function:

library(mgcv)
## Loading required package: nlme
## 
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
## 
##     collapse
## This is mgcv 1.8-38. For overview type 'help("mgcv-package")'.
## 
## Attaching package: 'mgcv'
## The following object is masked from 'package:wrapr':
## 
##     %.%
load("~/R Data/Soybean.RData")

fmla.lin <- weight ~ Time
model.lin <- lm(fmla.lin, data = soybean_train)

# soybean_train is in the workspace
summary(soybean_train)
##       Plot     Variety   Year          Time           weight       
##  1988F6 : 10   F:161   1988:124   Min.   :14.00   Min.   : 0.0290  
##  1988F7 :  9   P:169   1989:102   1st Qu.:27.00   1st Qu.: 0.6663  
##  1988P1 :  9           1990:104   Median :42.00   Median : 3.5233  
##  1988P8 :  9                      Mean   :43.56   Mean   : 6.1645  
##  1988P2 :  9                      3rd Qu.:56.00   3rd Qu.:10.3808  
##  1988F3 :  8                      Max.   :84.00   Max.   :27.3700  
##  (Other):276
# Plot weight vs Time (Time on x axis)
ggplot(soybean_train, aes(x = Time, y = weight)) + 
  geom_point() 

# Create the formula
(fmla.gam <- weight ~ s(Time))
## weight ~ s(Time)
# Fit the GAM Model
model.gam <- gam(fmla.gam, family = gaussian, data = soybean_train)

# Call summary() on model.lin and look for R-squared
summary(model.lin)
## 
## Call:
## lm(formula = fmla.lin, data = soybean_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3933 -1.7100 -0.3909  1.9056 11.4381 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6.559283   0.358527  -18.30   <2e-16 ***
## Time         0.292094   0.007444   39.24   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.778 on 328 degrees of freedom
## Multiple R-squared:  0.8244, Adjusted R-squared:  0.8238 
## F-statistic:  1540 on 1 and 328 DF,  p-value: < 2.2e-16
# Call summary() on model.gam and look for R-squared
summary(model.gam)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## weight ~ s(Time)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6.1645     0.1143   53.93   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##           edf Ref.df     F p-value    
## s(Time) 8.495   8.93 338.2  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.902   Deviance explained = 90.4%
## GCV = 4.4395  Scale est. = 4.3117    n = 330
# Call plot() on model.gam
plot(model.gam)

8.6 Model soybean growth with GAM

In this exercise you will model the average leaf weight on a soybean plant as a function of time (after planting). As you will see, the soybean plant doesn’t grow at a steady rate, but rather has a “growth spurt” that eventually tapers off. Hence, leaf weight is not well described by a linear model.

# Get predictions from linear model
soybean_test$pred.lin <- predict(model.lin, newdata = soybean_test)

# Get predictions from gam model
soybean_test$pred.gam <- as.numeric(predict(model.gam, newdata = soybean_test))

# Gather the predictions into a "long" dataset
soybean_long <- soybean_test %>%
  gather(key = modeltype, value = pred, pred.lin, pred.gam)

# Calculate the rmse
soybean_long %>%
  mutate(residual = weight - pred) %>%     # residuals
  group_by(modeltype) %>%                  # group by modeltype
  summarize(rmse = sqrt(mean(residual^2))) # calculate the RMSE
## # A tibble: 2 × 2
##   modeltype  rmse
##   <chr>     <dbl>
## 1 pred.gam   2.29
## 2 pred.lin   3.19
# Compare the predictions against actual weights on the test data
soybean_long %>%
  ggplot(aes(x = Time)) +                          # the column for the x axis
  geom_point(aes(y = weight)) +                    # the y-column for the scatterplot
  geom_point(aes(y = pred, color = modeltype)) +   # the y-column for the point-and-line plot
  geom_line(aes(y = pred, color = modeltype, linetype = modeltype)) + # the y-column for the point-and-line plot
  scale_color_brewer(palette = "Dark2")

9. Tree-Based Methods

In this chapter we will look at modeling algorithms that do not assume linearity or additivity, and that can learn limited types of interactions among input variables. These algorithms are tree-based methods that work by combining ensembles of decision trees that are learned from the training data.

Multiple diverse decision trees averaged together - reduces overfit - increases model expressiveness - finer grain predictions

9.1 Building a Random Forest Model

  1. draw bootstrapped sample from training data
  2. for each sample grow a tree at each node, pick best variable to split on (from a random subset of all variables) continue until tree is grown
  3. to score a datum, evaluate it with all the trees and average the results

In this exercise you will again build a model to predict the number of bikes rented in an hour as a function of the weather, the type of day (holiday, working day, or weekend), and the time of day. You will train the model on data from the month of July.

You will use the ranger package to fit the random forest model. For this exercise, the key arguments to the ranger(formula, data, num.trees, respect.unordered.factors) call are: - num.trees: the number of trees in the forest. - respect.unordered.factors : Specifies how to treat unordered factor variables.

We recommend setting this to “order” for regression. - seed: because this is a random algorithm, you will set the seed to get reproducible results

library(ranger)

# bikesJuly is in the workspace
str(bikesJuly)
## 'data.frame':    744 obs. of  12 variables:
##  $ hr        : Factor w/ 24 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ holiday   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ workingday: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ weathersit: chr  "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" ...
##  $ temp      : num  0.76 0.74 0.72 0.72 0.7 0.68 0.7 0.74 0.78 0.82 ...
##  $ atemp     : num  0.727 0.697 0.697 0.712 0.667 ...
##  $ hum       : num  0.66 0.7 0.74 0.84 0.79 0.79 0.79 0.7 0.62 0.56 ...
##  $ windspeed : num  0 0.1343 0.0896 0.1343 0.194 ...
##  $ cnt       : int  149 93 90 33 4 10 27 50 142 219 ...
##  $ instant   : int  13004 13005 13006 13007 13008 13009 13010 13011 13012 13013 ...
##  $ mnth      : int  7 7 7 7 7 7 7 7 7 7 ...
##  $ yr        : int  1 1 1 1 1 1 1 1 1 1 ...
# Random seed to reproduce results
seed <- 423563

# the outcome column
(outcome <- "cnt")
## [1] "cnt"
# The input variables
(vars <- c("hr", "holiday", "workingday", "weathersit", "temp", "atemp", "hum", "windspeed"))
## [1] "hr"         "holiday"    "workingday" "weathersit" "temp"      
## [6] "atemp"      "hum"        "windspeed"
# Create the formula string for bikes rented as a function of the inputs
(fmla <- paste(outcome, "~", paste(vars, collapse = " + ")))
## [1] "cnt ~ hr + holiday + workingday + weathersit + temp + atemp + hum + windspeed"
# Load the package ranger
library(ranger)

# Fit and print the random forest model.
(bike_model_rf <- ranger(fmla, 
                         bikesJuly, 
                         num.trees = 500, 
                         respect.unordered.factors = "order", 
                         seed = seed))
## Ranger result
## 
## Call:
##  ranger(fmla, bikesJuly, num.trees = 500, respect.unordered.factors = "order",      seed = seed) 
## 
## Type:                             Regression 
## Number of trees:                  500 
## Sample size:                      744 
## Number of independent variables:  8 
## Mtry:                             2 
## Target node size:                 5 
## Variable importance mode:         none 
## Splitrule:                        variance 
## OOB prediction error (MSE):       8299.851 
## R squared (OOB):                  0.8190328

9.2 Predict bike rentals with the random forest model

In this exercise you will use the model that you fit in the previous exercise to predict bike rentals for the month of August. The predict() function for a ranger model produces a list. One of the elements of this list is predictions, a vector of predicted values. predict(model, data)$predictions

# Make predictions on the August data
bikesAugust$pred <- predict(bike_model_rf, bikesAugust)$predictions

# Calculate the RMSE of the predictions
bikesAugust %>% 
  mutate(residual = cnt - pred)  %>%        # calculate the residual
  summarize(rmse  = sqrt(mean(residual^2))) # calculate rmse
##       rmse
## 1 96.66032
# Plot actual outcome vs predictions (predictions on x-axis)
ggplot(bikesAugust, aes(x = pred, y = cnt)) + 
  geom_point() + 
  geom_abline()

9.3 Visualize random forest bike model predictions

In the previous exercise, you saw that the random forest bike model did better on the August data than the quasiposson model, in terms of RMSE.

In this exercise you will visualize the random forest model’s August predictions as a function of time. The corresponding plot from the quasipoisson model that you built in a previous exercise is in the workspace for you to compare.

Recall that the quasipoisson model mostly identified the pattern of slow and busy hours in the day, but it somewhat underestimated peak demands. You would like to see how the random forest model compares.

first_two_weeks <- bikesAugust %>% 
  # Set start to 0, convert unit to days
  mutate(instant = (instant - min(instant)) / 24) %>% 
  # Gather cnt and pred into a column named value with key valuetype
  gather(key = valuetype, value = value, cnt, pred) %>%
  # Filter for rows in the first two
  filter(instant < 14) 

# Plot predictions and cnt by date/time 
(randomforest_plot <- ggplot(first_two_weeks, aes(x = instant, y = value, color = valuetype, linetype = valuetype)) + 
  geom_point() + 
  geom_line() + 
  scale_x_continuous("Day", breaks = 0:14, labels = 0:14) + 
  scale_color_brewer(palette = "Dark2") + 
  ggtitle("Predicted August bike rentals, Random Forest plot"))

9.4 vtreat on a small example

In this exercise you will use vtreat to one-hot-encode a categorical variable on a small example. vtreat creates a treatment plan to transform categorical variables into indicator variables (coded “lev”), and to clean bad values out of numerical variables (coded “clean”).

# Load the package vtreat
library(vtreat)
library(magrittr)
## 
## Attaching package: 'magrittr'
## The following objects are masked from 'package:assertive':
## 
##     is_greater_than, is_less_than
## The following object is masked from 'package:purrr':
## 
##     set_names
## The following object is masked from 'package:tidyr':
## 
##     extract
color <- factor(c("b", "r", "r", "r", "r", "b", "r", "g", "b", "b"))
size <- c(13, 11, 15, 14, 13, 11, 9, 12, 7, 12)
popularity <- c(1.0785088, 1.3956245, 0.9217988, 1.2025453, 1.0838662, 0.8043527, 1.1035440, 0.8746332, 0.6947058, 0.8832502)

dframe <- data.frame(color, size, popularity)

# dframe is in the workspace
dframe
##    color size popularity
## 1      b   13  1.0785088
## 2      r   11  1.3956245
## 3      r   15  0.9217988
## 4      r   14  1.2025453
## 5      r   13  1.0838662
## 6      b   11  0.8043527
## 7      r    9  1.1035440
## 8      g   12  0.8746332
## 9      b    7  0.6947058
## 10     b   12  0.8832502
# dframe is in the workspace
dframe
##    color size popularity
## 1      b   13  1.0785088
## 2      r   11  1.3956245
## 3      r   15  0.9217988
## 4      r   14  1.2025453
## 5      r   13  1.0838662
## 6      b   11  0.8043527
## 7      r    9  1.1035440
## 8      g   12  0.8746332
## 9      b    7  0.6947058
## 10     b   12  0.8832502
# Create and print a vector of variable names
(vars <- c("color", "size"))
## [1] "color" "size"
# Create the treatment plan
treatplan <- designTreatmentsZ(dframe, vars)
## [1] "vtreat 1.6.3 inspecting inputs Mon Apr 18 23:08:55 2022"
## [1] "designing treatments Mon Apr 18 23:08:55 2022"
## [1] " have initial level statistics Mon Apr 18 23:08:55 2022"
## [1] " scoring treatments Mon Apr 18 23:08:55 2022"
## [1] "have treatment plan Mon Apr 18 23:08:55 2022"
# Examine the scoreFrame
(scoreFrame <- treatplan %>%
    use_series(scoreFrame) %>%
    select(varName, origName, code))
##         varName origName  code
## 1    color_catP    color  catP
## 2          size     size clean
## 3 color_lev_x_b    color   lev
## 4 color_lev_x_g    color   lev
## 5 color_lev_x_r    color   lev
# We only want the rows with codes "clean" or "lev"
(newvars <- scoreFrame %>%
    filter(code %in% c("clean", "lev")) %>%
    use_series(varName))
## [1] "size"          "color_lev_x_b" "color_lev_x_g" "color_lev_x_r"
# Create the treated training data
(dframe.treat <- prepare(treatplan, dframe, varRestriction = newvars))
##    size color_lev_x_b color_lev_x_g color_lev_x_r
## 1    13             1             0             0
## 2    11             0             0             1
## 3    15             0             0             1
## 4    14             0             0             1
## 5    13             0             0             1
## 6    11             1             0             0
## 7     9             0             0             1
## 8    12             0             1             0
## 9     7             1             0             0
## 10   12             1             0             0

9.5 Novel levels

When a level of a categorical variable is rare, sometimes it will fail to show up in training data. If that rare level then appears in future data, downstream models may not know what to do with it. When such novel levels appear, using model.matrix or caret::dummyVars to one-hot-encode will not work correctly.

vtreat is a “safer” alternative to model.matrix for one-hot-encoding, because it can manage novel levels safely. vtreat also manages missing values in the data (both categorical and continuous).

In this exercise you will see how vtreat handles categorical values that did not appear in the training set. The treatment plan treatplan and the set of variables newvars from the previous exercise are still in your workspace. dframe and a new data frame testframe are also in your workspace.

color <- factor(c("g", "g", "y", "g", "g", "y", "b", "g", "g", "r"))
size <- c(7, 8, 10, 12, 6, 8, 12, 12, 12, 8)
popularity <- c(0.9733920, 0.9122529, 1.4217153, 1.1905828, 0.9866464, 1.3697515, 1.0959387, 0.9161547, 1.0000460, 1.3137360)
testframe <- data.frame(color, size, popularity)

# treatplan is in the workspace
summary(treatplan)
##               Length Class           Mode     
## treatments    3      -none-          list     
## scoreFrame    8      data.frame      list     
## outcomename   1      -none-          character
## vtreatVersion 1      package_version list     
## outcomeType   1      -none-          character
## outcomeTarget 1      -none-          character
## meanY         1      -none-          logical  
## splitmethod   1      -none-          character
# treatplan is in the workspace
summary(treatplan)
##               Length Class           Mode     
## treatments    3      -none-          list     
## scoreFrame    8      data.frame      list     
## outcomename   1      -none-          character
## vtreatVersion 1      package_version list     
## outcomeType   1      -none-          character
## outcomeTarget 1      -none-          character
## meanY         1      -none-          logical  
## splitmethod   1      -none-          character
# newvars is in the workspace
newvars
## [1] "size"          "color_lev_x_b" "color_lev_x_g" "color_lev_x_r"
# Print dframe and testframe
dframe
##    color size popularity
## 1      b   13  1.0785088
## 2      r   11  1.3956245
## 3      r   15  0.9217988
## 4      r   14  1.2025453
## 5      r   13  1.0838662
## 6      b   11  0.8043527
## 7      r    9  1.1035440
## 8      g   12  0.8746332
## 9      b    7  0.6947058
## 10     b   12  0.8832502
testframe
##    color size popularity
## 1      g    7  0.9733920
## 2      g    8  0.9122529
## 3      y   10  1.4217153
## 4      g   12  1.1905828
## 5      g    6  0.9866464
## 6      y    8  1.3697515
## 7      b   12  1.0959387
## 8      g   12  0.9161547
## 9      g   12  1.0000460
## 10     r    8  1.3137360
# Use prepare() to one-hot-encode testframe
(testframe.treat <- prepare(treatplan, testframe, varRestriction = newvars))
##    size color_lev_x_b color_lev_x_g color_lev_x_r
## 1     7             0             1             0
## 2     8             0             1             0
## 3    10             0             0             0
## 4    12             0             1             0
## 5     6             0             1             0
## 6     8             0             0             0
## 7    12             1             0             0
## 8    12             0             1             0
## 9    12             0             1             0
## 10    8             0             0             1

9.6 vtreat the bike rental data

In this exercise you will create one-hot-encoded data frames of the July/August bike data, for use with xgboost later on.

# The outcome column
(outcome <- "cnt")
## [1] "cnt"
# The input columns
(vars <- c("hr", "holiday", "workingday", "weathersit", "temp", "atemp", "hum", "windspeed"))
## [1] "hr"         "holiday"    "workingday" "weathersit" "temp"      
## [6] "atemp"      "hum"        "windspeed"
# Load the package vtreat
library(vtreat)

# Create the treatment plan from bikesJuly (the training data)
treatplan <- designTreatmentsZ(bikesJuly, vars, verbose = FALSE)

# Get the "clean" and "lev" variables from the scoreFrame
(newvars <- treatplan %>%
  use_series(scoreFrame) %>%               
  filter(code %in% c("clean", "lev")) %>%  # get the variables you care about
  use_series(varName))                     # get the varName column
##  [1] "holiday"                                
##  [2] "workingday"                             
##  [3] "temp"                                   
##  [4] "atemp"                                  
##  [5] "hum"                                    
##  [6] "windspeed"                              
##  [7] "hr_lev_x_0"                             
##  [8] "hr_lev_x_1"                             
##  [9] "hr_lev_x_10"                            
## [10] "hr_lev_x_11"                            
## [11] "hr_lev_x_12"                            
## [12] "hr_lev_x_13"                            
## [13] "hr_lev_x_14"                            
## [14] "hr_lev_x_15"                            
## [15] "hr_lev_x_16"                            
## [16] "hr_lev_x_17"                            
## [17] "hr_lev_x_18"                            
## [18] "hr_lev_x_19"                            
## [19] "hr_lev_x_2"                             
## [20] "hr_lev_x_20"                            
## [21] "hr_lev_x_21"                            
## [22] "hr_lev_x_22"                            
## [23] "hr_lev_x_23"                            
## [24] "hr_lev_x_3"                             
## [25] "hr_lev_x_4"                             
## [26] "hr_lev_x_5"                             
## [27] "hr_lev_x_6"                             
## [28] "hr_lev_x_7"                             
## [29] "hr_lev_x_8"                             
## [30] "hr_lev_x_9"                             
## [31] "weathersit_lev_x_Clear_to_partly_cloudy"
## [32] "weathersit_lev_x_Light_Precipitation"   
## [33] "weathersit_lev_x_Misty"
# Prepare the training data
bikesJuly.treat <- prepare(treatplan, bikesJuly,  varRestriction = newvars)

# Prepare the test data
bikesAugust.treat <- prepare(treatplan, bikesAugust, varRestriction = newvars)

9.7 ind the right number of trees for a gradient boosting machine

In this exercise you will get ready to build a gradient boosting model to predict the number of bikes rented in an hour as a function of the weather and the type and time of day. You will train the model on data from the month of July.

The July data is loaded into your workspace. Remember that bikesJuly.treat no longer has the outcome column, so you must get it from the untreated data: bikesJuly$cnt.

You will use the xgboost package to fit the random forest model. The function xgb.cv() uses cross-validation to estimate the out-of-sample learning error as each new tree is added to the model. The appropriate number of trees to use in the final model is the number that minimizes the holdout RMSE.

library(xgboost)

# Run xgb.cv
cv <- xgb.cv(data = as.matrix(bikesJuly.treat), 
            label = bikesJuly$cnt,
            nrounds = 100,
            nfold = 5,
            objective = "reg:linear",
            eta = 0.3,
            max_depth = 6,
            early_stopping_rounds = 10,
            verbose = 0   # silent
)
## [23:08:55] WARNING: amalgamation/../src/objective/regression_obj.cu:188: reg:linear is now deprecated in favor of reg:squarederror.
## [23:08:55] WARNING: amalgamation/../src/objective/regression_obj.cu:188: reg:linear is now deprecated in favor of reg:squarederror.
## [23:08:55] WARNING: amalgamation/../src/objective/regression_obj.cu:188: reg:linear is now deprecated in favor of reg:squarederror.
## [23:08:55] WARNING: amalgamation/../src/objective/regression_obj.cu:188: reg:linear is now deprecated in favor of reg:squarederror.
## [23:08:55] WARNING: amalgamation/../src/objective/regression_obj.cu:188: reg:linear is now deprecated in favor of reg:squarederror.
# Get the evaluation log
elog <- cv$evaluation_log

# Determine and print how many trees minimize training and test error
elog %>% 
  summarize(ntrees.train = which.min(train_rmse_mean),   # find the index of min(train_rmse_mean)
            ntrees.test  = which.min(test_rmse_mean))    # find the index of min(test_rmse_mean)
## Source: local data table [1 x 2]
## Call:   `_DT1`[, .(ntrees.train = which.min(train_rmse_mean), ntrees.test = which.min(test_rmse_mean))]
## 
##   ntrees.train ntrees.test
##          <int>       <int>
## 1           78          68
## 
## # Use as.data.table()/as.data.frame()/as_tibble() to access results

9.8 Fit an xgboost bike rental model and predict

In this exercise you will fit a gradient boosting model using xgboost() to predict the number of bikes rented in an hour as a function of the weather and the type and time of day. You will train the model on data from the month of July and predict on data for the month of August.

The datasets for July and August are loaded into your workspace. Remember the vtreat-ed data no longer has the outcome column, so you must get it from the original data (the cnt column).

For convenience, the number of trees to use, ntrees from the previous exercise is in the workspace.

# The number of trees to use, as determined by xgb.cv
ntrees <- 84

# Run xgboost
bike_model_xgb <- xgboost(data = as.matrix(bikesJuly.treat), # training data as matrix
                   label = bikesJuly$cnt,  # column of outcomes
                   nrounds = ntrees,       # number of trees to build
                   objective = "reg:linear", # objective
                   eta = 0.3,
                   depth = 6,
                   verbose = 0  # silent
)
## [23:08:55] WARNING: amalgamation/../src/objective/regression_obj.cu:188: reg:linear is now deprecated in favor of reg:squarederror.
## [23:08:55] WARNING: amalgamation/../src/learner.cc:576: 
## Parameters: { "depth" } might not be used.
## 
##   This could be a false alarm, with some parameters getting used by language bindings but
##   then being mistakenly passed down to XGBoost core, or some parameter actually being used
##   but getting flagged wrongly here. Please open an issue if you find any such cases.
# Make predictions
bikesAugust$pred <- predict(bike_model_xgb, as.matrix(bikesAugust.treat))

# Plot predictions vs actual bike rental count
ggplot(bikesAugust, aes(x = pred, y = cnt)) + 
  geom_point() + 
  geom_abline()

# Calculate RMSE
bikesAugust %>%
  mutate(residuals = cnt - pred) %>%
  summarize(rmse = sqrt(mean(residuals^2)))
##       rmse
## 1 76.73242

9.9 Visualize the xgboost bike rental model

You’ve now seen three different ways to model the bike rental data. For this example, you’ve seen that the gradient boosting model had the smallest RMSE. To finish up the course, let’s compare the gradient boosting model’s predictions to the other two models as a function of time.

# Print quasipoisson_plot
quasipoisson_plot

# Print randomforest_plot
randomforest_plot

# Plot predictions and actual bike rentals as a function of time (days)
bikesAugust %>% 
  mutate(instant = (instant - min(instant))/24) %>%  # set start to 0, convert unit to days
  gather(key = valuetype, value = value, cnt, pred) %>%
  filter(instant < 14) %>% # first two weeks
  ggplot(aes(x = instant, y = value, color = valuetype, linetype = valuetype)) + 
  geom_point() + 
  geom_line() + 
  scale_x_continuous("Day", breaks = 0:14, labels = 0:14) + 
  scale_color_brewer(palette = "Dark2") + 
  ggtitle("Predicted August bike rentals, Gradient Boosting model")

The End.

Thanks DataCamp