After several trips with a human behind the wheel, it is time for the self-driving car to attempt the test course alone.
As it begins to drive away, its camera captures the following image:
Stop Sign
Can you apply a kNN classifier to help the car recognize this sign?
The dataset signs is loaded in your workspace along with the dataframe next_sign, which holds the observation you want to classify.
Load the class package. Create a vector of sign labels to use with kNN by extracting the column sign_type from signs. Identify the next_sign using the knn() function. Set the train argument equal to the signs data frame without the first column. Set the test argument equal to the data frame next_sign. Use the vector of labels you created as the cl argument.
# Load the 'class' package
library(class)
##
## Attaching package: 'class'
## The following object is masked from 'package:igraph':
##
## knn
#importamos los datos
signs <- read.csv("./DATABASE/knn_traffic_signs.csv")
#creamos el dataset del testeo de KNN porque datacamp no lo propociona
signs_next_colnames <-c("r1","g1","b1","r2","g2","b2","r3","g3","b3","r4","g4","b4","r5","g5","b5","r6","g6","b6","r7","g7","b7","r8","g8","b8","r9","g9","b9","r10","g10","b10","r11","g11","b11","r12","g12","b12","r13","g13","b13","r14","g14","b14","r15","g15","b15","r16","g16","b16" ) %>% as.data.frame()
signs_values <- c(206,204,227,220,196,59,51,202,67,59,204,227,220,236,250,234,242,252,235,205,148,131,190,50,43,179,70,57,242,229,212,190,50,43,193,51,44,170,197,196,190,50,43,190,47,41,165,195
) %>% as.data.frame()
next_sign <- bind_cols(signs_next_colnames, signs_values)
## New names:
## * . -> ....1
## * . -> ....2
next_sign <- t(next_sign) %>% as.data.frame() %>% setnames(., c("r1","g1","b1","r2","g2","b2","r3","g3","b3","r4","g4","b4","r5","g5","b5","r6","g6","b6","r7","g7","b7","r8","g8","b8","r9","g9","b9","r10","g10","b10","r11","g11","b11","r12","g12","b12","r13","g13","b13","r14","g14","b14","r15","g15","b15","r16","g16","b16" ))
next_sign <- next_sign %>% slice(.,2)
# Create a vector of labels
sign_types <- signs$sign_type
#ajutamos el dataset de signs para poder meterlo en el KNN
signs <- signs %>% select(., c(- c("id" , "sample" , "sign_type" )))
# Classify the next sign observed
knn(train = signs, test = next_sign, cl = sign_types)
## [1] stop
## Levels: pedestrian speed stop
To better understand how the knn() function was able to classify the stop sign, it may help to examine the training dataset it used.
Each previously observed street sign was divided into a 4x4 grid, and the red, green, and blue level for each of the 16 center pixels is recorded as illustrated here.
Use the str() function to examine the signs dataset. Use table() to count the number of observations of each sign type by passing it the column containing the labels. Run the provided aggregate() command to see whether the average red level might vary by sign type.
# Examine the structure of the signs dataset
#str(signs)
# Count the number of signs of each type
table(signs$sign_type)
## < table of extent 0 >
# Check r10's average red level by sign type
aggregate(r10 ~ sign_types, data = signs, mean)
## sign_types r10
## 1 pedestrian 108.78462
## 2 speed 83.08571
## 3 stop 142.50704
Now that the autonomous vehicle has successfully stopped on its own, your team feels confident allowing the car to continue the test course.
The test course includes 59 additional road signs divided into three types:
Stop Sign Speed Limit Sign Pedestrian Sign
At the conclusion of the trial, you are asked to measure the car’s overall performance at recognizing these signs.
The class package and the dataset signs are already loaded in your workspace. So is the dataframe test_signs, which holds a set of observations you’ll test your model on.
Classify the test_signs data using knn(). Set train equal to the observations in signs without labels. Use test_signs for the test argument, again without labels. For the cl argument, use the vector of labels provided for you. Use table() to explore the classifier’s performance at identifying the three sign types (the confusion matrix). Create the vector signs_actual by extracting the labels from test_signs. Pass the vector of predictions and the vector of actual signs to table() to cross tabulate them. Compute the overall accuracy of the kNN learner using the mean() function.
# Use kNN to identify the test road signs
#cargamos los datos que datacamp no nos proporciona
test_signs <- read.delim("./DATABASE/sign_test.txt", sep = ",")
head(test_signs)
## X sign_type r1 g1 b1 r2 g2 b2 r3 g3 b3 r4 g4 b4 r5 g5 b5 r6
## 1 8 pedestrian 118 105 69 244 245 67 132 123 12 138 123 85 254 254 92 241
## 2 13 pedestrian 221 244 237 52 45 26 205 233 229 203 230 227 203 221 213 18
## 3 14 pedestrian 44 50 43 98 69 25 170 182 172 170 182 172 35 29 18 85
## 4 19 pedestrian 78 106 102 98 125 82 65 91 75 100 122 116 84 107 88 102
## 5 20 pedestrian 163 181 172 53 51 36 170 181 171 44 51 44 90 86 66 128
## 6 22 pedestrian 117 137 132 116 105 67 58 53 27 37 49 45 115 101 59 131
## g6 b6 r7 g7 b7 r8 g8 b8 r9 g9 b9 r10 g10 b10 r11 g11 b11 r12 g12
## 1 240 108 254 254 99 108 106 27 135 123 40 254 254 115 254 254 99 138 123
## 2 21 20 27 27 20 193 222 220 76 53 18 12 18 20 19 19 13 70 49
## 3 60 19 26 26 20 84 68 42 94 65 20 22 25 21 21 22 19 97 67
## 4 139 38 27 43 28 146 174 170 182 196 176 108 145 42 107 144 38 45 58
## 5 90 28 68 53 27 90 82 60 90 78 52 125 89 27 59 52 34 57 62
## 6 102 28 19 28 26 27 37 34 19 28 26 139 107 28 139 107 28 14 22
## b12 r13 g13 b13 r14 g14 b14 r15 g15 b15 r16 g16 b16
## 1 85 118 105 75 131 124 5 106 94 53 101 91 59
## 2 13 74 51 13 74 51 13 34 30 18 50 38 14
## 3 21 94 65 20 94 65 20 97 67 21 49 38 19
## 4 52 67 84 77 91 108 100 49 62 57 45 58 52
## 5 54 60 67 59 53 51 36 67 75 67 74 81 70
## 6 21 15 24 22 18 26 22 19 28 26 19 28 26
signs_pred <- knn(train = signs, test = test_signs[,c(-1,-2)], cl = sign_types)
# Create a confusion matrix of the predicted versus actual values
signs_actual <- test_signs$sign_type
table(signs_pred, signs_actual)
## signs_actual
## signs_pred pedestrian speed stop
## pedestrian 19 0 0
## speed 0 21 0
## stop 0 0 19
# Compute the accuracy
mean(signs_pred == signs_actual)
## [1] 1
By default, the knn() function in the class package uses only the single nearest neighbor.
Setting a k parameter allows the algorithm to consider additional nearby neighbors. This enlarges the collection of neighbors which will vote on the predicted class.
Compare k values of 1, 7, and 15 to examine the impact on traffic sign classification accuracy.
The class package is already loaded in your workspace along with the datasets signs, signs_test, and sign_types. The object signs_actual holds the true values of the signs.
Compute the accuracy of the default k = 1 model using the given code, then find the accuracy of the model using mean() to compare signs_actual and the model’s predictions.
signs_test <- read.delim("DATABASE/sign_test.txt", sep = ",")
head(signs_test)
## X sign_type r1 g1 b1 r2 g2 b2 r3 g3 b3 r4 g4 b4 r5 g5 b5 r6
## 1 8 pedestrian 118 105 69 244 245 67 132 123 12 138 123 85 254 254 92 241
## 2 13 pedestrian 221 244 237 52 45 26 205 233 229 203 230 227 203 221 213 18
## 3 14 pedestrian 44 50 43 98 69 25 170 182 172 170 182 172 35 29 18 85
## 4 19 pedestrian 78 106 102 98 125 82 65 91 75 100 122 116 84 107 88 102
## 5 20 pedestrian 163 181 172 53 51 36 170 181 171 44 51 44 90 86 66 128
## 6 22 pedestrian 117 137 132 116 105 67 58 53 27 37 49 45 115 101 59 131
## g6 b6 r7 g7 b7 r8 g8 b8 r9 g9 b9 r10 g10 b10 r11 g11 b11 r12 g12
## 1 240 108 254 254 99 108 106 27 135 123 40 254 254 115 254 254 99 138 123
## 2 21 20 27 27 20 193 222 220 76 53 18 12 18 20 19 19 13 70 49
## 3 60 19 26 26 20 84 68 42 94 65 20 22 25 21 21 22 19 97 67
## 4 139 38 27 43 28 146 174 170 182 196 176 108 145 42 107 144 38 45 58
## 5 90 28 68 53 27 90 82 60 90 78 52 125 89 27 59 52 34 57 62
## 6 102 28 19 28 26 27 37 34 19 28 26 139 107 28 139 107 28 14 22
## b12 r13 g13 b13 r14 g14 b14 r15 g15 b15 r16 g16 b16
## 1 85 118 105 75 131 124 5 106 94 53 101 91 59
## 2 13 74 51 13 74 51 13 34 30 18 50 38 14
## 3 21 94 65 20 94 65 20 97 67 21 49 38 19
## 4 52 67 84 77 91 108 100 49 62 57 45 58 52
## 5 54 60 67 59 53 51 36 67 75 67 74 81 70
## 6 21 15 24 22 18 26 22 19 28 26 19 28 26
signs_test <- signs_test %>% select(., c(-"X", -"sign_type"))
# Compute the accuracy of the baseline model (default k = 1)
k_1 <- knn(train = signs, test = signs_test, cl = sign_types)
mean(signs_actual == k_1)
## [1] 1
Modify the knn() function call by setting k = 7 and again find accuracy value.
# Modify the above to set k = 7
k_7 <- knn(train = signs, test = signs_test, cl = sign_types, k = 7)
mean(signs_actual == k_7)
## [1] 0.9661017
Revise the code once more by setting k = 15, plus find the accuracy value one more time.
# Set k = 15 and compare to the above
k_15 <- knn(train = signs, test = signs_test, cl = sign_types, k = 15)
mean(signs_actual == k_15)
## [1] 0.9661017
When multiple nearest neighbors hold a vote, it can sometimes be useful to examine whether the voters were unanimous or widely separated.
For example, knowing more about the voters’ confidence in the classification could allow an autonomous vehicle to use caution in the case there is any chance at all that a stop sign is ahead.
In this exercise, you will learn how to obtain the voting results from the knn() function.
The class package has already been loaded in your workspace along with the datasets signs, sign_types, and signs_test.
Build a kNN model with the prob = TRUE parameter to compute the vote proportions. Set k = 7.
# Use the prob parameter to get the proportion of votes for the winning class
sign_pred <- knn(train = signs, test = signs_test, cl = sign_types, k = 7, prob = TRUE)
Use the attr() function to obtain the vote proportions for the predicted class. These are stored in the attribute “prob”.
# Get the "prob" attribute from the predicted classes
sign_prob <- attr(sign_pred, "prob")
# Examine the first several predictions
head(sign_pred)
## [1] pedestrian pedestrian pedestrian stop pedestrian pedestrian
## Levels: pedestrian speed stop
Examine the first several vote outcomes and percentages using the head() function to see how the confidence varies from sign to sign.
# Examine the proportion of votes for the winning class
head(sign_prob)
## [1] 0.5714286 0.7142857 0.8571429 0.4285714 1.0000000 0.8571429
Formula para normalizar datos
normalizar <- function(x){
return((x-min(x)/(max(x)-min(X))))
}
The where9am data frame contains 91 days (thirteen weeks) worth of data in which Brett recorded his location at 9am each day as well as whether the daytype was a weekend or weekday.
Using the conditional probability formula below, you can compute the probability that Brett is working in the office, given that it is a weekday.
Find P(office) using nrow() and subset() to count rows in the dataset and save the result as p_A.
where9am <- read.delim("DATABASE/where9am.txt", sep = ",")
head(where9am)
## X daytype location
## 1 10 weekday office
## 2 34 weekday office
## 3 58 weekday office
## 4 82 weekend home
## 5 106 weekend home
## 6 130 weekday campus
Find P(weekday), using nrow() and subset() again, and save the result as p_B.
Use nrow() and subset() a final time to find P(office and weekday). Save the result as p_AB.
Compute P(office | weekday) and save the result as p_A_given_B.
# Compute P(A)
p_A <- nrow(subset(where9am, location == "office")) / nrow(where9am)
# Compute P(B)
p_B <- nrow(subset(where9am, daytype == "weekday")) / nrow(where9am)
# Compute the observed P(A and B)
p_AB <- nrow(subset(where9am, location == "office" & daytype == "weekday")) / nrow(where9am)
Print the value of p_A_given_B.
# Compute P(A | B) and print its value
p_A_given_B <- p_AB / p_B
p_A_given_B
## [1] 0.6
The previous exercises showed that the probability that Brett is at work or at home at 9am is highly dependent on whether it is the weekend or a weekday.
To see this finding in action, use the where9am data frame to build a Naive Bayes model on the same data.
You can then use this model to predict the future: where does the model think that Brett will be at 9am on Thursday and at 9am on Saturday?
The data frame where9am is available in your workspace. This dataset contains information about Brett’s location at 9am on different days.
Load the naivebayes package. Use naive_bayes() with a formula like y ~ x to build a model of location as a function of daytype.
# Load the naivebayes package
library(naivebayes)
# Build the location prediction model
locmodel <- naive_bayes(location ~ daytype, data = where9am)
## Warning: naive_bayes(): Feature daytype - zero probabilities are present.
## Consider Laplace smoothing.
Forecast the Thursday 9am location using predict() with the thursday9am object as the newdata argument.
thursday9am <- read.delim("DATABASE/thursday9am.txt", sep = ",")
## Warning in read.table(file = file, header = header, sep = sep, quote = quote, :
## incomplete final line found by readTableHeader on 'DATABASE/thursday9am.txt'
head(thursday9am)
## X daytype
## 1 1 weekday
thursday9am$daytype <- as.character(thursday9am$daytype)
# Predict Thursday's 9am location
predict(locmodel, thursday9am)
## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.
## [1] office
## Levels: appointment campus home office
Do the same for predicting the saturday9am location.
saturday9am <- read.delim("DATABASE/saturday9am.txt", sep = ",")
## Warning in read.table(file = file, header = header, sep = sep, quote = quote, :
## incomplete final line found by readTableHeader on 'DATABASE/saturday9am.txt'
head(saturday9am)
## X daytype
## 1 1 weekend
saturday9am$daytype <- as.character(saturday9am$daytype)
# Predict Saturdays's 9am location
predict(locmodel, saturday9am)
## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.
## [1] home
## Levels: appointment campus home office
The naivebayes package offers several ways to peek inside a Naive Bayes model.
Typing the name of the model object provides the a priori (overall) and conditional probabilities of each of the model’s predictors. If one were so inclined, you might use these for calculating posterior (predicted) probabilities by hand.
Alternatively, R will compute the posterior probabilities for you if the type = “prob” parameter is supplied to the predict() function.
Using these methods, examine how the model’s predicted 9am location probability varies from day-to-day. The model locmodel that you fit in the previous exercise is in your workspace.
Print the locmodel object to the console to view the computed a priori and conditional probabilities. Use the predict() function similarly to the previous exercise, but with type = “prob” to see the predicted probabilities for Thursday at 9am.
# The 'naivebayes' package is loaded into the workspace
# and the Naive Bayes 'locmodel' has been built
# Examine the location prediction model
locmodel
##
## ================================== Naive Bayes ==================================
##
## Call:
## naive_bayes.formula(formula = location ~ daytype, data = where9am)
##
## ---------------------------------------------------------------------------------
##
## Laplace smoothing: 0
##
## ---------------------------------------------------------------------------------
##
## A priori probabilities:
##
## appointment campus home office
## 0.01098901 0.10989011 0.45054945 0.42857143
##
## ---------------------------------------------------------------------------------
##
## Tables:
##
## ---------------------------------------------------------------------------------
## ::: daytype (Bernoulli)
## ---------------------------------------------------------------------------------
##
## daytype appointment campus home office
## weekday 1.0000000 1.0000000 0.3658537 1.0000000
## weekend 0.0000000 0.0000000 0.6341463 0.0000000
##
## ---------------------------------------------------------------------------------
# Obtain the predicted probabilities for Thursday at 9am
predict(locmodel, thursday9am , type = "prob")
## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.
## appointment campus home office
## [1,] 0.01538462 0.1538462 0.2307692 0.6
Compare these to the predicted probabilities for Saturday at 9am.
# Obtain the predicted probabilities for Saturday at 9am
predict(locmodel, saturday9am , type = "prob")
## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.
## appointment campus home office
## [1,] 3.838772e-05 0.0003838772 0.9980806 0.001497121
The locations dataset records Brett’s location every hour for 13 weeks. Each hour, the tracking information includes the daytype (weekend or weekday) as well as the hourtype (morning, afternoon, evening, or night).
Using this data, build a more sophisticated model to see how Brett’s predicted location not only varies by the day of week but also by the time of day. The dataset locations is already loaded in your workspace.
You can specify additional independent variables in your formula using the + sign (e.g. y ~ x + b).
Use the R formula interface to build a model where location depends on both daytype and hourtype. Recall that the function naive_bayes() takes 2 arguments: formula and data. Predict Brett’s location on a weekday afternoon using the dataframe weekday_afternoon and the predict() function.
# The 'naivebayes' package is loaded into the workspace already
locations <- read.csv("DATABASE/locations.csv", sep = ",")
head(locations)
## month day weekday daytype hour hourtype location
## 1 1 4 wednesday weekday 0 night home
## 2 1 4 wednesday weekday 1 night home
## 3 1 4 wednesday weekday 2 night home
## 4 1 4 wednesday weekday 3 night home
## 5 1 4 wednesday weekday 4 night home
## 6 1 4 wednesday weekday 5 night home
weekday_afternoon <- read.delim("DATABASE/weekday_afternoon.txt", sep = ",")
## Warning in read.table(file = file, header = header, sep = sep, quote =
## quote, : incomplete final line found by readTableHeader on 'DATABASE/
## weekday_afternoon.txt'
head(weekday_afternoon)
## X daytype hourtype location
## 1 13 weekday afternoon office
weekday_afternoon$daytype <- as.character(
weekday_afternoon$daytype)
weekday_afternoon$hourtype <- as.character(weekday_afternoon$hourtype)
weekday_afternoon$location <- as.character(weekday_afternoon$location)
# Build a NB model of location
locmodel <- naive_bayes(location ~ daytype + hourtype, data = locations)
## Warning: naive_bayes(): Feature daytype - zero probabilities are present.
## Consider Laplace smoothing.
## Warning: naive_bayes(): Feature hourtype - zero probabilities are present.
## Consider Laplace smoothing.
# Predict Brett's location on a weekday afternoon
predict(locmodel, weekday_afternoon)
## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.
## [1] office
## Levels: appointment campus home office restaurant store theater
Do the same for a weekday_evening.
weekday_evening <- read.delim("DATABASE/weekday_evening.txt", sep = ",")
## Warning in read.table(file = file, header = header, sep = sep, quote = quote, :
## incomplete final line found by readTableHeader on 'DATABASE/weekday_evening.txt'
head(weekday_evening)
## X daytype hourtype location
## 1 19 weekday evening home
weekday_evening$daytype <- as.character(weekday_evening$daytype)
weekday_evening$hourtype <- as.character(weekday_evening$hourtype)
weekday_evening$location <- as.character(weekday_evening$location)
# Predict Brett's location on a weekday evening
predict(locmodel, weekday_evening)
## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.
## [1] home
## Levels: appointment campus home office restaurant store theater
While Brett was tracking his location over 13 weeks, he never went into the office during the weekend. Consequently, the joint probability of P(office and weekend) = 0.
Explore how this impacts the predicted probability that Brett may go to work on the weekend in the future. Additionally, you can see how using the Laplace correction will allow a small chance for these types of unforeseen circumstances.
The model locmodel is already in your workspace, along with the dataframe weekend_afternoon.
Use the locmodel to output predicted probabilities for a weekend afternoon by using the predict() function. Remember to set the type argument.
# # The 'naivebayes' package is loaded into the workspace already
# # The Naive Bayes location model (locmodel) has already been built
weekend_afternoon <- read.delim("DATABASE/weekend_afternoon.txt", sep = ",")
## Warning in read.table(file = file, header = header, sep = sep, quote =
## quote, : incomplete final line found by readTableHeader on 'DATABASE/
## weekend_afternoon.txt'
head(weekend_afternoon)
## X daytype hourtype location
## 1 85 weekend afternoon home
#
# weekend_afternoon$daytype <- as.character(weekend_afternoon$daytype)
#
# weekday_afternoon$hourtype <- as.character(weekday_afternoon$hourtype)
# weekday_afternoon$location <- as.character(weekday_afternoon$location)
#
# weekday_afternoon$X <- as.character(weekday_afternoon$X)
#
# # Observe the predicted probabilities for a weekend afternoon
# predict(locmodel, weekend_afternoon, type = "prob")
Create a new naive Bayes model with the Laplace smoothing parameter set to 1. You can do this by setting the laplace argument in your call to naive_bayes(). Save this as locmodel2.
# Build a new model using the Laplace correction
locmodel2 <- naive_bayes(location ~ daytype + hourtype, data = locations, laplace = 1)
See how the new predicted probabilities compare by using the predict() function on your new model.
weekend_afternoon$daytype <- as.character(weekend_afternoon$daytype)
weekend_afternoon$hourtype <- as.character(weekend_afternoon$hourtype)
weekend_afternoon$location <- as.character(weekend_afternoon$location)
# Observe the new predicted probabilities for a weekend afternoon
predict(locmodel2, weekend_afternoon, type = "prob")
## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.
## appointment campus home office restaurant store
## [1,] 0.02013872 0.006187715 0.8308154 0.007929249 0.1098743 0.01871085
## theater
## [1,] 0.006343697
The donors dataset contains 93,462 examples of people mailed in a fundraising solicitation for paralyzed military veterans. The donated column is 1 if the person made a donation in response to the mailing and 0 otherwise. This binary outcome will be the dependent variable for the logistic regression model.
The remaining columns are features of the prospective donors that may influence their donation behavior. These are the model’s independent variables.
When building a regression model, it is often helpful to form a hypothesis about which independent variables will be predictive of the dependent variable. The bad_address column, which is set to 1 for an invalid mailing address and 0 otherwise, seems like it might reduce the chances of a donation. Similarly, one might suspect that religious interest (interest_religion) and interest in veterans affairs (interest_veterans) would be associated with greater charitable giving.
In this exercise, you will use these three factors to create a simple model of donation behavior. The dataset donors is available in your workspace.
Examine donors using the str() function.
donors <- read.csv("DATABASE/donors.csv")
head(donors)
## donated veteran bad_address age has_children wealth_rating interest_veterans
## 1 0 0 0 60 0 0 0
## 2 0 0 0 46 1 3 0
## 3 0 0 0 NA 0 1 0
## 4 0 0 0 70 0 2 0
## 5 0 0 0 78 1 1 0
## 6 0 0 0 NA 0 0 0
## interest_religion pet_owner catalog_shopper recency frequency money
## 1 0 0 0 CURRENT FREQUENT MEDIUM
## 2 0 0 0 CURRENT FREQUENT HIGH
## 3 0 0 0 CURRENT FREQUENT MEDIUM
## 4 0 0 0 CURRENT FREQUENT MEDIUM
## 5 1 0 1 CURRENT FREQUENT MEDIUM
## 6 0 0 0 CURRENT INFREQUENT MEDIUM
# Examine the dataset to identify potential independent variables
str(donors)
## 'data.frame': 93462 obs. of 13 variables:
## $ donated : int 0 0 0 0 0 0 0 0 0 0 ...
## $ veteran : int 0 0 0 0 0 0 0 0 0 0 ...
## $ bad_address : int 0 0 0 0 0 0 0 0 0 0 ...
## $ age : int 60 46 NA 70 78 NA 38 NA NA 65 ...
## $ has_children : int 0 1 0 0 1 0 1 0 0 0 ...
## $ wealth_rating : int 0 3 1 2 1 0 2 3 1 0 ...
## $ interest_veterans: int 0 0 0 0 0 0 0 0 0 0 ...
## $ interest_religion: int 0 0 0 0 1 0 0 0 0 0 ...
## $ pet_owner : int 0 0 0 0 0 0 1 0 0 0 ...
## $ catalog_shopper : int 0 0 0 0 1 0 0 0 0 0 ...
## $ recency : Factor w/ 2 levels "CURRENT","LAPSED": 1 1 1 1 1 1 1 1 1 1 ...
## $ frequency : Factor w/ 2 levels "FREQUENT","INFREQUENT": 1 1 1 1 1 2 2 1 2 2 ...
## $ money : Factor w/ 2 levels "HIGH","MEDIUM": 2 1 2 2 2 2 2 2 2 2 ...
Count the number of occurrences of each level of the donated variable using the table() function.
# Explore the dependent variable
table(donors$donated)
##
## 0 1
## 88751 4711
Fit a logistic regression model using the formula interface and the three independent variables described above.
Call glm() with the formula as its first argument and the dataframe as the data argument.
Save the result as donation_model.
# Build the donation model
donation_model <- glm(donated ~ bad_address + interest_religion + interest_veterans,
data = donors, family = "binomial")
Summarize the model object with summary().
# Summarize the model results
summary(donation_model)
##
## Call:
## glm(formula = donated ~ bad_address + interest_religion + interest_veterans,
## family = "binomial", data = donors)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.3480 -0.3192 -0.3192 -0.3192 2.5678
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.95139 0.01652 -178.664 <2e-16 ***
## bad_address -0.30780 0.14348 -2.145 0.0319 *
## interest_religion 0.06724 0.05069 1.327 0.1847
## interest_veterans 0.11009 0.04676 2.354 0.0186 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 37330 on 93461 degrees of freedom
## Residual deviance: 37316 on 93458 degrees of freedom
## AIC: 37324
##
## Number of Fisher Scoring iterations: 5
In the previous exercise, you used the glm() function to build a logistic regression model of donor behavior. As with many of R’s machine learning methods, you can apply the predict() function to the model object to forecast future behavior. By default, predict() outputs predictions in terms of log odds unless type = “response” is specified. This converts the log odds to probabilities.
Because a logistic regression model estimates the probability of the outcome, it is up to you to determine the threshold at which the probability implies action. One must balance the extremes of being too cautious versus being too aggressive. For example, if you were to solicit only the people with a 99% or greater donation probability, you may miss out on many people with lower estimated probabilities that still choose to donate. This balance is particularly important to consider for severely imbalanced outcomes, such as in this dataset where donations are relatively rare.
The dataset donors and the model donation_model are already loaded in your workspace.
# Estimate the donation probability
donors$donation_prob <- predict(donation_model, type = "response")
Use the predict() function to estimate each person’s donation probability.
Use the type argument to get probabilities. Assign the predictions to a new column called donation_prob.
Find the actual probability that an average person would donate by passing the mean() function the appropriate column of the donors dataframe.
# Find the donation probability of the average prospect
mean(donors$donated)
## [1] 0.05040551
Use ifelse() to predict a donation if their predicted donation probability is greater than average. Assign the predictions to a new column called donation_pred.
# Predict a donation if probability of donation is greater than average
donors$donation_pred <- ifelse(donors$donation_prob > 0.0504, 1, 0)
Use the mean() function to calculate the model’s accuracy.
# Calculate the model's accuracy
mean(donors$donated == donors$donation_pred)
## [1] 0.794815
The previous exercises have demonstrated that accuracy is a very misleading measure of model performance on imbalanced datasets. Graphing the model’s performance better illustrates the tradeoff between a model that is overly agressive and one that is overly passive.
In this exercise you will create a ROC curve and compute the area under the curve (AUC) to evaluate the logistic regression model of donations you built earlier.
The dataset donors with the column of predicted probabilities, donation_prob ,is already loaded in your workspace.
Load the pROC package. Create a ROC curve with roc() and the columns of actual and predicted donations. Store the result as ROC.
# Load the pROC package
library(pROC)
# Create a ROC curve
ROC <- roc(donors$donated, donors$donation_prob)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
Use plot() to draw the ROC object. Specify col = “blue” to color the curve blue.
# Plot the ROC curve
plot(ROC, col = "blue")
Compute the area under the curve with auc().
# Calculate the area under the curve (AUC)
auc(ROC)
## Area under the curve: 0.5102
Sometimes a dataset contains numeric values that represent a categorical feature.
In the donors dataset, wealth_rating uses numbers to indicate the donor’s wealth level:
0 = Unknown 1 = Low 2 = Medium 3 = High This exercise illustrates how to prepare this type of categorical feature and examines its impact on a logistic regression model. The dataframe donors is loaded in your workspace.
Create a factor wealth_levels from the numeric wealth_rating with labels as shown above by passing the factor() function the column you want to convert, the individual levels, and the labels.
# Convert the wealth rating to a factor
donors$wealth_levels <- factor(donors$wealth_rating, levels = c(0, 1, 2, 3), labels = c("Unknown", "Low", "Medium", "High"))
Use relevel() to change the reference category to Medium. The first argument should be your new factor column.
# Use relevel() to change reference category
donors$wealth_levels <- relevel(donors$wealth_levels, ref = "Medium")
Build a logistic regression model using the column wealth_levels to predict donated and display the result with summary().
# See how our factor coding impacts the model
summary(glm(donated ~ wealth_levels, data = donors, family = "binomial"))
##
## Call:
## glm(formula = donated ~ wealth_levels, family = "binomial", data = donors)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.3320 -0.3243 -0.3175 -0.3175 2.4582
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.91894 0.03614 -80.772 <2e-16 ***
## wealth_levelsUnknown -0.04373 0.04243 -1.031 0.303
## wealth_levelsLow -0.05245 0.05332 -0.984 0.325
## wealth_levelsHigh 0.04804 0.04768 1.008 0.314
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 37330 on 93461 degrees of freedom
## Residual deviance: 37323 on 93458 degrees of freedom
## AIC: 37331
##
## Number of Fisher Scoring iterations: 5
Some of the prospective donors have missing age data. Unfortunately, R will exclude any cases with NA values when building a regression model.
One workaround is to replace, or impute, the missing values with an estimated value. After doing so, you may also create a missing data indicator to model the possibility that cases with missing data are different in some way from those without.
The dataframe donors is loaded in your workspace.
Use summary() on donors\(age to find the average age of prospects with non-missing data. Use ifelse() and the test is.na(donors\)age) to impute the average (rounded to 2 decimal places) for cases with missing age. Be sure to also ignore NAs.
# Find the average age among non-missing values
summary(donors$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 48.00 62.00 61.65 75.00 98.00 22546
# Impute missing age values with the mean age
donors$imputed_age <- ifelse(is.na(donors$age), round(mean(donors$age, na.rm = TRUE), 2), donors$age)
Create a binary dummy variable named missing_age indicating the presence of missing data using another ifelse() call and the same test.
# Create missing value indicator for age
donors$missing_age <- ifelse(is.na(donors$age), 1, 0)
One of the best predictors of future giving is a history of recent, frequent, and large gifts. In marketing terms, this is known as R/F/M:
Recency Frequency Money Donors that haven’t given both recently and frequently may be especially likely to give again; in other words, the combined impact of recency and frequency may be greater than the sum of the separate effects.
Because these predictors together have a greater impact on the dependent variable, their joint effect must be modeled as an interaction. The donors dataset has been loaded for you.
Create a logistic regression model of donated as a function of money plus the interaction of recency and frequency. Use * to add the interaction term.
# Build a recency, frequency, and money (RFM) model
rfm_model <- glm(donated ~ recency * frequency + money, data = donors, family = "binomial")
Examine the model’s summary() to confirm the interaction effect was added. Save the model’s predicted probabilities as rfm_prob. Use the predict() function, and remember to set the type argument.
# Summarize the RFM model to see how the parameters were coded
summary(rfm_model)
##
## Call:
## glm(formula = donated ~ recency * frequency + money, family = "binomial",
## data = donors)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.3696 -0.3696 -0.2895 -0.2895 2.7924
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.01142 0.04279 -70.375 <2e-16 ***
## recencyLAPSED -0.86677 0.41434 -2.092 0.0364 *
## frequencyINFREQUENT -0.50148 0.03107 -16.143 <2e-16 ***
## moneyMEDIUM 0.36186 0.04300 8.415 <2e-16 ***
## recencyLAPSED:frequencyINFREQUENT 1.01787 0.51713 1.968 0.0490 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 37330 on 93461 degrees of freedom
## Residual deviance: 36938 on 93457 degrees of freedom
## AIC: 36948
##
## Number of Fisher Scoring iterations: 6
Plot a ROC curve by using the function roc(). Remember, this function takes the column of outcomes and the vector of predictions. Compute the AUC for the new model with the function auc() and compare performance to the simpler model.
# Compute predicted probabilities for the RFM model
rfm_prob <- predict(rfm_model, data = donors, type = "response")
# Plot the ROC curve for the new model
library(pROC)
ROC <- roc(donors$donated, rfm_prob)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(ROC, col = "red")
auc(ROC)
## Area under the curve: 0.5785
In the absence of subject-matter expertise, stepwise regression can assist with the search for the most important predictors of the outcome of interest.
In this exercise, you will use a forward stepwise approach to add predictors to the model one-by-one until no additional benefit is seen. The donors dataset has been loaded for you.
Use the R formula interface with glm() to specify the base model with no predictors. Set the explanatory variable equal to 1.
# Specify a null model with no predictors
null_model <- glm(donated ~ 1, data = donors, family = "binomial")
Use the R formula interface again with glm() to specify the model with all predictors.
# Specify the full model using all of the potential predictors
full_model <- glm(donated ~ ., data = donors, family = "binomial")
Apply step() to these models to perform forward stepwise regression. Set the first argument to null_model and set direction = “forward”. This might take a while (up to 10 or 15 seconds) as your computer has to fit quite a few different models to perform stepwise selection.
# Use a forward stepwise algorithm to build a parsimonious model
step_model <- step(null_model, scope = list(lower = null_model, upper = full_model), direction = "forward")
## Start: AIC=37332.13
## donated ~ 1
## Warning in add1.glm(fit, scope$add, scale = scale, trace = trace, k = k, : using
## the 70916/93462 rows from a combined fit
## Df Deviance AIC
## + frequency 1 28502 37122
## + money 1 28621 37241
## + wealth_rating 1 28705 37326
## + has_children 1 28705 37326
## + age 1 28707 37328
## + imputed_age 1 28707 37328
## + wealth_levels 3 28704 37328
## + interest_veterans 1 28709 37330
## + donation_prob 1 28710 37330
## + donation_pred 1 28710 37330
## + catalog_shopper 1 28710 37330
## + pet_owner 1 28711 37331
## <none> 28714 37332
## + interest_religion 1 28712 37333
## + recency 1 28713 37333
## + bad_address 1 28714 37334
## + veteran 1 28714 37334
##
## Step: AIC=37024.77
## donated ~ frequency
## Warning in add1.glm(fit, scope$add, scale = scale, trace = trace, k = k, : using
## the 70916/93462 rows from a combined fit
## Df Deviance AIC
## + money 1 28441 36966
## + wealth_rating 1 28493 37018
## + wealth_levels 3 28490 37019
## + has_children 1 28494 37019
## + donation_prob 1 28498 37023
## + interest_veterans 1 28498 37023
## + catalog_shopper 1 28499 37024
## + donation_pred 1 28499 37024
## + age 1 28499 37024
## + imputed_age 1 28499 37024
## + pet_owner 1 28499 37024
## <none> 28502 37025
## + interest_religion 1 28501 37026
## + recency 1 28501 37026
## + bad_address 1 28502 37026
## + veteran 1 28502 37027
##
## Step: AIC=36949.71
## donated ~ frequency + money
## Warning in add1.glm(fit, scope$add, scale = scale, trace = trace, k = k, : using
## the 70916/93462 rows from a combined fit
## Df Deviance AIC
## + wealth_levels 3 28427 36942
## + wealth_rating 1 28431 36942
## + has_children 1 28432 36943
## + interest_veterans 1 28438 36948
## + donation_prob 1 28438 36949
## + catalog_shopper 1 28438 36949
## + donation_pred 1 28438 36949
## + age 1 28438 36949
## + imputed_age 1 28438 36949
## + pet_owner 1 28439 36949
## <none> 28441 36950
## + interest_religion 1 28440 36951
## + recency 1 28440 36951
## + bad_address 1 28441 36951
## + veteran 1 28441 36952
##
## Step: AIC=36945.48
## donated ~ frequency + money + wealth_levels
## Warning in add1.glm(fit, scope$add, scale = scale, trace = trace, k = k, : using
## the 70916/93462 rows from a combined fit
## Df Deviance AIC
## + has_children 1 28416 36937
## + age 1 28424 36944
## + imputed_age 1 28424 36944
## + interest_veterans 1 28424 36945
## + donation_prob 1 28424 36945
## + catalog_shopper 1 28424 36945
## + donation_pred 1 28425 36945
## <none> 28427 36945
## + pet_owner 1 28425 36946
## + interest_religion 1 28426 36947
## + recency 1 28426 36947
## + bad_address 1 28427 36947
## + veteran 1 28427 36947
##
## Step: AIC=36938.4
## donated ~ frequency + money + wealth_levels + has_children
## Warning in add1.glm(fit, scope$add, scale = scale, trace = trace, k = k, : using
## the 70916/93462 rows from a combined fit
## Df Deviance AIC
## + pet_owner 1 28413 36937
## + donation_prob 1 28413 36937
## + catalog_shopper 1 28413 36937
## + interest_veterans 1 28413 36937
## + donation_pred 1 28414 36938
## <none> 28416 36938
## + interest_religion 1 28415 36939
## + age 1 28416 36940
## + imputed_age 1 28416 36940
## + recency 1 28416 36940
## + bad_address 1 28416 36940
## + veteran 1 28416 36940
##
## Step: AIC=36932.25
## donated ~ frequency + money + wealth_levels + has_children +
## pet_owner
## Warning in add1.glm(fit, scope$add, scale = scale, trace = trace, k = k, : using
## the 70916/93462 rows from a combined fit
## Df Deviance AIC
## <none> 28413 36932
## + donation_prob 1 28411 36932
## + interest_veterans 1 28411 36932
## + catalog_shopper 1 28412 36933
## + donation_pred 1 28412 36933
## + age 1 28412 36933
## + imputed_age 1 28412 36933
## + recency 1 28413 36934
## + interest_religion 1 28413 36934
## + bad_address 1 28413 36934
## + veteran 1 28413 36934
Create a vector of predicted probabilities using the predict() function.
# Estimate the stepwise donation probability
step_prob <- predict(step_model, type = "response")
Plot the ROC curve with roc() and plot() and compute the AUC of the stepwise model with auc().
# Plot the ROC of the stepwise model
library(pROC)
ROC <- roc(donors$donated, step_prob)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(ROC, col = "red")
auc(ROC)
## Area under the curve: 0.5849
In the absence of subject-matter expertise, stepwise regression can assist with the search for the most important predictors of the outcome of interest.
In this exercise, you will use a forward stepwise approach to add predictors to the model one-by-one until no additional benefit is seen. The donors dataset has been loaded for you.
Use the R formula interface with glm() to specify the base model with no predictors. Set the explanatory variable equal to 1. Use the R formula interface again with glm() to specify the model with all predictors. Apply step() to these models to perform forward stepwise regression. Set the first argument to null_model and set direction = “forward”. This might take a while (up to 10 or 15 seconds) as your computer has to fit quite a few different models to perform stepwise selection. Create a vector of predicted probabilities using the predict() function. Plot the ROC curve with roc() and plot() and compute the AUC of the stepwise model with auc().
# Specify a null model with no predictors
null_model <- glm(donated ~ 1, data = donors, family = "binomial")
# Specify the full model using all of the potential predictors
full_model <- glm(donated ~ ., data = donors, family = "binomial")
# Use a forward stepwise algorithm to build a parsimonious model
step_model <- step(null_model, scope = list(lower = null_model, upper = full_model), direction = "forward")
## Start: AIC=37332.13
## donated ~ 1
## Warning in add1.glm(fit, scope$add, scale = scale, trace = trace, k = k, : using
## the 70916/93462 rows from a combined fit
## Df Deviance AIC
## + frequency 1 28502 37122
## + money 1 28621 37241
## + wealth_rating 1 28705 37326
## + has_children 1 28705 37326
## + age 1 28707 37328
## + imputed_age 1 28707 37328
## + wealth_levels 3 28704 37328
## + interest_veterans 1 28709 37330
## + donation_prob 1 28710 37330
## + donation_pred 1 28710 37330
## + catalog_shopper 1 28710 37330
## + pet_owner 1 28711 37331
## <none> 28714 37332
## + interest_religion 1 28712 37333
## + recency 1 28713 37333
## + bad_address 1 28714 37334
## + veteran 1 28714 37334
##
## Step: AIC=37024.77
## donated ~ frequency
## Warning in add1.glm(fit, scope$add, scale = scale, trace = trace, k = k, : using
## the 70916/93462 rows from a combined fit
## Df Deviance AIC
## + money 1 28441 36966
## + wealth_rating 1 28493 37018
## + wealth_levels 3 28490 37019
## + has_children 1 28494 37019
## + donation_prob 1 28498 37023
## + interest_veterans 1 28498 37023
## + catalog_shopper 1 28499 37024
## + donation_pred 1 28499 37024
## + age 1 28499 37024
## + imputed_age 1 28499 37024
## + pet_owner 1 28499 37024
## <none> 28502 37025
## + interest_religion 1 28501 37026
## + recency 1 28501 37026
## + bad_address 1 28502 37026
## + veteran 1 28502 37027
##
## Step: AIC=36949.71
## donated ~ frequency + money
## Warning in add1.glm(fit, scope$add, scale = scale, trace = trace, k = k, : using
## the 70916/93462 rows from a combined fit
## Df Deviance AIC
## + wealth_levels 3 28427 36942
## + wealth_rating 1 28431 36942
## + has_children 1 28432 36943
## + interest_veterans 1 28438 36948
## + donation_prob 1 28438 36949
## + catalog_shopper 1 28438 36949
## + donation_pred 1 28438 36949
## + age 1 28438 36949
## + imputed_age 1 28438 36949
## + pet_owner 1 28439 36949
## <none> 28441 36950
## + interest_religion 1 28440 36951
## + recency 1 28440 36951
## + bad_address 1 28441 36951
## + veteran 1 28441 36952
##
## Step: AIC=36945.48
## donated ~ frequency + money + wealth_levels
## Warning in add1.glm(fit, scope$add, scale = scale, trace = trace, k = k, : using
## the 70916/93462 rows from a combined fit
## Df Deviance AIC
## + has_children 1 28416 36937
## + age 1 28424 36944
## + imputed_age 1 28424 36944
## + interest_veterans 1 28424 36945
## + donation_prob 1 28424 36945
## + catalog_shopper 1 28424 36945
## + donation_pred 1 28425 36945
## <none> 28427 36945
## + pet_owner 1 28425 36946
## + interest_religion 1 28426 36947
## + recency 1 28426 36947
## + bad_address 1 28427 36947
## + veteran 1 28427 36947
##
## Step: AIC=36938.4
## donated ~ frequency + money + wealth_levels + has_children
## Warning in add1.glm(fit, scope$add, scale = scale, trace = trace, k = k, : using
## the 70916/93462 rows from a combined fit
## Df Deviance AIC
## + pet_owner 1 28413 36937
## + donation_prob 1 28413 36937
## + catalog_shopper 1 28413 36937
## + interest_veterans 1 28413 36937
## + donation_pred 1 28414 36938
## <none> 28416 36938
## + interest_religion 1 28415 36939
## + age 1 28416 36940
## + imputed_age 1 28416 36940
## + recency 1 28416 36940
## + bad_address 1 28416 36940
## + veteran 1 28416 36940
##
## Step: AIC=36932.25
## donated ~ frequency + money + wealth_levels + has_children +
## pet_owner
## Warning in add1.glm(fit, scope$add, scale = scale, trace = trace, k = k, : using
## the 70916/93462 rows from a combined fit
## Df Deviance AIC
## <none> 28413 36932
## + donation_prob 1 28411 36932
## + interest_veterans 1 28411 36932
## + catalog_shopper 1 28412 36933
## + donation_pred 1 28412 36933
## + age 1 28412 36933
## + imputed_age 1 28412 36933
## + recency 1 28413 36934
## + interest_religion 1 28413 36934
## + bad_address 1 28413 36934
## + veteran 1 28413 36934
# Estimate the stepwise donation probability
step_prob <- predict(step_model, type = "response")
# Plot the ROC of the stepwise model
library(pROC)
ROC <- roc(donors$donated, step_prob)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(ROC, col = "red")
auc(ROC)
## Area under the curve: 0.5849
The loans dataset contains 11,312 randomly-selected people who applied for and later received loans from Lending Club, a US-based peer-to-peer lending company.
You will use a decision tree to try to learn patterns in the outcome of these loans (either repaid or default) based on the requested loan amount and credit score at the time of application.
Then, see how the tree’s predictions differ for an applicant with good credit versus one with bad credit.
The dataset loans is already in your workspace.
Load the rpart package. Fit a decision tree model with the function rpart(). Supply the R formula that specifies outcome as a function of loan_amount and credit_score as the first argument. Leave the control argument alone for now. (You’ll learn more about that later!) Use predict() with the resulting loan model to predict the outcome for the good_credit applicant. Use the type argument to predict the “class” of the outcome. Do the same for the bad_credit applicant.
# Load the rpart package
library(rpart)
loans <- read.delim("DATABASE/loans.txt", sep = ",")
head(loans)
## X loan_amount emp_length home_ownership income loan_purpose
## 1 1 LOW 10+ years RENT LOW credit_card
## 2 2 LOW < 2 years RENT LOW car
## 3 8 LOW 6 - 9 years RENT MEDIUM car
## 4 9 MEDIUM 2 - 5 years OWN MEDIUM small_business
## 5 10 LOW < 2 years RENT LOW other
## 6 13 MEDIUM < 2 years RENT LOW debt_consolidation
## debt_to_income credit_score recent_inquiry delinquent credit_accounts
## 1 HIGH AVERAGE YES NEVER FEW
## 2 LOW AVERAGE YES NEVER FEW
## 3 LOW LOW YES NEVER FEW
## 4 LOW AVERAGE YES NEVER AVERAGE
## 5 AVERAGE AVERAGE NO NEVER FEW
## 6 AVERAGE AVERAGE YES NEVER FEW
## bad_public_record credit_utilization past_bankrupt outcome
## 1 NO HIGH NO repaid
## 2 NO LOW NO default
## 3 NO HIGH NO repaid
## 4 NO MEDIUM NO default
## 5 NO MEDIUM NO default
## 6 NO HIGH NO default
# Build a lending model predicting loan outcome versus loan amount and credit score
loan_model <- rpart(outcome ~ loan_amount + credit_score, data = loans, method = "class", control = rpart.control(cp = 0))
good_credit <- read.delim("DATABASE/good_credit.txt", sep = ",")
## Warning in read.table(file = file, header = header, sep = sep, quote = quote, :
## incomplete final line found by readTableHeader on 'DATABASE/good_credit.txt'
head(good_credit)
## X loan_amount emp_length home_ownership income loan_purpose debt_to_income
## 1 1 LOW 10+ years MORTGAGE HIGH major_purchase AVERAGE
## credit_score recent_inquiry delinquent credit_accounts bad_public_record
## 1 HIGH NO NEVER MANY NO
## credit_utilization past_bankrupt outcome
## 1 LOW NO repaid
# Make a prediction for someone with good credit
predict(loan_model, good_credit, type = "class")
## 1
## repaid
## Levels: default repaid
bad_credit <- read.delim("DATABASE/bad_credit.txt", sep = ",")
## Warning in read.table(file = file, header = header, sep = sep, quote = quote, :
## incomplete final line found by readTableHeader on 'DATABASE/bad_credit.txt'
head(bad_credit)
## X loan_amount emp_length home_ownership income loan_purpose debt_to_income
## 1 1 LOW 6 - 9 years RENT MEDIUM car LOW
## credit_score recent_inquiry delinquent credit_accounts bad_public_record
## 1 LOW YES NEVER FEW NO
## credit_utilization past_bankrupt outcome
## 1 HIGH NO repaid
# Make a prediction for someone with bad credit
predict(loan_model, bad_credit, type = "class")
## 1
## default
## Levels: default repaid
Due to government rules to prevent illegal discrimination, lenders are required to explain why a loan application was rejected.
The structure of classification trees can be depicted visually, which helps to understand how the tree makes its decisions. The model loan_model that you fit in the last exercise is in your workspace.
Type loan_model to see a text representation of the classification tree. Load the rpart.plot package. Apply the rpart.plot() function to the loan model to visualize the tree. See how changing other plotting parameters impacts the visualization by running the supplied command.
# Examine the loan_model object
loan_model
## n= 11312
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 11312 5654 repaid (0.4998232 0.5001768)
## 2) credit_score=AVERAGE,LOW 9490 4437 default (0.5324552 0.4675448)
## 4) credit_score=LOW 1667 631 default (0.6214757 0.3785243) *
## 5) credit_score=AVERAGE 7823 3806 default (0.5134859 0.4865141)
## 10) loan_amount=HIGH 2472 1079 default (0.5635113 0.4364887) *
## 11) loan_amount=LOW,MEDIUM 5351 2624 repaid (0.4903756 0.5096244)
## 22) loan_amount=LOW 1810 874 default (0.5171271 0.4828729) *
## 23) loan_amount=MEDIUM 3541 1688 repaid (0.4767015 0.5232985) *
## 3) credit_score=HIGH 1822 601 repaid (0.3298573 0.6701427) *
# Load the rpart.plot package
library(rpart.plot)
# Plot the loan_model with default settings
rpart.plot(loan_model)
# Plot the loan_model with customized settings
rpart.plot(loan_model, type = 3, box.palette = c("red", "green"), fallen.leaves = TRUE)
Creating random test datasets Before building a more sophisticated lending model, it is important to hold out a portion of the loan data to simulate how well it will predict the outcomes of future loan applicants.
As depicted in the following image, you can use 75% of the observations for training and 25% for testing the model.
The sample() function can be used to generate a random sample of rows to include in the training set. Simply supply it the total number of observations and the number needed for training.
Use the resulting vector of row IDs to subset the loans into training and testing datasets. The dataset loans is loaded in your workspace.
# Determine the number of rows for training
nrow(loans) * 0.75
## [1] 8484
# Create a random sample of row IDs
sample_rows <- sample(nrow(loans), nrow(loans) * 0.75)
# Create the training dataset
loans_train <- loans[sample_rows, ]
# Create the test dataset
loans_test <- loans[-sample_rows, ]
Previously, you created a simple decision tree that used the applicant’s credit score and requested loan amount to predict the loan outcome.
Lending Club has additional information about the applicants, such as home ownership status, length of employment, loan purpose, and past bankruptcies, that may be useful for making more accurate predictions.
Using all of the available applicant data, build a more sophisticated lending model using the random training dataset created previously. Then, use this model to make predictions on the testing dataset to estimate the performance of the model on future loan applications.
The rpart package is loaded into the workspace and the loans_train and loans_test datasets have been created.
Use rpart() to build a loan model using the training dataset and all of the available predictors. Again, leave the control argument alone. Applying the predict() function to the testing dataset, create a vector of predicted outcomes. Don’t forget the type argument. Create a table() to compare the predicted values to the actual outcome values. Compute the accuracy of the predictions using the mean() function.
# Grow a tree using all of the available applicant data
loan_model <- rpart(outcome ~ ., data = loans_train, method = "class", control = rpart.control(cp = 0))
# Make predictions on the test dataset
loans_test$pred <- predict(loan_model, loans_test, type = "class")
# Examine the confusion matrix
table(loans_test$pred, loans_test$outcome)
##
## default repaid
## default 795 718
## repaid 594 721
# Compute the accuracy on the test dataset
mean(loans_test$pred == loans_test$outcome)
## [1] 0.5360679
The tree grown on the full set of applicant data grew to be extremely large and extremely complex, with hundreds of splits and leaf nodes containing only a handful of applicants. This tree would be almost impossible for a loan officer to interpret.
Using the pre-pruning methods for early stopping, you can prevent a tree from growing too large and complex. See how the rpart control options for maximum tree depth and minimum split count impact the resulting tree.
rpart is loaded.
Use rpart() to build a loan model using the training dataset and all of the available predictors. Set the model controls using rpart.control() with parameters cp set to 0 and maxdepth set to 6. See how the test set accuracy of the simpler model compares to the original accuracy of 58.3%. First create a vector of predictions using the predict() function. Compare the predictions to the actual outcomes and use mean() to calculate the accuracy.
# Grow a tree with maxdepth of 6
loan_model <- rpart(outcome ~ ., data = loans_train, method = "class", control = rpart.control(cp = 0, maxdepth = 6))
# Make a class prediction on the test set
loans_test$pred <- predict(loan_model, loans_test, type = "class")
# Compute the accuracy of the simpler tree
mean(loans_test$pred == loans_test$outcome)
## [1] 0.5848656
In the model controls, remove maxdepth and add a minimum split parameter, minsplit, set to 500.
# Swap maxdepth for a minimum split of 500
loan_model <- rpart(outcome ~ ., data = loans_train, method = "class", control = rpart.control(cp = 0, minsplit = 500))
# Run this. How does the accuracy change?
loans_test$pred <- predict(loan_model, loans_test, type = "class")
mean(loans_test$pred == loans_test$outcome)
## [1] 0.5880481
Stopping a tree from growing all the way can lead it to ignore some aspects of the data or miss important trends it may have discovered later.
By using post-pruning, you can intentionally grow a large and complex tree then prune it to be smaller and more efficient later on.
In this exercise, you will have the opportunity to construct a visualization of the tree’s performance versus complexity, and use this information to prune the tree to an appropriate level.
The rpart package is loaded into the workspace, along with loans_test and loans_train.
Use all of the applicant variables and no pre-pruning to create an overly complex tree. Make sure to set cp = 0 in rpart.control() to prevent pre-pruning. Create a complexity plot by using plotcp() on the model.
# Grow an overly complex tree
loan_model <- rpart(outcome ~ ., data = loans_train, method = "class", control = rpart.control(cp = 0))
# Examine the complexity plot
plotcp(loan_model)
Based on the complexity plot, prune the tree to a complexity of 0.0014 using the prune() function with the tree and the complexity parameter. Compare the accuracy of the pruned tree to the original accuracy of 58.3%. To calculate the accuracy use the predict() and mean() functions.
# Prune the tree
loan_model_pruned <- prune(loan_model, cp = 0.0014)
# Compute the accuracy of the pruned tree
loans_test$pred <- predict(loan_model_pruned, loans_test, type = "class")
mean(loans_test$pred == loans_test$outcome)
## [1] 0.5975955
In spite of the fact that a forest can contain hundreds of trees, growing a decision tree forest is perhaps even easier than creating a single highly-tuned tree.
Using the randomForest package, build a random forest and see how it compares to the single trees you built previously.
Keep in mind that due to the random nature of the forest, the results may vary slightly each time you create the forest. Load the randomForest package. Build a random forest model using all of the loan application variables. The randomForest function also uses the formula interface. Compute the accuracy of the random forest model to compare to the original tree’s accuracy of 58.3% using predict() and mean().
# Load the randomForest package
# library(randomForest)
#
# # Build a random forest model
# loan_model <- randomForest(outcome ~ ., data = loans_train)
#
# # Compute the accuracy of the random forest
# loans_test$pred <- predict(loan_model, loans_test)
# mean(loans_test$pred == loans_test$outcome)
For the first coding exercise, you’ll create a formula to define a one-variable modeling task, and then fit a linear model to the data. You are given the rates of male and female unemployment in the United States over several years (Source).
The task is to predict the rate of female unemployment from the observed rate of male unemployment. The outcome is female_unemployment, and the input is male_unemployment.
The sign of the variable coefficient tells you whether the outcome increases (+) or decreases (-) as the variable increases.
Recall the calling interface for lm() is:
lm(formula, data = ___)
The data frame unemployment is in your workspace.
Define a formula that expresses female_unemployment as a function of male_unemployment. Assign the formula to the variable fmla and print it.
unemployment <-read.delim("DATABASE/unemployment.txt", sep = ",")
head(unemployment)
## X male_unemployment female_unemployment
## 1 1 2.9 4.0
## 2 2 6.7 7.4
## 3 3 4.9 5.0
## 4 4 7.9 7.2
## 5 5 9.8 7.9
## 6 6 6.9 6.1
# unemployment is loaded in the workspace
summary(unemployment)
## X male_unemployment female_unemployment
## Min. : 1 Min. :2.900 Min. :4.000
## 1st Qu.: 4 1st Qu.:4.900 1st Qu.:4.400
## Median : 7 Median :6.000 Median :5.200
## Mean : 7 Mean :5.954 Mean :5.569
## 3rd Qu.:10 3rd Qu.:6.700 3rd Qu.:6.100
## Max. :13 Max. :9.800 Max. :7.900
# Define a formula to express female_unemployment as a function of male_unemployment
fmla <- female_unemployment ~ male_unemployment
# Print it
fmla
## female_unemployment ~ male_unemployment
Then use lm() and fmla to fit a linear model to predict female unemployment from male unemployment using the data set unemployment. Print the model. Is the coefficent for male unemployment consistent with what you would expect? Does female unemployment increase as male unemployment does?
# Use the formula to fit a model: unemployment_model
unemployment_model <- lm(fmla, data = unemployment)
# Print it
unemployment_model
##
## Call:
## lm(formula = fmla, data = unemployment)
##
## Coefficients:
## (Intercept) male_unemployment
## 1.4341 0.6945
Let’s look at the model unemployment_model that you have just created. There are a variety of different ways to examine a model; each way provides different information. We will use summary(), broom::glance(), and sigr::wrapFTest().
The object unemployment_model is in your workspace.
Print unemployment_model again. What information does it report? Call summary() on unemployment_model. In addition to the coefficient values, you get standard errors on the coefficient estimates, and some goodness-of-fit metrics like R-squared.
# broom and sigr are already loaded in your workspace
# Print unemployment_model
unemployment_model
##
## Call:
## lm(formula = fmla, data = unemployment)
##
## Coefficients:
## (Intercept) male_unemployment
## 1.4341 0.6945
# Call summary() on unemployment_model to get more details
summary(unemployment_model)
##
## Call:
## lm(formula = fmla, data = unemployment)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.77621 -0.34050 -0.09004 0.27911 1.31254
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.43411 0.60340 2.377 0.0367 *
## male_unemployment 0.69453 0.09767 7.111 1.97e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5803 on 11 degrees of freedom
## Multiple R-squared: 0.8213, Adjusted R-squared: 0.8051
## F-statistic: 50.56 on 1 and 11 DF, p-value: 1.966e-05
Call glance() on the model to see the performance metrics in an orderly data frame. Can you match the information from summary() to the columns of glance()?
# Call glance() on unemployment_model to see the details in a tidier form
glance(unemployment_model)
## # A tibble: 1 x 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.821 0.805 0.580 50.6 1.97e-5 1 -10.3 26.6 28.3
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
Now call wrapFTest() on the model to see the R-squared again.
# Call wrapFTest() on unemployment_model to see the most relevant details
wrapFTest(unemployment_model)
## [1] "F Test summary: (R2=0.8213, F(1,11)=50.56, p=1.966e-05)."
In this exercise, you will use your unemployment model unemployment_model to make predictions from the unemployment data, and compare predicted female unemployment rates to the actual observed female unemployment rates on the training data, unemployment. You will also use your model to predict on the new data in newrates, which consists of only one observation, where male unemployment is 5%.
The predict() interface for lm models takes the form
predict(model, newdata) You will use the ggplot2 package to make the plots, so you will add the prediction column to the unemployment data frame. You will plot outcome versus prediction, and compare them to the line that represents perfect predictions (that is when the outcome is equal to the predicted value).
The ggplot2 command to plot a scatterplot of dframe\(outcome versus dframe\)pred (pred on the x axis, outcome on the y axis), along with a blue line where outcome == pred is as follows:
ggplot(dframe, aes(x = pred, y = outcome)) + geom_point() +
geom_abline(color = “blue”)
The objects unemployment, unemployment_model and newrates are in your workspace.
Use predict() to predict female unemployment rates from the unemployment data. Assign it to a new column: prediction.
newrates <- read.delim("DATABASE/newrates.txt", sep = ",")
## Warning in read.table(file = file, header = header, sep = sep, quote = quote, :
## incomplete final line found by readTableHeader on 'DATABASE/newrates.txt'
head(newrates)
## X male_unemployment
## 1 1 5
# unemployment is in your workspace
summary(unemployment)
## X male_unemployment female_unemployment
## Min. : 1 Min. :2.900 Min. :4.000
## 1st Qu.: 4 1st Qu.:4.900 1st Qu.:4.400
## Median : 7 Median :6.000 Median :5.200
## Mean : 7 Mean :5.954 Mean :5.569
## 3rd Qu.:10 3rd Qu.:6.700 3rd Qu.:6.100
## Max. :13 Max. :9.800 Max. :7.900
# newrates is in your workspace
newrates
## X male_unemployment
## 1 1 5
# Predict female unemployment in the unemployment data set
unemployment$prediction <- predict(unemployment_model)
Use the library() command to load the ggplot2 package. Use ggplot() to compare the predictions to actual unemployment rates. Put the predictions on the x axis. How close are the results to the line of perfect prediction? Use the data frame newrates to predict expected female unemployment rate when male unemployment is 5%. Assign the answer to the variable pred and print it.
# load the ggplot2 package
library(ggplot2)
# Make a plot to compare predictions to actual (prediction on x axis)
ggplot(unemployment, aes(x = prediction, y = female_unemployment)) +
geom_point() +
geom_abline(color = "blue")
# Predict female unemployment rate when male unemployment is 5%
pred <- predict(unemployment_model, newdata = newrates)
# Print it
pred
## 1
## 4.906757
In this exercise, you will work with the blood pressure dataset (Source), and model blood_pressure as a function of weight and age.
The data frame bloodpressure is in the workspace.
Define a formula that expresses blood_pressure explicitly as a function of age and weight. Assign the formula to the variable fmla and print it.
bloodpressure <- read.delim("DATABASE/bloodpressure.txt", sep = ",")
head(bloodpressure)
## X blood_pressure age weight
## 1 1 132 52 173
## 2 2 143 59 184
## 3 3 153 67 194
## 4 4 162 73 211
## 5 5 154 64 196
## 6 6 168 74 220
# bloodpressure is in the workspace
summary(bloodpressure)
## X blood_pressure age weight
## Min. : 1.0 Min. :128.0 Min. :46.00 Min. :167
## 1st Qu.: 3.5 1st Qu.:140.0 1st Qu.:56.50 1st Qu.:186
## Median : 6.0 Median :153.0 Median :64.00 Median :194
## Mean : 6.0 Mean :150.1 Mean :62.45 Mean :195
## 3rd Qu.: 8.5 3rd Qu.:160.5 3rd Qu.:69.50 3rd Qu.:209
## Max. :11.0 Max. :168.0 Max. :74.00 Max. :220
head(bloodpressure)
## X blood_pressure age weight
## 1 1 132 52 173
## 2 2 143 59 184
## 3 3 153 67 194
## 4 4 162 73 211
## 5 5 154 64 196
## 6 6 168 74 220
Use fmla to fit a linear model to predict blood_pressure from age and weight in the data set bloodpressure. Call the model bloodpressure_model.
# Create the formula and print it
fmla <- blood_pressure ~ age + weight
fmla
## blood_pressure ~ age + weight
Print the model and call summary() on it. Does blood pressure increase or decrease with age? With weight?
# Fit the model: bloodpressure_model
bloodpressure_model <- lm(fmla, data = bloodpressure)
# Print bloodpressure_model and call summary()
bloodpressure_model
##
## Call:
## lm(formula = fmla, data = bloodpressure)
##
## Coefficients:
## (Intercept) age weight
## 30.9941 0.8614 0.3349
summary(bloodpressure_model)
##
## Call:
## lm(formula = fmla, data = bloodpressure)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4640 -1.1949 -0.4078 1.8511 2.6981
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.9941 11.9438 2.595 0.03186 *
## age 0.8614 0.2482 3.470 0.00844 **
## weight 0.3349 0.1307 2.563 0.03351 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.318 on 8 degrees of freedom
## Multiple R-squared: 0.9768, Adjusted R-squared: 0.9711
## F-statistic: 168.8 on 2 and 8 DF, p-value: 2.874e-07
Now you will make predictions using the blood pressure model bloodpressure_model that you fit in the previous exercise.
You will also compare the predictions to outcomes graphically. ggplot2 is already loaded in your workspace. Recall the plot command takes the form:
ggplot(dframe, aes(x = pred, y = outcome)) + geom_point() + geom_abline(color = “blue”) ` The objects bloodpressure and bloodpressure_model are in the workspace.
# bloodpressure is in your workspace
summary(bloodpressure)
## X blood_pressure age weight
## Min. : 1.0 Min. :128.0 Min. :46.00 Min. :167
## 1st Qu.: 3.5 1st Qu.:140.0 1st Qu.:56.50 1st Qu.:186
## Median : 6.0 Median :153.0 Median :64.00 Median :194
## Mean : 6.0 Mean :150.1 Mean :62.45 Mean :195
## 3rd Qu.: 8.5 3rd Qu.:160.5 3rd Qu.:69.50 3rd Qu.:209
## Max. :11.0 Max. :168.0 Max. :74.00 Max. :220
# bloodpressure_model is in your workspace
bloodpressure_model
##
## Call:
## lm(formula = fmla, data = bloodpressure)
##
## Coefficients:
## (Intercept) age weight
## 30.9941 0.8614 0.3349
Use predict() to predict blood pressure in the bloodpressure dataset. Assign the predictions to the column prediction.
# predict blood pressure using bloodpressure_model :prediction
bloodpressure$prediction <- predict(bloodpressure_model, data = bloodpressure)
head(bloodpressure)
## X blood_pressure age weight prediction
## 1 1 132 52 173 133.7183
## 2 2 143 59 184 143.4317
## 3 3 153 67 194 153.6716
## 4 4 162 73 211 164.5327
## 5 5 154 64 196 151.7570
## 6 6 168 74 220 168.4078
Graphically compare the predictions to actual blood pressures. Put predictions on the x axis. How close are the results to the line of perfect prediction?
# plot the results
ggplot(bloodpressure,aes(prediction, blood_pressure )) +
geom_point() +
geom_abline(color = "blue")
In this exercise you will graphically evaluate the unemployment model, unemployment_model, that you fit to the unemployment data in the previous chapter. Recall that the model predicts female_unemployment from male_unemployment.
You will plot the model’s predictions against the actual female_unemployment; recall the command is of the form
ggplot(dframe, aes(x = pred, y = outcome)) + geom_point() +
geom_abline() Then you will calculate the residuals:
residuals <- actual outcome - predicted outcome and plot predictions against residuals. The residual graph will take a slightly different form: you compare the residuals to the horizontal line (using geom_hline()) rather than to the line . The command will be provided.
The data frame unemployment and model unemployment_model are available in the workspace.
# unemployment, unemployment_model are in the workspace
summary(unemployment)
## X male_unemployment female_unemployment prediction
## Min. : 1 Min. :2.900 Min. :4.000 Min. :3.448
## 1st Qu.: 4 1st Qu.:4.900 1st Qu.:4.400 1st Qu.:4.837
## Median : 7 Median :6.000 Median :5.200 Median :5.601
## Mean : 7 Mean :5.954 Mean :5.569 Mean :5.569
## 3rd Qu.:10 3rd Qu.:6.700 3rd Qu.:6.100 3rd Qu.:6.087
## Max. :13 Max. :9.800 Max. :7.900 Max. :8.240
summary(unemployment_model)
##
## Call:
## lm(formula = fmla, data = unemployment)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.77621 -0.34050 -0.09004 0.27911 1.31254
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.43411 0.60340 2.377 0.0367 *
## male_unemployment 0.69453 0.09767 7.111 1.97e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5803 on 11 degrees of freedom
## Multiple R-squared: 0.8213, Adjusted R-squared: 0.8051
## F-statistic: 50.56 on 1 and 11 DF, p-value: 1.966e-05
head(unemployment)
## X male_unemployment female_unemployment prediction
## 1 1 2.9 4.0 3.448245
## 2 2 6.7 7.4 6.087456
## 3 3 4.9 5.0 4.837304
## 4 4 7.9 7.2 6.920891
## 5 5 9.8 7.9 8.240497
## 6 6 6.9 6.1 6.226362
Use predict() to get the model predictions and add them to unemployment as the column predictions.
# Make predictions from the model
unemployment$predictions <- predict(unemployment_model, data = unemployment)
Plot predictions (on the x-axis) versus actual female unemployment rates. Are the predictions near the x = y line?
# Fill in the blanks to plot predictions (on x-axis) versus the female_unemployment rates
ggplot(unemployment, aes(x = predictions, y = female_unemployment )) +
geom_point() +
geom_abline()
In the previous exercise you made predictions about female_unemployment and visualized the predictions and the residuals. Now, you will also plot the gain curve of the unemployment_model’s predictions against actual female_unemployment using the WVPlots::GainCurvePlot() function.
For situations where order is more important than exact values, the gain curve helps you check if the model’s predictions sort in the same order as the true outcome.
Calls to the function GainCurvePlot() look like:
GainCurvePlot(frame, xvar, truthvar, title) where
frame is a data frame xvar and truthvar are strings naming the prediction and actual outcome columns of frame title is the title of the plot When the predictions sort in exactly the same order, the relative Gini coefficient is 1. When the model sorts poorly, the relative Gini coefficient is close to zero, or even negative.
The data frame unemployment and the model unemployment_model are in the workspace.
# unemployment is in the workspace (with predictions)
summary(unemployment)
## X male_unemployment female_unemployment prediction
## Min. : 1 Min. :2.900 Min. :4.000 Min. :3.448
## 1st Qu.: 4 1st Qu.:4.900 1st Qu.:4.400 1st Qu.:4.837
## Median : 7 Median :6.000 Median :5.200 Median :5.601
## Mean : 7 Mean :5.954 Mean :5.569 Mean :5.569
## 3rd Qu.:10 3rd Qu.:6.700 3rd Qu.:6.100 3rd Qu.:6.087
## Max. :13 Max. :9.800 Max. :7.900 Max. :8.240
## predictions
## Min. :3.448
## 1st Qu.:4.837
## Median :5.601
## Mean :5.569
## 3rd Qu.:6.087
## Max. :8.240
# unemployment_model is in the workspace
summary(unemployment_model)
##
## Call:
## lm(formula = fmla, data = unemployment)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.77621 -0.34050 -0.09004 0.27911 1.31254
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.43411 0.60340 2.377 0.0367 *
## male_unemployment 0.69453 0.09767 7.111 1.97e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5803 on 11 degrees of freedom
## Multiple R-squared: 0.8213, Adjusted R-squared: 0.8051
## F-statistic: 50.56 on 1 and 11 DF, p-value: 1.966e-05
Load the package WVPlots using library(). Plot the gain curve. Give the plot the title “Unemployment model”. Do the model’s predictions sort correctly?
# Load the package WVPlots
library(WVPlots)
# Plot the Gain Curve
GainCurvePlot(unemployment, "predictions", "female_unemployment", "Unemployment model")
Calculate RMSE In this exercise you will calculate the RMSE of your unemployment model. In the previous coding exercises, you added two columns to the unemployment dataset:
the model’s predictions (predictions column) the residuals between the predictions and the outcome (residuals column) You can calculate the RMSE from a vector of residuals, , as:
You want RMSE to be small. How small is “small”? One heuristic is to compare the RMSE to the standard deviation of the outcome. With a good model, the RMSE should be smaller.
# unemployment is in the workspace
summary(unemployment)
## X male_unemployment female_unemployment prediction
## Min. : 1 Min. :2.900 Min. :4.000 Min. :3.448
## 1st Qu.: 4 1st Qu.:4.900 1st Qu.:4.400 1st Qu.:4.837
## Median : 7 Median :6.000 Median :5.200 Median :5.601
## Mean : 7 Mean :5.954 Mean :5.569 Mean :5.569
## 3rd Qu.:10 3rd Qu.:6.700 3rd Qu.:6.100 3rd Qu.:6.087
## Max. :13 Max. :9.800 Max. :7.900 Max. :8.240
## predictions
## Min. :3.448
## 1st Qu.:4.837
## Median :5.601
## Mean :5.569
## 3rd Qu.:6.087
## Max. :8.240
# For convenience put the residuals in the variable res
res <- unemployment$residuals
# Calculate RMSE, assign it to the variable rmse and print it
(rmse <- sqrt(mean(res^2)))
## [1] NaN
# Calculate the standard deviation of female_unemployment and print it
(sd_unemployment <- sd(unemployment$female_unemployment))
## [1] 1.314271
Now that you’ve calculated the RMSE of your model’s predictions, you will examine how well the model fits the data: that is, how much variance does it explain. You can do this using .
Suppose is the true outcome, is the prediction from the model, and are the residuals of the predictions.
Then the total sum of squares (“total variance”) of the data is:
where
is the mean value of .
The residual sum of squared errors of the model, is:
(R-Squared), the “variance explained” by the model, is then:
After you calculate , you will compare what you computed with the reported by glance(). glance() returns a one-row data frame; for a linear regression model, one of the columns returned is the of the model on the training data.
# Calculate mean female_unemployment: fe_mean. Print it
(fe_mean <- mean(unemployment$female_unemployment))
## [1] 5.569231
The data frame unemployment is in your workspace, with the columns predictions and residuals that you calculated in a previous exercise.
# Calculate total sum of squares: tss. Print it
(tss <- sum( (unemployment$female_unemployment - fe_mean)^2 ))
## [1] 20.72769
# Calculate residual sum of squares: rss. Print it
(rss <- sum(unemployment$residuals^2))
## [1] 0
# Calculate R-squared: rsq. Print it. Is it a good fit?
(rsq <- 1 - (rss/tss))
## [1] 1
# Get R-squared from glance. Print it
(rsq_glance <- glance(unemployment_model)$r.squared)
## [1] 0.8213157
The linear correlation of two variables, and , measures the strength of the linear relationship between them. When and are respectively:
the outcomes of a regression model that minimizes squared-error (like linear regression) and the true outcomes of the training data, then the square of the correlation is the same as . You will verify that in this exercise.
# unemployment is in your workspace
summary(unemployment)
## X male_unemployment female_unemployment prediction
## Min. : 1 Min. :2.900 Min. :4.000 Min. :3.448
## 1st Qu.: 4 1st Qu.:4.900 1st Qu.:4.400 1st Qu.:4.837
## Median : 7 Median :6.000 Median :5.200 Median :5.601
## Mean : 7 Mean :5.954 Mean :5.569 Mean :5.569
## 3rd Qu.:10 3rd Qu.:6.700 3rd Qu.:6.100 3rd Qu.:6.087
## Max. :13 Max. :9.800 Max. :7.900 Max. :8.240
## predictions
## Min. :3.448
## 1st Qu.:4.837
## Median :5.601
## Mean :5.569
## 3rd Qu.:6.087
## Max. :8.240
# unemployment_model is in the workspace
summary(unemployment_model)
##
## Call:
## lm(formula = fmla, data = unemployment)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.77621 -0.34050 -0.09004 0.27911 1.31254
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.43411 0.60340 2.377 0.0367 *
## male_unemployment 0.69453 0.09767 7.111 1.97e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5803 on 11 degrees of freedom
## Multiple R-squared: 0.8213, Adjusted R-squared: 0.8051
## F-statistic: 50.56 on 1 and 11 DF, p-value: 1.966e-05
Use cor() to get the correlation between the predictions and female unemployment. Assign it to the variable rho and print it. Make sure you use Pearson correlation (the default). Square rho and assign it to rho2. Print it. Compare rho2 to from the model (using glance()). Is it the same?
# Get the correlation between the prediction and true outcome: rho and print it
(rho <- cor(unemployment$predictions, unemployment$female_unemployment))
## [1] 0.9062647
# Square rho: rho2 and print it
(rho2 <- rho ^ 2)
## [1] 0.8213157
# Get R-squared from glance and print it
(rsq_glance <- glance(unemployment_model)$r.squared)
## [1] 0.8213157
For the next several exercises you will use the mpg data from the package ggplot2. The data describes the characteristics of several makes and models of cars from different years. The goal is to predict city fuel efficiency from highway fuel efficiency.
In this exercise, you will split mpg into a training set mpg_train (75% of the data) and a test set mpg_test (25% of the data). One way to do this is to generate a column of uniform random numbers between 0 and 1, using the function runif().
If you have a data set dframe of size , and you want a random subset of approximately size % of (where is between 0 and 1), then:
Generate a vector of uniform random numbers: gp = runif(N). dframe[gp < X,] will be about the right size. dframe[gp >= X,] will be the complement.
The data frame mpg is in the workspace.
# mpg is in the workspace
summary(mpg)
## manufacturer model displ year
## Length:234 Length:234 Min. :1.600 Min. :1999
## Class :character Class :character 1st Qu.:2.400 1st Qu.:1999
## Mode :character Mode :character Median :3.300 Median :2004
## Mean :3.472 Mean :2004
## 3rd Qu.:4.600 3rd Qu.:2008
## Max. :7.000 Max. :2008
## cyl trans drv cty
## Min. :4.000 Length:234 Length:234 Min. : 9.00
## 1st Qu.:4.000 Class :character Class :character 1st Qu.:14.00
## Median :6.000 Mode :character Mode :character Median :17.00
## Mean :5.889 Mean :16.86
## 3rd Qu.:8.000 3rd Qu.:19.00
## Max. :8.000 Max. :35.00
## hwy fl class
## Min. :12.00 Length:234 Length:234
## 1st Qu.:18.00 Class :character Class :character
## Median :24.00 Mode :character Mode :character
## Mean :23.44
## 3rd Qu.:27.00
## Max. :44.00
dim(mpg)
## [1] 234 11
Use the function nrow to get the number of rows in the data frame mpg. Assign this count to the variable N and print it.
# Use nrow to get the number of rows in mpg (N) and print it
(N <- nrow(mpg))
## [1] 234
Calculate about how many rows 75% of N should be. Assign it to the variable target and print it.
# Calculate how many rows 75% of N should be and print it
# Hint: use round() to get an integer
(target <- round(N * 0.75))
## [1] 176
Use runif() to generate a vector of N uniform random numbers, called gp.
# Create the vector of N uniform random variables: gp
gp <- runif(N)
Use gp to split mpg into mpg_train and mpg_test (with mpg_train containing approximately 75% of the data).
# Use gp to create the training set: mpg_train (75% of data) and mpg_test (25% of data)
mpg_train <- mpg[gp < 0.75, ]
mpg_test <- mpg[gp >= 0.75, ]
Use nrow() to check the size of mpg_train and mpg_test. Are they about the right size?
# Use nrow() to examine mpg_train and mpg_test
nrow(mpg_train)
## [1] 175
nrow(mpg_test)
## [1] 59
Now that you have split the mpg dataset into mpg_train and mpg_test, you will use mpg_train to train a model to predict city fuel efficiency (cty) from highway fuel efficiency (hwy).
The data frame mpg_train is in the workspace.
# mpg_train is in the workspace
summary(mpg_train)
## manufacturer model displ year
## Length:175 Length:175 Min. :1.60 Min. :1999
## Class :character Class :character 1st Qu.:2.40 1st Qu.:1999
## Mode :character Mode :character Median :3.30 Median :2008
## Mean :3.47 Mean :2004
## 3rd Qu.:4.60 3rd Qu.:2008
## Max. :6.50 Max. :2008
## cyl trans drv cty
## Min. :4.000 Length:175 Length:175 Min. : 9.00
## 1st Qu.:4.000 Class :character Class :character 1st Qu.:14.00
## Median :6.000 Mode :character Mode :character Median :16.00
## Mean :5.886 Mean :16.85
## 3rd Qu.:8.000 3rd Qu.:19.00
## Max. :8.000 Max. :33.00
## hwy fl class
## Min. :12.00 Length:175 Length:175
## 1st Qu.:18.00 Class :character Class :character
## Median :25.00 Mode :character Mode :character
## Mean :23.47
## 3rd Qu.:27.00
## Max. :44.00
# create a formula to express cty as a function of hwy: fmla and print it.
(fmla <- cty ~ hwy)
## cty ~ hwy
Create a formula fmla that expresses the relationship cty as a function of hwy. Print it.
# Now use lm() to build a model mpg_model from mpg_train that predicts cty from hwy
mpg_model <- lm(fmla, data = mpg_train)
Train a model mpg_model on mpg_train to predict cty from hwy using fmla and lm(). Use summary() to examine the model.
# Use summary() to examine the model
summary(mpg_model)
##
## Call:
## lm(formula = fmla, data = mpg_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8710 -0.8615 -0.1837 0.7901 4.7830
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.16468 0.40553 2.872 0.00459 **
## hwy 0.66825 0.01678 39.834 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.281 on 173 degrees of freedom
## Multiple R-squared: 0.9017, Adjusted R-squared: 0.9011
## F-statistic: 1587 on 1 and 173 DF, p-value: < 2.2e-16
Now you will test the model mpg_model on the test data, mpg_test. Functions rmse() and r_squared() to calculate RMSE and R-squared have been provided for convenience:
rmse(predcol, ycol) r_squared(predcol, ycol) where:
predcol: The predicted values ycol: The actual outcome You will also plot the predictions vs. the outcome.
Generally, model performance is better on the training data than the test data (though sometimes the test set “gets lucky”). A slight difference in performance is okay; if the performance on training is significantly better, there is a problem.
The data frames mpg_train and mpg_test, and the model mpg_model are in the workspace, along with the functions rmse() and r_squared().
# # Examine the objects in the workspace
# ls.str()
Predict city fuel efficiency from hwy on the mpg_train data. Assign the predictions to the column pred.
# predict cty from hwy for the training set
mpg_train$pred <- predict(mpg_model)
Predict city fuel efficiency from hwy on the mpg_test data. Assign the predictions to the column pred.
# predict cty from hwy for the test set
mpg_test$pred <- predict(mpg_model, newdata = mpg_test)
Use rmse() to evaluate rmse for both the test and training sets. Compare. Are the performances similar?
rmse = function(m, o){
sqrt(mean((m - o)^2))
}
# Evaluate the rmse on both training and test data and print them
(rmse_train <- rmse(mpg_train$pred, mpg_train$cty))
## [1] 1.273331
(rmse_test <- rmse(mpg_test$pred, mpg_test$cty))
## [1] 1.180521
r_squared <- function (x, y){
cor(x, y) ^ 2
}
# Evaluate the r-squared on both training and test data.and print them
(rsq_train <- r_squared(mpg_train$pred, mpg_train$cty))
## [1] 0.9016886
(rsq_test <- r_squared(mpg_test$pred, mpg_test$cty))
## [1] 0.9437755
Do the same with r_squared(). Are the performances similar? Use ggplot2 to plot the predictions against cty on the test data.
# Plot the predictions (on the x-axis) against the outcome (cty) on the test data
ggplot(mpg_test, aes(x = pred, y = cty)) +
geom_point() +
geom_abline()
There are several ways to implement an n-fold cross validation plan. In this exercise you will create such a plan using vtreat::kWayCrossValidation(), and examine it.
kWayCrossValidation() creates a cross validation plan with the following call:
splitPlan <- kWayCrossValidation(nRows, nSplits, dframe, y) where nRows is the number of rows of data to be split, and nSplits is the desired number of cross-validation folds.
Strictly speaking, dframe and y aren’t used by kWayCrossValidation; they are there for compatibility with other vtreat data partitioning functions. You can set them both to NULL.
The resulting splitPlan is a list of nSplits elements; each element contains two vectors:
train: the indices of dframe that will form the training set app: the indices of dframe that will form the test (or application) set In this exercise you will create a 3-fold cross-validation plan for the data set mpg.
Load the package vtreat.
# Load the package vtreat
library(vtreat)
# mpg is in the workspace
summary(mpg)
## manufacturer model displ year
## Length:234 Length:234 Min. :1.600 Min. :1999
## Class :character Class :character 1st Qu.:2.400 1st Qu.:1999
## Mode :character Mode :character Median :3.300 Median :2004
## Mean :3.472 Mean :2004
## 3rd Qu.:4.600 3rd Qu.:2008
## Max. :7.000 Max. :2008
## cyl trans drv cty
## Min. :4.000 Length:234 Length:234 Min. : 9.00
## 1st Qu.:4.000 Class :character Class :character 1st Qu.:14.00
## Median :6.000 Mode :character Mode :character Median :17.00
## Mean :5.889 Mean :16.86
## 3rd Qu.:8.000 3rd Qu.:19.00
## Max. :8.000 Max. :35.00
## hwy fl class
## Min. :12.00 Length:234 Length:234
## 1st Qu.:18.00 Class :character Class :character
## Median :24.00 Mode :character Mode :character
## Mean :23.44
## 3rd Qu.:27.00
## Max. :44.00
Get the number of rows in mpg and assign it to the variable nRows.
# Get the number of rows in mpg
nRows <- nrow(mpg)
Call kWayCrossValidation to create a 3-fold cross validation plan and assign it to the variable splitPlan. You can set the last two arguments of the function to NULL. Call str() to examine the structure of splitPlan.
# Implement the 3-fold cross-fold plan with vtreat
splitPlan <- kWayCrossValidation(nRows, 3,NULL,NULL)
# Examine the split plan
str(splitPlan)
## List of 3
## $ :List of 2
## ..$ train: int [1:156] 1 2 3 6 8 9 11 13 15 16 ...
## ..$ app : int [1:78] 209 141 175 128 32 117 38 185 140 85 ...
## $ :List of 2
## ..$ train: int [1:156] 1 3 4 5 6 7 9 10 12 14 ...
## ..$ app : int [1:78] 195 19 129 21 8 183 50 57 230 142 ...
## $ :List of 2
## ..$ train: int [1:156] 2 4 5 7 8 10 11 12 13 14 ...
## ..$ app : int [1:78] 154 224 3 234 111 102 197 25 173 56 ...
## - attr(*, "splitmethod")= chr "kwaycross"
In this exercise you will use splitPlan, the 3-fold cross validation plan from the previous exercise, to make predictions from a model that predicts mpg\(cty from mpg\)hwy.
If dframe is the training data, then one way to add a column of cross-validation predictions to the frame is as follows:
Initialize a column of the appropriate length
dframe$pred.cv <- 0
k is the number of folds
splitPlan is the cross validation plan
for(i in 1:k) { # Get the ith split split <- splitPlan[[i]]
# Build a model on the training data # from this split # (lm, in this case) model <- lm(fmla, data = dframe[split$train,])
# make predictions on the # application data from this split dframe\(pred.cv[split\)app] <- predict(model, newdata = dframe[split$app,]) } Cross-validation predicts how well a model built from all the data will perform on new data. As with the test/train split, for a good modeling procedure, cross-validation performance and training performance should be close.
The data frame mpg, the cross validation plan splitPlan, and the function to calculate RMSE (rmse()) from one of the previous exercises is available in your workspace.
# mpg is in the workspace
summary(mpg)
## manufacturer model displ year
## Length:234 Length:234 Min. :1.600 Min. :1999
## Class :character Class :character 1st Qu.:2.400 1st Qu.:1999
## Mode :character Mode :character Median :3.300 Median :2004
## Mean :3.472 Mean :2004
## 3rd Qu.:4.600 3rd Qu.:2008
## Max. :7.000 Max. :2008
## cyl trans drv cty
## Min. :4.000 Length:234 Length:234 Min. : 9.00
## 1st Qu.:4.000 Class :character Class :character 1st Qu.:14.00
## Median :6.000 Mode :character Mode :character Median :17.00
## Mean :5.889 Mean :16.86
## 3rd Qu.:8.000 3rd Qu.:19.00
## Max. :8.000 Max. :35.00
## hwy fl class
## Min. :12.00 Length:234 Length:234
## 1st Qu.:18.00 Class :character Class :character
## Median :24.00 Mode :character Mode :character
## Mean :23.44
## 3rd Qu.:27.00
## Max. :44.00
# splitPlan is in the workspace
str(splitPlan)
## List of 3
## $ :List of 2
## ..$ train: int [1:156] 1 2 3 6 8 9 11 13 15 16 ...
## ..$ app : int [1:78] 209 141 175 128 32 117 38 185 140 85 ...
## $ :List of 2
## ..$ train: int [1:156] 1 3 4 5 6 7 9 10 12 14 ...
## ..$ app : int [1:78] 195 19 129 21 8 183 50 57 230 142 ...
## $ :List of 2
## ..$ train: int [1:156] 2 4 5 7 8 10 11 12 13 14 ...
## ..$ app : int [1:78] 154 224 3 234 111 102 197 25 173 56 ...
## - attr(*, "splitmethod")= chr "kwaycross"
Run the 3-fold cross validation plan from splitPlan and put the predictions in the column mpg\(pred.cv. Use lm() and the formula cty ~ hwy. Create a linear regression model on all the mpg data (formula cty ~ hwy) and assign the predictions to mpg\)pred. Use rmse() to get the root mean squared error of the predictions from the full model (mpg$pred). Recall that rmse() takes two arguments, the predicted values, and the actual outcome.
# Run the 3-fold cross validation plan from splitPlan
k <- 3 # Number of folds
mpg$pred.cv <- 0
for(i in 1:k) {
split <- splitPlan[[i]]
model <- lm(cty ~ hwy, data = mpg[split$train, ])
mpg$pred.cv[split$app] <- predict(model, newdata = mpg[split$app, ])
}
# Predict from a full model
mpg$pred <- predict(lm(cty ~ hwy, data = mpg))
# Get the rmse of the full model's predictions
rmse(mpg$pred, mpg$cty)
## [1] 1.247045
Get the root mean squared error of the cross-validation predictions. Are the two values about the same?
# Get the rmse of the cross-validation predictions
rmse(mpg$pred.cv, mpg$cty)
## [1] 1.258478
Before moving on to more sophisticated regression techniques, we will look at some other modeling issues: modeling with categorical inputs, interactions between variables, and when you might consider transforming inputs and outputs before modeling. While more sophisticated regression techniques manage some of these issues automatically, it’s important to be aware of them, in order to understand which methods best handle various issues – and which issues you must still manage yourself.
For this exercise you will call model.matrix() to examine how R represents data with both categorical and numerical inputs for modeling. The dataset flowers (derived from the Sleuth3 package) is loaded into your workspace. It has the following columns:
Flowers: the average number of flowers on a meadowfoam plant Intensity: the intensity of a light treatment applied to the plant Time: A categorical variable - when (Late or Early) in the lifecycle the light treatment occurred The ultimate goal is to predict Flowers as a function of Time and Intensity.
The data frame flowers is in your workspace.
Call the str() function on flowers to see the types of each column.
flowers <- read.delim("DATABASE/flowers.txt", sep = ",")
head(flowers)
## X Flowers Time Intensity
## 1 1 62.3 Late 150
## 2 2 77.4 Late 150
## 3 3 55.3 Late 300
## 4 4 54.2 Late 300
## 5 5 49.6 Late 450
## 6 6 61.9 Late 450
# Call str on flowers to see the types of each column
str(flowers)
## 'data.frame': 24 obs. of 4 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Flowers : num 62.3 77.4 55.3 54.2 49.6 61.9 39.4 45.7 31.3 44.9 ...
## $ Time : Factor w/ 2 levels "Early","Late": 2 2 2 2 2 2 2 2 2 2 ...
## $ Intensity: int 150 150 300 300 450 450 600 600 750 750 ...
flowers
## X Flowers Time Intensity
## 1 1 62.3 Late 150
## 2 2 77.4 Late 150
## 3 3 55.3 Late 300
## 4 4 54.2 Late 300
## 5 5 49.6 Late 450
## 6 6 61.9 Late 450
## 7 7 39.4 Late 600
## 8 8 45.7 Late 600
## 9 9 31.3 Late 750
## 10 10 44.9 Late 750
## 11 11 36.8 Late 900
## 12 12 41.9 Late 900
## 13 13 77.8 Early 150
## 14 14 75.6 Early 150
## 15 15 69.1 Early 300
## 16 16 78.0 Early 300
## 17 17 57.0 Early 450
## 18 18 71.1 Early 450
## 19 19 62.9 Early 600
## 20 20 52.2 Early 600
## 21 21 60.3 Early 750
## 22 22 45.6 Early 750
## 23 23 52.6 Early 900
## 24 24 44.4 Early 900
Use the unique() function on the column flowers$Time to see the possible values that Time takes. How many unique values are there?
# Use unique() to see how many possible values Time takes
unique(flowers$Time)
## [1] Late Early
## Levels: Early Late
Create a formula to express Flowers as a function of Intensity and Time. Assign it to the variable fmla and print it.
# Build a formula to express Flowers as a function of Intensity and Time: fmla. Print it
(fmla <- as.formula("Flowers ~ Intensity + Time"))
## Flowers ~ Intensity + Time
Use fmla and model.matrix() to create the model matrix for the data frame flowers. Assign it to the variable mmat.
# Use fmla and model.matrix to see how the data is represented for modeling
mmat <- model.matrix(fmla, flowers)
Use head() to examine the first 20 lines of flowers. Now examine the first 20 lines of mmat.
# Examine the first 20 lines of flowers
head(flowers,20)
## X Flowers Time Intensity
## 1 1 62.3 Late 150
## 2 2 77.4 Late 150
## 3 3 55.3 Late 300
## 4 4 54.2 Late 300
## 5 5 49.6 Late 450
## 6 6 61.9 Late 450
## 7 7 39.4 Late 600
## 8 8 45.7 Late 600
## 9 9 31.3 Late 750
## 10 10 44.9 Late 750
## 11 11 36.8 Late 900
## 12 12 41.9 Late 900
## 13 13 77.8 Early 150
## 14 14 75.6 Early 150
## 15 15 69.1 Early 300
## 16 16 78.0 Early 300
## 17 17 57.0 Early 450
## 18 18 71.1 Early 450
## 19 19 62.9 Early 600
## 20 20 52.2 Early 600
Is the numeric column Intensity different? What happened to the categorical column Time from flowers? How is Time == ‘Early’ represented? And Time == ‘Late’?
# Examine the first 20 lines of mmat
head(mmat,20)
## (Intercept) Intensity TimeLate
## 1 1 150 1
## 2 1 150 1
## 3 1 300 1
## 4 1 300 1
## 5 1 450 1
## 6 1 450 1
## 7 1 600 1
## 8 1 600 1
## 9 1 750 1
## 10 1 750 1
## 11 1 900 1
## 12 1 900 1
## 13 1 150 0
## 14 1 150 0
## 15 1 300 0
## 16 1 300 0
## 17 1 450 0
## 18 1 450 0
## 19 1 600 0
## 20 1 600 0
For this exercise you will fit a linear model to the flowers data, to predict Flowers as a function of Time and Intensity.
The model formula fmla that you created in the previous exercise is still in your workspace, as is the model matrix mmat.
Use fmla and lm to train a linear model that predicts Flowers from Intensity and Time. Assign the model to the variable flower_model.
# flowers in is the workspace
str(flowers)
## 'data.frame': 24 obs. of 4 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Flowers : num 62.3 77.4 55.3 54.2 49.6 61.9 39.4 45.7 31.3 44.9 ...
## $ Time : Factor w/ 2 levels "Early","Late": 2 2 2 2 2 2 2 2 2 2 ...
## $ Intensity: int 150 150 300 300 450 450 600 600 750 750 ...
# fmla is in the workspace
fmla
## Flowers ~ Intensity + Time
# Fit a model to predict Flowers from Intensity and Time : flower_model
flower_model <- lm(fmla, flowers)
Use summary() to remind yourself of the structure of mmat.
# Use summary on mmat to remind yourself of its structure
summary(mmat)
## (Intercept) Intensity TimeLate
## Min. :1 Min. :150 Min. :0.0
## 1st Qu.:1 1st Qu.:300 1st Qu.:0.0
## Median :1 Median :525 Median :0.5
## Mean :1 Mean :525 Mean :0.5
## 3rd Qu.:1 3rd Qu.:750 3rd Qu.:1.0
## Max. :1 Max. :900 Max. :1.0
Use summary() to examine the flower_model. Do the variables match what you saw in mmat? Use flower_model to predict the number of flowers. Add the predictions to flowers as the column predictions.
# Use summary to examine flower_model
summary(flower_model)
##
## Call:
## lm(formula = fmla, data = flowers)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.652 -4.139 -1.558 5.632 12.165
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 83.464167 3.273772 25.495 < 2e-16 ***
## Intensity -0.040471 0.005132 -7.886 1.04e-07 ***
## TimeLate -12.158333 2.629557 -4.624 0.000146 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.441 on 21 degrees of freedom
## Multiple R-squared: 0.7992, Adjusted R-squared: 0.78
## F-statistic: 41.78 on 2 and 21 DF, p-value: 4.786e-08
Fill in the blanks to plot predictions vs. actual flowers (predictions on the x-axis).
# Predict the number of flowers on each plant
flowers$predictions <- predict(flower_model, flowers)
# Plot predictions vs actual flowers (predictions on x-axis)
ggplot(flowers, aes(x = predictions, y = Flowers)) +
geom_point() +
geom_abline(color = "blue")
In this exercise you will use interactions to model the effect of gender and gastric activity on alcohol metabolism.
The data frame alcohol has columns:
Metabol: the alcohol metabolism rate Gastric: the rate of gastric alcohol dehydrogenase activity Sex: the sex of the drinker (Male or Female) In the video, we fit three models to the alcohol data:
one with only additive (main effect) terms : Metabol ~ Gastric + Sex two models, each with interactions between gastric activity and sex We saw that one of the models with interaction terms had a better R-squared than the additive model, suggesting that using interaction terms gives a better fit. In this exercise we will compare the R-squared of one of the interaction models to the main-effects-only model.
Recall that the operator : designates the interaction between two variables. The operator * designates the interaction between the two variables, plus the main effects.
x*y = x + y + x:y
The data frame alcohol is in your workspace.
Write a formula that expresses Metabol as a function of Gastric and Sex with no interactions.
alcohol <- read.delim("DATABASE/alcohol.txt", sep = ",")
head(alcohol)
## X Subject Metabol Gastric Sex Alcohol
## 1 1 1 0.6 1.0 Female Alcoholic
## 2 2 2 0.6 1.6 Female Alcoholic
## 3 3 3 1.5 1.5 Female Alcoholic
## 4 4 4 0.4 2.2 Female Non-alcoholic
## 5 5 5 0.1 1.1 Female Non-alcoholic
## 6 6 6 0.2 1.2 Female Non-alcoholic
# alcohol is in the workspace
summary(alcohol)
## X Subject Metabol Gastric Sex
## Min. : 1.00 Min. : 1.00 Min. : 0.100 Min. :0.800 Female:18
## 1st Qu.: 8.75 1st Qu.: 8.75 1st Qu.: 0.600 1st Qu.:1.200 Male :14
## Median :16.50 Median :16.50 Median : 1.700 Median :1.600
## Mean :16.50 Mean :16.50 Mean : 2.422 Mean :1.863
## 3rd Qu.:24.25 3rd Qu.:24.25 3rd Qu.: 2.925 3rd Qu.:2.200
## Max. :32.00 Max. :32.00 Max. :12.300 Max. :5.200
## Alcohol
## Alcoholic : 8
## Non-alcoholic:24
##
##
##
##
Assign the formula to the variable fmla_add and print it. Write a formula that expresses Metabol as a function of the interaction between Gastric and Sex.
# Create the formula with main effects only
(fmla_add <- as.formula('Metabol~Gastric+Sex') )
## Metabol ~ Gastric + Sex
Add Gastric as a main effect, but not Sex.
# Create the formula with interactions
(fmla_interaction <- as.formula('Metabol~Gastric+Gastric:Sex') )
## Metabol ~ Gastric + Gastric:Sex
Assign the formula to the variable fmla_interaction and print it.
# Fit the main effects only model
model_add <- lm(fmla_add,alcohol)
# Fit the interaction model
model_interaction <- lm(fmla_interaction, alcohol)
# Call summary on both models and compare
summary(model_add)
##
## Call:
## lm(formula = fmla_add, data = alcohol)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.2779 -0.6328 -0.0966 0.5783 4.5703
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.9466 0.5198 -3.745 0.000796 ***
## Gastric 1.9656 0.2674 7.352 4.24e-08 ***
## SexMale 1.6174 0.5114 3.163 0.003649 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.331 on 29 degrees of freedom
## Multiple R-squared: 0.7654, Adjusted R-squared: 0.7492
## F-statistic: 47.31 on 2 and 29 DF, p-value: 7.41e-10
Fit a linear model with only main effects: model_add to the data. Fit a linear model with the interaction: model_interaction to the data. Call summary() on both models. Which has a better R-squared?
summary(model_interaction)
##
## Call:
## lm(formula = fmla_interaction, data = alcohol)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.4656 -0.5091 0.0143 0.5660 4.0668
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.7504 0.5310 -1.413 0.168236
## Gastric 1.1489 0.3450 3.331 0.002372 **
## Gastric:SexMale 1.0422 0.2412 4.321 0.000166 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.204 on 29 degrees of freedom
## Multiple R-squared: 0.8081, Adjusted R-squared: 0.7948
## F-statistic: 61.05 on 2 and 29 DF, p-value: 4.033e-11
In this exercise, you will compare the performance of the interaction model you fit in the previous exercise to the performance of a main-effects only model. Because this data set is small, we will use cross-validation to simulate making predictions on out-of-sample data.
You will begin to use the dplyr package to do calculations.
mutate() adds new columns to a tbl (a type of data frame) group_by() specifies how rows are grouped in a tbl summarize() computes summary statistics of a column You will also use tidyr’s gather() which takes multiple columns and collapses them into key-value pairs.
The data frame alcohol and the formulas fmla_add and fmla_interaction are in the workspace.
# alcohol is in the workspace
summary(alcohol)
## X Subject Metabol Gastric Sex
## Min. : 1.00 Min. : 1.00 Min. : 0.100 Min. :0.800 Female:18
## 1st Qu.: 8.75 1st Qu.: 8.75 1st Qu.: 0.600 1st Qu.:1.200 Male :14
## Median :16.50 Median :16.50 Median : 1.700 Median :1.600
## Mean :16.50 Mean :16.50 Mean : 2.422 Mean :1.863
## 3rd Qu.:24.25 3rd Qu.:24.25 3rd Qu.: 2.925 3rd Qu.:2.200
## Max. :32.00 Max. :32.00 Max. :12.300 Max. :5.200
## Alcohol
## Alcoholic : 8
## Non-alcoholic:24
##
##
##
##
# Both the formulae are in the workspace
fmla_add
## Metabol ~ Gastric + Sex
fmla_interaction
## Metabol ~ Gastric + Gastric:Sex
Use kWayCrossValidation() to create a splitting plan for a 3-fold cross validation. The first argument is the number of rows to be split.
# Create the splitting plan for 3-fold cross validation
set.seed(34245) # set the seed for reproducibility
splitPlan <- kWayCrossValidation(nrow(alcohol), 3, NULL, NULL)
The second argument is the number of folds for the cross-validation. You can set the 3rd and 4th arguments of the function to NULL.
Examine and run the sample code to get the 3-fold cross-validation predictions of a model with no interactions and assign them to the column pred_add.
# Sample code: Get cross-val predictions for main-effects only model
alcohol$pred_add <- 0 # initialize the prediction vector
for(i in 1:3) {
split <- splitPlan[[i]]
model_add <- lm(fmla_add, data = alcohol[split$train, ])
alcohol$pred_add[split$app] <- predict(model_add, newdata = alcohol[split$app, ])
}
Get the 3-fold cross-validation predictions of the model with interactions. Assign the predictions to the column pred_interaction. The sample code shows you the procedure.
# Get the cross-val predictions for the model with interactions
alcohol$pred_interaction <- 0 # initialize the prediction vector
for(i in 1:3) {
split <- splitPlan[[i]]
model_interaction <- lm(fmla_interaction, data = alcohol[split$train, ])
alcohol$pred_interaction[split$app] <- predict(model_interaction, newdata = alcohol[split$app, ])
}
Use the same splitPlan that you already created. Fill in the blanks to gather the predictions into a single column pred. add a column of residuals (actual outcome - predicted outcome). get the RMSE of the cross-validation predictions for each model type. Compare the RMSEs. Based on these results, which model should you use?
# Get RMSE
alcohol %>%
gather(key = modeltype, value = pred, pred_add, pred_interaction) %>%
mutate(residuals = Metabol - pred) %>%
group_by(modeltype) %>%
summarize(rmse = sqrt(mean(residuals^2)))
## # A tibble: 2 x 2
## modeltype rmse
## * <chr> <dbl>
## 1 pred_add 1.38
## 2 pred_interaction 1.30
In this exercise, you will compare relative error to absolute error. For the purposes of modeling, we will define relative error as
that is, the error is relative to the true outcome. You will measure the overall relative error of a model using root mean squared relative error:
where
is the mean of .
The example (toy) dataset fdata is loaded in your workspace. It includes the columns:
y: the true output to be predicted by some model; imagine it is the amount of money a customer will spend on a visit to your store. pred: the predictions of a model that predicts y. label: categorical: whether y comes from a population that makes small purchases, or large ones. You want to know which model does “better”: the one predicting the small purchases, or the one predicting large ones.
The data frame fdata is in the workspace.
fdata <- read.delim("DATABASE/fdata.txt", sep = ",")
head(fdata)
## X y pred label
## 1 1 9.149694 6.430583 small purchases
## 2 2 1.902521 3.473332 small purchases
## 3 3 -3.859826 1.594509 small purchases
## 4 4 2.388970 3.764175 small purchases
## 5 5 1.541572 9.509294 small purchases
## 6 6 13.561878 6.931725 small purchases
# fdata is in the workspace
summary(fdata)
## X y pred label
## Min. : 1.00 Min. : -5.894 Min. : 1.072 large purchases:50
## 1st Qu.: 25.75 1st Qu.: 5.407 1st Qu.: 6.373 small purchases:50
## Median : 50.50 Median : 57.374 Median : 55.693
## Mean : 50.50 Mean : 306.204 Mean : 305.905
## 3rd Qu.: 75.25 3rd Qu.: 550.903 3rd Qu.: 547.886
## Max. :100.00 Max. :1101.619 Max. :1098.896
fdata
## X y pred label
## 1 1 9.1496941 6.430583 small purchases
## 2 2 1.9025206 3.473332 small purchases
## 3 3 -3.8598260 1.594509 small purchases
## 4 4 2.3889703 3.764175 small purchases
## 5 5 1.5415715 9.509294 small purchases
## 6 6 13.5618785 6.931725 small purchases
## 7 7 10.1987859 8.191798 small purchases
## 8 8 1.1044627 1.514578 small purchases
## 9 9 3.9354788 8.986364 small purchases
## 10 10 9.0407978 6.149792 small purchases
## 11 11 1.7276491 8.498107 small purchases
## 12 12 15.7238410 10.941225 small purchases
## 13 13 2.2578444 6.003761 small purchases
## 14 14 -1.9750566 1.071922 small purchases
## 15 15 1.1006461 4.420715 small purchases
## 16 16 18.6282945 10.518787 small purchases
## 17 17 3.6786547 5.746765 small purchases
## 18 18 3.0863171 7.092900 small purchases
## 19 19 8.6913910 7.837105 small purchases
## 20 20 7.9086741 4.307022 small purchases
## 21 21 5.4401589 6.001545 small purchases
## 22 22 14.7850358 8.309529 small purchases
## 23 23 9.0220444 8.625899 small purchases
## 24 24 3.9766558 2.978443 small purchases
## 25 25 2.6675677 4.040108 small purchases
## 26 26 7.6815442 7.460930 small purchases
## 27 27 11.9348659 9.082721 small purchases
## 28 28 5.3087341 6.523833 small purchases
## 29 29 13.0587206 10.342933 small purchases
## 30 30 2.2285365 4.240767 small purchases
## 31 31 15.4031317 8.833006 small purchases
## 32 32 -0.8775657 1.165411 small purchases
## 33 33 7.6111939 4.743669 small purchases
## 34 34 9.8625104 8.426791 small purchases
## 35 35 4.3625009 10.191528 small purchases
## 36 36 3.8418327 4.325049 small purchases
## 37 37 11.3425371 6.389882 small purchases
## 38 38 17.1292595 10.997543 small purchases
## 39 39 16.1650339 7.524615 small purchases
## 40 40 -5.8934988 2.448673 small purchases
## 41 41 12.6387974 9.514229 small purchases
## 42 42 6.4500322 5.211945 small purchases
## 43 43 2.9682093 3.973336 small purchases
## 44 44 4.0760785 6.321008 small purchases
## 45 45 5.5235922 10.166002 small purchases
## 46 46 4.8320823 3.847532 small purchases
## 47 47 6.7230129 6.304504 small purchases
## 48 48 1.8421342 3.647446 small purchases
## 49 49 3.1992212 2.518993 small purchases
## 50 50 10.8161739 8.023910 small purchases
## 51 51 1026.4000266 1027.192723 large purchases
## 52 52 202.3892007 194.521504 large purchases
## 53 53 833.3530129 826.245477 large purchases
## 54 54 1075.4121192 1081.438809 large purchases
## 55 55 96.1198136 100.388039 large purchases
## 56 56 438.2357889 430.291572 large purchases
## 57 57 911.3332595 912.567621 large purchases
## 58 58 542.5643273 533.946165 large purchases
## 59 59 686.3323692 691.793193 large purchases
## 60 60 494.4655725 498.008558 large purchases
## 61 61 422.8093644 423.202790 large purchases
## 62 62 1033.8816216 1032.662338 large purchases
## 63 63 161.9940611 168.545765 large purchases
## 64 64 491.4299932 492.239389 large purchases
## 65 65 575.9199260 589.705853 large purchases
## 66 66 384.7702130 377.148966 large purchases
## 67 67 720.0291839 730.110724 large purchases
## 68 68 963.9350562 967.650520 large purchases
## 69 69 159.7865024 159.207010 large purchases
## 70 70 765.4002297 767.185617 large purchases
## 71 71 246.4217424 250.132558 large purchases
## 72 72 1097.9219860 1098.895762 large purchases
## 73 73 1050.4012681 1048.867067 large purchases
## 74 74 1069.6224565 1057.992036 large purchases
## 75 75 116.8100415 119.282157 large purchases
## 76 76 523.5162673 524.564769 large purchases
## 77 77 457.7430632 459.769560 large purchases
## 78 78 1060.5638650 1053.648585 large purchases
## 79 79 761.9174356 751.113173 large purchases
## 80 80 969.3237051 966.379570 large purchases
## 81 81 522.6803675 520.614066 large purchases
## 82 82 475.8733078 467.724627 large purchases
## 83 83 368.5448093 364.211669 large purchases
## 84 84 1101.6186402 1097.071859 large purchases
## 85 85 1052.9226814 1054.626093 large purchases
## 86 86 663.0351321 664.365996 large purchases
## 87 87 136.6886558 137.485828 large purchases
## 88 88 331.8120748 326.087523 large purchases
## 89 89 921.9615947 929.965382 large purchases
## 90 90 773.5812495 772.131253 large purchases
## 91 91 458.3325781 456.426629 large purchases
## 92 92 643.4683247 646.674567 large purchases
## 93 93 738.6477366 747.002823 large purchases
## 94 94 846.4213303 842.529168 large purchases
## 95 95 413.6645906 411.887935 large purchases
## 96 96 180.9942090 175.770888 large purchases
## 97 97 695.6228990 687.864916 large purchases
## 98 98 164.8980048 165.811990 large purchases
## 99 99 106.5062642 108.484750 large purchases
## 100 100 358.3557044 363.866527 large purchases
# Examine the data: generate the summaries for the groups large and small:
fdata %>%
group_by(label) %>% # group by small/large purchases
summarize(min = min(y), # min of y
mean = mean(y), # mean of y
max = max(y)) # max of y
## # A tibble: 2 x 4
## label min mean max
## * <fct> <dbl> <dbl> <dbl>
## 1 large purchases 96.1 606. 1102.
## 2 small purchases -5.89 6.48 18.6
Fill in the blanks to examine the data. Notice that large purchases tend to be about 100 times larger than small ones. Fill in the blanks to create error columns:
# Fill in the blanks to add error columns
fdata2 <- fdata %>%
group_by(label) %>% # group by label
mutate(residual = pred-y, # Residual
relerr = residual/y) # Relative error
Define residual as y - pred. Define relative error as residual / y.
# Compare the rmse and rmse.rel of the large and small groups:
fdata2 %>%
group_by(label) %>%
summarize(rmse = sqrt(mean((pred-y)^2)), # RMSE
rmse.rel = sqrt(mean(((pred-y)/y)^2))) # Root mean squared relative error
## # A tibble: 2 x 3
## label rmse rmse.rel
## * <fct> <dbl> <dbl>
## 1 large purchases 5.54 0.0147
## 2 small purchases 4.01 1.25
Fill in the blanks to calculate and compare RMSE and relative RMSE. How do the absolute errors compare? The relative errors? Examine the plot of predictions versus outcome. In your opinion, which model does “better”?
# Plot the predictions for both groups of purchases
ggplot(fdata2, aes(x = pred, y = y, color = label)) +
geom_point() +
geom_abline() +
facet_wrap(~ label, ncol = 1, scales = "free") +
ggtitle("Outcome vs prediction")
In this exercise, you will practice modeling on log-transformed monetary output, and then transforming the “log-money” predictions back into monetary units. The data loaded into your workspace records subjects’ incomes in 2005 (Income2005), as well as the results of several aptitude tests taken by the subjects in 1981:
Arith Word Parag Math AFQT (Percentile on the Armed Forces Qualifying Test) The data have already been split into training and test sets (income_train and income_test respectively) and are in the workspace. You will build a model of log(income) from the inputs, and then convert log(income) back into income.
Call summary() on income_train$Income2005 to see the summary statistics of income in the training set.
income_train <- read.delim("DATABASE/income_train.txt", sep = ",")
head(income_train)
## X Subject Arith Word Parag Math AFQT Income2005
## 1 1 2 8 15 6 6 6.841 5500
## 2 2 6 30 35 15 23 99.393 65000
## 3 4 8 13 35 12 4 44.022 36000
## 4 5 9 21 28 10 13 59.683 65000
## 5 7 16 17 30 12 17 50.283 71000
## 6 8 17 29 33 13 21 89.669 43000
# Examine Income2005 in the training set
summary(income_train$Income2005)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 63 23000 39000 49894 61500 703637
Write a formula to express log(Income2005) as a function of the five tests as the variable fmla.log. Print it.
# Write the formula for log income as a function of the tests and print it
# Write the formula for log income as a function of the tests and print it
(fmla.log <- log(Income2005)~Arith+Word+Parag+Math+AFQT)
## log(Income2005) ~ Arith + Word + Parag + Math + AFQT
Fit a linear model of log(Income2005) to the income_train data: model.log.
# Fit the linear model
model.log <- lm(fmla.log,data=income_train)
Use model.log to predict income on the income_test dataset. Put it in the column logpred.
#
# income_test <- read.delim("DATABASE/income_test.txt", sep = ",")
# # Make predictions on income_test
# income_test$logpred <- predict(model.log,newdata=income_test)
# summary(income_test$logpred)
Check summary() of logpred to see that the magnitudes are much different from those of Income2005. Reverse the log transformation to put the predictions into “monetary units”: exp(income_test$logpred). Check summary() of pred.income and see that the magnitudes are now similar to Income2005 magnitudes. Fill in the blanks to plot a scatter plot of predicted income vs income on the test set.
Input transforms: the “hockey stick” In this exercise, we will build a model to predict price from a measure of the house’s size (surface area). The data set houseprice has the columns:
price : house price in units of $1000 size: surface area
A scatterplot of the data shows that the data is quite non-linear: a sort of “hockey-stick” where price is fairly flat for smaller houses, but rises steeply as the house gets larger. Quadratics and tritics are often good functional forms to express hockey-stick like relationships. Note that there may not be a “physical” reason that price is related to the square of the size; a quadratic is simply a closed form approximation of the observed relationship.
You will fit a model to predict price as a function of the squared size, and look at its fit on the training data.
Because ^ is also a symbol to express interactions, use the function I() to treat the expression x^2 “as is”: that is, as the square of x rather than the interaction of x with itself.
exampleFormula = y ~ I(x^2)
The data set houseprice is in the workspace.
houseprice <- read.delim("DATABASE/houseprice.txt", sep = ",")
head(houseprice)
## X size price
## 1 1 72 156
## 2 2 98 153
## 3 3 92 230
## 4 4 90 152
## 5 5 44 42
## 6 6 46 157
Write a formula, fmla_sqr, to express price as a function of squared size. Print it. Fit a model model_sqr to the data using fmla_sqr
# houseprice is in the workspace
summary(houseprice)
## X size price
## Min. : 1.00 Min. : 44.0 Min. : 42.0
## 1st Qu.:10.75 1st Qu.: 73.5 1st Qu.:164.5
## Median :20.50 Median : 91.0 Median :203.5
## Mean :20.50 Mean : 94.3 Mean :249.2
## 3rd Qu.:30.25 3rd Qu.:118.5 3rd Qu.:287.8
## Max. :40.00 Max. :150.0 Max. :573.0
houseprice
## X size price
## 1 1 72 156
## 2 2 98 153
## 3 3 92 230
## 4 4 90 152
## 5 5 44 42
## 6 6 46 157
## 7 7 90 113
## 8 8 150 573
## 9 9 94 202
## 10 10 90 261
## 11 11 90 175
## 12 12 66 212
## 13 13 142 486
## 14 14 74 109
## 15 15 86 220
## 16 16 46 186
## 17 17 54 133
## 18 18 130 360
## 19 19 122 283
## 20 20 118 380
## 21 21 100 185
## 22 22 74 186
## 23 23 146 459
## 24 24 92 167
## 25 25 100 171
## 26 26 140 547
## 27 27 94 170
## 28 28 90 286
## 29 29 120 293
## 30 30 70 109
## 31 31 100 205
## 32 32 132 514
## 33 33 58 175
## 34 34 92 249
## 35 35 76 234
## 36 36 90 242
## 37 37 66 177
## 38 38 134 399
## 39 39 140 511
## 40 40 64 107
# Create the formula for price as a function of squared size
(fmla_sqr <- price~I(size^2))
## price ~ I(size^2)
For comparison, fit a linear model model_lin to the data using the formula price ~ size. Fill in the blanks to make predictions from the training data from the two models
# Fit a model of price as a function of squared size (use fmla_sqr)
model_sqr <- lm(fmla_sqr, data = houseprice)
# Fit a model of price as a linear function of size
model_lin <- lm(price ~ size, data = houseprice)
gather the predictions into a single column pred graphically compare the predictions of the two models to the data. Which fits better?
# Make predictions and compare
houseprice %>%
mutate(pred_lin = predict(model_lin), # predictions from linear model
pred_sqr = predict(model_sqr)) %>% # predictions from quadratic model
gather(key = modeltype, value = pred, pred_lin, pred_sqr) %>% # gather the predictions
ggplot(aes(x = size)) +
geom_point(aes(y = price)) + # actual prices
geom_line(aes(y = pred, color = modeltype)) + # the predictions
scale_color_brewer(palette = "Dark2")
Fit a model of sparrow survival probability In this exercise, you will estimate the probability that a sparrow survives a severe winter storm, based on physical characteristics of the sparrow. The dataset sparrow is loaded into your workspace. The outcome to be predicted is status (“Survived”, “Perished”). The variables we will consider are:
total_length: length of the bird from tip of beak to tip of tail (mm) weight: in grams humerus : length of humerus (“upper arm bone” that connects the wing to the body) (inches) Remember that when using glm() to create a logistic regression model, you must explicitly specify that family = binomial:
glm(formula, data = data, family = binomial) You will call summary(), broom::glance() to see different functions for examining a logistic regression model. One of the diagnostics that you will look at is the analog to , called pseudo-.
You can think of deviance as analogous to variance: it is a measure of the variation in categorical data. The pseudo- is analogous to for standard regression: is a measure of the “variance explained” of a regression model. The pseudo- is a measure of the “deviance explained”.
The data frame sparrow and the package broom are loaded in the workspace.
sparrow <- read.delim("DATABASE/sparrow.txt", sep = ",")
head(sparrow)
## X status age total_length wingspan weight beak_head humerus femur legbone
## 1 1 Survived adult 154 241 24.5 31.2 0.69 0.67 1.02
## 2 2 Survived adult 160 252 26.9 30.8 0.74 0.71 1.18
## 3 3 Survived adult 155 243 26.9 30.6 0.73 0.70 1.15
## 4 4 Survived adult 154 245 24.3 31.7 0.74 0.69 1.15
## 5 5 Survived adult 156 247 24.1 31.5 0.71 0.71 1.13
## 6 6 Survived adult 161 253 26.5 31.8 0.78 0.74 1.14
## skull sternum
## 1 0.59 0.83
## 2 0.60 0.84
## 3 0.60 0.85
## 4 0.58 0.84
## 5 0.57 0.82
## 6 0.61 0.89
# sparrow is in the workspace
summary(sparrow)
## X status age total_length wingspan
## Min. : 1.0 Perished:36 adult :59 Min. :153.0 Min. :236.0
## 1st Qu.:22.5 Survived:51 juvenile:28 1st Qu.:158.0 1st Qu.:245.0
## Median :44.0 Median :160.0 Median :247.0
## Mean :44.0 Mean :160.4 Mean :247.5
## 3rd Qu.:65.5 3rd Qu.:162.5 3rd Qu.:251.0
## Max. :87.0 Max. :167.0 Max. :256.0
## weight beak_head humerus femur
## Min. :23.2 Min. :29.80 Min. :0.6600 Min. :0.6500
## 1st Qu.:24.7 1st Qu.:31.40 1st Qu.:0.7250 1st Qu.:0.7000
## Median :25.8 Median :31.70 Median :0.7400 Median :0.7100
## Mean :25.8 Mean :31.64 Mean :0.7353 Mean :0.7134
## 3rd Qu.:26.7 3rd Qu.:32.10 3rd Qu.:0.7500 3rd Qu.:0.7300
## Max. :31.0 Max. :33.00 Max. :0.7800 Max. :0.7600
## legbone skull sternum
## Min. :1.010 Min. :0.5600 Min. :0.7700
## 1st Qu.:1.110 1st Qu.:0.5900 1st Qu.:0.8300
## Median :1.130 Median :0.6000 Median :0.8500
## Mean :1.131 Mean :0.6032 Mean :0.8511
## 3rd Qu.:1.160 3rd Qu.:0.6100 3rd Qu.:0.8800
## Max. :1.230 Max. :0.6400 Max. :0.9300
# Create the survived column
sparrow$survived <- sparrow$status=="Survived"
As suggested in the video, you will predict on the outcomes TRUE and FALSE. Create a new column survived in the sparrow data frame that is TRUE when status == “Survived”. Create the formula fmla that expresses survived as a function of the variables of interest. Print it.
# Create the formula
(fmla <- survived~total_length+weight+humerus)
## survived ~ total_length + weight + humerus
Fit a logistic regression model to predict the probability of sparrow survival. Assign the model to the variable sparrow_model.
# Fit the logistic regression model
sparrow_model <- glm(fmla, data = sparrow, family = "binomial")
# Call summary
summary(sparrow_model)
##
## Call:
## glm(formula = fmla, family = "binomial", data = sparrow)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1117 -0.6026 0.2871 0.6577 1.7082
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 46.8813 16.9631 2.764 0.005715 **
## total_length -0.5435 0.1409 -3.858 0.000115 ***
## weight -0.5689 0.2771 -2.053 0.040060 *
## humerus 75.4610 19.1586 3.939 8.19e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 118.008 on 86 degrees of freedom
## Residual deviance: 75.094 on 83 degrees of freedom
## AIC: 83.094
##
## Number of Fisher Scoring iterations: 5
Call summary() to see the coefficients of the model, the deviance and the null deviance.
# Call glance
(perf <- glance(sparrow_model))
## # A tibble: 1 x 8
## null.deviance df.null logLik AIC BIC deviance df.residual nobs
## <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int> <int>
## 1 118. 86 -37.5 83.1 93.0 75.1 83 87
Call glance() on the model to see the deviances and other diagnostics in a data frame. Assign the output from glance() to the variable perf. Calculate the pseudo-.
# Calculate pseudo-R-squared
(pseudoR2 <- 1-perf$deviance/perf$null.deviance)
## [1] 0.3636526
In this exercise you will predict the probability of survival using the sparrow survival model from the previous exercise.
Recall that when calling predict() to get the predicted probabilities from a glm() model, you must specify that you want the response:
predict(model, type = “response”) Otherwise, predict() on a logistic regression model returns the predicted log-odds of the event, not the probability.
You will also use the GainCurvePlot() function to plot the gain curve from the model predictions. If the model’s gain curve is close to the ideal (“wizard”) gain curve, then the model sorted the sparrows well: that is, the model predicted that sparrows that actually survived would have a higher probability of survival. The inputs to the GainCurvePlot() function are:
frame: data frame with prediction column and ground truth column xvar: the name of the column of predictions (as a string) truthVar: the name of the column with actual outcome (as a string) title: a title for the plot (as a string) GainCurvePlot(frame, xvar, truthVar, title)
The dataframe sparrow and the model sparrow_model are in the workspace.
# sparrow is in the workspace
summary(sparrow)
## X status age total_length wingspan
## Min. : 1.0 Perished:36 adult :59 Min. :153.0 Min. :236.0
## 1st Qu.:22.5 Survived:51 juvenile:28 1st Qu.:158.0 1st Qu.:245.0
## Median :44.0 Median :160.0 Median :247.0
## Mean :44.0 Mean :160.4 Mean :247.5
## 3rd Qu.:65.5 3rd Qu.:162.5 3rd Qu.:251.0
## Max. :87.0 Max. :167.0 Max. :256.0
## weight beak_head humerus femur
## Min. :23.2 Min. :29.80 Min. :0.6600 Min. :0.6500
## 1st Qu.:24.7 1st Qu.:31.40 1st Qu.:0.7250 1st Qu.:0.7000
## Median :25.8 Median :31.70 Median :0.7400 Median :0.7100
## Mean :25.8 Mean :31.64 Mean :0.7353 Mean :0.7134
## 3rd Qu.:26.7 3rd Qu.:32.10 3rd Qu.:0.7500 3rd Qu.:0.7300
## Max. :31.0 Max. :33.00 Max. :0.7800 Max. :0.7600
## legbone skull sternum survived
## Min. :1.010 Min. :0.5600 Min. :0.7700 Mode :logical
## 1st Qu.:1.110 1st Qu.:0.5900 1st Qu.:0.8300 FALSE:36
## Median :1.130 Median :0.6000 Median :0.8500 TRUE :51
## Mean :1.131 Mean :0.6032 Mean :0.8511
## 3rd Qu.:1.160 3rd Qu.:0.6100 3rd Qu.:0.8800
## Max. :1.230 Max. :0.6400 Max. :0.9300
# sparrow_model is in the workspace
summary(sparrow_model)
##
## Call:
## glm(formula = fmla, family = "binomial", data = sparrow)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1117 -0.6026 0.2871 0.6577 1.7082
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 46.8813 16.9631 2.764 0.005715 **
## total_length -0.5435 0.1409 -3.858 0.000115 ***
## weight -0.5689 0.2771 -2.053 0.040060 *
## humerus 75.4610 19.1586 3.939 8.19e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 118.008 on 86 degrees of freedom
## Residual deviance: 75.094 on 83 degrees of freedom
## AIC: 83.094
##
## Number of Fisher Scoring iterations: 5
Create a new column in sparrow called pred that contains the predictions on the training data. Call GainCurvePlot() to create the gain curve of predictions. Does the model do a good job of sorting the sparrows by whether or not they actually survived?
# Make predictions
sparrow$pred <- predict(sparrow_model, type = "response")
# Look at gain curve
GainCurvePlot(sparrow, "pred", "survived", "sparrow survival model")
In this exercise you will build a model to predict the number of bikes rented in an hour as a function of the weather, the type of day (holiday, working day, or weekend), and the time of day. You will train the model on data from the month of July.
The data frame has the columns:
cnt: the number of bikes rented in that hour (the outcome) hr: the hour of the day (0-23, as a factor) holiday: TRUE/FALSE workingday: TRUE if neither a holiday nor a weekend, else FALSE weathersit: categorical, “Clear to partly cloudy”/“Light Precipitation”/“Misty” temp: normalized temperature in Celsius atemp: normalized “feeling” temperature in Celsius hum: normalized humidity windspeed: normalized windspeed instant: the time index – number of hours since beginning of data set (not a variable) mnth and yr: month and year indices (not variables) Remember that you must specify family = poisson or family = quasipoisson when using glm() to fit a count model.
Since there are a lot of input variables, for convenience we will specify the outcome and the inputs in variables, and use paste() to assemble a string representing the model formula.
The data frame bikesJuly is in the workspace. The names of the outcome variable and the input variables are also in the workspace as the variables outcome and vars respectively.
bikesJuly <- read.delim("DATABASE/bikesJuly.txt", sep = ",") %>% select(., c(-"X"))
head(bikesJuly)
## hr holiday workingday weathersit temp atemp hum windspeed cnt
## 1 0 FALSE FALSE Clear to partly cloudy 0.76 0.7273 0.66 0.0000 149
## 2 1 FALSE FALSE Clear to partly cloudy 0.74 0.6970 0.70 0.1343 93
## 3 2 FALSE FALSE Clear to partly cloudy 0.72 0.6970 0.74 0.0896 90
## 4 3 FALSE FALSE Clear to partly cloudy 0.72 0.7121 0.84 0.1343 33
## 5 4 FALSE FALSE Clear to partly cloudy 0.70 0.6667 0.79 0.1940 4
## 6 5 FALSE FALSE Clear to partly cloudy 0.68 0.6364 0.79 0.1045 10
## instant mnth yr
## 1 13004 7 1
## 2 13005 7 1
## 3 13006 7 1
## 4 13007 7 1
## 5 13008 7 1
## 6 13009 7 1
# bikesJuly is in the workspace
str(bikesJuly)
## 'data.frame': 744 obs. of 12 variables:
## $ hr : int 0 1 2 3 4 5 6 7 8 9 ...
## $ holiday : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ workingday: logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ weathersit: Factor w/ 3 levels "Clear to partly cloudy",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ temp : num 0.76 0.74 0.72 0.72 0.7 0.68 0.7 0.74 0.78 0.82 ...
## $ atemp : num 0.727 0.697 0.697 0.712 0.667 ...
## $ hum : num 0.66 0.7 0.74 0.84 0.79 0.79 0.79 0.7 0.62 0.56 ...
## $ windspeed : num 0 0.1343 0.0896 0.1343 0.194 ...
## $ cnt : int 149 93 90 33 4 10 27 50 142 219 ...
## $ instant : int 13004 13005 13006 13007 13008 13009 13010 13011 13012 13013 ...
## $ mnth : int 7 7 7 7 7 7 7 7 7 7 ...
## $ yr : int 1 1 1 1 1 1 1 1 1 1 ...
Fill in the blanks to create the formula fmla expressing cnt as a function of the inputs. Print it.
vars <- read.delim("DATABASE/vars.txt", sep = ",")
head(vars)
## X x
## 1 1 hr
## 2 2 holiday
## 3 3 workingday
## 4 4 weathersit
## 5 5 temp
## 6 6 atemp
outcome <- read.delim("DATABASE/outcome.txt", sep = ",")
## Warning in read.table(file = file, header = header, sep = sep, quote = quote, :
## incomplete final line found by readTableHeader on 'DATABASE/outcome.txt'
# The outcome column
outcome
## X x
## 1 1 cnt
# The inputs to use
vars
## X x
## 1 1 hr
## 2 2 holiday
## 3 3 workingday
## 4 4 weathersit
## 5 5 temp
## 6 6 atemp
## 7 7 hum
## 8 8 windspeed
Calculate the mean (mean()) and variance (var()) of bikesJuly$cnt.
# Create the formula string for bikes rented as a function of the inputs
(fmla <- paste(outcome, "~", paste(vars, collapse = " + ")))
## [1] "1 ~ 1:8 + c(3, 2, 8, 6, 5, 1, 4, 7)" "1 ~ 1:8 + c(3, 2, 8, 6, 5, 1, 4, 7)"
Should you use poisson or quasipoisson regression?
# Calculate the mean and variance of the outcome
(mean_bikes <- mean(bikesJuly$cnt))
## [1] 273.6653
(var_bikes <- var(bikesJuly$cnt))
## [1] 45863.84
Use glm() to fit a model to the bikesJuly data: bike_model.
# # Fit the model
# bike_model <- glm(fmla, data=bikesJuly, family=quasipoisson)
Use glance() to look at the model’s fit statistics. Assign the output of glance() to the variable perf.
# # Call glance
# (perf <- glance(bike_model))
Calculate the pseudo-R-squared of the model.
# # Calculate pseudo-R-squared
# (pseudoR2 <- 1 - perf$deviance/perf$null.deviance)
In this exercise you will use the model you built in the previous exercise to make predictions for the month of August. The data set bikesAugust has the same columns as bikesJuly.
Recall that you must specify type = “response” with predict() when predicting counts from a glm poisson or quasipoisson model.
bikesAugust <- read.delim("DATABASE/bikesAugust.txt", sep = ",")
head(bikesAugust)
## X hr holiday workingday weathersit temp atemp hum windspeed cnt
## 1 1 0 FALSE TRUE Clear to partly cloudy 0.68 0.6364 0.79 0.1642 47
## 2 2 1 FALSE TRUE Clear to partly cloudy 0.66 0.6061 0.83 0.0896 33
## 3 3 2 FALSE TRUE Clear to partly cloudy 0.64 0.5758 0.83 0.1045 13
## 4 4 3 FALSE TRUE Clear to partly cloudy 0.64 0.5758 0.83 0.1045 7
## 5 5 4 FALSE TRUE Misty 0.64 0.5909 0.78 0.1343 4
## 6 6 5 FALSE TRUE Misty 0.64 0.5909 0.78 0.1343 49
## instant mnth yr
## 1 13748 8 1
## 2 13749 8 1
## 3 13750 8 1
## 4 13751 8 1
## 5 13752 8 1
## 6 13753 8 1
The model bike_model and the data frame bikesAugust are in the workspace.
# bikesAugust is in the workspace
str(bikesAugust)
## 'data.frame': 744 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ hr : int 0 1 2 3 4 5 6 7 8 9 ...
## $ holiday : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ workingday: logi TRUE TRUE TRUE TRUE TRUE TRUE ...
## $ weathersit: Factor w/ 3 levels "Clear to partly cloudy",..: 1 1 1 1 3 3 1 3 3 3 ...
## $ temp : num 0.68 0.66 0.64 0.64 0.64 0.64 0.64 0.64 0.66 0.68 ...
## $ atemp : num 0.636 0.606 0.576 0.576 0.591 ...
## $ hum : num 0.79 0.83 0.83 0.83 0.78 0.78 0.78 0.83 0.78 0.74 ...
## $ windspeed : num 0.1642 0.0896 0.1045 0.1045 0.1343 ...
## $ cnt : int 47 33 13 7 4 49 185 487 681 350 ...
## $ instant : int 13748 13749 13750 13751 13752 13753 13754 13755 13756 13757 ...
## $ mnth : int 8 8 8 8 8 8 8 8 8 8 ...
## $ yr : int 1 1 1 1 1 1 1 1 1 1 ...
Use predict to predict the number of bikes per hour on the bikesAugust data. Assign the predictions to the column bikesAugust$pred.
# Make predictions on August data
# bikesAugust$pred <- predict(bike_model, type = "response", newdata = bikesAugust)
# head(bikesAugust)
Fill in the blanks to get the RMSE of the predictions on the August data.
# # Calculate the RMSE
# bikesAugust %>%
# mutate(residual = pred - cnt) %>%
# summarize(rmse = sqrt(mean(residual^2)))
Fill in the blanks to generate the plot of predictions to actual counts. Do any of the predictions appear negative?
# # Plot predictions vs cnt (pred on x-axis)
# ggplot(bikesAugust, aes(x = pred, y = cnt)) +
# geom_point() +
# geom_abline(color = "darkblue")
In this exercise you will model the average leaf weight on a soybean plant as a function of time (after planting). As you will see, the soybean plant doesn’t grow at a steady rate, but rather has a “growth spurt” that eventually tapers off. Hence, leaf weight is not well described by a linear model.
Recall that you can designate which variable you want to model non-linearly in a formula with the s() function:
y ~ s(x) Also remember that gam() from the package mgcv has the calling interface
gam(formula, family, data) For standard regression, use family = gaussian (the default).
The soybean training data, soybean_train is loaded into your workspace. It has two columns: the outcome weight and the variable Time. For comparison, the linear model model.lin, which was fit using the formula weight ~ Time has already been loaded into the workspace as well.
Fill in the blanks to plot weight versus Time (Time on x-axis). Does the relationship look linear?
soybean_train <- read.delim("DATABASE/soybean_train.txt", sep = ",")
# soybean_train is in the workspace
summary(soybean_train)
## X Plot Variety Year Time
## Min. : 1.0 1988F6 : 10 F:161 Min. :1988 Min. :14.00
## 1st Qu.:101.2 1988F7 : 9 P:169 1st Qu.:1988 1st Qu.:27.00
## Median :207.5 1988P1 : 9 Median :1989 Median :42.00
## Mean :207.3 1988P2 : 9 Mean :1989 Mean :43.56
## 3rd Qu.:307.8 1988P8 : 9 3rd Qu.:1990 3rd Qu.:56.00
## Max. :412.0 1988F3 : 8 Max. :1990 Max. :84.00
## (Other):276
## weight
## Min. : 0.0290
## 1st Qu.: 0.6663
## Median : 3.5233
## Mean : 6.1645
## 3rd Qu.:10.3808
## Max. :27.3700
##
head(soybean_train)
## X Plot Variety Year Time weight
## 1 1 1988F1 F 1988 14 0.106
## 2 2 1988F1 F 1988 21 0.261
## 3 3 1988F1 F 1988 28 0.666
## 4 4 1988F1 F 1988 35 2.110
## 5 6 1988F1 F 1988 49 6.230
## 6 8 1988F1 F 1988 63 13.350
# Plot weight vs Time (Time on x axis)
ggplot(soybean_train, aes(x = Time, y = weight)) +
geom_point()
Load the package mgcv. Create the formula fmla.gam to express weight as a non-linear function of Time.seed Print it. Fit a generalized additive model on soybean_train using fmla.gam.
# Load the package mgcv
library(mgcv)
# Create the formula
(fmla.gam <- weight ~ s(Time))
## weight ~ s(Time)
# Fit the GAM Model
model.gam <- gam(fmla.gam, family = gaussian, data = soybean_train)
In this chapter we will look at modeling algorithms that do not assume linearity or additivity, and that can learn limited types of interactions among input variables. These algorithms are tree-based methods that work by combining ensembles of decision trees that are learned from the training data.
In this exercise you will again build a model to predict the number of bikes rented in an hour as a function of the weather, the type of day (holiday, working day, or weekend), and the time of day. You will train the model on data from the month of July.
You will use the ranger package to fit the random forest model. For this exercise, the key arguments to the ranger() call are:
formula data num.trees: the number of trees in the forest. respect.unordered.factors : Specifies how to treat unordered factor variables. We recommend setting this to “order” for regression. seed: because this is a random algorithm, you will set the seed to get reproducible results Since there are a lot of input variables, for convenience we will specify the outcome and the inputs in the variables outcome and vars, and use paste() to assemble a string representing the model formula.
The data frame bikesJuly is in the workspace. The sample code specifies the names of the outcome and input variables.
Fill in the blanks to create the formula fmla expressing cnt as a function of the inputs. Print it.
# bikesJuly is in the workspace
str(bikesJuly)
## 'data.frame': 744 obs. of 12 variables:
## $ hr : int 0 1 2 3 4 5 6 7 8 9 ...
## $ holiday : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ workingday: logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ weathersit: Factor w/ 3 levels "Clear to partly cloudy",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ temp : num 0.76 0.74 0.72 0.72 0.7 0.68 0.7 0.74 0.78 0.82 ...
## $ atemp : num 0.727 0.697 0.697 0.712 0.667 ...
## $ hum : num 0.66 0.7 0.74 0.84 0.79 0.79 0.79 0.7 0.62 0.56 ...
## $ windspeed : num 0 0.1343 0.0896 0.1343 0.194 ...
## $ cnt : int 149 93 90 33 4 10 27 50 142 219 ...
## $ instant : int 13004 13005 13006 13007 13008 13009 13010 13011 13012 13013 ...
## $ mnth : int 7 7 7 7 7 7 7 7 7 7 ...
## $ yr : int 1 1 1 1 1 1 1 1 1 1 ...
seed <- read.delim("DATABASE/seed.txt", sep = ",")
## Warning in read.table(file = file, header = header, sep = sep, quote = quote, :
## incomplete final line found by readTableHeader on 'DATABASE/seed.txt'
# Random seed to reproduce results
seed
## X x
## 1 1 423563
# The outcome column
(outcome <- "cnt")
## [1] "cnt"
# The input variables
(vars <- c("hr", "holiday", "workingday", "weathersit", "temp", "atemp", "hum", "windspeed"))
## [1] "hr" "holiday" "workingday" "weathersit" "temp"
## [6] "atemp" "hum" "windspeed"
# Create the formula string for bikes rented as a function of the inputs
(fmla <- paste(outcome, "~", paste(vars, collapse = " + ")))
## [1] "cnt ~ hr + holiday + workingday + weathersit + temp + atemp + hum + windspeed"
Load the package ranger.
Use ranger to fit a model to the bikesJuly data: bike_model_rf. The first argument to ranger() is the formula, fmla. Use 500 trees and respect.unordered.factors = “order”. Set the seed to seed for reproducible results. Print the model. What is the R-squared?
# # Load the package ranger
# library(ranger)
#
# # Fit and print the random forest model
# (bike_model_rf <- ranger(fmla, # formula
# bikesJuly, # data
# num.trees = 500,
# respect.unordered.factors = "order",
# seed = seed))
In this exercise you will use the model that you fit in the previous exercise to predict bike rentals for the month of August.
The predict() function for a ranger model produces a list. One of the elements of this list is predictions, a vector of predicted values. You can access predictions with the $ notation for accessing named elements of a list:
predict(model, data)$predictions
The model bike_model_rf and the dataset bikesAugust (for evaluation) are loaded into your workspace.
Call predict() on bikesAugust to predict the number of bikes rented in August (cnt). Add the predictions to bikesAugust as the column pred. Fill in the blanks to calculate the root mean squared error of the predictions. The poisson model you built for this data gave an RMSE of about 112.6. How does this model compare? Fill in the blanks to plot actual bike rental counts (cnt) versus the predictions (pred on x-axis)
# # bikesAugust is in the workspace
# str(bikesAugust)
#
# # bike_model_rf is in the workspace
# bike_model_rf
#
# # Make predictions on the August data
# bikesAugust$pred <- predict(bike_model_rf, bikesAugust)$predictions
#
# # Calculate the RMSE of the predictions
# bikesAugust %>%
# mutate(residual = cnt - pred) %>% # calculate the residual
# summarize(rmse = sqrt(mean(residual^2))) # calculate rmse
#
# # Plot actual outcome vs predictions (predictions on x-axis)
# ggplot(bikesAugust, aes(x = pred, y = cnt)) +
# geom_point() +
# geom_abline()
In the previous exercise, you saw that the random forest bike model did better on the August data than the quasiposson model, in terms of RMSE.
In this exercise you will visualize the random forest model’s August predictions as a function of time. The corresponding plot from the quasipoisson model that you built in a previous exercise is in the workspace for you to compare.
Recall that the quasipoisson model mostly identified the pattern of slow and busy hours in the day, but it somewhat underestimated peak demands. You would like to see how the random forest model compares.
The data frame bikesAugust (with predictions) is in the workspace. The plot quasipoisson_plot of quasipoisson model predictions as a function of time is shown.
Fill in the blanks to plot the predictions and actual counts by hour for the first 14 days of August. gather the cnt and pred columns into a column called value, with a key called valuetype. Plot value as a function of instant (day). How does the random forest model compare?
# first_two_weeks <- bikesAugust %>%
# # Set start to 0, convert unit to days
# mutate(instant = (instant - min(instant)) / 24) %>%
# # Gather cnt and pred into a column named value with key valuetype
# gather(key = valuetype, value = value, cnt, pred) %>%
# # Filter for rows in the first two
# filter(instant < 14)
# head(first_two_weeks)
# # Plot predictions and cnt by date/time
# ggplot(first_two_weeks, aes(x = instant, y = value, color = valuetype, linetype = valuetype)) +
# geom_point() +
# geom_line() +
# scale_x_continuous("Day", breaks = 0:14, labels = 0:14) +
# scale_color_brewer(palette = "Dark2") +
# ggtitle("Predicted August bike rentals, Random Forest plot")
In this exercise you will use vtreat to one-hot-encode a categorical variable on a small example. vtreat creates a treatment plan to transform categorical variables into indicator variables (coded “lev”), and to clean bad values out of numerical variables (coded “clean”).
To design a treatment plan use the function designTreatmentsZ()
treatplan <- designTreatmentsZ(data, varlist) data: the original training data frame varlist: a vector of input variables to be treated (as strings). designTreatmentsZ() returns a list with an element scoreFrame: a data frame that includes the names and types of the new variables:
scoreFrame <- treatplan %>% magrittr::use_series(scoreFrame) %>% select(varName, origName, code) varName: the name of the new treated variable origName: the name of the original variable that the treated variable comes from code: the type of the new variable. “clean”: a numerical variable with no NAs or NaNs “lev”: an indicator variable for a specific level of the original categorical variable. (magrittr::use_series() is an alias for $ that you can use in pipes.)
For these exercises, we want varName where code is either “clean” or “lev”:
newvarlist <- scoreFrame %>% filter(code %in% c(“clean”, “lev”) %>% magrittr::use_series(varName) To transform the data set into all numerical and one-hot-encoded variables, use prepare():
The data frame dframe and the package magrittr are loaded in the workspace.
Print dframe. We will assume that color and size are input variables, and popularity is the outcome to be predicted. Create a vector called vars with the names of the input variables (as strings). Load the package vtreat. Use designTreatmentsZ() to create a treatment plan for the variables in vars. Assign it to the variable treatplan. Get and examine the scoreFrame from the treatment plan to see the mapping from old variables to new variables. You only need the columns varName, origName and code. What are the names of the new indicator variables? Of the continuous variable? Create a vector newvars that contains the variable varName where code is either clean or lev. Print it. Use prepare() to create a new data frame dframe.treat that is a one-hot-encoded version of dframe (without the outcome column). Print it and compare to dframe.
# # dframe is in the workspace
# dframe
#
# # Create a vector of variable names
# (vars <- c("color", "size"))
#
# # Load the package vtreat
# library(vtreat)
#
# # Create the treatment plan
# treatplan <- designTreatmentsZ(dframe, vars)
#
# # Examine the scoreFrame
# (scoreFrame <- treatplan %>%
# use_series(scoreFrame) %>%
# select(varName, origName, code))
#
# # We only want the rows with codes "clean" or "lev"
# (newvars <- scoreFrame %>%
# filter(code %in% c("clean", "lev")) %>%
# use_series(varName))
#
# # Create the treated training data
# (dframe.treat <- prepare(treatplan, dframe, varRestriction = newvars))
Print dframe and testframe. Are there colors in testframe that didn’t appear in dframe? Call prepare() to create a one-hot-encoded version of testframe (without the outcome). Call it testframe.treat and print it. Use the varRestriction argument to restrict to only the variables in newvars. How are the yellow rows encoded?
# # treatplan is in the workspace
# summary(treatplan)
#
# # newvars is in the workspace
# newvars
#
# # Print dframe and testframe
# dframe
# testframe
#
# # Use prepare() to one-hot-encode testframe
# (testframe.treat <- prepare(treatplan, testframe, varRestriction = newvars))
We have created some two-dimensional data and stored it in a variable called x in your workspace. The scatter plot on the right is a visual representation of the data.
In this exercise, your task is to create a k-means model of the x data using 3 clusters, then to look at the structure of the resulting model using the summary() function.
Fit a k-means model to x using 3 centers and run the k-means algorithm 20 times. Store the result in km.out. Inspect the result with the summary() function.
x <- read.delim("DATABASE/x.txt", sep = ",")
head(x)
## X V1 V2
## 1 1 3.370958 1.995379
## 2 2 1.435302 2.760242
## 3 3 2.363128 2.038991
## 4 4 2.632863 2.735072
## 5 5 2.404268 1.853527
## 6 6 1.893875 1.942113
# Create the k-means model: km.out
km.out <- kmeans(x, 3, nstart = 20)
# Inspect the result
summary(km.out)
## Length Class Mode
## cluster 300 -none- numeric
## centers 9 -none- numeric
## totss 1 -none- numeric
## withinss 3 -none- numeric
## tot.withinss 1 -none- numeric
## betweenss 1 -none- numeric
## size 3 -none- numeric
## iter 1 -none- numeric
## ifault 1 -none- numeric
The kmeans() function produces several outputs. In the video, we discussed one output of modeling, the cluster membership.
In this exercise, you will access the cluster component directly. This is useful anytime you need the cluster membership for each observation of the data used to build the clustering model. A future exercise will show an example of how this cluster membership might be used to help communicate the results of k-means modeling.
k-means models also have a print method to give a human friendly output of basic modeling results. This is available by using print() or simply typing the name of the model.
The k-means model you built in the last exercise, km.out, is still available in your workspace.
Print a list of the cluster membership to the console. Use a print method to print out the km.out model.
# Print the cluster membership component of the model
km.out$cluster
## [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [38] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [75] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3
## [112] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [149] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [186] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [223] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [260] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [297] 1 1 1 1
# Print the km.out object
km.out
## K-means clustering with 3 clusters of sizes 100, 100, 100
##
## Cluster means:
## X V1 V2
## 1 250.5 -2.010368 1.002364
## 2 50.5 2.032515 2.032936
## 3 150.5 -5.087484 1.882169
##
## Clustering vector:
## [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [38] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [75] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3
## [112] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [149] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [186] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [223] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [260] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [297] 1 1 1 1
##
## Within cluster sum of squares by cluster:
## [1] 84556.96 83508.36 83508.93
## (between_SS / total_SS = 88.8 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
One of the more intuitive ways to interpret the results of k-means models is by plotting the data as a scatter plot and using color to label the samples’ cluster membership. In this exercise, you will use the standard plot() function to accomplish this.
To create a scatter plot, you can pass data with two features (i.e. columns) to plot() with an extra argument col = km.out$cluster, which sets the color of each point in the scatter plot according to its cluster membership.
x and km.out are available in your workspace. Using the plot() function to create a scatter plot of data x:
Color the dots on the scatterplot by setting the col argument to the cluster component in km.out. Title the plot “k-means with 3 clusters” using the main argument to plot(). Ensure there are no axis labels by specifying "" for both the xlab and ylab arguments to plot().
# Scatter plot of x
plot(x,
col = km.out$cluster,
main = "k-means with 3 clusters")
In the video, you saw how kmeans() randomly initializes the centers of clusters. This random initialization can result in assigning observations to different cluster labels. Also, the random initialization can result in finding different local minima for the k-means algorithm. This exercise will demonstrate both results.
At the top of each plot, the measure of model quality—total within cluster sum of squares error—will be plotted. Look for the model(s) with the lowest error to find models with the better model results.
Because kmeans() initializes observations to random clusters, it is important to set the random number generator seed for reproducibility.
The data, x, is still available in your workspace. Your task is to generate six kmeans() models on the data, plotting the results of each, in order to see the impact of random initializations on model results.
Set the random number seed to 1 with set.seed().
# Set up 2 x 3 plotting grid
par(mfrow = c(2, 3))
# Set seed
set.seed(1)
For each iteration of the for loop, run kmeans() on x. Assume the number of clusters is 3 and number of starts (nstart) is 1. Visualize the cluster memberships using the col argument to plot(). Observe how the measure of quality and cluster assignments vary among the six model runs.
for(i in 1:6) {
# Run kmeans() on x with three clusters and one start
km.out <- kmeans(x, 3, nstart = 1)
# Plot clusters
plot(x, col = km.out$cluster,
main = km.out$tot.withinss)
}
The k-means algorithm assumes the number of clusters as part of the input. If you know the number of clusters in advance (e.g. due to certain business constraints) this makes setting the number of clusters easy. However, as you saw in the video, if you do not know the number of clusters and need to determine it, you will need to run the algorithm multiple times, each time with a different number of clusters. From this, you can observe how a measure of model quality changes with the number of clusters.
In this exercise, you will run kmeans() multiple times to see how model quality changes as the number of clusters changes. Plots displaying this information help to determine the number of clusters and are often referred to as scree plots.
The ideal plot will have an elbow where the quality measure improves more slowly as the number of clusters increases. This indicates that the quality of the model is no longer improving substantially as the model complexity (i.e. number of clusters) increases. In other words, the elbow indicates the number of clusters inherent in the data.
The data, x, is still available in your workspace.
Build 15 kmeans() models on x, each with a different number of clusters (ranging from 1 to 15). Set nstart = 20 for all model runs and save the total within cluster sum of squares for each model to the ith element of wss.
# Initialize total within sum of squares error: wss
wss <- 0
# For 1 to 15 cluster centers
for (i in 1:15) {
km.out <- kmeans(x, centers = i, nstart = 20)
# Save total within sum of squares to wss variable
wss[i] <- km.out$tot.withinss
}
Run the code provided to create a scree plot of the wss for all 15 models.
# Plot total within sum of squares vs. number of clusters
plot(1:15, wss, type = "b",
xlab = "Number of Clusters",
ylab = "Within groups sum of squares")
Take a look at your scree plot. How many clusters are inherent in the data? Set k equal to the number of clusters at the location of the elbow.
# Set k equal to the number of clusters corresponding to the elbow location
k <- 2
Dealing with real data is often more challenging than dealing with synthetic data. Synthetic data helps with learning new concepts and techniques, but the next few exercises will deal with data that is closer to the type of real data you might find in your professional or academic pursuits.
The first challenge with the Pokemon data is that there is no pre-determined number of clusters. You will determine the appropriate number of clusters, keeping in mind that in real data the elbow in the scree plot might be less of a sharp elbow than in synthetic data. Use your judgement on making the determination of the number of clusters.
The second part of this exercise includes plotting the outcomes of the clustering on two dimensions, or features, of the data. These features were chosen somewhat arbitrarily for this exercise. Think about how you would use plotting and clustering to communicate interesting groups of Pokemon to other people.
An additional note: this exercise utilizes the iter.max argument to kmeans(). As you’ve seen, kmeans() is an iterative algorithm, repeating over and over until some stopping criterion is reached. The default number of iterations for kmeans() is 10, which is not enough for the algorithm to converge and reach its stopping criterion, so we’ll set the number of iterations to 50 to overcome this issue. To see what happens when kmeans() does not converge, try running the example with a lower number of iterations (e.g. 3). This is another example of what might happen when you encounter real data and use real cases.
The pokemon dataset, which contains observations of 800 Pokemon characters on 6 dimensions (i.e. features), is available in your workspace.
Using kmeans() with nstart = 20, determine the total within sum of square errors for different numbers of clusters (between 1 and 15).
pokemon <- read.csv("DATABASE/Pokemon.csv", )
head(pokemon)
## Number Name Type1 Type2 Total HitPoints Attack Defense
## 1 1 Bulbasaur Grass Poison 318 45 49 49
## 2 2 Ivysaur Grass Poison 405 60 62 63
## 3 3 Venusaur Grass Poison 525 80 82 83
## 4 3 VenusaurMega Venusaur Grass Poison 625 80 100 123
## 5 4 Charmander Fire 309 39 52 43
## 6 5 Charmeleon Fire 405 58 64 58
## SpecialAttack SpecialDefense Speed Generation Legendary
## 1 65 65 45 1 False
## 2 80 80 60 1 False
## 3 100 100 80 1 False
## 4 122 120 80 1 False
## 5 60 50 65 1 False
## 6 80 65 80 1 False
pokemon <- pokemon[,c(6:11)]
# Initialize total within sum of squares error: wss
wss <- 0
pokemon
## HitPoints Attack Defense SpecialAttack SpecialDefense Speed
## 1 45 49 49 65 65 45
## 2 60 62 63 80 80 60
## 3 80 82 83 100 100 80
## 4 80 100 123 122 120 80
## 5 39 52 43 60 50 65
## 6 58 64 58 80 65 80
## 7 78 84 78 109 85 100
## 8 78 130 111 130 85 100
## 9 78 104 78 159 115 100
## 10 44 48 65 50 64 43
## 11 59 63 80 65 80 58
## 12 79 83 100 85 105 78
## 13 79 103 120 135 115 78
## 14 45 30 35 20 20 45
## 15 50 20 55 25 25 30
## 16 60 45 50 90 80 70
## 17 40 35 30 20 20 50
## 18 45 25 50 25 25 35
## 19 65 90 40 45 80 75
## 20 65 150 40 15 80 145
## 21 40 45 40 35 35 56
## 22 63 60 55 50 50 71
## 23 83 80 75 70 70 101
## 24 83 80 80 135 80 121
## 25 30 56 35 25 35 72
## 26 55 81 60 50 70 97
## 27 40 60 30 31 31 70
## 28 65 90 65 61 61 100
## 29 35 60 44 40 54 55
## 30 60 85 69 65 79 80
## 31 35 55 40 50 50 90
## 32 60 90 55 90 80 110
## 33 50 75 85 20 30 40
## 34 75 100 110 45 55 65
## 35 55 47 52 40 40 41
## 36 70 62 67 55 55 56
## 37 90 92 87 75 85 76
## 38 46 57 40 40 40 50
## 39 61 72 57 55 55 65
## 40 81 102 77 85 75 85
## 41 70 45 48 60 65 35
## 42 95 70 73 95 90 60
## 43 38 41 40 50 65 65
## 44 73 76 75 81 100 100
## 45 115 45 20 45 25 20
## 46 140 70 45 85 50 45
## 47 40 45 35 30 40 55
## 48 75 80 70 65 75 90
## 49 45 50 55 75 65 30
## 50 60 65 70 85 75 40
## 51 75 80 85 110 90 50
## 52 35 70 55 45 55 25
## 53 60 95 80 60 80 30
## 54 60 55 50 40 55 45
## 55 70 65 60 90 75 90
## 56 10 55 25 35 45 95
## 57 35 80 50 50 70 120
## 58 40 45 35 40 40 90
## 59 65 70 60 65 65 115
## 60 50 52 48 65 50 55
## 61 80 82 78 95 80 85
## 62 40 80 35 35 45 70
## 63 65 105 60 60 70 95
## 64 55 70 45 70 50 60
## 65 90 110 80 100 80 95
## 66 40 50 40 40 40 90
## 67 65 65 65 50 50 90
## 68 90 95 95 70 90 70
## 69 25 20 15 105 55 90
## 70 40 35 30 120 70 105
## 71 55 50 45 135 95 120
## 72 55 50 65 175 95 150
## 73 70 80 50 35 35 35
## 74 80 100 70 50 60 45
## 75 90 130 80 65 85 55
## 76 50 75 35 70 30 40
## 77 65 90 50 85 45 55
## 78 80 105 65 100 70 70
## 79 40 40 35 50 100 70
## 80 80 70 65 80 120 100
## 81 40 80 100 30 30 20
## 82 55 95 115 45 45 35
## 83 80 120 130 55 65 45
## 84 50 85 55 65 65 90
## 85 65 100 70 80 80 105
## 86 90 65 65 40 40 15
## 87 95 75 110 100 80 30
## 88 95 75 180 130 80 30
## 89 25 35 70 95 55 45
## 90 50 60 95 120 70 70
## 91 52 65 55 58 62 60
## 92 35 85 45 35 35 75
## 93 60 110 70 60 60 100
## 94 65 45 55 45 70 45
## 95 90 70 80 70 95 70
## 96 80 80 50 40 50 25
## 97 105 105 75 65 100 50
## 98 30 65 100 45 25 40
## 99 50 95 180 85 45 70
## 100 30 35 30 100 35 80
## 101 45 50 45 115 55 95
## 102 60 65 60 130 75 110
## 103 60 65 80 170 95 130
## 104 35 45 160 30 45 70
## 105 60 48 45 43 90 42
## 106 85 73 70 73 115 67
## 107 30 105 90 25 25 50
## 108 55 130 115 50 50 75
## 109 40 30 50 55 55 100
## 110 60 50 70 80 80 140
## 111 60 40 80 60 45 40
## 112 95 95 85 125 65 55
## 113 50 50 95 40 50 35
## 114 60 80 110 50 80 45
## 115 50 120 53 35 110 87
## 116 50 105 79 35 110 76
## 117 90 55 75 60 75 30
## 118 40 65 95 60 45 35
## 119 65 90 120 85 70 60
## 120 80 85 95 30 30 25
## 121 105 130 120 45 45 40
## 122 250 5 5 35 105 50
## 123 65 55 115 100 40 60
## 124 105 95 80 40 80 90
## 125 105 125 100 60 100 100
## 126 30 40 70 70 25 60
## 127 55 65 95 95 45 85
## 128 45 67 60 35 50 63
## 129 80 92 65 65 80 68
## 130 30 45 55 70 55 85
## 131 60 75 85 100 85 115
## 132 40 45 65 100 120 90
## 133 70 110 80 55 80 105
## 134 65 50 35 115 95 95
## 135 65 83 57 95 85 105
## 136 65 95 57 100 85 93
## 137 65 125 100 55 70 85
## 138 65 155 120 65 90 105
## 139 75 100 95 40 70 110
## 140 20 10 55 15 20 80
## 141 95 125 79 60 100 81
## 142 95 155 109 70 130 81
## 143 130 85 80 85 95 60
## 144 48 48 48 48 48 48
## 145 55 55 50 45 65 55
## 146 130 65 60 110 95 65
## 147 65 65 60 110 95 130
## 148 65 130 60 95 110 65
## 149 65 60 70 85 75 40
## 150 35 40 100 90 55 35
## 151 70 60 125 115 70 55
## 152 30 80 90 55 45 55
## 153 60 115 105 65 70 80
## 154 80 105 65 60 75 130
## 155 80 135 85 70 95 150
## 156 160 110 65 65 110 30
## 157 90 85 100 95 125 85
## 158 90 90 85 125 90 100
## 159 90 100 90 125 85 90
## 160 41 64 45 50 50 50
## 161 61 84 65 70 70 70
## 162 91 134 95 100 100 80
## 163 106 110 90 154 90 130
## 164 106 190 100 154 100 130
## 165 106 150 70 194 120 140
## 166 100 100 100 100 100 100
## 167 45 49 65 49 65 45
## 168 60 62 80 63 80 60
## 169 80 82 100 83 100 80
## 170 39 52 43 60 50 65
## 171 58 64 58 80 65 80
## 172 78 84 78 109 85 100
## 173 50 65 64 44 48 43
## 174 65 80 80 59 63 58
## 175 85 105 100 79 83 78
## 176 35 46 34 35 45 20
## 177 85 76 64 45 55 90
## 178 60 30 30 36 56 50
## 179 100 50 50 76 96 70
## 180 40 20 30 40 80 55
## 181 55 35 50 55 110 85
## 182 40 60 40 40 40 30
## 183 70 90 70 60 60 40
## 184 85 90 80 70 80 130
## 185 75 38 38 56 56 67
## 186 125 58 58 76 76 67
## 187 20 40 15 35 35 60
## 188 50 25 28 45 55 15
## 189 90 30 15 40 20 15
## 190 35 20 65 40 65 20
## 191 55 40 85 80 105 40
## 192 40 50 45 70 45 70
## 193 65 75 70 95 70 95
## 194 55 40 40 65 45 35
## 195 70 55 55 80 60 45
## 196 90 75 85 115 90 55
## 197 90 95 105 165 110 45
## 198 75 80 95 90 100 50
## 199 70 20 50 20 50 40
## 200 100 50 80 60 80 50
## 201 70 100 115 30 65 30
## 202 90 75 75 90 100 70
## 203 35 35 40 35 55 50
## 204 55 45 50 45 65 80
## 205 75 55 70 55 95 110
## 206 55 70 55 40 55 85
## 207 30 30 30 30 30 30
## 208 75 75 55 105 85 30
## 209 65 65 45 75 45 95
## 210 55 45 45 25 25 15
## 211 95 85 85 65 65 35
## 212 65 65 60 130 95 110
## 213 95 65 110 60 130 65
## 214 60 85 42 85 42 91
## 215 95 75 80 100 110 30
## 216 60 60 60 85 85 85
## 217 48 72 48 72 48 48
## 218 190 33 58 33 58 33
## 219 70 80 65 90 65 85
## 220 50 65 90 35 35 15
## 221 75 90 140 60 60 40
## 222 100 70 70 65 65 45
## 223 65 75 105 35 65 85
## 224 75 85 200 55 65 30
## 225 75 125 230 55 95 30
## 226 60 80 50 40 40 30
## 227 90 120 75 60 60 45
## 228 65 95 75 55 55 85
## 229 70 130 100 55 80 65
## 230 70 150 140 65 100 75
## 231 20 10 230 10 230 5
## 232 80 125 75 40 95 85
## 233 80 185 115 40 105 75
## 234 55 95 55 35 75 115
## 235 60 80 50 50 50 40
## 236 90 130 75 75 75 55
## 237 40 40 40 70 40 20
## 238 50 50 120 80 80 30
## 239 50 50 40 30 30 50
## 240 100 100 80 60 60 50
## 241 55 55 85 65 85 35
## 242 35 65 35 65 35 65
## 243 75 105 75 105 75 45
## 244 45 55 45 65 45 75
## 245 65 40 70 80 140 70
## 246 65 80 140 40 70 70
## 247 45 60 30 80 50 65
## 248 75 90 50 110 80 95
## 249 75 90 90 140 90 115
## 250 75 95 95 95 95 85
## 251 90 60 60 40 40 40
## 252 90 120 120 60 60 50
## 253 85 80 90 105 95 60
## 254 73 95 62 85 65 85
## 255 55 20 35 20 45 75
## 256 35 35 35 35 35 35
## 257 50 95 95 35 110 70
## 258 45 30 15 85 65 65
## 259 45 63 37 65 55 95
## 260 45 75 37 70 55 83
## 261 95 80 105 40 70 100
## 262 255 10 10 75 135 55
## 263 90 85 75 115 100 115
## 264 115 115 85 90 75 100
## 265 100 75 115 90 115 85
## 266 50 64 50 45 50 41
## 267 70 84 70 65 70 51
## 268 100 134 110 95 100 61
## 269 100 164 150 95 120 71
## 270 106 90 130 90 154 110
## 271 106 130 90 110 154 90
## 272 100 100 100 100 100 100
## 273 40 45 35 65 55 70
## 274 50 65 45 85 65 95
## 275 70 85 65 105 85 120
## 276 70 110 75 145 85 145
## 277 45 60 40 70 50 45
## 278 60 85 60 85 60 55
## 279 80 120 70 110 70 80
## 280 80 160 80 130 80 100
## 281 50 70 50 50 50 40
## 282 70 85 70 60 70 50
## 283 100 110 90 85 90 60
## 284 100 150 110 95 110 70
## 285 35 55 35 30 30 35
## 286 70 90 70 60 60 70
## 287 38 30 41 30 41 60
## 288 78 70 61 50 61 100
## 289 45 45 35 20 30 20
## 290 50 35 55 25 25 15
## 291 60 70 50 100 50 65
## 292 50 35 55 25 25 15
## 293 60 50 70 50 90 65
## 294 40 30 30 40 50 30
## 295 60 50 50 60 70 50
## 296 80 70 70 90 100 70
## 297 40 40 50 30 30 30
## 298 70 70 40 60 40 60
## 299 90 100 60 90 60 80
## 300 40 55 30 30 30 85
## 301 60 85 60 50 50 125
## 302 40 30 30 55 30 85
## 303 60 50 100 85 70 65
## 304 28 25 25 45 35 40
## 305 38 35 35 65 55 50
## 306 68 65 65 125 115 80
## 307 68 85 65 165 135 100
## 308 40 30 32 50 52 65
## 309 70 60 62 80 82 60
## 310 60 40 60 40 60 35
## 311 60 130 80 60 60 70
## 312 60 60 60 35 35 30
## 313 80 80 80 55 55 90
## 314 150 160 100 95 65 100
## 315 31 45 90 30 30 40
## 316 61 90 45 50 50 160
## 317 1 90 45 30 30 40
## 318 64 51 23 51 23 28
## 319 84 71 43 71 43 48
## 320 104 91 63 91 73 68
## 321 72 60 30 20 30 25
## 322 144 120 60 40 60 50
## 323 50 20 40 20 40 20
## 324 30 45 135 45 90 30
## 325 50 45 45 35 35 50
## 326 70 65 65 55 55 70
## 327 50 75 75 65 65 50
## 328 50 85 125 85 115 20
## 329 50 85 85 55 55 50
## 330 50 105 125 55 95 50
## 331 50 70 100 40 40 30
## 332 60 90 140 50 50 40
## 333 70 110 180 60 60 50
## 334 70 140 230 60 80 50
## 335 30 40 55 40 55 60
## 336 60 60 75 60 75 80
## 337 60 100 85 80 85 100
## 338 40 45 40 65 40 65
## 339 70 75 60 105 60 105
## 340 70 75 80 135 80 135
## 341 60 50 40 85 75 95
## 342 60 40 50 75 85 95
## 343 65 73 55 47 75 85
## 344 65 47 55 73 75 85
## 345 50 60 45 100 80 65
## 346 70 43 53 43 53 40
## 347 100 73 83 73 83 55
## 348 45 90 20 65 20 65
## 349 70 120 40 95 40 95
## 350 70 140 70 110 65 105
## 351 130 70 35 70 35 60
## 352 170 90 45 90 45 60
## 353 60 60 40 65 45 35
## 354 70 100 70 105 75 40
## 355 70 120 100 145 105 20
## 356 70 85 140 85 70 20
## 357 60 25 35 70 80 60
## 358 80 45 65 90 110 80
## 359 60 60 60 60 60 60
## 360 45 100 45 45 45 10
## 361 50 70 50 50 50 70
## 362 80 100 80 80 80 100
## 363 50 85 40 85 40 35
## 364 70 115 60 115 60 55
## 365 45 40 60 40 75 50
## 366 75 70 90 70 105 80
## 367 75 110 110 110 105 80
## 368 73 115 60 60 60 90
## 369 73 100 60 100 60 65
## 370 70 55 65 95 85 70
## 371 70 95 85 55 65 70
## 372 50 48 43 46 41 60
## 373 110 78 73 76 71 60
## 374 43 80 65 50 35 35
## 375 63 120 85 90 55 55
## 376 40 40 55 40 70 55
## 377 60 70 105 70 120 75
## 378 66 41 77 61 87 23
## 379 86 81 97 81 107 43
## 380 45 95 50 40 50 75
## 381 75 125 100 70 80 45
## 382 20 15 20 10 55 80
## 383 95 60 79 100 125 81
## 384 70 70 70 70 70 70
## 385 60 90 70 60 120 40
## 386 44 75 35 63 33 45
## 387 64 115 65 83 63 65
## 388 64 165 75 93 83 75
## 389 20 40 90 30 90 25
## 390 40 70 130 60 130 25
## 391 99 68 83 72 87 51
## 392 65 50 70 95 80 65
## 393 65 130 60 75 60 75
## 394 65 150 60 115 60 115
## 395 95 23 48 23 48 23
## 396 50 50 50 50 50 50
## 397 80 80 80 80 80 80
## 398 80 120 80 120 80 100
## 399 70 40 50 55 50 25
## 400 90 60 70 75 70 45
## 401 110 80 90 95 90 65
## 402 35 64 85 74 55 32
## 403 55 104 105 94 75 52
## 404 55 84 105 114 75 52
## 405 100 90 130 45 65 55
## 406 43 30 55 40 65 97
## 407 45 75 60 40 30 50
## 408 65 95 100 60 50 50
## 409 95 135 80 110 80 100
## 410 95 145 130 120 90 120
## 411 40 55 80 35 60 30
## 412 60 75 100 55 80 50
## 413 80 135 130 95 90 70
## 414 80 145 150 105 110 110
## 415 80 100 200 50 100 50
## 416 80 50 100 100 200 50
## 417 80 75 150 75 150 50
## 418 80 80 90 110 130 110
## 419 80 100 120 140 150 110
## 420 80 90 80 130 110 110
## 421 80 130 100 160 120 110
## 422 100 100 90 150 140 90
## 423 100 150 90 180 160 90
## 424 100 150 140 100 90 90
## 425 100 180 160 150 90 90
## 426 105 150 90 150 90 95
## 427 105 180 100 180 100 115
## 428 100 100 100 100 100 100
## 429 50 150 50 150 50 150
## 430 50 180 20 180 20 150
## 431 50 70 160 70 160 90
## 432 50 95 90 95 90 180
## 433 55 68 64 45 55 31
## 434 75 89 85 55 65 36
## 435 95 109 105 75 85 56
## 436 44 58 44 58 44 61
## 437 64 78 52 78 52 81
## 438 76 104 71 104 71 108
## 439 53 51 53 61 56 40
## 440 64 66 68 81 76 50
## 441 84 86 88 111 101 60
## 442 40 55 30 30 30 60
## 443 55 75 50 40 40 80
## 444 85 120 70 50 60 100
## 445 59 45 40 35 40 31
## 446 79 85 60 55 60 71
## 447 37 25 41 25 41 25
## 448 77 85 51 55 51 65
## 449 45 65 34 40 34 45
## 450 60 85 49 60 49 60
## 451 80 120 79 95 79 70
## 452 40 30 35 50 70 55
## 453 60 70 65 125 105 90
## 454 67 125 40 30 30 58
## 455 97 165 60 65 50 58
## 456 30 42 118 42 88 30
## 457 60 52 168 47 138 30
## 458 40 29 45 29 45 36
## 459 60 59 85 79 105 36
## 460 60 79 105 59 85 36
## 461 60 69 95 69 95 36
## 462 70 94 50 94 50 66
## 463 30 30 42 30 42 70
## 464 70 80 102 80 102 40
## 465 60 45 70 45 90 95
## 466 55 65 35 60 30 85
## 467 85 105 55 85 50 115
## 468 45 35 45 62 53 35
## 469 70 60 70 87 78 85
## 470 76 48 48 57 62 34
## 471 111 83 68 92 82 39
## 472 75 100 66 60 66 115
## 473 90 50 34 60 44 70
## 474 150 80 44 90 54 80
## 475 55 66 44 44 56 85
## 476 65 76 84 54 96 105
## 477 65 136 94 54 96 135
## 478 60 60 60 105 105 105
## 479 100 125 52 105 52 71
## 480 49 55 42 42 37 85
## 481 71 82 64 64 59 112
## 482 45 30 50 65 50 45
## 483 63 63 47 41 41 74
## 484 103 93 67 71 61 84
## 485 57 24 86 24 86 23
## 486 67 89 116 79 116 33
## 487 50 80 95 10 45 10
## 488 20 25 45 70 90 60
## 489 100 5 5 15 65 30
## 490 76 65 45 92 42 91
## 491 50 92 108 92 108 35
## 492 58 70 45 40 45 42
## 493 68 90 65 50 55 82
## 494 108 130 95 80 85 102
## 495 108 170 115 120 95 92
## 496 135 85 40 40 85 5
## 497 40 70 40 35 40 60
## 498 70 110 70 115 70 90
## 499 70 145 88 140 70 112
## 500 68 72 78 38 42 32
## 501 108 112 118 68 72 47
## 502 40 50 90 30 55 65
## 503 70 90 110 60 75 95
## 504 48 61 40 61 40 50
## 505 83 106 65 86 65 85
## 506 74 100 72 90 72 46
## 507 49 49 56 49 61 66
## 508 69 69 76 69 86 91
## 509 45 20 50 60 120 50
## 510 60 62 50 62 60 40
## 511 90 92 75 92 85 60
## 512 90 132 105 132 105 30
## 513 70 120 65 45 85 125
## 514 70 70 115 130 90 60
## 515 110 85 95 80 95 50
## 516 115 140 130 55 55 40
## 517 100 100 125 110 50 50
## 518 75 123 67 95 85 95
## 519 75 95 67 125 95 83
## 520 85 50 95 120 115 80
## 521 86 76 86 116 56 95
## 522 65 110 130 60 65 95
## 523 65 60 110 130 95 65
## 524 75 95 125 45 75 95
## 525 110 130 80 70 60 80
## 526 85 80 70 135 75 90
## 527 68 125 65 65 115 80
## 528 68 165 95 65 115 110
## 529 60 55 145 75 150 40
## 530 45 100 135 65 135 45
## 531 70 80 70 80 70 110
## 532 50 50 77 95 77 91
## 533 50 65 107 105 107 86
## 534 50 65 107 105 107 86
## 535 50 65 107 105 107 86
## 536 50 65 107 105 107 86
## 537 50 65 107 105 107 86
## 538 75 75 130 75 130 95
## 539 80 105 105 105 105 80
## 540 75 125 70 125 70 115
## 541 100 120 120 150 100 90
## 542 90 120 100 150 120 100
## 543 91 90 106 130 106 77
## 544 110 160 110 80 110 100
## 545 150 100 120 100 120 90
## 546 150 120 100 120 100 90
## 547 120 70 120 75 130 85
## 548 80 80 80 80 80 80
## 549 100 100 100 100 100 100
## 550 70 90 90 135 90 125
## 551 100 100 100 100 100 100
## 552 100 103 75 120 75 127
## 553 120 120 120 120 120 120
## 554 100 100 100 100 100 100
## 555 45 45 55 45 55 63
## 556 60 60 75 60 75 83
## 557 75 75 95 75 95 113
## 558 65 63 45 45 45 45
## 559 90 93 55 70 55 55
## 560 110 123 65 100 65 65
## 561 55 55 45 63 45 45
## 562 75 75 60 83 60 60
## 563 95 100 85 108 70 70
## 564 45 55 39 35 39 42
## 565 60 85 69 60 69 77
## 566 45 60 45 25 45 55
## 567 65 80 65 35 65 60
## 568 85 110 90 45 90 80
## 569 41 50 37 50 37 66
## 570 64 88 50 88 50 106
## 571 50 53 48 53 48 64
## 572 75 98 63 98 63 101
## 573 50 53 48 53 48 64
## 574 75 98 63 98 63 101
## 575 50 53 48 53 48 64
## 576 75 98 63 98 63 101
## 577 76 25 45 67 55 24
## 578 116 55 85 107 95 29
## 579 50 55 50 36 30 43
## 580 62 77 62 50 42 65
## 581 80 115 80 65 55 93
## 582 45 60 32 50 32 76
## 583 75 100 63 80 63 116
## 584 55 75 85 25 25 15
## 585 70 105 105 50 40 20
## 586 85 135 130 60 80 25
## 587 55 45 43 55 43 72
## 588 67 57 55 77 55 114
## 589 60 85 40 30 45 68
## 590 110 135 60 50 65 88
## 591 103 60 86 60 86 50
## 592 103 60 126 80 126 50
## 593 75 80 55 25 35 35
## 594 85 105 85 40 50 40
## 595 105 140 95 55 65 45
## 596 50 50 40 50 40 64
## 597 75 65 55 65 55 69
## 598 105 95 75 85 75 74
## 599 120 100 85 30 85 45
## 600 75 125 75 30 75 85
## 601 45 53 70 40 60 42
## 602 55 63 90 50 80 42
## 603 75 103 80 70 80 92
## 604 30 45 59 30 39 57
## 605 40 55 99 40 79 47
## 606 60 100 89 55 69 112
## 607 40 27 60 37 50 66
## 608 60 67 85 77 75 116
## 609 45 35 50 70 50 30
## 610 70 60 75 110 75 90
## 611 70 92 65 80 55 98
## 612 50 72 35 35 35 65
## 613 60 82 45 45 45 74
## 614 95 117 80 65 70 92
## 615 70 90 45 15 45 50
## 616 105 140 55 30 55 95
## 617 105 30 105 140 105 55
## 618 75 86 67 106 67 60
## 619 50 65 85 35 35 55
## 620 70 95 125 65 75 45
## 621 50 75 70 35 70 48
## 622 65 90 115 45 115 58
## 623 72 58 80 103 80 97
## 624 38 30 85 55 65 30
## 625 58 50 145 95 105 30
## 626 54 78 103 53 45 22
## 627 74 108 133 83 65 32
## 628 55 112 45 74 45 70
## 629 75 140 65 112 65 110
## 630 50 50 62 40 62 65
## 631 80 95 82 60 82 75
## 632 40 65 40 80 40 65
## 633 60 105 60 120 60 105
## 634 55 50 40 40 40 75
## 635 75 95 60 65 60 115
## 636 45 30 50 55 65 45
## 637 60 45 70 75 85 55
## 638 70 55 95 95 110 65
## 639 45 30 40 105 50 20
## 640 65 40 50 125 60 30
## 641 110 65 75 125 85 30
## 642 62 44 50 44 50 55
## 643 75 87 63 87 63 98
## 644 36 50 50 65 60 44
## 645 51 65 65 80 75 59
## 646 71 95 85 110 95 79
## 647 60 60 50 40 50 75
## 648 80 100 70 60 70 95
## 649 55 75 60 75 60 103
## 650 50 75 45 40 45 60
## 651 70 135 105 60 105 20
## 652 69 55 45 55 55 15
## 653 114 85 70 85 80 30
## 654 55 40 50 65 85 40
## 655 100 60 70 85 105 60
## 656 165 75 80 40 45 65
## 657 50 47 50 57 50 65
## 658 70 77 60 97 60 108
## 659 44 50 91 24 86 10
## 660 74 94 131 54 116 20
## 661 40 55 70 45 60 30
## 662 60 80 95 70 85 50
## 663 60 100 115 70 85 90
## 664 35 55 40 45 40 60
## 665 65 85 70 75 70 40
## 666 85 115 80 105 80 50
## 667 55 55 55 85 55 30
## 668 75 75 75 125 95 40
## 669 50 30 55 65 55 20
## 670 60 40 60 95 60 55
## 671 60 55 90 145 90 80
## 672 46 87 60 30 40 57
## 673 66 117 70 40 50 67
## 674 76 147 90 60 70 97
## 675 55 70 40 60 40 40
## 676 95 110 80 70 80 50
## 677 70 50 30 95 135 105
## 678 50 40 85 40 65 25
## 679 80 70 40 100 60 145
## 680 109 66 84 81 99 32
## 681 45 85 50 55 50 65
## 682 65 125 60 95 60 105
## 683 77 120 90 60 90 48
## 684 59 74 50 35 50 35
## 685 89 124 80 55 80 55
## 686 45 85 70 40 40 60
## 687 65 125 100 60 70 70
## 688 95 110 95 40 95 55
## 689 70 83 50 37 50 60
## 690 100 123 75 57 75 80
## 691 70 55 75 45 65 60
## 692 110 65 105 55 95 80
## 693 85 97 66 105 66 65
## 694 58 109 112 48 48 109
## 695 52 65 50 45 50 38
## 696 72 85 70 65 70 58
## 697 92 105 90 125 90 98
## 698 55 85 55 50 55 60
## 699 85 60 65 135 105 100
## 700 91 90 129 90 72 108
## 701 91 129 90 72 90 108
## 702 91 90 72 90 129 108
## 703 79 115 70 125 80 111
## 704 79 100 80 110 90 121
## 705 79 115 70 125 80 111
## 706 79 105 70 145 80 101
## 707 100 120 100 150 120 90
## 708 100 150 120 120 100 90
## 709 89 125 90 115 80 101
## 710 89 145 90 105 80 91
## 711 125 130 90 130 90 95
## 712 125 170 100 120 90 95
## 713 125 120 90 170 100 95
## 714 91 72 90 129 90 108
## 715 91 72 90 129 90 108
## 716 100 77 77 128 128 90
## 717 100 128 90 77 77 128
## 718 71 120 95 120 95 99
## 719 56 61 65 48 45 38
## 720 61 78 95 56 58 57
## 721 88 107 122 74 75 64
## 722 40 45 40 62 60 60
## 723 59 59 58 90 70 73
## 724 75 69 72 114 100 104
## 725 41 56 40 62 44 71
## 726 54 63 52 83 56 97
## 727 72 95 67 103 71 122
## 728 38 36 38 32 36 57
## 729 85 56 77 50 77 78
## 730 45 50 43 40 38 62
## 731 62 73 55 56 52 84
## 732 78 81 71 74 69 126
## 733 38 35 40 27 25 35
## 734 45 22 60 27 30 29
## 735 80 52 50 90 50 89
## 736 62 50 58 73 54 72
## 737 86 68 72 109 66 106
## 738 44 38 39 61 79 42
## 739 54 45 47 75 98 52
## 740 78 65 68 112 154 75
## 741 66 65 48 62 57 52
## 742 123 100 62 97 81 68
## 743 67 82 62 46 48 43
## 744 95 124 78 69 71 58
## 745 75 80 60 65 90 102
## 746 62 48 54 63 60 68
## 747 74 48 76 83 81 104
## 748 74 48 76 83 81 104
## 749 45 80 100 35 37 28
## 750 59 110 150 45 49 35
## 751 60 150 50 150 50 60
## 752 60 50 150 50 150 60
## 753 78 52 60 63 65 23
## 754 101 72 72 99 89 29
## 755 62 48 66 59 57 49
## 756 82 80 86 85 75 72
## 757 53 54 53 37 46 45
## 758 86 92 88 68 75 73
## 759 42 52 67 39 56 50
## 760 72 105 115 54 86 68
## 761 50 60 60 60 60 30
## 762 65 75 90 97 123 44
## 763 50 53 62 58 63 44
## 764 71 73 88 120 89 59
## 765 44 38 33 61 43 70
## 766 62 55 52 109 94 109
## 767 58 89 77 45 45 48
## 768 82 121 119 69 59 71
## 769 77 59 50 67 63 46
## 770 123 77 72 99 92 58
## 771 95 65 65 110 130 60
## 772 78 92 75 74 63 118
## 773 67 58 57 81 67 101
## 774 50 50 150 50 150 50
## 775 45 50 35 55 75 40
## 776 68 75 53 83 113 60
## 777 90 100 70 110 150 80
## 778 57 80 91 80 87 75
## 779 43 70 48 50 60 38
## 780 85 110 76 65 82 56
## 781 49 66 70 44 55 51
## 782 44 66 70 44 55 56
## 783 54 66 70 44 55 46
## 784 59 66 70 44 55 41
## 785 65 90 122 58 75 84
## 786 55 85 122 58 75 99
## 787 75 95 122 58 75 69
## 788 85 100 122 58 75 54
## 789 55 69 85 32 35 28
## 790 95 117 184 44 46 28
## 791 40 30 35 45 40 55
## 792 85 70 80 97 80 123
## 793 126 131 95 131 98 99
## 794 126 131 95 131 98 99
## 795 108 100 121 81 95 95
## 796 50 100 150 100 150 50
## 797 50 160 110 160 110 110
## 798 80 110 60 150 130 70
## 799 80 160 60 170 130 80
## 800 80 110 120 130 90 70
# Look over 1 to 15 possible clusters
for (i in 1:15) {
# Fit the model: km.out
km.out <- kmeans(pokemon, centers = i, nstart = 20, iter.max = 50)
# Save the within cluster sum of squares
wss[i] <- km.out$tot.withinss
}
Pick an appropriate number of clusters based on these results from the first instruction and assign that number to k. Create a k-means model using k clusters and assign it to the km.out variable.
# Produce a scree plot
plot(1:15, wss, type = "b",
xlab = "Number of Clusters",
ylab = "Within groups sum of squares")
# Select number of clusters
k <- 4
# Build model with k clusters: km.out
km.out <- kmeans(pokemon, centers = k, nstart = 20, iter.max = 50)
# View the resulting model
km.out
## K-means clustering with 4 clusters of sizes 115, 114, 283, 288
##
## Cluster means:
## HitPoints Attack Defense SpecialAttack SpecialDefense Speed
## 1 71.30435 92.91304 121.42609 63.89565 88.23478 52.36522
## 2 89.20175 121.09649 92.73684 120.45614 97.67544 100.44737
## 3 50.29682 54.03180 51.62898 47.90459 49.15548 49.74912
## 4 79.18056 81.31944 69.19097 82.01042 77.53125 80.10417
##
## Clustering vector:
## [1] 3 4 4 2 3 4 4 2 2 3 4 4 2 3 3 4 3 3 4 4 3 3 4 2 3 4 3 4 3 4 3 4 3 1 3 3 4
## [38] 3 3 4 3 4 3 4 3 4 3 4 3 4 4 3 1 3 4 3 4 3 4 3 4 3 4 3 2 3 3 4 3 4 4 2 3 4
## [75] 1 3 4 4 3 4 3 1 1 4 4 3 1 1 3 4 3 3 4 3 4 3 4 3 1 3 4 4 2 1 3 4 3 1 3 4 3
## [112] 4 3 1 4 1 3 3 1 3 1 4 4 4 2 3 4 3 4 3 4 4 4 4 4 4 1 2 4 3 4 2 4 3 3 4 4 2
## [149] 4 3 1 3 1 4 2 4 2 2 2 3 4 2 2 2 2 2 3 4 4 3 4 4 3 4 4 3 4 3 4 3 4 3 3 4 3
## [186] 4 3 3 3 3 4 3 4 3 3 4 2 1 3 4 1 4 3 3 4 3 3 4 4 3 1 4 1 4 4 4 3 4 4 3 1 4
## [223] 1 1 1 3 4 4 1 1 1 4 1 4 3 4 3 1 3 4 3 3 4 3 4 1 3 4 2 4 3 1 4 4 3 3 1 3 3
## [260] 3 4 4 2 2 1 3 4 2 2 2 2 2 3 4 4 2 3 4 2 2 3 4 4 2 3 4 3 4 3 3 4 3 3 3 3 4
## [297] 3 3 4 3 4 3 4 3 3 4 2 3 4 3 4 3 4 2 3 4 3 3 3 4 3 4 3 1 3 3 3 1 3 1 3 1 1
## [334] 1 3 4 4 3 4 2 4 4 4 4 4 3 4 3 4 2 4 4 3 4 2 1 3 4 3 3 3 4 3 4 3 4 2 4 4 4
## [371] 4 3 4 3 4 3 1 3 1 3 1 3 4 4 1 3 4 2 3 1 4 4 4 2 3 3 4 2 3 4 4 3 1 1 1 3 3
## [408] 1 2 2 3 1 1 2 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 3 1 1 3 4 4 3 4 4 3 3 4
## [445] 3 4 3 3 3 3 4 3 4 3 4 1 1 3 1 1 1 4 3 1 4 3 4 3 4 3 4 4 3 4 3 4 2 4 4 3 4
## [482] 3 3 4 3 1 3 3 3 4 1 3 4 2 2 4 3 2 2 3 1 3 1 3 4 4 3 4 3 3 4 2 4 1 1 1 1 2
## [519] 2 4 4 1 4 1 4 4 4 2 1 1 4 4 4 4 4 4 4 1 2 2 2 2 2 2 2 2 1 4 2 2 2 2 2 2 3
## [556] 4 4 3 4 4 3 4 4 3 4 3 3 1 3 4 3 4 3 4 3 4 3 4 3 3 4 3 4 3 1 1 3 4 3 4 4 1
## [593] 3 1 1 3 4 4 1 4 3 1 4 3 3 4 3 4 3 4 4 3 3 4 3 4 4 4 3 1 3 1 4 3 1 1 1 4 2
## [630] 3 4 3 4 3 4 3 4 4 3 3 4 3 4 3 4 4 3 4 4 3 1 3 4 3 4 4 3 4 3 1 3 1 1 3 4 4
## [667] 3 4 3 3 4 3 4 2 3 4 4 3 4 4 3 4 1 3 1 3 1 1 3 4 3 1 4 1 3 4 2 3 4 2 2 2 2
## [704] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 1 1 3 4 4 3 4 4 3 4 3 3 4 3 3 4 3 4 3 3 4
## [741] 3 4 3 4 4 3 4 4 3 1 2 1 3 4 3 4 3 4 3 1 3 1 3 4 3 4 3 1 3 4 4 4 4 1 3 4 2
## [778] 4 3 4 3 3 3 3 1 1 1 1 3 1 3 4 2 2 2 1 2 2 2 2
##
## Within cluster sum of squares by cluster:
## [1] 455965.7 408271.9 513476.6 871531.3
## (between_SS / total_SS = 47.6 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
Create a scatter plot of Defense vs. Speed, showing cluster membership for each observation.
# Plot of Defense vs. Speed by cluster membership
plot(pokemon[, c("Defense", "Speed")],
col = km.out$cluster,
main = paste("k-means clustering of Pokemon with", k, "clusters"),
xlab = "Defense", ylab = "Speed")
In this exercise, you will create your first hierarchical clustering model using the hclust() function.
We have created some data that has two dimensions and placed it in a variable called x. Your task is to create a hierarchical clustering model of x. Remember from the video that the first step to hierarchical clustering is determining the similarity between observations, which you will do with the dist() function.
You will look at the structure of the resulting model using the summary() function.
Fit a hierarchical clustering model to x using the hclust() function. Store the result in hclust.out. Inspect the result with the summary() function.
# Create hierarchical clustering model: hclust.out
hclust.out <- hclust(dist(x) , method = "complete")
# Inspect the result
summary(hclust.out)
## Length Class Mode
## merge 598 -none- numeric
## height 299 -none- numeric
## order 300 -none- numeric
## labels 0 -none- NULL
## method 1 -none- character
## call 3 -none- call
## dist.method 1 -none- character
Remember from the video that cutree() is the R function that cuts a hierarchical model. The h and k arguments to cutree() allow you to cut the tree based on a certain height h or a certain number of clusters k.
In this exercise, you will use cutree() to cut the hierarchical model you created earlier based on each of these two criteria.
The hclust.out model you created earlier is available in your workspace.
Cut the hclust.out model at height 7.
# Cut by height
cutree(hclust.out, h = 7)
## [1] 1 1 2 2 2 2 2 2 3 3 3 3 4 4 4 4 4 5 5 5 5 5 6 6 6
## [26] 6 6 6 7 7 7 7 7 7 7 8 8 8 8 9 9 9 9 10 10 10 10 10 11 11
## [51] 11 11 11 11 11 12 12 12 12 13 13 13 13 13 14 14 14 14 14 14 15 15 15 16 16
## [76] 16 16 16 17 17 17 17 17 17 18 18 18 19 19 19 19 19 20 20 20 20 20 20 21 21
## [101] 22 22 22 22 23 23 23 23 23 24 24 24 24 24 24 24 25 25 25 26 26 26 26 26 26
## [126] 27 27 27 27 28 28 28 28 28 28 29 29 29 29 30 30 30 30 31 31 31 31 31 31 31
## [151] 32 32 32 32 32 32 33 33 33 33 33 33 33 34 34 34 34 34 35 35 35 35 35 36 36
## [176] 36 36 36 37 37 37 38 38 38 38 38 39 39 39 39 39 39 39 40 40 40 40 41 41 41
## [201] 41 41 42 42 42 42 42 42 42 43 43 43 43 43 44 44 44 44 44 44 44 45 45 45 45
## [226] 46 46 46 46 46 46 46 47 47 47 47 48 48 48 48 48 49 49 49 50 50 50 50 50 50
## [251] 51 51 51 51 51 51 51 52 52 52 52 52 52 52 53 53 53 53 53 53 53 54 54 54 54
## [276] 54 54 54 55 55 55 55 56 56 56 56 56 56 56 57 57 57 57 58 58 58 58 59 59 59
Cut the hclust.out model to create 3 clusters.
# Cut by number of clusters
cutree(hclust.out, k = 3)
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [149] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [186] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3
## [223] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [260] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [297] 3 3 3 3
In this exercise, you will produce hierarchical clustering models using different linkages and plot the dendrogram for each, observing the overall structure of the trees.
You’ll be asked to interpret the results in the next exercise.
Produce three hierarchical clustering models on x using the “complete”, “average”, and “single” linkage methods, respectively.
# Cluster using complete linkage: hclust.complete
hclust.complete <- hclust(dist(x), method = "complete")
# Plot dendrogram of hclust.complete
plot(hclust.complete, main = "Complete")
# Cluster using average linkage: hclust.average
hclust.average <- hclust(dist(x), method = "average")
# Plot dendrogram of hclust.average
plot(hclust.average, main = "Average")
# Cluster using single linkage: hclust.single
hclust.single <- hclust(dist(x), method = "single")
# Plot dendrogram of hclust.single
plot(hclust.single, main = "Single")
Recall from the video that clustering real data may require scaling the features if they have different distributions. So far in this chapter, you have been working with synthetic data that did not need scaling.
In this exercise, you will go back to working with “real” data, the pokemon dataset introduced in the first chapter. You will observe the distribution (mean and standard deviation) of each feature, scale the data accordingly, then produce a hierarchical clustering model using the complete linkage method.
The data is stored in the pokemon object in your workspace.
Observe the mean of each variable in pokemon using the colMeans() function.
# View column means
colMeans(pokemon)
## HitPoints Attack Defense SpecialAttack SpecialDefense
## 69.25875 79.00125 73.84250 72.82000 71.90250
## Speed
## 68.27750
Observe the standard deviation of each variable using the apply() and sd() functions. Since the variables are the columns of your matrix, make sure to specify 2 as the MARGIN argument to apply().
# View column standard deviations
apply(pokemon, 2, sd)
## HitPoints Attack Defense SpecialAttack SpecialDefense
## 25.53467 32.45737 31.18350 32.72229 27.82892
## Speed
## 29.06047
Scale the pokemon data using the scale() function and store the result in pokemon.scaled.
# Scale the data
pokemon.scaled <- scale(pokemon)
Create a hierarchical clustering model of the pokemon.scaled data using the complete linkage method. Manually specify the method argument and store the result in hclust.pokemon.
# Create hierarchical clustering model: hclust.pokemon
hclust.pokemon <- hclust(dist(pokemon.scaled) , method = "complete")
Comparing k-means and hierarchical clustering, you’ll see the two methods produce different cluster memberships. This is because the two algorithms make different assumptions about how the data is generated. In a more advanced course, we could choose to use one model over another based on the quality of the models’ assumptions, but for now, it’s enough to observe that they are different.
This exercise will have you compare results from the two models on the pokemon dataset to see how they differ.
The results from running k-means clustering on the pokemon data (for 3 clusters) are stored as km.pokemon. The hierarchical clustering model you created in the previous exercise is still available as hclust.pokemon.
Using cutree() on hclust.pokemon, assign cluster membership to each observation. Assume three clusters and assign the result to a vector called cut.pokemon. Using table(), compare cluster membership between the two clustering methods. Recall that the different components of k-means model objects can be accessed with the $ operator.
# Apply cutree() to hclust.pokemon: cut.pokemon
cut.pokemon <- cutree(hclust.pokemon, k = 3)
# Select number of clusters
k <- 3
# Build model with k clusters: km.out
km.pokemon <- kmeans(pokemon, centers = k, nstart = 20, iter.max = 50)
# Compare methods
table(km.pokemon$cluster, cut.pokemon)
## cut.pokemon
## 1 2 3
## 1 171 3 1
## 2 350 5 0
## 3 267 3 0
In this exercise, you will create your first PCA model and observe the diagnostic results.
We have loaded the Pokemon data from earlier, which has four dimensions, and placed it in a variable called pokemon. Your task is to create a PCA model of the data, then to inspect the resulting model using the summary() function.
Create a PCA model of the data in pokemon, setting scale to TRUE. Store the result in pr.out. Inspect the result with the summary() function.
# Perform scaled PCA: pr.out
pr.out <- prcomp(x = pokemon,
scale = TRUE)
# Inspect model output
summary(pr.out)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 1.6466 1.0457 0.8825 0.8489 0.65463 0.51681
## Proportion of Variance 0.4519 0.1822 0.1298 0.1201 0.07142 0.04451
## Cumulative Proportion 0.4519 0.6342 0.7640 0.8841 0.95549 1.00000
PCA models in R produce additional diagnostic and output components:
center: the column means used to center to the data, or FALSE if the data weren’t centered scale: the column standard deviations used to scale the data, or FALSE if the data weren’t scaled rotation: the directions of the principal component vectors in terms of the original features/variables. This information allows you to define new data in terms of the original principal components x: the value of each observation in the original dataset projected to the principal components You can access these the same as other model components. For example, use pr.out$rotation to access the rotation component.
Which of the following statements is not correct regarding the pr.out model fit on the pokemon data?
Interpreting biplots (1) As stated in the video, the biplot() function plots both the principal components loadings and the mapping of the observations to their first two principal component values. The next couple of exercises will check your interpretation of the biplot() visualization.
Using the biplot() of the pr.out model, which two original variables have approximately the same loadings in the first two principal components?
biplot(pr.out)
The second common plot type for understanding PCA models is a scree plot. A scree plot shows the variance explained as the number of principal components increases. Sometimes the cumulative variance explained is plotted as well.
In this and the next exercise, you will prepare data from the pr.out model you created at the beginning of the chapter for use in a scree plot. Preparing the data for plotting is required because there is not a built-in function in R to create this type of plot.
pr.out and the pokemon data are still available in your workspace.
Assign to the variable pr.var the square of the standard deviations of the principal components (i.e. the variance). The standard deviation of the principal components is available in the sdev component of the PCA model object. Assign to the variable pve the proportion of the variance explained, calculated by dividing pr.var by the total variance explained by all principal components.
# Variability of each principal component: pr.var
pr.var <- pr.out$sdev^2
# Variance explained by each principal component: pve
pve <- pr.var / sum(pr.var)
Now you will create a scree plot showing the proportion of variance explained by each principal component, as well as the cumulative proportion of variance explained.
Recall from the video that these plots can help to determine the number of principal components to retain. One way to determine the number of principal components to retain is by looking for an elbow in the scree plot showing that as the number of principal components increases, the rate at which variance is explained decreases substantially. In the absence of a clear elbow, you can use the scree plot as a guide for setting a threshold.
The proportion of variance explained is still available in the pve object you created in the last exercise.
Use plot() to plot the proportion of variance explained by each principal component.
# Plot variance explained for each principal component
head(pve)
## [1] 0.45190665 0.18225358 0.12979086 0.12011089 0.07142337 0.04451466
plot(pve, xlab = "Principal Component",
ylab = "Proportion of Variance Explained",
ylim = c(0, 1), type = "b")
Use plot() and cumsum() (cumulative sum) to plot the cumulative proportion of variance explained as a function of the number principal components.
# Plot cumulative proportion of variance explained
plot(cumsum(pve), xlab = "Principal Component",
ylab = "Cumulative Proportion of Variance Explained",
ylim = c(0, 1), type = "b")
1.- Scaling the data 2.- Missing values: -Drop obvserfvations with missing calues -Impute/estimate missing values 3- Categorical [data:\\](data:){.uri} -do not use categorical data features -encode categorical features as numbers
You saw in the video that scaling your data before doing PCA changes the results of the PCA modeling. Here, you will perform PCA with and without scaling, then visualize the results using biplots.
Sometimes scaling is appropriate when the variances of the variables are substantially different. This is commonly the case when variables have different units of measurement, for example, degrees Fahrenheit (temperature) and miles (distance). Making the decision to use scaling is an important step in performing a principal component analysis.
The same Pokemon dataset is available in your workspace as pokemon, but one new variable has been added: Total.
There is some code at the top of the editor to calculate the mean and standard deviation of each variable in the model. Run this code to see how the scale of the variables differs in the original data.
# Mean of each variable
colMeans(pokemon)
## HitPoints Attack Defense SpecialAttack SpecialDefense
## 69.25875 79.00125 73.84250 72.82000 71.90250
## Speed
## 68.27750
# Standard deviation of each variable
apply(pokemon, 2, sd)
## HitPoints Attack Defense SpecialAttack SpecialDefense
## 25.53467 32.45737 31.18350 32.72229 27.82892
## Speed
## 29.06047
Create a PCA model of pokemon with scaling, assigning the result to pr.with.scaling.
# PCA model with scaling: pr.with.scaling
pr.with.scaling <- prcomp(x = pokemon,
scale = TRUE)
# Create biplots of both for comparison
biplot(pr.with.scaling)
Create a PCA model of pokemon without scaling, assigning the result to pr.without.scaling. Use biplot() to plot both models (one at a time) and compare their outputs.
# PCA model without scaling: pr.without.scaling
pr.without.scaling <- prcomp(x = pokemon,
scale = FALSE)
biplot(pr.without.scaling)
The goal of this chapter is to guide you through a complete analysis using the unsupervised learning techniques covered in the first three chapters. You’ll extend what you’ve learned by combining PCA as a preprocessing step to clustering using data that consist of measurements of cell nuclei of human breast masses.
Unlike prior chapters, where we prepared the data for you for unsupervised learning, the goal of this chapter is to step you through a more realistic and complete workflow.
Recall from the video that the first step is to download and prepare the data.
Use read.csv() function to download the CSV (comma-separated values) file containing the data from the URL provided. Assign the result to wisc.df.
url <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1903/datasets/WisconsinCancer.csv"
# Download the data: wisc.df
wisc.df <- read.csv(url)
Use as.matrix() to convert the features of the data (in columns 3 through 32) to a matrix. Store this in a variable called wisc.data.
# Convert the features of the data: wisc.data
wisc.data <- as.matrix(wisc.df[,c(3:32)])
Assign the row names of wisc.data the values currently contained in the id column of wisc.df. While not strictly required, this will help you keep track of the different observations throughout the modeling process.
# Set the row names of wisc.data
row.names(wisc.data) <- wisc.df$id
Finally, set a vector called diagnosis to be 1 if a diagnosis is malignant (“M”) and 0 otherwise. Note that R coerces TRUE to 1 and FALSE to 0.
# Create diagnosis vector
diagnosis <- as.numeric(wisc.df$diagnosis == "M")
The first step of any data analysis, unsupervised or supervised, is to familiarize yourself with the data.
The variables you created before, wisc.data and diagnosis, are still available in your workspace. Explore the data to answer the following questions:
How many observations are in this dataset? How many variables/features in the data are suffixed with _mean? How many of the observations have a malignant diagnosis?
# The dim(), colnames(), and table() functions may prove useful here!
The next step in your analysis is to perform PCA on wisc.data.
You saw in the last chapter that it’s important to check if the data need to be scaled before performing PCA. Recall two common reasons for scaling data:
The input variables use different units of measurement. The input variables have significantly different variances.
The variables you created before, wisc.data and diagnosis, are still available in your workspace.
Check the mean and standard deviation of the features of the data to determine if the data should be scaled. Use the colMeans() and apply() functions like you’ve done before.
# Check column means and standard deviations
colMeans(wisc.data)
## radius_mean texture_mean perimeter_mean
## 1.412729e+01 1.928965e+01 9.196903e+01
## area_mean smoothness_mean compactness_mean
## 6.548891e+02 9.636028e-02 1.043410e-01
## concavity_mean concave.points_mean symmetry_mean
## 8.879932e-02 4.891915e-02 1.811619e-01
## fractal_dimension_mean radius_se texture_se
## 6.279761e-02 4.051721e-01 1.216853e+00
## perimeter_se area_se smoothness_se
## 2.866059e+00 4.033708e+01 7.040979e-03
## compactness_se concavity_se concave.points_se
## 2.547814e-02 3.189372e-02 1.179614e-02
## symmetry_se fractal_dimension_se radius_worst
## 2.054230e-02 3.794904e-03 1.626919e+01
## texture_worst perimeter_worst area_worst
## 2.567722e+01 1.072612e+02 8.805831e+02
## smoothness_worst compactness_worst concavity_worst
## 1.323686e-01 2.542650e-01 2.721885e-01
## concave.points_worst symmetry_worst fractal_dimension_worst
## 1.146062e-01 2.900756e-01 8.394582e-02
apply(wisc.data, 2, sd)
## radius_mean texture_mean perimeter_mean
## 3.524049e+00 4.301036e+00 2.429898e+01
## area_mean smoothness_mean compactness_mean
## 3.519141e+02 1.406413e-02 5.281276e-02
## concavity_mean concave.points_mean symmetry_mean
## 7.971981e-02 3.880284e-02 2.741428e-02
## fractal_dimension_mean radius_se texture_se
## 7.060363e-03 2.773127e-01 5.516484e-01
## perimeter_se area_se smoothness_se
## 2.021855e+00 4.549101e+01 3.002518e-03
## compactness_se concavity_se concave.points_se
## 1.790818e-02 3.018606e-02 6.170285e-03
## symmetry_se fractal_dimension_se radius_worst
## 8.266372e-03 2.646071e-03 4.833242e+00
## texture_worst perimeter_worst area_worst
## 6.146258e+00 3.360254e+01 5.693570e+02
## smoothness_worst compactness_worst concavity_worst
## 2.283243e-02 1.573365e-01 2.086243e-01
## concave.points_worst symmetry_worst fractal_dimension_worst
## 6.573234e-02 6.186747e-02 1.806127e-02
Execute PCA on the wisc.data, scaling if appropriate, and assign the model to wisc.pr.
# Execute PCA, scaling if appropriate: wisc.pr
wisc.pr <- prcomp(wisc.data, scale = TRUE)
Inspect a summary of the results with the summary() function.
# Look at summary of results
summary(wisc.pr)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 3.6444 2.3857 1.67867 1.40735 1.28403 1.09880 0.82172
## Proportion of Variance 0.4427 0.1897 0.09393 0.06602 0.05496 0.04025 0.02251
## Cumulative Proportion 0.4427 0.6324 0.72636 0.79239 0.84734 0.88759 0.91010
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.69037 0.6457 0.59219 0.5421 0.51104 0.49128 0.39624
## Proportion of Variance 0.01589 0.0139 0.01169 0.0098 0.00871 0.00805 0.00523
## Cumulative Proportion 0.92598 0.9399 0.95157 0.9614 0.97007 0.97812 0.98335
## PC15 PC16 PC17 PC18 PC19 PC20 PC21
## Standard deviation 0.30681 0.28260 0.24372 0.22939 0.22244 0.17652 0.1731
## Proportion of Variance 0.00314 0.00266 0.00198 0.00175 0.00165 0.00104 0.0010
## Cumulative Proportion 0.98649 0.98915 0.99113 0.99288 0.99453 0.99557 0.9966
## PC22 PC23 PC24 PC25 PC26 PC27 PC28
## Standard deviation 0.16565 0.15602 0.1344 0.12442 0.09043 0.08307 0.03987
## Proportion of Variance 0.00091 0.00081 0.0006 0.00052 0.00027 0.00023 0.00005
## Cumulative Proportion 0.99749 0.99830 0.9989 0.99942 0.99969 0.99992 0.99997
## PC29 PC30
## Standard deviation 0.02736 0.01153
## Proportion of Variance 0.00002 0.00000
## Cumulative Proportion 1.00000 1.00000
Now you’ll use some visualizations to better understand your PCA model. You were introduced to one of these visualizations, the biplot, in an earlier chapter.
You’ll run into some common challenges with using biplots on real-world data containing a non-trivial number of observations and variables, then you’ll look at some alternative visualizations. You are encouraged to experiment with additional visualizations before moving on to the next exercise.
The variables you created before, wisc.data, diagnosis, and wisc.pr, are still available.
Create a biplot of the wisc.pr data. What stands out to you about this plot? Is it easy or difficult to understand? Why?
# Create a biplot of wisc.pr
biplot(wisc.pr)
names(wisc.pr)
## [1] "sdev" "rotation" "center" "scale" "x"
Execute the code to scatter plot each observation by principal components 1 and 2, coloring the points by the diagnosis.
# Scatter plot observations by components 1 and 2
plot(wisc.pr$x[, c(1, 2)], col = (diagnosis + 1),
xlab = "PC1", ylab = "PC2")
Repeat the same for principal components 1 and 3. What do you notice about these plots?
# Repeat for components 1 and 3
plot(wisc.pr$x[, c(1, 3)], col = (diagnosis + 1),
xlab = "PC1", ylab = "PC3")
# Do additional data exploration of your choosing below (optional)
plot(wisc.pr$x[, c(2, 3)], col = (diagnosis + 1),
xlab = "PC2", ylab = "PC3")
In this exercise, you will produce scree plots showing the proportion of variance explained as the number of principal components increases. The data from PCA must be prepared for these plots, as there is not a built-in function in R to create them directly from the PCA model.
As you look at these plots, ask yourself if there’s an elbow in the amount of variance explained that might lead you to pick a natural number of principal components. If an obvious elbow does not exist, as is typical in real-world datasets, consider how else you might determine the number of principal components to retain based on the scree plot.
The variables you created before, wisc.data, diagnosis, and wisc.pr, are still available.
Calculate the variance of each principal component by squaring the sdev component of wisc.pr. Save the result as an object called pr.var.
# Set up 1 x 2 plotting grid
par(mfrow = c(1, 2))
# Calculate variability of each component
pr.var <- wisc.pr$sdev^2
Calculate the variance explained by each principal component by dividing by the total variance explained of all principal components. Assign this to a variable called pve.
# Variance explained by each principal component: pve
pve <- pr.var / sum(pr.var)
Create a plot of variance explained for each principal component.
# Plot variance explained for each principal component
plot(pve, xlab = "Principal Component",
ylab = "Proportion of Variance Explained",
ylim = c(0, 1), type = "b")
Using the cumsum() function, create a plot of cumulative proportion of variance explained.
# Plot cumulative proportion of variance explained
plot(cumsum(pve), xlab = "Principal Component",
ylab = "Cumulative Proportion of Variance Explained",
ylim = c(0, 1), type = "b")
This exercise will check your understanding of the PCA results, in particular the loadings and variance explained. The loadings, represented as vectors, explain the mapping from the original features to the principal components. The principal components are naturally ordered from the most variance explained to the least variance explained.
The variables you created before—wisc.data, diagnosis, wisc.pr, and pve—are still available.
For the first principal component, what is the component of the loading vector for the feature concave.points_mean? What is the minimum number of principal components required to explain 80% of the variance of the data?
wisc.pr$rotation[, 1:3]
## PC1 PC2 PC3
## radius_mean -0.21890244 0.233857132 -0.008531243
## texture_mean -0.10372458 0.059706088 0.064549903
## perimeter_mean -0.22753729 0.215181361 -0.009314220
## area_mean -0.22099499 0.231076711 0.028699526
## smoothness_mean -0.14258969 -0.186113023 -0.104291904
## compactness_mean -0.23928535 -0.151891610 -0.074091571
## concavity_mean -0.25840048 -0.060165363 0.002733838
## concave.points_mean -0.26085376 0.034767500 -0.025563541
## symmetry_mean -0.13816696 -0.190348770 -0.040239936
## fractal_dimension_mean -0.06436335 -0.366575471 -0.022574090
## radius_se -0.20597878 0.105552152 0.268481387
## texture_se -0.01742803 -0.089979682 0.374633665
## perimeter_se -0.21132592 0.089457234 0.266645367
## area_se -0.20286964 0.152292628 0.216006528
## smoothness_se -0.01453145 -0.204430453 0.308838979
## compactness_se -0.17039345 -0.232715896 0.154779718
## concavity_se -0.15358979 -0.197207283 0.176463743
## concave.points_se -0.18341740 -0.130321560 0.224657567
## symmetry_se -0.04249842 -0.183848000 0.288584292
## fractal_dimension_se -0.10256832 -0.280092027 0.211503764
## radius_worst -0.22799663 0.219866379 -0.047506990
## texture_worst -0.10446933 0.045467298 -0.042297823
## perimeter_worst -0.23663968 0.199878428 -0.048546508
## area_worst -0.22487053 0.219351858 -0.011902318
## smoothness_worst -0.12795256 -0.172304352 -0.259797613
## compactness_worst -0.21009588 -0.143593173 -0.236075625
## concavity_worst -0.22876753 -0.097964114 -0.173057335
## concave.points_worst -0.25088597 0.008257235 -0.170344076
## symmetry_worst -0.12290456 -0.141883349 -0.271312642
## fractal_dimension_worst -0.13178394 -0.275339469 -0.232791313
The goal of this exercise is to do hierarchical clustering of the observations. Recall from Chapter 2 that this type of clustering does not assume in advance the number of natural groups that exist in the data.
As part of the preparation for hierarchical clustering, distance between all pairs of observations are computed. Furthermore, there are different ways to link clusters together, with single, complete, and average being the most common linkage methods.
The variables you created before, wisc.data, diagnosis, wisc.pr, and pve, are available in your workspace.
Scale the wisc.data data and assign the result to data.scaled.
# Scale the wisc.data data: data.scaled
head(wisc.data)
## radius_mean texture_mean perimeter_mean area_mean smoothness_mean
## 842302 17.99 10.38 122.80 1001.0 0.11840
## 842517 20.57 17.77 132.90 1326.0 0.08474
## 84300903 19.69 21.25 130.00 1203.0 0.10960
## 84348301 11.42 20.38 77.58 386.1 0.14250
## 84358402 20.29 14.34 135.10 1297.0 0.10030
## 843786 12.45 15.70 82.57 477.1 0.12780
## compactness_mean concavity_mean concave.points_mean symmetry_mean
## 842302 0.27760 0.3001 0.14710 0.2419
## 842517 0.07864 0.0869 0.07017 0.1812
## 84300903 0.15990 0.1974 0.12790 0.2069
## 84348301 0.28390 0.2414 0.10520 0.2597
## 84358402 0.13280 0.1980 0.10430 0.1809
## 843786 0.17000 0.1578 0.08089 0.2087
## fractal_dimension_mean radius_se texture_se perimeter_se area_se
## 842302 0.07871 1.0950 0.9053 8.589 153.40
## 842517 0.05667 0.5435 0.7339 3.398 74.08
## 84300903 0.05999 0.7456 0.7869 4.585 94.03
## 84348301 0.09744 0.4956 1.1560 3.445 27.23
## 84358402 0.05883 0.7572 0.7813 5.438 94.44
## 843786 0.07613 0.3345 0.8902 2.217 27.19
## smoothness_se compactness_se concavity_se concave.points_se
## 842302 0.006399 0.04904 0.05373 0.01587
## 842517 0.005225 0.01308 0.01860 0.01340
## 84300903 0.006150 0.04006 0.03832 0.02058
## 84348301 0.009110 0.07458 0.05661 0.01867
## 84358402 0.011490 0.02461 0.05688 0.01885
## 843786 0.007510 0.03345 0.03672 0.01137
## symmetry_se fractal_dimension_se radius_worst texture_worst
## 842302 0.03003 0.006193 25.38 17.33
## 842517 0.01389 0.003532 24.99 23.41
## 84300903 0.02250 0.004571 23.57 25.53
## 84348301 0.05963 0.009208 14.91 26.50
## 84358402 0.01756 0.005115 22.54 16.67
## 843786 0.02165 0.005082 15.47 23.75
## perimeter_worst area_worst smoothness_worst compactness_worst
## 842302 184.60 2019.0 0.1622 0.6656
## 842517 158.80 1956.0 0.1238 0.1866
## 84300903 152.50 1709.0 0.1444 0.4245
## 84348301 98.87 567.7 0.2098 0.8663
## 84358402 152.20 1575.0 0.1374 0.2050
## 843786 103.40 741.6 0.1791 0.5249
## concavity_worst concave.points_worst symmetry_worst
## 842302 0.7119 0.2654 0.4601
## 842517 0.2416 0.1860 0.2750
## 84300903 0.4504 0.2430 0.3613
## 84348301 0.6869 0.2575 0.6638
## 84358402 0.4000 0.1625 0.2364
## 843786 0.5355 0.1741 0.3985
## fractal_dimension_worst
## 842302 0.11890
## 842517 0.08902
## 84300903 0.08758
## 84348301 0.17300
## 84358402 0.07678
## 843786 0.12440
data.scaled <- scale(wisc.data)
Calculate the (Euclidean) distances between all pairs of observations in the new scaled dataset and assign the result to data.dist.
# Calculate the (Euclidean) distances: data.dist
data.dist <- dist(data.scaled)
Create a hierarchical clustering model using complete linkage. Manually specify the method argument to hclust() and assign the results to wisc.hclust.
# Create a hierarchical clustering model: wisc.hclust
wisc.hclust <- hclust(data.dist, method = "complete")
Let’s use the hierarchical clustering model you just created to determine a height (or distance between clusters) where a certain number of clusters exists. The variables you created before—wisc.data, diagnosis, wisc.pr, pve, and wisc.hclust—are all available in your workspace.
Using the plot() function, what is the height at which the clustering model has 4 clusters?
plot(wisc.hclust)
In this exercise, you will compare the outputs from your hierarchical clustering model to the actual diagnoses. Normally when performing unsupervised learning like this, a target variable isn’t available. We do have it with this dataset, however, so it can be used to check the performance of the clustering model.
When performing supervised learning—that is, when you’re trying to predict some target variable of interest and that target variable is available in the original data—using clustering to create new features may or may not improve the performance of the final model. This exercise will help you determine if, in this case, hierarchical clustering provides a promising new feature.
wisc.data, diagnosis, wisc.pr, pve, and wisc.hclust are available in your workspace.
Use cutree() to cut the tree so that it has 4 clusters. Assign the output to the variable wisc.hclust.clusters.
# Cut tree so that it has 4 clusters: wisc.hclust.clusters
wisc.hclust.clusters <- cutree(wisc.hclust, k = 4)
Use the table() function to compare the cluster membership to the actual diagnoses.
# Compare cluster membership to actual diagnoses
table(wisc.hclust.clusters, diagnosis)
## diagnosis
## wisc.hclust.clusters 0 1
## 1 12 165
## 2 2 5
## 3 343 40
## 4 0 2
As you now know, there are two main types of clustering: hierarchical and k-means.
In this exercise, you will create a k-means clustering model on the Wisconsin breast cancer data and compare the results to the actual diagnoses and the results of your hierarchical clustering model. Take some time to see how each clustering model performs in terms of separating the two diagnoses and how the clustering models compare to each other.
wisc.data, diagnosis, and wisc.hclust.clusters are still available.
Create a k-means model on wisc.data, assigning the result to wisc.km. Be sure to create 2 clusters, corresponding to the actual number of diagnosis. Also, remember to scale the data and repeat the algorithm 20 times to find a well performing model.
# Create a k-means model on wisc.data: wisc.km
wisc.km <- kmeans(scale(wisc.data), 2, nstart = 20)
Use the table() function to compare the cluster membership of the k-means model to the actual diagnoses contained in the diagnosis vector. How well does k-means separate the two diagnoses?
# Compare k-means to actual diagnoses
table(wisc.km$cluster, diagnosis)
## diagnosis
## 0 1
## 1 343 37
## 2 14 175
Use the table() function to compare the cluster membership of the k-means model to the hierarchical clustering model. Recall the cluster membership of the hierarchical clustering model is contained in wisc.hclust.clusters.
# Compare k-means to hierarchical clustering
table(wisc.hclust.clusters, wisc.km$cluster)
##
## wisc.hclust.clusters 1 2
## 1 17 160
## 2 0 7
## 3 363 20
## 4 0 2
In this final exercise, you will put together several steps you used earlier and, in doing so, you will experience some of the creativity that is typical in unsupervised learning.
Recall from earlier exercises that the PCA model required significantly fewer features to describe 80% and 95% of the variability of the data. In addition to normalizing data and potentially avoiding overfitting, PCA also uncorrelates the variables, sometimes improving the performance of other modeling techniques.
Let’s see if PCA improves or degrades the performance of hierarchical clustering.
wisc.pr, diagnosis, wisc.hclust.clusters, and wisc.km are still available in your workspace.
Using the minimum number of principal components required to describe at least 90% of the variability in the data, create a hierarchical clustering model with complete linkage. Assign the results to wisc.pr.hclust.
# Create a hierarchical clustering model: wisc.pr.hclust
wisc.pr.hclust <- hclust(dist(wisc.pr$x[, 1:7]), method = "complete")
Cut this hierarchical clustering model into 4 clusters and assign the results to wisc.pr.hclust.clusters.
# Cut model into 4 clusters: wisc.pr.hclust.clusters
wisc.pr.hclust.clusters <- cutree(wisc.pr.hclust, k = 4)
Using table(), compare the results from your new hierarchical clustering model with the actual diagnoses. How well does the newly created model with four clusters separate out the two diagnoses?
# Compare to actual diagnoses
table(diagnosis, wisc.pr.hclust.clusters)
## wisc.pr.hclust.clusters
## diagnosis 1 2 3 4
## 0 5 350 2 0
## 1 113 97 0 2
How well do the k-means and hierarchical clustering models you created in previous exercises do in terms of separating the diagnoses? Again, use the table() function to compare the output of each model with the vector containing the actual diagnoses.
# Compare to k-means and hierarchical
table(diagnosis, wisc.hclust.clusters)
## wisc.hclust.clusters
## diagnosis 1 2 3 4
## 0 12 2 343 0
## 1 165 5 40 2
table(diagnosis, wisc.km$cluster)
##
## diagnosis 1 2
## 0 343 14
## 1 37 175
As you saw in the video, included in the course is the diamonds dataset, which is a classic dataset from the ggplot2 package. The dataset contains physical attributes of diamonds as well as the price they sold for. One interesting modeling challenge is predicting diamond price based on their attributes using something like a linear regression.
Recall that to fit a linear regression, you use the lm() function in the following format:
mod <- lm(y ~ x, my_data) To make predictions using mod on the original data, you call the predict() function:
pred <- predict(mod, my_data)
Fit a linear model on the diamonds dataset predicting price using all other variables as predictors (i.e. price ~ .). Save the result to model. Make predictions using model on the full original dataset and save the result to p. Compute errors using the formula . Save the result to error. Compute RMSE using the formula you learned in the video and print it to the console.
# Fit lm model: model
model <- lm(price ~ ., diamonds)
# Predict on full data: p
p <- predict(model, diamonds)
# Compute errors: error
error <- p - diamonds[["price"]]
# Calculate RMSE
sqrt(mean(error ^ 2))
## [1] 1129.843
One way you can take a train/test split of a dataset is to order the dataset randomly, then divide it into the two sets. This ensures that the training set and test set are both random samples and that any biases in the ordering of the dataset (e.g. if it had originally been ordered by price or size) are not retained in the samples we take for training and testing your models. You can think of this like shuffling a brand new deck of playing cards before dealing hands.
First, you set a random seed so that your work is reproducible and you get the same random split each time you run your script:
set.seed(42) Next, you use the sample() function to shuffle the row indices of the diamonds dataset. You can later use these indices to reorder the dataset.
rows <- sample(nrow(diamonds)) Finally, you can use this random vector to reorder the diamonds dataset:
diamonds <- diamonds[rows, ]
Set the random seed to 42. Make a vector of row indices called rows. Randomly reorder the diamonds data frame, assigning to shuffled_diamonds.
# Set seed
set.seed(42)
# Shuffle row indices: rows
rows <- sample(nrow(diamonds))
# Randomly order data
shuffled_diamonds <- diamonds[rows, ]
Now that your dataset is randomly ordered, you can split the first 80% of it into a training set, and the last 20% into a test set. You can do this by choosing a split point approximately 80% of the way through your data:
split <- round(nrow(mydata) * 0.80) You can then use this point to break off the first 80% of the dataset as a training set:
mydata[1:split, ] And then you can use that same point to determine the test set:
mydata[(split + 1):nrow(mydata), ]
Choose a row index to split on so that the split point is approximately 80% of the way through the diamonds dataset. Call this index split. Create a training set called train using that index. Create a test set called test using that index.
# Determine row to split on: split
split <- round(nrow(diamonds) * 0.80)
# Create train
train <- diamonds[1:split, ]
# Create test
test <- diamonds[(split + 1):nrow(diamonds), ]
Now that you have a randomly split training set and test set, you can use the lm() function as you did in the first exercise to fit a model to your training set, rather than the entire dataset. Recall that you can use the formula interface to the linear regression function to fit a model with a specified target variable using all other variables in the dataset as predictors:
mod <- lm(y ~ ., training_data) You can use the predict() function to make predictions from that model on new data. The new dataset must have all of the columns from the training data, but they can be in a different order with different values. Here, rather than re-predicting on the training set, you can predict on the test set, which you did not use for training the model. This will allow you to determine the out-of-sample error for the model in the next exercise:
p <- predict(model, new_data)
Fit an lm() model called model to predict price using all other variables as covariates. Be sure to use the training set, train. Predict on the test set, test, using predict(). Store these values in a vector called p.
# Fit lm model on train: model
model <- lm(price ~., train)
# Predict on test: p
p <- predict(model, test)
Now that you have predictions on the test set, you can use these predictions to calculate an error metric (in this case RMSE) on the test set and see how the model performs out-of-sample, rather than in-sample as you did in the first exercise. You first do this by calculating the errors between the predicted diamond prices and the actual diamond prices by subtracting the predictions from the actual values.
Once you have an error vector, calculating RMSE is as simple as squaring it, taking the mean, then taking the square root:
sqrt(mean(error^2))
test, model, and p are loaded in your workspace.
Calculate the error between the predictions on the test set and the actual diamond prices in the test set. Call this error. Calculate RMSE using this error vector, just printing the result to the console.
# Compute errors: error
error <- p - test[["price"]]
# Calculate RMSE
sqrt(mean(error^2))
## [1] 796.8922
As you saw in the video, a better approach to validating models is to use multiple systematic test sets, rather than a single random train/test split. Fortunately, the caret package makes this very easy to do:
model <- train(y ~ ., my_data) caret supports many types of cross-validation, and you can specify which type of cross-validation and the number of cross-validation folds with the trainControl() function, which you pass to the trControl argument in train():
model <- train( y ~ ., my_data, method = “lm”, trControl = trainControl( method = “cv”, number = 10, verboseIter = TRUE ) ) It’s important to note that you pass the method for modeling to the main train() function and the method for cross-validation to the trainControl() function.
Fit a linear regression to model price using all other variables in the diamonds dataset as predictors. Use the train() function and 10-fold cross-validation. (Note that we’ve taken a subset of the full diamonds dataset to speed up this operation, but it’s still named diamonds.) Print the model to the console and examine the results.
# Fit lm model using 10-fold CV: model
model <- train(
price ~ .,
diamonds,
method = "lm",
trControl = trainControl(
method = "cv",
number = 10,
verboseIter = TRUE
)
)
## + Fold01: intercept=TRUE
## - Fold01: intercept=TRUE
## + Fold02: intercept=TRUE
## - Fold02: intercept=TRUE
## + Fold03: intercept=TRUE
## - Fold03: intercept=TRUE
## + Fold04: intercept=TRUE
## - Fold04: intercept=TRUE
## + Fold05: intercept=TRUE
## - Fold05: intercept=TRUE
## + Fold06: intercept=TRUE
## - Fold06: intercept=TRUE
## + Fold07: intercept=TRUE
## - Fold07: intercept=TRUE
## + Fold08: intercept=TRUE
## - Fold08: intercept=TRUE
## + Fold09: intercept=TRUE
## - Fold09: intercept=TRUE
## + Fold10: intercept=TRUE
## - Fold10: intercept=TRUE
## Aggregating results
## Fitting final model on full training set
# Print model to console
print(model)
## Linear Regression
##
## 53940 samples
## 9 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 48547, 48546, 48546, 48547, 48545, 48547, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 1131.015 0.9196398 740.6117
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
In this course, you will use a wide variety of datasets to explore the full flexibility of the caret package. Here, you will use the famous Boston housing dataset, where the goal is to predict median home values in various Boston suburbs.
You can use exactly the same code as in the previous exercise, but change the dataset used by the model:
model <- train( medv ~ ., Boston, # <- new! method = “lm”, trControl = trainControl( method = “cv”, number = 10, verboseIter = TRUE ) ) Next, you can reduce the number of cross-validation folds from 10 to 5 using the number argument to the trainControl() argument:
trControl = trainControl( method = “cv”, number = 5, verboseIter = TRUE )
Fit an lm() model to the Boston housing dataset, such that medv is the response variable and all other variables are explanatory variables. Use 5-fold cross-validation rather than 10-fold cross-validation. Print the model to the console and inspect the results.
Boston <- read.delim("DATABASE/Boston.txt", sep = ",")
head(Boston)
## X crim zn indus chas nox rm age dis rad tax ptratio black lstat
## 1 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
## 2 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
## 3 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
## 4 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
## 5 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
## 6 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21
## medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
# Fit lm model using 5 x 5-fold CV: model
model <- train(
medv ~ .,
Boston,
method = "lm",
trControl = trainControl(
method = "repeatedcv",
number = 5,
repeats = 5,
verboseIter = TRUE
)
)
## + Fold1.Rep1: intercept=TRUE
## - Fold1.Rep1: intercept=TRUE
## + Fold2.Rep1: intercept=TRUE
## - Fold2.Rep1: intercept=TRUE
## + Fold3.Rep1: intercept=TRUE
## - Fold3.Rep1: intercept=TRUE
## + Fold4.Rep1: intercept=TRUE
## - Fold4.Rep1: intercept=TRUE
## + Fold5.Rep1: intercept=TRUE
## - Fold5.Rep1: intercept=TRUE
## + Fold1.Rep2: intercept=TRUE
## - Fold1.Rep2: intercept=TRUE
## + Fold2.Rep2: intercept=TRUE
## - Fold2.Rep2: intercept=TRUE
## + Fold3.Rep2: intercept=TRUE
## - Fold3.Rep2: intercept=TRUE
## + Fold4.Rep2: intercept=TRUE
## - Fold4.Rep2: intercept=TRUE
## + Fold5.Rep2: intercept=TRUE
## - Fold5.Rep2: intercept=TRUE
## + Fold1.Rep3: intercept=TRUE
## - Fold1.Rep3: intercept=TRUE
## + Fold2.Rep3: intercept=TRUE
## - Fold2.Rep3: intercept=TRUE
## + Fold3.Rep3: intercept=TRUE
## - Fold3.Rep3: intercept=TRUE
## + Fold4.Rep3: intercept=TRUE
## - Fold4.Rep3: intercept=TRUE
## + Fold5.Rep3: intercept=TRUE
## - Fold5.Rep3: intercept=TRUE
## + Fold1.Rep4: intercept=TRUE
## - Fold1.Rep4: intercept=TRUE
## + Fold2.Rep4: intercept=TRUE
## - Fold2.Rep4: intercept=TRUE
## + Fold3.Rep4: intercept=TRUE
## - Fold3.Rep4: intercept=TRUE
## + Fold4.Rep4: intercept=TRUE
## - Fold4.Rep4: intercept=TRUE
## + Fold5.Rep4: intercept=TRUE
## - Fold5.Rep4: intercept=TRUE
## + Fold1.Rep5: intercept=TRUE
## - Fold1.Rep5: intercept=TRUE
## + Fold2.Rep5: intercept=TRUE
## - Fold2.Rep5: intercept=TRUE
## + Fold3.Rep5: intercept=TRUE
## - Fold3.Rep5: intercept=TRUE
## + Fold4.Rep5: intercept=TRUE
## - Fold4.Rep5: intercept=TRUE
## + Fold5.Rep5: intercept=TRUE
## - Fold5.Rep5: intercept=TRUE
## Aggregating results
## Fitting final model on full training set
# Print model to console
print(model)
## Linear Regression
##
## 506 samples
## 14 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 5 times)
## Summary of sample sizes: 405, 405, 406, 403, 405, 405, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 4.832232 0.7267225 3.387277
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
Finally, the model you fit with the train() function has the exact same predict() interface as the linear regression models you fit earlier in this chapter.
After fitting a model with train(), you can simply call predict() with new data, e.g:
predict(my_model, new_data)
Use the predict() function to make predictions with model on the full Boston housing dataset. Print the result to the console.
# Predict on full Boston dataset
predict(model, Boston)
## 1 2 3 4 5 6 7
## 30.4213971 25.3220127 30.9378788 29.0007664 28.3136725 25.5910930 23.4177645
## 8 9 10 11 12 13 14
## 19.8600309 11.8007974 19.2522173 19.3175764 21.9341754 21.3733506 19.9713263
## 15 16 17 18 19 20 21
## 19.6447192 19.7247705 21.0287699 17.2668383 16.6419097 18.7941419 12.8145402
## 22 23 24 25 26 27 28
## 18.0016671 16.1531737 14.0888828 15.9770628 13.6900098 15.7607755 15.0107143
## 29 30 31 32 33 34 35
## 19.8556750 21.2106131 11.7246428 18.3364835 9.1054171 14.5558713 13.9762818
## 36 37 38 39 40 41 42
## 24.1647275 22.7010178 23.5128590 23.3489552 31.8552690 34.7419294 28.4264028
## 43 44 45 46 47 48 49
## 25.5783150 24.9813730 23.2150411 22.3830353 20.7057509 18.1691248 9.1721403
## 50 51 52 53 54 55 56
## 17.3796893 21.5507482 24.2035644 28.0125574 24.3828602 15.8526357 31.6059808
## 57 58 59 60 61 62 63
## 25.2343702 33.5301356 22.1939726 21.4598617 18.1786155 18.7542586 24.3144538
## 64 65 66 67 68 69 70
## 22.9343416 23.5307430 30.8451833 25.9540369 21.4788612 17.7284963 21.1139438
## 71 72 73 74 75 76 77
## 25.5867504 22.0784682 24.9261981 24.4150982 25.9854780 24.3345998 23.2360217
## 78 79 80 81 82 83 84
## 23.7188115 21.5838538 22.7960969 28.7330272 27.2165387 26.3468749 25.3238055
## 85 86 87 88 89 90 91
## 24.9773238 27.9767435 22.3544738 26.0668128 30.8090297 31.0259444 27.2851150
## 92 93 94 95 96 97 98
## 27.5563055 29.1986076 29.3969536 27.2216110 28.7961440 24.8467036 35.9478334
## 99 100 101 102 103 104 105
## 35.3745965 32.4257726 24.8577652 25.8945310 20.0234767 20.5399137 21.6613488
## 106 107 108 109 110 111 112
## 18.7367697 17.3891258 20.9834562 22.8550425 19.9776419 20.9482741 26.8162563
## 113 114 115 116 117 118 119
## 21.0049048 20.9386114 25.4412391 20.6635600 23.6593791 23.9411319 20.5996368
## 120 121 122 123 124 125 126
## 21.0696566 21.9403464 22.4593238 20.5154386 16.2997503 20.5059446 22.4511999
## 127 128 129 130 131 132 133
## 14.5391764 15.3583330 19.1323297 14.2267327 20.2228498 19.5916877 20.2423441
## 134 135 136 137 138 139 140
## 15.9186338 13.3974835 17.4296953 16.0457108 19.5346063 13.9618837 16.6068265
## 141 142 143 144 145 146 147
## 13.7315683 4.0835715 14.7484549 12.2815728 8.8457219 12.1721188 15.9539903
## 148 149 150 151 152 153 154
## 8.6288047 9.8413333 14.9433808 20.9946136 18.4266891 20.2862241 17.4113351
## 155 156 157 158 159 160 161
## 22.5281360 20.2874363 13.7274820 33.3710249 29.1052432 25.7118459 32.8216511
## 162 163 164 165 166 167 168
## 36.9118870 40.6994787 42.0107478 24.8458952 25.4352715 37.3219932 23.1629716
## 169 170 171 172 173 174 175
## 26.4542271 26.7057604 22.5825696 24.3145292 22.9267424 29.0630777 26.5327726
## 176 177 178 179 180 181 182
## 30.8426367 25.6679672 29.1356698 31.4508559 32.8731002 34.6281092 27.6745162
## 183 184 185 186 187 188 189
## 33.7497414 30.8216926 22.5175049 24.6304689 35.8497696 33.6123720 32.6106922
## 190 191 192 193 194 195 196
## 34.7034414 30.9514253 30.4475975 33.1022602 32.1716589 31.5862502 40.9681136
## 197 198 199 200 201 202 203
## 36.1975118 32.7162544 34.7541890 30.3067892 30.8627316 29.3437360 37.2339678
## 204 205 206 207 208 209 210
## 42.1386202 43.3036319 22.6900051 23.6057361 17.6953851 23.4018367 16.7770827
## 211 212 213 214 215 216 217
## 22.2083492 16.8607583 22.6515912 25.1883983 11.0923572 24.4423556 26.5352080
## 218 219 220 221 222 223 224
## 28.2115396 24.7422737 29.5291180 33.1406193 23.6780584 32.1103351 29.6824002
## 225 226 227 228 229 230 231
## 38.3706887 39.8118741 37.5482578 32.3455056 35.5819043 31.3122192 24.4037882
## 232 233 234 235 236 237 238
## 33.2330858 38.0289521 37.1476624 31.6730250 25.1963690 30.0154411 32.6553381
## 239 240 241 242 243 244 245
## 28.4199611 28.3597433 27.1902933 23.5833775 23.9864171 27.3942320 16.1271991
## 246 247 248 249 250 251 252
## 13.2041989 19.9381865 19.6661077 21.1802307 24.0658346 24.2035245 25.0485418
## 253 254 255 256 257 258 259
## 24.9180668 29.9758080 23.8137167 21.5581326 37.4365474 43.1834117 36.2816017
## 260 261 262 263 264 265 266
## 34.7705344 34.6432081 36.9868293 40.8267578 34.2381243 35.6343948 28.0667667
## 267 268 269 270 271 272 273
## 31.0213983 40.7049947 39.1928354 25.4288898 22.0433608 27.0300453 28.2404088
## 274 275 276 277 278 279 280
## 35.2668753 35.9918616 33.6352794 35.4473079 34.7187802 30.1964344 35.1119646
## 281 282 283 284 285 286 287
## 38.5325430 34.0997634 40.1132750 44.5162382 31.4439635 27.1054228 19.8118385
## 288 289 290 291 292 293 294
## 26.8747774 27.0086626 26.7696527 33.3184314 34.2923600 31.7142462 25.5707274
## 295 296 297 298 299 300 301
## 24.1112609 28.1833931 27.0280339 19.1431196 28.9936512 31.8360616 30.5901850
## 302 303 304 305 306 307 308
## 28.7753710 28.7676762 32.7039838 33.0225376 30.5250497 35.3122299 32.4409582
## 309 310 311 312 313 314 315
## 28.2970206 23.2336540 18.2599021 26.5987027 22.8855503 25.1723172 25.0942280
## 316 317 318 319 320 321 322
## 20.1404231 17.1951987 17.9822942 23.9407224 20.9752702 24.5613208 24.5350755
## 323 324 325 326 327 328 329
## 22.5299248 19.0291790 24.8122027 24.4161265 23.3841792 18.9867119 20.9341528
## 330 331 332 333 334 335 336
## 24.0496354 21.3277529 19.5971179 22.9771045 21.7470455 21.1562541 20.2267958
## 337 338 339 340 341 342 343
## 19.7413676 18.8205653 21.7805566 20.8357482 20.9786329 29.8906293 21.6326994
## 344 345 346 347 348 349 350
## 27.4394616 28.3483229 16.0842434 14.3028812 25.0223965 27.2282308 21.7519441
## 351 352 353 354 355 356 357
## 20.0214208 20.2298277 16.5849496 24.9575189 14.0035094 16.2885138 19.8312318
## 358 359 360 361 362 363 364
## 22.9357796 22.4273631 19.4172888 22.8720711 19.1171207 18.3821986 20.4286409
## 365 366 367 368 369 370 371
## 37.7947890 14.3965585 15.6735748 10.8748779 23.9089225 32.8252766 34.7973862
## 372 373 374 375 376 377 378
## 25.0707950 26.1704791 6.1825229 0.7994699 25.4566451 17.8674501 20.3550913
## 379 380 381 382 383 384 385
## 15.9334458 16.9250487 14.4532418 18.5737954 13.4970016 13.1284655 3.3116058
## 386 387 388 389 390 391 392
## 8.1060891 6.1455984 5.6662902 6.4717757 14.2599673 17.2791215 17.4051843
## 393 394 395 396 397 398 399
## 9.9203114 20.3086380 18.0071116 20.3767602 19.3708126 16.3805388 6.5462544
## 400 401 402 403 404 405 406
## 10.9717939 11.9003936 17.8561787 18.3063776 12.9907747 7.3954034 8.1841032
## 407 408 409 410 411 412 413
## 8.0223287 19.9985952 13.6918056 19.8630157 15.1779485 16.9509242 1.6235558
## 414 415 416 417 418 419 420
## 11.7443051 -4.3860489 9.5557801 13.3806201 6.8634382 6.0686868 14.6659520
## 421 422 423 424 425 426 427
## 19.6054378 18.1294931 18.5066776 13.1540087 14.6247126 9.8729101 16.3776642
## 428 429 430 431 432 433 434
## 14.0712442 14.2673860 13.0017716 18.1284693 18.6497714 21.5306971 17.0199231
## 435 436 437 438 439 440 441
## 15.9184301 13.3285688 14.4873630 8.7453146 4.8106754 13.0156277 12.6533148
## 442 443 444 445 446 447 448
## 17.2540829 18.6923358 18.0097591 11.4391117 11.9091549 17.6313826 18.0693223
## 449 450 451 452 453 454 455
## 17.4419398 17.1532195 16.4578791 19.3439991 18.5190681 22.4327564 15.1938128
## 456 457 458 459 460 461 462
## 15.7536724 12.5921290 12.7939656 17.1169188 18.4415471 18.9734228 20.0990734
## 463 464 465 466 467 468 469
## 19.7069995 22.3519805 20.2732171 17.8673461 14.2477401 16.7912868 16.8750355
## 470 471 472 473 474 475 476
## 18.5176506 20.0360997 22.8208224 22.3568697 25.5284901 16.2027533 15.9310144
## 477 478 479 480 481 482 483
## 20.3893507 11.3353814 19.0360604 21.7234998 23.3544766 26.9759487 28.4478220
## 484 485 486 487 488 489 490
## 21.0027705 19.3836943 22.1339240 19.4861514 21.2293617 11.3131631 7.6531788
## 491 492 493 494 495 496 497
## 3.0716459 13.2072754 15.4198389 19.9828024 20.0002371 16.2862261 13.2822879
## 498 499 500 501 502 503 504
## 18.4013892 20.6192034 17.7358813 19.7435816 22.6801964 21.4906326 26.7343473
## 505 506
## 25.2261743 21.4359702
As you saw in the video, you’ll be working with the Sonar dataset in this chapter, using a 60% training set and a 40% test set. We’ll practice making a train/test split one more time, just to be sure you have the hang of it. Recall that you can use the sample() function to get a random permutation of the row indices in a dataset, to use when making train/test splits, e.g.:
n_obs <- nrow(my_data) permuted_rows <- sample(n_obs) And then use those row indices to randomly reorder the dataset, e.g.:
my_data <- my_data[permuted_rows, ] Once your dataset is randomly ordered, you can split off the first 60% as a training set and the last 40% as a test set.
Get the number of observations (rows) in Sonar, assigning to n_obs. Shuffle the row indices of Sonar and store the result in permuted_rows. Use permuted_rows to randomly reorder the rows of Sonar, saving as Sonar_shuffled. Identify the proper row to split on for a 60/40 split. Store this row number as split. Save the first 60% of Sonar_shuffled as a training set. Save the last 40% of Sonar_shuffled as the test set.
Sonar <- read.delim("DATABASE/Sonar.txt", sep = ",")
head(Sonar)
## X V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
## 1 1 0.0200 0.0371 0.0428 0.0207 0.0954 0.0986 0.1539 0.1601 0.3109 0.2111
## 2 2 0.0453 0.0523 0.0843 0.0689 0.1183 0.2583 0.2156 0.3481 0.3337 0.2872
## 3 3 0.0262 0.0582 0.1099 0.1083 0.0974 0.2280 0.2431 0.3771 0.5598 0.6194
## 4 4 0.0100 0.0171 0.0623 0.0205 0.0205 0.0368 0.1098 0.1276 0.0598 0.1264
## 5 5 0.0762 0.0666 0.0481 0.0394 0.0590 0.0649 0.1209 0.2467 0.3564 0.4459
## 6 6 0.0286 0.0453 0.0277 0.0174 0.0384 0.0990 0.1201 0.1833 0.2105 0.3039
## V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21
## 1 0.1609 0.1582 0.2238 0.0645 0.0660 0.2273 0.3100 0.2999 0.5078 0.4797 0.5783
## 2 0.4918 0.6552 0.6919 0.7797 0.7464 0.9444 1.0000 0.8874 0.8024 0.7818 0.5212
## 3 0.6333 0.7060 0.5544 0.5320 0.6479 0.6931 0.6759 0.7551 0.8929 0.8619 0.7974
## 4 0.0881 0.1992 0.0184 0.2261 0.1729 0.2131 0.0693 0.2281 0.4060 0.3973 0.2741
## 5 0.4152 0.3952 0.4256 0.4135 0.4528 0.5326 0.7306 0.6193 0.2032 0.4636 0.4148
## 6 0.2988 0.4250 0.6343 0.8198 1.0000 0.9988 0.9508 0.9025 0.7234 0.5122 0.2074
## V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32
## 1 0.5071 0.4328 0.5550 0.6711 0.6415 0.7104 0.8080 0.6791 0.3857 0.1307 0.2604
## 2 0.4052 0.3957 0.3914 0.3250 0.3200 0.3271 0.2767 0.4423 0.2028 0.3788 0.2947
## 3 0.6737 0.4293 0.3648 0.5331 0.2413 0.5070 0.8533 0.6036 0.8514 0.8512 0.5045
## 4 0.3690 0.5556 0.4846 0.3140 0.5334 0.5256 0.2520 0.2090 0.3559 0.6260 0.7340
## 5 0.4292 0.5730 0.5399 0.3161 0.2285 0.6995 1.0000 0.7262 0.4724 0.5103 0.5459
## 6 0.3985 0.5890 0.2872 0.2043 0.5782 0.5389 0.3750 0.3411 0.5067 0.5580 0.4778
## V33 V34 V35 V36 V37 V38 V39 V40 V41 V42 V43
## 1 0.5121 0.7547 0.8537 0.8507 0.6692 0.6097 0.4943 0.2744 0.0510 0.2834 0.2825
## 2 0.1984 0.2341 0.1306 0.4182 0.3835 0.1057 0.1840 0.1970 0.1674 0.0583 0.1401
## 3 0.1862 0.2709 0.4232 0.3043 0.6116 0.6756 0.5375 0.4719 0.4647 0.2587 0.2129
## 4 0.6120 0.3497 0.3953 0.3012 0.5408 0.8814 0.9857 0.9167 0.6121 0.5006 0.3210
## 5 0.2881 0.0981 0.1951 0.4181 0.4604 0.3217 0.2828 0.2430 0.1979 0.2444 0.1847
## 6 0.3299 0.2198 0.1407 0.2856 0.3807 0.4158 0.4054 0.3296 0.2707 0.2650 0.0723
## V44 V45 V46 V47 V48 V49 V50 V51 V52 V53 V54
## 1 0.4256 0.2641 0.1386 0.1051 0.1343 0.0383 0.0324 0.0232 0.0027 0.0065 0.0159
## 2 0.1628 0.0621 0.0203 0.0530 0.0742 0.0409 0.0061 0.0125 0.0084 0.0089 0.0048
## 3 0.2222 0.2111 0.0176 0.1348 0.0744 0.0130 0.0106 0.0033 0.0232 0.0166 0.0095
## 4 0.3202 0.4295 0.3654 0.2655 0.1576 0.0681 0.0294 0.0241 0.0121 0.0036 0.0150
## 5 0.0841 0.0692 0.0528 0.0357 0.0085 0.0230 0.0046 0.0156 0.0031 0.0054 0.0105
## 6 0.1238 0.1192 0.1089 0.0623 0.0494 0.0264 0.0081 0.0104 0.0045 0.0014 0.0038
## V55 V56 V57 V58 V59 V60 Class
## 1 0.0072 0.0167 0.0180 0.0084 0.0090 0.0032 R
## 2 0.0094 0.0191 0.0140 0.0049 0.0052 0.0044 R
## 3 0.0180 0.0244 0.0316 0.0164 0.0095 0.0078 R
## 4 0.0085 0.0073 0.0050 0.0044 0.0040 0.0117 R
## 5 0.0110 0.0015 0.0072 0.0048 0.0107 0.0094 R
## 6 0.0013 0.0089 0.0057 0.0027 0.0051 0.0062 R
# Get the number of observations
n_obs <- nrow(Sonar)
# Shuffle row indices: permuted_rows
permuted_rows <- sample(n_obs)
# Randomly order data: Sonar
Sonar_shuffled <- Sonar[permuted_rows, ]
# Identify row to split on: split
split <- round(n_obs * 0.6)
# Create train
train <- Sonar_shuffled[1:split, ]
# Create test
test <- Sonar_shuffled[(split + 1):n_obs, ]
Once you have your random training and test sets you can fit a logistic regression model to your training set using the glm() function. glm() is a more advanced version of lm() that allows for more varied types of regression models, aside from plain vanilla ordinary least squares regression.
Be sure to pass the argument family = “binomial” to glm() to specify that you want to do logistic (rather than linear) regression. For example:
glm(Target ~ ., family = “binomial”, dataset) Don’t worry about warnings like glm.fit: algorithm did not converge or glm.fit: fitted probabilities numerically 0 or 1 occurred. These are common on smaller datasets and usually don’t cause any issues. They typically mean your dataset is perfectly separable, which can cause problems for the math behind the model, but R’s glm() function is almost always robust enough to handle this case with no problems.
Once you have a glm() model fit to your dataset, you can predict the outcome (e.g. rock or mine) on the test set using the predict() function with the argument type = “response”:
predict(my_model, test, type = “response”)
Fit a logistic regression called model to predict Class using all other variables as predictors. Use the training set for Sonar. Predict on the test set using that model. Call the result p like you’ve done before.
# Fit glm model: model
#model <- glm(Class ~ ., family = "binomial", train)
# Predict on test: p
#p <- predict(model, test, type = "response")
As you saw in the video, a confusion matrix is a very useful tool for calibrating the output of a model and examining all possible outcomes of your predictions (true positive, true negative, false positive, false negative).
Before you make your confusion matrix, you need to “cut” your predicted probabilities at a given threshold to turn probabilities into a factor of class predictions. Combine ifelse() with factor() as follows:
pos_or_neg <- ifelse(probability_prediction > threshold, positive_class, negative_class) p_class <- factor(pos_or_neg, levels = levels(test_values)) confusionMatrix() in caret improves on table() from base R by adding lots of useful ancillary statistics in addition to the base rates in the table. You can calculate the confusion matrix (and the associated statistics) using the predicted outcomes as well as the actual outcomes, e.g.:
confusionMatrix(p_class, test_values)
Use ifelse() to create a character vector, m_or_r that is the positive class, “M”, when p is greater than 0.5, and the negative class, “R”, otherwise. Convert m_or_r to be a factor, p_class, with levels the same as those of test[[“Class”]]. Make a confusion matrix with confusionMatrix(), passing p_class and the “Class” column from the test dataset.
# If p exceeds threshold of 0.5, M else R: m_or_r
m_or_r <- ifelse(p > 0.5, "M", "R")
# Convert to factor: p_class
p_class <- factor(m_or_r, levels = levels(test[["Class"]]))
# Create confusion matrix
#confusionMatrix(p_class, test[["Class"]])
In the previous exercises, you used a threshold of 0.50 to cut your predicted probabilities to make class predictions (rock vs mine). However, this classification threshold does not always align with the goals for a given modeling problem.
For example, pretend you want to identify the objects you are really certain are mines. In this case, you might want to use a probability threshold of 0.90 to get fewer predicted mines, but with greater confidence in each prediction.
The code pattern for cutting probabilities into predicted classes, then calculating a confusion matrix, was shown in Exercise 7 of this chapter.
Use ifelse() to create a character vector, m_or_r that is the positive class, “M”, when p is greater than 0.9, and the negative class, “R”, otherwise. Convert m_or_r to be a factor, p_class, with levels the same as those of test[[“Class”]]. Make a confusion matrix with confusionMatrix(), passing p_class and the “Class” column from the test dataset.
# If p exceeds threshold of 0.9, M else R: m_or_r
m_or_r <- ifelse(p > 0.9, "M", "R")
# Convert to factor: p_class
p_class <- factor(m_or_r, levels = levels(test[["Class"]]))
# Create confusion matrix
#confusionMatrix(p_class, test[["Class"]])
Conversely, say you want to be really certain that your model correctly identifies all the mines as mines. In this case, you might use a prediction threshold of 0.10, instead of 0.90.
The code pattern for cutting probabilities into predicted classes, then calculating a confusion matrix, was shown in Exercise 7 of this chapter.
Use ifelse() to create a character vector, m_or_r that is the positive class, “M”, when p is greater than 0.1, and the negative class, “R”, otherwise. Convert m_or_r to be a factor, p_class, with levels the same as those of test[[“Class”]]. Make a confusion matrix with confusionMatrix(), passing p_class and the “Class” column from the test dataset.
# If p exceeds threshold of 0.1, M else R: m_or_r
m_or_r <- ifelse(p > 0.1, "M", "R")
# Convert to factor: p_class
p_class <- factor(m_or_r, levels = levels(test[["Class"]]))
# Create confusion matrix
#confusionMatrix(p_class, test[["Class"]])
As you saw in the video, an ROC curve is a really useful shortcut for summarizing the performance of a classifier over all possible thresholds. This saves you a lot of tedious work computing class predictions for many different thresholds and examining the confusion matrix for each.
My favorite package for computing ROC curves is caTools, which contains a function called colAUC(). This function is very user-friendly and can actually calculate ROC curves for multiple predictors at once. In this case, you only need to calculate the ROC curve for one predictor, e.g.:
colAUC(predicted_probabilities, actual, plotROC = TRUE) The function will return a score called AUC (more on that later) and the plotROC = TRUE argument will return the plot of the ROC curve for visual inspection.
model, test, and train from the last exercise using the sonar data are loaded in your workspace.
Predict probabilities (i.e. type = “response”) on the test set, then store the result as p. Make an ROC curve using the predicted test set probabilities.
# Predict on test: p
#p <- predict(model, test, type = "response")
# Make ROC curve
#colAUC(p, test[["Class"]], plotROC = TRUE)
As you saw in the video, area under the ROC curve is a very useful, single-number summary of a model’s ability to discriminate the positive from the negative class (e.g. mines from rocks). An AUC of 0.5 is no better than random guessing, an AUC of 1.0 is a perfectly predictive model, and an AUC of 0.0 is perfectly anti-predictive (which rarely happens).
This is often a much more useful metric than simply ranking models by their accuracy at a set threshold, as different models might require different calibration steps (looking at a confusion matrix at each step) to find the optimal classification threshold for that model.
You can use the trainControl() function in caret to use AUC (instead of acccuracy), to tune the parameters of your models. The twoClassSummary() convenience function allows you to do this easily.
When using twoClassSummary(), be sure to always include the argument classProbs = TRUE or your model will throw an error! (You cannot calculate AUC with just class predictions. You need to have class probabilities as well.)
Customize the trainControl object to use twoClassSummary rather than defaultSummary. Use 10-fold cross-validation. Be sure to tell trainControl() to return class probabilities.
# Create trainControl object: myControl
myControl <- trainControl(
method = "cv",
number = 10,
summaryFunction = twoClassSummary,
classProbs = TRUE, # IMPORTANT!
verboseIter = TRUE
)
Now that you have a custom trainControl object, it’s easy to fit caret models that use AUC rather than accuracy to tune and evaluate the model. You can just pass your custom trainControl object to the train() function via the trControl argument, e.g.:
train(
This syntax gives you a convenient way to store a lot of custom modeling parameters and then use them across multiple different calls to train(). You will make extensive use of this trick in Chapter 5. Use train() to predict Class from all other variables in the Sonar data (that is, Class ~ .). It should be a glm model (that is, set method to “glm”) using your custom trainControl object, myControl. Save the result to model. Print the model to the console and examine its output.
# Train glm with custom trainControl: model
model <- train(
Class ~ .,
Sonar,
method = "glm",
trControl = myControl
)
## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
## in the result set. ROC will be used instead.
## + Fold01: parameter=none
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## - Fold01: parameter=none
## + Fold02: parameter=none
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## - Fold02: parameter=none
## + Fold03: parameter=none
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## - Fold03: parameter=none
## + Fold04: parameter=none
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## - Fold04: parameter=none
## + Fold05: parameter=none
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## - Fold05: parameter=none
## + Fold06: parameter=none
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## - Fold06: parameter=none
## + Fold07: parameter=none
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## - Fold07: parameter=none
## + Fold08: parameter=none
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## - Fold08: parameter=none
## + Fold09: parameter=none
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## - Fold09: parameter=none
## + Fold10: parameter=none
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## - Fold10: parameter=none
## Aggregating results
## Fitting final model on full training set
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# Print model to console
model
## Generalized Linear Model
##
## 208 samples
## 61 predictor
## 2 classes: 'M', 'R'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 188, 188, 187, 188, 187, 187, ...
## Resampling results:
##
## ROC Sens Spec
## 0.9926263 0.9636364 0.9688889
Da un ROC del 1 casi, eso es muy bueno
As you saw in the video, random forest models are much more flexible than linear models, and can model complicated nonlinear effects as well as automatically capture interactions between variables. They tend to give very good results on real world data, so let’s try one out on the wine quality dataset, where the goal is to predict the human-evaluated quality of a batch of wine, given some of the machine-measured chemical and physical properties of that batch.
Fitting a random forest model is exactly the same as fitting a generalized linear regression model, as you did in the previous chapter. You simply change the method argument in the train function to be “ranger”. The ranger package is a rewrite of R’s classic randomForest package and fits models much faster, but gives almost exactly the same results. We suggest that all beginners use the ranger package for random forest modeling.
Train a random forest called model on the wine quality dataset, wine, such that quality is the response variable and all other variables are explanatory variables. Use method = “ranger”. Use a tuneLength of 1. Use 5 CV folds. Print model to the console.
wine <- read.delim("DATABASE/wine.txt", sep = ",")
head(wine)
## X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 5944 6.7 0.27 0.69 1.2 0.176
## 2 6088 6.7 0.48 0.49 2.9 0.030
## 3 1859 5.8 0.36 0.38 0.9 0.037
## 4 5393 6.3 0.32 0.26 12.0 0.049
## 5 4167 6.6 0.24 0.28 1.8 0.028
## 6 3370 7.8 0.39 0.26 9.9 0.059
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 36 106 0.99288 2.96 0.43 9.2
## 2 28 122 0.98926 3.13 0.40 13.0
## 3 3 75 0.99040 3.28 0.34 11.4
## 4 63 170 0.99610 3.14 0.55 9.9
## 5 39 132 0.99182 3.34 0.46 11.4
## 6 33 181 0.99550 3.04 0.42 10.9
## quality color
## 1 6 white
## 2 6 white
## 3 4 white
## 4 6 white
## 5 5 white
## 6 6 white
# Fit random forest: model
model <- train(
quality ~ .,
tuneLength = 1,
data = wine,
method = "ranger",
trControl = trainControl(
method = "cv",
number = 5,
verboseIter = TRUE
)
)
## + Fold1: mtry=3, min.node.size=5, splitrule=variance
## - Fold1: mtry=3, min.node.size=5, splitrule=variance
## + Fold1: mtry=3, min.node.size=5, splitrule=extratrees
## - Fold1: mtry=3, min.node.size=5, splitrule=extratrees
## + Fold2: mtry=3, min.node.size=5, splitrule=variance
## - Fold2: mtry=3, min.node.size=5, splitrule=variance
## + Fold2: mtry=3, min.node.size=5, splitrule=extratrees
## - Fold2: mtry=3, min.node.size=5, splitrule=extratrees
## + Fold3: mtry=3, min.node.size=5, splitrule=variance
## - Fold3: mtry=3, min.node.size=5, splitrule=variance
## + Fold3: mtry=3, min.node.size=5, splitrule=extratrees
## - Fold3: mtry=3, min.node.size=5, splitrule=extratrees
## + Fold4: mtry=3, min.node.size=5, splitrule=variance
## - Fold4: mtry=3, min.node.size=5, splitrule=variance
## + Fold4: mtry=3, min.node.size=5, splitrule=extratrees
## - Fold4: mtry=3, min.node.size=5, splitrule=extratrees
## + Fold5: mtry=3, min.node.size=5, splitrule=variance
## - Fold5: mtry=3, min.node.size=5, splitrule=variance
## + Fold5: mtry=3, min.node.size=5, splitrule=extratrees
## - Fold5: mtry=3, min.node.size=5, splitrule=extratrees
## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 3, splitrule = variance, min.node.size = 5 on full training set
# Print model to console
print(model)
## Random Forest
##
## 100 samples
## 13 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 81, 80, 80, 80, 79
## Resampling results across tuning parameters:
##
## splitrule RMSE Rsquared MAE
## variance 0.6587570 0.3188962 0.4888864
## extratrees 0.6865412 0.2675681 0.5076654
##
## Tuning parameter 'mtry' was held constant at a value of 3
## Tuning
## parameter 'min.node.size' was held constant at a value of 5
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mtry = 3, splitrule = variance
## and min.node.size = 5.
Recall from the video that random forest models have a primary tuning parameter of mtry, which controls how many variables are exposed to the splitting search routine at each split. For example, suppose that a tree has a total of 10 splits and mtry = 2. This means that there are 10 samples of 2 predictors each time a split is evaluated.
Use a larger tuning grid this time, but stick to the defaults provided by the train() function. Try a tuneLength of 3, rather than 1, to explore some more potential models, and plot the resulting model using the plot function.
Train a random forest model, model, using the wine dataset on the quality variable with all other variables as explanatory variables. (This will take a few seconds to run, so be patient!) Use method = “ranger”. Change the tuneLength to 3. Use 5 CV folds. Print model to the console. Plot the model after fitting it.
# Fit random forest: model
model <- train(
quality ~ .,
tuneLength = 3,
data = wine,
method = "ranger",
trControl = trainControl(
method = "cv",
number = 5,
verboseIter = TRUE
)
)
## + Fold1: mtry= 2, min.node.size=5, splitrule=variance
## - Fold1: mtry= 2, min.node.size=5, splitrule=variance
## + Fold1: mtry= 7, min.node.size=5, splitrule=variance
## - Fold1: mtry= 7, min.node.size=5, splitrule=variance
## + Fold1: mtry=13, min.node.size=5, splitrule=variance
## - Fold1: mtry=13, min.node.size=5, splitrule=variance
## + Fold1: mtry= 2, min.node.size=5, splitrule=extratrees
## - Fold1: mtry= 2, min.node.size=5, splitrule=extratrees
## + Fold1: mtry= 7, min.node.size=5, splitrule=extratrees
## - Fold1: mtry= 7, min.node.size=5, splitrule=extratrees
## + Fold1: mtry=13, min.node.size=5, splitrule=extratrees
## - Fold1: mtry=13, min.node.size=5, splitrule=extratrees
## + Fold2: mtry= 2, min.node.size=5, splitrule=variance
## - Fold2: mtry= 2, min.node.size=5, splitrule=variance
## + Fold2: mtry= 7, min.node.size=5, splitrule=variance
## - Fold2: mtry= 7, min.node.size=5, splitrule=variance
## + Fold2: mtry=13, min.node.size=5, splitrule=variance
## - Fold2: mtry=13, min.node.size=5, splitrule=variance
## + Fold2: mtry= 2, min.node.size=5, splitrule=extratrees
## - Fold2: mtry= 2, min.node.size=5, splitrule=extratrees
## + Fold2: mtry= 7, min.node.size=5, splitrule=extratrees
## - Fold2: mtry= 7, min.node.size=5, splitrule=extratrees
## + Fold2: mtry=13, min.node.size=5, splitrule=extratrees
## - Fold2: mtry=13, min.node.size=5, splitrule=extratrees
## + Fold3: mtry= 2, min.node.size=5, splitrule=variance
## - Fold3: mtry= 2, min.node.size=5, splitrule=variance
## + Fold3: mtry= 7, min.node.size=5, splitrule=variance
## - Fold3: mtry= 7, min.node.size=5, splitrule=variance
## + Fold3: mtry=13, min.node.size=5, splitrule=variance
## - Fold3: mtry=13, min.node.size=5, splitrule=variance
## + Fold3: mtry= 2, min.node.size=5, splitrule=extratrees
## - Fold3: mtry= 2, min.node.size=5, splitrule=extratrees
## + Fold3: mtry= 7, min.node.size=5, splitrule=extratrees
## - Fold3: mtry= 7, min.node.size=5, splitrule=extratrees
## + Fold3: mtry=13, min.node.size=5, splitrule=extratrees
## - Fold3: mtry=13, min.node.size=5, splitrule=extratrees
## + Fold4: mtry= 2, min.node.size=5, splitrule=variance
## - Fold4: mtry= 2, min.node.size=5, splitrule=variance
## + Fold4: mtry= 7, min.node.size=5, splitrule=variance
## - Fold4: mtry= 7, min.node.size=5, splitrule=variance
## + Fold4: mtry=13, min.node.size=5, splitrule=variance
## - Fold4: mtry=13, min.node.size=5, splitrule=variance
## + Fold4: mtry= 2, min.node.size=5, splitrule=extratrees
## - Fold4: mtry= 2, min.node.size=5, splitrule=extratrees
## + Fold4: mtry= 7, min.node.size=5, splitrule=extratrees
## - Fold4: mtry= 7, min.node.size=5, splitrule=extratrees
## + Fold4: mtry=13, min.node.size=5, splitrule=extratrees
## - Fold4: mtry=13, min.node.size=5, splitrule=extratrees
## + Fold5: mtry= 2, min.node.size=5, splitrule=variance
## - Fold5: mtry= 2, min.node.size=5, splitrule=variance
## + Fold5: mtry= 7, min.node.size=5, splitrule=variance
## - Fold5: mtry= 7, min.node.size=5, splitrule=variance
## + Fold5: mtry=13, min.node.size=5, splitrule=variance
## - Fold5: mtry=13, min.node.size=5, splitrule=variance
## + Fold5: mtry= 2, min.node.size=5, splitrule=extratrees
## - Fold5: mtry= 2, min.node.size=5, splitrule=extratrees
## + Fold5: mtry= 7, min.node.size=5, splitrule=extratrees
## - Fold5: mtry= 7, min.node.size=5, splitrule=extratrees
## + Fold5: mtry=13, min.node.size=5, splitrule=extratrees
## - Fold5: mtry=13, min.node.size=5, splitrule=extratrees
## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 13, splitrule = variance, min.node.size = 5 on full training set
# Print model to console
print(model)
## Random Forest
##
## 100 samples
## 13 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 80, 79, 80, 80, 81
## Resampling results across tuning parameters:
##
## mtry splitrule RMSE Rsquared MAE
## 2 variance 0.6467175 0.3281169 0.4885136
## 2 extratrees 0.6885613 0.2353371 0.5082042
## 7 variance 0.6355183 0.3238600 0.4904450
## 7 extratrees 0.6811706 0.2270125 0.5148996
## 13 variance 0.6343619 0.3261168 0.4914816
## 13 extratrees 0.6845100 0.2212215 0.5214518
##
## Tuning parameter 'min.node.size' was held constant at a value of 5
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mtry = 13, splitrule = variance
## and min.node.size = 5.
# Plot model
plot(model)
Now that you’ve explored the default tuning grids provided by the train() function, let’s customize your models a bit more.
You can provide any number of values for mtry, from 2 up to the number of columns in the dataset. In practice, there are diminishing returns for much larger values of mtry, so you will use a custom tuning grid that explores 2 simple models (mtry = 2 and mtry = 3) as well as one more complicated model (mtry = 7).
Define a custom tuning grid. Set the number of variables to possibly split at each node, .mtry, to a vector of 2, 3, and 7. Set the rule to split on, .splitrule, to “variance”. Set the minimum node size, .min.node.size, to 5.
# Define the tuning grid: tuneGrid
tuneGrid <- data.frame(
.mtry = c(2,3,7),
.splitrule = "variance",
.min.node.size = 5
)
Train another random forest model, model, using the wine dataset on the quality variable with all other variables as explanatory variables. Use method = “ranger”. Use the custom tuneGrid. Use 5 CV folds. Print model to the console. Plot the model after fitting it using plot().
# Fit random forest: model
model <- train(
quality ~. ,
tuneGrid = tuneGrid,
data = wine,
method = "ranger",
trControl = trainControl(
method = "cv",
number = 5,
verboseIter = TRUE
)
)
## + Fold1: mtry=2, splitrule=variance, min.node.size=5
## - Fold1: mtry=2, splitrule=variance, min.node.size=5
## + Fold1: mtry=3, splitrule=variance, min.node.size=5
## - Fold1: mtry=3, splitrule=variance, min.node.size=5
## + Fold1: mtry=7, splitrule=variance, min.node.size=5
## - Fold1: mtry=7, splitrule=variance, min.node.size=5
## + Fold2: mtry=2, splitrule=variance, min.node.size=5
## - Fold2: mtry=2, splitrule=variance, min.node.size=5
## + Fold2: mtry=3, splitrule=variance, min.node.size=5
## - Fold2: mtry=3, splitrule=variance, min.node.size=5
## + Fold2: mtry=7, splitrule=variance, min.node.size=5
## - Fold2: mtry=7, splitrule=variance, min.node.size=5
## + Fold3: mtry=2, splitrule=variance, min.node.size=5
## - Fold3: mtry=2, splitrule=variance, min.node.size=5
## + Fold3: mtry=3, splitrule=variance, min.node.size=5
## - Fold3: mtry=3, splitrule=variance, min.node.size=5
## + Fold3: mtry=7, splitrule=variance, min.node.size=5
## - Fold3: mtry=7, splitrule=variance, min.node.size=5
## + Fold4: mtry=2, splitrule=variance, min.node.size=5
## - Fold4: mtry=2, splitrule=variance, min.node.size=5
## + Fold4: mtry=3, splitrule=variance, min.node.size=5
## - Fold4: mtry=3, splitrule=variance, min.node.size=5
## + Fold4: mtry=7, splitrule=variance, min.node.size=5
## - Fold4: mtry=7, splitrule=variance, min.node.size=5
## + Fold5: mtry=2, splitrule=variance, min.node.size=5
## - Fold5: mtry=2, splitrule=variance, min.node.size=5
## + Fold5: mtry=3, splitrule=variance, min.node.size=5
## - Fold5: mtry=3, splitrule=variance, min.node.size=5
## + Fold5: mtry=7, splitrule=variance, min.node.size=5
## - Fold5: mtry=7, splitrule=variance, min.node.size=5
## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 7, splitrule = variance, min.node.size = 5 on full training set
# Print model to console
print(model)
## Random Forest
##
## 100 samples
## 13 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 80, 80, 80, 80, 80
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 0.6559787 0.3284769 0.4936593
## 3 0.6519735 0.3292476 0.4946083
## 7 0.6498748 0.3164634 0.4926127
##
## Tuning parameter 'splitrule' was held constant at a value of variance
##
## Tuning parameter 'min.node.size' was held constant at a value of 5
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mtry = 7, splitrule = variance
## and min.node.size = 5.
# Plot model
plot(model)
The wine quality dataset was a regression problem, but now you are looking at a classification problem. This is a simulated dataset based on the “don’t” competition on Kaggle a number of years ago.
Classification problems are a little more complicated than regression problems because you have to provide a custom summaryFunction to the train() function to use the AUC metric to rank your models. Start by making a custom trainControl, as you did in the previous chapter. Be sure to set classProbs = TRUE, otherwise the twoClassSummary for summaryFunction will break.
Make a custom trainControl called myControl for classification using the trainControl function.
Use 10 CV folds. Use twoClassSummary for the summaryFunction. Be sure to set classProbs = TRUE.
# Create custom trainControl: myControl
myControl <- trainControl(
method = "cv",
number = 10,
summaryFunction = twoClassSummary,
classProbs = TRUE, # IMPORTANT!
verboseIter = TRUE
)
Now that you have a custom trainControl object, fit a glmnet model to the “don’t overfit” dataset. Recall from the video that glmnet is an extension of the generalized linear regression model (or glm) that places constraints on the magnitude of the coefficients to prevent overfitting. This is more commonly known as “penalized” regression modeling and is a very useful technique on datasets with many predictors and few values.
glmnet is capable of fitting two different kinds of penalized models, controlled by the alpha parameter:
Ridge regression (or alpha = 0) Lasso regression (or alpha = 1) You’ll now fit a glmnet model to the “don’t overfit” dataset using the defaults provided by the caret package.
Train a glmnet model called model on the overfit data. Use the custom trainControl from the previous exercise (myControl). The variable y is the response variable and all other variables are explanatory variables. Print the model to the console. Use the max() function to find the maximum of the ROC statistic contained somewhere in model[[“results”]].
overfit <- read.csv("DATABASE/overfit.csv")
head(overfit)
## y X1 X2 X3 X4 X5 X6
## 1 class2 0.9148060 0.33423133 0.1365052 0.24492099 0.84829322 0.73592037
## 2 class2 0.9370754 0.18843433 0.1771364 0.08763591 0.06274633 0.75178575
## 3 class2 0.2861395 0.26971618 0.5195605 0.39110850 0.81984509 0.33261448
## 4 class1 0.8304476 0.53074408 0.8111208 0.18256143 0.53936029 0.05754862
## 5 class2 0.6417455 0.02145023 0.1153620 0.13362478 0.49902010 0.67441545
## 6 class2 0.5190959 0.79876031 0.8934218 0.25746291 0.02222732 0.04157017
## X7 X8 X9 X10 X11 X12 X13
## 1 0.05391100 0.1651787 0.9899656 0.24640458 0.06038098 0.4088231 0.2737925
## 2 0.95509577 0.7277811 0.4384936 0.02302811 0.93300437 0.3158619 0.9441967
## 3 0.02560094 0.2061579 0.6999032 0.28421418 0.34894162 0.4947276 0.4459833
## 4 0.92076314 0.5864655 0.8890770 0.81289268 0.41179789 0.6489300 0.5417872
## 5 0.36666474 0.9135460 0.8341595 0.71891839 0.96114648 0.6498625 0.1617545
## 6 0.69839255 0.2069666 0.7344215 0.52957674 0.53107291 0.5023026 0.6930535
## X14 X15 X16 X17 X18 X19 X20
## 1 0.3133691 0.19940178 0.8555082 0.59892983 0.3641020 0.29991068 0.5124174
## 2 0.6887178 0.56649540 0.2165558 0.96894761 0.4326160 0.07423252 0.4984634
## 3 0.5323920 0.16805282 0.3174699 0.39053533 0.7288679 0.92831706 0.1367773
## 4 0.7599945 0.94363115 0.5413980 0.85281451 0.2627979 0.40114898 0.4507146
## 5 0.6498432 0.50440809 0.4303394 0.04228755 0.7689716 0.66537518 0.3133368
## 6 0.4968844 0.04706947 0.3649031 0.42472235 0.5102272 0.02496750 0.7338139
## X21 X22 X23 X24 X25 X26 X27
## 1 0.7314937 0.88671723 0.8427694 0.2265345 0.24646324 0.03078282 0.2439920
## 2 0.5609527 0.78957374 0.7690579 0.2930370 0.53030519 0.91007049 0.8861334
## 3 0.4981884 0.54639519 0.9188407 0.1260773 0.21397210 0.14263568 0.3657405
## 4 0.8410595 0.06316993 0.3557249 0.7101987 0.02584904 0.43959793 0.2502275
## 5 0.4636445 0.57615483 0.9376524 0.8479341 0.34200247 0.44986382 0.7712136
## 6 0.1989107 0.21231243 0.5919335 0.9862026 0.39413527 0.65888420 0.4224395
## X28 X29 X30 X31 X32 X33 X34
## 1 0.44031431 0.61425763 0.60700663 0.776540211 0.2101436 0.4436156 0.2398912
## 2 0.44528186 0.87616254 0.03163189 0.023826756 0.7116446 0.3283460 0.8914540
## 3 0.08088818 0.09615188 0.93982109 0.431787197 0.4351647 0.2078517 0.1483158
## 4 0.15290506 0.17540170 0.74308481 0.366538520 0.9759542 0.1919214 0.4238083
## 5 0.82051754 0.45457902 0.14370810 0.395403284 0.7004988 0.3723860 0.7290697
## 6 0.34739227 0.73957960 0.38408691 0.007261852 0.5107656 0.7262031 0.7877473
## X35 X36 X37 X38 X39 X40 X41
## 1 0.6363019 0.5722824 0.4197569 0.9500444 0.3587282 0.49694824 0.5283896
## 2 0.6835961 0.4618257 0.8261398 0.1127617 0.4972427 0.99363492 0.6463879
## 3 0.4627288 0.1354304 0.1123009 0.8579319 0.2506536 0.68541077 0.8340490
## 4 0.5929640 0.5824622 0.2039221 0.2698933 0.7749550 0.71740832 0.3457626
## 5 0.1989480 0.1145797 0.4922106 0.2721556 0.1394774 0.51023013 0.6217329
## 6 0.7350154 0.9524530 0.7657882 0.7173460 0.3119602 0.09114151 0.6631953
## X42 X43 X44 X45 X46 X47 X48
## 1 0.53637297 0.003498503 0.03108833 0.7895760 0.90801477 0.5743560 0.04658951
## 2 0.02224803 0.470631273 0.37119592 0.6076812 0.22839369 0.3472568 0.81377534
## 3 0.97147909 0.447381378 0.28906588 0.9859969 0.03609499 0.5428427 0.06971775
## 4 0.60075658 0.418609717 0.14279648 0.6505679 0.75519239 0.2701957 0.45726551
## 5 0.41735487 0.865590246 0.64557472 0.3890771 0.14964131 0.7032647 0.93253494
## 6 0.35575149 0.089314849 0.83856468 0.7514495 0.15831393 0.2215922 0.56264178
## X49 X50 X51 X52 X53 X54 X55
## 1 0.5686084 0.70674255 0.35659105 0.6176513 0.8937959 0.09195006 0.40665568
## 2 0.2178806 0.93513637 0.20800169 0.3523196 0.4806338 0.04368308 0.90273489
## 3 0.1015155 0.32386434 0.10062093 0.6447219 0.1601698 0.03980983 0.18040390
## 4 0.6981023 0.42993684 0.13811304 0.2108743 0.9060602 0.67766094 0.05992027
## 5 0.1927258 0.26389910 0.02372436 0.9677561 0.7480467 0.32464831 0.37985999
## 6 0.8629308 0.06350469 0.38926066 0.9384980 0.3468387 0.84224017 0.50614379
## X56 X57 X58 X59 X60 X61 X62
## 1 0.82231507 0.92166222 0.70143425 0.57086475 0.22403629 0.61544117 0.3942098
## 2 0.46761749 0.02544768 0.20914840 0.31707423 0.71769261 0.81500369 0.9481347
## 3 0.88325679 0.71131622 0.30949109 0.00622546 0.23348081 0.34300935 0.7528206
## 4 0.23977046 0.08098793 0.75071442 0.32235315 0.11093189 0.38827650 0.8657734
## 5 0.08607674 0.83676312 0.83867712 0.76597286 0.01509131 0.01461812 0.3746096
## 6 0.01749129 0.13737931 0.09458564 0.07863530 0.41161897 0.15780173 0.9028142
## X63 X64 X65 X66 X67 X68 X69
## 1 0.8888228 0.6099593 0.47709234 0.55782963 0.6912194 0.79023679 0.5307927
## 2 0.5172033 0.8014616 0.19922838 0.58179578 0.3595974 0.29431606 0.6447462
## 3 0.3488447 0.5248672 0.40166670 0.35338641 0.5171902 0.32632441 0.1399816
## 4 0.8560230 0.6720612 0.57517642 0.91979352 0.4984536 0.02167086 0.3166792
## 5 0.1388359 0.2903211 0.06373411 0.14328077 0.1268600 0.47265368 0.8713632
## 6 0.4936084 0.7054787 0.47593348 0.03223118 0.1145575 0.51155653 0.6391481
## X70 X71 X72 X73 X74 X75 X76
## 1 0.6121108 0.9927446 0.66891022 0.1782290782 0.65168895 0.13139016 0.6016485
## 2 0.2246952 0.6600863 0.76956286 0.5898728392 0.09957865 0.20797646 0.5147675
## 3 0.4450516 0.9908713 0.04286123 0.3100116646 0.61739505 0.65724972 0.9524858
## 4 0.4417733 0.8435938 0.09322869 0.9048300500 0.37862555 0.05183997 0.6899449
## 5 0.8194539 0.9318627 0.37029750 0.0009339715 0.02331053 0.89112706 0.6909429
## 6 0.2539138 0.4160950 0.47901167 0.8450782229 0.84258516 0.97169850 0.3970206
## X77 X78 X79 X80 X81 X82 X83
## 1 0.80038216 0.03375497 0.6148970 0.5501955 0.8776074 0.8771290 0.4591744
## 2 0.37053159 0.19464773 0.0756750 0.9074589 0.4554355 0.5586081 0.9691399
## 3 0.43891339 0.64983446 0.5525176 0.7823757 0.4245626 0.6889456 0.7962230
## 4 0.08944712 0.21965484 0.4636439 0.6054795 0.9387996 0.6737048 0.1111056
## 5 0.16717954 0.85817933 0.9300710 0.9142197 0.3861163 0.1886578 0.4308363
## 6 0.99779615 0.88751072 0.5320078 0.3934037 0.4441870 0.9797902 0.1132264
## X84 X85 X86 X87 X88 X89 X90
## 1 0.02265618 0.82100470 0.9451794 0.40031711 0.5063815 0.38419708 0.5978431
## 2 0.51662527 0.09394557 0.6691676 0.36259803 0.4020096 0.38109557 0.3437839
## 3 0.17372073 0.17385249 0.3524983 0.93234951 0.1412987 0.67869493 0.3873761
## 4 0.84811563 0.08457514 0.2725157 0.21494830 0.6899701 0.75592458 0.8271274
## 5 0.48302059 0.99570426 0.7055095 0.18531663 0.1049531 0.06213957 0.7161128
## 6 0.62567603 0.25286650 0.1484418 0.01903933 0.6162294 0.89707813 0.1780118
## X91 X92 X93 X94 X95 X96 X97
## 1 0.3178428 0.63577017 0.16898034 0.4363109 0.74558734 0.91218140 0.521450796
## 2 0.9580893 0.18256200 0.16977626 0.5074746 0.88742323 0.74437113 0.002551967
## 3 0.2691170 0.61096669 0.05505954 0.7045412 0.05273777 0.24672669 0.774404447
## 4 0.6020122 0.09389158 0.65476919 0.6609233 0.12751347 0.74064849 0.037558054
## 5 0.5659497 0.50872714 0.88375557 0.2501756 0.43159573 0.89100560 0.598648517
## 6 0.9331284 0.15435468 0.21043950 0.9714310 0.02401225 0.08361286 0.029283952
## X98 X99 X100 X101 X102 X103 X104
## 1 0.09177065 0.7341170 0.3207706 0.3410881 0.6124809 0.0006371774 0.5474337
## 2 0.72371277 0.2045022 0.5937234 0.0769237 0.1278179 0.9994430819 0.4792716
## 3 0.47358313 0.7541783 0.8585231 0.1764347 0.2325618 0.3747437985 0.1774841
## 4 0.06946549 0.3030019 0.7570058 0.4399086 0.2011980 0.0339846939 0.6359555
## 5 0.41871902 0.1878361 0.5910866 0.9375751 0.3795888 0.3704907512 0.5783285
## 6 0.62829984 0.4720801 0.3647425 0.6538159 0.9011956 0.0729143203 0.4993629
## X105 X106 X107 X108 X109 X110 X111
## 1 0.03589639 0.64268733 0.23045683 0.5939719 0.53344042 0.9527778 0.9500149
## 2 0.96449950 0.34132692 0.25149597 0.5282314 0.03137963 0.6503522 0.8069633
## 3 0.45762041 0.03583915 0.04275977 0.8099685 0.04959282 0.2623343 0.7691251
## 4 0.68747009 0.19993296 0.98033429 0.5882637 0.23380904 0.9604940 0.7486602
## 5 0.96663348 0.65268200 0.48489585 0.2208072 0.89631450 0.7733771 0.5731922
## 6 0.79437829 0.10197330 0.93174456 0.6425360 0.98148244 0.7171185 0.3608518
## X112 X113 X114 X115 X116 X117 X118
## 1 0.44486670 0.01083590 0.01484451 0.06289119 0.581997880 0.5924448 0.4623531
## 2 0.34772122 0.12578299 0.17502226 0.89214340 0.665076771 0.8878863 0.8693493
## 3 0.80746171 0.50185677 0.25231490 0.28707657 0.875653535 0.5235236 0.2247654
## 4 0.03664129 0.18031296 0.99640743 0.64055932 0.070306693 0.3749164 0.5402184
## 5 0.89283247 0.05301032 0.86648632 0.13174253 0.004420665 0.8752077 0.6109594
## 6 0.80284080 0.06297881 0.43661718 0.43528020 0.232604207 0.7159711 0.2163525
## X119 X120 X121 X122 X123 X124 X125
## 1 0.45086799 0.1943601 0.15381438 0.982242217 0.2947328 0.11021659 0.4058102
## 2 0.52969808 0.9761348 0.40933156 0.520649012 0.2511160 0.08025240 0.7747704
## 3 0.15883508 0.9871921 0.22537445 0.445695450 0.8000173 0.62539483 0.3647271
## 4 0.06271411 0.5119375 0.35218675 0.462261930 0.2342284 0.72221652 0.4875607
## 5 0.44944508 0.1983814 0.11016184 0.002215227 0.1068260 0.07613934 0.7666030
## 6 0.23194646 0.6491826 0.06810542 0.450323362 0.5336411 0.28059341 0.1265137
## X126 X127 X128 X129 X130 X131 X132
## 1 0.5527449 0.2878878 0.55946411 0.6899293 0.6057740 0.63405048 0.041380533
## 2 0.8601255 0.1305359 0.21613476 0.6588805 0.1262838 0.91769299 0.404302723
## 3 0.7557376 0.6784908 0.64567381 0.6979758 0.4577847 0.91924084 0.766801183
## 4 0.4862461 0.7317664 0.05425676 0.2821157 0.3138897 0.71750403 0.005256724
## 5 0.9917929 0.8525171 0.77481876 0.3364385 0.6482185 0.57506495 0.354915085
## 6 0.4760391 0.0567125 0.88117459 0.8124043 0.2068774 0.02557602 0.520244388
## X133 X134 X135 X136 X137 X138 X139
## 1 0.11652459 0.5825725 0.70794842 0.2909239 0.5437972 0.1286449 0.8094877
## 2 0.03312483 0.5669765 0.48408234 0.4611460 0.7934200 0.2398162 0.8214692
## 3 0.12827443 0.9846029 0.77778677 0.8627647 0.2294719 0.5779939 0.1852677
## 4 0.82357734 0.5245908 0.02017853 0.6642912 0.3530981 0.6174764 0.2548854
## 5 0.30659273 0.5602921 0.01777780 0.3422838 0.3047963 0.4213162 0.9207230
## 6 0.93538770 0.8365798 0.37046091 0.6586622 0.5051089 0.6448513 0.7427806
## X140 X141 X142 X143 X144 X145 X146
## 1 0.4016311 0.9367281 0.2746436 0.59995153 0.61420305 0.84734796 0.21923829
## 2 0.7927715 0.9362274 0.2936810 0.16334854 0.70992172 0.31810196 0.15674629
## 3 0.1173834 0.8976075 0.6272968 0.17710410 0.07593862 0.07364713 0.08922226
## 4 0.3649660 0.4590170 0.4494853 0.26653590 0.77857178 0.02329525 0.74250158
## 5 0.5366805 0.5760531 0.1779100 0.00149341 0.63019430 0.92182790 0.10475932
## 6 0.4860038 0.7332851 0.6615094 0.77484637 0.30370821 0.74226482 0.15595412
## X147 X148 X149 X150 X151 X152 X153
## 1 0.82867974 0.94290037 0.5083185 0.25527703 0.7526830 0.01497518 0.96331279
## 2 0.97047839 0.64720576 0.7175615 0.03806863 0.8967593 0.04082960 0.80074165
## 3 0.47023973 0.55771586 0.4559021 0.45961098 0.7685767 0.70849826 0.39650955
## 4 0.93823417 0.39179048 0.6270691 0.29450119 0.2282020 0.07697517 0.63353348
## 5 0.95659470 0.09407949 0.8008656 0.69764193 0.7520311 0.10761805 0.09724402
## 6 0.04171886 0.51430294 0.3555205 0.01282597 0.5723066 0.31757350 0.39974798
## X154 X155 X156 X157 X158 X159 X160
## 1 0.16493185 0.5734490 0.6172273 0.01984267 0.9923874 0.8431326 0.4516252
## 2 0.83442190 0.7298989 0.4894689 0.71974924 0.7642040 0.9155692 0.3701119
## 3 0.96948408 0.5114832 0.9822653 0.13222445 0.1225179 0.2006651 0.2851132
## 4 0.69882495 0.9638506 0.4711525 0.15887531 0.8788914 0.6495318 0.7246809
## 5 0.08477651 0.8876985 0.2279777 0.68196120 0.9136540 0.5322480 0.6712590
## 6 0.89601757 0.6552185 0.4086666 0.91309850 0.2051752 0.9216390 0.7176894
## X161 X162 X163 X164 X165 X166 X167
## 1 0.27610230 0.74179679 0.48524905 0.003358962 0.35415790 0.8967225 0.8834287
## 2 0.30687340 0.30933747 0.36506517 0.837234325 0.13497580 0.4668467 0.5235297
## 3 0.41455622 0.66539232 0.05123008 0.249701695 0.25971016 0.5343797 0.3086893
## 4 0.91913345 0.06219363 0.49290181 0.994318883 0.60124062 0.7753268 0.7970673
## 5 0.53340060 0.44950719 0.15271272 0.331355318 0.29420728 0.6948136 0.8076349
## 6 0.03427269 0.85936296 0.12930984 0.494159771 0.08425433 0.1823070 0.3097774
## X168 X169 X170 X171 X172 X173 X174
## 1 0.8061288 0.1292087 0.19116865 0.285561043 0.88923767 0.5870494 0.88877075
## 2 0.6243856 0.6522644 0.93251897 0.198662249 0.11641914 0.1335043 0.85607289
## 3 0.3250510 0.5548370 0.66663916 0.803086133 0.75219803 0.9061797 0.46824726
## 4 0.4519654 0.4108911 0.99667439 0.394731248 0.70289791 0.3921670 0.32963398
## 5 0.8714402 0.9276495 0.08040045 0.715471951 0.01533145 0.9418107 0.04961305
## 6 0.6398163 0.6567273 0.79273983 0.005634914 0.56296260 0.8835058 0.40821738
## X175 X176 X177 X178 X179 X180 X181
## 1 0.1514306 0.21224566 0.1572156 0.4175002 0.7225748 0.75372096 0.3955216
## 2 0.4241027 0.51659550 0.8845839 0.1575032 0.0794626 0.61866392 0.8728076
## 3 0.4330624 0.04400484 0.4311277 0.9490000 0.9359233 0.22093972 0.1652158
## 4 0.8379206 0.01883298 0.9475947 0.7788227 0.7653506 0.49391558 0.1522405
## 5 0.6613413 0.27919607 0.1502022 0.4259150 0.9541247 0.31082164 0.9833864
## 6 0.4268377 0.75931980 0.2362647 0.5131131 0.5040790 0.09395412 0.8566445
## X182 X183 X184 X185 X186 X187 X188
## 1 0.4887184 0.50930862 0.40526502 0.8997269 0.20133410 0.993279049 0.27672405
## 2 0.9823968 0.39664574 0.04680688 0.4138554 0.02252797 0.551354672 0.80417281
## 3 0.8082559 0.99125548 0.85048518 0.1409671 0.88720811 0.001778496 0.03994549
## 4 0.7771462 0.34878537 0.88635905 0.7680195 0.64712794 0.927141098 0.17404809
## 5 0.5740921 0.07616622 0.25642704 0.4703455 0.02694613 0.934355612 0.27263610
## 6 0.3805437 0.27785715 0.24521403 0.2338873 0.30423026 0.806766985 0.03654309
## X189 X190 X191 X192 X193 X194 X195
## 1 0.4524145 0.6911940 0.5406514 0.7222597 0.002576062 0.25223405 0.5568160
## 2 0.7580112 0.7804734 0.1411782 0.1310911 0.411775569 0.52042257 0.4986688
## 3 0.5278438 0.2509410 0.2106741 0.5895710 0.891389510 0.77904746 0.3912340
## 4 0.8901789 0.5280285 0.8211162 0.5913061 0.767070756 0.08815411 0.8240591
## 5 0.7148812 0.4105001 0.8930576 0.9879969 0.598619170 0.60512811 0.5665962
## 6 0.6873584 0.3161568 0.1698084 0.1546299 0.127565006 0.08847341 0.6644925
## X196 X197 X198 X199 X200
## 1 0.08803946 0.7399642 0.3738620 0.6715633 0.0001012264
## 2 0.67626884 0.9724303 0.8393751 0.7468173 0.8653565315
## 3 0.09020876 0.6599458 0.8371015 0.9785843 0.3703812116
## 4 0.41487972 0.3337511 0.5536796 0.7041028 0.6651317959
## 5 0.01701286 0.2368524 0.7016140 0.7000800 0.3351982590
## 6 0.96472272 0.3598008 0.9455406 0.2009973 0.2989820768
# Fit glmnet model: model
model <- train(
y ~ .,
overfit,
method = "glmnet",
trControl = myControl
)
## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
## in the result set. ROC will be used instead.
## + Fold01: alpha=0.10, lambda=0.01013
## - Fold01: alpha=0.10, lambda=0.01013
## + Fold01: alpha=0.55, lambda=0.01013
## - Fold01: alpha=0.55, lambda=0.01013
## + Fold01: alpha=1.00, lambda=0.01013
## - Fold01: alpha=1.00, lambda=0.01013
## + Fold02: alpha=0.10, lambda=0.01013
## - Fold02: alpha=0.10, lambda=0.01013
## + Fold02: alpha=0.55, lambda=0.01013
## - Fold02: alpha=0.55, lambda=0.01013
## + Fold02: alpha=1.00, lambda=0.01013
## - Fold02: alpha=1.00, lambda=0.01013
## + Fold03: alpha=0.10, lambda=0.01013
## - Fold03: alpha=0.10, lambda=0.01013
## + Fold03: alpha=0.55, lambda=0.01013
## - Fold03: alpha=0.55, lambda=0.01013
## + Fold03: alpha=1.00, lambda=0.01013
## - Fold03: alpha=1.00, lambda=0.01013
## + Fold04: alpha=0.10, lambda=0.01013
## - Fold04: alpha=0.10, lambda=0.01013
## + Fold04: alpha=0.55, lambda=0.01013
## - Fold04: alpha=0.55, lambda=0.01013
## + Fold04: alpha=1.00, lambda=0.01013
## - Fold04: alpha=1.00, lambda=0.01013
## + Fold05: alpha=0.10, lambda=0.01013
## - Fold05: alpha=0.10, lambda=0.01013
## + Fold05: alpha=0.55, lambda=0.01013
## - Fold05: alpha=0.55, lambda=0.01013
## + Fold05: alpha=1.00, lambda=0.01013
## - Fold05: alpha=1.00, lambda=0.01013
## + Fold06: alpha=0.10, lambda=0.01013
## - Fold06: alpha=0.10, lambda=0.01013
## + Fold06: alpha=0.55, lambda=0.01013
## - Fold06: alpha=0.55, lambda=0.01013
## + Fold06: alpha=1.00, lambda=0.01013
## - Fold06: alpha=1.00, lambda=0.01013
## + Fold07: alpha=0.10, lambda=0.01013
## - Fold07: alpha=0.10, lambda=0.01013
## + Fold07: alpha=0.55, lambda=0.01013
## - Fold07: alpha=0.55, lambda=0.01013
## + Fold07: alpha=1.00, lambda=0.01013
## - Fold07: alpha=1.00, lambda=0.01013
## + Fold08: alpha=0.10, lambda=0.01013
## - Fold08: alpha=0.10, lambda=0.01013
## + Fold08: alpha=0.55, lambda=0.01013
## - Fold08: alpha=0.55, lambda=0.01013
## + Fold08: alpha=1.00, lambda=0.01013
## - Fold08: alpha=1.00, lambda=0.01013
## + Fold09: alpha=0.10, lambda=0.01013
## - Fold09: alpha=0.10, lambda=0.01013
## + Fold09: alpha=0.55, lambda=0.01013
## - Fold09: alpha=0.55, lambda=0.01013
## + Fold09: alpha=1.00, lambda=0.01013
## - Fold09: alpha=1.00, lambda=0.01013
## + Fold10: alpha=0.10, lambda=0.01013
## - Fold10: alpha=0.10, lambda=0.01013
## + Fold10: alpha=0.55, lambda=0.01013
## - Fold10: alpha=0.55, lambda=0.01013
## + Fold10: alpha=1.00, lambda=0.01013
## - Fold10: alpha=1.00, lambda=0.01013
## Aggregating results
## Selecting tuning parameters
## Fitting alpha = 0.1, lambda = 0.000101 on full training set
# Print model to console
model
## glmnet
##
## 250 samples
## 200 predictors
## 2 classes: 'class1', 'class2'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 225, 224, 226, 225, 225, 225, ...
## Resampling results across tuning parameters:
##
## alpha lambda ROC Sens Spec
## 0.10 0.0001012745 0.4321558 0 0.9612319
## 0.10 0.0010127448 0.4320652 0 0.9655797
## 0.10 0.0101274483 0.4251812 0 0.9740942
## 0.55 0.0001012745 0.3866848 0 0.9443841
## 0.55 0.0010127448 0.3712862 0 0.9442029
## 0.55 0.0101274483 0.4058877 0 0.9699275
## 1.00 0.0001012745 0.3579710 0 0.9141304
## 1.00 0.0010127448 0.3469203 0 0.9228261
## 1.00 0.0101274483 0.3952899 0 0.9784420
##
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were alpha = 0.1 and lambda = 0.0001012745.
# Print maximum ROC statistic
max(model[["results"]][["ROC"]])
## [1] 0.4321558
As you saw in the video, the glmnet model actually fits many models at once (one of the great things about the package). You can exploit this by passing a large number of lambda values, which control the amount of penalization in the model. train() is smart enough to only fit one model per alpha value and pass all of the lambda values at once for simultaneous fitting.
My favorite tuning grid for glmnet models is:
expand.grid( alpha = 0:1, lambda = seq(0.0001, 1, length = 100) ) This grid explores a large number of lambda values (100, in fact), from a very small one to a very large one. (You could increase the maximum lambda to 10, but in this exercise 1 is a good upper bound.)
If you want to explore fewer models, you can use a shorter lambda sequence. For example, lambda = seq(0.0001, 1, length = 10) would fit 10 models per value of alpha.
You also look at the two forms of penalized models with this tuneGrid: ridge regression and lasso regression. alpha = 0 is pure ridge regression, and alpha = 1 is pure lasso regression. You can fit a mixture of the two models (i.e. an elastic net) using an alpha between 0 and 1. For example, alpha = 0.05 would be 95% ridge regression and 5% lasso regression.
In this problem you’ll just explore the 2 extremes – pure ridge and pure lasso regression – for the purpose of illustrating their differences.
Train a glmnet model on the overfit data such that y is the response variable and all other variables are explanatory variables. Make sure to use your custom trainControl from the previous exercise (myControl). Also, use a custom tuneGrid to explore alpha = 0:1 and 20 values of lambda between 0.0001 and 1 per value of alpha. Print model to the console. Print the max() of the ROC statistic in model[[“results”]]. You can access it using model[[“results”]][[“ROC”]].
# Train glmnet with custom trainControl and tuning: model
model <- train(
y ~ .,
overfit,
tuneGrid = expand.grid(
alpha = 0:1,
lambda = seq(0.0001, 1, length = 20)
),
method = "glmnet",
trControl = myControl
)
## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
## in the result set. ROC will be used instead.
## + Fold01: alpha=0, lambda=1
## - Fold01: alpha=0, lambda=1
## + Fold01: alpha=1, lambda=1
## - Fold01: alpha=1, lambda=1
## + Fold02: alpha=0, lambda=1
## - Fold02: alpha=0, lambda=1
## + Fold02: alpha=1, lambda=1
## - Fold02: alpha=1, lambda=1
## + Fold03: alpha=0, lambda=1
## - Fold03: alpha=0, lambda=1
## + Fold03: alpha=1, lambda=1
## - Fold03: alpha=1, lambda=1
## + Fold04: alpha=0, lambda=1
## - Fold04: alpha=0, lambda=1
## + Fold04: alpha=1, lambda=1
## - Fold04: alpha=1, lambda=1
## + Fold05: alpha=0, lambda=1
## - Fold05: alpha=0, lambda=1
## + Fold05: alpha=1, lambda=1
## - Fold05: alpha=1, lambda=1
## + Fold06: alpha=0, lambda=1
## - Fold06: alpha=0, lambda=1
## + Fold06: alpha=1, lambda=1
## - Fold06: alpha=1, lambda=1
## + Fold07: alpha=0, lambda=1
## - Fold07: alpha=0, lambda=1
## + Fold07: alpha=1, lambda=1
## - Fold07: alpha=1, lambda=1
## + Fold08: alpha=0, lambda=1
## - Fold08: alpha=0, lambda=1
## + Fold08: alpha=1, lambda=1
## - Fold08: alpha=1, lambda=1
## + Fold09: alpha=0, lambda=1
## - Fold09: alpha=0, lambda=1
## + Fold09: alpha=1, lambda=1
## - Fold09: alpha=1, lambda=1
## + Fold10: alpha=0, lambda=1
## - Fold10: alpha=0, lambda=1
## + Fold10: alpha=1, lambda=1
## - Fold10: alpha=1, lambda=1
## Aggregating results
## Selecting tuning parameters
## Fitting alpha = 1, lambda = 0.0527 on full training set
# Print model to console
model
## glmnet
##
## 250 samples
## 200 predictors
## 2 classes: 'class1', 'class2'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 225, 226, 224, 224, 226, 224, ...
## Resampling results across tuning parameters:
##
## alpha lambda ROC Sens Spec
## 0 0.00010000 0.3722826 0.05 0.9871377
## 0 0.05272632 0.3791667 0.00 0.9956522
## 0 0.10535263 0.3770833 0.00 1.0000000
## 0 0.15797895 0.3708333 0.00 1.0000000
## 0 0.21060526 0.3794384 0.00 1.0000000
## 0 0.26323158 0.3900362 0.00 1.0000000
## 0 0.31585789 0.3921196 0.00 1.0000000
## 0 0.36848421 0.3963768 0.00 1.0000000
## 0 0.42111053 0.3963768 0.00 1.0000000
## 0 0.47373684 0.3984601 0.00 1.0000000
## 0 0.52636316 0.3984601 0.00 1.0000000
## 0 0.57898947 0.4026268 0.00 1.0000000
## 0 0.63161579 0.4026268 0.00 1.0000000
## 0 0.68424211 0.4047101 0.00 1.0000000
## 0 0.73686842 0.4047101 0.00 1.0000000
## 0 0.78949474 0.4047101 0.00 1.0000000
## 0 0.84212105 0.4047101 0.00 1.0000000
## 0 0.89474737 0.4047101 0.00 1.0000000
## 0 0.94737368 0.4090580 0.00 1.0000000
## 0 1.00000000 0.4090580 0.00 1.0000000
## 1 0.00010000 0.2921196 0.05 0.9532609
## 1 0.05272632 0.5108696 0.00 1.0000000
## 1 0.10535263 0.5000000 0.00 1.0000000
## 1 0.15797895 0.5000000 0.00 1.0000000
## 1 0.21060526 0.5000000 0.00 1.0000000
## 1 0.26323158 0.5000000 0.00 1.0000000
## 1 0.31585789 0.5000000 0.00 1.0000000
## 1 0.36848421 0.5000000 0.00 1.0000000
## 1 0.42111053 0.5000000 0.00 1.0000000
## 1 0.47373684 0.5000000 0.00 1.0000000
## 1 0.52636316 0.5000000 0.00 1.0000000
## 1 0.57898947 0.5000000 0.00 1.0000000
## 1 0.63161579 0.5000000 0.00 1.0000000
## 1 0.68424211 0.5000000 0.00 1.0000000
## 1 0.73686842 0.5000000 0.00 1.0000000
## 1 0.78949474 0.5000000 0.00 1.0000000
## 1 0.84212105 0.5000000 0.00 1.0000000
## 1 0.89474737 0.5000000 0.00 1.0000000
## 1 0.94737368 0.5000000 0.00 1.0000000
## 1 1.00000000 0.5000000 0.00 1.0000000
##
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were alpha = 1 and lambda = 0.05272632.
# Print maximum ROC statistic
max(model[["results"]][["ROC"]])
## [1] 0.5108696
In this chapter, you’ll be using a version of the Wisconsin Breast Cancer dataset. This dataset presents a classic binary classification problem: 50% of the samples are benign, 50% are malignant, and the challenge is to identify which are which.
This dataset is interesting because many of the predictors contain missing values and most rows of the dataset have at least one missing value. This presents a modeling challenge, because most machine learning algorithms cannot handle missing values out of the box. For example, your first instinct might be to fit a logistic regression model to this data, but prior to doing this you need a strategy for handling the NAs.
Fortunately, the train() function in caret contains an argument called preProcess, which allows you to specify that median imputation should be used to fill in the missing values. In previous chapters, you created models with the train() function using formulas such as y ~ .. An alternative way is to specify the x and y arguments to train(), where x is an object with samples in rows and features in columns and y is a numeric or factor vector containing the outcomes. Said differently, x is a matrix or data frame that contains the whole dataset you’d use for the data argument to the lm() call, for example, but excludes the response variable column; y is a vector that contains just the response variable column.
For this exercise, the argument x to train() is loaded in your workspace as breast_cancer_x and y as breast_cancer_y.
Use the train() function to fit a glm model called median_model to the breast cancer dataset. Use preProcess = “medianImpute” to handle the missing values. Print median_model to the console.
# breast_cancer_y <- read.delim("DATABASE/breast_cancer_y.txt", sep = ",")
# breast_cancer_x <- read.delim("DATABASE/breast_cancer_x.txt", sep = ",")
# # Apply median imputation: median_model
# median_model <- train(
# x = breast_cancer_x,
# y = breast_cancer_y,
# method = "glm",
# trControl = myControl,
# preProcess = "medianImpute"
# )
#
# # Print median_model to console
# median_model
In the previous exercise, you used median imputation to fill in missing values in the breast cancer dataset, but that is not the only possible method for dealing with missing data.
An alternative to median imputation is k-nearest neighbors, or KNN, imputation. This is a more advanced form of imputation where missing values are replaced with values from other rows that are similar to the current row. While this is a lot more complicated to implement in practice than simple median imputation, it is very easy to explore in caret using the preProcess argument to train(). You can simply use preProcess = “knnImpute” to change the method of imputation used prior to model fitting.
breast_cancer_x and breast_cancer_y are loaded in your workspace.
Use the train() function to fit a glm model called knn_model to the breast cancer dataset. Use KNN imputation to handle missing values.
# Apply KNN imputation: knn_model
# knn_model <- train(
# x = breast_cancer_x,
# y = breast_cancer_y,
# method = "glm",
# trControl = myControl,
# preProcess = "knnImpute"
# )
# Print knn_model to console
# knn_model
The preProcess argument to train() doesn’t just limit you to imputing missing values. It also includes a wide variety of other preProcess techniques to make your life as a data scientist much easier. You can read a full list of them by typing ?p reProcess and reading the help page for this function.
One set of preprocessing functions that is particularly useful for fitting regression models is standardization: centering and scaling. You first center by subtracting the mean of each column from each value in that column, then you scale by dividing by the standard deviation.
Standardization transforms your data such that for each column, the mean is 0 and the standard deviation is 1. This makes it easier for regression models to find a good solution.
breast_cancer_x and breast_cancer_y are loaded in your workspace. Fit a logistic regression model using median imputation called model to the breast cancer data, then print it to the console.
# # Fit glm with median imputation
# model <- train(
# x = breast_cancer_x,
# y = breast_cancer_y,
# method = "glm",
# trControl = myControl,
# preProcess = "medianImpute"
# )
#
# # Print model
# model
Update the model to include two more pre-processing steps: centering and scaling.
# # Update model with standardization
# model <- train(
# x = breast_cancer_x,
# y = breast_cancer_y,
# method = "glm",
# trControl = myControl,
# preProcess = c("medianImpute", "center", "scale")
# )
#
# # Print updated model
# model
As you saw in the video, for the next set of exercises, you’ll be using the blood-brain dataset. This is a biochemical dataset in which the task is to predict the following value for a set of biochemical compounds:
log((concentration of compound in brain) / (concentration of compound in blood)) This gives a quantitative metric of the compound’s ability to cross the blood-brain barrier, and is useful for understanding the biological properties of that barrier.
One interesting aspect of this dataset is that it contains many variables and many of these variables have extemely low variances. This means that there is very little information in these variables because they mostly consist of a single value (e.g. zero).
Fortunately, caret contains a utility function called nearZeroVar() for removing such variables to save time during modeling.
nearZeroVar() takes in data x, then looks at the ratio of the most common value to the second most common value, freqCut, and the percentage of distinct values out of the number of total samples, uniqueCut. By default, caret uses freqCut = 19 and uniqueCut = 10, which is fairly conservative. I like to be a little more aggressive and use freqCut = 2 and uniqueCut = 20 when calling nearZeroVar().
bloodbrain_x and bloodbrain_y are loaded in your workspace.
Identify the near zero variance predictors by running nearZeroVar() on the blood-brain dataset. Store the result as an object called remove_cols. Use freqCut = 2 and uniqueCut = 20 in the call to nearZeroVar(). Use names() to create a vector containing all column names of bloodbrain_x. Call this all_cols. Make a new data frame called bloodbrain_x_small with the near-zero variance variables removed. Use setdiff() to isolate the column names that you wish to keep (i.e. that you don’t want to remove.)
bloodbrain_x <- read.delim("DATABASE/bloodbrain_x.txt", sep = ",")
head(bloodbrain_x)
## X tpsa nbasic vsa_hyd a_aro weight peoe_vsa.0 peoe_vsa.1 peoe_vsa.2
## 1 1 12.03 1 167.06700 0 156.293 76.94749 43.44619 0.00000
## 2 2 49.33 0 92.64243 6 151.165 38.24339 25.52006 0.00000
## 3 3 50.53 1 295.16700 15 366.485 58.05473 124.74020 21.65084
## 4 4 37.39 0 319.11220 15 382.552 62.23933 124.74020 13.19232
## 5 5 37.39 1 299.65800 12 326.464 74.80064 118.04060 33.00190
## 6 6 37.39 1 289.77770 11 332.492 74.80064 109.50990 13.19232
## peoe_vsa.3 peoe_vsa.4 peoe_vsa.5 peoe_vsa.6 peoe_vsa.0.1 peoe_vsa.1.1
## 1 0.000000 0.00000 0 17.238030 18.74768 43.50657
## 2 8.619013 23.27370 0 0.000000 49.01962 0.00000
## 3 8.619013 17.44054 0 8.619013 83.82487 49.01962
## 4 21.785640 0.00000 0 8.619013 83.82487 68.78024
## 5 0.000000 0.00000 0 8.619013 83.82487 36.76471
## 6 21.785640 0.00000 0 8.619013 73.54603 44.27042
## peoe_vsa.2.1 peoe_vsa.3.1 peoe_vsa.4.1 peoe_vsa.5.1 peoe_vsa.6.1 a_acc a_acid
## 1 0 0 0.000000 0.000000 0.000000 0 0
## 2 0 0 0.000000 13.566920 7.904431 2 0
## 3 0 0 5.682576 2.503756 2.640647 2 0
## 4 0 0 5.682576 0.000000 2.640647 2 0
## 5 0 0 5.682576 0.136891 2.503756 2 0
## 6 0 0 5.682576 0.000000 2.640647 2 0
## a_base vsa_acc vsa_acid vsa_base vsa_don vsa_other vsa_pol slogp_vsa0
## 1 1 0.000000 0 5.682576 5.682576 0.00000 0.00000 18.01075
## 2 0 13.566920 0 0.000000 5.682576 28.10760 13.56692 25.38523
## 3 1 8.186332 0 0.000000 5.682576 43.56089 0.00000 14.12420
## 4 1 8.186332 0 0.000000 5.682576 28.32470 0.00000 14.12420
## 5 1 8.186332 0 0.000000 5.682576 19.64908 0.00000 14.12420
## 6 1 8.186332 0 0.000000 5.682576 21.62514 0.00000 14.12420
## slogp_vsa1 slogp_vsa2 slogp_vsa3 slogp_vsa4 slogp_vsa5 slogp_vsa6 slogp_vsa7
## 1 0.00000 3.981969 0.000 4.410796 32.897190 0 0.00000
## 2 23.26954 23.862220 0.000 0.000000 0.000000 0 70.57274
## 3 34.79628 0.000000 76.245 3.185575 9.507346 0 148.12580
## 4 34.79628 0.000000 76.245 3.185575 0.000000 0 144.03540
## 5 34.79628 0.000000 76.245 3.185575 0.000000 0 140.71660
## 6 34.79628 0.000000 76.245 3.185575 0.000000 0 103.60310
## slogp_vsa8 slogp_vsa9 smr_vsa0 smr_vsa1 smr_vsa2 smr_vsa3 smr_vsa4 smr_vsa5
## 1 113.21040 33.32602 0.000000 18.01075 4.410796 3.981969 0.00000 113.21040
## 2 0.00000 41.32619 23.862220 25.38523 0.000000 5.243428 20.76750 70.57274
## 3 75.47363 28.27417 12.631660 27.78542 0.000000 8.429003 29.58226 235.05870
## 4 75.47363 55.46144 3.124314 27.78542 0.000000 8.429003 21.40142 235.05870
## 5 75.47363 26.01093 3.124314 27.78542 0.000000 8.429003 20.33867 234.62990
## 6 75.47363 55.46144 3.124314 27.78542 0.000000 8.429003 18.51150 197.51630
## smr_vsa6 smr_vsa7 tpsa.1 logp.o.w. frac.anion7. frac.cation7. andrewbind
## 1 0.000000 66.22321 16.61 2.948 0.000 0.999 3.4
## 2 5.258784 33.32602 49.33 0.889 0.001 0.000 -3.3
## 3 76.245000 0.00000 51.73 4.439 0.000 0.986 12.8
## 4 76.245000 31.27769 38.59 5.254 0.000 0.986 12.8
## 5 76.245000 0.00000 38.59 3.800 0.000 0.986 10.3
## 6 76.245000 31.27769 38.59 3.608 0.000 0.986 10.0
## rotatablebonds mlogp clogp mw nocount hbdnr rule.of.5violations
## 1 3 2.50245 2.970000 155.2856 1 1 0
## 2 2 1.05973 0.494000 151.1664 3 2 0
## 3 8 4.66091 5.136999 365.4794 5 1 1
## 4 8 3.82458 5.877599 381.5440 4 1 1
## 5 8 3.27214 4.367000 325.4577 4 1 0
## 6 8 2.89481 4.283600 331.4835 4 1 0
## prx ub pol inthb adistm adistd polar_area nonpolar_area psa_npsa tcsa
## 1 0 0.0 0 0 0.0000 0.0000 21.1242 379.0702 0.0557 0.0097
## 2 1 3.0 2 0 395.3757 10.8921 117.4081 247.5371 0.4743 0.0134
## 3 6 5.3 3 0 1364.5514 25.6784 82.0943 637.7242 0.1287 0.0111
## 4 2 5.3 3 0 702.6387 10.0232 65.0890 667.9713 0.0974 0.0108
## 5 2 4.2 2 0 745.5096 10.5753 66.1754 601.7463 0.1100 0.0118
## 6 2 3.6 2 0 779.2914 10.7712 69.0895 588.6569 0.1174 0.0111
## tcpa tcnp ovality surface_area volume most_negative_charge
## 1 0.1842 0.0103 1.0960 400.1944 656.0650 -0.6174
## 2 0.0417 0.0198 1.1173 364.9453 555.0969 -0.8397
## 3 0.0972 0.0125 1.3005 719.8185 1224.4553 -0.8012
## 4 0.1218 0.0119 1.3013 733.0603 1257.2002 -0.7608
## 5 0.1186 0.0130 1.2711 667.9218 1132.6826 -0.8567
## 6 0.1061 0.0125 1.2642 657.7465 1115.8672 -0.7672
## most_positive_charge sum_absolute_charge dipole_moment homo lumo
## 1 0.3068 3.8918 1.1898 -9.6672 3.4038
## 2 0.4967 4.8925 4.2109 -8.9618 0.1942
## 3 0.5414 7.9796 3.5234 -8.6271 0.0589
## 4 0.4800 7.9308 3.1463 -8.5592 -0.2651
## 5 0.4547 7.8516 3.2676 -8.6732 0.3149
## 6 0.4349 7.3305 3.2845 -8.6843 -0.0310
## hardness ppsa1 ppsa2 ppsa3 pnsa1 pnsa2 pnsa3 fpsa1 fpsa2
## 1 6.5355 349.1390 679.3832 30.9705 51.0554 -99.3477 -10.4876 0.8724 1.6976
## 2 4.5780 223.1310 545.8328 42.3030 141.8143 -346.9123 -44.0368 0.6114 1.4957
## 3 4.3430 517.8218 2066.0186 63.9503 201.9967 -805.9311 -43.7587 0.7194 2.8702
## 4 4.1471 507.6144 2012.9060 61.6890 225.4459 -893.9880 -42.0328 0.6925 2.7459
## 5 4.4940 509.1635 1998.8743 61.5645 158.7582 -623.2529 -39.8413 0.7623 2.9927
## 6 4.3266 473.5681 1735.7426 58.4993 184.1784 -675.0588 -41.2100 0.7200 2.6389
## fpsa3 fnsa1 fnsa2 fnsa3 wpsa1 wpsa2 wpsa3 wnsa1 wnsa2
## 1 0.0774 0.1276 -0.2482 -0.0262 139.7235 271.8854 12.3942 20.4321 -39.7584
## 2 0.1159 0.3886 -0.9506 -0.1207 81.4306 199.1991 15.4383 51.7544 -126.6040
## 3 0.0888 0.2806 -1.1196 -0.0608 372.7377 1487.1583 46.0326 145.4010 -580.1241
## 4 0.0842 0.3075 -1.2195 -0.0573 372.1120 1475.5815 45.2218 165.2654 -655.3471
## 5 0.0922 0.2377 -0.9331 -0.0596 340.0814 1335.0917 41.1203 106.0381 -416.2842
## 6 0.0889 0.2800 -1.0263 -0.0627 311.4878 1141.6785 38.4777 121.1427 -444.0175
## wnsa3 dpsa1 dpsa2 dpsa3 rpcg rncg wpcs wncs sadh1
## 1 -4.1971 298.0836 778.7310 41.4580 0.1577 0.3173 2.3805 1.9117 15.0988
## 2 -16.0710 81.3167 892.7451 86.3398 0.2030 0.3433 1.3116 2.2546 45.2163
## 3 -31.4983 315.8251 2871.9497 107.7089 0.1357 0.2008 1.1351 1.5725 16.7192
## 4 -30.8126 282.1685 2906.8940 103.7218 0.1210 0.1919 0.7623 1.5302 17.2491
## 5 -26.6109 350.4053 2622.1272 101.4058 0.1158 0.2182 0.7884 1.6795 16.0252
## 6 -27.1057 289.3897 2410.8013 99.7093 0.1187 0.2093 2.0505 1.6760 17.2815
## sadh2 sadh3 chdh1 chdh2 chdh3 scdh1 scdh2 scdh3 saaa1 saaa2
## 1 15.0988 0.0377 0.3068 0.3068 0.0008 4.6321 4.6321 0.0116 6.0255 6.0255
## 2 22.6082 0.1239 0.7960 0.3980 0.0022 17.6195 8.8098 0.0483 65.6236 32.8118
## 3 16.7192 0.0232 0.4550 0.4550 0.0006 7.6077 7.6077 0.0106 57.5440 14.3860
## 4 17.2491 0.0235 0.4354 0.4354 0.0006 7.5102 7.5102 0.0102 39.8638 13.2879
## 5 16.0252 0.0240 0.4366 0.4366 0.0007 6.9970 6.9970 0.0105 42.4544 14.1515
## 6 17.2815 0.0263 0.4349 0.4349 0.0007 7.5157 7.5157 0.0114 43.8012 14.6004
## saaa3 chaa1 chaa2 chaa3 scaa1 scaa2 scaa3 ctdh ctaa mchg
## 1 0.0151 -0.6174 -0.6174 -0.0015 -3.7199 -3.7199 -0.0093 1 1 0.9241
## 2 0.1798 -0.8371 -0.4185 -0.0023 -27.5143 -13.7571 -0.0754 2 2 1.2685
## 3 0.0799 -1.3671 -0.3418 -0.0019 -21.7898 -5.4475 -0.0303 1 4 1.2562
## 4 0.0544 -1.2332 -0.4111 -0.0017 -17.5957 -5.8652 -0.0240 1 3 1.1962
## 5 0.0636 -1.1480 -0.3827 -0.0017 -17.0447 -5.6816 -0.0255 1 3 1.2934
## 6 0.0666 -1.2317 -0.4106 -0.0019 -19.8513 -6.6171 -0.0302 1 3 1.2021
## achg rdta n_sp2 n_sp3 o_sp2 o_sp3
## 1 0.9241 1.0000 0.0000 6.0255 0.0000 0.0000
## 2 1.0420 1.0000 0.0000 6.5681 32.0102 33.6135
## 3 1.2562 0.2500 26.9733 10.8567 0.0000 27.5451
## 4 1.1962 0.3333 21.7065 11.0017 0.0000 15.1316
## 5 1.2934 0.3333 24.2061 10.8109 0.0000 15.1333
## 6 1.2021 0.3333 25.5529 11.1218 0.0000 15.1333
# Identify near zero variance predictors: remove_cols
remove_cols <- nearZeroVar(bloodbrain_x, names = TRUE,
freqCut = 2, uniqueCut = 20)
# Get all column names from bloodbrain_x: all_cols
all_cols <- names(bloodbrain_x)
# Remove from data: bloodbrain_x_small
bloodbrain_x_small <- bloodbrain_x[ , setdiff(all_cols, remove_cols)]
Now that you’ve reduced your dataset, you can fit a glm model to it using the train() function. This model will run faster than using the full dataset and will yield very similar predictive accuracy.
Furthermore, zero variance variables can cause problems with cross-validation (e.g. if one fold ends up with only a single unique value for that variable), so removing them prior to modeling means you are less likely to get errors during the fitting process.
bloodbrain_x, bloodbrain_y, remove, and bloodbrain_x_small are loaded in your workspace.
Fit a glm model using the train() function and the reduced blood-brain dataset you created in the previous exercise. Print the result to the console.
bloodbrain_x_small <- read.delim("DATABASE/bloodbrain_x_small.txt", sep = ",")
head(bloodbrain_x_small)
## X tpsa nbasic vsa_hyd a_aro weight peoe_vsa.0 peoe_vsa.1 peoe_vsa.2
## 1 1 12.03 1 167.06700 0 156.293 76.94749 43.44619 0.00000
## 2 2 49.33 0 92.64243 6 151.165 38.24339 25.52006 0.00000
## 3 3 50.53 1 295.16700 15 366.485 58.05473 124.74020 21.65084
## 4 4 37.39 0 319.11220 15 382.552 62.23933 124.74020 13.19232
## 5 5 37.39 1 299.65800 12 326.464 74.80064 118.04060 33.00190
## 6 6 37.39 1 289.77770 11 332.492 74.80064 109.50990 13.19232
## peoe_vsa.6 peoe_vsa.0.1 peoe_vsa.1.1 a_acc a_base vsa_acc vsa_don
## 1 17.238030 18.74768 43.50657 0 1 0.000000 5.682576
## 2 0.000000 49.01962 0.00000 2 0 13.566920 5.682576
## 3 8.619013 83.82487 49.01962 2 1 8.186332 5.682576
## 4 8.619013 83.82487 68.78024 2 1 8.186332 5.682576
## 5 8.619013 83.82487 36.76471 2 1 8.186332 5.682576
## 6 8.619013 73.54603 44.27042 2 1 8.186332 5.682576
## vsa_other slogp_vsa0 slogp_vsa1 slogp_vsa3 slogp_vsa4 slogp_vsa5 slogp_vsa7
## 1 0.00000 18.01075 0.00000 0.000 4.410796 32.897190 0.00000
## 2 28.10760 25.38523 23.26954 0.000 0.000000 0.000000 70.57274
## 3 43.56089 14.12420 34.79628 76.245 3.185575 9.507346 148.12580
## 4 28.32470 14.12420 34.79628 76.245 3.185575 0.000000 144.03540
## 5 19.64908 14.12420 34.79628 76.245 3.185575 0.000000 140.71660
## 6 21.62514 14.12420 34.79628 76.245 3.185575 0.000000 103.60310
## slogp_vsa9 smr_vsa0 smr_vsa1 smr_vsa2 smr_vsa3 smr_vsa5 smr_vsa6 smr_vsa7
## 1 33.32602 0.000000 18.01075 4.410796 3.981969 113.21040 0.000000 66.22321
## 2 41.32619 23.862220 25.38523 0.000000 5.243428 70.57274 5.258784 33.32602
## 3 28.27417 12.631660 27.78542 0.000000 8.429003 235.05870 76.245000 0.00000
## 4 55.46144 3.124314 27.78542 0.000000 8.429003 235.05870 76.245000 31.27769
## 5 26.01093 3.124314 27.78542 0.000000 8.429003 234.62990 76.245000 0.00000
## 6 55.46144 3.124314 27.78542 0.000000 8.429003 197.51630 76.245000 31.27769
## tpsa.1 logp.o.w. frac.cation7. andrewbind rotatablebonds mlogp clogp
## 1 16.61 2.948 0.999 3.4 3 2.50245 2.970000
## 2 49.33 0.889 0.000 -3.3 2 1.05973 0.494000
## 3 51.73 4.439 0.986 12.8 8 4.66091 5.136999
## 4 38.59 5.254 0.986 12.8 8 3.82458 5.877599
## 5 38.59 3.800 0.986 10.3 8 3.27214 4.367000
## 6 38.59 3.608 0.986 10.0 8 2.89481 4.283600
## mw nocount hbdnr prx ub pol adistm adistd polar_area nonpolar_area
## 1 155.2856 1 1 0 0.0 0 0.0000 0.0000 21.1242 379.0702
## 2 151.1664 3 2 1 3.0 2 395.3757 10.8921 117.4081 247.5371
## 3 365.4794 5 1 6 5.3 3 1364.5514 25.6784 82.0943 637.7242
## 4 381.5440 4 1 2 5.3 3 702.6387 10.0232 65.0890 667.9713
## 5 325.4577 4 1 2 4.2 2 745.5096 10.5753 66.1754 601.7463
## 6 331.4835 4 1 2 3.6 2 779.2914 10.7712 69.0895 588.6569
## psa_npsa tcsa tcpa tcnp ovality surface_area volume
## 1 0.0557 0.0097 0.1842 0.0103 1.0960 400.1944 656.0650
## 2 0.4743 0.0134 0.0417 0.0198 1.1173 364.9453 555.0969
## 3 0.1287 0.0111 0.0972 0.0125 1.3005 719.8185 1224.4553
## 4 0.0974 0.0108 0.1218 0.0119 1.3013 733.0603 1257.2002
## 5 0.1100 0.0118 0.1186 0.0130 1.2711 667.9218 1132.6826
## 6 0.1174 0.0111 0.1061 0.0125 1.2642 657.7465 1115.8672
## most_negative_charge most_positive_charge sum_absolute_charge dipole_moment
## 1 -0.6174 0.3068 3.8918 1.1898
## 2 -0.8397 0.4967 4.8925 4.2109
## 3 -0.8012 0.5414 7.9796 3.5234
## 4 -0.7608 0.4800 7.9308 3.1463
## 5 -0.8567 0.4547 7.8516 3.2676
## 6 -0.7672 0.4349 7.3305 3.2845
## homo lumo hardness ppsa1 ppsa2 ppsa3 pnsa1 pnsa2
## 1 -9.6672 3.4038 6.5355 349.1390 679.3832 30.9705 51.0554 -99.3477
## 2 -8.9618 0.1942 4.5780 223.1310 545.8328 42.3030 141.8143 -346.9123
## 3 -8.6271 0.0589 4.3430 517.8218 2066.0186 63.9503 201.9967 -805.9311
## 4 -8.5592 -0.2651 4.1471 507.6144 2012.9060 61.6890 225.4459 -893.9880
## 5 -8.6732 0.3149 4.4940 509.1635 1998.8743 61.5645 158.7582 -623.2529
## 6 -8.6843 -0.0310 4.3266 473.5681 1735.7426 58.4993 184.1784 -675.0588
## pnsa3 fpsa1 fpsa2 fpsa3 fnsa1 fnsa2 fnsa3 wpsa1 wpsa2
## 1 -10.4876 0.8724 1.6976 0.0774 0.1276 -0.2482 -0.0262 139.7235 271.8854
## 2 -44.0368 0.6114 1.4957 0.1159 0.3886 -0.9506 -0.1207 81.4306 199.1991
## 3 -43.7587 0.7194 2.8702 0.0888 0.2806 -1.1196 -0.0608 372.7377 1487.1583
## 4 -42.0328 0.6925 2.7459 0.0842 0.3075 -1.2195 -0.0573 372.1120 1475.5815
## 5 -39.8413 0.7623 2.9927 0.0922 0.2377 -0.9331 -0.0596 340.0814 1335.0917
## 6 -41.2100 0.7200 2.6389 0.0889 0.2800 -1.0263 -0.0627 311.4878 1141.6785
## wpsa3 wnsa1 wnsa2 wnsa3 dpsa1 dpsa2 dpsa3 rpcg rncg
## 1 12.3942 20.4321 -39.7584 -4.1971 298.0836 778.7310 41.4580 0.1577 0.3173
## 2 15.4383 51.7544 -126.6040 -16.0710 81.3167 892.7451 86.3398 0.2030 0.3433
## 3 46.0326 145.4010 -580.1241 -31.4983 315.8251 2871.9497 107.7089 0.1357 0.2008
## 4 45.2218 165.2654 -655.3471 -30.8126 282.1685 2906.8940 103.7218 0.1210 0.1919
## 5 41.1203 106.0381 -416.2842 -26.6109 350.4053 2622.1272 101.4058 0.1158 0.2182
## 6 38.4777 121.1427 -444.0175 -27.1057 289.3897 2410.8013 99.7093 0.1187 0.2093
## wpcs wncs sadh1 sadh2 sadh3 chdh1 chdh2 scdh1 scdh2 scdh3
## 1 2.3805 1.9117 15.0988 15.0988 0.0377 0.3068 0.3068 4.6321 4.6321 0.0116
## 2 1.3116 2.2546 45.2163 22.6082 0.1239 0.7960 0.3980 17.6195 8.8098 0.0483
## 3 1.1351 1.5725 16.7192 16.7192 0.0232 0.4550 0.4550 7.6077 7.6077 0.0106
## 4 0.7623 1.5302 17.2491 17.2491 0.0235 0.4354 0.4354 7.5102 7.5102 0.0102
## 5 0.7884 1.6795 16.0252 16.0252 0.0240 0.4366 0.4366 6.9970 6.9970 0.0105
## 6 2.0505 1.6760 17.2815 17.2815 0.0263 0.4349 0.4349 7.5157 7.5157 0.0114
## saaa1 saaa2 saaa3 chaa1 chaa2 chaa3 scaa1 scaa2 scaa3 ctdh
## 1 6.0255 6.0255 0.0151 -0.6174 -0.6174 -0.0015 -3.7199 -3.7199 -0.0093 1
## 2 65.6236 32.8118 0.1798 -0.8371 -0.4185 -0.0023 -27.5143 -13.7571 -0.0754 2
## 3 57.5440 14.3860 0.0799 -1.3671 -0.3418 -0.0019 -21.7898 -5.4475 -0.0303 1
## 4 39.8638 13.2879 0.0544 -1.2332 -0.4111 -0.0017 -17.5957 -5.8652 -0.0240 1
## 5 42.4544 14.1515 0.0636 -1.1480 -0.3827 -0.0017 -17.0447 -5.6816 -0.0255 1
## 6 43.8012 14.6004 0.0666 -1.2317 -0.4106 -0.0019 -19.8513 -6.6171 -0.0302 1
## ctaa mchg achg rdta n_sp2 n_sp3 o_sp2 o_sp3
## 1 1 0.9241 0.9241 1.0000 0.0000 6.0255 0.0000 0.0000
## 2 2 1.2685 1.0420 1.0000 0.0000 6.5681 32.0102 33.6135
## 3 4 1.2562 1.2562 0.2500 26.9733 10.8567 0.0000 27.5451
## 4 3 1.1962 1.1962 0.3333 21.7065 11.0017 0.0000 15.1316
## 5 3 1.2934 1.2934 0.3333 24.2061 10.8109 0.0000 15.1333
## 6 3 1.2021 1.2021 0.3333 25.5529 11.1218 0.0000 15.1333
# bloodbrain_y <- read.delim("DATABASE/bloodbrain_y.txt", sep = ",")
# # Fit model on reduced data: model
# model <- train(
# x = bloodbrain_x_small,
# y = bloodbrain_y,
# method = "glm"
# )
# Print model to console
model
## glmnet
##
## 250 samples
## 200 predictors
## 2 classes: 'class1', 'class2'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 225, 226, 224, 224, 226, 224, ...
## Resampling results across tuning parameters:
##
## alpha lambda ROC Sens Spec
## 0 0.00010000 0.3722826 0.05 0.9871377
## 0 0.05272632 0.3791667 0.00 0.9956522
## 0 0.10535263 0.3770833 0.00 1.0000000
## 0 0.15797895 0.3708333 0.00 1.0000000
## 0 0.21060526 0.3794384 0.00 1.0000000
## 0 0.26323158 0.3900362 0.00 1.0000000
## 0 0.31585789 0.3921196 0.00 1.0000000
## 0 0.36848421 0.3963768 0.00 1.0000000
## 0 0.42111053 0.3963768 0.00 1.0000000
## 0 0.47373684 0.3984601 0.00 1.0000000
## 0 0.52636316 0.3984601 0.00 1.0000000
## 0 0.57898947 0.4026268 0.00 1.0000000
## 0 0.63161579 0.4026268 0.00 1.0000000
## 0 0.68424211 0.4047101 0.00 1.0000000
## 0 0.73686842 0.4047101 0.00 1.0000000
## 0 0.78949474 0.4047101 0.00 1.0000000
## 0 0.84212105 0.4047101 0.00 1.0000000
## 0 0.89474737 0.4047101 0.00 1.0000000
## 0 0.94737368 0.4090580 0.00 1.0000000
## 0 1.00000000 0.4090580 0.00 1.0000000
## 1 0.00010000 0.2921196 0.05 0.9532609
## 1 0.05272632 0.5108696 0.00 1.0000000
## 1 0.10535263 0.5000000 0.00 1.0000000
## 1 0.15797895 0.5000000 0.00 1.0000000
## 1 0.21060526 0.5000000 0.00 1.0000000
## 1 0.26323158 0.5000000 0.00 1.0000000
## 1 0.31585789 0.5000000 0.00 1.0000000
## 1 0.36848421 0.5000000 0.00 1.0000000
## 1 0.42111053 0.5000000 0.00 1.0000000
## 1 0.47373684 0.5000000 0.00 1.0000000
## 1 0.52636316 0.5000000 0.00 1.0000000
## 1 0.57898947 0.5000000 0.00 1.0000000
## 1 0.63161579 0.5000000 0.00 1.0000000
## 1 0.68424211 0.5000000 0.00 1.0000000
## 1 0.73686842 0.5000000 0.00 1.0000000
## 1 0.78949474 0.5000000 0.00 1.0000000
## 1 0.84212105 0.5000000 0.00 1.0000000
## 1 0.89474737 0.5000000 0.00 1.0000000
## 1 0.94737368 0.5000000 0.00 1.0000000
## 1 1.00000000 0.5000000 0.00 1.0000000
##
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were alpha = 1 and lambda = 0.05272632.
An alternative to removing low-variance predictors is to run PCA on your dataset. This is sometimes preferable because it does not throw out all of your data: many different low variance predictors may end up combined into one high variance PCA variable, which might have a positive impact on your model’s accuracy.
This is an especially good trick for linear models: the pca option in the preProcess argument will center and scale your data, combine low variance variables, and ensure that all of your predictors are orthogonal. This creates an ideal dataset for linear regression modeling, and can often improve the accuracy of your models.
bloodbrain_x and bloodbrain_y are loaded in your workspace.
# # Fit glm model using PCA: model
# model <- train(
# x = bloodbrain_x,
# y = bloodbrain_y,
# method = "glm",
# preProcess = "pca"
# )
#
# # Print model to console
# model
Fit a glm model to the full blood-brain dataset using the “pca” option to preProcess. Print the model to the console and inspect the result.
As you saw in the video, for this chapter you will focus on a real-world dataset that brings together all of the concepts discussed in the previous chapters.
The churn dataset contains data on a variety of telecom customers and the modeling challenge is to predict which customers will cancel their service (or churn).
In this chapter, you will be exploring two different types of predictive models: glmnet and rf, so the first order of business is to create a reusable trainControl object you can use to reliably compare them.
churn_x and churn_y are loaded in your workspace.
Use createFolds() to create 5 CV folds on churn_y, your target variable for this exercise. Pass them to trainControl() to create a reusable trainControl for comparing models.
churn_y <- read.delim("DATABASE/churn_y.txt", sep = ",")
head(churn_y)
## X x
## 1 1 no
## 2 2 no
## 3 3 no
## 4 4 no
## 5 5 no
## 6 6 no
churn_x <- read.delim("DATABASE/churn_x.txt", sep = ",")
head(churn_x)
## X stateAK stateAL stateAR stateAZ stateCA stateCO stateCT stateDC stateDE
## 1 4575 0 0 0 0 0 0 0 0 0
## 2 4685 0 0 0 0 0 0 0 0 0
## 3 1431 0 0 0 0 0 0 0 0 0
## 4 4150 0 0 0 0 0 0 0 0 1
## 5 3207 0 0 0 0 0 0 0 0 0
## 6 2593 0 0 0 0 0 0 0 0 0
## stateFL stateGA stateHI stateIA stateID stateIL stateIN stateKS stateKY
## 1 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0 0
## stateLA stateMA stateMD stateME stateMI stateMN stateMO stateMS stateMT
## 1 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 1 0 0
## 3 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0 0
## stateNC stateND stateNE stateNH stateNJ stateNM stateNV stateNY stateOH
## 1 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0 0
## stateOK stateOR statePA stateRI stateSC stateSD stateTN stateTX stateUT
## 1 0 0 0 0 1 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0 0
## stateVA stateVT stateWA stateWI stateWV stateWY account_length
## 1 0 0 0 0 0 0 137
## 2 0 0 0 0 0 0 83
## 3 0 0 0 0 1 0 48
## 4 0 0 0 0 0 0 67
## 5 0 0 1 0 0 0 143
## 6 1 0 0 0 0 0 163
## area_codearea_code_415 area_codearea_code_510 international_planyes
## 1 0 1 0
## 2 1 0 0
## 3 1 0 0
## 4 1 0 0
## 5 0 1 0
## 6 1 0 0
## voice_mail_planyes number_vmail_messages total_day_minutes total_day_calls
## 1 0 0 109.8 112
## 2 0 0 196.7 117
## 3 1 34 198.0 70
## 4 0 0 164.5 79
## 5 0 0 133.4 107
## 6 0 0 202.9 100
## total_day_charge total_eve_minutes total_eve_calls total_eve_charge
## 1 18.67 223.5 88 19.00
## 2 33.44 272.0 89 23.12
## 3 33.66 273.7 121 23.26
## 4 27.97 110.3 108 9.38
## 5 22.68 223.9 117 19.03
## 6 34.49 178.6 46 15.18
## total_night_minutes total_night_calls total_night_charge total_intl_minutes
## 1 247.5 96 11.14 17.8
## 2 199.9 62 9.00 10.1
## 3 217.9 71 9.81 7.6
## 4 203.9 102 9.18 9.8
## 5 180.4 85 8.12 10.2
## 6 203.8 116 9.17 12.8
## total_intl_calls total_intl_charge number_customer_service_calls
## 1 2 4.81 1
## 2 11 2.73 3
## 3 4 2.05 1
## 4 2 2.65 1
## 5 13 2.75 1
## 6 3 3.46 5
# Create custom indices: myFolds
myFolds <- createFolds(churn_y, k = 5)
# Create reusable trainControl object: myControl
myControl <- trainControl(
summaryFunction = twoClassSummary,
classProbs = TRUE, # IMPORTANT!
verboseIter = TRUE,
savePredictions = TRUE,
index = myFolds
)
Now that you have a reusable trainControl object called myControl, you can start fitting different predictive models to your churn dataset and evaluate their predictive accuracy.
You’ll start with one of my favorite models, glmnet, which penalizes linear and logistic regression models on the size and number of coefficients to help prevent overfitting.
Fit a glmnet model to the churn dataset called model_glmnet. Make sure to use myControl, which you created in the first exercise and is available in your workspace, as the trainControl object.
# # Fit glmnet model: model_glmnet
# model_glmnet <- train(
# x = churn_x,
# y = churn_y,
# metric = "ROC",
# method = "glmnet",
# trControl = myControl
# )
Another one of my favorite models is the random forest, which combines an ensemble of non-linear decision trees into a highly flexible (and usually quite accurate) model.
Rather than using the classic randomForest package, you’ll be using the ranger package, which is a re-implementation of randomForest that produces almost the exact same results, but is faster, more stable, and uses less memory. I highly recommend it as a starting point for random forest modeling in R.
churn_x and churn_y are loaded in your workspace.
Fit a random forest model to the churn dataset. Be sure to use myControl as the trainControl like you’ve done before and implement the “ranger” method.
# Fit random forest: model_rf
# model_rf <- train(
# x = churn_x,
# y = churn_y,
# metric = "ROC",
# method = "ranger",
# trControl = myControl
# )
Now that you have fit two models to the churn dataset, it’s time to compare their out-of-sample predictions and choose which one is the best model for your dataset.
You can compare models in caret using the resamples() function, provided they have the same training data and use the same trainControl object with preset cross-validation folds. resamples() takes as input a list of models and can be used to compare dozens of models at once (though in this case you are only comparing two models).
model_glmnet and model_rf are loaded in your workspace.
Create a list() containing the glmnet model as item1 and the ranger model as item2. Pass this list to the resamples() function and save the resulting object as resamples. Summarize the results by calling summary() on resamples.
# # Create model_list
# model_list <- list(item1 = model_glmnet, item2 = model_rf)
#
# # Pass model_list to resamples(): resamples
# resamples <- resamples(model_list)
#
# # Summarize the results
# summary(resamples)