Course Description Machine learning is the study and application of algorithms that learn from and make predictions on data. From search results to self-driving cars, it has manifested itself in all areas of our lives and is one of the most exciting and fast growing fields of research in the world of data science. This course teaches the big ideas in machine learning: how to build and evaluate predictive models, how to tune them for optimal performance, how to preprocess data for better results, and much more. The popular caret R package, which provides a consistent interface to all of R’s most powerful machine learning facilities, is used throughout the course.
library(dplyr)
library(ggplot2)
install.packages("mlbench")
library(mlbench) # sonar data
install.packages("caret")
library(caret)
library(caTools)
#source('create_datasets.R')
In the first chapter of this course, you’ll fit regression models with train() and evaluate their out-of-sample performance using cross-validation and root-mean-square error (RMSE).
RMSE is commonly calculated in-sample on your training set. What’s a potential drawback to calculating training set error?
Answer the question
50 XP
Possible Answers
There’s no potential drawback to calculating training set error, but you should calculate R2 instead of RMSE.
You have no idea how well your model generalizes to new data (i.e. overfitting).
You should manually inspect your model to validate its coefficients and calculate RMSE.
head(diamonds)
# Fit lm model: model
model<-lm(price~.,diamonds)
# Predict on full data: p
p<-predict(model)
# Compute errors: error
error<-p-diamonds$price
# Calculate RMSE
RMSE<-sqrt(mean(error^2))
RMSE
[1] 1129.843
What is the advantage of using a train/test split rather than just validating your model in-sample on the training set?
Answer the question
50 XP
Possible Answers
It takes less time to calculate error on the test set, since it is smaller than the training set.
There is no advantage to using a test set. You can just use adjusted R2 on your training set.
It gives you an estimate of how well your model performs on new data.
One way you can take a train/test split of a dataset is to order the dataset randomly, then divide it into the two sets. This ensures that the training set and test set are both random samples and that any biases in the ordering of the dataset (e.g. if it had originally been ordered by price or size) are not retained in the samples we take for training and testing your models. You can think of this like shuffling a brand new deck of playing cards before dealing hands.
First, you set a random seed so that your work is reproducible and you get the same random split each time you run your script:
set.seed(42)
Next, you use the sample() function to shuffle the row indices of the diamonds dataset. You can later use these indices to reorder the dataset.
rows <- sample(nrow(diamonds)) Finally, you can use this random vector to reorder the diamonds dataset:
diamonds <- diamonds[rows, ]
Instructions
100 XP
Set the random seed to 42.
Make a vector of row indices called rows.
Randomly reorder the diamonds data frame.
# Set seed
set.seed(42)
# Shuffle row indices: rows
rows <- sample(nrow(diamonds))
# Randomly order data
diamonds <- diamonds[rows, ]
Now that your dataset is randomly ordered, you can split the first 80% of it into a training set, and the last 20% into a test set. You can do this by choosing a split point approximately 80% of the way through your data:
split <- round(nrow(mydata) * .80) You can then use this point to break off the first 80% of the dataset as a training set:
mydata[1:split, ] And then you can use that same point to determine the test set:
mydata[(split + 1):nrow(mydata), ]
Instructions
100 XP
Choose a row index to split on so that the split point is approximately 80% of the way through the diamonds dataset. Call this index split.
Create a training set called train using that index.
Create a test set called test using that index.
# Determine row to split on: split
split <- round(nrow(diamonds) * .80)
# Create train
train<-diamonds[1:split, ]
# Create test
test<-diamonds[(split+1):nrow(diamonds), ]
Now that you have a randomly split training set and test set, you can use the lm() function as you did in the first exercise to fit a model to your training set, rather than the entire dataset. Recall that you can use the formula interface to the linear regression function to fit a model with a specified target variable using all other variables in the dataset as predictors:
mod <- lm(y ~ ., training_data) You can use the predict() function to make predictions from that model on new data. The new dataset must have all of the columns from the training data, but they can be in a different order with different values. Here, rather than re-predicting on the training set, you can predict on the test set, which you did not use for training the model. This will allow you to determine the out-of-sample error for the model in the next exercise:
p <- predict(model, new_data)
Instructions
100 XP
Fit an lm() model called model to predict price using all other variables as covariates. Be sure to use the training set, train.
Predict on the test set, test, using predict(). Store these values in a vector called p.
# Fit lm model on train: model
model<-lm(price~.,train)
# Predict on test: p
p<-predict(model,newdata=test)
Now that you have predictions on the test set, you can use these predictions to calculate an error metric (in this case RMSE) on the test set and see how the model performs out-of-sample, rather than in-sample as you did in the first exercise. You first do this by calculating the errors between the predicted diamond prices and the actual diamond prices by subtracting the predictions from the actual values.
Once you have an error vector, calculating RMSE is as simple as squaring it, taking the mean, then taking the square root:
sqrt(mean(error^2))
Instructions
100 XP
test, model, and p are loaded in your workspace.
Calculate the error between the predictions on the test set and the actual diamond prices in the test set. Call this error.
Calculate RMSE using this error vector, just printing the result to the console.
RMSE
[1] 1136.596
Why is the test set RMSE higher than the training set RMSE?
Answer the question
50 XP
Possible Answers
Because you overfit the training set and the test set contains data the model hasn’t seen before.
Because you should not use a test set at all and instead just look at error on the training set.
Because the test set has a smaller sample size the training set and thus the mean error is lower. [ans]
Remark: Though the test set has the smaller sample size, the mean error is not necessarily lower.
What is the advantage of cross-validation over a single train/test split?
Answer the question
50 XP
Possible Answers
There is no advantage to cross-validation, just as there is no advantage to a single train/test split. You should be validating your models in-sample with a metric like adjusted R2.
You can pick the best test set to minimize the reported RMSE of your model.
It gives you multiple estimates of out-of-sample error, rather than a single estimate. [ans]
Remark: If all of your estimates give similar outputs, you can be more certain of the model’s accuracy. If your estimates give different outputs, that tells you the model does not perform consistently and suggests a problem with it.
As you saw in the video, a better approach to validating models is to use multiple systematic test sets, rather than a single random train/test split. Fortunately, the caret package makes this very easy to do:
model <- train(y ~ ., my_data) caret supports many types of cross-validation, and you can specify which type of cross-validation and the number of cross-validation folds with the trainControl() function, which you pass to the trControl argument in train():
model <- train( y ~ ., my_data, method = “lm”, trControl = trainControl( method = “cv”, number = 10, verboseIter = TRUE ) )
It’s important to note that you pass the method for modeling to the main train() function and the method for cross-validation to the trainControl() function.
Instructions
100 XP
Use the train() function and 10-fold cross-validation. (Note that we’ve taken a subset of the full diamonds dataset to speed up this operation, but it’s still named diamonds.)
Print the model to the console and examine the results.
# Fit lm model using 10-fold CV: model
model <- train(
price~., diamonds,
method = "lm",
trControl = trainControl(
method = "cv", number = 10,
verboseIter = TRUE
)
)
+ Fold01: intercept=TRUE
- Fold01: intercept=TRUE
+ Fold02: intercept=TRUE
- Fold02: intercept=TRUE
+ Fold03: intercept=TRUE
- Fold03: intercept=TRUE
+ Fold04: intercept=TRUE
- Fold04: intercept=TRUE
+ Fold05: intercept=TRUE
- Fold05: intercept=TRUE
+ Fold06: intercept=TRUE
- Fold06: intercept=TRUE
+ Fold07: intercept=TRUE
- Fold07: intercept=TRUE
+ Fold08: intercept=TRUE
- Fold08: intercept=TRUE
+ Fold09: intercept=TRUE
- Fold09: intercept=TRUE
+ Fold10: intercept=TRUE
- Fold10: intercept=TRUE
Aggregating results
Fitting final model on full training set
# Print model to console
model
Linear Regression
53940 samples
9 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 48547, 48546, 48546, 48545, 48545, 48545, ...
Resampling results:
RMSE Rsquared MAE
1130.658 0.9197492 740.4646
Tuning parameter 'intercept' was held constant at a value of TRUE
In this course, you will use a wide variety of datasets to explore the full flexibility of the caret package. Here, you will use the famous Boston housing dataset, where the goal is to predict median home values in various Boston suburbs.
You can use exactly the same code as in the previous exercise, but change the dataset used by the model:
model <- train( medv ~ ., Boston, method = “lm”, trControl = trainControl( method = “cv”, number = 10, verboseIter = TRUE ) ) Next, you can reduce the number of cross-validation folds from 10 to 5 using the number argument to the trainControl() argument:
trControl = trainControl( method = “cv”, number = 5, verboseIter = TRUE )
Instructions
100 XP
Fit an lm() model to the Boston housing dataset, such that medv is the response variable and all other variables are explanatory variables.
Use 5-fold cross-validation rather than 10-fold cross-validation.
Print the model to the console and inspect the results.
library(MASS) # For loading the Boston dataset
# Fit lm model using 5-fold CV: model
model <- train(
medv~., Boston,
method = "lm",
trControl = trainControl(
method = "cv", number = 5,
verboseIter = TRUE
)
)
+ Fold1: intercept=TRUE
- Fold1: intercept=TRUE
+ Fold2: intercept=TRUE
- Fold2: intercept=TRUE
+ Fold3: intercept=TRUE
- Fold3: intercept=TRUE
+ Fold4: intercept=TRUE
- Fold4: intercept=TRUE
+ Fold5: intercept=TRUE
- Fold5: intercept=TRUE
Aggregating results
Fitting final model on full training set
# Print model to console
model
Linear Regression
506 samples
13 predictor
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 405, 405, 404, 405, 405
Resampling results:
RMSE Rsquared MAE
4.875484 0.7316537 3.425015
Tuning parameter 'intercept' was held constant at a value of TRUE
You can do more than just one iteration of cross-validation. Repeated cross-validation gives you a better estimate of the test-set error. You can also repeat the entire cross-validation procedure. This takes longer, but gives you many more out-of-sample datasets to look at and much more precise assessments of how well the model performs.
One of the awesome things about the train() function in caret is how easy it is to run very different models or methods of cross-validation just by tweaking a few simple arguments to the function call. For example, you could repeat your entire cross-validation procedure 5 times for greater confidence in your estimates of the model’s out-of-sample accuracy, e.g.:
trControl = trainControl( method = “cv”, number = 5, repeats = 5, verboseIter = TRUE )
Instructions
100 XP
Re-fit the linear regression model to the Boston housing dataset.
Use 5 repeats of 5-fold cross-validation.
Print the model to the console.
# Fit lm model using 5 x 5-fold CV: model
model <- train(
medv ~ ., Boston,
method = "lm",
trControl = trainControl(
method = "cv", number = 5,
repeats = 5, verboseIter = TRUE
)
)
`repeats` has no meaning for this resampling method.
+ Fold1: intercept=TRUE
- Fold1: intercept=TRUE
+ Fold2: intercept=TRUE
- Fold2: intercept=TRUE
+ Fold3: intercept=TRUE
- Fold3: intercept=TRUE
+ Fold4: intercept=TRUE
- Fold4: intercept=TRUE
+ Fold5: intercept=TRUE
- Fold5: intercept=TRUE
Aggregating results
Fitting final model on full training set
# Print model to console
model
Linear Regression
506 samples
13 predictor
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 404, 406, 406, 403, 405
Resampling results:
RMSE Rsquared MAE
4.858793 0.7247119 3.39237
Tuning parameter 'intercept' was held constant at a value of TRUE
Finally, the model you fit with the train() function has the exact same predict() interface as the linear regression models you fit earlier in this chapter.
After fitting a model with train(), you can simply call predict() with new data, e.g:
predict(my_model, new_data)
Instructions
100 XP
# Predict on full Boston dataset
p<-predict(model,Boston)
p
1 2 3 4 5 6 7 8 9
30.0038434 25.0255624 30.5675967 28.6070365 27.9435242 25.2562845 23.0018083 19.5359884 11.5236369
10 11 12 13 14 15 16 17 18
18.9202621 18.9994965 21.5867957 20.9065215 19.5529028 19.2834821 19.2974832 20.5275098 16.9114013
19 20 21 22 23 24 25 26 27
16.1780111 18.4061360 12.5238575 17.6710367 15.8328813 13.8062853 15.6783383 13.3866856 15.4639765
28 29 30 31 32 33 34 35 36
14.7084743 19.5473729 20.8764282 11.4551176 18.0592329 8.8110574 14.2827581 13.7067589 23.8146353
37 38 39 40 41 42 43 44 45
22.3419371 23.1089114 22.9150261 31.3576257 34.2151023 28.0205641 25.2038663 24.6097927 22.9414918
46 47 48 49 50 51 52 53 54
22.0966982 20.4232003 18.0365509 9.1065538 17.2060775 21.2815254 23.9722228 27.6558508 24.0490181
55 56 57 58 59 60 61 62 63
15.3618477 31.1526495 24.8568698 33.1091981 21.7753799 21.0849356 17.8725804 18.5111021 23.9874286
64 65 66 67 68 69 70 71 72
22.5540887 23.3730864 30.3614836 25.5305651 21.1133856 17.4215379 20.7848363 25.2014886 21.7426577
73 74 75 76 77 78 79 80 81
24.5574496 24.0429571 25.5049972 23.9669302 22.9454540 23.3569982 21.2619827 22.4281737 28.4057697
82 83 84 85 86 87 88 89 90
26.9948609 26.0357630 25.0587348 24.7845667 27.7904920 22.1685342 25.8927642 30.6746183 30.8311062
91 92 93 94 95 96 97 98 99
27.1190194 27.4126673 28.9412276 29.0810555 27.0397736 28.6245995 24.7274498 35.7815952 35.1145459
100 101 102 103 104 105 106 107 108
32.2510280 24.5802202 25.5941347 19.7901368 20.3116713 21.4348259 18.5399401 17.1875599 20.7504903
109 110 111 112 113 114 115 116 117
22.6482911 19.7720367 20.6496586 26.5258674 20.7732364 20.7154831 25.1720888 20.4302559 23.3772463
118 119 120 121 122 123 124 125 126
23.6904326 20.3357836 20.7918087 21.9163207 22.4710778 20.5573856 16.3666198 20.5609982 22.4817845
127 128 129 130 131 132 133 134 135
14.6170663 15.1787668 18.9386859 14.0557329 20.0352740 19.4101340 20.0619157 15.7580767 13.2564524
136 137 138 139 140 141 142 143 144
17.2627773 15.8784188 19.3616395 13.8148390 16.4488147 13.5714193 3.9888551 14.5949548 12.1488148
145 146 147 148 149 150 151 152 153
8.7282236 12.0358534 15.8208206 8.5149902 9.7184414 14.8045137 20.8385815 18.3010117 20.1228256
154 155 156 157 158 159 160 161 162
17.2860189 22.3660023 20.1037592 13.6212589 33.2598270 29.0301727 25.5675277 32.7082767 36.7746701
163 164 165 166 167 168 169 170 171
40.5576584 41.8472817 24.7886738 25.3788924 37.2034745 23.0874875 26.4027396 26.6538211 22.5551466
172 173 174 175 176 177 178 179 180
24.2908281 22.9765722 29.0719431 26.5219434 30.7220906 25.6166931 29.1374098 31.4357197 32.9223157
181 182 183 184 185 186 187 188 189
34.7244046 27.7655211 33.8878732 30.9923804 22.7182001 24.7664781 35.8849723 33.4247672 32.4119915
190 191 192 193 194 195 196 197 198
34.5150995 30.7610949 30.2893414 32.9191871 32.1126077 31.5587100 40.8455572 36.1277008 32.6692081
199 200 201 202 203 204 205 206 207
34.7046912 30.0934516 30.6439391 29.2871950 37.0714839 42.0319312 43.1894984 22.6903480 23.6828471
208 209 210 211 212 213 214 215 216
17.8544721 23.4942899 17.0058772 22.3925110 17.0604275 22.7389292 25.2194255 11.1191674 24.5104915
217 218 219 220 221 222 223 224 225
26.6033477 28.3551871 24.9152546 29.6865277 33.1841975 23.7745666 32.1405196 29.7458199 38.3710245
226 227 228 229 230 231 232 233 234
39.8146187 37.5860575 32.3995325 35.4566524 31.2341151 24.4844923 33.2883729 38.0481048 37.1632863
235 236 237 238 239 240 241 242 243
31.7138352 25.2670557 30.1001074 32.7198716 28.4271706 28.4294068 27.2937594 23.7426248 24.1200789
244 245 246 247 248 249 250 251 252
27.4020841 16.3285756 13.3989126 20.0163878 19.8618443 21.2883131 24.0798915 24.2063355 25.0421582
253 254 255 256 257 258 259 260 261
24.9196401 29.9456337 23.9722832 21.6958089 37.5110924 43.3023904 36.4836142 34.9898859 34.8121151
262 263 264 265 266 267 268 269 270
37.1663133 40.9892850 34.4463409 35.8339755 28.2457430 31.2267359 40.8395575 39.3179239 25.7081791
271 272 273 274 275 276 277 278 279
22.3029553 27.2034097 28.5116947 35.4767660 36.1063916 33.7966827 35.6108586 34.8399338 30.3519266
280 281 282 283 284 285 286 287 288
35.3098070 38.7975697 34.3312319 40.3396307 44.6730834 31.5968909 27.3565923 20.1017415 27.0420667
289 290 291 292 293 294 295 296 297
27.2136458 26.9139584 33.4356331 34.4034963 31.8333982 25.8178324 24.4298235 28.4576434 27.3626700
298 299 300 301 302 303 304 305 306
19.5392876 29.1130984 31.9105461 30.7715945 28.9427587 28.8819102 32.7988723 33.2090546 30.7683179
307 308 309 310 311 312 313 314 315
35.5622686 32.7090512 28.6424424 23.5896583 18.5426690 26.8788984 23.2813398 25.5458025 25.4812006
316 317 318 319 320 321 322 323 324
20.5390990 17.6157257 18.3758169 24.2907028 21.3252904 24.8868224 24.8693728 22.8695245 19.4512379
325 326 327 328 329 330 331 332 333
25.1178340 24.6678691 23.6807618 19.3408962 21.1741811 24.2524907 21.5926089 19.9844661 23.3388800
334 335 336 337 338 339 340 341 342
22.1406069 21.5550993 20.6187291 20.1609718 19.2849039 22.1667232 21.2496577 21.4293931 30.3278880
343 344 345 346 347 348 349 350 351
22.0473498 27.7064791 28.5479412 16.5450112 14.7835964 25.2738008 27.5420512 22.1483756 20.4594409
352 353 354 355 356 357 358 359 360
20.5460542 16.8806383 25.4025351 14.3248663 16.5948846 19.6370469 22.7180661 22.2021889 19.2054806
361 362 363 364 365 366 367 368 369
22.6661611 18.9319262 18.2284680 20.2315081 37.4944739 14.2819073 15.5428625 10.8316232 23.8007290
370 371 372 373 374 375 376 377 378
32.6440736 34.6068404 24.9433133 25.9998091 6.1263250 0.7777981 25.3071306 17.7406106 20.2327441
379 380 381 382 383 384 385 386 387
15.8333130 16.8351259 14.3699483 18.4768283 13.4276828 13.0617751 3.2791812 8.0602217 6.1284220
388 389 390 391 392 393 394 395 396
5.6186481 6.4519857 14.2076474 17.2122518 17.2988727 9.8911664 20.2212419 17.9418118 20.3044578
397 398 399 400 401 402 403 404 405
19.2955908 16.3363278 6.5516232 10.8901678 11.8814587 17.8117451 18.2612659 12.9794878 7.3781636
406 407 408 409 410 411 412 413 414
8.2111586 8.0662619 19.9829479 13.7075637 19.8526845 15.2230830 16.9607198 1.7185181 11.8057839
415 416 417 418 419 420 421 422 423
-4.2813107 9.5837674 13.3666081 6.8956236 6.1477985 14.6066179 19.6000267 18.1242748 18.5217713
424 425 426 427 428 429 430 431 432
13.1752861 14.6261762 9.9237498 16.3459065 14.0751943 14.2575624 13.0423479 18.1595569 18.6955435
433 434 435 436 437 438 439 440 441
21.5272830 17.0314186 15.9609044 13.3614161 14.5207938 8.8197601 4.8675110 13.0659131 12.7060970
442 443 444 445 446 447 448 449 450
17.2955806 18.7404850 18.0590103 11.5147468 11.9740036 17.6834462 18.1269524 17.5183465 17.2274251
451 452 453 454 455 456 457 458 459
16.5227163 19.4129110 18.5821524 22.4894479 15.2800013 15.8208934 12.6872558 12.8763379 17.1866853
460 461 462 463 464 465 466 467 468
18.5124761 19.0486053 20.1720893 19.7740732 22.4294077 20.3191185 17.8861625 14.3747852 16.9477685
469 470 471 472 473 474 475 476 477
16.9840576 18.5883840 20.1671944 22.9771803 22.4558073 25.5782463 16.3914763 16.1114628 20.5348160
478 479 480 481 482 483 484 485 486
11.5427274 19.2049630 21.8627639 23.4687887 27.0988732 28.5699430 21.0839878 19.4551620 22.2222591
487 488 489 490 491 492 493 494 495
19.6559196 21.3253610 11.8558372 8.2238669 3.6639967 13.7590854 15.9311855 20.6266205 20.6124941
496 497 498 499 500 501 502 503 504
16.8854196 14.0132079 19.1085414 21.2980517 18.4549884 20.4687085 23.5333405 22.3757189 27.6274261
505 506
26.1279668 22.3442123