[Video]
# Fit a linear model on the breast_cancer_data.
linear_model <- lm(concavity_mean ~ symmetry_mean, data = breast_cancer_data)
# Look at the summary of the linear_model.
summary(linear_model)
##
## Call:
## lm(formula = concavity_mean ~ symmetry_mean, data = breast_cancer_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.201877 -0.039201 -0.008432 0.030655 0.226150
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.15311 0.04086 -3.747 0.000303 ***
## symmetry_mean 1.33366 0.21257 6.274 9.57e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06412 on 98 degrees of freedom
## Multiple R-squared: 0.2866, Adjusted R-squared: 0.2793
## F-statistic: 39.36 on 1 and 98 DF, p-value: 9.575e-09
# Extract the coefficients.
coefficients(linear_model)
## (Intercept) symmetry_mean
## -0.1531055 1.3336568
Which of the following is a hyperparameter in the linear model from your last exercise? The breast_cancer_data has again been loaded and the linear model has been built just as before:
# linear_model <- lm(concavity_mean ~ symmetry_mean,
# data = breast_cancer_data)
Note that hyperparameters can be found in the help section for a function, while model parameters are part of the output of a function.
help(lm)
library(ggplot2)
# Plot linear relationship.
ggplot(data = breast_cancer_data,
aes(x = symmetry_mean, y = concavity_mean)) +
geom_point(color = "grey") +
geom_abline(slope = linear_model$coefficients[2],
intercept = linear_model$coefficients[1])
[Video]
Machine Learning with caret - splitting data
# Load caret and set seed
library(caret)
## Loading required package: lattice
set.seed(42)
# Create caret partition index
index <- createDataPartition(breast_cancer_data$diagnosis, p = 0.70, list = FALSE)
# Subset 'breast_cancer_data' with index
bc_train_data <- breast_cancer_data[index, ]
bc_train_data <- breast_cancer_data[-index, ]
Train a machine learning model with caret
Set up cross-validation:
library(caret)
library(tictoc)
# Repeated CV.
fitControl <- trainControl(method = "repeatedcv", number = 3, repeats = 5)
Train a Random Forest model:
tic()
set.seed(42)
rf_model <- train(diagnosis ~ ., data = bc_train_data, method = "rf", trControl = fitControl, verbose = FALSE)
toc()
## 0.855 sec elapsed
rf_model
## Random Forest
##
## 30 samples
## 10 predictors
## 2 classes: 'B', 'M'
##
## No pre-processing
## Resampling: Cross-Validated (3 fold, repeated 5 times)
## Summary of sample sizes: 20, 20, 20, 20, 20, 20, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.88 0.76
## 6 0.88 0.76
## 10 0.88 0.76
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
# Create partition index
index <- createDataPartition(breast_cancer_data$diagnosis, p = 0.7, list = FALSE)
# Subset `breast_cancer_data` with index
bc_train_data <- breast_cancer_data[index, ]
bc_test_data <- breast_cancer_data[-index, ]
# Define 3x5 folds repeated cross-validation
fitControl <- trainControl(method = "repeatedcv", number = 5, repeats = 3)
# # Run the train() function
# gbm_model <- caret::train(diagnosis ~ .,
# data = bc_train_data,
# method = "gbm",
# trControl = fitControl,
# verbose = FALSE)
#
# # Look at the model
# gbm_model
In the previous exercise, you defined a 3x5 folds repeated cross-validation resampling scheme with the following code:
# fitControl <- trainControl(method = "repeatedcv", number = 5, repeats = 3)
Which of the following is NOT a valid resampling method in caret? Note: the caret package has already been loaded for you.
help(trainControl)
[Video]
Hyperparameters are specific to model algorithms
# modelLookup(model)
# https://topepo.github.io/caret/available-models.html
Hyperparameters in Support Vector Machines (SVM)
# library(caret)
# library(tictoc)
# fitControl <- trainControl(method = "repeatedcv", number = 3, repeats = 5)
# tic()
# set.seed(42)
# svm_model <- train(diagnosis ~ ., data = bc_train_data,method = "svmPoly", trControl = fitControl, verbose= FALSE)
# toc()
Defining hyperparameters for automatic tuning
tuneLength
# tic()
# set.seed(42)
# svm_model_2 <- train(diagnosis ~ ., data = bc_train_data, method = "svmPoly", trControl = fitControl, verbose = FALSE, tuneLength = 5)
# toc()
Manual hyperparameter tuning in caret
tuneGrid + expand.grid
# library(caret)
# library(tictoc)
# hyperparams <- expand.grid(degree = 4, scale = 1, C = 1)
# tic()
# set.seed(42)
# svm_model_3 <- train(diagnosis ~ ., data = bc_train_data, method = "svmPoly", trControl = fitControl, tuneGrid = hyperparams, verbose = FALSE)
# toc()
In the previous lesson, you built a Stochastic Gradient Boosting model in caret. A similar model as the one from before has been preloaded as gbm_model. In order to optimize this model, you want to tune its hyperparameters. Which of the following is NOT a hyperparameter of the gbm method?
Note: The library caret has also been preloaded.
# # Set seed.
# set.seed(42)
# # Start timer.
# tic()
# # Train model.
# gbm_model <- train(diagnosis ~ .,
# data = bc_train_data,
# method = "gbm",
# trControl = trainControl(method = "repeatedcv", number = 5, repeats = 3),
# verbose = FALSE,
# tuneLength = 4)
# # Stop timer.
# toc()
# # Define hyperparameter grid.
# hyperparams <- expand.grid(n.trees = 200,
# interaction.depth = 1,
# shrinkage = 0.1,
# n.minobsinnode = 10)
#
# # Apply hyperparameter grid to train().
# set.seed(42)
# gbm_model <- train(diagnosis ~ .,
# data = bc_train_data,
# method = "gbm",
# trControl = trainControl(method = "repeatedcv", number = 5, repeats = 3),
# verbose = FALSE,
# tuneGrid = hyperparams)
mlr
mlr
mlr
mlr
[Video]
What is H2O
library(h2o)
##
## ----------------------------------------------------------------------
##
## Your next step is to start H2O:
## > h2o.init()
##
## For H2O package documentation, ask for help:
## > ??h2o
##
## After starting H2O, you can use the Web UI at http://localhost:54321
## For more information visit http://docs.h2o.ai
##
## ----------------------------------------------------------------------
##
## Attaching package: 'h2o'
## The following objects are masked from 'package:stats':
##
## cor, sd, var
## The following objects are masked from 'package:base':
##
## &&, %*%, %in%, ||, apply, as.factor, as.numeric, colnames,
## colnames<-, ifelse, is.character, is.factor, is.numeric, log,
## log10, log1p, log2, round, signif, trunc
h2o.init()
## Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 18 minutes 56 seconds
## H2O cluster timezone: America/Denver
## H2O data parsing timezone: UTC
## H2O cluster version: 3.28.0.2
## H2O cluster version age: 11 days
## H2O cluster name: H2O_started_from_R_michael_clm836
## H2O cluster total nodes: 1
## H2O cluster total memory: 2.00 GB
## H2O cluster total cores: 8
## H2O cluster allowed cores: 8
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## H2O API Extensions: Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4
## R Version: R version 3.6.2 (2019-12-12)
Preparing the data for modeling with H2O
Data as H2O frame
# seeds_data_hf <- as.h2o(seeds_data)
Define features and target variable
# y <- "seed_type"
# x <- setdiff(colnames(seeds_data_hf), y)
For classification target should be a factor
# seeds_data_hf[, y] <- as.factor(seeds_data_hf[, y])
Training, validation and test sets
# sframe <- h2o.splitFrame(data = seeds_data_hf,
# ratios = c(0.7, 0.15),
# seed = 42)
# train <- sframe[[1]]
# valid <- sframe[[2]]
# test <- sframe[[3]]
# summary(train$seed_type, exact_quantiles = TRUE)
# summary(test$seed_type, exact_quantiles = TRUE)
Model training with H2O
Evaluate model performance with H2O
Model performance
# perf <- h2o.performance(gbm_model, test)
# Initialise h2o cluster
h2o.init()
## Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 18 minutes 57 seconds
## H2O cluster timezone: America/Denver
## H2O data parsing timezone: UTC
## H2O cluster version: 3.28.0.2
## H2O cluster version age: 11 days
## H2O cluster name: H2O_started_from_R_michael_clm836
## H2O cluster total nodes: 1
## H2O cluster total memory: 2.00 GB
## H2O cluster total cores: 8
## H2O cluster allowed cores: 8
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## H2O API Extensions: Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4
## R Version: R version 3.6.2 (2019-12-12)
# Convert data to h2o frame
seeds_train_data_hf <- as.h2o(seeds_train_data)
##
|
| | 0%
|
|======================================================================| 100%
# Identify target and features
y <- "seed_type"
x <- setdiff(colnames(seeds_train_data_hf), y)
# Split data into train & validation sets
sframe <- h2o.splitFrame(seeds_train_data_hf, seed = 42)
train <- sframe[[1]]
valid <- sframe[[2]]
# Calculate ratio of the target variable in the training set
summary(train$seed_type, exact_quantiles = TRUE)
## seed_type
## Min. :1
## 1st Qu.:1
## Median :2
## Mean :2
## 3rd Qu.:3
## Max. :3
Michael is a hybrid thinker and doer—a byproduct of being a StrengthsFinder “Learner” over time. With 20+ years of engineering, design, and product experience, he helps organizations identify market needs, mobilize internal and external resources, and deliver delightful digital customer experiences that align with business goals. He has been entrusted with problem-solving for brands—ranging from Fortune 500 companies to early-stage startups to not-for-profit organizations.
Michael earned his BS in Computer Science from New York Institute of Technology and his MBA from the University of Maryland, College Park. He is also a candidate to receive his MS in Applied Analytics from Columbia University.