[Video]
From a machine learning perspective, the term regression generally encompasses the prediction of continuous values. Statistically, these predictions are the expected value, or the average value one would observe for the given input values.
Which of the following tasks are regression problems?
[Video]
Linear Regression in R: lm()
# cmodel <- lm(temperature ~ chirps_per_sec, data = cricket())
Formulas
fmla_1 <- temperature ~ chirps_per_sec
fmla_2 <- blood_pressure ~ age + weight
fmla_1 <- as.formula("temperature ~ chirps_per_sec")
Looking at the Model
# cmodel
More Information about the Model
# summary(cmodel)
#
# broom::glance(cmodel)
#
# sigr::wrapFTest(cmodel)
# unemployment is loaded in the workspace
summary(unemployment)
## male_unemployment female_unemployment
## Min. :2.900 Min. :4.000
## 1st Qu.:4.900 1st Qu.:4.400
## Median :6.000 Median :5.200
## Mean :5.954 Mean :5.569
## 3rd Qu.:6.700 3rd Qu.:6.100
## Max. :9.800 Max. :7.900
# Define a formula to express female_unemployment as a function of male_unemployment
fmla <- female_unemployment ~ male_unemployment
# Print it
fmla
## female_unemployment ~ male_unemployment
# Use the formula to fit a model: unemployment_model
unemployment_model <- lm(fmla, data = unemployment)
# Print it
unemployment_model
##
## Call:
## lm(formula = fmla, data = unemployment)
##
## Coefficients:
## (Intercept) male_unemployment
## 1.4341 0.6945
# broom and sigr are already loaded in your workspace
# Print unemployment_model
unemployment_model
##
## Call:
## lm(formula = fmla, data = unemployment)
##
## Coefficients:
## (Intercept) male_unemployment
## 1.4341 0.6945
# Call summary() on unemployment_model to get more details
summary(unemployment_model)
##
## Call:
## lm(formula = fmla, data = unemployment)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.77621 -0.34050 -0.09004 0.27911 1.31254
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.43411 0.60340 2.377 0.0367 *
## male_unemployment 0.69453 0.09767 7.111 1.97e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5803 on 11 degrees of freedom
## Multiple R-squared: 0.8213, Adjusted R-squared: 0.8051
## F-statistic: 50.56 on 1 and 11 DF, p-value: 1.966e-05
# Call glance() on unemployment_model to see the details in a tidier form
broom::glance(unemployment_model)
## # A tibble: 1 x 11
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.821 0.805 0.580 50.6 1.97e-5 2 -10.3 26.6 28.3
## # … with 2 more variables: deviance <dbl>, df.residual <int>
# Call wrapFTest() on unemployment_model to see the most relevant details
sigr::wrapFTest(unemployment_model)
## [1] "F Test summary: (R2=0.8213, F(1,11)=50.56, p=1.966e-05)."
[Video]
Predicting From the Training Data
# cricket$prediction <- predict(model)
Looking at the Predictions
# ggplot(cricket, aes(x = prediction, y = temperature)) +
# geom_point() +
# geom_abline(color = "darkblue") +
# ggtitle("temperature vs. linear model prediction")
Predicting on New Data
# newchirps <- data.frame(chirps_per_sec = 16.5)
# newchirps$prediction <- predict(cmodel, newdata = newchirps)
# newchirps
# unemployment is in your workspace
summary(unemployment)
## male_unemployment female_unemployment
## Min. :2.900 Min. :4.000
## 1st Qu.:4.900 1st Qu.:4.400
## Median :6.000 Median :5.200
## Mean :5.954 Mean :5.569
## 3rd Qu.:6.700 3rd Qu.:6.100
## Max. :9.800 Max. :7.900
# newrates is in your workspace
newrates
## male_unemployment
## 1 5
# Predict female unemployment in the unemployment data set
unemployment$prediction <- predict(unemployment_model, data = unemployment)
# load the ggplot2 package
library(ggplot2)
# Make a plot to compare predictions to actual (prediction on x axis).
ggplot(unemployment, aes(x = prediction, y = female_unemployment)) +
geom_point() +
geom_abline(color = "blue")
# Predict female unemployment rate when male unemployment is 5%
pred <- predict(unemployment_model, newdata = newrates)
# Print it
pred
## 1
## 4.906757
# bloodpressure is in the workspace
summary(bloodpressure)
## blood_pressure age weight
## Min. :128.0 Min. :46.00 Min. :167
## 1st Qu.:140.0 1st Qu.:56.50 1st Qu.:186
## Median :153.0 Median :64.00 Median :194
## Mean :150.1 Mean :62.45 Mean :195
## 3rd Qu.:160.5 3rd Qu.:69.50 3rd Qu.:209
## Max. :168.0 Max. :74.00 Max. :220
# Create the formula and print it
fmla <- blood_pressure ~ age + weight
fmla
## blood_pressure ~ age + weight
# Fit the model: bloodpressure_model
bloodpressure_model <- lm(fmla, data = bloodpressure)
# Print bloodpressure_model and call summary()
bloodpressure_model
##
## Call:
## lm(formula = fmla, data = bloodpressure)
##
## Coefficients:
## (Intercept) age weight
## 30.9941 0.8614 0.3349
summary(bloodpressure_model)
##
## Call:
## lm(formula = fmla, data = bloodpressure)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4640 -1.1949 -0.4078 1.8511 2.6981
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.9941 11.9438 2.595 0.03186 *
## age 0.8614 0.2482 3.470 0.00844 **
## weight 0.3349 0.1307 2.563 0.03351 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.318 on 8 degrees of freedom
## Multiple R-squared: 0.9768, Adjusted R-squared: 0.9711
## F-statistic: 168.8 on 2 and 8 DF, p-value: 2.874e-07
# bloodpressure is in your workspace
summary(bloodpressure)
## blood_pressure age weight
## Min. :128.0 Min. :46.00 Min. :167
## 1st Qu.:140.0 1st Qu.:56.50 1st Qu.:186
## Median :153.0 Median :64.00 Median :194
## Mean :150.1 Mean :62.45 Mean :195
## 3rd Qu.:160.5 3rd Qu.:69.50 3rd Qu.:209
## Max. :168.0 Max. :74.00 Max. :220
# bloodpressure_model is in your workspace
bloodpressure_model
##
## Call:
## lm(formula = fmla, data = bloodpressure)
##
## Coefficients:
## (Intercept) age weight
## 30.9941 0.8614 0.3349
# predict blood pressure using bloodpressure_model :prediction
bloodpressure$prediction <- predict(bloodpressure_model, data = bloodpressure)
# plot the results
ggplot(bloodpressure, aes(x = prediction, y = blood_pressure)) +
geom_point() +
geom_abline(color = "blue")
[Video]
[Video]
Reading the Gain Curve
# GainCurvePlot(houseprices, "prediction", "price", "Home price model")
# unemployment, unemployment_model are in the workspace
summary(unemployment)
## male_unemployment female_unemployment prediction
## Min. :2.900 Min. :4.000 Min. :3.448
## 1st Qu.:4.900 1st Qu.:4.400 1st Qu.:4.837
## Median :6.000 Median :5.200 Median :5.601
## Mean :5.954 Mean :5.569 Mean :5.569
## 3rd Qu.:6.700 3rd Qu.:6.100 3rd Qu.:6.087
## Max. :9.800 Max. :7.900 Max. :8.240
summary(unemployment_model)
##
## Call:
## lm(formula = fmla, data = unemployment)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.77621 -0.34050 -0.09004 0.27911 1.31254
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.43411 0.60340 2.377 0.0367 *
## male_unemployment 0.69453 0.09767 7.111 1.97e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5803 on 11 degrees of freedom
## Multiple R-squared: 0.8213, Adjusted R-squared: 0.8051
## F-statistic: 50.56 on 1 and 11 DF, p-value: 1.966e-05
# Make predictions from the model
unemployment$predictions <- predict(unemployment_model, data = unemployment)
# Fill in the blanks to plot predictions (on x-axis) versus the female_unemployment rates
ggplot(unemployment, aes(x = predictions, y = female_unemployment)) +
geom_point() +
geom_abline()
# From previous step
unemployment$predictions <- predict(unemployment_model)
# Calculate residuals
unemployment$residuals <- unemployment$female_unemployment - unemployment$predictions
# Fill in the blanks to plot predictions (on x-axis) versus the residuals
ggplot(unemployment, aes(x = predictions, y = residuals)) +
geom_pointrange(aes(ymin = 0, ymax = residuals)) +
geom_hline(yintercept = 0, linetype = 3) +
ggtitle("residuals vs. linear model prediction")
# unemployment is in the workspace (with predictions)
summary(unemployment)
## male_unemployment female_unemployment prediction predictions
## Min. :2.900 Min. :4.000 Min. :3.448 Min. :3.448
## 1st Qu.:4.900 1st Qu.:4.400 1st Qu.:4.837 1st Qu.:4.837
## Median :6.000 Median :5.200 Median :5.601 Median :5.601
## Mean :5.954 Mean :5.569 Mean :5.569 Mean :5.569
## 3rd Qu.:6.700 3rd Qu.:6.100 3rd Qu.:6.087 3rd Qu.:6.087
## Max. :9.800 Max. :7.900 Max. :8.240 Max. :8.240
## residuals
## Min. :-0.77621
## 1st Qu.:-0.34050
## Median :-0.09004
## Mean : 0.00000
## 3rd Qu.: 0.27911
## Max. : 1.31254
# unemployment_model is in the workspace
summary(unemployment_model)
##
## Call:
## lm(formula = fmla, data = unemployment)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.77621 -0.34050 -0.09004 0.27911 1.31254
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.43411 0.60340 2.377 0.0367 *
## male_unemployment 0.69453 0.09767 7.111 1.97e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5803 on 11 degrees of freedom
## Multiple R-squared: 0.8213, Adjusted R-squared: 0.8051
## F-statistic: 50.56 on 1 and 11 DF, p-value: 1.966e-05
# Load the package WVPlots
library(WVPlots)
# Plot the Gain Curve
GainCurvePlot(unemployment, "predictions", "female_unemployment", "Unemployment model")
[Video]
RMSE of the Home Sales Price Model
# Calculate error
# err <- houseprices$prediction - houseprices$price
# Square the error vector
# err2 <- err ^ 2
# Take the mean, and sqrt it
# (rmse <- sqrt(mean(err2)))
Is the RMSE Large or Small?
RMSE ≈ 58.3 sd(price) ≈ 135
# unemployment is in the workspace
summary(unemployment)
## male_unemployment female_unemployment prediction predictions
## Min. :2.900 Min. :4.000 Min. :3.448 Min. :3.448
## 1st Qu.:4.900 1st Qu.:4.400 1st Qu.:4.837 1st Qu.:4.837
## Median :6.000 Median :5.200 Median :5.601 Median :5.601
## Mean :5.954 Mean :5.569 Mean :5.569 Mean :5.569
## 3rd Qu.:6.700 3rd Qu.:6.100 3rd Qu.:6.087 3rd Qu.:6.087
## Max. :9.800 Max. :7.900 Max. :8.240 Max. :8.240
## residuals
## Min. :-0.77621
## 1st Qu.:-0.34050
## Median :-0.09004
## Mean : 0.00000
## 3rd Qu.: 0.27911
## Max. : 1.31254
# For convenience put the residuals in the variable res
res <- unemployment$residuals
# Calculate RMSE, assign it to the variable rmse and print it
(rmse <- sqrt(mean(res ^ 2)))
## [1] 0.5337612
# Calculate the standard deviation of female_unemployment and print it
(sd_unemployment <- sd(unemployment$female_unemployment))
## [1] 1.314271
[Video]
What is R^2?
A measure of how well the model fits or explains the data
A value between 0-1 near 1: model fits well near 0: no better than guessing the average value
Calculating R^2R
R^2 is the variance explained by the model.
R^2 = 1 - R
R^2 = 1 − RSS / SStot
where
Residual sum of squares (variance from model) RSS = ∑(y−prediction)^2
Total sum of squares (variance of data) SStot = ∑(y−y‾)^2
Calculate R^2R
Calculate error
# err <- houseprices$prediction - houseprices$price
Square it and take the sum
# rss <- sum(err^2)
price: column of actual sale prices (in thousands) pred: column of predicted sale prices (in thousands) RSS ≈ 136138
Calculate R^2 of the House Price Model: SStot
Take the difference of prices from the mean price
# toterr <- houseprices$price - mean(houseprices$price)
Square it and take the sum
# sstot <- sum(toterr^2)
RSS ≈ 136138 SStot ≈ 713615
Calculate R^2 of the House Price Model
# (r_squared <- 1 - (rss/sstot))
RSS ≈ 136138 SStot ≈ 713615 R^2 ≈ 0.809
Reading R^2 from the lm() model
# # From summary()
# summary(hmodel)
# summary(hmodel)$r.squared
# # From glance()
# glance(hmodel)$r.squared
Correlation and R^2
# rho <- cor(houseprices$prediction, houseprices$price)
# rho^2
ρ = cor(prediction, price) = 0.8995709 ρ^2 = 0.8092278 = R^2
# unemployment is in your workspace
summary(unemployment)
## male_unemployment female_unemployment prediction predictions
## Min. :2.900 Min. :4.000 Min. :3.448 Min. :3.448
## 1st Qu.:4.900 1st Qu.:4.400 1st Qu.:4.837 1st Qu.:4.837
## Median :6.000 Median :5.200 Median :5.601 Median :5.601
## Mean :5.954 Mean :5.569 Mean :5.569 Mean :5.569
## 3rd Qu.:6.700 3rd Qu.:6.100 3rd Qu.:6.087 3rd Qu.:6.087
## Max. :9.800 Max. :7.900 Max. :8.240 Max. :8.240
## residuals
## Min. :-0.77621
## 1st Qu.:-0.34050
## Median :-0.09004
## Mean : 0.00000
## 3rd Qu.: 0.27911
## Max. : 1.31254
# unemployment_model is in the workspace
summary(unemployment_model)
##
## Call:
## lm(formula = fmla, data = unemployment)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.77621 -0.34050 -0.09004 0.27911 1.31254
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.43411 0.60340 2.377 0.0367 *
## male_unemployment 0.69453 0.09767 7.111 1.97e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5803 on 11 degrees of freedom
## Multiple R-squared: 0.8213, Adjusted R-squared: 0.8051
## F-statistic: 50.56 on 1 and 11 DF, p-value: 1.966e-05
# Calculate mean female_unemployment: fe_mean. Print it
(fe_mean <- mean(unemployment$female_unemployment))
## [1] 5.569231
# Calculate total sum of squares: tss. Print it
(tss <- sum((unemployment$female_unemployment - fe_mean)^2))
## [1] 20.72769
# Calculate residual sum of squares: rss. Print it
(rss <- sum(unemployment$residuals ^ 2))
## [1] 3.703714
# Calculate R-squared: rsq. Print it. Is it a good fit?
(rsq <- 1 - rss / tss)
## [1] 0.8213157
# Get R-squared from glance. Print it
(rsq_glance <- broom::glance(unemployment_model)$r.squared)
## [1] 0.8213157
# unemployment is in your workspace
summary(unemployment)
## male_unemployment female_unemployment prediction predictions
## Min. :2.900 Min. :4.000 Min. :3.448 Min. :3.448
## 1st Qu.:4.900 1st Qu.:4.400 1st Qu.:4.837 1st Qu.:4.837
## Median :6.000 Median :5.200 Median :5.601 Median :5.601
## Mean :5.954 Mean :5.569 Mean :5.569 Mean :5.569
## 3rd Qu.:6.700 3rd Qu.:6.100 3rd Qu.:6.087 3rd Qu.:6.087
## Max. :9.800 Max. :7.900 Max. :8.240 Max. :8.240
## residuals
## Min. :-0.77621
## 1st Qu.:-0.34050
## Median :-0.09004
## Mean : 0.00000
## 3rd Qu.: 0.27911
## Max. : 1.31254
# unemployment_model is in the workspace
summary(unemployment_model)
##
## Call:
## lm(formula = fmla, data = unemployment)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.77621 -0.34050 -0.09004 0.27911 1.31254
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.43411 0.60340 2.377 0.0367 *
## male_unemployment 0.69453 0.09767 7.111 1.97e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5803 on 11 degrees of freedom
## Multiple R-squared: 0.8213, Adjusted R-squared: 0.8051
## F-statistic: 50.56 on 1 and 11 DF, p-value: 1.966e-05
# Get the correlation between the prediction and true outcome: rho and print it
(rho <- cor(unemployment$predictions, unemployment$female_unemployment))
## [1] 0.9062647
# Square rho: rho2 and print it
(rho2 <- rho ^ 2)
## [1] 0.8213157
# Get R-squared from glance and print it
(rsq_glance <- broom::glance(unemployment_model)$r.squared)
## [1] 0.8213157
[Video]
Create a cross-validation plan
# library(vtreat)
# splitPlan <- kWayCrossValidation(nRows, nSplits, NULL, NULL)
# library(vtreat)
# splitPlan <- kWayCrossValidation(10, 3, NULL, NULL)
First fold (A and B to train, C to test)
# splitPlan[[1]]
Train A and B, test on C, etc.
# split <- splitPlan[[1]]
# model <- lm(fmla, data = df[split$train, ])
# df$pred.cv[split$app] <- predict(model, newdata = df[split$app, ])
# mpg is in the workspace
summary(mpg)
## manufacturer model displ year
## Length:234 Length:234 Min. :1.600 Min. :1999
## Class :character Class :character 1st Qu.:2.400 1st Qu.:1999
## Mode :character Mode :character Median :3.300 Median :2004
## Mean :3.472 Mean :2004
## 3rd Qu.:4.600 3rd Qu.:2008
## Max. :7.000 Max. :2008
## cyl trans drv cty
## Min. :4.000 Length:234 Length:234 Min. : 9.00
## 1st Qu.:4.000 Class :character Class :character 1st Qu.:14.00
## Median :6.000 Mode :character Mode :character Median :17.00
## Mean :5.889 Mean :16.86
## 3rd Qu.:8.000 3rd Qu.:19.00
## Max. :8.000 Max. :35.00
## hwy fl class
## Min. :12.00 Length:234 Length:234
## 1st Qu.:18.00 Class :character Class :character
## Median :24.00 Mode :character Mode :character
## Mean :23.44
## 3rd Qu.:27.00
## Max. :44.00
dim(mpg)
## [1] 234 11
# Use nrow to get the number of rows in mpg (N) and print it
(N <- nrow(mpg))
## [1] 234
# Calculate how many rows 75% of N should be and print it
# Hint: use round() to get an integer
(target <- round(N * 0.75))
## [1] 176
# Create the vector of N uniform random variables: gp
gp <- runif(N)
# Use gp to create the training set: mpg_train (75% of data) and mpg_test (25% of data)
mpg_train <- mpg[gp < 0.75, ]
mpg_test <- mpg[gp >= 0.75, ]
# Use nrow() to examine mpg_train and mpg_test
nrow(mpg_train)
## [1] 185
nrow(mpg_test)
## [1] 49
# mpg_train is in the workspace
summary(mpg_train)
## manufacturer model displ year
## Length:185 Length:185 Min. :1.600 Min. :1999
## Class :character Class :character 1st Qu.:2.400 1st Qu.:1999
## Mode :character Mode :character Median :3.300 Median :1999
## Mean :3.438 Mean :2003
## 3rd Qu.:4.600 3rd Qu.:2008
## Max. :6.200 Max. :2008
## cyl trans drv cty
## Min. :4.000 Length:185 Length:185 Min. : 9.00
## 1st Qu.:4.000 Class :character Class :character 1st Qu.:14.00
## Median :6.000 Mode :character Mode :character Median :17.00
## Mean :5.854 Mean :16.86
## 3rd Qu.:8.000 3rd Qu.:19.00
## Max. :8.000 Max. :35.00
## hwy fl class
## Min. :12.00 Length:185 Length:185
## 1st Qu.:18.00 Class :character Class :character
## Median :25.00 Mode :character Mode :character
## Mean :23.45
## 3rd Qu.:27.00
## Max. :44.00
# Create a formula to express cty as a function of hwy: fmla and print it.
(fmla <- cty ~ hwy)
## cty ~ hwy
# Now use lm() to build a model mpg_model from mpg_train that predicts cty from hwy
mpg_model <- lm(fmla, data = mpg_train)
# Use summary() to examine the model
summary(mpg_model)
##
## Call:
## lm(formula = fmla, data = mpg_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.9294 -0.6836 -0.0437 0.6449 4.5621
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.71513 0.38349 1.865 0.0638 .
## hwy 0.68857 0.01586 43.413 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.267 on 183 degrees of freedom
## Multiple R-squared: 0.9115, Adjusted R-squared: 0.911
## F-statistic: 1885 on 1 and 183 DF, p-value: < 2.2e-16
# Examine the objects in the workspace
ls.str()
## bloodpressure : 'data.frame': 11 obs. of 4 variables:
## $ blood_pressure: int 132 143 153 162 154 168 137 149 159 128 ...
## $ age : int 52 59 67 73 64 74 54 61 65 46 ...
## $ weight : int 173 184 194 211 196 220 188 188 207 167 ...
## $ prediction : num 134 143 154 165 152 ...
## bloodpressure_model : List of 12
## $ coefficients : Named num [1:3] 30.994 0.861 0.335
## $ residuals : Named num [1:11] -1.718 -0.432 -0.672 -2.533 2.243 ...
## $ effects : Named num [1:11] -497.8 42.17 5.94 -2.11 2.65 ...
## $ rank : int 3
## $ fitted.values: Named num [1:11] 134 143 154 165 152 ...
## $ assign : int [1:3] 0 1 2
## $ qr :List of 5
## $ df.residual : int 8
## $ xlevels : Named list()
## $ call : language lm(formula = fmla, data = bloodpressure)
## $ terms :Classes 'terms', 'formula' language blood_pressure ~ age + weight
## $ model :'data.frame': 11 obs. of 3 variables:
## fe_mean : num 5.57
## fmla : Class 'formula' language cty ~ hwy
## fmla_1 : Class 'formula' language temperature ~ chirps_per_sec
## fmla_2 : Class 'formula' language blood_pressure ~ age + weight
## gp : num [1:234] 0.3815 0.0202 0.2264 0.5662 0.4676 ...
## mpg_model : List of 12
## $ coefficients : Named num [1:2] 0.715 0.689
## $ residuals : Named num [1:185] -2.684 0.316 -2.061 -0.372 -2.618 ...
## $ effects : Named num [1:185] -229.387 -55.007 -1.919 -0.225 -2.447 ...
## $ rank : int 2
## $ fitted.values: Named num [1:185] 20.7 20.7 22.1 21.4 18.6 ...
## $ assign : int [1:2] 0 1
## $ qr :List of 5
## $ df.residual : int 183
## $ xlevels : Named list()
## $ call : language lm(formula = fmla, data = mpg_train)
## $ terms :Classes 'terms', 'formula' language cty ~ hwy
## $ model :'data.frame': 185 obs. of 2 variables:
## mpg_test : Classes 'tbl_df', 'tbl' and 'data.frame': 49 obs. of 11 variables:
## $ manufacturer: chr "audi" "audi" "chevrolet" "chevrolet" ...
## $ model : chr "a4" "a4 quattro" "corvette" "corvette" ...
## $ displ : num 3.1 2 5.7 7 6.5 2.4 3 3.3 3.7 3.9 ...
## $ year : int 2008 2008 1999 2008 1999 2008 1999 1999 2008 1999 ...
## $ cyl : int 6 4 8 8 8 4 6 6 6 6 ...
## $ trans : chr "auto(av)" "manual(m6)" "manual(m6)" "manual(m6)" ...
## $ drv : chr "f" "4" "r" "r" ...
## $ cty : int 18 20 16 15 14 22 17 16 15 14 ...
## $ hwy : int 27 28 26 24 17 30 24 22 19 17 ...
## $ fl : chr "p" "p" "p" "p" ...
## $ class : chr "compact" "compact" "2seater" "2seater" ...
## mpg_train : Classes 'tbl_df', 'tbl' and 'data.frame': 185 obs. of 11 variables:
## $ manufacturer: chr "audi" "audi" "audi" "audi" ...
## $ model : chr "a4" "a4" "a4" "a4" ...
## $ displ : num 1.8 1.8 2 2 2.8 2.8 1.8 1.8 2 2.8 ...
## $ year : int 1999 1999 2008 2008 1999 1999 1999 1999 2008 1999 ...
## $ cyl : int 4 4 4 4 6 6 4 4 4 6 ...
## $ trans : chr "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
## $ drv : chr "f" "f" "f" "f" ...
## $ cty : int 18 21 20 21 16 18 18 16 19 15 ...
## $ hwy : int 29 29 31 30 26 26 26 25 27 25 ...
## $ fl : chr "p" "p" "p" "p" ...
## $ class : chr "compact" "compact" "compact" "compact" ...
## mpgtest : Classes 'tbl_df', 'tbl' and 'data.frame': 54 obs. of 11 variables:
## $ manufacturer: chr "audi" "audi" "audi" "audi" ...
## $ model : chr "a4" "a4" "a4 quattro" "a4 quattro" ...
## $ displ : num 2 3.1 1.8 2 4.2 5.3 6.5 2.4 3.7 5.2 ...
## $ year : int 2008 2008 1999 2008 2008 2008 1999 1999 2008 1999 ...
## $ cyl : int 4 6 4 4 8 8 8 4 6 8 ...
## $ trans : chr "auto(av)" "auto(av)" "auto(l5)" "auto(s6)" ...
## $ drv : chr "f" "f" "4" "4" ...
## $ cty : int 21 18 16 19 16 11 14 19 15 11 ...
## $ hwy : int 30 27 25 27 23 14 17 27 19 17 ...
## $ fl : chr "p" "p" "p" "p" ...
## $ class : chr "compact" "compact" "compact" "compact" ...
## mpgtrain : Classes 'tbl_df', 'tbl' and 'data.frame': 180 obs. of 11 variables:
## $ manufacturer: chr "audi" "audi" "audi" "audi" ...
## $ model : chr "a4" "a4" "a4" "a4" ...
## $ displ : num 1.8 1.8 2 2.8 2.8 1.8 2 2.8 2.8 3.1 ...
## $ year : int 1999 1999 2008 1999 1999 1999 2008 1999 1999 2008 ...
## $ cyl : int 4 4 4 6 6 4 4 6 6 6 ...
## $ trans : chr "auto(l5)" "manual(m5)" "manual(m6)" "auto(l5)" ...
## $ drv : chr "f" "f" "f" "f" ...
## $ cty : int 18 21 20 16 18 18 20 15 17 17 ...
## $ hwy : int 29 29 31 26 26 26 28 25 25 25 ...
## $ fl : chr "p" "p" "p" "p" ...
## $ class : chr "compact" "compact" "compact" "compact" ...
## N : int 234
## newrates : 'data.frame': 1 obs. of 1 variable:
## $ male_unemployment: int 5
## pred : Named num 4.91
## r_squared : function (prediction, actual)
## res : num [1:13] 0.552 1.313 0.163 0.279 -0.34 ...
## rho : num 0.906
## rho2 : num 0.821
## rmse : function (prediction, actual)
## rsq : num 0.821
## rsq_glance : num 0.821
## rss : num 3.7
## sd_unemployment : num 1.31
## target : num 176
## tss : num 20.7
## unemployment : 'data.frame': 13 obs. of 5 variables:
## $ male_unemployment : num 2.9 6.7 4.9 7.9 9.8 ...
## $ female_unemployment: num 4 7.4 5 7.2 7.9 ...
## $ prediction : num 3.45 6.09 4.84 6.92 8.24 ...
## $ predictions : num 3.45 6.09 4.84 6.92 8.24 ...
## $ residuals : num 0.552 1.313 0.163 0.279 -0.34 ...
## unemployment_model : List of 12
## $ coefficients : Named num [1:2] 1.434 0.695
## $ residuals : Named num [1:13] 0.552 1.313 0.163 0.279 -0.34 ...
## $ effects : Named num [1:13] -20.08 -4.126 0.106 -0.264 -1.192 ...
## $ rank : int 2
## $ fitted.values: Named num [1:13] 3.45 6.09 4.84 6.92 8.24 ...
## $ assign : int [1:2] 0 1
## $ qr :List of 5
## $ df.residual : int 11
## $ xlevels : Named list()
## $ call : language lm(formula = fmla, data = unemployment)
## $ terms :Classes 'terms', 'formula' language female_unemployment ~ male_unemployment
## $ model :'data.frame': 13 obs. of 2 variables:
# predict cty from hwy for the training set
mpg_train$pred <- predict(mpg_model, data = mpg_train)
# predict cty from hwy for the test set
mpg_test$pred <- predict(mpg_model, newdata = mpg_test)
# Evaluate the rmse on both training and test data and print them
(rmse_train <- rmse(mpg_train$pred, mpg_train$cty))
## [1] 1.260217
(rmse_test <- rmse(mpg_test$pred, mpg_test$cty))
## [1] 1.198051
# Evaluate the r-squared on both training and test data.and print them
(rsq_train <- r_squared(mpg_train$pred, mpg_train$cty))
## [1] 0.9114938
(rsq_test <- r_squared(mpg_test$pred, mpg_test$cty))
## [1] 0.9219146
# Plot the predictions (on the x-axis) against the outcome (cty) on the test data
ggplot(mpg_test, aes(x = pred, y = cty)) +
geom_point() +
geom_abline()
# Load the package vtreat
library(vtreat)
## Loading required package: wrapr
# mpg is in the workspace
summary(mpg)
## manufacturer model displ year
## Length:234 Length:234 Min. :1.600 Min. :1999
## Class :character Class :character 1st Qu.:2.400 1st Qu.:1999
## Mode :character Mode :character Median :3.300 Median :2004
## Mean :3.472 Mean :2004
## 3rd Qu.:4.600 3rd Qu.:2008
## Max. :7.000 Max. :2008
## cyl trans drv cty
## Min. :4.000 Length:234 Length:234 Min. : 9.00
## 1st Qu.:4.000 Class :character Class :character 1st Qu.:14.00
## Median :6.000 Mode :character Mode :character Median :17.00
## Mean :5.889 Mean :16.86
## 3rd Qu.:8.000 3rd Qu.:19.00
## Max. :8.000 Max. :35.00
## hwy fl class
## Min. :12.00 Length:234 Length:234
## 1st Qu.:18.00 Class :character Class :character
## Median :24.00 Mode :character Mode :character
## Mean :23.44
## 3rd Qu.:27.00
## Max. :44.00
# Get the number of rows in mpg
nRows <- nrow(mpg)
# Implement the 3-fold cross-fold plan with vtreat
splitPlan <- kWayCrossValidation(nRows, 3, NULL, NULL)
# Examine the split plan
str(splitPlan)
## List of 3
## $ :List of 2
## ..$ train: int [1:156] 1 3 6 8 11 13 14 15 16 19 ...
## ..$ app : int [1:78] 223 9 54 77 27 99 67 18 119 103 ...
## $ :List of 2
## ..$ train: int [1:156] 1 2 4 5 6 7 8 9 10 12 ...
## ..$ app : int [1:78] 19 23 159 219 138 154 204 139 124 101 ...
## $ :List of 2
## ..$ train: int [1:156] 2 3 4 5 7 9 10 11 12 14 ...
## ..$ app : int [1:78] 85 150 217 51 122 175 135 31 76 60 ...
## - attr(*, "splitmethod")= chr "kwaycross"
# mpg is in the workspace
summary(mpg)
## manufacturer model displ year
## Length:234 Length:234 Min. :1.600 Min. :1999
## Class :character Class :character 1st Qu.:2.400 1st Qu.:1999
## Mode :character Mode :character Median :3.300 Median :2004
## Mean :3.472 Mean :2004
## 3rd Qu.:4.600 3rd Qu.:2008
## Max. :7.000 Max. :2008
## cyl trans drv cty
## Min. :4.000 Length:234 Length:234 Min. : 9.00
## 1st Qu.:4.000 Class :character Class :character 1st Qu.:14.00
## Median :6.000 Mode :character Mode :character Median :17.00
## Mean :5.889 Mean :16.86
## 3rd Qu.:8.000 3rd Qu.:19.00
## Max. :8.000 Max. :35.00
## hwy fl class
## Min. :12.00 Length:234 Length:234
## 1st Qu.:18.00 Class :character Class :character
## Median :24.00 Mode :character Mode :character
## Mean :23.44
## 3rd Qu.:27.00
## Max. :44.00
# splitPlan is in the workspace
str(splitPlan)
## List of 3
## $ :List of 2
## ..$ train: int [1:156] 1 3 6 8 11 13 14 15 16 19 ...
## ..$ app : int [1:78] 223 9 54 77 27 99 67 18 119 103 ...
## $ :List of 2
## ..$ train: int [1:156] 1 2 4 5 6 7 8 9 10 12 ...
## ..$ app : int [1:78] 19 23 159 219 138 154 204 139 124 101 ...
## $ :List of 2
## ..$ train: int [1:156] 2 3 4 5 7 9 10 11 12 14 ...
## ..$ app : int [1:78] 85 150 217 51 122 175 135 31 76 60 ...
## - attr(*, "splitmethod")= chr "kwaycross"
# Run the 3-fold cross validation plan from splitPlan
k <- 3 # Number of folds
mpg$pred.cv <- 0
for(i in 1:k) {
split <- splitPlan[[i]]
model <- lm(cty ~ hwy, data = mpg[split$train, ])
mpg$pred.cv[split$app] <- predict(model, newdata = mpg[split$app, ])
}
# Predict from a full model
mpg$pred <- predict(lm(cty ~ hwy, data = mpg))
# Get the rmse of the full model's predictions
rmse(mpg$pred, mpg$cty)
## [1] 1.247045
# Get the rmse of the cross-validation predictions
rmse(mpg$pred.cv, mpg$cty)
## [1] 1.289782
[Video]
Here you see the decision tree learned from the brain data set shown in the previous video. The tree predicts the expected intelligence (humans have an intelligence of 1) of several mammals, as a function of gestation time (in days) and average litter size for that species.
The leaf nodes show the expected brain size for the datums in that node, as well as how many (percentage-wise) of the datums fall into the node.
You want to predict the intelligence of a gorilla, which has a gestation time of 265 days and an average litter size of 1.
What relative brain size does this tree model predict?
[Video]
Random Forests with ranger()
# model <- ranger(fmla, bikesJan, num.trees = 500, respect.unordered.factors = "order")
Predicting with a ranger() model
# bikesFed$pred <- predict(model, bikesFeb)$predictions
# bikesJuly is in the workspace
str(bikesJuly)
## 'data.frame': 744 obs. of 12 variables:
## $ hr : Factor w/ 24 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ holiday : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ workingday: logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ weathersit: chr "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" ...
## $ temp : num 0.76 0.74 0.72 0.72 0.7 0.68 0.7 0.74 0.78 0.82 ...
## $ atemp : num 0.727 0.697 0.697 0.712 0.667 ...
## $ hum : num 0.66 0.7 0.74 0.84 0.79 0.79 0.79 0.7 0.62 0.56 ...
## $ windspeed : num 0 0.1343 0.0896 0.1343 0.194 ...
## $ cnt : int 149 93 90 33 4 10 27 50 142 219 ...
## $ instant : int 13004 13005 13006 13007 13008 13009 13010 13011 13012 13013 ...
## $ mnth : int 7 7 7 7 7 7 7 7 7 7 ...
## $ yr : int 1 1 1 1 1 1 1 1 1 1 ...
# Random seed to reproduce results
seed <- 423563
# The outcome column
(outcome <- "cnt")
## [1] "cnt"
# The input variables
(vars <- c("hr", "holiday", "workingday", "weathersit", "temp", "atemp", "hum", "windspeed"))
## [1] "hr" "holiday" "workingday" "weathersit" "temp"
## [6] "atemp" "hum" "windspeed"
# Create the formula string for bikes rented as a function of the inputs
(fmla <- paste(outcome, "~", paste(vars, collapse = " + ")))
## [1] "cnt ~ hr + holiday + workingday + weathersit + temp + atemp + hum + windspeed"
# Load the package ranger
library(ranger)
# Fit and print the random forest model
(bike_model_rf <- ranger(fmla, # formula
bikesJuly, # data
num.trees = 500,
respect.unordered.factors = "order",
seed = seed))
## Ranger result
##
## Call:
## ranger(fmla, bikesJuly, num.trees = 500, respect.unordered.factors = "order", seed = seed)
##
## Type: Regression
## Number of trees: 500
## Sample size: 744
## Number of independent variables: 8
## Mtry: 2
## Target node size: 5
## Variable importance mode: none
## Splitrule: variance
## OOB prediction error (MSE): 8299.851
## R squared (OOB): 0.8190328
# bikesAugust is in the workspace
str(bikesAugust)
## 'data.frame': 744 obs. of 12 variables:
## $ hr : Factor w/ 24 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ holiday : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ workingday: logi TRUE TRUE TRUE TRUE TRUE TRUE ...
## $ weathersit: chr "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" ...
## $ temp : num 0.68 0.66 0.64 0.64 0.64 0.64 0.64 0.64 0.66 0.68 ...
## $ atemp : num 0.636 0.606 0.576 0.576 0.591 ...
## $ hum : num 0.79 0.83 0.83 0.83 0.78 0.78 0.78 0.83 0.78 0.74 ...
## $ windspeed : num 0.1642 0.0896 0.1045 0.1045 0.1343 ...
## $ cnt : int 47 33 13 7 4 49 185 487 681 350 ...
## $ instant : int 13748 13749 13750 13751 13752 13753 13754 13755 13756 13757 ...
## $ mnth : int 8 8 8 8 8 8 8 8 8 8 ...
## $ yr : int 1 1 1 1 1 1 1 1 1 1 ...
# bike_model_rf is in the workspace
bike_model_rf
## Ranger result
##
## Call:
## ranger(fmla, bikesJuly, num.trees = 500, respect.unordered.factors = "order", seed = seed)
##
## Type: Regression
## Number of trees: 500
## Sample size: 744
## Number of independent variables: 8
## Mtry: 2
## Target node size: 5
## Variable importance mode: none
## Splitrule: variance
## OOB prediction error (MSE): 8299.851
## R squared (OOB): 0.8190328
# Make predictions on the August data
bikesAugust$pred <- predict(bike_model_rf, bikesAugust)$predictions
# Calculate the RMSE of the predictions
bikesAugust %>%
mutate(residual = cnt - pred) %>% # calculate the residual
summarize(rmse = sqrt(mean(residual ^ 2))) # calculate rmse
## rmse
## 1 96.66032
# Plot actual outcome vs predictions (predictions on x-axis)
ggplot(bikesAugust, aes(x = pred, y = cnt)) +
geom_point() +
geom_abline()
first_two_weeks <- bikesAugust %>%
# Set start to 0, convert unit to days
mutate(instant = (instant - min(instant)) / 24) %>%
# Gather cnt and pred into a column named value with key valuetype
gather(key = valuetype, value = value, cnt, pred) %>%
# Filter for rows in the first two
filter(instant < 14)
# Plot predictions and cnt by date/time
ggplot(first_two_weeks, aes(x = instant, y = value, color = valuetype, linetype = valuetype)) +
geom_point() +
geom_line() +
scale_x_continuous("Day", breaks = 0:14, labels = 0:14) +
scale_color_brewer(palette = "Dark2") +
ggtitle("Predicted August bike rentals, Random Forest plot")
[Video]
Michael is a hybrid thinker and doer—a byproduct of being a StrengthsFinder “Learner” over time. With 20+ years of engineering, design, and product experience, he helps organizations identify market needs, mobilize internal and external resources, and deliver delightful digital customer experiences that align with business goals. He has been entrusted with problem-solving for brands—ranging from Fortune 500 companies to early-stage startups to not-for-profit organizations.
Michael earned his BS in Computer Science from New York Institute of Technology and his MBA from the University of Maryland, College Park. He is also a candidate to receive his MS in Applied Analytics from Columbia University.
LinkedIn | Twitter | www.michaelmallari.com/data | www.columbia.edu/~mm5470