knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(eval = TRUE)
# Remark: If need Math font, 'Save with encoding as UTF' which makes the R Markdown compile

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

Chapter 1: What is Regression?

In this chapter we introduce the concept of regression from a machine learning point of view. We will present the fundamental regression method: linear regression. We will show how to fit a linear regression model and to make predictions from the model.

1.1: Identify the regression tasks

From a machine learning perspective, the term regression generally encompasses the prediction of continuous values. Statistically, these predictions are the expected value, or the average value one would observe for the given input values.

Which of the following tasks are regression problems?

Answer the question

50 XP

Possible Answers

  1. Predict (yes/no) whether a student will pass the final exam, given scores on midterms and homework.

  2. Predict the student’s score (0 - 100) on the final exam, given scores on midterms and homework. [ans]

  3. Predict the student’s final grade (A, B, C, D, F) in the class given scores on midterms and homework (before they’ve taken the final exam).

1.2: Code a simple one-variable regression

For the first coding exercise, you’ll create a formula to define a one-variable modeling task, and then fit a linear model to the data. You are given the rates of male and female unemployment in the United States over several years (Source).

The task is to predict the rate of female unemployment from the observed rate of male unemployment. The outcome is female_unemployment, and the input is male_unemployment.

The sign of the variable coefficient tells you whether the outcome increases (+) or decreases (-) as the variable increases.

Recall the calling interface for lm() is:

lm(formula, data = ___)

Instructions

100 XP

  • The data frame unemployment is in your workspace.

  • Define a formula that expresses female_unemployment as a function of male_unemployment. Assign the formula to the variable fmla and print it.

  • Then use lm() and fmla to fit a linear model to predict female unemployment from male unemployment using the data set unemployment.

  • Print the model. Is the coefficent for male unemployment consistent with what you would expect? Does female unemployment increase as male unemployment does?

# unemployment is loaded in the workspace
unemployment<-readRDS("unemployment.rds")
summary(unemployment)
 male_unemployment female_unemployment
 Min.   :2.900     Min.   :4.000      
 1st Qu.:4.900     1st Qu.:4.400      
 Median :6.000     Median :5.200      
 Mean   :5.954     Mean   :5.569      
 3rd Qu.:6.700     3rd Qu.:6.100      
 Max.   :9.800     Max.   :7.900      
# Define a formula to express female_unemployment as a function of male_unemployment
fmla <- female_unemployment ~ male_unemployment
# Print it
fmla
female_unemployment ~ male_unemployment
# Use the formula to fit a model: unemployment_model
unemployment_model <- lm(fmla, data = unemployment)
# Print it
unemployment_model

Call:
lm(formula = fmla, data = unemployment)

Coefficients:
      (Intercept)  male_unemployment  
           1.4341             0.6945  

1.3: Examining a model

Let’s look at the model unemployment_model that you have just created. There are a variety of different ways to examine a model; each way provides different information. We will use summary(), broom::glance(), and sigr::wrapFTest().

Instructions

100 XP

  • The object unemployment_model is in your workspace.

  • Print unemployment_model again. What information does it report?

  • Call summary() on unemployment_model. In addition to the coefficient values, you get standard errors on the coefficient estimates, and some goodness-of-fit metrics like R-squared.

  • Call glance() on the model to see the performance metrics in an orderly data frame. Can you match the information from summary() to the columns of glance()?

  • Now call wrapFTest() on the model to see the R-squared again.

# broom and sigr are already loaded in your workspace
# Print unemployment_model
install.packages("nlme")
Warning in install.packages :
  unable to access index for repository http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/3.5:
  cannot open URL 'http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/3.5/PACKAGES'
trying URL 'https://mran.microsoft.com/snapshot/2018-08-01/bin/windows/contrib/3.5/nlme_3.1-137.zip'
Content type 'application/zip' length 2351049 bytes (2.2 MB)
downloaded 2.2 MB
package ‘nlme’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    C:\Users\cweiqiang\AppData\Local\Temp\RtmpueSsh9\downloaded_packages
install.packages("broom")
Warning in install.packages :
  unable to access index for repository http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/3.5:
  cannot open URL 'http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/3.5/PACKAGES'
trying URL 'https://mran.microsoft.com/snapshot/2018-08-01/bin/windows/contrib/3.5/broom_0.5.0.zip'
Content type 'application/zip' length 2009854 bytes (1.9 MB)
downloaded 1.9 MB
package ‘broom’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    C:\Users\cweiqiang\AppData\Local\Temp\RtmpueSsh9\downloaded_packages
install.packages("sigr")
Warning in install.packages :
  unable to access index for repository http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/3.5:
  cannot open URL 'http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/3.5/PACKAGES'
trying URL 'https://mran.microsoft.com/snapshot/2018-08-01/bin/windows/contrib/3.5/sigr_0.2.8.zip'
Content type 'application/zip' length 200307 bytes (195 KB)
downloaded 195 KB
package ‘sigr’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    C:\Users\cweiqiang\AppData\Local\Temp\RtmpueSsh9\downloaded_packages
library(nlme)
library(broom)
library(sigr)
unemployment_model

Call:
lm(formula = fmla, data = unemployment)

Coefficients:
      (Intercept)  male_unemployment  
           1.4341             0.6945  
# Call summary() on unemployment_model to get more details
summary(unemployment_model)

Call:
lm(formula = fmla, data = unemployment)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.77621 -0.34050 -0.09004  0.27911  1.31254 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)        1.43411    0.60340   2.377   0.0367 *  
male_unemployment  0.69453    0.09767   7.111 1.97e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.5803 on 11 degrees of freedom
Multiple R-squared:  0.8213,    Adjusted R-squared:  0.8051 
F-statistic: 50.56 on 1 and 11 DF,  p-value: 1.966e-05
# Call glance() on unemployment_model to see the details in a tidier form
glance(unemployment_model)
# Call wrapFTest() on unemployment_model to see the most relevant details
wrapFTest(unemployment_model)
[1] "F Test summary: (R2=0.82, F(1,11)=51, p=2e-05)."

Remark: here are several different ways to get diagnostics for your model. Use the one that suits your needs or preferences the best.

1.4: Predicting from the unemployment model

In this exercise, you will use your unemployment model unemployment_model to make predictions from the unemployment data, and compare predicted female unemployment rates to the actual observed female unemployment rates on the training data, unemployment. You will also use your model to predict on the new data in newrates, which consists of only one observation, where male unemployment is 5%.

The predict() interface for lm models takes the form

predict(model, newdata)

You will use the ggplot2 package to make the plots, so you will add the prediction column to the unemployment data frame. You will plot outcome versus prediction, and compare them to the line that represents perfect predictions (that is when the outcome is equal to the predicted value).

The ggplot2 command to plot a scatterplot of dframe\(outcome versus dframe\)pred (pred on the x axis, outcome on the y axis), along with a blue line where outcome == pred is as follows:

ggplot(dframe, aes(x = pred, y = outcome)) + geom_point() +
geom_abline(color = “blue”)

Instructions

100 XP

  • The objects unemployment, unemployment_model and newrates are in your workspace.

  • Use predict() to predict female unemployment rates from the unemployment data. Assign it to a new column: prediction.

  • Use the library() command to load the ggplot2 package.

  • Use ggplot() to compare the predictions to actual unemployment rates. Put the predictions on the x axis. How close are the results to the line of perfect prediction?

  • Use the data frame newrates to predict expected female unemployment rate when male unemployment is 5%. Assign the answer to the variable pred and print it.

newrates<-read.csv("new_rates.csv")
# unemployment is in your workspace
summary(unemployment)
 male_unemployment female_unemployment
 Min.   :2.900     Min.   :4.000      
 1st Qu.:4.900     1st Qu.:4.400      
 Median :6.000     Median :5.200      
 Mean   :5.954     Mean   :5.569      
 3rd Qu.:6.700     3rd Qu.:6.100      
 Max.   :9.800     Max.   :7.900      
# newrates is in your workspace
newrates
# Predict female unemployment in the unemployment data set
unemployment$prediction <-  predict(unemployment_model)
# load the ggplot2 package
library(ggplot2)

Attaching package: 㤼㸱ggplot2㤼㸲

The following object is masked _by_ 㤼㸱.GlobalEnv㤼㸲:

    mpg
# Make a plot to compare predictions to actual (prediction on x axis)
ggplot(unemployment, aes(x = prediction, y = female_unemployment)) + 
  geom_point() +
  geom_abline(color = "blue")

# Predict female unemployment rate when male unemployment is 5%
pred <- predict(unemployment_model, newdata = newrates)
# Print it
pred
       1 
4.906757 

1.5: Multivariate linear regression (Part 1)

In this exercise, you will work with the blood pressure dataset (Source), and model blood_pressure as a function of weight and age.

Instructions

100 XP

  • The data frame bloodpressure is in the workspace.

  • Define a formula that expresses blood_pressure explicitly as a function of age and weight. Assign the formula to the variable fmla and print it.

  • Use fmla to fit a linear model to predict blood_pressure from age and weight in the data set bloodpressure. Call the model bloodpressure_model.

  • Print the model and call summary() on it. Does blood pressure increase or decrease with age? With weight?

# bloodpressure is in the workspace
bloodpressure<-readRDS("bloodpressure.rds")
summary(bloodpressure)
 blood_pressure       age            weight   
 Min.   :128.0   Min.   :46.00   Min.   :167  
 1st Qu.:140.0   1st Qu.:56.50   1st Qu.:186  
 Median :153.0   Median :64.00   Median :194  
 Mean   :150.1   Mean   :62.45   Mean   :195  
 3rd Qu.:160.5   3rd Qu.:69.50   3rd Qu.:209  
 Max.   :168.0   Max.   :74.00   Max.   :220  
# Create the formula and print it
fmla <- blood_pressure~age+weight
fmla
blood_pressure ~ age + weight
# Fit the model: bloodpressure_model
bloodpressure_model <- lm(fmla,data=bloodpressure)
# Print bloodpressure_model and call summary() 
bloodpressure_model

Call:
lm(formula = fmla, data = bloodpressure)

Coefficients:
(Intercept)          age       weight  
    30.9941       0.8614       0.3349  
summary(bloodpressure_model)

Call:
lm(formula = fmla, data = bloodpressure)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.4640 -1.1949 -0.4078  1.8511  2.6981 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  30.9941    11.9438   2.595  0.03186 * 
age           0.8614     0.2482   3.470  0.00844 **
weight        0.3349     0.1307   2.563  0.03351 * 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.318 on 8 degrees of freedom
Multiple R-squared:  0.9768,    Adjusted R-squared:  0.9711 
F-statistic: 168.8 on 2 and 8 DF,  p-value: 2.874e-07

1.6: Multivariate linear regression (Part 2)

Now you will make predictions using the blood pressure model bloodpressure_model that you fit in the previous exercise.

You will also compare the predictions to outcomes graphically. ggplot2 is already loaded in your workspace. Recall the plot command takes the form:

ggplot(dframe, aes(x = pred, y = outcome)) + geom_point() + geom_abline(color = “blue”) `

Instructions

100 XP

  • The objects bloodpressure and bloodpressure_model are in the workspace.

  • Use predict() to predict blood pressure in the bloodpressure dataset. Assign the predictions to the column prediction.

  • Graphically compare the predictions to actual blood pressures. Put predictions on the x axis. How close are the results to the line of perfect prediction?

# bloodpressure is in your workspace
summary(bloodpressure)
 blood_pressure       age            weight   
 Min.   :128.0   Min.   :46.00   Min.   :167  
 1st Qu.:140.0   1st Qu.:56.50   1st Qu.:186  
 Median :153.0   Median :64.00   Median :194  
 Mean   :150.1   Mean   :62.45   Mean   :195  
 3rd Qu.:160.5   3rd Qu.:69.50   3rd Qu.:209  
 Max.   :168.0   Max.   :74.00   Max.   :220  
# bloodpressure_model is in your workspace
bloodpressure_model

Call:
lm(formula = fmla, data = bloodpressure)

Coefficients:
(Intercept)          age       weight  
    30.9941       0.8614       0.3349  
# predict blood pressure using bloodpressure_model :prediction
bloodpressure$prediction <- predict(bloodpressure_model)
# plot the results
ggplot(bloodpressure, aes(x = prediction, y = blood_pressure))+ 
    geom_point() +
    geom_abline(color = "blue")

Chapter 2: Training and Evaluating Regression Models

Now that we have learned how to fit basic linear regression models, we will learn how to evaluate how well our models perform. We will review evaluating a model graphically, and look at two basic metrics for regression models. We will also learn how to train a model that will perform well in the wild, not just on training data. Although we will demonstrate these techniques using linear regression, all these concepts apply to models fit with any regression algorithm.

2.1: Graphically evaluate the unemployment model

In this exercise you will graphically evaluate the unemployment model, unemployment_model, that you fit to the unemployment data in the previous chapter. Recall that the model predicts female_unemployment from male_unemployment.

You will plot the model’s predictions against the actual female_unemployment; recall the command is of the form

ggplot(dframe, aes(x = pred, y = outcome)) + geom_point() +
geom_abline() Then you will calculate the residuals:

residuals <- actual outcome - predicted outcome and plot predictions against residuals. The residual graph will take a slightly different form: you compare the residuals to the horizonal line x=0 (using geom_hline()) rather than to the line x=y. The command will be provided.

Instructions

100 XP

  • The data frame unemployment and model unemployment_model are available in the workspace.

  • Use predict() to get the model predictions and add them to unemployment as the column predictions.

  • Plot predictions (on the x-axis) versus actual female unemployment rates. Are the predictions near the x=y line?

  • Calculate the residuals between the predictions and actual unemployment rates. Add these residuals to unemployment as the column residuals.

  • Fill in the blanks to plot predictions (on the x-axis) versus residuals (on the y-axis). This gives you a different view of the model’s predictions as compared to ground truth.

# unemployment is in the workspace
summary(unemployment)
 male_unemployment female_unemployment   prediction   
 Min.   :2.900     Min.   :4.000       Min.   :3.448  
 1st Qu.:4.900     1st Qu.:4.400       1st Qu.:4.837  
 Median :6.000     Median :5.200       Median :5.601  
 Mean   :5.954     Mean   :5.569       Mean   :5.569  
 3rd Qu.:6.700     3rd Qu.:6.100       3rd Qu.:6.087  
 Max.   :9.800     Max.   :7.900       Max.   :8.240  
# unemployment_model is in the workspace
summary(unemployment_model)

Call:
lm(formula = fmla, data = unemployment)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.77621 -0.34050 -0.09004  0.27911  1.31254 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)        1.43411    0.60340   2.377   0.0367 *  
male_unemployment  0.69453    0.09767   7.111 1.97e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.5803 on 11 degrees of freedom
Multiple R-squared:  0.8213,    Adjusted R-squared:  0.8051 
F-statistic: 50.56 on 1 and 11 DF,  p-value: 1.966e-05
# Make predictions from the model
unemployment$predictions <- predict(unemployment_model)
# Fill in the blanks to plot predictions (on x-axis) versus the female_unemployment rates
ggplot(unemployment, aes(x = predictions, y = female_unemployment)) + 
  geom_point() + 
  geom_abline()

# Calculate residuals
unemployment$residuals <- unemployment$female_unemployment - unemployment$predictions
# Fill in the blanks to plot predictions (on x-axis) versus the residuals
ggplot(unemployment, aes(x = predictions, y = residuals)) + 
  geom_pointrange(aes(ymin = 0, ymax = residuals)) + 
  geom_hline(yintercept = 0, linetype = 3) + 
  ggtitle("residuals vs. linear model prediction")

2.2: The gain curve to evaluate the unemployment model

In the previous exercise you made predictions about female_unemployment and visualized the predictions and the residuals. Now, you will also plot the gain curve of the unemployment_model’s predictions against actual female_unemployment using the WVPlots::GainCurvePlot() function.

For situations where order is more important than exact values, the gain curve helps you check if the model’s predictions sort in the same order as the true outcome.

Calls to the function GainCurvePlot() look like:

GainCurvePlot(frame, xvar, truthvar, title)

where

  • frame is a data frame
  • xvar and truthvar are strings naming the prediction and actual outcome columns of frame
  • title is the title of the plot

When the predictions sort in exactly the same order, the relative Gini coefficient is 1. When the model sorts poorly, the relative Gini coefficient is close to zero, or even negative.

Instructions

100 XP

  • The data frame unemployment and the model unemployment_model are in the workspace.

  • Load the package WVPlots.

  • Plot the gain curve. Give the plot the title “Unemployment model”. Do the model’s predictions sort correctly?

# unemployment is in the workspace (with predictions)
summary(unemployment)
 male_unemployment female_unemployment   prediction     predictions      residuals       
 Min.   :2.900     Min.   :4.000       Min.   :3.448   Min.   :3.448   Min.   :-0.77621  
 1st Qu.:4.900     1st Qu.:4.400       1st Qu.:4.837   1st Qu.:4.837   1st Qu.:-0.34050  
 Median :6.000     Median :5.200       Median :5.601   Median :5.601   Median :-0.09004  
 Mean   :5.954     Mean   :5.569       Mean   :5.569   Mean   :5.569   Mean   : 0.00000  
 3rd Qu.:6.700     3rd Qu.:6.100       3rd Qu.:6.087   3rd Qu.:6.087   3rd Qu.: 0.27911  
 Max.   :9.800     Max.   :7.900       Max.   :8.240   Max.   :8.240   Max.   : 1.31254  
# unemployment_model is in the workspace
summary(unemployment_model)

Call:
lm(formula = fmla, data = unemployment)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.77621 -0.34050 -0.09004  0.27911  1.31254 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)        1.43411    0.60340   2.377   0.0367 *  
male_unemployment  0.69453    0.09767   7.111 1.97e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.5803 on 11 degrees of freedom
Multiple R-squared:  0.8213,    Adjusted R-squared:  0.8051 
F-statistic: 50.56 on 1 and 11 DF,  p-value: 1.966e-05
# Load the package WVPlots
install.packages("WVPlots")
Warning in install.packages :
  unable to access index for repository http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/3.5:
  cannot open URL 'http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/3.5/PACKAGES'
trying URL 'https://mran.microsoft.com/snapshot/2018-08-01/bin/windows/contrib/3.5/WVPlots_1.0.2.zip'
Content type 'application/zip' length 2193056 bytes (2.1 MB)
downloaded 2.1 MB
package ‘WVPlots’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    C:\Users\cweiqiang\AppData\Local\Temp\RtmpueSsh9\downloaded_packages
library(WVPlots)
# Plot the Gain Curve
GainCurvePlot(unemployment, "predictions", "female_unemployment", "Unemployment model")

2.3: Calculate RMSE

In this exercise you will calculate the RMSE of your unemployment model. In the previous coding exercises, you added two columns to the unemployment dataset:

the model’s predictions (predictions column) the residuals between the predictions and the outcome (residuals column) You can calculate the RMSE from a vector of residuals, res, as:

RMSE=sqrt{mean(res 2)}

You want RMSE to be small. How small is “small”? One heuristic is to compare the RMSE to the standard deviation of the outcome. With a good model, the RMSE should be smaller.

Instructions

100 XP

  • The data frame unemployment is in your workspace.

  • Review the unemployment data from the previous exercise.

  • For convenience, assign the residuals column from unemployment to the variable res.

  • Calculate RMSE: square res, take its mean, and then square root it. Assign this to the variable rmse and print it.

  • Tip: you can do this in one step by wrapping the assignment in parentheses: (rmse <- ___)

  • Calculate the standard deviation of female_unemployment and assign it to the variable sd_unemployment.

  • Print it. How does the rmse of the model compare to the standard deviation of the data?

# unemployment is in the workspace
summary(unemployment)
 male_unemployment female_unemployment   prediction     predictions      residuals       
 Min.   :2.900     Min.   :4.000       Min.   :3.448   Min.   :3.448   Min.   :-0.77621  
 1st Qu.:4.900     1st Qu.:4.400       1st Qu.:4.837   1st Qu.:4.837   1st Qu.:-0.34050  
 Median :6.000     Median :5.200       Median :5.601   Median :5.601   Median :-0.09004  
 Mean   :5.954     Mean   :5.569       Mean   :5.569   Mean   :5.569   Mean   : 0.00000  
 3rd Qu.:6.700     3rd Qu.:6.100       3rd Qu.:6.087   3rd Qu.:6.087   3rd Qu.: 0.27911  
 Max.   :9.800     Max.   :7.900       Max.   :8.240   Max.   :8.240   Max.   : 1.31254  
# For convenience put the residuals in the variable res
res <- unemployment$predictions-unemployment$female_unemployment
# Calculate RMSE, assign it to the variable rmse and print it
(rmse <- sqrt(mean(res^2)))
[1] 0.5337612
# Calculate the standard deviation of female_unemployment and print it
(sd_unemployment <- sd(unemployment$female_unemployment))
[1] 1.314271
sd_unemployment
[1] 1.314271

2.4: Calculate R-Squared

Now that you’ve calculated the RMSE of your model’s predictions, you will examine how well the model fits the data: that is, how much variance does it explain. You can do this using R2.

Suppose y is the true outcome, p is the prediction from the model, and res=y−p are the residuals of the predictions.

  • Then the total sum of squares tss (“total variance”) of the data is: \(tss=\sum (y−\bar{y})^{2}\) where \(\bar{y}\) is the mean value of y.

  • The residual sum of squared errors of the model, rss is: \(rss=\sum res^{2}\)

  • \(R^{2}\) (R-Squared), the “variance explained” by the model, is then: \(1−\frac{rss}{tss}\)

After you calculate \(R^{2}\) , you will compare what you computed with the \(R^{2}\) reported by glance(). glance() returns a one-row data frame; for a linear regression model, one of the columns returned is the R2 of the model on the training data.

The data frame unemployment is in your workspace, with the columns predictions and residuals that you calculated in a previous exercise.

Instructions

100 XP

  • The data frame unemployment and the model unemployment_model are in the workspace.

  • Calculate the mean female_unemployment and assign it to the variable fe_mean.

  • Calculate the total sum of squares and assign it to the variable tss.

  • Calculate the residual sum of squares and assign it to the variable rss.

  • Calculate R2. Is it a good fit (R2 near 1)?

  • Use glance() to get R2 from the model. Is it the same as what you calculated?

# unemployment is in your workspace
summary(unemployment)
 male_unemployment female_unemployment   prediction     predictions      residuals       
 Min.   :2.900     Min.   :4.000       Min.   :3.448   Min.   :3.448   Min.   :-0.77621  
 1st Qu.:4.900     1st Qu.:4.400       1st Qu.:4.837   1st Qu.:4.837   1st Qu.:-0.34050  
 Median :6.000     Median :5.200       Median :5.601   Median :5.601   Median :-0.09004  
 Mean   :5.954     Mean   :5.569       Mean   :5.569   Mean   :5.569   Mean   : 0.00000  
 3rd Qu.:6.700     3rd Qu.:6.100       3rd Qu.:6.087   3rd Qu.:6.087   3rd Qu.: 0.27911  
 Max.   :9.800     Max.   :7.900       Max.   :8.240   Max.   :8.240   Max.   : 1.31254  
# unemployment_model is in the workspace
summary(unemployment_model)

Call:
lm(formula = fmla, data = unemployment)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.77621 -0.34050 -0.09004  0.27911  1.31254 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)        1.43411    0.60340   2.377   0.0367 *  
male_unemployment  0.69453    0.09767   7.111 1.97e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.5803 on 11 degrees of freedom
Multiple R-squared:  0.8213,    Adjusted R-squared:  0.8051 
F-statistic: 50.56 on 1 and 11 DF,  p-value: 1.966e-05
# Calculate mean female_unemployment: fe_mean. Print it
(fe_mean <- mean(unemployment$female_unemployment))
[1] 5.569231
fe_mean
[1] 5.569231
# Calculate total sum of squares: tss. Print it
(tss <- sum((unemployment$female_unemployment - fe_mean)^2))
[1] 20.72769
# Calculate residual sum of squares: rss. Print it
(rss <- sum((unemployment$female_unemployment - unemployment$predictions)^2))
[1] 3.703714
# Calculate R-squared: rsq. Print it. Is it a good fit?
(rsq <- 1-rss/tss)
[1] 0.8213157
# Get R-squared from glance. Print it
(rsq_glance <- glance(unemployment_model)$r.squared)
[1] 0.8213157

2.5: Correlation and R-squared

The linear correlation of two variables, x and y, measures the strength of the linear relationship between them. When x and y are respectively:

the outcomes of a regression model that minimizes squared-error (like linear regression) and the true outcomes of the training data, then the square of the correlation is the same as R2. You will verify that in this exercise.

Instructions

100 XP

  • Use cor() to get the correlation between the predictions and female unemployment. Assign it to the variable rho and print it. Make sure you use Pearson correlation (the default).

  • Square rho and assign it to rho2. Print it.

  • Compare rho2 to R2 from the model (using glance()). Is it the same?

# unemployment is in your workspace
summary(unemployment)
 male_unemployment female_unemployment   prediction     predictions      residuals       
 Min.   :2.900     Min.   :4.000       Min.   :3.448   Min.   :3.448   Min.   :-0.77621  
 1st Qu.:4.900     1st Qu.:4.400       1st Qu.:4.837   1st Qu.:4.837   1st Qu.:-0.34050  
 Median :6.000     Median :5.200       Median :5.601   Median :5.601   Median :-0.09004  
 Mean   :5.954     Mean   :5.569       Mean   :5.569   Mean   :5.569   Mean   : 0.00000  
 3rd Qu.:6.700     3rd Qu.:6.100       3rd Qu.:6.087   3rd Qu.:6.087   3rd Qu.: 0.27911  
 Max.   :9.800     Max.   :7.900       Max.   :8.240   Max.   :8.240   Max.   : 1.31254  
# unemployment_model is in the workspace
summary(unemployment_model)

Call:
lm(formula = fmla, data = unemployment)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.77621 -0.34050 -0.09004  0.27911  1.31254 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)        1.43411    0.60340   2.377   0.0367 *  
male_unemployment  0.69453    0.09767   7.111 1.97e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.5803 on 11 degrees of freedom
Multiple R-squared:  0.8213,    Adjusted R-squared:  0.8051 
F-statistic: 50.56 on 1 and 11 DF,  p-value: 1.966e-05
# Get the correlation between the prediction and true outcome: rho and print it
(rho <- cor(unemployment$predictions, unemployment$female_unemployment))
[1] 0.9062647
# Square rho: rho2 and print it
(rho2 <- rho^2)
[1] 0.8213157
# Get R-squared from glance and print it
(rsq_glance <- glance(unemployment_model)$r.squared)
[1] 0.8213157

2.6: Generating a random test/train split

For the next several exercises you will use the mpg data from the package ggplot2. The data describes the characteristics of several makes and models of cars from different years. The goal is to predict city fuel efficiency from highway fuel efficiency.

In this exercise, you will split mpg into a training set mpg_train (75% of the data) and a test set mpg_test (25% of the data). One way to do this is to generate a column of uniform random numbers between 0 and 1, using the function runif().

If you have a data set dframe of size N, and you want a random subset of approximately size 100∗X% of N (where X is between 0 and 1), then:

Generate a vector of uniform random numbers: gp = runif(N). dframe[gp < X,] will be about the right size. dframe[gp >= X,] will be the complement.

Instructions

100 XP

  • The data frame mpg is in the workspace.

  • Use the function nrow to get the number of rows in the data frame mpg. Assign this count to the variable N and print it.

  • Calculate about how many rows 75% of N should be. Assign it to the variable target and print it.

  • Use runif() to generate a vector of N uniform random numbers, called gp.

-Use gp to split mpg into mpg_train and mpg_test (with mpg_train containing approximately 75% of the data).

  • Use nrow() to check the size of mpg_train and mpg_test. Are they about the right size?
# mpg is in the workspace
summary(mpg)
 manufacturer          model               displ            year           cyl       
 Length:234         Length:234         Min.   :1.600   Min.   :1999   Min.   :4.000  
 Class :character   Class :character   1st Qu.:2.400   1st Qu.:1999   1st Qu.:4.000  
 Mode  :character   Mode  :character   Median :3.300   Median :2004   Median :6.000  
                                       Mean   :3.472   Mean   :2004   Mean   :5.889  
                                       3rd Qu.:4.600   3rd Qu.:2008   3rd Qu.:8.000  
                                       Max.   :7.000   Max.   :2008   Max.   :8.000  
    trans               drv                 cty             hwy             fl           
 Length:234         Length:234         Min.   : 9.00   Min.   :12.00   Length:234        
 Class :character   Class :character   1st Qu.:14.00   1st Qu.:18.00   Class :character  
 Mode  :character   Mode  :character   Median :17.00   Median :24.00   Mode  :character  
                                       Mean   :16.86   Mean   :23.44                     
                                       3rd Qu.:19.00   3rd Qu.:27.00                     
                                       Max.   :35.00   Max.   :44.00                     
    class              pred.cv            pred       
 Length:234         Min.   : 8.827   Min.   : 9.043  
 Class :character   1st Qu.:13.177   1st Qu.:13.142  
 Mode  :character   Median :17.313   Median :17.241  
                    Mean   :16.858   Mean   :16.859  
                    3rd Qu.:19.405   3rd Qu.:19.291  
                    Max.   :30.338   Max.   :30.906  
dim(mpg)
[1] 234  13
# Use nrow to get the number of rows in mpg (N) and print it
(N <- nrow(mpg))
[1] 234
# Calculate how many rows 75% of N should be and print it
# Hint: use round() to get an integer
(target <- round(0.75*N))
[1] 176
# Create the vector of N uniform random variables: gp
gp <- runif(N, min = 0, max = 1)
select.training<-gp<0.75
# Use gp to create the training set: mpg_train (75% of data) and mpg_test (25% of data)
mpg_train <- mpg[select.training,]
mpg_test <- mpg[!select.training,]
# Use nrow() to examine mpg_train and mpg_test
nrow(mpg_train)
[1] 179
nrow(mpg_test)
[1] 55

2.7: Train a model using test/train split

Now that you have split the mpg dataset into mpg_train and mpg_test, you will use mpg_train to train a model to predict city fuel efficiency (cty) from highway fuel efficiency (hwy).

Instructions

100 XP

  • The data frame mpg_train is in the workspace.

  • Create a formula fmla that expresses the relationship cty as a function of hwy. Print it.

  • Train a model mpg_model on mpg_train to predict cty from hwy using fmla and lm().

  • Use summary() to examine the model.

# mpg_train is in the workspace
summary(mpg_train)
 manufacturer          model               displ           year           cyl       
 Length:179         Length:179         Min.   :1.60   Min.   :1999   Min.   :4.000  
 Class :character   Class :character   1st Qu.:2.50   1st Qu.:1999   1st Qu.:4.000  
 Mode  :character   Mode  :character   Median :3.40   Median :2008   Median :6.000  
                                       Mean   :3.57   Mean   :2004   Mean   :6.006  
                                       3rd Qu.:4.70   3rd Qu.:2008   3rd Qu.:8.000  
                                       Max.   :7.00   Max.   :2008   Max.   :8.000  
    trans               drv                 cty             hwy            fl           
 Length:179         Length:179         Min.   : 9.00   Min.   :12.0   Length:179        
 Class :character   Class :character   1st Qu.:13.00   1st Qu.:17.0   Class :character  
 Mode  :character   Mode  :character   Median :16.00   Median :24.0   Mode  :character  
                                       Mean   :16.44   Mean   :22.9                     
                                       3rd Qu.:19.00   3rd Qu.:27.0                     
                                       Max.   :29.00   Max.   :41.0                     
    class              pred.cv            pred       
 Length:179         Min.   : 8.827   Min.   : 9.043  
 Class :character   1st Qu.:12.580   1st Qu.:12.459  
 Mode  :character   Median :17.289   Median :17.241  
                    Mean   :16.501   Mean   :16.490  
                    3rd Qu.:19.221   3rd Qu.:19.291  
                    Max.   :28.355   Max.   :28.856  
# Create a formula to express cty as a function of hwy: fmla and print it.
(fmla <- cty~hwy)
cty ~ hwy
# Now use lm() to build a model mpg_model from mpg_train that predicts cty from hwy 
mpg_model <- lm(fmla,data=mpg_train)
# Use summary() to examine the model
summary(mpg_model)

Call:
lm(formula = fmla, data = mpg_train)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.8605 -0.8068 -0.1043  0.6663  4.7345 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.96995    0.36128   2.685  0.00795 ** 
hwy          0.67562    0.01529  44.182  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.189 on 177 degrees of freedom
Multiple R-squared:  0.9169,    Adjusted R-squared:  0.9164 
F-statistic:  1952 on 1 and 177 DF,  p-value: < 2.2e-16

2.8: Evaluate a model using test/train split

Now you will test the model mpg_model on the test data, mpg_test. Functions rmse() and r_squared() to calculate RMSE and R-squared have been provided for convenience:

rmse(predcol, ycol)

r_squared(predcol, ycol)

where:

  • predcol: The predicted values

  • ycol: The actual outcome

You will also plot the predictions vs. the outcome.

Generally, model performance is better on the training data than the test data (though sometimes the test set “gets lucky”). A slight difference in performance is okay; if the performance on training is significantly better, there is a problem.

Instructions

100 XP

The data frames mpg_train and mpg_test, and the model mpg_model are in the workspace, along with the functions rmse() and r_squared().

  • Predict city fuel efficiency from hwy on the mpg_train data. Assign the predictions to the column pred.

  • Predict city fuel efficiency from hwy on the mpg_test data. Assign the predictions to the column pred.

  • Use rmse() to evaluate rmse for both the test and training sets. Compare. Are the performances similar?

  • Do the same with r_squared(). Are the performances similar?

  • Use ggplot2 to plot the predictions against cty on the test data.

# Examine the objects in the workspace
#ls.str()
# predict cty from hwy for the training set
mpg_train$pred <- predict(mpg_model)
# predict cty from hwy for the test set
mpg_test$pred <- predict(mpg_model,newdata=mpg_test)
#install.packages("Metrics")
library(Metrics)
# Evaluate the rmse on both training and test data and print them
(rmse_train <- rmse(mpg_train$cty,mpg_train$pred))
[1] 1.182395
(rmse_test <- rmse(mpg_test$cty,mpg_test$pred))
[1] 1.444506
# Evaluate the r-squared on both training and test data.and print them
#(rsq_train <- r_squared(mpg_train$pred,mpg_train$cty))
#(rsq_test <- r_squared(mpg_test$pred,mpg_test$cty))
(rsq_train <- 1-sse(mpg_train$pred,mpg_train$cty)/sse(mean(mpg_train$cty),mpg_train$cty))
[1] 0.9168634
(rsq_test <- 1-sse(mpg_test$pred,mpg_test$cty)/sse(mean(mpg_test$cty),mpg_test$cty))
[1] 0.8934799
# Plot the predictions (on the x-axis) against the outcome (cty) on the test data
ggplot(mpg_test, aes(x = pred, y = cty)) + 
  geom_point() + 
  geom_abline()

2.9: Create a cross validation plan

There are several ways to implement an n-fold cross validation plan. In this exercise you will create such a plan using vtreat::kWayCrossValidation(), and examine it.

kWayCrossValidation() creates a cross validation plan with the following call:

splitPlan <- kWayCrossValidation(nRows, nSplits, dframe, y) where nRows is the number of rows of data to be split, and nSplits is the desired number of cross-validation folds.

Strictly speaking, dframe and y aren’t used by kWayCrossValidation; they are there for compatibility with other vtreat data partitioning functions. You can set them both to NULL.

The resulting splitPlan is a list of nSplits elements; each element contains two vectors:

train: the indices of dframe that will form the training set app: the indices of dframe that will form the test (or application) set In this exercise you will create a 3-fold cross-validation plan for the data set mpg.

Instructions

100 XP

  • Load the package vtreat.

  • Get the number of rows in mpg and assign it to the variable nRows.

  • Call kWayCrossValidation to create a 3-fold cross validation plan and assign it to the variable splitPlan. You can set the last two arguments of the function to NULL.

  • Call str() to examine the structure of splitPlan.

# Load the package vtreat
install.packages("vtreat")
Error in install.packages : Updating loaded packages
library(vtreat)
# mpg is in the workspace
summary(mpg)
 manufacturer          model               displ            year           cyl       
 Length:234         Length:234         Min.   :1.600   Min.   :1999   Min.   :4.000  
 Class :character   Class :character   1st Qu.:2.400   1st Qu.:1999   1st Qu.:4.000  
 Mode  :character   Mode  :character   Median :3.300   Median :2004   Median :6.000  
                                       Mean   :3.472   Mean   :2004   Mean   :5.889  
                                       3rd Qu.:4.600   3rd Qu.:2008   3rd Qu.:8.000  
                                       Max.   :7.000   Max.   :2008   Max.   :8.000  
    trans               drv                 cty             hwy             fl           
 Length:234         Length:234         Min.   : 9.00   Min.   :12.00   Length:234        
 Class :character   Class :character   1st Qu.:14.00   1st Qu.:18.00   Class :character  
 Mode  :character   Mode  :character   Median :17.00   Median :24.00   Mode  :character  
                                       Mean   :16.86   Mean   :23.44                     
                                       3rd Qu.:19.00   3rd Qu.:27.00                     
                                       Max.   :35.00   Max.   :44.00                     
    class              pred.cv            pred       
 Length:234         Min.   : 8.827   Min.   : 9.043  
 Class :character   1st Qu.:13.177   1st Qu.:13.142  
 Mode  :character   Median :17.313   Median :17.241  
                    Mean   :16.858   Mean   :16.859  
                    3rd Qu.:19.405   3rd Qu.:19.291  
                    Max.   :30.338   Max.   :30.906  
# Get the number of rows in mpg
nRows <- nrow(mpg)
# Implement the 3-fold cross-fold plan with vtreat
splitPlan <- kWayCrossValidation(nRows, 3, NULL, NULL)
# Examine the split plan
str(splitPlan)
List of 3
 $ :List of 2
  ..$ train: int [1:156] 2 3 4 6 7 8 9 11 13 14 ...
  ..$ app  : int [1:78] 229 204 40 100 41 67 5 54 12 21 ...
 $ :List of 2
  ..$ train: int [1:156] 1 4 5 6 8 10 12 13 15 17 ...
  ..$ app  : int [1:78] 184 203 81 78 112 93 143 174 165 139 ...
 $ :List of 2
  ..$ train: int [1:156] 1 2 3 5 7 9 10 11 12 14 ...
  ..$ app  : int [1:78] 4 64 101 150 28 176 147 45 57 82 ...
 - attr(*, "splitmethod")= chr "kwaycross"

2.10: Evaluate a modeling procedure using n-fold cross-validation

In this exercise you will use splitPlan, the 3-fold cross validation plan from the previous exercise, to make predictions from a model that predicts mpg\(cty from mpg\)hwy.

If dframe is the training data, then one way to add a column of cross-validation predictions to the frame is as follows:

  • Initialize a column of the appropriate length

dframe$pred.cv <- 0

  • k is the number of folds

  • splitPlan is the cross validation plan

for(i in 1:k) { # Get the ith split split <- splitPlan[[i]]

# Build a model on the training data # from this split # (lm, in this case) model <- lm(fmla, data = dframe[split$train,])

# make predictions on the # application data from this split dframe\(pred.cv[split\)app] <- predict(model, newdata = dframe[split$app,]) }

Cross-validation predicts how well a model built from all the data will perform on new data. As with the test/train split, for a good modeling procedure, cross-validation performance and training performance should be close.

Instructions

100 XP

  • The data frame mpg, the cross validation plan splitPlan, and the function to calculate RMSE (rmse()) from one of the previous exercises is available in your workspace.

  • Run the 3-fold cross validation plan from splitPlan and put the predictions in the column mpg$pred.cv.

  • Use lm() and the formula cty ~ hwy.

  • Create a linear regression model on all the mpg data (formula cty ~ hwy) and assign the predictions to mpg$pred.

  • Use rmse() to get the root mean squared error of the predictions from the full model (mpg$pred). Recall that rmse() takes two arguments, the predicted values, and the actual outcome.

  • Get the root mean squared error of the cross-validation predictions. Are the two values about the same?

# mpg is in the workspace
summary(mpg)
 manufacturer          model               displ            year           cyl       
 Length:234         Length:234         Min.   :1.600   Min.   :1999   Min.   :4.000  
 Class :character   Class :character   1st Qu.:2.400   1st Qu.:1999   1st Qu.:4.000  
 Mode  :character   Mode  :character   Median :3.300   Median :2004   Median :6.000  
                                       Mean   :3.472   Mean   :2004   Mean   :5.889  
                                       3rd Qu.:4.600   3rd Qu.:2008   3rd Qu.:8.000  
                                       Max.   :7.000   Max.   :2008   Max.   :8.000  
    trans               drv                 cty             hwy             fl           
 Length:234         Length:234         Min.   : 9.00   Min.   :12.00   Length:234        
 Class :character   Class :character   1st Qu.:14.00   1st Qu.:18.00   Class :character  
 Mode  :character   Mode  :character   Median :17.00   Median :24.00   Mode  :character  
                                       Mean   :16.86   Mean   :23.44                     
                                       3rd Qu.:19.00   3rd Qu.:27.00                     
                                       Max.   :35.00   Max.   :44.00                     
    class              pred.cv            pred       
 Length:234         Min.   : 8.827   Min.   : 9.043  
 Class :character   1st Qu.:13.177   1st Qu.:13.142  
 Mode  :character   Median :17.313   Median :17.241  
                    Mean   :16.858   Mean   :16.859  
                    3rd Qu.:19.405   3rd Qu.:19.291  
                    Max.   :30.338   Max.   :30.906  
# splitPlan is in the workspace
str(splitPlan)
List of 3
 $ :List of 2
  ..$ train: int [1:156] 2 3 4 6 7 8 9 11 13 14 ...
  ..$ app  : int [1:78] 229 204 40 100 41 67 5 54 12 21 ...
 $ :List of 2
  ..$ train: int [1:156] 1 4 5 6 8 10 12 13 15 17 ...
  ..$ app  : int [1:78] 184 203 81 78 112 93 143 174 165 139 ...
 $ :List of 2
  ..$ train: int [1:156] 1 2 3 5 7 9 10 11 12 14 ...
  ..$ app  : int [1:78] 4 64 101 150 28 176 147 45 57 82 ...
 - attr(*, "splitmethod")= chr "kwaycross"
# Run the 3-fold cross validation plan from splitPlan
k <- 3 # Number of folds
mpg$pred.cv <- 0 
for(i in 1:k) {
  split <- splitPlan[[i]]
  model <- lm(cty~hwy, data = mpg[split$train,])
  mpg$pred.cv[split$app] <- predict(model, newdata = mpg[split$app,])
}
# Predict from a full model
mpg$pred <- predict(lm(cty ~ hwy, data = mpg))
# Get the rmse of the full model's predictions
rmse(mpg$pred, mpg$cty)
[1] 1.247045
# Get the rmse of the cross-validation predictions
rmse(mpg$pred.cv, mpg$cty)
[1] 1.253743

Chapter 3:

3.1: Examining the structure of categorical inputs

For this exercise you will call model.matrix() to examine how R represents data with both categorical and numerical inputs for modeling. The dataset flowers (derived from the Sleuth3 package) is loaded into your workspace. It has the following columns:

Flowers: the average number of flowers on a meadowfoam plant Intensity: the intensity of a light treatment applied to the plant Time: A categorical variable - when (Late or Early) in the lifecycle the light treatment occurred The ultimate goal is to predict Flowers as a function of Time and Intensity.

Instructions

50 XP

  • The data frame flowers is in your workspace.

  • Call the str() function on flowers to see the types of each column.

  • Use the unique() function on the column flowers$Time to see the possible values that Time takes. How many unique values are there?

  • Create a formula to express Flowers as a function of Intensity and Time. Assign it to the variable fmla and print it.

  • Use fmla and model.matrix() to create the model matrix for the data frame flowers. Assign it to the variable mmat.

  • Use head() to examine the first 20 lines of flowers.

  • Now examine the first 20 lines of mmat.

  • Is the numeric column Intensity different?

  • What happened to the categorical column Time from flowers?
  • How is Time == ‘Early’ represented? And Time = ‘Late’?

# Call str on flowers to see the types of each column
flowers<-read.csv("flowers.csv")
str(flowers)
'data.frame':   24 obs. of  4 variables:
 $ X        : int  1 2 3 4 5 6 7 8 9 10 ...
 $ Flowers  : num  62.3 77.4 55.3 54.2 49.6 61.9 39.4 45.7 31.3 44.9 ...
 $ Time     : Factor w/ 2 levels "Early","Late": 2 2 2 2 2 2 2 2 2 2 ...
 $ Intensity: int  150 150 300 300 450 450 600 600 750 750 ...
# Use unique() to see how many possible values Time takes
unique(flowers$Time)
[1] Late  Early
Levels: Early Late
# Build a formula to express Flowers as a function of Intensity and Time: fmla. Print it
(fmla <- as.formula("Flowers ~ Intensity + Time"))
Flowers ~ Intensity + Time
# Use fmla and model.matrix to see how the data is represented for modeling
mmat <- model.matrix(fmla, data = flowers)
# Examine the first 20 lines of flowers
head(flowers, n=20)
# Examine the first 20 lines of mmat
head(mmat, n=20)
   (Intercept) Intensity TimeLate
1            1       150        1
2            1       150        1
3            1       300        1
4            1       300        1
5            1       450        1
6            1       450        1
7            1       600        1
8            1       600        1
9            1       750        1
10           1       750        1
11           1       900        1
12           1       900        1
13           1       150        0
14           1       150        0
15           1       300        0
16           1       300        0
17           1       450        0
18           1       450        0
19           1       600        0
20           1       600        0

3.3: Modeling with categorical inputs

For this exercise you will fit a linear model to the flowers data, to predict Flowers as a function of Time and Intensity.

The model formula fmla that you created in the previous exercise is still in your workspace, as is the model matrix mmat.

Instructions 100 XP

  • Use fmla and lm to train a linear model that predicts Flowers from Intensity and Time. Assign the model to the variable flower_model.

  • Use summary() to remind yourself of the structure of mmat.

-Use summary() to examine the flower_model. Do the variables match what you saw in mmat?

  • Use flower_model to predict the number of flowers. Add the predictions to flowers as the column predictions.

  • Fill in the blanks to plot predictions vs. actual flowers (predictions on the x-axis).

# flowers in is the workspace
str(flowers)
'data.frame':   24 obs. of  4 variables:
 $ X        : int  1 2 3 4 5 6 7 8 9 10 ...
 $ Flowers  : num  62.3 77.4 55.3 54.2 49.6 61.9 39.4 45.7 31.3 44.9 ...
 $ Time     : Factor w/ 2 levels "Early","Late": 2 2 2 2 2 2 2 2 2 2 ...
 $ Intensity: int  150 150 300 300 450 450 600 600 750 750 ...
# fmla is in the workspace
fmla
Flowers ~ Intensity + Time
# Fit a model to predict Flowers from Intensity and Time : flower_model
flower_model <- lm(fmla,data=flowers)
# Use summary on mmat to remind yourself of its structure
summary(mmat)
  (Intercept)   Intensity      TimeLate  
 Min.   :1    Min.   :150   Min.   :0.0  
 1st Qu.:1    1st Qu.:300   1st Qu.:0.0  
 Median :1    Median :525   Median :0.5  
 Mean   :1    Mean   :525   Mean   :0.5  
 3rd Qu.:1    3rd Qu.:750   3rd Qu.:1.0  
 Max.   :1    Max.   :900   Max.   :1.0  
# Use summary to examine flower_model 
summary(flower_model)

Call:
lm(formula = fmla, data = flowers)

Residuals:
   Min     1Q Median     3Q    Max 
-9.652 -4.139 -1.558  5.632 12.165 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  83.464167   3.273772  25.495  < 2e-16 ***
Intensity    -0.040471   0.005132  -7.886 1.04e-07 ***
TimeLate    -12.158333   2.629557  -4.624 0.000146 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.441 on 21 degrees of freedom
Multiple R-squared:  0.7992,    Adjusted R-squared:   0.78 
F-statistic: 41.78 on 2 and 21 DF,  p-value: 4.786e-08
# Predict the number of flowers on each plant
flowers$predictions <- predict(flower_model)
# Plot predictions vs actual flowers (predictions on x-axis)
ggplot(flowers, aes(x = predictions, y = Flowers)) + 
  geom_point() +
  geom_abline(color = "blue") 

3.4: Modeling an interaction

In this exercise you will use interactions to model the effect of gender and gastric activity on alcohol metabolism.

The data frame alcohol has columns:

  1. Metabol: the alcohol metabolism rate

  2. Gastric: the rate of gastric alcohol dehydrogenase activity

  3. Sex: the sex of the drinker (Male or Female)

In the video, we fit three models to the alcohol data:

We saw that one of the models with interaction terms had a better R-squared than the additive model, suggesting that using interaction terms gives a better fit. In this exercise we will compare the R-squared of one of the interaction models to the main-effects-only model.

Recall that the operator : designates the interaction between two variables. The operator * designates the interaction between the two variables, plus the main effects.

x*y = x + y + x:y

Instructions

100 XP

# alcohol is in the workspace
alcohol<-read.csv("alcohol.csv")
summary(alcohol)
       X            Subject         Metabol          Gastric          Sex              Alcohol  
 Min.   : 1.00   Min.   : 1.00   Min.   : 0.100   Min.   :0.800   Female:18   Alcoholic    : 8  
 1st Qu.: 8.75   1st Qu.: 8.75   1st Qu.: 0.600   1st Qu.:1.200   Male  :14   Non-alcoholic:24  
 Median :16.50   Median :16.50   Median : 1.700   Median :1.600                                 
 Mean   :16.50   Mean   :16.50   Mean   : 2.422   Mean   :1.863                                 
 3rd Qu.:24.25   3rd Qu.:24.25   3rd Qu.: 2.925   3rd Qu.:2.200                                 
 Max.   :32.00   Max.   :32.00   Max.   :12.300   Max.   :5.200                                 
# Create the formula with main effects only
(fmla_add <- as.formula(Metabol~Gastric+Sex))
Metabol ~ Gastric + Sex
# Create the formula with interactions
(fmla_interaction <- as.formula(Metabol~Gastric+Gastric:Sex))
Metabol ~ Gastric + Gastric:Sex
# Fit the main effects only model
model_add <- lm(fmla_add,data=alcohol)
# Fit the interaction model
model_interaction <- lm(fmla_interaction,data=alcohol)
# Call summary on both models and compare
summary(model_add)

Call:
lm(formula = fmla_add, data = alcohol)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.2779 -0.6328 -0.0966  0.5783  4.5703 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -1.9466     0.5198  -3.745 0.000796 ***
Gastric       1.9656     0.2674   7.352 4.24e-08 ***
SexMale       1.6174     0.5114   3.163 0.003649 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.331 on 29 degrees of freedom
Multiple R-squared:  0.7654,    Adjusted R-squared:  0.7492 
F-statistic: 47.31 on 2 and 29 DF,  p-value: 7.41e-10
summary(model_interaction)

Call:
lm(formula = fmla_interaction, data = alcohol)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.4656 -0.5091  0.0143  0.5660  4.0668 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)      -0.7504     0.5310  -1.413 0.168236    
Gastric           1.1489     0.3450   3.331 0.002372 ** 
Gastric:SexMale   1.0422     0.2412   4.321 0.000166 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.204 on 29 degrees of freedom
Multiple R-squared:  0.8081,    Adjusted R-squared:  0.7948 
F-statistic: 61.05 on 2 and 29 DF,  p-value: 4.033e-11

3.5: Modeling an interaction (2)

In this exercise, you will compare the performance of the interaction model you fit in the previous exercise to the performance of a main-effects only model. Because this data set is small, we will use cross-validation to simulate making predictions on out-of-sample data.

You will begin to use the dplyr package to do calculations.

mutate() adds new columns to a tbl (a type of data frame) group_by() specifies how rows are grouped in a tbl summarize() computes summary statistics of a column You will also use tidyr’s gather() which takes multiple columns and collapses them into key-value pairs.

Instructions

100 XP

  • The data frame alcohol and the formulas fmla_add and fmla_interaction are in the workspace.

  • Use kWayCrossValidation() to create a splitting plan for a 3-fold cross validation.

  • The first argument is the number of rows to be split.

  • The second argument is the number of folds for the cross-validation.

  • You can set the 3rd and 4th arguments of the function to NULL.

  • Examine and run the sample code to get the 3-fold cross-validation predictions of a model with no interactions and assign them to the column pred_add.

  • Get the 3-fold cross-validation predictions of the model with interactions. Assign the predictions to the column pred_interaction.

  • The sample code shows you the procedure.

  • Use the same splitPlan that you already created.

  • Fill in the blanks to gather the predictions into a single column pred.

  • add a column of residuals (actual outcome - predicted outcome).

  • get the RMSE of the cross-validation predictions for each model type.

  • Compare the RMSEs. Based on these results, which model should you use?

# alcohol is in the workspace
summary(alcohol)
       X            Subject         Metabol          Gastric          Sex              Alcohol  
 Min.   : 1.00   Min.   : 1.00   Min.   : 0.100   Min.   :0.800   Female:18   Alcoholic    : 8  
 1st Qu.: 8.75   1st Qu.: 8.75   1st Qu.: 0.600   1st Qu.:1.200   Male  :14   Non-alcoholic:24  
 Median :16.50   Median :16.50   Median : 1.700   Median :1.600                                 
 Mean   :16.50   Mean   :16.50   Mean   : 2.422   Mean   :1.863                                 
 3rd Qu.:24.25   3rd Qu.:24.25   3rd Qu.: 2.925   3rd Qu.:2.200                                 
 Max.   :32.00   Max.   :32.00   Max.   :12.300   Max.   :5.200                                 
# Both the formulae are in the workspace
fmla_add
Metabol ~ Gastric + Sex
fmla_interaction
Metabol ~ Gastric + Gastric:Sex
# Create the splitting plan for 3-fold cross validation
set.seed(34245)  # set the seed for reproducibility
splitPlan <- kWayCrossValidation(nrow(alcohol), 3, NULL, NULL)
# Sample code: Get cross-val predictions for main-effects only model
alcohol$pred_add <- 0  # initialize the prediction vector
for(i in 1:3) {
  split <- splitPlan[[i]]
  model_add <- lm(fmla_add, data = alcohol[split$train, ])
  alcohol$pred_add[split$app] <- predict(model_add, newdata = alcohol[split$app, ])
}
# Get the cross-val predictions for the model with interactions
alcohol$pred_interaction <- 0 # initialize the prediction vector
for(i in 1:3) {
  split <- splitPlan[[i]]
  model_interaction <- lm(fmla_interaction, data = alcohol[split$train,])
  alcohol$pred_interaction[split$app] <- predict(model_interaction, newdata = alcohol[split$app, ])
}
install.packages("magrittr")
Error in install.packages : Updating loaded packages
library(magrittr) # for %>% filter function
library('tidyr')

Attaching package: 㤼㸱tidyr㤼㸲

The following object is masked from 㤼㸱package:magrittr㤼㸲:

    extract
example('gather')

gather> library(dplyr)

Attaching package: 㤼㸱dplyr㤼㸲

The following object is masked from 㤼㸱package:nlme㤼㸲:

    collapse

The following objects are masked from 㤼㸱package:stats㤼㸲:

    filter, lag

The following objects are masked from 㤼㸱package:base㤼㸲:

    intersect, setdiff, setequal, union

gather> # From http://stackoverflow.com/questions/1181060
gather> stocks <- tibble(
gather+   time = as.Date('2009-01-01') + 0:9,
gather+   X = rnorm(10, 0, 1),
gather+   Y = rnorm(10, 0, 2),
gather+   Z = rnorm(10, 0, 4)
gather+ )

gather> gather(stocks, stock, price, -time)

gather> stocks %>% gather(stock, price, -time)

gather> # get first observation for each Species in iris data -- base R
gather> mini_iris <- iris[c(1, 51, 101), ]

gather> # gather Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
gather> gather(mini_iris, key = flower_att, value = measurement,
gather+        Sepal.Length, Sepal.Width, Petal.Length, Petal.Width)
      Species   flower_att measurement
1      setosa Sepal.Length         5.1
2  versicolor Sepal.Length         7.0
3   virginica Sepal.Length         6.3
4      setosa  Sepal.Width         3.5
5  versicolor  Sepal.Width         3.2
6   virginica  Sepal.Width         3.3
7      setosa Petal.Length         1.4
8  versicolor Petal.Length         4.7
9   virginica Petal.Length         6.0
10     setosa  Petal.Width         0.2
11 versicolor  Petal.Width         1.4
12  virginica  Petal.Width         2.5

gather> # same result but less verbose
gather> gather(mini_iris, key = flower_att, value = measurement, -Species)
      Species   flower_att measurement
1      setosa Sepal.Length         5.1
2  versicolor Sepal.Length         7.0
3   virginica Sepal.Length         6.3
4      setosa  Sepal.Width         3.5
5  versicolor  Sepal.Width         3.2
6   virginica  Sepal.Width         3.3
7      setosa Petal.Length         1.4
8  versicolor Petal.Length         4.7
9   virginica Petal.Length         6.0
10     setosa  Petal.Width         0.2
11 versicolor  Petal.Width         1.4
12  virginica  Petal.Width         2.5

gather> # repeat iris example using dplyr and the pipe operator
gather> library(dplyr)

gather> mini_iris <-
gather+   iris %>%
gather+   group_by(Species) %>%
gather+   slice(1)

gather> mini_iris %>% gather(key = flower_att, value = measurement, -Species)
# Get RMSE
alcohol %>% 
  gather(key = modeltype, value = pred, pred_add, pred_interaction) %>%
  mutate(residuals = Metabol-pred) %>%      
  group_by(modeltype) %>%
  summarize(rmse = sqrt(mean(residuals^2)))

3.6: Relative error

In this exercise, you will compare relative error to absolute error. For the purposes of modeling, we will define relative error as

\(rel=\frac{(pred−y)}{y}\)

that is, the error is relative to the true outcome. You will measure the overall relative error of a model using root mean squared relative error:

\(rmse_{rel}=\sqrt{\bar{rel^{2}}}\) where \(\bar{rel^{2}}\) is the mean of \(rel^{2}\).

The example (toy) dataset fdata is loaded in your workspace. It includes the columns:

  • y: the true output to be predicted by some model; imagine it is the amount of money a customer will spend on a visit to your store.

  • pred: the predictions of a model that predicts y.

  • label: categorical: whether y comes from a population that makes small purchases, or large ones.

You want to know which model does “better”: the one predicting the small purchases, or the one predicting large ones.

Instructions

100 XP

  • The data frame fdata is in the workspace.

  • Fill in the blanks to examine the data. Notice that large purchases tend to be about 100 times larger than small ones.

  • Fill in the blanks to create error columns:

  • Define residual as pred - y.

  • Define relative error as residual / y.

  • Fill in the blanks to calculate and compare RMSE and relative RMSE.

  • How do the absolute errors compare? The relative errors?

  • Examine the plot of predictions versus outcome. In your opinion, which model does “better”?

# fdata is in the workspace
fdata<-read.csv("fdata.csv")
summary(fdata)
       X                y                 pred                      label   
 Min.   :  1.00   Min.   :  -5.894   Min.   :   1.072   large purchases:50  
 1st Qu.: 25.75   1st Qu.:   5.407   1st Qu.:   6.373   small purchases:50  
 Median : 50.50   Median :  57.374   Median :  55.693                       
 Mean   : 50.50   Mean   : 306.204   Mean   : 305.905                       
 3rd Qu.: 75.25   3rd Qu.: 550.903   3rd Qu.: 547.886                       
 Max.   :100.00   Max.   :1101.619   Max.   :1098.896                       
# Examine the data: generate the summaries for the groups large and small:
fdata %>% 
    group_by(label) %>%     # group by small/large purchases
    summarize(min  = min(y),   # min of y
              mean = mean(y),   # mean of y
              max  = max(y))   # max of y
# Fill in the blanks to add error columns
fdata2 <- fdata %>% 
         group_by(label) %>%       # group by label
           mutate(residual = pred-y,  # Residual
                  relerr   = residual/y)  # Relative error
# Compare the rmse and rmse.rel of the large and small groups:
fdata2 %>% 
  group_by(label) %>% 
  summarize(rmse     = sqrt(mean((pred-y)^2)),   # RMSE
            rmse.rel = sqrt(mean(((pred-y)/y)^2)))   # Root mean squared relative error
            
# Plot the predictions for both groups of purchases
ggplot(fdata2, aes(x = pred, y = y, color = label)) + 
  geom_point() + 
  geom_abline() + 
  facet_wrap(~ label, ncol = 1, scales = "free") + 
  ggtitle("Outcome vs prediction")

Remark: Notice from this example how a model with larger RMSE might still be better, if relative errors are more important than absolute errors.

3.7: Modeling log-transformed monetary output

In this exercise, you will practice modeling on log-transformed monetary output, and then transforming the “log-money” predictions back into monetary units. The data loaded into your workspace records subjects’ incomes in 2005 (Income2005), as well as the results of several aptitude tests taken by the subjects in 1981:

  • Arith

  • Word

  • Parag

  • Math

  • AFQT (Percentile on the Armed Forces Qualifying Test)

The data have already been split into training and test sets (income_train and income_test respectively) and are in the workspace. You will build a model of log(income) from the inputs, and then convert log(income) back into income.

Instructions

100 XP

  • Call summary() on income_train$Income2005 to see the summary statistics of income in the training set.

  • Write a formula to express log(Income2005) as a function of the five tests as the variable fmla.log. Print it.

  • Fit a linear model of log(Income2005) to the income_train data: model.log.

  • Use model.log to predict income on the income_test dataset. Put it in the column logpred.

  • Check summary() of logpred to see that the magnitudes are much different from those of Income2005.

  • Reverse the log transformation to put the predictions into “monetary units”: exp(income_test$logpred).

  • Check summary() of pred.income and see that the magnitudes are now similar to Income2005 magnitudes.

  • Fill in the blanks to plot a scatter plot of predicted income vs income on the test set.

# Examine Income2005 in the training set
income_train<-read.csv("income_train.csv")
income_test<-read.csv("income_test.csv")
summary(income_train$Income2005)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
     63   23000   39000   49894   61500  703637 
# Write the formula for log income as a function of the tests and print it
(fmla.log <- log(Income2005)~Arith+Word+Parag+Math+AFQT)
log(Income2005) ~ Arith + Word + Parag + Math + AFQT
# Fit the linear model
model.log <-  lm(fmla.log,data=income_train)
# Make predictions on income_test
income_test$logpred <- predict(model.log,newdata=income_test)
summary(income_test$logpred)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  9.766  10.133  10.423  10.419  10.705  11.006 
# Convert the predictions to monetary units
income_test$pred.income <- exp(income_test$logpred)
summary(income_test$pred.income)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  17432   25167   33615   35363   44566   60217 
#  Plot predicted income (x axis) vs income
ggplot(income_test, aes(x = pred.income, y = Income2005)) + 
  geom_point() + 
  geom_abline(color = "blue")

3.8: Comparing RMSE and root-mean-squared Relative Error

In this exercise, you will show that log-transforming a monetary output before modeling improves mean relative error (but increases RMSE) compared to modeling the monetary output directly. You will compare the results of model.log from the previous exercise to a model (model.abs) that directly fits income.

The income_train and income_test datasets are loaded in your workspace, along with your model, model.log.

Also in the workspace:

model.abs: a model that directly fits income to the inputs using the formula

Income2005 ~ Arith + Word + Parag + Math + AFQT

Instructions

100 XP

  • Fill in the blanks to add predictions from the models to income_test.

  • Don’t forget to take the exponent of the predictions from model.log to undo the log transform!

  • Fill in the blanks to gather() the predictions and calculate the residuals and relative error.

  • Fill in the blanks to calculate the RMSE and relative RMSE for predictions.

  • Which model has larger absolute error? Larger relative error?

# fmla.abs is in the workspace
fmla.abs<-as.formula(Income2005 ~ Arith + Word + Parag + Math + AFQT)
model.abs<-lm(formula = fmla.abs, data = income_train)
# model.abs is in the workspace
summary(model.abs)

Call:
lm(formula = fmla.abs, data = income_train)

Residuals:
   Min     1Q Median     3Q    Max 
-78728 -24137  -6979  11964 648573 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  17516.7     6420.1   2.728  0.00642 ** 
Arith         1552.3      303.4   5.116 3.41e-07 ***
Word          -132.3      265.0  -0.499  0.61754    
Parag        -1155.1      618.3  -1.868  0.06189 .  
Math           725.5      372.0   1.950  0.05127 .  
AFQT           177.8      144.1   1.234  0.21734    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 45500 on 2063 degrees of freedom
Multiple R-squared:  0.1165,    Adjusted R-squared:  0.1144 
F-statistic:  54.4 on 5 and 2063 DF,  p-value: < 2.2e-16
# Add predictions to the test set
income_test <- income_test %>%
  mutate(pred.absmodel = predict(model.abs, income_test),        # predictions from model.abs
         pred.logmodel = exp(predict(model.log, income_test)))   # predictions from model.log
# Gather the predictions and calculate residuals and relative error
income_long <- income_test %>% 
  gather(key = modeltype, value = pred, pred.absmodel, pred.logmodel) %>%
  mutate(residual = pred-Income2005,   # residuals
         relerr   = residual/Income2005)   # relative error
# Calculate RMSE and relative RMSE and compare
income_long %>% 
  group_by(modeltype) %>%      # group by modeltype
  summarize(rmse     = sqrt(mean(residual^2)),    # RMSE
            rmse.rel = sqrt(mean(relerr^2)))    # Root mean squared relative error

3.9: Input transforms: the “hockey stick”

In this exercise, we will build a model to predict price from a measure of the house’s size (surface area). The data set houseprice has the columns:

  • price : house price in units of $1000

  • size: surface area

A scatterplot of the data shows that the data is quite non-linear: a sort of “hockey-stick” where price is fairly flat for smaller houses, but rises steeply as the house gets larger. Quadratics and tritics are often good functional forms to express hockey-stick like relationships. Note that there may not be a “physical” reason that price is related to the square of the size; a quadratic is simply a closed form approximation of the observed relationship.

scatterplot

You will fit a model to predict price as a function of the squared size, and look at its fit on the training data.

Because ^ is also a symbol to express interactions, use the function I() to treat the expression x^2 “as is”: that is, as the square of x rather than the interaction of x with itself.

exampleFormula = y ~ I(x^2)

Instructions

100 XP

  • The data set houseprice is in the workspace.

  • Write a formula, fmla_sqr, to express price as a function of squared size. Print it.

  • Fit a model model_sqr to the data using fmla_sqr

  • For comparison, fit a linear model model_lin to the data using the formula price ~ size.

  • Fill in the blanks to make predictions from the training data from the two models,

  • gather the predictions into a single column pred

  • graphically compare the predictions of the two models to the data. Which fits better?

# houseprice is in the workspace
houseprice<-readRDS("houseprice.rds")
summary(houseprice)
      size           price      
 Min.   : 44.0   Min.   : 42.0  
 1st Qu.: 73.5   1st Qu.:164.5  
 Median : 91.0   Median :203.5  
 Mean   : 94.3   Mean   :249.2  
 3rd Qu.:118.5   3rd Qu.:287.8  
 Max.   :150.0   Max.   :573.0  
# Create the formula for price as a function of squared size
(fmla_sqr <- price~I(size^2))
price ~ I(size^2)
# Fit a model of price as a function of squared size (use fmla_sqr)
model_sqr <- lm(fmla_sqr, data=houseprice)
# Fit a model of price as a linear function of size
model_lin <- lm(price~size, data=houseprice)
# Make predictions and compare
houseprice %>% 
    mutate(pred_lin = predict(model_lin),       # predictions from linear model
           pred_sqr = predict(model_sqr)) %>%   # predictions from quadratic model 
    gather(key = modeltype, value = pred, pred_lin, pred_sqr) %>% # gather the predictions
    ggplot(aes(x = size)) + 
       geom_point(aes(y = price)) +                   # actual prices
       geom_line(aes(y = pred, color = modeltype)) + # the predictions
       scale_color_brewer(palette = "Dark2")

3.10: Input transforms: the “hockey stick” Part (2)

In this exercise, we will build a model to predict price from a measure of the house’s size (surface area). The data set houseprice has the columns:

  • price : house price in units of $1000

  • size: surface area

A scatterplot of the data shows that the data is quite non-linear: a sort of “hockey-stick” where price is fairly flat for smaller houses, but rises steeply as the house gets larger. Quadratics and tritics are often good functional forms to express hockey-stick like relationships. Note that there may not be a “physical” reason that price is related to the square of the size; a quadratic is simply a closed form approximation of the observed relationship.

scatterplot

You will fit a model to predict price as a function of the squared size, and look at its fit on the training data.

Because ^ is also a symbol to express interactions, use the function I() to treat the expression x^2 “as is”: that is, as the square of x rather than the interaction of x with itself.

exampleFormula = y ~ I(x^2)

Instructions

100 XP

  • The data set houseprice is in the workspace.

  • Write a formula, fmla_sqr, to express price as a function of squared size. Print it.

  • Fit a model model_sqr to the data using fmla_sqr

  • For comparison, fit a linear model model_lin to the data using the formula price ~ size.

  • Fill in the blanks to make predictions from the training data from the two models

  • gather the predictions into a single column pred

  • graphically compare the predictions of the two models to the data. Which fits better?

# houseprice is in the workspace
summary(houseprice)
      size           price      
 Min.   : 44.0   Min.   : 42.0  
 1st Qu.: 73.5   1st Qu.:164.5  
 Median : 91.0   Median :203.5  
 Mean   : 94.3   Mean   :249.2  
 3rd Qu.:118.5   3rd Qu.:287.8  
 Max.   :150.0   Max.   :573.0  
# fmla_sqr is in the workspace
fmla_sqr
price ~ I(size^2)
# Create a splitting plan for 3-fold cross validation
set.seed(34245)  # set the seed for reproducibility
splitPlan <-  kWayCrossValidation(nrow(houseprice),3,NULL,NULL)
# Sample code: get cross-val predictions for price ~ size
houseprice$pred_lin <- 0  # initialize the prediction vector
for(i in 1:3) {
  split <- splitPlan[[i]]
  model_lin <- lm(price ~ size, data = houseprice[split$train,])
  houseprice$pred_lin[split$app] <- predict(model_lin, newdata = houseprice[split$app,])
}
# Get cross-val predictions for price as a function of size^2 (use fmla_sqr)
houseprice$pred_sqr <- 0 # initialize the prediction vector
for(i in 1:3) {
  split <- splitPlan[[i]]
  model_sqr <- lm(fmla_sqr, data = houseprice[split$train, ])
  houseprice$pred_sqr[split$app] <- predict(model_sqr, newdata = houseprice[split$app, ])
}
# Gather the predictions and calculate the residuals
houseprice_long <- houseprice %>%
  gather(key = modeltype, value = pred, pred_lin, pred_sqr) %>%
  mutate(residuals = pred-price)
# Compare the cross-validated RMSE for the two models
houseprice_long %>% 
  group_by(modeltype) %>% # group by modeltype
  summarize(rmse = sqrt(mean(residuals^2)))

Chapter 4: Dealing with Non-Linear Responses

Now that we have mastered linear models, we will begin to look at techniques for modeling situations that don’t meet the assumptions of linearity. This includes predicting probabilities and frequencies (values bounded between 0 and 1); predicting counts (nonnegative integer values, and associated rates); and responses that have a non-linear but additive relationship to the inputs. These algorithms are variations on the standard linear model.

4.1: Fit a model of sparrow survival probability

In this exercise, you will estimate the probability that a sparrow survives a severe winter storm, based on physical characteristics of the sparrow. The dataset sparrow is loaded into your workspace. The outcome to be predicted is status (“Survived”, “Perished”). The variables we will consider are:

  • total_length: length of the bird from tip of beak to tip of tail (mm)

  • weight: in grams

  • humerus : length of humerus (“upper arm bone” that connects the wing to the body) (inches)

Remember that when using glm() to create a logistic regression model, you must explicitly specify that family = binomial:

glm(formula, data = data, family = binomial)

You will call summary(), broom::glance() to see different functions for examining a logistic regression model. One of the diagnostics that you will look at is the analog to R2, called pseudo-R2.

pseudoR2=1−deviancenull.deviance

You can think of deviance as analogous to variance: it is a measure of the variation in categorical data. The pseudo-R2 is analogous to R2 for standard regression: R2 is a measure of the “variance explained” of a regression model. The pseudo-R2 is a measure of the “deviance explained”.

Instructions

100 XP

  • The data frame sparrow and the package broom are loaded in the workspace.

  • As suggested in the video, you will predict on the outcomes TRUE and FALSE. Create a new column survived in the sparrow data frame that is TRUE when status == “Survived”.

  • Create the formula fmla that expresses survived as a function of the variables of interest. Print it.

  • Fit a logistic regression model to predict the probability of sparrow survival. Assign the model to the variable sparrow_model.

  • Call summary() to see the coefficients of the model, the deviance and the null deviance.

  • Call glance() on the model to see the deviances and other diagnostics in a data frame. Assign the output from glance() to the variable perf.

  • Calculate the pseudo-R2.

# sparrow is in the workspace
sparrow<-readRDS("sparrow.rds")
summary(sparrow)
      status       age             total_length      wingspan         weight       beak_head    
 Perished:36   Length:87          Min.   :153.0   Min.   :236.0   Min.   :23.2   Min.   :29.80  
 Survived:51   Class :character   1st Qu.:158.0   1st Qu.:245.0   1st Qu.:24.7   1st Qu.:31.40  
               Mode  :character   Median :160.0   Median :247.0   Median :25.8   Median :31.70  
                                  Mean   :160.4   Mean   :247.5   Mean   :25.8   Mean   :31.64  
                                  3rd Qu.:162.5   3rd Qu.:251.0   3rd Qu.:26.7   3rd Qu.:32.10  
                                  Max.   :167.0   Max.   :256.0   Max.   :31.0   Max.   :33.00  
    humerus           femur           legbone          skull           sternum      
 Min.   :0.6600   Min.   :0.6500   Min.   :1.010   Min.   :0.5600   Min.   :0.7700  
 1st Qu.:0.7250   1st Qu.:0.7000   1st Qu.:1.110   1st Qu.:0.5900   1st Qu.:0.8300  
 Median :0.7400   Median :0.7100   Median :1.130   Median :0.6000   Median :0.8500  
 Mean   :0.7353   Mean   :0.7134   Mean   :1.131   Mean   :0.6032   Mean   :0.8511  
 3rd Qu.:0.7500   3rd Qu.:0.7300   3rd Qu.:1.160   3rd Qu.:0.6100   3rd Qu.:0.8800  
 Max.   :0.7800   Max.   :0.7600   Max.   :1.230   Max.   :0.6400   Max.   :0.9300  
# Create the survived column
sparrow$survived <- sparrow$status=="Survived"
# Create the formula
(fmla <- survived~total_length+weight+humerus)
survived ~ total_length + weight + humerus
# Fit the logistic regression model
sparrow_model <- glm(fmla,data=sparrow, family="binomial")
# Call summary
summary(sparrow_model)

Call:
glm(formula = fmla, family = "binomial", data = sparrow)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.1117  -0.6026   0.2871   0.6577   1.7082  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)   46.8813    16.9631   2.764 0.005715 ** 
total_length  -0.5435     0.1409  -3.858 0.000115 ***
weight        -0.5689     0.2771  -2.053 0.040060 *  
humerus       75.4610    19.1586   3.939 8.19e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 118.008  on 86  degrees of freedom
Residual deviance:  75.094  on 83  degrees of freedom
AIC: 83.094

Number of Fisher Scoring iterations: 5
# Call glance
(perf <- glance(sparrow_model))
# Calculate pseudo-R-squared
(pseudoR2 <- 1-perf$deviance/perf$null.deviance)
[1] 0.3636526

4.2: Predict sparrow survival

In this exercise you will predict the probability of survival using the sparrow survival model from the previous exercise.

Recall that when calling predict() to get the predicted probabilities from a glm() model, you must specify that you want the response:

predict(model, type = “response”) Otherwise, predict() on a logistic regression model returns the predicted log-odds of the event, not the probability.

You will also use the GainCurvePlot() function to plot the gain curve from the model predictions. If the model’s gain curve is close to the ideal (“wizard”) gain curve, then the model sorted the sparrows well: that is, the model predicted that sparrows that actually survived would have a higher probability of survival. The inputs to the GainCurvePlot() function are:

  • frame: data frame with prediction column and ground truth column

  • xvar: the name of the column of predictions (as a string)

  • truthVar: the name of the column with actual outcome (as a string)

  • title: a title for the plot (as a string)

  • GainCurvePlot(frame, xvar, truthVar, title)

Instructions

100 XP

  • The dataframe sparrow and the model sparrow_model are in the workspace.

  • Create a new column in sparrow called pred that contains the predictions on the training data.

  • Call GainCurvePlot() to create the gain curve of predictions. Does the model do a good job of sorting the sparrows by whether or not they actually survived?

# sparrow is in the workspace
summary(sparrow)
      status       age             total_length      wingspan         weight       beak_head    
 Perished:36   Length:87          Min.   :153.0   Min.   :236.0   Min.   :23.2   Min.   :29.80  
 Survived:51   Class :character   1st Qu.:158.0   1st Qu.:245.0   1st Qu.:24.7   1st Qu.:31.40  
               Mode  :character   Median :160.0   Median :247.0   Median :25.8   Median :31.70  
                                  Mean   :160.4   Mean   :247.5   Mean   :25.8   Mean   :31.64  
                                  3rd Qu.:162.5   3rd Qu.:251.0   3rd Qu.:26.7   3rd Qu.:32.10  
                                  Max.   :167.0   Max.   :256.0   Max.   :31.0   Max.   :33.00  
    humerus           femur           legbone          skull           sternum      
 Min.   :0.6600   Min.   :0.6500   Min.   :1.010   Min.   :0.5600   Min.   :0.7700  
 1st Qu.:0.7250   1st Qu.:0.7000   1st Qu.:1.110   1st Qu.:0.5900   1st Qu.:0.8300  
 Median :0.7400   Median :0.7100   Median :1.130   Median :0.6000   Median :0.8500  
 Mean   :0.7353   Mean   :0.7134   Mean   :1.131   Mean   :0.6032   Mean   :0.8511  
 3rd Qu.:0.7500   3rd Qu.:0.7300   3rd Qu.:1.160   3rd Qu.:0.6100   3rd Qu.:0.8800  
 Max.   :0.7800   Max.   :0.7600   Max.   :1.230   Max.   :0.6400   Max.   :0.9300  
  survived      
 Mode :logical  
 FALSE:36       
 TRUE :51       
                
                
                
# sparrow_model is in the workspace
summary(sparrow_model)

Call:
glm(formula = fmla, family = "binomial", data = sparrow)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.1117  -0.6026   0.2871   0.6577   1.7082  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)   46.8813    16.9631   2.764 0.005715 ** 
total_length  -0.5435     0.1409  -3.858 0.000115 ***
weight        -0.5689     0.2771  -2.053 0.040060 *  
humerus       75.4610    19.1586   3.939 8.19e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 118.008  on 86  degrees of freedom
Residual deviance:  75.094  on 83  degrees of freedom
AIC: 83.094

Number of Fisher Scoring iterations: 5
# Make predictions
sparrow$pred <- predict(sparrow_model,type="response")
# Look at gain curve
GainCurvePlot(sparrow, "pred", "survived", "sparrow survival model")

Remark: You see from the gain curve that the model follows the wizard curve for about the first 30% of the data, identifying about 45% of the surviving sparrows with only a few false positives.

4.3: Poisson or quasipoisson

One of the assumptions of Poisson regression to predict counts is that the event you are counting is Poisson distributed: the average count per unit time is the same as the variance of the count. In practice, “the same” means that the mean and the variance should be of a similar order of magnitude.

When the variance is much larger than the mean, the Poisson assumption doesn’t apply, and one solution is to use quasipoisson regression, which does not assume that variance=mean.

For each of the following situations, decide if poisson regression would be suitable, or if you should use quasipoisson regression.

For which situations can you use poisson regression?

  1. Number of days students are absent: mean 5.9, variance 49

  2. Number of awards a student wins: mean 0.6, variance 1.1

  3. Number of hits per website page: mean 108.2, variance 108.5

  4. Number of bikes rented per day: mean 273, variance 45863.84

Answer the question

50 XP

Possible Answers

  1. All of them

  2. 1 and 4

  3. 2 and 3

  4. 1 and 3

  5. 2 and 4

4.4: Fit a model to predict bike rental counts

In this exercise you will build a model to predict the number of bikes rented in an hour as a function of the weather, the type of day (holiday, working day, or weekend), and the time of day. You will train the model on data from the month of July.

The data frame has the columns:

  • cnt: the number of bikes rented in that hour (the outcome)

  • hr: the hour of the day (0-23, as a factor)

  • holiday: TRUE/FALSE

  • workingday: TRUE if neither a holiday nor a weekend, else FALSE

  • weathersit: categorical, “Clear to partly cloudy”/“Light Precipitation”/“Misty”

  • temp: normalized temperature in Celsius

  • atemp: normalized “feeling” temperature in Celsius

  • hum: normalized humidity

  • windspeed: normalized windspeed

  • instant: the time index – number of hours since beginning of data set (not a variable) mnth and yr: month and year indices (not variables)

Remember that you must specify family = poisson or family = quasipoisson when using glm() to fit a count model.

Since there are a lot of input variables, for convenience we will specify the outcome and the inputs in variables, and use paste() to assemble a string representing the model formula.

Instructions

100 XP

  • The data frame bikesJuly is in the workspace. The names of the outcome variable and the input variables are also in the workspace as the variables outcome and vars respectively.

  • Fill in the blanks to create the formula fmla expressing cnt as a function of the inputs. Print it.

  • Calculate the mean (mean()) and variance (var()) of bikesJuly$cnt.

  • Should you use poisson or quasipoisson regression?

  • Use glm() to fit a model to the bikesJuly data: bike_model.

  • Use glance() to look at the model’s fit statistics. Assign the output of glance() to the variable perf.

  • Calculate the pseudo-R-squared of the model.

bikesJuly<-read.csv("bikesJuly.csv")
# bikesJuly is in the workspace
str(bikesJuly)
'data.frame':   744 obs. of  12 variables:
 $ hr        : int  0 1 2 3 4 5 6 7 8 9 ...
 $ holiday   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ workingday: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ weathersit: Factor w/ 3 levels "Clear to partly cloudy",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ temp      : num  0.76 0.74 0.72 0.72 0.7 0.68 0.7 0.74 0.78 0.82 ...
 $ atemp     : num  0.727 0.697 0.697 0.712 0.667 ...
 $ hum       : num  0.66 0.7 0.74 0.84 0.79 0.79 0.79 0.7 0.62 0.56 ...
 $ windspeed : num  0 0.1343 0.0896 0.1343 0.194 ...
 $ cnt       : int  149 93 90 33 4 10 27 50 142 219 ...
 $ instant   : int  13004 13005 13006 13007 13008 13009 13010 13011 13012 13013 ...
 $ mnth      : int  7 7 7 7 7 7 7 7 7 7 ...
 $ yr        : int  1 1 1 1 1 1 1 1 1 1 ...
# The outcome column
outcome<-c("cnt")
# The inputs to use
vars<-c("hr","holiday", "workingday", "weathersit", "temp", "atemp", "hum", "windspeed")
# Create the formula string for bikes rented as a function of the inputs
(fmla <- paste(outcome, "~", paste(vars, collapse = " + ")))
[1] "cnt ~ hr + holiday + workingday + weathersit + temp + atemp + hum + windspeed"
# Calculate the mean and variance of the outcome
(mean_bikes <- mean(bikesJuly$cnt))
[1] 273.6653
(var_bikes <- var(bikesJuly$cnt))
[1] 45863.84
# Fit the model
bike_model <- glm(fmla,data=bikesJuly,family=quasipoisson)
# Call glance
(perf <- glance(bike_model))
# Calculate pseudo-R-squared
(pseudoR2 <- 1-perf$deviance/perf$null.deviance)
[1] 0.3472145

4.5: Predict bike rentals on new data

In this exercise you will use the model you built in the previous exercise to make predictions for the month of August. The data set bikesAugust has the same columns as bikesJuly.

Recall that you must specify type = “response” with predict() when predicting counts from a glm poisson or quasipoisson model.

Instructions

100 XP

  • The model bike_model and the data frame bikesAugust are in the workspace.

  • Use predict to predict the number of bikes per hour on the bikesAugust data. Assign the predictions to the column bikesAugust$pred.

  • Fill in the blanks to get the RMSE of the predictions on the August data.

  • Fill in the blanks to generate the plot of predictions to actual counts.

  • Do any of the predictions appear negative?

bikesAugust<-read.csv("bikesAugust.csv")
# bikesAugust is in the workspace
str(bikesAugust)
'data.frame':   744 obs. of  13 variables:
 $ hr        : int  0 1 2 3 4 5 6 7 8 9 ...
 $ holiday   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ workingday: logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
 $ weathersit: Factor w/ 3 levels "Clear to partly cloudy",..: 1 1 1 1 3 3 1 3 3 3 ...
 $ temp      : num  0.68 0.66 0.64 0.64 0.64 0.64 0.64 0.64 0.66 0.68 ...
 $ atemp     : num  0.636 0.606 0.576 0.576 0.591 ...
 $ hum       : num  0.79 0.83 0.83 0.83 0.78 0.78 0.78 0.83 0.78 0.74 ...
 $ windspeed : num  0.1642 0.0896 0.1045 0.1045 0.1343 ...
 $ cnt       : int  47 33 13 7 4 49 185 487 681 350 ...
 $ instant   : int  13748 13749 13750 13751 13752 13753 13754 13755 13756 13757 ...
 $ mnth      : int  8 8 8 8 8 8 8 8 8 8 ...
 $ yr        : int  1 1 1 1 1 1 1 1 1 1 ...
 $ pred      : num  94.9 51.7 37.9 17.5 9.3 ...
# bike_model is in the workspace
summary(bike_model)

Call:
glm(formula = fmla, family = quasipoisson, data = bikesJuly)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-20.613   -9.712   -3.379    4.536   34.847  

Coefficients:
                               Estimate Std. Error t value Pr(>|t|)    
(Intercept)                    5.783187   0.716088   8.076 2.74e-15 ***
hr                             0.052373   0.004225  12.395  < 2e-16 ***
holidayTRUE                    0.050045   0.142381   0.351   0.7253    
workingdayTRUE                 0.041671   0.060260   0.692   0.4895    
weathersitLight Precipitation -0.049635   0.126787  -0.391   0.6956    
weathersitMisty                0.114739   0.065897   1.741   0.0821 .  
temp                          -2.231584   1.859082  -1.200   0.2304    
atemp                          2.266270   1.573259   1.440   0.1502    
hum                           -1.563113   0.390401  -4.004 6.87e-05 ***
windspeed                      0.590418   0.263472   2.241   0.0253 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for quasipoisson family taken to be 129.1499)

    Null deviance: 133365  on 743  degrees of freedom
Residual deviance:  87059  on 734  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 5
# Make predictions on August data
bikesAugust$pred  <- predict(bike_model,newdata=bikesAugust, type="response")
# Calculate the RMSE
bikesAugust %>% 
  mutate(residual = pred-cnt) %>%
  summarize(rmse  = sqrt(mean(residual^2)))
# Plot predictions vs cnt (pred on x-axis)
ggplot(bikesAugust, aes(x = pred, y = cnt)) +
  geom_point() + 
  geom_abline(color = "darkblue")

4.6: Visualize the Bike Rental Predictions

In the previous exercise, you visualized the bike model’s predictions using the standard “outcome vs. prediction” scatter plot. Since the bike rental data is time series data, you might be interested in how the model performs as a function of time. In this exercise, you will compare the predictions and actual rentals on an hourly basis, for the first 14 days of August.

To create the plot you will use the function tidyr::gather() to consolidate the predicted and actual values from bikesAugust in a single column. gather() takes as arguments:

  • The “wide” data frame to be gathered (implicit in a pipe)

  • The name of the key column to be created - contains the names of the gathered columns.

  • The name of the value column to be created - contains the values of the gathered columns.

  • The names of the columns to be gathered into a single column.

You’ll use the gathered data frame to compare the actual and predicted rental counts as a function of time. The time index, instant counts the number of observations since the beginning of data collection. The sample code converts the instants to daily units, starting from 0.

Instructions

100 XP

  • The data frame bikesAugust, with the predictions (bikesAugust$pred) is in the workspace.

  • Fill in the blanks to plot the predictions and actual counts by hour for the first 14 days of August.

  • convert instant to be in day units, rather than hour

  • gather() the cnt and pred columns into a column called value, with a key called valuetype.

  • filter() for the first two weeks of August

  • Plot value as a function of instant (day).

  • Does the model see the general time patterns in bike rentals?

# Plot predictions and cnt by date/time
quasipoisson_plot<-bikesAugust %>% 
  # set start to 0, convert unit to days
  mutate(instant = (instant - min(instant))/24) %>%  
  # gather cnt and pred into a value column
  gather(key = valuetype, value = value, cnt, pred) %>%
  filter(instant < 14) %>% # restric to first 14 days
  # plot value by instant
  ggplot(aes(x = instant, y = value, color = valuetype, linetype = valuetype)) + 
  geom_point() + 
  geom_line() + 
  scale_x_continuous("Day", breaks = 0:14, labels = 0:14) + 
  scale_color_brewer(palette = "Dark2") + 
  ggtitle("Predicted August bike rentals, Quasipoisson model")
quasipoisson_plot

Remark: This model mostly identifies the slow and busy hours of the day, although it often underestimates peak demand.

4.7: Writing formulas for GAM models

When using gam() to model outcome as an additive function of the inputs, you can use the s() function inside formulas to designate that you want use a spline to model the non-linear relationship of a continuous variable to the outcome.

Suppose that you want to predict how much weight (Wtloss) a dieter will lose over a 2-year diet plan as a function of:

  • Diet type (categorical)

  • Sex (categorical)

  • Age at beginning of diet (continuous)

  • BMI (body mass index) at beginning of diet (continuous)

You do not want to assume that any of the relationships are linear.

Which is the most appropriate formula?

Answer the question

50 XP

Possible Answers

  1. Wtloss ~ Diet + Sex + Age + BMI

  2. Wtloss ~ s(Diet) + s(Sex) + s(Age) + s(BMI)

  3. Wtloss ~ Diet + Sex + s(Age) + s(BMI) [ans]

Remark: Categorial variables cannot be ‘splined’, piecewise polynomially modelled

4.8: Writing formulas for GAM models (2)

Suppose that in the diet problem from the previous exercise, you now also want to take into account

the dieter’s resting metabolic rate (BMR – continuous) and the dieter’s average number hours of aerobic exercise per day (E – continuous) at the beginning of the study.

You have reason to believe that the relationship between BMR and weight loss is linear (and you want to model it that way), but not necessarily the relationship between aerobic exercise and weight loss.

Which is the most appropriate formula?

Answer the question

50 XP

Possible Answers

  1. Wtloss ~ Diet + Sex + s(Age) + s(BMI) + s(BMR) + s(E)

  2. Wtloss ~ Diet + Sex + s(Age) + s(BMI) + BMR + s(E) [ans]

  3. Wtloss ~ Diet + Sex + s(Age) + s(BMI) + s(BMR) + E

4.9: Model soybean growth with GAM

In this exercise you will model the average leaf weight on a soybean plant as a function of time (after planting). As you will see, the soybean plant doesn’t grow at a steady rate, but rather has a “growth spurt” that eventually tapers off. Hence, leaf weight is not well described by a linear model.

Recall that you can designate which variable you want to model non-linearly in a formula with the s() function:

y ~ s(x)

Also remember that gam() from the package mgcv has the calling interface

gam(formula, family, data)

For standard regression, use family = gaussian (the default).

The soybean training data, soybean_train is loaded into your workspace. It has two columns: the outcome weight and the variable Time. For comparison, the linear model model.lin, which was fit using the formula weight ~ Time has already been loaded into the workspace as well.

Instructions

100 XP

Instructions

100 XP

  • Fill in the blanks to plot weight versus Time (Time on x-axis). Does the relationship look linear?

  • Load the package mgcv.

  • Create the formula fmla.gam to express weight as a non-linear function of Time. Print it.

  • Fit a generalized additive model on soybean_train using fmla.gam.

  • Call summary() on the linear model model.lin (already in your workspace). What is the R2?

  • Call summary() on ’model.gam. The “deviance explained” reports the model’s unadjusted R2. What is the R2?

  • Which model appears to be a better fit to the training data?

  • Call plot() on model.gam to see the derived relationship between Time and weight.

# soybean_train is in the workspace
soybean_train<-read.csv("soybean_train.csv")
summary(soybean_train)
       X              Plot     Variety      Year           Time           weight       
 Min.   :  1.0   1988F6 : 10   F:161   Min.   :1988   Min.   :14.00   Min.   : 0.0290  
 1st Qu.:101.2   1988F7 :  9   P:169   1st Qu.:1988   1st Qu.:27.00   1st Qu.: 0.6663  
 Median :207.5   1988P1 :  9           Median :1989   Median :42.00   Median : 3.5233  
 Mean   :207.3   1988P2 :  9           Mean   :1989   Mean   :43.56   Mean   : 6.1645  
 3rd Qu.:307.8   1988P8 :  9           3rd Qu.:1990   3rd Qu.:56.00   3rd Qu.:10.3808  
 Max.   :412.0   1988F3 :  8           Max.   :1990   Max.   :84.00   Max.   :27.3700  
                 (Other):276                                                           
# Plot weight vs Time (Time on x axis)
ggplot(soybean_train, aes(x = Time, y = weight)) + 
  geom_point()
# Load the package mgcv
#install.packages("nlme")
#install.packages("mgcv")
#library(nlme)
#library(mgcv)
install.packages('gam')
Warning in install.packages :
  unable to access index for repository http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/3.5:
  cannot open URL 'http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/3.5/PACKAGES'
trying URL 'https://mran.microsoft.com/snapshot/2018-08-01/bin/windows/contrib/3.5/gam_1.16.zip'
Content type 'application/zip' length 409705 bytes (400 KB)
downloaded 400 KB
package ‘gam’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    C:\Users\cweiqiang\AppData\Local\Temp\RtmpueSsh9\downloaded_packages

Restarting R session...

4.10: Predict with the soybean model on test data

In this exercise you will apply the soybean models from the previous exercise (model.lin and model.gam, already in your workspace) to new data: soybean_test.

Instructions

100 XP

  • The data frame soybean.test and the models model.lin and model.gam are in the workspace.

  • Create a column soybean_test$pred.lin with predictions from the linear model model.lin.

  • Create a column soybean_test$pred.gam with predictions from the gam model model.gam.

  • For GAM models, the predict() method returns a matrix, so use as.numeric() to convert the matrix to a vector.

  • Fill in the blanks to gather() the prediction columns into a single value column pred with key column modeltype. Call the long dataframe soybean_long.

  • Calculate and compare the RMSE of both models.

  • Which model does better?

  • Run the code to compare the predictions of each model against the actual average leaf weights.

  • A scatter plot of weight as a function of Time.

  • Point-and-line plots of the predictions (pred) as a function of Time.

  • Notice that the linear model sometimes predicts negative weights! Does the gam model?

Remark: The GAM learns the non-linear growth function of the soybean plants, including the fact that weight is never negative.

Chapter 5: Tree-Based Methods

In this chapter we will look at modeling algorithms that do not assume linearity or additivity, and that can learn limited types of interactions among input variables. These algorithms are tree-based methods that work by combining ensembles of decision trees that are learned from the training data.

5.1: Build a random forest model for bike rentals

In this exercise you will again build a model to predict the number of bikes rented in an hour as a function of the weather, the type of day (holiday, working day, or weekend), and the time of day. You will train the model on data from the month of July.

You will use the ranger package to fit the random forest model. For this exercise, the key arguments to the ranger() call are:

  • formula

  • data

  • num.trees: the number of trees in the forest.

  • respect.unordered.factors : Specifies how to treat unordered factor variables. We recommend setting this to “order” for regression.

  • seed: because this is a random algorithm, you will set the seed to get reproducible results

Since there are a lot of input variables, for convenience we will specify the outcome and the inputs in the variables outcome and vars, and use paste() to assemble a string representing the model formula.

Instructions

100 XP

The data frame bikesJuly is in the workspace. The sample code specifies the names of the outcome and input variables.

  • Fill in the blanks to create the formula fmla expressing cnt as a function of the inputs. Print it.

  • Load the package ranger.

  • Use ranger to fit a model to the bikesJuly data: bike_model_rf.

  • The first argument to ranger() is the formula, fmla.

  • Use 500 trees and respect.unordered.factors = “order”.

  • Set the seed to seed for reproducible results.

  • Print the model. What is the R-squared?

5.2: Predict bike rentals with the random forest model

In this exercise you will use the model that you fit in the previous exercise to predict bike rentals for the month of August.

The predict() function for a ranger model produces a list. One of the elements of this list is predictions, a vector of predicted values. You can access predictions with the $ notation for accessing named elements of a list:

predict(model, data)$predictions

Instructions

100 XP

  • The model bike_model_rf and the dataset bikesAugust (for evaluation) are loaded into your workspace.

  • Call predict() on bikesAugust to predict the number of bikes rented in August (cnt). Add the predictions to bikesAugust as the column pred.

  • Fill in the blanks to calculate the root mean squared error of the predictions.

  • The poisson model you built for this data gave an RMSE of about 112.6. How does this model compare?

  • Fill in the blanks to plot actual bike rental counts (cnt) versus the predictions (pred on x-axis).

5.3: Visualize random forest bike model predictions

In the previous exercise, you saw that the random forest bike model did better on the August data than the quasiposson model, in terms of RMSE.

In this exercise you will visualize the random forest model’s August predictions as a function of time. The corresponding plot from the quasipoisson model that you built in a previous exercise is in the workspace for you to compare.

Recall that the quasipoisson model mostly identified the pattern of slow and busy hours in the day, but it somewhat underestimated peak demands. You would like to see how the random forest model compares.

Instructions

100 XP

  • The data frame bikesAugust (with predictions) is in the workspace. The plot quasipoisson_plot of quasipoisson model predictions as a function of time is also in the workspace.

  • Print quasipoisson_plot to review the quasipoisson model’s behavior.

  • Fill in the blanks to plot the predictions and actual counts by hour for the first 14 days of August. gather the cnt and pred columns into a column called value, with a key called valuetype.

  • Plot value as a function of instant (day).

  • How does the random forest model compare?

5.4: vtreat on a small example

In this exercise you will use vtreat to one-hot-encode a categorical variable on a small example. vtreat creates a treatment plan to transform categorical variables into indicator variables (coded “lev”), and to clean bad values out of numerical variables (coded “clean”).

To design a treatment plan use the function designTreatmentsZ()

  • treatplan <- designTreatmentsZ(data, varlist)

  • data: the original training data frame

  • varlist: a vector of input variables to be treated (as strings).

  • designTreatmentsZ() returns a list with an element scoreFrame: a data frame that includes the names and types of the new variables:

scoreFrame <- treatplan %>% magrittr::use_series(scoreFrame) %>% select(varName, origName, code)

  • varName: the name of the new treated variable

  • origName: the name of the original variable that the treated variable comes from
  • code: the type of the new variable.

  • “clean”: a numerical variable with no NAs or NaNs
  • “lev”: an indicator variable for a specific level of the original categorical variable. (magrittr::use_series() is an alias for $ that you can use in pipes.)

For these exercises, we want varName where code is either “clean” or “lev”:

newvarlist <- scoreFrame %>% filter(code %in% c(“clean”, “lev”) %>% magrittr::use_series(varName)

To transform the data set into all numerical and one-hot-encoded variables, use prepare():

data.treat <- prepare(treatplan, data, varRestrictions = newvarlist)

  • treatplan: the treatment plan

  • data: the data frame to be treated

  • varRestrictions: the variables desired in the treated data

Instructions

100 XP

The data frame dframe and the package magrittr are loaded in the workspace.

  1. Print dframe. We will assume that color and size are input variables, and popularity is the outcome to be predicted.

  2. Create a vector called vars with the names of the input variables (as strings).

  3. Load the package vtreat.

  4. Use designTreatmentsZ() to create a treatment plan for the variables in vars. Assign it to the variable treatplan.

  5. Get and examine the scoreFrame from the treatment plan to see the mapping from old variables to new variables.

  6. You only need the columns varName, origName and code.

  7. What are the names of the new indicator variables? Of the continuous variable?

  8. Create a vector newvars that contains the variable varName where code is either clean or lev. Print it.

  9. Use prepare() to create a new data frame dframe.treat that is a one-hot-encoded version of dframe (without the outcome column).

10.Print it and compare to dframe.

5.5: Novel levels

When a level of a categorical variable is rare, sometimes it will fail to show up in training data. If that rare level then appears in future data, downstream models may not know what to do with it. When such novel levels appear, using model.matrix or caret::dummyVars to one-hot-encode will not work correctly.

vtreat is a “safer” alternative to model.matrix for one-hot-encoding, because it can manage novel levels safely. vtreat also manages missing values in the data (both categorical and continuous).

In this exercise you will see how vtreat handles categorical values that did not appear in the training set. The treatment plan treatplan and the set of variables newvars from the previous exercise are still in your workspace. dframe and a new data frame testframe are also in your workspace.

Instructions

100 XP

  • Print dframe and testframe.

  • Are there colors in testframe that didn’t appear in dframe?

  • Call prepare() to create a one-hot-encoded version of testframe (without the outcome). Call it testframe.treat and print it.

  • Use the varRestriction argument to restrict to only the variables in newvars.

  • How are the yellow rows encoded?

5.6: vtreat the bike rental data

In this exercise you will create one-hot-encoded data frames of the July/August bike data, for use with xgboost later on.

The data frames bikesJuly and bikesAugust are in the workspace.

For your convenience, we have defined the variable vars with the list of variable columns for the model.

Instructions

100 XP

  • Load the package vtreat.

  • Use designTreatmentsZ() to create a treatment plan treatplan for the variables in vars from bikesJuly (the training data).

  • Set the flag verbose=FALSE to prevent the function from printing too many messages.

  • Fill in the blanks to create a vector newvars that contains only the names of the clean and lev transformed variables. Print it.

  • Use prepare() to create a one-hot-encoded training data frame bikesJuly.treat.

  • Use the varRestrictions argument to restrict the variables you will use to newvars.

  • Use prepare() to create a one-hot-encoded test frame bikesAugust.treat from bikesAugust in the same way.

  • Call str() on both prepared test frames to see the structure.

5.7: Find the right number of trees for a gradient boosting machine

In this exercise you will get ready to build a gradient boosting model to predict the number of bikes rented in an hour as a function of the weather and the type and time of day. You will train the model on data from the month of July.

The July data is loaded into your workspace. Remember that bikesJuly.treat no longer has the outcome column, so you must get it from the untreated data: bikesJuly$cnt.

You will use the xgboost package to fit the random forest model. The function xgb.cv() uses cross-validation to estimate the out-of-sample learning error as each new tree is added to the model. The appropriate number of trees to use in the final model is the number that minimizes the holdout RMSE.

For this exercise, the key arguments to the xgb.cv() call are:

  • data: a numeric matrix.

  • label: vector of outcomes (also numeric).

  • nrounds: the maximum number of rounds (trees to build).

  • nfold: the number of folds for the cross-validation. 5 is a good number.

  • objective: “reg:linear” for continuous outcomes.

  • eta: the learning rate.

  • max_depth: depth of trees.

  • early_stopping_rounds: after this many rounds without improvement, stop.

  • verbose: 0 to stay silent.

Instructions

100 XP

The data frames bikesJuly and bikesJuly.treat are in the workspace.

  1. Load the package xgboost.

  2. Fill in the blanks to run xgb.cv() on the treated training data; assign the output to the variable cv.

  3. Use as.matrix() to convert the vtreated data frame to a matrix.

  4. Use 100 rounds, and 5-fold cross validation.

  5. Set early_stopping_rounds to 10.

  6. Set eta to 0.3, max_depth to 6.

  7. Get the data frame evaluation_log from cv and assign it to the variable elog. Each row of the evaluation_log corresponds to an additional tree, so the row number tells you the number of trees in the model.

  8. Fill in the blanks to get the number of trees with the minimum value of the columns train_rmse_mean and test_rmse_mean.

  9. which.min() returns the index of the minimum value in a vector. How many trees do you need?

5.8: Fit an xgboost bike rental model and predict

In this exercise you will fit a gradient boosting model using xgboost() to predict the number of bikes rented in an hour as a function of the weather and the type and time of day. You will train the model on data from the month of July and predict on data for the month of August.

The datasets for July and August are loaded into your workspace. Remember the vtreat-ed data no longer has the outcome column, so you must get it from the original data (the cnt column).

For convenience, the number of trees to use, ntrees from the previous exercise is in the workspace.

The arguments to xgboost() are similar to those of xgb.cv().

Instructions

100 XP

  • The data frames bikesJuly, bikesJuly.treat, bikesAugust and bikesAugust.treat are in the workspace. The number of trees ntrees is in the workspace.

  • Fill in the blanks to run xgboost() on the July data. Assign the model to the variable model.

  • Use as.matrix() to convert the vtreated data frame to a matrix.

  • The objective should be “reg:linear”.

  • Use ntrees rounds.

  • Set eta to 0.3, depth to 6, and verbose to 0 (silent).

  • Now call predict() on bikesAugust.treat to predict the number of bikes rented in August.

  • Use as.matrix() to convert the vtreat-ed test data into a matrix.

  • Add the predictions tobikesAugust as the column pred.

  • Fill in the blanks to plot actual bike rental counts versus the predictions (predictions on the x-axis). Do you see a possible problem with the predictions?

5.9: Evaluate the xgboost bike rental model

In this exercise you will evaluate the gradient boosting model bike_model_xgb that you fit in the last exercise, using data from the month of August. You’ll compare this model’s RMSE for August to the RMSE of previous models that you’ve built.

The dataset bikesAugust is in the workspace. You have already made predictions using the xgboost model; they are in the column pred.

Instructions

100 XP

  • Fill in the blanks to calculate the RMSE of the predictions.

  • How does it compare to the RMSE from the poisson model (approx. 112.6) and the random forest model (approx. 96.7)?

5.10: Visualize the xgboost bike rental model

You’ve now seen three different ways to model the bike rental data. For this example, you’ve seen that the gradient boosting model had the smallest RMSE. To finish up the course, let’s compare the gradient boosting model’s predictions to the other two models as a function of time.

On completing this exercise, you will have completed the course. Congratulations! Now you have the tools to apply a variety of approaches to your regression tasks.

Instructions

100 XP

  • The data frame bikesAugust with predictions, is in the workspace. The plots quasipoisson_plot and randomforest_plot are also in the workspace.

  • Print quasipoisson_plot to review the quasipoisson model’s behavior.

  • Print randomforest_plot to review the random forest model’s behavior.

  • Fill in the blanks to plot the gradient boosting predictions and actual counts by hour for the first 14 days of August.

  • gather() the cnt and pred columns into a column called value, with a key called valuetype.

  • Plot value as a function of instant (day). How does the gradient boosting model compare to the previous models?

Remark: The gradient boosting pattern captures rental variations due to time of day and other factors better than the previous models.

---
title: "Datacamp R - Supervised Learning in R Regression"
author: "Chen Weiqiang"
date: "November 26, 2018"
output: html_notebook
---

```{r setup, include=TRUE}
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(eval = TRUE)
# Remark: If need Math font, 'Save with encoding as UTF' which makes the R Markdown compile
```

# R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see <http://rmarkdown.rstudio.com>.

When you click the **Knit** button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:


# Chapter 1: What is Regression?

In this chapter we introduce the concept of regression from a machine learning point of view. We will present the fundamental regression method: linear regression. We will show how to fit a linear regression model and to make predictions from the model.

## 1.1: Identify the regression tasks

From a machine learning perspective, the term regression generally encompasses the prediction of continuous values. Statistically, these predictions are the expected value, or the average value one would observe for the given input values.

Which of the following tasks are regression problems?

Answer the question

50 XP

Possible Answers

1. Predict (yes/no) whether a student will pass the final exam, given scores on midterms and homework.

2. Predict the student's score (0 - 100) on the final exam, given scores on midterms and homework. [ans]

3. Predict the student's final grade (A, B, C, D, F) in the class given scores on midterms and homework (before they've taken the final exam).

## 1.2: Code a simple one-variable regression

For the first coding exercise, you'll create a formula to define a one-variable modeling task, and then fit a linear model to the data. You are given the rates of male and female unemployment in the United States over several years (Source).

The task is to predict the rate of female unemployment from the observed rate of male unemployment. The outcome is female_unemployment, and the input is male_unemployment.

The sign of the variable coefficient tells you whether the outcome increases (+) or decreases (-) as the variable increases.

Recall the calling interface for lm() is:

lm(formula, data = ___)

Instructions

100 XP

- The data frame unemployment is in your workspace.

- Define a formula that expresses female_unemployment as a function of male_unemployment. Assign the formula to the variable fmla and print it.

- Then use lm() and fmla to fit a linear model to predict female unemployment from male unemployment using the data set unemployment.

- Print the model. Is the coefficent for male unemployment consistent with what you would expect? Does female unemployment increase as male unemployment does?

```{r}
# unemployment is loaded in the workspace
unemployment<-readRDS("unemployment.rds")
summary(unemployment)

# Define a formula to express female_unemployment as a function of male_unemployment
fmla <- female_unemployment ~ male_unemployment

# Print it
fmla

# Use the formula to fit a model: unemployment_model
unemployment_model <- lm(fmla, data = unemployment)

# Print it
unemployment_model
```

## 1.3: Examining a model

Let's look at the model unemployment_model that you have just created. There are a variety of different ways to examine a model; each way provides different information. We will use summary(), broom::glance(), and sigr::wrapFTest().

Instructions

100 XP

- The object unemployment_model is in your workspace.

- Print unemployment_model again. What information does it report?

- Call summary() on unemployment_model. In addition to the coefficient values, you get standard errors on the coefficient estimates, and some goodness-of-fit metrics like R-squared.

- Call glance() on the model to see the performance metrics in an orderly data frame. Can you match the information from summary() to the columns of glance()?

- Now call wrapFTest() on the model to see the R-squared again.

```{r}
# broom and sigr are already loaded in your workspace
# Print unemployment_model
install.packages("nlme")
install.packages("broom")
install.packages("sigr")
library(nlme)
library(broom)
library(sigr)
unemployment_model

# Call summary() on unemployment_model to get more details
summary(unemployment_model)

# Call glance() on unemployment_model to see the details in a tidier form
glance(unemployment_model)

# Call wrapFTest() on unemployment_model to see the most relevant details
wrapFTest(unemployment_model)
```
Remark: here are several different ways to get diagnostics for your model. Use the one that suits your needs or preferences the best.

## 1.4: Predicting from the unemployment model

In this exercise, you will use your unemployment model unemployment_model to make predictions from the unemployment data, and compare predicted female unemployment rates to the actual observed female unemployment rates on the training data, unemployment. You will also use your model to predict on the new data in newrates, which consists of only one observation, where male unemployment is 5%.

The predict() interface for lm models takes the form

predict(model, newdata)

You will use the ggplot2 package to make the plots, so you will add the prediction column to the unemployment data frame. You will plot outcome versus prediction, and compare them to the line that represents perfect predictions (that is when the outcome is equal to the predicted value).

The ggplot2 command to plot a scatterplot of dframe$outcome versus dframe$pred (pred on the x axis, outcome on the y axis), along with a blue line where outcome == pred is as follows:

ggplot(dframe, aes(x = pred, y = outcome)) + 
       geom_point() +  
       geom_abline(color = "blue")

Instructions

100 XP

- The objects unemployment, unemployment_model and newrates are in your workspace.

- Use predict() to predict female unemployment rates from the unemployment data. Assign it to a new column: prediction.

- Use the library() command to load the ggplot2 package.


- Use ggplot() to compare the predictions to actual unemployment rates. Put the predictions on the x axis.
How close are the results to the line of perfect prediction?

- Use the data frame newrates to predict expected female unemployment rate when male unemployment is 5%. Assign the answer to the variable pred and print it.

```{r}
newrates<-read.csv("new_rates.csv")

# unemployment is in your workspace
summary(unemployment)

# newrates is in your workspace
newrates

# Predict female unemployment in the unemployment data set
unemployment$prediction <-  predict(unemployment_model)

# load the ggplot2 package
library(ggplot2)

# Make a plot to compare predictions to actual (prediction on x axis)
ggplot(unemployment, aes(x = prediction, y = female_unemployment)) + 
  geom_point() +
  geom_abline(color = "blue")

# Predict female unemployment rate when male unemployment is 5%
pred <- predict(unemployment_model, newdata = newrates)
# Print it
pred

```

## 1.5: Multivariate linear regression (Part 1)

In this exercise, you will work with the blood pressure dataset (Source), and model blood_pressure as a function of weight and age.

Instructions

100 XP

- The data frame bloodpressure is in the workspace.

- Define a formula that expresses blood_pressure explicitly as a function of age and weight. Assign the formula to the variable fmla and print it.

- Use fmla to fit a linear model to predict blood_pressure from age and weight in the data set bloodpressure. Call the model bloodpressure_model.

- Print the model and call summary() on it. Does blood pressure increase or decrease with age? With weight?

```{r}
# bloodpressure is in the workspace
bloodpressure<-readRDS("bloodpressure.rds")
summary(bloodpressure)

# Create the formula and print it
fmla <- blood_pressure~age+weight
fmla

# Fit the model: bloodpressure_model
bloodpressure_model <- lm(fmla,data=bloodpressure)

# Print bloodpressure_model and call summary() 
bloodpressure_model
summary(bloodpressure_model)
```

## 1.6: Multivariate linear regression (Part 2)

Now you will make predictions using the blood pressure model bloodpressure_model that you fit in the previous exercise.

You will also compare the predictions to outcomes graphically. ggplot2 is already loaded in your workspace. Recall the plot command takes the form:

ggplot(dframe, aes(x = pred, y = outcome)) + 
     geom_point() + 
     geom_abline(color = "blue")
`

Instructions

100 XP

- The objects bloodpressure and bloodpressure_model are in the workspace.

- Use predict() to predict blood pressure in the bloodpressure dataset. Assign the predictions to the column prediction.

- Graphically compare the predictions to actual blood pressures. Put predictions on the x axis. How close are the results to the line of perfect prediction?

```{r}
# bloodpressure is in your workspace
summary(bloodpressure)

# bloodpressure_model is in your workspace
bloodpressure_model

# predict blood pressure using bloodpressure_model :prediction
bloodpressure$prediction <- predict(bloodpressure_model)

# plot the results
ggplot(bloodpressure, aes(x = prediction, y = blood_pressure))+ 
    geom_point() +
    geom_abline(color = "blue")
```

# Chapter 2: Training and Evaluating Regression Models

Now that we have learned how to fit basic linear regression models, we will learn how to evaluate how well our models perform. We will review evaluating a model graphically, and look at two basic metrics for regression models. We will also learn how to train a model that will perform well in the wild, not just on training data. Although we will demonstrate these techniques using linear regression, all these concepts apply to models fit with any regression algorithm.

## 2.1: Graphically evaluate the unemployment model

In this exercise you will graphically evaluate the unemployment model, unemployment_model, that you fit to the unemployment data in the previous chapter. Recall that the model predicts female_unemployment from male_unemployment.

You will plot the model's predictions against the actual female_unemployment; recall the command is of the form

ggplot(dframe, aes(x = pred, y = outcome)) + 
       geom_point() +  
       geom_abline()
Then you will calculate the residuals:

residuals <- actual outcome - predicted outcome
and plot predictions against residuals. The residual graph will take a slightly different form: you compare the residuals to the horizonal line x=0 (using geom_hline()) rather than to the line x=y. The command will be provided.

Instructions

100 XP

- The data frame unemployment and model unemployment_model are available in the workspace.

- Use predict() to get the model predictions and add them to unemployment as the column predictions.

- Plot predictions (on the x-axis) versus actual female unemployment rates. Are the predictions near the x=y line?

- Calculate the residuals between the predictions and actual unemployment rates. Add these residuals to unemployment as the column residuals.

- Fill in the blanks to plot predictions (on the x-axis) versus residuals (on the y-axis). This gives you a different view of the model's predictions as compared to ground truth.

```{r}
# unemployment is in the workspace
summary(unemployment)

# unemployment_model is in the workspace
summary(unemployment_model)

# Make predictions from the model
unemployment$predictions <- predict(unemployment_model)

# Fill in the blanks to plot predictions (on x-axis) versus the female_unemployment rates
ggplot(unemployment, aes(x = predictions, y = female_unemployment)) + 
  geom_point() + 
  geom_abline()

# Calculate residuals
unemployment$residuals <- unemployment$female_unemployment - unemployment$predictions

# Fill in the blanks to plot predictions (on x-axis) versus the residuals
ggplot(unemployment, aes(x = predictions, y = residuals)) + 
  geom_pointrange(aes(ymin = 0, ymax = residuals)) + 
  geom_hline(yintercept = 0, linetype = 3) + 
  ggtitle("residuals vs. linear model prediction")
```

## 2.2: The gain curve to evaluate the unemployment model

In the previous exercise you made predictions about female_unemployment and visualized the predictions and the residuals. Now, you will also plot the gain curve of the unemployment_model's predictions against actual female_unemployment using the WVPlots::GainCurvePlot() function.

For situations where order is more important than exact values, the gain curve helps you check if the model's predictions sort in the same order as the true outcome.

Calls to the function GainCurvePlot() look like:

GainCurvePlot(frame, xvar, truthvar, title)

where

- frame is a data frame
- xvar and truthvar are strings naming the prediction and actual outcome columns of frame
- title is the title of the plot

When the predictions sort in exactly the same order, the relative Gini coefficient is 1. When the model sorts poorly, the relative Gini coefficient is close to zero, or even negative.

Instructions

100 XP

- The data frame unemployment and the model unemployment_model are in the workspace.

- Load the package WVPlots.

- Plot the gain curve. Give the plot the title "Unemployment model". Do the model's predictions sort correctly?

```{r}
# unemployment is in the workspace (with predictions)
summary(unemployment)

# unemployment_model is in the workspace
summary(unemployment_model)

# Load the package WVPlots
install.packages("WVPlots")
library(WVPlots)

# Plot the Gain Curve
GainCurvePlot(unemployment, "predictions", "female_unemployment", "Unemployment model")
```

## 2.3: Calculate RMSE

In this exercise you will calculate the RMSE of your unemployment model. In the previous coding exercises, you added two columns to the unemployment dataset:

the model's predictions (predictions column)
the residuals between the predictions and the outcome (residuals column)
You can calculate the RMSE from a vector of residuals, res, as:

RMSE=sqrt{mean(res 2)}

You want RMSE to be small. How small is "small"? One heuristic is to compare the RMSE to the standard deviation of the outcome. With a good model, the RMSE should be smaller.

Instructions

100 XP

- The data frame unemployment is in your workspace.

- Review the unemployment data from the previous exercise.

- For convenience, assign the residuals column from unemployment to the variable res.

- Calculate RMSE: square res, take its mean, and then square root it. Assign this to the variable rmse and print it.

- Tip: you can do this in one step by wrapping the assignment in parentheses: (rmse <- ___)

- Calculate the standard deviation of female_unemployment and assign it to the variable sd_unemployment.

- Print it. How does the rmse of the model compare to the standard deviation of the data?

```{r}
# unemployment is in the workspace
summary(unemployment)

# For convenience put the residuals in the variable res
res <- unemployment$predictions-unemployment$female_unemployment

# Calculate RMSE, assign it to the variable rmse and print it
(rmse <- sqrt(mean(res^2)))

# Calculate the standard deviation of female_unemployment and print it
(sd_unemployment <- sd(unemployment$female_unemployment))
sd_unemployment
```

## 2.4: Calculate R-Squared

Now that you've calculated the RMSE of your model's predictions, you will examine how well the model fits the data: that is, how much variance does it explain. You can do this using R2.

Suppose y is the true outcome, p is the prediction from the model, and res=y−p are the residuals of the predictions.

- Then the total sum of squares tss ("total variance") of the data is: $tss=\sum (y−\bar{y})^{2}$ where $\bar{y}$ is the mean value of y.

- The residual sum of squared errors of the model, rss is: $rss=\sum res^{2}$


- $R^{2}$ (R-Squared), the "variance explained" by the model, is then: $1−\frac{rss}{tss}$

After you calculate $R^{2}$ , you will compare what you computed with the $R^{2}$  reported by glance(). glance() returns a one-row data frame; for a linear regression model, one of the columns returned is the R2 of the model on the training data.

The data frame unemployment is in your workspace, with the columns predictions and residuals that you calculated in a previous exercise.

Instructions

100 XP

- The data frame unemployment and the model unemployment_model are in the workspace.

- Calculate the mean female_unemployment and assign it to the variable fe_mean.

- Calculate the total sum of squares and assign it to the variable tss.

- Calculate the residual sum of squares and assign it to the variable rss.

- Calculate R2. Is it a good fit (R2 near 1)?

- Use glance() to get R2 from the model. Is it the same as what you calculated?


```{r}
# unemployment is in your workspace
summary(unemployment)

# unemployment_model is in the workspace
summary(unemployment_model)

# Calculate mean female_unemployment: fe_mean. Print it
(fe_mean <- mean(unemployment$female_unemployment))
fe_mean
# Calculate total sum of squares: tss. Print it
(tss <- sum((unemployment$female_unemployment - fe_mean)^2))

# Calculate residual sum of squares: rss. Print it
(rss <- sum((unemployment$female_unemployment - unemployment$predictions)^2))

# Calculate R-squared: rsq. Print it. Is it a good fit?
(rsq <- 1-rss/tss)

# Get R-squared from glance. Print it
(rsq_glance <- glance(unemployment_model)$r.squared)
```

## 2.5: Correlation and R-squared

The linear correlation of two variables, x and y, measures the strength of the linear relationship between them. When x and y are respectively:

the outcomes of a regression model that minimizes squared-error (like linear regression) and
the true outcomes of the training data,
then the square of the correlation is the same as R2. You will verify that in this exercise.

Instructions

100 XP

- Use cor() to get the correlation between the predictions and female unemployment. Assign it to the variable rho and print it. Make sure you use Pearson correlation (the default).

- Square rho and assign it to rho2. Print it.

- Compare rho2 to R2 from the model (using glance()). Is it the same?

```{r}
# unemployment is in your workspace
summary(unemployment)

# unemployment_model is in the workspace
summary(unemployment_model)

# Get the correlation between the prediction and true outcome: rho and print it
(rho <- cor(unemployment$predictions, unemployment$female_unemployment))

# Square rho: rho2 and print it
(rho2 <- rho^2)

# Get R-squared from glance and print it
(rsq_glance <- glance(unemployment_model)$r.squared)
```
## 2.6: Generating a random test/train split

For the next several exercises you will use the mpg data from the package ggplot2. The data describes the characteristics of several makes and models of cars from different years. The goal is to predict city fuel efficiency from highway fuel efficiency.

In this exercise, you will split mpg into a training set mpg_train (75% of the data) and a test set mpg_test (25% of the data). One way to do this is to generate a column of uniform random numbers between 0 and 1, using the function runif().

If you have a data set dframe of size N, and you want a random subset of approximately size 100∗X% of N (where X is between 0 and 1), then:

Generate a vector of uniform random numbers: gp = runif(N).
dframe[gp < X,] will be about the right size.
dframe[gp >= X,] will be the complement.

Instructions

100 XP

- The data frame mpg is in the workspace.

- Use the function nrow to get the number of rows in the data frame mpg. Assign this count to the variable N and print it.

- Calculate about how many rows 75% of N should be. Assign it to the variable target and print it.

- Use runif() to generate a vector of N uniform random numbers, called gp.

-Use gp to split mpg into mpg_train and mpg_test (with mpg_train containing approximately 75% of the data).

- Use nrow() to check the size of mpg_train and mpg_test. Are they about the right size?



```{r}
# mpg is in the workspace
summary(mpg)
dim(mpg)

# Use nrow to get the number of rows in mpg (N) and print it
(N <- nrow(mpg))

# Calculate how many rows 75% of N should be and print it
# Hint: use round() to get an integer
(target <- round(0.75*N))

# Create the vector of N uniform random variables: gp
gp <- runif(N, min = 0, max = 1)
select.training<-gp<0.75
# Use gp to create the training set: mpg_train (75% of data) and mpg_test (25% of data)
mpg_train <- mpg[select.training,]
mpg_test <- mpg[!select.training,]

# Use nrow() to examine mpg_train and mpg_test
nrow(mpg_train)
nrow(mpg_test)
```

## 2.7: Train a model using test/train split

Now that you have split the mpg dataset into mpg_train and mpg_test, you will use mpg_train to train a model to predict city fuel efficiency (cty) from highway fuel efficiency (hwy).

Instructions

100 XP

- The data frame mpg_train is in the workspace.

- Create a formula fmla that expresses the relationship cty as a function of hwy. Print it.

- Train a model mpg_model on mpg_train to predict cty from hwy using fmla and lm().

- Use summary() to examine the model.


```{r}
# mpg_train is in the workspace
summary(mpg_train)

# Create a formula to express cty as a function of hwy: fmla and print it.
(fmla <- cty~hwy)

# Now use lm() to build a model mpg_model from mpg_train that predicts cty from hwy 
mpg_model <- lm(fmla,data=mpg_train)

# Use summary() to examine the model
summary(mpg_model)
```

## 2.8: Evaluate a model using test/train split

Now you will test the model mpg_model on the test data, mpg_test. Functions rmse() and r_squared() to calculate RMSE and R-squared have been provided for convenience:

rmse(predcol, ycol)

r_squared(predcol, ycol)

where:

- predcol: The predicted values

- ycol: The actual outcome

You will also plot the predictions vs. the outcome.

Generally, model performance is better on the training data than the test data (though sometimes the test set "gets lucky"). A slight difference in performance is okay; if the performance on training is significantly better, there is a problem.

Instructions

100 XP

The data frames mpg_train and mpg_test, and the model mpg_model are in the workspace, along with the functions rmse() and r_squared().

- Predict city fuel efficiency from hwy on the mpg_train data. Assign the predictions to the column pred.

- Predict city fuel efficiency from hwy on the mpg_test data. Assign the predictions to the column pred.

- Use rmse() to evaluate rmse for both the test and training sets. Compare. Are the performances similar?

- Do the same with r_squared(). Are the performances similar?

- Use ggplot2 to plot the predictions against cty on the test data.

```{r}
# Examine the objects in the workspace
#ls.str()

# predict cty from hwy for the training set
mpg_train$pred <- predict(mpg_model)

# predict cty from hwy for the test set
mpg_test$pred <- predict(mpg_model,newdata=mpg_test)

#install.packages("Metrics")
library(Metrics)

# Evaluate the rmse on both training and test data and print them
(rmse_train <- rmse(mpg_train$cty,mpg_train$pred))
(rmse_test <- rmse(mpg_test$cty,mpg_test$pred))


# Evaluate the r-squared on both training and test data.and print them
#(rsq_train <- r_squared(mpg_train$pred,mpg_train$cty))
#(rsq_test <- r_squared(mpg_test$pred,mpg_test$cty))
(rsq_train <- 1-sse(mpg_train$pred,mpg_train$cty)/sse(mean(mpg_train$cty),mpg_train$cty))
(rsq_test <- 1-sse(mpg_test$pred,mpg_test$cty)/sse(mean(mpg_test$cty),mpg_test$cty))
# Plot the predictions (on the x-axis) against the outcome (cty) on the test data
ggplot(mpg_test, aes(x = pred, y = cty)) + 
  geom_point() + 
  geom_abline()
```
## 2.9: Create a cross validation plan

There are several ways to implement an n-fold cross validation plan. In this exercise you will create such a plan using vtreat::kWayCrossValidation(), and examine it.

kWayCrossValidation() creates a cross validation plan with the following call:

splitPlan <- kWayCrossValidation(nRows, nSplits, dframe, y)
where nRows is the number of rows of data to be split, and nSplits is the desired number of cross-validation folds.

Strictly speaking, dframe and y aren't used by kWayCrossValidation; they are there for compatibility with other vtreat data partitioning functions. You can set them both to NULL.

The resulting splitPlan is a list of nSplits elements; each element contains two vectors:

train: the indices of dframe that will form the training set
app: the indices of dframe that will form the test (or application) set
In this exercise you will create a 3-fold cross-validation plan for the data set mpg.

Instructions

100 XP

- Load the package vtreat.

- Get the number of rows in mpg and assign it to the variable nRows.

- Call kWayCrossValidation to create a 3-fold cross validation plan and assign it to the variable splitPlan. You can set the last two arguments of the function to NULL.

- Call str() to examine the structure of splitPlan.


```{r}
# Load the package vtreat
install.packages("vtreat")
library(vtreat)

# mpg is in the workspace
summary(mpg)

# Get the number of rows in mpg
nRows <- nrow(mpg)

# Implement the 3-fold cross-fold plan with vtreat
splitPlan <- kWayCrossValidation(nRows, 3, NULL, NULL)

# Examine the split plan
str(splitPlan)
```

## 2.10: Evaluate a modeling procedure using n-fold cross-validation

In this exercise you will use splitPlan, the 3-fold cross validation plan from the previous exercise, to make predictions from a model that predicts mpg$cty from mpg$hwy.

If dframe is the training data, then one way to add a column of cross-validation predictions to the frame is as follows:

- Initialize a column of the appropriate length

dframe$pred.cv <- 0 

- k is the number of folds

- splitPlan is the cross validation plan

for(i in 1:k) {
  # Get the ith split
  split <- splitPlan[[i]]

  # Build a model on the training data 
  # from this split 
  # (lm, in this case)
  model <- lm(fmla, data = dframe[split$train,])

  # make predictions on the 
  # application data from this split
  dframe$pred.cv[split$app] <- predict(model, newdata = dframe[split$app,])
}


Cross-validation predicts how well a model built from all the data will perform on new data. As with the test/train split, for a good modeling procedure, cross-validation performance and training performance should be close.

Instructions

100 XP

- The data frame mpg, the cross validation plan splitPlan, and the function to calculate RMSE (rmse()) from one of the previous exercises is available in your workspace.

- Run the 3-fold cross validation plan from splitPlan and put the predictions in the column mpg$pred.cv.

- Use lm() and the formula cty ~ hwy.

- Create a linear regression model on all the mpg data (formula cty ~ hwy) and assign the predictions to 
mpg$pred.

- Use rmse() to get the root mean squared error of the predictions from the full model (mpg$pred). Recall that rmse() takes two arguments, the predicted values, and the actual outcome.

- Get the root mean squared error of the cross-validation predictions. Are the two values about the same?

```{r}
# mpg is in the workspace
summary(mpg)

# splitPlan is in the workspace
str(splitPlan)

# Run the 3-fold cross validation plan from splitPlan
k <- 3 # Number of folds
mpg$pred.cv <- 0 
for(i in 1:k) {
  split <- splitPlan[[i]]
  model <- lm(cty~hwy, data = mpg[split$train,])
  mpg$pred.cv[split$app] <- predict(model, newdata = mpg[split$app,])
}

# Predict from a full model
mpg$pred <- predict(lm(cty ~ hwy, data = mpg))

# Get the rmse of the full model's predictions
rmse(mpg$pred, mpg$cty)

# Get the rmse of the cross-validation predictions
rmse(mpg$pred.cv, mpg$cty)
```

# Chapter 3: 

## 3.1: Examining the structure of categorical inputs

For this exercise you will call model.matrix() to examine how R represents data with both categorical and numerical inputs for modeling. The dataset flowers (derived from the Sleuth3 package) is loaded into your workspace. It has the following columns:

Flowers: the average number of flowers on a meadowfoam plant
Intensity: the intensity of a light treatment applied to the plant
Time: A categorical variable - when (Late or Early) in the lifecycle the light treatment occurred
The ultimate goal is to predict Flowers as a function of Time and Intensity.

Instructions

50 XP


- The data frame flowers is in your workspace.

- Call the str() function on flowers to see the types of each column.

- Use the unique() function on the column flowers$Time to see the possible values that Time takes. How many unique values are there?

- Create a formula to express Flowers as a function of Intensity and Time. Assign it to the variable fmla and print it.

- Use fmla and model.matrix() to create the model matrix for the data frame flowers. Assign it to the variable mmat.

- Use head() to examine the first 20 lines of flowers.

- Now examine the first 20 lines of mmat.

- Is the numeric column Intensity different?

- What happened to the categorical column Time from flowers?
- How is Time == 'Early' represented? And Time = 'Late'?

```{r}
# Call str on flowers to see the types of each column

flowers<-read.csv("flowers.csv")
str(flowers)

# Use unique() to see how many possible values Time takes
unique(flowers$Time)

# Build a formula to express Flowers as a function of Intensity and Time: fmla. Print it
(fmla <- as.formula("Flowers ~ Intensity + Time"))

# Use fmla and model.matrix to see how the data is represented for modeling
mmat <- model.matrix(fmla, data = flowers)

# Examine the first 20 lines of flowers
head(flowers, n=20)

# Examine the first 20 lines of mmat
head(mmat, n=20)
```

## 3.3: Modeling with categorical inputs

For this exercise you will fit a linear model to the flowers data, to predict Flowers as a function of Time and Intensity.

The model formula fmla that you created in the previous exercise is still in your workspace, as is the model matrix mmat.

Instructions
100 XP

- Use fmla and lm to train a linear model that predicts Flowers from Intensity and Time. Assign the model to the variable flower_model.

- Use summary() to remind yourself of the structure of mmat.

-Use summary() to examine the flower_model. Do the variables match what you saw in mmat?

- Use flower_model to predict the number of flowers. Add the predictions to flowers as the column predictions.

- Fill in the blanks to plot predictions vs. actual flowers (predictions on the x-axis).

```{r}
# flowers in is the workspace
str(flowers)

# fmla is in the workspace
fmla

# Fit a model to predict Flowers from Intensity and Time : flower_model
flower_model <- lm(fmla,data=flowers)

# Use summary on mmat to remind yourself of its structure
summary(mmat)

# Use summary to examine flower_model 
summary(flower_model)

# Predict the number of flowers on each plant
flowers$predictions <- predict(flower_model)

# Plot predictions vs actual flowers (predictions on x-axis)
ggplot(flowers, aes(x = predictions, y = Flowers)) + 
  geom_point() +
  geom_abline(color = "blue") 
```

# 3.4: Modeling an interaction

In this exercise you will use interactions to model the effect of gender and gastric activity on alcohol metabolism.

The data frame alcohol has columns:

1. Metabol: the alcohol metabolism rate

2. Gastric: the rate of gastric alcohol dehydrogenase activity

3. Sex: the sex of the drinker (Male or Female)

In the video, we fit three models to the alcohol data:

- one with only additive (main effect) terms : Metabol ~ Gastric + Sex

- two models, each with interactions between gastric activity and sex

We saw that one of the models with interaction terms had a better R-squared than the additive model, suggesting that using interaction terms gives a better fit. In this exercise we will compare the R-squared of one of the interaction models to the main-effects-only model.

Recall that the operator : designates the interaction between two variables. The operator * designates the interaction between the two variables, plus the main effects.

x*y = x + y + x:y

Instructions

100 XP


- The data frame alcohol is in your workspace.

- Write a formula that expresses Metabol as a function of Gastric and Sex with no interactions.

- Assign the formula to the variable fmla_add and print it.

- Write a formula that expresses Metabol as a function of the interaction between Gastric and Sex. Add Gastric as a main effect, but not Sex.

- Assign the formula to the variable fmla_interaction and print it.

- Fit a linear model with only main effects: model_add to the data.

- Fit a linear model with the interaction: model_interaction to the data.

- Call summary() on both models. Which has a better R-squared?

```{r}
# alcohol is in the workspace
alcohol<-read.csv("alcohol.csv")
summary(alcohol)

# Create the formula with main effects only
(fmla_add <- as.formula(Metabol~Gastric+Sex))

# Create the formula with interactions
(fmla_interaction <- as.formula(Metabol~Gastric+Gastric:Sex))
# Fit the main effects only model
model_add <- lm(fmla_add,data=alcohol)

# Fit the interaction model
model_interaction <- lm(fmla_interaction,data=alcohol)

# Call summary on both models and compare
summary(model_add)
summary(model_interaction)
```

## 3.5: Modeling an interaction (2)

In this exercise, you will compare the performance of the interaction model you fit in the previous exercise to the performance of a main-effects only model. Because this data set is small, we will use cross-validation to simulate making predictions on out-of-sample data.

You will begin to use the dplyr package to do calculations.

mutate() adds new columns to a tbl (a type of data frame)
group_by() specifies how rows are grouped in a tbl
summarize() computes summary statistics of a column
You will also use tidyr's gather() which takes multiple columns and collapses them into key-value pairs.

Instructions

100 XP

- The data frame alcohol and the formulas fmla_add and fmla_interaction are in the workspace.

- Use kWayCrossValidation() to create a splitting plan for a 3-fold cross validation.

- The first argument is the number of rows to be split.

- The second argument is the number of folds for the cross-validation.

- You can set the 3rd and 4th arguments of the function to NULL.

- Examine and run the sample code to get the 3-fold cross-validation predictions of a model with no interactions and assign them to the column pred_add.

- Get the 3-fold cross-validation predictions of the model with interactions. Assign the predictions to the column pred_interaction.

- The sample code shows you the procedure.

- Use the same splitPlan that you already created.

- Fill in the blanks to
gather the predictions into a single column pred.

- add a column of residuals (actual outcome - predicted outcome).

- get the RMSE of the cross-validation predictions for each model type.

- Compare the RMSEs. Based on these results, which model should you use?

```{r}
# alcohol is in the workspace
summary(alcohol)

# Both the formulae are in the workspace
fmla_add
fmla_interaction

# Create the splitting plan for 3-fold cross validation
set.seed(34245)  # set the seed for reproducibility
splitPlan <- kWayCrossValidation(nrow(alcohol), 3, NULL, NULL)

# Sample code: Get cross-val predictions for main-effects only model
alcohol$pred_add <- 0  # initialize the prediction vector
for(i in 1:3) {
  split <- splitPlan[[i]]
  model_add <- lm(fmla_add, data = alcohol[split$train, ])
  alcohol$pred_add[split$app] <- predict(model_add, newdata = alcohol[split$app, ])
}

# Get the cross-val predictions for the model with interactions
alcohol$pred_interaction <- 0 # initialize the prediction vector
for(i in 1:3) {
  split <- splitPlan[[i]]
  model_interaction <- lm(fmla_interaction, data = alcohol[split$train,])
  alcohol$pred_interaction[split$app] <- predict(model_interaction, newdata = alcohol[split$app, ])
}

install.packages("magrittr")
library(magrittr) # for %>% filter function

library('tidyr')
example('gather')
# Get RMSE
alcohol %>% 
  gather(key = modeltype, value = pred, pred_add, pred_interaction) %>%
  mutate(residuals = Metabol-pred) %>%      
  group_by(modeltype) %>%
  summarize(rmse = sqrt(mean(residuals^2)))
```
## 3.6: Relative error

In this exercise, you will compare relative error to absolute error. For the purposes of modeling, we will define relative error as

$rel=\frac{(pred−y)}{y}$

that is, the error is relative to the true outcome. You will measure the overall relative error of a model using root mean squared relative error:

$rmse_{rel}=\sqrt{\bar{rel^{2}}}$
where $\bar{rel^{2}}$ is the mean of $rel^{2}$.

The example (toy) dataset fdata is loaded in your workspace. It includes the columns:

- y: the true output to be predicted by some model; imagine it is the amount of money a customer will spend on a visit to your store.

- pred: the predictions of a model that predicts y.

- label: categorical: whether y comes from a population that makes small purchases, or large ones.

You want to know which model does "better": the one predicting the small purchases, or the one predicting large ones.

Instructions

100 XP

- The data frame fdata is in the workspace.

- Fill in the blanks to examine the data. Notice that large purchases tend to be about 100 times larger than small ones.

- Fill in the blanks to create error columns:

- Define residual as pred - y.

- Define relative error as residual / y.

- Fill in the blanks to calculate and compare RMSE and relative RMSE.

- How do the absolute errors compare? The relative errors?

- Examine the plot of predictions versus outcome. In your opinion, which model does "better"?

```{r}
# fdata is in the workspace
fdata<-read.csv("fdata.csv")
summary(fdata)

# Examine the data: generate the summaries for the groups large and small:
fdata %>% 
    group_by(label) %>%     # group by small/large purchases
    summarize(min  = min(y),   # min of y
              mean = mean(y),   # mean of y
              max  = max(y))   # max of y

# Fill in the blanks to add error columns
fdata2 <- fdata %>% 
         group_by(label) %>%       # group by label
           mutate(residual = pred-y,  # Residual
                  relerr   = residual/y)  # Relative error

# Compare the rmse and rmse.rel of the large and small groups:
fdata2 %>% 
  group_by(label) %>% 
  summarize(rmse     = sqrt(mean((pred-y)^2)),   # RMSE
            rmse.rel = sqrt(mean(((pred-y)/y)^2)))   # Root mean squared relative error
            
# Plot the predictions for both groups of purchases
ggplot(fdata2, aes(x = pred, y = y, color = label)) + 
  geom_point() + 
  geom_abline() + 
  facet_wrap(~ label, ncol = 1, scales = "free") + 
  ggtitle("Outcome vs prediction")
```
Remark: Notice from this example how a model with larger RMSE might still be better, if relative errors are more important than absolute errors.

## 3.7: Modeling log-transformed monetary output

In this exercise, you will practice modeling on log-transformed monetary output, and then transforming the "log-money" predictions back into monetary units. The data loaded into your workspace records subjects' incomes in 2005 (Income2005), as well as the results of several aptitude tests taken by the subjects in 1981:

- Arith

- Word

- Parag

- Math

- AFQT (Percentile on the Armed Forces Qualifying Test)

The data have already been split into training and test sets (income_train and income_test respectively) and are in the workspace. You will build a model of log(income) from the inputs, and then convert log(income) back into income.

Instructions

100 XP

- Call summary() on income_train$Income2005 to see the summary statistics of income in the training set.

- Write a formula to express log(Income2005) as a function of the five tests as the variable fmla.log. Print it.

- Fit a linear model of log(Income2005) to the income_train data: model.log.

- Use model.log to predict income on the income_test dataset. Put it in the column logpred.

- Check summary() of logpred to see that the magnitudes are much different from those of Income2005.

- Reverse the log transformation to put the predictions into "monetary units": exp(income_test$logpred).

- Check summary() of pred.income and see that the magnitudes are now similar to Income2005 magnitudes.

- Fill in the blanks to plot a scatter plot of predicted income vs income on the test set.

```{r}
# Examine Income2005 in the training set
income_train<-read.csv("income_train.csv")
income_test<-read.csv("income_test.csv")
summary(income_train$Income2005)

# Write the formula for log income as a function of the tests and print it
(fmla.log <- log(Income2005)~Arith+Word+Parag+Math+AFQT)

# Fit the linear model
model.log <-  lm(fmla.log,data=income_train)

# Make predictions on income_test
income_test$logpred <- predict(model.log,newdata=income_test)
summary(income_test$logpred)

# Convert the predictions to monetary units
income_test$pred.income <- exp(income_test$logpred)
summary(income_test$pred.income)

#  Plot predicted income (x axis) vs income
ggplot(income_test, aes(x = pred.income, y = Income2005)) + 
  geom_point() + 
  geom_abline(color = "blue")
```

## 3.8: Comparing RMSE and root-mean-squared Relative Error

In this exercise, you will show that log-transforming a monetary output before modeling improves mean relative error (but increases RMSE) compared to modeling the monetary output directly. You will compare the results of model.log from the previous exercise to a model (model.abs) that directly fits income.

The income_train and income_test datasets are loaded in your workspace, along with your model, model.log.

Also in the workspace:

model.abs: a model that directly fits income to the inputs using the formula

Income2005 ~ Arith + Word + Parag + Math + AFQT

Instructions

100 XP

- Fill in the blanks to add predictions from the models to income_test.

- Don’t forget to take the exponent of the predictions from model.log to undo the log transform!

- Fill in the blanks to gather() the predictions and calculate the residuals and relative error.

- Fill in the blanks to calculate the RMSE and relative RMSE for predictions.

- Which model has larger absolute error? Larger relative error?

```{r}
# fmla.abs is in the workspace
fmla.abs<-as.formula(Income2005 ~ Arith + Word + Parag + Math + AFQT)
model.abs<-lm(formula = fmla.abs, data = income_train)
# model.abs is in the workspace
summary(model.abs)

# Add predictions to the test set
income_test <- income_test %>%
  mutate(pred.absmodel = predict(model.abs, income_test),        # predictions from model.abs
         pred.logmodel = exp(predict(model.log, income_test)))   # predictions from model.log

# Gather the predictions and calculate residuals and relative error
income_long <- income_test %>% 
  gather(key = modeltype, value = pred, pred.absmodel, pred.logmodel) %>%
  mutate(residual = pred-Income2005,   # residuals
         relerr   = residual/Income2005)   # relative error

# Calculate RMSE and relative RMSE and compare
income_long %>% 
  group_by(modeltype) %>%      # group by modeltype
  summarize(rmse     = sqrt(mean(residual^2)),    # RMSE
            rmse.rel = sqrt(mean(relerr^2)))    # Root mean squared relative error
```

## 3.9: Input transforms: the "hockey stick"

In this exercise, we will build a model to predict price from a measure of the house's size (surface area). The data set houseprice has the columns:

- price : house price in units of $1000

- size: surface area

A scatterplot of the data shows that the data is quite non-linear: a sort of "hockey-stick" where price is fairly flat for smaller houses, but rises steeply as the house gets larger. Quadratics and tritics are often good functional forms to express hockey-stick like relationships. Note that there may not be a "physical" reason that price is related to the square of the size; a quadratic is simply a closed form approximation of the observed relationship.

scatterplot

You will fit a model to predict price as a function of the squared size, and look at its fit on the training data.

Because ^ is also a symbol to express interactions, use the function I() to treat the expression x^2 “as is”: that is, as the square of x rather than the interaction of x with itself.

exampleFormula = y ~ I(x^2)

Instructions

100 XP

- The data set houseprice is in the workspace.

- Write a formula, fmla_sqr, to express price as a function of squared size. Print it.

- Fit a model model_sqr to the data using fmla_sqr

- For comparison, fit a linear model model_lin to the data using the formula price ~ size.

- Fill in the blanks to make predictions from the training data from the two models, 

- gather the predictions into a single column pred

- graphically compare the predictions of the two models to the data. Which fits better?

```{r}
# houseprice is in the workspace
houseprice<-readRDS("houseprice.rds")
summary(houseprice)

# Create the formula for price as a function of squared size
(fmla_sqr <- price~I(size^2))

# Fit a model of price as a function of squared size (use fmla_sqr)
model_sqr <- lm(fmla_sqr, data=houseprice)

# Fit a model of price as a linear function of size
model_lin <- lm(price~size, data=houseprice)

# Make predictions and compare
houseprice %>% 
    mutate(pred_lin = predict(model_lin),       # predictions from linear model
           pred_sqr = predict(model_sqr)) %>%   # predictions from quadratic model 
    gather(key = modeltype, value = pred, pred_lin, pred_sqr) %>% # gather the predictions
    ggplot(aes(x = size)) + 
       geom_point(aes(y = price)) +                   # actual prices
       geom_line(aes(y = pred, color = modeltype)) + # the predictions
       scale_color_brewer(palette = "Dark2")
```

## 3.10: Input transforms: the "hockey stick" Part (2)

In this exercise, we will build a model to predict price from a measure of the house's size (surface area). The data set houseprice has the columns:

- price : house price in units of $1000

- size: surface area

A scatterplot of the data shows that the data is quite non-linear: a sort of "hockey-stick" where price is fairly flat for smaller houses, but rises steeply as the house gets larger. Quadratics and tritics are often good functional forms to express hockey-stick like relationships. Note that there may not be a "physical" reason that price is related to the square of the size; a quadratic is simply a closed form approximation of the observed relationship.

scatterplot

You will fit a model to predict price as a function of the squared size, and look at its fit on the training data.

Because ^ is also a symbol to express interactions, use the function I() to treat the expression x^2 “as is”: that is, as the square of x rather than the interaction of x with itself.

exampleFormula = y ~ I(x^2)

Instructions

100 XP

- The data set houseprice is in the workspace.

- Write a formula, fmla_sqr, to express price as a function of squared size. Print it.

- Fit a model model_sqr to the data using fmla_sqr

- For comparison, fit a linear model model_lin to the data using the formula price ~ size.

- Fill in the blanks to make predictions from the training data from the two models 

- gather the predictions into a single column pred 

- graphically compare the predictions of the two models to the data. Which fits better?

```{r}
# houseprice is in the workspace
summary(houseprice)

# fmla_sqr is in the workspace
fmla_sqr

# Create a splitting plan for 3-fold cross validation
set.seed(34245)  # set the seed for reproducibility
splitPlan <-  kWayCrossValidation(nrow(houseprice),3,NULL,NULL)

# Sample code: get cross-val predictions for price ~ size
houseprice$pred_lin <- 0  # initialize the prediction vector
for(i in 1:3) {
  split <- splitPlan[[i]]
  model_lin <- lm(price ~ size, data = houseprice[split$train,])
  houseprice$pred_lin[split$app] <- predict(model_lin, newdata = houseprice[split$app,])
}

# Get cross-val predictions for price as a function of size^2 (use fmla_sqr)
houseprice$pred_sqr <- 0 # initialize the prediction vector
for(i in 1:3) {
  split <- splitPlan[[i]]
  model_sqr <- lm(fmla_sqr, data = houseprice[split$train, ])
  houseprice$pred_sqr[split$app] <- predict(model_sqr, newdata = houseprice[split$app, ])
}

# Gather the predictions and calculate the residuals
houseprice_long <- houseprice %>%
  gather(key = modeltype, value = pred, pred_lin, pred_sqr) %>%
  mutate(residuals = pred-price)

# Compare the cross-validated RMSE for the two models
houseprice_long %>% 
  group_by(modeltype) %>% # group by modeltype
  summarize(rmse = sqrt(mean(residuals^2)))
```

# Chapter 4: Dealing with Non-Linear Responses

Now that we have mastered linear models, we will begin to look at techniques for modeling situations that don't meet the assumptions of linearity. This includes predicting probabilities and frequencies (values bounded between 0 and 1); predicting counts (nonnegative integer values, and associated rates); and responses that have a non-linear but additive relationship to the inputs. These algorithms are variations on the standard linear model.

## 4.1: Fit a model of sparrow survival probability

In this exercise, you will estimate the probability that a sparrow survives a severe winter storm, based on physical characteristics of the sparrow. The dataset sparrow is loaded into your workspace. The outcome to be predicted is status ("Survived", "Perished"). The variables we will consider are:

- total_length: length of the bird from tip of beak to tip of tail (mm)

- weight: in grams

- humerus : length of humerus ("upper arm bone" that connects the wing to the body) (inches)

Remember that when using glm() to create a logistic regression model, you must explicitly specify that family = binomial:

glm(formula, data = data, family = binomial)

You will call summary(), broom::glance() to see different functions for examining a logistic regression model. One of the diagnostics that you will look at is the analog to R2, called pseudo-R2.

pseudoR2=1−deviancenull.deviance

You can think of deviance as analogous to variance: it is a measure of the variation in categorical data. The pseudo-R2 is analogous to R2 for standard regression: R2 is a measure of the "variance explained" of a regression model. The pseudo-R2 is a measure of the "deviance explained".

Instructions

100 XP

- The data frame sparrow and the package broom are loaded in the workspace.

- As suggested in the video, you will predict on the outcomes TRUE and FALSE. Create a new column survived in the sparrow data frame that is TRUE when status == "Survived".

- Create the formula fmla that expresses survived as a function of the variables of interest. Print it.

- Fit a logistic regression model to predict the probability of sparrow survival. Assign the model to the variable sparrow_model.

- Call summary() to see the coefficients of the model, the deviance and the null deviance.

- Call glance() on the model to see the deviances and other diagnostics in a data frame. Assign the output from glance() to the variable perf.

- Calculate the pseudo-R2.

```{r}
# sparrow is in the workspace
sparrow<-readRDS("sparrow.rds")
summary(sparrow)

# Create the survived column
sparrow$survived <- sparrow$status=="Survived"

# Create the formula
(fmla <- survived~total_length+weight+humerus)

# Fit the logistic regression model
sparrow_model <- glm(fmla,data=sparrow, family="binomial")

# Call summary
summary(sparrow_model)

# Call glance
(perf <- glance(sparrow_model))

# Calculate pseudo-R-squared
(pseudoR2 <- 1-perf$deviance/perf$null.deviance)
```
## 4.2: Predict sparrow survival

In this exercise you will predict the probability of survival using the sparrow survival model from the previous exercise.

Recall that when calling predict() to get the predicted probabilities from a glm() model, you must specify that you want the response:

predict(model, type = "response")
Otherwise, predict() on a logistic regression model returns the predicted log-odds of the event, not the probability.

You will also use the GainCurvePlot() function to plot the gain curve from the model predictions. If the model's gain curve is close to the ideal ("wizard") gain curve, then the model sorted the sparrows well: that is, the model predicted that sparrows that actually survived would have a higher probability of survival. The inputs to the GainCurvePlot() function are:

- frame: data frame with prediction column and ground truth column

- xvar: the name of the column of predictions (as a string)

- truthVar: the name of the column with actual outcome (as a string)

- title: a title for the plot (as a string)

- GainCurvePlot(frame, xvar, truthVar, title)

Instructions

100 XP

- The dataframe sparrow and the model sparrow_model are in the workspace.

- Create a new column in sparrow called pred that contains the predictions on the training data.

- Call GainCurvePlot() to create the gain curve of predictions. Does the model do a good job of sorting the sparrows by whether or not they actually survived?
```{r}
# sparrow is in the workspace
summary(sparrow)

# sparrow_model is in the workspace
summary(sparrow_model)

# Make predictions
sparrow$pred <- predict(sparrow_model,type="response")

# Look at gain curve
GainCurvePlot(sparrow, "pred", "survived", "sparrow survival model")
```
Remark: You see from the gain curve that the model follows the wizard curve for about the first 30% of the data, identifying about 45% of the surviving sparrows with only a few false positives.

##  4.3: Poisson or quasipoisson

One of the assumptions of Poisson regression to predict counts is that the event you are counting is Poisson distributed: the average count per unit time is the same as the variance of the count. In practice, "the same" means that the mean and the variance should be of a similar order of magnitude.

When the variance is much larger than the mean, the Poisson assumption doesn't apply, and one solution is to use quasipoisson regression, which does not assume that variance=mean.

For each of the following situations, decide if poisson regression would be suitable, or if you should use quasipoisson regression.

For which situations can you use poisson regression?

1. Number of days students are absent: mean 5.9, variance 49

2. Number of awards a student wins: mean 0.6, variance 1.1

3. Number of hits per website page: mean 108.2, variance 108.5

4. Number of bikes rented per day: mean 273, variance 45863.84

Answer the question

50 XP

Possible Answers

1. All of them


2. 1 and 4

3. 2 and 3


4. 1 and 3

5. 2 and 4


## 4.4: Fit a model to predict bike rental counts

In this exercise you will build a model to predict the number of bikes rented in an hour as a function of the weather, the type of day (holiday, working day, or weekend), and the time of day. You will train the model on data from the month of July.

The data frame has the columns:

- cnt: the number of bikes rented in that hour (the outcome)

- hr: the hour of the day (0-23, as a factor)

- holiday: TRUE/FALSE

- workingday: TRUE if neither a holiday nor a weekend, else FALSE

- weathersit: categorical, "Clear to partly cloudy"/"Light Precipitation"/"Misty"

- temp: normalized temperature in Celsius

- atemp: normalized "feeling" temperature in Celsius

- hum: normalized humidity

- windspeed: normalized windspeed

- instant: the time index -- number of hours since beginning of data set (not a variable)
mnth and yr: month and year indices (not variables)

Remember that you must specify family = poisson or family = quasipoisson when using glm() to fit a count model.

Since there are a lot of input variables, for convenience we will specify the outcome and the inputs in variables, and use paste() to assemble a string representing the model formula.

Instructions

100 XP

- The data frame bikesJuly is in the workspace. The names of the outcome variable and the input variables are also in the workspace as the variables outcome and vars respectively.

- Fill in the blanks to create the formula fmla expressing cnt as a function of the inputs. Print it.

- Calculate the mean (mean()) and variance (var()) of bikesJuly$cnt.

- Should you use poisson or quasipoisson regression?

- Use glm() to fit a model to the bikesJuly data: bike_model.

- Use glance() to look at the model's fit statistics. Assign the output of glance() to the variable perf.

- Calculate the pseudo-R-squared of the model.

```{r}
bikesJuly<-read.csv("bikesJuly.csv")
# bikesJuly is in the workspace
str(bikesJuly)
# The outcome column
outcome<-c("cnt")

# The inputs to use
vars<-c("hr","holiday", "workingday", "weathersit", "temp", "atemp", "hum", "windspeed")

# Create the formula string for bikes rented as a function of the inputs
(fmla <- paste(outcome, "~", paste(vars, collapse = " + ")))

# Calculate the mean and variance of the outcome
(mean_bikes <- mean(bikesJuly$cnt))
(var_bikes <- var(bikesJuly$cnt))

# Fit the model
bike_model <- glm(fmla,data=bikesJuly,family=quasipoisson)

# Call glance
(perf <- glance(bike_model))

# Calculate pseudo-R-squared
(pseudoR2 <- 1-perf$deviance/perf$null.deviance)
```

## 4.5: Predict bike rentals on new data

In this exercise you will use the model you built in the previous exercise to make predictions for the month of August. The data set bikesAugust has the same columns as bikesJuly.

Recall that you must specify type = "response" with predict() when predicting counts from a glm poisson or quasipoisson model.

Instructions

100 XP

- The model bike_model and the data frame bikesAugust are in the workspace.

- Use predict to predict the number of bikes per hour on the bikesAugust data. Assign the predictions to the column bikesAugust$pred.

- Fill in the blanks to get the RMSE of the predictions on the August data.

- Fill in the blanks to generate the plot of predictions to actual counts.

- Do any of the predictions appear negative?

```{r}
bikesAugust<-read.csv("bikesAugust.csv")
# bikesAugust is in the workspace
str(bikesAugust)

# bike_model is in the workspace
summary(bike_model)

# Make predictions on August data
bikesAugust$pred  <- predict(bike_model,newdata=bikesAugust, type="response")

# Calculate the RMSE
bikesAugust %>% 
  mutate(residual = pred-cnt) %>%
  summarize(rmse  = sqrt(mean(residual^2)))

# Plot predictions vs cnt (pred on x-axis)
ggplot(bikesAugust, aes(x = pred, y = cnt)) +
  geom_point() + 
  geom_abline(color = "darkblue")
```

## 4.6: Visualize the Bike Rental Predictions

In the previous exercise, you visualized the bike model's predictions using the standard "outcome vs. prediction" scatter plot. Since the bike rental data is time series data, you might be interested in how the model performs as a function of time. In this exercise, you will compare the predictions and actual rentals on an hourly basis, for the first 14 days of August.

To create the plot you will use the function tidyr::gather() to consolidate the predicted and actual values from bikesAugust in a single column. gather() takes as arguments:

- The "wide" data frame to be gathered (implicit in a pipe)

- The name of the key column to be created - contains the names of the gathered columns.

- The name of the value column to be created - contains the values of the gathered columns.

- The names of the columns to be gathered into a single column.

You'll use the gathered data frame to compare the actual and predicted rental counts as a function of time. The time index, instant counts the number of observations since the beginning of data collection. The sample code converts the instants to daily units, starting from 0.

Instructions

100 XP

- The data frame bikesAugust, with the predictions (bikesAugust$pred) is in the workspace.

- Fill in the blanks to plot the predictions and actual counts by hour for the first 14 days of August.

- convert instant to be in day units, rather than hour

- gather() the cnt and pred columns into a column called value, with a key called valuetype.

- filter() for the first two weeks of August

- Plot value as a function of instant (day).

- Does the model see the general time patterns in bike rentals?

```{r}
# Plot predictions and cnt by date/time
quasipoisson_plot<-bikesAugust %>% 
  # set start to 0, convert unit to days
  mutate(instant = (instant - min(instant))/24) %>%  
  # gather cnt and pred into a value column
  gather(key = valuetype, value = value, cnt, pred) %>%
  filter(instant < 14) %>% # restric to first 14 days
  # plot value by instant
  ggplot(aes(x = instant, y = value, color = valuetype, linetype = valuetype)) + 
  geom_point() + 
  geom_line() + 
  scale_x_continuous("Day", breaks = 0:14, labels = 0:14) + 
  scale_color_brewer(palette = "Dark2") + 
  ggtitle("Predicted August bike rentals, Quasipoisson model")
quasipoisson_plot
```

Remark: This model mostly identifies the slow and busy hours of the day, although it often underestimates peak demand.

## 4.7: Writing formulas for GAM models

When using gam() to model outcome as an additive function of the inputs, you can use the s() function inside formulas to designate that you want use a spline to model the non-linear relationship of a continuous variable to the outcome.

Suppose that you want to predict how much weight (Wtloss) a dieter will lose over a 2-year diet plan as a function of:

- Diet type (categorical)

- Sex (categorical)

- Age at beginning of diet (continuous)

- BMI (body mass index) at beginning of diet (continuous)

You do not want to assume that any of the relationships are linear.

Which is the most appropriate formula?

Answer the question

50 XP

Possible Answers

1. Wtloss ~ Diet + Sex + Age + BMI

2. Wtloss ~ s(Diet) + s(Sex) + s(Age) + s(BMI)

3. Wtloss ~ Diet + Sex + s(Age) + s(BMI) [ans]

Remark: Categorial variables cannot be 'splined', piecewise polynomially modelled

## 4.8: Writing formulas for GAM models (2)

Suppose that in the diet problem from the previous exercise, you now also want to take into account

the dieter's resting metabolic rate (BMR -- continuous) and
the dieter's average number hours of aerobic exercise per day (E -- continuous) at the beginning of the study.

You have reason to believe that the relationship between BMR and weight loss is linear (and you want to model it that way), but not necessarily the relationship between aerobic exercise and weight loss.

Which is the most appropriate formula?

Answer the question

50 XP

Possible Answers

1. Wtloss ~ Diet + Sex + s(Age) + s(BMI) + s(BMR) + s(E)

2. Wtloss ~ Diet + Sex + s(Age) + s(BMI) + BMR + s(E) [ans]

3. Wtloss ~ Diet + Sex + s(Age) + s(BMI) + s(BMR) + E

## 4.9: Model soybean growth with GAM

In this exercise you will model the average leaf weight on a soybean plant as a function of time (after planting). As you will see, the soybean plant doesn't grow at a steady rate, but rather has a "growth spurt" that eventually tapers off. Hence, leaf weight is not well described by a linear model.

Recall that you can designate which variable you want to model non-linearly in a formula with the s() function:

y ~ s(x)

Also remember that gam() from the package mgcv has the calling interface

gam(formula, family, data)

For standard regression, use family = gaussian (the default).

The soybean training data, soybean_train is loaded into your workspace. It has two columns: the outcome weight and the variable Time. For comparison, the linear model model.lin, which was fit using the formula weight ~ Time has already been loaded into the workspace as well.

Instructions

100 XP

Instructions

100 XP

- Fill in the blanks to plot weight versus Time (Time on x-axis). Does the relationship look linear?

- Load the package mgcv.

- Create the formula fmla.gam to express weight as a non-linear function of Time. Print it.

- Fit a generalized additive model on soybean_train using fmla.gam.

- Call summary() on the linear model model.lin (already in your workspace). What is the R2?

- Call summary() on 'model.gam. The "deviance explained" reports the model's unadjusted R2. What is the R2?

- Which model appears to be a better fit to the training data?

- Call plot() on model.gam to see the derived relationship between Time and weight.

```{r}
# soybean_train is in the workspace
soybean_train<-read.csv("soybean_train.csv")
summary(soybean_train)

# Plot weight vs Time (Time on x axis)
ggplot(soybean_train, aes(x = Time, y = weight)) + 
  geom_point()

# Load the package mgcv
#install.packages("nlme")
#install.packages("mgcv")
#library(nlme)
#library(mgcv)

install.packages('gam')
library(gam)
# Create the formula 
(fmla.gam <- weight~s(Time) )
(fmla.lin <- weight~Time )
# Fit the GAM Model
model.gam <- gam(fmla.gam, data=soybean_train)
model.lin <- gam(fmla.lin, data=soybean_train)
# Call summary() on model.lin and look for R-squared
summary(model.lin)

# Call summary() on model.gam and look for R-squared
summary(model.gam)

# Call plot() on model.gam
plot(model.gam)
```

## 4.10: Predict with the soybean model on test data

In this exercise you will apply the soybean models from the previous exercise (model.lin and model.gam, already in your workspace) to new data: soybean_test.

Instructions

100 XP

- The data frame soybean.test and the models model.lin and model.gam are in the workspace.

- Create a column soybean_test$pred.lin with predictions from the linear model model.lin.

- Create a column soybean_test$pred.gam with predictions from the gam model model.gam.

- For GAM models, the predict() method returns a matrix, so use as.numeric() to convert the matrix to a vector.

- Fill in the blanks to gather() the prediction columns into a single value column pred with key column modeltype. Call the long dataframe soybean_long.

- Calculate and compare the RMSE of both models.

- Which model does better?

- Run the code to compare the predictions of each model against the actual average leaf weights.

- A scatter plot of weight as a function of Time.

- Point-and-line plots of the predictions (pred) as a function of Time.

- Notice that the linear model sometimes predicts negative weights! Does the gam model?
```{r}
#library(mgcv)
# soybean_test is in the workspace
soybean_test<-read.csv("soybean_test.csv")
summary(soybean_test)

# Get predictions from linear model
soybean_test$pred.lin <- predict(model.lin, newdata = soybean_test)

# Get predictions from gam model
soybean_test$pred.gam <- as.numeric(predict(model.gam, newdata = soybean_test))

# Gather the predictions into a "long" dataset
soybean_long <- soybean_test %>%
  gather(key = modeltype, value = pred, pred.lin, pred.gam)

# Calculate the rmse
soybean_long %>%
  mutate(residual = weight - pred) %>%     # residuals
  group_by(modeltype) %>%                  # group by modeltype
  summarize(rmse = sqrt(mean(residual^2))) # calculate the RMSE

# Compare the predictions against actual weights on the test data
soybean_long %>%
  ggplot(aes(x = Time)) +                          # the column for the x axis
  geom_point(aes(y = weight)) +                    # the y-column for the scatterplot
  geom_point(aes(y = pred, color = modeltype)) +   # the y-column for the point-and-line plot
  geom_line(aes(y = pred, color = modeltype, linetype = modeltype)) + # the y-column for the point-and-line plot
  scale_color_brewer(palette = "Dark2")
  
```
Remark: The GAM learns the non-linear growth function of the soybean plants, including the fact that weight is never negative.

# Chapter 5: Tree-Based Methods

In this chapter we will look at modeling algorithms that do not assume linearity or additivity, and that can learn limited types of interactions among input variables. These algorithms are *tree-based* methods that work by combining ensembles of *decision trees* that are learned from the training data.

## 5.1: Build a random forest model for bike rentals

In this exercise you will again build a model to predict the number of bikes rented in an hour as a function of the weather, the type of day (holiday, working day, or weekend), and the time of day. You will train the model on data from the month of July.

You will use the ranger package to fit the random forest model. For this exercise, the key arguments to the ranger() call are:

- formula

- data

- num.trees: the number of trees in the forest.

- respect.unordered.factors : Specifies how to treat unordered factor variables. We recommend setting this to "order" for regression.

- seed: because this is a random algorithm, you will set the seed to get reproducible results

Since there are a lot of input variables, for convenience we will specify the outcome and the inputs in the variables outcome and vars, and use paste() to assemble a string representing the model formula.

Instructions

100 XP

The data frame bikesJuly is in the workspace. The sample code specifies the names of the outcome and input variables.

- Fill in the blanks to create the formula fmla expressing cnt as a function of the inputs. Print it.

- Load the package ranger.

- Use ranger to fit a model to the bikesJuly data: bike_model_rf.

- The first argument to ranger() is the formula, fmla.

- Use 500 trees and respect.unordered.factors = "order".

- Set the seed to seed for reproducible results.

- Print the model. What is the R-squared?

```{r}
# bikesJuly is in the workspace
str(bikesJuly)

# Random seed to reproduce results
seed<-423563

# The outcome column
(outcome <- "cnt")

# The input variables
(vars <- c("hr", "holiday", "workingday", "weathersit", "temp", "atemp", "hum", "windspeed"))

# Create the formula string for bikes rented as a function of the inputs
(fmla <- paste(outcome, "~", paste(vars, collapse = " + ")))

# Load the package ranger
install.packages("ranger")
library(ranger)

# Fit and print the random forest model
(bike_model_rf <- ranger(fmla, # formula 
                         bikesJuly, # data
                         num.trees = 500, 
                         respect.unordered.factors = "order", 
                         seed = seed))
```

## 5.2: Predict bike rentals with the random forest model

In this exercise you will use the model that you fit in the previous exercise to predict bike rentals for the month of August.

The predict() function for a ranger model produces a list. One of the elements of this list is predictions, a vector of predicted values. You can access predictions with the $ notation for accessing named elements of a list:

predict(model, data)$predictions

Instructions

100 XP

- The model bike_model_rf and the dataset bikesAugust (for evaluation) are loaded into your workspace.

- Call predict() on bikesAugust to predict the number of bikes rented in August (cnt). Add the predictions to bikesAugust as the column pred.

- Fill in the blanks to calculate the root mean squared error of the predictions.

- The poisson model you built for this data gave an RMSE of about 112.6. How does this model compare?

- Fill in the blanks to plot actual bike rental counts (cnt) versus the predictions (pred on x-axis).
```{r}
# bikesAugust is in the workspace
str(bikesAugust)

# bike_model_rf is in the workspace
bike_model_rf

# Make predictions on the August data
bikesAugust$pred <- predict(bike_model_rf, bikesAugust)$predictions

# Calculate the RMSE of the predictions
bikesAugust %>% 
  mutate(residual = cnt-pred)  %>% # calculate the residual
  summarize(rmse  = sqrt(mean(residual^2)))      # calculate rmse

# Plot actual outcome vs predictions (predictions on x-axis)
ggplot(bikesAugust, aes(x = pred, y = cnt)) + 
  geom_point() + 
  geom_abline()
```

## 5.3: Visualize random forest bike model predictions

In the previous exercise, you saw that the random forest bike model did better on the August data than the quasiposson model, in terms of RMSE.

In this exercise you will visualize the random forest model's August predictions as a function of time. The corresponding plot from the quasipoisson model that you built in a previous exercise is in the workspace for you to compare.

Recall that the quasipoisson model mostly identified the pattern of slow and busy hours in the day, but it somewhat underestimated peak demands. You would like to see how the random forest model compares.

Instructions

100 XP

- The data frame bikesAugust (with predictions) is in the workspace. The plot quasipoisson_plot of quasipoisson model predictions as a function of time is also in the workspace.

- Print quasipoisson_plot to review the quasipoisson model's behavior.

- Fill in the blanks to plot the predictions and actual counts by hour for the first 14 days of August.
gather the cnt and pred columns into a column called value, with a key called valuetype.

- Plot value as a function of instant (day).

- How does the random forest model compare?

```{r}
# Print quasipoisson_plot
quasipoisson_plot

# Plot predictions and cnt by date/time
randomforest_plot<-bikesAugust %>% 
  mutate(instant = (instant - min(instant))/24) %>%  # set start to 0, convert unit to days
  gather(key = valuetype, value = value, cnt, pred) %>%
  filter(instant < 14) %>% # first two weeks
  ggplot(aes(x = instant, y = value, color = valuetype, linetype = valuetype)) + 
  geom_point() + 
  geom_line() + 
  scale_x_continuous("Day", breaks = 0:14, labels = 0:14) + 
  scale_color_brewer(palette = "Dark2") + 
  ggtitle("Predicted August bike rentals, Random Forest plot")

randomforest_plot

```

## 5.4: vtreat on a small example

In this exercise you will use vtreat to one-hot-encode a categorical variable on a small example. vtreat creates a treatment plan to transform categorical variables into indicator variables (coded "lev"), and to clean bad values out of numerical variables (coded "clean").

To design a treatment plan use the function designTreatmentsZ()

- treatplan <- designTreatmentsZ(data, varlist)

- data: the original training data frame

- varlist: a vector of input variables to be treated (as strings).

- designTreatmentsZ() returns a list with an element scoreFrame: a data frame that includes the names and types of the new variables:

scoreFrame <- treatplan %>% 
            magrittr::use_series(scoreFrame) %>% 
            select(varName, origName, code)

- varName: the name of the new treated variable

- origName: the name of the original variable that the treated variable comes from
- code: the type of the new variable.

- "clean": a numerical variable with no NAs or NaNs
- "lev": an indicator variable for a specific level of the original categorical variable.
(magrittr::use_series() is an alias for $ that you can use in pipes.)

For these exercises, we want varName where code is either "clean" or "lev":

newvarlist <- scoreFrame %>% 
             filter(code %in% c("clean", "lev") %>%
             magrittr::use_series(varName)

To transform the data set into all numerical and one-hot-encoded variables, use prepare():

data.treat <- prepare(treatplan, data, varRestrictions = newvarlist)

- treatplan: the treatment plan

- data: the data frame to be treated

- varRestrictions: the variables desired in the treated data

Instructions

100 XP

The data frame dframe and the package magrittr are loaded in the workspace.

1. Print dframe. We will assume that color and size are input variables, and popularity is the outcome to be predicted.

2. Create a vector called vars with the names of the input variables (as strings).

3. Load the package vtreat.

4. Use designTreatmentsZ() to create a treatment plan for the variables in vars. Assign it to the variable treatplan.

5. Get and examine the scoreFrame from the treatment plan to see the mapping from old variables to new variables.

6. You only need the columns varName, origName and code.

7. What are the names of the new indicator variables? Of the continuous variable?

8. Create a vector newvars that contains the variable varName where code is either clean or lev. Print it.

9. Use prepare() to create a new data frame dframe.treat that is a one-hot-encoded version of dframe (without the outcome column).

10.Print it and compare to dframe.

```{r}
dframe<-read.csv("dframe.csv")
# dframe is in the workspace
dframe

# Create and print a vector of variable names
(vars <- c("color", "size"))

# Load the package vtreat
library(vtreat)

# Create the treatment plan
treatplan <- designTreatmentsZ(dframe, vars)

# Examine the scoreFrame
(scoreFrame <- treatplan %>%
    use_series(scoreFrame) %>%
    select(varName, origName, code))

# We only want the rows with codes "clean" or "lev"
(newvars <- scoreFrame %>%
    filter(code %in% c("clean", "lev")) %>%
    use_series(varName))

# Create the treated training data
(dframe.treat <- prepare(treatplan, dframe, varRestriction = newvars))
```

## 5.5: Novel levels

When a level of a categorical variable is rare, sometimes it will fail to show up in training data. If that rare level then appears in future data, downstream models may not know what to do with it. When such novel levels appear, using model.matrix or caret::dummyVars to one-hot-encode will not work correctly.

vtreat is a "safer" alternative to model.matrix for one-hot-encoding, because it can manage novel levels safely. vtreat also manages missing values in the data (both categorical and continuous).

In this exercise you will see how vtreat handles categorical values that did not appear in the training set. The treatment plan treatplan and the set of variables newvars from the previous exercise are still in your workspace. dframe and a new data frame testframe are also in your workspace.

Instructions

100 XP

- Print dframe and testframe.

- Are there colors in testframe that didn't appear in dframe?

- Call prepare() to create a one-hot-encoded version of testframe (without the outcome). Call it testframe.treat and print it.

- Use the varRestriction argument to restrict to only the variables in newvars.

- How are the yellow rows encoded?

```{r}
summary(treatplan)
newvars

testframe<-read.csv("testframe.csv")
# Print dframe and testframe
dframe
testframe

# Use prepare() to one-hot-encode testframe
(testframe.treat <- prepare(treatplan, testframe, varRestriction = newvars))
```

## 5.6: vtreat the bike rental data

In this exercise you will create one-hot-encoded data frames of the July/August bike data, for use with xgboost later on.

The data frames bikesJuly and bikesAugust are in the workspace.

For your convenience, we have defined the variable vars with the list of variable columns for the model.

Instructions

100 XP

- Load the package vtreat.

- Use designTreatmentsZ() to create a treatment plan treatplan for the variables in vars from bikesJuly (the training data).

- Set the flag verbose=FALSE to prevent the function from printing too many messages.

- Fill in the blanks to create a vector newvars that contains only the names of the clean and lev transformed variables. Print it.

- Use prepare() to create a one-hot-encoded training data frame bikesJuly.treat.

- Use the varRestrictions argument to restrict the variables you will use to newvars.

- Use prepare() to create a one-hot-encoded test frame bikesAugust.treat from bikesAugust in the same way.

- Call str() on both prepared test frames to see the structure.

```{r}
# The outcome column
(outcome <- "cnt")

# The input columns
(vars <- c("hr", "holiday", "workingday", "weathersit", "temp", "atemp", "hum", "windspeed"))

# Load the package vtreat
library(vtreat)

# Create the treatment plan from bikesJuly (the training data)
treatplan <- designTreatmentsZ(bikesJuly,vars, verbose = FALSE)

# Get the "clean" and "lev" variables from the scoreFrame
(newvars <- treatplan %>%
  use_series(scoreFrame) %>%        
  filter(code %in% c("clean","lev")) %>%  # get the rows you care about
  use_series(varName))           # get the varName column


# Prepare the training data
bikesJuly.treat <- prepare(treatplan, bikesJuly,  varRestriction = newvars)

# Prepare the test data
bikesAugust.treat <- prepare(treatplan, bikesAugust,  varRestriction = newvars)

# Call str() on the treated data
str(bikesJuly.treat)
str(bikesAugust.treat)
```

## 5.7: Find the right number of trees for a gradient boosting machine

In this exercise you will get ready to build a gradient boosting model to predict the number of bikes rented in an hour as a function of the weather and the type and time of day. You will train the model on data from the month of July.

The July data is loaded into your workspace. Remember that bikesJuly.treat no longer has the outcome column, so you must get it from the untreated data: bikesJuly$cnt.

You will use the xgboost package to fit the random forest model. The function xgb.cv() uses cross-validation to estimate the out-of-sample learning error as each new tree is added to the model. The appropriate number of trees to use in the final model is the number that minimizes the holdout RMSE.

For this exercise, the key arguments to the xgb.cv() call are:

- data: a numeric matrix.

- label: vector of outcomes (also numeric).

- nrounds: the maximum number of rounds (trees to build).

- nfold: the number of folds for the cross-validation. 5 is a good number.

- objective: "reg:linear" for continuous outcomes.

- eta: the learning rate.

- max_depth: depth of trees.

- early_stopping_rounds: after this many rounds without improvement, stop.

- verbose: 0 to stay silent.

Instructions

100 XP


The data frames bikesJuly and bikesJuly.treat are in the workspace.

1. Load the package xgboost.

2. Fill in the blanks to run xgb.cv() on the treated training data; assign the output to the variable cv.

3. Use as.matrix() to convert the vtreated data frame to a matrix.

4. Use 100 rounds, and 5-fold cross validation.

5. Set early_stopping_rounds to 10.

6. Set eta to 0.3, max_depth to 6.

7. Get the data frame evaluation_log from cv and assign it to the variable elog. Each row of the evaluation_log corresponds to an additional tree, so the row number tells you the number of trees in the model.

8. Fill in the blanks to get the number of trees with the minimum value of the columns train_rmse_mean and test_rmse_mean.

9. which.min() returns the index of the minimum value in a vector.
How many trees do you need?

```{r}
# The July data is in the workspace
#ls()

# Load the package xgboost
install.packages("xgboost")
library(xgboost)

# Run xgb.cv
cv <- xgb.cv(data = as.matrix(bikesJuly.treat), 
            label = bikesJuly$cnt,
            nrounds = 100,
            nfold = 5,
            objective = "reg:linear",
            eta = 0.3,
            max_depth = 6,
            early_stopping_rounds = 10,
            verbose = 0    # silent
)

# Get the evaluation log 
elog <- cv$evaluation_log

# Determine and print how many trees minimize training and test error
elog %>% 
   summarize(ntrees.train = which.min(train_rmse_mean),   # find the index of min(train_rmse_mean)
             ntrees.test  = which.min(test_rmse_mean) )  # find the index of min(test_rmse_mean)
```

## 5.8: Fit an xgboost bike rental model and predict

In this exercise you will fit a gradient boosting model using xgboost() to predict the number of bikes rented in an hour as a function of the weather and the type and time of day. You will train the model on data from the month of July and predict on data for the month of August.

The datasets for July and August are loaded into your workspace. Remember the vtreat-ed data no longer has the outcome column, so you must get it from the original data (the cnt column).

For convenience, the number of trees to use, ntrees from the previous exercise is in the workspace.

The arguments to xgboost() are similar to those of xgb.cv().

Instructions

100 XP

- The data frames bikesJuly, bikesJuly.treat, bikesAugust and bikesAugust.treat are in the workspace. The number of trees ntrees is in the workspace.


- Fill in the blanks to run xgboost() on the July data. Assign the model to the variable model.

- Use as.matrix() to convert the vtreated data frame to a matrix.

- The objective should be "reg:linear".

- Use ntrees rounds.

- Set eta to 0.3, depth to 6, and verbose to 0 (silent).

- Now call predict() on bikesAugust.treat to predict the number of bikes rented in August.

- Use as.matrix() to convert the vtreat-ed test data into a matrix.

- Add the predictions tobikesAugust as the column pred.

- Fill in the blanks to plot actual bike rental counts versus the predictions (predictions on the x-axis). Do you see a possible problem with the predictions?

```{r}
# Examine the workspace
#ls()

# The number of trees to use, as determined by xgb.cv
#ntrees<-elog$ntrees.train+elog$ntrees.test
ntrees<-84
# Run xgboost
bike_model_xgb <- xgboost(data = as.matrix(bikesJuly.treat), # training data as matrix
                   label = bikesJuly$cnt,  # column of outcomes
                   nrounds = ntrees,       # number of trees to build
                   objective = "reg:linear", # objective
                   eta = 0.3,
                   depth = 6,
                   verbose = 0  # silent
)

# Make predictions
bikesAugust$pred <- predict(bike_model_xgb, as.matrix(bikesAugust.treat))

# Plot predictions (on x axis) vs actual bike rental count
ggplot(bikesAugust, aes(x = pred, y = cnt)) + 
  geom_point() + 
  geom_abline()
```

## 5.9: Evaluate the xgboost bike rental model

In this exercise you will evaluate the gradient boosting model bike_model_xgb that you fit in the last exercise, using data from the month of August. You'll compare this model's RMSE for August to the RMSE of previous models that you've built.

The dataset bikesAugust is in the workspace. You have already made predictions using the xgboost model; they are in the column pred.

Instructions

100 XP

- Fill in the blanks to calculate the RMSE of the predictions.

- How does it compare to the RMSE from the poisson model (approx. 112.6) and the random forest model (approx. 96.7)?

```{r}
# bikesAugust is in the workspace
str(bikesAugust)

# Calculate RMSE
bikesAugust %>%
  mutate(residuals = cnt - pred) %>%
  summarize(rmse = sqrt(mean(residuals^2)))
```


## 5.10: Visualize the xgboost bike rental model

You've now seen three different ways to model the bike rental data. For this example, you've seen that the gradient boosting model had the smallest RMSE. To finish up the course, let's compare the gradient boosting model's predictions to the other two models as a function of time.

On completing this exercise, you will have completed the course. Congratulations! Now you have the tools to apply a variety of approaches to your regression tasks.

Instructions

100 XP

- The data frame bikesAugust with predictions, is in the workspace. The plots quasipoisson_plot and randomforest_plot are also in the workspace.

- Print quasipoisson_plot to review the quasipoisson model's behavior.

- Print randomforest_plot to review the random forest model's behavior.

- Fill in the blanks to plot the gradient boosting predictions and actual counts by hour for the first 14 days of August.

- gather() the cnt and pred columns into a column called value, with a key called valuetype.

- Plot value as a function of instant (day).
How does the gradient boosting model compare to the previous models?
```{r}
# Print quasipoisson_plot
quasipoisson_plot

# Print randomforest_plot
randomforest_plot

# Plot predictions and actual bike rentals as a function of time (days)
bikesAugust %>% 
  mutate(instant = (instant - min(instant))/24) %>%  # set start to 0, convert unit to days
  gather(key = valuetype, value = value, cnt, pred) %>%
  filter(instant < 14) %>% # first two weeks
  ggplot(aes(x = instant, y = value, color = valuetype, linetype = valuetype)) + 
  geom_point() + 
  geom_line() + 
  scale_x_continuous("Day", breaks = 0:14, labels = 0:14) + 
  scale_color_brewer(palette = "Dark2") + 
  ggtitle("Predicted August bike rentals, Gradient Boosting model")
```
Remark: The gradient boosting pattern captures rental variations due to time of day and other factors better than the previous models.

