Regression

library(broom)
library(sigr)

1 Simple Regression

Create a formula to define a one-variable modeling task, and then fit a linear model to the data.The task is to predict the rate of female unemployment from the observed rate of male unemployment. The outcome is female_unemployment, and the input is male_unemployment.

# unemployment is loaded in the workspace
summary(unemployment)
##  male_unemployment female_unemployment
##  Min.   :2.900     Min.   :4.000      
##  1st Qu.:4.900     1st Qu.:4.400      
##  Median :6.000     Median :5.200      
##  Mean   :5.954     Mean   :5.569      
##  3rd Qu.:6.700     3rd Qu.:6.100      
##  Max.   :9.800     Max.   :7.900
# Define a formula to express female_unemployment as a function of male_unemployment
fmla <-female_unemployment~male_unemployment

# Print it
fmla
## female_unemployment ~ male_unemployment
# Use the formula to fit a model: unemployment_model
unemployment_model <- lm(fmla, data=unemployment)

# Print it
unemployment_model
## 
## Call:
## lm(formula = fmla, data = unemployment)
## 
## Coefficients:
##       (Intercept)  male_unemployment  
##            1.4341             0.6945

1.1 Examining a model

There are a variety of different ways to examine a model; each way provides different information. We will use summary(), broom::glance(), and sigr::wrapFTest().

# broom and sigr are already loaded in your workspace
# Print unemployment_model
unemployment_model
## 
## Call:
## lm(formula = fmla, data = unemployment)
## 
## Coefficients:
##       (Intercept)  male_unemployment  
##            1.4341             0.6945
# Call summary() on unemployment_model to get more details
summary(unemployment_model)
## 
## Call:
## lm(formula = fmla, data = unemployment)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.77621 -0.34050 -0.09004  0.27911  1.31254 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.43411    0.60340   2.377   0.0367 *  
## male_unemployment  0.69453    0.09767   7.111 1.97e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5803 on 11 degrees of freedom
## Multiple R-squared:  0.8213, Adjusted R-squared:  0.8051 
## F-statistic: 50.56 on 1 and 11 DF,  p-value: 1.966e-05
# Call glance() on unemployment_model to see the details in a tidier form
glance(unemployment_model)
## # A tibble: 1 x 11
##   r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <int>  <dbl> <dbl> <dbl>
## 1     0.821         0.805 0.580      50.6 1.97e-5     2  -10.3  26.6  28.3
## # ... with 2 more variables: deviance <dbl>, df.residual <int>
# Call wrapFTest() on unemployment_model to see the most relevant details
wrapFTest(unemployment_model)
## [1] "F Test summary: (R2=0.8213, F(1,11)=50.56, p=1.966e-05)."
newrates<-data.frame("male_unemployment"=5)
str(newrates)
## 'data.frame':    1 obs. of  1 variable:
##  $ male_unemployment: num 5
newrates
##   male_unemployment
## 1                 5
# unemployment is in your workspace
summary(unemployment)
##  male_unemployment female_unemployment
##  Min.   :2.900     Min.   :4.000      
##  1st Qu.:4.900     1st Qu.:4.400      
##  Median :6.000     Median :5.200      
##  Mean   :5.954     Mean   :5.569      
##  3rd Qu.:6.700     3rd Qu.:6.100      
##  Max.   :9.800     Max.   :7.900
# newrates is in your workspace
newrates
##   male_unemployment
## 1                 5
# Predict female unemployment in the unemployment data set
unemployment$prediction <-  predict(unemployment_model,unemployment)

# load the ggplot2 package
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.4.4
# Make a plot to compare predictions to actual (prediction on x axis). 
ggplot(unemployment, aes(x = prediction, y = female_unemployment)) + 
  geom_point ()+
  geom_abline(color = "blue")

# Predict female unemployment rate when male unemployment is 5%
pred <- predict(unemployment_model,newrates)
# Print it
pred
##        1 
## 4.906757

2 Multivariate linear regression

will work with the blood pressure dataset (Source), and model blood_pressure as a function of weight and age.

# bloodpressure is in the workspace
summary(bloodpressure)
##  blood_pressure       age            weight   
##  Min.   :128.0   Min.   :46.00   Min.   :167  
##  1st Qu.:140.0   1st Qu.:56.50   1st Qu.:186  
##  Median :153.0   Median :64.00   Median :194  
##  Mean   :150.1   Mean   :62.45   Mean   :195  
##  3rd Qu.:160.5   3rd Qu.:69.50   3rd Qu.:209  
##  Max.   :168.0   Max.   :74.00   Max.   :220
# Create the formula and print it
fmla <- blood_pressure~age+weight
fmla
## blood_pressure ~ age + weight
# Fit the model: bloodpressure_model
bloodpressure_model <- lm(fmla,data=bloodpressure)

# Print bloodpressure_model and call summary() 
bloodpressure_model
## 
## Call:
## lm(formula = fmla, data = bloodpressure)
## 
## Coefficients:
## (Intercept)          age       weight  
##     30.9941       0.8614       0.3349
summary(bloodpressure_model)
## 
## Call:
## lm(formula = fmla, data = bloodpressure)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4640 -1.1949 -0.4078  1.8511  2.6981 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  30.9941    11.9438   2.595  0.03186 * 
## age           0.8614     0.2482   3.470  0.00844 **
## weight        0.3349     0.1307   2.563  0.03351 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.318 on 8 degrees of freedom
## Multiple R-squared:  0.9768, Adjusted R-squared:  0.9711 
## F-statistic: 168.8 on 2 and 8 DF,  p-value: 2.874e-07

make predictions using the blood pressure model bloodpressure_model. compare the predictions to outcomes graphically. ggplot2 is already loaded in your workspace. Recall the plot command takes the form:

ggplot(dframe, aes(x = pred, y = outcome)) + geom_point() + geom_abline(color = “blue”)

# bloodpressure is in your workspace
summary(bloodpressure)
##  blood_pressure       age            weight   
##  Min.   :128.0   Min.   :46.00   Min.   :167  
##  1st Qu.:140.0   1st Qu.:56.50   1st Qu.:186  
##  Median :153.0   Median :64.00   Median :194  
##  Mean   :150.1   Mean   :62.45   Mean   :195  
##  3rd Qu.:160.5   3rd Qu.:69.50   3rd Qu.:209  
##  Max.   :168.0   Max.   :74.00   Max.   :220
# bloodpressure_model is in your workspace
bloodpressure_model
## 
## Call:
## lm(formula = fmla, data = bloodpressure)
## 
## Coefficients:
## (Intercept)          age       weight  
##     30.9941       0.8614       0.3349
# predict blood pressure using bloodpressure_model :prediction
bloodpressure$prediction <- predict(bloodpressure_model, bloodpressure)

# plot the results
ggplot(bloodpressure, aes(x=prediction, y=blood_pressure)) + 
    geom_point() +
    geom_abline(color = "blue")

3 Evaluating a model graphically

3.1 Graphically evaluate the unemployment model

graphically evaluate the unemployment model, unemployment_model, that you fit to the unemployment data in the previous chapter. Recall that the model predicts female_unemployment from male_unemployment. plot the model’s predictions against the actual female_unemployment; recall the command is of the form

ggplot(dframe, aes(x = pred, y = outcome)) + geom_point() +
geom_abline() Then you will calculate the residuals:

residuals <- actual outcome - predicted outcome

# unemployment, unemployment_model are in the workspace
summary(unemployment)
##  male_unemployment female_unemployment   prediction   
##  Min.   :2.900     Min.   :4.000       Min.   :3.448  
##  1st Qu.:4.900     1st Qu.:4.400       1st Qu.:4.837  
##  Median :6.000     Median :5.200       Median :5.601  
##  Mean   :5.954     Mean   :5.569       Mean   :5.569  
##  3rd Qu.:6.700     3rd Qu.:6.100       3rd Qu.:6.087  
##  Max.   :9.800     Max.   :7.900       Max.   :8.240
summary(unemployment_model)
## 
## Call:
## lm(formula = fmla, data = unemployment)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.77621 -0.34050 -0.09004  0.27911  1.31254 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.43411    0.60340   2.377   0.0367 *  
## male_unemployment  0.69453    0.09767   7.111 1.97e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5803 on 11 degrees of freedom
## Multiple R-squared:  0.8213, Adjusted R-squared:  0.8051 
## F-statistic: 50.56 on 1 and 11 DF,  p-value: 1.966e-05
# Make predictions from the model
unemployment$predictions <- predict(unemployment_model, unemployment)

# Fill in the blanks to plot predictions (on x-axis) versus the female_unemployment rates
ggplot(unemployment, aes(x = predictions, y = female_unemployment)) + 
  geom_point() + 
  geom_abline()

# From previous step
unemployment$predictions <- predict(unemployment_model)

# Calculate residuals
unemployment$residuals <- unemployment$female_unemployment-unemployment$predictions

# Fill in the blanks to plot predictions (on x-axis) versus the residuals
ggplot(unemployment, aes(x = predictions, y = residuals)) + 
  geom_pointrange(aes(ymin = 0, ymax = residuals)) + 
  geom_hline(yintercept = 0, linetype = 3) + 
  ggtitle("residuals vs. linear model prediction")

3.2 The gain curve to evaluate the unemployment model

plot the gain curve of the unemployment_model’s predictions against actual female_unemployment using the WVPlots::GainCurvePlot() function.

# unemployment is in the workspace (with predictions)
summary(unemployment)
##  male_unemployment female_unemployment   prediction     predictions   
##  Min.   :2.900     Min.   :4.000       Min.   :3.448   Min.   :3.448  
##  1st Qu.:4.900     1st Qu.:4.400       1st Qu.:4.837   1st Qu.:4.837  
##  Median :6.000     Median :5.200       Median :5.601   Median :5.601  
##  Mean   :5.954     Mean   :5.569       Mean   :5.569   Mean   :5.569  
##  3rd Qu.:6.700     3rd Qu.:6.100       3rd Qu.:6.087   3rd Qu.:6.087  
##  Max.   :9.800     Max.   :7.900       Max.   :8.240   Max.   :8.240  
##    residuals       
##  Min.   :-0.77621  
##  1st Qu.:-0.34050  
##  Median :-0.09004  
##  Mean   : 0.00000  
##  3rd Qu.: 0.27911  
##  Max.   : 1.31254
# unemployment_model is in the workspace
summary(unemployment_model)
## 
## Call:
## lm(formula = fmla, data = unemployment)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.77621 -0.34050 -0.09004  0.27911  1.31254 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.43411    0.60340   2.377   0.0367 *  
## male_unemployment  0.69453    0.09767   7.111 1.97e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5803 on 11 degrees of freedom
## Multiple R-squared:  0.8213, Adjusted R-squared:  0.8051 
## F-statistic: 50.56 on 1 and 11 DF,  p-value: 1.966e-05
# Load the package WVPlots
library(WVPlots)
## Warning: package 'WVPlots' was built under R version 3.4.4
# Plot the Gain Curve
GainCurvePlot(unemployment, "predictions", "female_unemployment", "Unemployment model")


3.3 Root Mean Squared Error (RMSE)

\(RMSE=\sqrt{mean(res^2)}\)

want RMSE to be small, One heuristic is to compare the RMSE to the standard deviation of the outcome. With a good model, the RMSE should be smaller.

# unemployment is in the workspace
summary(unemployment)
##  male_unemployment female_unemployment   prediction     predictions   
##  Min.   :2.900     Min.   :4.000       Min.   :3.448   Min.   :3.448  
##  1st Qu.:4.900     1st Qu.:4.400       1st Qu.:4.837   1st Qu.:4.837  
##  Median :6.000     Median :5.200       Median :5.601   Median :5.601  
##  Mean   :5.954     Mean   :5.569       Mean   :5.569   Mean   :5.569  
##  3rd Qu.:6.700     3rd Qu.:6.100       3rd Qu.:6.087   3rd Qu.:6.087  
##  Max.   :9.800     Max.   :7.900       Max.   :8.240   Max.   :8.240  
##    residuals       
##  Min.   :-0.77621  
##  1st Qu.:-0.34050  
##  Median :-0.09004  
##  Mean   : 0.00000  
##  3rd Qu.: 0.27911  
##  Max.   : 1.31254
# For convenience put the residuals in the variable res
res <- unemployment$residuals

# Calculate RMSE, assign it to the variable rmse and print it
(rmse <- sqrt(mean(res^2)))
## [1] 0.5337612
# Calculate the standard deviation of female_unemployment and print it
(sd_unemployment <- sd(unemployment$female_unemployment))
## [1] 1.314271

An RMSE much smaller than the outcome’s standard deviation suggests a model that predicts well.

3.4 R-Squared

how much variance does the model explain

Suppose \(y\) is the true outcome, \(p\) is the prediction from the model, and \(res=y???p\) are the residuals of the predictions.

Then the total sum of squares tss (“total variance”) of the data is: \(tss=\sum_{(y-\tilde{y})^2}\)
Residual sum of squared errors(rss):

\(rss=\sum_(1-rss/tss)\)

# unemployment is in your workspace
summary(unemployment)
##  male_unemployment female_unemployment   prediction     predictions   
##  Min.   :2.900     Min.   :4.000       Min.   :3.448   Min.   :3.448  
##  1st Qu.:4.900     1st Qu.:4.400       1st Qu.:4.837   1st Qu.:4.837  
##  Median :6.000     Median :5.200       Median :5.601   Median :5.601  
##  Mean   :5.954     Mean   :5.569       Mean   :5.569   Mean   :5.569  
##  3rd Qu.:6.700     3rd Qu.:6.100       3rd Qu.:6.087   3rd Qu.:6.087  
##  Max.   :9.800     Max.   :7.900       Max.   :8.240   Max.   :8.240  
##    residuals       
##  Min.   :-0.77621  
##  1st Qu.:-0.34050  
##  Median :-0.09004  
##  Mean   : 0.00000  
##  3rd Qu.: 0.27911  
##  Max.   : 1.31254
# unemployment_model is in the workspace
summary(unemployment_model)
## 
## Call:
## lm(formula = fmla, data = unemployment)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.77621 -0.34050 -0.09004  0.27911  1.31254 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.43411    0.60340   2.377   0.0367 *  
## male_unemployment  0.69453    0.09767   7.111 1.97e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5803 on 11 degrees of freedom
## Multiple R-squared:  0.8213, Adjusted R-squared:  0.8051 
## F-statistic: 50.56 on 1 and 11 DF,  p-value: 1.966e-05
# Calculate mean female_unemployment: fe_mean. Print it
(fe_mean <- mean(unemployment$female_unemployment))
## [1] 5.569231
fe_mean
## [1] 5.569231
# Calculate total sum of squares: tss. Print it
(tss <- sum((unemployment$female_unemployment - fe_mean)^2))
## [1] 20.72769
# Calculate residual sum of squares: rss. Print it
(rss <- sum(unemployment$residuals^2))
## [1] 3.703714
# Calculate R-squared: rsq. Print it. Is it a good fit?
(rsq <- 1-rss/tss)
## [1] 0.8213157
# Get R-squared from glance. Print it
(rsq_glance <- glance(unemployment_model)$r.squared)
## [1] 0.8213157

3.5 Correlation and R-squared

The linear correlation of two variables, x and y, measures the strength of the linear relationship between them. When x and y are respectively:

the outcomes of a regression model that minimizes squared-error (like linear regression) and
the true outcomes of the training data.

# unemployment is in your workspace
summary(unemployment)
##  male_unemployment female_unemployment   prediction     predictions   
##  Min.   :2.900     Min.   :4.000       Min.   :3.448   Min.   :3.448  
##  1st Qu.:4.900     1st Qu.:4.400       1st Qu.:4.837   1st Qu.:4.837  
##  Median :6.000     Median :5.200       Median :5.601   Median :5.601  
##  Mean   :5.954     Mean   :5.569       Mean   :5.569   Mean   :5.569  
##  3rd Qu.:6.700     3rd Qu.:6.100       3rd Qu.:6.087   3rd Qu.:6.087  
##  Max.   :9.800     Max.   :7.900       Max.   :8.240   Max.   :8.240  
##    residuals       
##  Min.   :-0.77621  
##  1st Qu.:-0.34050  
##  Median :-0.09004  
##  Mean   : 0.00000  
##  3rd Qu.: 0.27911  
##  Max.   : 1.31254
# unemployment_model is in the workspace
summary(unemployment_model)
## 
## Call:
## lm(formula = fmla, data = unemployment)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.77621 -0.34050 -0.09004  0.27911  1.31254 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.43411    0.60340   2.377   0.0367 *  
## male_unemployment  0.69453    0.09767   7.111 1.97e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5803 on 11 degrees of freedom
## Multiple R-squared:  0.8213, Adjusted R-squared:  0.8051 
## F-statistic: 50.56 on 1 and 11 DF,  p-value: 1.966e-05
# Get the correlation between the prediction and true outcome: rho and print it
(rho <- cor(unemployment$predictions,unemployment$female_unemployment))
## [1] 0.9062647
# Square rho: rho2 and print it
(rho2 <- rho^2)
## [1] 0.8213157
# Get R-squared from glance and print it
(rsq_glance <- glance(unemployment_model)$r.squared)
## [1] 0.8213157

3.6 Properly Training a Model

The mpg data describes the characteristics of several makes and models of cars from different years. The goal is to predict city fuel efficiency from highway fuel efficiency.
  • training set mpg_train (75% of the data);

    # mpg is in the workspace
    summary(mpg)
    ##  manufacturer          model               displ            year     
    ##  Length:234         Length:234         Min.   :1.600   Min.   :1999  
    ##  Class :character   Class :character   1st Qu.:2.400   1st Qu.:1999  
    ##  Mode  :character   Mode  :character   Median :3.300   Median :2004  
    ##                                        Mean   :3.472   Mean   :2004  
    ##                                        3rd Qu.:4.600   3rd Qu.:2008  
    ##                                        Max.   :7.000   Max.   :2008  
    ##       cyl           trans               drv                 cty       
    ##  Min.   :4.000   Length:234         Length:234         Min.   : 9.00  
    ##  1st Qu.:4.000   Class :character   Class :character   1st Qu.:14.00  
    ##  Median :6.000   Mode  :character   Mode  :character   Median :17.00  
    ##  Mean   :5.889                                         Mean   :16.86  
    ##  3rd Qu.:8.000                                         3rd Qu.:19.00  
    ##  Max.   :8.000                                         Max.   :35.00  
    ##       hwy             fl               class          
    ##  Min.   :12.00   Length:234         Length:234        
    ##  1st Qu.:18.00   Class :character   Class :character  
    ##  Median :24.00   Mode  :character   Mode  :character  
    ##  Mean   :23.44                                        
    ##  3rd Qu.:27.00                                        
    ##  Max.   :44.00
    dim(mpg)
    ## [1] 234  11
    # Use nrow to get the number of rows in mpg (N) and print it
    (N <- nrow(mpg))
    ## [1] 234
    # Calculate how many rows 75% of N should be and print it
    # Hint: use round() to get an integer
    (target <- round(0.75*N))
    ## [1] 176
    # Create the vector of N uniform random variables: gp
    gp <- runif(N)
    
    # Use gp to create the training set: mpg_train (75% of data) and mpg_test (25% of data)
    mpg_train <- mpg[gp<0.75,]
    mpg_test <-  mpg[gp>=0.75,]
    
    # Use nrow() to examine mpg_train and mpg_test
    nrow(mpg_train)
    ## [1] 164
    nrow(mpg_test)
    ## [1] 70

3.7 Train a model using test/train split

use mpg_train to train a model to predict city fuel efficiency (cty) from highway fuel efficiency (hwy).

# mpg_train is in the workspace
summary(mpg_train)
##  manufacturer          model               displ            year     
##  Length:164         Length:164         Min.   :1.600   Min.   :1999  
##  Class :character   Class :character   1st Qu.:2.400   1st Qu.:1999  
##  Mode  :character   Mode  :character   Median :3.100   Median :1999  
##                                        Mean   :3.455   Mean   :2003  
##                                        3rd Qu.:4.600   3rd Qu.:2008  
##                                        Max.   :7.000   Max.   :2008  
##       cyl           trans               drv                 cty       
##  Min.   :4.000   Length:164         Length:164         Min.   : 9.00  
##  1st Qu.:4.000   Class :character   Class :character   1st Qu.:14.00  
##  Median :6.000   Mode  :character   Mode  :character   Median :17.00  
##  Mean   :5.841                                         Mean   :16.99  
##  3rd Qu.:8.000                                         3rd Qu.:19.00  
##  Max.   :8.000                                         Max.   :35.00  
##       hwy             fl               class          
##  Min.   :12.00   Length:164         Length:164        
##  1st Qu.:17.00   Class :character   Class :character  
##  Median :25.00   Mode  :character   Mode  :character  
##  Mean   :23.62                                        
##  3rd Qu.:27.00                                        
##  Max.   :44.00
# Create a formula to express cty as a function of hwy: fmla and print it.
(fmla <- cty~hwy)
## cty ~ hwy
fmla
## cty ~ hwy
# Now use lm() to build a model mpg_model from mpg_train that predicts cty from hwy 
mpg_model <-lm(fmla, data=mpg_train)

# Use summary() to examine the model
summary(mpg_model)
## 
## Call:
## lm(formula = fmla, data = mpg_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9180 -0.8490 -0.0554  0.7570  4.6817 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.04216    0.41296   2.524   0.0126 *  
## hwy          0.67503    0.01692  39.891   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.328 on 162 degrees of freedom
## Multiple R-squared:  0.9076, Adjusted R-squared:  0.907 
## F-statistic:  1591 on 1 and 162 DF,  p-value: < 2.2e-16

3.8 Evaluate a model using test/train split

Functions rmse() and r_squared() to calculate RMSE and R-squared have been provided for convenience:

rmse(predcol, ycol) r_squared(predcol, ycol) where:

predcol: The predicted values ycol: The actual outcome Generally, model performance is better on the training data than the test data (though sometimes the test set “gets lucky”).

rmse<-function(predcol, ycol) {
  res <- predcol - ycol
  sqrt(mean(res^2))
}

r_squared<-function(predcol, ycol) {
  tss = sum( (ycol - mean(ycol))^2 )
  rss = sum( (predcol - ycol)^2 )
  1 - rss/tss
}

# Examine the objects in the workspace
ls.str()
## Bikes :  chr [1:4] "bikesJuly" "bikesAugust" "bikesJuly" "bikesAugust"
## bikesAugust : 'data.frame':  744 obs. of  12 variables:
##  $ hr        : Factor w/ 24 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ holiday   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ workingday: logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
##  $ weathersit: chr  "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" ...
##  $ temp      : num  0.68 0.66 0.64 0.64 0.64 0.64 0.64 0.64 0.66 0.68 ...
##  $ atemp     : num  0.636 0.606 0.576 0.576 0.591 ...
##  $ hum       : num  0.79 0.83 0.83 0.83 0.78 0.78 0.78 0.83 0.78 0.74 ...
##  $ windspeed : num  0.1642 0.0896 0.1045 0.1045 0.1343 ...
##  $ cnt       : int  47 33 13 7 4 49 185 487 681 350 ...
##  $ instant   : int  13748 13749 13750 13751 13752 13753 13754 13755 13756 13757 ...
##  $ mnth      : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ yr        : int  1 1 1 1 1 1 1 1 1 1 ...
## bikesJuly : 'data.frame':    744 obs. of  12 variables:
##  $ hr        : Factor w/ 24 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ holiday   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ workingday: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ weathersit: chr  "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" ...
##  $ temp      : num  0.76 0.74 0.72 0.72 0.7 0.68 0.7 0.74 0.78 0.82 ...
##  $ atemp     : num  0.727 0.697 0.697 0.712 0.667 ...
##  $ hum       : num  0.66 0.7 0.74 0.84 0.79 0.79 0.79 0.7 0.62 0.56 ...
##  $ windspeed : num  0 0.1343 0.0896 0.1343 0.194 ...
##  $ cnt       : int  149 93 90 33 4 10 27 50 142 219 ...
##  $ instant   : int  13004 13005 13006 13007 13008 13009 13010 13011 13012 13013 ...
##  $ mnth      : int  7 7 7 7 7 7 7 7 7 7 ...
##  $ yr        : int  1 1 1 1 1 1 1 1 1 1 ...
## bloodpressure : 'data.frame':    11 obs. of  4 variables:
##  $ blood_pressure: int  132 143 153 162 154 168 137 149 159 128 ...
##  $ age           : int  52 59 67 73 64 74 54 61 65 46 ...
##  $ weight        : int  173 184 194 211 196 220 188 188 207 167 ...
##  $ prediction    : num  134 143 154 165 152 ...
## bloodpressure_model : List of 12
##  $ coefficients : Named num [1:3] 30.994 0.861 0.335
##  $ residuals    : Named num [1:11] -1.718 -0.432 -0.672 -2.533 2.243 ...
##  $ effects      : Named num [1:11] -497.8 42.17 5.94 -2.11 2.65 ...
##  $ rank         : int 3
##  $ fitted.values: Named num [1:11] 134 143 154 165 152 ...
##  $ assign       : int [1:3] 0 1 2
##  $ qr           :List of 5
##  $ df.residual  : int 8
##  $ xlevels      : Named list()
##  $ call         : language lm(formula = fmla, data = bloodpressure)
##  $ terms        :Classes 'terms', 'formula'  language blood_pressure ~ age + weight
##  $ model        :'data.frame':   11 obs. of  3 variables:
## cricket : 'data.frame':  15 obs. of  2 variables:
##  $ chirps_per_sec: num  20 16 19.8 18.4 17.1 ...
##  $ temperature   : num  88.6 71.6 93.3 84.3 80.6 ...
## fe_mean :  num 5.57
## fmla : Class 'formula'  language cty ~ hwy
## gp :  num [1:234] 0.1173 0.2835 0.2417 0.5256 0.0908 ...
## houseprice : 'data.frame':   40 obs. of  2 variables:
##  $ size : int  72 98 92 90 44 46 90 150 94 90 ...
##  $ price: int  156 153 230 152 42 157 113 573 202 261 ...
## Income :  chr [1:2] "incometrain" "incometest"
## incometest : 'data.frame':   515 obs. of  7 variables:
##  $ Subject   : int  7 13 47 62 73 78 89 105 106 107 ...
##  $ Arith     : int  14 30 26 12 18 25 24 28 30 19 ...
##  $ Word      : int  27 29 33 25 34 35 34 32 35 31 ...
##  $ Parag     : int  8 13 13 10 13 14 15 14 12 11 ...
##  $ Math      : int  11 24 16 10 18 21 17 20 23 16 ...
##  $ AFQT      : num  47.4 72.3 75.5 36.4 81.5 ...
##  $ Income2005: int  19000 8000 66309 30000 186135 14657 62000 40000 105000 43000 ...
## incometrain : 'data.frame':  2069 obs. of  7 variables:
##  $ Subject   : int  2 6 8 9 16 17 18 20 21 22 ...
##  $ Arith     : int  8 30 13 21 17 29 30 17 29 27 ...
##  $ Word      : int  15 35 35 28 30 33 35 28 33 31 ...
##  $ Parag     : int  6 15 12 10 12 13 14 14 13 14 ...
##  $ Math      : int  6 23 4 13 17 21 23 20 25 22 ...
##  $ AFQT      : num  6.84 99.39 44.02 59.68 50.28 ...
##  $ Income2005: int  5500 65000 36000 65000 71000 43000 120000 64000 253043 45300 ...
## Mpg :  chr [1:2] "mpgtrain" "mpgtest"
## mpg_model : List of 12
##  $ coefficients : Named num [1:2] 1.042 0.675
##  $ residuals    : Named num [1:164] -2.618 0.382 -1.968 -0.293 -2.593 ...
##  $ effects      : Named num [1:164] -217.55 -52.974 -1.826 -0.145 -2.417 ...
##  $ rank         : int 2
##  $ fitted.values: Named num [1:164] 20.6 20.6 22 21.3 18.6 ...
##  $ assign       : int [1:2] 0 1
##  $ qr           :List of 5
##  $ df.residual  : int 162
##  $ xlevels      : Named list()
##  $ call         : language lm(formula = fmla, data = mpg_train)
##  $ terms        :Classes 'terms', 'formula'  language cty ~ hwy
##  $ model        :'data.frame':   164 obs. of  2 variables:
## mpg_test : Classes 'tbl_df', 'tbl' and 'data.frame': 70 obs. of  11 variables:
##  $ manufacturer: chr  "audi" "chevrolet" "chevrolet" "chevrolet" ...
##  $ model       : chr  "a6 quattro" "c1500 suburban 2wd" "corvette" "k1500 tahoe 4wd" ...
##  $ displ       : num  4.2 5.3 5.7 5.3 2.4 3.6 3 3.3 3.3 3.8 ...
##  $ year        : int  2008 2008 1999 2008 2008 2008 1999 1999 2008 1999 ...
##  $ cyl         : int  8 8 8 8 4 6 6 6 6 6 ...
##  $ trans       : chr  "auto(s6)" "auto(l4)" "auto(l4)" "auto(l4)" ...
##  $ drv         : chr  "4" "r" "r" "4" ...
##  $ cty         : int  16 14 15 11 22 17 17 16 17 15 ...
##  $ hwy         : int  23 20 23 14 30 26 24 22 24 22 ...
##  $ fl          : chr  "p" "r" "p" "e" ...
##  $ class       : chr  "midsize" "suv" "2seater" "suv" ...
## mpg_train : Classes 'tbl_df', 'tbl' and 'data.frame':    164 obs. of  11 variables:
##  $ manufacturer: chr  "audi" "audi" "audi" "audi" ...
##  $ model       : chr  "a4" "a4" "a4" "a4" ...
##  $ displ       : num  1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
##  $ year        : int  1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
##  $ cyl         : int  4 4 4 4 6 6 6 4 4 4 ...
##  $ trans       : chr  "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
##  $ drv         : chr  "f" "f" "f" "f" ...
##  $ cty         : int  18 21 20 21 16 18 18 18 16 20 ...
##  $ hwy         : int  29 29 31 30 26 26 27 26 25 28 ...
##  $ fl          : chr  "p" "p" "p" "p" ...
##  $ class       : chr  "compact" "compact" "compact" "compact" ...
## mpgtest : Classes 'tbl_df', 'tbl' and 'data.frame':  54 obs. of  11 variables:
##  $ manufacturer: chr  "audi" "audi" "audi" "audi" ...
##  $ model       : chr  "a4" "a4" "a4 quattro" "a4 quattro" ...
##  $ displ       : num  2 3.1 1.8 2 4.2 5.3 6.5 2.4 3.7 5.2 ...
##  $ year        : int  2008 2008 1999 2008 2008 2008 1999 1999 2008 1999 ...
##  $ cyl         : int  4 6 4 4 8 8 8 4 6 8 ...
##  $ trans       : chr  "auto(av)" "auto(av)" "auto(l5)" "auto(s6)" ...
##  $ drv         : chr  "f" "f" "4" "4" ...
##  $ cty         : int  21 18 16 19 16 11 14 19 15 11 ...
##  $ hwy         : int  30 27 25 27 23 14 17 27 19 17 ...
##  $ fl          : chr  "p" "p" "p" "p" ...
##  $ class       : chr  "compact" "compact" "compact" "compact" ...
## mpgtrain : Classes 'tbl_df', 'tbl' and 'data.frame': 180 obs. of  11 variables:
##  $ manufacturer: chr  "audi" "audi" "audi" "audi" ...
##  $ model       : chr  "a4" "a4" "a4" "a4" ...
##  $ displ       : num  1.8 1.8 2 2.8 2.8 1.8 2 2.8 2.8 3.1 ...
##  $ year        : int  1999 1999 2008 1999 1999 1999 2008 1999 1999 2008 ...
##  $ cyl         : int  4 4 4 6 6 4 4 6 6 6 ...
##  $ trans       : chr  "auto(l5)" "manual(m5)" "manual(m6)" "auto(l5)" ...
##  $ drv         : chr  "f" "f" "f" "f" ...
##  $ cty         : int  18 21 20 16 18 18 20 15 17 17 ...
##  $ hwy         : int  29 29 31 26 26 26 28 25 25 25 ...
##  $ fl          : chr  "p" "p" "p" "p" ...
##  $ class       : chr  "compact" "compact" "compact" "compact" ...
## N :  int 234
## newrates : 'data.frame': 1 obs. of  1 variable:
##  $ male_unemployment: num 5
## pred :  Named num 4.91
## r_squared : function (predcol, ycol)  
## res :  num [1:13] 0.552 1.313 0.163 0.279 -0.34 ...
## rho :  num 0.906
## rho2 :  num 0.821
## rmse : function (predcol, ycol)  
## rsq :  num 0.821
## rsq_glance :  num 0.821
## rss :  num 3.7
## sd_unemployment :  num 1.31
## Soybean :  chr [1:2] "soybean_train" "soybean_test"
## soybean_test : Classes 'nfnGroupedData', 'nfGroupedData', 'groupedData' and 'data.frame':    82 obs. of  5 variables:
##  $ Plot   : Ord.factor w/ 41 levels "1988F4"<"1988F2"<..: 3 3 3 2 2 2 6 6 1 1 ...
##  $ Variety: Factor w/ 2 levels "F","P": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Year   : Factor w/ 3 levels "1988","1989",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Time   : num  42 56 70 14 42 77 49 63 14 35 ...
##  $ weight : num  3.56 8.71 16.342 0.104 2.93 ...
## soybean_train : Classes 'nfnGroupedData', 'nfGroupedData', 'groupedData' and 'data.frame':   330 obs. of  5 variables:
##  $ Plot   : Ord.factor w/ 48 levels "1988F4"<"1988F2"<..: 3 3 3 3 3 3 3 2 2 2 ...
##  $ Variety: Factor w/ 2 levels "F","P": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Year   : Factor w/ 3 levels "1988","1989",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Time   : num  14 21 28 35 49 63 77 21 28 35 ...
##  $ weight : num  0.106 0.261 0.666 2.11 6.23 ...
## sparrow : 'data.frame':  87 obs. of  11 variables:
##  $ status      : Factor w/ 2 levels "Perished","Survived": 2 2 2 2 2 2 2 2 2 2 ...
##  $ age         : chr  "adult" "adult" "adult" "adult" ...
##  $ total_length: int  154 160 155 154 156 161 157 159 158 158 ...
##  $ wingspan    : int  241 252 243 245 247 253 251 247 247 252 ...
##  $ weight      : num  24.5 26.9 26.9 24.3 24.1 26.5 24.6 24.2 23.6 26.2 ...
##  $ beak_head   : num  31.2 30.8 30.6 31.7 31.5 31.8 31.1 31.4 29.8 32 ...
##  $ humerus     : num  0.69 0.74 0.73 0.74 0.71 0.78 0.74 0.73 0.7 0.75 ...
##  $ femur       : num  0.67 0.71 0.7 0.69 0.71 0.74 0.74 0.72 0.67 0.74 ...
##  $ legbone     : num  1.02 1.18 1.15 1.15 1.13 1.14 1.15 1.13 1.08 1.15 ...
##  $ skull       : num  0.59 0.6 0.6 0.58 0.57 0.61 0.61 0.61 0.6 0.61 ...
##  $ sternum     : num  0.83 0.84 0.85 0.84 0.82 0.89 0.86 0.79 0.82 0.86 ...
## target :  num 176
## tss :  num 20.7
## unemployment : 'data.frame': 13 obs. of  5 variables:
##  $ male_unemployment  : num  2.9 6.7 4.9 7.9 9.8 ...
##  $ female_unemployment: num  4 7.4 5 7.2 7.9 ...
##  $ prediction         : num  3.45 6.09 4.84 6.92 8.24 ...
##  $ predictions        : num  3.45 6.09 4.84 6.92 8.24 ...
##  $ residuals          : num  0.552 1.313 0.163 0.279 -0.34 ...
## unemployment_model : List of 12
##  $ coefficients : Named num [1:2] 1.434 0.695
##  $ residuals    : Named num [1:13] 0.552 1.313 0.163 0.279 -0.34 ...
##  $ effects      : Named num [1:13] -20.08 -4.126 0.106 -0.264 -1.192 ...
##  $ rank         : int 2
##  $ fitted.values: Named num [1:13] 3.45 6.09 4.84 6.92 8.24 ...
##  $ assign       : int [1:2] 0 1
##  $ qr           :List of 5
##  $ df.residual  : int 11
##  $ xlevels      : Named list()
##  $ call         : language lm(formula = fmla, data = unemployment)
##  $ terms        :Classes 'terms', 'formula'  language female_unemployment ~ male_unemployment
##  $ model        :'data.frame':   13 obs. of  2 variables:
# predict cty from hwy for the training set
mpg_train$pred <- predict(mpg_model,mpg_train)

# predict cty from hwy for the test set
mpg_test$pred <- predict(mpg_model, mpg_test)

# Evaluate the rmse on both training and test data and print them
(rmse_train <- rmse(mpg_train$pred,mpg_train$cty))
## [1] 1.319855
(rmse_test <- rmse(mpg_test$pred, mpg_test$cty))
## [1] 1.060794
# Evaluate the r-squared on both training and test data.and print them
(rsq_train <- r_squared(mpg_train$pred,mpg_train$cty))
## [1] 0.9076025
(rsq_test <- r_squared(mpg_test$pred, mpg_test$cty))
## [1] 0.929624
# Plot the predictions (on the x-axis) against the outcome (cty) on the test data
ggplot(mpg_test, aes(x = pred, y = cty)) +
  geom_point() +
  geom_abline()

There are several ways to implement an n-fold cross validation plan. In this exercise you will create such a plan using vtreat::kWayCrossValidation(), and examine it.

kWayCrossValidation() creates a cross validation plan with the following call: splitPlan <- kWayCrossValidation(nRows, nSplits, dframe, y) where nRows is the number of rows of data to be split, and nSplits is the desired number of cross-validation folds.

Strictly speaking, dframe and y aren’t used by kWayCrossValidation; they are there for compatibility with other vtreat data partitioning functions. You can set them both to NULL.

The resulting splitPlan is a list of nSplits elements; each element contains two vectors:

train: the indices of dframe that will form the training set app: the indices of dframe that will form the test (or application) set

# Load the package vtreat
library(vtreat)
## Warning: package 'vtreat' was built under R version 3.4.4
# mpg is in the workspace
summary(mpg)
##  manufacturer          model               displ            year     
##  Length:234         Length:234         Min.   :1.600   Min.   :1999  
##  Class :character   Class :character   1st Qu.:2.400   1st Qu.:1999  
##  Mode  :character   Mode  :character   Median :3.300   Median :2004  
##                                        Mean   :3.472   Mean   :2004  
##                                        3rd Qu.:4.600   3rd Qu.:2008  
##                                        Max.   :7.000   Max.   :2008  
##       cyl           trans               drv                 cty       
##  Min.   :4.000   Length:234         Length:234         Min.   : 9.00  
##  1st Qu.:4.000   Class :character   Class :character   1st Qu.:14.00  
##  Median :6.000   Mode  :character   Mode  :character   Median :17.00  
##  Mean   :5.889                                         Mean   :16.86  
##  3rd Qu.:8.000                                         3rd Qu.:19.00  
##  Max.   :8.000                                         Max.   :35.00  
##       hwy             fl               class          
##  Min.   :12.00   Length:234         Length:234        
##  1st Qu.:18.00   Class :character   Class :character  
##  Median :24.00   Mode  :character   Mode  :character  
##  Mean   :23.44                                        
##  3rd Qu.:27.00                                        
##  Max.   :44.00
# Get the number of rows in mpg
nRows <- nrow(mpg)

# Implement the 3-fold cross-fold plan with vtreat
splitPlan <- kWayCrossValidation(nRows, 3,NULL,NULL)

# Examine the split plan
str(splitPlan)
## List of 3
##  $ :List of 2
##   ..$ train: int [1:156] 1 2 3 5 7 8 10 11 12 13 ...
##   ..$ app  : int [1:78] 161 221 148 205 120 191 162 32 122 31 ...
##  $ :List of 2
##   ..$ train: int [1:156] 3 4 6 7 9 12 13 15 16 18 ...
##   ..$ app  : int [1:78] 17 80 193 146 77 123 126 131 76 214 ...
##  $ :List of 2
##   ..$ train: int [1:156] 1 2 4 5 6 8 9 10 11 14 ...
##   ..$ app  : int [1:78] 228 104 20 75 28 57 16 198 229 7 ...
##  - attr(*, "splitmethod")= chr "kwaycross"

3.9 Evaluate a modeling procedure using n-fold cross-validation

the 3-fold cross validation to make predictions from a model that predicts mpg\(cty from mpg\)hwy.

If dframe is the training data, then one way to add a column of cross-validation predictions to the frame is as follows:

Initialize a column of the appropriate length dframe$pred.cv <- 0

k is the number of folds splitPlan is the cross validation plan

# mpg is in the workspace
summary(mpg)
##  manufacturer          model               displ            year     
##  Length:234         Length:234         Min.   :1.600   Min.   :1999  
##  Class :character   Class :character   1st Qu.:2.400   1st Qu.:1999  
##  Mode  :character   Mode  :character   Median :3.300   Median :2004  
##                                        Mean   :3.472   Mean   :2004  
##                                        3rd Qu.:4.600   3rd Qu.:2008  
##                                        Max.   :7.000   Max.   :2008  
##       cyl           trans               drv                 cty       
##  Min.   :4.000   Length:234         Length:234         Min.   : 9.00  
##  1st Qu.:4.000   Class :character   Class :character   1st Qu.:14.00  
##  Median :6.000   Mode  :character   Mode  :character   Median :17.00  
##  Mean   :5.889                                         Mean   :16.86  
##  3rd Qu.:8.000                                         3rd Qu.:19.00  
##  Max.   :8.000                                         Max.   :35.00  
##       hwy             fl               class          
##  Min.   :12.00   Length:234         Length:234        
##  1st Qu.:18.00   Class :character   Class :character  
##  Median :24.00   Mode  :character   Mode  :character  
##  Mean   :23.44                                        
##  3rd Qu.:27.00                                        
##  Max.   :44.00
# splitPlan is in the workspace
str(splitPlan)
## List of 3
##  $ :List of 2
##   ..$ train: int [1:156] 1 2 3 5 7 8 10 11 12 13 ...
##   ..$ app  : int [1:78] 161 221 148 205 120 191 162 32 122 31 ...
##  $ :List of 2
##   ..$ train: int [1:156] 3 4 6 7 9 12 13 15 16 18 ...
##   ..$ app  : int [1:78] 17 80 193 146 77 123 126 131 76 214 ...
##  $ :List of 2
##   ..$ train: int [1:156] 1 2 4 5 6 8 9 10 11 14 ...
##   ..$ app  : int [1:78] 228 104 20 75 28 57 16 198 229 7 ...
##  - attr(*, "splitmethod")= chr "kwaycross"
# Run the 3-fold cross validation plan from splitPlan
k <- 3 # Number of folds
mpg$pred.cv <- 0 
for(i in 1:k) {
  split <- splitPlan[[i]]
  model <- lm(cty ~ hwy, data = mpg[split$train, ])
  mpg$pred.cv[split$app] <- predict(model, newdata = mpg[split$app, ])
}

# Predict from a full model
mpg$pred <- predict(lm(cty ~ hwy, data = mpg))

# Get the rmse of the full model's predictions
rmse(mpg$pred, mpg$cty)
## [1] 1.247045
# Get the rmse of the cross-validation predictions
rmse(mpg$pred.cv, mpg$cty)
## [1] 1.275298

4 Categorical inputs

4.1 Examining the structure of categorical inputs

model.matrix() to examine how R represents data with both categorical and numerical inputs for modeling.

flowers<-read.csv("C:/Users/maiam/Dropbox/PROFESSIONAL DEVELOPMENT/DATA SCIENCE/01_R/supervised learning in r regression/flowers.csv", sep=";")# Call str on flowers to see the types of each column
str(flowers)
## 'data.frame':    24 obs. of  3 variables:
##  $ Flowers  : num  62.3 77.4 55.3 54.2 49.6 61.9 39.4 45.7 31.3 44.9 ...
##  $ Time     : Factor w/ 2 levels "Early","Late": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Intensity: int  150 150 300 300 450 450 600 600 750 750 ...
# Use unique() to see how many possible values Time takes
unique(flowers$Time)
## [1] Late  Early
## Levels: Early Late
# Build a formula to express Flowers as a function of Intensity and Time: fmla. Print it
(fmla <- as.formula("Flowers ~ Intensity + Time"))
## Flowers ~ Intensity + Time
# Use fmla and model.matrix to see how the data is represented for modeling
mmat <- model.matrix(fmla, flowers)

# Examine the first 20 lines of flowers
head(flowers,20)
##    Flowers  Time Intensity
## 1     62.3  Late       150
## 2     77.4  Late       150
## 3     55.3  Late       300
## 4     54.2  Late       300
## 5     49.6  Late       450
## 6     61.9  Late       450
## 7     39.4  Late       600
## 8     45.7  Late       600
## 9     31.3  Late       750
## 10    44.9  Late       750
## 11    36.8  Late       900
## 12    41.9  Late       900
## 13    77.8 Early       150
## 14    75.6 Early       150
## 15    69.1 Early       300
## 16    78.0 Early       300
## 17    57.0 Early       450
## 18    71.1 Early       450
## 19    62.9 Early       600
## 20    52.2 Early       600
# Examine the first 20 lines of mmat
head(mmat,20)
##    (Intercept) Intensity TimeLate
## 1            1       150        1
## 2            1       150        1
## 3            1       300        1
## 4            1       300        1
## 5            1       450        1
## 6            1       450        1
## 7            1       600        1
## 8            1       600        1
## 9            1       750        1
## 10           1       750        1
## 11           1       900        1
## 12           1       900        1
## 13           1       150        0
## 14           1       150        0
## 15           1       300        0
## 16           1       300        0
## 17           1       450        0
## 18           1       450        0
## 19           1       600        0
## 20           1       600        0
# Fit a model to predict Flowers from Intensity and Time : flower_model
flower_model <- lm(fmla, flowers)

# Use summary on mmat to remind yourself of its structure
summary(mmat)
##   (Intercept)   Intensity      TimeLate  
##  Min.   :1    Min.   :150   Min.   :0.0  
##  1st Qu.:1    1st Qu.:300   1st Qu.:0.0  
##  Median :1    Median :525   Median :0.5  
##  Mean   :1    Mean   :525   Mean   :0.5  
##  3rd Qu.:1    3rd Qu.:750   3rd Qu.:1.0  
##  Max.   :1    Max.   :900   Max.   :1.0
# Use summary to examine flower_model 
summary(flower_model)
## 
## Call:
## lm(formula = fmla, data = flowers)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.652 -4.139 -1.558  5.632 12.165 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  83.464167   3.273772  25.495  < 2e-16 ***
## Intensity    -0.040471   0.005132  -7.886 1.04e-07 ***
## TimeLate    -12.158333   2.629557  -4.624 0.000146 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.441 on 21 degrees of freedom
## Multiple R-squared:  0.7992, Adjusted R-squared:   0.78 
## F-statistic: 41.78 on 2 and 21 DF,  p-value: 4.786e-08
# Predict the number of flowers on each plant
flowers$predictions <- predict(flower_model, flowers)

# Plot predictions vs actual flowers (predictions on x-axis)
ggplot(flowers, aes(x = predictions, y = Flowers)) + 
  geom_point() +
  geom_abline(color = "blue") 

4.2 Modeling an interaction (2)

library(dplyr)
## Warning: package 'dplyr' was built under R version 3.4.4
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(vtreat)
library(broom)
library(sigr)
library(tidyr)
## Warning: package 'tidyr' was built under R version 3.4.4
alcohol<-read.csv("C:/Users/maiam/Dropbox/PROFESSIONAL DEVELOPMENT/DATA SCIENCE/01_R/supervised learning in r regression/alcohol.csv", sep=";")# Call str on flowers to see the types of each column
str(alcohol)
## 'data.frame':    32 obs. of  5 variables:
##  $ Subject: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Metabol: num  0.6 0.6 1.5 0.4 0.1 0.2 0.3 0.3 0.4 1 ...
##  $ Gastric: num  1 1.6 1.5 2.2 1.1 1.2 0.9 0.8 1.5 0.9 ...
##  $ Sex    : Factor w/ 2 levels "Female","Male": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Alcohol: Factor w/ 2 levels "Alcoholic","Non-alcoholic": 1 1 1 2 2 2 2 2 2 2 ...
# alcohol is in the workspace
summary(alcohol)
##     Subject         Metabol          Gastric          Sex    
##  Min.   : 1.00   Min.   : 0.100   Min.   :0.800   Female:18  
##  1st Qu.: 8.75   1st Qu.: 0.600   1st Qu.:1.200   Male  :14  
##  Median :16.50   Median : 1.700   Median :1.600              
##  Mean   :16.50   Mean   : 2.422   Mean   :1.863              
##  3rd Qu.:24.25   3rd Qu.: 2.925   3rd Qu.:2.200              
##  Max.   :32.00   Max.   :12.300   Max.   :5.200              
##           Alcohol  
##  Alcoholic    : 8  
##  Non-alcoholic:24  
##                    
##                    
##                    
## 
# Create the formula with main effects only
(fmla_add <- as.formula('Metabol~Gastric+Sex') )
## Metabol ~ Gastric + Sex
# Create the formula with interactions
(fmla_interaction <- as.formula('Metabol~Gastric+Gastric:Sex')  )
## Metabol ~ Gastric + Gastric:Sex
# Fit the main effects only model
model_add <- lm(fmla_add,alcohol)

# Fit the interaction model
model_interaction <- lm(fmla_interaction, alcohol)

# Call summary on both models and compare
summary(model_add)
## 
## Call:
## lm(formula = fmla_add, data = alcohol)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2779 -0.6328 -0.0966  0.5783  4.5703 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.9466     0.5198  -3.745 0.000796 ***
## Gastric       1.9656     0.2674   7.352 4.24e-08 ***
## SexMale       1.6174     0.5114   3.163 0.003649 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.331 on 29 degrees of freedom
## Multiple R-squared:  0.7654, Adjusted R-squared:  0.7492 
## F-statistic: 47.31 on 2 and 29 DF,  p-value: 7.41e-10
summary(model_interaction)
## 
## Call:
## lm(formula = fmla_interaction, data = alcohol)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4656 -0.5091  0.0143  0.5660  4.0668 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -0.7504     0.5310  -1.413 0.168236    
## Gastric           1.1489     0.3450   3.331 0.002372 ** 
## Gastric:SexMale   1.0422     0.2412   4.321 0.000166 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.204 on 29 degrees of freedom
## Multiple R-squared:  0.8081, Adjusted R-squared:  0.7948 
## F-statistic: 61.05 on 2 and 29 DF,  p-value: 4.033e-11
# alcohol is in the workspace
summary(alcohol)
##     Subject         Metabol          Gastric          Sex    
##  Min.   : 1.00   Min.   : 0.100   Min.   :0.800   Female:18  
##  1st Qu.: 8.75   1st Qu.: 0.600   1st Qu.:1.200   Male  :14  
##  Median :16.50   Median : 1.700   Median :1.600              
##  Mean   :16.50   Mean   : 2.422   Mean   :1.863              
##  3rd Qu.:24.25   3rd Qu.: 2.925   3rd Qu.:2.200              
##  Max.   :32.00   Max.   :12.300   Max.   :5.200              
##           Alcohol  
##  Alcoholic    : 8  
##  Non-alcoholic:24  
##                    
##                    
##                    
## 
# Both the formulae are in the workspace
fmla_add
## Metabol ~ Gastric + Sex
fmla_interaction
## Metabol ~ Gastric + Gastric:Sex
# Create the splitting plan for 3-fold cross validation
set.seed(34245)  # set the seed for reproducibility
splitPlan <- kWayCrossValidation(nrow(alcohol), 3, NULL, NULL)

# Sample code: Get cross-val predictions for main-effects only model
alcohol$pred_add <- 0  # initialize the prediction vector
for(i in 1:3) {
  split <- splitPlan[[i]]
  model_add <- lm(fmla_add, data = alcohol[split$train, ])
  alcohol$pred_add[split$app] <- predict(model_add, newdata = alcohol[split$app, ])
}

# Get the cross-val predictions for the model with interactions
alcohol$pred_interaction <- 0 # initialize the prediction vector
for(i in 1:3) {
  split <- splitPlan[[i]]
  model_interaction <- lm(fmla_interaction, data = alcohol[split$train, ])
  alcohol$pred_interaction[split$app] <- predict(model_interaction, newdata = alcohol[split$app, ])
}

# Get RMSE
alcohol %>% 
  gather(key = modeltype, value = pred, pred_add, pred_interaction) %>%
  mutate(residuals = Metabol-pred) %>%      
  group_by(modeltype) %>%
  summarize(rmse = sqrt(mean(residuals^2)))
## # A tibble: 2 x 2
##   modeltype         rmse
##   <chr>            <dbl>
## 1 pred_add          1.64
## 2 pred_interaction  1.38

4.3 Relative error

fdata<-read.csv("C:/Users/maiam/Dropbox/PROFESSIONAL DEVELOPMENT/DATA SCIENCE/01_R/supervised learning in r regression/fdata.csv", sep=";")# Call str on flowers to see the types of each column
# fdata is in the workspace
summary(fdata)
##        y                 pred                      label   
##  Min.   :  -5.894   Min.   :   1.072   large purchases:50  
##  1st Qu.:   5.407   1st Qu.:   6.373   small purchases:50  
##  Median :  57.374   Median :  55.693                       
##  Mean   : 306.204   Mean   : 305.905                       
##  3rd Qu.: 550.903   3rd Qu.: 547.886                       
##  Max.   :1101.619   Max.   :1098.896
library(ggplot2)
library(tidyr)
library(dplyr)
library(vtreat)
library(broom)
library(sigr)
library(tidyr)

# Examine the data: generate the summaries for the groups large and small:
fdata %>% 
  group_by(label) %>%     # group by small/large purchases
  summarize(min  = min(y),   # min of y
            mean = mean(y),   # mean of y
            max  = max(y))  # max of y
## # A tibble: 2 x 4
##   label              min   mean    max
##   <fct>            <dbl>  <dbl>  <dbl>
## 1 large purchases  96.1  606.   1102. 
## 2 small purchases  -5.89   6.48   18.6
# Fill in the blanks to add error columns
fdata2 <- fdata %>% 
  group_by(label) %>%       # group by label
  mutate(residual = pred-y,  # Residual
         relerr   = residual/y)  # Relative error

# Compare the rmse and rmse.rel of the large and small groups:
fdata2 %>% 
  group_by(label) %>% 
  summarize(rmse     = sqrt(mean(residual^2)),   # RMSE
            rmse.rel = sqrt(mean(relerr^2)) )  # Root mean squared relative error
## # A tibble: 2 x 3
##   label            rmse rmse.rel
##   <fct>           <dbl>    <dbl>
## 1 large purchases  5.54   0.0147
## 2 small purchases  4.01   1.25
# Plot the predictions for both groups of purchases
ggplot(fdata2, aes(x = pred, y = y, color = label)) + 
  geom_point() + 
  geom_abline() + 
  facet_wrap(~ label, ncol = 1, scales = "free") + 
  ggtitle("Outcome vs prediction")

## Modeling log-transformed monetary output

income_train<-read.csv("C:/Users/maiam/Dropbox/PROFESSIONAL DEVELOPMENT/DATA SCIENCE/01_R/supervised learning in r regression/income_train.csv", sep=";")# Call str on flowers to see the types of each column
income_test<-read.csv("C:/Users/maiam/Dropbox/PROFESSIONAL DEVELOPMENT/DATA SCIENCE/01_R/supervised learning in r regression/income_test.csv", sep=";")# Call str on flowers to see the types of each column

# fdata is in the workspace
summary(income_test)
##     Subject          Arith            Word           Parag      
##  Min.   :    7   Min.   : 3.00   Min.   : 4.00   Min.   : 0.00  
##  1st Qu.: 1638   1st Qu.:12.00   1st Qu.:22.00   1st Qu.: 9.00  
##  Median : 3303   Median :19.00   Median :28.00   Median :12.00  
##  Mean   : 3467   Mean   :18.33   Mean   :26.19   Mean   :10.99  
##  3rd Qu.: 4690   3rd Qu.:24.00   3rd Qu.:32.00   3rd Qu.:13.00  
##  Max.   :12106   Max.   :30.00   Max.   :35.00   Max.   :15.00  
##       Math            AFQT          Income2005    
##  Min.   : 1.00   Min.   :  0.56   Min.   :  1200  
##  1st Qu.: 9.00   1st Qu.: 30.40   1st Qu.: 22000  
##  Median :13.00   Median : 55.12   Median : 36323  
##  Mean   :13.89   Mean   : 53.09   Mean   : 47500  
##  3rd Qu.:19.00   3rd Qu.: 76.83   3rd Qu.: 60000  
##  Max.   :25.00   Max.   :100.00   Max.   :368302
library(ggplot2)
library(tidyr)
library(dplyr)
library(vtreat)
library(broom)
library(sigr)
library(tidyr)

# Examine Income2005 in the training set
summary(income_train$Income2005)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      63   23000   39000   49894   61500  703637
# Write the formula for log income as a function of the tests and print it
(fmla.log <- as.formula("log(Income2005)~Arith+Word+Parag+Math+AFQT"))
## log(Income2005) ~ Arith + Word + Parag + Math + AFQT
# Fit the linear model
model.log <-  lm(fmla.log, income_train)

# Make predictions on income_test
income_test$logpred <- predict(model.log, income_test)
summary(income_test$logpred)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   9.766  10.133  10.423  10.419  10.705  11.006
# Convert the predictions to monetary units
income_test$pred.income <- exp(income_test$logpred)
summary(income_test$pred.income)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   17432   25167   33615   35363   44566   60217
#  Plot predicted income (x axis) vs income
ggplot(income_test, aes(x = pred.income, y = Income2005)) + 
  geom_point() + 
  geom_abline(color = "blue")

fmla.abs<-as.formula("Income2005 ~ Arith + Word + Parag + Math + AFQT")
model.abs<-lm(fmla.abs, income_train)

# fmla.abs is in the workspace
fmla.abs
## Income2005 ~ Arith + Word + Parag + Math + AFQT
# model.abs is in the workspace
summary(model.abs)
## 
## Call:
## lm(formula = fmla.abs, data = income_train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -78728 -24137  -6979  11964 648573 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  17516.7     6420.1   2.728  0.00642 ** 
## Arith         1552.3      303.4   5.116 3.41e-07 ***
## Word          -132.3      265.0  -0.499  0.61754    
## Parag        -1155.1      618.3  -1.868  0.06189 .  
## Math           725.5      372.0   1.950  0.05127 .  
## AFQT           177.8      144.1   1.234  0.21734    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 45500 on 2063 degrees of freedom
## Multiple R-squared:  0.1165, Adjusted R-squared:  0.1144 
## F-statistic:  54.4 on 5 and 2063 DF,  p-value: < 2.2e-16
# Add predictions to the test set
income_test <- income_test %>%
  mutate(pred.absmodel = predict(model.abs, income_test),        # predictions from model.abs
         pred.logmodel = exp(predict(model.log, income_test)))   # predictions from model.log

# Gather the predictions and calculate residuals and relative error
income_long <- income_test %>% 
  gather(key = modeltype, value = pred, pred.absmodel, pred.logmodel) %>%
  mutate(residual = pred-Income2005,   # residuals
         relerr   = residual/Income2005)   # relative error

# Calculate RMSE and relative RMSE and compare
income_long %>% 
  group_by(modeltype) %>%      # group by modeltype
  summarize(rmse     = sqrt(mean(residual^2)),    # RMSE
            rmse.rel = sqrt(mean(relerr^2)))    # Root mean squared relative error
## # A tibble: 2 x 3
##   modeltype       rmse rmse.rel
##   <chr>          <dbl>    <dbl>
## 1 pred.absmodel 37448.     3.18
## 2 pred.logmodel 39235.     2.22

4.4 Transforming inputs before modeling

Bikes<-load("C:/Users/maiam/Dropbox/PROFESSIONAL DEVELOPMENT/DATA SCIENCE/01_R/supervised learning in r regression/Bikes.RData")
str(bikesJuly)
## 'data.frame':    744 obs. of  12 variables:
##  $ hr        : Factor w/ 24 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ holiday   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ workingday: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ weathersit: chr  "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" ...
##  $ temp      : num  0.76 0.74 0.72 0.72 0.7 0.68 0.7 0.74 0.78 0.82 ...
##  $ atemp     : num  0.727 0.697 0.697 0.712 0.667 ...
##  $ hum       : num  0.66 0.7 0.74 0.84 0.79 0.79 0.79 0.7 0.62 0.56 ...
##  $ windspeed : num  0 0.1343 0.0896 0.1343 0.194 ...
##  $ cnt       : int  149 93 90 33 4 10 27 50 142 219 ...
##  $ instant   : int  13004 13005 13006 13007 13008 13009 13010 13011 13012 13013 ...
##  $ mnth      : int  7 7 7 7 7 7 7 7 7 7 ...
##  $ yr        : int  1 1 1 1 1 1 1 1 1 1 ...
str(Soybean)
##  chr [1:2] "soybean_train" "soybean_test"
# soybean_train is in the workspace
summary(soybean_train)
##       Plot     Variety   Year          Time           weight       
##  1988F6 : 10   F:161   1988:124   Min.   :14.00   Min.   : 0.0290  
##  1988F7 :  9   P:169   1989:102   1st Qu.:27.00   1st Qu.: 0.6663  
##  1988P1 :  9           1990:104   Median :42.00   Median : 3.5233  
##  1988P8 :  9                      Mean   :43.56   Mean   : 6.1645  
##  1988P2 :  9                      3rd Qu.:56.00   3rd Qu.:10.3808  
##  1988F3 :  8                      Max.   :84.00   Max.   :27.3700  
##  (Other):276
# Plot weight vs Time (Time on x axis)
ggplot(soybean_train, aes(x = Time, y = weight)) + 
  geom_point() 

# Load the package mgcv
library(mgcv)
## Warning: package 'mgcv' was built under R version 3.4.4
## Loading required package: nlme
## 
## Attaching package: 'nlme'
## The following object is masked _by_ '.GlobalEnv':
## 
##     Soybean
## The following object is masked from 'package:dplyr':
## 
##     collapse
## This is mgcv 1.8-28. For overview type 'help("mgcv-package")'.
# Create the formula 
(fmla.gam <- as.formula(weight~s(Time)) )
## weight ~ s(Time)
print(fmla.gam)
## weight ~ s(Time)
# Fit the GAM Model
model.gam <- gam(fmla.gam, soybean_train, family=gaussian)

# From previous step
library(mgcv)
fmla.gam <- weight ~ s(Time)
model.gam <- gam(fmla.gam, data = soybean_train, family = gaussian)

fmla.lin<-weight ~ Time
model.lin<-lm(fmla.lin, data = soybean_train)

# Call summary() on model.lin and look for R-squared
summary(model.lin)
## 
## Call:
## lm(formula = fmla.lin, data = soybean_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3933 -1.7100 -0.3909  1.9056 11.4381 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6.559283   0.358527  -18.30   <2e-16 ***
## Time         0.292094   0.007444   39.24   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.778 on 328 degrees of freedom
## Multiple R-squared:  0.8244, Adjusted R-squared:  0.8238 
## F-statistic:  1540 on 1 and 328 DF,  p-value: < 2.2e-16
# Call summary() on model.gam and look for R-squared
summary(model.gam)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## weight ~ s(Time)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6.1645     0.1143   53.93   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##           edf Ref.df     F p-value    
## s(Time) 8.495   8.93 338.2  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.902   Deviance explained = 90.4%
## GCV = 4.4395  Scale est. = 4.3117    n = 330
# Call plot() on model.gam
plot(model.gam)

# soybean_test is in the workspace
summary(soybean_test)
##       Plot    Variety   Year         Time           weight       
##  1988F8 : 4   F:43    1988:32   Min.   :14.00   Min.   : 0.0380  
##  1988P7 : 4   P:39    1989:26   1st Qu.:23.00   1st Qu.: 0.4248  
##  1989F8 : 4           1990:24   Median :41.00   Median : 3.0025  
##  1990F8 : 4                     Mean   :44.09   Mean   : 7.1576  
##  1988F4 : 3                     3rd Qu.:69.00   3rd Qu.:15.0113  
##  1988F2 : 3                     Max.   :84.00   Max.   :30.2717  
##  (Other):60
# Get predictions from linear model
soybean_test$pred.lin <- predict(model.lin, newdata = soybean_test)

# Get predictions from gam model
soybean_test$pred.gam <- as.numeric(predict(model.gam, newdata = soybean_test))

# Gather the predictions into a "long" dataset
soybean_long <- soybean_test %>%
  gather(key = modeltype, value = pred, pred.lin, pred.gam)

# Calculate the rmse
soybean_long %>%
  mutate(residual = weight - pred) %>%     # residuals
  group_by(modeltype) %>%                  # group by modeltype
  summarize(rmse = sqrt(mean(residual^2))) # calculate the RMSE
## # A tibble: 2 x 2
##   modeltype  rmse
##   <chr>     <dbl>
## 1 pred.gam   2.29
## 2 pred.lin   3.19
# Compare the predictions against actual weights on the test data
soybean_long %>%
  ggplot(aes(x = Time)) +                          # the column for the x axis
  geom_point(aes(y = weight)) +                    # the y-column for the scatterplot
  geom_point(aes(y = pred, color = modeltype)) +   # the y-column for the point-and-line plot
  geom_line(aes(y = pred, color = modeltype, linetype = modeltype)) + # the y-column for the point-and-line plot
  scale_color_brewer(palette = "Dark2")

# houseprice is in the workspace
summary(houseprice)
##       size           price      
##  Min.   : 44.0   Min.   : 42.0  
##  1st Qu.: 73.5   1st Qu.:164.5  
##  Median : 91.0   Median :203.5  
##  Mean   : 94.3   Mean   :249.2  
##  3rd Qu.:118.5   3rd Qu.:287.8  
##  Max.   :150.0   Max.   :573.0
# Create the formula for price as a function of squared size
(fmla_sqr <- as.formula(price~I(size^2)))
## price ~ I(size^2)
# Fit a model of price as a function of squared size (use fmla_sqr)
model_sqr <- lm(fmla_sqr, data=houseprice)

# Fit a model of price as a linear function of size
model_lin <- lm(price ~ size, data=houseprice)

# Make predictions and compare
houseprice %>% 
  mutate(pred_lin = predict(model_lin,houseprice),       # predictions from linear model
         pred_sqr = predict(model_sqr,houseprice)) %>%   # predictions from quadratic model 
  gather(key = modeltype, value = pred, pred_lin, pred_sqr) %>% # gather the predictions
  ggplot(aes(x = size)) + 
  geom_point(aes(y = price)) +                   # actual prices
  geom_line(aes(y = pred, color = modeltype)) + # the predictions
  scale_color_brewer(palette = "Dark2")

# fmla_sqr is in the workspace
fmla_sqr
## price ~ I(size^2)
# Create a splitting plan for 3-fold cross validation
set.seed(34245)  # set the seed for reproducibility
splitPlan <- kWayCrossValidation(nrow(houseprice), 3, NULL, NULL)

# Sample code: get cross-val predictions for price ~ size
houseprice$pred_lin <- 0  # initialize the prediction vector
for(i in 1:3) {
  split <- splitPlan[[i]]
  model_lin <- lm(price ~ size, data = houseprice[split$train, ])
  houseprice$pred_lin[split$app] <- predict(model_lin, newdata = houseprice[split$app, ])
}

# Get cross-val predictions for price as a function of size^2 (use fmla_sqr)
houseprice$pred_sqr <- 0 # initialize the prediction vector
for(i in 1:3) {
  split <- splitPlan[[i]]
  model_sqr <- lm(fmla_sqr, data = houseprice[split$train, ])
  houseprice$pred_sqr[split$app] <- predict(model_sqr, newdata = houseprice[split$app, ])
}

# Gather the predictions and calculate the residuals
houseprice_long <- houseprice %>%
  gather(key = modeltype, value = pred, pred_lin, pred_sqr) %>%
  mutate(residuals = pred - price)

# Compare the cross-validated RMSE for the two models
houseprice_long %>% 
  group_by(modeltype) %>%
  summarize(rmse = sqrt(mean(residuals^2)))
## # A tibble: 2 x 2
##   modeltype  rmse
##   <chr>     <dbl>
## 1 pred_lin   74.3
## 2 pred_sqr   63.7

5 Logistic regression to predict probabilities

Bikes<-load("C:/Users/maiam/Dropbox/PROFESSIONAL DEVELOPMENT/DATA SCIENCE/01_R/supervised learning in r regression/Bikes.RData")
str(bikesJuly)
## 'data.frame':    744 obs. of  12 variables:
##  $ hr        : Factor w/ 24 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ holiday   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ workingday: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ weathersit: chr  "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" ...
##  $ temp      : num  0.76 0.74 0.72 0.72 0.7 0.68 0.7 0.74 0.78 0.82 ...
##  $ atemp     : num  0.727 0.697 0.697 0.712 0.667 ...
##  $ hum       : num  0.66 0.7 0.74 0.84 0.79 0.79 0.79 0.7 0.62 0.56 ...
##  $ windspeed : num  0 0.1343 0.0896 0.1343 0.194 ...
##  $ cnt       : int  149 93 90 33 4 10 27 50 142 219 ...
##  $ instant   : int  13004 13005 13006 13007 13008 13009 13010 13011 13012 13013 ...
##  $ mnth      : int  7 7 7 7 7 7 7 7 7 7 ...
##  $ yr        : int  1 1 1 1 1 1 1 1 1 1 ...
library(WVPlots)
# sparrow is in the workspace
summary(sparrow)
##       status       age             total_length      wingspan    
##  Perished:36   Length:87          Min.   :153.0   Min.   :236.0  
##  Survived:51   Class :character   1st Qu.:158.0   1st Qu.:245.0  
##                Mode  :character   Median :160.0   Median :247.0  
##                                   Mean   :160.4   Mean   :247.5  
##                                   3rd Qu.:162.5   3rd Qu.:251.0  
##                                   Max.   :167.0   Max.   :256.0  
##      weight       beak_head        humerus           femur       
##  Min.   :23.2   Min.   :29.80   Min.   :0.6600   Min.   :0.6500  
##  1st Qu.:24.7   1st Qu.:31.40   1st Qu.:0.7250   1st Qu.:0.7000  
##  Median :25.8   Median :31.70   Median :0.7400   Median :0.7100  
##  Mean   :25.8   Mean   :31.64   Mean   :0.7353   Mean   :0.7134  
##  3rd Qu.:26.7   3rd Qu.:32.10   3rd Qu.:0.7500   3rd Qu.:0.7300  
##  Max.   :31.0   Max.   :33.00   Max.   :0.7800   Max.   :0.7600  
##     legbone          skull           sternum      
##  Min.   :1.010   Min.   :0.5600   Min.   :0.7700  
##  1st Qu.:1.110   1st Qu.:0.5900   1st Qu.:0.8300  
##  Median :1.130   Median :0.6000   Median :0.8500  
##  Mean   :1.131   Mean   :0.6032   Mean   :0.8511  
##  3rd Qu.:1.160   3rd Qu.:0.6100   3rd Qu.:0.8800  
##  Max.   :1.230   Max.   :0.6400   Max.   :0.9300
# Create the survived column
sparrow$survived <- sparrow$status=="Survived"

# Create the formula
(fmla <- as.formula("survived~total_length+weight+humerus"))
## survived ~ total_length + weight + humerus
# Fit the logistic regression model
sparrow_model <- glm(fmla, sparrow, family = binomial)

# Call summary
summary(sparrow_model)
## 
## Call:
## glm(formula = fmla, family = binomial, data = sparrow)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1117  -0.6026   0.2871   0.6577   1.7082  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   46.8813    16.9631   2.764 0.005715 ** 
## total_length  -0.5435     0.1409  -3.858 0.000115 ***
## weight        -0.5689     0.2771  -2.053 0.040060 *  
## humerus       75.4610    19.1586   3.939 8.19e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 118.008  on 86  degrees of freedom
## Residual deviance:  75.094  on 83  degrees of freedom
## AIC: 83.094
## 
## Number of Fisher Scoring iterations: 5
# Call glance
(perf <- glance(sparrow_model))
## # A tibble: 1 x 7
##   null.deviance df.null logLik   AIC   BIC deviance df.residual
##           <dbl>   <int>  <dbl> <dbl> <dbl>    <dbl>       <int>
## 1          118.      86  -37.5  83.1  93.0     75.1          83
# Calculate pseudo-R-squared
(pseudoR2 <- 1-perf$deviance/perf$null.deviance)
## [1] 0.3636526
# Make predictions
sparrow$pred <- predict(sparrow_model, type="response")


# Look at gain curve
GainCurvePlot(sparrow, "pred", "survived", "sparrow survival model")


looking at the pseudo-R2 of a logistic regression model, you should hope to see a value close to 1. the gain curve that the model follows the wizard curve for about the first 30% of the data, identifying about 45% of the surviving sparrows with only a few false positives.

5.1 Poisson and quasipoisson regression to predict counts

# The outcome column
outcome <-"cnt"

# The inputs to use
vars <-c("hr","holiday","workingday","weathersit","temp","atemp","hum" ,"windspeed")
# bikesJuly is in the workspace
str(bikesJuly)
## 'data.frame':    744 obs. of  12 variables:
##  $ hr        : Factor w/ 24 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ holiday   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ workingday: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ weathersit: chr  "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" ...
##  $ temp      : num  0.76 0.74 0.72 0.72 0.7 0.68 0.7 0.74 0.78 0.82 ...
##  $ atemp     : num  0.727 0.697 0.697 0.712 0.667 ...
##  $ hum       : num  0.66 0.7 0.74 0.84 0.79 0.79 0.79 0.7 0.62 0.56 ...
##  $ windspeed : num  0 0.1343 0.0896 0.1343 0.194 ...
##  $ cnt       : int  149 93 90 33 4 10 27 50 142 219 ...
##  $ instant   : int  13004 13005 13006 13007 13008 13009 13010 13011 13012 13013 ...
##  $ mnth      : int  7 7 7 7 7 7 7 7 7 7 ...
##  $ yr        : int  1 1 1 1 1 1 1 1 1 1 ...
# The outcome column
outcome <-"cnt"

# The inputs to use
vars <-c("hr","holiday","workingday","weathersit","temp","atemp","hum" ,"windspeed")
# bikesJuly is in the workspace
str(bikesJuly)
## 'data.frame':    744 obs. of  12 variables:
##  $ hr        : Factor w/ 24 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ holiday   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ workingday: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ weathersit: chr  "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" ...
##  $ temp      : num  0.76 0.74 0.72 0.72 0.7 0.68 0.7 0.74 0.78 0.82 ...
##  $ atemp     : num  0.727 0.697 0.697 0.712 0.667 ...
##  $ hum       : num  0.66 0.7 0.74 0.84 0.79 0.79 0.79 0.7 0.62 0.56 ...
##  $ windspeed : num  0 0.1343 0.0896 0.1343 0.194 ...
##  $ cnt       : int  149 93 90 33 4 10 27 50 142 219 ...
##  $ instant   : int  13004 13005 13006 13007 13008 13009 13010 13011 13012 13013 ...
##  $ mnth      : int  7 7 7 7 7 7 7 7 7 7 ...
##  $ yr        : int  1 1 1 1 1 1 1 1 1 1 ...
# The outcome column
outcome 
## [1] "cnt"
# The inputs to use
vars 
## [1] "hr"         "holiday"    "workingday" "weathersit" "temp"      
## [6] "atemp"      "hum"        "windspeed"
# Create the formula string for bikes rented as a function of the inputs
(fmla <- paste(outcome, "~", paste(vars, collapse = " + ")))
## [1] "cnt ~ hr + holiday + workingday + weathersit + temp + atemp + hum + windspeed"
# Calculate the mean and variance of the outcome
(mean_bikes <- mean(bikesJuly$cnt))
## [1] 273.6653
(var_bikes <- var(bikesJuly$cnt))
## [1] 45863.84
# Fit the model
bike_model <- glm(fmla, bikesJuly, family=quasipoisson)

# Call glance
(perf <- glance(bike_model))
## # A tibble: 1 x 7
##   null.deviance df.null logLik   AIC   BIC deviance df.residual
##           <dbl>   <int>  <dbl> <dbl> <dbl>    <dbl>       <int>
## 1       133365.     743     NA    NA    NA   28775.         712
# Calculate pseudo-R-squared
(pseudoR2 <- 1-perf$deviance/perf$null.deviance)
## [1] 0.7842393
# bikesAugust is in the workspace
str(bikesAugust)
## 'data.frame':    744 obs. of  12 variables:
##  $ hr        : Factor w/ 24 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ holiday   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ workingday: logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
##  $ weathersit: chr  "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" ...
##  $ temp      : num  0.68 0.66 0.64 0.64 0.64 0.64 0.64 0.64 0.66 0.68 ...
##  $ atemp     : num  0.636 0.606 0.576 0.576 0.591 ...
##  $ hum       : num  0.79 0.83 0.83 0.83 0.78 0.78 0.78 0.83 0.78 0.74 ...
##  $ windspeed : num  0.1642 0.0896 0.1045 0.1045 0.1343 ...
##  $ cnt       : int  47 33 13 7 4 49 185 487 681 350 ...
##  $ instant   : int  13748 13749 13750 13751 13752 13753 13754 13755 13756 13757 ...
##  $ mnth      : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ yr        : int  1 1 1 1 1 1 1 1 1 1 ...
# bike_model is in the workspace
summary(bike_model)
## 
## Call:
## glm(formula = fmla, family = quasipoisson, data = bikesJuly)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -21.6117   -4.3121   -0.7223    3.5507   16.5079  
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    5.934986   0.439027  13.519  < 2e-16 ***
## hr1                           -0.580055   0.193354  -3.000 0.002794 ** 
## hr2                           -0.892314   0.215452  -4.142 3.86e-05 ***
## hr3                           -1.662342   0.290658  -5.719 1.58e-08 ***
## hr4                           -2.350204   0.393560  -5.972 3.71e-09 ***
## hr5                           -1.084289   0.230130  -4.712 2.96e-06 ***
## hr6                            0.211945   0.156476   1.354 0.176012    
## hr7                            1.211135   0.132332   9.152  < 2e-16 ***
## hr8                            1.648361   0.127177  12.961  < 2e-16 ***
## hr9                            1.155669   0.133927   8.629  < 2e-16 ***
## hr10                           0.993913   0.137096   7.250 1.09e-12 ***
## hr11                           1.116547   0.136300   8.192 1.19e-15 ***
## hr12                           1.282685   0.134769   9.518  < 2e-16 ***
## hr13                           1.273010   0.135872   9.369  < 2e-16 ***
## hr14                           1.237721   0.136386   9.075  < 2e-16 ***
## hr15                           1.260647   0.136144   9.260  < 2e-16 ***
## hr16                           1.515893   0.132727  11.421  < 2e-16 ***
## hr17                           1.948404   0.128080  15.212  < 2e-16 ***
## hr18                           1.893915   0.127812  14.818  < 2e-16 ***
## hr19                           1.669277   0.128471  12.993  < 2e-16 ***
## hr20                           1.420732   0.131004  10.845  < 2e-16 ***
## hr21                           1.146763   0.134042   8.555  < 2e-16 ***
## hr22                           0.856182   0.138982   6.160 1.21e-09 ***
## hr23                           0.479197   0.148051   3.237 0.001265 ** 
## holidayTRUE                    0.201598   0.079039   2.551 0.010961 *  
## workingdayTRUE                 0.116798   0.033510   3.485 0.000521 ***
## weathersitLight Precipitation -0.214801   0.072699  -2.955 0.003233 ** 
## weathersitMisty               -0.010757   0.038600  -0.279 0.780572    
## temp                          -3.246001   1.148270  -2.827 0.004833 ** 
## atemp                          2.042314   0.953772   2.141 0.032589 *  
## hum                           -0.748557   0.236015  -3.172 0.001581 ** 
## windspeed                      0.003277   0.148814   0.022 0.982439    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 38.98949)
## 
##     Null deviance: 133365  on 743  degrees of freedom
## Residual deviance:  28775  on 712  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 5
# Make predictions on August data
bikesAugust$pred  <- predict(bike_model, newdata = bikesAugust,type = "response")

# Calculate the RMSE
bikesAugust %>% 
  mutate(residual = pred-cnt) %>%
  summarize(rmse  = sqrt(mean(residual^2)))
##       rmse
## 1 112.5815
# Plot predictions vs cnt (pred on x-axis)
ggplot(bikesAugust, aes(x = pred, y = cnt)) +
  geom_point() + 
  geom_abline(color = "darkblue")

# Plot predictions and cnt by date/time
bikesAugust %>% 
  # set start to 0, convert unit to days
  mutate(instant = (instant - min(instant))/24) %>%  
  # gather cnt and pred into a value column
  gather(key = valuetype, value = value, cnt, pred) %>%
  filter(instant < 14) %>% # restric to first 14 days
  # plot value by instant
  ggplot(aes(x = instant, y = value, color = valuetype, linetype = valuetype)) + 
  geom_point() + 
  geom_line() + 
  scale_x_continuous("Day", breaks = 0:14, labels = 0:14) + 
  scale_color_brewer(palette = "Dark2") + 
  ggtitle("Predicted August bike rentals, Quasipoisson model")

a (quasi)poisson model to predict counts! a pseudo-R2 is near 1.Great! (Quasi)poisson models predict non-negative rates, making them useful for count or frequency data.
This model mostly identifies the slow and busy hours of the day, although it often underestimates peak demand

5.2 GAM to learn non-linear transforms

# soybean_train is in the workspace
summary(soybean_train)
##       Plot     Variety   Year          Time           weight       
##  1988F6 : 10   F:161   1988:124   Min.   :14.00   Min.   : 0.0290  
##  1988F7 :  9   P:169   1989:102   1st Qu.:27.00   1st Qu.: 0.6663  
##  1988P1 :  9           1990:104   Median :42.00   Median : 3.5233  
##  1988P8 :  9                      Mean   :43.56   Mean   : 6.1645  
##  1988P2 :  9                      3rd Qu.:56.00   3rd Qu.:10.3808  
##  1988F3 :  8                      Max.   :84.00   Max.   :27.3700  
##  (Other):276
# Plot weight vs Time (Time on x axis)
ggplot(soybean_train, aes(x = Time, y = weight)) + 
  geom_point() 

# Load the package mgcv
library(mgcv)

# Create the formula 
(fmla.gam <- as.formula(weight~s(Time)) )
## weight ~ s(Time)
print(fmla.gam)
## weight ~ s(Time)
# Fit the GAM Model
model.gam <- gam(fmla.gam, soybean_train, family=gaussian)

# From previous step
library(mgcv)
fmla.gam <- weight ~ s(Time)
model.gam <- gam(fmla.gam, data = soybean_train, family = gaussian)

# Call summary() on model.lin and look for R-squared
#summary(model.lin)

# Call summary() on model.gam and look for R-squared
summary(model.gam)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## weight ~ s(Time)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6.1645     0.1143   53.93   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##           edf Ref.df     F p-value    
## s(Time) 8.495   8.93 338.2  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.902   Deviance explained = 90.4%
## GCV = 4.4395  Scale est. = 4.3117    n = 330
# Call plot() on model.gam
plot(model.gam)

# soybean_test is in the workspace
summary(soybean_test)
##       Plot    Variety   Year         Time           weight       
##  1988F8 : 4   F:43    1988:32   Min.   :14.00   Min.   : 0.0380  
##  1988P7 : 4   P:39    1989:26   1st Qu.:23.00   1st Qu.: 0.4248  
##  1989F8 : 4           1990:24   Median :41.00   Median : 3.0025  
##  1990F8 : 4                     Mean   :44.09   Mean   : 7.1576  
##  1988F4 : 3                     3rd Qu.:69.00   3rd Qu.:15.0113  
##  1988F2 : 3                     Max.   :84.00   Max.   :30.2717  
##  (Other):60                                                      
##     pred.lin          pred.gam      
##  Min.   :-2.4700   Min.   : 0.1084  
##  1st Qu.: 0.1589   1st Qu.: 0.4156  
##  Median : 5.4166   Median : 3.6137  
##  Mean   : 6.3178   Mean   : 6.8382  
##  3rd Qu.:13.5952   3rd Qu.:14.7402  
##  Max.   :17.9766   Max.   :18.6845  
## 
# Get predictions from linear model
soybean_test$pred.lin <- predict(model.lin, newdata = soybean_test)

# Get predictions from gam model
soybean_test$pred.gam <- as.numeric(predict(model.gam, newdata = soybean_test))

# Gather the predictions into a "long" dataset
soybean_long <- soybean_test %>%
  gather(key = modeltype, value = pred, pred.lin, pred.gam)

# Calculate the rmse
soybean_long %>%
  mutate(residual = weight - pred) %>%     # residuals
  group_by(modeltype) %>%                  # group by modeltype
  summarize(rmse = sqrt(mean(residual^2))) # calculate the RMSE
## # A tibble: 2 x 2
##   modeltype  rmse
##   <chr>     <dbl>
## 1 pred.gam   2.29
## 2 pred.lin   3.19
# Compare the predictions against actual weights on the test data
soybean_long %>%
  ggplot(aes(x = Time)) +                          # the column for the x axis
  geom_point(aes(y = weight)) +                    # the y-column for the scatterplot
  geom_point(aes(y = pred, color = modeltype)) +   # the y-column for the point-and-line plot
  geom_line(aes(y = pred, color = modeltype, linetype = modeltype)) + # the y-column for the point-and-line plot
  scale_color_brewer(palette = "Dark2")

For this data, the GAM appears to fit the data better than a linear model, as measured by the R-squared.The GAM learns the non-linear growth function of the soybean plants, including the fact that weight is never negative.

# bikesJuly is in the workspace
str(bikesJuly)
## 'data.frame':    744 obs. of  12 variables:
##  $ hr        : Factor w/ 24 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ holiday   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ workingday: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ weathersit: chr  "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" ...
##  $ temp      : num  0.76 0.74 0.72 0.72 0.7 0.68 0.7 0.74 0.78 0.82 ...
##  $ atemp     : num  0.727 0.697 0.697 0.712 0.667 ...
##  $ hum       : num  0.66 0.7 0.74 0.84 0.79 0.79 0.79 0.7 0.62 0.56 ...
##  $ windspeed : num  0 0.1343 0.0896 0.1343 0.194 ...
##  $ cnt       : int  149 93 90 33 4 10 27 50 142 219 ...
##  $ instant   : int  13004 13005 13006 13007 13008 13009 13010 13011 13012 13013 ...
##  $ mnth      : int  7 7 7 7 7 7 7 7 7 7 ...
##  $ yr        : int  1 1 1 1 1 1 1 1 1 1 ...
# Random seed to reproduce results
seed<-423563

# The outcome column
(outcome <- "cnt")
## [1] "cnt"
# The input variables
(vars <- c("hr", "holiday", "workingday", "weathersit", "temp", "atemp", "hum", "windspeed"))
## [1] "hr"         "holiday"    "workingday" "weathersit" "temp"      
## [6] "atemp"      "hum"        "windspeed"
# Create the formula string for bikes rented as a function of the inputs
(fmla <- paste(outcome, "~", paste(vars, collapse = " + ")))
## [1] "cnt ~ hr + holiday + workingday + weathersit + temp + atemp + hum + windspeed"
# Load the package ranger
library(ranger)
## Warning: package 'ranger' was built under R version 3.4.4
# Fit and print the random forest model
(bike_model_rf <- ranger(fmla, # formula 
                         bikesJuly, # data
                         num.trees = 500, 
                         respect.unordered.factors = "order", 
                         seed = seed))
## Ranger result
## 
## Call:
##  ranger(fmla, bikesJuly, num.trees = 500, respect.unordered.factors = "order",      seed = seed) 
## 
## Type:                             Regression 
## Number of trees:                  500 
## Sample size:                      744 
## Number of independent variables:  8 
## Mtry:                             2 
## Target node size:                 5 
## Variable importance mode:         none 
## Splitrule:                        variance 
## OOB prediction error (MSE):       8230.568 
## R squared (OOB):                  0.8205434
# bikesAugust is in the workspace
str(bikesAugust)
## 'data.frame':    744 obs. of  13 variables:
##  $ hr        : Factor w/ 24 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ holiday   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ workingday: logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
##  $ weathersit: chr  "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" ...
##  $ temp      : num  0.68 0.66 0.64 0.64 0.64 0.64 0.64 0.64 0.66 0.68 ...
##  $ atemp     : num  0.636 0.606 0.576 0.576 0.591 ...
##  $ hum       : num  0.79 0.83 0.83 0.83 0.78 0.78 0.78 0.83 0.78 0.74 ...
##  $ windspeed : num  0.1642 0.0896 0.1045 0.1045 0.1343 ...
##  $ cnt       : int  47 33 13 7 4 49 185 487 681 350 ...
##  $ instant   : int  13748 13749 13750 13751 13752 13753 13754 13755 13756 13757 ...
##  $ mnth      : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ yr        : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pred      : num  94.96 51.74 37.98 17.58 9.36 ...
# bike_model_rf is in the workspace
bike_model_rf
## Ranger result
## 
## Call:
##  ranger(fmla, bikesJuly, num.trees = 500, respect.unordered.factors = "order",      seed = seed) 
## 
## Type:                             Regression 
## Number of trees:                  500 
## Sample size:                      744 
## Number of independent variables:  8 
## Mtry:                             2 
## Target node size:                 5 
## Variable importance mode:         none 
## Splitrule:                        variance 
## OOB prediction error (MSE):       8230.568 
## R squared (OOB):                  0.8205434
# Make predictions on the August data
bikesAugust$pred <- predict(bike_model_rf, bikesAugust)$predictions

# Calculate the RMSE of the predictions
bikesAugust %>% 
  mutate(residual = cnt-pred)  %>% # calculate the residual
  summarize(rmse  = sqrt(mean(residual^2)))      # calculate rmse
##       rmse
## 1 97.18347
# Plot actual outcome vs predictions (predictions on x-axis)
ggplot(bikesAugust, aes(x = pred, y = cnt)) + 
  geom_point() + 
  geom_abline()

first_two_weeks <- bikesAugust %>% 
  # Set start to 0, convert unit to days
  mutate(instant = (instant - min(instant)) / 24) %>% 
  # Gather cnt and pred into a column named value with key valuetype
  gather(key = valuetype, value = value, cnt, pred) %>%
  # Filter for rows in the first two
  filter(instant < 14) 

# Plot predictions and cnt by date/time 
ggplot(first_two_weeks,aes(x = instant, y = value, color = valuetype, linetype = valuetype)) + 
  geom_point() + 
  geom_line() + 
  scale_x_continuous("Day", breaks = 0:14, labels = 0:14) + 
  scale_color_brewer(palette = "Dark2") + 
  ggtitle("Predicted August bike rentals, Random Forest plot")

The random forest model captured the day-to-day variations in peak demand better than the quasipoisson model, but it still underestmates peak demand, and also overestimates minimum demand. So there is still room for improvement.

5.3 One-Hot-Encoding Categorical Variables

dframe<-read.csv("C:/Users/maiam/Dropbox/PROFESSIONAL DEVELOPMENT/DATA SCIENCE/01_R/supervised learning in r regression/dframe.csv", sep=";")
testframe<-read.csv("C:/Users/maiam/Dropbox/PROFESSIONAL DEVELOPMENT/DATA SCIENCE/01_R/supervised learning in r regression/testframe.csv", sep=";")
str(dframe)
## 'data.frame':    10 obs. of  3 variables:
##  $ color     : Factor w/ 3 levels "b","g","r": 1 3 3 3 3 1 3 2 1 1
##  $ size      : int  13 11 15 14 13 11 9 12 7 12
##  $ popularity: num  1.079 1.396 0.922 1.203 1.084 ...
library(vtreat)
library(magrittr)
## 
## Attaching package: 'magrittr'
## The following object is masked from 'package:tidyr':
## 
##     extract
# dframe is in the workspace
dframe
##    color size popularity
## 1      b   13  1.0785088
## 2      r   11  1.3956245
## 3      r   15  0.9217988
## 4      r   14  1.2025453
## 5      r   13  1.0838662
## 6      b   11  0.8043527
## 7      r    9  1.1035440
## 8      g   12  0.8746332
## 9      b    7  0.6947058
## 10     b   12  0.8832502
# Create and print a vector of variable names
(vars <- c("color","size"))
## [1] "color" "size"
# Load the package vtreat
library(vtreat)
library(dplyr)

# Create the treatment plan
treatplan <- designTreatmentsZ(dframe,vars, verbose=FALSE)

# Examine the scoreFrame
(scoreFrame <- treatplan %>%
    use_series(scoreFrame) %>%
    select(varName, origName, code))
##         varName origName  code
## 1    color_catP    color  catP
## 2          size     size clean
## 3 color_lev_x_b    color   lev
## 4 color_lev_x_g    color   lev
## 5 color_lev_x_r    color   lev
# We only want the rows with codes "clean" or "lev"
(newvars <- scoreFrame %>%
    filter(code %in% c("clean", "lev")) %>%
    use_series(varName))
## [1] "size"          "color_lev_x_b" "color_lev_x_g" "color_lev_x_r"
# Create the treated training data
(dframe.treat <- prepare(treatplan, dframe, varRestriction = newvars))
##    size color_lev_x_b color_lev_x_g color_lev_x_r
## 1    13             1             0             0
## 2    11             0             0             1
## 3    15             0             0             1
## 4    14             0             0             1
## 5    13             0             0             1
## 6    11             1             0             0
## 7     9             0             0             1
## 8    12             0             1             0
## 9     7             1             0             0
## 10   12             1             0             0
# Print dframe and testframe
dframe
##    color size popularity
## 1      b   13  1.0785088
## 2      r   11  1.3956245
## 3      r   15  0.9217988
## 4      r   14  1.2025453
## 5      r   13  1.0838662
## 6      b   11  0.8043527
## 7      r    9  1.1035440
## 8      g   12  0.8746332
## 9      b    7  0.6947058
## 10     b   12  0.8832502
testframe
##    color size popularity
## 1      g    7  0.9733920
## 2      g    8  0.9122529
## 3      y   10  1.4217153
## 4      g   12  1.1905828
## 5      g    6  0.9866464
## 6      y    8  1.3697515
## 7      b   12  1.0959387
## 8      g   12  0.9161547
## 9      g   12  1.0000460
## 10     r    8  1.3137360
# Use prepare() to one-hot-encode testframe
(testframe.treat <- prepare(treatplan, testframe, varRestriction = newvars))
##    size color_lev_x_b color_lev_x_g color_lev_x_r
## 1     7             0             1             0
## 2     8             0             1             0
## 3    10             0             0             0
## 4    12             0             1             0
## 5     6             0             1             0
## 6     8             0             0             0
## 7    12             1             0             0
## 8    12             0             1             0
## 9    12             0             1             0
## 10    8             0             0             1

vtreat the bike rental data

# The outcome column
(outcome <- "cnt")
## [1] "cnt"
# The input columns
(vars <- c("hr", "holiday", "workingday", "weathersit", "temp", "atemp", "hum", "windspeed"))
## [1] "hr"         "holiday"    "workingday" "weathersit" "temp"      
## [6] "atemp"      "hum"        "windspeed"
# Load the package vtreat
library(vtreat)

# Create the treatment plan from bikesJuly (the training data)
treatplan <- designTreatmentsZ(bikesJuly,vars,  verbose = FALSE)

# Get the "clean" and "lev" variables from the scoreFrame
(newvars <- treatplan %>%
  use_series(scoreFrame) %>%        
  filter(code %in% c("clean","lev")) %>%  # get the rows you care about
  use_series(varName))           # get the varName column
##  [1] "holiday"                                
##  [2] "workingday"                             
##  [3] "temp"                                   
##  [4] "atemp"                                  
##  [5] "hum"                                    
##  [6] "windspeed"                              
##  [7] "hr_lev_x_0"                             
##  [8] "hr_lev_x_1"                             
##  [9] "hr_lev_x_10"                            
## [10] "hr_lev_x_11"                            
## [11] "hr_lev_x_12"                            
## [12] "hr_lev_x_13"                            
## [13] "hr_lev_x_14"                            
## [14] "hr_lev_x_15"                            
## [15] "hr_lev_x_16"                            
## [16] "hr_lev_x_17"                            
## [17] "hr_lev_x_18"                            
## [18] "hr_lev_x_19"                            
## [19] "hr_lev_x_2"                             
## [20] "hr_lev_x_20"                            
## [21] "hr_lev_x_21"                            
## [22] "hr_lev_x_22"                            
## [23] "hr_lev_x_23"                            
## [24] "hr_lev_x_3"                             
## [25] "hr_lev_x_4"                             
## [26] "hr_lev_x_5"                             
## [27] "hr_lev_x_6"                             
## [28] "hr_lev_x_7"                             
## [29] "hr_lev_x_8"                             
## [30] "hr_lev_x_9"                             
## [31] "weathersit_lev_x_Clear_to_partly_cloudy"
## [32] "weathersit_lev_x_Light_Precipitation"   
## [33] "weathersit_lev_x_Misty"
# Prepare the training data
bikesJuly.treat <- prepare(treatplan, bikesJuly,  varRestriction = newvars)

# Prepare the test data
bikesAugust.treat <- prepare(treatplan, bikesAugust,  varRestriction = newvars)

# Call str() on the treated data
str(bikesJuly.treat )
## 'data.frame':    744 obs. of  33 variables:
##  $ holiday                                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ workingday                             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ temp                                   : num  0.76 0.74 0.72 0.72 0.7 0.68 0.7 0.74 0.78 0.82 ...
##  $ atemp                                  : num  0.727 0.697 0.697 0.712 0.667 ...
##  $ hum                                    : num  0.66 0.7 0.74 0.84 0.79 0.79 0.79 0.7 0.62 0.56 ...
##  $ windspeed                              : num  0 0.1343 0.0896 0.1343 0.194 ...
##  $ hr_lev_x_0                             : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ hr_lev_x_1                             : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ hr_lev_x_10                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hr_lev_x_11                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hr_lev_x_12                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hr_lev_x_13                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hr_lev_x_14                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hr_lev_x_15                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hr_lev_x_16                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hr_lev_x_17                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hr_lev_x_18                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hr_lev_x_19                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hr_lev_x_2                             : num  0 0 1 0 0 0 0 0 0 0 ...
##  $ hr_lev_x_20                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hr_lev_x_21                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hr_lev_x_22                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hr_lev_x_23                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hr_lev_x_3                             : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ hr_lev_x_4                             : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ hr_lev_x_5                             : num  0 0 0 0 0 1 0 0 0 0 ...
##  $ hr_lev_x_6                             : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ hr_lev_x_7                             : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ hr_lev_x_8                             : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ hr_lev_x_9                             : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ weathersit_lev_x_Clear_to_partly_cloudy: num  1 1 1 1 1 1 1 1 1 1 ...
##  $ weathersit_lev_x_Light_Precipitation   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ weathersit_lev_x_Misty                 : num  0 0 0 0 0 0 0 0 0 0 ...
str(bikesAugust.treat)
## 'data.frame':    744 obs. of  33 variables:
##  $ holiday                                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ workingday                             : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ temp                                   : num  0.68 0.66 0.64 0.64 0.64 0.64 0.64 0.64 0.66 0.68 ...
##  $ atemp                                  : num  0.636 0.606 0.576 0.576 0.591 ...
##  $ hum                                    : num  0.79 0.83 0.83 0.83 0.78 0.78 0.78 0.83 0.78 0.74 ...
##  $ windspeed                              : num  0.1642 0.0896 0.1045 0.1045 0.1343 ...
##  $ hr_lev_x_0                             : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ hr_lev_x_1                             : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ hr_lev_x_10                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hr_lev_x_11                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hr_lev_x_12                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hr_lev_x_13                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hr_lev_x_14                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hr_lev_x_15                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hr_lev_x_16                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hr_lev_x_17                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hr_lev_x_18                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hr_lev_x_19                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hr_lev_x_2                             : num  0 0 1 0 0 0 0 0 0 0 ...
##  $ hr_lev_x_20                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hr_lev_x_21                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hr_lev_x_22                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hr_lev_x_23                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hr_lev_x_3                             : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ hr_lev_x_4                             : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ hr_lev_x_5                             : num  0 0 0 0 0 1 0 0 0 0 ...
##  $ hr_lev_x_6                             : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ hr_lev_x_7                             : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ hr_lev_x_8                             : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ hr_lev_x_9                             : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ weathersit_lev_x_Clear_to_partly_cloudy: num  1 1 1 1 0 0 1 0 0 0 ...
##  $ weathersit_lev_x_Light_Precipitation   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ weathersit_lev_x_Misty                 : num  0 0 0 0 1 1 0 1 1 1 ...