library(broom)
library(sigr)
Create a formula to define a one-variable modeling task, and then fit a linear model to the data.The task is to predict the rate of female unemployment from the observed rate of male unemployment. The outcome is female_unemployment, and the input is male_unemployment.
# unemployment is loaded in the workspace
summary(unemployment)
## male_unemployment female_unemployment
## Min. :2.900 Min. :4.000
## 1st Qu.:4.900 1st Qu.:4.400
## Median :6.000 Median :5.200
## Mean :5.954 Mean :5.569
## 3rd Qu.:6.700 3rd Qu.:6.100
## Max. :9.800 Max. :7.900
# Define a formula to express female_unemployment as a function of male_unemployment
fmla <-female_unemployment~male_unemployment
# Print it
fmla
## female_unemployment ~ male_unemployment
# Use the formula to fit a model: unemployment_model
unemployment_model <- lm(fmla, data=unemployment)
# Print it
unemployment_model
##
## Call:
## lm(formula = fmla, data = unemployment)
##
## Coefficients:
## (Intercept) male_unemployment
## 1.4341 0.6945
There are a variety of different ways to examine a model; each way provides different information. We will use summary(), broom::glance(), and sigr::wrapFTest().
# broom and sigr are already loaded in your workspace
# Print unemployment_model
unemployment_model
##
## Call:
## lm(formula = fmla, data = unemployment)
##
## Coefficients:
## (Intercept) male_unemployment
## 1.4341 0.6945
# Call summary() on unemployment_model to get more details
summary(unemployment_model)
##
## Call:
## lm(formula = fmla, data = unemployment)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.77621 -0.34050 -0.09004 0.27911 1.31254
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.43411 0.60340 2.377 0.0367 *
## male_unemployment 0.69453 0.09767 7.111 1.97e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5803 on 11 degrees of freedom
## Multiple R-squared: 0.8213, Adjusted R-squared: 0.8051
## F-statistic: 50.56 on 1 and 11 DF, p-value: 1.966e-05
# Call glance() on unemployment_model to see the details in a tidier form
glance(unemployment_model)
## # A tibble: 1 x 11
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.821 0.805 0.580 50.6 1.97e-5 2 -10.3 26.6 28.3
## # ... with 2 more variables: deviance <dbl>, df.residual <int>
# Call wrapFTest() on unemployment_model to see the most relevant details
wrapFTest(unemployment_model)
## [1] "F Test summary: (R2=0.8213, F(1,11)=50.56, p=1.966e-05)."
newrates<-data.frame("male_unemployment"=5)
str(newrates)
## 'data.frame': 1 obs. of 1 variable:
## $ male_unemployment: num 5
newrates
## male_unemployment
## 1 5
# unemployment is in your workspace
summary(unemployment)
## male_unemployment female_unemployment
## Min. :2.900 Min. :4.000
## 1st Qu.:4.900 1st Qu.:4.400
## Median :6.000 Median :5.200
## Mean :5.954 Mean :5.569
## 3rd Qu.:6.700 3rd Qu.:6.100
## Max. :9.800 Max. :7.900
# newrates is in your workspace
newrates
## male_unemployment
## 1 5
# Predict female unemployment in the unemployment data set
unemployment$prediction <- predict(unemployment_model,unemployment)
# load the ggplot2 package
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.4.4
# Make a plot to compare predictions to actual (prediction on x axis).
ggplot(unemployment, aes(x = prediction, y = female_unemployment)) +
geom_point ()+
geom_abline(color = "blue")
# Predict female unemployment rate when male unemployment is 5%
pred <- predict(unemployment_model,newrates)
# Print it
pred
## 1
## 4.906757
will work with the blood pressure dataset (Source), and model blood_pressure as a function of weight and age.
# bloodpressure is in the workspace
summary(bloodpressure)
## blood_pressure age weight
## Min. :128.0 Min. :46.00 Min. :167
## 1st Qu.:140.0 1st Qu.:56.50 1st Qu.:186
## Median :153.0 Median :64.00 Median :194
## Mean :150.1 Mean :62.45 Mean :195
## 3rd Qu.:160.5 3rd Qu.:69.50 3rd Qu.:209
## Max. :168.0 Max. :74.00 Max. :220
# Create the formula and print it
fmla <- blood_pressure~age+weight
fmla
## blood_pressure ~ age + weight
# Fit the model: bloodpressure_model
bloodpressure_model <- lm(fmla,data=bloodpressure)
# Print bloodpressure_model and call summary()
bloodpressure_model
##
## Call:
## lm(formula = fmla, data = bloodpressure)
##
## Coefficients:
## (Intercept) age weight
## 30.9941 0.8614 0.3349
summary(bloodpressure_model)
##
## Call:
## lm(formula = fmla, data = bloodpressure)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4640 -1.1949 -0.4078 1.8511 2.6981
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.9941 11.9438 2.595 0.03186 *
## age 0.8614 0.2482 3.470 0.00844 **
## weight 0.3349 0.1307 2.563 0.03351 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.318 on 8 degrees of freedom
## Multiple R-squared: 0.9768, Adjusted R-squared: 0.9711
## F-statistic: 168.8 on 2 and 8 DF, p-value: 2.874e-07
make predictions using the blood pressure model bloodpressure_model. compare the predictions to outcomes graphically. ggplot2 is already loaded in your workspace. Recall the plot command takes the form:
ggplot(dframe, aes(x = pred, y = outcome)) + geom_point() + geom_abline(color = “blue”)
# bloodpressure is in your workspace
summary(bloodpressure)
## blood_pressure age weight
## Min. :128.0 Min. :46.00 Min. :167
## 1st Qu.:140.0 1st Qu.:56.50 1st Qu.:186
## Median :153.0 Median :64.00 Median :194
## Mean :150.1 Mean :62.45 Mean :195
## 3rd Qu.:160.5 3rd Qu.:69.50 3rd Qu.:209
## Max. :168.0 Max. :74.00 Max. :220
# bloodpressure_model is in your workspace
bloodpressure_model
##
## Call:
## lm(formula = fmla, data = bloodpressure)
##
## Coefficients:
## (Intercept) age weight
## 30.9941 0.8614 0.3349
# predict blood pressure using bloodpressure_model :prediction
bloodpressure$prediction <- predict(bloodpressure_model, bloodpressure)
# plot the results
ggplot(bloodpressure, aes(x=prediction, y=blood_pressure)) +
geom_point() +
geom_abline(color = "blue")
graphically evaluate the unemployment model, unemployment_model, that you fit to the unemployment data in the previous chapter. Recall that the model predicts female_unemployment from male_unemployment. plot the model’s predictions against the actual female_unemployment; recall the command is of the form
ggplot(dframe, aes(x = pred, y = outcome)) + geom_point() +
geom_abline() Then you will calculate the residuals:
residuals <- actual outcome - predicted outcome
# unemployment, unemployment_model are in the workspace
summary(unemployment)
## male_unemployment female_unemployment prediction
## Min. :2.900 Min. :4.000 Min. :3.448
## 1st Qu.:4.900 1st Qu.:4.400 1st Qu.:4.837
## Median :6.000 Median :5.200 Median :5.601
## Mean :5.954 Mean :5.569 Mean :5.569
## 3rd Qu.:6.700 3rd Qu.:6.100 3rd Qu.:6.087
## Max. :9.800 Max. :7.900 Max. :8.240
summary(unemployment_model)
##
## Call:
## lm(formula = fmla, data = unemployment)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.77621 -0.34050 -0.09004 0.27911 1.31254
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.43411 0.60340 2.377 0.0367 *
## male_unemployment 0.69453 0.09767 7.111 1.97e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5803 on 11 degrees of freedom
## Multiple R-squared: 0.8213, Adjusted R-squared: 0.8051
## F-statistic: 50.56 on 1 and 11 DF, p-value: 1.966e-05
# Make predictions from the model
unemployment$predictions <- predict(unemployment_model, unemployment)
# Fill in the blanks to plot predictions (on x-axis) versus the female_unemployment rates
ggplot(unemployment, aes(x = predictions, y = female_unemployment)) +
geom_point() +
geom_abline()
# From previous step
unemployment$predictions <- predict(unemployment_model)
# Calculate residuals
unemployment$residuals <- unemployment$female_unemployment-unemployment$predictions
# Fill in the blanks to plot predictions (on x-axis) versus the residuals
ggplot(unemployment, aes(x = predictions, y = residuals)) +
geom_pointrange(aes(ymin = 0, ymax = residuals)) +
geom_hline(yintercept = 0, linetype = 3) +
ggtitle("residuals vs. linear model prediction")
plot the gain curve of the unemployment_model’s predictions against actual female_unemployment using the WVPlots::GainCurvePlot() function.
# unemployment is in the workspace (with predictions)
summary(unemployment)
## male_unemployment female_unemployment prediction predictions
## Min. :2.900 Min. :4.000 Min. :3.448 Min. :3.448
## 1st Qu.:4.900 1st Qu.:4.400 1st Qu.:4.837 1st Qu.:4.837
## Median :6.000 Median :5.200 Median :5.601 Median :5.601
## Mean :5.954 Mean :5.569 Mean :5.569 Mean :5.569
## 3rd Qu.:6.700 3rd Qu.:6.100 3rd Qu.:6.087 3rd Qu.:6.087
## Max. :9.800 Max. :7.900 Max. :8.240 Max. :8.240
## residuals
## Min. :-0.77621
## 1st Qu.:-0.34050
## Median :-0.09004
## Mean : 0.00000
## 3rd Qu.: 0.27911
## Max. : 1.31254
# unemployment_model is in the workspace
summary(unemployment_model)
##
## Call:
## lm(formula = fmla, data = unemployment)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.77621 -0.34050 -0.09004 0.27911 1.31254
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.43411 0.60340 2.377 0.0367 *
## male_unemployment 0.69453 0.09767 7.111 1.97e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5803 on 11 degrees of freedom
## Multiple R-squared: 0.8213, Adjusted R-squared: 0.8051
## F-statistic: 50.56 on 1 and 11 DF, p-value: 1.966e-05
# Load the package WVPlots
library(WVPlots)
## Warning: package 'WVPlots' was built under R version 3.4.4
# Plot the Gain Curve
GainCurvePlot(unemployment, "predictions", "female_unemployment", "Unemployment model")
\(RMSE=\sqrt{mean(res^2)}\)
want RMSE to be small, One heuristic is to compare the RMSE to the standard deviation of the outcome. With a good model, the RMSE should be smaller.
# unemployment is in the workspace
summary(unemployment)
## male_unemployment female_unemployment prediction predictions
## Min. :2.900 Min. :4.000 Min. :3.448 Min. :3.448
## 1st Qu.:4.900 1st Qu.:4.400 1st Qu.:4.837 1st Qu.:4.837
## Median :6.000 Median :5.200 Median :5.601 Median :5.601
## Mean :5.954 Mean :5.569 Mean :5.569 Mean :5.569
## 3rd Qu.:6.700 3rd Qu.:6.100 3rd Qu.:6.087 3rd Qu.:6.087
## Max. :9.800 Max. :7.900 Max. :8.240 Max. :8.240
## residuals
## Min. :-0.77621
## 1st Qu.:-0.34050
## Median :-0.09004
## Mean : 0.00000
## 3rd Qu.: 0.27911
## Max. : 1.31254
# For convenience put the residuals in the variable res
res <- unemployment$residuals
# Calculate RMSE, assign it to the variable rmse and print it
(rmse <- sqrt(mean(res^2)))
## [1] 0.5337612
# Calculate the standard deviation of female_unemployment and print it
(sd_unemployment <- sd(unemployment$female_unemployment))
## [1] 1.314271
An RMSE much smaller than the outcome’s standard deviation suggests a model that predicts well.
how much variance does the model explain
Suppose \(y\) is the true outcome, \(p\) is the prediction from the model, and \(res=y???p\) are the residuals of the predictions.
Then the total sum of squares tss (“total variance”) of the data is: \(tss=\sum_{(y-\tilde{y})^2}\)
Residual sum of squared errors(rss):
\(rss=\sum_(1-rss/tss)\)
# unemployment is in your workspace
summary(unemployment)
## male_unemployment female_unemployment prediction predictions
## Min. :2.900 Min. :4.000 Min. :3.448 Min. :3.448
## 1st Qu.:4.900 1st Qu.:4.400 1st Qu.:4.837 1st Qu.:4.837
## Median :6.000 Median :5.200 Median :5.601 Median :5.601
## Mean :5.954 Mean :5.569 Mean :5.569 Mean :5.569
## 3rd Qu.:6.700 3rd Qu.:6.100 3rd Qu.:6.087 3rd Qu.:6.087
## Max. :9.800 Max. :7.900 Max. :8.240 Max. :8.240
## residuals
## Min. :-0.77621
## 1st Qu.:-0.34050
## Median :-0.09004
## Mean : 0.00000
## 3rd Qu.: 0.27911
## Max. : 1.31254
# unemployment_model is in the workspace
summary(unemployment_model)
##
## Call:
## lm(formula = fmla, data = unemployment)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.77621 -0.34050 -0.09004 0.27911 1.31254
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.43411 0.60340 2.377 0.0367 *
## male_unemployment 0.69453 0.09767 7.111 1.97e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5803 on 11 degrees of freedom
## Multiple R-squared: 0.8213, Adjusted R-squared: 0.8051
## F-statistic: 50.56 on 1 and 11 DF, p-value: 1.966e-05
# Calculate mean female_unemployment: fe_mean. Print it
(fe_mean <- mean(unemployment$female_unemployment))
## [1] 5.569231
fe_mean
## [1] 5.569231
# Calculate total sum of squares: tss. Print it
(tss <- sum((unemployment$female_unemployment - fe_mean)^2))
## [1] 20.72769
# Calculate residual sum of squares: rss. Print it
(rss <- sum(unemployment$residuals^2))
## [1] 3.703714
# Calculate R-squared: rsq. Print it. Is it a good fit?
(rsq <- 1-rss/tss)
## [1] 0.8213157
# Get R-squared from glance. Print it
(rsq_glance <- glance(unemployment_model)$r.squared)
## [1] 0.8213157
The linear correlation of two variables, x and y, measures the strength of the linear relationship between them. When x and y are respectively:
the outcomes of a regression model that minimizes squared-error (like linear regression) and
the true outcomes of the training data.
# unemployment is in your workspace
summary(unemployment)
## male_unemployment female_unemployment prediction predictions
## Min. :2.900 Min. :4.000 Min. :3.448 Min. :3.448
## 1st Qu.:4.900 1st Qu.:4.400 1st Qu.:4.837 1st Qu.:4.837
## Median :6.000 Median :5.200 Median :5.601 Median :5.601
## Mean :5.954 Mean :5.569 Mean :5.569 Mean :5.569
## 3rd Qu.:6.700 3rd Qu.:6.100 3rd Qu.:6.087 3rd Qu.:6.087
## Max. :9.800 Max. :7.900 Max. :8.240 Max. :8.240
## residuals
## Min. :-0.77621
## 1st Qu.:-0.34050
## Median :-0.09004
## Mean : 0.00000
## 3rd Qu.: 0.27911
## Max. : 1.31254
# unemployment_model is in the workspace
summary(unemployment_model)
##
## Call:
## lm(formula = fmla, data = unemployment)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.77621 -0.34050 -0.09004 0.27911 1.31254
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.43411 0.60340 2.377 0.0367 *
## male_unemployment 0.69453 0.09767 7.111 1.97e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5803 on 11 degrees of freedom
## Multiple R-squared: 0.8213, Adjusted R-squared: 0.8051
## F-statistic: 50.56 on 1 and 11 DF, p-value: 1.966e-05
# Get the correlation between the prediction and true outcome: rho and print it
(rho <- cor(unemployment$predictions,unemployment$female_unemployment))
## [1] 0.9062647
# Square rho: rho2 and print it
(rho2 <- rho^2)
## [1] 0.8213157
# Get R-squared from glance and print it
(rsq_glance <- glance(unemployment_model)$r.squared)
## [1] 0.8213157
training set mpg_train (75% of the data);
# mpg is in the workspace
summary(mpg)
## manufacturer model displ year
## Length:234 Length:234 Min. :1.600 Min. :1999
## Class :character Class :character 1st Qu.:2.400 1st Qu.:1999
## Mode :character Mode :character Median :3.300 Median :2004
## Mean :3.472 Mean :2004
## 3rd Qu.:4.600 3rd Qu.:2008
## Max. :7.000 Max. :2008
## cyl trans drv cty
## Min. :4.000 Length:234 Length:234 Min. : 9.00
## 1st Qu.:4.000 Class :character Class :character 1st Qu.:14.00
## Median :6.000 Mode :character Mode :character Median :17.00
## Mean :5.889 Mean :16.86
## 3rd Qu.:8.000 3rd Qu.:19.00
## Max. :8.000 Max. :35.00
## hwy fl class
## Min. :12.00 Length:234 Length:234
## 1st Qu.:18.00 Class :character Class :character
## Median :24.00 Mode :character Mode :character
## Mean :23.44
## 3rd Qu.:27.00
## Max. :44.00
dim(mpg)
## [1] 234 11
# Use nrow to get the number of rows in mpg (N) and print it
(N <- nrow(mpg))
## [1] 234
# Calculate how many rows 75% of N should be and print it
# Hint: use round() to get an integer
(target <- round(0.75*N))
## [1] 176
# Create the vector of N uniform random variables: gp
gp <- runif(N)
# Use gp to create the training set: mpg_train (75% of data) and mpg_test (25% of data)
mpg_train <- mpg[gp<0.75,]
mpg_test <- mpg[gp>=0.75,]
# Use nrow() to examine mpg_train and mpg_test
nrow(mpg_train)
## [1] 164
nrow(mpg_test)
## [1] 70
use mpg_train to train a model to predict city fuel efficiency (cty) from highway fuel efficiency (hwy).
# mpg_train is in the workspace
summary(mpg_train)
## manufacturer model displ year
## Length:164 Length:164 Min. :1.600 Min. :1999
## Class :character Class :character 1st Qu.:2.400 1st Qu.:1999
## Mode :character Mode :character Median :3.100 Median :1999
## Mean :3.455 Mean :2003
## 3rd Qu.:4.600 3rd Qu.:2008
## Max. :7.000 Max. :2008
## cyl trans drv cty
## Min. :4.000 Length:164 Length:164 Min. : 9.00
## 1st Qu.:4.000 Class :character Class :character 1st Qu.:14.00
## Median :6.000 Mode :character Mode :character Median :17.00
## Mean :5.841 Mean :16.99
## 3rd Qu.:8.000 3rd Qu.:19.00
## Max. :8.000 Max. :35.00
## hwy fl class
## Min. :12.00 Length:164 Length:164
## 1st Qu.:17.00 Class :character Class :character
## Median :25.00 Mode :character Mode :character
## Mean :23.62
## 3rd Qu.:27.00
## Max. :44.00
# Create a formula to express cty as a function of hwy: fmla and print it.
(fmla <- cty~hwy)
## cty ~ hwy
fmla
## cty ~ hwy
# Now use lm() to build a model mpg_model from mpg_train that predicts cty from hwy
mpg_model <-lm(fmla, data=mpg_train)
# Use summary() to examine the model
summary(mpg_model)
##
## Call:
## lm(formula = fmla, data = mpg_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.9180 -0.8490 -0.0554 0.7570 4.6817
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.04216 0.41296 2.524 0.0126 *
## hwy 0.67503 0.01692 39.891 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.328 on 162 degrees of freedom
## Multiple R-squared: 0.9076, Adjusted R-squared: 0.907
## F-statistic: 1591 on 1 and 162 DF, p-value: < 2.2e-16
Functions rmse() and r_squared() to calculate RMSE and R-squared have been provided for convenience:
rmse(predcol, ycol) r_squared(predcol, ycol) where:
predcol: The predicted values ycol: The actual outcome Generally, model performance is better on the training data than the test data (though sometimes the test set “gets lucky”).
rmse<-function(predcol, ycol) {
res <- predcol - ycol
sqrt(mean(res^2))
}
r_squared<-function(predcol, ycol) {
tss = sum( (ycol - mean(ycol))^2 )
rss = sum( (predcol - ycol)^2 )
1 - rss/tss
}
# Examine the objects in the workspace
ls.str()
## Bikes : chr [1:4] "bikesJuly" "bikesAugust" "bikesJuly" "bikesAugust"
## bikesAugust : 'data.frame': 744 obs. of 12 variables:
## $ hr : Factor w/ 24 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ holiday : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ workingday: logi TRUE TRUE TRUE TRUE TRUE TRUE ...
## $ weathersit: chr "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" ...
## $ temp : num 0.68 0.66 0.64 0.64 0.64 0.64 0.64 0.64 0.66 0.68 ...
## $ atemp : num 0.636 0.606 0.576 0.576 0.591 ...
## $ hum : num 0.79 0.83 0.83 0.83 0.78 0.78 0.78 0.83 0.78 0.74 ...
## $ windspeed : num 0.1642 0.0896 0.1045 0.1045 0.1343 ...
## $ cnt : int 47 33 13 7 4 49 185 487 681 350 ...
## $ instant : int 13748 13749 13750 13751 13752 13753 13754 13755 13756 13757 ...
## $ mnth : int 8 8 8 8 8 8 8 8 8 8 ...
## $ yr : int 1 1 1 1 1 1 1 1 1 1 ...
## bikesJuly : 'data.frame': 744 obs. of 12 variables:
## $ hr : Factor w/ 24 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ holiday : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ workingday: logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ weathersit: chr "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" ...
## $ temp : num 0.76 0.74 0.72 0.72 0.7 0.68 0.7 0.74 0.78 0.82 ...
## $ atemp : num 0.727 0.697 0.697 0.712 0.667 ...
## $ hum : num 0.66 0.7 0.74 0.84 0.79 0.79 0.79 0.7 0.62 0.56 ...
## $ windspeed : num 0 0.1343 0.0896 0.1343 0.194 ...
## $ cnt : int 149 93 90 33 4 10 27 50 142 219 ...
## $ instant : int 13004 13005 13006 13007 13008 13009 13010 13011 13012 13013 ...
## $ mnth : int 7 7 7 7 7 7 7 7 7 7 ...
## $ yr : int 1 1 1 1 1 1 1 1 1 1 ...
## bloodpressure : 'data.frame': 11 obs. of 4 variables:
## $ blood_pressure: int 132 143 153 162 154 168 137 149 159 128 ...
## $ age : int 52 59 67 73 64 74 54 61 65 46 ...
## $ weight : int 173 184 194 211 196 220 188 188 207 167 ...
## $ prediction : num 134 143 154 165 152 ...
## bloodpressure_model : List of 12
## $ coefficients : Named num [1:3] 30.994 0.861 0.335
## $ residuals : Named num [1:11] -1.718 -0.432 -0.672 -2.533 2.243 ...
## $ effects : Named num [1:11] -497.8 42.17 5.94 -2.11 2.65 ...
## $ rank : int 3
## $ fitted.values: Named num [1:11] 134 143 154 165 152 ...
## $ assign : int [1:3] 0 1 2
## $ qr :List of 5
## $ df.residual : int 8
## $ xlevels : Named list()
## $ call : language lm(formula = fmla, data = bloodpressure)
## $ terms :Classes 'terms', 'formula' language blood_pressure ~ age + weight
## $ model :'data.frame': 11 obs. of 3 variables:
## cricket : 'data.frame': 15 obs. of 2 variables:
## $ chirps_per_sec: num 20 16 19.8 18.4 17.1 ...
## $ temperature : num 88.6 71.6 93.3 84.3 80.6 ...
## fe_mean : num 5.57
## fmla : Class 'formula' language cty ~ hwy
## gp : num [1:234] 0.1173 0.2835 0.2417 0.5256 0.0908 ...
## houseprice : 'data.frame': 40 obs. of 2 variables:
## $ size : int 72 98 92 90 44 46 90 150 94 90 ...
## $ price: int 156 153 230 152 42 157 113 573 202 261 ...
## Income : chr [1:2] "incometrain" "incometest"
## incometest : 'data.frame': 515 obs. of 7 variables:
## $ Subject : int 7 13 47 62 73 78 89 105 106 107 ...
## $ Arith : int 14 30 26 12 18 25 24 28 30 19 ...
## $ Word : int 27 29 33 25 34 35 34 32 35 31 ...
## $ Parag : int 8 13 13 10 13 14 15 14 12 11 ...
## $ Math : int 11 24 16 10 18 21 17 20 23 16 ...
## $ AFQT : num 47.4 72.3 75.5 36.4 81.5 ...
## $ Income2005: int 19000 8000 66309 30000 186135 14657 62000 40000 105000 43000 ...
## incometrain : 'data.frame': 2069 obs. of 7 variables:
## $ Subject : int 2 6 8 9 16 17 18 20 21 22 ...
## $ Arith : int 8 30 13 21 17 29 30 17 29 27 ...
## $ Word : int 15 35 35 28 30 33 35 28 33 31 ...
## $ Parag : int 6 15 12 10 12 13 14 14 13 14 ...
## $ Math : int 6 23 4 13 17 21 23 20 25 22 ...
## $ AFQT : num 6.84 99.39 44.02 59.68 50.28 ...
## $ Income2005: int 5500 65000 36000 65000 71000 43000 120000 64000 253043 45300 ...
## Mpg : chr [1:2] "mpgtrain" "mpgtest"
## mpg_model : List of 12
## $ coefficients : Named num [1:2] 1.042 0.675
## $ residuals : Named num [1:164] -2.618 0.382 -1.968 -0.293 -2.593 ...
## $ effects : Named num [1:164] -217.55 -52.974 -1.826 -0.145 -2.417 ...
## $ rank : int 2
## $ fitted.values: Named num [1:164] 20.6 20.6 22 21.3 18.6 ...
## $ assign : int [1:2] 0 1
## $ qr :List of 5
## $ df.residual : int 162
## $ xlevels : Named list()
## $ call : language lm(formula = fmla, data = mpg_train)
## $ terms :Classes 'terms', 'formula' language cty ~ hwy
## $ model :'data.frame': 164 obs. of 2 variables:
## mpg_test : Classes 'tbl_df', 'tbl' and 'data.frame': 70 obs. of 11 variables:
## $ manufacturer: chr "audi" "chevrolet" "chevrolet" "chevrolet" ...
## $ model : chr "a6 quattro" "c1500 suburban 2wd" "corvette" "k1500 tahoe 4wd" ...
## $ displ : num 4.2 5.3 5.7 5.3 2.4 3.6 3 3.3 3.3 3.8 ...
## $ year : int 2008 2008 1999 2008 2008 2008 1999 1999 2008 1999 ...
## $ cyl : int 8 8 8 8 4 6 6 6 6 6 ...
## $ trans : chr "auto(s6)" "auto(l4)" "auto(l4)" "auto(l4)" ...
## $ drv : chr "4" "r" "r" "4" ...
## $ cty : int 16 14 15 11 22 17 17 16 17 15 ...
## $ hwy : int 23 20 23 14 30 26 24 22 24 22 ...
## $ fl : chr "p" "r" "p" "e" ...
## $ class : chr "midsize" "suv" "2seater" "suv" ...
## mpg_train : Classes 'tbl_df', 'tbl' and 'data.frame': 164 obs. of 11 variables:
## $ manufacturer: chr "audi" "audi" "audi" "audi" ...
## $ model : chr "a4" "a4" "a4" "a4" ...
## $ displ : num 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
## $ year : int 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
## $ cyl : int 4 4 4 4 6 6 6 4 4 4 ...
## $ trans : chr "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
## $ drv : chr "f" "f" "f" "f" ...
## $ cty : int 18 21 20 21 16 18 18 18 16 20 ...
## $ hwy : int 29 29 31 30 26 26 27 26 25 28 ...
## $ fl : chr "p" "p" "p" "p" ...
## $ class : chr "compact" "compact" "compact" "compact" ...
## mpgtest : Classes 'tbl_df', 'tbl' and 'data.frame': 54 obs. of 11 variables:
## $ manufacturer: chr "audi" "audi" "audi" "audi" ...
## $ model : chr "a4" "a4" "a4 quattro" "a4 quattro" ...
## $ displ : num 2 3.1 1.8 2 4.2 5.3 6.5 2.4 3.7 5.2 ...
## $ year : int 2008 2008 1999 2008 2008 2008 1999 1999 2008 1999 ...
## $ cyl : int 4 6 4 4 8 8 8 4 6 8 ...
## $ trans : chr "auto(av)" "auto(av)" "auto(l5)" "auto(s6)" ...
## $ drv : chr "f" "f" "4" "4" ...
## $ cty : int 21 18 16 19 16 11 14 19 15 11 ...
## $ hwy : int 30 27 25 27 23 14 17 27 19 17 ...
## $ fl : chr "p" "p" "p" "p" ...
## $ class : chr "compact" "compact" "compact" "compact" ...
## mpgtrain : Classes 'tbl_df', 'tbl' and 'data.frame': 180 obs. of 11 variables:
## $ manufacturer: chr "audi" "audi" "audi" "audi" ...
## $ model : chr "a4" "a4" "a4" "a4" ...
## $ displ : num 1.8 1.8 2 2.8 2.8 1.8 2 2.8 2.8 3.1 ...
## $ year : int 1999 1999 2008 1999 1999 1999 2008 1999 1999 2008 ...
## $ cyl : int 4 4 4 6 6 4 4 6 6 6 ...
## $ trans : chr "auto(l5)" "manual(m5)" "manual(m6)" "auto(l5)" ...
## $ drv : chr "f" "f" "f" "f" ...
## $ cty : int 18 21 20 16 18 18 20 15 17 17 ...
## $ hwy : int 29 29 31 26 26 26 28 25 25 25 ...
## $ fl : chr "p" "p" "p" "p" ...
## $ class : chr "compact" "compact" "compact" "compact" ...
## N : int 234
## newrates : 'data.frame': 1 obs. of 1 variable:
## $ male_unemployment: num 5
## pred : Named num 4.91
## r_squared : function (predcol, ycol)
## res : num [1:13] 0.552 1.313 0.163 0.279 -0.34 ...
## rho : num 0.906
## rho2 : num 0.821
## rmse : function (predcol, ycol)
## rsq : num 0.821
## rsq_glance : num 0.821
## rss : num 3.7
## sd_unemployment : num 1.31
## Soybean : chr [1:2] "soybean_train" "soybean_test"
## soybean_test : Classes 'nfnGroupedData', 'nfGroupedData', 'groupedData' and 'data.frame': 82 obs. of 5 variables:
## $ Plot : Ord.factor w/ 41 levels "1988F4"<"1988F2"<..: 3 3 3 2 2 2 6 6 1 1 ...
## $ Variety: Factor w/ 2 levels "F","P": 1 1 1 1 1 1 1 1 1 1 ...
## $ Year : Factor w/ 3 levels "1988","1989",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Time : num 42 56 70 14 42 77 49 63 14 35 ...
## $ weight : num 3.56 8.71 16.342 0.104 2.93 ...
## soybean_train : Classes 'nfnGroupedData', 'nfGroupedData', 'groupedData' and 'data.frame': 330 obs. of 5 variables:
## $ Plot : Ord.factor w/ 48 levels "1988F4"<"1988F2"<..: 3 3 3 3 3 3 3 2 2 2 ...
## $ Variety: Factor w/ 2 levels "F","P": 1 1 1 1 1 1 1 1 1 1 ...
## $ Year : Factor w/ 3 levels "1988","1989",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Time : num 14 21 28 35 49 63 77 21 28 35 ...
## $ weight : num 0.106 0.261 0.666 2.11 6.23 ...
## sparrow : 'data.frame': 87 obs. of 11 variables:
## $ status : Factor w/ 2 levels "Perished","Survived": 2 2 2 2 2 2 2 2 2 2 ...
## $ age : chr "adult" "adult" "adult" "adult" ...
## $ total_length: int 154 160 155 154 156 161 157 159 158 158 ...
## $ wingspan : int 241 252 243 245 247 253 251 247 247 252 ...
## $ weight : num 24.5 26.9 26.9 24.3 24.1 26.5 24.6 24.2 23.6 26.2 ...
## $ beak_head : num 31.2 30.8 30.6 31.7 31.5 31.8 31.1 31.4 29.8 32 ...
## $ humerus : num 0.69 0.74 0.73 0.74 0.71 0.78 0.74 0.73 0.7 0.75 ...
## $ femur : num 0.67 0.71 0.7 0.69 0.71 0.74 0.74 0.72 0.67 0.74 ...
## $ legbone : num 1.02 1.18 1.15 1.15 1.13 1.14 1.15 1.13 1.08 1.15 ...
## $ skull : num 0.59 0.6 0.6 0.58 0.57 0.61 0.61 0.61 0.6 0.61 ...
## $ sternum : num 0.83 0.84 0.85 0.84 0.82 0.89 0.86 0.79 0.82 0.86 ...
## target : num 176
## tss : num 20.7
## unemployment : 'data.frame': 13 obs. of 5 variables:
## $ male_unemployment : num 2.9 6.7 4.9 7.9 9.8 ...
## $ female_unemployment: num 4 7.4 5 7.2 7.9 ...
## $ prediction : num 3.45 6.09 4.84 6.92 8.24 ...
## $ predictions : num 3.45 6.09 4.84 6.92 8.24 ...
## $ residuals : num 0.552 1.313 0.163 0.279 -0.34 ...
## unemployment_model : List of 12
## $ coefficients : Named num [1:2] 1.434 0.695
## $ residuals : Named num [1:13] 0.552 1.313 0.163 0.279 -0.34 ...
## $ effects : Named num [1:13] -20.08 -4.126 0.106 -0.264 -1.192 ...
## $ rank : int 2
## $ fitted.values: Named num [1:13] 3.45 6.09 4.84 6.92 8.24 ...
## $ assign : int [1:2] 0 1
## $ qr :List of 5
## $ df.residual : int 11
## $ xlevels : Named list()
## $ call : language lm(formula = fmla, data = unemployment)
## $ terms :Classes 'terms', 'formula' language female_unemployment ~ male_unemployment
## $ model :'data.frame': 13 obs. of 2 variables:
# predict cty from hwy for the training set
mpg_train$pred <- predict(mpg_model,mpg_train)
# predict cty from hwy for the test set
mpg_test$pred <- predict(mpg_model, mpg_test)
# Evaluate the rmse on both training and test data and print them
(rmse_train <- rmse(mpg_train$pred,mpg_train$cty))
## [1] 1.319855
(rmse_test <- rmse(mpg_test$pred, mpg_test$cty))
## [1] 1.060794
# Evaluate the r-squared on both training and test data.and print them
(rsq_train <- r_squared(mpg_train$pred,mpg_train$cty))
## [1] 0.9076025
(rsq_test <- r_squared(mpg_test$pred, mpg_test$cty))
## [1] 0.929624
# Plot the predictions (on the x-axis) against the outcome (cty) on the test data
ggplot(mpg_test, aes(x = pred, y = cty)) +
geom_point() +
geom_abline()
There are several ways to implement an n-fold cross validation plan. In this exercise you will create such a plan using vtreat::kWayCrossValidation(), and examine it.
kWayCrossValidation() creates a cross validation plan with the following call: splitPlan <- kWayCrossValidation(nRows, nSplits, dframe, y) where nRows is the number of rows of data to be split, and nSplits is the desired number of cross-validation folds.
Strictly speaking, dframe and y aren’t used by kWayCrossValidation; they are there for compatibility with other vtreat data partitioning functions. You can set them both to NULL.
The resulting splitPlan is a list of nSplits elements; each element contains two vectors:
train: the indices of dframe that will form the training set app: the indices of dframe that will form the test (or application) set
# Load the package vtreat
library(vtreat)
## Warning: package 'vtreat' was built under R version 3.4.4
# mpg is in the workspace
summary(mpg)
## manufacturer model displ year
## Length:234 Length:234 Min. :1.600 Min. :1999
## Class :character Class :character 1st Qu.:2.400 1st Qu.:1999
## Mode :character Mode :character Median :3.300 Median :2004
## Mean :3.472 Mean :2004
## 3rd Qu.:4.600 3rd Qu.:2008
## Max. :7.000 Max. :2008
## cyl trans drv cty
## Min. :4.000 Length:234 Length:234 Min. : 9.00
## 1st Qu.:4.000 Class :character Class :character 1st Qu.:14.00
## Median :6.000 Mode :character Mode :character Median :17.00
## Mean :5.889 Mean :16.86
## 3rd Qu.:8.000 3rd Qu.:19.00
## Max. :8.000 Max. :35.00
## hwy fl class
## Min. :12.00 Length:234 Length:234
## 1st Qu.:18.00 Class :character Class :character
## Median :24.00 Mode :character Mode :character
## Mean :23.44
## 3rd Qu.:27.00
## Max. :44.00
# Get the number of rows in mpg
nRows <- nrow(mpg)
# Implement the 3-fold cross-fold plan with vtreat
splitPlan <- kWayCrossValidation(nRows, 3,NULL,NULL)
# Examine the split plan
str(splitPlan)
## List of 3
## $ :List of 2
## ..$ train: int [1:156] 1 2 3 5 7 8 10 11 12 13 ...
## ..$ app : int [1:78] 161 221 148 205 120 191 162 32 122 31 ...
## $ :List of 2
## ..$ train: int [1:156] 3 4 6 7 9 12 13 15 16 18 ...
## ..$ app : int [1:78] 17 80 193 146 77 123 126 131 76 214 ...
## $ :List of 2
## ..$ train: int [1:156] 1 2 4 5 6 8 9 10 11 14 ...
## ..$ app : int [1:78] 228 104 20 75 28 57 16 198 229 7 ...
## - attr(*, "splitmethod")= chr "kwaycross"
the 3-fold cross validation to make predictions from a model that predicts mpg\(cty from mpg\)hwy.
If dframe is the training data, then one way to add a column of cross-validation predictions to the frame is as follows:
Initialize a column of the appropriate length dframe$pred.cv <- 0
k is the number of folds splitPlan is the cross validation plan
# mpg is in the workspace
summary(mpg)
## manufacturer model displ year
## Length:234 Length:234 Min. :1.600 Min. :1999
## Class :character Class :character 1st Qu.:2.400 1st Qu.:1999
## Mode :character Mode :character Median :3.300 Median :2004
## Mean :3.472 Mean :2004
## 3rd Qu.:4.600 3rd Qu.:2008
## Max. :7.000 Max. :2008
## cyl trans drv cty
## Min. :4.000 Length:234 Length:234 Min. : 9.00
## 1st Qu.:4.000 Class :character Class :character 1st Qu.:14.00
## Median :6.000 Mode :character Mode :character Median :17.00
## Mean :5.889 Mean :16.86
## 3rd Qu.:8.000 3rd Qu.:19.00
## Max. :8.000 Max. :35.00
## hwy fl class
## Min. :12.00 Length:234 Length:234
## 1st Qu.:18.00 Class :character Class :character
## Median :24.00 Mode :character Mode :character
## Mean :23.44
## 3rd Qu.:27.00
## Max. :44.00
# splitPlan is in the workspace
str(splitPlan)
## List of 3
## $ :List of 2
## ..$ train: int [1:156] 1 2 3 5 7 8 10 11 12 13 ...
## ..$ app : int [1:78] 161 221 148 205 120 191 162 32 122 31 ...
## $ :List of 2
## ..$ train: int [1:156] 3 4 6 7 9 12 13 15 16 18 ...
## ..$ app : int [1:78] 17 80 193 146 77 123 126 131 76 214 ...
## $ :List of 2
## ..$ train: int [1:156] 1 2 4 5 6 8 9 10 11 14 ...
## ..$ app : int [1:78] 228 104 20 75 28 57 16 198 229 7 ...
## - attr(*, "splitmethod")= chr "kwaycross"
# Run the 3-fold cross validation plan from splitPlan
k <- 3 # Number of folds
mpg$pred.cv <- 0
for(i in 1:k) {
split <- splitPlan[[i]]
model <- lm(cty ~ hwy, data = mpg[split$train, ])
mpg$pred.cv[split$app] <- predict(model, newdata = mpg[split$app, ])
}
# Predict from a full model
mpg$pred <- predict(lm(cty ~ hwy, data = mpg))
# Get the rmse of the full model's predictions
rmse(mpg$pred, mpg$cty)
## [1] 1.247045
# Get the rmse of the cross-validation predictions
rmse(mpg$pred.cv, mpg$cty)
## [1] 1.275298
model.matrix() to examine how R represents data with both categorical and numerical inputs for modeling.
flowers<-read.csv("C:/Users/maiam/Dropbox/PROFESSIONAL DEVELOPMENT/DATA SCIENCE/01_R/supervised learning in r regression/flowers.csv", sep=";")# Call str on flowers to see the types of each column
str(flowers)
## 'data.frame': 24 obs. of 3 variables:
## $ Flowers : num 62.3 77.4 55.3 54.2 49.6 61.9 39.4 45.7 31.3 44.9 ...
## $ Time : Factor w/ 2 levels "Early","Late": 2 2 2 2 2 2 2 2 2 2 ...
## $ Intensity: int 150 150 300 300 450 450 600 600 750 750 ...
# Use unique() to see how many possible values Time takes
unique(flowers$Time)
## [1] Late Early
## Levels: Early Late
# Build a formula to express Flowers as a function of Intensity and Time: fmla. Print it
(fmla <- as.formula("Flowers ~ Intensity + Time"))
## Flowers ~ Intensity + Time
# Use fmla and model.matrix to see how the data is represented for modeling
mmat <- model.matrix(fmla, flowers)
# Examine the first 20 lines of flowers
head(flowers,20)
## Flowers Time Intensity
## 1 62.3 Late 150
## 2 77.4 Late 150
## 3 55.3 Late 300
## 4 54.2 Late 300
## 5 49.6 Late 450
## 6 61.9 Late 450
## 7 39.4 Late 600
## 8 45.7 Late 600
## 9 31.3 Late 750
## 10 44.9 Late 750
## 11 36.8 Late 900
## 12 41.9 Late 900
## 13 77.8 Early 150
## 14 75.6 Early 150
## 15 69.1 Early 300
## 16 78.0 Early 300
## 17 57.0 Early 450
## 18 71.1 Early 450
## 19 62.9 Early 600
## 20 52.2 Early 600
# Examine the first 20 lines of mmat
head(mmat,20)
## (Intercept) Intensity TimeLate
## 1 1 150 1
## 2 1 150 1
## 3 1 300 1
## 4 1 300 1
## 5 1 450 1
## 6 1 450 1
## 7 1 600 1
## 8 1 600 1
## 9 1 750 1
## 10 1 750 1
## 11 1 900 1
## 12 1 900 1
## 13 1 150 0
## 14 1 150 0
## 15 1 300 0
## 16 1 300 0
## 17 1 450 0
## 18 1 450 0
## 19 1 600 0
## 20 1 600 0
# Fit a model to predict Flowers from Intensity and Time : flower_model
flower_model <- lm(fmla, flowers)
# Use summary on mmat to remind yourself of its structure
summary(mmat)
## (Intercept) Intensity TimeLate
## Min. :1 Min. :150 Min. :0.0
## 1st Qu.:1 1st Qu.:300 1st Qu.:0.0
## Median :1 Median :525 Median :0.5
## Mean :1 Mean :525 Mean :0.5
## 3rd Qu.:1 3rd Qu.:750 3rd Qu.:1.0
## Max. :1 Max. :900 Max. :1.0
# Use summary to examine flower_model
summary(flower_model)
##
## Call:
## lm(formula = fmla, data = flowers)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.652 -4.139 -1.558 5.632 12.165
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 83.464167 3.273772 25.495 < 2e-16 ***
## Intensity -0.040471 0.005132 -7.886 1.04e-07 ***
## TimeLate -12.158333 2.629557 -4.624 0.000146 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.441 on 21 degrees of freedom
## Multiple R-squared: 0.7992, Adjusted R-squared: 0.78
## F-statistic: 41.78 on 2 and 21 DF, p-value: 4.786e-08
# Predict the number of flowers on each plant
flowers$predictions <- predict(flower_model, flowers)
# Plot predictions vs actual flowers (predictions on x-axis)
ggplot(flowers, aes(x = predictions, y = Flowers)) +
geom_point() +
geom_abline(color = "blue")
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.4.4
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(vtreat)
library(broom)
library(sigr)
library(tidyr)
## Warning: package 'tidyr' was built under R version 3.4.4
alcohol<-read.csv("C:/Users/maiam/Dropbox/PROFESSIONAL DEVELOPMENT/DATA SCIENCE/01_R/supervised learning in r regression/alcohol.csv", sep=";")# Call str on flowers to see the types of each column
str(alcohol)
## 'data.frame': 32 obs. of 5 variables:
## $ Subject: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Metabol: num 0.6 0.6 1.5 0.4 0.1 0.2 0.3 0.3 0.4 1 ...
## $ Gastric: num 1 1.6 1.5 2.2 1.1 1.2 0.9 0.8 1.5 0.9 ...
## $ Sex : Factor w/ 2 levels "Female","Male": 1 1 1 1 1 1 1 1 1 1 ...
## $ Alcohol: Factor w/ 2 levels "Alcoholic","Non-alcoholic": 1 1 1 2 2 2 2 2 2 2 ...
# alcohol is in the workspace
summary(alcohol)
## Subject Metabol Gastric Sex
## Min. : 1.00 Min. : 0.100 Min. :0.800 Female:18
## 1st Qu.: 8.75 1st Qu.: 0.600 1st Qu.:1.200 Male :14
## Median :16.50 Median : 1.700 Median :1.600
## Mean :16.50 Mean : 2.422 Mean :1.863
## 3rd Qu.:24.25 3rd Qu.: 2.925 3rd Qu.:2.200
## Max. :32.00 Max. :12.300 Max. :5.200
## Alcohol
## Alcoholic : 8
## Non-alcoholic:24
##
##
##
##
# Create the formula with main effects only
(fmla_add <- as.formula('Metabol~Gastric+Sex') )
## Metabol ~ Gastric + Sex
# Create the formula with interactions
(fmla_interaction <- as.formula('Metabol~Gastric+Gastric:Sex') )
## Metabol ~ Gastric + Gastric:Sex
# Fit the main effects only model
model_add <- lm(fmla_add,alcohol)
# Fit the interaction model
model_interaction <- lm(fmla_interaction, alcohol)
# Call summary on both models and compare
summary(model_add)
##
## Call:
## lm(formula = fmla_add, data = alcohol)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.2779 -0.6328 -0.0966 0.5783 4.5703
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.9466 0.5198 -3.745 0.000796 ***
## Gastric 1.9656 0.2674 7.352 4.24e-08 ***
## SexMale 1.6174 0.5114 3.163 0.003649 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.331 on 29 degrees of freedom
## Multiple R-squared: 0.7654, Adjusted R-squared: 0.7492
## F-statistic: 47.31 on 2 and 29 DF, p-value: 7.41e-10
summary(model_interaction)
##
## Call:
## lm(formula = fmla_interaction, data = alcohol)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.4656 -0.5091 0.0143 0.5660 4.0668
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.7504 0.5310 -1.413 0.168236
## Gastric 1.1489 0.3450 3.331 0.002372 **
## Gastric:SexMale 1.0422 0.2412 4.321 0.000166 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.204 on 29 degrees of freedom
## Multiple R-squared: 0.8081, Adjusted R-squared: 0.7948
## F-statistic: 61.05 on 2 and 29 DF, p-value: 4.033e-11
# alcohol is in the workspace
summary(alcohol)
## Subject Metabol Gastric Sex
## Min. : 1.00 Min. : 0.100 Min. :0.800 Female:18
## 1st Qu.: 8.75 1st Qu.: 0.600 1st Qu.:1.200 Male :14
## Median :16.50 Median : 1.700 Median :1.600
## Mean :16.50 Mean : 2.422 Mean :1.863
## 3rd Qu.:24.25 3rd Qu.: 2.925 3rd Qu.:2.200
## Max. :32.00 Max. :12.300 Max. :5.200
## Alcohol
## Alcoholic : 8
## Non-alcoholic:24
##
##
##
##
# Both the formulae are in the workspace
fmla_add
## Metabol ~ Gastric + Sex
fmla_interaction
## Metabol ~ Gastric + Gastric:Sex
# Create the splitting plan for 3-fold cross validation
set.seed(34245) # set the seed for reproducibility
splitPlan <- kWayCrossValidation(nrow(alcohol), 3, NULL, NULL)
# Sample code: Get cross-val predictions for main-effects only model
alcohol$pred_add <- 0 # initialize the prediction vector
for(i in 1:3) {
split <- splitPlan[[i]]
model_add <- lm(fmla_add, data = alcohol[split$train, ])
alcohol$pred_add[split$app] <- predict(model_add, newdata = alcohol[split$app, ])
}
# Get the cross-val predictions for the model with interactions
alcohol$pred_interaction <- 0 # initialize the prediction vector
for(i in 1:3) {
split <- splitPlan[[i]]
model_interaction <- lm(fmla_interaction, data = alcohol[split$train, ])
alcohol$pred_interaction[split$app] <- predict(model_interaction, newdata = alcohol[split$app, ])
}
# Get RMSE
alcohol %>%
gather(key = modeltype, value = pred, pred_add, pred_interaction) %>%
mutate(residuals = Metabol-pred) %>%
group_by(modeltype) %>%
summarize(rmse = sqrt(mean(residuals^2)))
## # A tibble: 2 x 2
## modeltype rmse
## <chr> <dbl>
## 1 pred_add 1.64
## 2 pred_interaction 1.38
fdata<-read.csv("C:/Users/maiam/Dropbox/PROFESSIONAL DEVELOPMENT/DATA SCIENCE/01_R/supervised learning in r regression/fdata.csv", sep=";")# Call str on flowers to see the types of each column
# fdata is in the workspace
summary(fdata)
## y pred label
## Min. : -5.894 Min. : 1.072 large purchases:50
## 1st Qu.: 5.407 1st Qu.: 6.373 small purchases:50
## Median : 57.374 Median : 55.693
## Mean : 306.204 Mean : 305.905
## 3rd Qu.: 550.903 3rd Qu.: 547.886
## Max. :1101.619 Max. :1098.896
library(ggplot2)
library(tidyr)
library(dplyr)
library(vtreat)
library(broom)
library(sigr)
library(tidyr)
# Examine the data: generate the summaries for the groups large and small:
fdata %>%
group_by(label) %>% # group by small/large purchases
summarize(min = min(y), # min of y
mean = mean(y), # mean of y
max = max(y)) # max of y
## # A tibble: 2 x 4
## label min mean max
## <fct> <dbl> <dbl> <dbl>
## 1 large purchases 96.1 606. 1102.
## 2 small purchases -5.89 6.48 18.6
# Fill in the blanks to add error columns
fdata2 <- fdata %>%
group_by(label) %>% # group by label
mutate(residual = pred-y, # Residual
relerr = residual/y) # Relative error
# Compare the rmse and rmse.rel of the large and small groups:
fdata2 %>%
group_by(label) %>%
summarize(rmse = sqrt(mean(residual^2)), # RMSE
rmse.rel = sqrt(mean(relerr^2)) ) # Root mean squared relative error
## # A tibble: 2 x 3
## label rmse rmse.rel
## <fct> <dbl> <dbl>
## 1 large purchases 5.54 0.0147
## 2 small purchases 4.01 1.25
# Plot the predictions for both groups of purchases
ggplot(fdata2, aes(x = pred, y = y, color = label)) +
geom_point() +
geom_abline() +
facet_wrap(~ label, ncol = 1, scales = "free") +
ggtitle("Outcome vs prediction")
## Modeling log-transformed monetary output
income_train<-read.csv("C:/Users/maiam/Dropbox/PROFESSIONAL DEVELOPMENT/DATA SCIENCE/01_R/supervised learning in r regression/income_train.csv", sep=";")# Call str on flowers to see the types of each column
income_test<-read.csv("C:/Users/maiam/Dropbox/PROFESSIONAL DEVELOPMENT/DATA SCIENCE/01_R/supervised learning in r regression/income_test.csv", sep=";")# Call str on flowers to see the types of each column
# fdata is in the workspace
summary(income_test)
## Subject Arith Word Parag
## Min. : 7 Min. : 3.00 Min. : 4.00 Min. : 0.00
## 1st Qu.: 1638 1st Qu.:12.00 1st Qu.:22.00 1st Qu.: 9.00
## Median : 3303 Median :19.00 Median :28.00 Median :12.00
## Mean : 3467 Mean :18.33 Mean :26.19 Mean :10.99
## 3rd Qu.: 4690 3rd Qu.:24.00 3rd Qu.:32.00 3rd Qu.:13.00
## Max. :12106 Max. :30.00 Max. :35.00 Max. :15.00
## Math AFQT Income2005
## Min. : 1.00 Min. : 0.56 Min. : 1200
## 1st Qu.: 9.00 1st Qu.: 30.40 1st Qu.: 22000
## Median :13.00 Median : 55.12 Median : 36323
## Mean :13.89 Mean : 53.09 Mean : 47500
## 3rd Qu.:19.00 3rd Qu.: 76.83 3rd Qu.: 60000
## Max. :25.00 Max. :100.00 Max. :368302
library(ggplot2)
library(tidyr)
library(dplyr)
library(vtreat)
library(broom)
library(sigr)
library(tidyr)
# Examine Income2005 in the training set
summary(income_train$Income2005)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 63 23000 39000 49894 61500 703637
# Write the formula for log income as a function of the tests and print it
(fmla.log <- as.formula("log(Income2005)~Arith+Word+Parag+Math+AFQT"))
## log(Income2005) ~ Arith + Word + Parag + Math + AFQT
# Fit the linear model
model.log <- lm(fmla.log, income_train)
# Make predictions on income_test
income_test$logpred <- predict(model.log, income_test)
summary(income_test$logpred)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.766 10.133 10.423 10.419 10.705 11.006
# Convert the predictions to monetary units
income_test$pred.income <- exp(income_test$logpred)
summary(income_test$pred.income)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 17432 25167 33615 35363 44566 60217
# Plot predicted income (x axis) vs income
ggplot(income_test, aes(x = pred.income, y = Income2005)) +
geom_point() +
geom_abline(color = "blue")
fmla.abs<-as.formula("Income2005 ~ Arith + Word + Parag + Math + AFQT")
model.abs<-lm(fmla.abs, income_train)
# fmla.abs is in the workspace
fmla.abs
## Income2005 ~ Arith + Word + Parag + Math + AFQT
# model.abs is in the workspace
summary(model.abs)
##
## Call:
## lm(formula = fmla.abs, data = income_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -78728 -24137 -6979 11964 648573
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17516.7 6420.1 2.728 0.00642 **
## Arith 1552.3 303.4 5.116 3.41e-07 ***
## Word -132.3 265.0 -0.499 0.61754
## Parag -1155.1 618.3 -1.868 0.06189 .
## Math 725.5 372.0 1.950 0.05127 .
## AFQT 177.8 144.1 1.234 0.21734
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 45500 on 2063 degrees of freedom
## Multiple R-squared: 0.1165, Adjusted R-squared: 0.1144
## F-statistic: 54.4 on 5 and 2063 DF, p-value: < 2.2e-16
# Add predictions to the test set
income_test <- income_test %>%
mutate(pred.absmodel = predict(model.abs, income_test), # predictions from model.abs
pred.logmodel = exp(predict(model.log, income_test))) # predictions from model.log
# Gather the predictions and calculate residuals and relative error
income_long <- income_test %>%
gather(key = modeltype, value = pred, pred.absmodel, pred.logmodel) %>%
mutate(residual = pred-Income2005, # residuals
relerr = residual/Income2005) # relative error
# Calculate RMSE and relative RMSE and compare
income_long %>%
group_by(modeltype) %>% # group by modeltype
summarize(rmse = sqrt(mean(residual^2)), # RMSE
rmse.rel = sqrt(mean(relerr^2))) # Root mean squared relative error
## # A tibble: 2 x 3
## modeltype rmse rmse.rel
## <chr> <dbl> <dbl>
## 1 pred.absmodel 37448. 3.18
## 2 pred.logmodel 39235. 2.22
Bikes<-load("C:/Users/maiam/Dropbox/PROFESSIONAL DEVELOPMENT/DATA SCIENCE/01_R/supervised learning in r regression/Bikes.RData")
str(bikesJuly)
## 'data.frame': 744 obs. of 12 variables:
## $ hr : Factor w/ 24 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ holiday : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ workingday: logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ weathersit: chr "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" ...
## $ temp : num 0.76 0.74 0.72 0.72 0.7 0.68 0.7 0.74 0.78 0.82 ...
## $ atemp : num 0.727 0.697 0.697 0.712 0.667 ...
## $ hum : num 0.66 0.7 0.74 0.84 0.79 0.79 0.79 0.7 0.62 0.56 ...
## $ windspeed : num 0 0.1343 0.0896 0.1343 0.194 ...
## $ cnt : int 149 93 90 33 4 10 27 50 142 219 ...
## $ instant : int 13004 13005 13006 13007 13008 13009 13010 13011 13012 13013 ...
## $ mnth : int 7 7 7 7 7 7 7 7 7 7 ...
## $ yr : int 1 1 1 1 1 1 1 1 1 1 ...
str(Soybean)
## chr [1:2] "soybean_train" "soybean_test"
# soybean_train is in the workspace
summary(soybean_train)
## Plot Variety Year Time weight
## 1988F6 : 10 F:161 1988:124 Min. :14.00 Min. : 0.0290
## 1988F7 : 9 P:169 1989:102 1st Qu.:27.00 1st Qu.: 0.6663
## 1988P1 : 9 1990:104 Median :42.00 Median : 3.5233
## 1988P8 : 9 Mean :43.56 Mean : 6.1645
## 1988P2 : 9 3rd Qu.:56.00 3rd Qu.:10.3808
## 1988F3 : 8 Max. :84.00 Max. :27.3700
## (Other):276
# Plot weight vs Time (Time on x axis)
ggplot(soybean_train, aes(x = Time, y = weight)) +
geom_point()
# Load the package mgcv
library(mgcv)
## Warning: package 'mgcv' was built under R version 3.4.4
## Loading required package: nlme
##
## Attaching package: 'nlme'
## The following object is masked _by_ '.GlobalEnv':
##
## Soybean
## The following object is masked from 'package:dplyr':
##
## collapse
## This is mgcv 1.8-28. For overview type 'help("mgcv-package")'.
# Create the formula
(fmla.gam <- as.formula(weight~s(Time)) )
## weight ~ s(Time)
print(fmla.gam)
## weight ~ s(Time)
# Fit the GAM Model
model.gam <- gam(fmla.gam, soybean_train, family=gaussian)
# From previous step
library(mgcv)
fmla.gam <- weight ~ s(Time)
model.gam <- gam(fmla.gam, data = soybean_train, family = gaussian)
fmla.lin<-weight ~ Time
model.lin<-lm(fmla.lin, data = soybean_train)
# Call summary() on model.lin and look for R-squared
summary(model.lin)
##
## Call:
## lm(formula = fmla.lin, data = soybean_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3933 -1.7100 -0.3909 1.9056 11.4381
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.559283 0.358527 -18.30 <2e-16 ***
## Time 0.292094 0.007444 39.24 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.778 on 328 degrees of freedom
## Multiple R-squared: 0.8244, Adjusted R-squared: 0.8238
## F-statistic: 1540 on 1 and 328 DF, p-value: < 2.2e-16
# Call summary() on model.gam and look for R-squared
summary(model.gam)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## weight ~ s(Time)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.1645 0.1143 53.93 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(Time) 8.495 8.93 338.2 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.902 Deviance explained = 90.4%
## GCV = 4.4395 Scale est. = 4.3117 n = 330
# Call plot() on model.gam
plot(model.gam)
# soybean_test is in the workspace
summary(soybean_test)
## Plot Variety Year Time weight
## 1988F8 : 4 F:43 1988:32 Min. :14.00 Min. : 0.0380
## 1988P7 : 4 P:39 1989:26 1st Qu.:23.00 1st Qu.: 0.4248
## 1989F8 : 4 1990:24 Median :41.00 Median : 3.0025
## 1990F8 : 4 Mean :44.09 Mean : 7.1576
## 1988F4 : 3 3rd Qu.:69.00 3rd Qu.:15.0113
## 1988F2 : 3 Max. :84.00 Max. :30.2717
## (Other):60
# Get predictions from linear model
soybean_test$pred.lin <- predict(model.lin, newdata = soybean_test)
# Get predictions from gam model
soybean_test$pred.gam <- as.numeric(predict(model.gam, newdata = soybean_test))
# Gather the predictions into a "long" dataset
soybean_long <- soybean_test %>%
gather(key = modeltype, value = pred, pred.lin, pred.gam)
# Calculate the rmse
soybean_long %>%
mutate(residual = weight - pred) %>% # residuals
group_by(modeltype) %>% # group by modeltype
summarize(rmse = sqrt(mean(residual^2))) # calculate the RMSE
## # A tibble: 2 x 2
## modeltype rmse
## <chr> <dbl>
## 1 pred.gam 2.29
## 2 pred.lin 3.19
# Compare the predictions against actual weights on the test data
soybean_long %>%
ggplot(aes(x = Time)) + # the column for the x axis
geom_point(aes(y = weight)) + # the y-column for the scatterplot
geom_point(aes(y = pred, color = modeltype)) + # the y-column for the point-and-line plot
geom_line(aes(y = pred, color = modeltype, linetype = modeltype)) + # the y-column for the point-and-line plot
scale_color_brewer(palette = "Dark2")
# houseprice is in the workspace
summary(houseprice)
## size price
## Min. : 44.0 Min. : 42.0
## 1st Qu.: 73.5 1st Qu.:164.5
## Median : 91.0 Median :203.5
## Mean : 94.3 Mean :249.2
## 3rd Qu.:118.5 3rd Qu.:287.8
## Max. :150.0 Max. :573.0
# Create the formula for price as a function of squared size
(fmla_sqr <- as.formula(price~I(size^2)))
## price ~ I(size^2)
# Fit a model of price as a function of squared size (use fmla_sqr)
model_sqr <- lm(fmla_sqr, data=houseprice)
# Fit a model of price as a linear function of size
model_lin <- lm(price ~ size, data=houseprice)
# Make predictions and compare
houseprice %>%
mutate(pred_lin = predict(model_lin,houseprice), # predictions from linear model
pred_sqr = predict(model_sqr,houseprice)) %>% # predictions from quadratic model
gather(key = modeltype, value = pred, pred_lin, pred_sqr) %>% # gather the predictions
ggplot(aes(x = size)) +
geom_point(aes(y = price)) + # actual prices
geom_line(aes(y = pred, color = modeltype)) + # the predictions
scale_color_brewer(palette = "Dark2")
# fmla_sqr is in the workspace
fmla_sqr
## price ~ I(size^2)
# Create a splitting plan for 3-fold cross validation
set.seed(34245) # set the seed for reproducibility
splitPlan <- kWayCrossValidation(nrow(houseprice), 3, NULL, NULL)
# Sample code: get cross-val predictions for price ~ size
houseprice$pred_lin <- 0 # initialize the prediction vector
for(i in 1:3) {
split <- splitPlan[[i]]
model_lin <- lm(price ~ size, data = houseprice[split$train, ])
houseprice$pred_lin[split$app] <- predict(model_lin, newdata = houseprice[split$app, ])
}
# Get cross-val predictions for price as a function of size^2 (use fmla_sqr)
houseprice$pred_sqr <- 0 # initialize the prediction vector
for(i in 1:3) {
split <- splitPlan[[i]]
model_sqr <- lm(fmla_sqr, data = houseprice[split$train, ])
houseprice$pred_sqr[split$app] <- predict(model_sqr, newdata = houseprice[split$app, ])
}
# Gather the predictions and calculate the residuals
houseprice_long <- houseprice %>%
gather(key = modeltype, value = pred, pred_lin, pred_sqr) %>%
mutate(residuals = pred - price)
# Compare the cross-validated RMSE for the two models
houseprice_long %>%
group_by(modeltype) %>%
summarize(rmse = sqrt(mean(residuals^2)))
## # A tibble: 2 x 2
## modeltype rmse
## <chr> <dbl>
## 1 pred_lin 74.3
## 2 pred_sqr 63.7
Bikes<-load("C:/Users/maiam/Dropbox/PROFESSIONAL DEVELOPMENT/DATA SCIENCE/01_R/supervised learning in r regression/Bikes.RData")
str(bikesJuly)
## 'data.frame': 744 obs. of 12 variables:
## $ hr : Factor w/ 24 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ holiday : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ workingday: logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ weathersit: chr "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" ...
## $ temp : num 0.76 0.74 0.72 0.72 0.7 0.68 0.7 0.74 0.78 0.82 ...
## $ atemp : num 0.727 0.697 0.697 0.712 0.667 ...
## $ hum : num 0.66 0.7 0.74 0.84 0.79 0.79 0.79 0.7 0.62 0.56 ...
## $ windspeed : num 0 0.1343 0.0896 0.1343 0.194 ...
## $ cnt : int 149 93 90 33 4 10 27 50 142 219 ...
## $ instant : int 13004 13005 13006 13007 13008 13009 13010 13011 13012 13013 ...
## $ mnth : int 7 7 7 7 7 7 7 7 7 7 ...
## $ yr : int 1 1 1 1 1 1 1 1 1 1 ...
library(WVPlots)
# sparrow is in the workspace
summary(sparrow)
## status age total_length wingspan
## Perished:36 Length:87 Min. :153.0 Min. :236.0
## Survived:51 Class :character 1st Qu.:158.0 1st Qu.:245.0
## Mode :character Median :160.0 Median :247.0
## Mean :160.4 Mean :247.5
## 3rd Qu.:162.5 3rd Qu.:251.0
## Max. :167.0 Max. :256.0
## weight beak_head humerus femur
## Min. :23.2 Min. :29.80 Min. :0.6600 Min. :0.6500
## 1st Qu.:24.7 1st Qu.:31.40 1st Qu.:0.7250 1st Qu.:0.7000
## Median :25.8 Median :31.70 Median :0.7400 Median :0.7100
## Mean :25.8 Mean :31.64 Mean :0.7353 Mean :0.7134
## 3rd Qu.:26.7 3rd Qu.:32.10 3rd Qu.:0.7500 3rd Qu.:0.7300
## Max. :31.0 Max. :33.00 Max. :0.7800 Max. :0.7600
## legbone skull sternum
## Min. :1.010 Min. :0.5600 Min. :0.7700
## 1st Qu.:1.110 1st Qu.:0.5900 1st Qu.:0.8300
## Median :1.130 Median :0.6000 Median :0.8500
## Mean :1.131 Mean :0.6032 Mean :0.8511
## 3rd Qu.:1.160 3rd Qu.:0.6100 3rd Qu.:0.8800
## Max. :1.230 Max. :0.6400 Max. :0.9300
# Create the survived column
sparrow$survived <- sparrow$status=="Survived"
# Create the formula
(fmla <- as.formula("survived~total_length+weight+humerus"))
## survived ~ total_length + weight + humerus
# Fit the logistic regression model
sparrow_model <- glm(fmla, sparrow, family = binomial)
# Call summary
summary(sparrow_model)
##
## Call:
## glm(formula = fmla, family = binomial, data = sparrow)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1117 -0.6026 0.2871 0.6577 1.7082
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 46.8813 16.9631 2.764 0.005715 **
## total_length -0.5435 0.1409 -3.858 0.000115 ***
## weight -0.5689 0.2771 -2.053 0.040060 *
## humerus 75.4610 19.1586 3.939 8.19e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 118.008 on 86 degrees of freedom
## Residual deviance: 75.094 on 83 degrees of freedom
## AIC: 83.094
##
## Number of Fisher Scoring iterations: 5
# Call glance
(perf <- glance(sparrow_model))
## # A tibble: 1 x 7
## null.deviance df.null logLik AIC BIC deviance df.residual
## <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int>
## 1 118. 86 -37.5 83.1 93.0 75.1 83
# Calculate pseudo-R-squared
(pseudoR2 <- 1-perf$deviance/perf$null.deviance)
## [1] 0.3636526
# Make predictions
sparrow$pred <- predict(sparrow_model, type="response")
# Look at gain curve
GainCurvePlot(sparrow, "pred", "survived", "sparrow survival model")
looking at the pseudo-R2 of a logistic regression model, you should hope to see a value close to 1. the gain curve that the model follows the wizard curve for about the first 30% of the data, identifying about 45% of the surviving sparrows with only a few false positives.
# The outcome column
outcome <-"cnt"
# The inputs to use
vars <-c("hr","holiday","workingday","weathersit","temp","atemp","hum" ,"windspeed")
# bikesJuly is in the workspace
str(bikesJuly)
## 'data.frame': 744 obs. of 12 variables:
## $ hr : Factor w/ 24 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ holiday : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ workingday: logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ weathersit: chr "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" ...
## $ temp : num 0.76 0.74 0.72 0.72 0.7 0.68 0.7 0.74 0.78 0.82 ...
## $ atemp : num 0.727 0.697 0.697 0.712 0.667 ...
## $ hum : num 0.66 0.7 0.74 0.84 0.79 0.79 0.79 0.7 0.62 0.56 ...
## $ windspeed : num 0 0.1343 0.0896 0.1343 0.194 ...
## $ cnt : int 149 93 90 33 4 10 27 50 142 219 ...
## $ instant : int 13004 13005 13006 13007 13008 13009 13010 13011 13012 13013 ...
## $ mnth : int 7 7 7 7 7 7 7 7 7 7 ...
## $ yr : int 1 1 1 1 1 1 1 1 1 1 ...
# The outcome column
outcome <-"cnt"
# The inputs to use
vars <-c("hr","holiday","workingday","weathersit","temp","atemp","hum" ,"windspeed")
# bikesJuly is in the workspace
str(bikesJuly)
## 'data.frame': 744 obs. of 12 variables:
## $ hr : Factor w/ 24 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ holiday : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ workingday: logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ weathersit: chr "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" ...
## $ temp : num 0.76 0.74 0.72 0.72 0.7 0.68 0.7 0.74 0.78 0.82 ...
## $ atemp : num 0.727 0.697 0.697 0.712 0.667 ...
## $ hum : num 0.66 0.7 0.74 0.84 0.79 0.79 0.79 0.7 0.62 0.56 ...
## $ windspeed : num 0 0.1343 0.0896 0.1343 0.194 ...
## $ cnt : int 149 93 90 33 4 10 27 50 142 219 ...
## $ instant : int 13004 13005 13006 13007 13008 13009 13010 13011 13012 13013 ...
## $ mnth : int 7 7 7 7 7 7 7 7 7 7 ...
## $ yr : int 1 1 1 1 1 1 1 1 1 1 ...
# The outcome column
outcome
## [1] "cnt"
# The inputs to use
vars
## [1] "hr" "holiday" "workingday" "weathersit" "temp"
## [6] "atemp" "hum" "windspeed"
# Create the formula string for bikes rented as a function of the inputs
(fmla <- paste(outcome, "~", paste(vars, collapse = " + ")))
## [1] "cnt ~ hr + holiday + workingday + weathersit + temp + atemp + hum + windspeed"
# Calculate the mean and variance of the outcome
(mean_bikes <- mean(bikesJuly$cnt))
## [1] 273.6653
(var_bikes <- var(bikesJuly$cnt))
## [1] 45863.84
# Fit the model
bike_model <- glm(fmla, bikesJuly, family=quasipoisson)
# Call glance
(perf <- glance(bike_model))
## # A tibble: 1 x 7
## null.deviance df.null logLik AIC BIC deviance df.residual
## <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int>
## 1 133365. 743 NA NA NA 28775. 712
# Calculate pseudo-R-squared
(pseudoR2 <- 1-perf$deviance/perf$null.deviance)
## [1] 0.7842393
# bikesAugust is in the workspace
str(bikesAugust)
## 'data.frame': 744 obs. of 12 variables:
## $ hr : Factor w/ 24 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ holiday : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ workingday: logi TRUE TRUE TRUE TRUE TRUE TRUE ...
## $ weathersit: chr "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" ...
## $ temp : num 0.68 0.66 0.64 0.64 0.64 0.64 0.64 0.64 0.66 0.68 ...
## $ atemp : num 0.636 0.606 0.576 0.576 0.591 ...
## $ hum : num 0.79 0.83 0.83 0.83 0.78 0.78 0.78 0.83 0.78 0.74 ...
## $ windspeed : num 0.1642 0.0896 0.1045 0.1045 0.1343 ...
## $ cnt : int 47 33 13 7 4 49 185 487 681 350 ...
## $ instant : int 13748 13749 13750 13751 13752 13753 13754 13755 13756 13757 ...
## $ mnth : int 8 8 8 8 8 8 8 8 8 8 ...
## $ yr : int 1 1 1 1 1 1 1 1 1 1 ...
# bike_model is in the workspace
summary(bike_model)
##
## Call:
## glm(formula = fmla, family = quasipoisson, data = bikesJuly)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -21.6117 -4.3121 -0.7223 3.5507 16.5079
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.934986 0.439027 13.519 < 2e-16 ***
## hr1 -0.580055 0.193354 -3.000 0.002794 **
## hr2 -0.892314 0.215452 -4.142 3.86e-05 ***
## hr3 -1.662342 0.290658 -5.719 1.58e-08 ***
## hr4 -2.350204 0.393560 -5.972 3.71e-09 ***
## hr5 -1.084289 0.230130 -4.712 2.96e-06 ***
## hr6 0.211945 0.156476 1.354 0.176012
## hr7 1.211135 0.132332 9.152 < 2e-16 ***
## hr8 1.648361 0.127177 12.961 < 2e-16 ***
## hr9 1.155669 0.133927 8.629 < 2e-16 ***
## hr10 0.993913 0.137096 7.250 1.09e-12 ***
## hr11 1.116547 0.136300 8.192 1.19e-15 ***
## hr12 1.282685 0.134769 9.518 < 2e-16 ***
## hr13 1.273010 0.135872 9.369 < 2e-16 ***
## hr14 1.237721 0.136386 9.075 < 2e-16 ***
## hr15 1.260647 0.136144 9.260 < 2e-16 ***
## hr16 1.515893 0.132727 11.421 < 2e-16 ***
## hr17 1.948404 0.128080 15.212 < 2e-16 ***
## hr18 1.893915 0.127812 14.818 < 2e-16 ***
## hr19 1.669277 0.128471 12.993 < 2e-16 ***
## hr20 1.420732 0.131004 10.845 < 2e-16 ***
## hr21 1.146763 0.134042 8.555 < 2e-16 ***
## hr22 0.856182 0.138982 6.160 1.21e-09 ***
## hr23 0.479197 0.148051 3.237 0.001265 **
## holidayTRUE 0.201598 0.079039 2.551 0.010961 *
## workingdayTRUE 0.116798 0.033510 3.485 0.000521 ***
## weathersitLight Precipitation -0.214801 0.072699 -2.955 0.003233 **
## weathersitMisty -0.010757 0.038600 -0.279 0.780572
## temp -3.246001 1.148270 -2.827 0.004833 **
## atemp 2.042314 0.953772 2.141 0.032589 *
## hum -0.748557 0.236015 -3.172 0.001581 **
## windspeed 0.003277 0.148814 0.022 0.982439
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 38.98949)
##
## Null deviance: 133365 on 743 degrees of freedom
## Residual deviance: 28775 on 712 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 5
# Make predictions on August data
bikesAugust$pred <- predict(bike_model, newdata = bikesAugust,type = "response")
# Calculate the RMSE
bikesAugust %>%
mutate(residual = pred-cnt) %>%
summarize(rmse = sqrt(mean(residual^2)))
## rmse
## 1 112.5815
# Plot predictions vs cnt (pred on x-axis)
ggplot(bikesAugust, aes(x = pred, y = cnt)) +
geom_point() +
geom_abline(color = "darkblue")
# Plot predictions and cnt by date/time
bikesAugust %>%
# set start to 0, convert unit to days
mutate(instant = (instant - min(instant))/24) %>%
# gather cnt and pred into a value column
gather(key = valuetype, value = value, cnt, pred) %>%
filter(instant < 14) %>% # restric to first 14 days
# plot value by instant
ggplot(aes(x = instant, y = value, color = valuetype, linetype = valuetype)) +
geom_point() +
geom_line() +
scale_x_continuous("Day", breaks = 0:14, labels = 0:14) +
scale_color_brewer(palette = "Dark2") +
ggtitle("Predicted August bike rentals, Quasipoisson model")
a (quasi)poisson model to predict counts! a pseudo-R2 is near 1.Great! (Quasi)poisson models predict non-negative rates, making them useful for count or frequency data.
This model mostly identifies the slow and busy hours of the day, although it often underestimates peak demand
# soybean_train is in the workspace
summary(soybean_train)
## Plot Variety Year Time weight
## 1988F6 : 10 F:161 1988:124 Min. :14.00 Min. : 0.0290
## 1988F7 : 9 P:169 1989:102 1st Qu.:27.00 1st Qu.: 0.6663
## 1988P1 : 9 1990:104 Median :42.00 Median : 3.5233
## 1988P8 : 9 Mean :43.56 Mean : 6.1645
## 1988P2 : 9 3rd Qu.:56.00 3rd Qu.:10.3808
## 1988F3 : 8 Max. :84.00 Max. :27.3700
## (Other):276
# Plot weight vs Time (Time on x axis)
ggplot(soybean_train, aes(x = Time, y = weight)) +
geom_point()
# Load the package mgcv
library(mgcv)
# Create the formula
(fmla.gam <- as.formula(weight~s(Time)) )
## weight ~ s(Time)
print(fmla.gam)
## weight ~ s(Time)
# Fit the GAM Model
model.gam <- gam(fmla.gam, soybean_train, family=gaussian)
# From previous step
library(mgcv)
fmla.gam <- weight ~ s(Time)
model.gam <- gam(fmla.gam, data = soybean_train, family = gaussian)
# Call summary() on model.lin and look for R-squared
#summary(model.lin)
# Call summary() on model.gam and look for R-squared
summary(model.gam)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## weight ~ s(Time)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.1645 0.1143 53.93 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(Time) 8.495 8.93 338.2 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.902 Deviance explained = 90.4%
## GCV = 4.4395 Scale est. = 4.3117 n = 330
# Call plot() on model.gam
plot(model.gam)
# soybean_test is in the workspace
summary(soybean_test)
## Plot Variety Year Time weight
## 1988F8 : 4 F:43 1988:32 Min. :14.00 Min. : 0.0380
## 1988P7 : 4 P:39 1989:26 1st Qu.:23.00 1st Qu.: 0.4248
## 1989F8 : 4 1990:24 Median :41.00 Median : 3.0025
## 1990F8 : 4 Mean :44.09 Mean : 7.1576
## 1988F4 : 3 3rd Qu.:69.00 3rd Qu.:15.0113
## 1988F2 : 3 Max. :84.00 Max. :30.2717
## (Other):60
## pred.lin pred.gam
## Min. :-2.4700 Min. : 0.1084
## 1st Qu.: 0.1589 1st Qu.: 0.4156
## Median : 5.4166 Median : 3.6137
## Mean : 6.3178 Mean : 6.8382
## 3rd Qu.:13.5952 3rd Qu.:14.7402
## Max. :17.9766 Max. :18.6845
##
# Get predictions from linear model
soybean_test$pred.lin <- predict(model.lin, newdata = soybean_test)
# Get predictions from gam model
soybean_test$pred.gam <- as.numeric(predict(model.gam, newdata = soybean_test))
# Gather the predictions into a "long" dataset
soybean_long <- soybean_test %>%
gather(key = modeltype, value = pred, pred.lin, pred.gam)
# Calculate the rmse
soybean_long %>%
mutate(residual = weight - pred) %>% # residuals
group_by(modeltype) %>% # group by modeltype
summarize(rmse = sqrt(mean(residual^2))) # calculate the RMSE
## # A tibble: 2 x 2
## modeltype rmse
## <chr> <dbl>
## 1 pred.gam 2.29
## 2 pred.lin 3.19
# Compare the predictions against actual weights on the test data
soybean_long %>%
ggplot(aes(x = Time)) + # the column for the x axis
geom_point(aes(y = weight)) + # the y-column for the scatterplot
geom_point(aes(y = pred, color = modeltype)) + # the y-column for the point-and-line plot
geom_line(aes(y = pred, color = modeltype, linetype = modeltype)) + # the y-column for the point-and-line plot
scale_color_brewer(palette = "Dark2")
For this data, the GAM appears to fit the data better than a linear model, as measured by the R-squared.The GAM learns the non-linear growth function of the soybean plants, including the fact that weight is never negative.
# bikesJuly is in the workspace
str(bikesJuly)
## 'data.frame': 744 obs. of 12 variables:
## $ hr : Factor w/ 24 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ holiday : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ workingday: logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ weathersit: chr "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" ...
## $ temp : num 0.76 0.74 0.72 0.72 0.7 0.68 0.7 0.74 0.78 0.82 ...
## $ atemp : num 0.727 0.697 0.697 0.712 0.667 ...
## $ hum : num 0.66 0.7 0.74 0.84 0.79 0.79 0.79 0.7 0.62 0.56 ...
## $ windspeed : num 0 0.1343 0.0896 0.1343 0.194 ...
## $ cnt : int 149 93 90 33 4 10 27 50 142 219 ...
## $ instant : int 13004 13005 13006 13007 13008 13009 13010 13011 13012 13013 ...
## $ mnth : int 7 7 7 7 7 7 7 7 7 7 ...
## $ yr : int 1 1 1 1 1 1 1 1 1 1 ...
# Random seed to reproduce results
seed<-423563
# The outcome column
(outcome <- "cnt")
## [1] "cnt"
# The input variables
(vars <- c("hr", "holiday", "workingday", "weathersit", "temp", "atemp", "hum", "windspeed"))
## [1] "hr" "holiday" "workingday" "weathersit" "temp"
## [6] "atemp" "hum" "windspeed"
# Create the formula string for bikes rented as a function of the inputs
(fmla <- paste(outcome, "~", paste(vars, collapse = " + ")))
## [1] "cnt ~ hr + holiday + workingday + weathersit + temp + atemp + hum + windspeed"
# Load the package ranger
library(ranger)
## Warning: package 'ranger' was built under R version 3.4.4
# Fit and print the random forest model
(bike_model_rf <- ranger(fmla, # formula
bikesJuly, # data
num.trees = 500,
respect.unordered.factors = "order",
seed = seed))
## Ranger result
##
## Call:
## ranger(fmla, bikesJuly, num.trees = 500, respect.unordered.factors = "order", seed = seed)
##
## Type: Regression
## Number of trees: 500
## Sample size: 744
## Number of independent variables: 8
## Mtry: 2
## Target node size: 5
## Variable importance mode: none
## Splitrule: variance
## OOB prediction error (MSE): 8230.568
## R squared (OOB): 0.8205434
# bikesAugust is in the workspace
str(bikesAugust)
## 'data.frame': 744 obs. of 13 variables:
## $ hr : Factor w/ 24 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ holiday : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ workingday: logi TRUE TRUE TRUE TRUE TRUE TRUE ...
## $ weathersit: chr "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" ...
## $ temp : num 0.68 0.66 0.64 0.64 0.64 0.64 0.64 0.64 0.66 0.68 ...
## $ atemp : num 0.636 0.606 0.576 0.576 0.591 ...
## $ hum : num 0.79 0.83 0.83 0.83 0.78 0.78 0.78 0.83 0.78 0.74 ...
## $ windspeed : num 0.1642 0.0896 0.1045 0.1045 0.1343 ...
## $ cnt : int 47 33 13 7 4 49 185 487 681 350 ...
## $ instant : int 13748 13749 13750 13751 13752 13753 13754 13755 13756 13757 ...
## $ mnth : int 8 8 8 8 8 8 8 8 8 8 ...
## $ yr : int 1 1 1 1 1 1 1 1 1 1 ...
## $ pred : num 94.96 51.74 37.98 17.58 9.36 ...
# bike_model_rf is in the workspace
bike_model_rf
## Ranger result
##
## Call:
## ranger(fmla, bikesJuly, num.trees = 500, respect.unordered.factors = "order", seed = seed)
##
## Type: Regression
## Number of trees: 500
## Sample size: 744
## Number of independent variables: 8
## Mtry: 2
## Target node size: 5
## Variable importance mode: none
## Splitrule: variance
## OOB prediction error (MSE): 8230.568
## R squared (OOB): 0.8205434
# Make predictions on the August data
bikesAugust$pred <- predict(bike_model_rf, bikesAugust)$predictions
# Calculate the RMSE of the predictions
bikesAugust %>%
mutate(residual = cnt-pred) %>% # calculate the residual
summarize(rmse = sqrt(mean(residual^2))) # calculate rmse
## rmse
## 1 97.18347
# Plot actual outcome vs predictions (predictions on x-axis)
ggplot(bikesAugust, aes(x = pred, y = cnt)) +
geom_point() +
geom_abline()
first_two_weeks <- bikesAugust %>%
# Set start to 0, convert unit to days
mutate(instant = (instant - min(instant)) / 24) %>%
# Gather cnt and pred into a column named value with key valuetype
gather(key = valuetype, value = value, cnt, pred) %>%
# Filter for rows in the first two
filter(instant < 14)
# Plot predictions and cnt by date/time
ggplot(first_two_weeks,aes(x = instant, y = value, color = valuetype, linetype = valuetype)) +
geom_point() +
geom_line() +
scale_x_continuous("Day", breaks = 0:14, labels = 0:14) +
scale_color_brewer(palette = "Dark2") +
ggtitle("Predicted August bike rentals, Random Forest plot")
The random forest model captured the day-to-day variations in peak demand better than the quasipoisson model, but it still underestmates peak demand, and also overestimates minimum demand. So there is still room for improvement.
dframe<-read.csv("C:/Users/maiam/Dropbox/PROFESSIONAL DEVELOPMENT/DATA SCIENCE/01_R/supervised learning in r regression/dframe.csv", sep=";")
testframe<-read.csv("C:/Users/maiam/Dropbox/PROFESSIONAL DEVELOPMENT/DATA SCIENCE/01_R/supervised learning in r regression/testframe.csv", sep=";")
str(dframe)
## 'data.frame': 10 obs. of 3 variables:
## $ color : Factor w/ 3 levels "b","g","r": 1 3 3 3 3 1 3 2 1 1
## $ size : int 13 11 15 14 13 11 9 12 7 12
## $ popularity: num 1.079 1.396 0.922 1.203 1.084 ...
library(vtreat)
library(magrittr)
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:tidyr':
##
## extract
# dframe is in the workspace
dframe
## color size popularity
## 1 b 13 1.0785088
## 2 r 11 1.3956245
## 3 r 15 0.9217988
## 4 r 14 1.2025453
## 5 r 13 1.0838662
## 6 b 11 0.8043527
## 7 r 9 1.1035440
## 8 g 12 0.8746332
## 9 b 7 0.6947058
## 10 b 12 0.8832502
# Create and print a vector of variable names
(vars <- c("color","size"))
## [1] "color" "size"
# Load the package vtreat
library(vtreat)
library(dplyr)
# Create the treatment plan
treatplan <- designTreatmentsZ(dframe,vars, verbose=FALSE)
# Examine the scoreFrame
(scoreFrame <- treatplan %>%
use_series(scoreFrame) %>%
select(varName, origName, code))
## varName origName code
## 1 color_catP color catP
## 2 size size clean
## 3 color_lev_x_b color lev
## 4 color_lev_x_g color lev
## 5 color_lev_x_r color lev
# We only want the rows with codes "clean" or "lev"
(newvars <- scoreFrame %>%
filter(code %in% c("clean", "lev")) %>%
use_series(varName))
## [1] "size" "color_lev_x_b" "color_lev_x_g" "color_lev_x_r"
# Create the treated training data
(dframe.treat <- prepare(treatplan, dframe, varRestriction = newvars))
## size color_lev_x_b color_lev_x_g color_lev_x_r
## 1 13 1 0 0
## 2 11 0 0 1
## 3 15 0 0 1
## 4 14 0 0 1
## 5 13 0 0 1
## 6 11 1 0 0
## 7 9 0 0 1
## 8 12 0 1 0
## 9 7 1 0 0
## 10 12 1 0 0
# Print dframe and testframe
dframe
## color size popularity
## 1 b 13 1.0785088
## 2 r 11 1.3956245
## 3 r 15 0.9217988
## 4 r 14 1.2025453
## 5 r 13 1.0838662
## 6 b 11 0.8043527
## 7 r 9 1.1035440
## 8 g 12 0.8746332
## 9 b 7 0.6947058
## 10 b 12 0.8832502
testframe
## color size popularity
## 1 g 7 0.9733920
## 2 g 8 0.9122529
## 3 y 10 1.4217153
## 4 g 12 1.1905828
## 5 g 6 0.9866464
## 6 y 8 1.3697515
## 7 b 12 1.0959387
## 8 g 12 0.9161547
## 9 g 12 1.0000460
## 10 r 8 1.3137360
# Use prepare() to one-hot-encode testframe
(testframe.treat <- prepare(treatplan, testframe, varRestriction = newvars))
## size color_lev_x_b color_lev_x_g color_lev_x_r
## 1 7 0 1 0
## 2 8 0 1 0
## 3 10 0 0 0
## 4 12 0 1 0
## 5 6 0 1 0
## 6 8 0 0 0
## 7 12 1 0 0
## 8 12 0 1 0
## 9 12 0 1 0
## 10 8 0 0 1
vtreat the bike rental data
# The outcome column
(outcome <- "cnt")
## [1] "cnt"
# The input columns
(vars <- c("hr", "holiday", "workingday", "weathersit", "temp", "atemp", "hum", "windspeed"))
## [1] "hr" "holiday" "workingday" "weathersit" "temp"
## [6] "atemp" "hum" "windspeed"
# Load the package vtreat
library(vtreat)
# Create the treatment plan from bikesJuly (the training data)
treatplan <- designTreatmentsZ(bikesJuly,vars, verbose = FALSE)
# Get the "clean" and "lev" variables from the scoreFrame
(newvars <- treatplan %>%
use_series(scoreFrame) %>%
filter(code %in% c("clean","lev")) %>% # get the rows you care about
use_series(varName)) # get the varName column
## [1] "holiday"
## [2] "workingday"
## [3] "temp"
## [4] "atemp"
## [5] "hum"
## [6] "windspeed"
## [7] "hr_lev_x_0"
## [8] "hr_lev_x_1"
## [9] "hr_lev_x_10"
## [10] "hr_lev_x_11"
## [11] "hr_lev_x_12"
## [12] "hr_lev_x_13"
## [13] "hr_lev_x_14"
## [14] "hr_lev_x_15"
## [15] "hr_lev_x_16"
## [16] "hr_lev_x_17"
## [17] "hr_lev_x_18"
## [18] "hr_lev_x_19"
## [19] "hr_lev_x_2"
## [20] "hr_lev_x_20"
## [21] "hr_lev_x_21"
## [22] "hr_lev_x_22"
## [23] "hr_lev_x_23"
## [24] "hr_lev_x_3"
## [25] "hr_lev_x_4"
## [26] "hr_lev_x_5"
## [27] "hr_lev_x_6"
## [28] "hr_lev_x_7"
## [29] "hr_lev_x_8"
## [30] "hr_lev_x_9"
## [31] "weathersit_lev_x_Clear_to_partly_cloudy"
## [32] "weathersit_lev_x_Light_Precipitation"
## [33] "weathersit_lev_x_Misty"
# Prepare the training data
bikesJuly.treat <- prepare(treatplan, bikesJuly, varRestriction = newvars)
# Prepare the test data
bikesAugust.treat <- prepare(treatplan, bikesAugust, varRestriction = newvars)
# Call str() on the treated data
str(bikesJuly.treat )
## 'data.frame': 744 obs. of 33 variables:
## $ holiday : num 0 0 0 0 0 0 0 0 0 0 ...
## $ workingday : num 0 0 0 0 0 0 0 0 0 0 ...
## $ temp : num 0.76 0.74 0.72 0.72 0.7 0.68 0.7 0.74 0.78 0.82 ...
## $ atemp : num 0.727 0.697 0.697 0.712 0.667 ...
## $ hum : num 0.66 0.7 0.74 0.84 0.79 0.79 0.79 0.7 0.62 0.56 ...
## $ windspeed : num 0 0.1343 0.0896 0.1343 0.194 ...
## $ hr_lev_x_0 : num 1 0 0 0 0 0 0 0 0 0 ...
## $ hr_lev_x_1 : num 0 1 0 0 0 0 0 0 0 0 ...
## $ hr_lev_x_10 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hr_lev_x_11 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hr_lev_x_12 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hr_lev_x_13 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hr_lev_x_14 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hr_lev_x_15 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hr_lev_x_16 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hr_lev_x_17 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hr_lev_x_18 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hr_lev_x_19 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hr_lev_x_2 : num 0 0 1 0 0 0 0 0 0 0 ...
## $ hr_lev_x_20 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hr_lev_x_21 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hr_lev_x_22 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hr_lev_x_23 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hr_lev_x_3 : num 0 0 0 1 0 0 0 0 0 0 ...
## $ hr_lev_x_4 : num 0 0 0 0 1 0 0 0 0 0 ...
## $ hr_lev_x_5 : num 0 0 0 0 0 1 0 0 0 0 ...
## $ hr_lev_x_6 : num 0 0 0 0 0 0 1 0 0 0 ...
## $ hr_lev_x_7 : num 0 0 0 0 0 0 0 1 0 0 ...
## $ hr_lev_x_8 : num 0 0 0 0 0 0 0 0 1 0 ...
## $ hr_lev_x_9 : num 0 0 0 0 0 0 0 0 0 1 ...
## $ weathersit_lev_x_Clear_to_partly_cloudy: num 1 1 1 1 1 1 1 1 1 1 ...
## $ weathersit_lev_x_Light_Precipitation : num 0 0 0 0 0 0 0 0 0 0 ...
## $ weathersit_lev_x_Misty : num 0 0 0 0 0 0 0 0 0 0 ...
str(bikesAugust.treat)
## 'data.frame': 744 obs. of 33 variables:
## $ holiday : num 0 0 0 0 0 0 0 0 0 0 ...
## $ workingday : num 1 1 1 1 1 1 1 1 1 1 ...
## $ temp : num 0.68 0.66 0.64 0.64 0.64 0.64 0.64 0.64 0.66 0.68 ...
## $ atemp : num 0.636 0.606 0.576 0.576 0.591 ...
## $ hum : num 0.79 0.83 0.83 0.83 0.78 0.78 0.78 0.83 0.78 0.74 ...
## $ windspeed : num 0.1642 0.0896 0.1045 0.1045 0.1343 ...
## $ hr_lev_x_0 : num 1 0 0 0 0 0 0 0 0 0 ...
## $ hr_lev_x_1 : num 0 1 0 0 0 0 0 0 0 0 ...
## $ hr_lev_x_10 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hr_lev_x_11 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hr_lev_x_12 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hr_lev_x_13 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hr_lev_x_14 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hr_lev_x_15 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hr_lev_x_16 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hr_lev_x_17 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hr_lev_x_18 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hr_lev_x_19 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hr_lev_x_2 : num 0 0 1 0 0 0 0 0 0 0 ...
## $ hr_lev_x_20 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hr_lev_x_21 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hr_lev_x_22 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hr_lev_x_23 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hr_lev_x_3 : num 0 0 0 1 0 0 0 0 0 0 ...
## $ hr_lev_x_4 : num 0 0 0 0 1 0 0 0 0 0 ...
## $ hr_lev_x_5 : num 0 0 0 0 0 1 0 0 0 0 ...
## $ hr_lev_x_6 : num 0 0 0 0 0 0 1 0 0 0 ...
## $ hr_lev_x_7 : num 0 0 0 0 0 0 0 1 0 0 ...
## $ hr_lev_x_8 : num 0 0 0 0 0 0 0 0 1 0 ...
## $ hr_lev_x_9 : num 0 0 0 0 0 0 0 0 0 1 ...
## $ weathersit_lev_x_Clear_to_partly_cloudy: num 1 1 1 1 0 0 1 0 0 0 ...
## $ weathersit_lev_x_Light_Precipitation : num 0 0 0 0 0 0 0 0 0 0 ...
## $ weathersit_lev_x_Misty : num 0 0 0 0 1 1 0 1 1 1 ...