This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
This data set contains bike share data. As someone who strives to reduce my own carbon footprint as well as someone who grew up in Seattle, where a city wide bike share program has taken hold, this data set is of particular interest to me. Bike share programs have great potential to benefit cities in numerous ways. These benefits include but are not limited to, reduced carbon emissions, reduced traffic congestion, increased accessibility of public transportation, etc. I remember that a few years back, when Seattle first implemented their bike share program, there were many who were skeptical about whether or not it would take hold in the community. In that first year, one unforeseen issue that arose was that, because Seattle is notoriously hilly, most users only used the bikes to go downhill. The city then had to expend time, money and resources to drive the bikes back up to the tops of hills where demand was highest. Obviously, this likely countered if not negated any reduction in carbon emissions initially created by use of the bike share program. This example illustrates why it is necessary to understand variables that might affect the popularity of the program, and the magnitude of these effects. This data set in particular focuses on a bike share program in Portugal, but the findings of this data exploration will likely be able to be extrapolated and applied to bike share programs all over the world. Three questions that have arisen after an initial glimpse at the data include: 1. What is the seasonal affect on total number of users? 2. Do some aspects of the local climate (temperature, humidity, windspeed) have a greater impact on the total number of users than others? 3. How do the numbers of registered vs casual users differ across seasons?
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.0.6 ✓ dplyr 1.0.4
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
ggplot(bikeShare, aes(temp, cnt, color = hum))+
geom_point()+
theme_bw()
ggplot(bikeShare, aes(as.factor(weathersit), cnt, fill = as.factor(weathersit)))+
geom_boxplot()+
theme_bw()
ggplot(bikeShare, aes(as.factor(season), registered, fill = as.factor(season)))+
geom_boxplot()+
theme_bw()
ggplot(bikeShare, aes(x = as.factor(season), fill = as.factor(weekday))) +
geom_bar(position = "fill")+
theme_bw()
ggplot(bikeShare, aes(x = as.factor(season), fill = as.factor(weathersit))) +
geom_bar(position = "dodge")+
theme_bw()
The variables used in these graphics are mostly climactic variables (season, weather situation, temperature, and humitidy) as well as day of the week, and their relationship with number of users, be that total users, registered, or casual users. In the first plot, there appears to be an obvious but non linear relationship between temperature and total users. The data appears to be grouping into two arches. This is a trend that will be worth exploring down the line. Additionally, weather situation and season seem to have strong relationships with the number of users, as shown by the two box plots. Warmer seasons and warmer weather seem highly correlated with more bike share users.
mod <- lm(cnt~temp, bikeShare)
summary(mod)
##
## Call:
## lm(formula = cnt ~ temp, data = bikeShare)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4615.3 -1134.9 -104.4 1044.3 3737.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1214.6 161.2 7.537 1.43e-13 ***
## temp 6640.7 305.2 21.759 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1509 on 729 degrees of freedom
## Multiple R-squared: 0.3937, Adjusted R-squared: 0.3929
## F-statistic: 473.5 on 1 and 729 DF, p-value: < 2.2e-16
Reference Distribution: t test Degrees of Freedom: 729 Test Statistic: 473.5 P-value: < 2.2e-16 5 Part Conclusion: We can reject the null hypothesis with a p-value of < 2.2e-16 at the α = 0 significance level.There is strong evidence to suggest that there could be a linear relationship between temperature and total bike share users (i.e. the slope parameter is not equal to zero).
aov.out = aov(cnt~temp, bikeShare)
summary(aov.out)
## Df Sum Sq Mean Sq F value Pr(>F)
## temp 1 1.079e+09 1.079e+09 473.5 <2e-16 ***
## Residuals 729 1.661e+09 2.278e+06
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(mod)
Based on the residuals vs. fitted values plot, there appears to be an even distribution of data points below and above the sample mean of 0. Additionally, there does not appear to be increasing or decreasing vertical spread of the data as you move horizontally across the plot (there is no fan shape). These characteristics indicate that the model fits our data relatively well. In the normal QQ plot, although most of the points lie on or very near the diagonal line, there is some deviation from the line at the tail ends of the data. Ideally, all the points would be on the line and most of these are, meaning that the plot follows normality. In the residuals vs leverage plot, there appear to be a handful of outliers, although they are not obvious because the entire collection of data points is not extremely clustered on the left side of the plot, and is instead relatively spread out. As you move to the right however, the correlation does not appear to be as strong as on the left of the plot.
Response variable: total users (cnt) Categorical predictor: season Numeric predictor: temperature (temp)
mod1 <- lm(cnt~temp, bikeShare)
summary(mod1)
##
## Call:
## lm(formula = cnt ~ temp, data = bikeShare)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4615.3 -1134.9 -104.4 1044.3 3737.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1214.6 161.2 7.537 1.43e-13 ***
## temp 6640.7 305.2 21.759 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1509 on 729 degrees of freedom
## Multiple R-squared: 0.3937, Adjusted R-squared: 0.3929
## F-statistic: 473.5 on 1 and 729 DF, p-value: < 2.2e-16
ggplot(bikeShare, aes(x=temp, y=cnt))+
geom_point()+
geom_abline(intercept = mod1$coefficients[1], slope=mod1$coefficients[2])+
theme_bw()
## Step 3: Dummy Variable
bikeShare$season<-as.factor(bikeShare$season)
contrasts(bikeShare$season)
## 2 3 4
## 1 0 0 0
## 2 1 0 0
## 3 0 1 0
## 4 0 0 1
# winter
winter<-bikeShare%>%
filter(season==1)%>%
mutate(seasonLab="Winter")
# spring
spring<-bikeShare%>%
filter(season==2)%>%
mutate(seasonLab="Spring")
# summer
summer<-bikeShare%>%
filter(season==3)%>%
mutate(seasonLab="Summer")
#fall
fall<-bikeShare%>%
filter(season==4)%>%
mutate(seasonLab="Fall")
bikeShare2<-rbind(winter, spring, summer, fall)
bikeShare2$seasonLab<-as.factor(bikeShare2$seasonLab)
contrasts(bikeShare2$seasonLab)
## Spring Summer Winter
## Fall 0 0 0
## Spring 1 0 0
## Summer 0 1 0
## Winter 0 0 1
mod2 <- lm(cnt~seasonLab, bikeShare2)
summary(mod2)
##
## Call:
## lm(formula = cnt ~ seasonLab, data = bikeShare2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4706.2 -1065.6 -181.3 1219.2 5231.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4728.2 117.6 40.213 < 2e-16 ***
## seasonLabSpring 264.2 164.9 1.602 0.11
## seasonLabSummer 916.1 164.1 5.584 3.31e-08 ***
## seasonLabWinter -2124.0 165.6 -12.827 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1569 on 727 degrees of freedom
## Multiple R-squared: 0.347, Adjusted R-squared: 0.3443
## F-statistic: 128.8 on 3 and 727 DF, p-value: < 2.2e-16
ggplot(bikeShare2, aes(x=seasonLab, y=cnt))+
geom_boxplot()+
theme_bw()
mod3 <- lm(cnt~temp + seasonLab, bikeShare2)
summary(mod3)
##
## Call:
## lm(formula = cnt ~ temp + seasonLab, data = bikeShare2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4812.9 -996.8 -271.3 1240.9 3881.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2088.7 244.0 8.559 < 2e-16 ***
## temp 6241.3 518.1 12.046 < 2e-16 ***
## seasonLabSpring -494.1 163.3 -3.026 0.00256 **
## seasonLabSummer -852.7 209.8 -4.064 5.35e-05 ***
## seasonLabWinter -1342.9 164.6 -8.159 1.49e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1433 on 726 degrees of freedom
## Multiple R-squared: 0.4558, Adjusted R-squared: 0.4528
## F-statistic: 152 on 4 and 726 DF, p-value: < 2.2e-16
ggplot(bikeShare2, aes(x=temp, y=cnt, color=seasonLab))+
geom_point()+
# FALL (reference)
geom_abline(intercept = mod3$coefficients[1], slope=mod3$coefficients[2], color = "tomato1")+
# SPRING
geom_abline(intercept = mod3$coefficients[1]+mod3$coefficients[3], slope=mod3$coefficients[2], color = "#66A61E")+
# summer
geom_abline(intercept = mod3$coefficients[1]+mod3$coefficients[4], slope=mod3$coefficients[2], color = "cyan3")+
# winter
geom_abline(intercept = mod3$coefficients[1]+mod3$coefficients[5], slope=mod3$coefficients[2], color = "mediumorchid")+
theme_bw()
mod4 <- lm(cnt~temp*seasonLab, bikeShare2)
summary(mod4)
##
## Call:
## lm(formula = cnt ~ temp * seasonLab, data = bikeShare2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4813.3 -1016.3 -184.3 1221.7 3258.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2077.5 427.8 4.857 1.46e-06 ***
## temp 6267.8 980.3 6.394 2.90e-10 ***
## seasonLabSpring -675.1 637.7 -1.059 0.290121
## seasonLabSummer 4044.4 1115.7 3.625 0.000309 ***
## seasonLabWinter -2188.5 535.0 -4.091 4.78e-05 ***
## temp:seasonLabSpring 326.5 1295.9 0.252 0.801179
## temp:seasonLabSummer -6944.0 1751.6 -3.964 8.09e-05 ***
## temp:seasonLabWinter 2851.2 1414.9 2.015 0.044262 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1406 on 723 degrees of freedom
## Multiple R-squared: 0.478, Adjusted R-squared: 0.473
## F-statistic: 94.59 on 7 and 723 DF, p-value: < 2.2e-16
ggplot(bikeShare2, aes(x=temp, y=cnt, color=seasonLab))+
geom_point()+
# fall
geom_abline(intercept = mod4$coefficients[1], slope=mod4$coefficients[2], color="tomato1")+
# spring
geom_abline(intercept = mod4$coefficients[1]+mod4$coefficients[3], slope=mod4$coefficients[2]+mod4$coefficients[6], color="#66A61E")+
# summer
geom_abline(intercept = mod4$coefficients[1]+mod4$coefficients[4], slope=mod4$coefficients[2]+mod4$coefficients[7], color = "cyan3")+
# winter
geom_abline(intercept = mod4$coefficients[1]+mod4$coefficients[5], slope=mod4$coefficients[2]+mod4$coefficients[8], color = "mediumorchid")+
theme_bw()
## Step 7: Model Comparisons
Mean Standard Error
# mod1
mean(residuals(mod1)^2)
## [1] 2272020
# mod2
mean(mod2$residuals^2)
## [1] 2447250
# mod3
mean(mod3$residuals^2)
## [1] 2039615
# mod4
mean(mod4$residuals^2)
## [1] 1956178
Based on the model output from step 2, the variable temperature does appear to have a strong relationship with total users. This relationship is statistically significant and accordingly, our p-value is very low. Based on the plot of this model, the points seem clustered near the line and are trending in a positive direction, near the line of linear regression. The model output from step 4 fits a linear model to the data using a categorical predictor (season) instead of numeric (temperature). When each season acts as its own variable, spring stands out as insignificant compared to the other seasons. Additionally, spring has a very high p-value compared to the others. If I were to adjust this model to improve the fit, I would take out the category spring. Again, when combining numeric and categorical variables in a single model, spring proves to be a relatively weak variable, however it is more significant than in the previous model which used only categorical variables. Finally, in the final model which accounted for the interaction between numeric and categorical variables, spring, spring after controlling for temperature, and winter after controlling for temperature proved to be far less significant than the other variables. In the final model, these three variables had very high p-values relative to the other variables.
It seems that there is high significance of climactic variables impact on total number of users. This significance seems to be slightly less during seasons that are perhaps more variable in terms of weather patterns. For example, the significance of the summer season remains highly significant throughout all the models but the significance of spring is erratic across models. There is need for further investigation of the relationship between climactic variables and total users because some plots would suggest that although there is a relationship there, it is not necessarily linear.
In order to apply classification models to this project, I have chosen to dichotimize the numeric variable temperature. The given data set description describes these temperature values as normalized temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-8, t_max=+39 (only in hourly scale).
# Tree model
attach(bikeShare2)
# new variable if total users (cnt) is greater than 6,000
High <- ifelse(cnt<=6000, "No", "Yes")
bikeShare3 <- data.frame(bikeShare2, High)%>%
select(-c("instant", "dteday", "registered", "casual"))
detach(bikeShare2)
attach(bikeShare3)
## The following object is masked _by_ .GlobalEnv:
##
## High
# visualization
ggplot(data=bikeShare3, aes(x=temp, color=High))+
geom_density()+
theme_bw()
# test and train
set.seed(2)
dim(bikeShare3)
## [1] 731 14
# split in half
train <- sample(1:nrow(bikeShare3), 360)
library(rpart)
# using method = "class" for classification
tree.bikeShare2 <- rpart(High~seasonLab+holiday+weekday+workingday+
weathersit+temp+hum+windspeed, data = bikeShare3,
subset = train,
method = "class")
summary(tree.bikeShare2)
## Call:
## rpart(formula = High ~ seasonLab + holiday + weekday + workingday +
## weathersit + temp + hum + windspeed, data = bikeShare3, subset = train,
## method = "class")
## n= 360
##
## CP nsplit rel error xerror xstd
## 1 0.04868914 0 1.0000000 1.000000 0.09196835
## 2 0.03370787 8 0.5955056 1.089888 0.09458505
## 3 0.02247191 9 0.5617978 1.101124 0.09489043
## 4 0.01123596 11 0.5168539 1.146067 0.09606577
## 5 0.01000000 12 0.5056180 1.123596 0.09548726
##
## Variable importance
## temp seasonLab hum windspeed workingday weekday weathersit
## 39 23 20 9 4 3 2
## holiday
## 1
##
## Node number 1: 360 observations, complexity param=0.04868914
## predicted class=No expected loss=0.2472222 P(node) =1
## class counts: 271 89
## probabilities: 0.753 0.247
## left son=2 (133 obs) right son=3 (227 obs)
## Primary splits:
## temp < 0.4233335 to the left, improve=24.238560, (0 missing)
## seasonLab splits as RRRL, improve=11.731060, (0 missing)
## hum < 0.741875 to the right, improve= 6.678023, (0 missing)
## windspeed < 0.286808 to the right, improve= 2.938739, (0 missing)
## weathersit < 1.5 to the right, improve= 2.364121, (0 missing)
## Surrogate splits:
## seasonLab splits as LRRL, agree=0.825, adj=0.526, (0 split)
## windspeed < 0.0628229 to the left, agree=0.639, adj=0.023, (0 split)
## hum < 0.943913 to the right, agree=0.636, adj=0.015, (0 split)
##
## Node number 2: 133 observations
## predicted class=No expected loss=0.007518797 P(node) =0.3694444
## class counts: 132 1
## probabilities: 0.992 0.008
##
## Node number 3: 227 observations, complexity param=0.04868914
## predicted class=No expected loss=0.3876652 P(node) =0.6305556
## class counts: 139 88
## probabilities: 0.612 0.388
## left son=6 (52 obs) right son=7 (175 obs)
## Primary splits:
## hum < 0.741875 to the right, improve=11.463890, (0 missing)
## windspeed < 0.286808 to the right, improve= 3.076912, (0 missing)
## temp < 0.52125 to the left, improve= 2.307238, (0 missing)
## weathersit < 1.5 to the right, improve= 2.203131, (0 missing)
## seasonLab splits as RRRL, improve= 1.433514, (0 missing)
## Surrogate splits:
## weathersit < 2.5 to the right, agree=0.793, adj=0.096, (0 split)
## temp < 0.4416665 to the left, agree=0.775, adj=0.019, (0 split)
##
## Node number 6: 52 observations
## predicted class=No expected loss=0.09615385 P(node) =0.1444444
## class counts: 47 5
## probabilities: 0.904 0.096
##
## Node number 7: 175 observations, complexity param=0.04868914
## predicted class=No expected loss=0.4742857 P(node) =0.4861111
## class counts: 92 83
## probabilities: 0.526 0.474
## left son=14 (17 obs) right son=15 (158 obs)
## Primary splits:
## windspeed < 0.286808 to the right, improve=3.3400530, (0 missing)
## temp < 0.5208335 to the left, improve=2.2400000, (0 missing)
## seasonLab splits as RRRL, improve=2.0455180, (0 missing)
## hum < 0.7339585 to the left, improve=1.0806930, (0 missing)
## workingday < 0.5 to the right, improve=0.6366874, (0 missing)
##
## Node number 14: 17 observations
## predicted class=No expected loss=0.1764706 P(node) =0.04722222
## class counts: 14 3
## probabilities: 0.824 0.176
##
## Node number 15: 158 observations, complexity param=0.04868914
## predicted class=Yes expected loss=0.4936709 P(node) =0.4388889
## class counts: 78 80
## probabilities: 0.494 0.506
## left son=30 (7 obs) right son=31 (151 obs)
## Primary splits:
## temp < 0.79125 to the right, improve=1.9353080, (0 missing)
## hum < 0.7339585 to the left, improve=1.4064320, (0 missing)
## seasonLab splits as RRLL, improve=1.2819600, (0 missing)
## windspeed < 0.1669835 to the right, improve=1.2819600, (0 missing)
## workingday < 0.5 to the right, improve=0.7352409, (0 missing)
##
## Node number 30: 7 observations
## predicted class=No expected loss=0.1428571 P(node) =0.01944444
## class counts: 6 1
## probabilities: 0.857 0.143
##
## Node number 31: 151 observations, complexity param=0.04868914
## predicted class=Yes expected loss=0.4768212 P(node) =0.4194444
## class counts: 72 79
## probabilities: 0.477 0.523
## left son=62 (63 obs) right son=63 (88 obs)
## Primary splits:
## windspeed < 0.1822145 to the right, improve=1.9351510, (0 missing)
## temp < 0.5208335 to the left, improve=1.8952400, (0 missing)
## hum < 0.5395835 to the right, improve=1.6667130, (0 missing)
## workingday < 0.5 to the right, improve=0.8447414, (0 missing)
## seasonLab splits as RRLL, improve=0.7496782, (0 missing)
## Surrogate splits:
## hum < 0.510625 to the left, agree=0.636, adj=0.127, (0 split)
## seasonLab splits as RRRL, agree=0.609, adj=0.063, (0 split)
## weekday < 5.5 to the right, agree=0.596, adj=0.032, (0 split)
## temp < 0.4391665 to the left, agree=0.596, adj=0.032, (0 split)
##
## Node number 62: 63 observations, complexity param=0.04868914
## predicted class=No expected loss=0.4285714 P(node) =0.175
## class counts: 36 27
## probabilities: 0.571 0.429
## left son=124 (31 obs) right son=125 (32 obs)
## Primary splits:
## seasonLab splits as RRLL, improve=6.742224, (0 missing)
## temp < 0.665 to the right, improve=5.772886, (0 missing)
## windspeed < 0.2092645 to the left, improve=3.061794, (0 missing)
## workingday < 0.5 to the right, improve=2.991758, (0 missing)
## weekday < 1.5 to the right, improve=1.679365, (0 missing)
## Surrogate splits:
## temp < 0.6483335 to the right, agree=0.778, adj=0.548, (0 split)
## windspeed < 0.190915 to the left, agree=0.603, adj=0.194, (0 split)
## workingday < 0.5 to the right, agree=0.587, adj=0.161, (0 split)
## weekday < 2.5 to the right, agree=0.571, adj=0.129, (0 split)
## hum < 0.638125 to the left, agree=0.571, adj=0.129, (0 split)
##
## Node number 63: 88 observations, complexity param=0.04868914
## predicted class=Yes expected loss=0.4090909 P(node) =0.2444444
## class counts: 36 52
## probabilities: 0.409 0.591
## left son=126 (57 obs) right son=127 (31 obs)
## Primary splits:
## temp < 0.7 to the left, improve=4.4469830, (0 missing)
## hum < 0.5395835 to the right, improve=2.6597400, (0 missing)
## windspeed < 0.08365415 to the left, improve=2.0454550, (0 missing)
## seasonLab splits as LLRR, improve=1.0570820, (0 missing)
## weekday < 1.5 to the left, improve=0.4694756, (0 missing)
## Surrogate splits:
## seasonLab splits as LLRL, agree=0.807, adj=0.452, (0 split)
## windspeed < 0.164185 to the left, agree=0.659, adj=0.032, (0 split)
##
## Node number 124: 31 observations
## predicted class=No expected loss=0.1935484 P(node) =0.08611111
## class counts: 25 6
## probabilities: 0.806 0.194
##
## Node number 125: 32 observations, complexity param=0.01123596
## predicted class=Yes expected loss=0.34375 P(node) =0.08888889
## class counts: 11 21
## probabilities: 0.344 0.656
## left son=250 (17 obs) right son=251 (15 obs)
## Primary splits:
## workingday < 0.5 to the right, improve=2.5002450, (0 missing)
## hum < 0.6766665 to the right, improve=1.9102270, (0 missing)
## weekday < 1.5 to the right, improve=1.7284090, (0 missing)
## temp < 0.5158335 to the left, improve=1.6875000, (0 missing)
## windspeed < 0.2431625 to the left, improve=0.8790584, (0 missing)
## Surrogate splits:
## weekday < 5.5 to the left, agree=0.750, adj=0.467, (0 split)
## windspeed < 0.1980815 to the right, agree=0.656, adj=0.267, (0 split)
## weathersit < 1.5 to the right, agree=0.625, adj=0.200, (0 split)
## hum < 0.5875 to the right, agree=0.625, adj=0.200, (0 split)
## holiday < 0.5 to the left, agree=0.594, adj=0.133, (0 split)
##
## Node number 126: 57 observations, complexity param=0.04868914
## predicted class=No expected loss=0.4736842 P(node) =0.1583333
## class counts: 30 27
## probabilities: 0.526 0.474
## left son=252 (48 obs) right son=253 (9 obs)
## Primary splits:
## hum < 0.510208 to the right, improve=3.6849420, (0 missing)
## temp < 0.638333 to the right, improve=2.2259310, (0 missing)
## windspeed < 0.08365415 to the left, improve=1.7467670, (0 missing)
## seasonLab splits as RRLR, improve=0.8020050, (0 missing)
## weekday < 1.5 to the left, improve=0.5718463, (0 missing)
## Surrogate splits:
## temp < 0.4491665 to the right, agree=0.877, adj=0.222, (0 split)
## windspeed < 0.179421 to the left, agree=0.877, adj=0.222, (0 split)
## seasonLab splits as LLLR, agree=0.860, adj=0.111, (0 split)
##
## Node number 127: 31 observations
## predicted class=Yes expected loss=0.1935484 P(node) =0.08611111
## class counts: 6 25
## probabilities: 0.194 0.806
##
## Node number 250: 17 observations
## predicted class=No expected loss=0.4705882 P(node) =0.04722222
## class counts: 9 8
## probabilities: 0.529 0.471
##
## Node number 251: 15 observations
## predicted class=Yes expected loss=0.1333333 P(node) =0.04166667
## class counts: 2 13
## probabilities: 0.133 0.867
##
## Node number 252: 48 observations, complexity param=0.03370787
## predicted class=No expected loss=0.3958333 P(node) =0.1333333
## class counts: 29 19
## probabilities: 0.604 0.396
## left son=504 (41 obs) right son=505 (7 obs)
## Primary splits:
## hum < 0.730625 to the left, improve=1.6621660, (0 missing)
## temp < 0.5425 to the left, improve=1.4583330, (0 missing)
## windspeed < 0.0867583 to the left, improve=1.0489260, (0 missing)
## seasonLab splits as RLL-, improve=0.5208333, (0 missing)
## weathersit < 1.5 to the left, improve=0.4289216, (0 missing)
## Surrogate splits:
## holiday < 0.5 to the left, agree=0.896, adj=0.286, (0 split)
##
## Node number 253: 9 observations
## predicted class=Yes expected loss=0.1111111 P(node) =0.025
## class counts: 1 8
## probabilities: 0.111 0.889
##
## Node number 504: 41 observations, complexity param=0.02247191
## predicted class=No expected loss=0.3414634 P(node) =0.1138889
## class counts: 27 14
## probabilities: 0.659 0.341
## left son=1008 (19 obs) right son=1009 (22 obs)
## Primary splits:
## temp < 0.5425 to the left, improve=1.2141440, (0 missing)
## seasonLab splits as RLL-, improve=1.0686540, (0 missing)
## hum < 0.695833 to the right, improve=0.9314486, (0 missing)
## windspeed < 0.1508065 to the right, improve=0.7662971, (0 missing)
## weekday < 1.5 to the left, improve=0.4390244, (0 missing)
## Surrogate splits:
## seasonLab splits as LLR-, agree=0.707, adj=0.368, (0 split)
## hum < 0.594167 to the left, agree=0.707, adj=0.368, (0 split)
## windspeed < 0.146773 to the right, agree=0.659, adj=0.263, (0 split)
## weekday < 4.5 to the right, agree=0.610, adj=0.158, (0 split)
## weathersit < 1.5 to the left, agree=0.585, adj=0.105, (0 split)
##
## Node number 505: 7 observations
## predicted class=Yes expected loss=0.2857143 P(node) =0.01944444
## class counts: 2 5
## probabilities: 0.286 0.714
##
## Node number 1008: 19 observations
## predicted class=No expected loss=0.2105263 P(node) =0.05277778
## class counts: 15 4
## probabilities: 0.789 0.211
##
## Node number 1009: 22 observations, complexity param=0.02247191
## predicted class=No expected loss=0.4545455 P(node) =0.06111111
## class counts: 12 10
## probabilities: 0.545 0.455
## left son=2018 (12 obs) right son=2019 (10 obs)
## Primary splits:
## temp < 0.6504165 to the right, improve=2.2090910, (0 missing)
## weekday < 1.5 to the left, improve=1.6441340, (0 missing)
## seasonLab splits as RRL-, improve=1.4545450, (0 missing)
## hum < 0.6470835 to the right, improve=0.3636364, (0 missing)
## workingday < 0.5 to the left, improve=0.1590909, (0 missing)
## Surrogate splits:
## seasonLab splits as RLL-, agree=0.682, adj=0.3, (0 split)
## hum < 0.6554165 to the right, agree=0.636, adj=0.2, (0 split)
## weekday < 3.5 to the left, agree=0.591, adj=0.1, (0 split)
## windspeed < 0.0923583 to the right, agree=0.591, adj=0.1, (0 split)
##
## Node number 2018: 12 observations
## predicted class=No expected loss=0.25 P(node) =0.03333333
## class counts: 9 3
## probabilities: 0.750 0.250
##
## Node number 2019: 10 observations
## predicted class=Yes expected loss=0.3 P(node) =0.02777778
## class counts: 3 7
## probabilities: 0.300 0.700
# plotting the tree
# install.packages("rpart.plot")
library(rpart.plot)
prp(tree.bikeShare2, faclen = 0, cex = 0.7, extra = 1, space = 0.5, main = "To Rent a Bike or Not?")
Description: This tree model essentially describes the conditions that might lead to a persons decision to rent or not to rent a bike. The variables which impact this decision most significantly are weather variables. This model predicts that if the temperature is too low, a person is more likely not to rent. Even if the temperature is high but the humidity is also high, they are more likely not to rent. Even if the temperature is high and the humidity is low, if the wind speed is too great or the temperature too hot, this could deter people as well. If the wind speed is low, people are still less likely to rent if it is summer or winter. If it is summer or winter, the likelihood of renting is even less if it is a work day. This is how one would read the tree model above.
# this is what we actually observe (testing set)
high.test<-High[-train]
# this is what we predict for the testing set using the training model
tree.pred<-predict(tree.bikeShare2, newdata = bikeShare3, type="class")
tree.test<-tree.pred[-train]
# the confusion matrix matches these up to see how well we did
cm<-table(tree.test, high.test)
cm
## high.test
## tree.test No Yes
## No 245 49
## Yes 35 42
sum(diag(cm))/sum(cm)
## [1] 0.7735849