R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

Project Milestone 2

Loading Bike Share .csv

bikeShare <- read.csv("day.csv",
                        header=TRUE)
#view(bikeShare)

Data Set Introduction

This data set contains bike share data. As someone who strives to reduce my own carbon footprint as well as someone who grew up in Seattle, where a city wide bike share program has taken hold, this data set is of particular interest to me. Bike share programs have great potential to benefit cities in numerous ways. These benefits include but are not limited to, reduced carbon emissions, reduced traffic congestion, increased accessibility of public transportation, etc. I remember that a few years back, when Seattle first implemented their bike share program, there were many who were skeptical about whether or not it would take hold in the community. In that first year, one unforeseen issue that arose was that, because Seattle is notoriously hilly, most users only used the bikes to go downhill. The city then had to expend time, money and resources to drive the bikes back up to the tops of hills where demand was highest. Obviously, this likely countered if not negated any reduction in carbon emissions initially created by use of the bike share program. This example illustrates why it is necessary to understand variables that might affect the popularity of the program, and the magnitude of these effects. This data set in particular focuses on a bike share program in Portugal, but the findings of this data exploration will likely be able to be extrapolated and applied to bike share programs all over the world. Three questions that have arisen after an initial glimpse at the data include: 1. What is the seasonal affect on total number of users? 2. Do some aspects of the local climate (temperature, humidity, windspeed) have a greater impact on the total number of users than others? 3. How do the numbers of registered vs casual users differ across seasons?

Graphics

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.0.6     ✓ dplyr   1.0.4
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

ggplot(bikeShare, aes(temp, cnt, color = hum))+
  geom_point()+
  theme_bw()

ggplot(bikeShare, aes(as.factor(weathersit), cnt, fill = as.factor(weathersit)))+
  geom_boxplot()+
  theme_bw()

ggplot(bikeShare, aes(as.factor(season), registered, fill = as.factor(season)))+
  geom_boxplot()+
  theme_bw()

ggplot(bikeShare, aes(x = as.factor(season), fill = as.factor(weekday))) +
  geom_bar(position = "fill")+
  theme_bw()

ggplot(bikeShare, aes(x = as.factor(season), fill = as.factor(weathersit))) +
  geom_bar(position = "dodge")+
  theme_bw()

Variable Relationships

The variables used in these graphics are mostly climactic variables (season, weather situation, temperature, and humitidy) as well as day of the week, and their relationship with number of users, be that total users, registered, or casual users. In the first plot, there appears to be an obvious but non linear relationship between temperature and total users. The data appears to be grouping into two arches. This is a trend that will be worth exploring down the line. Additionally, weather situation and season seem to have strong relationships with the number of users, as shown by the two box plots. Warmer seasons and warmer weather seem highly correlated with more bike share users.

Project Milestone 3

Step 1: Determining a Fitted Model

mod <- lm(cnt~temp, bikeShare)
summary(mod)

## 
## Call:
## lm(formula = cnt ~ temp, data = bikeShare)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4615.3 -1134.9  -104.4  1044.3  3737.8 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1214.6      161.2   7.537 1.43e-13 ***
## temp          6640.7      305.2  21.759  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1509 on 729 degrees of freedom
## Multiple R-squared:  0.3937, Adjusted R-squared:  0.3929 
## F-statistic: 473.5 on 1 and 729 DF,  p-value: < 2.2e-16

Step 2: Test for Slope

Reference Distribution: t test Degrees of Freedom: 729 Test Statistic: 473.5 P-value: < 2.2e-16 5 Part Conclusion: We can reject the null hypothesis with a p-value of < 2.2e-16 at the α = 0 significance level.There is strong evidence to suggest that there could be a linear relationship between temperature and total bike share users (i.e. the slope parameter is not equal to zero).

Step 3: ANOVA Table

aov.out = aov(cnt~temp, bikeShare)
summary(aov.out)

##              Df    Sum Sq   Mean Sq F value Pr(>F)    
## temp          1 1.079e+09 1.079e+09   473.5 <2e-16 ***
## Residuals   729 1.661e+09 2.278e+06                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Step 4: Diagnostic Plots

plot(mod)

Step 5: Summary of Findings

Based on the residuals vs. fitted values plot, there appears to be an even distribution of data points below and above the sample mean of 0. Additionally, there does not appear to be increasing or decreasing vertical spread of the data as you move horizontally across the plot (there is no fan shape). These characteristics indicate that the model fits our data relatively well. In the normal QQ plot, although most of the points lie on or very near the diagonal line, there is some deviation from the line at the tail ends of the data. Ideally, all the points would be on the line and most of these are, meaning that the plot follows normality. In the residuals vs leverage plot, there appear to be a handful of outliers, although they are not obvious because the entire collection of data points is not extremely clustered on the left side of the plot, and is instead relatively spread out. As you move to the right however, the correlation does not appear to be as strong as on the left of the plot.

Project Milestone 4

Step 1: Variable Identification

Response variable: total users (cnt) Categorical predictor: season Numeric predictor: temperature (temp)

Step 2: Fitted Model w/ Numeric Predictor

mod1 <- lm(cnt~temp, bikeShare)
summary(mod1)

## 
## Call:
## lm(formula = cnt ~ temp, data = bikeShare)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4615.3 -1134.9  -104.4  1044.3  3737.8 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1214.6      161.2   7.537 1.43e-13 ***
## temp          6640.7      305.2  21.759  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1509 on 729 degrees of freedom
## Multiple R-squared:  0.3937, Adjusted R-squared:  0.3929 
## F-statistic: 473.5 on 1 and 729 DF,  p-value: < 2.2e-16

ggplot(bikeShare, aes(x=temp, y=cnt))+
  geom_point()+
  geom_abline(intercept = mod1$coefficients[1], slope=mod1$coefficients[2])+
  theme_bw()

## Step 3: Dummy Variable

bikeShare$season<-as.factor(bikeShare$season)
contrasts(bikeShare$season)

Releveling

# winter
winter<-bikeShare%>%
  filter(season==1)%>%
  mutate(seasonLab="Winter")

# spring
spring<-bikeShare%>%
  filter(season==2)%>%
  mutate(seasonLab="Spring")

# summer
summer<-bikeShare%>%
  filter(season==3)%>%
  mutate(seasonLab="Summer")

#fall
fall<-bikeShare%>%
  filter(season==4)%>%
  mutate(seasonLab="Fall")

bikeShare2<-rbind(winter, spring, summer, fall)
bikeShare2$seasonLab<-as.factor(bikeShare2$seasonLab)
contrasts(bikeShare2$seasonLab)

##        Spring Summer Winter
## Fall        0      0      0
## Spring      1      0      0
## Summer      0      1      0
## Winter      0      0      1

Step 4: Fitted Model w/ Categorical Predictor

mod2 <- lm(cnt~seasonLab, bikeShare2)
summary(mod2)

## 
## Call:
## lm(formula = cnt ~ seasonLab, data = bikeShare2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4706.2 -1065.6  -181.3  1219.2  5231.9 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       4728.2      117.6  40.213  < 2e-16 ***
## seasonLabSpring    264.2      164.9   1.602     0.11    
## seasonLabSummer    916.1      164.1   5.584 3.31e-08 ***
## seasonLabWinter  -2124.0      165.6 -12.827  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1569 on 727 degrees of freedom
## Multiple R-squared:  0.347,  Adjusted R-squared:  0.3443 
## F-statistic: 128.8 on 3 and 727 DF,  p-value: < 2.2e-16

ggplot(bikeShare2, aes(x=seasonLab, y=cnt))+
  geom_boxplot()+
  theme_bw()

Step 5: Combining Steps 2 & 4

mod3 <- lm(cnt~temp + seasonLab, bikeShare2)
summary(mod3)

## 
## Call:
## lm(formula = cnt ~ temp + seasonLab, data = bikeShare2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4812.9  -996.8  -271.3  1240.9  3881.1 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       2088.7      244.0   8.559  < 2e-16 ***
## temp              6241.3      518.1  12.046  < 2e-16 ***
## seasonLabSpring   -494.1      163.3  -3.026  0.00256 ** 
## seasonLabSummer   -852.7      209.8  -4.064 5.35e-05 ***
## seasonLabWinter  -1342.9      164.6  -8.159 1.49e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1433 on 726 degrees of freedom
## Multiple R-squared:  0.4558, Adjusted R-squared:  0.4528 
## F-statistic:   152 on 4 and 726 DF,  p-value: < 2.2e-16

ggplot(bikeShare2, aes(x=temp, y=cnt, color=seasonLab))+
  geom_point()+
  # FALL (reference)
  geom_abline(intercept = mod3$coefficients[1], slope=mod3$coefficients[2], color = "tomato1")+
  # SPRING
  geom_abline(intercept = mod3$coefficients[1]+mod3$coefficients[3], slope=mod3$coefficients[2], color = "#66A61E")+
  # summer
  geom_abline(intercept = mod3$coefficients[1]+mod3$coefficients[4], slope=mod3$coefficients[2], color = "cyan3")+
  # winter
  geom_abline(intercept = mod3$coefficients[1]+mod3$coefficients[5], slope=mod3$coefficients[2], color = "mediumorchid")+
  theme_bw()

Step 6: Variable Interactions

mod4 <- lm(cnt~temp*seasonLab, bikeShare2)
summary(mod4)

## 
## Call:
## lm(formula = cnt ~ temp * seasonLab, data = bikeShare2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4813.3 -1016.3  -184.3  1221.7  3258.3 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            2077.5      427.8   4.857 1.46e-06 ***
## temp                   6267.8      980.3   6.394 2.90e-10 ***
## seasonLabSpring        -675.1      637.7  -1.059 0.290121    
## seasonLabSummer        4044.4     1115.7   3.625 0.000309 ***
## seasonLabWinter       -2188.5      535.0  -4.091 4.78e-05 ***
## temp:seasonLabSpring    326.5     1295.9   0.252 0.801179    
## temp:seasonLabSummer  -6944.0     1751.6  -3.964 8.09e-05 ***
## temp:seasonLabWinter   2851.2     1414.9   2.015 0.044262 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1406 on 723 degrees of freedom
## Multiple R-squared:  0.478,  Adjusted R-squared:  0.473 
## F-statistic: 94.59 on 7 and 723 DF,  p-value: < 2.2e-16

ggplot(bikeShare2, aes(x=temp, y=cnt, color=seasonLab))+
  geom_point()+
  # fall
  geom_abline(intercept = mod4$coefficients[1], slope=mod4$coefficients[2], color="tomato1")+
  # spring
  geom_abline(intercept = mod4$coefficients[1]+mod4$coefficients[3], slope=mod4$coefficients[2]+mod4$coefficients[6], color="#66A61E")+
  # summer
  geom_abline(intercept = mod4$coefficients[1]+mod4$coefficients[4], slope=mod4$coefficients[2]+mod4$coefficients[7], color = "cyan3")+
  # winter
  geom_abline(intercept = mod4$coefficients[1]+mod4$coefficients[5], slope=mod4$coefficients[2]+mod4$coefficients[8], color = "mediumorchid")+
  theme_bw()

## Step 7: Model Comparisons

Mean Standard Error

# mod1
mean(residuals(mod1)^2)

## [1] 2272020

# mod2
mean(mod2$residuals^2)

## [1] 2447250

# mod3
mean(mod3$residuals^2)

## [1] 2039615

# mod4
mean(mod4$residuals^2)

## [1] 1956178

Based on the model output from step 2, the variable temperature does appear to have a strong relationship with total users. This relationship is statistically significant and accordingly, our p-value is very low. Based on the plot of this model, the points seem clustered near the line and are trending in a positive direction, near the line of linear regression. The model output from step 4 fits a linear model to the data using a categorical predictor (season) instead of numeric (temperature). When each season acts as its own variable, spring stands out as insignificant compared to the other seasons. Additionally, spring has a very high p-value compared to the others. If I were to adjust this model to improve the fit, I would take out the category spring. Again, when combining numeric and categorical variables in a single model, spring proves to be a relatively weak variable, however it is more significant than in the previous model which used only categorical variables. Finally, in the final model which accounted for the interaction between numeric and categorical variables, spring, spring after controlling for temperature, and winter after controlling for temperature proved to be far less significant than the other variables. In the final model, these three variables had very high p-values relative to the other variables.

Step 8: Conclusions

It seems that there is high significance of climactic variables impact on total number of users. This significance seems to be slightly less during seasons that are perhaps more variable in terms of weather patterns. For example, the significance of the summer season remains highly significant throughout all the models but the significance of spring is erratic across models. There is need for further investigation of the relationship between climactic variables and total users because some plots would suggest that although there is a relationship there, it is not necessarily linear.

Project Milestone 5

Step 1:

In order to apply classification models to this project, I have chosen to dichotimize the numeric variable temperature. The given data set description describes these temperature values as normalized temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-8, t_max=+39 (only in hourly scale).

Step 2:

# Tree model
attach(bikeShare2)

# new variable if total users (cnt) is greater than 6,000
High <- ifelse(cnt<=6000, "No", "Yes")

bikeShare3 <- data.frame(bikeShare2, High)%>%
  select(-c("instant", "dteday", "registered", "casual"))
detach(bikeShare2)
attach(bikeShare3)

## The following object is masked _by_ .GlobalEnv:
## 
##     High

# visualization
ggplot(data=bikeShare3, aes(x=temp, color=High))+
  geom_density()+
  theme_bw()

# test and train
set.seed(2)
dim(bikeShare3)

## [1] 731  14

# split in half
train <- sample(1:nrow(bikeShare3), 360)

library(rpart)
# using method = "class" for classification
tree.bikeShare2 <- rpart(High~seasonLab+holiday+weekday+workingday+
                           weathersit+temp+hum+windspeed, data = bikeShare3,
                         subset = train, 
                         method = "class")
summary(tree.bikeShare2)

## Call:
## rpart(formula = High ~ seasonLab + holiday + weekday + workingday + 
##     weathersit + temp + hum + windspeed, data = bikeShare3, subset = train, 
##     method = "class")
##   n= 360 
## 
##           CP nsplit rel error   xerror       xstd
## 1 0.04868914      0 1.0000000 1.000000 0.09196835
## 2 0.03370787      8 0.5955056 1.089888 0.09458505
## 3 0.02247191      9 0.5617978 1.101124 0.09489043
## 4 0.01123596     11 0.5168539 1.146067 0.09606577
## 5 0.01000000     12 0.5056180 1.123596 0.09548726
## 
## Variable importance
##       temp  seasonLab        hum  windspeed workingday    weekday weathersit 
##         39         23         20          9          4          3          2 
##    holiday 
##          1 
## 
## Node number 1: 360 observations,    complexity param=0.04868914
##   predicted class=No   expected loss=0.2472222  P(node) =1
##     class counts:   271    89
##    probabilities: 0.753 0.247 
##   left son=2 (133 obs) right son=3 (227 obs)
##   Primary splits:
##       temp       < 0.4233335  to the left,  improve=24.238560, (0 missing)
##       seasonLab  splits as  RRRL, improve=11.731060, (0 missing)
##       hum        < 0.741875   to the right, improve= 6.678023, (0 missing)
##       windspeed  < 0.286808   to the right, improve= 2.938739, (0 missing)
##       weathersit < 1.5        to the right, improve= 2.364121, (0 missing)
##   Surrogate splits:
##       seasonLab splits as  LRRL, agree=0.825, adj=0.526, (0 split)
##       windspeed < 0.0628229  to the left,  agree=0.639, adj=0.023, (0 split)
##       hum       < 0.943913   to the right, agree=0.636, adj=0.015, (0 split)
## 
## Node number 2: 133 observations
##   predicted class=No   expected loss=0.007518797  P(node) =0.3694444
##     class counts:   132     1
##    probabilities: 0.992 0.008 
## 
## Node number 3: 227 observations,    complexity param=0.04868914
##   predicted class=No   expected loss=0.3876652  P(node) =0.6305556
##     class counts:   139    88
##    probabilities: 0.612 0.388 
##   left son=6 (52 obs) right son=7 (175 obs)
##   Primary splits:
##       hum        < 0.741875   to the right, improve=11.463890, (0 missing)
##       windspeed  < 0.286808   to the right, improve= 3.076912, (0 missing)
##       temp       < 0.52125    to the left,  improve= 2.307238, (0 missing)
##       weathersit < 1.5        to the right, improve= 2.203131, (0 missing)
##       seasonLab  splits as  RRRL, improve= 1.433514, (0 missing)
##   Surrogate splits:
##       weathersit < 2.5        to the right, agree=0.793, adj=0.096, (0 split)
##       temp       < 0.4416665  to the left,  agree=0.775, adj=0.019, (0 split)
## 
## Node number 6: 52 observations
##   predicted class=No   expected loss=0.09615385  P(node) =0.1444444
##     class counts:    47     5
##    probabilities: 0.904 0.096 
## 
## Node number 7: 175 observations,    complexity param=0.04868914
##   predicted class=No   expected loss=0.4742857  P(node) =0.4861111
##     class counts:    92    83
##    probabilities: 0.526 0.474 
##   left son=14 (17 obs) right son=15 (158 obs)
##   Primary splits:
##       windspeed  < 0.286808   to the right, improve=3.3400530, (0 missing)
##       temp       < 0.5208335  to the left,  improve=2.2400000, (0 missing)
##       seasonLab  splits as  RRRL, improve=2.0455180, (0 missing)
##       hum        < 0.7339585  to the left,  improve=1.0806930, (0 missing)
##       workingday < 0.5        to the right, improve=0.6366874, (0 missing)
## 
## Node number 14: 17 observations
##   predicted class=No   expected loss=0.1764706  P(node) =0.04722222
##     class counts:    14     3
##    probabilities: 0.824 0.176 
## 
## Node number 15: 158 observations,    complexity param=0.04868914
##   predicted class=Yes  expected loss=0.4936709  P(node) =0.4388889
##     class counts:    78    80
##    probabilities: 0.494 0.506 
##   left son=30 (7 obs) right son=31 (151 obs)
##   Primary splits:
##       temp       < 0.79125    to the right, improve=1.9353080, (0 missing)
##       hum        < 0.7339585  to the left,  improve=1.4064320, (0 missing)
##       seasonLab  splits as  RRLL, improve=1.2819600, (0 missing)
##       windspeed  < 0.1669835  to the right, improve=1.2819600, (0 missing)
##       workingday < 0.5        to the right, improve=0.7352409, (0 missing)
## 
## Node number 30: 7 observations
##   predicted class=No   expected loss=0.1428571  P(node) =0.01944444
##     class counts:     6     1
##    probabilities: 0.857 0.143 
## 
## Node number 31: 151 observations,    complexity param=0.04868914
##   predicted class=Yes  expected loss=0.4768212  P(node) =0.4194444
##     class counts:    72    79
##    probabilities: 0.477 0.523 
##   left son=62 (63 obs) right son=63 (88 obs)
##   Primary splits:
##       windspeed  < 0.1822145  to the right, improve=1.9351510, (0 missing)
##       temp       < 0.5208335  to the left,  improve=1.8952400, (0 missing)
##       hum        < 0.5395835  to the right, improve=1.6667130, (0 missing)
##       workingday < 0.5        to the right, improve=0.8447414, (0 missing)
##       seasonLab  splits as  RRLL, improve=0.7496782, (0 missing)
##   Surrogate splits:
##       hum       < 0.510625   to the left,  agree=0.636, adj=0.127, (0 split)
##       seasonLab splits as  RRRL, agree=0.609, adj=0.063, (0 split)
##       weekday   < 5.5        to the right, agree=0.596, adj=0.032, (0 split)
##       temp      < 0.4391665  to the left,  agree=0.596, adj=0.032, (0 split)
## 
## Node number 62: 63 observations,    complexity param=0.04868914
##   predicted class=No   expected loss=0.4285714  P(node) =0.175
##     class counts:    36    27
##    probabilities: 0.571 0.429 
##   left son=124 (31 obs) right son=125 (32 obs)
##   Primary splits:
##       seasonLab  splits as  RRLL, improve=6.742224, (0 missing)
##       temp       < 0.665      to the right, improve=5.772886, (0 missing)
##       windspeed  < 0.2092645  to the left,  improve=3.061794, (0 missing)
##       workingday < 0.5        to the right, improve=2.991758, (0 missing)
##       weekday    < 1.5        to the right, improve=1.679365, (0 missing)
##   Surrogate splits:
##       temp       < 0.6483335  to the right, agree=0.778, adj=0.548, (0 split)
##       windspeed  < 0.190915   to the left,  agree=0.603, adj=0.194, (0 split)
##       workingday < 0.5        to the right, agree=0.587, adj=0.161, (0 split)
##       weekday    < 2.5        to the right, agree=0.571, adj=0.129, (0 split)
##       hum        < 0.638125   to the left,  agree=0.571, adj=0.129, (0 split)
## 
## Node number 63: 88 observations,    complexity param=0.04868914
##   predicted class=Yes  expected loss=0.4090909  P(node) =0.2444444
##     class counts:    36    52
##    probabilities: 0.409 0.591 
##   left son=126 (57 obs) right son=127 (31 obs)
##   Primary splits:
##       temp      < 0.7        to the left,  improve=4.4469830, (0 missing)
##       hum       < 0.5395835  to the right, improve=2.6597400, (0 missing)
##       windspeed < 0.08365415 to the left,  improve=2.0454550, (0 missing)
##       seasonLab splits as  LLRR, improve=1.0570820, (0 missing)
##       weekday   < 1.5        to the left,  improve=0.4694756, (0 missing)
##   Surrogate splits:
##       seasonLab splits as  LLRL, agree=0.807, adj=0.452, (0 split)
##       windspeed < 0.164185   to the left,  agree=0.659, adj=0.032, (0 split)
## 
## Node number 124: 31 observations
##   predicted class=No   expected loss=0.1935484  P(node) =0.08611111
##     class counts:    25     6
##    probabilities: 0.806 0.194 
## 
## Node number 125: 32 observations,    complexity param=0.01123596
##   predicted class=Yes  expected loss=0.34375  P(node) =0.08888889
##     class counts:    11    21
##    probabilities: 0.344 0.656 
##   left son=250 (17 obs) right son=251 (15 obs)
##   Primary splits:
##       workingday < 0.5        to the right, improve=2.5002450, (0 missing)
##       hum        < 0.6766665  to the right, improve=1.9102270, (0 missing)
##       weekday    < 1.5        to the right, improve=1.7284090, (0 missing)
##       temp       < 0.5158335  to the left,  improve=1.6875000, (0 missing)
##       windspeed  < 0.2431625  to the left,  improve=0.8790584, (0 missing)
##   Surrogate splits:
##       weekday    < 5.5        to the left,  agree=0.750, adj=0.467, (0 split)
##       windspeed  < 0.1980815  to the right, agree=0.656, adj=0.267, (0 split)
##       weathersit < 1.5        to the right, agree=0.625, adj=0.200, (0 split)
##       hum        < 0.5875     to the right, agree=0.625, adj=0.200, (0 split)
##       holiday    < 0.5        to the left,  agree=0.594, adj=0.133, (0 split)
## 
## Node number 126: 57 observations,    complexity param=0.04868914
##   predicted class=No   expected loss=0.4736842  P(node) =0.1583333
##     class counts:    30    27
##    probabilities: 0.526 0.474 
##   left son=252 (48 obs) right son=253 (9 obs)
##   Primary splits:
##       hum       < 0.510208   to the right, improve=3.6849420, (0 missing)
##       temp      < 0.638333   to the right, improve=2.2259310, (0 missing)
##       windspeed < 0.08365415 to the left,  improve=1.7467670, (0 missing)
##       seasonLab splits as  RRLR, improve=0.8020050, (0 missing)
##       weekday   < 1.5        to the left,  improve=0.5718463, (0 missing)
##   Surrogate splits:
##       temp      < 0.4491665  to the right, agree=0.877, adj=0.222, (0 split)
##       windspeed < 0.179421   to the left,  agree=0.877, adj=0.222, (0 split)
##       seasonLab splits as  LLLR, agree=0.860, adj=0.111, (0 split)
## 
## Node number 127: 31 observations
##   predicted class=Yes  expected loss=0.1935484  P(node) =0.08611111
##     class counts:     6    25
##    probabilities: 0.194 0.806 
## 
## Node number 250: 17 observations
##   predicted class=No   expected loss=0.4705882  P(node) =0.04722222
##     class counts:     9     8
##    probabilities: 0.529 0.471 
## 
## Node number 251: 15 observations
##   predicted class=Yes  expected loss=0.1333333  P(node) =0.04166667
##     class counts:     2    13
##    probabilities: 0.133 0.867 
## 
## Node number 252: 48 observations,    complexity param=0.03370787
##   predicted class=No   expected loss=0.3958333  P(node) =0.1333333
##     class counts:    29    19
##    probabilities: 0.604 0.396 
##   left son=504 (41 obs) right son=505 (7 obs)
##   Primary splits:
##       hum        < 0.730625   to the left,  improve=1.6621660, (0 missing)
##       temp       < 0.5425     to the left,  improve=1.4583330, (0 missing)
##       windspeed  < 0.0867583  to the left,  improve=1.0489260, (0 missing)
##       seasonLab  splits as  RLL-, improve=0.5208333, (0 missing)
##       weathersit < 1.5        to the left,  improve=0.4289216, (0 missing)
##   Surrogate splits:
##       holiday < 0.5        to the left,  agree=0.896, adj=0.286, (0 split)
## 
## Node number 253: 9 observations
##   predicted class=Yes  expected loss=0.1111111  P(node) =0.025
##     class counts:     1     8
##    probabilities: 0.111 0.889 
## 
## Node number 504: 41 observations,    complexity param=0.02247191
##   predicted class=No   expected loss=0.3414634  P(node) =0.1138889
##     class counts:    27    14
##    probabilities: 0.659 0.341 
##   left son=1008 (19 obs) right son=1009 (22 obs)
##   Primary splits:
##       temp      < 0.5425     to the left,  improve=1.2141440, (0 missing)
##       seasonLab splits as  RLL-, improve=1.0686540, (0 missing)
##       hum       < 0.695833   to the right, improve=0.9314486, (0 missing)
##       windspeed < 0.1508065  to the right, improve=0.7662971, (0 missing)
##       weekday   < 1.5        to the left,  improve=0.4390244, (0 missing)
##   Surrogate splits:
##       seasonLab  splits as  LLR-, agree=0.707, adj=0.368, (0 split)
##       hum        < 0.594167   to the left,  agree=0.707, adj=0.368, (0 split)
##       windspeed  < 0.146773   to the right, agree=0.659, adj=0.263, (0 split)
##       weekday    < 4.5        to the right, agree=0.610, adj=0.158, (0 split)
##       weathersit < 1.5        to the left,  agree=0.585, adj=0.105, (0 split)
## 
## Node number 505: 7 observations
##   predicted class=Yes  expected loss=0.2857143  P(node) =0.01944444
##     class counts:     2     5
##    probabilities: 0.286 0.714 
## 
## Node number 1008: 19 observations
##   predicted class=No   expected loss=0.2105263  P(node) =0.05277778
##     class counts:    15     4
##    probabilities: 0.789 0.211 
## 
## Node number 1009: 22 observations,    complexity param=0.02247191
##   predicted class=No   expected loss=0.4545455  P(node) =0.06111111
##     class counts:    12    10
##    probabilities: 0.545 0.455 
##   left son=2018 (12 obs) right son=2019 (10 obs)
##   Primary splits:
##       temp       < 0.6504165  to the right, improve=2.2090910, (0 missing)
##       weekday    < 1.5        to the left,  improve=1.6441340, (0 missing)
##       seasonLab  splits as  RRL-, improve=1.4545450, (0 missing)
##       hum        < 0.6470835  to the right, improve=0.3636364, (0 missing)
##       workingday < 0.5        to the left,  improve=0.1590909, (0 missing)
##   Surrogate splits:
##       seasonLab splits as  RLL-, agree=0.682, adj=0.3, (0 split)
##       hum       < 0.6554165  to the right, agree=0.636, adj=0.2, (0 split)
##       weekday   < 3.5        to the left,  agree=0.591, adj=0.1, (0 split)
##       windspeed < 0.0923583  to the right, agree=0.591, adj=0.1, (0 split)
## 
## Node number 2018: 12 observations
##   predicted class=No   expected loss=0.25  P(node) =0.03333333
##     class counts:     9     3
##    probabilities: 0.750 0.250 
## 
## Node number 2019: 10 observations
##   predicted class=Yes  expected loss=0.3  P(node) =0.02777778
##     class counts:     3     7
##    probabilities: 0.300 0.700

# plotting the tree

# install.packages("rpart.plot")
library(rpart.plot)

prp(tree.bikeShare2, faclen = 0, cex = 0.7, extra = 1, space = 0.5, main = "To Rent a Bike or Not?")

Description: This tree model essentially describes the conditions that might lead to a persons decision to rent or not to rent a bike. The variables which impact this decision most significantly are weather variables. This model predicts that if the temperature is too low, a person is more likely not to rent. Even if the temperature is high but the humidity is also high, they are more likely not to rent. Even if the temperature is high and the humidity is low, if the wind speed is too great or the temperature too hot, this could deter people as well. If the wind speed is low, people are still less likely to rent if it is summer or winter. If it is summer or winter, the likelihood of renting is even less if it is a work day. This is how one would read the tree model above.

Step 3

# this is what we actually observe (testing set)
high.test<-High[-train]
# this is what we predict for the testing set using the training model
tree.pred<-predict(tree.bikeShare2, newdata = bikeShare3, type="class")
tree.test<-tree.pred[-train]
# the confusion matrix matches these up to see how well we did
cm<-table(tree.test, high.test)
cm

##          high.test
## tree.test  No Yes
##       No  245  49
##       Yes  35  42

sum(diag(cm))/sum(cm)

## [1] 0.7735849

Final Project: BikeShare Data

Isabel Duxbury