Introduction

This data set contains bike share data from the Capital Bike Share program. This data was compiled at the Laboratory of Artificial Intelligence and Decision Support (LIAAD), at the University of Porto, in Portugal. The popularity of the bike share model has taken off in recent years as cities across the world begin to transition to more sustainable metropoleis. A reduced reliance on fossil fuel powered cars will be necessary to both reduce carbon emissions and save space in densely populated areas. Although the bike share model is an innovative and intelligent model with great potential, it is important to monitor pilot programs in order to see whether or not it is an idea worth pursuing. This is why data regarding bike share popularity and use is of such value. This data set helps us understand what factors might impact consumer decisions to utilize a bike share program. Upon first glance at this data set, most of the explanatory variables seem to be local climatic variables, however other variables such as the date and type of day ( working, weekend, or holiday) can also be classified as explanatory. The response variables are those regarding the number of casual users, registered users, and total users. Early questions that emerged from examining the variables in this data set were mostly concerned with how climatic variables affected use of the bike share program. Specifically, this paper will look to analyze the impact of temperature and seasonal change on the total number of users. In summary, temperature appeared to have a significant but non linear relationship with the total number of users. Seasonality for the most part had a significant relationship with the total number of users, but upon further inspection of statistical model outputs, spring time, which is a time of year with highly variable weather, was not significant. The remainder of this paper will outline the potential applications of this data, the statistical methods used, and the results of these methods. This paper will end with the main conclusions as well as suggestions for further data collection and analysis. Beyond that, all of the code used to analyze this data set will be available in the appendix.

Data application

The following graphics are the graphics that were developed upon early exploration of this data set in R studio (see Figures 1-5). In the first plot, there appears to be a positive but non linear relationship between temperature and total users. The data appears to be clustered into two arches that are concave down. Additionally, the weather situation and season seem to have strong relationships with the number of users, as shown by the two box plots. While spring and fall are not very distinguishable from one another in terms of median or spread, summer and winter clearly correlate with more and less registered users respectively. The fourth plot shows that there is not one season during which any given day(s) of the week is more popular than the others. For example it is just as popular to rent a bike on the weekend in summer as it is in winter. The final plot shows that although people prefer to rent bikes on clear days throughout the year, demand is greatest on clear days in the summertime.

Methods

All of the statistical analysis of this data set was performed within R-studio. To begin, an exploratory data analysis (EDA) was performed on the data. This meant examining the structure of the data set, learning about the variables, what they mean and how they are coded, and obtaining an initial glimpse at how they relate to one another. Exploring these relationships meant creating numerous graphics using ggplot2 from the tidyverse package to visualize the data. Then, a simple linear regression model was fitted and assessed using an Analysis of Variance (ANOVA) table and diagnostic plots. The diagnostic plots include the Residual vs. Fitted plot, to assess the spread of the data, the QQ plot, to assess normality, and the Leverage vs. Residual plot to assess the presence of outliers. A multiple linear regression model was then used to examine the interaction between the numeric variable temperature and the categorical variable season. In order to analyze consumer decisions to rent or not to rent a bike from a bike share program, a classification model, specificially a tree model, was used. To do this, the numeric response variable, total users, was dichotomized in order to create the model. The conditions that create the branches of the tree model are based on the dichotomization of the total users variable; All of the observations in the data set are divided into two groups where the total number of users is above or below 6,000. Then the tree model was assessed with the creation of a a test and training set. The model was fit to the training set and a confusion matrix was created with the testing set. Using the confusion matrix, the accuracy of the tree model can be calculated.

Analysis and results

The simple linear regression model which models a relationship between temperature and total users can provide us with some key insights (see Figure 7). A graphic of this model shows us that there is a seemingly positive but non-linear relationship between these two variables. The points seem clustered in two vertically stacked arches. The model output reveals a p-value < 2.2e-16 5. Because of this, the null hypothesis can be rejected at the α = 0 significance level. There is strong evidence to suggest that there could be a linear relationship between temperature and total bike share users. Diagnostic plots (see Figures 8-12) can help assess how well this simple linear regression model fits the data. The diagnostic plots analyzed here include the Residuals vs. Fitted Values plot, the Normal QQ plot and the Residuals vs. Leverage plot. Based on the Residuals vs. Fitted values plot, there appears to be an even distribution of data points below and above the sample mean of 0. Additionally, there does not appear to be an increasing or decreasing vertical spread of the data as you move horizontally across the plot (there is no fan shape). These characteristics indicate that the model fits the data relatively well. In the Normal QQ plot, although most of the points lie on or very near the diagonal line, there is some deviation from the line at the tail ends of the data. Ideally, all the points would be on the line and most of these are, meaning that the plot follows normality. In the Residuals vs Leverage plot, there appear to be a handful of outliers, although they are not obvious because the entire collection of data points is not extremely clustered on the left side of the plot, and is instead relatively spread out. However, from left to right, the correlation does not appear to be as strong as on the left of the plot. The multiple linear regression models (see Figures 15 & 16) examine the relationship between temperature and total bike share users, while taking into account a categorical variable: season. These models account for the interaction between temperature, a numeric explanatory variable, and season, a categorical explanatory variable. A parallel slopes model showed the expected difference in number of bike share users between Fall (the reference group) and the other seasons, when temperature is held constant. A non-parallel slopes model has a similar function, but reveals the interaction between temperature and the different seasons themselves. The model outputs show that spring stands out as insignificant, with a high p-value, compared to the other seasons. If I were to adjust this model to improve the fit, I would take out the category spring. In context, spring likely had a much weaker relationship with total bike share users because of the highly variable weather that one would normally associate with spring time. On the other hand, Summer, Fall and Winter all maintained high significance and accordingly low p-values relative to Spring, throughout all of the models run during this data exploration. In both the parallel and non-parallel slopes models, there are some hints that a linear model may not be the best fit for this data. A fitted model and graphic of only the relationship between the categorical variable season and total users clearly demonstrates that spring and summer are more popular seasons in which to rent bikes. However, the parallel slopes model output states that the expected difference between Fall and the other seasons, when temperature is held constant, is negative. This would indicate that when controlling for temperature there are fewer expected users in Spring and Summer than in Fall, which is inconsistent with the earlier model. Additionally, in the non-parallel slopes model, the slope of Summer is negative, and the other three seasons have positive slopes, which would indicate that a linear model may not be the best fit. Finally, a classification model (see Figure 18) was applied. It reads as follows. The conditions that split the data are bolded. If the condition is true, move to the left, if not true, move to the right. The leaves of the tree, or terminal nodes, represent the final outcome (whether or not a consumer rented a bike) given all of the conditions that were met as one travels down the tree to reach that node. This tree model essentially describes the conditions that might lead to a person’s decision to rent or not to rent a bike. The variables which impact this decision most significantly are climatic variables. This model predicts that if the season is Fall or Winter, a person is more likely not to rent. Even if it is Fall or Winter, if the humidity is too high, then this would be an incentive not to rent. If the season is Spring or Summer, the humidity is low, and the temperature is high then it appears a person would rent. If the season is Spring or Summer, the humidity is low, and temperature is low then there are a few more conditions, such as the season being summer specifically, the weather situation being clear or partly cloudy, and not too hot, in order for someone to be likely to rent a bike. A confusion matrix was used to assess this classification model, and the correct rate was determined to be 84%.

Conclusion

All cities will face their own challenges depending on the local climate, geography and infrastructure. This data set was especially revealing of the ways in which climatic variables affect bike share use. Main conclusions that arose from this analysis include the fact that there is a statistically significant relationship between temperature, seasonality, and total number of users. However, a linear model may not be the best fit for the data for a number of aforementioned reasons. Analysis of this data is important for assessing whether or not bike share programs have the potential to take off in cities with sustainable development goals. However, in order to truly answer this question, this data set must be expanded on to include variables regarding natural and built environment. For example a very hilly city, a city that has no bike lanes, or a city where there is high levels of street crime, could all be examples of places where a bike share program could be unpopular regardless of the weather.

Appendix

All code used to perform this analysis:

# Loading the data set
bikeShare <- read.csv("day.csv",
                        header=TRUE)

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# Creating labels for weather situation and season
clear<-bikeShare%>%
  filter(weathersit==1)%>%
  mutate(weatherLab="Clear")

partCloud<-bikeShare%>%
  filter(weathersit==2)%>%
  mutate(weatherLab="Partly Cloudy")

precipitating<-bikeShare%>%
  filter(weathersit==3)%>%
  mutate(weatherLab="Precipitating")

bikeShare2<-rbind(clear, partCloud, precipitating)
bikeShare2$weatherLab<-as.factor(bikeShare2$weatherLab)
contrasts(bikeShare2$weatherLab)

##               Partly Cloudy Precipitating
## Clear                     0             0
## Partly Cloudy             1             0
## Precipitating             0             1

# winter
winter<-bikeShare2%>%
  filter(season==1)%>%
  mutate(seasonLab="Winter")

# spring
spring<-bikeShare2%>%
  filter(season==2)%>%
  mutate(seasonLab="Spring")

# summer
summer<-bikeShare2%>%
  filter(season==3)%>%
  mutate(seasonLab="Summer")

#fall
fall<-bikeShare2%>%
  filter(season==4)%>%
  mutate(seasonLab="Fall")

bikeShare3<-rbind(winter, spring, summer, fall)
bikeShare3$seasonLab<-as.factor(bikeShare3$seasonLab)
contrasts(bikeShare3$seasonLab)

##        Spring Summer Winter
## Fall        0      0      0
## Spring      1      0      0
## Summer      0      1      0
## Winter      0      0      1

Figures 1-5

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.0.6     ✓ stringr 1.4.0
## ✓ tidyr   1.1.2     ✓ forcats 0.5.1
## ✓ readr   1.4.0

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

# EDA graphics
ggplot(bikeShare3, aes(temp, cnt, color = hum))+
  geom_point()+
  theme_bw()+
  labs(title="Tempurature, Humidity and Total Users", color = "Humidity")+ 
  xlab("Temperature")+
  ylab("Total Users")

ggplot(bikeShare3, aes(weatherLab, cnt, fill = weatherLab))+
  geom_boxplot()+
  theme_bw()+
  labs(title="Weather Situation and Total Users", fill = "Weather Situation")+ 
  xlab("Weather Situation")+
  ylab("Total Users")

ggplot(bikeShare3, aes(seasonLab, registered, fill = seasonLab))+
  geom_boxplot()+
  theme_bw()+
  labs(title="Seasonality and Registered Users", fill = "Season")+ 
  xlab("Season")+
  ylab("Registered Users")

ggplot(bikeShare3, aes(x = seasonLab, fill = as.factor(weekday))) +
  geom_bar(position = "fill")+
  theme_bw()+
  labs(title="Program Use by Day and Season", fill = "Weekday")+ 
  xlab("Season")+
  ylab("Bike Share Use")

ggplot(bikeShare3, aes(x = seasonLab, fill = weatherLab)) +
  geom_bar(position = "dodge")+
  theme_bw()+
  labs(title="Seasonality and Weather situation", fill = "Weather Situation")+ 
  xlab("Season")+
  ylab("Bike Share Use")

Figure 6

# Determining a SLR fitted model: Model 1
mod <- lm(cnt~temp, bikeShare3)
summary(mod)

## 
## Call:
## lm(formula = cnt ~ temp, data = bikeShare3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4615.3 -1134.9  -104.4  1044.3  3737.8 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1214.6      161.2   7.537 1.43e-13 ***
## temp          6640.7      305.2  21.759  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1509 on 729 degrees of freedom
## Multiple R-squared:  0.3937, Adjusted R-squared:  0.3929 
## F-statistic: 473.5 on 1 and 729 DF,  p-value: < 2.2e-16

Test for Slope:

Reference Distribution: t test Degrees of Freedom: 729 Test Statistic: 473.5 P-value: < 2.2e-16 5 Part Conclusion: We can reject the null hypothesis with a p-value of < 2.2e-16 at the α = 0 significance level.There is strong evidence to suggest that there could be a linear relationship between temperature and total bike share users (i.e. the slope parameter is not equal to zero).

Figure 7

# Creating an ANOVA table 
aov.out = aov(cnt~temp, bikeShare3)
summary(aov.out)

##              Df    Sum Sq   Mean Sq F value Pr(>F)    
## temp          1 1.079e+09 1.079e+09   473.5 <2e-16 ***
## Residuals   729 1.661e+09 2.278e+06                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Figure 8-12

# Assessing model fit
plot(mod)

# Determining a SLR fitted model w/ numeric predictor: Model 2
mod1 <- lm(cnt~temp, bikeShare3)
summary(mod1)

## 
## Call:
## lm(formula = cnt ~ temp, data = bikeShare3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4615.3 -1134.9  -104.4  1044.3  3737.8 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1214.6      161.2   7.537 1.43e-13 ***
## temp          6640.7      305.2  21.759  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1509 on 729 degrees of freedom
## Multiple R-squared:  0.3937, Adjusted R-squared:  0.3929 
## F-statistic: 473.5 on 1 and 729 DF,  p-value: < 2.2e-16

Figure 13

# Visualizing Model 2
ggplot(bikeShare3, aes(x=temp, y=cnt))+
  geom_point()+
  geom_abline(intercept = mod1$coefficients[1], slope=mod1$coefficients[2])+
  theme_bw()+
  labs(title="Temperature and Total Users")+ 
  xlab("Temperature")+
  ylab("Total Users")

# Determining a SLR fitted model w/ categorical predictor: Model 3
mod2 <- lm(cnt~seasonLab, bikeShare3)
summary(mod2)

## 
## Call:
## lm(formula = cnt ~ seasonLab, data = bikeShare3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4706.2 -1065.6  -181.3  1219.2  5231.9 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       4728.2      117.6  40.213  < 2e-16 ***
## seasonLabSpring    264.2      164.9   1.602     0.11    
## seasonLabSummer    916.1      164.1   5.584 3.31e-08 ***
## seasonLabWinter  -2124.0      165.6 -12.827  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1569 on 727 degrees of freedom
## Multiple R-squared:  0.347,  Adjusted R-squared:  0.3443 
## F-statistic: 128.8 on 3 and 727 DF,  p-value: < 2.2e-16

Figure 14

# Visualizing Model 3
ggplot(bikeShare3, aes(x=seasonLab, y=cnt))+
  geom_boxplot()+
  theme_bw()+
  labs(title="Seasonality and Total Users")+ 
  xlab("Season")+
  ylab("Total Users")

# Determining a MLR parallel slops model: Model 4
mod3 <- lm(cnt~temp + seasonLab, bikeShare3)
summary(mod3)

## 
## Call:
## lm(formula = cnt ~ temp + seasonLab, data = bikeShare3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4812.9  -996.8  -271.3  1240.9  3881.1 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       2088.7      244.0   8.559  < 2e-16 ***
## temp              6241.3      518.1  12.046  < 2e-16 ***
## seasonLabSpring   -494.1      163.3  -3.026  0.00256 ** 
## seasonLabSummer   -852.7      209.8  -4.064 5.35e-05 ***
## seasonLabWinter  -1342.9      164.6  -8.159 1.49e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1433 on 726 degrees of freedom
## Multiple R-squared:  0.4558, Adjusted R-squared:  0.4528 
## F-statistic:   152 on 4 and 726 DF,  p-value: < 2.2e-16

Figure 15

# Visualizing Model 4
ggplot(bikeShare3, aes(x=temp, y=cnt, color=seasonLab))+
  geom_point()+
  # FALL (reference)
  geom_abline(intercept = mod3$coefficients[1], slope=mod3$coefficients[2], color = "tomato1")+
  # SPRING
  geom_abline(intercept = mod3$coefficients[1]+mod3$coefficients[3], slope=mod3$coefficients[2], color = "#66A61E")+
  # summer
  geom_abline(intercept = mod3$coefficients[1]+mod3$coefficients[4], slope=mod3$coefficients[2], color = "cyan3")+
  # winter
  geom_abline(intercept = mod3$coefficients[1]+mod3$coefficients[5], slope=mod3$coefficients[2], color = "mediumorchid")+
  theme_bw()+
  labs(title="Seasonality, Temperature, and Total Users", color = "Season")+ 
  xlab("Temperature")+
  ylab("Total Users")

# Determining a MLR non-parallel slopes model: Model 5
mod4 <- lm(cnt~temp*seasonLab, bikeShare3)
summary(mod4)

## 
## Call:
## lm(formula = cnt ~ temp * seasonLab, data = bikeShare3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4813.3 -1016.3  -184.3  1221.7  3258.3 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            2077.5      427.8   4.857 1.46e-06 ***
## temp                   6267.8      980.3   6.394 2.90e-10 ***
## seasonLabSpring        -675.1      637.7  -1.059 0.290121    
## seasonLabSummer        4044.4     1115.7   3.625 0.000309 ***
## seasonLabWinter       -2188.5      535.0  -4.091 4.78e-05 ***
## temp:seasonLabSpring    326.5     1295.9   0.252 0.801179    
## temp:seasonLabSummer  -6944.0     1751.6  -3.964 8.09e-05 ***
## temp:seasonLabWinter   2851.2     1414.9   2.015 0.044262 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1406 on 723 degrees of freedom
## Multiple R-squared:  0.478,  Adjusted R-squared:  0.473 
## F-statistic: 94.59 on 7 and 723 DF,  p-value: < 2.2e-16

Figure 16

# Visualizing Model 5
ggplot(bikeShare3, aes(x=temp, y=cnt, color=seasonLab))+
  geom_point()+
  # fall
  geom_abline(intercept = mod4$coefficients[1], slope=mod4$coefficients[2], color="tomato1")+
  # spring
  geom_abline(intercept = mod4$coefficients[1]+mod4$coefficients[3], slope=mod4$coefficients[2]+mod4$coefficients[6], color="#66A61E")+
  # summer
  geom_abline(intercept = mod4$coefficients[1]+mod4$coefficients[4], slope=mod4$coefficients[2]+mod4$coefficients[7], color = "cyan3")+
  # winter
  geom_abline(intercept = mod4$coefficients[1]+mod4$coefficients[5], slope=mod4$coefficients[2]+mod4$coefficients[8], color = "mediumorchid")+
  theme_bw()+
  labs(title="Seasonality, Temperature, and Total Users", color = "Season")+ 
  xlab("Temperature")+
  ylab("Total Users")

Mean Standard Error

# Model comparisons

# mod1
mean(residuals(mod1)^2)

## [1] 2272020

# mod2
mean(mod2$residuals^2)

## [1] 2447250

# mod3
mean(mod3$residuals^2)

## [1] 2039615

# mod4
mean(mod4$residuals^2)

## [1] 1956178

# Creating a classification tree model
attach(bikeShare2)

# new variable if total users (cnt) is greater than 6,000
High <- ifelse(cnt<=6000, "No", "Yes")

bikeShare4 <- data.frame(bikeShare3, High)%>%
  select(-c("instant", "dteday", "registered", "casual"))
detach(bikeShare2)
attach(bikeShare4)

## The following object is masked _by_ .GlobalEnv:
## 
##     High

Figure 17

# visualization
ggplot(data=bikeShare4, aes(x=temp, color=High))+
  geom_density()+
  theme_bw()+
  labs(title="To Rent a Bike or Not?", color = "Did They Rent?")+ 
  xlab("Temperature")+
  ylab("Density")

# test and train
set.seed(2)
dim(bikeShare4)

## [1] 731  15

# split in half
train <- sample(1:nrow(bikeShare4), 360)

library(rpart)
# using method = "class" for classification
tree.bikeShare2 <- rpart(High~seasonLab+holiday+weekday+workingday+
                           weathersit+temp+hum+windspeed, data = bikeShare4,
                         subset = train, 
                         method = "class")
summary(tree.bikeShare2)

## Call:
## rpart(formula = High ~ seasonLab + holiday + weekday + workingday + 
##     weathersit + temp + hum + windspeed, data = bikeShare4, subset = train, 
##     method = "class")
##   n= 360 
## 
##           CP nsplit rel error    xerror       xstd
## 1 0.03745318      0 1.0000000 1.0000000 0.09196835
## 2 0.03370787      6 0.6516854 0.8988764 0.08863041
## 3 0.01685393      8 0.5842697 0.9213483 0.08941041
## 4 0.01123596     10 0.5505618 0.9213483 0.08941041
## 5 0.01000000     11 0.5393258 0.9101124 0.08902323
## 
## Variable importance
##       temp        hum  seasonLab weathersit  windspeed    weekday 
##         31         23         21         15          9          1 
## 
## Node number 1: 360 observations,    complexity param=0.03745318
##   predicted class=No   expected loss=0.2472222  P(node) =1
##     class counts:   271    89
##    probabilities: 0.753 0.247 
##   left son=2 (168 obs) right son=3 (192 obs)
##   Primary splits:
##       seasonLab splits as  LRRL, improve=13.434920, (0 missing)
##       temp      < 0.637917  to the left,  improve= 9.652903, (0 missing)
##       hum       < 0.544167  to the left,  improve= 4.088565, (0 missing)
##       windspeed < 0.186256  to the right, improve= 3.073263, (0 missing)
##       holiday   < 0.5       to the right, improve= 1.128348, (0 missing)
##   Surrogate splits:
##       temp       < 0.5308335 to the left,  agree=0.844, adj=0.667, (0 split)
##       windspeed  < 0.201504  to the right, agree=0.572, adj=0.083, (0 split)
##       hum        < 0.7577085 to the right, agree=0.544, adj=0.024, (0 split)
##       holiday    < 0.5       to the right, agree=0.542, adj=0.018, (0 split)
##       weathersit < 1.5       to the right, agree=0.542, adj=0.018, (0 split)
## 
## Node number 2: 168 observations
##   predicted class=No   expected loss=0.1011905  P(node) =0.4666667
##     class counts:   151    17
##    probabilities: 0.899 0.101 
## 
## Node number 3: 192 observations,    complexity param=0.03745318
##   predicted class=No   expected loss=0.375  P(node) =0.5333333
##     class counts:   120    72
##    probabilities: 0.625 0.375 
##   left son=6 (61 obs) right son=7 (131 obs)
##   Primary splits:
##       hum        < 0.546875  to the left,  improve=3.7850080, (0 missing)
##       temp       < 0.432174  to the right, improve=3.3767440, (0 missing)
##       weathersit < 1.5       to the left,  improve=3.1723190, (0 missing)
##       windspeed  < 0.2095875 to the right, improve=0.9222011, (0 missing)
##       weekday    < 3.5       to the right, improve=0.6857143, (0 missing)
##   Surrogate splits:
##       temp      < 0.8154165 to the right, agree=0.708, adj=0.082, (0 split)
##       windspeed < 0.246277  to the right, agree=0.703, adj=0.066, (0 split)
## 
## Node number 6: 61 observations,    complexity param=0.03370787
##   predicted class=No   expected loss=0.2295082  P(node) =0.1694444
##     class counts:    47    14
##    probabilities: 0.770 0.230 
##   left son=12 (35 obs) right son=13 (26 obs)
##   Primary splits:
##       temp      < 0.62875   to the left,  improve=4.8792650, (0 missing)
##       seasonLab splits as  -LR-, improve=4.1435900, (0 missing)
##       windspeed < 0.1343355 to the right, improve=2.6588770, (0 missing)
##       hum       < 0.5008335 to the right, improve=0.8519660, (0 missing)
##       weekday   < 0.5       to the right, improve=0.6266805, (0 missing)
##   Surrogate splits:
##       seasonLab splits as  -LR-, agree=0.902, adj=0.769, (0 split)
##       windspeed < 0.180348  to the right, agree=0.721, adj=0.346, (0 split)
##       hum       < 0.526458  to the left,  agree=0.623, adj=0.115, (0 split)
## 
## Node number 7: 131 observations,    complexity param=0.03745318
##   predicted class=No   expected loss=0.4427481  P(node) =0.3638889
##     class counts:    73    58
##    probabilities: 0.557 0.443 
##   left son=14 (119 obs) right son=15 (12 obs)
##   Primary splits:
##       temp      < 0.4339675 to the right, improve=5.933938, (0 missing)
##       hum       < 0.58625   to the right, improve=3.687975, (0 missing)
##       seasonLab splits as  -RL-, improve=3.316233, (0 missing)
##       windspeed < 0.230723  to the left,  improve=2.404240, (0 missing)
##       weekday   < 3.5       to the right, improve=0.761202, (0 missing)
## 
## Node number 12: 35 observations
##   predicted class=No   expected loss=0.05714286  P(node) =0.09722222
##     class counts:    33     2
##    probabilities: 0.943 0.057 
## 
## Node number 13: 26 observations,    complexity param=0.03370787
##   predicted class=No   expected loss=0.4615385  P(node) =0.07222222
##     class counts:    14    12
##    probabilities: 0.538 0.462 
##   left son=26 (18 obs) right son=27 (8 obs)
##   Primary splits:
##       hum        < 0.445833  to the right, improve=3.950855000, (0 missing)
##       temp       < 0.774167  to the right, improve=3.380593000, (0 missing)
##       windspeed  < 0.1343355 to the right, improve=1.923077000, (0 missing)
##       weekday    < 3.5       to the right, improve=0.307692300, (0 missing)
##       workingday < 0.5       to the left,  improve=0.008044243, (0 missing)
##   Surrogate splits:
##       windspeed < 0.1977625 to the left,  agree=0.769, adj=0.250, (0 split)
##       temp      < 0.7154165 to the right, agree=0.731, adj=0.125, (0 split)
## 
## Node number 14: 119 observations,    complexity param=0.03745318
##   predicted class=No   expected loss=0.394958  P(node) =0.3305556
##     class counts:    72    47
##    probabilities: 0.605 0.395 
##   left son=28 (98 obs) right son=29 (21 obs)
##   Primary splits:
##       hum       < 0.58625   to the right, improve=5.2004800, (0 missing)
##       temp      < 0.7708335 to the left,  improve=1.4375860, (0 missing)
##       seasonLab splits as  -RL-, improve=1.0407160, (0 missing)
##       windspeed < 0.0876815 to the left,  improve=0.9453782, (0 missing)
##       weekday   < 3.5       to the right, improve=0.6110388, (0 missing)
##   Surrogate splits:
##       temp < 0.7708335 to the left,  agree=0.866, adj=0.238, (0 split)
## 
## Node number 15: 12 observations
##   predicted class=Yes  expected loss=0.08333333  P(node) =0.03333333
##     class counts:     1    11
##    probabilities: 0.083 0.917 
## 
## Node number 26: 18 observations
##   predicted class=No   expected loss=0.2777778  P(node) =0.05
##     class counts:    13     5
##    probabilities: 0.722 0.278 
## 
## Node number 27: 8 observations
##   predicted class=Yes  expected loss=0.125  P(node) =0.02222222
##     class counts:     1     7
##    probabilities: 0.125 0.875 
## 
## Node number 28: 98 observations,    complexity param=0.03745318
##   predicted class=No   expected loss=0.3265306  P(node) =0.2722222
##     class counts:    66    32
##    probabilities: 0.673 0.327 
##   left son=56 (59 obs) right son=57 (39 obs)
##   Primary splits:
##       seasonLab splits as  -RL-, improve=2.3614930, (0 missing)
##       temp      < 0.6558335 to the right, improve=1.7703910, (0 missing)
##       windspeed < 0.1265645 to the right, improve=1.0207620, (0 missing)
##       hum       < 0.691712  to the left,  improve=0.9159749, (0 missing)
##       weekday   < 5.5       to the left,  improve=0.5463847, (0 missing)
##   Surrogate splits:
##       temp < 0.6225    to the right, agree=0.776, adj=0.436, (0 split)
## 
## Node number 29: 21 observations,    complexity param=0.01123596
##   predicted class=Yes  expected loss=0.2857143  P(node) =0.05833333
##     class counts:     6    15
##    probabilities: 0.286 0.714 
##   left son=58 (9 obs) right son=59 (12 obs)
##   Primary splits:
##       temp      < 0.6858335 to the left,  improve=2.2936510, (0 missing)
##       hum       < 0.568125  to the left,  improve=0.7936508, (0 missing)
##       windspeed < 0.193408  to the left,  improve=0.6675824, (0 missing)
##       seasonLab splits as  -LR-, improve=0.4285714, (0 missing)
##       weekday   < 2.5       to the right, improve=0.2805195, (0 missing)
##   Surrogate splits:
##       seasonLab splits as  -LR-, agree=0.810, adj=0.556, (0 split)
##       hum       < 0.55      to the left,  agree=0.667, adj=0.222, (0 split)
##       windspeed < 0.1156745 to the left,  agree=0.667, adj=0.222, (0 split)
## 
## Node number 56: 59 observations,    complexity param=0.01685393
##   predicted class=No   expected loss=0.2372881  P(node) =0.1638889
##     class counts:    45    14
##    probabilities: 0.763 0.237 
##   left son=112 (25 obs) right son=113 (34 obs)
##   Primary splits:
##       weathersit < 1.5       to the right, improve=4.885344, (0 missing)
##       hum        < 0.72375   to the right, improve=2.122599, (0 missing)
##       temp       < 0.6400905 to the left,  improve=1.522599, (0 missing)
##       windspeed  < 0.1682225 to the right, improve=1.381573, (0 missing)
##       weekday    < 3.5       to the right, improve=1.020218, (0 missing)
##   Surrogate splits:
##       hum       < 0.72375   to the right, agree=0.814, adj=0.56, (0 split)
##       temp      < 0.6941665 to the left,  agree=0.763, adj=0.44, (0 split)
##       windspeed < 0.1946625 to the right, agree=0.763, adj=0.44, (0 split)
##       weekday   < 4.5       to the right, agree=0.593, adj=0.04, (0 split)
## 
## Node number 57: 39 observations,    complexity param=0.03745318
##   predicted class=No   expected loss=0.4615385  P(node) =0.1083333
##     class counts:    21    18
##    probabilities: 0.538 0.462 
##   left son=114 (19 obs) right son=115 (20 obs)
##   Primary splits:
##       weathersit < 1.5       to the left,  improve=9.4056680, (0 missing)
##       hum        < 0.7525    to the left,  improve=4.7920230, (0 missing)
##       temp       < 0.6545835 to the right, improve=1.5512820, (0 missing)
##       weekday    < 4.5       to the left,  improve=0.5156499, (0 missing)
##       windspeed  < 0.151429  to the left,  improve=0.3528694, (0 missing)
##   Surrogate splits:
##       hum       < 0.6420835 to the left,  agree=0.718, adj=0.421, (0 split)
##       temp      < 0.66375   to the right, agree=0.692, adj=0.368, (0 split)
##       windspeed < 0.2294835 to the left,  agree=0.641, adj=0.263, (0 split)
##       weekday   < 3.5       to the right, agree=0.564, adj=0.105, (0 split)
## 
## Node number 58: 9 observations
##   predicted class=No   expected loss=0.4444444  P(node) =0.025
##     class counts:     5     4
##    probabilities: 0.556 0.444 
## 
## Node number 59: 12 observations
##   predicted class=Yes  expected loss=0.08333333  P(node) =0.03333333
##     class counts:     1    11
##    probabilities: 0.083 0.917 
## 
## Node number 112: 25 observations
##   predicted class=No   expected loss=0  P(node) =0.06944444
##     class counts:    25     0
##    probabilities: 1.000 0.000 
## 
## Node number 113: 34 observations,    complexity param=0.01685393
##   predicted class=No   expected loss=0.4117647  P(node) =0.09444444
##     class counts:    20    14
##    probabilities: 0.588 0.412 
##   left son=226 (27 obs) right son=227 (7 obs)
##   Primary splits:
##       hum        < 0.7010415 to the left,  improve=1.6134450, (0 missing)
##       windspeed  < 0.150808  to the right, improve=1.2134450, (0 missing)
##       weekday    < 1.5       to the right, improve=1.0918000, (0 missing)
##       temp       < 0.7020835 to the right, improve=0.5061438, (0 missing)
##       workingday < 0.5       to the right, improve=0.2205882, (0 missing)
## 
## Node number 114: 19 observations
##   predicted class=No   expected loss=0.1052632  P(node) =0.05277778
##     class counts:    17     2
##    probabilities: 0.895 0.105 
## 
## Node number 115: 20 observations
##   predicted class=Yes  expected loss=0.2  P(node) =0.05555556
##     class counts:     4    16
##    probabilities: 0.200 0.800 
## 
## Node number 226: 27 observations
##   predicted class=No   expected loss=0.3333333  P(node) =0.075
##     class counts:    18     9
##    probabilities: 0.667 0.333 
## 
## Node number 227: 7 observations
##   predicted class=Yes  expected loss=0.2857143  P(node) =0.01944444
##     class counts:     2     5
##    probabilities: 0.286 0.714

Figure 18

# plotting the tree

# install.packages("rpart.plot")
library(rpart.plot)

prp(tree.bikeShare2, faclen = 0, cex = 0.7, extra = 1, space = 0.5, main = "To Rent a Bike or Not?")

# this is what we actually observe (testing set)
high.test<-High[-train]
# this is what we predict for the testing set using the training model
tree.pred<-predict(tree.bikeShare2, newdata = bikeShare4, type="class")
tree.test<-tree.pred[-train]

Figure 19

# the confusion matrix matches these up to see how well we did
cm<-table(tree.test, high.test)
cm

##          high.test
## tree.test  No Yes
##       No  267  47
##       Yes  13  44

sum(diag(cm))/sum(cm)

## [1] 0.8382749

Final Project: BikeShare Data

Isabel Duxbury