Introduction

Background -

Bike sharing systems are new generation of traditional bike rentals where whole process from membership, rental and return has been automated. Through these systems, user can easily rent a bike from one position and return back at another position. Today, there exists great interest in these systems due to their important role in traffic, environmental and health issues.

Dataset Description

The dataset contains the daily count of rental bikes between years 2011 and 2012 in Capital bikeshare system with the corresponding weather and seasonal information. Dataset includes weather attributes like temperature, humidity, windspeed as well as seasonal attributes like season, month, weekday or working day etc. Number of Observation: 731, Number of attributes: 16, Response variable – Count of Daily bike rentals

Variable Names

“instant” “dteday” “season” “yr” “mnth” “holiday” “weekday” “workingday” “weathersit” “temp” “atemp” “hum” “windspeed” “casual” “registered” “cnt”

Goal

we will try to determine the factors that influence the total number of bike rentals on any particular day. Or in other words, what are the factors that influence number of bike rentals on a day?

Exploratory Data Analysis

First, we checked the data for missing values and found that there is no missing data in our dataset.

Read the data file and setting names that convey apt description

download.file("https://archive.ics.uci.edu/ml/machine-learning-databases/00275/Bike-Sharing-Dataset.zip","Bike.zip") 
day <- read.table(unz("Bike.zip", "day.csv"), header=T, quote="\"", sep=",")

#Variable type Identification
str(day)
## 'data.frame':    731 obs. of  16 variables:
##  $ instant   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ dteday    : Factor w/ 731 levels "2011-01-01","2011-01-02",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ season    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ yr        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ mnth      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ holiday   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday   : int  6 0 1 2 3 4 5 6 0 1 ...
##  $ workingday: int  0 0 1 1 1 1 1 0 0 1 ...
##  $ weathersit: int  2 2 1 1 1 1 2 2 1 1 ...
##  $ temp      : num  0.344 0.363 0.196 0.2 0.227 ...
##  $ atemp     : num  0.364 0.354 0.189 0.212 0.229 ...
##  $ hum       : num  0.806 0.696 0.437 0.59 0.437 ...
##  $ windspeed : num  0.16 0.249 0.248 0.16 0.187 ...
##  $ casual    : int  331 131 120 108 82 88 148 68 54 41 ...
##  $ registered: int  654 670 1229 1454 1518 1518 1362 891 768 1280 ...
##  $ cnt       : int  985 801 1349 1562 1600 1606 1510 959 822 1321 ...
#standardise the temp and atemp values which were normalized in the dataset.
day$temp<- day$temp*41 
day$atemp<- day$atemp*50
day$hum<- day$hum*100
day$windspeed<-day$windspeed*67

#Find missing values in data if any. 
table(is.na(day))
## 
## FALSE 
## 11696

Frequency Distribution

plot_histogram(day)

Few inferences can be drawn by looking at these histograms: Season has four categories of almost equal distribution. Weather 1 has higher contribution than 2 and 3. Most of the numeric variables are normally distributed.

Correlation Plot of Bike Sharing Data

nonums <- unlist(lapply(day, is.numeric))
nums<-day[,nonums] 
par(mfrow=c(1,1)) 
corrplot(cor(nums))

We can say registered and casual variables are highly correlated with target variable because target variables is sum of these two variable. Temp, atemp, humidity, windspeed can be good predictors. Humidity and windspeed have negative correlation with counts. Variables “temp” and “atemp” show high positive correlation (0.99) and hence both can’t be used for the analysis due to assumption of singularity. It means that actual temperature and feels like temperature change together in same direction

Daily weather conditions influence the daily bike rental count

par(mfrow=c(2,2))

plot(day$cnt~day$temp ,type = 'p', col= 'violetred', xlab = 'Temperature', ylab = 'Total Count')

abline(lm(day$cnt~day$temp))

plot(day$cnt~day$atemp ,type = 'p', col= 'royalblue', xlab = 'Feels Like Temp', ylab = 'Total Count')

abline(lm(day$cnt~day$atemp)) 

plot(day$cnt~day$windspeed ,type = 'p', col= 'lightsalmon3', xlab = 'Windspeed', ylab = 'Total Count')

abline(lm(day$cnt~day$windspeed)) 

plot(day$cnt~day$hum ,type = 'p', col= 'darkslategray4', xlab = 'Humidity', ylab = 'Total Count')

abline(lm(day$cnt~day$hum))

Our dependent variable (count of users) increases with increase in temp and reduces with increase in humidity and windspeed. When we cross-checked this, we found that more people rent bikes on “good” (clear/few clouds) weather days rather than “bad” (heavy rain/fog/thunderstorm) weather days

Visualizing Categorical Variables

day$season<- factor(day$season
                    ,levels = c(1,2,3,4)
                    ,labels = c("spring", "summer", "fall", "winter")
)


day$weathersit<-factor(day$weather
                       ,levels = c(3,2,1)
                       ,labels = c("Bad", "Normal", "Good")
                       ,ordered = TRUE)

day$holiday<- factor(day$holiday
                     ,levels = c(0,1)
                     ,labels = c("noholiday", "holiday")
)

day$workingday<-factor(day$workingday
                       ,levels = c(0,1)
                       ,labels = c("nonworking", "working")
)

day$weekday<-factor(day$weekday
                    ,levels = c(0,1,2,3,4,5,6)
                    ,labels = c("sun", "mon","tue","wed","thur","fri","sat")
)

##Now we will try and find the relationship between count(independent variable) and categorical dependent variables.

ggplot(day, aes(x = fct_infreq(season), y = cnt, fill = season))+ geom_bar(stat = "identity")+ labs(title = "Bike count vs Season", x = "Season", y = "Count of bikes") + theme(legend.position = "right") 

ggplot(day, aes(x=fct_infreq(weathersit), y=cnt, fill=weathersit)) + geom_bar(stat="identity")+labs(title = "Bike count vs Weather Condition", x = "Weather", y = "Count of bikes") + theme(legend.position = "right")

Avg_Bike_Rental1 <- day %>% group_by(holiday) %>% summarise(mean = mean(cnt)) 

ggplot(Avg_Bike_Rental1, aes(x = holiday, y = mean, fill = holiday))+ geom_bar(stat = "identity")+theme_minimal()+ labs(title = "Average Daily Bike Rentals", x = "Holiday", y = "Count of bikes") + scale_fill_manual(name = "Holiday", labels = c("Noholiday", "Holiday"), values = c("hotpink2", "cyan3"))

Avg_Bike_Rental <- day %>% group_by(workingday) %>% summarise(mean = mean(cnt))

ggplot(Avg_Bike_Rental, aes(x = workingday, y = mean, fill = workingday)) + geom_bar(stat = "identity", position = "dodge")+ labs(title = "Average Daily Bike Rentals", x = "Workingday", y = "Count of bikes") + scale_fill_manual(name = "Workingday", labels = c("Nonworking", "Working"), values = c("hotpink2", "cyan3"))

The above graph shows people like to ride in good weather, least bike users in spring season, working people use more bikes than non working in weekdays. Users tend to rent bike mostly in good weather, moderately in normal weather and least in bad weather. There is small difference in average daily bike rentals during working days and holidays.

Compare casual and registered users with working days

ggplot(day, aes(x = casual, y = registered, color = workingday))+
  geom_point()+
  labs(title = "Relation Between Bike counts(casual& registeres) vs Working, Non working")+
   scale_color_manual(values=c("deeppink", "turquoise2")) +
  xlab("Casual Bike Counts") +
  ylab("Registered Bike Counts")

The graph shows that mostly working people are registered and use bikes mainly on weekdays. On the other hand, mostly non-working people are casual bikers and prefer to ride on weekends and holidays.

Average user counts by Monthly & Daily

#Monthly
boxplot(day$cnt~day$mnth,xlab="mnth", ylab="count of users", col= "violetred4", 
        main = "Monthly Bike Users")

#Daily
date=substr(day$dteday,1,10)
days<-weekdays(as.Date(date))
day$days=days

par(mfrow=c(1,1))
boxplot(day$cnt~day$days,xlab="days", ylab="count of users", col = "steelblue4",
        main = "Daily Bike Users")

Bike trend initially increase and then decrease during the month and shows a sign of cyclical behavior. Bike rentals have increasing pattern through Spring and reach a pick in the Summer and fall starting. Then rentals decreasing in end fall and winter to meet the trend of spring.

Time to move to the next step, Data Manipulation . In this process we will try to change and adjust few data values to make data more sense absed on our EDA and prior knowledge of the subject

Data Manipulation and Preparation

Once we completed our EDA and had a fair understanding of the data, we went ahead to manipulate the data to make it ready for modelling. We: • changed categorical variables to dummy variables using one-hot encoding • dropped columns – instant (simply count of rows), dteday (date of row), casual (number of casual rentals on a day), registered (number of registered rentals on a day), atemp (feels like temperature) • transformed months to quarters for easy of modelling • bring together all the columns and our final data is ready for modelling

#Lets remove variables which are not important. 

day$instant<-NULL
day$dteday<- NULL
day$casual<-NULL
day$registered<- NULL

#We removed casual, registered, dteday, and instrant from data to do linear
#regression. casual and registered included in cnt and dteday is not a single independent variable.

##Transform Month into quarters for dummy variables

day$Quarter <- ceiling(as.numeric(day$mnth) / 3)
day$Quarter<- factor(day$Quarter)
day$mnth = NULL

##Transform workingday and holiday as numeric variable because they have 0,1 value and we don't need dummy variable for these two variables.

day$holiday<- as.numeric(day$holiday)
day$workingday<- as.numeric(day$workingday)

#Here we will create dummy variables for factor variables

factor_variables <- sapply(day,is.factor)
day_factor <- day[,factor_variables]

factor.names <- names(day_factor)
day_factor <- as.data.frame(day_factor)
day_factor <- acm.disjonctif(day_factor)

#Now we will merge this data with our original data
day <- day[,-which(names(day) %in% factor.names)]

day <- cbind(day,day_factor)

rm(day_factor,factor_variables,factor.names)

nums <- unlist(lapply(day, is.numeric))  
day<-day[,nums]

day$cnt<- as.numeric(day$cnt)
day$yr<- as.factor(day$yr)

##Again we will transform holiday and workingday as factor for modeling

day$holiday<- as.factor(day$holiday)
day$workingday<- as.factor(day$workingday)

Check the Assumptions

Before building models, we checked assumptions for linear regression namely • normality • linearity • additivity • homogeneity/homoscedasticity

#1. Linearity
linear<- lm(cnt~ ., data = day)
summary(linear)
## 
## Call:
## lm(formula = cnt ~ ., data = day)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3602.3  -370.9    70.4   486.0  3118.3 
## 
## Coefficients: (5 not defined because of singularities)
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        3256.594    256.230  12.710  < 2e-16 ***
## yr1                2024.596     60.791  33.304  < 2e-16 ***
## holiday2           -621.598    215.983  -2.878  0.00412 ** 
## workingday2          -8.478    112.427  -0.075  0.93991    
## temp                 82.327     34.003   2.421  0.01572 *  
## atemp                32.252     30.166   1.069  0.28537    
## hum                 -11.888      2.944  -4.038 5.98e-05 ***
## windspeed           -39.696      6.397  -6.205 9.27e-10 ***
## season.spring     -1898.018    166.722 -11.384  < 2e-16 ***
## season.summer      -846.833    193.271  -4.382 1.36e-05 ***
## season.fall       -1143.082    182.589  -6.260 6.64e-10 ***
## season.winter            NA         NA      NA       NA    
## weekday.sun        -453.128    111.939  -4.048 5.73e-05 ***
## weekday.mon        -233.476    114.920  -2.032  0.04256 *  
## weekday.tue        -141.674    112.803  -1.256  0.20955    
## weekday.wed         -66.537    113.160  -0.588  0.55673    
## weekday.thur        -42.771    112.531  -0.380  0.70400    
## weekday.fri              NA         NA      NA       NA    
## weekday.sat              NA         NA      NA       NA    
## weathersit.Bad    -1948.840    205.379  -9.489  < 2e-16 ***
## weathersit.Normal  -458.701     80.269  -5.715 1.62e-08 ***
## weathersit.Good          NA         NA      NA       NA    
## Quarter.1           436.168    163.015   2.676  0.00763 ** 
## Quarter.2           548.426    203.325   2.697  0.00716 ** 
## Quarter.3           602.136    187.064   3.219  0.00135 ** 
## Quarter.4                NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 808.3 on 710 degrees of freedom
## Multiple R-squared:  0.8307, Adjusted R-squared:  0.8259 
## F-statistic: 174.2 on 20 and 710 DF,  p-value: < 2.2e-16
#Create standarized residuals and plot linearity 
standardized = rstudent(linear) 
qqnorm(standardized) 
abline(0,1) 

#2 Normality

hist(standardized, breaks = 15)

mean(linear$residuals)
## [1] 1.668786e-14
#3Homogeneity/Homoscedasticity
fitvalues = scale(linear$fitted.values)
plot(fitvalues, standardized)
abline(0,0) 
abline(v = 0) 

plot(linear, 1)

Results – Our data is nearly normally distributed and passes through other assumptions of linear regression.

Model Building process

Split final dataset into train and test dataset

#We selected the multilinear model because our response variable is numeric and we will use more than 2 variables in our model.

##building a model without the date, casual, registered and instant as the cnt variable includes both casual and registered and the dteday variable is not a independent variable,but consist variable that overlap with variables such as month, working day, holiday

model1<- lm(cnt ~ temp +atemp+ hum +windspeed, data = train)
summary(model1)
## 
## Call:
## lm(formula = cnt ~ temp + atemp + hum + windspeed, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4808.0 -1005.9  -160.8  1131.6  3489.8 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3553.600    450.064   7.896 1.61e-14 ***
## temp        -142.794    103.647  -1.378  0.16887    
## atemp        274.438     95.917   2.861  0.00438 ** 
## hum          -29.487      4.425  -6.664 6.55e-11 ***
## windspeed    -64.790     12.830  -5.050 6.05e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1420 on 543 degrees of freedom
## Multiple R-squared:  0.4484, Adjusted R-squared:  0.4444 
## F-statistic: 110.4 on 4 and 543 DF,  p-value: < 2.2e-16
prediction<- predict(model1, newdata = train)
prediction1<- predict(model1, newdata = test)
R2(prediction, train$cnt)
## [1] 0.4484263
R2(prediction1, test$cnt)
## [1] 0.4718517
mean((test$cnt- prediction1 )^2)
## [1] 2199846
RMSE(prediction1, test$cnt)
## [1] 1483.188
AIC(model1)
## [1] 9517.083
BIC(model1)
## [1] 9542.92
#model2
model2<- lm(cnt~ temp+ hum+ windspeed, data= day)
summary(model2)
## 
## Call:
## lm(formula = cnt ~ temp + hum + windspeed, data = day)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4780.5 -1082.6   -62.2  1056.5  3653.5 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4084.363    337.862  12.089  < 2e-16 ***
## temp         161.598      7.148  22.606  < 2e-16 ***
## hum          -31.001      3.840  -8.073 2.83e-15 ***
## windspeed    -71.745     10.581  -6.781 2.48e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1425 on 727 degrees of freedom
## Multiple R-squared:  0.4609, Adjusted R-squared:  0.4587 
## F-statistic: 207.2 on 3 and 727 DF,  p-value: < 2.2e-16
prediction02<- predict(model2, newdata = train)
prediction2<- predict(model2, newdata = test)
R2(prediction02, train$cnt)
## [1] 0.4394869
R2(prediction2, test$cnt)
## [1] 0.5262855
mean((test$cnt - prediction2)^2)
## [1] 1983840
RMSE(prediction2, test$cnt)
## [1] 1408.489
AIC(model2)
## [1] 12697.73
BIC(model2)
## [1] 12720.7
#model3
model3<- lm(cnt~ .-atemp, data = day) 
summary(model3)
## 
## Call:
## lm(formula = cnt ~ . - atemp, data = day)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3621.8  -361.0    66.7   484.8  3130.6 
## 
## Coefficients: (5 not defined because of singularities)
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        3320.007    249.295  13.318  < 2e-16 ***
## yr1                2023.652     60.791  33.289  < 2e-16 ***
## holiday2           -637.249    215.508  -2.957  0.00321 ** 
## workingday2         -14.233    112.309  -0.127  0.89919    
## temp                117.670      7.963  14.776  < 2e-16 ***
## hum                 -11.702      2.939  -3.981 7.55e-05 ***
## windspeed           -40.954      6.289  -6.512 1.40e-10 ***
## season.spring     -1901.722    166.703 -11.408  < 2e-16 ***
## season.summer      -843.252    193.261  -4.363 1.47e-05 ***
## season.fall       -1153.137    182.365  -6.323 4.52e-10 ***
## season.winter            NA         NA      NA       NA    
## weekday.sun        -451.536    111.940  -4.034 6.08e-05 ***
## weekday.mon        -223.863    114.579  -1.954  0.05112 .  
## weekday.tue        -134.416    112.610  -1.194  0.23302    
## weekday.wed         -62.752    113.116  -0.555  0.57923    
## weekday.thur        -36.108    112.370  -0.321  0.74805    
## weekday.fri              NA         NA      NA       NA    
## weekday.sat              NA         NA      NA       NA    
## weathersit.Bad    -1962.833    204.982  -9.576  < 2e-16 ***
## weathersit.Normal  -461.834     80.223  -5.757 1.27e-08 ***
## weathersit.Good          NA         NA      NA       NA    
## Quarter.1           432.972    163.004   2.656  0.00808 ** 
## Quarter.2           538.345    203.126   2.650  0.00822 ** 
## Quarter.3           586.993    186.546   3.147  0.00172 ** 
## Quarter.4                NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 808.4 on 711 degrees of freedom
## Multiple R-squared:  0.8304, Adjusted R-squared:  0.8259 
## F-statistic: 183.2 on 19 and 711 DF,  p-value: < 2.2e-16
prediction03<- predict(model3, newdata = train)
prediction3<- predict(model3, newdata = test)
R2(prediction03, train$cnt)
## [1] 0.8258697
R2(prediction3, test$cnt)
## [1] 0.8444828
mean(( test$cnt - prediction3)^2)
## [1] 648419.4
RMSE(prediction3, test$cnt)
## [1] 805.2449
AIC(model3)
## [1] 11884.31
BIC(model3)
## [1] 11980.79
#model4
model4<- lm(cnt~ .- atemp-workingday-weekday.tue-weekday.mon-weekday.fri- weekday.tue-weekday.wed-weekday.thur-weekday.sat, data = train)
summary(model4)
## 
## Call:
## lm(formula = cnt ~ . - atemp - workingday - weekday.tue - weekday.mon - 
##     weekday.fri - weekday.tue - weekday.wed - weekday.thur - 
##     weekday.sat, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3340.9  -360.8    41.6   466.0  3244.6 
## 
## Coefficients: (3 not defined because of singularities)
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        3302.388    277.250  11.911  < 2e-16 ***
## yr1                2016.846     70.087  28.776  < 2e-16 ***
## holiday2           -727.096    227.812  -3.192 0.001498 ** 
## temp                104.412      9.047  11.541  < 2e-16 ***
## hum                  -9.201      3.324  -2.768 0.005839 ** 
## windspeed           -41.064      7.246  -5.667 2.38e-08 ***
## season.spring     -1936.103    191.145 -10.129  < 2e-16 ***
## season.summer      -935.975    225.211  -4.156 3.77e-05 ***
## season.fall       -1256.592    220.033  -5.711 1.87e-08 ***
## season.winter            NA         NA      NA       NA    
## weekday.sun        -418.416     97.828  -4.277 2.24e-05 ***
## weathersit.Bad    -2044.731    227.786  -8.977  < 2e-16 ***
## weathersit.Normal  -473.299     91.218  -5.189 3.02e-07 ***
## weathersit.Good          NA         NA      NA       NA    
## Quarter.1           480.468    186.940   2.570 0.010435 *  
## Quarter.2           743.765    237.675   3.129 0.001848 ** 
## Quarter.3           885.809    226.696   3.907 0.000105 ***
## Quarter.4                NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 806.8 on 533 degrees of freedom
## Multiple R-squared:  0.8251, Adjusted R-squared:  0.8206 
## F-statistic: 179.7 on 14 and 533 DF,  p-value: < 2.2e-16
prediction04<- predict(model4, newdata = train)
prediction4<- predict(model4, newdata = test)
R2(prediction04, train$cnt)
## [1] 0.8251444
R2(prediction4, test$cnt)
## [1] 0.8361154
mean(( test$cnt - prediction4 )^2)
## [1] 686811.8
RMSE(prediction4, test$cnt)
## [1] 828.7411
AIC(model4)
## [1] 8907.532
BIC(model4)
## [1] 8976.432

Interpretation of Models

First model shows that humidity and windspeed are good predictor with p value < 0.05 and adjusted R2 0.44.Temperature and feels like temperature show multicollinearity because they are highly correlated.

To avoid multicollinearity, we removed feels like temperature in our model 2 and got results better from 1st model with adjusted R2 0.45 . To get better prediction, we try other variables in our next model.

Third model includes all variables except feels like temperature and this model fit well with accounts for 82% of the variance. But this has RSE 808 with 19 variables and few variables are not significant. So, we will run model 4 to remove variables which are not significant.

We run one more model to reduce variable number and this model also fit very well with adjusted R2 of 0.84 which means 84% of the variance can be explained by this model4. All variables are significant with < 0.05 p value. Model has lower AIC and BIC than other models and lower indicates a more parsimonious model, relative to a model fit with a higher AIC. So, based on these results we will select Model 4.

The Best Model

plot(model4, col = "gold")

We built multiple linear regressions with putting all variables against response variable and removed insignificant predictor variables from earlier models. The best fit model achieved 0.84 adjusted R-squared, which indicates a good fit. Also, p value for almost all predictor variables are significant. Though, checking the residual plot and QQ plot, we can see that residuals have n0 pattern and are normally distributed, and residual plot shows slightly curve but close to straight line, which means the model fit the data well.

Prediction

par(mfrow= c(1,1)) 
model4_step<- step(model4)
## Start:  AIC=7350.38
## cnt ~ (yr + holiday + workingday + temp + atemp + hum + windspeed + 
##     season.spring + season.summer + season.fall + season.winter + 
##     weekday.sun + weekday.mon + weekday.tue + weekday.wed + weekday.thur + 
##     weekday.fri + weekday.sat + weathersit.Bad + weathersit.Normal + 
##     weathersit.Good + Quarter.1 + Quarter.2 + Quarter.3 + Quarter.4) - 
##     atemp - workingday - weekday.tue - weekday.mon - weekday.fri - 
##     weekday.tue - weekday.wed - weekday.thur - weekday.sat
## 
## 
## Step:  AIC=7350.38
## cnt ~ yr + holiday + temp + hum + windspeed + season.spring + 
##     season.summer + season.fall + season.winter + weekday.sun + 
##     weathersit.Bad + weathersit.Normal + weathersit.Good + Quarter.1 + 
##     Quarter.2 + Quarter.3
## 
## 
## Step:  AIC=7350.38
## cnt ~ yr + holiday + temp + hum + windspeed + season.spring + 
##     season.summer + season.fall + season.winter + weekday.sun + 
##     weathersit.Bad + weathersit.Normal + Quarter.1 + Quarter.2 + 
##     Quarter.3
## 
## 
## Step:  AIC=7350.38
## cnt ~ yr + holiday + temp + hum + windspeed + season.spring + 
##     season.summer + season.fall + weekday.sun + weathersit.Bad + 
##     weathersit.Normal + Quarter.1 + Quarter.2 + Quarter.3
## 
##                     Df Sum of Sq       RSS    AIC
## <none>                           346926771 7350.4
## - Quarter.1          1   4299679 351226450 7355.1
## - hum                1   4986587 351913358 7356.2
## - Quarter.2          1   6374048 353300819 7358.4
## - holiday            1   6630394 353557165 7358.7
## - Quarter.3          1   9938086 356864857 7363.9
## - season.summer      1  11242414 358169185 7365.9
## - weekday.sun        1  11907042 358833813 7366.9
## - weathersit.Normal  1  17523628 364450399 7375.4
## - windspeed          1  20902027 367828798 7380.4
## - season.fall        1  21228613 368155384 7380.9
## - weathersit.Bad     1  52447862 399374633 7425.5
## - season.spring      1  66779628 413706399 7444.8
## - temp               1  86695162 433621934 7470.6
## - yr                 1 538988591 885915362 7862.1
plot(test$cnt, main = "Linear Model", ylab = "Test Set Rental Count", pch = 20)
points(predict(model4_step, newdata = test), col = "red", pch = 20)

When we plot the prediction of best fit model (model 4) on test dataset against the true values, the graph shows that the spread of the response variable is similar to multilinear model. Still, we cannot depend on this because we worked on a small data and this dataset does not contain other variables such as daily hours and bike stations, which can help more in accuracy.

Conclusion

“An ideal day for highest bike rentals would be a warm working day in summer/fall with low humidity and slow wind” We can conclude that count of bike rentals on a day is dependent of a number of factors – both seasonal and weather related. ● How well can you predict your response variable? Our model is a good fit and can predict response variable with good accuracy as we can see above in the graph of actual test data against the prediction ● What are the caveats to your analysis? We did not incorporate company’s growth plans in our analysis. We could see that bike rentals increased from 2011 to 2012 over same time period. So using our model to predict for future won’t be right as it does not incorporate company’s expansion plans. ● Does this data set lack information that you would have liked to use? We did not have hour wise data which could have validated our finding that working people who are registered users use bike to commute to office. Bike station wise data visibility could also be helpful to improve the model