This project focuses on conducting Exploratory Data Analysis and running Linear Regression on Bike Sharing Demand data set which was provided by Hadi Fanaee Tork using data from Capital Bikeshare.
Bike sharing systems are a means of renting bicycles where the process of obtaining membership, rental, and bike return is automated via a network of kiosk locations throughout a city. Using these systems, people are able rent a bike from a one location and return it to a different place on an as-needed basis. Currently, there are over 500 bike-sharing programs around the world.
The main objective of this project is to explore and create a linear Regression Model so as to try to predict bike sharing demand.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.3.3
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.3.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
bikeshare <- read.csv("C:/Users/aaror/Desktop/DataAnalytics-Notes/Predictive Analytics And Data Science/R Prog -data-science-and-machine-learning-bootcamp/CSV files for ML Projects/bikeshare.csv")
View(bikeshare)
print(head(bikeshare))
## datetime season holiday workingday weather temp atemp
## 1 2011-01-01 00:00:00 1 0 0 1 9.84 14.395
## 2 2011-01-01 01:00:00 1 0 0 1 9.02 13.635
## 3 2011-01-01 02:00:00 1 0 0 1 9.02 13.635
## 4 2011-01-01 03:00:00 1 0 0 1 9.84 14.395
## 5 2011-01-01 04:00:00 1 0 0 1 9.84 14.395
## 6 2011-01-01 05:00:00 1 0 0 2 9.84 12.880
## humidity windspeed casual registered count
## 1 81 0.0000 3 13 16
## 2 80 0.0000 8 32 40
## 3 80 0.0000 5 27 32
## 4 75 0.0000 3 10 13
## 5 75 0.0000 0 1 1
## 6 75 6.0032 0 1 1
The data set contains following features :-
weather -
1: Clear, Few clouds, Partly cloudy, Partly cloudy
2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
We are trying to predict count variable i.e. number of total rentals which will be behaving as a dependent variable for our analysis.
First I will be conducting exploratory data analysis which is an essential step to undestand the data.
library(ggplot2)
ggplot(data = bikeshare, aes(temp,count)) + geom_point(alpha = 0.3, aes(color = temp)) + theme_bw()
The above scatter plot shows that as the temperature increases the count i.e. the number of total rentals also increases.
bikeshare$datetime <- as.POSIXct(bikeshare$datetime)
pl <- ggplot(bikeshare,aes(datetime,count)) + geom_point(aes(color=temp),alpha=0.5)
pl + scale_color_continuous(low = '#55D8CE',high = '#FF6E2E') + theme_bw()
There is a clear seasonal trend where the total rental bikes seems to decrease during Winters i.e month of January and Feburary of the year and the total rental bikes seems to increase during summers.
The other trend which is quite evident is that the number of rental bike counts is increasing from year 2011 to year 2013.
cor(bikeshare[,c('temp','count')])
## temp count
## temp 1.0000000 0.3944536
## count 0.3944536 1.0000000
There is not so strong correlation between temp and count.
ggplot(bikeshare,aes(factor(season),count)) + geom_boxplot(aes(color = factor(season))) + theme_bw()
The box plot between the number of bike rentals and season shows that the line can not capture the non linear relationship and that there’s is more rentals in winter as compared to spring.
As part of feature engineering I have added an hour column in the dataset.
bikeshare$hour <- sapply(bikeshare$datetime,function(x){format(x,"%H")})
bikeshare$hour <- sapply(bikeshare$hour,as.numeric)
print(head(bikeshare))
## datetime season holiday workingday weather temp atemp
## 1 2011-01-01 00:00:00 1 0 0 1 9.84 14.395
## 2 2011-01-01 01:00:00 1 0 0 1 9.02 13.635
## 3 2011-01-01 02:00:00 1 0 0 1 9.02 13.635
## 4 2011-01-01 03:00:00 1 0 0 1 9.84 14.395
## 5 2011-01-01 04:00:00 1 0 0 1 9.84 14.395
## 6 2011-01-01 05:00:00 1 0 0 2 9.84 12.880
## humidity windspeed casual registered count hour
## 1 81 0.0000 3 13 16 0
## 2 80 0.0000 8 32 40 1
## 3 80 0.0000 5 27 32 2
## 4 75 0.0000 3 10 13 3
## 5 75 0.0000 0 1 1 4
## 6 75 6.0032 0 1 1 5
pl1 <- ggplot(filter(bikeshare,workingday == 1), aes(hour,count))
## Warning: package 'bindrcpp' was built under R version 3.3.3
pl1 <- pl1+ geom_point()
print(pl1)
This scatter plot shows an interesting trend where count of rented bikes increases during the evening hours when people leave from office i.e. around 5 PM and morning hours when people leave for office i.e. around 8 AM.
pl1 <- pl1 + geom_point(position=position_jitter(w=1,h=0),aes(color = temp),alpah=0.5)
## Warning: Ignoring unknown parameters: alpah
pl1 <- pl1 + scale_color_gradientn(colours = c('dark blue','blue','light blue','light green','yellow','orange','red'))
print(pl1 + theme_bw())
This plot gives an interesting finding regarding temperature and bike rental count. As the temperature increases i.e. gets hotter the count of bike rental increases and for cold temperature there is a decline in count of bike rental.
pl2 <- ggplot(filter(bikeshare,workingday == 0), aes(hour,count))
pl2 <- pl2+ geom_point()
pl2 <- pl2 + geom_point(position=position_jitter(w=1,h=0),aes(color = temp),alpah=0.5)
## Warning: Ignoring unknown parameters: alpah
pl2 <- pl2 + scale_color_gradientn(colours = c('dark blue','blue','light blue','light green','yellow','orange','red'))
print(pl2 + theme_bw())
During non working days there is very less bike rental during morning hours and it eventually increases after noon.
This model will be predicting the count of the bike rental based on the temp variable.
temp.model <- lm(count ~ temp, bikeshare)
print(summary(temp.model))
##
## Call:
## lm(formula = count ~ temp, data = bikeshare)
##
## Residuals:
## Min 1Q Median 3Q Max
## -293.32 -112.36 -33.36 78.98 741.44
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.0462 4.4394 1.362 0.173
## temp 9.1705 0.2048 44.783 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 166.5 on 10884 degrees of freedom
## Multiple R-squared: 0.1556, Adjusted R-squared: 0.1555
## F-statistic: 2006 on 1 and 10884 DF, p-value: < 2.2e-16
** Based on the value of Intercept which is 6.0462, linear regression model predicts that there will be 6 bike rental when the temperature is 0. ** For temp variable Estimated Std. value is 9.1705 which signigies that a temperature increase of 1 celsius holding all things equal is associated with a rental increase of about 9.1 bikes. ** The above findings is not a Causation and Beta 1 would be negative if an increase in temperature was associated with a decrease in rentals.
Next we want to know is how many bikes would we predict to be rented if the temperature was 25 degrees celsius.
6.0462 + 9.1705 * 25
## [1] 235.3087
temp.test <- data.frame(temp=c(25))
predict(temp.model,temp.test)
## 1
## 235.3097
Based on the above calculation we can say that the number of bikes rented at 25 degrees celsius temperature will be 235.30
hour (factor)
model <- lm(count ~ . -casual - registered - datetime - atemp, bikeshare)
print(summary(model))
##
## Call:
## lm(formula = count ~ . - casual - registered - datetime - atemp,
## data = bikeshare)
##
## Residuals:
## Min 1Q Median 3Q Max
## -324.61 -96.88 -31.01 55.27 688.83
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.91369 8.45147 5.551 2.91e-08 ***
## season 21.70333 1.35409 16.028 < 2e-16 ***
## holiday -10.29914 8.79069 -1.172 0.241
## workingday -0.71781 3.14463 -0.228 0.819
## weather -3.20909 2.49731 -1.285 0.199
## temp 7.01953 0.19135 36.684 < 2e-16 ***
## humidity -2.21174 0.09083 -24.349 < 2e-16 ***
## windspeed 0.20271 0.18639 1.088 0.277
## hour 7.61283 0.21688 35.102 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 147.8 on 10877 degrees of freedom
## Multiple R-squared: 0.3344, Adjusted R-squared: 0.3339
## F-statistic: 683 on 8 and 10877 DF, p-value: < 2.2e-16
This sort of model doesn’t work well given our seasonal and time series data. We need a model that can account for this type of trend. We will get thrown off with the growth of our dataset accidentaly attributing to the winter season instead of realizing it’s just overall demand growing.