About Project

This project focuses on conducting Exploratory Data Analysis and running Linear Regression on Bike Sharing Demand data set which was provided by Hadi Fanaee Tork using data from Capital Bikeshare.

Objective

The main objective of this project is to explore and create a linear Regression Model so as to try to predict bike sharing demand.

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.3.3

library(dplyr)

## Warning: package 'dplyr' was built under R version 3.3.3

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

bikeshare <- read.csv("C:/Users/aaror/Desktop/DataAnalytics-Notes/Predictive Analytics And Data Science/R Prog -data-science-and-machine-learning-bootcamp/CSV files for ML Projects/bikeshare.csv")
View(bikeshare)

Features of Data

print(head(bikeshare))

##              datetime season holiday workingday weather temp  atemp
## 1 2011-01-01 00:00:00      1       0          0       1 9.84 14.395
## 2 2011-01-01 01:00:00      1       0          0       1 9.02 13.635
## 3 2011-01-01 02:00:00      1       0          0       1 9.02 13.635
## 4 2011-01-01 03:00:00      1       0          0       1 9.84 14.395
## 5 2011-01-01 04:00:00      1       0          0       1 9.84 14.395
## 6 2011-01-01 05:00:00      1       0          0       2 9.84 12.880
##   humidity windspeed casual registered count
## 1       81    0.0000      3         13    16
## 2       80    0.0000      8         32    40
## 3       80    0.0000      5         27    32
## 4       75    0.0000      3         10    13
## 5       75    0.0000      0          1     1
## 6       75    6.0032      0          1     1

The data set contains following features :-

datetime - hourly date + timestamp

season - 1 = spring, 2 = summer, 3 = fall, 4 = winter

holiday - whether the day is considered a holiday

workingday - whether the day is neither a weekend nor holiday

weather -

 1: Clear, Few clouds, Partly cloudy, Partly cloudy 
 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist 
 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds 
 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog

temp - temperature in Celsius

atemp - “feels like” temperature in Celsius

humidity - relative humidity

windspeed - wind speed

casual - number of non-registered user rentals initiated

registered - number of registered user rentals initiated

count - number of total rentals

What we are trying to predict ?

We are trying to predict count variable i.e. number of total rentals which will be behaving as a dependent variable for our analysis.

Exploratory Data Analysis

First I will be conducting exploratory data analysis which is an essential step to undestand the data.

Loading the ggplot library

library(ggplot2)

Scatter Plot to show the relationship between count (number of total rentals) and temp (temperature in Celsius)

ggplot(data = bikeshare, aes(temp,count)) + geom_point(alpha = 0.3, aes(color = temp)) + theme_bw()

The above scatter plot shows that as the temperature increases the count i.e. the number of total rentals also increases.

Scatter Plot to show the relationship between count (number of total rentals) and date time.

bikeshare$datetime <- as.POSIXct(bikeshare$datetime)
pl <- ggplot(bikeshare,aes(datetime,count)) + geom_point(aes(color=temp),alpha=0.5)
pl + scale_color_continuous(low = '#55D8CE',high = '#FF6E2E') + theme_bw()

There is a clear seasonal trend where the total rental bikes seems to decrease during Winters i.e month of January and Feburary of the year and the total rental bikes seems to increase during summers.

The other trend which is quite evident is that the number of rental bike counts is increasing from year 2011 to year 2013.

Correlation between temperature and count.

cor(bikeshare[,c('temp','count')])

##            temp     count
## temp  1.0000000 0.3944536
## count 0.3944536 1.0000000

There is not so strong correlation between temp and count.

Box Plot

ggplot(bikeshare,aes(factor(season),count)) + geom_boxplot(aes(color = factor(season))) + theme_bw()

The box plot between the number of bike rentals and season shows that the line can not capture the non linear relationship and that there’s is more rentals in winter as compared to spring.

Feature Engineering

As part of feature engineering I have added an hour column in the dataset.

bikeshare$hour <- sapply(bikeshare$datetime,function(x){format(x,"%H")})
bikeshare$hour <- sapply(bikeshare$hour,as.numeric)
print(head(bikeshare))

##              datetime season holiday workingday weather temp  atemp
## 1 2011-01-01 00:00:00      1       0          0       1 9.84 14.395
## 2 2011-01-01 01:00:00      1       0          0       1 9.02 13.635
## 3 2011-01-01 02:00:00      1       0          0       1 9.02 13.635
## 4 2011-01-01 03:00:00      1       0          0       1 9.84 14.395
## 5 2011-01-01 04:00:00      1       0          0       1 9.84 14.395
## 6 2011-01-01 05:00:00      1       0          0       2 9.84 12.880
##   humidity windspeed casual registered count hour
## 1       81    0.0000      3         13    16    0
## 2       80    0.0000      8         32    40    1
## 3       80    0.0000      5         27    32    2
## 4       75    0.0000      3         10    13    3
## 5       75    0.0000      0          1     1    4
## 6       75    6.0032      0          1     1    5

Relationship between hour of the working day and the count of bikes rented.

pl1 <- ggplot(filter(bikeshare,workingday == 1), aes(hour,count))

## Warning: package 'bindrcpp' was built under R version 3.3.3

pl1 <- pl1+ geom_point()
print(pl1)

This scatter plot shows an interesting trend where count of rented bikes increases during the evening hours when people leave from office i.e. around 5 PM and morning hours when people leave for office i.e. around 8 AM.

pl1 <- pl1 + geom_point(position=position_jitter(w=1,h=0),aes(color = temp),alpah=0.5)

## Warning: Ignoring unknown parameters: alpah

pl1 <- pl1 + scale_color_gradientn(colours = c('dark blue','blue','light blue','light green','yellow','orange','red'))
print(pl1 + theme_bw())

This plot gives an interesting finding regarding temperature and bike rental count. As the temperature increases i.e. gets hotter the count of bike rental increases and for cold temperature there is a decline in count of bike rental.

Relationship between hour of the non-working day and the count of bikes rented.

pl2 <- ggplot(filter(bikeshare,workingday == 0), aes(hour,count))
pl2 <- pl2+ geom_point()
pl2 <- pl2 + geom_point(position=position_jitter(w=1,h=0),aes(color = temp),alpah=0.5)

## Warning: Ignoring unknown parameters: alpah

pl2 <- pl2 + scale_color_gradientn(colours = c('dark blue','blue','light blue','light green','yellow','orange','red'))
print(pl2 + theme_bw())

During non working days there is very less bike rental during morning hours and it eventually increases after noon.

Model Building

This model will be predicting the count of the bike rental based on the temp variable.

temp.model <- lm(count ~ temp, bikeshare)
print(summary(temp.model))

## 
## Call:
## lm(formula = count ~ temp, data = bikeshare)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -293.32 -112.36  -33.36   78.98  741.44 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6.0462     4.4394   1.362    0.173    
## temp          9.1705     0.2048  44.783   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 166.5 on 10884 degrees of freedom
## Multiple R-squared:  0.1556, Adjusted R-squared:  0.1555 
## F-statistic:  2006 on 1 and 10884 DF,  p-value: < 2.2e-16

Model Interpretation

** Based on the value of Intercept which is 6.0462, linear regression model predicts that there will be 6 bike rental when the temperature is 0. ** For temp variable Estimated Std. value is 9.1705 which signigies that a temperature increase of 1 celsius holding all things equal is associated with a rental increase of about 9.1 bikes.
** The above findings is not a Causation and Beta 1 would be negative if an increase in temperature was associated with a decrease in rentals.

Next we want to know is how many bikes would we predict to be rented if the temperature was 25 degrees celsius.

How many rented bikes at temperature 25 degrees celsius

6.0462 + 9.1705 * 25

## [1] 235.3087

temp.test <- data.frame(temp=c(25))
predict(temp.model,temp.test)

##        1 
## 235.3097

Based on the above calculation we can say that the number of bikes rented at 25 degrees celsius temperature will be 235.30

Building Second Model with more features

Model that attempts to predict count based off of the following features :-

season

holiday

workingday

weather

temp

humidity

windspeed

hour (factor)

model <- lm(count ~ . -casual - registered - datetime - atemp, bikeshare)
print(summary(model))

## 
## Call:
## lm(formula = count ~ . - casual - registered - datetime - atemp, 
##     data = bikeshare)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -324.61  -96.88  -31.01   55.27  688.83 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  46.91369    8.45147   5.551 2.91e-08 ***
## season       21.70333    1.35409  16.028  < 2e-16 ***
## holiday     -10.29914    8.79069  -1.172    0.241    
## workingday   -0.71781    3.14463  -0.228    0.819    
## weather      -3.20909    2.49731  -1.285    0.199    
## temp          7.01953    0.19135  36.684  < 2e-16 ***
## humidity     -2.21174    0.09083 -24.349  < 2e-16 ***
## windspeed     0.20271    0.18639   1.088    0.277    
## hour          7.61283    0.21688  35.102  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 147.8 on 10877 degrees of freedom
## Multiple R-squared:  0.3344, Adjusted R-squared:  0.3339 
## F-statistic:   683 on 8 and 10877 DF,  p-value: < 2.2e-16

Important Finding

This sort of model doesn’t work well given our seasonal and time series data. We need a model that can account for this type of trend. We will get thrown off with the growth of our dataset accidentaly attributing to the winter season instead of realizing it’s just overall demand growing.

Predicting Bike Sharing Demand

Ashish Arora

January 9, 2018

About Project

Objective

Features of Data

What we are trying to predict ?

Exploratory Data Analysis

Loading the ggplot library

Scatter Plot to show the relationship between count (number of total rentals) and temp (temperature in Celsius)

Scatter Plot to show the relationship between count (number of total rentals) and date time.

Correlation between temperature and count.

Box Plot

Feature Engineering

Relationship between hour of the working day and the count of bikes rented.

Relationship between hour of the non-working day and the count of bikes rented.

Model Building

Model Interpretation

How many rented bikes at temperature 25 degrees celsius

Building Second Model with more features

Important Finding

Predicting Bike Sharing Demand

Ashish Arora

January 9, 2018

About Project

What is Bike Sharing Systems ?

Objective

Features of Data

What we are trying to predict ?

Exploratory Data Analysis

Loading the ggplot library

Scatter Plot to show the relationship between count (number of total rentals) and temp (temperature in Celsius)

Scatter Plot to show the relationship between count (number of total rentals) and date time.

Correlation between temperature and count.

Box Plot

Feature Engineering

Relationship between hour of the working day and the count of bikes rented.

Relationship between hour of the non-working day and the count of bikes rented.

Model Building

Model Interpretation

How many rented bikes at temperature 25 degrees celsius

Building Second Model with more features

Important Finding