About Project

This project focuses on conducting Exploratory Data Analysis and running Linear Regression on Bike Sharing Demand data set which was provided by Hadi Fanaee Tork using data from Capital Bikeshare.

What is Bike Sharing Systems ?

Bike sharing systems are a means of renting bicycles where the process of obtaining membership, rental, and bike return is automated via a network of kiosk locations throughout a city. Using these systems, people are able rent a bike from a one location and return it to a different place on an as-needed basis. Currently, there are over 500 bike-sharing programs around the world.

Objective

The main objective of this project is to explore and create a linear Regression Model so as to try to predict bike sharing demand.

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.3.3
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.3.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
bikeshare <- read.csv("C:/Users/aaror/Desktop/DataAnalytics-Notes/Predictive Analytics And Data Science/R Prog -data-science-and-machine-learning-bootcamp/CSV files for ML Projects/bikeshare.csv")
View(bikeshare)

Features of Data

print(head(bikeshare))
##              datetime season holiday workingday weather temp  atemp
## 1 2011-01-01 00:00:00      1       0          0       1 9.84 14.395
## 2 2011-01-01 01:00:00      1       0          0       1 9.02 13.635
## 3 2011-01-01 02:00:00      1       0          0       1 9.02 13.635
## 4 2011-01-01 03:00:00      1       0          0       1 9.84 14.395
## 5 2011-01-01 04:00:00      1       0          0       1 9.84 14.395
## 6 2011-01-01 05:00:00      1       0          0       2 9.84 12.880
##   humidity windspeed casual registered count
## 1       81    0.0000      3         13    16
## 2       80    0.0000      8         32    40
## 3       80    0.0000      5         27    32
## 4       75    0.0000      3         10    13
## 5       75    0.0000      0          1     1
## 6       75    6.0032      0          1     1

The data set contains following features :-

  • datetime - hourly date + timestamp
  • season - 1 = spring, 2 = summer, 3 = fall, 4 = winter
  • holiday - whether the day is considered a holiday
  • workingday - whether the day is neither a weekend nor holiday
  • weather -

     1: Clear, Few clouds, Partly cloudy, Partly cloudy 
     2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist 
     3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds 
     4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog 
     
  • temp - temperature in Celsius
  • atemp - “feels like” temperature in Celsius
  • humidity - relative humidity
  • windspeed - wind speed
  • casual - number of non-registered user rentals initiated
  • registered - number of registered user rentals initiated
  • count - number of total rentals
  • What we are trying to predict ?

    We are trying to predict count variable i.e. number of total rentals which will be behaving as a dependent variable for our analysis.

    Exploratory Data Analysis

    First I will be conducting exploratory data analysis which is an essential step to undestand the data.

    Loading the ggplot library

    library(ggplot2)

    Scatter Plot to show the relationship between count (number of total rentals) and temp (temperature in Celsius)

    ggplot(data = bikeshare, aes(temp,count)) + geom_point(alpha = 0.3, aes(color = temp)) + theme_bw()

    The above scatter plot shows that as the temperature increases the count i.e. the number of total rentals also increases.

    Scatter Plot to show the relationship between count (number of total rentals) and date time.

    bikeshare$datetime <- as.POSIXct(bikeshare$datetime)
    pl <- ggplot(bikeshare,aes(datetime,count)) + geom_point(aes(color=temp),alpha=0.5)
    pl + scale_color_continuous(low = '#55D8CE',high = '#FF6E2E') + theme_bw()

    There is a clear seasonal trend where the total rental bikes seems to decrease during Winters i.e month of January and Feburary of the year and the total rental bikes seems to increase during summers.

    The other trend which is quite evident is that the number of rental bike counts is increasing from year 2011 to year 2013.

    Correlation between temperature and count.

    cor(bikeshare[,c('temp','count')])
    ##            temp     count
    ## temp  1.0000000 0.3944536
    ## count 0.3944536 1.0000000

    There is not so strong correlation between temp and count.

    Box Plot

    ggplot(bikeshare,aes(factor(season),count)) + geom_boxplot(aes(color = factor(season))) + theme_bw()

    The box plot between the number of bike rentals and season shows that the line can not capture the non linear relationship and that there’s is more rentals in winter as compared to spring.

    Feature Engineering

    As part of feature engineering I have added an hour column in the dataset.

    bikeshare$hour <- sapply(bikeshare$datetime,function(x){format(x,"%H")})
    bikeshare$hour <- sapply(bikeshare$hour,as.numeric)
    print(head(bikeshare))
    ##              datetime season holiday workingday weather temp  atemp
    ## 1 2011-01-01 00:00:00      1       0          0       1 9.84 14.395
    ## 2 2011-01-01 01:00:00      1       0          0       1 9.02 13.635
    ## 3 2011-01-01 02:00:00      1       0          0       1 9.02 13.635
    ## 4 2011-01-01 03:00:00      1       0          0       1 9.84 14.395
    ## 5 2011-01-01 04:00:00      1       0          0       1 9.84 14.395
    ## 6 2011-01-01 05:00:00      1       0          0       2 9.84 12.880
    ##   humidity windspeed casual registered count hour
    ## 1       81    0.0000      3         13    16    0
    ## 2       80    0.0000      8         32    40    1
    ## 3       80    0.0000      5         27    32    2
    ## 4       75    0.0000      3         10    13    3
    ## 5       75    0.0000      0          1     1    4
    ## 6       75    6.0032      0          1     1    5

    Relationship between hour of the working day and the count of bikes rented.

    pl1 <- ggplot(filter(bikeshare,workingday == 1), aes(hour,count))
    ## Warning: package 'bindrcpp' was built under R version 3.3.3
    pl1 <- pl1+ geom_point()
    print(pl1)

    This scatter plot shows an interesting trend where count of rented bikes increases during the evening hours when people leave from office i.e. around 5 PM and morning hours when people leave for office i.e. around 8 AM.

    pl1 <- pl1 + geom_point(position=position_jitter(w=1,h=0),aes(color = temp),alpah=0.5)
    ## Warning: Ignoring unknown parameters: alpah
    pl1 <- pl1 + scale_color_gradientn(colours = c('dark blue','blue','light blue','light green','yellow','orange','red'))
    print(pl1 + theme_bw())

    This plot gives an interesting finding regarding temperature and bike rental count. As the temperature increases i.e. gets hotter the count of bike rental increases and for cold temperature there is a decline in count of bike rental.

    Relationship between hour of the non-working day and the count of bikes rented.

    pl2 <- ggplot(filter(bikeshare,workingday == 0), aes(hour,count))
    pl2 <- pl2+ geom_point()
    pl2 <- pl2 + geom_point(position=position_jitter(w=1,h=0),aes(color = temp),alpah=0.5)
    ## Warning: Ignoring unknown parameters: alpah
    pl2 <- pl2 + scale_color_gradientn(colours = c('dark blue','blue','light blue','light green','yellow','orange','red'))
    print(pl2 + theme_bw())

    During non working days there is very less bike rental during morning hours and it eventually increases after noon.

    Model Building

    This model will be predicting the count of the bike rental based on the temp variable.

    temp.model <- lm(count ~ temp, bikeshare)
    print(summary(temp.model))
    ## 
    ## Call:
    ## lm(formula = count ~ temp, data = bikeshare)
    ## 
    ## Residuals:
    ##     Min      1Q  Median      3Q     Max 
    ## -293.32 -112.36  -33.36   78.98  741.44 
    ## 
    ## Coefficients:
    ##             Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)   6.0462     4.4394   1.362    0.173    
    ## temp          9.1705     0.2048  44.783   <2e-16 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 166.5 on 10884 degrees of freedom
    ## Multiple R-squared:  0.1556, Adjusted R-squared:  0.1555 
    ## F-statistic:  2006 on 1 and 10884 DF,  p-value: < 2.2e-16

    Model Interpretation

    ** Based on the value of Intercept which is 6.0462, linear regression model predicts that there will be 6 bike rental when the temperature is 0. ** For temp variable Estimated Std. value is 9.1705 which signigies that a temperature increase of 1 celsius holding all things equal is associated with a rental increase of about 9.1 bikes.
    ** The above findings is not a Causation and Beta 1 would be negative if an increase in temperature was associated with a decrease in rentals.

    Next we want to know is how many bikes would we predict to be rented if the temperature was 25 degrees celsius.

    How many rented bikes at temperature 25 degrees celsius

    6.0462 + 9.1705 * 25
    ## [1] 235.3087
    temp.test <- data.frame(temp=c(25))
    predict(temp.model,temp.test)
    ##        1 
    ## 235.3097

    Based on the above calculation we can say that the number of bikes rented at 25 degrees celsius temperature will be 235.30

    Building Second Model with more features

    Model that attempts to predict count based off of the following features :-
  • season
  • holiday
  • workingday
  • weather
  • temp
  • humidity
  • windspeed
  • hour (factor)

    model <- lm(count ~ . -casual - registered - datetime - atemp, bikeshare)
    print(summary(model))
    ## 
    ## Call:
    ## lm(formula = count ~ . - casual - registered - datetime - atemp, 
    ##     data = bikeshare)
    ## 
    ## Residuals:
    ##     Min      1Q  Median      3Q     Max 
    ## -324.61  -96.88  -31.01   55.27  688.83 
    ## 
    ## Coefficients:
    ##              Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)  46.91369    8.45147   5.551 2.91e-08 ***
    ## season       21.70333    1.35409  16.028  < 2e-16 ***
    ## holiday     -10.29914    8.79069  -1.172    0.241    
    ## workingday   -0.71781    3.14463  -0.228    0.819    
    ## weather      -3.20909    2.49731  -1.285    0.199    
    ## temp          7.01953    0.19135  36.684  < 2e-16 ***
    ## humidity     -2.21174    0.09083 -24.349  < 2e-16 ***
    ## windspeed     0.20271    0.18639   1.088    0.277    
    ## hour          7.61283    0.21688  35.102  < 2e-16 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 147.8 on 10877 degrees of freedom
    ## Multiple R-squared:  0.3344, Adjusted R-squared:  0.3339 
    ## F-statistic:   683 on 8 and 10877 DF,  p-value: < 2.2e-16
  • Important Finding

    This sort of model doesn’t work well given our seasonal and time series data. We need a model that can account for this type of trend. We will get thrown off with the growth of our dataset accidentaly attributing to the winter season instead of realizing it’s just overall demand growing.