1 About the Project

An open source Rental Bike sharing dataset published on Kaggle was chosen for this project. Bike sharing is a highly-demanded application where users get a chance to download a phone application, locate bicycles and rent one when needed. This project involves the implementation of Linear Regression and Model Checking using Residual Analysis on the variables cnt(Daily Bike Count), temp(Temperature),windspeed(Wind speed) available in the dataset. Two models were built- one with cnt vs temp and the other with cnt vs windspeed. We test these two models for the primary assumptions of Linear Regression and conclude with the model that should be preferred for a better explanation of the variation in bike count.

2 Source and Description of Data

The dataset that describes the daily rental count of bikes across 4 seasons of a year(Spring, Summer, Fall, Winter) for the years 2011-2012 was chosen for analysis. This dataset was obtained from Kaggle - Bike Sharing Data.

The dataset day.csv has the following fields:-

3 Data Importing and Wrangling

## spec_tbl_df [731 x 16] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ instant   : num [1:731] 1 2 3 4 5 6 7 8 9 10 ...
##  $ dteday    : Date[1:731], format: "2011-01-01" "2011-01-02" ...
##  $ season    : num [1:731] 1 1 1 1 1 1 1 1 1 1 ...
##  $ yr        : num [1:731] 0 0 0 0 0 0 0 0 0 0 ...
##  $ mnth      : num [1:731] 1 1 1 1 1 1 1 1 1 1 ...
##  $ holiday   : num [1:731] 0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday   : num [1:731] 6 0 1 2 3 4 5 6 0 1 ...
##  $ workingday: num [1:731] 0 0 1 1 1 1 1 0 0 1 ...
##  $ weathersit: num [1:731] 2 2 1 1 1 1 2 2 1 1 ...
##  $ temp      : num [1:731] 0.344 0.363 0.196 0.2 0.227 ...
##  $ atemp     : num [1:731] 0.364 0.354 0.189 0.212 0.229 ...
##  $ hum       : num [1:731] 0.806 0.696 0.437 0.59 0.437 ...
##  $ windspeed : num [1:731] 0.16 0.249 0.248 0.16 0.187 ...
##  $ casual    : num [1:731] 331 131 120 108 82 88 148 68 54 41 ...
##  $ registered: num [1:731] 654 670 1229 1454 1518 ...
##  $ cnt       : num [1:731] 985 801 1349 1562 1600 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   instant = col_double(),
##   ..   dteday = col_date(format = ""),
##   ..   season = col_double(),
##   ..   yr = col_double(),
##   ..   mnth = col_double(),
##   ..   holiday = col_double(),
##   ..   weekday = col_double(),
##   ..   workingday = col_double(),
##   ..   weathersit = col_double(),
##   ..   temp = col_double(),
##   ..   atemp = col_double(),
##   ..   hum = col_double(),
##   ..   windspeed = col_double(),
##   ..   casual = col_double(),
##   ..   registered = col_double(),
##   ..   cnt = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

3.1 Missing Values

  • There are no missing values in any of the columns in the dataset.
##    instant     dteday     season         yr       mnth    holiday    weekday 
##          0          0          0          0          0          0          0 
## workingday weathersit       temp      atemp        hum  windspeed     casual 
##          0          0          0          0          0          0          0 
## registered        cnt 
##          0          0

3.2 Creating new columns and dataframes

We can create new columns to gain more insights from the data. I want to create a column year by extracting the year from dteday.

## # A tibble: 8 x 3
## # Groups:   year [2]
##   year  season  count
##   <chr> <fct>   <dbl>
## 1 2011  1      150000
## 2 2011  2      347316
## 3 2011  3      419650
## 4 2011  4      326137
## 5 2012  1      321348
## 6 2012  2      571273
## 7 2012  3      641479
## 8 2012  4      515476

4 Data Visualization

The below plot describes the seasonal count of rented bikes for the two years 2011, 2012. We observe that Fall season has the highest number of rented bikes.

5 Correlation between various variables of concern

We want to find out the correlation coefficients for the continuous variables of our interest - cnt, temp, atemp, hum, windspeed.

##                  cnt       temp      atemp        hum  windspeed
## cnt        1.0000000  0.6274940  0.6310657 -0.1006586 -0.2345450
## temp       0.6274940  1.0000000  0.9917016  0.1269629 -0.1579441
## atemp      0.6310657  0.9917016  1.0000000  0.1399881 -0.1836430
## hum       -0.1006586  0.1269629  0.1399881  1.0000000 -0.2484891
## windspeed -0.2345450 -0.1579441 -0.1836430 -0.2484891  1.0000000

6 Linear Regression Models

6.1 Model-1 for Bike Count vs Temperature

## 
## Call:
## lm(formula = cnt ~ temp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4615.3 -1134.9  -104.4  1044.3  3737.8 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1214.6      161.2   7.537 1.43e-13 ***
## temp          6640.7      305.2  21.759  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1509 on 729 degrees of freedom
## Multiple R-squared:  0.3937, Adjusted R-squared:  0.3929 
## F-statistic: 473.5 on 1 and 729 DF,  p-value: < 2.2e-16
  • From the above output, we observe that for every unit increase in temperature, there is an increase of 6640.7 in the number of rented bikes.
  • The p-value is very close to 0 and it indicates that there is a significant relationship between temperature and the bike count and we should reject the null hypothesis.

6.1.1 Scatterplot for Bike Count vs Temperature

6.2 Model-2 for Bike Count vs Windspeed

Now, let us build the second model for cnt vs windspeed.

## 
## Call:
## lm(formula = cnt ~ windspeed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4522.7 -1374.7   -74.6  1461.8  4544.0 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5621.2      185.1  30.374  < 2e-16 ***
## windspeed    -5862.9      900.0  -6.514 1.36e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1884 on 729 degrees of freedom
## Multiple R-squared:  0.05501,    Adjusted R-squared:  0.05372 
## F-statistic: 42.44 on 1 and 729 DF,  p-value: 1.36e-10
  • From the above output, we observe that for every unit increase in wind speed, there is a decrease in the rented bike count by 5862.9
  • The p-value is very close to 0 and indicates that there is a significant relationship between bike count and windspeed and we should reject the null hypothesis.

6.2.1 Scatterplot for Bike Count vs Windspeed

7 Residual Diagnostics

7.1 Residual plots for both Models

  • Displaying the first few values in res_hat1.
##          1          2          3          4          5          6 
## -2515.1554 -2827.3941 -1169.6385  -980.7841 -1121.7977  -965.6579
  • From the below Histogram output, we notice that the residuals for Model 1 are not normally distributed.

  • Displaying the first few values in res_hat2.

##         1         2         3         4         5         6 
## -3695.472 -3362.990 -2816.339 -3119.351 -2925.374 -3490.040
  • From the below Histogram output, we notice that the residuals for Model 2 are not normally distributed.

7.1.1 Checking the assumption of Normality

  • Shapiro-Wilk normality test is performed on residual vectors of both models.

  • From the below output, we can infer that the p-value for Model-1 is close to 0 and we can reject the null hypotheses that the data are from normal a distribution.

## 
##  Shapiro-Wilk normality test
## 
## data:  res_hat1
## W = 0.98671, p-value = 3.392e-06
  • From the below output, we can infer that the p-value for Model-2 is close to 0 and we can reject the null hypotheses that the data are from normal a distribution.
## 
##  Shapiro-Wilk normality test
## 
## data:  res_hat2
## W = 0.98616, p-value = 2.129e-06

7.1.2 Checking the assumption of a Zero Mean

  • From the below output, we observe that the assumption of a zero mean is violated for Model-1.
## 
##  One Sample t-test
## 
## data:  res_hat1
## t = 1.1531e-15, df = 730, p-value = 1
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  -109.525  109.525
## sample estimates:
##    mean of x 
## 6.433118e-14
  • From the below output, we observe that the assumption of a zero mean is violated for Model-2.
## 
##  One Sample t-test
## 
## data:  res_hat2
## t = 6.2361e-15, df = 730, p-value = 1
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  -136.7415  136.7415
## sample estimates:
##    mean of x 
## 4.343524e-13

7.1.3 Assumption of Constant Variance

  • From the below plots, we can see that there is some pattern (increasing and decreasing) in the residuals for Model-1 around 0. So, the assumption of constant variance is not met.

From the below output, for Model-2 the residuals seem to be roughly symmetrical around 0. So, this model fulfils the assumption of constant variance.

8 New functions used in this project

9 Conclusion

##                                  Model-1   Model-2
## R-Squared                         0.3929   0.05372
## Assumption of Normality         Violated  Violated
## Assumption of Zero Mean         Violated  Violated
## Assumption of Constant Variance Violated Satisfied