Introudction

Bike sharing systems are new generation of traditional bike rentals where whole process from membership, rental and return back has become automatic. Through these systems, user is able to easily rent a bike from a particular position and return back at another position. Currently, there are about over 500 bike-sharing programs around the world which is composed of over 500 thousands bicycles. Today, there exists great interest in these systems due to their important role in traffic, environmental and health issues.

Apart from interesting real world applications of bike sharing systems, the characteristics of data being generated by these systems make them attractive for the research. Opposed to other transport services such as bus or subway, the duration of travel, departure and arrival position is explicitly recorded in these systems. This feature turns bike sharing system into a virtual sensor network that can be used for sensing mobility in the city. Hence, it is expected that most of important events in the city could be detected via monitoring these data.

How did I get the dataset?

I downloaded the dataset Bike Sharing Dataset from the UCI machine learning database link.

rm(list = ls())
day <- read.csv("~/Desktop/UCSC//UCSC Data Analysis - R/Project/Bike-Sharing-Dataset/day.csv")
bk_sh_dy <- day
head(bk_sh_dy)
##   instant     dteday season yr mnth holiday weekday workingday weathersit
## 1       1 2011-01-01      1  0    1       0       6          0          2
## 2       2 2011-01-02      1  0    1       0       0          0          2
## 3       3 2011-01-03      1  0    1       0       1          1          1
## 4       4 2011-01-04      1  0    1       0       2          1          1
## 5       5 2011-01-05      1  0    1       0       3          1          1
## 6       6 2011-01-06      1  0    1       0       4          1          1
##       temp    atemp      hum windspeed casual registered  cnt
## 1 0.344167 0.363625 0.805833 0.1604460    331        654  985
## 2 0.363478 0.353739 0.696087 0.2485390    131        670  801
## 3 0.196364 0.189405 0.437273 0.2483090    120       1229 1349
## 4 0.200000 0.212122 0.590435 0.1602960    108       1454 1562
## 5 0.226957 0.229270 0.436957 0.1869000     82       1518 1600
## 6 0.204348 0.233209 0.518261 0.0895652     88       1518 1606

Dimension of Data

dim(bk_sh_dy)
## [1] 731  16

What are the attributes in the dataset?

names(bk_sh_dy)  ## names of the columns 
##  [1] "instant"    "dteday"     "season"     "yr"         "mnth"      
##  [6] "holiday"    "weekday"    "workingday" "weathersit" "temp"      
## [11] "atemp"      "hum"        "windspeed"  "casual"     "registered"
## [16] "cnt"
is.null(bk_sh_dy) ## Checking for null values
## [1] FALSE
is.integer(bk_sh_dy)
## [1] FALSE
bk_sh_dy<- data.frame(bk_sh_dy)
str(bk_sh_dy)
## 'data.frame':    731 obs. of  16 variables:
##  $ instant   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ dteday    : Factor w/ 731 levels "2011-01-01","2011-01-02",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ season    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ yr        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ mnth      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ holiday   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday   : int  6 0 1 2 3 4 5 6 0 1 ...
##  $ workingday: int  0 0 1 1 1 1 1 0 0 1 ...
##  $ weathersit: int  2 2 1 1 1 1 2 2 1 1 ...
##  $ temp      : num  0.344 0.363 0.196 0.2 0.227 ...
##  $ atemp     : num  0.364 0.354 0.189 0.212 0.229 ...
##  $ hum       : num  0.806 0.696 0.437 0.59 0.437 ...
##  $ windspeed : num  0.16 0.249 0.248 0.16 0.187 ...
##  $ casual    : int  331 131 120 108 82 88 148 68 54 41 ...
##  $ registered: int  654 670 1229 1454 1518 1518 1362 891 768 1280 ...
##  $ cnt       : int  985 801 1349 1562 1600 1606 1510 959 822 1321 ...

Attribute Information:

instant: record index

dteday: date

season: season (1:spring, 2:summer, 3:fall, 4:winter)

yr: year (0: 2011, 1:2012)

mnth: month ( 1 to 12)

holiday: weather day is holiday or not (extracted from http://dchr.dc.gov/page/holiday-schedule)

weekday: day of the week

workingday: if day is neither weekend nor holiday is 1, otherwise is 0.

weathersit:

1: Clear, Few clouds, Partly cloudy

2: Mist and Cloudy, Mist and Broken clouds, Mist and Few clouds, Mist

3: Light Snow, Light Rain and Thunderstorm and Scattered clouds, Light Rain and Scattered clouds

4: Heavy Rain and Ice Pallets and Thunderstorm and Mist, Snow and Fog

temp: Normalized temperature in Celsius. The values are divided to 41 (max)

atemp: Normalized feeling temperature in Celsius. The values are divided to 50 (max)

hum: Normalized humidity. The values are divided to 100 (max)

windspeed: Normalized wind speed. The values are divided to 67 (max)

casual: count of casual users

registered: count of registered users

cnt: count of total rental bikes including both casual and registered

Updated Attributies:

We created new attributes to denormalize the actual values, since the normalized values were very low and factorized the categorical attributes.

actual_temp: Converted normalized temperature in Celsius

actual_windspeed: Converted normalized windspeed

actual_humidity: Converted normalized humidity

actual_feel_temp: Converted normalized feeled temperature in Celsius

mean_acttemp_feeltemp: Created a mean of actual temperature and feel temperature

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(corrplot)
library(ggplot2)
library(stats)

bk_sh_dy$season <- factor(format(bk_sh_dy$season, format="%A"),
                          levels = c("1", "2","3","4") , labels = c("Spring","Summer","Fall","Winter"))
table(bk_sh_dy$season)
## 
## Spring Summer   Fall Winter 
##    181    184    188    178
bk_sh_dy$holiday <- factor(format(bk_sh_dy$holiday, format="%A"),
                          levels = c("0", "1") , labels = c("Working Day","Holiday"))
table(bk_sh_dy$holiday)
## 
## Working Day     Holiday 
##         710          21
bk_sh_dy$weathersit <- factor(format(bk_sh_dy$weathersit, format="%A"),
                          levels = c("1", "2","3","4") , 
               labels = c("Good:Clear/Sunny","Moderate:Cloudy/Mist","Bad: Rain/Snow/Fog","Worse: Heavy Rain/Snow/Fog"))
table(bk_sh_dy$weathersit)
## 
##           Good:Clear/Sunny       Moderate:Cloudy/Mist 
##                        463                        247 
##         Bad: Rain/Snow/Fog Worse: Heavy Rain/Snow/Fog 
##                         21                          0
bk_sh_dy$yr <- factor(format(bk_sh_dy$yr, format="%A"),
                          levels = c("0", "1") , labels = c("2011","2012"))
table(bk_sh_dy$yr)
## 
## 2011 2012 
##  365  366
bk_sh_dy$actual_temp <- bk_sh_dy$temp*41
bk_sh_dy$actual_feel_temp <- bk_sh_dy$atemp*50
bk_sh_dy$actual_windspeed <- bk_sh_dy$windspeed*67
bk_sh_dy$actual_humidity <- bk_sh_dy$hum*100
bk_sh_dy$mean_acttemp_feeltemp <- (bk_sh_dy$actual_temp+bk_sh_dy$actual_feel_temp)/2
str(bk_sh_dy)
## 'data.frame':    731 obs. of  21 variables:
##  $ instant              : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ dteday               : Factor w/ 731 levels "2011-01-01","2011-01-02",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ season               : Factor w/ 4 levels "Spring","Summer",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ yr                   : Factor w/ 2 levels "2011","2012": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mnth                 : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ holiday              : Factor w/ 2 levels "Working Day",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ weekday              : int  6 0 1 2 3 4 5 6 0 1 ...
##  $ workingday           : int  0 0 1 1 1 1 1 0 0 1 ...
##  $ weathersit           : Factor w/ 4 levels "Good:Clear/Sunny",..: 2 2 1 1 1 1 2 2 1 1 ...
##  $ temp                 : num  0.344 0.363 0.196 0.2 0.227 ...
##  $ atemp                : num  0.364 0.354 0.189 0.212 0.229 ...
##  $ hum                  : num  0.806 0.696 0.437 0.59 0.437 ...
##  $ windspeed            : num  0.16 0.249 0.248 0.16 0.187 ...
##  $ casual               : int  331 131 120 108 82 88 148 68 54 41 ...
##  $ registered           : int  654 670 1229 1454 1518 1518 1362 891 768 1280 ...
##  $ cnt                  : int  985 801 1349 1562 1600 1606 1510 959 822 1321 ...
##  $ actual_temp          : num  14.11 14.9 8.05 8.2 9.31 ...
##  $ actual_feel_temp     : num  18.18 17.69 9.47 10.61 11.46 ...
##  $ actual_windspeed     : num  10.7 16.7 16.6 10.7 12.5 ...
##  $ actual_humidity      : num  80.6 69.6 43.7 59 43.7 ...
##  $ mean_acttemp_feeltemp: num  16.15 16.29 8.76 9.4 10.38 ...
summary(bk_sh_dy)
##     instant             dteday       season       yr           mnth      
##  Min.   :  1.0   2011-01-01:  1   Spring:181   2011:365   Min.   : 1.00  
##  1st Qu.:183.5   2011-01-02:  1   Summer:184   2012:366   1st Qu.: 4.00  
##  Median :366.0   2011-01-03:  1   Fall  :188              Median : 7.00  
##  Mean   :366.0   2011-01-04:  1   Winter:178              Mean   : 6.52  
##  3rd Qu.:548.5   2011-01-05:  1                           3rd Qu.:10.00  
##  Max.   :731.0   2011-01-06:  1                           Max.   :12.00  
##                  (Other)   :725                                          
##         holiday       weekday        workingday   
##  Working Day:710   Min.   :0.000   Min.   :0.000  
##  Holiday    : 21   1st Qu.:1.000   1st Qu.:0.000  
##                    Median :3.000   Median :1.000  
##                    Mean   :2.997   Mean   :0.684  
##                    3rd Qu.:5.000   3rd Qu.:1.000  
##                    Max.   :6.000   Max.   :1.000  
##                                                   
##                       weathersit       temp             atemp        
##  Good:Clear/Sunny          :463   Min.   :0.05913   Min.   :0.07907  
##  Moderate:Cloudy/Mist      :247   1st Qu.:0.33708   1st Qu.:0.33784  
##  Bad: Rain/Snow/Fog        : 21   Median :0.49833   Median :0.48673  
##  Worse: Heavy Rain/Snow/Fog:  0   Mean   :0.49538   Mean   :0.47435  
##                                   3rd Qu.:0.65542   3rd Qu.:0.60860  
##                                   Max.   :0.86167   Max.   :0.84090  
##                                                                      
##       hum           windspeed           casual         registered  
##  Min.   :0.0000   Min.   :0.02239   Min.   :   2.0   Min.   :  20  
##  1st Qu.:0.5200   1st Qu.:0.13495   1st Qu.: 315.5   1st Qu.:2497  
##  Median :0.6267   Median :0.18097   Median : 713.0   Median :3662  
##  Mean   :0.6279   Mean   :0.19049   Mean   : 848.2   Mean   :3656  
##  3rd Qu.:0.7302   3rd Qu.:0.23321   3rd Qu.:1096.0   3rd Qu.:4776  
##  Max.   :0.9725   Max.   :0.50746   Max.   :3410.0   Max.   :6946  
##                                                                    
##       cnt        actual_temp     actual_feel_temp actual_windspeed
##  Min.   :  22   Min.   : 2.424   Min.   : 3.953   Min.   : 1.500  
##  1st Qu.:3152   1st Qu.:13.820   1st Qu.:16.892   1st Qu.: 9.042  
##  Median :4548   Median :20.432   Median :24.337   Median :12.125  
##  Mean   :4504   Mean   :20.311   Mean   :23.718   Mean   :12.763  
##  3rd Qu.:5956   3rd Qu.:26.872   3rd Qu.:30.430   3rd Qu.:15.625  
##  Max.   :8714   Max.   :35.328   Max.   :42.045   Max.   :34.000  
##                                                                   
##  actual_humidity mean_acttemp_feeltemp
##  Min.   : 0.00   Min.   : 3.189       
##  1st Qu.:52.00   1st Qu.:15.251       
##  Median :62.67   Median :22.347       
##  Mean   :62.79   Mean   :22.014       
##  3rd Qu.:73.02   3rd Qu.:28.664       
##  Max.   :97.25   Max.   :38.413       
## 

Exploratory Analysis with Plots

h <- hist(bk_sh_dy$cnt, breaks = 25, ylab = 'Frequency of Rental', xlab = 'Total Bike Rental Count', main = 'Distribution of Total Bike Rental Count', col = 'blue' )

xfit <- seq(min(bk_sh_dy$cnt),max(bk_sh_dy$cnt), length = 50)
yfit <- dnorm(xfit, mean =mean(bk_sh_dy$cnt),sd=sd(bk_sh_dy$cnt))
yfit <- yfit*diff(h$mids[1:2])*length(bk_sh_dy$cnt)
lines(xfit,yfit, col='red', lwd= 3)

Firstly, we observed how the response variable Total Bike Rentals (cnt) is distributed.

From the histogram above, it seems that the number of total rented bikes follow a nearly normal distribution.The mean and variance of distribution are the same, and when the mean is getting larger, distribution approximates a normal distribution.

Next, we looked at the relationship between the response variable and each explanatory variable. We selected few plots with patterns as shown below.

Distribution of Categorical Variables

par(mfcol=c(2,2))

 boxplot(bk_sh_dy$cnt ~ bk_sh_dy$season,
        data = bk_sh_dy,
        main = "Total Bike Rentals Vs Season",
        xlab = "Season",
        ylab = "Total Bike Rentals",
        col = c("coral", "coral1", "coral2", "coral3")) 


 boxplot(bk_sh_dy$cnt ~ bk_sh_dy$holiday,
        data = bk_sh_dy,
        main = "Total Bike Rentals Vs Holiday/Working Day",
        xlab = "Holiday/Working Day",
        ylab = "Total Bike Rentals",
        col = c("pink", "pink1", "pink2", "pink3")) 

boxplot(bk_sh_dy$cnt ~ bk_sh_dy$weathersit,
        data = bk_sh_dy,
        main = "Total Bike Rentals Vs Weather Situation",
        xlab = "Weather Situation",
        ylab = "Total Bike Rentals",
        col = c("purple", "purple1", "purple2", "purple3")) 


plot(bk_sh_dy$dteday, bk_sh_dy$cnt,type = "p",
     main = "Total Bike Rentals Vs DateDay",
     xlab = "Year",
     ylab = "Total Bike Rentals",
     col  = "orange",
     pch  = 19)

The plot shows the relationship between Total Bike Rentals(cnt) variable and season. The average numbers of bike rentals are the highest during summer and fall.

The plot shows the relationship between Total Bike Rentals(cnt) variable and holiday. We can see that the average number of bike rentals on working day is higher than holiday.

The plot shows the relationship between Total Bike Rentals(cnt) variable and weather. There is a clearly decreasing trend of bike rentals when weather is bad.

The plot shows the relationship between Total Bike Rentals(cnt) variable and Year. We can see that the overall trend increased during the two-year time span. And within each year, there are huge amount of bike rentals during summer and fall seasons.

Distribution of Numerical Variables

par(mfrow=c(2,2))

plot(bk_sh_dy$actual_temp, bk_sh_dy$cnt ,type = 'h', col= 'yellow', xlab = 'Actual Temperature', ylab = 'Total Bike Rentals')
   
plot(bk_sh_dy$actual_feel_temp, bk_sh_dy$cnt ,type = 'h', col= 'yellow', xlab = 'Actual Feel Temperature', ylab = 'Total Bike Rentals')
   
plot(bk_sh_dy$actual_windspeed, bk_sh_dy$cnt ,type = 'h', col= 'yellow', xlab = 'Actual Windspeed', ylab = 'Total Bike Rentals')
   
plot(bk_sh_dy$actual_humidity, bk_sh_dy$cnt ,type = 'h', col= 'yellow', xlab = 'Actual Humidity', ylab = 'Total Bike Rentals')

It seems these numerical variables are distributed quite naturally.

Correlation Plots

Correlation tests between Bike Rental Count, Actual temp , Feel Temp, Mean Actual Temp Feel Temp, Windspeed and Humidity .

Cor_actual_temp<-cor(x = bk_sh_dy$actual_temp, y = bk_sh_dy$cnt)
Cor_actual_feel_temp <- cor(x = bk_sh_dy$actual_feel_temp, y =bk_sh_dy$cnt)
bk_sh_dy_cor<- bk_sh_dy %>% select (cnt,actual_temp,actual_feel_temp,mean_acttemp_feeltemp,actual_humidity,actual_windspeed)
bk_sh_dy_cor<- data.frame(bk_sh_dy_cor)
  
colnames(bk_sh_dy_cor)[1] <- "Total Number of Bike Rentals"
colnames(bk_sh_dy_cor)[2] <- "Temperature"
colnames(bk_sh_dy_cor)[3] <- "Feel Temperature"
colnames(bk_sh_dy_cor)[4] <- "Mean Actual Temp Feel Temp"
colnames(bk_sh_dy_cor)[5] <- "Humidity"
colnames(bk_sh_dy_cor)[6] <- "Windspeed"

cor(bk_sh_dy_cor)
##                              Total Number of Bike Rentals Temperature
## Total Number of Bike Rentals                    1.0000000   0.6274940
## Temperature                                     0.6274940   1.0000000
## Feel Temperature                                0.6310657   0.9917016
## Mean Actual Temp Feel Temp                      0.6306607   0.9977489
## Humidity                                       -0.1006586   0.1269629
## Windspeed                                      -0.2345450  -0.1579441
##                              Feel Temperature Mean Actual Temp Feel Temp
## Total Number of Bike Rentals        0.6310657                  0.6306607
## Temperature                         0.9917016                  0.9977489
## Feel Temperature                    1.0000000                  0.9980905
## Mean Actual Temp Feel Temp          0.9980905                  1.0000000
## Humidity                            0.1399881                  0.1340209
## Windspeed                          -0.1836430                 -0.1716773
##                                Humidity  Windspeed
## Total Number of Bike Rentals -0.1006586 -0.2345450
## Temperature                   0.1269629 -0.1579441
## Feel Temperature              0.1399881 -0.1836430
## Mean Actual Temp Feel Temp    0.1340209 -0.1716773
## Humidity                      1.0000000 -0.2484891
## Windspeed                    -0.2484891  1.0000000
corplot_bk_sh <- cor(bk_sh_dy_cor)
corrplot(corplot_bk_sh, method="number")

From the above correlation plots actual tempearture is more correlated with bike rentals, humidity and windspeed are also slightly correlated.

Scatter Plot between Bike Rentals and Actual Temperature

library(ggplot2)

ggplot_Temp_Rent<- ggplot(bk_sh_dy, aes(x=bk_sh_dy$actual_temp,y=bk_sh_dy$cnt))+geom_point(shape=1)+geom_smooth(method=lm)+ xlab("Actual Temp. in Celcius")+ylab("Bike Rentals")
ggplot_Temp_Rent+scale_y_continuous(breaks=c(0,1100,2345,3500,5000,6000,7000,8000))+labs(title="Total Bike Rentals Vs Actual Temperature | Intercept = 2345")

lm_test<- lm(bk_sh_dy$cnt~bk_sh_dy$actual_temp)
summary(lm_test)
## 
## Call:
## lm(formula = bk_sh_dy$cnt ~ bk_sh_dy$actual_temp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4615.3 -1134.9  -104.4  1044.3  3737.8 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          1214.642    161.164   7.537 1.43e-13 ***
## bk_sh_dy$actual_temp  161.969      7.444  21.759  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1509 on 729 degrees of freedom
## Multiple R-squared:  0.3937, Adjusted R-squared:  0.3929 
## F-statistic: 473.5 on 1 and 729 DF,  p-value: < 2.2e-16
plot(lm_test, col = "green")

From the linear regression between bike rentals (cnt) and actual temperature, we found that R-Squared value is at 40%, with p-value for actual temperatue is at a significant level.

Linear Regression Model

Linear Regression between Total Bike Rentals, Temperature, Windspeed and Humidity

lm_test1<- lm(sqrt(bk_sh_dy$cnt)~bk_sh_dy$actual_temp+bk_sh_dy$actual_humidity+bk_sh_dy$actual_windspeed)

lm_test1
## 
## Call:
## lm(formula = sqrt(bk_sh_dy$cnt) ~ bk_sh_dy$actual_temp + bk_sh_dy$actual_humidity + 
##     bk_sh_dy$actual_windspeed)
## 
## Coefficients:
##               (Intercept)       bk_sh_dy$actual_temp  
##                   61.6726                     1.3374  
##  bk_sh_dy$actual_humidity  bk_sh_dy$actual_windspeed  
##                   -0.2531                    -0.6035
summary(lm_test1)
## 
## Call:
## lm(formula = sqrt(bk_sh_dy$cnt) ~ bk_sh_dy$actual_temp + bk_sh_dy$actual_humidity + 
##     bk_sh_dy$actual_windspeed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -47.460  -8.065   0.531   7.811  25.632 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               61.67260    2.70399  22.808  < 2e-16 ***
## bk_sh_dy$actual_temp       1.33744    0.05721  23.378  < 2e-16 ***
## bk_sh_dy$actual_humidity  -0.25313    0.03073  -8.237 8.21e-16 ***
## bk_sh_dy$actual_windspeed -0.60348    0.08468  -7.127 2.48e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.41 on 727 degrees of freedom
## Multiple R-squared:  0.4781, Adjusted R-squared:  0.4759 
## F-statistic:   222 on 3 and 727 DF,  p-value: < 2.2e-16
lm_test2<- lm(((bk_sh_dy$cnt)^2)~bk_sh_dy$actual_temp+bk_sh_dy$actual_humidity+bk_sh_dy$actual_windspeed)

lm_test2
## 
## Call:
## lm(formula = ((bk_sh_dy$cnt)^2) ~ bk_sh_dy$actual_temp + bk_sh_dy$actual_humidity + 
##     bk_sh_dy$actual_windspeed)
## 
## Coefficients:
##               (Intercept)       bk_sh_dy$actual_temp  
##                  22179942                    1347024  
##  bk_sh_dy$actual_humidity  bk_sh_dy$actual_windspeed  
##                   -280650                    -617458
summary(lm_test2)
## 
## Call:
## lm(formula = ((bk_sh_dy$cnt)^2) ~ bk_sh_dy$actual_temp + bk_sh_dy$actual_humidity + 
##     bk_sh_dy$actual_windspeed)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -38825404  -9628562  -3271989   9996519  45699800 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               22179942    3299775   6.722 3.63e-11 ***
## bk_sh_dy$actual_temp       1347024      69816  19.294  < 2e-16 ***
## bk_sh_dy$actual_humidity   -280650      37503  -7.483 2.10e-13 ***
## bk_sh_dy$actual_windspeed  -617458     103337  -5.975 3.60e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13920000 on 727 degrees of freedom
## Multiple R-squared:  0.3875, Adjusted R-squared:  0.3849 
## F-statistic: 153.3 on 3 and 727 DF,  p-value: < 2.2e-16
lm_test3<- lm((log(bk_sh_dy$cnt))~bk_sh_dy$actual_temp+bk_sh_dy$actual_humidity+bk_sh_dy$actual_windspeed)

lm_test3
## 
## Call:
## lm(formula = (log(bk_sh_dy$cnt)) ~ bk_sh_dy$actual_temp + bk_sh_dy$actual_humidity + 
##     bk_sh_dy$actual_windspeed)
## 
## Coefficients:
##               (Intercept)       bk_sh_dy$actual_temp  
##                  8.225902                   0.046859  
##  bk_sh_dy$actual_humidity  bk_sh_dy$actual_windspeed  
##                 -0.009481                  -0.023428
summary(lm_test3)
## 
## Call:
## lm(formula = (log(bk_sh_dy$cnt)) ~ bk_sh_dy$actual_temp + bk_sh_dy$actual_humidity + 
##     bk_sh_dy$actual_windspeed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5836 -0.2396  0.0637  0.2787  0.7905 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                8.225902   0.103620   79.39  < 2e-16 ***
## bk_sh_dy$actual_temp       0.046859   0.002192   21.37  < 2e-16 ***
## bk_sh_dy$actual_humidity  -0.009481   0.001178   -8.05 3.37e-15 ***
## bk_sh_dy$actual_windspeed -0.023428   0.003245   -7.22 1.31e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4371 on 727 degrees of freedom
## Multiple R-squared:  0.4408, Adjusted R-squared:  0.4385 
## F-statistic:   191 on 3 and 727 DF,  p-value: < 2.2e-16
lm_final<- lm(bk_sh_dy$cnt~bk_sh_dy$actual_temp+bk_sh_dy$actual_humidity+bk_sh_dy$actual_windspeed)

lm_final
## 
## Call:
## lm(formula = bk_sh_dy$cnt ~ bk_sh_dy$actual_temp + bk_sh_dy$actual_humidity + 
##     bk_sh_dy$actual_windspeed)
## 
## Coefficients:
##               (Intercept)       bk_sh_dy$actual_temp  
##                   4084.36                     161.60  
##  bk_sh_dy$actual_humidity  bk_sh_dy$actual_windspeed  
##                    -31.00                     -71.75
summary(lm_final)
## 
## Call:
## lm(formula = bk_sh_dy$cnt ~ bk_sh_dy$actual_temp + bk_sh_dy$actual_humidity + 
##     bk_sh_dy$actual_windspeed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4780.5 -1082.6   -62.2  1056.5  3653.5 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               4084.363    337.862  12.089  < 2e-16 ***
## bk_sh_dy$actual_temp       161.598      7.148  22.606  < 2e-16 ***
## bk_sh_dy$actual_humidity   -31.001      3.840  -8.073 2.83e-15 ***
## bk_sh_dy$actual_windspeed  -71.745     10.581  -6.781 2.48e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1425 on 727 degrees of freedom
## Multiple R-squared:  0.4609, Adjusted R-squared:  0.4587 
## F-statistic: 207.2 on 3 and 727 DF,  p-value: < 2.2e-16
plot(lm_final,col = "gold", main = "Linear Regression: Bike Rentals, Temp, Windspeed and Humidity")

Conclusion:

As we found the correlation plots against bike rentals with humidity and windspeed were slightly related, we created a linear model and found the R-Squared value at 46% and all p-value for three variables were significant.

Though, checking the residual plot and QQ plot, we can see that the residuals have a pattern, and are not normally distributed, which means the linear model doesn’t fit the data so well.

References

  1. https://ww2.coastal.edu/kingw/statistics/R-tutorials/
  2. http://www.cookbook-r.com/
  3. https://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset
  4. https://www.khanacademy.org/math/statistics-probability/describing-relationships-quantitative-data/residuals-least-squares-rsquared/a/introduction-to-residuals