Linear Regression Project

This project is based on the Bike Sharing Demand Kaggle challenge. The main point of this project is to practice Exploratory Data Analysis and begin to get an understanding the power and limitations of certain ML approaches.

Unlike the original project in the Kaggle challenge, we will not take into account seasonality of the data. In this case, we will discover that Linear Regression may not be the best choice given our data.

library(readr)
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Data Loading

We will start by loading the provided .csv file that contains the observations.

Expected features: - datetime - hourly date + timestamp - season - 1 = spring, 2 = summer, 3 = fall, 4 = winter - holiday - whether the day is considered a holiday - workingday - whether the day is neither a weekend nor holiday - weather - - 1: Clear, Few clouds, Partly cloudy, Partly cloudy - 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist - 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds - 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog - temp - temperature in Celsius - atemp - “feels like” temperature in Celsius - humidity - relative humidity - windspeed - wind speed - casual - number of non-registered user rentals initiated - registered - number of registered user rentals initiated - count - number of total rentals

bike <- read_csv("bikeshare.csv")
## Rows: 10886 Columns: 12
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## dbl  (11): season, holiday, workingday, weather, temp, atemp, humidity, wind...
## dttm  (1): datetime
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(bike, 10)
## # A tibble: 10 x 12
##    datetime            season holiday workingday weather  temp atemp humidity
##    <dttm>               <dbl>   <dbl>      <dbl>   <dbl> <dbl> <dbl>    <dbl>
##  1 2011-01-01 00:00:00      1       0          0       1  9.84  14.4       81
##  2 2011-01-01 01:00:00      1       0          0       1  9.02  13.6       80
##  3 2011-01-01 02:00:00      1       0          0       1  9.02  13.6       80
##  4 2011-01-01 03:00:00      1       0          0       1  9.84  14.4       75
##  5 2011-01-01 04:00:00      1       0          0       1  9.84  14.4       75
##  6 2011-01-01 05:00:00      1       0          0       2  9.84  12.9       75
##  7 2011-01-01 06:00:00      1       0          0       1  9.02  13.6       80
##  8 2011-01-01 07:00:00      1       0          0       1  8.2   12.9       86
##  9 2011-01-01 08:00:00      1       0          0       1  9.84  14.4       75
## 10 2011-01-01 09:00:00      1       0          0       1 13.1   17.4       76
## # ... with 4 more variables: windspeed <dbl>, casual <dbl>, registered <dbl>,
## #   count <dbl>
str(bike)
## spec_tbl_df [10,886 x 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ datetime  : POSIXct[1:10886], format: "2011-01-01 00:00:00" "2011-01-01 01:00:00" ...
##  $ season    : num [1:10886] 1 1 1 1 1 1 1 1 1 1 ...
##  $ holiday   : num [1:10886] 0 0 0 0 0 0 0 0 0 0 ...
##  $ workingday: num [1:10886] 0 0 0 0 0 0 0 0 0 0 ...
##  $ weather   : num [1:10886] 1 1 1 1 1 2 1 1 1 1 ...
##  $ temp      : num [1:10886] 9.84 9.02 9.02 9.84 9.84 ...
##  $ atemp     : num [1:10886] 14.4 13.6 13.6 14.4 14.4 ...
##  $ humidity  : num [1:10886] 81 80 80 75 75 75 80 86 75 76 ...
##  $ windspeed : num [1:10886] 0 0 0 0 0 ...
##  $ casual    : num [1:10886] 3 8 5 3 0 0 2 1 1 8 ...
##  $ registered: num [1:10886] 13 32 27 10 1 1 0 2 7 6 ...
##  $ count     : num [1:10886] 16 40 32 13 1 1 2 3 8 14 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   datetime = col_datetime(format = ""),
##   ..   season = col_double(),
##   ..   holiday = col_double(),
##   ..   workingday = col_double(),
##   ..   weather = col_double(),
##   ..   temp = col_double(),
##   ..   atemp = col_double(),
##   ..   humidity = col_double(),
##   ..   windspeed = col_double(),
##   ..   casual = col_double(),
##   ..   registered = col_double(),
##   ..   count = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

The variable that we will try to predict is the “count” of trips in the bike sharing system. To gain an understanding about how we will achieve this, let’s do some exploratory analysis.

Exploratory Data Analysis

Rental Count vs. Temperature

ggplot(bike, aes(x=temp, y=count)) + 
      geom_point(alpha=.5, shape=21, size=2.5, aes(fill=temp))

Rental Count vs. Datetime and Temperature

ggplot(bike, aes(x=datetime, y=count)) + 
      geom_point(alpha=.7, shape=21, size=2.5, aes(fill=temp))+
      scale_fill_gradient(low = "cyan", high = "red")

From this initial exploration we can observe that there is in fact a seasonality to the data. Rental count seem so increase with higher temperatures and decrease with lower ones. Let’s have a quick overview of pros and cons right now of Linear Regression:

Pros: - Simple to explain - Highly interpretable - Model training and prediction are fast - No tuning is required (excluding regularization) - Features don’t need scaling - Can perform well with a small number of observations - Well-understood

Cons: - Assumes a linear relationship between the features and the response - Performance is (generally) not competitive with the best supervised learning methods due to high bias - Can’t automatically learn feature interactions

We’ll keep this in mind as we continue on.

What is the correlation between temp and count?

cor(bike$count, bike$temp) # construir matriz de correlación
## [1] 0.3944536

Let’s explore the season data using a box plot, with the y axis indicating count and the x axis begin a box for each season.

ggplot(bike, aes(x=as.factor(season), y=count))+
      geom_boxplot(aes(color=as.factor(season)))

## Feature Engineering

The low correlation between temperature and count tells us that it can be the only explanatory variable to predict travel demand. We will now proceed to do some feature engineering based on current domain knowledge to see if we can come up with a better model.

Timestamp and day variable

bike <- bike %>%
      mutate(hour=format(datetime, "%H"))%>%
      mutate(day=(weekdays(bike$datetime))) %>%
      mutate(nonwd=(day=="sábado" | day=="domingo"))

Hour vs. Count

ggplot(bike, aes(x=hour, y=count)) + 
      geom_point(alpha=.5, shape=21, size=4, position=position_jitter(w=1, h=0), aes(fill=temp))+
      scale_fill_gradientn(colors=c("navy", "blue", "green", "yellow", "red"))

Hour vs. Count in non working day

ggplot(filter(bike, nonwd==TRUE), aes(x=hour, y=count)) + 
      geom_point(alpha=.5, shape=21, size=4, position=position_jitter(w=1, h=0), aes(fill=temp))+
      scale_fill_gradientn(colors=c("navy", "blue", "green", "yellow", "red"))

As we can see, working days have peak activity during the morning (~8am) and right after work gets out (~5pm), with some lunchtime activity. While the non-work days have a steady rise and fall for the afternoon.

Building the Model

In our first attempt we will be based only in the temperature feature to build the model.

bkmod.1 <- lm(data = bike, count ~ temp)
summary(bkmod.1)
## 
## Call:
## lm(formula = count ~ temp, data = bike)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -293.32 -112.36  -33.36   78.98  741.44 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6.0462     4.4394   1.362    0.173    
## temp          9.1705     0.2048  44.783   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 166.5 on 10884 degrees of freedom
## Multiple R-squared:  0.1556, Adjusted R-squared:  0.1555 
## F-statistic:  2006 on 1 and 10884 DF,  p-value: < 2.2e-16

As a result of this linear model we can interpret that a one unit increase in temperature will result in 9.1707 increase in rental count (demand). This relation has high statistical significance, however, it only explains 15.56% of the real model.

How many bike rentals we would have with an average temperature of 25°?

predict(bkmod.1, data.frame(temp=c(25)))
##        1 
## 235.3097
bike <- mutate(bike, hournum = as.numeric(hour))

Finally, lets build a model that attempts to predict count based off of the following features. - season - holiday - workingday - weather - temp - humidity - windspeed - hour (factor)

bkmod.2 <- lm(count ~ season+holiday+workingday+weather+temp+humidity+windspeed+hournum,
              bike)
summary(bkmod.2)
## 
## Call:
## lm(formula = count ~ season + holiday + workingday + weather + 
##     temp + humidity + windspeed + hournum, data = bike)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -324.61  -96.88  -31.01   55.27  688.83 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  46.91369    8.45147   5.551 2.91e-08 ***
## season       21.70333    1.35409  16.028  < 2e-16 ***
## holiday     -10.29914    8.79069  -1.172    0.241    
## workingday   -0.71781    3.14463  -0.228    0.819    
## weather      -3.20909    2.49731  -1.285    0.199    
## temp          7.01953    0.19135  36.684  < 2e-16 ***
## humidity     -2.21174    0.09083 -24.349  < 2e-16 ***
## windspeed     0.20271    0.18639   1.088    0.277    
## hournum       7.61283    0.21688  35.102  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 147.8 on 10877 degrees of freedom
## Multiple R-squared:  0.3344, Adjusted R-squared:  0.3339 
## F-statistic:   683 on 8 and 10877 DF,  p-value: < 2.2e-16

This project aimed to test the relevance of linear regression to explain real world complex data that involves user demand of a mobility service in a given city. The statistic summary shows us that the model is not performing well.

A likely better approach would be to have training data coming not from a random sample, but taking into account past and future data points.