This project is based on the Bike Sharing Demand Kaggle challenge. The main point of this project is to practice Exploratory Data Analysis and begin to get an understanding the power and limitations of certain ML approaches.
Unlike the original project in the Kaggle challenge, we will not take into account seasonality of the data. In this case, we will discover that Linear Regression may not be the best choice given our data.
library(readr)
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
We will start by loading the provided .csv file that contains the observations.
Expected features: - datetime - hourly date + timestamp - season - 1 = spring, 2 = summer, 3 = fall, 4 = winter - holiday - whether the day is considered a holiday - workingday - whether the day is neither a weekend nor holiday - weather - - 1: Clear, Few clouds, Partly cloudy, Partly cloudy - 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist - 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds - 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog - temp - temperature in Celsius - atemp - “feels like” temperature in Celsius - humidity - relative humidity - windspeed - wind speed - casual - number of non-registered user rentals initiated - registered - number of registered user rentals initiated - count - number of total rentals
bike <- read_csv("bikeshare.csv")
## Rows: 10886 Columns: 12
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## dbl (11): season, holiday, workingday, weather, temp, atemp, humidity, wind...
## dttm (1): datetime
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(bike, 10)
## # A tibble: 10 x 12
## datetime season holiday workingday weather temp atemp humidity
## <dttm> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2011-01-01 00:00:00 1 0 0 1 9.84 14.4 81
## 2 2011-01-01 01:00:00 1 0 0 1 9.02 13.6 80
## 3 2011-01-01 02:00:00 1 0 0 1 9.02 13.6 80
## 4 2011-01-01 03:00:00 1 0 0 1 9.84 14.4 75
## 5 2011-01-01 04:00:00 1 0 0 1 9.84 14.4 75
## 6 2011-01-01 05:00:00 1 0 0 2 9.84 12.9 75
## 7 2011-01-01 06:00:00 1 0 0 1 9.02 13.6 80
## 8 2011-01-01 07:00:00 1 0 0 1 8.2 12.9 86
## 9 2011-01-01 08:00:00 1 0 0 1 9.84 14.4 75
## 10 2011-01-01 09:00:00 1 0 0 1 13.1 17.4 76
## # ... with 4 more variables: windspeed <dbl>, casual <dbl>, registered <dbl>,
## # count <dbl>
str(bike)
## spec_tbl_df [10,886 x 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ datetime : POSIXct[1:10886], format: "2011-01-01 00:00:00" "2011-01-01 01:00:00" ...
## $ season : num [1:10886] 1 1 1 1 1 1 1 1 1 1 ...
## $ holiday : num [1:10886] 0 0 0 0 0 0 0 0 0 0 ...
## $ workingday: num [1:10886] 0 0 0 0 0 0 0 0 0 0 ...
## $ weather : num [1:10886] 1 1 1 1 1 2 1 1 1 1 ...
## $ temp : num [1:10886] 9.84 9.02 9.02 9.84 9.84 ...
## $ atemp : num [1:10886] 14.4 13.6 13.6 14.4 14.4 ...
## $ humidity : num [1:10886] 81 80 80 75 75 75 80 86 75 76 ...
## $ windspeed : num [1:10886] 0 0 0 0 0 ...
## $ casual : num [1:10886] 3 8 5 3 0 0 2 1 1 8 ...
## $ registered: num [1:10886] 13 32 27 10 1 1 0 2 7 6 ...
## $ count : num [1:10886] 16 40 32 13 1 1 2 3 8 14 ...
## - attr(*, "spec")=
## .. cols(
## .. datetime = col_datetime(format = ""),
## .. season = col_double(),
## .. holiday = col_double(),
## .. workingday = col_double(),
## .. weather = col_double(),
## .. temp = col_double(),
## .. atemp = col_double(),
## .. humidity = col_double(),
## .. windspeed = col_double(),
## .. casual = col_double(),
## .. registered = col_double(),
## .. count = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
The variable that we will try to predict is the “count” of trips in the bike sharing system. To gain an understanding about how we will achieve this, let’s do some exploratory analysis.
ggplot(bike, aes(x=temp, y=count)) +
geom_point(alpha=.5, shape=21, size=2.5, aes(fill=temp))
ggplot(bike, aes(x=datetime, y=count)) +
geom_point(alpha=.7, shape=21, size=2.5, aes(fill=temp))+
scale_fill_gradient(low = "cyan", high = "red")
From this initial exploration we can observe that there is in fact a seasonality to the data. Rental count seem so increase with higher temperatures and decrease with lower ones. Let’s have a quick overview of pros and cons right now of Linear Regression:
Pros: - Simple to explain - Highly interpretable - Model training and prediction are fast - No tuning is required (excluding regularization) - Features don’t need scaling - Can perform well with a small number of observations - Well-understood
Cons: - Assumes a linear relationship between the features and the response - Performance is (generally) not competitive with the best supervised learning methods due to high bias - Can’t automatically learn feature interactions
We’ll keep this in mind as we continue on.
What is the correlation between temp and count?
cor(bike$count, bike$temp) # construir matriz de correlación
## [1] 0.3944536
Let’s explore the season data using a box plot, with the y axis indicating count and the x axis begin a box for each season.
ggplot(bike, aes(x=as.factor(season), y=count))+
geom_boxplot(aes(color=as.factor(season)))
## Feature Engineering
The low correlation between temperature and count tells us that it can be the only explanatory variable to predict travel demand. We will now proceed to do some feature engineering based on current domain knowledge to see if we can come up with a better model.
Timestamp and day variable
bike <- bike %>%
mutate(hour=format(datetime, "%H"))%>%
mutate(day=(weekdays(bike$datetime))) %>%
mutate(nonwd=(day=="sábado" | day=="domingo"))
ggplot(bike, aes(x=hour, y=count)) +
geom_point(alpha=.5, shape=21, size=4, position=position_jitter(w=1, h=0), aes(fill=temp))+
scale_fill_gradientn(colors=c("navy", "blue", "green", "yellow", "red"))
ggplot(filter(bike, nonwd==TRUE), aes(x=hour, y=count)) +
geom_point(alpha=.5, shape=21, size=4, position=position_jitter(w=1, h=0), aes(fill=temp))+
scale_fill_gradientn(colors=c("navy", "blue", "green", "yellow", "red"))
As we can see, working days have peak activity during the morning (~8am) and right after work gets out (~5pm), with some lunchtime activity. While the non-work days have a steady rise and fall for the afternoon.
In our first attempt we will be based only in the temperature feature to build the model.
bkmod.1 <- lm(data = bike, count ~ temp)
summary(bkmod.1)
##
## Call:
## lm(formula = count ~ temp, data = bike)
##
## Residuals:
## Min 1Q Median 3Q Max
## -293.32 -112.36 -33.36 78.98 741.44
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.0462 4.4394 1.362 0.173
## temp 9.1705 0.2048 44.783 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 166.5 on 10884 degrees of freedom
## Multiple R-squared: 0.1556, Adjusted R-squared: 0.1555
## F-statistic: 2006 on 1 and 10884 DF, p-value: < 2.2e-16
As a result of this linear model we can interpret that a one unit increase in temperature will result in 9.1707 increase in rental count (demand). This relation has high statistical significance, however, it only explains 15.56% of the real model.
How many bike rentals we would have with an average temperature of 25°?
predict(bkmod.1, data.frame(temp=c(25)))
## 1
## 235.3097
bike <- mutate(bike, hournum = as.numeric(hour))
Finally, lets build a model that attempts to predict count based off of the following features. - season - holiday - workingday - weather - temp - humidity - windspeed - hour (factor)
bkmod.2 <- lm(count ~ season+holiday+workingday+weather+temp+humidity+windspeed+hournum,
bike)
summary(bkmod.2)
##
## Call:
## lm(formula = count ~ season + holiday + workingday + weather +
## temp + humidity + windspeed + hournum, data = bike)
##
## Residuals:
## Min 1Q Median 3Q Max
## -324.61 -96.88 -31.01 55.27 688.83
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.91369 8.45147 5.551 2.91e-08 ***
## season 21.70333 1.35409 16.028 < 2e-16 ***
## holiday -10.29914 8.79069 -1.172 0.241
## workingday -0.71781 3.14463 -0.228 0.819
## weather -3.20909 2.49731 -1.285 0.199
## temp 7.01953 0.19135 36.684 < 2e-16 ***
## humidity -2.21174 0.09083 -24.349 < 2e-16 ***
## windspeed 0.20271 0.18639 1.088 0.277
## hournum 7.61283 0.21688 35.102 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 147.8 on 10877 degrees of freedom
## Multiple R-squared: 0.3344, Adjusted R-squared: 0.3339
## F-statistic: 683 on 8 and 10877 DF, p-value: < 2.2e-16
This project aimed to test the relevance of linear regression to explain real world complex data that involves user demand of a mobility service in a given city. The statistic summary shows us that the model is not performing well.
A likely better approach would be to have training data coming not from a random sample, but taking into account past and future data points.