1 Introduction

I carried out a project to predict the rental volume of 따릉이, a public bicycle in Seoul. Data is used for a year from December 2017 to November 2018, before the corona virus period and was available on the Seoul public data website. Also, the weather data and fine dust data that are expected to affect bicycle rental were collected by the Korea Meteorological Administration. XGBoost model algorithm is applied. Baseline model result shows 315 of RMSE and 76% of \(R^{2}\), and I have made a substantial improvement with the result of 182 of RMSE and 92% of \(R^{2}\).



2 Feature destription

date : 날짜

bike_cnt : 매시간 공공자전거 대여수

hour : 시간

temperature_c : 기온(섭씨)

humidity_percent : 습도(%)

windspeed_ms : 풍속(ms/s)

visibility_10m : 가시성

dew_point_temperature : 이슬점

solar_Radiation : 일사량

rainfall_mm : 강수량(mm)

snowfall_cm : 강설량(cm)

seasons : 계절

finedust : 미세먼지(μg)

3 Loading library and dataset

library(tidyverse)
library(lubridate)
library(funModeling)
library(gridExtra)
library(scales)
library(rsample)
library(recipes)
library(tidymodels)
library(doParallel)
library(vip)
bike <- read.csv("./dataset/SeoulBikeData_modified.csv")

# Tidy variable names
bike <- janitor::clean_names(bike)

# For building a baseline model later on.
bike_baseline <- bike


4 EDA

4.1 Missing values

sapply(bike, function(x) sum(is.na(x)))
##                     x                  date                  hour 
##                     0                     0                     0 
##              bike_cnt         temperature_c      humidity_percent 
##                     0                     0                     0 
##          windspeed_ms        visibility_10m dew_point_temperature 
##                     0                     0                     0 
##       solar_radiation           rainfall_mm           snowfall_cm 
##                     0                     0                     0 
##               seasons             fine_dust 
##                     0                     0

No missing values are found in the data set.

4.2 Descriptive EDA

head(bike)
##   x           date hour bike_cnt temperature_c humidity_percent windspeed_ms
## 1 1 12/1/2017 0:00    0      254          -5.2               37          2.2
## 2 2 12/1/2017 1:00    1      204          -5.5               38          0.8
## 3 3 12/1/2017 2:00    2      173          -6.0               39          1.0
## 4 4 12/1/2017 3:00    3      107          -6.2               40          0.9
## 5 5 12/1/2017 4:00    4       78          -6.0               36          2.3
## 6 6 12/1/2017 5:00    5      100          -6.4               37          1.5
##   visibility_10m dew_point_temperature solar_radiation rainfall_mm snowfall_cm
## 1           2000                   -18               0           0           0
## 2           2000                   -18               0           0           0
## 3           2000                   -18               0           0           0
## 4           2000                   -18               0           0           0
## 5           2000                   -19               0           0           0
## 6           2000                   -19               0           0           0
##   seasons fine_dust
## 1  Winter        15
## 2  Winter        11
## 3  Winter        17
## 4  Winter        21
## 5  Winter        18
## 6  Winter        14

.

glimpse(bike)
## Rows: 8,760
## Columns: 14
## $ x                     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1~
## $ date                  <chr> "12/1/2017 0:00", "12/1/2017 1:00", "12/1/2017 2~
## $ hour                  <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14~
## $ bike_cnt              <int> 254, 204, 173, 107, 78, 100, 181, 460, 930, 490,~
## $ temperature_c         <dbl> -5.2, -5.5, -6.0, -6.2, -6.0, -6.4, -6.6, -7.4, ~
## $ humidity_percent      <int> 37, 38, 39, 40, 36, 37, 35, 38, 37, 27, 24, 21, ~
## $ windspeed_ms          <dbl> 2.2, 0.8, 1.0, 0.9, 2.3, 1.5, 1.3, 0.9, 1.1, 0.5~
## $ visibility_10m        <int> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, ~
## $ dew_point_temperature <dbl> -17.6, -17.6, -17.7, -17.6, -18.6, -18.7, -19.5,~
## $ solar_radiation       <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, ~
## $ rainfall_mm           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ snowfall_cm           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ seasons               <chr> "Winter", "Winter", "Winter", "Winter", "Winter"~
## $ fine_dust             <int> 15, 11, 17, 21, 18, 14, 26, 12, 24, 17, 21, 21, ~

Without the x’s, the dataframe consists of 13 predictors and our response variable bike_cnt.

summary(bike)
##        x            date                hour         bike_cnt    temperature_c
##  Min.   :   1   Length:8760        Min.   : 0.0   Min.   :   0   Min.   :-18  
##  1st Qu.:2191   Class :character   1st Qu.: 5.8   1st Qu.: 191   1st Qu.:  4  
##  Median :4380   Mode  :character   Median :11.5   Median : 504   Median : 14  
##  Mean   :4380                      Mean   :11.5   Mean   : 705   Mean   : 13  
##  3rd Qu.:6570                      3rd Qu.:17.2   3rd Qu.:1065   3rd Qu.: 22  
##  Max.   :8760                      Max.   :23.0   Max.   :3556   Max.   : 39  
##  humidity_percent  windspeed_ms visibility_10m dew_point_temperature
##  Min.   : 0       Min.   :0.0   Min.   :  27   Min.   :-30.6        
##  1st Qu.:42       1st Qu.:0.9   1st Qu.: 940   1st Qu.: -4.7        
##  Median :57       Median :1.5   Median :1698   Median :  5.1        
##  Mean   :58       Mean   :1.7   Mean   :1437   Mean   :  4.1        
##  3rd Qu.:74       3rd Qu.:2.3   3rd Qu.:2000   3rd Qu.: 14.8        
##  Max.   :98       Max.   :7.4   Max.   :2000   Max.   : 27.2        
##  solar_radiation  rainfall_mm  snowfall_cm    seasons            fine_dust  
##  Min.   :0.0     Min.   : 0   Min.   :0.0   Length:8760        Min.   :  2  
##  1st Qu.:0.0     1st Qu.: 0   1st Qu.:0.0   Class :character   1st Qu.: 19  
##  Median :0.0     Median : 0   Median :0.0   Mode  :character   Median : 29  
##  Mean   :0.6     Mean   : 0   Mean   :0.1                      Mean   : 35  
##  3rd Qu.:0.9     3rd Qu.: 0   3rd Qu.:0.0                      3rd Qu.: 45  
##  Max.   :3.5     Max.   :35   Max.   :8.8                      Max.   :304
profiling_num(bike)
##                 variable     mean std_dev variation_coef  p_01  p_05   p_25
## 1                      x 4380.500 2528.94           0.58  88.6 439.0 2190.8
## 2                   hour   11.500    6.92           0.60   0.0   1.0    5.8
## 3               bike_cnt  704.602  645.00           0.92   0.0  22.0  191.0
## 4          temperature_c   12.883   11.94           0.93 -12.7  -7.1    3.5
## 5       humidity_percent   58.226   20.36           0.35  17.0  27.0   42.0
## 6           windspeed_ms    1.725    1.04           0.60   0.1   0.4    0.9
## 7         visibility_10m 1436.826  608.30           0.42 173.0 300.0  940.0
## 8  dew_point_temperature    4.074   13.06           3.21 -24.8 -19.5   -4.7
## 9        solar_radiation    0.569    0.87           1.53   0.0   0.0    0.0
## 10           rainfall_mm    0.149    1.13           7.59   0.0   0.0    0.0
## 11           snowfall_cm    0.075    0.44           5.82   0.0   0.0    0.0
## 12             fine_dust   35.441   24.18           0.68   5.0   8.0   19.0
##       p_50    p_75   p_95   p_99 skewness kurtosis     iqr         range_98
## 1  4380.50 6570.25 8322.0 8672.4     0.00      1.8 4379.50 [88.59, 8672.41]
## 2    11.50   17.25   22.0   23.0     0.00      1.8   11.50          [0, 23]
## 3   504.50 1065.25 2043.0 2526.2     1.15      3.9  874.25     [0, 2526.23]
## 4    13.70   22.50   30.7   35.1    -0.20      2.2   19.00  [-12.741, 35.1]
## 5    57.00   74.00   94.0   97.0     0.06      2.2   32.00         [17, 97]
## 6     1.50    2.30    3.7    4.7     0.89      3.7    1.40       [0.1, 4.7]
## 7  1698.00 2000.00 2000.0 2000.0    -0.70      2.0 1060.00      [173, 2000]
## 8     5.10   14.80   22.4   24.7    -0.37      2.2   19.50    [-24.8, 24.7]
## 9     0.01    0.93    2.6    3.2     1.50      4.1    0.93        [0, 3.17]
## 10    0.00    0.00    0.4    4.0    14.53    287.8    0.00           [0, 4]
## 11    0.00    0.00    0.2    2.5     8.44     96.7    0.00         [0, 2.5]
## 12   29.00   45.00   82.0  119.0     2.03     12.1   26.00         [5, 119]
##           range_80
## 1  [876.9, 7884.1]
## 2          [2, 21]
## 3     [64, 1671.1]
## 4       [-3.7, 28]
## 5         [32, 86]
## 6       [0.6, 3.2]
## 7    [436.9, 2000]
## 8      [-15.3, 21]
## 9       [0, 2.051]
## 10          [0, 0]
## 11          [0, 0]
## 12        [12, 68]
bike <- bike %>% 
  mutate(date = mdy_hm(date, tz = "Asia/Seoul"))

4.3 Distributions of each variable

Numeric variables

With histogram plots, I explored the distributions of numeric variables including the response variable.

# Only numeric variables included

numVars <- profiling_num(bike) %>% 
  .$variable

numVars
##  [1] "x"                     "hour"                  "bike_cnt"             
##  [4] "temperature_c"         "humidity_percent"      "windspeed_ms"         
##  [7] "visibility_10m"        "dew_point_temperature" "solar_radiation"      
## [10] "rainfall_mm"           "snowfall_cm"           "fine_dust"
nn <- list()

for(i in 1:length(numVars)) {
  x = bike[, numVars][,i]
  nn[[i]] <- ggplot(data = data.frame(x),
                    aes(x)) +
    geom_histogram(bins = 100)
}

grid.arrange(nn[[1]], nn[[2]], nn[[3]],
             nn[[4]], nn[[5]], nn[[6]],
             nn[[7]], nn[[8]], nn[[9]],
             nn[[10]], nn[[11]], nn[[12]], ncol = 3)


Nominal Variables

With bar plots, I explored the distributions of categorical variables

# Only categorical variables

catVars <- select(bike, -numVars) %>% 
  colnames()

catVars
## [1] "date"    "seasons"
cc <- list()
for(i in 1:length(catVars)) {
  x = bike[, catVars][, i]
  cc[[i]] <- ggplot(data.frame(x), aes(x)) +
    geom_bar()
}

grid.arrange(cc[[1]], cc[[2]])

I found some variables are highly skewed ; especially rainfall_mm and snowfall_mm. This was expected as it doesn’t rain or snow the most of the time.

4.4 Relationship between each variable

Add variables related to calender components and Korean holidays

To embrace the characteristics of time series, I made month, day, hour and weekend, etc variables and a holiday variable that shows Korean public holidays and weekends.

# Calendar components
bike <- bike %>% 
  mutate(month = month(date, label = T),
         wday = wday(date, label = T),
         mday = mday(date),
         yday = yday(date),
         weekend = ifelse(wday %in% c("Sat", "Sun"), 1, 0),
         hour = hour(date))
# Add Korean National Holidays
holiday <- c(
  "20171225", "20180101", "20180215", "20180216", "20180301", "20180507", 
  "20180522", "20180606", "20180613", "20180815", "20180924", 
  "20180925", "20180926", "20181003", "20181009"
)

holiday <- ymd(holiday)

bike <- bike %>% 
  mutate(holiday = NULL,
         date2 = as.Date(date, tz = "Asia/Seoul"),
         holiday = case_when(
           weekend == 1 ~ 1,
           date2 %in% holiday ~ 1,
           TRUE ~ 0))

Korean holidays and weekends such as National Liberation Day, Christmas, and Children’s Day, and Saturdays and Sundays were added as holiday variables.

4.4.1 Correlation plot

Cor_data <- round(cor(bike[, numVars]), 2)

Cor_high <- Cor_data[, "bike_cnt"] %>% 
  abs() %>% 
  sort(decreasing = T) %>% 
  names()

Cor_data <- Cor_data[Cor_high, Cor_high]

Cor_data[lower.tri(Cor_data)] <- NA

melted_cor <- reshape2::melt(Cor_data, na.rm = T)

ggplot(melted_cor, aes(Var2, Var1, fill = value)) +
  geom_tile(color = "white") +
  scale_fill_gradient2(low = "skyblue",
                       high = "red",
                       mid = "white",
                       midpoint = 0,
                       limit = c(-1, 1),
                       name = "Peason\nCorrelation") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 30, vjust = 1 , hjust = 1, size = 9)) +
  coord_fixed() +
  geom_text(aes(Var2, Var1, label = value), color = "black", size = 2) +
  theme(panel.grid.major = element_blank()) +
  labs(x = NULL, y = NULL)

Before diving into the relationships between variables, I checked the Pearson correlations between variables. Seemingly temperature and hour pattern variables have high correlations with the bike rental volume.

4.4.2 hourly rental pattern

bike %>% 
  ggplot(aes(hour, bike_cnt)) +
  geom_boxplot(aes(group = hour)) +
  geom_smooth(se = F)

I plotted the distribution of hourly rental volume of the bicycle with a boxplot. It shows a relatively high distribution at around 8 am and 19 pm. An object (x = 4923, date = 2018-06-24) at 2 a.m. that appears to be an outlier was discovered and in-depth analysis was conducted.

bike %>%
  filter(hour == 2& bike_cnt > 1000) %>% 
  select(x, date, hour, bike_cnt)
##      x                date hour bike_cnt
## 1 4923 2018-06-24 02:00:00    2     1254

The distribution of all the rental amount of bicycle at 2 a.m.

bike %>% 
  filter(hour == 2) %>% 
  
  ggplot(aes(bike_cnt)) +
  geom_histogram(binwidth = 10)

On June 24, 2018, compared to the amount of rental at other times, the amount of rental at 2 a.m. is considered unusual and is suspected to be abnormal.

bike %>% 
  filter(date2 == ymd("20180624")) %>% 
  select(date, hour, bike_cnt)
##                   date hour bike_cnt
## 1  2018-06-24 00:00:00    0      849
## 2  2018-06-24 01:00:00    1      545
## 3  2018-06-24 02:00:00    2     1254
## 4  2018-06-24 03:00:00    3      644
## 5  2018-06-24 04:00:00    4      286
## 6  2018-06-24 05:00:00    5      235
## 7  2018-06-24 06:00:00    6      258
## 8  2018-06-24 07:00:00    7      369
## 9  2018-06-24 08:00:00    8      561
## 10 2018-06-24 09:00:00    9      734
## 11 2018-06-24 10:00:00   10      763
## 12 2018-06-24 11:00:00   11      870
## 13 2018-06-24 12:00:00   12     1042
## 14 2018-06-24 13:00:00   13     1062
## 15 2018-06-24 14:00:00   14     1064
## 16 2018-06-24 15:00:00   15     1213
## 17 2018-06-24 16:00:00   16     1471
## 18 2018-06-24 17:00:00   17     1754
## 19 2018-06-24 18:00:00   18     2094
## 20 2018-06-24 19:00:00   19     2290
## 21 2018-06-24 20:00:00   20     2387
## 22 2018-06-24 21:00:00   21     2153
## 23 2018-06-24 22:00:00   22     1801
## 24 2018-06-24 23:00:00   23     1329

Bicycle rental on 2018 June 24th (Sunday).

bike %>% 
  filter(date2 == ymd("20180624")) %>% 
  ggplot(aes(hour, bike_cnt)) +
  geom_line() +
  geom_point() +
  geom_point(data = bike %>% 
               filter(date2 == ymd("20180624") &
                        hour == 2), color = "red") +
  geom_line(data = bike %>% 
               filter(date2 == ymd("20180624") &
                        hour == 2), color = "red")

Bicycle rental on 2018 June 17th (Sunday), a week earlier.

bike %>% 
  filter(date2 == ymd("20180617")) %>% 
  ggplot(aes(hour, bike_cnt)) +
  geom_line() +
  geom_point()

Compared to the hourly rental history on the day a week earlier, the object appears to be abnormal at 2 am. I see this as an outlier and deal with it later on.



요일별 패턴

bike %>% 
  ggplot(aes(hour, bike_cnt, color = wday)) +
  geom_smooth(se = F, size = 1) +
  scale_color_discrete("")

It shows that the pattern of use of 따릉이 is different on weekends and weekdays. On weekdays, the usage seems high during rush hours.


Comparison of Holidays and Weekdays excluding Weekends (Saturday and Sunday)

bike %>% 
  filter(!wday %in% c("Sat", "Sun")) %>% 
  mutate(holiday = factor(holiday)) %>% 
  ggplot(aes(hour, bike_cnt, color = holiday)) +
  geom_smooth(se = F, size = 1)

This shows that the holiday shows a similar pattern with the above even if the weekend is excluded.

4.4.3 Temperature

bike %>% 
  ggplot(aes(temperature_c, bike_cnt)) +
  geom_point(alpha = 0.3) +
  geom_smooth(se = F)

It shows that the higher the temperature, the higher the rental amount of bicycles, and the lower the rental amount if the temperature rises above a certain temperature.

t <- bike %>% 
  mutate(jitter_times = date + minutes(round(runif(nrow(bike), min = 0, max = 59))),
         times = as.POSIXct(hms::parse_hm(format(jitter_times, format = "%H:%M")))) %>%

  ggplot(aes(x = times, y = bike_cnt, color = temperature_c)) +
  geom_point() +
  scale_colour_gradientn("Temp (°C)", colours=c("#5e4fa2", "#3288bd", "#66c2a5", "#abdda4", "#e6f598", "#fee08b", "#fdae61", "#f46d43", "#d53e4f", "#9e0142")) +
  scale_x_datetime(breaks = date_breaks("4 hours"), labels = date_format("%H:%M")) +
  theme(axis.text.x = element_text(angle = 40, hjust = 1))
t

The temperature over time and the resulting bicycle rental patterns were visualized the whole and by seasons. They show the obvious different patterns by temperature.

Rent by Seasonal Temperature and Time

t + facet_wrap(~ seasons,  ncol = 2)

4.4.4 humidity_percent

bike %>% 
  ggplot(aes(humidity_percent, bike_cnt)) +
  geom_point(alpha = 0.3) +
  geom_smooth(se = F)

As the humidity increases, the amount of bicycle rental also decreases.

4.4.5 Fine dust

bike %>% 
  ggplot(aes(fine_dust, bike_cnt)) +
  geom_point(alpha = 0.3)

The fine dust and bicycle rental volume do not seem to have much to do with, but we will find out in-depth about the days when fine dust is severe (the figure is over 100 or higher).

# I add a finedust_alert variable that is 1 if fine_dust is over 100.
bike <- bike %>% 
  mutate(finedust_alert = ifelse(fine_dust >= 100, 1, 0))

bike %>% 
  count(month, finedust_alert) %>% 
  filter(finedust_alert == 1)
##   month finedust_alert  n
## 1   Jan              1 19
## 2   Feb              1 11
## 3   Mar              1 16
## 4   Apr              1 29
## 5   May              1 25
## 6   Jun              1  2
## 7   Oct              1  1
## 8   Nov              1 39
## 9   Dec              1 34

A fine dust alert variable is added when the fine dust level is more than 100. April and November have the most severe fine dust days.

bike %>% 
  filter(month == "Apr" | month == "Nov") %>% 
  mutate(finedust_alert = factor(finedust_alert)) %>% 
  ggplot(aes(finedust_alert, bike_cnt, fill = finedust_alert)) +
  geom_bar(stat = "summary", fun = mean)

When calculating the average amount of rentals between days with severe fine dust in April and November and days without, it can be seen that the average amount of rentals on days with severe fine dust is lower on average.

4.4.6 rainfall_mm

bike %>% 
  filter(rainfall_mm > 0) %>% 
  ggplot(aes(rainfall_mm, bike_cnt)) +
  geom_point(alpha = .4)

The more rainfall, the lower the bicycle rental.

bike <- bike %>% 
  mutate(raining = ifelse(rainfall_mm > 0, 1, 0))

Raining variables were added during rainy hours.

4.4.7 snowfall_mm

bike %>% 
  filter(snowfall_cm > 0) %>% 
  ggplot(aes(snowfall_cm, bike_cnt)) +
  geom_point(alpha = .4)

The higher the snowfall, the lower the bicycle rental.

bike <- bike %>% 
  mutate(snowing = ifelse(snowfall_cm>0, 1, 0))

Snowing variables were added during the snowy hours.

bike %>% 
  filter(snowing == 1) %>% 
  count(month)
##   month   n
## 1   Jan 192
## 2   Feb  38
## 3   Nov  51
## 4   Dec 162

It shows that it snowed frequently in January and December. I compared the number of bicycle rentals at snowy hours with the number of bicycle rentals at non-snowy hours in January and December.

bike %>% 
  filter(month %in% c("Jan", "Dec")) %>% 
  mutate(snowing = factor(snowing)) %>% 
  
  ggplot(aes(month, bike_cnt, fill = snowing)) +
  geom_bar(stat = "summary", fun = mean, position = "dodge")

It can be seen that the amount of rental at the snowy time hours is significantly lower than that at the non-snowy hours.

4.4.8 Solar radiation

The relationship between the rental amount of the bicycle and the solar radiation.

bike %>% 
  filter(solar_radiation >0) %>% 
  ggplot(aes(solar_radiation, bike_cnt)) +

  geom_point(alpha = 0.3) +
  geom_smooth(se = F)

I added the no_sunny variable for the time hours(night, cloudy day, etc.) with zero solar radiation and compared the hours with zero solar radiation to those without.

bike <- bike %>% 
  mutate(no_sunny = ifelse(solar_radiation == 0, 1, 0))
bike %>% 
  mutate(no_sunny = factor(no_sunny)) %>% 
  ggplot(aes(hour, bike_cnt, color = no_sunny)) +
  geom_smooth(se = F)

It shows that bicycle rental amount is lower on average on days when there is no solar radiation.


5 Preprocessing features

5.1 Outlier

I plotted the pattern of the bicycle rent at 2 am in May, June and July and double checked the 4923th object was an outlier and removed it.

bike[4923, ]
##         x                date hour bike_cnt temperature_c humidity_percent
## 4923 4923 2018-06-24 02:00:00    2     1254            21               87
##      windspeed_ms visibility_10m dew_point_temperature solar_radiation
## 4923          1.8            222                    19               0
##      rainfall_mm snowfall_cm seasons fine_dust month wday mday yday weekend
## 4923           0           0  Summer        63   Jun  Sun   24  175       1
##      holiday      date2 finedust_alert raining snowing no_sunny
## 4923       1 2018-06-24              0       0       0        1
bike %>% 
  filter(month %in% c("May","Jun", "Jul") & hour ==2) %>% 
  ggplot(aes(date2, bike_cnt)) +
  geom_point() +
  geom_point(data = bike[4923, ], col = "red", size = 3)

Bicycle rental for May, June, and July at 2 a.m.

bike <- bike[-4923,]

The 4923rd object was determined to be an outlier and deleted.

5.2 General feature engineering

bike <- bike %>% 
  mutate(hour = factor(hour),
         month = factor(as.character(month)),
         wday = factor(as.character(wday))) %>% 
  select(-x, -date, -date2)

I turned hour, month, wday variables into a factor type and removed variables that were useless for modeling.

5.3 Data split

I divided the data set into train and test sets.

set.seed(1234)
split <- initial_split(bike, prop = 0.6, strata = bike_cnt)
train <- training(split)
test <- testing(split)


train %>% glimpse
## Rows: 5,257
## Columns: 22
## $ hour                  <fct> 0, 2, 3, 4, 5, 6, 7, 8, 9, 12, 13, 14, 15, 16, 1~
## $ bike_cnt              <int> 254, 173, 107, 78, 100, 181, 460, 930, 490, 449,~
## $ temperature_c         <dbl> -5.2, -6.0, -6.2, -6.0, -6.4, -6.6, -7.4, -7.6, ~
## $ humidity_percent      <int> 37, 39, 40, 36, 37, 35, 38, 37, 27, 23, 25, 26, ~
## $ windspeed_ms          <dbl> 2.2, 1.0, 0.9, 2.3, 1.5, 1.3, 0.9, 1.1, 0.5, 1.4~
## $ visibility_10m        <int> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, ~
## $ dew_point_temperature <dbl> -17.6, -17.7, -17.6, -18.6, -18.7, -19.5, -19.3,~
## $ solar_radiation       <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.01, ~
## $ rainfall_mm           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ snowfall_cm           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ seasons               <fct> Winter, Winter, Winter, Winter, Winter, Winter, ~
## $ fine_dust             <int> 15, 17, 21, 18, 14, 26, 12, 24, 17, 24, 26, 29, ~
## $ month                 <fct> Dec, Dec, Dec, Dec, Dec, Dec, Dec, Dec, Dec, Dec~
## $ wday                  <fct> Fri, Fri, Fri, Fri, Fri, Fri, Fri, Fri, Fri, Fri~
## $ mday                  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ yday                  <dbl> 335, 335, 335, 335, 335, 335, 335, 335, 335, 335~
## $ weekend               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ holiday               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ finedust_alert        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ raining               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ snowing               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ no_sunny              <dbl> 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, ~

5.4 Feature engineering for tree-based models

With recipes package, I transformed some variables to advance the performance of modeling.

recipe_trees <- train %>% 
  recipe(bike_cnt ~.) %>% 
  step_other(all_nominal(), threshold = 0.01) %>% 
  step_nzv(all_nominal()) %>% 
  prep()

trainE <- bake(recipe_trees, new_data = train)
testE <- bake(recipe_trees, new_data = test)

6 Modeling

6.1 Baseline

I created a baseline model to compare. I omitted the code chunks and showed the performance result.

bike_baseline %>% glimpse
## Rows: 8,760
## Columns: 14
## $ x                     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1~
## $ date                  <chr> "12/1/2017 0:00", "12/1/2017 1:00", "12/1/2017 2~
## $ hour                  <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14~
## $ bike_cnt              <int> 254, 204, 173, 107, 78, 100, 181, 460, 930, 490,~
## $ temperature_c         <dbl> -5.2, -5.5, -6.0, -6.2, -6.0, -6.4, -6.6, -7.4, ~
## $ humidity_percent      <int> 37, 38, 39, 40, 36, 37, 35, 38, 37, 27, 24, 21, ~
## $ windspeed_ms          <dbl> 2.2, 0.8, 1.0, 0.9, 2.3, 1.5, 1.3, 0.9, 1.1, 0.5~
## $ visibility_10m        <int> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, ~
## $ dew_point_temperature <dbl> -17.6, -17.6, -17.7, -17.6, -18.6, -18.7, -19.5,~
## $ solar_radiation       <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, ~
## $ rainfall_mm           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ snowfall_cm           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ seasons               <chr> "Winter", "Winter", "Winter", "Winter", "Winter"~
## $ fine_dust             <int> 15, 11, 17, 21, 18, 14, 26, 12, 24, 17, 21, 21, ~

Evaluating the performance

xgb_test_pred_baseline <- predict(xgb_baseline_mod, new_data = testE_baseline) %>% 
  bind_cols(test_baseline %>% 
              select(bike_cnt, everything()))

xgb_test_score_baseline <- xgb_test_pred_baseline %>% 
  metrics(bike_cnt, .pred)

.

xgb_test_score_baseline
## # A tibble: 3 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard     315.   
## 2 rsq     standard       0.764
## 3 mae     standard     189.

The rmse of the baseline is 315, \(R^{2}\) is 0.76, and mae is 189.

6.2 XGBoost

Hyperparameter tuning and model fitting

xgb_spec <- boost_tree(
  mode = "regression",
  trees = 5000,
  
  min_n = 34,
  tree_depth = 10,
  learn_rate = 0.0329,
  loss_reduction = 15.5,
  stop_iter = 30
) %>% 
  set_engine("xgboost", objective = "reg:squarederror")
set.seed(234)
xgb_mod <- 
  xgb_spec %>% 
  fit(bike_cnt ~., data = trainE)

Evaluating the model performance

xgb_test_pred <- predict(xgb_mod, new_data = testE) %>% 
  bind_cols(test %>% 
              select(bike_cnt, everything()))

xgb_test_score <- xgb_test_pred %>% 
  metrics(bike_cnt, .pred)
xgb_test_score
## # A tibble: 3 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard     182.   
## 2 rsq     standard       0.921
## 3 mae     standard     114.

The results of the XGB model, which performed performance evaluations on the test data after fitting them to the training data, were significantly improved over the baseline model with 182 of RMSE, 0.92 of \(R^{2}\) and 114 of MAE.

xgb_test_pred %>% 
  ggplot(aes(x = bike_cnt, y = .pred)) +
  geom_point() +
  geom_abline(intercept = 0, slope = 1, color = "blue", size = 1)

7 Feature importance

vip(xgb_mod, num_features = 20)

The variable that had the biggest impact on the rental of 따릉이 was temperature, as EDA found out. Humidity and solar radiation also have a high impact. And certain times of the day (18:00, 19:00, 20:00) seem to have a relatively large impact on forecasting rental volumes compared to other times.

This concludes the prediction analysis of rental volume of public rental bicycles in Seoul.