I carried out a project to predict the rental volume of 따릉이, a public bicycle in Seoul. Data is used for a year from December 2017 to November 2018, before the corona virus period and was available on the Seoul public data website. Also, the weather data and fine dust data that are expected to affect bicycle rental were collected by the Korea Meteorological Administration. XGBoost model algorithm is applied. Baseline model result shows 315 of RMSE and 76% of \(R^{2}\), and I have made a substantial improvement with the result of 182 of RMSE and 92% of \(R^{2}\).
date : 날짜
bike_cnt : 매시간 공공자전거 대여수
hour : 시간
temperature_c : 기온(섭씨)
humidity_percent : 습도(%)
windspeed_ms : 풍속(ms/s)
visibility_10m : 가시성
dew_point_temperature : 이슬점
solar_Radiation : 일사량
rainfall_mm : 강수량(mm)
snowfall_cm : 강설량(cm)
seasons : 계절
finedust : 미세먼지(μg)
library(tidyverse)
library(lubridate)
library(funModeling)
library(gridExtra)
library(scales)
library(rsample)
library(recipes)
library(tidymodels)
library(doParallel)
library(vip)
bike <- read.csv("./dataset/SeoulBikeData_modified.csv")
# Tidy variable names
bike <- janitor::clean_names(bike)
# For building a baseline model later on.
bike_baseline <- bike
sapply(bike, function(x) sum(is.na(x)))
## x date hour
## 0 0 0
## bike_cnt temperature_c humidity_percent
## 0 0 0
## windspeed_ms visibility_10m dew_point_temperature
## 0 0 0
## solar_radiation rainfall_mm snowfall_cm
## 0 0 0
## seasons fine_dust
## 0 0
No missing values are found in the data set.
head(bike)
## x date hour bike_cnt temperature_c humidity_percent windspeed_ms
## 1 1 12/1/2017 0:00 0 254 -5.2 37 2.2
## 2 2 12/1/2017 1:00 1 204 -5.5 38 0.8
## 3 3 12/1/2017 2:00 2 173 -6.0 39 1.0
## 4 4 12/1/2017 3:00 3 107 -6.2 40 0.9
## 5 5 12/1/2017 4:00 4 78 -6.0 36 2.3
## 6 6 12/1/2017 5:00 5 100 -6.4 37 1.5
## visibility_10m dew_point_temperature solar_radiation rainfall_mm snowfall_cm
## 1 2000 -18 0 0 0
## 2 2000 -18 0 0 0
## 3 2000 -18 0 0 0
## 4 2000 -18 0 0 0
## 5 2000 -19 0 0 0
## 6 2000 -19 0 0 0
## seasons fine_dust
## 1 Winter 15
## 2 Winter 11
## 3 Winter 17
## 4 Winter 21
## 5 Winter 18
## 6 Winter 14
.
glimpse(bike)
## Rows: 8,760
## Columns: 14
## $ x <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1~
## $ date <chr> "12/1/2017 0:00", "12/1/2017 1:00", "12/1/2017 2~
## $ hour <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14~
## $ bike_cnt <int> 254, 204, 173, 107, 78, 100, 181, 460, 930, 490,~
## $ temperature_c <dbl> -5.2, -5.5, -6.0, -6.2, -6.0, -6.4, -6.6, -7.4, ~
## $ humidity_percent <int> 37, 38, 39, 40, 36, 37, 35, 38, 37, 27, 24, 21, ~
## $ windspeed_ms <dbl> 2.2, 0.8, 1.0, 0.9, 2.3, 1.5, 1.3, 0.9, 1.1, 0.5~
## $ visibility_10m <int> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, ~
## $ dew_point_temperature <dbl> -17.6, -17.6, -17.7, -17.6, -18.6, -18.7, -19.5,~
## $ solar_radiation <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, ~
## $ rainfall_mm <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ snowfall_cm <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ seasons <chr> "Winter", "Winter", "Winter", "Winter", "Winter"~
## $ fine_dust <int> 15, 11, 17, 21, 18, 14, 26, 12, 24, 17, 21, 21, ~
Without the x’s, the dataframe consists of 13 predictors and our response variable bike_cnt.
summary(bike)
## x date hour bike_cnt temperature_c
## Min. : 1 Length:8760 Min. : 0.0 Min. : 0 Min. :-18
## 1st Qu.:2191 Class :character 1st Qu.: 5.8 1st Qu.: 191 1st Qu.: 4
## Median :4380 Mode :character Median :11.5 Median : 504 Median : 14
## Mean :4380 Mean :11.5 Mean : 705 Mean : 13
## 3rd Qu.:6570 3rd Qu.:17.2 3rd Qu.:1065 3rd Qu.: 22
## Max. :8760 Max. :23.0 Max. :3556 Max. : 39
## humidity_percent windspeed_ms visibility_10m dew_point_temperature
## Min. : 0 Min. :0.0 Min. : 27 Min. :-30.6
## 1st Qu.:42 1st Qu.:0.9 1st Qu.: 940 1st Qu.: -4.7
## Median :57 Median :1.5 Median :1698 Median : 5.1
## Mean :58 Mean :1.7 Mean :1437 Mean : 4.1
## 3rd Qu.:74 3rd Qu.:2.3 3rd Qu.:2000 3rd Qu.: 14.8
## Max. :98 Max. :7.4 Max. :2000 Max. : 27.2
## solar_radiation rainfall_mm snowfall_cm seasons fine_dust
## Min. :0.0 Min. : 0 Min. :0.0 Length:8760 Min. : 2
## 1st Qu.:0.0 1st Qu.: 0 1st Qu.:0.0 Class :character 1st Qu.: 19
## Median :0.0 Median : 0 Median :0.0 Mode :character Median : 29
## Mean :0.6 Mean : 0 Mean :0.1 Mean : 35
## 3rd Qu.:0.9 3rd Qu.: 0 3rd Qu.:0.0 3rd Qu.: 45
## Max. :3.5 Max. :35 Max. :8.8 Max. :304
profiling_num(bike)
## variable mean std_dev variation_coef p_01 p_05 p_25
## 1 x 4380.500 2528.94 0.58 88.6 439.0 2190.8
## 2 hour 11.500 6.92 0.60 0.0 1.0 5.8
## 3 bike_cnt 704.602 645.00 0.92 0.0 22.0 191.0
## 4 temperature_c 12.883 11.94 0.93 -12.7 -7.1 3.5
## 5 humidity_percent 58.226 20.36 0.35 17.0 27.0 42.0
## 6 windspeed_ms 1.725 1.04 0.60 0.1 0.4 0.9
## 7 visibility_10m 1436.826 608.30 0.42 173.0 300.0 940.0
## 8 dew_point_temperature 4.074 13.06 3.21 -24.8 -19.5 -4.7
## 9 solar_radiation 0.569 0.87 1.53 0.0 0.0 0.0
## 10 rainfall_mm 0.149 1.13 7.59 0.0 0.0 0.0
## 11 snowfall_cm 0.075 0.44 5.82 0.0 0.0 0.0
## 12 fine_dust 35.441 24.18 0.68 5.0 8.0 19.0
## p_50 p_75 p_95 p_99 skewness kurtosis iqr range_98
## 1 4380.50 6570.25 8322.0 8672.4 0.00 1.8 4379.50 [88.59, 8672.41]
## 2 11.50 17.25 22.0 23.0 0.00 1.8 11.50 [0, 23]
## 3 504.50 1065.25 2043.0 2526.2 1.15 3.9 874.25 [0, 2526.23]
## 4 13.70 22.50 30.7 35.1 -0.20 2.2 19.00 [-12.741, 35.1]
## 5 57.00 74.00 94.0 97.0 0.06 2.2 32.00 [17, 97]
## 6 1.50 2.30 3.7 4.7 0.89 3.7 1.40 [0.1, 4.7]
## 7 1698.00 2000.00 2000.0 2000.0 -0.70 2.0 1060.00 [173, 2000]
## 8 5.10 14.80 22.4 24.7 -0.37 2.2 19.50 [-24.8, 24.7]
## 9 0.01 0.93 2.6 3.2 1.50 4.1 0.93 [0, 3.17]
## 10 0.00 0.00 0.4 4.0 14.53 287.8 0.00 [0, 4]
## 11 0.00 0.00 0.2 2.5 8.44 96.7 0.00 [0, 2.5]
## 12 29.00 45.00 82.0 119.0 2.03 12.1 26.00 [5, 119]
## range_80
## 1 [876.9, 7884.1]
## 2 [2, 21]
## 3 [64, 1671.1]
## 4 [-3.7, 28]
## 5 [32, 86]
## 6 [0.6, 3.2]
## 7 [436.9, 2000]
## 8 [-15.3, 21]
## 9 [0, 2.051]
## 10 [0, 0]
## 11 [0, 0]
## 12 [12, 68]
bike <- bike %>%
mutate(date = mdy_hm(date, tz = "Asia/Seoul"))
Numeric variables
With histogram plots, I explored the distributions of numeric variables including the response variable.
# Only numeric variables included
numVars <- profiling_num(bike) %>%
.$variable
numVars
## [1] "x" "hour" "bike_cnt"
## [4] "temperature_c" "humidity_percent" "windspeed_ms"
## [7] "visibility_10m" "dew_point_temperature" "solar_radiation"
## [10] "rainfall_mm" "snowfall_cm" "fine_dust"
nn <- list()
for(i in 1:length(numVars)) {
x = bike[, numVars][,i]
nn[[i]] <- ggplot(data = data.frame(x),
aes(x)) +
geom_histogram(bins = 100)
}
grid.arrange(nn[[1]], nn[[2]], nn[[3]],
nn[[4]], nn[[5]], nn[[6]],
nn[[7]], nn[[8]], nn[[9]],
nn[[10]], nn[[11]], nn[[12]], ncol = 3)
Nominal Variables
With bar plots, I explored the distributions of categorical variables
# Only categorical variables
catVars <- select(bike, -numVars) %>%
colnames()
catVars
## [1] "date" "seasons"
cc <- list()
for(i in 1:length(catVars)) {
x = bike[, catVars][, i]
cc[[i]] <- ggplot(data.frame(x), aes(x)) +
geom_bar()
}
grid.arrange(cc[[1]], cc[[2]])
I found some variables are highly skewed ; especially rainfall_mm and snowfall_mm. This was expected as it doesn’t rain or snow the most of the time.
Add variables related to calender components and Korean holidays
To embrace the characteristics of time series, I made month, day, hour and weekend, etc variables and a holiday variable that shows Korean public holidays and weekends.
# Calendar components
bike <- bike %>%
mutate(month = month(date, label = T),
wday = wday(date, label = T),
mday = mday(date),
yday = yday(date),
weekend = ifelse(wday %in% c("Sat", "Sun"), 1, 0),
hour = hour(date))
# Add Korean National Holidays
holiday <- c(
"20171225", "20180101", "20180215", "20180216", "20180301", "20180507",
"20180522", "20180606", "20180613", "20180815", "20180924",
"20180925", "20180926", "20181003", "20181009"
)
holiday <- ymd(holiday)
bike <- bike %>%
mutate(holiday = NULL,
date2 = as.Date(date, tz = "Asia/Seoul"),
holiday = case_when(
weekend == 1 ~ 1,
date2 %in% holiday ~ 1,
TRUE ~ 0))
Korean holidays and weekends such as National Liberation Day, Christmas, and Children’s Day, and Saturdays and Sundays were added as holiday variables.
Cor_data <- round(cor(bike[, numVars]), 2)
Cor_high <- Cor_data[, "bike_cnt"] %>%
abs() %>%
sort(decreasing = T) %>%
names()
Cor_data <- Cor_data[Cor_high, Cor_high]
Cor_data[lower.tri(Cor_data)] <- NA
melted_cor <- reshape2::melt(Cor_data, na.rm = T)
ggplot(melted_cor, aes(Var2, Var1, fill = value)) +
geom_tile(color = "white") +
scale_fill_gradient2(low = "skyblue",
high = "red",
mid = "white",
midpoint = 0,
limit = c(-1, 1),
name = "Peason\nCorrelation") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 30, vjust = 1 , hjust = 1, size = 9)) +
coord_fixed() +
geom_text(aes(Var2, Var1, label = value), color = "black", size = 2) +
theme(panel.grid.major = element_blank()) +
labs(x = NULL, y = NULL)
Before diving into the relationships between variables, I checked the Pearson correlations between variables. Seemingly temperature and hour pattern variables have high correlations with the bike rental volume.
bike %>%
ggplot(aes(hour, bike_cnt)) +
geom_boxplot(aes(group = hour)) +
geom_smooth(se = F)
I plotted the distribution of hourly rental volume of the bicycle with a boxplot. It shows a relatively high distribution at around 8 am and 19 pm. An object (x = 4923, date = 2018-06-24) at 2 a.m. that appears to be an outlier was discovered and in-depth analysis was conducted.
bike %>%
filter(hour == 2& bike_cnt > 1000) %>%
select(x, date, hour, bike_cnt)
## x date hour bike_cnt
## 1 4923 2018-06-24 02:00:00 2 1254
The distribution of all the rental amount of bicycle at 2 a.m.
bike %>%
filter(hour == 2) %>%
ggplot(aes(bike_cnt)) +
geom_histogram(binwidth = 10)
On June 24, 2018, compared to the amount of rental at other times, the amount of rental at 2 a.m. is considered unusual and is suspected to be abnormal.
bike %>%
filter(date2 == ymd("20180624")) %>%
select(date, hour, bike_cnt)
## date hour bike_cnt
## 1 2018-06-24 00:00:00 0 849
## 2 2018-06-24 01:00:00 1 545
## 3 2018-06-24 02:00:00 2 1254
## 4 2018-06-24 03:00:00 3 644
## 5 2018-06-24 04:00:00 4 286
## 6 2018-06-24 05:00:00 5 235
## 7 2018-06-24 06:00:00 6 258
## 8 2018-06-24 07:00:00 7 369
## 9 2018-06-24 08:00:00 8 561
## 10 2018-06-24 09:00:00 9 734
## 11 2018-06-24 10:00:00 10 763
## 12 2018-06-24 11:00:00 11 870
## 13 2018-06-24 12:00:00 12 1042
## 14 2018-06-24 13:00:00 13 1062
## 15 2018-06-24 14:00:00 14 1064
## 16 2018-06-24 15:00:00 15 1213
## 17 2018-06-24 16:00:00 16 1471
## 18 2018-06-24 17:00:00 17 1754
## 19 2018-06-24 18:00:00 18 2094
## 20 2018-06-24 19:00:00 19 2290
## 21 2018-06-24 20:00:00 20 2387
## 22 2018-06-24 21:00:00 21 2153
## 23 2018-06-24 22:00:00 22 1801
## 24 2018-06-24 23:00:00 23 1329
Bicycle rental on 2018 June 24th (Sunday).
bike %>%
filter(date2 == ymd("20180624")) %>%
ggplot(aes(hour, bike_cnt)) +
geom_line() +
geom_point() +
geom_point(data = bike %>%
filter(date2 == ymd("20180624") &
hour == 2), color = "red") +
geom_line(data = bike %>%
filter(date2 == ymd("20180624") &
hour == 2), color = "red")
Bicycle rental on 2018 June 17th (Sunday), a week earlier.
bike %>%
filter(date2 == ymd("20180617")) %>%
ggplot(aes(hour, bike_cnt)) +
geom_line() +
geom_point()
Compared to the hourly rental history on the day a week earlier, the object appears to be abnormal at 2 am. I see this as an outlier and deal with it later on.
요일별 패턴
bike %>%
ggplot(aes(hour, bike_cnt, color = wday)) +
geom_smooth(se = F, size = 1) +
scale_color_discrete("")
It shows that the pattern of use of 따릉이 is different on weekends and weekdays. On weekdays, the usage seems high during rush hours.
Comparison of Holidays and Weekdays excluding Weekends (Saturday and Sunday)
bike %>%
filter(!wday %in% c("Sat", "Sun")) %>%
mutate(holiday = factor(holiday)) %>%
ggplot(aes(hour, bike_cnt, color = holiday)) +
geom_smooth(se = F, size = 1)
This shows that the holiday shows a similar pattern with the above even if the weekend is excluded.
bike %>%
ggplot(aes(temperature_c, bike_cnt)) +
geom_point(alpha = 0.3) +
geom_smooth(se = F)
It shows that the higher the temperature, the higher the rental amount of bicycles, and the lower the rental amount if the temperature rises above a certain temperature.
t <- bike %>%
mutate(jitter_times = date + minutes(round(runif(nrow(bike), min = 0, max = 59))),
times = as.POSIXct(hms::parse_hm(format(jitter_times, format = "%H:%M")))) %>%
ggplot(aes(x = times, y = bike_cnt, color = temperature_c)) +
geom_point() +
scale_colour_gradientn("Temp (°C)", colours=c("#5e4fa2", "#3288bd", "#66c2a5", "#abdda4", "#e6f598", "#fee08b", "#fdae61", "#f46d43", "#d53e4f", "#9e0142")) +
scale_x_datetime(breaks = date_breaks("4 hours"), labels = date_format("%H:%M")) +
theme(axis.text.x = element_text(angle = 40, hjust = 1))
t
The temperature over time and the resulting bicycle rental patterns were visualized the whole and by seasons. They show the obvious different patterns by temperature.
Rent by Seasonal Temperature and Time
t + facet_wrap(~ seasons, ncol = 2)
bike %>%
ggplot(aes(humidity_percent, bike_cnt)) +
geom_point(alpha = 0.3) +
geom_smooth(se = F)
As the humidity increases, the amount of bicycle rental also decreases.
bike %>%
ggplot(aes(fine_dust, bike_cnt)) +
geom_point(alpha = 0.3)
The fine dust and bicycle rental volume do not seem to have much to do with, but we will find out in-depth about the days when fine dust is severe (the figure is over 100 or higher).
# I add a finedust_alert variable that is 1 if fine_dust is over 100.
bike <- bike %>%
mutate(finedust_alert = ifelse(fine_dust >= 100, 1, 0))
bike %>%
count(month, finedust_alert) %>%
filter(finedust_alert == 1)
## month finedust_alert n
## 1 Jan 1 19
## 2 Feb 1 11
## 3 Mar 1 16
## 4 Apr 1 29
## 5 May 1 25
## 6 Jun 1 2
## 7 Oct 1 1
## 8 Nov 1 39
## 9 Dec 1 34
A fine dust alert variable is added when the fine dust level is more than 100. April and November have the most severe fine dust days.
bike %>%
filter(month == "Apr" | month == "Nov") %>%
mutate(finedust_alert = factor(finedust_alert)) %>%
ggplot(aes(finedust_alert, bike_cnt, fill = finedust_alert)) +
geom_bar(stat = "summary", fun = mean)
When calculating the average amount of rentals between days with severe fine dust in April and November and days without, it can be seen that the average amount of rentals on days with severe fine dust is lower on average.
bike %>%
filter(rainfall_mm > 0) %>%
ggplot(aes(rainfall_mm, bike_cnt)) +
geom_point(alpha = .4)
The more rainfall, the lower the bicycle rental.
bike <- bike %>%
mutate(raining = ifelse(rainfall_mm > 0, 1, 0))
Raining variables were added during rainy hours.
bike %>%
filter(snowfall_cm > 0) %>%
ggplot(aes(snowfall_cm, bike_cnt)) +
geom_point(alpha = .4)
The higher the snowfall, the lower the bicycle rental.
bike <- bike %>%
mutate(snowing = ifelse(snowfall_cm>0, 1, 0))
Snowing variables were added during the snowy hours.
bike %>%
filter(snowing == 1) %>%
count(month)
## month n
## 1 Jan 192
## 2 Feb 38
## 3 Nov 51
## 4 Dec 162
It shows that it snowed frequently in January and December. I compared the number of bicycle rentals at snowy hours with the number of bicycle rentals at non-snowy hours in January and December.
bike %>%
filter(month %in% c("Jan", "Dec")) %>%
mutate(snowing = factor(snowing)) %>%
ggplot(aes(month, bike_cnt, fill = snowing)) +
geom_bar(stat = "summary", fun = mean, position = "dodge")
It can be seen that the amount of rental at the snowy time hours is significantly lower than that at the non-snowy hours.
The relationship between the rental amount of the bicycle and the solar radiation.
bike %>%
filter(solar_radiation >0) %>%
ggplot(aes(solar_radiation, bike_cnt)) +
geom_point(alpha = 0.3) +
geom_smooth(se = F)
I added the no_sunny variable for the time hours(night, cloudy day, etc.) with zero solar radiation and compared the hours with zero solar radiation to those without.
bike <- bike %>%
mutate(no_sunny = ifelse(solar_radiation == 0, 1, 0))
bike %>%
mutate(no_sunny = factor(no_sunny)) %>%
ggplot(aes(hour, bike_cnt, color = no_sunny)) +
geom_smooth(se = F)
It shows that bicycle rental amount is lower on average on days when there is no solar radiation.
I plotted the pattern of the bicycle rent at 2 am in May, June and July and double checked the 4923th object was an outlier and removed it.
bike[4923, ]
## x date hour bike_cnt temperature_c humidity_percent
## 4923 4923 2018-06-24 02:00:00 2 1254 21 87
## windspeed_ms visibility_10m dew_point_temperature solar_radiation
## 4923 1.8 222 19 0
## rainfall_mm snowfall_cm seasons fine_dust month wday mday yday weekend
## 4923 0 0 Summer 63 Jun Sun 24 175 1
## holiday date2 finedust_alert raining snowing no_sunny
## 4923 1 2018-06-24 0 0 0 1
bike %>%
filter(month %in% c("May","Jun", "Jul") & hour ==2) %>%
ggplot(aes(date2, bike_cnt)) +
geom_point() +
geom_point(data = bike[4923, ], col = "red", size = 3)
Bicycle rental for May, June, and July at 2 a.m.
bike <- bike[-4923,]
The 4923rd object was determined to be an outlier and deleted.
bike <- bike %>%
mutate(hour = factor(hour),
month = factor(as.character(month)),
wday = factor(as.character(wday))) %>%
select(-x, -date, -date2)
I turned hour, month, wday variables into a factor type and removed variables that were useless for modeling.
I divided the data set into train and test sets.
set.seed(1234)
split <- initial_split(bike, prop = 0.6, strata = bike_cnt)
train <- training(split)
test <- testing(split)
train %>% glimpse
## Rows: 5,257
## Columns: 22
## $ hour <fct> 0, 2, 3, 4, 5, 6, 7, 8, 9, 12, 13, 14, 15, 16, 1~
## $ bike_cnt <int> 254, 173, 107, 78, 100, 181, 460, 930, 490, 449,~
## $ temperature_c <dbl> -5.2, -6.0, -6.2, -6.0, -6.4, -6.6, -7.4, -7.6, ~
## $ humidity_percent <int> 37, 39, 40, 36, 37, 35, 38, 37, 27, 23, 25, 26, ~
## $ windspeed_ms <dbl> 2.2, 1.0, 0.9, 2.3, 1.5, 1.3, 0.9, 1.1, 0.5, 1.4~
## $ visibility_10m <int> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, ~
## $ dew_point_temperature <dbl> -17.6, -17.7, -17.6, -18.6, -18.7, -19.5, -19.3,~
## $ solar_radiation <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.01, ~
## $ rainfall_mm <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ snowfall_cm <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ seasons <fct> Winter, Winter, Winter, Winter, Winter, Winter, ~
## $ fine_dust <int> 15, 17, 21, 18, 14, 26, 12, 24, 17, 24, 26, 29, ~
## $ month <fct> Dec, Dec, Dec, Dec, Dec, Dec, Dec, Dec, Dec, Dec~
## $ wday <fct> Fri, Fri, Fri, Fri, Fri, Fri, Fri, Fri, Fri, Fri~
## $ mday <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ yday <dbl> 335, 335, 335, 335, 335, 335, 335, 335, 335, 335~
## $ weekend <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ holiday <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ finedust_alert <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ raining <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ snowing <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ no_sunny <dbl> 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, ~
With recipes package, I transformed some variables to advance the performance of modeling.
recipe_trees <- train %>%
recipe(bike_cnt ~.) %>%
step_other(all_nominal(), threshold = 0.01) %>%
step_nzv(all_nominal()) %>%
prep()
trainE <- bake(recipe_trees, new_data = train)
testE <- bake(recipe_trees, new_data = test)
I created a baseline model to compare. I omitted the code chunks and showed the performance result.
bike_baseline %>% glimpse
## Rows: 8,760
## Columns: 14
## $ x <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1~
## $ date <chr> "12/1/2017 0:00", "12/1/2017 1:00", "12/1/2017 2~
## $ hour <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14~
## $ bike_cnt <int> 254, 204, 173, 107, 78, 100, 181, 460, 930, 490,~
## $ temperature_c <dbl> -5.2, -5.5, -6.0, -6.2, -6.0, -6.4, -6.6, -7.4, ~
## $ humidity_percent <int> 37, 38, 39, 40, 36, 37, 35, 38, 37, 27, 24, 21, ~
## $ windspeed_ms <dbl> 2.2, 0.8, 1.0, 0.9, 2.3, 1.5, 1.3, 0.9, 1.1, 0.5~
## $ visibility_10m <int> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, ~
## $ dew_point_temperature <dbl> -17.6, -17.6, -17.7, -17.6, -18.6, -18.7, -19.5,~
## $ solar_radiation <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, ~
## $ rainfall_mm <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ snowfall_cm <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ seasons <chr> "Winter", "Winter", "Winter", "Winter", "Winter"~
## $ fine_dust <int> 15, 11, 17, 21, 18, 14, 26, 12, 24, 17, 21, 21, ~
Evaluating the performance
xgb_test_pred_baseline <- predict(xgb_baseline_mod, new_data = testE_baseline) %>%
bind_cols(test_baseline %>%
select(bike_cnt, everything()))
xgb_test_score_baseline <- xgb_test_pred_baseline %>%
metrics(bike_cnt, .pred)
.
xgb_test_score_baseline
## # A tibble: 3 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 315.
## 2 rsq standard 0.764
## 3 mae standard 189.
The rmse of the baseline is 315, \(R^{2}\) is 0.76, and mae is 189.
Hyperparameter tuning and model fitting
xgb_spec <- boost_tree(
mode = "regression",
trees = 5000,
min_n = 34,
tree_depth = 10,
learn_rate = 0.0329,
loss_reduction = 15.5,
stop_iter = 30
) %>%
set_engine("xgboost", objective = "reg:squarederror")
set.seed(234)
xgb_mod <-
xgb_spec %>%
fit(bike_cnt ~., data = trainE)
Evaluating the model performance
xgb_test_pred <- predict(xgb_mod, new_data = testE) %>%
bind_cols(test %>%
select(bike_cnt, everything()))
xgb_test_score <- xgb_test_pred %>%
metrics(bike_cnt, .pred)
xgb_test_score
## # A tibble: 3 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 182.
## 2 rsq standard 0.921
## 3 mae standard 114.
The results of the XGB model, which performed performance evaluations on the test data after fitting them to the training data, were significantly improved over the baseline model with 182 of RMSE, 0.92 of \(R^{2}\) and 114 of MAE.
xgb_test_pred %>%
ggplot(aes(x = bike_cnt, y = .pred)) +
geom_point() +
geom_abline(intercept = 0, slope = 1, color = "blue", size = 1)
vip(xgb_mod, num_features = 20)
The variable that had the biggest impact on the rental of 따릉이 was temperature, as EDA found out. Humidity and solar radiation also have a high impact. And certain times of the day (18:00, 19:00, 20:00) seem to have a relatively large impact on forecasting rental volumes compared to other times.
This concludes the prediction analysis of rental volume of public rental bicycles in Seoul.