This project follows the CRISP-DM (Cross-Industry Standard Process for Data Mining) methodology, providing a structured framework for our analysis. CRISP-DM guides us through stages such as Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, and Deployment. By adhering to this methodology, we ensure a comprehensive approach to predicting bike sharing demand.
A bike-sharing system is a transport service where bicycles are rent
to individuals for short period at a low cost. It allows users to rent a
bike in a bike station and return it at another bike station within a
same network. Bike sharing has gained attraction in recent years as part
of initiatives to promote green commuting. Bike-sharing system has
significant influence on the transportation, public health as well as
environment of the city.
The earliest bike sharing was
started in 1965 with 50 bicycles, and bike sharing has rapidly expanded
through enhancement of technology in the last decade. As of August 2021,
there are 10 millions bike and over 3000 bike share systems across the
world. According to Statistia, the revenue in bike sharing is expected
to reach 9.21billions USD in 2023 and surge to 13.53 billion USD in 4
years.
The great opportunity in bike sharing system
necessitates the prediction of the bike sharing demand to make this
business constantly work and grow. In this research, the bike sharing
demand will be predicted by using machine learning algorithm. Besides,
the relationship between bike sharing demand and other factors such as
time, weather and climate will be explored.
options(warn = -1) # To disable warning message to print
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(corrplot)
## corrplot 0.92 loaded
library(readr)
library(kableExtra)
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ lubridate 1.9.2 ✔ tibble 3.2.1
## ✔ purrr 1.0.1 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ kableExtra::group_rows() masks dplyr::group_rows()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(here)
## here() starts at C:/Users/Yi Kwong/Documents
library(skimr)
library(janitor)
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(Tmisc)
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(purrr)
Dataset with title “Bike Sharing Dataset” is gained from Kaggle with url https://www.kaggle.com/datasets/lakshmi25npathi/bike-sharing-dataset. Dataset is published 4 years ago, containing hourly count of rental bikes sharing data for the year of 2011 and 2012 in the Capital bike share system. Purpose of this dataset is predicting bike sharing demand with the features in the dataset, for example, temperature, windspeed and humidity.
df <- read_csv("C:/Users/Yi Kwong/Documents/bike_sharing.csv")
## Rows: 17379 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (16): instant, season, yr, mnth, hr, holiday, weekday, workingday, weat...
## date (1): dteday
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(df) %>%
kable() %>%
kable_styling()
| instant | dteday | season | yr | mnth | hr | holiday | weekday | workingday | weathersit | temp | atemp | hum | windspeed | casual | registered | cnt |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 2011-01-01 | 1 | 0 | 1 | 0 | 0 | 6 | 0 | 1 | 0.24 | 0.2879 | 0.81 | 0.0000 | 3 | 13 | 16 |
| 2 | 2011-01-01 | 1 | 0 | 1 | 1 | 0 | 6 | 0 | 1 | 0.22 | 0.2727 | 0.80 | 0.0000 | 8 | 32 | 40 |
| 3 | 2011-01-01 | 1 | 0 | 1 | 2 | 0 | 6 | 0 | 1 | 0.22 | 0.2727 | 0.80 | 0.0000 | 5 | 27 | 32 |
| 4 | 2011-01-01 | 1 | 0 | 1 | 3 | 0 | 6 | 0 | 1 | 0.24 | 0.2879 | 0.75 | 0.0000 | 3 | 10 | 13 |
| 5 | 2011-01-01 | 1 | 0 | 1 | 4 | 0 | 6 | 0 | 1 | 0.24 | 0.2879 | 0.75 | 0.0000 | 0 | 1 | 1 |
| 6 | 2011-01-01 | 1 | 0 | 1 | 5 | 0 | 6 | 0 | 2 | 0.24 | 0.2576 | 0.75 | 0.0896 | 0 | 1 | 1 |
glimpse(df)
## Rows: 17,379
## Columns: 17
## $ instant <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ dteday <date> 2011-01-01, 2011-01-01, 2011-01-01, 2011-01-01, 2011-01-01…
## $ season <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ yr <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ mnth <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ hr <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
## $ holiday <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ weekday <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,…
## $ workingday <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ weathersit <dbl> 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3,…
## $ temp <dbl> 0.24, 0.22, 0.22, 0.24, 0.24, 0.24, 0.22, 0.20, 0.24, 0.32,…
## $ atemp <dbl> 0.2879, 0.2727, 0.2727, 0.2879, 0.2879, 0.2576, 0.2727, 0.2…
## $ hum <dbl> 0.81, 0.80, 0.80, 0.75, 0.75, 0.75, 0.80, 0.86, 0.75, 0.76,…
## $ windspeed <dbl> 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0896, 0.0000, 0.0…
## $ casual <dbl> 3, 8, 5, 3, 0, 0, 2, 1, 1, 8, 12, 26, 29, 47, 35, 40, 41, 1…
## $ registered <dbl> 13, 32, 27, 10, 1, 1, 0, 2, 7, 6, 24, 30, 55, 47, 71, 70, 5…
## $ cnt <dbl> 16, 40, 32, 13, 1, 1, 2, 3, 8, 14, 36, 56, 84, 94, 106, 110…
#Classification
class(df)
## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
typeof(df)
## [1] "list"
dim(df)
## [1] 17379 17
#Structure of data
str(df)
## spc_tbl_ [17,379 × 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ instant : num [1:17379] 1 2 3 4 5 6 7 8 9 10 ...
## $ dteday : Date[1:17379], format: "2011-01-01" "2011-01-01" ...
## $ season : num [1:17379] 1 1 1 1 1 1 1 1 1 1 ...
## $ yr : num [1:17379] 0 0 0 0 0 0 0 0 0 0 ...
## $ mnth : num [1:17379] 1 1 1 1 1 1 1 1 1 1 ...
## $ hr : num [1:17379] 0 1 2 3 4 5 6 7 8 9 ...
## $ holiday : num [1:17379] 0 0 0 0 0 0 0 0 0 0 ...
## $ weekday : num [1:17379] 6 6 6 6 6 6 6 6 6 6 ...
## $ workingday: num [1:17379] 0 0 0 0 0 0 0 0 0 0 ...
## $ weathersit: num [1:17379] 1 1 1 1 1 2 1 1 1 1 ...
## $ temp : num [1:17379] 0.24 0.22 0.22 0.24 0.24 0.24 0.22 0.2 0.24 0.32 ...
## $ atemp : num [1:17379] 0.288 0.273 0.273 0.288 0.288 ...
## $ hum : num [1:17379] 0.81 0.8 0.8 0.75 0.75 0.75 0.8 0.86 0.75 0.76 ...
## $ windspeed : num [1:17379] 0 0 0 0 0 0.0896 0 0 0 0 ...
## $ casual : num [1:17379] 3 8 5 3 0 0 2 1 1 8 ...
## $ registered: num [1:17379] 13 32 27 10 1 1 0 2 7 6 ...
## $ cnt : num [1:17379] 16 40 32 13 1 1 2 3 8 14 ...
## - attr(*, "spec")=
## .. cols(
## .. instant = col_double(),
## .. dteday = col_date(format = ""),
## .. season = col_double(),
## .. yr = col_double(),
## .. mnth = col_double(),
## .. hr = col_double(),
## .. holiday = col_double(),
## .. weekday = col_double(),
## .. workingday = col_double(),
## .. weathersit = col_double(),
## .. temp = col_double(),
## .. atemp = col_double(),
## .. hum = col_double(),
## .. windspeed = col_double(),
## .. casual = col_double(),
## .. registered = col_double(),
## .. cnt = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
#data summary
summary(df)
## instant dteday season yr
## Min. : 1 Min. :2011-01-01 Min. :1.000 Min. :0.0000
## 1st Qu.: 4346 1st Qu.:2011-07-04 1st Qu.:2.000 1st Qu.:0.0000
## Median : 8690 Median :2012-01-02 Median :3.000 Median :1.0000
## Mean : 8690 Mean :2012-01-02 Mean :2.502 Mean :0.5026
## 3rd Qu.:13034 3rd Qu.:2012-07-02 3rd Qu.:3.000 3rd Qu.:1.0000
## Max. :17379 Max. :2012-12-31 Max. :4.000 Max. :1.0000
## mnth hr holiday weekday
## Min. : 1.000 Min. : 0.00 Min. :0.00000 Min. :0.000
## 1st Qu.: 4.000 1st Qu.: 6.00 1st Qu.:0.00000 1st Qu.:1.000
## Median : 7.000 Median :12.00 Median :0.00000 Median :3.000
## Mean : 6.538 Mean :11.55 Mean :0.02877 Mean :3.004
## 3rd Qu.:10.000 3rd Qu.:18.00 3rd Qu.:0.00000 3rd Qu.:5.000
## Max. :12.000 Max. :23.00 Max. :1.00000 Max. :6.000
## workingday weathersit temp atemp
## Min. :0.0000 Min. :1.000 Min. :0.020 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:1.000 1st Qu.:0.340 1st Qu.:0.3333
## Median :1.0000 Median :1.000 Median :0.500 Median :0.4848
## Mean :0.6827 Mean :1.425 Mean :0.497 Mean :0.4758
## 3rd Qu.:1.0000 3rd Qu.:2.000 3rd Qu.:0.660 3rd Qu.:0.6212
## Max. :1.0000 Max. :4.000 Max. :1.000 Max. :1.0000
## hum windspeed casual registered
## Min. :0.0000 Min. :0.0000 Min. : 0.00 Min. : 0.0
## 1st Qu.:0.4800 1st Qu.:0.1045 1st Qu.: 4.00 1st Qu.: 34.0
## Median :0.6300 Median :0.1940 Median : 17.00 Median :115.0
## Mean :0.6272 Mean :0.1901 Mean : 35.68 Mean :153.8
## 3rd Qu.:0.7800 3rd Qu.:0.2537 3rd Qu.: 48.00 3rd Qu.:220.0
## Max. :1.0000 Max. :0.8507 Max. :367.00 Max. :886.0
## cnt
## Min. : 1.0
## 1st Qu.: 40.0
## Median :142.0
## Mean :189.5
## 3rd Qu.:281.0
## Max. :977.0
#attribute of data
length(df)
## [1] 17
names(df)
## [1] "instant" "dteday" "season" "yr" "mnth"
## [6] "hr" "holiday" "weekday" "workingday" "weathersit"
## [11] "temp" "atemp" "hum" "windspeed" "casual"
## [16] "registered" "cnt"
instant: record index
dteday: date
season: season (1:spring, 2:summer, 3:fall, 4:winter)
yr: year (0: 2011, 1:2012)
mnth: month ( 1 to 12)
hr: hour (0 to 23)
holiday: weather day is holiday or not (extracted from http://dchr.dc.gov/page/holiday-schedule)
weekday: day of the week
workingday: if day is neither weekend nor holiday is 1, otherwise is 0.
weathersit:
1: Clear, Few clouds, Partly cloudy
2: Mist and Cloudy, Mist and Broken clouds, Mist and Few clouds, Mist
3: Light Snow, Light Rain and Thunderstorm and Scattered clouds, Light Rain and Scattered clouds
4: Heavy Rain and Ice Pallets and Thunderstorm and Mist, Snow and Fog
temp: Normalized temperature in Celsius. The values are divided to 41 (max)
atemp: Normalized feeling temperature in Celsius. The values are divided to 50 (max)
hum: Normalized humidity. The values are divided to 100 (max)
windspeed: Normalized wind speed. The values are divided to 67 (max)
casual: count of casual users
registered: count of registered users
cnt: count of total rental bikes including both casual and registered
Total Number of Rides over date
df$dteday <- as.Date(df$dteday)
dt_aggregate_data <-
df %>%
group_by(dteday) %>%
summarize(rides = sum(cnt))
ggplot(dt_aggregate_data, aes(x = dteday, y = rides)) +
geom_line() +
geom_smooth(se = FALSE) + # Add a smoothed line
labs(x = "Date", y = "Total Number of Rides") +
ggtitle("Total Number of Rides over date")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
The plot shows the relationship between Total Number of
Rides(cnt) variable and Date. We can observed the number of rides from
day to day from the dataset. Generally, the number of total rides are
increasing over the period.
Trend of Total Number of registered and casual over date
dt_aggrregate_data2 <-
df %>%
group_by(dteday) %>%
summarize(registered = sum(registered),
unregistered = sum(casual))
ggplot(dt_aggrregate_data2, aes(x = dteday)) +
geom_bar(aes(y = registered, fill = "Registered"), stat = "identity", width = 0.5) +
geom_bar(aes(y = unregistered, fill = "Casual"), stat = "identity", width = 0.5) +
geom_smooth(aes(y = registered, color = "Registered"), se = FALSE, color="blue") +
geom_smooth(aes(y = unregistered, color = "Casual"), se = FALSE, color="red") +
labs(x = "Date", y = "Total Number of Rides", fill = "User Type") +
scale_fill_manual(values = c("Registered" = "blue", "Casual" = "red")) +
ggtitle("Total Number of registered and casual over date")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
The plot shows the relationship between Total Number of registered
and casual rides variable and Date. Generally, both the number of
registered and casual rides are increasing but the number of registered
users is increasing faster than the casual rides. Their difference is
become higher and higher over the period, indicating that the percentage
of registered is getting higher.
Histogram
histogram <- hist(df$cnt, breaks = 25, ylab = 'Frequency of Rental', xlab = 'Total Bike Rental Count', main = 'Distribution of Total Bike Rental Count', col = 'indianred' )
xfit <- seq(min(df$cnt),max(df$cnt), length = 50)
yfit <- dnorm(xfit, mean =mean(df$cnt),sd=sd(df$cnt))
yfit <- yfit*diff(histogram$mids[1:2])*length(df$cnt)
lines(xfit,yfit, col='yellow', lwd= 3)
From the histogram above, it seems that the number of total rented bikes are positively skewed distribution. The median is generally less than the mean. The mean is typically pulled towards the right by the presence of the long tail. The data have outliers on the right side, which are extreme values that deviate significantly from the majority of the data. These outliers can influence the mean and make it larger than the median.
Trend of Total Bike Rentals on Seasonal Factor
boxplot(df$cnt ~df$season,
data = df,
main = "Total Bike Rentals Vs Season",
xlab = "Season",
ylab = "Total Bike Rentals",
col = c("darkblue", "blue", "lightblue", "skyblue"))
The plot shows the relationship between Total Bike Rentals(cnt)
variable and season. The average numbers of bike rentals are the highest
during summer and fall.
Trend of Total Bike Rentals on Days of the Week Factor
boxplot(df$cnt ~df$weekday,
data = df,
main = "Total Bike Rentals Vs Days of the Week",
xlab = "Days of the Week",
ylab = "Total Bike Rentals",
col = c("red", "yellow", "purple", "blue", "green","orange"))
The plot shows the relationship between Total Bike Rentals(cnt)
variable and Days of The Week. We can see that the median of bike
rentals almost the same each day but slightly higher during the
weekdays.
Trend of Total Bike Rentals on Time Factor
boxplot(df$cnt ~df$hr,
data = df,
main = "Total Bike Rentals Vs Time",
xlab = "Time",
ylab = "Total Bike Rentals",
col = c("pink"))
The plot shows the relationship between Total Bike Rentals(cnt)
variable and Time. We can see that the peak average number of bike
rentals are at 8:00 am and 5:00pm.
Trend of Total Bike Rentals on Weather Factor
boxplot(df$cnt ~df$weathersit,
data = df,
main = "Total Bike Rentals Vs Weather",
xlab = "Weather",
ylab = "Total Bike Rentals",
col = c("lightgreen"))
The plot shows the relationship between Total Bike Rentals(cnt)
variable and weather. We can see that the median number of bike rentals
on clear and fine day are the highest among the others.
Classify the cnt column into a new column name classes
df_classes <- df
for (i in 1:nrow(df_classes)) {
# Check the conditions and assign the corresponding value
if (df_classes$cnt[i] < quantile(df_classes$cnt, probs = 0.25)) {
df_classes$classes[i] <- 1
} else if (df_classes$cnt[i] >= quantile(df_classes$cnt, probs = 0.25) & df_classes$cnt[i] < quantile(df_classes$cnt, probs = 0.5)) {
df_classes$classes[i] <- 2
} else if (df_classes$cnt[i] >= quantile(df_classes$cnt, probs = 0.5) & df_classes$cnt[i] < quantile(df_classes$cnt, probs = 0.75)) {
df_classes$classes[i] <- 3
} else {
df_classes$classes[i] <- 4
}
}
df_classes$classes<-as.factor(df_classes$classes)
Assign classes to total number of rides based on the quartiles.
Total number of rides is divided into 4 classes. Class 1 is the bottom
25%, class 2 is 25% to 50% and class 3 is 50% to 75% and class 4 is the
top 25% of the number of rides.
Classify the cnt column
into a new column name classes
#Plot bar chart for count of classes
classes_aggregate_data <- df_classes %>%
group_by(classes)%>%
summarize(count = n())
ggplot(classes_aggregate_data, aes(x = as.factor(classes), y = count, fill = classes)) +
geom_col(position = "dodge", width = 0.5) +
labs(x = "Classes", y = "Count", fill = "Class") +
ggtitle("Count of classes")
classes_aggregate_data2 <- df_classes %>%
group_by(season, classes) %>%
summarize(count = n()) %>%
mutate(classes = factor(classes),
season=factor(season))
## `summarise()` has grouped output by 'season'. You can override using the
## `.groups` argument.
The plot shows the count of each class. They are distributed
uniformly.
Classify the cnt column into a new column name
classes
#Plot bar chart for count of classes by season
ggplot(classes_aggregate_data2, aes(x = interaction(classes, season), y = count, fill = classes)) +
geom_col(position = "dodge", width = 0.5) +
labs(x = "Classes and Season", y = "Count", fill = "Class") +
scale_fill_manual(values = c("1" = "blue", "2" = "red", "3" = "green", "4" = "orange")) +
ggtitle("Count of classes by season")
The plot shows the count of classes by season. It is clear that
class 3 and class 4 is less in spring compared to other 3 seasons.
Classify the cnt column into a new column name
classes
#Plot bar chart for count of classes by season
classes_aggregate_data3 <-
df_classes %>%
group_by(classes) %>%
summarize(atemp = mean(atemp))
ggplot(classes_aggregate_data3, aes(x = classes, y = atemp, fill = classes)) +
geom_col(position = "dodge", width = 0.5) +
labs(x = "Classes", y = "Atemp", fill = "Class") +
scale_fill_manual(values = c("1" = "blue", "2" = "red", "3" = "green", "4" = "orange")) +
ggtitle("Temperature of classes")
The plot shows the relation ship between classes and
average(temperature). The average temperature is highest in class 4.
#Measure correlation between all variable:
df_numeric = select_if(df,is.numeric)
data_corr<-cor(df_numeric)
data_corr
## instant season yr mnth hr
## instant 1.000000000 0.404045721 0.866014049 0.489163831 -0.004774815
## season 0.404045721 1.000000000 -0.010742486 0.830385892 -0.006116901
## yr 0.866014049 -0.010742486 1.000000000 -0.010472929 -0.003867005
## mnth 0.489163831 0.830385892 -0.010472929 1.000000000 -0.005771909
## hr -0.004774815 -0.006116901 -0.003867005 -0.005771909 1.000000000
## holiday 0.014723494 -0.009584526 0.006691617 0.018430325 0.000479136
## weekday 0.001356820 -0.002335350 -0.004484851 0.010400061 -0.003497739
## workingday -0.003415559 0.013743102 -0.002196005 -0.003476922 0.002284998
## weathersit -0.014197603 -0.014523552 -0.019156853 0.005399522 -0.020202528
## temp 0.136178007 0.312025237 0.040913380 0.201691494 0.137603494
## atemp 0.137614610 0.319379811 0.039221595 0.208096131 0.133749965
## hum 0.009576774 0.150624745 -0.083546421 0.164411443 -0.276497828
## windspeed -0.074504540 -0.149772751 -0.008739533 -0.135386323 0.137251568
## casual 0.158295401 0.120206447 0.142778528 0.068457301 0.301201730
## registered 0.282045777 0.174225633 0.253684310 0.122272967 0.374140710
## cnt 0.278378694 0.178055731 0.250494899 0.120637760 0.394071498
## holiday weekday workingday weathersit temp
## instant 0.014723494 0.001356820 -0.003415559 -0.014197603 0.136178007
## season -0.009584526 -0.002335350 0.013743102 -0.014523552 0.312025237
## yr 0.006691617 -0.004484851 -0.002196005 -0.019156853 0.040913380
## mnth 0.018430325 0.010400061 -0.003476922 0.005399522 0.201691494
## hr 0.000479136 -0.003497739 0.002284998 -0.020202528 0.137603494
## holiday 1.000000000 -0.102087791 -0.252471370 -0.017036113 -0.027340477
## weekday -0.102087791 1.000000000 0.035955071 0.003310740 -0.001794927
## workingday -0.252471370 0.035955071 1.000000000 0.044672224 0.055390317
## weathersit -0.017036113 0.003310740 0.044672224 1.000000000 -0.102639936
## temp -0.027340477 -0.001794927 0.055390317 -0.102639936 1.000000000
## atemp -0.030972737 -0.008820945 0.054667235 -0.105563108 0.987672139
## hum -0.010588465 -0.037158268 0.015687512 0.418130329 -0.069881391
## windspeed 0.003987632 0.011501545 -0.011829789 0.026225652 -0.023125262
## casual 0.031563628 0.032721415 -0.300942486 -0.152627885 0.459615646
## registered -0.047345424 0.021577888 0.134325791 -0.120965520 0.335360849
## cnt -0.030927303 0.026899860 0.030284368 -0.142426138 0.404772276
## atemp hum windspeed casual registered
## instant 0.137614610 0.009576774 -0.074504540 0.15829540 0.28204578
## season 0.319379811 0.150624745 -0.149772751 0.12020645 0.17422563
## yr 0.039221595 -0.083546421 -0.008739533 0.14277853 0.25368431
## mnth 0.208096131 0.164411443 -0.135386323 0.06845730 0.12227297
## hr 0.133749965 -0.276497828 0.137251568 0.30120173 0.37414071
## holiday -0.030972737 -0.010588465 0.003987632 0.03156363 -0.04734542
## weekday -0.008820945 -0.037158268 0.011501545 0.03272142 0.02157789
## workingday 0.054667235 0.015687512 -0.011829789 -0.30094249 0.13432579
## weathersit -0.105563108 0.418130329 0.026225652 -0.15262788 -0.12096552
## temp 0.987672139 -0.069881391 -0.023125262 0.45961565 0.33536085
## atemp 1.000000000 -0.051917696 -0.062336043 0.45408007 0.33255864
## hum -0.051917696 1.000000000 -0.290104895 -0.34702809 -0.27393312
## windspeed -0.062336043 -0.290104895 1.000000000 0.09028678 0.08232085
## casual 0.454080065 -0.347028093 0.090286775 1.00000000 0.50661770
## registered 0.332558635 -0.273933118 0.082320847 0.50661770 1.00000000
## cnt 0.400929304 -0.322910741 0.093233784 0.69456408 0.97215073
## cnt
## instant 0.27837869
## season 0.17805573
## yr 0.25049490
## mnth 0.12063776
## hr 0.39407150
## holiday -0.03092730
## weekday 0.02689986
## workingday 0.03028437
## weathersit -0.14242614
## temp 0.40477228
## atemp 0.40092930
## hum -0.32291074
## windspeed 0.09323378
## casual 0.69456408
## registered 0.97215073
## cnt 1.00000000
corrplot(data_corr, method="color")
# check missing value
sum(is.na(df))
## [1] 0
# Create new dataset excluding instant,casual,registered variables for regression
df<-subset(df,select=-c(dteday,instant,casual,registered))
head(df,5)
## # A tibble: 5 × 13
## season yr mnth hr holiday weekday workingday weathersit temp atemp
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 0 1 0 0 6 0 1 0.24 0.288
## 2 1 0 1 1 0 6 0 1 0.22 0.273
## 3 1 0 1 2 0 6 0 1 0.22 0.273
## 4 1 0 1 3 0 6 0 1 0.24 0.288
## 5 1 0 1 4 0 6 0 1 0.24 0.288
## # ℹ 3 more variables: hum <dbl>, windspeed <dbl>, cnt <dbl>
#
# Drop the selected column of dteday,instant,casual,registered for classification
df_classes<-subset(df_classes,select=-c(dteday,instant,casual,registered,cnt))
head(df_classes,5)
## # A tibble: 5 × 13
## season yr mnth hr holiday weekday workingday weathersit temp atemp
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 0 1 0 0 6 0 1 0.24 0.288
## 2 1 0 1 1 0 6 0 1 0.22 0.273
## 3 1 0 1 2 0 6 0 1 0.22 0.273
## 4 1 0 1 3 0 6 0 1 0.24 0.288
## 5 1 0 1 4 0 6 0 1 0.24 0.288
## # ℹ 3 more variables: hum <dbl>, windspeed <dbl>, classes <fct>
str(df_classes)
## tibble [17,379 × 13] (S3: tbl_df/tbl/data.frame)
## $ season : num [1:17379] 1 1 1 1 1 1 1 1 1 1 ...
## $ yr : num [1:17379] 0 0 0 0 0 0 0 0 0 0 ...
## $ mnth : num [1:17379] 1 1 1 1 1 1 1 1 1 1 ...
## $ hr : num [1:17379] 0 1 2 3 4 5 6 7 8 9 ...
## $ holiday : num [1:17379] 0 0 0 0 0 0 0 0 0 0 ...
## $ weekday : num [1:17379] 6 6 6 6 6 6 6 6 6 6 ...
## $ workingday: num [1:17379] 0 0 0 0 0 0 0 0 0 0 ...
## $ weathersit: num [1:17379] 1 1 1 1 1 2 1 1 1 1 ...
## $ temp : num [1:17379] 0.24 0.22 0.22 0.24 0.24 0.24 0.22 0.2 0.24 0.32 ...
## $ atemp : num [1:17379] 0.288 0.273 0.273 0.288 0.288 ...
## $ hum : num [1:17379] 0.81 0.8 0.8 0.75 0.75 0.75 0.8 0.86 0.75 0.76 ...
## $ windspeed : num [1:17379] 0 0 0 0 0 0.0896 0 0 0 0 ...
## $ classes : Factor w/ 4 levels "1","2","3","4": 1 2 1 1 1 1 1 1 1 1 ...
There are 2 parts in modelling. One is regression and another
is classification. In regression, 5 models are selected. They are RF,
XGB, SVM, LR & DT. All models’ hyperparameter are tuned using random
search. While in classification, RF, SVM & DT are selected.
##
Regression
Import all the required library for modelling
library("tidyverse")
library("here")
library("skimr")
library("janitor")
library("dplyr")
library("Tmisc")
library(ggplot2)
library(caret)
library(purrr)
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
library(e1071)
library(kernlab)
##
## Attaching package: 'kernlab'
## The following object is masked from 'package:purrr':
##
## cross
## The following object is masked from 'package:ggplot2':
##
## alpha
library(rpart)
library(xgboost)
##
## Attaching package: 'xgboost'
## The following object is masked from 'package:dplyr':
##
## slice
Split the dataset into 80% of training data and 20% of testing data
set.seed(123)
trainIndex <- createDataPartition(df$cnt, p = 0.8, list = FALSE)
train <- df[trainIndex,]
test <- df[-trainIndex,]
x_train <- subset(train, select = -cnt)
y_train <-train$cnt
x_test <- subset(test, select = -cnt)
y_test <- test$cnt
Defined three function to calculate RMSE, R square and MAE
RMSE_func <- function(actual, prediction){
sqrt(mean((actual - prediction)^2))
}
RSquare_func <- function(actual, prediction){
tss <- sum((actual - mean(actual)) ^ 2)
rss <- sum((actual - prediction) ^ 2)
rsq <- 1 - (rss / tss)
}
MAE_func <- function(actual, prediction){
mean(abs(actual - prediction))
}
Define a data frame to store all model performance
model_performance_df <- data.frame(
Model = c("Linear Regression", "Random Forest", "SVM", "Decision Tree", "XG Boost"),
RMSE = numeric(5),
RSquare = numeric(5),
MAE = numeric(5),
Best_Params = 'NAN',
Random_HT_RMSE = numeric(5),
Random_HT_RSquare = numeric(5),
Random_HT_MAE = numeric(5)
)
Define the control value and tunelength for all machine learning model
control <- trainControl(method="cv", number=5, search = "random")
tuneLgth <- 3
Linear Regression
# Train the model
lr_model = lm(y_train ~ .,data = x_train)
# Tested with X_test data
predictions <- predict(lr_model, newdata = x_test)
# Store the performance into dataframe
model_performance_df$RMSE[1] <- RMSE_func(y_test, predictions)
model_performance_df$RSquare[1] <- RSquare_func(y_test, predictions)
model_performance_df$MAE[1] <- MAE_func(y_test, predictions)
# Random Search Hyperparameter tuning
tuning_params <- expand.grid(intercept = c(TRUE, FALSE))
cv_lr_model <- train(x = x_train, y = y_train, method = "lm", tuneLength = tuneLgth, trControl = control, tuneGrid = tuning_params)
predictions <- predict(cv_lr_model$finalModel, newdata = x_test)
# Store the performance into dataframe
model_performance_df$Random_HT_RMSE[1] <- RMSE_func(y_test, predictions)
model_performance_df$Random_HT_RSquare[1] <- RSquare_func(y_test, predictions)
model_performance_df$Random_HT_MAE[1] <- MAE_func(y_test, predictions)
model_performance_df$Best_Params[1] <- cv_lr_model$bestTune
Random Forest
# Train the model
rf_model <- randomForest(y_train~ ., data = x_train)
# Tested with X_test data
predictions <- predict(rf_model, newdata = x_test)
# Store the performance into dataframe
model_performance_df$RMSE[2] <- RMSE_func(y_test, predictions)
model_performance_df$RSquare[2] <- RSquare_func(y_test, predictions)
model_performance_df$MAE[2] <- MAE_func(y_test, predictions)
# Random Search Hyperparameter Tuning
tuning_params <- expand.grid(mtry = seq(2,5,by=1))
cv_rf_model <- train(x = x_train, y = y_train, method = "rf", tuneLength = tuneLgth ,trControl = control, tuneGrid = tuning_params)
predictions <- predict(cv_rf_model$finalModel, newdata = x_test)
# Store the performance into dataframe
model_performance_df$Random_HT_RMSE[2] <- RMSE_func(y_test, predictions)
model_performance_df$Random_HT_RSquare[2] <- RSquare_func(y_test, predictions)
model_performance_df$Random_HT_MAE[2] <- MAE_func(y_test, predictions)
model_performance_df$Best_Params[2] <- cv_rf_model$bestTune
SVM
# Train the model
svm_model <- svm(y_train~ ., data = x_train)
# Tested with X_test data
predictions <- predict(svm_model, newdata = x_test)
# Store the performance into dataframe
model_performance_df$RMSE[3] <- RMSE_func(y_test, predictions)
model_performance_df$RSquare[3] <- RSquare_func(y_test, predictions)
model_performance_df$MAE[3] <- MAE_func(y_test, predictions)
# Random Search Hyperparameter Tuning
tuning_params<- expand.grid(sigma = seq(0.1,0.3,by=0.1), C = seq(1,3,by=1))
cv_svm_model <- train(x = x_train, y = y_train, method = "svmRadial", tuneLength = tuneLgth, trControl = control , tuneGrid = tuning_params)
predictions <- predict(cv_svm_model$finalModel, newdata = x_test)
# Store the performance into dataframe
model_performance_df$Random_HT_RMSE[3] <- RMSE_func(y_test, predictions)
model_performance_df$Random_HT_RSquare[3] <- RSquare_func(y_test, predictions)
model_performance_df$Random_HT_MAE[3] <- MAE_func(y_test, predictions)
model_performance_df$Best_Params[3] <- cv_svm_model$bestTune
Decision Tree
# Train the model
dt_model <- rpart(y_train~ ., data = x_train)
# Tested with X_test data
predictions <- predict(dt_model, x_test)
# Store the performance into dataframe
model_performance_df$RMSE[4] <- RMSE_func(y_test, predictions)
model_performance_df$RSquare[4] <- RSquare_func(y_test, predictions)
model_performance_df$MAE[4] <- MAE_func(y_test, predictions)
# Random Search Hyperparameter Tuning
tuning_params <- expand.grid(cp = seq(0.005, 0.01, by = 0.001))
cv_dt_model <- train(x = x_train, y = y_train, method = "rpart", tuneLength = tuneLgth, trControl = control, tuneGrid = tuning_params)
predictions <- predict(cv_dt_model$finalModel, newdata = x_test)
# Store the performance into dataframe
model_performance_df$Random_HT_RMSE[4] <- RMSE_func(y_test, predictions)
model_performance_df$Random_HT_RSquare[4] <- RSquare_func(y_test, predictions)
model_performance_df$Random_HT_MAE[4] <- MAE_func(y_test, predictions)
model_performance_df$Best_Params[4] <- cv_dt_model$bestTune
XG Boost
# Convert the training and testing data to DMatrix
data_test <- xgb.DMatrix(data = as.matrix(x_test))
data_train <- xgb.DMatrix(data = as.matrix(x_train), label = y_train)
# Train the model
xgb_model <- xgb.train(data = data_train, nrounds = 100)
# Tested with X_test data
predictions <- predict(xgb_model, newdata = data_test)
# Store the performance into dataframe
model_performance_df$RMSE[5] <- RMSE_func(y_test, predictions)
model_performance_df$RSquare[5] <- RSquare_func(y_test, predictions)
model_performance_df$MAE[5] <- MAE_func(y_test, predictions)
# Define the random search hyperparameter tuning parameters
tuning_params <- expand.grid(
nrounds = c(50, 100),
max_depth = c(3, 6),
eta = c(0.05, 0.1),
gamma = c(0.05, 0.1),
colsample_bytree = c(0.6, 0.8),
min_child_weight = c(1, 3),
subsample=c(0.6,0.8)
)
# Random Search Hyperparameter Tuning
cv_xgb_model <- train(x = x_train, y = y_train, method = "xgbTree", tuneLength = tuneLgth, trControl = control, tuneGrid = tuning_params, verbosity =0)
predictions <- predict(cv_xgb_model$finalModel, newdata = data_test)
# Store the performance into dataframe
model_performance_df$Random_HT_RMSE[5] <- RMSE_func(y_test, predictions)
model_performance_df$Random_HT_RSquare[5] <- RSquare_func(y_test, predictions)
model_performance_df$Random_HT_MAE[5] <- MAE_func(y_test, predictions)
model_performance_df$Best_Params[5] <- cv_xgb_model$bestTune
# Create a data frame to store classification models' performance
model_class_performance_df <- data.frame(
Model = c("SVM", "Random Forest", "Decision Tree"),
train_accuracy = numeric(3),
test_accuracy = numeric(3)
)
Data Spliting
#Split the dataset into 80% of training data and 20% of testing data
set.seed(123)
trainIndex <- createDataPartition(df$cnt, p = 0.8, list = FALSE)
train <- df_classes[trainIndex,]
test <- df_classes[-trainIndex,]
x_train <- subset(train, select = -classes)
y_train <-train$classes
x_test <- subset(test, select = -classes)
y_test <- test$classes
SVM
# Support Vector Machine Model
# Train the model
svm_class_model <- svm(x = x_train, y = y_train, method="class")
# Make predictions with the x_test dataset
predictions <- predict(svm_class_model, newdata = x_test)
train_predictions <- predict(svm_class_model, newdata = x_train)
# Evaluate the model
model_class_performance_df$train_accuracy[1] <- sum(train_predictions == y_train) / length(y_train)
model_class_performance_df$test_accuracy[1] <- sum(predictions == y_test) / length(y_test)
confusionMatrix(predictions, y_test)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3 4
## 1 717 231 26 5
## 2 144 476 185 58
## 3 1 134 451 187
## 4 1 29 206 624
##
## Overall Statistics
##
## Accuracy : 0.6527
## 95% CI : (0.6366, 0.6685)
## No Information Rate : 0.2515
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5369
##
## Mcnemar's Test P-Value : 4.88e-12
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity 0.8308 0.5471 0.5196 0.7140
## Specificity 0.8997 0.8514 0.8765 0.9093
## Pos Pred Value 0.7324 0.5516 0.5834 0.7256
## Neg Pred Value 0.9415 0.8492 0.8457 0.9044
## Prevalence 0.2483 0.2504 0.2498 0.2515
## Detection Rate 0.2063 0.1370 0.1298 0.1796
## Detection Prevalence 0.2817 0.2483 0.2224 0.2475
## Balanced Accuracy 0.8653 0.6993 0.6980 0.8116
table(predictions,y_test)
## y_test
## predictions 1 2 3 4
## 1 717 231 26 5
## 2 144 476 185 58
## 3 1 134 451 187
## 4 1 29 206 624
Random Forest
# Random Forest Model
# Train the model
rf_class_model <- randomForest(y_train ~ ., data = x_train, ntree=100,method = 'class')
# Make predictions with the x_test dataset
predictions <- predict(rf_class_model, newdata = x_test)
train_predictions <- predict(rf_class_model, newdata = x_train)
# Evaluate the model
model_class_performance_df$train_accuracy[2] <- sum(train_predictions == y_train) / length(y_train)
model_class_performance_df$test_accuracy[2] <- sum(predictions == y_test) / length(y_test)
confusionMatrix(predictions, y_test)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3 4
## 1 785 75 1 0
## 2 75 654 143 5
## 3 3 136 610 110
## 4 0 5 114 759
##
## Overall Statistics
##
## Accuracy : 0.8081
## 95% CI : (0.7946, 0.821)
## No Information Rate : 0.2515
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7441
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity 0.9096 0.7517 0.7028 0.8684
## Specificity 0.9709 0.9144 0.9045 0.9542
## Pos Pred Value 0.9117 0.7457 0.7101 0.8645
## Neg Pred Value 0.9702 0.9169 0.9014 0.9557
## Prevalence 0.2483 0.2504 0.2498 0.2515
## Detection Rate 0.2259 0.1882 0.1755 0.2184
## Detection Prevalence 0.2478 0.2524 0.2472 0.2527
## Balanced Accuracy 0.9403 0.8331 0.8036 0.9113
table(predictions,y_test)
## y_test
## predictions 1 2 3 4
## 1 785 75 1 0
## 2 75 654 143 5
## 3 3 136 610 110
## 4 0 5 114 759
# Identify the importance features
importance(rf_class_model)
## MeanDecreaseGini
## season 274.56662
## yr 370.04220
## mnth 479.23582
## hr 4270.30608
## holiday 49.44674
## weekday 566.60579
## workingday 329.33875
## weathersit 258.08345
## temp 805.59874
## atemp 855.43716
## hum 922.06924
## windspeed 655.57018
randomForest::varImpPlot(rf_class_model,
sort=TRUE,
main="Importamce Feature Ranking")
Decision Tree
# Decision Tree Model
# Train the model
dt_class_model <- rpart(y_train ~ ., data = x_train, method = 'class')
# Make predictions with the x_test dataset
predictions <- predict(dt_class_model, newdata = x_test, type = "class")
train_predictions <- predict(dt_class_model, newdata = x_train, type = "class")
# Evaluate the model
model_class_performance_df$train_accuracy[3] <- sum(train_predictions == y_train) / length(y_train)
model_class_performance_df$test_accuracy[3] <- sum(predictions == y_test) / length(y_test)
confusionMatrix(predictions, y_test)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3 4
## 1 684 67 1 0
## 2 153 616 332 126
## 3 14 126 388 122
## 4 12 61 147 626
##
## Overall Statistics
##
## Accuracy : 0.6659
## 95% CI : (0.6499, 0.6816)
## No Information Rate : 0.2515
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5545
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity 0.7926 0.7080 0.4470 0.7162
## Specificity 0.9740 0.7655 0.8995 0.9154
## Pos Pred Value 0.9096 0.5020 0.5969 0.7400
## Neg Pred Value 0.9343 0.8870 0.8301 0.9057
## Prevalence 0.2483 0.2504 0.2498 0.2515
## Detection Rate 0.1968 0.1773 0.1117 0.1801
## Detection Prevalence 0.2164 0.3531 0.1871 0.2435
## Balanced Accuracy 0.8833 0.7367 0.6733 0.8158
table(predictions,y_test)
## y_test
## predictions 1 2 3 4
## 1 684 67 1 0
## 2 153 616 332 126
## 3 14 126 388 122
## 4 12 61 147 626
In regression, RMSE, MAE and R squared are used to evaluate the
performance of models. While in classification, train accuracy and test
accuracy are used. In regression, performanc of RF, SVM and DT has
improved after hyperparameter tuning. However, XGB without
hyperparameter tuning still has the best performance with RMSE OF 39.70,
R squared of 0.95 and MAE of 24.62. In classification, RF has the best
performance of train accuracy about 99.73% and test accuracy of 80.80%.
# Display the performance of each model before tuning and after tuning
view(model_performance_df) %>%
kable() %>%
kable_styling()
| Model | RMSE | RSquare | MAE | Best_Params | Random_HT_RMSE | Random_HT_RSquare | Random_HT_MAE |
|---|---|---|---|---|---|---|---|
| Linear Regression | 138.33235 | 0.4072316 | 103.23875 | TRUE | 138.33235 | 0.4072316 | 103.23875 |
| Random Forest | 48.57959 | 0.9268953 | 31.41334 | 5 | 43.27478 | 0.9419894 | 27.52897 |
| SVM | 107.77084 | 0.6402175 | 64.46379 | 0.3 | 95.25184 | 0.7189496 | 57.60480 |
| Decision Tree | 94.54635 | 0.7230974 | 69.25203 | 0.005 | 85.27825 | 0.7747245 | 61.68893 |
| XG Boost | 39.69831 | 0.9511818 | 24.61818 | 100 | 41.17549 | 0.9474812 | 26.41195 |
view(model_class_performance_df) %>%
kable() %>%
kable_styling()
| Model | train_accuracy | test_accuracy |
|---|---|---|
| SVM | 0.6899453 | 0.6526619 |
| Random Forest | 0.9973389 | 0.8080576 |
| Decision Tree | 0.6749856 | 0.6658993 |
For the deployment, we using the Shiny to deploy our model. The
XGB model has been “PredictionModel.RData” file and load in the shinny
app. If the model saved in the RData, it make the model not need to
train everytime and it allow the web application to load the RData file
to use the model.
# Save the XGB model into RData file
save(xgb_model, file="PredictionModel.RData")
save(rf_class_model, file="PredictionModelClass.RData")
library(shiny)
library(xgboost)
# install.packages('rsconnect')
# Load the trained model
load("PredictionModel.Rdata")
# Define the user interface
ui <- fluidPage(
titlePanel("Bike Sharing Demand Prediction"),
sidebarLayout(
sidebarPanel(
# Input fields for the model features
numericInput("temp", "Temperature:", value = 0),
numericInput("atemp", "ATemperature:", value = 0),
numericInput("hum", "Humidity:", value = 0),
numericInput("windspeed", "Wind Speed:", value = 0),
numericInput("season", "Season:", value = 0),
numericInput("yr", "Year:", value = 0),
numericInput("mnth", "Month:", value = 0),
numericInput("hr", "Hour:", value = 0),
numericInput("holiday", "Holiday:", value = 0),
numericInput("weekday", "Weekday:", value = 0),
numericInput("workingday", "Working Day:", value = 0),
numericInput("weathersit", "Weather Condition:",value = 0),
actionButton("predictBtn", "Predict")
),
mainPanel(
h4("Predicted Demand Total Count:"),
# Output to display the predicted bike demand
verbatimTextOutput("prediction")
)
)
)
# Define the server function
server <- function(input, output) {
# Function to make predictions
makePredictions <- function(input_data) {
# print(input_data)
data_predict <- xgb.DMatrix(data = as.matrix(input_data))
# Load the random forest model
load("PredictionModel1.Rdata")
# Use the random forest model for prediction
prediction <- predict(xgb_model, newdata = data_predict)
# Return the predicted bike demand
return(round(prediction))
}
# Event handler for the predict button
observeEvent(input$predictBtn, {
# Create a data frame with user inputs
input_data <- data.frame(
season=input$season,
yr=input$yr,
mnth=input$mnth,
hr=input$hr,
holiday=input$holiday,
weekday=input$weekday,
workingday=input$workingday,
weathersit=input$weathersit,
temp=input$temp,
atemp=input$atemp,
hum=input$hum,
windspeed = input$windspeed
)
output$prediction <- renderText({
makePredictions(input_data)
})
})
}
# Run the Shiny app
shinyApp(ui = ui, server = server)