This project follows the CRISP-DM (Cross-Industry Standard Process for Data Mining) methodology, providing a structured framework for our analysis. CRISP-DM guides us through stages such as Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, and Deployment. By adhering to this methodology, we ensure a comprehensive approach to predicting bike sharing demand.
A bike-sharing system is a transport service where bicycles are rent
to individuals for short period at a low cost. It allows users to rent a
bike in a bike station and return it at another bike station within a
same network. Bike sharing has gained attraction in recent years as part
of initiatives to promote green commuting. Bike-sharing system has
significant influence on the transportation, public health as well as
environment of the city.
The earliest bike sharing was
started in 1965 with 50 bicycles, and bike sharing has rapidly expanded
through enhancement of technology in the last decade. As of August 2021,
there are 10 millions bike and over 3000 bike share systems across the
world. According to Statistia, the revenue in bike sharing is expected
to reach 9.21billions USD in 2023 and surge to 13.53 billion USD in 4
years.
The great opportunity in bike sharing system
necessitates the prediction of the bike sharing demand to make this
business constantly work and grow. In this research, the bike sharing
demand will be predicted by using machine learning algorithm. Besides,
the relationship between bike sharing demand and other factors such as
time, weather and climate will be explored.
options(warn = -1) # To disable warning message to print
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(corrplot)
## corrplot 0.92 loaded
library(readr)
library(kableExtra)
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ lubridate 1.9.2 ✔ tibble 3.2.1
## ✔ purrr 1.0.1 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ kableExtra::group_rows() masks dplyr::group_rows()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(here)
## here() starts at D:/MDS UM/WQD7004/bikesharing dataset
library(skimr)
library(janitor)
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(Tmisc)
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(purrr)
Dataset with title “Bike Sharing Dataset” is gained from Kaggle with url https://www.kaggle.com/datasets/lakshmi25npathi/bike-sharing-dataset. Dataset is published 4 years ago, containing hourly count of rental bikes sharing data for the year of 2011 and 2012 in the Capital bike share system. Purpose of this dataset is predicting bike sharing demand with the features in the dataset, for example, temperature, windspeed and humidity.
df <- read_csv("D:/MDS UM/WQD7004/bikesharing dataset/hour.csv")
## Rows: 17379 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (16): instant, season, yr, mnth, hr, holiday, weekday, workingday, weat...
## date (1): dteday
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(df) %>%
kable() %>%
kable_styling()
| instant | dteday | season | yr | mnth | hr | holiday | weekday | workingday | weathersit | temp | atemp | hum | windspeed | casual | registered | cnt |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 2011-01-01 | 1 | 0 | 1 | 0 | 0 | 6 | 0 | 1 | 0.24 | 0.2879 | 0.81 | 0.0000 | 3 | 13 | 16 |
| 2 | 2011-01-01 | 1 | 0 | 1 | 1 | 0 | 6 | 0 | 1 | 0.22 | 0.2727 | 0.80 | 0.0000 | 8 | 32 | 40 |
| 3 | 2011-01-01 | 1 | 0 | 1 | 2 | 0 | 6 | 0 | 1 | 0.22 | 0.2727 | 0.80 | 0.0000 | 5 | 27 | 32 |
| 4 | 2011-01-01 | 1 | 0 | 1 | 3 | 0 | 6 | 0 | 1 | 0.24 | 0.2879 | 0.75 | 0.0000 | 3 | 10 | 13 |
| 5 | 2011-01-01 | 1 | 0 | 1 | 4 | 0 | 6 | 0 | 1 | 0.24 | 0.2879 | 0.75 | 0.0000 | 0 | 1 | 1 |
| 6 | 2011-01-01 | 1 | 0 | 1 | 5 | 0 | 6 | 0 | 2 | 0.24 | 0.2576 | 0.75 | 0.0896 | 0 | 1 | 1 |
glimpse(df)
## Rows: 17,379
## Columns: 17
## $ instant <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ dteday <date> 2011-01-01, 2011-01-01, 2011-01-01, 2011-01-01, 2011-01-01…
## $ season <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ yr <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ mnth <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ hr <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
## $ holiday <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ weekday <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,…
## $ workingday <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ weathersit <dbl> 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3,…
## $ temp <dbl> 0.24, 0.22, 0.22, 0.24, 0.24, 0.24, 0.22, 0.20, 0.24, 0.32,…
## $ atemp <dbl> 0.2879, 0.2727, 0.2727, 0.2879, 0.2879, 0.2576, 0.2727, 0.2…
## $ hum <dbl> 0.81, 0.80, 0.80, 0.75, 0.75, 0.75, 0.80, 0.86, 0.75, 0.76,…
## $ windspeed <dbl> 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0896, 0.0000, 0.0…
## $ casual <dbl> 3, 8, 5, 3, 0, 0, 2, 1, 1, 8, 12, 26, 29, 47, 35, 40, 41, 1…
## $ registered <dbl> 13, 32, 27, 10, 1, 1, 0, 2, 7, 6, 24, 30, 55, 47, 71, 70, 5…
## $ cnt <dbl> 16, 40, 32, 13, 1, 1, 2, 3, 8, 14, 36, 56, 84, 94, 106, 110…
#Classification
class(df)
## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
typeof(df)
## [1] "list"
dim(df)
## [1] 17379 17
#Structure of data
str(df)
## spc_tbl_ [17,379 × 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ instant : num [1:17379] 1 2 3 4 5 6 7 8 9 10 ...
## $ dteday : Date[1:17379], format: "2011-01-01" "2011-01-01" ...
## $ season : num [1:17379] 1 1 1 1 1 1 1 1 1 1 ...
## $ yr : num [1:17379] 0 0 0 0 0 0 0 0 0 0 ...
## $ mnth : num [1:17379] 1 1 1 1 1 1 1 1 1 1 ...
## $ hr : num [1:17379] 0 1 2 3 4 5 6 7 8 9 ...
## $ holiday : num [1:17379] 0 0 0 0 0 0 0 0 0 0 ...
## $ weekday : num [1:17379] 6 6 6 6 6 6 6 6 6 6 ...
## $ workingday: num [1:17379] 0 0 0 0 0 0 0 0 0 0 ...
## $ weathersit: num [1:17379] 1 1 1 1 1 2 1 1 1 1 ...
## $ temp : num [1:17379] 0.24 0.22 0.22 0.24 0.24 0.24 0.22 0.2 0.24 0.32 ...
## $ atemp : num [1:17379] 0.288 0.273 0.273 0.288 0.288 ...
## $ hum : num [1:17379] 0.81 0.8 0.8 0.75 0.75 0.75 0.8 0.86 0.75 0.76 ...
## $ windspeed : num [1:17379] 0 0 0 0 0 0.0896 0 0 0 0 ...
## $ casual : num [1:17379] 3 8 5 3 0 0 2 1 1 8 ...
## $ registered: num [1:17379] 13 32 27 10 1 1 0 2 7 6 ...
## $ cnt : num [1:17379] 16 40 32 13 1 1 2 3 8 14 ...
## - attr(*, "spec")=
## .. cols(
## .. instant = col_double(),
## .. dteday = col_date(format = ""),
## .. season = col_double(),
## .. yr = col_double(),
## .. mnth = col_double(),
## .. hr = col_double(),
## .. holiday = col_double(),
## .. weekday = col_double(),
## .. workingday = col_double(),
## .. weathersit = col_double(),
## .. temp = col_double(),
## .. atemp = col_double(),
## .. hum = col_double(),
## .. windspeed = col_double(),
## .. casual = col_double(),
## .. registered = col_double(),
## .. cnt = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
#data summary
summary(df)
## instant dteday season yr
## Min. : 1 Min. :2011-01-01 Min. :1.000 Min. :0.0000
## 1st Qu.: 4346 1st Qu.:2011-07-04 1st Qu.:2.000 1st Qu.:0.0000
## Median : 8690 Median :2012-01-02 Median :3.000 Median :1.0000
## Mean : 8690 Mean :2012-01-02 Mean :2.502 Mean :0.5026
## 3rd Qu.:13034 3rd Qu.:2012-07-02 3rd Qu.:3.000 3rd Qu.:1.0000
## Max. :17379 Max. :2012-12-31 Max. :4.000 Max. :1.0000
## mnth hr holiday weekday
## Min. : 1.000 Min. : 0.00 Min. :0.00000 Min. :0.000
## 1st Qu.: 4.000 1st Qu.: 6.00 1st Qu.:0.00000 1st Qu.:1.000
## Median : 7.000 Median :12.00 Median :0.00000 Median :3.000
## Mean : 6.538 Mean :11.55 Mean :0.02877 Mean :3.004
## 3rd Qu.:10.000 3rd Qu.:18.00 3rd Qu.:0.00000 3rd Qu.:5.000
## Max. :12.000 Max. :23.00 Max. :1.00000 Max. :6.000
## workingday weathersit temp atemp
## Min. :0.0000 Min. :1.000 Min. :0.020 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:1.000 1st Qu.:0.340 1st Qu.:0.3333
## Median :1.0000 Median :1.000 Median :0.500 Median :0.4848
## Mean :0.6827 Mean :1.425 Mean :0.497 Mean :0.4758
## 3rd Qu.:1.0000 3rd Qu.:2.000 3rd Qu.:0.660 3rd Qu.:0.6212
## Max. :1.0000 Max. :4.000 Max. :1.000 Max. :1.0000
## hum windspeed casual registered
## Min. :0.0000 Min. :0.0000 Min. : 0.00 Min. : 0.0
## 1st Qu.:0.4800 1st Qu.:0.1045 1st Qu.: 4.00 1st Qu.: 34.0
## Median :0.6300 Median :0.1940 Median : 17.00 Median :115.0
## Mean :0.6272 Mean :0.1901 Mean : 35.68 Mean :153.8
## 3rd Qu.:0.7800 3rd Qu.:0.2537 3rd Qu.: 48.00 3rd Qu.:220.0
## Max. :1.0000 Max. :0.8507 Max. :367.00 Max. :886.0
## cnt
## Min. : 1.0
## 1st Qu.: 40.0
## Median :142.0
## Mean :189.5
## 3rd Qu.:281.0
## Max. :977.0
#attribute of data
length(df)
## [1] 17
names(df)
## [1] "instant" "dteday" "season" "yr" "mnth"
## [6] "hr" "holiday" "weekday" "workingday" "weathersit"
## [11] "temp" "atemp" "hum" "windspeed" "casual"
## [16] "registered" "cnt"
instant: record index
dteday: date
season: season (1:spring, 2:summer, 3:fall, 4:winter)
yr: year (0: 2011, 1:2012)
mnth: month ( 1 to 12)
hr: hour (0 to 23)
holiday: weather day is holiday or not (extracted from http://dchr.dc.gov/page/holiday-schedule)
weekday: day of the week
workingday: if day is neither weekend nor holiday is 1, otherwise is 0.
weathersit:
1: Clear, Few clouds, Partly cloudy
2: Mist and Cloudy, Mist and Broken clouds, Mist and Few clouds, Mist
3: Light Snow, Light Rain and Thunderstorm and Scattered clouds, Light Rain and Scattered clouds
4: Heavy Rain and Ice Pallets and Thunderstorm and Mist, Snow and Fog
temp: Normalized temperature in Celsius. The values are divided to 41 (max)
atemp: Normalized feeling temperature in Celsius. The values are divided to 50 (max)
hum: Normalized humidity. The values are divided to 100 (max)
windspeed: Normalized wind speed. The values are divided to 67 (max)
casual: count of casual users
registered: count of registered users
cnt: count of total rental bikes including both casual and registered
Total Number of Rides over date
df$dteday <- as.Date(df$dteday)
dt_aggregate_data <-
df %>%
group_by(dteday) %>%
summarize(rides = sum(cnt))
ggplot(dt_aggregate_data, aes(x = dteday, y = rides)) +
geom_line() +
geom_smooth(se = FALSE) + # Add a smoothed line
labs(x = "Date", y = "Total Number of Rides") +
ggtitle("Total Number of Rides over date")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
The plot shows the relationship between Total Number of
Rides(cnt) variable and Date. We can observed the number of rides from
day to day from the dataset. Generally, the number of total rides are
increasing over the period.
Trend of Total Number of registered and casual over date
dt_aggrregate_data2 <-
df %>%
group_by(dteday) %>%
summarize(registered = sum(registered),
unregistered = sum(casual))
ggplot(dt_aggrregate_data2, aes(x = dteday)) +
geom_bar(aes(y = registered, fill = "Registered"), stat = "identity", width = 0.5) +
geom_bar(aes(y = unregistered, fill = "Casual"), stat = "identity", width = 0.5) +
geom_smooth(aes(y = registered, color = "Registered"), se = FALSE, color="blue") +
geom_smooth(aes(y = unregistered, color = "Casual"), se = FALSE, color="red") +
labs(x = "Date", y = "Total Number of Rides", fill = "User Type") +
scale_fill_manual(values = c("Registered" = "blue", "Casual" = "red")) +
ggtitle("Total Number of registered and casual over date")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
The plot shows the relationship between Total Number of registered
and casual rides variable and Date. Generally, both the number of
registered and casual rides are increasing but the number of registered
users is increasing faster than the casual rides. Their difference is
become higher and higher over the period, indicating that the percentage
of registered is getting higher.
Histogram
histogram <- hist(df$cnt, breaks = 25, ylab = 'Frequency of Rental', xlab = 'Total Bike Rental Count', main = 'Distribution of Total Bike Rental Count', col = 'indianred' )
xfit <- seq(min(df$cnt),max(df$cnt), length = 50)
yfit <- dnorm(xfit, mean =mean(df$cnt),sd=sd(df$cnt))
yfit <- yfit*diff(histogram$mids[1:2])*length(df$cnt)
lines(xfit,yfit, col='yellow', lwd= 3)
From the histogram above, it seems that the number of total rented bikes are positively skewed distribution. The median is generally less than the mean. The mean is typically pulled towards the right by the presence of the long tail. The data have outliers on the right side, which are extreme values that deviate significantly from the majority of the data. These outliers can influence the mean and make it larger than the median.
Trend of Total Bike Rentals on Seasonal Factor
boxplot(df$cnt ~df$season,
data = df,
main = "Total Bike Rentals Vs Season",
xlab = "Season",
ylab = "Total Bike Rentals",
col = c("darkblue", "blue", "lightblue", "skyblue"))
The plot shows the relationship between Total Bike Rentals(cnt)
variable and season. The average numbers of bike rentals are the highest
during summer and fall.
Trend of Total Bike Rentals on Days of the Week Factor
boxplot(df$cnt ~df$weekday,
data = df,
main = "Total Bike Rentals Vs Days of the Week",
xlab = "Days of the Week",
ylab = "Total Bike Rentals",
col = c("red", "yellow", "purple", "blue", "green","orange"))
The plot shows the relationship between Total Bike Rentals(cnt)
variable and Days of The Week. We can see that the median of bike
rentals almost the same each day but slightly higher during the
weekdays.
Trend of Total Bike Rentals on Time Factor
boxplot(df$cnt ~df$hr,
data = df,
main = "Total Bike Rentals Vs Time",
xlab = "Time",
ylab = "Total Bike Rentals",
col = c("pink"))
The plot shows the relationship between Total Bike Rentals(cnt)
variable and Time. We can see that the peak average number of bike
rentals are at 8:00 am and 5:00pm.
Trend of Total Bike Rentals on Weather Factor
boxplot(df$cnt ~df$weathersit,
data = df,
main = "Total Bike Rentals Vs Weather",
xlab = "Weather",
ylab = "Total Bike Rentals",
col = c("lightgreen"))
The plot shows the relationship between Total Bike Rentals(cnt)
variable and weather. We can see that the median number of bike rentals
on clear and fine day are the highest among the others.
Classify the cnt column into a new column name classes
df_classes <- df
for (i in 1:nrow(df_classes)) {
# Check the conditions and assign the corresponding value
if (df_classes$cnt[i] < quantile(df_classes$cnt, probs = 0.25)) {
df_classes$classes[i] <- 1
} else if (df_classes$cnt[i] >= quantile(df_classes$cnt, probs = 0.25) & df_classes$cnt[i] < quantile(df_classes$cnt, probs = 0.5)) {
df_classes$classes[i] <- 2
} else if (df_classes$cnt[i] >= quantile(df_classes$cnt, probs = 0.5) & df_classes$cnt[i] < quantile(df_classes$cnt, probs = 0.75)) {
df_classes$classes[i] <- 3
} else {
df_classes$classes[i] <- 4
}
}
df_classes$classes<-as.factor(df_classes$classes)
Assign classes to total number of rides based on the quartiles.
Total number of rides is divided into 4 classes. Class 1 is the bottom
25%, class 2 is 25% to 50% and class 3 is 50% to 75% and class 4 is the
top 25% of the number of rides.
Classify the cnt column
into a new column name classes
#Plot bar chart for count of classes
classes_aggregate_data <- df_classes %>%
group_by(classes)%>%
summarize(count = n())
ggplot(classes_aggregate_data, aes(x = as.factor(classes), y = count, fill = classes)) +
geom_col(position = "dodge", width = 0.5) +
labs(x = "Classes", y = "Count", fill = "Class") +
ggtitle("Count of classes")
classes_aggregate_data2 <- df_classes %>%
group_by(season, classes) %>%
summarize(count = n()) %>%
mutate(classes = factor(classes),
season=factor(season))
## `summarise()` has grouped output by 'season'. You can override using the
## `.groups` argument.
The plot shows the count of each class. They are distributed
uniformly.
Classify the cnt column into a new column name
classes
#Plot bar chart for count of classes by season
ggplot(classes_aggregate_data2, aes(x = interaction(classes, season), y = count, fill = classes)) +
geom_col(position = "dodge", width = 0.5) +
labs(x = "Classes and Season", y = "Count", fill = "Class") +
scale_fill_manual(values = c("1" = "blue", "2" = "red", "3" = "green", "4" = "orange")) +
ggtitle("Count of classes by season")
The plot shows the count of classes by season. It is clear that
class 3 and class 4 is less in spring compared to other 3 seasons.
Classify the cnt column into a new column name
classes
#Plot bar chart for count of classes by season
classes_aggregate_data3 <-
df_classes %>%
group_by(classes) %>%
summarize(atemp = mean(atemp))
ggplot(classes_aggregate_data3, aes(x = classes, y = atemp, fill = classes)) +
geom_col(position = "dodge", width = 0.5) +
labs(x = "Classes", y = "Atemp", fill = "Class") +
scale_fill_manual(values = c("1" = "blue", "2" = "red", "3" = "green", "4" = "orange")) +
ggtitle("Temperature of classes")
The plot shows the relation ship between classes and
average(temperature). The average temperature is highest in class 4.
#Measure correlation between all variable:
df_numeric = select_if(df,is.numeric)
data_corr<-cor(df_numeric)
data_corr
## instant season yr mnth hr
## instant 1.000000000 0.404045721 0.866014049 0.489163831 -0.004774815
## season 0.404045721 1.000000000 -0.010742486 0.830385892 -0.006116901
## yr 0.866014049 -0.010742486 1.000000000 -0.010472929 -0.003867005
## mnth 0.489163831 0.830385892 -0.010472929 1.000000000 -0.005771909
## hr -0.004774815 -0.006116901 -0.003867005 -0.005771909 1.000000000
## holiday 0.014723494 -0.009584526 0.006691617 0.018430325 0.000479136
## weekday 0.001356820 -0.002335350 -0.004484851 0.010400061 -0.003497739
## workingday -0.003415559 0.013743102 -0.002196005 -0.003476922 0.002284998
## weathersit -0.014197603 -0.014523552 -0.019156853 0.005399522 -0.020202528
## temp 0.136178007 0.312025237 0.040913380 0.201691494 0.137603494
## atemp 0.137614610 0.319379811 0.039221595 0.208096131 0.133749965
## hum 0.009576774 0.150624745 -0.083546421 0.164411443 -0.276497828
## windspeed -0.074504540 -0.149772751 -0.008739533 -0.135386323 0.137251568
## casual 0.158295401 0.120206447 0.142778528 0.068457301 0.301201730
## registered 0.282045777 0.174225633 0.253684310 0.122272967 0.374140710
## cnt 0.278378694 0.178055731 0.250494899 0.120637760 0.394071498
## holiday weekday workingday weathersit temp
## instant 0.014723494 0.001356820 -0.003415559 -0.014197603 0.136178007
## season -0.009584526 -0.002335350 0.013743102 -0.014523552 0.312025237
## yr 0.006691617 -0.004484851 -0.002196005 -0.019156853 0.040913380
## mnth 0.018430325 0.010400061 -0.003476922 0.005399522 0.201691494
## hr 0.000479136 -0.003497739 0.002284998 -0.020202528 0.137603494
## holiday 1.000000000 -0.102087791 -0.252471370 -0.017036113 -0.027340477
## weekday -0.102087791 1.000000000 0.035955071 0.003310740 -0.001794927
## workingday -0.252471370 0.035955071 1.000000000 0.044672224 0.055390317
## weathersit -0.017036113 0.003310740 0.044672224 1.000000000 -0.102639936
## temp -0.027340477 -0.001794927 0.055390317 -0.102639936 1.000000000
## atemp -0.030972737 -0.008820945 0.054667235 -0.105563108 0.987672139
## hum -0.010588465 -0.037158268 0.015687512 0.418130329 -0.069881391
## windspeed 0.003987632 0.011501545 -0.011829789 0.026225652 -0.023125262
## casual 0.031563628 0.032721415 -0.300942486 -0.152627885 0.459615646
## registered -0.047345424 0.021577888 0.134325791 -0.120965520 0.335360849
## cnt -0.030927303 0.026899860 0.030284368 -0.142426138 0.404772276
## atemp hum windspeed casual registered
## instant 0.137614610 0.009576774 -0.074504540 0.15829540 0.28204578
## season 0.319379811 0.150624745 -0.149772751 0.12020645 0.17422563
## yr 0.039221595 -0.083546421 -0.008739533 0.14277853 0.25368431
## mnth 0.208096131 0.164411443 -0.135386323 0.06845730 0.12227297
## hr 0.133749965 -0.276497828 0.137251568 0.30120173 0.37414071
## holiday -0.030972737 -0.010588465 0.003987632 0.03156363 -0.04734542
## weekday -0.008820945 -0.037158268 0.011501545 0.03272142 0.02157789
## workingday 0.054667235 0.015687512 -0.011829789 -0.30094249 0.13432579
## weathersit -0.105563108 0.418130329 0.026225652 -0.15262788 -0.12096552
## temp 0.987672139 -0.069881391 -0.023125262 0.45961565 0.33536085
## atemp 1.000000000 -0.051917696 -0.062336043 0.45408007 0.33255864
## hum -0.051917696 1.000000000 -0.290104895 -0.34702809 -0.27393312
## windspeed -0.062336043 -0.290104895 1.000000000 0.09028678 0.08232085
## casual 0.454080065 -0.347028093 0.090286775 1.00000000 0.50661770
## registered 0.332558635 -0.273933118 0.082320847 0.50661770 1.00000000
## cnt 0.400929304 -0.322910741 0.093233784 0.69456408 0.97215073
## cnt
## instant 0.27837869
## season 0.17805573
## yr 0.25049490
## mnth 0.12063776
## hr 0.39407150
## holiday -0.03092730
## weekday 0.02689986
## workingday 0.03028437
## weathersit -0.14242614
## temp 0.40477228
## atemp 0.40092930
## hum -0.32291074
## windspeed 0.09323378
## casual 0.69456408
## registered 0.97215073
## cnt 1.00000000
corrplot(data_corr, method="color")
#missing value
sum(is.na(df))
## [1] 0
# #Create new dataset excluding instant,casual,registered variables
# df<-subset(df,select=-c(dteday,instant,casual,registered))
# head(df,5)
# str(df)
Import all the required library for modelling
library("tidyverse")
library("here")
library("skimr")
library("janitor")
library("dplyr")
library("Tmisc")
library(ggplot2)
library(caret)
library(purrr)
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
library(e1071)
library(kernlab)
##
## Attaching package: 'kernlab'
## The following object is masked from 'package:purrr':
##
## cross
## The following object is masked from 'package:ggplot2':
##
## alpha
library(rpart)
library(xgboost)
##
## Attaching package: 'xgboost'
## The following object is masked from 'package:dplyr':
##
## slice
Import all the required library for modelling
#Create new dataset excluding instant,casual,registered variables
df<-subset(df,select=-c(dteday,instant,casual,registered))
head(df,5)
## # A tibble: 5 × 13
## season yr mnth hr holiday weekday workingday weathersit temp atemp
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 0 1 0 0 6 0 1 0.24 0.288
## 2 1 0 1 1 0 6 0 1 0.22 0.273
## 3 1 0 1 2 0 6 0 1 0.22 0.273
## 4 1 0 1 3 0 6 0 1 0.24 0.288
## 5 1 0 1 4 0 6 0 1 0.24 0.288
## # ℹ 3 more variables: hum <dbl>, windspeed <dbl>, cnt <dbl>
str(df)
## tibble [17,379 × 13] (S3: tbl_df/tbl/data.frame)
## $ season : num [1:17379] 1 1 1 1 1 1 1 1 1 1 ...
## $ yr : num [1:17379] 0 0 0 0 0 0 0 0 0 0 ...
## $ mnth : num [1:17379] 1 1 1 1 1 1 1 1 1 1 ...
## $ hr : num [1:17379] 0 1 2 3 4 5 6 7 8 9 ...
## $ holiday : num [1:17379] 0 0 0 0 0 0 0 0 0 0 ...
## $ weekday : num [1:17379] 6 6 6 6 6 6 6 6 6 6 ...
## $ workingday: num [1:17379] 0 0 0 0 0 0 0 0 0 0 ...
## $ weathersit: num [1:17379] 1 1 1 1 1 2 1 1 1 1 ...
## $ temp : num [1:17379] 0.24 0.22 0.22 0.24 0.24 0.24 0.22 0.2 0.24 0.32 ...
## $ atemp : num [1:17379] 0.288 0.273 0.273 0.288 0.288 ...
## $ hum : num [1:17379] 0.81 0.8 0.8 0.75 0.75 0.75 0.8 0.86 0.75 0.76 ...
## $ windspeed : num [1:17379] 0 0 0 0 0 0.0896 0 0 0 0 ...
## $ cnt : num [1:17379] 16 40 32 13 1 1 2 3 8 14 ...
Split the dataset into 80% of training data and 20% of testing data
set.seed(123)
trainIndex <- createDataPartition(df$cnt, p = 0.8, list = FALSE)
train <- df[trainIndex,]
test <- df[-trainIndex,]
x_train <- subset(train, select = -cnt)
y_train <-train$cnt
x_test <- subset(test, select = -cnt)
y_test <- test$cnt
Defined three function to calculate RMSE, R square and MAE
RMSE_func <- function(actual, prediction){
sqrt(mean((actual - prediction)^2))
}
RSquare_func <- function(actual, prediction){
tss <- sum((actual - mean(actual)) ^ 2)
rss <- sum((actual - prediction) ^ 2)
rsq <- 1 - (rss / tss)
}
MAE_func <- function(actual, prediction){
mean(abs(actual - prediction))
}
Define a data frame to store all model performance
model_performance_df <- data.frame(
Model = c("Linear Regression", "Random Forest", "SVM", "Decision Tree", "XG Boost"),
RMSE = numeric(5),
RSquare = numeric(5),
MAE = numeric(5),
Best_Params = 'NAN',
Random_HT_RMSE = numeric(5),
Random_HT_RSquare = numeric(5),
Random_HT_MAE = numeric(5)
)
Define the control value and tunelength for all machine learning model
control <- trainControl(method="cv", number=5, search = "random")
tuneLgth <- 3
Linear Regression
# Train the model
lr_model = lm(y_train ~ .,data = x_train)
# Tested with X_test data
predictions <- predict(lr_model, newdata = x_test)
# Store the performance into dataframe
model_performance_df$RMSE[1] <- RMSE_func(y_test, predictions)
model_performance_df$RSquare[1] <- RSquare_func(y_test, predictions)
model_performance_df$MAE[1] <- MAE_func(y_test, predictions)
# Random Search Hyperparameter tuning
tuning_params <- expand.grid(intercept = c(TRUE, FALSE))
cv_lr_model <- train(x = x_train, y = y_train, method = "lm", tuneLength = tuneLgth, trControl = control, tuneGrid = tuning_params)
predictions <- predict(cv_lr_model$finalModel, newdata = x_test)
# Store the performance into dataframe
model_performance_df$Random_HT_RMSE[1] <- RMSE_func(y_test, predictions)
model_performance_df$Random_HT_RSquare[1] <- RSquare_func(y_test, predictions)
model_performance_df$Random_HT_MAE[1] <- MAE_func(y_test, predictions)
model_performance_df$Best_Params[1] <- cv_lr_model$bestTune
Random Forest
# # Train the model
# rf_model <- randomForest(y_train~ ., data = x_train)
#
# # Tested with X_test data
# predictions <- predict(rf_model, newdata = x_test)
#
# # Store the performance into dataframe
# model_performance_df$RMSE[2] <- RMSE_func(y_test, predictions)
# model_performance_df$RSquare[2] <- RSquare_func(y_test, predictions)
# model_performance_df$MAE[2] <- MAE_func(y_test, predictions)
#
# # Random Search Hyperparameter Tuning
# tuning_params <- expand.grid(mtry = seq(2,5))
# cv_rf_model <- train(x = x_train, y = y_train, method = "rf", tuneLength = tuneLgth ,trControl = control, tuneGrid = tuning_params)
# predictions <- predict(cv_rf_model$finalModel, newdata = x_test)
#
# # Store the performance into dataframe
# model_performance_df$Random_HT_RMSE[2] <- RMSE_func(y_test, predictions)
# model_performance_df$Random_HT_RSquare[2] <- RSquare_func(y_test, predictions)
# model_performance_df$Random_HT_MAE[2] <- MAE_func(y_test, predictions)
# model_performance_df$Best_Params[2] <- cv_rf_model$bestTune
SVM
# # Train the model
# svm_model <- svm(y_train~ ., data = x_train)
#
# # Tested with X_test data
# predictions <- predict(svm_model, newdata = x_test)
#
# # Store the performance into dataframe
# model_performance_df$RMSE[3] <- RMSE_func(y_test, predictions)
# model_performance_df$RSquare[3] <- RSquare_func(y_test, predictions)
# model_performance_df$MAE[3] <- MAE_func(y_test, predictions)
#
# # Random Search Hyperparameter Tuning
# tuning_params<- expand.grid(sigma = seq(0.1,0.3,by=0.1), C = seq(1,3))
# cv_svm_model <- train(x = x_train, y = y_train, method = "svmRadial", tuneLength = tuneLgth, trControl = control , tuneGrid = tuning_params)
# predictions <- predict(cv_svm_model$finalModel, newdata = x_test)
#
# # Store the performance into dataframe
# model_performance_df$Random_HT_RMSE[3] <- RMSE_func(y_test, predictions)
# model_performance_df$Random_HT_RSquare[3] <- RSquare_func(y_test, predictions)
# model_performance_df$Random_HT_MAE[3] <- MAE_func(y_test, predictions)
# model_performance_df$Best_Params[3] <- cv_svm_model$bestTune
Decision Tree
# # Train the model
# dt_model <- rpart(y_train~ ., data = x_train)
#
# # Tested with X_test data
# predictions <- predict(dt_model, x_test)
#
# # Store the performance into dataframe
# model_performance_df$RMSE[4] <- RMSE_func(y_test, predictions)
# model_performance_df$RSquare[4] <- RSquare_func(y_test, predictions)
# model_performance_df$MAE[4] <- MAE_func(y_test, predictions)
#
# # Random Search Hyperparameter Tuning
# tuning_params <- expand.grid(cp = seq(0.005, 0.01, by = 0.001))
# cv_dt_model <- train(x = x_train, y = y_train, method = "rpart", tuneLength = tuneLgth, trControl = control, tuneGrid = tuning_params)
# predictions <- predict(cv_dt_model$finalModel, newdata = x_test)
#
# # Store the performance into dataframe
# model_performance_df$Random_HT_RMSE[4] <- RMSE_func(y_test, predictions)
# model_performance_df$Random_HT_RSquare[4] <- RSquare_func(y_test, predictions)
# model_performance_df$Random_HT_MAE[4] <- MAE_func(y_test, predictions)
# model_performance_df$Best_Params[4] <- cv_dt_model$bestTune
XG Boost
# # Convert the training and testing data to DMatrix
# data_test <- xgb.DMatrix(data = as.matrix(x_test))
# data_train <- xgb.DMatrix(data = as.matrix(x_train), label = y_train)
#
# # Train the model
# xgb_model <- xgb.train(data = data_train, nrounds = 100)
#
# # Tested with X_test data
# predictions <- predict(xgb_model, newdata = data_test)
#
# # Store the performance into dataframe
# model_performance_df$RMSE[5] <- RMSE_func(y_test, predictions)
# model_performance_df$RSquare[5] <- RSquare_func(y_test, predictions)
# model_performance_df$MAE[5] <- MAE_func(y_test, predictions)
#
# # Define the random search hyperparameter tuning parameters
# tuning_params <- expand.grid(
# nrounds = c(50, 100),
# max_depth = c(3, 6),
# eta = c(0.05, 0.1),
# gamma = c(0.05, 0.1),
# colsample_bytree = c(0.6, 0.8),
# min_child_weight = c(1, 3),
# subsample=c(0.6,0.8)
# )
#
# # Random Search Hyperparameter Tuning
# cv_xgb_model <- train(x = x_train, y = y_train, method = "xgbTree", tuneLength = tuneLgth, trControl = control, tuneGrid = tuning_params)
# predictions <- predict(cv_xgb_model$finalModel, newdata = data_test)
#
# # Store the performance into dataframe
# model_performance_df$Random_HT_RMSE[5] <- RMSE_func(y_test, predictions)
# model_performance_df$Random_HT_RSquare[5] <- RSquare_func(y_test, predictions)
# model_performance_df$Random_HT_MAE[5] <- MAE_func(y_test, predictions)
# model_performance_df$Best_Params[5] <- cv_xgb_model$bestTune
SVM
# # Support Vector Machine Model
# # Train the model
# svm_model <- svm(x = x_train, y = y_train, method="class")
#
# # Make predictions with the x_test dataset
# predictions <- predict(svm_model, newdata = x_test)
#
# # Evaluate the model
# confusionMatrix(predictions, y_test)
# table(predictions,predictions)
# ```
# ***Random Forest***
# ```{r predictionmodel7}
# # Random Forest Model
# # Train the model
# rf_model <- randomForest(y_train ~ ., data = x_train, ntree=100,method = 'class')
#
# # Make predictions with the x_test dataset
# predictions <- predict(rf_model, newdata = x_test)
# head(predictions)
#
# # Evaluate the model
# confusionMatrix(predictions, y_test)
# table(predictions,predictions)
#
# # Identify the importance features
# importance(rf_model)
#
# randomForest::varImpPlot(rf_model,
# sort=TRUE,
# main="Importamce Feature Ranking")
# ```
#
# *** Decision Tree ***
# ```{r predictionmodel8}
# # Decision Tree Model
# # Train the model
# dt_model <- rpart(y_train ~ ., data = x_train, method = 'class')
#
# # Make predictions with the x_test dataset
# predictions <- predict(dt_model, newdata = x_test, type = "class")
#
# # Evaluate the model
# confusionMatrix(predictions, y_test)
# table(predictions,predictions)
# Display the performance of each model before tuning and after tuning
view(model_performance_df) %>%
kable() %>%
kable_styling()
| Model | RMSE | RSquare | MAE | Best_Params | Random_HT_RMSE | Random_HT_RSquare | Random_HT_MAE |
|---|---|---|---|---|---|---|---|
| Linear Regression | 138.3324 | 0.4072316 | 103.2388 | TRUE | 138.3324 | 0.4072316 | 103.2388 |
| Random Forest | 0.0000 | 0.0000000 | 0.0000 | NAN | 0.0000 | 0.0000000 | 0.0000 |
| SVM | 0.0000 | 0.0000000 | 0.0000 | NAN | 0.0000 | 0.0000000 | 0.0000 |
| Decision Tree | 0.0000 | 0.0000000 | 0.0000 | NAN | 0.0000 | 0.0000000 | 0.0000 |
| XG Boost | 0.0000 | 0.0000000 | 0.0000 | NAN | 0.0000 | 0.0000000 | 0.0000 |
[Shiny App]