CRISP-DM

This project follows the CRISP-DM (Cross-Industry Standard Process for Data Mining) methodology, providing a structured framework for our analysis. CRISP-DM guides us through stages such as Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, and Deployment. By adhering to this methodology, we ensure a comprehensive approach to predicting bike sharing demand.

Introduction/Business Understanding

A bike-sharing system is a transport service where bicycles are rent to individuals for short period at a low cost. It allows users to rent a bike in a bike station and return it at another bike station within a same network. Bike sharing has gained attraction in recent years as part of initiatives to promote green commuting. Bike-sharing system has significant influence on the transportation, public health as well as environment of the city.

The earliest bike sharing was started in 1965 with 50 bicycles, and bike sharing has rapidly expanded through enhancement of technology in the last decade. As of August 2021, there are 10 millions bike and over 3000 bike share systems across the world. According to Statistia, the revenue in bike sharing is expected to reach 9.21billions USD in 2023 and surge to 13.53 billion USD in 4 years.

The great opportunity in bike sharing system necessitates the prediction of the bike sharing demand to make this business constantly work and grow. In this research, the bike sharing demand will be predicted by using machine learning algorithm. Besides, the relationship between bike sharing demand and other factors such as time, weather and climate will be explored.

Business Goals & KPI

  1. To attract more users and increase the number of rides taken by the users. KPI of this goal is the total number of rides and its growth rate.
  2. To optimise the usage of available bikes. KPI is the number of rides per hour.
  3. To increase the percentage of registered users. KPI is the number of registered users to the number of casual users.

Data Mining Goals & KPI

  1. To build a model that can forecast the bike sharing demand based on the factors such as as season, weather and time. KPI of this goal is the model accuracy.
  2. To identify the relationship between various features and the number of users, and the most relevance feature that has most significant impact. KPI is the correlation between various features and the most important feature in model.
  3. To identify the trend of number of total users based on the date. KPI is the trend analysis of the number of total users based on the date.

Initilization

options(warn = -1) # To disable warning message to print
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(corrplot)
## corrplot 0.92 loaded
library(readr)
library(kableExtra)
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ lubridate 1.9.2     ✔ tibble    3.2.1
## ✔ purrr     1.0.1     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()          masks stats::filter()
## ✖ kableExtra::group_rows() masks dplyr::group_rows()
## ✖ dplyr::lag()             masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(here)
## here() starts at C:/Users/Yi Kwong/Documents
library(skimr)
library(janitor)
## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(Tmisc)
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
library(purrr)

Data Understanding

Dataset with title “Bike Sharing Dataset” is gained from Kaggle with url https://www.kaggle.com/datasets/lakshmi25npathi/bike-sharing-dataset. Dataset is published 4 years ago, containing hourly count of rental bikes sharing data for the year of 2011 and 2012 in the Capital bike share system. Purpose of this dataset is predicting bike sharing demand with the features in the dataset, for example, temperature, windspeed and humidity.

df <- read_csv("C:/Users/Yi Kwong/Documents/bike_sharing.csv")
## Rows: 17379 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl  (16): instant, season, yr, mnth, hr, holiday, weekday, workingday, weat...
## date  (1): dteday
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(df) %>%
kable() %>%
kable_styling()
instant dteday season yr mnth hr holiday weekday workingday weathersit temp atemp hum windspeed casual registered cnt
1 2011-01-01 1 0 1 0 0 6 0 1 0.24 0.2879 0.81 0.0000 3 13 16
2 2011-01-01 1 0 1 1 0 6 0 1 0.22 0.2727 0.80 0.0000 8 32 40
3 2011-01-01 1 0 1 2 0 6 0 1 0.22 0.2727 0.80 0.0000 5 27 32
4 2011-01-01 1 0 1 3 0 6 0 1 0.24 0.2879 0.75 0.0000 3 10 13
5 2011-01-01 1 0 1 4 0 6 0 1 0.24 0.2879 0.75 0.0000 0 1 1
6 2011-01-01 1 0 1 5 0 6 0 2 0.24 0.2576 0.75 0.0896 0 1 1
glimpse(df)
## Rows: 17,379
## Columns: 17
## $ instant    <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ dteday     <date> 2011-01-01, 2011-01-01, 2011-01-01, 2011-01-01, 2011-01-01…
## $ season     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ yr         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ mnth       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ hr         <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
## $ holiday    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ weekday    <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,…
## $ workingday <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ weathersit <dbl> 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3,…
## $ temp       <dbl> 0.24, 0.22, 0.22, 0.24, 0.24, 0.24, 0.22, 0.20, 0.24, 0.32,…
## $ atemp      <dbl> 0.2879, 0.2727, 0.2727, 0.2879, 0.2879, 0.2576, 0.2727, 0.2…
## $ hum        <dbl> 0.81, 0.80, 0.80, 0.75, 0.75, 0.75, 0.80, 0.86, 0.75, 0.76,…
## $ windspeed  <dbl> 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0896, 0.0000, 0.0…
## $ casual     <dbl> 3, 8, 5, 3, 0, 0, 2, 1, 1, 8, 12, 26, 29, 47, 35, 40, 41, 1…
## $ registered <dbl> 13, 32, 27, 10, 1, 1, 0, 2, 7, 6, 24, 30, 55, 47, 71, 70, 5…
## $ cnt        <dbl> 16, 40, 32, 13, 1, 1, 2, 3, 8, 14, 36, 56, 84, 94, 106, 110…
#Classification
class(df)
## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"
typeof(df)
## [1] "list"
dim(df)
## [1] 17379    17
#Structure of data
str(df)
## spc_tbl_ [17,379 × 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ instant   : num [1:17379] 1 2 3 4 5 6 7 8 9 10 ...
##  $ dteday    : Date[1:17379], format: "2011-01-01" "2011-01-01" ...
##  $ season    : num [1:17379] 1 1 1 1 1 1 1 1 1 1 ...
##  $ yr        : num [1:17379] 0 0 0 0 0 0 0 0 0 0 ...
##  $ mnth      : num [1:17379] 1 1 1 1 1 1 1 1 1 1 ...
##  $ hr        : num [1:17379] 0 1 2 3 4 5 6 7 8 9 ...
##  $ holiday   : num [1:17379] 0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday   : num [1:17379] 6 6 6 6 6 6 6 6 6 6 ...
##  $ workingday: num [1:17379] 0 0 0 0 0 0 0 0 0 0 ...
##  $ weathersit: num [1:17379] 1 1 1 1 1 2 1 1 1 1 ...
##  $ temp      : num [1:17379] 0.24 0.22 0.22 0.24 0.24 0.24 0.22 0.2 0.24 0.32 ...
##  $ atemp     : num [1:17379] 0.288 0.273 0.273 0.288 0.288 ...
##  $ hum       : num [1:17379] 0.81 0.8 0.8 0.75 0.75 0.75 0.8 0.86 0.75 0.76 ...
##  $ windspeed : num [1:17379] 0 0 0 0 0 0.0896 0 0 0 0 ...
##  $ casual    : num [1:17379] 3 8 5 3 0 0 2 1 1 8 ...
##  $ registered: num [1:17379] 13 32 27 10 1 1 0 2 7 6 ...
##  $ cnt       : num [1:17379] 16 40 32 13 1 1 2 3 8 14 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   instant = col_double(),
##   ..   dteday = col_date(format = ""),
##   ..   season = col_double(),
##   ..   yr = col_double(),
##   ..   mnth = col_double(),
##   ..   hr = col_double(),
##   ..   holiday = col_double(),
##   ..   weekday = col_double(),
##   ..   workingday = col_double(),
##   ..   weathersit = col_double(),
##   ..   temp = col_double(),
##   ..   atemp = col_double(),
##   ..   hum = col_double(),
##   ..   windspeed = col_double(),
##   ..   casual = col_double(),
##   ..   registered = col_double(),
##   ..   cnt = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
#data summary
summary(df)
##     instant          dteday               season            yr        
##  Min.   :    1   Min.   :2011-01-01   Min.   :1.000   Min.   :0.0000  
##  1st Qu.: 4346   1st Qu.:2011-07-04   1st Qu.:2.000   1st Qu.:0.0000  
##  Median : 8690   Median :2012-01-02   Median :3.000   Median :1.0000  
##  Mean   : 8690   Mean   :2012-01-02   Mean   :2.502   Mean   :0.5026  
##  3rd Qu.:13034   3rd Qu.:2012-07-02   3rd Qu.:3.000   3rd Qu.:1.0000  
##  Max.   :17379   Max.   :2012-12-31   Max.   :4.000   Max.   :1.0000  
##       mnth              hr           holiday           weekday     
##  Min.   : 1.000   Min.   : 0.00   Min.   :0.00000   Min.   :0.000  
##  1st Qu.: 4.000   1st Qu.: 6.00   1st Qu.:0.00000   1st Qu.:1.000  
##  Median : 7.000   Median :12.00   Median :0.00000   Median :3.000  
##  Mean   : 6.538   Mean   :11.55   Mean   :0.02877   Mean   :3.004  
##  3rd Qu.:10.000   3rd Qu.:18.00   3rd Qu.:0.00000   3rd Qu.:5.000  
##  Max.   :12.000   Max.   :23.00   Max.   :1.00000   Max.   :6.000  
##    workingday       weathersit         temp           atemp       
##  Min.   :0.0000   Min.   :1.000   Min.   :0.020   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:1.000   1st Qu.:0.340   1st Qu.:0.3333  
##  Median :1.0000   Median :1.000   Median :0.500   Median :0.4848  
##  Mean   :0.6827   Mean   :1.425   Mean   :0.497   Mean   :0.4758  
##  3rd Qu.:1.0000   3rd Qu.:2.000   3rd Qu.:0.660   3rd Qu.:0.6212  
##  Max.   :1.0000   Max.   :4.000   Max.   :1.000   Max.   :1.0000  
##       hum           windspeed          casual         registered   
##  Min.   :0.0000   Min.   :0.0000   Min.   :  0.00   Min.   :  0.0  
##  1st Qu.:0.4800   1st Qu.:0.1045   1st Qu.:  4.00   1st Qu.: 34.0  
##  Median :0.6300   Median :0.1940   Median : 17.00   Median :115.0  
##  Mean   :0.6272   Mean   :0.1901   Mean   : 35.68   Mean   :153.8  
##  3rd Qu.:0.7800   3rd Qu.:0.2537   3rd Qu.: 48.00   3rd Qu.:220.0  
##  Max.   :1.0000   Max.   :0.8507   Max.   :367.00   Max.   :886.0  
##       cnt       
##  Min.   :  1.0  
##  1st Qu.: 40.0  
##  Median :142.0  
##  Mean   :189.5  
##  3rd Qu.:281.0  
##  Max.   :977.0
#attribute of data
length(df)
## [1] 17
names(df)
##  [1] "instant"    "dteday"     "season"     "yr"         "mnth"      
##  [6] "hr"         "holiday"    "weekday"    "workingday" "weathersit"
## [11] "temp"       "atemp"      "hum"        "windspeed"  "casual"    
## [16] "registered" "cnt"

Attribute Information

instant: record index

dteday: date

season: season (1:spring, 2:summer, 3:fall, 4:winter)

yr: year (0: 2011, 1:2012)

mnth: month ( 1 to 12)

hr: hour (0 to 23)

holiday: weather day is holiday or not (extracted from http://dchr.dc.gov/page/holiday-schedule)

weekday: day of the week

workingday: if day is neither weekend nor holiday is 1, otherwise is 0.

weathersit:

1: Clear, Few clouds, Partly cloudy

2: Mist and Cloudy, Mist and Broken clouds, Mist and Few clouds, Mist

3: Light Snow, Light Rain and Thunderstorm and Scattered clouds, Light Rain and Scattered clouds

4: Heavy Rain and Ice Pallets and Thunderstorm and Mist, Snow and Fog

temp: Normalized temperature in Celsius. The values are divided to 41 (max)

atemp: Normalized feeling temperature in Celsius. The values are divided to 50 (max)

hum: Normalized humidity. The values are divided to 100 (max)

windspeed: Normalized wind speed. The values are divided to 67 (max)

casual: count of casual users

registered: count of registered users

cnt: count of total rental bikes including both casual and registered

Data Exploration

Total Number of Rides over date

df$dteday <- as.Date(df$dteday)

dt_aggregate_data <- 
  df %>%
  group_by(dteday) %>%
  summarize(rides = sum(cnt))

ggplot(dt_aggregate_data, aes(x = dteday, y = rides)) +
  geom_line() +
  geom_smooth(se = FALSE) +  # Add a smoothed line
  labs(x = "Date", y = "Total Number of Rides") +
  ggtitle("Total Number of Rides over date")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'


The plot shows the relationship between Total Number of Rides(cnt) variable and Date. We can observed the number of rides from day to day from the dataset. Generally, the number of total rides are increasing over the period.

Trend of Total Number of registered and casual over date

dt_aggrregate_data2 <- 
  df %>%
  group_by(dteday) %>%
  summarize(registered = sum(registered),
            unregistered = sum(casual))

ggplot(dt_aggrregate_data2, aes(x = dteday)) +
  geom_bar(aes(y = registered, fill = "Registered"), stat = "identity", width = 0.5) +
  geom_bar(aes(y = unregistered, fill = "Casual"), stat = "identity", width = 0.5) +
  geom_smooth(aes(y = registered, color = "Registered"), se = FALSE, color="blue") +
  geom_smooth(aes(y = unregistered, color = "Casual"), se = FALSE, color="red") +
  labs(x = "Date", y = "Total Number of Rides", fill = "User Type") +
  scale_fill_manual(values = c("Registered" = "blue", "Casual" = "red")) +
  ggtitle("Total Number of registered and casual over date")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'


The plot shows the relationship between Total Number of registered and casual rides variable and Date. Generally, both the number of registered and casual rides are increasing but the number of registered users is increasing faster than the casual rides. Their difference is become higher and higher over the period, indicating that the percentage of registered is getting higher.

Histogram

histogram <- hist(df$cnt, breaks = 25, ylab = 'Frequency of Rental', xlab = 'Total Bike Rental Count', main = 'Distribution of Total Bike Rental Count', col = 'indianred' )

xfit <- seq(min(df$cnt),max(df$cnt), length = 50)
yfit <- dnorm(xfit, mean =mean(df$cnt),sd=sd(df$cnt))
yfit <- yfit*diff(histogram$mids[1:2])*length(df$cnt)
lines(xfit,yfit, col='yellow', lwd= 3)

From the histogram above, it seems that the number of total rented bikes are positively skewed distribution. The median is generally less than the mean. The mean is typically pulled towards the right by the presence of the long tail. The data have outliers on the right side, which are extreme values that deviate significantly from the majority of the data. These outliers can influence the mean and make it larger than the median.

Trend of Total Bike Rentals on Seasonal Factor

boxplot(df$cnt ~df$season,
        data = df,
        main = "Total Bike Rentals Vs Season",
        xlab = "Season",
        ylab = "Total Bike Rentals",
        col = c("darkblue", "blue", "lightblue", "skyblue"))


The plot shows the relationship between Total Bike Rentals(cnt) variable and season. The average numbers of bike rentals are the highest during summer and fall.

Trend of Total Bike Rentals on Days of the Week Factor

boxplot(df$cnt ~df$weekday,
        data = df,
        main = "Total Bike Rentals Vs Days of the Week",
        xlab = "Days of the Week",
        ylab = "Total Bike Rentals",
        col = c("red", "yellow", "purple", "blue", "green","orange")) 


The plot shows the relationship between Total Bike Rentals(cnt) variable and Days of The Week. We can see that the median of bike rentals almost the same each day but slightly higher during the weekdays.

Trend of Total Bike Rentals on Time Factor

boxplot(df$cnt ~df$hr,
        data = df,
        main = "Total Bike Rentals Vs Time",
        xlab = "Time",
        ylab = "Total Bike Rentals",
        col = c("pink")) 


The plot shows the relationship between Total Bike Rentals(cnt) variable and Time. We can see that the peak average number of bike rentals are at 8:00 am and 5:00pm.

Trend of Total Bike Rentals on Weather Factor

boxplot(df$cnt ~df$weathersit,
        data = df,
        main = "Total Bike Rentals Vs Weather",
        xlab = "Weather",
        ylab = "Total Bike Rentals",
        col = c("lightgreen")) 


The plot shows the relationship between Total Bike Rentals(cnt) variable and weather. We can see that the median number of bike rentals on clear and fine day are the highest among the others.

Classify the cnt column into a new column name classes

df_classes <- df

for (i in 1:nrow(df_classes)) {
  
  # Check the conditions and assign the corresponding value
  if (df_classes$cnt[i] < quantile(df_classes$cnt, probs = 0.25)) {
    df_classes$classes[i] <- 1
  } else if (df_classes$cnt[i] >= quantile(df_classes$cnt, probs = 0.25) & df_classes$cnt[i] < quantile(df_classes$cnt, probs = 0.5)) {
    df_classes$classes[i] <- 2
  } else if (df_classes$cnt[i] >= quantile(df_classes$cnt, probs = 0.5) & df_classes$cnt[i] < quantile(df_classes$cnt, probs = 0.75)) {
    df_classes$classes[i] <- 3
  } else {
    df_classes$classes[i] <- 4
  }
}
df_classes$classes<-as.factor(df_classes$classes)


Assign classes to total number of rides based on the quartiles. Total number of rides is divided into 4 classes. Class 1 is the bottom 25%, class 2 is 25% to 50% and class 3 is 50% to 75% and class 4 is the top 25% of the number of rides.
Classify the cnt column into a new column name classes

#Plot bar chart for count of classes
classes_aggregate_data <- df_classes %>%
  group_by(classes)%>%
  summarize(count = n())

ggplot(classes_aggregate_data, aes(x = as.factor(classes), y = count, fill = classes)) +
  geom_col(position = "dodge", width = 0.5) +
  labs(x = "Classes", y = "Count", fill = "Class") +
  ggtitle("Count of classes")

classes_aggregate_data2 <- df_classes %>%
  group_by(season, classes) %>%
  summarize(count = n()) %>%
  mutate(classes = factor(classes),
         season=factor(season))
## `summarise()` has grouped output by 'season'. You can override using the
## `.groups` argument.


The plot shows the count of each class. They are distributed uniformly.
Classify the cnt column into a new column name classes

#Plot bar chart for count of classes by season
ggplot(classes_aggregate_data2, aes(x = interaction(classes, season), y = count, fill = classes)) +
  geom_col(position = "dodge", width = 0.5) +
  labs(x = "Classes and Season", y = "Count", fill = "Class") +
  scale_fill_manual(values = c("1" = "blue", "2" = "red", "3" = "green", "4" = "orange")) +
  ggtitle("Count of classes by season")


The plot shows the count of classes by season. It is clear that class 3 and class 4 is less in spring compared to other 3 seasons.
Classify the cnt column into a new column name classes

#Plot bar chart for count of classes by season
classes_aggregate_data3 <- 
  df_classes %>%
  group_by(classes) %>%
  summarize(atemp = mean(atemp))

ggplot(classes_aggregate_data3, aes(x = classes, y = atemp, fill = classes)) +
    geom_col(position = "dodge", width = 0.5) +
    labs(x = "Classes", y = "Atemp", fill = "Class") +
    scale_fill_manual(values = c("1" = "blue", "2" = "red", "3" = "green", "4" = "orange")) +
    ggtitle("Temperature of classes")


The plot shows the relation ship between classes and average(temperature). The average temperature is highest in class 4.

#Measure correlation between all variable:
df_numeric = select_if(df,is.numeric)
data_corr<-cor(df_numeric)
data_corr
##                 instant       season           yr         mnth           hr
## instant     1.000000000  0.404045721  0.866014049  0.489163831 -0.004774815
## season      0.404045721  1.000000000 -0.010742486  0.830385892 -0.006116901
## yr          0.866014049 -0.010742486  1.000000000 -0.010472929 -0.003867005
## mnth        0.489163831  0.830385892 -0.010472929  1.000000000 -0.005771909
## hr         -0.004774815 -0.006116901 -0.003867005 -0.005771909  1.000000000
## holiday     0.014723494 -0.009584526  0.006691617  0.018430325  0.000479136
## weekday     0.001356820 -0.002335350 -0.004484851  0.010400061 -0.003497739
## workingday -0.003415559  0.013743102 -0.002196005 -0.003476922  0.002284998
## weathersit -0.014197603 -0.014523552 -0.019156853  0.005399522 -0.020202528
## temp        0.136178007  0.312025237  0.040913380  0.201691494  0.137603494
## atemp       0.137614610  0.319379811  0.039221595  0.208096131  0.133749965
## hum         0.009576774  0.150624745 -0.083546421  0.164411443 -0.276497828
## windspeed  -0.074504540 -0.149772751 -0.008739533 -0.135386323  0.137251568
## casual      0.158295401  0.120206447  0.142778528  0.068457301  0.301201730
## registered  0.282045777  0.174225633  0.253684310  0.122272967  0.374140710
## cnt         0.278378694  0.178055731  0.250494899  0.120637760  0.394071498
##                 holiday      weekday   workingday   weathersit         temp
## instant     0.014723494  0.001356820 -0.003415559 -0.014197603  0.136178007
## season     -0.009584526 -0.002335350  0.013743102 -0.014523552  0.312025237
## yr          0.006691617 -0.004484851 -0.002196005 -0.019156853  0.040913380
## mnth        0.018430325  0.010400061 -0.003476922  0.005399522  0.201691494
## hr          0.000479136 -0.003497739  0.002284998 -0.020202528  0.137603494
## holiday     1.000000000 -0.102087791 -0.252471370 -0.017036113 -0.027340477
## weekday    -0.102087791  1.000000000  0.035955071  0.003310740 -0.001794927
## workingday -0.252471370  0.035955071  1.000000000  0.044672224  0.055390317
## weathersit -0.017036113  0.003310740  0.044672224  1.000000000 -0.102639936
## temp       -0.027340477 -0.001794927  0.055390317 -0.102639936  1.000000000
## atemp      -0.030972737 -0.008820945  0.054667235 -0.105563108  0.987672139
## hum        -0.010588465 -0.037158268  0.015687512  0.418130329 -0.069881391
## windspeed   0.003987632  0.011501545 -0.011829789  0.026225652 -0.023125262
## casual      0.031563628  0.032721415 -0.300942486 -0.152627885  0.459615646
## registered -0.047345424  0.021577888  0.134325791 -0.120965520  0.335360849
## cnt        -0.030927303  0.026899860  0.030284368 -0.142426138  0.404772276
##                   atemp          hum    windspeed      casual  registered
## instant     0.137614610  0.009576774 -0.074504540  0.15829540  0.28204578
## season      0.319379811  0.150624745 -0.149772751  0.12020645  0.17422563
## yr          0.039221595 -0.083546421 -0.008739533  0.14277853  0.25368431
## mnth        0.208096131  0.164411443 -0.135386323  0.06845730  0.12227297
## hr          0.133749965 -0.276497828  0.137251568  0.30120173  0.37414071
## holiday    -0.030972737 -0.010588465  0.003987632  0.03156363 -0.04734542
## weekday    -0.008820945 -0.037158268  0.011501545  0.03272142  0.02157789
## workingday  0.054667235  0.015687512 -0.011829789 -0.30094249  0.13432579
## weathersit -0.105563108  0.418130329  0.026225652 -0.15262788 -0.12096552
## temp        0.987672139 -0.069881391 -0.023125262  0.45961565  0.33536085
## atemp       1.000000000 -0.051917696 -0.062336043  0.45408007  0.33255864
## hum        -0.051917696  1.000000000 -0.290104895 -0.34702809 -0.27393312
## windspeed  -0.062336043 -0.290104895  1.000000000  0.09028678  0.08232085
## casual      0.454080065 -0.347028093  0.090286775  1.00000000  0.50661770
## registered  0.332558635 -0.273933118  0.082320847  0.50661770  1.00000000
## cnt         0.400929304 -0.322910741  0.093233784  0.69456408  0.97215073
##                    cnt
## instant     0.27837869
## season      0.17805573
## yr          0.25049490
## mnth        0.12063776
## hr          0.39407150
## holiday    -0.03092730
## weekday     0.02689986
## workingday  0.03028437
## weathersit -0.14242614
## temp        0.40477228
## atemp       0.40092930
## hum        -0.32291074
## windspeed   0.09323378
## casual      0.69456408
## registered  0.97215073
## cnt         1.00000000
corrplot(data_corr, method="color")

Data Preparation

# check missing value
sum(is.na(df))
## [1] 0
# Create new dataset excluding instant,casual,registered variables for regression
df<-subset(df,select=-c(dteday,instant,casual,registered))
head(df,5)
## # A tibble: 5 × 13
##   season    yr  mnth    hr holiday weekday workingday weathersit  temp atemp
##    <dbl> <dbl> <dbl> <dbl>   <dbl>   <dbl>      <dbl>      <dbl> <dbl> <dbl>
## 1      1     0     1     0       0       6          0          1  0.24 0.288
## 2      1     0     1     1       0       6          0          1  0.22 0.273
## 3      1     0     1     2       0       6          0          1  0.22 0.273
## 4      1     0     1     3       0       6          0          1  0.24 0.288
## 5      1     0     1     4       0       6          0          1  0.24 0.288
## # ℹ 3 more variables: hum <dbl>, windspeed <dbl>, cnt <dbl>
# 
# Drop the selected column of dteday,instant,casual,registered for classification
df_classes<-subset(df_classes,select=-c(dteday,instant,casual,registered,cnt))
head(df_classes,5)
## # A tibble: 5 × 13
##   season    yr  mnth    hr holiday weekday workingday weathersit  temp atemp
##    <dbl> <dbl> <dbl> <dbl>   <dbl>   <dbl>      <dbl>      <dbl> <dbl> <dbl>
## 1      1     0     1     0       0       6          0          1  0.24 0.288
## 2      1     0     1     1       0       6          0          1  0.22 0.273
## 3      1     0     1     2       0       6          0          1  0.22 0.273
## 4      1     0     1     3       0       6          0          1  0.24 0.288
## 5      1     0     1     4       0       6          0          1  0.24 0.288
## # ℹ 3 more variables: hum <dbl>, windspeed <dbl>, classes <fct>
str(df_classes)
## tibble [17,379 × 13] (S3: tbl_df/tbl/data.frame)
##  $ season    : num [1:17379] 1 1 1 1 1 1 1 1 1 1 ...
##  $ yr        : num [1:17379] 0 0 0 0 0 0 0 0 0 0 ...
##  $ mnth      : num [1:17379] 1 1 1 1 1 1 1 1 1 1 ...
##  $ hr        : num [1:17379] 0 1 2 3 4 5 6 7 8 9 ...
##  $ holiday   : num [1:17379] 0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday   : num [1:17379] 6 6 6 6 6 6 6 6 6 6 ...
##  $ workingday: num [1:17379] 0 0 0 0 0 0 0 0 0 0 ...
##  $ weathersit: num [1:17379] 1 1 1 1 1 2 1 1 1 1 ...
##  $ temp      : num [1:17379] 0.24 0.22 0.22 0.24 0.24 0.24 0.22 0.2 0.24 0.32 ...
##  $ atemp     : num [1:17379] 0.288 0.273 0.273 0.288 0.288 ...
##  $ hum       : num [1:17379] 0.81 0.8 0.8 0.75 0.75 0.75 0.8 0.86 0.75 0.76 ...
##  $ windspeed : num [1:17379] 0 0 0 0 0 0.0896 0 0 0 0 ...
##  $ classes   : Factor w/ 4 levels "1","2","3","4": 1 2 1 1 1 1 1 1 1 1 ...

Modelling


There are 2 parts in modelling. One is regression and another is classification. In regression, 5 models are selected. They are RF, XGB, SVM, LR & DT. All models’ hyperparameter are tuned using random search. While in classification, RF, SVM & DT are selected.
## Regression

Import all the required library for modelling

library("tidyverse")
library("here")
library("skimr")
library("janitor")
library("dplyr")
library("Tmisc")
library(ggplot2)
library(caret)
library(purrr)
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
## The following object is masked from 'package:dplyr':
## 
##     combine
library(e1071)
library(kernlab)
## 
## Attaching package: 'kernlab'
## The following object is masked from 'package:purrr':
## 
##     cross
## The following object is masked from 'package:ggplot2':
## 
##     alpha
library(rpart)
library(xgboost)
## 
## Attaching package: 'xgboost'
## The following object is masked from 'package:dplyr':
## 
##     slice

Split the dataset into 80% of training data and 20% of testing data

set.seed(123)  
trainIndex <- createDataPartition(df$cnt, p = 0.8, list = FALSE)
train <- df[trainIndex,]
test <- df[-trainIndex,]

x_train <- subset(train, select = -cnt)
y_train <-train$cnt
x_test <- subset(test, select = -cnt)
y_test <- test$cnt

Defined three function to calculate RMSE, R square and MAE

RMSE_func <- function(actual, prediction){
  sqrt(mean((actual - prediction)^2))
}

RSquare_func <- function(actual, prediction){
  tss <- sum((actual - mean(actual)) ^ 2)
  rss <- sum((actual - prediction) ^ 2)
  rsq <- 1 - (rss / tss)
}

MAE_func <- function(actual, prediction){
  mean(abs(actual - prediction))
}

Define a data frame to store all model performance

model_performance_df <- data.frame(
  Model = c("Linear Regression", "Random Forest", "SVM", "Decision Tree", "XG Boost"),
  RMSE = numeric(5),
  RSquare = numeric(5),
  MAE = numeric(5),
  Best_Params = 'NAN',
  Random_HT_RMSE = numeric(5),
  Random_HT_RSquare = numeric(5),
  Random_HT_MAE = numeric(5)
)

Define the control value and tunelength for all machine learning model

control <- trainControl(method="cv", number=5, search = "random")
tuneLgth <- 3

Linear Regression

# Train the model
lr_model = lm(y_train ~ .,data = x_train)

# Tested with X_test data
predictions <- predict(lr_model, newdata = x_test)

# Store the performance into dataframe
model_performance_df$RMSE[1] <- RMSE_func(y_test, predictions)
model_performance_df$RSquare[1] <- RSquare_func(y_test, predictions)
model_performance_df$MAE[1] <- MAE_func(y_test, predictions)

# Random Search Hyperparameter tuning
tuning_params <- expand.grid(intercept = c(TRUE, FALSE))
cv_lr_model <- train(x = x_train, y = y_train, method = "lm", tuneLength = tuneLgth, trControl = control, tuneGrid = tuning_params)
predictions <- predict(cv_lr_model$finalModel, newdata = x_test)

# Store the performance into dataframe
model_performance_df$Random_HT_RMSE[1] <- RMSE_func(y_test, predictions)
model_performance_df$Random_HT_RSquare[1] <- RSquare_func(y_test, predictions)
model_performance_df$Random_HT_MAE[1] <- MAE_func(y_test, predictions)
model_performance_df$Best_Params[1] <- cv_lr_model$bestTune

Random Forest

# Train the model
rf_model <- randomForest(y_train~ ., data = x_train)

# Tested with X_test data
predictions <- predict(rf_model, newdata = x_test)

# Store the performance into dataframe
model_performance_df$RMSE[2] <- RMSE_func(y_test, predictions)
model_performance_df$RSquare[2] <- RSquare_func(y_test, predictions)
model_performance_df$MAE[2] <- MAE_func(y_test, predictions)

# Random Search Hyperparameter Tuning
tuning_params <- expand.grid(mtry = seq(2,5,by=1))
cv_rf_model <- train(x = x_train, y = y_train, method = "rf", tuneLength = tuneLgth ,trControl = control, tuneGrid = tuning_params)
predictions <- predict(cv_rf_model$finalModel, newdata = x_test)

# Store the performance into dataframe
model_performance_df$Random_HT_RMSE[2] <- RMSE_func(y_test, predictions)
model_performance_df$Random_HT_RSquare[2] <- RSquare_func(y_test, predictions)
model_performance_df$Random_HT_MAE[2] <- MAE_func(y_test, predictions)
model_performance_df$Best_Params[2] <- cv_rf_model$bestTune

SVM

# Train the model
svm_model <- svm(y_train~ ., data = x_train)

# Tested with X_test data
predictions <- predict(svm_model, newdata = x_test)

# Store the performance into dataframe
model_performance_df$RMSE[3] <- RMSE_func(y_test, predictions)
model_performance_df$RSquare[3] <- RSquare_func(y_test, predictions)
model_performance_df$MAE[3] <- MAE_func(y_test, predictions)

# Random Search Hyperparameter Tuning
tuning_params<- expand.grid(sigma = seq(0.1,0.3,by=0.1), C = seq(1,3,by=1))
cv_svm_model <- train(x = x_train, y = y_train, method = "svmRadial", tuneLength = tuneLgth, trControl = control , tuneGrid = tuning_params)
predictions <- predict(cv_svm_model$finalModel, newdata = x_test)

# Store the performance into dataframe
model_performance_df$Random_HT_RMSE[3] <- RMSE_func(y_test, predictions)
model_performance_df$Random_HT_RSquare[3] <- RSquare_func(y_test, predictions)
model_performance_df$Random_HT_MAE[3] <- MAE_func(y_test, predictions)
model_performance_df$Best_Params[3] <- cv_svm_model$bestTune

Decision Tree

# Train the model
dt_model <- rpart(y_train~ ., data = x_train)

# Tested with X_test data
predictions <- predict(dt_model, x_test)

# Store the performance into dataframe
model_performance_df$RMSE[4] <- RMSE_func(y_test, predictions)
model_performance_df$RSquare[4] <- RSquare_func(y_test, predictions)
model_performance_df$MAE[4] <- MAE_func(y_test, predictions)

# Random Search Hyperparameter Tuning
tuning_params <- expand.grid(cp = seq(0.005, 0.01, by = 0.001))
cv_dt_model <- train(x = x_train, y = y_train, method = "rpart", tuneLength = tuneLgth, trControl = control, tuneGrid = tuning_params)
predictions <- predict(cv_dt_model$finalModel, newdata = x_test)

# Store the performance into dataframe
model_performance_df$Random_HT_RMSE[4] <- RMSE_func(y_test, predictions)
model_performance_df$Random_HT_RSquare[4] <- RSquare_func(y_test, predictions)
model_performance_df$Random_HT_MAE[4] <- MAE_func(y_test, predictions)
model_performance_df$Best_Params[4] <- cv_dt_model$bestTune

XG Boost

# Convert the training and testing data to DMatrix
data_test <- xgb.DMatrix(data = as.matrix(x_test))
data_train <- xgb.DMatrix(data = as.matrix(x_train), label = y_train)

# Train the model
xgb_model <- xgb.train(data = data_train, nrounds = 100)

# Tested with X_test data
predictions <- predict(xgb_model, newdata = data_test)

# Store the performance into dataframe
model_performance_df$RMSE[5] <- RMSE_func(y_test, predictions)
model_performance_df$RSquare[5] <- RSquare_func(y_test, predictions)
model_performance_df$MAE[5] <- MAE_func(y_test, predictions)

# Define the random search hyperparameter tuning parameters
tuning_params <- expand.grid(
  nrounds = c(50, 100),
  max_depth = c(3, 6),
  eta = c(0.05, 0.1),
  gamma = c(0.05, 0.1),
  colsample_bytree = c(0.6, 0.8),
  min_child_weight = c(1, 3),
  subsample=c(0.6,0.8)
)

# Random Search Hyperparameter Tuning
cv_xgb_model <- train(x = x_train, y = y_train, method = "xgbTree", tuneLength = tuneLgth, trControl = control, tuneGrid = tuning_params, verbosity =0)
predictions <- predict(cv_xgb_model$finalModel, newdata = data_test)

# Store the performance into dataframe
model_performance_df$Random_HT_RMSE[5] <- RMSE_func(y_test, predictions)
model_performance_df$Random_HT_RSquare[5] <- RSquare_func(y_test, predictions)
model_performance_df$Random_HT_MAE[5] <- MAE_func(y_test, predictions)
model_performance_df$Best_Params[5] <- cv_xgb_model$bestTune

Classification

# Create a data frame to store classification models' performance
model_class_performance_df <- data.frame(
  Model = c("SVM", "Random Forest", "Decision Tree"),
  train_accuracy = numeric(3),
  test_accuracy = numeric(3)
)

Data Spliting

#Split the dataset into 80% of training data and 20% of testing data
set.seed(123)  
trainIndex <- createDataPartition(df$cnt, p = 0.8, list = FALSE)
train <- df_classes[trainIndex,]
test <- df_classes[-trainIndex,]

x_train <- subset(train, select = -classes)
y_train <-train$classes
x_test <- subset(test, select = -classes)
y_test <- test$classes

SVM

# Support Vector Machine Model
# Train the model
svm_class_model <- svm(x = x_train, y = y_train, method="class")

# Make predictions with the x_test dataset
predictions <- predict(svm_class_model, newdata = x_test)
train_predictions <- predict(svm_class_model, newdata = x_train)


# Evaluate the model
model_class_performance_df$train_accuracy[1] <- sum(train_predictions == y_train) / length(y_train)
model_class_performance_df$test_accuracy[1] <- sum(predictions == y_test) / length(y_test)
confusionMatrix(predictions, y_test)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   1   2   3   4
##          1 717 231  26   5
##          2 144 476 185  58
##          3   1 134 451 187
##          4   1  29 206 624
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6527          
##                  95% CI : (0.6366, 0.6685)
##     No Information Rate : 0.2515          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5369          
##                                           
##  Mcnemar's Test P-Value : 4.88e-12        
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity            0.8308   0.5471   0.5196   0.7140
## Specificity            0.8997   0.8514   0.8765   0.9093
## Pos Pred Value         0.7324   0.5516   0.5834   0.7256
## Neg Pred Value         0.9415   0.8492   0.8457   0.9044
## Prevalence             0.2483   0.2504   0.2498   0.2515
## Detection Rate         0.2063   0.1370   0.1298   0.1796
## Detection Prevalence   0.2817   0.2483   0.2224   0.2475
## Balanced Accuracy      0.8653   0.6993   0.6980   0.8116
table(predictions,y_test)
##            y_test
## predictions   1   2   3   4
##           1 717 231  26   5
##           2 144 476 185  58
##           3   1 134 451 187
##           4   1  29 206 624

Random Forest

# Random Forest Model
# Train the model
rf_class_model <- randomForest(y_train ~ ., data = x_train, ntree=100,method = 'class')

# Make predictions with the x_test dataset
predictions <- predict(rf_class_model, newdata = x_test)
train_predictions <- predict(rf_class_model, newdata = x_train)

# Evaluate the model
model_class_performance_df$train_accuracy[2] <- sum(train_predictions == y_train) / length(y_train)
model_class_performance_df$test_accuracy[2] <- sum(predictions == y_test) / length(y_test)
confusionMatrix(predictions, y_test)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   1   2   3   4
##          1 785  75   1   0
##          2  75 654 143   5
##          3   3 136 610 110
##          4   0   5 114 759
## 
## Overall Statistics
##                                          
##                Accuracy : 0.8081         
##                  95% CI : (0.7946, 0.821)
##     No Information Rate : 0.2515         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.7441         
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity            0.9096   0.7517   0.7028   0.8684
## Specificity            0.9709   0.9144   0.9045   0.9542
## Pos Pred Value         0.9117   0.7457   0.7101   0.8645
## Neg Pred Value         0.9702   0.9169   0.9014   0.9557
## Prevalence             0.2483   0.2504   0.2498   0.2515
## Detection Rate         0.2259   0.1882   0.1755   0.2184
## Detection Prevalence   0.2478   0.2524   0.2472   0.2527
## Balanced Accuracy      0.9403   0.8331   0.8036   0.9113
table(predictions,y_test)
##            y_test
## predictions   1   2   3   4
##           1 785  75   1   0
##           2  75 654 143   5
##           3   3 136 610 110
##           4   0   5 114 759
# Identify the importance features
importance(rf_class_model)
##            MeanDecreaseGini
## season            274.56662
## yr                370.04220
## mnth              479.23582
## hr               4270.30608
## holiday            49.44674
## weekday           566.60579
## workingday        329.33875
## weathersit        258.08345
## temp              805.59874
## atemp             855.43716
## hum               922.06924
## windspeed         655.57018
randomForest::varImpPlot(rf_class_model,
                         sort=TRUE,
                         main="Importamce Feature Ranking")

Decision Tree

# Decision Tree Model
# Train the model
dt_class_model <- rpart(y_train ~ ., data = x_train, method = 'class')

# Make predictions with the x_test dataset
predictions <- predict(dt_class_model, newdata = x_test, type = "class")
train_predictions <- predict(dt_class_model, newdata = x_train, type = "class")

# Evaluate the model

model_class_performance_df$train_accuracy[3] <- sum(train_predictions == y_train) / length(y_train)
model_class_performance_df$test_accuracy[3] <- sum(predictions == y_test) / length(y_test)
confusionMatrix(predictions, y_test)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   1   2   3   4
##          1 684  67   1   0
##          2 153 616 332 126
##          3  14 126 388 122
##          4  12  61 147 626
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6659          
##                  95% CI : (0.6499, 0.6816)
##     No Information Rate : 0.2515          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5545          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity            0.7926   0.7080   0.4470   0.7162
## Specificity            0.9740   0.7655   0.8995   0.9154
## Pos Pred Value         0.9096   0.5020   0.5969   0.7400
## Neg Pred Value         0.9343   0.8870   0.8301   0.9057
## Prevalence             0.2483   0.2504   0.2498   0.2515
## Detection Rate         0.1968   0.1773   0.1117   0.1801
## Detection Prevalence   0.2164   0.3531   0.1871   0.2435
## Balanced Accuracy      0.8833   0.7367   0.6733   0.8158
table(predictions,y_test)
##            y_test
## predictions   1   2   3   4
##           1 684  67   1   0
##           2 153 616 332 126
##           3  14 126 388 122
##           4  12  61 147 626

Evaluation


In regression, RMSE, MAE and R squared are used to evaluate the performance of models. While in classification, train accuracy and test accuracy are used. In regression, performanc of RF, SVM and DT has improved after hyperparameter tuning. However, XGB without hyperparameter tuning still has the best performance with RMSE OF 39.70, R squared of 0.95 and MAE of 24.62. In classification, RF has the best performance of train accuracy about 99.73% and test accuracy of 80.80%.

# Display the performance of each model before tuning and after tuning
view(model_performance_df) %>%
kable() %>%
kable_styling()
Model RMSE RSquare MAE Best_Params Random_HT_RMSE Random_HT_RSquare Random_HT_MAE
Linear Regression 138.33235 0.4072316 103.23875 TRUE 138.33235 0.4072316 103.23875
Random Forest 48.57959 0.9268953 31.41334 5 43.27478 0.9419894 27.52897
SVM 107.77084 0.6402175 64.46379 0.3 95.25184 0.7189496 57.60480
Decision Tree 94.54635 0.7230974 69.25203 0.005 85.27825 0.7747245 61.68893
XG Boost 39.69831 0.9511818 24.61818 100 41.17549 0.9474812 26.41195
view(model_class_performance_df) %>%
kable() %>%
kable_styling()
Model train_accuracy test_accuracy
SVM 0.6899453 0.6526619
Random Forest 0.9973389 0.8080576
Decision Tree 0.6749856 0.6658993

Deployment


For the deployment, we using the Shiny to deploy our model. The XGB model has been “PredictionModel.RData” file and load in the shinny app. If the model saved in the RData, it make the model not need to train everytime and it allow the web application to load the RData file to use the model.

We has use GitHub to host our model. To access to our model, you may use the command: runGitHub(“RAssignment”,“RSSST”) run in R Studio to view our product.

# Save the XGB model into RData file
save(xgb_model, file="PredictionModel.RData")
save(rf_class_model, file="PredictionModelClass.RData")
library(shiny)
library(xgboost)
# install.packages('rsconnect')

# Load the trained model
load("PredictionModel.Rdata")

# Define the user interface
ui <- fluidPage(
  titlePanel("Bike Sharing Demand Prediction"),
  sidebarLayout(
    sidebarPanel(
      # Input fields for the model features
      numericInput("temp", "Temperature:", value = 0),
      numericInput("atemp", "ATemperature:", value = 0),
      numericInput("hum", "Humidity:", value = 0),
      numericInput("windspeed", "Wind Speed:", value = 0),
      numericInput("season", "Season:", value = 0),
      numericInput("yr", "Year:", value = 0),
      numericInput("mnth", "Month:", value = 0),
      numericInput("hr", "Hour:", value = 0),
      numericInput("holiday", "Holiday:", value = 0),
      numericInput("weekday", "Weekday:", value = 0),
      numericInput("workingday", "Working Day:", value = 0),
      numericInput("weathersit", "Weather Condition:",value = 0),
      actionButton("predictBtn", "Predict")
    ),
    mainPanel(
      h4("Predicted Demand Total Count:"),
      # Output to display the predicted bike demand
      verbatimTextOutput("prediction")
    )
  )
)

# Define the server function
server <- function(input, output) {
  
  # Function to make predictions
  makePredictions <- function(input_data) {
    # print(input_data)
    data_predict <- xgb.DMatrix(data = as.matrix(input_data))
    # Load the random forest model
    load("PredictionModel1.Rdata")
    # Use the random forest model for prediction
    prediction <- predict(xgb_model, newdata = data_predict)
    
    # Return the predicted bike demand
    return(round(prediction))
  }

  
  # Event handler for the predict button
  observeEvent(input$predictBtn, {
    # Create a data frame with user inputs
    input_data <- data.frame(
      season=input$season,
      yr=input$yr,
      mnth=input$mnth,
      hr=input$hr,
      holiday=input$holiday,
      weekday=input$weekday,
      workingday=input$workingday,
      weathersit=input$weathersit,
      temp=input$temp,
      atemp=input$atemp,
      hum=input$hum,
      windspeed = input$windspeed
    )
    output$prediction <- renderText({
      makePredictions(input_data)
    })
  })
}

# Run the Shiny app
shinyApp(ui = ui, server = server)

Conclusion


In conclusion, we did our study following CRISP-DM. All the data mining goals are fulfilled. The trend of bike sharing is plotted. Models to predict bike sharing demand are built and the most relevant feature is found which is “hr”. The best model in regression is XGB while the best model in classification is RF. Both of them are deployed using Shiny package