CRISP-DM

This project follows the CRISP-DM (Cross-Industry Standard Process for Data Mining) methodology, providing a structured framework for our analysis. CRISP-DM guides us through stages such as Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, and Deployment. By adhering to this methodology, we ensure a comprehensive approach to predicting bike sharing demand.

Introduction/Business Understanding

A bike-sharing system is a transport service where bicycles are rent to individuals for short period at a low cost. It allows users to rent a bike in a bike station and return it at another bike station within a same network. Bike sharing has gained attraction in recent years as part of initiatives to promote green commuting. Bike-sharing system has significant influence on the transportation, public health as well as environment of the city.

The earliest bike sharing was started in 1965 with 50 bicycles, and bike sharing has rapidly expanded through enhancement of technology in the last decade. As of August 2021, there are 10 millions bike and over 3000 bike share systems across the world. According to Statistia, the revenue in bike sharing is expected to reach 9.21billions USD in 2023 and surge to 13.53 billion USD in 4 years.

The great opportunity in bike sharing system necessitates the prediction of the bike sharing demand to make this business constantly work and grow. In this research, the bike sharing demand will be predicted by using machine learning algorithm. Besides, the relationship between bike sharing demand and other factors such as time, weather and climate will be explored.

Business Goals & KPI

  1. To attract more users and increase the number of rides taken by the users. KPI of this goal is the total number of rides and its growth rate.
  2. To optimise the usage of available bikes. KPI is the number of rides per hour.
  3. To increase the percentage of registered users. KPI is the number of registered users to the number of casual users.

Data Mining Goals & KPI

  1. To build a model that can forecast the bike sharing demand based on the factors such as as season, weather and time. KPI of this goal is the model accuracy.
  2. To identify the relationship between various features and the number of users, and the most relevance feature that has most significant impact. KPI is the correlation between various features and the most important feature in model.
  3. To identify the trend of number of total users based on the date. KPI is the trend analysis of the number of total users based on the date.

Initilization

options(warn = -1) # To disable warning message to print
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(corrplot)
## corrplot 0.92 loaded
library(readr)
library(kableExtra)
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ lubridate 1.9.2     ✔ tibble    3.2.1
## ✔ purrr     1.0.1     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()          masks stats::filter()
## ✖ kableExtra::group_rows() masks dplyr::group_rows()
## ✖ dplyr::lag()             masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(here)
## here() starts at D:/MDS UM/WQD7004/bikesharing dataset
library(skimr)
library(janitor)
## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(Tmisc)
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
library(purrr)

Data Understanding

Dataset with title “Bike Sharing Dataset” is gained from Kaggle with url https://www.kaggle.com/datasets/lakshmi25npathi/bike-sharing-dataset. Dataset is published 4 years ago, containing hourly count of rental bikes sharing data for the year of 2011 and 2012 in the Capital bike share system. Purpose of this dataset is predicting bike sharing demand with the features in the dataset, for example, temperature, windspeed and humidity.

df <- read_csv("D:/MDS UM/WQD7004/bikesharing dataset/hour.csv")
## Rows: 17379 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl  (16): instant, season, yr, mnth, hr, holiday, weekday, workingday, weat...
## date  (1): dteday
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(df) %>%
kable() %>%
kable_styling()
instant dteday season yr mnth hr holiday weekday workingday weathersit temp atemp hum windspeed casual registered cnt
1 2011-01-01 1 0 1 0 0 6 0 1 0.24 0.2879 0.81 0.0000 3 13 16
2 2011-01-01 1 0 1 1 0 6 0 1 0.22 0.2727 0.80 0.0000 8 32 40
3 2011-01-01 1 0 1 2 0 6 0 1 0.22 0.2727 0.80 0.0000 5 27 32
4 2011-01-01 1 0 1 3 0 6 0 1 0.24 0.2879 0.75 0.0000 3 10 13
5 2011-01-01 1 0 1 4 0 6 0 1 0.24 0.2879 0.75 0.0000 0 1 1
6 2011-01-01 1 0 1 5 0 6 0 2 0.24 0.2576 0.75 0.0896 0 1 1
glimpse(df)
## Rows: 17,379
## Columns: 17
## $ instant    <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ dteday     <date> 2011-01-01, 2011-01-01, 2011-01-01, 2011-01-01, 2011-01-01…
## $ season     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ yr         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ mnth       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ hr         <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
## $ holiday    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ weekday    <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,…
## $ workingday <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ weathersit <dbl> 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3,…
## $ temp       <dbl> 0.24, 0.22, 0.22, 0.24, 0.24, 0.24, 0.22, 0.20, 0.24, 0.32,…
## $ atemp      <dbl> 0.2879, 0.2727, 0.2727, 0.2879, 0.2879, 0.2576, 0.2727, 0.2…
## $ hum        <dbl> 0.81, 0.80, 0.80, 0.75, 0.75, 0.75, 0.80, 0.86, 0.75, 0.76,…
## $ windspeed  <dbl> 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0896, 0.0000, 0.0…
## $ casual     <dbl> 3, 8, 5, 3, 0, 0, 2, 1, 1, 8, 12, 26, 29, 47, 35, 40, 41, 1…
## $ registered <dbl> 13, 32, 27, 10, 1, 1, 0, 2, 7, 6, 24, 30, 55, 47, 71, 70, 5…
## $ cnt        <dbl> 16, 40, 32, 13, 1, 1, 2, 3, 8, 14, 36, 56, 84, 94, 106, 110…
#Classification
class(df)
## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"
typeof(df)
## [1] "list"
dim(df)
## [1] 17379    17
#Structure of data
str(df)
## spc_tbl_ [17,379 × 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ instant   : num [1:17379] 1 2 3 4 5 6 7 8 9 10 ...
##  $ dteday    : Date[1:17379], format: "2011-01-01" "2011-01-01" ...
##  $ season    : num [1:17379] 1 1 1 1 1 1 1 1 1 1 ...
##  $ yr        : num [1:17379] 0 0 0 0 0 0 0 0 0 0 ...
##  $ mnth      : num [1:17379] 1 1 1 1 1 1 1 1 1 1 ...
##  $ hr        : num [1:17379] 0 1 2 3 4 5 6 7 8 9 ...
##  $ holiday   : num [1:17379] 0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday   : num [1:17379] 6 6 6 6 6 6 6 6 6 6 ...
##  $ workingday: num [1:17379] 0 0 0 0 0 0 0 0 0 0 ...
##  $ weathersit: num [1:17379] 1 1 1 1 1 2 1 1 1 1 ...
##  $ temp      : num [1:17379] 0.24 0.22 0.22 0.24 0.24 0.24 0.22 0.2 0.24 0.32 ...
##  $ atemp     : num [1:17379] 0.288 0.273 0.273 0.288 0.288 ...
##  $ hum       : num [1:17379] 0.81 0.8 0.8 0.75 0.75 0.75 0.8 0.86 0.75 0.76 ...
##  $ windspeed : num [1:17379] 0 0 0 0 0 0.0896 0 0 0 0 ...
##  $ casual    : num [1:17379] 3 8 5 3 0 0 2 1 1 8 ...
##  $ registered: num [1:17379] 13 32 27 10 1 1 0 2 7 6 ...
##  $ cnt       : num [1:17379] 16 40 32 13 1 1 2 3 8 14 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   instant = col_double(),
##   ..   dteday = col_date(format = ""),
##   ..   season = col_double(),
##   ..   yr = col_double(),
##   ..   mnth = col_double(),
##   ..   hr = col_double(),
##   ..   holiday = col_double(),
##   ..   weekday = col_double(),
##   ..   workingday = col_double(),
##   ..   weathersit = col_double(),
##   ..   temp = col_double(),
##   ..   atemp = col_double(),
##   ..   hum = col_double(),
##   ..   windspeed = col_double(),
##   ..   casual = col_double(),
##   ..   registered = col_double(),
##   ..   cnt = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
#data summary
summary(df)
##     instant          dteday               season            yr        
##  Min.   :    1   Min.   :2011-01-01   Min.   :1.000   Min.   :0.0000  
##  1st Qu.: 4346   1st Qu.:2011-07-04   1st Qu.:2.000   1st Qu.:0.0000  
##  Median : 8690   Median :2012-01-02   Median :3.000   Median :1.0000  
##  Mean   : 8690   Mean   :2012-01-02   Mean   :2.502   Mean   :0.5026  
##  3rd Qu.:13034   3rd Qu.:2012-07-02   3rd Qu.:3.000   3rd Qu.:1.0000  
##  Max.   :17379   Max.   :2012-12-31   Max.   :4.000   Max.   :1.0000  
##       mnth              hr           holiday           weekday     
##  Min.   : 1.000   Min.   : 0.00   Min.   :0.00000   Min.   :0.000  
##  1st Qu.: 4.000   1st Qu.: 6.00   1st Qu.:0.00000   1st Qu.:1.000  
##  Median : 7.000   Median :12.00   Median :0.00000   Median :3.000  
##  Mean   : 6.538   Mean   :11.55   Mean   :0.02877   Mean   :3.004  
##  3rd Qu.:10.000   3rd Qu.:18.00   3rd Qu.:0.00000   3rd Qu.:5.000  
##  Max.   :12.000   Max.   :23.00   Max.   :1.00000   Max.   :6.000  
##    workingday       weathersit         temp           atemp       
##  Min.   :0.0000   Min.   :1.000   Min.   :0.020   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:1.000   1st Qu.:0.340   1st Qu.:0.3333  
##  Median :1.0000   Median :1.000   Median :0.500   Median :0.4848  
##  Mean   :0.6827   Mean   :1.425   Mean   :0.497   Mean   :0.4758  
##  3rd Qu.:1.0000   3rd Qu.:2.000   3rd Qu.:0.660   3rd Qu.:0.6212  
##  Max.   :1.0000   Max.   :4.000   Max.   :1.000   Max.   :1.0000  
##       hum           windspeed          casual         registered   
##  Min.   :0.0000   Min.   :0.0000   Min.   :  0.00   Min.   :  0.0  
##  1st Qu.:0.4800   1st Qu.:0.1045   1st Qu.:  4.00   1st Qu.: 34.0  
##  Median :0.6300   Median :0.1940   Median : 17.00   Median :115.0  
##  Mean   :0.6272   Mean   :0.1901   Mean   : 35.68   Mean   :153.8  
##  3rd Qu.:0.7800   3rd Qu.:0.2537   3rd Qu.: 48.00   3rd Qu.:220.0  
##  Max.   :1.0000   Max.   :0.8507   Max.   :367.00   Max.   :886.0  
##       cnt       
##  Min.   :  1.0  
##  1st Qu.: 40.0  
##  Median :142.0  
##  Mean   :189.5  
##  3rd Qu.:281.0  
##  Max.   :977.0
#attribute of data
length(df)
## [1] 17
names(df)
##  [1] "instant"    "dteday"     "season"     "yr"         "mnth"      
##  [6] "hr"         "holiday"    "weekday"    "workingday" "weathersit"
## [11] "temp"       "atemp"      "hum"        "windspeed"  "casual"    
## [16] "registered" "cnt"

Attribute Information

instant: record index

dteday: date

season: season (1:spring, 2:summer, 3:fall, 4:winter)

yr: year (0: 2011, 1:2012)

mnth: month ( 1 to 12)

hr: hour (0 to 23)

holiday: weather day is holiday or not (extracted from http://dchr.dc.gov/page/holiday-schedule)

weekday: day of the week

workingday: if day is neither weekend nor holiday is 1, otherwise is 0.

weathersit:

1: Clear, Few clouds, Partly cloudy

2: Mist and Cloudy, Mist and Broken clouds, Mist and Few clouds, Mist

3: Light Snow, Light Rain and Thunderstorm and Scattered clouds, Light Rain and Scattered clouds

4: Heavy Rain and Ice Pallets and Thunderstorm and Mist, Snow and Fog

temp: Normalized temperature in Celsius. The values are divided to 41 (max)

atemp: Normalized feeling temperature in Celsius. The values are divided to 50 (max)

hum: Normalized humidity. The values are divided to 100 (max)

windspeed: Normalized wind speed. The values are divided to 67 (max)

casual: count of casual users

registered: count of registered users

cnt: count of total rental bikes including both casual and registered

Data Exploration

Total Number of Rides over date

df$dteday <- as.Date(df$dteday)

dt_aggregate_data <- 
  df %>%
  group_by(dteday) %>%
  summarize(rides = sum(cnt))

ggplot(dt_aggregate_data, aes(x = dteday, y = rides)) +
  geom_line() +
  geom_smooth(se = FALSE) +  # Add a smoothed line
  labs(x = "Date", y = "Total Number of Rides") +
  ggtitle("Total Number of Rides over date")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'


The plot shows the relationship between Total Number of Rides(cnt) variable and Date. We can observed the number of rides from day to day from the dataset. Generally, the number of total rides are increasing over the period.

Trend of Total Number of registered and casual over date

dt_aggrregate_data2 <- 
  df %>%
  group_by(dteday) %>%
  summarize(registered = sum(registered),
            unregistered = sum(casual))

ggplot(dt_aggrregate_data2, aes(x = dteday)) +
  geom_bar(aes(y = registered, fill = "Registered"), stat = "identity", width = 0.5) +
  geom_bar(aes(y = unregistered, fill = "Casual"), stat = "identity", width = 0.5) +
  geom_smooth(aes(y = registered, color = "Registered"), se = FALSE, color="blue") +
  geom_smooth(aes(y = unregistered, color = "Casual"), se = FALSE, color="red") +
  labs(x = "Date", y = "Total Number of Rides", fill = "User Type") +
  scale_fill_manual(values = c("Registered" = "blue", "Casual" = "red")) +
  ggtitle("Total Number of registered and casual over date")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'


The plot shows the relationship between Total Number of registered and casual rides variable and Date. Generally, both the number of registered and casual rides are increasing but the number of registered users is increasing faster than the casual rides. Their difference is become higher and higher over the period, indicating that the percentage of registered is getting higher.

Histogram

histogram <- hist(df$cnt, breaks = 25, ylab = 'Frequency of Rental', xlab = 'Total Bike Rental Count', main = 'Distribution of Total Bike Rental Count', col = 'indianred' )

xfit <- seq(min(df$cnt),max(df$cnt), length = 50)
yfit <- dnorm(xfit, mean =mean(df$cnt),sd=sd(df$cnt))
yfit <- yfit*diff(histogram$mids[1:2])*length(df$cnt)
lines(xfit,yfit, col='yellow', lwd= 3)

From the histogram above, it seems that the number of total rented bikes are positively skewed distribution. The median is generally less than the mean. The mean is typically pulled towards the right by the presence of the long tail. The data have outliers on the right side, which are extreme values that deviate significantly from the majority of the data. These outliers can influence the mean and make it larger than the median.

Trend of Total Bike Rentals on Seasonal Factor

boxplot(df$cnt ~df$season,
        data = df,
        main = "Total Bike Rentals Vs Season",
        xlab = "Season",
        ylab = "Total Bike Rentals",
        col = c("darkblue", "blue", "lightblue", "skyblue"))


The plot shows the relationship between Total Bike Rentals(cnt) variable and season. The average numbers of bike rentals are the highest during summer and fall.

Trend of Total Bike Rentals on Days of the Week Factor

boxplot(df$cnt ~df$weekday,
        data = df,
        main = "Total Bike Rentals Vs Days of the Week",
        xlab = "Days of the Week",
        ylab = "Total Bike Rentals",
        col = c("red", "yellow", "purple", "blue", "green","orange")) 


The plot shows the relationship between Total Bike Rentals(cnt) variable and Days of The Week. We can see that the median of bike rentals almost the same each day but slightly higher during the weekdays.

Trend of Total Bike Rentals on Time Factor

boxplot(df$cnt ~df$hr,
        data = df,
        main = "Total Bike Rentals Vs Time",
        xlab = "Time",
        ylab = "Total Bike Rentals",
        col = c("pink")) 


The plot shows the relationship between Total Bike Rentals(cnt) variable and Time. We can see that the peak average number of bike rentals are at 8:00 am and 5:00pm.

Trend of Total Bike Rentals on Weather Factor

boxplot(df$cnt ~df$weathersit,
        data = df,
        main = "Total Bike Rentals Vs Weather",
        xlab = "Weather",
        ylab = "Total Bike Rentals",
        col = c("lightgreen")) 


The plot shows the relationship between Total Bike Rentals(cnt) variable and weather. We can see that the median number of bike rentals on clear and fine day are the highest among the others.

Classify the cnt column into a new column name classes

df_classes <- df

for (i in 1:nrow(df_classes)) {
  
  # Check the conditions and assign the corresponding value
  if (df_classes$cnt[i] < quantile(df_classes$cnt, probs = 0.25)) {
    df_classes$classes[i] <- 1
  } else if (df_classes$cnt[i] >= quantile(df_classes$cnt, probs = 0.25) & df_classes$cnt[i] < quantile(df_classes$cnt, probs = 0.5)) {
    df_classes$classes[i] <- 2
  } else if (df_classes$cnt[i] >= quantile(df_classes$cnt, probs = 0.5) & df_classes$cnt[i] < quantile(df_classes$cnt, probs = 0.75)) {
    df_classes$classes[i] <- 3
  } else {
    df_classes$classes[i] <- 4
  }
}
df_classes$classes<-as.factor(df_classes$classes)


Assign classes to total number of rides based on the quartiles. Total number of rides is divided into 4 classes. Class 1 is the bottom 25%, class 2 is 25% to 50% and class 3 is 50% to 75% and class 4 is the top 25% of the number of rides.
Classify the cnt column into a new column name classes

#Plot bar chart for count of classes
classes_aggregate_data <- df_classes %>%
  group_by(classes)%>%
  summarize(count = n())

ggplot(classes_aggregate_data, aes(x = as.factor(classes), y = count, fill = classes)) +
  geom_col(position = "dodge", width = 0.5) +
  labs(x = "Classes", y = "Count", fill = "Class") +
  ggtitle("Count of classes")

classes_aggregate_data2 <- df_classes %>%
  group_by(season, classes) %>%
  summarize(count = n()) %>%
  mutate(classes = factor(classes),
         season=factor(season))
## `summarise()` has grouped output by 'season'. You can override using the
## `.groups` argument.


The plot shows the count of each class. They are distributed uniformly.
Classify the cnt column into a new column name classes

#Plot bar chart for count of classes by season
ggplot(classes_aggregate_data2, aes(x = interaction(classes, season), y = count, fill = classes)) +
  geom_col(position = "dodge", width = 0.5) +
  labs(x = "Classes and Season", y = "Count", fill = "Class") +
  scale_fill_manual(values = c("1" = "blue", "2" = "red", "3" = "green", "4" = "orange")) +
  ggtitle("Count of classes by season")


The plot shows the count of classes by season. It is clear that class 3 and class 4 is less in spring compared to other 3 seasons.
Classify the cnt column into a new column name classes

#Plot bar chart for count of classes by season
classes_aggregate_data3 <- 
  df_classes %>%
  group_by(classes) %>%
  summarize(atemp = mean(atemp))

ggplot(classes_aggregate_data3, aes(x = classes, y = atemp, fill = classes)) +
    geom_col(position = "dodge", width = 0.5) +
    labs(x = "Classes", y = "Atemp", fill = "Class") +
    scale_fill_manual(values = c("1" = "blue", "2" = "red", "3" = "green", "4" = "orange")) +
    ggtitle("Temperature of classes")


The plot shows the relation ship between classes and average(temperature). The average temperature is highest in class 4.

#Measure correlation between all variable:

df_numeric = select_if(df,is.numeric)
data_corr<-cor(df_numeric)
data_corr
##                 instant       season           yr         mnth           hr
## instant     1.000000000  0.404045721  0.866014049  0.489163831 -0.004774815
## season      0.404045721  1.000000000 -0.010742486  0.830385892 -0.006116901
## yr          0.866014049 -0.010742486  1.000000000 -0.010472929 -0.003867005
## mnth        0.489163831  0.830385892 -0.010472929  1.000000000 -0.005771909
## hr         -0.004774815 -0.006116901 -0.003867005 -0.005771909  1.000000000
## holiday     0.014723494 -0.009584526  0.006691617  0.018430325  0.000479136
## weekday     0.001356820 -0.002335350 -0.004484851  0.010400061 -0.003497739
## workingday -0.003415559  0.013743102 -0.002196005 -0.003476922  0.002284998
## weathersit -0.014197603 -0.014523552 -0.019156853  0.005399522 -0.020202528
## temp        0.136178007  0.312025237  0.040913380  0.201691494  0.137603494
## atemp       0.137614610  0.319379811  0.039221595  0.208096131  0.133749965
## hum         0.009576774  0.150624745 -0.083546421  0.164411443 -0.276497828
## windspeed  -0.074504540 -0.149772751 -0.008739533 -0.135386323  0.137251568
## casual      0.158295401  0.120206447  0.142778528  0.068457301  0.301201730
## registered  0.282045777  0.174225633  0.253684310  0.122272967  0.374140710
## cnt         0.278378694  0.178055731  0.250494899  0.120637760  0.394071498
##                 holiday      weekday   workingday   weathersit         temp
## instant     0.014723494  0.001356820 -0.003415559 -0.014197603  0.136178007
## season     -0.009584526 -0.002335350  0.013743102 -0.014523552  0.312025237
## yr          0.006691617 -0.004484851 -0.002196005 -0.019156853  0.040913380
## mnth        0.018430325  0.010400061 -0.003476922  0.005399522  0.201691494
## hr          0.000479136 -0.003497739  0.002284998 -0.020202528  0.137603494
## holiday     1.000000000 -0.102087791 -0.252471370 -0.017036113 -0.027340477
## weekday    -0.102087791  1.000000000  0.035955071  0.003310740 -0.001794927
## workingday -0.252471370  0.035955071  1.000000000  0.044672224  0.055390317
## weathersit -0.017036113  0.003310740  0.044672224  1.000000000 -0.102639936
## temp       -0.027340477 -0.001794927  0.055390317 -0.102639936  1.000000000
## atemp      -0.030972737 -0.008820945  0.054667235 -0.105563108  0.987672139
## hum        -0.010588465 -0.037158268  0.015687512  0.418130329 -0.069881391
## windspeed   0.003987632  0.011501545 -0.011829789  0.026225652 -0.023125262
## casual      0.031563628  0.032721415 -0.300942486 -0.152627885  0.459615646
## registered -0.047345424  0.021577888  0.134325791 -0.120965520  0.335360849
## cnt        -0.030927303  0.026899860  0.030284368 -0.142426138  0.404772276
##                   atemp          hum    windspeed      casual  registered
## instant     0.137614610  0.009576774 -0.074504540  0.15829540  0.28204578
## season      0.319379811  0.150624745 -0.149772751  0.12020645  0.17422563
## yr          0.039221595 -0.083546421 -0.008739533  0.14277853  0.25368431
## mnth        0.208096131  0.164411443 -0.135386323  0.06845730  0.12227297
## hr          0.133749965 -0.276497828  0.137251568  0.30120173  0.37414071
## holiday    -0.030972737 -0.010588465  0.003987632  0.03156363 -0.04734542
## weekday    -0.008820945 -0.037158268  0.011501545  0.03272142  0.02157789
## workingday  0.054667235  0.015687512 -0.011829789 -0.30094249  0.13432579
## weathersit -0.105563108  0.418130329  0.026225652 -0.15262788 -0.12096552
## temp        0.987672139 -0.069881391 -0.023125262  0.45961565  0.33536085
## atemp       1.000000000 -0.051917696 -0.062336043  0.45408007  0.33255864
## hum        -0.051917696  1.000000000 -0.290104895 -0.34702809 -0.27393312
## windspeed  -0.062336043 -0.290104895  1.000000000  0.09028678  0.08232085
## casual      0.454080065 -0.347028093  0.090286775  1.00000000  0.50661770
## registered  0.332558635 -0.273933118  0.082320847  0.50661770  1.00000000
## cnt         0.400929304 -0.322910741  0.093233784  0.69456408  0.97215073
##                    cnt
## instant     0.27837869
## season      0.17805573
## yr          0.25049490
## mnth        0.12063776
## hr          0.39407150
## holiday    -0.03092730
## weekday     0.02689986
## workingday  0.03028437
## weathersit -0.14242614
## temp        0.40477228
## atemp       0.40092930
## hum        -0.32291074
## windspeed   0.09323378
## casual      0.69456408
## registered  0.97215073
## cnt         1.00000000
corrplot(data_corr, method="color")

Data Preparation

#missing value
sum(is.na(df))
## [1] 0
# #Create new dataset excluding instant,casual,registered variables
# df<-subset(df,select=-c(dteday,instant,casual,registered))
# head(df,5)
# str(df)

Modelling

Regression

Import all the required library for modelling

library("tidyverse")
library("here")
library("skimr")
library("janitor")
library("dplyr")
library("Tmisc")
library(ggplot2)
library(caret)
library(purrr)
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
## The following object is masked from 'package:dplyr':
## 
##     combine
library(e1071)
library(kernlab)
## 
## Attaching package: 'kernlab'
## The following object is masked from 'package:purrr':
## 
##     cross
## The following object is masked from 'package:ggplot2':
## 
##     alpha
library(rpart)
library(xgboost)
## 
## Attaching package: 'xgboost'
## The following object is masked from 'package:dplyr':
## 
##     slice

Import all the required library for modelling

#Create new dataset excluding instant,casual,registered variables
df<-subset(df,select=-c(dteday,instant,casual,registered))
head(df,5)
## # A tibble: 5 × 13
##   season    yr  mnth    hr holiday weekday workingday weathersit  temp atemp
##    <dbl> <dbl> <dbl> <dbl>   <dbl>   <dbl>      <dbl>      <dbl> <dbl> <dbl>
## 1      1     0     1     0       0       6          0          1  0.24 0.288
## 2      1     0     1     1       0       6          0          1  0.22 0.273
## 3      1     0     1     2       0       6          0          1  0.22 0.273
## 4      1     0     1     3       0       6          0          1  0.24 0.288
## 5      1     0     1     4       0       6          0          1  0.24 0.288
## # ℹ 3 more variables: hum <dbl>, windspeed <dbl>, cnt <dbl>
str(df)
## tibble [17,379 × 13] (S3: tbl_df/tbl/data.frame)
##  $ season    : num [1:17379] 1 1 1 1 1 1 1 1 1 1 ...
##  $ yr        : num [1:17379] 0 0 0 0 0 0 0 0 0 0 ...
##  $ mnth      : num [1:17379] 1 1 1 1 1 1 1 1 1 1 ...
##  $ hr        : num [1:17379] 0 1 2 3 4 5 6 7 8 9 ...
##  $ holiday   : num [1:17379] 0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday   : num [1:17379] 6 6 6 6 6 6 6 6 6 6 ...
##  $ workingday: num [1:17379] 0 0 0 0 0 0 0 0 0 0 ...
##  $ weathersit: num [1:17379] 1 1 1 1 1 2 1 1 1 1 ...
##  $ temp      : num [1:17379] 0.24 0.22 0.22 0.24 0.24 0.24 0.22 0.2 0.24 0.32 ...
##  $ atemp     : num [1:17379] 0.288 0.273 0.273 0.288 0.288 ...
##  $ hum       : num [1:17379] 0.81 0.8 0.8 0.75 0.75 0.75 0.8 0.86 0.75 0.76 ...
##  $ windspeed : num [1:17379] 0 0 0 0 0 0.0896 0 0 0 0 ...
##  $ cnt       : num [1:17379] 16 40 32 13 1 1 2 3 8 14 ...

Split the dataset into 80% of training data and 20% of testing data

set.seed(123)  
trainIndex <- createDataPartition(df$cnt, p = 0.8, list = FALSE)
train <- df[trainIndex,]
test <- df[-trainIndex,]

x_train <- subset(train, select = -cnt)
y_train <-train$cnt
x_test <- subset(test, select = -cnt)
y_test <- test$cnt

Defined three function to calculate RMSE, R square and MAE

RMSE_func <- function(actual, prediction){
  sqrt(mean((actual - prediction)^2))
}

RSquare_func <- function(actual, prediction){
  tss <- sum((actual - mean(actual)) ^ 2)
  rss <- sum((actual - prediction) ^ 2)
  rsq <- 1 - (rss / tss)
}

MAE_func <- function(actual, prediction){
  mean(abs(actual - prediction))
}

Define a data frame to store all model performance

model_performance_df <- data.frame(
  Model = c("Linear Regression", "Random Forest", "SVM", "Decision Tree", "XG Boost"),
  RMSE = numeric(5),
  RSquare = numeric(5),
  MAE = numeric(5),
  Best_Params = 'NAN',
  Random_HT_RMSE = numeric(5),
  Random_HT_RSquare = numeric(5),
  Random_HT_MAE = numeric(5)
)

Define the control value and tunelength for all machine learning model

control <- trainControl(method="cv", number=5, search = "random")
tuneLgth <- 3

Linear Regression

# Train the model
lr_model = lm(y_train ~ .,data = x_train)

# Tested with X_test data
predictions <- predict(lr_model, newdata = x_test)

# Store the performance into dataframe
model_performance_df$RMSE[1] <- RMSE_func(y_test, predictions)
model_performance_df$RSquare[1] <- RSquare_func(y_test, predictions)
model_performance_df$MAE[1] <- MAE_func(y_test, predictions)

# Random Search Hyperparameter tuning
tuning_params <- expand.grid(intercept = c(TRUE, FALSE))
cv_lr_model <- train(x = x_train, y = y_train, method = "lm", tuneLength = tuneLgth, trControl = control, tuneGrid = tuning_params)
predictions <- predict(cv_lr_model$finalModel, newdata = x_test)

# Store the performance into dataframe
model_performance_df$Random_HT_RMSE[1] <- RMSE_func(y_test, predictions)
model_performance_df$Random_HT_RSquare[1] <- RSquare_func(y_test, predictions)
model_performance_df$Random_HT_MAE[1] <- MAE_func(y_test, predictions)
model_performance_df$Best_Params[1] <- cv_lr_model$bestTune

Random Forest

# # Train the model
# rf_model <- randomForest(y_train~ ., data = x_train)
# 
# # Tested with X_test data
# predictions <- predict(rf_model, newdata = x_test)
# 
# # Store the performance into dataframe
# model_performance_df$RMSE[2] <- RMSE_func(y_test, predictions)
# model_performance_df$RSquare[2] <- RSquare_func(y_test, predictions)
# model_performance_df$MAE[2] <- MAE_func(y_test, predictions)
# 
# # Random Search Hyperparameter Tuning
# tuning_params <- expand.grid(mtry = seq(2,5))
# cv_rf_model <- train(x = x_train, y = y_train, method = "rf", tuneLength = tuneLgth ,trControl = control, tuneGrid = tuning_params)
# predictions <- predict(cv_rf_model$finalModel, newdata = x_test)
# 
# # Store the performance into dataframe
# model_performance_df$Random_HT_RMSE[2] <- RMSE_func(y_test, predictions)
# model_performance_df$Random_HT_RSquare[2] <- RSquare_func(y_test, predictions)
# model_performance_df$Random_HT_MAE[2] <- MAE_func(y_test, predictions)
# model_performance_df$Best_Params[2] <- cv_rf_model$bestTune

SVM

# # Train the model
# svm_model <- svm(y_train~ ., data = x_train)
# 
# # Tested with X_test data
# predictions <- predict(svm_model, newdata = x_test)
# 
# # Store the performance into dataframe
# model_performance_df$RMSE[3] <- RMSE_func(y_test, predictions)
# model_performance_df$RSquare[3] <- RSquare_func(y_test, predictions)
# model_performance_df$MAE[3] <- MAE_func(y_test, predictions)
# 
# # Random Search Hyperparameter Tuning
# tuning_params<- expand.grid(sigma = seq(0.1,0.3,by=0.1), C = seq(1,3))
# cv_svm_model <- train(x = x_train, y = y_train, method = "svmRadial", tuneLength = tuneLgth, trControl = control , tuneGrid = tuning_params)
# predictions <- predict(cv_svm_model$finalModel, newdata = x_test)
# 
# # Store the performance into dataframe
# model_performance_df$Random_HT_RMSE[3] <- RMSE_func(y_test, predictions)
# model_performance_df$Random_HT_RSquare[3] <- RSquare_func(y_test, predictions)
# model_performance_df$Random_HT_MAE[3] <- MAE_func(y_test, predictions)
# model_performance_df$Best_Params[3] <- cv_svm_model$bestTune

Decision Tree

# # Train the model
# dt_model <- rpart(y_train~ ., data = x_train)
# 
# # Tested with X_test data
# predictions <- predict(dt_model, x_test)
# 
# # Store the performance into dataframe
# model_performance_df$RMSE[4] <- RMSE_func(y_test, predictions)
# model_performance_df$RSquare[4] <- RSquare_func(y_test, predictions)
# model_performance_df$MAE[4] <- MAE_func(y_test, predictions)
# 
# # Random Search Hyperparameter Tuning
# tuning_params <- expand.grid(cp = seq(0.005, 0.01, by = 0.001))
# cv_dt_model <- train(x = x_train, y = y_train, method = "rpart", tuneLength = tuneLgth, trControl = control, tuneGrid = tuning_params)
# predictions <- predict(cv_dt_model$finalModel, newdata = x_test)
# 
# # Store the performance into dataframe
# model_performance_df$Random_HT_RMSE[4] <- RMSE_func(y_test, predictions)
# model_performance_df$Random_HT_RSquare[4] <- RSquare_func(y_test, predictions)
# model_performance_df$Random_HT_MAE[4] <- MAE_func(y_test, predictions)
# model_performance_df$Best_Params[4] <- cv_dt_model$bestTune

XG Boost

# # Convert the training and testing data to DMatrix
# data_test <- xgb.DMatrix(data = as.matrix(x_test))
# data_train <- xgb.DMatrix(data = as.matrix(x_train), label = y_train)
# 
# # Train the model
# xgb_model <- xgb.train(data = data_train, nrounds = 100)
# 
# # Tested with X_test data
# predictions <- predict(xgb_model, newdata = data_test)
# 
# # Store the performance into dataframe
# model_performance_df$RMSE[5] <- RMSE_func(y_test, predictions)
# model_performance_df$RSquare[5] <- RSquare_func(y_test, predictions)
# model_performance_df$MAE[5] <- MAE_func(y_test, predictions)
# 
# # Define the random search hyperparameter tuning parameters 
# tuning_params <- expand.grid(
#   nrounds = c(50, 100),
#   max_depth = c(3, 6),
#   eta = c(0.05, 0.1),
#   gamma = c(0.05, 0.1),
#   colsample_bytree = c(0.6, 0.8),
#   min_child_weight = c(1, 3),
#   subsample=c(0.6,0.8)
# )
# 
# # Random Search Hyperparameter Tuning
# cv_xgb_model <- train(x = x_train, y = y_train, method = "xgbTree", tuneLength = tuneLgth, trControl = control, tuneGrid = tuning_params)
# predictions <- predict(cv_xgb_model$finalModel, newdata = data_test)
# 
# # Store the performance into dataframe
# model_performance_df$Random_HT_RMSE[5] <- RMSE_func(y_test, predictions)
# model_performance_df$Random_HT_RSquare[5] <- RSquare_func(y_test, predictions)
# model_performance_df$Random_HT_MAE[5] <- MAE_func(y_test, predictions)
# model_performance_df$Best_Params[5] <- cv_xgb_model$bestTune

Classification

SVM

# # Support Vector Machine Model
# # Train the model
# svm_model <- svm(x = x_train, y = y_train, method="class")
# 
# # Make predictions with the x_test dataset
# predictions <- predict(svm_model, newdata = x_test)
# 
# # Evaluate the model
# confusionMatrix(predictions, y_test)
# table(predictions,predictions)
# ```
# ***Random Forest***
# ```{r predictionmodel7}
# # Random Forest Model
# # Train the model
# rf_model <- randomForest(y_train ~ ., data = x_train, ntree=100,method = 'class')
# 
# # Make predictions with the x_test dataset
# predictions <- predict(rf_model, newdata = x_test)
# head(predictions)
# 
# # Evaluate the model
# confusionMatrix(predictions, y_test)
# table(predictions,predictions)
# 
# # Identify the importance features
# importance(rf_model)
# 
# randomForest::varImpPlot(rf_model,
#                          sort=TRUE,
#                          main="Importamce Feature Ranking")
# ```
# 
# *** Decision Tree ***
# ```{r predictionmodel8}
# # Decision Tree Model
# # Train the model
# dt_model <- rpart(y_train ~ ., data = x_train, method = 'class')
# 
# # Make predictions with the x_test dataset
# predictions <- predict(dt_model, newdata = x_test, type = "class")
# 
# # Evaluate the model
# confusionMatrix(predictions, y_test)
# table(predictions,predictions)

Evaluation

# Display the performance of each model before tuning and after tuning
view(model_performance_df) %>%
kable() %>%
kable_styling()
Model RMSE RSquare MAE Best_Params Random_HT_RMSE Random_HT_RSquare Random_HT_MAE
Linear Regression 138.3324 0.4072316 103.2388 TRUE 138.3324 0.4072316 103.2388
Random Forest 0.0000 0.0000000 0.0000 NAN 0.0000 0.0000000 0.0000
SVM 0.0000 0.0000000 0.0000 NAN 0.0000 0.0000000 0.0000
Decision Tree 0.0000 0.0000000 0.0000 NAN 0.0000 0.0000000 0.0000
XG Boost 0.0000 0.0000000 0.0000 NAN 0.0000 0.0000000 0.0000

Deployment

[Shiny App]