1. Introduction

1.1. Greetings

Hi !! Welcome to my Rmd for data visualization. This time, I am looking dataset from external source (kaggle). Please enjoy it.

1.2. Briefing

Love Bike is the middle level company, which working at rental bike bussiness. The company give two kind service, such as : - hour rental bike - daily rental bike

The company hope to develop their business. So the management want to analize data : - What time the company get the highest visitors / customers (hour)? - What weekdays the company get the highest visitors / customers (day)? - Is the company need to focus adding the employee on time above ? - Is the company need do other improvement?

2. Data Set Up

This is set up library

library(lubridate)
library(ggplot2)
library(tidyverse)
library(ggpubr)
library(scales)
library(glue)
library(plotly)

3. Data Pre - Processing

3.1. Read Data

rent_day <- read.csv("data_input/day.csv")
rent_hour <- read.csv("data_input/hour.csv")
#check data
#head(rent_day)
tail(rent_day)
#check data
#head(rent_hour)
tail(rent_hour)
# check data

dim(rent_hour)
#> [1] 17379    17
names(rent_hour)
#>  [1] "instant"    "dteday"     "season"     "yr"         "mnth"      
#>  [6] "hr"         "holiday"    "weekday"    "workingday" "weathersit"
#> [11] "temp"       "atemp"      "hum"        "windspeed"  "casual"    
#> [16] "registered" "cnt"
dim(rent_day)
#> [1] 731  16
names(rent_day)
#>  [1] "instant"    "dteday"     "season"     "yr"         "mnth"      
#>  [6] "holiday"    "weekday"    "workingday" "weathersit" "temp"      
#> [11] "atemp"      "hum"        "windspeed"  "casual"     "registered"
#> [16] "cnt"

========================================= Dataset characteristics =========================================
Both hour.csv and day.csv have the following fields, except hr which is not available in day.csv

- instant       : record index
- dteday        : date
- season        : season (1:springer, 2:summer, 3:fall, 4:winter)
- yr            : year (0: 2011, 1:2012)
- mnth          : month ( 1 to 12)
- hr            : hour (0 to 23)
- holiday       : weather day is holiday or not (extracted from http://dchr.dc.gov/page/holiday-schedule)
- weekday       : day of the week
- workingday    : if day is neither weekend nor holiday is 1, otherwise is 0.
- weathersit    : 
    + 1: Clear, Few clouds, Partly cloudy, Partly cloudy
    + 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
    + 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
    + 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
- temp          : Normalized temperature in Celsius. The values are divided to 41 (max)
- atemp         : Normalized feeling temperature in Celsius. The values are divided to 50 (max)
- hum           : Normalized humidity. The values are divided to 100 (max)
- windspeed     : Normalized wind speed. The values are divided to 67 (max)
- casual        : count of casual users
- registered    : count of registered users
- cnt           : count of total rental bikes including both casual and registered

3.2. Check Data Type

Check data type for each column using the str() function.

The str() function will do a sanity check on the structure and show sample data for each variable.

rent_hour$dteday <- ymd(rent_hour$dteday)
rent_day$dteday <- ymd(rent_day$dteday)
str(rent_hour)
#> 'data.frame':    17379 obs. of  17 variables:
#>  $ instant   : int  1 2 3 4 5 6 7 8 9 10 ...
#>  $ dteday    : Date, format: "2011-01-01" "2011-01-01" ...
#>  $ season    : int  1 1 1 1 1 1 1 1 1 1 ...
#>  $ yr        : int  0 0 0 0 0 0 0 0 0 0 ...
#>  $ mnth      : int  1 1 1 1 1 1 1 1 1 1 ...
#>  $ hr        : int  0 1 2 3 4 5 6 7 8 9 ...
#>  $ holiday   : int  0 0 0 0 0 0 0 0 0 0 ...
#>  $ weekday   : int  6 6 6 6 6 6 6 6 6 6 ...
#>  $ workingday: int  0 0 0 0 0 0 0 0 0 0 ...
#>  $ weathersit: int  1 1 1 1 1 2 1 1 1 1 ...
#>  $ temp      : num  0.24 0.22 0.22 0.24 0.24 0.24 0.22 0.2 0.24 0.32 ...
#>  $ atemp     : num  0.288 0.273 0.273 0.288 0.288 ...
#>  $ hum       : num  0.81 0.8 0.8 0.75 0.75 0.75 0.8 0.86 0.75 0.76 ...
#>  $ windspeed : num  0 0 0 0 0 0.0896 0 0 0 0 ...
#>  $ casual    : int  3 8 5 3 0 0 2 1 1 8 ...
#>  $ registered: int  13 32 27 10 1 1 0 2 7 6 ...
#>  $ cnt       : int  16 40 32 13 1 1 2 3 8 14 ...
rent_hour$yr <- as.factor(rent_hour$yr)
rent_hour$holiday <- as.factor(rent_hour$holiday)
rent_hour$workingday <- as.factor(rent_hour$workingday)
str(rent_hour)
#> 'data.frame':    17379 obs. of  17 variables:
#>  $ instant   : int  1 2 3 4 5 6 7 8 9 10 ...
#>  $ dteday    : Date, format: "2011-01-01" "2011-01-01" ...
#>  $ season    : int  1 1 1 1 1 1 1 1 1 1 ...
#>  $ yr        : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ mnth      : int  1 1 1 1 1 1 1 1 1 1 ...
#>  $ hr        : int  0 1 2 3 4 5 6 7 8 9 ...
#>  $ holiday   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ weekday   : int  6 6 6 6 6 6 6 6 6 6 ...
#>  $ workingday: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ weathersit: int  1 1 1 1 1 2 1 1 1 1 ...
#>  $ temp      : num  0.24 0.22 0.22 0.24 0.24 0.24 0.22 0.2 0.24 0.32 ...
#>  $ atemp     : num  0.288 0.273 0.273 0.288 0.288 ...
#>  $ hum       : num  0.81 0.8 0.8 0.75 0.75 0.75 0.8 0.86 0.75 0.76 ...
#>  $ windspeed : num  0 0 0 0 0 0.0896 0 0 0 0 ...
#>  $ casual    : int  3 8 5 3 0 0 2 1 1 8 ...
#>  $ registered: int  13 32 27 10 1 1 0 2 7 6 ...
#>  $ cnt       : int  16 40 32 13 1 1 2 3 8 14 ...
str(rent_day)
#> 'data.frame':    731 obs. of  16 variables:
#>  $ instant   : int  1 2 3 4 5 6 7 8 9 10 ...
#>  $ dteday    : Date, format: "2011-01-01" "2011-01-02" ...
#>  $ season    : int  1 1 1 1 1 1 1 1 1 1 ...
#>  $ yr        : int  0 0 0 0 0 0 0 0 0 0 ...
#>  $ mnth      : int  1 1 1 1 1 1 1 1 1 1 ...
#>  $ holiday   : int  0 0 0 0 0 0 0 0 0 0 ...
#>  $ weekday   : int  6 0 1 2 3 4 5 6 0 1 ...
#>  $ workingday: int  0 0 1 1 1 1 1 0 0 1 ...
#>  $ weathersit: int  2 2 1 1 1 1 2 2 1 1 ...
#>  $ temp      : num  0.344 0.363 0.196 0.2 0.227 ...
#>  $ atemp     : num  0.364 0.354 0.189 0.212 0.229 ...
#>  $ hum       : num  0.806 0.696 0.437 0.59 0.437 ...
#>  $ windspeed : num  0.16 0.249 0.248 0.16 0.187 ...
#>  $ casual    : int  331 131 120 108 82 88 148 68 54 41 ...
#>  $ registered: int  654 670 1229 1454 1518 1518 1362 891 768 1280 ...
#>  $ cnt       : int  985 801 1349 1562 1600 1606 1510 959 822 1321 ...
rent_day$yr <- as.factor(rent_day$yr)
rent_day$holiday <- as.factor(rent_day$holiday)
rent_day$workingday <- as.factor(rent_day$workingday)
str(rent_day)
#> 'data.frame':    731 obs. of  16 variables:
#>  $ instant   : int  1 2 3 4 5 6 7 8 9 10 ...
#>  $ dteday    : Date, format: "2011-01-01" "2011-01-02" ...
#>  $ season    : int  1 1 1 1 1 1 1 1 1 1 ...
#>  $ yr        : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ mnth      : int  1 1 1 1 1 1 1 1 1 1 ...
#>  $ holiday   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ weekday   : int  6 0 1 2 3 4 5 6 0 1 ...
#>  $ workingday: Factor w/ 2 levels "0","1": 1 1 2 2 2 2 2 1 1 2 ...
#>  $ weathersit: int  2 2 1 1 1 1 2 2 1 1 ...
#>  $ temp      : num  0.344 0.363 0.196 0.2 0.227 ...
#>  $ atemp     : num  0.364 0.354 0.189 0.212 0.229 ...
#>  $ hum       : num  0.806 0.696 0.437 0.59 0.437 ...
#>  $ windspeed : num  0.16 0.249 0.248 0.16 0.187 ...
#>  $ casual    : int  331 131 120 108 82 88 148 68 54 41 ...
#>  $ registered: int  654 670 1229 1454 1518 1518 1362 891 768 1280 ...
#>  $ cnt       : int  985 801 1349 1562 1600 1606 1510 959 822 1321 ...

3.3. Check for Missing Value

We can use colSums() and anyNA() function to check if there is missing value in our data.

colSums(is.na(rent_hour))
#>    instant     dteday     season         yr       mnth         hr    holiday 
#>          0          0          0          0          0          0          0 
#>    weekday workingday weathersit       temp      atemp        hum  windspeed 
#>          0          0          0          0          0          0          0 
#>     casual registered        cnt 
#>          0          0          0
colSums(is.na(rent_day))
#>    instant     dteday     season         yr       mnth    holiday    weekday 
#>          0          0          0          0          0          0          0 
#> workingday weathersit       temp      atemp        hum  windspeed     casual 
#>          0          0          0          0          0          0          0 
#> registered        cnt 
#>          0          0

colSums(is.na()) is for check how many missing value in our data. If the result is zero (0), so our data have no missing value.

anyNA(rent_hour)
#> [1] FALSE
anyNA(rent_day)
#> [1] FALSE

anyNA() is for check missing value in our data. If the result is “FALSE”, so our data have no missing value / NA.

Now, rent_hour and rent_day dataset is ready to be processed and analyzed.

3.4 Data Explanation

The summary() function will do a sanity check on the structure and show IQR and class data for each variable.

summary(rent_hour)
#>     instant          dteday               season      yr            mnth       
#>  Min.   :    1   Min.   :2011-01-01   Min.   :1.000   0:8645   Min.   : 1.000  
#>  1st Qu.: 4346   1st Qu.:2011-07-04   1st Qu.:2.000   1:8734   1st Qu.: 4.000  
#>  Median : 8690   Median :2012-01-02   Median :3.000            Median : 7.000  
#>  Mean   : 8690   Mean   :2012-01-02   Mean   :2.502            Mean   : 6.538  
#>  3rd Qu.:13034   3rd Qu.:2012-07-02   3rd Qu.:3.000            3rd Qu.:10.000  
#>  Max.   :17379   Max.   :2012-12-31   Max.   :4.000            Max.   :12.000  
#>        hr        holiday      weekday      workingday   weathersit   
#>  Min.   : 0.00   0:16879   Min.   :0.000   0: 5514    Min.   :1.000  
#>  1st Qu.: 6.00   1:  500   1st Qu.:1.000   1:11865    1st Qu.:1.000  
#>  Median :12.00             Median :3.000              Median :1.000  
#>  Mean   :11.55             Mean   :3.004              Mean   :1.425  
#>  3rd Qu.:18.00             3rd Qu.:5.000              3rd Qu.:2.000  
#>  Max.   :23.00             Max.   :6.000              Max.   :4.000  
#>       temp           atemp             hum           windspeed     
#>  Min.   :0.020   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
#>  1st Qu.:0.340   1st Qu.:0.3333   1st Qu.:0.4800   1st Qu.:0.1045  
#>  Median :0.500   Median :0.4848   Median :0.6300   Median :0.1940  
#>  Mean   :0.497   Mean   :0.4758   Mean   :0.6272   Mean   :0.1901  
#>  3rd Qu.:0.660   3rd Qu.:0.6212   3rd Qu.:0.7800   3rd Qu.:0.2537  
#>  Max.   :1.000   Max.   :1.0000   Max.   :1.0000   Max.   :0.8507  
#>      casual         registered         cnt       
#>  Min.   :  0.00   Min.   :  0.0   Min.   :  1.0  
#>  1st Qu.:  4.00   1st Qu.: 34.0   1st Qu.: 40.0  
#>  Median : 17.00   Median :115.0   Median :142.0  
#>  Mean   : 35.68   Mean   :153.8   Mean   :189.5  
#>  3rd Qu.: 48.00   3rd Qu.:220.0   3rd Qu.:281.0  
#>  Max.   :367.00   Max.   :886.0   Max.   :977.0
summary(rent_day)
#>     instant          dteday               season      yr           mnth      
#>  Min.   :  1.0   Min.   :2011-01-01   Min.   :1.000   0:365   Min.   : 1.00  
#>  1st Qu.:183.5   1st Qu.:2011-07-02   1st Qu.:2.000   1:366   1st Qu.: 4.00  
#>  Median :366.0   Median :2012-01-01   Median :3.000           Median : 7.00  
#>  Mean   :366.0   Mean   :2012-01-01   Mean   :2.497           Mean   : 6.52  
#>  3rd Qu.:548.5   3rd Qu.:2012-07-01   3rd Qu.:3.000           3rd Qu.:10.00  
#>  Max.   :731.0   Max.   :2012-12-31   Max.   :4.000           Max.   :12.00  
#>  holiday    weekday      workingday   weathersit         temp        
#>  0:710   Min.   :0.000   0:231      Min.   :1.000   Min.   :0.05913  
#>  1: 21   1st Qu.:1.000   1:500      1st Qu.:1.000   1st Qu.:0.33708  
#>          Median :3.000              Median :1.000   Median :0.49833  
#>          Mean   :2.997              Mean   :1.395   Mean   :0.49538  
#>          3rd Qu.:5.000              3rd Qu.:2.000   3rd Qu.:0.65542  
#>          Max.   :6.000              Max.   :3.000   Max.   :0.86167  
#>      atemp              hum           windspeed           casual      
#>  Min.   :0.07907   Min.   :0.0000   Min.   :0.02239   Min.   :   2.0  
#>  1st Qu.:0.33784   1st Qu.:0.5200   1st Qu.:0.13495   1st Qu.: 315.5  
#>  Median :0.48673   Median :0.6267   Median :0.18097   Median : 713.0  
#>  Mean   :0.47435   Mean   :0.6279   Mean   :0.19049   Mean   : 848.2  
#>  3rd Qu.:0.60860   3rd Qu.:0.7302   3rd Qu.:0.23321   3rd Qu.:1096.0  
#>  Max.   :0.84090   Max.   :0.9725   Max.   :0.50746   Max.   :3410.0  
#>    registered        cnt      
#>  Min.   :  20   Min.   :  22  
#>  1st Qu.:2497   1st Qu.:3152  
#>  Median :3662   Median :4548  
#>  Mean   :3656   Mean   :4504  
#>  3rd Qu.:4776   3rd Qu.:5956  
#>  Max.   :6946   Max.   :8714

It works well, but a more complete function is the skim() function from the “skimr” package. It breaks down the variables by type with relevant summary information, PLUS a small histogram for each numeric variable.

library(skimr)
skim(rent_hour)
Data summary
Name rent_hour
Number of rows 17379
Number of columns 17
_______________________
Column type frequency:
Date 1
factor 3
numeric 13
________________________
Group variables None

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
dteday 0 1 2011-01-01 2012-12-31 2012-01-02 731

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
yr 0 1 FALSE 2 1: 8734, 0: 8645
holiday 0 1 FALSE 2 0: 16879, 1: 500
workingday 0 1 FALSE 2 1: 11865, 0: 5514

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
instant 0 1 8690.00 5017.03 1.00 4345.50 8690.00 13034.50 17379.00 ▇▇▇▇▇
season 0 1 2.50 1.11 1.00 2.00 3.00 3.00 4.00 ▇▇▁▇▇
mnth 0 1 6.54 3.44 1.00 4.00 7.00 10.00 12.00 ▇▆▆▅▇
hr 0 1 11.55 6.91 0.00 6.00 12.00 18.00 23.00 ▇▇▆▇▇
weekday 0 1 3.00 2.01 0.00 1.00 3.00 5.00 6.00 ▇▃▃▃▇
weathersit 0 1 1.43 0.64 1.00 1.00 1.00 2.00 4.00 ▇▃▁▁▁
temp 0 1 0.50 0.19 0.02 0.34 0.50 0.66 1.00 ▂▇▇▇▁
atemp 0 1 0.48 0.17 0.00 0.33 0.48 0.62 1.00 ▁▆▇▆▁
hum 0 1 0.63 0.19 0.00 0.48 0.63 0.78 1.00 ▁▃▇▇▆
windspeed 0 1 0.19 0.12 0.00 0.10 0.19 0.25 0.85 ▇▆▂▁▁
casual 0 1 35.68 49.31 0.00 4.00 17.00 48.00 367.00 ▇▁▁▁▁
registered 0 1 153.79 151.36 0.00 34.00 115.00 220.00 886.00 ▇▃▁▁▁
cnt 0 1 189.46 181.39 1.00 40.00 142.00 281.00 977.00 ▇▃▁▁▁
skim(rent_day)
Data summary
Name rent_day
Number of rows 731
Number of columns 16
_______________________
Column type frequency:
Date 1
factor 3
numeric 12
________________________
Group variables None

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
dteday 0 1 2011-01-01 2012-12-31 2012-01-01 731

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
yr 0 1 FALSE 2 1: 366, 0: 365
holiday 0 1 FALSE 2 0: 710, 1: 21
workingday 0 1 FALSE 2 1: 500, 0: 231

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
instant 0 1 366.00 211.17 1.00 183.50 366.00 548.50 731.00 ▇▇▇▇▇
season 0 1 2.50 1.11 1.00 2.00 3.00 3.00 4.00 ▇▇▁▇▇
mnth 0 1 6.52 3.45 1.00 4.00 7.00 10.00 12.00 ▇▅▅▅▇
weekday 0 1 3.00 2.00 0.00 1.00 3.00 5.00 6.00 ▇▃▃▃▇
weathersit 0 1 1.40 0.54 1.00 1.00 1.00 2.00 3.00 ▇▁▅▁▁
temp 0 1 0.50 0.18 0.06 0.34 0.50 0.66 0.86 ▂▇▇▇▅
atemp 0 1 0.47 0.16 0.08 0.34 0.49 0.61 0.84 ▂▇▆▇▂
hum 0 1 0.63 0.14 0.00 0.52 0.63 0.73 0.97 ▁▁▆▇▂
windspeed 0 1 0.19 0.08 0.02 0.13 0.18 0.23 0.51 ▃▇▅▁▁
casual 0 1 848.18 686.62 2.00 315.50 713.00 1096.00 3410.00 ▇▆▂▁▁
registered 0 1 3656.17 1560.26 20.00 2497.00 3662.00 4776.50 6946.00 ▂▅▇▅▃
cnt 0 1 4504.35 1937.21 22.00 3152.00 4548.00 5956.00 8714.00 ▂▅▇▅▃

Summary : A. data :rent_hour 1. casual => sd value = 49.30 and mean = 35.70 2. registered => sd value = 151.00 and mean = 154.00 3. cnt => sd value = 181.00 and mean = 189.00

B. data :rent_day 1. casual => sd value = 687.00 and mean = 848.00 2. registered => sd value = 1560.00 and mean = 3656.00 3. cnt => sd value = 1937.00 and mean = 4504.00

3.5. Check the Outlier

  boxplot(rent_hour$casual, rent_hour$registered, rent_hour$cnt , main="Boxplot for Hour Rental Bike (continuos var)")

boxplot(rent_day$casual, rent_day$registered, rent_day$cnt , main="Boxplot for Daily Rental Bike (continuos var)")

From result above, we find posibilities for the outliers, but from our calculation, sd value in cnt is not far above mean value (in my oppinion its still be tolerated), so the process may continue.

4. Data Processing and Plotting

  1. Please add column with year and weekdays for data rent_hour and rent_day!
rent_hour$year <- year(rent_hour$dteday)
rent_hour$weekdays <- wday(rent_hour$dteday, # data
                          label = T, # nama (T)/digit (F)
                          abbr = F, # disingkat/tidak
                          week_start = 1, # minggu dimulai di hari ke berapa (Senin = 1)
                          locale = "IND") # penggantian bahasa

head(rent_hour)
rent_day$year <- year(rent_day$dteday)
rent_day$weekdays <- wday(rent_day$dteday, # data
                          label = T, # nama (T)/digit (F)
                          abbr = F, # disingkat/tidak
                          week_start = 1, # minggu dimulai di hari ke berapa (Senin = 1)
                          locale = "IND") # penggantian bahasa

head(rent_day)
  1. Analize weekdays that highest visitors (use column cnt) on data rent_day! Please make plotting !
wday <- aggregate(cnt ~ year + weekdays,
                        data = rent_day,
                        FUN = mean)
wday
#Plotting
g1 <- ggplot(data = wday, mapping = aes(x = cnt, y =reorder(weekdays,cnt) ),
             text = glue("Total Rent Bike by {weekdays}: {(cnt)}")) + 
  geom_col(aes(fill = year), position = "dodge") +
  labs(x = "Weekdays", y = NULL,
       fill = NULL,
       title = "Top Visitors Rental Bike by Weekdays") +
  facet_wrap(~year, 
             scales = "free_y")+
  theme_minimal()+
  theme(axis.text.x = element_text(angle = 90,
                                   hjust = 1))

g1

ggplotly(g1, tooltip = "text")

Insight : - On the year 2011, “Love Bike” got highest visitors on Jumat (3500.115), Selasa (3468.038) and Senin (3465.788) - On the year 2012, “Love Bike” got highest visitors on Kamis (5977.750), Jumat (5880.462) dan Rabu (5843.827)

  1. Analize the hour highest visitors (use column cnt) on data rent_hour! Please make plotting !
hhour <- aggregate(cnt ~ year + hr,
                        data = rent_hour,
                        FUN = mean)
hhour
# Plotting

g2 <- ggplot(data = hhour, mapping = aes(x = hr, y = cnt)) + 
  geom_line(aes(group = year,
                col = year)) + 
  geom_point(aes(col = year)) + 
  labs(x = "Hours", y = NULL,
       fill = NULL,
       title = "Top Visitors Rental Bike by Hours") +
  scale_x_continuous(breaks = seq(0,23,1)) +
  theme_minimal()+
  theme(axis.text.x = element_text(angle = 90,
                                   hjust = 1))

g2

Insight : - On the year 2011, “Love Bike” got highest visitors at hour : 17:00 (349.676712), 18:00 (322.272727) and 08:00 (262.994490)

  • On the year 2012, “Love Bike” got highest visitors at hour : 17:00 (573.227397), 18:00 (528.183562) dan 08:00 (454.763736)

5. Recomendation

Based on the plotting on above, we can recommend : 1) “Love Bike” can increase the number of employees to working on day : Kamis, Jumat and Rabu at hour : 08:00, 17:00 and 18:00.

  1. The company can do maintenance/manual checking of Bikes at the other day of the high visitors days on above. With doing this periodically, the company can give the best service and increase customer satisfaction.

We hope that the analize data can help the improvement of both income and service at “Love Bike” company.