1 Introduction

Bike-sharing systems represent a revolutionary evolution in the realm of traditional bike rentals by seamlessly automating the entire process, encompassing membership, rental, and return. These systems empower users to effortlessly rent a bike from one location and conveniently return it to another. The contemporary allure of these systems lies in their pivotal role addressing traffic congestion, environmental concerns, and promoting public health.

Beyond their evident real-world applications, the data emanating from bike-sharing systems possesses intriguing characteristics that make them a compelling subject for research. Parameters such as travel duration, departure and arrival positions, and the total number of rented bikes transform these systems into a virtual sensor network, capable of providing valuable insights into urban mobility. Consequently, monitoring this data is anticipated to unveil significant events within the city.

Capital Bikeshare, boasting over 4300 bikes strategically stationed at 500 locations across 7 jurisdictions, emerges as a robust player in this transformative landscape. This extensive network offers residents and visitors alike a convenient, enjoyable, and cost-effective transportation alternative for navigating between point A and point B. Capital Bikeshare serves as a versatile solution for daily commutes, errands, appointments, social engagements, and more, embodying a dynamic facet of modern urban mobility. As Capital Bikeshare celebrates its role in fostering accessible transportation, it stands as a testament to the evolving landscape of city living on its journey into the future.

2 Data Pre-Processing

# import library
library(dplyr)
library(data.table)
library(tidyr)
library(lubridate)
library(GGally)

# read dataset
bike <- read.csv("data_input/day.csv")
head(bike)

1️⃣ Selecting Column

# copying dataset
prepro <- copy(bike)

# changing datatypes
prepro$dteday <- ymd(prepro$dteday)
rownames(prepro) <- prepro$dteday

# selecting numeric columns
prepro <- prepro %>% select(temp, atemp, hum, windspeed)

# checking NA values
colSums(is.na(prepro))

#>      temp     atemp       hum windspeed 
#>         0         0         0         0

# checking interval range
summary(prepro)

#>       temp             atemp              hum           windspeed      
#>  Min.   :0.05913   Min.   :0.07907   Min.   :0.0000   Min.   :0.02239  
#>  1st Qu.:0.33708   1st Qu.:0.33784   1st Qu.:0.5200   1st Qu.:0.13495  
#>  Median :0.49833   Median :0.48673   Median :0.6267   Median :0.18097  
#>  Mean   :0.49538   Mean   :0.47435   Mean   :0.6279   Mean   :0.19049  
#>  3rd Qu.:0.65542   3rd Qu.:0.60860   3rd Qu.:0.7302   3rd Qu.:0.23321  
#>  Max.   :0.86167   Max.   :0.84090   Max.   :0.9725   Max.   :0.50746

2️⃣Checking Variances

var(prepro)

#>                   temp        atemp          hum    windspeed
#> temp       0.033507667  0.029582662  0.003310151 -0.002240605
#> atemp      0.029582662  0.026556346  0.003249181 -0.002319254
#> hum        0.003310151  0.003249181  0.020286047 -0.002742811
#> windspeed -0.002240605 -0.002319254 -0.002742811  0.006005920

3 PCA

pca <- prcomp(prepro)

# checking Eigen Vector (showing contribution of each column in each PC)
pca$rotation

#>                   PC1         PC2            PC3          PC4
#> temp       0.73994694  0.10390141 -0.05872877866 -0.661992414
#> atemp      0.65904681  0.07475057  0.00009875324  0.748378005
#> hum        0.11830922 -0.97858880 -0.16830230039 -0.006420051
#> windspeed -0.06433315  0.16118561 -0.98398437817  0.040684000

# checking Variance by squaring the sdev value (showing how much information can be collect in each PC)
pca$sdev^2

#> [1] 0.0605800482 0.0201381752 0.0054032881 0.0002344684

summary(pca)

#> Importance of components:
#>                           PC1    PC2     PC3     PC4
#> Standard deviation     0.2461 0.1419 0.07351 0.01531
#> Proportion of Variance 0.7015 0.2332 0.06257 0.00272
#> Cumulative Proportion  0.7015 0.9347 0.99728 1.00000

if we would like to keep more than 90% information from our data, we can keep only PC1 and PC2

biplot(pca,
       cex = 0.7,
       scale = F)

> observation on 2011-03-10 has extreme value in variable windspeed
> humidity has great contribution on PC2
> meanwhile temp and atemp have great contribution on PC1

Unsupervised Learning

Taufan Anggoro Adhi

2023-12-01

1 Introduction

2 Data Pre-Processing

3 PCA