HD project

The objective of this study is to analyze a data set coming from the Vélib system in Paris (a bike sharing system). The data are loading profiles of the bike stations over one week. The data were collected every hour during the period Sunday 1st Sept. - Sunday 7th Sept., 2014.

The whole dataset contains of:

data: the loading profiles (nb of available bikes / nb of bike docks) of the 1189 stations at 181 time points. Moreover, a loading value of 1 means that the station is full of bikes while a value of 0 means that the station has no available bikes.
position: the longitude and latitude of the 1189 bike stations.
dates: the download dates.
bonus: indicates if the station is on a hill (bonus = 1).
names: the names of the stations.

Pretreatment and descriptive analysis.

loading the data.

library(kableExtra)

## Warning: package 'kableExtra' was built under R version 4.0.4

load('C:/Users/parin/Downloads/velib.Rdata')
summary(velib) %>% kable() %>% kable_styling()

	Length	Class	Mode
data	181	data.frame	list
position	2	data.frame	list
dates	181	-none-	character
bonus	1189	-none-	numeric
names	1189	-none-	character

pos = velib$position
dat = velib$data
dates = velib$dates
bonus = velib$bonus
name = velib$names

let us check if there is any missing value in our dataset.

sum(is.na(dat) == TRUE)

## [1] 0

sum(is.na(pos) == TRUE)

## [1] 0

anyDuplicated(name)

## [1] 210

anyDuplicated(pos)

## [1] 0

We can see that we do not have any missing values, but there is some stations which have the exact same name but their positions are different, so we keep all the data.

We can check the map of position of the stations using the GPS coordinates of them, thanks to the ldaflet package(we choose two colors in order to distinguish between if the station is on a hill(bonus = 1) or not):

library(leaflet)

## Warning: package 'leaflet' was built under R version 4.0.4

palette = colorFactor("RdYlBu", domain = NULL)
leaflet(pos) %>% addTiles() %>%addCircleMarkers(radius = 3,color = palette(bonus),stroke = FALSE, fillOpacity = 0.9)

It is worth to mention that the color blue indicated the stations on a hill as we have less of bonus value = 1. Now, let us look at the loading profiles of some stations.

par(mfrow=c(1,2))
matplot(t(dat[1,]),main=name[1],type='l',lty=1,col=2,xaxt='n',lwd=2,ylim=c(0,1))
axis(1,at=seq(5,181,6),labels=dates[seq(5,181,6)],las=2)

matplot(t(dat[11,]),main=name[11],type='l',lty=1,col=3,xaxt='n',lwd=2,ylim=c(0,1))
axis(1,at=seq(5,181,6),labels=dates[seq(5,181,6)],las=2)

matplot(t(dat[20,]),main=name[20],type='l',lty=1,col=4,xaxt='n',lwd=2,ylim=c(0,1))
axis(1,at=seq(5,181,6),labels=dates[seq(5,181,6)],las=2)

matplot(t(dat[50,]),main=name[50],type='l',lty=1,col=5,xaxt='n',lwd=2,ylim=c(0,1))
axis(1,at=seq(5,181,6),labels=dates[seq(5,181,6)],las=2)

matplot(t(velib$data[c(1,11,20,50),]),main ='loading profile of 4 different stations',type='l',lty=1,col=c(2,3,4,5),xaxt='n',lwd=2,ylim=c(0,1))
axis(1,at=seq(5,181,6),labels=velib$dates[seq(5,181,6)],las=2)

We can see that in some stations the ratio of available bikes is higher than other stations.The reason can be related to the size of the stations and the position of the stations. from “loading profile of 4 different stations”, it looks like people use more bikes on Sundays as there is less bike available in the stations.

We can check the position of these stations on the map as well.

library(mapview)

## Warning: package 'mapview' was built under R version 4.0.4

df = pos[c(1,11,20,50),]
leaflet(df) %>% addTiles() %>%addCircleMarkers(label = ~name[c(1,11,20,50)],radius = 6,stroke = FALSE,color ='red', fillOpacity = 0.9)

We can look at the loading profiles of 20 stations in 3 different days (Sunday, Monday and Tuesday) at 12 afternoon.

par(mfrow=c(1,3))
matplot(dat[1:20,2],main="Dim-12",type='l',lty=1,col=2,xaxt='n',lwd=2,ylim=c(0,1))

matplot(dat[1:20,26],main="lundi-12",type='l',lty=1,col=3,xaxt='n',lwd=2,ylim=c(0,1))
matplot(dat[1:20,50],main="mar-12",type='l',lty=1,col=4,xaxt='n',lwd=2,ylim=c(0,1))

matplot(dat[1:20,c(2,26,50)],main="profile loading of 20 stations
        on Sunday, Monday and Tuesday at 12 pm.",type='l',lty=1,col=c(2,3,4),xaxt='n',lwd=2,ylim=c(0,1))

With this plot we can see that there is not much different between the days of the week and the rate of available bikes are some how same in all these 3 days except for some stations.

Now let us check for the differences between times of a day. We select three different time zone (morning:9, afternoon:13, night:22) on monday.

par(mfrow=c(1,3))
matplot(dat[1:20,23],main="lun-9",type='l',lty=1,col=2,xaxt='n',lwd=2,ylim=c(0,1))

matplot(dat[1:20,27],main="lundi-13",type='l',lty=1,col=3,xaxt='n',lwd=2,ylim=c(0,1))
matplot(dat[1:20,36],main="lundi-22",type='l',lty=1,col=4,xaxt='n',lwd=2,ylim=c(0,1))

matplot(dat[1:20,c(23,27,36)],main="profile loading of 20 stations
    on Monday at 9, 13 and 22. pm."
    ,type='l',lty=1,col=c(2,3,4),xaxt='n',lwd=2,ylim=c(0,1))

the plot demonstrate that in different time zones the number of available bikes for a special station can change significantly. for example while for the second station in morning and at lunch time there isn’t any available bikes, at night the station is almost full of bikes. However, we can see that for some stations, no matter what time it is, the station is always full.(which can be related to the position of the station or the size of the station which is small).

In following, to understand better the behavior of the system, we perform different clustering methods on loading profile data. first let us apply PCA as a dimension reduction technique.

Data visualization

We apply PCA in order to be able to visualize the data as it lives in high dimension space.

pc = princomp(dat)

plot(pc,main = " Contribution of variables")

plot(predict(pc),type = 'p',pch = 19)

biplot(pc,main = "Correlation circle")

By looking at the bar plot which shows the contribution of the variables, it can be seen that the first two variables has the highest contribution with having higher variances. and with the correlation circle plot we can find out the correlation of the variables with the two components.

Also we can check the scree plot and use Scree-test of Cattell to see how many axis is enough.

library(FactoMineR)

## Warning: package 'FactoMineR' was built under R version 4.0.4

library(factoextra)

## Warning: package 'factoextra' was built under R version 4.0.4

## Loading required package: ggplot2

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

res.pca <- PCA(dat,  graph = FALSE)

fviz_screeplot(res.pca, addlabels = TRUE, ylim = c(0, 50))

dat1 = scale(dat,center = TRUE,scale = FALSE)
S = t(dat1) %*% dat1

out = eigen(S)
Q = out$vectors
Delta = out$values

# Scree-test of Cattell
cattell <- function(Delta,tau=0.1){
  sc = abs(diff(Delta))
  p = length(sc)
  thd = tau * max(sc)
  d = p-1
  for (j in 1:(p-1)){
   if (prod(sc[(j+1):p] < thd) == 1){
     d = j
     break
   } 
  }
  plot(sc,type='p')
  abline(h=thd,col='red')
  return(d)
}

cattell(Delta,tau = 0.1)

## [1] 2

Once more it can be seen that having two components is enough.

Clustering

Now, we apply different clustering methods to these data. As our first model we apply k-means as a reference model. And in order to find the optimal number of clusters we use NbClust package which provides 30 indices for determining the relevant number of clusters and proposes to users the best clustering scheme from the different results obtained by varying all combinations of number of clusters, distance measures, and clustering methods.

library(NbClust)
library(factoextra)
res.nbclust <- NbClust(dat, distance = "euclidean",
                  min.nc = 2, max.nc = 9, 
                  method = "complete", index ="all")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 4 proposed 2 as the best number of clusters 
## * 7 proposed 3 as the best number of clusters 
## * 6 proposed 4 as the best number of clusters 
## * 2 proposed 7 as the best number of clusters 
## * 1 proposed 9 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## *******************************************************************

factoextra::fviz_nbclust(res.nbclust) + theme_minimal() + ggtitle("NbClust's optimal number of clusters")

## Warning in if (class(best_nc) == "numeric") print(best_nc) else if
## (class(best_nc) == : the condition has length > 1 and only the first element
## will be used

## Warning in if (class(best_nc) == "matrix") .viz_NbClust(x, print.summary, : the
## condition has length > 1 and only the first element will be used

## Warning in if (class(best_nc) == "numeric") print(best_nc) else if
## (class(best_nc) == : the condition has length > 1 and only the first element
## will be used

## Warning in if (class(best_nc) == "matrix") {: the condition has length > 1 and
## only the first element will be used

## Among all indices: 
## ===================
## * 2 proposed  0 as the best number of clusters
## * 1 proposed  1 as the best number of clusters
## * 4 proposed  2 as the best number of clusters
## * 7 proposed  3 as the best number of clusters
## * 6 proposed  4 as the best number of clusters
## * 2 proposed  7 as the best number of clusters
## * 1 proposed  9 as the best number of clusters
## * 3 proposed  NA's as the best number of clusters
## 
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is  3 .

This suggest the optimal number of clusters is 3. Also, we apply another method for confirming the optimal number of clusters with Elbow method.

fviz_nbclust(dat, kmeans, method = "wss", k.max = 24) + theme_minimal() + ggtitle("the Elbow Method")

With this method we obtain the optimal number of clusters to be 3 as well.

Now let look at the clustering of the observation on the map.

out.kmeans = kmeans(dat,3)

palette = colorFactor("RdYlBu", domain = NULL)

leaflet(pos) %>% addTiles() %>%
addCircleMarkers(radius = 3,
color = palette(out.kmeans$cluster),
stroke = FALSE, fillOpacity = 0.9)

table(out.kmeans$cluster)

## 
##   1   2   3 
## 359 544 286

We can say that probably these three clusters corresponding to working area, living area and touristic area, in such a way in some areas there is less bikes available while in other areas there is more.
However, as we have a high dimensional data, kmeans is not a perfect model for clustering.

A model-based approach would be to use Gaussian mixture models (GMMs). The mclust package allows to run an EM algorithm to fit a GMM on the data.

library(mclust)

## Package 'mclust' version 5.4.7
## Type 'citation("mclust")' for citing this R package in publications.

out.mclust = Mclust(dat,G = 2:10)
plot(out.mclust,what = "BIC")

table(out.mclust$classification)

## 
##   1   2 
## 665 524

Here, BIC selects the model ‘EEE’ with 2 groups. Now, let us look the clustering of the stations on the map.

palette = colorFactor("RdYlBu", domain = NULL)
leaflet(pos) %>% addTiles () %>%
addCircleMarkers(color =  palette(out.mclust$classification),
radius = 3, fillOpacity = 1, stroke = FALSE)

table(out.mclust$classification)

## 
##   1   2 
## 665 524

It can be seen that, the clusters might related to only working area and living area.
Also, Mclust outputs the posterior probabilities to belong to the groups and We apply some iterations of EM to have less binary results: :

out.mclust1it = Mclust(dat,G = 2, modelNames = 'EEE',
                       control = emControl(itmax = 1))

round(out.mclust1it$z[30:40,],5)

##          [,1]    [,2]
## 8113  0.97119 0.02881
## 14103 0.99956 0.00044
## 16105 0.99878 0.00122
## 15020 0.00009 0.99991
## 15106 0.00002 0.99998
## 35010 0.96449 0.03551
## 15044 0.96232 0.03768
## 21210 0.98695 0.01305
## 15035 0.00042 0.99958
## 21502 0.96894 0.03106
## 31003 0.49792 0.50208

We can see that the posterior probability to belong to the groups for most of the stations is high, except one which is 50% (these analysis is only for these 10 samples.)

Finally, we consider a GMM for HD data: HDDC. Also, for finding the optimal number of cluster for this method we can look at the BIC criterion for different number of clusters and pick the one with lowest value.

library(HDclassif)

## Loading required package: MASS

for (k in 2:10){
  out.hddc = hddc(dat,K = k)
  print(out.hddc$BIC)
}

## [1] 79850.2
## [1] 80743.31
## [1] 113087.9
## [1] 100303.2
## [1] 121744.5
## [1] 108527.2
## [1] 124972.9
## [1] 124471.3
## [1] 117240.3

It shows us having two clusters is the optimal number of clusters.

out.hddc = hddc(dat,K = 2)

palette = colorFactor("RdYlBu", domain = NULL)

leaflet(pos) %>% addTiles() %>%
addCircleMarkers(radius = 3,
color = palette(out.hddc$class),
stroke = FALSE, fillOpacity = 0.9)

table(out.hddc$class)

## 
##   1   2 
## 354 835

with this method, we cannot conclude about the area but with comparing it with the first map which showing the stations on the hill and down the hill, we might say these clustering is based the position of the stations.

Summary

In this section we analyze the results we achieved up to now.

What we would like to study is to analyze the velib system in Paris which can be useful to position and size the future stations according to the travel needs of users of Paris.

What we obtain from the first part is that the loading profile of the stations is depends on the size of the station and where the station located. And, also this profile can change according to the day of the week and the time. For our study, we work on the velib$data which gives us the loading profile and inorder to map the locations, we used the data of the position of the stations. Hence, we did not consider the bonus data.

With performing PCA, we obtain that having two components is enough and the two variables have the highest variance and contribution. We can understand from PCA plot that each point correspond to the stations and we can see that the distances between stations have been shown correctly by performing PCA.

Finally, we perform several clustering methods such as kmean, mclust and hddc.

For k-mean we obtain 3 clusters for three different areas, working area, living area and touristic area.

For two the models (mclust and hddc) we obtain the optimal number of clusters to be 2. What it can be seen is that these methods clustered the observation based on the stations which have more available bikes and the stations where is less bikes available. One of the analysis we can get is that in the working areas people tend to go with bikes. Hence, they take the bikes from around and come with bikes to their work place and park it there. That is why in these areas there is usually more bikes available. Whereas, other cluster correspond where parasian lives. Also another analysis we can get is that the location of the station(being on the hill or not) can effect the loading profile of the stations.