library(DescTools)
library(ggplot2)
library(dbscan)
##
## Attaching package: 'dbscan'
## The following object is masked from 'package:stats':
##
## as.dendrogram
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(cluster)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(prophet)
## Loading required package: Rcpp
## Loading required package: rlang
library(hopkins)
library(leaflet)
We are loading and checking data for NAs, wrong values.
rides <- read.csv("uber-data.csv")
ggplot(rides, aes(x = Lon, y = Lat)) +
geom_point(alpha = 0.3, size = 0.1) +
theme_minimal()
DescTools::Desc(rides)
## ──────────────────────────────────────────────────────────────────────────────
## Describe rides (data.frame):
##
## data frame: 1028136 obs. of 4 variables
## 1028136 complete cases (100.0%)
##
## Nr Class ColName NAs Levels
## 1 chr Date.Time .
## 2 num Lat .
## 3 num Lon .
## 4 chr Base .
##
##
## ──────────────────────────────────────────────────────────────────────────────
## 1 - Date.Time (character)
##
## length n NAs unique levels dupes
## 1'028'136 1'028'136 0 42'907 42'907 y
## 100.0% 0.0%
##
## level freq perc cumfreq cumperc
## 1 9/13/2014 18:44:00 82 0.0% 82 0.0%
## 2 9/13/2014 18:41:00 80 0.0% 162 0.0%
## 3 9/13/2014 18:45:00 78 0.0% 240 0.0%
## 4 9/18/2014 19:01:00 78 0.0% 318 0.0%
## 5 9/13/2014 15:40:00 75 0.0% 393 0.0%
## 6 9/13/2014 18:05:00 75 0.0% 468 0.0%
## 7 9/27/2014 23:40:00 75 0.0% 543 0.1%
## 8 9/13/2014 19:17:00 74 0.0% 617 0.1%
## 9 9/18/2014 19:04:00 74 0.0% 691 0.1%
## 10 9/19/2014 18:14:00 74 0.0% 765 0.1%
## 11 9/20/2014 18:47:00 74 0.0% 839 0.1%
## 12 9/4/2014 18:49:00 74 0.0% 913 0.1%
## ... etc.
## [list output truncated]
## ──────────────────────────────────────────────────────────────────────────────
## 2 - Lat (numeric)
##
## length n NAs unique 0s mean meanCI'
## 1'028'136 1'028'136 0 5'135 0 40.7392 40.7391
## 100.0% 0.0% 0.0% 40.7393
##
## .05 .10 .25 median .75 .90 .95
## 40.6703 40.6912 40.7204 40.7418 40.7612 40.7760 40.7889
##
## range sd vcoef mad IQR skew kurt
## 1.3579 0.0408 0.0010 0.0302 0.0408 0.4615 10.1800
##
## lowest : 39.9897, 40.058, 40.0794, 40.0972, 40.1122
## highest: 41.3043, 41.3141, 41.318, 41.3183 (2), 41.3476
##
## ' 95%-CI (classic)
## ──────────────────────────────────────────────────────────────────────────────
## 3 - Lon (numeric)
##
## length n NAs unique 0s mean meanCI'
## 1'028'136 1'028'136 0 7'724 0 -73.9718 -73.9719
## 100.0% 0.0% 0.0% -73.9717
##
## .05 .10 .25 median .75 .90 .95
## -74.0104 -74.0066 -73.9962 -73.9831 -73.9628 -73.9296 -73.8653
##
## range sd vcoef mad IQR skew kurt
## 2.0573 0.0583 -0.0008 0.0233 0.0334 2.2393 27.9242
##
## lowest : -74.7736, -74.7155, -74.7121, -74.7120, -74.7035
## highest: -72.8013, -72.7558, -72.7480, -72.7339, -72.7163
##
## ' 95%-CI (classic)
## ──────────────────────────────────────────────────────────────────────────────
## 4 - Base (character)
##
## length n NAs unique levels dupes
## 1'028'136 1'028'136 0 5 5 y
## 100.0% 0.0%
##
## level freq perc cumfreq cumperc
## 1 B02617 377'695 36.7% 377'695 36.7%
## 2 B02598 240'600 23.4% 618'295 60.1%
## 3 B02682 197'138 19.2% 815'433 79.3%
## 4 B02764 178'333 17.3% 993'766 96.7%
## 5 B02512 34'370 3.3% 1'028'136 100.0%
Since the dataset is huge and it will not possible to cluster whole set, we have to reduce the data points and take a random subset.
set.seed(123)
subset_rides <- rides[sample(nrow(rides), 10000), ]
subset_rides$Date.Time <- mdy_hms(subset_rides$Date.Time)
subset_rides$Date <- as.Date(subset_rides$Date.Time)
subset_rides$Hour <- hour(subset_rides$Date.Time)
subset_rides$Weekday <- wday(subset_rides$Date.Time, label = TRUE)
ggplot(subset_rides, aes(x = Lon, y = Lat)) +
geom_point(alpha = 0.3, size = 0.1) +
theme_minimal()
coords <- subset_rides[, c("Lat", "Lon")]
fviz_nbclust(coords, kmeans, method = "silhouette") +
theme_minimal()
fviz_nbclust(coords, kmeans, method = "gap_stat", , nboot = 50) +
theme_minimal()
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 500000)
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the factoextra package.
## Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
set.seed(123)
hopkins_stat <- hopkins(coords, m = nrow(coords) - 1)
hopkins_stat
## [1] 0.997341
We see that data is not ideal to cluster with K-means algorithm, but I still want to do it see what we get. From silhouette score test we can conclude that 2 clusters is a good choice. However, gap statistic suggests 5 clusters is the better one, I choose to go with 5, because I expect more clusters and silhouette for 5 clusters is not much worse than 2. We also extracted Date and Hour values from the dataset to analyze later. It is clearly visible that shapes of clusters are not regular, and they are more like concave, so DBSCAN will be our choice.
k <- 5
kmeans_result <- kmeans(coords, centers = k)
subset_rides$cluster <- kmeans_result$cluster
pal <- colorFactor(palette = "Set1", domain = subset_rides$cluster)
leaflet(subset_rides) %>%
addTiles() %>%
addCircleMarkers(
lng = ~Lon,
lat = ~Lat,
color = ~pal(cluster),
radius = 2,
stroke = FALSE,
fillOpacity = 0.7,
popup = ~paste("Cluster:", cluster)
) %>%
addLegend("bottomright", pal = pal, values = ~cluster,
title = "K-Means Clusters")
Again the k-means clustering is unsatisfying, as we just cut the big cluster into half. There are no clear groupings of points this way. I will also not try hierarchial clustering or Medoids Clustering, because it is clearly visible from data that only DBSCAN or Spectral clustering are a better choice. After running DBSCAN, I think we can not find better clustering than the result, so I will just analyze DBSCAN results.
Elbow points of the distance plot is at around 0.005, which means it is a good eps value. MinPts was chosen as 10, because we have 4 properties, 4+1 = 5, and it is necessary to choose higher or equal to 5. Also, checking for minPts=3 and minPts=5, gave unsatisfactory results.
k <- 4
kNNdistplot(coords, k)
abline(h = 0.01, lty = 2)
db <- dbscan(coords, eps = 0.005, minPts = 10)
subset_rides$db_cluster <- db$cluster
Looking at the data looks like DBScan is the better choice for clustering this set. Also after checking different eps values, 0.03 gives a moderately nice result of clusters.
pal <- colorFactor(palette = "Set1", domain = subset_rides$db_cluster)
leaflet(subset_rides) %>%
addTiles() %>%
addCircleMarkers(
lng = ~Lon,
lat = ~Lat,
color = ~pal(db_cluster),
radius = 2,
stroke = FALSE,
fillOpacity = 0.7,
popup = ~paste("Cluster:", db_cluster)
) %>%
addLegend("bottomright", pal = pal, values = ~db_cluster,
title = "DBSCAN Clusters")
## Warning in RColorBrewer::brewer.pal(max(3, n), palette): n too large, allowed maximum for palette Set1 is 9
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(max(3, n), palette): n too large, allowed maximum for palette Set1 is 9
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(max(3, n), palette): n too large, allowed maximum for palette Set1 is 9
## Returning the palette you asked for with that many colors
We can identify three big clusters in the city center(, one class of outliers (defined by a red color), and 14 small collection of points. I would say this is reasonable, however, unfortunately, due to the fact that we can not analyze whole set of 1 million points, the structure of data is a little different.
cluster_time_summary <- subset_rides %>%
group_by(db_cluster, Hour) %>%
summarise(
count = n(),
.groups = "drop"
) %>%
group_by(db_cluster) %>%
mutate(
total_cluster_pickups = sum(count),
percent = 100 * count / total_cluster_pickups
) %>%
arrange(db_cluster, Hour)
ggplot(cluster_time_summary, aes(x = Hour, y = percent, color = factor(db_cluster))) +
geom_line(size = 1) +
facet_wrap(~ db_cluster, scales = "free_y") +
labs(
title = "Cluster activity by hour of day",
x = "Hour of Day",
y = "Percentage of Pickups",
color = "Cluster"
) +
theme_minimal() +
theme(legend.position = "none")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
By the hour of day, peak number of pickups in the biggest cluster (the city center) are at around 16-18 time slot, which coincides with reality. From 20 to 4 in the morning the number is at its lowest, indicating the night time (obviously). It spikes slightly at around 7 in the morning, when people go to work and schools. Then it falls down a little from 10 to 15 when people are busy. Other clusters have slightly different specific times when the Uber is more active. In cluster number 5, for example, we can observe that there are more requested rides even earlier, around 6-7 in the morning.
cluster_date_summary <- subset_rides %>%
group_by(db_cluster, Date) %>%
summarise(
count = n(),
.groups = "drop"
) %>%
group_by(db_cluster) %>%
mutate(
total_cluster_pickups = sum(count),
percent = 100 * count / total_cluster_pickups
) %>%
arrange(db_cluster, Date)
ggplot(cluster_date_summary, aes(x = Date, y = percent, color = factor(db_cluster))) +
geom_line(size = 1) +
facet_wrap(~ db_cluster, scales = "free_y") +
labs(
title = "Cluster activity by date",
x = "Hour of Day",
y = "Percentage of Pickups",
color = "Cluster"
) +
theme_minimal() +
theme(legend.position = "none")
From the date chart, we can tell that all clusters do not have stable number of pickups every day. The demand is higher on specific days and it falls down soon. However, in smaller clusters, the pickups are concentrated in small amount of dates, while in bigger clusters, the ride numbers are spread evenly across the whole time span.
base_cluster_summary <- subset_rides %>%
group_by(db_cluster, Base) %>%
summarise(
count = n(),
mean_Lat = mean(Lat),
mean_Lon = mean(Lon)
) %>%
arrange(db_cluster, desc(count))
## `summarise()` has grouped output by 'db_cluster'. You can override using the
## `.groups` argument.
ggplot(base_cluster_summary, aes(x = factor(db_cluster), y = count, fill = Base)) +
geom_bar(stat = "identity", position = "fill") +
labs(title = "Base distribution across clusters",
x = "Cluster", y = "Proportion of rides", fill = "Base") +
theme_minimal()
To analyze the distribution of pickup “Bases”, we can see that the most popular one is B02617 overall, while B02512 is the least popular. Interesting insight is that cluster 17 has only two types of pickups which may indicate a special geographical location. Maybe an airport? Some clusters also do not contain red Base of pickups(clusters 9, 11, 14, and 18). Most clusters are dominated by the green Base, but in some of them purple or yellow are more dominant.
cluster_hour_summary <- subset_rides %>%
group_by(db_cluster, Hour) %>%
summarise(
mean_Lat = mean(Lat),
mean_Lon = mean(Lon),
count = n(),
.groups = "drop"
)
ggplot(cluster_hour_summary, aes(x = mean_Lon, y = mean_Lat, color = factor(db_cluster))) +
geom_path(aes(group = db_cluster), arrow = arrow(length = unit(0.15, "cm"))) +
geom_point(size = 2) +
labs(
title = "Movement of Cluster Centers by Hour of Day",
x = "Longitude", y = "Latitude",
color = "Cluster"
) +
theme_minimal()
We can see that all clusters by the hour of the day do not change their center by much. This means that there is not a strong relation between hour of the day and the places of pickups. However, the outliers behave weirdly, as the center changes for almost every hour of the day, meaning the hour directly impacts which areas of the city are more active during a certain hour.
cluster_date_summary <- subset_rides %>%
group_by(db_cluster, Date) %>%
summarise(
mean_Lat = mean(Lat),
mean_Lon = mean(Lon),
count = n(),
.groups = "drop"
)
ggplot(cluster_date_summary, aes(x = mean_Lon, y = mean_Lat, color = factor(db_cluster))) +
geom_path(aes(group = db_cluster), arrow = arrow(length = unit(0.15, "cm"))) +
geom_point(size = 2) +
labs(
title = "Movement of Cluster Centers by Date",
x = "Longitude", y = "Latitude",
color = "Cluster"
) +
theme_minimal()
We can observe the same trends for the relation between date and locations of pickups as with the hours of the day.
We first get a time series of rides for every cluster. Then we choose cluster 1 to do forecasting.
rides_by_cluster_time <- subset_rides %>%
group_by(db_cluster, Date.Time) %>%
summarise(count = n(), .groups = "drop")
cluster1 <- rides_by_cluster_time %>%
filter(db_cluster == 1) %>%
arrange(Date.Time) %>%
select(ds = Date.Time, y = count)
m <- prophet(cluster1)
## Disabling yearly seasonality. Run prophet with yearly.seasonality=TRUE to override this.
future <- make_future_dataframe(m, periods = 240, freq = "hour")
forecast <- predict(m, future)
prophet_plot_components(m, forecast)
We can see that forecasting by dates seems to show that next 10 days will see reduced demand for rides. Highest number of rides comes for around September 18. Saturday and Sunday are predicted to have the most pickups. Finally, the busiest hours are 8 in the morning and around 18 in the evening, as expected, and as seen from cluster analysis earlier.
plot(m, forecast)
The blue line, which is the fitted model, shows us approximation of future. Light shaded area is the uncertainty interval. Forecast shows us that every day the highest demand is expected at 8 and 18. By the day of the week, the demand smoothly rises till weekend and falls down when the week begins.
We managed to use DBSCAN to get clusters, which showed us most dense areas in the NYC. We can group those areas into clusters to get analysis of which hour of the day is the most active in a certain area. Areas could also denote districts of the city and help us to adapt according to the district. We can also analyze how cluster centers move to get an idea of which parts of a specific cluster is the busiest at a given hour or date. Additionally, we inferred which Base of the rides are demanded and which are not in specific clusters.