In Unit 6, we saw how clustering can be used for market segmentation, the idea of dividing a broad target market of customers into smaller, more similar groups, and then designing a marketing strategy specifically for each group. In this problem, we’ll see how the same idea can be applied using data from Hubway, a bike-sharing program in the Boston, Massachusetts area.
Registered users of Hubway can check-out a bicycle from one of 140 stations located throughout the Metro-Boston area, and return the bike to any of the 140 stations. This enables users to take bikes on one-way trips throughout the city. Users pay a membership fee, which includes unlimited trips up to 30 minutes in duration at no additional cost. Trips longer than 30 minutes cost additional “overtime” fees.
In this problem, we’ll use the dataset HubwayTrips.csv, which contains data from trips taken by registered users of Hubway from June 2012 through September 2012. The dataset contains the following seven variables:
Duration = the time of the trip, in seconds
Morning = whether or not the trip started in the morning, between the hours of 6:00am and 12:00pm (1 if yes, 0 if no)
Afternoon = whether or not the trip started in the afternoon, between the hours of 12:00pm and 6:00pm (1 if yes, 0 if no)
Evening = whether or not the trip started in the evening, between the hours of 6:00pm and 12:00am (1 if yes, 0 if no)
Weekday = whether or not the trip started on Monday, Tuesday, Wednesday, Thursday, or Friday (1 if yes, 0 if no)
Male = whether or not the user was male (1 if yes, 0 if no)
Age = the age of the user, in years
setwd("C:/Users/jzchen/Documents/Courses/Analytics Edge/Final")
hub <- read.csv("HubwayTrips.csv")
str(hub)
## 'data.frame': 185190 obs. of 7 variables:
## $ Duration : int 212 229 259 273 279 291 295 298 298 303 ...
## $ Morning : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Afternoon: int 1 1 0 1 1 1 1 1 1 1 ...
## $ Evening : int 0 0 1 0 0 0 0 0 0 0 ...
## $ Weekday : int 1 1 0 1 1 1 1 1 1 1 ...
## $ Male : int 1 1 0 1 0 1 1 0 0 1 ...
## $ Age : int 17 17 17 17 17 17 17 17 17 17 ...
In this dataset, what proportion of trips are taken by male users?
mean(hub$Male)
## [1] 0.7371078
Determine which variable to normalize
summary(hub)
## Duration Morning Afternoon Evening
## Min. : 180.0 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.: 377.0 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median : 562.0 Median :0.0000 Median :0.0000 Median :0.0000
## Mean : 721.5 Mean :0.3261 Mean :0.3997 Mean :0.2498
## 3rd Qu.: 860.0 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.0000
## Max. :85040.0 Max. :1.0000 Max. :1.0000 Max. :1.0000
## Weekday Male Age
## Min. :0.0000 Min. :0.0000 Min. :17.00
## 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:27.00
## Median :1.0000 Median :1.0000 Median :32.00
## Mean :0.8299 Mean :0.7371 Mean :35.37
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:42.00
## Max. :1.0000 Max. :1.0000 Max. :78.00
Normalize all of the variables in the Hubway dataset
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
preproc <- preProcess(hub)
hubNorm <- predict(preproc, hub)
summary(hubNorm)
## Duration Morning Afternoon Evening
## Min. :-0.4333 Min. :-0.6957 Min. :-0.816 Min. :-0.5771
## 1st Qu.:-0.2757 1st Qu.:-0.6957 1st Qu.:-0.816 1st Qu.:-0.5771
## Median :-0.1277 Median :-0.6957 Median :-0.816 Median :-0.5771
## Mean : 0.0000 Mean : 0.0000 Mean : 0.000 Mean : 0.0000
## 3rd Qu.: 0.1108 3rd Qu.: 1.4374 3rd Qu.: 1.225 3rd Qu.:-0.5771
## Max. :67.4650 Max. : 1.4374 Max. : 1.225 Max. : 1.7329
## Weekday Male Age
## Min. :-2.2088 Min. :-1.6745 Min. :-1.6711
## 1st Qu.: 0.4527 1st Qu.:-1.6745 1st Qu.:-0.7616
## Median : 0.4527 Median : 0.5972 Median :-0.3068
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4527 3rd Qu.: 0.5972 3rd Qu.: 0.6027
## Max. : 0.4527 Max. : 0.5972 Max. : 3.8770
We won’t be using hierarchical clustering on this dataset, because we might have too many observations in our dataset for Hierarchical clustering to handle.
set.seed(5000)
hubKMC <- kmeans(hubNorm, centers = 10)
How many observations are in the smallest cluster?
sort(hubKMC$size)
## [1] 9720 9899 10819 14981 15635 16067 20256 22931 28473 36409
or
table(hubKMC$cluster)
##
## 1 2 3 4 5 6 7 8 9 10
## 28473 36409 20256 10819 9720 22931 16067 14981 15635 9899
Assign data to centroids
cluster1 <- subset(hub, hubKMC$cluster == 1)
cluster2 <- subset(hub, hubKMC$cluster == 2)
cluster3 <- subset(hub, hubKMC$cluster == 3)
cluster4 <- subset(hub, hubKMC$cluster == 4)
cluster5 <- subset(hub, hubKMC$cluster == 5)
cluster6 <- subset(hub, hubKMC$cluster == 6)
cluster7 <- subset(hub, hubKMC$cluster == 7)
cluster8 <- subset(hub, hubKMC$cluster == 8)
cluster9 <- subset(hub, hubKMC$cluster == 9)
cluster10 <- subset(hub, hubKMC$cluster == 10)
set.seed(8000)
hubKMC2 <- kmeans(hubNorm, centers = 20)
sort(table(hubKMC2$cluster))
##
## 20 2 1 6 8 17 15 18 7 10 12 16
## 99 330 1897 2759 4432 4784 4949 6596 7295 7826 7922 9019
## 3 9 19 5 4 13 14 11
## 9033 10246 11064 14226 14991 21215 21282 25225
Assign data to centroids
cluster1 <- subset(hub, hubKMC2$cluster == 1)
cluster2 <- subset(hub, hubKMC2$cluster == 2)
cluster3 <- subset(hub, hubKMC2$cluster == 3)
cluster4 <- subset(hub, hubKMC2$cluster == 4)
cluster5 <- subset(hub, hubKMC2$cluster == 5)
cluster6 <- subset(hub, hubKMC2$cluster == 6)
cluster7 <- subset(hub, hubKMC2$cluster == 7)
cluster8 <- subset(hub, hubKMC2$cluster == 8)
cluster9 <- subset(hub, hubKMC2$cluster == 9)
cluster10 <- subset(hub, hubKMC2$cluster == 10)
cluster11 <- subset(hub, hubKMC2$cluster == 11)
cluster12 <- subset(hub, hubKMC2$cluster == 12)
cluster13 <- subset(hub, hubKMC2$cluster == 13)
cluster14 <- subset(hub, hubKMC2$cluster == 14)
cluster15 <- subset(hub, hubKMC2$cluster == 15)
cluster16 <- subset(hub, hubKMC2$cluster == 16)
cluster17 <- subset(hub, hubKMC2$cluster == 17)
cluster18 <- subset(hub, hubKMC2$cluster == 18)
cluster19 <- subset(hub, hubKMC2$cluster == 19)
cluster20 <- subset(hub, hubKMC2$cluster == 20)
library(ggplot2)
ggplot(data = hub, aes(x= Age, y = hubKMC2$cluster)) + geom_point()