Techniques: data normalization, k-means clustering, ggplot

In Unit 6, we saw how clustering can be used for market segmentation, the idea of dividing a broad target market of customers into smaller, more similar groups, and then designing a marketing strategy specifically for each group. In this problem, we’ll see how the same idea can be applied using data from Hubway, a bike-sharing program in the Boston, Massachusetts area.

Registered users of Hubway can check-out a bicycle from one of 140 stations located throughout the Metro-Boston area, and return the bike to any of the 140 stations. This enables users to take bikes on one-way trips throughout the city. Users pay a membership fee, which includes unlimited trips up to 30 minutes in duration at no additional cost. Trips longer than 30 minutes cost additional “overtime” fees.

In this problem, we’ll use the dataset HubwayTrips.csv, which contains data from trips taken by registered users of Hubway from June 2012 through September 2012. The dataset contains the following seven variables:

Duration = the time of the trip, in seconds

Morning = whether or not the trip started in the morning, between the hours of 6:00am and 12:00pm (1 if yes, 0 if no)

Afternoon = whether or not the trip started in the afternoon, between the hours of 12:00pm and 6:00pm (1 if yes, 0 if no)

Evening = whether or not the trip started in the evening, between the hours of 6:00pm and 12:00am (1 if yes, 0 if no)

Weekday = whether or not the trip started on Monday, Tuesday, Wednesday, Thursday, or Friday (1 if yes, 0 if no)

Male = whether or not the user was male (1 if yes, 0 if no)

Age = the age of the user, in years

Load the data

setwd("C:/Users/jzchen/Documents/Courses/Analytics Edge/Final")
hub <- read.csv("HubwayTrips.csv")
str(hub)
## 'data.frame':    185190 obs. of  7 variables:
##  $ Duration : int  212 229 259 273 279 291 295 298 298 303 ...
##  $ Morning  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Afternoon: int  1 1 0 1 1 1 1 1 1 1 ...
##  $ Evening  : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ Weekday  : int  1 1 0 1 1 1 1 1 1 1 ...
##  $ Male     : int  1 1 0 1 0 1 1 0 0 1 ...
##  $ Age      : int  17 17 17 17 17 17 17 17 17 17 ...

In this dataset, what proportion of trips are taken by male users?

mean(hub$Male)
## [1] 0.7371078

Normalize the data

Determine which variable to normalize

summary(hub)
##     Duration          Morning         Afternoon         Evening      
##  Min.   :  180.0   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:  377.0   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :  562.0   Median :0.0000   Median :0.0000   Median :0.0000  
##  Mean   :  721.5   Mean   :0.3261   Mean   :0.3997   Mean   :0.2498  
##  3rd Qu.:  860.0   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.0000  
##  Max.   :85040.0   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##     Weekday            Male             Age       
##  Min.   :0.0000   Min.   :0.0000   Min.   :17.00  
##  1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.:27.00  
##  Median :1.0000   Median :1.0000   Median :32.00  
##  Mean   :0.8299   Mean   :0.7371   Mean   :35.37  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:42.00  
##  Max.   :1.0000   Max.   :1.0000   Max.   :78.00

Normalize all of the variables in the Hubway dataset

library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
preproc <- preProcess(hub)
hubNorm <- predict(preproc, hub)
summary(hubNorm)
##     Duration          Morning          Afternoon         Evening       
##  Min.   :-0.4333   Min.   :-0.6957   Min.   :-0.816   Min.   :-0.5771  
##  1st Qu.:-0.2757   1st Qu.:-0.6957   1st Qu.:-0.816   1st Qu.:-0.5771  
##  Median :-0.1277   Median :-0.6957   Median :-0.816   Median :-0.5771  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.000   Mean   : 0.0000  
##  3rd Qu.: 0.1108   3rd Qu.: 1.4374   3rd Qu.: 1.225   3rd Qu.:-0.5771  
##  Max.   :67.4650   Max.   : 1.4374   Max.   : 1.225   Max.   : 1.7329  
##     Weekday             Male              Age         
##  Min.   :-2.2088   Min.   :-1.6745   Min.   :-1.6711  
##  1st Qu.: 0.4527   1st Qu.:-1.6745   1st Qu.:-0.7616  
##  Median : 0.4527   Median : 0.5972   Median :-0.3068  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4527   3rd Qu.: 0.5972   3rd Qu.: 0.6027  
##  Max.   : 0.4527   Max.   : 0.5972   Max.   : 3.8770

We won’t be using hierarchical clustering on this dataset, because we might have too many observations in our dataset for Hierarchical clustering to handle.

k-means clustering

set.seed(5000)
hubKMC <- kmeans(hubNorm, centers = 10)

How many observations are in the smallest cluster?

sort(hubKMC$size)
##  [1]  9720  9899 10819 14981 15635 16067 20256 22931 28473 36409

or

table(hubKMC$cluster)
## 
##     1     2     3     4     5     6     7     8     9    10 
## 28473 36409 20256 10819  9720 22931 16067 14981 15635  9899

Assign data to centroids

cluster1 <- subset(hub, hubKMC$cluster == 1)
cluster2 <- subset(hub, hubKMC$cluster == 2)
cluster3 <- subset(hub, hubKMC$cluster == 3)
cluster4 <- subset(hub, hubKMC$cluster == 4)
cluster5 <- subset(hub, hubKMC$cluster == 5)
cluster6 <- subset(hub, hubKMC$cluster == 6)
cluster7 <- subset(hub, hubKMC$cluster == 7)
cluster8 <- subset(hub, hubKMC$cluster == 8)
cluster9 <- subset(hub, hubKMC$cluster == 9)
cluster10 <- subset(hub, hubKMC$cluster == 10)

Increasing number of clusters

set.seed(8000)
hubKMC2 <- kmeans(hubNorm, centers = 20)
sort(table(hubKMC2$cluster))
## 
##    20     2     1     6     8    17    15    18     7    10    12    16 
##    99   330  1897  2759  4432  4784  4949  6596  7295  7826  7922  9019 
##     3     9    19     5     4    13    14    11 
##  9033 10246 11064 14226 14991 21215 21282 25225

Assign data to centroids

cluster1 <- subset(hub, hubKMC2$cluster == 1)
cluster2 <- subset(hub, hubKMC2$cluster == 2)
cluster3 <- subset(hub, hubKMC2$cluster == 3)
cluster4 <- subset(hub, hubKMC2$cluster == 4)
cluster5 <- subset(hub, hubKMC2$cluster == 5)
cluster6 <- subset(hub, hubKMC2$cluster == 6)
cluster7 <- subset(hub, hubKMC2$cluster == 7)
cluster8 <- subset(hub, hubKMC2$cluster == 8)
cluster9 <- subset(hub, hubKMC2$cluster == 9)
cluster10 <- subset(hub, hubKMC2$cluster == 10)
cluster11 <- subset(hub, hubKMC2$cluster == 11)
cluster12 <- subset(hub, hubKMC2$cluster == 12)
cluster13 <- subset(hub, hubKMC2$cluster == 13)
cluster14 <- subset(hub, hubKMC2$cluster == 14)
cluster15 <- subset(hub, hubKMC2$cluster == 15)
cluster16 <- subset(hub, hubKMC2$cluster == 16)
cluster17 <- subset(hub, hubKMC2$cluster == 17)
cluster18 <- subset(hub, hubKMC2$cluster == 18)
cluster19 <- subset(hub, hubKMC2$cluster == 19)
cluster20 <- subset(hub, hubKMC2$cluster == 20)

Visualization

library(ggplot2)
ggplot(data = hub, aes(x= Age, y = hubKMC2$cluster)) + geom_point()