This report covers the pre-module assignment of the course Operations Analytics. The goal of this report is to perform an exploratory analysis of the dataset of 5000 players Station Casinos and assist management with making informed and robust data-driven decisions. This report was created on Sun Jul 21 23:49:39 2019.

This report can also be accessed through the following link: http://rpubs.com/borjaureta/514393

Initial data import

The main data preparation steps I followed are the following: + Loaded a number of libraries that I plan on using for the analysis e.g readxl, tidyverse, dplyr, ggplot2, here, etc. + Imported the dataset using the readxl library + converted the dataset to tibble + Renamed column (which was initially empty) to PlayerNum + Checked for missing values and confirmed that there weren’t any + Performed some initial high level data analysis/exploration i.e. structure, head, tail and summary statistics

Once, this tasks were performed, it is possible to start with the analysis of the dataset.

#Install libraries
library(readr)
library(readxl)
library(janitor)
library(tidyverse)
library(lubridate)
library(dplyr)
library(knitr)
library(ggplot2)
library(gridExtra)
library(here)
library(scales)
library(cowplot)
library(car)
library(formattable)
#tinytex::install_tinytex()
library(tinytex)
#Import dataset
setwd(here())
dset <- read_xlsx("Casino Final_Dec 2013 Raw Data_for test.xlsx", sheet = "Sheet1")
#Rename first column
dset <- as_tibble(dset)
dset <- rename(dset, PlayerNum = "...1")
dset <- rename(dset, Total_Spent = "Total Spend")
#Explore data
#str(dset)
#head(dset)
#tail(dset)
#summary(dset)
#Check for na's #No missing values
#dset[!complete.cases(dset),]

Assignment Questions

Question 1 - How much total dollars were played on each game

Notes: Present the answer to this question as a single bar chart with seven bars, one for each game (six games and ‘Other Games’ as the seventh), with the length of each bar representing the total amount bet on that game by all 5000 players on that specific day. Since these charts are difficult to read accurately, include the actual correct total dollar value at the top of each bar.

Response: The following graph shows a bar chart with the total USD for the specific day by each game, ordered in decresing order.

sum_per_games <- data.frame(sapply(dset[2:8], sum))
sum_per_games$key <- rownames(sum_per_games)
sum_per_games <- rename(sum_per_games, SumGame = "sapply.dset.2.8...sum.")


ggplot(data = sum_per_games, aes(x=reorder(key, -SumGame), y = SumGame, fill=key)) + geom_bar(stat = "identity", color="Navy") + 
  scale_y_continuous(labels = comma) +
  ggtitle("Total USD by Game") + theme(plot.title = element_text(hjust = 0.5, size=16, face ="bold")) + 
  geom_text(aes(label = number(SumGame, big.mark = ",")), size=4, position = position_stack(vjust = 1.1)) + xlab("Games") + ylab ("Total USD")

The revenue ranking for the day is in line with the trends described in the assignment, with slots, blackjack and craps topping the total spent by a large gap, other games, bac and poker following, and bingo being the lowest spend by a large distance. We can begin to see management’s concern with bing being a low revenu-generating game, that occupies a large space.

Question 2 - How many people played each of the seven games and how much did they wager?

Notes: Create a bar chart for each game (so you will have seven bar charts). Each bar chart should show how many people played that game (the vertical axis) and how many dollars were wagered (the horizontal axis). Plot all seven charts together on a single display (i.e. use the plotgrid function from the cowplot package).

Response: In order to respond to this question, I have created an histogram for each game, as this reflects the frequencies for different wager intervals. Given that the counts and wagers vary considerably by game, I have customized the binwidth for each of the games accordingly, and have defined limits for the x and y axis in some cases to improve readibility of the graphics.

Some relevant observations:

  • Slots: A large part of the players will concentrate on low wagers under USD100. Players with highers wagers than that tend to concentrate around USD500.
  • Blackjack: Significant amount of players with waggers under USD25. Beyond that, the waggers trend towards USD300.
  • Craps: Significant amount of players with waggers under USD25. Beyond that, the waggers trend towards USD200.
  • Other: Significant amount of players with waggers under USD50. Beyond that, the waggers trend towards USD500.
  • Bac: Most players have waggers of USD200 or less
  • Poker: High concentration around USD200 and a fair amount of extreme values. The distribution is different than the previous games, as Poker doesn’t show significant “small waggers”
  • Bingo: Same for Bingo but with concentration around USD100 and few extreme observations. Distribution ressembles normal distribution.
dsetSlots<- filter(dset, Slots>0)
h_slots <- ggplot(data = dsetSlots, aes(dsetSlots$Slots)) + geom_histogram(binwidth = 50, color="Navy", aes(fill=..count..), alpha=.5, show.legend = TRUE) + 
  labs(title="Histogram for Slots", x="Wager (USD)", y="Frequency") +
  theme(plot.title = element_text(hjust = 0.5, size=12, face ="bold")) + xlim(c(0,1500))

dsetBJ<- filter(dset, BJ>0)
h_bj <- ggplot(data = dsetBJ, aes(dsetBJ$BJ)) + geom_histogram(binwidth = 25, color="Green", aes(fill=..count..), alpha=.5, show.legend = TRUE) + 
  labs(title="Histogram for Blackjack", x="Wager (USD)", y="Frequency") +
  theme(plot.title = element_text(hjust = 0.5, size=12, face ="bold")) + xlim(c(0,500))

dsetCraps <- filter(dset, Craps>0)
h_craps <- ggplot(data = dsetCraps, aes(dsetCraps$Craps)) + geom_histogram(binwidth = 25, color="Grey", aes(fill=..count..), alpha=.5, show.legend = TRUE) + 
  labs(title="Histogram for Craps", x="Wager (USD)", y="Frequency") +
  theme(plot.title = element_text(hjust = 0.5, size=12, face ="bold")) + xlim(c(0,600))

dsetOther <- filter(dset, Other>0)
h_other <- ggplot(data = dsetOther, aes(dsetOther$Other)) + geom_histogram(binwidth = 50, color="Purple", aes(fill=..count..), alpha=.5, show.legend = TRUE) + 
  labs(title="Histogram for Other Games", x="Wager (USD)", y="Frequency") +
  theme(plot.title = element_text(hjust = 0.5, size=12, face ="bold")) + xlim(c(0,900))

dsetBac <- filter(dset, Bac>0)
h_bac <- ggplot(data = dsetBac, aes(dsetBac$Bac)) + geom_histogram(binwidth = 25, color="Orange", aes(fill=..count..), alpha=.5, show.legend = TRUE) + 
  labs(title="Histogram for Bac", x="Wager (USD)", y="Frequency") +
  theme(plot.title = element_text(hjust = 0.5, size=12, face ="bold")) + xlim(c(0,1250)) + ylim(c(0,320))

dsetPoker<- filter(dset, Poker>0)
h_poker <- ggplot(data = dsetPoker, aes(dsetPoker$Poker)) + geom_histogram(binwidth = 25, color="Black", aes(fill=..count..), alpha=.5, show.legend = TRUE) + 
  labs(title="Histogram for Poker", x="Wager (USD)", y="Frequency") +
  theme(plot.title = element_text(hjust = 0.5, size=12, face ="bold")) + xlim(c(0,750)) + ylim(c(0,180))

dsetBingo <- filter(dset, Bingo>0)
h_bingo <- ggplot(data = dsetBingo, aes(dsetBingo$Bingo)) + geom_histogram(binwidth = 10, color="Darkblue", aes(fill=..count..), alpha=.5, show.legend = TRUE) + 
  labs(title="Histogram for Bingo", x="Wager (USD)", y="Frequency") +
  theme(plot.title = element_text(hjust = 0.5, size=12, face ="bold"))

plot_grid(h_slots, h_bj, h_craps, h_other, h_bac, h_poker, h_bingo)

Question 3 - K-Means Cluster Analysis

Notes: K-Means Cluster Analysis to determine type (and value) of comps to be offered to players in each grouping. Keep the number of groups in the 3-6 range (K=3 to 6). Show the K=6 results as a Scatterplot Matrix showing every combination of two games (i.e. use the pairs function from the cars package). Clearly the 5000 players have strong differences in terms of what games they play and how much they wager. Management wants to offer comps to the players based on their value to the casino. (i.e. a player who wagers USD5000-USD10000 per day should be offered much more than a 25-cent slot machine player who wagers $40 per day.)

In order to perform a K-means Cluster Analysis, I carried out the following tasks:

  • Explore the data for NAs (i.e. none identified)
  • Since all the variables are numerical, I didn’t use any dummy variables.
  • I considered including the total spend variable. This would result in multicolinearity. When carrying out a K-means cluster analysis multicolinearity is not so much of an issue as we are not trying to optimize certain parameters but to understand similarities and patterns within groups. Notwithstanding, I decided not to include this variable so that I could compare the different clusters against total spend.
  • While all variables are measured in USD, the USD per variable show different ranges (as shown in the histograms in the previous question) so I decided to z-score standardize the features. Doing this would also be benefical from an interpretation perspective, as it is easier to read the z-scores as they have a mean of 0 and standard deviation of 1.

Next, I ran the K-means Cluster Analysis with k being 3, 4, 5 and 6. In order to determine the optimum number of k, I used the elbow method by looking at the WCSS (Within-Cluster-Sum-of-Squares) for different values of k. Initially, the WCSS was not stable so I defined the maximum number of iterations at 1000 and the nstart at 10 (i.e. attempt 10 different initial configurations and use the best one).

The plot below shows the results in terms of the WCSS for different number of K.

#First I created a DF with the numerical game USD wagger data for the players only
games_dset <- dset[2:8]
#Next I standardized the data
games_dset_z <- as.data.frame(lapply(games_dset, scale))
#Run a first try of clustering for each k from 3 to 6
set.seed(1)
game_clusters3 <- kmeans(games_dset_z, 3, iter.max = 1000, nstart = 10)
game_clusters4 <- kmeans(games_dset_z, 4, iter.max = 1000, nstart = 10)
game_clusters5 <- kmeans(games_dset_z, 5, iter.max = 1000, nstart = 10)
game_clusters6 <- kmeans(games_dset_z, 6, iter.max = 1000, nstart = 10)
#Use the elbow method to find the optimal number of clusters
wcss <- vector()
for (i in 1:10) wcss[i] <- sum(kmeans(games_dset_z, i, iter.max = 1000, nstart = 10)$withinss)
plot(1:10, wcss, type = "b", main = paste("Clusters of games"), xlab = "Number of clusters", ylab = "WCSS")

Based on the plot, I would select k=4 as the optimal number of clusters. The k represents the point where an increase in K doesn’t cause a significant reduction in the WCSS.

Next, I analyzed the cluster sizes and centers. The table below shows the cluster centers and size for each of the clusters.

kable(data.frame("Cluster" = c(1:4),"Centers" = round(game_clusters4$centers,2), "Size" = game_clusters4$size))
Cluster Centers.Slots Centers.BJ Centers.Craps Centers.Bac Centers.Bingo Centers.Poker Centers.Other Size
1 -0.55 -0.27 -0.26 -0.28 -0.31 -0.42 -0.35 3220
2 1.01 -0.01 -0.09 0.03 -0.32 1.65 1.52 1040
3 0.31 -0.32 -0.29 -0.33 2.96 -0.52 -0.61 477
4 2.14 3.95 4.03 3.88 -0.32 -0.52 -0.61 263

Some high level observations:

  • Cluster 1 “Casual Players” is the largest with 477 observations, which basically represents 5 percent of the observations. This cluster may represent casual players, with below below average results for every game.
  • Cluster 2 “Poker/Slots/Other” is the second largest with 477 observations, which basically represents 5 percent of the observations. This cluster shows a strong preference for poker, other and slots and a close to average preference for all the other games and slightly below average preference for bingo.
  • Cluster 3 “Bingo Players” is the second smallest cluster with 1040 observations, which basically represents 21 percent of the observations. Players in this cluster distinguish themsleves by having a big preference for bingo and to a lesser exstent for slots, with below average results for every other game.
  • Cluster 4 “Craps/BJ/Bac/Slots Players” is the least numerous cluster with 3220 observations, which basically represents 64 percent of the observations. Players in this cluster are concentrated on craps, blackjack, bac and slots, with significantly above average z-scores. In contrast, this cluster has below average results for bingo, poker and other.

Next, I added the cluster number to the dataset, to assess any relevant insights that can link the different clusters to the total spent data. This would be very useful to make business decisions and make a decision on the targetting of the different groups.

Once the cluster number was added, I analyzed the relationship between cluster and total spend, which is presented in the table below.

#Add the assigned cluster for each player to the dataset
dset$Cluster <- game_clusters4$cluster
kable(aggregate(data = dset, Total_Spent ~ Cluster, FUN = function(format.mean){round(mean(format.mean),0)}), align = c('l','l', 'l'))
Cluster Total_Spent
1 259
2 1860
3 499
4 9905

Conclusions:

  • The highest revenue-generating cluster is Cluster 4 by far. This is the “Craps/BJ/Bac/Slots Players”. This makes sense since slots, bj and craps are the games that generate the most revenue according to the assignment and also profits. It would make sense to offer this players the maximum incentives as part of their boarding pass cards. In order to define the exact amount, it would be necessary to obtain profitability data and do further analysis but from a directional perspective, it is clear that this is the cluster where incentives should be the highest.
  • Cluster 3, the “Bingo Players” is the second cluster with the lowest revenue (nearly 20 times less than Cluster 4). This makes sense since the bingo game generates the lowest profits and revenue. Incentives should be low. Also, it may make sense to discontinue or reduce space for bingo, since these players, have a strong preference for bingo, don’t really play any other game (slots slightly above average) and revenue and profits are low.
  • Cluster 1, the “Casual Players” is the cluster generating the lowest revenue (nearly 40 times less than Cluster 4). Incentives should generally be low. However, given the large size of this cluster, it would be interesting to analyze what marketing actions could be taken to increase the expenditure of this cluster.
  • Last, Cluster 2, the “Poker/Slots/Other” is the second largest revenue-generating cluster (about 5 times less than Cluster 4 but still 3.5 times more than the bingo players). Incentives would be higher for this cluster as compared to clusters 2 and 3 but lower than 1. One thing to consider is that this cluster is about 4 times larger than Cluster 4 so even if less revenue/profitable, it could make sense to work on this cluster as the absolute monetary impact is increased comparatively to Cluster 4 as a result of size.

Last, I included the scatterplot matrix for K=6 for every pair og games using the cars package, as indicated in the assignment. The graphs are shown below. I have also included the matrix of correlations for better interpretation.

scatterplotMatrix(~ Slots + BJ + Craps + Bac + Bingo + Poker + Other, data = dset, lty.smooth=2, main="Scatterplot matrix for pairs of games")

round(cor(dset[c("Slots", "BJ", "Craps", "Bac", "Bingo", "Poker", "Other")]),2)
##       Slots    BJ Craps   Bac Bingo Poker Other
## Slots  1.00  0.54  0.53  0.53  0.10  0.31  0.39
## BJ     0.54  1.00  0.91  0.86 -0.10 -0.04 -0.04
## Craps  0.53  0.91  1.00  0.88 -0.09 -0.07 -0.08
## Bac    0.53  0.86  0.88  1.00 -0.10 -0.03 -0.03
## Bingo  0.10 -0.10 -0.09 -0.10  1.00 -0.16 -0.19
## Poker  0.31 -0.04 -0.07 -0.03 -0.16  1.00  0.53
## Other  0.39 -0.04 -0.08 -0.03 -0.19  0.53  1.00

The matrix of correlations shows some interesting insights, which are in line with the discussion of prior sections of the report. + For instance, we can observe that bingo has no correlation with any other game (or correlations are negative), with the exception of a slight positive correlation with slots. This is in line with the pattern identified in Cluster 3. Bottom line is that we can confirm that bingo players will typically not play any other games or will do it compartively less than other players. + We can also see a high correlation between Craps, BJ, Bac and Slots, which is in line with our findings from Cluster 4. + While slots has higher correlations with the games of Cluster 4, there is also a strong correlation with poker and other games and between these two, which is in line with our findings from Cluster 2.