This report covers the pre-module assignment of the course Operations Analytics. The goal of this report is to perform an exploratory analysis of the dataset of 5000 players Station Casinos and assist management with making informed and robust data-driven decisions. This report was created on Sun Jul 21 23:49:39 2019.
This report can also be accessed through the following link: http://rpubs.com/borjaureta/514393
The main data preparation steps I followed are the following: + Loaded a number of libraries that I plan on using for the analysis e.g readxl, tidyverse, dplyr, ggplot2, here, etc. + Imported the dataset using the readxl library + converted the dataset to tibble + Renamed column (which was initially empty) to PlayerNum + Checked for missing values and confirmed that there weren’t any + Performed some initial high level data analysis/exploration i.e. structure, head, tail and summary statistics
Once, this tasks were performed, it is possible to start with the analysis of the dataset.
#Install libraries
library(readr)
library(readxl)
library(janitor)
library(tidyverse)
library(lubridate)
library(dplyr)
library(knitr)
library(ggplot2)
library(gridExtra)
library(here)
library(scales)
library(cowplot)
library(car)
library(formattable)
#tinytex::install_tinytex()
library(tinytex)
#Import dataset
setwd(here())
dset <- read_xlsx("Casino Final_Dec 2013 Raw Data_for test.xlsx", sheet = "Sheet1")
#Rename first column
dset <- as_tibble(dset)
dset <- rename(dset, PlayerNum = "...1")
dset <- rename(dset, Total_Spent = "Total Spend")
#Explore data
#str(dset)
#head(dset)
#tail(dset)
#summary(dset)
#Check for na's #No missing values
#dset[!complete.cases(dset),]
Notes: Present the answer to this question as a single bar chart with seven bars, one for each game (six games and ‘Other Games’ as the seventh), with the length of each bar representing the total amount bet on that game by all 5000 players on that specific day. Since these charts are difficult to read accurately, include the actual correct total dollar value at the top of each bar.
Response: The following graph shows a bar chart with the total USD for the specific day by each game, ordered in decresing order.
sum_per_games <- data.frame(sapply(dset[2:8], sum))
sum_per_games$key <- rownames(sum_per_games)
sum_per_games <- rename(sum_per_games, SumGame = "sapply.dset.2.8...sum.")
ggplot(data = sum_per_games, aes(x=reorder(key, -SumGame), y = SumGame, fill=key)) + geom_bar(stat = "identity", color="Navy") +
scale_y_continuous(labels = comma) +
ggtitle("Total USD by Game") + theme(plot.title = element_text(hjust = 0.5, size=16, face ="bold")) +
geom_text(aes(label = number(SumGame, big.mark = ",")), size=4, position = position_stack(vjust = 1.1)) + xlab("Games") + ylab ("Total USD")
The revenue ranking for the day is in line with the trends described in the assignment, with slots, blackjack and craps topping the total spent by a large gap, other games, bac and poker following, and bingo being the lowest spend by a large distance. We can begin to see management’s concern with bing being a low revenu-generating game, that occupies a large space.
Notes: Create a bar chart for each game (so you will have seven bar charts). Each bar chart should show how many people played that game (the vertical axis) and how many dollars were wagered (the horizontal axis). Plot all seven charts together on a single display (i.e. use the plotgrid function from the cowplot package).
Response: In order to respond to this question, I have created an histogram for each game, as this reflects the frequencies for different wager intervals. Given that the counts and wagers vary considerably by game, I have customized the binwidth for each of the games accordingly, and have defined limits for the x and y axis in some cases to improve readibility of the graphics.
Some relevant observations:
dsetSlots<- filter(dset, Slots>0)
h_slots <- ggplot(data = dsetSlots, aes(dsetSlots$Slots)) + geom_histogram(binwidth = 50, color="Navy", aes(fill=..count..), alpha=.5, show.legend = TRUE) +
labs(title="Histogram for Slots", x="Wager (USD)", y="Frequency") +
theme(plot.title = element_text(hjust = 0.5, size=12, face ="bold")) + xlim(c(0,1500))
dsetBJ<- filter(dset, BJ>0)
h_bj <- ggplot(data = dsetBJ, aes(dsetBJ$BJ)) + geom_histogram(binwidth = 25, color="Green", aes(fill=..count..), alpha=.5, show.legend = TRUE) +
labs(title="Histogram for Blackjack", x="Wager (USD)", y="Frequency") +
theme(plot.title = element_text(hjust = 0.5, size=12, face ="bold")) + xlim(c(0,500))
dsetCraps <- filter(dset, Craps>0)
h_craps <- ggplot(data = dsetCraps, aes(dsetCraps$Craps)) + geom_histogram(binwidth = 25, color="Grey", aes(fill=..count..), alpha=.5, show.legend = TRUE) +
labs(title="Histogram for Craps", x="Wager (USD)", y="Frequency") +
theme(plot.title = element_text(hjust = 0.5, size=12, face ="bold")) + xlim(c(0,600))
dsetOther <- filter(dset, Other>0)
h_other <- ggplot(data = dsetOther, aes(dsetOther$Other)) + geom_histogram(binwidth = 50, color="Purple", aes(fill=..count..), alpha=.5, show.legend = TRUE) +
labs(title="Histogram for Other Games", x="Wager (USD)", y="Frequency") +
theme(plot.title = element_text(hjust = 0.5, size=12, face ="bold")) + xlim(c(0,900))
dsetBac <- filter(dset, Bac>0)
h_bac <- ggplot(data = dsetBac, aes(dsetBac$Bac)) + geom_histogram(binwidth = 25, color="Orange", aes(fill=..count..), alpha=.5, show.legend = TRUE) +
labs(title="Histogram for Bac", x="Wager (USD)", y="Frequency") +
theme(plot.title = element_text(hjust = 0.5, size=12, face ="bold")) + xlim(c(0,1250)) + ylim(c(0,320))
dsetPoker<- filter(dset, Poker>0)
h_poker <- ggplot(data = dsetPoker, aes(dsetPoker$Poker)) + geom_histogram(binwidth = 25, color="Black", aes(fill=..count..), alpha=.5, show.legend = TRUE) +
labs(title="Histogram for Poker", x="Wager (USD)", y="Frequency") +
theme(plot.title = element_text(hjust = 0.5, size=12, face ="bold")) + xlim(c(0,750)) + ylim(c(0,180))
dsetBingo <- filter(dset, Bingo>0)
h_bingo <- ggplot(data = dsetBingo, aes(dsetBingo$Bingo)) + geom_histogram(binwidth = 10, color="Darkblue", aes(fill=..count..), alpha=.5, show.legend = TRUE) +
labs(title="Histogram for Bingo", x="Wager (USD)", y="Frequency") +
theme(plot.title = element_text(hjust = 0.5, size=12, face ="bold"))
plot_grid(h_slots, h_bj, h_craps, h_other, h_bac, h_poker, h_bingo)
Notes: K-Means Cluster Analysis to determine type (and value) of comps to be offered to players in each grouping. Keep the number of groups in the 3-6 range (K=3 to 6). Show the K=6 results as a Scatterplot Matrix showing every combination of two games (i.e. use the pairs function from the cars package). Clearly the 5000 players have strong differences in terms of what games they play and how much they wager. Management wants to offer comps to the players based on their value to the casino. (i.e. a player who wagers USD5000-USD10000 per day should be offered much more than a 25-cent slot machine player who wagers $40 per day.)
In order to perform a K-means Cluster Analysis, I carried out the following tasks:
Next, I ran the K-means Cluster Analysis with k being 3, 4, 5 and 6. In order to determine the optimum number of k, I used the elbow method by looking at the WCSS (Within-Cluster-Sum-of-Squares) for different values of k. Initially, the WCSS was not stable so I defined the maximum number of iterations at 1000 and the nstart at 10 (i.e. attempt 10 different initial configurations and use the best one).
The plot below shows the results in terms of the WCSS for different number of K.
#First I created a DF with the numerical game USD wagger data for the players only
games_dset <- dset[2:8]
#Next I standardized the data
games_dset_z <- as.data.frame(lapply(games_dset, scale))
#Run a first try of clustering for each k from 3 to 6
set.seed(1)
game_clusters3 <- kmeans(games_dset_z, 3, iter.max = 1000, nstart = 10)
game_clusters4 <- kmeans(games_dset_z, 4, iter.max = 1000, nstart = 10)
game_clusters5 <- kmeans(games_dset_z, 5, iter.max = 1000, nstart = 10)
game_clusters6 <- kmeans(games_dset_z, 6, iter.max = 1000, nstart = 10)
#Use the elbow method to find the optimal number of clusters
wcss <- vector()
for (i in 1:10) wcss[i] <- sum(kmeans(games_dset_z, i, iter.max = 1000, nstart = 10)$withinss)
plot(1:10, wcss, type = "b", main = paste("Clusters of games"), xlab = "Number of clusters", ylab = "WCSS")
Based on the plot, I would select k=4 as the optimal number of clusters. The k represents the point where an increase in K doesn’t cause a significant reduction in the WCSS.
Next, I analyzed the cluster sizes and centers. The table below shows the cluster centers and size for each of the clusters.
kable(data.frame("Cluster" = c(1:4),"Centers" = round(game_clusters4$centers,2), "Size" = game_clusters4$size))
Cluster | Centers.Slots | Centers.BJ | Centers.Craps | Centers.Bac | Centers.Bingo | Centers.Poker | Centers.Other | Size |
---|---|---|---|---|---|---|---|---|
1 | -0.55 | -0.27 | -0.26 | -0.28 | -0.31 | -0.42 | -0.35 | 3220 |
2 | 1.01 | -0.01 | -0.09 | 0.03 | -0.32 | 1.65 | 1.52 | 1040 |
3 | 0.31 | -0.32 | -0.29 | -0.33 | 2.96 | -0.52 | -0.61 | 477 |
4 | 2.14 | 3.95 | 4.03 | 3.88 | -0.32 | -0.52 | -0.61 | 263 |
Some high level observations:
Next, I added the cluster number to the dataset, to assess any relevant insights that can link the different clusters to the total spent data. This would be very useful to make business decisions and make a decision on the targetting of the different groups.
Once the cluster number was added, I analyzed the relationship between cluster and total spend, which is presented in the table below.
#Add the assigned cluster for each player to the dataset
dset$Cluster <- game_clusters4$cluster
kable(aggregate(data = dset, Total_Spent ~ Cluster, FUN = function(format.mean){round(mean(format.mean),0)}), align = c('l','l', 'l'))
Cluster | Total_Spent |
---|---|
1 | 259 |
2 | 1860 |
3 | 499 |
4 | 9905 |
Conclusions:
Last, I included the scatterplot matrix for K=6 for every pair og games using the cars package, as indicated in the assignment. The graphs are shown below. I have also included the matrix of correlations for better interpretation.
scatterplotMatrix(~ Slots + BJ + Craps + Bac + Bingo + Poker + Other, data = dset, lty.smooth=2, main="Scatterplot matrix for pairs of games")
round(cor(dset[c("Slots", "BJ", "Craps", "Bac", "Bingo", "Poker", "Other")]),2)
## Slots BJ Craps Bac Bingo Poker Other
## Slots 1.00 0.54 0.53 0.53 0.10 0.31 0.39
## BJ 0.54 1.00 0.91 0.86 -0.10 -0.04 -0.04
## Craps 0.53 0.91 1.00 0.88 -0.09 -0.07 -0.08
## Bac 0.53 0.86 0.88 1.00 -0.10 -0.03 -0.03
## Bingo 0.10 -0.10 -0.09 -0.10 1.00 -0.16 -0.19
## Poker 0.31 -0.04 -0.07 -0.03 -0.16 1.00 0.53
## Other 0.39 -0.04 -0.08 -0.03 -0.19 0.53 1.00
The matrix of correlations shows some interesting insights, which are in line with the discussion of prior sections of the report. + For instance, we can observe that bingo has no correlation with any other game (or correlations are negative), with the exception of a slight positive correlation with slots. This is in line with the pattern identified in Cluster 3. Bottom line is that we can confirm that bingo players will typically not play any other games or will do it compartively less than other players. + We can also see a high correlation between Craps, BJ, Bac and Slots, which is in line with our findings from Cluster 4. + While slots has higher correlations with the games of Cluster 4, there is also a strong correlation with poker and other games and between these two, which is in line with our findings from Cluster 2.