Analytics Case Study

Introduction

Question 1
1. MGM Resorts International casino marketing team wants to offer hotel room discounts to customers based on customers past gaming behavior. For example, if there are only 5 customers and their past one year history is available in the table below, Please tell us with analysis which customers should get a hotel discount and how much discount would be appropriate? Make appropriate assumptions but list your assumptions when you come to a conclusion.

install.packages(“plotly”)
install.packages(“knitr”)

File Import
Load data set

Customer <- c('Alex', 'Bobby', 'Cindy', 'David', 'Emma')
NumberOfTrips <- c(15, 2, 8, 5, 2)
TotalSpend <- c(1500, 5000, 400, 150, 200)
Hotel_Night_Stays <- c(5,0,1,0,1)

HotelData <- data.frame(Customer, NumberOfTrips, TotalSpend, Hotel_Night_Stays)
Spend_per_trip <- HotelData$TotalSpend/HotelData$NumberOfTrips
HotelData <- data.frame(Customer, NumberOfTrips, TotalSpend, Hotel_Night_Stays, Spend_per_trip)
HotelData

##   Customer NumberOfTrips TotalSpend Hotel_Night_Stays Spend_per_trip
## 1     Alex            15       1500                 5            100
## 2    Bobby             2       5000                 0           2500
## 3    Cindy             8        400                 1             50
## 4    David             5        150                 0             30
## 5     Emma             2        200                 1            100

Observation
The individual trip amount spent by these customers is not provided, only total spend of customers is available.

General Assumption

The Total Spend Amount column is the amount spent by customers on Casino/Gaming only (i.e this amount does not include the cost of staying at the hotel for a night)

Before we begin the analysis, the objective needs to be well defined
The target is to identify customers who should be given a discount, but the goal is not clear. We need a clear objective from MGM Resorts
What does MGM want to accomplish by giving discount? (by giving Hotel room incentive)
Goal 1 - make customers who already play more happier
Goal 2 - make existing customers likely to spend more
Goal 3 - make occasional customers frequent more often

Let’s begin by comprehending the data better
Calclulating spend/trip to bring the spend of all customers on the same scale

# Mean
mean_num_of_trips <- round(mean(HotelData$NumberOfTrips))
mean_num_of_trips

## [1] 6

mean_spend_per_trip <- round(mean(HotelData$Spend_per_trip))
mean_spend_per_trip

## [1] 556

mean_hotel_night_stays <- round(mean(HotelData$Hotel_Night_Stays))
mean_hotel_night_stays

## [1] 1

Plotting the distribution

hist(HotelData$NumberOfTrips)

hist(HotelData$Spend_per_trip)

hist(HotelData$Hotel_Night_Stays)

We observe that the data is not normally distributed, so instead of mean, we prefer median

# Median
median_num_of_trips <- round(median(HotelData$NumberOfTrips))
median_num_of_trips

## [1] 5

median_spend_per_trip <- round(median(HotelData$Spend_per_trip))
median_spend_per_trip

## [1] 100

median_hotel_night_stays <- round(median(HotelData$Hotel_Night_Stays))
median_hotel_night_stays

## [1] 1

To find the best customers (valued) who should be offered a discount
3 factors are considered

Discount factor 1 - above or equal to median spend on gaming
Assumpsion 1 - If a customer is spending above average amount at the gaming tables
MGM should definitely grant those customers an incentive (hotel room discounts)

filter(HotelData, Spend_per_trip > median_spend_per_trip)

##   Customer NumberOfTrips TotalSpend Hotel_Night_Stays Spend_per_trip
## 1    Bobby             2       5000                 0           2500

Discount factor 2 - above or equal to median number of trips to Casino
Assumpsion 2 - If a customer is making above averge trips to Casino, he would be a probable customer for discount

filter(HotelData, NumberOfTrips > median_num_of_trips)

##   Customer NumberOfTrips TotalSpend Hotel_Night_Stays Spend_per_trip
## 1     Alex            15       1500                 5            100
## 2    Cindy             8        400                 1             50

Discount factor 3 - above or equal to median number of night stays at hotel
Assumpsion 3 - customers who chose to stay repeatedly at the hotel would have liked the services and hence are likely to spread a positive feedback by word of mouth within his circle
This customer is also likely to have a positive outlook on the hotel’s Casino
Such customers would help MGM attract new customers and should be considered for a discount
Calculating customers above or equal to median number of Night stays

filter(HotelData, Hotel_Night_Stays > median_hotel_night_stays)

##   Customer NumberOfTrips TotalSpend Hotel_Night_Stays Spend_per_trip
## 1     Alex            15       1500                 5            100

Combined list of customers who sould be offered a discount

filter(HotelData, Hotel_Night_Stays > median_hotel_night_stays | Spend_per_trip > median_spend_per_trip | NumberOfTrips > median_num_of_trips)

##   Customer NumberOfTrips TotalSpend Hotel_Night_Stays Spend_per_trip
## 1     Alex            15       1500                 5            100
## 2    Bobby             2       5000                 0           2500
## 3    Cindy             8        400                 1             50

To determine how much discount should be offered to these customers, we can assign weights to each of the discount factors considered above.
These weights can be adjusted based on the business scenarios and profitability required by MGM at a given time
Assumption 4
1. Amount spent per trip has been assigned as the highest factor (50%)
2. Number of trips to the Casino is chosen as the second most important attribute (30%)
3. Night Stays at Hotel is picked as the last attribute to be factored for discount (20%)

Creating a new column with above criteria and formula

HotelData$Discount_weight <- 0
HotelData$Discount_weight <- ifelse (HotelData$Spend_per_trip > median_spend_per_trip
                                     & HotelData$NumberOfTrips > median_num_of_trips
                                     & HotelData$Hotel_Night_Stays > median_hotel_night_stays, 1,
                                      ifelse(HotelData$Spend_per_trip > median_spend_per_trip
                                             & HotelData$NumberOfTrips > median_num_of_trips, 0.8,
                                      ifelse(HotelData$Spend_per_trip > median_spend_per_trip
                                             & HotelData$Hotel_Night_Stays > median_hotel_night_stays, 0.7,
                                      ifelse(HotelData$NumberOfTrips > median_num_of_trips
                                             & HotelData$Hotel_Night_Stays > median_hotel_night_stays, 0.5,
                                      ifelse(HotelData$Spend_per_trip > median_spend_per_trip, 0.5,
                                      ifelse(HotelData$NumberOfTrips > median_num_of_trips, 0.3,
                                      ifelse(HotelData$Hotel_Night_Stays > median_hotel_night_stays, 0.2,0
                                             )))))))

HotelData

##   Customer NumberOfTrips TotalSpend Hotel_Night_Stays Spend_per_trip
## 1     Alex            15       1500                 5            100
## 2    Bobby             2       5000                 0           2500
## 3    Cindy             8        400                 1             50
## 4    David             5        150                 0             30
## 5     Emma             2        200                 1            100
##   Discount_weight
## 1             0.5
## 2             0.5
## 3             0.3
## 4             0.0
## 5             0.0

Based on MGM’s cost of operations in running the Hotel & Casino along with the profit margin that needs to be maintained
MGM resorts would have a maximum discount % that could be offered to it’s customers
Assumption 5 - Maximum discount - 30%

Discount <- 30

Populating the discount column

HotelData_With_Discount <- HotelData
HotelData_With_Discount$Discount_Percentage <- HotelData_With_Discount$Discount_weight * Discount
HotelData_With_Discount <- HotelData_With_Discount[ -c(2:6) ]

HotelData_With_Discount <- filter(HotelData_With_Discount, Hotel_Night_Stays > median_hotel_night_stays | Spend_per_trip > median_spend_per_trip | NumberOfTrips > median_num_of_trips)

Final shortlisted customers who are selected for Hotel Night Stay discount and the proposed discount %

##   Customer Discount_Percentage
## 1     Alex                  15
## 2    Bobby                  15
## 3    Cindy                   9

Question 2
2. Calculate Regression Output using the information below: We would like to predict whether a teacher will convert. A conversion in this sense means that he/she responded to one of our marketing campaigns. The exhibit below presents a summary of the parameter estimates for a generalized linear model (GLM) with a log link function. The model predicts converting teachers. # of Years Taught at School and # of Children are numeric; the rest of the predictor variables are coded as categorical: the base class for ‘Lives in Low Population Density Area’ is No, the base class for Gender is Male, the base class for Class Taught is Not Science, the base class for Region is California, and the base class for ‘Presence of After School Program’ is No. What is the model’s predicted probability of converting for a male science teacher who has taught for 2 years, has 4 children, lives in California, does not live in a low population density area, and is at a school with an after school program? Please show your work and interpret any results.

Solution:

To predict whether a teacher will convert, we first construct the complete model equation from given coefficients and attributes.

Log(y) = -4.927 - 0.332x # of Years Taught at School + 0.370x # of Children -0.683x Lives in Low Population Density Area (Yes) + 1.538x Gender: Female + 0.969x Class Taught: Science -0.490x Region EAST_NORTH_CENTRAL -0.820x Region EAST_SOUTH_CENTRAL -0.779x Region MID_ATLANTIC -0.396x Region MOUNTAIN -0.337x Region NEW_ENGLAND -0.093x Region PACIFIC_NW -0.344x Region SOUTH_ATLANTIC -0.812x Region WEST_NORTH_CENTRAL -1.004x Region WEST_SOUTH_CENTRAL + 0.646x Presence of After School Program (Yes)

Next, for the given model, we substitute the attribute values for the test scenario.
Log(y) = -4.927 - 0.332x2 + 0.370x4 + 0.969x1 + 0.646x1
Considering only significant p-value
Log(y) = -4.927 - 0.332x2 + 0.370x4 + 0.969x1 + 0.646x1
Log(y) = -2.496

Since the model is a GLM Log Link function, the function is of standard form: log(y) = B0 + B1x

Our purpose is to check the probability that y is true, for given condition of x.
To remove the log transformation, we take exponential and simplify it in the form of a sigmoid function: p = exp(B0 + B1x)

Re-writing the model in new form:
p = exp(-2.496) = 0.082

Here, p is the probability of y being true.
In this case, the event that the teacher responded to one of the marketing campaigns.
p = p(y=1|x) = 0.0824

Assumption: The model threshold for probability is 0.5 for the teacher to convert.

So, for the given test instance, the given model predicts that the predicted probability of converting for a male science teacher who has taught for 2 years, has 4 children, lives in California, does not live in a low population density area, and is at a school with an after-school program is 0.0824.

In other words, this teacher is NOT likely to respond to the marketing campaign.

Accounting for Standard Error of these coefficients

The standard error tells us how uncertain our estimate is, so we apply the 2 Standard error rule
The refined model for significant p-value attributes would thus be:
Log(y) = -4.927 +/- 2x0.185 - 0.332 +/- 2x0.061 x2 + 0.370 +/- 2x0.035 x4 + 0.969 +/- 2x0.472 x1 + 0.646 +/- 2x0.186 x1

For 95% confidence interval level

2.5% Band
Log(y) = -4.927 - 2x0.185 - 0.332 - 2x0.061 x2 + 0.370 - 2x0.035 x4 + 0.969 - 2x0.472 x1 + 0.646 - 2x0.186 x1

Log(y) = -4.927-0.37 - (0.332-0.122)x2 + (0.370-0.07)x4 + (0.969-.944)x1 + (0.646-0.372)x1
Log(y) = -5.297 -0.454x2 +0.3x4 +0.025x1 +0.274x1
Log(y) = -5.297 -0.908 +1.2 +0.025 +0.274
Log(y) = -4.706
p = exp(-4.706) = 0.009

97.5% Band
Log(y) = -4.927 + 2x0.185 - 0.332 + 2x0.061 x2 + 0.370 + 2x0.035 x4 + 0.969 + 2x0.472 x1 + 0.646 + 2x0.186 x1

Log(y) = -4.927+0.37 - (0.332+0.122)x2 + (0.370+0.07)x4 + (0.969+.944)x1 + (0.646+0.372)x1
Log(y) = -4.557 -0.21x2 +0.44x4 +1.913x1 +1.018x1
Log(y) = -4.557 -0.42 +1.76 +1.913 +1.018 Log(y) = -0.286
p = exp(-0.286) = 0.751

Accounting for the standard deviation of each parameters, for the 95% confidence interval level
We obtain a probability range of 0.009 and 0.751
In other words, we cannot conclusively predict whether the teacher would convert for this marketing campaign with 95% confidence.

Question 3
3. MGM would like to predict customers future spend using the data from problem 1. How would you predict these 5 customers next trip spend? Provide at least 2-3 solutions.

Solution:

Note: The data made available does not provide the spend of these customers per trip. A linear model cannot be effectively created with provided attributes, as the prediction has to be made for the same customer’s next trip spend. Without the distribution of individual trip spends, the model is very likely to estimate the same average amount for each customer.
So, we proceed by not building ML models and instead applying logic on the available dataset

Checking if there is a correlation between the total amount spent by customer and number of night stays at hotel.

Correlation

c<- cor(HotelData[ -c(1,6) ])
corrplot(c, method = "number")

We observe that there is no correlation between the total spend amount/spend per trip and nights stayed at hotel.

Approach 1

Using the average amount spent per trip
This helps us gauge the average amount spent by the customer on all of his trips so far
We can expect our customers to spend this amount respectively in their next trips:

##   Customer Average_Spend_Per_Visit
## 1     Alex                     100
## 2    Bobby                    2500
## 3    Cindy                      50
## 4    David                      30
## 5     Emma                     100

Approach 2

Using the average amount spent by customer in last one year.
If they make a trip within a month, we estimate that the customers would have an annual budget
They are expected to spend below amount respectively in their next trip if they return within a month:

HotelData_2 <- HotelData
HotelData_2$Average_Monthly_Spend <- round((HotelData_2$TotalSpend/12),2)
HotelData_2 <- HotelData_2[ -c(2,3:6) ]

##   Customer Average_Monthly_Spend
## 1     Alex                125.00
## 2    Bobby                416.67
## 3    Cindy                 33.33
## 4    David                 12.50
## 5     Emma                 16.67

Approach 3

Creating a confidence interval by permitting a variation of 20%
We expect the customers to spend an amount within the below range

##   Customer Average_Spend_Per_Visit Lower_Estimate Higher_Estimate
## 1     Alex                     100             80             120
## 2    Bobby                    2500           2000            3000
## 3    Cindy                      50             40              60
## 4    David                      30             24              36
## 5     Emma                     100             80             120

Analytics Case Study

Chaithanya Rao

February 25, 2018

Introduction

General Assumption

Plotting the distribution

We observe that the data is not normally distributed, so instead of mean, we prefer median

Combined list of customers who sould be offered a discount

Solution:

Accounting for Standard Error of these coefficients

For 95% confidence interval level

Solution:

Approach 1

Approach 2

Approach 3