Introduction

The goal of this case study is to analyze if bobblehead promotions increase attendance at Dodgers home games. Using the fitted predictive model we can predict the attendance for the game in the forthcomming season and we can predict the attendance with or without bobblehead promotion.

The motivation of this case study is to:

Pre-Processing

Load the required libraries

library(car)  # Package with Special functions for linear regression
library(lattice)  # Graphics Package
library(ggplot2) # Graphical Package

Load the data into a data frame from the csv file. (This is just a sample data and not the entire dataset)

# Create a dataframe with Dodgers Data
DodgersData <- read.csv("DodgersData.csv")

Data Cleanup

Evaluate the Structure and Re-Level the factor variables for “Day Of Week”" and “Month”" in the right order

# Check the structure for Dorder Data
str(DodgersData)
## 'data.frame':    81 obs. of  12 variables:
##  $ month      : Factor w/ 7 levels "APR","AUG","JUL",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ day        : int  10 11 12 13 14 15 23 24 25 27 ...
##  $ attend     : int  56000 29729 28328 31601 46549 38359 26376 44014 26345 44807 ...
##  $ day_of_week: Factor w/ 7 levels "Friday","Monday",..: 6 7 5 1 3 4 2 6 7 1 ...
##  $ opponent   : Factor w/ 17 levels "Angels","Astros",..: 13 13 13 11 11 11 3 3 3 10 ...
##  $ temp       : int  67 58 57 54 57 65 60 63 64 66 ...
##  $ skies      : Factor w/ 2 levels "Clear ","Cloudy": 1 2 2 2 2 1 2 2 2 1 ...
##  $ day_night  : Factor w/ 2 levels "Day","Night": 1 2 2 2 2 1 2 2 2 2 ...
##  $ cap        : Factor w/ 2 levels "NO","YES": 1 1 1 1 1 1 1 1 1 1 ...
##  $ shirt      : Factor w/ 2 levels "NO","YES": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fireworks  : Factor w/ 2 levels "NO","YES": 1 1 1 2 1 1 1 1 1 2 ...
##  $ bobblehead : Factor w/ 2 levels "NO","YES": 1 1 1 1 1 1 1 1 1 1 ...
# Evaluate the factor levels for day_of_week
# levels(DodgersData$day_of_week)

# Reorder the factor levels for day_of_week
DodgersData$day_of_week <- factor(DodgersData$day_of_week, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))

# Evaluate the factor levels for month
# levels(DodgersData$month)

# Reorder the factor levels for month
DodgersData$month <- factor(DodgersData$month, levels = c("APR", "MAY", "JUN", "JUL", "AUG", "SEP", "OCT"))

Explore Data Analysis

The sample Major League Baseball data from 2012 season shown below

# First 10 rows of the data frame
head(DodgersData, 10)
##    month day attend day_of_week  opponent temp  skies day_night cap shirt
## 1    APR  10  56000     Tuesday   Pirates   67 Clear        Day  NO    NO
## 2    APR  11  29729   Wednesday   Pirates   58 Cloudy     Night  NO    NO
## 3    APR  12  28328    Thursday   Pirates   57 Cloudy     Night  NO    NO
## 4    APR  13  31601      Friday    Padres   54 Cloudy     Night  NO    NO
## 5    APR  14  46549    Saturday    Padres   57 Cloudy     Night  NO    NO
## 6    APR  15  38359      Sunday    Padres   65 Clear        Day  NO    NO
## 7    APR  23  26376      Monday    Braves   60 Cloudy     Night  NO    NO
## 8    APR  24  44014     Tuesday    Braves   63 Cloudy     Night  NO    NO
## 9    APR  25  26345   Wednesday    Braves   64 Cloudy     Night  NO    NO
## 10   APR  27  44807      Friday Nationals   66 Clear      Night  NO    NO
##    fireworks bobblehead
## 1         NO         NO
## 2         NO         NO
## 3         NO         NO
## 4        YES         NO
## 5         NO         NO
## 6         NO         NO
## 7         NO         NO
## 8         NO         NO
## 9         NO         NO
## 10       YES         NO

The data shows that in 2012 there were a few promotions

We have data from April to October for games played in the Day or Night under Clear or Cloudy Skys.

Dodger Stadium has a capacity of about 56,000. Looking at the entire (sample) data shows that the stadium filled up only twice in 2012. There were only two cap promotions, three shirt promotions - not enough data for any inferences. Fireworks and Bobblehead promotions have happened a few times.

Further more there were eleven bobble head promotions and most of then (six) being on Tuesday nights.

Attendance by Day Of Week

## Box plot to explore attendance by day of week
plot(DodgersData$day_of_week, DodgersData$attend / 1000, main = "Dodgers Attendence By Day Of Week", xlab = "Day of Week", ylab = "Attendance (thousands)", col = "violet", las = 1)

Tuesday seems to be the high attendance day as the median line seems to be the highest. Can this be because of the promotion that happened mostly Tuesdays? Let us see.

Attendance by Month

## Box plot to explore attendance by Month
plot(DodgersData$month, DodgersData$attend / 1000, main = "Dodgers Attendence By Month", xlab = "Month", 
ylab = "Attendance (thousands)", col = "light blue", las = 1)

June seems to be the month of high attendance as the median line is the highest.

Evaluate Attendance by Weather

#Evaluate attendance by weather
ggplot(DodgersData, aes(x=temp, y=attend/1000, color=fireworks)) + 
        geom_point() + 
        facet_wrap(day_night~skies) + 
        ggtitle("Dodgers Attendance By Temperature By Time of Game and Skies") +
        theme(plot.title = element_text(lineheight=3, face="bold", color="black", size=10)) +
        xlab("Temperature (Degree Farenheit)") +
        ylab("Attendance (Thousands)")

On day games with clear skies we see that when the temperature the attendance seems to be low. Also only one day game was played under cloudy skies.

Strip Plot of Attendance by opponent or visiting team

#Strip Plot of Attendance by opponent or visiting team
ggplot(DodgersData, aes(x=attend/1000, y=opponent, color=day_night)) + 
        geom_point() + 
        ggtitle("Dodgers Attendance By Opponent") +
        theme(plot.title = element_text(lineheight=3, face="bold", color="black", size=10)) +
        xlab("Attendance (Thousands)") +
        ylab("Opponent (Visiting Team)")

Games with opponents from large metropolitan areas show higher attendance consistently.

Design Predictive Model

To advise the management if promotions impact attendance we will need to identify if there is a positive effect, and if there is a positive effect how much of an effect it is.

To provide this advice I built a Linear Model for predicting attendance using Month, Day Of Week and the indicator variable Bobblehead promotion. I split the data into Training and Test to create the model

# Create a model with the bobblehead variable entered last
my.model <- {attend ~ month + day_of_week + bobblehead}

# Prepare a Training and Test dataset

# Reseed for repeatability
set.seed(1234)

training_test <- c(rep(1, trunc((2/3)*nrow(DodgersData))), rep(2, trunc((1/3)*nrow(DodgersData))))

# sample(training_test)

# Create a variable in DodgersData data frame to identify Test and Training row
DodgersData$Training_Test <- sample(training_test)

DodgersData$Training_Test <- factor(DodgersData$Training_Test, levels = c(1, 2), labels = c("TRAIN", "TEST"))


DodgersData.Train <- subset(DodgersData, Training_Test == "TRAIN")
DodgersData.Test <- subset(DodgersData, Training_Test == "TEST")

#head(DodgersData.Train)
#head(DodgersData.Test)

Fit the model on the Training Data and predict the model on the test set

# Fit model to training set
train.model.fit <- lm(my.model, data = DodgersData.Train)

# Predict from Training Set
DodgersData.Train$Predict_Attend <- predict(train.model.fit)

# Evaluate The Fitted Model on the Test Set
DodgersData.Test$Predict_Attend <- predict(train.model.fit, newdata = DodgersData.Test)

#round(cor(DodgersData.Test$attend, DodgersData.Test$Predict_Attend)^2, digits=3)

# compute the proportion of response variance accounted for when predicting Test Data
cat("\n","Proportion of Test Set Variance Accounted for: ", round(cor(DodgersData.Test$attend, DodgersData.Test$Predict_Attend)^2, digits=3), "\n", sep="")
## 
## Proportion of Test Set Variance Accounted for: 0.453
DodgersData.Training_Test <- rbind(DodgersData.Train, DodgersData.Test)

Scatter Plot to evaluate fit of the model on the test data

ggplot(DodgersData.Training_Test, aes(x=attend/1000, y=Predict_Attend/1000, color=bobblehead)) + 
        geom_point() + 
        geom_line(data = data.frame(x = c(25,60), y = c(25,60)), aes(x = x, y = y), colour = "red") +
        facet_wrap(~Training_Test) +
        #geom_smooth(method = "lm", se=FALSE) +
        ggtitle("Regression Model Performance : Bobblehead and Attendance") +
        theme(plot.title = element_text(lineheight=3, face="bold", color="black", size=10)) +
        xlab("Actual Attendance (Thousands)") +
        ylab("Predicted Attendance (Thousands)")

The scatter plot displays that tmodel fit to the training set holds up when used with the test data.

Run the model on the entire Dodgers Data Set

# use the full data set to obtain an estimate of the increase in
# attendance due to bobbleheads, controlling for other factors
my.model.fit <- lm(my.model, data = DodgersData)  # use all available data
print(summary(my.model.fit))
## 
## Call:
## lm(formula = my.model, data = DodgersData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10786.5  -3628.1   -516.1   2230.2  14351.0 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          33909.16    2521.81  13.446  < 2e-16 ***
## monthMAY             -2385.62    2291.22  -1.041  0.30152    
## monthJUN              7163.23    2732.72   2.621  0.01083 *  
## monthJUL              2849.83    2578.60   1.105  0.27303    
## monthAUG              2377.92    2402.91   0.990  0.32593    
## monthSEP                29.03    2521.25   0.012  0.99085    
## monthOCT              -662.67    4046.45  -0.164  0.87041    
## day_of_weekTuesday    7911.49    2702.21   2.928  0.00466 ** 
## day_of_weekWednesday  2460.02    2514.03   0.979  0.33134    
## day_of_weekThursday    775.36    3486.15   0.222  0.82467    
## day_of_weekFriday     4883.82    2504.65   1.950  0.05537 .  
## day_of_weekSaturday   6372.06    2552.08   2.497  0.01500 *  
## day_of_weekSunday     6724.00    2506.72   2.682  0.00920 ** 
## bobbleheadYES        10714.90    2419.52   4.429 3.59e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6120 on 67 degrees of freedom
## Multiple R-squared:  0.5444, Adjusted R-squared:  0.456 
## F-statistic: 6.158 on 13 and 67 DF,  p-value: 2.083e-07
cat("\n","Estimated Effect of Bobblehead Promotion on Attendance: ", round(my.model.fit$coefficients[length(my.model.fit$coefficients)], digits = 0),"\n",sep="")
## 
## Estimated Effect of Bobblehead Promotion on Attendance: 10715

From the stats above we see that there is a positive impact on bobblehead promotions on attendance with a potential increase in attendance by 10,715 fans per game.

Inference

Looking at the fitted model we can predict the attendance for the game in forthcomming season with and without the bobblehead promotions. If we combine the financial criteria along with the statistical findings we can compute the Dodgers revenue with and without the bobble head promotion.

Reference

Modeling Techniques in Predictive Analysis - Thomas W. Miller

“Thanks for this wonderful book.”