This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
It is tough to make good predictions. The numerous factors or variables, independent and dependent, involved in many sporting events contribute to the unpredictability. However, using carefully-selected variables, it is still possible to make marketing promotions more accountable.
The goal of this case study is to analyze if bobblehead promotions increase attendance at Dodgers home games. Using the fitted predictive model, we can predict the attendance for the game in the forthcoming season and we can predict the attendance with or without bobblehead promotion.
The motivation of this case study is to design a predictive model, and report any interesting findings to support critical business decision making.
Important Tips: If you use the desktop version of R, please make sure to reset your working directory before performing the analysis.
Load the required libraries and the data
#rm(list=ls())# clear memory
#setwd("C:/Users/zxu3/Documents/R/regression")
library(lattice) # Graphics Package
library(ggplot2) # Graphical Package
#Create a dataframe with the Dodgers Data - if you import the data from your own drive
#DodgersData <- read.csv("DodgersData.csv")
library(readr)
#adding a hashtag to the beginning of a line of syntax allows you to take notes or add descriptions.
#Now upload the following dataset to your work environment.
DodgersData <- read_csv("https://raw.githubusercontent.com/utjimmyx/regression/master/DodgersData.csv")
## `curl` package not installed, falling back to using `url()`
## Rows: 81 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): month, day_of_week, opponent, skies, day_night, cap, shirt, firewor...
## dbl (3): day, attend, temp
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#Alternatively, you can read the data from your computer
Evaluate the Structure and Re-Level the factor variables for “Day Of Week”” and “Month”” in the right order
# Check the structure for Dorder Data
str(DodgersData)
## spc_tbl_ [81 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ month : chr [1:81] "APR" "APR" "APR" "APR" ...
## $ day : num [1:81] 10 11 12 13 14 15 23 24 25 27 ...
## $ attend : num [1:81] 56000 29729 28328 31601 46549 ...
## $ day_of_week: chr [1:81] "Tuesday" "Wednesday" "Thursday" "Friday" ...
## $ opponent : chr [1:81] "Pirates" "Pirates" "Pirates" "Padres" ...
## $ temp : num [1:81] 67 58 57 54 57 65 60 63 64 66 ...
## $ skies : chr [1:81] "Clear" "Cloudy" "Cloudy" "Cloudy" ...
## $ day_night : chr [1:81] "Day" "Night" "Night" "Night" ...
## $ cap : chr [1:81] "NO" "NO" "NO" "NO" ...
## $ shirt : chr [1:81] "NO" "NO" "NO" "NO" ...
## $ fireworks : chr [1:81] "NO" "NO" "NO" "YES" ...
## $ bobblehead : chr [1:81] "NO" "NO" "NO" "NO" ...
## - attr(*, "spec")=
## .. cols(
## .. month = col_character(),
## .. day = col_double(),
## .. attend = col_double(),
## .. day_of_week = col_character(),
## .. opponent = col_character(),
## .. temp = col_double(),
## .. skies = col_character(),
## .. day_night = col_character(),
## .. cap = col_character(),
## .. shirt = col_character(),
## .. fireworks = col_character(),
## .. bobblehead = col_character()
## .. )
## - attr(*, "problems")=<pointer: 0x5d9f613eec00>
head(DodgersData)
## # A tibble: 6 × 12
## month day attend day_of_week opponent temp skies day_night cap shirt
## <chr> <dbl> <dbl> <chr> <chr> <dbl> <chr> <chr> <chr> <chr>
## 1 APR 10 56000 Tuesday Pirates 67 Clear Day NO NO
## 2 APR 11 29729 Wednesday Pirates 58 Cloudy Night NO NO
## 3 APR 12 28328 Thursday Pirates 57 Cloudy Night NO NO
## 4 APR 13 31601 Friday Padres 54 Cloudy Night NO NO
## 5 APR 14 46549 Saturday Padres 57 Cloudy Night NO NO
## 6 APR 15 38359 Sunday Padres 65 Clear Day NO NO
## # ℹ 2 more variables: fireworks <chr>, bobblehead <chr>
# Evaluate the factor levels for day_of_week
# levels(DodgersData$day_of_week)
# Evaluate the factor levels for month
unique(DodgersData$month)
## [1] "APR" "MAY" "JUN" "JUL" "AUG" "SEP" "OCT"
# First 10 rows of the data frame
head(DodgersData, 10)
## # A tibble: 10 × 12
## month day attend day_of_week opponent temp skies day_night cap shirt
## <chr> <dbl> <dbl> <chr> <chr> <dbl> <chr> <chr> <chr> <chr>
## 1 APR 10 56000 Tuesday Pirates 67 Clear Day NO NO
## 2 APR 11 29729 Wednesday Pirates 58 Cloudy Night NO NO
## 3 APR 12 28328 Thursday Pirates 57 Cloudy Night NO NO
## 4 APR 13 31601 Friday Padres 54 Cloudy Night NO NO
## 5 APR 14 46549 Saturday Padres 57 Cloudy Night NO NO
## 6 APR 15 38359 Sunday Padres 65 Clear Day NO NO
## 7 APR 23 26376 Monday Braves 60 Cloudy Night NO NO
## 8 APR 24 44014 Tuesday Braves 63 Cloudy Night NO NO
## 9 APR 25 26345 Wednesday Braves 64 Cloudy Night NO NO
## 10 APR 27 44807 Friday Nationals 66 Clear Night NO NO
## # ℹ 2 more variables: fireworks <chr>, bobblehead <chr>
DodgersData[20, c("temp", "attend", "opponent", "bobblehead")]
## # A tibble: 1 × 4
## temp attend opponent bobblehead
## <dbl> <dbl> <chr> <chr>
## 1 70 47077 Snakes YES
meanattend <- mean(DodgersData$attend)
meanattend
## [1] 41040.07
promotions <- sum(DodgersData$bobblehead=="YES")
promotions
## [1] 11
## in-class notes
The results show that in 2012 there were a few promotions (see the last four columns)
Cap Shirt Fireworks Bobblehead
We have data from April to October for games played in the Day or Night under Clear or Cloudy Skys.
Dodger Stadium has a capacity of about 56,000. Looking at the entire (sample) data shows that the stadium filled up only twice in 2012. There were only two cap promotions, three shirt promotions - not enough data for any inferences. Fireworks and Bobblehead promotions have happened a few times.
Further more there were eleven bobble head promotions and most of then (six) being on Tuesday nights.
#Evaluate attendance by weather
ggplot(DodgersData, aes(x=temp, y=attend/1000, color=fireworks)) +
geom_point() +
facet_wrap(day_night~skies) +
ggtitle("Dodgers Attendance By Temperature By Time of Game and Skies") +
theme(plot.title = element_text(lineheight=3, face="bold", color="black", size=10)) +
xlab("Temperature (Degree Farenheit)") +
ylab("Attendance (Thousands)")
#Strip Plot of Attendance by opponent or visiting team
ggplot(DodgersData, aes(x=attend/1000, y=opponent, color=day_night)) +
geom_point() +
ggtitle("Dodgers Attendance By Opponent") +
theme(plot.title = element_text(lineheight=3, face="bold", color="black", size=10)) +
xlab("Attendance (Thousands)") +
ylab("Opponent (Visiting Team)")
To advise the management if promotions impact attendance we will need to identify if there is a positive effect, and if there is a positive effect how much of an effect it is.
# Create a model with the bobblehead variable entered last
my.model <- {attend ~ month + day_of_week + bobblehead}
# use the full data set to obtain an estimate of the increase in
# attendance due to bobbleheads, controlling for other factors
my.model.fit <- lm(my.model, data = DodgersData) # use all available data
print(summary(my.model.fit))
##
## Call:
## lm(formula = my.model, data = DodgersData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10786.5 -3628.1 -516.1 2230.2 14351.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38792.98 2364.68 16.405 < 2e-16 ***
## monthAUG 2377.92 2402.91 0.990 0.3259
## monthJUL 2849.83 2578.60 1.105 0.2730
## monthJUN 7163.23 2732.72 2.621 0.0108 *
## monthMAY -2385.62 2291.22 -1.041 0.3015
## monthOCT -662.67 4046.45 -0.164 0.8704
## monthSEP 29.03 2521.25 0.012 0.9908
## day_of_weekMonday -4883.82 2504.65 -1.950 0.0554 .
## day_of_weekSaturday 1488.24 2442.68 0.609 0.5444
## day_of_weekSunday 1840.18 2426.79 0.758 0.4509
## day_of_weekThursday -4108.45 3381.22 -1.215 0.2286
## day_of_weekTuesday 3027.68 2686.43 1.127 0.2638
## day_of_weekWednesday -2423.80 2485.46 -0.975 0.3330
## bobbleheadYES 10714.90 2419.52 4.429 3.59e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6120 on 67 degrees of freedom
## Multiple R-squared: 0.5444, Adjusted R-squared: 0.456
## F-statistic: 6.158 on 13 and 67 DF, p-value: 2.083e-07
# Re-level factors for interpretability
DodgersData$month <- factor(DodgersData$month,
levels = c("APR","MAY","JUN","JUL","AUG","SEP","OCT"))
DodgersData$day_of_week <- factor(DodgersData$day_of_week,
levels = c("Monday","Tuesday","Wednesday",
"Thursday","Friday","Saturday","Sunday"))
# Define model formula
my.model2 <- attend ~ month + day_of_week + bobblehead
my.model.fit2 <-lm(my.model2, data = DodgersData)
print(summary(my.model.fit2))
##
## Call:
## lm(formula = my.model2, data = DodgersData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10786.5 -3628.1 -516.1 2230.2 14351.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33909.16 2521.81 13.446 < 2e-16 ***
## monthMAY -2385.62 2291.22 -1.041 0.30152
## monthJUN 7163.23 2732.72 2.621 0.01083 *
## monthJUL 2849.83 2578.60 1.105 0.27303
## monthAUG 2377.92 2402.91 0.990 0.32593
## monthSEP 29.03 2521.25 0.012 0.99085
## monthOCT -662.67 4046.45 -0.164 0.87041
## day_of_weekTuesday 7911.49 2702.21 2.928 0.00466 **
## day_of_weekWednesday 2460.02 2514.03 0.979 0.33134
## day_of_weekThursday 775.36 3486.15 0.222 0.82467
## day_of_weekFriday 4883.82 2504.65 1.950 0.05537 .
## day_of_weekSaturday 6372.06 2552.08 2.497 0.01500 *
## day_of_weekSunday 6724.00 2506.72 2.682 0.00920 **
## bobbleheadYES 10714.90 2419.52 4.429 3.59e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6120 on 67 degrees of freedom
## Multiple R-squared: 0.5444, Adjusted R-squared: 0.456
## F-statistic: 6.158 on 13 and 67 DF, p-value: 2.083e-07
This model adds fireworks as a second promotion variable. The goal is to test whether fireworks promotions help explain attendance after controlling for month, day of week, and bobblehead promotions.
# 1: Inspect the fireworks variable
table(DodgersData$fireworks)
##
## NO YES
## 67 14
str(DodgersData$fireworks)
## chr [1:81] "NO" "NO" "NO" "YES" "NO" "NO" "NO" "NO" "NO" "YES" "NO" "NO" ...
# 2: Make sure promotion variables are factors
DodgersData$bobblehead <- factor(DodgersData$bobblehead)
DodgersData$fireworks <- factor(DodgersData$fireworks)
# 3: Set "NO" as the reference level for promotion variables
DodgersData$bobblehead <- relevel(DodgersData$bobblehead, ref = "NO")
DodgersData$fireworks <- relevel(DodgersData$fireworks, ref = "NO")
# 4: Define Model 3 with fireworks added
my.model3 <- attend ~ month + day_of_week + bobblehead + fireworks
# 5: Fit Model 3 using lm()
my.model.fit3 <- lm(my.model3, data = DodgersData)
# 6: Print Model 3 results
print(summary(my.model.fit3))
##
## Call:
## lm(formula = my.model3, data = DodgersData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9504 -3683 -709 2569 15390
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34321.68 2418.90 14.189 < 2e-16 ***
## monthMAY -2492.79 2193.60 -1.136 0.25990
## monthJUN 7062.62 2616.13 2.700 0.00881 **
## monthJUL 1315.38 2534.42 0.519 0.60549
## monthAUG 2377.88 2300.15 1.034 0.30501
## monthSEP -55.37 2413.63 -0.023 0.98177
## monthOCT -502.88 3873.86 -0.130 0.89711
## day_of_weekTuesday 7750.46 2587.35 2.996 0.00386 **
## day_of_weekWednesday 904.16 2476.14 0.365 0.71617
## day_of_weekThursday 309.24 3341.63 0.093 0.92655
## day_of_weekFriday -12386.23 6901.86 -1.795 0.07729 .
## day_of_weekSaturday 6094.17 2445.16 2.492 0.01521 *
## day_of_weekSunday 6577.96 2400.14 2.741 0.00788 **
## bobbleheadYES 10995.05 2318.43 4.742 1.17e-05 ***
## fireworksYES 17028.78 6381.63 2.668 0.00958 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5858 on 66 degrees of freedom
## Multiple R-squared: 0.5887, Adjusted R-squared: 0.5015
## F-statistic: 6.749 on 14 and 66 DF, p-value: 2.848e-08
# 7: Compare Model 2 (no fireworks) and Model 3 (fireworks added) adjusted R-squared
summary(my.model.fit2)$adj.r.squared
## [1] 0.4559652
summary(my.model.fit3)$adj.r.squared
## [1] 0.5015025
# Building block 8: Compare nested models to test whether fireworks improves fit
anova(my.model.fit2, my.model.fit3)
## Analysis of Variance Table
##
## Model 1: attend ~ month + day_of_week + bobblehead
## Model 2: attend ~ month + day_of_week + bobblehead + fireworks
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 67 2509574563
## 2 66 2265194779 1 244379784 7.1204 0.009581 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The bonus question asks us to predict attendance for a Saturday
afternoon game in July with no promotion. Model 3 includes
month, day_of_week, bobblehead,
and fireworks, the prediction will use July, Saturday, no
bobblehead, and no fireworks. Because Model 3 does not include a
day_night variable, the “afternoon” detail is treated as a
model limitation rather than a direct input to the prediction.
# Create a new game scenario for the bonus prediction
bonus_game <- data.frame(
month = factor("JUL", levels = levels(DodgersData$month)),
day_of_week = factor("Saturday", levels = levels(DodgersData$day_of_week)),
bobblehead = factor("NO", levels = levels(DodgersData$bobblehead)),
fireworks = factor("NO", levels = levels(DodgersData$fireworks))
)
# Check the prediction scenario
bonus_game
## month day_of_week bobblehead fireworks
## 1 JUL Saturday NO NO
# Predict attendance using Model 3
bonus_prediction <- predict(
my.model.fit3,
newdata = bonus_game,
interval = "prediction"
)
# View predicted attendance
bonus_prediction
## fit lwr upr
## 1 41731.23 28987.54 54474.92
Interpretation: Using Model 3, the predicted
attendance for a Saturday game in July with no bobblehead or fireworks
promotion is approximately 41,731 fans. The prediction
interval ranges from about 28,988 to 54,475 fans. This
means the model’s best estimate is around 41,731 attendees, but actual
attendance for one comparable game could reasonably fall within that
wider range. Because Model 3 does not include a day_night
or afternoon variable, the afternoon detail is not directly included in
this prediction.
# Pull the model prediction values from the bonus prediction object
bonus_fit <- as.numeric(bonus_prediction[1, "fit"])
bonus_lwr <- as.numeric(bonus_prediction[1, "lwr"])
bonus_upr <- as.numeric(bonus_prediction[1, "upr"])
# Filter historical games that match the bonus scenario as closely as Model 3 allows
bonus_history <- subset(
DodgersData,
month == "JUL" &
day_of_week == "Saturday" &
bobblehead == "NO" &
fireworks == "NO"
)
# Create a small data frame for the prediction interval
prediction_strip <- data.frame(
Scenario = "Bonus prediction range",
fit = bonus_fit,
lwr = bonus_lwr,
upr = bonus_upr
)
# Build the visualization
ggplot() +
geom_segment(
data = prediction_strip,
aes(
x = lwr,
xend = upr,
y = Scenario,
yend = Scenario
),
linewidth = 8,
alpha = 0.55,
color = "#005A9C"
) +
geom_point(
data = prediction_strip,
aes(
x = fit,
y = Scenario
),
size = 6,
shape = 23,
fill = "white",
color = "black",
stroke = 1.2
) +
geom_vline(
xintercept = bonus_fit,
linetype = "dotted",
color = "red",
linewidth = 1.2
) +
geom_point(
data = bonus_history,
aes(
x = attend,
y = "Historical comparable games"
),
position = position_jitter(height = 0.08, width = 0),
alpha = 0.65,
size = 2
) +
scale_x_continuous(labels = scales::comma) +
labs(
title = "Bonus Prediction Landing Strip",
subtitle = stringr::str_wrap(
"Predicted attendance range for a July Saturday game with no bobblehead or fireworks promotion",
width = 60
),
x = "Attendance",
y = NULL,
caption = "Dodger blue strip = prediction interval. White diamond = predicted attendance. Red dotted line = model estimate."
) +
theme_minimal()
## Business Recommendation
Based on the regression results, bobblehead promotions appear to be a stronger positive attendance driver than fireworks promotions after controlling for month and day of week.
The bonus prediction estimates attendance of about 41,731 fans for a July Saturday game with no bobblehead or fireworks promotion, with a wide prediction interval that shows individual game attendance can vary substantially. A comparable single game could reasonably fall between about 28,988 and 54,475 fans.
From a marketing perspective, Dodgers management should prioritize bobblehead promotions for games where higher attendance is strategically important, while treating fireworks as a secondary promotion whose value should be evaluated alongside each promotions respective cost, scheduling, and fan-experience goals(i.e. Ensuring Fireworks on the 4th of July).
Xu, Z. “DodgersData.csv.” GitHub regression dataset. https://raw.githubusercontent.com/utjimmyx/regression/master/DodgersData.csv
RStudio / Posit. “R Markdown.” https://rmarkdown.rstudio.com/
I used ChatGPT to support R debugging, code organization, and interpretation refinement. All analysis, interpretation, and conclusions are my own.