Lab 2 - EDA and Marketing Response ModelingDodgers

Roci Barnes

2026-06-27

knitr::include_graphics("Dodgers.png")

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

Introduction

It is tough to make good predictions. The numerous factors or variables, independent and dependent, involved in many sporting events contribute to the unpredictability. However, using carefully-selected variables, it is still possible to make marketing promotions more accountable.

The goal of this case study is to analyze if bobblehead promotions increase attendance at Dodgers home games. Using the fitted predictive model, we can predict the attendance for the game in the forthcoming season and we can predict the attendance with or without bobblehead promotion.

The motivation of this case study is to design a predictive model, and report any interesting findings to support critical business decision making.

Pre-Processing

Important Tips: If you use the desktop version of R, please make sure to reset your working directory before performing the analysis.

Load the required libraries and the data

#rm(list=ls())# clear memory

#setwd("C:/Users/zxu3/Documents/R/regression") 

library(lattice) # Graphics Package
library(ggplot2) # Graphical Package

#Create a dataframe with the Dodgers Data  - if you import the data from your own drive
#DodgersData <- read.csv("DodgersData.csv")

 

 

library(readr)

#adding a hashtag to the beginning of a line of syntax allows you to take notes or add descriptions.

#Now upload the following dataset to your work environment. 

DodgersData <- read_csv("https://raw.githubusercontent.com/utjimmyx/regression/master/DodgersData.csv")
## Rows: 81 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): month, day_of_week, opponent, skies, day_night, cap, shirt, firewor...
## dbl (3): day, attend, temp
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#Alternatively, you can read the data from your computer

Data cleanup and exploratory analysis

Evaluate the Structure and Re-Level the factor variables for “Day Of Week”” and “Month”” in the right order

# Check the structure for Dorder Data
str(DodgersData)
## spc_tbl_ [81 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ month      : chr [1:81] "APR" "APR" "APR" "APR" ...
##  $ day        : num [1:81] 10 11 12 13 14 15 23 24 25 27 ...
##  $ attend     : num [1:81] 56000 29729 28328 31601 46549 ...
##  $ day_of_week: chr [1:81] "Tuesday" "Wednesday" "Thursday" "Friday" ...
##  $ opponent   : chr [1:81] "Pirates" "Pirates" "Pirates" "Padres" ...
##  $ temp       : num [1:81] 67 58 57 54 57 65 60 63 64 66 ...
##  $ skies      : chr [1:81] "Clear" "Cloudy" "Cloudy" "Cloudy" ...
##  $ day_night  : chr [1:81] "Day" "Night" "Night" "Night" ...
##  $ cap        : chr [1:81] "NO" "NO" "NO" "NO" ...
##  $ shirt      : chr [1:81] "NO" "NO" "NO" "NO" ...
##  $ fireworks  : chr [1:81] "NO" "NO" "NO" "YES" ...
##  $ bobblehead : chr [1:81] "NO" "NO" "NO" "NO" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   month = col_character(),
##   ..   day = col_double(),
##   ..   attend = col_double(),
##   ..   day_of_week = col_character(),
##   ..   opponent = col_character(),
##   ..   temp = col_double(),
##   ..   skies = col_character(),
##   ..   day_night = col_character(),
##   ..   cap = col_character(),
##   ..   shirt = col_character(),
##   ..   fireworks = col_character(),
##   ..   bobblehead = col_character()
##   .. )
##  - attr(*, "problems")=<pointer: 0x611b33c7be20>
head(DodgersData)
## # A tibble: 6 × 12
##   month   day attend day_of_week opponent  temp skies  day_night cap   shirt
##   <chr> <dbl>  <dbl> <chr>       <chr>    <dbl> <chr>  <chr>     <chr> <chr>
## 1 APR      10  56000 Tuesday     Pirates     67 Clear  Day       NO    NO   
## 2 APR      11  29729 Wednesday   Pirates     58 Cloudy Night     NO    NO   
## 3 APR      12  28328 Thursday    Pirates     57 Cloudy Night     NO    NO   
## 4 APR      13  31601 Friday      Padres      54 Cloudy Night     NO    NO   
## 5 APR      14  46549 Saturday    Padres      57 Cloudy Night     NO    NO   
## 6 APR      15  38359 Sunday      Padres      65 Clear  Day       NO    NO   
## # ℹ 2 more variables: fireworks <chr>, bobblehead <chr>
# Evaluate the factor levels for day_of_week
# levels(DodgersData$day_of_week)

 

# Evaluate the factor levels for month
levels(DodgersData$month)
## NULL
# First 10 rows of the data frame
head(DodgersData, 10)
## # A tibble: 10 × 12
##    month   day attend day_of_week opponent   temp skies  day_night cap   shirt
##    <chr> <dbl>  <dbl> <chr>       <chr>     <dbl> <chr>  <chr>     <chr> <chr>
##  1 APR      10  56000 Tuesday     Pirates      67 Clear  Day       NO    NO   
##  2 APR      11  29729 Wednesday   Pirates      58 Cloudy Night     NO    NO   
##  3 APR      12  28328 Thursday    Pirates      57 Cloudy Night     NO    NO   
##  4 APR      13  31601 Friday      Padres       54 Cloudy Night     NO    NO   
##  5 APR      14  46549 Saturday    Padres       57 Cloudy Night     NO    NO   
##  6 APR      15  38359 Sunday      Padres       65 Clear  Day       NO    NO   
##  7 APR      23  26376 Monday      Braves       60 Cloudy Night     NO    NO   
##  8 APR      24  44014 Tuesday     Braves       63 Cloudy Night     NO    NO   
##  9 APR      25  26345 Wednesday   Braves       64 Cloudy Night     NO    NO   
## 10 APR      27  44807 Friday      Nationals    66 Clear  Night     NO    NO   
## # ℹ 2 more variables: fireworks <chr>, bobblehead <chr>

Let R identify the temperature, the attendance, the opponent, and the promotion (i.e., bobblehead) for the 20th home game of the season.

DodgersData[20, c("temp", "attend", "opponent", "bobblehead")]
## # A tibble: 1 × 4
##    temp attend opponent bobblehead
##   <dbl>  <dbl> <chr>    <chr>     
## 1    70  47077 Snakes   YES

Let R identify the average value for attendance.

meanattend <- mean(DodgersData$attend)
meanattend
## [1] 41040.07

Let R identify the number of promotions

promotions <- sum(DodgersData$bobblehead=="YES")
promotions
## [1] 11
## in-class notes

Exploratory analysis

The results show that in 2012 there were a few promotions (see the last four columns)

Cap Shirt Fireworks Bobblehead

We have data from April to October for games played in the Day or Night under Clear or Cloudy Skys.

Dodger Stadium has a capacity of about 56,000. Looking at the entire (sample) data shows that the stadium filled up only twice in 2012. There were only two cap promotions, three shirt promotions - not enough data for any inferences. Fireworks and Bobblehead promotions have happened a few times.

Further more there were eleven bobble head promotions and most of then (six) being on Tuesday nights.

Evaluate Attendance by Weather

#Evaluate attendance by weather
ggplot(DodgersData, aes(x=temp, y=attend/1000, color=fireworks)) +
geom_point() +
facet_wrap(day_night~skies) +
ggtitle("Dodgers Attendance By Temperature By Time of Game and Skies") +
theme(plot.title = element_text(lineheight=3, face="bold", color="black", size=10)) +
xlab("Temperature (Degree Farenheit)") +
ylab("Attendance (Thousands)")

Strip Plot of Attendance by opponent or visiting team

#Strip Plot of Attendance by opponent or visiting team
ggplot(DodgersData, aes(x=attend/1000, y=opponent, color=day_night)) +
geom_point() +
ggtitle("Dodgers Attendance By Opponent") +
theme(plot.title = element_text(lineheight=3, face="bold", color="black", size=10)) +
xlab("Attendance (Thousands)") +
ylab("Opponent (Visiting Team)")

Design Predictive Model - our original model

To advise the management if promotions impact attendance we will need to identify if there is a positive effect, and if there is a positive effect how much of an effect it is.

# Create a model with the bobblehead variable entered last
my.model <- {attend ~ month + day_of_week + bobblehead}
# use the full data set to obtain an estimate of the increase in
# attendance due to bobbleheads, controlling for other factors
my.model.fit <- lm(my.model, data = DodgersData) # use all available data
print(summary(my.model.fit))
## 
## Call:
## lm(formula = my.model, data = DodgersData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10786.5  -3628.1   -516.1   2230.2  14351.0 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          38792.98    2364.68  16.405  < 2e-16 ***
## monthAUG              2377.92    2402.91   0.990   0.3259    
## monthJUL              2849.83    2578.60   1.105   0.2730    
## monthJUN              7163.23    2732.72   2.621   0.0108 *  
## monthMAY             -2385.62    2291.22  -1.041   0.3015    
## monthOCT              -662.67    4046.45  -0.164   0.8704    
## monthSEP                29.03    2521.25   0.012   0.9908    
## day_of_weekMonday    -4883.82    2504.65  -1.950   0.0554 .  
## day_of_weekSaturday   1488.24    2442.68   0.609   0.5444    
## day_of_weekSunday     1840.18    2426.79   0.758   0.4509    
## day_of_weekThursday  -4108.45    3381.22  -1.215   0.2286    
## day_of_weekTuesday    3027.68    2686.43   1.127   0.2638    
## day_of_weekWednesday -2423.80    2485.46  -0.975   0.3330    
## bobbleheadYES        10714.90    2419.52   4.429 3.59e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6120 on 67 degrees of freedom
## Multiple R-squared:  0.5444, Adjusted R-squared:  0.456 
## F-statistic: 6.158 on 13 and 67 DF,  p-value: 2.083e-07

Re-level factors for interpretability

# Re-level factors for interpretability
DodgersData$month <- factor(DodgersData$month,
levels = c("APR","MAY","JUN","JUL","AUG","SEP","OCT"))
DodgersData$day_of_week <- factor(DodgersData$day_of_week,
levels = c("Monday","Tuesday","Wednesday",
"Thursday","Friday","Saturday","Sunday"))
# Define model formula
my.model2 <- attend ~ month + day_of_week + bobblehead
my.model.fit2 <-lm(my.model2, data = DodgersData)
print(summary(my.model.fit2))
## 
## Call:
## lm(formula = my.model2, data = DodgersData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10786.5  -3628.1   -516.1   2230.2  14351.0 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          33909.16    2521.81  13.446  < 2e-16 ***
## monthMAY             -2385.62    2291.22  -1.041  0.30152    
## monthJUN              7163.23    2732.72   2.621  0.01083 *  
## monthJUL              2849.83    2578.60   1.105  0.27303    
## monthAUG              2377.92    2402.91   0.990  0.32593    
## monthSEP                29.03    2521.25   0.012  0.99085    
## monthOCT              -662.67    4046.45  -0.164  0.87041    
## day_of_weekTuesday    7911.49    2702.21   2.928  0.00466 ** 
## day_of_weekWednesday  2460.02    2514.03   0.979  0.33134    
## day_of_weekThursday    775.36    3486.15   0.222  0.82467    
## day_of_weekFriday     4883.82    2504.65   1.950  0.05537 .  
## day_of_weekSaturday   6372.06    2552.08   2.497  0.01500 *  
## day_of_weekSunday     6724.00    2506.72   2.682  0.00920 ** 
## bobbleheadYES        10714.90    2419.52   4.429 3.59e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6120 on 67 degrees of freedom
## Multiple R-squared:  0.5444, Adjusted R-squared:  0.456 
## F-statistic: 6.158 on 13 and 67 DF,  p-value: 2.083e-07

Model Diagnostics

That pattern suggests a possible violation of the OLS constant variance assumption, also called homoscedasticity.

OLS assumes that the residuals have roughly the same spread across all fitted values. If residuals spread out more as fitted values increase, that suggests heteroscedasticity: the error variance is not constant.

This matters because heteroscedasticity can make the usual OLS standard errors, p-values, and confidence intervals unreliable, even if the coefficient estimates themselves may still be unbiased.

par(mfrow = c(2,2))
plot(my.model.fit)

Predicting Future Attendance

# Tuesday night in June WITH Bobblehad
new_yes <- data.frame(month = "JUN",
                      day_of_week = "Tuesday",
                      bobblehead = "YES")
# Tuesday night in June WITHOUT bobblehead
new_no <- data.frame(month = "JUN",
                     day_of_week = "Tuesday",
                     bobblehead = "NO")

pred_yes <- predict(my.model.fit, newdata = new_yes, interval = "prediction")
pred_no <- predict(my.model.fit, newdata = new_no, interval = "prediction")

#cat("With bobblehead: ", round(pred_yes[1]), "\n")
cat("With bobblehead: ", format(round(pred_yes[1]), big.mark = ","), "\n", sep = "")
## With bobblehead: 59,699

Add fireworks as a second promotion variable — does it matter?

Holding month and day of the week constant, bobblehead games are associated with ~10,995 additional attendees, while fireworks games are associated with ~ 17,029 additional attendees. Both effects are statistically significant, suggesting that these promotions are associated with highter attendance.

# Re-level factors for interpretability
DodgersData$month <- factor(DodgersData$month,
levels = c("APR","MAY","JUN","JUL","AUG","SEP","OCT"))
DodgersData$day_of_week <- factor(DodgersData$day_of_week,
levels = c("Monday","Tuesday","Wednesday",
"Thursday","Friday","Saturday","Sunday"))
# Define model formula
my.model3 <- attend ~ month + day_of_week + bobblehead + fireworks
my.model.fit3 <-lm(my.model3, data = DodgersData)
print(summary(my.model.fit3))
## 
## Call:
## lm(formula = my.model3, data = DodgersData)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -9504  -3683   -709   2569  15390 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           34321.68    2418.90  14.189  < 2e-16 ***
## monthMAY              -2492.79    2193.60  -1.136  0.25990    
## monthJUN               7062.62    2616.13   2.700  0.00881 ** 
## monthJUL               1315.38    2534.42   0.519  0.60549    
## monthAUG               2377.88    2300.15   1.034  0.30501    
## monthSEP                -55.37    2413.63  -0.023  0.98177    
## monthOCT               -502.88    3873.86  -0.130  0.89711    
## day_of_weekTuesday     7750.46    2587.35   2.996  0.00386 ** 
## day_of_weekWednesday    904.16    2476.14   0.365  0.71617    
## day_of_weekThursday     309.24    3341.63   0.093  0.92655    
## day_of_weekFriday    -12386.23    6901.86  -1.795  0.07729 .  
## day_of_weekSaturday    6094.17    2445.16   2.492  0.01521 *  
## day_of_weekSunday      6577.96    2400.14   2.741  0.00788 ** 
## bobbleheadYES         10995.05    2318.43   4.742 1.17e-05 ***
## fireworksYES          17028.78    6381.63   2.668  0.00958 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5858 on 66 degrees of freedom
## Multiple R-squared:  0.5887, Adjusted R-squared:  0.5015 
## F-statistic: 6.749 on 14 and 66 DF,  p-value: 2.848e-08
# Tuesday night in June WITH Fireworks
new_yes <- data.frame(month = "JUN",
                      day_of_week = "Tuesday",
                      bobblehead = "NO",
                      fireworks = "YES")
# Tuesday night in June WITHOUT fireworkds
new_no <- data.frame(month = "JUN",
                     day_of_week = "Tuesday",
                     bobblehead = "NO",
                     fireworks = "NO")

pred_yes <- predict(my.model.fit3, newdata = new_yes, interval = "prediction")
pred_no <- predict(my.model.fit3, newdata = new_no, interval = "prediction")

#cat("With bobblehead: ", round(pred_yes[1]), "\n")
cat("With Fireworks: ", format(round(pred_yes[1]), big.mark = ","), "\n", sep = "")
## With Fireworks: 66,164
# Tuesday night in June WITH Bobblehad and Fireworks
new_yes <- data.frame(month = "JUN",
                      day_of_week = "Tuesday",
                      bobblehead = "YES",
                      fireworks = "YES")
# Tuesday night in June WITHOUT bobblehead and fireworkds
new_no <- data.frame(month = "JUN",
                     day_of_week = "Tuesday",
                     bobblehead = "NO",
                     fireworks = "NO")

pred_yes <- predict(my.model.fit3, newdata = new_yes, interval = "prediction")
pred_no <- predict(my.model.fit3, newdata = new_no, interval = "prediction")

#cat("With bobblehead: ", round(pred_yes[1]), "\n")
cat("With Bobblehead and Fireworks: ", format(round(pred_yes[1]), big.mark = ","), "\n", sep = "")
## With Bobblehead and Fireworks: 77,159
# Tuesday night in June WITH Bobblehad and Fireworks
new_yes <- data.frame(month = "JUN",
                      day_of_week = "Tuesday",
                      bobblehead = "YES",
                      fireworks = "YES")
# Tuesday night in June WITHOUT bobblehead and fireworkds
new_no <- data.frame(month = "JUN",
                     day_of_week = "Tuesday",
                     bobblehead = "NO",
                     fireworks = "NO")

pred_yes <- predict(my.model.fit3, newdata = new_yes, interval = "prediction")
pred_no <- predict(my.model.fit3, newdata = new_no, interval = "prediction")

#cat("With bobblehead: ", round(pred_yes[1]), "\n")
cat("With Bobblehead and Fireworks: ", format(round(pred_yes[1]), big.mark = ","), "\n", sep = "")
## With Bobblehead and Fireworks: 77,159

Predicting Future Attendance for a Saturday afternoon in July with no promotion

# Create a model with the bobblehead variable entered last
my.model4 <- {attend ~ month + day_of_week + bobblehead + fireworks + day_night}
# use the full data set to obtain an estimate of the increase in
# attendance due to bobbleheads, controlling for other factors
my.model.fit4 <- lm(my.model4, data = DodgersData) # use all available data
print(summary(my.model.fit4))
## 
## Call:
## lm(formula = my.model4, data = DodgersData)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -9429  -3751   -650   2717  15563 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           35435.46    3582.15   9.892 1.37e-14 ***
## monthMAY              -2351.31    2232.46  -1.053  0.29613    
## monthJUN               6983.49    2639.15   2.646  0.01020 *  
## monthJUL               1350.35    2551.66   0.529  0.59847    
## monthAUG               2362.53    2314.87   1.021  0.31123    
## monthSEP                 27.32    2436.60   0.011  0.99109    
## monthOCT               -305.74    3925.83  -0.078  0.93816    
## day_of_weekTuesday     7531.65    2654.28   2.838  0.00606 ** 
## day_of_weekWednesday    706.38    2535.01   0.279  0.78140    
## day_of_weekThursday     226.19    3368.30   0.067  0.94667    
## day_of_weekFriday    -12581.40    6960.41  -1.808  0.07530 .  
## day_of_weekSaturday    5978.03    2475.72   2.415  0.01857 *  
## day_of_weekSunday      5573.64    3383.68   1.647  0.10434    
## bobbleheadYES         11283.52    2430.24   4.643 1.72e-05 ***
## fireworksYES          17249.88    6442.83   2.677  0.00938 ** 
## day_nightNight        -1172.06    2765.65  -0.424  0.67312    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5895 on 65 degrees of freedom
## Multiple R-squared:  0.5899, Adjusted R-squared:  0.4952 
## F-statistic: 6.232 on 15 and 65 DF,  p-value: 7.09e-08
new_no <- data.frame(month = "JUL",
                     day_of_week = "Saturday",
                     bobblehead = "NO",
                     fireworks = "NO",
                     day_night = "Day")

pred_no <- predict(my.model.fit4, newdata = new_no, interval = "prediction")

cat("Saturday Afternoon in July with NO promotion: ", format(round(pred_no[1]), big.mark = ","), "\n", sep = "")
## Saturday Afternoon in July with NO promotion: 42,764

Recommendation

Promotions have a noticeable impact on Dodger game attendance. On Saturdays without promotions, the model predicts attendance of 42,764, which suggests the stadium would not be filled without promotional events. Although Tuesday nights typically have lower attendance, the model estimates attendance would rise to 59,699 with a bobblehead promotion and 66,164 with fireworks. Combining both promotions increases the estimate to 77,159.

However, Dodger Stadium’s capacity is about 56,000, so offering both promotions together may be unnecessary. Since either promotion alone is predicted to fill or exceed stadium capacity, the recommendation is to use one promotion at a time rather than combining bobbleheads and fireworks.