# A tibble: 6 × 164
Date NumberofGames DayofWeek VisitingTeam VisitingTeamLeague
<dbl> <dbl> <chr> <chr> <chr>
1 20000329 0 Wed CHN NL
2 20000330 0 Thu NYN NL
3 20000403 0 Mon COL NL
4 20000403 0 Mon MIL NL
5 20000403 0 Mon SFN NL
6 20000403 0 Mon LAN NL
# ℹ 159 more variables: VisitingTeamGameNumber <dbl>, HomeTeam <chr>,
# HomeTeamLeague <chr>, HomeTeamGameNumber <dbl>, VistingTeamScore <dbl>,
# HomeTeamScore <dbl>, NumberofOuts <dbl>, DayNight <chr>,
# Completition_Information <chr>, Forfeit_Information <lgl>,
# Protest_Information <chr>, BallParkID <chr>, Attendance <dbl>,
# LengthofGame <dbl>, VisitingTeam_LineScore <chr>, HomeTeam_LineScore <chr>,
# VisitingTeamOffense_AtBats <dbl>, VisitingTeamOffense_Hits <dbl>, …
MLB Data Analysis
Introduction
The provided MLB dataset is a dataset consisting of 54,345 observations of 161 variables. Each row logs one specific baseball game, and includes records such as game date, home team, visiting team, player statistics, post-game scores, and much more. Here is a sample of the first few rows and columns of the data:
# A tibble: 164 × 3
Variable `Num. of Unique Vals` `Variable Class`
<chr> <dbl> <chr>
1 Date 3978 numeric
2 NumberofGames 3 numeric
3 DayofWeek 7 character
4 VisitingTeam 32 character
5 VisitingTeamLeague 2 character
6 VisitingTeamGameNumber 163 numeric
7 HomeTeam 32 character
8 HomeTeamLeague 2 character
9 HomeTeamGameNumber 163 numeric
10 VistingTeamScore 28 numeric
11 HomeTeamScore 25 numeric
12 NumberofOuts 69 numeric
13 DayNight 2 character
14 Completition_Information 40 character
15 Forfeit_Information 1 logical
16 Protest_Information 4 character
17 BallParkID 56 character
18 Attendance 29852 numeric
19 LengthofGame 287 numeric
20 VisitingTeam_LineScore 15618 character
21 HomeTeam_LineScore 17209 character
22 VisitingTeamOffense_AtBats 59 numeric
23 VisitingTeamOffense_Hits 29 numeric
24 VisitingTeamOffense_Doubles 12 numeric
25 VisitingTeamOffense_Triples 6 numeric
26 VisitingTeamOffense_Homeruns 9 numeric
27 VisitingTeamOffense_RBIs 27 numeric
28 VisitingTeamOffense_SacrificeHits 6 numeric
29 VisitingTeamOffense_SacrificeFlies 5 numeric
30 VisitingTeamOffense_HitbyPitch 6 numeric
31 VisitingTeamOffense_Walks 17 numeric
32 VisitingTeamOffense_IntentionalWalks 6 numeric
33 VisitingTeamOffense_Strickouts 27 numeric
34 VisitingTeamOffense_StolenBases 9 numeric
35 VisitingTeamOffense_CaughtStealing 5 numeric
36 VisitingTeamOffense_GroundedintoDoub… 7 numeric
37 VisitingTeamOffense_AwardedFirstonCa… 3 numeric
38 VisitingTeamOffense_LeftOnBase 24 numeric
39 VisitingTeamPitchers_PitchersUsed 13 numeric
40 VisitingTeamPitchers_IndividualEarne… 25 numeric
41 VisitingTeamPitchers_TeamEarnedRuns 25 numeric
42 VisitingTeamPitchers_WildPitches 7 numeric
43 VisitingTeamPitchers_Balks 4 numeric
44 VisitingTeamDefense_PutOuts 47 numeric
45 VisitingTeamDefense_Assists 27 numeric
46 VisitingTeamDefense_Errors 8 numeric
47 VisitingTeamDefense_PassedBalls 5 numeric
48 VisitingTeamDefense_DoublePlays 8 numeric
49 VisitingTeamDefense_TriplePlays 2 numeric
50 HomeTeamOffense_AtBats 57 numeric
51 HomeTeamOffense_Hits 29 numeric
52 HomeTeamOffense_Doubles 12 numeric
53 HomeTeamOffense_Triples 6 numeric
54 HomeTeamOffense_Homeruns 9 numeric
55 HomeTeamOffense_RBIs 25 numeric
56 HomeTeamOffense_SacrificeHits 6 numeric
57 HomeTeamOffense_SacrificeFlies 6 numeric
58 HomeTeamOffense_HitbyPitch 7 numeric
59 HomeTeamOffense_Walks 18 numeric
60 HomeTeamOffense_IntentionalWalks 7 numeric
61 HomeTeamOffense_Strickouts 26 numeric
62 HomeTeamOffense_StolenBases 10 numeric
63 HomeTeamOffense_CaughtStealing 5 numeric
64 HomeTeamOffense_GroundedintoDoublePl… 7 numeric
65 HomeTeamOffense_AwardedFirstonCatche… 3 numeric
66 HomeTeamOffense_LeftOnBase 25 numeric
67 HomeTeamPitchers_PitchersUsed 13 numeric
68 HomeTeamPitchers_IndividualEarnedRuns 27 numeric
69 HomeTeamPitchers_TeamEarnedRuns 27 numeric
70 HomeTeamPitchers_WildPitches 7 numeric
71 HomeTeamPitchers_Balks 4 numeric
72 HomeTeamDefense_PutOuts 25 numeric
73 HomeTeamDefense_Assists 29 numeric
74 HomeTeamDefense_Errors 8 numeric
75 HomeTeamDefense_PassedBalls 5 numeric
76 HomeTeamDefense_DoublePlays 7 numeric
77 HomeTeamDefense_TriplePlays 2 numeric
78 HomePlateUmp_ID 190 character
79 HomePlateUmp_Name 190 character
80 1BUmp_ID 192 character
81 1BUmp_Name 192 character
82 2BUmp_ID 193 character
83 2BUmp_Name 193 character
84 3BUmp_ID 193 character
85 3BUmp_Name 193 character
86 LFUmp_ID 1 logical
87 LFUmp_Name 6 character
88 RFUmp_ID 1 logical
89 RFUmp_Name 7 character
90 VisitingTeamManager_ID 168 character
91 VisitingTeamManager_Name 168 character
92 HomeTeamManager_ID 170 character
93 HomeTeamManager_Name 170 character
94 WinningPitcher_ID 2727 character
95 WinningPitcher_Name 2715 character
96 LosingPitcher_ID 2920 character
97 LosingPitcher_Name 2907 character
98 SavingPitcher_ID 1302 character
99 SavingPitcher_Name 1304 character
100 GameWinningRBIBatter_ID 2443 character
101 GameWinningRBIBatter_Name 2424 character
102 VisitingTeam_StartingPitcher_ID 1779 character
103 VisitingTeam_StartingPitcher_Name 1778 character
104 HomeTeam_StartingPitcher_ID 1763 character
105 HomeTeam_StartingPitcher_Name 1761 character
106 VisitingTeam_Player1_ID 1116 character
107 VisitingTeam_Player1_Name 1111 character
108 VisitingTeam_Player1_Position 10 numeric
109 VisitingTeam_Player2_ID 1510 character
110 VisitingTeam_Player2_Name 1499 character
111 VisitingTeam_Player2_Position 10 numeric
112 VisitingTeam_Player3_ID 1069 character
113 VisitingTeam_Player3_Name 1065 character
114 VisitingTeam_Player3_Position 9 numeric
115 VisitingTeam_Player4_ID 1052 character
116 VisitingTeam_Player4_Name 1049 character
117 VisitingTeam_Player4_Position 9 numeric
118 VisitingTeam_Player5_ID 1591 character
119 VisitingTeam_Player5_Name 1584 character
120 VisitingTeam_Player5_Position 9 numeric
121 VisitingTeam_Player6_ID 2013 character
122 VisitingTeam_Player6_Name 2003 character
123 VisitingTeam_Player6_Position 9 numeric
124 VisitingTeam_Player7_ID 2267 character
125 VisitingTeam_Player7_Name 2253 character
126 VisitingTeam_Player7_Position 10 numeric
127 VisitingTeam_Player8_ID 2494 character
128 VisitingTeam_Player8_Name 2477 character
129 VisitingTeam_Player8_Position 10 numeric
130 VisitingTeam_Player9_ID 3161 character
131 VisitingTeam_Player9_Name 3137 character
132 VisitingTeam_Player9_Position 10 numeric
133 HomeTeam_Player1_ID 1082 character
134 HomeTeam_Player1_Name 1077 character
135 HomeTeam_Player1_Position 10 numeric
136 HomeTeam_Player2_ID 1486 character
137 HomeTeam_Player2_Name 1474 character
138 HomeTeam_Player2_Position 10 numeric
139 HomeTeam_Player3_ID 1033 character
140 HomeTeam_Player3_Name 1029 character
141 HomeTeam_Player3_Position 10 numeric
142 HomeTeam_Player4_ID 1044 character
143 HomeTeam_Player4_Name 1041 character
144 HomeTeam_Player4_Position 9 numeric
145 HomeTeam_Player5_ID 1559 character
146 HomeTeam_Player5_Name 1554 character
147 HomeTeam_Player5_Position 9 numeric
148 HomeTeam_Player6_ID 1996 character
149 HomeTeam_Player6_Name 1984 character
150 HomeTeam_Player6_Position 9 numeric
151 HomeTeam_Player7_ID 2252 character
152 HomeTeam_Player7_Name 2237 character
153 HomeTeam_Player7_Position 10 numeric
154 HomeTeam_Player8_ID 2381 character
155 HomeTeam_Player8_Name 2368 character
156 HomeTeam_Player8_Position 10 numeric
157 HomeTeam_Player9_ID 2605 character
158 HomeTeam_Player9_Name 2590 character
159 HomeTeam_Player9_Position 10 numeric
160 Additional_Information 250 character
161 Acquisition_Information 1 character
# ℹ 3 more rows
After cleaning, our data remains at 53,444 rows (no duplicate rows, 901 rows with negative Attendance values). Additionally, columns denoting game year, month, and whether or not the game was on a weekend, were all added for convenience in analyzing data.
Attendance Analysis
We are primarily interested in analyzing game attendance, in seats sold, and its relationships with the other available variables. We can start with exploring basic observed facts of Attendance:
Attendance Metrics
Mean 29436
Median 30024
Minimum 0
Maximum 61707
1st Quartile 20673
3rd Quartile 38438
Standard Deviation 11289
It is noteworthy that the minimum Attendance value recorded was 0; as in, there was at least 1 MLB game recorded with 0 game attendees. This is not a mistake! And we will hopefully come back to why this is the case shortly. Looking at the other metrics (especially mean and median), we can imagine that the data is likely symmetric. Indeed, a histogram reveals this:
Attendance by Year
We can also examine Attendance by Year to get a sense of how game attendance evolves over time:
# A tibble: 22 × 3
Year Mean Pct_Change
<dbl> <dbl> <dbl>
1 2000 29970 NA
2 2001 29848 -0.407
3 2002 28006 -6.17
4 2003 27839 -0.596
5 2004 30073 8.02
6 2005 30817 2.47
7 2006 31303 1.58
8 2007 32704 4.48
9 2008 32381 -0.988
10 2009 30214 -6.69
11 2010 30071 -0.473
12 2011 30228 0.522
13 2012 30806 1.91
14 2013 30451 -1.15
15 2014 30345 -0.348
16 2015 30378 0.109
17 2016 30131 -0.813
18 2017 29922 -0.694
19 2018 28659 -4.22
20 2019 28203 -1.59
21 2021 18659 -33.8
22 2022 26577 42.4
Pct_change found with lag of 1 year previous, so it will be NA for the first year
Game attendance seems to massively decrease in 2021, then increase right back up again in 2022. The decrease is likely a result of Covid-19 concerns. It would be reasonable to exclude all games from 2021, as well as 2022, from this analysis because of this singular extenuating circumstance affecting the data. Otherwise, however, the mean attendance is quite stable at roughly 30,000 attendees per game. For a visualization:
It would be fair to hypothesize that Covid-19-affected data is the cause of that anomaly observed earlier, the game(s) with 0 attendees. If we remove all 2021 and 2022 MLB games from the data, we obtain:
Attendance Metrics
Mean 30118
Median 30586
Minimum 0
Maximum 61707
1st Quartile 21497
3rd Quartile 38910
Standard Deviation 10987
Unfortunately, the issue persists. It turns out that there are 280 games (in the complete dataset) with a recorded 0 attendees, and excluding Covid-19 years, we are still left with 225 games with 0 attendees. These games are also distributed among every year present in the dataset. It is not entirely clear why these 225 remaining games have reportedly 0 attendees, but their effect should be minor in the wake of the ~48,000 other games; as such, we will just leave them in the working data and move on.
# A tibble: 20 × 3
Year Mean Pct_Change
<dbl> <dbl> <dbl>
1 2000 29970 NA
2 2001 29848 -0.407
3 2002 28006 -6.17
4 2003 27839 -0.596
5 2004 30073 8.02
6 2005 30817 2.47
7 2006 31303 1.58
8 2007 32704 4.48
9 2008 32381 -0.988
10 2009 30214 -6.69
11 2010 30071 -0.473
12 2011 30228 0.522
13 2012 30806 1.91
14 2013 30451 -1.15
15 2014 30345 -0.348
16 2015 30378 0.109
17 2016 30131 -0.813
18 2017 29922 -0.694
19 2018 28659 -4.22
20 2019 28203 -1.59
Attendance by Day of Week
MLB game attendance also shows significant variability with the day of the week, as well:
# A tibble: 7 × 2
DayofWeek Mean
<fct> <dbl>
1 Mon 27851
2 Tue 26972
3 Wed 27177
4 Thu 27642
5 Fri 32070
6 Sat 34848
7 Sun 32338
We see something expected: the attendance for weekday games is lower than for weekend games. In fact, we can compute the average attendance of weekday games to be 27,352 attendees, and weekend games to be 33,097 attendees. This is a somewhat significant difference! Likely useful for a future statistical model!
Attendance by Day/Night
Lastly, we can take a look at game attendance by whether the game started during the “day” or during the “night”:
# A tibble: 2 × 4
DayNight Mean Median `Total Games`
<chr> <dbl> <dbl> <int>
1 Day 31790 32953 15919
2 Night 29303 29564 32668
There actually does seem to be an interaction between DayNight and DayofWeek; for most days of the week, attendance is higher during daytime games than nighttime games, often significantly. For Sundays, however, the opposite is true!
# A tibble: 14 × 5
# Groups: DayofWeek [7]
DayofWeek DayNight Mean Median `Total Games`
<fct> <chr> <dbl> <dbl> <int>
1 Mon Day 32586. 35243 869
2 Mon Night 26844. 25763 4085
3 Tue Day 29575. 31762 301
4 Tue Night 26860. 25982. 7004
5 Wed Day 27908. 28154. 1840
6 Wed Night 26937. 26188. 5592
7 Thu Day 28200. 27968 2159
8 Thu Night 27282. 26255 3341
9 Fri Day 37316. 39301 425
10 Fri Night 31764. 32332. 7272
11 Sat Day 35407. 37416 3153
12 Sat Night 34477. 35628. 4732
13 Sun Day 31946. 32526. 7172
14 Sun Night 36727. 38048. 642
Conclusions & Checkpoint
The answers to the checkpoint questions are as follows:
(a). How many observations in SubsetYears?
[1] "Number of observations: 48587"
(b). What’s the maximum game attendance from P1Q3?
[1] "Maximum game attendance: 61707"
(c). Home team associated with previous question?
[1] "Home team: SDN (not sure about the team name)"
(d). Standard deviation of game attendance from P1Q3?
[1] "Standard deviation: 10987.0506"
(e). Average game attendance in 2010 from P1Q4?
[1] "2010 Average: 30071.9885"
(f + g). Day of week with highest average attendance from P1Q5?
[1] "Day of Week with Attendance high: 6 (Sat), with mean attendance of 34848.8075"