MLB Data Analysis

Author

Griffin Lessinger

Introduction

The provided MLB dataset is a dataset consisting of 54,345 observations of 161 variables. Each row logs one specific baseball game, and includes records such as game date, home team, visiting team, player statistics, post-game scores, and much more. Here is a sample of the first few rows and columns of the data:

NoteSample Data (click me!)
# A tibble: 6 × 164
      Date NumberofGames DayofWeek VisitingTeam VisitingTeamLeague
     <dbl>         <dbl> <chr>     <chr>        <chr>             
1 20000329             0 Wed       CHN          NL                
2 20000330             0 Thu       NYN          NL                
3 20000403             0 Mon       COL          NL                
4 20000403             0 Mon       MIL          NL                
5 20000403             0 Mon       SFN          NL                
6 20000403             0 Mon       LAN          NL                
# ℹ 159 more variables: VisitingTeamGameNumber <dbl>, HomeTeam <chr>,
#   HomeTeamLeague <chr>, HomeTeamGameNumber <dbl>, VistingTeamScore <dbl>,
#   HomeTeamScore <dbl>, NumberofOuts <dbl>, DayNight <chr>,
#   Completition_Information <chr>, Forfeit_Information <lgl>,
#   Protest_Information <chr>, BallParkID <chr>, Attendance <dbl>,
#   LengthofGame <dbl>, VisitingTeam_LineScore <chr>, HomeTeam_LineScore <chr>,
#   VisitingTeamOffense_AtBats <dbl>, VisitingTeamOffense_Hits <dbl>, …
# A tibble: 164 × 3
    Variable                              `Num. of Unique Vals` `Variable Class`
    <chr>                                                 <dbl> <chr>           
  1 Date                                                   3978 numeric         
  2 NumberofGames                                             3 numeric         
  3 DayofWeek                                                 7 character       
  4 VisitingTeam                                             32 character       
  5 VisitingTeamLeague                                        2 character       
  6 VisitingTeamGameNumber                                  163 numeric         
  7 HomeTeam                                                 32 character       
  8 HomeTeamLeague                                            2 character       
  9 HomeTeamGameNumber                                      163 numeric         
 10 VistingTeamScore                                         28 numeric         
 11 HomeTeamScore                                            25 numeric         
 12 NumberofOuts                                             69 numeric         
 13 DayNight                                                  2 character       
 14 Completition_Information                                 40 character       
 15 Forfeit_Information                                       1 logical         
 16 Protest_Information                                       4 character       
 17 BallParkID                                               56 character       
 18 Attendance                                            29852 numeric         
 19 LengthofGame                                            287 numeric         
 20 VisitingTeam_LineScore                                15618 character       
 21 HomeTeam_LineScore                                    17209 character       
 22 VisitingTeamOffense_AtBats                               59 numeric         
 23 VisitingTeamOffense_Hits                                 29 numeric         
 24 VisitingTeamOffense_Doubles                              12 numeric         
 25 VisitingTeamOffense_Triples                               6 numeric         
 26 VisitingTeamOffense_Homeruns                              9 numeric         
 27 VisitingTeamOffense_RBIs                                 27 numeric         
 28 VisitingTeamOffense_SacrificeHits                         6 numeric         
 29 VisitingTeamOffense_SacrificeFlies                        5 numeric         
 30 VisitingTeamOffense_HitbyPitch                            6 numeric         
 31 VisitingTeamOffense_Walks                                17 numeric         
 32 VisitingTeamOffense_IntentionalWalks                      6 numeric         
 33 VisitingTeamOffense_Strickouts                           27 numeric         
 34 VisitingTeamOffense_StolenBases                           9 numeric         
 35 VisitingTeamOffense_CaughtStealing                        5 numeric         
 36 VisitingTeamOffense_GroundedintoDoub…                     7 numeric         
 37 VisitingTeamOffense_AwardedFirstonCa…                     3 numeric         
 38 VisitingTeamOffense_LeftOnBase                           24 numeric         
 39 VisitingTeamPitchers_PitchersUsed                        13 numeric         
 40 VisitingTeamPitchers_IndividualEarne…                    25 numeric         
 41 VisitingTeamPitchers_TeamEarnedRuns                      25 numeric         
 42 VisitingTeamPitchers_WildPitches                          7 numeric         
 43 VisitingTeamPitchers_Balks                                4 numeric         
 44 VisitingTeamDefense_PutOuts                              47 numeric         
 45 VisitingTeamDefense_Assists                              27 numeric         
 46 VisitingTeamDefense_Errors                                8 numeric         
 47 VisitingTeamDefense_PassedBalls                           5 numeric         
 48 VisitingTeamDefense_DoublePlays                           8 numeric         
 49 VisitingTeamDefense_TriplePlays                           2 numeric         
 50 HomeTeamOffense_AtBats                                   57 numeric         
 51 HomeTeamOffense_Hits                                     29 numeric         
 52 HomeTeamOffense_Doubles                                  12 numeric         
 53 HomeTeamOffense_Triples                                   6 numeric         
 54 HomeTeamOffense_Homeruns                                  9 numeric         
 55 HomeTeamOffense_RBIs                                     25 numeric         
 56 HomeTeamOffense_SacrificeHits                             6 numeric         
 57 HomeTeamOffense_SacrificeFlies                            6 numeric         
 58 HomeTeamOffense_HitbyPitch                                7 numeric         
 59 HomeTeamOffense_Walks                                    18 numeric         
 60 HomeTeamOffense_IntentionalWalks                          7 numeric         
 61 HomeTeamOffense_Strickouts                               26 numeric         
 62 HomeTeamOffense_StolenBases                              10 numeric         
 63 HomeTeamOffense_CaughtStealing                            5 numeric         
 64 HomeTeamOffense_GroundedintoDoublePl…                     7 numeric         
 65 HomeTeamOffense_AwardedFirstonCatche…                     3 numeric         
 66 HomeTeamOffense_LeftOnBase                               25 numeric         
 67 HomeTeamPitchers_PitchersUsed                            13 numeric         
 68 HomeTeamPitchers_IndividualEarnedRuns                    27 numeric         
 69 HomeTeamPitchers_TeamEarnedRuns                          27 numeric         
 70 HomeTeamPitchers_WildPitches                              7 numeric         
 71 HomeTeamPitchers_Balks                                    4 numeric         
 72 HomeTeamDefense_PutOuts                                  25 numeric         
 73 HomeTeamDefense_Assists                                  29 numeric         
 74 HomeTeamDefense_Errors                                    8 numeric         
 75 HomeTeamDefense_PassedBalls                               5 numeric         
 76 HomeTeamDefense_DoublePlays                               7 numeric         
 77 HomeTeamDefense_TriplePlays                               2 numeric         
 78 HomePlateUmp_ID                                         190 character       
 79 HomePlateUmp_Name                                       190 character       
 80 1BUmp_ID                                                192 character       
 81 1BUmp_Name                                              192 character       
 82 2BUmp_ID                                                193 character       
 83 2BUmp_Name                                              193 character       
 84 3BUmp_ID                                                193 character       
 85 3BUmp_Name                                              193 character       
 86 LFUmp_ID                                                  1 logical         
 87 LFUmp_Name                                                6 character       
 88 RFUmp_ID                                                  1 logical         
 89 RFUmp_Name                                                7 character       
 90 VisitingTeamManager_ID                                  168 character       
 91 VisitingTeamManager_Name                                168 character       
 92 HomeTeamManager_ID                                      170 character       
 93 HomeTeamManager_Name                                    170 character       
 94 WinningPitcher_ID                                      2727 character       
 95 WinningPitcher_Name                                    2715 character       
 96 LosingPitcher_ID                                       2920 character       
 97 LosingPitcher_Name                                     2907 character       
 98 SavingPitcher_ID                                       1302 character       
 99 SavingPitcher_Name                                     1304 character       
100 GameWinningRBIBatter_ID                                2443 character       
101 GameWinningRBIBatter_Name                              2424 character       
102 VisitingTeam_StartingPitcher_ID                        1779 character       
103 VisitingTeam_StartingPitcher_Name                      1778 character       
104 HomeTeam_StartingPitcher_ID                            1763 character       
105 HomeTeam_StartingPitcher_Name                          1761 character       
106 VisitingTeam_Player1_ID                                1116 character       
107 VisitingTeam_Player1_Name                              1111 character       
108 VisitingTeam_Player1_Position                            10 numeric         
109 VisitingTeam_Player2_ID                                1510 character       
110 VisitingTeam_Player2_Name                              1499 character       
111 VisitingTeam_Player2_Position                            10 numeric         
112 VisitingTeam_Player3_ID                                1069 character       
113 VisitingTeam_Player3_Name                              1065 character       
114 VisitingTeam_Player3_Position                             9 numeric         
115 VisitingTeam_Player4_ID                                1052 character       
116 VisitingTeam_Player4_Name                              1049 character       
117 VisitingTeam_Player4_Position                             9 numeric         
118 VisitingTeam_Player5_ID                                1591 character       
119 VisitingTeam_Player5_Name                              1584 character       
120 VisitingTeam_Player5_Position                             9 numeric         
121 VisitingTeam_Player6_ID                                2013 character       
122 VisitingTeam_Player6_Name                              2003 character       
123 VisitingTeam_Player6_Position                             9 numeric         
124 VisitingTeam_Player7_ID                                2267 character       
125 VisitingTeam_Player7_Name                              2253 character       
126 VisitingTeam_Player7_Position                            10 numeric         
127 VisitingTeam_Player8_ID                                2494 character       
128 VisitingTeam_Player8_Name                              2477 character       
129 VisitingTeam_Player8_Position                            10 numeric         
130 VisitingTeam_Player9_ID                                3161 character       
131 VisitingTeam_Player9_Name                              3137 character       
132 VisitingTeam_Player9_Position                            10 numeric         
133 HomeTeam_Player1_ID                                    1082 character       
134 HomeTeam_Player1_Name                                  1077 character       
135 HomeTeam_Player1_Position                                10 numeric         
136 HomeTeam_Player2_ID                                    1486 character       
137 HomeTeam_Player2_Name                                  1474 character       
138 HomeTeam_Player2_Position                                10 numeric         
139 HomeTeam_Player3_ID                                    1033 character       
140 HomeTeam_Player3_Name                                  1029 character       
141 HomeTeam_Player3_Position                                10 numeric         
142 HomeTeam_Player4_ID                                    1044 character       
143 HomeTeam_Player4_Name                                  1041 character       
144 HomeTeam_Player4_Position                                 9 numeric         
145 HomeTeam_Player5_ID                                    1559 character       
146 HomeTeam_Player5_Name                                  1554 character       
147 HomeTeam_Player5_Position                                 9 numeric         
148 HomeTeam_Player6_ID                                    1996 character       
149 HomeTeam_Player6_Name                                  1984 character       
150 HomeTeam_Player6_Position                                 9 numeric         
151 HomeTeam_Player7_ID                                    2252 character       
152 HomeTeam_Player7_Name                                  2237 character       
153 HomeTeam_Player7_Position                                10 numeric         
154 HomeTeam_Player8_ID                                    2381 character       
155 HomeTeam_Player8_Name                                  2368 character       
156 HomeTeam_Player8_Position                                10 numeric         
157 HomeTeam_Player9_ID                                    2605 character       
158 HomeTeam_Player9_Name                                  2590 character       
159 HomeTeam_Player9_Position                                10 numeric         
160 Additional_Information                                  250 character       
161 Acquisition_Information                                   1 character       
# ℹ 3 more rows

After cleaning, our data remains at 53,444 rows (no duplicate rows, 901 rows with negative Attendance values). Additionally, columns denoting game year, month, and whether or not the game was on a weekend, were all added for convenience in analyzing data.

Attendance Analysis

We are primarily interested in analyzing game attendance, in seats sold, and its relationships with the other available variables. We can start with exploring basic observed facts of Attendance:

NoteAttendance Metrics
                   Attendance Metrics
Mean                            29436
Median                          30024
Minimum                             0
Maximum                         61707
1st Quartile                    20673
3rd Quartile                    38438
Standard Deviation              11289

It is noteworthy that the minimum Attendance value recorded was 0; as in, there was at least 1 MLB game recorded with 0 game attendees. This is not a mistake! And we will hopefully come back to why this is the case shortly. Looking at the other metrics (especially mean and median), we can imagine that the data is likely symmetric. Indeed, a histogram reveals this:

Attendance by Year

We can also examine Attendance by Year to get a sense of how game attendance evolves over time:

NoteAttendance by Year
# A tibble: 22 × 3
    Year  Mean Pct_Change
   <dbl> <dbl>      <dbl>
 1  2000 29970     NA    
 2  2001 29848     -0.407
 3  2002 28006     -6.17 
 4  2003 27839     -0.596
 5  2004 30073      8.02 
 6  2005 30817      2.47 
 7  2006 31303      1.58 
 8  2007 32704      4.48 
 9  2008 32381     -0.988
10  2009 30214     -6.69 
11  2010 30071     -0.473
12  2011 30228      0.522
13  2012 30806      1.91 
14  2013 30451     -1.15 
15  2014 30345     -0.348
16  2015 30378      0.109
17  2016 30131     -0.813
18  2017 29922     -0.694
19  2018 28659     -4.22 
20  2019 28203     -1.59 
21  2021 18659    -33.8  
22  2022 26577     42.4  

Pct_change found with lag of 1 year previous, so it will be NA for the first year

Game attendance seems to massively decrease in 2021, then increase right back up again in 2022. The decrease is likely a result of Covid-19 concerns. It would be reasonable to exclude all games from 2021, as well as 2022, from this analysis because of this singular extenuating circumstance affecting the data. Otherwise, however, the mean attendance is quite stable at roughly 30,000 attendees per game. For a visualization:

It would be fair to hypothesize that Covid-19-affected data is the cause of that anomaly observed earlier, the game(s) with 0 attendees. If we remove all 2021 and 2022 MLB games from the data, we obtain:

NoteAttendance Metrics (excluding 2021, 2022)
                   Attendance Metrics
Mean                            30118
Median                          30586
Minimum                             0
Maximum                         61707
1st Quartile                    21497
3rd Quartile                    38910
Standard Deviation              10987

Unfortunately, the issue persists. It turns out that there are 280 games (in the complete dataset) with a recorded 0 attendees, and excluding Covid-19 years, we are still left with 225 games with 0 attendees. These games are also distributed among every year present in the dataset. It is not entirely clear why these 225 remaining games have reportedly 0 attendees, but their effect should be minor in the wake of the ~48,000 other games; as such, we will just leave them in the working data and move on.

NoteAttendance by Year (excluding 2021, 2022)
# A tibble: 20 × 3
    Year  Mean Pct_Change
   <dbl> <dbl>      <dbl>
 1  2000 29970     NA    
 2  2001 29848     -0.407
 3  2002 28006     -6.17 
 4  2003 27839     -0.596
 5  2004 30073      8.02 
 6  2005 30817      2.47 
 7  2006 31303      1.58 
 8  2007 32704      4.48 
 9  2008 32381     -0.988
10  2009 30214     -6.69 
11  2010 30071     -0.473
12  2011 30228      0.522
13  2012 30806      1.91 
14  2013 30451     -1.15 
15  2014 30345     -0.348
16  2015 30378      0.109
17  2016 30131     -0.813
18  2017 29922     -0.694
19  2018 28659     -4.22 
20  2019 28203     -1.59 

Attendance by Day of Week

MLB game attendance also shows significant variability with the day of the week, as well:

NoteAttendance by Weekday
# A tibble: 7 × 2
  DayofWeek  Mean
  <fct>     <dbl>
1 Mon       27851
2 Tue       26972
3 Wed       27177
4 Thu       27642
5 Fri       32070
6 Sat       34848
7 Sun       32338

We see something expected: the attendance for weekday games is lower than for weekend games. In fact, we can compute the average attendance of weekday games to be 27,352 attendees, and weekend games to be 33,097 attendees. This is a somewhat significant difference! Likely useful for a future statistical model!

Attendance by Day/Night

Lastly, we can take a look at game attendance by whether the game started during the “day” or during the “night”:

NoteAttendance by Day/Night
# A tibble: 2 × 4
  DayNight  Mean Median `Total Games`
  <chr>    <dbl>  <dbl>         <int>
1 Day      31790  32953         15919
2 Night    29303  29564         32668

There actually does seem to be an interaction between DayNight and DayofWeek; for most days of the week, attendance is higher during daytime games than nighttime games, often significantly. For Sundays, however, the opposite is true!

NoteAttendance by Weekday & Day/Night
# A tibble: 14 × 5
# Groups:   DayofWeek [7]
   DayofWeek DayNight   Mean Median `Total Games`
   <fct>     <chr>     <dbl>  <dbl>         <int>
 1 Mon       Day      32586. 35243            869
 2 Mon       Night    26844. 25763           4085
 3 Tue       Day      29575. 31762            301
 4 Tue       Night    26860. 25982.          7004
 5 Wed       Day      27908. 28154.          1840
 6 Wed       Night    26937. 26188.          5592
 7 Thu       Day      28200. 27968           2159
 8 Thu       Night    27282. 26255           3341
 9 Fri       Day      37316. 39301            425
10 Fri       Night    31764. 32332.          7272
11 Sat       Day      35407. 37416           3153
12 Sat       Night    34477. 35628.          4732
13 Sun       Day      31946. 32526.          7172
14 Sun       Night    36727. 38048.           642

Conclusions & Checkpoint

The answers to the checkpoint questions are as follows:

(a). How many observations in SubsetYears?

[1] "Number of observations: 48587"

(b). What’s the maximum game attendance from P1Q3?

[1] "Maximum game attendance: 61707"

(c). Home team associated with previous question?

[1] "Home team: SDN (not sure about the team name)"

(d). Standard deviation of game attendance from P1Q3?

[1] "Standard deviation: 10987.0506"

(e). Average game attendance in 2010 from P1Q4?

[1] "2010 Average: 30071.9885"

(f + g). Day of week with highest average attendance from P1Q5?

[1] "Day of Week with Attendance high: 6 (Sat), with mean attendance of 34848.8075"