library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## corrplot 0.95 loaded
setwd("C:/Users/njnav/OneDrive/Data 101/Projects")

mlb_teams<- read_csv("mlb_teams.csv")
## Rows: 2784 Columns: 41
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (8): league_id, division_id, division_winner, wild_card_winner, league_...
## dbl (33): year, rank, games_played, home_games, wins, losses, runs_scored, a...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Which teams scored the most runs on average and which teams hit the most home runs on average throughout all the teams in the league? Is there a correlation between the number of home runs hit and the number of runs scored?

Introduction

Since the creation of Major League Baseball (MLB) many teams have joined and played against one another. There have been many seasons played since 1876 with many different winners of the World Series. I wondered which teams throughout the creation of the league have scored the most runs (points) and home runs on average and if there was a potential correlation between the two. The data set I am using titled mlb_teams is a subset of data collected from Lahman’s Baseball Database which include various types of data about MLB teams. I got this data set from the website openintro.org.

The data set has 2784 observations across 41 variables which includes the specific data I will be focusing on being team names, runs scored, and home runs. These variables provide me with the data I need to be able to see which teams were the most consistent in scoring and hitting home runs. It will also allow me to see if their is a correlation between the number of home runs hit and the number of runs scored.

Unfortunately the data set has only collected data from the start of the first season in the league back in 1876 up to the 145th season in 2020. This will mean that data from the past five seasons between 2021 and 2025 will not be included in the results possibly changing the outcome slightly. The results should be taken with this information in mind and a grain of salt.

Data Analysis

The name of the variables I will be focusing on are called team_name, runs_scored, and homeruns.

  1. team_name: Self explanatory it’s just the names of the teams

  2. runs_scored: This is the number of runs scored during a season

  3. homeruns: This is the number of homeruns hit by batters in a season

These variables allow me to see which teams were able to score the most runs and homeruns on average and look for a correlation. I first checked the head of the data to see what it looks like and then check the structure. After that I checked if there is any NA’s in the columns and if so how many. Luckily there were no NA’s in the variables I was using and the column names looked fine.

head(mlb_teams)
## # A tibble: 6 × 41
##    year league_id division_id  rank games_played home_games  wins losses
##   <dbl> <chr>     <chr>       <dbl>        <dbl>      <dbl> <dbl>  <dbl>
## 1  1876 NL        <NA>            4           70         NA    39     31
## 2  1876 NL        <NA>            1           66         NA    52     14
## 3  1876 NL        <NA>            8           65         NA     9     56
## 4  1876 NL        <NA>            2           69         NA    47     21
## 5  1876 NL        <NA>            5           69         NA    30     36
## 6  1876 NL        <NA>            6           57         NA    21     35
## # ℹ 33 more variables: division_winner <chr>, wild_card_winner <chr>,
## #   league_winner <chr>, world_series_winner <chr>, runs_scored <dbl>,
## #   at_bats <dbl>, hits <dbl>, doubles <dbl>, triples <dbl>, homeruns <dbl>,
## #   walks <dbl>, strikeouts_by_batters <dbl>, stolen_bases <dbl>,
## #   caught_stealing <dbl>, batters_hit_by_pitch <dbl>, sacrifice_flies <dbl>,
## #   opponents_runs_scored <dbl>, earned_runs_allowed <dbl>,
## #   earned_run_average <dbl>, complete_games <dbl>, shutouts <dbl>, …
str(mlb_teams)
## spc_tbl_ [2,784 × 41] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ year                  : num [1:2784] 1876 1876 1876 1876 1876 ...
##  $ league_id             : chr [1:2784] "NL" "NL" "NL" "NL" ...
##  $ division_id           : chr [1:2784] NA NA NA NA ...
##  $ rank                  : num [1:2784] 4 1 8 2 5 6 7 3 1 5 ...
##  $ games_played          : num [1:2784] 70 66 65 69 69 57 60 64 61 60 ...
##  $ home_games            : num [1:2784] NA NA NA NA NA NA NA NA NA NA ...
##  $ wins                  : num [1:2784] 39 52 9 47 30 21 14 45 42 26 ...
##  $ losses                : num [1:2784] 31 14 56 21 36 35 45 19 18 33 ...
##  $ division_winner       : chr [1:2784] NA NA NA NA ...
##  $ wild_card_winner      : chr [1:2784] NA NA NA NA ...
##  $ league_winner         : chr [1:2784] "N" "Y" "N" "N" ...
##  $ world_series_winner   : chr [1:2784] NA NA NA NA ...
##  $ runs_scored           : num [1:2784] 471 624 238 429 280 260 378 386 419 366 ...
##  $ at_bats               : num [1:2784] 2722 2748 2372 2664 2570 ...
##  $ hits                  : num [1:2784] 723 926 555 711 641 494 646 642 700 633 ...
##  $ doubles               : num [1:2784] 96 131 51 96 68 39 79 73 91 79 ...
##  $ triples               : num [1:2784] 24 32 12 22 14 15 35 27 37 30 ...
##  $ homeruns              : num [1:2784] 9 8 4 2 6 2 7 2 4 0 ...
##  $ walks                 : num [1:2784] 58 70 41 39 24 18 27 59 65 57 ...
##  $ strikeouts_by_batters : num [1:2784] 98 45 136 78 98 35 36 63 121 111 ...
##  $ stolen_bases          : num [1:2784] NA NA NA NA NA NA NA NA NA NA ...
##  $ caught_stealing       : num [1:2784] NA NA NA NA NA NA NA NA NA NA ...
##  $ batters_hit_by_pitch  : num [1:2784] NA NA NA NA NA NA NA NA NA NA ...
##  $ sacrifice_flies       : num [1:2784] NA NA NA NA NA NA NA NA NA NA ...
##  $ opponents_runs_scored : num [1:2784] 450 257 579 261 344 412 534 229 263 375 ...
##  $ earned_runs_allowed   : num [1:2784] 176 116 238 116 121 173 197 78 131 200 ...
##  $ earned_run_average    : num [1:2784] 2.51 1.76 3.62 1.67 1.69 2.94 3.22 1.22 2.15 3.37 ...
##  $ complete_games        : num [1:2784] 49 58 57 69 67 56 53 63 61 45 ...
##  $ shutouts              : num [1:2784] 3 9 0 11 5 2 1 16 7 3 ...
##  $ saves                 : num [1:2784] 7 4 0 0 0 0 2 0 0 3 ...
##  $ outs_pitches          : num [1:2784] 1896 1777 1773 1872 1929 ...
##  $ hits_allowed          : num [1:2784] 732 608 850 570 605 718 783 472 557 630 ...
##  $ homeruns_allowed      : num [1:2784] 7 6 9 2 3 8 2 3 5 7 ...
##  $ walks_allowed         : num [1:2784] 104 29 34 27 38 24 41 39 38 58 ...
##  $ strikeouts_by_pitchers: num [1:2784] 77 51 60 114 125 37 22 103 177 92 ...
##  $ errors                : num [1:2784] 442 282 469 337 397 473 456 268 290 313 ...
##  $ double_plays          : num [1:2784] 42 33 45 27 44 18 32 33 36 43 ...
##  $ fielding_percentage   : num [1:2784] 0.86 0.899 0.841 0.888 0.875 0.825 0.839 0.902 0.889 0.883 ...
##  $ team_name             : chr [1:2784] "Boston Red Caps" "Chicago White Stockings" "Cincinnati Reds" "Hartford Dark Blues" ...
##  $ ball_park             : chr [1:2784] "South End Grounds I" "23rd Street Grounds" "Avenue Grounds" "Hartford Ball Club Grounds" ...
##  $ home_attendance       : num [1:2784] NA NA NA NA NA NA NA NA NA NA ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   year = col_double(),
##   ..   league_id = col_character(),
##   ..   division_id = col_character(),
##   ..   rank = col_double(),
##   ..   games_played = col_double(),
##   ..   home_games = col_double(),
##   ..   wins = col_double(),
##   ..   losses = col_double(),
##   ..   division_winner = col_character(),
##   ..   wild_card_winner = col_character(),
##   ..   league_winner = col_character(),
##   ..   world_series_winner = col_character(),
##   ..   runs_scored = col_double(),
##   ..   at_bats = col_double(),
##   ..   hits = col_double(),
##   ..   doubles = col_double(),
##   ..   triples = col_double(),
##   ..   homeruns = col_double(),
##   ..   walks = col_double(),
##   ..   strikeouts_by_batters = col_double(),
##   ..   stolen_bases = col_double(),
##   ..   caught_stealing = col_double(),
##   ..   batters_hit_by_pitch = col_double(),
##   ..   sacrifice_flies = col_double(),
##   ..   opponents_runs_scored = col_double(),
##   ..   earned_runs_allowed = col_double(),
##   ..   earned_run_average = col_double(),
##   ..   complete_games = col_double(),
##   ..   shutouts = col_double(),
##   ..   saves = col_double(),
##   ..   outs_pitches = col_double(),
##   ..   hits_allowed = col_double(),
##   ..   homeruns_allowed = col_double(),
##   ..   walks_allowed = col_double(),
##   ..   strikeouts_by_pitchers = col_double(),
##   ..   errors = col_double(),
##   ..   double_plays = col_double(),
##   ..   fielding_percentage = col_double(),
##   ..   team_name = col_character(),
##   ..   ball_park = col_character(),
##   ..   home_attendance = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
colSums(is.na(mlb_teams))
##                   year              league_id            division_id 
##                      0                      0                   1346 
##                   rank           games_played             home_games 
##                      0                      0                    228 
##                   wins                 losses        division_winner 
##                      0                      0                   1374 
##       wild_card_winner          league_winner    world_series_winner 
##                   2010                     28                    248 
##            runs_scored                at_bats                   hits 
##                      0                      0                      0 
##                doubles                triples               homeruns 
##                      0                      0                      0 
##                  walks  strikeouts_by_batters           stolen_bases 
##                      0                     16                     76 
##        caught_stealing   batters_hit_by_pitch        sacrifice_flies 
##                    708                   1066                   1370 
##  opponents_runs_scored    earned_runs_allowed     earned_run_average 
##                      0                      0                      0 
##         complete_games               shutouts                  saves 
##                      0                      0                      0 
##           outs_pitches           hits_allowed       homeruns_allowed 
##                      0                      0                      0 
##          walks_allowed strikeouts_by_pitchers                 errors 
##                      0                      0                      0 
##           double_plays    fielding_percentage              team_name 
##                      0                      0                      0 
##              ball_park        home_attendance 
##                      0                    108

This is where I looked for the data of the teams average runs scored in their entire time playing in the league. I first created a new data set called average and then went in to the mlb_teams data set and selected the variables team_name, runs_scored, and homeruns. After I decided to group the data by team names so I can see which team got which average. I then summarized the average runs scored and homeruns for each teams career. I then created new data sets named average_runs and average_homeruns where I took the data from the average data set and arranged it in descending order for both runs scored and homeruns so I could see the teams with the highest averages for both.

average <- mlb_teams|>
  select(team_name, runs_scored, homeruns)|>
  group_by(team_name)|>
  summarize(mean_runs = mean(runs_scored), mean_homeruns = mean(homeruns))
average
## # A tibble: 86 × 3
##    team_name            mean_runs mean_homeruns
##    <chr>                    <dbl>         <dbl>
##  1 Anaheim Angels            788.         166. 
##  2 Arizona Diamondbacks      717.         167. 
##  3 Atlanta Braves            690.         152. 
##  4 Baltimore Orioles         718.         140. 
##  5 Boston Americans          607.          30.4
##  6 Boston Beaneaters         747.          39.7
##  7 Boston Bees               593.          59.8
##  8 Boston Braves             618.          54.9
##  9 Boston Doves              492.          21  
## 10 Boston Red Caps           426.          10.7
## # ℹ 76 more rows
average_runs <- average|>
  arrange(desc(mean_runs))
average_runs
## # A tibble: 86 × 3
##    team_name            mean_runs mean_homeruns
##    <chr>                    <dbl>         <dbl>
##  1 Brooklyn Grooms           876.          36.2
##  2 Chicago Colts             839           47.1
##  3 St. Louis Perfectos       819           47  
##  4 Cleveland Spiders         789.          24.2
##  5 Colorado Rockies          788.         179. 
##  6 Anaheim Angels            788.         166. 
##  7 New York Yankees          772.         149. 
##  8 Brooklyn Bridegrooms      754           28  
##  9 Boston Beaneaters         747.          39.7
## 10 Cincinnati Redlegs        745.         170  
## # ℹ 76 more rows
average_homeruns <- average|>
  arrange(desc(mean_homeruns))
average_homeruns
## # A tibble: 86 × 3
##    team_name                     mean_runs mean_homeruns
##    <chr>                             <dbl>         <dbl>
##  1 Colorado Rockies                   788.          179.
##  2 Milwaukee Braves                   722.          172.
##  3 Tampa Bay Rays                     683           171.
##  4 Cincinnati Redlegs                 745.          170 
##  5 Toronto Blue Jays                  725.          168.
##  6 Arizona Diamondbacks               717.          167.
##  7 Anaheim Angels                     788.          166.
##  8 Los Angeles Angels of Anaheim      718.          163.
##  9 Texas Rangers                      740.          163.
## 10 Oakland Athletics                  716.          158.
## # ℹ 76 more rows

Here I created a new data set called average_none where from the average data set I only selected mean_runs and mean_homeruns and then checked the correlation of all the data in the average_none data set to see if there was a correlation between the average number of home runs hit and the average number of runs scored.

average_none <- average|>
  select(mean_runs, mean_homeruns)
correlation <- cor(average_none)
correlation
##               mean_runs mean_homeruns
## mean_runs     1.0000000     0.6042898
## mean_homeruns 0.6042898     1.0000000
corrplot(correlation, method = "color", type = "upper", order = "hclust",
         tl.col = "black", tl.srt = 45, addCoef.col = "white",
         title = "Correlation between the average runs scored and average homeruns")

Conclusion and Future Directions

After finding all of this data I now know which teams hit the scored the most runs on average with the Brooklyn Grooms having the highest average runs scored followed by the Chicago Colts and the St. Louis Perfectos. I also now know which teams hit the most homeruns on average with the team hitting the most being the Colorado Rockies followed by the Milwaukee Braves and the Tampa Bay Rays. There were only three teams who made it in to the top 10 for both average runs scored and average home runs hit with the teams being the Colorado Rockies (1st in homeruns and 5th in runs scored), the Anaheim Angels (7th in homeruns and 6th in runs scored), and the Cincinnati Redlegs (4th in homeruns and 10th in runs scored). The correlation I found between the mean homeruns and the mean runs was 0.6042898 showing that their was a moderate positive correlation between the two factors. Now with this data I can positively say that hitting more homeruns does increase your chances of scoring more runs throughout your teams career. There is of course new data from each team in the league every year and in this data set we didn’t have data for the years between 2021-2025 so maybe in a few years we can come back to see if anything changed such as the correlation between the mean homeruns and runs scored potentially getting stronger or weaker or a team moving up or down in terms of average homeruns/runs scored.

References

Data set found from openintro.org at “https://www.openintro.org/data/index.php?data=mlb_teams

I used the correlation function from “Descriptive Statistics” activity from week 6.

I also looked at the “Exploratory Data Analysis air quality” activity from week 5 for functions about EDA.