library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## corrplot 0.95 loaded
setwd("C:/Users/njnav/OneDrive/Data 101/Projects")
mlb_teams<- read_csv("mlb_teams.csv")
## Rows: 2784 Columns: 41
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): league_id, division_id, division_winner, wild_card_winner, league_...
## dbl (33): year, rank, games_played, home_games, wins, losses, runs_scored, a...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Which teams scored the most runs on average and which teams hit the most home runs on average throughout all the teams in the league? Is there a correlation between the number of home runs hit and the number of runs scored?
Since the creation of Major League Baseball (MLB) many teams have joined and played against one another. There have been many seasons played since 1876 with many different winners of the World Series. I wondered which teams throughout the creation of the league have scored the most runs (points) and home runs on average and if there was a potential correlation between the two. The data set I am using titled mlb_teams is a subset of data collected from Lahman’s Baseball Database which include various types of data about MLB teams. I got this data set from the website openintro.org.
The data set has 2784 observations across 41 variables which includes the specific data I will be focusing on being team names, runs scored, and home runs. These variables provide me with the data I need to be able to see which teams were the most consistent in scoring and hitting home runs. It will also allow me to see if their is a correlation between the number of home runs hit and the number of runs scored.
Unfortunately the data set has only collected data from the start of the first season in the league back in 1876 up to the 145th season in 2020. This will mean that data from the past five seasons between 2021 and 2025 will not be included in the results possibly changing the outcome slightly. The results should be taken with this information in mind and a grain of salt.
The name of the variables I will be focusing on are called team_name, runs_scored, and homeruns.
team_name: Self explanatory it’s just the names of the teams
runs_scored: This is the number of runs scored during a season
homeruns: This is the number of homeruns hit by batters in a season
These variables allow me to see which teams were able to score the most runs and homeruns on average and look for a correlation. I first checked the head of the data to see what it looks like and then check the structure. After that I checked if there is any NA’s in the columns and if so how many. Luckily there were no NA’s in the variables I was using and the column names looked fine.
head(mlb_teams)
## # A tibble: 6 × 41
## year league_id division_id rank games_played home_games wins losses
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1876 NL <NA> 4 70 NA 39 31
## 2 1876 NL <NA> 1 66 NA 52 14
## 3 1876 NL <NA> 8 65 NA 9 56
## 4 1876 NL <NA> 2 69 NA 47 21
## 5 1876 NL <NA> 5 69 NA 30 36
## 6 1876 NL <NA> 6 57 NA 21 35
## # ℹ 33 more variables: division_winner <chr>, wild_card_winner <chr>,
## # league_winner <chr>, world_series_winner <chr>, runs_scored <dbl>,
## # at_bats <dbl>, hits <dbl>, doubles <dbl>, triples <dbl>, homeruns <dbl>,
## # walks <dbl>, strikeouts_by_batters <dbl>, stolen_bases <dbl>,
## # caught_stealing <dbl>, batters_hit_by_pitch <dbl>, sacrifice_flies <dbl>,
## # opponents_runs_scored <dbl>, earned_runs_allowed <dbl>,
## # earned_run_average <dbl>, complete_games <dbl>, shutouts <dbl>, …
str(mlb_teams)
## spc_tbl_ [2,784 × 41] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ year : num [1:2784] 1876 1876 1876 1876 1876 ...
## $ league_id : chr [1:2784] "NL" "NL" "NL" "NL" ...
## $ division_id : chr [1:2784] NA NA NA NA ...
## $ rank : num [1:2784] 4 1 8 2 5 6 7 3 1 5 ...
## $ games_played : num [1:2784] 70 66 65 69 69 57 60 64 61 60 ...
## $ home_games : num [1:2784] NA NA NA NA NA NA NA NA NA NA ...
## $ wins : num [1:2784] 39 52 9 47 30 21 14 45 42 26 ...
## $ losses : num [1:2784] 31 14 56 21 36 35 45 19 18 33 ...
## $ division_winner : chr [1:2784] NA NA NA NA ...
## $ wild_card_winner : chr [1:2784] NA NA NA NA ...
## $ league_winner : chr [1:2784] "N" "Y" "N" "N" ...
## $ world_series_winner : chr [1:2784] NA NA NA NA ...
## $ runs_scored : num [1:2784] 471 624 238 429 280 260 378 386 419 366 ...
## $ at_bats : num [1:2784] 2722 2748 2372 2664 2570 ...
## $ hits : num [1:2784] 723 926 555 711 641 494 646 642 700 633 ...
## $ doubles : num [1:2784] 96 131 51 96 68 39 79 73 91 79 ...
## $ triples : num [1:2784] 24 32 12 22 14 15 35 27 37 30 ...
## $ homeruns : num [1:2784] 9 8 4 2 6 2 7 2 4 0 ...
## $ walks : num [1:2784] 58 70 41 39 24 18 27 59 65 57 ...
## $ strikeouts_by_batters : num [1:2784] 98 45 136 78 98 35 36 63 121 111 ...
## $ stolen_bases : num [1:2784] NA NA NA NA NA NA NA NA NA NA ...
## $ caught_stealing : num [1:2784] NA NA NA NA NA NA NA NA NA NA ...
## $ batters_hit_by_pitch : num [1:2784] NA NA NA NA NA NA NA NA NA NA ...
## $ sacrifice_flies : num [1:2784] NA NA NA NA NA NA NA NA NA NA ...
## $ opponents_runs_scored : num [1:2784] 450 257 579 261 344 412 534 229 263 375 ...
## $ earned_runs_allowed : num [1:2784] 176 116 238 116 121 173 197 78 131 200 ...
## $ earned_run_average : num [1:2784] 2.51 1.76 3.62 1.67 1.69 2.94 3.22 1.22 2.15 3.37 ...
## $ complete_games : num [1:2784] 49 58 57 69 67 56 53 63 61 45 ...
## $ shutouts : num [1:2784] 3 9 0 11 5 2 1 16 7 3 ...
## $ saves : num [1:2784] 7 4 0 0 0 0 2 0 0 3 ...
## $ outs_pitches : num [1:2784] 1896 1777 1773 1872 1929 ...
## $ hits_allowed : num [1:2784] 732 608 850 570 605 718 783 472 557 630 ...
## $ homeruns_allowed : num [1:2784] 7 6 9 2 3 8 2 3 5 7 ...
## $ walks_allowed : num [1:2784] 104 29 34 27 38 24 41 39 38 58 ...
## $ strikeouts_by_pitchers: num [1:2784] 77 51 60 114 125 37 22 103 177 92 ...
## $ errors : num [1:2784] 442 282 469 337 397 473 456 268 290 313 ...
## $ double_plays : num [1:2784] 42 33 45 27 44 18 32 33 36 43 ...
## $ fielding_percentage : num [1:2784] 0.86 0.899 0.841 0.888 0.875 0.825 0.839 0.902 0.889 0.883 ...
## $ team_name : chr [1:2784] "Boston Red Caps" "Chicago White Stockings" "Cincinnati Reds" "Hartford Dark Blues" ...
## $ ball_park : chr [1:2784] "South End Grounds I" "23rd Street Grounds" "Avenue Grounds" "Hartford Ball Club Grounds" ...
## $ home_attendance : num [1:2784] NA NA NA NA NA NA NA NA NA NA ...
## - attr(*, "spec")=
## .. cols(
## .. year = col_double(),
## .. league_id = col_character(),
## .. division_id = col_character(),
## .. rank = col_double(),
## .. games_played = col_double(),
## .. home_games = col_double(),
## .. wins = col_double(),
## .. losses = col_double(),
## .. division_winner = col_character(),
## .. wild_card_winner = col_character(),
## .. league_winner = col_character(),
## .. world_series_winner = col_character(),
## .. runs_scored = col_double(),
## .. at_bats = col_double(),
## .. hits = col_double(),
## .. doubles = col_double(),
## .. triples = col_double(),
## .. homeruns = col_double(),
## .. walks = col_double(),
## .. strikeouts_by_batters = col_double(),
## .. stolen_bases = col_double(),
## .. caught_stealing = col_double(),
## .. batters_hit_by_pitch = col_double(),
## .. sacrifice_flies = col_double(),
## .. opponents_runs_scored = col_double(),
## .. earned_runs_allowed = col_double(),
## .. earned_run_average = col_double(),
## .. complete_games = col_double(),
## .. shutouts = col_double(),
## .. saves = col_double(),
## .. outs_pitches = col_double(),
## .. hits_allowed = col_double(),
## .. homeruns_allowed = col_double(),
## .. walks_allowed = col_double(),
## .. strikeouts_by_pitchers = col_double(),
## .. errors = col_double(),
## .. double_plays = col_double(),
## .. fielding_percentage = col_double(),
## .. team_name = col_character(),
## .. ball_park = col_character(),
## .. home_attendance = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
colSums(is.na(mlb_teams))
## year league_id division_id
## 0 0 1346
## rank games_played home_games
## 0 0 228
## wins losses division_winner
## 0 0 1374
## wild_card_winner league_winner world_series_winner
## 2010 28 248
## runs_scored at_bats hits
## 0 0 0
## doubles triples homeruns
## 0 0 0
## walks strikeouts_by_batters stolen_bases
## 0 16 76
## caught_stealing batters_hit_by_pitch sacrifice_flies
## 708 1066 1370
## opponents_runs_scored earned_runs_allowed earned_run_average
## 0 0 0
## complete_games shutouts saves
## 0 0 0
## outs_pitches hits_allowed homeruns_allowed
## 0 0 0
## walks_allowed strikeouts_by_pitchers errors
## 0 0 0
## double_plays fielding_percentage team_name
## 0 0 0
## ball_park home_attendance
## 0 108
This is where I looked for the data of the teams average runs scored in their entire time playing in the league. I first created a new data set called average and then went in to the mlb_teams data set and selected the variables team_name, runs_scored, and homeruns. After I decided to group the data by team names so I can see which team got which average. I then summarized the average runs scored and homeruns for each teams career. I then created new data sets named average_runs and average_homeruns where I took the data from the average data set and arranged it in descending order for both runs scored and homeruns so I could see the teams with the highest averages for both.
average <- mlb_teams|>
select(team_name, runs_scored, homeruns)|>
group_by(team_name)|>
summarize(mean_runs = mean(runs_scored), mean_homeruns = mean(homeruns))
average
## # A tibble: 86 × 3
## team_name mean_runs mean_homeruns
## <chr> <dbl> <dbl>
## 1 Anaheim Angels 788. 166.
## 2 Arizona Diamondbacks 717. 167.
## 3 Atlanta Braves 690. 152.
## 4 Baltimore Orioles 718. 140.
## 5 Boston Americans 607. 30.4
## 6 Boston Beaneaters 747. 39.7
## 7 Boston Bees 593. 59.8
## 8 Boston Braves 618. 54.9
## 9 Boston Doves 492. 21
## 10 Boston Red Caps 426. 10.7
## # ℹ 76 more rows
average_runs <- average|>
arrange(desc(mean_runs))
average_runs
## # A tibble: 86 × 3
## team_name mean_runs mean_homeruns
## <chr> <dbl> <dbl>
## 1 Brooklyn Grooms 876. 36.2
## 2 Chicago Colts 839 47.1
## 3 St. Louis Perfectos 819 47
## 4 Cleveland Spiders 789. 24.2
## 5 Colorado Rockies 788. 179.
## 6 Anaheim Angels 788. 166.
## 7 New York Yankees 772. 149.
## 8 Brooklyn Bridegrooms 754 28
## 9 Boston Beaneaters 747. 39.7
## 10 Cincinnati Redlegs 745. 170
## # ℹ 76 more rows
average_homeruns <- average|>
arrange(desc(mean_homeruns))
average_homeruns
## # A tibble: 86 × 3
## team_name mean_runs mean_homeruns
## <chr> <dbl> <dbl>
## 1 Colorado Rockies 788. 179.
## 2 Milwaukee Braves 722. 172.
## 3 Tampa Bay Rays 683 171.
## 4 Cincinnati Redlegs 745. 170
## 5 Toronto Blue Jays 725. 168.
## 6 Arizona Diamondbacks 717. 167.
## 7 Anaheim Angels 788. 166.
## 8 Los Angeles Angels of Anaheim 718. 163.
## 9 Texas Rangers 740. 163.
## 10 Oakland Athletics 716. 158.
## # ℹ 76 more rows
Here I created a new data set called average_none where from the average data set I only selected mean_runs and mean_homeruns and then checked the correlation of all the data in the average_none data set to see if there was a correlation between the average number of home runs hit and the average number of runs scored.
average_none <- average|>
select(mean_runs, mean_homeruns)
correlation <- cor(average_none)
correlation
## mean_runs mean_homeruns
## mean_runs 1.0000000 0.6042898
## mean_homeruns 0.6042898 1.0000000
corrplot(correlation, method = "color", type = "upper", order = "hclust",
tl.col = "black", tl.srt = 45, addCoef.col = "white",
title = "Correlation between the average runs scored and average homeruns")
After finding all of this data I now know which teams hit the scored the most runs on average with the Brooklyn Grooms having the highest average runs scored followed by the Chicago Colts and the St. Louis Perfectos. I also now know which teams hit the most homeruns on average with the team hitting the most being the Colorado Rockies followed by the Milwaukee Braves and the Tampa Bay Rays. There were only three teams who made it in to the top 10 for both average runs scored and average home runs hit with the teams being the Colorado Rockies (1st in homeruns and 5th in runs scored), the Anaheim Angels (7th in homeruns and 6th in runs scored), and the Cincinnati Redlegs (4th in homeruns and 10th in runs scored). The correlation I found between the mean homeruns and the mean runs was 0.6042898 showing that their was a moderate positive correlation between the two factors. Now with this data I can positively say that hitting more homeruns does increase your chances of scoring more runs throughout your teams career. There is of course new data from each team in the league every year and in this data set we didn’t have data for the years between 2021-2025 so maybe in a few years we can come back to see if anything changed such as the correlation between the mean homeruns and runs scored potentially getting stronger or weaker or a team moving up or down in terms of average homeruns/runs scored.
Data set found from openintro.org at “https://www.openintro.org/data/index.php?data=mlb_teams”
I used the correlation function from “Descriptive Statistics” activity from week 6.
I also looked at the “Exploratory Data Analysis air quality” activity from week 5 for functions about EDA.