Task

Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?

Introduction

I’m a Yankees fan, so it has been a hard few years. During the difficult summer of 2021, the manager of the Yankees (Aaron Boone) tried to explain the high number of double plays that the team hit into (at the time, tied with the Astros for the league lead). Boone said, “It’s part of the reality of usually the better teams. We need to get to be that offense that we can be, but typically the better teams are going to be higher up in hitting in double plays because they hit the ball harder and there’s more people on base.”\(^1\) Years later, Yankees fans continue\(^2\) to mock Boone for his suggestion that better teams hit into more double plays.

I was curious to know whether this was actually true. The purpose of this analysis is to use linear regression to assess the relationship between the number of double plays a team hits into and how good that team’s offense is (assessed by how many runs they score over the course of the season).

\(^1\) https://www.nj.com/yankees/2021/06/aaron-boone-gives-lame-excuse-after-yankees-burned-again-by-double-plays.html accessed 4/1/24

\(^2\) https://keefetothecity.com/yankees-thoughts-aaron-boone-somehow-believes-team-is-competing-asses-off/

Model Construction

I used data from https://www.baseball-reference.com to construct the file that is imported below. It includes the offense data for each team in each season going back to 2010 (2010 was selected as the cutoff point because the last time the Yankees won the World Series was 2009). Since teams don’t always play the same number of games every season (this affected every team during the pandemic-shortened 2020 season but there are occasionally other instances as well) I have created new columns for double plays grounded into and runs scored weighted for a standard 162 game season.

Data

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)

baseball_data <- read.csv("https://raw.githubusercontent.com/Marley-Myrianthopoulos/grad_school_data/main/baseball2.csv")

baseball_data <- baseball_data %>%
  mutate(runs_per_162 = round(R / G * 162)) %>%
  mutate(dp_per_162 = round(GDP / G * 162)) %>%
  select(Yr, Tm, dp_per_162, runs_per_162)

Model

dp_runs_model <- lm(runs_per_162 ~ dp_per_162, data = baseball_data)

dp_runs_model
## 
## Call:
## lm(formula = runs_per_162 ~ dp_per_162, data = baseball_data)
## 
## Coefficients:
## (Intercept)   dp_per_162  
##    696.0009       0.1755

If \(x\) represents the number of double plays that a team hits into and \(y\) represents the number of runs the team scores, then based on the regression analysis above we can represent runs scored as a function of double plays with the equation \(y = 0.1755x + 696.0009\) or \(y = 696.0009 + 0.1755x\)

We can interpret our regression equation to mean that we would expect a team that doesn’t hit into any double plays to score about 696 runs over the course of a season, and that each double play hit into over the course of the season would be associated with an average increase in runs scored of around 0.1755.

Visualization

The scatterplot for the data is shown below.

ggplot(baseball_data, aes(x = dp_per_162, y = runs_per_162)) + geom_point() + labs(title = "Double Plays vs. Runs Scored", x = "Double Plays", y = "Runs Scored")

The same scatterplot with the linear model overlaid is shown below.

ggplot(baseball_data, aes(x = dp_per_162, y = runs_per_162)) + geom_point() + 
  labs(title = "Double Plays vs. Runs Scored", x = "Double Plays", y = "Runs Scored") +
  geom_abline(aes(slope = coef(dp_runs_model)[2], intercept = coef(dp_runs_model)[1]))

Quality Evaluation

summary(dp_runs_model)
## 
## Call:
## lm(formula = runs_per_162 ~ dp_per_162, data = baseball_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -205.991  -57.463   -2.007   48.769  228.536 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 696.0009    27.6962  25.130   <2e-16 ***
## dp_per_162    0.1755     0.2307   0.761    0.447    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 78.81 on 418 degrees of freedom
## Multiple R-squared:  0.001382,   Adjusted R-squared:  -0.001007 
## F-statistic: 0.5785 on 1 and 418 DF,  p-value: 0.4473

We see that the median residual is close to 0 and the first and third quartile residuals are of similar magnitude, both of which are desirable for a good linear model. The minimum and maximum residuals are not as close as the first and third quartile residuals, but are still not very far apart.

The t value for the slope is very low at 0.761, indicating that this is not a good model. The t value for the y-intercept is high at 25.130, indicating confidence in our y-intercept.

The p-value for the slope is large at 0.447. This does not provide convincing evidence of a linear relationship between double plays hit into and runs scored. The p-value for the y-intercept is extremely small, which is strong evidence that the true y-intercept is not 0.

The multiple R-squared value of 0.001382 means that only about 0.1% of the variability in runs scored is explained by the variation in double plays.

Residual Analysis

The residual plot for this linear model is shown below.

ggplot (data = dp_runs_model, aes(x = .fitted, y = .resid)) +
  geom_point () +
  geom_hline (yintercept = 0, linetype = "dashed") +
  labs(title = "Residual Plot", x = "Fitted Values", y = "Residuals")

We see from the plot that there is no noticeable pattern in the residuals. This is a good sign, as constant variance in residuals is needed for a high-quality linear model.

We can further investigate the residuals with a Q-Q plot, shown below.

ggplot (data = dp_runs_model, aes(sample = .resid)) + stat_qq() + stat_qq_line() + labs(title = "Q-Q Residual Plot", x = "Theoretical Quantiles", y = "Sample Quantiles")

We see from our Q-Q plot that the ends of the our model diverge slightly from the straight line. Therefore, the residuals are not normally distributed, but they are close.

All four default diagnostic plots are shown below.

par(mfrow = c(2,2))
plot(dp_runs_model)

In conclusion, the linear model we developed is a poor fit for the data, as indicated by the very low R-squared of 0.001382 and the high slope p-value of 0.447. Just a glance at the random-looking scatter plot should be sufficient to determine that there is barely any relationship between how often teams ground into double plays and how many runs they score. Aaron Boone’s claim is not supported by the data, although it should also be noted that this analysis also means that there’s no evidence that teams that ground into more double plays have worse results from their offense.

Full data:

library(knitr)

kable(baseball_data, columns = c("Year", "Team", "Double Plays per 162 Games", "Runs per 162 Games"), align = "clcc")
Yr Tm dp_per_162 runs_per_162
2023 Arizona Diamondbacks 121 746
2023 Atlanta Braves 128 947
2023 Baltimore Orioles 114 807
2023 Boston Red Sox 118 772
2023 Chicago Cubs 95 819
2023 Chicago White Sox 122 641
2023 Cincinnati Reds 101 783
2023 Cleveland Guardians 108 662
2023 Colorado Rockies 124 721
2023 Detroit Tigers 107 661
2023 Houston Astros 124 827
2023 Kansas City Royals 96 676
2023 Los Angeles Angels 117 739
2023 Los Angeles Dodgers 98 906
2023 Miami Marlins 159 666
2023 Milwaukee Brewers 144 728
2023 Minnesota Twins 119 778
2023 New York Mets 105 717
2023 New York Yankees 119 673
2023 Oakland Athletics 125 585
2023 Philadelphia Phillies 110 796
2023 Pittsburgh Pirates 92 692
2023 San Diego Padres 134 752
2023 Seattle Mariners 95 758
2023 San Francisco Giants 116 674
2023 St. Louis Cardinals 122 719
2023 Tampa Bay Rays 109 860
2023 Texas Rangers 103 881
2023 Toronto Blue Jays 129 746
2023 Washington Nationals 112 700
2022 Arizona Diamondbacks 97 702
2022 Atlanta Braves 103 789
2022 Baltimore Orioles 95 674
2022 Boston Red Sox 131 735
2022 Chicago Cubs 130 657
2022 Chicago White Sox 127 686
2022 Cincinnati Reds 127 648
2022 Cleveland Guardians 119 698
2022 Colorado Rockies 139 698
2022 Detroit Tigers 108 557
2022 Houston Astros 118 737
2022 Kansas City Royals 101 640
2022 Los Angeles Angels 95 623
2022 Los Angeles Dodgers 85 847
2022 Miami Marlins 120 586
2022 Milwaukee Brewers 117 725
2022 Minnesota Twins 133 696
2022 New York Mets 122 772
2022 New York Yankees 121 807
2022 Oakland Athletics 109 568
2022 Philadelphia Phillies 116 747
2022 Pittsburgh Pirates 96 591
2022 San Diego Padres 95 705
2022 Seattle Mariners 120 690
2022 San Francisco Giants 109 716
2022 St. Louis Cardinals 112 772
2022 Tampa Bay Rays 93 666
2022 Texas Rangers 82 707
2022 Toronto Blue Jays 136 775
2022 Washington Nationals 141 603
2021 Arizona Diamondbacks 99 679
2021 Atlanta Braves 82 795
2021 Baltimore Orioles 105 659
2021 Boston Red Sox 100 829
2021 Chicago Cubs 133 705
2021 Chicago White Sox 139 796
2021 Cincinnati Reds 129 786
2021 Cleveland Indians 105 717
2021 Colorado Rockies 99 744
2021 Detroit Tigers 113 697
2021 Houston Astros 136 863
2021 Kansas City Royals 100 686
2021 Los Angeles Angels 107 723
2021 Los Angeles Dodgers 96 830
2021 Miami Marlins 95 623
2021 Milwaukee Brewers 102 738
2021 Minnesota Twins 122 729
2021 New York Mets 123 636
2021 New York Yankees 154 711
2021 Oakland Athletics 99 743
2021 Philadelphia Phillies 103 734
2021 Pittsburgh Pirates 102 609
2021 San Diego Padres 121 729
2021 Seattle Mariners 92 697
2021 San Francisco Giants 117 804
2021 St. Louis Cardinals 99 706
2021 Tampa Bay Rays 75 857
2021 Texas Rangers 113 625
2021 Toronto Blue Jays 112 846
2021 Washington Nationals 158 724
2020 Arizona Diamondbacks 100 726
2020 Atlanta Braves 105 940
2020 Baltimore Orioles 86 740
2020 Boston Red Sox 138 788
2020 Chicago Cubs 113 716
2020 Chicago White Sox 119 826
2020 Cincinnati Reds 119 656
2020 Cleveland Indians 108 670
2020 Colorado Rockies 111 742
2020 Detroit Tigers 117 695
2020 Houston Astros 105 753
2020 Kansas City Royals 76 670
2020 Los Angeles Angels 132 794
2020 Los Angeles Dodgers 124 942
2020 Miami Marlins 100 710
2020 Milwaukee Brewers 143 667
2020 Minnesota Twins 103 726
2020 New York Mets 143 772
2020 New York Yankees 138 850
2020 Oakland Athletics 119 740
2020 Philadelphia Phillies 108 826
2020 Pittsburgh Pirates 92 591
2020 San Diego Padres 100 878
2020 Seattle Mariners 94 686
2020 San Francisco Giants 138 807
2020 St. Louis Cardinals 106 670
2020 Tampa Bay Rays 103 780
2020 Texas Rangers 89 605
2020 Toronto Blue Jays 105 815
2020 Washington Nationals 113 791
2019 Arizona Diamondbacks 120 813
2019 Atlanta Braves 104 855
2019 Baltimore Orioles 111 729
2019 Boston Red Sox 127 901
2019 Chicago Cubs 127 814
2019 Chicago White Sox 115 712
2019 Cincinnati Reds 111 701
2019 Cleveland Indians 110 769
2019 Colorado Rockies 111 835
2019 Detroit Tigers 109 586
2019 Houston Astros 146 920
2019 Kansas City Royals 113 691
2019 Los Angeles Angels 143 769
2019 Los Angeles Dodgers 100 886
2019 Miami Marlins 139 615
2019 Milwaukee Brewers 120 769
2019 Minnesota Twins 101 939
2019 New York Mets 129 791
2019 New York Yankees 113 943
2019 Oakland Athletics 140 845
2019 Philadelphia Phillies 97 774
2019 Pittsburgh Pirates 119 758
2019 San Diego Padres 120 682
2019 Seattle Mariners 83 758
2019 San Francisco Giants 111 678
2019 St. Louis Cardinals 110 764
2019 Tampa Bay Rays 114 769
2019 Texas Rangers 98 810
2019 Toronto Blue Jays 107 726
2019 Washington Nationals 117 873
2018 Arizona Diamondbacks 110 693
2018 Atlanta Braves 99 759
2018 Baltimore Orioles 132 622
2018 Boston Red Sox 130 876
2018 Chicago Cubs 106 756
2018 Chicago White Sox 99 656
2018 Cincinnati Reds 128 696
2018 Cleveland Indians 98 818
2018 Colorado Rockies 113 775
2018 Detroit Tigers 110 630
2018 Houston Astros 156 797
2018 Kansas City Royals 123 638
2018 Los Angeles Angels 111 721
2018 Los Angeles Dodgers 118 799
2018 Miami Marlins 120 593
2018 Milwaukee Brewers 127 749
2018 Minnesota Twins 89 738
2018 New York Mets 116 676
2018 New York Yankees 107 851
2018 Oakland Athletics 136 813
2018 Philadelphia Phillies 102 677
2018 Pittsburgh Pirates 122 696
2018 San Diego Padres 122 617
2018 Seattle Mariners 128 677
2018 San Francisco Giants 113 603
2018 St. Louis Cardinals 92 759
2018 Tampa Bay Rays 122 716
2018 Texas Rangers 104 737
2018 Toronto Blue Jays 118 709
2018 Washington Nationals 104 771
2017 Arizona Diamondbacks 106 812
2017 Atlanta Braves 137 732
2017 Baltimore Orioles 138 743
2017 Boston Red Sox 141 785
2017 Chicago Cubs 134 822
2017 Chicago White Sox 124 706
2017 Cincinnati Reds 116 753
2017 Cleveland Indians 125 818
2017 Colorado Rockies 143 824
2017 Detroit Tigers 128 735
2017 Houston Astros 139 896
2017 Kansas City Royals 160 702
2017 Los Angeles Angels 141 710
2017 Los Angeles Dodgers 119 770
2017 Miami Marlins 119 778
2017 Milwaukee Brewers 116 732
2017 Minnesota Twins 105 815
2017 New York Mets 118 735
2017 New York Yankees 119 858
2017 Oakland Athletics 129 739
2017 Philadelphia Phillies 128 690
2017 Pittsburgh Pirates 120 668
2017 San Diego Padres 99 604
2017 Seattle Mariners 131 750
2017 San Francisco Giants 136 639
2017 St. Louis Cardinals 139 761
2017 Tampa Bay Rays 115 694
2017 Texas Rangers 110 799
2017 Toronto Blue Jays 153 693
2017 Washington Nationals 116 819
2016 Arizona Diamondbacks 117 752
2016 Atlanta Braves 146 653
2016 Baltimore Orioles 119 744
2016 Boston Red Sox 137 878
2016 Chicago Cubs 107 808
2016 Chicago White Sox 122 686
2016 Cincinnati Reds 129 716
2016 Cleveland Indians 138 782
2016 Colorado Rockies 113 845
2016 Detroit Tigers 136 755
2016 Houston Astros 134 724
2016 Kansas City Royals 134 675
2016 Los Angeles Angels 147 717
2016 Los Angeles Dodgers 120 725
2016 Miami Marlins 141 659
2016 Milwaukee Brewers 131 671
2016 Minnesota Twins 96 722
2016 New York Mets 123 671
2016 New York Yankees 121 680
2016 Oakland Athletics 142 653
2016 Philadelphia Phillies 112 610
2016 Pittsburgh Pirates 133 729
2016 San Diego Padres 93 686
2016 Seattle Mariners 138 768
2016 San Francisco Giants 120 715
2016 St. Louis Cardinals 117 779
2016 Tampa Bay Rays 88 672
2016 Texas Rangers 114 765
2016 Toronto Blue Jays 153 759
2016 Washington Nationals 102 763
2015 Arizona Diamondbacks 134 720
2015 Atlanta Braves 148 573
2015 Baltimore Orioles 127 713
2015 Boston Red Sox 127 748
2015 Chicago Cubs 101 689
2015 Chicago White Sox 125 622
2015 Cincinnati Reds 112 640
2015 Cleveland Indians 135 673
2015 Colorado Rockies 114 737
2015 Detroit Tigers 153 693
2015 Houston Astros 102 729
2015 Kansas City Royals 133 724
2015 Los Angeles Angels of Anaheim 116 661
2015 Los Angeles Dodgers 135 667
2015 Miami Marlins 133 613
2015 Milwaukee Brewers 130 655
2015 Minnesota Twins 133 696
2015 New York Mets 130 683
2015 New York Yankees 105 764
2015 Oakland Athletics 124 694
2015 Philadelphia Phillies 119 626
2015 Pittsburgh Pirates 115 697
2015 San Diego Padres 108 650
2015 Seattle Mariners 123 656
2015 San Francisco Giants 142 696
2015 St. Louis Cardinals 128 647
2015 Tampa Bay Rays 121 644
2015 Texas Rangers 99 751
2015 Toronto Blue Jays 140 891
2015 Washington Nationals 129 703
2014 Arizona Diamondbacks 115 615
2014 Atlanta Braves 121 573
2014 Baltimore Orioles 112 705
2014 Boston Red Sox 138 634
2014 Chicago Cubs 94 614
2014 Chicago White Sox 127 660
2014 Cincinnati Reds 88 595
2014 Cleveland Indians 126 669
2014 Colorado Rockies 121 755
2014 Detroit Tigers 137 757
2014 Houston Astros 122 629
2014 Kansas City Royals 131 651
2014 Los Angeles Angels of Anaheim 112 773
2014 Los Angeles Dodgers 119 718
2014 Miami Marlins 143 645
2014 Milwaukee Brewers 135 650
2014 Minnesota Twins 97 715
2014 New York Mets 112 629
2014 New York Yankees 111 633
2014 Oakland Athletics 118 729
2014 Philadelphia Phillies 94 619
2014 Pittsburgh Pirates 127 682
2014 San Diego Padres 118 535
2014 Seattle Mariners 112 634
2014 San Francisco Giants 113 665
2014 St. Louis Cardinals 140 619
2014 Tampa Bay Rays 135 612
2014 Texas Rangers 148 637
2014 Toronto Blue Jays 128 723
2014 Washington Nationals 115 686
2013 Arizona Diamondbacks 160 685
2013 Atlanta Braves 119 688
2013 Baltimore Orioles 105 745
2013 Boston Red Sox 137 853
2013 Chicago Cubs 120 602
2013 Chicago White Sox 124 598
2013 Cincinnati Reds 129 698
2013 Cleveland Indians 106 745
2013 Colorado Rockies 111 706
2013 Detroit Tigers 146 796
2013 Houston Astros 110 610
2013 Kansas City Royals 131 648
2013 Los Angeles Angels of Anaheim 150 733
2013 Los Angeles Dodgers 130 649
2013 Miami Marlins 131 513
2013 Milwaukee Brewers 116 640
2013 Minnesota Twins 103 614
2013 New York Mets 106 619
2013 New York Yankees 121 650
2013 Oakland Athletics 108 767
2013 Philadelphia Phillies 131 610
2013 Pittsburgh Pirates 120 634
2013 San Diego Padres 99 618
2013 Seattle Mariners 122 624
2013 San Francisco Giants 131 629
2013 St. Louis Cardinals 154 783
2013 Tampa Bay Rays 139 696
2013 Texas Rangers 123 726
2013 Toronto Blue Jays 133 712
2013 Washington Nationals 115 656
2012 Arizona Diamondbacks 108 734
2012 Atlanta Braves 108 700
2012 Baltimore Orioles 152 712
2012 Boston Red Sox 105 734
2012 Chicago Cubs 125 613
2012 Chicago White Sox 112 748
2012 Cincinnati Reds 100 669
2012 Cleveland Indians 141 667
2012 Colorado Rockies 132 758
2012 Detroit Tigers 156 726
2012 Houston Astros 114 583
2012 Kansas City Royals 130 676
2012 Los Angeles Angels of Anaheim 138 767
2012 Los Angeles Dodgers 139 637
2012 Miami Marlins 114 609
2012 Milwaukee Brewers 111 776
2012 Minnesota Twins 149 701
2012 New York Mets 118 650
2012 New York Yankees 136 804
2012 Oakland Athletics 97 713
2012 Philadelphia Phillies 114 684
2012 Pittsburgh Pirates 98 651
2012 San Diego Padres 100 651
2012 Seattle Mariners 95 619
2012 San Francisco Giants 114 718
2012 St. Louis Cardinals 135 765
2012 Tampa Bay Rays 133 697
2012 Texas Rangers 121 808
2012 Toronto Blue Jays 109 716
2012 Washington Nationals 110 731
2011 Arizona Diamondbacks 82 731
2011 Atlanta Braves 113 641
2011 Baltimore Orioles 154 708
2011 Boston Red Sox 136 875
2011 Chicago Cubs 123 654
2011 Chicago White Sox 125 654
2011 Cincinnati Reds 98 735
2011 Cleveland Indians 111 704
2011 Colorado Rockies 112 735
2011 Detroit Tigers 142 787
2011 Florida Marlins 111 625
2011 Houston Astros 111 615
2011 Kansas City Royals 121 730
2011 Los Angeles Angels of Anaheim 126 667
2011 Los Angeles Dodgers 102 648
2011 Milwaukee Brewers 114 721
2011 Minnesota Twins 115 619
2011 New York Mets 112 718
2011 New York Yankees 146 867
2011 Oakland Athletics 119 645
2011 Philadelphia Phillies 108 713
2011 Pittsburgh Pirates 123 610
2011 San Diego Padres 105 593
2011 Seattle Mariners 82 556
2011 San Francisco Giants 117 570
2011 St. Louis Cardinals 169 762
2011 Tampa Bay Rays 101 707
2011 Texas Rangers 135 855
2011 Toronto Blue Jays 108 743
2011 Washington Nationals 104 628
2010 Arizona Diamondbacks 113 713
2010 Atlanta Braves 136 738
2010 Baltimore Orioles 154 613
2010 Boston Red Sox 130 818
2010 Chicago Cubs 124 685
2010 Chicago White Sox 148 752
2010 Cincinnati Reds 113 790
2010 Cleveland Indians 118 646
2010 Colorado Rockies 103 770
2010 Detroit Tigers 118 751
2010 Florida Marlins 107 719
2010 Houston Astros 130 611
2010 Kansas City Royals 152 676
2010 Los Angeles Angels of Anaheim 125 681
2010 Los Angeles Dodgers 123 667
2010 Milwaukee Brewers 115 750
2010 Minnesota Twins 159 781
2010 New York Mets 101 656
2010 New York Yankees 124 859
2010 Oakland Athletics 129 663
2010 Philadelphia Phillies 120 772
2010 Pittsburgh Pirates 119 587
2010 San Diego Padres 106 665
2010 Seattle Mariners 110 513
2010 San Francisco Giants 158 697
2010 St. Louis Cardinals 124 736
2010 Tampa Bay Rays 92 802
2010 Texas Rangers 129 787
2010 Toronto Blue Jays 114 755
2010 Washington Nationals 125 655