Homework #1: Moneyball

Overview

The use of historical statistics to predict future outcomes, particularly wins and losses, and identify opportunities for improving team or individual performance, has gained significant attention in professional sports. The aim of this analysis is to develop several models that can predict a baseball team’s wins over a season based on team stats such as homeruns, strikeouts, base hits, and more. We will begin by examining the data for any issues, such as missing data, or outliers, and take the necessary measures to clean the data. We will subsequently create and evaluate three different linear models that forecast seasonal wins using the dataset, which includes both training and evaluation data. We will train the models using the main training data and then evaluate their performance against the evaluation data to determine their effectiveness. Finally, we will choose the best model that balances accuracy and simplicity for predicting seasonal wins.

1. Date Exploration

The baseball training dataset contains 2,276 observations of 17 variables detailing various teams’ performances per year from 1871 to 2006. The description of the columns is shown below. Due to the relatively long period, we expect to see outliers and missing data as the league modified official game rules; these rule changes undoubtedly caused teams and players to change their tactics in response. Additionally, the number of single base hits is noticeably missing from the columns. However, we will derive this value as the number of other types of hits (doubles, triples, home runs) can be subtracted from total hits. Lastly, other columns representing game number (out of 162), inning number (1-9), and matching opponent columns would have been vastly useful for predictions. One last noticeable omission from the original dataset is of the number of single base hits. However, this value can possibly be calculated as a difference between other types of hits (doubles, triples, home runs) and total hits.

1.1 Summary Statistics

The table below shows us some valuable descriptive statistics for the training data. The data set contains all integers. We can see that many of the variables have a minimum of 0 but not all. The means and medians of each variable are all relatively close in value for each individual variable. This tells us that most data is free from extreme outliers as they tend to skew the mean relative to the median.

One interesting piece of information is the min/max of the TARGET_WINS variable. The minimum is 0 meaning there are teams that did not win a single game. The maximum is 146 which indicates no team in the training dataset had a perfect season, as we know from the data a season consists of 162 games.

Also of note is the number of missing values from certain variables. Most notably the TEAM_BATTING_HBP (batters hit by pitch variable). With 91% of the data missing we will remove this variable from our dataset because there simply is not enough information to impute a sensible value. The column TEAM_BASERUN_CS (caught stealing) had 34% of the missing data, we may consider removing it later. The missing data for these two columns may be due to a change official rules or tactics before the modern era of baseball.

No

Variable

Stats / Values

Freqs (% of Valid)

Graph

Valid

Missing

1

INDEX [integer]

Mean (sd) : 1268.5 (736.3)

min ≤ med ≤ max:

1 ≤ 1270.5 ≤ 2535

Q1 - Q3 : 630.5 - 1916

2276 distinct values

2276 (100.0%)

0 (0.0%)

2

TARGET_WINS [integer]

Mean (sd) : 80.8 (15.8)

min ≤ med ≤ max:

0 ≤ 82 ≤ 146

Q1 - Q3 : 71 - 92

108 distinct values

2276 (100.0%)

0 (0.0%)

3

TEAM_BATTING_H [integer]

Mean (sd) : 1469.3 (144.6)

min ≤ med ≤ max:

891 ≤ 1454 ≤ 2554

Q1 - Q3 : 1383 - 1537.5

569 distinct values

2276 (100.0%)

0 (0.0%)

4

TEAM_BATTING_2B [integer]

Mean (sd) : 241.2 (46.8)

min ≤ med ≤ max:

69 ≤ 238 ≤ 458

Q1 - Q3 : 208 - 273

240 distinct values

2276 (100.0%)

0 (0.0%)

5

TEAM_BATTING_3B [integer]

Mean (sd) : 55.2 (27.9)

min ≤ med ≤ max:

0 ≤ 47 ≤ 223

Q1 - Q3 : 34 - 72

144 distinct values

2276 (100.0%)

0 (0.0%)

6

TEAM_BATTING_HR [integer]

Mean (sd) : 99.6 (60.5)

min ≤ med ≤ max:

0 ≤ 102 ≤ 264

Q1 - Q3 : 42 - 147

243 distinct values

2276 (100.0%)

0 (0.0%)

7

TEAM_BATTING_BB [integer]

Mean (sd) : 501.6 (122.7)

min ≤ med ≤ max:

0 ≤ 512 ≤ 878

Q1 - Q3 : 451 - 580

533 distinct values

2276 (100.0%)

0 (0.0%)

8

TEAM_BATTING_SO [integer]

Mean (sd) : 735.6 (248.5)

min ≤ med ≤ max:

0 ≤ 750 ≤ 1399

Q1 - Q3 : 548 - 930

822 distinct values

2174 (95.5%)

102 (4.5%)

9

TEAM_BASERUN_SB [integer]

Mean (sd) : 124.8 (87.8)

min ≤ med ≤ max:

0 ≤ 101 ≤ 697

Q1 - Q3 : 66 - 156

348 distinct values

2145 (94.2%)

131 (5.8%)

10

TEAM_BASERUN_CS [integer]

Mean (sd) : 52.8 (23)

min ≤ med ≤ max:

0 ≤ 49 ≤ 201

Q1 - Q3 : 38 - 62

128 distinct values

1504 (66.1%)

772 (33.9%)

11

TEAM_BATTING_HBP [integer]

Mean (sd) : 59.4 (13)

min ≤ med ≤ max:

29 ≤ 58 ≤ 95

Q1 - Q3 : 50 - 67

55 distinct values

191 (8.4%)

2085 (91.6%)

12

TEAM_PITCHING_H [integer]

Mean (sd) : 1779.2 (1406.8)

min ≤ med ≤ max:

1137 ≤ 1518 ≤ 30132

Q1 - Q3 : 1419 - 1683

843 distinct values

2276 (100.0%)

0 (0.0%)

13

TEAM_PITCHING_HR [integer]

Mean (sd) : 105.7 (61.3)

min ≤ med ≤ max:

0 ≤ 107 ≤ 343

Q1 - Q3 : 50 - 150

256 distinct values

2276 (100.0%)

0 (0.0%)

14

TEAM_PITCHING_BB [integer]

Mean (sd) : 553 (166.4)

min ≤ med ≤ max:

0 ≤ 536.5 ≤ 3645

Q1 - Q3 : 476 - 611

535 distinct values

2276 (100.0%)

0 (0.0%)

15

TEAM_PITCHING_SO [integer]

Mean (sd) : 817.7 (553.1)

min ≤ med ≤ max:

0 ≤ 813.5 ≤ 19278

Q1 - Q3 : 615 - 968

823 distinct values

2174 (95.5%)

102 (4.5%)

16

TEAM_FIELDING_E [integer]

Mean (sd) : 246.5 (227.8)

min ≤ med ≤ max:

65 ≤ 159 ≤ 1898

Q1 - Q3 : 127 - 249.5

549 distinct values

2276 (100.0%)

0 (0.0%)

17

TEAM_FIELDING_DP [integer]

Mean (sd) : 146.4 (26.2)

min ≤ med ≤ max:

52 ≤ 149 ≤ 228

Q1 - Q3 : 131 - 164

144 distinct values

1990 (87.4%)

286 (12.6%)

Generated by summarytools 1.0.1 (R version 4.2.2)
2023-02-24

1.2 Distribution and Box Plots

Next, we’ll visually check for normal distributions and box plots in both the dependent and independent variables. The density plot below shows normalcy in most features except for extremely right skewed features such as hits allowed (PITCHING_H) or errors (FIELDING_E). Homeruns by batters (BATTING_HR) and strikeouts by batters (BATTING_SO) variables seem bimodal. It implies the existence of two distinct clusters within the baseball season data, where teams tended to score more in one of the clusters.
Box plots for these further show a high number of outliers exist outside of the interquartile ranges so their effects should be carefully considered and we may deal with non-unimodal distributions.

Lastly, the function featurePlot() will show the relationship between independent variables and the target variable TARGET_WINS. In general, while our graphs display certain intriguing connections among the variables, they also expose noteworthy problems with the data. For example, the dataset contains a team that has not won any games, which appears improbable. By checking the web data, we found that it actually happened 2 times: 1872 and 1873. Or that the pitching data contains numerous instances of 0’s, several teams have 0 strikeouts by their pitchers over the season, which is highly improbable. Also, there is as a team achieving 20,000 strikeouts. There will be further steps to work with outliers and 0’s.

1.3 Correlation Matrix

Plotting the correlations between TARGET_WINS and the variables (excluding INDEX and TEAM_BATTING_HBP) we can see that very few variables are strongly correlated with the target variable. Columns with correlations close to zero are unlikely to offer significant insights into the factors that contribute to a team’s victories.

To avoid multicolinearity, we should not include features that have strong correlation. Comparing offensive (any column starting with BATTING or BASERUN) to defensive stats unexpectedly shows some correlation, pointing to potential problems. Qualitatively, the matrix implies some teams or players are exceptional both at hitting (offensive) and fielding (defensive). Furthermore, a typical team’s number of batted home runs and allowed home runs has a correlation of nearly 1.0. This is an unexpected correlation but can be explained by noticing most games are decided by a difference of one or two runs (whether the games are high scoring or not). Any final models should include one of these two home run variables. Alternatively, the correlation between a team’s hits (BATTING_H) and hits allowed (PITCHING_H) is around 0.3 which is seems reasonable.

There are some other strong correlations that are less obvious such as Errors (TEAM_FIELDING_E) being strongly negatively correlated with walks by batters (TEAM_BATTING_BB), strike outs (TEAM_BATTING_SO). All combined together, teams that get a lot of hits do not generally make fielding errors.

Digging a little deeper we can see there is a Pearson correlation coefficient of -0.6559708 for errors and walks by batters which indicates a strong negative correlation between the two variables. Looking at errors compared with team pitching hits allowed we see a correlation of 0.667759 which indicates a strong positive correlation.

	Coefficient
TARGET_WINS	1.0000000
TEAM_BATTING_H	0.4699467
TEAM_BATTING_2B	0.3129840
TEAM_BATTING_3B	-0.1243459
TEAM_BATTING_HR	0.4224168
TEAM_BATTING_BB	0.4686879
TEAM_BATTING_SO	-0.2288927
TEAM_BASERUN_SB	0.0148364
TEAM_BASERUN_CS	-0.1787560
TEAM_BATTING_HBP	0.0735042
TEAM_PITCHING_H	0.4712343
TEAM_PITCHING_HR	0.4224668
TEAM_PITCHING_BB	0.4683988
TEAM_PITCHING_SO	-0.2293648
TEAM_FIELDING_E	-0.3866880
TEAM_FIELDING_DP	-0.1958660

Lastly, lets take a closer look at the missing data. We’ve already determined that the batter hit by pitch (TEAM_BATTING_HBP) variable is missing 91% of its data but what of the other variables. We will just drop the column from the further analysis.

Using the plot below we can visualize the missingness of the remaining variables. There are 5 variables that contain varying degrees of missing data. We will use the information to fill in the missing values in our data preparation step.

TEAM_BASERUN_CS appears to be missing the second most amount of values but at only 772 missing values out of 2276 this is much less of a concern than the HBP variable we identified earlier. The remaining variables that are missing data have less than 25% of their data missing so should be safe to impute.

2. Data Preparation

2.1 Missing Data

Notes / Questions

BASE_CS is missing 33% of its values is it an issue to fill that many values

added the na_flag to the data set

do we want to calculate singles?

should we include Hits and Singles, Doubles, Triples, HR - this is the same data captured differently

Slugging Percentage is an interesting measure that we could add https://en.wikipedia.org/wiki/Slugging_percentage Wikipedia contributors (2022)

Here are some advanced stats that might be interesting https://www.mlb.com/glossary/advanced-stats

It would be interesting to test Pythagorean Winning Percentage https://www.mlb.com/glossary/advanced-stats/pythagorean-winning-percentage

As discussed above, we will drop the INDEX and TEAM_BATTING_HBP variables as the TEAM_BATTING_HBP variable is missing 91% of its data and the INDEX variable is just an identification Variable.

We’ll also derive a new column for single base hits derived from subtracting double, triples and home runs from the total number of hits.

For further work with NA’s, we create flags to suggest if a variable was missing.

2.2 Drop Outliers

No

Variable

Stats / Values

Freqs (% of Valid)

Graph

Valid

Missing

1

TARGET_WINS [integer]

Mean (sd) : 80.8 (13.7)

min ≤ med ≤ max:

36 ≤ 82 ≤ 124

Q1 - Q3 : 72 - 91

80 distinct values

1951 (100.0%)

0 (0.0%)

2

TEAM_BATTING_H [integer]

Mean (sd) : 1453 (111.1)

min ≤ med ≤ max:

1137 ≤ 1446 ≤ 1876

Q1 - Q3 : 1379 - 1523

480 distinct values

1951 (100.0%)

0 (0.0%)

3

TEAM_BATTING_2B [integer]

Mean (sd) : 243 (44.9)

min ≤ med ≤ max:

123 ≤ 241 ≤ 392

Q1 - Q3 : 211 - 274

225 distinct values

1951 (100.0%)

0 (0.0%)

4

TEAM_BATTING_3B [integer]

Mean (sd) : 50.3 (23.1)

min ≤ med ≤ max:

11 ≤ 44 ≤ 138

Q1 - Q3 : 33 - 64

117 distinct values

1951 (100.0%)

0 (0.0%)

5

TEAM_BATTING_HR [integer]

Mean (sd) : 109.3 (57.6)

min ≤ med ≤ max:

5 ≤ 113 ≤ 264

Q1 - Q3 : 62 - 152

240 distinct values

1951 (100.0%)

0 (0.0%)

6

TEAM_BATTING_BB [integer]

Mean (sd) : 523.5 (87.6)

min ≤ med ≤ max:

273 ≤ 519 ≤ 824

Q1 - Q3 : 465 - 583

405 distinct values

1951 (100.0%)

0 (0.0%)

7

TEAM_BATTING_SO [integer]

Mean (sd) : 776.7 (222.2)

min ≤ med ≤ max:

319 ≤ 805 ≤ 1399

Q1 - Q3 : 585.5 - 952

743 distinct values

1852 (94.9%)

99 (5.1%)

8

TEAM_BASERUN_SB [integer]

Mean (sd) : 109.3 (61.4)

min ≤ med ≤ max:

18 ≤ 96 ≤ 367

Q1 - Q3 : 64 - 141

271 distinct values

1951 (100.0%)

0 (0.0%)

9

TEAM_BASERUN_CS [integer]

Mean (sd) : 52.9 (23)

min ≤ med ≤ max:

11 ≤ 49 ≤ 201

Q1 - Q3 : 38 - 62

126 distinct values

1455 (74.6%)

496 (25.4%)

10

TEAM_PITCHING_H [integer]

Mean (sd) : 1516.9 (157.6)

min ≤ med ≤ max:

1137 ≤ 1492 ≤ 2096

Q1 - Q3 : 1405 - 1599

599 distinct values

1951 (100.0%)

0 (0.0%)

11

TEAM_PITCHING_HR [integer]

Mean (sd) : 112.5 (58.2)

min ≤ med ≤ max:

5 ≤ 115 ≤ 277

Q1 - Q3 : 66 - 155

241 distinct values

1951 (100.0%)

0 (0.0%)

12

TEAM_PITCHING_BB [integer]

Mean (sd) : 545.4 (93.6)

min ≤ med ≤ max:

312 ≤ 536 ≤ 877

Q1 - Q3 : 483 - 601

432 distinct values

1951 (100.0%)

0 (0.0%)

13

TEAM_PITCHING_SO [integer]

Mean (sd) : 803 (218.3)

min ≤ med ≤ max:

341 ≤ 819 ≤ 1659

Q1 - Q3 : 629 - 961.5

743 distinct values

1852 (94.9%)

99 (5.1%)

14

TEAM_FIELDING_E [integer]

Mean (sd) : 174.7 (77.7)

min ≤ med ≤ max:

65 ≤ 149 ≤ 515

Q1 - Q3 : 124 - 199

326 distinct values

1951 (100.0%)

0 (0.0%)

15

TEAM_FIELDING_DP [integer]

Mean (sd) : 147.6 (25.1)

min ≤ med ≤ max:

72 ≤ 149 ≤ 228

Q1 - Q3 : 133 - 164

138 distinct values

1902 (97.5%)

49 (2.5%)

16

TEAM_BATTING_1B [integer]

Mean (sd) : 1050.3 (92.2)

min ≤ med ≤ max:

811 ≤ 1040 ≤ 1458

Q1 - Q3 : 984 - 1106

406 distinct values

1951 (100.0%)

0 (0.0%)

17

na_flag [logical]

1. FALSE

2. TRUE

1455	(	74.6%	)
496	(	25.4%	)

1951 (100.0%)

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.2)
2023-02-24

To to impute the missing values in the trainDf data, the mice library is used. To utilize MICE, one must make the assumption that the missing values are missing at random, indicating that the missingness can be explained by variables that have complete information. The MICE algorithm then performs several iterations over the data, as suggested by its name, and generates data to complete the missing values. We check what impute method we use for each column. pmm is predictive mean matching, replacing missing data with column means.

Let’s also take a look at the density plots pre and post-imputation to make sure densities look similar. Unfortunately, for TEAM_BASERUN_SB, TEAM_BATTING_SO, and TEAM_FIELDING DP they do not. But in the case of TEAM_BATTING_SO our distribution becomes roughly more normal, so it may be beneficial. For the TEAM_BASERUN_SB and TEAM_BASERUN_CS and TEAM_FIELDING_DP we may need to consider alternative methods.

2.2 Transform non-normal variables

We should also try to transform some variables so that they may fit a more normal distribution, particularly TEAM_BATTING_HR, TEAM_BATTING_SO, TEAM_PITCHING_HR, but also TEAM_PITCHING_H, TEAM_PITCHING_BB, TEAM_PITCHING_SO, and TEAM_FIELDING_E. Square rooting these variables helps, as does removing the skewness by reducing the low density ranges. The plots below compare the original distributions of non-normal variables and transformed ones. This is just one of the ways to handle the data that doesn’t follow normal distribution. The other way is to use Box-Cox transformations, we may try it after fitting the model.

Our current dataset is devoid of any missing data values, and we have excluded the irrelevant INDEX and the TEAM_BATTING_HBP variables, which had 91% missing values. As shown in the table below, no missing data values exist anymore, and we can analyze how the summary statistics may have altered with the imputed data.

No

Variable

Stats / Values

Freqs (% of Valid)

Graph

Valid

Missing

1

TARGET_WINS [integer]

Mean (sd) : 80.8 (13.7)

min ≤ med ≤ max:

36 ≤ 82 ≤ 124

Q1 - Q3 : 72 - 91

80 distinct values

1951 (100.0%)

0 (0.0%)

2

TEAM_BATTING_H [integer]

Mean (sd) : 1453 (111.1)

min ≤ med ≤ max:

1137 ≤ 1446 ≤ 1876

Q1 - Q3 : 1379 - 1523

480 distinct values

1951 (100.0%)

0 (0.0%)

3

TEAM_BATTING_2B [integer]

Mean (sd) : 243 (44.9)

min ≤ med ≤ max:

123 ≤ 241 ≤ 392

Q1 - Q3 : 211 - 274

225 distinct values

1951 (100.0%)

0 (0.0%)

4

TEAM_BATTING_3B [integer]

Mean (sd) : 50.3 (23.1)

min ≤ med ≤ max:

11 ≤ 44 ≤ 138

Q1 - Q3 : 33 - 64

117 distinct values

1951 (100.0%)

0 (0.0%)

5

TEAM_BATTING_HR [integer]

Mean (sd) : 109.3 (57.6)

min ≤ med ≤ max:

5 ≤ 113 ≤ 264

Q1 - Q3 : 62 - 152

240 distinct values

1951 (100.0%)

0 (0.0%)

6

TEAM_BATTING_BB [integer]

Mean (sd) : 523.5 (87.6)

min ≤ med ≤ max:

273 ≤ 519 ≤ 824

Q1 - Q3 : 465 - 583

405 distinct values

1951 (100.0%)

0 (0.0%)

7

TEAM_BATTING_SO [integer]

Mean (sd) : 770.2 (221.8)

min ≤ med ≤ max:

319 ≤ 796 ≤ 1399

Q1 - Q3 : 582 - 947

743 distinct values

1951 (100.0%)

0 (0.0%)

8

TEAM_BASERUN_SB [integer]

Mean (sd) : 109.3 (61.4)

min ≤ med ≤ max:

18 ≤ 96 ≤ 367

Q1 - Q3 : 64 - 141

271 distinct values

1951 (100.0%)

0 (0.0%)

9

TEAM_BASERUN_CS [integer]

Mean (sd) : 66.5 (39.9)

min ≤ med ≤ max:

11 ≤ 54 ≤ 201

Q1 - Q3 : 41 - 75

126 distinct values

1951 (100.0%)

0 (0.0%)

10

TEAM_PITCHING_H [integer]

Mean (sd) : 1516.9 (157.6)

min ≤ med ≤ max:

1137 ≤ 1492 ≤ 2096

Q1 - Q3 : 1405 - 1599

599 distinct values

1951 (100.0%)

0 (0.0%)

11

TEAM_PITCHING_HR [integer]

Mean (sd) : 112.5 (58.2)

min ≤ med ≤ max:

5 ≤ 115 ≤ 277

Q1 - Q3 : 66 - 155

241 distinct values

1951 (100.0%)

0 (0.0%)

12

TEAM_PITCHING_BB [integer]

Mean (sd) : 545.4 (93.6)

min ≤ med ≤ max:

312 ≤ 536 ≤ 877

Q1 - Q3 : 483 - 601

432 distinct values

1951 (100.0%)

0 (0.0%)

13

TEAM_PITCHING_SO [integer]

Mean (sd) : 797.1 (218.1)

min ≤ med ≤ max:

341 ≤ 811 ≤ 1659

Q1 - Q3 : 623 - 955

743 distinct values

1951 (100.0%)

0 (0.0%)

14

TEAM_FIELDING_E [integer]

Mean (sd) : 174.7 (77.7)

min ≤ med ≤ max:

65 ≤ 149 ≤ 515

Q1 - Q3 : 124 - 199

326 distinct values

1951 (100.0%)

0 (0.0%)

15

TEAM_FIELDING_DP [integer]

Mean (sd) : 146.9 (25.4)

min ≤ med ≤ max:

72 ≤ 149 ≤ 228

Q1 - Q3 : 131 - 164

138 distinct values

1951 (100.0%)

0 (0.0%)

16

TEAM_BATTING_1B [integer]

Mean (sd) : 1050.3 (92.2)

min ≤ med ≤ max:

811 ≤ 1040 ≤ 1458

Q1 - Q3 : 984 - 1106

406 distinct values

1951 (100.0%)

0 (0.0%)

17

na_flag [logical]

1. FALSE

2. TRUE

1455	(	74.6%	)
496	(	25.4%	)

1951 (100.0%)

0 (0.0%)

18

batting_hr_sqrt [numeric]

Mean (sd) : 10 (3.1)

min ≤ med ≤ max:

2.2 ≤ 10.6 ≤ 16.2

Q1 - Q3 : 7.9 - 12.3

240 distinct values

1951 (100.0%)

0 (0.0%)

19

batting_so_sqrt [numeric]

Mean (sd) : 27.4 (4.1)

min ≤ med ≤ max:

17.9 ≤ 28.2 ≤ 37.4

Q1 - Q3 : 24.1 - 30.8

743 distinct values

1951 (100.0%)

0 (0.0%)

20

baserun_cs_sqrt [numeric]

Mean (sd) : 7.9 (2.2)

min ≤ med ≤ max:

3.3 ≤ 7.3 ≤ 14.2

Q1 - Q3 : 6.4 - 8.7

126 distinct values

1951 (100.0%)

0 (0.0%)

21

pitching_hr_sqrt [numeric]

Mean (sd) : 10.2 (3)

min ≤ med ≤ max:

2.2 ≤ 10.7 ≤ 16.6

Q1 - Q3 : 8.1 - 12.4

241 distinct values

1951 (100.0%)

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.2)
2023-02-24

3. Building models

Notes / Questions

we could try building a model that does not include any of the outlier variables values

there is also saber metrics model that Pythagorean Winning Percentage taht is supposed to preduct wins https://www.mlb.com/glossary/advanced-stats/pythagorean-winning-percentage

At this juncture, with a thorough comprehension of our dataset and having completed the data cleaning process, we can initiate the construction of our multiple linear regression models. We will build four separate linear models and compare their performance.

3.1 Model 1: Baseline

For the first model, we will select all the variables from the original un-cleaned dataset. We may use this model as a base model to compare with.

3.2 Model 2: Removed N/A Values

The second model will be based on the cleaned dataset without missing values and non-normal distributions. We chose variables TEAM_PITCHING_HR, TEAM_PITCHING_BB, TEAM_BATTING_2B, TEAM_BATTING_3B, TEAM_FIELDING_E based on the intuition and the understanding of the data.

3.3 Model 3: Removed Outliers

The third model will use the cleaned dataset with outliers omitted. To identify outliers, we used historical data from the Lahman’s Baseball Database. In baseball’s early history, few games were played in a season (less than 20); thus, the calculation used to normalize the statistics to a 162-game season tends to create outliers. We used the min and max from the modern era (post-1900) as a filter to alleviate the issues created by outlier values.

The initial model will include all variables from the cleaned data set with outliers omitted. Stepwise variable selection based on the AIC score is used to filter the model features. The resulting model has a significant p-values for the model and all predictor variables.

	TARGET_WINS
Predictors	Estimates	CI	p	df
(Intercept)	56.66	43.45 – 69.87	<0.0001	1443.00
TEAM BATTING 2B	-0.03	-0.05 – -0.02	<0.0001	1443.00
TEAM BATTING 3B	0.21	0.16 – 0.25	<0.0001	1443.00
TEAM BATTING BB	0.09	0.07 – 0.11	<0.0001	1443.00
TEAM BATTING SO	-0.02	-0.03 – -0.02	<0.0001	1443.00
TEAM BASERUN SB	0.03	0.02 – 0.05	0.0001	1443.00
TEAM BASERUN CS	0.05	0.01 – 0.09	0.0057	1443.00
TEAM PITCHING HR	0.13	0.11 – 0.15	<0.0001	1443.00
TEAM PITCHING BB	-0.05	-0.06 – -0.03	<0.0001	1443.00
TEAM FIELDING E	-0.15	-0.17 – -0.13	<0.0001	1443.00
TEAM FIELDING DP	-0.12	-0.14 – -0.09	<0.0001	1443.00
TEAM BATTING 1B	0.04	0.03 – 0.04	<0.0001	1443.00
Observations	1455
R² / R² adjusted	0.439 / 0.435
AIC	10708.883

The linear modeling assumption are evaluated using a diagnostic plot, the Breusch–Pagan Test for Heteroscedasticity and Variance Inflation Factor (VIF) to assess colinearity.

The initial review of the diagnostic plots for this model shows some deviation from linear modeling assumptions at the boundaries of the prediction range, but the overall QQ plot is almost a flat line; the residuals are flat and exhibit homoscedasticity of variance in the 65 - 95 fitted value range;

The Breusch–Pagan Test for Heteroscedasticity assumes the following Null and Alternate hypothesis.

H0 - Residuals are distributed with equal variance (i.e., homoscedasticity)
H1 - Residuals are distributed with unequal variance (i.e., heteroscedasticity)

For this model iteration, we reject the null hypothesis and conclude that this model violates the homoscedasticity function.

studentized Breusch-Pagan test

data: lm3step BP = 30.179, df = 11, p-value = 0.001485

The Variance Inflation Factor (VIF) calculation detects collinearity with the TEAM_PITCHING_BB and TEAM_BATTING_BB variables. Both variables have a VIF score over 8 and are 0.93 correlated.

For the refined model we will drop TEAM_PITCHING_BB and TEAM_BATTING_BB from the model. The p-value TEAM_BASERUN_CS is not significant therefore it will be dropped from the model as well. This model has a lower \(AdjR^2=0.35\) but should align better with the linear regression assumptions.

	TARGET_WINS
Predictors	Estimates	CI	p	df
(Intercept)	76.17	64.38 – 87.96	<0.0001	1795.00
TEAM BATTING 3B	0.20	0.16 – 0.23	<0.0001	1795.00
TEAM BATTING SO	-0.02	-0.03 – -0.02	<0.0001	1795.00
TEAM BASERUN SB	0.08	0.06 – 0.09	<0.0001	1795.00
TEAM PITCHING HR	0.14	0.12 – 0.15	<0.0001	1795.00
TEAM FIELDING E	-0.12	-0.13 – -0.10	<0.0001	1795.00
TEAM FIELDING DP	-0.10	-0.13 – -0.08	<0.0001	1795.00
TEAM BATTING 1B	0.02	0.02 – 0.03	<0.0001	1795.00
Observations	1803
R² / R² adjusted	0.354 / 0.352
AIC	13642.626

The diagnostic plots for the new model appear to be closer aligned with the linear regression assumptions. For this model iteration, we will again reject the null hypothesis and conclude that this model violates the homoscedasticity function. And finally the Variance Inflation Factor (VIF) calculation does not detects collinearity across the remaining variables. Overall we will reject model 3 because the residuals are not homoscedastic.

studentized Breusch-Pagan test

data: lm3step.final BP = 40.152, df = 7, p-value = 1.177e-06

	model 3			model 3 (StepAIC)			model 3 (Final)
Predictors	Estimates	CI	p	Estimates	CI	p	Estimates	CI	p
(Intercept)	56.77	43.51 – 70.02	<0.0001	56.66	43.45 – 69.87	<0.0001	76.17	64.38 – 87.96	<0.0001
TEAM BATTING 2B	-0.04	-0.12 – 0.03	0.2397	-0.03	-0.05 – -0.02	<0.0001
TEAM BATTING 3B	0.20	0.11 – 0.28	<0.0001	0.21	0.16 – 0.25	<0.0001	0.20	0.16 – 0.23	<0.0001
TEAM BATTING HR	-0.02	-0.31 – 0.27	0.8959
TEAM BATTING BB	0.09	-0.07 – 0.25	0.2878	0.09	0.07 – 0.11	<0.0001
TEAM BATTING SO	-0.01	-0.08 – 0.07	0.8913	-0.02	-0.03 – -0.02	<0.0001	-0.02	-0.03 – -0.02	<0.0001
TEAM BASERUN SB	0.04	0.02 – 0.05	0.0001	0.03	0.02 – 0.05	0.0001	0.08	0.06 – 0.09	<0.0001
TEAM BASERUN CS	0.05	0.01 – 0.09	0.0058	0.05	0.01 – 0.09	0.0057
TEAM PITCHING H	0.01	-0.06 – 0.08	0.8072
TEAM PITCHING HR	0.14	-0.12 – 0.40	0.2870	0.13	0.11 – 0.15	<0.0001	0.14	0.12 – 0.15	<0.0001
TEAM PITCHING BB	-0.05	-0.20 – 0.11	0.5598	-0.05	-0.06 – -0.03	<0.0001
TEAM PITCHING SO	-0.01	-0.09 – 0.06	0.6954
TEAM FIELDING E	-0.15	-0.17 – -0.13	<0.0001	-0.15	-0.17 – -0.13	<0.0001	-0.12	-0.13 – -0.10	<0.0001
TEAM FIELDING DP	-0.12	-0.14 – -0.09	<0.0001	-0.12	-0.14 – -0.09	<0.0001	-0.10	-0.13 – -0.08	<0.0001
TEAM BATTING 1B	0.03	-0.05 – 0.10	0.4665	0.04	0.03 – 0.04	<0.0001	0.02	0.02 – 0.03	<0.0001
Observations	1455			1455			1803
R² / R² adjusted	0.439 / 0.434			0.439 / 0.435			0.354 / 0.352
AIC	10714.711			10708.883			13642.626

3.4 Model 4: Variable Transformation

For the fourth model we can run the imputed + features that were engineered (i.e. sqrt), and pick the best model.

Below are the results \(R^2\), residual standard error, and F-statistics of each model. Surprisingly the non-cleaned, non-imputed raw training data had the best fitting statistics.

Money Ball Dataset
r	rsse	adjusted.r
0.55	8.49	0.51
0.22	12.10	0.22
0.44	9.55	0.43

4. Selecting Models

Next, we apply the models to the evaluation dataset to make predictions. However, to ensure that the models 2-4 work properly, we will fill in the missing values in the evaluation data set using the same imputation method - mice, for model 4 we will also make necessary transformations.

CHANGE SINCE THE MODELS WILL BE DIFFERENT The table below contains the predicted TARGET_WINS for each model. Upon initial examination, it is noticeable that the first model is generating predictions with negative values. This is not a realistic outcome as it is impossible to have a negative number of wins. Therefore, this model is not particularly valuable. However, the second and third models, which utilize cleaned and imputed data, do not encounter this problem of producing large negative predictions. Generally, both the AIC-generated model and the second model are yielding comparable results.

Money Ball Dataset
lm1	lm2	aic	aic4
49.79	72.82	62.31	64.67
60.34	73.94	68.84	71.42
65.66	72.98	73.96	76.51
79.89	78.48	81.81	84.14
-3970.00	47.55	2.09	136.88
-1633.38	50.35	12.09	79.35
25.95	61.91	51.31	70.22
-134.29	61.55	57.11	69.57
26.47	79.90	70.21	77.57
33.99	73.04	69.26	71.90
42.44	72.21	60.17	66.32
68.81	85.46	82.52	86.88
94.21	89.44	85.99	90.18
91.46	78.53	82.77	89.20
85.54	75.27	89.98	90.72

We can also see when plotting the predictions that there doesn’t seem to be much obvious difference between the models aside from the clearly outrageous outliers generated by the first model, and the 4th model showing a slightly tighter range in wins.

We can the graphs below to check the validity of our models. All models suffer from a lack of linearity which indicates that a linear regression model may not be the greatest technique for predicting values from this data with the given variables. The models that included all the most variables (model 1, model 3, model 4), suffer from co-linearity issues.

Model 1

Model 2

Model 3

Model 4

5. Conclusion

Notes / Questions

This is actually a better source to do the research :) https://www.seanlahman.com/baseball-archive/statistics/

Overall none of the models that I was able to generate instill much confidence in their ability to predict. The model with the best fit according to the \(R^2\) statistic was filled with missing data that caused clearly incorrect negative predictions.

The second and third models both had significantly lower \(R^2\) scores which indicated a poor fit overall. In addition, none of the models performed well when checked for linearity or homogeneity of variance. While the second model did not suffer from colinearity issues the other two models did.

Appendix A: Lahman’s Baseball Database

Despite significant efforts to compensate for poor data quality, the resulting models are poor predictors of win totals. Moreover, the poor data quality is inconsistent with the overall state of baseball statistics. When it comes to the major sports, baseball has the most mature statistics available. Therefore finding better data is the best course of action for developing a better predictive model.

We were able to locate a cleaner version of the same data set provided in the class. The Lahman’s Baseball Database includes the same variables as the sample database with fewer errors and additional reference data that would allow us to connect the database to other sources.

https://www.seanlahman.com/baseball-archive/statistics/

A significant advantage of Lahman’s data set over the data set provided in class is that it includes information about the year and the team. This data is valuable when considering how baseball has changed over the years. The modern era in baseball is often delineated by the turn of the century. However, when looking at the past 120 years of baseball history, it is easy to pinpoint rule changes, evolutions in playing strategy, and league structure that have fundamentally impacted the game.

When comparing statistics across time, it is common to use many of the breakdowns below to add context to the analysis:

The Dead Ball Era (1901 - 1920)
World War 2 (1941 - 1945)
Segregation Era (1901 - 1947ish)
Post-War Era/Yankees Era (1945 - late 50s/early 60s)
Westward Expansion (1953 - 1961)
Dead Ball 2 (The Sixties, roughly)
Designated Hitter Era (1973 - current, AL only)
Free Agency/Arbitration Era (1975 - current)
Steroid Era (unknown, but late 80s - 2005 seems likely)
Wild Card Era (1994 - current)

Surveying these periods would suggest that a more granular model has the potential to perform better.

Although we could have chosen any number of the time periods above, exploring the statistical outliers highlights that many of these values correspond to the pre-1969 period. This delineation has some historical support. As Jayson Stark of ESPN argues in this article (https://www.espn.com/mlb/columns/story?columnist=stark_jayson&id=2471349) In 1969 the MLB underwent several rule changes and changes to the league structure that impacted win totals and team statistics. 1969 was the first year of division play and the expanded postseason. The Pitcher’s Mound was lowered five inches. The Strike zone shrinks. Five-person rotations kicking in. The save was invented. And more expansion to the unbalanced schedules.

Thus using 1900 as the beginning of the modern era and 1969 as an additional breakpoint, the dataset can be divided into three segments. The density profiles for the predictor variables approach a normal distribution when grouped by the three segments we identified. To support data exploration, we added a era_cat field to the data set.

In general the Lahman’s data set contains fewer data gaps and the variables are more consistently distributed. There are some missing values in the data set including the Caught Stealing (CS/TEAM_BASERUN_CS) variable is missing 27.9%; the Batters Hit by Pitch (HBP/TEAM_BATTING_HBP) variable is missing 38.8%; and the Sacrifice Flies (SF) variable is missing 51.6% of the values.

Most of the variables in the data set show some level of skewness, with the following variables having a Kurtosis measure of greater than 3, TEAM_BATTING_H, TEAM_BASERUN_SB, TEAM_BASERUN_CS, TEAM_PITCHING_H, and TEAM_FIELDING_E

The Lahman data set contains several variables with bimodal distributions, including, TEAM_BATTING_HR, TEAM_BATTING_SO, TEAM_BATTING_HR, and TEAM_PITCHING_SO.

No

Variable

Stats / Values

Freqs (% of Valid)

Graph

Valid

Missing

1

yearID [numeric]

Mean (sd) : 1958.9 (43)

min ≤ med ≤ max:

1871 ≤ 1967 ≤ 2021

Q1 - Q3 : 1922 - 1997

151 distinct values

2985 (100.0%)

0 (0.0%)

2

lgID [character]

1. AA

2. AL

3. FL

4. NL

5. PL

6. UA

85	(	2.9%	)
1295	(	44.1%	)
16	(	0.5%	)
1519	(	51.8%	)
8	(	0.3%	)
12	(	0.4%	)

2935 (98.3%)

50 (1.7%)

3

teamID [character]

1. CHN

2. PHI

3. PIT

4. CIN

5. SLN

6. BOS

7. CHA

8. CLE

9. DET

10. NYA

[ 139 others ]

146	(	4.9%	)
139	(	4.7%	)
135	(	4.5%	)
132	(	4.4%	)
130	(	4.4%	)
121	(	4.1%	)
121	(	4.1%	)
121	(	4.1%	)
121	(	4.1%	)
119	(	4.0%	)
1700	(	57.0%	)

2985 (100.0%)

0 (0.0%)

4

franchID [character]

1. ATL

2. CHC

3. CIN

4. PIT

5. STL

6. PHI

7. SFG

8. LAD

9. BAL

10. BOS

[ 110 others ]

146	(	4.9%	)
146	(	4.9%	)
140	(	4.7%	)
140	(	4.7%	)
140	(	4.7%	)
139	(	4.7%	)
139	(	4.7%	)
138	(	4.6%	)
121	(	4.1%	)
121	(	4.1%	)
1615	(	54.1%	)

2985 (100.0%)

0 (0.0%)

5

divID [character]

1. (Empty string)

2. C

3. E

4. W

1517	(	50.8%	)
295	(	9.9%	)
598	(	20.0%	)
575	(	19.3%	)

2985 (100.0%)

0 (0.0%)

6

Rank [numeric]

Mean (sd) : 4 (2.3)

min ≤ med ≤ max:

1 ≤ 4 ≤ 13

Q1 - Q3 : 2 - 6

13 distinct values

2985 (100.0%)

0 (0.0%)

7

G [numeric]

Mean (sd) : 150 (24.4)

min ≤ med ≤ max:

6 ≤ 159 ≤ 165

Q1 - Q3 : 154 - 162

123 distinct values

2985 (100.0%)

0 (0.0%)

8

W [numeric]

Mean (sd) : 74.6 (18)

min ≤ med ≤ max:

0 ≤ 77 ≤ 116

Q1 - Q3 : 66 - 87

113 distinct values

2985 (100.0%)

0 (0.0%)

9

L [numeric]

Mean (sd) : 74.6 (17.8)

min ≤ med ≤ max:

4 ≤ 76 ≤ 134

Q1 - Q3 : 65 - 87

114 distinct values

2985 (100.0%)

0 (0.0%)

10

R [numeric]

Mean (sd) : 681 (139.5)

min ≤ med ≤ max:

24 ≤ 691 ≤ 1220

Q1 - Q3 : 614 - 764

640 distinct values

2985 (100.0%)

0 (0.0%)

11

AB [numeric]

Mean (sd) : 5129 (798.2)

min ≤ med ≤ max:

211 ≤ 5402 ≤ 5781

Q1 - Q3 : 5135 - 5519

1137 distinct values

2985 (100.0%)

0 (0.0%)

12

H [numeric]

Mean (sd) : 1339.4 (230.9)

min ≤ med ≤ max:

33 ≤ 1390 ≤ 1783

Q1 - Q3 : 1299 - 1465

758 distinct values

2985 (100.0%)

0 (0.0%)

13

X2B [numeric]

Mean (sd) : 228.7 (59.8)

min ≤ med ≤ max:

1 ≤ 234 ≤ 376

Q1 - Q3 : 194 - 272

317 distinct values

2985 (100.0%)

0 (0.0%)

14

X3B [numeric]

Mean (sd) : 45.7 (22.5)

min ≤ med ≤ max:

0 ≤ 40 ≤ 150

Q1 - Q3 : 29 - 59

126 distinct values

2985 (100.0%)

0 (0.0%)

15

HR [numeric]

Mean (sd) : 105.9 (64)

min ≤ med ≤ max:

0 ≤ 110 ≤ 307

Q1 - Q3 : 45 - 155

260 distinct values

2985 (100.0%)

0 (0.0%)

16

BB [numeric]

Mean (sd) : 473.6 (132.3)

min ≤ med ≤ max:

1 ≤ 494 ≤ 835

Q1 - Q3 : 425.5 - 554.5

586 distinct values

2984 (100.0%)

1 (0.0%)

17

SO [numeric]

Mean (sd) : 762.1 (319.3)

min ≤ med ≤ max:

3 ≤ 761 ≤ 1596

Q1 - Q3 : 516 - 990

1117 distinct values

2969 (99.5%)

16 (0.5%)

18

SB [numeric]

Mean (sd) : 109.4 (69.7)

min ≤ med ≤ max:

1 ≤ 93 ≤ 581

Q1 - Q3 : 62 - 137

324 distinct values

2859 (95.8%)

126 (4.2%)

19

CS [numeric]

Mean (sd) : 46.5 (21.9)

min ≤ med ≤ max:

3 ≤ 44 ≤ 191

Q1 - Q3 : 33 - 56

137 distinct values

2153 (72.1%)

832 (27.9%)

20

HBP [numeric]

Mean (sd) : 45.8 (18.1)

min ≤ med ≤ max:

7 ≤ 43 ≤ 160

Q1 - Q3 : 32 - 57

101 distinct values

1827 (61.2%)

1158 (38.8%)

21

SF [numeric]

Mean (sd) : 44.1 (10.2)

min ≤ med ≤ max:

7 ≤ 44 ≤ 77

Q1 - Q3 : 38 - 50

66 distinct values

1444 (48.4%)

1541 (51.6%)

22

RA [numeric]

Mean (sd) : 681 (139.2)

min ≤ med ≤ max:

34 ≤ 689 ≤ 1252

Q1 - Q3 : 610 - 766

623 distinct values

2985 (100.0%)

0 (0.0%)

23

ER [numeric]

Mean (sd) : 573.4 (149.9)

min ≤ med ≤ max:

23 ≤ 594 ≤ 1023

Q1 - Q3 : 503 - 671

656 distinct values

2985 (100.0%)

0 (0.0%)

24

ERA [numeric]

Mean (sd) : 3.8 (0.8)

min ≤ med ≤ max:

1 ≤ 4 ≤ 8

Q1 - Q3 : 3 - 4

1	:	1	(	0.0%	)
2	:	146	(	4.9%	)
3	:	777	(	26.0%	)
4	:	1511	(	50.6%	)
5	:	502	(	16.8%	)
6	:	46	(	1.5%	)
7	:	1	(	0.0%	)
8	:	1	(	0.0%	)

2985 (100.0%)

0 (0.0%)

25

CG [numeric]

Mean (sd) : 47.5 (39.3)

min ≤ med ≤ max:

0 ≤ 41 ≤ 148

Q1 - Q3 : 9 - 76

147 distinct values

2985 (100.0%)

0 (0.0%)

26

SHO [numeric]

Mean (sd) : 9.6 (5.1)

min ≤ med ≤ max:

0 ≤ 9 ≤ 32

Q1 - Q3 : 6 - 12

32 distinct values

2985 (100.0%)

0 (0.0%)

27

SV [numeric]

Mean (sd) : 24.4 (16.3)

min ≤ med ≤ max:

0 ≤ 25 ≤ 68

Q1 - Q3 : 10 - 39

66 distinct values

2985 (100.0%)

0 (0.0%)

28

IPouts [numeric]

Mean (sd) : 4013.2 (663.3)

min ≤ med ≤ max:

162 ≤ 4252 ≤ 4518

Q1 - Q3 : 4080 - 4341

862 distinct values

2985 (100.0%)

0 (0.0%)

29

HA [numeric]

Mean (sd) : 1339.2 (231)

min ≤ med ≤ max:

49 ≤ 1389 ≤ 1993

Q1 - Q3 : 1287 - 1468

770 distinct values

2985 (100.0%)

0 (0.0%)

30

HRA [numeric]

Mean (sd) : 105.9 (60.9)

min ≤ med ≤ max:

0 ≤ 113 ≤ 305

Q1 - Q3 : 51 - 153

247 distinct values

2985 (100.0%)

0 (0.0%)

31

BBA [numeric]

Mean (sd) : 473.7 (131.7)

min ≤ med ≤ max:

1 ≤ 495 ≤ 827

Q1 - Q3 : 429 - 554

577 distinct values

2985 (100.0%)

0 (0.0%)

32

SOA [numeric]

Mean (sd) : 761.6 (320.5)

min ≤ med ≤ max:

0 ≤ 762 ≤ 1687

Q1 - Q3 : 511 - 997

1148 distinct values

2985 (100.0%)

0 (0.0%)

33

E [numeric]

Mean (sd) : 180.8 (108.4)

min ≤ med ≤ max:

20 ≤ 141 ≤ 639

Q1 - Q3 : 111 - 207

474 distinct values

2985 (100.0%)

0 (0.0%)

34

DP [numeric]

Mean (sd) : 132.6 (35.9)

min ≤ med ≤ max:

0 ≤ 140 ≤ 217

Q1 - Q3 : 116 - 157

199 distinct values

2985 (100.0%)

0 (0.0%)

35

FP [numeric]

1 distinct value

1

:

2985

(

100.0%

)

2985 (100.0%)

0 (0.0%)

36

name [character]

1. Cincinnati Reds

2. Pittsburgh Pirates

3. Philadelphia Phillies

4. St. Louis Cardinals

5. Chicago White Sox

6. Detroit Tigers

7. Chicago Cubs

8. Boston Red Sox

9. New York Yankees

10. Cleveland Indians

[ 129 others ]

131	(	4.4%	)
131	(	4.4%	)
130	(	4.4%	)
122	(	4.1%	)
121	(	4.1%	)
121	(	4.1%	)
119	(	4.0%	)
114	(	3.8%	)
109	(	3.7%	)
107	(	3.6%	)
1780	(	59.6%	)

2985 (100.0%)

0 (0.0%)

37

gFactor [numeric]

Mean (sd) : 1.2 (1)

min ≤ med ≤ max:

1 ≤ 1 ≤ 27

Q1 - Q3 : 1 - 1

14 distinct values

2985 (100.0%)

0 (0.0%)

38

TARGET_WINS [numeric]

Mean (sd) : 80.7 (15.4)

min ≤ med ≤ max:

0 ≤ 82 ≤ 146

Q1 - Q3 : 71 - 91

111 distinct values

2985 (100.0%)

0 (0.0%)

39

TEAM_BATTING_H [numeric]

Mean (sd) : 1459.4 (140.4)

min ≤ med ≤ max:

819 ≤ 1445 ≤ 2562

Q1 - Q3 : 1374 - 1526

604 distinct values

2985 (100.0%)

0 (0.0%)

40

TEAM_BATTING_2B [numeric]

Mean (sd) : 246.9 (47)

min ≤ med ≤ max:

27 ≤ 247 ≤ 458

Q1 - Q3 : 214 - 280

248 distinct values

2985 (100.0%)

0 (0.0%)

41

TEAM_BATTING_3B [numeric]

Mean (sd) : 51.2 (27.7)

min ≤ med ≤ max:

0 ≤ 42 ≤ 223

Q1 - Q3 : 31 - 66

149 distinct values

2985 (100.0%)

0 (0.0%)

42

TEAM_BATTING_HR [numeric]

Mean (sd) : 110.7 (63.7)

min ≤ med ≤ max:

0 ≤ 115 ≤ 319

Q1 - Q3 : 52 - 159

263 distinct values

2985 (100.0%)

0 (0.0%)

43

TEAM_BATTING_BB [numeric]

Mean (sd) : 503.7 (115.4)

min ≤ med ≤ max:

12 ≤ 512 ≤ 878

Q1 - Q3 : 452 - 573

563 distinct values

2984 (100.0%)

1 (0.0%)

44

TEAM_BATTING_SO [numeric]

Mean (sd) : 808.7 (296.7)

min ≤ med ≤ max:

44 ≤ 812 ≤ 1642

Q1 - Q3 : 578 - 1019

1077 distinct values

2969 (99.5%)

16 (0.5%)

45

TEAM_BASERUN_SB [numeric]

Mean (sd) : 119.5 (83)

min ≤ med ≤ max:

4 ≤ 97 ≤ 697

Q1 - Q3 : 66 - 145

362 distinct values

2859 (95.8%)

126 (4.2%)

46

TEAM_BASERUN_CS [numeric]

Mean (sd) : 49 (22.6)

min ≤ med ≤ max:

8 ≤ 45 ≤ 201

Q1 - Q3 : 34 - 58

136 distinct values

2153 (72.1%)

832 (27.9%)

47

TEAM_BATTING_HBP [numeric]

Mean (sd) : 49 (19.9)

min ≤ med ≤ max:

9 ≤ 47 ≤ 174

Q1 - Q3 : 34 - 61

109 distinct values

1827 (61.2%)

1158 (38.8%)

48

TEAM_PITCHING_H [numeric]

Mean (sd) : 1463 (156.7)

min ≤ med ≤ max:

662 ≤ 1447 ≤ 3888

Q1 - Q3 : 1367 - 1533

625 distinct values

2985 (100.0%)

0 (0.0%)

49

TEAM_PITCHING_HR [numeric]

Mean (sd) : 110.8 (60.2)

min ≤ med ≤ max:

0 ≤ 118 ≤ 305

Q1 - Q3 : 56 - 158

250 distinct values

2985 (100.0%)

0 (0.0%)

50

TEAM_PITCHING_BB [numeric]

Mean (sd) : 504.2 (114.8)

min ≤ med ≤ max:

22 ≤ 515 ≤ 881

Q1 - Q3 : 458 - 573

545 distinct values

2985 (100.0%)

0 (0.0%)

51

TEAM_PITCHING_SO [numeric]

Mean (sd) : 807.7 (299.7)

min ≤ med ≤ max:

0 ≤ 806 ≤ 1876

Q1 - Q3 : 575 - 1016

1103 distinct values

2985 (100.0%)

0 (0.0%)

52

TEAM_FIELDING_E [numeric]

Mean (sd) : 225.5 (222.1)

min ≤ med ≤ max:

54 ≤ 145 ≤ 1998

Q1 - Q3 : 115 - 226

596 distinct values

2985 (100.0%)

0 (0.0%)

53

TEAM_FIELDING_DP [numeric]

Mean (sd) : 141.9 (27.6)

min ≤ med ≤ max:

0 ≤ 145 ≤ 228

Q1 - Q3 : 126 - 160

163 distinct values

2985 (100.0%)

0 (0.0%)

54

pythPercent [numeric]

Mean (sd) : 0.5 (0.1)

min ≤ med ≤ max:

0 ≤ 0.5 ≤ 0.9

Q1 - Q3 : 0.4 - 0.6

2922 distinct values

2985 (100.0%)

0 (0.0%)

55

era_cat [character]

1. 1900-

2. 1900-1969

3. 1969+

375	(	12.6%	)
1142	(	38.3%	)
1468	(	49.2%	)

2985 (100.0%)

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.2)
2023-02-24

When we group the statistics by the three categories presented earliers, we see a much cleaner density plot across all variables. There are few signs of bimodal distributions, and the skewness of individual variables is reduced greatly.

The training data set exibits the following characteristics.

No

Variable

Stats / Values

Freqs (% of Valid)

Graph

Valid

Missing

1

yearID [numeric]

Mean (sd) : 1958.9 (43.2)

min ≤ med ≤ max:

1871 ≤ 1967 ≤ 2021

Q1 - Q3 : 1922 - 1997

151 distinct values

2389 (100.0%)

0 (0.0%)

2

era_cat [character]

1. 1900-

2. 1900-1969

3. 1969+

302	(	12.6%	)
915	(	38.3%	)
1172	(	49.1%	)

2389 (100.0%)

0 (0.0%)

3

TARGET_WINS [numeric]

Mean (sd) : 80.7 (15.5)

min ≤ med ≤ max:

0 ≤ 82 ≤ 146

Q1 - Q3 : 71 - 91

107 distinct values

2389 (100.0%)

0 (0.0%)

4

TEAM_BATTING_H [numeric]

Mean (sd) : 1458.3 (140.7)

min ≤ med ≤ max:

819 ≤ 1445 ≤ 2562

Q1 - Q3 : 1374 - 1524

564 distinct values

2389 (100.0%)

0 (0.0%)

5

TEAM_BATTING_2B [numeric]

Mean (sd) : 246.8 (47.2)

min ≤ med ≤ max:

27 ≤ 248 ≤ 403

Q1 - Q3 : 214 - 280

242 distinct values

2389 (100.0%)

0 (0.0%)

6

TEAM_BATTING_3B [numeric]

Mean (sd) : 51.2 (28.1)

min ≤ med ≤ max:

0 ≤ 42 ≤ 223

Q1 - Q3 : 31 - 66

146 distinct values

2389 (100.0%)

0 (0.0%)

7

TEAM_BATTING_HR [numeric]

Mean (sd) : 110.9 (63.9)

min ≤ med ≤ max:

0 ≤ 116 ≤ 319

Q1 - Q3 : 52 - 159

259 distinct values

2389 (100.0%)

0 (0.0%)

8

TEAM_BATTING_BB [numeric]

Mean (sd) : 503 (116.9)

min ≤ med ≤ max:

12 ≤ 513 ≤ 878

Q1 - Q3 : 451 - 573

528 distinct values

2388 (100.0%)

1 (0.0%)

9

TEAM_BATTING_SO [numeric]

Mean (sd) : 811 (298)

min ≤ med ≤ max:

44 ≤ 814 ≤ 1642

Q1 - Q3 : 579 - 1023

990 distinct values

2374 (99.4%)

15 (0.6%)

10

TEAM_BASERUN_SB [numeric]

Mean (sd) : 119.4 (83)

min ≤ med ≤ max:

4 ≤ 97 ≤ 654

Q1 - Q3 : 66 - 144

340 distinct values

2287 (95.7%)

102 (4.3%)

11

TEAM_BASERUN_CS [numeric]

Mean (sd) : 48.8 (22)

min ≤ med ≤ max:

8 ≤ 45 ≤ 200

Q1 - Q3 : 34 - 58

128 distinct values

1721 (72.0%)

668 (28.0%)

12

TEAM_BATTING_HBP [numeric]

Mean (sd) : 49.1 (20)

min ≤ med ≤ max:

9 ≤ 48 ≤ 174

Q1 - Q3 : 34 - 61

105 distinct values

1453 (60.8%)

936 (39.2%)

13

TEAM_PITCHING_H [numeric]

Mean (sd) : 1462.5 (160.9)

min ≤ med ≤ max:

662 ≤ 1447 ≤ 3888

Q1 - Q3 : 1366 - 1532

591 distinct values

2389 (100.0%)

0 (0.0%)

14

TEAM_PITCHING_HR [numeric]

Mean (sd) : 110.9 (60.3)

min ≤ med ≤ max:

0 ≤ 118 ≤ 305

Q1 - Q3 : 56 - 159

248 distinct values

2389 (100.0%)

0 (0.0%)

15

TEAM_PITCHING_BB [numeric]

Mean (sd) : 502.6 (114.8)

min ≤ med ≤ max:

26 ≤ 512 ≤ 881

Q1 - Q3 : 455 - 572

510 distinct values

2389 (100.0%)

0 (0.0%)

16

TEAM_PITCHING_SO [numeric]

Mean (sd) : 809 (303)

min ≤ med ≤ max:

0 ≤ 807 ≤ 1876

Q1 - Q3 : 574 - 1017

1041 distinct values

2389 (100.0%)

0 (0.0%)

17

TEAM_FIELDING_E [numeric]

Mean (sd) : 226.5 (225.1)

min ≤ med ≤ max:

54 ≤ 144 ≤ 1998

Q1 - Q3 : 114 - 228

543 distinct values

2389 (100.0%)

0 (0.0%)

18

TEAM_FIELDING_DP [numeric]

Mean (sd) : 141.7 (27.7)

min ≤ med ≤ max:

0 ≤ 145 ≤ 228

Q1 - Q3 : 126 - 160

159 distinct values

2389 (100.0%)

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.2)
2023-02-24

DATA PREPARATION

The use of the era_cat variable allows us to group the data set into 3 categories that variables that approach normal distribution.

Since segmenting the data set using the era_cat variable creates 3 categories of variables that approach the normal distribution, we will focus our data preparation step on missing values. The documentation for MICE package recommends that a 5% threshold should be observed for safely imputing missing values. Based on this rule of thumb we will drop the ‘TEAM_BASERUN_CS’ and the ‘TEAM_BATTING_HBP’ variables.

The remaining variables with missing data (‘TEAM_BATTING_BB’,‘TEAM_BATTING_SO’ and ‘TEAM_BASERUN_SB’) will be impuned using the MICE package. Several methods can be used but for simplicity we selected m=5 the default method.

      yearID          era_cat      TARGET_WINS   TEAM_BATTING_H 
          ""               ""               ""               ""

TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB “” “” “” “pmm” TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_H TEAM_PITCHING_HR “pmm” “pmm” “” “” TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP “” “” “” “”

Build Model

The first model includes all the variables from the original data set with the exception of the yearID. It could be argued that year might have an impact on the numbers, however we are accounting for year with the era_cat variable. The initial model has an \(AdjR^2 = 0.8078\), and the model selected by stepAIC includes the same variables and has the same \(AdjR^2=8078\).

Call: lm(formula = TARGET_WINS ~ . - yearID, data = trainingTeam_df, by = era_cat)

Residuals: Min 1Q Median 3Q Max -41.931 -4.084 0.041 4.117 69.904

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 73.928822 3.451424 21.420 < 2e-16 era_cat1900-1969 2.334654 0.889034 2.626 0.00869 era_cat1969+ 1.709035 1.004416 1.702 0.08898 .
TEAM_BATTING_H 0.055323 0.001982 27.919 < 2e-16 TEAM_BATTING_2B 0.010082 0.004848 2.080 0.03768 *
TEAM_BATTING_3B 0.010279 0.009500 1.082 0.27934
TEAM_BATTING_HR 0.104601 0.005724 18.272 < 2e-16 TEAM_BATTING_BB 0.047301 0.001851 25.552 < 2e-16 TEAM_BATTING_SO -0.011012 0.001446 -7.615 3.78e-14 TEAM_BASERUN_SB 0.023154 0.002559 9.049 < 2e-16 TEAM_PITCHING_H -0.056791 0.001451 -39.132 < 2e-16 TEAM_PITCHING_HR -0.091167 0.006374 -14.302 < 2e-16 TEAM_PITCHING_BB -0.057994 0.001858 -31.213 < 2e-16 TEAM_PITCHING_SO 0.009377 0.001394 6.727 2.17e-11 TEAM_FIELDING_E -0.002432 0.001788 -1.360 0.17396
TEAM_FIELDING_DP 0.051356 0.007505 6.843 9.81e-12 * — Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1

Residual standard error: 6.718 on 2373 degrees of freedom Multiple R-squared: 0.8139, Adjusted R-squared: 0.8127 F-statistic: 691.7 on 15 and 2373 DF, p-value: < 2.2e-16

Call: lm(formula = TARGET_WINS ~ era_cat + TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_PITCHING_BB + TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP, data = trainingTeam_df, by = era_cat)

Residuals: Min 1Q Median 3Q Max -41.265 -4.045 0.045 4.136 69.568

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 73.492304 3.427890 21.440 < 2e-16 era_cat1900-1969 2.251512 0.885740 2.542 0.0111
era_cat1969+ 1.438109 0.972743 1.478 0.1394
TEAM_BATTING_H 0.056184 0.001815 30.959 < 2e-16 TEAM_BATTING_2B 0.009789 0.004841 2.022 0.0433
TEAM_BATTING_HR 0.103579 0.005646 18.345 < 2e-16 TEAM_BATTING_BB 0.047327 0.001851 25.567 < 2e-16 TEAM_BATTING_SO -0.010805 0.001433 -7.538 6.75e-14 TEAM_BASERUN_SB 0.023901 0.002464 9.701 < 2e-16 TEAM_PITCHING_H -0.056764 0.001451 -39.118 < 2e-16 TEAM_PITCHING_HR -0.092139 0.006311 -14.600 < 2e-16 TEAM_PITCHING_BB -0.057846 0.001853 -31.216 < 2e-16 TEAM_PITCHING_SO 0.009269 0.001391 6.666 3.26e-11 TEAM_FIELDING_E -0.002638 0.001778 -1.484 0.1380
TEAM_FIELDING_DP 0.050746 0.007484 6.781 1.50e-11 *** — Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1

Residual standard error: 6.718 on 2374 degrees of freedom Multiple R-squared: 0.8138, Adjusted R-squared: 0.8127 F-statistic: 740.9 on 14 and 2374 DF, p-value: < 2.2e-16

The TEAM_BATTING_3B variable was dropped due to a low p-values, giving us the final model with an \(AdjR^2=0.8078\).

Call: lm(formula = TARGET_WINS ~ era_cat + TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_PITCHING_BB + TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP, data = trainingTeam_df, by = era_cat)

Residuals: Min 1Q Median 3Q Max -41.265 -4.045 0.045 4.136 69.568

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 73.492304 3.427890 21.440 < 2e-16 era_cat1900-1969 2.251512 0.885740 2.542 0.0111
era_cat1969+ 1.438109 0.972743 1.478 0.1394
TEAM_BATTING_H 0.056184 0.001815 30.959 < 2e-16 TEAM_BATTING_2B 0.009789 0.004841 2.022 0.0433
TEAM_BATTING_HR 0.103579 0.005646 18.345 < 2e-16 TEAM_BATTING_BB 0.047327 0.001851 25.567 < 2e-16 TEAM_BATTING_SO -0.010805 0.001433 -7.538 6.75e-14 TEAM_BASERUN_SB 0.023901 0.002464 9.701 < 2e-16 TEAM_PITCHING_H -0.056764 0.001451 -39.118 < 2e-16 TEAM_PITCHING_HR -0.092139 0.006311 -14.600 < 2e-16 TEAM_PITCHING_BB -0.057846 0.001853 -31.216 < 2e-16 TEAM_PITCHING_SO 0.009269 0.001391 6.666 3.26e-11 TEAM_FIELDING_E -0.002638 0.001778 -1.484 0.1380
TEAM_FIELDING_DP 0.050746 0.007484 6.781 1.50e-11 *** — Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1

Residual standard error: 6.718 on 2374 degrees of freedom Multiple R-squared: 0.8138, Adjusted R-squared: 0.8127 F-statistic: 740.9 on 14 and 2374 DF, p-value: < 2.2e-16

There are a number of issues with the model diagnostics. The Linearity, and QQ plots show deviation from the straight line at the boundaries. In addition, the reference line for the Homogeneity of Variance graph is not flat. This indicates that residual variance is not consistent across the model values. Further exploration could be conducted to see if there are smaller date ranges and prediction ranges that generate a model that better aligns with the linear regression assumptions.

PREDICT

The next step is to use the selected model to predict the testing data set. The Yardstick package was used to calculate model performance, including the \(R^2\). With an \(R^2=0.8318\) the performance on the testing data set is consistent with the model summary. In the Residual vs. TARGET_WINS graph, there appears to be wins range between 60 and 110 that generates more accurate forecasts. However, it should be noted that the residuals appear to drift downwards. TARGET_WINS less than the mean are overestimated and TARGET_WINS greater than the mean is over estimated.

      yearID          era_cat      TARGET_WINS   TEAM_BATTING_H 
          ""               ""               ""               ""

TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB “” “” “” “” TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_H TEAM_PITCHING_HR “pmm” “pmm” “” “” TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP “” “” “” “”

	.metric	.estimator	.estimate
1	mape	standard	7.0172
2	smape	standard	6.8136
3	mase	standard	0.3122
4	mpe	standard	-1.0803
5	rmse	standard	6.7128
6	rsq	standard	0.7940

Appendix B: Pythagorean Model

Bill James developed the Pythagorean winning percentage. The concept strives to calculate the number of games a team should win based on the total offense and the number of runs allowed. Since this model includes total runs scored vs. total runs allowed the expectation is that this model will be a good predictor of a team’s wins in a given season.

\(Win Percentage = (Runs Scored)^2 / [ (Runs Scored)^2 + (Runs Allowed)^2]\)

As we expected, the Pythagorean model performs well \(Adj R^2\) = 0.9112. However, this linear model differs slightly from the formal definition since it includes an intercept of 5.21 and a coefficient for the Pythagorean factor of 151.39. The formal definition of the model would have an intercept of 0 and a coefficient of 162.

Call: lm(formula = TARGET_WINS ~ pythPercent, data = trainingTeam_df)

Residuals: Min 1Q Median 3Q Max -38.657 -3.006 0.090 3.103 18.395

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.016 0.490 10.24 <2e-16 pythPercent 151.702 0.964 157.36 <2e-16 — Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1

Residual standard error: 4.609 on 2387 degrees of freedom Multiple R-squared: 0.9121, Adjusted R-squared: 0.912 F-statistic: 2.476e+04 on 1 and 2387 DF, p-value: < 2.2e-16

The diagnostic plots show residuals that are normally distributed with a linear relationship to the fitted values. The homogeneity of variance is curved, indicating that the variance of residuals is not consistent.

The next step is to use the Pythagorean Model to predict the testing data set. The Yardstick package was used to calculate model performance, including the \(R^2\). With an \(R^2=0.9131\) the performance on the testing data set is consistent with the model summary. In the Residual vs. TARGET_WINS graph there appears to be a wins range between 50 and 130 that generates more accurate forecasts.

	.metric	.estimator	.estimate
1	mape	standard	Inf
2	smape	standard	4.7570
3	mase	standard	0.2086
4	mpe	standard	-Inf
5	rmse	standard	4.4718
6	rsq	standard	0.9078

Unsurprising the Pythagorean Model performs well when it comes to win projections. There is clear relationship between the total yearly run differential and the number of games won. It is a simple model but efficient.

References

Wikipedia contributors. 2022. “Slugging Percentage — Wikipedia, the Free Encyclopedia.” https://en.wikipedia.org/w/index.php?title=Slugging_percentage&oldid=1123388426.

Homework #1: Moneyball

Critical Thinking Group 1: Ben Inbar, Cliff Lee, Daria Dubovskaia, David Simbandumwe, Jeff Parks, Nick Oliver

Overview

1. Date Exploration

1.1 Summary Statistics

1.2 Distribution and Box Plots

1.3 Correlation Matrix

2. Data Preparation

2.1 Missing Data

2.2 Drop Outliers

2.2 Transform non-normal variables

3. Building models

3.1 Model 1: Baseline

3.2 Model 2: Removed N/A Values

3.3 Model 3: Removed Outliers

3.4 Model 4: Variable Transformation

4. Selecting Models

5. Conclusion

Appendix A: Lahman’s Baseball Database

DATA PREPARATION

Build Model

PREDICT

Appendix B: Pythagorean Model

References