Homework #1: Moneyball

Overview

Using historical statistics to predict future outcomes, wins and losses, and opportunities for improving performance has gained significant attention in professional sports.

This analysis aims to develop several models to predict a baseball team’s wins over a season based on team stats such as home runs, strikeouts, base hits, and more.

Our approach:

We will begin by examining the data for missing values or outliers and take the necessary measures to clean the data.
We will train and evaluate four different multivariate linear models that forecast seasonal wins using variants of the dataset split into training and evaluation sets.
Finally, we will choose the best model that balances accuracy and simplicity for predicting seasonal wins.

1. Date Exploration

The baseball training dataset contains 2,276 observations of 17 variables detailing various teams’ performances from 1871 to 2006. The descriptions of each column are below.

Some initial observations:

Due to the relatively long period, we expect to see outliers and missing data as the league modified official game rules; these rule changes undoubtedly caused teams and players to change their tactics in response.
The number of single base hits is noticeably missing from the dataset, but we can derive this value by subtracting other hit types (doubles, triples, home runs) from total hits.
Lastly, other columns representing game number (out of 162), inning number (1-9), and matching opponent columns would be helpful for predictions.

1.1 Summary Statistics

The table below provides valuable descriptive statistics about the training data. These variables are all integers, and many have a minimum of zero.

One interesting piece of information is the min/max of the TARGET_WINS variable, where a value of zero suggests teams that did not win a single game in a given season. The maximum is 146, indicating no teams in the training dataset had a perfect 162-game season.

Also of note is the number of missing values from certain variables, which may be due to changes in official rules or tactics before the modern era. We’ll explore and handle these missing values in a later step.

Numerous extreme cases seem implausible; several teams have zero strikeouts by their pitchers over the season, and one achieved 20,000 strikeouts. We’ll also address these values in the coming steps.

No

Variable

Stats / Values

Freqs (% of Valid)

Graph

Valid

Missing

1

INDEX [integer]

Mean (sd) : 1268.5 (736.3)

min ≤ med ≤ max:

1 ≤ 1270.5 ≤ 2535

Q1 - Q3 : 630.5 - 1916

2276 distinct values

2276 (100.0%)

0 (0.0%)

2

TARGET_WINS [integer]

Mean (sd) : 80.8 (15.8)

min ≤ med ≤ max:

0 ≤ 82 ≤ 146

Q1 - Q3 : 71 - 92

108 distinct values

2276 (100.0%)

0 (0.0%)

3

TEAM_BATTING_H [integer]

Mean (sd) : 1469.3 (144.6)

min ≤ med ≤ max:

891 ≤ 1454 ≤ 2554

Q1 - Q3 : 1383 - 1537.5

569 distinct values

2276 (100.0%)

0 (0.0%)

4

TEAM_BATTING_2B [integer]

Mean (sd) : 241.2 (46.8)

min ≤ med ≤ max:

69 ≤ 238 ≤ 458

Q1 - Q3 : 208 - 273

240 distinct values

2276 (100.0%)

0 (0.0%)

5

TEAM_BATTING_3B [integer]

Mean (sd) : 55.2 (27.9)

min ≤ med ≤ max:

0 ≤ 47 ≤ 223

Q1 - Q3 : 34 - 72

144 distinct values

2276 (100.0%)

0 (0.0%)

6

TEAM_BATTING_HR [integer]

Mean (sd) : 99.6 (60.5)

min ≤ med ≤ max:

0 ≤ 102 ≤ 264

Q1 - Q3 : 42 - 147

243 distinct values

2276 (100.0%)

0 (0.0%)

7

TEAM_BATTING_BB [integer]

Mean (sd) : 501.6 (122.7)

min ≤ med ≤ max:

0 ≤ 512 ≤ 878

Q1 - Q3 : 451 - 580

533 distinct values

2276 (100.0%)

0 (0.0%)

8

TEAM_BATTING_SO [integer]

Mean (sd) : 735.6 (248.5)

min ≤ med ≤ max:

0 ≤ 750 ≤ 1399

Q1 - Q3 : 548 - 930

822 distinct values

2174 (95.5%)

102 (4.5%)

9

TEAM_BASERUN_SB [integer]

Mean (sd) : 124.8 (87.8)

min ≤ med ≤ max:

0 ≤ 101 ≤ 697

Q1 - Q3 : 66 - 156

348 distinct values

2145 (94.2%)

131 (5.8%)

10

TEAM_BASERUN_CS [integer]

Mean (sd) : 52.8 (23)

min ≤ med ≤ max:

0 ≤ 49 ≤ 201

Q1 - Q3 : 38 - 62

128 distinct values

1504 (66.1%)

772 (33.9%)

11

TEAM_BATTING_HBP [integer]

Mean (sd) : 59.4 (13)

min ≤ med ≤ max:

29 ≤ 58 ≤ 95

Q1 - Q3 : 50 - 67

55 distinct values

191 (8.4%)

2085 (91.6%)

12

TEAM_PITCHING_H [integer]

Mean (sd) : 1779.2 (1406.8)

min ≤ med ≤ max:

1137 ≤ 1518 ≤ 30132

Q1 - Q3 : 1419 - 1683

843 distinct values

2276 (100.0%)

0 (0.0%)

13

TEAM_PITCHING_HR [integer]

Mean (sd) : 105.7 (61.3)

min ≤ med ≤ max:

0 ≤ 107 ≤ 343

Q1 - Q3 : 50 - 150

256 distinct values

2276 (100.0%)

0 (0.0%)

14

TEAM_PITCHING_BB [integer]

Mean (sd) : 553 (166.4)

min ≤ med ≤ max:

0 ≤ 536.5 ≤ 3645

Q1 - Q3 : 476 - 611

535 distinct values

2276 (100.0%)

0 (0.0%)

15

TEAM_PITCHING_SO [integer]

Mean (sd) : 817.7 (553.1)

min ≤ med ≤ max:

0 ≤ 813.5 ≤ 19278

Q1 - Q3 : 615 - 968

823 distinct values

2174 (95.5%)

102 (4.5%)

16

TEAM_FIELDING_E [integer]

Mean (sd) : 246.5 (227.8)

min ≤ med ≤ max:

65 ≤ 159 ≤ 1898

Q1 - Q3 : 127 - 249.5

549 distinct values

2276 (100.0%)

0 (0.0%)

17

TEAM_FIELDING_DP [integer]

Mean (sd) : 146.4 (26.2)

min ≤ med ≤ max:

52 ≤ 149 ≤ 228

Q1 - Q3 : 131 - 164

144 distinct values

1990 (87.4%)

286 (12.6%)

Generated by summarytools 1.0.1 (R version 4.2.2)
2023-02-26

1.2 Distribution

Next, we’ll visually check for normality and outliers in the dependent and independent variables.

The density plots display near-normal distributions for most variables with some exceptions. Right-skewness is evident for PITCHING_H (hits allowed) and FIELDING_E (fielding errors), and bimodal distributions are observed in the density plots for BATTING_HR (home runs) and BATTING_SO (strikeouts), implying two distinct clusters of high-scoring and low-scoring teams.

Additionally, the boxplots indicate a large number of outliers exist outside of the interquartile ranges for many of the variables - all these effects will need to be carefully considered.

The function featurePlot() demonstrates the relationships between each independent variable and the target TARGET_WINS.

1.3 Correlation Matrix

Plotting the correlations between the target TARGET_WINS and the variables (excluding INDEX and TEAM_BATTING_HBP), we can see that very few variables correlate with the target. Variables with correlations close to zero are unlikely to offer significant insights into the factors contibuting to a team’s victories.

	Coefficient
TARGET_WINS	1.0000000
TEAM_BATTING_H	0.4699467
TEAM_BATTING_2B	0.3129840
TEAM_BATTING_3B	-0.1243459
TEAM_BATTING_HR	0.4224168
TEAM_BATTING_BB	0.4686879
TEAM_BATTING_SO	-0.2288927
TEAM_BASERUN_SB	0.0148364
TEAM_BASERUN_CS	-0.1787560
TEAM_BATTING_HBP	0.0735042
TEAM_PITCHING_H	0.4712343
TEAM_PITCHING_HR	0.4224668
TEAM_PITCHING_BB	0.4683988
TEAM_PITCHING_SO	-0.2293648
TEAM_FIELDING_E	-0.3866880
TEAM_FIELDING_DP	-0.1958660

To avoid multicollinearity, we should exclude some variables from the model that have high degrees of correlation with each other. By comparing offensive with defensive stats (variables starting with BATTING or BASERUN), we can see potential cases of multicollinearity. We might interpret this as some teams being exceptional at both hitting (offense) and fielding (defense).

Further, a typical team’s number of TEAM_BATTING_HR (home runs) and TEAM_PITCHING_HR (allowed home runs) has a correlation of nearly 1.0. This is an unexpected correlation but can be explained by noticing most games are decided by a difference of one or two runs. Any final models should probably exclude one of these two home run variables.

Alternatively, the correlation between a team’s BATTING_H (hits) and PITCHING_H (hits allowed) is around 0.3 which seems reasonable.

Some other strong correlations are less obvious such as TEAM_FIELDING_E (errors) being strongly negatively correlated with TEAM_BATTING_BB (walks by batters) and TEAM_BATTING_HR (home runs).

Digging a little deeper we can see that TEAM_FIELDING_E (errors) has a strong negative Pearson correlation coefficient of r round(cor(trainDf$TEAM_FIELDING_E, trainDf$TEAM_BATTING_BB),3) with TEAM_BATTING_BB (walks by batters), and a strong positive correlation of r round(cor(trainDf$TEAM_FIELDING_E, trainDf$TEAM_PITCHING_H),3) with TEAM_PITCHING_H (pitching hits allowed). Altogether, this seems to support an overall correlation of strong offensive stats with strong defensive stats (and vice versa.)

2. Data Preparation

2.1 Missing Data

Let’s take a closer look at the missing data using the plot below, highlighting six variables with various percentages of missing values:

Most notably, TEAM_BATTING_HBP (batter hit by pitch) is missing values in 91% of the cases; we’ll remove this variable from the dataset because there is not enough information to estimate sensible values.

The variable TEAM_BASERUN_CS (caught stealing) is also missing data in 34% of the cases; instead of dropping this variable and losing the information present, we’ll use imputation methods to fill in values we might reasonably expect.

The remaining variables are missing data in fewer than 25% of the cases, so we’ll impute expected values here too.

Additionally, we’ll drop the INDEX variable as it is just an identification label for each case.

2.2 Impute Missing Values (cleanDF)

To handle the cases with missing values, we’ll create a new variable na_flag to label if any variables were missing for each case, and use the MICE (Multiple Imputation by Chained Equations) library to impute values for our missing data.

One of the assumptions of MICE is that the missing values in our dataset occur randomly, and be estimated by using the existing values in the rest of the dataset. The MICE algorithm performs several iterations over the data and generates data to complete the missing values.

One of the most common algorithms used by MICE is Predictive Mean Matching (PMM), which imputes missing values based on the means of the existing values.

Here are the new density plots of our five variables after imputation with MICE. The blue plots represent the original values with missing data, and the red plots indicate multiple passes of the MICE algorithm with imputed values.

After imputation, the distribution of TEAM_BATTING_SO becomes roughly more normally distributed, so it may be beneficial for the model. However, the remaining variables may need to be handled via alternative methods.

2.3 Remove Outliers (droppedDF)

After exploring outliers using methods based on standard deviation and IQR, we decided to try another approach to preserve as much information for our models as possible.

We used historical data from the Lahman’s Baseball Database to identify outliers. Early baseball teams played fewer than 20 games per season, and the calculation used in our project dataset to normalize the statistics to a 162-game season tended to create outliers.

We used the min and max from the modern era (post-1900) as a filter to alleviate the issues created by outlier values. We compared each statistic listed in the column names to the min and max for the same statistic in the modern era, and we dropped the row if the values were below the min or above the max.

No

Variable

Stats / Values

Freqs (% of Valid)

Graph

Valid

Missing

1

TARGET_WINS [integer]

Mean (sd) : 80.8 (13.7)

min ≤ med ≤ max:

36 ≤ 82 ≤ 124

Q1 - Q3 : 72 - 91

80 distinct values

1948 (100.0%)

0 (0.0%)

2

TEAM_BATTING_H [integer]

Mean (sd) : 1452.8 (111)

min ≤ med ≤ max:

1137 ≤ 1446 ≤ 1876

Q1 - Q3 : 1379 - 1523

480 distinct values

1948 (100.0%)

0 (0.0%)

3

TEAM_BATTING_2B [integer]

Mean (sd) : 243.1 (45)

min ≤ med ≤ max:

123 ≤ 240.5 ≤ 392

Q1 - Q3 : 211 - 274

225 distinct values

1948 (100.0%)

0 (0.0%)

4

TEAM_BATTING_3B [integer]

Mean (sd) : 50.3 (23.1)

min ≤ med ≤ max:

11 ≤ 44 ≤ 138

Q1 - Q3 : 33 - 64

116 distinct values

1948 (100.0%)

0 (0.0%)

5

TEAM_BATTING_HR [integer]

Mean (sd) : 109.4 (57.6)

min ≤ med ≤ max:

5 ≤ 113 ≤ 264

Q1 - Q3 : 62 - 152.5

240 distinct values

1948 (100.0%)

0 (0.0%)

6

TEAM_BATTING_BB [integer]

Mean (sd) : 523.7 (87.6)

min ≤ med ≤ max:

273 ≤ 520 ≤ 824

Q1 - Q3 : 465 - 583

405 distinct values

1948 (100.0%)

0 (0.0%)

7

TEAM_BATTING_SO [integer]

Mean (sd) : 765.2 (223.9)

min ≤ med ≤ max:

319 ≤ 788.5 ≤ 1399

Q1 - Q3 : 573 - 945

747 distinct values

1948 (100.0%)

0 (0.0%)

8

TEAM_BASERUN_SB [integer]

Mean (sd) : 109.2 (61.3)

min ≤ med ≤ max:

18 ≤ 96 ≤ 367

Q1 - Q3 : 64 - 141

271 distinct values

1948 (100.0%)

0 (0.0%)

9

TEAM_BASERUN_CS [integer]

Mean (sd) : 46.4 (23.4)

min ≤ med ≤ max:

11 ≤ 45 ≤ 201

Q1 - Q3 : 32 - 57

126 distinct values

1948 (100.0%)

0 (0.0%)

10

TEAM_PITCHING_H [integer]

Mean (sd) : 1516.7 (157.4)

min ≤ med ≤ max:

1137 ≤ 1492 ≤ 2096

Q1 - Q3 : 1405 - 1599

599 distinct values

1948 (100.0%)

0 (0.0%)

11

TEAM_PITCHING_HR [integer]

Mean (sd) : 112.6 (58.2)

min ≤ med ≤ max:

5 ≤ 115 ≤ 277

Q1 - Q3 : 66 - 155.5

241 distinct values

1948 (100.0%)

0 (0.0%)

12

TEAM_PITCHING_BB [integer]

Mean (sd) : 545.6 (93.5)

min ≤ med ≤ max:

312 ≤ 536 ≤ 877

Q1 - Q3 : 483 - 601

432 distinct values

1948 (100.0%)

0 (0.0%)

13

TEAM_PITCHING_SO [integer]

Mean (sd) : 796.9 (217.1)

min ≤ med ≤ max:

316 ≤ 809 ≤ 1659

Q1 - Q3 : 625 - 955

746 distinct values

1948 (100.0%)

0 (0.0%)

14

TEAM_FIELDING_E [integer]

Mean (sd) : 174.5 (77.6)

min ≤ med ≤ max:

65 ≤ 148.5 ≤ 515

Q1 - Q3 : 124 - 199

326 distinct values

1948 (100.0%)

0 (0.0%)

15

TEAM_FIELDING_DP [integer]

Mean (sd) : 146.8 (25.6)

min ≤ med ≤ max:

72 ≤ 149 ≤ 228

Q1 - Q3 : 131 - 164

138 distinct values

1948 (100.0%)

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.2)
2023-02-26

2.4 New Variables

We’ll also derive a new column TEAM_BATTING_1B (single base hits) by subtracting doubles, triples, and home runs from the total number of hits per case.

2.5 Transform Non-Normal Variables (transformedDF)

We should also try to transform some variables to fit a more normal distribution, including TEAM_BATTING_HR, TEAM_BATTING_SO, TEAM_PITCHING_HR, and TEAM_BASERUN_CS.

We apply a square-root transformation to these variables and remove the skewness by reducing the low-density ranges. The plots below compare the original distributions of non-normal variables and transformed ones.

Our current dataset is free of missing data values, the irrelevant INDEX and the TEAM_BATTING_HBP variables, and the outliers. As shown in the table below, no missing data values exist anymore, and we can analyze how the summary statistics may have altered with the imputed data.

No

Variable

Stats / Values

Freqs (% of Valid)

Graph

Valid

Missing

1

TARGET_WINS [integer]

Mean (sd) : 80.8 (15.8)

min ≤ med ≤ max:

0 ≤ 82 ≤ 146

Q1 - Q3 : 71 - 92

108 distinct values

2276 (100.0%)

0 (0.0%)

2

TEAM_BATTING_H [integer]

Mean (sd) : 1469.3 (144.6)

min ≤ med ≤ max:

891 ≤ 1454 ≤ 2554

Q1 - Q3 : 1383 - 1537.5

569 distinct values

2276 (100.0%)

0 (0.0%)

3

TEAM_BATTING_2B [integer]

Mean (sd) : 241.2 (46.8)

min ≤ med ≤ max:

69 ≤ 238 ≤ 458

Q1 - Q3 : 208 - 273

240 distinct values

2276 (100.0%)

0 (0.0%)

4

TEAM_BATTING_3B [integer]

Mean (sd) : 55.2 (27.9)

min ≤ med ≤ max:

0 ≤ 47 ≤ 223

Q1 - Q3 : 34 - 72

144 distinct values

2276 (100.0%)

0 (0.0%)

5

TEAM_BATTING_HR [integer]

Mean (sd) : 99.6 (60.5)

min ≤ med ≤ max:

0 ≤ 102 ≤ 264

Q1 - Q3 : 42 - 147

243 distinct values

2276 (100.0%)

0 (0.0%)

6

TEAM_BATTING_BB [integer]

Mean (sd) : 501.6 (122.7)

min ≤ med ≤ max:

0 ≤ 512 ≤ 878

Q1 - Q3 : 451 - 580

533 distinct values

2276 (100.0%)

0 (0.0%)

7

TEAM_BATTING_SO [integer]

Mean (sd) : 726.8 (247.5)

min ≤ med ≤ max:

0 ≤ 733 ≤ 1399

Q1 - Q3 : 541 - 925

822 distinct values

2276 (100.0%)

0 (0.0%)

8

TEAM_BASERUN_SB [integer]

Mean (sd) : 142.2 (118.4)

min ≤ med ≤ max:

0 ≤ 106 ≤ 697

Q1 - Q3 : 67 - 170

348 distinct values

2276 (100.0%)

0 (0.0%)

9

TEAM_BASERUN_CS [integer]

Mean (sd) : 50.5 (32.8)

min ≤ med ≤ max:

0 ≤ 45 ≤ 201

Q1 - Q3 : 33 - 59

128 distinct values

2276 (100.0%)

0 (0.0%)

10

TEAM_PITCHING_H [integer]

Mean (sd) : 1779.2 (1406.8)

min ≤ med ≤ max:

1137 ≤ 1518 ≤ 30132

Q1 - Q3 : 1419 - 1683

843 distinct values

2276 (100.0%)

0 (0.0%)

11

TEAM_PITCHING_HR [integer]

Mean (sd) : 105.7 (61.3)

min ≤ med ≤ max:

0 ≤ 107 ≤ 343

Q1 - Q3 : 50 - 150

256 distinct values

2276 (100.0%)

0 (0.0%)

12

TEAM_PITCHING_BB [integer]

Mean (sd) : 553 (166.4)

min ≤ med ≤ max:

0 ≤ 536.5 ≤ 3645

Q1 - Q3 : 476 - 611

535 distinct values

2276 (100.0%)

0 (0.0%)

13

TEAM_PITCHING_SO [integer]

Mean (sd) : 811.1 (542.5)

min ≤ med ≤ max:

0 ≤ 804 ≤ 19278

Q1 - Q3 : 611 - 958.5

823 distinct values

2276 (100.0%)

0 (0.0%)

14

TEAM_FIELDING_E [integer]

Mean (sd) : 246.5 (227.8)

min ≤ med ≤ max:

65 ≤ 159 ≤ 1898

Q1 - Q3 : 127 - 249.5

549 distinct values

2276 (100.0%)

0 (0.0%)

15

TEAM_FIELDING_DP [integer]

Mean (sd) : 142.1 (29.2)

min ≤ med ≤ max:

52 ≤ 145.5 ≤ 228

Q1 - Q3 : 125 - 162

144 distinct values

2276 (100.0%)

0 (0.0%)

16

batting_hr_sqrt [numeric]

Mean (sd) : 9.4 (3.4)

min ≤ med ≤ max:

0 ≤ 10.1 ≤ 16.2

Q1 - Q3 : 6.5 - 12.1

243 distinct values

2276 (100.0%)

0 (0.0%)

17

batting_so_sqrt [numeric]

Mean (sd) : 26.4 (5.3)

min ≤ med ≤ max:

0 ≤ 27.1 ≤ 37.4

Q1 - Q3 : 23.3 - 30.4

822 distinct values

2276 (100.0%)

0 (0.0%)

18

baserun_cs_sqrt [numeric]

Mean (sd) : 6.8 (2)

min ≤ med ≤ max:

0 ≤ 6.7 ≤ 14.2

Q1 - Q3 : 5.7 - 7.7

128 distinct values

2276 (100.0%)

0 (0.0%)

19

pitching_hr_sqrt [numeric]

Mean (sd) : 9.7 (3.3)

min ≤ med ≤ max:

0 ≤ 10.3 ≤ 18.5

Q1 - Q3 : 7.1 - 12.2

256 distinct values

2276 (100.0%)

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.2)
2023-02-26

3. Building Models

Having thoroughly explored our dataset and completed the data cleaning process, we can initiate the construction of our multivariate linear regression models.

We’ll build our models using four variants of our training dataset:

The original training dataset (trainDf),
The training dataset with missing data variables imputed using MICE (cleanDf),
The training dataset with outliers dropped, using stepwise selection (droppedDf), and
The dataset with transforms of non-normal variables, using stepwise selection (transformedDf).

3.1 Model 1 - Baseline

Dataset - trainDf

For the first multivariate linear regression model, we will select all the variables from the original un-cleaned dataset except for INDEX (just a row index) and TEAM_BATTING_HBP (missing data in 91% of cases). We may use this model as a base model for comparison.

Observations	1486 (790 missing obs. deleted)
Dependent variable	TARGET_WINS
Type	OLS linear regression

F(14,1471)	82.0963
R²	0.4386
Adj. R²	0.4333

	Est.	S.E.	t val.	p
(Intercept)	57.9124	6.6428	8.7180	0.0000
TEAM_BATTING_H	0.0154	0.0196	0.7864	0.4318
TEAM_BATTING_2B	-0.0705	0.0094	-7.5218	0.0000
TEAM_BATTING_3B	0.1616	0.0222	7.2796	0.0000
TEAM_BATTING_HR	0.0740	0.0854	0.8660	0.3866
TEAM_BATTING_BB	0.0438	0.0465	0.9421	0.3463
TEAM_BATTING_SO	0.0183	0.0235	0.7778	0.4368
TEAM_BASERUN_SB	0.0359	0.0087	4.1304	0.0000
TEAM_BASERUN_CS	0.0521	0.0182	2.8597	0.0043
TEAM_PITCHING_H	0.0190	0.0184	1.0361	0.3003
TEAM_PITCHING_HR	0.0230	0.0821	0.2801	0.7794
TEAM_PITCHING_BB	-0.0042	0.0447	-0.0935	0.9255
TEAM_PITCHING_SO	-0.0382	0.0224	-1.7007	0.0892
TEAM_FIELDING_E	-0.1559	0.0099	-15.6719	0.0000
TEAM_FIELDING_DP	-0.1129	0.0131	-8.5930	0.0000
Standard errors: OLS

Model Results

The F-statistic is 82.1, the adjusted R-squared is 0.433, and out of the 15 variables, 6 have statistically significant p-values.

The adjusted R2 indicates that only 43% of the variance in the response variable can be explained by the predictor variables. The F-statistic is low and the model’s p-value is not statistically significant. We should probably look at other models for better performance.

Checking Model Assumptions

We evaluate the linear modeling assumptions using standard diagnostic plots, the Breusch–Pagan Test for Heteroscedasticity and Variance Inflation Factor (VIF) to assess colinearity.

The initial review of the diagnostic plots for this model shows some deviation from linear modeling assumptions. The Q-Q plot shows some deviations from the normal distribution at the ends. The residuals-fitted and standardized residuals-fitted plots show a curve in the main cluster, which indicates that we do not have constant variance.

3.2 Model 2 - Missing Values Imputed

Dataset - cleanDF

Our second model will be based on the cleaned dataset which doesn’t have missing values. We chose the variables TEAM_PITCHING_BB, TEAM_BATTING_2B, TEAM_BATTING_3B, TEAM_FIELDING_E, TEAM_PITCHING_H, TEAM_BATTING_HR, TEAM_BATTING_H based on our understanding of the data.

Observations	2276
Dependent variable	TARGET_WINS
Type	OLS linear regression

F(7,2268)	119.8631
R²	0.2700
Adj. R²	0.2678

	Est.	S.E.	t val.	p
(Intercept)	7.2713	3.2775	2.2185	0.0266
TEAM_PITCHING_BB	0.0103	0.0020	5.2868	0.0000
TEAM_BATTING_2B	-0.0258	0.0090	-2.8544	0.0044
TEAM_BATTING_3B	0.1011	0.0166	6.0742	0.0000
TEAM_FIELDING_E	-0.0166	0.0023	-7.3642	0.0000
TEAM_PITCHING_H	-0.0013	0.0003	-4.0257	0.0001
TEAM_BATTING_HR	0.0367	0.0076	4.8540	0.0000
TEAM_BATTING_H	0.0485	0.0032	15.2210	0.0000
Standard errors: OLS

Model Results

The F-statistic is 119.9, the adjusted R-squared is 0.268, and all 7 variables have statistically significant p-values.

The adjusted R2 indicates that only 26% of the variance in the response variable can be explained by the predictor variables, which is less than in the original model.

The F-statistic is high, the p-value of the model is close to zero, and the model’s diagnostic checks are satisfactory. We would dismiss the null hypothesis, which assumes no correlation between the explanatory and response variables.

Checking Model Assumptions

The initial review of the diagnostic plots for this model shows some deviation from linear modeling assumptions. The Q-Q plot shows the normal distribution, but the residuals-fitted and standardzied residuals-fitted plots don’t have constant variance for the fitted values below 60 and above 95.

3.3 Model 3 - Outliers Removed and Stepwise Selection

Dataset - droppedDF

The third model uses the cleaned dataset with outliers omitted, and we dropped TEAM_BATTING_H due to colinearity with TEAM_BATTING_1B, TEAM_BATTING_2B, TEAM_BATTING_3B and TEAM_BATTING_HR.

We use stepwise variable selection based on their AIC score to identify the optimal set of features. The resulting model has significant p-values for the model and all predictor variables.

Observations	1948
Dependent variable	TARGET_WINS
Type	OLS linear regression

F(12,1935)	118.0244
R²	0.4226
Adj. R²	0.4190

	Est.	S.E.	t val.	p
(Intercept)	62.7205	5.5952	11.2097	0.0000
TEAM_BATTING_2B	-0.0428	0.0086	-4.9607	0.0000
TEAM_BATTING_3B	0.1767	0.0182	9.6845	0.0000
TEAM_BATTING_HR	0.0911	0.0091	10.0529	0.0000
TEAM_BATTING_BB	0.1512	0.0183	8.2517	0.0000
TEAM_BATTING_SO	-0.0409	0.0072	-5.6743	0.0000
TEAM_BASERUN_SB	0.0898	0.0056	15.9375	0.0000
TEAM_BASERUN_CS	-0.0649	0.0111	-5.8369	0.0000
TEAM_PITCHING_H	0.0272	0.0038	7.0936	0.0000
TEAM_PITCHING_BB	-0.1117	0.0169	-6.6086	0.0000
TEAM_PITCHING_SO	0.0170	0.0068	2.4922	0.0128
TEAM_FIELDING_E	-0.1222	0.0060	-20.4539	0.0000
TEAM_FIELDING_DP	-0.1194	0.0122	-9.8128	0.0000
Standard errors: OLS

Model Results

The F-statistic is 118, the adjusted R-squared 0.42, and all 12 variables have statistically significant p-values.

The adjusted R2 indicates that 42% of the variance in the response variable can be explained by the predictor variables. The F-statistic is high, and the p-value of the model is close to zero, so we would dismiss the null hypothesis, which assumes that there is no correlation between the explanatory and response variables.

Check Model Assumptions

The initial review of the diagnostic plots for this model shows some deviation from linear modeling assumptions at the boundaries of the prediction range, but the overall QQ plot is almost a flat line; the residuals are flat and exhibit homoscedasticity of variance in the 65 - 95 fitted value range.

The Breusch–Pagan Test for Heteroscedasticity assumes the following Null and Alternate hypothesis.

H0 - Residuals are distributed with equal variance (i.e., homoscedasticity)
H1 - Residuals are distributed with unequal variance (i.e., heteroscedasticity)

For this model iteration, we reject the null hypothesis and conclude that this model violates the homoscedasticity function.

studentized Breusch-Pagan test

data: lm3step BP = 68.868, df = 12, p-value = 5.213e-10

The Variance Inflation Factor (VIF) calculation detects collinearity between the TEAM_PITCHING_BB and TEAM_BATTING_BB variables. Both variables have a VIF score over 5 and are 0.93 correlated. Additionally, variables TEAM_PITCHING_SO, TEAM_BATTING_SO, TEAM_PITCHING_H and TEAM_BATTING_2B have VIF over 5 as well.

Re-run Model with new Variable Selections

For the refined model we will drop TEAM_PITCHING_BB, TEAM_BATTING_BB, TEAM_PITCHING_SO TEAM_BATTING_SO TEAM_PITCHING_H and TEAM_BATTING_2B.

Observations	1948
Dependent variable	TARGET_WINS
Type	OLS linear regression

F(7,1940)	165.8482
R²	0.3744
Adj. R²	0.3721

	Est.	S.E.	t val.	p
(Intercept)	107.6576	3.2419	33.2080	0.0000
TEAM_BATTING_3B	0.2124	0.0175	12.1657	0.0000
TEAM_BATTING_HR	0.1443	0.0074	19.4784	0.0000
TEAM_BATTING_SO	-0.0338	0.0020	-16.7390	0.0000
TEAM_BASERUN_SB	0.1031	0.0057	18.1485	0.0000
TEAM_BASERUN_CS	-0.0659	0.0115	-5.7423	0.0000
TEAM_FIELDING_E	-0.1253	0.0059	-21.3872	0.0000
TEAM_FIELDING_DP	-0.0938	0.0123	-7.6022	0.0000
Standard errors: OLS

Model Results

This model has a lower adjusted R2 of 0.37, but with a high F-statistic of 165.8 and statistically significant p-values, it should align better with the linear regression assumptions.

Check Model Assumptions

The diagnostic plots for the new model appear to be more closely with the linear regression assumptions. We again reject the null hypothesis and conclude that this model violates the homoscedasticity function.

The Variance Inflation Factor (VIF) calculation does not detect collinearity across the remaining variables. Overall, however, we will reject Model 3 because the residuals are not homoscedastic.

studentized Breusch-Pagan test

data: lm3step.final BP = 45.198, df = 7, p-value = 1.252e-07

	model 3				model 3 (StepAIC)				model 3 (Final)
Predictors	Estimates	std. Error	Statistic	p	Estimates	std. Error	Statistic	p	Estimates	std. Error	Statistic	p
(Intercept)	62.92	5.81	10.83	<0.0001	62.72	5.60	11.21	<0.0001	107.66	3.24	33.21	<0.0001
TEAM BATTING 2B	-0.05	0.02	-2.53	0.0116	-0.04	0.01	-4.96	<0.0001
TEAM BATTING 3B	0.17	0.02	7.17	<0.0001	0.18	0.02	9.68	<0.0001	0.21	0.02	12.17	<0.0001
TEAM BATTING HR	0.12	0.09	1.38	0.1668	0.09	0.01	10.05	<0.0001	0.14	0.01	19.48	<0.0001
TEAM BATTING BB	0.15	0.05	3.05	0.0023	0.15	0.02	8.25	<0.0001
TEAM BATTING SO	-0.04	0.01	-5.50	<0.0001	-0.04	0.01	-5.67	<0.0001	-0.03	0.00	-16.74	<0.0001
TEAM BASERUN SB	0.09	0.01	15.51	<0.0001	0.09	0.01	15.94	<0.0001	0.10	0.01	18.15	<0.0001
TEAM BASERUN CS	-0.06	0.01	-5.65	<0.0001	-0.06	0.01	-5.84	<0.0001	-0.07	0.01	-5.74	<0.0001
TEAM PITCHING H	0.03	0.02	1.84	0.0656	0.03	0.00	7.09	<0.0001
TEAM PITCHING HR	-0.03	0.08	-0.38	0.7067
TEAM PITCHING BB	-0.11	0.05	-2.37	0.0177	-0.11	0.02	-6.61	<0.0001
TEAM PITCHING SO	0.02	0.01	2.47	0.0137	0.02	0.01	2.49	0.0128
TEAM FIELDING E	-0.12	0.01	-20.13	<0.0001	-0.12	0.01	-20.45	<0.0001	-0.13	0.01	-21.39	<0.0001
TEAM FIELDING DP	-0.12	0.01	-9.74	<0.0001	-0.12	0.01	-9.81	<0.0001	-0.09	0.01	-7.60	<0.0001
TEAM BATTING 1B	-0.00	0.02	-0.14	0.8852
Observations	1948				1948				1948
R² / R² adjusted	0.423 / 0.418				0.423 / 0.419				0.374 / 0.372
AIC	14675.322				14671.497				14817.770

3.4 Model 4 - Normal Transformation and Stepwise Selection

Dataset - transformedDF

The fourth model uses the dataset with missing values imputed, outliers removed, and variables square-root transformed for more normal distributions. Stepwise AIC selection is applied to select the optimal combination of variables. The resulting model has significant p-values for the model and all predictor variables.

Observations	2276
Dependent variable	TARGET_WINS
Type	OLS linear regression

F(10,2265)	127.9868
R²	0.3610
Adj. R²	0.3582

	Est.	S.E.	t val.	p
(Intercept)	37.0792	5.6195	6.5983	0.0000
TEAM_BATTING_H	0.0483	0.0032	14.9230	0.0000
TEAM_BATTING_2B	-0.0233	0.0087	-2.6693	0.0077
TEAM_BATTING_BB	0.0116	0.0030	3.8575	0.0001
TEAM_BASERUN_SB	0.0533	0.0042	12.7742	0.0000
TEAM_PITCHING_SO	0.0021	0.0006	3.4272	0.0006
TEAM_FIELDING_E	-0.0407	0.0023	-17.4640	0.0000
TEAM_FIELDING_DP	-0.0944	0.0125	-7.5552	0.0000
batting_hr_sqrt	1.2153	0.1573	7.7265	0.0000
batting_so_sqrt	-0.8177	0.1119	-7.3049	0.0000
baserun_cs_sqrt	-0.4448	0.1645	-2.7038	0.0069
Standard errors: OLS

Model Results

The F-statistic is 128, the adjusted R-squared 0.358, and the p-value of the model is close to zero, so we would dismiss the null hypothesis, which assumes that there is no correlation between the explanatory and response variables.

Check Model Assumptions

The review of the diagnostic plots shows that the overall QQ plot is almost a flat line; the residuals are flat and exhibit homoscedasticity of variance in the 60 - 100 fitted value range.

4. Selecting Models

The table presents an analysis of the performance metrics for each model on the training dataset. The results suggest that the model’s performance slightly improved after incorporating transformations and selecting significant parameters.

The $R^2$, residual standard error, and F-statistics for each model are below. The non-cleaned, non-imputed raw training data had the best-fitting statistics. Though Model 1 doesn’t meet assumption requirements. Model 3 looks like the best fit.

The F-statistic for models 2-3 was large, with significant p-values. During the exploratory data analysis, some of the predictor variables had correlations with each other, as expected in baseball. However, this high correlation between variables could result in multicollinearity, causing unstable regression fits. To mitigate this issue, we employed the Variable Inflation Factor (VIF) method in the previous section to identify better models by selecting variables with VIF values less than 5.

The residual plots from part 3 showed that models 1-3 almost follow the normal distribution and almost constant variance.

Moneyball Dataset
r	mse	adjusted.r
0.44	9.56	0.43
0.27	13.48	0.27
0.42	10.41	0.42
0.36	12.62	0.36

Next, we apply the models to the evaluation dataset to make predictions. However, to ensure that Models 2-4 work properly, we fill in the missing values in the evaluation data set using the same MICE imputation method, drop the INDEX and TEAM_BATTING_HBP variables, remove outliers for Model 3, and make transformations for Model 4.

This table presents the predicted values of TARGET_WINS for each model. The first thing to notice is that the first model predicts a negative number of wins, which is unrealistic. The third and fourth AIC-generated models (no outliers and transformed values) have similar outcomes, with both performing well. The second model also produced similar results but had an issue with model assumptions.

Moneyball Dataset
lm1	lm2	aic3	aic4
62.57	68.58	60.12	62.85
68.91	70.21	67.00	66.03
73.68	77.35	71.92	75.44
81.89	83.61	84.33	87.27
20.39	66.44	74.53	57.00
32.64	67.44	65.74	75.28
48.67	74.02	68.01	83.20
58.69	72.52	59.07	73.87
67.81	72.08	71.37	67.30
69.81	75.86	69.24	73.89

We can also see when plotting the predictions that there doesn’t seem to be much difference between the models, aside from the unrealistic outliers generated by Model 1, and the slightly tighter range in wins in Model 4.

5. Conclusion

We don’t generally place much faith in any of the models for their predictive power:

Model 1 produced the highest $R^2$, but was affected by missing data leading to negative win predictions.
Model 3 produced lower $R^2$ than Model 1 but didn’t produce negative results.
Models 2 and 4 demonstrated a weak overall fit, as evidenced by their lower $R^2$ scores.
None of the models displayed satisfactory performance when assessed for linearity or homogeneity of variance.

According to the analysis, Models 1 and 3 performed slightly better than Models 2 and 4, and would be the preferred linear models based solely on their adjusted $R^2$. However, Model 3 is more statistically significant than the others and uses fewer unnecessary variables for making predictions while maintaining a similar level of adjusted $R^2$.

Overall, Model 3 would be the recommended choice based on these factors and its better handling of multicollinearity (with fewer instances of sign-flipping in coefficients.)

6. References

Faraway, J. J. (2014). Linear Models with R, Second Edition. CRC Press.
Sheather, S. (2009). A Modern Approach to Regression with R. Springer Science & Business Media.
Detecting Multicollinearity Using Variance Inflation Factors | STAT 462. (n.d.). https://online.stat.psu.edu/stat462/node/180/
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2014. An Introduction to Statistical Learning: With Applications in R. Springer Publishing Company, Incorporated

Appendix A: Lahman’s Baseball Database

Despite significant efforts to compensate for poor data quality, the resulting models are poor predictors of win totals.

Moreover, the poor data quality is inconsistent with the overall state of baseball statistics. When it comes to the major sports, baseball has the most mature statistics available. Therefore finding better data is the best course of action for developing a better predictive model.

We were able to locate a cleaner version of the same data set provided in the class. The Lahman’s Baseball Database includes the same variables as the sample database with fewer errors and additional reference data that would allow us to connect the database to other sources.

https://www.seanlahman.com/baseball-archive/statistics/

A significant advantage of Lahman’s data set over the data set provided in class is that it includes information about the year and the team. This data is valuable when considering how baseball has changed over the years. The modern era in baseball is often delineated by the turn of the century. However, when looking at the past 120 years of baseball history, it is easy to pinpoint rule changes, evolutions in playing strategy, and league structure that have fundamentally impacted the game.

When comparing statistics across time, it is common to use many of the breakdowns below to add context to the analysis:

The Dead Ball Era (1901 - 1920)
World War 2 (1941 - 1945)
Segregation Era (1901 - 1947ish)
Post-War Era/Yankees Era (1945 - late 50s/early 60s)
Westward Expansion (1953 - 1961)
Dead Ball 2 (The Sixties, roughly)
Designated Hitter Era (1973 - current, AL only)
Free Agency/Arbitration Era (1975 - current)
Steroid Era (unknown, but late 80s - 2005 seems likely)
Wild Card Era (1994 - current)

Surveying these periods would suggest that a more granular model has the potential to perform better.

Although we could have chosen any number of the time periods above, exploring the statistical outliers highlights that many of these values correspond to the pre-1969 period. This delineation has some historical support.

As Jayson Stark of ESPN argues in this article (https://www.espn.com/mlb/columns/story?columnist=stark_jayson&id=2471349) In 1969 the MLB underwent several rule changes and changes to the league structure that impacted win totals and team statistics. 1969 was the first year of division play and the expanded postseason. The Pitcher’s Mound was lowered five inches. The Strike zone shrinks. Five-person rotations kicking in. The save was invented. And more expansion to the unbalanced schedules.

Thus using 1900 as the beginning of the modern era and 1969 as an additional breakpoint, the dataset can be divided into three segments. The density profiles for the predictor variables approach a normal distribution when grouped by the three segments we identified. To support data exploration, we added a era_cat field to the data set.

In general the Lahman’s data set contains fewer data gaps and the variables are more consistently distributed. There are some missing values in the data set including the Caught Stealing (CS/TEAM_BASERUN_CS) variable is missing 27.9%; the Batters Hit by Pitch (HBP/TEAM_BATTING_HBP) variable is missing 38.8%; and the Sacrifice Flies (SF) variable is missing 51.6% of the values.

Most of the variables in the data set show some level of skewness, with the following variables having a Kurtosis measure of greater than 3, TEAM_BATTING_H, TEAM_BASERUN_SB, TEAM_BASERUN_CS, TEAM_PITCHING_H, and TEAM_FIELDING_E

The Lahman data set contains several variables with bimodal distributions, including, TEAM_BATTING_HR, TEAM_BATTING_SO, TEAM_BATTING_HR, and TEAM_PITCHING_SO.

No

Variable

Stats / Values

Freqs (% of Valid)

Graph

Valid

Missing

1

yearID [numeric]

Mean (sd) : 1958.9 (43)

min ≤ med ≤ max:

1871 ≤ 1967 ≤ 2021

Q1 - Q3 : 1922 - 1997

151 distinct values

2985 (100.0%)

0 (0.0%)

2

lgID [character]

1. AA

2. AL

3. FL

4. NL

5. PL

6. UA

85	(	2.9%	)
1295	(	44.1%	)
16	(	0.5%	)
1519	(	51.8%	)
8	(	0.3%	)
12	(	0.4%	)

2935 (98.3%)

50 (1.7%)

3

teamID [character]

1. CHN

2. PHI

3. PIT

4. CIN

5. SLN

6. BOS

7. CHA

8. CLE

9. DET

10. NYA

[ 139 others ]

146	(	4.9%	)
139	(	4.7%	)
135	(	4.5%	)
132	(	4.4%	)
130	(	4.4%	)
121	(	4.1%	)
121	(	4.1%	)
121	(	4.1%	)
121	(	4.1%	)
119	(	4.0%	)
1700	(	57.0%	)

2985 (100.0%)

0 (0.0%)

4

franchID [character]

1. ATL

2. CHC

3. CIN

4. PIT

5. STL

6. PHI

7. SFG

8. LAD

9. BAL

10. BOS

[ 110 others ]

146	(	4.9%	)
146	(	4.9%	)
140	(	4.7%	)
140	(	4.7%	)
140	(	4.7%	)
139	(	4.7%	)
139	(	4.7%	)
138	(	4.6%	)
121	(	4.1%	)
121	(	4.1%	)
1615	(	54.1%	)

2985 (100.0%)

0 (0.0%)

5

divID [character]

1. (Empty string)

2. C

3. E

4. W

1517	(	50.8%	)
295	(	9.9%	)
598	(	20.0%	)
575	(	19.3%	)

2985 (100.0%)

0 (0.0%)

6

Rank [numeric]

Mean (sd) : 4 (2.3)

min ≤ med ≤ max:

1 ≤ 4 ≤ 13

Q1 - Q3 : 2 - 6

13 distinct values

2985 (100.0%)

0 (0.0%)

7

G [numeric]

Mean (sd) : 150 (24.4)

min ≤ med ≤ max:

6 ≤ 159 ≤ 165

Q1 - Q3 : 154 - 162

123 distinct values

2985 (100.0%)

0 (0.0%)

8

W [numeric]

Mean (sd) : 74.6 (18)

min ≤ med ≤ max:

0 ≤ 77 ≤ 116

Q1 - Q3 : 66 - 87

113 distinct values

2985 (100.0%)

0 (0.0%)

9

L [numeric]

Mean (sd) : 74.6 (17.8)

min ≤ med ≤ max:

4 ≤ 76 ≤ 134

Q1 - Q3 : 65 - 87

114 distinct values

2985 (100.0%)

0 (0.0%)

10

R [numeric]

Mean (sd) : 681 (139.5)

min ≤ med ≤ max:

24 ≤ 691 ≤ 1220

Q1 - Q3 : 614 - 764

640 distinct values

2985 (100.0%)

0 (0.0%)

11

AB [numeric]

Mean (sd) : 5129 (798.2)

min ≤ med ≤ max:

211 ≤ 5402 ≤ 5781

Q1 - Q3 : 5135 - 5519

1137 distinct values

2985 (100.0%)

0 (0.0%)

12

H [numeric]

Mean (sd) : 1339.4 (230.9)

min ≤ med ≤ max:

33 ≤ 1390 ≤ 1783

Q1 - Q3 : 1299 - 1465

758 distinct values

2985 (100.0%)

0 (0.0%)

13

X2B [numeric]

Mean (sd) : 228.7 (59.8)

min ≤ med ≤ max:

1 ≤ 234 ≤ 376

Q1 - Q3 : 194 - 272

317 distinct values

2985 (100.0%)

0 (0.0%)

14

X3B [numeric]

Mean (sd) : 45.7 (22.5)

min ≤ med ≤ max:

0 ≤ 40 ≤ 150

Q1 - Q3 : 29 - 59

126 distinct values

2985 (100.0%)

0 (0.0%)

15

HR [numeric]

Mean (sd) : 105.9 (64)

min ≤ med ≤ max:

0 ≤ 110 ≤ 307

Q1 - Q3 : 45 - 155

260 distinct values

2985 (100.0%)

0 (0.0%)

16

BB [numeric]

Mean (sd) : 473.6 (132.3)

min ≤ med ≤ max:

1 ≤ 494 ≤ 835

Q1 - Q3 : 425.5 - 554.5

586 distinct values

2984 (100.0%)

1 (0.0%)

17

SO [numeric]

Mean (sd) : 762.1 (319.3)

min ≤ med ≤ max:

3 ≤ 761 ≤ 1596

Q1 - Q3 : 516 - 990

1117 distinct values

2969 (99.5%)

16 (0.5%)

18

SB [numeric]

Mean (sd) : 109.4 (69.7)

min ≤ med ≤ max:

1 ≤ 93 ≤ 581

Q1 - Q3 : 62 - 137

324 distinct values

2859 (95.8%)

126 (4.2%)

19

CS [numeric]

Mean (sd) : 46.5 (21.9)

min ≤ med ≤ max:

3 ≤ 44 ≤ 191

Q1 - Q3 : 33 - 56

137 distinct values

2153 (72.1%)

832 (27.9%)

20

HBP [numeric]

Mean (sd) : 45.8 (18.1)

min ≤ med ≤ max:

7 ≤ 43 ≤ 160

Q1 - Q3 : 32 - 57

101 distinct values

1827 (61.2%)

1158 (38.8%)

21

SF [numeric]

Mean (sd) : 44.1 (10.2)

min ≤ med ≤ max:

7 ≤ 44 ≤ 77

Q1 - Q3 : 38 - 50

66 distinct values

1444 (48.4%)

1541 (51.6%)

22

RA [numeric]

Mean (sd) : 681 (139.2)

min ≤ med ≤ max:

34 ≤ 689 ≤ 1252

Q1 - Q3 : 610 - 766

623 distinct values

2985 (100.0%)

0 (0.0%)

23

ER [numeric]

Mean (sd) : 573.4 (149.9)

min ≤ med ≤ max:

23 ≤ 594 ≤ 1023

Q1 - Q3 : 503 - 671

656 distinct values

2985 (100.0%)

0 (0.0%)

24

ERA [numeric]

Mean (sd) : 3.8 (0.8)

min ≤ med ≤ max:

1 ≤ 4 ≤ 8

Q1 - Q3 : 3 - 4

1	:	1	(	0.0%	)
2	:	146	(	4.9%	)
3	:	777	(	26.0%	)
4	:	1511	(	50.6%	)
5	:	502	(	16.8%	)
6	:	46	(	1.5%	)
7	:	1	(	0.0%	)
8	:	1	(	0.0%	)

2985 (100.0%)

0 (0.0%)

25

CG [numeric]

Mean (sd) : 47.5 (39.3)

min ≤ med ≤ max:

0 ≤ 41 ≤ 148

Q1 - Q3 : 9 - 76

147 distinct values

2985 (100.0%)

0 (0.0%)

26

SHO [numeric]

Mean (sd) : 9.6 (5.1)

min ≤ med ≤ max:

0 ≤ 9 ≤ 32

Q1 - Q3 : 6 - 12

32 distinct values

2985 (100.0%)

0 (0.0%)

27

SV [numeric]

Mean (sd) : 24.4 (16.3)

min ≤ med ≤ max:

0 ≤ 25 ≤ 68

Q1 - Q3 : 10 - 39

66 distinct values

2985 (100.0%)

0 (0.0%)

28

IPouts [numeric]

Mean (sd) : 4013.2 (663.3)

min ≤ med ≤ max:

162 ≤ 4252 ≤ 4518

Q1 - Q3 : 4080 - 4341

862 distinct values

2985 (100.0%)

0 (0.0%)

29

HA [numeric]

Mean (sd) : 1339.2 (231)

min ≤ med ≤ max:

49 ≤ 1389 ≤ 1993

Q1 - Q3 : 1287 - 1468

770 distinct values

2985 (100.0%)

0 (0.0%)

30

HRA [numeric]

Mean (sd) : 105.9 (60.9)

min ≤ med ≤ max:

0 ≤ 113 ≤ 305

Q1 - Q3 : 51 - 153

247 distinct values

2985 (100.0%)

0 (0.0%)

31

BBA [numeric]

Mean (sd) : 473.7 (131.7)

min ≤ med ≤ max:

1 ≤ 495 ≤ 827

Q1 - Q3 : 429 - 554

577 distinct values

2985 (100.0%)

0 (0.0%)

32

SOA [numeric]

Mean (sd) : 761.6 (320.5)

min ≤ med ≤ max:

0 ≤ 762 ≤ 1687

Q1 - Q3 : 511 - 997

1148 distinct values

2985 (100.0%)

0 (0.0%)

33

E [numeric]

Mean (sd) : 180.8 (108.4)

min ≤ med ≤ max:

20 ≤ 141 ≤ 639

Q1 - Q3 : 111 - 207

474 distinct values

2985 (100.0%)

0 (0.0%)

34

DP [numeric]

Mean (sd) : 132.6 (35.9)

min ≤ med ≤ max:

0 ≤ 140 ≤ 217

Q1 - Q3 : 116 - 157

199 distinct values

2985 (100.0%)

0 (0.0%)

35

FP [numeric]

1 distinct value

1

:

2985

(

100.0%

)

2985 (100.0%)

0 (0.0%)

36

name [character]

1. Cincinnati Reds

2. Pittsburgh Pirates

3. Philadelphia Phillies

4. St. Louis Cardinals

5. Chicago White Sox

6. Detroit Tigers

7. Chicago Cubs

8. Boston Red Sox

9. New York Yankees

10. Cleveland Indians

[ 129 others ]

131	(	4.4%	)
131	(	4.4%	)
130	(	4.4%	)
122	(	4.1%	)
121	(	4.1%	)
121	(	4.1%	)
119	(	4.0%	)
114	(	3.8%	)
109	(	3.7%	)
107	(	3.6%	)
1780	(	59.6%	)

2985 (100.0%)

0 (0.0%)

37

gFactor [numeric]

Mean (sd) : 1.2 (1)

min ≤ med ≤ max:

1 ≤ 1 ≤ 27

Q1 - Q3 : 1 - 1

14 distinct values

2985 (100.0%)

0 (0.0%)

38

TARGET_WINS [numeric]

Mean (sd) : 80.7 (15.4)

min ≤ med ≤ max:

0 ≤ 82 ≤ 146

Q1 - Q3 : 71 - 91

111 distinct values

2985 (100.0%)

0 (0.0%)

39

TEAM_BATTING_H [numeric]

Mean (sd) : 1459.4 (140.4)

min ≤ med ≤ max:

819 ≤ 1445 ≤ 2562

Q1 - Q3 : 1374 - 1526

604 distinct values

2985 (100.0%)

0 (0.0%)

40

TEAM_BATTING_2B [numeric]

Mean (sd) : 246.9 (47)

min ≤ med ≤ max:

27 ≤ 247 ≤ 458

Q1 - Q3 : 214 - 280

248 distinct values

2985 (100.0%)

0 (0.0%)

41

TEAM_BATTING_3B [numeric]

Mean (sd) : 51.2 (27.7)

min ≤ med ≤ max:

0 ≤ 42 ≤ 223

Q1 - Q3 : 31 - 66

149 distinct values

2985 (100.0%)

0 (0.0%)

42

TEAM_BATTING_HR [numeric]

Mean (sd) : 110.7 (63.7)

min ≤ med ≤ max:

0 ≤ 115 ≤ 319

Q1 - Q3 : 52 - 159

263 distinct values

2985 (100.0%)

0 (0.0%)

43

TEAM_BATTING_BB [numeric]

Mean (sd) : 503.7 (115.4)

min ≤ med ≤ max:

12 ≤ 512 ≤ 878

Q1 - Q3 : 452 - 573

563 distinct values

2984 (100.0%)

1 (0.0%)

44

TEAM_BATTING_SO [numeric]

Mean (sd) : 808.7 (296.7)

min ≤ med ≤ max:

44 ≤ 812 ≤ 1642

Q1 - Q3 : 578 - 1019

1077 distinct values

2969 (99.5%)

16 (0.5%)

45

TEAM_BASERUN_SB [numeric]

Mean (sd) : 119.5 (83)

min ≤ med ≤ max:

4 ≤ 97 ≤ 697

Q1 - Q3 : 66 - 145

362 distinct values

2859 (95.8%)

126 (4.2%)

46

TEAM_BASERUN_CS [numeric]

Mean (sd) : 49 (22.6)

min ≤ med ≤ max:

8 ≤ 45 ≤ 201

Q1 - Q3 : 34 - 58

136 distinct values

2153 (72.1%)

832 (27.9%)

47

TEAM_BATTING_HBP [numeric]

Mean (sd) : 49 (19.9)

min ≤ med ≤ max:

9 ≤ 47 ≤ 174

Q1 - Q3 : 34 - 61

109 distinct values

1827 (61.2%)

1158 (38.8%)

48

TEAM_PITCHING_H [numeric]

Mean (sd) : 1463 (156.7)

min ≤ med ≤ max:

662 ≤ 1447 ≤ 3888

Q1 - Q3 : 1367 - 1533

625 distinct values

2985 (100.0%)

0 (0.0%)

49

TEAM_PITCHING_HR [numeric]

Mean (sd) : 110.8 (60.2)

min ≤ med ≤ max:

0 ≤ 118 ≤ 305

Q1 - Q3 : 56 - 158

250 distinct values

2985 (100.0%)

0 (0.0%)

50

TEAM_PITCHING_BB [numeric]

Mean (sd) : 504.2 (114.8)

min ≤ med ≤ max:

22 ≤ 515 ≤ 881

Q1 - Q3 : 458 - 573

545 distinct values

2985 (100.0%)

0 (0.0%)

51

TEAM_PITCHING_SO [numeric]

Mean (sd) : 807.7 (299.7)

min ≤ med ≤ max:

0 ≤ 806 ≤ 1876

Q1 - Q3 : 575 - 1016

1103 distinct values

2985 (100.0%)

0 (0.0%)

52

TEAM_FIELDING_E [numeric]

Mean (sd) : 225.5 (222.1)

min ≤ med ≤ max:

54 ≤ 145 ≤ 1998

Q1 - Q3 : 115 - 226

596 distinct values

2985 (100.0%)

0 (0.0%)

53

TEAM_FIELDING_DP [numeric]

Mean (sd) : 141.9 (27.6)

min ≤ med ≤ max:

0 ≤ 145 ≤ 228

Q1 - Q3 : 126 - 160

163 distinct values

2985 (100.0%)

0 (0.0%)

54

pythPercent [numeric]

Mean (sd) : 0.5 (0.1)

min ≤ med ≤ max:

0 ≤ 0.5 ≤ 0.9

Q1 - Q3 : 0.4 - 0.6

2922 distinct values

2985 (100.0%)

0 (0.0%)

55

era_cat [character]

1. 1900-

2. 1900-1969

3. 1969+

375	(	12.6%	)
1142	(	38.3%	)
1468	(	49.2%	)

2985 (100.0%)

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.2)
2023-02-26

When we group the statistics by the three categories presented earliers, we see a much cleaner density plot across all variables. There are few signs of bimodal distributions, and the skewness of individual variables is reduced greatly.

The training data set exhibits the following characteristics.

No

Variable

Stats / Values

Freqs (% of Valid)

Graph

Valid

Missing

1

yearID [numeric]

Mean (sd) : 1958.6 (42.7)

min ≤ med ≤ max:

1871 ≤ 1967 ≤ 2021

Q1 - Q3 : 1922 - 1996

151 distinct values

2389 (100.0%)

0 (0.0%)

2

era_cat [character]

1. 1900-

2. 1900-1969

3. 1969+

295	(	12.3%	)
925	(	38.7%	)
1169	(	48.9%	)

2389 (100.0%)

0 (0.0%)

3

TARGET_WINS [numeric]

Mean (sd) : 80.7 (15.5)

min ≤ med ≤ max:

0 ≤ 82 ≤ 146

Q1 - Q3 : 71 - 91

106 distinct values

2389 (100.0%)

0 (0.0%)

4

TEAM_BATTING_H [numeric]

Mean (sd) : 1458.5 (140.2)

min ≤ med ≤ max:

819 ≤ 1446 ≤ 2500

Q1 - Q3 : 1372 - 1525

572 distinct values

2389 (100.0%)

0 (0.0%)

5

TEAM_BATTING_2B [numeric]

Mean (sd) : 246.8 (47)

min ≤ med ≤ max:

27 ≤ 247 ≤ 458

Q1 - Q3 : 214 - 280

236 distinct values

2389 (100.0%)

0 (0.0%)

6

TEAM_BATTING_3B [numeric]

Mean (sd) : 51.3 (27.5)

min ≤ med ≤ max:

0 ≤ 42 ≤ 223

Q1 - Q3 : 31 - 67

141 distinct values

2389 (100.0%)

0 (0.0%)

7

TEAM_BATTING_HR [numeric]

Mean (sd) : 109.9 (62.7)

min ≤ med ≤ max:

0 ≤ 115 ≤ 319

Q1 - Q3 : 53 - 158

256 distinct values

2389 (100.0%)

0 (0.0%)

8

TEAM_BATTING_BB [numeric]

Mean (sd) : 505.8 (113.8)

min ≤ med ≤ max:

12 ≤ 513 ≤ 878

Q1 - Q3 : 454 - 576

517 distinct values

2388 (100.0%)

1 (0.0%)

9

TEAM_BATTING_SO [numeric]

Mean (sd) : 804.7 (293.8)

min ≤ med ≤ max:

44 ≤ 805 ≤ 1642

Q1 - Q3 : 575 - 1013

987 distinct values

2377 (99.5%)

12 (0.5%)

10

TEAM_BASERUN_SB [numeric]

Mean (sd) : 119.6 (83.9)

min ≤ med ≤ max:

4 ≤ 97 ≤ 697

Q1 - Q3 : 66 - 145

341 distinct values

2297 (96.1%)

92 (3.9%)

11

TEAM_BASERUN_CS [numeric]

Mean (sd) : 49 (22.7)

min ≤ med ≤ max:

8 ≤ 45 ≤ 201

Q1 - Q3 : 34 - 58

126 distinct values

1715 (71.8%)

674 (28.2%)

12

TEAM_BATTING_HBP [numeric]

Mean (sd) : 49 (20.3)

min ≤ med ≤ max:

9 ≤ 47 ≤ 174

Q1 - Q3 : 34 - 61

107 distinct values

1463 (61.2%)

926 (38.8%)

13

TEAM_PITCHING_H [numeric]

Mean (sd) : 1462.5 (157.4)

min ≤ med ≤ max:

662 ≤ 1448 ≤ 3888

Q1 - Q3 : 1368 - 1533

587 distinct values

2389 (100.0%)

0 (0.0%)

14

TEAM_PITCHING_HR [numeric]

Mean (sd) : 110.3 (59.6)

min ≤ med ≤ max:

0 ≤ 117 ≤ 305

Q1 - Q3 : 57 - 158

248 distinct values

2389 (100.0%)

0 (0.0%)

15

TEAM_PITCHING_BB [numeric]

Mean (sd) : 506.2 (113.5)

min ≤ med ≤ max:

26 ≤ 516 ≤ 881

Q1 - Q3 : 460 - 574

505 distinct values

2389 (100.0%)

0 (0.0%)

16

TEAM_PITCHING_SO [numeric]

Mean (sd) : 804.2 (295.7)

min ≤ med ≤ max:

0 ≤ 800 ≤ 1876

Q1 - Q3 : 576 - 1010

1024 distinct values

2389 (100.0%)

0 (0.0%)

17

TEAM_FIELDING_E [numeric]

Mean (sd) : 224.2 (220.5)

min ≤ med ≤ max:

54 ≤ 145 ≤ 1998

Q1 - Q3 : 116 - 227

535 distinct values

2389 (100.0%)

0 (0.0%)

18

TEAM_FIELDING_DP [numeric]

Mean (sd) : 142 (27.5)

min ≤ med ≤ max:

0 ≤ 145 ≤ 225

Q1 - Q3 : 126 - 160

159 distinct values

2389 (100.0%)

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.2)
2023-02-26

DATA PREPARATION

The use of the era_cat variable allows us to group the data set into 3 categories that variables that approach normal distribution.

Since segmenting the data set using the era_cat variable creates 3 categories of variables that approach the normal distribution, we will focus our data preparation step on missing values. The documentation for MICE package recommends that a 5% threshold should be observed for safely imputing missing values. Based on this rule of thumb we will drop the ‘TEAM_BASERUN_CS’ and the ‘TEAM_BATTING_HBP’ variables.

The remaining variables with missing data (TEAM_BATTING_BB, TEAM_BATTING_SO and TEAM_BASERUN_SB) will be imputed using the MICE package. Several methods can be used but for simplicity we selected m=5 the default method.

Build Model

The first model includes all the variables from the original data set with the exception of the yearID. It could be argued that year might have an impact on the numbers, however we are accounting for year with the era_cat variable. The initial model has an $AdjR^2 = 0.8078$, and the model selected by stepAIC includes the same variables and has the same $AdjR^2=8078$.

	TARGET_WINS
Predictors	Estimates	std. Error	Statistic	p
(Intercept)	81.22	3.48	23.31	<0.0001
era_cat1900-1969	1.15	0.88	1.31	0.1901
era cat [1969+]	0.22	0.99	0.22	0.8221
TEAM BATTING H	0.05	0.00	26.72	<0.0001
TEAM BATTING 2B	0.01	0.00	2.99	0.0028
TEAM BATTING 3B	0.01	0.01	1.39	0.1643
TEAM BATTING HR	0.11	0.01	19.56	<0.0001
TEAM BATTING BB	0.04	0.00	24.09	<0.0001
TEAM BATTING SO	-0.01	0.00	-8.80	<0.0001
TEAM BASERUN SB	0.03	0.00	10.25	<0.0001
TEAM PITCHING H	-0.06	0.00	-38.59	<0.0001
TEAM PITCHING HR	-0.09	0.01	-14.38	<0.0001
TEAM PITCHING BB	-0.06	0.00	-31.95	<0.0001
TEAM PITCHING SO	0.01	0.00	6.70	<0.0001
TEAM FIELDING E	-0.01	0.00	-3.40	0.0007
TEAM FIELDING DP	0.05	0.01	6.76	<0.0001
Observations	2389
R² / R² adjusted	0.812 / 0.811
AIC	15918.773

	TARGET_WINS
Predictors	Estimates	std. Error	Statistic	p
(Intercept)	80.91	3.48	23.27	<0.0001
era_cat1900-1969	1.06	0.87	1.21	0.2247
era cat [1969+]	-0.10	0.96	-0.10	0.9214
TEAM BATTING H	0.05	0.00	29.24	<0.0001
TEAM BATTING 2B	0.01	0.00	2.95	0.0032
TEAM BATTING HR	0.11	0.01	19.57	<0.0001
TEAM BATTING BB	0.04	0.00	24.08	<0.0001
TEAM BATTING SO	-0.01	0.00	-8.70	<0.0001
TEAM BASERUN SB	0.03	0.00	11.11	<0.0001
TEAM PITCHING H	-0.06	0.00	-38.63	<0.0001
TEAM PITCHING HR	-0.09	0.01	-14.67	<0.0001
TEAM PITCHING BB	-0.06	0.00	-31.93	<0.0001
TEAM PITCHING SO	0.01	0.00	6.58	<0.0001
TEAM FIELDING E	-0.01	0.00	-3.53	0.0004
TEAM FIELDING DP	0.05	0.01	6.68	<0.0001
Observations	2389
R² / R² adjusted	0.812 / 0.811
AIC	15918.721

The TEAM_BATTING_3B variable was dropped due to low p-values, giving us the final model with an $AdjR^2=0.8078$.

	TARGET_WINS
Predictors	Estimates	std. Error	Statistic	p
(Intercept)	80.91	3.48	23.27	<0.0001
era_cat1900-1969	1.06	0.87	1.21	0.2247
era cat [1969+]	-0.10	0.96	-0.10	0.9214
TEAM BATTING H	0.05	0.00	29.24	<0.0001
TEAM BATTING 2B	0.01	0.00	2.95	0.0032
TEAM BATTING HR	0.11	0.01	19.57	<0.0001
TEAM BATTING BB	0.04	0.00	24.08	<0.0001
TEAM BATTING SO	-0.01	0.00	-8.70	<0.0001
TEAM BASERUN SB	0.03	0.00	11.11	<0.0001
TEAM PITCHING H	-0.06	0.00	-38.63	<0.0001
TEAM PITCHING HR	-0.09	0.01	-14.67	<0.0001
TEAM PITCHING BB	-0.06	0.00	-31.93	<0.0001
TEAM PITCHING SO	0.01	0.00	6.58	<0.0001
TEAM FIELDING E	-0.01	0.00	-3.53	0.0004
TEAM FIELDING DP	0.05	0.01	6.68	<0.0001
Observations	2389
R² / R² adjusted	0.812 / 0.811
AIC	15918.721

There are a number of issues with the model diagnostics. The Linearity, and QQ plots show deviation from the straight line at the boundaries. In addition, the reference line for the Homogeneity of Variance graph is not flat. This indicates that residual variance is not consistent across the model values. Further exploration could be conducted to see if there are smaller date ranges and prediction ranges that generate a model that better aligns with the linear regression assumptions.

PREDICT

The next step is to use the selected model to predict the testing data set. The Yardstick package was used to calculate model performance, including the $R^2$. With an $R^2=0.8318$ the performance on the testing data set is consistent with the model summary. In the Residual vs. TARGET_WINS graph, there appears to be wins range between 60 and 110 that generates more accurate forecasts. However, it should be noted that the residuals appear to drift downwards. TARGET_WINS less than the mean are overestimated and TARGET_WINS greater than the mean is over estimated.

	.metric	.estimator	.estimate
1	mape	standard	7.0714
2	smape	standard	6.9789
3	mase	standard	0.3186
4	mpe	standard	-0.4950
5	rmse	standard	6.6625
6	rsq	standard	0.7980

Appendix B: Pythagorean Model

Bill James developed the Pythagorean winning percentage. The concept strives to calculate the number of games a team should win based on the total offense and the number of runs allowed. Since this model includes total runs scored vs. total runs allowed the expectation is that this model will be a good predictor of a team’s wins in a given season.

$Win Percentage = (Runs Scored)^2 / [ (Runs Scored)^2 + (Runs Allowed)^2]$

	TARGET_WINS
Predictors	Estimates	std. Error	Statistic	p
(Intercept)	4.80	0.50	9.62	<0.0001
pythPercent	152.34	0.98	154.88	<0.0001
Observations	2389
R² / R² adjusted	0.909 / 0.909
AIC	14085.472

Model Results

As we expected, the Pythagorean model performs well $Adj R^2$ = 0.9112. However, this linear model differs slightly from the formal definition since it includes an intercept of 5.21 and a coefficient for the Pythagorean factor of 151.39. The formal definition of the model would have an intercept of 0 and a coefficient of 162.

Check Model Assumptions

The diagnostic plots show residuals that are normally distributed with a linear relationship to the fitted values. The homogeneity of variance is curved, indicating that the variance of residuals is not consistent.

The next step is to use the Pythagorean Model to predict the testing data set. The Yardstick package was used to calculate model performance, including the $R^2$. With an $R^2=0.9131$ the performance on the testing data set is consistent with the model summary. In the Residual vs. TARGET_WINS graph there appears to be a wins range between 50 and 130 that generates more accurate forecasts.

	.metric	.estimator	.estimate
1	mape	standard	4.9670
2	smape	standard	4.6989
3	mase	standard	0.2031
4	mpe	standard	-1.4262
5	rmse	standard	4.4704
6	rsq	standard	0.9187

Unsurprisingly, the Pythagorean Model performs well when it comes to win projections. There is clear relationship between the total yearly run differential and the number of games won. It is a simple model but efficient.

Appendix C: R code

## ----setup, include=FALSE-------------------------------------------------------------------------------------------------------
library(tidyverse)
library(reshape2)
library(kableExtra)
library(Matrix)
library(MASS)
library(mice)
library(Hmisc)
library(corrplot)
library(performance)
library(naniar)
library(psych)
library(GGally)
library(campfin)
library(caret)
library(yardstick)
library(summarytools)
library(sjPlot)
library(car)
library(rsample)
library(olsrr) 

knitr::opts_chunk$set(echo = F, 
                      warning = F, 
                      message = F, 
                      eval = T , 
                      results="asis", 
                      fig.height=6, 
                      fig.width=8)
set.seed(1234)


## ----functions------------------------------------------------------------------------------------------------------------------

st_css()

 st_options(
   plain.ascii = FALSE,
   style = 'grid',
   dfSummary.style ='grid',
   freq.silent  = TRUE,
   headings     = FALSE,
   tmp.img.dir  = "./tmp",
   dfSummary.custom.1 =
     expression(
       paste(
         "Q1 - Q3 :",
         round(
           quantile(column_data, probs = .25, type = 2,
                    names = FALSE, na.rm = TRUE), digits = 1
         ), " - ",
         round(
           quantile(column_data, probs = .75, type = 2,
                    names = FALSE, na.rm = TRUE), digits = 1
         )
       )
     )
 )

source('./hw1/Functions.R', local = knitr::knit_global())
 


## ----load_data------------------------------------------------------------------------------------------------------------------
trainDf <- read.csv("https://raw.githubusercontent.com/cliftonleesps/data621_group1/main/hw1/moneyball-training-data.csv")
evalDf <- read.csv("https://raw.githubusercontent.com/cliftonleesps/data621_group1/main/hw1/moneyball-evaluation-data.csv")


## ----df_summary-----------------------------------------------------------------------------------------------------------------
print(
  dfSummary(trainDf, 
            varnumbers   = TRUE,
            na.col       = TRUE,
            graph.magnif = .8,
            tmp.img.dir  = "/tmp"),
  method = "render"
)



## ----density_plot---------------------------------------------------------------------------------------------------------------
m_df <- trainDf %>% dplyr::select(-c(INDEX)) %>% melt() 

m_df %>% ggplot(aes(x= value)) + 
geom_density(color='#023020', fill='gray') + facet_wrap(~variable, scales = 'free',  ncol = 4) + theme_bw()


## ----boxplot--------------------------------------------------------------------------------------------------------------------

m_df %>% ggplot(aes(x = value)) +
  geom_boxplot(outlier.color = 'red', outlier.shape = 1) +
  facet_wrap(vars(variable),scales = "free", ncol = 4)



## ----feature_plot---------------------------------------------------------------------------------------------------------------
featurePlot(trainDf[,2:ncol(trainDf)], trainDf[,1], plot = "scatter", type = c("p", "smooth"), span = 1)


## ----correlation_matrix---------------------------------------------------------------------------------------------------------
rcore <- rcorr(as.matrix(trainDf %>% dplyr::select(where(is.numeric) & -INDEX & -TEAM_BATTING_HBP)))
coeff <- rcore$r
corrplot(coeff, tl.cex = .7 , method = 'circle')


## ----corr_numbers---------------------------------------------------------------------------------------------------------------
tst <- trainDf
tst <- tst[,-1 ]
kable(cor(drop_na(tst))[,1], "html", escape = F, col.names = c('Coefficient')) %>%
  kable_styling("striped", full_width = F) %>%
  column_spec(1, bold = T)


## ----missing_data,fig.height=4--------------------------------------------------------------------------------------------------
gg_miss_var(trainDf, show_pct=TRUE)


## ----dropped_df-----------------------------------------------------------------------------------------------------------------
droppedDf <- trainDf %>% dplyr::select(-c(INDEX, TEAM_BATTING_HBP))


## ----missing_flags--------------------------------------------------------------------------------------------------------------
columnNames <- names(droppedDf)
droppedDf <- flag_na(droppedDf, columnNames)
columnNames <- names(evalDf)
evalDf <- flag_na(evalDf, columnNames)


## ----density_plots_post,fig.height=4--------------------------------------------------------------------------------------------
imputeDf <- mice(droppedDf, m = 5, maxit = 50, seed = 123, printFlag = F)
cleanDf <- complete(imputeDf)
m_imputed <- melt(cleanDf)
densityplot(imputeDf)


## ----drop_na_flag---------------------------------------------------------------------------------------------------------------
cleanDf <- cleanDf %>% dplyr::select(-na_flag)
droppedDf <- droppedDf %>% dplyr::select(-na_flag)


## ----drop_outliers--------------------------------------------------------------------------------------------------------------

# list of columns included in in the outlier filtering
columnNames <- c('TEAM_BATTING_H','TEAM_BATTING_2B','TEAM_BATTING_3B','TEAM_BATTING_HR',
                 'TEAM_BATTING_BB','TEAM_BATTING_SO','TEAM_BASERUN_SB','TEAM_BASERUN_CS',
                 'TEAM_PITCHING_H','TEAM_PITCHING_HR','TEAM_PITCHING_BB','TEAM_PITCHING_SO',
                 'TEAM_FIELDING_E','TEAM_FIELDING_DP','TEAM_BATTING_1B')

# Filter 
filterDf <- loadLahmanShortData() %>% filter(yearID >= 1900)
droppedDf <- filterOutliers(filterDf, cleanDf, columnNames)

# print stats for new model
print(
  dfSummary(droppedDf, 
            varnumbers   = TRUE,
            na.col       = TRUE,
            graph.magnif = .8,
            tmp.img.dir  = "/tmp"),
  method = "render"
)

m_df <- droppedDf %>% melt() 

m_df %>% ggplot(aes(x= value)) + 
    geom_density(color='#023020', fill='gray') + facet_wrap(~variable, scales = 'free',  ncol = 4) + theme_bw()

m_df %>% ggplot(aes(x = value)) +
  geom_boxplot(outlier.color = 'red', outlier.shape = 1) +
  facet_wrap(vars(variable),scales = "free", ncol = 4)



## ----single_base_hits-----------------------------------------------------------------------------------------------------------
droppedDf <- droppedDf %>% mutate(TEAM_BATTING_1B = TEAM_BATTING_H - TEAM_BATTING_2B - TEAM_BATTING_3B - TEAM_BATTING_HR)


## ----transform_nonnormal--------------------------------------------------------------------------------------------------------
# Reminder of our distributions
m_imputed %>% ggplot(aes(x= value)) + 
    geom_density(fill='gray') + facet_wrap(~variable, scales = 'free', ncol=4) +
    theme(strip.text.x = element_text(size = 6, angle = 0)) +
    labs(title='Training w/Imputed Missing Values')

# Transform some non-normal variables.
transformed_df <- cleanDf |> mutate_each(funs(sqrt),
                   batting_hr_sqrt = TEAM_BATTING_HR,
                   batting_so_sqrt = TEAM_BATTING_SO,
                   baserun_cs_sqrt = TEAM_BASERUN_CS,
                   pitching_hr_sqrt = TEAM_PITCHING_HR) 

visualize_transformed <- transformed_df |> melt()

# Plot.
visualize_transformed %>% ggplot(aes(x= value)) + 
    geom_density(fill='gray') + facet_wrap(~variable, scales = 'free', ncol=4) +
    theme(strip.text.x = element_text(size = 6, angle = 0)) +
    labs(title='Training w/Imputed Missing Values + SQRT Transform')


## ----df_summary_normalized------------------------------------------------------------------------------------------------------
print(
  dfSummary(transformed_df, 
            varnumbers   = TRUE,
            na.col       = TRUE,
            graph.magnif = .8,
            tmp.img.dir  = "/tmp"),
  method = "render")



## ----lm1------------------------------------------------------------------------------------------------------------------------
trainDf_1 <- trainDf %>% dplyr::select(-c(INDEX, TEAM_BATTING_HBP))
lm1 <- lm(TARGET_WINS ~ . , data = trainDf_1)
lm1Sum <- summary(lm1)
tab_model(lm1Sum, show.df = FALSE, show.aic = TRUE, show.fstat=TRUE, show.se = TRUE, show.ci=FALSE, show.stat=TRUE, digits.p=4)


## ----check_lm1------------------------------------------------------------------------------------------------------------------
check_model(lm1, check=c('ncv','qq','homogeneity','outliers'))


## ----lm2------------------------------------------------------------------------------------------------------------------------
lm2 <- lm(formula = TARGET_WINS ~ TEAM_PITCHING_BB  + TEAM_BATTING_2B + TEAM_BATTING_3B +
   TEAM_FIELDING_E + TEAM_PITCHING_H + TEAM_BATTING_HR + TEAM_BATTING_H, data = cleanDf)
lm2Sum <- summary(lm2)
tab_model(lm2Sum, show.df = FALSE, show.aic = TRUE, show.fstat=TRUE, show.se = TRUE, show.ci=FALSE, show.stat=TRUE, digits.p=4)


## ----check_lm2------------------------------------------------------------------------------------------------------------------
check_model(lm2, check=c('ncv','qq','homogeneity','outliers'))


## ----lm3------------------------------------------------------------------------------------------------------------------------

lm3 <- lm(TARGET_WINS ~ . -TEAM_BATTING_H, data=droppedDf, by=na_flag)
lm3step <- stepAIC(lm3, trace = FALSE)
lm3sum <- summary(lm3step)
tab_model(lm3step, show.df = FALSE, show.aic = TRUE, show.fstat=TRUE, show.se = TRUE, show.ci=FALSE, show.stat=TRUE, digits.p=4)



## ----check_lm3------------------------------------------------------------------------------------------------------------------
check_model(lm3step, check=c('ncv','qq','homogeneity','outliers'))


## ----bptest_lm3-----------------------------------------------------------------------------------------------------------------
lmtest::bptest(lm3step)


## ----vif_lm3--------------------------------------------------------------------------------------------------------------------

vif_values <- vif(lm3step)
vif_values <- rownames_to_column(as.data.frame(vif_values), var = "var")

vif_values %>%
  ggplot(aes(y=vif_values, x=var)) +
  coord_flip() + 
  geom_hline(yintercept=5, linetype="dashed", color = "red") +
  geom_bar(stat = 'identity', width=0.3 ,position=position_dodge()) 

df <- droppedDf[ , vif_values$var] %>% drop_na()
coeff <- cor(df)
corrplot(coeff, tl.cex = .7 , diag = FALSE ,type = 'upper' ,method = 'number')



## ----lm3_final------------------------------------------------------------------------------------------------------------------
lm3step.final <- update(lm3step, . ~ . -TEAM_PITCHING_BB -TEAM_BATTING_BB -TEAM_PITCHING_SO - -TEAM_BATTING_SO -TEAM_PITCHING_H -TEAM_BATTING_2B)
lm3finalsum <- summary(lm3step.final)
tab_model(lm3step.final, show.df = FALSE, show.aic = TRUE, show.fstat=TRUE, show.se = TRUE, show.ci=FALSE, show.stat=TRUE, digits.p=4)


## ----check_lm3_final------------------------------------------------------------------------------------------------------------
check_model(lm3step.final, check=c('ncv','qq','homogeneity','outliers'))


## ----vif_lm3_final--------------------------------------------------------------------------------------------------------------

vif_values <- vif(lm3step.final)
vif_values <- rownames_to_column(as.data.frame(vif_values), var = "var")

vif_values %>%
  ggplot(aes(y=vif_values, x=var)) +
  coord_flip() + 
  geom_hline(yintercept=5, linetype="dashed", color = "red") +
  geom_bar(stat = 'identity', width=0.3 ,position=position_dodge()) 

df <- droppedDf[ , vif_values$var] %>% drop_na()
coeff <- cor(df)
corrplot(coeff, tl.cex = .7 , diag = FALSE ,type = 'upper' ,method = 'number')



## ----bptest_lm3_final-----------------------------------------------------------------------------------------------------------
lmtest::bptest(lm3step.final)


## ----tab_lm3_final--------------------------------------------------------------------------------------------------------------
tab_model(lm3, lm3step, lm3step.final, 
          dv.labels = c('model 3','model 3 (StepAIC)','model 3 (Final)'),
          show.df = FALSE, show.aic = TRUE, show.fstat=TRUE, show.se = TRUE, show.ci=FALSE, show.stat=TRUE, digits.p=4)


## ----lm4------------------------------------------------------------------------------------------------------------------------
lm4 <- lm(TARGET_WINS~. -TEAM_PITCHING_HR -TEAM_BATTING_SO -TEAM_BASERUN_CS -TEAM_BATTING_HR -TEAM_PITCHING_BB -TEAM_PITCHING_H -pitching_hr_sqrt, data=transformed_df)
lm4step <- stepAIC(lm4, trace=FALSE)
lm4sum <- summary(lm4step)
tab_model(lm4step, show.df = FALSE, show.aic = TRUE, show.fstat=TRUE, show.se = TRUE, show.ci=FALSE, show.stat=TRUE, digits.p=4)


## ----check_lm4------------------------------------------------------------------------------------------------------------------
check_model(lm4step, check=c('ncv','qq','homogeneity','outliers'))


## ----compare_models-------------------------------------------------------------------------------------------------------------
plot(compare_performance(lm1,lm2,lm3step,lm4step, rank=T))


## ----compare_r2-----------------------------------------------------------------------------------------------------------------
r <- c(lm1Sum$r.squared, lm2Sum$r.squared, lm3sum$r.squared, lm4sum$r.squared)
mse <- c(lm1Sum$sigma, lm2Sum$sigma, lm3sum$sigma, lm4sum$sigma)
adjusted.r <- c(lm1Sum$adj.r.squared, lm2Sum$adj.r.squared, lm3sum$adj.r.squared, lm4sum$adj.r.squared)
modelDf <- data.frame(r,mse,adjusted.r)
kable(modelDf, caption = "Moneyball Dataset", digits = 2, format = "html", row.names = F) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover"),
    full_width = F,
    position = "center"
  )


## ----predict_setup--------------------------------------------------------------------------------------------------------------
#removed columns
evalDroppedDf <- evalDf %>% dplyr::select(-c(INDEX, TEAM_BATTING_HBP))
evalDroppedDf <- evalDroppedDf %>% mutate(TEAM_BATTING_1B = TEAM_BATTING_H - TEAM_BATTING_2B - TEAM_BATTING_3B - TEAM_BATTING_HR)

#no NA's
evalcleanDf <- mice(evalDroppedDf, m = 5, maxit = 50, seed = 123, printFlag = F)
evalcleanDf <- complete(evalcleanDf)

#with transformations
transformed_eval <- evalcleanDf |> mutate_each(funs(sqrt),
                   batting_hr_sqrt = TEAM_BATTING_HR,
                   batting_so_sqrt = TEAM_BATTING_SO,
                   baserun_cs_sqrt = TEAM_BASERUN_CS,
                   pitching_hr_sqrt = TEAM_PITCHING_HR)



## ----predict--------------------------------------------------------------------------------------------------------------------
lm1Pred <- lm1 %>% predict(evalcleanDf)
lm2Pred <- lm2 %>% predict(evalcleanDf)
aiclm3 <- lm3step %>% predict(evalcleanDf)
aiclm4 <- lm4step %>% predict(transformed_eval)


## ----predict_report-------------------------------------------------------------------------------------------------------------
predsDf <- evalDf %>% 
  mutate(lm1 = lm1Pred, lm2 = lm2Pred, aic3 = aiclm3, aic4 = aiclm4) %>%
  dplyr::select(c(lm1, lm2, aic3, aic4))
  kable(head(predsDf, 10), caption = "Moneyball Dataset", digits = 2, format = "html", row.names = F) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover"),
    full_width = F,
    position = "center"
  )


## ----predict_plot---------------------------------------------------------------------------------------------------------------
xT <- evalcleanDf %>% 
  mutate(lm1 = lm1Pred, lm2 = lm2Pred, aic3 = aiclm3, aic4 = aiclm4)
par(mfrow=c(2,2))
plot(1:nrow(xT), xT$lm1, xlab="LM1", ylab="Wins", title="LM1")
plot(1:nrow(xT), xT$lm2, xlab="LM2", ylab="Wins", title="LM2")
plot(1:nrow(xT), xT$aic, xlab="AIC LM3", ylab="Wins", title="AIC LM3")
plot(1:nrow(xT), xT$aic4, xlab="AIC LM4", ylab="Wins", title="AIC LM4")


## ----lahman_df------------------------------------------------------------------------------------------------------------------
teamDF <- loadLahmanData()


## ----lahmanshort_df-------------------------------------------------------------------------------------------------------------
teamAdjDF <- loadLahmanShortData()

random_sample <- createDataPartition(teamAdjDF$TARGET_WINS,p = 0.8, list = FALSE)

trainingTeam_df <- teamAdjDF[random_sample, ]
testingTeam_df <- teamAdjDF[-random_sample, ]


## ----lahman_summary-------------------------------------------------------------------------------------------------------------
print(
  dfSummary(teamDF, 
            varnumbers   = TRUE,
            na.col       = TRUE,
            graph.magnif = .8,
            tmp.img.dir  = "/tmp"),
  method = "render")


## ----lahman_plot----------------------------------------------------------------------------------------------------------------
m_df <- teamAdjDF[random_sample, ] %>% melt() 

m_df %>% ggplot(aes(x= value)) + 
      geom_density(color='#023020', fill='gray') + 
      facet_wrap(~variable, scales = 'free',  ncol = 4) + 
      theme_bw() +
      labs(title = 'Variable Density Plots')


## ----lahman_density-------------------------------------------------------------------------------------------------------------
m_df <- teamAdjDF[random_sample, ] %>% filter(era_cat == '1969+') %>% melt() 

m_df %>% ggplot(aes(x= value)) + 
      geom_density(color='#023020', fill='gray') + 
      facet_wrap(~variable, scales = 'free',  ncol = 4) + 
      theme_bw() +
      labs(title = 'Variable Density Plots (1969+)')


## ----lahman_training_summary----------------------------------------------------------------------------------------------------
print(
  dfSummary(trainingTeam_df, 
            varnumbers   = TRUE,
            na.col       = TRUE,
            graph.magnif = .8,
            tmp.img.dir  = "/tmp"),
  method = "render")


## ----lahman_training_plot-------------------------------------------------------------------------------------------------------
m_df <- trainingTeam_df %>% melt() 

m_df %>% ggplot(aes(x= value)) + 
    geom_density(color='#023020', fill='gray') + facet_wrap(~variable, scales = 'free',  ncol = 4) + theme_bw()


m_df %>% ggplot(aes(x = value)) +
  geom_boxplot(outlier.color = 'red', outlier.shape = 1) +
  facet_wrap(vars(variable),scales = "free", ncol = 4)


## ----lahman_training_corr-------------------------------------------------------------------------------------------------------
rcore <- rcorr(as.matrix(trainingTeam_df %>% dplyr::select(where(is.numeric))))
coeff <- rcore$r
corrplot(coeff, tl.cex = .7 , method = 'pie')


## ----lahman_prep_data-----------------------------------------------------------------------------------------------------------
trainingTeam_df <- trainingTeam_df %>% dplyr::select(-c(TEAM_BASERUN_CS,TEAM_BATTING_HBP))
testingTeam_df <- testingTeam_df %>% dplyr::select(-c(TEAM_BASERUN_CS,TEAM_BATTING_HBP))


## ----lahman_prep_impute---------------------------------------------------------------------------------------------------------
imputeDf <- mice(trainingTeam_df, m = 5, maxit = 50, seed = 123, printFlag = F)
#imputeDf$meth
trainingTeam_df <- complete(imputeDf)
m_imputed <- melt(trainingTeam_df)
densityplot(imputeDf)


## ----lahman_lma1----------------------------------------------------------------------------------------------------------------
lmA1 <- lm(TARGET_WINS ~ . -yearID, data = trainingTeam_df, by=era_cat)
tab_model(lmA1, show.df = FALSE, show.aic = TRUE, show.fstat=TRUE, show.se = TRUE, show.ci=FALSE, show.stat=TRUE, digits.p=4)


## ----lahman_lma1_step-----------------------------------------------------------------------------------------------------------
lmA1.step <- stepAIC(lmA1, trace = FALSE, by=era_cat)
tab_model(lmA1.step, show.df = FALSE, show.aic = TRUE, show.fstat=TRUE, show.se = TRUE, show.ci=FALSE, show.stat=TRUE, digits.p=4)


## ----lahman_summary_lm1---------------------------------------------------------------------------------------------------------
lmA1.final <- update(lmA1.step, . ~ . -TEAM_BATTING_3B)
tab_model(lmA1.final, show.df = FALSE, show.aic = TRUE, show.fstat=TRUE, show.se = TRUE, show.ci=FALSE, show.stat=TRUE, digits.p=4)


## ----lahman_check_lm1-----------------------------------------------------------------------------------------------------------
check_model(lmA1.final, check=c('ncv','qq','homogeneity','outliers'))


## ----lahman_predict-------------------------------------------------------------------------------------------------------------
imputeDf <- mice(testingTeam_df, m = 5, maxit = 50, seed = 123, printFlag = F)
#imputeDf$meth
testingTeam_df <- complete(imputeDf)
m_imputed <- melt(testingTeam_df)
densityplot(imputeDf)


## ----lahman_predict_final-------------------------------------------------------------------------------------------------------
lmA1Pred.final <- lmA1.final %>% predict(testingTeam_df)


## ----lahman_predict_report------------------------------------------------------------------------------------------------------
multi_metric <- metric_set(mape, smape, mase, mpe, rmse, rsq)
m <- testingTeam_df %>% multi_metric(truth=testingTeam_df$TARGET_WINS, estimate=lmA1Pred.final)

kable(m, digits = 4, format = "html", row.names = T) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover"),
    full_width = F,
    position = "center"
  )


## ----lahman_predict_plot--------------------------------------------------------------------------------------------------------
n <- nrow(testingTeam_df)
x <- testingTeam_df$TARGET_WINS
e <- lmA1Pred.final - testingTeam_df$TARGET_WINS 


plot(x, e,  
     xlab = "wins", 
     ylab = "residuals",
     bg = "steelblue", 
     col = "darkgray", cex = 1.5, pch = 21, frame = FALSE)
abline(h = 0, lwd = 2)
for (i in 1 : n) 
  lines(c(x[i], x[i]), c(e[i], 0), col = "red" , lwd = 1)


## ----py_df----------------------------------------------------------------------------------------------------------------------
random_sample <- createDataPartition(teamDF$TARGET_WINS, p = 0.8, list = FALSE)

trainingTeam_df <- teamDF[random_sample, ]
testingTeam_df <- teamDF[-random_sample, ]


## ----py_lm----------------------------------------------------------------------------------------------------------------------
lmp <- lm(TARGET_WINS ~ pythPercent, data = trainingTeam_df)

tab_model(lmp, show.df = FALSE, show.aic = TRUE, show.fstat=TRUE, show.se = TRUE, show.ci=FALSE, show.stat=TRUE, digits.p=4)


## ----py_check_lm----------------------------------------------------------------------------------------------------------------
check_model(lmp, check=c('ncv','qq','homogeneity','outliers'))


## ----py_predict-----------------------------------------------------------------------------------------------------------------
lmpPred <- lmp %>% predict(testingTeam_df)
hist(lmpPred)


## ----py_predict_report----------------------------------------------------------------------------------------------------------
m1 <- testingTeam_df %>% multi_metric(truth=testingTeam_df$TARGET_WINS, estimate=lmpPred)

kable(m1, digits = 4, format = "html", row.names = T) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover"),
    full_width = F,
    position = "center"
  )


## ----py_predict_plot------------------------------------------------------------------------------------------------------------
n <- nrow(testingTeam_df)
x <- testingTeam_df$TARGET_WINS
e <- lmpPred - testingTeam_df$TARGET_WINS 

plot(x, e,  
     xlab = "wins", 
     ylab = "residuals", 
     bg = "steelblue", 
     col = "darkgray", cex = 1.5, pch = 21, frame = FALSE)
abline(h = 0, lwd = 2)
for (i in 1 : n) 
  lines(c(x[i], x[i]), c(e[i], 0), col = "red" , lwd = 1)

Homework #1: Moneyball

Critical Thinking Group 1

Ben Inbar, Cliff Lee, Daria Dubovskaia, David Simbandumwe, Jeff Parks, Nick Oliver

Overview

1. Date Exploration

1.1 Summary Statistics

1.2 Distribution

1.3 Correlation Matrix

2. Data Preparation

2.1 Missing Data

2.2 Impute Missing Values (cleanDF)

2.3 Remove Outliers (droppedDF)

2.4 New Variables

2.5 Transform Non-Normal Variables (transformedDF)

3. Building Models

3.1 Model 1 - Baseline

Dataset - trainDf

Model Results

Checking Model Assumptions

3.2 Model 2 - Missing Values Imputed

Dataset - cleanDF

Model Results

Checking Model Assumptions

3.3 Model 3 - Outliers Removed and Stepwise Selection

Dataset - droppedDF

Model Results

Check Model Assumptions

Re-run Model with new Variable Selections

Model Results

Check Model Assumptions

3.4 Model 4 - Normal Transformation and Stepwise Selection

Dataset - transformedDF

Model Results

Check Model Assumptions

4. Selecting Models

5. Conclusion

6. References

Appendix A: Lahman’s Baseball Database

Appendix B: Pythagorean Model

Model Results

Check Model Assumptions

Appendix C: R code