Homework-1

Setup:

First, let’s read in the provided dataset.

Data Exploration:

The original training dataset consists of 17 elements and a total of 2276 observations. Among these elements, INDEX is merely an index value utilized for sorting, while TARGET_WINS serves as the response variable for our regression models. The remaining 15 elements represent potential predictor variables for our linear models. Below is a summary table for the dataset.

variables	n	mean	sd	med	min	max	range	skew	kurtosis	se	NAs
TARGET_WINS	2276	81	16	82	0	146	146	-0.40	1.03	0.33
TEAM_BATTING_H	2276	1470	145	1454	891	2554	1663	1.57	7.28	3.03
TEAM_BATTING_2B	2276	241	47	238	69	458	389	0.22	0.01	0.98
TEAM_BATTING_3B	2276	55	28	47	0	223	223	1.11	1.50	0.59
TEAM_BATTING_HR	2276	100	61	102	0	264	264	0.19	-0.96	1.27
TEAM_BATTING_BB	2276	502	123	512	0	878	878	-1.03	2.18	2.57
TEAM_BATTING_SO	2174	736	240	750	0	1399	1399	-0.30	-0.32	5.33	102
TEAM_BASERUN_SB	2145	125	88	101	0	697	697	1.97	5.49	1.90	131
TEAM_BASERUN_CS	1504	53	23	49	0	201	201	1.98	7.62	0.59	772
TEAM_BATTING_HBP	191	59	13	58	29	95	66	0.32	-0.11	0.94	2085
TEAM_PITCHING_H	2276	1779	1407	1518	1137	30132	28995	10.33	141.84	29.49
TEAM_PITCHING_HR	2276	106	61	107	0	343	343	0.29	-0.60	1.28
TEAM_PITCHING_BB	2276	553	166	536	0	3645	3645	6.74	96.97	3.49
TEAM_PITCHING_SO	2174	818	553	813	0	19278	19278	22.17	671.19	11.86	102
TEAM_FIELDING_E	2276	246	228	159	65	1898	1833	2.99	10.97	4.77
TEAM_FIELDING_DP	1990	146	26	149	52	228	176	-0.39	0.18	0.59	286

The variables and their definitions can be seen below:

Variable	Definition
`INDEX`	Identification variable
`TARGET_WINS`	Number of wins
`TEAM_BATTING_H`	Base hits by batters (1B, 2B, 3B, HR)
`TEAM_BATTING_2B`	Doubles by batters (2B)
`TEAM_BATTING_3B`	Triples by batters (3B)
`TEAM_BATTING_HR`	Homeruns by batters (4B)
`TEAM_BATTING_BB`	Walks by batters
`TEAM_BATTING_HBP`	Batters hit by pitch (get a free base)
`TEAM_BATTING_SO`	Strikeouts by batters
`TEAM_BASERUN_SB`	Stolen bases
`TEAM_BASRUN_CS`	Caught stealing
`TEAM_FIELDING_E`	Errors
`TEAM_FIELDING_DP`	Double plays
`TEAM_PITCHING_BB`	Walks allowed
`TEAM_PITCHING_H`	Hits allowed
`TEAM_PITCHING_HR`	Homeruns allowed
`TEAM_PITCHING_SO`	Strikeouts by pitchers

Next, let’s print out some summary statistics. We’re primarily interested in the TARGET_WINS variable, so we’ll look at that first.

## The mean number of wins in a season is 80.79.

## The median number of wins in a season is 82.

## The standard deviation for number of wins in a season is 15.75.

Let’s also make a histogram of the TARGET_WINS variable. This should give us a sense of the distribution of wins for teams/seasons in our population.

Overall, the number of wins in a season for a given baseball team looks fairly normally distributed. Looking at a boxplot helps to highlight the outliers.

We could describe the average team’s season using the mean for all variables below:

TARGET_WINS	80.8
TEAM_BATTING_H	1469.3
TEAM_BATTING_2B	241.2
TEAM_BATTING_3B	55.2
TEAM_BATTING_HR	99.6
TEAM_BATTING_BB	501.6
TEAM_BATTING_SO	735.6
TEAM_BASERUN_SB	124.8
TEAM_BASERUN_CS	52.8
TEAM_BATTING_HBP	59.4
TEAM_PITCHING_H	1779.2
TEAM_PITCHING_HR	105.7
TEAM_PITCHING_BB	553.0
TEAM_PITCHING_SO	817.7
TEAM_FIELDING_E	246.5
TEAM_FIELDING_DP	146.4

Data Preparation:

Let’s take a closer look at all the summary statistics for these variables and identify any data completeness issues:

TARGET_WINS	TEAM_BATTING_H	TEAM_BATTING_2B	TEAM_BATTING_3B	TEAM_BATTING_HR	TEAM_BATTING_BB	TEAM_BATTING_SO	TEAM_BASERUN_SB	TEAM_BASERUN_CS	TEAM_BATTING_HBP	TEAM_PITCHING_H	TEAM_PITCHING_HR	TEAM_PITCHING_BB	TEAM_PITCHING_SO	TEAM_FIELDING_E	TEAM_FIELDING_DP
Min. : 0.00	Min. : 891	Min. : 69.0	Min. : 0.00	Min. : 0.00	Min. : 0.0	Min. : 0.0	Min. : 0.0	Min. : 0.0	Min. :29.00	Min. : 1137	Min. : 0.0	Min. : 0.0	Min. : 0.0	Min. : 65.0	Min. : 52.0
1st Qu.: 71.00	1st Qu.:1383	1st Qu.:208.0	1st Qu.: 34.00	1st Qu.: 42.00	1st Qu.:451.0	1st Qu.: 548.0	1st Qu.: 66.0	1st Qu.: 38.0	1st Qu.:50.50	1st Qu.: 1419	1st Qu.: 50.0	1st Qu.: 476.0	1st Qu.: 615.0	1st Qu.: 127.0	1st Qu.:131.0
Median : 82.00	Median :1454	Median :238.0	Median : 47.00	Median :102.00	Median :512.0	Median : 750.0	Median :101.0	Median : 49.0	Median :58.00	Median : 1518	Median :107.0	Median : 536.5	Median : 813.5	Median : 159.0	Median :149.0
Mean : 80.79	Mean :1469	Mean :241.2	Mean : 55.25	Mean : 99.61	Mean :501.6	Mean : 735.6	Mean :124.8	Mean : 52.8	Mean :59.36	Mean : 1779	Mean :105.7	Mean : 553.0	Mean : 817.7	Mean : 246.5	Mean :146.4
3rd Qu.: 92.00	3rd Qu.:1537	3rd Qu.:273.0	3rd Qu.: 72.00	3rd Qu.:147.00	3rd Qu.:580.0	3rd Qu.: 930.0	3rd Qu.:156.0	3rd Qu.: 62.0	3rd Qu.:67.00	3rd Qu.: 1682	3rd Qu.:150.0	3rd Qu.: 611.0	3rd Qu.: 968.0	3rd Qu.: 249.2	3rd Qu.:164.0
Max. :146.00	Max. :2554	Max. :458.0	Max. :223.00	Max. :264.00	Max. :878.0	Max. :1399.0	Max. :697.0	Max. :201.0	Max. :95.00	Max. :30132	Max. :343.0	Max. :3645.0	Max. :19278.0	Max. :1898.0	Max. :228.0
NA	NA	NA	NA	NA	NA	NA’s :102	NA’s :131	NA’s :772	NA’s :2085	NA	NA	NA	NA’s :102	NA	NA’s :286

# First, store the summary in an object
summary_data <- summary(train)

# Create a data frame from the summary
summary_df <- as.data.frame(summary_data)

# Print the data frame
kable(summary_df)

Var1	Var2	Freq
	TARGET_WINS	Min. : 0.00
	TARGET_WINS	1st Qu.: 71.00
	TARGET_WINS	Median : 82.00
	TARGET_WINS	Mean : 80.79
	TARGET_WINS	3rd Qu.: 92.00
	TARGET_WINS	Max. :146.00
	TARGET_WINS	NA
	TEAM_BATTING_H	Min. : 891
	TEAM_BATTING_H	1st Qu.:1383
	TEAM_BATTING_H	Median :1454
	TEAM_BATTING_H	Mean :1469
	TEAM_BATTING_H	3rd Qu.:1537
	TEAM_BATTING_H	Max. :2554
	TEAM_BATTING_H	NA
	TEAM_BATTING_2B	Min. : 69.0
	TEAM_BATTING_2B	1st Qu.:208.0
	TEAM_BATTING_2B	Median :238.0
	TEAM_BATTING_2B	Mean :241.2
	TEAM_BATTING_2B	3rd Qu.:273.0
	TEAM_BATTING_2B	Max. :458.0
	TEAM_BATTING_2B	NA
	TEAM_BATTING_3B	Min. : 0.00
	TEAM_BATTING_3B	1st Qu.: 34.00
	TEAM_BATTING_3B	Median : 47.00
	TEAM_BATTING_3B	Mean : 55.25
	TEAM_BATTING_3B	3rd Qu.: 72.00
	TEAM_BATTING_3B	Max. :223.00
	TEAM_BATTING_3B	NA
	TEAM_BATTING_HR	Min. : 0.00
	TEAM_BATTING_HR	1st Qu.: 42.00
	TEAM_BATTING_HR	Median :102.00
	TEAM_BATTING_HR	Mean : 99.61
	TEAM_BATTING_HR	3rd Qu.:147.00
	TEAM_BATTING_HR	Max. :264.00
	TEAM_BATTING_HR	NA
	TEAM_BATTING_BB	Min. : 0.0
	TEAM_BATTING_BB	1st Qu.:451.0
	TEAM_BATTING_BB	Median :512.0
	TEAM_BATTING_BB	Mean :501.6
	TEAM_BATTING_BB	3rd Qu.:580.0
	TEAM_BATTING_BB	Max. :878.0
	TEAM_BATTING_BB	NA
	TEAM_BATTING_SO	Min. : 0.0
	TEAM_BATTING_SO	1st Qu.: 548.0
	TEAM_BATTING_SO	Median : 750.0
	TEAM_BATTING_SO	Mean : 735.6
	TEAM_BATTING_SO	3rd Qu.: 930.0
	TEAM_BATTING_SO	Max. :1399.0
	TEAM_BATTING_SO	NA’s :102
	TEAM_BASERUN_SB	Min. : 0.0
	TEAM_BASERUN_SB	1st Qu.: 66.0
	TEAM_BASERUN_SB	Median :101.0
	TEAM_BASERUN_SB	Mean :124.8
	TEAM_BASERUN_SB	3rd Qu.:156.0
	TEAM_BASERUN_SB	Max. :697.0
	TEAM_BASERUN_SB	NA’s :131
	TEAM_BASERUN_CS	Min. : 0.0
	TEAM_BASERUN_CS	1st Qu.: 38.0
	TEAM_BASERUN_CS	Median : 49.0
	TEAM_BASERUN_CS	Mean : 52.8
	TEAM_BASERUN_CS	3rd Qu.: 62.0
	TEAM_BASERUN_CS	Max. :201.0
	TEAM_BASERUN_CS	NA’s :772
	TEAM_BATTING_HBP	Min. :29.00
	TEAM_BATTING_HBP	1st Qu.:50.50
	TEAM_BATTING_HBP	Median :58.00
	TEAM_BATTING_HBP	Mean :59.36
	TEAM_BATTING_HBP	3rd Qu.:67.00
	TEAM_BATTING_HBP	Max. :95.00
	TEAM_BATTING_HBP	NA’s :2085
	TEAM_PITCHING_H	Min. : 1137
	TEAM_PITCHING_H	1st Qu.: 1419
	TEAM_PITCHING_H	Median : 1518
	TEAM_PITCHING_H	Mean : 1779
	TEAM_PITCHING_H	3rd Qu.: 1682
	TEAM_PITCHING_H	Max. :30132
	TEAM_PITCHING_H	NA
	TEAM_PITCHING_HR	Min. : 0.0
	TEAM_PITCHING_HR	1st Qu.: 50.0
	TEAM_PITCHING_HR	Median :107.0
	TEAM_PITCHING_HR	Mean :105.7
	TEAM_PITCHING_HR	3rd Qu.:150.0
	TEAM_PITCHING_HR	Max. :343.0
	TEAM_PITCHING_HR	NA
	TEAM_PITCHING_BB	Min. : 0.0
	TEAM_PITCHING_BB	1st Qu.: 476.0
	TEAM_PITCHING_BB	Median : 536.5
	TEAM_PITCHING_BB	Mean : 553.0
	TEAM_PITCHING_BB	3rd Qu.: 611.0
	TEAM_PITCHING_BB	Max. :3645.0
	TEAM_PITCHING_BB	NA
	TEAM_PITCHING_SO	Min. : 0.0
	TEAM_PITCHING_SO	1st Qu.: 615.0
	TEAM_PITCHING_SO	Median : 813.5
	TEAM_PITCHING_SO	Mean : 817.7
	TEAM_PITCHING_SO	3rd Qu.: 968.0
	TEAM_PITCHING_SO	Max. :19278.0
	TEAM_PITCHING_SO	NA’s :102
	TEAM_FIELDING_E	Min. : 65.0
	TEAM_FIELDING_E	1st Qu.: 127.0
	TEAM_FIELDING_E	Median : 159.0
	TEAM_FIELDING_E	Mean : 246.5
	TEAM_FIELDING_E	3rd Qu.: 249.2
	TEAM_FIELDING_E	Max. :1898.0
	TEAM_FIELDING_E	NA
	TEAM_FIELDING_DP	Min. : 52.0
	TEAM_FIELDING_DP	1st Qu.:131.0
	TEAM_FIELDING_DP	Median :149.0
	TEAM_FIELDING_DP	Mean :146.4
	TEAM_FIELDING_DP	3rd Qu.:164.0
	TEAM_FIELDING_DP	Max. :228.0
	TEAM_FIELDING_DP	NA’s :286

We can see quite a few NA values for TEAM_BATTING_SO, TEAM_BASERUN_SB, TEAM_BASERUN_CS, TEAM_BATTING_HBP, TEAM_PITCHING_SO, and TEAM_FIELDING_DP. Let’s take a look at the distributions of these variables to see how to impute these missing values.

TEAM_BASERUN_SB, TEAM_PITCHING_SO, and TEAM_BASERUN_CS seem to be skewed to the right so we should probably impute the missing values using the median value for these variables. TEAM_BATTING_HBP and TEAM_FIELDING_DP seem basically normally distributed so we can use the mean here, although TEAM_BATTING_HBP has 2,085 NA values out of 2,276 observations so it may make sense to leave this variable out of our model entirely. TEAM_BATTING_SO is bimodally distributed, so we have decided to use KNN imputation, which does not rely on the shape of the distribution, for this variable.

Identify missing data

# Identify missing data by Feature and display percent breakout
missing <- colSums(df %>% sapply(is.na))
missing_pct <- round(missing / nrow(df) * 100, 2)
stack(sort(missing_pct, decreasing = TRUE))

##    values              ind
## 1   91.61 TEAM_BATTING_HBP
## 2   33.92  TEAM_BASERUN_CS
## 3   12.57 TEAM_FIELDING_DP
## 4    5.76  TEAM_BASERUN_SB
## 5    4.48  TEAM_BATTING_SO
## 6    4.48 TEAM_PITCHING_SO
## 7    0.00            INDEX
## 8    0.00      TARGET_WINS
## 9    0.00   TEAM_BATTING_H
## 10   0.00  TEAM_BATTING_2B
## 11   0.00  TEAM_BATTING_3B
## 12   0.00  TEAM_BATTING_HR
## 13   0.00  TEAM_BATTING_BB
## 14   0.00  TEAM_PITCHING_H
## 15   0.00 TEAM_PITCHING_HR
## 16   0.00 TEAM_PITCHING_BB
## 17   0.00  TEAM_FIELDING_E

# Drop the BATTING_HBP field
df <- df %>% 
  select(-TEAM_BATTING_HBP)

Notice that ~91.6% of the rows are missing from the BATTING_HBP field - we will just drop this column from consideration. The columns BASERUN_CS (base run caught stealing) and BASERUN_SB (stolen bases) both have missing values. According to baseball history, stolen bases weren’t tracked officially until 1887, so some of the missing data could be from 1871-1886. We will impute those values. There are a high percentage of missing BATTING_SO (batter strike outs) and PITCHING_SO (pitching strike outs) which seem highly unlikely - we will also impute those missing values. We have chosen to impute missing values with the median value of the feature.

train_imputed <- train |>
    mutate(TEAM_BASERUN_SB = replace(TEAM_BASERUN_SB, is.na(TEAM_BASERUN_SB),
                                     median(TEAM_BASERUN_SB, na.rm=T)),
           TEAM_BASERUN_CS = replace(TEAM_BASERUN_CS, is.na(TEAM_BASERUN_CS),
                                     median(TEAM_BASERUN_CS, na.rm=T)),
           TEAM_PITCHING_SO = replace(TEAM_PITCHING_SO, is.na(TEAM_PITCHING_SO),
                                      median(TEAM_PITCHING_SO, na.rm=T)),
           TEAM_FIELDING_DP = replace(TEAM_FIELDING_DP, is.na(TEAM_FIELDING_DP),
                                      mean(TEAM_FIELDING_DP, na.rm=T))) |>
    select(-TEAM_BATTING_HBP)
    

train_imputed <- train_imputed |>
    VIM::kNN(variable = "TEAM_BATTING_SO", k = 15, numFun = weighted.mean,
             weightDist = TRUE) |>
    select(-TEAM_BATTING_SO_imp)

Let’s look at raw correlations between our other included variables and a team’s win total for a season:

##                         [,1]
## TARGET_WINS       1.00000000
## TEAM_BATTING_H    0.38876752
## TEAM_BATTING_2B   0.28910365
## TEAM_BATTING_3B   0.14260841
## TEAM_BATTING_HR   0.17615320
## TEAM_BATTING_BB   0.23255986
## TEAM_BATTING_SO  -0.03606403
## TEAM_BASERUN_SB   0.12361087
## TEAM_BASERUN_CS   0.01595982
## TEAM_PITCHING_H  -0.10993705
## TEAM_PITCHING_HR  0.18901373
## TEAM_PITCHING_BB  0.12417454
## TEAM_PITCHING_SO -0.07579967
## TEAM_FIELDING_E  -0.17648476
## TEAM_FIELDING_DP -0.02884126

None of the independent variables seem to have such high correlation with TARGET_WINS. TEAM_BATTING_H is most highly correlated, with a correlation of 0.39. TEAM_BATTING_H, TEAM_BATTING_2B, TEAM_BATTING_3B, TEAM_BATTING_HR, TEAM_BATTING_BB, TEAM_BASERUN_SB, TEAM_BASERUN_CS, TEAM_PITCHING_HR, and TEAM_PITCHING_BB are all positively correlated with TARGET_WINS while TEAM_BATTING_SO, TEAM_PITCHING_H, TEAM_PITCHING_SO, TEAM_FIELDING_E, and TEAM_FIELDING_DP are negatively correlated.

Some of these correlations are surprising, as we would have expected TEAM_BASERUN_CS, TEAM_PITCHING_HR, and TEAM_PITCHING_BB to be negatively correlated with TARGET_WINS, and we would have expected TEAM_PITCHING_SO and TEAM_FIELDING_DP to be positively correlated with TARGET_WINS. We won’t exclude them from our models based solely on the unexpected relationships themselves.

Let’s review relationships between batting independent variables.

train_cleaned <- train_imputed |> rename_all(~stringr::str_replace(.,"^TEAM_",""))
subset_batting <- train_cleaned |> select(contains('batting'))
kdepairs(subset_batting)

Most of the batting variables appear to be somewhat approximately normal although there are some cases of right skew. Overall, there aren’t any very strong correlations between these statistics at least from a preliminary visual inspection. From the distributions of these variables, we can see some that require transforming to normalize them before we use them in our linear model.

Let’s review relationships between other independent variables.

subset_pitching <- train_cleaned |> select(!contains('batting'), -TARGET_WINS)
kdepairs(subset_pitching)

There isn’t very strong correlation between the other independent variables similar to the batting statistics although there are more examples of skewed data with these inputs. Once again, we can see that we will need to transform some of these variables.

Model Development:

First, let’s create a basic model with all untransformed variables:

## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = train_imputed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.260  -8.612   0.151   8.425  59.018 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       2.657e+01  5.234e+00   5.078 4.14e-07 ***
## TEAM_BATTING_H    4.708e-02  3.699e-03  12.729  < 2e-16 ***
## TEAM_BATTING_2B  -1.788e-02  9.206e-03  -1.942 0.052245 .  
## TEAM_BATTING_3B   6.137e-02  1.678e-02   3.657 0.000261 ***
## TEAM_BATTING_HR   5.752e-02  2.749e-02   2.093 0.036500 *  
## TEAM_BATTING_BB   1.085e-02  5.816e-03   1.865 0.062310 .  
## TEAM_BATTING_SO  -1.141e-02  2.579e-03  -4.427 1.00e-05 ***
## TEAM_BASERUN_SB   2.580e-02  4.317e-03   5.976 2.66e-09 ***
## TEAM_BASERUN_CS  -7.159e-03  1.577e-02  -0.454 0.649853    
## TEAM_PITCHING_H  -8.980e-04  3.673e-04  -2.445 0.014562 *  
## TEAM_PITCHING_HR  1.612e-02  2.431e-02   0.663 0.507243    
## TEAM_PITCHING_BB -2.408e-05  4.124e-03  -0.006 0.995341    
## TEAM_PITCHING_SO  3.201e-03  9.134e-04   3.505 0.000466 ***
## TEAM_FIELDING_E  -1.961e-02  2.448e-03  -8.011 1.80e-15 ***
## TEAM_FIELDING_DP -1.201e-01  1.293e-02  -9.290  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.05 on 2261 degrees of freedom
## Multiple R-squared:  0.3181, Adjusted R-squared:  0.3139 
## F-statistic: 75.34 on 14 and 2261 DF,  p-value: < 2.2e-16

Despite the simplicity of the approach used by including all of the variables provided, there are several variables which indicate multicollinearity thereby impacting the reliability of the variance and coefficients in the model.

vif(lm_all)

##   TEAM_BATTING_H  TEAM_BATTING_2B  TEAM_BATTING_3B  TEAM_BATTING_HR 
##         3.822624         2.480705         2.936429        37.007740 
##  TEAM_BATTING_BB  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS 
##         6.801581         5.336220         1.816473         1.167416 
##  TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO 
##         3.567853        29.669252         6.288097         3.257794 
##  TEAM_FIELDING_E TEAM_FIELDING_DP 
##         4.155586         1.343048

TEAM_BATTING_HR and TEAM_PITCHING_HR are the most correlated with other variables which is interesting that the correlation plots did not more clearly emphasize that fact from an initial spot check.

We can remove some variables that are not significant using backward step-wise elimination.

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + 
##     TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + 
##     TEAM_BASERUN_SB + TEAM_PITCHING_H + TEAM_PITCHING_SO + TEAM_FIELDING_E + 
##     TEAM_FIELDING_DP, data = train_imputed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.201  -8.548   0.137   8.404  59.080 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      25.5925419  5.0907238   5.027 5.36e-07 ***
## TEAM_BATTING_H    0.0473024  0.0036728  12.879  < 2e-16 ***
## TEAM_BATTING_2B  -0.0182083  0.0091927  -1.981 0.047742 *  
## TEAM_BATTING_3B   0.0633643  0.0165996   3.817 0.000139 ***
## TEAM_BATTING_HR   0.0752404  0.0098361   7.649 2.97e-14 ***
## TEAM_BATTING_BB   0.0109356  0.0033639   3.251 0.001167 ** 
## TEAM_BATTING_SO  -0.0114146  0.0024962  -4.573 5.07e-06 ***
## TEAM_BASERUN_SB   0.0254110  0.0041873   6.069 1.51e-09 ***
## TEAM_PITCHING_H  -0.0008562  0.0003209  -2.669 0.007672 ** 
## TEAM_PITCHING_SO  0.0032329  0.0006703   4.823 1.51e-06 ***
## TEAM_FIELDING_E  -0.0192393  0.0023792  -8.086 9.91e-16 ***
## TEAM_FIELDING_DP -0.1201245  0.0129038  -9.309  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.04 on 2264 degrees of freedom
## Multiple R-squared:  0.3179, Adjusted R-squared:  0.3145 
## F-statistic: 95.91 on 11 and 2264 DF,  p-value: < 2.2e-16

The \(R^2\) for this model is not much improved. Let’s check for multicollinearity between variables.

Removing predictors that were not statistically significant slightly increased \(R^2\) for this model. The coefficients for most of the variables around batting are positively associated with target wins which makes some sense as more hits/stolen bases should correspond with more runs and ultimately translate into wins. TEAM_BATTING_2B has a negative coefficient which is not expected at all as what differentiates doubles compared to other hits from the dependent variable. The individual T-test from our sample would also seem to indicate that it is only slightly significant. TEAM_BATTING_SO is expected to have a negative relationship with wins and it’s coefficient is aligned with the initial expectations. The predictors around pitching do not have very strong coefficients although they are significant to the model and the coefficients align with expectations that allowing hits inversely impacts winning, while striking out opposing players is beneficial as well. Lastly, the fielding variables that remain (TEAM_FIELDING_E and TEAM_FIELDING_DP) appear to be consistent with expectations. Double plays may have one of the strongest impacts given it’s coefficient although there is more sample variability compared to the other predictors.

Reviewing the variance inflation factors:

vif(lm_all_reduced)

##   TEAM_BATTING_H  TEAM_BATTING_2B  TEAM_BATTING_3B  TEAM_BATTING_HR 
##         3.772304         2.475869         2.876917         4.744128 
##  TEAM_BATTING_BB  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_H 
##         2.277645         5.005090         1.710653         2.725441 
## TEAM_PITCHING_SO  TEAM_FIELDING_E TEAM_FIELDING_DP 
##         1.755773         3.928217         1.339341

The variance inflation factor for TEAM_BATTING_SO is greater than 5. We can remove this predictor.

lm_all_reduced <- update(lm_all_reduced, .~. - TEAM_BATTING_SO)
summary(lm_all_reduced)

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + 
##     TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BASERUN_SB + 
##     TEAM_PITCHING_H + TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP, 
##     data = train_imputed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -49.354  -8.637   0.046   8.422  56.185 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      10.4516005  3.8837691   2.691  0.00717 ** 
## TEAM_BATTING_H    0.0543248  0.0033510  16.212  < 2e-16 ***
## TEAM_BATTING_2B  -0.0275776  0.0090008  -3.064  0.00221 ** 
## TEAM_BATTING_3B   0.0742212  0.0165010   4.498 7.21e-06 ***
## TEAM_BATTING_HR   0.0472798  0.0077385   6.110 1.17e-09 ***
## TEAM_BATTING_BB   0.0134412  0.0033335   4.032 5.71e-05 ***
## TEAM_BASERUN_SB   0.0204726  0.0040634   5.038 5.07e-07 ***
## TEAM_PITCHING_H  -0.0005109  0.0003132  -1.631  0.10300    
## TEAM_PITCHING_SO  0.0020318  0.0006194   3.281  0.00105 ** 
## TEAM_FIELDING_E  -0.0186422  0.0023861  -7.813 8.48e-15 ***
## TEAM_FIELDING_DP -0.1165458  0.0129366  -9.009  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.1 on 2265 degrees of freedom
## Multiple R-squared:  0.3116, Adjusted R-squared:  0.3085 
## F-statistic: 102.5 on 10 and 2265 DF,  p-value: < 2.2e-16

The coefficients were not impacted that substantially once this modification was made. The only caveat to that prior statement is that TEAM_BATTING_2B remains negatively associated with target wins but it’s significance to the model has drastically improved with the exclusion of the collinear variable. It is not intuitive why `TEAM_BATTING_SO would impact TEAM_BATTING_2B as their correlation was only 0.185. TEAM_PITCHING_SO also became slightly less significant to the model

Let’s remove TEAM_PITCHING_H as it is no longer significant.

lm_all_reduced <- update(lm_all_reduced, .~. - TEAM_PITCHING_H)
summary(lm_all_reduced)

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + 
##     TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BASERUN_SB + 
##     TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP, data = train_imputed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.785  -8.584   0.002   8.446  57.585 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      12.1798845  3.7378157   3.259  0.00114 ** 
## TEAM_BATTING_H    0.0528676  0.0032309  16.363  < 2e-16 ***
## TEAM_BATTING_2B  -0.0272628  0.0090020  -3.029  0.00249 ** 
## TEAM_BATTING_3B   0.0787841  0.0162681   4.843 1.37e-06 ***
## TEAM_BATTING_HR   0.0481046  0.0077248   6.227 5.64e-10 ***
## TEAM_BATTING_BB   0.0133595  0.0033344   4.007 6.36e-05 ***
## TEAM_BASERUN_SB   0.0216382  0.0040015   5.408 7.06e-08 ***
## TEAM_PITCHING_SO  0.0016132  0.0005639   2.861  0.00426 ** 
## TEAM_FIELDING_E  -0.0208625  0.0019604 -10.642  < 2e-16 ***
## TEAM_FIELDING_DP -0.1173637  0.0129316  -9.076  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.1 on 2266 degrees of freedom
## Multiple R-squared:  0.3108, Adjusted R-squared:  0.308 
## F-statistic: 113.5 on 9 and 2266 DF,  p-value: < 2.2e-16

vif(lm_all_reduced)

##   TEAM_BATTING_H  TEAM_BATTING_2B  TEAM_BATTING_3B  TEAM_BATTING_HR 
##         2.891551         2.351793         2.737074         2.898411 
##  TEAM_BATTING_BB  TEAM_BASERUN_SB TEAM_PITCHING_SO  TEAM_FIELDING_E 
##         2.216711         1.547476         1.230982         2.641838 
## TEAM_FIELDING_DP 
##         1.332410

Based on the definitions of TEAM_BATTING_H, TEAM_BATTING_2B, TEAM_BATTING_3B, and TEAM_BATTING_HR, there is probably some multicollinearity going on with these variables. Let’s compare a model that uses just the total hits against a model using each individual type of hit.

lm_all_reduced_hits <- update(lm_all_reduced, .~. - TEAM_BATTING_2B - TEAM_BATTING_3B - TEAM_BATTING_HR)
summary(lm_all_reduced_hits)

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_BB + 
##     TEAM_BASERUN_SB + TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP, 
##     data = train_imputed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.823  -8.638   0.156   8.473  52.443 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       9.7032191  3.5315442   2.748  0.00605 ** 
## TEAM_BATTING_H    0.0542548  0.0021078  25.740  < 2e-16 ***
## TEAM_BATTING_BB   0.0171889  0.0032588   5.275 1.46e-07 ***
## TEAM_BASERUN_SB   0.0237070  0.0037196   6.374 2.23e-10 ***
## TEAM_PITCHING_SO  0.0015116  0.0005316   2.844  0.00450 ** 
## TEAM_FIELDING_E  -0.0210881  0.0018286 -11.532  < 2e-16 ***
## TEAM_FIELDING_DP -0.1107454  0.0128421  -8.624  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.24 on 2269 degrees of freedom
## Multiple R-squared:  0.2955, Adjusted R-squared:  0.2936 
## F-statistic: 158.6 on 6 and 2269 DF,  p-value: < 2.2e-16

lm_all_reduced_others <- update(lm_all_reduced, .~. - TEAM_BATTING_H)
summary(lm_all_reduced_others)

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_2B + TEAM_BATTING_3B + 
##     TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BASERUN_SB + TEAM_PITCHING_SO + 
##     TEAM_FIELDING_E + TEAM_FIELDING_DP, data = train_imputed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -61.232  -8.904   0.069   8.910  65.787 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      55.5049520  2.7892547  19.900  < 2e-16 ***
## TEAM_BATTING_2B   0.0694280  0.0071795   9.670  < 2e-16 ***
## TEAM_BATTING_3B   0.1953241  0.0154629  12.632  < 2e-16 ***
## TEAM_BATTING_HR   0.0748393  0.0079819   9.376  < 2e-16 ***
## TEAM_BATTING_BB   0.0103989  0.0035199   2.954  0.00317 ** 
## TEAM_BASERUN_SB   0.0221969  0.0042302   5.247 1.69e-07 ***
## TEAM_PITCHING_SO -0.0012958  0.0005657  -2.291  0.02208 *  
## TEAM_FIELDING_E  -0.0110738  0.0019737  -5.611 2.26e-08 ***
## TEAM_FIELDING_DP -0.0947871  0.0135932  -6.973 4.05e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.85 on 2267 degrees of freedom
## Multiple R-squared:  0.2293, Adjusted R-squared:  0.2266 
## F-statistic: 84.31 on 8 and 2267 DF,  p-value: < 2.2e-16

Comparing Partial F-Tests/ANOVA of reduced models

anova(lm_all_reduced_others,lm_all_reduced)

## Analysis of Variance Table
## 
## Model 1: TARGET_WINS ~ TEAM_BATTING_2B + TEAM_BATTING_3B + TEAM_BATTING_HR + 
##     TEAM_BATTING_BB + TEAM_BASERUN_SB + TEAM_PITCHING_SO + TEAM_FIELDING_E + 
##     TEAM_FIELDING_DP
## Model 2: TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B + 
##     TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BASERUN_SB + TEAM_PITCHING_SO + 
##     TEAM_FIELDING_E + TEAM_FIELDING_DP
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   2267 435053                                  
## 2   2266 389079  1     45974 267.75 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The results of the partial F-test indicate that there is a statistically significant difference between the full model and the reduced one and it does not seem necessary to exclude the other hit predictors in the model.

anova(lm_all_reduced_hits,lm_all_reduced)

## Analysis of Variance Table
## 
## Model 1: TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_BB + TEAM_BASERUN_SB + 
##     TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP
## Model 2: TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B + 
##     TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BASERUN_SB + TEAM_PITCHING_SO + 
##     TEAM_FIELDING_E + TEAM_FIELDING_DP
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   2269 397696                                  
## 2   2266 389079  3    8617.2 16.729 9.486e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Similarly, the partial F-test would indicate that reducing the hits variable is not necessary despite the potential reduced collinearity.

The model using TEAM_BATTING_HITS has a higher \(R^2\) so it accounts for more variability. Let’s use this variable in our model.

We can make some plots to help test our assumptions of our basic model using the plot function on our model variable:

par(mfrow=c(2,2))
par(mai=c(.3,.3,.3,.3))
plot(lm_all_reduced_hits)

The Q-Q plot shows that the residuals of this model are fairly normally distributed. The residuals vs. fitted plot shows a cluster of residuals and a seeming outlying point. There is no general pattern seen here and the cluster of points seems to indicate that homoscedasticity is satisfied for this model.

Let’s try transforming some of our variables to come up with a more accurate model.

TEAM_PITCHING_SO is a right-skewed variable with very large outliers. Let’s compare how four common transformations (log, fourth root, cube root, and square root) would normalize the distribution of this variable (after adding a small constant since the variable includes accurate values of 0).

The square root transformation appears to normalize the data best. Let’s confirm the ideal lambda proposed by the boxcox function from the MASS library is similar to the square root transformation lambda (0.5) we assume will work best for this data.

## [1] 0.45

The proposed lambda of 0.45 is in fact very close to 0.5, so we will go with the easier to understand square root transformation. We will follow a similar process to find reasonable transformations for several other variables in our model without showing the process repeatedly.

variables	lambdas	adj
TEAM_BASERUN_SB	0.2	log
TEAM_BASERUN_CS	0.3	fourth root
TEAM_PITCHING_SO	0.45	square root
TEAM_BATTING_3B	0.3	log
TEAM_BATTING_BB	1.75	square
TEAM_PITCHING_H	-2	square inverse
TEAM_PITCHING_BB	0.35	cube root
TEAM_FIELDING_E	-0.95	inverse

Adjusting the ideal lambdas proposed for several variables to commonly understood transformations, we see mixed results on normalizing the distributions. Let’s use the same variables from our final untransformed model above to see if we can improve the model using transformations.

lm_trans <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_BB + I(TEAM_BATTING_BB**2) + log(TEAM_BASERUN_SB + .0001) + I(TEAM_PITCHING_SO**.5) + I(TEAM_FIELDING_E**-1) + TEAM_FIELDING_DP, train_imputed)
summary(lm_trans)

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_BB + 
##     I(TEAM_BATTING_BB^2) + log(TEAM_BASERUN_SB + 1e-04) + I(TEAM_PITCHING_SO^0.5) + 
##     I(TEAM_FIELDING_E^-1) + TEAM_FIELDING_DP, data = train_imputed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -59.340  -8.557   0.113   8.603  55.118 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  -3.404e+00  4.958e+00  -0.686   0.4925    
## TEAM_BATTING_H                4.868e-02  2.173e-03  22.405  < 2e-16 ***
## TEAM_BATTING_BB               5.538e-03  1.068e-02   0.518   0.6042    
## I(TEAM_BATTING_BB^2)          2.482e-05  1.131e-05   2.195   0.0283 *  
## log(TEAM_BASERUN_SB + 1e-04)  3.037e+00  4.263e-01   7.125 1.39e-12 ***
## I(TEAM_PITCHING_SO^0.5)      -1.991e-02  5.392e-02  -0.369   0.7119    
## I(TEAM_FIELDING_E^-1)         1.411e+03  1.427e+02   9.885  < 2e-16 ***
## TEAM_FIELDING_DP             -1.270e-01  1.313e-02  -9.672  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.29 on 2268 degrees of freedom
## Multiple R-squared:  0.2905, Adjusted R-squared:  0.2883 
## F-statistic: 132.7 on 7 and 2268 DF,  p-value: < 2.2e-16

Note: There are instances of TEAM_BASERUN_SB where the value is zero. Because of this, a log transformation creates an error. To account for this we increment by a small number (0.0001) so that the log transformation can be applied.

The transformed TEAM_PITCHING_SO is no longer significant, let’s remove it.

lm_trans_reduced <- update(lm_trans, .~. - I(TEAM_PITCHING_SO**.5), train_imputed)
summary(lm_trans_reduced)

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_BB + 
##     I(TEAM_BATTING_BB^2) + log(TEAM_BASERUN_SB + 1e-04) + I(TEAM_FIELDING_E^-1) + 
##     TEAM_FIELDING_DP, data = train_imputed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -59.611  -8.620   0.107   8.597  53.441 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  -4.223e+00  4.433e+00  -0.953   0.3409    
## TEAM_BATTING_H                4.893e-02  2.059e-03  23.771  < 2e-16 ***
## TEAM_BATTING_BB               5.693e-03  1.067e-02   0.533   0.5938    
## I(TEAM_BATTING_BB^2)          2.473e-05  1.131e-05   2.188   0.0288 *  
## log(TEAM_BASERUN_SB + 1e-04)  3.019e+00  4.233e-01   7.132 1.33e-12 ***
## I(TEAM_FIELDING_E^-1)         1.394e+03  1.354e+02  10.297  < 2e-16 ***
## TEAM_FIELDING_DP             -1.269e-01  1.312e-02  -9.668  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.29 on 2269 degrees of freedom
## Multiple R-squared:  0.2905, Adjusted R-squared:  0.2886 
## F-statistic: 154.8 on 6 and 2269 DF,  p-value: < 2.2e-16

The adjusted \(R^2\) is slightly less for this model than for the untransformed one. The coefficients/relationships for the predictor variables become a little less intuitive due to the transformations. It makes sense that the inverse fielding errors would create a positive relationship with target wins as if its some type of rate rather than count data. The other variables that were transformed do not appear to be that much different. Let’s take a look at the diagnostic plots for this transformed model.

par(mfrow=c(2,2))
par(mai=c(.3,.3,.3,.3))
plot(lm_trans_reduced)

Once again, the Q-Q plot shows that the residuals are fairly normally distributed. From the plot of Cook’s distance, it seems there are fewer possible leverage points. The residuals vs. fitted plot also seems to indicate that homoscedasticity is satisfied.

Now we can make a model with inputs that we know from baseball.

Total hits (TEAM_BATTING_H)
Total walks gained (TEAM_BATTING_BB)
Total hits allowed (TEAM_PITCHING_H)
Total walks allowed (TEAM_PITCHING_BB)

We chose these variables based on our understanding that good teams generally tend to get on base more frequently (positive predictor variables TEAM_BATTING_HITS and TEAM_BATTING_BB) while allowing fewer runners on base (negative predictor variables TEAM_PITCHING_H and TEAM_PITCHING_BB).

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_BB + 
##     TEAM_PITCHING_H + TEAM_PITCHING_BB, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -52.133  -8.860   0.379   9.373  52.416 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -0.3518000  3.2552864  -0.108 0.913949    
## TEAM_BATTING_H    0.0497667  0.0021032  23.663  < 2e-16 ***
## TEAM_BATTING_BB   0.0148499  0.0039923   3.720 0.000204 ***
## TEAM_PITCHING_H  -0.0025469  0.0003317  -7.679 2.36e-14 ***
## TEAM_PITCHING_BB  0.0092317  0.0027681   3.335 0.000867 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.73 on 2271 degrees of freedom
## Multiple R-squared:  0.2416, Adjusted R-squared:  0.2403 
## F-statistic: 180.9 on 4 and 2271 DF,  p-value: < 2.2e-16

When reviewing the model output it appears that all of these factors are statistically significant and some of the intuition around successful baseball performance translates from the statistics. The coefficients are all positively associated with target wins except for TEAM_PITCHING_H which makes sense given that more hits given up typically would indicate the other team may be winning the game. What is slightly surprising is that TEAM_PITCHING_BB has a positive coefficient and is highly significant as in theory more opposing base runners, even via walks, should help the other team win.

It’s interesting to note that with selected variables (walks and hits gained/allowed per team) that our adjusted \(R^2\) actually went down, indicating the amount of variability in TARGET_WINS explained by our more selective walks/hits model is less than the model including all variables.

Looking at our residual plot above, there seems to be a clustering of residuals along the x-axis at \(X \approx 80\). This shows a pattern in our residuals.

Let’s plot our response variable (Total Wins) versus each of our predictor variables to get a sense of linear relationships.

Overall, we’re seeing some loosely linear relationships between our input variables and wins. For example, offensive hits has a plausibly linear relationship to wins, whereas hits allowed (TEAM_PITCHING_H) does not have as clear of a linear relationship.

One alternative theory that uses some of the same logic as before is to maximize run producing offense (TEAM_BATTING_H,TEAM_BATTING_3B+TEAM_BATTING_2B+TEAM_BATTING_HR) as these hits imply that the batters are getting more bases which should result in runners on base being more likely to score. Another way to supercharge the offense is by advancing the runners when they get on base (TEAM_BASERUN_SB), which in theory should put us in a position to earn more runs. In terms of defense, the key area we will focus on is limiting mistakes (TEAM_FIELDING_E) and when batters get on, cleaning up our mess in an efficient way (TEAM_FIELDING_DP). On paper the lean towards more offense with crisp error free defense is a recipe for success.

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + 
##     TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BASERUN_SB + TEAM_FIELDING_DP + 
##     TEAM_FIELDING_E, data = train_imputed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.143  -8.756  -0.050   8.494  65.205 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      20.391917   3.301174   6.177 7.71e-10 ***
## TEAM_BATTING_H    0.049586   0.003078  16.108  < 2e-16 ***
## TEAM_BATTING_2B  -0.019814   0.008771  -2.259    0.024 *  
## TEAM_BATTING_3B   0.082551   0.016245   5.082 4.04e-07 ***
## TEAM_BATTING_HR   0.057850   0.007480   7.734 1.56e-14 ***
## TEAM_BASERUN_SB   0.027281   0.003804   7.171 1.00e-12 ***
## TEAM_FIELDING_DP -0.105804   0.012612  -8.389  < 2e-16 ***
## TEAM_FIELDING_E  -0.023849   0.001633 -14.602  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.16 on 2268 degrees of freedom
## Multiple R-squared:  0.3039, Adjusted R-squared:  0.3017 
## F-statistic: 141.4 on 7 and 2268 DF,  p-value: < 2.2e-16

One modification that was made due to the number of null values in fielding predictors is that we used the imputed data set to run the regression analysis to limit the records that would be excluded from the model. Overall we can see a competitive one and appears to be similar enough to the automated backward selection despite have less independent variables. TEAM_BATTING_2B has a negative coefficient consistent with other models which doesn’t make so much sense, but perhaps the distribution of these values and the higher p-value could mean this could be excluded; however, we will keep it in the model given that it would seem to have predictive value on paper and meets the statistical baselines.

Let’s confirm there isn’t any collinearity issues with this model variant:

##   TEAM_BATTING_H  TEAM_BATTING_2B  TEAM_BATTING_3B  TEAM_BATTING_HR 
##         2.601543         2.212394         2.704599         2.693294 
##  TEAM_BASERUN_SB TEAM_FIELDING_DP  TEAM_FIELDING_E 
##         1.386145         1.255963         1.817196

There is some collinearity which is to be expected with the data that is available to us, but nothing that would be problematic that requires any major changes.

Let’s confirm that the assumptions of OLS are not violated:

The diagnostic plots don’t seem to indicate a violation of the assumptions as the variance across the x-values while mostly concentrated at 80 does not show much heteroskedasticity in the residuals or the transformed standardized ones. There are some leverage points (1342 in particular) in the data set and it can definitely skew the data although some teams could have very strong seasons that aren’t invalid or anomalous to be worthy of exclusion.

Model Evaluation/Selection:

First we read in our evaluation data.

Now we can make some predictions on the test holdout data and compare results from our models before we select the best model to use on the evaluation data. First we compare the distributions of the test data to confirm we can use the same imputation methods we used to fill missing values for variables in the train data.

par(mfrow=c(2,3))
par(mai=c(.3,.3,.3,.3))
variables <- c("TEAM_BATTING_SO", "TEAM_BASERUN_SB", "TEAM_BASERUN_CS", 
               "TEAM_PITCHING_SO", "TEAM_FIELDING_DP")
for (i in 1:(length(variables))) {
    hist(test[[variables[i]]], main = variables[i], col = "lightblue")
}

The test data distributions are similar to the distributions observed in the train data for these variables, so the same imputation methods can be used for each of them.

test <- test |>
    select(-TEAM_BATTING_HBP) |>
    mutate(TEAM_BASERUN_SB = replace(TEAM_BASERUN_SB, is.na(TEAM_BASERUN_SB),
                                     median(TEAM_BASERUN_SB, na.rm=T)),
           TEAM_BASERUN_CS = replace(TEAM_BASERUN_CS, is.na(TEAM_BASERUN_CS),
                                     median(TEAM_BASERUN_CS, na.rm=T)),
           TEAM_PITCHING_SO = replace(TEAM_PITCHING_SO, is.na(TEAM_PITCHING_SO),
                                      median(TEAM_PITCHING_SO, na.rm=T)),
           TEAM_FIELDING_DP = replace(TEAM_FIELDING_DP, is.na(TEAM_FIELDING_DP),
                                      mean(TEAM_FIELDING_DP, na.rm=T)))

test <- test |>
    VIM::kNN(variable = "TEAM_BATTING_SO", k = 15, numFun = weighted.mean,
             weightDist = TRUE) |>
    select(-TEAM_BATTING_SO_imp)

# Predict using the model using all input variables
predict_all <- predict(lm_all, test)
predict_reduced <- predict(lm_all_reduced, test)

predict_transformed_reduced <- predict(lm_trans_reduced, test)

predict_select <- predict(lm_select, test)
predict_off <- predict(lm_off,test)
predict_trans_reduced <- predict(lm_trans_reduced, test)

We can use Root-mean Squared Error (RMSE) to analyze our models from above. This is one way to measure the performance of a model. In simple terms, a smaller RMSE value indicates better model performance when predicting on new data.

rmse <- function(c1, c2){
    sqrt(mean((c1 - c2)^2))
}
#Calculate RMSE and print to screen
rmse_all <- rmse(predict_all, test$TARGET_WINS)
rmse_reduced <- rmse(predict_reduced, test$TARGET_WINS)
rmse_transformed_reduced <- rmse(predict_transformed_reduced, test$TARGET_WINS)

print(rmse_all)

## [1] 12.29232

print(rmse_reduced)

## [1] 12.35444

print(rmse_transformed_reduced)

## [1] 12.67368

The model with the lowest RMSE is our all variable model without transformations, so we will make predictions on the evaluation data using this model. Since we don’t have TARGET_WINS in our evaluation data, we won’t be able to evaluate the model performance against actual win totals.

We can use the Root-mean Squared Error (RMSE) (from the modelr package) to analyze our models from above. This is one way to measure the performance of a model. In simple terms, a smaller RMSE value indicates better model performance when predicting on new data.

# Calculate RMSE and print to screen
rmse_all <- modelr::rmse(lm_all, test)
rmse_reduced <- modelr::rmse(lm_all_reduced, test)
rmse_select <- modelr::rmse(lm_select, test)
rmse_off <- modelr::rmse(lm_off, test)
rmse_trans_reduced <- modelr::rmse(lm_trans_reduced, test)

print(glue('RMSE lm_all: {rmse_all}'))

## RMSE lm_all: 12.2923197085304

print(glue('RMSE lm_reduced: {rmse_reduced}'))

## RMSE lm_reduced: 12.3544360642616

print(glue('RMSE lm_select: {rmse_select}'))

## RMSE lm_select: 13.2136000264059

print(glue('RMSE lm_off: {rmse_off}'))

## RMSE lm_off: 12.5057843276575

print(glue('RMSE lm_transreduced {rmse_trans_reduced}'))

## RMSE lm_transreduced 12.6736755963075

Lastly, we can predict on our eval data based on the best RMSE value coming from our reduced model. Since we don’t have TARGET_WINS in our evaluation data, we won’t be able to evaluate the model performance against actual win totals.

The model with the lowest RMSE is our all variable model without transformations, but given the collinearity concerns it probably isn’t fair to use this version as a means of prediction, so we will also use offensive and mistake free defensive one for evaluation as well. Since we don’t have TARGET_WINS in our evaluation data, we won’t be able to identify the model performance against actual win totals.

However, we can look at the distribution of predicted wins to make sure our model predicts reasonable values. Knowing what we know about baseball, average teams tend to win \(~80\) games in a season (out of 162 total regular season games).

# Predict and plot on evaluation data (no wins listed)
predict_all_eval <- as.data.frame(predict(lm_all, eval))

prediction_reduced <- predict(lm_all_reduced, eval)
predict_reduced_eval <- as.data.frame(prediction_reduced)

# Plot reduced model evaluation
ggplot(predict_reduced_eval,
       aes(x=prediction_reduced)) +
       geom_histogram(bins=15) +
       labs(x="Wins", 
            title="Predicted Wins (Reduced Model): Evaluation Data.")

Roughly speaking, these predicted win totals look roughly normal, and centered around 80 wins, which is expected

predict_off_eval <- as.data.frame(predict(lm_off, eval))

ggplot(predict_off_eval, 
       aes(x =predict(lm_off,eval))) + geom_histogram(bins=15) + labs(x="Wins", title="Predicted Wins (Offensive Imputed Model): Evaluation Data.")

Roughly speaking, these predicted win totals look roughly normal and centered around 80 wins, which is expected.

Conclusions:

Overall, we found relatively similar RMSE values across our models, suggesting comparable performance based on this metric with the initial evaluation dataset. When seeking approval from our manager for decisions, we’d likely opt for the reduced model. This choice stems from its ability to evaluate possible variations and its use of AIC (Akaike information criterion) for decision-making from a statistical standpoint. While this approach might not fully convince all managers, we can clarify that certain predictor combinations were effective in minimizing our loss function. AIC is designed to strike a balance between the benefits of linear combinations and model complexity. However, if our manager prefers prioritizing judgment or intuition over statistical metrics to justify predictors in the model, we would rely on the offensive model for any future forecasting needs.

The RMSE for each models are:

library(knitr)

# Create a dataframe with model names and RMSE values
model_rmse <- data.frame(
  Model = c("M_{reduced}", "M_{offensive}", "M_{transformed}", "M_{select}"),
  RMSE = c(13.16778, 13.17077, 13.40541, 13.7739)
)

# Print the table
kable(model_rmse, format = "markdown")

Model	RMSE
M_{reduced}	13.16778
M_{offensive}	13.17077
M_{transformed}	13.40541
M_{select}	13.77390

Appendix: Report Code

Below is the code for this report to generate the models and charts above.

knitr::opts_chunk$set(eval = TRUE, message = FALSE, warning = FALSE)
library(tidyverse)
library(openintro)
library(glue)
library(tidyverse)
library(car)
library(ResourceSelection)
library(VIM)
library(pracma)
library(MASS)
select <- dplyr::select
library(dplyr)
library(knitr)
library(modelr)
df <- read.csv("https://raw.githubusercontent.com/waheeb123/Data-621/main/Homeworks/Homework-1/moneyball-training-data.csv")
df <- data.frame(df)
set.seed(30)

# Adding a 20% holdout of our input data for model evaluation later
train <- subset(df[sample(1:nrow(df)), ], select=-c(TEAM_BATTING_HBP))%>%
          sample_frac(0.7)

test  <- dplyr::anti_join(df, train, by = 'INDEX')
test <- subset(test, select=-c(INDEX))

train <- subset(df, select=-c(INDEX))
mean_wins <- mean(train$TARGET_WINS)
median_wins <- median(train$TARGET_WINS)
sd_wins <- sd(train$TARGET_WINS)

# Print summary stats
print(glue("The mean number of wins in a season is {round(mean_wins,2)}."))
print(glue("The median number of wins in a season is {median_wins}."))
print(glue("The standard deviation for number of wins in a season is {round(sd_wins,2)}."))
ggplot(train, aes(x=TARGET_WINS)) + 
  geom_histogram(fill = "skyblue") +
  labs(title = "Distribution of Wins (Histogram)", x = "Number of Wins", y = "Count")
ggplot(train, aes(x=TARGET_WINS)) + 
  geom_boxplot(fill="darkgrey") +
  labs(title = "Distribution of Wins (Boxplot)", x = "Number of Wins", y = "Count")
cMeans <- as.data.frame(round(colMeans(train, na.rm = TRUE), 1))
colnames(cMeans) <- NULL
kable(cMeans, format = "simple")

library(knitr)
kable(summary(train))
# First, store the summary in an object
summary_data <- summary(train)

# Create a data frame from the summary
summary_df <- as.data.frame(summary_data)

# Print the data frame
kable(summary_df)


par(mfrow=c(2,3))
par(mai=c(.3,.3,.3,.3))

variables <- c("TEAM_BATTING_SO", "TEAM_BASERUN_SB", "TEAM_BASERUN_CS", "TEAM_BATTING_HBP", "TEAM_PITCHING_SO", "TEAM_FIELDING_DP")

for (i in 1:(length(variables))) {
    hist(train[[variables[i]]], main = variables[i], col = "lightblue")
}
# Identify missing data by Feature and display percent breakout
missing <- colSums(df %>% sapply(is.na))
missing_pct <- round(missing / nrow(df) * 100, 2)
stack(sort(missing_pct, decreasing = TRUE))

# Drop the BATTING_HBP field
df <- df %>% 
  select(-TEAM_BATTING_HBP)
train_imputed <- train |>
    mutate(TEAM_BASERUN_SB = replace(TEAM_BASERUN_SB, is.na(TEAM_BASERUN_SB),
                                     median(TEAM_BASERUN_SB, na.rm=T)),
           TEAM_BASERUN_CS = replace(TEAM_BASERUN_CS, is.na(TEAM_BASERUN_CS),
                                     median(TEAM_BASERUN_CS, na.rm=T)),
           TEAM_PITCHING_SO = replace(TEAM_PITCHING_SO, is.na(TEAM_PITCHING_SO),
                                      median(TEAM_PITCHING_SO, na.rm=T)),
           TEAM_FIELDING_DP = replace(TEAM_FIELDING_DP, is.na(TEAM_FIELDING_DP),
                                      mean(TEAM_FIELDING_DP, na.rm=T))) |>
    select(-TEAM_BATTING_HBP)
    

train_imputed <- train_imputed |>
    VIM::kNN(variable = "TEAM_BATTING_SO", k = 15, numFun = weighted.mean,
             weightDist = TRUE) |>
    select(-TEAM_BATTING_SO_imp)
cor(train_imputed, df$TARGET_WINS)
train_cleaned <- train_imputed |> rename_all(~stringr::str_replace(.,"^TEAM_",""))
subset_batting <- train_cleaned |> select(contains('batting'))
kdepairs(subset_batting)
subset_pitching <- train_cleaned |> select(!contains('batting'), -TARGET_WINS)
kdepairs(subset_pitching)
lm_all <- lm(TARGET_WINS~., train_imputed)
summary(lm_all)
vif(lm_all)
lm_all_reduced <- step(lm_all, direction="backward", trace = 0)
summary(lm_all_reduced)
vif(lm_all_reduced)
lm_all_reduced <- update(lm_all_reduced, .~. - TEAM_BATTING_SO)
summary(lm_all_reduced)
lm_all_reduced <- update(lm_all_reduced, .~. - TEAM_PITCHING_H)
summary(lm_all_reduced)
vif(lm_all_reduced)
lm_all_reduced_hits <- update(lm_all_reduced, .~. - TEAM_BATTING_2B - TEAM_BATTING_3B - TEAM_BATTING_HR)
summary(lm_all_reduced_hits)
lm_all_reduced_others <- update(lm_all_reduced, .~. - TEAM_BATTING_H)
summary(lm_all_reduced_others)
anova(lm_all_reduced_others,lm_all_reduced)
anova(lm_all_reduced_hits,lm_all_reduced)
par(mfrow=c(2,2))
par(mai=c(.3,.3,.3,.3))
plot(lm_all_reduced_hits)
train_imputed_transformed <- train_imputed
#Add a small constant to TEAM_PITCHING_SO so there are no 0 values.
train_imputed_transformed$TEAM_PITCHING_SO <- train_imputed_transformed$TEAM_PITCHING_SO + 0.001
par(mfrow=c(2,2))
par(mai=c(.3,.3,.3,.3))
#Compare how easy to understand transformations alter the distribution
hist(log(train_imputed_transformed$TEAM_PITCHING_SO),
     main = "Log Transformation", col="lightblue")
hist(nthroot(train_imputed_transformed$TEAM_PITCHING_SO, 4),
     main = "Fourth Root Transformation", col="lightblue")
hist(nthroot(train_imputed_transformed$TEAM_PITCHING_SO, 3),
     main = "Cube Root Transformation", col="lightblue")
hist(sqrt(train_imputed_transformed$TEAM_PITCHING_SO),
     main = "Square Root Transformation", col="lightblue")
bc <- boxcox(lm(train_imputed_transformed$TEAM_PITCHING_SO ~ 1),
             lambda = seq(-2, 2, length.out = 81),
             plotit = FALSE)
lambda <- bc$x[which.max(bc$y)]
lambda
variables <- c("TEAM_BASERUN_SB", "TEAM_BASERUN_CS", "TEAM_PITCHING_SO",
               "TEAM_BATTING_3B", "TEAM_BATTING_BB", "TEAM_PITCHING_H",
               "TEAM_PITCHING_BB", "TEAM_FIELDING_E")
for (i in 1:(length(variables))){
    #Add a small constant to columns with any 0 values
    if (sum(train_imputed_transformed[[variables[i]]] == 0) > 0){
        train_imputed_transformed[[variables[i]]] <-
            train_imputed_transformed[[variables[i]]] + 0.001
    }
}
for (i in 1:(length(variables))){
    if (i == 1){
        lambdas <- c()
    }
    bc <- boxcox(lm(train_imputed_transformed[[variables[i]]] ~ 1),
                 lambda = seq(-2, 2, length.out = 81),
                 plotit = FALSE)
    lambda <- bc$x[which.max(bc$y)]
    lambdas <- append(lambdas, lambda)
}
lambdas <- as.data.frame(cbind(variables, lambdas))
adj <- c("log", "fourth root", "square root", "log", "square", "square inverse", "cube root", "inverse")
lambdas <- cbind(lambdas, adj)
kable(lambdas, format = "simple")
par(mfrow=c(3, 3))
par(mai=c(.3,.3,.3,.3))
#Compare how easy to understand transformations alter the distribution
hist(log(train_imputed_transformed$TEAM_BASERUN_SB),
     main = "Log(TEAM_BASERUN_SB)", col="lightblue")
hist(nthroot(train_imputed_transformed$TEAM_BASERUN_CS, 4),
     main = "Fourth Root(TEAM_BASERUN_CS)", col="lightblue")
hist(sqrt(train_imputed_transformed$TEAM_PITCHING_SO),
     main = "Square Root(TEAM_PITCHING_SO)", col="lightblue")
hist(log(train_imputed_transformed$TEAM_BATTING_3B),
     main = "Log(TEAM_BATTING_3B)", col="lightblue")
hist(train_imputed_transformed$TEAM_BATTING_BB^2,
     main = "TEAM_BATTING_BB SQUARED", col="lightblue")
hist(train_imputed_transformed$TEAM_PITCHING_H^-2,
     main = "TEAM_PITCHING_H INVERSE SQUARED", col="lightblue")
hist(nthroot(train_imputed_transformed$TEAM_PITCHING_BB, 3),
     main = "Cube Root(TEAM_PITCHING_BB)", col="lightblue")
hist(train_imputed_transformed$TEAM_FIELDING_E^-1,
     main = "TEAM_FIELDING_E INVERSE", col="lightblue")
lm_trans <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_BB + I(TEAM_BATTING_BB**2) + log(TEAM_BASERUN_SB + .0001) + I(TEAM_PITCHING_SO**.5) + I(TEAM_FIELDING_E**-1) + TEAM_FIELDING_DP, train_imputed)
summary(lm_trans)
lm_trans_reduced <- update(lm_trans, .~. - I(TEAM_PITCHING_SO**.5), train_imputed)
summary(lm_trans_reduced)
par(mfrow=c(2,2))
par(mai=c(.3,.3,.3,.3))
plot(lm_trans_reduced)
# Create model with select inputs (walks and hits allowed/gained)
lm_select <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_BB + TEAM_PITCHING_H + TEAM_PITCHING_BB, train)

summary(lm_select)
par(mfrow=c(2,2))
par(mai=c(.3,.3,.3,.3))
plot(lm_select)
# Plot selective model residuals
plot(lm_select$residuals)
# par(mfrow=c(2,2))
# par(mai=c(.3,.3,.3,.3))
# plot(TARGET_WINS ~ TEAM_BATTING_H +
#                    TEAM_BATTING_BB +
#                    TEAM_PITCHING_H +
#                    TEAM_PITCHING_BB,
#                    data=train)
ggplot(train, aes(x=TEAM_BATTING_H, y=TARGET_WINS)) + geom_point() + labs(x="Hits", y="Wins")
ggplot(train, aes(x=TEAM_BATTING_BB, y=TARGET_WINS)) + geom_point() + labs(x="Walks", y="Wins")
ggplot(train, aes(x=TEAM_PITCHING_H, y=TARGET_WINS)) + geom_point() + labs(x="Hits Allowed", y="Wins")
ggplot(train, aes(x=TEAM_PITCHING_BB, y=TARGET_WINS)) + geom_point()  + labs(x="Walks Allowed", y="Wins")


lm_off <- lm(TARGET_WINS ~ TEAM_BATTING_H+TEAM_BATTING_2B+TEAM_BATTING_3B+TEAM_BATTING_HR +TEAM_BASERUN_SB +TEAM_FIELDING_DP+TEAM_FIELDING_E, train_imputed)

summary(lm_off)
vif(lm_off)
par(mfrow=c(2,2))
par(mai=c(.3,.3,.3,.3))
plot(lm_off)
eval_data_url <- "https://raw.githubusercontent.com/waheeb123/Data-621/main/Homeworks/Homework-1/moneyball-evaluation-data.csv"

eval <- read.csv(eval_data_url)
par(mfrow=c(2,3))
par(mai=c(.3,.3,.3,.3))
variables <- c("TEAM_BATTING_SO", "TEAM_BASERUN_SB", "TEAM_BASERUN_CS", 
               "TEAM_PITCHING_SO", "TEAM_FIELDING_DP")
for (i in 1:(length(variables))) {
    hist(test[[variables[i]]], main = variables[i], col = "lightblue")
}
test <- test |>
    select(-TEAM_BATTING_HBP) |>
    mutate(TEAM_BASERUN_SB = replace(TEAM_BASERUN_SB, is.na(TEAM_BASERUN_SB),
                                     median(TEAM_BASERUN_SB, na.rm=T)),
           TEAM_BASERUN_CS = replace(TEAM_BASERUN_CS, is.na(TEAM_BASERUN_CS),
                                     median(TEAM_BASERUN_CS, na.rm=T)),
           TEAM_PITCHING_SO = replace(TEAM_PITCHING_SO, is.na(TEAM_PITCHING_SO),
                                      median(TEAM_PITCHING_SO, na.rm=T)),
           TEAM_FIELDING_DP = replace(TEAM_FIELDING_DP, is.na(TEAM_FIELDING_DP),
                                      mean(TEAM_FIELDING_DP, na.rm=T)))

test <- test |>
    VIM::kNN(variable = "TEAM_BATTING_SO", k = 15, numFun = weighted.mean,
             weightDist = TRUE) |>
    select(-TEAM_BATTING_SO_imp)

# Predict using the model using all input variables
predict_all <- predict(lm_all, test)
predict_reduced <- predict(lm_all_reduced, test)

predict_transformed_reduced <- predict(lm_trans_reduced, test)

predict_select <- predict(lm_select, test)
predict_off <- predict(lm_off,test)
predict_trans_reduced <- predict(lm_trans_reduced, test)

rmse <- function(c1, c2){
    sqrt(mean((c1 - c2)^2))
}
#Calculate RMSE and print to screen
rmse_all <- rmse(predict_all, test$TARGET_WINS)
rmse_reduced <- rmse(predict_reduced, test$TARGET_WINS)
rmse_transformed_reduced <- rmse(predict_transformed_reduced, test$TARGET_WINS)

print(rmse_all)
print(rmse_reduced)
print(rmse_transformed_reduced)
# Calculate RMSE and print to screen
rmse_all <- modelr::rmse(lm_all, test)
rmse_reduced <- modelr::rmse(lm_all_reduced, test)
rmse_select <- modelr::rmse(lm_select, test)
rmse_off <- modelr::rmse(lm_off, test)
rmse_trans_reduced <- modelr::rmse(lm_trans_reduced, test)

print(glue('RMSE lm_all: {rmse_all}'))
print(glue('RMSE lm_reduced: {rmse_reduced}'))
print(glue('RMSE lm_select: {rmse_select}'))
print(glue('RMSE lm_off: {rmse_off}'))
print(glue('RMSE lm_transreduced {rmse_trans_reduced}'))
# Predict and plot on evaluation data (no wins listed)
predict_all_eval <- as.data.frame(predict(lm_all, eval))

prediction_reduced <- predict(lm_all_reduced, eval)
predict_reduced_eval <- as.data.frame(prediction_reduced)

# Plot reduced model evaluation
ggplot(predict_reduced_eval,
       aes(x=prediction_reduced)) +
       geom_histogram(bins=15) +
       labs(x="Wins", 
            title="Predicted Wins (Reduced Model): Evaluation Data.")
predict_off_eval <- as.data.frame(predict(lm_off, eval))

ggplot(predict_off_eval, 
       aes(x =predict(lm_off,eval))) + geom_histogram(bins=15) + labs(x="Wins", title="Predicted Wins (Offensive Imputed Model): Evaluation Data.")
library(knitr)

# Create a dataframe with model names and RMSE values
model_rmse <- data.frame(
  Model = c("M_{reduced}", "M_{offensive}", "M_{transformed}", "M_{select}"),
  RMSE = c(13.16778, 13.17077, 13.40541, 13.7739)
)

# Print the table
kable(model_rmse, format = "markdown")