library(tidyverse)
library(olsrr)
library(skimr)
library(DataExplorer)
library(corrplot)
library(fabletools)
library(ggfortify)
We will explore, analyze and model a data set containing approximately 2200 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season
Here we load in the data from the CSVs which we have downloaded. The data is separated into a training set for building the model and an evaluation set for testing the model.
test_1 <- read_csv("moneyball-training-data.csv", show_col_types = FALSE)
eval_1 <- read_csv("moneyball-evaluation-data.csv", show_col_types = FALSE)
glimpse(test_1)
## Rows: 2,276
## Columns: 17
## $ INDEX <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 11, 12, 13, 15, 16, 17, 18, 1…
## $ TARGET_WINS <dbl> 39, 70, 86, 70, 82, 75, 80, 85, 86, 76, 78, 68, 72, 7…
## $ TEAM_BATTING_H <dbl> 1445, 1339, 1377, 1387, 1297, 1279, 1244, 1273, 1391,…
## $ TEAM_BATTING_2B <dbl> 194, 219, 232, 209, 186, 200, 179, 171, 197, 213, 179…
## $ TEAM_BATTING_3B <dbl> 39, 22, 35, 38, 27, 36, 54, 37, 40, 18, 27, 31, 41, 2…
## $ TEAM_BATTING_HR <dbl> 13, 190, 137, 96, 102, 92, 122, 115, 114, 96, 82, 95,…
## $ TEAM_BATTING_BB <dbl> 143, 685, 602, 451, 472, 443, 525, 456, 447, 441, 374…
## $ TEAM_BATTING_SO <dbl> 842, 1075, 917, 922, 920, 973, 1062, 1027, 922, 827, …
## $ TEAM_BASERUN_SB <dbl> NA, 37, 46, 43, 49, 107, 80, 40, 69, 72, 60, 119, 221…
## $ TEAM_BASERUN_CS <dbl> NA, 28, 27, 30, 39, 59, 54, 36, 27, 34, 39, 79, 109, …
## $ TEAM_BATTING_HBP <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ TEAM_PITCHING_H <dbl> 9364, 1347, 1377, 1396, 1297, 1279, 1244, 1281, 1391,…
## $ TEAM_PITCHING_HR <dbl> 84, 191, 137, 97, 102, 92, 122, 116, 114, 96, 86, 95,…
## $ TEAM_PITCHING_BB <dbl> 927, 689, 602, 454, 472, 443, 525, 459, 447, 441, 391…
## $ TEAM_PITCHING_SO <dbl> 5456, 1082, 917, 928, 920, 973, 1062, 1033, 922, 827,…
## $ TEAM_FIELDING_E <dbl> 1011, 193, 175, 164, 138, 123, 136, 112, 127, 131, 11…
## $ TEAM_FIELDING_DP <dbl> NA, 155, 153, 156, 168, 149, 186, 136, 169, 159, 141,…
We can see that we have the expected amount of records loaded in.
We begin our data exploration by skimming the test data for immediate summary statistics, data types, and a visualization of distributions with mini histograms.
skim(test_1)
| Name | test_1 |
| Number of rows | 2276 |
| Number of columns | 17 |
| _______________________ | |
| Column type frequency: | |
| numeric | 17 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| INDEX | 0 | 1.00 | 1268.46 | 736.35 | 1 | 630.75 | 1270.5 | 1915.50 | 2535 | ▇▇▇▇▇ |
| TARGET_WINS | 0 | 1.00 | 80.79 | 15.75 | 0 | 71.00 | 82.0 | 92.00 | 146 | ▁▁▇▅▁ |
| TEAM_BATTING_H | 0 | 1.00 | 1469.27 | 144.59 | 891 | 1383.00 | 1454.0 | 1537.25 | 2554 | ▁▇▂▁▁ |
| TEAM_BATTING_2B | 0 | 1.00 | 241.25 | 46.80 | 69 | 208.00 | 238.0 | 273.00 | 458 | ▁▆▇▂▁ |
| TEAM_BATTING_3B | 0 | 1.00 | 55.25 | 27.94 | 0 | 34.00 | 47.0 | 72.00 | 223 | ▇▇▂▁▁ |
| TEAM_BATTING_HR | 0 | 1.00 | 99.61 | 60.55 | 0 | 42.00 | 102.0 | 147.00 | 264 | ▇▆▇▅▁ |
| TEAM_BATTING_BB | 0 | 1.00 | 501.56 | 122.67 | 0 | 451.00 | 512.0 | 580.00 | 878 | ▁▁▇▇▁ |
| TEAM_BATTING_SO | 102 | 0.96 | 735.61 | 248.53 | 0 | 548.00 | 750.0 | 930.00 | 1399 | ▁▆▇▇▁ |
| TEAM_BASERUN_SB | 131 | 0.94 | 124.76 | 87.79 | 0 | 66.00 | 101.0 | 156.00 | 697 | ▇▃▁▁▁ |
| TEAM_BASERUN_CS | 772 | 0.66 | 52.80 | 22.96 | 0 | 38.00 | 49.0 | 62.00 | 201 | ▃▇▁▁▁ |
| TEAM_BATTING_HBP | 2085 | 0.08 | 59.36 | 12.97 | 29 | 50.50 | 58.0 | 67.00 | 95 | ▂▇▇▅▁ |
| TEAM_PITCHING_H | 0 | 1.00 | 1779.21 | 1406.84 | 1137 | 1419.00 | 1518.0 | 1682.50 | 30132 | ▇▁▁▁▁ |
| TEAM_PITCHING_HR | 0 | 1.00 | 105.70 | 61.30 | 0 | 50.00 | 107.0 | 150.00 | 343 | ▇▇▆▁▁ |
| TEAM_PITCHING_BB | 0 | 1.00 | 553.01 | 166.36 | 0 | 476.00 | 536.5 | 611.00 | 3645 | ▇▁▁▁▁ |
| TEAM_PITCHING_SO | 102 | 0.96 | 817.73 | 553.09 | 0 | 615.00 | 813.5 | 968.00 | 19278 | ▇▁▁▁▁ |
| TEAM_FIELDING_E | 0 | 1.00 | 246.48 | 227.77 | 65 | 127.00 | 159.0 | 249.25 | 1898 | ▇▁▁▁▁ |
| TEAM_FIELDING_DP | 286 | 0.87 | 146.39 | 26.23 | 52 | 131.00 | 149.0 | 164.00 | 228 | ▁▂▇▆▁ |
We have 2276 rows with 17 columns that are all numeric.
Note here that the average win per team is 81 with a standard deviation of 16 wins, this is very close to the median of 82 wins. Combining this with our boxplot suggests a normal distribution towards the middle with extremely bad teams and extremely good teams deviating from the norm.
test_1 |>
select(TARGET_WINS) |>
ggplot(aes(x= TARGET_WINS)) +
geom_boxplot() +
labs(title = "Distribution of Season Wins Per Team",x = "Season Wins")
Our histograms give us an idea that most of our data is not actually normally distributed with a rightward skew being very common. We will want to adjust the distribution of those where outlier removal isn’t enough with box-cox transformations once we get to data preparation. As a variable being normally distributed lends to its predictive power when creating a linear regression model.
What is most immediately noticeable to me in the skimmed summary is the fact that there are four categories where the histogram shows outliers that are up to twenty times the 75th percentile. These categories include: TEAM_PITCHING_H (Hits allowed by the pitcher), TEAM_PITCHING_BB (Walks allowed by the pitcher), TEAM_PITCHING_SO (Strikeouts by the pitcher), and TEAM_FIELDING_E (Errors by the team). Outliers of this magnitude are usually erroneously entered data or perhaps a totaling row that we would want to take a closer look at.
test_1 |>
slice_max(order_by = TEAM_PITCHING_H, n = 5)
## # A tibble: 5 × 17
## INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1769 36 1674 216 54
## 2 1347 0 891 135 0
## 3 2380 41 992 263 20
## 4 459 23 1458 220 35
## 5 1494 108 1188 338 0
## # ℹ 12 more variables: TEAM_BATTING_HR <dbl>, TEAM_BATTING_BB <dbl>,
## # TEAM_BATTING_SO <dbl>, TEAM_BASERUN_SB <dbl>, TEAM_BASERUN_CS <dbl>,
## # TEAM_BATTING_HBP <dbl>, TEAM_PITCHING_H <dbl>, TEAM_PITCHING_HR <dbl>,
## # TEAM_PITCHING_BB <dbl>, TEAM_PITCHING_SO <dbl>, TEAM_FIELDING_E <dbl>,
## # TEAM_FIELDING_DP <dbl>
test_1 |>
slice_max(order_by = TEAM_PITCHING_BB, n = 5)
## # A tibble: 5 × 17
## INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1494 108 1188 338 0
## 2 2380 41 992 263 20
## 3 309 71 1325 162 76
## 4 1202 60 1518 162 68
## 5 1492 95 1494 261 68
## # ℹ 12 more variables: TEAM_BATTING_HR <dbl>, TEAM_BATTING_BB <dbl>,
## # TEAM_BATTING_SO <dbl>, TEAM_BASERUN_SB <dbl>, TEAM_BASERUN_CS <dbl>,
## # TEAM_BATTING_HBP <dbl>, TEAM_PITCHING_H <dbl>, TEAM_PITCHING_HR <dbl>,
## # TEAM_PITCHING_BB <dbl>, TEAM_PITCHING_SO <dbl>, TEAM_FIELDING_E <dbl>,
## # TEAM_FIELDING_DP <dbl>
test_1 |>
slice_max(order_by = TEAM_PITCHING_SO, n = 5)
## # A tibble: 5 × 17
## INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2380 41 992 263 20
## 2 1494 108 1188 338 0
## 3 1 39 1445 194 39
## 4 2051 46 1254 154 127
## 5 309 71 1325 162 76
## # ℹ 12 more variables: TEAM_BATTING_HR <dbl>, TEAM_BATTING_BB <dbl>,
## # TEAM_BATTING_SO <dbl>, TEAM_BASERUN_SB <dbl>, TEAM_BASERUN_CS <dbl>,
## # TEAM_BATTING_HBP <dbl>, TEAM_PITCHING_H <dbl>, TEAM_PITCHING_HR <dbl>,
## # TEAM_PITCHING_BB <dbl>, TEAM_PITCHING_SO <dbl>, TEAM_FIELDING_E <dbl>,
## # TEAM_FIELDING_DP <dbl>
test_1 |>
slice_max(order_by = TEAM_FIELDING_E, n = 5)
## # A tibble: 5 × 17
## INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 459 23 1458 220 35
## 2 1347 0 891 135 0
## 3 2050 14 1437 148 56
## 4 1769 36 1674 216 54
## 5 433 39 1620 201 44
## # ℹ 12 more variables: TEAM_BATTING_HR <dbl>, TEAM_BATTING_BB <dbl>,
## # TEAM_BATTING_SO <dbl>, TEAM_BASERUN_SB <dbl>, TEAM_BASERUN_CS <dbl>,
## # TEAM_BATTING_HBP <dbl>, TEAM_PITCHING_H <dbl>, TEAM_PITCHING_HR <dbl>,
## # TEAM_PITCHING_BB <dbl>, TEAM_PITCHING_SO <dbl>, TEAM_FIELDING_E <dbl>,
## # TEAM_FIELDING_DP <dbl>
suppressWarnings(
test_1 |>
select(TARGET_WINS, TEAM_PITCHING_H, TEAM_PITCHING_BB, TEAM_PITCHING_SO, TEAM_FIELDING_E) |>
plot_scatterplot(by = "TARGET_WINS",nrow = 2L, ncol = 2L)
)
Looking at the top 5 of each of these there doesn’t seem to be an obvious total row, or any row that is just completely filled with erroneous data. As each top 5 row within these categories is different rather than being a row filled with outliers. Still, if these extremely large outliers were left in then our model would be greatly affected, we can see this by observing how far off these points are from the general trend in the scatter plots against wins. So, we will remove the rows with outliers that are 1.5 beyond the IQR in our columns.
While the majority of our columns do not have any missing values, we can see that there are a columns that are. Most strikingly, TEAM_BATTING_HBP, is missing data from 2085 rows or 92% of our data. This column will have to be dropped as the amount of data within the column is too little to help our model, imputing it would not be useful.
TEAM_BASERUN_CS, also has a large amount of data missing with 772 rows having NA values. This could be a candidate for imputation or removal, but since the coefficient of variation is below 1, we will not remove it and instead impute.
TEAM_PITCHING_SO, TEAM_BATTING_SO, TEAM_BASERUN_SB, and TEAM_FIELDING_DP all have comparatively lower proportions of missing data that will easily be dealt with via imputation.
plot_missing(test_1, missing_only = TRUE, title = "Percent Rows Missing Values by Column")
Perhaps the most important thing to explore before building this linear model is if any of our columns have a strong direct correlation with wins themselves. As we want to select for variables with strong correlation in building the model.
We will evaluate correlation with a correlation plot.
M <- test_1 %>%
select(-INDEX,-TEAM_BATTING_HBP) %>%
cor()
corrplot(M, type = 'lower', tl.srt = 45, tl.cex = 0.5, na.label = "square", na.label.col = "lightgrey")
corrplot(M, method = 'number', type = 'lower', tl.srt = 45, tl.cex = 0.5, number.cex = 0.55, na.label = "square", na.label.col = "lightgrey")
Note that we are missing many correlations because of the missing values within some of our columns. So we will come back to this data once we have finished data preparation. However, from what we have we can see the following:
After exploring our data we move on to preparing it to be more useful for our goal of creating a linear model to predict team wins.
As previously discussed, it is important to remove outliers that go beyond 2 * IQR as these will have a very big effect on the individual coefficient of each variable skewing it to be larger or smaller than it should be. Here we chose to deal with these outliers in the response variables by capping their values as 2 * IQR in order to not skew the data back too much through utilizing a mean transform. Traditionally we would filter and cap the values at 1.5*IQR however in the case of our data that would change too much of it with the large skewing in some of our categories. Notice in our scatter plots that although we have introduced and artificial cap on the rightmost ends, our axes are now reduced by up to 10x with the capping.
test_2 <- test_1 |>
select(-TARGET_WINS)
test_2 <- cbind(TARGET_WINS = test_1$TARGET_WINS, as_tibble(
lapply(test_2, function(x) {
q1 <- quantile(x, .25, na.rm = TRUE)
q2 <- quantile(x, .75, na.rm = TRUE)
IQR <- q2 - q1
replace(x, x < (q1 - 2*IQR), (q1 - 2*IQR))
replace(x, x > (q2+ 2*IQR), (q2 + 2*IQR))
})
))
suppressWarnings(
test_2 |>
select(TARGET_WINS, TEAM_PITCHING_H, TEAM_PITCHING_BB, TEAM_PITCHING_SO, TEAM_FIELDING_E) |>
plot_scatterplot(by = "TARGET_WINS",nrow = 2L, ncol = 2L)
)
When there are missing values within our data, attempting to create a linear model from the data will end up dropping multitudes of rows used to calculate the model. To avoid this loss of data we will deal with the missing values ourselves. To begin with we will drop the TEAM_BATTING_HBP column from our data. Not related to dropping missing values, but we will also drop our index as it will not be useful for generating the model.
test_2 <- test_2 |>
drop_columns(c("TEAM_BATTING_HBP", "INDEX"))
Next we deal with the other columns that have missing values by using mean (after outlier removal) imputation.
test_2$TEAM_BATTING_SO[is.na(test_2$TEAM_BATTING_SO)] <- mean(test_2$TEAM_BATTING_SO,na.rm=TRUE)
test_2$TEAM_BASERUN_SB[is.na(test_2$TEAM_BASERUN_SB)] <- mean(test_2$TEAM_BASERUN_SB,na.rm=TRUE)
test_2$TEAM_BASERUN_CS[is.na(test_2$TEAM_BASERUN_CS)] <- mean(test_2$TEAM_BASERUN_CS,na.rm=TRUE)
test_2$TEAM_PITCHING_SO[is.na(test_2$TEAM_PITCHING_SO)] <- mean(test_2$TEAM_PITCHING_SO,na.rm=TRUE)
test_2$TEAM_FIELDING_DP[is.na(test_2$TEAM_FIELDING_DP)] <- mean(test_2$TEAM_FIELDING_DP,na.rm=TRUE)
print(paste0("The amount of remaining NAs in the data is: ", sum(is.na(test_2))))
## [1] "The amount of remaining NAs in the data is: 0"
After we have excluded outliers and imputated missing values, our distributions are going to look different. So we want to look at them before we decide on how we want to take any transformations. We now can now outline the distributions that we would want to transform:
test_2 |>
plot_histogram(nrow = 3L, ncol = 3L)
We manually tune lambda parameters to use for box-cox transformations in order to attempt to normalize these distributions. Note that for BOX_TEAM_FIELDING_E and BOX_TEAM_PITCHING_H it is very difficult to get a normal distribution due to the cap. So we will likely be served better without using these in our model.
test_2 <- test_2 %>%
mutate(BOX_TEAM_BASERUN_SB = box_cox(TEAM_BASERUN_SB, 0.57),
BOX_TEAM_FIELDING_E = box_cox(TEAM_FIELDING_E, -0.42),
BOX_TEAM_PITCHING_H = box_cox(TEAM_PITCHING_H, -0.69),
BOX_TEAM_BATTING_3B = box_cox(TEAM_BATTING_3B, -.001),
BOX_TEAM_BATTING_HR = box_cox(TEAM_BATTING_HR, 0.50),
BOX_TEAM_PITCHING_HR = box_cox(TEAM_PITCHING_HR, 0.57)
)
test_2 |>
dplyr::select(BOX_TEAM_BASERUN_SB, BOX_TEAM_FIELDING_E, BOX_TEAM_PITCHING_H, BOX_TEAM_BATTING_3B, BOX_TEAM_BATTING_HR, BOX_TEAM_PITCHING_HR) %>%
plot_histogram(nrow = 2L, ncol = 3L)
Many times it is not enough to use the existing data after you have transformed it. Interactions of different variables can affect how well your linear regression fits the data. For example, if we want a general batting evaluation variable we could divide TEAM_BATTING_H by TEAM_BATTING_SO, this would give us a ratio of how much more likely a batter is to make a hit compared to striking out. Something that logically should feed directly into getting more wins. We also generate a similar interaction variable for pitching.
test_2 <- test_2 %>%
mutate(BATTING = case_when(TEAM_BATTING_SO > 0 ~ TEAM_BATTING_H/TEAM_BATTING_SO, .default = TEAM_BATTING_H),
PITCHING = TEAM_PITCHING_SO / TEAM_PITCHING_H)
After all of our data preparation is done we want to revisit the correlation plot to see if any of the transformations we have done have majorly changed the correlation values. Unfortunately, there are no changes in correlations between values and TARGET_WINS. We also see that the box-cox transformation we did on BOX_TEAM_BATTING_3B has dropped values, so we will want to drop that row instead of using it.
M <- test_2 %>%
cor()
corrplot(M, type = 'lower', tl.srt = 45, tl.cex = 0.5, na.label = "square", na.label.col = "lightgrey")
corrplot(M, method = 'number', type = 'lower', tl.srt = 45, tl.cex = 0.5, number.cex = 0.35, na.label = "square", na.label.col = "lightgrey")
The first model that we will tackle is a baseline linear regression model from the untransformed data that has not gone through any of our data processing. This is to determine if our transformations may have had the opposite effect on the regression model and weakened it. We utilize automatic step wise regression to gradually add variables to it, determining how significant they may be within our model.
model_1 <- lm(TARGET_WINS ~ ., data = test_1)
model_1 <- ols_step_both_p(model_1)
summary(model_1$model)
##
## Call:
## lm(formula = paste(response, "~", paste(preds, collapse = " + ")),
## data = l)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30.6103 -6.6903 -0.0775 6.5444 27.9073
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 58.169860 6.633370 8.769 < 2e-16 ***
## TEAM_BATTING_H 0.034994 0.004687 7.466 1.41e-13 ***
## TEAM_FIELDING_E -0.155842 0.009923 -15.705 < 2e-16 ***
## TEAM_BASERUN_SB 0.035711 0.008671 4.118 4.03e-05 ***
## TEAM_BATTING_BB 0.039641 0.003356 11.813 < 2e-16 ***
## TEAM_FIELDING_DP -0.113055 0.013133 -8.609 < 2e-16 ***
## TEAM_BATTING_3B 0.162367 0.022164 7.326 3.90e-13 ***
## TEAM_BATTING_2B -0.069446 0.009322 -7.449 1.59e-13 ***
## TEAM_BATTING_HR 0.096828 0.009421 10.278 < 2e-16 ***
## TEAM_BATTING_SO -0.014586 0.004015 -3.633 0.00029 ***
## TEAM_BASERUN_CS 0.051942 0.018212 2.852 0.00440 **
## TEAM_PITCHING_SO -0.006658 0.003245 -2.052 0.04034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.554 on 1474 degrees of freedom
## (790 observations deleted due to missingness)
## Multiple R-squared: 0.4378, Adjusted R-squared: 0.4336
## F-statistic: 104.3 on 11 and 1474 DF, p-value: < 2.2e-16
Model 1 gives us an R^2 of 0.4336 with an RMSE of 9.554.
autoplot(model_1$model)
The residuals for model_1 are normally distributed as seen from the Normal Q-Q additionally they are centered on 0. However, it should be noted that the variance of the residuals does not seem heteroscedastic because the max variance clusters around the fitted value of 80 while decreasing on both sides in the Residuals vs Fitted graph. While the residuals can be considered to be independent here as there is no clear pattern besides this increase in variance.
The second model that we will generate is a linear regression model from the transformed dataset without our box-cox transformations. This is to determine if the box-cox transformations may have had the opposite effect on the regression model and weakened it. We utilize automatic step wise regression to gradually add variables to it, determining how significant they may be within our model.
model_2 <- lm(TARGET_WINS ~ .-BOX_TEAM_BASERUN_SB -BOX_TEAM_FIELDING_E -BOX_TEAM_PITCHING_H -BOX_TEAM_BATTING_3B -BOX_TEAM_BATTING_HR -BOX_TEAM_PITCHING_HR, data = test_2)
model_2 <- ols_step_both_p(model_2)
summary(model_2$model)
##
## Call:
## lm(formula = paste(response, "~", paste(preds, collapse = " + ")),
## data = l)
##
## Residuals:
## Min 1Q Median 3Q Max
## -72.637 -8.395 0.205 8.086 79.102
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.599729 6.940483 1.527 0.126843
## TEAM_BATTING_H 0.034565 0.003959 8.731 < 2e-16 ***
## TEAM_BATTING_BB 0.060941 0.008297 7.345 2.86e-13 ***
## TEAM_FIELDING_DP -0.115499 0.013312 -8.676 < 2e-16 ***
## TEAM_FIELDING_E -0.054642 0.005440 -10.044 < 2e-16 ***
## TEAM_BASERUN_SB 0.052924 0.005887 8.990 < 2e-16 ***
## TEAM_BATTING_3B 0.125732 0.017567 7.157 1.11e-12 ***
## TEAM_PITCHING_HR 0.048300 0.008968 5.386 7.94e-08 ***
## BATTING 0.006196 0.002121 2.921 0.003524 **
## TEAM_BASERUN_CS -0.049032 0.019678 -2.492 0.012783 *
## TEAM_PITCHING_H 0.016394 0.003099 5.290 1.34e-07 ***
## TEAM_PITCHING_BB -0.039449 0.007150 -5.517 3.84e-08 ***
## TEAM_BATTING_SO -0.028624 0.006730 -4.253 2.19e-05 ***
## PITCHING 34.787007 9.281586 3.748 0.000183 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.13 on 2262 degrees of freedom
## Multiple R-squared: 0.3095, Adjusted R-squared: 0.3055
## F-statistic: 77.97 on 13 and 2262 DF, p-value: < 2.2e-16
Model 2 gives us an R^2 of 0.3055 with an RMSE of 13.13. As we can see, it seems our transformations have actually weakened the linear model. The outliers that we capped likely were significant to actually determining if a team was winning games or not.
autoplot(model_2$model)
The residuals for model_2 are normally distributed as seen from the
Normal Q-Q with acceptable deviations at the ends additionally they are
centered on 0. The variance of the residuals is slightly less
heteroscedastic because of the outlier removal we have done and this can
be seen on the Residuals vs Fitted graph. While the residuals can be
considered to be independent here as there is no clear pattern besides
this increase in variance.
The third model that we will generate is a linear regression model from the transformed dataset with our box-cox transformations. This is to determine if the box-cox transformations may have had helped the regression model. We utilize automatic step wise regression to gradually add variables to it, determining how significant they may be within our model.
model_3 <- lm(TARGET_WINS ~ ., data = test_2)
model_3 <- ols_step_both_p(model_3)
summary(model_3$model)
##
## Call:
## lm(formula = paste(response, "~", paste(preds, collapse = " + ")),
## data = l)
##
## Residuals:
## Min 1Q Median 3Q Max
## -65.222 -7.946 0.244 8.231 72.879
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.868e+04 5.459e+03 7.086 1.84e-12 ***
## TEAM_BATTING_H 4.201e-02 4.249e-03 9.885 < 2e-16 ***
## TEAM_BATTING_BB 5.273e-02 6.741e-03 7.823 7.82e-15 ***
## TEAM_FIELDING_DP -1.233e-01 1.325e-02 -9.310 < 2e-16 ***
## BOX_TEAM_FIELDING_E -1.186e+02 1.147e+01 -10.340 < 2e-16 ***
## TEAM_BATTING_3B 1.438e-01 1.766e-02 8.143 6.29e-16 ***
## BOX_TEAM_BASERUN_SB 3.702e-01 4.377e-02 8.457 < 2e-16 ***
## TEAM_PITCHING_HR 4.095e-02 8.986e-03 4.558 5.44e-06 ***
## TEAM_PITCHING_BB -3.007e-02 5.938e-03 -5.065 4.42e-07 ***
## TEAM_PITCHING_H 7.607e-02 1.091e-02 6.975 3.99e-12 ***
## BOX_TEAM_PITCHING_H -2.676e+04 3.802e+03 -7.037 2.59e-12 ***
## TEAM_BASERUN_CS -5.705e-02 1.958e-02 -2.913 0.00361 **
## TEAM_PITCHING_SO 4.081e-02 8.590e-03 4.751 2.15e-06 ***
## PITCHING -8.186e+01 1.471e+01 -5.564 2.95e-08 ***
## BATTING 6.873e-03 2.396e-03 2.868 0.00417 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.98 on 2261 degrees of freedom
## Multiple R-squared: 0.3255, Adjusted R-squared: 0.3214
## F-statistic: 77.95 on 14 and 2261 DF, p-value: < 2.2e-16
Model 3 gives us an R^2 of 0.3214 with an RMSE of 12.98. Although the Box transformed variables seem to have helped, overall we have a worse off model than our original data. Once again, the outliers that we capped likely were significant to actually determining if a team was winning games or not.
autoplot(model_3$model)
The residuals for model_3 are normally distributed as seen from the Normal Q-Q with acceptable deviations at the ends additionally they are centered on 0. The variance of the residuals is slightly less heteroscedastic than both previous models because of the outlier removal and transformations we have done and this can be seen on the Residuals vs Fitted graph. While the residuals can be considered to be independent here as there is no clear pattern besides this increase in variance.
Keeping model_1 which was built off the original data as the model we selected seems like the best choice here. That is because not only does it have the highest adjusted R^2 and RMSE, but it is also easier to explain the coefficients compared to the box-cox transformed model of model_3.
Do note that we have to deal with NA values in the eval dataset to get proper predictions from this model. Here we convert them all to 0, as the training data for model_1 did not actually change anything about NAs.
predictions <- predict(model_1$model, eval_1 |>
mutate_if(is.numeric, ~replace_na(.,0)))
write_csv(as_tibble(predictions),"evaluated.csv")
Sometimes the data that you are dealt despite the apparent flaws in it is the best data to work with for training your model.