621assignment1.knit

Assignment 1

Daniel DeBonis

Background

In this homework assignment, you will explore, analyze and model a data set containing approximately 2200 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season. Your objective is to build a multiple linear regression model on the training data to predict the number of wins for the team. You can only use the variables given to you (or variables that you derive from the variables provided)

Importing the Data

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

data <- read.csv("C:/Users/ddebo/Downloads/moneyball-training-data.csv")

Exploratory Data Analysis

str(data)

## 'data.frame':    2276 obs. of  17 variables:
##  $ INDEX           : int  1 2 3 4 5 6 7 8 11 12 ...
##  $ TARGET_WINS     : int  39 70 86 70 82 75 80 85 86 76 ...
##  $ TEAM_BATTING_H  : int  1445 1339 1377 1387 1297 1279 1244 1273 1391 1271 ...
##  $ TEAM_BATTING_2B : int  194 219 232 209 186 200 179 171 197 213 ...
##  $ TEAM_BATTING_3B : int  39 22 35 38 27 36 54 37 40 18 ...
##  $ TEAM_BATTING_HR : int  13 190 137 96 102 92 122 115 114 96 ...
##  $ TEAM_BATTING_BB : int  143 685 602 451 472 443 525 456 447 441 ...
##  $ TEAM_BATTING_SO : int  842 1075 917 922 920 973 1062 1027 922 827 ...
##  $ TEAM_BASERUN_SB : int  NA 37 46 43 49 107 80 40 69 72 ...
##  $ TEAM_BASERUN_CS : int  NA 28 27 30 39 59 54 36 27 34 ...
##  $ TEAM_BATTING_HBP: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ TEAM_PITCHING_H : int  9364 1347 1377 1396 1297 1279 1244 1281 1391 1271 ...
##  $ TEAM_PITCHING_HR: int  84 191 137 97 102 92 122 116 114 96 ...
##  $ TEAM_PITCHING_BB: int  927 689 602 454 472 443 525 459 447 441 ...
##  $ TEAM_PITCHING_SO: int  5456 1082 917 928 920 973 1062 1033 922 827 ...
##  $ TEAM_FIELDING_E : int  1011 193 175 164 138 123 136 112 127 131 ...
##  $ TEAM_FIELDING_DP: int  NA 155 153 156 168 149 186 136 169 159 ...

summary(data)

##      INDEX         TARGET_WINS     TEAM_BATTING_H TEAM_BATTING_2B
##  Min.   :   1.0   Min.   :  0.00   Min.   : 891   Min.   : 69.0  
##  1st Qu.: 630.8   1st Qu.: 71.00   1st Qu.:1383   1st Qu.:208.0  
##  Median :1270.5   Median : 82.00   Median :1454   Median :238.0  
##  Mean   :1268.5   Mean   : 80.79   Mean   :1469   Mean   :241.2  
##  3rd Qu.:1915.5   3rd Qu.: 92.00   3rd Qu.:1537   3rd Qu.:273.0  
##  Max.   :2535.0   Max.   :146.00   Max.   :2554   Max.   :458.0  
##                                                                  
##  TEAM_BATTING_3B  TEAM_BATTING_HR  TEAM_BATTING_BB TEAM_BATTING_SO 
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.0   Min.   :   0.0  
##  1st Qu.: 34.00   1st Qu.: 42.00   1st Qu.:451.0   1st Qu.: 548.0  
##  Median : 47.00   Median :102.00   Median :512.0   Median : 750.0  
##  Mean   : 55.25   Mean   : 99.61   Mean   :501.6   Mean   : 735.6  
##  3rd Qu.: 72.00   3rd Qu.:147.00   3rd Qu.:580.0   3rd Qu.: 930.0  
##  Max.   :223.00   Max.   :264.00   Max.   :878.0   Max.   :1399.0  
##                                                    NA's   :102     
##  TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H
##  Min.   :  0.0   Min.   :  0.0   Min.   :29.00    Min.   : 1137  
##  1st Qu.: 66.0   1st Qu.: 38.0   1st Qu.:50.50    1st Qu.: 1419  
##  Median :101.0   Median : 49.0   Median :58.00    Median : 1518  
##  Mean   :124.8   Mean   : 52.8   Mean   :59.36    Mean   : 1779  
##  3rd Qu.:156.0   3rd Qu.: 62.0   3rd Qu.:67.00    3rd Qu.: 1682  
##  Max.   :697.0   Max.   :201.0   Max.   :95.00    Max.   :30132  
##  NA's   :131     NA's   :772     NA's   :2085                    
##  TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO  TEAM_FIELDING_E 
##  Min.   :  0.0    Min.   :   0.0   Min.   :    0.0   Min.   :  65.0  
##  1st Qu.: 50.0    1st Qu.: 476.0   1st Qu.:  615.0   1st Qu.: 127.0  
##  Median :107.0    Median : 536.5   Median :  813.5   Median : 159.0  
##  Mean   :105.7    Mean   : 553.0   Mean   :  817.7   Mean   : 246.5  
##  3rd Qu.:150.0    3rd Qu.: 611.0   3rd Qu.:  968.0   3rd Qu.: 249.2  
##  Max.   :343.0    Max.   :3645.0   Max.   :19278.0   Max.   :1898.0  
##                                    NA's   :102                       
##  TEAM_FIELDING_DP
##  Min.   : 52.0   
##  1st Qu.:131.0   
##  Median :149.0   
##  Mean   :146.4   
##  3rd Qu.:164.0   
##  Max.   :228.0   
##  NA's   :286

The first thing that stands out is the presence of NAs in six variables. This data set extends back to 1871 which is earlier than some statistics such as being hit by a pitch or being caught while attempting to steal a base were being recorded. Each began to be recorded or considered later in the decade. Even stolen bases were not recorded until 1877, the rules behind them were not solidified until the 1890s. That being said, the number of missing values in the hit by pitch column renders it useless for this endeavor.

summary_table <- data |>
  select(where(is.numeric)) %>%
  # For each column, compute the desired statistics
  summarize(across(everything(), 
                   list(mean = ~mean(.x, na.rm = TRUE),
                        sd   = ~sd(.x, na.rm = TRUE),
                        min  = ~min(.x, na.rm = TRUE),
                        max  = ~max(.x, na.rm = TRUE),
                        missing = ~sum(is.na(.x))))) %>%
  # Convert from wide to long format
  pivot_longer(
    cols = everything(),
    names_to = c("Variable", "Statistic"),
    names_pattern = "^(.*)_(mean|sd|min|max|missing)$",
    values_to = "Value"
  ) %>%
  pivot_wider(names_from = Statistic, values_from = Value)

summary_table

## # A tibble: 17 × 6
##    Variable           mean     sd   min   max missing
##    <chr>             <dbl>  <dbl> <dbl> <dbl>   <dbl>
##  1 INDEX            1268.   736.      1  2535       0
##  2 TARGET_WINS        80.8   15.8     0   146       0
##  3 TEAM_BATTING_H   1469.   145.    891  2554       0
##  4 TEAM_BATTING_2B   241.    46.8    69   458       0
##  5 TEAM_BATTING_3B    55.2   27.9     0   223       0
##  6 TEAM_BATTING_HR    99.6   60.5     0   264       0
##  7 TEAM_BATTING_BB   502.   123.      0   878       0
##  8 TEAM_BATTING_SO   736.   249.      0  1399     102
##  9 TEAM_BASERUN_SB   125.    87.8     0   697     131
## 10 TEAM_BASERUN_CS    52.8   23.0     0   201     772
## 11 TEAM_BATTING_HBP   59.4   13.0    29    95    2085
## 12 TEAM_PITCHING_H  1779.  1407.   1137 30132       0
## 13 TEAM_PITCHING_HR  106.    61.3     0   343       0
## 14 TEAM_PITCHING_BB  553.   166.      0  3645       0
## 15 TEAM_PITCHING_SO  818.   553.      0 19278     102
## 16 TEAM_FIELDING_E   246.   228.     65  1898       0
## 17 TEAM_FIELDING_DP  146.    26.2    52   228     286

# Convert data to long format for plotting
long_df <- data |>
  select(where(is.numeric)) |>
  pivot_longer(everything(), names_to = "Variable", values_to = "Value")

# Plot histogram for each variable
ggplot(long_df, aes(x = Value)) +
  geom_histogram(fill = "steelblue", bins = 30, color = "black") +
  facet_wrap(~Variable, scales = "free") +
  theme_minimal() +
  labs(title = "Distribution of Each Numeric Variable", x = "Value", y = "Count")

## Warning: Removed 3478 rows containing non-finite outside the scale range
## (`stat_bin()`).

Many of the variables are normally distributed, most notably our target variable, TARGET_WINS. Possible predictor variables that also appear normally distributed are doubles, walks, hit-by-pitches, and double plays. Bimodality is present for home runs, strikeouts by batters, and home runs allowed. Other variables show a notable right skew such as both baserunning stats and triples. The rest of the variables show a very extreme right skew. As a result, I took a closer look at the mean, median, and max for these variables to confirm, the max value for each is much higher than expected given the means and as a result have very large standard deviations.

long_df <- data |>
  select(-INDEX) |>
  pivot_longer(
    cols = everything(),
    names_to = "Variable",
    values_to = "Value"
  )

# Boxplot of all numeric variables except INDEX
ggplot(long_df, aes(x = Variable, y = Value)) +
  geom_boxplot(fill = "steelblue") +
  coord_cartesian(ylim = c(0, 2000)) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Boxplot of All Numeric Variables", x = "", y = "Value")

## Warning: Removed 3478 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

The large range for the number of hits given up by pitchers required me to crop the visualization. Additionally, hits batted and strike outs also have outliers that go beyond the cutoff, but not to the same degree.

cor_matrix <- cor(data[,-1], use="pairwise.complete.obs")
cor_matrix

##                  TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## TARGET_WINS       1.00000000    0.388767521      0.28910365     0.142608411
## TEAM_BATTING_H    0.38876752    1.000000000      0.56284968     0.427696575
## TEAM_BATTING_2B   0.28910365    0.562849678      1.00000000    -0.107305824
## TEAM_BATTING_3B   0.14260841    0.427696575     -0.10730582     1.000000000
## TEAM_BATTING_HR   0.17615320   -0.006544685      0.43539729    -0.635566946
## TEAM_BATTING_BB   0.23255986   -0.072464013      0.25572610    -0.287235841
## TEAM_BATTING_SO  -0.03175071   -0.463853571      0.16268519    -0.669781188
## TEAM_BASERUN_SB   0.13513892    0.123567797     -0.19975724     0.533506448
## TEAM_BASERUN_CS   0.02240407    0.016705668     -0.09981406     0.348764919
## TEAM_BATTING_HBP  0.07350424   -0.029112176      0.04608475    -0.174247154
## TEAM_PITCHING_H  -0.10993705    0.302693709      0.02369219     0.194879411
## TEAM_PITCHING_HR  0.18901373    0.072853119      0.45455082    -0.567836679
## TEAM_PITCHING_BB  0.12417454    0.094193027      0.17805420    -0.002224148
## TEAM_PITCHING_SO -0.07843609   -0.252656790      0.06479231    -0.258818931
## TEAM_FIELDING_E  -0.17648476    0.264902478     -0.23515099     0.509778447
## TEAM_FIELDING_DP -0.03485058    0.155383321      0.29087998    -0.323074847
##                  TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO
## TARGET_WINS          0.176153200      0.23255986     -0.03175071
## TEAM_BATTING_H      -0.006544685     -0.07246401     -0.46385357
## TEAM_BATTING_2B      0.435397293      0.25572610      0.16268519
## TEAM_BATTING_3B     -0.635566946     -0.28723584     -0.66978119
## TEAM_BATTING_HR      1.000000000      0.51373481      0.72706935
## TEAM_BATTING_BB      0.513734810      1.00000000      0.37975087
## TEAM_BATTING_SO      0.727069348      0.37975087      1.00000000
## TEAM_BASERUN_SB     -0.453578426     -0.10511564     -0.25448923
## TEAM_BASERUN_CS     -0.433793868     -0.13698837     -0.21788137
## TEAM_BATTING_HBP     0.106181160      0.04746007      0.22094219
## TEAM_PITCHING_H     -0.250145481     -0.44977762     -0.37568637
## TEAM_PITCHING_HR     0.969371396      0.45955207      0.66717889
## TEAM_PITCHING_BB     0.136927564      0.48936126      0.03700514
## TEAM_PITCHING_SO     0.184707564     -0.02075682      0.41623330
## TEAM_FIELDING_E     -0.587339098     -0.65597081     -0.58466444
## TEAM_FIELDING_DP     0.448985348      0.43087675      0.15488939
##                  TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP
## TARGET_WINS           0.13513892      0.02240407       0.07350424
## TEAM_BATTING_H        0.12356780      0.01670567      -0.02911218
## TEAM_BATTING_2B      -0.19975724     -0.09981406       0.04608475
## TEAM_BATTING_3B       0.53350645      0.34876492      -0.17424715
## TEAM_BATTING_HR      -0.45357843     -0.43379387       0.10618116
## TEAM_BATTING_BB      -0.10511564     -0.13698837       0.04746007
## TEAM_BATTING_SO      -0.25448923     -0.21788137       0.22094219
## TEAM_BASERUN_SB       1.00000000      0.65524480      -0.06400498
## TEAM_BASERUN_CS       0.65524480      1.00000000      -0.07051390
## TEAM_BATTING_HBP     -0.06400498     -0.07051390       1.00000000
## TEAM_PITCHING_H       0.07328505     -0.05200781      -0.02769699
## TEAM_PITCHING_HR     -0.41651072     -0.42256605       0.10675878
## TEAM_PITCHING_BB      0.14641513     -0.10696124       0.04785137
## TEAM_PITCHING_SO     -0.13712861     -0.21022274       0.22157375
## TEAM_FIELDING_E       0.50963090      0.04832189       0.04178971
## TEAM_FIELDING_DP     -0.49707763     -0.21424801      -0.07120824
##                  TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB
## TARGET_WINS          -0.10993705       0.18901373      0.124174536
## TEAM_BATTING_H        0.30269371       0.07285312      0.094193027
## TEAM_BATTING_2B       0.02369219       0.45455082      0.178054204
## TEAM_BATTING_3B       0.19487941      -0.56783668     -0.002224148
## TEAM_BATTING_HR      -0.25014548       0.96937140      0.136927564
## TEAM_BATTING_BB      -0.44977762       0.45955207      0.489361263
## TEAM_BATTING_SO      -0.37568637       0.66717889      0.037005141
## TEAM_BASERUN_SB       0.07328505      -0.41651072      0.146415134
## TEAM_BASERUN_CS      -0.05200781      -0.42256605     -0.106961236
## TEAM_BATTING_HBP     -0.02769699       0.10675878      0.047851371
## TEAM_PITCHING_H       1.00000000      -0.14161276      0.320676162
## TEAM_PITCHING_HR     -0.14161276       1.00000000      0.221937505
## TEAM_PITCHING_BB      0.32067616       0.22193750      1.000000000
## TEAM_PITCHING_SO      0.26724807       0.20588053      0.488498653
## TEAM_FIELDING_E       0.66775901      -0.49314447     -0.022837561
## TEAM_FIELDING_DP     -0.22865059       0.43917040      0.324457226
##                  TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## TARGET_WINS           -0.07843609     -0.17648476      -0.03485058
## TEAM_BATTING_H        -0.25265679      0.26490248       0.15538332
## TEAM_BATTING_2B        0.06479231     -0.23515099       0.29087998
## TEAM_BATTING_3B       -0.25881893      0.50977845      -0.32307485
## TEAM_BATTING_HR        0.18470756     -0.58733910       0.44898535
## TEAM_BATTING_BB       -0.02075682     -0.65597081       0.43087675
## TEAM_BATTING_SO        0.41623330     -0.58466444       0.15488939
## TEAM_BASERUN_SB       -0.13712861      0.50963090      -0.49707763
## TEAM_BASERUN_CS       -0.21022274      0.04832189      -0.21424801
## TEAM_BATTING_HBP       0.22157375      0.04178971      -0.07120824
## TEAM_PITCHING_H        0.26724807      0.66775901      -0.22865059
## TEAM_PITCHING_HR       0.20588053     -0.49314447       0.43917040
## TEAM_PITCHING_BB       0.48849865     -0.02283756       0.32445723
## TEAM_PITCHING_SO       1.00000000     -0.02329178       0.02615804
## TEAM_FIELDING_E       -0.02329178      1.00000000      -0.49768495
## TEAM_FIELDING_DP       0.02615804     -0.49768495       1.00000000

library(corrplot)

## corrplot 0.95 loaded

corrplot(cor_matrix, method = "circle", type = "upper",
         tl.col = "black", tl.srt = 45, # text label color & rotation
         number.cex = 0.7,              # size of coefficients
         diag = FALSE)

One correlation that stands out is the one between home runs hit and home runs pitched. They have a .97 correlation which suggests that times of batters hittings more home runs are tied to the pitchers pitching them those very home runs. This incredibly strong correlation is therefore no surprise. We see other strong correlations between batting strikeouts and home runs hit (r = .727), triples hit and batters striking out (-.670), batting strikeouts and home runs pitched (.667), stolen bases and caught stolen bases (r = .655), fielding errors and walks (r = -.656).

cor_with_target <- cor_matrix["TARGET_WINS", ]

cor_with_target

##      TARGET_WINS   TEAM_BATTING_H  TEAM_BATTING_2B  TEAM_BATTING_3B 
##       1.00000000       0.38876752       0.28910365       0.14260841 
##  TEAM_BATTING_HR  TEAM_BATTING_BB  TEAM_BATTING_SO  TEAM_BASERUN_SB 
##       0.17615320       0.23255986      -0.03175071       0.13513892 
##  TEAM_BASERUN_CS TEAM_BATTING_HBP  TEAM_PITCHING_H TEAM_PITCHING_HR 
##       0.02240407       0.07350424      -0.10993705       0.18901373 
## TEAM_PITCHING_BB TEAM_PITCHING_SO  TEAM_FIELDING_E TEAM_FIELDING_DP 
##       0.12417454      -0.07843609      -0.17648476      -0.03485058

Focusing on the target variable, we can see that the strongest correlation is a positive one with hits (r = .389), followed by doubles hit and home runs pitched (r = .289, r = .189). The negative correlations are not as strong, with the strongest negative correlation being with fielding errors committed (r = -.176), followed by hits given up (r = -.101).

Data Preparation

As seen earlier, there are a few variables with missing values. For batters hit by pitches, there are just too many missing values for any possibility of it contributing meaningfully to the model. Another variable with many missing values, although not as many is runners caught while stealing. About 30% of the data is missing a value for this variable and the correlation of those with a value present is only .02, suggesting essentially no impact on variance in number of games won. This is strong enough evidence to preclude this variable from further analysis as well.

After dropping those variables, we still have some variables with missing values. In these cases, whether they have been omitted due to error or the fact that that statistic was not yet being recorded, I have decided that the best course of action is to impute the median value.

numeric_vars <- data |>
  select(-INDEX)  
numeric_vars <- numeric_vars |>
  mutate(across(everything(), ~ifelse(is.na(.x), median(.x, na.rm = TRUE), .x)))

With a little research into baseball statistics, one that can be derived from what we have are run differential (difference between runs scored and runs given up). Another variable that may prove useful in a model is a combination of on-base and slugging percentages to get a full picture of the player’s offensive impact. Previously, HBP was identified as a variable that will be cut from the final model, but where available it is useful in computing the on base + slugging percentage.

numeric_vars <- numeric_vars |>
  mutate(
    RUN_DIFFERENTIAL = TEAM_BATTING_H + TEAM_BATTING_BB + TEAM_BASERUN_SB - TEAM_PITCHING_H - TEAM_PITCHING_BB - TEAM_FIELDING_E,
    # Replace HBP missing values with 0 just for the computation of this derived metric
    ON_BASE_PLUS_SLUGGING = (TEAM_BATTING_H + TEAM_BATTING_BB + coalesce(TEAM_BATTING_HBP, 0)) / 162
  )

Train - Test Split

set.seed(24601) 
n <- nrow(numeric_vars)
train_indices <- sample(seq_len(n), size = 0.7 * n)
train_data <- numeric_vars[train_indices, ]
test_data  <- numeric_vars[-train_indices, ]

Building Models

Model 1 - Using Most Correlated Predictors Only

I built this model using the variables that had the strongest correlation with target wins, with one exception. Although batting doubles was the second most correlated variable with wins, it is has a correlation with the first most correlated variable of over .5, so I would not need both in a predictive model.

model1 <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_HR + TEAM_BATTING_BB +
                           TEAM_BASERUN_SB + TEAM_PITCHING_HR + TEAM_FIELDING_E,
             data = train_data)
summary(model1)

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_HR + 
##     TEAM_BATTING_BB + TEAM_BASERUN_SB + TEAM_PITCHING_HR + TEAM_FIELDING_E, 
##     data = train_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.225  -8.713   0.121   8.651  52.371 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       8.750619   3.833846   2.282   0.0226 *  
## TEAM_BATTING_H    0.047001   0.002482  18.940  < 2e-16 ***
## TEAM_BATTING_HR  -0.024597   0.027514  -0.894   0.3715    
## TEAM_BATTING_BB   0.003622   0.003801   0.953   0.3407    
## TEAM_BASERUN_SB   0.034572   0.004483   7.712 2.17e-14 ***
## TEAM_PITCHING_HR  0.041409   0.025446   1.627   0.1039    
## TEAM_FIELDING_E  -0.018755   0.002350  -7.982 2.75e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.2 on 1586 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2548 
## F-statistic: 91.75 on 6 and 1586 DF,  p-value: < 2.2e-16

One of the biggest surprises here is the negative coefficient for the number of home runs! That is something that is expected to be associated with more wins the higher it is, not fewer. It is also important to note that this variable did not have a significant impact on the model, suggesting that impact of home runs are likely being better accounted for in another variable. Although our coefficient just passes the .05 significance standard, the variables that had the greatest impact are hits, stolen bases, and in a negative way, fielding errors.

library(performance)
library(see)
check_model(model1)

Here we can confirm some suspected issues with this model. For instance, the assumption of the homogeneity of variance is not met. Additionally, we see significant collinearity between at least two of the variables in this model. One likely reason why this model is performing decently but with room for improvement is the lack of transformation in these variables. Going back to the visualizations, some variables will need to be transformed due to their extreme right skew.

Model 2 - Log Transformation Applied to Model 1

# log-transform selected predictors (safe version: log1p handles zeros)
train_data <- train_data %>%
  mutate(
    log_TEAM_BATTING_H = log1p(TEAM_BATTING_H),
    log_TEAM_BATTING_HR = log1p(TEAM_BATTING_HR),
    log_TEAM_BATTING_BB = log1p(TEAM_BATTING_BB),
    log_TEAM_BASERUN_SB = log1p(TEAM_BASERUN_SB),
    log_TEAM_PITCHING_HR = log1p(TEAM_PITCHING_HR),
    log_TEAM_FIELDING_E = log1p(TEAM_FIELDING_E)
  )

model2 <- lm(
  TARGET_WINS ~ log_TEAM_BATTING_H + log_TEAM_BATTING_HR + 
    log_TEAM_BATTING_BB + log_TEAM_BASERUN_SB + 
    log_TEAM_PITCHING_HR + log_TEAM_FIELDING_E,
  data = train_data
)

summary(model2)

## 
## Call:
## lm(formula = TARGET_WINS ~ log_TEAM_BATTING_H + log_TEAM_BATTING_HR + 
##     log_TEAM_BATTING_BB + log_TEAM_BASERUN_SB + log_TEAM_PITCHING_HR + 
##     log_TEAM_FIELDING_E, data = train_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -52.701  -8.660  -0.121   8.606  52.691 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -456.618     28.267 -16.154  < 2e-16 ***
## log_TEAM_BATTING_H     71.754      3.961  18.114  < 2e-16 ***
## log_TEAM_BATTING_HR    -8.610      2.726  -3.159  0.00161 ** 
## log_TEAM_BATTING_BB     5.457      1.362   4.006 6.46e-05 ***
## log_TEAM_BASERUN_SB     4.492      0.611   7.351 3.13e-13 ***
## log_TEAM_PITCHING_HR    8.067      2.500   3.227  0.00128 ** 
## log_TEAM_FIELDING_E    -7.224      1.148  -6.290 4.09e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.26 on 1586 degrees of freedom
## Multiple R-squared:  0.2508, Adjusted R-squared:  0.248 
## F-statistic:  88.5 on 6 and 1586 DF,  p-value: < 2.2e-16

All previous variables were kept and logarithmically transformed to build this model. The result is a much stronger model. We still see a surprising negative coefficient for the effect of home runs hit and it remains one of the weaker aspects of our model, along with home runs pitched.

check_model(model2)

The collinearity graph suggests two variables are still strongly correlated with each other, and from the evidence above, it seems to be the two home run variables. For the next model, I will see how removing them impacts the rest of the model.

Model 3 - Dealing with High Colinearity

model3 <- lm(
  TARGET_WINS ~ log_TEAM_BATTING_H + 
    log_TEAM_BATTING_BB + log_TEAM_BASERUN_SB +log_TEAM_FIELDING_E,
  data = train_data
)

summary(model3)

## 
## Call:
## lm(formula = TARGET_WINS ~ log_TEAM_BATTING_H + log_TEAM_BATTING_BB + 
##     log_TEAM_BASERUN_SB + log_TEAM_FIELDING_E, data = train_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -49.027  -8.943  -0.147   8.710  51.594 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -473.9766    27.3774 -17.313  < 2e-16 ***
## log_TEAM_BATTING_H    74.0328     3.6823  20.105  < 2e-16 ***
## log_TEAM_BATTING_BB    3.8609     1.2267   3.148  0.00168 ** 
## log_TEAM_BASERUN_SB    4.4868     0.6032   7.438 1.67e-13 ***
## log_TEAM_FIELDING_E   -5.5078     0.8099  -6.800 1.47e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.3 on 1588 degrees of freedom
## Multiple R-squared:  0.2459, Adjusted R-squared:  0.244 
## F-statistic: 129.5 on 4 and 1588 DF,  p-value: < 2.2e-16

check_model(model3)

This is the best performing model so far #### Model 4 - Using Generated Variables

model4 <- lm(TARGET_WINS ~ ON_BASE_PLUS_SLUGGING + RUN_DIFFERENTIAL +
                           log1p(TEAM_PITCHING_SO),
             data = train_data)
summary(model4)

## 
## Call:
## lm(formula = TARGET_WINS ~ ON_BASE_PLUS_SLUGGING + RUN_DIFFERENTIAL + 
##     log1p(TEAM_PITCHING_SO), data = train_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -46.652  -8.908   0.301   8.967  55.890 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             4.8789192  5.3217219   0.917  0.35939    
## ON_BASE_PLUS_SLUGGING   5.9259662  0.3053187  19.409  < 2e-16 ***
## RUN_DIFFERENTIAL        0.0006255  0.0002361   2.650  0.00814 ** 
## log1p(TEAM_PITCHING_SO) 0.3765640  0.5112345   0.737  0.46149    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.67 on 1589 degrees of freedom
## Multiple R-squared:  0.2024, Adjusted R-squared:  0.2009 
## F-statistic: 134.4 on 3 and 1589 DF,  p-value: < 2.2e-16

check_model(model4)

Model 5 - Using (Almost) All Variables

model5 <- lm(TARGET_WINS ~ log1p(TEAM_BATTING_H) + TEAM_BATTING_HR + TEAM_BATTING_BB + log1p(TEAM_PITCHING_SO) + log1p(TEAM_PITCHING_BB) + log1p(TEAM_FIELDING_DP) +
                           log1p(TEAM_FIELDING_E) + I(TEAM_PITCHING_HR / TEAM_BATTING_HR),
             data = train_data)
summary(model5)

## 
## Call:
## lm(formula = TARGET_WINS ~ log1p(TEAM_BATTING_H) + TEAM_BATTING_HR + 
##     TEAM_BATTING_BB + log1p(TEAM_PITCHING_SO) + log1p(TEAM_PITCHING_BB) + 
##     log1p(TEAM_FIELDING_DP) + log1p(TEAM_FIELDING_E) + I(TEAM_PITCHING_HR/TEAM_BATTING_HR), 
##     data = train_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -46.542  -8.476   0.403   8.503  48.748 
## 
## Coefficients:
##                                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                         -4.138e+02  3.255e+01 -12.710  < 2e-16 ***
## log1p(TEAM_BATTING_H)                7.879e+01  4.220e+00  18.669  < 2e-16 ***
## TEAM_BATTING_HR                      2.337e-02  9.190e-03   2.543  0.01108 *  
## TEAM_BATTING_BB                      1.890e-03  9.136e-03   0.207  0.83611    
## log1p(TEAM_PITCHING_SO)             -7.458e-01  7.633e-01  -0.977  0.32872    
## log1p(TEAM_PITCHING_BB)              8.450e+00  3.934e+00   2.148  0.03188 *  
## log1p(TEAM_FIELDING_DP)             -2.220e+01  2.039e+00 -10.889  < 2e-16 ***
## log1p(TEAM_FIELDING_E)              -3.070e+00  1.008e+00  -3.046  0.00235 ** 
## I(TEAM_PITCHING_HR/TEAM_BATTING_HR) -3.417e+00  1.403e+00  -2.436  0.01498 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.81 on 1574 degrees of freedom
##   (10 observations deleted due to missingness)
## Multiple R-squared:  0.2674, Adjusted R-squared:  0.2637 
## F-statistic: 71.81 on 8 and 1574 DF,  p-value: < 2.2e-16

check_model(model5)

model6 <- lm(TARGET_WINS ~ log1p(TEAM_BATTING_H) + TEAM_BATTING_BB + log1p(TEAM_FIELDING_DP) +
                           log1p(TEAM_FIELDING_E),
             data = train_data)
summary(model6)

## 
## Call:
## lm(formula = TARGET_WINS ~ log1p(TEAM_BATTING_H) + TEAM_BATTING_BB + 
##     log1p(TEAM_FIELDING_DP) + log1p(TEAM_FIELDING_E), data = train_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -53.141  -8.591   0.307   8.447  55.993 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             -3.852e+02  2.616e+01 -14.727  < 2e-16 ***
## log1p(TEAM_BATTING_H)    8.070e+01  3.714e+00  21.726  < 2e-16 ***
## TEAM_BATTING_BB          2.402e-02  3.382e-03   7.100 1.87e-12 ***
## log1p(TEAM_FIELDING_DP) -2.184e+01  1.977e+00 -11.048  < 2e-16 ***
## log1p(TEAM_FIELDING_E)  -4.750e+00  7.030e-01  -6.757 1.97e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.05 on 1588 degrees of freedom
## Multiple R-squared:  0.2737, Adjusted R-squared:  0.2718 
## F-statistic: 149.6 on 4 and 1588 DF,  p-value: < 2.2e-16

check_model(model6)

par(mfrow = c(2, 2))
plot(model6)