BrianSingh_Data621

0.1 Data Exploration

# load data
library(tidyverse)
library(readr)
library(curl)
library(ggplot2)
library(dplyr)
library(scales)
library(zoo)
baseball_training <- read.csv(curl("https://raw.githubusercontent.com/brsingh7/Data621/main/moneyball-training-data.csv"))

head(baseball_training)

0.1.1 Summary statistics

summary(baseball_training)

##      INDEX         TARGET_WINS     TEAM_BATTING_H TEAM_BATTING_2B
##  Min.   :   1.0   Min.   :  0.00   Min.   : 891   Min.   : 69.0  
##  1st Qu.: 630.8   1st Qu.: 71.00   1st Qu.:1383   1st Qu.:208.0  
##  Median :1270.5   Median : 82.00   Median :1454   Median :238.0  
##  Mean   :1268.5   Mean   : 80.79   Mean   :1469   Mean   :241.2  
##  3rd Qu.:1915.5   3rd Qu.: 92.00   3rd Qu.:1537   3rd Qu.:273.0  
##  Max.   :2535.0   Max.   :146.00   Max.   :2554   Max.   :458.0  
##                                                                  
##  TEAM_BATTING_3B  TEAM_BATTING_HR  TEAM_BATTING_BB TEAM_BATTING_SO 
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.0   Min.   :   0.0  
##  1st Qu.: 34.00   1st Qu.: 42.00   1st Qu.:451.0   1st Qu.: 548.0  
##  Median : 47.00   Median :102.00   Median :512.0   Median : 750.0  
##  Mean   : 55.25   Mean   : 99.61   Mean   :501.6   Mean   : 735.6  
##  3rd Qu.: 72.00   3rd Qu.:147.00   3rd Qu.:580.0   3rd Qu.: 930.0  
##  Max.   :223.00   Max.   :264.00   Max.   :878.0   Max.   :1399.0  
##                                                    NA's   :102     
##  TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H
##  Min.   :  0.0   Min.   :  0.0   Min.   :29.00    Min.   : 1137  
##  1st Qu.: 66.0   1st Qu.: 38.0   1st Qu.:50.50    1st Qu.: 1419  
##  Median :101.0   Median : 49.0   Median :58.00    Median : 1518  
##  Mean   :124.8   Mean   : 52.8   Mean   :59.36    Mean   : 1779  
##  3rd Qu.:156.0   3rd Qu.: 62.0   3rd Qu.:67.00    3rd Qu.: 1682  
##  Max.   :697.0   Max.   :201.0   Max.   :95.00    Max.   :30132  
##  NA's   :131     NA's   :772     NA's   :2085                    
##  TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO  TEAM_FIELDING_E 
##  Min.   :  0.0    Min.   :   0.0   Min.   :    0.0   Min.   :  65.0  
##  1st Qu.: 50.0    1st Qu.: 476.0   1st Qu.:  615.0   1st Qu.: 127.0  
##  Median :107.0    Median : 536.5   Median :  813.5   Median : 159.0  
##  Mean   :105.7    Mean   : 553.0   Mean   :  817.7   Mean   : 246.5  
##  3rd Qu.:150.0    3rd Qu.: 611.0   3rd Qu.:  968.0   3rd Qu.: 249.2  
##  Max.   :343.0    Max.   :3645.0   Max.   :19278.0   Max.   :1898.0  
##                                    NA's   :102                       
##  TEAM_FIELDING_DP
##  Min.   : 52.0   
##  1st Qu.:131.0   
##  Median :149.0   
##  Mean   :146.4   
##  3rd Qu.:164.0   
##  Max.   :228.0   
##  NA's   :286

We can see that a number of the variables have a minimum of 0.0 and/or NA values. These will need to be addressed in order to build models.

0.1.2 Plots

Most people believe the more home runs a team hits, the more the team will win. This is definitely plausible, to an extent. People don’t take into account how many runs the same team’s pitching gives up, as well. Nonetheless, let’s take a look at home runs and target wins as they relate.

baseball_training %>%
  ggplot(aes(x=TEAM_BATTING_HR, y=TARGET_WINS, color=TEAM_BATTING_HR)) +
  geom_point() +
  labs(title = "Wins vs. Home runs hit", x="Home Runs",y="Target Wins",colour="Home Runs")

There seems to be a slight positive correlation in the amount of home runs hit and the number of wins. You can see the team with the most wins (~149) in the data set actually has a small, relative amount of home runs and then a few teams with 150-155 homeruns actually have less than 50 wins. Let’s take a look at how all of the explanatory variables relate to the response variable (Target wins):

#remove Index, no need to conduct a relationship on that column
baseball_training %>%
  gather(-INDEX,-TARGET_WINS, key="var", value="value") %>%
  ggplot(aes(x=value,y=TARGET_WINS))+
  geom_point() +
  geom_smooth(method="lm",color="red") +
  facet_wrap(~ var, scales="free") +
  theme_bw()

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 3478 rows containing non-finite values (stat_smooth).

## Warning: Removed 3478 rows containing missing values (geom_point).

It’s pretty hard to tell how each X relates to Y based on just the plots. There are some relationships you expect to see that are just now as clear by looking only at the scatterplots. Let’s calculate correlations between the predictors and target.

correlations <- cor(baseball_training[ , colnames(baseball_training) != "TARGET_WINS"], baseball_training$TARGET_WINS)
correlations

##                         [,1]
## INDEX            -0.02105643
## TEAM_BATTING_H    0.38876752
## TEAM_BATTING_2B   0.28910365
## TEAM_BATTING_3B   0.14260841
## TEAM_BATTING_HR   0.17615320
## TEAM_BATTING_BB   0.23255986
## TEAM_BATTING_SO           NA
## TEAM_BASERUN_SB           NA
## TEAM_BASERUN_CS           NA
## TEAM_BATTING_HBP          NA
## TEAM_PITCHING_H  -0.10993705
## TEAM_PITCHING_HR  0.18901373
## TEAM_PITCHING_BB  0.12417454
## TEAM_PITCHING_SO          NA
## TEAM_FIELDING_E  -0.17648476
## TEAM_FIELDING_DP          NA

It doesn’t appear that any one predictor has a strong correlation to TARGET_WINS on its own.

Before moving further along in creating models, let’s tidy the data.

0.2 Data Preparation

#replace NAs with mean of column
baseball_training2 <- baseball_training
baseball_training2 <- na.aggregate(baseball_training2)
summary(baseball_training2)

##      INDEX         TARGET_WINS     TEAM_BATTING_H TEAM_BATTING_2B
##  Min.   :   1.0   Min.   :  0.00   Min.   : 891   Min.   : 69.0  
##  1st Qu.: 630.8   1st Qu.: 71.00   1st Qu.:1383   1st Qu.:208.0  
##  Median :1270.5   Median : 82.00   Median :1454   Median :238.0  
##  Mean   :1268.5   Mean   : 80.79   Mean   :1469   Mean   :241.2  
##  3rd Qu.:1915.5   3rd Qu.: 92.00   3rd Qu.:1537   3rd Qu.:273.0  
##  Max.   :2535.0   Max.   :146.00   Max.   :2554   Max.   :458.0  
##  TEAM_BATTING_3B  TEAM_BATTING_HR  TEAM_BATTING_BB TEAM_BATTING_SO 
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.0   Min.   :   0.0  
##  1st Qu.: 34.00   1st Qu.: 42.00   1st Qu.:451.0   1st Qu.: 556.8  
##  Median : 47.00   Median :102.00   Median :512.0   Median : 735.6  
##  Mean   : 55.25   Mean   : 99.61   Mean   :501.6   Mean   : 735.6  
##  3rd Qu.: 72.00   3rd Qu.:147.00   3rd Qu.:580.0   3rd Qu.: 925.0  
##  Max.   :223.00   Max.   :264.00   Max.   :878.0   Max.   :1399.0  
##  TEAM_BASERUN_SB TEAM_BASERUN_CS  TEAM_BATTING_HBP TEAM_PITCHING_H
##  Min.   :  0.0   Min.   :  0.00   Min.   :29.00    Min.   : 1137  
##  1st Qu.: 67.0   1st Qu.: 44.00   1st Qu.:59.36    1st Qu.: 1419  
##  Median :106.0   Median : 52.80   Median :59.36    Median : 1518  
##  Mean   :124.8   Mean   : 52.80   Mean   :59.36    Mean   : 1779  
##  3rd Qu.:151.0   3rd Qu.: 54.25   3rd Qu.:59.36    3rd Qu.: 1682  
##  Max.   :697.0   Max.   :201.00   Max.   :95.00    Max.   :30132  
##  TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO  TEAM_FIELDING_E 
##  Min.   :  0.0    Min.   :   0.0   Min.   :    0.0   Min.   :  65.0  
##  1st Qu.: 50.0    1st Qu.: 476.0   1st Qu.:  626.0   1st Qu.: 127.0  
##  Median :107.0    Median : 536.5   Median :  817.7   Median : 159.0  
##  Mean   :105.7    Mean   : 553.0   Mean   :  817.7   Mean   : 246.5  
##  3rd Qu.:150.0    3rd Qu.: 611.0   3rd Qu.:  957.0   3rd Qu.: 249.2  
##  Max.   :343.0    Max.   :3645.0   Max.   :19278.0   Max.   :1898.0  
##  TEAM_FIELDING_DP
##  Min.   : 52.0   
##  1st Qu.:134.0   
##  Median :146.4   
##  Mean   :146.4   
##  3rd Qu.:161.2   
##  Max.   :228.0

Now you will see no NA values exist within the data. However, there are still some statistics with a Min of 0.0, which also does not seem likely. In order for the model to be more reliable, I’d want to replace the Minimums of 0.0 in HR, BB, SO, SB, CS, Pitching_HR, Pitching_BB, Pitching_SO with the mean, as well.

baseball_training2 <-  baseball_training2 |> mutate(across(.cols = c("TARGET_WINS", "TEAM_BATTING_3B", "TEAM_BATTING_HR","TEAM_BATTING_BB","TEAM_BATTING_SO","TEAM_BASERUN_SB","TEAM_BASERUN_CS","TEAM_PITCHING_HR","TEAM_PITCHING_BB","TEAM_PITCHING_SO"), 
                           .fns = ~ifelse(.x == 0, mean(.x), .x)))
summary(baseball_training2)

##      INDEX         TARGET_WINS     TEAM_BATTING_H TEAM_BATTING_2B
##  Min.   :   1.0   Min.   : 12.00   Min.   : 891   Min.   : 69.0  
##  1st Qu.: 630.8   1st Qu.: 71.00   1st Qu.:1383   1st Qu.:208.0  
##  Median :1270.5   Median : 82.00   Median :1454   Median :238.0  
##  Mean   :1268.5   Mean   : 80.83   Mean   :1469   Mean   :241.2  
##  3rd Qu.:1915.5   3rd Qu.: 92.00   3rd Qu.:1537   3rd Qu.:273.0  
##  Max.   :2535.0   Max.   :146.00   Max.   :2554   Max.   :458.0  
##  TEAM_BATTING_3B TEAM_BATTING_HR  TEAM_BATTING_BB TEAM_BATTING_SO 
##  Min.   :  8.0   Min.   :  3.00   Min.   : 12.0   Min.   :  66.0  
##  1st Qu.: 34.0   1st Qu.: 42.75   1st Qu.:451.0   1st Qu.: 562.0  
##  Median : 47.0   Median :102.00   Median :512.0   Median : 735.6  
##  Mean   : 55.3   Mean   :100.27   Mean   :501.8   Mean   : 742.1  
##  3rd Qu.: 72.0   3rd Qu.:147.00   3rd Qu.:580.0   3rd Qu.: 925.0  
##  Max.   :223.0   Max.   :264.00   Max.   :878.0   Max.   :1399.0  
##  TEAM_BASERUN_SB TEAM_BASERUN_CS  TEAM_BATTING_HBP TEAM_PITCHING_H
##  Min.   : 14.0   Min.   :  7.00   Min.   :29.00    Min.   : 1137  
##  1st Qu.: 67.0   1st Qu.: 44.00   1st Qu.:59.36    1st Qu.: 1419  
##  Median :106.0   Median : 52.80   Median :59.36    Median : 1518  
##  Mean   :124.9   Mean   : 52.83   Mean   :59.36    Mean   : 1779  
##  3rd Qu.:151.0   3rd Qu.: 54.25   3rd Qu.:59.36    3rd Qu.: 1682  
##  Max.   :697.0   Max.   :201.00   Max.   :95.00    Max.   :30132  
##  TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO  TEAM_FIELDING_E 
##  Min.   :  3.0    Min.   : 119.0   Min.   :  181.0   Min.   :  65.0  
##  1st Qu.: 52.0    1st Qu.: 476.0   1st Qu.:  633.0   1st Qu.: 127.0  
##  Median :107.0    Median : 537.0   Median :  817.7   Median : 159.0  
##  Mean   :106.4    Mean   : 553.3   Mean   :  824.9   Mean   : 246.5  
##  3rd Qu.:150.0    3rd Qu.: 611.0   3rd Qu.:  957.0   3rd Qu.: 249.2  
##  Max.   :343.0    Max.   :3645.0   Max.   :19278.0   Max.   :1898.0  
##  TEAM_FIELDING_DP
##  Min.   : 52.0   
##  1st Qu.:134.0   
##  Median :146.4   
##  Mean   :146.4   
##  3rd Qu.:161.2   
##  Max.   :228.0

Now you’ll see that no zero values exist in the variables. This is more of a working data set. We can do some more data preparation to condense a few X variables. 1) TEAM_BATTING_HBP is the same as a walk (TEAM_BATTING_BB), so I’ll combine to a new variable, TEAM_BATTING_WALKS. 2) TEAM_BASERUN_SB and TEAM_BASERUN_CS together are the total Stolen Base attempts. I’ll find the success rate of stolen bases as TEAM_BASERUN_SB/(TEAM_BASERUN_CS + TEAM_BASERUN_SB) to TEAM_SB_SUCCESS.

#this caused errors in my testing later on. Removing.
#baseball_training2 <- baseball_training2 %>%
#  mutate(TEAM_BATTING_WALKS = TEAM_BATTING_BB+TEAM_BATTING_HBP) %>%
#  mutate(TEAM_SB_SUCCESS = TEAM_BASERUN_SB/(TEAM_BASERUN_SB+TEAM_BASERUN_CS)) %>%
#  select(-c(TEAM_BATTING_BB,TEAM_BATTING_HBP,TEAM_BASERUN_SB,TEAM_BASERUN_CS))

Now we’re left with 2 less explanatory variables, however, it represents the original data, therefore, still usable in the model. Let’s looks at correlations with non-NA and zero data.

correlations_tidy <- cor(baseball_training2[ , colnames(baseball_training2) != "TARGET_WINS"], baseball_training2$TARGET_WINS)
correlations_tidy

##                          [,1]
## INDEX            -0.020937436
## TEAM_BATTING_H    0.381966897
## TEAM_BATTING_2B   0.285642615
## TEAM_BATTING_3B   0.140704764
## TEAM_BATTING_HR   0.151842473
## TEAM_BATTING_BB   0.225471525
## TEAM_BATTING_SO  -0.050333581
## TEAM_BASERUN_SB   0.118638506
## TEAM_BASERUN_CS   0.009251595
## TEAM_BATTING_HBP  0.016432705
## TEAM_PITCHING_H  -0.074670265
## TEAM_PITCHING_HR  0.163747351
## TEAM_PITCHING_BB  0.117643913
## TEAM_PITCHING_SO -0.085788792
## TEAM_FIELDING_E  -0.161152195
## TEAM_FIELDING_DP -0.029009541

Still, not one predictor on its own stands out.

0.3 Build Models

I will build 3 models as follows: 1) based on TEAM_BATTING variables, 2) based on TEAM PITCHING variables, and 3) based on TEAM_BATTING_H, TEAM_PITCHING_H, and TEAM_FIELDING_E as these seem to have the strongest linear relationships to TARGET_WINS based on the plots above. You may notice that TEAM_HITTING_2B also appears to have a strong correlation (per the scatterplots, not per the calculations necessarily), but a 2B is also counted in TEAM_BATTING_H, therefore I’ve decided to use the variable that encompasses all hits.

0.3.1 Model 1 - TEAM_BATTING variables

fit_batting <- lm(TARGET_WINS ~ TEAM_BATTING_H+TEAM_BATTING_2B+TEAM_BATTING_3B+TEAM_BATTING_HR+TEAM_BATTING_SO+TEAM_BATTING_BB+TEAM_BATTING_HBP,data=baseball_training2)
coef(fit_batting)

##      (Intercept)   TEAM_BATTING_H  TEAM_BATTING_2B  TEAM_BATTING_3B 
##    -13.103877806      0.043546345     -0.012480071      0.099596565 
##  TEAM_BATTING_HR  TEAM_BATTING_SO  TEAM_BATTING_BB TEAM_BATTING_HBP 
##      0.023000975      0.008193301      0.030864722      0.060290292

summary(fit_batting)

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + 
##     TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_SO + TEAM_BATTING_BB + 
##     TEAM_BATTING_HBP, data = baseball_training2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -64.010  -8.797   0.535   9.114  51.629 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -13.103878   6.713408  -1.952 0.051073 .  
## TEAM_BATTING_H     0.043546   0.003556  12.247  < 2e-16 ***
## TEAM_BATTING_2B   -0.012480   0.009242  -1.350 0.177013    
## TEAM_BATTING_3B    0.099597   0.016587   6.005 2.23e-09 ***
## TEAM_BATTING_HR    0.023001   0.009181   2.505 0.012305 *  
## TEAM_BATTING_SO    0.008193   0.002204   3.718 0.000206 ***
## TEAM_BATTING_BB    0.030865   0.002790  11.062  < 2e-16 ***
## TEAM_BATTING_HBP   0.060290   0.077058   0.782 0.434061    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.77 on 2268 degrees of freedom
## Multiple R-squared:  0.2297, Adjusted R-squared:  0.2273 
## F-statistic:  96.6 on 7 and 2268 DF,  p-value: < 2.2e-16

This model has a Multiple R-squared of 0.2297 with a p-value less than 0.05. This means that the model does not do a good job in explaining variations in the estimators. I can also see that TEAM_BATTING_2B has a high p-value, and therefore may need to be excluded from the model for better accuracy.

0.3.2 Model 2 - TEAM_PITCHING variables

fit_pitching <- lm(TARGET_WINS ~ TEAM_PITCHING_H+TEAM_PITCHING_HR+TEAM_PITCHING_BB+TEAM_PITCHING_SO,data=baseball_training2)
coef(fit_pitching)

##      (Intercept)  TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB 
##    72.5417444307    -0.0008080885     0.0399348172     0.0180298652 
## TEAM_PITCHING_SO 
##    -0.0054569652

summary(fit_pitching)

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_PITCHING_H + TEAM_PITCHING_HR + 
##     TEAM_PITCHING_BB + TEAM_PITCHING_SO, data = baseball_training2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -64.034  -9.834   0.644   9.793  77.022 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      72.5417444  1.1496770  63.098  < 2e-16 ***
## TEAM_PITCHING_H  -0.0008081  0.0002488  -3.248  0.00118 ** 
## TEAM_PITCHING_HR  0.0399348  0.0055313   7.220 7.07e-13 ***
## TEAM_PITCHING_BB  0.0180299  0.0022669   7.954 2.83e-15 ***
## TEAM_PITCHING_SO -0.0054570  0.0006910  -7.897 4.42e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.12 on 2271 degrees of freedom
## Multiple R-squared:  0.06923,    Adjusted R-squared:  0.06759 
## F-statistic: 42.23 on 4 and 2271 DF,  p-value: < 2.2e-16

This model’s Multiple R-squared value is significantly higher than the previous with a p-value below 0.05. All of the predictor variables p-values are also less than 0.05. This model, based on pitching statistics, may be a better predictor of target_wins than just batting, as it can predict 69% of wins.

0.3.3 Model 3 - TEAM_BATTING_H, TEAM_PITCHING_H, and TEAM_FIELDING_E

fit_hits_errors <- lm(TARGET_WINS ~ TEAM_BATTING_H+TEAM_PITCHING_H+ TEAM_FIELDING_E,data=baseball_training2)
coef(fit_hits_errors)

##     (Intercept)  TEAM_BATTING_H TEAM_PITCHING_H TEAM_FIELDING_E 
##   12.3612274067    0.0501349210   -0.0005080986   -0.0174154798

summary(fit_hits_errors)

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_PITCHING_H + 
##     TEAM_FIELDING_E, data = baseball_training2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -53.795  -9.272   0.267   9.056  68.898 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     12.3612274  3.0165918   4.098 4.32e-05 ***
## TEAM_BATTING_H   0.0501349  0.0021125  23.732  < 2e-16 ***
## TEAM_PITCHING_H -0.0005081  0.0002813  -1.807    0.071 .  
## TEAM_FIELDING_E -0.0174155  0.0017171 -10.143  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.83 on 2272 degrees of freedom
## Multiple R-squared:  0.221,  Adjusted R-squared:   0.22 
## F-statistic: 214.9 on 3 and 2272 DF,  p-value: < 2.2e-16

By only using these three variables, the multiple r-squared value again decreases to 0.22, therefore may not be a good model to predict our target variable.

With our three models, we can input the actual data to test the accuracy of each. Let’s first load it.

baseball_evaluation <- read.csv(curl("https://raw.githubusercontent.com/brsingh7/Data621/main/moneyball-evaluation-data.csv"))

#fill NAs for purposes of testing model
baseball_evaluation2 <- baseball_evaluation
baseball_evaluation2 <- na.aggregate(baseball_evaluation2)
summary(baseball_evaluation2)

##      INDEX      TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B 
##  Min.   :   9   Min.   : 819   Min.   : 44.0   Min.   : 14.00  
##  1st Qu.: 708   1st Qu.:1387   1st Qu.:210.0   1st Qu.: 35.00  
##  Median :1249   Median :1455   Median :239.0   Median : 52.00  
##  Mean   :1264   Mean   :1469   Mean   :241.3   Mean   : 55.91  
##  3rd Qu.:1832   3rd Qu.:1548   3rd Qu.:278.5   3rd Qu.: 72.00  
##  Max.   :2525   Max.   :2170   Max.   :376.0   Max.   :155.00  
##  TEAM_BATTING_HR  TEAM_BATTING_BB TEAM_BATTING_SO  TEAM_BASERUN_SB
##  Min.   :  0.00   Min.   : 15.0   Min.   :   0.0   Min.   :  0.0  
##  1st Qu.: 44.50   1st Qu.:436.5   1st Qu.: 565.0   1st Qu.: 60.5  
##  Median :101.00   Median :509.0   Median : 709.3   Median : 96.0  
##  Mean   : 95.63   Mean   :499.0   Mean   : 709.3   Mean   :123.7  
##  3rd Qu.:135.50   3rd Qu.:565.5   3rd Qu.: 904.5   3rd Qu.:149.0  
##  Max.   :242.00   Max.   :792.0   Max.   :1268.0   Max.   :580.0  
##  TEAM_BASERUN_CS  TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR
##  Min.   :  0.00   Min.   :42.00    Min.   : 1155   Min.   :  0.0   
##  1st Qu.: 44.00   1st Qu.:62.37    1st Qu.: 1426   1st Qu.: 52.0   
##  Median : 52.32   Median :62.37    Median : 1515   Median :104.0   
##  Mean   : 52.32   Mean   :62.37    Mean   : 1813   Mean   :102.1   
##  3rd Qu.: 56.00   3rd Qu.:62.37    3rd Qu.: 1681   3rd Qu.:142.5   
##  Max.   :154.00   Max.   :96.00    Max.   :22768   Max.   :336.0   
##  TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E  TEAM_FIELDING_DP
##  Min.   : 136.0   Min.   :   0.0   Min.   :  73.0   Min.   : 69.0   
##  1st Qu.: 471.0   1st Qu.: 622.5   1st Qu.: 131.0   1st Qu.:134.5   
##  Median : 526.0   Median : 782.0   Median : 163.0   Median :146.1   
##  Mean   : 552.4   Mean   : 799.7   Mean   : 249.7   Mean   :146.1   
##  3rd Qu.: 606.5   3rd Qu.: 927.5   3rd Qu.: 252.0   3rd Qu.:160.5   
##  Max.   :2008.0   Max.   :9963.0   Max.   :1568.0   Max.   :204.0

#standardize variables
#baseball_evaluation <- baseball_evaluation %>%
#  mutate(TEAM_BATTING_WALKS = TEAM_BATTING_BB+TEAM_BATTING_HBP) %>%
#  mutate(TEAM_SB_SUCCESS = TEAM_BASERUN_SB/(TEAM_BASERUN_SB+TEAM_BASERUN_CS)) %>%
#  select(-c(TEAM_BATTING_BB,TEAM_BATTING_HBP,TEAM_BASERUN_SB,TEAM_BASERUN_CS))

To test the accuracy, we can run a few visuals and root mean square errors to understand the difference between each. Let’s run predictions on using the baseball_evaluation data set before comparing.

0.3.4 First prediction using model 1

pred1 <- predict(fit_batting,newdata = data.frame(baseball_evaluation2))
pred1

##         1         2         3         4         5         6         7         8 
##  69.02332  70.39209  76.54290  81.62419  64.27548  65.25927  78.37235  69.91582 
##         9        10        11        12        13        14        15        16 
##  71.59791  73.09766  75.41997  81.81791  81.58931  78.67432  75.51881  76.72835 
##        17        18        19        20        21        22        23        24 
##  73.18468  81.91163  67.40273  90.67725  84.20854  86.63365  85.31085  75.15657 
##        25        26        27        28        29        30        31        32 
##  79.29060  83.04885  59.06793  73.79108  85.00737  74.96752  91.97348  87.01782 
##        33        34        35        36        37        38        39        40 
##  89.08648  92.29560  83.28620  83.45037  77.10765  89.74024  85.81320  87.98720 
##        41        42        43        44        45        46        47        48 
##  81.56670  87.04156  42.50173 100.66709  87.20943  93.68922  95.92407  71.94483 
##        49        50        51        52        53        54        55        56 
##  70.45126  74.87788  76.38373  83.37616  80.49225  72.63923  76.92429  75.76593 
##        57        58        59        60        61        62        63        64 
##  87.85558  68.52421  63.13311  77.22863  82.55592  85.15860  84.57356  82.94137 
##        65        66        67        68        69        70        71        72 
##  80.17056  85.94818  77.03164  81.98240  76.78787  87.45434  89.08374  74.51536 
##        73        74        75        76        77        78        79        80 
##  82.46346  84.82696  81.05446  86.98952  83.26616  79.09429  68.24419  74.10746 
##        81        82        83        84        85        86        87        88 
##  85.20444  89.29199  96.79917  82.64148  87.54932  77.16695  76.03202  81.60815 
##        89        90        91        92        93        94        95        96 
##  82.24683  88.79385  78.70431  93.52340  72.80467  77.95506  76.47452  77.42784 
##        97        98        99       100       101       102       103       104 
##  87.82337 100.81797  91.23864  92.24367  84.18104  74.29019  83.83800  81.97561 
##       105       106       107       108       109       110       111       112 
##  83.00491  75.94619  65.86750  81.19457  84.67721  68.82092  81.95114  79.59690 
##       113       114       115       116       117       118       119       120 
##  88.45745  85.44496  79.08546  80.28643  89.34081  79.21481  77.64203  72.97227 
##       121       122       123       124       125       126       127       128 
##  86.39589  66.39059  67.30869  60.26995  70.71182  82.55702  89.54708  73.69231 
##       129       130       131       132       133       134       135       136 
##  87.79599  93.33843  87.96655  79.08752  74.60263  84.62829  84.10350  69.34915 
##       137       138       139       140       141       142       143       144 
##  76.46567  75.89800  79.10612  79.88932  65.40377  69.92514  93.25758  80.96440 
##       145       146       147       148       149       150       151       152 
##  76.30607  76.27901  80.91122  82.33784  83.76734  79.61306  82.27751  80.13054 
##       153       154       155       156       157       158       159       160 
##  61.14164  71.95630  77.43822  72.79959  84.91432  73.15190  91.17033  72.56672 
##       161       162       163       164       165       166       167       168 
## 104.86366 105.15197  91.06320 105.01078  98.28674  90.72715  86.25125  79.78595 
##       169       170       171       172       173       174       175       176 
##  72.49598  80.29813  88.63038  84.26605  81.01467  89.95105  83.11377  79.27908 
##       177       178       179       180       181       182       183       184 
##  80.50449  77.52805  76.91121  82.04500  76.88126  84.55123  82.97896  85.21549 
##       185       186       187       188       189       190       191       192 
##  98.44055  86.73570  92.12372  68.74468  65.60718 107.11029  71.79960  79.63424 
##       193       194       195       196       197       198       199       200 
##  72.58022  76.83369  80.15356  69.63103  76.02875  82.86953  80.95599  87.75156 
##       201       202       203       204       205       206       207       208 
##  81.52108  82.16757  77.78889  82.50815  76.79609  81.45618  81.68290  78.19480 
##       209       210       211       212       213       214       215       216 
##  81.43368  78.46398 101.96047  92.39319  83.97692  70.71751  75.10910  86.95871 
##       217       218       219       220       221       222       223       224 
##  84.80324  85.24344  75.13609  78.42613  80.99022  74.78310  85.27730  80.76922 
##       225       226       227       228       229       230       231       232 
##  93.67668  76.25543  78.53725  83.23883  81.55291  81.64655  73.19156  90.46028 
##       233       234       235       236       237       238       239       240 
##  84.18946  84.27135  80.03538  74.50281  82.56600  77.07609  93.57988  72.89689 
##       241       242       243       244       245       246       247       248 
##  88.59108  86.40800  82.98258  81.53952  65.96458  83.48599  76.81224  82.22952 
##       249       250       251       252       253       254       255       256 
##  72.81879  84.15689  83.28610  62.82698  93.83673  48.18839  69.81853  79.83797 
##       257       258       259 
##  75.82978  78.85357  77.57921

#compare predicted and actual values
plot(baseball_training$TARGET_WINS, type="l",lty=1.8,col="red")
lines(pred1,type="l",col="blue")

#plot(pred1,type="l",col="blue")
rmse <- sqrt(mean(pred1-baseball_training$TARGET_WINS)^2)

## Warning in pred1 - baseball_training$TARGET_WINS: longer object length is not a
## multiple of shorter object length

rmse

## [1] 0.1722775

0.3.5 First prediction using model 2

pred2<-predict(fit_pitching,newdata = data.frame(baseball_evaluation2))
pred2

##         1         2         3         4         5         6         7         8 
##  77.04518  79.30322  79.85272  81.42258  68.45317  74.63967  80.83874  76.42000 
##         9        10        11        12        13        14        15        16 
##  77.41109  79.11062  79.16637  83.09389  84.55075  84.32453  82.14794  81.24158 
##        17        18        19        20        21        22        23        24 
##  80.14140  82.62919  75.12521  83.75498  82.70278  84.12543  84.62454  82.55133 
##        25        26        27        28        29        30        31        32 
##  80.23546  81.58861  73.92224  77.80588  82.07987  77.71562  86.84706  84.83176 
##        33        34        35        36        37        38        39        40 
##  85.81664  87.73500  81.89847  85.52082  82.01151  84.48213  87.69453  82.89879 
##        41        42        43        44        45        46        47        48 
##  84.12553  83.48571  72.70279  86.90693  84.11633  84.21597  83.10782  77.78950 
##        49        50        51        52        53        54        55        56 
##  76.87728  78.83716  79.40117  81.53498  80.52444  79.48955  79.18987  78.61305 
##        57        58        59        60        61        62        63        64 
##  82.38836  75.98521  75.64863  76.19330  80.14578  83.60986  83.12728  84.38542 
##        65        66        67        68        69        70        71        72 
##  81.73209  81.36614  76.78993  75.78061  77.64728  80.25995  79.78227  79.19576 
##        73        74        75        76        77        78        79        80 
##  85.03699  86.96859  77.21840  80.97159  83.39845  81.25324  75.15099  74.90960 
##        81        82        83        84        85        86        87        88 
##  83.42739  81.10773  83.86050  80.99557  85.81101  82.30003  82.11341  81.29180 
##        89        90        91        92        93        94        95        96 
##  84.04329  84.78144  80.72769 109.41427  79.48985  74.00984  75.90040  77.39775 
##        97        98        99       100       101       102       103       104 
##  82.70851  83.03437  82.15563  82.74118  80.05892  81.99711  82.46801  85.53390 
##       105       106       107       108       109       110       111       112 
##  82.47149  76.78341  73.55357  80.21612  84.01747  75.20067  83.06988  80.73286 
##       113       114       115       116       117       118       119       120 
##  81.10482  80.35060  78.68424  79.65251  81.95838  81.41096  78.54989  77.92738 
##       121       122       123       124       125       126       127       128 
##  81.04498  76.13429  75.22439  74.93731  76.97530  78.47243  80.44013  77.49180 
##       129       130       131       132       133       134       135       136 
##  82.99307  82.77731  82.50669  82.50063  79.97061  81.94254  83.18179  76.69299 
##       137       138       139       140       141       142       143       144 
##  80.30968  79.45320  79.58514  81.30195  73.04708  75.02532  83.10015  80.27213 
##       145       146       147       148       149       150       151       152 
##  81.53951  82.40013  82.10462  81.78437  81.46879  81.79842  82.90198  79.37736 
##       153       154       155       156       157       158       159       160 
##  54.59167  76.82498  79.98688  78.18751  83.35955  76.34499  79.13518  76.20533 
##       161       162       163       164       165       166       167       168 
##  87.14551  89.71291  85.90671  89.68915  89.34562  86.90390  84.63987  83.59410 
##       169       170       171       172       173       174       175       176 
##  79.43832  82.60606  77.78716  80.17687  82.21869  82.60998  82.65314  82.70713 
##       177       178       179       180       181       182       183       184 
##  85.55218  81.49001  81.99295  83.24471  82.35500  84.26075  82.98463  85.03622 
##       185       186       187       188       189       190       191       192 
##  93.77507  78.05510  83.71529  72.61288  76.55040  84.79403  75.80936  75.95766 
##       193       194       195       196       197       198       199       200 
##  76.64950  78.76897  78.80736  79.81275  80.52056  86.16151  84.31874  82.90969 
##       201       202       203       204       205       206       207       208 
##  80.22751  79.95144  79.16455  86.91438  80.45776  83.03616  81.20316  79.77685 
##       209       210       211       212       213       214       215       216 
##  81.19356  79.99647  84.09687  78.27573  80.49052  80.86473  80.89989  81.70446 
##       217       218       219       220       221       222       223       224 
##  79.75560  80.35548  78.62211  80.83886  81.26718  76.98884  82.57107  78.82091 
##       225       226       227       228       229       230       231       232 
##  73.20540  80.73887  80.53003  82.10081  82.07548  76.45128  78.49053  81.56820 
##       233       234       235       236       237       238       239       240 
##  85.91794  83.92520  82.41435  79.31196  79.97506  80.26811  83.78061  78.29002 
##       241       242       243       244       245       246       247       248 
##  81.56839  83.58202  81.32763  79.19948  78.47194  81.75620  79.44828  81.91921 
##       249       250       251       252       253       254       255       256 
##  80.04707  81.62991  82.01191  69.91561  84.83660  50.81624  79.78578  84.73741 
##       257       258       259 
##  78.52096  81.35358  79.91782

#compare predicted and actual values
plot(baseball_training$TARGET_WINS, type="l",lty=1.8,col="red")
lines(pred2,type="l",col="blue")

#plot(pred1,type="l",col="blue")
rmse2 <- sqrt(mean(pred2-baseball_training$TARGET_WINS)^2)

## Warning in pred2 - baseball_training$TARGET_WINS: longer object length is not a
## multiple of shorter object length

rmse2

## [1] 0.02142157

0.3.6 First prediction using model 3

pred3 <- predict(fit_hits_errors,newdata = data.frame(baseball_evaluation2))
pred3

##         1         2         3         4         5         6         7         8 
##  69.92189  70.60449  78.87383  86.57739  72.09565  72.72353  74.73608  75.25965 
##         9        10        11        12        13        14        15        16 
##  70.86333  78.43871  79.65801  83.95100  80.40780  82.60084  80.38189  80.98343 
##        17        18        19        20        21        22        23        24 
##  74.03710  82.10969  70.40489  93.37952  84.85222  88.86251  81.29777  77.92393 
##        25        26        27        28        29        30        31        32 
##  85.76686  87.80418  63.57158  80.93589  85.84209  83.38473  88.42351  86.12408 
##        33        34        35        36        37        38        39        40 
##  85.01269  86.14169  81.89808  81.60331  76.80906  87.50736  87.88554  89.17448 
##        41        42        43        44        45        46        47        48 
##  84.38400  88.60607  39.06137  83.86190  80.15680  85.00856  90.08344  72.83338 
##        49        50        51        52        53        54        55        56 
##  71.60092  78.66625  83.48078  90.28015  78.57083  75.74656  79.35438  80.84595 
##        57        58        59        60        61        62        63        64 
##  82.07339  71.49616  62.12779  76.01162  85.87424  79.82496  84.94993  84.11700 
##        65        66        67        68        69        70        71        72 
##  82.68306  87.78081  76.29441  85.14064  73.52318  79.30423  89.40244  81.22710 
##        73        74        75        76        77        78        79        80 
##  81.27792  83.64773  87.91639  86.75402  79.03760  80.60475  71.34585  77.37806 
##        81        82        83        84        85        86        87        88 
##  90.14202  91.78527  98.51390  85.43525  82.53457  83.90018  78.21573  83.77106 
##        89        90        91        92        93        94        95        96 
##  82.57503  86.37456  74.97354  78.60790  74.76515  81.47932  79.24140  75.90281 
##        97        98        99       100       101       102       103       104 
##  79.21503  98.60902  89.45587  89.00847  86.28271  77.32394  87.13607  80.33120 
##       105       106       107       108       109       110       111       112 
##  80.13284  74.93526  58.35551  83.07222  82.95817  65.79888  80.67781  82.65071 
##       113       114       115       116       117       118       119       120 
##  90.94625  89.70126  83.90972  83.36449  89.26695  80.31917  81.89505  69.72583 
##       121       122       123       124       125       126       127       128 
##  84.30796  66.00216  68.94005  62.39899  69.33363  90.43742  93.42221  77.18817 
##       129       130       131       132       133       134       135       136 
##  89.42069  95.56621  89.65630  82.79842  77.38148  82.47851  83.04879  71.15935 
##       137       138       139       140       141       142       143       144 
##  76.40127  82.17776  84.11007  80.84765  68.74505  69.92721  93.06669  80.91154 
##       145       146       147       148       149       150       151       152 
##  76.88632  76.47451  78.64481  82.86925  86.65137  82.37914  82.06659  83.58462 
##       153       154       155       156       157       158       159       160 
##  52.64843  77.55826  78.15131  74.94113  83.45182  68.75187  85.16782  71.88811 
##       161       162       163       164       165       166       167       168 
##  96.33997  96.81719  88.34111  97.38312  90.35794  87.39468  83.77134  82.69735 
##       169       170       171       172       173       174       175       176 
##  75.77354  82.96511  83.71000  82.60755  81.89726  92.58346  84.48676  78.14771 
##       177       178       179       180       181       182       183       184 
##  77.90270  78.45673  78.02985  79.45286  75.68671  80.95384  83.68254  82.54374 
##       185       186       187       188       189       190       191       192 
##  87.91744  83.80529  83.26453  63.96956  65.69721 106.30433  71.92543  79.26786 
##       193       194       195       196       197       198       199       200 
##  80.78624  85.04144  88.41274  74.73990  79.83767  78.19445  78.22955  84.29407 
##       201       202       203       204       205       206       207       208 
##  78.53667  82.25292  75.25147  84.98479  80.57516  76.80954  82.74379  79.51389 
##       209       210       211       212       213       214       215       216 
##  77.18789  70.67993  94.62582  90.37015  78.31362  71.69743  75.18288  87.93221 
##       217       218       219       220       221       222       223       224 
##  85.50733  86.75756  80.20512  76.85669  81.41352  79.15248  84.80038  81.85919 
##       225       226       227       228       229       230       231       232 
##  96.45704  75.60749  81.57202  80.02834  80.73452  74.65767  72.29034  91.64071 
##       233       234       235       236       237       238       239       240 
##  80.38514  85.58598  78.34181  75.03108  83.80862  77.22843  83.46140  73.83059 
##       241       242       243       244       245       246       247       248 
##  89.14121  88.92305  88.04102  86.71147  66.30425  90.47187  82.44069  83.14790 
##       249       250       251       252       253       254       255       256 
##  77.86597  84.56626  82.80405  66.35048  85.08240  33.37564  71.99924  76.27058 
##       257       258       259 
##  78.73635  79.56599  80.69877

#compare predicted and actual values
plot(baseball_training$TARGET_WINS, type="l",lty=1.8,col="red")
lines(pred3,type="l",col="blue")

rmse3 <- sqrt(mean(pred3-baseball_training$TARGET_WINS)^2)

## Warning in pred3 - baseball_training$TARGET_WINS: longer object length is not a
## multiple of shorter object length

rmse3

## [1] 0.01683029

0.4 Select Models

Based on the analyses above, I would probably choose the second model as the best predictor of the three, as this model’s Multiple R-squared value is significantly higher than the other two (0.69 vs ~0.22) with a p-value below 0.05. All of the predictor variables p-values are also less than 0.05. This model, based on pitching statistics, may be a better predictor of target_wins than just batting, as it can predict 69% of wins. In addition, the RMSE of the second model, while not less than the third model’s RMSE, is still relatively low at 0.02. Low RMSE values indicate that the model fits the data well and has more precise predictions. As for next steps, I would continue adjusting the second model, adding some TEAM_BATTING predictors to increase the multiple r-squared as well as the performance of the model. This would include some trial and error.

BrianSingh_Data621_HW1

Brian Singh

2023-09-28