Introduction

  • In this paper I attmept to create a model to predict team wins in MLB. My model will make these predictions on a train and eval dataset of real life baseball team yearly statistics.
  • Through data exploration, I realized that there are several data entry mistakes in these dataframes.
    • Several entries of 0 which are impossible
    • Historically inaccurate entries. Further there is no way to verify if the historically incorrect values are corrupt through the entire observation(all stats), or just for the singular column entry.
      • To identify and handle these misentries, I provide links to historical MLB records and adjust the data accordingly.
      • Unfortunately a problem may arise from this on the prediction side of things. Being that I have filtered for wins (min-20,max- 116), the model may miss on any test indexes that would need to predict outside that range. As this is an entry level attempt to create a linear model, hopefully this does not end up hurting the model to badly.
  • The paper proceeds by using the exploratory information gained to perform some intial data transformations. I break the categories into 3 areas
    • Batting
    • Baserunning
    • Pithcing
  • I explore these categories, filter them as i see best, and attempt to visualize and dispaly their correlation to our target variable(Wins)
  • This project highlights a major concern with observational studies. I came to the realization that many of the pitching statistics must have been incorrectly entered into the dataframe. There are an outright 780 entries which are duplicate values of the batting statistics, but the correlations in some of the pitching fields do not make sense either. I will guide you through the data which brought me to this realization.

Data Exploration

## Warning: package 'faraway' was built under R version 3.4.4
## Warning: package 'stringr' was built under R version 3.4.4
## Warning: package 'corrplot' was built under R version 3.4.4
NA counts
na_count
INDEX 0
TEAM_BATTING_H 0
TEAM_BATTING_2B 0
TEAM_BATTING_3B 0
TEAM_BATTING_HR 0
TEAM_BATTING_BB 0
TEAM_BATTING_SO 18
TEAM_BASERUN_SB 13
TEAM_BASERUN_CS 87
TEAM_BATTING_HBP 240
TEAM_PITCHING_H 0
TEAM_PITCHING_HR 0
TEAM_PITCHING_BB 0
TEAM_PITCHING_SO 18
TEAM_FIELDING_E 0
TEAM_FIELDING_DP 31

General data observations of Entire Dataset

vars n mean sd median trimmed mad min max range skew kurtosis se
INDEX 1 2276 1268.46353 736.34904 1270.5 1268.56970 952.5705 1 2535 2534 0.0042149 -1.2167564 15.4346788
TARGET_WINS 2 2276 80.79086 15.75215 82.0 81.31229 14.8260 0 146 146 -0.3987232 1.0274757 0.3301823
TEAM_BATTING_H 3 2276 1469.26977 144.59120 1454.0 1459.04116 114.1602 891 2554 1663 1.5713335 7.2785261 3.0307891
TEAM_BATTING_2B 4 2276 241.24692 46.80141 238.0 240.39627 47.4432 69 458 389 0.2151018 0.0061609 0.9810087
TEAM_BATTING_3B 5 2276 55.25000 27.93856 47.0 52.17563 23.7216 0 223 223 1.1094652 1.5032418 0.5856226
TEAM_BATTING_HR 6 2276 99.61204 60.54687 102.0 97.38529 78.5778 0 264 264 0.1860421 -0.9631189 1.2691285
TEAM_BATTING_BB 7 2276 501.55888 122.67086 512.0 512.18331 94.8864 0 878 878 -1.0257599 2.1828544 2.5713150
TEAM_BATTING_SO 8 2174 735.60534 248.52642 750.0 742.31322 284.6592 0 1399 1399 -0.2978001 -0.3207992 5.3301912
TEAM_BASERUN_SB 9 2145 124.76177 87.79117 101.0 110.81188 60.7866 0 697 697 1.9724140 5.4896754 1.8955584
TEAM_BASERUN_CS 10 1504 52.80386 22.95634 49.0 50.35963 17.7912 0 201 201 1.9762180 7.6203818 0.5919414
TEAM_BATTING_HBP 11 191 59.35602 12.96712 58.0 58.86275 11.8608 29 95 66 0.3185754 -0.1119828 0.9382681
TEAM_PITCHING_H 12 2276 1779.21046 1406.84293 1518.0 1555.89517 174.9468 1137 30132 28995 10.3295111 141.8396985 29.4889618
TEAM_PITCHING_HR 13 2276 105.69859 61.29875 107.0 103.15697 74.1300 0 343 343 0.2877877 -0.6046311 1.2848886
TEAM_PITCHING_BB 14 2276 553.00791 166.35736 536.5 542.62459 98.5929 0 3645 3645 6.7438995 96.9676398 3.4870317
TEAM_PITCHING_SO 15 2174 817.73045 553.08503 813.5 796.93391 257.2311 0 19278 19278 22.1745535 671.1891292 11.8621151
TEAM_FIELDING_E 16 2276 246.48067 227.77097 159.0 193.43798 62.2692 65 1898 1833 2.9904656 10.9702717 4.7743279
TEAM_FIELDING_DP 17 1990 146.38794 26.22639 149.0 147.57789 23.7216 52 228 176 -0.3889390 0.1817397 0.5879114
  • Our Training dataset contains 2276 different MLB teams yearly results
  • 18 Total categories
    • Numeric index representing each team
    • Category tracking Wins(Our target category)
    • 7 different batting statistics
    • 2 base running statistics
    • 4 pitching statistics
    • 2 fielding statistics
  • Our testing dataset has 259 observations and has all of the same variables except for our target variable(wins)
    • This leaves us with an issue. Below I filter our test data for modern era historical records. This allows us to correct many of the outliers that are present within the data.

Explore target variable Wins

vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 2276 80.79086 15.75215 82 81.31229 14.826 0 146 146 -0.3987232 1.027476 0.3301823
  • Our mean and median are of little practical value. 162 games in a season, so we should expect to see 81 wins on average given our total sample size.
  • Although not exact as seasons since 1800’s havent all been 162 games, but my interval test shows a left tail interval of 80.35-81 which our 80.79 mean falls within confidently

  • While our average wins seem fine, our min and max seem incorrect
    • The most wins by an mlb team ever was 116
    • The least wins by an MLB team in the modern era was 20

Remove outliers discovered in Wins category

vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 2255 80.58537 15.05826 82 81.18726 14.826 21 116 95 -0.4395385 0.3152447 0.3171039
  • The mean of our sample moved further from our expected population mean when we removed incorrect outliers, however it is still within the Confidence interval. We lost 35 observations when removing these data entries which suffered from some sort of data entry mistake. This does raise concerns as to the vailidity of all the entires, however, with no way to confirm index to specific teams lets continue

View target variable distribution before and after filters

  • The median and mean are around 82 and the histogram looks normal, however there is somewhat of a left tail. We can assume our target variable is normally distributed

Descriptive Look at batting statistics

Create and view summary of batting DF

vars n mean sd median trimmed mad min max range skew kurtosis se
TEAM_BATTING_H 1 2255 1466.81685 135.92839 1454 1458.24377 114.1602 992 2496 1504 1.2098420 4.8627441 2.8624434
TEAM_BATTING_2B 2 2255 241.09490 46.28571 238 240.29695 47.4432 69 458 389 0.2012387 -0.0427041 0.9747061
TEAM_BATTING_3B 3 2255 54.93348 27.63678 47 51.89695 23.7216 0 223 223 1.1205339 1.5849716 0.5819881
TEAM_BATTING_HR 4 2255 100.22395 60.39496 103 98.07867 77.0952 0 264 264 0.1771368 -0.9584953 1.2718252
TEAM_BATTING_BB 5 2255 503.80089 119.41435 513 513.29640 93.4038 29 878 849 -0.9614819 2.1131941 2.5146829
TEAM_BATTING_SO 6 2155 739.51323 244.66509 753 744.96464 284.6592 0 1399 1399 -0.2526910 -0.4140827 5.2704582
TEAM_BATTING_HBP 7 191 59.35602 12.96712 58 58.86275 11.8608 29 95 66 0.3185754 -0.1119828 0.9382681
Singles 8 2255 1070.56452 121.79442 1049 1058.05319 97.8516 709 2112 1403 1.7574832 6.7989398 2.5648036

Fig 1&2

##          TEAM_BATTING_H     TEAM_BATTING_2B    TEAM_BATTING_3B   
## breaks   Integer,17         Integer,10         Numeric,13        
## counts   Integer,16         Integer,9          Integer,12        
## density  Numeric,16         Numeric,9          Numeric,12        
## mids     Numeric,16         Numeric,9          Numeric,12        
## xname    "dots[[1L]][[1L]]" "dots[[1L]][[2L]]" "dots[[1L]][[3L]]"
## equidist TRUE               TRUE               TRUE              
##          TEAM_BATTING_HR    TEAM_BATTING_BB    TEAM_BATTING_SO   
## breaks   Numeric,15         Numeric,19         Numeric,15        
## counts   Integer,14         Integer,18         Integer,14        
## density  Numeric,14         Numeric,18         Numeric,14        
## mids     Numeric,14         Numeric,18         Numeric,14        
## xname    "dots[[1L]][[4L]]" "dots[[1L]][[5L]]" "dots[[1L]][[6L]]"
## equidist TRUE               TRUE               TRUE              
##          TEAM_BATTING_HBP   Singles           
## breaks   Integer,9          Integer,16        
## counts   Integer,8          Integer,15        
## density  Numeric,8          Numeric,15        
## mids     Numeric,8          Numeric,15        
## xname    "dots[[1L]][[7L]]" "dots[[1L]][[8L]]"
## equidist TRUE               TRUE

##       TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR
## stats Numeric,5      Numeric,5       Numeric,5       Numeric,5      
## n     2255           2255            2255            2255           
## conf  Numeric,2      Numeric,2       Numeric,2       Numeric,2      
## out   Numeric,61     Numeric,11      Numeric,26      Numeric,0      
## group Numeric,61     Numeric,11      Numeric,26      Numeric,0      
## names ""             ""              ""              ""             
##       TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BATTING_HBP Singles   
## stats Numeric,5       Numeric,5       Numeric,5        Numeric,5 
## n     2255            2155            191              2255      
## conf  Numeric,2       Numeric,2       Numeric,2        Numeric,2 
## out   Numeric,120     Numeric,0       95               Numeric,70
## group Numeric,120     Numeric,0       1                Numeric,70
## names ""              ""              ""               ""

  • Triples,hits,walks, and singles seem to have a large amount of outliers.
    • strikeouts and Homeruns are the only variables with no outliers
  • All of the means and medians seem very close to each other
    • However, the sd seems rather high in certain categories relative to the mean. Without expert knowledge, this could be normal
    • Some Min values in these categories are 0, which is impossible.
      • Exploring these categories, there are less than 10 or so 0's per category. Given that the other categories in these observations seem in line, some sort of data entry mistake was likely made again. In this case, lets replace the 0’s with NA.

Data manipulation of batting statistics

Replace 0 with NA’s

vars n mean sd median trimmed mad min max range skew kurtosis se
TEAM_BATTING_H 1 2255 1466.81685 135.92839 1454.0 1458.24377 114.1602 992 2496 1504 1.2098420 4.8627441 2.8624434
TEAM_BATTING_2B 2 2255 241.09490 46.28571 238.0 240.29695 47.4432 69 458 389 0.2012387 -0.0427041 0.9747061
TEAM_BATTING_3B 3 2254 54.95785 27.61866 47.0 51.91075 23.7216 8 223 215 1.1240833 1.5881365 0.5817357
TEAM_BATTING_HR 4 2244 100.71524 60.13266 104.0 98.55679 75.6126 3 264 261 0.1785408 -0.9565209 1.2694014
TEAM_BATTING_BB 5 2255 503.80089 119.41435 513.0 513.29640 93.4038 29 878 849 -0.9614819 2.1131941 2.5146829
TEAM_BATTING_SO 6 2140 744.69673 237.52652 756.5 747.73598 283.9179 67 1399 1332 -0.1320162 -0.7184334 5.1345834
TEAM_BATTING_HBP 7 191 59.35602 12.96712 58.0 58.86275 11.8608 29 95 66 0.3185754 -0.1119828 0.9382681
Singles 8 2255 1070.56452 121.79442 1049.0 1058.05319 97.8516 709 2112 1403 1.7574832 6.7989398 2.5648036

Fig 3

NA counts
na_count
INDEX 0
TARGET_WINS 0
TEAM_BATTING_H 0
TEAM_BATTING_2B 0
TEAM_BATTING_3B 1
TEAM_BATTING_HR 11
TEAM_BATTING_BB 0
TEAM_BATTING_SO 115
TEAM_BASERUN_SB 121
TEAM_BASERUN_CS 757
TEAM_BATTING_HBP 2064
TEAM_PITCHING_H 0
TEAM_PITCHING_HR 0
TEAM_PITCHING_BB 0
TEAM_PITCHING_SO 100
TEAM_FIELDING_E 0
TEAM_FIELDING_DP 271
Singles 0

Examine historical record to detect incorrect outliers

  • Triples
    • max-153
    • min- 11
  • Homerun records
    • This link provides some good evidence that our assumptions so far were correct. The least HR recorded ever were 3, and the most were 264, which matches with our dataset
  • Doubles records
    • Total doubles team record is 376
    • min is 110
  • hits records
    • Hits record is 1,783,
  • Base on Balls
    • min- 282
    • max- 835
  • Singles
    • min- 811
    • max- 1,338

Filter for historical records

  • Applied filters print in code blocks below
  • NA values in all batting predictor columns except strikeouts are discarded with these filters
    • Looking at fig 3 above, there are not many of these NA values, as such I felt like they should just be discarded as a precaution
  • For strikeouts there were 115 NA values,
    • I attmepted to run a linear model for imputation, but consdiering none of the predictors were significant, I chose to replace missing vlaues with Median values. The distribution for Strikeouts seems normal, although median is higher, so I think mean or median would be an adequate choice for NA replacement.
    • I preserve a copy of db with NA vlaues as well, so I can eventually see what models dropping these NA values look like
## Change in batting dataset
batting <- batting %>% 
    filter(TEAM_BATTING_3B < 154  &
           TEAM_BATTING_3B > 10   &
           TEAM_BATTING_2B < 377  &
           TEAM_BATTING_2B > 109  &   
           TEAM_BATTING_H  < 1784 &
           TEAM_BATTING_BB > 281  &
           TEAM_BATTING_BB < 836  &
           Singles         < 1339 &
           Singles         > 810   )
# FILL Strikeout NA's with median value
batting_w_na <- batting
batting$TEAM_BATTING_SO[is.na(batting$TEAM_BATTING_SO)] <- median(batting$TEAM_BATTING_SO, na.rm=TRUE)    

#lmodr <- lm(logit(TEAM_BATTING_SO/100)~TEAM_BATTING_3B+TEAM_BATTING_2B+TEAM_BATTING_H+TEAM_BATTING_BB,batting)
#ilogit(predict(lmodr,batting[is.na(batting$TEAM_BATTING_SO),]))*100
#summary(lmodr)
na_count <-sapply(batting, function(y) sum(length(which(is.na(y)))))
na_count <- data.frame(na_count) 
kable(na_count,caption="NA counts")
NA counts
na_count
TEAM_BATTING_H 0
TEAM_BATTING_2B 0
TEAM_BATTING_3B 0
TEAM_BATTING_HR 0
TEAM_BATTING_BB 0
TEAM_BATTING_SO 0
TEAM_BATTING_HBP 1901
Singles 0

Figure 3 : Reexamine boxplot post data manipulation

##       TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR
## stats Numeric,5      Numeric,5       Numeric,5       Numeric,5      
## n     2092           2092            2092            2092           
## conf  Numeric,2      Numeric,2       Numeric,2       Numeric,2      
## out   Numeric,15     Numeric,6       Numeric,29      Numeric,0      
## group Numeric,15     Numeric,6       Numeric,29      Numeric,0      
## names ""             ""              ""              ""             
##       TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BATTING_HBP Singles   
## stats Numeric,5       Numeric,5       Numeric,5        Numeric,5 
## n     2092            2092            191              2092      
## conf  Numeric,2       Numeric,2       Numeric,2        Numeric,2 
## out   Numeric,12      Numeric,0       95               Numeric,15
## group Numeric,12      Numeric,0       1                Numeric,15
## names ""              ""              ""               ""

  • Our data set has been trimmed from 2241 to 2090
  • Looking at fig 4 as compared to fig 2, we can see many of the outliers from singles, hits,bb from fig 2 were removed
    • Category Triples seems to still have many outliers
    • There are some small outliers present in other categories still as well

Batting statistics overall correlations

  • HBP has the least correlations, It can likley be safely discarded
  • 3b are heavily negatively correlated with HR and strikeouts
  • Homeruns and strikeouts share the largest correlation(positive), of all batting statistics

How do batting statistics correlate with our target(Wins)?

  • There aren’t really any strong predictors.
    • Relative to the availble predictors, Hits, Doubles, and Homeruns ahve the strongest correlation with target vairable(wins).
    • Strikeouts and HBP stand out as having little to no correlation with wins

Stolen bases

Edit stolen bases via historical records and observe correlations

  • Stolen bases and Caught stealing
  • fewest stolen bases=13
  • most stolen bases= 415
  • fewest caught stealing=8
  • most caught stelaing=191
  • Caught stealing has a substantial amount of NA’s. IF we filter out NA, we will lose nearly 1/3 of our data. Therefore I chose to use the median for all caught stealing==NA fields
NA counts
na_count
INDEX 0
TARGET_WINS 0
TEAM_BATTING_H 0
TEAM_BATTING_2B 0
TEAM_BATTING_3B 0
TEAM_BATTING_HR 0
TEAM_BATTING_BB 0
TEAM_BATTING_SO 0
TEAM_BASERUN_SB 0
TEAM_BASERUN_CS 0
TEAM_BATTING_HBP 1853
TEAM_PITCHING_H 0
TEAM_PITCHING_HR 0
TEAM_PITCHING_BB 0
TEAM_PITCHING_SO 100
TEAM_FIELDING_E 0
TEAM_FIELDING_DP 0
Singles 0

  • Baserunning doesn’t appear to have much correlation with our target variable

Pitching categories

Display pitching categories

vars n mean sd median trimmed mad min max range skew kurtosis se
TARGET_WINS 1 2044 80.52789 13.89309 81.0 80.96638 14.8260 21 116 95 -0.3090862 -0.0162643 0.3072970
TEAM_PITCHING_H 2 2044 1539.55479 220.16288 1498.0 1513.97127 146.7774 1137 6723 5586 7.2180667 150.1985045 4.8697161
TEAM_PITCHING_HR 3 2044 111.71771 59.92485 114.0 110.10391 68.1996 3 343 340 0.1784958 -0.5918350 1.3254597
TEAM_PITCHING_BB 4 2044 553.66830 109.40697 541.0 546.10452 91.9212 312 2169 1857 2.2520497 23.8773902 2.4199397
TEAM_PITCHING_SO 5 1944 807.61060 228.05424 818.0 803.06427 250.5594 301 2309 2008 0.3567090 0.6392627 5.1723752
TEAM_FIELDING_E 6 2044 188.89922 109.36885 150.5 166.38570 49.6671 65 765 700 2.3312285 5.9463732 2.4190964
TEAM_FIELDING_DP 7 2044 147.84883 24.20592 149.0 148.80746 20.7564 68 228 160 -0.3664606 0.4738264 0.5354035
##          TARGET_WINS        TEAM_PITCHING_H    TEAM_PITCHING_HR  
## breaks   Integer,11         Integer,13         Numeric,8         
## counts   Integer,10         Integer,12         Integer,7         
## density  Numeric,10         Numeric,12         Numeric,7         
## mids     Numeric,10         Numeric,12         Numeric,7         
## xname    "dots[[1L]][[1L]]" "dots[[1L]][[2L]]" "dots[[1L]][[3L]]"
## equidist TRUE               TRUE               TRUE              
##          TEAM_PITCHING_BB   TEAM_PITCHING_SO   TEAM_FIELDING_E   
## breaks   Integer,11         Integer,12         Integer,16        
## counts   Integer,10         Integer,11         Integer,15        
## density  Numeric,10         Numeric,11         Numeric,15        
## mids     Numeric,10         Numeric,11         Numeric,15        
## xname    "dots[[1L]][[4L]]" "dots[[1L]][[5L]]" "dots[[1L]][[6L]]"
## equidist TRUE               TRUE               TRUE              
##          TEAM_FIELDING_DP  
## breaks   Integer,18        
## counts   Integer,17        
## density  Numeric,17        
## mids     Numeric,17        
## xname    "dots[[1L]][[7L]]"
## equidist TRUE

##       TARGET_WINS TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB
## stats Numeric,5   Numeric,5       Numeric,5        Numeric,5       
## n     2044        2044            2044             2044            
## conf  Numeric,2   Numeric,2       Numeric,2        Numeric,2       
## out   Numeric,7   Numeric,99      Numeric,3        Numeric,49      
## group Numeric,7   Numeric,99      Numeric,3        Numeric,49      
## names ""          ""              ""               ""              
##       TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## stats Numeric,5        Numeric,5       Numeric,5       
## n     1944             2044            2044            
## conf  Numeric,2        Numeric,2       Numeric,2       
## out   Numeric,9        Numeric,178     Numeric,56      
## group Numeric,9        Numeric,178     Numeric,56      
## names ""               ""              ""

  • Pitching stats look like they have alot of outliers

Historical pitching records

  • Pitching records were much harder to find
  • homeruns allowed
    • Most Homeruns allowed- 258 +Errors
    • most errors= 867

Filter for historical records and run correlation plot

  • Having some knowledge of baseball, this seems weird
    • More hits and bb’s and homeruns are associated with an increase in total team wins?
  • Something else is wrong with the data

Look at batting to pitching correlations

  • Pitching HR has a perfect correaltion with Batting HR
  • Pitching BB has perfect correlation with batting BB
  • Pitching SO has perfect correlation with batting SO
  • Pitching Hits has strong correlation with batting Hits

Problems with correlations

  • I noticed some duplicate entries between batting and pitching stats
  • I am hoping that perhaps these duplicate values are causing problems in my correlations
  • Lets check to see how many duplicate values there are between the abtting/ pitching categories
##   total_equal_categories
## 1                    785
  • 785 teams according to our DF have the same amount of hits pitching as hits hitting
    • That means that over 30% of teams on this list randomly struckout as many times as their pitching struck out oppposing batters
    • This is nearly impossible
  • Below correlation plot is a plot of these shared values displaying how all of these categories are identical entries(785 observation Df)

Filter all of these misentries out of DF

  • There still appears to be a heavy correlation between the pitching and hitting categories
  • More worriesome is the pitching stats still have correlations that make no sense

  • Batting and Pitching staistics shouldn’t be correlated in the same direction.
  • These pitching stats suffer from far too many mistakes to be trusted
  • Dropping the 780 obserrvations created a large loss of data, and didn’t help to make the data more trustworthy

I have to drop all of the pitching statistics

  • Becuase of collinearity issues, this likely would have had to happen anyway
  • This shows how dangerous using observatiional data can be
  • I believe we can still keep all of the duplicate observations as batting stats, because they have overall similar correlations with previous batting stats and my expectation for batting
    • This must all be addressed in any tenetative conclusion or model

Create custom stats in place of poor pitching stats

  • Singles
  • a slugging scale ( hr=4, triple=3, double=2, single/bb=1)
  • an on base stat (hits + bb+HBP)
NA counts
na_count
INDEX 0
TARGET_WINS 0
TEAM_BATTING_H 0
TEAM_BATTING_2B 0
TEAM_BATTING_3B 0
TEAM_BATTING_HR 0
TEAM_BATTING_BB 0
TEAM_BATTING_SO 0
TEAM_BASERUN_SB 0
TEAM_BASERUN_CS 0
TEAM_BATTING_HBP 1844
TEAM_PITCHING_H 0
TEAM_PITCHING_HR 0
TEAM_PITCHING_BB 0
TEAM_PITCHING_SO 100
TEAM_FIELDING_E 0
TEAM_FIELDING_DP 0
Singles 0
slugging 0
OBP 0

View new stat correlations with Wins

corrplot(cor(money_ball_train_2[,c(2,20,19)])[,1, drop=FALSE], cl.pos='n',method="number", number.cex = .7)

  • Both stats have larger correlation with wins than any of the original predictor variables

Building models

  • For almost every model, I manually backwards selected my predictors
  • I have rejected all pitching stats due to poor quality of the data
  • Each model has a slighlty different idea behind it
    • Model 1
      • Takes all of the batting, fielding and running stats to start except HBP(which is discarded from all analysis)
      • Advanced diagnostics run to check residuals.
        • In the other models I simply summarize these diagnostics for brevity
    • Model 2
      • Moneyball stats seemed like they would be important, but due to collinearity, likely due to how I built the stats, they don’t make it into most models.
      • Model 2 attempts to use the moneyballstats with whatever predictors aren’t collinear.
    • Model 3
      • An experimental model to see how scaling and centering my data, effects the results
      • model starts with the same inputs as model 1
    • Model 4
      • I run a box cox transformation over our target variable
      • input for this model is the same as model 1
    • Model 5
      • Generalized additive model
        • Produces a slightly different summary result
      • Input for this model is the same as model 1
    • Model 6
      • Throughout data exploration I updated NA values with median values.
      • This Model contains the original NA values
        • Therefore the LM model is run on 1468 observations instead of original 2000+
    • Model 7
      • Attempt at polynomial fitting

Model 1 A

  • Model is continuous, linear, and residuals are normal
  • total predictors used - 8
  • Adjusted R squared of .368
  • f stat 148.8
  • RSE= 11.04
  • Dropped TEAM_BATTING_2B, OBP, slugging, and TEAM_BASERUN_CS during backwards selection
  • Our residuals seem linear and normal
  • our qq plot does tail off with some outliers
  • Run some further diagnostics on model 1

## 
## Call:
## lm(formula = y ~ TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + 
##     TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_FIELDING_E + TEAM_FIELDING_DP + 
##     Singles, data = fit_1_db)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.987  -7.593   0.217   7.144  37.095 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      46.431876   5.799453   8.006 1.97e-15 ***
## TEAM_BATTING_3B   0.189918   0.017127  11.089  < 2e-16 ***
## TEAM_BATTING_HR   0.116007   0.007590  15.285  < 2e-16 ***
## TEAM_BATTING_BB   0.031182   0.003161   9.863  < 2e-16 ***
## TEAM_BATTING_SO  -0.015229   0.002245  -6.784 1.53e-11 ***
## TEAM_BASERUN_SB   0.079259   0.005247  15.106  < 2e-16 ***
## TEAM_FIELDING_E  -0.077432   0.003876 -19.979  < 2e-16 ***
## TEAM_FIELDING_DP -0.112248   0.011893  -9.438  < 2e-16 ***
## Singles           0.027791   0.004155   6.689 2.90e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.04 on 2025 degrees of freedom
## Multiple R-squared:  0.3701, Adjusted R-squared:  0.3677 
## F-statistic: 148.8 on 8 and 2025 DF,  p-value: < 2.2e-16
## Analysis of Variance Table
## 
## Response: y
##                    Df Sum Sq Mean Sq F value    Pr(>F)    
## TEAM_BATTING_3B     1   2208    2208  18.115 2.175e-05 ***
## TEAM_BATTING_HR     1  49985   49985 410.082 < 2.2e-16 ***
## TEAM_BATTING_BB     1  14374   14374 117.927 < 2.2e-16 ***
## TEAM_BATTING_SO     1   1906    1906  15.634 7.952e-05 ***
## TEAM_BASERUN_SB     1   8190    8190  67.193 4.316e-16 ***
## TEAM_FIELDING_E     1  53143   53143 435.990 < 2.2e-16 ***
## TEAM_FIELDING_DP    1   9795    9795  80.362 < 2.2e-16 ***
## Singles             1   5453    5453  44.738 2.905e-11 ***
## Residuals        2025 246829     122                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model 1 advanced diagnostics

Test for constant variance

##                 Estimate Std. Error t value  Pr(>|t|)
## (Intercept)    3.9845162  0.2427866 16.4116 < 2.2e-16
## fitted(fit_1) -0.0155285  0.0030003 -5.1756 2.496e-07
## 
## n = 2034, p = 2, Residual SE = 1.14270, R-Squared = 0.01
  • Given the P level(2.496e-07), Our model may have some issues with constant variance
    • This test shows their is a statistically significant difference in variance

Normality tests

## 
##  Anderson-Darling normality test
## 
## data:  fit_1$residuals
## A = 0.28768, p-value = 0.6194
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(fit_1)
## W = 0.99913, p-value = 0.4494
  • model seems fine according to both normality tests and previous qq plots

Test for independence

## 
##  Durbin-Watson test
## 
## data:  y ~ TEAM_BASERUN_CS + TEAM_BASERUN_SB + TEAM_BATTING_2B + TEAM_BATTING_3B +     TEAM_BATTING_BB + TEAM_BATTING_HR + TEAM_FIELDING_DP + TEAM_FIELDING_E
## DW = 1.0338, p-value < 2.2e-16
## alternative hypothesis: true autocorrelation is greater than 0
  • It appears there may be some issues with independence
    • considering this issue isn’t visible in plots, There may be a need to add quadratic relationships in predictors

leverage points and outliers

## [1] -3.295338
##       761      1433      1691      1808      1810 
## -3.403779  3.320870  3.381547 -3.433216 -3.460660
  • After running this model several different ways one takeaway that is negative on our overall analysis, is our outlier’s are more proof of failures in data entry. Our dataset has 4 seasons with exactly 110 wins, historically this has only happened twice according to historical records
    • Those two historical seasons don’t map to the data entries in any of the 4 observations of 110 wins in our dataset.
  • With that noted, the model diagnostics were not terrible, and I will rerun this model without outliers in model 1B

Model 1B

  • Remove outliers from model 1A and attempt analysis
  • total predictors used - 9
  • Adjusted r squared is .383
  • F stat 140.6
  • RSE= 10.82
  • Dropped slugging,OBP,TEAM_BATTING_2B
## 
## Call:
## lm(formula = y ~ TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + 
##     TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_BASERUN_CS + TEAM_FIELDING_E + 
##     TEAM_FIELDING_DP + Singles, data = fit_1b_db)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.779  -7.474   0.135   7.246  35.294 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      51.515840   5.771956   8.925  < 2e-16 ***
## TEAM_BATTING_3B   0.186005   0.016825  11.055  < 2e-16 ***
## TEAM_BATTING_HR   0.110316   0.007542  14.627  < 2e-16 ***
## TEAM_BATTING_BB   0.029713   0.003116   9.535  < 2e-16 ***
## TEAM_BATTING_SO  -0.016821   0.002210  -7.613 4.10e-14 ***
## TEAM_BASERUN_SB   0.090115   0.005512  16.348  < 2e-16 ***
## TEAM_BASERUN_CS  -0.081294   0.014844  -5.477 4.88e-08 ***
## TEAM_FIELDING_E  -0.084660   0.004052 -20.892  < 2e-16 ***
## TEAM_FIELDING_DP -0.109285   0.011692  -9.347  < 2e-16 ***
## Singles           0.029337   0.004080   7.190 9.07e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.82 on 2019 degrees of freedom
## Multiple R-squared:  0.3852, Adjusted R-squared:  0.3825 
## F-statistic: 140.6 on 9 and 2019 DF,  p-value: < 2.2e-16
## Analysis of Variance Table
## 
## Response: y
##                    Df Sum Sq Mean Sq  F value    Pr(>F)    
## TEAM_BATTING_3B     1   2361    2361  20.1681 7.493e-06 ***
## TEAM_BATTING_HR     1  48976   48976 418.4169 < 2.2e-16 ***
## TEAM_BATTING_BB     1  14081   14081 120.2952 < 2.2e-16 ***
## TEAM_BATTING_SO     1   2401    2401  20.5123 6.271e-06 ***
## TEAM_BASERUN_SB     1   8617    8617  73.6138 < 2.2e-16 ***
## TEAM_BASERUN_CS     1    261     261   2.2316    0.1354    
## TEAM_FIELDING_E     1  56180   56180 479.9588 < 2.2e-16 ***
## TEAM_FIELDING_DP    1   9170    9170  78.3399 < 2.2e-16 ***
## Singles             1   6052    6052  51.7002 9.065e-13 ***
## Residuals        2019 236326     117                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model 1C

  • Try model 1 with centering
  • Identical to model 1A in terms of predictors used(8), F stat, Adjusted R, and RSE
  • Coefficents take on much larger values and intercept is much closer to the mean
## 
## Call:
## lm(formula = y ~ TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + 
##     TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_FIELDING_E + TEAM_FIELDING_DP + 
##     Singles, data = fit_1_db.c)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.987  -7.593   0.217   7.144  37.095 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       80.4784     0.2448 328.753  < 2e-16 ***
## TEAM_BATTING_3B    4.6345     0.4179  11.089  < 2e-16 ***
## TEAM_BATTING_HR    6.7045     0.4386  15.285  < 2e-16 ***
## TEAM_BATTING_BB    2.7601     0.2798   9.863  < 2e-16 ***
## TEAM_BATTING_SO   -3.3121     0.4882  -6.784 1.53e-11 ***
## TEAM_BASERUN_SB    5.6539     0.3743  15.106  < 2e-16 ***
## TEAM_FIELDING_E   -8.4613     0.4235 -19.979  < 2e-16 ***
## TEAM_FIELDING_DP  -2.7208     0.2883  -9.438  < 2e-16 ***
## Singles            2.5687     0.3840   6.689 2.90e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.04 on 2025 degrees of freedom
## Multiple R-squared:  0.3701, Adjusted R-squared:  0.3677 
## F-statistic: 148.8 on 8 and 2025 DF,  p-value: < 2.2e-16

Summary from model 1

  • model 1 B adjusting for outliers has a slightly better performance but 1A is a simpler model with one less predictor
  • Moneyball stats aren’t present in any of these models
  • Diagnostics suggest a need to add polynomials

Model 2

  • Money ball stats,running,fielding
  • Overall much simpler model than model 1
  • 6 predictors
  • adjusted r= .308
  • Fstat-151.5
  • RSE 11.55
##                  vars    n    mean     sd median trimmed    mad  min  max
## TEAM_BASERUN_SB     1 2034  116.60  71.33    100  106.99  59.30   18  414
## TEAM_BASERUN_CS     2 2034   51.94  18.39     50   50.12  10.38   11  186
## TEAM_FIELDING_E     3 2034  189.06 109.27    151  166.62  50.41   65  765
## TEAM_FIELDING_DP    4 2034  147.84  24.24    149  148.80  21.50   68  228
## slugging            5 2034 2118.54 238.31   2128 2120.70 240.92 1453 2832
## OBP                 6 2034 1976.50 152.78   1975 1975.51 145.29 1438 2507
##                  range  skew kurtosis   se
## TEAM_BASERUN_SB    396  1.35     2.00 1.58
## TEAM_BASERUN_CS    175  2.03     8.49 0.41
## TEAM_FIELDING_E    700  2.33     5.96 2.42
## TEAM_FIELDING_DP   160 -0.37     0.47 0.54
## slugging          1379 -0.08    -0.24 5.28
## OBP               1069  0.05     0.25 3.39

## 
## Call:
## lm(formula = y ~ ., data = fit_2_db)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.560  -8.230   0.086   8.084  34.943 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      19.283586   3.617509   5.331 1.09e-07 ***
## TEAM_BASERUN_SB   0.071440   0.005369  13.307  < 2e-16 ***
## TEAM_BASERUN_CS  -0.054363   0.015069  -3.608 0.000317 ***
## TEAM_FIELDING_E  -0.054537   0.003476 -15.691  < 2e-16 ***
## TEAM_FIELDING_DP -0.106807   0.012085  -8.838  < 2e-16 ***
## slugging          0.003939   0.001917   2.055 0.040047 *  
## OBP               0.037159   0.002788  13.329  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.55 on 2027 degrees of freedom
## Multiple R-squared:  0.3096, Adjusted R-squared:  0.3075 
## F-statistic: 151.5 on 6 and 2027 DF,  p-value: < 2.2e-16

Model 2 tests

  • Test for constant variance- could be a problem
    • p= (7.264e-06)
    • Given the P level, Our model may have some issues with constant variance
  • Normality tests - Passes
    • anderson- .09
    • Shapiro wilk .15
    • very close to significant innormality, but still appears normal
  • Independence of predictors
    • Durbin Watson Test-2.2 e^-16
    • Need a polynomial fit most likely
  • Outliers-

    • No outliers detected

Model 2 Takeaways

  • Slightly worse RSE, lower adjusted R^2, than model 1
  • Much simpler model with only 6 predictors and a better F statistic than model 1

Model 3 USE Automatic AIC backwards selection

  • this is the only model that automatically backwards selects
  • The model chose to keep team_batting_double even though it’s p value doesnt prove significance
  • adjusted r^2= .3766
  • Predictors used 10
  • F score 123.8
  • RSE= 10.96
  • Takeaway form model 3 is that the performance isn’t much better compared to the simpler model 1
## Start:  AIC=9751.67
## y ~ TEAM_BATTING_2B + TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + 
##     TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_BASERUN_CS + TEAM_FIELDING_E + 
##     TEAM_FIELDING_DP + Singles + slugging + OBP
## 
## 
## Step:  AIC=9751.67
## y ~ TEAM_BATTING_2B + TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + 
##     TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_BASERUN_CS + TEAM_FIELDING_E + 
##     TEAM_FIELDING_DP + Singles + slugging
## 
## 
## Step:  AIC=9751.67
## y ~ TEAM_BATTING_2B + TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + 
##     TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_BASERUN_CS + TEAM_FIELDING_E + 
##     TEAM_FIELDING_DP + Singles
## 
##                    Df Sum of Sq    RSS     AIC
## <none>                          243111  9751.7
## - TEAM_BATTING_2B   1       258 243370  9751.8
## - TEAM_BASERUN_CS   1      3374 246485  9777.7
## - Singles           1      5897 249008  9798.4
## - TEAM_BATTING_SO   1      6229 249340  9801.1
## - TEAM_FIELDING_DP  1      9905 253016  9830.9
## - TEAM_BATTING_BB   1     11029 254141  9839.9
## - TEAM_BATTING_3B   1     15474 258585  9875.2
## - TEAM_BATTING_HR   1     23336 266447  9936.1
## - TEAM_BASERUN_SB   1     31319 274430  9996.1
## - TEAM_FIELDING_E   1     51195 294307 10138.4
## 
## Call:
## lm(formula = y ~ TEAM_BATTING_2B + TEAM_BATTING_3B + TEAM_BATTING_HR + 
##     TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_BASERUN_CS + 
##     TEAM_FIELDING_E + TEAM_FIELDING_DP + Singles, data = fit_1_db)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -38.859  -7.587   0.197   7.363  37.316 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      51.478451   5.841835   8.812  < 2e-16 ***
## TEAM_BATTING_2B  -0.010004   0.006823  -1.466    0.143    
## TEAM_BATTING_3B   0.195439   0.017223  11.347  < 2e-16 ***
## TEAM_BATTING_HR   0.113537   0.008148  13.935  < 2e-16 ***
## TEAM_BATTING_BB   0.030298   0.003163   9.580  < 2e-16 ***
## TEAM_BATTING_SO  -0.016108   0.002237  -7.200 8.47e-13 ***
## TEAM_BASERUN_SB   0.090137   0.005583  16.144  < 2e-16 ***
## TEAM_BASERUN_CS  -0.079757   0.015053  -5.299 1.29e-07 ***
## TEAM_FIELDING_E  -0.086042   0.004169 -20.640  < 2e-16 ***
## TEAM_FIELDING_DP -0.107498   0.011841  -9.079  < 2e-16 ***
## Singles           0.029968   0.004278   7.005 3.35e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.96 on 2023 degrees of freedom
## Multiple R-squared:  0.3796, Adjusted R-squared:  0.3766 
## F-statistic: 123.8 on 10 and 2023 DF,  p-value: < 2.2e-16

Model 4 box cox transformation of Target variable

  • Use box cox to transform our target variable
  • 9 predictors used
  • interpretability of the model is difficult with the transformation
    • sigma= 1.34
    • RSE= 18.1
    • adjusted r squared= .37
    • f stat= 135
  • Results of advanced daignostics are similar to other models
    • Several outliers present
    • Normality test- Passes
    • Failure of independence means predictors likely need to be analyzed with polynomial relations

Model 4 Takeaway

  • Performance in adjusted r squared is worse than model 1 and RSE seems worse
  • I left this block of code in because I am not sure I calculated the rse correctly
lambda <- MASS::boxcox(lm(y~.,fit_1_db),lambda=seq(0.5,2.5,by=0.1))

my_x <- lambda$x
my_y <- lambda$y
boxpower <- cbind(my_y,my_x)
#boxpower[order(-my_y),]
my_power <- 1.34

fit_4 <- lm(y^(1.34/1)~.,fit_1_db)
fit_4 <- update(fit_4, . ~ . -slugging)
fit_4 <- update(fit_4, . ~ . -OBP)
fit_4 <- update(fit_4, . ~ . -TEAM_BATTING_2B)

layout(matrix(c(1,2,3,4),2,2)) 

plot(fit_4)

summary(fit_4)
## 
## Call:
## lm(formula = y^(1.34/1) ~ TEAM_BATTING_3B + TEAM_BATTING_HR + 
##     TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_BASERUN_CS + 
##     TEAM_FIELDING_E + TEAM_FIELDING_DP + Singles, data = fit_1_db)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -217.917  -44.612   -0.329   42.735  220.299 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      191.90759   34.49281   5.564 2.99e-08 ***
## TEAM_BATTING_3B    1.12718    0.10045  11.222  < 2e-16 ***
## TEAM_BATTING_HR    0.64807    0.04510  14.369  < 2e-16 ***
## TEAM_BATTING_BB    0.17726    0.01860   9.533  < 2e-16 ***
## TEAM_BATTING_SO   -0.09707    0.01321  -7.350 2.86e-13 ***
## TEAM_BASERUN_SB    0.52484    0.03297  15.918  < 2e-16 ***
## TEAM_BASERUN_CS   -0.47928    0.08882  -5.396 7.61e-08 ***
## TEAM_FIELDING_E   -0.48928    0.02415 -20.258  < 2e-16 ***
## TEAM_FIELDING_DP  -0.64199    0.06990  -9.184  < 2e-16 ***
## Singles            0.16534    0.02437   6.784 1.53e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 64.74 on 2024 degrees of freedom
## Multiple R-squared:  0.3752, Adjusted R-squared:  0.3724 
## F-statistic:   135 on 9 and 2024 DF,  p-value: < 2.2e-16
anova(fit_4)
## Analysis of Variance Table
## 
## Response: y^(1.34/1)
##                    Df  Sum Sq Mean Sq  F value    Pr(>F)    
## TEAM_BATTING_3B     1   92055   92055  21.9623 2.966e-06 ***
## TEAM_BATTING_HR     1 1720434 1720434 410.4554 < 2.2e-16 ***
## TEAM_BATTING_BB     1  505824  505824 120.6777 < 2.2e-16 ***
## TEAM_BATTING_SO     1   71413   71413  17.0375 3.814e-05 ***
## TEAM_BASERUN_SB     1  289337  289337  69.0292 < 2.2e-16 ***
## TEAM_BASERUN_CS     1    8182    8182   1.9521    0.1625    
## TEAM_FIELDING_E     1 1895413 1895413 452.2014 < 2.2e-16 ***
## TEAM_FIELDING_DP    1  318290  318290  75.9366 < 2.2e-16 ***
## Singles             1  192910  192910  46.0238 1.529e-11 ***
## Residuals        2024 8483646    4192                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
resids <- (abs(fit_4$residuals)**(5/7))
my_rse <- sqrt(sum(resids^2)/2024)
my_rse
## [1] 18.12186

Model 5 Generalized additive model

  • Use gam function to build a generalized additive model
  • Adjusted r squared is .411
    • Best R performance yet
    • 10 predictor varibales
    • RSE is 10.565
## Loading required package: nlme
## 
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
## 
##     collapse
## This is mgcv 1.8-22. For overview type 'help("mgcv-package")'.
## 
## Attaching package: 'mgcv'
## The following object is masked from 'package:pracma':
## 
##     magic
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## y ~ s(TEAM_BASERUN_CS) + s(TEAM_BASERUN_SB) + s(TEAM_BATTING_2B) + 
##     s(TEAM_BATTING_3B) + s(TEAM_BATTING_BB) + s(TEAM_BATTING_HR) + 
##     s(TEAM_FIELDING_DP) + s(TEAM_FIELDING_E) + s(slugging) + 
##     s(OBP)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  80.4784     0.2363   340.6   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                        edf Ref.df      F  p-value    
## s(TEAM_BASERUN_CS)  3.5470 4.4214  3.472 0.006372 ** 
## s(TEAM_BASERUN_SB)  3.7776 4.7323 36.190  < 2e-16 ***
## s(TEAM_BATTING_2B)  5.5072 6.6699 13.945  < 2e-16 ***
## s(TEAM_BATTING_3B)  5.7467 6.8703  5.092 1.73e-05 ***
## s(TEAM_BATTING_BB)  3.0675 3.9436  3.334 0.010085 *  
## s(TEAM_BATTING_HR)  0.7544 0.7544 11.195 0.003699 ** 
## s(TEAM_FIELDING_DP) 2.5068 3.2147 27.101  < 2e-16 ***
## s(TEAM_FIELDING_E)  6.3478 7.4980 58.282  < 2e-16 ***
## s(slugging)         6.2073 7.3527  3.737 0.000418 ***
## s(OBP)              5.6877 6.8884  2.911 0.005187 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Rank: 90/91
## R-sq.(adj) =  0.411   Deviance explained = 42.3%
## GCV = 116.07  Scale est. = 113.55    n = 2034
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## y ~ s(TEAM_BASERUN_CS) + s(TEAM_BASERUN_SB) + s(TEAM_BATTING_2B) + 
##     s(TEAM_BATTING_3B) + s(TEAM_BATTING_BB) + s(TEAM_BATTING_HR) + 
##     s(TEAM_FIELDING_DP) + s(TEAM_FIELDING_E) + s(slugging) + 
##     s(OBP)
## 
## Approximate significance of smooth terms:
##                        edf Ref.df      F  p-value
## s(TEAM_BASERUN_CS)  3.5470 4.4214  3.472 0.006372
## s(TEAM_BASERUN_SB)  3.7776 4.7323 36.190  < 2e-16
## s(TEAM_BATTING_2B)  5.5072 6.6699 13.945  < 2e-16
## s(TEAM_BATTING_3B)  5.7467 6.8703  5.092 1.73e-05
## s(TEAM_BATTING_BB)  3.0675 3.9436  3.334 0.010085
## s(TEAM_BATTING_HR)  0.7544 0.7544 11.195 0.003699
## s(TEAM_FIELDING_DP) 2.5068 3.2147 27.101  < 2e-16
## s(TEAM_FIELDING_E)  6.3478 7.4980 58.282  < 2e-16
## s(slugging)         6.2073 7.3527  3.737 0.000418
## s(OBP)              5.6877 6.8884  2.911 0.005187
RSE
x
10.5656

Model 6

  • Drop all NA values experienced throughout the dataset, instead of imputing with median
  • 9 predictors
  • .43 adjusted r squared
  • f stat 123.1

Model 6 Diagnostic tests

  • Test for constant variance- could be a problem
    • p= 0.008997
    • Given the P level, Our model may have some issues with constant variance
    • It still looks much better than original fit 1
  • Normality tests - Passes
    • anderson- 0.701
    • Shapiro wilk 0.849
  • Independence of predictors
    • Durbin Watson Test-2.2 e^10
    • Need a polynomial fit most likely
  • outliers-
    • No outliers detects over 3.297

Model 6 summary

  • Model performs the best out of all linear models ran so far
  • It passes normality test confidently, where as model 1-4 didn’t pass confidently
  • adjusted r squared is highest value yet
  • no significant outliers present in analysis

## 
## Call:
## lm(formula = y_na ~ ., data = fit_6_db)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -30.9356  -6.5508  -0.0419   6.6782  30.0121 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      61.320410   6.649090   9.222  < 2e-16 ***
## TEAM_BATTING_2B  -0.036499   0.006967  -5.239 1.85e-07 ***
## TEAM_BATTING_3B   0.205953   0.021152   9.737  < 2e-16 ***
## TEAM_BATTING_HR   0.129672   0.008349  15.531  < 2e-16 ***
## TEAM_BATTING_BB   0.037615   0.003402  11.055  < 2e-16 ***
## TEAM_BATTING_SO  -0.021573   0.002431  -8.873  < 2e-16 ***
## TEAM_BASERUN_SB   0.050535   0.006336   7.976 3.04e-15 ***
## TEAM_FIELDING_E  -0.152176   0.009680 -15.721  < 2e-16 ***
## TEAM_FIELDING_DP -0.116077   0.013213  -8.785  < 2e-16 ***
## Singles           0.034687   0.004714   7.358 3.09e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.549 on 1458 degrees of freedom
## Multiple R-squared:  0.4317, Adjusted R-squared:  0.4282 
## F-statistic: 123.1 on 9 and 1458 DF,  p-value: < 2.2e-16

comapring several models together

  • The GAM model(model5) as well as the box cox transformed model(model 4) couldn’t be fitted to the below table
  • The clear winner seems to be model 6
    •     Model 1A   Model 1B   Model 2(moneyball stats)   Model 3(stepaic)   Model 6(Dropped NA observfations
          B CI p   B CI p   B CI p   B CI p   B CI p
      (Intercept)   46.43 35.06 – 57.81 <.001   51.52 40.20 – 62.84 <.001   19.28 12.19 – 26.38 <.001   51.48 40.02 – 62.94 <.001   61.32 48.28 – 74.36 <.001
      TEAM_BATTING_3B   0.19 0.16 – 0.22 <.001   0.19 0.15 – 0.22 <.001       0.20 0.16 – 0.23 <.001   0.21 0.16 – 0.25 <.001
      TEAM_BATTING_HR   0.12 0.10 – 0.13 <.001   0.11 0.10 – 0.13 <.001       0.11 0.10 – 0.13 <.001   0.13 0.11 – 0.15 <.001
      TEAM_BATTING_BB   0.03 0.02 – 0.04 <.001   0.03 0.02 – 0.04 <.001       0.03 0.02 – 0.04 <.001   0.04 0.03 – 0.04 <.001
      TEAM_BATTING_SO   -0.02 -0.02 – -0.01 <.001   -0.02 -0.02 – -0.01 <.001       -0.02 -0.02 – -0.01 <.001   -0.02 -0.03 – -0.02 <.001
      TEAM_BASERUN_SB   0.08 0.07 – 0.09 <.001   0.09 0.08 – 0.10 <.001   0.07 0.06 – 0.08 <.001   0.09 0.08 – 0.10 <.001   0.05 0.04 – 0.06 <.001
      TEAM_FIELDING_E   -0.08 -0.09 – -0.07 <.001   -0.08 -0.09 – -0.08 <.001   -0.05 -0.06 – -0.05 <.001   -0.09 -0.09 – -0.08 <.001   -0.15 -0.17 – -0.13 <.001
      TEAM_FIELDING_DP   -0.11 -0.14 – -0.09 <.001   -0.11 -0.13 – -0.09 <.001   -0.11 -0.13 – -0.08 <.001   -0.11 -0.13 – -0.08 <.001   -0.12 -0.14 – -0.09 <.001
      Singles   0.03 0.02 – 0.04 <.001   0.03 0.02 – 0.04 <.001       0.03 0.02 – 0.04 <.001   0.03 0.03 – 0.04 <.001
      TEAM_BASERUN_CS       -0.08 -0.11 – -0.05 <.001   -0.05 -0.08 – -0.02 <.001   -0.08 -0.11 – -0.05 <.001    
      slugging           0.00 0.00 – 0.01 .040        
      OBP           0.04 0.03 – 0.04 <.001        
      TEAM_BATTING_2B               -0.01 -0.02 – 0.00 .143   -0.04 -0.05 – -0.02 <.001
      Observations   2034   2029   2034   2034   1468
      R2 / adj. R2   .370 / .368   .385 / .383   .310 / .308   .380 / .377   .432 / .428

Create test train split and view RMSE on TEST PREDICTIONS

  • Model 3 is excluded from analysis here
  • Model 5 and 6 perform the best
## Warning in predict.lm(fit_1, test_df): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(fit_4, test_df): prediction from a rank-deficient fit
## may be misleading
##  [1] "TEAM_BATTING_2B"  "TEAM_BATTING_3B"  "TEAM_BATTING_HR" 
##  [4] "TEAM_BATTING_BB"  "TEAM_BATTING_SO"  "TEAM_BASERUN_SB" 
##  [7] "TEAM_BASERUN_CS"  "TEAM_FIELDING_E"  "TEAM_FIELDING_DP"
## [10] "Singles"          "slugging"         "OBP"
## Warning in predict.lm(fit_6, test_df): prediction from a rank-deficient fit
## may be misleading
V1
model_1_rmse my RMSE for model 1A is 10.5681339038901
model_2_rmse my RMSE for model 2 is 11.2939527952715
model_4_rmse my RMSE for model 4 is 10.7346911574377
model_5_rmse my RMSE for model 5 is 10.8498621720078
model_6_rmse my RMSE for model 6 is 9.30217646362863

Model 7- Attempt Quadratic Fitting

  • I attmepted to add some qaudratic predictors to our model 6 dataset(NA dropped dataset)
  • After stepaic and fitting to the 5th degree, I did not see a large increase in adjusted R^2 overall, and almost no increase after the 3rd degree power
## [1] "TEAM_BATTING_2B"  "TEAM_BATTING_3B"  "TEAM_BATTING_HR" 
## [4] "TEAM_BATTING_BB"  "TEAM_BATTING_SO"  "TEAM_BASERUN_SB" 
## [7] "TEAM_FIELDING_E"  "TEAM_FIELDING_DP" "Singles"
## Start:  AIC=6618.11
## y_na ~ TEAM_BATTING_2B + TEAM_BATTING_3B + TEAM_BATTING_HR + 
##     TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_FIELDING_E + 
##     TEAM_FIELDING_DP + Singles + I(TEAM_FIELDING_DP^2) + I(TEAM_FIELDING_E^2) + 
##     I(TEAM_BATTING_2B^2) + I(TEAM_BATTING_3B^2) + I(TEAM_BATTING_HR^2) + 
##     I(TEAM_BASERUN_SB^2) + I(TEAM_BATTING_SO^2) + I(TEAM_BATTING_BB^2) + 
##     I(TEAM_FIELDING_DP^3) + I(TEAM_BATTING_2B^3) + I(TEAM_BATTING_3B^3) + 
##     I(TEAM_BATTING_SO^3) + I(TEAM_BATTING_BB^3)
## 
##                         Df Sum of Sq    RSS    AIC
## - I(TEAM_FIELDING_DP^3)  1       0.1 129130 6616.1
## - I(TEAM_FIELDING_DP^2)  1       5.5 129135 6616.2
## - I(TEAM_FIELDING_E^2)   1      12.4 129142 6616.3
## - I(TEAM_BASERUN_SB^2)   1      18.8 129148 6616.3
## - TEAM_FIELDING_DP       1      30.8 129160 6616.5
## - TEAM_BATTING_3B        1      42.6 129172 6616.6
## - I(TEAM_BATTING_BB^3)   1      75.7 129205 6617.0
## - TEAM_BATTING_BB        1      77.9 129207 6617.0
## - I(TEAM_BATTING_3B^3)   1      78.5 129208 6617.0
## - I(TEAM_BATTING_BB^2)   1      93.6 129223 6617.2
## - I(TEAM_BATTING_SO^2)   1      98.3 129228 6617.2
## - TEAM_BATTING_SO        1     130.2 129260 6617.6
## - I(TEAM_BATTING_HR^2)   1     133.4 129263 6617.6
## - I(TEAM_BATTING_SO^3)   1     152.6 129282 6617.8
## <none>                               129129 6618.1
## - I(TEAM_BATTING_3B^2)   1     178.5 129308 6618.1
## - TEAM_BASERUN_SB        1     270.9 129400 6619.2
## - TEAM_BATTING_2B        1     527.9 129657 6622.1
## - I(TEAM_BATTING_2B^2)   1     654.3 129784 6623.5
## - I(TEAM_BATTING_2B^3)   1     746.9 129876 6624.6
## - TEAM_FIELDING_E        1    1770.2 130900 6636.1
## - TEAM_BATTING_HR        1    2546.7 131676 6644.8
## - Singles                1    4427.8 133557 6665.6
## 
## Step:  AIC=6616.11
## y_na ~ TEAM_BATTING_2B + TEAM_BATTING_3B + TEAM_BATTING_HR + 
##     TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_FIELDING_E + 
##     TEAM_FIELDING_DP + Singles + I(TEAM_FIELDING_DP^2) + I(TEAM_FIELDING_E^2) + 
##     I(TEAM_BATTING_2B^2) + I(TEAM_BATTING_3B^2) + I(TEAM_BATTING_HR^2) + 
##     I(TEAM_BASERUN_SB^2) + I(TEAM_BATTING_SO^2) + I(TEAM_BATTING_BB^2) + 
##     I(TEAM_BATTING_2B^3) + I(TEAM_BATTING_3B^3) + I(TEAM_BATTING_SO^3) + 
##     I(TEAM_BATTING_BB^3)
## 
##                         Df Sum of Sq    RSS    AIC
## - I(TEAM_FIELDING_E^2)   1      12.6 129142 6614.3
## - I(TEAM_BASERUN_SB^2)   1      18.8 129148 6614.3
## - TEAM_BATTING_3B        1      42.9 129172 6614.6
## - I(TEAM_BATTING_BB^3)   1      76.0 129206 6615.0
## - TEAM_BATTING_BB        1      78.0 129208 6615.0
## - I(TEAM_BATTING_3B^3)   1      78.7 129208 6615.0
## - I(TEAM_BATTING_BB^2)   1      93.9 129223 6615.2
## - I(TEAM_BATTING_SO^2)   1      98.4 129228 6615.2
## - TEAM_BATTING_SO        1     130.6 129260 6615.6
## - I(TEAM_BATTING_HR^2)   1     134.0 129264 6615.6
## - I(TEAM_BATTING_SO^3)   1     152.7 129282 6615.8
## <none>                               129130 6616.1
## - I(TEAM_BATTING_3B^2)   1     179.1 129309 6616.1
## - TEAM_BASERUN_SB        1     271.1 129401 6617.2
## - TEAM_BATTING_2B        1     527.8 129657 6620.1
## - I(TEAM_BATTING_2B^2)   1     654.3 129784 6621.5
## - I(TEAM_BATTING_2B^3)   1     746.9 129876 6622.6
## - I(TEAM_FIELDING_DP^2)  1     747.1 129877 6622.6
## - TEAM_FIELDING_DP       1    1225.2 130355 6628.0
## - TEAM_FIELDING_E        1    1774.6 130904 6634.1
## - TEAM_BATTING_HR        1    2570.7 131700 6643.0
## - Singles                1    4430.0 133560 6663.6
## 
## Step:  AIC=6614.25
## y_na ~ TEAM_BATTING_2B + TEAM_BATTING_3B + TEAM_BATTING_HR + 
##     TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_FIELDING_E + 
##     TEAM_FIELDING_DP + Singles + I(TEAM_FIELDING_DP^2) + I(TEAM_BATTING_2B^2) + 
##     I(TEAM_BATTING_3B^2) + I(TEAM_BATTING_HR^2) + I(TEAM_BASERUN_SB^2) + 
##     I(TEAM_BATTING_SO^2) + I(TEAM_BATTING_BB^2) + I(TEAM_BATTING_2B^3) + 
##     I(TEAM_BATTING_3B^3) + I(TEAM_BATTING_SO^3) + I(TEAM_BATTING_BB^3)
## 
##                         Df Sum of Sq    RSS    AIC
## - I(TEAM_BASERUN_SB^2)   1      18.9 129161 6612.5
## - TEAM_BATTING_3B        1      48.0 129190 6612.8
## - I(TEAM_BATTING_BB^3)   1      80.2 129222 6613.2
## - TEAM_BATTING_BB        1      82.7 129225 6613.2
## - I(TEAM_BATTING_3B^3)   1      84.3 129226 6613.2
## - I(TEAM_BATTING_SO^2)   1      95.8 129238 6613.3
## - I(TEAM_BATTING_BB^2)   1      98.8 129241 6613.4
## - I(TEAM_BATTING_HR^2)   1     122.1 129264 6613.6
## - TEAM_BATTING_SO        1     126.2 129268 6613.7
## - I(TEAM_BATTING_SO^3)   1     150.7 129293 6614.0
## <none>                               129142 6614.3
## - I(TEAM_BATTING_3B^2)   1     189.4 129331 6614.4
## - TEAM_BASERUN_SB        1     272.3 129414 6615.3
## - TEAM_BATTING_2B        1     517.7 129660 6618.1
## - I(TEAM_BATTING_2B^2)   1     643.3 129785 6619.5
## - I(TEAM_BATTING_2B^3)   1     735.6 129878 6620.6
## - I(TEAM_FIELDING_DP^2)  1     757.1 129899 6620.8
## - TEAM_FIELDING_DP       1    1240.9 130383 6626.3
## - TEAM_BATTING_HR        1    2623.0 131765 6641.8
## - Singles                1    4435.9 133578 6661.8
## - TEAM_FIELDING_E        1   20166.9 149309 6825.3
## 
## Step:  AIC=6612.47
## y_na ~ TEAM_BATTING_2B + TEAM_BATTING_3B + TEAM_BATTING_HR + 
##     TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_FIELDING_E + 
##     TEAM_FIELDING_DP + Singles + I(TEAM_FIELDING_DP^2) + I(TEAM_BATTING_2B^2) + 
##     I(TEAM_BATTING_3B^2) + I(TEAM_BATTING_HR^2) + I(TEAM_BATTING_SO^2) + 
##     I(TEAM_BATTING_BB^2) + I(TEAM_BATTING_2B^3) + I(TEAM_BATTING_3B^3) + 
##     I(TEAM_BATTING_SO^3) + I(TEAM_BATTING_BB^3)
## 
##                         Df Sum of Sq    RSS    AIC
## - TEAM_BATTING_3B        1      45.3 129206 6611.0
## - I(TEAM_BATTING_BB^3)   1      78.4 129239 6611.4
## - TEAM_BATTING_BB        1      80.8 129242 6611.4
## - I(TEAM_BATTING_3B^3)   1      83.4 129244 6611.4
## - I(TEAM_BATTING_SO^2)   1      88.7 129250 6611.5
## - I(TEAM_BATTING_BB^2)   1      96.8 129258 6611.6
## - TEAM_BATTING_SO        1     118.2 129279 6611.8
## - I(TEAM_BATTING_HR^2)   1     120.2 129281 6611.8
## - I(TEAM_BATTING_SO^3)   1     142.4 129303 6612.1
## <none>                               129161 6612.5
## - I(TEAM_BATTING_3B^2)   1     186.6 129348 6612.6
## - TEAM_BATTING_2B        1     521.5 129683 6616.4
## - I(TEAM_BATTING_2B^2)   1     650.1 129811 6617.8
## - I(TEAM_BATTING_2B^3)   1     745.2 129906 6618.9
## - I(TEAM_FIELDING_DP^2)  1     764.3 129925 6619.1
## - TEAM_FIELDING_DP       1    1248.6 130410 6624.6
## - TEAM_BATTING_HR        1    2616.3 131777 6639.9
## - Singles                1    4420.7 133582 6659.9
## - TEAM_BASERUN_SB        1    5191.0 134352 6668.3
## - TEAM_FIELDING_E        1   20163.8 149325 6823.4
## 
## Step:  AIC=6610.98
## y_na ~ TEAM_BATTING_2B + TEAM_BATTING_HR + TEAM_BATTING_BB + 
##     TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_FIELDING_E + TEAM_FIELDING_DP + 
##     Singles + I(TEAM_FIELDING_DP^2) + I(TEAM_BATTING_2B^2) + 
##     I(TEAM_BATTING_3B^2) + I(TEAM_BATTING_HR^2) + I(TEAM_BATTING_SO^2) + 
##     I(TEAM_BATTING_BB^2) + I(TEAM_BATTING_2B^3) + I(TEAM_BATTING_3B^3) + 
##     I(TEAM_BATTING_SO^3) + I(TEAM_BATTING_BB^3)
## 
##                         Df Sum of Sq    RSS    AIC
## - I(TEAM_BATTING_3B^3)   1      68.7 129275 6609.8
## - I(TEAM_BATTING_BB^3)   1      80.3 129287 6609.9
## - TEAM_BATTING_BB        1      82.0 129288 6609.9
## - I(TEAM_BATTING_SO^2)   1      86.2 129293 6610.0
## - I(TEAM_BATTING_BB^2)   1      98.6 129305 6610.1
## - I(TEAM_BATTING_HR^2)   1     107.0 129313 6610.2
## - TEAM_BATTING_SO        1     117.3 129324 6610.3
## - I(TEAM_BATTING_SO^3)   1     137.6 129344 6610.5
## <none>                               129206 6611.0
## - TEAM_BATTING_2B        1     495.9 129702 6614.6
## - I(TEAM_BATTING_2B^2)   1     622.0 129828 6616.0
## - I(TEAM_BATTING_2B^3)   1     715.9 129922 6617.1
## - I(TEAM_FIELDING_DP^2)  1     772.1 129978 6617.7
## - TEAM_FIELDING_DP       1    1260.1 130466 6623.2
## - I(TEAM_BATTING_3B^2)   1    1274.2 130481 6623.4
## - TEAM_BATTING_HR        1    2576.1 131783 6638.0
## - Singles                1    4429.0 133635 6658.5
## - TEAM_BASERUN_SB        1    5169.8 134376 6666.6
## - TEAM_FIELDING_E        1   20266.4 149473 6822.9
## 
## Step:  AIC=6609.77
## y_na ~ TEAM_BATTING_2B + TEAM_BATTING_HR + TEAM_BATTING_BB + 
##     TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_FIELDING_E + TEAM_FIELDING_DP + 
##     Singles + I(TEAM_FIELDING_DP^2) + I(TEAM_BATTING_2B^2) + 
##     I(TEAM_BATTING_3B^2) + I(TEAM_BATTING_HR^2) + I(TEAM_BATTING_SO^2) + 
##     I(TEAM_BATTING_BB^2) + I(TEAM_BATTING_2B^3) + I(TEAM_BATTING_SO^3) + 
##     I(TEAM_BATTING_BB^3)
## 
##                         Df Sum of Sq    RSS    AIC
## - I(TEAM_BATTING_SO^2)   1      60.4 129335 6608.5
## - I(TEAM_BATTING_BB^3)   1      71.9 129347 6608.6
## - TEAM_BATTING_BB        1      73.1 129348 6608.6
## - TEAM_BATTING_SO        1      86.0 129361 6608.7
## - I(TEAM_BATTING_BB^2)   1      89.2 129364 6608.8
## - I(TEAM_BATTING_SO^3)   1     107.8 129383 6609.0
## - I(TEAM_BATTING_HR^2)   1     109.1 129384 6609.0
## <none>                               129275 6609.8
## - TEAM_BATTING_2B        1     487.0 129762 6613.3
## - I(TEAM_BATTING_2B^2)   1     609.8 129885 6614.7
## - I(TEAM_BATTING_2B^3)   1     701.4 129976 6615.7
## - I(TEAM_FIELDING_DP^2)  1     803.7 130079 6616.9
## - TEAM_FIELDING_DP       1    1298.0 130573 6622.4
## - TEAM_BATTING_HR        1    2551.5 131827 6636.5
## - Singles                1    4466.1 133741 6657.6
## - TEAM_BASERUN_SB        1    5109.7 134385 6664.7
## - I(TEAM_BATTING_3B^2)   1    9093.5 138369 6707.6
## - TEAM_FIELDING_E        1   20311.1 149586 6822.0
## 
## Step:  AIC=6608.45
## y_na ~ TEAM_BATTING_2B + TEAM_BATTING_HR + TEAM_BATTING_BB + 
##     TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_FIELDING_E + TEAM_FIELDING_DP + 
##     Singles + I(TEAM_FIELDING_DP^2) + I(TEAM_BATTING_2B^2) + 
##     I(TEAM_BATTING_3B^2) + I(TEAM_BATTING_HR^2) + I(TEAM_BATTING_BB^2) + 
##     I(TEAM_BATTING_2B^3) + I(TEAM_BATTING_SO^3) + I(TEAM_BATTING_BB^3)
## 
##                         Df Sum of Sq    RSS    AIC
## - I(TEAM_BATTING_BB^3)   1      72.9 129408 6607.3
## - TEAM_BATTING_BB        1      74.5 129410 6607.3
## - I(TEAM_BATTING_BB^2)   1      90.4 129426 6607.5
## - I(TEAM_BATTING_HR^2)   1      92.8 129428 6607.5
## - TEAM_BATTING_SO        1      98.3 129434 6607.6
## <none>                               129335 6608.5
## - I(TEAM_BATTING_SO^3)   1     436.6 129772 6611.4
## - TEAM_BATTING_2B        1     476.9 129812 6611.9
## - I(TEAM_BATTING_2B^2)   1     600.3 129936 6613.2
## - I(TEAM_BATTING_2B^3)   1     693.6 130029 6614.3
## - I(TEAM_FIELDING_DP^2)  1     780.4 130116 6615.3
## - TEAM_FIELDING_DP       1    1272.2 130608 6620.8
## - TEAM_BATTING_HR        1    2494.5 131830 6634.5
## - Singles                1    4640.8 133976 6658.2
## - TEAM_BASERUN_SB        1    5153.1 134489 6663.8
## - I(TEAM_BATTING_3B^2)   1    9556.1 138892 6711.1
## - TEAM_FIELDING_E        1   20728.2 150064 6824.7
## 
## Step:  AIC=6607.28
## y_na ~ TEAM_BATTING_2B + TEAM_BATTING_HR + TEAM_BATTING_BB + 
##     TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_FIELDING_E + TEAM_FIELDING_DP + 
##     Singles + I(TEAM_FIELDING_DP^2) + I(TEAM_BATTING_2B^2) + 
##     I(TEAM_BATTING_3B^2) + I(TEAM_BATTING_HR^2) + I(TEAM_BATTING_BB^2) + 
##     I(TEAM_BATTING_2B^3) + I(TEAM_BATTING_SO^3)
## 
##                         Df Sum of Sq    RSS    AIC
## - TEAM_BATTING_BB        1       1.6 129410 6605.3
## - I(TEAM_BATTING_HR^2)   1      78.1 129486 6606.2
## - TEAM_BATTING_SO        1      91.3 129500 6606.3
## - I(TEAM_BATTING_BB^2)   1     158.3 129567 6607.1
## <none>                               129408 6607.3
## - I(TEAM_BATTING_SO^3)   1     445.8 129854 6610.3
## - TEAM_BATTING_2B        1     482.2 129891 6610.7
## - I(TEAM_BATTING_2B^2)   1     607.5 130016 6612.2
## - I(TEAM_BATTING_2B^3)   1     702.5 130111 6613.2
## - I(TEAM_FIELDING_DP^2)  1     752.7 130161 6613.8
## - TEAM_FIELDING_DP       1    1239.5 130648 6619.3
## - TEAM_BATTING_HR        1    2441.5 131850 6632.7
## - Singles                1    4717.7 134126 6657.8
## - TEAM_BASERUN_SB        1    5153.9 134562 6662.6
## - I(TEAM_BATTING_3B^2)   1    9645.0 139053 6710.8
## - TEAM_FIELDING_E        1   20658.3 150067 6822.7
## 
## Step:  AIC=6605.3
## y_na ~ TEAM_BATTING_2B + TEAM_BATTING_HR + TEAM_BATTING_SO + 
##     TEAM_BASERUN_SB + TEAM_FIELDING_E + TEAM_FIELDING_DP + Singles + 
##     I(TEAM_FIELDING_DP^2) + I(TEAM_BATTING_2B^2) + I(TEAM_BATTING_3B^2) + 
##     I(TEAM_BATTING_HR^2) + I(TEAM_BATTING_BB^2) + I(TEAM_BATTING_2B^3) + 
##     I(TEAM_BATTING_SO^3)
## 
##                         Df Sum of Sq    RSS    AIC
## - I(TEAM_BATTING_HR^2)   1      77.5 129488 6604.2
## - TEAM_BATTING_SO        1      91.7 129502 6604.3
## <none>                               129410 6605.3
## - I(TEAM_BATTING_SO^3)   1     444.8 129855 6608.3
## - TEAM_BATTING_2B        1     480.6 129891 6608.7
## - I(TEAM_BATTING_2B^2)   1     605.9 130016 6610.2
## - I(TEAM_BATTING_2B^3)   1     701.0 130111 6611.2
## - I(TEAM_FIELDING_DP^2)  1     763.2 130173 6611.9
## - TEAM_FIELDING_DP       1    1254.1 130664 6617.5
## - TEAM_BATTING_HR        1    2439.8 131850 6630.7
## - Singles                1    4735.5 134145 6656.1
## - TEAM_BASERUN_SB        1    5175.7 134586 6660.9
## - I(TEAM_BATTING_3B^2)   1    9702.9 139113 6709.4
## - I(TEAM_BATTING_BB^2)   1   11033.1 140443 6723.4
## - TEAM_FIELDING_E        1   20667.1 150077 6820.8
## 
## Step:  AIC=6604.18
## y_na ~ TEAM_BATTING_2B + TEAM_BATTING_HR + TEAM_BATTING_SO + 
##     TEAM_BASERUN_SB + TEAM_FIELDING_E + TEAM_FIELDING_DP + Singles + 
##     I(TEAM_FIELDING_DP^2) + I(TEAM_BATTING_2B^2) + I(TEAM_BATTING_3B^2) + 
##     I(TEAM_BATTING_BB^2) + I(TEAM_BATTING_2B^3) + I(TEAM_BATTING_SO^3)
## 
##                         Df Sum of Sq    RSS    AIC
## - TEAM_BATTING_SO        1      48.8 129536 6602.7
## <none>                               129488 6604.2
## - TEAM_BATTING_2B        1     499.7 129987 6607.8
## - I(TEAM_BATTING_2B^2)   1     621.6 130109 6609.2
## - I(TEAM_BATTING_SO^3)   1     647.1 130135 6609.5
## - I(TEAM_BATTING_2B^3)   1     711.6 130199 6610.2
## - I(TEAM_FIELDING_DP^2)  1     740.3 130228 6610.5
## - TEAM_FIELDING_DP       1    1223.1 130711 6616.0
## - Singles                1    4844.5 134332 6656.1
## - TEAM_BASERUN_SB        1    5115.0 134603 6659.0
## - I(TEAM_BATTING_3B^2)   1    9697.3 139185 6708.2
## - I(TEAM_BATTING_BB^2)   1   11102.5 140590 6722.9
## - TEAM_BATTING_HR        1   19521.0 149009 6808.3
## - TEAM_FIELDING_E        1   21631.4 151119 6829.0
## 
## Step:  AIC=6602.73
## y_na ~ TEAM_BATTING_2B + TEAM_BATTING_HR + TEAM_BASERUN_SB + 
##     TEAM_FIELDING_E + TEAM_FIELDING_DP + Singles + I(TEAM_FIELDING_DP^2) + 
##     I(TEAM_BATTING_2B^2) + I(TEAM_BATTING_3B^2) + I(TEAM_BATTING_BB^2) + 
##     I(TEAM_BATTING_2B^3) + I(TEAM_BATTING_SO^3)
## 
##                         Df Sum of Sq    RSS    AIC
## <none>                               129536 6602.7
## - TEAM_BATTING_2B        1     494.8 130031 6606.3
## - I(TEAM_BATTING_2B^2)   1     613.5 130150 6607.7
## - I(TEAM_BATTING_2B^3)   1     701.9 130238 6608.7
## - I(TEAM_FIELDING_DP^2)  1     740.6 130277 6609.1
## - TEAM_FIELDING_DP       1    1221.8 130758 6614.5
## - TEAM_BASERUN_SB        1    5230.3 134767 6658.8
## - Singles                1    5385.6 134922 6660.5
## - I(TEAM_BATTING_SO^3)   1    7680.9 137217 6685.3
## - I(TEAM_BATTING_BB^2)   1   11652.2 141189 6727.2
## - I(TEAM_BATTING_3B^2)   1   12051.5 141588 6731.3
## - TEAM_BATTING_HR        1   20707.0 150243 6818.4
## - TEAM_FIELDING_E        1   22289.2 151826 6833.8
## 
## Call:
## lm(formula = y_na ~ TEAM_BATTING_2B + TEAM_BATTING_HR + TEAM_BASERUN_SB + 
##     TEAM_FIELDING_E + TEAM_FIELDING_DP + Singles + I(TEAM_FIELDING_DP^2) + 
##     I(TEAM_BATTING_2B^2) + I(TEAM_BATTING_3B^2) + I(TEAM_BATTING_BB^2) + 
##     I(TEAM_BATTING_2B^3) + I(TEAM_BATTING_SO^3), data = fit_6_db)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -31.5933  -6.6402   0.0684   6.3734  27.2096 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            1.226e+01  3.658e+01   0.335  0.73768    
## TEAM_BATTING_2B        9.898e-01  4.198e-01   2.358  0.01853 *  
## TEAM_BATTING_HR        1.221e-01  8.003e-03  15.251  < 2e-16 ***
## TEAM_BASERUN_SB        4.770e-02  6.224e-03   7.665 3.26e-14 ***
## TEAM_FIELDING_E       -1.510e-01  9.542e-03 -15.823  < 2e-16 ***
## TEAM_FIELDING_DP      -4.888e-01  1.319e-01  -3.705  0.00022 ***
## Singles                3.541e-02  4.552e-03   7.778 1.39e-14 ***
## I(TEAM_FIELDING_DP^2)  1.217e-03  4.221e-04   2.884  0.00398 ** 
## I(TEAM_BATTING_2B^2)  -4.315e-03  1.644e-03  -2.625  0.00875 ** 
## I(TEAM_BATTING_3B^2)   2.067e-03  1.776e-04  11.635  < 2e-16 ***
## I(TEAM_BATTING_BB^2)   3.424e-05  2.993e-06  11.440  < 2e-16 ***
## I(TEAM_BATTING_2B^3)   5.928e-06  2.111e-06   2.808  0.00506 ** 
## I(TEAM_BATTING_SO^3)  -8.823e-09  9.499e-10  -9.288  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.435 on 1455 degrees of freedom
## Multiple R-squared:  0.4463, Adjusted R-squared:  0.4417 
## F-statistic: 97.74 on 12 and 1455 DF,  p-value: < 2.2e-16

Model of choice

  • I have chosen to use model 6 to predict
    • It outperformed all other linear models in every test statistic
    • It outperformed the GAM model on test/train split rmse, and it seems like the simplest model
y_na <- money_ball_train_2_withna$TARGET_WINS
fit_6_db<-money_ball_train_2_withna[,c(4,5,6,7,8,9,16,17,18)] 
fit_6 <- lm(y_na~.,data=fit_6_db)
layout(matrix(c(1,2,3,4),2,2)) 
plot(fit_6)

summary(fit_6)
## 
## Call:
## lm(formula = y_na ~ ., data = fit_6_db)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -30.9356  -6.5508  -0.0419   6.6782  30.0121 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      61.320410   6.649090   9.222  < 2e-16 ***
## TEAM_BATTING_2B  -0.036499   0.006967  -5.239 1.85e-07 ***
## TEAM_BATTING_3B   0.205953   0.021152   9.737  < 2e-16 ***
## TEAM_BATTING_HR   0.129672   0.008349  15.531  < 2e-16 ***
## TEAM_BATTING_BB   0.037615   0.003402  11.055  < 2e-16 ***
## TEAM_BATTING_SO  -0.021573   0.002431  -8.873  < 2e-16 ***
## TEAM_BASERUN_SB   0.050535   0.006336   7.976 3.04e-15 ***
## TEAM_FIELDING_E  -0.152176   0.009680 -15.721  < 2e-16 ***
## TEAM_FIELDING_DP -0.116077   0.013213  -8.785  < 2e-16 ***
## Singles           0.034687   0.004714   7.358 3.09e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.549 on 1458 degrees of freedom
## Multiple R-squared:  0.4317, Adjusted R-squared:  0.4282 
## F-statistic: 123.1 on 9 and 1458 DF,  p-value: < 2.2e-16
my_names <- colnames(fit_6_db)
lm_target <-y_na
lm_inputs <- fit_6_db[,my_names]
train_df <-  fit_6_db[,my_names]
test_df <- money_eval_na[,my_names]

My_prediction<- lm(y_na~.,data=train_df)
fit_6_predict <- predict(My_prediction,test_df)
kable(fit_6_predict)
x
61.92662
68.76625
73.50705
82.07337
NA
56.05863
68.46613
69.58884
61.49032
83.05783
86.48455
83.95449
89.68919
77.04353
70.81106
76.53099
NA
81.54091
83.35541
82.27576
82.48616
70.57153
78.97134
85.16802
62.85810
77.37633
60.94855
89.83601
87.97425
85.99453
84.43743
80.46297
85.23486
76.45860
91.37220
82.66758
89.15680
80.11327
94.71830
NA
NA
NA
63.54624
57.46252
72.97738
71.23652
81.35044
73.02683
76.18881
67.52168
79.88642
75.41477
68.30099
NA
NA
86.36666
84.77333
87.73882
84.60453
84.63909
NA
45.69831
54.88904
NA
83.36602
88.77719
73.07122
81.97568
95.37573
70.48764
78.28982
89.16533
81.63855
NA
NA
77.04870
80.69682
69.98533
88.53044
80.19605
87.02073
87.22192
96.65479
88.90896
NA
52.29533
NA
NA
NA
88.55088
85.40561
86.06973
75.51747
70.47183
83.36415
89.45490
73.82390
NA
71.37230
88.87607
NA
87.76417
85.14594
94.35076
89.81292
78.36933
74.65148
84.12522
79.97949
67.60061
NA
NA
NA
NA
NA
63.91358
80.43352
83.83552
70.44251
88.06252
86.38444
79.09888
80.51968
71.21323
82.19965
93.76708
76.91460
75.89617
91.83577
80.48406
42.12529
NA
89.52964
64.65567
76.42108
74.66181
72.95899
80.68064
79.38578
83.17903
82.88350
79.46469
61.97677
80.18481
68.20260
89.11799
NA
63.29854
NA
103.43310
112.91436
98.60398
106.41409
101.46655
100.19181
88.90789
86.64024
71.29591
79.02157
NA
87.52662
79.49987
90.60211
73.87771
79.03048
84.91793
68.05868
75.84968
81.83688
93.67837
87.65334
88.02007
87.27270
NA
NA
NA
NA
NA
61.64871
68.20794
63.71984
57.09436
68.74322
95.96053
84.86653
86.29287
73.09883
81.95712
77.52093
91.91348
81.09405
87.96287
75.82657
76.27937
NA
NA
NA
82.10810
62.01272
68.55481
83.92376
79.48529
91.75034
75.61021
81.23918
71.99662
66.53253
75.63967
70.15693
80.57726
78.58580
76.14602
84.43725
NA
NA
93.54387
86.51571
89.00744
78.07078
75.66783
75.92058
81.84754
NA
63.15399
84.66570
91.44331
83.23063
78.66853
58.38997
83.12404
77.47833
82.07125
74.67415
83.06915
78.26561
NA
71.58667
80.62028
80.80553
81.20826
67.56517

```