DATA LOAD

Read in the Moneyball training data

df <- read.csv('/Users/dianaplunkett/CUNY/621 Business Analytics and Data Mining/Homework 1/moneyball-training-data.csv')

DATA EXPLORATION (& a little prep)

Confirming exact number of rows and columns

dim(df)
## [1] 2276   17

Getting the full summary, will refer back to this as I go

summary(df)
##      INDEX         TARGET_WINS     TEAM_BATTING_H TEAM_BATTING_2B
##  Min.   :   1.0   Min.   :  0.00   Min.   : 891   Min.   : 69.0  
##  1st Qu.: 630.8   1st Qu.: 71.00   1st Qu.:1383   1st Qu.:208.0  
##  Median :1270.5   Median : 82.00   Median :1454   Median :238.0  
##  Mean   :1268.5   Mean   : 80.79   Mean   :1469   Mean   :241.2  
##  3rd Qu.:1915.5   3rd Qu.: 92.00   3rd Qu.:1537   3rd Qu.:273.0  
##  Max.   :2535.0   Max.   :146.00   Max.   :2554   Max.   :458.0  
##                                                                  
##  TEAM_BATTING_3B  TEAM_BATTING_HR  TEAM_BATTING_BB TEAM_BATTING_SO 
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.0   Min.   :   0.0  
##  1st Qu.: 34.00   1st Qu.: 42.00   1st Qu.:451.0   1st Qu.: 548.0  
##  Median : 47.00   Median :102.00   Median :512.0   Median : 750.0  
##  Mean   : 55.25   Mean   : 99.61   Mean   :501.6   Mean   : 735.6  
##  3rd Qu.: 72.00   3rd Qu.:147.00   3rd Qu.:580.0   3rd Qu.: 930.0  
##  Max.   :223.00   Max.   :264.00   Max.   :878.0   Max.   :1399.0  
##                                                    NA's   :102     
##  TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H
##  Min.   :  0.0   Min.   :  0.0   Min.   :29.00    Min.   : 1137  
##  1st Qu.: 66.0   1st Qu.: 38.0   1st Qu.:50.50    1st Qu.: 1419  
##  Median :101.0   Median : 49.0   Median :58.00    Median : 1518  
##  Mean   :124.8   Mean   : 52.8   Mean   :59.36    Mean   : 1779  
##  3rd Qu.:156.0   3rd Qu.: 62.0   3rd Qu.:67.00    3rd Qu.: 1682  
##  Max.   :697.0   Max.   :201.0   Max.   :95.00    Max.   :30132  
##  NA's   :131     NA's   :772     NA's   :2085                    
##  TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO  TEAM_FIELDING_E 
##  Min.   :  0.0    Min.   :   0.0   Min.   :    0.0   Min.   :  65.0  
##  1st Qu.: 50.0    1st Qu.: 476.0   1st Qu.:  615.0   1st Qu.: 127.0  
##  Median :107.0    Median : 536.5   Median :  813.5   Median : 159.0  
##  Mean   :105.7    Mean   : 553.0   Mean   :  817.7   Mean   : 246.5  
##  3rd Qu.:150.0    3rd Qu.: 611.0   3rd Qu.:  968.0   3rd Qu.: 249.2  
##  Max.   :343.0    Max.   :3645.0   Max.   :19278.0   Max.   :1898.0  
##                                    NA's   :102                       
##  TEAM_FIELDING_DP
##  Min.   : 52.0   
##  1st Qu.:131.0   
##  Median :149.0   
##  Mean   :146.4   
##  3rd Qu.:164.0   
##  Max.   :228.0   
##  NA's   :286

Small thing to note here: the max of index is 2535, which is > than the # of rows, so the index is not purely sequential (some values are skipped).

Looking for NAs.

Since some exploration and prep are some what iterative, we will both explore and address the NAs here, and then do some more exploration.

Though they are in the Summary above, I find it useful to look at how many values are not NA, and then only show those columns where that amount is < 2276 (the number of rows).

df_na <- colSums(! is.na(df)) 
(da_na <- df_na[df_na<2276])
##  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS TEAM_BATTING_HBP 
##             2174             2145             1504              191 
## TEAM_PITCHING_SO TEAM_FIELDING_DP 
##             2174             1990

Diving into TEAM_BATTING_HBP (Batters hit by pitch (get a free base)) since that has the most NAs.

(191/2276)*100
## [1] 8.391916

8.4% of the rows have a value in this field.

Researching how common it is for a batter to get hit, 32,001 rows of pitchers by season of between 1876 and 2023 had at least 1 HB (Hit Batter) out of a total of 47,206 (~68%). Of the remaining rows, some where 0 and some where null. The range of years here is not exactly the same, but overlaps significantly. (Source: https://www.mlb.com/stats/pitching/hit-batsmen/all-time-by-season)

This rule dates from 1887. (Source: https://en.wikipedia.org/wiki/Hit_by_pitch) We might expect that there were no stats kept prior to that, each row of this data is meant to be the stats for a given year from 1871-2006.

Given the 2 facts above, we would expect to have significantly more than 8.4% of the rows populated. There is no way I can think of to input this data. Therefore, we will exclude it from our analysis. See section below with columns to exclude.

Next, we will look at TEAM_BASERUN_CS, (Caught Stealing).

(1504/2276)*100 
## [1] 66.08084

It has a non NA value in about 2/3 of the rows (eg missing 1/3 rows)

Not sure if the moneyball data is real, max all time by season per MLB site is 42, and that is lower than the max in this data of 201, but also lower that the median 49 and mean 52. For simplicity, I am going to exclude this field too.

Columns to Exclude

exclude_col = c("TEAM_BATTING_HBP", "TEAM_BASERUN_CS")
df<-df[, !(names(df) %in% exclude_col)]
dim(df)
## [1] 2276   15

TEAM_BATTING_SO, TEAM_BASERUN_SB, TEAM_PITCHING_SO, TEAM_FIELDING_DP These fields are mostly populated (95.5%, 94.2%. 95.5%, 87.4%, respectively). To address the NAs, we will replace the NAs with the Mean.

Fill NAs with Mean

#for fields TEAM_BATTING_SO, TEAM_BASERUN_SB, TEAM_PITCHING_SO, TEAM_FIELDING_DP
df$TEAM_BATTING_SO[is.na(df$TEAM_BATTING_SO)]<-mean(df$TEAM_BATTING_SO,na.rm=TRUE)
df$TEAM_BASERUN_SB[is.na(df$TEAM_BASERUN_SB)]<-mean(df$TEAM_BASERUN_SB,na.rm=TRUE)
df$TEAM_PITCHING_SO[is.na(df$TEAM_PITCHING_SO)]<-mean(df$TEAM_PITCHING_SO,na.rm=TRUE)
df$TEAM_FIELDING_DP[is.na(df$TEAM_FIELDING_DP)]<-mean(df$TEAM_FIELDING_DP,na.rm=TRUE)

What columns are important?

plot_scatterplot(df[,-1], by ="TARGET_WINS")

Nothing super strong here, but TEAM_BASERUN_SB, TEAM_BATTING_2B, TEAM_BATTING_H look like they have positive correlations. And TEAM_FIELDING_E looks like it has a negative correlation. Oddly, TEAM_PITCHING_HR, which I would expect to have negative correlation, seems to have a positive on. Also, want to take a closer look at TEAM_BATTING_HR.

Let’s plot these 6 with a linear trend to look further

p1<- ggplot(df, aes(x=TEAM_BASERUN_SB, y=TARGET_WINS))+
    geom_point()+
    geom_smooth(method=lm , color="red", se=FALSE)

p2<- ggplot(df, aes(x=TEAM_BATTING_2B, y=TARGET_WINS))+
    geom_point()+
    geom_smooth(method=lm , color="red", se=FALSE)

p3<- ggplot(df, aes(x=TEAM_BATTING_H, y=TARGET_WINS))+
    geom_point()+
    geom_smooth(method=lm , color="red", se=FALSE)

p4<- ggplot(df, aes(x=TEAM_FIELDING_E, y=TARGET_WINS))+
    geom_point()+
    geom_smooth(method=lm , color="red", se=FALSE)

p5<- ggplot(df, aes(x=TEAM_PITCHING_HR, y=TARGET_WINS))+
    geom_point()+
    geom_smooth(method=lm , color="red", se=FALSE)

p6<- ggplot(df, aes(x=TEAM_BATTING_HR, y=TARGET_WINS))+
    geom_point()+
    geom_smooth(method=lm , color="red", se=FALSE)
#from library(gridExtra)
grid.arrange(p1, p2, p3, p4, p5, p6, nrow = 3)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

Observations: TEAM_BATTING_H seems to have the strongest corr, followed by TEAM_FIELDING_E (a neg corr) and then TEAM_BATTING_2B.

Distributions

#df[,-1] so we do not include the index
gather_df <- df[,-1] %>% 
  gather(key = 'variable', value = 'value')

ggplot(gather_df) +
    geom_histogram(aes(x=value, y = ..density..), bins=30) +
    geom_density(aes(x=value), color='red') +
    facet_wrap(. ~variable, scales='free', ncol=5) + 
    theme(strip.text.x = element_text(size=5))
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Observations:
1. Many of the fields are skewed.
2. There are also some bimodal fields (with 2 peaks).

Typically in either of these situations (skewed or bimodal), we would look for a way to transform. See Side Note below for experiment with taking a log of the one of the skewed fields. It did not have an impact on the correlation, so we did not pursue this. For the bimodal, we do not have any data that might help us understand what might cause an observation to be in the low or high peak. Therefore we will keep this in mind as we progress, but not take any action on the data.

Side Note: Will log of skewed data help?

Took one of the skewed fields to see if taking the log would a. help the skewness b. incease the correlation

Histo of log(TEAM_FIELDING_E)

ggplot(df) +
    geom_histogram(aes(x=log(TEAM_FIELDING_E)), bins=30)

Testing to see the impact on skewness

#type 2 uses moment based formula
skewness(df$TEAM_FIELDING_E, type=2)
## [1] 2.994411
#type 2 uses moment based formula
skewness(log(df$TEAM_FIELDING_E), type=2)
## [1] 1.25156

The skewness is halfed, but still skewed.

Testing to see if using the less skewed log has greater correlation.

df$log_FIELDING_E <- log(df$TEAM_FIELDING_E)
ggpairs(df[,c('TARGET_WINS','log_FIELDING_E', 'TEAM_FIELDING_E' )] )

corrplot(cor(df[,c('TARGET_WINS','log_FIELDING_E', 'TEAM_FIELDING_E' )]), type = 'lower', tl.col = 'black',
         cl.ratio = 0.2, tl.srt = 45,tl.cex = 0.5)

It does not, so taking it out. Moving on.

df <- df[,-18]

Trying out Correlation Funnel

In looking at this, it seems that it depends on having a Target that is binary. I could find no explanation for what to do in the case where the Target is numeric. Therefore, will create a field called More_W, that is True when the row has a higher number of wins than the mean. And then see how the Correlation Funnel works on that.

#create the MORE_Ws field 
df$MORE_Ws <- df$TARGET_WINS > mean(df$TARGET_WINS)

#build the binary table for all the data except INDEX (the ID) and TARGET_WINS, now replaced as the target by MORE_Ws 
df_bi_tbl <- df %>% select (-c("INDEX","TARGET_WINS") ) %>% binarize(n_bins = 4, thresh_infreq = 0.01) 

#build corr table from the binary table
df_corr_tbl <- df_bi_tbl %>% correlate(target = MORE_Ws__1) 
#plot the result
df_corr_tbl %>%  plot_correlation_funnel(interactive = FALSE, limits = c(-0.2, 0.2))
## Warning: Removed 4 rows containing missing values (`geom_point()`).
## Warning: Removed 4 rows containing missing values (`geom_text_repel()`).

Observations:
- TEAM_BATTING_BB (walks by batters) has the strongest positive corr when it is 580 or higher. It also has somewhat of a negative corr when it is 451 or lower.
- TEAM_PITCHING_H (Hits allowed) has the strongest negative corr when it is 1419 or lower. And a more positive corr when it is 1682.5 or higher. This is counter intuitive, as a pitcher allowing more hits would seem like it should have a neg corr with wins.
- TEAM_BATTING_2B (Doubles by batters) has a positive corr when it is 273 or higher. It also has somewhat of a negative corr when it is 208 or lower.
- TEAM_PITCHING_BB (walks allowed) has a negative corr when it is 476 or lower. And a more positive corr when it is 611 or higher. As with the Hits Allowed, this is counter intuitive, as a pitcher allowing more walks would seem like it should have a neg corr with wins.
- TEAM_FIELDING_E (Errors) has a positive corr when it low, 127 or lower.

DATA PREPARATION

Before we make further changes to the data, let’s discuss how we intend to model.

Model 1: Use the most detailed level of fields. For example, TEAM_BATTING_H is the sum of TEAM_BATTING_2B/3B/HR + 1B (for which there is no field, but we could calc) - so here we will not use TEAM_BATTING_H, but will use TEAM_BATTING_2B/3B/HR + the calculated 1B.

Model 2: Use more summary fields. In addition to TEAM_BATTING_H, discussed above, also create the following: - the ratio of double plays to errors would be a summary of fielding stats. - the ratio of basehits to strikeouts for batters as a summary of batting - similarly, the ratio of strikeouts to hits could be a summary of pitching stats since, per my baseball SME, HR are included in H (Hits allowed).

Model 3: Since the three fields with the strongest corr (TEAM_BATTING_H, TEAM_FIELDING_E, TEAM_BATTING_2B.) also showed up in Correlation Funnel, let’s create binary fields around the highs and lows for those fields, plus the other 2 in the Corr Funnel Observations.

Create New Fields.

## For Model 1: 
## Calc Singles by batters by subtracting doubles, triples and hr from the total base hits.
df$TEAM_BATTING_1B <- df$TEAM_BATTING_H - df$TEAM_BATTING_2B - df$TEAM_BATTING_3B - df$TEAM_BATTING_HR

## For Model 2:
## Create a ratio of double plays to errors
df$FIELDING <- df$TEAM_FIELDING_DP/df$TEAM_FIELDING_E
## Create a ratio of strikes to hits
df$BATTING <- df$TEAM_BATTING_SO/df$TEAM_BATTING_H
## Create a ratio of strikeouts to hits
df$PITCHING <- df$TEAM_PITCHING_SO/df$TEAM_PITCHING_H

## For Model 3:
df$BAT_BB_G <- with(df, ifelse(TEAM_BATTING_BB>=580,"G", "NG")) #Good / Not
df$BAT_BB_B <- with(df, ifelse(TEAM_BATTING_BB<=451, "B", "NB")) #Bad / Not
df$PITCH_H_G <- with(df, ifelse(TEAM_PITCHING_H>=1682.5, "G", "NG")) 
df$PITCH_H_B <- with(df, ifelse(TEAM_PITCHING_H<=1419, "B", "NB")) 
df$BAT_2B_G <- with(df, ifelse(TEAM_BATTING_2B>=273, "G", "NG"))
df$BAT_2B_B <- with(df, ifelse(TEAM_BATTING_2B<=208, "B", "NB"))
df$PITCH_BB_G <- with(df, ifelse(TEAM_PITCHING_BB>=611, "G", "NG"))
df$PITCH_BB_B <- with(df, ifelse(TEAM_PITCHING_BB<=476, "B", "NB"))
df$FIELD_E_G <- with(df, ifelse(TEAM_FIELDING_E<=127, "G", "NG"))
#note, only the G, positive corr with fielding errors.

Create Dataframes for each Model

# All will have TARGET_WINS, and will NOT have INDEX (a dumb number, not related to wins).

# df1 will have all the "detail" columns, so remove all the summary ones and the binaries.  Also have to take out the MORE_Ws field we used in the Corr Funnel, and the log_FIELDING_E (from the side note)
df1<-df %>% select(-c(INDEX,TEAM_BATTING_H, FIELDING, BATTING, PITCHING, BAT_BB_G , BAT_BB_B , PITCH_H_G , PITCH_H_B, BAT_2B_G, BAT_2B_B, PITCH_BB_G, PITCH_BB_B, FIELD_E_G, MORE_Ws, log_FIELDING_E ))

# df2 will have only "summary" columns, that were removed above
df2<-df %>% select(c(TARGET_WINS,TEAM_BATTING_H, FIELDING, BATTING, PITCHING))

#df3 will have only the binaries
df3<-df %>% select(c(TARGET_WINS, BAT_BB_G , BAT_BB_B , PITCH_H_G , PITCH_H_B, BAT_2B_G, BAT_2B_B, PITCH_BB_G, PITCH_BB_B, FIELD_E_G))
lm1<-lm(TARGET_WINS ~ .,data=df1)

summary(lm1)
## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = df1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -49.906  -8.582   0.121   8.411  58.597 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      24.2467875  5.2929578   4.581 4.88e-06 ***
## TEAM_BATTING_2B   0.0278792  0.0073013   3.818 0.000138 ***
## TEAM_BATTING_3B   0.1090625  0.0158847   6.866 8.51e-12 ***
## TEAM_BATTING_HR   0.1032024  0.0273964   3.767 0.000169 ***
## TEAM_BATTING_BB   0.0105498  0.0058145   1.814 0.069749 .  
## TEAM_BATTING_SO  -0.0093113  0.0025500  -3.652 0.000267 ***
## TEAM_BASERUN_SB   0.0287308  0.0043400   6.620 4.47e-11 ***
## TEAM_PITCHING_H  -0.0007465  0.0003672  -2.033 0.042189 *  
## TEAM_PITCHING_HR  0.0141572  0.0243055   0.582 0.560309    
## TEAM_PITCHING_BB  0.0001870  0.0041429   0.045 0.964010    
## TEAM_PITCHING_SO  0.0028358  0.0009186   3.087 0.002046 ** 
## TEAM_FIELDING_E  -0.0207258  0.0024210  -8.561  < 2e-16 ***
## TEAM_FIELDING_DP -0.1211711  0.0130165  -9.309  < 2e-16 ***
## TEAM_BATTING_1B   0.0482145  0.0036862  13.080  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.04 on 2262 degrees of freedom
## Multiple R-squared:  0.3188, Adjusted R-squared:  0.3149 
## F-statistic: 81.42 on 13 and 2262 DF,  p-value: < 2.2e-16
lm1v2<-lm(TARGET_WINS ~ . - TEAM_PITCHING_HR -TEAM_PITCHING_BB -   TEAM_FIELDING_E,  data=df1)

summary(lm1v2)
## 
## Call:
## lm(formula = TARGET_WINS ~ . - TEAM_PITCHING_HR - TEAM_PITCHING_BB - 
##     TEAM_FIELDING_E, data = df1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -49.985  -8.323   0.234   8.706  50.713 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      15.0123752  5.2034565   2.885  0.00395 ** 
## TEAM_BATTING_2B   0.0369817  0.0073343   5.042 4.96e-07 ***
## TEAM_BATTING_3B   0.0931079  0.0157565   5.909 3.96e-09 ***
## TEAM_BATTING_HR   0.1200267  0.0089252  13.448  < 2e-16 ***
## TEAM_BATTING_BB   0.0235079  0.0030441   7.722 1.70e-14 ***
## TEAM_BATTING_SO  -0.0072639  0.0024840  -2.924  0.00349 ** 
## TEAM_BASERUN_SB   0.0174572  0.0041447   4.212 2.63e-05 ***
## TEAM_PITCHING_H  -0.0021647  0.0002751  -7.868 5.54e-15 ***
## TEAM_PITCHING_SO  0.0032834  0.0006796   4.831 1.45e-06 ***
## TEAM_FIELDING_DP -0.1258597  0.0132005  -9.534  < 2e-16 ***
## TEAM_BATTING_1B   0.0472641  0.0037171  12.715  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.24 on 2265 degrees of freedom
## Multiple R-squared:  0.2967, Adjusted R-squared:  0.2936 
## F-statistic: 95.55 on 10 and 2265 DF,  p-value: < 2.2e-16
lm2<-lm(TARGET_WINS ~ .,data=df2)

summary(lm2)
## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = df2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -69.428  -9.122   0.424   9.635  47.402 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -3.535e+00  4.646e+00  -0.761    0.447    
## TEAM_BATTING_H  5.156e-02  2.695e-03  19.134  < 2e-16 ***
## FIELDING        5.513e+00  8.658e-01   6.368 2.32e-10 ***
## BATTING        -3.232e+02  5.937e+01  -5.444 5.76e-08 ***
## PITCHING        3.297e+02  5.907e+01   5.582 2.66e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.15 on 2271 degrees of freedom
## Multiple R-squared:  0.1943, Adjusted R-squared:  0.1929 
## F-statistic: 136.9 on 4 and 2271 DF,  p-value: < 2.2e-16
lm3<-lm(TARGET_WINS ~ .,data=df3)

summary(lm3)
## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = df3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -76.615  -8.946   0.320   9.721  63.346 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   81.1965     1.7276  46.998  < 2e-16 ***
## BAT_BB_GNG    -6.1933     1.1216  -5.522 3.74e-08 ***
## BAT_BB_BNB     2.6608     1.2605   2.111   0.0349 *  
## PITCH_H_GNG   -4.2791     0.8656  -4.943 8.24e-07 ***
## PITCH_H_BNB    3.8016     0.8088   4.700 2.75e-06 ***
## BAT_2B_GNG    -1.3012     0.8132  -1.600   0.1097    
## BAT_2B_BNB     4.7377     0.8028   5.902 4.14e-09 ***
## PITCH_BB_GNG   2.2520     1.1483   1.961   0.0500 *  
## PITCH_BB_BNB   0.9454     1.2031   0.786   0.4321    
## FIELD_E_GNG   -3.1409     0.7961  -3.946 8.20e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.82 on 2266 degrees of freedom
## Multiple R-squared:  0.118,  Adjusted R-squared:  0.1145 
## F-statistic: 33.68 on 9 and 2266 DF,  p-value: < 2.2e-16

Could try to further refine

Trying one more thing with just the 3 fields that had good linear trend

df4<-df %>% select(c(TARGET_WINS,TEAM_BATTING_H,TEAM_FIELDING_E, TEAM_BATTING_2B))
                   
lm4<-lm(TARGET_WINS ~ .,data=df4)

summary(lm4)
## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = df4)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -54.855  -9.069   0.265   9.163  57.061 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      9.271243   3.022538   3.067 0.002185 ** 
## TEAM_BATTING_H   0.057341   0.002775  20.662  < 2e-16 ***
## TEAM_FIELDING_E -0.023250   0.001498 -15.519  < 2e-16 ***
## TEAM_BATTING_2B -0.029013   0.008506  -3.411 0.000659 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.75 on 2272 degrees of freedom
## Multiple R-squared:  0.239,  Adjusted R-squared:  0.238 
## F-statistic: 237.9 on 3 and 2272 DF,  p-value: < 2.2e-16

CONCLUSION: All the effort to try and model well actually made it worse!?!

(Including removing the least signif of the fields in model 1)