Untitled

Introduction

In this paper I attmept to create a model to predict team wins in MLB. My model will make these predictions on a train and eval dataset of real life baseball team yearly statistics.
Through data exploration, I realized that there are several data entry mistakes in these dataframes.
- Several entries of 0 which are impossible
- Historically inaccurate entries. Further there is no way to verify if the historically incorrect values are corrupt through the entire observation(all stats), or just for the singular column entry.
  - To identify and handle these misentries, I provide links to historical MLB records and adjust the data accordingly.
  - Unfortunately a problem may arise from this on the prediction side of things. Being that I have filtered for wins (min-20,max- 116), the model may miss on any test indexes that would need to predict outside that range. As this is an entry level attempt to create a linear model, hopefully this does not end up hurting the model to badly.
The paper proceeds by using the exploratory information gained to perform some intial data transformations. I break the categories into 3 areas
- Batting
- Baserunning
- Pithcing
I explore these categories, filter them as i see best, and attempt to visualize and dispaly their correlation to our target variable(Wins)
This project highlights a major concern with observational studies. I came to the realization that many of the pitching statistics must have been incorrectly entered into the dataframe. There are an outright 780 entries which are duplicate values of the batting statistics, but the correlations in some of the pitching fields do not make sense either. I will guide you through the data which brought me to this realization.

Data Exploration

## Warning: package 'faraway' was built under R version 3.4.4

## Warning: package 'stringr' was built under R version 3.4.4

## Warning: package 'corrplot' was built under R version 3.4.4

NA counts
	na_count
INDEX	0
TEAM_BATTING_H	0
TEAM_BATTING_2B	0
TEAM_BATTING_3B	0
TEAM_BATTING_HR	0
TEAM_BATTING_BB	0
TEAM_BATTING_SO	18
TEAM_BASERUN_SB	13
TEAM_BASERUN_CS	87
TEAM_BATTING_HBP	240
TEAM_PITCHING_H	0
TEAM_PITCHING_HR	0
TEAM_PITCHING_BB	0
TEAM_PITCHING_SO	18
TEAM_FIELDING_E	0
TEAM_FIELDING_DP	31

General data observations of Entire Dataset

	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
INDEX	1	2276	1268.46353	736.34904	1270.5	1268.56970	952.5705	1	2535	2534	0.0042149	-1.2167564	15.4346788
TARGET_WINS	2	2276	80.79086	15.75215	82.0	81.31229	14.8260	0	146	146	-0.3987232	1.0274757	0.3301823
TEAM_BATTING_H	3	2276	1469.26977	144.59120	1454.0	1459.04116	114.1602	891	2554	1663	1.5713335	7.2785261	3.0307891
TEAM_BATTING_2B	4	2276	241.24692	46.80141	238.0	240.39627	47.4432	69	458	389	0.2151018	0.0061609	0.9810087
TEAM_BATTING_3B	5	2276	55.25000	27.93856	47.0	52.17563	23.7216	0	223	223	1.1094652	1.5032418	0.5856226
TEAM_BATTING_HR	6	2276	99.61204	60.54687	102.0	97.38529	78.5778	0	264	264	0.1860421	-0.9631189	1.2691285
TEAM_BATTING_BB	7	2276	501.55888	122.67086	512.0	512.18331	94.8864	0	878	878	-1.0257599	2.1828544	2.5713150
TEAM_BATTING_SO	8	2174	735.60534	248.52642	750.0	742.31322	284.6592	0	1399	1399	-0.2978001	-0.3207992	5.3301912
TEAM_BASERUN_SB	9	2145	124.76177	87.79117	101.0	110.81188	60.7866	0	697	697	1.9724140	5.4896754	1.8955584
TEAM_BASERUN_CS	10	1504	52.80386	22.95634	49.0	50.35963	17.7912	0	201	201	1.9762180	7.6203818	0.5919414
TEAM_BATTING_HBP	11	191	59.35602	12.96712	58.0	58.86275	11.8608	29	95	66	0.3185754	-0.1119828	0.9382681
TEAM_PITCHING_H	12	2276	1779.21046	1406.84293	1518.0	1555.89517	174.9468	1137	30132	28995	10.3295111	141.8396985	29.4889618
TEAM_PITCHING_HR	13	2276	105.69859	61.29875	107.0	103.15697	74.1300	0	343	343	0.2877877	-0.6046311	1.2848886
TEAM_PITCHING_BB	14	2276	553.00791	166.35736	536.5	542.62459	98.5929	0	3645	3645	6.7438995	96.9676398	3.4870317
TEAM_PITCHING_SO	15	2174	817.73045	553.08503	813.5	796.93391	257.2311	0	19278	19278	22.1745535	671.1891292	11.8621151
TEAM_FIELDING_E	16	2276	246.48067	227.77097	159.0	193.43798	62.2692	65	1898	1833	2.9904656	10.9702717	4.7743279
TEAM_FIELDING_DP	17	1990	146.38794	26.22639	149.0	147.57789	23.7216	52	228	176	-0.3889390	0.1817397	0.5879114

Our Training dataset contains 2276 different MLB teams yearly results
18 Total categories
- Numeric index representing each team
- Category tracking Wins(Our target category)
- 7 different batting statistics
- 2 base running statistics
- 4 pitching statistics
- 2 fielding statistics
Our testing dataset has 259 observations and has all of the same variables except for our target variable(wins)
- This leaves us with an issue. Below I filter our test data for modern era historical records. This allows us to correct many of the outliers that are present within the data.

Explore target variable Wins

	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
X1	1	2276	80.79086	15.75215	82	81.31229	14.826	0	146	146	-0.3987232	1.027476	0.3301823

Our mean and median are of little practical value. 162 games in a season, so we should expect to see 81 wins on average given our total sample size.
Although not exact as seasons since 1800’s havent all been 162 games, but my interval test shows a left tail interval of 80.35-81 which our 80.79 mean falls within confidently
While our average wins seem fine, our min and max seem incorrect
- The most wins by an mlb team ever was 116
- The least wins by an MLB team in the modern era was 20

Remove outliers discovered in Wins category

	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
X1	1	2255	80.58537	15.05826	82	81.18726	14.826	21	116	95	-0.4395385	0.3152447	0.3171039

The mean of our sample moved further from our expected population mean when we removed incorrect outliers, however it is still within the Confidence interval. We lost 35 observations when removing these data entries which suffered from some sort of data entry mistake. This does raise concerns as to the vailidity of all the entires, however, with no way to confirm index to specific teams lets continue

View target variable distribution before and after filters

The median and mean are around 82 and the histogram looks normal, however there is somewhat of a left tail. We can assume our target variable is normally distributed

Descriptive Look at batting statistics

Create and view summary of batting DF

	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
TEAM_BATTING_H	1	2255	1466.81685	135.92839	1454	1458.24377	114.1602	992	2496	1504	1.2098420	4.8627441	2.8624434
TEAM_BATTING_2B	2	2255	241.09490	46.28571	238	240.29695	47.4432	69	458	389	0.2012387	-0.0427041	0.9747061
TEAM_BATTING_3B	3	2255	54.93348	27.63678	47	51.89695	23.7216	0	223	223	1.1205339	1.5849716	0.5819881
TEAM_BATTING_HR	4	2255	100.22395	60.39496	103	98.07867	77.0952	0	264	264	0.1771368	-0.9584953	1.2718252
TEAM_BATTING_BB	5	2255	503.80089	119.41435	513	513.29640	93.4038	29	878	849	-0.9614819	2.1131941	2.5146829
TEAM_BATTING_SO	6	2155	739.51323	244.66509	753	744.96464	284.6592	0	1399	1399	-0.2526910	-0.4140827	5.2704582
TEAM_BATTING_HBP	7	191	59.35602	12.96712	58	58.86275	11.8608	29	95	66	0.3185754	-0.1119828	0.9382681
Singles	8	2255	1070.56452	121.79442	1049	1058.05319	97.8516	709	2112	1403	1.7574832	6.7989398	2.5648036

Fig 1&2

##          TEAM_BATTING_H     TEAM_BATTING_2B    TEAM_BATTING_3B   
## breaks   Integer,17         Integer,10         Numeric,13        
## counts   Integer,16         Integer,9          Integer,12        
## density  Numeric,16         Numeric,9          Numeric,12        
## mids     Numeric,16         Numeric,9          Numeric,12        
## xname    "dots[[1L]][[1L]]" "dots[[1L]][[2L]]" "dots[[1L]][[3L]]"
## equidist TRUE               TRUE               TRUE              
##          TEAM_BATTING_HR    TEAM_BATTING_BB    TEAM_BATTING_SO   
## breaks   Numeric,15         Numeric,19         Numeric,15        
## counts   Integer,14         Integer,18         Integer,14        
## density  Numeric,14         Numeric,18         Numeric,14        
## mids     Numeric,14         Numeric,18         Numeric,14        
## xname    "dots[[1L]][[4L]]" "dots[[1L]][[5L]]" "dots[[1L]][[6L]]"
## equidist TRUE               TRUE               TRUE              
##          TEAM_BATTING_HBP   Singles           
## breaks   Integer,9          Integer,16        
## counts   Integer,8          Integer,15        
## density  Numeric,8          Numeric,15        
## mids     Numeric,8          Numeric,15        
## xname    "dots[[1L]][[7L]]" "dots[[1L]][[8L]]"
## equidist TRUE               TRUE

##       TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR
## stats Numeric,5      Numeric,5       Numeric,5       Numeric,5      
## n     2255           2255            2255            2255           
## conf  Numeric,2      Numeric,2       Numeric,2       Numeric,2      
## out   Numeric,61     Numeric,11      Numeric,26      Numeric,0      
## group Numeric,61     Numeric,11      Numeric,26      Numeric,0      
## names ""             ""              ""              ""             
##       TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BATTING_HBP Singles   
## stats Numeric,5       Numeric,5       Numeric,5        Numeric,5 
## n     2255            2155            191              2255      
## conf  Numeric,2       Numeric,2       Numeric,2        Numeric,2 
## out   Numeric,120     Numeric,0       95               Numeric,70
## group Numeric,120     Numeric,0       1                Numeric,70
## names ""              ""              ""               ""

Triples,hits,walks, and singles seem to have a large amount of outliers.
- strikeouts and Homeruns are the only variables with no outliers
All of the means and medians seem very close to each other
- However, the sd seems rather high in certain categories relative to the mean. Without expert knowledge, this could be normal
- Some Min values in these categories are 0, which is impossible.
  - Exploring these categories, there are less than 10 or so 0's per category. Given that the other categories in these observations seem in line, some sort of data entry mistake was likely made again. In this case, lets replace the 0’s with NA.

Data manipulation of batting statistics

Replace 0 with NA’s

	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
TEAM_BATTING_H	1	2255	1466.81685	135.92839	1454.0	1458.24377	114.1602	992	2496	1504	1.2098420	4.8627441	2.8624434
TEAM_BATTING_2B	2	2255	241.09490	46.28571	238.0	240.29695	47.4432	69	458	389	0.2012387	-0.0427041	0.9747061
TEAM_BATTING_3B	3	2254	54.95785	27.61866	47.0	51.91075	23.7216	8	223	215	1.1240833	1.5881365	0.5817357
TEAM_BATTING_HR	4	2244	100.71524	60.13266	104.0	98.55679	75.6126	3	264	261	0.1785408	-0.9565209	1.2694014
TEAM_BATTING_BB	5	2255	503.80089	119.41435	513.0	513.29640	93.4038	29	878	849	-0.9614819	2.1131941	2.5146829
TEAM_BATTING_SO	6	2140	744.69673	237.52652	756.5	747.73598	283.9179	67	1399	1332	-0.1320162	-0.7184334	5.1345834
TEAM_BATTING_HBP	7	191	59.35602	12.96712	58.0	58.86275	11.8608	29	95	66	0.3185754	-0.1119828	0.9382681
Singles	8	2255	1070.56452	121.79442	1049.0	1058.05319	97.8516	709	2112	1403	1.7574832	6.7989398	2.5648036

Fig 3

NA counts
	na_count
INDEX	0
TARGET_WINS	0
TEAM_BATTING_H	0
TEAM_BATTING_2B	0
TEAM_BATTING_3B	1
TEAM_BATTING_HR	11
TEAM_BATTING_BB	0
TEAM_BATTING_SO	115
TEAM_BASERUN_SB	121
TEAM_BASERUN_CS	757
TEAM_BATTING_HBP	2064
TEAM_PITCHING_H	0
TEAM_PITCHING_HR	0
TEAM_PITCHING_BB	0
TEAM_PITCHING_SO	100
TEAM_FIELDING_E	0
TEAM_FIELDING_DP	271
Singles	0

Examine historical record to detect incorrect outliers

Triples
- max-153
- min- 11
Homerun records
- This link provides some good evidence that our assumptions so far were correct. The least HR recorded ever were 3, and the most were 264, which matches with our dataset
Doubles records
- Total doubles team record is 376
- min is 110
hits records
- Hits record is 1,783,
Base on Balls
- min- 282
- max- 835
Singles
- min- 811
- max- 1,338

Filter for historical records

Applied filters print in code blocks below
NA values in all batting predictor columns except strikeouts are discarded with these filters
- Looking at fig 3 above, there are not many of these NA values, as such I felt like they should just be discarded as a precaution
For strikeouts there were 115 NA values,
- I attmepted to run a linear model for imputation, but consdiering none of the predictors were significant, I chose to replace missing vlaues with Median values. The distribution for Strikeouts seems normal, although median is higher, so I think mean or median would be an adequate choice for NA replacement.
- I preserve a copy of db with NA vlaues as well, so I can eventually see what models dropping these NA values look like

## Change in batting dataset
batting <- batting %>% 
    filter(TEAM_BATTING_3B < 154  &
           TEAM_BATTING_3B > 10   &
           TEAM_BATTING_2B < 377  &
           TEAM_BATTING_2B > 109  &   
           TEAM_BATTING_H  < 1784 &
           TEAM_BATTING_BB > 281  &
           TEAM_BATTING_BB < 836  &
           Singles         < 1339 &
           Singles         > 810   )
# FILL Strikeout NA's with median value
batting_w_na <- batting
batting$TEAM_BATTING_SO[is.na(batting$TEAM_BATTING_SO)] <- median(batting$TEAM_BATTING_SO, na.rm=TRUE)    

#lmodr <- lm(logit(TEAM_BATTING_SO/100)~TEAM_BATTING_3B+TEAM_BATTING_2B+TEAM_BATTING_H+TEAM_BATTING_BB,batting)
#ilogit(predict(lmodr,batting[is.na(batting$TEAM_BATTING_SO),]))*100
#summary(lmodr)
na_count <-sapply(batting, function(y) sum(length(which(is.na(y)))))
na_count <- data.frame(na_count) 
kable(na_count,caption="NA counts")

NA counts
	na_count
TEAM_BATTING_H	0
TEAM_BATTING_2B	0
TEAM_BATTING_3B	0
TEAM_BATTING_HR	0
TEAM_BATTING_BB	0
TEAM_BATTING_SO	0
TEAM_BATTING_HBP	1901
Singles	0

Figure 3 : Reexamine boxplot post data manipulation

##       TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR
## stats Numeric,5      Numeric,5       Numeric,5       Numeric,5      
## n     2092           2092            2092            2092           
## conf  Numeric,2      Numeric,2       Numeric,2       Numeric,2      
## out   Numeric,15     Numeric,6       Numeric,29      Numeric,0      
## group Numeric,15     Numeric,6       Numeric,29      Numeric,0      
## names ""             ""              ""              ""             
##       TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BATTING_HBP Singles   
## stats Numeric,5       Numeric,5       Numeric,5        Numeric,5 
## n     2092            2092            191              2092      
## conf  Numeric,2       Numeric,2       Numeric,2        Numeric,2 
## out   Numeric,12      Numeric,0       95               Numeric,15
## group Numeric,12      Numeric,0       1                Numeric,15
## names ""              ""              ""               ""

Our data set has been trimmed from 2241 to 2090
Looking at fig 4 as compared to fig 2, we can see many of the outliers from singles, hits,bb from fig 2 were removed
- Category Triples seems to still have many outliers
- There are some small outliers present in other categories still as well

Batting statistics overall correlations

HBP has the least correlations, It can likley be safely discarded
3b are heavily negatively correlated with HR and strikeouts
Homeruns and strikeouts share the largest correlation(positive), of all batting statistics

How do batting statistics correlate with our target(Wins)?

There aren’t really any strong predictors.
- Relative to the availble predictors, Hits, Doubles, and Homeruns ahve the strongest correlation with target vairable(wins).
- Strikeouts and HBP stand out as having little to no correlation with wins

Stolen bases

Edit stolen bases via historical records and observe correlations

Stolen bases and Caught stealing
fewest stolen bases=13
most stolen bases= 415
fewest caught stealing=8
most caught stelaing=191
Caught stealing has a substantial amount of NA’s. IF we filter out NA, we will lose nearly 1/3 of our data. Therefore I chose to use the median for all caught stealing==NA fields

NA counts
	na_count
INDEX	0
TARGET_WINS	0
TEAM_BATTING_H	0
TEAM_BATTING_2B	0
TEAM_BATTING_3B	0
TEAM_BATTING_HR	0
TEAM_BATTING_BB	0
TEAM_BATTING_SO	0
TEAM_BASERUN_SB	0
TEAM_BASERUN_CS	0
TEAM_BATTING_HBP	1853
TEAM_PITCHING_H	0
TEAM_PITCHING_HR	0
TEAM_PITCHING_BB	0
TEAM_PITCHING_SO	100
TEAM_FIELDING_E	0
TEAM_FIELDING_DP	0
Singles	0

Baserunning doesn’t appear to have much correlation with our target variable

Pitching categories

Display pitching categories

	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
TARGET_WINS	1	2044	80.52789	13.89309	81.0	80.96638	14.8260	21	116	95	-0.3090862	-0.0162643	0.3072970
TEAM_PITCHING_H	2	2044	1539.55479	220.16288	1498.0	1513.97127	146.7774	1137	6723	5586	7.2180667	150.1985045	4.8697161
TEAM_PITCHING_HR	3	2044	111.71771	59.92485	114.0	110.10391	68.1996	3	343	340	0.1784958	-0.5918350	1.3254597
TEAM_PITCHING_BB	4	2044	553.66830	109.40697	541.0	546.10452	91.9212	312	2169	1857	2.2520497	23.8773902	2.4199397
TEAM_PITCHING_SO	5	1944	807.61060	228.05424	818.0	803.06427	250.5594	301	2309	2008	0.3567090	0.6392627	5.1723752
TEAM_FIELDING_E	6	2044	188.89922	109.36885	150.5	166.38570	49.6671	65	765	700	2.3312285	5.9463732	2.4190964
TEAM_FIELDING_DP	7	2044	147.84883	24.20592	149.0	148.80746	20.7564	68	228	160	-0.3664606	0.4738264	0.5354035

##          TARGET_WINS        TEAM_PITCHING_H    TEAM_PITCHING_HR  
## breaks   Integer,11         Integer,13         Numeric,8         
## counts   Integer,10         Integer,12         Integer,7         
## density  Numeric,10         Numeric,12         Numeric,7         
## mids     Numeric,10         Numeric,12         Numeric,7         
## xname    "dots[[1L]][[1L]]" "dots[[1L]][[2L]]" "dots[[1L]][[3L]]"
## equidist TRUE               TRUE               TRUE              
##          TEAM_PITCHING_BB   TEAM_PITCHING_SO   TEAM_FIELDING_E   
## breaks   Integer,11         Integer,12         Integer,16        
## counts   Integer,10         Integer,11         Integer,15        
## density  Numeric,10         Numeric,11         Numeric,15        
## mids     Numeric,10         Numeric,11         Numeric,15        
## xname    "dots[[1L]][[4L]]" "dots[[1L]][[5L]]" "dots[[1L]][[6L]]"
## equidist TRUE               TRUE               TRUE              
##          TEAM_FIELDING_DP  
## breaks   Integer,18        
## counts   Integer,17        
## density  Numeric,17        
## mids     Numeric,17        
## xname    "dots[[1L]][[7L]]"
## equidist TRUE

##       TARGET_WINS TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB
## stats Numeric,5   Numeric,5       Numeric,5        Numeric,5       
## n     2044        2044            2044             2044            
## conf  Numeric,2   Numeric,2       Numeric,2        Numeric,2       
## out   Numeric,7   Numeric,99      Numeric,3        Numeric,49      
## group Numeric,7   Numeric,99      Numeric,3        Numeric,49      
## names ""          ""              ""               ""              
##       TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## stats Numeric,5        Numeric,5       Numeric,5       
## n     1944             2044            2044            
## conf  Numeric,2        Numeric,2       Numeric,2       
## out   Numeric,9        Numeric,178     Numeric,56      
## group Numeric,9        Numeric,178     Numeric,56      
## names ""               ""              ""

Pitching stats look like they have alot of outliers

Historical pitching records

Pitching records were much harder to find
homeruns allowed
- Most Homeruns allowed- 258 +Errors
- most errors= 867

Filter for historical records and run correlation plot

Having some knowledge of baseball, this seems weird
- More hits and bb’s and homeruns are associated with an increase in total team wins?
Something else is wrong with the data

Look at batting to pitching correlations

Pitching HR has a perfect correaltion with Batting HR
Pitching BB has perfect correlation with batting BB
Pitching SO has perfect correlation with batting SO
Pitching Hits has strong correlation with batting Hits

Problems with correlations

I noticed some duplicate entries between batting and pitching stats
I am hoping that perhaps these duplicate values are causing problems in my correlations
Lets check to see how many duplicate values there are between the abtting/ pitching categories

##   total_equal_categories
## 1                    785

785 teams according to our DF have the same amount of hits pitching as hits hitting
- That means that over 30% of teams on this list randomly struckout as many times as their pitching struck out oppposing batters
- This is nearly impossible
Below correlation plot is a plot of these shared values displaying how all of these categories are identical entries(785 observation Df)

Filter all of these misentries out of DF

There still appears to be a heavy correlation between the pitching and hitting categories
More worriesome is the pitching stats still have correlations that make no sense

Batting and Pitching staistics shouldn’t be correlated in the same direction.
These pitching stats suffer from far too many mistakes to be trusted
Dropping the 780 obserrvations created a large loss of data, and didn’t help to make the data more trustworthy

I have to drop all of the pitching statistics

Becuase of collinearity issues, this likely would have had to happen anyway
This shows how dangerous using observatiional data can be
I believe we can still keep all of the duplicate observations as batting stats, because they have overall similar correlations with previous batting stats and my expectation for batting
- This must all be addressed in any tenetative conclusion or model

Create custom stats in place of poor pitching stats

Singles
a slugging scale ( hr=4, triple=3, double=2, single/bb=1)
an on base stat (hits + bb+HBP)

NA counts
	na_count
INDEX	0
TARGET_WINS	0
TEAM_BATTING_H	0
TEAM_BATTING_2B	0
TEAM_BATTING_3B	0
TEAM_BATTING_HR	0
TEAM_BATTING_BB	0
TEAM_BATTING_SO	0
TEAM_BASERUN_SB	0
TEAM_BASERUN_CS	0
TEAM_BATTING_HBP	1844
TEAM_PITCHING_H	0
TEAM_PITCHING_HR	0
TEAM_PITCHING_BB	0
TEAM_PITCHING_SO	100
TEAM_FIELDING_E	0
TEAM_FIELDING_DP	0
Singles	0
slugging	0
OBP	0

View new stat correlations with Wins

corrplot(cor(money_ball_train_2[,c(2,20,19)])[,1, drop=FALSE], cl.pos='n',method="number", number.cex = .7)

Both stats have larger correlation with wins than any of the original predictor variables

Building models

For almost every model, I manually backwards selected my predictors
I have rejected all pitching stats due to poor quality of the data
Each model has a slighlty different idea behind it
- Model 1
  - Takes all of the batting, fielding and running stats to start except HBP(which is discarded from all analysis)
  - Advanced diagnostics run to check residuals.
    - In the other models I simply summarize these diagnostics for brevity
- Model 2
  - Moneyball stats seemed like they would be important, but due to collinearity, likely due to how I built the stats, they don’t make it into most models.
  - Model 2 attempts to use the moneyballstats with whatever predictors aren’t collinear.
- Model 3
  - An experimental model to see how scaling and centering my data, effects the results
  - model starts with the same inputs as model 1
- Model 4
  - I run a box cox transformation over our target variable
  - input for this model is the same as model 1
- Model 5
  - Generalized additive model
    - Produces a slightly different summary result
  - Input for this model is the same as model 1
- Model 6
  - Throughout data exploration I updated NA values with median values.
  - This Model contains the original NA values
    - Therefore the LM model is run on 1468 observations instead of original 2000+
- Model 7
  - Attempt at polynomial fitting

Model 1 A

Model is continuous, linear, and residuals are normal
total predictors used - 8
Adjusted R squared of .368
f stat 148.8
RSE= 11.04
Dropped TEAM_BATTING_2B, OBP, slugging, and TEAM_BASERUN_CS during backwards selection
Our residuals seem linear and normal
our qq plot does tail off with some outliers
Run some further diagnostics on model 1

## 
## Call:
## lm(formula = y ~ TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + 
##     TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_FIELDING_E + TEAM_FIELDING_DP + 
##     Singles, data = fit_1_db)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.987  -7.593   0.217   7.144  37.095 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      46.431876   5.799453   8.006 1.97e-15 ***
## TEAM_BATTING_3B   0.189918   0.017127  11.089  < 2e-16 ***
## TEAM_BATTING_HR   0.116007   0.007590  15.285  < 2e-16 ***
## TEAM_BATTING_BB   0.031182   0.003161   9.863  < 2e-16 ***
## TEAM_BATTING_SO  -0.015229   0.002245  -6.784 1.53e-11 ***
## TEAM_BASERUN_SB   0.079259   0.005247  15.106  < 2e-16 ***
## TEAM_FIELDING_E  -0.077432   0.003876 -19.979  < 2e-16 ***
## TEAM_FIELDING_DP -0.112248   0.011893  -9.438  < 2e-16 ***
## Singles           0.027791   0.004155   6.689 2.90e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.04 on 2025 degrees of freedom
## Multiple R-squared:  0.3701, Adjusted R-squared:  0.3677 
## F-statistic: 148.8 on 8 and 2025 DF,  p-value: < 2.2e-16

## Analysis of Variance Table
## 
## Response: y
##                    Df Sum Sq Mean Sq F value    Pr(>F)    
## TEAM_BATTING_3B     1   2208    2208  18.115 2.175e-05 ***
## TEAM_BATTING_HR     1  49985   49985 410.082 < 2.2e-16 ***
## TEAM_BATTING_BB     1  14374   14374 117.927 < 2.2e-16 ***
## TEAM_BATTING_SO     1   1906    1906  15.634 7.952e-05 ***
## TEAM_BASERUN_SB     1   8190    8190  67.193 4.316e-16 ***
## TEAM_FIELDING_E     1  53143   53143 435.990 < 2.2e-16 ***
## TEAM_FIELDING_DP    1   9795    9795  80.362 < 2.2e-16 ***
## Singles             1   5453    5453  44.738 2.905e-11 ***
## Residuals        2025 246829     122                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model 1 advanced diagnostics

Test for constant variance

##                 Estimate Std. Error t value  Pr(>|t|)
## (Intercept)    3.9845162  0.2427866 16.4116 < 2.2e-16
## fitted(fit_1) -0.0155285  0.0030003 -5.1756 2.496e-07
## 
## n = 2034, p = 2, Residual SE = 1.14270, R-Squared = 0.01

Given the P level(2.496e-07), Our model may have some issues with constant variance
- This test shows their is a statistically significant difference in variance

Normality tests

## 
##  Anderson-Darling normality test
## 
## data:  fit_1$residuals
## A = 0.28768, p-value = 0.6194

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(fit_1)
## W = 0.99913, p-value = 0.4494

model seems fine according to both normality tests and previous qq plots

Test for independence

## 
##  Durbin-Watson test
## 
## data:  y ~ TEAM_BASERUN_CS + TEAM_BASERUN_SB + TEAM_BATTING_2B + TEAM_BATTING_3B +     TEAM_BATTING_BB + TEAM_BATTING_HR + TEAM_FIELDING_DP + TEAM_FIELDING_E
## DW = 1.0338, p-value < 2.2e-16
## alternative hypothesis: true autocorrelation is greater than 0

It appears there may be some issues with independence
- considering this issue isn’t visible in plots, There may be a need to add quadratic relationships in predictors

leverage points and outliers

## [1] -3.295338

##       761      1433      1691      1808      1810 
## -3.403779  3.320870  3.381547 -3.433216 -3.460660

After running this model several different ways one takeaway that is negative on our overall analysis, is our outlier’s are more proof of failures in data entry. Our dataset has 4 seasons with exactly 110 wins, historically this has only happened twice according to historical records
- Those two historical seasons don’t map to the data entries in any of the 4 observations of 110 wins in our dataset.
With that noted, the model diagnostics were not terrible, and I will rerun this model without outliers in model 1B

Model 1B

Remove outliers from model 1A and attempt analysis
total predictors used - 9
Adjusted r squared is .383
F stat 140.6
RSE= 10.82
Dropped slugging,OBP,TEAM_BATTING_2B

## 
## Call:
## lm(formula = y ~ TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + 
##     TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_BASERUN_CS + TEAM_FIELDING_E + 
##     TEAM_FIELDING_DP + Singles, data = fit_1b_db)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.779  -7.474   0.135   7.246  35.294 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      51.515840   5.771956   8.925  < 2e-16 ***
## TEAM_BATTING_3B   0.186005   0.016825  11.055  < 2e-16 ***
## TEAM_BATTING_HR   0.110316   0.007542  14.627  < 2e-16 ***
## TEAM_BATTING_BB   0.029713   0.003116   9.535  < 2e-16 ***
## TEAM_BATTING_SO  -0.016821   0.002210  -7.613 4.10e-14 ***
## TEAM_BASERUN_SB   0.090115   0.005512  16.348  < 2e-16 ***
## TEAM_BASERUN_CS  -0.081294   0.014844  -5.477 4.88e-08 ***
## TEAM_FIELDING_E  -0.084660   0.004052 -20.892  < 2e-16 ***
## TEAM_FIELDING_DP -0.109285   0.011692  -9.347  < 2e-16 ***
## Singles           0.029337   0.004080   7.190 9.07e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.82 on 2019 degrees of freedom
## Multiple R-squared:  0.3852, Adjusted R-squared:  0.3825 
## F-statistic: 140.6 on 9 and 2019 DF,  p-value: < 2.2e-16

## Analysis of Variance Table
## 
## Response: y
##                    Df Sum Sq Mean Sq  F value    Pr(>F)    
## TEAM_BATTING_3B     1   2361    2361  20.1681 7.493e-06 ***
## TEAM_BATTING_HR     1  48976   48976 418.4169 < 2.2e-16 ***
## TEAM_BATTING_BB     1  14081   14081 120.2952 < 2.2e-16 ***
## TEAM_BATTING_SO     1   2401    2401  20.5123 6.271e-06 ***
## TEAM_BASERUN_SB     1   8617    8617  73.6138 < 2.2e-16 ***
## TEAM_BASERUN_CS     1    261     261   2.2316    0.1354    
## TEAM_FIELDING_E     1  56180   56180 479.9588 < 2.2e-16 ***
## TEAM_FIELDING_DP    1   9170    9170  78.3399 < 2.2e-16 ***
## Singles             1   6052    6052  51.7002 9.065e-13 ***
## Residuals        2019 236326     117                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model 1C

Try model 1 with centering
Identical to model 1A in terms of predictors used(8), F stat, Adjusted R, and RSE
Coefficents take on much larger values and intercept is much closer to the mean

## 
## Call:
## lm(formula = y ~ TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + 
##     TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_FIELDING_E + TEAM_FIELDING_DP + 
##     Singles, data = fit_1_db.c)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.987  -7.593   0.217   7.144  37.095 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       80.4784     0.2448 328.753  < 2e-16 ***
## TEAM_BATTING_3B    4.6345     0.4179  11.089  < 2e-16 ***
## TEAM_BATTING_HR    6.7045     0.4386  15.285  < 2e-16 ***
## TEAM_BATTING_BB    2.7601     0.2798   9.863  < 2e-16 ***
## TEAM_BATTING_SO   -3.3121     0.4882  -6.784 1.53e-11 ***
## TEAM_BASERUN_SB    5.6539     0.3743  15.106  < 2e-16 ***
## TEAM_FIELDING_E   -8.4613     0.4235 -19.979  < 2e-16 ***
## TEAM_FIELDING_DP  -2.7208     0.2883  -9.438  < 2e-16 ***
## Singles            2.5687     0.3840   6.689 2.90e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.04 on 2025 degrees of freedom
## Multiple R-squared:  0.3701, Adjusted R-squared:  0.3677 
## F-statistic: 148.8 on 8 and 2025 DF,  p-value: < 2.2e-16

Summary from model 1

model 1 B adjusting for outliers has a slightly better performance but 1A is a simpler model with one less predictor
Moneyball stats aren’t present in any of these models
Diagnostics suggest a need to add polynomials

Model 2

Money ball stats,running,fielding
Overall much simpler model than model 1
6 predictors
adjusted r= .308
Fstat-151.5
RSE 11.55

##                  vars    n    mean     sd median trimmed    mad  min  max
## TEAM_BASERUN_SB     1 2034  116.60  71.33    100  106.99  59.30   18  414
## TEAM_BASERUN_CS     2 2034   51.94  18.39     50   50.12  10.38   11  186
## TEAM_FIELDING_E     3 2034  189.06 109.27    151  166.62  50.41   65  765
## TEAM_FIELDING_DP    4 2034  147.84  24.24    149  148.80  21.50   68  228
## slugging            5 2034 2118.54 238.31   2128 2120.70 240.92 1453 2832
## OBP                 6 2034 1976.50 152.78   1975 1975.51 145.29 1438 2507
##                  range  skew kurtosis   se
## TEAM_BASERUN_SB    396  1.35     2.00 1.58
## TEAM_BASERUN_CS    175  2.03     8.49 0.41
## TEAM_FIELDING_E    700  2.33     5.96 2.42
## TEAM_FIELDING_DP   160 -0.37     0.47 0.54
## slugging          1379 -0.08    -0.24 5.28
## OBP               1069  0.05     0.25 3.39

## 
## Call:
## lm(formula = y ~ ., data = fit_2_db)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.560  -8.230   0.086   8.084  34.943 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      19.283586   3.617509   5.331 1.09e-07 ***
## TEAM_BASERUN_SB   0.071440   0.005369  13.307  < 2e-16 ***
## TEAM_BASERUN_CS  -0.054363   0.015069  -3.608 0.000317 ***
## TEAM_FIELDING_E  -0.054537   0.003476 -15.691  < 2e-16 ***
## TEAM_FIELDING_DP -0.106807   0.012085  -8.838  < 2e-16 ***
## slugging          0.003939   0.001917   2.055 0.040047 *  
## OBP               0.037159   0.002788  13.329  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.55 on 2027 degrees of freedom
## Multiple R-squared:  0.3096, Adjusted R-squared:  0.3075 
## F-statistic: 151.5 on 6 and 2027 DF,  p-value: < 2.2e-16

Model 2 tests

Test for constant variance- could be a problem
- p= (7.264e-06)
- Given the P level, Our model may have some issues with constant variance
Normality tests - Passes
- anderson- .09
- Shapiro wilk .15
- very close to significant innormality, but still appears normal
Independence of predictors
- Durbin Watson Test-2.2 e^-16
- Need a polynomial fit most likely
Outliers-
- No outliers detected

Model 2 Takeaways

Slightly worse RSE, lower adjusted R^2, than model 1
Much simpler model with only 6 predictors and a better F statistic than model 1

Model 3 USE Automatic AIC backwards selection

this is the only model that automatically backwards selects
The model chose to keep team_batting_double even though it’s p value doesnt prove significance
adjusted r^2= .3766
Predictors used 10
F score 123.8
RSE= 10.96
Takeaway form model 3 is that the performance isn’t much better compared to the simpler model 1

## Start:  AIC=9751.67
## y ~ TEAM_BATTING_2B + TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + 
##     TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_BASERUN_CS + TEAM_FIELDING_E + 
##     TEAM_FIELDING_DP + Singles + slugging + OBP
## 
## 
## Step:  AIC=9751.67
## y ~ TEAM_BATTING_2B + TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + 
##     TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_BASERUN_CS + TEAM_FIELDING_E + 
##     TEAM_FIELDING_DP + Singles + slugging
## 
## 
## Step:  AIC=9751.67
## y ~ TEAM_BATTING_2B + TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + 
##     TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_BASERUN_CS + TEAM_FIELDING_E + 
##     TEAM_FIELDING_DP + Singles
## 
##                    Df Sum of Sq    RSS     AIC
## <none>                          243111  9751.7
## - TEAM_BATTING_2B   1       258 243370  9751.8
## - TEAM_BASERUN_CS   1      3374 246485  9777.7
## - Singles           1      5897 249008  9798.4
## - TEAM_BATTING_SO   1      6229 249340  9801.1
## - TEAM_FIELDING_DP  1      9905 253016  9830.9
## - TEAM_BATTING_BB   1     11029 254141  9839.9
## - TEAM_BATTING_3B   1     15474 258585  9875.2
## - TEAM_BATTING_HR   1     23336 266447  9936.1
## - TEAM_BASERUN_SB   1     31319 274430  9996.1
## - TEAM_FIELDING_E   1     51195 294307 10138.4

## 
## Call:
## lm(formula = y ~ TEAM_BATTING_2B + TEAM_BATTING_3B + TEAM_BATTING_HR + 
##     TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_BASERUN_CS + 
##     TEAM_FIELDING_E + TEAM_FIELDING_DP + Singles, data = fit_1_db)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -38.859  -7.587   0.197   7.363  37.316 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      51.478451   5.841835   8.812  < 2e-16 ***
## TEAM_BATTING_2B  -0.010004   0.006823  -1.466    0.143    
## TEAM_BATTING_3B   0.195439   0.017223  11.347  < 2e-16 ***
## TEAM_BATTING_HR   0.113537   0.008148  13.935  < 2e-16 ***
## TEAM_BATTING_BB   0.030298   0.003163   9.580  < 2e-16 ***
## TEAM_BATTING_SO  -0.016108   0.002237  -7.200 8.47e-13 ***
## TEAM_BASERUN_SB   0.090137   0.005583  16.144  < 2e-16 ***
## TEAM_BASERUN_CS  -0.079757   0.015053  -5.299 1.29e-07 ***
## TEAM_FIELDING_E  -0.086042   0.004169 -20.640  < 2e-16 ***
## TEAM_FIELDING_DP -0.107498   0.011841  -9.079  < 2e-16 ***
## Singles           0.029968   0.004278   7.005 3.35e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.96 on 2023 degrees of freedom
## Multiple R-squared:  0.3796, Adjusted R-squared:  0.3766 
## F-statistic: 123.8 on 10 and 2023 DF,  p-value: < 2.2e-16

Model 4 box cox transformation of Target variable

Use box cox to transform our target variable
9 predictors used
interpretability of the model is difficult with the transformation
- sigma= 1.34
- RSE= 18.1
- adjusted r squared= .37
- f stat= 135
Results of advanced daignostics are similar to other models
- Several outliers present
- Normality test- Passes
- Failure of independence means predictors likely need to be analyzed with polynomial relations

Model 4 Takeaway

Performance in adjusted r squared is worse than model 1 and RSE seems worse
I left this block of code in because I am not sure I calculated the rse correctly

lambda <- MASS::boxcox(lm(y~.,fit_1_db),lambda=seq(0.5,2.5,by=0.1))

my_x <- lambda$x
my_y <- lambda$y
boxpower <- cbind(my_y,my_x)
#boxpower[order(-my_y),]
my_power <- 1.34

fit_4 <- lm(y^(1.34/1)~.,fit_1_db)
fit_4 <- update(fit_4, . ~ . -slugging)
fit_4 <- update(fit_4, . ~ . -OBP)
fit_4 <- update(fit_4, . ~ . -TEAM_BATTING_2B)

layout(matrix(c(1,2,3,4),2,2)) 

plot(fit_4)

summary(fit_4)

## 
## Call:
## lm(formula = y^(1.34/1) ~ TEAM_BATTING_3B + TEAM_BATTING_HR + 
##     TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_BASERUN_CS + 
##     TEAM_FIELDING_E + TEAM_FIELDING_DP + Singles, data = fit_1_db)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -217.917  -44.612   -0.329   42.735  220.299 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      191.90759   34.49281   5.564 2.99e-08 ***
## TEAM_BATTING_3B    1.12718    0.10045  11.222  < 2e-16 ***
## TEAM_BATTING_HR    0.64807    0.04510  14.369  < 2e-16 ***
## TEAM_BATTING_BB    0.17726    0.01860   9.533  < 2e-16 ***
## TEAM_BATTING_SO   -0.09707    0.01321  -7.350 2.86e-13 ***
## TEAM_BASERUN_SB    0.52484    0.03297  15.918  < 2e-16 ***
## TEAM_BASERUN_CS   -0.47928    0.08882  -5.396 7.61e-08 ***
## TEAM_FIELDING_E   -0.48928    0.02415 -20.258  < 2e-16 ***
## TEAM_FIELDING_DP  -0.64199    0.06990  -9.184  < 2e-16 ***
## Singles            0.16534    0.02437   6.784 1.53e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 64.74 on 2024 degrees of freedom
## Multiple R-squared:  0.3752, Adjusted R-squared:  0.3724 
## F-statistic:   135 on 9 and 2024 DF,  p-value: < 2.2e-16

anova(fit_4)

## Analysis of Variance Table
## 
## Response: y^(1.34/1)
##                    Df  Sum Sq Mean Sq  F value    Pr(>F)    
## TEAM_BATTING_3B     1   92055   92055  21.9623 2.966e-06 ***
## TEAM_BATTING_HR     1 1720434 1720434 410.4554 < 2.2e-16 ***
## TEAM_BATTING_BB     1  505824  505824 120.6777 < 2.2e-16 ***
## TEAM_BATTING_SO     1   71413   71413  17.0375 3.814e-05 ***
## TEAM_BASERUN_SB     1  289337  289337  69.0292 < 2.2e-16 ***
## TEAM_BASERUN_CS     1    8182    8182   1.9521    0.1625    
## TEAM_FIELDING_E     1 1895413 1895413 452.2014 < 2.2e-16 ***
## TEAM_FIELDING_DP    1  318290  318290  75.9366 < 2.2e-16 ***
## Singles             1  192910  192910  46.0238 1.529e-11 ***
## Residuals        2024 8483646    4192                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

resids <- (abs(fit_4$residuals)**(5/7))
my_rse <- sqrt(sum(resids^2)/2024)
my_rse

## [1] 18.12186

Model 5 Generalized additive model

Use gam function to build a generalized additive model
Adjusted r squared is .411
- Best R performance yet
- 10 predictor varibales
- RSE is 10.565

## Loading required package: nlme

## 
## Attaching package: 'nlme'

## The following object is masked from 'package:dplyr':
## 
##     collapse

## This is mgcv 1.8-22. For overview type 'help("mgcv-package")'.

## 
## Attaching package: 'mgcv'

## The following object is masked from 'package:pracma':
## 
##     magic

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## y ~ s(TEAM_BASERUN_CS) + s(TEAM_BASERUN_SB) + s(TEAM_BATTING_2B) + 
##     s(TEAM_BATTING_3B) + s(TEAM_BATTING_BB) + s(TEAM_BATTING_HR) + 
##     s(TEAM_FIELDING_DP) + s(TEAM_FIELDING_E) + s(slugging) + 
##     s(OBP)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  80.4784     0.2363   340.6   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                        edf Ref.df      F  p-value    
## s(TEAM_BASERUN_CS)  3.5470 4.4214  3.472 0.006372 ** 
## s(TEAM_BASERUN_SB)  3.7776 4.7323 36.190  < 2e-16 ***
## s(TEAM_BATTING_2B)  5.5072 6.6699 13.945  < 2e-16 ***
## s(TEAM_BATTING_3B)  5.7467 6.8703  5.092 1.73e-05 ***
## s(TEAM_BATTING_BB)  3.0675 3.9436  3.334 0.010085 *  
## s(TEAM_BATTING_HR)  0.7544 0.7544 11.195 0.003699 ** 
## s(TEAM_FIELDING_DP) 2.5068 3.2147 27.101  < 2e-16 ***
## s(TEAM_FIELDING_E)  6.3478 7.4980 58.282  < 2e-16 ***
## s(slugging)         6.2073 7.3527  3.737 0.000418 ***
## s(OBP)              5.6877 6.8884  2.911 0.005187 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Rank: 90/91
## R-sq.(adj) =  0.411   Deviance explained = 42.3%
## GCV = 116.07  Scale est. = 113.55    n = 2034

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## y ~ s(TEAM_BASERUN_CS) + s(TEAM_BASERUN_SB) + s(TEAM_BATTING_2B) + 
##     s(TEAM_BATTING_3B) + s(TEAM_BATTING_BB) + s(TEAM_BATTING_HR) + 
##     s(TEAM_FIELDING_DP) + s(TEAM_FIELDING_E) + s(slugging) + 
##     s(OBP)
## 
## Approximate significance of smooth terms:
##                        edf Ref.df      F  p-value
## s(TEAM_BASERUN_CS)  3.5470 4.4214  3.472 0.006372
## s(TEAM_BASERUN_SB)  3.7776 4.7323 36.190  < 2e-16
## s(TEAM_BATTING_2B)  5.5072 6.6699 13.945  < 2e-16
## s(TEAM_BATTING_3B)  5.7467 6.8703  5.092 1.73e-05
## s(TEAM_BATTING_BB)  3.0675 3.9436  3.334 0.010085
## s(TEAM_BATTING_HR)  0.7544 0.7544 11.195 0.003699
## s(TEAM_FIELDING_DP) 2.5068 3.2147 27.101  < 2e-16
## s(TEAM_FIELDING_E)  6.3478 7.4980 58.282  < 2e-16
## s(slugging)         6.2073 7.3527  3.737 0.000418
## s(OBP)              5.6877 6.8884  2.911 0.005187

RSE
x
10.5656

Model 6

Drop all NA values experienced throughout the dataset, instead of imputing with median
9 predictors
.43 adjusted r squared
f stat 123.1

Model 6 Diagnostic tests

Test for constant variance- could be a problem
- p= 0.008997
- Given the P level, Our model may have some issues with constant variance
- It still looks much better than original fit 1
Normality tests - Passes
- anderson- 0.701
- Shapiro wilk 0.849
Independence of predictors
- Durbin Watson Test-2.2 e^10
- Need a polynomial fit most likely
outliers-
- No outliers detects over 3.297

Model 6 summary

Model performs the best out of all linear models ran so far
It passes normality test confidently, where as model 1-4 didn’t pass confidently
adjusted r squared is highest value yet
no significant outliers present in analysis

## 
## Call:
## lm(formula = y_na ~ ., data = fit_6_db)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -30.9356  -6.5508  -0.0419   6.6782  30.0121 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      61.320410   6.649090   9.222  < 2e-16 ***
## TEAM_BATTING_2B  -0.036499   0.006967  -5.239 1.85e-07 ***
## TEAM_BATTING_3B   0.205953   0.021152   9.737  < 2e-16 ***
## TEAM_BATTING_HR   0.129672   0.008349  15.531  < 2e-16 ***
## TEAM_BATTING_BB   0.037615   0.003402  11.055  < 2e-16 ***
## TEAM_BATTING_SO  -0.021573   0.002431  -8.873  < 2e-16 ***
## TEAM_BASERUN_SB   0.050535   0.006336   7.976 3.04e-15 ***
## TEAM_FIELDING_E  -0.152176   0.009680 -15.721  < 2e-16 ***
## TEAM_FIELDING_DP -0.116077   0.013213  -8.785  < 2e-16 ***
## Singles           0.034687   0.004714   7.358 3.09e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.549 on 1458 degrees of freedom
## Multiple R-squared:  0.4317, Adjusted R-squared:  0.4282 
## F-statistic: 123.1 on 9 and 1458 DF,  p-value: < 2.2e-16

comapring several models together

The GAM model(model5) as well as the box cox transformed model(model 4) couldn’t be fitted to the below table

The clear winner seems to be model 6

Model 1A

Model 1B

Model 2(moneyball stats)

Model 3(stepaic)

Model 6(Dropped NA observfations

(Intercept)

46.43

35.06 – 57.81

<.001

51.52

40.20 – 62.84

<.001

19.28

12.19 – 26.38

<.001

51.48

40.02 – 62.94

<.001

61.32

48.28 – 74.36

<.001

TEAM_BATTING_3B

0.19

0.16 – 0.22

<.001

0.19

0.15 – 0.22

<.001

0.20

0.16 – 0.23

<.001

0.21

0.16 – 0.25

<.001

TEAM_BATTING_HR

0.12

0.10 – 0.13

<.001

0.11

0.10 – 0.13

<.001

0.11

0.10 – 0.13

<.001

0.13

0.11 – 0.15

<.001

TEAM_BATTING_BB

0.03

0.02 – 0.04

<.001

0.03

0.02 – 0.04

<.001

0.03

0.02 – 0.04

<.001

0.04

0.03 – 0.04

<.001

TEAM_BATTING_SO

-0.02

-0.02 – -0.01

<.001

-0.02

-0.02 – -0.01

<.001

-0.02

-0.02 – -0.01

<.001

-0.02

-0.03 – -0.02

<.001

TEAM_BASERUN_SB

0.08

0.07 – 0.09

<.001

0.09

0.08 – 0.10

<.001

0.07

0.06 – 0.08

<.001

0.09

0.08 – 0.10

<.001

0.05

0.04 – 0.06

<.001

TEAM_FIELDING_E

-0.08

-0.09 – -0.07

<.001

-0.08

-0.09 – -0.08

<.001

-0.05

-0.06 – -0.05

<.001

-0.09

-0.09 – -0.08

<.001

-0.15

-0.17 – -0.13

<.001

TEAM_FIELDING_DP

-0.11

-0.14 – -0.09

<.001

-0.11

-0.13 – -0.09

<.001

-0.11

-0.13 – -0.08

<.001

-0.11

-0.13 – -0.08

<.001

-0.12

-0.14 – -0.09

<.001

Singles

0.03

0.02 – 0.04

<.001

0.03

0.02 – 0.04

<.001

0.03

0.02 – 0.04

<.001

0.03

0.03 – 0.04

<.001

TEAM_BASERUN_CS

-0.08

-0.11 – -0.05

<.001

-0.05

-0.08 – -0.02

<.001

-0.08

-0.11 – -0.05

<.001

slugging

0.00

0.00 – 0.01

.040

OBP

0.04

0.03 – 0.04

<.001

TEAM_BATTING_2B

-0.01

-0.02 – 0.00

.143

-0.04

-0.05 – -0.02

<.001

Observations

2034

2029

2034

1468

R² / adj. R²

.370 / .368

.385 / .383

.310 / .308

.380 / .377

.432 / .428

Create test train split and view RMSE on TEST PREDICTIONS

Model 3 is excluded from analysis here
Model 5 and 6 perform the best

## Warning in predict.lm(fit_1, test_df): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(fit_4, test_df): prediction from a rank-deficient fit
## may be misleading

##  [1] "TEAM_BATTING_2B"  "TEAM_BATTING_3B"  "TEAM_BATTING_HR" 
##  [4] "TEAM_BATTING_BB"  "TEAM_BATTING_SO"  "TEAM_BASERUN_SB" 
##  [7] "TEAM_BASERUN_CS"  "TEAM_FIELDING_E"  "TEAM_FIELDING_DP"
## [10] "Singles"          "slugging"         "OBP"

## Warning in predict.lm(fit_6, test_df): prediction from a rank-deficient fit
## may be misleading

	V1
model_1_rmse	my RMSE for model 1A is 10.5681339038901
model_2_rmse	my RMSE for model 2 is 11.2939527952715
model_4_rmse	my RMSE for model 4 is 10.7346911574377
model_5_rmse	my RMSE for model 5 is 10.8498621720078
model_6_rmse	my RMSE for model 6 is 9.30217646362863

Model 7- Attempt Quadratic Fitting

I attmepted to add some qaudratic predictors to our model 6 dataset(NA dropped dataset)
After stepaic and fitting to the 5th degree, I did not see a large increase in adjusted R^2 overall, and almost no increase after the 3rd degree power

## [1] "TEAM_BATTING_2B"  "TEAM_BATTING_3B"  "TEAM_BATTING_HR" 
## [4] "TEAM_BATTING_BB"  "TEAM_BATTING_SO"  "TEAM_BASERUN_SB" 
## [7] "TEAM_FIELDING_E"  "TEAM_FIELDING_DP" "Singles"

## Start:  AIC=6618.11
## y_na ~ TEAM_BATTING_2B + TEAM_BATTING_3B + TEAM_BATTING_HR + 
##     TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_FIELDING_E + 
##     TEAM_FIELDING_DP + Singles + I(TEAM_FIELDING_DP^2) + I(TEAM_FIELDING_E^2) + 
##     I(TEAM_BATTING_2B^2) + I(TEAM_BATTING_3B^2) + I(TEAM_BATTING_HR^2) + 
##     I(TEAM_BASERUN_SB^2) + I(TEAM_BATTING_SO^2) + I(TEAM_BATTING_BB^2) + 
##     I(TEAM_FIELDING_DP^3) + I(TEAM_BATTING_2B^3) + I(TEAM_BATTING_3B^3) + 
##     I(TEAM_BATTING_SO^3) + I(TEAM_BATTING_BB^3)
## 
##                         Df Sum of Sq    RSS    AIC
## - I(TEAM_FIELDING_DP^3)  1       0.1 129130 6616.1
## - I(TEAM_FIELDING_DP^2)  1       5.5 129135 6616.2
## - I(TEAM_FIELDING_E^2)   1      12.4 129142 6616.3
## - I(TEAM_BASERUN_SB^2)   1      18.8 129148 6616.3
## - TEAM_FIELDING_DP       1      30.8 129160 6616.5
## - TEAM_BATTING_3B        1      42.6 129172 6616.6
## - I(TEAM_BATTING_BB^3)   1      75.7 129205 6617.0
## - TEAM_BATTING_BB        1      77.9 129207 6617.0
## - I(TEAM_BATTING_3B^3)   1      78.5 129208 6617.0
## - I(TEAM_BATTING_BB^2)   1      93.6 129223 6617.2
## - I(TEAM_BATTING_SO^2)   1      98.3 129228 6617.2
## - TEAM_BATTING_SO        1     130.2 129260 6617.6
## - I(TEAM_BATTING_HR^2)   1     133.4 129263 6617.6
## - I(TEAM_BATTING_SO^3)   1     152.6 129282 6617.8
## <none>                               129129 6618.1
## - I(TEAM_BATTING_3B^2)   1     178.5 129308 6618.1
## - TEAM_BASERUN_SB        1     270.9 129400 6619.2
## - TEAM_BATTING_2B        1     527.9 129657 6622.1
## - I(TEAM_BATTING_2B^2)   1     654.3 129784 6623.5
## - I(TEAM_BATTING_2B^3)   1     746.9 129876 6624.6
## - TEAM_FIELDING_E        1    1770.2 130900 6636.1
## - TEAM_BATTING_HR        1    2546.7 131676 6644.8
## - Singles                1    4427.8 133557 6665.6
## 
## Step:  AIC=6616.11
## y_na ~ TEAM_BATTING_2B + TEAM_BATTING_3B + TEAM_BATTING_HR + 
##     TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_FIELDING_E + 
##     TEAM_FIELDING_DP + Singles + I(TEAM_FIELDING_DP^2) + I(TEAM_FIELDING_E^2) + 
##     I(TEAM_BATTING_2B^2) + I(TEAM_BATTING_3B^2) + I(TEAM_BATTING_HR^2) + 
##     I(TEAM_BASERUN_SB^2) + I(TEAM_BATTING_SO^2) + I(TEAM_BATTING_BB^2) + 
##     I(TEAM_BATTING_2B^3) + I(TEAM_BATTING_3B^3) + I(TEAM_BATTING_SO^3) + 
##     I(TEAM_BATTING_BB^3)
## 
##                         Df Sum of Sq    RSS    AIC
## - I(TEAM_FIELDING_E^2)   1      12.6 129142 6614.3
## - I(TEAM_BASERUN_SB^2)   1      18.8 129148 6614.3
## - TEAM_BATTING_3B        1      42.9 129172 6614.6
## - I(TEAM_BATTING_BB^3)   1      76.0 129206 6615.0
## - TEAM_BATTING_BB        1      78.0 129208 6615.0
## - I(TEAM_BATTING_3B^3)   1      78.7 129208 6615.0
## - I(TEAM_BATTING_BB^2)   1      93.9 129223 6615.2
## - I(TEAM_BATTING_SO^2)   1      98.4 129228 6615.2
## - TEAM_BATTING_SO        1     130.6 129260 6615.6
## - I(TEAM_BATTING_HR^2)   1     134.0 129264 6615.6
## - I(TEAM_BATTING_SO^3)   1     152.7 129282 6615.8
## <none>                               129130 6616.1
## - I(TEAM_BATTING_3B^2)   1     179.1 129309 6616.1
## - TEAM_BASERUN_SB        1     271.1 129401 6617.2
## - TEAM_BATTING_2B        1     527.8 129657 6620.1
## - I(TEAM_BATTING_2B^2)   1     654.3 129784 6621.5
## - I(TEAM_BATTING_2B^3)   1     746.9 129876 6622.6
## - I(TEAM_FIELDING_DP^2)  1     747.1 129877 6622.6
## - TEAM_FIELDING_DP       1    1225.2 130355 6628.0
## - TEAM_FIELDING_E        1    1774.6 130904 6634.1
## - TEAM_BATTING_HR        1    2570.7 131700 6643.0
## - Singles                1    4430.0 133560 6663.6
## 
## Step:  AIC=6614.25
## y_na ~ TEAM_BATTING_2B + TEAM_BATTING_3B + TEAM_BATTING_HR + 
##     TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_FIELDING_E + 
##     TEAM_FIELDING_DP + Singles + I(TEAM_FIELDING_DP^2) + I(TEAM_BATTING_2B^2) + 
##     I(TEAM_BATTING_3B^2) + I(TEAM_BATTING_HR^2) + I(TEAM_BASERUN_SB^2) + 
##     I(TEAM_BATTING_SO^2) + I(TEAM_BATTING_BB^2) + I(TEAM_BATTING_2B^3) + 
##     I(TEAM_BATTING_3B^3) + I(TEAM_BATTING_SO^3) + I(TEAM_BATTING_BB^3)
## 
##                         Df Sum of Sq    RSS    AIC
## - I(TEAM_BASERUN_SB^2)   1      18.9 129161 6612.5
## - TEAM_BATTING_3B        1      48.0 129190 6612.8
## - I(TEAM_BATTING_BB^3)   1      80.2 129222 6613.2
## - TEAM_BATTING_BB        1      82.7 129225 6613.2
## - I(TEAM_BATTING_3B^3)   1      84.3 129226 6613.2
## - I(TEAM_BATTING_SO^2)   1      95.8 129238 6613.3
## - I(TEAM_BATTING_BB^2)   1      98.8 129241 6613.4
## - I(TEAM_BATTING_HR^2)   1     122.1 129264 6613.6
## - TEAM_BATTING_SO        1     126.2 129268 6613.7
## - I(TEAM_BATTING_SO^3)   1     150.7 129293 6614.0
## <none>                               129142 6614.3
## - I(TEAM_BATTING_3B^2)   1     189.4 129331 6614.4
## - TEAM_BASERUN_SB        1     272.3 129414 6615.3
## - TEAM_BATTING_2B        1     517.7 129660 6618.1
## - I(TEAM_BATTING_2B^2)   1     643.3 129785 6619.5
## - I(TEAM_BATTING_2B^3)   1     735.6 129878 6620.6
## - I(TEAM_FIELDING_DP^2)  1     757.1 129899 6620.8
## - TEAM_FIELDING_DP       1    1240.9 130383 6626.3
## - TEAM_BATTING_HR        1    2623.0 131765 6641.8
## - Singles                1    4435.9 133578 6661.8
## - TEAM_FIELDING_E        1   20166.9 149309 6825.3
## 
## Step:  AIC=6612.47
## y_na ~ TEAM_BATTING_2B + TEAM_BATTING_3B + TEAM_BATTING_HR + 
##     TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_FIELDING_E + 
##     TEAM_FIELDING_DP + Singles + I(TEAM_FIELDING_DP^2) + I(TEAM_BATTING_2B^2) + 
##     I(TEAM_BATTING_3B^2) + I(TEAM_BATTING_HR^2) + I(TEAM_BATTING_SO^2) + 
##     I(TEAM_BATTING_BB^2) + I(TEAM_BATTING_2B^3) + I(TEAM_BATTING_3B^3) + 
##     I(TEAM_BATTING_SO^3) + I(TEAM_BATTING_BB^3)
## 
##                         Df Sum of Sq    RSS    AIC
## - TEAM_BATTING_3B        1      45.3 129206 6611.0
## - I(TEAM_BATTING_BB^3)   1      78.4 129239 6611.4
## - TEAM_BATTING_BB        1      80.8 129242 6611.4
## - I(TEAM_BATTING_3B^3)   1      83.4 129244 6611.4
## - I(TEAM_BATTING_SO^2)   1      88.7 129250 6611.5
## - I(TEAM_BATTING_BB^2)   1      96.8 129258 6611.6
## - TEAM_BATTING_SO        1     118.2 129279 6611.8
## - I(TEAM_BATTING_HR^2)   1     120.2 129281 6611.8
## - I(TEAM_BATTING_SO^3)   1     142.4 129303 6612.1
## <none>                               129161 6612.5
## - I(TEAM_BATTING_3B^2)   1     186.6 129348 6612.6
## - TEAM_BATTING_2B        1     521.5 129683 6616.4
## - I(TEAM_BATTING_2B^2)   1     650.1 129811 6617.8
## - I(TEAM_BATTING_2B^3)   1     745.2 129906 6618.9
## - I(TEAM_FIELDING_DP^2)  1     764.3 129925 6619.1
## - TEAM_FIELDING_DP       1    1248.6 130410 6624.6
## - TEAM_BATTING_HR        1    2616.3 131777 6639.9
## - Singles                1    4420.7 133582 6659.9
## - TEAM_BASERUN_SB        1    5191.0 134352 6668.3
## - TEAM_FIELDING_E        1   20163.8 149325 6823.4
## 
## Step:  AIC=6610.98
## y_na ~ TEAM_BATTING_2B + TEAM_BATTING_HR + TEAM_BATTING_BB + 
##     TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_FIELDING_E + TEAM_FIELDING_DP + 
##     Singles + I(TEAM_FIELDING_DP^2) + I(TEAM_BATTING_2B^2) + 
##     I(TEAM_BATTING_3B^2) + I(TEAM_BATTING_HR^2) + I(TEAM_BATTING_SO^2) + 
##     I(TEAM_BATTING_BB^2) + I(TEAM_BATTING_2B^3) + I(TEAM_BATTING_3B^3) + 
##     I(TEAM_BATTING_SO^3) + I(TEAM_BATTING_BB^3)
## 
##                         Df Sum of Sq    RSS    AIC
## - I(TEAM_BATTING_3B^3)   1      68.7 129275 6609.8
## - I(TEAM_BATTING_BB^3)   1      80.3 129287 6609.9
## - TEAM_BATTING_BB        1      82.0 129288 6609.9
## - I(TEAM_BATTING_SO^2)   1      86.2 129293 6610.0
## - I(TEAM_BATTING_BB^2)   1      98.6 129305 6610.1
## - I(TEAM_BATTING_HR^2)   1     107.0 129313 6610.2
## - TEAM_BATTING_SO        1     117.3 129324 6610.3
## - I(TEAM_BATTING_SO^3)   1     137.6 129344 6610.5
## <none>                               129206 6611.0
## - TEAM_BATTING_2B        1     495.9 129702 6614.6
## - I(TEAM_BATTING_2B^2)   1     622.0 129828 6616.0
## - I(TEAM_BATTING_2B^3)   1     715.9 129922 6617.1
## - I(TEAM_FIELDING_DP^2)  1     772.1 129978 6617.7
## - TEAM_FIELDING_DP       1    1260.1 130466 6623.2
## - I(TEAM_BATTING_3B^2)   1    1274.2 130481 6623.4
## - TEAM_BATTING_HR        1    2576.1 131783 6638.0
## - Singles                1    4429.0 133635 6658.5
## - TEAM_BASERUN_SB        1    5169.8 134376 6666.6
## - TEAM_FIELDING_E        1   20266.4 149473 6822.9
## 
## Step:  AIC=6609.77
## y_na ~ TEAM_BATTING_2B + TEAM_BATTING_HR + TEAM_BATTING_BB + 
##     TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_FIELDING_E + TEAM_FIELDING_DP + 
##     Singles + I(TEAM_FIELDING_DP^2) + I(TEAM_BATTING_2B^2) + 
##     I(TEAM_BATTING_3B^2) + I(TEAM_BATTING_HR^2) + I(TEAM_BATTING_SO^2) + 
##     I(TEAM_BATTING_BB^2) + I(TEAM_BATTING_2B^3) + I(TEAM_BATTING_SO^3) + 
##     I(TEAM_BATTING_BB^3)
## 
##                         Df Sum of Sq    RSS    AIC
## - I(TEAM_BATTING_SO^2)   1      60.4 129335 6608.5
## - I(TEAM_BATTING_BB^3)   1      71.9 129347 6608.6
## - TEAM_BATTING_BB        1      73.1 129348 6608.6
## - TEAM_BATTING_SO        1      86.0 129361 6608.7
## - I(TEAM_BATTING_BB^2)   1      89.2 129364 6608.8
## - I(TEAM_BATTING_SO^3)   1     107.8 129383 6609.0
## - I(TEAM_BATTING_HR^2)   1     109.1 129384 6609.0
## <none>                               129275 6609.8
## - TEAM_BATTING_2B        1     487.0 129762 6613.3
## - I(TEAM_BATTING_2B^2)   1     609.8 129885 6614.7
## - I(TEAM_BATTING_2B^3)   1     701.4 129976 6615.7
## - I(TEAM_FIELDING_DP^2)  1     803.7 130079 6616.9
## - TEAM_FIELDING_DP       1    1298.0 130573 6622.4
## - TEAM_BATTING_HR        1    2551.5 131827 6636.5
## - Singles                1    4466.1 133741 6657.6
## - TEAM_BASERUN_SB        1    5109.7 134385 6664.7
## - I(TEAM_BATTING_3B^2)   1    9093.5 138369 6707.6
## - TEAM_FIELDING_E        1   20311.1 149586 6822.0
## 
## Step:  AIC=6608.45
## y_na ~ TEAM_BATTING_2B + TEAM_BATTING_HR + TEAM_BATTING_BB + 
##     TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_FIELDING_E + TEAM_FIELDING_DP + 
##     Singles + I(TEAM_FIELDING_DP^2) + I(TEAM_BATTING_2B^2) + 
##     I(TEAM_BATTING_3B^2) + I(TEAM_BATTING_HR^2) + I(TEAM_BATTING_BB^2) + 
##     I(TEAM_BATTING_2B^3) + I(TEAM_BATTING_SO^3) + I(TEAM_BATTING_BB^3)
## 
##                         Df Sum of Sq    RSS    AIC
## - I(TEAM_BATTING_BB^3)   1      72.9 129408 6607.3
## - TEAM_BATTING_BB        1      74.5 129410 6607.3
## - I(TEAM_BATTING_BB^2)   1      90.4 129426 6607.5
## - I(TEAM_BATTING_HR^2)   1      92.8 129428 6607.5
## - TEAM_BATTING_SO        1      98.3 129434 6607.6
## <none>                               129335 6608.5
## - I(TEAM_BATTING_SO^3)   1     436.6 129772 6611.4
## - TEAM_BATTING_2B        1     476.9 129812 6611.9
## - I(TEAM_BATTING_2B^2)   1     600.3 129936 6613.2
## - I(TEAM_BATTING_2B^3)   1     693.6 130029 6614.3
## - I(TEAM_FIELDING_DP^2)  1     780.4 130116 6615.3
## - TEAM_FIELDING_DP       1    1272.2 130608 6620.8
## - TEAM_BATTING_HR        1    2494.5 131830 6634.5
## - Singles                1    4640.8 133976 6658.2
## - TEAM_BASERUN_SB        1    5153.1 134489 6663.8
## - I(TEAM_BATTING_3B^2)   1    9556.1 138892 6711.1
## - TEAM_FIELDING_E        1   20728.2 150064 6824.7
## 
## Step:  AIC=6607.28
## y_na ~ TEAM_BATTING_2B + TEAM_BATTING_HR + TEAM_BATTING_BB + 
##     TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_FIELDING_E + TEAM_FIELDING_DP + 
##     Singles + I(TEAM_FIELDING_DP^2) + I(TEAM_BATTING_2B^2) + 
##     I(TEAM_BATTING_3B^2) + I(TEAM_BATTING_HR^2) + I(TEAM_BATTING_BB^2) + 
##     I(TEAM_BATTING_2B^3) + I(TEAM_BATTING_SO^3)
## 
##                         Df Sum of Sq    RSS    AIC
## - TEAM_BATTING_BB        1       1.6 129410 6605.3
## - I(TEAM_BATTING_HR^2)   1      78.1 129486 6606.2
## - TEAM_BATTING_SO        1      91.3 129500 6606.3
## - I(TEAM_BATTING_BB^2)   1     158.3 129567 6607.1
## <none>                               129408 6607.3
## - I(TEAM_BATTING_SO^3)   1     445.8 129854 6610.3
## - TEAM_BATTING_2B        1     482.2 129891 6610.7
## - I(TEAM_BATTING_2B^2)   1     607.5 130016 6612.2
## - I(TEAM_BATTING_2B^3)   1     702.5 130111 6613.2
## - I(TEAM_FIELDING_DP^2)  1     752.7 130161 6613.8
## - TEAM_FIELDING_DP       1    1239.5 130648 6619.3
## - TEAM_BATTING_HR        1    2441.5 131850 6632.7
## - Singles                1    4717.7 134126 6657.8
## - TEAM_BASERUN_SB        1    5153.9 134562 6662.6
## - I(TEAM_BATTING_3B^2)   1    9645.0 139053 6710.8
## - TEAM_FIELDING_E        1   20658.3 150067 6822.7
## 
## Step:  AIC=6605.3
## y_na ~ TEAM_BATTING_2B + TEAM_BATTING_HR + TEAM_BATTING_SO + 
##     TEAM_BASERUN_SB + TEAM_FIELDING_E + TEAM_FIELDING_DP + Singles + 
##     I(TEAM_FIELDING_DP^2) + I(TEAM_BATTING_2B^2) + I(TEAM_BATTING_3B^2) + 
##     I(TEAM_BATTING_HR^2) + I(TEAM_BATTING_BB^2) + I(TEAM_BATTING_2B^3) + 
##     I(TEAM_BATTING_SO^3)
## 
##                         Df Sum of Sq    RSS    AIC
## - I(TEAM_BATTING_HR^2)   1      77.5 129488 6604.2
## - TEAM_BATTING_SO        1      91.7 129502 6604.3
## <none>                               129410 6605.3
## - I(TEAM_BATTING_SO^3)   1     444.8 129855 6608.3
## - TEAM_BATTING_2B        1     480.6 129891 6608.7
## - I(TEAM_BATTING_2B^2)   1     605.9 130016 6610.2
## - I(TEAM_BATTING_2B^3)   1     701.0 130111 6611.2
## - I(TEAM_FIELDING_DP^2)  1     763.2 130173 6611.9
## - TEAM_FIELDING_DP       1    1254.1 130664 6617.5
## - TEAM_BATTING_HR        1    2439.8 131850 6630.7
## - Singles                1    4735.5 134145 6656.1
## - TEAM_BASERUN_SB        1    5175.7 134586 6660.9
## - I(TEAM_BATTING_3B^2)   1    9702.9 139113 6709.4
## - I(TEAM_BATTING_BB^2)   1   11033.1 140443 6723.4
## - TEAM_FIELDING_E        1   20667.1 150077 6820.8
## 
## Step:  AIC=6604.18
## y_na ~ TEAM_BATTING_2B + TEAM_BATTING_HR + TEAM_BATTING_SO + 
##     TEAM_BASERUN_SB + TEAM_FIELDING_E + TEAM_FIELDING_DP + Singles + 
##     I(TEAM_FIELDING_DP^2) + I(TEAM_BATTING_2B^2) + I(TEAM_BATTING_3B^2) + 
##     I(TEAM_BATTING_BB^2) + I(TEAM_BATTING_2B^3) + I(TEAM_BATTING_SO^3)
## 
##                         Df Sum of Sq    RSS    AIC
## - TEAM_BATTING_SO        1      48.8 129536 6602.7
## <none>                               129488 6604.2
## - TEAM_BATTING_2B        1     499.7 129987 6607.8
## - I(TEAM_BATTING_2B^2)   1     621.6 130109 6609.2
## - I(TEAM_BATTING_SO^3)   1     647.1 130135 6609.5
## - I(TEAM_BATTING_2B^3)   1     711.6 130199 6610.2
## - I(TEAM_FIELDING_DP^2)  1     740.3 130228 6610.5
## - TEAM_FIELDING_DP       1    1223.1 130711 6616.0
## - Singles                1    4844.5 134332 6656.1
## - TEAM_BASERUN_SB        1    5115.0 134603 6659.0
## - I(TEAM_BATTING_3B^2)   1    9697.3 139185 6708.2
## - I(TEAM_BATTING_BB^2)   1   11102.5 140590 6722.9
## - TEAM_BATTING_HR        1   19521.0 149009 6808.3
## - TEAM_FIELDING_E        1   21631.4 151119 6829.0
## 
## Step:  AIC=6602.73
## y_na ~ TEAM_BATTING_2B + TEAM_BATTING_HR + TEAM_BASERUN_SB + 
##     TEAM_FIELDING_E + TEAM_FIELDING_DP + Singles + I(TEAM_FIELDING_DP^2) + 
##     I(TEAM_BATTING_2B^2) + I(TEAM_BATTING_3B^2) + I(TEAM_BATTING_BB^2) + 
##     I(TEAM_BATTING_2B^3) + I(TEAM_BATTING_SO^3)
## 
##                         Df Sum of Sq    RSS    AIC
## <none>                               129536 6602.7
## - TEAM_BATTING_2B        1     494.8 130031 6606.3
## - I(TEAM_BATTING_2B^2)   1     613.5 130150 6607.7
## - I(TEAM_BATTING_2B^3)   1     701.9 130238 6608.7
## - I(TEAM_FIELDING_DP^2)  1     740.6 130277 6609.1
## - TEAM_FIELDING_DP       1    1221.8 130758 6614.5
## - TEAM_BASERUN_SB        1    5230.3 134767 6658.8
## - Singles                1    5385.6 134922 6660.5
## - I(TEAM_BATTING_SO^3)   1    7680.9 137217 6685.3
## - I(TEAM_BATTING_BB^2)   1   11652.2 141189 6727.2
## - I(TEAM_BATTING_3B^2)   1   12051.5 141588 6731.3
## - TEAM_BATTING_HR        1   20707.0 150243 6818.4
## - TEAM_FIELDING_E        1   22289.2 151826 6833.8

## 
## Call:
## lm(formula = y_na ~ TEAM_BATTING_2B + TEAM_BATTING_HR + TEAM_BASERUN_SB + 
##     TEAM_FIELDING_E + TEAM_FIELDING_DP + Singles + I(TEAM_FIELDING_DP^2) + 
##     I(TEAM_BATTING_2B^2) + I(TEAM_BATTING_3B^2) + I(TEAM_BATTING_BB^2) + 
##     I(TEAM_BATTING_2B^3) + I(TEAM_BATTING_SO^3), data = fit_6_db)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -31.5933  -6.6402   0.0684   6.3734  27.2096 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            1.226e+01  3.658e+01   0.335  0.73768    
## TEAM_BATTING_2B        9.898e-01  4.198e-01   2.358  0.01853 *  
## TEAM_BATTING_HR        1.221e-01  8.003e-03  15.251  < 2e-16 ***
## TEAM_BASERUN_SB        4.770e-02  6.224e-03   7.665 3.26e-14 ***
## TEAM_FIELDING_E       -1.510e-01  9.542e-03 -15.823  < 2e-16 ***
## TEAM_FIELDING_DP      -4.888e-01  1.319e-01  -3.705  0.00022 ***
## Singles                3.541e-02  4.552e-03   7.778 1.39e-14 ***
## I(TEAM_FIELDING_DP^2)  1.217e-03  4.221e-04   2.884  0.00398 ** 
## I(TEAM_BATTING_2B^2)  -4.315e-03  1.644e-03  -2.625  0.00875 ** 
## I(TEAM_BATTING_3B^2)   2.067e-03  1.776e-04  11.635  < 2e-16 ***
## I(TEAM_BATTING_BB^2)   3.424e-05  2.993e-06  11.440  < 2e-16 ***
## I(TEAM_BATTING_2B^3)   5.928e-06  2.111e-06   2.808  0.00506 ** 
## I(TEAM_BATTING_SO^3)  -8.823e-09  9.499e-10  -9.288  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.435 on 1455 degrees of freedom
## Multiple R-squared:  0.4463, Adjusted R-squared:  0.4417 
## F-statistic: 97.74 on 12 and 1455 DF,  p-value: < 2.2e-16

Model of choice

I have chosen to use model 6 to predict
- It outperformed all other linear models in every test statistic
- It outperformed the GAM model on test/train split rmse, and it seems like the simplest model

y_na <- money_ball_train_2_withna$TARGET_WINS
fit_6_db<-money_ball_train_2_withna[,c(4,5,6,7,8,9,16,17,18)] 
fit_6 <- lm(y_na~.,data=fit_6_db)
layout(matrix(c(1,2,3,4),2,2)) 
plot(fit_6)

summary(fit_6)

## 
## Call:
## lm(formula = y_na ~ ., data = fit_6_db)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -30.9356  -6.5508  -0.0419   6.6782  30.0121 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      61.320410   6.649090   9.222  < 2e-16 ***
## TEAM_BATTING_2B  -0.036499   0.006967  -5.239 1.85e-07 ***
## TEAM_BATTING_3B   0.205953   0.021152   9.737  < 2e-16 ***
## TEAM_BATTING_HR   0.129672   0.008349  15.531  < 2e-16 ***
## TEAM_BATTING_BB   0.037615   0.003402  11.055  < 2e-16 ***
## TEAM_BATTING_SO  -0.021573   0.002431  -8.873  < 2e-16 ***
## TEAM_BASERUN_SB   0.050535   0.006336   7.976 3.04e-15 ***
## TEAM_FIELDING_E  -0.152176   0.009680 -15.721  < 2e-16 ***
## TEAM_FIELDING_DP -0.116077   0.013213  -8.785  < 2e-16 ***
## Singles           0.034687   0.004714   7.358 3.09e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.549 on 1458 degrees of freedom
## Multiple R-squared:  0.4317, Adjusted R-squared:  0.4282 
## F-statistic: 123.1 on 9 and 1458 DF,  p-value: < 2.2e-16

my_names <- colnames(fit_6_db)
lm_target <-y_na
lm_inputs <- fit_6_db[,my_names]
train_df <-  fit_6_db[,my_names]
test_df <- money_eval_na[,my_names]

My_prediction<- lm(y_na~.,data=train_df)
fit_6_predict <- predict(My_prediction,test_df)
kable(fit_6_predict)

x
61.92662
68.76625
73.50705
82.07337
NA
56.05863
68.46613
69.58884
61.49032
83.05783
86.48455
83.95449
89.68919
77.04353
70.81106
76.53099
NA
81.54091
83.35541
82.27576
82.48616
70.57153
78.97134
85.16802
62.85810
77.37633
60.94855
89.83601
87.97425
85.99453
84.43743
80.46297
85.23486
76.45860
91.37220
82.66758
89.15680
80.11327
94.71830
NA
NA
NA
63.54624
57.46252
72.97738
71.23652
81.35044
73.02683
76.18881
67.52168
79.88642
75.41477
68.30099
NA
NA
86.36666
84.77333
87.73882
84.60453
84.63909
NA
45.69831
54.88904
NA
83.36602
88.77719
73.07122
81.97568
95.37573
70.48764
78.28982
89.16533
81.63855
NA
NA
77.04870
80.69682
69.98533
88.53044
80.19605
87.02073
87.22192
96.65479
88.90896
NA
52.29533
NA
NA
NA
88.55088
85.40561
86.06973
75.51747
70.47183
83.36415
89.45490
73.82390
NA
71.37230
88.87607
NA
87.76417
85.14594
94.35076
89.81292
78.36933
74.65148
84.12522
79.97949
67.60061
NA
NA
NA
NA
NA
63.91358
80.43352
83.83552
70.44251
88.06252
86.38444
79.09888
80.51968
71.21323
82.19965
93.76708
76.91460
75.89617
91.83577
80.48406
42.12529
NA
89.52964
64.65567
76.42108
74.66181
72.95899
80.68064
79.38578
83.17903
82.88350
79.46469
61.97677
80.18481
68.20260
89.11799
NA
63.29854
NA
103.43310
112.91436
98.60398
106.41409
101.46655
100.19181
88.90789
86.64024
71.29591
79.02157
NA
87.52662
79.49987
90.60211
73.87771
79.03048
84.91793
68.05868
75.84968
81.83688
93.67837
87.65334
88.02007
87.27270
NA
NA
NA
NA
NA
61.64871
68.20794
63.71984
57.09436
68.74322
95.96053
84.86653
86.29287
73.09883
81.95712
77.52093
91.91348
81.09405
87.96287
75.82657
76.27937
NA
NA
NA
82.10810
62.01272
68.55481
83.92376
79.48529
91.75034
75.61021
81.23918
71.99662
66.53253
75.63967
70.15693
80.57726
78.58580
76.14602
84.43725
NA
NA
93.54387
86.51571
89.00744
78.07078
75.66783
75.92058
81.84754
NA
63.15399
84.66570
91.44331
83.23063
78.66853
58.38997
83.12404
77.47833
82.07125
74.67415
83.06915
78.26561
NA
71.58667
80.62028
80.80553
81.20826
67.56517

```

Untitled

Justin Herman

June 6, 2018

Introduction

Data Exploration

General data observations of Entire Dataset

Explore target variable Wins

Remove outliers discovered in Wins category

View target variable distribution before and after filters

Descriptive Look at batting statistics

Create and view summary of batting DF

Fig 1&2

Data manipulation of batting statistics

Replace 0 with NA’s

Fig 3

Examine historical record to detect incorrect outliers

Filter for historical records

Figure 3 : Reexamine boxplot post data manipulation

Batting statistics overall correlations

How do batting statistics correlate with our target(Wins)?

Stolen bases

Edit stolen bases via historical records and observe correlations

Pitching categories

Display pitching categories

Historical pitching records

Filter for historical records and run correlation plot

Look at batting to pitching correlations

Problems with correlations

Filter all of these misentries out of DF

I have to drop all of the pitching statistics

Create custom stats in place of poor pitching stats

View new stat correlations with Wins

Building models

Model 1 A

Model 1 advanced diagnostics

Test for constant variance

Normality tests

Test for independence

leverage points and outliers

Model 1B

Model 1C

Summary from model 1

Model 2

Model 2 tests

Model 2 Takeaways

Model 3 USE Automatic AIC backwards selection

Model 4 box cox transformation of Target variable

Model 4 Takeaway

Model 5 Generalized additive model

Model 6

Model 6 Diagnostic tests

Model 6 summary

comapring several models together

Create test train split and view RMSE on TEST PREDICTIONS

Model 7- Attempt Quadratic Fitting

Model of choice