DATA 621 Assignment 1

Overview

DATA 621 – Business Analytics and Data Mining

Homework #1 Assignment Requirements

Author: James Mundy

Date: March 1, 2020

Overview

In this homework assignment, you will explore, analyze and model a data set containing approximately 2200 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season.

Your objective is to build a multiple linear regression model on the training data to predict the number of wins for the team. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:

Project Game Plan

My Game Plan

The goal of the Moneyball assignment is to develop three or more regression models that predict team wins.

As an avid baseball fan of many years and having recently read Analyzing Baseball Data with R, I begin this assignment with a the following premise:

Baseball teams win games (Target_Wins) by scoring more runs than their opponent. Thus the number of wins earned over a season must be related to both the number of runs scored and allowed.

As a result I will seek to utilize variables that capture runs scored and allowed. Below I have listed four metrics utilized in baseball to capture run scoring and one metric that addresses runs allowed.

Batting Average (AVG) - One of the oldest and most universal tools to measure a hitter’s success at the plate, batting average is determined by dividing a player’s hits by his total at-bats.
On base Percentage (OBP) - OBP refers to how frequently a batter reaches base per plate appearance. Times on base include hits, walks and hit-by-pitches, but do not include errors, times reached on a fielder’s choice or a dropped third strike.
Slugging Percentage (SLG) - Slugging percentage represents the total number of bases a player records per at-bat. Unlike on-base percentage, slugging percentage deals only with hits and does not include walks and hit-by-pitches in its equation.
On-base Plus Slugging (OPS) - OPS adds on-base percentage and slugging percentage to get one number that unites the two. It’s meant to combine how well a hitter can reach base, with how well he can hit for average and for power.
Run Differential (RD) - In baseball, run differential is a cumulative team statistic that combines offensive and defensive scoring. Run differential is calculated by subtracting runs allowed from runs scored.

I will use the Moneyball data set to calculate these five statistics (or similar alternatives) and use these statistics as the basis of my regression models.

Data Exploration

EDA Analysis

I will utilize the SKIM package and ggplot charts to perform my exploratory data analysis. The objective of this is analysis is to develop a better understanding of the data to include its shape, central tendencies, completeness (missing data) and its correlation to our response variable Target_Wins. Here are some key takeaways from my data exploration:

The data set includes 2276 rows, 16 columns and all variables are numeric
The data is generally complete, however, six variables have missing data. The lowest complete rate is for the variable Hit By Pitch, with a rate of only 8%.
The response variable appears to be normally or near-normally distributed.
The majority of the explanatory variables appear to be normal or near-normal. There are however, several variable that have bi-modal distributions (Batting_HR, SO, Pitching_HR, Batting_SO) and others that are right-skewed (Fielding, Pitching_BB, Pitching_H)
Target_Wins scatter plots reveal linear and non linear relationships.
Box plots reveal some outliers on the response variable. For example, zero wins are displayed in the Win box plot.
I see strong correlations with batting, fielding and pitching variables.

Below you will find the Skim function results, density charts, scatter plots and correlation tables and plots that support the take always above and provide additional detailed information.

mbData <- read.csv("https://raw.githubusercontent.com/ilyakats/CUNY-DATA621/master/hw1/moneyball-training-data.csv")


mbData <- mbData %>% 
  select(-INDEX)

mbData2 <- mbData

mbTest <- read.csv("https://raw.githubusercontent.com/ilyakats/CUNY-DATA621/master/hw1/moneyball-evaluation-data.csv")

mbTest <- mbTest %>% 
  select(-INDEX)

Skim Function

The skim function is an alternative to the summary function. It displays most of the numerical attributes from summary, but it also displays missing values, more quantile information and an inline histogram for each variable.

skim(mbData)

Data summary

Name	mbData
Number of rows	2276
Number of columns	16
_______________________
Column type frequency:
numeric	16
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
TARGET_WINS	0	1.00	80.79	15.75	0	71.0	82.0	92.00	146	▁▁▇▅▁
TEAM_BATTING_H	0	1.00	1469.27	144.59	891	1383.0	1454.0	1537.25	2554	▁▇▂▁▁
TEAM_BATTING_2B	0	1.00	241.25	46.80	69	208.0	238.0	273.00	458	▁▆▇▂▁
TEAM_BATTING_3B	0	1.00	55.25	27.94	0	34.0	47.0	72.00	223	▇▇▂▁▁
TEAM_BATTING_HR	0	1.00	99.61	60.55	0	42.0	102.0	147.00	264	▇▆▇▅▁
TEAM_BATTING_BB	0	1.00	501.56	122.67	0	451.0	512.0	580.00	878	▁▁▇▇▁
TEAM_BATTING_SO	102	0.96	735.61	248.53	0	548.0	750.0	930.00	1399	▁▆▇▇▁
TEAM_BASERUN_SB	131	0.94	124.76	87.79	0	66.0	101.0	156.00	697	▇▃▁▁▁
TEAM_BASERUN_CS	772	0.66	52.80	22.96	0	38.0	49.0	62.00	201	▃▇▁▁▁
TEAM_BATTING_HBP	2085	0.08	59.36	12.97	29	50.5	58.0	67.00	95	▂▇▇▅▁
TEAM_PITCHING_H	0	1.00	1779.21	1406.84	1137	1419.0	1518.0	1682.50	30132	▇▁▁▁▁
TEAM_PITCHING_HR	0	1.00	105.70	61.30	0	50.0	107.0	150.00	343	▇▇▆▁▁
TEAM_PITCHING_BB	0	1.00	553.01	166.36	0	476.0	536.5	611.00	3645	▇▁▁▁▁
TEAM_PITCHING_SO	102	0.96	817.73	553.09	0	615.0	813.5	968.00	19278	▇▁▁▁▁
TEAM_FIELDING_E	0	1.00	246.48	227.77	65	127.0	159.0	249.25	1898	▇▁▁▁▁
TEAM_FIELDING_DP	286	0.87	146.39	26.23	52	131.0	149.0	164.00	228	▁▂▇▆▁

Density Plots

mbData %>%
  gather(variable, value, TARGET_WINS:TEAM_FIELDING_DP) %>%
  ggplot(., aes(value)) + 
  geom_density(fill = "Blue", color="Blue") + 
  facet_wrap(~variable, scales ="free", ncol = 4)

Box Plots

mbData %>%
  gather(variable, value, TARGET_WINS:TEAM_FIELDING_DP) %>%
  ggplot(., aes(x= variable, y=value)) + 
  geom_boxplot() +
  facet_wrap(~variable, scales ="free", ncol = 4) +
  labs(x = element_blank(), y = element_blank())

Scatter Plots

mbData %>% 
  gather(variable, value, -TARGET_WINS) %>%
  ggplot(., aes(value, TARGET_WINS)) + 
  geom_point(fill = "blue", color="blue") + 
  geom_smooth(method = "lm", se = FALSE, color = "black") + 
  facet_wrap(~variable, scales ="free", ncol = 4) +
  labs(x = element_blank(), y = "Wins")

Correlation Table

temp <- mbData %>% 
  cor(., use = "complete.obs") #%>%
  
temp[lower.tri(temp, diag=TRUE)] <- ""
temp <- temp %>%
  as.data.frame() %>%
  rownames_to_column() %>%
  gather(Variable, Correlation, -rowname) %>%
  filter(Variable != rowname) %>%
  filter(Correlation != "") %>%
  mutate(Correlation = as.numeric(Correlation)) %>%
  rename(` Variable` = rowname) %>%
  arrange(desc(abs(Correlation)))

## Warning: attributes are not identical across measure variables;
## they will be dropped

temp %>%
  filter(` Variable` == "TARGET_WINS") %>%
  kable() %>%
  kable_styling()

Variable	Variable	Correlation
TARGET_WINS	TEAM_PITCHING_H	0.4712343
TARGET_WINS	TEAM_BATTING_H	0.4699467
TARGET_WINS	TEAM_BATTING_BB	0.4686879
TARGET_WINS	TEAM_PITCHING_BB	0.4683988
TARGET_WINS	TEAM_PITCHING_HR	0.4224668
TARGET_WINS	TEAM_BATTING_HR	0.4224168
TARGET_WINS	TEAM_FIELDING_E	-0.3866880
TARGET_WINS	TEAM_BATTING_2B	0.3129840
TARGET_WINS	TEAM_PITCHING_SO	-0.2293648
TARGET_WINS	TEAM_BATTING_SO	-0.2288927
TARGET_WINS	TEAM_FIELDING_DP	-0.1958660
TARGET_WINS	TEAM_BASERUN_CS	-0.1787560
TARGET_WINS	TEAM_BATTING_3B	-0.1243459
TARGET_WINS	TEAM_BATTING_HBP	0.0735042
TARGET_WINS	TEAM_BASERUN_SB	0.0148364

Correlation Panels - Positive Variables

pairs.panels(mbData[1:8])

Correlation Panels - Negative Variables

pairs.panels(mbData[9:16])

Data Prep

Data Preparation

Now with our enhanced understanding of the data set, I will begin to prepare the data for model building. My data preparation will include addressing missing data, outliers and feature engineering or creating new variables.

Missing Data - My EDA identified variables with missing data. Given my strategy I must address the missing data for two variables: Hit By Pitch and Caught Stealing. I will utilize historical major league baseball averages of these two variable as my replacement data.
Outliers - The box plot for Wins should some very low values (even zero). According to major league baseball, the lowest number of wins recorded by a team was 20 by the Cleveland Spiders in 1899. Therefore, I will remove all rows from the data set with wins less than 20. The highest number of wins was 116, earned by the Seattle Mariners in 2001. I will also adjust the data set accordingly.
Feature Engineering (New Variables) - I will use the Moneyball data set to create the following five variables for potential use in my regression models:

On base Percentage (OBP)
Batting Average (AVG)
Slugging Percentage (SLG)
On base Plus Slugging (OPS)
Run Differential

Missing Data

Six variables in the Moneyball data set are missing data. The variables are set forth below along with each variable’s complete_rate as provided by the Skim function.

TEAM_BATTING_SO (0.96)
TEAM_BASERUN_SB (0.94)
TEAM_BASERUN_CS (0.66)
TEAM_BATTING_HBP (0.08)
TEAM_PITCHING_SO (0.96)
TEAM_FIELDING_DP (0.87)

Two of these variables (TEAM_BATTING_HBP and TEAM_BASERUN_CS) are inputs to our new variables. Therefore, in each case, NA values are replaced with the major league baseball 2019 and 2018 average for Hit By Pitch (HBP) and Caught Stealing (CS). The values were obtained from Team rankings.com and are 65 and 30, respectively.

#Clean-up missing data

mbData2 <- mbData2 %>% 
  mutate(TEAM_BATTING_HBP = replace_na(TEAM_BATTING_HBP,65)) %>% 
  mutate(TEAM_BASERUN_CS  = replace_na(TEAM_BASERUN_CS,30))

Outliers (Target_Wins)

The Target Wins box plot below introduces the possibility of outliers in our response variable. I will utilize MLB historical win/loss information to eliminate these outliers.

Before

mbData2 %>%
  gather(variable, value, TARGET_WINS) %>%
  ggplot(., aes(x= variable, y=value)) + 
  geom_boxplot() +
  labs(x = element_blank(), y = element_blank())

mbData2 <- mbData2 %>% 
  filter(TARGET_WINS >19) %>% 
  filter(TARGET_WINS <121)

After

Use dplyr to filter out the outlier rows.

  mbData2 %>%
  gather(variable, value, TARGET_WINS) %>%
  ggplot(., aes(x= variable, y=value)) + 
  geom_boxplot() +
  labs(x = element_blank(), y = element_blank())

Now, with the missing data and outlier challenges addressed, I will direct my attention to the creation of our five new variables and the selection of variables for the regression model(s).

Feature Engineering (New Variables)

We will supplement the the Moneyball data set with five new variables: On base Percentage (OBP), Slugging Percentage (SLG), On Base Plus Slugging (OPS) and Batting Average (AVG) and Run Differential (RD). Plate Appearances and At Bats are two additional variables that are inputs in a new variable calculations. The definitions for each of these variables follow:

On Base Percentage = (H + BB + HBP) / (AB + BB + HBP + SF)
Slugging Percentage = (1B + 22B + 33B + 4*HR) / AB
Batting Average = H / AB
Plate Appearance(PA) = H + BB + K + HBP + SH + SF + DI + E+ DFO
Total Bases (TB) = (1B + 22B + 33B + 4*HR)

Where: H = Hit, BB = Walk, K = Strikeout, HBP = Hit by Pitch, SH = Sacrifice Hit, SF = Sacrifice Fly, DI = Defensive Interference, E = Fielding error,DFO = Defensive Fielding Out.

The provided data set does not include all the variable necessary to calculate these metrics. Therefore, I will create and utilize the following alternative definitions:

On Base Percentage = (H + BB + HBP) / PA
Slugging Percentage = (1B + 22B + 33B + 4*HR) / (PA-BB-HBP)
Batting Average = H / (PA-BB-HBP)
Plate Appearance = H + BB + HBP + (162 * 9 * 3) - cs - DP
Run Differential = (Team_Batting_H + Team_Batting_HR) - (Team_Pitching_H + Team_Pitching_HR)

mbData2 <- mbData2 %>% 
  mutate(TEAM_BATTING_1B = TEAM_BATTING_H - TEAM_BATTING_2B -TEAM_BATTING_3B -TEAM_BATTING_HR) %>% 
  mutate(PA = TEAM_BATTING_H + TEAM_BATTING_BB + TEAM_BATTING_HBP + 4374 - TEAM_BASERUN_CS) %>% 
  mutate(AB = TEAM_BATTING_H + 4374 - TEAM_BASERUN_CS)

mbData2 <- mbData2 %>% 
  mutate(OBP = (TEAM_BATTING_H + TEAM_BATTING_BB + TEAM_BATTING_HBP)/ PA) %>% 
  mutate(SLG = ((TEAM_BATTING_1B*1) +(2*TEAM_BATTING_2B)+(TEAM_BATTING_3B*3)+(TEAM_BATTING_HR*4))/AB) %>% 
  mutate(TB = ((TEAM_BATTING_1B*1) +(2*TEAM_BATTING_2B)+(TEAM_BATTING_3B*3)+(TEAM_BATTING_HR*4))) %>% 
  mutate(AVG = TEAM_BATTING_H/AB) %>% 
  mutate(OPS = OBP + SLG) %>% 
  mutate(RD = TEAM_BATTING_H + TEAM_BATTING_HR - TEAM_PITCHING_H - TEAM_PITCHING_HR) %>% 
  mutate(NI = TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_PITCHING_BB + TEAM_FIELDING_E)

Data Preparation Wrap-up

I’ve replace missing data, removed outliers, created new variables. I’ve also selected eight explanatory variables for use in my regression models. Below I have recreated some of the exploratory data visuals for the revised data set.

Note - After some preliminary model building I decided to expand my variable set to increase my chances of producing a good model.

Skim Function

The Skim function reveals a cleaner and smaller revised data set

skim(mbData2)

Data summary

Name	mbData2
Number of rows	2262
Number of columns	26
_______________________
Column type frequency:
numeric	26
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
TARGET_WINS	0	1.00	80.70	15.18	21.00	71.00	82.00	91.00	120.00	▁▂▇▇▂
TEAM_BATTING_H	0	1.00	1468.03	139.35	992.00	1383.00	1454.00	1536.00	2554.00	▁▇▁▁▁
TEAM_BATTING_2B	0	1.00	241.17	46.34	69.00	208.00	238.00	272.75	458.00	▁▆▇▂▁
TEAM_BATTING_3B	0	1.00	55.10	27.80	0.00	34.00	47.00	72.00	223.00	▇▇▂▁▁
TEAM_BATTING_HR	0	1.00	100.08	60.41	0.00	42.00	102.50	147.00	264.00	▇▆▇▅▁
TEAM_BATTING_BB	0	1.00	503.37	120.17	29.00	452.00	513.00	580.00	878.00	▁▁▇▆▁
TEAM_BATTING_SO	101	0.96	738.37	245.84	0.00	551.00	752.00	932.00	1399.00	▁▆▇▇▁
TEAM_BASERUN_SB	123	0.95	124.72	87.63	0.00	66.00	101.00	156.00	697.00	▇▃▁▁▁
TEAM_BASERUN_CS	0	1.00	45.18	21.56	11.00	30.00	38.00	55.00	201.00	▇▃▁▁▁
TEAM_BATTING_HBP	0	1.00	64.52	4.07	29.00	65.00	65.00	65.00	95.00	▁▁▇▁▁
TEAM_PITCHING_H	0	1.00	1745.98	1258.05	1137.00	1418.00	1517.00	1679.00	30132.00	▇▁▁▁▁
TEAM_PITCHING_HR	0	1.00	106.01	61.07	0.00	50.00	108.00	150.00	343.00	▇▇▆▁▁
TEAM_PITCHING_BB	0	1.00	553.57	164.44	119.00	477.00	537.00	611.00	3645.00	▇▁▁▁▁
TEAM_PITCHING_SO	101	0.96	819.04	553.58	0.00	615.00	814.00	968.00	19278.00	▇▁▁▁▁
TEAM_FIELDING_E	0	1.00	242.21	218.81	65.00	127.00	158.00	248.00	1898.00	▇▁▁▁▁
TEAM_FIELDING_DP	275	0.88	146.47	26.15	52.00	131.00	149.00	164.00	228.00	▁▂▇▆▁
TEAM_BATTING_1B	0	1.00	1071.68	124.85	709.00	990.00	1050.00	1128.00	2112.00	▃▇▁▁▁
PA	0	1.00	6364.74	179.05	5543.00	6254.00	6362.00	6472.00	7229.00	▁▂▇▂▁
AB	0	1.00	5796.85	142.00	5336.00	5708.00	5782.50	5867.00	6854.00	▁▇▁▁▁
OBP	0	1.00	0.32	0.02	0.22	0.31	0.32	0.33	0.40	▁▁▇▅▁
SLG	0	1.00	0.37	0.04	0.24	0.34	0.37	0.39	0.48	▁▃▇▅▁
TB	0	1.00	2119.65	255.25	1295.00	1947.00	2126.50	2284.00	3290.00	▁▆▇▁▁
AVG	0	1.00	0.25	0.02	0.19	0.24	0.25	0.26	0.37	▁▇▃▁▁
OPS	0	1.00	0.68	0.05	0.46	0.65	0.69	0.72	0.87	▁▂▇▅▁
RD	0	1.00	-283.88	1223.06	-28458.00	-108.00	-78.00	0.00	29.00	▁▁▁▁▇
NI	0	1.00	2647.77	1465.38	1814.00	2206.00	2360.00	2579.00	33156.00	▇▁▁▁▁

Density Plots

The density plot of RD reveals significant skew. A transformation may need to be employed. Home Runs remains bi-modal.

mbData2 %>%
  gather(variable, value, TARGET_WINS:RD) %>%
  ggplot(., aes(value)) + 
  geom_density(fill = "Blue", color="Blue") + 
  facet_wrap(~variable, scales ="free", ncol = 4) +
  labs(x = element_blank(), y = element_blank())

Scatter Plots

The Scatter plots generally reveal a linear relationship between the response variable and New variables.

mbData2 %>% 
  gather(variable, value, -TARGET_WINS) %>%
  ggplot(., aes(value, TARGET_WINS)) + 
  geom_point(fill = "blue", color="blue") + 
  geom_smooth(method = "lm", se = FALSE, color = "black") + 
  facet_wrap(~variable, scales ="free", ncol = 4) +
  labs(x = element_blank(), y = "Wins")

## Warning: Removed 600 rows containing non-finite values (stat_smooth).

## Warning: Removed 600 rows containing missing values (geom_point).

Correlation Table

Here is the correlations table for the revised data set.

temp2 <- mbData2 %>% 
  cor(., use = "complete.obs") #%>%
  
temp2[lower.tri(temp2, diag=TRUE)] <- ""
temp2 <- temp2 %>%
  as.data.frame() %>%
  rownames_to_column() %>%
  gather(Variable, Correlation, -rowname) %>%
  filter(Variable != rowname) %>%
  filter(Correlation != "") %>%
  mutate(Correlation = as.numeric(Correlation)) %>%
  rename(` Variable` = rowname) %>%
  arrange(desc(abs(Correlation))) 


temp2 %>%
  filter(` Variable` == "TARGET_WINS") %>%
  kable() %>%
  kable_styling()

Variable	Variable	Correlation
TARGET_WINS	OBP	0.4425900
TARGET_WINS	PA	0.4358739
TARGET_WINS	OPS	0.4239949
TARGET_WINS	TB	0.3891767
TARGET_WINS	SLG	0.3743504
TARGET_WINS	AVG	0.3526064
TARGET_WINS	TEAM_BATTING_H	0.3522027
TARGET_WINS	AB	0.3447749
TARGET_WINS	TEAM_BATTING_BB	0.3015885
TARGET_WINS	TEAM_PITCHING_BB	0.2717246
TARGET_WINS	NI	0.2699488
TARGET_WINS	TEAM_PITCHING_H	0.2212359
TARGET_WINS	TEAM_BATTING_HR	0.2201378
TARGET_WINS	TEAM_PITCHING_HR	0.2197311
TARGET_WINS	TEAM_BATTING_2B	0.2133341
TARGET_WINS	TEAM_FIELDING_E	-0.1848233
TARGET_WINS	TEAM_BATTING_1B	0.1599655
TARGET_WINS	TEAM_BATTING_3B	0.1227178
TARGET_WINS	TEAM_BASERUN_SB	0.1213532
TARGET_WINS	TEAM_PITCHING_SO	-0.0655594
TARGET_WINS	TEAM_BATTING_SO	-0.0594744
TARGET_WINS	TEAM_FIELDING_DP	-0.0366452
TARGET_WINS	TEAM_BATTING_HBP	0.0207607
TARGET_WINS	TEAM_BASERUN_CS	-0.0084982
TARGET_WINS	RD	-0.0065063

Build Models

Strategy

I will use my selected variable to build several models to predict wins. The variables selected reflect my strategy of using variables that are related to runs scored and/or runs allowed. I will utilize the training data set to create the various models. Next, in subsequent sections I will select the best model and apply the test data set to that model. I will provide summary, broom:tidy and brooom:glance data for each model.

Note - As I tested various models, I came across an interesting finding. If I only included data rows that were populated with Hit By Pitch values the R-Squared of all models increase by 10 to 15 percentage points. I determined however, despite the favorable R-Squared results, that resulting data set would be too small to develop viable models.

Model 4 - TARGET_WINS ~ OBP + OPS + AVG + RD

m4 <-   lm(TARGET_WINS ~ OBP + OPS + AVG + RD, data = mbData2)
summary(m4)

## 
## Call:
## lm(formula = TARGET_WINS ~ OBP + OPS + AVG + RD, data = mbData2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.174  -8.878   0.249   9.216  67.940 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.817e+01  4.975e+00  -5.662 1.69e-08 ***
## OBP          1.630e+02  3.417e+01   4.771 1.95e-06 ***
## OPS          3.931e+01  1.128e+01   3.485 0.000502 ***
## AVG          1.199e+02  2.795e+01   4.290 1.86e-05 ***
## RD           1.406e-03  2.643e-04   5.319 1.15e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.45 on 2257 degrees of freedom
## Multiple R-squared:  0.2163, Adjusted R-squared:  0.2149 
## F-statistic: 155.7 on 4 and 2257 DF,  p-value: < 2.2e-16

Model 7 - TARGET_WINS ~ OPS + RD + TEAM_PITCHING_H + TEAM_FIELDING_E + TEAM_PITCHING_BB

m7 <-   lm(TARGET_WINS ~ OPS + RD + TEAM_PITCHING_H + TEAM_FIELDING_E + TEAM_PITCHING_BB, data = mbData2)
summary(m7)

## 
## Call:
## lm(formula = TARGET_WINS ~ OPS + RD + TEAM_PITCHING_H + TEAM_FIELDING_E + 
##     TEAM_PITCHING_BB, data = mbData2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.795  -9.064   0.202   9.087  42.246 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       1.168667   3.961676   0.295    0.768    
## OPS              22.722267  11.059706   2.055    0.040 *  
## RD                0.042823   0.004197  10.204  < 2e-16 ***
## TEAM_PITCHING_H   0.040885   0.004145   9.864  < 2e-16 ***
## TEAM_FIELDING_E  -0.009496   0.002205  -4.306 1.73e-05 ***
## TEAM_PITCHING_BB  0.012742   0.002178   5.850 5.62e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.35 on 2256 degrees of freedom
## Multiple R-squared:  0.2278, Adjusted R-squared:  0.2261 
## F-statistic: 133.1 on 5 and 2256 DF,  p-value: < 2.2e-16

Model 8 - TARGET_WINS ~ RD + TEAM_PITCHING_H + TEAM_FIELDING_E + TEAM_PITCHING_BB

m8 <-   lm(TARGET_WINS ~ RD + TEAM_PITCHING_H + TEAM_FIELDING_E + TEAM_PITCHING_BB, data = mbData2)
summary(m8)

## 
## Call:
## lm(formula = TARGET_WINS ~ RD + TEAM_PITCHING_H + TEAM_FIELDING_E + 
##     TEAM_PITCHING_BB, data = mbData2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.786  -8.903   0.169   9.111  41.434 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       5.701446   3.292833   1.731   0.0835 .  
## RD                0.050192   0.002180  23.026  < 2e-16 ***
## TEAM_PITCHING_H   0.048185   0.002135  22.565  < 2e-16 ***
## TEAM_FIELDING_E  -0.012065   0.001818  -6.636 4.03e-11 ***
## TEAM_PITCHING_BB  0.014524   0.001999   7.265 5.10e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.36 on 2257 degrees of freedom
## Multiple R-squared:  0.2263, Adjusted R-squared:  0.225 
## F-statistic: 165.1 on 4 and 2257 DF,  p-value: < 2.2e-16

Select Models

Selected Model

In the Build phase of this project, I built 9 regression models. I have narrowed those 9 models down to 3 models that use to determine the best model. To select the best model I will test my final 3 models against the Moneyball evaluation data set. To accomplish this I will first create the same variable that were created in the training data set. Then I will use the models to predict Wins in the evaluation data set (Note -Target Wins is not included in the evaluation set).

In the final analysis, Models 4, 7 and 8 all produce very similar results. However, Owing to the counter intuitive coefficients in models 7 and 8, I selected model 4 as the best model. Model 4 benefits from brand name baseball metrics, intuitive coefficients, and an R-Squared of 22%. Model 8 is a very close second owing to the elegance of it only using Run Differential, Hits Allowed, Walks Allowed and Fielding Errors. Model 8 also had a higher F-Statistic. All Three model produce what I believe are good predictions. See below for additional detail.

#Clean-up missing data

mbTest <- mbTest %>% 
  mutate(TEAM_BATTING_HBP = replace_na(TEAM_BATTING_HBP,65)) %>% 
  mutate(TEAM_BASERUN_CS  = replace_na(TEAM_BASERUN_CS,30))

mbTest <- mbTest %>% 
  mutate(TEAM_BATTING_1B = TEAM_BATTING_H - TEAM_BATTING_2B -TEAM_BATTING_3B -TEAM_BATTING_HR) %>% 
  mutate(PA = TEAM_BATTING_H + TEAM_BATTING_BB + TEAM_BATTING_HBP + 4374 - TEAM_BASERUN_CS) %>% 
  mutate(AB = TEAM_BATTING_H + 4374 - TEAM_BASERUN_CS)

mbTest <- mbTest %>% 
  mutate(OBP = (TEAM_BATTING_H + TEAM_BATTING_BB + TEAM_BATTING_HBP)/ PA) %>% 
  mutate(SLG = ((TEAM_BATTING_1B*1) +(2*TEAM_BATTING_2B)+(TEAM_BATTING_3B*3)+(TEAM_BATTING_HR*4))/AB) %>% 
  mutate(TB = ((TEAM_BATTING_1B*1) +(2*TEAM_BATTING_2B)+(TEAM_BATTING_3B*3)+(TEAM_BATTING_HR*4))) %>% 
  mutate(AVG = TEAM_BATTING_H/AB) %>% 
  mutate(OPS = OBP + SLG) %>% 
  mutate(RD = TEAM_BATTING_H + TEAM_BATTING_HR - TEAM_PITCHING_H - TEAM_PITCHING_HR) %>% 
  mutate(NI = TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_PITCHING_BB + TEAM_FIELDING_E)

Model Testing & Selection

Model 4 - TARGET_WINS ~ OBP + OPS + AVG + RD

Model 4 uses sabermetric statistics On base Percentage, On base Plus Slugging, Batting Average and Run Differential to produce a model with five statistically significant variables, high t stats and zero p-values. The coefficients are all intuitive (all positive) and the predicted results are reasonable. R-Squared of 21.6%

Model Summary and Prediction Summary

summary(m4)

## 
## Call:
## lm(formula = TARGET_WINS ~ OBP + OPS + AVG + RD, data = mbData2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.174  -8.878   0.249   9.216  67.940 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.817e+01  4.975e+00  -5.662 1.69e-08 ***
## OBP          1.630e+02  3.417e+01   4.771 1.95e-06 ***
## OPS          3.931e+01  1.128e+01   3.485 0.000502 ***
## AVG          1.199e+02  2.795e+01   4.290 1.86e-05 ***
## RD           1.406e-03  2.643e-04   5.319 1.15e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.45 on 2257 degrees of freedom
## Multiple R-squared:  0.2163, Adjusted R-squared:  0.2149 
## F-statistic: 155.7 on 4 and 2257 DF,  p-value: < 2.2e-16

summary(m4Test)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   30.07   76.92   81.43   80.43   85.29  104.32

Model 7 - TARGET_WINS ~ OPS + RD + TEAM_PITCHING_H + TEAM_FIELDING_E + TEAM_PITCHING_BB

Model 7 is a derivative of model 4. It eliminates On base Percentage and Batting Average and replaces them with some fielding and pitching metrics (Hits Allowed, Walks Allowed and Errors). The R-Squared of this model increased slightly to (22.7%), however, two of the coefficients are counter intuitive. One would expect Wins to go down with Hit allowed and Walks allowed increase.

Model Summary and Prediction Summary

summary(m7)

## 
## Call:
## lm(formula = TARGET_WINS ~ OPS + RD + TEAM_PITCHING_H + TEAM_FIELDING_E + 
##     TEAM_PITCHING_BB, data = mbData2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.795  -9.064   0.202   9.087  42.246 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       1.168667   3.961676   0.295    0.768    
## OPS              22.722267  11.059706   2.055    0.040 *  
## RD                0.042823   0.004197  10.204  < 2e-16 ***
## TEAM_PITCHING_H   0.040885   0.004145   9.864  < 2e-16 ***
## TEAM_FIELDING_E  -0.009496   0.002205  -4.306 1.73e-05 ***
## TEAM_PITCHING_BB  0.012742   0.002178   5.850 5.62e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.35 on 2256 degrees of freedom
## Multiple R-squared:  0.2278, Adjusted R-squared:  0.2261 
## F-statistic: 133.1 on 5 and 2256 DF,  p-value: < 2.2e-16

summary(m7Test)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   24.56   76.83   81.49   80.46   85.40  106.93

Model 8 - TARGET_WINS ~ RD + TEAM_PITCHING_H + TEAM_FIELDING_E + TEAM_PITCHING_BB

Model 8 is also a derivative of model 4. It eliminates On base Plus Slugging, On base Percentage and Batting Average and includes the same fielding a pitching metrics as Model 7 (Hits Allowed, Walks Allowed and Errors). The R-Squared of this model was almost identical to Model 7 at 22.6%. Model 8 suffered from the same issue as Model 7. Hits allowed and Walks allowed both have positive coefficients which I believe is counter intuitive.

Model Summary and Prediction Summary

summary(m8)

## 
## Call:
## lm(formula = TARGET_WINS ~ RD + TEAM_PITCHING_H + TEAM_FIELDING_E + 
##     TEAM_PITCHING_BB, data = mbData2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.786  -8.903   0.169   9.111  41.434 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       5.701446   3.292833   1.731   0.0835 .  
## RD                0.050192   0.002180  23.026  < 2e-16 ***
## TEAM_PITCHING_H   0.048185   0.002135  22.565  < 2e-16 ***
## TEAM_FIELDING_E  -0.012065   0.001818  -6.636 4.03e-11 ***
## TEAM_PITCHING_BB  0.014524   0.001999   7.265 5.10e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.36 on 2257 degrees of freedom
## Multiple R-squared:  0.2263, Adjusted R-squared:  0.225 
## F-statistic: 165.1 on 4 and 2257 DF,  p-value: < 2.2e-16

summary(m8Test)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   23.05   77.01   81.45   80.50   85.39  107.78

DATA 621 Assignment 1 - Money Ball

Overview

DATA 621 – Business Analytics and Data Mining

Overview

Project Game Plan

My Game Plan

Data Exploration

EDA Analysis

Skim Function

Density Plots

Box Plots

Scatter Plots

Correlation Table

Correlation Panels - Positive Variables

Correlation Panels - Negative Variables

Data Prep

Data Preparation

Missing Data

Outliers (Target_Wins)

Before

After

Feature Engineering (New Variables)

Data Preparation Wrap-up

Skim Function

Density Plots

Scatter Plots

Correlation Table

Build Models

Strategy

Model 4 - TARGET_WINS ~ OBP + OPS + AVG + RD

Model 7 - TARGET_WINS ~ OPS + RD + TEAM_PITCHING_H + TEAM_FIELDING_E + TEAM_PITCHING_BB

Model 8 - TARGET_WINS ~ RD + TEAM_PITCHING_H + TEAM_FIELDING_E + TEAM_PITCHING_BB

Select Models

Selected Model

Model Testing & Selection

Model 4 - TARGET_WINS ~ OBP + OPS + AVG + RD

Model Summary and Prediction Summary

Model 7 - TARGET_WINS ~ OPS + RD + TEAM_PITCHING_H + TEAM_FIELDING_E + TEAM_PITCHING_BB

Model Summary and Prediction Summary

Model 8 - TARGET_WINS ~ RD + TEAM_PITCHING_H + TEAM_FIELDING_E + TEAM_PITCHING_BB

Model Summary and Prediction Summary