Overview


DATA 621 – Business Analytics and Data Mining

Homework #1 Assignment Requirements


Author: James Mundy

Date: March 1, 2020


Overview


In this homework assignment, you will explore, analyze and model a data set containing approximately 2200 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season.


Your objective is to build a multiple linear regression model on the training data to predict the number of wins for the team. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:


Project Game Plan


My Game Plan



The goal of the Moneyball assignment is to develop three or more regression models that predict team wins.


As an avid baseball fan of many years and having recently read Analyzing Baseball Data with R, I begin this assignment with a the following premise:

Baseball teams win games (Target_Wins) by scoring more runs than their opponent. Thus the number of wins earned over a season must be related to both the number of runs scored and allowed.

As a result I will seek to utilize variables that capture runs scored and allowed. Below I have listed four metrics utilized in baseball to capture run scoring and one metric that addresses runs allowed.

  • Batting Average (AVG) - One of the oldest and most universal tools to measure a hitter’s success at the plate, batting average is determined by dividing a player’s hits by his total at-bats.

  • On base Percentage (OBP) - OBP refers to how frequently a batter reaches base per plate appearance. Times on base include hits, walks and hit-by-pitches, but do not include errors, times reached on a fielder’s choice or a dropped third strike.

  • Slugging Percentage (SLG) - Slugging percentage represents the total number of bases a player records per at-bat. Unlike on-base percentage, slugging percentage deals only with hits and does not include walks and hit-by-pitches in its equation.

  • On-base Plus Slugging (OPS) - OPS adds on-base percentage and slugging percentage to get one number that unites the two. It’s meant to combine how well a hitter can reach base, with how well he can hit for average and for power.

  • Run Differential (RD) - In baseball, run differential is a cumulative team statistic that combines offensive and defensive scoring. Run differential is calculated by subtracting runs allowed from runs scored.

I will use the Moneyball data set to calculate these five statistics (or similar alternatives) and use these statistics as the basis of my regression models.

Back to top

Data Exploration

EDA Analysis


I will utilize the SKIM package and ggplot charts to perform my exploratory data analysis. The objective of this is analysis is to develop a better understanding of the data to include its shape, central tendencies, completeness (missing data) and its correlation to our response variable Target_Wins. Here are some key takeaways from my data exploration:

  • The data set includes 2276 rows, 16 columns and all variables are numeric
  • The data is generally complete, however, six variables have missing data. The lowest complete rate is for the variable Hit By Pitch, with a rate of only 8%.
  • The response variable appears to be normally or near-normally distributed.
  • The majority of the explanatory variables appear to be normal or near-normal. There are however, several variable that have bi-modal distributions (Batting_HR, SO, Pitching_HR, Batting_SO) and others that are right-skewed (Fielding, Pitching_BB, Pitching_H)
  • Target_Wins scatter plots reveal linear and non linear relationships.
  • Box plots reveal some outliers on the response variable. For example, zero wins are displayed in the Win box plot.
  • I see strong correlations with batting, fielding and pitching variables.

Below you will find the Skim function results, density charts, scatter plots and correlation tables and plots that support the take always above and provide additional detailed information.





Skim Function


The skim function is an alternative to the summary function. It displays most of the numerical attributes from summary, but it also displays missing values, more quantile information and an inline histogram for each variable.

Data summary
Name mbData
Number of rows 2276
Number of columns 16
_______________________
Column type frequency:
numeric 16
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
TARGET_WINS 0 1.00 80.79 15.75 0 71.0 82.0 92.00 146 ▁▁▇▅▁
TEAM_BATTING_H 0 1.00 1469.27 144.59 891 1383.0 1454.0 1537.25 2554 ▁▇▂▁▁
TEAM_BATTING_2B 0 1.00 241.25 46.80 69 208.0 238.0 273.00 458 ▁▆▇▂▁
TEAM_BATTING_3B 0 1.00 55.25 27.94 0 34.0 47.0 72.00 223 ▇▇▂▁▁
TEAM_BATTING_HR 0 1.00 99.61 60.55 0 42.0 102.0 147.00 264 ▇▆▇▅▁
TEAM_BATTING_BB 0 1.00 501.56 122.67 0 451.0 512.0 580.00 878 ▁▁▇▇▁
TEAM_BATTING_SO 102 0.96 735.61 248.53 0 548.0 750.0 930.00 1399 ▁▆▇▇▁
TEAM_BASERUN_SB 131 0.94 124.76 87.79 0 66.0 101.0 156.00 697 ▇▃▁▁▁
TEAM_BASERUN_CS 772 0.66 52.80 22.96 0 38.0 49.0 62.00 201 ▃▇▁▁▁
TEAM_BATTING_HBP 2085 0.08 59.36 12.97 29 50.5 58.0 67.00 95 ▂▇▇▅▁
TEAM_PITCHING_H 0 1.00 1779.21 1406.84 1137 1419.0 1518.0 1682.50 30132 ▇▁▁▁▁
TEAM_PITCHING_HR 0 1.00 105.70 61.30 0 50.0 107.0 150.00 343 ▇▇▆▁▁
TEAM_PITCHING_BB 0 1.00 553.01 166.36 0 476.0 536.5 611.00 3645 ▇▁▁▁▁
TEAM_PITCHING_SO 102 0.96 817.73 553.09 0 615.0 813.5 968.00 19278 ▇▁▁▁▁
TEAM_FIELDING_E 0 1.00 246.48 227.77 65 127.0 159.0 249.25 1898 ▇▁▁▁▁
TEAM_FIELDING_DP 286 0.87 146.39 26.23 52 131.0 149.0 164.00 228 ▁▂▇▆▁



Correlation Table

## Warning: attributes are not identical across measure variables;
## they will be dropped
Variable Variable Correlation
TARGET_WINS TEAM_PITCHING_H 0.4712343
TARGET_WINS TEAM_BATTING_H 0.4699467
TARGET_WINS TEAM_BATTING_BB 0.4686879
TARGET_WINS TEAM_PITCHING_BB 0.4683988
TARGET_WINS TEAM_PITCHING_HR 0.4224668
TARGET_WINS TEAM_BATTING_HR 0.4224168
TARGET_WINS TEAM_FIELDING_E -0.3866880
TARGET_WINS TEAM_BATTING_2B 0.3129840
TARGET_WINS TEAM_PITCHING_SO -0.2293648
TARGET_WINS TEAM_BATTING_SO -0.2288927
TARGET_WINS TEAM_FIELDING_DP -0.1958660
TARGET_WINS TEAM_BASERUN_CS -0.1787560
TARGET_WINS TEAM_BATTING_3B -0.1243459
TARGET_WINS TEAM_BATTING_HBP 0.0735042
TARGET_WINS TEAM_BASERUN_SB 0.0148364

Correlation Panels - Positive Variables

Correlation Panels - Negative Variables

Data Prep

Data Preparation


Now with our enhanced understanding of the data set, I will begin to prepare the data for model building. My data preparation will include addressing missing data, outliers and feature engineering or creating new variables.

  • Missing Data - My EDA identified variables with missing data. Given my strategy I must address the missing data for two variables: Hit By Pitch and Caught Stealing. I will utilize historical major league baseball averages of these two variable as my replacement data.
  • Outliers - The box plot for Wins should some very low values (even zero). According to major league baseball, the lowest number of wins recorded by a team was 20 by the Cleveland Spiders in 1899. Therefore, I will remove all rows from the data set with wins less than 20. The highest number of wins was 116, earned by the Seattle Mariners in 2001. I will also adjust the data set accordingly.
  • Feature Engineering (New Variables) - I will use the Moneyball data set to create the following five variables for potential use in my regression models:
    1. On base Percentage (OBP)
    2. Batting Average (AVG)
    3. Slugging Percentage (SLG)
    4. On base Plus Slugging (OPS)
    5. Run Differential


Missing Data

Six variables in the Moneyball data set are missing data. The variables are set forth below along with each variable’s complete_rate as provided by the Skim function.

  1. TEAM_BATTING_SO (0.96)
  2. TEAM_BASERUN_SB (0.94)
  3. TEAM_BASERUN_CS (0.66)
  4. TEAM_BATTING_HBP (0.08)
  5. TEAM_PITCHING_SO (0.96)
  6. TEAM_FIELDING_DP (0.87)

Two of these variables (TEAM_BATTING_HBP and TEAM_BASERUN_CS) are inputs to our new variables. Therefore, in each case, NA values are replaced with the major league baseball 2019 and 2018 average for Hit By Pitch (HBP) and Caught Stealing (CS). The values were obtained from Team rankings.com and are 65 and 30, respectively.

Outliers (Target_Wins)

The Target Wins box plot below introduces the possibility of outliers in our response variable. I will utilize MLB historical win/loss information to eliminate these outliers.

After

Use dplyr to filter out the outlier rows.

Now, with the missing data and outlier challenges addressed, I will direct my attention to the creation of our five new variables and the selection of variables for the regression model(s).

Feature Engineering (New Variables)

We will supplement the the Moneyball data set with five new variables: On base Percentage (OBP), Slugging Percentage (SLG), On Base Plus Slugging (OPS) and Batting Average (AVG) and Run Differential (RD). Plate Appearances and At Bats are two additional variables that are inputs in a new variable calculations. The definitions for each of these variables follow:

  • On Base Percentage = (H + BB + HBP) / (AB + BB + HBP + SF)

  • Slugging Percentage = (1B + 22B + 33B + 4*HR) / AB

  • Batting Average = H / AB

  • Plate Appearance(PA) = H + BB + K + HBP + SH + SF + DI + E+ DFO

  • Total Bases (TB) = (1B + 22B + 33B + 4*HR)

Where: H = Hit, BB = Walk, K = Strikeout, HBP = Hit by Pitch, SH = Sacrifice Hit, SF = Sacrifice Fly, DI = Defensive Interference, E = Fielding error,DFO = Defensive Fielding Out.

The provided data set does not include all the variable necessary to calculate these metrics. Therefore, I will create and utilize the following alternative definitions:

  • On Base Percentage = (H + BB + HBP) / PA

  • Slugging Percentage = (1B + 22B + 33B + 4*HR) / (PA-BB-HBP)

  • Batting Average = H / (PA-BB-HBP)

  • Plate Appearance = H + BB + HBP + (162 * 9 * 3) - cs - DP

  • Run Differential = (Team_Batting_H + Team_Batting_HR) - (Team_Pitching_H + Team_Pitching_HR)

Data Preparation Wrap-up


I’ve replace missing data, removed outliers, created new variables. I’ve also selected eight explanatory variables for use in my regression models. Below I have recreated some of the exploratory data visuals for the revised data set.

Note - After some preliminary model building I decided to expand my variable set to increase my chances of producing a good model.

Skim Function


The Skim function reveals a cleaner and smaller revised data set

Data summary
Name mbData2
Number of rows 2262
Number of columns 26
_______________________
Column type frequency:
numeric 26
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
TARGET_WINS 0 1.00 80.70 15.18 21.00 71.00 82.00 91.00 120.00 ▁▂▇▇▂
TEAM_BATTING_H 0 1.00 1468.03 139.35 992.00 1383.00 1454.00 1536.00 2554.00 ▁▇▁▁▁
TEAM_BATTING_2B 0 1.00 241.17 46.34 69.00 208.00 238.00 272.75 458.00 ▁▆▇▂▁
TEAM_BATTING_3B 0 1.00 55.10 27.80 0.00 34.00 47.00 72.00 223.00 ▇▇▂▁▁
TEAM_BATTING_HR 0 1.00 100.08 60.41 0.00 42.00 102.50 147.00 264.00 ▇▆▇▅▁
TEAM_BATTING_BB 0 1.00 503.37 120.17 29.00 452.00 513.00 580.00 878.00 ▁▁▇▆▁
TEAM_BATTING_SO 101 0.96 738.37 245.84 0.00 551.00 752.00 932.00 1399.00 ▁▆▇▇▁
TEAM_BASERUN_SB 123 0.95 124.72 87.63