UDEMY’s Data Science and Machine Learning Bootcamp with R Capstone Project is based on the historical problem of The Oakland Athletic’s 2002 season. The Athletics’ 2002 campaign ranks among the most famous in franchise history. Following the 2001 season, Oakland saw the departure of three key players (the lost boys). Billy Beane, the team’s general manager, responded with a series of under-the-radar free agent signings. The new-look Athletics, despite a comparative lack of star power, surprised the baseball world by besting the 2001 team’s regular season record.
In this project we will perform an analysis that aims to inform decision makers on the selection of replacements for the departed players. We will be using real data provided by Sean Lahaman’s Website (http://www.seanlahman.com/baseball-archive/statistics/).
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
We’ll start loading and exploring the data.
batting <- read.csv('Batting.csv')
head(batting, 12)
## playerID yearID stint teamID lgID G G_batting AB R H X2B X3B HR RBI
## 1 aardsda01 2004 1 SFN NL 11 11 0 0 0 0 0 0 0
## 2 aardsda01 2006 1 CHN NL 45 43 2 0 0 0 0 0 0
## 3 aardsda01 2007 1 CHA AL 25 2 0 0 0 0 0 0 0
## 4 aardsda01 2008 1 BOS AL 47 5 1 0 0 0 0 0 0
## 5 aardsda01 2009 1 SEA AL 73 3 0 0 0 0 0 0 0
## 6 aardsda01 2010 1 SEA AL 53 4 0 0 0 0 0 0 0
## 7 aardsda01 2012 1 NYA AL 1 NA NA NA NA NA NA NA NA
## 8 aaronha01 1954 1 ML1 NL 122 122 468 58 131 27 6 13 69
## 9 aaronha01 1955 1 ML1 NL 153 153 602 105 189 37 9 27 106
## 10 aaronha01 1956 1 ML1 NL 153 153 609 106 200 34 14 26 92
## 11 aaronha01 1957 1 ML1 NL 151 151 615 118 198 27 6 44 132
## 12 aaronha01 1958 1 ML1 NL 153 153 601 109 196 34 4 30 95
## SB CS BB SO IBB HBP SH SF GIDP G_old
## 1 0 0 0 0 0 0 0 0 0 11
## 2 0 0 0 0 0 0 1 0 0 45
## 3 0 0 0 0 0 0 0 0 0 2
## 4 0 0 0 1 0 0 0 0 0 5
## 5 0 0 0 0 0 0 0 0 0 NA
## 6 0 0 0 0 0 0 0 0 0 NA
## 7 NA NA NA NA NA NA NA NA NA NA
## 8 2 2 28 39 NA 3 6 4 13 122
## 9 3 1 49 61 5 3 7 4 20 153
## 10 2 4 37 54 6 2 5 7 21 153
## 11 1 1 57 58 15 0 0 3 13 151
## 12 4 1 59 49 16 1 0 3 21 153
str(batting)
## 'data.frame': 97889 obs. of 24 variables:
## $ playerID : chr "aardsda01" "aardsda01" "aardsda01" "aardsda01" ...
## $ yearID : int 2004 2006 2007 2008 2009 2010 2012 1954 1955 1956 ...
## $ stint : int 1 1 1 1 1 1 1 1 1 1 ...
## $ teamID : chr "SFN" "CHN" "CHA" "BOS" ...
## $ lgID : chr "NL" "NL" "AL" "AL" ...
## $ G : int 11 45 25 47 73 53 1 122 153 153 ...
## $ G_batting: int 11 43 2 5 3 4 NA 122 153 153 ...
## $ AB : int 0 2 0 1 0 0 NA 468 602 609 ...
## $ R : int 0 0 0 0 0 0 NA 58 105 106 ...
## $ H : int 0 0 0 0 0 0 NA 131 189 200 ...
## $ X2B : int 0 0 0 0 0 0 NA 27 37 34 ...
## $ X3B : int 0 0 0 0 0 0 NA 6 9 14 ...
## $ HR : int 0 0 0 0 0 0 NA 13 27 26 ...
## $ RBI : int 0 0 0 0 0 0 NA 69 106 92 ...
## $ SB : int 0 0 0 0 0 0 NA 2 3 2 ...
## $ CS : int 0 0 0 0 0 0 NA 2 1 4 ...
## $ BB : int 0 0 0 0 0 0 NA 28 49 37 ...
## $ SO : int 0 0 0 1 0 0 NA 39 61 54 ...
## $ IBB : int 0 0 0 0 0 0 NA NA 5 6 ...
## $ HBP : int 0 0 0 0 0 0 NA 3 3 2 ...
## $ SH : int 0 1 0 0 0 0 NA 6 7 5 ...
## $ SF : int 0 0 0 0 0 0 NA 4 4 7 ...
## $ GIDP : int 0 0 0 0 0 0 NA 13 20 21 ...
## $ G_old : int 11 45 2 5 NA NA NA 122 153 153 ...
head(batting$AB, 5)
## [1] 0 2 0 1 0
head(batting$X2B)
## [1] 0 0 0 0 0 0
colnames(batting)
## [1] "playerID" "yearID" "stint" "teamID" "lgID" "G"
## [7] "G_batting" "AB" "R" "H" "X2B" "X3B"
## [13] "HR" "RBI" "SB" "CS" "BB" "SO"
## [19] "IBB" "HBP" "SH" "SF" "GIDP" "G_old"
Some exploratory plots:
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 6413 rows containing non-finite values (stat_bin).
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 6413 rows containing non-finite values (stat_bin).
Let’s move on adding more useful statistics to the intial data frame: Batting Average, On Base Percentage, Slugging Percentage.
batting <- batting %>% mutate(AVG=H/AB, OBP=(H+BB+HBP)/(AB+BB+HBP+SF),
OBP=((H+BB+HBP)/(AB+BB+HBP+SF)),
SLG=((H-X2B-X3B-HR)+2*X2B+3*X3B+4*HR)/AB)
str(batting)
## 'data.frame': 97889 obs. of 27 variables:
## $ playerID : chr "aardsda01" "aardsda01" "aardsda01" "aardsda01" ...
## $ yearID : int 2004 2006 2007 2008 2009 2010 2012 1954 1955 1956 ...
## $ stint : int 1 1 1 1 1 1 1 1 1 1 ...
## $ teamID : chr "SFN" "CHN" "CHA" "BOS" ...
## $ lgID : chr "NL" "NL" "AL" "AL" ...
## $ G : int 11 45 25 47 73 53 1 122 153 153 ...
## $ G_batting: int 11 43 2 5 3 4 NA 122 153 153 ...
## $ AB : int 0 2 0 1 0 0 NA 468 602 609 ...
## $ R : int 0 0 0 0 0 0 NA 58 105 106 ...
## $ H : int 0 0 0 0 0 0 NA 131 189 200 ...
## $ X2B : int 0 0 0 0 0 0 NA 27 37 34 ...
## $ X3B : int 0 0 0 0 0 0 NA 6 9 14 ...
## $ HR : int 0 0 0 0 0 0 NA 13 27 26 ...
## $ RBI : int 0 0 0 0 0 0 NA 69 106 92 ...
## $ SB : int 0 0 0 0 0 0 NA 2 3 2 ...
## $ CS : int 0 0 0 0 0 0 NA 2 1 4 ...
## $ BB : int 0 0 0 0 0 0 NA 28 49 37 ...
## $ SO : int 0 0 0 1 0 0 NA 39 61 54 ...
## $ IBB : int 0 0 0 0 0 0 NA NA 5 6 ...
## $ HBP : int 0 0 0 0 0 0 NA 3 3 2 ...
## $ SH : int 0 1 0 0 0 0 NA 6 7 5 ...
## $ SF : int 0 0 0 0 0 0 NA 4 4 7 ...
## $ GIDP : int 0 0 0 0 0 0 NA 13 20 21 ...
## $ G_old : int 11 45 2 5 NA NA NA 122 153 153 ...
## $ AVG : num NaN 0 NaN 0 NaN ...
## $ OBP : num NaN 0 NaN 0 NaN ...
## $ SLG : num NaN 0 NaN 0 NaN ...
We know we don’t just want the best players, we want the most undervalued players, meaning we will also need to know current salary information.
salary <- read.csv("Salaries.csv")
str(salary)
## 'data.frame': 23956 obs. of 5 variables:
## $ yearID : int 1985 1985 1985 1985 1985 1985 1985 1985 1985 1985 ...
## $ teamID : chr "BAL" "BAL" "BAL" "BAL" ...
## $ lgID : chr "AL" "AL" "AL" "AL" ...
## $ playerID: chr "murraed02" "lynnfr01" "ripkeca01" "lacyle01" ...
## $ salary : int 1472819 1090000 800000 725000 641667 625000 581250 560000 558333 547143 ...
summary(salary)
## yearID teamID lgID playerID
## Min. :1985 Length:23956 Length:23956 Length:23956
## 1st Qu.:1993 Class :character Class :character Class :character
## Median :1999 Mode :character Mode :character Mode :character
## Mean :1999
## 3rd Qu.:2006
## Max. :2013
## salary
## Min. : 0
## 1st Qu.: 250000
## Median : 507950
## Mean : 1864357
## 3rd Qu.: 2100000
## Max. :33000000
Salary data starts in 1985, while Batting data starts in 1871. To facilitate things we can remove the batting data that occurred before 1985.
batting <- filter(batting, yearID >= 1985)
summary(batting)
## playerID yearID stint teamID
## Length:35652 Min. :1985 Min. :1.00 Length:35652
## Class :character 1st Qu.:1993 1st Qu.:1.00 Class :character
## Mode :character Median :2000 Median :1.00 Mode :character
## Mean :2000 Mean :1.08
## 3rd Qu.:2007 3rd Qu.:1.00
## Max. :2013 Max. :4.00
##
## lgID G G_batting AB
## Length:35652 Min. : 1.0 Min. : 0.00 Min. : 0.0
## Class :character 1st Qu.: 14.0 1st Qu.: 4.00 1st Qu.: 3.0
## Mode :character Median : 34.0 Median : 27.00 Median : 47.0
## Mean : 51.7 Mean : 46.28 Mean :144.7
## 3rd Qu.: 77.0 3rd Qu.: 77.00 3rd Qu.:241.0
## Max. :163.0 Max. :163.00 Max. :716.0
## NA's :1406 NA's :4377
## R H X2B X3B
## Min. : 0.00 Min. : 0.00 Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 0.000
## Median : 4.00 Median : 8.00 Median : 1.000 Median : 0.000
## Mean : 19.44 Mean : 37.95 Mean : 7.293 Mean : 0.824
## 3rd Qu.: 30.00 3rd Qu.: 61.00 3rd Qu.:11.000 3rd Qu.: 1.000
## Max. :152.00 Max. :262.00 Max. :59.000 Max. :23.000
## NA's :4377 NA's :4377 NA's :4377 NA's :4377
## HR RBI SB CS
## Min. : 0.000 Min. : 0.00 Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 0.000
## Median : 0.000 Median : 3.00 Median : 0.000 Median : 0.000
## Mean : 4.169 Mean : 18.41 Mean : 2.811 Mean : 1.219
## 3rd Qu.: 5.000 3rd Qu.: 27.00 3rd Qu.: 2.000 3rd Qu.: 1.000
## Max. :73.000 Max. :165.00 Max. :110.000 Max. :29.000
## NA's :4377 NA's :4377 NA's :4377 NA's :4377
## BB SO IBB HBP
## Min. : 0.00 Min. : 0.00 Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.00 1st Qu.: 1.00 1st Qu.: 0.000 1st Qu.: 0.000
## Median : 3.00 Median : 12.00 Median : 0.000 Median : 0.000
## Mean : 14.06 Mean : 27.03 Mean : 1.171 Mean : 1.273
## 3rd Qu.: 21.00 3rd Qu.: 42.00 3rd Qu.: 1.000 3rd Qu.: 1.000
## Max. :232.00 Max. :223.00 Max. :120.000 Max. :35.000
## NA's :4377 NA's :4377 NA's :4378 NA's :4387
## SH SF GIDP G_old
## Min. : 0.000 Min. : 0.000 Min. : 0.00 Min. : 0.0
## 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 11.0
## Median : 0.000 Median : 0.000 Median : 1.00 Median : 32.0
## Mean : 1.465 Mean : 1.212 Mean : 3.25 Mean : 49.7
## 3rd Qu.: 2.000 3rd Qu.: 2.000 3rd Qu.: 5.00 3rd Qu.: 77.0
## Max. :39.000 Max. :17.000 Max. :35.00 Max. :163.0
## NA's :4377 NA's :4378 NA's :4377 NA's :5189
## AVG OBP SLG
## Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:0.136 1st Qu.:0.188 1st Qu.:0.167
## Median :0.233 Median :0.296 Median :0.333
## Mean :0.205 Mean :0.262 Mean :0.304
## 3rd Qu.:0.274 3rd Qu.:0.342 3rd Qu.:0.423
## Max. :1.000 Max. :1.000 Max. :4.000
## NA's :8905 NA's :8821 NA's :8905
Now we can merge salary data into batting data frame.
combo <- merge(batting, salary, by=c("playerID", "yearID"))
summary(combo)
## playerID yearID stint teamID.x
## Length:25397 Min. :1985 Min. :1.000 Length:25397
## Class :character 1st Qu.:1993 1st Qu.:1.000 Class :character
## Mode :character Median :1999 Median :1.000 Mode :character
## Mean :1999 Mean :1.098
## 3rd Qu.:2006 3rd Qu.:1.000
## Max. :2013 Max. :4.000
##
## lgID.x G G_batting AB
## Length:25397 Min. : 1.00 Min. : 0.00 Min. : 0.0
## Class :character 1st Qu.: 26.00 1st Qu.: 8.00 1st Qu.: 5.0
## Mode :character Median : 50.00 Median : 42.00 Median : 85.0
## Mean : 64.06 Mean : 57.58 Mean :182.4
## 3rd Qu.:101.00 3rd Qu.:101.00 3rd Qu.:336.0
## Max. :163.00 Max. :163.00 Max. :716.0
## NA's :906 NA's :2661
## R H X2B X3B
## Min. : 0.00 Min. : 0.00 Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.00 1st Qu.: 1.00 1st Qu.: 0.000 1st Qu.: 0.000
## Median : 9.00 Median : 19.00 Median : 3.000 Median : 0.000
## Mean : 24.71 Mean : 48.18 Mean : 9.276 Mean : 1.033
## 3rd Qu.: 43.00 3rd Qu.: 87.25 3rd Qu.:16.000 3rd Qu.: 1.000
## Max. :152.00 Max. :262.00 Max. :59.000 Max. :23.000
## NA's :2661 NA's :2661 NA's :2661 NA's :2661
## HR RBI SB CS
## Min. : 0.000 Min. : 0.00 Min. : 0.000 Min. : 0.00
## 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 0.00
## Median : 1.000 Median : 8.00 Median : 0.000 Median : 0.00
## Mean : 5.369 Mean : 23.56 Mean : 3.568 Mean : 1.54
## 3rd Qu.: 7.000 3rd Qu.: 39.00 3rd Qu.: 3.000 3rd Qu.: 2.00
## Max. :73.000 Max. :165.00 Max. :110.000 Max. :29.00
## NA's :2661 NA's :2661 NA's :2661 NA's :2661
## BB SO IBB HBP
## Min. : 0.00 Min. : 0.00 Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.00 1st Qu.: 2.00 1st Qu.: 0.000 1st Qu.: 0.000
## Median : 6.00 Median : 20.00 Median : 0.000 Median : 0.000
## Mean : 17.98 Mean : 33.52 Mean : 1.533 Mean : 1.614
## 3rd Qu.: 29.00 3rd Qu.: 55.00 3rd Qu.: 2.000 3rd Qu.: 2.000
## Max. :232.00 Max. :223.00 Max. :120.000 Max. :35.000
## NA's :2661 NA's :2661 NA's :2662 NA's :2670
## SH SF GIDP G_old
## Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.00
## 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 20.00
## Median : 0.000 Median : 0.000 Median : 2.000 Median : 47.00
## Mean : 1.786 Mean : 1.554 Mean : 4.127 Mean : 61.43
## 3rd Qu.: 2.000 3rd Qu.: 2.000 3rd Qu.: 7.000 3rd Qu.:101.00
## Max. :39.000 Max. :17.000 Max. :35.000 Max. :163.00
## NA's :2661 NA's :2662 NA's :2661 NA's :3414
## AVG OBP SLG teamID.y
## Min. :0.000 Min. :0.000 Min. :0.000 Length:25397
## 1st Qu.:0.160 1st Qu.:0.208 1st Qu.:0.200 Class :character
## Median :0.242 Median :0.305 Median :0.351 Mode :character
## Mean :0.212 Mean :0.270 Mean :0.317
## 3rd Qu.:0.276 3rd Qu.:0.346 3rd Qu.:0.432
## Max. :1.000 Max. :1.000 Max. :4.000
## NA's :5618 NA's :5562 NA's :5618
## lgID.y salary
## Length:25397 Min. : 0
## Class :character 1st Qu.: 255000
## Mode :character Median : 550000
## Mean : 1879256
## 3rd Qu.: 2150000
## Max. :33000000
##
Analyzing lost players and their relevant statistics for year 2001.
lost_pay <- combo %>%
filter(playerID == "giambja01" | playerID == "damonjo01" | playerID == "saenzol01") %>%
filter(yearID == 2001) %>%
select(playerID,H,X2B,X3B,HR,OBP,SLG,AVG,AB)
Finding replacements whom satisfy the conditions:
The total combined salary of the three players can not exceed 15 million dollars. Their combined number of At Bats (AB) needs to be equal to or greater than the lost players. *Their mean OBP had to equal to or greater than the mean OBP of the lost players
Let’s first take a look at the merged data frame.
ggplot(combo, aes(salary/1000))+
geom_histogram()+
labs(title="Salaries histogram", x="Earinings in thousands of dollars", y="Count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(combo, aes(AB, OBP, colour = salary/1000))+
geom_point(alpha=.5)+
labs(title="Selection Variables", x="At Bat", y="OBP")
## Warning: Removed 5562 rows containing missing values (geom_point).
In order to balance the combination, we will set the following caps to filter the merged data frame. Salary limit is 15 million. It makes sense to cut players who individually earn 10 million. At bat figure will be set to half the mean of the 3 lost players. OBP minimum value will be set as a third of the mean from lost players.
SAL_cap <- 10000000
AB_lostpl <- sum(lost_pay$AB)
OBP_mean_lostpl <- mean(lost_pay$OBP)
AB_cap <- AB_lostpl/6
OBP_cap <- OBP_mean_lostpl/3
combo_k <- combo %>%
filter(yearID==2001) %>%
filter(playerID != "giambja01" & playerID != "damonjo01" & playerID != "saenzol01") %>%
select(playerID, yearID, AB, OBP, SLG, AVG, salary) %>%
filter(salary <= SAL_cap)
combo_k <- combo_k %>%
filter(AB >= AB_cap) %>%
filter(OBP >= OBP_cap)
ggplot(combo_k, aes(salary/1000))+
geom_histogram()+
labs(title="Salaries histogram", x="Earinings in thousands of dollars", y="Count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(combo_k, aes(AB, OBP, colour = salary/1000))+
geom_point(alpha=.5, size = 3)+
labs(title="Selection Variables", x="At Bat", y="OBP")
summary(combo_k)
## playerID yearID AB OBP
## Length:247 Min. :2001 Min. :245.0 Min. :0.2438
## Class :character 1st Qu.:2001 1st Qu.:375.0 1st Qu.:0.3169
## Mode :character Median :2001 Median :456.0 Median :0.3346
## Mean :2001 Mean :452.7 Mean :0.3381
## 3rd Qu.:2001 3rd Qu.:542.0 3rd Qu.:0.3633
## Max. :2001 Max. :692.0 Max. :0.4317
## SLG AVG salary
## Min. :0.2853 Min. :0.1870 Min. : 200000
## 1st Qu.:0.3757 1st Qu.:0.2531 1st Qu.: 345000
## Median :0.4311 Median :0.2699 Median : 2350000
## Mean :0.4364 Mean :0.2716 Mean : 2933150
## 3rd Qu.:0.4768 3rd Qu.:0.2908 3rd Qu.: 4991500
## Max. :0.6880 Max. :0.3497 Max. :10000000
The plots shown above allow us to think that there are a significant number of good candidates to substitute the lost players. We need players with an OBP mean of above 0.3639 and an at bat combined score of above 1469. Let’s try a simple approach. We can cut the At Bat threshold at 500, and OBP at .35, this way we ca be certain that remaining players will meet performance cap.
combo_k1 <- combo_k%>%
filter(OBP >= .35 & AB >=500)
Now we have ended up with a “short list” of 45 players. Let’s take a look at their salary histogram.
ggplot(combo_k1, aes(salary/1000))+
geom_histogram()+
labs(title="Histogram salaries-SHORT LIST",
x="Earinings in thousands of dollars", y="Count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We can confidently say that we can pull out at leas one set of three players that could satisfy the conditions. We can arrange the short list based on the OBP score.
combo_k1 <- arrange(combo_k1, desc(OBP))
head(combo_k1, 10)
## playerID yearID AB OBP SLG AVG salary
## 1 heltoto01 2001 587 0.4316547 0.6848382 0.3356048 4950000
## 2 berkmla01 2001 577 0.4302326 0.6204506 0.3310225 305000
## 3 gonzalu01 2001 609 0.4285714 0.6880131 0.3251232 4833333
## 4 sheffga01 2001 515 0.4174757 0.5825243 0.3106796 9916667
## 5 thomeji01 2001 526 0.4161491 0.6235741 0.2908745 7875000
## 6 alomaro01 2001 575 0.4146707 0.5408696 0.3356522 7750000
## 7 edmonji01 2001 500 0.4102142 0.5640000 0.3040000 6333333
## 8 gilesbr02 2001 576 0.4035608 0.5902778 0.3090278 7333333
## 9 pujolal01 2001 590 0.4029630 0.6101695 0.3288136 200000
## 10 olerujo01 2001 572 0.4011799 0.4720280 0.3024476 6700000
A look at the first three players in the above table already give us a feasible answer. Every player in it meets the performance threshold. Lets make a simple summation to see if they can be afforded with the 15 million cap.
play1 <- combo_k1[1, 1]
play2 <- combo_k1[2, 1]
play3 <- combo_k1[3, 1]
z_sal <- sum(combo_k1[1:3, 7])
paste("Combined salayes of", play1,",", play2, ",", "and", play3, "is equal to", z_sal)
## [1] "Combined salayes of heltoto01 , berkmla01 , and gonzalu01 is equal to 10088333"
We can now conclude that players heltoto01, berkmla01, and gonzalu01 have a combined At Bat score greater than 1469, an OBP with a mean grater that .3639 and altogether cost less than 15 million dollars. We can confidently recommend their hiring, as they are also the ones with the highest OBP among the short list.