In this exercise I will walk through the chapter 1 of Analyzing Baseball Data with R.
First step will be to download the data. The data can be found at www.seanlahman.com/baseball-archive/statistics. I will be using the most recent version which includes data from 2016 version. We first want to download the file. You could either manually go to the page and download the file, or you can use the download.file argument in R to have it automatically downloaded, each time the script is run within R. Using the comma delimited file (csv) is highly recommended.
#2016 Data
data_url<-"http://seanlahman.com/files/database/baseballdatabank-2017.1.zip"
download.file(data_url, "baseball.zip")
#unzip the file
zipF<- "~\\baseball.zip"
#outDir<-"~\\Anaylizing Baseball with R\\chapter1"
unzip(zipF)
#show files created
#list.files(outDir)
We can see that we created a new file in our folder called “baseballdatabank-2017.1”. Inside the new folder, there is another subfolder called “core” that has the data we are looking for. To see the files in “core” we can use the list.files command.
list.files("~\\Anaylizing Baseball with R\\chapter1\\baseballdatabank-2017.1\\core")
## [1] "AllstarFull.csv" "Appearances.csv"
## [3] "AwardsManagers.csv" "AwardsPlayers.csv"
## [5] "AwardsShareManagers.csv" "AwardsSharePlayers.csv"
## [7] "baseball.zip" "baseballdatabank-2017.1"
## [9] "Batting.csv" "BattingPost.csv"
## [11] "chapter1.html" "chapter1.Rmd"
## [13] "CollegePlaying.csv" "Fielding.csv"
## [15] "FieldingOF.csv" "FieldingOFsplit.csv"
## [17] "FieldingPost.csv" "HallOfFame.csv"
## [19] "HomeGames.csv" "Managers.csv"
## [21] "ManagersHalf.csv" "Master.csv"
## [23] "Parks.csv" "Pitching.csv"
## [25] "PitchingPost.csv" "readme2014.txt"
## [27] "Salaries.csv" "Schools.csv"
## [29] "SeriesPost.csv" "Teams.csv"
## [31] "TeamsFranchises.csv" "TeamsHalf.csv"
Lets take a look at the Maser.csv file. This file contains a bunch of bibliography data of players and managers.
master<-read.csv("Master.csv")
head(master,10)
## playerID birthYear birthMonth birthDay birthCountry birthState
## 1 aardsda01 1981 12 27 USA CO
## 2 aaronha01 1934 2 5 USA AL
## 3 aaronto01 1939 8 5 USA AL
## 4 aasedo01 1954 9 8 USA CA
## 5 abadan01 1972 8 25 USA FL
## 6 abadfe01 1985 12 17 D.R. La Romana
## 7 abadijo01 1850 11 4 USA PA
## 8 abbated01 1877 4 15 USA PA
## 9 abbeybe01 1869 11 11 USA VT
## 10 abbeych01 1866 10 14 USA NE
## birthCity deathYear deathMonth deathDay deathCountry deathState
## 1 Denver NA NA NA
## 2 Mobile NA NA NA
## 3 Mobile 1984 8 16 USA GA
## 4 Orange NA NA NA
## 5 Palm Beach NA NA NA
## 6 La Romana NA NA NA
## 7 Philadelphia 1905 5 17 USA NJ
## 8 Latrobe 1957 1 6 USA FL
## 9 Essex 1962 6 11 USA VT
## 10 Falls City 1926 4 27 USA CA
## deathCity nameFirst nameLast nameGiven weight height
## 1 David Aardsma David Allan 215 75
## 2 Hank Aaron Henry Louis 180 72
## 3 Atlanta Tommie Aaron Tommie Lee 190 75
## 4 Don Aase Donald William 190 75
## 5 Andy Abad Fausto Andres 184 73
## 6 Fernando Abad Fernando Antonio 220 73
## 7 Pemberton John Abadie John W. 192 72
## 8 Fort Lauderdale Ed Abbaticchio Edward James 170 71
## 9 Colchester Bert Abbey Bert Wood 175 71
## 10 San Francisco Charlie Abbey Charles S. 169 68
## bats throws debut finalGame retroID bbrefID
## 1 R R 2004-04-06 2015-08-23 aardd001 aardsda01
## 2 R R 1954-04-13 1976-10-03 aaroh101 aaronha01
## 3 R R 1962-04-10 1971-09-26 aarot101 aaronto01
## 4 R R 1977-07-26 1990-10-03 aased001 aasedo01
## 5 L L 2001-09-10 2006-04-13 abada001 abadan01
## 6 L L 2010-07-28 2016-09-25 abadf001 abadfe01
## 7 R R 1875-04-26 1875-06-10 abadj101 abadijo01
## 8 R R 1897-09-04 1910-09-15 abbae101 abbated01
## 9 R R 1892-06-14 1896-09-23 abbeb101 abbeybe01
## 10 L L 1893-08-16 1897-08-19 abbec101 abbeych01
We can see that Hank Aaron was born February 5, 1934, in Mobile, Alabama, his full name is Henry Louis Aaron, and some other facts about the Hall of Fame player.
By using the “head” function we can see the first 10 rows of the data. To get a list of the columns we can use colnames(master)
colnames(master)
## [1] "playerID" "birthYear" "birthMonth" "birthDay"
## [5] "birthCountry" "birthState" "birthCity" "deathYear"
## [9] "deathMonth" "deathDay" "deathCountry" "deathState"
## [13] "deathCity" "nameFirst" "nameLast" "nameGiven"
## [17] "weight" "height" "bats" "throws"
## [21] "debut" "finalGame" "retroID" "bbrefID"
Now lest take a look at the batting table.
batting<-read.csv("Batting.csv")
head(batting)
## playerID yearID stint teamID lgID G AB R H X2B X3B HR RBI SB CS BB
## 1 abercda01 1871 1 TRO <NA> 1 4 0 0 0 0 0 0 0 0 0
## 2 addybo01 1871 1 RC1 <NA> 25 118 30 32 6 0 0 13 8 1 4
## 3 allisar01 1871 1 CL1 <NA> 29 137 28 40 4 5 0 19 3 1 2
## 4 allisdo01 1871 1 WS3 <NA> 27 133 28 44 10 2 2 27 1 1 0
## 5 ansonca01 1871 1 RC1 <NA> 25 120 29 39 11 3 0 16 6 2 2
## 6 armstbo01 1871 1 FW1 <NA> 12 49 9 11 2 1 0 5 0 1 0
## SO IBB HBP SH SF GIDP
## 1 0 NA NA NA NA NA
## 2 0 NA NA NA NA NA
## 3 5 NA NA NA NA NA
## 4 2 NA NA NA NA NA
## 5 1 NA NA NA NA NA
## 6 1 NA NA NA NA NA
We can look up information on a player in a number of different ways. We can select observations or use the subset function. First lets look up Lou Brock and Hank Aaron in the Master table to get their playerID’s so we can look up their batting stats for 1964. (We can create various types of databases in R, that would connect each table, but lets hold off on that for a moment.)
subset(master, master$nameFirst == "Hank"& master$nameLast =="Aaron" )
## playerID birthYear birthMonth birthDay birthCountry birthState
## 2 aaronha01 1934 2 5 USA AL
## birthCity deathYear deathMonth deathDay deathCountry deathState
## 2 Mobile NA NA NA
## deathCity nameFirst nameLast nameGiven weight height bats throws
## 2 Hank Aaron Henry Louis 180 72 R R
## debut finalGame retroID bbrefID
## 2 1954-04-13 1976-10-03 aaroh101 aaronha01
subset(master, master$nameFirst == "Lou"& master$nameLast =="Brock" )
## playerID birthYear birthMonth birthDay birthCountry birthState
## 1936 brocklo01 1939 6 18 USA AR
## birthCity deathYear deathMonth deathDay deathCountry deathState
## 1936 El Dorado NA NA NA
## deathCity nameFirst nameLast nameGiven weight height bats throws
## 1936 Lou Brock Louis Clark 170 71 L L
## debut finalGame retroID bbrefID
## 1936 1961-09-10 1979-09-30 brocl102 brocklo01
For Hank Aaron the playerId is aaronha01, and for Lou Brock his playerID is bocklo01. Lets Look up their stats for 1964.
subset(batting, batting$playerID == "aaronha01"& batting$yearID=="1964" )
## playerID yearID stint teamID lgID G AB R H X2B X3B HR RBI SB
## 43458 aaronha01 1964 1 ML1 NL 145 570 103 187 30 2 24 95 22
## CS BB SO IBB HBP SH SF GIDP
## 43458 4 62 46 9 0 0 2 22
subset(batting, batting$playerID == "brocklo01"& batting$yearID=="1964" )
## playerID yearID stint teamID lgID G AB R H X2B X3B HR RBI SB
## 43538 brocklo01 1964 1 CHN NL 52 215 30 54 9 2 2 14 10
## 43539 brocklo01 1964 2 SLN NL 103 419 81 146 21 9 12 44 33
## CS BB SO IBB HBP SH SF GIDP
## 43538 3 13 40 0 2 1 0 3
## 43539 15 27 87 0 2 12 4 2
We can see that Lou Brock played for 2 different teams in 1964.
Now lets move onto the Pitching Table. We will look up Babe Ruth’s pitching and batting stats. Yes he was a great home run slugger, but he became such an effective batter, he was moved to the outfield.
pitching<-read.csv("Pitching.csv")
subset(batting, batting$playerID == "ruthba01" )
## playerID yearID stint teamID lgID G AB R H X2B X3B HR RBI SB
## 15050 ruthba01 1914 1 BOS AL 5 10 1 2 1 0 0 2 0
## 15835 ruthba01 1915 1 BOS AL 42 92 16 29 10 1 4 21 0
## 16456 ruthba01 1916 1 BOS AL 67 136 18 37 5 3 3 15 0
## 16977 ruthba01 1917 1 BOS AL 52 123 14 40 6 3 2 12 0
## 17491 ruthba01 1918 1 BOS AL 95 317 50 95 26 11 11 66 6
## 18003 ruthba01 1919 1 BOS AL 130 432 103 139 34 12 29 114 7
## 18520 ruthba01 1920 1 NYA AL 142 457 158 172 36 9 54 137 14
## 19040 ruthba01 1921 1 NYA AL 152 540 177 204 44 16 59 171 17
## 19574 ruthba01 1922 1 NYA AL 110 406 94 128 24 8 35 99 2
## 20097 ruthba01 1923 1 NYA AL 152 522 151 205 45 13 41 131 17
## 20637 ruthba01 1924 1 NYA AL 153 529 143 200 39 7 46 121 9
## 21189 ruthba01 1925 1 NYA AL 98 359 61 104 12 2 25 66 2
## 21719 ruthba01 1926 1 NYA AL 152 495 139 184 30 5 47 150 11
## 22257 ruthba01 1927 1 NYA AL 151 540 158 192 29 8 60 164 7
## 22787 ruthba01 1928 1 NYA AL 154 536 163 173 29 8 54 142 4
## 23331 ruthba01 1929 1 NYA AL 135 499 121 172 26 6 46 154 5
## 23863 ruthba01 1930 1 NYA AL 145 518 150 186 28 9 49 153 10
## 24368 ruthba01 1931 1 NYA AL 145 534 149 199 31 3 46 163 5
## 24897 ruthba01 1932 1 NYA AL 133 457 120 156 13 5 41 137 2
## 25394 ruthba01 1933 1 NYA AL 137 459 97 138 21 3 34 103 4
## 25908 ruthba01 1934 1 NYA AL 125 365 78 105 17 4 22 84 1
## 26422 ruthba01 1935 1 BSN NL 28 72 13 13 0 0 6 12 0
## CS BB SO IBB HBP SH SF GIDP
## 15050 NA 0 4 NA 0 0 NA NA
## 15835 NA 9 23 NA 0 2 NA NA
## 16456 NA 10 23 NA 0 4 NA NA
## 16977 NA 12 18 NA 0 7 NA NA
## 17491 NA 58 58 NA 2 3 NA NA
## 18003 NA 101 58 NA 6 3 NA NA
## 18520 14 150 80 NA 3 5 NA NA
## 19040 13 145 81 NA 4 4 NA NA
## 19574 5 84 80 NA 1 4 NA NA
## 20097 21 170 93 NA 4 3 NA NA
## 20637 13 142 81 NA 4 6 NA NA
## 21189 4 59 68 NA 2 6 NA NA
## 21719 9 144 76 NA 3 10 NA NA
## 22257 6 137 89 NA 0 14 NA NA
## 22787 5 137 87 NA 3 8 NA NA
## 23331 3 72 60 NA 3 13 NA NA
## 23863 10 136 61 NA 1 21 NA NA
## 24368 4 128 51 NA 1 0 NA NA
## 24897 2 130 62 NA 2 0 NA NA
## 25394 5 114 90 NA 2 0 NA NA
## 25908 3 104 63 NA 2 0 NA NA
## 26422 NA 20 24 NA 0 0 NA 2
subset(pitching, pitching$playerID == "ruthba01" )
## playerID yearID stint teamID lgID W L G GS CG SHO SV IPouts H ER
## 5363 ruthba01 1914 1 BOS AL 2 1 4 3 1 0 0 69 21 10
## 5671 ruthba01 1915 1 BOS AL 18 8 32 28 16 1 0 653 166 59
## 5894 ruthba01 1916 1 BOS AL 23 12 44 41 23 9 1 971 230 63
## 6073 ruthba01 1917 1 BOS AL 24 13 41 38 35 6 2 979 244 73
## 6282 ruthba01 1918 1 BOS AL 13 7 20 19 18 1 0 499 125 41
## 6499 ruthba01 1919 1 BOS AL 9 5 17 15 12 0 1 400 148 44
## 6713 ruthba01 1920 1 NYA AL 1 0 1 1 0 0 0 12 3 2
## 6925 ruthba01 1921 1 NYA AL 2 0 2 1 0 0 0 27 14 9
## 8881 ruthba01 1930 1 NYA AL 1 0 1 1 1 0 0 27 11 3
## 9485 ruthba01 1933 1 NYA AL 1 0 1 1 1 0 0 27 12 5
## HR BB SO BAOpp ERA IBB WP HBP BK BFP GF R SH SF GIDP
## 5363 1 7 3 0.23 3.91 NA 0 0 0 100 0 12 NA NA NA
## 5671 3 85 112 0.21 2.44 NA 9 6 1 895 3 80 NA NA NA
## 5894 0 118 170 0.20 1.75 NA 3 8 1 1300 3 83 NA NA NA
## 6073 2 108 128 0.21 2.01 NA 5 11 0 1313 2 91 NA NA NA
## 6282 1 49 40 0.21 2.22 NA 3 2 1 660 0 51 NA NA NA
## 6499 2 58 30 0.29 2.97 NA 5 2 1 591 2 59 NA NA NA
## 6713 0 2 0 0.20 4.50 NA 0 0 0 17 0 4 NA NA NA
## 6925 1 9 2 0.35 9.00 NA 0 0 0 49 1 10 NA NA NA
## 8881 0 2 3 0.30 3.00 NA 0 0 0 39 0 3 NA NA NA
## 9485 0 3 0 0.30 5.00 NA 0 0 0 42 0 5 NA NA NA
We will move onto Fielding. Using the books reference of using Bill James’ proposed fielding range, we will sum his putouts, and assists, then divide the sums by games played.
fielding<-read.csv("Fielding.csv")
ruth_fielding<-subset(fielding, fielding$playerID == "ruthba01" )
ruth_fielding$yearID<-as.factor(ruth_fielding$yearID)
#Sum put outs, assists, games played by year
Ruth_range_data<-ddply(ruth_fielding,~yearID,summarise,Put_Outs=sum(PO),Assists=sum(A), Games=sum(G))
#add PO + A, then divide by G
Ruth_range<-round((Ruth_range_data$Put_Outs+ Ruth_range_data$Assists)/Ruth_range_data$Games, 2)
Ruth_range[5:22]
## [1] 3.72 2.43 2.02 2.41 2.16 2.89 2.36 2.27 2.18 2.26 2.03 1.84 1.93 1.73
## [15] 1.72 1.73 1.80 1.54
*Note there seems to be a difference in Lahman’s stats and the numbers used in the book. I’ve double checked the calculations with Lahman’s stats, and they are correct.
teams<-read.csv("Teams.csv")
Yankees27<-subset(teams, teams$yearID==1927 & teams$franchID=="NYY")
Yankees27
## yearID lgID teamID franchID divID Rank G Ghome W L DivWin WCWin
## 825 1927 AL NYA NYY 1 155 77 110 44
## LgWin WSWin R AB H X2B X3B HR BB SO SB CS HBP SF RA ER ERA
## 825 Y Y 975 5347 1644 291 103 158 635 605 90 64 NA NA 599 494 3.2
## CG SHO SV IPouts HA HRA BBA SOA E DP FP name
## 825 82 11 20 4167 1403 42 409 431 196 123 0.969 New York Yankees
## park attendance BPF PPF teamIDBR teamIDlahman45
## 825 Yankee Stadium I 1164015 98 94 NYY NYA
## teamIDretro
## 825 NYA
Q. What is the average number of hume runs per game recorded in each decade? Does the rate of strikeouts show any correlation with the rate of home runs?
#Home runs per decade
Homers<-ddply(teams,~yearID,summarise,Home_Runs=sum(HR), StrikeOuts=sum(SO), HRA=sum(HRA), Games=sum(G/2))
x<- seq(1880, 2016, by = 10)
Homers$Decade[Homers$yearID < 1880] <- "1870"
for (year in x){
Homers$Decade[Homers$yearID >= year & Homers$yearID < (year+10)] <- year
}
#Homers
Decades<-ddply(Homers,~Decade,summarise,HR=sum(Home_Runs),K=sum(StrikeOuts),HRA=sum(HRA), Games=sum(Games))
#Decades
Decades$HR_G<-(Decades$HR/Decades$Games)
Decades$K_G<-(Decades$K/Decades$Games)
Decades
## Decade HR K HRA Games HR_G K_G
## 1 1870 362 4906 362 2031 0.1782373 2.415559
## 2 1880 3774 64502 3774 8759 0.4308711 7.364083
## 3 1890 4755 50021 4755 9457 0.5028022 5.289310
## 4 1900 3104 NA 3104 11340 0.2737213 NA
## 5 1910 4533 NA 4533 13324 0.3402131 NA
## 6 1920 9894 69373 9894 12323 0.8028889 5.629554
## 7 1930 13442 81675 13442 12311 1.0918691 6.634311
## 8 1940 12958 87873 12958 12376 1.0470265 7.100275
## 9 1950 20860 108863 20860 12374 1.6857928 8.797721
## 10 1960 26169 182438 26169 15961 1.6395589 11.430236
## 11 1970 29543 203741 29543 19806 1.4916187 10.286832
## 12 1980 32942 218147 32942 20337 1.6198063 10.726607
## 13 1990 41349 265539 41349 21594 1.9148375 12.296888
## 14 2000 52148 318746 52148 24291 2.1468033 13.121979
## 15 2010 33465 255799 33465 17007 1.9677192 15.040807
plot(Decades$Decade, Decades$HR_G)
plot(Decades$Decade, Decades$K_G)