Working through Chapter 1

In this exercise I will walk through the chapter 1 of Analyzing Baseball Data with R.

First step will be to download the data. The data can be found at www.seanlahman.com/baseball-archive/statistics. I will be using the most recent version which includes data from 2016 version. We first want to download the file. You could either manually go to the page and download the file, or you can use the download.file argument in R to have it automatically downloaded, each time the script is run within R. Using the comma delimited file (csv) is highly recommended.

#2016 Data
data_url<-"http://seanlahman.com/files/database/baseballdatabank-2017.1.zip"
download.file(data_url, "baseball.zip")

#unzip the file
zipF<- "~\\baseball.zip"
#outDir<-"~\\Anaylizing Baseball with R\\chapter1"
unzip(zipF)

#show files created
#list.files(outDir)

We can see that we created a new file in our folder called “baseballdatabank-2017.1”. Inside the new folder, there is another subfolder called “core” that has the data we are looking for. To see the files in “core” we can use the list.files command.

list.files("~\\Anaylizing Baseball with R\\chapter1\\baseballdatabank-2017.1\\core")
##  [1] "AllstarFull.csv"         "Appearances.csv"        
##  [3] "AwardsManagers.csv"      "AwardsPlayers.csv"      
##  [5] "AwardsShareManagers.csv" "AwardsSharePlayers.csv" 
##  [7] "baseball.zip"            "baseballdatabank-2017.1"
##  [9] "Batting.csv"             "BattingPost.csv"        
## [11] "chapter1.html"           "chapter1.Rmd"           
## [13] "CollegePlaying.csv"      "Fielding.csv"           
## [15] "FieldingOF.csv"          "FieldingOFsplit.csv"    
## [17] "FieldingPost.csv"        "HallOfFame.csv"         
## [19] "HomeGames.csv"           "Managers.csv"           
## [21] "ManagersHalf.csv"        "Master.csv"             
## [23] "Parks.csv"               "Pitching.csv"           
## [25] "PitchingPost.csv"        "readme2014.txt"         
## [27] "Salaries.csv"            "Schools.csv"            
## [29] "SeriesPost.csv"          "Teams.csv"              
## [31] "TeamsFranchises.csv"     "TeamsHalf.csv"

Lets take a look at the Maser.csv file. This file contains a bunch of bibliography data of players and managers.

master<-read.csv("Master.csv")
head(master,10)
##     playerID birthYear birthMonth birthDay birthCountry birthState
## 1  aardsda01      1981         12       27          USA         CO
## 2  aaronha01      1934          2        5          USA         AL
## 3  aaronto01      1939          8        5          USA         AL
## 4   aasedo01      1954          9        8          USA         CA
## 5   abadan01      1972          8       25          USA         FL
## 6   abadfe01      1985         12       17         D.R.  La Romana
## 7  abadijo01      1850         11        4          USA         PA
## 8  abbated01      1877          4       15          USA         PA
## 9  abbeybe01      1869         11       11          USA         VT
## 10 abbeych01      1866         10       14          USA         NE
##       birthCity deathYear deathMonth deathDay deathCountry deathState
## 1        Denver        NA         NA       NA                        
## 2        Mobile        NA         NA       NA                        
## 3        Mobile      1984          8       16          USA         GA
## 4        Orange        NA         NA       NA                        
## 5    Palm Beach        NA         NA       NA                        
## 6     La Romana        NA         NA       NA                        
## 7  Philadelphia      1905          5       17          USA         NJ
## 8       Latrobe      1957          1        6          USA         FL
## 9         Essex      1962          6       11          USA         VT
## 10   Falls City      1926          4       27          USA         CA
##          deathCity nameFirst    nameLast        nameGiven weight height
## 1                      David     Aardsma      David Allan    215     75
## 2                       Hank       Aaron      Henry Louis    180     72
## 3          Atlanta    Tommie       Aaron       Tommie Lee    190     75
## 4                        Don        Aase   Donald William    190     75
## 5                       Andy        Abad    Fausto Andres    184     73
## 6                   Fernando        Abad Fernando Antonio    220     73
## 7        Pemberton      John      Abadie          John W.    192     72
## 8  Fort Lauderdale        Ed Abbaticchio     Edward James    170     71
## 9       Colchester      Bert       Abbey        Bert Wood    175     71
## 10   San Francisco   Charlie       Abbey       Charles S.    169     68
##    bats throws      debut  finalGame  retroID   bbrefID
## 1     R      R 2004-04-06 2015-08-23 aardd001 aardsda01
## 2     R      R 1954-04-13 1976-10-03 aaroh101 aaronha01
## 3     R      R 1962-04-10 1971-09-26 aarot101 aaronto01
## 4     R      R 1977-07-26 1990-10-03 aased001  aasedo01
## 5     L      L 2001-09-10 2006-04-13 abada001  abadan01
## 6     L      L 2010-07-28 2016-09-25 abadf001  abadfe01
## 7     R      R 1875-04-26 1875-06-10 abadj101 abadijo01
## 8     R      R 1897-09-04 1910-09-15 abbae101 abbated01
## 9     R      R 1892-06-14 1896-09-23 abbeb101 abbeybe01
## 10    L      L 1893-08-16 1897-08-19 abbec101 abbeych01

We can see that Hank Aaron was born February 5, 1934, in Mobile, Alabama, his full name is Henry Louis Aaron, and some other facts about the Hall of Fame player.

By using the “head” function we can see the first 10 rows of the data. To get a list of the columns we can use colnames(master)

colnames(master)
##  [1] "playerID"     "birthYear"    "birthMonth"   "birthDay"    
##  [5] "birthCountry" "birthState"   "birthCity"    "deathYear"   
##  [9] "deathMonth"   "deathDay"     "deathCountry" "deathState"  
## [13] "deathCity"    "nameFirst"    "nameLast"     "nameGiven"   
## [17] "weight"       "height"       "bats"         "throws"      
## [21] "debut"        "finalGame"    "retroID"      "bbrefID"

Now lest take a look at the batting table.

batting<-read.csv("Batting.csv")
head(batting)
##    playerID yearID stint teamID lgID  G  AB  R  H X2B X3B HR RBI SB CS BB
## 1 abercda01   1871     1    TRO <NA>  1   4  0  0   0   0  0   0  0  0  0
## 2  addybo01   1871     1    RC1 <NA> 25 118 30 32   6   0  0  13  8  1  4
## 3 allisar01   1871     1    CL1 <NA> 29 137 28 40   4   5  0  19  3  1  2
## 4 allisdo01   1871     1    WS3 <NA> 27 133 28 44  10   2  2  27  1  1  0
## 5 ansonca01   1871     1    RC1 <NA> 25 120 29 39  11   3  0  16  6  2  2
## 6 armstbo01   1871     1    FW1 <NA> 12  49  9 11   2   1  0   5  0  1  0
##   SO IBB HBP SH SF GIDP
## 1  0  NA  NA NA NA   NA
## 2  0  NA  NA NA NA   NA
## 3  5  NA  NA NA NA   NA
## 4  2  NA  NA NA NA   NA
## 5  1  NA  NA NA NA   NA
## 6  1  NA  NA NA NA   NA

We can look up information on a player in a number of different ways. We can select observations or use the subset function. First lets look up Lou Brock and Hank Aaron in the Master table to get their playerID’s so we can look up their batting stats for 1964. (We can create various types of databases in R, that would connect each table, but lets hold off on that for a moment.)

subset(master, master$nameFirst == "Hank"&  master$nameLast =="Aaron" )
##    playerID birthYear birthMonth birthDay birthCountry birthState
## 2 aaronha01      1934          2        5          USA         AL
##   birthCity deathYear deathMonth deathDay deathCountry deathState
## 2    Mobile        NA         NA       NA                        
##   deathCity nameFirst nameLast   nameGiven weight height bats throws
## 2                Hank    Aaron Henry Louis    180     72    R      R
##        debut  finalGame  retroID   bbrefID
## 2 1954-04-13 1976-10-03 aaroh101 aaronha01
subset(master, master$nameFirst == "Lou"&  master$nameLast =="Brock" )
##       playerID birthYear birthMonth birthDay birthCountry birthState
## 1936 brocklo01      1939          6       18          USA         AR
##      birthCity deathYear deathMonth deathDay deathCountry deathState
## 1936 El Dorado        NA         NA       NA                        
##      deathCity nameFirst nameLast   nameGiven weight height bats throws
## 1936                 Lou    Brock Louis Clark    170     71    L      L
##           debut  finalGame  retroID   bbrefID
## 1936 1961-09-10 1979-09-30 brocl102 brocklo01

For Hank Aaron the playerId is aaronha01, and for Lou Brock his playerID is bocklo01. Lets Look up their stats for 1964.

subset(batting, batting$playerID == "aaronha01"&  batting$yearID=="1964" )
##        playerID yearID stint teamID lgID   G  AB   R   H X2B X3B HR RBI SB
## 43458 aaronha01   1964     1    ML1   NL 145 570 103 187  30   2 24  95 22
##       CS BB SO IBB HBP SH SF GIDP
## 43458  4 62 46   9   0  0  2   22
subset(batting, batting$playerID == "brocklo01"&  batting$yearID=="1964" )
##        playerID yearID stint teamID lgID   G  AB  R   H X2B X3B HR RBI SB
## 43538 brocklo01   1964     1    CHN   NL  52 215 30  54   9   2  2  14 10
## 43539 brocklo01   1964     2    SLN   NL 103 419 81 146  21   9 12  44 33
##       CS BB SO IBB HBP SH SF GIDP
## 43538  3 13 40   0   2  1  0    3
## 43539 15 27 87   0   2 12  4    2

We can see that Lou Brock played for 2 different teams in 1964.

Now lets move onto the Pitching Table. We will look up Babe Ruth’s pitching and batting stats. Yes he was a great home run slugger, but he became such an effective batter, he was moved to the outfield.

pitching<-read.csv("Pitching.csv")
subset(batting, batting$playerID == "ruthba01" )
##       playerID yearID stint teamID lgID   G  AB   R   H X2B X3B HR RBI SB
## 15050 ruthba01   1914     1    BOS   AL   5  10   1   2   1   0  0   2  0
## 15835 ruthba01   1915     1    BOS   AL  42  92  16  29  10   1  4  21  0
## 16456 ruthba01   1916     1    BOS   AL  67 136  18  37   5   3  3  15  0
## 16977 ruthba01   1917     1    BOS   AL  52 123  14  40   6   3  2  12  0
## 17491 ruthba01   1918     1    BOS   AL  95 317  50  95  26  11 11  66  6
## 18003 ruthba01   1919     1    BOS   AL 130 432 103 139  34  12 29 114  7
## 18520 ruthba01   1920     1    NYA   AL 142 457 158 172  36   9 54 137 14
## 19040 ruthba01   1921     1    NYA   AL 152 540 177 204  44  16 59 171 17
## 19574 ruthba01   1922     1    NYA   AL 110 406  94 128  24   8 35  99  2
## 20097 ruthba01   1923     1    NYA   AL 152 522 151 205  45  13 41 131 17
## 20637 ruthba01   1924     1    NYA   AL 153 529 143 200  39   7 46 121  9
## 21189 ruthba01   1925     1    NYA   AL  98 359  61 104  12   2 25  66  2
## 21719 ruthba01   1926     1    NYA   AL 152 495 139 184  30   5 47 150 11
## 22257 ruthba01   1927     1    NYA   AL 151 540 158 192  29   8 60 164  7
## 22787 ruthba01   1928     1    NYA   AL 154 536 163 173  29   8 54 142  4
## 23331 ruthba01   1929     1    NYA   AL 135 499 121 172  26   6 46 154  5
## 23863 ruthba01   1930     1    NYA   AL 145 518 150 186  28   9 49 153 10
## 24368 ruthba01   1931     1    NYA   AL 145 534 149 199  31   3 46 163  5
## 24897 ruthba01   1932     1    NYA   AL 133 457 120 156  13   5 41 137  2
## 25394 ruthba01   1933     1    NYA   AL 137 459  97 138  21   3 34 103  4
## 25908 ruthba01   1934     1    NYA   AL 125 365  78 105  17   4 22  84  1
## 26422 ruthba01   1935     1    BSN   NL  28  72  13  13   0   0  6  12  0
##       CS  BB SO IBB HBP SH SF GIDP
## 15050 NA   0  4  NA   0  0 NA   NA
## 15835 NA   9 23  NA   0  2 NA   NA
## 16456 NA  10 23  NA   0  4 NA   NA
## 16977 NA  12 18  NA   0  7 NA   NA
## 17491 NA  58 58  NA   2  3 NA   NA
## 18003 NA 101 58  NA   6  3 NA   NA
## 18520 14 150 80  NA   3  5 NA   NA
## 19040 13 145 81  NA   4  4 NA   NA
## 19574  5  84 80  NA   1  4 NA   NA
## 20097 21 170 93  NA   4  3 NA   NA
## 20637 13 142 81  NA   4  6 NA   NA
## 21189  4  59 68  NA   2  6 NA   NA
## 21719  9 144 76  NA   3 10 NA   NA
## 22257  6 137 89  NA   0 14 NA   NA
## 22787  5 137 87  NA   3  8 NA   NA
## 23331  3  72 60  NA   3 13 NA   NA
## 23863 10 136 61  NA   1 21 NA   NA
## 24368  4 128 51  NA   1  0 NA   NA
## 24897  2 130 62  NA   2  0 NA   NA
## 25394  5 114 90  NA   2  0 NA   NA
## 25908  3 104 63  NA   2  0 NA   NA
## 26422 NA  20 24  NA   0  0 NA    2
subset(pitching, pitching$playerID == "ruthba01" )
##      playerID yearID stint teamID lgID  W  L  G GS CG SHO SV IPouts   H ER
## 5363 ruthba01   1914     1    BOS   AL  2  1  4  3  1   0  0     69  21 10
## 5671 ruthba01   1915     1    BOS   AL 18  8 32 28 16   1  0    653 166 59
## 5894 ruthba01   1916     1    BOS   AL 23 12 44 41 23   9  1    971 230 63
## 6073 ruthba01   1917     1    BOS   AL 24 13 41 38 35   6  2    979 244 73
## 6282 ruthba01   1918     1    BOS   AL 13  7 20 19 18   1  0    499 125 41
## 6499 ruthba01   1919     1    BOS   AL  9  5 17 15 12   0  1    400 148 44
## 6713 ruthba01   1920     1    NYA   AL  1  0  1  1  0   0  0     12   3  2
## 6925 ruthba01   1921     1    NYA   AL  2  0  2  1  0   0  0     27  14  9
## 8881 ruthba01   1930     1    NYA   AL  1  0  1  1  1   0  0     27  11  3
## 9485 ruthba01   1933     1    NYA   AL  1  0  1  1  1   0  0     27  12  5
##      HR  BB  SO BAOpp  ERA IBB WP HBP BK  BFP GF  R SH SF GIDP
## 5363  1   7   3  0.23 3.91  NA  0   0  0  100  0 12 NA NA   NA
## 5671  3  85 112  0.21 2.44  NA  9   6  1  895  3 80 NA NA   NA
## 5894  0 118 170  0.20 1.75  NA  3   8  1 1300  3 83 NA NA   NA
## 6073  2 108 128  0.21 2.01  NA  5  11  0 1313  2 91 NA NA   NA
## 6282  1  49  40  0.21 2.22  NA  3   2  1  660  0 51 NA NA   NA
## 6499  2  58  30  0.29 2.97  NA  5   2  1  591  2 59 NA NA   NA
## 6713  0   2   0  0.20 4.50  NA  0   0  0   17  0  4 NA NA   NA
## 6925  1   9   2  0.35 9.00  NA  0   0  0   49  1 10 NA NA   NA
## 8881  0   2   3  0.30 3.00  NA  0   0  0   39  0  3 NA NA   NA
## 9485  0   3   0  0.30 5.00  NA  0   0  0   42  0  5 NA NA   NA

We will move onto Fielding. Using the books reference of using Bill James’ proposed fielding range, we will sum his putouts, and assists, then divide the sums by games played.

fielding<-read.csv("Fielding.csv")
ruth_fielding<-subset(fielding, fielding$playerID == "ruthba01" )



ruth_fielding$yearID<-as.factor(ruth_fielding$yearID)
#Sum put outs, assists, games played by year
Ruth_range_data<-ddply(ruth_fielding,~yearID,summarise,Put_Outs=sum(PO),Assists=sum(A), Games=sum(G))
#add PO + A, then divide by G

Ruth_range<-round((Ruth_range_data$Put_Outs+   Ruth_range_data$Assists)/Ruth_range_data$Games, 2)

Ruth_range[5:22]
##  [1] 3.72 2.43 2.02 2.41 2.16 2.89 2.36 2.27 2.18 2.26 2.03 1.84 1.93 1.73
## [15] 1.72 1.73 1.80 1.54

*Note there seems to be a difference in Lahman’s stats and the numbers used in the book. I’ve double checked the calculations with Lahman’s stats, and they are correct.

Teams

teams<-read.csv("Teams.csv")
Yankees27<-subset(teams, teams$yearID==1927 & teams$franchID=="NYY")
Yankees27
##     yearID lgID teamID franchID divID Rank   G Ghome   W  L DivWin WCWin
## 825   1927   AL    NYA      NYY          1 155    77 110 44             
##     LgWin WSWin   R   AB    H X2B X3B  HR  BB  SO SB CS HBP SF  RA  ER ERA
## 825     Y     Y 975 5347 1644 291 103 158 635 605 90 64  NA NA 599 494 3.2
##     CG SHO SV IPouts   HA HRA BBA SOA   E  DP    FP             name
## 825 82  11 20   4167 1403  42 409 431 196 123 0.969 New York Yankees
##                 park attendance BPF PPF teamIDBR teamIDlahman45
## 825 Yankee Stadium I    1164015  98  94      NYY            NYA
##     teamIDretro
## 825         NYA

1.2.8 Baseball Questions

Q. What is the average number of hume runs per game recorded in each decade? Does the rate of strikeouts show any correlation with the rate of home runs?

#Home runs per decade
Homers<-ddply(teams,~yearID,summarise,Home_Runs=sum(HR), StrikeOuts=sum(SO), HRA=sum(HRA), Games=sum(G/2))
x<- seq(1880, 2016, by = 10)

Homers$Decade[Homers$yearID < 1880] <- "1870"

for (year in x){

Homers$Decade[Homers$yearID  >= year  & Homers$yearID < (year+10)] <- year
}

#Homers

Decades<-ddply(Homers,~Decade,summarise,HR=sum(Home_Runs),K=sum(StrikeOuts),HRA=sum(HRA), Games=sum(Games))
#Decades

Decades$HR_G<-(Decades$HR/Decades$Games)
Decades$K_G<-(Decades$K/Decades$Games)
Decades
##    Decade    HR      K   HRA Games      HR_G       K_G
## 1    1870   362   4906   362  2031 0.1782373  2.415559
## 2    1880  3774  64502  3774  8759 0.4308711  7.364083
## 3    1890  4755  50021  4755  9457 0.5028022  5.289310
## 4    1900  3104     NA  3104 11340 0.2737213        NA
## 5    1910  4533     NA  4533 13324 0.3402131        NA
## 6    1920  9894  69373  9894 12323 0.8028889  5.629554
## 7    1930 13442  81675 13442 12311 1.0918691  6.634311
## 8    1940 12958  87873 12958 12376 1.0470265  7.100275
## 9    1950 20860 108863 20860 12374 1.6857928  8.797721
## 10   1960 26169 182438 26169 15961 1.6395589 11.430236
## 11   1970 29543 203741 29543 19806 1.4916187 10.286832
## 12   1980 32942 218147 32942 20337 1.6198063 10.726607
## 13   1990 41349 265539 41349 21594 1.9148375 12.296888
## 14   2000 52148 318746 52148 24291 2.1468033 13.121979
## 15   2010 33465 255799 33465 17007 1.9677192 15.040807
plot(Decades$Decade, Decades$HR_G)

plot(Decades$Decade, Decades$K_G)