This document illustrates data manipulation in R. It accompanies Sample Annotated Paper in Econometrics published by Journal of Economic Education in 2007. The sample annotated paper illustrated the structure of an econometrics paper: the sections, presentation of descriptive statistics, results etc. In contrast, this document illustrates how data for an empirical paper can be put together. It shows the process through which empirical researchers retrieve, manipulate and analyze data. It illustrates common tasks such as downloading data, selecting observations and variables, aggregating data by groups, reshaping data, plotting data, running and displaying regression results. It has been nearly 10 years since the original sample paper was published, so the data in this document are are more recent. However, the empirical specification (and, as it turns out, results) are the same.
We need data on pay inequality and team performance. The USA Today has data on salaries of individual players in the MLB. Let’s download that data for the last four years. Each year of data is at a separate url so we need to put each year in a separate data frame. We will use function readHTMLTable from the XML package. Later, we will also need dplyr and tidyr packages.
library(XML)
library(dplyr)
library(tidyr)
sal2016 <- readHTMLTable("http://www.usatoday.com/sports/mlb/salaries/")[[1]]
sal2015 <- readHTMLTable("http://www.usatoday.com/sports/mlb/salaries/2015/player/all/")[[1]]
sal2014 <- readHTMLTable("http://www.usatoday.com/sports/mlb/salaries/2014/player/all/")[[1]]
sal2013 <- readHTMLTable("http://www.usatoday.com/sports/mlb/salaries/2013/player/all/")[[1]]
Before combining these four data frames into one we first need to make sure that we can identify which year the data come from. We can do this by creating a variable year in each data frame and assign it the year the data frame represents.
sal2016$year <- 2016
sal2015$year <- 2015
sal2014$year <- 2014
sal2013$year <- 2013
Now we can append the four data frames. We should drop the rank variable which was not read from the website.
salaries <- bind_rows(sal2016,sal2015,sal2014,sal2013)
salaries <- select(salaries, -rank)
Let’s save this data as they were on May 26, 2016. The data is save to dropbox and can be retrieved using the address below.
write.csv(salaries, "C:/Users/dvorakt/Dropbox/reproducibility/salaries.csv", row.names = FALSE)
salaries <- read.csv("https://www.dropbox.com/s/a1tk758yyl76j1t/salaries.csv?raw=1")
Let’s now keep the variables that we will use in this study, and look at the structure of the data frame.
salaries <- select(salaries, year, Salary, Name, Team)
str(salaries)
## Classes 'tbl_df', 'tbl' and 'data.frame': 3302 obs. of 4 variables:
## $ year : num 2016 2016 2016 2016 2016 ...
## $ Salary: chr "$ 33,000,000" "$ 31,799,030" "$ 30,000,000" "$ 28,000,000" ...
## $ Name : chr "Clayton Kershaw" "Zack Greinke" "David Price" "Miguel Cabrera" ...
## $ Team : Factor w/ 30 levels "ARI","ATL","BAL",..: 14 1 4 10 10 18 24 13 5 21 ...
We see that we have 3,302 observations. We have 30 teams. We also see that Salary is a character with dollar signs and commas. Let’s strip all the punctuation from Salary, and turn it into a numerical variable.
salaries$Salary <- as.numeric(gsub("[[:punct:]]","",salaries$Salary))
Let’s look at the descriptive statistics of salary.
summary(salaries$Salary)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 517300 1543000 4181000 5644000 33000000
Hmm, the minimum is 0 and maximum is 33 million. Let’s take a look at the largest and lowest values. We will first sort the data by salary then use the head() and tail() functions.
salaries <- arrange(salaries, Salary)
head(salaries, n=5)
## Source: local data frame [5 x 4]
##
## year Salary Name Team
## (dbl) (dbl) (chr) (fctr)
## 1 2014 0 Melvin Upton Jr. SD
## 2 2013 0 Melvin Upton Jr. SD
## 3 2013 0 Yasiel Puig LAD
## 4 2013 0 Hyun-Jin Ryu LAD
## 5 2013 490000 A.J. Ramos MIA
tail(salaries, n=5)
## Source: local data frame [5 x 4]
##
## year Salary Name Team
## (dbl) (dbl) (chr) (fctr)
## 1 2014 28000000 Zack Greinke LAD
## 2 2013 29000000 Alex Rodriguez NYY
## 3 2016 30000000 David Price BOS
## 4 2016 31799030 Zack Greinke ARI
## 5 2016 33000000 Clayton Kershaw LAD
The zero salaries are probably errors or missing data. We will drop those observations. As for the highest values, the names associated with these salaries are probably well-known to baseball fans.
salaries <- filter(salaries, Salary != 0)
There are different ways of measuring pay inequality within a team. We will use the share earned by the top 20%. We can use function ntile() which creates a new variable indicating which n-tile the observation falls into. For example, ntile(x,10) would tell us which decile each observation in x belongs to. Of course, we want to do this only for each team and each year, thus we will group_by(year, Team). Also, while we do this, we should calculate the total payroll by summing all salaries within each team and year.
salaries <- salaries %>% group_by(year, Team ) %>% mutate(payroll=sum(Salary), pctile=ntile(Salary,5))
head(salaries, n=3)
## Source: local data frame [3 x 6]
## Groups: year, Team [2]
##
## year Salary Name Team payroll pctile
## (dbl) (dbl) (chr) (fctr) (dbl) (int)
## 1 2013 490000 A.J. Ramos MIA 35857000 1
## 2 2013 490000 Alex Sanabia MIA 35857000 1
## 3 2013 490000 Alfredo Marte ARI 81106000 1
We now have total payroll and an identifier of players in the top 20% for each team and year. For example, we see that with his 490K salary, A.J. Ramos from Miami was in the bottom (1st) 20% of players within his team. We don’t need information on players that are not in the top 20% so we filter them out.
salaries <- filter(salaries, pctile==5)
Now we need to add up how much those top 20% players made and divide that amount by payroll. Since now we only have players in the top 20% we can aggregate/summarize the data by year and Team. We don’t want to lose the payroll variable so we include it among the group_by variables. As we aggregate/summarize the data we sum(Salary) and put the result in variable named top20.
salaries <- salaries %>% group_by(year, Team, payroll) %>% summarize(top20=sum(Salary))
salaries$top20share <- salaries$top20/salaries$payroll*100
Let’s put payroll in millions rather than dollars. It will make the magnitude of our regression coefficients easier to read. Let’s also check some descriptive statistics on top20share.
salaries$payroll <- salaries$payroll/1000000
summary(salaries)
## year Team payroll top20
## Min. :2013 ARI : 4 Min. : 18.83 Min. : 8900000
## 1st Qu.:2014 ATL : 4 1st Qu.: 81.96 1st Qu.: 47298750
## Median :2014 BAL : 4 Median :107.37 Median : 60812500
## Mean :2014 BOS : 4 Mean :115.04 Mean : 65897100
## 3rd Qu.:2015 CHC : 4 3rd Qu.:139.44 3rd Qu.: 81719635
## Max. :2016 CIN : 4 Max. :245.13 Max. :136035384
## (Other):96
## top20share
## Min. :35.94
## 1st Qu.:52.11
## Median :57.23
## Mean :57.76
## 3rd Qu.:62.99
## Max. :79.89
##
salaries <- arrange(salaries, top20share)
head(salaries, n=3)
## Source: local data frame [3 x 5]
## Groups: year, Team [3]
##
## year Team payroll top20 top20share
## (dbl) (fctr) (dbl) (dbl) (dbl)
## 1 2013 ARI 81.10600 40000000 49.31818
## 2 2013 ATL 73.60069 43896942 59.64202
## 3 2013 BAL 89.05413 42929464 48.20603
tail(salaries, n=3)
## Source: local data frame [3 x 5]
## Groups: year, Team [3]
##
## year Team payroll top20 top20share
## (dbl) (fctr) (dbl) (dbl) (dbl)
## 1 2016 TEX 186.0387 123833333 66.56320
## 2 2016 TOR 138.7017 72650000 52.37859
## 3 2016 WSH 141.6526 80913046 57.12074
Before we move onto data on performance, let’s turn the factor variable Team into a character. This is because the performance data may have different set of teams which means that factor levels may not match.
salaries$Team <- as.character(salaries$Team)
The second piece of data we need is how well a team is doing. We will use data from baseball-reference.com. The website has a table that has the number of wins for each year and each team in MLB.
teamwins <- readHTMLTable("http://www.baseball-reference.com/leagues/MLB/#teams_team_wins3000::none", stringsAsFactors = FALSE)[[1]]
As with the salary data we should save this table in a private drive so that we can preserve the data as it was downloaded on May 26, 2016.
write.csv(teamwins, "C:/Users/dvorakt/Dropbox/reproducibility/teamwins.csv", row.names = FALSE)
teamwins <- read.csv("https://www.dropbox.com/s/2l3fatf2h27bvcl/teamwins.csv?raw=1",stringsAsFactors = FALSE)
Let’s take a look at the first 10 rows and 27 observations.
head(teamwins[,1:10], n=27)
## Year G ARI ATL BLA BAL BOS CHC CHW CIN
## 1 2016 49 21 12 26 29 31 27 15
## 2 2015 162 79 67 81 78 97 76 64
## 3 2014 162 64 79 96 71 73 73 76
## 4 2013 163 81 96 85 97 66 63 90
## 5 2012 162 81 94 93 69 61 85 97
## 6 2011 162 94 89 69 90 71 79 79
## 7 2010 162 65 91 66 89 75 88 91
## 8 2009 163 70 86 64 95 83 79 78
## 9 2008 163 82 72 68 95 97 89 74
## 10 2007 163 90 84 69 96 85 72 72
## 11 2006 162 76 79 70 86 66 90 80
## 12 2005 162 77 90 74 95 79 99 73
## 13 2004 162 51 96 78 98 89 83 76
## 14 2003 162 84 101 71 95 88 86 69
## 15 2002 162 98 101 67 93 67 81 78
## 16 2001 162 92 88 63 82 88 83 66
## 17 2000 162 85 95 74 85 65 95 85
## 18 1999 163 100 103 78 94 67 75 96
## 19 1998 163 65 106 79 92 90 80 77
## 20 1997 162 101 98 78 68 80 76
## 21 1996 162 96 88 85 76 85 81
## 22 1995 145 90 71 86 73 68 85
## 23 1994 117 68 63 54 49 67 66
## 24 1993 162 104 85 80 84 94 73
## 25 1992 162 98 89 73 78 86 90
## 26 Year G ARI ATL BLA BAL BOS CHC CHW CIN
## 27 1991 162 94 67 84 77 87 74
Looking at the structure of this data, we see that each row corresponds to a year and each column represents a team with the exception of the first two columns which represent year and the number of games played.
Also, we see that the table’s column headings repeat every 25 rows - this makes it easy to read the table online, but is undesirable here. We need to delete the rows that contain these extra column headings. One way we could do it is keeping only rows that have a valid year. Right now, Year is stored as a character, if we convert it into a numeric the observations in the ‘heading’ rows will appear as NAs. Then we can filter out observations where Year is an NA.
teamwins$Year <- as.numeric(teamwins$Year)
teamwins <- filter(teamwins, !is.na(teamwins$Year))
Since our salary data goes back only to 2013, we can filter out years prior to 2013.
teamwins <- filter(teamwins, Year>2012)
A more difficult problem is that teams are in columns rather than rows. We need to reshape the data. Specifically, we want each row to be defined by year and a team. The variables we need will be year, team, games played, games won. Thus we reshape the original wide data set into a long data set.
First, year and games should stay together, they will not be reshaped. Thus, we will unite them into one variable.
teamwins <- unite(teamwins, year_games, Year, G)
Next, the numbers that will be reshaped is the number of wins, the variable that will identify (i.e. key) whose wins these are is team. The columns that will be reshaped are all the variables corresponding to teams.
teamwins2 <- gather(data=teamwins, value=wins, key=team, ARI,ATL,BLA,BAL,BOS,CHC,CHW,CIN,CLE,COL,DET,HOU,KCR,ANA,
LAD,FLA,MIL,MIN,NYM,NYY,OAK,PHI,PIT,SDP,SFG,SEA,STL,TBD,TEX,TOR,WSN)
#now we can restore year and games as separate variables
teamwins2 <- separate(teamwins2,year_games, c("year", "games"))
head(teamwins2)
## year games team wins
## 1 2016 49 ARI 21
## 2 2015 162 ARI 79
## 3 2014 162 ARI 64
## 4 2013 163 ARI 81
## 5 2016 49 ATL 12
## 6 2015 162 ATL 67
Terrific. Now let’s put games wins in a numeric format.
teamwins2$games <- as.numeric(teamwins2$games)
teamwins2$wins <- as.numeric(teamwins2$wins)
teamwins2$year <- as.numeric(teamwins2$year)
Let’s calculate winning percentage, look at the descriptive stats, top and bottom performers.
teamwins2$pctwin <- teamwins2$wins/teamwins2$games*100
teamwins2 <- arrange(teamwins2,pctwin)
summary(teamwins2)
## year games team wins
## Min. :2013 Min. : 49.0 ARI : 4 Min. : 12.00
## 1st Qu.:2014 1st Qu.:133.8 ATL : 4 1st Qu.: 46.00
## Median :2014 Median :162.0 BLA : 4 Median : 76.00
## Mean :2014 Mean :134.0 BAL : 4 Mean : 66.61
## 3rd Qu.:2015 3rd Qu.:162.2 BOS : 4 3rd Qu.: 87.00
## Max. :2016 Max. :163.0 CHC : 4 Max. :100.00
## (Other):100 NA's :4
## pctwin
## Min. :24.49
## 1st Qu.:44.87
## Median :49.85
## Mean :49.38
## 3rd Qu.:55.10
## Max. :63.27
## NA's :4
head(teamwins2)
## year games team wins pctwin
## 1 2016 49 ATL 12 24.48980
## 2 2016 49 MIN 12 24.48980
## 3 2016 49 CIN 15 30.61224
## 4 2013 163 HOU 51 31.28834
## 5 2013 163 FLA 62 38.03681
## 6 2013 163 CHW 63 38.65031
tail(teamwins2)
## year games team wins pctwin
## 119 2015 162 STL 100 61.72840
## 120 2016 49 CHC 31 63.26531
## 121 2016 49 BLA NA NA
## 122 2015 162 BLA NA NA
## 123 2014 162 BLA NA NA
## 124 2013 163 BLA NA NA
It looks like in 2016 only 46 games have been played. Chicago Cubs (CHC) are doing well, while Atlanta Braves are not. The 2015 first finisher in the National League, St.Louis Cardinals, reassuringly appear in the table with 100 out of 162 games won. We have a number of missing values for team coded “BLA” which is a code baseball-reference.com used for Baltimore Orioles in 1901 and 1902. The new code for the Orioles is “BAL”. We will drop observations that have NA for winning percentage (i.e. we drop team “BLA”).
teamwins2 <- filter(teamwins2, !is.na(pctwin))
We are now ready to merge our performance data with the salary data. The two data sets will be matched by year and team. This means that the two variables need to be in the same format in both data sets. Year is numeric so that is OK, but team is a factor, so we will turn it into a character because we also turn team into character in the salary data. Fortunately, both data bases uses the same codes to identify teams. Finally, in the salary data team variable is Team rather than team, so to make them match we create Team and drop team.
teamwins2$Team <- as.character(teamwins2$team)
teamwins2 <- select(teamwins2, -team)
We will use the inner_join() function because we plan to use only observations that match in both data sets.
merged <- inner_join(teamwins2, salaries,by=c("year", "Team"))
summary(merged)
## year games wins pctwin
## Min. :2013 Min. : 49.0 Min. : 12.00 Min. :24.49
## 1st Qu.:2014 1st Qu.:133.8 1st Qu.: 46.00 1st Qu.:43.76
## Median :2014 Median :162.0 Median : 74.00 Median :50.00
## Mean :2014 Mean :134.0 Mean : 66.53 Mean :49.19
## 3rd Qu.:2015 3rd Qu.:162.2 3rd Qu.: 88.00 3rd Qu.:55.30
## Max. :2016 Max. :163.0 Max. :100.00 Max. :63.27
## Team payroll top20 top20share
## Length:88 Min. : 18.83 Min. : 8900000 Min. :40.63
## Class :character 1st Qu.: 83.43 1st Qu.: 50542238 1st Qu.:52.11
## Mode :character Median :105.90 Median : 61980954 Median :60.35
## Mean :117.91 Mean : 68372295 Mean :58.64
## 3rd Qu.:143.37 3rd Qu.: 83089215 3rd Qu.:63.09
## Max. :245.13 Max. :136035384 Max. :79.89
library(ggplot2)
ggplot(data=merged, aes(x=payroll,y=pctwin, label=Team, color=as.factor(year)))+ geom_text(size=3) +
ggtitle("Payroll and performance of MLB teams, 2013-2016") +
xlab("Payroll in millions") + ylab("Percent of games won") +
scale_color_discrete(name="year") + theme_minimal()
ggplot(data=merged, aes(x=top20share,y=pctwin, label=Team, color=as.factor(year))) + geom_text(size=3) +
ggtitle("Team inequality and performance of MLB teams, 2013-2016") +
xlab("Share of payroll earned by top 20%") + ylab("Percent of games won") +
scale_color_discrete(name="year") + theme_minimal()
library(stargazer)
m1 <- lm(pctwin ~ top20share, data=merged)
m2 <- lm(pctwin ~ top20share + payroll, data=merged)
m3 <- lm(pctwin ~ top20share + log(payroll) , data=merged)
stargazer(m1,m2,m3, type="text")
##
## ====================================================================================
## Dependent variable:
## ----------------------------------------------------------------
## pctwin
## (1) (2) (3)
## ------------------------------------------------------------------------------------
## top20share -0.213** -0.162 -0.172*
## (0.104) (0.103) (0.101)
##
## payroll 0.042**
## (0.017)
##
## log(payroll) 5.506***
## (1.918)
##
## Constant 61.666*** 53.686*** 33.487***
## (6.142) (6.805) (11.451)
##
## ------------------------------------------------------------------------------------
## Observations 88 88 88
## R2 0.047 0.109 0.131
## Adjusted R2 0.035 0.088 0.110
## Residual Std. Error 7.767 (df = 86) 7.552 (df = 85) 7.460 (df = 85)
## F Statistic 4.201** (df = 1; 86) 5.210*** (df = 2; 85) 6.398*** (df = 2; 85)
## ====================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01