Does pay inequality affect team performance?

This document illustrates data manipulation in R. It accompanies Sample Annotated Paper in Econometrics published by Journal of Economic Education in 2007. The sample annotated paper illustrated the structure of an econometrics paper: the sections, presentation of descriptive statistics, results etc. In contrast, this document illustrates how data for an empirical paper can be put together. It shows the process through which empirical researchers retrieve, manipulate and analyze data. It illustrates common tasks such as downloading data, selecting observations and variables, aggregating data by groups, reshaping data, plotting data, running and displaying regression results. It has been nearly 10 years since the original sample paper was published, so the data in this document are are more recent. However, the empirical specification (and, as it turns out, results) are the same.

1. Retrieving data

We need data on pay inequality and team performance. The USA Today has data on salaries of individual players in the MLB. Let’s download that data for the last four years. Each year of data is at a separate url so we need to put each year in a separate data frame. We will use function readHTMLTable from the XML package. Later, we will also need dplyr and tidyr packages.

library(XML)
library(dplyr)
library(tidyr)
sal2016 <- readHTMLTable("http://www.usatoday.com/sports/mlb/salaries/")[[1]]
sal2015 <- readHTMLTable("http://www.usatoday.com/sports/mlb/salaries/2015/player/all/")[[1]]
sal2014 <- readHTMLTable("http://www.usatoday.com/sports/mlb/salaries/2014/player/all/")[[1]]
sal2013 <- readHTMLTable("http://www.usatoday.com/sports/mlb/salaries/2013/player/all/")[[1]]

Before combining these four data frames into one we first need to make sure that we can identify which year the data come from. We can do this by creating a variable year in each data frame and assign it the year the data frame represents.

sal2016$year <- 2016
sal2015$year <- 2015
sal2014$year <- 2014
sal2013$year <- 2013

Now we can append the four data frames. We should drop the rank variable which was not read from the website.

salaries <- bind_rows(sal2016,sal2015,sal2014,sal2013)
salaries <- select(salaries, -rank)

Let’s save this data as they were on May 26, 2016. The data is save to dropbox and can be retrieved using the address below.

write.csv(salaries, "C:/Users/dvorakt/Dropbox/reproducibility/salaries.csv", row.names = FALSE)
salaries <- read.csv("https://www.dropbox.com/s/a1tk758yyl76j1t/salaries.csv?raw=1")

2. Selecting observations and variables

Let’s now keep the variables that we will use in this study, and look at the structure of the data frame.

salaries <- select(salaries, year, Salary, Name, Team)
str(salaries)
## Classes 'tbl_df', 'tbl' and 'data.frame':    3302 obs. of  4 variables:
##  $ year  : num  2016 2016 2016 2016 2016 ...
##  $ Salary: chr  "$ 33,000,000" "$ 31,799,030" "$ 30,000,000" "$ 28,000,000" ...
##  $ Name  : chr  "Clayton Kershaw" "Zack Greinke" "David Price" "Miguel Cabrera" ...
##  $ Team  : Factor w/ 30 levels "ARI","ATL","BAL",..: 14 1 4 10 10 18 24 13 5 21 ...

We see that we have 3,302 observations. We have 30 teams. We also see that Salary is a character with dollar signs and commas. Let’s strip all the punctuation from Salary, and turn it into a numerical variable.

salaries$Salary <- as.numeric(gsub("[[:punct:]]","",salaries$Salary))

Let’s look at the descriptive statistics of salary.

summary(salaries$Salary)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##        0   517300  1543000  4181000  5644000 33000000

Hmm, the minimum is 0 and maximum is 33 million. Let’s take a look at the largest and lowest values. We will first sort the data by salary then use the head() and tail() functions.

salaries <- arrange(salaries, Salary)
head(salaries, n=5)
## Source: local data frame [5 x 4]
## 
##    year Salary             Name   Team
##   (dbl)  (dbl)            (chr) (fctr)
## 1  2014      0 Melvin Upton Jr.     SD
## 2  2013      0 Melvin Upton Jr.     SD
## 3  2013      0      Yasiel Puig    LAD
## 4  2013      0     Hyun-Jin Ryu    LAD
## 5  2013 490000       A.J. Ramos    MIA
tail(salaries, n=5)
## Source: local data frame [5 x 4]
## 
##    year   Salary            Name   Team
##   (dbl)    (dbl)           (chr) (fctr)
## 1  2014 28000000    Zack Greinke    LAD
## 2  2013 29000000  Alex Rodriguez    NYY
## 3  2016 30000000     David Price    BOS
## 4  2016 31799030    Zack Greinke    ARI
## 5  2016 33000000 Clayton Kershaw    LAD

The zero salaries are probably errors or missing data. We will drop those observations. As for the highest values, the names associated with these salaries are probably well-known to baseball fans.

salaries <- filter(salaries, Salary != 0)

3. Aggregating by groups

There are different ways of measuring pay inequality within a team. We will use the share earned by the top 20%. We can use function ntile() which creates a new variable indicating which n-tile the observation falls into. For example, ntile(x,10) would tell us which decile each observation in x belongs to. Of course, we want to do this only for each team and each year, thus we will group_by(year, Team). Also, while we do this, we should calculate the total payroll by summing all salaries within each team and year.

salaries <- salaries %>% group_by(year, Team ) %>% mutate(payroll=sum(Salary), pctile=ntile(Salary,5))
head(salaries, n=3)
## Source: local data frame [3 x 6]
## Groups: year, Team [2]
## 
##    year Salary          Name   Team  payroll pctile
##   (dbl)  (dbl)         (chr) (fctr)    (dbl)  (int)
## 1  2013 490000    A.J. Ramos    MIA 35857000      1
## 2  2013 490000  Alex Sanabia    MIA 35857000      1
## 3  2013 490000 Alfredo Marte    ARI 81106000      1

We now have total payroll and an identifier of players in the top 20% for each team and year. For example, we see that with his 490K salary, A.J. Ramos from Miami was in the bottom (1st) 20% of players within his team. We don’t need information on players that are not in the top 20% so we filter them out.

salaries <- filter(salaries, pctile==5)

Now we need to add up how much those top 20% players made and divide that amount by payroll. Since now we only have players in the top 20% we can aggregate/summarize the data by year and Team. We don’t want to lose the payroll variable so we include it among the group_by variables. As we aggregate/summarize the data we sum(Salary) and put the result in variable named top20.

salaries <- salaries %>% group_by(year, Team, payroll) %>% summarize(top20=sum(Salary))
salaries$top20share <- salaries$top20/salaries$payroll*100

Let’s put payroll in millions rather than dollars. It will make the magnitude of our regression coefficients easier to read. Let’s also check some descriptive statistics on top20share.

salaries$payroll <- salaries$payroll/1000000
summary(salaries)
##       year           Team       payroll           top20          
##  Min.   :2013   ARI    : 4   Min.   : 18.83   Min.   :  8900000  
##  1st Qu.:2014   ATL    : 4   1st Qu.: 81.96   1st Qu.: 47298750  
##  Median :2014   BAL    : 4   Median :107.37   Median : 60812500  
##  Mean   :2014   BOS    : 4   Mean   :115.04   Mean   : 65897100  
##  3rd Qu.:2015   CHC    : 4   3rd Qu.:139.44   3rd Qu.: 81719635  
##  Max.   :2016   CIN    : 4   Max.   :245.13   Max.   :136035384  
##                 (Other):96                                       
##    top20share   
##  Min.   :35.94  
##  1st Qu.:52.11  
##  Median :57.23  
##  Mean   :57.76  
##  3rd Qu.:62.99  
##  Max.   :79.89  
## 
salaries <- arrange(salaries, top20share)
head(salaries, n=3)
## Source: local data frame [3 x 5]
## Groups: year, Team [3]
## 
##    year   Team  payroll    top20 top20share
##   (dbl) (fctr)    (dbl)    (dbl)      (dbl)
## 1  2013    ARI 81.10600 40000000   49.31818
## 2  2013    ATL 73.60069 43896942   59.64202
## 3  2013    BAL 89.05413 42929464   48.20603
tail(salaries, n=3)
## Source: local data frame [3 x 5]
## Groups: year, Team [3]
## 
##    year   Team  payroll     top20 top20share
##   (dbl) (fctr)    (dbl)     (dbl)      (dbl)
## 1  2016    TEX 186.0387 123833333   66.56320
## 2  2016    TOR 138.7017  72650000   52.37859
## 3  2016    WSH 141.6526  80913046   57.12074

Before we move onto data on performance, let’s turn the factor variable Team into a character. This is because the performance data may have different set of teams which means that factor levels may not match.

salaries$Team <- as.character(salaries$Team)

4. Get the data on team performance, data retrieval

The second piece of data we need is how well a team is doing. We will use data from baseball-reference.com. The website has a table that has the number of wins for each year and each team in MLB.

teamwins <- readHTMLTable("http://www.baseball-reference.com/leagues/MLB/#teams_team_wins3000::none", stringsAsFactors = FALSE)[[1]]

As with the salary data we should save this table in a private drive so that we can preserve the data as it was downloaded on May 26, 2016.

write.csv(teamwins, "C:/Users/dvorakt/Dropbox/reproducibility/teamwins.csv", row.names = FALSE)
teamwins <- read.csv("https://www.dropbox.com/s/2l3fatf2h27bvcl/teamwins.csv?raw=1",stringsAsFactors = FALSE)

Let’s take a look at the first 10 rows and 27 observations.

head(teamwins[,1:10], n=27)
##    Year   G ARI ATL BLA BAL BOS CHC CHW CIN
## 1  2016  49  21  12      26  29  31  27  15
## 2  2015 162  79  67      81  78  97  76  64
## 3  2014 162  64  79      96  71  73  73  76
## 4  2013 163  81  96      85  97  66  63  90
## 5  2012 162  81  94      93  69  61  85  97
## 6  2011 162  94  89      69  90  71  79  79
## 7  2010 162  65  91      66  89  75  88  91
## 8  2009 163  70  86      64  95  83  79  78
## 9  2008 163  82  72      68  95  97  89  74
## 10 2007 163  90  84      69  96  85  72  72
## 11 2006 162  76  79      70  86  66  90  80
## 12 2005 162  77  90      74  95  79  99  73
## 13 2004 162  51  96      78  98  89  83  76
## 14 2003 162  84 101      71  95  88  86  69
## 15 2002 162  98 101      67  93  67  81  78
## 16 2001 162  92  88      63  82  88  83  66
## 17 2000 162  85  95      74  85  65  95  85
## 18 1999 163 100 103      78  94  67  75  96
## 19 1998 163  65 106      79  92  90  80  77
## 20 1997 162     101      98  78  68  80  76
## 21 1996 162      96      88  85  76  85  81
## 22 1995 145      90      71  86  73  68  85
## 23 1994 117      68      63  54  49  67  66
## 24 1993 162     104      85  80  84  94  73
## 25 1992 162      98      89  73  78  86  90
## 26 Year   G ARI ATL BLA BAL BOS CHC CHW CIN
## 27 1991 162      94      67  84  77  87  74

Looking at the structure of this data, we see that each row corresponds to a year and each column represents a team with the exception of the first two columns which represent year and the number of games played.

Also, we see that the table’s column headings repeat every 25 rows - this makes it easy to read the table online, but is undesirable here. We need to delete the rows that contain these extra column headings. One way we could do it is keeping only rows that have a valid year. Right now, Year is stored as a character, if we convert it into a numeric the observations in the ‘heading’ rows will appear as NAs. Then we can filter out observations where Year is an NA.

teamwins$Year <- as.numeric(teamwins$Year)
teamwins <- filter(teamwins, !is.na(teamwins$Year))

Since our salary data goes back only to 2013, we can filter out years prior to 2013.

teamwins <- filter(teamwins, Year>2012)

5. Reshaping data

A more difficult problem is that teams are in columns rather than rows. We need to reshape the data. Specifically, we want each row to be defined by year and a team. The variables we need will be year, team, games played, games won. Thus we reshape the original wide data set into a long data set.

First, year and games should stay together, they will not be reshaped. Thus, we will unite them into one variable.

teamwins <- unite(teamwins, year_games, Year, G)

Next, the numbers that will be reshaped is the number of wins, the variable that will identify (i.e. key) whose wins these are is team. The columns that will be reshaped are all the variables corresponding to teams.

teamwins2 <- gather(data=teamwins, value=wins, key=team, ARI,ATL,BLA,BAL,BOS,CHC,CHW,CIN,CLE,COL,DET,HOU,KCR,ANA,
                    LAD,FLA,MIL,MIN,NYM,NYY,OAK,PHI,PIT,SDP,SFG,SEA,STL,TBD,TEX,TOR,WSN)
#now we can restore year and games as separate variables
teamwins2 <- separate(teamwins2,year_games, c("year", "games"))
head(teamwins2)
##   year games team wins
## 1 2016    49  ARI   21
## 2 2015   162  ARI   79
## 3 2014   162  ARI   64
## 4 2013   163  ARI   81
## 5 2016    49  ATL   12
## 6 2015   162  ATL   67

Terrific. Now let’s put games wins in a numeric format.

teamwins2$games <- as.numeric(teamwins2$games)
teamwins2$wins <- as.numeric(teamwins2$wins)
teamwins2$year <- as.numeric(teamwins2$year)

Let’s calculate winning percentage, look at the descriptive stats, top and bottom performers.

teamwins2$pctwin <- teamwins2$wins/teamwins2$games*100
teamwins2 <- arrange(teamwins2,pctwin)
summary(teamwins2)
##       year          games            team          wins       
##  Min.   :2013   Min.   : 49.0   ARI    :  4   Min.   : 12.00  
##  1st Qu.:2014   1st Qu.:133.8   ATL    :  4   1st Qu.: 46.00  
##  Median :2014   Median :162.0   BLA    :  4   Median : 76.00  
##  Mean   :2014   Mean   :134.0   BAL    :  4   Mean   : 66.61  
##  3rd Qu.:2015   3rd Qu.:162.2   BOS    :  4   3rd Qu.: 87.00  
##  Max.   :2016   Max.   :163.0   CHC    :  4   Max.   :100.00  
##                                 (Other):100   NA's   :4       
##      pctwin     
##  Min.   :24.49  
##  1st Qu.:44.87  
##  Median :49.85  
##  Mean   :49.38  
##  3rd Qu.:55.10  
##  Max.   :63.27  
##  NA's   :4
head(teamwins2)
##   year games team wins   pctwin
## 1 2016    49  ATL   12 24.48980
## 2 2016    49  MIN   12 24.48980
## 3 2016    49  CIN   15 30.61224
## 4 2013   163  HOU   51 31.28834
## 5 2013   163  FLA   62 38.03681
## 6 2013   163  CHW   63 38.65031
tail(teamwins2)
##     year games team wins   pctwin
## 119 2015   162  STL  100 61.72840
## 120 2016    49  CHC   31 63.26531
## 121 2016    49  BLA   NA       NA
## 122 2015   162  BLA   NA       NA
## 123 2014   162  BLA   NA       NA
## 124 2013   163  BLA   NA       NA

It looks like in 2016 only 46 games have been played. Chicago Cubs (CHC) are doing well, while Atlanta Braves are not. The 2015 first finisher in the National League, St.Louis Cardinals, reassuringly appear in the table with 100 out of 162 games won. We have a number of missing values for team coded “BLA” which is a code baseball-reference.com used for Baltimore Orioles in 1901 and 1902. The new code for the Orioles is “BAL”. We will drop observations that have NA for winning percentage (i.e. we drop team “BLA”).

teamwins2 <- filter(teamwins2, !is.na(pctwin))

6. Merging data sets

We are now ready to merge our performance data with the salary data. The two data sets will be matched by year and team. This means that the two variables need to be in the same format in both data sets. Year is numeric so that is OK, but team is a factor, so we will turn it into a character because we also turn team into character in the salary data. Fortunately, both data bases uses the same codes to identify teams. Finally, in the salary data team variable is Team rather than team, so to make them match we create Team and drop team.

teamwins2$Team <- as.character(teamwins2$team)
teamwins2 <- select(teamwins2, -team)

We will use the inner_join() function because we plan to use only observations that match in both data sets.

merged <- inner_join(teamwins2, salaries,by=c("year", "Team"))
summary(merged)
##       year          games            wins            pctwin     
##  Min.   :2013   Min.   : 49.0   Min.   : 12.00   Min.   :24.49  
##  1st Qu.:2014   1st Qu.:133.8   1st Qu.: 46.00   1st Qu.:43.76  
##  Median :2014   Median :162.0   Median : 74.00   Median :50.00  
##  Mean   :2014   Mean   :134.0   Mean   : 66.53   Mean   :49.19  
##  3rd Qu.:2015   3rd Qu.:162.2   3rd Qu.: 88.00   3rd Qu.:55.30  
##  Max.   :2016   Max.   :163.0   Max.   :100.00   Max.   :63.27  
##      Team              payroll           top20             top20share   
##  Length:88          Min.   : 18.83   Min.   :  8900000   Min.   :40.63  
##  Class :character   1st Qu.: 83.43   1st Qu.: 50542238   1st Qu.:52.11  
##  Mode  :character   Median :105.90   Median : 61980954   Median :60.35  
##                     Mean   :117.91   Mean   : 68372295   Mean   :58.64  
##                     3rd Qu.:143.37   3rd Qu.: 83089215   3rd Qu.:63.09  
##                     Max.   :245.13   Max.   :136035384   Max.   :79.89

7. Graphing data

library(ggplot2)
ggplot(data=merged, aes(x=payroll,y=pctwin, label=Team, color=as.factor(year)))+ geom_text(size=3) +
  ggtitle("Payroll and performance of MLB teams, 2013-2016") + 
  xlab("Payroll in millions") + ylab("Percent of games won") +
  scale_color_discrete(name="year") + theme_minimal()

ggplot(data=merged, aes(x=top20share,y=pctwin, label=Team, color=as.factor(year))) + geom_text(size=3) +
  ggtitle("Team inequality and performance of MLB teams, 2013-2016") + 
  xlab("Share of payroll earned by top 20%") + ylab("Percent of games won") +
  scale_color_discrete(name="year") + theme_minimal()

8. Regression Analysis

library(stargazer)
m1 <- lm(pctwin ~ top20share, data=merged)
m2 <- lm(pctwin ~ top20share + payroll, data=merged)
m3 <- lm(pctwin ~ top20share + log(payroll) , data=merged)

stargazer(m1,m2,m3, type="text")
## 
## ====================================================================================
##                                           Dependent variable:                       
##                     ----------------------------------------------------------------
##                                                  pctwin                             
##                             (1)                   (2)                   (3)         
## ------------------------------------------------------------------------------------
## top20share                -0.213**              -0.162                -0.172*       
##                           (0.104)               (0.103)               (0.101)       
##                                                                                     
## payroll                                         0.042**                             
##                                                 (0.017)                             
##                                                                                     
## log(payroll)                                                         5.506***       
##                                                                       (1.918)       
##                                                                                     
## Constant                 61.666***             53.686***             33.487***      
##                           (6.142)               (6.805)              (11.451)       
##                                                                                     
## ------------------------------------------------------------------------------------
## Observations                 88                   88                    88          
## R2                         0.047                 0.109                 0.131        
## Adjusted R2                0.035                 0.088                 0.110        
## Residual Std. Error   7.767 (df = 86)       7.552 (df = 85)       7.460 (df = 85)   
## F Statistic         4.201** (df = 1; 86) 5.210*** (df = 2; 85) 6.398*** (df = 2; 85)
## ====================================================================================
## Note:                                                    *p<0.1; **p<0.05; ***p<0.01