A common task in data analysis requires taking some variable measured by individuals and transforming it to be measured by some group. For instance, baseball batting averages are measured by individual players, but we might want to compare the averages of different baseball clubs as a whole. R makes this easy, but through a process that takes some time to get used to. (Like coffee or opera, R can be an acquired taste.) This example works through how to accomplish this task using the aggregate() command and shows how to display the results graphically.

Begin by loading the Handout007.dta dataset:

library(foreign)
gdp.data <- read.dta("Handout007.dta")

In this dataset, again taken from the Quality of Government dataset, we have two indicators from the World Development Indicators: wdi_gdpcu, which is national GDP at current nominal prices, and wdi_gdpc, which is per-capita GDP at purchasing power parity. In a real analysis, we’d be a little more careful to have either PPP for both, nominal for both, or a complete set of PPP and nominal, but here we’re mostly interested in the fact that we often care about the sum of some types of individual-level data and the mean of others. The dataset also includes ht_region, a categorical variable that tells us which region of the world a country is supposed to be in.

Our first task is to create totals of world GDP by region. That means summing up each region’s countries’ GDPs to arrive at regional totals. This can be a little tedious in other programs (just try to do it in Excel as fast as we’ll do it here, much less by hand) but it does require some guidance. The key lies in the aggregate() command, which splits data into subsets, computes summary statistics for each subset, and returns the result to us in a convenient form (that is, in in a new object). The syntax is straightforward:

region.gdp <- aggregate(gdp.data$wdi_gdpcu,by=list(gdp.data$ht_region),FUN=sum)

Let’s look at the code first. aggregate(gdp.data$wdi_gdpcu,by=list(gdp.data$ht_region),FUN=sum) means that we’re telling R that we’re operating on the column gdp.data$wdi_gdpcu. We want it to do something (we’ll tell it what) to that variable “by” the variable gdp.data$ht_region. “By” means that the “by” variable is some sort of a category variable. That’s a declaration that we make, by the way, not a natural law; if we told R to use a variable that plainly wasn’t categorical, R would happily obey:

region.gdp.bad <- aggregate(gdp.data$wdi_gdpcu,by=list(gdp.data$wdi_gdpc),FUN=sum)
head(region.gdp.bad)
##    Group.1           x
## 1 303.3885 11204007936
## 2 458.1052  1155146240
## 3 493.7736  1856715392
## 4 517.5714  1815182208
## 5 623.6989  5254406144
## 6 698.1852  1981732736

So what are we doing to gdp.data$wdi_gdpcu “by” gdp.data$ht_region? That’s what the last argument, FUN=sum, tells R. sum here means, literally, the “sum” total. That is, it’s just a way of saying “add these things up”. So, decoded, the function means: “Take gdp.data$wdi_gdpcu, divide it up into subgroups by its gdp.data$ht_region, total it up in each of those groups, and report the results in a new object named region.gdp.” And we get:

head(region.gdp)
##                                   Group.1            x
## 1 1. Eastern Europe and post Soviet Union 2.963789e+12
## 2                        2. Latin America 4.034090e+12
## 3       3. North Africa & the Middle East 2.806348e+12
## 4                   4. Sub-Saharan Africa           NA
## 5     5. Western Europe and North America 3.230282e+13
## 6                            6. East Asia           NA

This mostly looks okay – but last I checked, adding two numbers didn’t yield “NA”. For R, NA means “missing data”. It’s what R returns both when the dataset itself has missing data and also when you try to do some operation to a dataset with missing data. Fortunately, 99 times out of 100, the solution to unexpected NAs is simple:

region.gdp <- aggregate(gdp.data$wdi_gdpcu,by=list(gdp.data$ht_region),FUN=sum,na.rm=TRUE)
head(region.gdp)
##                                   Group.1            x
## 1 1. Eastern Europe and post Soviet Union 2.963789e+12
## 2                        2. Latin America 4.034090e+12
## 3       3. North Africa & the Middle East 2.806348e+12
## 4                   4. Sub-Saharan Africa 9.361415e+11
## 5     5. Western Europe and North America 3.230282e+13
## 6                            6. East Asia 1.086504e+13

na.rm=TRUE just tells R to remove (rm) the missing data for this operation.

Displaying Results

We might also want to display the results graphically, using barplot(). First, use names(region.gdp) to find what the names of the columns are; if you like, you can then use colnames() to change the names to something nicer.

colnames(region.gdp) <- c("Region","GDP")
barplot(region.gdp$GDP,names=region.gdp$Region)

Obviously, this isn’t quite what we want. Really look at this plot. It’s awful! What would Few say about this? We would never sit down to make a plot this useless–it’s what happens when we make machines choose instead of thinking ourselves.

Let’s fix the problems. To make the names fit, we can fiddle a little bit with the graphical parameters see more on this page:

par(las=2) # make label text perpendicular to axis
barplot(region.gdp$GDP,names=region.gdp$Region,horiz=TRUE,cex.names=.5)

horiz changes the plot into a horizontal plot; not necessary, but sometimes it looks nice. cex.names is a funny name for what in a normal environment we’d call font size. A more lasting fix would be to change the names of the longest regions:

unique(region.gdp$Region)  ## unique gives us the "unique" values of a given object
##  [1] 1. Eastern Europe and post Soviet Union
##  [2] 2. Latin America                       
##  [3] 3. North Africa & the Middle East      
##  [4] 4. Sub-Saharan Africa                  
##  [5] 5. Western Europe and North America    
##  [6] 6. East Asia                           
##  [7] 7. South-East Asia                     
##  [8] 8. South Asia                          
##  [9] 9. The Pacific                         
## [10] 10. The Caribbean                      
## 10 Levels: 1. Eastern Europe and post Soviet Union ... 10. The Caribbean

We learn a few things from this: there are 10 unique values in the dataset, apparently this dataset doesn’t believe that “continents” are the regions most important to analysis, and also that this variable is recorded as a factor, a special and often useful (but sometimes irritating) data type. (The tipoff is the bit at the end: “10 Levels: …”) Recoding a factor is also easy, we just need to know what we’re doing.

It looks like we could fix a few of these:

levels(region.gdp$Region) <- c("E. Europe + Fmr USSR",
                                     "Latin America",
                                     "N. Africa and Mideast", 
                                     "Sub-Saharan Africa",
                                     "W. Europe and N. America",
                                     "East Asia", 
                                     "South-East Asia",
                                     "South Asia",
                                     "The Pacific",
                                     "The Caribbean")

Finally, we can use a bit of Google to figure out how to get bigger margins on the plot (http://www.r-bloggers.com/setting-graph-margins-in-r-using-the-par-function-and-lots-of-cow-milk/):

par(las=2,mai=c(1.5,2.5,.25,.5)) # make label text perpendicular to axis, 
# change the margin size in inches to give us 1.5 inches on the bottom, 2.5 inches on the left, 
#.25 on the top, and .5 on the right
barplot(region.gdp$GDP,names=region.gdp$Region,horiz=TRUE,cex.names=.75)

This is a little tricky, and there’s still more to be done (what’s with the scientific notation on the x-axis, for instance?), but the point of this lesson isn’t about making the prettiest possible graphs in R.

Having done all of this, though, it’s a lot faster to create in some ways a more interesting chart: average per-capita income by country by region. Look closely at the code to see how this is both similar and different to the code we just worked through.

region.percapita <- aggregate(gdp.data$wdi_gdpc,
                              by=list(gdp.data$ht_region), 
                              FUN=mean,na.rm=TRUE) ## don't forget the na.rm!
colnames(region.percapita) <- c("Region","GDPPC")
region.percapita <- region.percapita[order(region.percapita$GDPPC),]

OK. Stop here and think about why we would want to do this in some order. (Go back to Few.)

Now we’re going to use the levels() command to rename the regions—this is one of the times when I need you to trust what I’m doing in lieu of a longer explanation, which will be forthcoming:

levels(region.percapita$Region) <- c("E. Europe + Fmr USSR",
                                     "Latin America",
                                     "N. Africa and Mideast", 
                                     "Sub-Saharan Africa",
                                     "W. Europe and N. America",
                                     "East Asia", 
                                     "South-East Asia",
                                     "South Asia",
                                     "The Pacific",
                                     "The Caribbean")
par(las=2,mai=c(1.5,2.5,.25,.5)) 
barplot(region.percapita$GDPPC,
        names=region.percapita$Region,
        horiz=TRUE,
        cex.names=.75,
        main="Average National Per Capita GDP By Region",
        xlab="In USD at PPP")

We’re almost there. To fix the problem that the values of the x-axis are too big, let’s use a (very common) fix and report them in thousands of dollars instead:

par(las=2,mai=c(1.5,2.5,.25,.5)) 
barplot(region.percapita$GDPPC/1000, ## It's really this easy!
        names=region.percapita$Region,
        horiz=TRUE,
        cex.names=.75,
        main="Average National Per Capita GDP By Region",
        xlab="In 1,000s USD at PPP") ### don't forget to change the label

And we can make this even fancier by using options to highlight a particular region using colors.

region.percapita$color <- "gray" ## This creates a new column called "color" and fills it with the word "gray"
region.percapita[region.percapita$Region=="East Asia",]$color <- "green" ## This changes the value of the field "color" to "Green" for any row where Region equals East Asia
par(las=2,mai=c(1.5,2.5,.25,.5)) 
barplot(region.percapita$GDPPC/1000, 
        names=region.percapita$Region,
        horiz=TRUE,
        cex.names=.75,
        main="Average National Per Capita GDP By Region",
        xlab="In 1,000s USD at PPP",
        col=region.percapita$color) ## This line tells R where to find that value of the color parameter for each bar