Manipulating Data and graphing with R

R is a beautiful tool for all things data; cleaning, manipulating, visualizing and running statistical analysis. That said, R takes a bit of time getting used to. My purpose here is just to familiarize you with some of the key syntax, explain a few key packages (ggplot2 and tidyverse), and leave you with an appreciation of the power of R as an analytical tool!

To illustrate the power of R in a time efficient manner, with real world applications, I wanted to use the Gapminder dataset. I have assembled a custom one with data gathered from this link. Also, huge shoutout to the Gapminder team and the late Hans Rosling for all their work.

Before we jump in, I want to talk about a key point that comes up A LOT within the R community: “tidy” data. Data is considered “tidy” if each column represents a unique variable, which sounds intuitive; however, we can think of several times where this isn’t the case and it ends up making a big difference.

So… back to tidy data… All the data sets from the website above looked something like this:

 head(untidy_df[,1:8] )
## # A tibble: 6 x 8
##   country             `1800` `1801` `1802` `1803` `1804` `1805` `1806`
##   <chr>                <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 Afghanistan            603    603    603    603    603    603    603
## 2 Albania                667    667    667    667    667    668    668
## 3 Algeria                715    716    717    718    719    720    721
## 4 Andorra               1200   1200   1200   1200   1210   1210   1210
## 5 Angola                 618    620    623    626    628    631    634
## 6 Antigua and Barbuda    757    757    757    757    757    757    757


Notice that what would ordinarily be the “Year” variable is spread lengthwise across a row for an individual country. Although this seems intuitive to look at, it becomes problematic when understanding your data. There are three total variables here - Year, Country, and Population; however, only one of those are labeled.

“Tidyverse” to the rescue! Tidyverse is a package within R and is crucial for manipulating data. This package hasn’t always been available and it has absolutely sped up data science for R users. So, let’s tidy up this data so that each column is a variable. It’ll look something like:

# creating a variable to index in my "gather" function
n <- names(untidy_df)

tidy_data <- untidy_df %>% 
  # "gathering" up the data, calling the key (or columns we want to gather) "Year"
  # and the values that we're gathering "population"
  gather(key = "Year",
         value = "Population",
         n[2:length(n)])

head(tidy_data)
## # A tibble: 6 x 3
##   country             Year  Population
##   <chr>               <chr>      <dbl>
## 1 Afghanistan         1800         603
## 2 Albania             1800         667
## 3 Algeria             1800         715
## 4 Andorra             1800        1200
## 5 Angola              1800         618
## 6 Antigua and Barbuda 1800         757


Much better!

So, I did this for all the data sets (population, child mortality, etc.) and joined them to create my own for this demonstration. The data set we come up with looks like this (notice each column is a variable):

head(gapminder_df)
##   X     country Year GDP_per_capita life_expectancy population continent
## 1 1 Afghanistan 1800            603            28.2    3280000      Asia
## 2 2     Albania 1800            667            35.4     410000    Europe
## 3 3     Algeria 1800            715            28.8    2500000    Africa
## 4 4      Angola 1800            618            27.0    1570000    Africa
## 5 5   Argentina 1800           1510            33.2     534000  Americas
## 6 6   Australia 1800            814            34.0     351000   Oceania
##   child_per_woman child_mortality
## 1            7.00             469
## 2            4.60             375
## 3            6.99             460
## 4            6.93             486
## 5            6.80             402
## 6            6.50             391




The problem, though, as we know, is that not all data sets have so few columns - some have as many as 50. There are two other good ways to summarise your data. One is summary() and the other is str(), which stands for “structure”.

summary(gapminder_df)
##        X                country           Year      GDP_per_capita  
##  Min.   :    1   Afghanistan:  219   Min.   :1800   Min.   :   247  
##  1st Qu.: 7337   Albania    :  219   1st Qu.:1854   1st Qu.:   888  
##  Median :14674   Algeria    :  219   Median :1909   Median :  1460  
##  Mean   :14674   Angola     :  219   Mean   :1909   Mean   :  4516  
##  3rd Qu.:22010   Argentina  :  219   3rd Qu.:1964   3rd Qu.:  3700  
##  Max.   :29346   Australia  :  219   Max.   :2018   Max.   :114000  
##                  (Other)    :28032                                  
##  life_expectancy   population           continent     child_per_woman
##  Min.   : 1.50   Min.   :2.270e+04   Africa  :11169   Min.   :1.120  
##  1st Qu.:31.60   1st Qu.:8.770e+05   Americas: 5256   1st Qu.:4.600  
##  Median :35.50   Median :2.840e+06   Asia    : 5913   Median :5.940  
##  Mean   :43.23   Mean   :1.742e+07   Europe  : 6570   Mean   :5.395  
##  3rd Qu.:54.50   3rd Qu.:8.780e+06   Oceania :  438   3rd Qu.:6.630  
##  Max.   :84.20   Max.   :1.420e+09                    Max.   :8.460  
##                                                                      
##  child_mortality 
##  Min.   :  1.95  
##  1st Qu.:149.00  
##  Median :369.00  
##  Mean   :296.96  
##  3rd Qu.:421.00  
##  Max.   :756.00  
##  NA's   :2




… and str():

str(gapminder_df)
## 'data.frame':    29346 obs. of  9 variables:
##  $ X              : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ country        : Factor w/ 134 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Year           : int  1800 1800 1800 1800 1800 1800 1800 1800 1800 1800 ...
##  $ GDP_per_capita : int  603 667 715 618 1510 814 1850 1240 876 2410 ...
##  $ life_expectancy: num  28.2 35.4 28.8 27 33.2 34 34.4 30.3 25.5 40 ...
##  $ population     : num  3280000 410000 2500000 1570000 534000 351000 3210000 64500 19200000 3140000 ...
##  $ continent      : Factor w/ 5 levels "Africa","Americas",..: 3 4 1 1 2 5 4 3 3 4 ...
##  $ child_per_woman: num  7 4.6 6.99 6.93 6.8 6.5 5.1 7.03 6.7 4.85 ...
##  $ child_mortality: num  469 375 460 486 402 391 387 440 508 322 ...




This last function, str() is particularly useful because it lists the different types of variables within one’s dataset and is useful to begin the feature engineering or cleaning part of your job. Here, all our data is relatively clean (because I have already done so) and so we can dive right in to graphing with a package called “ggplot2”. We will use this in combination with the “tidyverse” package to see our world “then” and “now”!


First, how are we doing regarding life expectancy and GDP per capita?

# here we pass our data frame to ggplot()
ggplot(data = gapminder_df %>% 
         # we only want years 1800 and 2018...
         filter(Year %in% c("1800", "2018")) %>% 
         # we need to summarise for those years within specific continents
         group_by(Year, continent) %>% 
         summarise(Avg_life_expectancy = mean(life_expectancy)),
       # now we tell ggplot what variables to use
       aes(x = continent, y = Avg_life_expectancy, fill = as.factor(Year))) +
  # then what shape our data should take... 
  # (all ggplot shapes start with "geom_*")
  geom_bar(stat = "identity", position = "dodge") +
  labs(x = "", y = "Average Life Expectancy", fill = "Year") +
  scale_fill_manual(values = c("grey", "lightblue")) +
  theme_hc()




On a country level it looks something like this…

ggplot(gapminder_df %>% 
         filter(Year %in% c("1800", "2018")),
       aes(x = GDP_per_capita, y = life_expectancy, color = Year, size = population)) +
  geom_point(alpha = 3/4) +
  labs(y = "Life Expectancy", x = "GDP Per Capita") +
  scale_x_log10() +
  theme_hc() +
  guides(color = FALSE)



Look how far everyone has come in both categories!
And what of child mortality for all of the continents over time?

ggplot(data = gapminder_df %>% 
         group_by(Year, continent) %>% 
         summarise(average_cm = mean(child_mortality))
         , aes(x = Year, y = average_cm, color = continent)) +
  geom_line(stat = "identity") +
  labs(x = "Year", y = "Average Child Mortality") +
  theme_hc()
## Warning: Removed 2 rows containing missing values (geom_path).


And GDP per capita?

ggplot(data = gapminder_df %>% 
         group_by(Year, continent) %>% 
         summarise(average_GDP = mean(GDP_per_capita))
         , aes(x = Year, y = average_GDP, color = continent)) +
  geom_line(stat = "identity") +
  labs(x = "Year", y = "Average GDP per Capita") +
  theme_hc() +
  scale_y_continuous(labels = dollar)




What countries have grown the most over the last 10 years?

top_10 <- gapminder_df %>% 
  # here we are "selecting" variables of interest...
  select(Year, country, population) %>% 
  # filtering dates like we have done above...
  filter(Year %in% c("1800", "2018")) %>%
  # "spreading" the two year population values so that 
  # we can subtract them
  spread(Year, population) %>% 
  # "mutating" a new column "pop_difference"
  mutate(pop_difference = `2018` - `1800`) %>% 
  # then selecting the top 10
  top_n(10,pop_difference)


top_10
##          country     1800     2018 pop_difference
## 1     Bangladesh 1.92e+07 1.66e+08      146800000
## 2         Brazil 3.64e+06 2.11e+08      207360000
## 3          China 3.22e+08 1.42e+09     1098000000
## 4          India 1.69e+08 1.35e+09     1181000000
## 5      Indonesia 1.61e+07 2.67e+08      250900000
## 6         Mexico 6.18e+06 1.31e+08      124820000
## 7        Nigeria 1.21e+07 1.96e+08      183900000
## 8       Pakistan 1.31e+07 2.01e+08      187900000
## 9    Philippines 1.89e+06 1.07e+08      105110000
## 10 United States 6.80e+06 3.27e+08      320200000





Then we can plot them over time…

top_countries <- top_10$country
ggplot(gapminder_df %>% 
         filter(country %in% top_countries),
       aes(x = Year, y = population, col = country)) +
  geom_line() +
  theme_hc()




So by now, you are an R expert and ready to take on your own data sets! There are two main takeaways. First, R takes time but is such a powerful tool! Second, look at how much progress our world has made since the 1800s!


If you have enjoyed seeing this data, I would HIGHLY recommend Hans Rosling’s book Factfulness

“You’ve got to admit, it’s getting better” -The Beatles