R is a beautiful tool for all things data; cleaning, manipulating, visualizing and running statistical analysis. That said, R takes a bit of time getting used to. My purpose here is just to familiarize you with some of the key syntax, explain a few key packages (ggplot2
and tidyverse
), and leave you with an appreciation of the power of R as an analytical tool!
To illustrate the power of R in a time efficient manner, with real world applications, I wanted to use the Gapminder dataset. I have assembled a custom one with data gathered from this link. Also, huge shoutout to the Gapminder team and the late Hans Rosling for all their work.
Before we jump in, I want to talk about a key point that comes up A LOT within the R community: “tidy” data. Data is considered “tidy” if each column represents a unique variable, which sounds intuitive; however, we can think of several times where this isn’t the case and it ends up making a big difference.
So… back to tidy data… All the data sets from the website above looked something like this:
head(untidy_df[,1:8] )
## # A tibble: 6 x 8
## country `1800` `1801` `1802` `1803` `1804` `1805` `1806`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan 603 603 603 603 603 603 603
## 2 Albania 667 667 667 667 667 668 668
## 3 Algeria 715 716 717 718 719 720 721
## 4 Andorra 1200 1200 1200 1200 1210 1210 1210
## 5 Angola 618 620 623 626 628 631 634
## 6 Antigua and Barbuda 757 757 757 757 757 757 757
Notice that what would ordinarily be the “Year” variable is spread lengthwise across a row for an individual country. Although this seems intuitive to look at, it becomes problematic when understanding your data. There are three total variables here - Year, Country, and Population; however, only one of those are labeled.
“Tidyverse” to the rescue! Tidyverse is a package within R and is crucial for manipulating data. This package hasn’t always been available and it has absolutely sped up data science for R users. So, let’s tidy up this data so that each column is a variable. It’ll look something like:
# creating a variable to index in my "gather" function
n <- names(untidy_df)
tidy_data <- untidy_df %>%
# "gathering" up the data, calling the key (or columns we want to gather) "Year"
# and the values that we're gathering "population"
gather(key = "Year",
value = "Population",
n[2:length(n)])
head(tidy_data)
## # A tibble: 6 x 3
## country Year Population
## <chr> <chr> <dbl>
## 1 Afghanistan 1800 603
## 2 Albania 1800 667
## 3 Algeria 1800 715
## 4 Andorra 1800 1200
## 5 Angola 1800 618
## 6 Antigua and Barbuda 1800 757
Much better!
So, I did this for all the data sets (population, child mortality, etc.) and joined them to create my own for this demonstration. The data set we come up with looks like this (notice each column is a variable):
head(gapminder_df)
## X country Year GDP_per_capita life_expectancy population continent
## 1 1 Afghanistan 1800 603 28.2 3280000 Asia
## 2 2 Albania 1800 667 35.4 410000 Europe
## 3 3 Algeria 1800 715 28.8 2500000 Africa
## 4 4 Angola 1800 618 27.0 1570000 Africa
## 5 5 Argentina 1800 1510 33.2 534000 Americas
## 6 6 Australia 1800 814 34.0 351000 Oceania
## child_per_woman child_mortality
## 1 7.00 469
## 2 4.60 375
## 3 6.99 460
## 4 6.93 486
## 5 6.80 402
## 6 6.50 391
The problem, though, as we know, is that not all data sets have so few columns - some have as many as 50. There are two other good ways to summarise your data. One is summary()
and the other is str()
, which stands for “structure”.
summary(gapminder_df)
## X country Year GDP_per_capita
## Min. : 1 Afghanistan: 219 Min. :1800 Min. : 247
## 1st Qu.: 7337 Albania : 219 1st Qu.:1854 1st Qu.: 888
## Median :14674 Algeria : 219 Median :1909 Median : 1460
## Mean :14674 Angola : 219 Mean :1909 Mean : 4516
## 3rd Qu.:22010 Argentina : 219 3rd Qu.:1964 3rd Qu.: 3700
## Max. :29346 Australia : 219 Max. :2018 Max. :114000
## (Other) :28032
## life_expectancy population continent child_per_woman
## Min. : 1.50 Min. :2.270e+04 Africa :11169 Min. :1.120
## 1st Qu.:31.60 1st Qu.:8.770e+05 Americas: 5256 1st Qu.:4.600
## Median :35.50 Median :2.840e+06 Asia : 5913 Median :5.940
## Mean :43.23 Mean :1.742e+07 Europe : 6570 Mean :5.395
## 3rd Qu.:54.50 3rd Qu.:8.780e+06 Oceania : 438 3rd Qu.:6.630
## Max. :84.20 Max. :1.420e+09 Max. :8.460
##
## child_mortality
## Min. : 1.95
## 1st Qu.:149.00
## Median :369.00
## Mean :296.96
## 3rd Qu.:421.00
## Max. :756.00
## NA's :2
… and str()
:
str(gapminder_df)
## 'data.frame': 29346 obs. of 9 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ country : Factor w/ 134 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Year : int 1800 1800 1800 1800 1800 1800 1800 1800 1800 1800 ...
## $ GDP_per_capita : int 603 667 715 618 1510 814 1850 1240 876 2410 ...
## $ life_expectancy: num 28.2 35.4 28.8 27 33.2 34 34.4 30.3 25.5 40 ...
## $ population : num 3280000 410000 2500000 1570000 534000 351000 3210000 64500 19200000 3140000 ...
## $ continent : Factor w/ 5 levels "Africa","Americas",..: 3 4 1 1 2 5 4 3 3 4 ...
## $ child_per_woman: num 7 4.6 6.99 6.93 6.8 6.5 5.1 7.03 6.7 4.85 ...
## $ child_mortality: num 469 375 460 486 402 391 387 440 508 322 ...
This last function, str()
is particularly useful because it lists the different types of variables within one’s dataset and is useful to begin the feature engineering or cleaning part of your job. Here, all our data is relatively clean (because I have already done so) and so we can dive right in to graphing with a package called “ggplot2”. We will use this in combination with the “tidyverse” package to see our world “then” and “now”!
First, how are we doing regarding life expectancy and GDP per capita?
# here we pass our data frame to ggplot()
ggplot(data = gapminder_df %>%
# we only want years 1800 and 2018...
filter(Year %in% c("1800", "2018")) %>%
# we need to summarise for those years within specific continents
group_by(Year, continent) %>%
summarise(Avg_life_expectancy = mean(life_expectancy)),
# now we tell ggplot what variables to use
aes(x = continent, y = Avg_life_expectancy, fill = as.factor(Year))) +
# then what shape our data should take...
# (all ggplot shapes start with "geom_*")
geom_bar(stat = "identity", position = "dodge") +
labs(x = "", y = "Average Life Expectancy", fill = "Year") +
scale_fill_manual(values = c("grey", "lightblue")) +
theme_hc()
On a country level it looks something like this…
ggplot(gapminder_df %>%
filter(Year %in% c("1800", "2018")),
aes(x = GDP_per_capita, y = life_expectancy, color = Year, size = population)) +
geom_point(alpha = 3/4) +
labs(y = "Life Expectancy", x = "GDP Per Capita") +
scale_x_log10() +
theme_hc() +
guides(color = FALSE)
Look how far everyone has come in both categories!
And what of child mortality for all of the continents over time?
ggplot(data = gapminder_df %>%
group_by(Year, continent) %>%
summarise(average_cm = mean(child_mortality))
, aes(x = Year, y = average_cm, color = continent)) +
geom_line(stat = "identity") +
labs(x = "Year", y = "Average Child Mortality") +
theme_hc()
## Warning: Removed 2 rows containing missing values (geom_path).
And GDP per capita?
ggplot(data = gapminder_df %>%
group_by(Year, continent) %>%
summarise(average_GDP = mean(GDP_per_capita))
, aes(x = Year, y = average_GDP, color = continent)) +
geom_line(stat = "identity") +
labs(x = "Year", y = "Average GDP per Capita") +
theme_hc() +
scale_y_continuous(labels = dollar)
What countries have grown the most over the last 10 years?
top_10 <- gapminder_df %>%
# here we are "selecting" variables of interest...
select(Year, country, population) %>%
# filtering dates like we have done above...
filter(Year %in% c("1800", "2018")) %>%
# "spreading" the two year population values so that
# we can subtract them
spread(Year, population) %>%
# "mutating" a new column "pop_difference"
mutate(pop_difference = `2018` - `1800`) %>%
# then selecting the top 10
top_n(10,pop_difference)
top_10
## country 1800 2018 pop_difference
## 1 Bangladesh 1.92e+07 1.66e+08 146800000
## 2 Brazil 3.64e+06 2.11e+08 207360000
## 3 China 3.22e+08 1.42e+09 1098000000
## 4 India 1.69e+08 1.35e+09 1181000000
## 5 Indonesia 1.61e+07 2.67e+08 250900000
## 6 Mexico 6.18e+06 1.31e+08 124820000
## 7 Nigeria 1.21e+07 1.96e+08 183900000
## 8 Pakistan 1.31e+07 2.01e+08 187900000
## 9 Philippines 1.89e+06 1.07e+08 105110000
## 10 United States 6.80e+06 3.27e+08 320200000
Then we can plot them over time…
top_countries <- top_10$country
ggplot(gapminder_df %>%
filter(country %in% top_countries),
aes(x = Year, y = population, col = country)) +
geom_line() +
theme_hc()
So by now, you are an R expert and ready to take on your own data sets! There are two main takeaways. First, R takes time but is such a powerful tool! Second, look at how much progress our world has made since the 1800s!
If you have enjoyed seeing this data, I would HIGHLY recommend Hans Rosling’s book Factfulness
“You’ve got to admit, it’s getting better” -The Beatles