Video
RAFAEL IRIZARRY: In this section, we will demonstrate how relatively simple ggplot and dplyr code can create insightful and aesthetically pleasing plots that help us better understand trends in world health and economics.
We will use many of the techniques we have learned about data visualization, exploratory data analysis, and summarization.
We later augment the code somewhat to perfect the plots, and describe some general principles to guide data visualization. We’re going to be using data from Gapminder. Hans Rosling was the co-founder of the Gapminder Foundation, an organization dedicated to educating the public by using data to dispel common myths about the so-called developing world.
The organization uses data to show how actual trends in health and economics contradict the narratives that emanate from sensationalist media coverage of catastrophes, tragedies, and other unfortunate events. As stated in the Gapminder Foundation’s website, “Journalists and lobbyists tell dramatic stories. That’s their job. They tell stories about extraordinary events and unusual people. The piles of dramatic stories pile up in people’s minds into an overdramatic worldview and strong negative stress feelings. The world is getting worse. It’s we versus them. Other people are strange. The population just keeps growing. And nobody cares.” Hans Rosling conveyed actual data-based trends, in a dramatic way of his own, using effective data visualization. This section is based on these talks that exemplify this approach to education. “New Insights on Poverty” and “The Best Stats You’ve Ever Seen” are the title of these talks. Specifically, in this section, we set out to answer the following two questions. First, is it a fair characterization of today’s world to say that it is divided into a Western rich nations, and the developing world in Africa, Asia, and Latin America? Second, has income inequality across countries worsened during the last 40 years? We’re going to use data and our code to answer these questions.
This video corresponds to the textbook section introducing the case study on new insights in poverty.
The original Gapminder TED talks are available and we encourage you to watch them.
You can also find more information and raw data (in addition to what we analyze in class) at https://www.gapminder.org/.
Key points
RAFAEL IRIZARRY: To learn about world health and economics, we will be using the Gapminder data set provided in the dslabs library. This dataset was put together for you, and it was created using a number of spreadsheets available from the Gapminder Foundation.
You can access the table using this code. We load the dslabs package, then we type data gapminder, and we can see that the data includes country, year, and several health outcomes and economics outcomes.
As done in the New Insights on Poverty video, we start by testing our knowledge regarding differences in child mortality across different countries. To get us started, we’re going to take a quiz created by Hans Rosling in his video New Insights on Poverty, and we’re going to start by testing our knowledge regarding differences in child mortality across different countries.
So here’s a quiz. For each of the pairs of countries here, which country do you think had the highest child mortality in 2015? And also, which pairs do you think are most similar? When answering these questions without data, the non-European countries are typically picked as having higher mortality rates, Sri Lanka over Turkey, South Korea over Poland, and Malaysia over Russia. It is also common to assume that countries considered to be part of the developing world, Pakistan, Vietnam, Thailand, and South Africa, have similarly high mortality rates.
Now let’s answer these questions with data. For example, for this first comparison, we can write this simple dplyr code to see that Turkey has a higher mortality rate than Sri Lanka. We can use the same code to answer each of the five questions, and we see that Sri Lanka has a lower mortality rate than Turkey, South Korea has a lower mortality rate than Poland, Malaysia has a lower mortality rate than Russia, and Pakistan is very different from Vietnam, and South Africa is very different from Thailand.
From here, we see that these comparisons, the European countries have higher rates. We also see that the countries from the developing world can have very different rates. It turns out that most people do worse than if they were just guessing, which implies that we’re more than ignorant, we’re misinformed.
This video corresponds to the textbook section introducing the case study on new insights on poverty.
Key points
Codes:
# compare infant mortality in Sri Lanka and Turkey
gapminder_filter <- gapminder %>%
filter(year == 2015 & country %in% c("Sri Lanka", "Turkey")) %>%
select(country, infant_mortality)
gapminder_filter
## country infant_mortality
## 1 Sri Lanka 8.4
## 2 Turkey 11.6
| country | infant_mortality |
|---|---|
| Sri Lanka | 8.4 |
| Turkey | 11.6 |
RAFAEL IRIZARRY: Our misconceptions stem from the preconceived notion that the world is divided into two groups, the Western World, composed of Western Europe and North America, which is characterized by long lifespans and small families versus the developing world, Africa, Asia, and Latin America, characterized by short lifespans and large families.
But does the data support this dichotomous view of the world? The necessary data to answer this question is also available in our gapminder table. Using our newly-learned data visualization skills, we will be able to answer this question.
The first plot we make to see what data have to say about this worldview is a scatterplot of life expectancy versus fertility rates. Fertility rates are defined as the average number of children per woman. We will start by looking at data from about 50 years ago when, perhaps, this worldview was cemented in our minds.
We just type the simple code and we see this plot. Note that most points do, in fact, fall into two distinct categories, one with life expectancies around 70 years and three or less children per family and the other with life expectancies lower than 65 years and with more than five children per family.
Now, to confirm that indeed these countries are from the regions we expect, we can use color to represent continent. So we change the code slightly by adding the color argument, assigning continent to it. Because continent is a character, it will automatically assign color to each continent.
Here’s the plot. So indeed, in 1962, the West versus developing worldview was grounded in some reality, but is this still the case 50 years later? To answer to this question, we’re going to learn about faceting. End of transcript. Skip to the start.
This video corresponds to the textbook section on Gapminder scatterplots.
Key points
# basic scatterplot of life expectancy versus fertility
ds_theme_set() # set plot theme
filter(gapminder, year == 1962) %>%
ggplot(aes(fertility, life_expectancy)) +
geom_point()
# add color as continent
filter(gapminder, year == 1962) %>%
ggplot(aes(fertility, life_expectancy, color = continent)) +
geom_point()
RAFAEL IRIZARRY: We could easily plot the 2012 data in the same way we did for 1962. But for comparison, side by side plots are preferable. In ggplot, we can achieve this by faceting variables. We stratify the data by some variable and make the same plot for each strata.
Here we are faceting by the year. To achieve this, we use a function facet_grid. This is added as a layer which automatically separates the plots. The function lets you facet by up to two variables using columns to represent one variable and rows to represent the other. The function expects the rows and column variables separated by a tilde.
Here’s an example. We’re going to facet by continent and year. So continent will be in the rows, and year will be in the columns. Here is the plot. We can see how the data has been stratified. We have 1962 on the left, 2012 on the right, and the 5 continents in each row.
filter(gapminder, year %in% c("1962","2012")) %>%
ggplot(aes(fertility, life_expectancy, col = continent)) +
geom_point() +
facet_grid(continent ~ year)
However, this is just an example and more than what we want, which is simply to compare in 1962 and 2012. In this case, there’s just one variable. So what we do is we use the dot to let the facet function know that we’re not using two variables but just one. The code looks like this.
# facet by year only
filter(gapminder, year %in% c(1962, 2012)) %>%
ggplot(aes(fertility, life_expectancy, col = continent)) +
geom_point() +
facet_grid(. ~ year)
We simply type facet_grid dot– meaning we’re not using a variable for the rows– tilde year which now tells it make two columns– 1962 and 2012. And here is the plot.
After we split the plot like this, it clearly shows that the majority of countries have moved from the developing world cluster to the Western world one. They went from having large families and short lifespans to having smaller families and longer lifespans. In 2012, the Western versus developing world view no longer makes sense. This is particularly clear when we compare Europe to Asia. Asia includes several countries that have made great improvements in the last 40 to 50 years.
To explore how this transformation happened through the years, we can make the plot for several years. For example, we can add 1970, 1980, 1990, and 2000 to the plot. Now, if we do this, we will not want all the plots on the same row. This is the default behavior of facet_grid. If we do this, the plots will become too thin, and we won’t be able to see the data.
Instead, we might want to have the plots across different rows and columns. For this, we can use the facet_wrap function which permits us to do this. It automatically wraps the series of plots so that most displays has viewable dimensions. So the code looks like this.
# facet by year, plots wrapped onto multiple rows
years <- c(1962, 1980, 1990, 2000, 2012)
continents <- c("Europe", "Asia")
gapminder %>%
filter(year %in% years & continent %in% continents) %>%
ggplot(aes(fertility, life_expectancy, col = continent)) +
geom_point() +
facet_wrap(~year)
It’s very similar. We’re adding some years. And then at the end, we facet_wrap instead of facet_grid. And now, the plot looks like this.
Now, we’re only showing Asia and Europe, but the function clearly shows us how the Asian countries have made great improvements throughout the years.
Now, note that the default choice for the range of the axes is an important one. When not using facet, this range is determined by the data shown in the plot. When using facet, the range is determined by the data shown in all plots. And therefore, it’s kept fixed across the plots. This makes comparisons across plots much easier. For example, in the plot we just saw, the life expectancy has increased, and the fertility has decreased across most countries.
We see this because the cloud of points moves up and to the left. This is not the case if we adjust the scales to each year separately. The plot looks like this. In this case, we have to pay special attention to the range to notice that the plot on the right has larger life expectancy. Therefore, by keeping the scales the same, we were able to quickly see how many of the countries outside of the Western world have improved during the last 40 to 50 years.
This video corresponds to the textbook section on faceting.
Key points - Faceting makes multiple side-by-side plots stratified by some variable. This is a way to ease comparisons. - The facet_grid() function allows faceting by up to two variables, with rows faceted by one variable and columns faceted by the other variable. To facet by only one variable, use the dot operator as the other variable. - The facet_wrap() function facets by one variable and automatically wraps the series of plots so they have readable dimensions. - Faceting keeps the axes fixed across all plots, easing comparisons between plots. - The data suggest that the developing versus Western world view no longer makes sense in 2012.
The visualizations we have just seen effectively illustrate that data no longer supports the Western versus developing worldview. But once we see these plots, new questions emerge. For example, which countries are improving more? Which ones are improving less? Was the improvement constant during the last 50 years, or was there more of an acceleration during a specific certain period? For a closer look that may help answer these questions, we introduce time series plots.
Time series plots have time in the x-axis, and an outcome, or measurement of interest, on the y-axis. For example, here’s a trend plot for the United States fertility rate. We can get this plot by simply using the geom point layer.
# scatterplot of US fertility by year
gapminder %>%
filter(country == "United States") %>%
ggplot(aes(year, fertility)) +
geom_point()
## Warning: Removed 1 rows containing missing values (geom_point).
When we look at this plot, we immediately see that the trend is not linear at all. Instead, we see a sharp drop during the 60s and 70s to below 2. Then, the trend comes back up to 2, and stabilizes there in the 1990s. When the points are regularly spaced and densely packed as they are here, we can create curves by joining points with lines. This conveys that these data are from a single country. To do this, we use the geom_line function instead of geom_point. We write the code like this, and now the curve looks like this.
# line plot of US fertility by year
gapminder %>%
filter(country == "United States") %>%
ggplot(aes(year, fertility)) +
geom_line()
## Warning: Removed 1 row(s) containing missing values (geom_path).
This is particularly helpful when we look at two or more countries. Let’s look at an example. Let’s subset the data to include two countries. Let’s look at one from Europe and one from Asia. So we copy the code above–
# line plot fertility time series for two countries- only one line (incorrect)
countries <- c("South Korea", "Germany")
gapminder %>% filter(country %in% countries) %>%
ggplot(aes(year, fertility)) +
geom_line()
## Warning: Removed 2 row(s) containing missing values (geom_path).
here it is– and we get this plot. But note that this is not what we want. Rather than a line for each country, this code has produced a line that goes through the points for both countries– they’re both joined. This is actually expected, since we have not told ggplot anything about wanting two separate lines. To let ggplot know that there are two curves that need to be made separately, we assign each point to a group, one for each country.
# line plot fertility time series for two countries - one line per country
gapminder %>% filter(country %in% countries) %>%
ggplot(aes(year, fertility, group = country)) +
geom_line()
## Warning: Removed 2 row(s) containing missing values (geom_path).
We do this through the mapping. We assign country to the group argument. The plot now looks like this. We can see the two lines, one for each country. However, we don’t know which line goes with which country. To see this, we can use color for example. We can use color to distinguish the two countries. A useful side effect of using color to assign different colors to each country is that ggplot automatically groups the data by the color value. So the code is very simple, it looks like this.
# fertility time series for two countries - lines colored by country
gapminder %>% filter(country %in% countries) %>%
ggplot(aes(year, fertility, col = country)) +
geom_line()
## Warning: Removed 2 row(s) containing missing values (geom_path).
And once we type this, then we get two lines, each with a color. And a legend has been added by default. Note that this plot clearly shows how South Korea’s fertility rate dropped drastically during the 60s and 70s. And by 1990, it had a similar fertility rate to Germany.
For time series plots, we actually recommend labeling the curves rather than using legends as we did in the previous plot. This suggestion actually applies to most plots. Labeling is usually preferred over legends. However, legends are easier to make and appear by default in many of ggplot’s functions.
We are going to show an example of how to add labels to a time series plot. We demonstrate how we can do this using the life expectancy data. We define a data table with the label locations. And then we use a second mapping just for the labels. The code looks like this.
Notice that we define a data frame with the locations of where we want the labels. We pick these by eye. And then you can see in the geom_text, we are using the labels data frame as the data, so that those labels are put in those positions. Then we have to tell the plot not to add a legend through the theme function. And now the plot looks like this.
# life expectancy time series - lines colored by country and labeled, no legend
labels <- data.frame(country = countries, x = c(1970, 1965), y = c(55, 72))
gapminder %>% filter(country %in% countries) %>%
ggplot(aes(year, life_expectancy, col = country)) +
geom_line() +
geom_text(data = labels, aes(x, y, label = country), size = 5) +
theme(legend.position = "none")
This is the life expectancy plot. And we can see how the plot shows how an improvement in life expectancy followed the drops in fertility rates. While in 1960, Germans lived more than 15 years more on average than South Koreans, by 2010 the gap is completely closed.
Another commonly held notion is that wealth distribution across the world has become worse during the last decades. When general audiences are asked if poor countries have become poorer and rich countries have become richer, the majority answer yes. By using histograms, smooth densities, and box plots, will be able to understand if this is in fact the case. End of transcript. Skip to the start.
In this video, we cover transformations. Transformations can be very useful to better understand distributions. As an example, in this video, we look at income.
gapminder_gdp <- gapminder %>%
filter(!is.na(gdp)) %>%
select(!c(region, infant_mortality, life_expectancy))
kable(tail(gapminder_gdp), caption = "Tabel GDP")
| country | year | fertility | population | gdp | continent | |
|---|---|---|---|---|---|---|
| 7568 | Vanuatu | 2011 | 3.46 | 241876 | 386483180 | Oceania |
| 7569 | Venezuela | 2011 | 2.44 | 29427631 | 166062245436 | Americas |
| 7570 | Vietnam | 2011 | 1.79 | 89321903 | 66530108958 | Asia |
| 7571 | Yemen | 2011 | 4.35 | 24234940 | 13104223693 | Asia |
| 7572 | Zambia | 2011 | 5.77 | 14343526 | 5917195991 | Africa |
| 7573 | Zimbabwe | 2011 | 3.64 | 14255592 | 4407438807 | Africa |
The Gapminder data table includes a column with the country’s gross domestic product, the GDP. GDP measures the market value of goods and services produced by a country in a given year. The GDP per person is often used as a rough summary of how rich a country is. Here we divide this quantity by 365 to obtain the more interpretable measure dollars per day.
Using current US dollars as a unit, a person surviving on an income of less than $2 a day, for example, is defined to be living in absolute poverty. So we’re going to add this variable to our data table. It’s the dollars per day variable. GDP divided by population divided by 365. Before we continue, note that GDP values is in our data table are adjusted for inflation and represent current US dollars. So these values are meant to be comparable across the years. Also note that these are country averages and that within each country, there’s much variability.
OK, so let’s move on to examining distributions. Here’s a histogram of per day incomes from 1970.
# add dollars per day variable
gapminder <- gapminder %>%
mutate(dollars_per_day = gdp/population/365)
# histogram of dollars per day
past_year <- 1970
gapminder %>%
filter(year == past_year & !is.na(gdp)) %>%
ggplot(aes(dollars_per_day)) +
geom_histogram(binwidth = 1, color = "gold")
You can obtain it with this simple code which you’ve already learned. We see that for the majority of countries, averages are below $10 a day. However, the majority of the x-axis is dedicated to the 35 countries with averages above 10.
It might be more informative to quickly be able to see how many countries make on average about
- $1 a day, extremely poor– - $2 a day, very poor– - $4 a day, poor– - $8 a day, which is about middle– - $16 a day which is a well-off country– - $32 is rich, and $64 which is very rich. These changes are multiplicative. And here we introduce log transformations. Log transformations change multiplicative changes into additive ones. Using base 2 for, example, means that every time a value doubles, the log transformation increases by one. So to get the distribution of the log base 2 transformed values, we simply transform the data and use the same code. And now we obtain this histogram.
# repeat histogram with log2 scaled data
gapminder %>%
filter(year == past_year & !is.na(gdp)) %>%
ggplot(aes(log2(dollars_per_day))) +
geom_histogram(binwidth = 1, color = "black")
In this plot, we see something new. We see two clear bumps. Before we continue interpreting the data, let’s introduce some commonly used statistical language. In statistics, these bumps are sometimes referred to as modes. The mode of a distribution is the value with the highest frequency. The mode of a normal distribution is the average. But if the mode is a value with the highest frequency, how can we have more than one? When a distribution like the one we just saw doesn’t monotonically decrease from the mode, we call the location where it goes up and down again as local modes. And we say that the distribution has multiple modes. The histogram we just saw suggests that in 1970, country income distribution have two modes. One at about $2 per day, one in the log2 scale, and another at about $32 per day– 5 in the log2 scale. This bimodality is consistent with the dichotomous world made up of countries with average incomes less than $8 per day, 3 on the log scale. And countries above that we see two modes in the histogram. Now before we continue interpreting the data, we need to make another pause to explain how we choose the base. And the histogram we just saw we chose base 2. Other common choices are the natural log in base 10. In general, we do not recommend using the natural log for data exploration and visualization. Why is this? It’s because while we know what 2 to the 2 is– 2 to the 3– 2 to do the 4, we can quickly compute that in our mind. 10 to the 1, 10 to the 2, 10 to the 3– also very easy to compute. It’s not easy to compute E to the 2, E to the 3, et cetera. So we don’t recommend using the natural log for data exploration. In the dollar per day example, we use base 2 instead of base 10 because the resulting range is easier to interpret. The range of the values being plotted started from about 0.3 and ended around 50. In base 10, this turns to a range that includes very few integers, just 0 and 1. With base 2, our range includes negative 2, negative 1, 0, 1, 2, 3, 4, and 5. Note that it is easier to compute 2 to the x and 10 to the x when x is an integer. So we prefer to have more integers in the transform scale. Another consequence of a limited range is that choosing the bin width is more challenging. With log base 2, we know that a bin width of 1 will translate to bins with range x to 2 to the x. As an example in which base 10 makes more sense than base 2, consider population size. Using log base 10 makes more sense here since the range for these data goes from 45,000 to about 800 million. Here’s a histogram if we transform the values with the log base 10.In this plot, we see something new. We see two clear bumps. Before we continue interpreting the data, let’s introduce some commonly used statistical language. In statistics, these bumps are sometimes referred to as modes. The mode of a distribution is the value with the highest frequency. The mode of a normal distribution is the average. But if the mode is a value with the highest frequency, how can we have more than one? When a distribution like the one we just saw doesn’t monotonically decrease from the mode, we call the location where it goes up and down again as local modes. And we say that the distribution has multiple modes. The histogram we just saw suggests that in 1970, country income distribution have two modes. One at about $2 per day, one in the log2 scale, and another at about $32 per day– 5 in the log2 scale. This bimodality is consistent with the dichotomous world made up of countries with average incomes less than $8 per day, 3 on the log scale. And countries above that we see two modes in the histogram. Now before we continue interpreting the data, we need to make another pause to explain how we choose the base. And the histogram we just saw we chose base 2. Other common choices are the natural log in base 10. In general, we do not recommend using the natural log for data exploration and visualization. Why is this? It’s because while we know what 2 to the 2 is– 2 to the 3– 2 to do the 4, we can quickly compute that in our mind. 10 to the 1, 10 to the 2, 10 to the 3– also very easy to compute. It’s not easy to compute E to the 2, E to the 3, et cetera. So we don’t recommend using the natural log for data exploration. In the dollar per day example, we use base 2 instead of base 10 because the resulting range is easier to interpret. The range of the values being plotted started from about 0.3 and ended around 50. In base 10, this turns to a range that includes very few integers, just 0 and 1. With base 2, our range includes negative 2, negative 1, 0, 1, 2, 3, 4, and 5. Note that it is easier to compute 2 to the x and 10 to the x when x is an integer. So we prefer to have more integers in the transform scale. Another consequence of a limited range is that choosing the bin width is more challenging. With log base 2, we know that a bin width of 1 will translate to bins with range x to 2 to the x. As an example in which base 10 makes more sense than base 2, consider population size.
gapminder %>%
filter(year == past_year) %>%
ggplot(aes(log10(population))) +
geom_histogram(binwidth = 0.5, color = "black")
Using log base 10 makes more sense here since the range for these data goes from 45,000 to about 800 million. Here’s a histogram if we transform the values with the log base 10.
Looking at the scale knowing that we’re in base 10, we can quickly determine that country population ranges from about 40,000 to about a billion. Now let’s talk about log transformations and how we use them in the plots. There are two ways we can use log transformation in plots. We can log the values before plotting them, or we can use log scales in the axis. Both approaches are useful and have different strengths. If we log the data, we can more easily interpret intermediate values in the scale. For example, if we use a scale that looks like this that has been log transformed, we know that x is 1.5. If the scales are logged and we have that x in between 1 and 10, then we don’t know immediately what the x is because it’s 10 to the 1.5, not an easy thing to compute in our heads. However, the advantage of using log scales is that we see the original values on the axis. So this has an advantage because we see the original values displayed in the plot which makes it very easy to quickly see what numbers we’re actually dealing with. For example, when we see $32 a day, instead of 5 log base $2 a day. Now let’s review how we make plots where the scales have been log transformed. We already learned this. We learned the scale_x_continuous function. So we want to remake the histograms that we already made but now using scales that have been transformed. We simply add a layer using the scale underscore x underscore continuous function. And we no longer transform the data before plotting it. So the code would look like this, and the plot would look like this.
# repeat histogram with log2 scaled x-axis
gapminder %>%
filter(year == past_year & !is.na(gdp)) %>%
ggplot(aes(dollars_per_day)) +
geom_histogram(binwidth = 1, color = "black") +
scale_x_continuous(trans = "log2")
Notice that the histogram looks exactly the same. The difference is that in the scales in the x-axis, instead of seeing the log values, we see the original values in a log scale. So we see 1, 8, and 64. And we can very quickly interpret what that means in terms of dollars per day. End of transcript. Skip to the start.
RAFAEL IRIZARRY: The histogram showed us that the income distribution values show a dichotomy. However, the histogram does not show us if the two groups of countries are west versus the developing world. To see distributions by geographical region, we first stratify the data into regions, and then examine the distribution for each. Now, because the number of regions is large in this case, it’s 22, as we can see, by just typing this command in R.
# add dollars per day variable
gapminder <- gapminder %>%
mutate(dollars_per_day = gdp/population/365)
# number of regions
length(levels(gapminder$region))
## [1] 22
Looking at histograms or smooth densities for each will not be useful. Instead, we can stack box plots next to each other. To do this, we simply write this code. We’ve learned how to use geom_boxplot before so we write this. When we do this, we get this plot.
# boxplot of GDP by region in 1970
past_year <- 1970
p <- gapminder %>%
filter(year == past_year & !is.na(gdp)) %>%
ggplot(aes(region, dollars_per_day))
p + geom_boxplot()
Now, note that we can’t read the region names because the default ggplot behavior is to write the labels horizontally and here we run out of room. We can easily fix this by rotating the labels. Consulting the documentation, we find that we can rotate the names by changing the theme through element underscore text. The hjust equals 1 argument justifies this text so that it’s next to the axis. So now what we do is we add to our graph, using layers, the following line. We say theme, axis, dot text dot x equals element underscore text. Angle equals 90. That rotates it. And then hjust equals 1. When we do this, we get this plot.
# rotate names on x-axis
p + geom_boxplot() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Now, we can read the names. We can already see that there is indeed a west versus the rest dichotomy. If you look closely at the box plots that are high, we see that they’re North America, northern Europe, Australia, New Zealand, and Western Europe. There are a few more adjustments we can make to this plot to help uncover this reality to help relay this message. First, it helps to order the regions in some other order that is not alphabetical. Ordering alphabetically is completely arbitrary. We can order by something meaningful. We’ll see how we can do that. The function that’s going to help us achieve this is the reorder function. This function lets us change the order of the levels of a factor variable based on a summary computed on a numeric vector. Before we continue with our example, let’s understand how the reorder function works using a simpler one. Let’s define a factor. Based on the vector with 5 entries, Asia, Asia, west, west, west. If we turn this vector into factor, the levels of this factor are ordered alphabetically.
# by default, factor order is alphabetical
fac <- factor(c("Asia", "Asia", "West", "West", "West"))
levels(fac)
## [1] "Asia" "West"
This is the default in r. So Asia is the first level. West is the second level. But suppose that each of these elements of the original vector are associated with a value. Here we’re just defining one arbitrarily. 10, 11, 12, 6, 4. Let’s suppose that we want to order the levels based on the mean value of these numbers. In this case, the west has a lower mean. It’s the mean of 12, 6, 4. Compared to the mean of Asia, which is the mean of 10 and 11. So if we use a function the order like this– reorder fac– that’s our factor.
# reorder factor by the category means
value <- c(10, 11, 12, 6, 4)
fac <- reorder(fac, value, FUN = mean)
levels(fac)
## [1] "West" "Asia"
Value– those are five values. And then using the function mean to summarize the values, we can see that the new factor that’s created has levels ordered differently. Now west is the first one. Why? Because it has a smaller mean value of the value vector. All right. Let’s get back to our example. In our example, we have regions. These are the different parts of a continent. We also have continents. And then we have divided the world into West versus the rest. So we have three different ways of dividing the data. The first thing we’re going to do to improve our plot, is to simply reorder the regions by their median income level. To achieve this, we write the same code as before but we add to mutate that changes region to a new factor where the levels are reordered. In this line here. If we do this, we get the following plot.
# reorder by median income and color by continent
p <- gapminder %>%
filter(year == past_year & !is.na(gdp)) %>%
mutate(region = reorder(region, dollars_per_day, FUN = median)) %>% # reorder
ggplot(aes(region, dollars_per_day, fill = continent)) + # color by continent
geom_boxplot() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
xlab("")
p
Now we can see that the box plots are ordered by their medium value. And we very quickly see that there’s four box plots that stand out at the end. The four highest ones. These are Western Europe, Australia and New Zealand, northern Europe, and North America. This is what we define as the West. Now there’s another change we made to the plot to help convey this message, and that’s that we use color to show another variable. We use color to show continent. Remember, regions are parts of continents. To add color to define the different continents, we use the fill argument in the aesthetic mappings of ggplot. We simply say fill equals continent. And now each continent gets its color. Its own color. Now, we can see what this does is it helps us see that, for example, the blue box plots are towards the right because these are the European countries. We also see that the red countries, the red box plots, are to the left. These are the countries in the African continent. The last change we can make to this plot to help us see the data little bit better, is to change the scale to the log scale. We want to change it to log2 scale in this case, so we add the layer scale underscore y underscore continuous, and we use the log2 transformation.
# log2 scale y-axis
p + scale_y_continuous(trans = "log2")
And now what this does, is it helps us see the differences between the countries with the lower income. For example, we see a difference now between the African continent, which is in red, and Asia, which is in green. All right. The last change we can make to this plot to make it tell the story a little better to give us even more information, is to show the data. In many cases, we don’t show the data, the actual individual points, because it adds too much clutter to the plot and it obfuscates the message. But in this particular example, we don’t have that many points. So we can add a layer of points by simply adding the geom point layer. It’s very simple.
# add data points
p + scale_y_continuous(trans = "log2") + geom_point(show.legend = FALSE)
We just add that layer and now we get this plot. And we can see the individual points. You can decide if you show this or not. But now we can see exactly where every single country lies. End of transcript. Skip to the start.
This video corresponds to the textbook section on comparing multiple distributions with boxplotshttps://rafalab.github.io/dsbook/gapminder.html#comparing-multiple-distributions-with-boxplots-and-ridge-plots. Note that many boxplots from the video are instead dot plots in the textbook and that a different boxplot is constructed in the textbook. Also read that section to see an example of grouping factors with the case_when function.
Key points
This video corresponds to the textbook section on 1970 versus 2010 income distributions. Note that the boxplots are slightly different: the group variable in those plots was defined in section 10.7.1.
Key points
The exploratory data analysis we have conducted has revealed two characteristics about average income distributions in 1970. Using a histogram, we found a bimodal distribution with the most relating to poor and rich countries. Then by stratifying by region and examining box plots, we found that the rich countries were mostly in Europe and Northern America, along with Australia, New Zealand, and then the poor countries were mostly in the rest of the world. So we are going to define a vector that defines the regions in the West. They’re just simply defining a vector like this.
# define Western countries
west <- c("Western Europe", "Northern Europe", "Southern Europe", "Northern America", "Australia and New Zealand")
Now we want to focus on comparing the differences in distribution across time. We start by confirming that the bi-modality observed in 1970 is explained by a west versus developing world economy. We do this by creating a histogram for the groups previously defined.
# facet by West vs devloping
gapminder %>%
filter(year == past_year & !is.na(gdp)) %>%
mutate(group = ifelse(region %in% west, "West", "Developing")) %>%
ggplot(aes(dollars_per_day)) +
geom_histogram(binwidth = 1, color = "black") +
scale_x_continuous(trans = "log2") +
facet_grid(. ~ group)
Note that we create the two groups with an if else inside a mutate. And that if we then use facet grid to make histograms for each group using this code, we see this histogram.
And we immediately see that the countries in the West have higher incomes. The histogram is shifted to the right. Countries in the developing world are shifted towards the left. Now we’re ready to see if the separation is worse today than it was 40 years ago. We do this by now faceting by both region and year. So it’s the same code, but now we’re looking at two years, 1970 and 2010. And we end the code with a facet grid by year and group.
# add dollars per day variable and define past year
gapminder <- gapminder %>%
mutate(dollars_per_day = gdp/population/365)
past_year <- 1970
# facet by West/developing and year
present_year <- 2010
gapminder %>%
filter(year %in% c(past_year, present_year) & !is.na(gdp)) %>%
mutate(group = ifelse(region %in% west, "West", "Developing")) %>%
ggplot(aes(dollars_per_day)) +
geom_histogram(binwidth = 1, color = "black") +
scale_x_continuous(trans = "log2") +
facet_grid(year ~ group)
Now we can see the histogram again for the four different groups. When we look at this figure, we can see that the developing world has shifted to the right more than the West. Meaning that it has gotten closer. The income distribution of the developing countries has gotten closer to those from the west.
Before we interpret the findings of this plot further. We note that there are more countries represented in the 2010 histograms than in the 1970s ones. The total counts are larger. One reason for this is that several countries were founded after 1970. For example, the Soviet Union turned into several countries, including Russia and Ukraine during the ’90s. Another reason is that data is available for more countries in 2010 compared to 1970. So we’re going to remake the plots, but using only countries with data available for both years. We’re going to use this very simple code. We’re going to define a vector with a list for 2010, a vector with a list for 1970. Here notice we use the dot that we explained earlier, to get this character vector out of this dplyr command. And then we’re going to take the intersection using the intersect function. There’s actually a better way of doing this using the tidyverse tools, but we haven’t learned those yet. So we use this simple piece of code.
# define countries that have data available in both years
country_list_1 <- gapminder %>%
filter(year == past_year & !is.na(dollars_per_day)) %>% .$country
country_list_2 <- gapminder %>%
filter(year == present_year & !is.na(dollars_per_day)) %>% .$country
country_list <- intersect(country_list_1, country_list_2)
So now there’s 108 countries in this list. It accounts for 86% of the total population. So this subset should be representative of the entire world. Let’s make the plot again, but this time using only the subset of countries that are present for which data is present in 1970 and 2010. We’re going to use the country in country list argument to do this in a filter function.
# make histogram including only countries with data available in both years
gapminder %>%
filter(year %in% c(past_year, present_year) & country %in% country_list) %>% # keep only selected countries
mutate(group = ifelse(region %in% west, "West", "Developing")) %>%
ggplot(aes(dollars_per_day)) +
geom_histogram(binwidth = 1, color = "black") +
scale_x_continuous(trans = "log2") +
facet_grid(year ~ group)
Now we get this plot. We now see that while the rich countries have become a bit richer percentage wise, the poorer countries appear to have improved more. The histogram has shifted more to the right than for the rich countries. In particular, we see that the proportion of developing countries earning more than $16 a day increases substantially.
To see which specific regions improve the most, we can remake the box plots that we made earlier, but now adding 2010. Here it is. We use the same code. We use facet grid to divide into 2010 and 1970. And we can see which countries have gone up more.
p <- gapminder %>%
filter(year %in% c(past_year, present_year) & country %in% country_list) %>%
mutate(region = reorder(region, dollars_per_day, FUN = median)) %>%
ggplot() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
xlab("") + scale_y_continuous(trans = "log2")
p + geom_boxplot(aes(region, dollars_per_day, fill = continent)) +
facet_grid(year ~ .)
Now these box plots, it’s a little bit hard to compare, because we’re trying to compare box plots that are on top of each other. It’s helpful to put them next to each other. So we’re going to learn to ease the comparisons. To do this we’re going to pause to introduce another powerful ggplot feature. Because we want to compare each region before and after, it would be convenient to have the 1970 box plot next to the 2010 box plot. In general, comparisons are easier when data are plotted next to each other. So instead of faceting, we keep the data from each year together. But ask ggplot plot to color or fill the box block depending on the year. ggplot automatically separates them and puts the two box plots next to each other. This is very convenient. Because year is a number, we turn it into a factor so that each is a category. This is because ggplot automatically assigns a color to each level of a factor if we assign that factor to the color argument. So if we type this command now, we add fill equals factor year, we get this plot.
p + geom_boxplot(aes(region, dollars_per_day, fill = factor(year)))
And we can see which countries have improved the most. Look at Eastern Asia, for example, how it went from way down around 8 all the way up almost to 64. And finally we point out that if what we are most interested in is in comparing before and after values, it might make more sense to plot the ratios, or differences in the log scale. We’re still not ready to learn the code that achieves this, but here’s what the plot would look like. This is actually showing a box plot of the log ratios from 2010 compared to 1970 for each country. And we can see again, eastern Asia has the biggest improvement. End of transcript. Skip to the start.
This video corresponds to the following sections:
Key points
RAFAEL IRIZARRY: We have used data exploration to discover that the income gap between rich and poor countries has closed considerably during the last forty years.
We use a series of histograms and box plots to see this. Here, we suggest a succinct way to convey this message with just one plot. We will use smooth density plots. Let’s start by noting that the density plot for income distribution in 1970 and 2010 deliver the message that the gap is closing.
In the 1970s plot, we see two clear modes, poor and rich. In 2010, it appears that some of the poorer countries have shifted towards the right, closing the gap. The next message we need to convey is that the reason for this change in distribution is that poor countries became richer rather than some rich countries becoming poorer.
To do this, all we need to do is assign a color to the groups we identified during the data exploration. However, before we can do this, we need to learn how to make these smooth densities in a way that preserves information of how many countries are in each group.
To understand why we need to do this, note the discrepancy in the size of each group. If we divide the world into developing and West, we have 87 developing countries and 21 Western countries.
# smooth density plots - area under each curve adds to 1
gapminder %>%
filter(year == past_year & country %in% country_list) %>%
mutate(group = ifelse(region %in% west, "West", "Developing")) %>% group_by(group) %>%
summarize(n = n()) %>% knitr::kable()
| group | n |
|---|---|
| Developing | 87 |
| West | 21 |
If we overlay the two densities, the default is to have the area represented by each distribution add up to 1 regardless of the size of each group. This makes it seem like there’s the same number of countries in each group, which is incorrect. To change this, we’ll need to learn to access computed variables with the geom_density function. To have the areas of the densities be proportional to the size of the groups, we can simply multiply the y-axis values by the size of the group. From the geom_density help file, we see that the function computes a variable called count that does exactly this. We want this variable to be on the y-axis rather than the density value. In gg plot, we can access these variables by surrounding their names with dot dot. So we will use the following mapping. We type aes x = dollars_per_day and y = dot dot count dot dot. This will put count on the y-axis. We can now create the desired plot by simply changing the mapping in the previous code chunk.
p <- gapminder %>%
filter(year == past_year & country %in% country_list) %>%
mutate(group = ifelse(region %in% west, "West", "Developing")) %>%
ggplot(aes(dollars_per_day, y = ..count.., fill = group)) +
scale_x_continuous(trans = "log2")
p + geom_density(alpha = 0.2, bw = 0.75) + facet_grid(year ~ .)
It would look like this. And it produces a plot like this.
Notice that now we can clearly see that the developing world has more countries. If you want the densities to be smoother, because we can see in the Western countries, there was a lot of unsmoothness, we can change the bw argument, as we learned earlier. We tried a few and decided on 0.75. You can try a few yourself. Here’s what it looks like with 0.75. This plot now shows what is happening very clearly. The developing world distribution is changing. A third mode appears consisting of the countries that most closed the gap. We can actually make this figure somewhat more informative. From the exploratory data analysis, we noticed that many of the countries that most improved were from Asia. We can easily alter the plot to show key regions separately. To do this, we introduced a new function called case_when. It’s useful for defining groups. It currently does not have a data argument. This might change. But because it doesn’t, we need to access the components of our data using the dot placeholder. So the code looks like this.
# add group as a factor, grouping regions
gapminder <- gapminder %>%
mutate(group = case_when(
.$region %in% west ~ "West",
.$region %in% c("Eastern Asia", "South-Eastern Asia") ~ "East Asia",
.$region %in% c("Caribbean", "Central America", "South America") ~ "Latin America",
.$continent == "Africa" & .$region != "Northern Africa" ~ "Sub-Saharan Africa",
TRUE ~ "Others"))
Look at what we’re doing. We’re assigning groups depending on the region. If the region’s in the West, we call the West. If the region is in Eastern Asia, Southern Asia, we call it East Asia. If the region is in the Caribbean, Central America, South America, we call it Latin America. If the continent is Africa and the region is not Northern Africa, we’re going to call it Sub-Saharan Africa. And then the rest we’re just going to call others. Now we turn this group variable into a factor to control the order of the levels. We do it like this. We picked this particular order for a reason that becomes clearer
# reorder factor levels
gapminder <- gapminder %>%
mutate(group = factor(group, levels = c("Others", "Latin America", "East Asia", "Sub-Saharan Africa", "West")))
later when we make the plots. Now we can easily plot the density for each one. We use color and size to clearly see the top. Here’s what the two look like in 1970 and 2010. The plot is a little bit cluttered and is hard to read, and we’re going to use a stacking approach to make the picture clear. Here’s how we do it. We use this argument, position = “stack”. And now what happens is that the histograms or density plots are stacked on top of each other.
# note you must redefine p with the new gapminder object first
p <- gapminder %>%
filter(year %in% c(past_year, present_year) & country %in% country_list) %>%
ggplot(aes(dollars_per_day, fill = group)) +
scale_x_continuous(trans = "log2")
# stacked density plot
p + geom_density(alpha = 0.2, bw = 0.75, position = "stack") +
facet_grid(year ~ .)
Here we can see clearly that the distribution from East Asia and Latin America and others shift markedly to the right while Sub-Saharan Africa remain stagnant. Note that we order the levels of the groups so that the West density was plotted first, and then Sub-Saharan Africa. This helps us see this pattern. As a final point, we note that these distributions weigh every country the same. So if most of the population is improving but living in a very large country such as China, we might not appreciate this. We can actually weigh the smooth densities using the weight mapping argument.
# weighted stacked density plot
gapminder %>%
filter(year %in% c(past_year, present_year) & country %in% country_list) %>%
group_by(year) %>%
mutate(weight = population/sum(population*2)) %>%
ungroup() %>%
ggplot(aes(dollars_per_day, fill = group, weight = weight)) +
scale_x_continuous(trans = "log2") +
geom_density(alpha = 0.2, bw = 0.75, position = "stack") + facet_grid(year ~ .)
And if we do that, the plot now looks like this. This particular figure shows very clearly how the income distribution gap is closing with most of the poor countries remaining in Sub-Saharan Africa. End of transcript. Skip to the start.
This video corresponds to the textbook section on the ecological fallacy. Key points
RAFAEL IRIZARRY: Throughout this section, we have been comparing regions of the world. We have seen that on average some regions do better than others in health outcomes and economic outcomes. Here, we focus on the importance of describing the variability within the groups. While we do this, we’ll also show you some other ggplot functions as well as a transformation called the logit transformation, which is useful for the data that we’ll be looking at. As an example for this, we will focus on the relationship between country child survival rates and average income. We start by comparing these quantities across regions. Before we start, we’re going to find a few more regions using the case when function. We’re going to define the West, Northern Africa, East Asia, Southern Asia, Latin America, Sub-Saharan Africa, and the Pacific Islands.
# define gapminder
library(tidyverse)
library(dslabs)
data(gapminder)
# add additional cases
gapminder <- gapminder %>%
mutate(group = case_when(
.$region %in% west ~ "The West",
.$region %in% "Northern Africa" ~ "Northern Africa",
.$region %in% c("Eastern Asia", "South-Eastern Asia") ~ "East Asia",
.$region == "Southern Asia" ~ "Southern Asia",
.$region %in% c("Central America", "South America", "Caribbean") ~ "Latin America",
.$continent == "Africa" & .$region != "Northern Africa" ~ "Sub-Saharan Africa",
.$region %in% c("Melanesia", "Micronesia", "Polynesia") ~ "Pacific Islands"))
Once we do this, we can compute the quantities that we’re interested in for each region. We’ll compute the average. This shows a dramatic difference.
# define a data frame with group average income and average infant survival rate
surv_income <- gapminder %>%
filter(year %in% present_year & !is.na(gdp) & !is.na(infant_mortality) & !is.na(group)) %>%
group_by(group) %>%
summarize(income = sum(gdp)/sum(population)/365,
infant_survival_rate = 1 - sum(infant_mortality/1000*population)/sum(population))
surv_income %>% arrange(income)
## # A tibble: 7 x 3
## group income infant_survival_rate
## <chr> <dbl> <dbl>
## 1 Sub-Saharan Africa 1.76 0.936
## 2 Southern Asia 2.07 0.952
## 3 Pacific Islands 2.70 0.956
## 4 Northern Africa 4.94 0.970
## 5 Latin America 13.2 0.983
## 6 East Asia 13.4 0.985
## 7 The West 77.1 0.995
While in the West less than 0.5% of children die, in Sub-Saharan Africa, the rate is higher than 6%. In fact, the relationship between these two variables is almost perfectly linear. In this plot, we introduced the use of the limit argument, which lets us change the range of the axis. We would do it like this following this code.
# plot infant survival versus income, with transformed axes
surv_income %>% ggplot(aes(income, infant_survival_rate, label = group, color = group)) +
scale_x_continuous(trans = "log2", limit = c(0.25, 150)) +
scale_y_continuous(trans = "logit", limit = c(0.875, .9981),
breaks = c(.85, .90, .95, .99, .995, .998)) +
geom_label(size = 3, show.legend = FALSE)
We are making the range larger than the data needs because we will later compare this plot we just saw to one with more variability. And we want the ranges to be the same. We also introduced the breaks argument, which lets us set the location of the axis labels. Finally, we introduce a new transformation, the logistic transformation. The logistic or logit transformation for a proportional rate p is defined as follows. f of p equals the log of p divided by 1 minus p. When p is a proportion or probability, the quantity that is being logged, p divided by 1 minus p, is called the odds. And the case p is the proportion of children that survive. The odds tells us how many more children are expected to survive than to die. The log transformation makes this quantity symmetric. If the rates are the same, then the log odds is 0. Fold increases or decreases turn into positive and negative increments respectively. This scale is useful when we want to highlight differences that are near 0 or near 1. For survival rates, this is important because a survival rate of 90% is unacceptable while the survival rate of 99% is relatively good. We would much prefer a survival rate closer to 99.9%. We want our scale to highlight these differences and the logit does this. Note that 99.9 divided by 0.1 is about 10 times larger than 99 divided by 1, which is about 10 times larger than 90 divided by 10. By using the log, these fold changes turn into constant increases. OK, now back to our plot. Based on the plot we showed earlier, do we conclude that a country with a low income is destined to have low survival rate? Do we conclude that all survival rates in Sub-Saharan Africa are all lower than in southern Asia, which in turn are lower than in the Pacific Islands and so on? Jumping to this conclusion based on the plot we showed, the plot that shows only the averages is referred to as the ecological fallacy. The almost perfect relationship between survival rates and income is only observed for the averages at the regional level. Once we show the data, we see a somewhat more complicated story. So here is the plot for the averages. And look at what happens once we show you every individual country. Specifically, we see that there is a large amount of variability. We see that the countries from the same regions can be quite different. And that countries within the same income can have different survival rates. For example, while on average Sub-Saharan Africa had the worst health and economic outcomes, there is wide variability within that group. For example, note that Mauritius and Botswana are doing much better than Angola and Sierra Leone with Mauritius comparable to Western countries. End of transcript. Skip to the start.
The Gapminder Foundation (www.gapminder.org) is a non-profit organization based in Sweden that promotes global development through the use of statistics that can help reduce misconceptions about global development.
Instruction
library(dplyr)
library(ggplot2)
library(dslabs)
data(gapminder)
## fill out the missing parts in filter and aes
gapminder %>% filter( continent %in% "Africa" & year %in% "2012" ) %>%
ggplot(aes(fertility, life_expectancy )) +
geom_point()
Note that there is quite a bit of variability in life expectancy and fertility with some African countries having very high life expectancies. There also appear to be three clusters in the plot.
Instruction
Remake the plot from the previous exercises but this time use color to distinguish the different regions of Africa to see if this explains the clusters.
Remember that you can explore the gapminder data to see how the regions of Africa are labeled in the data frame!
Use color rather than col inside your ggplot call - while these two forms are equivalent in R, the grader specifically looks for color.
## fill out the missing parts in filter and aes
gapminder %>% filter( continent %in% "Africa" & year %in% "2012" ) %>%
ggplot(aes(fertility, life_expectancy, color = region )) +
geom_point()
While many of the countries in the high life expectancy/low fertility cluster are from Northern Africa, three countries are not.
Instruction
#Create table
df <- gapminder %>%
filter(continent %in% "Africa" & year %in% "2012" & fertility <= 3 & life_expectancy >= 70) %>%
select(country, region)
df
## country region
## 1 Algeria Northern Africa
## 2 Cape Verde Western Africa
## 3 Egypt Northern Africa
## 4 Libya Northern Africa
## 5 Mauritius Eastern Africa
## 6 Morocco Northern Africa
## 7 Seychelles Eastern Africa
## 8 Tunisia Northern Africa
The Vietnam War lasted from 1955 to 1975. Do the data support war having a negative effect on life expectancy? We will create a time series plot that covers the period from 1960 to 2010 of life expectancy for Vietnam and the United States, using color to distinguish the two countries. In this start we start the analysis by generating a table.
Instruction
tab <- gapminder %>%
filter(year %in% c(1960:2010) & country %in% c("Vietnam","United States"))
tab
## country year infant_mortality life_expectancy fertility population
## 1 United States 1960 25.9 69.91 3.67 186176524
## 2 Vietnam 1960 75.6 58.52 6.35 32670623
## 3 United States 1961 25.4 70.32 3.63 189077076
## 4 Vietnam 1961 72.6 59.17 6.39 33666768
## 5 United States 1962 24.9 70.21 3.48 191860710
## 6 Vietnam 1962 69.9 59.82 6.43 34684164
## 7 United States 1963 24.4 70.04 3.35 194513911
## 8 Vietnam 1963 67.3 60.42 6.45 35722092
## 9 United States 1964 23.8 70.33 3.22 197028908
## 10 Vietnam 1964 61.7 60.95 6.46 36780984
## 11 United States 1965 23.3 70.41 2.93 199403532
## 12 Vietnam 1965 60.7 61.32 6.48 37860014
## 13 United States 1966 22.7 70.43 2.71 201629471
## 14 Vietnam 1966 59.9 61.36 6.49 38959335
## 15 United States 1967 22.0 70.76 2.56 203713082
## 16 Vietnam 1967 59.0 61.06 6.49 40074695
## 17 United States 1968 21.3 70.42 2.47 205687611
## 18 Vietnam 1968 58.2 60.45 6.49 41195833
## 19 United States 1969 20.6 70.66 2.46 207599308
## 20 Vietnam 1969 57.3 59.63 6.49 42309662
## 21 United States 1970 19.9 70.92 2.46 209485807
## 22 Vietnam 1970 56.4 58.78 6.47 43407291
## 23 United States 1971 19.1 71.24 2.27 211357912
## 24 Vietnam 1971 55.5 58.17 6.42 44485910
## 25 United States 1972 18.3 71.34 2.01 213219515
## 26 Vietnam 1972 54.7 58.00 6.35 45549487
## 27 United States 1973 17.5 71.54 1.87 215092900
## 28 Vietnam 1973 53.8 58.35 6.25 46604726
## 29 United States 1974 16.7 72.08 1.83 217001865
## 30 Vietnam 1974 52.8 59.23 6.13 47661770
## 31 United States 1975 16.0 72.68 1.77 218963561
## 32 Vietnam 1975 51.8 60.54 5.97 48729397
## 33 United States 1976 15.2 72.99 1.74 220993166
## 34 Vietnam 1976 50.9 62.07 5.80 49808071
## 35 United States 1977 14.5 73.38 1.78 223090871
## 36 Vietnam 1977 49.8 63.58 5.61 50899504
## 37 United States 1978 13.8 73.58 1.75 225239456
## 38 Vietnam 1978 48.8 64.86 5.42 52015279
## 39 United States 1979 13.2 74.03 1.80 227411604
## 40 Vietnam 1979 47.8 65.84 5.23 53169674
## 41 United States 1980 12.6 73.93 1.82 229588208
## 42 Vietnam 1980 46.8 66.49 5.05 54372518
## 43 United States 1981 12.1 74.36 1.81 231765783
## 44 Vietnam 1981 45.8 66.86 4.87 55627743
## 45 United States 1982 11.7 74.65 1.81 233953874
## 46 Vietnam 1982 44.8 67.10 4.69 56931822
## 47 United States 1983 11.2 74.71 1.78 236161961
## 48 Vietnam 1983 43.9 67.30 4.52 58277391
## 49 United States 1984 10.9 74.81 1.79 238404223
## 50 Vietnam 1984 43.0 67.51 4.36 59653092
## 51 United States 1985 10.6 74.79 1.84 240691557
## 52 Vietnam 1985 42.0 67.77 4.21 61049370
## 53 United States 1986 10.4 74.87 1.84 243032017
## 54 Vietnam 1986 41.0 68.07 4.06 62459557
## 55 United States 1987 10.2 75.01 1.87 245425409
## 56 Vietnam 1987 40.0 68.38 3.93 63881296
## 57 United States 1988 10.0 75.02 1.92 247865202
## 58 Vietnam 1988 38.9 68.68 3.81 65313709
## 59 United States 1989 9.7 75.10 2.00 250340795
## 60 Vietnam 1989 37.7 69.00 3.68 66757401
## 61 United States 1990 9.4 75.40 2.07 252847810
## 62 Vietnam 1990 36.6 69.30 3.56 68209604
## 63 United States 1991 9.1 75.50 2.06 255367160
## 64 Vietnam 1991 35.4 69.60 3.42 69670620
## 65 United States 1992 8.8 75.80 2.04 257908206
## 66 Vietnam 1992 34.3 69.80 3.26 71129537
## 67 United States 1993 8.5 75.70 2.02 260527420
## 68 Vietnam 1993 33.1 70.10 3.07 72558986
## 69 United States 1994 8.2 75.80 2.00 263301323
## 70 Vietnam 1994 32.0 70.30 2.88 73923849
## 71 United States 1995 8.0 75.90 1.98 266275528
## 72 Vietnam 1995 30.9 70.60 2.68 75198975
## 73 United States 1996 7.7 76.30 1.98 269483224
## 74 Vietnam 1996 29.9 70.90 2.48 76375677
## 75 United States 1997 7.5 76.60 1.97 272882865
## 76 Vietnam 1997 28.9 71.10 2.31 77460429
## 77 United States 1998 7.3 76.80 2.00 276354096
## 78 Vietnam 1998 27.9 71.50 2.17 78462888
## 79 United States 1999 7.2 76.90 2.01 279730801
## 80 Vietnam 1999 27.0 71.70 2.06 79399708
## 81 United States 2000 7.1 76.90 2.05 282895741
## 82 Vietnam 2000 26.1 72.00 1.98 80285563
## 83 United States 2001 7.0 76.90 2.03 285796198
## 84 Vietnam 2001 25.3 72.20 1.94 81123685
## 85 United States 2002 6.9 77.10 2.02 288470847
## 86 Vietnam 2002 24.6 72.50 1.92 81917488
## 87 United States 2003 6.8 77.30 2.05 291005482
## 88 Vietnam 2003 23.9 72.80 1.91 82683039
## 89 United States 2004 6.9 77.60 2.06 293530886
## 90 Vietnam 2004 23.2 73.00 1.90 83439812
## 91 United States 2005 6.8 77.60 2.06 296139635
## 92 Vietnam 2005 22.6 73.30 1.90 84203817
## 93 United States 2006 6.7 77.80 2.11 298860519
## 94 Vietnam 2006 22.0 73.50 1.89 84979667
## 95 United States 2007 6.6 78.10 2.12 301655953
## 96 Vietnam 2007 21.4 73.80 1.88 85770717
## 97 United States 2008 6.5 78.30 2.07 304473143
## 98 Vietnam 2008 20.8 74.10 1.86 86589342
## 99 United States 2009 6.4 78.50 2.00 307231961
## 100 Vietnam 2009 20.3 74.30 1.84 87449021
## 101 United States 2010 6.3 78.80 1.93 309876170
## 102 Vietnam 2010 19.8 74.50 1.82 88357775
## gdp continent region
## 1 2.479391e+12 Americas Northern America
## 2 NA Asia South-Eastern Asia
## 3 2.536417e+12 Americas Northern America
## 4 NA Asia South-Eastern Asia
## 5 2.691139e+12 Americas Northern America
## 6 NA Asia South-Eastern Asia
## 7 2.809549e+12 Americas Northern America
## 8 NA Asia South-Eastern Asia
## 9 2.972502e+12 Americas Northern America
## 10 NA Asia South-Eastern Asia
## 11 3.162743e+12 Americas Northern America
## 12 NA Asia South-Eastern Asia
## 13 3.368321e+12 Americas Northern America
## 14 NA Asia South-Eastern Asia
## 15 3.452529e+12 Americas Northern America
## 16 NA Asia South-Eastern Asia
## 17 3.618250e+12 Americas Northern America
## 18 NA Asia South-Eastern Asia
## 19 3.730416e+12 Americas Northern America
## 20 NA Asia South-Eastern Asia
## 21 3.737877e+12 Americas Northern America
## 22 NA Asia South-Eastern Asia
## 23 3.867133e+12 Americas Northern America
## 24 NA Asia South-Eastern Asia
## 25 4.080668e+12 Americas Northern America
## 26 NA Asia South-Eastern Asia
## 27 4.321881e+12 Americas Northern America
## 28 NA Asia South-Eastern Asia
## 29 4.299437e+12 Americas Northern America
## 30 NA Asia South-Eastern Asia
## 31 4.291009e+12 Americas Northern America
## 32 NA Asia South-Eastern Asia
## 33 4.523528e+12 Americas Northern America
## 34 NA Asia South-Eastern Asia
## 35 4.733337e+12 Americas Northern America
## 36 NA Asia South-Eastern Asia
## 37 4.999656e+12 Americas Northern America
## 38 NA Asia South-Eastern Asia
## 39 5.157035e+12 Americas Northern America
## 40 NA Asia South-Eastern Asia
## 41 5.142220e+12 Americas Northern America
## 42 NA Asia South-Eastern Asia
## 43 5.272896e+12 Americas Northern America
## 44 NA Asia South-Eastern Asia
## 45 5.168479e+12 Americas Northern America
## 46 NA Asia South-Eastern Asia
## 47 5.401886e+12 Americas Northern America
## 48 NA Asia South-Eastern Asia
## 49 5.790542e+12 Americas Northern America
## 50 1.145347e+10 Asia South-Eastern Asia
## 51 6.028651e+12 Americas Northern America
## 52 1.188938e+10 Asia South-Eastern Asia
## 53 6.235265e+12 Americas Northern America
## 54 1.222101e+10 Asia South-Eastern Asia
## 55 6.432743e+12 Americas Northern America
## 56 1.265894e+10 Asia South-Eastern Asia
## 57 6.696490e+12 Americas Northern America
## 58 1.330898e+10 Asia South-Eastern Asia
## 59 6.935219e+12 Americas Northern America
## 60 1.428912e+10 Asia South-Eastern Asia
## 61 7.063943e+12 Americas Northern America
## 62 1.501800e+10 Asia South-Eastern Asia
## 63 7.045491e+12 Americas Northern America
## 64 1.591320e+10 Asia South-Eastern Asia
## 65 7.285373e+12 Americas Northern America
## 66 1.728906e+10 Asia South-Eastern Asia
## 67 7.494650e+12 Americas Northern America
## 68 1.868476e+10 Asia South-Eastern Asia
## 69 7.803020e+12 Americas Northern America
## 70 2.033630e+10 Asia South-Eastern Asia
## 71 8.001917e+12 Americas Northern America
## 72 2.227648e+10 Asia South-Eastern Asia
## 73 8.304875e+12 Americas Northern America
## 74 2.435711e+10 Asia South-Eastern Asia
## 75 8.679071e+12 Americas Northern America
## 76 2.634272e+10 Asia South-Eastern Asia
## 77 9.061073e+12 Americas Northern America
## 78 2.786124e+10 Asia South-Eastern Asia
## 79 9.502248e+12 Americas Northern America
## 80 2.919122e+10 Asia South-Eastern Asia
## 81 9.898800e+12 Americas Northern America
## 82 3.117252e+10 Asia South-Eastern Asia
## 83 1.000703e+13 Americas Northern America
## 84 3.332183e+10 Asia South-Eastern Asia
## 85 1.018996e+13 Americas Northern America
## 86 3.568108e+10 Asia South-Eastern Asia
## 87 1.045007e+13 Americas Northern America
## 88 3.830049e+10 Asia South-Eastern Asia
## 89 1.081371e+13 Americas Northern America
## 90 4.128394e+10 Asia South-Eastern Asia
## 91 1.114630e+13 Americas Northern America
## 92 4.476905e+10 Asia South-Eastern Asia
## 93 1.144269e+13 Americas Northern America
## 94 4.845303e+10 Asia South-Eastern Asia
## 95 1.166093e+13 Americas Northern America
## 96 5.255039e+10 Asia South-Eastern Asia
## 97 1.161905e+13 Americas Northern America
## 98 5.586668e+10 Asia South-Eastern Asia
## 99 1.120919e+13 Americas Northern America
## 100 5.884079e+10 Asia South-Eastern Asia
## 101 1.154791e+13 Americas Northern America
## 102 6.283222e+10 Asia South-Eastern Asia
Now that you have created the data table in Exercise 4, it is time to plot the data for the two countries.
Instruction
- Use geom_line to plot life expectancy vs year for Vietnam and the United States and save the plot as p. The data table is stored in tab.
p <- tab %>% ggplot(aes(year, life_expectancy, color = country)) +
geom_line()
p
Cambodia was also involved in this conflict and, after the war, Pol Pot and his communist Khmer Rouge took control and ruled Cambodia from 1975 to 1979. He is considered one of the most brutal dictators in history. Do the data support this claim?
Instruction
Use a single line of code to create a time series plot from 1960 to 2010 of life expectancy vs year for Cambodia.
gapminder %>% filter(country %in% "Cambodia" & year %in% c(1960:2010)) %>%
ggplot(aes(year, life_expectancy)) +
geom_line()
Now we are going to calculate and plot dollars per day for African countries in 2010 using GDP data.
In the first part of this analysis, we will create the dollars per day variable.
Instructions
daydollars <- gapminder %>% mutate(dollars_per_day = gdp / population / 365) %>%
filter(continent %in% "Africa" & year %in% 2010 & !is.na(gdp))
head(daydollars)
## country year infant_mortality life_expectancy fertility population
## 1 Algeria 2010 23.5 76.0 2.82 36036159
## 2 Angola 2010 109.6 57.6 6.22 21219954
## 3 Benin 2010 71.0 60.8 5.10 9509798
## 4 Botswana 2010 39.8 55.6 2.76 2047831
## 5 Burkina Faso 2010 69.7 59.0 5.87 15632066
## 6 Burundi 2010 63.8 60.4 6.30 9461117
## gdp continent region dollars_per_day
## 1 79164339611 Africa Northern Africa 6.0186382
## 2 26125663270 Africa Middle Africa 3.3731063
## 3 3336801340 Africa Western Africa 0.9613161
## 4 8408166868 Africa Southern Africa 11.2490111
## 5 4655655008 Africa Western Africa 0.8159650
## 6 1158914103 Africa Eastern Africa 0.3355954
Now we are going to calculate and plot dollars per day for African countries in 2010 using GDP data.
In the second part of this analysis, we will plot the smooth density plot using a log (base 2) x axis.
Instructions
daydollars %>% ggplot(aes(dollars_per_day)) +
scale_x_continuous(trans = "log2") +
geom_density()
Now we are going to combine the plotting tools we have used in the past two exercises to create density plots for multiple years.
Instructions
daydollars <- gapminder %>% mutate(dollars_per_day = gdp / population / 365) %>%
filter(continent %in% "Africa" & year %in% c(1970,2010) & !is.na(gdp)) %>%
ggplot(aes(dollars_per_day)) +
scale_x_continuous(trans = "log2") +
geom_density() +
facet_grid(year ~ .)
daydollars
Now we are going to edit the code from Exercise 9 to show a stacked density plot of each region in Africa.
Instructions
daydollars <- gapminder %>% mutate(dollars_per_day = gdp / population / 365) %>%
filter(continent %in% "Africa" & year %in% c(1970,2010) & !is.na(gdp)) %>%
ggplot(aes(dollars_per_day, fill = region)) +
scale_x_continuous(trans = "log2") +
geom_density(bw = 0.5, position = "stack") +
facet_grid(year ~ .)
daydollars
We are going to continue looking at patterns in the gapminder dataset by plotting infant mortality rates versus dollars per day for African countries.
Instructions
gapminder_Africa_2010 <- gapminder %>% mutate(dollars_per_day = gdp / population / 365) %>%
filter(continent %in% "Africa" & year %in% c(2010) & !is.na(gdp))
# now make the scatter plot
gapminder_Africa_2010 %>% ggplot(aes(dollars_per_day, infant_mortality, color = region)) +
geom_point()
Now we are going to transform the x axis of the plot from the previous exercise. Instructions
gapminder_Africa_2010 %>% ggplot(aes(dollars_per_day, infant_mortality, color = region)) +
geom_point() +
scale_x_continuous(trans='log2')
Note that there is a large variation in infant mortality and dollars per day among African countries.
As an example, one country has infant mortality rates of less than 20 per 1000 and dollars per day of 16, while another country has infant mortality rates over 10% and dollars per day of about 1.
In this exercise, we will remake the plot from Exercise 12 with country names instead of points so we can identify which countries are which.
Instructions
gapminder_Africa_2010 %>% ggplot(aes(dollars_per_day, infant_mortality, color = region, label = country)) +
geom_point() +
scale_x_continuous(trans='log2') +
geom_text()
Now we are going to look at changes in the infant mortality and dollars per day patterns African countries between 1970 and 2010.
Instructions
gapminder %>%
filter(continent %in% "Africa" & year %in% c(1970,2010) & !is.na(gdp)& !is.na(year) & !is.na(infant_mortality)) %>%
mutate(dollars_per_day = gdp / population / 365) %>%
ggplot(aes(dollars_per_day, infant_mortality, color = region, label = country)) +
geom_point() +
scale_x_continuous(trans='log2') +
geom_text() +
facet_grid(year~.)
Section 5 covers some general principles that can serve as guides for effective data visualization.
After completing Section 5, you will:
There are 3 assignments that use the DataCamp platform for you to practice your coding skills. There is also 1 assignment on the edX platform to allow you to practice exploratory data analysis.
We encourage you to use R to interactively test out your answers and further your learning.
RAFAEL IRIZARRY: We have already provided some rules to follow as we created plots for our examples. Here we aim to provide some general principles we can use as guidelines for effective data visualization.
Much of this part of the course is based on a talk by Karl Broman entitled “Creating Effective Figures and Tables” and from class notes from Peter Aldhous titled “Introduction to Data Visualization.” In many of our examples, we follow Karl’s approach. We show some examples of plot styles we should avoid, explain how to improve them, and then use these as motivation for a list of principles. We compare and contrast plots that follow these principles to those that don’t.
The principles are mostly based on research related to how humans detect patterns and make visual comparisons. The preferred approaches are those that best fit the way our brain processes visual information.
When deciding on a visualization approach it is also important to keep our goal in mind. We may be comparing a viewable number of quantities, describing distributions for categories or numeric values, comparing the data from two groups, or describing the relationship between two variables.
As a final note, we also note that for a data scientist it is important to adapt and optimize graphs to the audience. For example, an exploratory plot made for ourselves will be different than a chart intended to communicate a finding to a general audien
This video corresponds to the textbook chapter introduction on data visualization principles.
Key points
This video corresponds to the textbook case study on vaccines. Information on color palettes can be found in the textbook section on encoding a third variable.
Key points
# import data and inspect
library(tidyverse)
library(dslabs)
data(us_contagious_diseases)
str(us_contagious_diseases)
## 'data.frame': 16065 obs. of 6 variables:
## $ disease : Factor w/ 7 levels "Hepatitis A",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ state : Factor w/ 51 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ year : num 1966 1967 1968 1969 1970 ...
## $ weeks_reporting: num 50 49 52 49 51 51 45 45 45 46 ...
## $ count : num 321 291 314 380 413 378 342 467 244 286 ...
## $ population : num 3345787 3364130 3386068 3412450 3444165 ...
# assign dat to the per 10,000 rate of measles, removing Alaska and Hawaii and adjusting for weeks reporting
the_disease <- "Measles"
dat <- us_contagious_diseases %>%
filter(!state %in% c("Hawaii", "Alaska") & disease == the_disease) %>%
mutate(rate = count / population * 10000 * 52/weeks_reporting) %>%
mutate(state = reorder(state, rate))
# plot disease rates per year in California
dat %>% filter(state == "California" & !is.na(rate)) %>%
ggplot(aes(year, rate)) +
geom_line() +
ylab("Cases per 10,000") +
geom_vline(xintercept=1963, col = "blue")
# tile plot of disease rate by state and year
dat %>% ggplot(aes(year, state, fill=rate)) +
geom_tile(color = "grey50") +
scale_x_continuous(expand = c(0,0)) +
scale_fill_gradientn(colors = RColorBrewer::brewer.pal(9, "Reds"), trans = "sqrt") +
geom_vline(xintercept = 1963, col = "blue") +
theme_minimal() + theme(panel.grid = element_blank()) +
ggtitle(the_disease) +
ylab("") +
xlab("")
# compute US average measles rate by year
avg <- us_contagious_diseases %>%
filter(disease == the_disease) %>% group_by(year) %>%
summarize(us_rate = sum(count, na.rm = TRUE)/sum(population, na.rm = TRUE)*10000)
# make line plot of measles rate by year by state
dat %>%
filter(!is.na(rate)) %>%
ggplot() +
geom_line(aes(year, rate, group = state), color = "magenta",
show.legend = FALSE, alpha = 0.2, size = 1) +
geom_line(mapping = aes(year, us_rate), data = avg, size = 1, col = "red") +
scale_y_continuous(trans = "sqrt", breaks = c(5, 25, 125, 300)) +
ggtitle("Cases per 10,000 by state") +
xlab("") +
ylab("") +
geom_text(data = data.frame(x = 1955, y = 50),
mapping = aes(x, y, label = "US average"), color = "black") +
geom_vline(xintercept = 1963, col = "blue")
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
You can also embed plots, for example:
Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.