section4

Case Study: Trends in World Health and Economics

Video
RAFAEL IRIZARRY: In this section, we will demonstrate how relatively simple ggplot and dplyr code can create insightful and aesthetically pleasing plots that help us better understand trends in world health and economics.
We will use many of the techniques we have learned about data visualization, exploratory data analysis, and summarization.
We later augment the code somewhat to perfect the plots, and describe some general principles to guide data visualization. We’re going to be using data from Gapminder. Hans Rosling was the co-founder of the Gapminder Foundation, an organization dedicated to educating the public by using data to dispel common myths about the so-called developing world.
The organization uses data to show how actual trends in health and economics contradict the narratives that emanate from sensationalist media coverage of catastrophes, tragedies, and other unfortunate events. As stated in the Gapminder Foundation’s website, “Journalists and lobbyists tell dramatic stories. That’s their job. They tell stories about extraordinary events and unusual people. The piles of dramatic stories pile up in people’s minds into an overdramatic worldview and strong negative stress feelings. The world is getting worse. It’s we versus them. Other people are strange. The population just keeps growing. And nobody cares.” Hans Rosling conveyed actual data-based trends, in a dramatic way of his own, using effective data visualization. This section is based on these talks that exemplify this approach to education. “New Insights on Poverty” and “The Best Stats You’ve Ever Seen” are the title of these talks. Specifically, in this section, we set out to answer the following two questions. First, is it a fair characterization of today’s world to say that it is divided into a Western rich nations, and the developing world in Africa, Asia, and Latin America? Second, has income inequality across countries worsened during the last 40 years? We’re going to use data and our code to answer these questions.
This video corresponds to the textbook section introducing the case study on new insights in poverty.

More about Gapminder

The original Gapminder TED talks are available and we encourage you to watch them.

You can also find more information and raw data (in addition to what we analyze in class) at https://www.gapminder.org/.
Key points

Data visualization can be used to dispel common myths and educate the public and contradict sensationalist or outdated claims and stories.
We will use real data to answer the following questions about world health and economics:
Is it still fair to consider the world as divided into the West and the developing world?
Has income inequality across countries worsened over the last 40 years?

Gapminder Dataset

RAFAEL IRIZARRY: To learn about world health and economics, we will be using the Gapminder data set provided in the dslabs library. This dataset was put together for you, and it was created using a number of spreadsheets available from the Gapminder Foundation.

You can access the table using this code. We load the dslabs package, then we type data gapminder, and we can see that the data includes country, year, and several health outcomes and economics outcomes.

As done in the New Insights on Poverty video, we start by testing our knowledge regarding differences in child mortality across different countries. To get us started, we’re going to take a quiz created by Hans Rosling in his video New Insights on Poverty, and we’re going to start by testing our knowledge regarding differences in child mortality across different countries.

So here’s a quiz. For each of the pairs of countries here, which country do you think had the highest child mortality in 2015? And also, which pairs do you think are most similar? When answering these questions without data, the non-European countries are typically picked as having higher mortality rates, Sri Lanka over Turkey, South Korea over Poland, and Malaysia over Russia. It is also common to assume that countries considered to be part of the developing world, Pakistan, Vietnam, Thailand, and South Africa, have similarly high mortality rates.

Now let’s answer these questions with data. For example, for this first comparison, we can write this simple dplyr code to see that Turkey has a higher mortality rate than Sri Lanka. We can use the same code to answer each of the five questions, and we see that Sri Lanka has a lower mortality rate than Turkey, South Korea has a lower mortality rate than Poland, Malaysia has a lower mortality rate than Russia, and Pakistan is very different from Vietnam, and South Africa is very different from Thailand.

From here, we see that these comparisons, the European countries have higher rates. We also see that the countries from the developing world can have very different rates. It turns out that most people do worse than if they were just guessing, which implies that we’re more than ignorant, we’re misinformed.

This video corresponds to the textbook section introducing the case study on new insights on poverty.

Key points

A selection of world health and economics statistics from the Gapminder project can be found in the dslabs package as data(gapminder).
Most people have misconceptions about world health and economics, which can be addressed by considering real data.

Codes:

# compare infant mortality in Sri Lanka and Turkey
gapminder_filter <- gapminder %>%
    filter(year == 2015 & country %in% c("Sri Lanka", "Turkey")) %>%
    select(country, infant_mortality)
gapminder_filter

##     country infant_mortality
## 1 Sri Lanka              8.4
## 2    Turkey             11.6

Tabel perbandingan
country	infant_mortality
Sri Lanka	8.4
Turkey	11.6

Life Expectancy and Fertility Rates

RAFAEL IRIZARRY: Our misconceptions stem from the preconceived notion that the world is divided into two groups, the Western World, composed of Western Europe and North America, which is characterized by long lifespans and small families versus the developing world, Africa, Asia, and Latin America, characterized by short lifespans and large families.

But does the data support this dichotomous view of the world? The necessary data to answer this question is also available in our gapminder table. Using our newly-learned data visualization skills, we will be able to answer this question.

The first plot we make to see what data have to say about this worldview is a scatterplot of life expectancy versus fertility rates. Fertility rates are defined as the average number of children per woman. We will start by looking at data from about 50 years ago when, perhaps, this worldview was cemented in our minds.

We just type the simple code and we see this plot. Note that most points do, in fact, fall into two distinct categories, one with life expectancies around 70 years and three or less children per family and the other with life expectancies lower than 65 years and with more than five children per family.

Now, to confirm that indeed these countries are from the regions we expect, we can use color to represent continent. So we change the code slightly by adding the color argument, assigning continent to it. Because continent is a character, it will automatically assign color to each continent.

Here’s the plot. So indeed, in 1962, the West versus developing worldview was grounded in some reality, but is this still the case 50 years later? To answer to this question, we’re going to learn about faceting. End of transcript. Skip to the start.

This video corresponds to the textbook section on Gapminder scatterplots.
Key points

A prevalent worldview is that the world is divided into two groups of countries:
Western world: high life expectancy, low fertility rate
Developing world: lower life expectancy, higher fertility rate
Gapminder data can be used to evaluate the validity of this view.
A scatterplot of life expectancy versus fertility rate in 1962 suggests that this viewpoint was grounded in reality 50 years ago. Is it still the case today?
Codes

# basic scatterplot of life expectancy versus fertility
ds_theme_set()    # set plot theme
filter(gapminder, year == 1962) %>%
    ggplot(aes(fertility, life_expectancy)) +
    geom_point()

# add color as continent
filter(gapminder, year == 1962) %>%
    ggplot(aes(fertility, life_expectancy, color = continent)) +
    geom_point()

Faceting

RAFAEL IRIZARRY: We could easily plot the 2012 data in the same way we did for 1962. But for comparison, side by side plots are preferable. In ggplot, we can achieve this by faceting variables. We stratify the data by some variable and make the same plot for each strata.

Here we are faceting by the year. To achieve this, we use a function facet_grid. This is added as a layer which automatically separates the plots. The function lets you facet by up to two variables using columns to represent one variable and rows to represent the other. The function expects the rows and column variables separated by a tilde.

Here’s an example. We’re going to facet by continent and year. So continent will be in the rows, and year will be in the columns. Here is the plot. We can see how the data has been stratified. We have 1962 on the left, 2012 on the right, and the 5 continents in each row.

filter(gapminder, year %in% c("1962","2012")) %>%
  ggplot(aes(fertility, life_expectancy, col = continent)) +
  geom_point() +
  facet_grid(continent ~ year)

However, this is just an example and more than what we want, which is simply to compare in 1962 and 2012. In this case, there’s just one variable. So what we do is we use the dot to let the facet function know that we’re not using two variables but just one. The code looks like this.

# facet by year only
filter(gapminder, year %in% c(1962, 2012)) %>%
    ggplot(aes(fertility, life_expectancy, col = continent)) +
    geom_point() +
    facet_grid(. ~ year)

We simply type facet_grid dot– meaning we’re not using a variable for the rows– tilde year which now tells it make two columns– 1962 and 2012. And here is the plot.

After we split the plot like this, it clearly shows that the majority of countries have moved from the developing world cluster to the Western world one. They went from having large families and short lifespans to having smaller families and longer lifespans. In 2012, the Western versus developing world view no longer makes sense. This is particularly clear when we compare Europe to Asia. Asia includes several countries that have made great improvements in the last 40 to 50 years.

To explore how this transformation happened through the years, we can make the plot for several years. For example, we can add 1970, 1980, 1990, and 2000 to the plot. Now, if we do this, we will not want all the plots on the same row. This is the default behavior of facet_grid. If we do this, the plots will become too thin, and we won’t be able to see the data.

Instead, we might want to have the plots across different rows and columns. For this, we can use the facet_wrap function which permits us to do this. It automatically wraps the series of plots so that most displays has viewable dimensions. So the code looks like this.

# facet by year, plots wrapped onto multiple rows
years <- c(1962, 1980, 1990, 2000, 2012)
continents <- c("Europe", "Asia")
gapminder %>%
    filter(year %in% years & continent %in% continents) %>%
    ggplot(aes(fertility, life_expectancy, col = continent)) +
    geom_point() +
    facet_wrap(~year)

It’s very similar. We’re adding some years. And then at the end, we facet_wrap instead of facet_grid. And now, the plot looks like this.

Now, we’re only showing Asia and Europe, but the function clearly shows us how the Asian countries have made great improvements throughout the years.

Now, note that the default choice for the range of the axes is an important one. When not using facet, this range is determined by the data shown in the plot. When using facet, the range is determined by the data shown in all plots. And therefore, it’s kept fixed across the plots. This makes comparisons across plots much easier. For example, in the plot we just saw, the life expectancy has increased, and the fertility has decreased across most countries.

We see this because the cloud of points moves up and to the left. This is not the case if we adjust the scales to each year separately. The plot looks like this. In this case, we have to pay special attention to the range to notice that the plot on the right has larger life expectancy. Therefore, by keeping the scales the same, we were able to quickly see how many of the countries outside of the Western world have improved during the last 40 to 50 years.

This video corresponds to the textbook section on faceting.
Key points - Faceting makes multiple side-by-side plots stratified by some variable. This is a way to ease comparisons. - The facet_grid() function allows faceting by up to two variables, with rows faceted by one variable and columns faceted by the other variable. To facet by only one variable, use the dot operator as the other variable. - The facet_wrap() function facets by one variable and automatically wraps the series of plots so they have readable dimensions. - Faceting keeps the axes fixed across all plots, easing comparisons between plots. - The data suggest that the developing versus Western world view no longer makes sense in 2012.

Time Series Plots

The visualizations we have just seen effectively illustrate that data no longer supports the Western versus developing worldview. But once we see these plots, new questions emerge. For example, which countries are improving more? Which ones are improving less? Was the improvement constant during the last 50 years, or was there more of an acceleration during a specific certain period? For a closer look that may help answer these questions, we introduce time series plots.

Time series plots have time in the x-axis, and an outcome, or measurement of interest, on the y-axis. For example, here’s a trend plot for the United States fertility rate. We can get this plot by simply using the geom point layer.

# scatterplot of US fertility by year
gapminder %>%
    filter(country == "United States") %>%
    ggplot(aes(year, fertility)) +
    geom_point()

## Warning: Removed 1 rows containing missing values (geom_point).

When we look at this plot, we immediately see that the trend is not linear at all. Instead, we see a sharp drop during the 60s and 70s to below 2. Then, the trend comes back up to 2, and stabilizes there in the 1990s. When the points are regularly spaced and densely packed as they are here, we can create curves by joining points with lines. This conveys that these data are from a single country. To do this, we use the geom_line function instead of geom_point. We write the code like this, and now the curve looks like this.

# line plot of US fertility by year
gapminder %>%
    filter(country == "United States") %>%
    ggplot(aes(year, fertility)) +
    geom_line()

## Warning: Removed 1 row(s) containing missing values (geom_path).

This is particularly helpful when we look at two or more countries. Let’s look at an example. Let’s subset the data to include two countries. Let’s look at one from Europe and one from Asia. So we copy the code above–

# line plot fertility time series for two countries- only one line (incorrect)
countries <- c("South Korea", "Germany")
gapminder %>% filter(country %in% countries) %>%
    ggplot(aes(year, fertility)) +
    geom_line()

## Warning: Removed 2 row(s) containing missing values (geom_path).

here it is– and we get this plot. But note that this is not what we want. Rather than a line for each country, this code has produced a line that goes through the points for both countries– they’re both joined. This is actually expected, since we have not told ggplot anything about wanting two separate lines. To let ggplot know that there are two curves that need to be made separately, we assign each point to a group, one for each country.

# line plot fertility time series for two countries - one line per country
gapminder %>% filter(country %in% countries) %>%
    ggplot(aes(year, fertility, group = country)) +
    geom_line()

## Warning: Removed 2 row(s) containing missing values (geom_path).

We do this through the mapping. We assign country to the group argument. The plot now looks like this. We can see the two lines, one for each country. However, we don’t know which line goes with which country. To see this, we can use color for example. We can use color to distinguish the two countries. A useful side effect of using color to assign different colors to each country is that ggplot automatically groups the data by the color value. So the code is very simple, it looks like this.

# fertility time series for two countries - lines colored by country
gapminder %>% filter(country %in% countries) %>%
    ggplot(aes(year, fertility, col = country)) +
    geom_line()

## Warning: Removed 2 row(s) containing missing values (geom_path).

And once we type this, then we get two lines, each with a color. And a legend has been added by default. Note that this plot clearly shows how South Korea’s fertility rate dropped drastically during the 60s and 70s. And by 1990, it had a similar fertility rate to Germany.

For time series plots, we actually recommend labeling the curves rather than using legends as we did in the previous plot. This suggestion actually applies to most plots. Labeling is usually preferred over legends. However, legends are easier to make and appear by default in many of ggplot’s functions.

We are going to show an example of how to add labels to a time series plot. We demonstrate how we can do this using the life expectancy data. We define a data table with the label locations. And then we use a second mapping just for the labels. The code looks like this.

Notice that we define a data frame with the locations of where we want the labels. We pick these by eye. And then you can see in the geom_text, we are using the labels data frame as the data, so that those labels are put in those positions. Then we have to tell the plot not to add a legend through the theme function. And now the plot looks like this.

# life expectancy time series - lines colored by country and labeled, no legend
labels <- data.frame(country = countries, x = c(1970, 1965), y = c(55, 72))
gapminder %>% filter(country %in% countries) %>%
    ggplot(aes(year, life_expectancy, col = country)) +
    geom_line() +
    geom_text(data = labels, aes(x, y, label = country), size = 5) +
    theme(legend.position = "none")

This is the life expectancy plot. And we can see how the plot shows how an improvement in life expectancy followed the drops in fertility rates. While in 1960, Germans lived more than 15 years more on average than South Koreans, by 2010 the gap is completely closed.

Another commonly held notion is that wealth distribution across the world has become worse during the last decades. When general audiences are asked if poor countries have become poorer and rich countries have become richer, the majority answer yes. By using histograms, smooth densities, and box plots, will be able to understand if this is in fact the case. End of transcript. Skip to the start.

Transformations

In this video, we cover transformations. Transformations can be very useful to better understand distributions. As an example, in this video, we look at income.

gapminder_gdp <- gapminder %>%
    filter(!is.na(gdp)) %>%
    select(!c(region, infant_mortality, life_expectancy))
kable(tail(gapminder_gdp), caption = "Tabel GDP")

Tabel GDP
	country	year	fertility	population	gdp	continent
7568	Vanuatu	2011	3.46	241876	386483180	Oceania
7569	Venezuela	2011	2.44	29427631	166062245436	Americas
7570	Vietnam	2011	1.79	89321903	66530108958	Asia
7571	Yemen	2011	4.35	24234940	13104223693	Asia
7572	Zambia	2011	5.77	14343526	5917195991	Africa
7573	Zimbabwe	2011	3.64	14255592	4407438807	Africa

The Gapminder data table includes a column with the country’s gross domestic product, the GDP. GDP measures the market value of goods and services produced by a country in a given year. The GDP per person is often used as a rough summary of how rich a country is. Here we divide this quantity by 365 to obtain the more interpretable measure dollars per day.

Using current US dollars as a unit, a person surviving on an income of less than $2 a day, for example, is defined to be living in absolute poverty. So we’re going to add this variable to our data table. It’s the dollars per day variable. GDP divided by population divided by 365. Before we continue, note that GDP values is in our data table are adjusted for inflation and represent current US dollars. So these values are meant to be comparable across the years. Also note that these are country averages and that within each country, there’s much variability.
OK, so let’s move on to examining distributions. Here’s a histogram of per day incomes from 1970.

# add dollars per day variable
gapminder <- gapminder %>%
    mutate(dollars_per_day = gdp/population/365)

# histogram of dollars per day
past_year <- 1970
gapminder %>%
    filter(year == past_year & !is.na(gdp)) %>%
    ggplot(aes(dollars_per_day)) +
    geom_histogram(binwidth = 1, color = "gold")

You can obtain it with this simple code which you’ve already learned. We see that for the majority of countries, averages are below $10 a day. However, the majority of the x-axis is dedicated to the 35 countries with averages above 10.

It might be more informative to quickly be able to see how many countries make on average about
- $1 a day, extremely poor– - $2 a day, very poor– - $4 a day, poor– - $8 a day, which is about middle– - $16 a day which is a well-off country– - $32 is rich, and $64 which is very rich. These changes are multiplicative. And here we introduce log transformations. Log transformations change multiplicative changes into additive ones. Using base 2 for, example, means that every time a value doubles, the log transformation increases by one. So to get the distribution of the log base 2 transformed values, we simply transform the data and use the same code. And now we obtain this histogram.

# repeat histogram with log2 scaled data
gapminder %>%
    filter(year == past_year & !is.na(gdp)) %>%
    ggplot(aes(log2(dollars_per_day))) +
    geom_histogram(binwidth = 1, color = "black")

In this plot, we see something new. We see two clear bumps. Before we continue interpreting the data, let’s introduce some commonly used statistical language. In statistics, these bumps are sometimes referred to as modes. The mode of a distribution is the value with the highest frequency. The mode of a normal distribution is the average. But if the mode is a value with the highest frequency, how can we have more than one? When a distribution like the one we just saw doesn’t monotonically decrease from the mode, we call the location where it goes up and down again as local modes. And we say that the distribution has multiple modes. The histogram we just saw suggests that in 1970, country income distribution have two modes. One at about $2 per day, one in the log2 scale, and another at about $32 per day– 5 in the log2 scale. This bimodality is consistent with the dichotomous world made up of countries with average incomes less than $8 per day, 3 on the log scale. And countries above that we see two modes in the histogram. Now before we continue interpreting the data, we need to make another pause to explain how we choose the base. And the histogram we just saw we chose base 2. Other common choices are the natural log in base 10. In general, we do not recommend using the natural log for data exploration and visualization. Why is this? It’s because while we know what 2 to the 2 is– 2 to the 3– 2 to do the 4, we can quickly compute that in our mind. 10 to the 1, 10 to the 2, 10 to the 3– also very easy to compute. It’s not easy to compute E to the 2, E to the 3, et cetera. So we don’t recommend using the natural log for data exploration. In the dollar per day example, we use base 2 instead of base 10 because the resulting range is easier to interpret. The range of the values being plotted started from about 0.3 and ended around 50. In base 10, this turns to a range that includes very few integers, just 0 and 1. With base 2, our range includes negative 2, negative 1, 0, 1, 2, 3, 4, and 5. Note that it is easier to compute 2 to the x and 10 to the x when x is an integer. So we prefer to have more integers in the transform scale. Another consequence of a limited range is that choosing the bin width is more challenging. With log base 2, we know that a bin width of 1 will translate to bins with range x to 2 to the x. As an example in which base 10 makes more sense than base 2, consider population size. Using log base 10 makes more sense here since the range for these data goes from 45,000 to about 800 million. Here’s a histogram if we transform the values with the log base 10.In this plot, we see something new. We see two clear bumps. Before we continue interpreting the data, let’s introduce some commonly used statistical language. In statistics, these bumps are sometimes referred to as modes. The mode of a distribution is the value with the highest frequency. The mode of a normal distribution is the average. But if the mode is a value with the highest frequency, how can we have more than one? When a distribution like the one we just saw doesn’t monotonically decrease from the mode, we call the location where it goes up and down again as local modes. And we say that the distribution has multiple modes. The histogram we just saw suggests that in 1970, country income distribution have two modes. One at about $2 per day, one in the log2 scale, and another at about $32 per day– 5 in the log2 scale. This bimodality is consistent with the dichotomous world made up of countries with average incomes less than $8 per day, 3 on the log scale. And countries above that we see two modes in the histogram. Now before we continue interpreting the data, we need to make another pause to explain how we choose the base. And the histogram we just saw we chose base 2. Other common choices are the natural log in base 10. In general, we do not recommend using the natural log for data exploration and visualization. Why is this? It’s because while we know what 2 to the 2 is– 2 to the 3– 2 to do the 4, we can quickly compute that in our mind. 10 to the 1, 10 to the 2, 10 to the 3– also very easy to compute. It’s not easy to compute E to the 2, E to the 3, et cetera. So we don’t recommend using the natural log for data exploration. In the dollar per day example, we use base 2 instead of base 10 because the resulting range is easier to interpret. The range of the values being plotted started from about 0.3 and ended around 50. In base 10, this turns to a range that includes very few integers, just 0 and 1. With base 2, our range includes negative 2, negative 1, 0, 1, 2, 3, 4, and 5. Note that it is easier to compute 2 to the x and 10 to the x when x is an integer. So we prefer to have more integers in the transform scale. Another consequence of a limited range is that choosing the bin width is more challenging. With log base 2, we know that a bin width of 1 will translate to bins with range x to 2 to the x. As an example in which base 10 makes more sense than base 2, consider population size.

gapminder %>% 
  filter(year == past_year) %>%
  ggplot(aes(log10(population))) +
  geom_histogram(binwidth = 0.5, color = "black")

Using log base 10 makes more sense here since the range for these data goes from 45,000 to about 800 million. Here’s a histogram if we transform the values with the log base 10.

Looking at the scale knowing that we’re in base 10, we can quickly determine that country population ranges from about 40,000 to about a billion. Now let’s talk about log transformations and how we use them in the plots. There are two ways we can use log transformation in plots. We can log the values before plotting them, or we can use log scales in the axis. Both approaches are useful and have different strengths. If we log the data, we can more easily interpret intermediate values in the scale. For example, if we use a scale that looks like this that has been log transformed, we know that x is 1.5. If the scales are logged and we have that x in between 1 and 10, then we don’t know immediately what the x is because it’s 10 to the 1.5, not an easy thing to compute in our heads. However, the advantage of using log scales is that we see the original values on the axis. So this has an advantage because we see the original values displayed in the plot which makes it very easy to quickly see what numbers we’re actually dealing with. For example, when we see $32 a day, instead of 5 log base $2 a day. Now let’s review how we make plots where the scales have been log transformed. We already learned this. We learned the scale_x_continuous function. So we want to remake the histograms that we already made but now using scales that have been transformed. We simply add a layer using the scale underscore x underscore continuous function. And we no longer transform the data before plotting it. So the code would look like this, and the plot would look like this.

# repeat histogram with log2 scaled x-axis
gapminder %>%
    filter(year == past_year & !is.na(gdp)) %>%
    ggplot(aes(dollars_per_day)) +
    geom_histogram(binwidth = 1, color = "black") +
    scale_x_continuous(trans = "log2")

Notice that the histogram looks exactly the same. The difference is that in the scales in the x-axis, instead of seeing the log values, we see the original values in a log scale. So we see 1, 8, and 64. And we can very quickly interpret what that means in terms of dollars per day. End of transcript. Skip to the start.

Stratify and Boxplot

RAFAEL IRIZARRY: The histogram showed us that the income distribution values show a dichotomy. However, the histogram does not show us if the two groups of countries are west versus the developing world. To see distributions by geographical region, we first stratify the data into regions, and then examine the distribution for each. Now, because the number of regions is large in this case, it’s 22, as we can see, by just typing this command in R.

# add dollars per day variable
gapminder <- gapminder %>%
    mutate(dollars_per_day = gdp/population/365)
    
# number of regions
length(levels(gapminder$region))

## [1] 22

Looking at histograms or smooth densities for each will not be useful. Instead, we can stack box plots next to each other. To do this, we simply write this code. We’ve learned how to use geom_boxplot before so we write this. When we do this, we get this plot.

# boxplot of GDP by region in 1970
past_year <- 1970
p <- gapminder %>%
    filter(year == past_year & !is.na(gdp)) %>%
    ggplot(aes(region, dollars_per_day))
p + geom_boxplot()

Now, note that we can’t read the region names because the default ggplot behavior is to write the labels horizontally and here we run out of room. We can easily fix this by rotating the labels. Consulting the documentation, we find that we can rotate the names by changing the theme through element underscore text. The hjust equals 1 argument justifies this text so that it’s next to the axis. So now what we do is we add to our graph, using layers, the following line. We say theme, axis, dot text dot x equals element underscore text. Angle equals 90. That rotates it. And then hjust equals 1. When we do this, we get this plot.

# rotate names on x-axis
p + geom_boxplot() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

Now, we can read the names. We can already see that there is indeed a west versus the rest dichotomy. If you look closely at the box plots that are high, we see that they’re North America, northern Europe, Australia, New Zealand, and Western Europe. There are a few more adjustments we can make to this plot to help uncover this reality to help relay this message. First, it helps to order the regions in some other order that is not alphabetical. Ordering alphabetically is completely arbitrary. We can order by something meaningful. We’ll see how we can do that. The function that’s going to help us achieve this is the reorder function. This function lets us change the order of the levels of a factor variable based on a summary computed on a numeric vector. Before we continue with our example, let’s understand how the reorder function works using a simpler one. Let’s define a factor. Based on the vector with 5 entries, Asia, Asia, west, west, west. If we turn this vector into factor, the levels of this factor are ordered alphabetically.

# by default, factor order is alphabetical
fac <- factor(c("Asia", "Asia", "West", "West", "West"))
levels(fac)

## [1] "Asia" "West"

This is the default in r. So Asia is the first level. West is the second level. But suppose that each of these elements of the original vector are associated with a value. Here we’re just defining one arbitrarily. 10, 11, 12, 6, 4. Let’s suppose that we want to order the levels based on the mean value of these numbers. In this case, the west has a lower mean. It’s the mean of 12, 6, 4. Compared to the mean of Asia, which is the mean of 10 and 11. So if we use a function the order like this– reorder fac– that’s our factor.

# reorder factor by the category means
value <- c(10, 11, 12, 6, 4)
fac <- reorder(fac, value, FUN = mean)
levels(fac)

## [1] "West" "Asia"

Value– those are five values. And then using the function mean to summarize the values, we can see that the new factor that’s created has levels ordered differently. Now west is the first one. Why? Because it has a smaller mean value of the value vector. All right. Let’s get back to our example. In our example, we have regions. These are the different parts of a continent. We also have continents. And then we have divided the world into West versus the rest. So we have three different ways of dividing the data. The first thing we’re going to do to improve our plot, is to simply reorder the regions by their median income level. To achieve this, we write the same code as before but we add to mutate that changes region to a new factor where the levels are reordered. In this line here. If we do this, we get the following plot.

# reorder by median income and color by continent
p <- gapminder %>%
    filter(year == past_year & !is.na(gdp)) %>%
    mutate(region = reorder(region, dollars_per_day, FUN = median)) %>%    # reorder
    ggplot(aes(region, dollars_per_day, fill = continent)) +    # color by continent
    geom_boxplot() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    xlab("")
p

Now we can see that the box plots are ordered by their medium value. And we very quickly see that there’s four box plots that stand out at the end. The four highest ones. These are Western Europe, Australia and New Zealand, northern Europe, and North America. This is what we define as the West. Now there’s another change we made to the plot to help convey this message, and that’s that we use color to show another variable. We use color to show continent. Remember, regions are parts of continents. To add color to define the different continents, we use the fill argument in the aesthetic mappings of ggplot. We simply say fill equals continent. And now each continent gets its color. Its own color. Now, we can see what this does is it helps us see that, for example, the blue box plots are towards the right because these are the European countries. We also see that the red countries, the red box plots, are to the left. These are the countries in the African continent. The last change we can make to this plot to help us see the data little bit better, is to change the scale to the log scale. We want to change it to log2 scale in this case, so we add the layer scale underscore y underscore continuous, and we use the log2 transformation.

# log2 scale y-axis
p + scale_y_continuous(trans = "log2")

And now what this does, is it helps us see the differences between the countries with the lower income. For example, we see a difference now between the African continent, which is in red, and Asia, which is in green. All right. The last change we can make to this plot to make it tell the story a little better to give us even more information, is to show the data. In many cases, we don’t show the data, the actual individual points, because it adds too much clutter to the plot and it obfuscates the message. But in this particular example, we don’t have that many points. So we can add a layer of points by simply adding the geom point layer. It’s very simple.

# add data points
p + scale_y_continuous(trans = "log2") + geom_point(show.legend = FALSE)

We just add that layer and now we get this plot. And we can see the individual points. You can decide if you show this or not. But now we can see exactly where every single country lies. End of transcript. Skip to the start.

This video corresponds to the textbook section on comparing multiple distributions with boxplots https://rafalab.github.io/dsbook/gapminder.html#comparing-multiple-distributions-with-boxplots-and-ridge-plots. Note that many boxplots from the video are instead dot plots in the textbook and that a different boxplot is constructed in the textbook. Also read that section to see an example of grouping factors with the case_when function.

Key points

Make boxplots stratified by a categorical variable using the geom_boxplot() geometry.
Rotate axis labels by changing the theme through element_text(). You can change the angle and justification of the text labels.
Consider ordering your factors by a meaningful value with the reorder() function, which changes the order of factor levels based on a related numeric vector. This is a way to ease comparisons.
Show the data by adding data points to the boxplot with a geom_point() layer. This adds information beyond the five-number summary to your plot, but too many data points it can obfuscate your message.

Comparing Distributions

This video corresponds to the textbook section on 1970 versus 2010 income distributions. Note that the boxplots are slightly different: the group variable in those plots was defined in section 10.7.1.

Key points

Use intersect() to find the overlap between two vectors.
To make boxplots where grouped variables are adjacaent, color the boxplot by a factor instead of faceting by that factor. This is a way to ease comparisons.
The data suggest that the income gap between rich and poor countries has narrowed, not expanded.

The exploratory data analysis we have conducted has revealed two characteristics about average income distributions in 1970. Using a histogram, we found a bimodal distribution with the most relating to poor and rich countries. Then by stratifying by region and examining box plots, we found that the rich countries were mostly in Europe and Northern America, along with Australia, New Zealand, and then the poor countries were mostly in the rest of the world. So we are going to define a vector that defines the regions in the West. They’re just simply defining a vector like this.

# define Western countries
west <- c("Western Europe", "Northern Europe", "Southern Europe", "Northern America", "Australia and New Zealand")

Now we want to focus on comparing the differences in distribution across time. We start by confirming that the bi-modality observed in 1970 is explained by a west versus developing world economy. We do this by creating a histogram for the groups previously defined.

# facet by West vs devloping
gapminder %>%
    filter(year == past_year & !is.na(gdp)) %>%
    mutate(group = ifelse(region %in% west, "West", "Developing")) %>%
    ggplot(aes(dollars_per_day)) +
    geom_histogram(binwidth = 1, color = "black") +
    scale_x_continuous(trans = "log2") +
    facet_grid(. ~ group)

Note that we create the two groups with an if else inside a mutate. And that if we then use facet grid to make histograms for each group using this code, we see this histogram.

And we immediately see that the countries in the West have higher incomes. The histogram is shifted to the right. Countries in the developing world are shifted towards the left. Now we’re ready to see if the separation is worse today than it was 40 years ago. We do this by now faceting by both region and year. So it’s the same code, but now we’re looking at two years, 1970 and 2010. And we end the code with a facet grid by year and group.

# add dollars per day variable and define past year
gapminder <- gapminder %>%
    mutate(dollars_per_day = gdp/population/365)
past_year <- 1970
# facet by West/developing and year
present_year <- 2010
gapminder %>%
    filter(year %in% c(past_year, present_year) & !is.na(gdp)) %>%
    mutate(group = ifelse(region %in% west, "West", "Developing")) %>%
    ggplot(aes(dollars_per_day)) +
    geom_histogram(binwidth = 1, color = "black") +
    scale_x_continuous(trans = "log2") +
    facet_grid(year ~ group)

Now we can see the histogram again for the four different groups. When we look at this figure, we can see that the developing world has shifted to the right more than the West. Meaning that it has gotten closer. The income distribution of the developing countries has gotten closer to those from the west.

Before we interpret the findings of this plot further. We note that there are more countries represented in the 2010 histograms than in the 1970s ones. The total counts are larger. One reason for this is that several countries were founded after 1970. For example, the Soviet Union turned into several countries, including Russia and Ukraine during the ’90s. Another reason is that data is available for more countries in 2010 compared to 1970. So we’re going to remake the plots, but using only countries with data available for both years. We’re going to use this very simple code. We’re going to define a vector with a list for 2010, a vector with a list for 1970. Here notice we use the dot that we explained earlier, to get this character vector out of this dplyr command. And then we’re going to take the intersection using the intersect function. There’s actually a better way of doing this using the tidyverse tools, but we haven’t learned those yet. So we use this simple piece of code.

# define countries that have data available in both years
country_list_1 <- gapminder %>%
    filter(year == past_year & !is.na(dollars_per_day)) %>% .$country
    country_list_2 <- gapminder %>%
    filter(year == present_year & !is.na(dollars_per_day)) %>% .$country
    country_list <- intersect(country_list_1, country_list_2)

So now there’s 108 countries in this list. It accounts for 86% of the total population. So this subset should be representative of the entire world. Let’s make the plot again, but this time using only the subset of countries that are present for which data is present in 1970 and 2010. We’re going to use the country in country list argument to do this in a filter function.

# make histogram including only countries with data available in both years
gapminder %>%
    filter(year %in% c(past_year, present_year) & country %in% country_list) %>%    # keep only selected countries
    mutate(group = ifelse(region %in% west, "West", "Developing")) %>%
    ggplot(aes(dollars_per_day)) +
    geom_histogram(binwidth = 1, color = "black") +
    scale_x_continuous(trans = "log2") +
    facet_grid(year ~ group)

Now we get this plot. We now see that while the rich countries have become a bit richer percentage wise, the poorer countries appear to have improved more. The histogram has shifted more to the right than for the rich countries. In particular, we see that the proportion of developing countries earning more than $16 a day increases substantially.

To see which specific regions improve the most, we can remake the box plots that we made earlier, but now adding 2010. Here it is. We use the same code. We use facet grid to divide into 2010 and 1970. And we can see which countries have gone up more.

p <- gapminder %>%
    filter(year %in% c(past_year, present_year) & country %in% country_list) %>%
    mutate(region = reorder(region, dollars_per_day, FUN = median)) %>%
    ggplot() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    xlab("") + scale_y_continuous(trans = "log2")
    
 p + geom_boxplot(aes(region, dollars_per_day, fill = continent)) +
     facet_grid(year ~ .)

Now these box plots, it’s a little bit hard to compare, because we’re trying to compare box plots that are on top of each other. It’s helpful to put them next to each other. So we’re going to learn to ease the comparisons. To do this we’re going to pause to introduce another powerful ggplot feature. Because we want to compare each region before and after, it would be convenient to have the 1970 box plot next to the 2010 box plot. In general, comparisons are easier when data are plotted next to each other. So instead of faceting, we keep the data from each year together. But ask ggplot plot to color or fill the box block depending on the year. ggplot automatically separates them and puts the two box plots next to each other. This is very convenient. Because year is a number, we turn it into a factor so that each is a category. This is because ggplot automatically assigns a color to each level of a factor if we assign that factor to the color argument. So if we type this command now, we add fill equals factor year, we get this plot.

p + geom_boxplot(aes(region, dollars_per_day, fill = factor(year)))

And we can see which countries have improved the most. Look at Eastern Asia, for example, how it went from way down around 8 all the way up almost to 64. And finally we point out that if what we are most interested in is in comparing before and after values, it might make more sense to plot the ratios, or differences in the log scale. We’re still not ready to learn the code that achieves this, but here’s what the plot would look like. This is actually showing a box plot of the log ratios from 2010 compared to 1970 for each country. And we can see again, eastern Asia has the biggest improvement. End of transcript. Skip to the start.

Density Plots

This video corresponds to the following sections:

Key points

Change the y-axis of density plots to variable counts using ..count.. as the y argument.
The case_when() function defines a factor whose levels are defined by a variety of logical operations to group data.
Plot stacked density plots using position=“stack”.
Define a weight aesthetic mapping to change the relative weights of density plots - for example, this allows weighting of plots by population rather than number of countries.

RAFAEL IRIZARRY: We have used data exploration to discover that the income gap between rich and poor countries has closed considerably during the last forty years.

We use a series of histograms and box plots to see this. Here, we suggest a succinct way to convey this message with just one plot. We will use smooth density plots. Let’s start by noting that the density plot for income distribution in 1970 and 2010 deliver the message that the gap is closing.

In the 1970s plot, we see two clear modes, poor and rich. In 2010, it appears that some of the poorer countries have shifted towards the right, closing the gap. The next message we need to convey is that the reason for this change in distribution is that poor countries became richer rather than some rich countries becoming poorer.

To do this, all we need to do is assign a color to the groups we identified during the data exploration. However, before we can do this, we need to learn how to make these smooth densities in a way that preserves information of how many countries are in each group.

To understand why we need to do this, note the discrepancy in the size of each group. If we divide the world into developing and West, we have 87 developing countries and 21 Western countries.

# smooth density plots - area under each curve adds to 1
gapminder %>%
    filter(year == past_year & country %in% country_list) %>%
    mutate(group = ifelse(region %in% west, "West", "Developing")) %>% group_by(group) %>%
    summarize(n = n()) %>% knitr::kable()

group	n
Developing	87
West	21

If we overlay the two densities, the default is to have the area represented by each distribution add up to 1 regardless of the size of each group. This makes it seem like there’s the same number of countries in each group, which is incorrect. To change this, we’ll need to learn to access computed variables with the geom_density function. To have the areas of the densities be proportional to the size of the groups, we can simply multiply the y-axis values by the size of the group. From the geom_density help file, we see that the function computes a variable called count that does exactly this. We want this variable to be on the y-axis rather than the density value. In gg plot, we can access these variables by surrounding their names with dot dot. So we will use the following mapping. We type aes x = dollars_per_day and y = dot dot count dot dot. This will put count on the y-axis. We can now create the desired plot by simply changing the mapping in the previous code chunk.

p <- gapminder %>%
    filter(year == past_year & country %in% country_list) %>%
    mutate(group = ifelse(region %in% west, "West", "Developing")) %>%
    ggplot(aes(dollars_per_day, y = ..count.., fill = group)) +
    scale_x_continuous(trans = "log2")
p + geom_density(alpha = 0.2, bw = 0.75) + facet_grid(year ~ .)

It would look like this. And it produces a plot like this.

Notice that now we can clearly see that the developing world has more countries. If you want the densities to be smoother, because we can see in the Western countries, there was a lot of unsmoothness, we can change the bw argument, as we learned earlier. We tried a few and decided on 0.75. You can try a few yourself. Here’s what it looks like with 0.75. This plot now shows what is happening very clearly. The developing world distribution is changing. A third mode appears consisting of the countries that most closed the gap. We can actually make this figure somewhat more informative. From the exploratory data analysis, we noticed that many of the countries that most improved were from Asia. We can easily alter the plot to show key regions separately. To do this, we introduced a new function called case_when. It’s useful for defining groups. It currently does not have a data argument. This might change. But because it doesn’t, we need to access the components of our data using the dot placeholder. So the code looks like this.

# add group as a factor, grouping regions
gapminder <- gapminder %>%
    mutate(group = case_when(
            .$region %in% west ~ "West",
            .$region %in% c("Eastern Asia", "South-Eastern Asia") ~ "East Asia",
            .$region %in% c("Caribbean", "Central America", "South America") ~ "Latin America",
            .$continent == "Africa" & .$region != "Northern Africa" ~ "Sub-Saharan Africa",
            TRUE ~ "Others"))

Look at what we’re doing. We’re assigning groups depending on the region. If the region’s in the West, we call the West. If the region is in Eastern Asia, Southern Asia, we call it East Asia. If the region is in the Caribbean, Central America, South America, we call it Latin America. If the continent is Africa and the region is not Northern Africa, we’re going to call it Sub-Saharan Africa. And then the rest we’re just going to call others. Now we turn this group variable into a factor to control the order of the levels. We do it like this. We picked this particular order for a reason that becomes clearer

# reorder factor levels
gapminder <- gapminder %>%
    mutate(group = factor(group, levels = c("Others", "Latin America", "East Asia", "Sub-Saharan Africa", "West")))

later when we make the plots. Now we can easily plot the density for each one. We use color and size to clearly see the top. Here’s what the two look like in 1970 and 2010. The plot is a little bit cluttered and is hard to read, and we’re going to use a stacking approach to make the picture clear. Here’s how we do it. We use this argument, position = “stack”. And now what happens is that the histograms or density plots are stacked on top of each other.

# note you must redefine p with the new gapminder object first
p <- gapminder %>%
  filter(year %in% c(past_year, present_year) & country %in% country_list) %>%
    ggplot(aes(dollars_per_day, fill = group)) +
    scale_x_continuous(trans = "log2")

# stacked density plot
p + geom_density(alpha = 0.2, bw = 0.75, position = "stack") +
    facet_grid(year ~ .)

Here we can see clearly that the distribution from East Asia and Latin America and others shift markedly to the right while Sub-Saharan Africa remain stagnant. Note that we order the levels of the groups so that the West density was plotted first, and then Sub-Saharan Africa. This helps us see this pattern. As a final point, we note that these distributions weigh every country the same. So if most of the population is improving but living in a very large country such as China, we might not appreciate this. We can actually weigh the smooth densities using the weight mapping argument.

# weighted stacked density plot
gapminder %>%
    filter(year %in% c(past_year, present_year) & country %in% country_list) %>%
    group_by(year) %>%
    mutate(weight = population/sum(population*2)) %>%
    ungroup() %>%
    ggplot(aes(dollars_per_day, fill = group, weight = weight)) +
    scale_x_continuous(trans = "log2") +
    geom_density(alpha = 0.2, bw = 0.75, position = "stack") + facet_grid(year ~ .)

And if we do that, the plot now looks like this. This particular figure shows very clearly how the income distribution gap is closing with most of the poor countries remaining in Sub-Saharan Africa. End of transcript. Skip to the start.

Ecological Fallacy

This video corresponds to the textbook section on the ecological fallacy. Key points

The breaks argument allows us to set the location of the axis labels and tick marks.
The logistic or logit transformation is defined as f(p)=logp1−p, or the log of odds. This scale is useful for highlighting differences near 0 or near 1 and converts fold changes into constant increases.
The ecological fallacy is assuming that conclusions made from the average of a group apply to all members of that group.

RAFAEL IRIZARRY: Throughout this section, we have been comparing regions of the world. We have seen that on average some regions do better than others in health outcomes and economic outcomes. Here, we focus on the importance of describing the variability within the groups. While we do this, we’ll also show you some other ggplot functions as well as a transformation called the logit transformation, which is useful for the data that we’ll be looking at. As an example for this, we will focus on the relationship between country child survival rates and average income. We start by comparing these quantities across regions. Before we start, we’re going to find a few more regions using the case when function. We’re going to define the West, Northern Africa, East Asia, Southern Asia, Latin America, Sub-Saharan Africa, and the Pacific Islands.

# define gapminder
library(tidyverse)
library(dslabs)
data(gapminder)

# add additional cases
gapminder <- gapminder %>%
    mutate(group = case_when(
        .$region %in% west ~ "The West",
        .$region %in% "Northern Africa" ~ "Northern Africa",
        .$region %in% c("Eastern Asia", "South-Eastern Asia") ~ "East Asia",
        .$region == "Southern Asia" ~ "Southern Asia",
        .$region %in% c("Central America", "South America", "Caribbean") ~ "Latin America",
        .$continent == "Africa" & .$region != "Northern Africa" ~ "Sub-Saharan Africa",
        .$region %in% c("Melanesia", "Micronesia", "Polynesia") ~ "Pacific Islands"))

Once we do this, we can compute the quantities that we’re interested in for each region. We’ll compute the average. This shows a dramatic difference.

# define a data frame with group average income and average infant survival rate
surv_income <- gapminder %>%
    filter(year %in% present_year & !is.na(gdp) & !is.na(infant_mortality) & !is.na(group)) %>%
    group_by(group) %>%
    summarize(income = sum(gdp)/sum(population)/365,
                        infant_survival_rate = 1 - sum(infant_mortality/1000*population)/sum(population))
surv_income %>% arrange(income)

## # A tibble: 7 x 3
##   group              income infant_survival_rate
##   <chr>               <dbl>                <dbl>
## 1 Sub-Saharan Africa   1.76                0.936
## 2 Southern Asia        2.07                0.952
## 3 Pacific Islands      2.70                0.956
## 4 Northern Africa      4.94                0.970
## 5 Latin America       13.2                 0.983
## 6 East Asia           13.4                 0.985
## 7 The West            77.1                 0.995

While in the West less than 0.5% of children die, in Sub-Saharan Africa, the rate is higher than 6%. In fact, the relationship between these two variables is almost perfectly linear. In this plot, we introduced the use of the limit argument, which lets us change the range of the axis. We would do it like this following this code.

# plot infant survival versus income, with transformed axes
surv_income %>% ggplot(aes(income, infant_survival_rate, label = group, color = group)) +
    scale_x_continuous(trans = "log2", limit = c(0.25, 150)) +
    scale_y_continuous(trans = "logit", limit = c(0.875, .9981),
                                       breaks = c(.85, .90, .95, .99, .995, .998)) +
    geom_label(size = 3, show.legend = FALSE)

We are making the range larger than the data needs because we will later compare this plot we just saw to one with more variability. And we want the ranges to be the same. We also introduced the breaks argument, which lets us set the location of the axis labels. Finally, we introduce a new transformation, the logistic transformation. The logistic or logit transformation for a proportional rate p is defined as follows. f of p equals the log of p divided by 1 minus p. When p is a proportion or probability, the quantity that is being logged, p divided by 1 minus p, is called the odds. And the case p is the proportion of children that survive. The odds tells us how many more children are expected to survive than to die. The log transformation makes this quantity symmetric. If the rates are the same, then the log odds is 0. Fold increases or decreases turn into positive and negative increments respectively. This scale is useful when we want to highlight differences that are near 0 or near 1. For survival rates, this is important because a survival rate of 90% is unacceptable while the survival rate of 99% is relatively good. We would much prefer a survival rate closer to 99.9%. We want our scale to highlight these differences and the logit does this. Note that 99.9 divided by 0.1 is about 10 times larger than 99 divided by 1, which is about 10 times larger than 90 divided by 10. By using the log, these fold changes turn into constant increases. OK, now back to our plot. Based on the plot we showed earlier, do we conclude that a country with a low income is destined to have low survival rate? Do we conclude that all survival rates in Sub-Saharan Africa are all lower than in southern Asia, which in turn are lower than in the Pacific Islands and so on? Jumping to this conclusion based on the plot we showed, the plot that shows only the averages is referred to as the ecological fallacy. The almost perfect relationship between survival rates and income is only observed for the averages at the regional level. Once we show the data, we see a somewhat more complicated story. So here is the plot for the averages. And look at what happens once we show you every individual country. Specifically, we see that there is a large amount of variability. We see that the countries from the same regions can be quite different. And that countries within the same income can have different survival rates. For example, while on average Sub-Saharan Africa had the worst health and economic outcomes, there is wide variability within that group. For example, note that Mauritius and Botswana are doing much better than Angola and Sierra Leone with Mauritius comparable to Western countries. End of transcript. Skip to the start.

Assessment: Exploring the Gapminder Dataset

Exercise 1. Life expectancy vs fertility - part 1

The Gapminder Foundation (www.gapminder.org) is a non-profit organization based in Sweden that promotes global development through the use of statistics that can help reduce misconceptions about global development.
Instruction

Using ggplot and the points layer, create a scatter plot of life expectancy versus fertility for the African continent in 2012.
Remember that you can use the R console to explore the gapminder dataset to figure out the names of the columns in the dataframe.
In this exercise we provide parts of code to get you going. You need to fill out what is missing. But note that going forward, in the next exercises, you will be required to write most of the code.

library(dplyr)
library(ggplot2)
library(dslabs)
data(gapminder)
## fill out the missing parts in filter and aes
gapminder %>% filter( continent %in% "Africa" & year %in% "2012" ) %>%
  ggplot(aes(fertility, life_expectancy )) +
  geom_point()

Exercise 2. Life expectancy vs fertility - part 2 - coloring your plot

Note that there is quite a bit of variability in life expectancy and fertility with some African countries having very high life expectancies. There also appear to be three clusters in the plot.

Instruction

Remake the plot from the previous exercises but this time use color to distinguish the different regions of Africa to see if this explains the clusters.
Remember that you can explore the gapminder data to see how the regions of Africa are labeled in the data frame!
Use color rather than col inside your ggplot call - while these two forms are equivalent in R, the grader specifically looks for color.

## fill out the missing parts in filter and aes
gapminder %>% filter( continent %in% "Africa" & year %in% "2012" ) %>%
  ggplot(aes(fertility, life_expectancy, color = region )) +
  geom_point()

Exercise 3. Life expectancy vs fertility - part 3 - selecting country and region

While many of the countries in the high life expectancy/low fertility cluster are from Northern Africa, three countries are not.

Instruction

Create a table showing the country and region for the African countries (use select) that in 2012 had fertility rates of 3 or less and life expectancies of at least 70.
Assign your result to a data frame called df.

#Create table
df <- gapminder %>% 
      filter(continent %in% "Africa" & year %in% "2012" & fertility <= 3 & life_expectancy >= 70) %>%
      select(country, region)
df

##      country          region
## 1    Algeria Northern Africa
## 2 Cape Verde  Western Africa
## 3      Egypt Northern Africa
## 4      Libya Northern Africa
## 5  Mauritius  Eastern Africa
## 6    Morocco Northern Africa
## 7 Seychelles  Eastern Africa
## 8    Tunisia Northern Africa

Exercise 4. Life expectancy and the Vietnam War - part 1

The Vietnam War lasted from 1955 to 1975. Do the data support war having a negative effect on life expectancy? We will create a time series plot that covers the period from 1960 to 2010 of life expectancy for Vietnam and the United States, using color to distinguish the two countries. In this start we start the analysis by generating a table.

Instruction

Use filter to create a table with data for the years from 1960 to 2010 in Vietnam and the United States.
Save the table in an object called tab.

tab <- gapminder %>% 
          filter(year %in% c(1960:2010) & country %in% c("Vietnam","United States"))
tab

##           country year infant_mortality life_expectancy fertility population
## 1   United States 1960             25.9           69.91      3.67  186176524
## 2         Vietnam 1960             75.6           58.52      6.35   32670623
## 3   United States 1961             25.4           70.32      3.63  189077076
## 4         Vietnam 1961             72.6           59.17      6.39   33666768
## 5   United States 1962             24.9           70.21      3.48  191860710
## 6         Vietnam 1962             69.9           59.82      6.43   34684164
## 7   United States 1963             24.4           70.04      3.35  194513911
## 8         Vietnam 1963             67.3           60.42      6.45   35722092
## 9   United States 1964             23.8           70.33      3.22  197028908
## 10        Vietnam 1964             61.7           60.95      6.46   36780984
## 11  United States 1965             23.3           70.41      2.93  199403532
## 12        Vietnam 1965             60.7           61.32      6.48   37860014
## 13  United States 1966             22.7           70.43      2.71  201629471
## 14        Vietnam 1966             59.9           61.36      6.49   38959335
## 15  United States 1967             22.0           70.76      2.56  203713082
## 16        Vietnam 1967             59.0           61.06      6.49   40074695
## 17  United States 1968             21.3           70.42      2.47  205687611
## 18        Vietnam 1968             58.2           60.45      6.49   41195833
## 19  United States 1969             20.6           70.66      2.46  207599308
## 20        Vietnam 1969             57.3           59.63      6.49   42309662
## 21  United States 1970             19.9           70.92      2.46  209485807
## 22        Vietnam 1970             56.4           58.78      6.47   43407291
## 23  United States 1971             19.1           71.24      2.27  211357912
## 24        Vietnam 1971             55.5           58.17      6.42   44485910
## 25  United States 1972             18.3           71.34      2.01  213219515
## 26        Vietnam 1972             54.7           58.00      6.35   45549487
## 27  United States 1973             17.5           71.54      1.87  215092900
## 28        Vietnam 1973             53.8           58.35      6.25   46604726
## 29  United States 1974             16.7           72.08      1.83  217001865
## 30        Vietnam 1974             52.8           59.23      6.13   47661770
## 31  United States 1975             16.0           72.68      1.77  218963561
## 32        Vietnam 1975             51.8           60.54      5.97   48729397
## 33  United States 1976             15.2           72.99      1.74  220993166
## 34        Vietnam 1976             50.9           62.07      5.80   49808071
## 35  United States 1977             14.5           73.38      1.78  223090871
## 36        Vietnam 1977             49.8           63.58      5.61   50899504
## 37  United States 1978             13.8           73.58      1.75  225239456
## 38        Vietnam 1978             48.8           64.86      5.42   52015279
## 39  United States 1979             13.2           74.03      1.80  227411604
## 40        Vietnam 1979             47.8           65.84      5.23   53169674
## 41  United States 1980             12.6           73.93      1.82  229588208
## 42        Vietnam 1980             46.8           66.49      5.05   54372518
## 43  United States 1981             12.1           74.36      1.81  231765783
## 44        Vietnam 1981             45.8           66.86      4.87   55627743
## 45  United States 1982             11.7           74.65      1.81  233953874
## 46        Vietnam 1982             44.8           67.10      4.69   56931822
## 47  United States 1983             11.2           74.71      1.78  236161961
## 48        Vietnam 1983             43.9           67.30      4.52   58277391
## 49  United States 1984             10.9           74.81      1.79  238404223
## 50        Vietnam 1984             43.0           67.51      4.36   59653092
## 51  United States 1985             10.6           74.79      1.84  240691557
## 52        Vietnam 1985             42.0           67.77      4.21   61049370
## 53  United States 1986             10.4           74.87      1.84  243032017
## 54        Vietnam 1986             41.0           68.07      4.06   62459557
## 55  United States 1987             10.2           75.01      1.87  245425409
## 56        Vietnam 1987             40.0           68.38      3.93   63881296
## 57  United States 1988             10.0           75.02      1.92  247865202
## 58        Vietnam 1988             38.9           68.68      3.81   65313709
## 59  United States 1989              9.7           75.10      2.00  250340795
## 60        Vietnam 1989             37.7           69.00      3.68   66757401
## 61  United States 1990              9.4           75.40      2.07  252847810
## 62        Vietnam 1990             36.6           69.30      3.56   68209604
## 63  United States 1991              9.1           75.50      2.06  255367160
## 64        Vietnam 1991             35.4           69.60      3.42   69670620
## 65  United States 1992              8.8           75.80      2.04  257908206
## 66        Vietnam 1992             34.3           69.80      3.26   71129537
## 67  United States 1993              8.5           75.70      2.02  260527420
## 68        Vietnam 1993             33.1           70.10      3.07   72558986
## 69  United States 1994              8.2           75.80      2.00  263301323
## 70        Vietnam 1994             32.0           70.30      2.88   73923849
## 71  United States 1995              8.0           75.90      1.98  266275528
## 72        Vietnam 1995             30.9           70.60      2.68   75198975
## 73  United States 1996              7.7           76.30      1.98  269483224
## 74        Vietnam 1996             29.9           70.90      2.48   76375677
## 75  United States 1997              7.5           76.60      1.97  272882865
## 76        Vietnam 1997             28.9           71.10      2.31   77460429
## 77  United States 1998              7.3           76.80      2.00  276354096
## 78        Vietnam 1998             27.9           71.50      2.17   78462888
## 79  United States 1999              7.2           76.90      2.01  279730801
## 80        Vietnam 1999             27.0           71.70      2.06   79399708
## 81  United States 2000              7.1           76.90      2.05  282895741
## 82        Vietnam 2000             26.1           72.00      1.98   80285563
## 83  United States 2001              7.0           76.90      2.03  285796198
## 84        Vietnam 2001             25.3           72.20      1.94   81123685
## 85  United States 2002              6.9           77.10      2.02  288470847
## 86        Vietnam 2002             24.6           72.50      1.92   81917488
## 87  United States 2003              6.8           77.30      2.05  291005482
## 88        Vietnam 2003             23.9           72.80      1.91   82683039
## 89  United States 2004              6.9           77.60      2.06  293530886
## 90        Vietnam 2004             23.2           73.00      1.90   83439812
## 91  United States 2005              6.8           77.60      2.06  296139635
## 92        Vietnam 2005             22.6           73.30      1.90   84203817
## 93  United States 2006              6.7           77.80      2.11  298860519
## 94        Vietnam 2006             22.0           73.50      1.89   84979667
## 95  United States 2007              6.6           78.10      2.12  301655953
## 96        Vietnam 2007             21.4           73.80      1.88   85770717
## 97  United States 2008              6.5           78.30      2.07  304473143
## 98        Vietnam 2008             20.8           74.10      1.86   86589342
## 99  United States 2009              6.4           78.50      2.00  307231961
## 100       Vietnam 2009             20.3           74.30      1.84   87449021
## 101 United States 2010              6.3           78.80      1.93  309876170
## 102       Vietnam 2010             19.8           74.50      1.82   88357775
##              gdp continent             region
## 1   2.479391e+12  Americas   Northern America
## 2             NA      Asia South-Eastern Asia
## 3   2.536417e+12  Americas   Northern America
## 4             NA      Asia South-Eastern Asia
## 5   2.691139e+12  Americas   Northern America
## 6             NA      Asia South-Eastern Asia
## 7   2.809549e+12  Americas   Northern America
## 8             NA      Asia South-Eastern Asia
## 9   2.972502e+12  Americas   Northern America
## 10            NA      Asia South-Eastern Asia
## 11  3.162743e+12  Americas   Northern America
## 12            NA      Asia South-Eastern Asia
## 13  3.368321e+12  Americas   Northern America
## 14            NA      Asia South-Eastern Asia
## 15  3.452529e+12  Americas   Northern America
## 16            NA      Asia South-Eastern Asia
## 17  3.618250e+12  Americas   Northern America
## 18            NA      Asia South-Eastern Asia
## 19  3.730416e+12  Americas   Northern America
## 20            NA      Asia South-Eastern Asia
## 21  3.737877e+12  Americas   Northern America
## 22            NA      Asia South-Eastern Asia
## 23  3.867133e+12  Americas   Northern America
## 24            NA      Asia South-Eastern Asia
## 25  4.080668e+12  Americas   Northern America
## 26            NA      Asia South-Eastern Asia
## 27  4.321881e+12  Americas   Northern America
## 28            NA      Asia South-Eastern Asia
## 29  4.299437e+12  Americas   Northern America
## 30            NA      Asia South-Eastern Asia
## 31  4.291009e+12  Americas   Northern America
## 32            NA      Asia South-Eastern Asia
## 33  4.523528e+12  Americas   Northern America
## 34            NA      Asia South-Eastern Asia
## 35  4.733337e+12  Americas   Northern America
## 36            NA      Asia South-Eastern Asia
## 37  4.999656e+12  Americas   Northern America
## 38            NA      Asia South-Eastern Asia
## 39  5.157035e+12  Americas   Northern America
## 40            NA      Asia South-Eastern Asia
## 41  5.142220e+12  Americas   Northern America
## 42            NA      Asia South-Eastern Asia
## 43  5.272896e+12  Americas   Northern America
## 44            NA      Asia South-Eastern Asia
## 45  5.168479e+12  Americas   Northern America
## 46            NA      Asia South-Eastern Asia
## 47  5.401886e+12  Americas   Northern America
## 48            NA      Asia South-Eastern Asia
## 49  5.790542e+12  Americas   Northern America
## 50  1.145347e+10      Asia South-Eastern Asia
## 51  6.028651e+12  Americas   Northern America
## 52  1.188938e+10      Asia South-Eastern Asia
## 53  6.235265e+12  Americas   Northern America
## 54  1.222101e+10      Asia South-Eastern Asia
## 55  6.432743e+12  Americas   Northern America
## 56  1.265894e+10      Asia South-Eastern Asia
## 57  6.696490e+12  Americas   Northern America
## 58  1.330898e+10      Asia South-Eastern Asia
## 59  6.935219e+12  Americas   Northern America
## 60  1.428912e+10      Asia South-Eastern Asia
## 61  7.063943e+12  Americas   Northern America
## 62  1.501800e+10      Asia South-Eastern Asia
## 63  7.045491e+12  Americas   Northern America
## 64  1.591320e+10      Asia South-Eastern Asia
## 65  7.285373e+12  Americas   Northern America
## 66  1.728906e+10      Asia South-Eastern Asia
## 67  7.494650e+12  Americas   Northern America
## 68  1.868476e+10      Asia South-Eastern Asia
## 69  7.803020e+12  Americas   Northern America
## 70  2.033630e+10      Asia South-Eastern Asia
## 71  8.001917e+12  Americas   Northern America
## 72  2.227648e+10      Asia South-Eastern Asia
## 73  8.304875e+12  Americas   Northern America
## 74  2.435711e+10      Asia South-Eastern Asia
## 75  8.679071e+12  Americas   Northern America
## 76  2.634272e+10      Asia South-Eastern Asia
## 77  9.061073e+12  Americas   Northern America
## 78  2.786124e+10      Asia South-Eastern Asia
## 79  9.502248e+12  Americas   Northern America
## 80  2.919122e+10      Asia South-Eastern Asia
## 81  9.898800e+12  Americas   Northern America
## 82  3.117252e+10      Asia South-Eastern Asia
## 83  1.000703e+13  Americas   Northern America
## 84  3.332183e+10      Asia South-Eastern Asia
## 85  1.018996e+13  Americas   Northern America
## 86  3.568108e+10      Asia South-Eastern Asia
## 87  1.045007e+13  Americas   Northern America
## 88  3.830049e+10      Asia South-Eastern Asia
## 89  1.081371e+13  Americas   Northern America
## 90  4.128394e+10      Asia South-Eastern Asia
## 91  1.114630e+13  Americas   Northern America
## 92  4.476905e+10      Asia South-Eastern Asia
## 93  1.144269e+13  Americas   Northern America
## 94  4.845303e+10      Asia South-Eastern Asia
## 95  1.166093e+13  Americas   Northern America
## 96  5.255039e+10      Asia South-Eastern Asia
## 97  1.161905e+13  Americas   Northern America
## 98  5.586668e+10      Asia South-Eastern Asia
## 99  1.120919e+13  Americas   Northern America
## 100 5.884079e+10      Asia South-Eastern Asia
## 101 1.154791e+13  Americas   Northern America
## 102 6.283222e+10      Asia South-Eastern Asia

Exercise 5. Life expectancy and the Vietnam War - part 2

Now that you have created the data table in Exercise 4, it is time to plot the data for the two countries.

Instruction
- Use geom_line to plot life expectancy vs year for Vietnam and the United States and save the plot as p. The data table is stored in tab.

Use color to distinguish the two countries. Print the object p

p <- tab %>% ggplot(aes(year, life_expectancy, color = country)) +
              geom_line()
p

Exercise 6. Life expectancy in Cambodia

Cambodia was also involved in this conflict and, after the war, Pol Pot and his communist Khmer Rouge took control and ruled Cambodia from 1975 to 1979. He is considered one of the most brutal dictators in history. Do the data support this claim?

Instruction
Use a single line of code to create a time series plot from 1960 to 2010 of life expectancy vs year for Cambodia.

gapminder %>% filter(country %in% "Cambodia" & year %in% c(1960:2010)) %>%
                       ggplot(aes(year, life_expectancy)) +
                       geom_line()

Exercise 7. Dollars per day - part 1

Now we are going to calculate and plot dollars per day for African countries in 2010 using GDP data.

In the first part of this analysis, we will create the dollars per day variable.

Instructions

Use mutate to create a dollars_per_day variable, which is defined as gdp/population/365.
Create the dollars_per_day variable for African countries for the year 2010.
Remove any NA values.
Save the mutated dataset as daydollars.

daydollars <- gapminder %>% mutate(dollars_per_day = gdp / population / 365) %>%
                            filter(continent %in% "Africa" & year %in% 2010 & !is.na(gdp))
head(daydollars)

##        country year infant_mortality life_expectancy fertility population
## 1      Algeria 2010             23.5            76.0      2.82   36036159
## 2       Angola 2010            109.6            57.6      6.22   21219954
## 3        Benin 2010             71.0            60.8      5.10    9509798
## 4     Botswana 2010             39.8            55.6      2.76    2047831
## 5 Burkina Faso 2010             69.7            59.0      5.87   15632066
## 6      Burundi 2010             63.8            60.4      6.30    9461117
##           gdp continent          region dollars_per_day
## 1 79164339611    Africa Northern Africa       6.0186382
## 2 26125663270    Africa   Middle Africa       3.3731063
## 3  3336801340    Africa  Western Africa       0.9613161
## 4  8408166868    Africa Southern Africa      11.2490111
## 5  4655655008    Africa  Western Africa       0.8159650
## 6  1158914103    Africa  Eastern Africa       0.3355954

Exercise 8. Dollars per day - part 2

Now we are going to calculate and plot dollars per day for African countries in 2010 using GDP data.

In the second part of this analysis, we will plot the smooth density plot using a log (base 2) x axis.

Instructions

The dataset including the dollars_per_day variable is preloaded as daydollars.
Create a smooth density plot of dollars per day from daydollars.
Use scale_x_continuous to change the x-axis to a log (base 2) scale.

daydollars %>% ggplot(aes(dollars_per_day)) +
               scale_x_continuous(trans = "log2") + 
               geom_density()

Exercise 9. Dollars per day - part 3 - multiple density plots

Now we are going to combine the plotting tools we have used in the past two exercises to create density plots for multiple years.

Instructions

Create the dollars_per_day variable as in Exercise 7, but for African countries in the years 1970 and 2010 this time.
Make sure you remove any NA values.
Create a smooth density plot of dollars per day for 1970 and 2010 using a log (base 2) scale for the x axis.
Use facet_grid to show a different density plot for 1970 and 2010.

daydollars <- gapminder %>% mutate(dollars_per_day = gdp / population / 365) %>%
                            filter(continent %in% "Africa" & year %in% c(1970,2010) & !is.na(gdp)) %>%
                            ggplot(aes(dollars_per_day)) + 
                            scale_x_continuous(trans = "log2") + 
                            geom_density() + 
                            facet_grid(year ~ .)
daydollars

Exercise 10. Dollars per day - part 4 - stacked density plot

Now we are going to edit the code from Exercise 9 to show a stacked density plot of each region in Africa.

Instructions

Much of the code will be the same as in Exercise 9:
Create the dollars_per_day variable as in Exercise 7, but for African countries in the years 1970 and 2010 this time.
Make sure you remove any NA values.
Create a smooth density plot of dollars per day for 1970 and 2010 using a log (base 2) scale for the x axis.
Use facet_grid to show a different density plot for 1970 and 2010.
Make sure the densities are smooth by using bw = 0.5.
Use the fill and position arguments where appropriate to create the stacked density plot of each region.

daydollars <- gapminder %>% mutate(dollars_per_day = gdp / population / 365) %>%
                            filter(continent %in% "Africa" & year %in% c(1970,2010) & !is.na(gdp)) %>%
                            ggplot(aes(dollars_per_day, fill = region)) + 
                            scale_x_continuous(trans = "log2") + 
                            geom_density(bw = 0.5, position = "stack") + 
                            facet_grid(year ~ .)
daydollars

Exercise 11. Infant mortality scatter plot - part 1

We are going to continue looking at patterns in the gapminder dataset by plotting infant mortality rates versus dollars per day for African countries.

Instructions

Generate dollars_per_day using mutate and filter for the year 2010 for African countries.
Remember to remove NA values.
Store the mutated dataset in gapminder_Africa_2010.
Make a scatter plot of infant_mortality versus dollars_per_day for countries in the African continent.
Use color to denote the different regions of Africa.

gapminder_Africa_2010 <- gapminder %>% mutate(dollars_per_day = gdp / population / 365) %>%
                            filter(continent %in% "Africa" & year %in% c(2010) & !is.na(gdp))

# now make the scatter plot
gapminder_Africa_2010 %>% ggplot(aes(dollars_per_day, infant_mortality, color = region)) +
geom_point()

Exercise 12. Infant mortality scatter plot - part 2 - logarithmic axis

Now we are going to transform the x axis of the plot from the previous exercise. Instructions

The mutated dataset is preloaded as gapminder_Africa_2010.
As in the previous exercise, make a scatter plot of infant_mortality versus dollars_per_day for countries in the African continent.
As in the previous exercise, use color to denote the different regions of Africa.
Transform the x axis to be in the log (base 2) scale.

gapminder_Africa_2010 %>% ggplot(aes(dollars_per_day, infant_mortality, color = region)) +
geom_point() + 
scale_x_continuous(trans='log2')

Exercise 13. Infant mortality scatter plot - part 3 - adding labels

Note that there is a large variation in infant mortality and dollars per day among African countries.

As an example, one country has infant mortality rates of less than 20 per 1000 and dollars per day of 16, while another country has infant mortality rates over 10% and dollars per day of about 1.

In this exercise, we will remake the plot from Exercise 12 with country names instead of points so we can identify which countries are which.

Instructions

The mutated dataset is preloaded as gapminder_Africa_2010.
As in the previous exercise, make a scatter plot of infant_mortality versus dollars_per_day for countries in the African continent.
As in the previous exercise, use color to denote the different regions of Africa.
As in the previous exercise, transform the x axis to be in the log (base 2) scale.
Add a geom_text layer to display country names in addition to of points.

gapminder_Africa_2010 %>% ggplot(aes(dollars_per_day, infant_mortality, color = region, label = country)) +
geom_point() + 
scale_x_continuous(trans='log2') +
geom_text()

Exercise 14. Infant mortality scatter plot - part 4 - comparison of scatter plots

Now we are going to look at changes in the infant mortality and dollars per day patterns African countries between 1970 and 2010.

Instructions

Generate dollars_per_day using mutate and filter for the years 1970 and 2010 for African countries.
Remember to remove NA values.
As in the previous exercise, make a scatter plot of infant_mortality versus dollars_per_day for countries in the African continent.
As in the previous exercise, use color to denote the different regions of Africa.
As in the previous exercise, transform the x axis to be in the log (base 2) scale.
As in the previous exercise, add a layer to display country names instead of points.
Use facet_grid to show different plots for 1970 and 2010. Align the plots vertically.

gapminder %>%
filter(continent %in% "Africa" & year %in% c(1970,2010) & !is.na(gdp)& !is.na(year) & !is.na(infant_mortality)) %>%
  mutate(dollars_per_day = gdp / population / 365) %>%
  
  ggplot(aes(dollars_per_day, infant_mortality, color = region, label = country)) +
  geom_point() + 
  scale_x_continuous(trans='log2') +
  geom_text() +
  facet_grid(year~.)

End of Assessment: Exploring the gapminder dataset

Section 5 Overview

Section 5 covers some general principles that can serve as guides for effective data visualization.

After completing Section 5, you will:

understand basic principles of effective data visualization.
understand the importance of keeping your goal in mind when deciding on a visualization approach.
understand principles for encoding data, including position, aligned lengths, angles, area, brightness, and color hue.
know when to include the number zero in visualizations.
be able to use techniques to ease comparisons, such as using common axes, putting visual cues to be compared adjacent to one another, and using color effectively.

There are 3 assignments that use the DataCamp platform for you to practice your coding skills. There is also 1 assignment on the edX platform to allow you to practice exploratory data analysis.

We encourage you to use R to interactively test out your answers and further your learning.

Introduction to Data Visualization Principles

RAFAEL IRIZARRY: We have already provided some rules to follow as we created plots for our examples. Here we aim to provide some general principles we can use as guidelines for effective data visualization.

Much of this part of the course is based on a talk by Karl Broman entitled “Creating Effective Figures and Tables” and from class notes from Peter Aldhous titled “Introduction to Data Visualization.” In many of our examples, we follow Karl’s approach. We show some examples of plot styles we should avoid, explain how to improve them, and then use these as motivation for a list of principles. We compare and contrast plots that follow these principles to those that don’t.

The principles are mostly based on research related to how humans detect patterns and make visual comparisons. The preferred approaches are those that best fit the way our brain processes visual information.

When deciding on a visualization approach it is also important to keep our goal in mind. We may be comparing a viewable number of quantities, describing distributions for categories or numeric values, comparing the data from two groups, or describing the relationship between two variables.

As a final note, we also note that for a data scientist it is important to adapt and optimize graphs to the audience. For example, an exploratory plot made for ourselves will be different than a chart intended to communicate a finding to a general audien

This video corresponds to the textbook chapter introduction on data visualization principles.

Key points

We aim to provide some general guidelines for effective data visualization.
We show examples of plot styles to avoid, discuss how to improve them, and use these examples to explain research-based principles for effective visualization.
When choosing a visualization approach, keep your goal and audience in mind.

Case Study: Vaccines

This video corresponds to the textbook case study on vaccines. Information on color palettes can be found in the textbook section on encoding a third variable.

Key points

Vaccines save millions of lives, but misinformation has led some to question the safety of vaccines. The data support vaccines as safe and effective. We visualize data about measles incidence in order to demonstrate the impact of vaccination programs on disease rate.
The RColorBrewer package offers several color palettes. Sequential color palettes are best suited for data that span from high to low. Diverging color palettes are best suited for data that are centered and diverge towards high or low values.
The geom_tile() geometry creates a grid of colored tiles.
Position and length are stronger cues than color for numeric values, but color can be appropriate sometimes.

# import data and inspect
library(tidyverse)
library(dslabs)
data(us_contagious_diseases)
str(us_contagious_diseases)

## 'data.frame':    16065 obs. of  6 variables:
##  $ disease        : Factor w/ 7 levels "Hepatitis A",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ state          : Factor w/ 51 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ year           : num  1966 1967 1968 1969 1970 ...
##  $ weeks_reporting: num  50 49 52 49 51 51 45 45 45 46 ...
##  $ count          : num  321 291 314 380 413 378 342 467 244 286 ...
##  $ population     : num  3345787 3364130 3386068 3412450 3444165 ...

# assign dat to the per 10,000 rate of measles, removing Alaska and Hawaii and adjusting for weeks reporting
the_disease <- "Measles"
dat <- us_contagious_diseases %>%
    filter(!state %in% c("Hawaii", "Alaska") & disease == the_disease) %>%
    mutate(rate = count / population * 10000 * 52/weeks_reporting) %>%
    mutate(state = reorder(state, rate))

# plot disease rates per year in California
dat %>% filter(state == "California" & !is.na(rate)) %>%
    ggplot(aes(year, rate)) +
    geom_line() +
    ylab("Cases per 10,000") +
    geom_vline(xintercept=1963, col = "blue")

# tile plot of disease rate by state and year
dat %>% ggplot(aes(year, state, fill=rate)) +
    geom_tile(color = "grey50") +
    scale_x_continuous(expand = c(0,0)) +
    scale_fill_gradientn(colors = RColorBrewer::brewer.pal(9, "Reds"), trans = "sqrt") +
    geom_vline(xintercept = 1963, col = "blue") +
    theme_minimal() + theme(panel.grid = element_blank()) +
    ggtitle(the_disease) +
    ylab("") +
    xlab("")

# compute US average measles rate by year
avg <- us_contagious_diseases %>%
    filter(disease == the_disease) %>% group_by(year) %>%
    summarize(us_rate = sum(count, na.rm = TRUE)/sum(population, na.rm = TRUE)*10000)

# make line plot of measles rate by year by state
dat %>%
    filter(!is.na(rate)) %>%
    ggplot() +
    geom_line(aes(year, rate, group = state), color = "magenta", 
        show.legend = FALSE, alpha = 0.2, size = 1) +
    geom_line(mapping = aes(year, us_rate), data = avg, size = 1, col = "red") +
    scale_y_continuous(trans = "sqrt", breaks = c(5, 25, 125, 300)) +
    ggtitle("Cases per 10,000 by state") +
    xlab("") +
    ylab("") +
    geom_text(data = data.frame(x = 1955, y = 50),
        mapping = aes(x, y, label = "US average"), color = "black") +
    geom_vline(xintercept = 1963, col = "blue")

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

summary(cars)

##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Including Plots

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

section4

ariswa

3/28/2020

Case Study: Trends in World Health and Economics

More about Gapminder

Gapminder Dataset

Life Expectancy and Fertility Rates

Faceting

Time Series Plots

Transformations

Stratify and Boxplot

Comparing Distributions

Density Plots

Ecological Fallacy

Assessment: Exploring the Gapminder Dataset

Exercise 1. Life expectancy vs fertility - part 1

Exercise 2. Life expectancy vs fertility - part 2 - coloring your plot

Exercise 3. Life expectancy vs fertility - part 3 - selecting country and region

Exercise 4. Life expectancy and the Vietnam War - part 1

Exercise 5. Life expectancy and the Vietnam War - part 2

Exercise 6. Life expectancy in Cambodia

Exercise 7. Dollars per day - part 1

Exercise 8. Dollars per day - part 2

Exercise 9. Dollars per day - part 3 - multiple density plots

Exercise 10. Dollars per day - part 4 - stacked density plot

Exercise 11. Infant mortality scatter plot - part 1

Exercise 12. Infant mortality scatter plot - part 2 - logarithmic axis

Exercise 13. Infant mortality scatter plot - part 3 - adding labels

Exercise 14. Infant mortality scatter plot - part 4 - comparison of scatter plots

End of Assessment: Exploring the gapminder dataset

Section 5 Overview

Introduction to Data Visualization Principles

Case Study: Vaccines

R Markdown

Including Plots