PH125.2x: Data Science: Visualization

School: EDX, HarvardX
Course Instructor: Rafael Irizarry

Abstract

In this second course of nine in the HarvardX Data Science Professional Certificate, we learn the basics of data visualization and exploratory data analysis.

The growing availability of informative datasets and software tools has led to increased reliance on data visualizations across many industries, academia, and government. Data visualization provides a powerful way to communicate data-driven findings, motivate analyses, or detect flaws.

In this course, you will learn the basics of data visualization and exploratory data analysis. We will use three motivating examples and ggplot2, a data visualization package for the statistical programming language R, to code. To learn the very basics, we will start with a somewhat artificial example: heights reported by students. Then we will use two case studies related to world health and economics and another in infectious disease trends in the United States.

It is also important to note that mistakes, biases, systematic errors, and other unexpected problems often lead to data that should be handled with care. The fact that it can be difficult or impossible to notice an error just from the reported results makes data visualization particularly important. This course will explore how failure to discover these problems often leads to flawed analyses and false discoveries.

Learning Objective:

data visualization principles to better communicate data-driven findings
how to use ggplot2 to create custom plots
the weaknesses of several widely used plots and why you should avoid them

Course Outline:

Section 1: Introduction to Data Visualization and Distributions You will get started with data visualization and distributions in R.

Section 2: Introduction to ggplot2 You will learn how to use ggplot2 to create plots.

Section 3: Summarizing with dplyr You will learn how to summarize data using dplyr.

Section 4: Gapminder You will see examples of ggplot2 and dplyr in action with the Gapminder dataset.

Section 5: Data Visualization Principles You will learn general principles to guide you in developing effective data visualizations.

Section 1: Introduction to Data Visualization and Distributions

Section 1 introduces you to Data Visualization and Distributions.

After completing Section 1, you will:

understand the importance of data visualization for communicating data-driven findings.
be able to use distributions to summarize data.
be able to use the average and the standard deviation to understand the normal distribution.
be able to assess how well a normal distribution fits the data using a quantile-quantile plot.
be able to interpret data from a boxplot.

1.1 Introduction to Data Visualization

A picture is worth a 1000 words.

Visualization is the strongest tool of Exploratory Data Analysis

The greatest value of a picture is when it forces us to notice what we never expected to see (John Tukey)

library(dslabs)

head(murders)

##        state abb region population total
## 1    Alabama  AL  South    4779736   135
## 2     Alaska  AK   West     710231    19
## 3    Arizona  AZ   West    6392017   232
## 4   Arkansas  AR  South    2915918    93
## 5 California  CA   West   37253956  1257
## 6   Colorado  CO   West    5029196    65

1.1.1 Introduction to Distributions

Sometimes averages and standard deviation is all we need to know about results

However at other times we need more

1.1.2 Data Types

categorical
- ordinal
- spicyness: mild, medium, hot
- non-ordinal
- Sex: male, female
numeric
- discrete
- population (has to be a round number)
- continuous
- heights

EX 1: Exercise 1. Variable names

The type of data we are working with will often influence the data visualization technique we use. We will be working with two types of variables: categorical and numeric. Each can be divided into two other groups: categorical can be ordinal or not, whereas numerical variables can be discrete or continuous.

We will review data types using some of the examples provided in the dslabs package. For example, the heights dataset.

library(dslabs)
data(heights)

names(heights)

## [1] "sex"    "height"

# EX 2: Exercise 2. Variable type
# categorical

EX 3: Exercise 3. Numerical values

Keep in mind that discrete numeric data can be considered ordinal. Although this is technically true, we usually reserve the term ordinal data for variables belonging to a small number of different groups, with each group having many members.

The height variable could be ordinal if, for example, we report a small number of values such as short, medium, and tall. Let’s explore how many unique values are used by the heights varialbe. For this we can use the unique fuction:

x <- c(3, 3,

library(dslabs)
data(heights)
x <- heights$height

length(unique(x))

## [1] 139

EX 4: Exercise 4. Tables

One of the useful outputs of data visualization is that we can learn about the distribution of variables. For categorical data we can construct this distribution by simply computing the frequency of each unique value. This can be done with the function table. Here is an example:

x <- c(3, 3, 3, 3, 4, 4, 2) table(x)

library(dslabs)
data(heights)
x <- heights$height
tab <- table(x)

EX 5: Exercise 5. Indicator variables

To see why treating the reported heights as an ordinal value is not useful in practice we note how many values are reported only once.

library(dslabs)
data(heights)
tab <- table(heights$height)
sum(tab ==1)

## [1] 63

EX 6: Exercise 6. Data types - heights

Since there are a finite number of reported heights and technically the height can be considered ordinal, which of the following is true:

# It is more effective to consider heights to be numerical given the number of unique values we observe and the fact that if we keep collecting data even more will be observed.

1.2 Introduction to Distributions

1.2.1 Describe Heights to ET

library(dslabs)
data(heights)
head(heights)

##      sex height
## 1   Male     75
## 2   Male     70
## 3   Male     68
## 4   Male     74
## 5   Male     61
## 6 Female     65

prop.table(table(heights$sex))

## 
##    Female      Male 
## 0.2266667 0.7733333

1.2.2 Smooth Density Plots

Compute frequencies rather than counts
AUC = 1 ( the area under the density curve)

EX 1: Exercise 1. Distributions - 1

You may have noticed that numerical data is often summarized with the average value. For example, the quality of a high school is sometimes summarized with one number: the average score on a standardized test. Occasionally, a second number is reported: the standard deviation. So, for example, you might read a report stating that scores were 680 plus or minus 50 (the standard deviation). The report has summarized an entire vector of scores with with just two numbers. Is this appropriate? Is there any important piece of information that we are missing by only looking at this summary rather than the entire list? We are going to learn when these 2 numbers are enough and when we need more elaborate summaries and plots to describe the data.

Our first data visualization building block is learning to summarize lists of factors or numeric vectors. The most basic statistical summary of a list of objects or numbers is its distribution. Once a vector has been summarized as distribution, there are several data visualization techniques to effectively relay this information. In later assessments we will practice to write code for data visualization. Here we start with some multiple choice questions to test your understanding of distributions and related basic plots.

In the murders dataset, the region is a categorical variable and on the right you can see its distribution. To the closet 5%, what proportion of the states are in the North Central region?

Answer: 20%

EX 2: Exercise 2. Distributions - 2

In the murders dataset, the region is a categorical variable and to the right is its distribution.

Which of the following is true:

Answer: The graph shows only four numbers with a bar plot.

EX 3: Exercise 3. Empirical Cumulative Distribution Function (eCDF)

The plot shows the eCDF for male heights:

Based on the plot, what percentage of males are shorter than 75 inches?

Answer: 95%

EX 4: Exercise 4. eCDF Male Heights

The plot shows the eCDF for male heights:

To the closest inch, what height m has the property that 1/2 of the male students are taller than m and 1/2 are shorter?

Answer: 69 inches

EX 5: Exercise 5. eCDF of Murder Rates

Here is an eCDF of the murder rates across states.

Knowing that there are 51 states (counting DC) and based on this plot, how many states have murder rates larger than 10 per 100,000 people?

Answer: 1

EX 6: Exercise 6. eCDF of Murder Rates - 2

Here is an eCDF of the murder rates across states:

Based on the eCDF above, which of the following statements are true:

Answer: With the exception of 4 states, the murder rates are below 5 per 100,000.

EX 7: Exercise 7. Histograms

Here is a histogram of male heights in our heights dataset:

Based on this plot, how many males are between 62.5 and 65.5?

Answer: 44

EX 8: Exercise 8. Histograms - 2

Here is a histogram of male heights in our heights dataset:

About what percentage are shorter than 60 inches?

Answer: 1%

EX 9: Exercise 9. Density plots

Based on this density plot, about what proportion of US states have populations larger than 10 million?

Answer: 0.15

EX 10: Exercise 10. Density plots - 2

Here are three density plots. Is it possible that they are from the same dataset? Which of the following statements is true:

Answer: They are the same dataset, but the first is not in the log scale, the second undersmooths and the third oversmooths.

1.3 Quantiles, Percentiles, and Boxplots

1.3.1 Normal Distribution

Normal distribution:
also bell curve
also Gauissian distribution
It occur in many situations:
gambling winnings
heights
weights
blood pressure

# definition of average:
average <- sum(x) /length(x)

# Standdard Deviation
SD <- sqrt( sum(( x - average)^2) / length(x))

index <- heights$sex=="Male"
x <- heights$height[index]

average <- mean(x)
SD <- sd(x)

c(average=average,SD=SD)

##   average        SD 
## 69.314755  3.611024

z = (x-average)/SD
mean(abs(z) < 2)

## [1] 0.9495074

1.3.2 Quantile-Quantile Plots

# 51% of the data is below 69.5 inches (176.53 cm)
mean(x <= 69.5)

## [1] 0.5147783

p <- seq(0.05, 0.95, 0.05)

observed_quantiles <- quantile(x, p)
observed_quantiles

##       5%      10%      15%      20%      25%      30%      35%      40% 
## 63.90079 65.00000 66.00000 67.00000 67.00000 68.00000 68.00000 68.62236 
##      45%      50%      55%      60%      65%      70%      75%      80% 
## 69.00000 69.00000 70.00000 70.00000 70.86614 71.00000 72.00000 72.00000 
##      85%      90%      95% 
## 72.44000 73.22751 75.00000

theoretical_quantiles <- qnorm( p, mean = mean(x), sd = sd(x))

plot(theoretical_quantiles, observed_quantiles)
abline(0,1)

# This code becomes slightly simpler if we use standard units.
observed_quantiles <- quantile(z,p)
theoretical_quantiles <- qnorm(p)
plot(theoretical_quantiles, observed_quantiles)
abline(0,1)

1.3.3 Percentiles

EX 1: Proportions

Histograms and density plots provide excellent summaries of a distribution. But can we summarize even further? We often see the average and standard deviation used as summary statistics: a two number summary! To understand what these summaries are and why they are so widely used, we need to understand the normal distribution.

The normal distribution, also known as the bell curve and as the Gaussian distribution, is one of the most famous mathematical concepts in history. A reason for this is that approximately normal distributions occur in many situations. Examples include gambling winnings, heights, weights, blood pressure, standardized test scores, and experimental measurement errors. Often data visualization is needed to confirm that our data follows a normal distribution.

Here we focus on how the normal distribution helps us summarize data and can be useful in practice.

One way the normal distribution is useful is that it can be used to approximate the distribution of a list of numbers without having access to the entire list. We will demonstrate this with the heights dataset.

Load the height data set and create a vector x with just the male heights:

library(dslabs)
data(heights)
x <- heights$height[heights$sex == "Male"]
mean(x > 69 & x <= 72)

## [1] 0.3337438

EX 2: Exercise 2. Averages and Standard Deviations

Suppose all you know about the height data from the previous exercise is the average and the standard deviation and that its distribution is approximated by the normal distribution. We can compute the average and standard deviation like this:

library(dslabs) data(heights) x <- heights$height[heights$sex==“Male”] avg <- mean(x) stdev <- sd(x) Suppose you only have avg and stdev below, but no access to x, can you approximate the proportion of the data that is between 69 and 72 inches?

library(dslabs)
data(heights)
x <- heights$height[heights$sex=="Male"]
avg <- mean(x)
stdev <- sd(x)
pnorm(72, avg, stdev) - pnorm(69, avg, stdev)

## [1] 0.3061779

Exercise 3. Approximations

Notice that the approximation calculated in the second question is very close to the exact calculation in the first question. The normal distribution was a useful approximation for this case.

However, the approximation is not always useful. An example is for the more extreme values, often called the “tails” of the distribution. Let’s look at an example. We can compute the proportion of heights between 79 and 81.

library(dslabs) data(heights) x <- heights$height[heights$sex == “Male”] mean(x > 79 & x <= 81)

library(dslabs)
data(heights)
x <- heights$height[heights$sex == "Male"]
avg <- mean(x)
stdev <- sd(x)
exact <- mean(x>79 & x<=81)
approx <- pnorm(81, avg, stdev) - pnorm(79, avg, stdev)
exact/approx

## [1] 1.614261

Exercise 4. Seven footers and the NBA

Someone asks you what percent of seven footers are in the National Basketball Association (NBA). Can you provide an estimate? Let’s try using the normal approximation to answer this question.

First, we will estimate the proportion of adult men that are 7 feet tall or taller.

Assume that the distribution of adult men in the world as normally distributed with an average of 69 inches and a standard deviation of 3 inches.

1 - pnorm(7*12, 69, 3)

## [1] 2.866516e-07

Exercise 5. Estimating the number seven footers

Now we have an approximation for the proportion, call it p, of men that are 7 feet tall or taller.

We know that there are about 1 billion men between the ages of 18 and 40 in the world, the age range for the NBA.

Can we use the normal distribution to estimate how many of these 1 billion men are at least seven feet tall?

p <- 1 - pnorm(7*12, 69, 3)
round(p * 10^9)

## [1] 287

Exercise 6. How many seven footers are in the NBA?

There are about 10 National Basketball Association (NBA) players that are 7 feet tall or higher. 7 feet = 213.36 cm

p <- 1 - pnorm(7*12, 69, 3)
N <- round(p * 10^9)
10/N

## [1] 0.03484321

Exercise 7. Lebron James’ height

In the previous exerceise we estimated the proportion of seven footers in the NBA using this simple code:

p <- 1 - pnorm(712, 69, 3) N <- round(p 10^9) 10/N Repeat the calculations performed in the previous question for Lebron James’ height: 6 feet 8 inches. There are about 150 players, instead of 10, that are at least that tall in the NBA.

p <- 1 - pnorm(6*12 + 8, 69, 3)
N <- round(p * 10^9)
150/N

## [1] 0.001220842

Exercise 8. Interpretation

In answering the previous questions, we found that it is not at all rare for a seven footer to become an NBA player.

What would be a fair critique of our calculations?

Answer: As seen in exercise 3, the normal approximation tends to underestimate the extreme values. It’s possible that there are more seven footers than we predicted.

1.3.4 Boxplots

If something is NOT normal distributed: we can NOT use mean, std. to describe the data

1.3.5 Distribution of Female Heights

Exercise 1. Vector lengths

When analyzing data it’s often important to know the number of measurements you have for each category.

library(dslabs)
data(heights)
male <- heights$height[heights$sex=="Male"]
female <- heights$height[heights$sex=="Female"]

length(male)

## [1] 812

length(female)

## [1] 238

Exercise 2. Percentiles

Suppose we can’t make a plot and want to compare the distributions side by side. If the number of data points is large, listing all the numbers is inpractical. A more practical approach is to look at the percentiles. We can obtain percentiles using the quantile function like this

library(dslabs) data(heights) quantile(heights$height, seq(.01, 0.99, 0.01))

library(dslabs)
data(heights)
male <- heights$height[heights$sex=="Male"]
female <- heights$height[heights$sex=="Female"]

female_percentiles <- quantile(female, seq(0.1, 0.9, 0.2))
male_percentiles <- quantile(male, seq(0.1, 0.9, 0.2))

df <- data.frame(female = female_percentiles, male = male_percentiles)
df

##       female     male
## 10% 61.00000 65.00000
## 30% 63.00000 68.00000
## 50% 64.98031 69.00000
## 70% 66.46417 71.00000
## 90% 69.00000 73.22751

Exercise 3. Interpretating Boxplots - 1

Study the boxplots summarizing the distributions of populations sizes by country.

Which continent has the country with the largest population size?

Answer: Asia

Exercise 4. Interpretating Boxplots - 2

Study the boxplots summarizing the distributions of populations sizes by country.

Which continent has median country with the largest population?

Answer: Africa

Exercise 5. Interpreting Boxplots - 3

Again, look at the boxplots summarizing the distributions of populations sizes by country. To the nearest million, what is the median population size for Africa?

Answer: 10 million

Exercise 6. Low quantiles

Examine the following boxplots and report approximately what proportion of countries in Europe have populations below 14 million:

Answer: 0.75

Exercise 7. Interquantile Range (IQR)

Based on the boxplot, if we use a log transformation, which continent shown below has the largest interquartile range?

Answer: Americas

Section 2: Introduction to ggplot2

In Section 2, you will learn how to create data visualizations in R using ggplot2.

After completing Section 2, you will:

be able to use ggplot2 to create data visualizations in R.
be able to explain what the data component of a graph is.
be able to identify the geometry component of a graph and know when to use which type of geometry.
be able to explain what the aesthetic mapping component of a graph is.
be able to understand the scale component of a graph and select an appropriate scale component to use.

2.1 Basics of ggplot2

2.1.1 ggplot

ggplot = grammer of graphics plot

library(tidyverse)

## -- Attaching packages ------------------------------------------------------------- tidyverse 1.2.1 --

## v ggplot2 3.1.0     v readr   1.3.1
## v tibble  2.0.1     v purrr   0.3.1
## v tidyr   0.8.3     v stringr 1.4.0
## v ggplot2 3.1.0     v forcats 0.4.0

## -- Conflicts ---------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

2.1.2 Graph Component

library(dslabs)
data(murders)

2.1.3 Creating a New Plot

murders %>% ggplot(data = murders)

EX 1: Exercise 1. Exploring the Galton Dataset - Average and Median

For this chapter, we will use height data collected by Francis Galton for his genetics studies. Here we just use height of the children in the dataset:

library(HistData) data(Galton) x <- Galton$child

#install.packages("HistData")

library(HistData)
data(Galton)
x <- Galton$child

mean(x)

## [1] 68.08847

median(x)

## [1] 68.2

EX 2: Exercise 2. Exploring the Galton Dataset - SD and MAD

Now for the same data compute the standard deviation and the median absolute deviation (MAD).

library(HistData)
data(Galton)
x <- Galton$child
sd(x)

## [1] 2.517941

mad(x)

## [1] 2.9652

Exercise 3. Error impact on average

In the previous exercises we saw that the mean and median are very similar and so are the standard deviation and MAD. This is expected since the data is approximated by a normal distribution which has this propoerty.

Now suppose that suppose Galton made a mistake when entering the first value, forgetting to use the decimal point. You can imitate this error by typing:

library(HistData) data(Galton) x <- Galton$child x_with_error <- x x_with_error[1] <- x_with_error[1]*10 The data now has an outlier that the normal approximation does not account for. Let’s see how this affects the average.

library(HistData)
data(Galton)
x <- Galton$child
x_with_error <- x
x_with_error[1] <- x_with_error[1]*10
mean(x_with_error)- mean(x)

## [1] 0.5983836

Exercise 4. Error impact on SD

In the previous exercise we saw how a simple mistake can result in the average of our data increasing more than half a foot, which is a large difference in practical terms. Now let’s explore the effect this outlier has on the standard deviation.

x_with_error <- x
x_with_error[1] <- x_with_error[1]*10
sd(x_with_error)- sd(x)

## [1] 15.6746

Exercise 5. Error impact on median

In the previous exercises we saw how one mistake can have a substantial effect on the average and the standard deviation.

Now we are going to see how the median and MAD are much more resistant to outliers. For this reason we say that they are robust summaries.

x_with_error <- x
x_with_error[1] <- x_with_error[1]*10
median(x_with_error)- median(x)

## [1] 0

Exercise 6. Error impact on MAD

We saw that the median barely changes. Now let’s see how the MAD is affected.

x_with_error <- x
x_with_error[1] <- x_with_error[1]*10
mad(x_with_error)- mad(x)

## [1] 0

Exercise 7. Usefulness of EDA

How could you use exploratory data analysis to detect that an error was made?

Answer: A boxplot, histogram, or qq-plot would reveal a clear outlier.

Exercise 8. Using EDA to explore changes

We have seen how the average can be affected by outliers. But how large can this effect get? This of course depends on the size of the outlier and the size of the dataset.

To see how outliers can affect the average of a dataset, let’s write a simple function that takes the size of the outlier as input and returns the average.

x <- Galton$child

error_avg <- function(k){
  x[1] <- k
  mean(x)
}

error_avg(10000)

## [1] 78.79784

error_avg(-10000)

## [1] 57.24612

2.2 Customizing Plots

2.2.1 Layers

Layers can define geometries, compute summary statistics,
define what scales to use, and even change styles.
Step 1: Geometry:
scatter plot: geom_point()
Step 2: Aesthtics:
x
y
alpha
colour

# pipe the murders dataset into ggplot
murders %>% ggplot() +
  geom_point(aes(x = population/10^6, y = total))

adding labels and text to the plot

p <- ggplot(data = murders)
p + geom_point(aes(x = population/10^6, y = total)) +
  geom_text(aes(population/10^6, total, label = abb))

2.2.2 Tinkering

# adding size
# and moving the label to the right
p <- ggplot(data = murders)
p + geom_point(aes(x = population/10^6, y = total), size = 3) +
  geom_text(aes(population/10^6, total, label = abb), nudge_x = 1)

redefine p as a mapping inside the ggplot function

p <- murders %>% ggplot(aes(population/10^6, total, label = abb))
p + geom_point(size = 3) +
  geom_text(nudge_x = )

p + geom_point(size = 3) +
geom_text(aes(x = 10, y = 800, label = "Hello there"))

2.2.3 Scales, Labels, and Colors

p + geom_point(size = 3) + 
geom_text(nudge_x = 0.05) +
scale_x_continuous(trans = "log10") +
scale_y_continuous(trans = "log10")

p + geom_point(size = 3) + 
geom_text(nudge_x = 0.075) +
scale_x_log10() +
scale_y_log10()

p + geom_point(size = 3) + 
geom_text(nudge_x = 0.075) +
scale_x_log10() +
scale_y_log10() +
xlab("Populations in millions (log scale)") +
ylab("Total numbe of murders (log scale)") +
ggtitle("US Gun Murders in US 2010")

p <- murders %>% ggplot(aes(population/10^6, total, label = abb)) +
  geom_text(nudge_x = 0.075) +
  scale_x_log10() +
  scale_y_log10() +
  xlab("Populations in millions (log scale)") +
  ylab("Total numbe of murders (log scale)") +
  ggtitle("US Gun Murders in US 2010")
p + geom_point(size=3, color = "blue")

p + geom_point(aes(col=region), size = 3)

Section 3: Summarizing with dplyr

Section 3 introduces you to summarizing with dplyr.

After completing Section 3, you will:

understand the importance of summarizing data in exploratory data analysis.
be able to use the “summarize” verb in dplyr to facilitate summarizing data.
be able to use the “group_by” verb in dplyr to facilitate summarizing data.
be able to access values using the dot placeholder.
be able to use “arrange” to examine data after sorting.

3.1 Summarizing with dplyr

Section 4: Gapminder

In Section 4, you will look at a case study involving data from the Gapminder Foundation about trends in world health and economics.

After completing Section 4, you will:

understand how Hans Rosling and the Gapminder Foundation use effective data visualization to convey data-based trends.
be able to apply the ggplot2 techniques from the previous section to answer questions using data.
understand how fixed scales across plots can ease comparisons.
be able to modify graphs to improve data visualization.

4.1 Introduction to Gapminder

4.2 Using the Gapminder Dataset

Section 5: Data Visualization Principles

Section 5 covers some general principles that can serve as guides for effective data visualization.

After completing Section 5, you will:

understand basic principles of effective data visualization.
understand the importance of keeping your goal in mind when deciding on a visualization approach.
understand principles for encoding data, including position, aligned lengths, angles, area, brightness, and color hue.
know when to include the number zero in visualizations.
be able to use techniques to ease comparisons, such as using common axes, putting visual cues to be compared adjacent to one another, and using color effectively.

Data Science: Visualization

Henrik Gjerning

2019-03-10