PH125.2x: Data Science: Visualization
- School: EDX, HarvardX
- Course Instructor: Rafael Irizarry
Abstract
In this second course of nine in the HarvardX Data Science Professional Certificate, we learn the basics of data visualization and exploratory data analysis.
The growing availability of informative datasets and software tools has led to increased reliance on data visualizations across many industries, academia, and government. Data visualization provides a powerful way to communicate data-driven findings, motivate analyses, or detect flaws.
In this course, you will learn the basics of data visualization and exploratory data analysis. We will use three motivating examples and ggplot2, a data visualization package for the statistical programming language R, to code. To learn the very basics, we will start with a somewhat artificial example: heights reported by students. Then we will use two case studies related to world health and economics and another in infectious disease trends in the United States.
It is also important to note that mistakes, biases, systematic errors, and other unexpected problems often lead to data that should be handled with care. The fact that it can be difficult or impossible to notice an error just from the reported results makes data visualization particularly important. This course will explore how failure to discover these problems often leads to flawed analyses and false discoveries.
Learning Objective:
- data visualization principles to better communicate data-driven findings
- how to use ggplot2 to create custom plots
- the weaknesses of several widely used plots and why you should avoid them
Course Outline:
Section 1: Introduction to Data Visualization and Distributions You will get started with data visualization and distributions in R.
Section 2: Introduction to ggplot2 You will learn how to use ggplot2 to create plots.
Section 3: Summarizing with dplyr You will learn how to summarize data using dplyr.
Section 4: Gapminder You will see examples of ggplot2 and dplyr in action with the Gapminder dataset.
Section 5: Data Visualization Principles You will learn general principles to guide you in developing effective data visualizations.
Section 1: Introduction to Data Visualization and Distributions
Section 1 introduces you to Data Visualization and Distributions.
After completing Section 1, you will:
- understand the importance of data visualization for communicating data-driven findings.
- be able to use distributions to summarize data.
- be able to use the average and the standard deviation to understand the normal distribution.
- be able to assess how well a normal distribution fits the data using a quantile-quantile plot.
- be able to interpret data from a boxplot.
1.1 Introduction to Data Visualization
A picture is worth a 1000 words.
Visualization is the strongest tool of Exploratory Data Analysis
The greatest value of a picture is when it forces us to notice what we never expected to see (John Tukey)
library(dslabs)
head(murders)
## state abb region population total
## 1 Alabama AL South 4779736 135
## 2 Alaska AK West 710231 19
## 3 Arizona AZ West 6392017 232
## 4 Arkansas AR South 2915918 93
## 5 California CA West 37253956 1257
## 6 Colorado CO West 5029196 65
1.1.1 Introduction to Distributions
Sometimes averages and standard deviation is all we need to know about results
However at other times we need more
1.1.2 Data Types
- categorical
- ordinal
- spicyness: mild, medium, hot
- non-ordinal
- Sex: male, female
- numeric
- discrete
- population (has to be a round number)
- continuous
- heights
EX 1: Exercise 1. Variable names
The type of data we are working with will often influence the data visualization technique we use. We will be working with two types of variables: categorical and numeric. Each can be divided into two other groups: categorical can be ordinal or not, whereas numerical variables can be discrete or continuous.
We will review data types using some of the examples provided in the dslabs package. For example, the heights dataset.
library(dslabs)
data(heights)
names(heights)
## [1] "sex" "height"
# EX 2: Exercise 2. Variable type
# categorical
EX 3: Exercise 3. Numerical values
Keep in mind that discrete numeric data can be considered ordinal. Although this is technically true, we usually reserve the term ordinal data for variables belonging to a small number of different groups, with each group having many members.
The height variable could be ordinal if, for example, we report a small number of values such as short, medium, and tall. Let’s explore how many unique values are used by the heights varialbe. For this we can use the unique fuction:
x <- c(3, 3,
library(dslabs)
data(heights)
x <- heights$height
length(unique(x))
## [1] 139
EX 4: Exercise 4. Tables
One of the useful outputs of data visualization is that we can learn about the distribution of variables. For categorical data we can construct this distribution by simply computing the frequency of each unique value. This can be done with the function table. Here is an example:
x <- c(3, 3, 3, 3, 4, 4, 2) table(x)
library(dslabs)
data(heights)
x <- heights$height
tab <- table(x)
EX 5: Exercise 5. Indicator variables
To see why treating the reported heights as an ordinal value is not useful in practice we note how many values are reported only once.
library(dslabs)
data(heights)
tab <- table(heights$height)
sum(tab ==1)
## [1] 63
EX 6: Exercise 6. Data types - heights
Since there are a finite number of reported heights and technically the height can be considered ordinal, which of the following is true:
# It is more effective to consider heights to be numerical given the number of unique values we observe and the fact that if we keep collecting data even more will be observed.
1.2 Introduction to Distributions
1.2.1 Describe Heights to ET
library(dslabs)
data(heights)
head(heights)
## sex height
## 1 Male 75
## 2 Male 70
## 3 Male 68
## 4 Male 74
## 5 Male 61
## 6 Female 65
prop.table(table(heights$sex))
##
## Female Male
## 0.2266667 0.7733333
1.2.2 Smooth Density Plots
- Compute frequencies rather than counts
- AUC = 1 ( the area under the density curve)
EX 1: Exercise 1. Distributions - 1
You may have noticed that numerical data is often summarized with the average value. For example, the quality of a high school is sometimes summarized with one number: the average score on a standardized test. Occasionally, a second number is reported: the standard deviation. So, for example, you might read a report stating that scores were 680 plus or minus 50 (the standard deviation). The report has summarized an entire vector of scores with with just two numbers. Is this appropriate? Is there any important piece of information that we are missing by only looking at this summary rather than the entire list? We are going to learn when these 2 numbers are enough and when we need more elaborate summaries and plots to describe the data.
Our first data visualization building block is learning to summarize lists of factors or numeric vectors. The most basic statistical summary of a list of objects or numbers is its distribution. Once a vector has been summarized as distribution, there are several data visualization techniques to effectively relay this information. In later assessments we will practice to write code for data visualization. Here we start with some multiple choice questions to test your understanding of distributions and related basic plots.
In the murders dataset, the region is a categorical variable and on the right you can see its distribution. To the closet 5%, what proportion of the states are in the North Central region?
Answer: 20%
EX 2: Exercise 2. Distributions - 2
In the murders dataset, the region is a categorical variable and to the right is its distribution.
Which of the following is true:
Answer: The graph shows only four numbers with a bar plot.
EX 3: Exercise 3. Empirical Cumulative Distribution Function (eCDF)
The plot shows the eCDF for male heights:
Based on the plot, what percentage of males are shorter than 75 inches?
Answer: 95%
EX 4: Exercise 4. eCDF Male Heights
The plot shows the eCDF for male heights:
To the closest inch, what height m has the property that 1/2 of the male students are taller than m and 1/2 are shorter?
Answer: 69 inches
EX 5: Exercise 5. eCDF of Murder Rates
Here is an eCDF of the murder rates across states.
Knowing that there are 51 states (counting DC) and based on this plot, how many states have murder rates larger than 10 per 100,000 people?
Answer: 1
EX 6: Exercise 6. eCDF of Murder Rates - 2
Here is an eCDF of the murder rates across states:
Based on the eCDF above, which of the following statements are true:
Answer: With the exception of 4 states, the murder rates are below 5 per 100,000.
EX 7: Exercise 7. Histograms
Here is a histogram of male heights in our heights dataset:
Based on this plot, how many males are between 62.5 and 65.5?
Answer: 44
EX 8: Exercise 8. Histograms - 2
Here is a histogram of male heights in our heights dataset:
About what percentage are shorter than 60 inches?
Answer: 1%
EX 9: Exercise 9. Density plots
Based on this density plot, about what proportion of US states have populations larger than 10 million?
Answer: 0.15
EX 10: Exercise 10. Density plots - 2
Here are three density plots. Is it possible that they are from the same dataset? Which of the following statements is true:
Answer: They are the same dataset, but the first is not in the log scale, the second undersmooths and the third oversmooths.
1.3 Quantiles, Percentiles, and Boxplots
1.3.1 Normal Distribution
- Normal distribution:
- also bell curve
also Gauissian distribution
- It occur in many situations:
- gambling winnings
- heights
- weights
blood pressure
# definition of average:
average <- sum(x) /length(x)
# Standdard Deviation
SD <- sqrt( sum(( x - average)^2) / length(x))
index <- heights$sex=="Male"
x <- heights$height[index]
average <- mean(x)
SD <- sd(x)
c(average=average,SD=SD)
## average SD
## 69.314755 3.611024
z = (x-average)/SD
mean(abs(z) < 2)
## [1] 0.9495074
1.3.2 Quantile-Quantile Plots
# 51% of the data is below 69.5 inches (176.53 cm)
mean(x <= 69.5)
## [1] 0.5147783
p <- seq(0.05, 0.95, 0.05)
observed_quantiles <- quantile(x, p)
observed_quantiles
## 5% 10% 15% 20% 25% 30% 35% 40%
## 63.90079 65.00000 66.00000 67.00000 67.00000 68.00000 68.00000 68.62236
## 45% 50% 55% 60% 65% 70% 75% 80%
## 69.00000 69.00000 70.00000 70.00000 70.86614 71.00000 72.00000 72.00000
## 85% 90% 95%
## 72.44000 73.22751 75.00000
theoretical_quantiles <- qnorm( p, mean = mean(x), sd = sd(x))
plot(theoretical_quantiles, observed_quantiles)
abline(0,1)
# This code becomes slightly simpler if we use standard units.
observed_quantiles <- quantile(z,p)
theoretical_quantiles <- qnorm(p)
plot(theoretical_quantiles, observed_quantiles)
abline(0,1)
1.3.3 Percentiles
EX 1: Proportions
Histograms and density plots provide excellent summaries of a distribution. But can we summarize even further? We often see the average and standard deviation used as summary statistics: a two number summary! To understand what these summaries are and why they are so widely used, we need to understand the normal distribution.
The normal distribution, also known as the bell curve and as the Gaussian distribution, is one of the most famous mathematical concepts in history. A reason for this is that approximately normal distributions occur in many situations. Examples include gambling winnings, heights, weights, blood pressure, standardized test scores, and experimental measurement errors. Often data visualization is needed to confirm that our data follows a normal distribution.
Here we focus on how the normal distribution helps us summarize data and can be useful in practice.
One way the normal distribution is useful is that it can be used to approximate the distribution of a list of numbers without having access to the entire list. We will demonstrate this with the heights dataset.
Load the height data set and create a vector x with just the male heights:
library(dslabs)
data(heights)
x <- heights$height[heights$sex == "Male"]
mean(x > 69 & x <= 72)
## [1] 0.3337438
EX 2: Exercise 2. Averages and Standard Deviations
Suppose all you know about the height data from the previous exercise is the average and the standard deviation and that its distribution is approximated by the normal distribution. We can compute the average and standard deviation like this:
library(dslabs) data(heights) x <- heights\(height[heights\)sex==“Male”] avg <- mean(x) stdev <- sd(x) Suppose you only have avg and stdev below, but no access to x, can you approximate the proportion of the data that is between 69 and 72 inches?
library(dslabs)
data(heights)
x <- heights$height[heights$sex=="Male"]
avg <- mean(x)
stdev <- sd(x)
pnorm(72, avg, stdev) - pnorm(69, avg, stdev)
## [1] 0.3061779
Exercise 3. Approximations
Notice that the approximation calculated in the second question is very close to the exact calculation in the first question. The normal distribution was a useful approximation for this case.
However, the approximation is not always useful. An example is for the more extreme values, often called the “tails” of the distribution. Let’s look at an example. We can compute the proportion of heights between 79 and 81.
library(dslabs) data(heights) x <- heights\(height[heights\)sex == “Male”] mean(x > 79 & x <= 81)
library(dslabs)
data(heights)
x <- heights$height[heights$sex == "Male"]
avg <- mean(x)
stdev <- sd(x)
exact <- mean(x>79 & x<=81)
approx <- pnorm(81, avg, stdev) - pnorm(79, avg, stdev)
exact/approx
## [1] 1.614261
Exercise 7. Lebron James’ height
In the previous exerceise we estimated the proportion of seven footers in the NBA using this simple code:
p <- 1 - pnorm(712, 69, 3) N <- round(p 10^9) 10/N Repeat the calculations performed in the previous question for Lebron James’ height: 6 feet 8 inches. There are about 150 players, instead of 10, that are at least that tall in the NBA.
p <- 1 - pnorm(6*12 + 8, 69, 3)
N <- round(p * 10^9)
150/N
## [1] 0.001220842
Exercise 8. Interpretation
In answering the previous questions, we found that it is not at all rare for a seven footer to become an NBA player.
What would be a fair critique of our calculations?
Answer: As seen in exercise 3, the normal approximation tends to underestimate the extreme values. It’s possible that there are more seven footers than we predicted.
1.3.4 Boxplots
- If something is NOT normal distributed: we can NOT use mean, std. to describe the data
1.3.5 Distribution of Female Heights
Exercise 1. Vector lengths
When analyzing data it’s often important to know the number of measurements you have for each category.
library(dslabs)
data(heights)
male <- heights$height[heights$sex=="Male"]
female <- heights$height[heights$sex=="Female"]
length(male)
## [1] 812
length(female)
## [1] 238
Exercise 2. Percentiles
Suppose we can’t make a plot and want to compare the distributions side by side. If the number of data points is large, listing all the numbers is inpractical. A more practical approach is to look at the percentiles. We can obtain percentiles using the quantile function like this
library(dslabs) data(heights) quantile(heights$height, seq(.01, 0.99, 0.01))
library(dslabs)
data(heights)
male <- heights$height[heights$sex=="Male"]
female <- heights$height[heights$sex=="Female"]
female_percentiles <- quantile(female, seq(0.1, 0.9, 0.2))
male_percentiles <- quantile(male, seq(0.1, 0.9, 0.2))
df <- data.frame(female = female_percentiles, male = male_percentiles)
df
## female male
## 10% 61.00000 65.00000
## 30% 63.00000 68.00000
## 50% 64.98031 69.00000
## 70% 66.46417 71.00000
## 90% 69.00000 73.22751
Exercise 3. Interpretating Boxplots - 1
Study the boxplots summarizing the distributions of populations sizes by country.
Which continent has the country with the largest population size?
Answer: Asia
Exercise 4. Interpretating Boxplots - 2
Study the boxplots summarizing the distributions of populations sizes by country.
Which continent has median country with the largest population?
Answer: Africa
Exercise 5. Interpreting Boxplots - 3
Again, look at the boxplots summarizing the distributions of populations sizes by country. To the nearest million, what is the median population size for Africa?
Answer: 10 million
Exercise 6. Low quantiles
Examine the following boxplots and report approximately what proportion of countries in Europe have populations below 14 million:
Answer: 0.75
Exercise 7. Interquantile Range (IQR)
Based on the boxplot, if we use a log transformation, which continent shown below has the largest interquartile range?
Answer: Americas
Section 2: Introduction to ggplot2
In Section 2, you will learn how to create data visualizations in R using ggplot2.
After completing Section 2, you will:
- be able to use ggplot2 to create data visualizations in R.
- be able to explain what the data component of a graph is.
- be able to identify the geometry component of a graph and know when to use which type of geometry.
- be able to explain what the aesthetic mapping component of a graph is.
- be able to understand the scale component of a graph and select an appropriate scale component to use.
2.1 Basics of ggplot2
2.1.1 ggplot
ggplot = grammer of graphics plot
library(tidyverse)
## -- Attaching packages ------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.1.0 v readr 1.3.1
## v tibble 2.0.1 v purrr 0.3.1
## v tidyr 0.8.3 v stringr 1.4.0
## v ggplot2 3.1.0 v forcats 0.4.0
## -- Conflicts ---------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
2.1.2 Graph Component
library(dslabs)
data(murders)
2.1.3 Creating a New Plot
murders %>% ggplot(data = murders)
EX 1: Exercise 1. Exploring the Galton Dataset - Average and Median
For this chapter, we will use height data collected by Francis Galton for his genetics studies. Here we just use height of the children in the dataset:
library(HistData) data(Galton) x <- Galton$child
#install.packages("HistData")
library(HistData)
data(Galton)
x <- Galton$child
mean(x)
## [1] 68.08847
median(x)
## [1] 68.2
EX 2: Exercise 2. Exploring the Galton Dataset - SD and MAD
Now for the same data compute the standard deviation and the median absolute deviation (MAD).
library(HistData)
data(Galton)
x <- Galton$child
sd(x)
## [1] 2.517941
mad(x)
## [1] 2.9652
Exercise 3. Error impact on average
In the previous exercises we saw that the mean and median are very similar and so are the standard deviation and MAD. This is expected since the data is approximated by a normal distribution which has this propoerty.
Now suppose that suppose Galton made a mistake when entering the first value, forgetting to use the decimal point. You can imitate this error by typing:
library(HistData) data(Galton) x <- Galton$child x_with_error <- x x_with_error[1] <- x_with_error[1]*10 The data now has an outlier that the normal approximation does not account for. Let’s see how this affects the average.
library(HistData)
data(Galton)
x <- Galton$child
x_with_error <- x
x_with_error[1] <- x_with_error[1]*10
mean(x_with_error)- mean(x)
## [1] 0.5983836
Exercise 4. Error impact on SD
In the previous exercise we saw how a simple mistake can result in the average of our data increasing more than half a foot, which is a large difference in practical terms. Now let’s explore the effect this outlier has on the standard deviation.
x_with_error <- x
x_with_error[1] <- x_with_error[1]*10
sd(x_with_error)- sd(x)
## [1] 15.6746
Exercise 5. Error impact on median
In the previous exercises we saw how one mistake can have a substantial effect on the average and the standard deviation.
Now we are going to see how the median and MAD are much more resistant to outliers. For this reason we say that they are robust summaries.
x_with_error <- x
x_with_error[1] <- x_with_error[1]*10
median(x_with_error)- median(x)
## [1] 0
Exercise 6. Error impact on MAD
We saw that the median barely changes. Now let’s see how the MAD is affected.
x_with_error <- x
x_with_error[1] <- x_with_error[1]*10
mad(x_with_error)- mad(x)
## [1] 0
Exercise 7. Usefulness of EDA
How could you use exploratory data analysis to detect that an error was made?
Answer: A boxplot, histogram, or qq-plot would reveal a clear outlier.
Exercise 8. Using EDA to explore changes
We have seen how the average can be affected by outliers. But how large can this effect get? This of course depends on the size of the outlier and the size of the dataset.
To see how outliers can affect the average of a dataset, let’s write a simple function that takes the size of the outlier as input and returns the average.
x <- Galton$child
error_avg <- function(k){
x[1] <- k
mean(x)
}
error_avg(10000)
## [1] 78.79784
error_avg(-10000)
## [1] 57.24612
2.2 Customizing Plots
2.2.1 Layers
- Layers can define geometries, compute summary statistics,
define what scales to use, and even change styles.
- Step 1: Geometry:
scatter plot: geom_point()
- Step 2: Aesthtics:
- x
- y
- alpha
colour
# pipe the murders dataset into ggplot
murders %>% ggplot() +
geom_point(aes(x = population/10^6, y = total))
adding labels and text to the plot
p <- ggplot(data = murders)
p + geom_point(aes(x = population/10^6, y = total)) +
geom_text(aes(population/10^6, total, label = abb))
2.2.2 Tinkering
# adding size
# and moving the label to the right
p <- ggplot(data = murders)
p + geom_point(aes(x = population/10^6, y = total), size = 3) +
geom_text(aes(population/10^6, total, label = abb), nudge_x = 1)
redefine p as a mapping inside the ggplot function
p <- murders %>% ggplot(aes(population/10^6, total, label = abb))
p + geom_point(size = 3) +
geom_text(nudge_x = )
p + geom_point(size = 3) +
geom_text(aes(x = 10, y = 800, label = "Hello there"))
2.2.3 Scales, Labels, and Colors
p + geom_point(size = 3) +
geom_text(nudge_x = 0.05) +
scale_x_continuous(trans = "log10") +
scale_y_continuous(trans = "log10")
p + geom_point(size = 3) +
geom_text(nudge_x = 0.075) +
scale_x_log10() +
scale_y_log10()
p + geom_point(size = 3) +
geom_text(nudge_x = 0.075) +
scale_x_log10() +
scale_y_log10() +
xlab("Populations in millions (log scale)") +
ylab("Total numbe of murders (log scale)") +
ggtitle("US Gun Murders in US 2010")
p <- murders %>% ggplot(aes(population/10^6, total, label = abb)) +
geom_text(nudge_x = 0.075) +
scale_x_log10() +
scale_y_log10() +
xlab("Populations in millions (log scale)") +
ylab("Total numbe of murders (log scale)") +
ggtitle("US Gun Murders in US 2010")
p + geom_point(size=3, color = "blue")
p + geom_point(aes(col=region), size = 3)
Section 3: Summarizing with dplyr
Section 3 introduces you to summarizing with dplyr.
After completing Section 3, you will:
- understand the importance of summarizing data in exploratory data analysis.
- be able to use the “summarize” verb in dplyr to facilitate summarizing data.
- be able to use the “group_by” verb in dplyr to facilitate summarizing data.
- be able to access values using the dot placeholder.
- be able to use “arrange” to examine data after sorting.
3.1 Summarizing with dplyr
Section 4: Gapminder
In Section 4, you will look at a case study involving data from the Gapminder Foundation about trends in world health and economics.
After completing Section 4, you will:
- understand how Hans Rosling and the Gapminder Foundation use effective data visualization to convey data-based trends.
- be able to apply the ggplot2 techniques from the previous section to answer questions using data.
- understand how fixed scales across plots can ease comparisons.
- be able to modify graphs to improve data visualization.
4.1 Introduction to Gapminder
4.2 Using the Gapminder Dataset
Section 5: Data Visualization Principles
Section 5 covers some general principles that can serve as guides for effective data visualization.
After completing Section 5, you will:
- understand basic principles of effective data visualization.
- understand the importance of keeping your goal in mind when deciding on a visualization approach.
- understand principles for encoding data, including position, aligned lengths, angles, area, brightness, and color hue.
- know when to include the number zero in visualizations.
- be able to use techniques to ease comparisons, such as using common axes, putting visual cues to be compared adjacent to one another, and using color effectively.