Section 1: Data Visualization


1.1 Introduction to Data Visualization

Data Visualization

Plots of data easily communicate information that is difficult to extract from tables of raw values. Data visualization can help discover biases, systematic errors, mistakes and other unexpected problems in data before those data are incorporated into potentially flawed analysis.

John Tukey, considered the father of exploratory data analysis (EDA) once said, “The greatest value of a picture is when it forces us to notice what we never expected to see.”

Data visualization is also now pervasive and philanthropic in educational organizations. One example comes from GAPminder, or the Worldview Upgrader, and the talks, New Insights on Poverty and The Best Stats You’ve Ever Seen, where Hans Rosling forces us to notice the unexpected with a series of plots related to world health and economics.

Distributions

The most basic statistical summary of a list of objects is its distribution. A distribution is a function or description that shows the possible values of a variable and how often those values occur. For categorical variables, the distribution describes the proportions of each category.

Data Types

Categorical data or variables, as defined by Wikipedia are " variables that can take on one of a limited, and usually fixed, number of possible values, assigning each individual or other unit of observation to a particular group or nominal category on the basis of some qualitative property.[1] In computer science and some branches of mathematics, categorical variables are referred to as enumerations or enumerated types. Commonly (though not in this article), each of the possible values of a categorical variable is referred to as a level"

  • Ordinal categorical data have an inherent order to the categories (mild/medium/hot, for example).
  • Non-ordinal categorical data have no order to the categories (male,female).

Numeric data or variables are for numeric values. It is the default data type for numbers in R. Examples of numeric values would be 1, 34.5, 3.145, -24, -45.003. NUmerics fall into one of two categories:

  • Continuous variables can take any value (1, 3.145, -0.005)
  • Discrete variables are limited to sets of specific values, such as whole integers

1.2 Introduction to Distributions

A distribution is a function or description that shows the possible values of a variable and how often those values occur.

For categorical variables, the distribution describes the proportions of each category, or as Wikipedia describes it, “In probability theory and statistics, a categorical distribution (also called a generalized Bernoulli distribution, multinoulli distribution) is a discrete probability distribution that describes the possible results of a random variable that can take on one of K possible categories, with the probability of each category separately specified.”

A frequency table is the simplest way to show a categorical distribution. Use prop.table() to convert a table of counts to a frequency table. Barplots display the distribution of categorical variables and are a way to visualize the information in frequency tables, provided you group the data.

# load the dataset
library(dslabs)
data(heights)

# make a table of category proportions
prop.table(table(heights$sex))
## 
##    Female      Male 
## 0.2266667 0.7733333
# consider a bar chart here to display male/female proportions

For continuous numerical data, reporting the frequency of each unique entry is not an effective summary as many or most values are unique. Instead, a distribution function is required. A histogram divides data into non-overlapping bins of the same size and plots the counts of number of values that fall in that interval.

p <- heights %>%
  filter(sex == "Male") %>%
  ggplot(aes(x = height))
# basic histograms
# p + geom_histogram()
p + geom_histogram(binwidth = 1)

Cumulative Distribution Function

Every continuous distribution, normal or otherwise, has a cumulative distribution function (CDF). The CDF defines the proportion of the data below a given value a for all values of a . For example, the male heights data we used in the previous section has this CDF:

Cumulative Distribution Function (CDF):

We can leverage the CDF to determine the probability that a value with fall between, say, 68.5 and 69.5 inches. The CDF is essential for calculating probabilities related to continuous data.

library(ggplot2)
library(dslabs)
data(heights)

# create a vector of data to plot the CDF for:
b <- heights$height
# obtain a sequence of 100 values from the data starting from the minimum value of the dataset through the maximum value
a <- seq(min(b), max(b), length = 100)  
# create a custom function that receives a value and returns the average value for all heights less than or equal to said value: 
cdf_function <- function(x) { 
  mean(heights$height <= x)
}
cdf_values <- sapply(a, cdf_function) # sapply applies the function call to every value in the a vector
plot(a, cdf_values) #and plots the values

Smooth Density Plots

Smooth density plots are basically applying kernel smoothing on a histogram with increased bin counts. Note how the Y axis has been replaced by ‘density’, how local peaks have been removed and how sharp edges and intervals have been smoothed out.

Degrees of smoothing can be configured, but we should select a degree of smoothness that we can defend as being representative of the underlying data. The degree of smoothness can be controlled by an argument in the plotting function (covered later).

Normal Distribution

When using R, there are a variety of distributions we may deal with, and as such, a variety of functions built in to R to deal with them.

A normal (or Gaussian or Gauss or Laplace–Gauss) distribution has a number of immutable attributes:
* It is centered around one value, the mean (which can be calculated using mean())
* It is symmetric around the mean
* It can be defined entirely and succinctly by its mean (μ) and standard deviation (σ)
* Always has the same proportion of observations within a given distance of the mean (for example, 95% within 2 σ)

data(heights)
index <- heights$sex=="Male"
x <- heights$height[index]
average <- mean(x) # built in mean function

The standard deviation is the average distance between a value and the mean value. The standard deviation can be calculated using the sd() function.

SD <- sd(x) #built in standard deviation function

Standard units describe how many standard deviations a value is away from the mean. Compute standard units with the scale() function.

z <- scale(x) # calculate standard units using scale()

Important: to calculate the proportion of values that meet a certain condition, use the mean() function on a logical vector. Because TRUE is converted to 1 and FALSE is converted to 0, taking the mean of this vector yields the proportion of TRUE.

# calculate proportion of values within 2 SD of mean
mean(abs(z) < 2)
## [1] 0.9495074

Standard Units and Z-scores

Standard units are a way of putting different kinds of observations on the same scale. The idea is to replace a datum by the number of standard deviations it is above the mean of the data. If a datum is above the mean, its value in standard units is positive; if it is below the mean, its value in standard units is negative. A datum that is above the mean by 2.5 times the SD is 2.5 in standard units.

Standard units are denoted by the variable z and are also known as z-scores.

Z-scores are useful to quickly evaluate whether an observation is average or extreme.
* Z-scores near 0 are average.
* Z-scores above 2 or below -2 are significantly above or below the mean, and
* z-scores above 3 or below -3 are extremely rare.

The scale() function converts a vector of approximately normally distributed values into z-scores.

You can compute the proportion of observations that are within 2 standard deviations of the mean like this:

mean(abs(z) < 2)
## [1] 0.9495074

The Normal CDF and pnorm

The cumulative distribution CDF for the normal distribution is defined by a mathematical formula, which in R can be obtained with the function pnorm().

pnorm(a, avg, s) gives the value of the cumulative distribution function F(a) for the normal distribution defined by average avg and standard deviation s. Once we have this, for any population that adheres to the normal distribution, we can take any value and extrapolate the probability that value will have being drawn from a random sample.

As an example, take male heights. Assume you wanted to know the probability that a male was 70.5 inches tall or greater. Using pnorm(), you feed in 70.5 as well as the mean of the population and the sd of the population. What returns is the percentage of the population that, according to the cumulative distribution, would be equal to or LESS probable than that. If you subtract this value from 1, you arrive at the probability of heights occurring GREATER than 70.5:

x <- heights %>% filter(sex=="Male") %>% pull(height)
1 - pnorm(70.5, mean(x), sd(x)) #Based on male heights, determine the probability of a sample being drawn where the male is TALLER than 70.5 inches (see intuition above)
## [1] 0.371369

So, there is a 37% chance that a male will be 70.5 inches tall or taller when selected from a random sample.
But what would the benefit be of calculating this for every single reported height? Instead, we determine probability of intervals- the probability that an individual will measure between one value and another.

If we use inputs that are rounded to integers, using pnorm() approximates the actuals very closely. If we use inputs that are not rounded and are in continuous values, the approximation pnorm() makes is not as accurate, a phenomenon called discretization.

# plot distribution of exact heights in data
plot(prop.table(table(x)), xlab = "a = Height in inches", ylab = "Pr(x = a)")

# probabilities in actual data over length 1 ranges containing an integer
mean(x <= 68.5) - mean(x <= 67.5) # when the difference between means is in integers...
## [1] 0.114532
mean(x <= 69.5) - mean(x <= 68.5)
## [1] 0.1194581
mean(x <= 70.5) - mean(x <= 69.5)
## [1] 0.1219212
# probabilities in normal approximation match well
pnorm(68.5, mean(x), sd(x)) - pnorm(67.5, mean(x), sd(x)) #pnorm() matches well
## [1] 0.1031077
pnorm(69.5, mean(x), sd(x)) - pnorm(68.5, mean(x), sd(x))
## [1] 0.1097121
pnorm(70.5, mean(x), sd(x)) - pnorm(69.5, mean(x), sd(x))
## [1] 0.1081743
# probabilities in actual data over other ranges don't match normal approx as well
mean(x <= 70.9) - mean(x <= 70.1) # when the difference between means is in continuous values...
## [1] 0.02216749
pnorm(70.9, mean(x), sd(x)) - pnorm(70.1, mean(x), sd(x))  # pnorm() do NOT match very well (discretization)
## [1] 0.08359562

1.3 Quantiles, Percentiles, and Boxplots

Definition of quantiles

Quantiles are cutoff points that divide a dataset into intervals with set probabilities. The qth quantile is the value at which q % of the observations are equal to or less than that value.

The most-used quantiles are actually the 0 percent and 100 percent quantiles. You could just as easily call them the minimum and maximum, because that’s what they are. You can get both together using the range() function. This function conveniently gives you the range of the data. So, to know the range of mileages, you simply do:

range(heights$height)
## [1] 50.00000 82.67717

The range still gives you only limited information. Often statisticians report the first and the third quartile together with the range and the median. This can be returned using the quantile function.

quantile(heights$height)
##       0%      25%      50%      75%     100% 
## 50.00000 66.00000 68.50000 71.00000 82.67717

The quartiles are not the same as the lower and upper hinge calculated in the five-number summary. The latter two are, respectively, the median of the lower and upper half of your data, and they differ slightly from the first and third quartiles. If you want the numeric values for the first and third quartiles, use the fivenum() function.

fivenum(heights$height)
## [1] 50.00000 66.00000 68.50000 71.00000 82.67717

Lastly, you can elect your own quantiles passing in the desired quantile argument:

quantile(heights$height,.30) #returns the 30% quantile
##   30% 
## 66.92
quantile(heights$height, probs = c(0.05, 0.95)) # returns multiple specified, concatted probabilities
##       5%      95% 
## 62.00000 74.80315

If you want percentiles:

p <- seq(0.01, 0.99, 0.01)
quantile(heights$height, p) #returns percents incrementing by 1
##       1%       2%       3%       4%       5%       6%       7%       8% 
## 59.00000 60.00000 60.00000 61.00000 62.00000 62.00000 62.28869 63.00000 
##       9%      10%      11%      12%      13%      14%      15%      16% 
## 63.00000 63.00000 63.77953 64.00000 64.00000 64.00000 64.17321 64.96000 
##      17%      18%      19%      20%      21%      22%      23%      24% 
## 65.00000 65.00000 65.00000 65.00000 65.00000 65.94457 66.00000 66.00000 
##      25%      26%      27%      28%      29%      30%      31%      32% 
## 66.00000 66.00000 66.00000 66.00000 66.50000 66.92000 66.92913 67.00000 
##      33%      34%      35%      36%      37%      38%      39%      40% 
## 67.00000 67.00000 67.00000 67.00000 67.00000 67.00000 67.32200 67.71650 
##      41%      42%      43%      44%      45%      46%      47%      48% 
## 67.72540 68.00000 68.00000 68.00000 68.00000 68.00000 68.00000 68.00000 
##      49%      50%      51%      52%      53%      54%      55%      56% 
## 68.11024 68.50000 68.89000 68.89764 69.00000 69.00000 69.00000 69.00000 
##      57%      58%      59%      60%      61%      62%      63%      64% 
## 69.00000 69.00000 69.00000 69.00000 69.60000 70.00000 70.00000 70.00000 
##      65%      66%      67%      68%      69%      70%      71%      72% 
## 70.00000 70.00000 70.00000 70.00000 70.00000 70.07874 70.73700 70.86614 
##      73%      74%      75%      76%      77%      78%      79%      80% 
## 71.00000 71.00000 71.00000 71.00000 71.00000 71.11000 72.00000 72.00000 
##      81%      82%      83%      84%      85%      86%      87%      88% 
## 72.00000 72.00000 72.00000 72.00000 72.00000 72.00000 72.00000 72.44011 
##      89%      90%      91%      92%      93%      94%      95%      96% 
## 72.83465 73.00000 73.00000 74.00000 74.00000 74.00000 74.80315 75.00000 
##      97%      98%      99% 
## 75.00000 76.00000 78.00000

Or if you want the best geography:

summary(heights$height) #returns the min,mean, max and 1st and 3rd quartiles
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   50.00   66.00   68.50   68.32   71.00   82.68

Finding quantiles with qnorm

The qnorm() function gives the THEORETICAL value of a quantile with probability p of observing a value equal to or less than that quantile value given a normal distribution with mean mu and standard deviation sigma. By default, mu=0 and sigma=1. Therefore, calling qnorm() with no arguments gives quantiles for the standard normal distribution.

Recall that quantiles are defined such that p is the probability of a random observation less than or equal to the quantile.

Quantile-Quantile Plots

A Quantile-Quantile Plots, or QQ Plot, is an exercise to help determine how well a distribution is approximated by a normal distribution.

In a QQ-plot, the sample (actual) quantiles in the observed data are compared to the theoretical quantiles expected from the normal distribution.

  1. Calculate sample quantiles (observed quantiles) using the quantile() function
  2. Calculate theoretical quantiles with the qnorm() function. qnorm() will calculate quantiles for the standard normal distribution ( μ=0,σ=1 ) by default, but it can calculate quantiles for any normal distribution given mean() and sd() arguments.

If the data are well-approximated by the normal distribution, then the points on the QQ-plot will fall near the identity line (sample = theoretical).

Let’s take an example, heights for men. If we take the actual quantiles of the observed data we collected and compare them to theoretical quantiles generated using qnorm(), we can test to see how well normal distributions approximates for male heights.

# create an index of the data based on male heights and scale() it:
data(heights)
index <- heights$sex=="Male"
x <- heights$height[index]

Now we have a data set, x. Let’s start by carving up 5% quantiles to compare and save them in a variable, observed_quantiles:

p <- seq(0.05, 0.95, 0.05)
observed_quantiles <- quantile(x, p)

Now, using qnorm(), let’s carve up the theoretical quantiles based on the average and sd values of x:

theoretical_quantiles <- qnorm(p, mean = mean(x), sd = sd(x))

Now, let’s plot them together and we’ll see a close correlation:

plot(theoretical_quantiles, observed_quantiles, col=c("#0072b2","#009e73"))

So thus far, we’ve compared the two on their original scales, but let’s now compare them following scaling.

z <- scale(x)
observed_quantiles <- quantile(z, p) #scale the actual quantiles
theoretical_quantiles <- qnorm(p) #qnorm() defaults to scaling
plot(theoretical_quantiles, observed_quantiles, col=c("#0072b2","#009e73")) #now plot them
abline(0,1,col="red")

Boxplots

When data do not follow a normal distribution and cannot be succinctly summarized by only the mean and standard deviation, an alternative is to report a five-number summary: range (ignoring outliers) and the quartiles (25th, 50th, 75th percentile).

In a boxplot, the box is defined by the 25th and 75th percentiles and the median is a horizontal line through the box. The whiskers show the range excluding outliers, and outliers are plotted separately as individual points. The interquartile range is the distance between the 25th and 75th percentiles. Boxplots are particularly useful when comparing multiple distributions.


Section 2: Introduction to ggplot2


Section 3: Summarizing with dplyr


Section 4: Gapminder


Section 5: Data Visualization Principles

Slope Charts

When examining the relationship between two variables, it is often useful to use scatterplots. One exception might be when you are examining the value of the same variable, but at different points in time.

There is no geometry for slope charts in ggplot2, but we can construct one using geom_lines. An advantage of the slope chart is that it permits us to quickly get an idea of changes based on the slope of the lines.

Leveraging the gapminder dataset, we create a utility vector we’ll call ‘west’:

west <- c("Western Europe", "Northern Europe", "Southern Europe", "Northern America", "Australia and New Zealand")

Now a utility vector ‘dat’ filtering from gapminder those rows that:
* include either 2010 or 2015, AND
* include regions in our vector west, AND
* include only rows where life_expectancy is NOT NA, AND
* the population is greater than 10^7, or 10 million

dat <- gapminder %>%
    filter(year %in% c(2010, 2015) & region %in% west & !is.na(life_expectancy) & population > 10^7)

With this data frame now in hand, we can start to plot the life expectancy data we have collected from 2010 to 2015 and quickly ascertain that it has risen across all considered regions:

dat %>%
    mutate(location = ifelse(year == 2010, 1, 2),
           location = ifelse(year == 2015 & country %in% c("United Kingdom", "Portugal"),
                             location + 0.22, location),
           hjust = ifelse(year == 2010, 1, 0)) %>%
    mutate(year = as.factor(year)) %>%
    ggplot(aes(year, life_expectancy, group = country)) +
    geom_line(aes(color = country), show.legend = FALSE) +
    geom_text(aes(x = location, label = country, hjust = hjust), show.legend = FALSE) +
    xlab("") +
    ylab("Life Expectancy") 

By applying a slope chart, we take advantage of common-axis, which make comparisons across a few data points easy. However, what happens when you want to evaluated more than just a few?

The Bland-Altman Plot

Since what we’re interested in is in differences, it makes sense to dedicate one of our axes to differences. The Bland-Altman Plot, also known as the Tukey Mean Different plot, and also the MA plot, shows the difference versus the average.

library(ggrepel)
dat %>%
    mutate(year = sum("life_expectancy", year)) %>%
    select(country, year, life_expectancy) %>% spread(year, life_expectancy) %>%
    mutate(average = (life_expectancy_2015 + life_expectancy_2010)/2,
                difference = life_expectancy_2015 - life_expectancy_2010) %>%
    ggplot(aes(average, difference, label = country)) +
    geom_point() +
    geom_text_repel() +
    geom_abline(lty = 2) +
    xlab("Average of 2010 and 2015") +
    ylab("Difference between 2015 and 2010")

Those countries that have improved the most are easy to identify as it is represented in the Y-axis. We also get an idea of the overall value from the x-axis.

The Bland-Altman plot (Tukey mean difference plot, MA plot) graphs the difference between conditions on the y-axis and the mean between conditions on the x-axis. It is more appropriate for large numbers of observations than slope charts.

Encode a Third Variable

  • Encode a categorical third variable on a scatterplot using color hue or shape. Use the shape argument to control shape.
  • Encode a continuous third variable on a using color intensity or size.

Case Study: Vaccines

Case studies on vaccines can be fueled by datasets made possible through The Tyco Project. The Tyco Project serves to help institutes and researchers to make valuable global health data available to others, contributing to open and reproducible science through the offering of open data. This data includes weekly reported counts data for 7 diseases from 1928 to 2011 from all 50 states.

The dslabs package also makes available the yearly totals:

library(tidyverse)
library(dslabs)
data(us_contagious_diseases)
str(us_contagious_diseases)
## 'data.frame':    16065 obs. of  6 variables:
##  $ disease        : Factor w/ 7 levels "Hepatitis A",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ state          : Factor w/ 51 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ year           : num  1966 1967 1968 1969 1970 ...
##  $ weeks_reporting: num  50 49 52 49 51 51 45 45 45 46 ...
##  $ count          : num  321 291 314 380 413 378 342 467 244 286 ...
##  $ population     : num  3345787 3364130 3386068 3412450 3444165 ...
as_tibble(us_contagious_diseases)
## # A tibble: 16,065 x 6
##    disease     state    year weeks_reporting count population
##    <fct>       <fct>   <dbl>           <dbl> <dbl>      <dbl>
##  1 Hepatitis A Alabama  1966              50   321    3345787
##  2 Hepatitis A Alabama  1967              49   291    3364130
##  3 Hepatitis A Alabama  1968              52   314    3386068
##  4 Hepatitis A Alabama  1969              49   380    3412450
##  5 Hepatitis A Alabama  1970              51   413    3444165
##  6 Hepatitis A Alabama  1971              51   378    3481798
##  7 Hepatitis A Alabama  1972              45   342    3524543
##  8 Hepatitis A Alabama  1973              45   467    3571209
##  9 Hepatitis A Alabama  1974              45   244    3620548
## 10 Hepatitis A Alabama  1975              46   286    3671246
## # ... with 16,055 more rows

This exercise will focus only on measles. So we create a utility string to sort and the vector we’ll temporarily store the data frame in we’ll call dat.
* Pipe in us_contagious_diseases in filter
* Filter out states in the data that include Hawaii and Alaska (no data until 1950) AND
* Filter only those rows where the disease is measles THEN
* Pipe the filtered results to mutate, we we add a new column called rate, which is the rate per year of measles THEN
* Reorder the table to list state by order of rate

the_disease <- "Measles"
dat <- us_contagious_diseases %>%
    filter(!state %in% c("Hawaii", "Alaska") & disease == the_disease) %>%
    mutate(rate = count / population * 10000 * 52/weeks_reporting) %>%
    mutate(state = reorder(state, rate))

We can now easily plot the rate of measles per year.
* We take the dat data set and filter out where state is ‘California’ and NOT NA (recall this as the data component step) THEN
* Pipe this to ggplot and map it to x and y (recall this as the aesthetic mapping component), THEN
* Select a geom_line() as the geometry (recall this as the geometry component) - 1963 is the year that measles vaccine was discovered AND
* Add some visual tweaks to assist visually with interpreting the data (recall this as the style component):

dat %>% filter(state == "California" & !is.na(rate)) %>%
    ggplot(aes(year, rate)) +
    geom_line() +
    ylab("Cases per 10,000") +
    geom_vline(xintercept=1963, col = "blue")

But this is just California, and just measles. How can we show the data for all states, in one plot?

dat %>% ggplot(aes(year, state, fill=rate)) +
    geom_tile(color = "grey50") +
    scale_x_continuous(expand = c(0,0)) +
    scale_fill_gradientn(colors = RColorBrewer::brewer.pal(9, "Reds"), trans = "sqrt") +
    geom_vline(xintercept = 1963, col = "blue") +
    theme_minimal() + theme(panel.grid = element_blank()) +
    ggtitle(the_disease) +
    ylab("") +
    xlab("")

# compute US average measles rate by year
avg <- us_contagious_diseases %>%
    filter(disease == the_disease) %>% group_by(year) %>%
    summarize(us_rate = sum(count, na.rm = TRUE)/sum(population, na.rm = TRUE)*10000)

# make line plot of measles rate by year by state
dat %>%
    filter(!is.na(rate)) %>%
    ggplot() +
    geom_line(aes(year, rate, group = state), color = "grey50", 
        show.legend = FALSE, alpha = 0.2, size = 1) +
    geom_line(mapping = aes(year, us_rate), data = avg, size = 1, col = "black") +
    scale_y_continuous(trans = "sqrt", breaks = c(5, 25, 125, 300)) +
    ggtitle("Cases per 10,000 by state") +
    xlab("") +
    ylab("") +
    geom_text(data = data.frame(x = 1955, y = 50),
        mapping = aes(x, y, label = "US average"), color = "black") +
    geom_vline(xintercept = 1963, col = "blue")