Plots of data easily communicate information that is difficult to extract from tables of raw values. Data visualization can help discover biases, systematic errors, mistakes and other unexpected problems in data before those data are incorporated into potentially flawed analysis.
John Tukey, considered the father of exploratory data analysis (EDA) once said, “The greatest value of a picture is when it forces us to notice what we never expected to see.”
Data visualization is also now pervasive and philanthropic in educational organizations. One example comes from GAPminder, or the Worldview Upgrader, and the talks, New Insights on Poverty and The Best Stats You’ve Ever Seen, where Hans Rosling forces us to notice the unexpected with a series of plots related to world health and economics.
The most basic statistical summary of a list of objects is its distribution. A distribution is a function or description that shows the possible values of a variable and how often those values occur. For categorical variables, the distribution describes the proportions of each category.
Categorical data or variables, as defined by Wikipedia are " variables that can take on one of a limited, and usually fixed, number of possible values, assigning each individual or other unit of observation to a particular group or nominal category on the basis of some qualitative property.[1] In computer science and some branches of mathematics, categorical variables are referred to as enumerations or enumerated types. Commonly (though not in this article), each of the possible values of a categorical variable is referred to as a level"
Numeric data or variables are for numeric values. It is the default data type for numbers in R. Examples of numeric values would be 1, 34.5, 3.145, -24, -45.003. NUmerics fall into one of two categories:
A distribution is a function or description that shows the possible values of a variable and how often those values occur.
For categorical variables, the distribution describes the proportions of each category, or as Wikipedia describes it, “In probability theory and statistics, a categorical distribution (also called a generalized Bernoulli distribution, multinoulli distribution) is a discrete probability distribution that describes the possible results of a random variable that can take on one of K possible categories, with the probability of each category separately specified.”
A frequency table is the simplest way to show a categorical distribution. Use prop.table() to convert a table of counts to a frequency table. Barplots display the distribution of categorical variables and are a way to visualize the information in frequency tables, provided you group the data.
# load the dataset
library(dslabs)
data(heights)
# make a table of category proportions
prop.table(table(heights$sex))
##
## Female Male
## 0.2266667 0.7733333
# consider a bar chart here to display male/female proportions
For continuous numerical data, reporting the frequency of each unique entry is not an effective summary as many or most values are unique. Instead, a distribution function is required. A histogram divides data into non-overlapping bins of the same size and plots the counts of number of values that fall in that interval.
p <- heights %>%
filter(sex == "Male") %>%
ggplot(aes(x = height))
# basic histograms
# p + geom_histogram()
p + geom_histogram(binwidth = 1)
Every continuous distribution, normal or otherwise, has a cumulative distribution function (CDF). The CDF defines the proportion of the data below a given value a for all values of a . For example, the male heights data we used in the previous section has this CDF:
Cumulative Distribution Function (CDF):
We can leverage the CDF to determine the probability that a value with fall between, say, 68.5 and 69.5 inches. The CDF is essential for calculating probabilities related to continuous data.
library(ggplot2)
library(dslabs)
data(heights)
# create a vector of data to plot the CDF for:
b <- heights$height
# obtain a sequence of 100 values from the data starting from the minimum value of the dataset through the maximum value
a <- seq(min(b), max(b), length = 100)
# create a custom function that receives a value and returns the average value for all heights less than or equal to said value:
cdf_function <- function(x) {
mean(heights$height <= x)
}
cdf_values <- sapply(a, cdf_function) # sapply applies the function call to every value in the a vector
plot(a, cdf_values) #and plots the values
Smooth density plots are basically applying kernel smoothing on a histogram with increased bin counts. Note how the Y axis has been replaced by ‘density’, how local peaks have been removed and how sharp edges and intervals have been smoothed out.
Degrees of smoothing can be configured, but we should select a degree of smoothness that we can defend as being representative of the underlying data. The degree of smoothness can be controlled by an argument in the plotting function (covered later).
When using R, there are a variety of distributions we may deal with, and as such, a variety of functions built in to R to deal with them.
A normal (or Gaussian or Gauss or Laplace–Gauss) distribution has a number of immutable attributes:
* It is centered around one value, the mean (which can be calculated using mean())
* It is symmetric around the mean
* It can be defined entirely and succinctly by its mean (μ) and standard deviation (σ)
* Always has the same proportion of observations within a given distance of the mean (for example, 95% within 2 σ)
data(heights)
index <- heights$sex=="Male"
x <- heights$height[index]
average <- mean(x) # built in mean function
The standard deviation is the average distance between a value and the mean value. The standard deviation can be calculated using the sd() function.
SD <- sd(x) #built in standard deviation function
Standard units describe how many standard deviations a value is away from the mean. Compute standard units with the scale() function.
z <- scale(x) # calculate standard units using scale()
Important: to calculate the proportion of values that meet a certain condition, use the mean() function on a logical vector. Because TRUE is converted to 1 and FALSE is converted to 0, taking the mean of this vector yields the proportion of TRUE.
# calculate proportion of values within 2 SD of mean
mean(abs(z) < 2)
## [1] 0.9495074
Standard units are a way of putting different kinds of observations on the same scale. The idea is to replace a datum by the number of standard deviations it is above the mean of the data. If a datum is above the mean, its value in standard units is positive; if it is below the mean, its value in standard units is negative. A datum that is above the mean by 2.5 times the SD is 2.5 in standard units.
Standard units are denoted by the variable z and are also known as z-scores.
Z-scores are useful to quickly evaluate whether an observation is average or extreme.
* Z-scores near 0 are average.
* Z-scores above 2 or below -2 are significantly above or below the mean, and
* z-scores above 3 or below -3 are extremely rare.
The scale() function converts a vector of approximately normally distributed values into z-scores.
You can compute the proportion of observations that are within 2 standard deviations of the mean like this:
mean(abs(z) < 2)
## [1] 0.9495074
The cumulative distribution CDF for the normal distribution is defined by a mathematical formula, which in R can be obtained with the function pnorm().
pnorm(a, avg, s) gives the value of the cumulative distribution function F(a) for the normal distribution defined by average avg and standard deviation s. Once we have this, for any population that adheres to the normal distribution, we can take any value and extrapolate the probability that value will have being drawn from a random sample.
As an example, take male heights. Assume you wanted to know the probability that a male was 70.5 inches tall or greater. Using pnorm(), you feed in 70.5 as well as the mean of the population and the sd of the population. What returns is the percentage of the population that, according to the cumulative distribution, would be equal to or LESS probable than that. If you subtract this value from 1, you arrive at the probability of heights occurring GREATER than 70.5:
x <- heights %>% filter(sex=="Male") %>% pull(height)
1 - pnorm(70.5, mean(x), sd(x)) #Based on male heights, determine the probability of a sample being drawn where the male is TALLER than 70.5 inches (see intuition above)
## [1] 0.371369
So, there is a 37% chance that a male will be 70.5 inches tall or taller when selected from a random sample.
But what would the benefit be of calculating this for every single reported height? Instead, we determine probability of intervals- the probability that an individual will measure between one value and another.
If we use inputs that are rounded to integers, using pnorm() approximates the actuals very closely. If we use inputs that are not rounded and are in continuous values, the approximation pnorm() makes is not as accurate, a phenomenon called discretization.
# plot distribution of exact heights in data
plot(prop.table(table(x)), xlab = "a = Height in inches", ylab = "Pr(x = a)")
# probabilities in actual data over length 1 ranges containing an integer
mean(x <= 68.5) - mean(x <= 67.5) # when the difference between means is in integers...
## [1] 0.114532
mean(x <= 69.5) - mean(x <= 68.5)
## [1] 0.1194581
mean(x <= 70.5) - mean(x <= 69.5)
## [1] 0.1219212
# probabilities in normal approximation match well
pnorm(68.5, mean(x), sd(x)) - pnorm(67.5, mean(x), sd(x)) #pnorm() matches well
## [1] 0.1031077
pnorm(69.5, mean(x), sd(x)) - pnorm(68.5, mean(x), sd(x))
## [1] 0.1097121
pnorm(70.5, mean(x), sd(x)) - pnorm(69.5, mean(x), sd(x))
## [1] 0.1081743
# probabilities in actual data over other ranges don't match normal approx as well
mean(x <= 70.9) - mean(x <= 70.1) # when the difference between means is in continuous values...
## [1] 0.02216749
pnorm(70.9, mean(x), sd(x)) - pnorm(70.1, mean(x), sd(x)) # pnorm() do NOT match very well (discretization)
## [1] 0.08359562
Quantiles are cutoff points that divide a dataset into intervals with set probabilities. The qth quantile is the value at which q % of the observations are equal to or less than that value.
The most-used quantiles are actually the 0 percent and 100 percent quantiles. You could just as easily call them the minimum and maximum, because that’s what they are. You can get both together using the range() function. This function conveniently gives you the range of the data. So, to know the range of mileages, you simply do:
range(heights$height)
## [1] 50.00000 82.67717
The range still gives you only limited information. Often statisticians report the first and the third quartile together with the range and the median. This can be returned using the quantile function.
quantile(heights$height)
## 0% 25% 50% 75% 100%
## 50.00000 66.00000 68.50000 71.00000 82.67717
The quartiles are not the same as the lower and upper hinge calculated in the five-number summary. The latter two are, respectively, the median of the lower and upper half of your data, and they differ slightly from the first and third quartiles. If you want the numeric values for the first and third quartiles, use the fivenum() function.
fivenum(heights$height)
## [1] 50.00000 66.00000 68.50000 71.00000 82.67717
Lastly, you can elect your own quantiles passing in the desired quantile argument:
quantile(heights$height,.30) #returns the 30% quantile
## 30%
## 66.92
quantile(heights$height, probs = c(0.05, 0.95)) # returns multiple specified, concatted probabilities
## 5% 95%
## 62.00000 74.80315
If you want percentiles:
p <- seq(0.01, 0.99, 0.01)
quantile(heights$height, p) #returns percents incrementing by 1
## 1% 2% 3% 4% 5% 6% 7% 8%
## 59.00000 60.00000 60.00000 61.00000 62.00000 62.00000 62.28869 63.00000
## 9% 10% 11% 12% 13% 14% 15% 16%
## 63.00000 63.00000 63.77953 64.00000 64.00000 64.00000 64.17321 64.96000
## 17% 18% 19% 20% 21% 22% 23% 24%
## 65.00000 65.00000 65.00000 65.00000 65.00000 65.94457 66.00000 66.00000
## 25% 26% 27% 28% 29% 30% 31% 32%
## 66.00000 66.00000 66.00000 66.00000 66.50000 66.92000 66.92913 67.00000
## 33% 34% 35% 36% 37% 38% 39% 40%
## 67.00000 67.00000 67.00000 67.00000 67.00000 67.00000 67.32200 67.71650
## 41% 42% 43% 44% 45% 46% 47% 48%
## 67.72540 68.00000 68.00000 68.00000 68.00000 68.00000 68.00000 68.00000
## 49% 50% 51% 52% 53% 54% 55% 56%
## 68.11024 68.50000 68.89000 68.89764 69.00000 69.00000 69.00000 69.00000
## 57% 58% 59% 60% 61% 62% 63% 64%
## 69.00000 69.00000 69.00000 69.00000 69.60000 70.00000 70.00000 70.00000
## 65% 66% 67% 68% 69% 70% 71% 72%
## 70.00000 70.00000 70.00000 70.00000 70.00000 70.07874 70.73700 70.86614
## 73% 74% 75% 76% 77% 78% 79% 80%
## 71.00000 71.00000 71.00000 71.00000 71.00000 71.11000 72.00000 72.00000
## 81% 82% 83% 84% 85% 86% 87% 88%
## 72.00000 72.00000 72.00000 72.00000 72.00000 72.00000 72.00000 72.44011
## 89% 90% 91% 92% 93% 94% 95% 96%
## 72.83465 73.00000 73.00000 74.00000 74.00000 74.00000 74.80315 75.00000
## 97% 98% 99%
## 75.00000 76.00000 78.00000
Or if you want the best geography:
summary(heights$height) #returns the min,mean, max and 1st and 3rd quartiles
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 50.00 66.00 68.50 68.32 71.00 82.68
The qnorm() function gives the THEORETICAL value of a quantile with probability p of observing a value equal to or less than that quantile value given a normal distribution with mean mu and standard deviation sigma. By default, mu=0 and sigma=1. Therefore, calling qnorm() with no arguments gives quantiles for the standard normal distribution.
Recall that quantiles are defined such that p is the probability of a random observation less than or equal to the quantile.
A Quantile-Quantile Plots, or QQ Plot, is an exercise to help determine how well a distribution is approximated by a normal distribution.
In a QQ-plot, the sample (actual) quantiles in the observed data are compared to the theoretical quantiles expected from the normal distribution.
If the data are well-approximated by the normal distribution, then the points on the QQ-plot will fall near the identity line (sample = theoretical).
Let’s take an example, heights for men. If we take the actual quantiles of the observed data we collected and compare them to theoretical quantiles generated using qnorm(), we can test to see how well normal distributions approximates for male heights.
# create an index of the data based on male heights and scale() it:
data(heights)
index <- heights$sex=="Male"
x <- heights$height[index]
Now we have a data set, x. Let’s start by carving up 5% quantiles to compare and save them in a variable, observed_quantiles:
p <- seq(0.05, 0.95, 0.05)
observed_quantiles <- quantile(x, p)
Now, using qnorm(), let’s carve up the theoretical quantiles based on the average and sd values of x:
theoretical_quantiles <- qnorm(p, mean = mean(x), sd = sd(x))
Now, let’s plot them together and we’ll see a close correlation:
plot(theoretical_quantiles, observed_quantiles, col=c("#0072b2","#009e73"))
So thus far, we’ve compared the two on their original scales, but let’s now compare them following scaling.
z <- scale(x)
observed_quantiles <- quantile(z, p) #scale the actual quantiles
theoretical_quantiles <- qnorm(p) #qnorm() defaults to scaling
plot(theoretical_quantiles, observed_quantiles, col=c("#0072b2","#009e73")) #now plot them
abline(0,1,col="red")
When data do not follow a normal distribution and cannot be succinctly summarized by only the mean and standard deviation, an alternative is to report a five-number summary: range (ignoring outliers) and the quartiles (25th, 50th, 75th percentile).
In a boxplot, the box is defined by the 25th and 75th percentiles and the median is a horizontal line through the box. The whiskers show the range excluding outliers, and outliers are plotted separately as individual points. The interquartile range is the distance between the 25th and 75th percentiles. Boxplots are particularly useful when comparing multiple distributions.
When examining the relationship between two variables, it is often useful to use scatterplots. One exception might be when you are examining the value of the same variable, but at different points in time.
There is no geometry for slope charts in ggplot2, but we can construct one using geom_lines. An advantage of the slope chart is that it permits us to quickly get an idea of changes based on the slope of the lines.
Leveraging the gapminder dataset, we create a utility vector we’ll call ‘west’:
west <- c("Western Europe", "Northern Europe", "Southern Europe", "Northern America", "Australia and New Zealand")
Now a utility vector ‘dat’ filtering from gapminder those rows that:
* include either 2010 or 2015, AND…
* include regions in our vector west, AND…
* include only rows where life_expectancy is NOT NA, AND…
* the population is greater than 10^7, or 10 million
dat <- gapminder %>%
filter(year %in% c(2010, 2015) & region %in% west & !is.na(life_expectancy) & population > 10^7)
With this data frame now in hand, we can start to plot the life expectancy data we have collected from 2010 to 2015 and quickly ascertain that it has risen across all considered regions:
dat %>%
mutate(location = ifelse(year == 2010, 1, 2),
location = ifelse(year == 2015 & country %in% c("United Kingdom", "Portugal"),
location + 0.22, location),
hjust = ifelse(year == 2010, 1, 0)) %>%
mutate(year = as.factor(year)) %>%
ggplot(aes(year, life_expectancy, group = country)) +
geom_line(aes(color = country), show.legend = FALSE) +
geom_text(aes(x = location, label = country, hjust = hjust), show.legend = FALSE) +
xlab("") +
ylab("Life Expectancy")
By applying a slope chart, we take advantage of common-axis, which make comparisons across a few data points easy. However, what happens when you want to evaluated more than just a few?
Since what we’re interested in is in differences, it makes sense to dedicate one of our axes to differences. The Bland-Altman Plot, also known as the Tukey Mean Different plot, and also the MA plot, shows the difference versus the average.
library(ggrepel)
dat %>%
mutate(year = sum("life_expectancy", year)) %>%
select(country, year, life_expectancy) %>% spread(year, life_expectancy) %>%
mutate(average = (life_expectancy_2015 + life_expectancy_2010)/2,
difference = life_expectancy_2015 - life_expectancy_2010) %>%
ggplot(aes(average, difference, label = country)) +
geom_point() +
geom_text_repel() +
geom_abline(lty = 2) +
xlab("Average of 2010 and 2015") +
ylab("Difference between 2015 and 2010")
Those countries that have improved the most are easy to identify as it is represented in the Y-axis. We also get an idea of the overall value from the x-axis.
The Bland-Altman plot (Tukey mean difference plot, MA plot) graphs the difference between conditions on the y-axis and the mean between conditions on the x-axis. It is more appropriate for large numbers of observations than slope charts.
Case studies on vaccines can be fueled by datasets made possible through The Tyco Project. The Tyco Project serves to help institutes and researchers to make valuable global health data available to others, contributing to open and reproducible science through the offering of open data. This data includes weekly reported counts data for 7 diseases from 1928 to 2011 from all 50 states.
The dslabs package also makes available the yearly totals:
library(tidyverse)
library(dslabs)
data(us_contagious_diseases)
str(us_contagious_diseases)
## 'data.frame': 16065 obs. of 6 variables:
## $ disease : Factor w/ 7 levels "Hepatitis A",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ state : Factor w/ 51 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ year : num 1966 1967 1968 1969 1970 ...
## $ weeks_reporting: num 50 49 52 49 51 51 45 45 45 46 ...
## $ count : num 321 291 314 380 413 378 342 467 244 286 ...
## $ population : num 3345787 3364130 3386068 3412450 3444165 ...
as_tibble(us_contagious_diseases)
## # A tibble: 16,065 x 6
## disease state year weeks_reporting count population
## <fct> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 Hepatitis A Alabama 1966 50 321 3345787
## 2 Hepatitis A Alabama 1967 49 291 3364130
## 3 Hepatitis A Alabama 1968 52 314 3386068
## 4 Hepatitis A Alabama 1969 49 380 3412450
## 5 Hepatitis A Alabama 1970 51 413 3444165
## 6 Hepatitis A Alabama 1971 51 378 3481798
## 7 Hepatitis A Alabama 1972 45 342 3524543
## 8 Hepatitis A Alabama 1973 45 467 3571209
## 9 Hepatitis A Alabama 1974 45 244 3620548
## 10 Hepatitis A Alabama 1975 46 286 3671246
## # ... with 16,055 more rows
This exercise will focus only on measles. So we create a utility string to sort and the vector we’ll temporarily store the data frame in we’ll call dat.
* Pipe in us_contagious_diseases in filter
* Filter out states in the data that include Hawaii and Alaska (no data until 1950) AND
* Filter only those rows where the disease is measles THEN
* Pipe the filtered results to mutate, we we add a new column called rate, which is the rate per year of measles THEN
* Reorder the table to list state by order of rate
the_disease <- "Measles"
dat <- us_contagious_diseases %>%
filter(!state %in% c("Hawaii", "Alaska") & disease == the_disease) %>%
mutate(rate = count / population * 10000 * 52/weeks_reporting) %>%
mutate(state = reorder(state, rate))
We can now easily plot the rate of measles per year.
* We take the dat data set and filter out where state is ‘California’ and NOT NA (recall this as the data component step) THEN
* Pipe this to ggplot and map it to x and y (recall this as the aesthetic mapping component), THEN
* Select a geom_line() as the geometry (recall this as the geometry component) - 1963 is the year that measles vaccine was discovered AND
* Add some visual tweaks to assist visually with interpreting the data (recall this as the style component):
dat %>% filter(state == "California" & !is.na(rate)) %>%
ggplot(aes(year, rate)) +
geom_line() +
ylab("Cases per 10,000") +
geom_vline(xintercept=1963, col = "blue")
But this is just California, and just measles. How can we show the data for all states, in one plot?
dat %>% ggplot(aes(year, state, fill=rate)) +
geom_tile(color = "grey50") +
scale_x_continuous(expand = c(0,0)) +
scale_fill_gradientn(colors = RColorBrewer::brewer.pal(9, "Reds"), trans = "sqrt") +
geom_vline(xintercept = 1963, col = "blue") +
theme_minimal() + theme(panel.grid = element_blank()) +
ggtitle(the_disease) +
ylab("") +
xlab("")
# compute US average measles rate by year
avg <- us_contagious_diseases %>%
filter(disease == the_disease) %>% group_by(year) %>%
summarize(us_rate = sum(count, na.rm = TRUE)/sum(population, na.rm = TRUE)*10000)
# make line plot of measles rate by year by state
dat %>%
filter(!is.na(rate)) %>%
ggplot() +
geom_line(aes(year, rate, group = state), color = "grey50",
show.legend = FALSE, alpha = 0.2, size = 1) +
geom_line(mapping = aes(year, us_rate), data = avg, size = 1, col = "black") +
scale_y_continuous(trans = "sqrt", breaks = c(5, 25, 125, 300)) +
ggtitle("Cases per 10,000 by state") +
xlab("") +
ylab("") +
geom_text(data = data.frame(x = 1955, y = 50),
mapping = aes(x, y, label = "US average"), color = "black") +
geom_vline(xintercept = 1963, col = "blue")