Describing and Visualizing Data

Joe Ripberger

Descriptive Statistics

Descriptive statistics aim to summarize a sample; there are two types of descriptive statistics:
1. Measures of central tendency provide information about the most typical or average value of a variable
2. Measures of variability (dispersion) provide information about the distribution of a variable

Central Tendency (Mean)

Arithmetic Mean: the sum of a series of observations divided by the number of observations
- Symbol: \(\bar{x}\) (sample) or \(\mu\) (population)
- Formula: \(\bar{x} = \frac{1}{n}\left (\sum_{i=1}^n{x_i}\right ) = \frac{x_1+x_2+\cdots +x_n}{n}\)
Use when describing continuous variables
Sensitive to extreme values (e.g., income)

Central Tendency (Median)

Median: the middle value in a series of observations
- Arrange a series of observations from smallest to greatest then take the middle value
  - Note: if there is an even number of observations, then there is no single middle value; the median is then usually defined to be the mean of the two middle values
In a distribution, the median is the \(2^{nd}\) quartile and the \(50^{th}\) percentile
Use when describing continuous variables
Robust to extreme values (e.g., income)

Central Tendency (Normal Distribution)

Central Tendency (Skewed Distribution)

Central Tendency (Mode)

Mode: the most common value in a series of observations
- Some variables include multiple modes (multimodal)
Use when describing discrete variables

Central Tendency

Arithmetic mean: sum values divided by number of values
Median: middle value
Mode: most frequent value
What is the mean, median, and mode of \(x\)?
- \(x\) = {1, 2, 2, 3, 4, 7, 9, 10, 11, 12, 12}

Dispersion (Variance)

Variance: the extent to which a variable’s values deviate from or vary around it’s mean; the average distance of each observation from the mean
- Symbol: \(s^{2}\) (sample) or \(\sigma^{2}\) (population)
- Formula: \(s^{2} = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2\)
- Formula: \(\sigma^{2} = \frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})^2\)
Measures how far observations are from their average value (spread of the data)
- Low variance \(\rightarrow\) observations are relatively close to the mean value
- High variance \(\rightarrow\) observations are relatively far from the mean value
Expressed in square units (i.e., \(\text{inches}^2\))

Dispersion (Standard Deviation)

Standard Deviation: standardized measure of variance
- Symbol: \(s\) (sample) or \(\sigma\) (population)
- Formula: \(s = \sqrt{s^2} = \sqrt{\frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2}\)
- Formula: \(\sigma = \sqrt{\sigma^2} = \sqrt{\frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})^2}\)
Expressed in the same units as the mean (i.e., inches)
- Mean distance from the mean

Dispersion

Given that: \(s^{2} = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2 \text{ and } s = \sqrt{s^2},\) what is the variance and standard deviation of \(x\)?
- \(x\) = {1, 2, 2, 3, 4, 7, 9, 10, 11, 12, 12}
- \(\bar{x}\) = {6.6}

Dispersion

Given that: \(s^{2} = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2 \text{ and } s = \sqrt{s^2},\) what is the variance and standard deviation of \(x\)?
- \(x\) = {1, 2, 2, 3, 4, 7, 9, 10, 11, 12, 12}
- \(\bar{x}\) = {6.6}
- \((x - \bar{x})\) = {-5.6, -4.6, -4.6, -3.6, -2.6, 0.4, 2.4, 3.4, 4.4, 5.4, 5.4}
- \((x - \bar{x})^2\) = {31.4, 21.2, 21.2, 13.0, 6.8, 0.2, 5.8, 11.6, 19.4, 29.2, 29.2}
- \(\sum_{i=1}^n (x_i - \bar{x})^2\) = {189.0}
- \(\frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2\) = {1/(11-1) * 189.0 = 18.9}
- \(s = \sqrt{s^2}\) = {4.3}

Dispersion

x <- c(1, 2, 2, 3, 4, 7, 9, 10, 11, 12, 12)
var(x)

[1] 18.85455

sd(x)

[1] 4.342182

Visualizing Data

In The Visual Display of Quantitative Information (1983), Edward Tufte says that graphical displays should:

Show the data
Induce the viewer to think about the substance rather than about methodology, graphic design, the technology of graphic production or something else
Avoid distorting what the data have to say
Present many numbers in a small space
Make large data sets coherent
Encourage the eye to compare different pieces of data
Reveal the data at several levels of detail, from a broad overview to the fine structure
Serve a reasonably clear purpose: description, exploration, tabulation or decoration
Be closely integrated with the statistical and verbal descriptions of a data set

Visualizing Data

Always include a title
Always include axis labels
Include a legend when necessary
Avoid chartjunk (elements that are not necessary to comprehend the information in the plot)
In bivariate plots, the dependent variable (\(y\)) goes on the y-axis and the independent variable (\(x\)) goes on the x-axis
Don’t use pie charts

Visualizing Single Continuous Variables

Code

p1 <- ggplot(data, aes(x = Height)) +
  geom_density() +
  labs(y = "Density", x = "Height (Inches)", title = "Density Plot") +
  theme_classic(base_size = 18)
p2 <- ggplot(data, aes(x = Height)) +
  geom_histogram(binwidth = 1) +
  labs(y = "Count", x = "Height (Inches)", title = "Histogram") +
  theme_classic(base_size = 18)
gridExtra::grid.arrange(p1, p2, nrow = 1)

Visualizing Single Discrete Variables

Code

ggplot(data, aes(x = Height)) +
  geom_bar() +
  labs(y = "Count", x = "Height (Inches)", title = "Bar Plot") +
  theme_classic(base_size = 20)

Histograms Are Not the Same as Bar Plots!

If you have a continuous variable with more than 10 categories, use a histogram!
If you have a discrete variable or a continuous variable with less than 10 categories, use a bar plot!

Visualizing Continuous \(x\) and Continuous \(y\) Variables

Code

p1 <- ggplot(d, aes(x = Height, y = Weight)) +
  geom_point(size = 2, alpha = 0.2) +
  labs(y = "Weight (Pounds)", x = "Height (Inches)", title = "Scatter Plot\n") +
  theme_classic(base_size = 16) +
  ylim(145, 155)
p2 <- ggplot(d, aes(x = Height, y = Weight)) +
  geom_smooth(size = 2, method = "loess") +
  labs(y = "Weight (Pounds)", x = "Height (Inches)", title = "Smooth Plot\n") +
  theme_classic(base_size = 16) +
  ylim(145, 155)
p3 <- ggplot(d, aes(x = Height, y = Weight)) +
  geom_point(size = 2, alpha = 0.3) +
  geom_smooth(size = 2, method = "loess") +
  labs(y = "Weight (Pounds)", x = "Height (Inches)", title = "Scatter Plot with\nSmooth Line") +
  theme_classic(base_size = 16) +
  ylim(145, 155)
gridExtra::grid.arrange(p1, p2, p3, nrow = 1)

Visualizing Discrete \(x\) and Continuous \(y\) Variables

Code

p1 <- ggplot(d, aes(x = name, y = value)) +
  geom_boxplot() +
  labs(y = "Height (Inches)", x = "", title = "Box Plot") +
  theme_classic(base_size = 18)
p2 <- ggplot(d, aes(x = name, y = value)) +
  geom_violin() +
  labs(y = "Height (Inches)", x = "", title = "Violin Plot") +
  theme_classic(base_size = 18)
p3 <- ggplot(d, aes(x = name, y = value)) +
  geom_violin() +
  geom_boxplot(width = 0.5) +
  labs(y = "Height (Inches)", x = "", title = "Box and Violin Plot") +
  theme_classic(base_size = 18)
gridExtra::grid.arrange(p1, p2, p3, nrow = 1)

Visualizing Discrete \(x\) and Discrete \(y\) Variables

Code

ggplot(p, aes(x = Height, y = p, fill = Weight, label = text)) +
  geom_col(size = 8, position = position_dodge()) +
  geom_text(position = position_dodge(width = 0.9), vjust = -0.5) +
  labs(y = "Proportion", x = "Height (Inches)", color = "Weight (Pounds)", title = "Proportion Plot") +
  theme_classic(base_size = 20)

ggplot2 Cheat Sheet