Describing and Visualizing Data

Joe Ripberger

Descriptive Statistics

  • Descriptive statistics aim to summarize a sample; there are two types of descriptive statistics:
    1. Measures of central tendency provide information about the most typical or average value of a variable
    2. Measures of variability (dispersion) provide information about the distribution of a variable

Central Tendency (Mean)

  • Arithmetic Mean: the sum of a series of observations divided by the number of observations
    • Symbol: \(\bar{x}\) (sample) or \(\mu\) (population)
    • Formula: \(\bar{x} = \frac{1}{n}\left (\sum_{i=1}^n{x_i}\right ) = \frac{x_1+x_2+\cdots +x_n}{n}\)
  • Use when describing continuous variables
  • Sensitive to extreme values (e.g., income)

Central Tendency (Median)

  • Median: the middle value in a series of observations
    • Arrange a series of observations from smallest to greatest then take the middle value
      • Note: if there is an even number of observations, then there is no single middle value; the median is then usually defined to be the mean of the two middle values
  • In a distribution, the median is the \(2^{nd}\) quartile and the \(50^{th}\) percentile
  • Use when describing continuous variables
  • Robust to extreme values (e.g., income)

Central Tendency (Normal Distribution)

Central Tendency (Skewed Distribution)

Central Tendency (Mode)

  • Mode: the most common value in a series of observations
    • Some variables include multiple modes (multimodal)
  • Use when describing discrete variables

Central Tendency

  • Arithmetic mean: sum values divided by number of values
  • Median: middle value
  • Mode: most frequent value
  • What is the mean, median, and mode of \(x\)?
    • \(x\) = {1, 2, 2, 3, 4, 7, 9, 10, 11, 12, 12}

Dispersion (Variance)

  • Variance: the extent to which a variable’s values deviate from or vary around it’s mean; the average distance of each observation from the mean
    • Symbol: \(s^{2}\) (sample) or \(\sigma^{2}\) (population)
    • Formula: \(s^{2} = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2\)
    • Formula: \(\sigma^{2} = \frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})^2\)
  • Measures how far observations are from their average value (spread of the data)
    • Low variance \(\rightarrow\) observations are relatively close to the mean value
    • High variance \(\rightarrow\) observations are relatively far from the mean value
  • Expressed in square units (i.e., \(\text{inches}^2\))

Dispersion (Standard Deviation)

  • Standard Deviation: standardized measure of variance
    • Symbol: \(s\) (sample) or \(\sigma\) (population)
    • Formula: \(s = \sqrt{s^2} = \sqrt{\frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2}\)
    • Formula: \(\sigma = \sqrt{\sigma^2} = \sqrt{\frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})^2}\)
  • Expressed in the same units as the mean (i.e., inches)
    • Mean distance from the mean

Dispersion

Dispersion

Dispersion

  • Given that: \(s^{2} = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2 \text{ and } s = \sqrt{s^2},\) what is the variance and standard deviation of \(x\)?
    • \(x\) = {1, 2, 2, 3, 4, 7, 9, 10, 11, 12, 12}
    • \(\bar{x}\) = {6.6}

Dispersion

  • Given that: \(s^{2} = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2 \text{ and } s = \sqrt{s^2},\) what is the variance and standard deviation of \(x\)?
    • \(x\) = {1, 2, 2, 3, 4, 7, 9, 10, 11, 12, 12}
    • \(\bar{x}\) = {6.6}
    • \((x - \bar{x})\) = {-5.6, -4.6, -4.6, -3.6, -2.6, 0.4, 2.4, 3.4, 4.4, 5.4, 5.4}
    • \((x - \bar{x})^2\) = {31.4, 21.2, 21.2, 13.0, 6.8, 0.2, 5.8, 11.6, 19.4, 29.2, 29.2}
    • \(\sum_{i=1}^n (x_i - \bar{x})^2\) = {189.0}
    • \(\frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2\) = {1/(11-1) * 189.0 = 18.9}
    • \(s = \sqrt{s^2}\) = {4.3}

Dispersion

x <- c(1, 2, 2, 3, 4, 7, 9, 10, 11, 12, 12)
var(x)
[1] 18.85455
sd(x)
[1] 4.342182

Visualizing Data

In The Visual Display of Quantitative Information (1983), Edward Tufte says that graphical displays should:

  1. Show the data
  2. Induce the viewer to think about the substance rather than about methodology, graphic design, the technology of graphic production or something else
  3. Avoid distorting what the data have to say
  4. Present many numbers in a small space
  5. Make large data sets coherent
  6. Encourage the eye to compare different pieces of data
  7. Reveal the data at several levels of detail, from a broad overview to the fine structure
  8. Serve a reasonably clear purpose: description, exploration, tabulation or decoration
  9. Be closely integrated with the statistical and verbal descriptions of a data set

Visualizing Data

  1. Always include a title
  2. Always include axis labels
  3. Include a legend when necessary
  4. Avoid chartjunk (elements that are not necessary to comprehend the information in the plot)
  5. In bivariate plots, the dependent variable (\(y\)) goes on the y-axis and the independent variable (\(x\)) goes on the x-axis
  6. Don’t use pie charts

Visualizing Single Continuous Variables

Code
p1 <- ggplot(data, aes(x = Height)) +
  geom_density() +
  labs(y = "Density", x = "Height (Inches)", title = "Density Plot") +
  theme_classic(base_size = 18)
p2 <- ggplot(data, aes(x = Height)) +
  geom_histogram(binwidth = 1) +
  labs(y = "Count", x = "Height (Inches)", title = "Histogram") +
  theme_classic(base_size = 18)
gridExtra::grid.arrange(p1, p2, nrow = 1)

Visualizing Single Discrete Variables

Code
ggplot(data, aes(x = Height)) +
  geom_bar() +
  labs(y = "Count", x = "Height (Inches)", title = "Bar Plot") +
  theme_classic(base_size = 20)

Histograms Are Not the Same as Bar Plots!

  • If you have a continuous variable with more than 10 categories, use a histogram!
  • If you have a discrete variable or a continuous variable with less than 10 categories, use a bar plot!

Visualizing Continuous \(x\) and Continuous \(y\) Variables

Code
p1 <- ggplot(d, aes(x = Height, y = Weight)) +
  geom_point(size = 2, alpha = 0.2) +
  labs(y = "Weight (Pounds)", x = "Height (Inches)", title = "Scatter Plot\n") +
  theme_classic(base_size = 16) +
  ylim(145, 155)
p2 <- ggplot(d, aes(x = Height, y = Weight)) +
  geom_smooth(size = 2, method = "loess") +
  labs(y = "Weight (Pounds)", x = "Height (Inches)", title = "Smooth Plot\n") +
  theme_classic(base_size = 16) +
  ylim(145, 155)
p3 <- ggplot(d, aes(x = Height, y = Weight)) +
  geom_point(size = 2, alpha = 0.3) +
  geom_smooth(size = 2, method = "loess") +
  labs(y = "Weight (Pounds)", x = "Height (Inches)", title = "Scatter Plot with\nSmooth Line") +
  theme_classic(base_size = 16) +
  ylim(145, 155)
gridExtra::grid.arrange(p1, p2, p3, nrow = 1)

Visualizing Discrete \(x\) and Continuous \(y\) Variables

Code
p1 <- ggplot(d, aes(x = name, y = value)) +
  geom_boxplot() +
  labs(y = "Height (Inches)", x = "", title = "Box Plot") +
  theme_classic(base_size = 18)
p2 <- ggplot(d, aes(x = name, y = value)) +
  geom_violin() +
  labs(y = "Height (Inches)", x = "", title = "Violin Plot") +
  theme_classic(base_size = 18)
p3 <- ggplot(d, aes(x = name, y = value)) +
  geom_violin() +
  geom_boxplot(width = 0.5) +
  labs(y = "Height (Inches)", x = "", title = "Box and Violin Plot") +
  theme_classic(base_size = 18)
gridExtra::grid.arrange(p1, p2, p3, nrow = 1)

Visualizing Discrete \(x\) and Discrete \(y\) Variables

Code
ggplot(p, aes(x = Height, y = p, fill = Weight, label = text)) +
  geom_col(size = 8, position = position_dodge()) +
  geom_text(position = position_dodge(width = 0.9), vjust = -0.5) +
  labs(y = "Proportion", x = "Height (Inches)", color = "Weight (Pounds)", title = "Proportion Plot") +
  theme_classic(base_size = 20)

ggplot2 Cheat Sheet