Chapter 1: Descriptive Statistics

Statistics for Data Science

Author

Pai

Published

May 5, 2026


1 Chapter Overview

Descriptive statistics is the branch of statistics concerned with summarizing and communicating the key features of a dataset without making inferences beyond the data at hand. Before fitting models, testing hypotheses, or making predictions, a data scientist must first develop an intimate understanding of their data — its center, its spread, its shape, and its relationships.

This chapter covers:

  • Measures of Central Tendency — where the data is centered
  • Measures of Dispersion — how spread out the data is
  • Measures of Shape — skewness and kurtosis
  • Data Visualization — graphical summaries
  • Multivariate Descriptive Statistics — describing relationships between variables
NoteLearning Objectives

By the end of this chapter, you will be able to:

  1. Compute and interpret all major descriptive statistics by hand and in R.
  2. Choose the appropriate measure of center and spread for different data types.
  3. Diagnose distributional shape using skewness and kurtosis.
  4. Produce and interpret publication-quality descriptive plots in R.
  5. Describe bivariate and multivariate relationships using correlation and summary tables.

2 Measures of Central Tendency

2.1 Introduction

The first question we ask about any variable is: Where is the data centered? Measures of central tendency answer this by identifying a single representative value around which the observations cluster. The three most widely used measures are the mean, the median, and the mode — and the choice among them is never arbitrary. It depends on the scale of measurement, the shape of the distribution, and the presence of outliers.

2.2 Theory

2.2.1 The Arithmetic Mean

The arithmetic mean (commonly called “the average”) is the sum of all observed values divided by the number of observations. For a sample of \(n\) values \(x_1, x_2, \ldots, x_n\):

\[\bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i\]

For a population of size \(N\), we use the symbol \(\mu\):

\[\mu = \frac{1}{N} \sum_{i=1}^{N} x_i\]

The mean minimizes the sum of squared deviations: \(\sum(x_i - \bar{x})^2\) is smaller for \(\bar{x}\) than for any other value. This makes it the least squares center of the data. However, the mean is sensitive to extreme values (outliers), because every observation contributes equally to the sum.

2.2.2 The Median

The median is the middle value when data are sorted in ascending order. For \(n\) observations:

\[\text{Median} = \begin{cases} x_{(n+1)/2} & \text{if } n \text{ is odd} \\ \frac{x_{(n/2)} + x_{(n/2+1)}}{2} & \text{if } n \text{ is even} \end{cases}\]

The median is robust to outliers — it is insensitive to the exact values of extreme observations. This makes it the preferred measure of center for skewed distributions such as income, house prices, and reaction times.

2.2.3 The Mode

The mode is the value (or values) that appear most frequently in the data. It is the only measure of central tendency applicable to nominal (categorical) data. A distribution may be unimodal (one mode), bimodal (two modes), or multimodal.

2.2.4 When to Use Which Measure?

Choosing the appropriate measure of central tendency
Situation Recommended Measure
Symmetric distribution, no outliers Mean
Skewed distribution or outliers present Median
Categorical or nominal data Mode
Reporting income, house prices, reaction times Median
Reporting height, temperature, test scores (approx. symmetric) Mean

2.2.5 Weighted Mean

When observations carry different importance (weights \(w_i\)), the weighted mean accounts for this:

\[\bar{x}_w = \frac{\sum_{i=1}^{n} w_i x_i}{\sum_{i=1}^{n} w_i}\]

This is commonly used in survey research (sampling weights), grade calculation, and portfolio analysis.

2.3 Example: Measures of Central Tendency

Example 1.1. A researcher collects the following annual salaries (in thousands of USD) from 9 employees in a small firm:

\[32, \; 35, \; 38, \; 40, \; 42, \; 45, \; 48, \; 52, \; 180\]

Step 1 — Mean: \[\bar{x} = \frac{32+35+38+40+42+45+48+52+180}{9} = \frac{512}{9} \approx 56.9 \text{ thousand}\]

Step 2 — Median (n = 9, odd → middle value is the 5th): \[\text{Median} = 42 \text{ thousand}\]

Step 3 — Mode: No value repeats, so this dataset has no mode.

Interpretation: The mean ($56.9K) is substantially higher than the median ($42K) because of the outlier salary of $180K. The median better represents the “typical” employee salary in this firm. This illustrates why the median is preferred when data is right-skewed or contains outliers.

2.4 R Example: Measures of Central Tendency

# --- Define the salary data ---
salary <- c(32, 35, 38, 40, 42, 45, 48, 52, 180)

# --- Compute the three measures ---
mean_val   <- mean(salary)
median_val <- median(salary)

# R has no built-in mode() for continuous data; we write a helper:
get_mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}
mode_val <- get_mode(salary)

# --- Display results ---
cat("Mean:  ", round(mean_val, 2), "thousand USD\n")
Mean:   56.89 thousand USD
cat("Median:", median_val, "thousand USD\n")
Median: 42 thousand USD
cat("Mode:  ", mode_val, "(appears once — effectively no mode)\n")
Mode:   32 (appears once — effectively no mode)
# --- Visualize: dotplot showing mean vs median ---
salary_df <- data.frame(salary = salary)

ggplot(salary_df, aes(x = salary)) +
  geom_dotplot(fill = "steelblue", color = "white", binwidth = 5) +
  geom_vline(aes(xintercept = mean_val,   color = "Mean"),   linewidth = 1.2, linetype = "dashed") +
  geom_vline(aes(xintercept = median_val, color = "Median"), linewidth = 1.2, linetype = "solid") +
  scale_color_manual(values = c("Mean" = "tomato", "Median" = "darkgreen")) +
  labs(
    title    = "Employee Salaries: Mean vs. Median",
    subtitle = "The outlier ($180K) pulls the mean far above the median",
    x        = "Annual Salary (thousands USD)",
    y        = "",
    color    = "Measure"
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "top")

Code explanation:

  • mean() and median() are base R functions.
  • We define get_mode() manually because R’s built-in mode() returns the storage type, not the statistical mode.
  • geom_vline() adds vertical reference lines, making the gap between mean and median visually apparent.
  • The scale_color_manual() call controls line colors via a named vector.

Try modifying: Replace 180 with 55 (removing the outlier) and observe how the mean and median converge.

2.5 Exercises

TipExercise 1.1

The following values represent the number of publications per faculty member in a department:

\[3, 5, 7, 2, 8, 5, 12, 5, 6, 4\]

  1. Compute the mean, median, and mode by hand.
  2. Which measure best represents the “typical” faculty member? Justify your answer.
  3. Verify your answers in R.
TipExercise 1.2

A student’s course grades are: Midterm (weight = 30%) = 78, Final (weight = 50%) = 85, Project (weight = 20%) = 91.
Compute the weighted mean grade. How does it compare to the simple arithmetic mean?


3 Measures of Dispersion

3.1 Introduction

Knowing where data is centered is necessary but not sufficient. Two datasets can have identical means yet look completely different — one tightly clustered, the other wildly spread. Measures of dispersion quantify the variability or spread of data around the center. This is critical in data science: high variability in model predictions indicates instability; high variability in features may require normalization.

3.2 Theory

3.2.1 Range

The simplest measure of spread: the difference between the maximum and minimum values.

\[\text{Range} = x_{\max} - x_{\min}\]

The range uses only two data points and is highly sensitive to outliers. It is most useful as a quick preliminary check.

3.2.2 Interquartile Range (IQR)

The IQR is the range of the middle 50% of the data — the distance between the 75th percentile (Q3) and the 25th percentile (Q1):

\[\text{IQR} = Q_3 - Q_1\]

The IQR is robust to outliers because it ignores the tails. It is the basis for the boxplot and for the standard outlier detection rule: an observation is a potential outlier if it falls below \(Q_1 - 1.5 \times \text{IQR}\) or above \(Q_3 + 1.5 \times \text{IQR}\).

3.2.3 Variance

The variance measures the average squared deviation from the mean. For a sample:

\[s^2 = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2\]

We divide by \(n-1\) (not \(n\)) to obtain an unbiased estimator of the population variance \(\sigma^2\). This correction is known as Bessel’s correction. Squaring the deviations ensures all contributions are positive and penalizes large deviations more heavily.

3.2.4 Standard Deviation

The standard deviation is the square root of the variance, returning units to the original scale of the data:

\[s = \sqrt{\frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2}\]

It is the most widely used measure of dispersion and is interpretable as the “typical” distance of observations from the mean.

3.2.5 Coefficient of Variation (CV)

The CV expresses the standard deviation as a percentage of the mean, enabling comparisons of variability across variables with different units or scales:

\[\text{CV} = \frac{s}{\bar{x}} \times 100\%\]

For example, comparing the variability of blood pressure (measured in mmHg) to heart rate (beats per minute) using CV makes the comparison unit-free.

3.3 Example: Measures of Dispersion

Example 1.2. Two machine production lines produce bolts with the following lengths (mm):

  • Line A: 50.1, 49.8, 50.3, 50.0, 49.9, 50.2, 50.1, 49.7
  • Line B: 48.5, 51.2, 49.0, 52.1, 48.8, 51.5, 49.3, 51.6

Both lines have \(\bar{x}_A \approx \bar{x}_B \approx 50\) mm. But the quality implications differ dramatically.

Line A: \[s_A^2 = \frac{(0.1^2 + 0.2^2 + 0.3^2 + 0^2 + 0.1^2 + 0.2^2 + 0.1^2 + 0.3^2)}{7} \approx 0.039, \quad s_A \approx 0.197 \text{ mm}\]

Line B: \[s_B \approx 1.35 \text{ mm}\]

Despite identical means, Line B’s standard deviation is nearly 7 times larger. Line A produces consistent, high-quality bolts; Line B has unacceptable variability that would cause assembly failures.

3.4 R Example: Measures of Dispersion

# --- Production line data ---
line_A <- c(50.1, 49.8, 50.3, 50.0, 49.9, 50.2, 50.1, 49.7)
line_B <- c(48.5, 51.2, 49.0, 52.1, 48.8, 51.5, 49.3, 51.6)

# --- Compute dispersion measures ---
dispersion_table <- data.frame(
  Measure = c("Mean", "Range", "IQR", "Variance", "Std. Deviation", "CV (%)"),
  Line_A  = c(
    mean(line_A),
    diff(range(line_A)),
    IQR(line_A),
    var(line_A),
    sd(line_A),
    sd(line_A) / mean(line_A) * 100
  ),
  Line_B  = c(
    mean(line_B),
    diff(range(line_B)),
    IQR(line_B),
    var(line_B),
    sd(line_B),
    sd(line_B) / mean(line_B) * 100
  )
)

dispersion_table[, 2:3] <- round(dispersion_table[, 2:3], 4)

kable(dispersion_table, 
      caption = "Dispersion Measures: Line A vs. Line B",
      col.names = c("Measure", "Line A", "Line B")) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Dispersion Measures: Line A vs. Line B
Measure Line A Line B
Mean 50.0125 50.2500
Range 0.6000 3.6000
IQR 0.2500 2.5750
Variance 0.0412 2.1914
Std. Deviation 0.2031 1.4803
CV (%) 0.4061 2.9460
# --- Side-by-side boxplots ---
bolt_data <- data.frame(
  length = c(line_A, line_B),
  line   = rep(c("Line A", "Line B"), each = 8)
)

ggplot(bolt_data, aes(x = line, y = length, fill = line)) +
  geom_boxplot(alpha = 0.7, outlier.shape = 21, outlier.size = 3) +
  geom_jitter(width = 0.1, size = 2, alpha = 0.8, color = "gray30") +
  scale_fill_manual(values = c("Line A" = "steelblue", "Line B" = "tomato")) +
  labs(
    title    = "Bolt Lengths by Production Line",
    subtitle = "Both lines centered at ~50 mm, but Line B has far greater spread",
    x        = "Production Line",
    y        = "Bolt Length (mm)"
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

Code explanation:

  • var(), sd(), IQR(), and range() are all base R functions.
  • diff(range(x)) computes the range efficiently: range() returns min and max, diff() subtracts them.
  • kable() + kable_styling() produce a clean, styled comparison table suitable for reports.
  • geom_jitter() overlays raw data points on the boxplot, revealing the actual observations behind the summary.

3.5 Exercises

TipExercise 1.3

Using the salary data from Exercise 1.1 (\(3, 5, 7, 2, 8, 5, 12, 5, 6, 4\) publications):

  1. Compute the variance and standard deviation by hand (show all steps).
  2. Identify any outliers using the IQR rule (\(Q_1 - 1.5 \times \text{IQR}\), \(Q_3 + 1.5 \times \text{IQR}\)).
  3. How does removing any outlier change the standard deviation?
TipExercise 1.4

Two university courses each have a class mean of 75% on the midterm. Course A has \(s = 5\), Course B has \(s = 18\).

  1. Compute the CV for each course.
  2. What does this tell the instructor about each class? Which class is more homogeneous?
  3. Simulate this scenario in R and produce a density plot comparing both distributions.

4 Measures of Shape: Skewness and Kurtosis

4.1 Introduction

The mean and standard deviation describe where data is centered and how spread out it is, but they say nothing about the shape of the distribution. Two datasets can share the same mean and variance yet differ in symmetry and tail heaviness. Measures of shape — skewness and kurtosis — capture these features and are essential for deciding which statistical methods are appropriate (many assume normality) and for understanding the nature of extreme observations.

4.2 Theory

4.2.1 Skewness

Skewness measures the degree and direction of asymmetry in a distribution. The sample skewness is:

\[\text{Skewness} = \frac{n}{(n-1)(n-2)} \sum_{i=1}^{n} \left(\frac{x_i - \bar{x}}{s}\right)^3\]

Interpreting skewness
Skewness Value Shape Relationship
\(\approx 0\) Symmetric Mean \(\approx\) Median \(\approx\) Mode
\(> 0\) Right-skewed (positive) Mean \(>\) Median \(>\) Mode
\(< 0\) Left-skewed (negative) Mean \(<\) Median \(<\) Mode

Right skew (positive skew) is common in income, wealth, reaction times, and biological measurements where values cannot go below zero but can take large positive values. Left skew occurs in data like scores on an easy exam (most cluster near the maximum).

As a rule of thumb: |skewness| < 0.5 is approximately symmetric; 0.5–1.0 is moderately skewed; > 1.0 is highly skewed.

4.2.2 Kurtosis

Kurtosis measures the “tailedness” of a distribution — the concentration of probability in the tails relative to a normal distribution. The excess kurtosis (relative to the normal distribution, which has kurtosis = 3) is:

\[\text{Excess Kurtosis} = \frac{n(n+1)}{(n-1)(n-2)(n-3)} \sum_{i=1}^{n}\left(\frac{x_i-\bar{x}}{s}\right)^4 - \frac{3(n-1)^2}{(n-2)(n-3)}\]

Interpreting excess kurtosis
Excess Kurtosis Distribution Type Description
\(= 0\) Mesokurtic Normal-like tails
\(> 0\) Leptokurtic Heavy tails, peaked center (e.g., Student’s \(t\))
\(< 0\) Platykurtic Light tails, flatter shape (e.g., Uniform)

High (positive) kurtosis warns of tail risk — a higher probability of extreme values than the normal distribution predicts. This is critical in financial risk management and anomaly detection.

4.3 Example: Measures of Shape

Example 1.3. A researcher measures response times (milliseconds) for a cognitive task across 200 participants. Summary statistics:

  • \(\bar{x} = 485\) ms, Median = \(440\) ms, \(s = 210\) ms
  • Skewness = \(1.82\), Excess Kurtosis = \(4.15\)

Interpretation:

  • The mean (485 ms) exceeds the median (440 ms), consistent with positive skewness.
  • Skewness = 1.82 indicates strong right skew — a small number of participants with very long response times pulls the mean upward.
  • Excess kurtosis = 4.15 (leptokurtic) suggests heavier tails than a normal distribution — extreme response times are more common than normality would predict.
  • Implication: Standard parametric tests assuming normality may be inappropriate. A log transformation or non-parametric alternatives should be considered.

4.4 R Example: Measures of Shape

# --- Simulate three distributions with same mean and variance ---
set.seed(42)
n <- 500
sym   <- rnorm(n, mean = 50, sd = 10)                     # symmetric
right <- c(rnorm(400, 40, 5), rnorm(100, 85, 10))         # right-skewed
left  <- c(rnorm(400, 60, 5), rnorm(100, 15, 10))         # left-skewed

# --- Compute skewness and kurtosis using moments package ---
shape_summary <- data.frame(
  Distribution   = c("Symmetric", "Right-Skewed", "Left-Skewed"),
  Mean           = round(c(mean(sym), mean(right), mean(left)), 2),
  SD             = round(c(sd(sym), sd(right), sd(left)), 2),
  Skewness       = round(c(skewness(sym), skewness(right), skewness(left)), 3),
  Excess_Kurtosis = round(c(kurtosis(sym)-3, kurtosis(right)-3, kurtosis(left)-3), 3)
)

kable(shape_summary,
      caption = "Shape Measures Across Three Distributions",
      col.names = c("Distribution", "Mean", "SD", "Skewness", "Excess Kurtosis")) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Shape Measures Across Three Distributions
Distribution Mean SD Skewness Excess Kurtosis
Symmetric 49.70 9.72 0.021 0.016
Right-Skewed 48.98 19.69 1.551 1.063
Left-Skewed 50.74 19.13 -1.487 0.803
# --- Density plots for all three ---
shape_df <- data.frame(
  value = c(sym, right, left),
  dist  = rep(c("Symmetric", "Right-Skewed", "Left-Skewed"), each = n)
)
shape_df$dist <- factor(shape_df$dist,
                        levels = c("Left-Skewed", "Symmetric", "Right-Skewed"))

ggplot(shape_df, aes(x = value, fill = dist)) +
  geom_density(alpha = 0.5) +
  geom_vline(data = shape_df |>
               group_by(dist) |>
               summarise(m = mean(value), med = median(value)),
             aes(xintercept = m,   color = dist), linetype = "dashed", linewidth = 1) +
  geom_vline(data = shape_df |>
               group_by(dist) |>
               summarise(m = mean(value), med = median(value)),
             aes(xintercept = med, color = dist), linetype = "solid",  linewidth = 1) +
  scale_fill_manual(values  = c("Symmetric" = "steelblue",
                                "Right-Skewed" = "tomato",
                                "Left-Skewed" = "seagreen")) +
  scale_color_manual(values = c("Symmetric" = "steelblue",
                                "Right-Skewed" = "tomato",
                                "Left-Skewed" = "seagreen")) +
  labs(
    title    = "Distributional Shape: Skewness Comparison",
    subtitle = "Dashed = Mean, Solid = Median",
    x        = "Value", y        = "Density", fill = "Distribution"
  ) +
  facet_wrap(~dist, ncol = 1) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

Code explanation:

  • skewness() and kurtosis() come from the moments package. Note that kurtosis() returns the raw kurtosis (normal = 3), so we subtract 3 for excess kurtosis.
  • facet_wrap(~dist, ncol = 1) creates stacked panels — useful for comparing shapes visually.
  • The factor() call controls the panel order for logical display (left → symmetric → right).

4.5 Exercises

TipExercise 1.5

Given the following frequency distribution of exam scores, determine:

Score Range Frequency
40–50 2
50–60 8
60–70 25
70–80 35
80–90 20
90–100 10
  1. Based on the frequency table alone, predict the direction of skewness. Explain your reasoning.
  2. Generate similar data in R and compute skewness and kurtosis to confirm.
  3. Plot a histogram with a superimposed density curve.

5 Data Visualization for Descriptive Statistics

5.1 Introduction

Numerical summaries are powerful, but they inevitably compress information. A single summary statistic can mask patterns that are immediately obvious in a plot. The famous Anscombe’s Quartet (1973) demonstrates that four datasets with nearly identical means, variances, correlations, and regression lines can look completely different when plotted. Visualization is not optional — it is an indispensable component of responsible data analysis.

5.2 Theory

5.2.1 Histograms and Density Plots

A histogram divides the range of values into bins and plots the count (or proportion) of observations in each bin. The choice of bin width significantly affects interpretation: too few bins hides structure; too many bins shows noise as pattern. Kernel density estimates (KDE) provide a smooth continuous estimate of the distribution without imposing arbitrary bins.

5.2.2 Boxplots

A boxplot (box-and-whisker plot) summarizes a distribution using five numbers: minimum, Q1, median, Q3, and maximum (with outliers plotted separately). Boxplots excel at comparing distributions across groups and at identifying outliers. Their weakness is that they can hide bimodality or other complex shapes.

5.2.3 Bar Charts and Frequency Tables

For categorical data, bar charts display the frequency or proportion of each category. Unlike histograms, the bars do not touch (no ordering implied for nominal data). Frequency tables provide the raw counts and relative frequencies underlying any bar chart.

5.2.4 QQ Plots (Quantile-Quantile Plots)

A QQ plot compares the quantiles of observed data against the quantiles of a theoretical distribution (usually normal). If the data follow that distribution, points fall approximately on a straight diagonal line. Deviations from the line reveal departures from normality — S-curves indicate skewness, heavy tails appear as points curving away at the ends.

5.3 R Example: Data Visualization

# --- Use the built-in 'iris' dataset ---
data(iris)

# 1. Histogram with density overlay for Sepal.Length
ggplot(iris, aes(x = Sepal.Length)) +
  geom_histogram(aes(y = after_stat(density)),
                 bins = 20, fill = "steelblue", color = "white", alpha = 0.8) +
  geom_density(color = "tomato", linewidth = 1.2) +
  labs(title    = "Distribution of Sepal Length",
       subtitle = "Histogram with kernel density overlay — iris dataset",
       x        = "Sepal Length (cm)", y = "Density") +
  theme_minimal(base_size = 13)

# 2. Grouped boxplot by species
ggplot(iris, aes(x = Species, y = Petal.Length, fill = Species)) +
  geom_boxplot(alpha = 0.7, outlier.shape = 21, outlier.size = 3) +
  geom_jitter(width = 0.15, alpha = 0.4, size = 1.5) +
  scale_fill_brewer(palette = "Set2") +
  labs(title    = "Petal Length by Species",
       subtitle = "Boxplot with individual observations — iris dataset",
       x        = "Species", y = "Petal Length (cm)") +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

# 3. QQ plot to assess normality
ggplot(iris, aes(sample = Sepal.Length)) +
  stat_qq(color = "steelblue", size = 2) +
  stat_qq_line(color = "tomato", linewidth = 1) +
  labs(title    = "Normal QQ Plot — Sepal Length",
       subtitle = "Points near the line suggest approximate normality",
       x        = "Theoretical Quantiles",
       y        = "Sample Quantiles") +
  theme_minimal(base_size = 13)

Code explanation:

  • after_stat(density) in the histogram aesthetic maps bar heights to density (not count), so the density curve overlays correctly.
  • stat_qq() + stat_qq_line() produce a QQ plot entirely within ggplot2 — no additional packages needed.
  • scale_fill_brewer(palette = "Set2") uses a colorblind-friendly palette from ColorBrewer.

5.4 Exercises

TipExercise 1.6

Using the mtcars dataset in R:

  1. Create a histogram of mpg (miles per gallon). Experiment with 5, 15, and 30 bins. Which bin count best reveals the distribution’s shape?
  2. Create side-by-side boxplots of mpg grouped by number of cylinders (cyl). What pattern do you observe?
  3. Produce a QQ plot of mpg and assess whether it follows a normal distribution.

6 Multivariate Descriptive Statistics

6.1 Introduction

Real-world datasets rarely consist of a single variable. Data scientists routinely work with dozens or hundreds of features simultaneously. Multivariate descriptive statistics extend single-variable summaries to describe relationships between variables — whether two variables tend to move together (covariance, correlation), and how to summarize many variables at once.

6.2 Theory

6.2.1 Covariance

Covariance measures the direction of the linear relationship between two variables \(X\) and \(Y\):

\[s_{XY} = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})\]

  • \(s_{XY} > 0\): \(X\) and \(Y\) tend to increase together (positive relationship).
  • \(s_{XY} < 0\): When \(X\) increases, \(Y\) tends to decrease (negative relationship).
  • \(s_{XY} = 0\): No linear relationship.

Covariance has a critical limitation: its magnitude depends on the scales of \(X\) and \(Y\), making direct comparisons across variable pairs impossible.

6.2.2 Pearson Correlation Coefficient

The Pearson correlation coefficient \(r\) standardizes covariance to a dimensionless scale from \(-1\) to \(+1\):

\[r_{XY} = \frac{s_{XY}}{s_X \cdot s_Y} = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum(x_i-\bar{x})^2 \cdot \sum(y_i-\bar{y})^2}}\]

Interpreting Pearson correlation
Magnitude of \(r\) Strength of Linear Relationship
0.00 – 0.19 Very weak or none
0.20 – 0.39 Weak
0.40 – 0.59 Moderate
0.60 – 0.79 Strong
0.80 – 1.00 Very strong

Critical caveat: Pearson’s \(r\) measures only linear relationships. Two variables can be strongly related (e.g., quadratically) yet have \(r \approx 0\). Always visualize with a scatterplot alongside computing \(r\).

6.2.3 Spearman Rank Correlation

When data violates the assumptions of Pearson’s \(r\) (non-normality, outliers, or ordinal scale), Spearman’s \(\rho\) provides a robust alternative based on the ranks of the data rather than raw values:

\[\rho = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)}\]

where \(d_i\) is the difference in ranks for observation \(i\).

6.2.4 The Covariance and Correlation Matrix

For \(p\) variables, the covariance matrix \(\mathbf{\Sigma}\) is a \(p \times p\) matrix where the diagonal contains variances and off-diagonal entries contain pairwise covariances. The correlation matrix \(\mathbf{R}\) standardizes this to show all correlations. These matrices are fundamental to multivariate methods like PCA (Chapter 7) and linear regression (Chapter 10).

6.3 Example: Multivariate Descriptive Statistics

Example 1.4. For four students, compute the correlation between study hours (\(X\)) and exam score (\(Y\)):

Student Hours (\(X\)) Score (\(Y\))
1 2 55
2 4 70
3 6 80
4 8 90

\(\bar{x} = 5\), \(\bar{y} = 73.75\)

\[\sum(x_i-\bar{x})(y_i-\bar{y}) = (-3)(-18.75)+(-1)(-3.75)+(1)(6.25)+(3)(16.25) = 56.25+3.75+6.25+48.75 = 115\]

\[s_X = \sqrt{\frac{20}{3}} \approx 2.582, \quad s_Y = \sqrt{\frac{306.75}{3}} \approx 10.11\]

\[r = \frac{115/3}{2.582 \times 10.11} \approx \frac{38.33}{26.10} \approx 0.997\]

This near-perfect positive correlation confirms a strong linear relationship between study time and exam performance.

6.4 R Example: Multivariate Descriptive Statistics

# --- Correlation matrix for iris numeric variables ---
iris_num <- iris[, 1:4]

# Pearson correlation matrix
cor_matrix <- cor(iris_num, method = "pearson")
round(cor_matrix, 3)
             Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length        1.000      -0.118        0.872       0.818
Sepal.Width        -0.118       1.000       -0.428      -0.366
Petal.Length        0.872      -0.428        1.000       0.963
Petal.Width         0.818      -0.366        0.963       1.000
# --- Visualize the correlation matrix ---
corrplot(cor_matrix,
         method  = "color",
         type    = "upper",
         addCoef.col = "black",
         tl.col  = "black",
         tl.srt  = 45,
         col     = colorRampPalette(c("tomato", "white", "steelblue"))(200),
         title   = "Pearson Correlation Matrix — iris Dataset",
         mar     = c(0, 0, 2, 0))

# --- Scatterplot matrix (pairs plot) ---
ggpairs(iris,
        columns   = 1:4,
        aes(color = Species, alpha = 0.6),
        upper = list(continuous = wrap("cor", size = 3.5)),
        lower = list(continuous = wrap("points", size = 0.8)),
        diag  = list(continuous = wrap("densityDiag"))) +
  scale_color_brewer(palette = "Set1") +
  labs(title = "Scatterplot Matrix — iris Dataset") +
  theme_minimal(base_size = 11)

Code explanation:

  • cor(iris_num, method = "pearson") computes the full correlation matrix. Change to "spearman" for rank-based correlation.
  • corrplot() visualizes the matrix with colors encoding strength and direction; addCoef.col overlays the numeric values.
  • ggpairs() from GGally creates a comprehensive pairs plot: scatterplots (lower), correlations by species (upper), and density distributions (diagonal) — all in one call.

6.5 Exercises

TipExercise 1.7

Using the mtcars dataset:

  1. Compute the Pearson and Spearman correlation between hp (horsepower) and mpg (fuel efficiency). Do they differ substantially? Why?
  2. Produce a correlation matrix for mpg, hp, wt, and qsec. Which pair has the strongest relationship?
  3. Create a ggpairs plot for these four variables, colored by number of cylinders (cyl). Describe any multivariate patterns you observe.
TipExercise 1.8 (Challenge)

The Anscombe Quartet is built into R as anscombe. It contains four pairs of variables (x1/y1, x2/y2, x3/y3, x4/y4).

  1. Compute the mean, variance, and correlation for each pair. What do you notice?
  2. Create a 2×2 scatterplot grid for all four pairs.
  3. What lesson does this exercise teach about relying on summary statistics alone?

7 Chapter Lab Activity: Exploring the mtcars Dataset

7.1 Objectives

In this lab, you will apply all descriptive statistics concepts from this chapter to a real-world dataset — the mtcars dataset, containing specifications and performance data for 32 automobiles from the 1974 Motor Trend magazine. The goal is to build a complete descriptive profile of the data, practicing the full workflow from initial exploration to visual communication of findings.

7.2 The Dataset

# Load and preview the dataset
data(mtcars)

# Display first 6 rows
kable(head(mtcars),
      caption = "First 6 Rows of the mtcars Dataset") |>
  kable_styling(bootstrap_options = c("striped", "hover"), font_size = 11)
First 6 Rows of the mtcars Dataset
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
# Dataset dimensions
cat("Dimensions:", nrow(mtcars), "rows x", ncol(mtcars), "columns\n")
Dimensions: 32 rows x 11 columns

Variable descriptions:

Variable Description Type
mpg Miles per gallon (fuel efficiency) Continuous
cyl Number of cylinders Categorical (4, 6, 8)
disp Displacement (cu.in.) Continuous
hp Gross horsepower Continuous
wt Weight (1000 lbs) Continuous
am Transmission (0 = automatic, 1 = manual) Binary
gear Number of forward gears Categorical

7.3 Lab Task 1: Full Descriptive Summary

# Comprehensive descriptive statistics using psych::describe()
mtcars_desc <- describe(mtcars[, c("mpg", "hp", "wt", "disp")])

kable(round(mtcars_desc[, c("n","mean","sd","median","min","max","skew","kurtosis")], 3),
      caption = "Descriptive Statistics for Key mtcars Variables") |>
  kable_styling(bootstrap_options = c("striped", "hover")) |>
  column_spec(7, bold = TRUE, color = "tomato")  # highlight skewness column
Descriptive Statistics for Key mtcars Variables
n mean sd median min max skew kurtosis
mpg 32 20.091 6.027 19.200 10.400 33.900 0.611 -0.373
hp 32 146.688 68.563 123.000 52.000 335.000 0.726 -0.136
wt 32 3.217 0.978 3.325 1.513 5.424 0.423 -0.023
disp 32 230.722 123.939 196.300 71.100 472.000 0.382 -1.207

Interpretation guide: Examine the skewness column carefully. Which variables are approximately symmetric? Which are right-skewed? How does this inform your choice of summary statistics (mean vs. median) for each variable?

7.4 Lab Task 2: Central Tendency and Spread by Group

# Compare mpg across cylinder groups (4, 6, 8 cylinders)
mtcars |>
  mutate(cyl = factor(cyl, labels = c("4-cyl", "6-cyl", "8-cyl"))) |>
  group_by(cyl) |>
  summarise(
    n      = n(),
    Mean   = round(mean(mpg), 2),
    Median = round(median(mpg), 2),
    SD     = round(sd(mpg), 2),
    IQR    = round(IQR(mpg), 2),
    CV     = round(sd(mpg)/mean(mpg)*100, 1)
  ) |>
  kable(caption = "Fuel Efficiency (mpg) by Number of Cylinders") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Fuel Efficiency (mpg) by Number of Cylinders
cyl n Mean Median SD IQR CV
4-cyl 11 26.66 26.0 4.51 7.60 16.9
6-cyl 7 19.74 19.7 1.45 2.35 7.4
8-cyl 14 15.10 15.2 2.56 1.85 17.0

7.5 Lab Task 3: Visualizing the Full Distribution

# Multi-panel visualization: histogram, boxplot, QQ plot for mpg
p1 <- ggplot(mtcars, aes(x = mpg)) +
  geom_histogram(aes(y = after_stat(density)), bins = 12,
                 fill = "steelblue", color = "white", alpha = 0.8) +
  geom_density(color = "tomato", linewidth = 1.2) +
  labs(title = "A. Distribution of mpg", x = "Miles per Gallon", y = "Density") +
  theme_minimal(base_size = 12)

p2 <- mtcars |>
  mutate(cyl = factor(cyl)) |>
  ggplot(aes(x = cyl, y = mpg, fill = cyl)) +
  geom_boxplot(alpha = 0.7) +
  geom_jitter(width = 0.1, alpha = 0.6, size = 2) +
  scale_fill_brewer(palette = "Set2") +
  labs(title = "B. mpg by Cylinders", x = "Cylinders", y = "Miles per Gallon") +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none")

p3 <- ggplot(mtcars, aes(sample = mpg)) +
  stat_qq(color = "steelblue", size = 2.5) +
  stat_qq_line(color = "tomato", linewidth = 1) +
  labs(title = "C. Normal QQ Plot of mpg",
       x = "Theoretical Quantiles", y = "Sample Quantiles") +
  theme_minimal(base_size = 12)

p4 <- mtcars |>
  mutate(am = factor(am, labels = c("Automatic", "Manual"))) |>
  ggplot(aes(x = am, y = mpg, fill = am)) +
  geom_violin(alpha = 0.6, trim = FALSE) +
  geom_boxplot(width = 0.1, fill = "white", alpha = 0.8) +
  scale_fill_manual(values = c("Automatic" = "steelblue", "Manual" = "tomato")) +
  labs(title = "D. mpg by Transmission Type",
       x = "Transmission", y = "Miles per Gallon") +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none")

# Arrange in 2x2 grid using patchwork
library(patchwork)
(p1 + p2) / (p3 + p4)

7.6 Lab Task 4: Correlation Analysis

# Compute and visualize correlations among continuous variables
cont_vars <- mtcars[, c("mpg", "hp", "wt", "disp", "qsec")]
cor_mat    <- cor(cont_vars)

corrplot(cor_mat,
         method      = "color",
         type        = "upper",
         addCoef.col = "black",
         number.cex  = 0.85,
         tl.col      = "black",
         tl.srt      = 30,
         col         = colorRampPalette(c("tomato", "white", "steelblue"))(200),
         title       = "Correlation Matrix: mtcars Continuous Variables",
         mar         = c(0, 0, 2, 0))

7.7 Lab Discussion Questions

Answer the following in writing (100–150 words each):

  1. Central Tendency: For the variable mpg, which measure — mean or median — is more appropriate as a summary statistic? Justify your answer using skewness and the presence/absence of outliers.

  2. Group Comparison: Based on Task 2, describe the relationship between engine size (cylinders) and fuel efficiency. Is the variability (CV) similar across groups or does it differ? What might explain this?

  3. Normality: Based on the QQ plot in Task 3, does mpg appear normally distributed? What features of the plot support your conclusion?

  4. Correlation vs. Causation: The correlation between hp and mpg is strongly negative. Does this mean that reducing horsepower causes better fuel efficiency? What other variables might explain this relationship?

  5. Visualization Choice: In Panel D, a violin plot is used alongside a boxplot. What additional information does the violin plot reveal that the boxplot alone does not?


8 Chapter Summary

This chapter established the descriptive foundation for all subsequent statistical analysis. The key takeaways are:

  • Central tendency (mean, median, mode) describes where data is centered. The mean is efficient but sensitive to outliers; the median is robust.
  • Dispersion (range, IQR, variance, SD, CV) describes how spread out the data is. The IQR and SD are complementary — one robust, one efficient.
  • Shape (skewness, kurtosis) reveals asymmetry and tail behavior, which guide the choice of statistical methods and transformations.
  • Visualization is never optional. QQ plots, boxplots, histograms, and scatterplot matrices each reveal different features of the data that numbers alone cannot capture.
  • Correlation quantifies linear relationships between pairs of variables but does not imply causation and cannot detect non-linear relationships.
ImportantKey Formulas to Know

\[\bar{x} = \frac{1}{n}\sum x_i \qquad s^2 = \frac{1}{n-1}\sum(x_i-\bar{x})^2 \qquad \text{CV} = \frac{s}{\bar{x}} \times 100\%\] \[r = \frac{\sum(x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum(x_i-\bar{x})^2 \cdot \sum(y_i-\bar{y})^2}} \qquad \text{IQR} = Q_3 - Q_1\]


End of Chapter 1. Proceed to Chapter 2: Data Distribution and Probability.