1. Introduction

In statistics, a Measure of Central Tendency is a summary measure that attempts to describe a whole set of data with a single value that represents the middle or center of its distribution.

It is often referred to as a “summary statistic.” The three most common measures are: 1. Mean 2. Median 3. Mode


2. The Arithmetic Mean

The mean is the most common measure of central tendency. It is the “average” of all values.

Mathematical Formula

For a sample of size \(n\), the sample mean (\(\bar{x}\)) is calculated as:

\[\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n}\]

Where: - \(\sum\) = Summation symbol - \(x_i\) = Each individual value in the dataset - \(n\) = Total number of observations

R Example

# Dataset: Exam scores of 10 students
scores <- c(85, 90, 78, 92, 88, 76, 95, 89, 84, 82)

# Calculate Mean
mean_score <- mean(scores)
print(paste("The mean score is:", mean_score))
## [1] "The mean score is: 85.9"

Real-Life Example

Corporate Salaries: If a company has 5 employees earning $40k, $45k, $50k, $55k, and $200k (the CEO), the mean salary is $78k. Notice how the CEO’s high salary pulls the mean upward, making it less representative of the “average” worker.


3. The Median

The median is the middle value in a dataset when the values are arranged in ascending or descending order. It splits the data into two equal halves.

Mathematical Definition

  1. Arrange data from smallest to largest.
  2. If \(n\) is odd, the median is the value at position: \(\frac{n+1}{2}\)
  3. If \(n\) is even, the median is the average of the two middle values at positions: \(\frac{n}{2}\) and \(\frac{n}{2} + 1\)

R Example

# Dataset: Number of hours spent studying
study_hours <- c(2, 5, 1, 8, 10, 3, 4)

# Calculate Median
median_hours <- median(study_hours)
print(paste("The median study hours is:", median_hours))
## [1] "The median study hours is: 4"

Real-Life Example

Real Estate: When reporting home prices in a city, economists usually use the Median. Because a few multi-million dollar mansions would skew the “Mean” price higher, the Median provides a more accurate picture of what a typical homebuyer can expect to pay.


4. The Mode

The mode is the value that appears most frequently in a dataset. A dataset can have one mode (unimodal), two modes (bimodal), or multiple modes (multimodal).

R Example

R does not have a standard built-in function for the mode, so we create a custom function:

get_mode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}

# Dataset: Shoe sizes sold in a day
shoe_sizes <- c(7, 8, 8, 9, 10, 10, 10, 11, 12)

# Calculate Mode
mode_size <- get_mode(shoe_sizes)
print(paste("The mode of shoe sizes is:", mode_size))
## [1] "The mode of shoe sizes is: 10"

Real-Life Example

Inventory Management: A shoe store manager needs the Mode to know which shoe size to stock the most. Knowing the “average” size is 8.4 is not helpful, but knowing that size 10 is the most frequently purchased is vital for business.


5. Comparing Mean, Median, and Mode

The relationship between these measures depends on the shape of the distribution.

Feature Mean Median Mode
Definition Average Middle Value Most Frequent
Sensitive to Outliers? Yes (High) No (Robust) No
Data Type Quantitative Quantitative/Ordinal Quantitative/Categorical

The Effect of Skewness

  • Symmetrical (Normal Distribution): Mean = Median = Mode.
  • Right Skewed (Positive): Mean > Median > Mode (Mean is pulled by high outliers).
  • Left Skewed (Negative): Mean < Median < Mode (Mean is pulled by low outliers).

6. Practical Lab: Visualizing Central Tendency

Let’s see how an outlier affects the Mean vs. the Median using R’s ggplot2.

library(ggplot2)

# Create a dataset with an outlier
data <- data.frame(val = c(10, 12, 11, 13, 12, 14, 15, 100)) # 100 is the outlier

# Calculate statistics
m_mean <- mean(data$val)
m_median <- median(data$val)

# Plot
ggplot(data, aes(x = val)) +
  geom_histogram(binwidth = 5, fill = "skyblue", color = "black") +
  geom_vline(aes(xintercept = m_mean, color = "Mean"), size = 1.5) +
  geom_vline(aes(xintercept = m_median, color = "Median"), size = 1.5) +
  labs(title = "Impact of an Outlier on Mean vs Median",
       subtitle = paste("Mean is", round(m_mean, 2), "while Median is", m_median),
       x = "Values", y = "Frequency") +
  scale_color_manual(name = "Statistics", values = c(Mean = "red", Median = "blue"))

Summary Task

  1. Using the mtcars dataset in R, calculate the mean and median of the mpg (miles per gallon) variable.
  2. Determine if the distribution of mpg is likely skewed based on these results. ```