Introduction

In statistics, a measure of central tendency is a single value that describes the center of a data distribution. It identifies the location where data points cluster. The three most common measures are:

  1. Mean (The Average)
  2. Median (The Middle Value)
  3. Mode (The Most Frequent Value)

This module explores the mathematical formulas behind these measures, real-life applications, and how to calculate them using R.


1. The Arithmetic Mean

The mean is the sum of all observations divided by the number of observations. It is the most common measure of central tendency but is sensitive to outliers.

Mathematical Formula

For a sample size \(n\) with values \(x_1, x_2, ..., x_n\):

\[ \bar{x} = \frac{\sum_{i=1}^{n} x_i}{n} \]

Real-Life Example: Student Grades

Imagine a teacher wants to calculate the “average” performance of a student over 5 exams. Scores: 85, 90, 78, 92, 88.

\[ \text{Mean} = \frac{85 + 90 + 78 + 92 + 88}{5} = \frac{433}{5} = 86.6 \]

R Implementation

# Create a vector of exam scores
exam_scores <- c(85, 90, 78, 92, 88)

# Calculate the mean using the built-in mean() function
average_score <- mean(exam_scores)

print(paste("The mean exam score is:", average_score))
## [1] "The mean exam score is: 86.6"

The Effect of Outliers

If one student scores a 0, the mean drops significantly, even if the other scores are high.

# Adding an outlier
scores_with_outlier <- c(85, 90, 78, 92, 88, 0)
mean(scores_with_outlier)
## [1] 72.16667

2. The Median

The median is the middle value when the data set is ordered from least to greatest. It is robust against outliers (it is not pulled by extreme values).

Mathematical Formula

First, sort the data \(\{x_1, x_2, ..., x_n\}\) in ascending order.

  • If \(n\) is odd: The median is the value at position \(\frac{n+1}{2}\).
  • If \(n\) is even: The median is the average of the values at positions \(\frac{n}{2}\) and \(\frac{n}{2} + 1\).

Real-Life Example: Real Estate Prices

House prices in a neighborhood are often skewed. Prices (in $1000s): 350, 400, 380, 420, 2500 (Mansion).

If we use the mean, the mansion makes it look like the “average” house costs $810k. The median gives a better representation of a typical house.

Ordered: 350, 380, 400, 420, 2500. Median = 400.

R Implementation

# Create a vector of house prices
house_prices <- c(350, 400, 380, 420, 2500)

# Calculate the median using the built-in median() function
median_price <- median(house_prices)
mean_price <- mean(house_prices)

print(paste("The median price is:", median_price))
## [1] "The median price is: 400"
print(paste("The mean price is:", mean_price, "(Skewed by the mansion)"))
## [1] "The mean price is: 810 (Skewed by the mansion)"

3. The Mode

The mode is the value that appears most frequently in a data set. A dataset can be unimodal (one mode), bimodal (two modes), or multimodal.

Mathematical Concept

\[ \text{Mode} = x_k \text{ such that freq}(x_k) \ge \text{freq}(x_i) \text{ for all } i \]

Real-Life Example: Inventory Management

A shoe shop manager needs to restock. Knowing the “average” shoe size (e.g., 8.43) is useless. They need to know which size sells the most (e.g., Size 9).

Sales: 7, 8, 9, 9, 9, 10, 10, 11. Mode: 9.

R Implementation

R does not have a built-in statistical function called mode() (the base command checks variable storage types). We must create a custom function or use a frequency table.

# Create a vector of shoe sizes sold
shoe_sizes <- c(7, 8, 9, 9, 9, 10, 10, 11)

# Method 1: Using a frequency table to visualize
freq_table <- table(shoe_sizes)
print(freq_table)
## shoe_sizes
##  7  8  9 10 11 
##  1  1  3  2  1
# Method 2: Custom Function to find the Mode
get_mode <- function(v) {
  uniqv <- unique(v)
  uniqv[which.max(tabulate(match(v, uniqv)))]
}

calculated_mode <- get_mode(shoe_sizes)
print(paste("The most popular shoe size (Mode) is:", calculated_mode))
## [1] "The most popular shoe size (Mode) is: 9"

Summary and Comparison

Measure Best Used For… Sensitivity to Outliers
Mean Symmetric data, numeric data without extremes High
Median Skewed data (e.g., income, home prices) Low (Robust)
Mode Categorical data (nominal) or discrete data (e.g., sizes) Low

Visualizing Central Tendency in R

Let’s visualize a skewed distribution to see where the Mean and Median land.

# Generate skewed data (Chi-square distribution)
set.seed(123)
data <- rchisq(1000, df=4)

# Plot histogram
hist(data, col="lightblue", main="Right Skewed Distribution", 
     xlab="Value", breaks=30)

# Add vertical lines for Mean and Median
abline(v = mean(data), col = "red", lwd = 2, lty = 1)   # Mean
abline(v = median(data), col = "blue", lwd = 2, lty = 2) # Median

legend("topright", legend=c("Mean", "Median"), 
       col=c("red", "blue"), lty=c(1, 2), lwd=2)

Observation: In a right-skewed distribution, the Mean (Red) is pulled to the right by the tail, while the Median (Blue) stays closer to the peak.

```

How to use this Note:

  1. Install R and RStudio if you haven’t already.
  2. Open RStudio.
  3. Go to File -> New File -> R Markdown.
  4. Delete the default template content.
  5. Paste the code block above into the file.
  6. Click the Knit button (the yarn ball icon) to generate the formatted lecture note.