1. Introduction

In statistics, a Measure of Central Tendency is a summary measure that attempts to describe a whole set of data with a single value that represents the middle or center of its distribution.

It is often referred to as a “summary statistic.” The three most common measures are: 1. The Mean 2. The Median 3. The Mode


2. The Arithmetic Mean

The mean (or average) is the sum of all values divided by the total number of values.

Mathematical Formula

For a sample of size \(n\), the mean \(\bar{x}\) is calculated as:

\[\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n} = \frac{x_1 + x_2 + ... + x_n}{n}\]

Where: * \(\sum\): The summation symbol. * \(x_i\): Each individual value in the data set. * \(n\): The total number of observations.

Real-Life Example

Academic Performance: If a student scores 80, 85, 90, 75, and 100 in five exams, the mean represents their overall average performance.

R Example

# Exam scores
scores <- c(80, 85, 90, 75, 100)

# Calculate Mean
mean_score <- mean(scores)
print(paste("The mean score is:", mean_score))
## [1] "The mean score is: 86"

3. The Median

The median is the middle value in a data set when the values are arranged in ascending or descending order. It splits the data into two equal halves.

Mathematical Formula

  1. Arrange data from smallest to largest.
  2. If \(n\) is odd: The median is the value at position \(\frac{n+1}{2}\).
  3. If \(n\) is even: The median is the average of the two middle values at positions \(\frac{n}{2}\) and \(\frac{n}{2} + 1\).

Real-Life Example

Real Estate: Median house prices are often used instead of the mean because a few multi-million dollar mansions (outliers) would artificially inflate the average, making the “typical” house seem more expensive than it actually is.

R Example

# Salaries in a small startup (in thousands)
salaries <- c(45, 50, 52, 55, 60, 150) # 150 is an outlier

# Mean vs Median
cat("Mean Salary:", mean(salaries), "\n")
## Mean Salary: 68.66667
cat("Median Salary:", median(salaries), "\n")
## Median Salary: 53.5

Note: Notice how the outlier (150) pulls the mean up, while the median remains representative of the staff.


4. The Mode

The mode is the value that appears most frequently in a data set. A data set can have one mode (unimodal), two modes (bimodal), or multiple modes (multimodal).

Mathematical Formula

There is no specific algebraic formula for the mode; it is identified by counting frequencies: \[Mode = \text{Value with highest frequency}\]

Real-Life Example

Inventory Management: A shoe store manager looks at the mode of shoe sizes sold to determine which size to restock most heavily. If size 9 is sold more than any other, the mode is 9.

R Example

R does not have a standard built-in function for the mode, so we create a simple one:

# Function to calculate mode
get_mode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}

# Data: Shoe sizes sold
shoe_sizes <- c(7, 8, 8, 9, 9, 9, 10, 11)

# Calculate Mode
shoe_mode <- get_mode(shoe_sizes)
print(paste("The mode of shoe sizes is:", shoe_mode))
## [1] "The mode of shoe sizes is: 9"

5. Comparison: Which one to use?

Measure Best Used For… Sensitivity to Outliers
Mean Symmetric data, continuous variables. High (Very sensitive)
Median Skewed data, ordinal data. Low (Robust)
Mode Categorical data (e.g., favorite color). Low

Visualization in R

Let’s see how these measures look on a distribution.

set.seed(123)
data <- rgamma(100, shape = 2, scale = 2) # Right-skewed data

hist(data, col="lightblue", main="Mean vs Median on Skewed Data", xlab="Value")
abline(v = mean(data), col = "red", lwd = 2, lty = 1)
abline(v = median(data), col = "blue", lwd = 2, lty = 2)

legend("topright", legend=c("Mean", "Median"), col=c("red", "blue"), lty=1:2, lwd=2)

6. Summary

Understanding when to use each measure is crucial for accurate data storytelling and decision-making. ```


Key Features of this Lecture Note:

  1. Mathematical Precision: Uses LaTeX for formulas (like \(\bar{x} = \frac{\sum x_i}{n}\)).
  2. Reproducible Code: Includes R code chunks that calculate mean, median, and a custom mode function.
  3. Visual Aids: Includes a histogram to demonstrate the difference between Mean and Median in skewed distributions.
  4. Practicality: Provides real-world scenarios (Salary, Exams, Shoe sizes) to help students relate to the theory.