1. Introduction

Measures of central tendency are summary statistics that represent the center point or “typical” value of a dataset. They condense a large set of data into a single representative value to make the data easier to understand.

The three primary measures are: 1. Mean (The Average) 2. Median (The Middle Value) 3. Mode (The Most Frequent Value)


2. The Arithmetic Mean (\(\bar{x}\))

The mean is the most common measure of central tendency. It is calculated by summing all observations and dividing by the total number of observations.

Mathematical Formula

For a sample of size \(n\): \[\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n}\]

Where: - \(\bar{x}\) = Sample Mean - \(\sum x_i\) = Sum of all values - \(n\) = Number of values

R Example

# Exam scores of 10 students
scores <- c(85, 90, 78, 92, 88, 76, 89, 95, 82, 85)

# Calculate mean using R's built-in function
avg_score <- mean(scores)
print(paste("The average score is:", avg_score))
## [1] "The average score is: 86"

Real-Life Application

Education: Teachers use the mean to determine the overall performance of a class. If the mean score is low, it might indicate that the teaching method or the exam was too difficult.


3. The Median

The median is the middle value of a dataset when it is ordered from smallest to largest. It is highly “robust,” meaning it is not affected by extreme outliers.

Mathematical Definition

To find the median, first sort the data: 1. If \(n\) is odd: The median is the value at position \(\frac{n+1}{2}\). 2. If \(n\) is even: The median is the average of the values at positions \(\frac{n}{2}\) and \(\frac{n}{2} + 1\).

R Example

# Salaries of 5 employees (in thousands)
salaries <- c(45, 50, 55, 60, 1000) # 1000 is an outlier (CEO)

# Calculate mean vs median
print(paste("Mean Salary:", mean(salaries)))
## [1] "Mean Salary: 242"
print(paste("Median Salary:", median(salaries)))
## [1] "Median Salary: 55"

Real-Life Application

Real Estate: Median house prices are used instead of the mean because a few multi-million dollar mansions (outliers) would unfairly inflate the “average” price, making it look like most houses are more expensive than they actually are.


4. The Mode

The mode is the value that appears most frequently in a dataset. A dataset can be unimodal (one mode), bimodal (two modes), or multimodal.

Note: Base R does not have a standard statistical mode() function (the mode() function in R actually tells you the data type). We often create a custom function or use a package.

Custom R Function for Mode

get_mode <- function(v) {
  uniqv <- unique(v)
  uniqv[which.max(tabulate(match(v, uniqv)))]
}

# Survey of favorite colors
colors <- c("Blue", "Red", "Blue", "Green", "Blue", "Red")
print(paste("The most popular color is:", get_mode(colors)))
## [1] "The most popular color is: Blue"

Real-Life Application

Inventory Management: A shoe store manager needs the mode of shoe sizes sold to know which size to stock the most. The “average” shoe size (e.g., 8.34) is useless for ordering physical inventory.


5. Comparing the Measures

Measure Best Used For Sensitive to Outliers?
Mean Continuous data with no outliers (Normal distribution) Yes (Very sensitive)
Median Skewed data or data with extreme outliers No
Mode Categorical data or discrete data No

Visualizing Skewness

When data is skewed, these measures separate: - Right Skewed (Positive): Mode < Median < Mean - Left Skewed (Negative): Mean < Median < Mode - Symmetric: Mean \(\approx\) Median \(\approx\) Mode


6. Conclusion

Selecting the correct measure of central tendency depends entirely on the type of data and the distribution. While the mean is mathematically powerful, the median and mode often provide a more “honest” look at real-world data containing outliers or categories. ```