1. Introduction

What does a “typical” data point look like? In statistics, a measure of central tendency is a summary measure that attempts to describe a whole set of data with a single value that represents the middle or center of its distribution.

The three most common measures are: 1. Mean 2. Median 3. Mode


2. The Mean (Arithmetic Average)

The mean is the sum of all values divided by the total number of values. It is the most common measure of central tendency.

Mathematical Formula

For a sample mean (\(\bar{x}\)): \[\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n}\]

R Implementation

# Sample data: Exam scores of 10 students
scores <- c(85, 90, 88, 76, 92, 85, 80, 89, 94, 85)

# Calculate Mean
mean_score <- mean(scores)
print(paste("The mean score is:", mean_score))
## [1] "The mean score is: 86.4"

Real-Life Example: Classroom Performance

A teacher uses the mean to determine the overall performance of a class. If the mean score is low, the teacher might decide to review the material again.

Pros: Uses every value in the dataset. Cons: Highly sensitive to outliers (extreme values).


3. The Median (The Middle Value)

The median is the middle value when a data set is ordered from least to greatest. If there is an even number of observations, the median is the average of the two middle numbers.

R Implementation

# Using the same scores data
median_score <- median(scores)
print(paste("The median score is:", median_score))
## [1] "The median score is: 86.5"

Real-Life Example: Real Estate & Income

If you are looking at housing prices in a neighborhood where 9 houses cost $300k and 1 mansion costs $10 million, the Mean would be $1.27 million (misleading!), but the Median would remain $300k. This is why the median is used for Household Income reports.


4. The Mode (The Most Frequent Value)

The mode is the value that appears most often in a data set. A data set can have one mode, more than one mode (bimodal/multimodal), or no mode at all.

Note: R does not have a standard built-in function for the statistical mode, so we create a simple custom function.

R Implementation

# Custom function to find the mode
get_mode <- function(x) {
  uniqx <- unique(x)
  uniqx[which.max(tabulate(match(x, uniqx)))]
}

# Calculate Mode
mode_score <- get_mode(scores)
print(paste("The mode score is:", mode_score))
## [1] "The mode score is: 85"

Real-Life Example: Inventory Management

A shoe store manager needs to know which shoe size is the Mode. Knowing the “average” shoe size (e.g., 8.34) is useless for ordering stock, but knowing that Size 9 sells the most (the mode) is vital.


5. Comparing Mean, Median, and Mode

When to use what?

Measure Best for… Sensitive to Outliers?
Mean Symmetric data (Normal distribution) Yes (High)
Median Skewed data (Income, House prices) No
Mode Categorical data (Colors, Brands, Sizes) No

Visualizing the Impact of Outliers

Observe how one “Outlier” (the value 1000) changes the mean drastically but barely touches the median.

# Data with an outlier
data_with_outlier <- c(10, 12, 11, 14, 13, 1000)

cat("Mean:", mean(data_with_outlier), "\n")
## Mean: 176.6667
cat("Median:", median(data_with_outlier), "\n")
## Median: 12.5

6. Practical Exercise

Imagine you are a Data Analyst for a streaming service. You have the following data representing the number of minutes 10 users spent watching a show: minutes <- c(22, 25, 22, 28, 21, 23, 150, 22, 24, 26)

  1. Calculate the Mean and Median.
  2. Which measure better represents the “typical” viewer?
  3. Why is one value (150) so different? (This is an outlier, perhaps someone fell asleep with the TV on!)
minutes <- c(22, 25, 22, 28, 21, 23, 150, 22, 24, 26)

# Your calculations
avg_min <- mean(minutes)
med_min <- median(minutes)

# Plotting to visualize
hist(minutes, col="skyblue", main="Distribution of Watch Time", xlab="Minutes")
abline(v = avg_min, col="red", lwd=2, lty=2) # Mean in red
abline(v = med_min, col="blue", lwd=2)      # Median in blue
legend("topright", legend=c("Mean", "Median"), col=c("red", "blue"), lty=c(2,1), lwd=2)


7. Summary

  • Mean is the “balance point.”
  • Median is the “physical middle.”
  • Mode is the “popularity winner.”

In the next module, we will discuss Measures of Dispersion (Range, Variance, and Standard Deviation) to see how spread out our data is around these centers. ```

How to use this:

  1. Open RStudio.
  2. Go to File -> New File -> R Markdown....
  3. Clear the default text and paste the code above.
  4. Click the Knit button (the yarn icon) to generate a professional HTML or PDF document.

Key Features of these notes:

  • Mathematical clarity: Uses LaTeX for the mean formula.
  • Reproducible Code: Students can run the R chunks themselves to see the results.
  • Contextual Learning: It explains why we use Median for house prices instead of Mean.
  • Visualization: Includes a histogram plot comparing mean vs. median to show the impact of outliers.