1. Introduction

In statistics, a measure of central tendency is a summary measure that attempts to describe a whole set of data with a single value that represents the middle or center of its distribution.

It is often referred to as a “summary statistic.” The three most common measures are: 1. The Mean 2. The Median 3. The Mode


2. The Arithmetic Mean

The mean (or average) is the most popular and well-known measure of central tendency. It is the sum of all values divided by the number of values.

Mathematical Formula

For a sample of size \(n\), the sample mean \(\bar{x}\) is calculated as:

\[\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n}\]

Where: * \(\sum\): The summation symbol * \(x_i\): Each individual value in the data set * \(n\): The total number of observations

Real-Life Example

Scenario: A teacher wants to find the average score of 5 students in a mini-quiz. The scores are: 85, 90, 70, 75, and 95.

R Calculation

# Define the data
scores <- c(85, 90, 70, 75, 95)

# Calculate mean
avg_score <- mean(scores)
print(paste("The average score is:", avg_score))
## [1] "The average score is: 83"

Note: The mean is highly sensitive to outliers (extreme values).


3. The Median

The median is the middle value in a data set when the values are arranged in ascending or descending order.

Mathematical Definition

To find the median: 1. Arrange the data in order. 2. If \(n\) is odd, the median is the value at position: \(\frac{n+1}{2}\) 3. If \(n\) is even, the median is the average of the values at positions: \(\frac{n}{2}\) and \(\frac{n}{2} + 1\)

Real-Life Example

Scenario: Monthly household incomes in a small neighborhood: $2000, $2500, $3000, $3500, and $1,000,000 (an outlier).

The mean would be heavily skewed by the millionaire, but the median provides a better sense of what a “typical” neighbor earns.

R Calculation

incomes <- c(2000, 2500, 3000, 3500, 1000000)

# Calculate Median
med_income <- median(incomes)
avg_income <- mean(incomes)

print(paste("Median Income:", med_income))
## [1] "Median Income: 3000"
print(paste("Mean Income:", round(avg_income, 2)))
## [1] "Mean Income: 202200"

4. The Mode

The mode is the value that appears most frequently in a data set. A data set can have one mode (unimodal), two modes (bimodal), or multiple modes (multimodal).

Real-Life Example

Scenario: A shoe store tracks the sizes of sneakers sold in one hour: 8, 9, 9, 10, 11, 9, 8. The mode is 9 because it occurs most frequently.

R Calculation

Note: R does not have a built-in standard function for mode, so we create a simple custom function or use a table.

# Sample data
shoe_sizes <- c(8, 9, 9, 10, 11, 9, 8)

# Calculate mode using a frequency table
mode_val <- names(sort(table(shoe_sizes), decreasing = TRUE))[1]
print(paste("The most popular shoe size is:", mode_val))
## [1] "The most popular shoe size is: 9"

5. Comparing the Measures

When should you use which measure?

Measure Best Used For… Sensitivity to Outliers
Mean Continuous data with a symmetrical distribution High (Very sensitive)
Median Skewed data or data with outliers (e.g., Salaries) Low (Robust)
Mode Categorical/Nominal data (e.g., Favorite color) Low

Visualization: The Impact of Outliers

The code below generates a skewed distribution and shows where the Mean and Median fall.

library(ggplot2)

# Create a skewed dataset
data <- data.frame(val = c(rbeta(1000, 2, 8) * 100))

# Calculate stats
mu <- mean(data$val)
med <- median(data$val)

# Plot
ggplot(data, aes(x=val)) +
  geom_histogram(fill="skyblue", color="white", bins=30) +
  geom_vline(aes(xintercept=mu, color="Mean"), size=1) +
  geom_vline(aes(xintercept=med, color="Median"), size=1) +
  labs(title="Mean vs Median in Right-Skewed Data",
       x="Value", y="Frequency") +
  scale_color_manual(name = "Measures", values = c(Mean = "red", Median = "blue")) +
  theme_minimal()


6. Exercises

  1. Dataset: 12, 15, 12, 18, 20, 100. Calculate the mean and median. Which one represents the “center” better?
  2. R Task: Create a vector of 20 random numbers using rnorm(20) and calculate all three measures of central tendency.

End of Module I ```

Key features of this lecture note:

  1. LaTeX Integration: It uses $...$ for inline math and $$...$$ for centered formulas, which is standard for academic notes.
  2. R Code Chunks: It provides actual executable code blocks to demonstrate how to calculate these measures using R.
  3. Visualization: It includes a ggplot2 block to visualize the difference between mean and median in skewed data—a crucial concept for students.
  4. Markdown Table: It uses a table to compare the three measures for quick revision.
  5. Professional Formatting: Uses the cosmo theme and a floating table of contents for easy navigation.