1. Introduction

In statistics, a measure of central tendency is a single value that attempts to describe a set of data by identifying the central position within that set of data. As such, measures of central tendency are sometimes called measures of central location.

The three main measures are: 1. Mean (The Average) 2. Median (The Middle) 3. Mode (The Most Frequent)


2. The Arithmetic Mean

The mean is the most popular and well-known measure of central tendency. It can be used with both discrete and continuous data, although its use is most often with continuous data.

Mathematical Formula

For a sample size of \(n\) with observed values \(x_1, x_2, ..., x_n\), the sample mean (denoted as \(\bar{x}\)) is calculated as:

\[ \bar{x} = \frac{\sum_{i=1}^{n} x_i}{n} \]

Where: * \(\sum\) indicates the summation. * \(x_i\) represents each individual value. * \(n\) represents the total number of observations.

Real-Life Example: Student Test Scores

Imagine a teacher wants to find the “average” performance of a class on a difficult math quiz.

Data: 78, 85, 92, 60, 88

Calculation: \[ \frac{78 + 85 + 92 + 60 + 88}{5} = \frac{403}{5} = 80.6 \]

R Implementation

# Define the dataset
quiz_scores <- c(78, 85, 92, 60, 88)

# Calculate the mean
mean_score <- mean(quiz_scores)

print(paste("The average test score is:", mean_score))
## [1] "The average test score is: 80.6"

Warning: The mean is heavily influenced by outliers (extreme values). If one student scored a 0, the average would drop significantly, potentially misrepresenting the class performance.


3. The Median

The median is the middle score for a set of data that has been arranged in order of magnitude. The median is less affected by outliers and skewed data.

Mathematical Logic

  1. Sort the data from smallest to largest.
  2. Find the position using the formula \(\frac{n+1}{2}\).
    • If \(n\) is odd, the median is the middle number.
    • If \(n\) is even, the median is the average of the two middle numbers.

Real-Life Example: Real Estate Prices

In a neighborhood, house prices vary wildly. Most houses cost $200k, but there is one mansion worth $5 million.

Data: $200k, $210k, $205k, $220k, $5,000k (Sorted)

  • Mean Calculation: \(\approx\) $1,167k (This suggests the “average” house is over a million dollars, which is misleading).
  • Median Calculation: The middle value is $210k. This gives a much better idea of what a typical house costs.

R Implementation

# Define house prices (in thousands)
house_prices <- c(200, 210, 205, 220, 5000)

# Calculate Mean vs Median
avg_price <- mean(house_prices)
med_price <- median(house_prices)

print(paste("Mean Price: $", avg_price, "k"))
## [1] "Mean Price: $ 1167 k"
print(paste("Median Price: $", med_price, "k"))
## [1] "Median Price: $ 210 k"

4. The Mode

The mode is the most frequent score in our data set. On a histogram it represents the highest bar. The mode is the only measure of central tendency that can be used with categorical data (nominal data).

Mathematical Logic

There is no complex algebraic formula. It is determined by frequency counts. \[ Mode = \text{Value}(x) \text{ where } Frequency(x) \text{ is maximized} \]

  • Unimodal: One mode.
  • Bimodal: Two modes.
  • Multimodal: More than two modes.

Real-Life Example: Retail Inventory

A shoe store manager needs to restock inventory. Knowing the “average” shoe size (e.g., size 9.3) is useless because you cannot buy a size 9.3 shoe. The manager needs to know the Mode—the size that sells the most.

Data: 7, 8, 8, 9, 9, 9, 10, 10, 11

Mode: 9 (It appears 3 times).

R Implementation

R does not have a built-in statistical mode function, so we usually create a frequency table.

# Shoe sizes sold today
shoe_sizes <- c(7, 8, 8, 9, 9, 9, 10, 10, 11)

# Create a frequency table
freq_table <- table(shoe_sizes)

# Find the size with the max frequency
mode_size <- names(freq_table)[which.max(freq_table)]

print(freq_table)
## shoe_sizes
##  7  8  9 10 11 
##  1  2  3  2  1
print(paste("The modal shoe size is:", mode_size))
## [1] "The modal shoe size is: 9"

5. Summary: When to use which?

Measure Best Used For Pros Cons
Mean Continuous data, Symmetrical distributions Uses every value in the data Sensitive to outliers
Median Skewed data (Income, Home prices) Robust against outliers Ignores the precise value of outliers
Mode Categorical data (Eye color, Brand preference) Works with non-numeric data There may be no mode or multiple modes

Visualizing the difference

Let’s look at a skewed distribution (Salary Data) to see how the measures diverge.

  • Blue Line (Median): Closer to the peak (the typical value).
  • Red Line (Mean): Pulled to the right by high values.

```

Instructions for the user:

  1. Open RStudio.
  2. Go to File -> New File -> R Markdown.
  3. Paste the code above into the editor (overwrite the default content).
  4. Click the Knit button (icon of a yarn ball) to generate the HTML lecture note.