1. Introduction

In statistics, a measure of central tendency is a summary measure that attempts to describe a whole set of data with a single value that represents the middle or center of its distribution.

It is often referred to as a “summary statistic.” The three most common measures are the Mean, Median, and Mode.


2. The Arithmetic Mean

The mean (or average) is the most popular and well-known measure of central tendency. It is the sum of all values divided by the total number of values.

Mathematical Formula

For a sample of size \(n\), the sample mean \(\bar{x}\) is calculated as:

\[\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n} = \frac{x_1 + x_2 + \dots + x_n}{n}\]

Where: - \(\sum\): Sigma notation (summation) - \(x_i\): Each individual value in the data set - \(n\): The total number of observations

Real-Life Example

Scenario: A small tech startup has 5 employees with the following monthly salaries (in USD): $3,000, $3,500, $3,200, $4,000, and $10,000 (CEO).

salaries <- c(3000, 3500, 3200, 4000, 10000)
mean_salary <- mean(salaries)
print(paste("The mean salary is:", mean_salary))
## [1] "The mean salary is: 4740"

Pros & Cons: - Pro: Uses every value in the dataset. - Con: Highly sensitive to outliers (extreme values). In the example above, the $10,000 salary pulls the mean higher than what most employees actually earn.


3. The Median

The median is the middle value in a data set that has been arranged in ascending or descending order.

Mathematical Formula

To find the median: 1. Arrange data from smallest to largest. 2. Find the position of the median using: \[\text{Position} = \frac{n + 1}{2}\]

  • If \(n\) is odd, the median is the value at that exact position.
  • If \(n\) is even, the median is the average of the two middle values.

Real-Life Example

Scenario: A real estate agent lists the prices of 6 houses sold in a neighborhood: $250k, $270k, $260k, $280k, $240k, and $1M (a luxury mansion).

house_prices <- c(240, 250, 260, 270, 280, 1000) # in thousands
med_price <- median(house_prices)
print(paste("The median house price is:", med_price))
## [1] "The median house price is: 265"

Why use Median? In the house price example, the mean would be 383.3333333. The median ($265) is a better representation of “typical” housing prices because it is resistant to outliers.


4. The Mode

The mode is the value that appears most frequently in a data set. A distribution can be unimodal (one mode), bimodal (two modes), or multimodal.

Real-Life Example

Scenario: A shoe store records the sizes of sneakers sold in one hour: 8, 9, 9, 10, 11, 9, 8, 12.

R does not have a built-in function for the statistical mode, so we create one:

get_mode <- function(v) {
  uniqv <- unique(v)
  uniqv[which.max(tabulate(match(v, uniqv)))]
}

shoe_sizes <- c(8, 9, 9, 10, 11, 9, 8, 12)
result_mode <- get_mode(shoe_sizes)
print(paste("The mode of shoe sizes is:", result_mode))
## [1] "The mode of shoe sizes is: 9"

When to use Mode? - Used primarily for categorical data (e.g., favorite color, most popular car brand). - Useful for inventory management (knowing which item to restock most).


5. Summary: When to use which?

Measure Best for… Sensitive to Outliers?
Mean Symmetric data, Normal distribution Yes
Median Skewed data (Income, House prices) No
Mode Nominal / Categorical data No

Visualizing the Relationship

In a perfectly symmetrical distribution, Mean = Median = Mode.


6. Exercises

  1. Calculate the mean and median for the following test scores: 85, 90, 78, 92, 100, 20. Which measure better represents the class performance?
  2. Find the mode for the following survey results: iPhone, Samsung, iPhone, Google, Samsung, iPhone.

End of Module I Notes ```