1. Introduction

A Measure of Central Tendency is a single value that attempts to describe a set of data by identifying the central position within that set of data. These are often referred to as “summary statistics.”

The three most common measures are: 1. The Mean 2. The Median 3. The Mode


2. The Arithmetic Mean

The mean (or average) is the sum of all values divided by the total number of values. It is the most common measure of central tendency.

Mathematical Formula

For a sample of size \(n\), the sample mean \(\bar{x}\) is calculated as:

\[\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n} = \frac{x_1 + x_2 + ... + x_n}{n}\]

Where: - \(\sum\): Sigma notation (summation) - \(x_i\): The value of each individual observation - \(n\): The total number of observations

Real-Life Example

Scenario: A tech startup tracks the daily hours worked by a small team of 5 developers: 8, 9, 7, 12, and 8 hours.

Calculation: \[\bar{x} = \frac{8 + 9 + 7 + 12 + 8}{5} = \frac{44}{5} = 8.8 \text{ hours}\]

R Implementation

hours_worked <- c(8, 9, 7, 12, 8)
mean_val <- mean(hours_worked)
print(paste("The mean hours worked is:", mean_val))
## [1] "The mean hours worked is: 8.8"

3. The Median

The median is the middle value in a data set when the values are arranged in ascending or descending order. It is a “robust” measure because it is not affected by extreme outliers.

Mathematical Formula

  1. Arrange data in order: \(x_{(1)} \leq x_{(2)} \leq ... \leq x_{(n)}\)
  2. If \(n\) is odd: \[\text{Median} = x_{(\frac{n+1}{2})}\]
  3. If \(n\) is even: \[\text{Median} = \frac{x_{(\frac{n}{2})} + x_{(\frac{n}{2} + 1)}}{2}\]

Real-Life Example

Scenario: Real estate prices in a neighborhood. Suppose five houses sold for: $250k, $270k, $310k, $320k, and $1.2M (a mansion).

  • Ordered Data: 250, 270, 310, 320, 1200
  • Median: $310,000.
  • Note: The Mean would be $470,000, which is misleading because of the $1.2M outlier. The median provides a better “typical” price.

R Implementation

house_prices <- c(250000, 270000, 310000, 320000, 1200000)
median_val <- median(house_prices)
print(paste("The median house price is:", median_val))
## [1] "The median house price is: 310000"

4. The Mode

The mode is the value that appears most frequently in a data set. A data set can have one mode (unimodal), two modes (bimodal), or multiple modes (multimodal).

Mathematical Formula

\[\text{Mode} = \text{Value with the highest frequency } (f_i)\]

Real-Life Example

Scenario: A shoe store owner wants to know which size to restock most. The sizes of the last 10 pairs sold were: 7, 8, 8, 9, 9, 9, 10, 11, 9, 8.

  • Frequency of 7: 1
  • Frequency of 8: 3
  • Frequency of 9: 4
  • Frequency of 10: 1
  • Frequency of 11: 1
  • Mode: Size 9 (it sold the most).

R Implementation

R does not have a built-in function for the mode, but we can create one:

get_mode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}

shoe_sizes <- c(7, 8, 8, 9, 9, 9, 10, 11, 9, 8)
mode_val <- get_mode(shoe_sizes)
print(paste("The mode shoe size is:", mode_val))
## [1] "The mode shoe size is: 9"

5. Summary: When to Use Which?

Measure Best Used For Sensitivity to Outliers
Mean Symmetric data, Normal distributions Highly Sensitive
Median Skewed data (e.g., Income, Prices) Robust (Not affected)
Mode Categorical data (e.g., Color, Size) Robust

Visualizing Skewness

  • Symmetric (Normal): Mean \(\approx\) Median \(\approx\) Mode
  • Right Skewed (Positive): Mode < Median < Mean
  • Left Skewed (Negative): Mean < Median < Mode


End of Module I ```