1. Introduction

In statistics, a Measure of Central Tendency is a summary measure that attempts to describe a whole set of data with a single value that represents the middle or center of its distribution.

It helps us understand the “typical” value in a dataset. The three most common measures are: 1. Mean 2. Median 3. Mode


2. The Arithmetic Mean

The mean (often called the average) is the sum of all values divided by the total number of values.

Mathematical Formula

For a sample of size \(n\), the sample mean \(\bar{x}\) is calculated as: \[\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n}\]

Where: * \(\sum\) is the summation symbol. * \(x_i\) represents each individual value. * \(n\) is the total number of observations.

Real-Life Example

Scenario: A teacher wants to find the average score of 5 students in a math quiz. Scores: 85, 90, 88, 76, 92.

R Implementation

# Define the dataset
scores <- c(85, 90, 88, 76, 92)

# Calculate mean
mean_score <- mean(scores)
print(paste("The average score is:", mean_score))
## [1] "The average score is: 86.2"

Pros: Uses every value in the dataset; mathematically stable. Cons: Highly sensitive to outliers (extreme values).


3. The Median

The median is the middle value in a dataset when the values are arranged in ascending or descending order.

Mathematical Formula

  1. Sort the data from smallest to largest.
  2. If \(n\) is odd, the median is the middle value: \[\text{Median} = \left( \frac{n+1}{2} \right)^{th} \text{term}\]
  3. If \(n\) is even, the median is the average of the two middle values: \[\text{Median} = \frac{(\frac{n}{2})^{th} \text{term} + (\frac{n}{2} + 1)^{th} \text{term}}{2}\]

Real-Life Example: Income Analysis

Imagine 5 people earn \(\$30k, \$35k, \$40k, \$45k,\) and \(\$1,000k\) (a millionaire). * Mean: \(\$230k\) (Does not represent the “typical” person). * Median: \(\$40k\) (A much better representation of the group).

R Implementation

incomes <- c(30, 35, 40, 45, 1000)

# Compare Mean vs Median
print(paste("Mean Income:", mean(incomes)))
## [1] "Mean Income: 230"
print(paste("Median Income:", median(incomes)))
## [1] "Median Income: 40"

Pros: Robust to outliers; best for skewed distributions (like house prices or salaries).


4. The Mode

The mode is the value that appears most frequently in a dataset. A dataset can be: * Unimodal: One mode. * Bimodal: Two modes. * Multimodal: Three or more modes.

Real-Life Example

A shoe store wants to know which shoe size to stock most. If they sold sizes [7, 8, 8, 8, 9, 10, 11], the mode is 8.

R Implementation

Note: R does not have a built-in function for the statistical mode (the mode() function in R returns the data type). We often create a custom function.

get_mode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}

shoe_sizes <- c(7, 8, 8, 8, 9, 10, 11)
result <- get_mode(shoe_sizes)
print(paste("The mode of shoe sizes is:", result))
## [1] "The mode of shoe sizes is: 8"

5. Comparing Mean, Median, and Mode

The relationship between these measures depends on the Skewness of the data:

  1. Symmetrical (Normal) Distribution:
    • Mean = Median = Mode.
  2. Right-Skewed (Positive Skew):
    • The tail is on the right.
    • Mean > Median > Mode.
  3. Left-Skewed (Negative Skew):
    • The tail is on the left.
    • Mean < Median < Mode.

Visualizing with R

# Generating a right-skewed dataset
set.seed(123)
data <- rchisq(1000, df = 5)

hist(data, col="skyblue", main="Right Skewed Distribution", xlab="Values")
abline(v = mean(data), col = "red", lwd = 2, lty = 1)
abline(v = median(data), col = "blue", lwd = 2, lty = 2)
legend("topright", legend=c("Mean", "Median"), col=c("red", "blue"), lty=1:2, lwd=2)


6. Summary Table

Measure Best Used For… Sensitive to Outliers?
Mean Continuous data, symmetrical distributions Yes
Median Skewed data (Income, House Prices) No
Mode Categorical data (Colors, Brands) No

7. Practice Exercise

  1. Create a vector in R called weights with the following values: 60, 65, 70, 72, 70, 200.
  2. Calculate the mean and median.
  3. Which measure is more appropriate for this data? Why?
# Your solution here
weights <- c(60, 65, 70, 72, 70, 200)
# mean(weights)
# median(weights)

```