1. Introduction

In statistics, raw data can be overwhelming. To make sense of a dataset, we look for a single value that describes the center or the most typical behavior of the data. This single value is called a Measure of Central Tendency.

Definition: > A Measure of Central Tendency is a summary statistic that represents the center point or typical value of a dataset.

The three most common measures are:

  1. Mean (The Average)
  2. Median (The Middle Value)
  3. Mode (The Most Frequent Value)

2. The Arithmetic Mean (The Average)

The mean is the most widely used measure of central tendency. It is the sum of all observed values divided by the number of observations.

A. Mathematical Formula

If we have a dataset \(x\) containing \(n\) values (\(x_1, x_2, ..., x_n\)):

\[ \bar{x} = \frac{\sum x_i}{n} \]

Where:

  • \(\bar{x}\): The Sample Mean
  • \(\sum\): Sigma (Sum of)
  • \(x_i\): Each individual value
  • \(n\): Total number of values

B. Calculation Example (with R)

Problem: A student scores the following marks in 5 quizzes: 70, 85, 80, 95, 60. Calculate the mean.

# Step 1: Define the data vector
quiz_scores <- c(70, 85, 80, 95, 60)

# Step 2: Calculate the mean manually to demonstrate the formula
sum_scores <- sum(quiz_scores)
n <- length(quiz_scores)
manual_mean <- sum_scores / n

# Step 3: Use the built-in R function
r_mean <- mean(quiz_scores)

# Output results
print(paste("Sum of scores:", sum_scores))
## [1] "Sum of scores: 390"
print(paste("Calculated Mean:", r_mean))
## [1] "Calculated Mean: 78"

C. Real-Life Example: GPA

Universities calculate your Grade Point Average (GPA) by taking the numerical value of every grade you earned, summing them up, and dividing by the total number of classes.

  • Weakness: The mean is highly sensitive to outliers (extreme values).

3. The Median (The Middle)

The median is the middle value in an ordered dataset. It splits the data into two equal halves.

A. Mathematical Concept

First, sort the data. The position is found using: \[ \text{Position} = \frac{n + 1}{2} \]

  • If \(n\) is Odd: The median is the exact center value.
  • If \(n\) is Even: The median is the average of the two middle numbers.

B. Calculation Example (with R)

Example 1: Odd Number of Data Points

Data: 5, 9, 3, 1, 11

# Define data
data_odd <- c(5, 9, 3, 1, 11)

# Sort the data (Crucial step!)
sorted_data <- sort(data_odd)
print("Sorted Data:")
## [1] "Sorted Data:"
print(sorted_data)
## [1]  1  3  5  9 11
# Calculate Median
median_val <- median(data_odd)
print(paste("The Median is:", median_val))
## [1] "The Median is: 5"

Example 2: Even Number of Data Points

Data: 10, 20, 40, 30

# Define data
data_even <- c(10, 20, 40, 30)

# Sort the data
print("Sorted Data:")
## [1] "Sorted Data:"
print(sort(data_even))
## [1] 10 20 30 40
# Calculate Median
# (20 + 30) / 2 = 25
median_val_even <- median(data_even)
print(paste("The Median is:", median_val_even))
## [1] "The Median is: 25"

C. Real-Life Example: Real Estate

News reports usually quote the Median Home Price, not the Mean. If one mansion sells for $100 million, the mean price skyrockets, but the median remains stable, representing the “typical” house cost.


4. The Mode (The Most Frequent)

The mode is the value that appears most frequently in the dataset.

A. Mathematical Concept

\[ \text{Mode} = \text{Value with Max}(\text{Frequency}) \]

B. Calculation Example (with R)

Note: Base R does not have a built-in statistical mode function (the function mode() in R checks variable type). We use a frequency table to find it.

Data: 4, 1, 2, 4, 3, 4, 2, 5

# Define data
shoe_sizes <- c(4, 1, 2, 4, 3, 4, 2, 5)

# Create a frequency table
freq_table <- table(shoe_sizes)
print(freq_table)
## shoe_sizes
## 1 2 3 4 5 
## 1 2 1 3 1
# Identify the max frequency
# We can see visually that '4' appears 3 times.

C. Real-Life Example: Retail

A shoe store owner analyzes sales. They don’t care about the “average” shoe size (e.g., size 9.35). They care about the Mode (e.g., Size 9) to know which stock to reorder.


5. Summary Comparison

Measure Best Used For… Affected by Outliers? R Function
Mean Symmetric, numerical data. Yes (Highly) mean(x)
Median Skewed data (Income, Housing). No median(x)
Mode Categorical data, Inventory. No table(x)

6. Practice Challenge

Scenario: A startup has 5 employees with the following salaries: $30k, $30k, $30k, $40k, and $2,000,000 (CEO).

  1. Calculate the Mean.
  2. Calculate the Median.
  3. Which represents the company better?
salaries <- c(30000, 30000, 30000, 40000, 2000000)

# Calculate Mean and Median
mean_sal <- mean(salaries)
median_sal <- median(salaries)

print(paste("Mean Salary: $", format(mean_sal, big.mark=",")))
## [1] "Mean Salary: $ 426,000"
print(paste("Median Salary: $", format(median_sal, big.mark=",")))
## [1] "Median Salary: $ 30,000"

Conclusion: The Median ($30,000) is the better measure because the CEO’s salary is an extreme outlier that distorts the mean. ```