In statistics, raw data can be overwhelming. To make sense of a dataset, we look for a single value that describes the center or the most typical behavior of the data. This single value is called a Measure of Central Tendency.
Definition: > A Measure of Central Tendency is a summary statistic that represents the center point or typical value of a dataset.
The three most common measures are:
The mean is the most widely used measure of central tendency. It is the sum of all observed values divided by the number of observations.
If we have a dataset \(x\) containing \(n\) values (\(x_1, x_2, ..., x_n\)):
\[ \bar{x} = \frac{\sum x_i}{n} \]
Where:
Problem: A student scores the following marks in 5 quizzes: 70, 85, 80, 95, 60. Calculate the mean.
# Step 1: Define the data vector
quiz_scores <- c(70, 85, 80, 95, 60)
# Step 2: Calculate the mean manually to demonstrate the formula
sum_scores <- sum(quiz_scores)
n <- length(quiz_scores)
manual_mean <- sum_scores / n
# Step 3: Use the built-in R function
r_mean <- mean(quiz_scores)
# Output results
print(paste("Sum of scores:", sum_scores))## [1] "Sum of scores: 390"
## [1] "Calculated Mean: 78"
Universities calculate your Grade Point Average (GPA) by taking the numerical value of every grade you earned, summing them up, and dividing by the total number of classes.
The median is the middle value in an ordered dataset. It splits the data into two equal halves.
First, sort the data. The position is found using: \[ \text{Position} = \frac{n + 1}{2} \]
Data: 5, 9, 3, 1, 11
# Define data
data_odd <- c(5, 9, 3, 1, 11)
# Sort the data (Crucial step!)
sorted_data <- sort(data_odd)
print("Sorted Data:")## [1] "Sorted Data:"
## [1] 1 3 5 9 11
## [1] "The Median is: 5"
News reports usually quote the Median Home Price, not the Mean. If one mansion sells for $100 million, the mean price skyrockets, but the median remains stable, representing the “typical” house cost.
The mode is the value that appears most frequently in the dataset.
\[ \text{Mode} = \text{Value with Max}(\text{Frequency}) \]
Note: Base R does not have a built-in statistical mode function
(the function mode() in R checks variable type). We use a
frequency table to find it.
Data: 4, 1, 2, 4, 3, 4, 2, 5
# Define data
shoe_sizes <- c(4, 1, 2, 4, 3, 4, 2, 5)
# Create a frequency table
freq_table <- table(shoe_sizes)
print(freq_table)## shoe_sizes
## 1 2 3 4 5
## 1 2 1 3 1
A shoe store owner analyzes sales. They don’t care about the “average” shoe size (e.g., size 9.35). They care about the Mode (e.g., Size 9) to know which stock to reorder.
| Measure | Best Used For… | Affected by Outliers? | R Function |
|---|---|---|---|
| Mean | Symmetric, numerical data. | Yes (Highly) | mean(x) |
| Median | Skewed data (Income, Housing). | No | median(x) |
| Mode | Categorical data, Inventory. | No | table(x) |
Scenario: A startup has 5 employees with the following salaries: $30k, $30k, $30k, $40k, and $2,000,000 (CEO).
salaries <- c(30000, 30000, 30000, 40000, 2000000)
# Calculate Mean and Median
mean_sal <- mean(salaries)
median_sal <- median(salaries)
print(paste("Mean Salary: $", format(mean_sal, big.mark=",")))## [1] "Mean Salary: $ 426,000"
## [1] "Median Salary: $ 30,000"
Conclusion: The Median ($30,000) is the better measure because the CEO’s salary is an extreme outlier that distorts the mean. ```