1. Introduction

In statistics, a Measure of Central Tendency is a single value that attempts to describe a set of data by identifying the central position within that set of data. These measures are sometimes called “measures of central location” or “Central Attendance.”

The three most common measures are: 1. The Mean 2. The Median 3. The Mode


2. The Arithmetic Mean (\(\bar{x}\))

The mean is the “average” value. It is calculated by summing all observations and dividing by the total number of observations.

Mathematical Formula

For a sample of size \(n\): \[\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n}\]

Where: - \(\bar{x}\) = Sample Mean - \(\sum\) = Summation symbol - \(x_i\) = Each individual value - \(n\) = Number of values in the sample

Real-Life Example

Scenario: A tech company tracks the hours worked by 5 employees in a single day: 8, 9, 7, 10, and 6 hours.

Calculation: \[\bar{x} = \frac{8 + 9 + 7 + 10 + 6}{5} = \frac{40}{5} = 8 \text{ hours}\]

R Code Example

hours_worked <- c(8, 9, 7, 10, 6)
mean_val <- mean(hours_worked)
print(paste("The mean hours worked is:", mean_val))
## [1] "The mean hours worked is: 8"

3. The Median (\(\tilde{x}\))

The median is the middle value in a dataset when the values are arranged in ascending or descending order. It is the “50th percentile.”

Mathematical Calculation

  1. Order the data from least to greatest.
  2. If \(n\) is odd: The median is the middle value at position \(\frac{n+1}{2}\).
  3. If \(n\) is even: The median is the average of the two middle values at positions \(\frac{n}{2}\) and \(\frac{n}{2} + 1\).

Real-Life Example (The Outlier Problem)

Scenario: Weekly salaries of 5 workers are $500, $550, $600, $650, and one CEO who earns $10,000.

  • Ordered Data: 500, 550, 600, 650, 10,000.
  • Median: $600.
  • Mean (for comparison): $2,460.

In this case, the Median is a better representation of “central attendance” because the Mean is skewed by the CEO’s high salary (outlier).

R Code Example

salaries <- c(500, 550, 600, 650, 10000)
median_val <- median(salaries)
print(paste("The median salary is:", median_val))
## [1] "The median salary is: 600"

4. The Mode

The mode is the value that appears most frequently in a dataset.

Mathematical Formula

There is no specific algebraic formula for the mode; it is determined by frequency (\(f\)): \[\text{Mode} = \text{Value with maximum } f(x)\]

Real-Life Example

Scenario: A shoe store sells sizes 7, 8, 8, 9, 10, 8, 11. - Size 8 appears 3 times. - Mode: 8.

R Code Example

Note: Base R does not have a built-in mode() function for statistics, so we create a simple table.

shoe_sizes <- c(7, 8, 8, 9, 10, 8, 11)
freq_table <- table(shoe_sizes)
mode_val <- names(freq_table)[which.max(freq_table)]
print(paste("The mode shoe size is:", mode_val))
## [1] "The mode shoe size is: 8"

5. Summary and Comparison

Measure Best for… Sensitive to Outliers?
Mean Symmetric data / Normal distribution Yes
Median Skewed data (e.g., Income, House prices) No
Mode Categorical data (e.g., Favorite color) No

Visualizing Central Tendency in R

# Generating a random skewed dataset
set.seed(123)
data <- rgamma(100, shape = 2, scale = 2)

# Plotting
hist(data, col = "lightblue", main = "Distribution of Data", xlab = "Values")
abline(v = mean(data), col = "red", lwd = 2, lty = 1)   # Mean in Red
abline(v = median(data), col = "blue", lwd = 2, lty = 2) # Median in Blue

legend("topright", legend = c("Mean", "Median"), col = c("red", "blue"), lty = 1:2)


6. Practice Exercise

  1. Create a vector in R with the following ages: c(21, 23, 21, 25, 22, 29, 21, 50).
  2. Calculate the Mean, Median, and Mode.
  3. Which measure best represents this group? Why?

```

Tips for customizing this note:

  1. Mathematical Formulas: I used LaTeX syntax (the text between $$). When you knit this in RStudio, it will render as professional math symbols.
  2. Outlier Example: Use the salary example provided to explain to students why we don’t always use the Mean.
  3. R Visualization: The histogram code at the end helps students see how the Mean is pulled toward the “tail” of a distribution while the Median stays more central.