1. Introduction

A measure of central tendency is a summary statistic that represents the center point or typical value of a dataset. These measures indicate where most values in a distribution fall and are also referred to as the “location” of the data.

The three most common measures are: 1. The Mean 2. The Median 3. The Mode


2. The Arithmetic Mean

The mean (or average) is the sum of all values divided by the total number of values.

Mathematical Formula

For a sample of size \(n\), the sample mean \(\bar{x}\) is calculated as:

\[\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n} = \frac{x_1 + x_2 + ... + x_n}{n}\]

Where: - \(\sum\): Symbol for summation. - \(x_i\): Each individual value in the dataset. - \(n\): The number of observations.

Real-Life Example

Imagine a small tech startup with five employees earning the following annual salaries (in thousands of dollars): 50, 60, 65, 70, and 75.

R Implementation

salaries <- c(50, 60, 65, 70, 75)
mean_salary <- mean(salaries)
print(paste("The mean salary is:", mean_salary))
## [1] "The mean salary is: 64"

3. The Median

The median is the middle value in a dataset when the values are arranged in ascending or descending order.

Mathematical Formula

  1. Arrange data from smallest to largest.
  2. If \(n\) is odd, the median is the middle value: \[\text{Median} = x_{(\frac{n+1}{2})}\]
  3. If \(n\) is even, the median is the average of the two middle values: \[\text{Median} = \frac{x_{(n/2)} + x_{(n/2 + 1)}}{2}\]

Real-Life Example

Consider the same salaries: 50, 60, 65, 70, 75. The middle value is 65. If we add a CEO earning 300, the set becomes: 50, 60, 65, 70, 75, 300.

R Implementation

salaries_with_ceo <- c(50, 60, 65, 70, 75, 300)

# Mean vs Median
mean_val <- mean(salaries_with_ceo)
median_val <- median(salaries_with_ceo)

cat("Mean with CEO:", mean_val, "\n")
## Mean with CEO: 103.3333
cat("Median with CEO:", median_val)
## Median with CEO: 67.5

Note: The median is “robust” to outliers, whereas the mean is heavily pulled by the CEO’s high salary.


4. The Mode

The mode is the value that appears most frequently in a dataset. A distribution can be unimodal (one mode), bimodal (two modes), or multimodal.

Real-Life Example

A shoe store tracks the sizes sold in one hour: 7, 8, 8, 9, 10, 10, 10, 11. The mode is 10 because it appears three times.

R Implementation

Standard R does not have a built-in function for Mode, so we create a custom function:

get_mode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}

shoe_sizes <- c(7, 8, 8, 9, 10, 10, 10, 11)
print(paste("The mode of shoe sizes is:", get_mode(shoe_sizes)))
## [1] "The mode of shoe sizes is: 10"

5. Comparison: When to use what?

Measure Best for… Sensitivity to Outliers
Mean Symmetric data, Normal distributions High (Very Sensitive)
Median Skewed data (e.g., Income, Home prices) Low (Robust)
Mode Categorical data (e.g., Favorite color) Low

Visualization of Skewness


6. Summary Exercise

The following vector represents the daily number of customers at a local cafe over 10 days: 34, 45, 40, 38, 50, 45, 55, 120, 42, 41.

  1. Calculate the Mean.
  2. Calculate the Median.
  3. Identify the Outlier.
  4. Which measure provides a better “typical” day for the cafe owner?

```

Key Components included:

  1. LaTeX Equations: Used $$ for professional mathematical rendering.
  2. R Code Chunks: Included {r} blocks to demonstrate how to calculate these measures using the R language.
  3. Data Robustness Explanation: Highlighted the difference between Mean and Median using a “CEO salary” outlier example.
  4. Custom Function: Since R’s mode() function refers to data storage types, I provided a functional statistical mode snippet.
  5. Visual Aids: Added a histogram to show how the mean and median split in skewed distributions.