1. Introduction

Descriptive statistics are used to describe the basic features of the data in a study. They provide simple summaries about the sample and the measures. Together with simple graphics analysis, they form the basis of virtually every quantitative analysis of data.

2. Measures of Central Tendency

These measures indicate the center of a distribution.

2.1 The Mean

The arithmetic average of a set of values.

Formula: \[\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n}\]

  • Real-life Example: A teacher calculating the average test score of a class to determine the overall performance level.

2.2 The Median

The middle value in a list of numbers sorted from lowest to highest.

  • Real-life Example: Household Income. The median is used because the mean can be “pulled” upward by a few billionaires, giving a false impression of what a “typical” family earns.

2.3 The Mode

The value that occurs most often.

  • Real-life Example: A restaurant owner tracking which menu item is ordered most frequently to optimize inventory.

3. Measures of Dispersion

Dispersion tells us how “spread out” the data is.

3.1 Variance (\(s^2\))

The average of the squared differences from the Mean. \[s^2 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n - 1}\]

3.2 Standard Deviation (\(s\))

The square root of the variance. It is the most common measure of spread. \[s = \sqrt{\frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n - 1}}\]

  • Real-life Example: Quality Control. If a machine fills cereal boxes, a high standard deviation means the weights are inconsistent, leading to some boxes being half-empty and others overflowing.

4. Case Study: Employee Salaries

Let’s analyze a dataset of 50 employees in a startup.

# Create a dataset: Most earn between 40k-60k, but the CEO earns 250k
set.seed(123)
salaries <- c(rnorm(48, mean=50000, sd=5000), 250000, 260000)

# Calculating Central Tendency
mean_sal   <- mean(salaries)
median_sal <- median(salaries)

# Calculating Dispersion
sd_sal     <- sd(salaries)
range_sal  <- range(salaries)
iqr_sal    <- IQR(salaries)

# Display results
cat("Mean Salary:", mean_sal, 
    "\nMedian Salary:", median_sal, 
    "\nStandard Deviation:", sd_sal)
## Mean Salary: 58302.36 
## Median Salary: 50021.49 
## Standard Deviation: 40830.6

4.1 Visualizing with Base R

Histogram

A histogram shows the frequency distribution. We will add a vertical line for the Mean (red) and Median (blue).

hist(salaries, 
     breaks = 15, 
     col = "lightgray", 
     border = "white",
     main = "Distribution of Salaries",
     xlab = "Salary ($)")

abline(v = mean_sal, col = "red", lwd = 2, lty = 2)   # Dashed Red = Mean
abline(v = median_sal, col = "blue", lwd = 2)        # Solid Blue = Median

legend("topright", legend=c("Mean", "Median"), col=c("red", "blue"), lty=c(2,1), lwd=2)

Boxplot

Boxplots are the best way to visualize outliers (the CEO’s salary).

boxplot(salaries, 
        horizontal = TRUE, 
        col = "lightblue", 
        main = "Boxplot of Salaries",
        xlab = "Salary ($)")


5. Skewness and Interpretation


6. Summary Table

Statistic R Function Description
Mean mean(x) Average value
Median median(x) Middle value
Standard Deviation sd(x) Average distance from mean
Variance var(x) Squared dispersion
Interquartile Range IQR(x) Spread of the middle 50%
Summary summary(x) Min, Q1, Median, Mean, Q3, Max

7. Practice Task

  1. Use the built-in islands dataset in R (type data(islands)).
  2. Calculate the mean and median area of the islands.
  3. Create a boxplot and identify if there are any outliers.
  4. Question: Why is the mean so much larger than the median in this dataset? (Hint: Look at the size of continents vs. small islands). ```