This is a structured R Markdown template for your lecture. It uses
the built-in mtcars dataset to provide real-world
context.
You can copy this code into an .Rmd file in RStudio to
knit it into a PDF, HTML, or Word document.
---
title: "Lecture Notes: Descriptive Statistics"
subtitle: "Understanding 'What Happens' in Data"
author: "Data Analytics Department"
date: "2026-04-23"
output:
html_document:
toc: true
toc_depth: 3
theme: united
highlight: tango
---
# 1. Introduction to Descriptive Statistics
Descriptive Statistics is one of the main branches of data analytics. Its primary purpose is to answer the fundamental question: **"What happens in the data?"** It summarizes and organizes characteristics of a data set.
## The Five Pillars of Descriptive Statistics
1. **Measure of Central Tendency:** Finding the "center" or typical value.
2. **Measure of Variation (Dispersion):** Understanding the spread or "noise."
3. **Measure of Shape:** Determining symmetry and peakedness (Skewness/Kurtosis).
4. **Measure of Position:** Identifying where a value stands relative to others (Percentiles/Quartiles).
5. **Measure of Frequency:** Counting occurrences of specific values or ranges.
---
# 2. The Data Processing Workflow
Before calculating statistics, we follow a rigorous order of operations:
### i. Data Cleaning
Handling missing values (NAs) and removing duplicates or structural errors.
``` r
# Example: Using the built-in mtcars dataset
data <- mtcars
# Check for missing values
sum(is.na(data))
## [1] 0
Converting data into formats suitable for analysis (e.g., Log transformation or scaling).
# Log transformation of 'mpg' to normalize variance if needed
data$log_mpg <- log(data$mpg)
Defining the variable types: Nominal, Ordinal, Interval, or Ratio.
Ensuring the data falls within logical bounds (e.g., checking if ‘mpg’ is positive).
# Validation check: All MPG values should be > 0
all(data$mpg > 0)
## [1] TRUE
Using plots to see the distribution before calculating numbers.
ggplot(data, aes(x = mpg)) +
geom_histogram(bins = 10, fill = "steelblue", color = "white") +
theme_minimal() +
labs(title = "Distribution of Miles Per Gallon (MPG)", x = "MPG", y = "Frequency")
Measure of central tendency acts as the backbone of descriptive statistics. It provides a single value that represents the center of the data distribution.
We will focus on 7 main elements using the MPG (Miles Per Gallon) variable from our dataset:
The arithmetic average.
avg_val <- mean(data$mpg)
print(paste("Mean MPG:", round(avg_val, 2)))
## [1] "Mean MPG: 20.09"
The middle value when the data is ordered. It is robust to outliers.
med_val <- median(data$mpg)
print(paste("Median MPG:", med_val))
## [1] "Median MPG: 19.2"
The most frequently occurring value. (R doesn’t have a built-in mode function for numbers, so we create one).
get_mode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
mode_val <- get_mode(data$mpg)
print(paste("Mode MPG:", mode_val))
## [1] "Mode MPG: 21"
The highest value in the dataset.
max_val <- max(data$mpg)
print(paste("Maximum MPG:", max_val))
## [1] "Maximum MPG: 33.9"
The lowest value in the dataset.
min_val <- min(data$mpg)
print(paste("Minimum MPG:", min_val))
## [1] "Minimum MPG: 10.4"
The average of the maximum and minimum values.
mid_range <- (max_val + min_val) / 2
print(paste("Mid-range MPG:", mid_range))
## [1] "Mid-range MPG: 22.15"
The average of the absolute differences between each data point and the mean. It measures the average “distance” from the center.
# Calculating Mean Absolute Deviation manually
mad_val <- mean(abs(data$mpg - mean(data$mpg)))
print(paste("Mean Absolute Deviation (MAD):", round(mad_val, 2)))
## [1] "Mean Absolute Deviation (MAD): 4.71"
To visualize the central tendency, we use a density plot and mark the Mean and Median.
ggplot(data, aes(x = mpg)) +
geom_density(fill = "gray", alpha = 0.3) +
geom_vline(aes(xintercept = mean(mpg)), color = "red", linetype = "dashed", size = 1) +
geom_vline(aes(xintercept = median(mpg)), color = "blue", linetype = "dotted", size = 1) +
annotate("text", x = 25, y = 0.06, label = "Red = Mean", color = "red") +
annotate("text", x = 25, y = 0.05, label = "Blue = Median", color = "blue") +
labs(title = "Central Tendency on MPG Data", subtitle = "Comparing Mean and Median") +
theme_classic()
Figure 1: MPG Distribution with Central Tendency Markers
If the Mean > Median, the data is usually positively skewed. In our example: * Mean: 20.09 * Median: 19.2 Since the Mean is slightly higher than the Median, the MPG data is slightly right-skewed. ```