This is a structured R Markdown template for your lecture. It uses the built-in mtcars dataset to provide real-world context.

You can copy this code into an .Rmd file in RStudio to knit it into a PDF, HTML, or Word document.

---
title: "Lecture Notes: Descriptive Statistics"
subtitle: "Understanding 'What Happens' in Data"
author: "Data Analytics Department"
date: "2026-04-23"
output: 
  html_document:
    toc: true
    toc_depth: 3
    theme: united
    highlight: tango
---



# 1. Introduction to Descriptive Statistics

Descriptive Statistics is one of the main branches of data analytics. Its primary purpose is to answer the fundamental question: **"What happens in the data?"** It summarizes and organizes characteristics of a data set.

## The Five Pillars of Descriptive Statistics
1.  **Measure of Central Tendency:** Finding the "center" or typical value.
2.  **Measure of Variation (Dispersion):** Understanding the spread or "noise."
3.  **Measure of Shape:** Determining symmetry and peakedness (Skewness/Kurtosis).
4.  **Measure of Position:** Identifying where a value stands relative to others (Percentiles/Quartiles).
5.  **Measure of Frequency:** Counting occurrences of specific values or ranges.

---

# 2. The Data Processing Workflow

Before calculating statistics, we follow a rigorous order of operations:

### i. Data Cleaning
Handling missing values (NAs) and removing duplicates or structural errors.

``` r
# Example: Using the built-in mtcars dataset
data <- mtcars
# Check for missing values
sum(is.na(data))

## [1] 0

ii. Data Transformation

Converting data into formats suitable for analysis (e.g., Log transformation or scaling).

# Log transformation of 'mpg' to normalize variance if needed
data$log_mpg <- log(data$mpg)

iii. Data Measurement

Defining the variable types: Nominal, Ordinal, Interval, or Ratio.

iv. Data Validation

Ensuring the data falls within logical bounds (e.g., checking if ‘mpg’ is positive).

# Validation check: All MPG values should be > 0
all(data$mpg > 0)

## [1] TRUE

v. Data Visualization

Using plots to see the distribution before calculating numbers.

ggplot(data, aes(x = mpg)) +
  geom_histogram(bins = 10, fill = "steelblue", color = "white") +
  theme_minimal() +
  labs(title = "Distribution of Miles Per Gallon (MPG)", x = "MPG", y = "Frequency")

3. Measure of Central Tendency

Measure of central tendency acts as the backbone of descriptive statistics. It provides a single value that represents the center of the data distribution.

We will focus on 7 main elements using the MPG (Miles Per Gallon) variable from our dataset:

3.1 Mean

The arithmetic average.

avg_val <- mean(data$mpg)
print(paste("Mean MPG:", round(avg_val, 2)))

## [1] "Mean MPG: 20.09"

3.2 Median

The middle value when the data is ordered. It is robust to outliers.

med_val <- median(data$mpg)
print(paste("Median MPG:", med_val))

## [1] "Median MPG: 19.2"

3.3 Mode

The most frequently occurring value. (R doesn’t have a built-in mode function for numbers, so we create one).

get_mode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}
mode_val <- get_mode(data$mpg)
print(paste("Mode MPG:", mode_val))

## [1] "Mode MPG: 21"

3.4 Maximum

The highest value in the dataset.

max_val <- max(data$mpg)
print(paste("Maximum MPG:", max_val))

## [1] "Maximum MPG: 33.9"

3.5 Minimum

The lowest value in the dataset.

min_val <- min(data$mpg)
print(paste("Minimum MPG:", min_val))

## [1] "Minimum MPG: 10.4"

3.6 Mid-range

The average of the maximum and minimum values.

mid_range <- (max_val + min_val) / 2
print(paste("Mid-range MPG:", mid_range))

## [1] "Mid-range MPG: 22.15"

3.7 MAD (Mean Absolute Deviation)

The average of the absolute differences between each data point and the mean. It measures the average “distance” from the center.

# Calculating Mean Absolute Deviation manually
mad_val <- mean(abs(data$mpg - mean(data$mpg)))
print(paste("Mean Absolute Deviation (MAD):", round(mad_val, 2)))

## [1] "Mean Absolute Deviation (MAD): 4.71"

4. Summary Visualization

To visualize the central tendency, we use a density plot and mark the Mean and Median.

ggplot(data, aes(x = mpg)) +
  geom_density(fill = "gray", alpha = 0.3) +
  geom_vline(aes(xintercept = mean(mpg)), color = "red", linetype = "dashed", size = 1) +
  geom_vline(aes(xintercept = median(mpg)), color = "blue", linetype = "dotted", size = 1) +
  annotate("text", x = 25, y = 0.06, label = "Red = Mean", color = "red") +
  annotate("text", x = 25, y = 0.05, label = "Blue = Median", color = "blue") +
  labs(title = "Central Tendency on MPG Data", subtitle = "Comparing Mean and Median") +
  theme_classic()

Figure 1: MPG Distribution with Central Tendency Markers

Key Takeaway

If the Mean > Median, the data is usually positively skewed. In our example: * Mean: 20.09 * Median: 19.2 Since the Mean is slightly higher than the Median, the MPG data is slightly right-skewed. ```

How to use this:

Mean/Median/Mode: These show where the “typical” car’s fuel efficiency lies.
Max/Min/Mid-range: These define the boundaries of the dataset.
MAD: This shows how much the data deviates from the average on average.
Figures: The code includes a histogram and a density plot to visually demonstrate the concepts of central tendency.