Descriptive Analysis in R: A Complete Beginner’s Guide

From Concepts to 20+ Real-World Use Cases

Author

Your Name

Published

June 16, 2026

1 Introduction to Descriptive Analysis

Welcome! If you are new to R and want to learn how to describe your data clearly, you are in the right place. This document will walk you through descriptive analysis step by step. No prior statistics background is needed. We will start with the basics and slowly build up to 20+ real-world examples.

1.1 What is Descriptive Analysis?

Imagine you have a big spreadsheet with thousands of rows. Looking at the raw numbers tells you very little. You need a way to summarize the data so that patterns, averages, spreads, and groups become visible. That is exactly what descriptive analysis does.

Descriptive analysis answers questions like:

  • What is the average value in this column?
  • How spread out are the numbers?
  • How many people fall into each category?
  • Does the average salary differ between men and women?
  • Are two variables related?

1.2 Creating Our Sample Dataset

We will create one large simulated dataset to use throughout this guide.

# Create a large simulated dataset
set.seed(123)

df <- data.frame(
  id = 1:500,
  age = round(rnorm(500, mean = 35, sd = 10)),
  income = round(rnorm(500, mean = 50000, sd = 15000)),
  height = round(rnorm(500, mean = 170, sd = 10), 1),
  weight = round(rnorm(500, mean = 70, sd = 15), 1),
  exam_score = round(rnorm(500, mean = 75, sd = 12)),
  study_hours = round(rnorm(500, mean = 20, sd = 8)),
  gender = sample(c("Male", "Female"), 500, replace = TRUE),
  region = sample(c("North", "South", "East", "West"), 500, replace = TRUE),
  education = factor(
    sample(c("High School", "Bachelor", "Master", "PhD"), 500, replace = TRUE),
    levels = c("High School", "Bachelor", "Master", "PhD"),
    ordered = TRUE
  ),
  purchased = sample(c("Yes", "No"), 500, replace = TRUE, prob = c(0.4, 0.6)),
  blood_type = sample(c("A", "B", "AB", "O"), 500, replace = TRUE),
  satisfaction = factor(
    sample(c("Low", "Medium", "High"), 500, replace = TRUE),
    levels = c("Low", "Medium", "High"),
    ordered = TRUE
  )
)

# Add age groups and experience (used in later examples)
df <- df %>%
  mutate(
    age_group = case_when(
      age < 30 ~ "Young",
      age < 50 ~ "Middle",
      TRUE ~ "Older"
    ),
    experience = pmax(0, age - 22)
  )

head(df)
  id age income height weight exam_score study_hours gender region education
1  1  29  40972  160.0   57.7         69          15   Male   East    Master
2  2  33  35095  159.6   65.4         78          25 Female  North    Master
3  3  51  65402  169.8   56.5         69          14 Female   West    Master
4  4  36  61266  168.7   79.4         90          16 Female   East       PhD
5  5  36  27363  144.5   86.8         77          26 Female   East       PhD
6  6  52  48573  180.4  101.9         68          16 Female   West  Bachelor
  purchased blood_type satisfaction age_group experience
1        No          B          Low     Young          7
2       Yes          A         High    Middle         11
3       Yes          A         High     Older         29
4       Yes          B          Low    Middle         14
5       Yes          O         High    Middle         14
6       Yes         AB          Low     Older         30

2 Key Concepts in Descriptive Analysis

2.1 Measures of Central Tendency

2.1.1 Mean (Average)

mean(c(10, 20, 30, 40, 50))
[1] 30

2.1.2 Median (Middle Value)

median(c(10, 20, 30, 40, 50))
[1] 30

2.1.3 Mode (Most Frequent Value)

x <- c(1, 2, 2, 3, 3, 3, 4)
mode_val <- as.numeric(names(sort(table(x), decreasing = TRUE)[1]))
mode_val
[1] 3

2.2 Measures of Spread

2.2.1 Range

range(c(10, 20, 30, 40, 50))
[1] 10 50

2.2.2 Variance

var(c(10, 20, 30, 40, 50))
[1] 250

2.2.3 Standard Deviation

sd(c(10, 20, 30, 40, 50))
[1] 15.81139

2.2.4 Interquartile Range (IQR)

IQR(c(10, 20, 30, 40, 50))
[1] 20

2.3 Frequency and Counts

table(c("A", "B", "A", "C", "B", "A"))

A B C 
3 2 1 

3 Types of Variables

3.1 Quick Test

Ask yourself:

  1. Can I do math with it? → Continuous
  2. Is it a label with no order? → Categorical
  3. Is it a label with a clear order? → Ordinal

4 Core Descriptive Techniques in R

4.1 The summary() Function

summary(mtcars)
      mpg             cyl             disp             hp       
 Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
 1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
 Median :19.20   Median :6.000   Median :196.3   Median :123.0  
 Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
 3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
 Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
      drat             wt             qsec             vs        
 Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
 1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
 Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
 Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
 3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
 Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
       am              gear            carb      
 Min.   :0.0000   Min.   :3.000   Min.   :1.000  
 1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
 Median :0.0000   Median :4.000   Median :2.000  
 Mean   :0.4062   Mean   :3.688   Mean   :2.812  
 3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
 Max.   :1.0000   Max.   :5.000   Max.   :8.000  

4.2 The mean(), median(), sd() Functions

mean(mtcars$mpg)
[1] 20.09062
median(mtcars$mpg)
[1] 19.2
sd(mtcars$mpg)
[1] 6.026948

4.3 Grouped Summaries with dplyr

mtcars %>%
  group_by(cyl) %>%
  summarise(
    n = n(),
    avg_mpg = mean(mpg),
    sd_mpg = sd(mpg),
    min_mpg = min(mpg),
    max_mpg = max(mpg)
  )
# A tibble: 3 × 6
    cyl     n avg_mpg sd_mpg min_mpg max_mpg
  <dbl> <int>   <dbl>  <dbl>   <dbl>   <dbl>
1     4    11    26.7   4.51    21.4    33.9
2     6     7    19.7   1.45    17.8    21.4
3     8    14    15.1   2.56    10.4    19.2

4.4 Frequency Tables with dplyr

mtcars %>%
  count(cyl)
  cyl  n
1   4 11
2   6  7
3   8 14

5 Visualizations for Descriptive Analysis

5.1 Histograms (for continuous data)

ggplot(mtcars, aes(x = mpg)) +
  geom_histogram(binwidth = 3, fill = "steelblue", color = "white") +
  labs(title = "Distribution of Miles per Gallon",
       x = "MPG", y = "Count")

5.2 Boxplots (continuous data grouped by category)

ggplot(mtcars, aes(x = factor(cyl), y = mpg, fill = factor(cyl))) +
  geom_boxplot() +
  labs(title = "MPG by Number of Cylinders",
       x = "Cylinders", y = "MPG") +
  theme(legend.position = "none")

5.3 Bar Charts (for categorical data)

mtcars %>%
  count(cyl) %>%
  ggplot(aes(x = factor(cyl), y = n, fill = factor(cyl))) +
  geom_col() +
  labs(title = "Number of Cars by Cylinder Count",
       x = "Cylinders", y = "Count") +
  theme(legend.position = "none")

5.4 Scatter Plots (for two continuous variables)

ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point(color = "darkred", size = 2) +
  labs(title = "Weight vs MPG",
       x = "Weight (1000 lbs)", y = "Miles per Gallon")

6 Summary Tables

6.1 A Basic Summary Table

mtcars %>%
  group_by(cyl) %>%
  summarise(
    N = n(),
    Mean_MPG = round(mean(mpg), 2),
    SD_MPG = round(sd(mpg), 2),
    Min_MPG = min(mpg),
    Max_MPG = max(mpg)
  )
# A tibble: 3 × 6
    cyl     N Mean_MPG SD_MPG Min_MPG Max_MPG
  <dbl> <int>    <dbl>  <dbl>   <dbl>   <dbl>
1     4    11     26.7   4.51    21.4    33.9
2     6     7     19.7   1.45    17.8    21.4
3     8    14     15.1   2.56    10.4    19.2

6.2 A Wider Table

mtcars %>%
  group_by(cyl) %>%
  summarise(Mean_MPG = mean(mpg)) %>%
  pivot_wider(names_from = cyl, values_from = Mean_MPG)
# A tibble: 1 × 3
    `4`   `6`   `8`
  <dbl> <dbl> <dbl>
1  26.7  19.7  15.1

7 Real-World Use Cases (20+ Examples)

7.1 Use Cases: Continuous Variable Only

7.1.1 Use Case 1: Exam Score Distribution in a Class

df %>%
  summarise(
    N = n(),
    Mean = mean(exam_score),
    Median = median(exam_score),
    SD = sd(exam_score),
    Min = min(exam_score),
    Max = max(exam_score),
    IQR = IQR(exam_score)
  )
    N   Mean Median       SD Min Max IQR
1 500 74.232     74 11.26428  41 111  15
ggplot(df, aes(x = exam_score)) +
  geom_histogram(binwidth = 5, fill = "darkgreen", color = "white") +
  labs(title = "Distribution of Exam Scores",
       x = "Score", y = "Number of Students")

7.1.2 Use Case 2: Employee Salary Analysis

df %>%
  summarise(
    Count = n(),
    Mean_Salary = mean(income),
    Median_Salary = median(income),
    SD_Salary = sd(income)
  )
  Count Mean_Salary Median_Salary SD_Salary
1   500    49964.99         49982  15163.56

7.1.3 Use Case 4: Salary Quantiles (House Prices Scenario)

quantile(df$income, probs = c(0.25, 0.5, 0.75, 0.9, 0.95))
    25%     50%     75%     90%     95% 
39674.5 49982.0 59648.5 67908.5 74836.1 

7.1.4 Use Case 5: Product Weights in a Factory

ggplot(df, aes(y = weight)) +
  geom_boxplot(fill = "orange") +
  labs(title = "Boxplot of Product Weights", y = "Weight (kg)")

7.2 Use Cases: Categorical Variable Only

7.2.1 Use Case 6: Customer Blood Type Distribution

df %>%
  count(blood_type) %>%
  mutate(percentage = round(n / sum(n) * 100, 1)) %>%
  arrange(desc(n))
  blood_type   n percentage
1         AB 141       28.2
2          A 130       26.0
3          B 128       25.6
4          O 101       20.2

7.2.2 Use Case 9: Department Employee Counts

df %>%
  count(region) %>%
  ggplot(aes(x = reorder(region, -n), y = n, fill = region)) +
  geom_col() +
  labs(title = "Employees per Region",
       x = "Region", y = "Count") +
  theme(legend.position = "none")

7.3 Use Cases: Continuous + Continuous

7.3.1 Use Case 10: Height vs. Weight

cor(df$height, df$weight)
[1] -0.1578324
ggplot(df, aes(x = height, y = weight)) +
  geom_point(alpha = 0.6, color = "purple") +
  geom_smooth(method = "lm", se = FALSE, color = "black") +
  labs(title = "Height vs. Weight",
       x = "Height (cm)", y = "Weight (kg)")

7.3.2 Use Case 11: Study Hours vs. Exam Scores

df %>%
  group_by(study_hours) %>%
  summarise(avg_score = mean(exam_score)) %>%
  ggplot(aes(x = study_hours, y = avg_score)) +
  geom_line(color = "blue") +
  labs(title = "Average Exam Score by Study Hours",
       x = "Study Hours", y = "Average Score")

7.4 Use Cases: Continuous + Categorical

7.4.1 Use Case 13: Salary by Education Level

df %>%
  group_by(education) %>%
  summarise(
    N = n(),
    Avg_Salary = mean(income),
    Median_Salary = median(income),
    SD_Salary = sd(income)
  )
# A tibble: 4 × 5
  education       N Avg_Salary Median_Salary SD_Salary
  <ord>       <int>      <dbl>         <dbl>     <dbl>
1 High School   118     52437.        52616     13706.
2 Bachelor      126     49377.        48692     15633.
3 Master        138     50618.        51324.    15219.
4 PhD           118     47357.        48304.    15704.
ggplot(df, aes(x = education, y = income, fill = education)) +
  geom_boxplot() +
  labs(title = "Income by Education Level",
       x = "Education", y = "Income") +
  theme(legend.position = "none")

7.4.2 Use Case 14: Test Scores by Gender

df %>%
  group_by(gender) %>%
  summarise(
    N = n(),
    Mean_Score = mean(exam_score),
    Median_Score = median(exam_score),
    SD_Score = sd(exam_score)
  )
# A tibble: 2 × 5
  gender     N Mean_Score Median_Score SD_Score
  <chr>  <int>      <dbl>        <dbl>    <dbl>
1 Female   260       73.9           74     11.5
2 Male     240       74.5           74     11.0

7.4.3 Use Case 15: Income by Region

df %>%
  group_by(region) %>%
  summarise(
    N = n(),
    Avg_Income = mean(income),
    Median_Income = median(income)
  ) %>%
  arrange(desc(Avg_Income))
# A tibble: 4 × 4
  region     N Avg_Income Median_Income
  <chr>  <int>      <dbl>         <dbl>
1 South    109     50717.        51193 
2 North    130     50087.        50042.
3 East     127     49815.        49555 
4 West     134     49377.        48600.
df %>%
  group_by(region) %>%
  summarise(avg_income = mean(income)) %>%
  ggplot(aes(x = reorder(region, -avg_income), y = avg_income, fill = region)) +
  geom_col() +
  labs(title = "Average Income by Region",
       x = "Region", y = "Average Income") +
  theme(legend.position = "none")

7.4.4 Use Case 16: Reaction Time by Age Group

df %>%
  group_by(age_group) %>%
  summarise(
    N = n(),
    Avg_Score = mean(exam_score)
  )
# A tibble: 3 × 3
  age_group     N Avg_Score
  <chr>     <int>     <dbl>
1 Middle      326      74.3
2 Older        40      72.2
3 Young       134      74.8

7.5 Use Cases: Categorical + Categorical

7.5.1 Use Case 17: Gender vs. Purchase Decision

df %>%
  count(gender, purchased) %>%
  group_by(gender) %>%
  mutate(percent = round(100 * n / sum(n), 1))
# A tibble: 4 × 4
# Groups:   gender [2]
  gender purchased     n percent
  <chr>  <chr>     <int>   <dbl>
1 Female No          157    60.4
2 Female Yes         103    39.6
3 Male   No          146    60.8
4 Male   Yes          94    39.2
table(df$gender, df$purchased)
        
          No Yes
  Female 157 103
  Male   146  94
ggplot(df, aes(x = gender, fill = purchased)) +
  geom_bar(position = "fill") +
  labs(title = "Purchase Rate by Gender",
       x = "Gender", y = "Proportion") +
  scale_y_continuous(labels = scales::percent_format())

7.5.2 Use Case 19: Education Level vs. Purchase Decision

df %>%
  count(education, purchased) %>%
  group_by(education) %>%
  mutate(percent = round(100 * n / sum(n), 1))
# A tibble: 8 × 4
# Groups:   education [4]
  education   purchased     n percent
  <ord>       <chr>     <int>   <dbl>
1 High School No           77    65.3
2 High School Yes          41    34.7
3 Bachelor    No           77    61.1
4 Bachelor    Yes          49    38.9
5 Master      No           84    60.9
6 Master      Yes          54    39.1
7 PhD         No           65    55.1
8 PhD         Yes          53    44.9

7.6 Use Cases: Multiple Variables (3+ Interactions)

7.6.1 Use Case 20: Salary by Education AND Experience

df %>%
  mutate(experience_group = cut(experience, breaks = 3, labels = c("Low", "Mid", "High"))) %>%
  group_by(education, experience_group) %>%
  summarise(
    N = n(),
    Avg_Salary = mean(income),
    .groups = "drop"
  )
# A tibble: 12 × 4
   education   experience_group     N Avg_Salary
   <ord>       <fct>            <int>      <dbl>
 1 High School Low                 69     53798.
 2 High School Mid                 43     50348.
 3 High School High                 6     51748.
 4 Bachelor    Low                 78     49930.
 5 Bachelor    Mid                 43     48534.
 6 Bachelor    High                 5     48015.
 7 Master      Low                 82     50515.
 8 Master      Mid                 53     51001.
 9 Master      High                 3     46669.
10 PhD         Low                 74     50374.
11 PhD         Mid                 36     40449.
12 PhD         High                 8     50540.

7.6.2 Use Case 21: Sales by Region AND Blood Type

df %>%
  group_by(region, blood_type) %>%
  summarise(avg_income = mean(income), .groups = "drop") %>%
  pivot_wider(names_from = region, values_from = avg_income)
# A tibble: 4 × 5
  blood_type   East  North  South   West
  <chr>       <dbl>  <dbl>  <dbl>  <dbl>
1 A          50218. 48281. 47399. 49947.
2 AB         48359. 51781. 50473. 46441.
3 B          52307. 52916. 52832. 55790.
4 O          47803. 47904. 52299. 46066.

7.6.3 Use Case 22: Customer Behavior by Age, Gender, and Income

df %>%
  group_by(age_group, gender) %>%
  summarise(
    N = n(),
    Avg_Income = mean(income),
    Avg_Score = mean(exam_score),
    Purchase_Rate = mean(purchased == "Yes"),
    .groups = "drop"
  )
# A tibble: 6 × 6
  age_group gender     N Avg_Income Avg_Score Purchase_Rate
  <chr>     <chr>  <int>      <dbl>     <dbl>         <dbl>
1 Middle    Female   167     48394.      74.0         0.407
2 Middle    Male     159     50475.      74.5         0.403
3 Older     Female    20     50189       74.6         0.4  
4 Older     Male      20     48418.      69.8         0.25 
5 Young     Female    73     52547.      73.5         0.370
6 Young     Male      61     50278.      76.3         0.410

8 Summary

Descriptive analysis is the foundation of every data project. In this document, we covered:

  • Basic statistics like mean, median, mode, variance, standard deviation, and quantiles.
  • Variable types (continuous, categorical, ordinal) and how they affect your choice of method.
  • Core R functions including summary(), mean(), median(), and dplyr’s group_by() and summarise().
  • Visualizations with ggplot2 — histograms, boxplots, bar charts, and scatter plots.
  • 22 real-world use cases spanning all combinations of variable types.

9 Common Mistakes in Descriptive Analysis

  1. Using the mean on heavily skewed data. Use the median for skewed data like income.
  2. Confusing correlation with causation. Descriptive analysis only describes; it does not prove cause.
  3. Forgetting to handle missing values. Use na.rm = TRUE when needed.
  4. Treating ordinal data as numeric. Use frequency tables for ordinal variables.
  5. Ignoring sample size. Always report n (the count).
  6. Choosing the wrong plot. Bar charts for categories, histograms for continuous data.
  7. Not checking for outliers. Always visualize with a boxplot first.
  8. Over-rounding. Round to 1 or 2 decimal places.
  9. Forgetting to label axes and titles. Use labs() in ggplot2.
  10. Stopping after one look. Descriptive analysis is iterative.

10 Beginner’s Checklist

11 Final Words

You have now seen how to perform descriptive analysis in R from the ground up. The best way to learn is by doing. Open RStudio, load this file, and click Render. The more you experiment, the more comfortable you will become.

Good luck, and happy analyzing! ````