This document demonstrates key principles of Exploratory Data Analysis (EDA) using R, guided by systematic detective frameworks. We’ll explore various techniques while following best practices for reproducibility and efficient data handling.
EDA: A Creative Inquiry with Structure
Exploratory Data Analysis is not just a step in data science but a creative process that benefits from systematic frameworks. Like a detective solving a case, we need both creativity and methodology.
We’ll use two complementary frameworks throughout this analysis:
READY Framework: Strategic planning before investigation
SCAN Framework: Tactical execution during investigation
The Detective Frameworks
READY Framework - Your Strategic Case File
Apply BEFORE you start investigating data
Representative Data: Do we have all the data we need?
Executive Driven Questions: What is the main question we’re trying to solve?
Analytical Frameworks: What is our plan of action?
Data Best Practices: What are the rules for keeping our data clean?
Your Insight, Your Impact: What did we learn, and how will it help?
Group Activity: Work with your neighbor to brainstorm questions for each READY component. We’ll share a few from each group.
R - Representative Data: What don’t we know about our dataset?
Your group’s questions:
# Add your group's R questions as comments:
# Where is the data from?
# Are we looking at numerical or qualitative data?
E - Executive Questions: What questions do stakeholders have that we can’t answer yet?
Your group’s questions:
# Add your group's E questions as comments:
# How many total diamonds in there in all?
# What diamonds are the most popular?
A - Analytical Framework: What’s our exploration strategy?
Your group’s questions:
# Add your group's A questions as comments:
# Looking at the specific variables to understand them better, like cut, clarity, etc.
# What's the plan, where do we start?
D - Data Best Practices: What data quality unknowns should we check?
Your group’s questions:
# Add your group's D questions as comments:
# What are the variable types?
# NA values? Missing.
Y - Your Insights: What story might emerge from our investigation?
Your group’s questions:
# Add your group's Y questions as comments:
# Maybe find out about the trends of cuts of diamonds.
# Coorelations between sizes and prices.
Class Discussion: Each group shares one interesting question from any READY category.
SCAN Framework - Your Field Investigation Guide
Apply WHILE you’re exploring data
Stakeholder Goals: Who wants the answer to this question?
Columns and Coverage: What data do we have, and is any of it missing?
Aggregates and Anomalies: What are the overall trends, and are there any strange outliers?
Notable Segments: Can we break the data into smaller groups to find hidden clues?
Loading Required Libraries
# Function to check and install required packagesrequired_packages <-c("tidyverse", "corrplot")# Install missing packagesfor (pkg in required_packages) {if (!requireNamespace(pkg, quietly =TRUE)) {install.packages(pkg) }}# Load all packages at onceinvisible(lapply(required_packages, library, character.only =TRUE))
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
corrplot 0.95 loaded
# Loading packages individually#library(tidyverse) # Includes dplyr, ggplot2, tidyr#library(corrplot) # For correlation visualization
🌟 Meet Your Investigation Dataset: diamonds
We’ll use the built-in diamonds dataset to demonstrate EDA techniques. While this isn’t “big data”, the principles we’ll demonstrate scale to larger datasets.
🤔 Detective Rule #1: If you know nothing about diamonds, what’s the FIRST thing you should do?
data("diamonds")diamonds_df <-as_tibble(diamonds)# First look at the structure of our evidencediamonds_df |>head() |>glimpse()
Rows: 6
Columns: 10
$ carat <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24
$ cut <ord> Ideal, Premium, Good, Premium, Good, Very Good
$ color <ord> E, E, E, I, J, J
$ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2
$ depth <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8
$ table <dbl> 55, 61, 65, 58, 58, 57
$ price <int> 326, 326, 327, 334, 335, 336
$ x <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94
$ y <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96
$ z <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48
🔍 Investigation Exercise:
Without using any external resources, what can you discover about this dataset?
💡Think about….
How many observations and variables are there? 10 variables and 53940 observations.
What types of data are present? Doubles: Numbers with a decimal Ordinal: They have a ranking system Integer: Numbers without decimal.
Before we touch the data, let’s plan our investigation systematically:
R - Representative Data: Do we have what we need?
Check the size and scope of our dataset nrow(diamonds)
How many diamonds do we have? ncol(diamonds)
How many characteristics per diamond?
What time period does this represent? What geography? What market segment?
Detective Assessment:
E - Executive Driven Questions: What do stakeholders want to know?
Who might want diamond insights and what questions would they ask?
Potential Stakeholders & Their Questions:
Jewelry buyers: “How do I get the best value for my budget?”
Jewelry retailers: “What inventory should I stock?”
Diamond investors: “Which characteristics drive the highest prices?”
EDA students: “How do I systematically investigate a new dataset?”
Our Primary Investigation Question:“What patterns exist in diamond characteristics and pricing that could inform purchasing decisions?”
A - Analytical Framework: What’s our investigation plan?
Our Systematic Investigation Strategy:
Understand the domain (diamond characteristics and market)
Explore data structure and quality (missing values, ranges, distributions)
Investigate individual variables (what does each characteristic tell us?)
Examine relationships (how do characteristics relate to price?)
Segment analysis (do patterns differ across groups?)
# A tibble: 1,790 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.52 Ideal F IF 62.2 55 2783 5.14 5.18 3.21
2 0.55 Ideal G IF 60.9 57 2789 5.28 5.3 3.22
3 0.64 Ideal G IF 61.3 56 2790 5.54 5.58 3.41
4 0.72 Premium I IF 63 57 2795 5.72 5.7 3.6
5 0.6 Very Good G IF 61.6 56 2800 5.43 5.46 3.35
6 0.61 Ideal G IF 62.3 56 2800 5.43 5.45 3.39
7 0.53 Ideal F IF 61.9 54 2802 5.22 5.25 3.24
8 0.56 Very Good E IF 61 59 2833 5.28 5.34 3.24
9 0.58 Very Good E IF 60.6 59 2852 5.37 5.43 3.27
10 0.62 Fair F IF 60.1 61 2861 5.53 5.56 3.33
# ℹ 1,780 more rows
So there are no Flawless diamonds because that name doesn’t exist. But, there are 1,790 pretty good ones.
Filter by diamonds under $500
#ADD CODE BELOWdiamonds|>filter(price <500)
# A tibble: 1,729 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
# ℹ 1,719 more rows
1,729 diamonds are under $500.
👉 Your Turn: Advanced Detective Work OPTIONAL⤵
Investigation Questions:
How many diamonds are both over 1 carat AND under $5000? (Potential good deals?)
# A tibble: 53,940 × 5
carat cut color clarity price
<dbl> <ord> <ord> <ord> <int>
1 0.23 Ideal E SI2 326
2 0.21 Premium E SI1 326
3 0.23 Good E VS1 327
4 0.29 Premium I VS2 334
5 0.31 Good J SI2 335
6 0.24 Very Good J VVS2 336
7 0.24 Very Good I VVS1 336
8 0.26 Very Good H SI1 337
9 0.22 Fair E VS2 337
10 0.23 Very Good H VS1 338
# ℹ 53,930 more rows
# A tibble: 53,940 × 5
carat cut color clarity price
<dbl> <ord> <ord> <ord> <int>
1 0.23 Ideal E SI2 326
2 0.21 Premium E SI1 326
3 0.23 Good E VS1 327
4 0.29 Premium I VS2 334
5 0.31 Good J SI2 335
6 0.24 Very Good J VVS2 336
7 0.24 Very Good I VVS1 336
8 0.26 Very Good H SI1 337
9 0.22 Fair E VS2 337
10 0.23 Very Good H VS1 338
# ℹ 53,930 more rows
# A tibble: 53,940 × 4
carat cut color clarity
<dbl> <ord> <ord> <ord>
1 0.23 Ideal E SI2
2 0.21 Premium E SI1
3 0.23 Good E VS1
4 0.29 Premium I VS2
5 0.31 Good J SI2
6 0.24 Very Good J VVS2
7 0.24 Very Good I VVS1
8 0.26 Very Good H SI1
9 0.22 Fair E VS2
10 0.23 Very Good H VS1
# ℹ 53,930 more rows
# A tibble: 5 × 5
price carat cut color clarity
<int> <dbl> <ord> <ord> <ord>
1 326 0.23 Ideal E SI2
2 326 0.21 Premium E SI1
3 327 0.23 Good E VS1
4 334 0.29 Premium I VS2
5 335 0.31 Good J SI2
📏 arrange(): Organize Your Evidence
Detective Skill: Sort clues to reveal patterns.
# EXAMPLE: Find diamonds with unusual price patternsdiamonds |>arrange(desc(carat), price) |># Large diamonds, but sort by price within sizehead()
# A tibble: 6 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 5.01 Fair J I1 65.5 59 18018 10.7 10.5 6.98
2 4.5 Fair J I1 65.8 58 18531 10.2 10.2 6.72
3 4.13 Fair H I1 64.8 61 17329 10 9.85 6.43
4 4.01 Premium I I1 61 61 15223 10.1 10.1 6.17
5 4.01 Premium J I1 62.5 62 15223 10.0 9.94 6.24
6 4 Very Good I I1 63.3 58 15984 10.0 9.94 6.31
# EXAMPLE: Look for potential data anomalies by sorting multiple waysdiamonds |>arrange(price, desc(carat)) |># Cheapest first, but largest within pricehead()
# A tibble: 6 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
👉 Your Turn: Pattern Recognition ⤵
Investigation Tasks:
Arrange by most expensive diamonds, inspect first 5 rows
Detective insight - cheapest diamonds (potential data issues?)
#ADD CODE BELOWdiamonds|>arrange(price)
# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
# ℹ 53,930 more rows
# a potential issue is that there is too much to look at. #Just looking at the first few rows, you can't make a #conclusion about what makes a diamond cheaper.
➕ mutate(): Create New Investigation Angles
Detective Skill: Derive new insights from existing evidence.
# A tibble: 6 × 3
cut color quality_rating
<ord> <ord> <chr>
1 Ideal E Premium
2 Premium E Good
3 Good E Standard
4 Premium I Good
5 Good J Standard
6 Very Good J Good
# A tibble: 53,940 × 2
Sizes carat
<chr> <dbl>
1 Small 0.23
2 Small 0.21
3 Small 0.23
4 Small 0.29
5 Small 0.31
6 Small 0.24
7 Small 0.24
8 Small 0.26
9 Small 0.22
10 Small 0.23
# ℹ 53,930 more rows
# 3. Determine if a diamond is premium.diamond_median <-median(diamonds$price)diamonds|>mutate(premium_status =case_when( price <= diamond_median ~"Not Premium", price > diamond_median ~"Premium" )) |>select(price, premium_status)
# A tibble: 53,940 × 2
price premium_status
<int> <chr>
1 326 Not Premium
2 326 Not Premium
3 327 Not Premium
4 334 Not Premium
5 335 Not Premium
6 336 Not Premium
7 336 Not Premium
8 337 Not Premium
9 337 Not Premium
10 338 Not Premium
# ℹ 53,930 more rows
# A tibble: 3 × 4
size_group diamond_count avg_price price_range
<chr> <int> <dbl> <int>
1 Large 19060 8142 17561
2 Medium 17206 2496 9052
3 Small 17674 792 2351
# EXAMPLE: Look for unexpected patterns across multiple characteristicsdiamonds |>group_by(cut, color) |>summarise(available =n(),typical_price =median(price),.groups ="drop" ) |>filter(available >=50) |># Only combinations with decent sample sizearrange(desc(typical_price))
# A tibble: 35 × 4
cut color available typical_price
<ord> <ord> <int> <dbl>
1 Premium J 808 5063
2 Premium I 1428 4640
3 Premium H 2360 4511
4 Very Good J 678 4113
5 Ideal J 896 4096
6 Very Good I 1204 3888
7 Fair H 303 3816
8 Very Good H 1824 3734
9 Good J 307 3733
10 Fair D 163 3730
# ℹ 25 more rows
👉 Your Turn: Pattern Analysis ⤵
Investigation Questions:
What’s the average price for each cut quality?
How many diamonds are in each color grade?
What’s the price range (min/max) for each clarity level?
# 1. Average Price for each cut qualitydiamonds|>group_by(cut)|>summarise(average_price =mean(price) )
# A tibble: 5 × 2
cut average_price
<ord> <dbl>
1 Fair 4359.
2 Good 3929.
3 Very Good 3982.
4 Premium 4584.
5 Ideal 3458.
# 2. How many diamonds in each color category?diamonds|>group_by(color)|>summarise(counts =n() )
# A tibble: 7 × 2
color counts
<ord> <int>
1 D 6775
2 E 9797
3 F 9542
4 G 11292
5 H 8304
6 I 5422
7 J 2808