#Submit a link (look on moodle) Set up an Rpub site —
Introduction
This document demonstrates key principles of Exploratory Data Analysis (EDA) using R, guided by systematic detective frameworks. We’ll explore various techniques while following best practices for reproducibility and efficient data handling.
EDA: A Creative Inquiry with Structure
Exploratory Data Analysis is not just a step in data science but a creative process that benefits from systematic frameworks. Like a detective solving a case, we need both creativity and methodology.
We’ll use two complementary frameworks throughout this analysis:
READY Framework: Strategic planning before investigation
SCAN Framework: Tactical execution during investigation
The Detective Frameworks
READY Framework - Your Strategic Case File
Apply BEFORE you start investigating data
Representative Data: Do we have all the data we need?
Executive Driven Questions: What is the main question we’re trying to solve?
Analytical Frameworks: What is our plan of action?
Data Best Practices: What are the rules for keeping our data clean?
Your Insight, Your Impact: What did we learn, and how will it help?
Group Activity: Work with your neighbor to brainstorm questions for each READY component. We’ll share a few from each group.
R - Representative Data: What don’t we know about our dataset?
Your group’s questions:
# Add your group's R questions as comments:
# Where is the data from
# Are we looking at numerical or categorical
E - Executive Questions: What questions do stakeholders have that we can’t answer yet?
Your group’s questions:
# Add your group's E questions as comments:
# How many total diamonds are there in all
# Which dimonds are most popular
A - Analytical Framework: What’s our exploration strategy?
Your group’s questions:
# Add your group's A questions as comments:
# Looking at specific variables to understand them better (cut, clarity, etc.)
# What numeric variables were working with (Their cost, size, hardness)
# What are their categorical variables (cut, shape, type, where they're from
# Where do we start
D - Data Best Practices: What data quality unknowns should we check?
Your group’s questions:
# Add your group's D questions as comments:
# What are the types of each variable
# Null values/N/A
Y - Your Insights: What story might emerge from our investigation?
Your group’s questions:
# Add your group's Y questions as comments:
# Maybe find out about trends or cuts of diamonds
# Possible correlations between values specific size - price
Class Discussion: Each group shares one interesting question from any READY category.
SCAN Framework - Your Field Investigation Guide
Apply WHILE you’re exploring data
Stakeholder Goals: Who wants the answer to this question?
Columns and Coverage: What data do we have, and is any of it missing?
Aggregates and Anomalies: What are the overall trends, and are there any strange outliers?
Notable Segments: Can we break the data into smaller groups to find hidden clues?
Loading Required Libraries
# Function to check and install required packagesrequired_packages <-c("tidyverse", "corrplot")# Install missing packagesfor (pkg in required_packages) {if (!requireNamespace(pkg, quietly =TRUE)) {install.packages(pkg) }}# Load all packages at onceinvisible(lapply(required_packages, library, character.only =TRUE))
Warning: package 'tidyverse' was built under R version 4.3.3
Warning: package 'ggplot2' was built under R version 4.3.3
Warning: package 'tibble' was built under R version 4.3.3
Warning: package 'readr' was built under R version 4.3.3
Warning: package 'dplyr' was built under R version 4.3.3
Warning: package 'stringr' was built under R version 4.3.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.0
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Warning: package 'corrplot' was built under R version 4.3.3
corrplot 0.95 loaded
# Loading packages individually#library(tidyverse) # Includes dplyr, ggplot2, tidyr#library(corrplot) # For correlation visualization
🌟 Meet Your Investigation Dataset: diamonds
We’ll use the built-in diamonds dataset to demonstrate EDA techniques. While this isn’t “big data”, the principles we’ll demonstrate scale to larger datasets.
🤔 Detective Rule #1: If you know nothing about diamonds, what’s the FIRST thing you should do?
data("diamonds")diamonds_df <-as_tibble(diamonds)# First look at the structure of our evidencediamonds_df |>head() |>glimpse()
Rows: 6
Columns: 10
$ carat <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24
$ cut <ord> Ideal, Premium, Good, Premium, Good, Very Good
$ color <ord> E, E, E, I, J, J
$ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2
$ depth <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8
$ table <dbl> 55, 61, 65, 58, 58, 57
$ price <int> 326, 326, 327, 334, 335, 336
$ x <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94
$ y <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96
$ z <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48
🔍 Investigation Exercise:
Without using any external resources, what can you discover about this dataset?
💡Think about….
How many observations and variables are there? Observations: 53940 and variables: 10
What types of data are present? ordinal (some type of order), double (number with decimal), integer (whole number)
Before we touch the data, let’s plan our investigation systematically:
R - Representative Data: Do we have what we need?
Check the size and scope of our dataset nrow(diamonds)
How many diamonds do we have? ncol(diamonds)
How many characteristics per diamond?
What time period does this represent? What geography? What market segment?
Detective Assessment:
E - Executive Driven Questions: What do stakeholders want to know?
Who might want diamond insights and what questions would they ask?
Potential Stakeholders & Their Questions:
Jewelry buyers: “How do I get the best value for my budget?”
Jewelry retailers: “What inventory should I stock?”
Diamond investors: “Which characteristics drive the highest prices?”
EDA students: “How do I systematically investigate a new dataset?”
Our Primary Investigation Question:“What patterns exist in diamond characteristics and pricing that could inform purchasing decisions?”
A - Analytical Framework: What’s our investigation plan?
Our Systematic Investigation Strategy:
Understand the domain (diamond characteristics and market)
Explore data structure and quality (missing values, ranges, distributions)
Investigate individual variables (what does each characteristic tell us?)
Examine relationships (how do characteristics relate to price?)
Segment analysis (do patterns differ across groups?)
Detective Insight: Now that we have our strategic plan AND understand our evidence, we can start systematic investigation using SCAN!
🧠 The dplyr Mindset: Think in Verbs
💡 Traditional R vs. dplyr Approach
Traditional R thinking:
# Multiple objects, hard to follow subset_data <-subset(starwars, species =="Human") selected_data <- subset_data[, c("name", "height", "mass")] ordered_data <- selected_data[order(selected_data$height, decreasing =TRUE), ]subset_data
# A tibble: 35 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sk… 172 77 blond fair blue 19 male mascu…
2 Darth V… 202 136 none white yellow 41.9 male mascu…
3 Leia Or… 150 49 brown light brown 19 fema… femin…
4 Owen La… 178 120 brown, gr… light blue 52 male mascu…
5 Beru Wh… 165 75 brown light blue 47 fema… femin…
6 Biggs D… 183 84 black light brown 24 male mascu…
7 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
8 Anakin … 188 84 blond fair blue 41.9 male mascu…
9 Wilhuff… 180 NA auburn, g… fair blue 64 male mascu…
10 Han Solo 180 80 brown fair brown 29 male mascu…
# ℹ 25 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
dplyr thinking tidyverse way:
# One pipeline, easy to read starwars |>filter(species =="Human") |>select(name, height, mass) |>arrange(desc(height))
#Filtering by flawless diamonds (clarity == "IF")diamonds |>filter(clarity =="IF")
# A tibble: 1,790 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.52 Ideal F IF 62.2 55 2783 5.14 5.18 3.21
2 0.55 Ideal G IF 60.9 57 2789 5.28 5.3 3.22
3 0.64 Ideal G IF 61.3 56 2790 5.54 5.58 3.41
4 0.72 Premium I IF 63 57 2795 5.72 5.7 3.6
5 0.6 Very Good G IF 61.6 56 2800 5.43 5.46 3.35
6 0.61 Ideal G IF 62.3 56 2800 5.43 5.45 3.39
7 0.53 Ideal F IF 61.9 54 2802 5.22 5.25 3.24
8 0.56 Very Good E IF 61 59 2833 5.28 5.34 3.24
9 0.58 Very Good E IF 60.6 59 2852 5.37 5.43 3.27
10 0.62 Fair F IF 60.1 61 2861 5.53 5.56 3.33
# ℹ 1,780 more rows
Filter by diamonds under $500
#Filter by diamonds under $500diamonds |>filter(price <500)
# A tibble: 1,729 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
# ℹ 1,719 more rows
👉 Your Turn: Advanced Detective Work OPTIONAL⤵
Investigation Questions:
How many diamonds are both over 1 carat AND under $5000? (Potential good deals?)
diamonds |>filter(carat >1)|>filter (price <5000)
# A tibble: 3,756 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 1.17 Very Good J I1 60.2 61 2774 6.83 6.9 4.13
2 1.01 Premium F I1 61.8 60 2781 6.39 6.36 3.94
3 1.01 Fair E I1 64.5 58 2788 6.29 6.21 4.03
4 1.01 Premium H SI2 62.7 59 2788 6.31 6.22 3.93
5 1.05 Very Good J SI2 63.2 56 2789 6.49 6.45 4.09
6 1.05 Fair J SI2 65.8 59 2789 6.41 6.27 4.18
7 1.01 Fair E SI2 67.4 60 2797 6.19 6.05 4.13
8 1.04 Premium G I1 62.2 58 2801 6.46 6.41 4
9 1.2 Fair F I1 64.6 56 2809 6.73 6.66 4.33
10 1.02 Premium G I1 60.3 58 2815 6.55 6.5 3.94
# ℹ 3,746 more rows
Find diamonds that are either “Ideal” cut OR “Premium” cut
#Find diamonds that are either "Ideal" cut OR "Premium" cutdiamonds |>filter(cut =="ideal"| cut =="Premium")
# A tibble: 13,791 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
2 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
3 0.22 Premium F SI1 60.4 61 342 3.88 3.84 2.33
4 0.2 Premium E SI2 60.2 62 345 3.79 3.75 2.27
5 0.32 Premium E I1 60.9 58 345 4.38 4.42 2.68
6 0.24 Premium I VS1 62.5 57 355 3.97 3.94 2.47
7 0.29 Premium F SI1 62.4 58 403 4.24 4.26 2.65
8 0.22 Premium E VS2 61.6 58 404 3.93 3.89 2.41
9 0.22 Premium D VS2 59.3 62 404 3.91 3.88 2.31
10 0.3 Premium J SI2 59.3 61 405 4.43 4.38 2.61
# ℹ 13,781 more rows
Investigate diamonds with missing depth measurements
#ADD CODE BELOW
Scan Anamolies
#ADD CODE BELOW
🔧 select(): Focus Your Investigation
Detective Skill: Choose which evidence to examine closely.
# A tibble: 53,940 × 5
carat cut color clarity price
<dbl> <ord> <ord> <ord> <int>
1 0.23 Ideal E SI2 326
2 0.21 Premium E SI1 326
3 0.23 Good E VS1 327
4 0.29 Premium I VS2 334
5 0.31 Good J SI2 335
6 0.24 Very Good J VVS2 336
7 0.24 Very Good I VVS1 336
8 0.26 Very Good H SI1 337
9 0.22 Fair E VS2 337
10 0.23 Very Good H VS1 338
# ℹ 53,930 more rows
# A tibble: 53,940 × 5
carat cut color clarity price
<dbl> <ord> <ord> <ord> <int>
1 0.23 Ideal E SI2 326
2 0.21 Premium E SI1 326
3 0.23 Good E VS1 327
4 0.29 Premium I VS2 334
5 0.31 Good J SI2 335
6 0.24 Very Good J VVS2 336
7 0.24 Very Good I VVS1 336
8 0.26 Very Good H SI1 337
9 0.22 Fair E VS2 337
10 0.23 Very Good H VS1 338
# ℹ 53,930 more rows
# A tibble: 53,940 × 4
carat cut color clarity
<dbl> <ord> <ord> <ord>
1 0.23 Ideal E SI2
2 0.21 Premium E SI1
3 0.23 Good E VS1
4 0.29 Premium I VS2
5 0.31 Good J SI2
6 0.24 Very Good J VVS2
7 0.24 Very Good I VVS1
8 0.26 Very Good H SI1
9 0.22 Fair E VS2
10 0.23 Very Good H VS1
# ℹ 53,930 more rows
select size and price
#Select size and pricediamonds |>select(carat, price)
#Select all colors *Hint use contains () functiondiamonds |>select(color)
# A tibble: 53,940 × 1
color
<ord>
1 E
2 E
3 E
4 I
5 J
6 J
7 I
8 H
9 E
10 H
# ℹ 53,930 more rows
#Anything in the variable name had color than this would pull all of the colors.diamonds |>select(contains("color"))
# A tibble: 53,940 × 1
color
<ord>
1 E
2 E
3 E
4 I
5 J
6 J
7 I
8 H
9 E
10 H
# ℹ 53,930 more rows
Select what buyers care about and inspect first 5 rows
#Select what buyers care aout and inspect first 5 rowsdiamonds |>select (cut, price, carat) |>head()
# A tibble: 6 × 3
cut price carat
<ord> <int> <dbl>
1 Ideal 326 0.23
2 Premium 326 0.21
3 Good 327 0.23
4 Premium 334 0.29
5 Good 335 0.31
6 Very Good 336 0.24
📏 arrange(): Organize Your Evidence
Detective Skill: Sort clues to reveal patterns.
# EXAMPLE: Find diamonds with unusual price patternsdiamonds |>arrange(desc(carat), price) |># Large diamonds, but sort by price within sizehead()
# A tibble: 6 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 5.01 Fair J I1 65.5 59 18018 10.7 10.5 6.98
2 4.5 Fair J I1 65.8 58 18531 10.2 10.2 6.72
3 4.13 Fair H I1 64.8 61 17329 10 9.85 6.43
4 4.01 Premium I I1 61 61 15223 10.1 10.1 6.17
5 4.01 Premium J I1 62.5 62 15223 10.0 9.94 6.24
6 4 Very Good I I1 63.3 58 15984 10.0 9.94 6.31
# EXAMPLE: Look for potential data anomalies by sorting multiple waysdiamonds |>arrange(price, desc(carat)) |># Cheapest first, but largest within pricehead()
# A tibble: 6 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
👉 Your Turn: Pattern Recognition ⤵
Investigation Tasks:
Arrange by most expensive diamonds, inspect first 5 rows
#Arrange by most expensive diamonds, inspect first 5 rowsdiamonds |>arrange(desc(price)) |>head()
# A tibble: 6 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 2.29 Premium I VS2 60.8 60 18823 8.5 8.47 5.16
2 2 Very Good G SI1 63.5 56 18818 7.9 7.97 5.04
3 1.51 Ideal G IF 61.7 55 18806 7.37 7.41 4.56
4 2.07 Ideal G SI2 62.5 55 18804 8.2 8.13 5.11
5 2 Very Good H SI1 62.8 57 18803 7.95 8 5.01
6 2.29 Premium I SI1 61.8 59 18797 8.52 8.45 5.24
Arrange by the largest diamonds by carat, inspect first 10 rows
#Arrange by the largest diamonds by carat, inspect first 10 rowsdiamonds |>arrange(desc(carat)) |>head(10)
# A tibble: 6 × 3
cut color quality_rating
<ord> <ord> <chr>
1 Ideal E Premium
2 Premium E Good
3 Good E Standard
4 Premium I Good
5 Good J Standard
6 Very Good J Good
# A tibble: 3 × 4
size_group diamond_count avg_price price_range
<chr> <int> <dbl> <int>
1 Large 19060 8142 17561
2 Medium 17206 2496 9052
3 Small 17674 792 2351
# EXAMPLE: Look for unexpected patterns across multiple characteristicsdiamonds |>group_by(cut, color) |>summarise(available =n(),typical_price =median(price),.groups ="drop" ) |>filter(available >=50) |># Only combinations with decent sample sizearrange(desc(typical_price))
# A tibble: 35 × 4
cut color available typical_price
<ord> <ord> <int> <dbl>
1 Premium J 808 5063
2 Premium I 1428 4640
3 Premium H 2360 4511
4 Very Good J 678 4113
5 Ideal J 896 4096
6 Very Good I 1204 3888
7 Fair H 303 3816
8 Very Good H 1824 3734
9 Good J 307 3733
10 Fair D 163 3730
# ℹ 25 more rows
👉 Your Turn: Pattern Analysis ⤵
Investigation Questions:
What’s the average price for each cut quality?
How many diamonds are in each color grade?
What’s the price range (min/max) for each clarity level?