This document demonstrates key principles of Exploratory Data Analysis (EDA) using R, guided by systematic detective frameworks. We’ll explore various techniques while following best practices for reproducibility and efficient data handling.
EDA: A Creative Inquiry with Structure
Exploratory Data Analysis is not just a step in data science but a creative process that benefits from systematic frameworks. Like a detective solving a case, we need both creativity and methodology.
We’ll use two complementary frameworks throughout this analysis:
READY Framework: Strategic planning before investigation
SCAN Framework: Tactical execution during investigation
The Detective Frameworks
READY Framework - Your Strategic Case File
Apply BEFORE you start investigating data
Representative Data: Do we have all the data we need?
Executive Driven Questions: What is the main question we’re trying to solve?
Analytical Frameworks: What is our plan of action?
Data Best Practices: What are the rules for keeping our data clean?
Your Insight, Your Impact: What did we learn, and how will it help?
Group Activity: Work with your neighbor to brainstorm questions for each READY component. We’ll share a few from each group.
R - Representative Data: What don’t we know about our dataset?
Your group’s questions:
# Add your group's R questions as comments:
# 1.Where does the data come from,(company or research firm)?
# 2.Are we looking for numerical or categorical data?
E - Executive Questions: What questions do stakeholders have that we can’t answer yet?
Your group’s questions:
# Add your group's E questions as comments:
# 1.How many total diamonds are there in all?
# 2.Which diamonds are most popular? (to know what to sell)
A - Analytical Framework: What’s our exploration strategy?
Your group’s questions:
# Add your group's A questions as comments:
# 1.How many types of columns or variables are present in the data to account for?
# 2.Looking at specifc variables to understand them better (cut, clarity,etc.)
# 3.Whats the plan, where to start to analyze it?
D - Data Best Practices: What data quality unknowns should we check?
Your group’s questions:
# Add your group's D questions as comments:
# 1.What are the types of each variable present?
# 2.Are there any null, invalid, or N/A values?
Y - Your Insights: What story might emerge from our investigation?
Your group’s questions:
# Add your group's Y questions as comments (Bracketing):
# 1.Are there any trends related to certain characteristics of diamonds?
# 2.Are there possible correlations between values, such as size and price?
Class Discussion: Each group shares one interesting question from any READY category.
SCAN Framework - Your Field Investigation Guide
Apply WHILE you’re exploring data
Stakeholder Goals: Who wants the answer to this question?
Columns and Coverage: What data do we have, and is any of it missing?
Aggregates and Anomalies: What are the overall trends, and are there any strange outliers?
Notable Segments: Can we break the data into smaller groups to find hidden clues?
Loading Required Libraries
# Function to check and install required packagesrequired_packages <-c("tidyverse", "corrplot")# Install missing packagesfor (pkg in required_packages) {if (!requireNamespace(pkg, quietly =TRUE)) {install.packages(pkg) }}# Load all packages at onceinvisible(lapply(required_packages, library, character.only =TRUE))
Warning: package 'tidyverse' was built under R version 4.5.1
Warning: package 'ggplot2' was built under R version 4.5.1
Warning: package 'tibble' was built under R version 4.5.1
Warning: package 'tidyr' was built under R version 4.5.1
Warning: package 'readr' was built under R version 4.5.1
Warning: package 'purrr' was built under R version 4.5.1
Warning: package 'dplyr' was built under R version 4.5.1
Warning: package 'stringr' was built under R version 4.5.1
Warning: package 'forcats' was built under R version 4.5.1
Warning: package 'lubridate' was built under R version 4.5.1
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.2
✔ ggplot2 4.0.0 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.1.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Warning: package 'corrplot' was built under R version 4.5.1
corrplot 0.95 loaded
# Loading packages individually#library(tidyverse) # Includes dplyr, ggplot2, tidyr#library(corrplot) # For correlation visualization
🌟 Meet Your Investigation Dataset: diamonds
We’ll use the built-in diamonds dataset to demonstrate EDA techniques. While this isn’t “big data”, the principles we’ll demonstrate scale to larger datasets.
🤔 Detective Rule #1: If you know nothing about diamonds, what’s the FIRST thing you should do?
data("diamonds") # creating new objectdiamonds_df <-as_tibble(diamonds)# First look at the structure of our evidencediamonds_df |># |> pipe operatorhead() |>glimpse()
Rows: 6
Columns: 10
$ carat <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24
$ cut <ord> Ideal, Premium, Good, Premium, Good, Very Good
$ color <ord> E, E, E, I, J, J
$ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2
$ depth <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8
$ table <dbl> 55, 61, 65, 58, 58, 57
$ price <int> 326, 326, 327, 334, 335, 336
$ x <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94
$ y <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96
$ z <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48
‘’diamonds’ Dataset Dictionary
price price in US dollars (\$326–\$18,823)
carat weight of the diamond (0.2–5.01)
cut quality of the cut (Fair, Good, Very Good, Premium, Ideal)
color diamond colour, from J (worst) to D (best)
clarity a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
x length in mm (0–10.74)
y width in mm (0–58.9)
z depth in mm (0–31.8)
depth total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43–79)
table width of top of diamond relative to widest point (43–95)
Before we touch the data, let’s plan our investigation systematically:
R - Representative Data: Do we have what we need?
1.Check the size and scope of our dataset nrow(diamonds)
2.How many diamonds do we have? ncol(diamonds)?
3.How many characteristics are there per diamond?
4.What time period does this dataset represent? What geography? What market segmen?
Detective Assessment:
E - Executive Driven Questions: What do stakeholders want to know?
Who might want diamond insights and what questions would they ask?
Potential Stakeholders & Their Questions:
Jewelry buyers: “How do I get the best value for my budget?”
Jewelry retailers: “What inventory should I stock?”
Diamond investors: “Which characteristics drive the highest prices?”
EDA students: “How do I systematically investigate a new dataset?”
Our Primary Investigation Question:“What patterns exist in diamond characteristics and pricing that could inform purchasing decisions?”
A - Analytical Framework: What’s our investigation plan?
Our Systematic Investigation Strategy:
Understand the domain (diamond characteristics and market)
Explore data structure and quality (missing values, ranges, distributions)
Investigate individual variables (what does each characteristic tell us?)
Examine relationships (how do characteristics relate to price?)
Segment analysis (do patterns differ across groups?)
Detective Insight: Now that we have our strategic plan AND understand our evidence, we can start systematic investigation using SCAN!
🧠 The dplyr Mindset: Think in Verbs
💡 Traditional R vs. dplyr Approach
Traditional R thinking:
# Multiple objects, hard to follow subset_data <-subset(starwars, species =="Human") selected_data <- subset_data[, c("name", "height", "mass")] ordered_data <- selected_data[order(selected_data$height,decreasing =TRUE),]#displays the adjusted dataselected_data
# A tibble: 35 × 3
name height mass
<chr> <int> <dbl>
1 Luke Skywalker 172 77
2 Darth Vader 202 136
3 Leia Organa 150 49
4 Owen Lars 178 120
5 Beru Whitesun Lars 165 75
6 Biggs Darklighter 183 84
7 Obi-Wan Kenobi 182 77
8 Anakin Skywalker 188 84
9 Wilhuff Tarkin 180 NA
10 Han Solo 180 80
# ℹ 25 more rows
dplyr thinking:
# One pipeline, easy to read starwars |>filter(species =="Human") |>#filter species (Human)select(name, height, mass) |>#select name, height, mass variables arrange(desc(height)) #arrange by descending order by height
#ADD CODE BELOW#using the other pipe/chain method#adjusted codediamondsFlawless <- diamonds %>%#filter selectionfilter(clarity =="IF")diamondsFlawless
# A tibble: 1,790 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.52 Ideal F IF 62.2 55 2783 5.14 5.18 3.21
2 0.55 Ideal G IF 60.9 57 2789 5.28 5.3 3.22
3 0.64 Ideal G IF 61.3 56 2790 5.54 5.58 3.41
4 0.72 Premium I IF 63 57 2795 5.72 5.7 3.6
5 0.6 Very Good G IF 61.6 56 2800 5.43 5.46 3.35
6 0.61 Ideal G IF 62.3 56 2800 5.43 5.45 3.39
7 0.53 Ideal F IF 61.9 54 2802 5.22 5.25 3.24
8 0.56 Very Good E IF 61 59 2833 5.28 5.34 3.24
9 0.58 Very Good E IF 60.6 59 2852 5.37 5.43 3.27
10 0.62 Fair F IF 60.1 61 2861 5.53 5.56 3.33
# ℹ 1,780 more rows
# A tibble: 1,729 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
# ℹ 1,719 more rows
👉 Your Turn: Advanced Detective Work OPTIONAL⤵
Investigation Questions:
How many diamonds are both over 1 carat AND under $5000? (Potential good deals?)
#ADD CODE BELOW#carat level over 1 and under $5000 in pricediamondsDeal <- diamonds %>%filter(price <5000, carat >1 )diamondsDeal
# A tibble: 3,756 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 1.17 Very Good J I1 60.2 61 2774 6.83 6.9 4.13
2 1.01 Premium F I1 61.8 60 2781 6.39 6.36 3.94
3 1.01 Fair E I1 64.5 58 2788 6.29 6.21 4.03
4 1.01 Premium H SI2 62.7 59 2788 6.31 6.22 3.93
5 1.05 Very Good J SI2 63.2 56 2789 6.49 6.45 4.09
6 1.05 Fair J SI2 65.8 59 2789 6.41 6.27 4.18
7 1.01 Fair E SI2 67.4 60 2797 6.19 6.05 4.13
8 1.04 Premium G I1 62.2 58 2801 6.46 6.41 4
9 1.2 Fair F I1 64.6 56 2809 6.73 6.66 4.33
10 1.02 Premium G I1 60.3 58 2815 6.55 6.5 3.94
# ℹ 3,746 more rows
Find diamonds that are either “Ideal” cut OR “Premium” cut
#ADD CODE BELOW#cut seletion is either 'Ideal' or 'Premium'diamondCut <- diamonds %>%# | is the or operatorfilter( cut =="Ideal"| cut =="Premium")diamondCut
# A tibble: 53,940 × 5
carat cut color clarity price
<dbl> <ord> <ord> <ord> <int>
1 0.23 Ideal E SI2 326
2 0.21 Premium E SI1 326
3 0.23 Good E VS1 327
4 0.29 Premium I VS2 334
5 0.31 Good J SI2 335
6 0.24 Very Good J VVS2 336
7 0.24 Very Good I VVS1 336
8 0.26 Very Good H SI1 337
9 0.22 Fair E VS2 337
10 0.23 Very Good H VS1 338
# ℹ 53,930 more rows
# A tibble: 53,940 × 5
carat cut color clarity price
<dbl> <ord> <ord> <ord> <int>
1 0.23 Ideal E SI2 326
2 0.21 Premium E SI1 326
3 0.23 Good E VS1 327
4 0.29 Premium I VS2 334
5 0.31 Good J SI2 335
6 0.24 Very Good J VVS2 336
7 0.24 Very Good I VVS1 336
8 0.26 Very Good H SI1 337
9 0.22 Fair E VS2 337
10 0.23 Very Good H VS1 338
# ℹ 53,930 more rows
# A tibble: 53,940 × 4
cut clarity color carat
<ord> <ord> <ord> <dbl>
1 Ideal SI2 E 0.23
2 Premium SI1 E 0.21
3 Good VS1 E 0.23
4 Premium VS2 I 0.29
5 Good SI2 J 0.31
6 Very Good VVS2 J 0.24
7 Very Good VVS1 I 0.24
8 Very Good SI1 H 0.26
9 Fair VS2 E 0.22
10 Very Good VS1 H 0.23
# ℹ 53,930 more rows
select size and price
#ADD CODE BELOW#size dimensions and pricediamondsTwo <- diamonds %>%select (x, y, z, carat, price)diamondsTwo
#ADD CODE BELOW#utilzing color as keydiamondsColor <- diamonds %>%#if there is a "color" name as part of column naming schemeselect(contains("color")) diamondsColor
# A tibble: 53,940 × 1
color
<ord>
1 E
2 E
3 E
4 I
5 J
6 J
7 I
8 H
9 E
10 H
# ℹ 53,930 more rows
Select what buyers care about and inspect first 5 rows
#ADD CODE BELOW#carat, cut, and price are what most matters to buyers #that contributes in bringing valuediamondQuality <- diamonds %>%select(carat,cut,price) %>%head(5)diamondQuality
# EXAMPLE: Find diamonds with unusual price patternsexampleOne <- diamonds |>arrange(desc(carat), price) |># Large diamonds, but sort by price within sizehead()glimpse(exampleOne) #displays in a horizontal manner
# EXAMPLE: Look for potential data anomalies by sorting multiple waysdiamonds |>arrange(price, desc(carat)) |># Cheapest first, but largest within pricehead()
# A tibble: 6 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
👉 Your Turn: Pattern Recognition ⤵
Investigation Tasks:
Arrange by most expensive diamonds, inspect first 5 rows
#ADD CODE BELOW#arranges diamond values by descending order in pricediamondsExpensive <- diamonds %>%arrange(desc(price))#prints/inspects the first 5 rowshead(diamondsExpensive,5)
# A tibble: 5 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 2.29 Premium I VS2 60.8 60 18823 8.5 8.47 5.16
2 2 Very Good G SI1 63.5 56 18818 7.9 7.97 5.04
3 1.51 Ideal G IF 61.7 55 18806 7.37 7.41 4.56
4 2.07 Ideal G SI2 62.5 55 18804 8.2 8.13 5.11
5 2 Very Good H SI1 62.8 57 18803 7.95 8 5.01
Arrange by the largest diamonds by carat, inspect first 10 rows
#ADD CODE BELOW#order by carat size of diamonddiamondsLarge <- diamonds %>%arrange(desc(carat)) %>%head(10)diamondsLarge
# A tibble: 6 × 3
cut color quality_rating
<ord> <ord> <chr>
1 Ideal E Premium
2 Premium E Good
3 Good E Standard
4 Premium I Good
5 Good J Standard
6 Very Good J Good
# Investigation 4: READY Data Quality - perfect dimensions indicatordiamondDimensions<- diamonds %>%mutate(#checks for na values in dimensions and validatesdimensionsIndicator =!is.na(x) &!is.na(y) &!is.na(z) & x >0& y >0& z >0 ) %>%count(dimensionsIndicator)diamondDimensions
# A tibble: 2 × 2
dimensionsIndicator n
<lgl> <int>
1 FALSE 20
2 TRUE 53920
# A tibble: 3 × 4
size_group diamond_count avg_price price_range
<chr> <int> <dbl> <int>
1 Large 19060 8142 17561
2 Medium 17206 2496 9052
3 Small 17674 792 2351
# EXAMPLE: Look for unexpected patterns across characteristicsdiamonds |>group_by(cut, color) |>summarise(available =n(),typical_price =median(price),.groups ="drop" ) |>filter(available >=50) |># Only combinations with decent sample sizearrange(desc(typical_price))
# A tibble: 35 × 4
cut color available typical_price
<ord> <ord> <int> <dbl>
1 Premium J 808 5063
2 Premium I 1428 4640
3 Premium H 2360 4511
4 Very Good J 678 4113
5 Ideal J 896 4096
6 Very Good I 1204 3888
7 Fair H 303 3816
8 Very Good H 1824 3734
9 Good J 307 3733
10 Fair D 163 3730
# ℹ 25 more rows
👉 Your Turn: Pattern Analysis ⤵
Investigation Questions:
What’s the average price for each cut quality?
How many diamonds are in each color grade?
What’s the price range (min/max) for each clarity level?
Systematic Analysis:
# Investigation 1: SCAN Notable Segments - cut quality impact#cut quality impact on average price#used na.rm = TRUE to account for NA ValuescutQualityImpact <- diamonds %>%group_by(cut) %>%summarise(avgPrice =mean(price, na.rm =TRUE))cutQualityImpact
# A tibble: 5 × 2
cut avgPrice
<ord> <dbl>
1 Fair 4359.
2 Good 3929.
3 Very Good 3982.
4 Premium 4584.
5 Ideal 3458.
# Investigation 2: SCAN Columns Coverage - color distribution #count based on colorcolorDistribution <- diamonds %>%count(color)colorDistribution
# A tibble: 7 × 2
color n
<ord> <int>
1 D 6775
2 E 9797
3 F 9542
4 G 11292
5 H 8304
6 I 5422
7 J 2808
# Investigation 3: SCAN Aggregates - clarity price ranges#creates price range from claritypriceRangeClarity <- diamonds %>%group_by(clarity) %>%#provides minimum to maximum price range summarise(minPrice =min(price, na.rm =TRUE),maxPrice =max(price, na.rm =TRUE) )priceRangeClarity
# Investigation 4: READY Executive Questions - stakeholder insight#approach: How much can changes in carat level affect pricing?#check the range on carat levels for dataset#min(diamonds$carat)#max(diamonds$carat)#sum(diamonds$carat == 1)#determining mean carat level of dataset#mean(diamonds$carat)#note: for this analysis the target of 1 would be used #organization #organize carat level by less than 1, exactly 1,and over 1thresholdsCarat <- diamonds %>%mutate(#running multiple conditions using case_when () #to classify under thresholdcarat_group =case_when( carat <1.00~"Less than 1.00", carat ==1.00~"Exactly 1.00", carat >1.00~"Over 1.00" ) ) #this is where the table gets put togetherpricesData <- thresholdsCarat%>%group_by(carat_group) %>%summarise(avgPrice =mean(price, na.rm =TRUE) )pricesData
# A tibble: 3 × 2
carat_group avgPrice
<chr> <dbl>
1 Exactly 1.00 5242.
2 Less than 1.00 1633.
3 Over 1.00 8400.
🔗 Building Pipelines: The dplyr Way Here’s the general pattern we’ll use throughout the workshop:
data |> # 1. Start with data
filter(condition) |> # 2. Filter early (reduce rows)
select(relevant_columns) |> # 3. Select early (reduce columns)
mutate(new_variables) |> # 4. Transform as needed
group_by(grouping_vars) |> # 5. Group for summaries
summarise(summary_stats) |> # 6. Calculate aggregates
arrange(sorting_column) # 7. Sort results