Week 3/4 - EDA Frameworks-KEY

Slides companion

Author

Dr. McClure

Introduction

This document demonstrates key principles of Exploratory Data Analysis (EDA) using R, guided by systematic detective frameworks. We’ll explore various techniques while following best practices for reproducibility and efficient data handling.

EDA: A Creative Inquiry with Structure

Exploratory Data Analysis is not just a step in data science but a creative process that benefits from systematic frameworks. Like a detective solving a case, we need both creativity and methodology.

We’ll use two complementary frameworks throughout this analysis:

READY Framework: Strategic planning before investigation
SCAN Framework: Tactical execution during investigation

The Detective Frameworks

READY Framework - Your Strategic Case File

Apply BEFORE you start investigating data

Representative Data: Do we have all the data we need?
Executive Driven Questions: What is the main question we’re trying to solve?
Analytical Frameworks: What is our plan of action?
Data Best Practices: What are the rules for keeping our data clean?
Your Insight, Your Impact: What did we learn, and how will it help?

Group Activity: Work with your neighbor to brainstorm questions for each READY component. We’ll share a few from each group.

R - Representative Data: What don’t we know about our dataset?

Your group’s questions:

# Add your group's R questions as comments:
# Where is the data from?
# Are we looking at numerical or qualitative data?

E - Executive Questions: What questions do stakeholders have that we can’t answer yet?

Your group’s questions:

# Add your group's E questions as comments:
# How many total diamonds in there in all?
# What diamonds are the most popular?

A - Analytical Framework: What’s our exploration strategy?

Your group’s questions:

# Add your group's A questions as comments:
# Looking at the specific variables to understand them better, like cut, clarity, etc. 
# What's the plan, where do we start?

D - Data Best Practices: What data quality unknowns should we check?

Your group’s questions:

# Add your group's D questions as comments:
# What are the variable types?
# NA values? Missing.

Y - Your Insights: What story might emerge from our investigation?

Your group’s questions:

# Add your group's Y questions as comments:
# Maybe find out about the trends of cuts of diamonds. 
# Coorelations between sizes and prices.

Class Discussion: Each group shares one interesting question from any READY category.

SCAN Framework - Your Field Investigation Guide

Apply WHILE you’re exploring data

Stakeholder Goals: Who wants the answer to this question?
Columns and Coverage: What data do we have, and is any of it missing?
Aggregates and Anomalies: What are the overall trends, and are there any strange outliers?
Notable Segments: Can we break the data into smaller groups to find hidden clues?

Loading Required Libraries

# Function to check and install required packages
required_packages <- c("tidyverse", "corrplot")

# Install missing packages
for (pkg in required_packages) {
  if (!requireNamespace(pkg, quietly = TRUE)) {
    install.packages(pkg)
  }
}

# Load all packages at once
invisible(lapply(required_packages, library, character.only = TRUE))

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
corrplot 0.95 loaded

# Loading packages individually
#library(tidyverse)  # Includes dplyr, ggplot2, tidyr
#library(corrplot)   # For correlation visualization

🌟 Meet Your Investigation Dataset: diamonds

We’ll use the built-in diamonds dataset to demonstrate EDA techniques. While this isn’t “big data”, the principles we’ll demonstrate scale to larger datasets.

🤔 Detective Rule #1: If you know nothing about diamonds, what’s the FIRST thing you should do?

data("diamonds")
diamonds_df <- as_tibble(diamonds)

# First look at the structure of our evidence
diamonds_df |> 
  head() |>
  glimpse()

Rows: 6
Columns: 10
$ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24
$ cut     <ord> Ideal, Premium, Good, Premium, Good, Very Good
$ color   <ord> E, E, E, I, J, J
$ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2
$ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8
$ table   <dbl> 55, 61, 65, 58, 58, 57
$ price   <int> 326, 326, 327, 334, 335, 336
$ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94
$ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96
$ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48

🔍 Investigation Exercise:

Without using any external resources, what can you discover about this dataset?

💡Think about….

How many observations and variables are there? 10 variables and 53940 observations.
What types of data are present? Doubles: Numbers with a decimal Ordinal: They have a ranking system Integer: Numbers without decimal.

📋 READY Framework - Strategic Investigation Planning

Before we touch the data, let’s plan our investigation systematically:

R - Representative Data: Do we have what we need?

Check the size and scope of our dataset nrow(diamonds)  

How many diamonds do we have? ncol(diamonds)  

How many characteristics per diamond?  

What time period does this represent? What geography? What market segment?

Detective Assessment:

E - Executive Driven Questions: What do stakeholders want to know?

Who might want diamond insights and what questions would they ask?

Potential Stakeholders & Their Questions:

Jewelry buyers: “How do I get the best value for my budget?”
Jewelry retailers: “What inventory should I stock?”
Diamond investors: “Which characteristics drive the highest prices?”
EDA students: “How do I systematically investigate a new dataset?”

Our Primary Investigation Question: “What patterns exist in diamond characteristics and pricing that could inform purchasing decisions?”

A - Analytical Framework: What’s our investigation plan?

Our Systematic Investigation Strategy:

Understand the domain (diamond characteristics and market)
Explore data structure and quality (missing values, ranges, distributions)
Investigate individual variables (what does each characteristic tell us?)
Examine relationships (how do characteristics relate to price?)
Segment analysis (do patterns differ across groups?)
Synthesize findings (what actionable insights emerged?)

D - Data Best Practices: How do we ensure investigation quality?

Check for missing values across all variables
Verify data types are appropriate for each variable
Validate ranges make sense (no negative prices, reasonable measurements)
Look for duplicates or data entry errors
Document any transformations we make during analysis
Handle outliers systematically (investigate, don’t just remove)

Y - Your Insights: What do we expect to learn?

Expected Investigation Outcomes:

Understanding of how diamond characteristics (4 Cs) relate to pricing
Identification of value opportunities across different market segments
Recognition of data patterns and anomalies
Development of systematic EDA skills using dplyr

Success Criteria: Can we provide actionable recommendations for different stakeholder groups based on data evidence?

📚 Quick Diamond Knowledge (What Detectives Need to Know):

Carat: Weight/size (bigger = more expensive, usually)
Cut: Quality of diamond cutting (Ideal > Premium > Very Good > Good > Fair)
Color: D=colorless (best) → J=light yellow (worst in this dataset)
Clarity: FL=flawless → I1=inclusions visible (FL > IF > VVS1 > VVS2 > VS1 > VS2 > SI1 > SI2 > I1)
Price: Market value in US dollars
Dimensions (x,y,z): Physical measurements in mm

Detective Insight: Now that we have our strategic plan AND understand our evidence, we can start systematic investigation using SCAN!

🧠 The dplyr Mindset: Think in Verbs

💡 Traditional R vs. dplyr Approach

Traditional R thinking:

# Multiple objects, hard to follow 

subset_data <- subset(starwars, species == "Human") selected_data <- subset_data[, c("name", "height", "mass")] 

ordered_data <- selected_data[order(selected_data$height, decreasing = TRUE), ]

dplyr thinking:

# One pipeline, easy to read 

starwars |>   
  filter(species == "Human") |> 
  select(name, height, mass) |> 
  arrange(desc(height))

📚 Complete dplyr Verb Coverage

filter(): Finding specific clues and evidence
select(): Focusing investigation on key variables
arrange(): Organizing evidence to reveal patterns
mutate(): Creating new investigation angles
group_by() + summarise(): Systematic pattern analysis

🔍 filter(): Find Specific Clues

Detective Skill: Narrow down your evidence to specific cases of interest.

👉 Your Turn: Basic Detective Filtering ⤵

# EXAMPLE: Investigate expensive diamonds (our "high-value suspects")
diamonds |> 
  filter(price > 15000)

# A tibble: 1,655 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  1.54 Premium   E     VS2      62.3    58 15002  7.31  7.39  4.58
 2  1.19 Ideal     F     VVS1     61.5    55 15005  6.82  6.84  4.2 
 3  2.1  Premium   I     SI1      61.5    57 15007  8.25  8.21  5.06
 4  1.69 Ideal     D     SI1      60.8    57 15011  7.69  7.71  4.68
 5  1.5  Very Good G     VVS2     62.9    56 15013  7.22  7.32  4.57
 6  1.73 Very Good G     VS1      62.8    57 15014  7.57  7.72  4.8 
 7  2.02 Premium   G     SI2      63      59 15014  8.05  7.95  5.03
 8  2.05 Very Good F     SI2      61.9    56 15017  8.13  8.18  5.05
 9  1.5  Very Good F     VS1      61.6    58 15022  7.35  7.43  4.55
10  1.82 Very Good G     SI1      62.7    58 15025  7.68  7.75  4.84
# ℹ 1,645 more rows

# EXAMPLE: Look for specific diamond characteristics
diamonds |> 
  filter(cut == "Premium", color == "D")

# A tibble: 1,603 × 10
   carat cut     color clarity depth table price     x     y     z
   <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.22 Premium D     VS2      59.3    62   404  3.91  3.88  2.31
 2  0.3  Premium D     SI1      62.6    59   552  4.23  4.27  2.66
 3  0.71 Premium D     SI2      61.7    59  2768  5.71  5.67  3.51
 4  0.71 Premium D     VS2      62.5    60  2770  5.65  5.61  3.52
 5  0.7  Premium D     VS2      58      62  2773  5.87  5.78  3.38
 6  0.72 Premium D     SI1      62.7    59  2782  5.73  5.69  3.58
 7  0.7  Premium D     SI1      62.8    60  2782  5.68  5.66  3.56
 8  0.72 Premium D     SI2      62      60  2795  5.73  5.69  3.54
 9  0.71 Premium D     SI1      62.7    60  2797  5.67  5.71  3.57
10  0.71 Premium D     SI1      61.3    58  2797  5.73  5.75  3.52
# ℹ 1,593 more rows

Filter by diamonds over 2 carats

diamonds |>
  filter(carat > 2)

# A tibble: 1,889 × 10
   carat cut     color clarity depth table price     x     y     z
   <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  2.06 Premium J     I1       61.2    58  5203  8.1   8.07  4.95
 2  2.14 Fair    J     I1       69.4    57  5405  7.74  7.7   5.36
 3  2.15 Fair    J     I1       65.5    57  5430  8.01  7.95  5.23
 4  2.22 Fair    J     I1       66.7    56  5607  8.04  8.02  5.36
 5  2.01 Fair    I     I1       67.4    58  5696  7.71  7.64  5.17
 6  2.01 Fair    I     I1       55.9    64  5696  8.48  8.39  4.71
 7  2.27 Fair    J     I1       67.6    55  5733  8.05  8     5.43
 8  2.03 Fair    H     I1       64.4    59  6002  7.91  7.85  5.07
 9  2.03 Fair    H     I1       66.6    57  6002  7.81  7.75  5.19
10  2.06 Good    H     I1       64.3    58  6091  8.03  7.99  5.15
# ℹ 1,879 more rows

Filter by flawless diamonds (clarity == “IF”)

diamonds |>
  filter(clarity == "IF")

# A tibble: 1,790 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.52 Ideal     F     IF       62.2    55  2783  5.14  5.18  3.21
 2  0.55 Ideal     G     IF       60.9    57  2789  5.28  5.3   3.22
 3  0.64 Ideal     G     IF       61.3    56  2790  5.54  5.58  3.41
 4  0.72 Premium   I     IF       63      57  2795  5.72  5.7   3.6 
 5  0.6  Very Good G     IF       61.6    56  2800  5.43  5.46  3.35
 6  0.61 Ideal     G     IF       62.3    56  2800  5.43  5.45  3.39
 7  0.53 Ideal     F     IF       61.9    54  2802  5.22  5.25  3.24
 8  0.56 Very Good E     IF       61      59  2833  5.28  5.34  3.24
 9  0.58 Very Good E     IF       60.6    59  2852  5.37  5.43  3.27
10  0.62 Fair      F     IF       60.1    61  2861  5.53  5.56  3.33
# ℹ 1,780 more rows

So there are no Flawless diamonds because that name doesn’t exist. But, there are 1,790 pretty good ones.

Filter by diamonds under $500

#ADD CODE BELOW
diamonds|>
  filter(price < 500)

# A tibble: 1,729 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
# ℹ 1,719 more rows

1,729 diamonds are under $500.

👉 Your Turn: Advanced Detective Work OPTIONAL⤵

Investigation Questions:

How many diamonds are both over 1 carat AND under $5000? (Potential good deals?)

#ADD CODE BELOW
diamonds|>
  filter(carat>1, price<5000)

# A tibble: 3,756 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  1.17 Very Good J     I1       60.2    61  2774  6.83  6.9   4.13
 2  1.01 Premium   F     I1       61.8    60  2781  6.39  6.36  3.94
 3  1.01 Fair      E     I1       64.5    58  2788  6.29  6.21  4.03
 4  1.01 Premium   H     SI2      62.7    59  2788  6.31  6.22  3.93
 5  1.05 Very Good J     SI2      63.2    56  2789  6.49  6.45  4.09
 6  1.05 Fair      J     SI2      65.8    59  2789  6.41  6.27  4.18
 7  1.01 Fair      E     SI2      67.4    60  2797  6.19  6.05  4.13
 8  1.04 Premium   G     I1       62.2    58  2801  6.46  6.41  4   
 9  1.2  Fair      F     I1       64.6    56  2809  6.73  6.66  4.33
10  1.02 Premium   G     I1       60.3    58  2815  6.55  6.5   3.94
# ℹ 3,746 more rows

There are 3,756 diamonds that are over 1 carat and less than $5,000

Find diamonds that are either “Ideal” cut OR “Premium” cut

#ADD CODE BELOW
diamonds|>
  filter(cut== "Premium" | cut == "Ideal")

# A tibble: 35,342 × 10
   carat cut     color clarity depth table price     x     y     z
   <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal   E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 Premium E     SI1      59.8    61   326  3.89  3.84  2.31
 3  0.29 Premium I     VS2      62.4    58   334  4.2   4.23  2.63
 4  0.23 Ideal   J     VS1      62.8    56   340  3.93  3.9   2.46
 5  0.22 Premium F     SI1      60.4    61   342  3.88  3.84  2.33
 6  0.31 Ideal   J     SI2      62.2    54   344  4.35  4.37  2.71
 7  0.2  Premium E     SI2      60.2    62   345  3.79  3.75  2.27
 8  0.32 Premium E     I1       60.9    58   345  4.38  4.42  2.68
 9  0.3  Ideal   I     SI2      62      54   348  4.31  4.34  2.68
10  0.24 Premium I     VS1      62.5    57   355  3.97  3.94  2.47
# ℹ 35,332 more rows

35,342 diamonds are either ideal or premium

Investigate diamonds with missing depth measurements

#ADD CODE BELOW

Scan Anamolies

#ADD CODE BELOW

🔧 select(): Focus Your Investigation

Detective Skill: Choose which evidence to examine closely.

# EXAMPLE: Focus on pricing factors
diamonds |> 
  select(carat, cut, color, clarity, price)

# A tibble: 53,940 × 5
   carat cut       color clarity price
   <dbl> <ord>     <ord> <ord>   <int>
 1  0.23 Ideal     E     SI2       326
 2  0.21 Premium   E     SI1       326
 3  0.23 Good      E     VS1       327
 4  0.29 Premium   I     VS2       334
 5  0.31 Good      J     SI2       335
 6  0.24 Very Good J     VVS2      336
 7  0.24 Very Good I     VVS1      336
 8  0.26 Very Good H     SI1       337
 9  0.22 Fair      E     VS2       337
10  0.23 Very Good H     VS1       338
# ℹ 53,930 more rows

# EXAMPLE: Exclude complex measurements for now
diamonds |> 
  select(-x, -y, -z, -table, -depth)

# A tibble: 53,940 × 5
   carat cut       color clarity price
   <dbl> <ord>     <ord> <ord>   <int>
 1  0.23 Ideal     E     SI2       326
 2  0.21 Premium   E     SI1       326
 3  0.23 Good      E     VS1       327
 4  0.29 Premium   I     VS2       334
 5  0.31 Good      J     SI2       335
 6  0.24 Very Good J     VVS2      336
 7  0.24 Very Good I     VVS1      336
 8  0.26 Very Good H     SI1       337
 9  0.22 Fair      E     VS2       337
10  0.23 Very Good H     VS1       338
# ℹ 53,930 more rows

👉 Your Turn: Evidence Selection ⤵

Select 4 C’s

#ADD CODE BELOW
diamonds |>
  select(carat, cut, color, clarity)

# A tibble: 53,940 × 4
   carat cut       color clarity
   <dbl> <ord>     <ord> <ord>  
 1  0.23 Ideal     E     SI2    
 2  0.21 Premium   E     SI1    
 3  0.23 Good      E     VS1    
 4  0.29 Premium   I     VS2    
 5  0.31 Good      J     SI2    
 6  0.24 Very Good J     VVS2   
 7  0.24 Very Good I     VVS1   
 8  0.26 Very Good H     SI1    
 9  0.22 Fair      E     VS2    
10  0.23 Very Good H     VS1    
# ℹ 53,930 more rows

select size and price

#ADD CODE BELOW
diamonds|>
  select(price,x,y,z, carat)

# A tibble: 53,940 × 5
   price     x     y     z carat
   <int> <dbl> <dbl> <dbl> <dbl>
 1   326  3.95  3.98  2.43  0.23
 2   326  3.89  3.84  2.31  0.21
 3   327  4.05  4.07  2.31  0.23
 4   334  4.2   4.23  2.63  0.29
 5   335  4.34  4.35  2.75  0.31
 6   336  3.94  3.96  2.48  0.24
 7   336  3.95  3.98  2.47  0.24
 8   337  4.07  4.11  2.53  0.26
 9   337  3.87  3.78  2.49  0.22
10   338  4     4.05  2.39  0.23
# ℹ 53,930 more rows

Select all colors *hint use contains() function

#ADD CODE BELOW
diamonds|>
  select(contains("color"))

# A tibble: 53,940 × 1
   color
   <ord>
 1 E    
 2 E    
 3 E    
 4 I    
 5 J    
 6 J    
 7 I    
 8 H    
 9 E    
10 H    
# ℹ 53,930 more rows

diamonds|>
  select(contains("color"))

# A tibble: 53,940 × 1
   color
   <ord>
 1 E    
 2 E    
 3 E    
 4 I    
 5 J    
 6 J    
 7 I    
 8 H    
 9 E    
10 H    
# ℹ 53,930 more rows

Select what buyers care aout and inspect first 5 rows

#ADD CODE BELOW
diamonds|>
  select(price, carat, cut, color, clarity)|>
  head(5)

# A tibble: 5 × 5
  price carat cut     color clarity
  <int> <dbl> <ord>   <ord> <ord>  
1   326  0.23 Ideal   E     SI2    
2   326  0.21 Premium E     SI1    
3   327  0.23 Good    E     VS1    
4   334  0.29 Premium I     VS2    
5   335  0.31 Good    J     SI2

📏 arrange(): Organize Your Evidence

Detective Skill: Sort clues to reveal patterns.

# EXAMPLE: Find diamonds with unusual price patterns
diamonds |> 
  arrange(desc(carat), price) |>  # Large diamonds, but sort by price within size
  head()

# A tibble: 6 × 10
  carat cut       color clarity depth table price     x     y     z
  <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1  5.01 Fair      J     I1       65.5    59 18018  10.7 10.5   6.98
2  4.5  Fair      J     I1       65.8    58 18531  10.2 10.2   6.72
3  4.13 Fair      H     I1       64.8    61 17329  10    9.85  6.43
4  4.01 Premium   I     I1       61      61 15223  10.1 10.1   6.17
5  4.01 Premium   J     I1       62.5    62 15223  10.0  9.94  6.24
6  4    Very Good I     I1       63.3    58 15984  10.0  9.94  6.31

# EXAMPLE: Look for potential data anomalies by sorting multiple ways
diamonds |> 
  arrange(price, desc(carat)) |>  # Cheapest first, but largest within price
  head()

# A tibble: 6 × 10
  carat cut       color clarity depth table price     x     y     z
  <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

👉 Your Turn: Pattern Recognition ⤵

Investigation Tasks:

Arrange by most expensive diamonds, inspect first 5 rows

#ADD CODE BELOW
diamonds|>
  arrange(desc(price))|>
  head(5)

# A tibble: 5 × 10
  carat cut       color clarity depth table price     x     y     z
  <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1  2.29 Premium   I     VS2      60.8    60 18823  8.5   8.47  5.16
2  2    Very Good G     SI1      63.5    56 18818  7.9   7.97  5.04
3  1.51 Ideal     G     IF       61.7    55 18806  7.37  7.41  4.56
4  2.07 Ideal     G     SI2      62.5    55 18804  8.2   8.13  5.11
5  2    Very Good H     SI1      62.8    57 18803  7.95  8     5.01

Arrange by the largest diamonds by carat, inspect first 10 rows

#ADD CODE BELOW
diamonds|>
  arrange(desc(carat))|>
  head(10)

# A tibble: 10 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  5.01 Fair      J     I1       65.5    59 18018 10.7  10.5   6.98
 2  4.5  Fair      J     I1       65.8    58 18531 10.2  10.2   6.72
 3  4.13 Fair      H     I1       64.8    61 17329 10     9.85  6.43
 4  4.01 Premium   I     I1       61      61 15223 10.1  10.1   6.17
 5  4.01 Premium   J     I1       62.5    62 15223 10.0   9.94  6.24
 6  4    Very Good I     I1       63.3    58 15984 10.0   9.94  6.31
 7  3.67 Premium   I     I1       62.4    56 16193  9.86  9.81  6.13
 8  3.65 Fair      H     I1       67.1    53 11668  9.53  9.48  6.38
 9  3.51 Premium   J     VS2      62.5    59 18701  9.66  9.63  6.03
10  3.5  Ideal     H     I1       62.8    57 12587  9.65  9.59  6.03

Arrange by cut quality, then by price (within each cut)

#ADD CODE BELOW
diamonds|>
  arrange(desc(cut), price)

# A tibble: 53,940 × 10
   carat cut   color clarity depth table price     x     y     z
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.23 Ideal J     VS1      62.8    56   340  3.93  3.9   2.46
 3  0.31 Ideal J     SI2      62.2    54   344  4.35  4.37  2.71
 4  0.3  Ideal I     SI2      62      54   348  4.31  4.34  2.68
 5  0.25 Ideal H     SI1      62.8    54   357  4.05  4.07  2.55
 6  0.26 Ideal I     VS1      61.9    56   358  4.08  4.16  2.55
 7  0.3  Ideal I     SI2      62      55   360  4.32  4.33  2.68
 8  0.27 Ideal G     SI2      62.3    55   361  4.14  4.18  2.59
 9  0.26 Ideal H     SI2      62.5    53   362  4.09  4.13  2.57
10  0.25 Ideal G     SI1      62.3    53   363  4.06  4.09  2.54
# ℹ 53,930 more rows

Detective insight - cheapest diamonds (potential data issues?)

#ADD CODE BELOW
diamonds|>
  arrange(price)

# A tibble: 53,940 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
# ℹ 53,930 more rows

# a potential issue is that there is too much to look at. 
#Just looking at the first few rows, you can't make a 
#conclusion about what makes a diamond cheaper.

➕ mutate(): Create New Investigation Angles

Detective Skill: Derive new insights from existing evidence.

# EXAMPLE: Create diamond quality rating system
diamonds |> 
  mutate(
    quality_rating = case_when(
      cut == "Ideal" & color %in% c("D", "E") ~ "Premium",
      cut %in% c("Premium", "Very Good") ~ "Good", 
      TRUE ~ "Standard"
    )
  ) |>
  select(cut, color, quality_rating) |>
  head()

# A tibble: 6 × 3
  cut       color quality_rating
  <ord>     <ord> <chr>         
1 Ideal     E     Premium       
2 Premium   E     Good          
3 Good      E     Standard      
4 Premium   I     Good          
5 Good      J     Standard      
6 Very Good J     Good

# EXAMPLE: Flag potentially unusual diamonds for investigation
diamonds |> 
  mutate(
    investigation_flag = case_when(
      price < 500 ~ "Suspiciously_Cheap",
      carat > 3 ~ "Unusually_Large",
      price/carat < 1000 ~ "Great_Value_Check",
      TRUE ~ "Normal"
    )
  ) |>
  select(carat, price, investigation_flag) |>
  filter(investigation_flag != "Normal") |>
  head()

# A tibble: 6 × 3
  carat price investigation_flag
  <dbl> <int> <chr>             
1  0.23   326 Suspiciously_Cheap
2  0.21   326 Suspiciously_Cheap
3  0.23   327 Suspiciously_Cheap
4  0.29   334 Suspiciously_Cheap
5  0.31   335 Suspiciously_Cheap
6  0.24   336 Suspiciously_Cheap

👉 Your Turn: New Clues Creation ⤵

Investigation Tasks:

Calculate price per carat (value efficiency)
Create size categories: Small (<0.5), Medium (0.5-1), Large (1-2), Huge (>2)
Determine if a diamond is “premium priced” (above median price)

# 1. Calculate price per carat
diamonds|>
  mutate(price_per_carat = price/carat)|>
  select(price, carat, price_per_carat)

# A tibble: 53,940 × 3
   price carat price_per_carat
   <int> <dbl>           <dbl>
 1   326  0.23           1417.
 2   326  0.21           1552.
 3   327  0.23           1422.
 4   334  0.29           1152.
 5   335  0.31           1081.
 6   336  0.24           1400 
 7   336  0.24           1400 
 8   337  0.26           1296.
 9   337  0.22           1532.
10   338  0.23           1470.
# ℹ 53,930 more rows

# 2. Create size categories.
diamonds|>
  mutate(Sizes = case_when(
    carat < .5 ~ "Small",
    carat < 1 ~ "Medium",
    carat <= 2 ~ "Large",
    carat > 2 ~ "Huge"
  ))|>
  select(Sizes, carat)

# A tibble: 53,940 × 2
   Sizes carat
   <chr> <dbl>
 1 Small  0.23
 2 Small  0.21
 3 Small  0.23
 4 Small  0.29
 5 Small  0.31
 6 Small  0.24
 7 Small  0.24
 8 Small  0.26
 9 Small  0.22
10 Small  0.23
# ℹ 53,930 more rows

# 3. Determine if a diamond is premium.
diamond_median <- median(diamonds$price)

diamonds|>
  mutate(premium_status = case_when(
    price <= diamond_median ~ "Not Premium",
    price > diamond_median ~ "Premium"
  )) |>
  select(price, premium_status)

# A tibble: 53,940 × 2
   price premium_status
   <int> <chr>         
 1   326 Not Premium   
 2   326 Not Premium   
 3   327 Not Premium   
 4   334 Not Premium   
 5   335 Not Premium   
 6   336 Not Premium   
 7   336 Not Premium   
 8   337 Not Premium   
 9   337 Not Premium   
10   338 Not Premium   
# ℹ 53,930 more rows

🔑 Answer Key - New Investigation Angles:

# Investigation 1: SCAN Aggregates - price efficiency metric
diamonds |> 
  mutate(price_per_carat = price / carat) |> 
  select(carat, price, price_per_carat) |> 
  arrange(price_per_carat) |> 
  head()

# A tibble: 6 × 3
  carat price price_per_carat
  <dbl> <int>           <dbl>
1  0.43   452           1051.
2  0.32   345           1078.
3  0.31   335           1081.
4  0.33   366           1109.
5  0.31   344           1110.
6  0.32   361           1128.

# Investigation 2: SCAN Notable Segments - size categories
diamonds |> 
  mutate(
    size_category = case_when(
      carat < 0.5 ~ "Small",
      carat < 1.0 ~ "Medium", 
      carat < 2.0 ~ "Large",
      carat >= 2.0 ~ "Huge"
    )
  ) |> 
  count(size_category)

# A tibble: 4 × 2
  size_category     n
  <chr>         <int>
1 Huge           2154
2 Large         16906
3 Medium        17206
4 Small         17674

# Investigation 3: Market positioning analysis
median_price <- median(diamonds$price)
diamonds |> 
  mutate(
    price_tier = ifelse(price > median_price, "Premium", "Budget"),
    value_ratio = price / carat
  ) |> 
  select(price, carat, price_tier, value_ratio) |> 
  head()

# A tibble: 6 × 4
  price carat price_tier value_ratio
  <int> <dbl> <chr>            <dbl>
1   326  0.23 Budget           1417.
2   326  0.21 Budget           1552.
3   327  0.23 Budget           1422.
4   334  0.29 Budget           1152.
5   335  0.31 Budget           1081.
6   336  0.24 Budget           1400

# Investigation 4: READY Data Quality - perfect dimensions indicator
diamonds |> 
  mutate(
    has_dimensions = !is.na(x) & !is.na(y) & !is.na(z) & x > 0 & y > 0 & z > 0
  ) |> 
  count(has_dimensions)

# A tibble: 2 × 2
  has_dimensions     n
  <lgl>          <int>
1 FALSE             20
2 TRUE           53920

📊 group_by() + summarise(): Systematic Investigation

Detective Skill: Analyze patterns across different categories of evidence.

# EXAMPLE: Investigate pricing patterns by diamond size categories
diamonds |> 
  mutate(size_group = case_when(
    carat < 0.5 ~ "Small",
    carat < 1.0 ~ "Medium",
    TRUE ~ "Large"
  )) |>
  group_by(size_group) |>
  summarise(
    diamond_count = n(),
    avg_price = round(mean(price)),
    price_range = max(price) - min(price),
    .groups = "drop"
  )

# A tibble: 3 × 4
  size_group diamond_count avg_price price_range
  <chr>              <int>     <dbl>       <int>
1 Large              19060      8142       17561
2 Medium             17206      2496        9052
3 Small              17674       792        2351

# EXAMPLE: Look for unexpected patterns across multiple characteristics
diamonds |> 
  group_by(cut, color) |>
  summarise(
    available = n(),
    typical_price = median(price),
    .groups = "drop"
  ) |>
  filter(available >= 50) |>  # Only combinations with decent sample size
  arrange(desc(typical_price))

# A tibble: 35 × 4
   cut       color available typical_price
   <ord>     <ord>     <int>         <dbl>
 1 Premium   J           808          5063
 2 Premium   I          1428          4640
 3 Premium   H          2360          4511
 4 Very Good J           678          4113
 5 Ideal     J           896          4096
 6 Very Good I          1204          3888
 7 Fair      H           303          3816
 8 Very Good H          1824          3734
 9 Good      J           307          3733
10 Fair      D           163          3730
# ℹ 25 more rows

👉 Your Turn: Pattern Analysis ⤵

Investigation Questions:

What’s the average price for each cut quality?
How many diamonds are in each color grade?
What’s the price range (min/max) for each clarity level?

# 1. Average Price for each cut quality
diamonds|>
  group_by(cut)|>
  summarise(
    average_price = mean(price)
  )

# A tibble: 5 × 2
  cut       average_price
  <ord>             <dbl>
1 Fair              4359.
2 Good              3929.
3 Very Good         3982.
4 Premium           4584.
5 Ideal             3458.

# 2. How many diamonds in each color category?
diamonds|>
  group_by(color)|>
  summarise(
    counts = n()
  )

# A tibble: 7 × 2
  color counts
  <ord>  <int>
1 D       6775
2 E       9797
3 F       9542
4 G      11292
5 H       8304
6 I       5422
7 J       2808

🔑 Systematic Analysis:

# Investigation 1: SCAN Notable Segments - cut quality impact




# Investigation 2: SCAN Columns Coverage - color distribution  





# Investigation 3: SCAN Aggregates - clarity price ranges





# Investigation 4: READY Executive Questions - stakeholder insight

🔗 Building Pipelines: The dplyr Way Here’s the general pattern we’ll use throughout the workshop:

data |>                          # 1. Start with data
  filter(condition) |>           # 2. Filter early (reduce rows)
  select(relevant_columns) |>    # 3. Select early (reduce columns)  
  mutate(new_variables) |>       # 4. Transform as needed
  group_by(grouping_vars) |>     # 5. Group for summaries
  summarise(summary_stats) |>    # 6. Calculate aggregates
  arrange(sorting_column)        # 7. Sort results

⚠️ Detective Best Practices

🔍 Investigation Mistakes to Avoid:

Broken Evidence Chain

# ❌ Wrong - breaks the investigation flow 

diamonds |>      
  filter(price > 5000)  
  select(carat, price)  # Lost connection!    

# ✅ Correct investigation flow 
diamonds |>      
  filter(price > 5000) |>      
  select(carat, price)

Ignoring Missing Evidence

# ❌ Missing data will give NA results 
diamonds |>      
  summarise(avg_depth = mean(depth))    

# ✅ Handle missing evidence properly 
diamonds |>      
  summarise(avg_depth = mean(depth, na.rm = TRUE))

Comparison vs Assignment Confusion

# ❌ Wrong - this assigns, doesn't compare 

diamonds |>    
  filter(cut = "Ideal")   
  
# ✅ Correct - use == for comparison   
diamonds |>    
  filter(cut == "Ideal")