DSA406_001_FL25_dplyr Quarto_aschaap

Slides companion

Author

Amy Schaap

#Submit a link (look on moodle) Set up an Rpub site —

Introduction

This document demonstrates key principles of Exploratory Data Analysis (EDA) using R, guided by systematic detective frameworks. We’ll explore various techniques while following best practices for reproducibility and efficient data handling.

EDA: A Creative Inquiry with Structure

Exploratory Data Analysis is not just a step in data science but a creative process that benefits from systematic frameworks. Like a detective solving a case, we need both creativity and methodology.

We’ll use two complementary frameworks throughout this analysis:

  • READY Framework: Strategic planning before investigation

  • SCAN Framework: Tactical execution during investigation

The Detective Frameworks

READY Framework - Your Strategic Case File

Apply BEFORE you start investigating data

  • Representative Data: Do we have all the data we need?

  • Executive Driven Questions: What is the main question we’re trying to solve?

  • Analytical Frameworks: What is our plan of action?

  • Data Best Practices: What are the rules for keeping our data clean?

  • Your Insight, Your Impact: What did we learn, and how will it help?

Group Activity: Work with your neighbor to brainstorm questions for each READY component. We’ll share a few from each group.

R - Representative Data: What don’t we know about our dataset?

Your group’s questions:

# Add your group's R questions as comments:
# Where is the data from
# Are we looking at numerical or categorical

E - Executive Questions: What questions do stakeholders have that we can’t answer yet?

Your group’s questions:

# Add your group's E questions as comments:
# How many total diamonds are there in all
# Which dimonds are most popular

A - Analytical Framework: What’s our exploration strategy?

Your group’s questions:

# Add your group's A questions as comments:
# Looking at specific variables to understand them better (cut, clarity, etc.)
# What numeric variables were working with (Their cost, size, hardness)
# What are their categorical variables (cut, shape, type, where they're from
# Where do we start

D - Data Best Practices: What data quality unknowns should we check?

Your group’s questions:

# Add your group's D questions as comments:
# What are the types of each variable
# Null values/N/A

Y - Your Insights: What story might emerge from our investigation?

Your group’s questions:

# Add your group's Y questions as comments:
# Maybe find out about trends or cuts of diamonds
# Possible correlations between values specific size - price

Class Discussion: Each group shares one interesting question from any READY category.

SCAN Framework - Your Field Investigation Guide

Apply WHILE you’re exploring data

  • Stakeholder Goals: Who wants the answer to this question?

  • Columns and Coverage: What data do we have, and is any of it missing?

  • Aggregates and Anomalies: What are the overall trends, and are there any strange outliers?

  • Notable Segments: Can we break the data into smaller groups to find hidden clues?

Loading Required Libraries

# Function to check and install required packages
required_packages <- c("tidyverse", "corrplot")

# Install missing packages
for (pkg in required_packages) {
  if (!requireNamespace(pkg, quietly = TRUE)) {
    install.packages(pkg)
  }
}

# Load all packages at once
invisible(lapply(required_packages, library, character.only = TRUE))
Warning: package 'tidyverse' was built under R version 4.3.3
Warning: package 'ggplot2' was built under R version 4.3.3
Warning: package 'tibble' was built under R version 4.3.3
Warning: package 'readr' was built under R version 4.3.3
Warning: package 'dplyr' was built under R version 4.3.3
Warning: package 'stringr' was built under R version 4.3.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Warning: package 'corrplot' was built under R version 4.3.3
corrplot 0.95 loaded
# Loading packages individually
#library(tidyverse)  # Includes dplyr, ggplot2, tidyr
#library(corrplot)   # For correlation visualization

🌟 Meet Your Investigation Dataset: diamonds

We’ll use the built-in diamonds dataset to demonstrate EDA techniques. While this isn’t “big data”, the principles we’ll demonstrate scale to larger datasets.

🤔 Detective Rule #1: If you know nothing about diamonds, what’s the FIRST thing you should do?

data("diamonds")
diamonds_df <- as_tibble(diamonds)

# First look at the structure of our evidence
diamonds_df |> 
  head() |> 
  glimpse()
Rows: 6
Columns: 10
$ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24
$ cut     <ord> Ideal, Premium, Good, Premium, Good, Very Good
$ color   <ord> E, E, E, I, J, J
$ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2
$ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8
$ table   <dbl> 55, 61, 65, 58, 58, 57
$ price   <int> 326, 326, 327, 334, 335, 336
$ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94
$ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96
$ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48

🔍 Investigation Exercise:

Without using any external resources, what can you discover about this dataset?

💡Think about….

  • How many observations and variables are there? Observations: 53940 and variables: 10
  • What types of data are present? ordinal (some type of order), double (number with decimal), integer (whole number)

📋 READY Framework - Strategic Investigation Planning

Before we touch the data, let’s plan our investigation systematically:

R - Representative Data: Do we have what we need?

Check the size and scope of our dataset nrow(diamonds)  

How many diamonds do we have? ncol(diamonds)  

How many characteristics per diamond?  

What time period does this represent? What geography? What market segment?

Detective Assessment:

E - Executive Driven Questions: What do stakeholders want to know?

Who might want diamond insights and what questions would they ask?

Potential Stakeholders & Their Questions:

  • Jewelry buyers: “How do I get the best value for my budget?”

  • Jewelry retailers: “What inventory should I stock?”

  • Diamond investors: “Which characteristics drive the highest prices?”

  • EDA students: “How do I systematically investigate a new dataset?”

Our Primary Investigation Question: “What patterns exist in diamond characteristics and pricing that could inform purchasing decisions?”

A - Analytical Framework: What’s our investigation plan?

Our Systematic Investigation Strategy:

  1. Understand the domain (diamond characteristics and market)

  2. Explore data structure and quality (missing values, ranges, distributions)

  3. Investigate individual variables (what does each characteristic tell us?)

  4. Examine relationships (how do characteristics relate to price?)

  5. Segment analysis (do patterns differ across groups?)

  6. Synthesize findings (what actionable insights emerged?)

D - Data Best Practices: How do we ensure investigation quality?

  1. Check for missing values across all variables

  2. Verify data types are appropriate for each variable

  3. Validate ranges make sense (no negative prices, reasonable measurements)

  4. Look for duplicates or data entry errors

  5. Document any transformations we make during analysis

  6. Handle outliers systematically (investigate, don’t just remove)

Y - Your Insights: What do we expect to learn?

Expected Investigation Outcomes:

  1. Understanding of how diamond characteristics (4 Cs) relate to pricing

  2. Identification of value opportunities across different market segments

  3. Recognition of data patterns and anomalies

  4. Development of systematic EDA skills using dplyr

Success Criteria: Can we provide actionable recommendations for different stakeholder groups based on data evidence?

📚 Quick Diamond Knowledge (What Detectives Need to Know):

  • Carat: Weight/size (bigger = more expensive, usually)

  • Cut: Quality of diamond cutting (Ideal > Premium > Very Good > Good > Fair)

  • Color: D=colorless (best) → J=light yellow (worst in this dataset)

  • Clarity: FL=flawless → I1=inclusions visible (FL > IF > VVS1 > VVS2 > VS1 > VS2 > SI1 > SI2 > I1)

  • Price: Market value in US dollars

  • Dimensions (x,y,z): Physical measurements in mm

Detective Insight: Now that we have our strategic plan AND understand our evidence, we can start systematic investigation using SCAN!

🧠 The dplyr Mindset: Think in Verbs

💡 Traditional R vs. dplyr Approach

Traditional R thinking:

# Multiple objects, hard to follow 

subset_data <- subset(starwars, species == "Human") 

selected_data <- subset_data[, c("name", "height", "mass")] 

ordered_data <- selected_data[order(selected_data$height, decreasing = TRUE), ]

subset_data
# A tibble: 35 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
 2 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
 3 Leia Or…    150    49 brown      light      brown           19   fema… femin…
 4 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
 5 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
 6 Biggs D…    183    84 black      light      brown           24   male  mascu…
 7 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
 8 Anakin …    188    84 blond      fair       blue            41.9 male  mascu…
 9 Wilhuff…    180    NA auburn, g… fair       blue            64   male  mascu…
10 Han Solo    180    80 brown      fair       brown           29   male  mascu…
# ℹ 25 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

dplyr thinking tidyverse way:

# One pipeline, easy to read 

starwars |>   
  filter(species == "Human") |> 
  select(name, height, mass) |> 
  arrange(desc(height))
# A tibble: 35 × 3
   name                height  mass
   <chr>                <int> <dbl>
 1 Darth Vader            202 136  
 2 Qui-Gon Jinn           193  89  
 3 Dooku                  193  80  
 4 Bail Prestor Organa    191  NA  
 5 Anakin Skywalker       188  84  
 6 Mace Windu             188  84  
 7 Raymus Antilles        188  79  
 8 Padmé Amidala          185  45  
 9 Biggs Darklighter      183  84  
10 Boba Fett              183  78.2
# ℹ 25 more rows

📚 Complete dplyr Verb Coverage

  • filter(): Finding specific clues and evidence

  • select(): Focusing investigation on key variables

  • arrange(): Organizing evidence to reveal patterns

  • mutate(): Creating new investigation angles

  • group_by() + summarise(): Systematic pattern analysis

🔍 filter(): Find Specific Clues

Control > shift > M

Detective Skill: Narrow down your evidence to specific cases of interest.

👉 Your Turn: Basic Detective Filtering ⤵

# EXAMPLE: Investigate expensive diamonds (our "high-value suspects")
diamonds |> 
  filter(price > 15000)
# A tibble: 1,655 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  1.54 Premium   E     VS2      62.3    58 15002  7.31  7.39  4.58
 2  1.19 Ideal     F     VVS1     61.5    55 15005  6.82  6.84  4.2 
 3  2.1  Premium   I     SI1      61.5    57 15007  8.25  8.21  5.06
 4  1.69 Ideal     D     SI1      60.8    57 15011  7.69  7.71  4.68
 5  1.5  Very Good G     VVS2     62.9    56 15013  7.22  7.32  4.57
 6  1.73 Very Good G     VS1      62.8    57 15014  7.57  7.72  4.8 
 7  2.02 Premium   G     SI2      63      59 15014  8.05  7.95  5.03
 8  2.05 Very Good F     SI2      61.9    56 15017  8.13  8.18  5.05
 9  1.5  Very Good F     VS1      61.6    58 15022  7.35  7.43  4.55
10  1.82 Very Good G     SI1      62.7    58 15025  7.68  7.75  4.84
# ℹ 1,645 more rows
# EXAMPLE: Look for specific diamond characteristics
diamonds |> 
  filter(cut == "Premium", color == "D")
# A tibble: 1,603 × 10
   carat cut     color clarity depth table price     x     y     z
   <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.22 Premium D     VS2      59.3    62   404  3.91  3.88  2.31
 2  0.3  Premium D     SI1      62.6    59   552  4.23  4.27  2.66
 3  0.71 Premium D     SI2      61.7    59  2768  5.71  5.67  3.51
 4  0.71 Premium D     VS2      62.5    60  2770  5.65  5.61  3.52
 5  0.7  Premium D     VS2      58      62  2773  5.87  5.78  3.38
 6  0.72 Premium D     SI1      62.7    59  2782  5.73  5.69  3.58
 7  0.7  Premium D     SI1      62.8    60  2782  5.68  5.66  3.56
 8  0.72 Premium D     SI2      62      60  2795  5.73  5.69  3.54
 9  0.71 Premium D     SI1      62.7    60  2797  5.67  5.71  3.57
10  0.71 Premium D     SI1      61.3    58  2797  5.73  5.75  3.52
# ℹ 1,593 more rows
  • Filter by diamonds over 2 carats
#Filtering dimonds over 2 carats
diamonds |> 
  filter(carat > 2)
# A tibble: 1,889 × 10
   carat cut     color clarity depth table price     x     y     z
   <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  2.06 Premium J     I1       61.2    58  5203  8.1   8.07  4.95
 2  2.14 Fair    J     I1       69.4    57  5405  7.74  7.7   5.36
 3  2.15 Fair    J     I1       65.5    57  5430  8.01  7.95  5.23
 4  2.22 Fair    J     I1       66.7    56  5607  8.04  8.02  5.36
 5  2.01 Fair    I     I1       67.4    58  5696  7.71  7.64  5.17
 6  2.01 Fair    I     I1       55.9    64  5696  8.48  8.39  4.71
 7  2.27 Fair    J     I1       67.6    55  5733  8.05  8     5.43
 8  2.03 Fair    H     I1       64.4    59  6002  7.91  7.85  5.07
 9  2.03 Fair    H     I1       66.6    57  6002  7.81  7.75  5.19
10  2.06 Good    H     I1       64.3    58  6091  8.03  7.99  5.15
# ℹ 1,879 more rows
  • Filter by flawless diamonds (clarity == “FL”)
#Filtering by flawless diamonds (clarity == "IF")
diamonds |> 
  filter(clarity == "IF")
# A tibble: 1,790 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.52 Ideal     F     IF       62.2    55  2783  5.14  5.18  3.21
 2  0.55 Ideal     G     IF       60.9    57  2789  5.28  5.3   3.22
 3  0.64 Ideal     G     IF       61.3    56  2790  5.54  5.58  3.41
 4  0.72 Premium   I     IF       63      57  2795  5.72  5.7   3.6 
 5  0.6  Very Good G     IF       61.6    56  2800  5.43  5.46  3.35
 6  0.61 Ideal     G     IF       62.3    56  2800  5.43  5.45  3.39
 7  0.53 Ideal     F     IF       61.9    54  2802  5.22  5.25  3.24
 8  0.56 Very Good E     IF       61      59  2833  5.28  5.34  3.24
 9  0.58 Very Good E     IF       60.6    59  2852  5.37  5.43  3.27
10  0.62 Fair      F     IF       60.1    61  2861  5.53  5.56  3.33
# ℹ 1,780 more rows
  • Filter by diamonds under $500
#Filter by diamonds under $500
diamonds |>
  filter(price < 500)
# A tibble: 1,729 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
# ℹ 1,719 more rows

👉 Your Turn: Advanced Detective Work OPTIONAL

Investigation Questions:

  • How many diamonds are both over 1 carat AND under $5000? (Potential good deals?)
diamonds |>
  filter(carat > 1)|>
  filter (price < 5000)
# A tibble: 3,756 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  1.17 Very Good J     I1       60.2    61  2774  6.83  6.9   4.13
 2  1.01 Premium   F     I1       61.8    60  2781  6.39  6.36  3.94
 3  1.01 Fair      E     I1       64.5    58  2788  6.29  6.21  4.03
 4  1.01 Premium   H     SI2      62.7    59  2788  6.31  6.22  3.93
 5  1.05 Very Good J     SI2      63.2    56  2789  6.49  6.45  4.09
 6  1.05 Fair      J     SI2      65.8    59  2789  6.41  6.27  4.18
 7  1.01 Fair      E     SI2      67.4    60  2797  6.19  6.05  4.13
 8  1.04 Premium   G     I1       62.2    58  2801  6.46  6.41  4   
 9  1.2  Fair      F     I1       64.6    56  2809  6.73  6.66  4.33
10  1.02 Premium   G     I1       60.3    58  2815  6.55  6.5   3.94
# ℹ 3,746 more rows
  • Find diamonds that are either “Ideal” cut OR “Premium” cut
#Find diamonds that are either "Ideal" cut OR "Premium" cut

diamonds |>
  filter(cut == "ideal" | cut =="Premium")
# A tibble: 13,791 × 10
   carat cut     color clarity depth table price     x     y     z
   <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.21 Premium E     SI1      59.8    61   326  3.89  3.84  2.31
 2  0.29 Premium I     VS2      62.4    58   334  4.2   4.23  2.63
 3  0.22 Premium F     SI1      60.4    61   342  3.88  3.84  2.33
 4  0.2  Premium E     SI2      60.2    62   345  3.79  3.75  2.27
 5  0.32 Premium E     I1       60.9    58   345  4.38  4.42  2.68
 6  0.24 Premium I     VS1      62.5    57   355  3.97  3.94  2.47
 7  0.29 Premium F     SI1      62.4    58   403  4.24  4.26  2.65
 8  0.22 Premium E     VS2      61.6    58   404  3.93  3.89  2.41
 9  0.22 Premium D     VS2      59.3    62   404  3.91  3.88  2.31
10  0.3  Premium J     SI2      59.3    61   405  4.43  4.38  2.61
# ℹ 13,781 more rows
  • Investigate diamonds with missing depth measurements
#ADD CODE BELOW
  • Scan Anamolies
#ADD CODE BELOW

🔧 select(): Focus Your Investigation

Detective Skill: Choose which evidence to examine closely.

# EXAMPLE: Focus on pricing factors
diamonds |> 
  select(carat, cut, color, clarity, price)
# A tibble: 53,940 × 5
   carat cut       color clarity price
   <dbl> <ord>     <ord> <ord>   <int>
 1  0.23 Ideal     E     SI2       326
 2  0.21 Premium   E     SI1       326
 3  0.23 Good      E     VS1       327
 4  0.29 Premium   I     VS2       334
 5  0.31 Good      J     SI2       335
 6  0.24 Very Good J     VVS2      336
 7  0.24 Very Good I     VVS1      336
 8  0.26 Very Good H     SI1       337
 9  0.22 Fair      E     VS2       337
10  0.23 Very Good H     VS1       338
# ℹ 53,930 more rows
# EXAMPLE: Exclude complex measurements for now
diamonds |> 
  select(-x, -y, -z, -table, -depth)
# A tibble: 53,940 × 5
   carat cut       color clarity price
   <dbl> <ord>     <ord> <ord>   <int>
 1  0.23 Ideal     E     SI2       326
 2  0.21 Premium   E     SI1       326
 3  0.23 Good      E     VS1       327
 4  0.29 Premium   I     VS2       334
 5  0.31 Good      J     SI2       335
 6  0.24 Very Good J     VVS2      336
 7  0.24 Very Good I     VVS1      336
 8  0.26 Very Good H     SI1       337
 9  0.22 Fair      E     VS2       337
10  0.23 Very Good H     VS1       338
# ℹ 53,930 more rows

👉 Your Turn: Evidence Selection ⤵

  • Select 4 C’s
#Select 4 C's
diamonds |> 
  select(carat, cut, color, clarity)
# A tibble: 53,940 × 4
   carat cut       color clarity
   <dbl> <ord>     <ord> <ord>  
 1  0.23 Ideal     E     SI2    
 2  0.21 Premium   E     SI1    
 3  0.23 Good      E     VS1    
 4  0.29 Premium   I     VS2    
 5  0.31 Good      J     SI2    
 6  0.24 Very Good J     VVS2   
 7  0.24 Very Good I     VVS1   
 8  0.26 Very Good H     SI1    
 9  0.22 Fair      E     VS2    
10  0.23 Very Good H     VS1    
# ℹ 53,930 more rows
  • select size and price
#Select size and price
diamonds |> 
  select(carat, price)
# A tibble: 53,940 × 2
   carat price
   <dbl> <int>
 1  0.23   326
 2  0.21   326
 3  0.23   327
 4  0.29   334
 5  0.31   335
 6  0.24   336
 7  0.24   336
 8  0.26   337
 9  0.22   337
10  0.23   338
# ℹ 53,930 more rows
diamonds |> 
  select(price, x, y, z)
# A tibble: 53,940 × 4
   price     x     y     z
   <int> <dbl> <dbl> <dbl>
 1   326  3.95  3.98  2.43
 2   326  3.89  3.84  2.31
 3   327  4.05  4.07  2.31
 4   334  4.2   4.23  2.63
 5   335  4.34  4.35  2.75
 6   336  3.94  3.96  2.48
 7   336  3.95  3.98  2.47
 8   337  4.07  4.11  2.53
 9   337  3.87  3.78  2.49
10   338  4     4.05  2.39
# ℹ 53,930 more rows
  • Select all colors *hint use contains() function
#Select all colors *Hint use contains () function
diamonds |> 
  select(color)
# A tibble: 53,940 × 1
   color
   <ord>
 1 E    
 2 E    
 3 E    
 4 I    
 5 J    
 6 J    
 7 I    
 8 H    
 9 E    
10 H    
# ℹ 53,930 more rows
#Anything in the variable name had color than this would pull all of the colors.
diamonds |> 
  select(contains("color"))
# A tibble: 53,940 × 1
   color
   <ord>
 1 E    
 2 E    
 3 E    
 4 I    
 5 J    
 6 J    
 7 I    
 8 H    
 9 E    
10 H    
# ℹ 53,930 more rows
  • Select what buyers care about and inspect first 5 rows
#Select what buyers care aout and inspect first 5 rows
diamonds |>
  select (cut, price, carat) |>
  head()
# A tibble: 6 × 3
  cut       price carat
  <ord>     <int> <dbl>
1 Ideal       326  0.23
2 Premium     326  0.21
3 Good        327  0.23
4 Premium     334  0.29
5 Good        335  0.31
6 Very Good   336  0.24

📏 arrange(): Organize Your Evidence

Detective Skill: Sort clues to reveal patterns.

# EXAMPLE: Find diamonds with unusual price patterns

diamonds |> 
  arrange(desc(carat), price) |>  # Large diamonds, but sort by price within size
  head()
# A tibble: 6 × 10
  carat cut       color clarity depth table price     x     y     z
  <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1  5.01 Fair      J     I1       65.5    59 18018  10.7 10.5   6.98
2  4.5  Fair      J     I1       65.8    58 18531  10.2 10.2   6.72
3  4.13 Fair      H     I1       64.8    61 17329  10    9.85  6.43
4  4.01 Premium   I     I1       61      61 15223  10.1 10.1   6.17
5  4.01 Premium   J     I1       62.5    62 15223  10.0  9.94  6.24
6  4    Very Good I     I1       63.3    58 15984  10.0  9.94  6.31
# EXAMPLE: Look for potential data anomalies by sorting multiple ways
diamonds |> 
  arrange(price, desc(carat)) |>  # Cheapest first, but largest within price
  head()
# A tibble: 6 × 10
  carat cut       color clarity depth table price     x     y     z
  <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

👉 Your Turn: Pattern Recognition ⤵

Investigation Tasks:

  1. Arrange by most expensive diamonds, inspect first 5 rows
#Arrange by most expensive diamonds, inspect first 5 rows
diamonds |> 
  arrange(desc(price)) |>
  head()
# A tibble: 6 × 10
  carat cut       color clarity depth table price     x     y     z
  <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1  2.29 Premium   I     VS2      60.8    60 18823  8.5   8.47  5.16
2  2    Very Good G     SI1      63.5    56 18818  7.9   7.97  5.04
3  1.51 Ideal     G     IF       61.7    55 18806  7.37  7.41  4.56
4  2.07 Ideal     G     SI2      62.5    55 18804  8.2   8.13  5.11
5  2    Very Good H     SI1      62.8    57 18803  7.95  8     5.01
6  2.29 Premium   I     SI1      61.8    59 18797  8.52  8.45  5.24
  1. Arrange by the largest diamonds by carat, inspect first 10 rows
#Arrange by the largest diamonds by carat, inspect first 10 rows
diamonds |> 
  arrange(desc(carat)) |>
  head(10)
# A tibble: 10 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  5.01 Fair      J     I1       65.5    59 18018 10.7  10.5   6.98
 2  4.5  Fair      J     I1       65.8    58 18531 10.2  10.2   6.72
 3  4.13 Fair      H     I1       64.8    61 17329 10     9.85  6.43
 4  4.01 Premium   I     I1       61      61 15223 10.1  10.1   6.17
 5  4.01 Premium   J     I1       62.5    62 15223 10.0   9.94  6.24
 6  4    Very Good I     I1       63.3    58 15984 10.0   9.94  6.31
 7  3.67 Premium   I     I1       62.4    56 16193  9.86  9.81  6.13
 8  3.65 Fair      H     I1       67.1    53 11668  9.53  9.48  6.38
 9  3.51 Premium   J     VS2      62.5    59 18701  9.66  9.63  6.03
10  3.5  Ideal     H     I1       62.8    57 12587  9.65  9.59  6.03
  1. Arrange by cut quality, then by price (within each cut)
# Arrange by cut quality, then by price (within each cut)
diamonds |> 
  arrange(cut,price) |>
  head(10)
# A tibble: 10 × 10
   carat cut   color clarity depth table price     x     y     z
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.22 Fair  E     VS2      65.1    61   337  3.87  3.78  2.49
 2  0.25 Fair  E     VS1      55.2    64   361  4.21  4.23  2.33
 3  0.23 Fair  G     VVS2     61.4    66   369  3.87  3.91  2.39
 4  0.27 Fair  E     VS1      66.4    58   371  3.99  4.02  2.66
 5  0.3  Fair  J     VS2      64.8    58   416  4.24  4.16  2.72
 6  0.3  Fair  F     SI1      63.1    58   496  4.3   4.22  2.69
 7  0.34 Fair  J     SI1      64.5    57   497  4.38  4.36  2.82
 8  0.37 Fair  F     SI1      65.3    56   527  4.53  4.47  2.94
 9  0.3  Fair  D     SI2      64.6    54   536  4.29  4.25  2.76
10  0.25 Fair  D     VS1      61.2    55   563  4.09  4.11  2.51
  • Detective insight - cheapest diamonds (potential data issues?)
#Detective insight - cheapest diamonds (potential data issues?)
diamonds |>
  arrange(price) |>
  head (10)
# A tibble: 10 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39

➕ mutate(): Create New Investigation Angles

Detective Skill: Derive new insights from existing evidence.

# EXAMPLE: Create diamond quality rating system
diamonds |> 
  mutate(
    quality_rating = case_when(
      cut == "Ideal" & color %in% c("D", "E") ~ "Premium",
      cut %in% c("Premium", "Very Good") ~ "Good", 
      TRUE ~ "Standard"
    )
  ) |>
  select(cut, color, quality_rating) |>
  head()
# A tibble: 6 × 3
  cut       color quality_rating
  <ord>     <ord> <chr>         
1 Ideal     E     Premium       
2 Premium   E     Good          
3 Good      E     Standard      
4 Premium   I     Good          
5 Good      J     Standard      
6 Very Good J     Good          
# EXAMPLE: Flag potentially unusual diamonds for investigation
diamonds |> 
  mutate(
    investigation_flag = case_when(
      price < 500 ~ "Suspiciously_Cheap",
      carat > 3 ~ "Unusually_Large",
      price/carat < 1000 ~ "Great_Value_Check",
      TRUE ~ "Normal"
    )
  ) |>
  select(carat, price, investigation_flag) |>
  filter(investigation_flag != "Normal") |>
  head()
# A tibble: 6 × 3
  carat price investigation_flag
  <dbl> <int> <chr>             
1  0.23   326 Suspiciously_Cheap
2  0.21   326 Suspiciously_Cheap
3  0.23   327 Suspiciously_Cheap
4  0.29   334 Suspiciously_Cheap
5  0.31   335 Suspiciously_Cheap
6  0.24   336 Suspiciously_Cheap

👉 Your Turn: New Clues Creation ⤵

Investigation Tasks:

  1. Calculate price per carat (value efficiency)

  2. Create size categories: Small (<0.5), Medium (0.5-1), Large (1-2), Huge (>2)

  3. Determine if a diamond is “premium priced” (above median price)

🔑 Answer Key - New Investigation Angles:

# Investigation 1: SCAN Aggregates - price efficiency metric
diamonds |> 
  mutate(price_per_carat = price / carat) |> 
  select(carat, price, price_per_carat) |> 
  arrange(price_per_carat) |> 
  head()
# A tibble: 6 × 3
  carat price price_per_carat
  <dbl> <int>           <dbl>
1  0.43   452           1051.
2  0.32   345           1078.
3  0.31   335           1081.
4  0.33   366           1109.
5  0.31   344           1110.
6  0.32   361           1128.
# Investigation 2: SCAN Notable Segments - size categories
diamonds |> 
  mutate(
    size_category = case_when(
      carat < 0.5 ~ "Small",
      carat < 1.0 ~ "Medium", 
      carat < 2.0 ~ "Large",
      carat >= 2.0 ~ "Huge"
    )
  ) |> 
  count(size_category)
# A tibble: 4 × 2
  size_category     n
  <chr>         <int>
1 Huge           2154
2 Large         16906
3 Medium        17206
4 Small         17674
# Investigation 3: Market positioning analysis
median_price <- median(diamonds$price)
diamonds |> 
  mutate(
    price_tier = ifelse(price > median_price, "Premium", "Budget"),
    value_ratio = price / carat
  ) |> 
  select(price, carat, price_tier, value_ratio) |> 
  head()
# A tibble: 6 × 4
  price carat price_tier value_ratio
  <int> <dbl> <chr>            <dbl>
1   326  0.23 Budget           1417.
2   326  0.21 Budget           1552.
3   327  0.23 Budget           1422.
4   334  0.29 Budget           1152.
5   335  0.31 Budget           1081.
6   336  0.24 Budget           1400 
# Investigation 4: READY Data Quality - perfect dimensions indicator
diamonds |> 
  mutate(
    has_dimensions = !is.na(x) & !is.na(y) & !is.na(z) & x > 0 & y > 0 & z > 0
  ) |> 
  count(has_dimensions)
# A tibble: 2 × 2
  has_dimensions     n
  <lgl>          <int>
1 FALSE             20
2 TRUE           53920

📊 group_by() + summarise(): Systematic Investigation

Detective Skill: Analyze patterns across different categories of evidence.

# EXAMPLE: Investigate pricing patterns by diamond size categories
diamonds |> 
  mutate(size_group = case_when(
    carat < 0.5 ~ "Small",
    carat < 1.0 ~ "Medium",
    TRUE ~ "Large"
  )) |>
  group_by(size_group) |>
  summarise(
    diamond_count = n(),
    avg_price = round(mean(price)),
    price_range = max(price) - min(price),
    .groups = "drop"
  )
# A tibble: 3 × 4
  size_group diamond_count avg_price price_range
  <chr>              <int>     <dbl>       <int>
1 Large              19060      8142       17561
2 Medium             17206      2496        9052
3 Small              17674       792        2351
# EXAMPLE: Look for unexpected patterns across multiple characteristics
diamonds |> 
  group_by(cut, color) |>
  summarise(
    available = n(),
    typical_price = median(price),
    .groups = "drop"
  ) |>
  filter(available >= 50) |>  # Only combinations with decent sample size
  arrange(desc(typical_price))
# A tibble: 35 × 4
   cut       color available typical_price
   <ord>     <ord>     <int>         <dbl>
 1 Premium   J           808          5063
 2 Premium   I          1428          4640
 3 Premium   H          2360          4511
 4 Very Good J           678          4113
 5 Ideal     J           896          4096
 6 Very Good I          1204          3888
 7 Fair      H           303          3816
 8 Very Good H          1824          3734
 9 Good      J           307          3733
10 Fair      D           163          3730
# ℹ 25 more rows

👉 Your Turn: Pattern Analysis ⤵

Investigation Questions:

  1. What’s the average price for each cut quality?

  2. How many diamonds are in each color grade?

  3. What’s the price range (min/max) for each clarity level?

🔑 Systematic Analysis:

# Investigation 1: SCAN Notable Segments - cut quality impact

diamonds |> 
  group_by(cut) |> 
  summarise(avg_price = mean(price, na.rm = TRUE))
# A tibble: 5 × 2
  cut       avg_price
  <ord>         <dbl>
1 Fair          4359.
2 Good          3929.
3 Very Good     3982.
4 Premium       4584.
5 Ideal         3458.
# Investigation 2: SCAN Columns Coverage - color distribution  

diamonds |> 
  count(color, name = "n_diamonds")
# A tibble: 7 × 2
  color n_diamonds
  <ord>      <int>
1 D           6775
2 E           9797
3 F           9542
4 G          11292
5 H           8304
6 I           5422
7 J           2808
# Investigation 3: SCAN Aggregates - clarity price ranges

diamonds |> 
  group_by(clarity) |> 
  summarise(
    min_price = min(price, na.rm = TRUE),
    max_price = max(price, na.rm = TRUE)
  )
# A tibble: 8 × 3
  clarity min_price max_price
  <ord>       <int>     <int>
1 I1            345     18531
2 SI2           326     18804
3 SI1           326     18818
4 VS2           334     18823
5 VS1           327     18795
6 VVS2          336     18768
7 VVS1          336     18777
8 IF            369     18806
# Investigation 4: READY Executive Questions - stakeholder insight

🔗 Building Pipelines: The dplyr Way Here’s the general pattern we’ll use throughout the workshop:

data |>                          # 1. Start with data
  filter(condition) |>           # 2. Filter early (reduce rows)
  select(relevant_columns) |>    # 3. Select early (reduce columns)  
  mutate(new_variables) |>       # 4. Transform as needed
  group_by(grouping_vars) |>     # 5. Group for summaries
  summarise(summary_stats) |>    # 6. Calculate aggregates
  arrange(sorting_column)        # 7. Sort results

⚠️ Detective Best Practices

🔍 Investigation Mistakes to Avoid:

  1. Broken Evidence Chain
# ❌ Wrong - breaks the investigation flow 

diamonds |>      
  filter(price > 5000)  
  select(carat, price)  # Lost connection!    

# ✅ Correct investigation flow 
diamonds |>      
  filter(price > 5000) |>      
  select(carat, price)
  1. Ignoring Missing Evidence
# ❌ Missing data will give NA results 
diamonds |>      
  summarise(avg_depth = mean(depth))    

# ✅ Handle missing evidence properly 
diamonds |>      
  summarise(avg_depth = mean(depth, na.rm = TRUE))
  1. Comparison vs Assignment Confusion
# ❌ Wrong - this assigns, doesn't compare 

diamonds |>    
  filter(cut = "Ideal")   
  
# ✅ Correct - use == for comparison   
diamonds |>    
  filter(cut == "Ideal")