Week 3/4 - EDA Frameworks

Slides companion

Author

Dr. McClure

Student: Luis Tapia

Class : DSA 406 - 001

Introduction

This document demonstrates key principles of Exploratory Data Analysis (EDA) using R, guided by systematic detective frameworks. We’ll explore various techniques while following best practices for reproducibility and efficient data handling.

EDA: A Creative Inquiry with Structure

Exploratory Data Analysis is not just a step in data science but a creative process that benefits from systematic frameworks. Like a detective solving a case, we need both creativity and methodology.

We’ll use two complementary frameworks throughout this analysis:

  • READY Framework: Strategic planning before investigation

  • SCAN Framework: Tactical execution during investigation

The Detective Frameworks

READY Framework - Your Strategic Case File

Apply BEFORE you start investigating data

  • Representative Data: Do we have all the data we need?

  • Executive Driven Questions: What is the main question we’re trying to solve?

  • Analytical Frameworks: What is our plan of action?

  • Data Best Practices: What are the rules for keeping our data clean?

  • Your Insight, Your Impact: What did we learn, and how will it help?

Group Activity: Work with your neighbor to brainstorm questions for each READY component. We’ll share a few from each group.

R - Representative Data: What don’t we know about our dataset?

Your group’s questions:

# Add your group's R questions as comments:
# 1.Where does the data come from,(company or research firm)?
# 2.Are we looking for numerical or categorical data?

E - Executive Questions: What questions do stakeholders have that we can’t answer yet?

Your group’s questions:

# Add your group's E questions as comments:
# 1.How many total diamonds are there in all?
# 2.Which diamonds are most popular? (to know what to sell)

A - Analytical Framework: What’s our exploration strategy?

Your group’s questions:

# Add your group's A questions as comments:
# 1.How many types of columns or variables are present in the data to account for?
# 2.Looking at specifc variables to understand them better (cut, clarity,etc.)
# 3.Whats the plan, where to start to analyze it?

D - Data Best Practices: What data quality unknowns should we check?

Your group’s questions:

# Add your group's D questions as comments:
# 1.What are the types of each variable present?
# 2.Are there any null, invalid, or N/A values?

Y - Your Insights: What story might emerge from our investigation?

Your group’s questions:

# Add your group's Y questions as comments (Bracketing):
# 1.Are there any trends related to certain characteristics of diamonds?
# 2.Are there possible correlations between values, such as size and price?

Class Discussion: Each group shares one interesting question from any READY category.

SCAN Framework - Your Field Investigation Guide

Apply WHILE you’re exploring data

  • Stakeholder Goals: Who wants the answer to this question?

  • Columns and Coverage: What data do we have, and is any of it missing?

  • Aggregates and Anomalies: What are the overall trends, and are there any strange outliers?

  • Notable Segments: Can we break the data into smaller groups to find hidden clues?

Loading Required Libraries

# Function to check and install required packages
required_packages <- c("tidyverse", "corrplot")

# Install missing packages
for (pkg in required_packages) {
  if (!requireNamespace(pkg, quietly = TRUE)) {
    install.packages(pkg)
  }
}

# Load all packages at once
invisible(lapply(required_packages, library, character.only = TRUE))
Warning: package 'tidyverse' was built under R version 4.5.1
Warning: package 'ggplot2' was built under R version 4.5.1
Warning: package 'tibble' was built under R version 4.5.1
Warning: package 'tidyr' was built under R version 4.5.1
Warning: package 'readr' was built under R version 4.5.1
Warning: package 'purrr' was built under R version 4.5.1
Warning: package 'dplyr' was built under R version 4.5.1
Warning: package 'stringr' was built under R version 4.5.1
Warning: package 'forcats' was built under R version 4.5.1
Warning: package 'lubridate' was built under R version 4.5.1
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.2
✔ ggplot2   4.0.0     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Warning: package 'corrplot' was built under R version 4.5.1
corrplot 0.95 loaded
# Loading packages individually
#library(tidyverse)  # Includes dplyr, ggplot2, tidyr
#library(corrplot)   # For correlation visualization

🌟 Meet Your Investigation Dataset: diamonds

We’ll use the built-in diamonds dataset to demonstrate EDA techniques. While this isn’t “big data”, the principles we’ll demonstrate scale to larger datasets.

🤔 Detective Rule #1: If you know nothing about diamonds, what’s the FIRST thing you should do?

data("diamonds") # creating new object
diamonds_df <- as_tibble(diamonds)

# First look at the structure of our evidence
diamonds_df |> # |> pipe operator
  head() |> 
  glimpse() 
Rows: 6
Columns: 10
$ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24
$ cut     <ord> Ideal, Premium, Good, Premium, Good, Very Good
$ color   <ord> E, E, E, I, J, J
$ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2
$ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8
$ table   <dbl> 55, 61, 65, 58, 58, 57
$ price   <int> 326, 326, 327, 334, 335, 336
$ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94
$ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96
$ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48

‘’diamonds’ Dataset Dictionary

price price in US dollars (\$326–\$18,823)

carat weight of the diamond (0.2–5.01)

cut quality of the cut (Fair, Good, Very Good, Premium, Ideal)

color diamond colour, from J (worst) to D (best)

clarity a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))

x length in mm (0–10.74)

y width in mm (0–58.9)

z depth in mm (0–31.8)

depth total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43–79)

table width of top of diamond relative to widest point (43–95)

Source: https://www.kaggle.com/datasets/shivam2503/diamonds

🔍 Investigation Exercise:

Without using any external resources, what can you discover about this dataset?

💡Think about….

  • How many observations and variables are there?

    There are 53,940 observations and 10 variables present.

  • What types of data are present?

    There are doubles (decimal), ordinal, and integer (whole) values present in the data.

📋 READY Framework - Strategic Investigation Planning

Before we touch the data, let’s plan our investigation systematically:

R - Representative Data: Do we have what we need?

1.Check the size and scope of our dataset nrow(diamonds)  

2.How many diamonds do we have? ncol(diamonds)?  

3.How many characteristics are there per diamond?  

4.What time period does this dataset represent? What geography? What market segmen?

Detective Assessment:

E - Executive Driven Questions: What do stakeholders want to know?

Who might want diamond insights and what questions would they ask?

Potential Stakeholders & Their Questions:

  • Jewelry buyers: “How do I get the best value for my budget?”

  • Jewelry retailers: “What inventory should I stock?”

  • Diamond investors: “Which characteristics drive the highest prices?”

  • EDA students: “How do I systematically investigate a new dataset?”

Our Primary Investigation Question: “What patterns exist in diamond characteristics and pricing that could inform purchasing decisions?”

A - Analytical Framework: What’s our investigation plan?

Our Systematic Investigation Strategy:

  1. Understand the domain (diamond characteristics and market)

  2. Explore data structure and quality (missing values, ranges, distributions)

  3. Investigate individual variables (what does each characteristic tell us?)

  4. Examine relationships (how do characteristics relate to price?)

  5. Segment analysis (do patterns differ across groups?)

  6. Synthesize findings (what actionable insights emerged?)

D - Data Best Practices: How do we ensure investigation quality?

  1. Check for missing values across all variables

  2. Verify data types are appropriate for each variable

  3. Validate ranges make sense (no negative prices, reasonable measurements)

  4. Look for duplicates or data entry errors

  5. Document any transformations we make during analysis

  6. Handle outliers systematically (investigate, don’t just remove)

Y - Your Insights: What do we expect to learn?

Expected Investigation Outcomes:

  1. Understanding of how diamond characteristics (4 Cs) relate to pricing

  2. Identification of value opportunities across different market segments

  3. Recognition of data patterns and anomalies

  4. Development of systematic EDA skills using dplyr

Success Criteria: Can we provide actionable recommendations for different stakeholder groups based on data evidence?

📚 Quick Diamond Knowledge (What Detectives Need to Know):

  • Carat: Weight/size (bigger = more expensive, usually)

  • Cut: Quality of diamond cutting (Ideal > Premium > Very Good > Good > Fair)

  • Color: D=colorless (best) → J=light yellow (worst in this dataset)

  • Clarity: FL=flawless → I1=inclusions visible (FL > IF > VVS1 > VVS2 > VS1 > VS2 > SI1 > SI2 > I1)

  • Price: Market value in US dollars

  • Dimensions (x,y,z): Physical measurements in mm

Detective Insight: Now that we have our strategic plan AND understand our evidence, we can start systematic investigation using SCAN!

🧠 The dplyr Mindset: Think in Verbs

💡 Traditional R vs. dplyr Approach

Traditional R thinking:

# Multiple objects, hard to follow 

subset_data <- subset(starwars, species == "Human") 
selected_data <- subset_data[, c("name", "height", "mass")] 

ordered_data <- selected_data[order(selected_data$height,decreasing = TRUE),]

#displays the adjusted data
selected_data
# A tibble: 35 × 3
   name               height  mass
   <chr>               <int> <dbl>
 1 Luke Skywalker        172    77
 2 Darth Vader           202   136
 3 Leia Organa           150    49
 4 Owen Lars             178   120
 5 Beru Whitesun Lars    165    75
 6 Biggs Darklighter     183    84
 7 Obi-Wan Kenobi        182    77
 8 Anakin Skywalker      188    84
 9 Wilhuff Tarkin        180    NA
10 Han Solo              180    80
# ℹ 25 more rows

dplyr thinking:

# One pipeline, easy to read 

starwars |>   
  filter(species == "Human") |> #filter species (Human)
  select(name, height, mass) |> #select name, height, mass variables 
  arrange(desc(height)) #arrange by descending order by height
# A tibble: 35 × 3
   name                height  mass
   <chr>                <int> <dbl>
 1 Darth Vader            202 136  
 2 Qui-Gon Jinn           193  89  
 3 Dooku                  193  80  
 4 Bail Prestor Organa    191  NA  
 5 Anakin Skywalker       188  84  
 6 Mace Windu             188  84  
 7 Raymus Antilles        188  79  
 8 Padmé Amidala          185  45  
 9 Biggs Darklighter      183  84  
10 Boba Fett              183  78.2
# ℹ 25 more rows

📚 Complete dplyr Verb Coverage

  • filter(): Finding specific clues and evidence

  • select(): Focusing investigation on key variables

  • arrange(): Organizing evidence to reveal patterns

  • mutate(): Creating new investigation angles

  • group_by() + summarise(): Systematic pattern analysis

🔍 filter(): Find Specific Clues

Detective Skill: Narrow down your evidence to specific cases of interest.

👉 Your Turn: Basic Detective Filtering ⤵

# EXAMPLE: Investigate expensive diamonds (our "high-value suspects")
diamonds |> 
  filter(price > 15000)
# A tibble: 1,655 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  1.54 Premium   E     VS2      62.3    58 15002  7.31  7.39  4.58
 2  1.19 Ideal     F     VVS1     61.5    55 15005  6.82  6.84  4.2 
 3  2.1  Premium   I     SI1      61.5    57 15007  8.25  8.21  5.06
 4  1.69 Ideal     D     SI1      60.8    57 15011  7.69  7.71  4.68
 5  1.5  Very Good G     VVS2     62.9    56 15013  7.22  7.32  4.57
 6  1.73 Very Good G     VS1      62.8    57 15014  7.57  7.72  4.8 
 7  2.02 Premium   G     SI2      63      59 15014  8.05  7.95  5.03
 8  2.05 Very Good F     SI2      61.9    56 15017  8.13  8.18  5.05
 9  1.5  Very Good F     VS1      61.6    58 15022  7.35  7.43  4.55
10  1.82 Very Good G     SI1      62.7    58 15025  7.68  7.75  4.84
# ℹ 1,645 more rows
# EXAMPLE: Look for specific diamond characteristics
diamonds |> 
  filter(cut == "Premium", color == "D")
# A tibble: 1,603 × 10
   carat cut     color clarity depth table price     x     y     z
   <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.22 Premium D     VS2      59.3    62   404  3.91  3.88  2.31
 2  0.3  Premium D     SI1      62.6    59   552  4.23  4.27  2.66
 3  0.71 Premium D     SI2      61.7    59  2768  5.71  5.67  3.51
 4  0.71 Premium D     VS2      62.5    60  2770  5.65  5.61  3.52
 5  0.7  Premium D     VS2      58      62  2773  5.87  5.78  3.38
 6  0.72 Premium D     SI1      62.7    59  2782  5.73  5.69  3.58
 7  0.7  Premium D     SI1      62.8    60  2782  5.68  5.66  3.56
 8  0.72 Premium D     SI2      62      60  2795  5.73  5.69  3.54
 9  0.71 Premium D     SI1      62.7    60  2797  5.67  5.71  3.57
10  0.71 Premium D     SI1      61.3    58  2797  5.73  5.75  3.52
# ℹ 1,593 more rows
  • Filter by diamonds over 2 carats
#ADD CODE BELOW

#created new instance of diamond dataset that only filters by carat > 2
diamondCaratTwo <- diamonds |>
  filter(carat > 2)

diamondCaratTwo
# A tibble: 1,889 × 10
   carat cut     color clarity depth table price     x     y     z
   <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  2.06 Premium J     I1       61.2    58  5203  8.1   8.07  4.95
 2  2.14 Fair    J     I1       69.4    57  5405  7.74  7.7   5.36
 3  2.15 Fair    J     I1       65.5    57  5430  8.01  7.95  5.23
 4  2.22 Fair    J     I1       66.7    56  5607  8.04  8.02  5.36
 5  2.01 Fair    I     I1       67.4    58  5696  7.71  7.64  5.17
 6  2.01 Fair    I     I1       55.9    64  5696  8.48  8.39  4.71
 7  2.27 Fair    J     I1       67.6    55  5733  8.05  8     5.43
 8  2.03 Fair    H     I1       64.4    59  6002  7.91  7.85  5.07
 9  2.03 Fair    H     I1       66.6    57  6002  7.81  7.75  5.19
10  2.06 Good    H     I1       64.3    58  6091  8.03  7.99  5.15
# ℹ 1,879 more rows
  • Filter by flawless diamonds (clarity == “FL”)
#ADD CODE BELOW

#using the other pipe/chain method
#adjusted code
diamondsFlawless <- diamonds %>% 
  #filter selection
  filter(clarity == "IF")

diamondsFlawless
# A tibble: 1,790 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.52 Ideal     F     IF       62.2    55  2783  5.14  5.18  3.21
 2  0.55 Ideal     G     IF       60.9    57  2789  5.28  5.3   3.22
 3  0.64 Ideal     G     IF       61.3    56  2790  5.54  5.58  3.41
 4  0.72 Premium   I     IF       63      57  2795  5.72  5.7   3.6 
 5  0.6  Very Good G     IF       61.6    56  2800  5.43  5.46  3.35
 6  0.61 Ideal     G     IF       62.3    56  2800  5.43  5.45  3.39
 7  0.53 Ideal     F     IF       61.9    54  2802  5.22  5.25  3.24
 8  0.56 Very Good E     IF       61      59  2833  5.28  5.34  3.24
 9  0.58 Very Good E     IF       60.6    59  2852  5.37  5.43  3.27
10  0.62 Fair      F     IF       60.1    61  2861  5.53  5.56  3.33
# ℹ 1,780 more rows
  • Filter by diamonds under $500
#ADD CODE BELOW

#adjusted code
diamondsBudget <- diamonds %>% 
  filter(price < 500) #under $500 in price

diamondsBudget
# A tibble: 1,729 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
# ℹ 1,719 more rows

👉 Your Turn: Advanced Detective Work OPTIONAL

Investigation Questions:

  • How many diamonds are both over 1 carat AND under $5000? (Potential good deals?)
#ADD CODE BELOW

#carat level over 1 and under $5000 in price
diamondsDeal <- diamonds %>% 
  filter(price < 5000, carat > 1 )

diamondsDeal
# A tibble: 3,756 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  1.17 Very Good J     I1       60.2    61  2774  6.83  6.9   4.13
 2  1.01 Premium   F     I1       61.8    60  2781  6.39  6.36  3.94
 3  1.01 Fair      E     I1       64.5    58  2788  6.29  6.21  4.03
 4  1.01 Premium   H     SI2      62.7    59  2788  6.31  6.22  3.93
 5  1.05 Very Good J     SI2      63.2    56  2789  6.49  6.45  4.09
 6  1.05 Fair      J     SI2      65.8    59  2789  6.41  6.27  4.18
 7  1.01 Fair      E     SI2      67.4    60  2797  6.19  6.05  4.13
 8  1.04 Premium   G     I1       62.2    58  2801  6.46  6.41  4   
 9  1.2  Fair      F     I1       64.6    56  2809  6.73  6.66  4.33
10  1.02 Premium   G     I1       60.3    58  2815  6.55  6.5   3.94
# ℹ 3,746 more rows
  • Find diamonds that are either “Ideal” cut OR “Premium” cut
#ADD CODE BELOW

#cut seletion is either 'Ideal' or 'Premium'
diamondCut <- diamonds %>% 
  # | is the or operator
  filter( cut == "Ideal" | cut == "Premium")

diamondCut
# A tibble: 35,342 × 10
   carat cut     color clarity depth table price     x     y     z
   <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal   E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 Premium E     SI1      59.8    61   326  3.89  3.84  2.31
 3  0.29 Premium I     VS2      62.4    58   334  4.2   4.23  2.63
 4  0.23 Ideal   J     VS1      62.8    56   340  3.93  3.9   2.46
 5  0.22 Premium F     SI1      60.4    61   342  3.88  3.84  2.33
 6  0.31 Ideal   J     SI2      62.2    54   344  4.35  4.37  2.71
 7  0.2  Premium E     SI2      60.2    62   345  3.79  3.75  2.27
 8  0.32 Premium E     I1       60.9    58   345  4.38  4.42  2.68
 9  0.3  Ideal   I     SI2      62      54   348  4.31  4.34  2.68
10  0.24 Premium I     VS1      62.5    57   355  3.97  3.94  2.47
# ℹ 35,332 more rows
  • Investigate diamonds with missing depth measurements
#ADD CODE BELOW

#note: the measure of depth applies to both the z and 'depth' 
missingDepthData <- diamonds %>% 
  filter(z == 0 | depth == 0)

missingDepthData
# A tibble: 20 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  1    Premium   G     SI2      59.1    59  3142  6.55  6.48     0
 2  1.01 Premium   H     I1       58.1    59  3167  6.66  6.6      0
 3  1.1  Premium   G     SI2      63      59  3696  6.5   6.47     0
 4  1.01 Premium   F     SI2      59.2    58  3837  6.5   6.47     0
 5  1.5  Good      G     I1       64      61  4731  7.15  7.04     0
 6  1.07 Ideal     F     SI2      61.6    56  4954  0     6.62     0
 7  1    Very Good H     VS2      63.3    53  5139  0     0        0
 8  1.15 Ideal     G     VS2      59.2    56  5564  6.88  6.83     0
 9  1.14 Fair      G     VS1      57.5    67  6381  0     0        0
10  2.18 Premium   H     SI2      59.4    61 12631  8.49  8.45     0
11  1.56 Ideal     G     VS2      62.2    54 12800  0     0        0
12  2.25 Premium   I     SI1      61.3    58 15397  8.52  8.42     0
13  1.2  Premium   D     VVS1     62.1    59 15686  0     0        0
14  2.2  Premium   H     SI1      61.2    59 17265  8.42  8.37     0
15  2.25 Premium   H     SI2      62.8    59 18034  0     0        0
16  2.02 Premium   H     VS2      62.7    53 18207  8.02  7.95     0
17  2.8  Good      G     SI2      63.8    58 18788  8.9   8.85     0
18  0.71 Good      F     SI2      64.1    60  2130  0     0        0
19  0.71 Good      F     SI2      64.1    60  2130  0     0        0
20  1.12 Premium   G     I1       60.4    59  2383  6.71  6.67     0
  • Scan Anomalies
#ADD CODE BELOW

#start off with missing measurements or 0 values 
#that indicates inconsistencies
missingMeasurements <- diamonds %>% 
  filter(x == 0 | y== 0 | y== 0)

missingMeasurements
# A tibble: 8 × 10
  carat cut       color clarity depth table price     x     y     z
  <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1  1.07 Ideal     F     SI2      61.6    56  4954     0  6.62     0
2  1    Very Good H     VS2      63.3    53  5139     0  0        0
3  1.14 Fair      G     VS1      57.5    67  6381     0  0        0
4  1.56 Ideal     G     VS2      62.2    54 12800     0  0        0
5  1.2  Premium   D     VVS1     62.1    59 15686     0  0        0
6  2.25 Premium   H     SI2      62.8    59 18034     0  0        0
7  0.71 Good      F     SI2      64.1    60  2130     0  0        0
8  0.71 Good      F     SI2      64.1    60  2130     0  0        0
#check for any duplicated values using duplicated() and sum()
sum(duplicated(diamonds))
[1] 146

🔧 select(): Focus Your Investigation

Detective Skill: Choose which evidence to examine closely.

# EXAMPLE: Focus on pricing factors
diamonds |> 
  select(carat, cut, color, clarity, price)
# A tibble: 53,940 × 5
   carat cut       color clarity price
   <dbl> <ord>     <ord> <ord>   <int>
 1  0.23 Ideal     E     SI2       326
 2  0.21 Premium   E     SI1       326
 3  0.23 Good      E     VS1       327
 4  0.29 Premium   I     VS2       334
 5  0.31 Good      J     SI2       335
 6  0.24 Very Good J     VVS2      336
 7  0.24 Very Good I     VVS1      336
 8  0.26 Very Good H     SI1       337
 9  0.22 Fair      E     VS2       337
10  0.23 Very Good H     VS1       338
# ℹ 53,930 more rows
# EXAMPLE: Exclude complex measurements for now
diamonds |> 
  select(-x, -y, -z, -table, -depth)
# A tibble: 53,940 × 5
   carat cut       color clarity price
   <dbl> <ord>     <ord> <ord>   <int>
 1  0.23 Ideal     E     SI2       326
 2  0.21 Premium   E     SI1       326
 3  0.23 Good      E     VS1       327
 4  0.29 Premium   I     VS2       334
 5  0.31 Good      J     SI2       335
 6  0.24 Very Good J     VVS2      336
 7  0.24 Very Good I     VVS1      336
 8  0.26 Very Good H     SI1       337
 9  0.22 Fair      E     VS2       337
10  0.23 Very Good H     VS1       338
# ℹ 53,930 more rows

👉 Your Turn: Evidence Selection ⤵

  • Select 4 C’s
#ADD CODE BELOW

#4C's are cut, clarity, color, carat
diamondsOne <- diamonds %>% 
  select (cut,clarity,color,carat)

diamondsOne
# A tibble: 53,940 × 4
   cut       clarity color carat
   <ord>     <ord>   <ord> <dbl>
 1 Ideal     SI2     E      0.23
 2 Premium   SI1     E      0.21
 3 Good      VS1     E      0.23
 4 Premium   VS2     I      0.29
 5 Good      SI2     J      0.31
 6 Very Good VVS2    J      0.24
 7 Very Good VVS1    I      0.24
 8 Very Good SI1     H      0.26
 9 Fair      VS2     E      0.22
10 Very Good VS1     H      0.23
# ℹ 53,930 more rows
  • select size and price
#ADD CODE BELOW

#size dimensions and price
diamondsTwo <- diamonds %>% 
  select (x, y, z, carat, price)

diamondsTwo
# A tibble: 53,940 × 5
       x     y     z carat price
   <dbl> <dbl> <dbl> <dbl> <int>
 1  3.95  3.98  2.43  0.23   326
 2  3.89  3.84  2.31  0.21   326
 3  4.05  4.07  2.31  0.23   327
 4  4.2   4.23  2.63  0.29   334
 5  4.34  4.35  2.75  0.31   335
 6  3.94  3.96  2.48  0.24   336
 7  3.95  3.98  2.47  0.24   336
 8  4.07  4.11  2.53  0.26   337
 9  3.87  3.78  2.49  0.22   337
10  4     4.05  2.39  0.23   338
# ℹ 53,930 more rows
  • Select all colors *hint use contains() function
#ADD CODE BELOW

#utilzing color as key
diamondsColor <- diamonds %>%
  #if there is a "color" name as part of column naming scheme
  select(contains("color")) 

diamondsColor
# A tibble: 53,940 × 1
   color
   <ord>
 1 E    
 2 E    
 3 E    
 4 I    
 5 J    
 6 J    
 7 I    
 8 H    
 9 E    
10 H    
# ℹ 53,930 more rows
  • Select what buyers care about and inspect first 5 rows
#ADD CODE BELOW

#carat, cut, and price are what most matters to buyers 
#that contributes in bringing value
diamondQuality <- diamonds %>% 
  select(carat,cut,price) %>% 
  head(5)

diamondQuality
# A tibble: 5 × 3
  carat cut     price
  <dbl> <ord>   <int>
1  0.23 Ideal     326
2  0.21 Premium   326
3  0.23 Good      327
4  0.29 Premium   334
5  0.31 Good      335

📏 arrange(): Organize Your Evidence

Detective Skill: Sort clues to reveal patterns.

# EXAMPLE: Find diamonds with unusual price patterns
exampleOne <- diamonds |> 
  arrange(desc(carat), price) |>  # Large diamonds, but sort by price within size
  head()
  glimpse(exampleOne) #displays in a horizontal manner
Rows: 6
Columns: 10
$ carat   <dbl> 5.01, 4.50, 4.13, 4.01, 4.01, 4.00
$ cut     <ord> Fair, Fair, Fair, Premium, Premium, Very Good
$ color   <ord> J, J, H, I, J, I
$ clarity <ord> I1, I1, I1, I1, I1, I1
$ depth   <dbl> 65.5, 65.8, 64.8, 61.0, 62.5, 63.3
$ table   <dbl> 59, 58, 61, 61, 62, 58
$ price   <int> 18018, 18531, 17329, 15223, 15223, 15984
$ x       <dbl> 10.74, 10.23, 10.00, 10.14, 10.02, 10.01
$ y       <dbl> 10.54, 10.16, 9.85, 10.10, 9.94, 9.94
$ z       <dbl> 6.98, 6.72, 6.43, 6.17, 6.24, 6.31
# EXAMPLE: Look for potential data anomalies by sorting multiple ways
diamonds |> 
  arrange(price, desc(carat)) |>  # Cheapest first, but largest within price
  head()
# A tibble: 6 × 10
  carat cut       color clarity depth table price     x     y     z
  <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

👉 Your Turn: Pattern Recognition ⤵

Investigation Tasks:

  1. Arrange by most expensive diamonds, inspect first 5 rows
#ADD CODE BELOW

#arranges diamond values by descending order in price
diamondsExpensive <- diamonds %>% 
  arrange(desc(price))

#prints/inspects the first 5 rows
head(diamondsExpensive,5)
# A tibble: 5 × 10
  carat cut       color clarity depth table price     x     y     z
  <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1  2.29 Premium   I     VS2      60.8    60 18823  8.5   8.47  5.16
2  2    Very Good G     SI1      63.5    56 18818  7.9   7.97  5.04
3  1.51 Ideal     G     IF       61.7    55 18806  7.37  7.41  4.56
4  2.07 Ideal     G     SI2      62.5    55 18804  8.2   8.13  5.11
5  2    Very Good H     SI1      62.8    57 18803  7.95  8     5.01
  1. Arrange by the largest diamonds by carat, inspect first 10 rows
#ADD CODE BELOW

#order by carat size of diamond
diamondsLarge <- diamonds %>% 
  arrange(desc(carat)) %>%
  head(10)

diamondsLarge
# A tibble: 10 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  5.01 Fair      J     I1       65.5    59 18018 10.7  10.5   6.98
 2  4.5  Fair      J     I1       65.8    58 18531 10.2  10.2   6.72
 3  4.13 Fair      H     I1       64.8    61 17329 10     9.85  6.43
 4  4.01 Premium   I     I1       61      61 15223 10.1  10.1   6.17
 5  4.01 Premium   J     I1       62.5    62 15223 10.0   9.94  6.24
 6  4    Very Good I     I1       63.3    58 15984 10.0   9.94  6.31
 7  3.67 Premium   I     I1       62.4    56 16193  9.86  9.81  6.13
 8  3.65 Fair      H     I1       67.1    53 11668  9.53  9.48  6.38
 9  3.51 Premium   J     VS2      62.5    59 18701  9.66  9.63  6.03
10  3.5  Ideal     H     I1       62.8    57 12587  9.65  9.59  6.03
  1. Arrange by cut quality, then by price (within each cut)
#ADD CODE BELOW

#arrange by cut quality first, then price
diamondsCutPrice <-diamonds %>% 
  arrange(cut,price) 

diamondsCutPrice
# A tibble: 53,940 × 10
   carat cut   color clarity depth table price     x     y     z
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.22 Fair  E     VS2      65.1    61   337  3.87  3.78  2.49
 2  0.25 Fair  E     VS1      55.2    64   361  4.21  4.23  2.33
 3  0.23 Fair  G     VVS2     61.4    66   369  3.87  3.91  2.39
 4  0.27 Fair  E     VS1      66.4    58   371  3.99  4.02  2.66
 5  0.3  Fair  J     VS2      64.8    58   416  4.24  4.16  2.72
 6  0.3  Fair  F     SI1      63.1    58   496  4.3   4.22  2.69
 7  0.34 Fair  J     SI1      64.5    57   497  4.38  4.36  2.82
 8  0.37 Fair  F     SI1      65.3    56   527  4.53  4.47  2.94
 9  0.3  Fair  D     SI2      64.6    54   536  4.29  4.25  2.76
10  0.25 Fair  D     VS1      61.2    55   563  4.09  4.11  2.51
# ℹ 53,930 more rows
  • Detective insight - cheapest diamonds (potential data issues?)
#ADD CODE BELOW

#starts off with the lowest prices
diamondsCheap <- diamonds %>% 
  arrange(price) 

diamondsCheap
# A tibble: 53,940 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
# ℹ 53,930 more rows

➕ mutate(): Create New Investigation Angles

Detective Skill: Derive new insights from existing evidence.

# EXAMPLE: Create diamond quality rating system
diamonds |> 
  mutate(
    quality_rating = case_when(
      cut == "Ideal" & color %in% c("D", "E") ~ "Premium",
      cut %in% c("Premium", "Very Good") ~ "Good", 
      TRUE ~ "Standard"
    )
  ) |>
  select(cut, color, quality_rating) |>
  head()
# A tibble: 6 × 3
  cut       color quality_rating
  <ord>     <ord> <chr>         
1 Ideal     E     Premium       
2 Premium   E     Good          
3 Good      E     Standard      
4 Premium   I     Good          
5 Good      J     Standard      
6 Very Good J     Good          
# EXAMPLE: Flag potentially unusual diamonds for investigation
diamonds |> 
  mutate(
    investigation_flag = case_when(
      price < 500 ~ "Suspiciously_Cheap",
      carat > 3 ~ "Unusually_Large",
      price/carat < 1000 ~ "Great_Value_Check",
      TRUE ~ "Normal"
    )
  ) |>
  select(carat, price, investigation_flag) |>
  filter(investigation_flag != "Normal") |>
  head()
# A tibble: 6 × 3
  carat price investigation_flag
  <dbl> <int> <chr>             
1  0.23   326 Suspiciously_Cheap
2  0.21   326 Suspiciously_Cheap
3  0.23   327 Suspiciously_Cheap
4  0.29   334 Suspiciously_Cheap
5  0.31   335 Suspiciously_Cheap
6  0.24   336 Suspiciously_Cheap

👉 Your Turn: New Clues Creation ⤵

Investigation Tasks:

  1. Calculate price per carat (value efficiency)

  2. Create size categories: Small (<0.5), Medium (0.5-1), Large (1-2), Huge (>2)

  3. Determine if a diamond is “premium priced” (above median price)

New Investigation Angles:

# Investigation 1: SCAN Aggregates - price efficiency metric
priceEfficiency <- diamonds %>% 
  mutate(pricePerCarat = price / carat) %>% 
  select(carat, price, pricePerCarat) %>%
  arrange(pricePerCarat) %>%
  head()

priceEfficiency
# A tibble: 6 × 3
  carat price pricePerCarat
  <dbl> <int>         <dbl>
1  0.43   452         1051.
2  0.32   345         1078.
3  0.31   335         1081.
4  0.33   366         1109.
5  0.31   344         1110.
6  0.32   361         1128.
# Investigation 2: SCAN Notable Segments - size categories
sizeClass <- diamonds %>%
  mutate(
    sizeClass = case_when(
      carat < 0.5 ~ "Small",
      carat < 1.0 ~ "Medium", 
      carat < 2.0 ~ "Large",
      carat >= 2.0 ~ "Huge"
    )
  ) %>% 
  count(sizeClass)
  
  sizeClass
# A tibble: 4 × 2
  sizeClass     n
  <chr>     <int>
1 Huge       2154
2 Large     16906
3 Medium    17206
4 Small     17674
# Investigation 3: Market positioning analysis
medianPrice <- median(diamonds$price)

diamondPosition <- diamonds %>% 
  mutate(
    priceTier = ifelse(price > medianPrice, "Premium", "Budget"),
    caratPriceRatio = price / carat
  ) %>%
  select(price, carat, priceTier, caratPriceRatio) %>% 
  head()

diamondPosition
# A tibble: 6 × 4
  price carat priceTier caratPriceRatio
  <int> <dbl> <chr>               <dbl>
1   326  0.23 Budget              1417.
2   326  0.21 Budget              1552.
3   327  0.23 Budget              1422.
4   334  0.29 Budget              1152.
5   335  0.31 Budget              1081.
6   336  0.24 Budget              1400 
# Investigation 4: READY Data Quality - perfect dimensions indicator
diamondDimensions<- diamonds %>% 
  mutate(
    #checks for na values in dimensions and validates
    dimensionsIndicator = !is.na(x) & !is.na(y) & !is.na(z) & x > 0 & y > 0 & z > 0
  ) %>%
  count(dimensionsIndicator)

diamondDimensions
# A tibble: 2 × 2
  dimensionsIndicator     n
  <lgl>               <int>
1 FALSE                  20
2 TRUE                53920

📊 group_by() + summarise(): Systematic Investigation

Detective Skill: Analyze patterns across different categories of evidence.

# EXAMPLE: Investigate pricing patterns by diamond size 
diamonds |> 
  mutate(size_group = case_when(
    carat < 0.5 ~ "Small",
    carat < 1.0 ~ "Medium",
    TRUE ~ "Large"
  )) |>
  group_by(size_group) |>
  summarise(
    diamond_count = n(),
    avg_price = round(mean(price)),
    price_range = max(price) - min(price),
    .groups = "drop"
  )
# A tibble: 3 × 4
  size_group diamond_count avg_price price_range
  <chr>              <int>     <dbl>       <int>
1 Large              19060      8142       17561
2 Medium             17206      2496        9052
3 Small              17674       792        2351
# EXAMPLE: Look for unexpected patterns across characteristics
diamonds |> 
  group_by(cut, color) |>
  summarise(
    available = n(),
    typical_price = median(price),
    .groups = "drop"
  ) |>
  filter(available >= 50) |>  
  # Only combinations with decent sample size
  arrange(desc(typical_price))
# A tibble: 35 × 4
   cut       color available typical_price
   <ord>     <ord>     <int>         <dbl>
 1 Premium   J           808          5063
 2 Premium   I          1428          4640
 3 Premium   H          2360          4511
 4 Very Good J           678          4113
 5 Ideal     J           896          4096
 6 Very Good I          1204          3888
 7 Fair      H           303          3816
 8 Very Good H          1824          3734
 9 Good      J           307          3733
10 Fair      D           163          3730
# ℹ 25 more rows

👉 Your Turn: Pattern Analysis ⤵

Investigation Questions:

  1. What’s the average price for each cut quality?

  2. How many diamonds are in each color grade?

  3. What’s the price range (min/max) for each clarity level?

Systematic Analysis:

# Investigation 1: SCAN Notable Segments - cut quality impact

#cut quality impact on average price
#used na.rm = TRUE to account for NA Values
cutQualityImpact <- diamonds %>% 
  group_by(cut) %>%  
  summarise(avgPrice = mean(price, na.rm = TRUE))

cutQualityImpact
# A tibble: 5 × 2
  cut       avgPrice
  <ord>        <dbl>
1 Fair         4359.
2 Good         3929.
3 Very Good    3982.
4 Premium      4584.
5 Ideal        3458.
# Investigation 2: SCAN Columns Coverage - color distribution 

#count based on color
colorDistribution <- diamonds %>% 
  count(color)

colorDistribution
# A tibble: 7 × 2
  color     n
  <ord> <int>
1 D      6775
2 E      9797
3 F      9542
4 G     11292
5 H      8304
6 I      5422
7 J      2808
# Investigation 3: SCAN Aggregates - clarity price ranges

#creates price range from clarity
priceRangeClarity <- diamonds %>%  
  group_by(clarity) %>%  
  
#provides minimum to maximum price range 
  summarise(
    minPrice = min(price, na.rm = TRUE),
    maxPrice = max(price, na.rm = TRUE)
  )

priceRangeClarity
# A tibble: 8 × 3
  clarity minPrice maxPrice
  <ord>      <int>    <int>
1 I1           345    18531
2 SI2          326    18804
3 SI1          326    18818
4 VS2          334    18823
5 VS1          327    18795
6 VVS2         336    18768
7 VVS1         336    18777
8 IF           369    18806
# Investigation 4: READY Executive Questions - stakeholder insight

#approach: How much can changes in carat level affect pricing?

#check the range on carat levels for dataset
#min(diamonds$carat)
#max(diamonds$carat)
#sum(diamonds$carat == 1)

#determining mean carat level of dataset
#mean(diamonds$carat)

#note: for this analysis the target of 1 would be used 

#organization 
#organize carat level by less than 1, exactly 1,and over 1
thresholdsCarat <- diamonds %>%
  mutate(
    #running multiple conditions using case_when () 
    #to classify under threshold
    carat_group = case_when(
      carat < 1.00 ~ "Less than 1.00",
      carat == 1.00 ~ "Exactly 1.00",
      carat > 1.00 ~ "Over 1.00"
    )
  ) 

#this is where the table gets put together
pricesData <- thresholdsCarat%>%
  group_by(carat_group) %>%
  summarise(
    avgPrice = mean(price, na.rm = TRUE)
  )

pricesData
# A tibble: 3 × 2
  carat_group    avgPrice
  <chr>             <dbl>
1 Exactly 1.00      5242.
2 Less than 1.00    1633.
3 Over 1.00         8400.

🔗 Building Pipelines: The dplyr Way Here’s the general pattern we’ll use throughout the workshop:

data |>                          # 1. Start with data
  filter(condition) |>           # 2. Filter early (reduce rows)
  select(relevant_columns) |>    # 3. Select early (reduce columns)  
  mutate(new_variables) |>       # 4. Transform as needed
  group_by(grouping_vars) |>     # 5. Group for summaries
  summarise(summary_stats) |>    # 6. Calculate aggregates
  arrange(sorting_column)        # 7. Sort results

⚠️ Detective Best Practices

🔍 Investigation Mistakes to Avoid:

  1. Broken Evidence Chain
# ❌ Wrong - breaks the investigation flow 

diamonds |>      
  filter(price > 5000)  
  select(carat, price)  # Lost connection!    

# ✅ Correct investigation flow 
diamonds |>      
  filter(price > 5000) |>      
  select(carat, price)
  1. Ignoring Missing Evidence
# ❌ Missing data will give NA results 
diamonds |>      
  summarise(avg_depth = mean(depth))    

# ✅ Handle missing evidence properly 
diamonds |>      
  summarise(avg_depth = mean(depth, na.rm = TRUE))
  1. Comparison vs Assignment Confusion
# ❌ Wrong - this assigns, doesn't compare 

diamonds |>    
  filter(cut = "Ideal")   
  
# ✅ Correct - use == for comparison   
diamonds |>    
  filter(cut == "Ideal")