10 Exploratory data analysis

set up

download the package and data

if(!require(tidyverse))
        {install.packages("tidyverse")}

## Loading required package: tidyverse

## Warning: package 'tidyverse' was built under R version 4.5.2

## Warning: package 'forcats' was built under R version 4.5.2

## Warning: package 'lubridate' was built under R version 4.5.2

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidyverse)

if(!require(nycflights13)){install.packages("nycflights13")}

## Loading required package: nycflights13

## Warning: package 'nycflights13' was built under R version 4.5.2

library(nycflights13)  

if(!require(hexbin)){install.packages("hexbin")}

## Loading required package: hexbin

## Warning: package 'hexbin' was built under R version 4.5.2

library(hexbin)

Introduction

What type of variation occurs within my variables?

What type of covariation occurs between my variables?

10.3 Variation

ggplot(diamonds, aes(x = carat)) +
  geom_histogram(binwidth = 0.5)

smaller <- diamonds |> 
        filter(carat < 3)
ggplot(smaller, aes(x = carat)) +
        geom_histogram(binwidth = 0.01)

The Cartesian coordinate system is the most familiar, and common, type of coordinate system. Setting limits on the coordinate system will zoom the plot (like you’re looking at it with a magnifying glass), and will not change the underlying data like setting limits on a scale will.

ggplot(diamonds, aes(x = y)) +
  geom_histogram(binwidth = 0.5)

#To make it easy to see the unusual values, we need to zoom to small values of the y-axis with
ggplot(diamonds, aes(x = y)) + 
  geom_histogram(binwidth = 0.5) +
  coord_cartesian(ylim = c(0, 50))

unusual <- diamonds |> 
  filter(y < 3 | y > 20) |> 
  select(price, x, y, z) |>
  arrange(y)
unusual

## # A tibble: 9 × 4
##   price     x     y     z
##   <int> <dbl> <dbl> <dbl>
## 1  5139  0      0    0   
## 2  6381  0      0    0   
## 3 12800  0      0    0   
## 4 15686  0      0    0   
## 5 18034  0      0    0   
## 6  2130  0      0    0   
## 7  2130  0      0    0   
## 8  2075  5.15  31.8  5.12
## 9 12210  8.09  58.9  8.06

ggplot2 also has xlim() and ylim() functions that work slightly differently: they throw away the data outside the limits.

Repeat your analysis with and without the outliers. - If they have minimal effect on the results, and you can’t figure out why they’re there, it’s reasonable to omit them, and move on. - If they have a substantial effect on your results, you shouldn’t drop them without justification. You’ll need to figure out what caused them (e.g., a data entry error) and disclose that you removed them in your write-up.

10.3.3 Exercises

Explore the distribution of each of the x, y, and z variables in diamonds. What do you learn? Think about a diamond and how you might decide which dimension is the length, width, and depth.

long_diamonds <-diamonds |> 
        select(carat, price, x, y, z) |> 
        pivot_longer(cols = x:z,
                     names_to  = "dimension",
                     values_to = "value"
        )
long_diamonds

## # A tibble: 161,820 × 4
##    carat price dimension value
##    <dbl> <int> <chr>     <dbl>
##  1  0.23   326 x          3.95
##  2  0.23   326 y          3.98
##  3  0.23   326 z          2.43
##  4  0.21   326 x          3.89
##  5  0.21   326 y          3.84
##  6  0.21   326 z          2.31
##  7  0.23   327 x          4.05
##  8  0.23   327 y          4.07
##  9  0.23   327 z          2.31
## 10  0.29   334 x          4.2 
## # ℹ 161,810 more rows

p <-ggplot(long_diamonds, 
       aes(x = value)) +
        coord_cartesian(xlim = c(0, 10))
p + geom_histogram(aes(fill = dimension),
                       binwidth = 0.5,
                       position = "dodge")

p + geom_density(aes(colour = dimension, linetype = dimension),
                 alpha = 0.3,
                 linewidth = 1)

ggplot(long_diamonds,
       aes(x = value, y = price, colour = dimension))+
        geom_point(alpha = 0.3, size = 0.5)+
        coord_cartesian(xlim = c(0, 12.5), ylim = c(0, 20000))+ 
        geom_smooth(method = "lm")

## `geom_smooth()` using formula = 'y ~ x'

z = ldepth, x/y = length or width

ggplot(diamonds,
       aes(x = x, y = y)
       )+
       geom_point(shape = 1, alpha = 0.2)+
        coord_cartesian(xlim = c(0, 12.5), ylim = c(0, 12.5))

ggplot(diamonds,
       aes(x = x, y = z)
       )+
       geom_point(alpha = 0.2)+
        coord_cartesian(xlim = c(0, 12.5), ylim = c(0, 12.5))

ggplot(diamonds,
       aes(x = y, y = z)
       )+
       geom_point(alpha = 0.2)+
        coord_cartesian(xlim = c(0, 12.5), ylim = c(0, 12.5))

Explore the distribution of price. Do you discover anything unusual or surprising? (Hint: Carefully think about the binwidth and make sure you try a wide range of values.)

p <- ggplot(diamonds, aes(x = price,))
p + geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

p + geom_histogram(binwidth = 500)

p + geom_histogram(binwidth = 100, aes(fill = cut_number(carat, 5)))

p + geom_histogram(binwidth = 50) + coord_cartesian(xlim = c(0, 2500))

p + geom_histogram(binwidth = 50, aes(fill = cut_number(carat, 5))) + 
        coord_cartesian(xlim = c(0, 2500))

3. How many diamonds are 0.99 carat? How many are 1 carat? What do you think is the cause of the difference?

p <- ggplot(diamonds, aes(carat))
p + geom_histogram(binwidth = 0.01)

diamonds |> 
        filter(carat == 0.99 | carat == 1) |>
        ggplot(aes(x = as.factor(carat))) + # change the numerical variable to categorical variable
        geom_bar(fill = "blue") +
        geom_text(stat = "count", 
                  aes(label = after_stat(count)), 
                  vjust = -0.5) + # Positions text slightly above the bar
        labs(x = "Carat", y= "Count")

4. Compare and contrast coord_cartesian() vs. xlim() or ylim() when zooming in on a histogram. What happens if you leave binwidth unset? What happens if you try and zoom so only half a bar shows?

p <- ggplot(diamonds, aes(x = z))
# coord_cartesian
p + geom_histogram(binwidth = 0.5) + coord_cartesian(xlim = c(2, 6.5)) #1

p + geom_histogram(binwidth = 0.5) + coord_cartesian(xlim = c(2.5, 5)) #2

p + geom_histogram() + coord_cartesian(xlim = c(2, 6.5))#3

## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

p + geom_histogram( ) + coord_cartesian(xlim = c(2.5, 5))#4

## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

# xlim()
p + geom_histogram(binwidth = 0.5) + xlim(2, 6.5)#5

## Warning: Removed 27 rows containing non-finite outside the scale range
## (`stat_bin()`).

## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_bar()`).

p + geom_histogram(binwidth = 0.5) + xlim(2.5, 5)#6

## Warning: Removed 2114 rows containing non-finite outside the scale range (`stat_bin()`).
## Removed 2 rows containing missing values or values outside the scale range
## (`geom_bar()`).

p + geom_histogram() + xlim(2, 6.5)#7

## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

## Warning: Removed 27 rows containing non-finite outside the scale range (`stat_bin()`).
## Removed 2 rows containing missing values or values outside the scale range
## (`geom_bar()`).

p + geom_histogram( ) + xlim(2.5, 5)#8

## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

## Warning: Removed 2114 rows containing non-finite outside the scale range (`stat_bin()`).
## Removed 2 rows containing missing values or values outside the scale range
## (`geom_bar()`).

coord_cartesian(): zoom-in the graph with all data; doesn’t remove any data xlim/ylim: remove all the data outside the range, plot the data within the limit

10.4 Unusual values

diamonds2 <- diamonds |> 
  mutate(y = if_else(y < 3 | y > 20, NA, y))
ggplot(diamonds2, aes(x = x, y = y)) + 
  geom_point(na.rm = TRUE) #optional, as ggplot will ignore na. Help to suppress the warning message.

nycflights13::flights |> 
  mutate(
    cancelled = is.na(dep_time),
    sched_hour = sched_dep_time %/% 100,
    sched_min = sched_dep_time %% 100,
    sched_dep_time = sched_hour + (sched_min / 60)
  ) |> 
  ggplot(aes(x = sched_dep_time)) + 
  geom_freqpoly(aes(color = cancelled), binwidth = 1/4)

10.4.1 Exercises

What happens to missing values in a histogram? What happens to missing values in a bar chart? Why is there a difference in how missing values are handled in histograms and bar charts?

table((is.na(diamonds2$y)))

## 
## FALSE  TRUE 
## 53931     9

ggplot(diamonds2, aes(x = y)) +
        geom_bar()

## Warning: Removed 9 rows containing non-finite outside the scale range
## (`stat_count()`).

ggplot(diamonds2, aes(x = cut_interval(y, 5))) +
        geom_bar()

ggplot(diamonds2, aes(x = y)) +
        geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

## Warning: Removed 9 rows containing non-finite outside the scale range
## (`stat_bin()`).

What does na.rm = TRUE do in mean() and sum()?

mean(diamonds2$y)

## [1] NA

mean(diamonds2$y, na.rm = TRUE)

## [1] 5.733801

sum(diamonds2$y)

## [1] NA

sum(diamonds2$y, na.rm = TRUE)

## [1] 309229.6

Recreate the frequency plot of scheduled_dep_time colored by whether the flight was cancelled or not. Also facet by the cancelled variable. Experiment with different values of the scales variable in the faceting function to mitigate the effect of more non-cancelled flights than cancelled flights.

flights2 <- nycflights13::flights |> 
  mutate(
    cancelled = is.na(dep_time),
    sched_hour = sched_dep_time %/% 100,
    sched_min = sched_dep_time %% 100,
    sched_dep_time = sched_hour + (sched_min / 60)
  ) 

p <- ggplot(flights2, aes(x = sched_dep_time))
p + geom_freqpoly(binwidth = 1/4) +
        facet_wrap(~cancelled, ncol = 1, scales ="free_y")

10.5 Covariation

10.5.1 A categorical and a numerical variable

ggplot(diamonds, aes(x = price)) +
        geom_freqpoly(aes(colour = cut), binwidth = 500, linewidth = 0.75)

ggplot(diamonds, aes(x = price, y = after_stat(density))) +
        geom_freqpoly(aes(colour = cut), binwidth = 500, linewidth = 0.75)

ggplot(diamonds, aes(x = cut, y = price))+
        geom_boxplot()

ggplot(mpg, aes(x = class, y = hwy)) +
  geom_boxplot()

ggplot(mpg, aes(x = fct_reorder(class, hwy, median), y = hwy)) +
  geom_boxplot()

ggplot(mpg, aes(x = hwy, y = fct_reorder(class, hwy, median))) +
  geom_boxplot()

ggplot(diamonds, aes(x = price, y = after_stat(density))) +
        geom_freqpoly(aes(colour = cut), binwidth = 500, linewidth = 0.75)+
        facet_wrap(~cut_number(carat, 4), scales ="free")

### 10.5.1.1 Exercises Q1. Use what you’ve learned to improve the visualization of the departure times of cancelled vs. non-cancelled flights.

p <- ggplot(flights2, aes(x = sched_dep_time, y = after_stat(density)))
p + geom_freqpoly(binwidth = 1/4, aes(colour = cancelled))

Q2. Based on EDA, what variable in the diamonds dataset appears to be most important for predicting the price of a diamond? How is that variable correlated with cut? Why does the combination of those two relationships lead to lower quality diamonds being more expensive?

glimpse(diamonds)

## Rows: 53,940
## Columns: 10
## $ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23, 0.…
## $ cut     <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, Ver…
## $ color   <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J, I,…
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS1, …
## $ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4, 64…
## $ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62, 58…
## $ price   <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340, 34…
## $ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00, 4.…
## $ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05, 4.…
## $ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39, 2.…

p <- ggplot(diamonds, aes(x = price, y = after_stat(density))) 

p + geom_freqpoly(aes(colour = cut), binwidth = 500, linewidth = 0.75)

p + geom_freqpoly(aes(colour = color), binwidth = 500, linewidth = 0.75)

p + geom_freqpoly(aes(colour = clarity), binwidth = 500, linewidth = 0.75)

p + geom_freqpoly(aes(colour = cut_number(carat, 5)), binwidth = 500, linewidth = 0.75)

p + geom_freqpoly(aes(colour = cut_number(x, 5)), binwidth = 500, linewidth = 0.75)

p + geom_freqpoly(aes(colour = cut_number(y, 5)), binwidth = 500, linewidth = 0.75)

p + geom_freqpoly(aes(colour = cut_number(z, 5)), binwidth = 500, linewidth = 0.75)

p2 <- ggplot(diamonds, aes(y = price))
p2 + geom_boxplot(aes(x = cut))

p2 + geom_boxplot(aes(x = color))

p2 + geom_boxplot(aes(x = clarity))

p2 + geom_point(aes(x = carat), alpha = 0.2, shape = 1) + geom_smooth(aes(x = carat), method = "lm")

## `geom_smooth()` using formula = 'y ~ x'

ggplot(diamonds, aes(y = carat, x = cut)) + geom_boxplot()

ggplot(diamonds, aes(x = carat, y= after_stat(density), colour = cut)) + geom_freqpoly(linewidth = 0.75)

## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

ggplot(diamonds, aes(x = carat, y= after_stat(density), colour = cut)) +
        geom_freqpoly(linewidth = 0.75) +
        coord_cartesian(xlim = c(0, 3.5))

## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

Q3. Instead of exchanging the x and y variables, add coord_flip() as a new layer to the vertical boxplot to create a horizontal one. How does this compare to exchanging the variables?

ggplot(diamonds, aes(y = carat, x = cut)) + geom_boxplot()

ggplot(diamonds, aes(y = carat, x = cut)) + geom_boxplot() + coord_flip()

ggplot(diamonds, aes(x = carat, y = cut)) + geom_boxplot()

no difference in this case

Q4. One problem with boxplots is that they were developed in an era of much smaller datasets and tend to display a prohibitively large number of “outlying values”. One approach to remedy this problem is the letter value plot. Install the lvplot package, and try using geom_lv() to display the distribution of price vs. cut. What do you learn? How do you interpret the plots?

if(!require(lvplot)){install.packages("lvplot")}

## Loading required package: lvplot

## Warning: package 'lvplot' was built under R version 4.5.2

library(lvplot)
ggplot(diamonds, aes(y = carat, x = cut)) + geom_lv()

Q5. Create a visualization of diamond prices vs. a categorical variable from the diamonds dataset using geom_violin(), then a faceted geom_histogram(), then a colored geom_freqpoly(), and then a colored geom_density(). Compare and contrast the four plots. What are the pros and cons of each method of visualizing the distribution of a numerical variable based on the levels of a categorical variable?

p <- ggplot(diamonds, 
            aes(y = price)
)
p + geom_violin(aes(x = color))

ggplot(diamonds, aes(x = price, y = after_stat(density)))+ 
        geom_histogram() + 
        facet_wrap(~color, ncol = 1)

## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

ggplot(diamonds, aes(x = price, y = after_stat(density)))+ 
        geom_freqpoly(aes(colour = color), linewidth = 0.75)

## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

ggplot(diamonds, aes(x = price))+ 
        geom_density(aes(colour = color), linewidth = 0.75)

Q6. If you have a small dataset, it’s sometimes useful to use geom_jitter() to avoid overplotting to more easily see the relationship between a continuous and categorical variable. The ggbeeswarm package provides a number of methods similar to geom_jitter(). List them and briefly describe what each one does.

if(!require(ggbeeswarm)){install.packages("ggbeeswarm")}

## Loading required package: ggbeeswarm

## Warning: package 'ggbeeswarm' was built under R version 4.5.2

library(ggbeeswarm)

 ggplot(mpg, aes(class, hwy)) + geom_point(alpha = 0.3)

 ggplot(mpg, aes(class, hwy)) + geom_jitter(alpha = 0.3)

 ggplot(mpg, aes(class, hwy)) + geom_quasirandom(alpha = 0.3)

10.5.2 Two categorical variables

ggplot(diamonds, aes(x = cut, y = color)) +
  geom_count()

diamonds |> 
        count(cut, color) |> 
        ggplot(aes(x = color, y = cut)) +
        geom_tile(aes(fill = n))

10.5.2.1 Exercises

Q1. How could you rescale the count dataset above to more clearly show the distribution of cut within color, or color within cut?

diamonds %>%
  count(color, cut) %>%
  group_by(color) %>%
  mutate(prop = n / sum(n)) %>%
  ggplot(aes(x = color, y = cut, fill = prop)) +
  geom_tile()

diamonds %>%
  count(color, cut) %>%
  group_by(cut) %>%
  mutate(prop = n / sum(n)) %>%
  ggplot(aes(x = cut, y = color, fill = prop)) +
  geom_tile(color = "white") + # Adds a white border to tiles
  scale_fill_viridis_c(option = "magma") + # Changes the color palette
  theme_minimal()

Q2. What different data insights do you get with a segmented bar chart if color is mapped to the x aesthetic and cut is mapped to the fill aesthetic? Calculate the counts that fall into each of the segments.

ggplot(diamonds, aes(x = color, fill = cut)) +
        geom_bar()

diamonds |> 
        count(color, cut) %>%
        group_by(color)

## # A tibble: 35 × 3
## # Groups:   color [7]
##    color cut           n
##    <ord> <ord>     <int>
##  1 D     Fair        163
##  2 D     Good        662
##  3 D     Very Good  1513
##  4 D     Premium    1603
##  5 D     Ideal      2834
##  6 E     Fair        224
##  7 E     Good        933
##  8 E     Very Good  2400
##  9 E     Premium    2337
## 10 E     Ideal      3903
## # ℹ 25 more rows

Q3. Use geom_tile() together with dplyr to explore how average flight departure delays vary by destination and month of year. What makes the plot difficult to read? How could you improve it?

glimpse(flights)

## Rows: 336,776
## Columns: 19
## $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
## $ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
## $ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
## $ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
## $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
## $ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
## $ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
## $ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
## $ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
## $ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…

delay <- flights |> 
        group_by(destination = dest, month = as.factor(month))|>
        summarise(mean_delay = mean(dep_delay, na.rm = TRUE),
                  .groups = "drop") |> 
         mutate(destination = reorder(destination, 
                                      mean_delay,
                                      mean,
                                      decreasing = TRUE)) 
  # Reorder dest based on the overall mean_delay
delay |>
  ggplot(aes(y = destination, x = month, fill = mean_delay)) +
  geom_tile()+
  theme(
    axis.text.y = element_text(
      size = 5,      # Change label size
      angle = 0,     # Rotate labels ? degrees
      vjust = 1,      # Adjust vertical positioning
      hjust = 1       # Right-align labels for better fit
    )
  )+
scale_y_discrete(guide = guide_axis(n.dodge = 2))

10.5.3 Two numerical variables

ggplot(smaller, aes(x = carat, y = price)) +
  geom_point()

ggplot(smaller, aes(x = carat, y = price)) + 
  geom_point(alpha = 1 / 100) # transparency: very challenging for large dataset

geom_bin2d() and geom_hex() divide the coordinate plane into 2d bins and then use a fill color to display how many points fall into each bin.
geom_bin2d() creates rectangular bins.
geom_hex() creates hexagonal bins.

ggplot(smaller, aes(x = carat, y = price)) +
  geom_bin2d()

## `stat_bin2d()` using `bins = 30`. Pick better value `binwidth`.

#> `stat_bin2d()` using `bins = 30`. Pick better value `binwidth`.

# install.packages("hexbin")
ggplot(smaller, aes(x = carat, y = price)) +
  geom_hex()

ggplot(smaller, aes(x = carat, y = price)) + 
  geom_boxplot(aes(group = cut_width(carat, 0.1)))

#> Warning: Orientation is not uniquely specified when both the x and y aesthetics are
#> continuous. Picking default orientation 'x'.
#> 
ggplot(smaller, aes(x = carat, y = price)) + 
  geom_boxplot(aes(group = cut_width(carat, 0.1)), varwidth = TRUE)

ggplot(smaller, aes(x = cut_width(carat, 0.1), y = price)) + 
        geom_boxplot(varwidth = TRUE) +
        theme(
        axis.text.x = element_text(
        size = 10,      # Change label size
        angle = 45,     # Rotate labels ? degrees
        vjust = 1,      # Adjust vertical positioning
         hjust = 1       # Right-align labels for better fit
        )
        )+
        labs( x = "Carat")

10.5.3.1 Exercises

Instead of summarizing the conditional distribution with a boxplot, you could use a frequency polygon. What do you need to consider when using cut_width() vs. cut_number()? How does that impact a visualization of the 2d distribution of carat and price?

ggplot(smaller, aes(x = price))+
        geom_freqpoly(aes(colour = cut_width(carat, 0.1)), linewidth = 0.75)

## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

ggplot(smaller, aes(x = price))+
        geom_freqpoly(aes(colour = cut_width(carat, 0.3)), linewidth = 0.75)

## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

ggplot(smaller, aes(x = price, y = after_stat(density)))+
        geom_freqpoly(aes(colour = cut_width(carat, 0.3)), linewidth = 0.75)

## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

ggplot(smaller, aes(x = price))+
        geom_freqpoly(aes(colour = cut_number(carat, 9)), linewidth = 0.75)

## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

Visualize the distribution of carat, partitioned by price.

smaller |>  
        group_by(carat_bin = cut_width(carat, 0.5)) |> 
        summarise(ave_price = mean(price))|>
        ggplot(aes(x = carat_bin, y = ave_price))+
        geom_col()

ggplot(smaller, aes(x = carat))+
        geom_freqpoly(aes(colour = cut_width(price, 2500)), linewidth = 0.75)

## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

ggplot(smaller, aes(x = carat, y = after_stat(density)))+
        geom_freqpoly(aes(colour = cut_width(price, 2500)), linewidth = 0.75)

## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

ggplot(smaller, aes(x = carat, y = price))+
        geom_boxplot(aes(group = cut_width(price, 2500)), varwidth = TRUE)

ggplot(diamonds, aes(x = carat, y = price))+
        geom_boxplot(aes(group = cut_width(price, 2500)), varwidth = TRUE)

ggplot(smaller, aes(x = carat, y = price))+
        geom_boxplot(aes(group = cut_width(price, 1000)), varwidth = TRUE)

ggplot(diamonds, aes(x = carat, y = price))+
        geom_boxplot(aes(group = cut_width(price, 1000)), varwidth = TRUE)

ggplot(diamonds, aes(x = carat, y = price))+
        geom_boxplot(aes(group = cut_width(carat, 0.1)), varwidth = TRUE)

ggplot(smaller, aes(x = price, y = carat))+
        geom_boxplot(aes(group = cut_width(price, 1000)), varwidth = TRUE)

#it is important to select suitable bin width

How does the price distribution of very large diamonds compare to small diamonds? Is it as you expect, or does it surprise you

less obvious positive correlationship with the size for large diamonds. Factors other than size, such as clarity and cut, may play a role in determining the prize.

ggplot(smaller, aes(x = carat, y = price)) +
        geom_hex() +
        facet_grid(clarity ~ color)+
        theme(
        axis.text.x = element_text(
        size = 5,      # Change label size
        angle = 45,     # Rotate labels ? degrees
        vjust = 1,      # Adjust vertical positioning
         hjust = 1       # Right-align labels for better fit
        )
        )

smaller %>%
  group_by(clarity, carat_bin = cut_width(carat, 0.5)) %>%
  summarize(avg_price = mean(price, na.rm = TRUE)) %>%
  ggplot(aes(x = clarity, y = avg_price, fill = carat_bin)) +
  geom_col(position = "dodge")

## `summarise()` has grouped output by 'clarity'. You can override using the
## `.groups` argument.

smaller %>%
  group_by(clarity, carat_bin = cut_width(carat, 0.5)) %>%
  summarize(avg_price = mean(price, na.rm = TRUE)) %>% 
        #the default stat is "identity", i.e. the sum of the value.
        ## thus we need to calculate the average before plotting with geom_col
  ggplot(aes(x = carat_bin, y = avg_price, fill = clarity)) +
  geom_col(position = "dodge")

## `summarise()` has grouped output by 'clarity'. You can override using the
## `.groups` argument.

smaller %>%
  group_by(cut, carat_bin = cut_width(carat, 0.5)) %>%
  summarize(avg_price = mean(price, na.rm = TRUE)) %>% 
        #the default stat is "identity", i.e. the sum of the value.
        ## thus we need to calculate the average before plotting with geom_col
  ggplot(aes(x = carat_bin, y = avg_price, fill = cut)) +
  geom_col(position = "dodge")

## `summarise()` has grouped output by 'cut'. You can override using the `.groups`
## argument.

4. Combine two of the techniques you’ve learned to visualize the combined distribution of cut, carat, and price.

smaller %>%
  group_by(cut, carat_bin = cut_width(carat, 0.5)) %>%
  summarize(avg_price = mean(price, na.rm = TRUE)) %>% 
        #the default stat is "identity", i.e. the sum of the value.
        ## thus we need to calculate the average before plotting with geom_col
  ggplot(aes(x = carat_bin, y = avg_price, fill = cut)) +
  geom_col(position = "dodge")

## `summarise()` has grouped output by 'cut'. You can override using the `.groups`
## argument.

smaller %>%
  group_by(cut, carat_bin = cut_width(carat, 0.5)) %>%
  summarize(avg_price = mean(price, na.rm = TRUE)) %>% 
        #the default stat is "identity", i.e. the sum of the value.
        ## thus we need to calculate the average before plotting with geom_col
  ggplot(aes(x = cut, y = avg_price, fill = carat_bin)) +
  geom_col(position = "dodge")

## `summarise()` has grouped output by 'cut'. You can override using the `.groups`
## argument.

ggplot(smaller, aes(x = carat, y = price))+
        geom_boxplot(aes(group = cut_width(carat, 0.1)), varwidth = TRUE)+
        facet_wrap(~cut, ncol = 2)

ggplot(smaller, aes(x = cut, y = price))+
        geom_boxplot(varwidth = TRUE)+
        facet_wrap(~cut_interval(carat, 9))+
        theme(
        axis.text.x = element_text(
        size = 10,      # Change label size
        angle = 45,     # Rotate labels ? degrees
        vjust = 1,      # Adjust vertical positioning
         hjust = 1       # Right-align labels for better fit
        )
        )

Two dimensional plots reveal outliers that are not visible in one dimensional plots. For example, some points in the following plot have an unusual combination of x and y values, which makes the points outliers even though their x and y values appear normal when examined separately. Why is a scatterplot a better display than a binned plot for this case?

diamonds |> filter(x >= 4) |> ggplot(aes(x = x, y = y)) + geom_point() + coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))

diamonds |> 
  filter(x >= 4) |> 
  ggplot(aes(x = x, y = y)) +
  geom_point(alpha = 0.1) +
  coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))

diamonds |> 
  filter(x >= 4) |> 
  ggplot(aes(x = x, y = y)) +
  geom_boxplot(aes(group = cut_width(x, 0.5)), varwidth = TRUE) +
  coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))

diamonds |> 
  filter(y >= 4) |> 
  ggplot(aes(x = y, y = x)) +
  geom_boxplot(aes(group = cut_width(y, 0.5)), varwidth = TRUE) +
  coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))

From AI
Why the Scatterplot is Better
- Multivariate Outlier Detection: Scatterplots reveal “unusual combinations” of variables. A point might have a perfectly normal x value and a normal y value when viewed individually (univariate), but the relationship between them is what makes it an outlier. Binned plots often hide these specific points by aggregating them into a single bin or summary statistic.
- Precise Data Points: Scatterplots show every individual observation, allowing you to see exactly where “off-diagonal” points fall. In the diamonds dataset, x and y represent dimensions; they are highly correlated, so any point far from the main diagonal is immediately suspicious.
- Pattern Recognition: Human eyes are naturally skilled at spotting gaps and clusters that binned plots might smooth over.

Limitations of the Binned Plot (Boxplot) in this Case
- Loss of Granularity: By grouping x into bins (like cut_width(x, 0.5)), you lose the exact x coordinate of outliers.
- Over-Aggregation: A boxplot summarizes the distribution of y for a range of x. While it can show outliers in y for that specific bin, it doesn’t show how they relate to the exact x value, making it harder to see if they follow a specific non-linear trend or are truly isolated.
- Clarity: In high-density areas, boxplots can become cluttered or misleading if the varwidth doesn’t clearly communicate the underlying sample size, whereas a scatterplot with alpha = 0.1 uses transparency to show density naturally.

Instead of creating boxes of equal width with cut_width(), we could create boxes that contain roughly equal number of points with cut_number(). What are the advantages and disadvantages of this approach?

ungroup(smaller)

## # A tibble: 53,900 × 10
##    carat cut       color clarity depth table price     x     y     z
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
##  2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
##  3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
##  4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
##  5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
##  6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
##  7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
##  8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
##  9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
## 10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
## # ℹ 53,890 more rows

ggplot(smaller, aes(x = carat, y = price)) + 
  geom_boxplot(aes(group = cut_number(carat, 20)), varwidth = FALSE)

ggplot(smaller, aes(x = carat, y = price)) + 
  geom_boxplot(aes(group = cut_number(carat, 20)), varwidth = TRUE)

ggplot(smaller, aes(x = cut_number(carat, 20), y = price))+ 
        geom_boxplot(varwidth = FALSE)+
        theme(
        axis.text.x = element_text(
        size = 10,      # Change label size
        angle = 45,     # Rotate labels ? degrees
        vjust = 1,      # Adjust vertical positioning
         hjust = 1       # Right-align labels for better fit
        )
        )

ggplot(smaller, aes(x = cut_number(carat, 20), y = price))+ 
        geom_boxplot(varwidth = TRUE)+
        theme(
        axis.text.x = element_text(
        size = 10,      # Change label size
        angle = 45,     # Rotate labels ? degrees
        vjust = 1,      # Adjust vertical positioning
         hjust = 1       # Right-align labels for better fit
        )
        )

ggplot(smaller, aes(x = carat, y = price)) + 
  geom_boxplot(aes(group = cut_width(carat, 0.1)), varwidth = TRUE)

ggplot(smaller, aes(x = carat, y = price)) + 
  geom_boxplot(aes(group = cut_interval(carat, 20)), varwidth = TRUE)

From GEMINI AI

The reason you see varying widths even with cut_number(carat, 20) is due to tied values in your data.

How cut_number() works: It attempts to put an exactly equal number of observations into each of the 20 bins. However, the diamonds dataset has many identical carat values (e.g., many diamonds are exactly 1.00 carat).

The Conflict: If there are more diamonds with a single carat weight than can fit in one bin, cut_number() cannot split those identical values into different bins. It must keep them together, resulting in some bins having slightly more or fewer observations than others to maintain valid “breaks” in the data.

Effect of varwidth = TRUE: When this is active, geom_boxplot calculates the horizontal width of each box proportional to the square root of the number of observations in that bin. Because cut_number() produced bins with slightly different counts (due to those tied carat values), the box widths reflect those differences.

FROM COPILOT AI
Since each cut_number() group has the same number of observations, you would expect equal widths… …but ggplot2 does not use the number of observations in the group you created. Instead, it uses the number of observations in the x-position category.

So ggplot2 sees:

One x-position per observation
Each observation is its own “group” for width calculation
The computed widths vary because the underlying continuous x-scale spacing varies
This is why the widths look inconsistent.

Convert the groups into a factor and map that to x:

r ggplot(smaller, aes(x = cut_number(carat, 20), y = price)) + geom_boxplot(varwidth = TRUE) Now:

x is a factor with 20 levels
Each level has equal sample size
varwidth = TRUE will produce equal-width boxes

EDA_Ch10

CMCHOW

2026-01-08

R for Data Science(2e)