–metadata title=“R for Data Science”
set up
download the package and data
if(!require(tidyverse))
{install.packages("tidyverse")}
## Loading required package: tidyverse
## Warning: package 'tidyverse' was built under R version 4.5.2
## Warning: package 'forcats' was built under R version 4.5.2
## Warning: package 'lubridate' was built under R version 4.5.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidyverse)
if(!require(nycflights13)){install.packages("nycflights13")}
## Loading required package: nycflights13
## Warning: package 'nycflights13' was built under R version 4.5.2
library(nycflights13)
if(!require(hexbin)){install.packages("hexbin")}
## Loading required package: hexbin
## Warning: package 'hexbin' was built under R version 4.5.2
library(hexbin)
What type of variation occurs within my variables?
What type of covariation occurs between my variables?
ggplot(diamonds, aes(x = carat)) +
geom_histogram(binwidth = 0.5)
smaller <- diamonds |>
filter(carat < 3)
ggplot(smaller, aes(x = carat)) +
geom_histogram(binwidth = 0.01)
The Cartesian coordinate system is the most familiar, and common, type of coordinate system. Setting limits on the coordinate system will zoom the plot (like you’re looking at it with a magnifying glass), and will not change the underlying data like setting limits on a scale will.
ggplot(diamonds, aes(x = y)) +
geom_histogram(binwidth = 0.5)
#To make it easy to see the unusual values, we need to zoom to small values of the y-axis with
ggplot(diamonds, aes(x = y)) +
geom_histogram(binwidth = 0.5) +
coord_cartesian(ylim = c(0, 50))
unusual <- diamonds |>
filter(y < 3 | y > 20) |>
select(price, x, y, z) |>
arrange(y)
unusual
## # A tibble: 9 × 4
## price x y z
## <int> <dbl> <dbl> <dbl>
## 1 5139 0 0 0
## 2 6381 0 0 0
## 3 12800 0 0 0
## 4 15686 0 0 0
## 5 18034 0 0 0
## 6 2130 0 0 0
## 7 2130 0 0 0
## 8 2075 5.15 31.8 5.12
## 9 12210 8.09 58.9 8.06
ggplot2 also has xlim() and ylim() functions that work slightly differently: they throw away the data outside the limits.
Repeat your analysis with and without the outliers. - If they have minimal effect on the results, and you can’t figure out why they’re there, it’s reasonable to omit them, and move on. - If they have a substantial effect on your results, you shouldn’t drop them without justification. You’ll need to figure out what caused them (e.g., a data entry error) and disclose that you removed them in your write-up.
long_diamonds <-diamonds |>
select(carat, price, x, y, z) |>
pivot_longer(cols = x:z,
names_to = "dimension",
values_to = "value"
)
long_diamonds
## # A tibble: 161,820 × 4
## carat price dimension value
## <dbl> <int> <chr> <dbl>
## 1 0.23 326 x 3.95
## 2 0.23 326 y 3.98
## 3 0.23 326 z 2.43
## 4 0.21 326 x 3.89
## 5 0.21 326 y 3.84
## 6 0.21 326 z 2.31
## 7 0.23 327 x 4.05
## 8 0.23 327 y 4.07
## 9 0.23 327 z 2.31
## 10 0.29 334 x 4.2
## # ℹ 161,810 more rows
p <-ggplot(long_diamonds,
aes(x = value)) +
coord_cartesian(xlim = c(0, 10))
p + geom_histogram(aes(fill = dimension),
binwidth = 0.5,
position = "dodge")
p + geom_density(aes(colour = dimension, linetype = dimension),
alpha = 0.3,
linewidth = 1)
ggplot(long_diamonds,
aes(x = value, y = price, colour = dimension))+
geom_point(alpha = 0.3, size = 0.5)+
coord_cartesian(xlim = c(0, 12.5), ylim = c(0, 20000))+
geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'
z = ldepth, x/y = length or width
ggplot(diamonds,
aes(x = x, y = y)
)+
geom_point(shape = 1, alpha = 0.2)+
coord_cartesian(xlim = c(0, 12.5), ylim = c(0, 12.5))
ggplot(diamonds,
aes(x = x, y = z)
)+
geom_point(alpha = 0.2)+
coord_cartesian(xlim = c(0, 12.5), ylim = c(0, 12.5))
ggplot(diamonds,
aes(x = y, y = z)
)+
geom_point(alpha = 0.2)+
coord_cartesian(xlim = c(0, 12.5), ylim = c(0, 12.5))
p <- ggplot(diamonds, aes(x = price,))
p + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
p + geom_histogram(binwidth = 500)
p + geom_histogram(binwidth = 100, aes(fill = cut_number(carat, 5)))
p + geom_histogram(binwidth = 50) + coord_cartesian(xlim = c(0, 2500))
p + geom_histogram(binwidth = 50, aes(fill = cut_number(carat, 5))) +
coord_cartesian(xlim = c(0, 2500))
3. How many diamonds are 0.99 carat? How many are 1 carat? What do you
think is the cause of the difference?
p <- ggplot(diamonds, aes(carat))
p + geom_histogram(binwidth = 0.01)
diamonds |>
filter(carat == 0.99 | carat == 1) |>
ggplot(aes(x = as.factor(carat))) + # change the numerical variable to categorical variable
geom_bar(fill = "blue") +
geom_text(stat = "count",
aes(label = after_stat(count)),
vjust = -0.5) + # Positions text slightly above the bar
labs(x = "Carat", y= "Count")
4. Compare and contrast coord_cartesian() vs. xlim() or ylim() when
zooming in on a histogram. What happens if you leave binwidth unset?
What happens if you try and zoom so only half a bar shows?
p <- ggplot(diamonds, aes(x = z))
# coord_cartesian
p + geom_histogram(binwidth = 0.5) + coord_cartesian(xlim = c(2, 6.5)) #1
p + geom_histogram(binwidth = 0.5) + coord_cartesian(xlim = c(2.5, 5)) #2
p + geom_histogram() + coord_cartesian(xlim = c(2, 6.5))#3
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
p + geom_histogram( ) + coord_cartesian(xlim = c(2.5, 5))#4
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
# xlim()
p + geom_histogram(binwidth = 0.5) + xlim(2, 6.5)#5
## Warning: Removed 27 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_bar()`).
p + geom_histogram(binwidth = 0.5) + xlim(2.5, 5)#6
## Warning: Removed 2114 rows containing non-finite outside the scale range (`stat_bin()`).
## Removed 2 rows containing missing values or values outside the scale range
## (`geom_bar()`).
p + geom_histogram() + xlim(2, 6.5)#7
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
## Warning: Removed 27 rows containing non-finite outside the scale range (`stat_bin()`).
## Removed 2 rows containing missing values or values outside the scale range
## (`geom_bar()`).
p + geom_histogram( ) + xlim(2.5, 5)#8
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
## Warning: Removed 2114 rows containing non-finite outside the scale range (`stat_bin()`).
## Removed 2 rows containing missing values or values outside the scale range
## (`geom_bar()`).
coord_cartesian(): zoom-in the graph with all data; doesn’t remove any data xlim/ylim: remove all the data outside the range, plot the data within the limit
diamonds2 <- diamonds |>
mutate(y = if_else(y < 3 | y > 20, NA, y))
ggplot(diamonds2, aes(x = x, y = y)) +
geom_point(na.rm = TRUE) #optional, as ggplot will ignore na. Help to suppress the warning message.
nycflights13::flights |>
mutate(
cancelled = is.na(dep_time),
sched_hour = sched_dep_time %/% 100,
sched_min = sched_dep_time %% 100,
sched_dep_time = sched_hour + (sched_min / 60)
) |>
ggplot(aes(x = sched_dep_time)) +
geom_freqpoly(aes(color = cancelled), binwidth = 1/4)
table((is.na(diamonds2$y)))
##
## FALSE TRUE
## 53931 9
ggplot(diamonds2, aes(x = y)) +
geom_bar()
## Warning: Removed 9 rows containing non-finite outside the scale range
## (`stat_count()`).
ggplot(diamonds2, aes(x = cut_interval(y, 5))) +
geom_bar()
ggplot(diamonds2, aes(x = y)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
## Warning: Removed 9 rows containing non-finite outside the scale range
## (`stat_bin()`).
mean(diamonds2$y)
## [1] NA
mean(diamonds2$y, na.rm = TRUE)
## [1] 5.733801
sum(diamonds2$y)
## [1] NA
sum(diamonds2$y, na.rm = TRUE)
## [1] 309229.6
flights2 <- nycflights13::flights |>
mutate(
cancelled = is.na(dep_time),
sched_hour = sched_dep_time %/% 100,
sched_min = sched_dep_time %% 100,
sched_dep_time = sched_hour + (sched_min / 60)
)
p <- ggplot(flights2, aes(x = sched_dep_time))
p + geom_freqpoly(binwidth = 1/4) +
facet_wrap(~cancelled, ncol = 1, scales ="free_y")
ggplot(diamonds, aes(x = price)) +
geom_freqpoly(aes(colour = cut), binwidth = 500, linewidth = 0.75)
ggplot(diamonds, aes(x = price, y = after_stat(density))) +
geom_freqpoly(aes(colour = cut), binwidth = 500, linewidth = 0.75)
ggplot(diamonds, aes(x = cut, y = price))+
geom_boxplot()
ggplot(mpg, aes(x = class, y = hwy)) +
geom_boxplot()
ggplot(mpg, aes(x = fct_reorder(class, hwy, median), y = hwy)) +
geom_boxplot()
ggplot(mpg, aes(x = hwy, y = fct_reorder(class, hwy, median))) +
geom_boxplot()
ggplot(diamonds, aes(x = price, y = after_stat(density))) +
geom_freqpoly(aes(colour = cut), binwidth = 500, linewidth = 0.75)+
facet_wrap(~cut_number(carat, 4), scales ="free")
### 10.5.1.1
Exercises Q1. Use what you’ve learned to improve the visualization
of the departure times of cancelled vs. non-cancelled flights.
p <- ggplot(flights2, aes(x = sched_dep_time, y = after_stat(density)))
p + geom_freqpoly(binwidth = 1/4, aes(colour = cancelled))
Q2. Based on EDA, what variable in the diamonds dataset appears to be most important for predicting the price of a diamond? How is that variable correlated with cut? Why does the combination of those two relationships lead to lower quality diamonds being more expensive?
glimpse(diamonds)
## Rows: 53,940
## Columns: 10
## $ carat <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23, 0.…
## $ cut <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, Ver…
## $ color <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J, I,…
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS1, …
## $ depth <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4, 64…
## $ table <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62, 58…
## $ price <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340, 34…
## $ x <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00, 4.…
## $ y <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05, 4.…
## $ z <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39, 2.…
p <- ggplot(diamonds, aes(x = price, y = after_stat(density)))
p + geom_freqpoly(aes(colour = cut), binwidth = 500, linewidth = 0.75)
p + geom_freqpoly(aes(colour = color), binwidth = 500, linewidth = 0.75)
p + geom_freqpoly(aes(colour = clarity), binwidth = 500, linewidth = 0.75)
p + geom_freqpoly(aes(colour = cut_number(carat, 5)), binwidth = 500, linewidth = 0.75)
p + geom_freqpoly(aes(colour = cut_number(x, 5)), binwidth = 500, linewidth = 0.75)
p + geom_freqpoly(aes(colour = cut_number(y, 5)), binwidth = 500, linewidth = 0.75)
p + geom_freqpoly(aes(colour = cut_number(z, 5)), binwidth = 500, linewidth = 0.75)
p2 <- ggplot(diamonds, aes(y = price))
p2 + geom_boxplot(aes(x = cut))
p2 + geom_boxplot(aes(x = color))
p2 + geom_boxplot(aes(x = clarity))
p2 + geom_point(aes(x = carat), alpha = 0.2, shape = 1) + geom_smooth(aes(x = carat), method = "lm")
## `geom_smooth()` using formula = 'y ~ x'
ggplot(diamonds, aes(y = carat, x = cut)) + geom_boxplot()
ggplot(diamonds, aes(x = carat, y= after_stat(density), colour = cut)) + geom_freqpoly(linewidth = 0.75)
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
ggplot(diamonds, aes(x = carat, y= after_stat(density), colour = cut)) +
geom_freqpoly(linewidth = 0.75) +
coord_cartesian(xlim = c(0, 3.5))
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
Q3. Instead of exchanging the x and y variables, add coord_flip() as a
new layer to the vertical boxplot to create a horizontal one. How does
this compare to exchanging the variables?
ggplot(diamonds, aes(y = carat, x = cut)) + geom_boxplot()
ggplot(diamonds, aes(y = carat, x = cut)) + geom_boxplot() + coord_flip()
ggplot(diamonds, aes(x = carat, y = cut)) + geom_boxplot()
no difference in this case
Q4. One problem with boxplots is that they were developed in an era of much smaller datasets and tend to display a prohibitively large number of “outlying values”. One approach to remedy this problem is the letter value plot. Install the lvplot package, and try using geom_lv() to display the distribution of price vs. cut. What do you learn? How do you interpret the plots?
if(!require(lvplot)){install.packages("lvplot")}
## Loading required package: lvplot
## Warning: package 'lvplot' was built under R version 4.5.2
library(lvplot)
ggplot(diamonds, aes(y = carat, x = cut)) + geom_lv()
Q5. Create a visualization of diamond prices vs. a categorical variable
from the diamonds dataset using geom_violin(), then a faceted
geom_histogram(), then a colored geom_freqpoly(), and then a colored
geom_density(). Compare and contrast the four plots. What are the pros
and cons of each method of visualizing the distribution of a numerical
variable based on the levels of a categorical variable?
p <- ggplot(diamonds,
aes(y = price)
)
p + geom_violin(aes(x = color))
ggplot(diamonds, aes(x = price, y = after_stat(density)))+
geom_histogram() +
facet_wrap(~color, ncol = 1)
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
ggplot(diamonds, aes(x = price, y = after_stat(density)))+
geom_freqpoly(aes(colour = color), linewidth = 0.75)
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
ggplot(diamonds, aes(x = price))+
geom_density(aes(colour = color), linewidth = 0.75)
Q6. If you have a small dataset, it’s sometimes useful to use
geom_jitter() to avoid overplotting to more easily see the relationship
between a continuous and categorical variable. The ggbeeswarm package
provides a number of methods similar to geom_jitter(). List them and
briefly describe what each one does.
if(!require(ggbeeswarm)){install.packages("ggbeeswarm")}
## Loading required package: ggbeeswarm
## Warning: package 'ggbeeswarm' was built under R version 4.5.2
library(ggbeeswarm)
ggplot(mpg, aes(class, hwy)) + geom_point(alpha = 0.3)
ggplot(mpg, aes(class, hwy)) + geom_jitter(alpha = 0.3)
ggplot(mpg, aes(class, hwy)) + geom_quasirandom(alpha = 0.3)
ggplot(diamonds, aes(x = cut, y = color)) +
geom_count()
diamonds |>
count(cut, color) |>
ggplot(aes(x = color, y = cut)) +
geom_tile(aes(fill = n))
Q1. How could you rescale the count dataset above to more clearly show the distribution of cut within color, or color within cut?
diamonds %>%
count(color, cut) %>%
group_by(color) %>%
mutate(prop = n / sum(n)) %>%
ggplot(aes(x = color, y = cut, fill = prop)) +
geom_tile()
diamonds %>%
count(color, cut) %>%
group_by(cut) %>%
mutate(prop = n / sum(n)) %>%
ggplot(aes(x = cut, y = color, fill = prop)) +
geom_tile(color = "white") + # Adds a white border to tiles
scale_fill_viridis_c(option = "magma") + # Changes the color palette
theme_minimal()
Q2. What different data insights do you get with a segmented bar chart if color is mapped to the x aesthetic and cut is mapped to the fill aesthetic? Calculate the counts that fall into each of the segments.
ggplot(diamonds, aes(x = color, fill = cut)) +
geom_bar()
diamonds |>
count(color, cut) %>%
group_by(color)
## # A tibble: 35 × 3
## # Groups: color [7]
## color cut n
## <ord> <ord> <int>
## 1 D Fair 163
## 2 D Good 662
## 3 D Very Good 1513
## 4 D Premium 1603
## 5 D Ideal 2834
## 6 E Fair 224
## 7 E Good 933
## 8 E Very Good 2400
## 9 E Premium 2337
## 10 E Ideal 3903
## # ℹ 25 more rows
Q3. Use geom_tile() together with dplyr to explore how average flight departure delays vary by destination and month of year. What makes the plot difficult to read? How could you improve it?
glimpse(flights)
## Rows: 336,776
## Columns: 19
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
## $ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
## $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
## $ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
## $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
## $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
## $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
## $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
## $ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
## $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
## $ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
## $ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…
delay <- flights |>
group_by(destination = dest, month = as.factor(month))|>
summarise(mean_delay = mean(dep_delay, na.rm = TRUE),
.groups = "drop") |>
mutate(destination = reorder(destination,
mean_delay,
mean,
decreasing = TRUE))
# Reorder dest based on the overall mean_delay
delay |>
ggplot(aes(y = destination, x = month, fill = mean_delay)) +
geom_tile()+
theme(
axis.text.y = element_text(
size = 5, # Change label size
angle = 0, # Rotate labels ? degrees
vjust = 1, # Adjust vertical positioning
hjust = 1 # Right-align labels for better fit
)
)+
scale_y_discrete(guide = guide_axis(n.dodge = 2))
ggplot(smaller, aes(x = carat, y = price)) +
geom_point()
ggplot(smaller, aes(x = carat, y = price)) +
geom_point(alpha = 1 / 100) # transparency: very challenging for large dataset
geom_bin2d() and geom_hex() divide the coordinate plane into 2d bins and
then use a fill color to display how many points fall into each
bin.
geom_bin2d() creates rectangular bins.
geom_hex() creates hexagonal bins.
ggplot(smaller, aes(x = carat, y = price)) +
geom_bin2d()
## `stat_bin2d()` using `bins = 30`. Pick better value `binwidth`.
#> `stat_bin2d()` using `bins = 30`. Pick better value `binwidth`.
# install.packages("hexbin")
ggplot(smaller, aes(x = carat, y = price)) +
geom_hex()
ggplot(smaller, aes(x = carat, y = price)) +
geom_boxplot(aes(group = cut_width(carat, 0.1)))
#> Warning: Orientation is not uniquely specified when both the x and y aesthetics are
#> continuous. Picking default orientation 'x'.
#>
ggplot(smaller, aes(x = carat, y = price)) +
geom_boxplot(aes(group = cut_width(carat, 0.1)), varwidth = TRUE)
ggplot(smaller, aes(x = cut_width(carat, 0.1), y = price)) +
geom_boxplot(varwidth = TRUE) +
theme(
axis.text.x = element_text(
size = 10, # Change label size
angle = 45, # Rotate labels ? degrees
vjust = 1, # Adjust vertical positioning
hjust = 1 # Right-align labels for better fit
)
)+
labs( x = "Carat")
ggplot(smaller, aes(x = price))+
geom_freqpoly(aes(colour = cut_width(carat, 0.1)), linewidth = 0.75)
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
ggplot(smaller, aes(x = price))+
geom_freqpoly(aes(colour = cut_width(carat, 0.3)), linewidth = 0.75)
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
ggplot(smaller, aes(x = price, y = after_stat(density)))+
geom_freqpoly(aes(colour = cut_width(carat, 0.3)), linewidth = 0.75)
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
ggplot(smaller, aes(x = price))+
geom_freqpoly(aes(colour = cut_number(carat, 9)), linewidth = 0.75)
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
smaller |>
group_by(carat_bin = cut_width(carat, 0.5)) |>
summarise(ave_price = mean(price))|>
ggplot(aes(x = carat_bin, y = ave_price))+
geom_col()
ggplot(smaller, aes(x = carat))+
geom_freqpoly(aes(colour = cut_width(price, 2500)), linewidth = 0.75)
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
ggplot(smaller, aes(x = carat, y = after_stat(density)))+
geom_freqpoly(aes(colour = cut_width(price, 2500)), linewidth = 0.75)
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
ggplot(smaller, aes(x = carat, y = price))+
geom_boxplot(aes(group = cut_width(price, 2500)), varwidth = TRUE)
ggplot(diamonds, aes(x = carat, y = price))+
geom_boxplot(aes(group = cut_width(price, 2500)), varwidth = TRUE)
ggplot(smaller, aes(x = carat, y = price))+
geom_boxplot(aes(group = cut_width(price, 1000)), varwidth = TRUE)
ggplot(diamonds, aes(x = carat, y = price))+
geom_boxplot(aes(group = cut_width(price, 1000)), varwidth = TRUE)
ggplot(diamonds, aes(x = carat, y = price))+
geom_boxplot(aes(group = cut_width(carat, 0.1)), varwidth = TRUE)
ggplot(smaller, aes(x = price, y = carat))+
geom_boxplot(aes(group = cut_width(price, 1000)), varwidth = TRUE)
#it is important to select suitable bin width
less obvious positive correlationship with the size for large diamonds. Factors other than size, such as clarity and cut, may play a role in determining the prize.
ggplot(smaller, aes(x = carat, y = price)) +
geom_hex() +
facet_grid(clarity ~ color)+
theme(
axis.text.x = element_text(
size = 5, # Change label size
angle = 45, # Rotate labels ? degrees
vjust = 1, # Adjust vertical positioning
hjust = 1 # Right-align labels for better fit
)
)
smaller %>%
group_by(clarity, carat_bin = cut_width(carat, 0.5)) %>%
summarize(avg_price = mean(price, na.rm = TRUE)) %>%
ggplot(aes(x = clarity, y = avg_price, fill = carat_bin)) +
geom_col(position = "dodge")
## `summarise()` has grouped output by 'clarity'. You can override using the
## `.groups` argument.
smaller %>%
group_by(clarity, carat_bin = cut_width(carat, 0.5)) %>%
summarize(avg_price = mean(price, na.rm = TRUE)) %>%
#the default stat is "identity", i.e. the sum of the value.
## thus we need to calculate the average before plotting with geom_col
ggplot(aes(x = carat_bin, y = avg_price, fill = clarity)) +
geom_col(position = "dodge")
## `summarise()` has grouped output by 'clarity'. You can override using the
## `.groups` argument.
smaller %>%
group_by(cut, carat_bin = cut_width(carat, 0.5)) %>%
summarize(avg_price = mean(price, na.rm = TRUE)) %>%
#the default stat is "identity", i.e. the sum of the value.
## thus we need to calculate the average before plotting with geom_col
ggplot(aes(x = carat_bin, y = avg_price, fill = cut)) +
geom_col(position = "dodge")
## `summarise()` has grouped output by 'cut'. You can override using the `.groups`
## argument.
4. Combine two of the techniques you’ve learned to visualize the
combined distribution of cut, carat, and price.
smaller %>%
group_by(cut, carat_bin = cut_width(carat, 0.5)) %>%
summarize(avg_price = mean(price, na.rm = TRUE)) %>%
#the default stat is "identity", i.e. the sum of the value.
## thus we need to calculate the average before plotting with geom_col
ggplot(aes(x = carat_bin, y = avg_price, fill = cut)) +
geom_col(position = "dodge")
## `summarise()` has grouped output by 'cut'. You can override using the `.groups`
## argument.
smaller %>%
group_by(cut, carat_bin = cut_width(carat, 0.5)) %>%
summarize(avg_price = mean(price, na.rm = TRUE)) %>%
#the default stat is "identity", i.e. the sum of the value.
## thus we need to calculate the average before plotting with geom_col
ggplot(aes(x = cut, y = avg_price, fill = carat_bin)) +
geom_col(position = "dodge")
## `summarise()` has grouped output by 'cut'. You can override using the `.groups`
## argument.
ggplot(smaller, aes(x = carat, y = price))+
geom_boxplot(aes(group = cut_width(carat, 0.1)), varwidth = TRUE)+
facet_wrap(~cut, ncol = 2)
ggplot(smaller, aes(x = cut, y = price))+
geom_boxplot(varwidth = TRUE)+
facet_wrap(~cut_interval(carat, 9))+
theme(
axis.text.x = element_text(
size = 10, # Change label size
angle = 45, # Rotate labels ? degrees
vjust = 1, # Adjust vertical positioning
hjust = 1 # Right-align labels for better fit
)
)
diamonds |> filter(x >= 4) |> ggplot(aes(x = x, y = y)) + geom_point() + coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))
diamonds |>
filter(x >= 4) |>
ggplot(aes(x = x, y = y)) +
geom_point(alpha = 0.1) +
coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))
diamonds |>
filter(x >= 4) |>
ggplot(aes(x = x, y = y)) +
geom_boxplot(aes(group = cut_width(x, 0.5)), varwidth = TRUE) +
coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))
diamonds |>
filter(y >= 4) |>
ggplot(aes(x = y, y = x)) +
geom_boxplot(aes(group = cut_width(y, 0.5)), varwidth = TRUE) +
coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))
From AI
Why the Scatterplot is Better
- Multivariate Outlier Detection: Scatterplots reveal “unusual
combinations” of variables. A point might have a perfectly normal x
value and a normal y value when viewed individually (univariate), but
the relationship between them is what makes it an outlier. Binned plots
often hide these specific points by aggregating them into a single bin
or summary statistic.
- Precise Data Points: Scatterplots show every individual observation,
allowing you to see exactly where “off-diagonal” points fall. In the
diamonds dataset, x and y represent dimensions; they are highly
correlated, so any point far from the main diagonal is immediately
suspicious.
- Pattern Recognition: Human eyes are naturally skilled at spotting gaps
and clusters that binned plots might smooth over.
Limitations of the Binned Plot (Boxplot) in this Case
- Loss of Granularity: By grouping x into bins (like cut_width(x, 0.5)),
you lose the exact x coordinate of outliers.
- Over-Aggregation: A boxplot summarizes the distribution of y for a
range of x. While it can show outliers in y for that specific bin, it
doesn’t show how they relate to the exact x value, making it harder to
see if they follow a specific non-linear trend or are truly
isolated.
- Clarity: In high-density areas, boxplots can become cluttered or
misleading if the varwidth doesn’t clearly communicate the underlying
sample size, whereas a scatterplot with alpha = 0.1 uses transparency to
show density naturally.
ungroup(smaller)
## # A tibble: 53,900 × 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
## 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
## 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
## 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
## # ℹ 53,890 more rows
ggplot(smaller, aes(x = carat, y = price)) +
geom_boxplot(aes(group = cut_number(carat, 20)), varwidth = FALSE)
ggplot(smaller, aes(x = carat, y = price)) +
geom_boxplot(aes(group = cut_number(carat, 20)), varwidth = TRUE)
ggplot(smaller, aes(x = cut_number(carat, 20), y = price))+
geom_boxplot(varwidth = FALSE)+
theme(
axis.text.x = element_text(
size = 10, # Change label size
angle = 45, # Rotate labels ? degrees
vjust = 1, # Adjust vertical positioning
hjust = 1 # Right-align labels for better fit
)
)
ggplot(smaller, aes(x = cut_number(carat, 20), y = price))+
geom_boxplot(varwidth = TRUE)+
theme(
axis.text.x = element_text(
size = 10, # Change label size
angle = 45, # Rotate labels ? degrees
vjust = 1, # Adjust vertical positioning
hjust = 1 # Right-align labels for better fit
)
)
ggplot(smaller, aes(x = carat, y = price)) +
geom_boxplot(aes(group = cut_width(carat, 0.1)), varwidth = TRUE)
ggplot(smaller, aes(x = carat, y = price)) +
geom_boxplot(aes(group = cut_interval(carat, 20)), varwidth = TRUE)
From GEMINI AI
The reason you see varying widths even with cut_number(carat, 20) is due to tied values in your data.
How cut_number() works: It attempts to put an exactly equal number of observations into each of the 20 bins. However, the diamonds dataset has many identical carat values (e.g., many diamonds are exactly 1.00 carat).
The Conflict: If there are more diamonds with a single carat weight than can fit in one bin, cut_number() cannot split those identical values into different bins. It must keep them together, resulting in some bins having slightly more or fewer observations than others to maintain valid “breaks” in the data.
Effect of varwidth = TRUE: When this is active, geom_boxplot calculates the horizontal width of each box proportional to the square root of the number of observations in that bin. Because cut_number() produced bins with slightly different counts (due to those tied carat values), the box widths reflect those differences.
FROM COPILOT AI
Since each cut_number() group has the same number of observations, you
would expect equal widths… …but ggplot2 does not use the number of
observations in the group you created. Instead, it uses the number of
observations in the x-position category.
So ggplot2 sees:
Convert the groups into a factor and map that to x:
r ggplot(smaller, aes(x = cut_number(carat, 20), y = price)) + geom_boxplot(varwidth = TRUE) Now: