Code
library(ggplot2)
library(tidyverse)
data(diamonds)
map(diamonds, ~sum(is.na(.)))$carat
[1] 0
$cut
[1] 0
$color
[1] 0
$clarity
[1] 0
$depth
[1] 0
$table
[1] 0
$price
[1] 0
$x
[1] 0
$y
[1] 0
$z
[1] 0
Diamonds Dataset
My chosen categorical variables for this lab are Cut and color.
library(ggplot2)
library(tidyverse)
data(diamonds)
map(diamonds, ~sum(is.na(.)))$carat
[1] 0
$cut
[1] 0
$color
[1] 0
$clarity
[1] 0
$depth
[1] 0
$table
[1] 0
$price
[1] 0
$x
[1] 0
$y
[1] 0
$z
[1] 0
pip1 <- diamonds %>%
group_by(cut, color) %>%
summarize(N = n()) %>%
mutate(freq = N/sum(N),
pct = round((freq*100),0))
pip1# A tibble: 35 × 5
# Groups: cut [5]
cut color N freq pct
<ord> <ord> <int> <dbl> <dbl>
1 Fair D 163 0.101 10
2 Fair E 224 0.139 14
3 Fair F 312 0.194 19
4 Fair G 314 0.195 20
5 Fair H 303 0.188 19
6 Fair I 175 0.109 11
7 Fair J 119 0.0739 7
8 Good D 662 0.135 13
9 Good E 933 0.190 19
10 Good F 909 0.185 19
# ℹ 25 more rows
p <- ggplot(data = subset(pip1, !is.na(cut) & !is.na(color)),
aes(x = cut, y = pct, fill = color))Yes, the data seems to make sense. J colored diamonds appear less frequent than all others across all cut qualities while the majority of diamonds are in the middle color categories. Our data also has zero NAs which is good.
p + geom_col(position = "stack") +
labs(x = "Cut", y = "pct", fill = "Color",
title = "Cut Vs. Color", caption = "Data: diamonds{ggplot2}",
subtitle = "Stacked Bar Chart for Cut and Color") p + geom_col(position = "dodge") +
labs(x = "Cut", y = "pct", fill = "Color",
title = "Cut Vs. Color", caption = "Data: diamonds{ggplot2}",
subtitle = "As expected, diamonds are clumped around the middle cut categories") p <- ggplot(pip1, aes(x = color, y = pct, fill = color))
p + geom_col(position = "dodge2") +
labs(x = "Color", y = "Percent", fill = "color") +
guides(fill = FALSE) +
coord_flip() +
facet_grid(~ cut) +
labs(title = "Color Percent by Cut", caption = "Data: diamonds{ggplot2}",
subtitle = "Distribution of Color remains very similar across cut types")The first Graph looks vary consistent across all cuts and colors. Graph 2 shows that pct rises with the middle color qualities and is much lower at the extremes of high and low quality.The final graph also shows that different cut types are shockingly similar. Other variables will need to be explored.
Price and Table were my selected numerical variables for this lab.
pip2 <- diamonds %>%
group_by(cut) %>%
summarize(mean_price = mean(price),
mean_table = mean(table))
pip2# A tibble: 5 × 3
cut mean_price mean_table
<ord> <dbl> <dbl>
1 Fair 4359. 59.1
2 Good 3929. 58.7
3 Very Good 3982. 58.0
4 Premium 4584. 58.7
5 Ideal 3458. 56.0
This result is different than I expected because the lowest quality cut has the second highest mean price and the highest quality cut has the lowest mean price. However, this can probably be explained by the fact that the lower quality cut diamonds tend to be larger as indicated by the large table value.
p <- ggplot(data = pip2,
aes(x = mean_table, y = mean_price, color = cut))
p + geom_point() +
annotate(geom = "text", x = 56, y = 3500,
label = "A surprisingly low mean \n price for Ideal.",
hjust = 0) +
labs(x = "Mean Table",
y = "Mean Price",
color = "Cut Quality",
title = "Mean Price and mean Table value for each cut",
caption = "Data: diamonds{ggplot2}",
subtitle = "High quality cuts tend to have small tables while low quality cuts have high price and large tables") +
theme(legend.position = "bottom")p <- ggplot(data = pip2,
aes(x = mean_table, y = mean_price, color = cut))
p + geom_point() +
annotate(geom = "text", x = 56, y = 3500,
label = "A surprisingly low mean \n price for Ideal.",
hjust = 0) +
labs(x = "Mean Table",
y = "Mean Price",
color = "Cut Quality",
title = "Mean Price and mean Table value for each cut",
caption = "Data: diamonds{ggplot2}",
subtitle = "High quality cuts tend to have small tables while low quality cuts have high price and large tables") +
geom_text(mapping = aes(label = cut), hjust = 1) +
guides(color = FALSE)Here we see that the Ideal cut has both the lowest table width and lowest mean price.This surprised me given that it is the highest quality cut. Good and Very Good are in the middle while Premium and Fair are at the top end. Large Premium diamonds have the highest mean price while large fair diamonds are the second most valuable combination.