DV Lab 5

Diamonds Dataset

Author

Aaron Devine

My chosen categorical variables for this lab are Cut and color.

Code

library(ggplot2)
library(tidyverse)
data(diamonds)
map(diamonds, ~sum(is.na(.)))

$carat
[1] 0

$cut
[1] 0

$color
[1] 0

$clarity
[1] 0

$depth
[1] 0

$table
[1] 0

$price
[1] 0

$x
[1] 0

$y
[1] 0

$z
[1] 0

Part 1

Code

pip1 <- diamonds %>%
  group_by(cut, color) %>%
  summarize(N = n()) %>%
  mutate(freq = N/sum(N),
         pct = round((freq*100),0))
pip1

# A tibble: 35 × 5
# Groups:   cut [5]
   cut   color     N   freq   pct
   <ord> <ord> <int>  <dbl> <dbl>
 1 Fair  D       163 0.101     10
 2 Fair  E       224 0.139     14
 3 Fair  F       312 0.194     19
 4 Fair  G       314 0.195     20
 5 Fair  H       303 0.188     19
 6 Fair  I       175 0.109     11
 7 Fair  J       119 0.0739     7
 8 Good  D       662 0.135     13
 9 Good  E       933 0.190     19
10 Good  F       909 0.185     19
# ℹ 25 more rows

Code

p <- ggplot(data = subset(pip1, !is.na(cut) & !is.na(color)),
            aes(x = cut, y = pct, fill = color))

Yes, the data seems to make sense. J colored diamonds appear less frequent than all others across all cut qualities while the majority of diamonds are in the middle color categories. Our data also has zero NAs which is good.

Part 2

Code

p + geom_col(position = "stack") +
  labs(x = "Cut", y = "pct", fill = "Color",
       title = "Cut Vs. Color", caption = "Data: diamonds{ggplot2}",
       subtitle = "Stacked Bar Chart for Cut and Color")

Code

p + geom_col(position = "dodge") +
  labs(x = "Cut", y = "pct", fill = "Color",
       title = "Cut Vs. Color", caption = "Data: diamonds{ggplot2}",
       subtitle = "As expected, diamonds are clumped around the middle cut categories")

Code

p <- ggplot(pip1, aes(x = color, y = pct, fill = color))
p + geom_col(position = "dodge2") +
    labs(x = "Color", y = "Percent", fill = "color") +
    guides(fill = FALSE) + 
    coord_flip() + 
    facet_grid(~ cut) +
  labs(title = "Color Percent by Cut", caption = "Data: diamonds{ggplot2}",
       subtitle = "Distribution of Color remains very similar across cut types")

The first Graph looks vary consistent across all cuts and colors. Graph 2 shows that pct rises with the middle color qualities and is much lower at the extremes of high and low quality.The final graph also shows that different cut types are shockingly similar. Other variables will need to be explored.

Part 3

Price and Table were my selected numerical variables for this lab.

Code

pip2 <- diamonds %>%
  group_by(cut) %>%
  summarize(mean_price = mean(price),
            mean_table = mean(table))
pip2

# A tibble: 5 × 3
  cut       mean_price mean_table
  <ord>          <dbl>      <dbl>
1 Fair           4359.       59.1
2 Good           3929.       58.7
3 Very Good      3982.       58.0
4 Premium        4584.       58.7
5 Ideal          3458.       56.0

This result is different than I expected because the lowest quality cut has the second highest mean price and the highest quality cut has the lowest mean price. However, this can probably be explained by the fact that the lower quality cut diamonds tend to be larger as indicated by the large table value.

Part 4 and 5

Code

p <- ggplot(data = pip2,
            aes(x = mean_table, y = mean_price, color = cut))
p + geom_point() +
    annotate(geom = "text", x = 56, y = 3500,
                            label = "A surprisingly low mean \n price for Ideal.",
                            hjust = 0) + 
  labs(x = "Mean Table",
         y = "Mean Price",
        color = "Cut Quality",
        title = "Mean Price and mean Table value for each cut", 
        caption = "Data: diamonds{ggplot2}",
        subtitle = "High quality cuts tend to have small tables while low quality cuts have high price and large tables") +
  theme(legend.position = "bottom")

Part 6

Code

p <- ggplot(data = pip2,
            aes(x = mean_table, y = mean_price, color = cut))
p + geom_point() +
    annotate(geom = "text", x = 56, y = 3500,
                            label = "A surprisingly low mean \n price for Ideal.",
                            hjust = 0) + 
  labs(x = "Mean Table",
         y = "Mean Price",
        color = "Cut Quality",
        title = "Mean Price and mean Table value for each cut", 
        caption = "Data: diamonds{ggplot2}",
        subtitle = "High quality cuts tend to have small tables while low quality cuts have high price and large tables") +
  geom_text(mapping = aes(label = cut), hjust = 1) +
    guides(color = FALSE)

Part 7

Here we see that the Ideal cut has both the lowest table width and lowest mean price.This surprised me given that it is the highest quality cut. Good and Very Good are in the middle while Premium and Fair are at the top end. Large Premium diamonds have the highest mean price while large fair diamonds are the second most valuable combination.