library(ggplot2)
library(trelliscopejs)
library(tidyverse)
library(purrr)
# look at carat variable
ggplot(diamonds, aes(carat)) +
geom_histogram()

summary(diamonds$carat) # most carats are less than 2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2000 0.4000 0.7000 0.7979 1.0400 5.0100
sum(diamonds$carat > 2) # not leaving out many observations
## [1] 1889
# functions for cognostic measures
price_range_fn <- function(x){
max(x$price) - min(x$price)
}
price_per_carat_fn <- function(x){
mean(x$price / x$carat)
}
count_fn <- function(x){
nrow(x)
}
# create new subset
diamonds_sub <- diamonds %>%
# filter out carat outliers
filter(carat <= 2) %>%
# group by cut
group_by(cut) %>%
nest() %>%
# find price range, number of diamonds, and avg price per carat for each cut
mutate(price_range = map_dbl(data, price_range_fn),
price_range = cog(price_range, desc = "Price Range", default_label = TRUE),
diamond_count = map_dbl(data, count_fn),
diamond_count = cog(diamond_count, desc = "Diamond Count", default_label = TRUE),
price_per_carat = map_dbl(data, price_per_carat_fn),
price_per_carat = cog(price_per_carat, desc = "Avg price per carat", default_label = TRUE),
# make plots
plots = map_plot(data, function(d){
ggplot(d, aes(x = carat, y = price, color = color)) +
geom_point(alpha = 0.5)
})) %>%
ungroup()
# trelliscope plots faceted on diamond cut
trelliscope(diamonds_sub,
name = "Price vs. Carat of Diamonds by Color",
desc = "How does price of diamond vary based on carat?",
nrow = 1, ncol = 2,
path = "output/assignment4")
The dataset that I chose to investigate with this graph is the
diamonds dataset from DataCamp. The variables cut, color, carat, and
clarity are descriptors of the diamond that explain the size/weight and
look of it. The variables x, y, z, depth, and table are more in detail
measurements of the diamond’s size and dimensions. I removed some
observations of diamonds with abnormally high carat values to make the
results more representative. For this plot I graphed the price of
diamond vs the carat measurement to look closer at the relationship
between these two continuous variables. I faceted on the cut variable
which places each diamond into a category for its quality. The levels
include “Ideal”, “Premium”, “Very Good”, “Good”, and “Fair”. To be more
specific, these levels are related to how reflective of light the
diamond is. “Ideal” cuts are highly reflective and “Fair” cuts are less
reflective making them less bright or shiny to the eye based on the
shape of the diamond. I chose to facet on this variable to see how
different cuts are priced as their carat measurement changes. I assume
that there will still be an increasing trend for all cuts, but the rate
of increase or overall prices may differ depending on cut. I also
colored the points based on the color of the diamond to see how this
impacted price after considering the carat’s influence.
Based on the results of the trelliscope plots, it is clear that as a
diamond’s carat measurement increases, its price tends to increase.
There is an overall positive relationship between these values based on
this data. When faceting by cut, it is evident that once accounting for
carat, more ideal cut diamonds tend to be priced higher. The plots are
all similar in shape with slight differences in steepness. The diamonds
appear to be less consistent amongst prices as their cut worsens. This
means that even as carat increases, a worse cut can be what lowers the
price of the diamond. However, there are still diamonds with less ideal
cuts that are priced high due to other factors such as carat and
possibly other variables influencing this. Color tends to be better for
smaller carat diamonds based on these plots, and prices tend to increase
as color rating gets improves.
I created a few cognostic measures that the reader might want to
investigate. The first measure is the price range which provides the
difference between the maximum and minimum price in each cut category.
The price ranges are similar across the different levels, so it seems
that the spread of prices are comparable regardless of cut. I also
thought it may be important to consider the number of diamonds in the
dataset at each cut level in order to see how unbalanced the categories
are. The majority of diamonds in this dataset fall into the “Ideal” cut
and very few of them are in the “Fair” and “Good” categories which may
make them difficult to compare. This is important to consider during
analysis. The average price per carat for each cut is also an
interesting measure to consider. This better represents the price
difference of diamonds based on cut. It provides a ratio of each
diamond’s price to its carat and then finds the average within each cut
level. The diamonds with “Very Good” and “Premium” cuts have higher
average price per carat values compared to “Fair” and “Good” cut
diamonds, with the “Ideal” level falling directly in the middle. Since
there are so many “Ideal” cut diamonds, there may be some outliers that
are skewing this measurement. Also, there are other aspects of the
diamonds that could be effecting price more than the cut does.