#Libraries
library(ggplot2)
library(trelliscopejs)
library(tidyverse)
#Data
diamonds <- read.csv("diamonds.csv")
diamonds <- as_tibble(diamonds)
#GGPlot scatterplot
ggplot(diamonds, aes(x=carat, y = price)) +
geom_point()
diamonds2 <- diamonds %>%
select(c(,3:12)) %>%
group_by(cut) %>%
mutate(Avg_PricePerCarat = cog(round(mean(price/carat),2),
desc = "Average price per carat for each cut",
default_label = TRUE ))
ggplot(diamonds2, aes(x=carat, y = price, col = color)) +
geom_point() +
facet_trelliscope(~cut,
name = "Diamond Price vs Carat by Cut",
desc = "Colored by Diamond Color",
nrow = 2,
ncol = 3,
path = ".",
self_contained = TRUE
)
The original data set was called ‘diamonds’ and was downloaded from Data Camp. It initially had 12 columns and 1000 observations. It gives information on the carat, cut, color ,clarity, depth , table, price, x ,y, and z of diamonds. Carat is the numerical diamond weight. Cut is the categorical cut quality, in order from worst to best by: Fair, Good, Very Good, Premium, Ideal. Color is the categorical color of the diamond from worst to best from J to D. Clarity is the categorical variable of how clear the diamond is from worst to best by: I1, SI2, SI1, VS2, VS1, VVS2, VVS1, IF. Depth is a numerical variable of the total depth percentage of the diamond. Table is a numerical variable of the width of the widest point. Price is an integer variable of the price of the diamond. Finally, x,y, and z are the numerical variables of length, width, and depth.
For this data set, I decided to create a scatter plot of price vs carat because this type of graph makes it easy to see the relationships between two continuous variables. I am investigating this relationship because it can determine if the carat of a diamond has a relationship with the price. This is information that buyers of diamonds/jewelry need to know before purchasing to make an informed choice. Also, I decided to use the diamond’s color to color the points to see if there is any clustering at certain carat or price ranges. The plot was also faceted by cut. This can show a reader if there are any differences in the relationship between price and carat when separated by the cut quality of the diamond. Finally, a cognostic measure was created to see the average price per carat for each cut scatter plot. This can bring insight into which groups are the most expensive and which can give the most value for the money being spent.
Overall, it seems like price and carat have a positive and mostly exponential relationship for every cut. Color does not seem to have much of a consistent pattern or clustering. Fair cuts have the highest average price per carat even though they are the lowest quality, which shows that cut does not solely contribute to pricing of diamonds. While making the graphs, I did not encounter many challenges. I initially plotted the data without trelliscope to see if there was any major skewing that would make the graph hard to read, and there was not. I did not filter any values because of this. I did have to add ‘message=FALSE, warning=FALSE’ to prevent warnings messages from printing. Also, after creating the cognostic the decimal numbers did not make sense for an average price/carat, so I used round() to fix this and round to 2 decimals.