In this assignment, I’ll look at the diamonds data set to determine their worth and figure their worth when reselling. This will be accomplished using some visuals, and some quantitative tests. (This is not supposed to be a complicated set)
Firstly, I need to set up the data that I’m working with, and get familiar with it.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
diamonds
## # A tibble: 53,940 × 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
## 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
## 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
## 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
## # ℹ 53,930 more rows
I have decided to look into a scatter plot matrix to get an idea on the numbers that will generally be related high at the same time, or low together. However this does not include categorical data, so this aspect will not include the cut, color, and clarity columns.
library(corrplot)
## corrplot 0.92 loaded
quant_cols = as.matrix(cbind(diamonds[,1], diamonds[, 5:10]))
pairs(quant_cols)
Based on the scatter plot matrix the most closely related was carats with x, y, and z, price with y and z, and x with y and z. However, though y and z pair with many of the same variables, they do not appear to work together well.It is important to note that the variable we want to maximize is price.
To look closer at this we will look at correlation coefficient to judge how well you can judge the price based on the y and z variable.
cor.test(diamonds$price, diamonds$y)
##
## Pearson's product-moment correlation
##
## data: diamonds$price and diamonds$y
## t = 401.14, df = 53938, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8632867 0.8675241
## sample estimates:
## cor
## 0.8654209
cor.test(diamonds$price, diamonds$z)
##
## Pearson's product-moment correlation
##
## data: diamonds$price and diamonds$z
## t = 393.6, df = 53938, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8590541 0.8634131
## sample estimates:
## cor
## 0.8612494
These both seem to have higher positive correlations, which supports what we saw on the scatterplot matrix.
Next, we are going to look at how the cut, color, and clarity affects the price by looking at a box and whisker plot based on the cut, color, and clarity.
library(ggplot2)
ggplot(data = diamonds) +
aes(x = cut, y = price)+
geom_boxplot()
ggplot(data = diamonds) +
aes(x = color, y = price)+
geom_boxplot()
ggplot(data = diamonds) +
aes(x = clarity, y = price)+
geom_boxplot()
This tells us that the cuts generally each cut can contain the higher price, and the average price for each cut is generally the same. I would take this to mean that the cut doesn’t change the price much. Suprisingly, the color has a greater affect on the price, in which the average prices of colors ‘H’, ‘I’, and ‘J’ were slightly higher. Lastly we can see that the clarity with the average highest price is SI2.
Another feature to look at is the amount of times these features occur in comparison to the total, to look at this I will use a pie chart.
cut_count = table(diamonds$cut)
pie(cut_count)
col_count = table(diamonds$color)
pie(col_count)
clar_count = table(diamonds$clarity)
pie(clar_count)
In this we see that the most commonly occuring cut is ideal. This continues to show that the cut will have little affect on the price as the “best” cut occurs the most and will not be ineffective to get. Whereas the pie chart showing colors is more evenly occuring, but the lowest occuring colors are clearly ‘H’, ‘I’, and ‘J’. This is slightly concerning because it could indicate that it is possibly inefficient to work towards having a majority ‘H’, ‘I’, ‘J’.The final factor being that in the clarity the most commonly occuring was SI1, which did not produce the highest average price, that being the SI2. Unlike in color, the SI2 doesn’t seem to be in the lowest amounts, which is more positive and could affect the results of the box whisker. It is important to note that these charts only show the proportion that occur in the set and not the total occurances.
Overall, the goal type of diamond seems to be ones with high ‘y’ and ‘z’ values, meaning they are wide and tall. As well it is more beneficial to have the ‘H’, ‘I’, ‘J’ ratings but with a majority being ‘H’ as to not overdo the cost to obtain the diamonds. Lastly the clarity of the diamond being SI2 show’s the most payoff to have.