diamonds4 <- read.csv("/Users/kylerhalat-shafer/Desktop/UVA/MSDS/STAT 6021/Project 1/diamonds4.csv")
library('tidyverse')
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Assumptions: 1.) That Cut has the most important when it comes to the 4 C’s because it has the greatest influence on the Diamond’s sparkle. Does this influence the price the most? Ideal = top 3%, Very Good = top 15%, Good = top 25%, Astor is the highest quality. The Cut should have the biggest influence on price.

2.) Color is the second most important of the 4C’s – this is how peerless a diamond is. D>E>F>G>H>I>J>K>L-Z, resulting in D being the best and having the strongest impact on price in terms of color.

3.) Clarity: FL > IF > VVS > VS> SI > I, with flawless being less than 1% of diamonds

4.) Carat: the relationship between carat weight and price depends on the rarity or availability of a rough crystal. The carat to price is more a by product of society than it is actual quality of the diamond.

carat_price <- lm(carat~price, data=diamonds4)
summary(carat_price)
## 
## Call:
## lm(formula = carat ~ price, data = diamonds4)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0007 -0.2589 -0.1444  0.2000  2.5999 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.229e-01  1.324e-02   47.04   <2e-16 ***
## price       2.701e-05  5.271e-07   51.24   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4428 on 1212 degrees of freedom
## Multiple R-squared:  0.6842, Adjusted R-squared:  0.6839 
## F-statistic:  2625 on 1 and 1212 DF,  p-value: < 2.2e-16

We see a slightly non-linear relationship with a strict Carat v. Price, where the lower carat weights are over-predicted whereas the larger carat weights are under-predicted.

diamonds4%>%
ggplot(aes(x=carat, y=price))+ 
  geom_point(alpha  = 0.5)+
  geom_smooth(method = 'lm')+
  theme(plot.title = element_text(hjust = 0.5))+
  labs(x="Carat", y="Price", title="Carat v. Price")
## `geom_smooth()` using formula 'y ~ x'

When adding in some more detail, we can see visually that ideal and very good have higher carat weights, which also are producing higher prices. An area to focus in on is by loooking at 0 to 2 carats to see if there is a substantial difference in price by cut.

diamonds4%>%
ggplot(aes(x=carat, y=price, color=cut))+ 
  geom_point(alpha  = 0.5)+
  geom_smooth(method = 'lm')+
  theme(plot.title = element_text(hjust = 0.5))+
  labs(x="Carat", y="Price", title="Carat v. Price")
## `geom_smooth()` using formula 'y ~ x'

Note: Re-order the variables in cut to match the above for easier analysis. Astor is the highest quality, Ideal = top 3%, Very Good = top 15%, Good = top 25%

By looking at only cut as a categorical variable against price, Ideal and Very Good again show up as those commanding higher prices with carat being in th 4 - 5 weight range, whereas a similarly size diamond in Good cost ~$1000 less.

diamonds4%>%
ggplot(aes(x=cut, y=price, size=carat))+ 
  geom_point(alpha  = 0.5)+
  theme(plot.title = element_text(hjust = 0.5))+
  labs(x="Cut", y="Price", title="Cut v. Price")

Note: Re-order the variables in clarity to match the above for easier analysis. Clarity: FL > IF > VVS > VS> SI > I, with flawless being less than 1% of diamonds

Here we are looking at the clarity, with flawless as the most rare, then IF, and after that VVS and VS. The most expensive diamond is a flawless diamond and seocnd being a larger size carat in the VS2 category. It appears that with clarity there is soome overlap in the VS1/2 and VVS1/2.

diamonds4%>%
ggplot(aes(x=clarity, y=price, size = carat))+ 
  geom_point(alpha  = 0.5)+
  theme(plot.title = element_text(hjust = 0.5))+
    labs(x="Clarity", y="Price", title="Clarity v. Price")

Note: Re-order the variables in cut to match the above for easier analysis. Color: D>E>F>G>H>I>J>K>L-Z

This follows the pattern online with the color commanding the higher price for similar size diamonds, with the smaller diamonds being intelligeble.

diamonds4%>%
ggplot(aes(x=color, y=price, size = carat))+ 
  geom_point(alpha  = 0.5)+
  theme(plot.title = element_text(hjust = 0.5))+
  labs(x="Color", y="Price", title="Color v. Price")

Again we are seeing as the carat size gets larger the price is increasing, this also comes with breaking it out by cut and overlaying the clarity. Flawless being the most expensive even with a very good cut opposed to ideal, which is a higher cut and it being a smaller weight in terms of carat. Ideal does produce the most diamonds with the highest price.

diamonds4%>%
ggplot(aes(x=carat, y=price, color=clarity, size = carat))+ 
  geom_point(alpha  = 0.5)+
  facet_wrap(~cut)+
  theme(plot.title = element_text(hjust = 0.5))

diamonds4%>%
ggplot(aes(x=clarity, y=price, color=carat, size=carat))+ 
  geom_point(alpha  = 0.5)+
  facet_wrap(~cut)+
  theme(plot.title = element_text(hjust = 0.5))

diamonds4%>%
ggplot(aes(x=carat, y=price, color=cut, size = carat))+ 
  geom_point(alpha  = 0.5)+
  facet_wrap(~color)+
  theme(plot.title = element_text(hjust = 0.5))