# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
# ℹ 53,930 more rows
Diamonds_Analysis_Revised_AndrewMarchant
Exploratory Data Analysis
Clean Data
Read In Data
Remove unusual values
diamonds2 <- diamonds %>%
mutate(x = ifelse(x == 0, NA, x))
diamonds2 <- diamonds2 %>%
mutate(y = ifelse(y < 3 | y > 20, NA, y))
diamonds2 <- diamonds2 %>%
mutate(z = ifelse(z == 0 | z > 30, NA, z))Question to consider?
- Given a low carat and high carat diamond, what is the most important characteristic in determining price?
Show Carat has the greatest impact on price of the diamond
Carat Vs. Price
[1] 0.9215913
What characteristic of the diamond has the most impact on it’s price?
lm <- lm(price ~ carat + cut + color + clarity, diamonds2)
summary(lm)
Call:
lm(formula = price ~ carat + cut + color + clarity, data = diamonds2)
Residuals:
Min 1Q Median 3Q Max
-16813.5 -680.4 -197.6 466.4 10394.9
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3710.603 13.980 -265.414 < 2e-16 ***
carat 8886.129 12.034 738.437 < 2e-16 ***
cut.L 698.907 20.335 34.369 < 2e-16 ***
cut.Q -327.686 17.911 -18.295 < 2e-16 ***
cut.C 180.565 15.557 11.607 < 2e-16 ***
cut^4 -1.207 12.458 -0.097 0.923
color.L -1910.288 17.712 -107.853 < 2e-16 ***
color.Q -627.954 16.121 -38.952 < 2e-16 ***
color.C -171.960 15.070 -11.410 < 2e-16 ***
color^4 21.678 13.840 1.566 0.117
color^5 -85.943 13.076 -6.572 5.00e-11 ***
color^6 -49.986 11.889 -4.205 2.62e-05 ***
clarity.L 4217.535 30.831 136.794 < 2e-16 ***
clarity.Q -1832.406 28.827 -63.565 < 2e-16 ***
clarity.C 923.273 24.679 37.411 < 2e-16 ***
clarity^4 -361.995 19.739 -18.339 < 2e-16 ***
clarity^5 216.616 16.109 13.447 < 2e-16 ***
clarity^6 2.105 14.037 0.150 0.881
clarity^7 110.340 12.383 8.910 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1157 on 53921 degrees of freedom
Multiple R-squared: 0.9159, Adjusted R-squared: 0.9159
F-statistic: 3.264e+04 on 18 and 53921 DF, p-value: < 2.2e-16
lm2 <- lm(price ~ cut + color + clarity, diamonds2)
summary(lm2)
Call:
lm(formula = price ~ cut + color + clarity, data = diamonds2)
Residuals:
Min 1Q Median 3Q Max
-6373 -2517 -1254 1159 17013
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3864.910 31.661 122.070 < 2e-16 ***
cut.L -56.512 67.702 -0.835 0.403883
cut.Q -249.924 59.706 -4.186 2.84e-05 ***
cut.C -525.553 51.760 -10.154 < 2e-16 ***
cut^4 -169.343 41.523 -4.078 4.54e-05 ***
color.L 2093.398 56.209 37.243 < 2e-16 ***
color.Q 22.783 53.660 0.425 0.671144
color.C -230.440 50.237 -4.587 4.51e-06 ***
color^4 68.794 46.136 1.491 0.135940
color^5 -272.289 43.582 -6.248 4.20e-10 ***
color^6 -3.699 39.631 -0.093 0.925641
clarity.L -1502.243 99.481 -15.101 < 2e-16 ***
clarity.Q -568.685 95.928 -5.928 3.08e-09 ***
clarity.C 684.040 82.263 8.315 < 2e-16 ***
clarity^4 -240.521 65.798 -3.655 0.000257 ***
clarity^5 808.666 53.635 15.077 < 2e-16 ***
clarity^6 -212.184 46.783 -4.536 5.76e-06 ***
clarity^7 197.454 41.279 4.783 1.73e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3856 on 53922 degrees of freedom
Multiple R-squared: 0.06587, Adjusted R-squared: 0.06557
F-statistic: 223.7 on 17 and 53922 DF, p-value: < 2.2e-16
Remove strong correlation of carat size vs. price
mod <- lm(log(price) ~ log(carat), data = diamonds2)
diamonds3 <- diamonds2 %>%
add_residuals(mod) %>%
mutate(resid = exp(resid))
ggplot(data = diamonds3) +
geom_point(mapping = aes(x = carat, y = resid))ggplot(diamonds3, aes(x = resid)) +
geom_histogram(binwidth = .05) +
labs(title = "Histogram of resididuals", x = "Residual", y = "Count") +
theme_minimal()Residuals are normally distributed which will allow us to run an anova test
After removing strong relationship between carat and price, view characteristics of diamonds relation to price: relative to their size
Cut vs. Price
Color vs. Price
Clarity vs. Price
Interpretation: The lower end of each diamond characteristic has more clustered density, which means that member has a higher probability of taking on a lower price. The higher end characteristics have a more scattered density indicating more variability in price.
What is the most important characteristic of determining high / low price diamonds relative to size?
Subset high and low weight diamonds
hc_diamonds <- subset(diamonds3, carat >= 1.5)
lc_diamonds <- subset(diamonds3, carat <= 1)Switched from price to weight
Characteristics of high carat diamonds
Anova Test
model_hc <- aov(resid ~ cut + color + clarity, data = hc_diamonds)
summary(model_hc) Df Sum Sq Mean Sq F value Pr(>F)
cut 4 34.68 8.671 474.1 <2e-16 ***
color 6 126.11 21.018 1149.1 <2e-16 ***
clarity 7 208.89 29.842 1631.5 <2e-16 ***
Residuals 6217 113.71 0.018
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Characteristics of low carat diamonds
Anova Test
model_lc <- aov(resid ~ cut + color + clarity, data = lc_diamonds)
summary(model_lc) Df Sum Sq Mean Sq F value Pr(>F)
cut 4 114.9 28.73 1142 <2e-16 ***
color 6 329.9 54.98 2186 <2e-16 ***
clarity 7 938.4 134.06 5330 <2e-16 ***
Residuals 36420 916.0 0.03
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Conclusions
Carat of the diamond is most important is determining the price of the diamond
As a diamond re-seller looking for good deals on diamonds
given a low weight diamond clarity is going to drive up the price the most
given a higher weight diamond clarity is the most important characteristic to look at, but color is going to have a decent significance on the price