This R Markdown document is an analysis output based on dataset diamonds from ggplot
A data frame with 53940 rows and 10 variables:
Column | Desciption |
---|---|
price | price in US dollars ($326–$18,823) |
carat | weight of the diamond (0.2–5.01) |
cut | quality of the cut (Fair, Good, Very Good, Premium, Ideal) |
color | diamond colour, from J (worst) to D (best) |
clarity | a measurement of how clear the diamond is (I1(worst),SI2,SI1,VS2,VS1,VVS2,VVS1,IF(best)) |
x | length in mm (0–10.74) |
y | width in mm (0–58.9) |
z | depth in mm (0–31.8) |
depth | total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43–79) |
table | width of top of diamond relative to widest point (43–95) |
library(QuantPsyc)
library(MASS)
library(ggplot2)
library(dplyr)
library(data.table)
library(tidyr)
library(GGally)
data(diamonds)
diamonds= data.table(diamonds)
diamonds
## carat cut color clarity depth table price x y z
## 1: 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2: 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3: 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4: 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5: 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## ---
## 53936: 0.72 Ideal D SI1 60.8 57 2757 5.75 5.76 3.50
## 53937: 0.72 Good D SI1 63.1 55 2757 5.69 5.75 3.61
## 53938: 0.70 Very Good D SI1 62.8 60 2757 5.66 5.68 3.56
## 53939: 0.86 Premium H SI2 61.0 58 2757 6.15 6.12 3.74
## 53940: 0.75 Ideal D SI2 62.2 55 2757 5.83 5.87 3.64
First, we can use summary(diamonds)
to view some descriptive facts of the dataset:
## carat cut color clarity
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066
## Max. :5.0100 I: 5422 VVS1 : 3655
## J: 2808 (Other): 2531
## depth table price x
## Min. :43.00 Min. :43.00 Min. : 326 Min. : 0.000
## 1st Qu.:61.00 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710
## Median :61.80 Median :57.00 Median : 2401 Median : 5.700
## Mean :61.75 Mean :57.46 Mean : 3933 Mean : 5.731
## 3rd Qu.:62.50 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540
## Max. :79.00 Max. :95.00 Max. :18823 Max. :10.740
##
## y z
## Min. : 0.000 Min. : 0.000
## 1st Qu.: 4.720 1st Qu.: 2.910
## Median : 5.710 Median : 3.530
## Mean : 5.735 Mean : 3.539
## 3rd Qu.: 6.540 3rd Qu.: 4.040
## Max. :58.900 Max. :31.800
##
We can then utilise ggpairs from GGally to visualise the correlation of each variables
ggpairs(diamonds)
From the pairwise correlation plot, we could observe that:
1- There is a relationship (trendable and diverging) in between log Price and carat.
2- Price exhibits very limited or no relationship with table and depths.
3- carat is highly correlated to x,y,z. Further understanding from the dictionary, we then understand that the weight (carat) of a diamond would definitely be affected by its length, width and width (x,y,z respectively).
Question 1: Take the diamonds data set from the base R on R-Studio and explore the relationship between cut and price.
Answer: Through a box plot, we observe that the highest price diamonds are premium cut, lowest price diamonds are ideal cut.
diamonds %>%
group_by(cut) %>%
summarise(n=n(),
mean= mean(price),
median=median(price),
Q1= quantile(price,0.25),
Q3= quantile(price,0.75))
## # A tibble: 5 x 6
## cut n mean median Q1 Q3
## <ord> <int> <dbl> <dbl> <dbl> <dbl>
## 1 Fair 1610 4358.758 3282.0 2050.25 5205.50
## 2 Good 4906 3928.864 3050.5 1145.00 5028.00
## 3 Very Good 12082 3981.760 2648.0 912.00 5372.75
## 4 Premium 13791 4584.258 3185.0 1046.00 6296.00
## 5 Ideal 21551 3457.542 1810.0 878.00 4678.50
diamonds %>%
ggplot(aes(x=cut,y=price, color=cut)) +
geom_boxplot()
Question 2: Show the distribution of each cuts
diamonds %>%
ggplot(aes(x=(price))) +
geom_histogram(stat="bin",binwidth= 500) +
facet_wrap(~cut, scales = "free")
Question 3: Which is the better representation for comparative continuous data(price ~ cut) : histogram or density plot? Explain with reasons.
Answer: Histogram, reason is that there is one cut group (Fair) exposing noise due to small sample size. By adjusting bins adequately noise could be reduced and hence derive to insights more easily without too much noise.
Question 4: What variable in the diamonds dataset is most important for predicting the price of a diamond? How is that variable correlated with cut? Why does the combination of those two relationships lead to lower quality diamonds being more expensive?
To predict diamond price, we would first try to fit the data with a linear regression model. Few assumptions are required before carrying out the analysis:
1) Normality
2) Independecy
3) Homogeneity
QQ plot and histogram both show that variable price shows a better normality after log transformation
par(mfrow=c(1,2))
qqnorm((diamonds$price),main="Normal Q-Q Plot of Price");qqline((diamonds$price))
qqnorm(log(diamonds$price),main="Normal Q-Q Plot of log Price");qqline(log(diamonds$price))
par(mfrow=c(1,2))
hist(diamonds$price,main="Price")
hist(log(diamonds$price),main="log Price")
Recall the ‘Data Exploration’ section which we observed that carat, x, y, z all are highly correlated (dependent) to each other, there is no need to carry variables with similar information, we could hence shortlist carat as one of the features and exclude x,y,z from our model.
cor((diamonds)[,.(carat,x,y,z)])
## carat x y z
## carat 1.0000000 0.9750942 0.9517222 0.9533874
## x 0.9750942 1.0000000 0.9747015 0.9707718
## y 0.9517222 0.9747015 1.0000000 0.9520057
## z 0.9533874 0.9707718 0.9520057 1.0000000
This check can be performed through inspecting the relationship of Price with other continuous variable.
cor(diamonds[,.(price, carat)])
## price carat
## price 1.0000000 0.9215913
## carat 0.9215913 1.0000000
cor(diamonds[,.(log(price), log(carat))])
## V1 V2
## V1 1.0000000 0.9659137
## V2 0.9659137 1.0000000
By plotting the categorical variables as color, we can see that these three variables all show a relationship with Price. From here we could assume that these three variables would be important in predicting price, hence including them as features in our following prediction model.
par(mfrow=c(1,3))
diamonds %>%
ggplot(aes(log(price),log(carat), col= clarity))+
geom_point()
diamonds %>%
ggplot(aes(log(price),log(carat), col= cut))+
geom_point()
diamonds %>%
ggplot(aes(log(price),log(carat), col= color))+
geom_point()
We use the lm function to build the regression model, with the features that we have selected in earlier sessions. To recall, we have excluded x,y,z due to their dependency to carat, additionally depth and table are excluded due to their very low correlation with price.
lm1= lm(log(price)~ log(carat) + cut+color+clarity, data= diamonds)
summary(lm1)
##
## Call:
## lm(formula = log(price) ~ log(carat) + cut + color + clarity,
## data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.01107 -0.08636 -0.00023 0.08341 1.94778
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.457030 0.001168 7242.225 < 2e-16 ***
## log(carat) 1.883718 0.001129 1668.750 < 2e-16 ***
## cut.L 0.120714 0.002354 51.284 < 2e-16 ***
## cut.Q -0.035115 0.002072 -16.950 < 2e-16 ***
## cut.C 0.013479 0.001799 7.494 6.77e-14 ***
## cut^4 -0.001562 0.001441 -1.084 0.278
## color.L -0.439576 0.002027 -216.828 < 2e-16 ***
## color.Q -0.095623 0.001863 -51.335 < 2e-16 ***
## color.C -0.014783 0.001743 -8.481 < 2e-16 ***
## color^4 0.011852 0.001601 7.403 1.35e-13 ***
## color^5 -0.002201 0.001513 -1.455 0.146
## color^6 0.002391 0.001375 1.739 0.082 .
## clarity.L 0.916832 0.003578 256.274 < 2e-16 ***
## clarity.Q -0.243038 0.003330 -72.982 < 2e-16 ***
## clarity.C 0.132400 0.002854 46.387 < 2e-16 ***
## clarity^4 -0.066104 0.002283 -28.955 < 2e-16 ***
## clarity^5 0.027418 0.001864 14.711 < 2e-16 ***
## clarity^6 -0.001819 0.001623 -1.120 0.263
## clarity^7 0.033531 0.001432 23.412 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1338 on 53921 degrees of freedom
## Multiple R-squared: 0.9826, Adjusted R-squared: 0.9826
## F-statistic: 1.693e+05 on 18 and 53921 DF, p-value: < 2.2e-16
We can also compare with a model before log transformation.
lm2= lm(price~carat+cut+color+clarity, data= diamonds)
summary(lm2)
##
## Call:
## lm(formula = price ~ carat + cut + color + clarity, data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16813.5 -680.4 -197.6 466.4 10394.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3710.603 13.980 -265.414 < 2e-16 ***
## carat 8886.129 12.034 738.437 < 2e-16 ***
## cut.L 698.907 20.335 34.369 < 2e-16 ***
## cut.Q -327.686 17.911 -18.295 < 2e-16 ***
## cut.C 180.565 15.557 11.607 < 2e-16 ***
## cut^4 -1.207 12.458 -0.097 0.923
## color.L -1910.288 17.712 -107.853 < 2e-16 ***
## color.Q -627.954 16.121 -38.952 < 2e-16 ***
## color.C -171.960 15.070 -11.410 < 2e-16 ***
## color^4 21.678 13.840 1.566 0.117
## color^5 -85.943 13.076 -6.572 5.00e-11 ***
## color^6 -49.986 11.889 -4.205 2.62e-05 ***
## clarity.L 4217.535 30.831 136.794 < 2e-16 ***
## clarity.Q -1832.406 28.827 -63.565 < 2e-16 ***
## clarity.C 923.273 24.679 37.411 < 2e-16 ***
## clarity^4 -361.995 19.739 -18.339 < 2e-16 ***
## clarity^5 216.616 16.109 13.447 < 2e-16 ***
## clarity^6 2.105 14.037 0.150 0.881
## clarity^7 110.340 12.383 8.910 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1157 on 53921 degrees of freedom
## Multiple R-squared: 0.9159, Adjusted R-squared: 0.9159
## F-statistic: 3.264e+04 on 18 and 53921 DF, p-value: < 2.2e-16
Results show that lm1 is better than lm2, though lm2 is good enough, there is a risk to yield inaccurate estimates from lm2 due to its violation against regression’s assumptions.
To understand which feature has the most impact in price prediction, standardized beta coefficients can be referred as the coefficients are standardized to a same scale:
lm.beta(lm1)
## log(carat) cut.L cut.Q cut.C cut^4
## 1.0857443928 0.1328432935 -0.0588713794 0.0218813985 -0.0009000604
## color.L color.Q color.C color^4 color^5
## -0.4837439068 -0.1603169316 -0.0239982978 0.0068311064 -0.0024223173
## color^6 clarity.L clarity.Q clarity.C clarity^4
## 0.0040089295 1.4883442619 -0.1400833855 0.1457032931 -0.1108268197
## clarity^5 clarity^6 clarity^7
## 0.0445095197 -0.0010482888 0.0369001511
Question 4-1: What variable in the diamonds dataset is most important for predicting the price of a diamond?
Answer: The result shows that carat carries is the most important feature.
Question 4-2: How is that variable correlated with cut? Why does the combination of those two relationships lead to lower quality diamonds being more expensive?
Answer: In general, carat and cut hold a negative correlation (see box-plot below): fair cuts are usually higher carat, ideal cuts are usually lower carat. This could be due to larger diamonds are rare also the technical challenge of cutting a large stone into an ideal cutting. This is also the reason why lower quality diamonds, if it is high carats, would be more expensive.
diamonds %>%
ggplot(aes(x=cut,y=carat, color=cut)) +
geom_boxplot()
Question: Explain the relationship between cut and colour (hint: both are categorical)
Since these two categories are categorical, we can cross-tabulate the count of these two variables, and further express in a relative frequency manner.
new=
diamonds %>%
group_by(color, cut) %>%
summarise(n=n()) %>%
group_by(cut) %>%
mutate(sum.n= sum(n)) %>%
ungroup() %>%
mutate(n2= n/sum.n) %>%
select(color, cut, n2)
new %>% spread(cut,n2)
## # A tibble: 7 x 6
## color Fair Good `Very Good` Premium Ideal
## * <ord> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 D 0.10124224 0.13493681 0.12522761 0.11623523 0.13150202
## 2 E 0.13913043 0.19017530 0.19864261 0.16945834 0.18110529
## 3 F 0.19378882 0.18528333 0.17910942 0.16902328 0.17753237
## 4 G 0.19503106 0.17753771 0.19028307 0.21202233 0.22662521
## 5 H 0.18819876 0.14309009 0.15096838 0.17112610 0.14454086
## 6 I 0.10869565 0.10640033 0.09965238 0.10354579 0.09711846
## 7 J 0.07391304 0.06257644 0.05611654 0.05858893 0.04157580
We can also use a heatmap to visualise this information. Darker blue indicates higher count and white indicates low count.
new %>%
ggplot(aes(color, cut)) +
geom_tile(aes(fill=n2*100), colour = "white") +
scale_fill_gradient(low="white",high="blue") +
labs(fill = "Density")
Answer:
From the heatmap, we can conclude that:
Most ideal and premium cuts are from colour G.
Most very good and good cut diamonds are from colour E.
Fair cut diamonds are usually from colour F, G, H.
Overall, all cut group diamonds are rare in colour J.