This analysis focuses on understanding how properties of a diamond affect the price at which it is sold. The data set classifies 308 diamonds from 2000 for sale Singapore by color, clarity, and certification. It also includes data on the weight of the diamond in carats and the price at which it is sold in Singapore $. The variables in the data set are as follows:
The raw data are organized into rows, with each one representing a diamond. There are five columns- one for each of the variables indicated above. For the purpose of this experiment, only the effects of color and clarity will be studied, and the response in price will be analyzed.
First, we assign the data set to a variable using the Ecdat package.
library(effsize)
## Warning: package 'effsize' was built under R version 3.2.5
library(Ecdat)
## Loading required package: Ecfun
## Warning: package 'Ecfun' was built under R version 3.2.5
##
## Attaching package: 'Ecfun'
## The following object is masked from 'package:base':
##
## sign
##
## Attaching package: 'Ecdat'
## The following object is masked from 'package:datasets':
##
## Orange
diamond <- Diamond
The first 10 data points are shown below.
head(diamond, n=10)
## carat colour clarity certification price
## 1 0.30 D VS2 GIA 1302
## 2 0.30 E VS1 GIA 1510
## 3 0.30 G VVS1 GIA 1510
## 4 0.30 G VS1 GIA 1260
## 5 0.31 D VS1 GIA 1641
## 6 0.31 E VS1 GIA 1555
## 7 0.31 F VS1 GIA 1427
## 8 0.31 G VVS2 GIA 1427
## 9 0.31 H VS2 GIA 1126
## 10 0.31 I VS1 GIA 1126
All of the variables are recognized as the proper data type by R, and the response variable of price is a numeric. The following code shows the structure as well as a summary of the data.
str(diamond)
## 'data.frame': 308 obs. of 5 variables:
## $ carat : num 0.3 0.3 0.3 0.3 0.31 0.31 0.31 0.31 0.31 0.31 ...
## $ colour : Factor w/ 6 levels "D","E","F","G",..: 1 2 4 4 1 2 3 4 5 6 ...
## $ clarity : Factor w/ 5 levels "IF","VS1","VS2",..: 3 2 4 2 2 2 2 5 3 2 ...
## $ certification: Factor w/ 3 levels "GIA","HRD","IGI": 1 1 1 1 1 1 1 1 1 1 ...
## $ price : int 1302 1510 1510 1260 1641 1555 1427 1427 1126 1126 ...
summary(diamond)
## carat colour clarity certification price
## Min. :0.1800 D:16 IF :44 GIA:151 Min. : 638
## 1st Qu.:0.3500 E:44 VS1 :81 HRD: 79 1st Qu.: 1625
## Median :0.6200 F:82 VS2 :53 IGI: 78 Median : 4215
## Mean :0.6309 G:65 VVS1:52 Mean : 5019
## 3rd Qu.:0.8500 H:61 VVS2:78 3rd Qu.: 7446
## Max. :1.1000 I:40 Max. :16008
This experimental design will use Analysis of Variance (ANOVA) on a randomized block design to analyze the effects of the clarity of a diamond on the price. The data will also be blocked by color in determining these effects. This experimental design was chosen because color of the diamond could be a nuisance factor that has an effect on the response variable of price, but is not of interest to the experiment. Using a randomized block design allows the existing data set to be used without choosing the treatments more specifically.
The null hypothesis for this experiment is that the clarity of a diamond has no effect on the price.
There are some repeated measures in this experiment because some of the diamonds have the same color and clarity. While no information was provided indicating how the data were collected, the data set does not seem to show any specific sampling, and the sample sizes for each of the treatments are not too different. For this reason, we will assume the data were randomly collected for the purpose of this experiment.
First, we look at boxplots for the two factors of interest- clarity and color. We can also look at the histogram of the diamond prices, which we see are right skewed.
The boxplot for clarity does not show any strong differences in means for all of the different levels, but the plot for IF looks like it may have a sligtly lower mean. The sample for that clarity level also seems to have more outliers than the others.
boxplot(price ~ clarity, data = diamond)
boxplot(price ~ colour, data = diamond)
hist(diamond$price)
Next, we subset the data by clarity.
IF <- subset(diamond, clarity == "IF")
VS1 <- subset(diamond, clarity == "VS1")
VS2 <- subset(diamond, clarity == "VS2")
VVS1 <- subset(diamond, clarity == "VVS1")
VVS2 <- subset(diamond, clarity == "VVS2")
In order to determine the effect size that is used to determine the necessary sample size, the mean and standard deviation for each subset can be used.
summary(IF)
## carat colour clarity certification price
## Min. :0.1800 D: 1 IF :44 GIA: 6 Min. : 725
## 1st Qu.:0.2175 E: 5 VS1 : 0 HRD: 4 1st Qu.: 1076
## Median :0.2650 F: 8 VS2 : 0 IGI:34 Median : 1266
## Mean :0.3718 G:11 VVS1: 0 Mean : 2695
## 3rd Qu.:0.5050 H:11 VVS2: 0 3rd Qu.: 3222
## Max. :1.0400 I: 8 Max. :13913
summary(VS1)
## carat colour clarity certification price
## Min. :0.2000 D: 7 IF : 0 GIA:61 Min. : 705
## 1st Qu.:0.4000 E:18 VS1 :81 HRD:13 1st Qu.: 1738
## Median :0.6000 F:25 VS2 : 0 IGI: 7 Median : 4221
## Mean :0.6469 G:12 VVS1: 0 Mean : 5057
## 3rd Qu.:0.9000 H:11 VVS2: 0 3rd Qu.: 7680
## Max. :1.0400 I: 8 Max. :11419
summary(VS2)
## carat colour clarity certification price
## Min. :0.2000 D: 2 IF : 0 GIA:36 Min. : 638
## 1st Qu.:0.5700 E: 7 VS1 : 0 HRD:15 1st Qu.:3421
## Median :0.8200 F:14 VS2 :53 IGI: 2 Median :5738
## Mean :0.7583 G: 9 VVS1: 0 Mean :5858
## 3rd Qu.:1.0000 H:14 VVS2: 0 3rd Qu.:8873
## Max. :1.1000 I: 7 Max. :9853
summary(VVS1)
## carat colour clarity certification price
## Min. :0.1800 D: 2 IF : 0 GIA:15 Min. : 800
## 1st Qu.:0.5075 E: 5 VS1 : 0 HRD:23 1st Qu.: 3428
## Median :0.6300 F:18 VS2 : 0 IGI:14 Median : 4514
## Mean :0.6398 G:12 VVS1:52 Mean : 5568
## 3rd Qu.:0.7800 H:10 VVS2: 0 3rd Qu.: 7117
## Max. :1.0100 I: 5 Max. :16008
summary(VVS2)
## carat colour clarity certification price
## Min. :0.1800 D: 4 IF : 0 GIA:33 Min. : 705
## 1st Qu.:0.5100 E: 9 VS1 : 0 HRD:24 1st Qu.: 3423
## Median :0.6750 F:17 VS2 : 0 IGI:21 Median : 4534
## Mean :0.6679 G:21 VVS1: 0 Mean : 5357
## 3rd Qu.:0.8900 H:15 VVS2:78 3rd Qu.: 7294
## Max. :1.0900 I:12 Max. :13909
sd(IF$price)
## [1] 3016.061
sd(VS1$price)
## [1] 3254.568
sd(VS2$price)
## [1] 3049.083
sd(VVS1$price)
## [1] 3673.286
sd(VVS2$price)
## [1] 3313.966
Because the standard deviation for each of the levels is around 3000, this was used as the input for G*Power, which only requests a single standard deviation for each group. The means and the standard deviation were used to calculate an effect size of 0.318, which is generally considered a moderate effect. With an alpha level of 0.05 and a power of 0.8, the required sample size was determined to be 125.
Using this output from G*Power, a sample from the original data set with 125 random values was created.
samplesize <- 125
set.seed(4)
sample1 <- diamond[sample(nrow(diamond), samplesize),]
This sample has the following summary statistics.
summary(sample1)
## carat colour clarity certification price
## Min. :0.1800 D: 8 IF :10 GIA:58 Min. : 638
## 1st Qu.:0.4000 E:17 VS1 :32 HRD:34 1st Qu.: 1616
## Median :0.6000 F:36 VS2 :21 IGI:33 Median : 4138
## Mean :0.6329 G:28 VVS1:23 Mean : 5048
## 3rd Qu.:0.8400 H:21 VVS2:39 3rd Qu.: 7368
## Max. :1.1000 I:15 Max. :16008
Using Analysis of Variance, the following model is created.
model1 <- aov(price ~ clarity + colour, data = sample1)
anova(model1)
## Analysis of Variance Table
##
## Response: price
## Df Sum Sq Mean Sq F value Pr(>F)
## clarity 4 98854071 24713518 2.2048 0.07278 .
## colour 5 83083151 16616630 1.4825 0.20091
## Residuals 115 1289016434 11208839
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From this output, we ignore the factor of colour since this is the variable that we blocked on. Further, the clarity of the diamond seems to be significant at the 0.1 significance level. This means that with a confidence level of 0.9, we can say that the result of different means is not due to randomization.
The following qqnorm plot suggests that the model might not be a good fit for the data because it is not normal at the median, and also deviates at the upper and lower tails.
qqnorm(residuals(model1))
qqline(residuals(model1))
Multiple models were used to determine how robust this model was for determining price from clarity and colour. The following models use different forms. The first includes interaction effects, and the latter takes the log of the response variable.
model2 <- aov(price ~ clarity * colour, data = sample1)
anova(model2)
## Analysis of Variance Table
##
## Response: price
## Df Sum Sq Mean Sq F value Pr(>F)
## clarity 4 98854071 24713518 2.1116 0.08526 .
## colour 5 83083151 16616630 1.4197 0.22406
## clarity:colour 19 165435558 8707135 0.7439 0.76467
## Residuals 96 1123580876 11703967
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
qqnorm(residuals(model2))
qqline(residuals(model2))
In the second model including interaction effects, the output shows little difference in significance to the original model, where clarity was significant at the 0.1 significance level. Color and the interaction between clarity and color are also insignificant.
Next, we look at a model that takes the log of the price as the response.
sample1$logprice <- log(sample1$price)
model3 <- aov(logprice ~ clarity + colour, data = sample1)
anova(model3)
## Analysis of Variance Table
##
## Response: logprice
## Df Sum Sq Mean Sq F value Pr(>F)
## clarity 4 9.502 2.37548 3.9927 0.004532 **
## colour 5 4.482 0.89646 1.5068 0.193110
## Residuals 115 68.420 0.59496
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
qqnorm(residuals(model3))
qqline(residuals(model3))
This output suggests a higher significance for clarity, which is significant at the 0.001 level. The qqplot of residuals also follows the normal distribution more closely than the original model.
Next, we use confidence intervals as an alternate method. We can use the same sample created earlier to determine the confidence interval. In order to do this, we calculate the mean and standard deviation from the sample. These are used to create an upper and lower bound on the confidence interval. Then, it is tested to see if the true mean lies within the confidence interval. A 95% level is used for this confidence interval.
##Confidence Intervals
mu = mean(sample1$price)
stdev = sd(sample1$price)
se = stdev/sqrt(samplesize)*qnorm(.975)
lower <- mu - se
upper <- mu + se
lower
## [1] 4444.472
upper
## [1] 5652.04
mean(diamond$price)
## [1] 5019.484
The mean of 5019.484 lies within the 95% confidence interval of [4444.472, 5652.04]
Chu, Singfat (2001) “Pricing the C’s of Diamond Stones”, Journal of Statistics Education, 9(2).
Complete R code
#Download packages
library(effsize)
library(Ecdat)
#Assign data set to variable
diamond <- Diamond
#Display a summary and the structure of diamond
str(diamond)
summary(diamond)
#Show boxplots and histograms
boxplot(price ~ clarity, data = diamond)
boxplot(price ~ colour, data = diamond)
hist(diamond$price)
#Subset the data by treatment variable
IF <- subset(diamond, clarity == "IF")
VS1 <- subset(diamond, clarity == "VS1")
VS2 <- subset(diamond, clarity == "VS2")
VVS1 <- subset(diamond, clarity == "VVS1")
VVS2 <- subset(diamond, clarity == "VVS2")
#Determine means
summary(IF)
summary(VS1)
summary(VS2)
summary(VVS1)
summary(VVS2)
#Determine standard deviations
sd(IF$price)
sd(VS1$price)
sd(VS2$price)
sd(VVS1$price)
sd(VVS2$price)
#Using sample size output from G*Power create sample
## alpha = 0.05 and power = 0.8
samplesize <- 125
set.seed(4)
sample1 <- diamond[sample(nrow(diamond), samplesize),]
#Show a summary of the sample
summary(sample1)
#Create ANOVA from sample
model1 <- aov(price ~ clarity + colour, data = sample1)
anova(model1)
qqnorm(residuals(model1))
qqline(residuals(model1))
#Multiple Models
model2 <- aov(price ~ clarity * colour, data = sample1)
anova(model2)
qqnorm(residuals(model2))
qqline(residuals(model2))
sample1$logprice <- log(sample1$price)
model3 <- aov(logprice ~ clarity + colour, data = sample1)
anova(model3)
qqnorm(residuals(model3))
qqline(residuals(model3))