1. Setting

This analysis focuses on understanding how properties of a diamond affect the price at which it is sold. The data set classifies 308 diamonds from 2000 for sale Singapore by color, clarity, and certification. It also includes data on the weight of the diamond in carats and the price at which it is sold in Singapore $. The variables in the data set are as follows:

The raw data are organized into rows, with each one representing a diamond. There are five columns- one for each of the variables indicated above. For the purpose of this experiment, only the effects of color and clarity will be studied, and the response in price will be analyzed.

First, we assign the data set to a variable using the Ecdat package.

library(effsize)
## Warning: package 'effsize' was built under R version 3.2.5
library(Ecdat)
## Loading required package: Ecfun
## Warning: package 'Ecfun' was built under R version 3.2.5
## 
## Attaching package: 'Ecfun'
## The following object is masked from 'package:base':
## 
##     sign
## 
## Attaching package: 'Ecdat'
## The following object is masked from 'package:datasets':
## 
##     Orange
diamond <- Diamond

The first 10 data points are shown below.

head(diamond, n=10)
##    carat colour clarity certification price
## 1   0.30      D     VS2           GIA  1302
## 2   0.30      E     VS1           GIA  1510
## 3   0.30      G    VVS1           GIA  1510
## 4   0.30      G     VS1           GIA  1260
## 5   0.31      D     VS1           GIA  1641
## 6   0.31      E     VS1           GIA  1555
## 7   0.31      F     VS1           GIA  1427
## 8   0.31      G    VVS2           GIA  1427
## 9   0.31      H     VS2           GIA  1126
## 10  0.31      I     VS1           GIA  1126

All of the variables are recognized as the proper data type by R, and the response variable of price is a numeric. The following code shows the structure as well as a summary of the data.

str(diamond)
## 'data.frame':    308 obs. of  5 variables:
##  $ carat        : num  0.3 0.3 0.3 0.3 0.31 0.31 0.31 0.31 0.31 0.31 ...
##  $ colour       : Factor w/ 6 levels "D","E","F","G",..: 1 2 4 4 1 2 3 4 5 6 ...
##  $ clarity      : Factor w/ 5 levels "IF","VS1","VS2",..: 3 2 4 2 2 2 2 5 3 2 ...
##  $ certification: Factor w/ 3 levels "GIA","HRD","IGI": 1 1 1 1 1 1 1 1 1 1 ...
##  $ price        : int  1302 1510 1510 1260 1641 1555 1427 1427 1126 1126 ...
summary(diamond)
##      carat        colour clarity   certification     price      
##  Min.   :0.1800   D:16   IF  :44   GIA:151       Min.   :  638  
##  1st Qu.:0.3500   E:44   VS1 :81   HRD: 79       1st Qu.: 1625  
##  Median :0.6200   F:82   VS2 :53   IGI: 78       Median : 4215  
##  Mean   :0.6309   G:65   VVS1:52                 Mean   : 5019  
##  3rd Qu.:0.8500   H:61   VVS2:78                 3rd Qu.: 7446  
##  Max.   :1.1000   I:40                           Max.   :16008

2. Experimental Design

This experimental design will use Analysis of Variance (ANOVA) on a randomized block design to analyze the effects of the clarity of a diamond on the price. The data will also be blocked by color in determining these effects. This experimental design was chosen because color of the diamond could be a nuisance factor that has an effect on the response variable of price, but is not of interest to the experiment. Using a randomized block design allows the existing data set to be used without choosing the treatments more specifically.

The null hypothesis for this experiment is that the clarity of a diamond has no effect on the price.

There are some repeated measures in this experiment because some of the diamonds have the same color and clarity. While no information was provided indicating how the data were collected, the data set does not seem to show any specific sampling, and the sample sizes for each of the treatments are not too different. For this reason, we will assume the data were randomly collected for the purpose of this experiment.

3. Statistical Analysis

First, we look at boxplots for the two factors of interest- clarity and color. We can also look at the histogram of the diamond prices, which we see are right skewed.

The boxplot for clarity does not show any strong differences in means for all of the different levels, but the plot for IF looks like it may have a sligtly lower mean. The sample for that clarity level also seems to have more outliers than the others.

boxplot(price ~ clarity, data = diamond)

boxplot(price ~ colour, data = diamond)

hist(diamond$price)

Next, we subset the data by clarity.

IF <- subset(diamond, clarity == "IF")
VS1 <- subset(diamond, clarity == "VS1")
VS2 <- subset(diamond, clarity == "VS2")
VVS1 <- subset(diamond, clarity == "VVS1")
VVS2 <- subset(diamond, clarity == "VVS2")

In order to determine the effect size that is used to determine the necessary sample size, the mean and standard deviation for each subset can be used.

summary(IF)
##      carat        colour clarity   certification     price      
##  Min.   :0.1800   D: 1   IF  :44   GIA: 6        Min.   :  725  
##  1st Qu.:0.2175   E: 5   VS1 : 0   HRD: 4        1st Qu.: 1076  
##  Median :0.2650   F: 8   VS2 : 0   IGI:34        Median : 1266  
##  Mean   :0.3718   G:11   VVS1: 0                 Mean   : 2695  
##  3rd Qu.:0.5050   H:11   VVS2: 0                 3rd Qu.: 3222  
##  Max.   :1.0400   I: 8                           Max.   :13913
summary(VS1)
##      carat        colour clarity   certification     price      
##  Min.   :0.2000   D: 7   IF  : 0   GIA:61        Min.   :  705  
##  1st Qu.:0.4000   E:18   VS1 :81   HRD:13        1st Qu.: 1738  
##  Median :0.6000   F:25   VS2 : 0   IGI: 7        Median : 4221  
##  Mean   :0.6469   G:12   VVS1: 0                 Mean   : 5057  
##  3rd Qu.:0.9000   H:11   VVS2: 0                 3rd Qu.: 7680  
##  Max.   :1.0400   I: 8                           Max.   :11419
summary(VS2)
##      carat        colour clarity   certification     price     
##  Min.   :0.2000   D: 2   IF  : 0   GIA:36        Min.   : 638  
##  1st Qu.:0.5700   E: 7   VS1 : 0   HRD:15        1st Qu.:3421  
##  Median :0.8200   F:14   VS2 :53   IGI: 2        Median :5738  
##  Mean   :0.7583   G: 9   VVS1: 0                 Mean   :5858  
##  3rd Qu.:1.0000   H:14   VVS2: 0                 3rd Qu.:8873  
##  Max.   :1.1000   I: 7                           Max.   :9853
summary(VVS1)
##      carat        colour clarity   certification     price      
##  Min.   :0.1800   D: 2   IF  : 0   GIA:15        Min.   :  800  
##  1st Qu.:0.5075   E: 5   VS1 : 0   HRD:23        1st Qu.: 3428  
##  Median :0.6300   F:18   VS2 : 0   IGI:14        Median : 4514  
##  Mean   :0.6398   G:12   VVS1:52                 Mean   : 5568  
##  3rd Qu.:0.7800   H:10   VVS2: 0                 3rd Qu.: 7117  
##  Max.   :1.0100   I: 5                           Max.   :16008
summary(VVS2)
##      carat        colour clarity   certification     price      
##  Min.   :0.1800   D: 4   IF  : 0   GIA:33        Min.   :  705  
##  1st Qu.:0.5100   E: 9   VS1 : 0   HRD:24        1st Qu.: 3423  
##  Median :0.6750   F:17   VS2 : 0   IGI:21        Median : 4534  
##  Mean   :0.6679   G:21   VVS1: 0                 Mean   : 5357  
##  3rd Qu.:0.8900   H:15   VVS2:78                 3rd Qu.: 7294  
##  Max.   :1.0900   I:12                           Max.   :13909
sd(IF$price)
## [1] 3016.061
sd(VS1$price)
## [1] 3254.568
sd(VS2$price)
## [1] 3049.083
sd(VVS1$price)
## [1] 3673.286
sd(VVS2$price)
## [1] 3313.966

Because the standard deviation for each of the levels is around 3000, this was used as the input for G*Power, which only requests a single standard deviation for each group. The means and the standard deviation were used to calculate an effect size of 0.318, which is generally considered a moderate effect. With an alpha level of 0.05 and a power of 0.8, the required sample size was determined to be 125.

Using this output from G*Power, a sample from the original data set with 125 random values was created.

samplesize <- 125
set.seed(4)
sample1 <- diamond[sample(nrow(diamond), samplesize),]

This sample has the following summary statistics.

summary(sample1)
##      carat        colour clarity   certification     price      
##  Min.   :0.1800   D: 8   IF  :10   GIA:58        Min.   :  638  
##  1st Qu.:0.4000   E:17   VS1 :32   HRD:34        1st Qu.: 1616  
##  Median :0.6000   F:36   VS2 :21   IGI:33        Median : 4138  
##  Mean   :0.6329   G:28   VVS1:23                 Mean   : 5048  
##  3rd Qu.:0.8400   H:21   VVS2:39                 3rd Qu.: 7368  
##  Max.   :1.1000   I:15                           Max.   :16008

Using Analysis of Variance, the following model is created.

model1 <- aov(price ~ clarity + colour, data = sample1)
anova(model1)
## Analysis of Variance Table
## 
## Response: price
##            Df     Sum Sq  Mean Sq F value  Pr(>F)  
## clarity     4   98854071 24713518  2.2048 0.07278 .
## colour      5   83083151 16616630  1.4825 0.20091  
## Residuals 115 1289016434 11208839                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From this output, we ignore the factor of colour since this is the variable that we blocked on. Further, the clarity of the diamond seems to be significant at the 0.1 significance level. This means that with a confidence level of 0.9, we can say that the result of different means is not due to randomization.

The following qqnorm plot suggests that the model might not be a good fit for the data because it is not normal at the median, and also deviates at the upper and lower tails.

qqnorm(residuals(model1))
qqline(residuals(model1))

Multiple models were used to determine how robust this model was for determining price from clarity and colour. The following models use different forms. The first includes interaction effects, and the latter takes the log of the response variable.

model2 <- aov(price ~ clarity * colour, data = sample1)
anova(model2)
## Analysis of Variance Table
## 
## Response: price
##                Df     Sum Sq  Mean Sq F value  Pr(>F)  
## clarity         4   98854071 24713518  2.1116 0.08526 .
## colour          5   83083151 16616630  1.4197 0.22406  
## clarity:colour 19  165435558  8707135  0.7439 0.76467  
## Residuals      96 1123580876 11703967                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
qqnorm(residuals(model2))
qqline(residuals(model2))

In the second model including interaction effects, the output shows little difference in significance to the original model, where clarity was significant at the 0.1 significance level. Color and the interaction between clarity and color are also insignificant.

Next, we look at a model that takes the log of the price as the response.

sample1$logprice <- log(sample1$price)

model3 <- aov(logprice ~ clarity + colour, data = sample1)
anova(model3)
## Analysis of Variance Table
## 
## Response: logprice
##            Df Sum Sq Mean Sq F value   Pr(>F)   
## clarity     4  9.502 2.37548  3.9927 0.004532 **
## colour      5  4.482 0.89646  1.5068 0.193110   
## Residuals 115 68.420 0.59496                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
qqnorm(residuals(model3))
qqline(residuals(model3))

This output suggests a higher significance for clarity, which is significant at the 0.001 level. The qqplot of residuals also follows the normal distribution more closely than the original model.

Next, we use confidence intervals as an alternate method. We can use the same sample created earlier to determine the confidence interval. In order to do this, we calculate the mean and standard deviation from the sample. These are used to create an upper and lower bound on the confidence interval. Then, it is tested to see if the true mean lies within the confidence interval. A 95% level is used for this confidence interval.

##Confidence Intervals
mu = mean(sample1$price)
stdev = sd(sample1$price)

se = stdev/sqrt(samplesize)*qnorm(.975)
lower <- mu - se
upper <- mu + se

lower
## [1] 4444.472
upper
## [1] 5652.04
mean(diamond$price)
## [1] 5019.484

The mean of 5019.484 lies within the 95% confidence interval of [4444.472, 5652.04]

4. References

Chu, Singfat (2001) “Pricing the C’s of Diamond Stones”, Journal of Statistics Education, 9(2).

5. Appendices

Complete R code

#Download packages
library(effsize)
library(Ecdat)

#Assign data set to variable
diamond <- Diamond

#Display a summary and the structure of diamond
str(diamond)
summary(diamond)

#Show boxplots and histograms
boxplot(price ~ clarity, data = diamond)
boxplot(price ~ colour, data = diamond)
hist(diamond$price)

#Subset the data by treatment variable
IF <- subset(diamond, clarity == "IF")
VS1 <- subset(diamond, clarity == "VS1")
VS2 <- subset(diamond, clarity == "VS2")
VVS1 <- subset(diamond, clarity == "VVS1")
VVS2 <- subset(diamond, clarity == "VVS2")

#Determine means
summary(IF)
summary(VS1)
summary(VS2)
summary(VVS1)
summary(VVS2)

#Determine standard deviations
sd(IF$price)
sd(VS1$price)
sd(VS2$price)
sd(VVS1$price)
sd(VVS2$price)

#Using sample size output from G*Power create sample
## alpha = 0.05 and power = 0.8
samplesize <- 125
set.seed(4)
sample1 <- diamond[sample(nrow(diamond), samplesize),]

#Show a summary of the sample
summary(sample1)

#Create ANOVA from sample
model1 <- aov(price ~ clarity + colour, data = sample1)
anova(model1)

qqnorm(residuals(model1))
qqline(residuals(model1))

#Multiple Models
model2 <- aov(price ~ clarity * colour, data = sample1)
anova(model2)

qqnorm(residuals(model2))
qqline(residuals(model2))

sample1$logprice <- log(sample1$price)

model3 <- aov(logprice ~ clarity + colour, data = sample1)
anova(model3)

qqnorm(residuals(model3))
qqline(residuals(model3))