Project 2

1. Setting

Pricing the C’s of Diamond Stones

Dataset containing prices of diamonds in Singapore Dollars. The “C’s” in the title refers to “colour”, “certification”, “clarity” and “carat”.

library(Ecfun) # Package needed for Ecdat
library(Ecdat)
data("Diamond") # Load dataset

Below are the first 6 observations

head(Diamond)

##   carat colour clarity certification price
## 1  0.30      D     VS2           GIA  1302
## 2  0.30      E     VS1           GIA  1510
## 3  0.30      G    VVS1           GIA  1510
## 4  0.30      G     VS1           GIA  1260
## 5  0.31      D     VS1           GIA  1641
## 6  0.31      E     VS1           GIA  1555

Factors and levels

The dataset contains the following 3 factors.

levels(Diamond$colour)

## [1] "D" "E" "F" "G" "H" "I"

levels(Diamond$certification)

## [1] "GIA" "HRD" "IGI"

levels(Diamond$clarity)

## [1] "IF"   "VS1"  "VS2"  "VVS1" "VVS2"

Continuous variables

“carat” is a continuous variable. A measure of the weight of the diamond in carat units.

Response variables

The response variable is price in Singapore Dollars.

2. Experimental Design

There is a reference in the Ecdat package manual supposedly with information about the dataset. Unfortunately, the link is for an outdated website with no information about how the experiment was conducted or how the data was collected.

However, in the testing and analysis, a subset of \(N\) randomly chosen observations is used. Blocking is used as well.

3. Statistical Analysis

Exploratoty data analysis

In this section, the levels of each factor are shown in a boxplot to determine thier influence on the response, price. Starting with clarity.

boxplot(price~clarity,data=Diamond,ylab="price",xlab="clarity")

The medians of the clarities are very different. The median price for “IF” is lower than the other levels.

boxplot(price~colour,data=Diamond,ylab="price",xlab="colour")

Here, the medians are somewhat close.

boxplot(price~certification,data=Diamond,ylab="price",xlab="certification")

The medians for certifications are different with “IGI” being quite lower than the other levels.

Testing

We select the type 1 and type 2 error as \(\alpha=0.05\) and \(\beta=0.05\), respectively. This yields power \(=1-\beta=0.95\). Using G*Power to compute the effect size from means and assuming same within-groups variance, yields \(0.522\). As a result, the sample size is \(N=60\). \(60\) observations are then randomly chosen to be used as a subset for testing and analysis.

In the testing, the response variable is price and the factors used are “colour” and “certification” where “colour” is used as a blocking variable. For the blocking variable we assume that there is no interaction between “colour” and “certification”.

set.seed(42)
testset = Diamond[sample(c(1:dim(Diamond)[1]),60),]
m1 = lm(price~certification+colour,data=testset)
anova(m1)

## Analysis of Variance Table
## 
## Response: price
##               Df    Sum Sq  Mean Sq F value    Pr(>F)    
## certification  2 163325520 81662760 11.6524 6.587e-05 ***
## colour         5  22860753  4572151  0.6524     0.661    
## Residuals     52 364428922  7008248                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The Anova table shows that certification has a significant effect on the price while the blocking variable, colour, does not. The F statisitic and p value for “certification” is shown below.

# Extract F statistic for certification
anova(m1)[[4]][2]

## [1] 0.6523956

# and the assosiacted p value
anova(m1)[[5]][2]

## [1] 0.6609593

Multiple models

Now, we will consider other fixed effect models. Starting with clarity while still blocking on colour.

m2 = lm(price~clarity+colour,data=testset)
anova(m2)

## Analysis of Variance Table
## 
## Response: price
##           Df    Sum Sq  Mean Sq F value Pr(>F)
## clarity    4  76589670 19147417   2.033 0.1039
## colour     5   3105945   621189   0.066 0.9969
## Residuals 50 470919580  9418392

Interestingly, neither the blocking variable or “clarity” has a significant effect on the price.

Next, we consider the interactions (colour,certification) and (colour,clarity)

m3 = lm(price~certification*colour,testset)
anova(m3)

## Analysis of Variance Table
## 
## Response: price
##                      Df    Sum Sq  Mean Sq F value    Pr(>F)    
## certification         2 163325520 81662760 11.7624 8.087e-05 ***
## colour                5  22860753  4572151  0.6586    0.6567    
## certification:colour  8  58950755  7368844  1.0614    0.4070    
## Residuals            44 305478167  6942686                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

m4 = lm(price~clarity*colour,data=testset)
anova(m4)

## Analysis of Variance Table
## 
## Response: price
##                Df    Sum Sq  Mean Sq F value Pr(>F)
## clarity         4  76589670 19147417  2.0823 0.1022
## colour          5   3105945   621189  0.0676 0.9966
## clarity:colour 12 121503868 10125322  1.1012 0.3871
## Residuals      38 349415712  9195150

Certification still has a significant effect on the price and clarity does not. The interaction effect, however, is not significant which satisfies the assumption that there is no interaction between the blocking variable and the factors.

Finally, the continuous variable, “carat”, is indluded in a model. As the blocking variable is a factor, analysis of covariance(ANCOVA) is used for hytpothesis testing.

m5 = lm(price~carat*colour,data=testset)
anova(m5)

## Analysis of Variance Table
## 
## Response: price
##              Df    Sum Sq   Mean Sq   F value    Pr(>F)    
## carat         1 507763186 507763186 1248.8492 < 2.2e-16 ***
## colour        5  19767222   3953444    9.7235 1.846e-06 ***
## carat:colour  5   3568715    713743    1.7555    0.1401    
## Residuals    48  19516073    406585                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Both carat and colour appear to be significant. This is interesting, and somewhat unexpected, as colour has not been significant before. The interaction between the 2 variables is not significant futher satisfying the assumption the there is no interaction between the blocking and explanatory variables.

Resampling methods

In this section, Anova with Bootstrapping is used as an alternative to Null-hypothesis testing.

# Resampling with boostrapping
meanstar = with(Diamond, tapply(price,certification,mean))
grpA = Diamond$price[Diamond$certification=="GIA"] - meanstar[1]
grpB = Diamond$price[Diamond$certification=="HRD"] - meanstar[2]
grpC = Diamond$price[Diamond$certification=="IGI"] - meanstar[3]

simblock = Diamond$certification
R = 10000
Fstar = numeric(R)
for (i in 1:R) {
  groupA = sample(grpA, size=151, replace=T)
  groupB = sample(grpB, size=79, replace=T)
  groupC = sample(grpC, size=78, replace=T)
  simprice = c(groupA,groupB,groupC)
  simdata = data.frame(simprice,simblock)
  Fstar[i] = oneway.test(simprice~simblock, var.equal=T, data=simdata)$statistic
}

The empirical distribution is shown in the following figure along with the known F distribution.

# Plot empirical distribution
hist(Fstar, ylim=c(0,1), xlim=c(0,8), prob=T)
x=seq(.25,6,.25)
points(x,y=df(x,6,301),type="b",col="red")  # The F distribution

Next we can compute the F test statistic and estimate the alpha level of the distribution

print(realFstar<-oneway.test(price~certification, var.equal=T, data=testset)$statistic)

##        F 
## 12.01885

mean(Fstar>=realFstar)

## [1] 0

Finally, we test the Null-hypothesis on the simulated data. The only IV is “certification”, the blocking variable.

simmodel=lm(simprice ~ simblock, data=simdata)
anova(simmodel)

## Analysis of Variance Table
## 
## Response: simprice
##            Df     Sum Sq Mean Sq F value Pr(>F)
## simblock    2    2567386 1283693  0.1629 0.8497
## Residuals 305 2403104115 7879030

The anova table suggests that the blocking variable does not have a significant effect on the price.

Estimation

The parameters for the first model are shown in the following table.

coef(m1)

##      (Intercept) certificationHRD certificationIGI          colourE 
##        7185.9599        1072.3105       -3353.9197       -2658.0767 
##          colourF          colourG          colourH          colourI 
##        -791.5346       -1471.1709       -1847.5319       -1626.1839

Diagnostics / Model Adequacy Checking

In this section, the assmumptions of using a linear fixed effect molde are checked for the first model. Starting with a Normal QQ-plot. The linear model in lm assumes the response is normally distributed.

plot(m1,2)

The points should not differ much from the straight dotted line. Thus the assumption, that the response is normal, may not be satisfied.

Next,the residuals are are plotted vs the fitted values. The residuals should not increase with the fitted values.

plot(m1,1)

The plot shows that the residuals do not increase substantially with the fitted values.

4. References

R Package Ecdat: https://cran.r-project.org/web/packages/Ecdat/index.html