Dataset containing prices of diamonds in Singapore Dollars. The “C’s” in the title refers to “colour”, “certification”, “clarity” and “carat”.
library(Ecfun) # Package needed for Ecdat
library(Ecdat)
data("Diamond") # Load dataset
Below are the first 6 observations
head(Diamond)
## carat colour clarity certification price
## 1 0.30 D VS2 GIA 1302
## 2 0.30 E VS1 GIA 1510
## 3 0.30 G VVS1 GIA 1510
## 4 0.30 G VS1 GIA 1260
## 5 0.31 D VS1 GIA 1641
## 6 0.31 E VS1 GIA 1555
The dataset contains the following 3 factors.
levels(Diamond$colour)
## [1] "D" "E" "F" "G" "H" "I"
levels(Diamond$certification)
## [1] "GIA" "HRD" "IGI"
levels(Diamond$clarity)
## [1] "IF" "VS1" "VS2" "VVS1" "VVS2"
“carat” is a continuous variable. A measure of the weight of the diamond in carat units.
The response variable is price in Singapore Dollars.
There is a reference in the Ecdat package manual supposedly with information about the dataset. Unfortunately, the link is for an outdated website with no information about how the experiment was conducted or how the data was collected.
However, in the testing and analysis, a subset of \(N\) randomly chosen observations is used. Blocking is used as well.
In this section, the levels of each factor are shown in a boxplot to determine thier influence on the response, price. Starting with clarity.
boxplot(price~clarity,data=Diamond,ylab="price",xlab="clarity")
The medians of the clarities are very different. The median price for “IF” is lower than the other levels.
boxplot(price~colour,data=Diamond,ylab="price",xlab="colour")
Here, the medians are somewhat close.
boxplot(price~certification,data=Diamond,ylab="price",xlab="certification")
The medians for certifications are different with “IGI” being quite lower than the other levels.
We select the type 1 and type 2 error as \(\alpha=0.05\) and \(\beta=0.05\), respectively. This yields power \(=1-\beta=0.95\). Using G*Power to compute the effect size from means and assuming same within-groups variance, yields \(0.522\). As a result, the sample size is \(N=60\). \(60\) observations are then randomly chosen to be used as a subset for testing and analysis.
In the testing, the response variable is price and the factors used are “colour” and “certification” where “colour” is used as a blocking variable. For the blocking variable we assume that there is no interaction between “colour” and “certification”.
set.seed(42)
testset = Diamond[sample(c(1:dim(Diamond)[1]),60),]
m1 = lm(price~certification+colour,data=testset)
anova(m1)
## Analysis of Variance Table
##
## Response: price
## Df Sum Sq Mean Sq F value Pr(>F)
## certification 2 163325520 81662760 11.6524 6.587e-05 ***
## colour 5 22860753 4572151 0.6524 0.661
## Residuals 52 364428922 7008248
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The Anova table shows that certification has a significant effect on the price while the blocking variable, colour, does not. The F statisitic and p value for “certification” is shown below.
# Extract F statistic for certification
anova(m1)[[4]][2]
## [1] 0.6523956
# and the assosiacted p value
anova(m1)[[5]][2]
## [1] 0.6609593
Now, we will consider other fixed effect models. Starting with clarity while still blocking on colour.
m2 = lm(price~clarity+colour,data=testset)
anova(m2)
## Analysis of Variance Table
##
## Response: price
## Df Sum Sq Mean Sq F value Pr(>F)
## clarity 4 76589670 19147417 2.033 0.1039
## colour 5 3105945 621189 0.066 0.9969
## Residuals 50 470919580 9418392
Interestingly, neither the blocking variable or “clarity” has a significant effect on the price.
Next, we consider the interactions (colour,certification) and (colour,clarity)
m3 = lm(price~certification*colour,testset)
anova(m3)
## Analysis of Variance Table
##
## Response: price
## Df Sum Sq Mean Sq F value Pr(>F)
## certification 2 163325520 81662760 11.7624 8.087e-05 ***
## colour 5 22860753 4572151 0.6586 0.6567
## certification:colour 8 58950755 7368844 1.0614 0.4070
## Residuals 44 305478167 6942686
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
m4 = lm(price~clarity*colour,data=testset)
anova(m4)
## Analysis of Variance Table
##
## Response: price
## Df Sum Sq Mean Sq F value Pr(>F)
## clarity 4 76589670 19147417 2.0823 0.1022
## colour 5 3105945 621189 0.0676 0.9966
## clarity:colour 12 121503868 10125322 1.1012 0.3871
## Residuals 38 349415712 9195150
Certification still has a significant effect on the price and clarity does not. The interaction effect, however, is not significant which satisfies the assumption that there is no interaction between the blocking variable and the factors.
Finally, the continuous variable, “carat”, is indluded in a model. As the blocking variable is a factor, analysis of covariance(ANCOVA) is used for hytpothesis testing.
m5 = lm(price~carat*colour,data=testset)
anova(m5)
## Analysis of Variance Table
##
## Response: price
## Df Sum Sq Mean Sq F value Pr(>F)
## carat 1 507763186 507763186 1248.8492 < 2.2e-16 ***
## colour 5 19767222 3953444 9.7235 1.846e-06 ***
## carat:colour 5 3568715 713743 1.7555 0.1401
## Residuals 48 19516073 406585
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Both carat and colour appear to be significant. This is interesting, and somewhat unexpected, as colour has not been significant before. The interaction between the 2 variables is not significant futher satisfying the assumption the there is no interaction between the blocking and explanatory variables.
In this section, Anova with Bootstrapping is used as an alternative to Null-hypothesis testing.
# Resampling with boostrapping
meanstar = with(Diamond, tapply(price,certification,mean))
grpA = Diamond$price[Diamond$certification=="GIA"] - meanstar[1]
grpB = Diamond$price[Diamond$certification=="HRD"] - meanstar[2]
grpC = Diamond$price[Diamond$certification=="IGI"] - meanstar[3]
simblock = Diamond$certification
R = 10000
Fstar = numeric(R)
for (i in 1:R) {
groupA = sample(grpA, size=151, replace=T)
groupB = sample(grpB, size=79, replace=T)
groupC = sample(grpC, size=78, replace=T)
simprice = c(groupA,groupB,groupC)
simdata = data.frame(simprice,simblock)
Fstar[i] = oneway.test(simprice~simblock, var.equal=T, data=simdata)$statistic
}
The empirical distribution is shown in the following figure along with the known F distribution.
# Plot empirical distribution
hist(Fstar, ylim=c(0,1), xlim=c(0,8), prob=T)
x=seq(.25,6,.25)
points(x,y=df(x,6,301),type="b",col="red") # The F distribution
Next we can compute the F test statistic and estimate the alpha level of the distribution
print(realFstar<-oneway.test(price~certification, var.equal=T, data=testset)$statistic)
## F
## 12.01885
mean(Fstar>=realFstar)
## [1] 0
Finally, we test the Null-hypothesis on the simulated data. The only IV is “certification”, the blocking variable.
simmodel=lm(simprice ~ simblock, data=simdata)
anova(simmodel)
## Analysis of Variance Table
##
## Response: simprice
## Df Sum Sq Mean Sq F value Pr(>F)
## simblock 2 2567386 1283693 0.1629 0.8497
## Residuals 305 2403104115 7879030
The anova table suggests that the blocking variable does not have a significant effect on the price.
The parameters for the first model are shown in the following table.
coef(m1)
## (Intercept) certificationHRD certificationIGI colourE
## 7185.9599 1072.3105 -3353.9197 -2658.0767
## colourF colourG colourH colourI
## -791.5346 -1471.1709 -1847.5319 -1626.1839
In this section, the assmumptions of using a linear fixed effect molde are checked for the first model. Starting with a Normal QQ-plot. The linear model in lm assumes the response is normally distributed.
plot(m1,2)
The points should not differ much from the straight dotted line. Thus the assumption, that the response is normal, may not be satisfied.
Next,the residuals are are plotted vs the fitted values. The residuals should not increase with the fitted values.
plot(m1,1)
The plot shows that the residuals do not increase substantially with the fitted values.
R Package Ecdat: https://cran.r-project.org/web/packages/Ecdat/index.html