Cheryl Tran

RPI

12/1/2014 Version 1

1. Setting

System under test

This recipe is examining the pricing the C’s of diamond stones dataset from the Ecdat package. We are looking at how 4 different factors, each with 2 or more levels effect the price in Singapore.

#installing package
install.packages("Ecdat", repos='http://cran.us.r-project.org')
## package 'Ecdat' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\tranc3\AppData\Local\Temp\RtmpkN7Hqz\downloaded_packages
library("Ecdat", lib.loc="C:/Program Files/R/R-3.1.1/library")
## Loading required package: Ecfun
## 
## Attaching package: 'Ecdat'
## 
## The following object is masked from 'package:datasets':
## 
##     Orange
#Pulling the Diamond data from the library
d<-Diamond
head(d)
##   carat colour clarity certification price
## 1  0.30      D     VS2           GIA  1302
## 2  0.30      E     VS1           GIA  1510
## 3  0.30      G    VVS1           GIA  1510
## 4  0.30      G     VS1           GIA  1260
## 5  0.31      D     VS1           GIA  1641
## 6  0.31      E     VS1           GIA  1555
tail(d)
##     carat colour clarity certification price
## 303  1.01      I    VVS1           HRD  8873
## 304  1.01      I     VS1           HRD  8175
## 305  1.02      F    VVS2           HRD 10796
## 306  1.06      H    VVS2           HRD  9890
## 307  1.02      H     VS2           HRD  8959
## 308  1.09      I    VVS2           HRD  9107
d$carat<-as.numeric(d$carat)
d$carat<-cut(d$carat, c(.18,.35,.62,.85,1.1)) #form carat into intervals
d$carat<-as.factor(d$carat)

Factors and Levels

The dataset has 308 observationsof diamonds from Singapore in 2000. There are 4 factors: Carat, Colour, Clarity, and certification. Carat is broken down into four levels: .18-.35, .35-.62, .62-.85, and .85-1.1. Colour is broken down into 6 levels: D,E,F,G,H,I. Clarity has 5 levels: IF,VVS1,VVS2,VS1,VS2.Lastly, certification has 3 levels:GIA, IGI, HRD.

Continuous variables (if any)

The continuous variable in the data set is the price of the diamonds in Singapore

Response variables

In this experiment, we are looking at the price of the diamonds in Singapore ($). From the summary of the prices, they range from $638 to $16010 with a mean of $5019.

summary(d$price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     638    1625    4215    5019    7446   16010

The Data: How is it organized and what does it look like?

The dataset has 308 observations of Diamonds in 2000 from Singapore. They are organized into five columns: carat, colour, clarity, certificatoin, and price.

Randomization

The dataset consists of observations so it unknow if there was randomization.

2. (Experimental) Design

How will the experiment be organized and conducted to test the hypothesis?

This experiment being conducted is a response surface method to see if any of the four factors being analyzed have an effect on the price of the diamonds in Singapore.

What is the rationale for this design?

This method is used to model and analyze problems in which a repsonse of interest is influenced by several variables and the aim is to optimize this response. From the response surface method we can find if there are any maximums, minimums, saddle points, or ridges.

Randomize: What is the Randomization Scheme?

The dataset consists of observations so it was not randomized

Replicate: Are there replicates and/or repeated measures?

There are no replicates or repeated measures

Block: Did you use blocking in the design?

There was no blocking used in the design.

3. (Statistical) Analysis

(Exploratory Data Analysis) Graphics and descriptive summary

summary(d)
##          carat    colour clarity   certification     price      
##  (0.18,0.35]:72   D:16   IF  :44   GIA:151       Min.   :  638  
##  (0.35,0.62]:77   E:44   VS1 :81   HRD: 79       1st Qu.: 1625  
##  (0.62,0.85]:77   F:82   VS2 :53   IGI: 78       Median : 4215  
##  (0.85,1.1] :76   G:65   VVS1:52                 Mean   : 5019  
##  NA's       : 6   H:61   VVS2:78                 3rd Qu.: 7446  
##                   I:40                           Max.   :16008
boxplot(price~carat, data=d, xlab="Carat", ylab="Price ($)")
title("Price of diamond by Carat")

boxplot(price~colour, data=d, xlab="Colour", ylab="price")
title("Price of diamond by Colour")

boxplot(price~clarity, data=d, xlab="clarity", ylab="price")
title("Price of diamond by clarity")

boxplot(price~certification, data=d, xlab="Certification?", ylab="price")
title("Price of diamond by certification")

When looking at the boxplots of the price of diamonds by carat, it makes sense that the larger the number of carats, the higher the price would be. There are some outliers however the IQRs of each group are clearly higher as the number of carats gets bigger.When looking at the price of diamonds by colour, D has the largest range of prices. All of the colours seem to have the same minimum but D has the largest maximum.The medians of the colours seem pretty close.When looking at clarity, IF has the smallest range of prices. It seems that diamonds with IF are on the lower side of price. The other four types of clarity seem to have very similar medians and the same minimums. For certification, it seems that diamonds with IGI certificaton are worth less in Singapore.

Testing

#convert the factors and levels into integers
d$carat<-as.character(d$carat)
d$carat[d$carat=="(0.18,0.35]"]<-1
d$carat[d$carat=="(0.35,0.62]"]<-2
d$carat[d$carat=="(0.62,0.85]"]<-3
d$carat[d$carat=="(0.85,1.1]"]<-4
d$carat<-as.integer(d$carat)

d$colour<-as.character(d$colour)
d$colour[d$colour=="D"]<-1
d$colour[d$colour=="E"]<-2
d$colour[d$colour=="F"]<-3
d$colour[d$colour=="G"]<-4
d$colour[d$colour=="H"]<-5
d$colour[d$colour=="I"]<-6
d$colour<-as.integer(d$colour)

d$clarity<-as.character(d$clarity)
d$clarity[d$clarity=="IF"]<-1
d$clarity[d$clarity=="VVS1"]<-2
d$clarity[d$clarity=="VVS2"]<-3
d$clarity[d$clarity=="VS1"]<-4
d$clarity[d$clarity=="VS2"]<-5
d$clarity<-as.integer(d$clarity)

d$certification<-as.character(d$certification)
d$certification[d$certification=="GIA"]<-1
d$certification[d$certification=="IGI"]<-2
d$certification[d$certification=="HRD"]<-3
d$certification<-as.integer(d$certification)

d<-na.omit(d)

install.packages("rsm",repos='http://cran.us.r-project.org')
## package 'rsm' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\tranc3\AppData\Local\Temp\RtmpkN7Hqz\downloaded_packages
library(rsm)
d.rsm<-rsm(price~SO(carat, colour, clarity, certification), data=d)
summary(d.rsm)
## 
## Call:
## rsm(formula = price ~ SO(carat, colour, clarity, certification), 
##     data = d)
## 
##                        Estimate Std. Error  t value  Pr(>|t|)    
## (Intercept)            2766.243    975.967   2.8344  0.004918 ** 
## carat                  1603.008    276.887   5.7894 1.851e-08 ***
## colour                 -359.814    190.497  -1.8888  0.059924 .  
## clarity                -582.450    210.670  -2.7648  0.006065 ** 
## certification         -1570.809    582.467  -2.6968  0.007414 ** 
## carat:colour           -283.787     26.329 -10.7783 < 2.2e-16 ***
## carat:clarity          -208.174     35.944  -5.7917 1.828e-08 ***
## carat:certification     -74.688     60.392  -1.2367  0.217206    
## colour:clarity          101.265     24.050   4.2106 3.412e-05 ***
## colour:certification     19.792     36.675   0.5397  0.589847    
## clarity:certification    37.792     42.870   0.8815  0.378767    
## carat^2                 627.252     45.076  13.9156 < 2.2e-16 ***
## colour^2                 33.173     19.820   1.6737  0.095273 .  
## clarity^2                51.666     28.813   1.7931  0.074003 .  
## certification^2         440.697    145.616   3.0264  0.002699 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Multiple R-squared:  0.9601, Adjusted R-squared:  0.9582 
## F-statistic: 493.6 on 14 and 287 DF,  p-value: < 2.2e-16
## 
## Analysis of Variance Table
## 
## Response: price
##                                             Df     Sum Sq   Mean Sq
## FO(carat, colour, clarity, certification)    4 3141716808 785429202
## TWI(carat, colour, clarity, certification)   6   59420302   9903384
## PQ(carat, colour, clarity, certification)    4  106434812  26608703
## Residuals                                  287  137377537    478667
## Lack of fit                                138   84837950    614768
## Pure error                                 149   52539587    352615
##                                              F value  Pr(>F)
## FO(carat, colour, clarity, certification)  1640.8664 < 2e-16
## TWI(carat, colour, clarity, certification)   20.6895 < 2e-16
## PQ(carat, colour, clarity, certification)    55.5891 < 2e-16
## Residuals                                                   
## Lack of fit                                   1.7435 0.00046
## Pure error                                                  
## 
## Stationary point of response surface:
##         carat        colour       clarity certification 
##   0.009398377   5.513685822  -0.360222998   1.674615559 
## 
## Eigenanalysis:
## $values
## [1] 686.62086 434.08641  46.01772 -13.93635
## 
## $vectors
##                     [,1]          [,2]        [,3]        [,4]
## carat          0.9444526 -0.1650517525 -0.26127958  0.11180444
## colour        -0.2213443  0.0341793757 -0.45227322  0.86330030
## clarity       -0.1774113  0.0007449124 -0.85230854 -0.49203131
## certification -0.1659556 -0.9856922108  0.02742371  0.01084215

The null hypothesis for this ANOVA is that the variation in the price of the diamonds can not be attributed to anything other than randomization. The alternate hypothesis is that the variation in the price of the diamonds can be attributed to one of the 4 factors.The p-values from the test show that carat, clarity, and certification are significant with an alpha level of .05. We would reject the null hypothesis for those three factors and something other than randomization explains the variation in the prices such as carat, clarity, and certification. There was also significance in the interaction between carat:color, carat:clarity, and colour:clarity.

The stationary points are carat=.0094, colour=5.514, clarity=-.3602, and certification = 1.675. The eigen values are carat=686.62, colour=434.09, clarity=46.02, and certification=-13.94. From these eigenvalues we can determine if there will be a maximum, minimum, saddle point, or ridge carat:colour (maximum) carat:clarity(maximum) carat:certification(saddle point) colour:clarity(maximum) colour:certification(saddle point) clarit:certification(saddle point)

par(mfrow=c(2,3))
contour(d.rsm, ~carat + colour + clarity + certification, image=TRUE, at=summary(d.rsm$canoncial$xs))

When looking at these contour plots of the response surfaces, it is difficult to really determine which interactions are significant and which have extrema.

Diagnostics/Model Adequacy Checking

shapiro.test(d$price)
## 
##  Shapiro-Wilk normality test
## 
## data:  d$price
## W = 0.9286, p-value = 7.504e-11
qqnorm(residuals(d.rsm))
qqline(residuals(d.rsm))

plot(fitted(d.rsm),residuals(d.rsm))

The shapiro test is used to check if the response variable is normal. With an alpha level of .05, and a pvalue of 7.504e-11, we assume that the data is normally distributed.

A Q-Q plot can be used to compare the shape of the distribution of the dataset. The Q-Q plot and Q-Q line of the residuals for both ANOVA models appear to be normal. However at each of the ends, the residuals seem to stray from the line. When looking at the plot of the fitted model and the residuals, a good fit would be if the plot was symmetric at zero and had a high variation. The points seem to be symmetric at zero at the beginning however it loses symmetry towards the right of the plot.

4. Contingencies

Since the dataset consist of observations of prices of diamonds in Singapore, it is possible that all of the values were from chance and that there might have been something different about the year the observations were made.There may be other nuisance factors that were not recorded that year and the pricing of diamonds could be based off of different people’s judgements and not a set standard. The Kruskal Wallis test does not assume a normal distrubtion of the residuals and could be used to test if the variation can be attributed to anything other than randomization without assuming a normal distribution.

kruskal.test(price~carat, data=d)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  price by carat
## Kruskal-Wallis chi-squared = 274.3102, df = 3, p-value < 2.2e-16
kruskal.test(price~colour, data=d)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  price by colour
## Kruskal-Wallis chi-squared = 3.523, df = 5, p-value = 0.6199
kruskal.test(price~clarity, data=d)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  price by clarity
## Kruskal-Wallis chi-squared = 34.0992, df = 4, p-value = 7.111e-07
kruskal.test(price~certification, data=d)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  price by certification
## Kruskal-Wallis chi-squared = 91.7821, df = 2, p-value < 2.2e-16

When looking at the results of the Kruskal-Wallis test, we would reject the null hypothesis and the variation in ther replacement rate can be attributed to Carat, clarity and certification with an alpha level of .05.

5. References to the literature

http://cran.r-project.org/web/packages/Ecdat/Ecdat.pdf

6. Appendices

A summary of, or pointer to, the raw data

complete and documented R code