This recipe is examining the pricing the C’s of diamond stones dataset from the Ecdat package. We are looking at how 4 different factors, each with 2 or more levels effect the price in Singapore.
#installing package
install.packages("Ecdat", repos='http://cran.us.r-project.org')
## package 'Ecdat' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\tranc3\AppData\Local\Temp\RtmpkN7Hqz\downloaded_packages
library("Ecdat", lib.loc="C:/Program Files/R/R-3.1.1/library")
## Loading required package: Ecfun
##
## Attaching package: 'Ecdat'
##
## The following object is masked from 'package:datasets':
##
## Orange
#Pulling the Diamond data from the library
d<-Diamond
head(d)
## carat colour clarity certification price
## 1 0.30 D VS2 GIA 1302
## 2 0.30 E VS1 GIA 1510
## 3 0.30 G VVS1 GIA 1510
## 4 0.30 G VS1 GIA 1260
## 5 0.31 D VS1 GIA 1641
## 6 0.31 E VS1 GIA 1555
tail(d)
## carat colour clarity certification price
## 303 1.01 I VVS1 HRD 8873
## 304 1.01 I VS1 HRD 8175
## 305 1.02 F VVS2 HRD 10796
## 306 1.06 H VVS2 HRD 9890
## 307 1.02 H VS2 HRD 8959
## 308 1.09 I VVS2 HRD 9107
d$carat<-as.numeric(d$carat)
d$carat<-cut(d$carat, c(.18,.35,.62,.85,1.1)) #form carat into intervals
d$carat<-as.factor(d$carat)
The dataset has 308 observationsof diamonds from Singapore in 2000. There are 4 factors: Carat, Colour, Clarity, and certification. Carat is broken down into four levels: .18-.35, .35-.62, .62-.85, and .85-1.1. Colour is broken down into 6 levels: D,E,F,G,H,I. Clarity has 5 levels: IF,VVS1,VVS2,VS1,VS2.Lastly, certification has 3 levels:GIA, IGI, HRD.
The continuous variable in the data set is the price of the diamonds in Singapore
In this experiment, we are looking at the price of the diamonds in Singapore ($). From the summary of the prices, they range from $638 to $16010 with a mean of $5019.
summary(d$price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 638 1625 4215 5019 7446 16010
The dataset has 308 observations of Diamonds in 2000 from Singapore. They are organized into five columns: carat, colour, clarity, certificatoin, and price.
The dataset consists of observations so it unknow if there was randomization.
This experiment being conducted is a response surface method to see if any of the four factors being analyzed have an effect on the price of the diamonds in Singapore.
This method is used to model and analyze problems in which a repsonse of interest is influenced by several variables and the aim is to optimize this response. From the response surface method we can find if there are any maximums, minimums, saddle points, or ridges.
The dataset consists of observations so it was not randomized
There are no replicates or repeated measures
There was no blocking used in the design.
summary(d)
## carat colour clarity certification price
## (0.18,0.35]:72 D:16 IF :44 GIA:151 Min. : 638
## (0.35,0.62]:77 E:44 VS1 :81 HRD: 79 1st Qu.: 1625
## (0.62,0.85]:77 F:82 VS2 :53 IGI: 78 Median : 4215
## (0.85,1.1] :76 G:65 VVS1:52 Mean : 5019
## NA's : 6 H:61 VVS2:78 3rd Qu.: 7446
## I:40 Max. :16008
boxplot(price~carat, data=d, xlab="Carat", ylab="Price ($)")
title("Price of diamond by Carat")
boxplot(price~colour, data=d, xlab="Colour", ylab="price")
title("Price of diamond by Colour")
boxplot(price~clarity, data=d, xlab="clarity", ylab="price")
title("Price of diamond by clarity")
boxplot(price~certification, data=d, xlab="Certification?", ylab="price")
title("Price of diamond by certification")
When looking at the boxplots of the price of diamonds by carat, it makes sense that the larger the number of carats, the higher the price would be. There are some outliers however the IQRs of each group are clearly higher as the number of carats gets bigger.When looking at the price of diamonds by colour, D has the largest range of prices. All of the colours seem to have the same minimum but D has the largest maximum.The medians of the colours seem pretty close.When looking at clarity, IF has the smallest range of prices. It seems that diamonds with IF are on the lower side of price. The other four types of clarity seem to have very similar medians and the same minimums. For certification, it seems that diamonds with IGI certificaton are worth less in Singapore.
#convert the factors and levels into integers
d$carat<-as.character(d$carat)
d$carat[d$carat=="(0.18,0.35]"]<-1
d$carat[d$carat=="(0.35,0.62]"]<-2
d$carat[d$carat=="(0.62,0.85]"]<-3
d$carat[d$carat=="(0.85,1.1]"]<-4
d$carat<-as.integer(d$carat)
d$colour<-as.character(d$colour)
d$colour[d$colour=="D"]<-1
d$colour[d$colour=="E"]<-2
d$colour[d$colour=="F"]<-3
d$colour[d$colour=="G"]<-4
d$colour[d$colour=="H"]<-5
d$colour[d$colour=="I"]<-6
d$colour<-as.integer(d$colour)
d$clarity<-as.character(d$clarity)
d$clarity[d$clarity=="IF"]<-1
d$clarity[d$clarity=="VVS1"]<-2
d$clarity[d$clarity=="VVS2"]<-3
d$clarity[d$clarity=="VS1"]<-4
d$clarity[d$clarity=="VS2"]<-5
d$clarity<-as.integer(d$clarity)
d$certification<-as.character(d$certification)
d$certification[d$certification=="GIA"]<-1
d$certification[d$certification=="IGI"]<-2
d$certification[d$certification=="HRD"]<-3
d$certification<-as.integer(d$certification)
d<-na.omit(d)
install.packages("rsm",repos='http://cran.us.r-project.org')
## package 'rsm' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\tranc3\AppData\Local\Temp\RtmpkN7Hqz\downloaded_packages
library(rsm)
d.rsm<-rsm(price~SO(carat, colour, clarity, certification), data=d)
summary(d.rsm)
##
## Call:
## rsm(formula = price ~ SO(carat, colour, clarity, certification),
## data = d)
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2766.243 975.967 2.8344 0.004918 **
## carat 1603.008 276.887 5.7894 1.851e-08 ***
## colour -359.814 190.497 -1.8888 0.059924 .
## clarity -582.450 210.670 -2.7648 0.006065 **
## certification -1570.809 582.467 -2.6968 0.007414 **
## carat:colour -283.787 26.329 -10.7783 < 2.2e-16 ***
## carat:clarity -208.174 35.944 -5.7917 1.828e-08 ***
## carat:certification -74.688 60.392 -1.2367 0.217206
## colour:clarity 101.265 24.050 4.2106 3.412e-05 ***
## colour:certification 19.792 36.675 0.5397 0.589847
## clarity:certification 37.792 42.870 0.8815 0.378767
## carat^2 627.252 45.076 13.9156 < 2.2e-16 ***
## colour^2 33.173 19.820 1.6737 0.095273 .
## clarity^2 51.666 28.813 1.7931 0.074003 .
## certification^2 440.697 145.616 3.0264 0.002699 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Multiple R-squared: 0.9601, Adjusted R-squared: 0.9582
## F-statistic: 493.6 on 14 and 287 DF, p-value: < 2.2e-16
##
## Analysis of Variance Table
##
## Response: price
## Df Sum Sq Mean Sq
## FO(carat, colour, clarity, certification) 4 3141716808 785429202
## TWI(carat, colour, clarity, certification) 6 59420302 9903384
## PQ(carat, colour, clarity, certification) 4 106434812 26608703
## Residuals 287 137377537 478667
## Lack of fit 138 84837950 614768
## Pure error 149 52539587 352615
## F value Pr(>F)
## FO(carat, colour, clarity, certification) 1640.8664 < 2e-16
## TWI(carat, colour, clarity, certification) 20.6895 < 2e-16
## PQ(carat, colour, clarity, certification) 55.5891 < 2e-16
## Residuals
## Lack of fit 1.7435 0.00046
## Pure error
##
## Stationary point of response surface:
## carat colour clarity certification
## 0.009398377 5.513685822 -0.360222998 1.674615559
##
## Eigenanalysis:
## $values
## [1] 686.62086 434.08641 46.01772 -13.93635
##
## $vectors
## [,1] [,2] [,3] [,4]
## carat 0.9444526 -0.1650517525 -0.26127958 0.11180444
## colour -0.2213443 0.0341793757 -0.45227322 0.86330030
## clarity -0.1774113 0.0007449124 -0.85230854 -0.49203131
## certification -0.1659556 -0.9856922108 0.02742371 0.01084215
The null hypothesis for this ANOVA is that the variation in the price of the diamonds can not be attributed to anything other than randomization. The alternate hypothesis is that the variation in the price of the diamonds can be attributed to one of the 4 factors.The p-values from the test show that carat, clarity, and certification are significant with an alpha level of .05. We would reject the null hypothesis for those three factors and something other than randomization explains the variation in the prices such as carat, clarity, and certification. There was also significance in the interaction between carat:color, carat:clarity, and colour:clarity.
The stationary points are carat=.0094, colour=5.514, clarity=-.3602, and certification = 1.675. The eigen values are carat=686.62, colour=434.09, clarity=46.02, and certification=-13.94. From these eigenvalues we can determine if there will be a maximum, minimum, saddle point, or ridge carat:colour (maximum) carat:clarity(maximum) carat:certification(saddle point) colour:clarity(maximum) colour:certification(saddle point) clarit:certification(saddle point)
par(mfrow=c(2,3))
contour(d.rsm, ~carat + colour + clarity + certification, image=TRUE, at=summary(d.rsm$canoncial$xs))
When looking at these contour plots of the response surfaces, it is difficult to really determine which interactions are significant and which have extrema.
shapiro.test(d$price)
##
## Shapiro-Wilk normality test
##
## data: d$price
## W = 0.9286, p-value = 7.504e-11
qqnorm(residuals(d.rsm))
qqline(residuals(d.rsm))
plot(fitted(d.rsm),residuals(d.rsm))
The shapiro test is used to check if the response variable is normal. With an alpha level of .05, and a pvalue of 7.504e-11, we assume that the data is normally distributed.
A Q-Q plot can be used to compare the shape of the distribution of the dataset. The Q-Q plot and Q-Q line of the residuals for both ANOVA models appear to be normal. However at each of the ends, the residuals seem to stray from the line. When looking at the plot of the fitted model and the residuals, a good fit would be if the plot was symmetric at zero and had a high variation. The points seem to be symmetric at zero at the beginning however it loses symmetry towards the right of the plot.
Since the dataset consist of observations of prices of diamonds in Singapore, it is possible that all of the values were from chance and that there might have been something different about the year the observations were made.There may be other nuisance factors that were not recorded that year and the pricing of diamonds could be based off of different people’s judgements and not a set standard. The Kruskal Wallis test does not assume a normal distrubtion of the residuals and could be used to test if the variation can be attributed to anything other than randomization without assuming a normal distribution.
kruskal.test(price~carat, data=d)
##
## Kruskal-Wallis rank sum test
##
## data: price by carat
## Kruskal-Wallis chi-squared = 274.3102, df = 3, p-value < 2.2e-16
kruskal.test(price~colour, data=d)
##
## Kruskal-Wallis rank sum test
##
## data: price by colour
## Kruskal-Wallis chi-squared = 3.523, df = 5, p-value = 0.6199
kruskal.test(price~clarity, data=d)
##
## Kruskal-Wallis rank sum test
##
## data: price by clarity
## Kruskal-Wallis chi-squared = 34.0992, df = 4, p-value = 7.111e-07
kruskal.test(price~certification, data=d)
##
## Kruskal-Wallis rank sum test
##
## data: price by certification
## Kruskal-Wallis chi-squared = 91.7821, df = 2, p-value < 2.2e-16
When looking at the results of the Kruskal-Wallis test, we would reject the null hypothesis and the variation in ther replacement rate can be attributed to Carat, clarity and certification with an alpha level of .05.