Among a dataset from the Ecdat R Package, we select “Diamond” which would be useful for expecting and calculating diamond prices.
In Diamond data, ‘colour’ and ‘clarity’ are selected as factors, which represent the characteristic of Diamond and have 6 levels and 5 levels respectively. We choose ‘price’ as a response variable
The orginal dataset was collected for examining characteristics of diamond on the price of diamond and consist of 5 variables and 308 observations. Original dataframe containing :
carat weight of diamond stones in carat unit
colour a factor with levels (D,E,F,G,H,I)
clarity a factor with levels (IF,VVS1,VVS2,VS1,VS2)
certification certification body, a factor with levels (GIA,IGI,HRD)
price price in Singapore $
We can inspect the head and tail of the dataframe in order to see what the data look like
load("C:/Users/bokjh3/Desktop/Ecdat_0.2-9/Ecdat/data/Diamond.rda")
head(Diamond, n=10)
## carat colour clarity certification price
## 1 0.30 D VS2 GIA 1302
## 2 0.30 E VS1 GIA 1510
## 3 0.30 G VVS1 GIA 1510
## 4 0.30 G VS1 GIA 1260
## 5 0.31 D VS1 GIA 1641
## 6 0.31 E VS1 GIA 1555
## 7 0.31 F VS1 GIA 1427
## 8 0.31 G VVS2 GIA 1427
## 9 0.31 H VS2 GIA 1126
## 10 0.31 I VS1 GIA 1126
tail(Diamond, n=10)
## carat colour clarity certification price
## 299 1.01 G VVS2 HRD 9993
## 300 1.01 G VS2 HRD 9293
## 301 1.01 H VVS2 HRD 9433
## 302 1.01 H VS1 HRD 9153
## 303 1.01 I VVS1 HRD 8873
## 304 1.01 I VS1 HRD 8175
## 305 1.02 F VVS2 HRD 10796
## 306 1.06 H VVS2 HRD 9890
## 307 1.02 H VS2 HRD 8959
## 308 1.09 I VVS2 HRD 9107
The couple of categorical independent variables selected are colour and clarity.
colour a factor with levels (D,E,F,G,H,I)
clarity a factor with levels (IF,VVS1,VVS2,VS1,VS2)
str(Diamond)
## 'data.frame': 308 obs. of 5 variables:
## $ carat : num 0.3 0.3 0.3 0.3 0.31 0.31 0.31 0.31 0.31 0.31 ...
## $ colour : Factor w/ 6 levels "D","E","F","G",..: 1 2 4 4 1 2 3 4 5 6 ...
## $ clarity : Factor w/ 5 levels "IF","VS1","VS2",..: 3 2 4 2 2 2 2 5 3 2 ...
## $ certification: Factor w/ 3 levels "GIA","HRD","IGI": 1 1 1 1 1 1 1 1 1 1 ...
## $ price : int 1302 1510 1510 1260 1641 1555 1427 1427 1126 1126 ...
Carat and price are continous variables. In order to conduct ANOVA, a continuous dependent variable must be choosed as the response variable. We think that we have to choose the price as a dependent variable and do not use carat data because characteristics of diamond have an influence on the price of dimond and carat is just one of the characteristics.
The response variable is price. The price is defined as the price of diamond in Singapore $. The price is a continuous variable.
The “Diamond” data is a cross-section data from 2000 in Singapore. The number of observations is 308.The dataset is organized by the following variables: carat, colour, clarity, certification, price.
We select colour and clarity as factors, which have 6 and 5 levles respectively, and price as response variable, which is continous variable.
After selecting a couple of categorical Independent variables and a continuous response variable, we conduct the experiment for null hypothesis statistical testing on the non-blocked independent variable, colour.
A randomized incomplete block design will be utilized to investigate the effect of colour on price. ANOVA with two factors will be examined to analyze the effect of colour while blocking clarity. Therefore, we only consider main effect without an interaction effect.
The ANOVA is an appropriate analysis method for a blocked experiment where we may conduct a 2-way ANOVA. We do not consider interaction effects related with the blocked independent variable.
Randomization is a technique used to balance the effect of extraneous or uncontrollable conditions that can impact the results of an experiment. In this experiment, we do not consider randomization becasue we do not have control over data collection, but we select a random sample from the data.
Replicates are multiple experimental runs with the same factor settings (levels). Replicates are subject to the same sources of variability, independently of each other. We can replicate combinations of factor levels, groups of factor level combinations, or entire designs. There is no replication in this research.
In experimental design, blocking is a technique used to deal with nuisance factors that may affect the results of the experiment. There is blocking in this research. We block on the clarity factor since we do not consider how clarity has an influence on price, but we guess that it will have an impact actually. Through blocking, we can better examine colour factor that we want to investigate.
In order to decide appropriate sample size in this research, at first, we should set type I error (alpha) and type II error (beta) and estimate the effect size from the main effect of the non-blocking independent variable.
In statistical hypothesis testing, a type I error is the incorrect rejection of a true null hypothesis (a “false positive”), while a type II error is incorrectly retaining a false null hypothesis (a “false negative”).
We use Cohen’s D formula to estimate effect size.
set.seed(45)
x <- rnorm(Diamond$price)
y <- rnorm(Diamond$colour)
cohens_d <- function(x, y) {
lx <- length(x)- 1
ly <- length(y)- 1
md <- abs(mean(x) - mean(y))
csd <- lx * var(x) + ly * var(y)
csd <- csd/(lx + ly)
csd <- sqrt(csd)
cd <- md/csd
}
res <- cohens_d(x, y)
res
## [1] 0.03467091
We make use of G*Power to determine sample size given that Alpha = 0.05, Beta = 0.05, Power = 1 - Beta = 0.95 and effect size = 0.03467091. The following output was generated, which indicates a critical F-statistic is 3.84 and a sample size is 10,814. Because the effect size is trivial effect (when d < 0.1), many observations are required. However, we select a sample size of 200 randomly from the Diamond dataframe for resampling and the next procedures although we have only 308 observations and lack the number of observations.
G*Power
sample_size <- 200
set.seed(2)
Diamondsample <- Diamond[sample(nrow(Diamond), sample_size),]
Before investigating Alternatives to Null Hypothesis Statistical Testing (NHST), we will only explore NHST in this section. At first, we will look at the histogram of price to see if it satisfies the assumption of normality. As you can see, the distribution looks to be positively skewed.
hist(Diamondsample$price, main = "The price of Diamond")
And then, the levels of each factor are shown in a boxplot to analyze the main effects of factors over the response variable, price.
boxplot(Diamondsample$price~Diamondsample$colour, xlab="Colour", ylab="The price of Diamond")
title("Impact of colour on The price of Diamond")
In this section, we will conduct a two-way ANOVA test. We only consider main effect without an interaction effect. Since we chose to block clarity, we will ignore the F-statistic associated with that factor. (We only care about colour.) From the ANOVA results, we get an F statistic of 1.788 and a p-value of 0.1172. The factors of colour is not statistically significant at 5% significance level.
model <- (aov(Diamondsample$price ~ Diamondsample$colour + Diamondsample$clarity))
summary(model)
## Df Sum Sq Mean Sq F value Pr(>F)
## Diamondsample$colour 5 9.027e+07 18054238 1.788 0.117155
## Diamondsample$clarity 4 2.232e+08 55805036 5.526 0.000314 ***
## Residuals 190 1.919e+09 10098455
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Quantile-Quantile (Q-Q) plots are graphs used to verify the distributional assumption for a set of data. The relatively linear relationship for all data sets justifies the use of ANOVA to test for the significant difference. From the QQ plot, the residuals nearly form a linear line, and we can find out that the assumptions of normality are met.
qqnorm(residuals(model))
qqline(residuals(model))
Residuals vs. Fits Plot is a common graph used in residual analysis. It is a scatter plot of residuals as a function of fitted values, or the estimated responses. From the Residuals vs. Fits plot, the distribution of points seems random although we can see a few outliers and some linearity. Overall, it is not worried.
plot(fitted(model),residuals(model))
In addition to the NHST performed in the previous section, we will also conduct alternative evaluation, resampling statistics and confidence intervals.
Resampling Statistics is a computational approach that draw a sample from your target population(s) and use techniques to randomly resample in order to generate an empirically derived estimate of the sampling distribution of your statistic.
Resampling techniques depend upon repeated rerandomization or simulation of a sample. In the end, the accuracy of the initial F-statistic may be detemined. Resampling was conducted using a loop. The number of iterations was done to be 10,000. Inside the loop, a random sample is generated, ANOVA is conducted. A histogram of the array represents the shape of the Probability Density Function (PDF).
Fvals <- 0
num_iter <- 10000
for (i in 1:num_iter) {
Diamondsample <- Diamond[sample(nrow(Diamond), sample_size),]
Fvals[i] <- summary(aov(Diamondsample$price ~ Diamondsample$colour+Diamondsample$clarity))[[1]][["F value"]][1]
}
hist(Fvals, freq = FALSE, xlab = "F-value", main = "Probability Density Function (PDF) from Resampling")
While the PDF has the same general shape as the F-distribution (df = 1), we can figure out the actual probability of F-values that fall above previous determined critical value (F-statistic = 3.84).
Fcrit <- 3.84
length(which(Fvals>Fcrit))/length(Fvals)
## [1] 0.0391
We use a 95% confidence level and try to calculate the confidence interval. we will assume that we are working with a sample (N=200) standard deviation rather than an exact standard deviation. The 95% confidence interval of price is [4,557.66, 5,532.59]
Diamondsample_mean <- mean(Diamondsample$price)
Diamondsample_sd <- sd(Diamondsample$price)
Diamondsample_n <- length(Diamondsample$price)
error <- qt(0.975,df=Diamondsample_n-1)*Diamondsample_sd/sqrt(Diamondsample_n)
Diamondsample_lower <- Diamondsample_mean-error
Diamondsample_upper <- Diamondsample_mean+error
Diamondsample_lower
## [1] 4557.66
Diamondsample_upper
## [1] 5532.59
For comparison, we calculate 95% confidence level of original data (N=308). The 95% confidence interval of price from original data is [4,637.922, 5,401.046]. From the results, we can find out that confidence interval is reduced when the number of sample is larger.
Diamond_mean <- mean(Diamond$price)
Diamond_sd <- sd(Diamond$price)
Diamond_n <- length(Diamond$price)
error <- qt(0.975,df=Diamond_n-1)*Diamond_sd/sqrt(Diamond_n)
Diamond_lower <- Diamond_mean-error
Diamond_upper <- Diamond_mean+error
Diamond_lower
## [1] 4637.922
Diamond_upper
## [1] 5401.046
Montgomery, Douglas C. 2012. Design and Analysis of Experiments, 8th Edition.
The Diamond dataset can be found by installing and loading the Ecdat R package.
#load("C:/Users/bokjh3/Desktop/Ecdat_0.2-9/Ecdat/data/Diamond.rda")
#head(Diamond, n=10)
#tail(Diamond, n=10)
#str(Diamond)
#set.seed(45)
#x <- rnorm(Diamond$price)
#y <- rnorm(Diamond$colour)
#cohens_d <- function(x, y) {
# lx <- length(x)- 1
# ly <- length(y)- 1
# md <- abs(mean(x) - mean(y))
# csd <- lx * var(x) + ly * var(y)
# csd <- csd/(lx + ly)
# csd <- sqrt(csd)
# cd <- md/csd
# }
#res <- cohens_d(x, y)
#res
#sample_size <- 200
#set.seed(2)
#Diamondsample <- Diamond[sample(nrow(Diamond), sample_size),]
#hist(Diamondsample$price, main = "The price of Diamond")
#boxplot(Diamondsample$price~Diamondsample$colour, xlab="Colour", ylab="The price of Diamond")
#title("Impact of colour on The price of Diamond")
#model <- (aov(Diamondsample$price ~ Diamondsample$colour + Diamondsample$clarity))
#summary(model)
#qqnorm(residuals(model))
#qqline(residuals(model))
#plot(fitted(model),residuals(model))
#Fvals <- 0
#num_iter <- 10000
#for (i in 1:num_iter) {
# Diamondsample <- Diamond[sample(nrow(Diamond), sample_size),]
# Fvals[i] <- summary(aov(Diamondsample$price ~ Diamondsample$colour+Diamondsample$clarity))[[1]][["F value"]][1]
# }
#hist(Fvals, freq = FALSE, xlab = "F-value", main = "Probability Density Function (PDF) from Resampling")
#Fcrit <- 3.84
#length(which(Fvals>Fcrit))/length(Fvals)
#Diamondsample_mean <- mean(Diamondsample$price)
#Diamondsample_sd <- sd(Diamondsample$price)
#Diamondsample_n <- length(Diamondsample$price)
#error <- qt(0.975,df=Diamondsample_n-1)*Diamondsample_sd/sqrt(Diamondsample_n)
#Diamondsample_lower <- Diamondsample_mean-error
#Diamondsample_upper <- Diamondsample_mean+error
#Diamondsample_lower
#Diamondsample_upper
#Diamond_mean <- mean(Diamond$price)
#Diamond_sd <- sd(Diamond$price)
#Diamond_n <- length(Diamond$price)
#error <- qt(0.975,df=Diamond_n-1)*Diamond_sd/sqrt(Diamond_n)
#Diamond_lower <- Diamond_mean-error
#Diamond_upper <- Diamond_mean+error
#Diamond_lower
#Diamond_upper