Project 2: Null Hypothesis Statistical Testing

Bok, Joonhyuk

Rensselaer Polytechnic Institute

Nov 08 2016 version 3

1. Setting

1.1 System under test

Among a dataset from the Ecdat R Package, we select “Diamond” which would be useful for expecting and calculating diamond prices.

In Diamond data, ‘colour’ and ‘clarity’ are selected as factors, which represent the characteristic of Diamond and have 6 levels and 5 levels respectively. We choose ‘price’ as a response variable

The orginal dataset was collected for examining characteristics of diamond on the price of diamond and consist of 5 variables and 308 observations. Original dataframe containing :

carat weight of diamond stones in carat unit
colour a factor with levels (D,E,F,G,H,I)
clarity a factor with levels (IF,VVS1,VVS2,VS1,VS2)
certification certification body, a factor with levels (GIA,IGI,HRD)
price price in Singapore $

We can inspect the head and tail of the dataframe in order to see what the data look like

load("C:/Users/bokjh3/Desktop/Ecdat_0.2-9/Ecdat/data/Diamond.rda")
head(Diamond, n=10)

##    carat colour clarity certification price
## 1   0.30      D     VS2           GIA  1302
## 2   0.30      E     VS1           GIA  1510
## 3   0.30      G    VVS1           GIA  1510
## 4   0.30      G     VS1           GIA  1260
## 5   0.31      D     VS1           GIA  1641
## 6   0.31      E     VS1           GIA  1555
## 7   0.31      F     VS1           GIA  1427
## 8   0.31      G    VVS2           GIA  1427
## 9   0.31      H     VS2           GIA  1126
## 10  0.31      I     VS1           GIA  1126

tail(Diamond, n=10)

##     carat colour clarity certification price
## 299  1.01      G    VVS2           HRD  9993
## 300  1.01      G     VS2           HRD  9293
## 301  1.01      H    VVS2           HRD  9433
## 302  1.01      H     VS1           HRD  9153
## 303  1.01      I    VVS1           HRD  8873
## 304  1.01      I     VS1           HRD  8175
## 305  1.02      F    VVS2           HRD 10796
## 306  1.06      H    VVS2           HRD  9890
## 307  1.02      H     VS2           HRD  8959
## 308  1.09      I    VVS2           HRD  9107

1.2 Factor and Levels

The couple of categorical independent variables selected are colour and clarity.

colour a factor with levels (D,E,F,G,H,I)
clarity a factor with levels (IF,VVS1,VVS2,VS1,VS2)

str(Diamond)

## 'data.frame':    308 obs. of  5 variables:
##  $ carat        : num  0.3 0.3 0.3 0.3 0.31 0.31 0.31 0.31 0.31 0.31 ...
##  $ colour       : Factor w/ 6 levels "D","E","F","G",..: 1 2 4 4 1 2 3 4 5 6 ...
##  $ clarity      : Factor w/ 5 levels "IF","VS1","VS2",..: 3 2 4 2 2 2 2 5 3 2 ...
##  $ certification: Factor w/ 3 levels "GIA","HRD","IGI": 1 1 1 1 1 1 1 1 1 1 ...
##  $ price        : int  1302 1510 1510 1260 1641 1555 1427 1427 1126 1126 ...

1.3 Continous variables

Carat and price are continous variables. In order to conduct ANOVA, a continuous dependent variable must be choosed as the response variable. We think that we have to choose the price as a dependent variable and do not use carat data because characteristics of diamond have an influence on the price of dimond and carat is just one of the characteristics.

1.4 Response variables

The response variable is price. The price is defined as the price of diamond in Singapore $. The price is a continuous variable.

1.5 The Data: How is it organized and what does it look like?

The “Diamond” data is a cross-section data from 2000 in Singapore. The number of observations is 308.The dataset is organized by the following variables: carat, colour, clarity, certification, price.

We select colour and clarity as factors, which have 6 and 5 levles respectively, and price as response variable, which is continous variable.

2. Experimental Design

2.1 How will the experiment be organized and conducted to test the hypothesis?

After selecting a couple of categorical Independent variables and a continuous response variable, we conduct the experiment for null hypothesis statistical testing on the non-blocked independent variable, colour.

A randomized incomplete block design will be utilized to investigate the effect of colour on price. ANOVA with two factors will be examined to analyze the effect of colour while blocking clarity. Therefore, we only consider main effect without an interaction effect.

2.2 What is the rationale for this design?

The ANOVA is an appropriate analysis method for a blocked experiment where we may conduct a 2-way ANOVA. We do not consider interaction effects related with the blocked independent variable.

2.3 Randomize: What is the Randomization Scheme?

Randomization is a technique used to balance the effect of extraneous or uncontrollable conditions that can impact the results of an experiment. In this experiment, we do not consider randomization becasue we do not have control over data collection, but we select a random sample from the data.

2.4 Replicate: Are there replicates and/or repeated measures?

Replicates are multiple experimental runs with the same factor settings (levels). Replicates are subject to the same sources of variability, independently of each other. We can replicate combinations of factor levels, groups of factor level combinations, or entire designs. There is no replication in this research.

2.5 Block: Did you use blocking in the design?

In experimental design, blocking is a technique used to deal with nuisance factors that may affect the results of the experiment. There is blocking in this research. We block on the clarity factor since we do not consider how clarity has an influence on price, but we guess that it will have an impact actually. Through blocking, we can better examine colour factor that we want to investigate.

2.6 Selection of Sample Size

In order to decide appropriate sample size in this research, at first, we should set type I error (alpha) and type II error (beta) and estimate the effect size from the main effect of the non-blocking independent variable.

In statistical hypothesis testing, a type I error is the incorrect rejection of a true null hypothesis (a “false positive”), while a type II error is incorrectly retaining a false null hypothesis (a “false negative”).

We use Cohen’s D formula to estimate effect size.

set.seed(45)                        
x <- rnorm(Diamond$price)                
y <- rnorm(Diamond$colour)
cohens_d <- function(x, y) {
  lx <- length(x)- 1
  ly <- length(y)- 1
  md  <- abs(mean(x) - mean(y))        
  csd <- lx * var(x) + ly * var(y)
  csd <- csd/(lx + ly)
  csd <- sqrt(csd)                    
  cd  <- md/csd                        
}
res <- cohens_d(x, y)
res

## [1] 0.03467091

We make use of G*Power to determine sample size given that Alpha = 0.05, Beta = 0.05, Power = 1 - Beta = 0.95 and effect size = 0.03467091. The following output was generated, which indicates a critical F-statistic is 3.84 and a sample size is 10,814. Because the effect size is trivial effect (when d < 0.1), many observations are required. However, we select a sample size of 200 randomly from the Diamond dataframe for resampling and the next procedures although we have only 308 observations and lack the number of observations.

G*Power

sample_size <- 200
set.seed(2)
Diamondsample <- Diamond[sample(nrow(Diamond), sample_size),]

3. Statistical Analysis

3.1 Graphics and descriptive summary

Before investigating Alternatives to Null Hypothesis Statistical Testing (NHST), we will only explore NHST in this section. At first, we will look at the histogram of price to see if it satisfies the assumption of normality. As you can see, the distribution looks to be positively skewed.

hist(Diamondsample$price, main = "The price of Diamond")

And then, the levels of each factor are shown in a boxplot to analyze the main effects of factors over the response variable, price.

boxplot(Diamondsample$price~Diamondsample$colour, xlab="Colour", ylab="The price of Diamond")
title("Impact of colour on The price of Diamond")

3.2 Testing

In this section, we will conduct a two-way ANOVA test. We only consider main effect without an interaction effect. Since we chose to block clarity, we will ignore the F-statistic associated with that factor. (We only care about colour.) From the ANOVA results, we get an F statistic of 1.788 and a p-value of 0.1172. The factors of colour is not statistically significant at 5% significance level.

model <- (aov(Diamondsample$price ~ Diamondsample$colour + Diamondsample$clarity))
summary(model)

##                        Df    Sum Sq  Mean Sq F value   Pr(>F)    
## Diamondsample$colour    5 9.027e+07 18054238   1.788 0.117155    
## Diamondsample$clarity   4 2.232e+08 55805036   5.526 0.000314 ***
## Residuals             190 1.919e+09 10098455                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3.3 Diagnostics / Model Adequacy Checking

Quantile-Quantile (Q-Q) plots are graphs used to verify the distributional assumption for a set of data. The relatively linear relationship for all data sets justifies the use of ANOVA to test for the significant difference. From the QQ plot, the residuals nearly form a linear line, and we can find out that the assumptions of normality are met.

qqnorm(residuals(model))
qqline(residuals(model))

Residuals vs. Fits Plot is a common graph used in residual analysis. It is a scatter plot of residuals as a function of fitted values, or the estimated responses. From the Residuals vs. Fits plot, the distribution of points seems random although we can see a few outliers and some linearity. Overall, it is not worried.

plot(fitted(model),residuals(model))

4. Alternatives to Null Hypothesis Statistical Testing

In addition to the NHST performed in the previous section, we will also conduct alternative evaluation, resampling statistics and confidence intervals.

4.1 Resampling Statistics

Resampling Statistics is a computational approach that draw a sample from your target population(s) and use techniques to randomly resample in order to generate an empirically derived estimate of the sampling distribution of your statistic.

Resampling techniques depend upon repeated rerandomization or simulation of a sample. In the end, the accuracy of the initial F-statistic may be detemined. Resampling was conducted using a loop. The number of iterations was done to be 10,000. Inside the loop, a random sample is generated, ANOVA is conducted. A histogram of the array represents the shape of the Probability Density Function (PDF).

Fvals <- 0 
num_iter <- 10000 
for (i in 1:num_iter) {
  Diamondsample <- Diamond[sample(nrow(Diamond), sample_size),]
  Fvals[i] <- summary(aov(Diamondsample$price ~ Diamondsample$colour+Diamondsample$clarity))[[1]][["F value"]][1]
}
hist(Fvals, freq = FALSE, xlab = "F-value", main = "Probability Density Function (PDF) from Resampling")

While the PDF has the same general shape as the F-distribution (df = 1), we can figure out the actual probability of F-values that fall above previous determined critical value (F-statistic = 3.84).

Fcrit <- 3.84
length(which(Fvals>Fcrit))/length(Fvals)

## [1] 0.0391

4.2 Confidence Interval

We use a 95% confidence level and try to calculate the confidence interval. we will assume that we are working with a sample (N=200) standard deviation rather than an exact standard deviation. The 95% confidence interval of price is [4,557.66, 5,532.59]

Diamondsample_mean <- mean(Diamondsample$price)
Diamondsample_sd <- sd(Diamondsample$price)
Diamondsample_n <- length(Diamondsample$price)
error <- qt(0.975,df=Diamondsample_n-1)*Diamondsample_sd/sqrt(Diamondsample_n)
Diamondsample_lower <- Diamondsample_mean-error
Diamondsample_upper <- Diamondsample_mean+error
Diamondsample_lower

## [1] 4557.66

Diamondsample_upper

## [1] 5532.59

For comparison, we calculate 95% confidence level of original data (N=308). The 95% confidence interval of price from original data is [4,637.922, 5,401.046]. From the results, we can find out that confidence interval is reduced when the number of sample is larger.

Diamond_mean <- mean(Diamond$price)
Diamond_sd <- sd(Diamond$price)
Diamond_n <- length(Diamond$price)
error <- qt(0.975,df=Diamond_n-1)*Diamond_sd/sqrt(Diamond_n)
Diamond_lower <- Diamond_mean-error
Diamond_upper <- Diamond_mean+error
Diamond_lower

## [1] 4637.922

Diamond_upper

## [1] 5401.046

5. References to the literature

Montgomery, Douglas C. 2012. Design and Analysis of Experiments, 8th Edition.

6. Appendices

Raw data

The Diamond dataset can be found by installing and loading the Ecdat R package.

Complete and documented R code

#load("C:/Users/bokjh3/Desktop/Ecdat_0.2-9/Ecdat/data/Diamond.rda")  

#head(Diamond, n=10)  

#tail(Diamond, n=10)  

#str(Diamond)  

#set.seed(45)                          
#x <- rnorm(Diamond$price)                  
#y <- rnorm(Diamond$colour)  
#cohens_d <- function(x, y) {  
#  lx <- length(x)- 1  
#  ly <- length(y)- 1  
#  md  <- abs(mean(x) - mean(y))          
#  csd <- lx * var(x) + ly * var(y)  
#  csd <- csd/(lx + ly)  
#  csd <- sqrt(csd)                      
#  cd  <- md/csd 
# }  
#res <- cohens_d(x, y)  
#res  

#sample_size <- 200  
#set.seed(2)  
#Diamondsample <- Diamond[sample(nrow(Diamond), sample_size),]  

#hist(Diamondsample$price, main = "The price of Diamond")  

#boxplot(Diamondsample$price~Diamondsample$colour, xlab="Colour", ylab="The price of Diamond")  
#title("Impact of colour on The price of Diamond")  

#model <- (aov(Diamondsample$price ~ Diamondsample$colour + Diamondsample$clarity))  
#summary(model)  

#qqnorm(residuals(model))  
#qqline(residuals(model))  

#plot(fitted(model),residuals(model))  

#Fvals <- 0   
#num_iter <- 10000   
#for (i in 1:num_iter) {  
#  Diamondsample <- Diamond[sample(nrow(Diamond), sample_size),]  
#  Fvals[i] <- summary(aov(Diamondsample$price ~ Diamondsample$colour+Diamondsample$clarity))[[1]][["F value"]][1]  
# }    
#hist(Fvals, freq = FALSE, xlab = "F-value", main = "Probability Density Function (PDF) from Resampling")      

#Fcrit <- 3.84    
#length(which(Fvals>Fcrit))/length(Fvals)  

#Diamondsample_mean <- mean(Diamondsample$price)    
#Diamondsample_sd <- sd(Diamondsample$price)  
#Diamondsample_n <- length(Diamondsample$price)  
#error <- qt(0.975,df=Diamondsample_n-1)*Diamondsample_sd/sqrt(Diamondsample_n)  
#Diamondsample_lower <- Diamondsample_mean-error 
#Diamondsample_upper <- Diamondsample_mean+error  
#Diamondsample_lower
#Diamondsample_upper  

#Diamond_mean <- mean(Diamond$price)  
#Diamond_sd <- sd(Diamond$price)  
#Diamond_n <- length(Diamond$price)  
#error <- qt(0.975,df=Diamond_n-1)*Diamond_sd/sqrt(Diamond_n)  
#Diamond_lower <- Diamond_mean-error
#Diamond_upper <- Diamond_mean+error  
#Diamond_lower  
#Diamond_upper