1.Setting

The data contains 6259 observations from 1993 to 1995 across the United States. The information included in in this data set contains price, hardware conditions, manufacturer, and etc.In this report, we believe that the CD-Room installation will significantly affect the computer price,so the effect of cd is analyzed.

System Under Test

We choose data set Computers from Ecdat in R package to examine the wage differences. The variables are listed as follows:

A dataframe containing,

price: price in US dollars of 486 PCs

speed: clock speed in MHz

hd: size of hard drive in MB

ram: size of Ram in in MB

screen: size of screen in inches

cd: is a CD-ROM present ?

multi: is a multimedia kit (speakers, sound card) included ?

premium: is the manufacturer was a “premium” firm (IBM, COMPAQ) ?

ads: number of 486 price listings for each month

trend: time trend indicating month starting from January of 1993 to November of 1995.

# instaling package
 install.packages("Ecdat", repos='http://cran.us.r-project.org')
## Installing package into 'C:/Users/zhao/Documents/R/win-library/3.3'
## (as 'lib' is unspecified)
## package 'Ecdat' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\zhao\AppData\Local\Temp\RtmpiIkgnD\downloaded_packages
# instaling package
install.packages("Ecfun", repos='http://cran.us.r-project.org')
## Installing package into 'C:/Users/zhao/Documents/R/win-library/3.3'
## (as 'lib' is unspecified)
## package 'Ecfun' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\zhao\AppData\Local\Temp\RtmpiIkgnD\downloaded_packages
library("Ecdat")
## Loading required package: Ecfun
## 
## Attaching package: 'Ecfun'
## The following object is masked from 'package:base':
## 
##     sign
## 
## Attaching package: 'Ecdat'
## The following object is masked from 'package:datasets':
## 
##     Orange
# Pulling the Computers data from the library. This data set meet the requirement that DV(price) is continuous and IVs are categorical.
Computers1<- Computers
Computers1$screen<-as.factor(Computers1$screen)

In order to intuitively understand the data, we summarize the data and show part of data as below.

summary(Computers1)
##      price          speed              hd              ram        
##  Min.   : 949   Min.   : 25.00   Min.   :  80.0   Min.   : 2.000  
##  1st Qu.:1794   1st Qu.: 33.00   1st Qu.: 214.0   1st Qu.: 4.000  
##  Median :2144   Median : 50.00   Median : 340.0   Median : 8.000  
##  Mean   :2220   Mean   : 52.01   Mean   : 416.6   Mean   : 8.287  
##  3rd Qu.:2595   3rd Qu.: 66.00   3rd Qu.: 528.0   3rd Qu.: 8.000  
##  Max.   :5399   Max.   :100.00   Max.   :2100.0   Max.   :32.000  
##  screen      cd       multi      premium         ads       
##  14:3661   no :3351   no :5386   no : 612   Min.   : 39.0  
##  15:1992   yes:2908   yes: 873   yes:5647   1st Qu.:162.5  
##  17: 606                                    Median :246.0  
##                                             Mean   :221.3  
##                                             3rd Qu.:275.0  
##                                             Max.   :339.0  
##      trend      
##  Min.   : 1.00  
##  1st Qu.:10.00  
##  Median :16.00  
##  Mean   :15.93  
##  3rd Qu.:21.50  
##  Max.   :35.00
head(Computers1)
##   price speed  hd ram screen cd multi premium ads trend
## 1  1499    25  80   4     14 no    no     yes  94     1
## 2  1795    33  85   2     14 no    no     yes  94     1
## 3  1595    25 170   4     15 no    no     yes  94     1
## 4  1849    25 170   8     14 no    no      no  94     1
## 5  3295    33 340  16     14 no    no     yes  94     1
## 6  3695    66 340  16     14 no    no     yes  94     1
tail(Computers1)
##      price speed   hd ram screen  cd multi premium ads trend
## 6254  2154    66  850  16     15 yes    no     yes  39    35
## 6255  1690   100  528   8     15  no    no     yes  39    35
## 6256  2223    66  850  16     15 yes   yes     yes  39    35
## 6257  2654   100 1200  24     15 yes    no     yes  39    35
## 6258  2195   100  850  16     15 yes    no     yes  39    35
## 6259  2490   100  850  16     17 yes    no     yes  39    35

Factors and levels

The structure of data frame is listed below:

str(Computers1)
## 'data.frame':    6259 obs. of  10 variables:
##  $ price  : num  1499 1795 1595 1849 3295 ...
##  $ speed  : num  25 33 25 25 33 66 25 50 50 50 ...
##  $ hd     : num  80 85 170 170 340 340 170 85 210 210 ...
##  $ ram    : num  4 2 4 8 16 16 4 2 8 4 ...
##  $ screen : Factor w/ 3 levels "14","15","17": 1 1 2 1 1 1 1 1 1 2 ...
##  $ cd     : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 2 1 1 1 ...
##  $ multi  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ premium: Factor w/ 2 levels "no","yes": 2 2 2 1 2 2 2 2 2 2 ...
##  $ ads    : num  94 94 94 94 94 94 94 94 94 94 ...
##  $ trend  : num  1 1 1 1 1 1 1 1 1 1 ...

The categorical independent variables we choose to study are cd and screen.

cd is a two-level factor. It means whether the computuer has a CD-ROM.

Screen is a factor with 3 levels (14,15, and 17), and it will be used as a varible for blocking.

Continuous Variables

The response variables in our study is the price of the computer. It is a continuous variable which represents the price of different computers. So in this ANOVA analysis, the continuous varible-price is selected as dependent variable (DV).

2. Experimental Design

Rational for Design

The rationale for this study is that we use ANOVA to analyze the price in a blocked experiment.we are not interested in interaction effects and want to evaluate only the main effect of one IV - cd.

Randomization

Randomization should be used in the experiment in three ways: random selection, random assignment, and random execution. In this study, we apply the randomization in sample assignment and sample execution. We use G*Power to determine the sample size. The result is shown in the later part.

Replication

The data set we use is cross-sectional data from 1993-1995, so in this study, we do not use replication here, because we do not have multiple data (experiment ) to do so.

Blocking

We use blocking in our model. The screen factor was used to block becasue we believe screen is not a determinent variable for pricing. By blocking, noise caused by screen size difference can be reduced.

3. Statistical Analysis

In this part, we will use Null Hypothesis Statistical Testing (NHST) to analyze the difference of computer price.

In order to check the assumption of normality, we draw a histogram of price below.

hist(Computers1$price, main = "Computer Price Difference")

We calculate the main effect of cd next. It is a two level difference so that we can calculate the mean difference between levels.

# Calculating main effect
#   Take the difference between the means of premium and non-premium
mp <- mean(subset(Computers1$price, Computers1$cd=="yes"))
mn <- mean(subset(Computers1$price, Computers1$cd=="no"))
md <- mp - mn

# Also calculate the group standard deviations, to be used later
sdp <- sd(subset(Computers1$price, Computers1$cd=="yes"))
sdn <- sd(subset(Computers1$price, Computers1$cd=="no"))

# Print the result
md
## [1] 229.7936

We further draw the boxplot to intuitively check the price difference between 2 levels.

# Generate boxplot of logarithm of wage versus sex
boxplot(price~cd, data=Computers1, vertical=TRUE, las=1, ylab="Price", main="The Quality of Firm")

Dtermining Sample Size Using GPower

In this study, we use GPower3.1.9.2 to determine the sample size. We set type I error (\(\alpha\))and type II error (\(\beta\)). Then the estimated size of the main effect cd can be calculated. The input is shown as following:

Effect size f = 0.44

\(\alpha\) = 0.05

Power (1- \(\beta\)) = 0.95

Number of groups = 2,

Calculate Cohen’s D by cohen.d function from effsize package.

require(effsize)
## Loading required package: effsize
cohen.d(Computers1$price, Computers1$cd)
## 
## Cohen's d
## 
## d estimate: -0.4035519 (small)
## 95 percent confidence interval:
##        inf        sup 
## -0.4537428 -0.3533609

G*Power plot shows the sample size based on the input above.

From G*Power, we get the sample size N equals to 84.

# Extract a sample of appropriate size from the dataframe
sample_size <- 84
# Set seed for reproducibility of data for report
set.seed(976)
csample <- Computers1[sample(nrow(Computers1), sample_size),]

ANOVA

We use a 2-way ANOVA and choose to block by screen.

# Perform 2-way ANOVA on the sample and show the summary
anova<-aov(csample$price ~ csample$cd+csample$screen)
summary(anova)
##                Df   Sum Sq Mean Sq F value   Pr(>F)    
## csample$cd      1  3863905 3863905  14.298 0.000299 ***
## csample$screen  2  1190621  595310   2.203 0.117141    
## Residuals      80 21619179  270240                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the result above, the F statistic is 14.298 and p-value is 0.0003, which indicates that the CD-ROM installation has a significant effect on computer price. However, there is a potential problem that we use set seed function to determine the random number here. The results may be different based on the random number.

Model Adequency Checking

In this part, we will check the robustness of our result. Firstly, we need to make sure that our dataset meets all of the assumptions.in order to see whether the dataset exhibit normally. In order to verify the normality of our dataset, we create the Normal Quantile-Quantile (QQ) Plot and the plot of the fitted values versus the residual value.

#Create a Normal Q-Q plot and add a reference line to Q-Q plot
qqnorm(residuals(anova))
qqline(residuals(anova))

The Q-Q plot of the residuals for this model forms a linear line and we can conclude that the assumptions of normality are met. When the residuals are nomal, the data points in Q-Q plot tend to be linear.

# Create a residuals plot for the model
plot(fitted(anova), residuals(anova))

The plot shows that the distribution is randomly ditributed.

4. Alternatives to Null Hypothesis Statistical Testing (NHST)

In this section, we introduce alternative methods to null hypothesis testing, Resampling Statiscics and confidence interval.

Resampling Statistics

We use for loop to run the resampling. The samples are generated based on the loop, repeating for 10,000 times. For each sample, a new corresponding F-value is calculated. In order to test the assumption of NHST, large number of iterations need to be performed. Then we can know the accuracy in our experiment sample. In the next step, we show the probability density function (PDF) of F-value. From the histogram, we can see that it tend to be a chi-square distribution.

# Repeatedly take new sample, perform aov, store F-value in an array
# initialize an array called Fvals that stores the F-value in each repeated test
Fvals <- 0 
# number of iterations in the for loop
num_iter <- 10000 
for (i in 1:num_iter) {
  # Take a new random sample (of the same sample size used in the NHST)
  csample <- Computers1[sample(nrow(Computers1), sample_size),]
  # Extract the F-value of interest from the aov summary output
  Fvals[i] <- summary(aov(csample$price ~ csample$cd+csample$screen))[[1]][["F value"]][1]
}
hist(Fvals, freq = FALSE, xlab = "F-value", main = "Probability Density Function (PDF) from Resampling")

In the end, we calculate the probability of F-values using the determined F value in GPower(F=3.96), and check whether it is above the 5% level.

Fcritical <- 3.96
length(which(Fvals>Fcritical))/length(Fvals)
## [1] 0.4793

From the result, we can see that the probability of F-value (F=3.96) is 0.4793, which is much higher than 0.05.

Confidence Intervals

In this part, we use Turkey’s method to create confidence intervals with a family error rate of 0.05.

tukey1<-TukeyHSD(anova)
plot(tukey1)

TukeyHSD(anova)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = csample$price ~ csample$cd + csample$screen)
## 
## $`csample$cd`
##            diff      lwr      upr     p adj
## yes-no 429.4341 203.4256 655.4426 0.0002994
## 
## $`csample$screen`
##            diff        lwr      upr     p adj
## 15-14  152.2185  -128.9593 433.3963 0.4033642
## 17-14 -327.8273  -913.7152 258.0605 0.3794800
## 17-15 -480.0458 -1073.5705 113.4788 0.1365055

The confidence interval for the price difference of CD-Room intallation is from 203.43 to 655.44. and the mean difference is 429.43, with the p-value of 0.0003. So the result shows that it is statistically significant.

5. References

Montgomery, Douglas C.. Design and Analysis of Experiments, 8th Edition. Wiley. Kindle Edition.

Appendix

Raw Data

The data was drawn from the R ecdat library, and the dataset used is Computers.

R Code

# instaling package
 install.packages("Ecdat", repos='http://cran.us.r-project.org')
# instaling package
install.packages("Ecfun", repos='http://cran.us.r-project.org')

library("Ecdat")

# Pulling the Computers data from the library. This data set meet the requirement that DV(price) is continuous and IVs are categorical.
Computers1<- Computers
Computers1$screen<-as.factor(Computers1$screen)

# Summarize the dataframe
summary(Computers1)
head(Computers1)
tail(Computers1)

#Explain the structure of dataframe
str(Computers1)

# Draw the histogram of price
hist(Computers1$price, main = "Computer Price Difference")# Calculating main effect
#   Take the difference between the means of premium and non-premium
mp <- mean(subset(Computers1$price, Computers1$cd=="yes"))
mn <- mean(subset(Computers1$price, Computers1$cd=="no"))
md <- mp - mn

# Also calculate the group standard deviations, to be used later
sdp <- sd(subset(Computers1$price, Computers1$cd=="yes"))
sdn <- sd(subset(Computers1$price, Computers1$cd=="no"))

# Print the result
md

# Generate boxplot of logarithm of wage versus sex
boxplot(price~cd, data=Computers1, vertical=TRUE, las=1, ylab="Price", main="The Quality of Firm")

#Calculate Cohen's D by cohen.d function from effsize package
require(effsize)
cohen.d(Computers1$price, Computers1$cd)

# Extract a sample of appropriate size from the dataframe
sample_size <- 84
# Set seed for reproducibility of data for report
set.seed(976)
csample <- Computers1[sample(nrow(Computers1), sample_size),]

# Perform 2-way ANOVA on the sample and show the summary
anova<-aov(csample$price ~ csample$cd+csample$screen)
summary(anova)

# Create a Normal Q-Q plot and add a reference line to Q-Q plot
qqnorm(residuals(anova))
qqline(residuals(anova))

# Create a residuals plot for the model
plot(fitted(anova), residuals(anova))

# Repeatedly take new sample, perform aov, store F-value in an array
# initialize an array called Fvals that stores the F-value in each repeated test
Fvals <- 0 
# number of iterations in the for loop
num_iter <- 10000 
for (i in 1:num_iter) {
# Take a new random sample (of the same sample size used in the NHST)
  csample <- Computers1[sample(nrow(Computers1), sample_size),]
# Extract the F-value of interest from the aov summary output
  Fvals[i] <- summary(aov(csample$price ~ csample$cd+csample$screen))[[1]][["F value"]][1]
}
hist(Fvals, freq = FALSE, xlab = "F-value", main = "Probability Density Function (PDF) from Resampling")
Fcritical <- 3.96
length(which(Fvals>Fcritical))/length(Fvals)

# Use Turkey's method to create confidence intervals
tukey1<-TukeyHSD(anova)
plot(tukey1)
TukeyHSD(anova)