The data contains 6259 observations from 1993 to 1995 across the United States. The information included in in this data set contains price, hardware conditions, manufacturer, and etc.In this report, we believe that the CD-Room installation will significantly affect the computer price,so the effect of cd is analyzed.
We choose data set Computers from Ecdat in R package to examine the wage differences. The variables are listed as follows:
A dataframe containing,
price: price in US dollars of 486 PCs
speed: clock speed in MHz
hd: size of hard drive in MB
ram: size of Ram in in MB
screen: size of screen in inches
cd: is a CD-ROM present ?
multi: is a multimedia kit (speakers, sound card) included ?
premium: is the manufacturer was a “premium” firm (IBM, COMPAQ) ?
ads: number of 486 price listings for each month
trend: time trend indicating month starting from January of 1993 to November of 1995.
# instaling package
install.packages("Ecdat", repos='http://cran.us.r-project.org')
## Installing package into 'C:/Users/zhao/Documents/R/win-library/3.3'
## (as 'lib' is unspecified)
## package 'Ecdat' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\zhao\AppData\Local\Temp\RtmpiIkgnD\downloaded_packages
# instaling package
install.packages("Ecfun", repos='http://cran.us.r-project.org')
## Installing package into 'C:/Users/zhao/Documents/R/win-library/3.3'
## (as 'lib' is unspecified)
## package 'Ecfun' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\zhao\AppData\Local\Temp\RtmpiIkgnD\downloaded_packages
library("Ecdat")
## Loading required package: Ecfun
##
## Attaching package: 'Ecfun'
## The following object is masked from 'package:base':
##
## sign
##
## Attaching package: 'Ecdat'
## The following object is masked from 'package:datasets':
##
## Orange
# Pulling the Computers data from the library. This data set meet the requirement that DV(price) is continuous and IVs are categorical.
Computers1<- Computers
Computers1$screen<-as.factor(Computers1$screen)
In order to intuitively understand the data, we summarize the data and show part of data as below.
summary(Computers1)
## price speed hd ram
## Min. : 949 Min. : 25.00 Min. : 80.0 Min. : 2.000
## 1st Qu.:1794 1st Qu.: 33.00 1st Qu.: 214.0 1st Qu.: 4.000
## Median :2144 Median : 50.00 Median : 340.0 Median : 8.000
## Mean :2220 Mean : 52.01 Mean : 416.6 Mean : 8.287
## 3rd Qu.:2595 3rd Qu.: 66.00 3rd Qu.: 528.0 3rd Qu.: 8.000
## Max. :5399 Max. :100.00 Max. :2100.0 Max. :32.000
## screen cd multi premium ads
## 14:3661 no :3351 no :5386 no : 612 Min. : 39.0
## 15:1992 yes:2908 yes: 873 yes:5647 1st Qu.:162.5
## 17: 606 Median :246.0
## Mean :221.3
## 3rd Qu.:275.0
## Max. :339.0
## trend
## Min. : 1.00
## 1st Qu.:10.00
## Median :16.00
## Mean :15.93
## 3rd Qu.:21.50
## Max. :35.00
head(Computers1)
## price speed hd ram screen cd multi premium ads trend
## 1 1499 25 80 4 14 no no yes 94 1
## 2 1795 33 85 2 14 no no yes 94 1
## 3 1595 25 170 4 15 no no yes 94 1
## 4 1849 25 170 8 14 no no no 94 1
## 5 3295 33 340 16 14 no no yes 94 1
## 6 3695 66 340 16 14 no no yes 94 1
tail(Computers1)
## price speed hd ram screen cd multi premium ads trend
## 6254 2154 66 850 16 15 yes no yes 39 35
## 6255 1690 100 528 8 15 no no yes 39 35
## 6256 2223 66 850 16 15 yes yes yes 39 35
## 6257 2654 100 1200 24 15 yes no yes 39 35
## 6258 2195 100 850 16 15 yes no yes 39 35
## 6259 2490 100 850 16 17 yes no yes 39 35
The structure of data frame is listed below:
str(Computers1)
## 'data.frame': 6259 obs. of 10 variables:
## $ price : num 1499 1795 1595 1849 3295 ...
## $ speed : num 25 33 25 25 33 66 25 50 50 50 ...
## $ hd : num 80 85 170 170 340 340 170 85 210 210 ...
## $ ram : num 4 2 4 8 16 16 4 2 8 4 ...
## $ screen : Factor w/ 3 levels "14","15","17": 1 1 2 1 1 1 1 1 1 2 ...
## $ cd : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 2 1 1 1 ...
## $ multi : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ premium: Factor w/ 2 levels "no","yes": 2 2 2 1 2 2 2 2 2 2 ...
## $ ads : num 94 94 94 94 94 94 94 94 94 94 ...
## $ trend : num 1 1 1 1 1 1 1 1 1 1 ...
The categorical independent variables we choose to study are cd and screen.
cd is a two-level factor. It means whether the computuer has a CD-ROM.
Screen is a factor with 3 levels (14,15, and 17), and it will be used as a varible for blocking.
The response variables in our study is the price of the computer. It is a continuous variable which represents the price of different computers. So in this ANOVA analysis, the continuous varible-price is selected as dependent variable (DV).
The rationale for this study is that we use ANOVA to analyze the price in a blocked experiment.we are not interested in interaction effects and want to evaluate only the main effect of one IV - cd.
Randomization should be used in the experiment in three ways: random selection, random assignment, and random execution. In this study, we apply the randomization in sample assignment and sample execution. We use G*Power to determine the sample size. The result is shown in the later part.
The data set we use is cross-sectional data from 1993-1995, so in this study, we do not use replication here, because we do not have multiple data (experiment ) to do so.
We use blocking in our model. The screen factor was used to block becasue we believe screen is not a determinent variable for pricing. By blocking, noise caused by screen size difference can be reduced.
In this part, we will use Null Hypothesis Statistical Testing (NHST) to analyze the difference of computer price.
In order to check the assumption of normality, we draw a histogram of price below.
hist(Computers1$price, main = "Computer Price Difference")
We calculate the main effect of cd next. It is a two level difference so that we can calculate the mean difference between levels.
# Calculating main effect
# Take the difference between the means of premium and non-premium
mp <- mean(subset(Computers1$price, Computers1$cd=="yes"))
mn <- mean(subset(Computers1$price, Computers1$cd=="no"))
md <- mp - mn
# Also calculate the group standard deviations, to be used later
sdp <- sd(subset(Computers1$price, Computers1$cd=="yes"))
sdn <- sd(subset(Computers1$price, Computers1$cd=="no"))
# Print the result
md
## [1] 229.7936
We further draw the boxplot to intuitively check the price difference between 2 levels.
# Generate boxplot of logarithm of wage versus sex
boxplot(price~cd, data=Computers1, vertical=TRUE, las=1, ylab="Price", main="The Quality of Firm")
In this study, we use GPower3.1.9.2 to determine the sample size. We set type I error (\(\alpha\))and type II error (\(\beta\)). Then the estimated size of the main effect cd can be calculated. The input is shown as following:
Effect size f = 0.44
\(\alpha\) = 0.05
Power (1- \(\beta\)) = 0.95
Number of groups = 2,
Calculate Cohen’s D by cohen.d function from effsize package.
require(effsize)
## Loading required package: effsize
cohen.d(Computers1$price, Computers1$cd)
##
## Cohen's d
##
## d estimate: -0.4035519 (small)
## 95 percent confidence interval:
## inf sup
## -0.4537428 -0.3533609
G*Power plot shows the sample size based on the input above.
From G*Power, we get the sample size N equals to 84.
# Extract a sample of appropriate size from the dataframe
sample_size <- 84
# Set seed for reproducibility of data for report
set.seed(976)
csample <- Computers1[sample(nrow(Computers1), sample_size),]
We use a 2-way ANOVA and choose to block by screen.
# Perform 2-way ANOVA on the sample and show the summary
anova<-aov(csample$price ~ csample$cd+csample$screen)
summary(anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## csample$cd 1 3863905 3863905 14.298 0.000299 ***
## csample$screen 2 1190621 595310 2.203 0.117141
## Residuals 80 21619179 270240
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the result above, the F statistic is 14.298 and p-value is 0.0003, which indicates that the CD-ROM installation has a significant effect on computer price. However, there is a potential problem that we use set seed function to determine the random number here. The results may be different based on the random number.
In this part, we will check the robustness of our result. Firstly, we need to make sure that our dataset meets all of the assumptions.in order to see whether the dataset exhibit normally. In order to verify the normality of our dataset, we create the Normal Quantile-Quantile (QQ) Plot and the plot of the fitted values versus the residual value.
#Create a Normal Q-Q plot and add a reference line to Q-Q plot
qqnorm(residuals(anova))
qqline(residuals(anova))
The Q-Q plot of the residuals for this model forms a linear line and we can conclude that the assumptions of normality are met. When the residuals are nomal, the data points in Q-Q plot tend to be linear.
# Create a residuals plot for the model
plot(fitted(anova), residuals(anova))
The plot shows that the distribution is randomly ditributed.
In this section, we introduce alternative methods to null hypothesis testing, Resampling Statiscics and confidence interval.
We use for loop to run the resampling. The samples are generated based on the loop, repeating for 10,000 times. For each sample, a new corresponding F-value is calculated. In order to test the assumption of NHST, large number of iterations need to be performed. Then we can know the accuracy in our experiment sample. In the next step, we show the probability density function (PDF) of F-value. From the histogram, we can see that it tend to be a chi-square distribution.
# Repeatedly take new sample, perform aov, store F-value in an array
# initialize an array called Fvals that stores the F-value in each repeated test
Fvals <- 0
# number of iterations in the for loop
num_iter <- 10000
for (i in 1:num_iter) {
# Take a new random sample (of the same sample size used in the NHST)
csample <- Computers1[sample(nrow(Computers1), sample_size),]
# Extract the F-value of interest from the aov summary output
Fvals[i] <- summary(aov(csample$price ~ csample$cd+csample$screen))[[1]][["F value"]][1]
}
hist(Fvals, freq = FALSE, xlab = "F-value", main = "Probability Density Function (PDF) from Resampling")
In the end, we calculate the probability of F-values using the determined F value in GPower(F=3.96), and check whether it is above the 5% level.
Fcritical <- 3.96
length(which(Fvals>Fcritical))/length(Fvals)
## [1] 0.4793
From the result, we can see that the probability of F-value (F=3.96) is 0.4793, which is much higher than 0.05.
In this part, we use Turkey’s method to create confidence intervals with a family error rate of 0.05.
tukey1<-TukeyHSD(anova)
plot(tukey1)
TukeyHSD(anova)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = csample$price ~ csample$cd + csample$screen)
##
## $`csample$cd`
## diff lwr upr p adj
## yes-no 429.4341 203.4256 655.4426 0.0002994
##
## $`csample$screen`
## diff lwr upr p adj
## 15-14 152.2185 -128.9593 433.3963 0.4033642
## 17-14 -327.8273 -913.7152 258.0605 0.3794800
## 17-15 -480.0458 -1073.5705 113.4788 0.1365055
The confidence interval for the price difference of CD-Room intallation is from 203.43 to 655.44. and the mean difference is 429.43, with the p-value of 0.0003. So the result shows that it is statistically significant.
Montgomery, Douglas C.. Design and Analysis of Experiments, 8th Edition. Wiley. Kindle Edition.
The data was drawn from the R ecdat library, and the dataset used is Computers.
# instaling package
install.packages("Ecdat", repos='http://cran.us.r-project.org')
# instaling package
install.packages("Ecfun", repos='http://cran.us.r-project.org')
library("Ecdat")
# Pulling the Computers data from the library. This data set meet the requirement that DV(price) is continuous and IVs are categorical.
Computers1<- Computers
Computers1$screen<-as.factor(Computers1$screen)
# Summarize the dataframe
summary(Computers1)
head(Computers1)
tail(Computers1)
#Explain the structure of dataframe
str(Computers1)
# Draw the histogram of price
hist(Computers1$price, main = "Computer Price Difference")# Calculating main effect
# Take the difference between the means of premium and non-premium
mp <- mean(subset(Computers1$price, Computers1$cd=="yes"))
mn <- mean(subset(Computers1$price, Computers1$cd=="no"))
md <- mp - mn
# Also calculate the group standard deviations, to be used later
sdp <- sd(subset(Computers1$price, Computers1$cd=="yes"))
sdn <- sd(subset(Computers1$price, Computers1$cd=="no"))
# Print the result
md
# Generate boxplot of logarithm of wage versus sex
boxplot(price~cd, data=Computers1, vertical=TRUE, las=1, ylab="Price", main="The Quality of Firm")
#Calculate Cohen's D by cohen.d function from effsize package
require(effsize)
cohen.d(Computers1$price, Computers1$cd)
# Extract a sample of appropriate size from the dataframe
sample_size <- 84
# Set seed for reproducibility of data for report
set.seed(976)
csample <- Computers1[sample(nrow(Computers1), sample_size),]
# Perform 2-way ANOVA on the sample and show the summary
anova<-aov(csample$price ~ csample$cd+csample$screen)
summary(anova)
# Create a Normal Q-Q plot and add a reference line to Q-Q plot
qqnorm(residuals(anova))
qqline(residuals(anova))
# Create a residuals plot for the model
plot(fitted(anova), residuals(anova))
# Repeatedly take new sample, perform aov, store F-value in an array
# initialize an array called Fvals that stores the F-value in each repeated test
Fvals <- 0
# number of iterations in the for loop
num_iter <- 10000
for (i in 1:num_iter) {
# Take a new random sample (of the same sample size used in the NHST)
csample <- Computers1[sample(nrow(Computers1), sample_size),]
# Extract the F-value of interest from the aov summary output
Fvals[i] <- summary(aov(csample$price ~ csample$cd+csample$screen))[[1]][["F value"]][1]
}
hist(Fvals, freq = FALSE, xlab = "F-value", main = "Probability Density Function (PDF) from Resampling")
Fcritical <- 3.96
length(which(Fvals>Fcritical))/length(Fvals)
# Use Turkey's method to create confidence intervals
tukey1<-TukeyHSD(anova)
plot(tukey1)
TukeyHSD(anova)