This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
Dataset in this analysis is from ‘Computers’ in package ‘Ecdat’. In this analysis, we test the effect of computer screen size on the price of computers.
A quick view of the dataset is as below.
library("Ecdat")
## Loading required package: Ecfun
##
## Attaching package: 'Ecdat'
##
## The following object is masked from 'package:datasets':
##
## Orange
data<-Computers
#quick view of first and last several rows.
head(data)
## price speed hd ram screen cd multi premium ads trend
## 1 1499 25 80 4 14 no no yes 94 1
## 2 1795 33 85 2 14 no no yes 94 1
## 3 1595 25 170 4 15 no no yes 94 1
## 4 1849 25 170 8 14 no no no 94 1
## 5 3295 33 340 16 14 no no yes 94 1
## 6 3695 66 340 16 14 no no yes 94 1
tail(data)
## price speed hd ram screen cd multi premium ads trend
## 6254 2154 66 850 16 15 yes no yes 39 35
## 6255 1690 100 528 8 15 no no yes 39 35
## 6256 2223 66 850 16 15 yes yes yes 39 35
## 6257 2654 100 1200 24 15 yes no yes 39 35
## 6258 2195 100 850 16 15 yes no yes 39 35
## 6259 2490 100 850 16 17 yes no yes 39 35
summary(data)
## price speed hd ram
## Min. : 949 Min. : 25 Min. : 80 Min. : 2.00
## 1st Qu.:1794 1st Qu.: 33 1st Qu.: 214 1st Qu.: 4.00
## Median :2144 Median : 50 Median : 340 Median : 8.00
## Mean :2220 Mean : 52 Mean : 417 Mean : 8.29
## 3rd Qu.:2595 3rd Qu.: 66 3rd Qu.: 528 3rd Qu.: 8.00
## Max. :5399 Max. :100 Max. :2100 Max. :32.00
## screen cd multi premium ads
## Min. :14.0 no :3351 no :5386 no : 612 Min. : 39
## 1st Qu.:14.0 yes:2908 yes: 873 yes:5647 1st Qu.:162
## Median :14.0 Median :246
## Mean :14.6 Mean :221
## 3rd Qu.:15.0 3rd Qu.:275
## Max. :17.0 Max. :339
## trend
## Min. : 1.0
## 1st Qu.:10.0
## Median :16.0
## Mean :15.9
## 3rd Qu.:21.5
## Max. :35.0
One factor:size of computer screen Three levels: ‘14’ ‘15’ ‘17’
# factors and levels setting
screen<-factor(data$screen)
levels(screen)
## [1] "14" "15" "17"
summary(screen)
## 14 15 17
## 3661 1992 606
In data we analyze, response variable ‘price’ is continuous variable.
Data was collected in the United States about computer price and related properties, including speed, size of hard drive, size of Ram, size of screen and so on. In this test we only investigate effect of ‘screen’ on ‘price’.
#structure of data
str(sample)
## 'data.frame': 38 obs. of 6 variables:
## $ wfood : num 0.454 0.635 0.356 0.398 0.671 ...
## $ totexp: num 39072 47992 23400 24176 46480 ...
## $ age : num 85 73 77 66 88 78 75 72 53 63 ...
## $ size : num 1 1 1 2 1 1 1 1 1 1 ...
## $ town : Factor w/ 5 levels "1","2","3","4",..: 2 3 1 3 4 1 3 4 3 1 ...
## $ sex : Factor w/ 2 levels "man","woman": 2 1 2 1 2 2 2 2 2 2 ...
Data can be assumed as randomly collected.
There is no replication or repeated measures in raw data or this analysis.
There is no blocks in this study
This analysis is going to investigate how the variation of computer screen size influence computer price.eams. Use ANOVA to analyze variance of response variable and determine if screen size has major effect on price. After that we use resampling methods determine their effect of the outcome of the ANOVA test. Therefore, null hypothesis is: H0: the difference in computer price can only be explained by randomization. Alternative hypothesis: Ha: the difference in computer price can be explained by something other than randomization.
This analysis is about resampling techniques. ANOVA is based on normal distribution, however in fact data might not be as normally distributed as we hope, therefore usage of resampling technique would improve accuracy of ANOVA.
#Histogram of response variable
hist(data$price,xlab='price',ylab='number')
#Boxplot of price and screen size
boxplot(price~screen,data=data)
Plots above reveal that clearly computer price will increase with the increase of screen size.
model=aov(price ~ screen, data=data)
anova(model)
## Analysis of Variance Table
##
## Response: price
## Df Sum Sq Mean Sq F value Pr(>F)
## screen 1 1.85e+08 1.85e+08 601 <2e-16 ***
## Residuals 6257 1.93e+09 3.08e+05
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ANOVA result has a P-value of 0, indicating that variance of computer price can be explained by variance in screen size. The Mean Sq=307818. Based on this result we can reject null hypothesis and accept alternative hypothesis that variance of computer price can be explained by screen size.
#Shapiro.test
#shapiro.test(data$price)
#Cannot do shapiro test since sample amount is over 5000
#Model normality checking
qqnorm(residuals(model))
qqline(residuals(model))
#Residuals fitted plot
plot(fitted(model),residuals(model))
Q-Q norm plot and Q-Q line of residuals exhibit linear pattern of residuals which means original model is valid. However residuals are not evenly distributed on each side of zero in residual and fitted plot, meaning the model is probably not perfectly fitted.
with(data, tapply(price,screen,mean))
## 14 15 17
## 2084 2350 2612
with(data, tapply(price,screen,var))
## 14 15 17
## 295031 292515 419346
with(data, tapply(price,screen,length))
## 14 15 17
## 3661 1992 606
summary(aov(price ~ screen,data=data))
## Df Sum Sq Mean Sq F value Pr(>F)
## screen 1 1.85e+08 1.85e+08 601 <2e-16 ***
## Residuals 6257 1.93e+09 3.08e+05
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#mean of price
meanstar=mean(data$price)
#square root of residual error obtained above
sdstar=sqrt(307818)
simscreen=data$screen
#simulation run time
R=1000
Fstar = numeric(R)
for (i in 1:R) {
groupA = rnorm(3661, mean=meanstar, sd=sdstar)
groupB = rnorm(1992, mean=meanstar, sd=sdstar)
groupC = rnorm(606, mean=meanstar, sd=sdstar)
simprice = c(groupA,groupB,groupC)
simdata = data.frame(simprice,simscreen)
Fstar[i] = oneway.test(simprice~simscreen, var.equal=T, data=simdata)$statistic
}
hist(Fstar, prob=T, ylim=c(0,1),xlim=c(0,8), main='Theoretical F-distribution')
x=seq(.05,6,.25)
points(x,y=df(x,2,6256),type="b",col="red")
library("Ecdat")
data<-Computers
data(data)
## Warning: data set 'data' not found
meanstar = with(data, tapply(price,screen,mean))
grpA = data$price[data$screen=="14"] - meanstar[1]
grpB = data$price[data$screen=="15"] - meanstar[2]
grpC = data$price[data$screen=="17"] - meanstar[3]
newsimscreen = data$screen
R = 1000
newFstar = numeric(R)
for (i in 1:R) {
groupA = sample(grpA, size=3661, replace=T)
groupB = sample(grpB, size=1992, replace=T)
groupC = sample(grpC, size=606, replace=T)
newsimprice = c(groupA,groupB,groupC)
newsimdata = data.frame(simprice,simscreen)
newFstar[i] = oneway.test(newsimprice~newsimscreen, var.equal=T, data=newsimdata)$statistic
}
hist(newFstar, prob=T, ylim=c(0,1),xlim=c(0,8), main='Bootstrap F-distribution')
x=seq(.05,6,.25)
points(x,y=df(x,2,6256),type="b",col="red")
Bootstrap F distribution histogram is similar to theoretical F distribution histogram. Then we are going to use ANOVA based on resampling result.
mean(Fstar>=newFstar)
## [1] 0.49
qf(.95,5,90)
## [1] 2.316
quantile(Fstar,.95)
## 95%
## 2.784
newmodel=aov(newsimprice ~ newsimscreen, data=newsimdata)
anova(newmodel)
## Analysis of Variance Table
##
## Response: newsimprice
## Df Sum Sq Mean Sq F value Pr(>F)
## newsimscreen 1 1.32e+06 1324471 4.28 0.039 *
## Residuals 6257 1.94e+09 309564
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ANOVA result of new model has a p-value of 0.18, which indicates it is highly possible that variance of computer price is only due to randomizaiton, and screen size is not a major factor to explain computer price. This is contrast to original ANOVA result, which means data collected might not be normally distributed. Based on this ANOVA result we cannot reject the null hypothesis.
#Model normality checking
qqnorm(residuals(newmodel))
qqline(residuals(newmodel))
Q-Q plot and Q-Q line of residuals exhibit linear pattern of residuals, which means new model is valid.
plot(fitted(newmodel),residuals(newmodel))
Plot of fitted model and residuals model shows residuals are not well distributed on each side of 0, indicating the new model we use is not perfectly fitted.