This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
Dataset in this analysis is from ‘Computers’ in package ‘Ecdat’. In this analysis, we test the effect of computer screen size on the price of computers.
A quick view of the dataset is as below.
library("Ecdat")
## Loading required package: Ecfun
##
## Attaching package: 'Ecdat'
##
## The following object is masked from 'package:datasets':
##
## Orange
data<-Computers
#quick view of first and last several rows.
head(data)
## price speed hd ram screen cd multi premium ads trend
## 1 1499 25 80 4 14 no no yes 94 1
## 2 1795 33 85 2 14 no no yes 94 1
## 3 1595 25 170 4 15 no no yes 94 1
## 4 1849 25 170 8 14 no no no 94 1
## 5 3295 33 340 16 14 no no yes 94 1
## 6 3695 66 340 16 14 no no yes 94 1
tail(data)
## price speed hd ram screen cd multi premium ads trend
## 6254 2154 66 850 16 15 yes no yes 39 35
## 6255 1690 100 528 8 15 no no yes 39 35
## 6256 2223 66 850 16 15 yes yes yes 39 35
## 6257 2654 100 1200 24 15 yes no yes 39 35
## 6258 2195 100 850 16 15 yes no yes 39 35
## 6259 2490 100 850 16 17 yes no yes 39 35
summary(data)
## price speed hd ram
## Min. : 949 Min. : 25 Min. : 80 Min. : 2.00
## 1st Qu.:1794 1st Qu.: 33 1st Qu.: 214 1st Qu.: 4.00
## Median :2144 Median : 50 Median : 340 Median : 8.00
## Mean :2220 Mean : 52 Mean : 417 Mean : 8.29
## 3rd Qu.:2595 3rd Qu.: 66 3rd Qu.: 528 3rd Qu.: 8.00
## Max. :5399 Max. :100 Max. :2100 Max. :32.00
## screen cd multi premium ads
## Min. :14.0 no :3351 no :5386 no : 612 Min. : 39
## 1st Qu.:14.0 yes:2908 yes: 873 yes:5647 1st Qu.:162
## Median :14.0 Median :246
## Mean :14.6 Mean :221
## 3rd Qu.:15.0 3rd Qu.:275
## Max. :17.0 Max. :339
## trend
## Min. : 1.0
## 1st Qu.:10.0
## Median :16.0
## Mean :15.9
## 3rd Qu.:21.5
## Max. :35.0
One factor:size of computer screen Three levels: ‘14’ ‘15’ ‘17’
# factors and levels setting
screen<-factor(data$screen)
levels(screen)
## [1] "14" "15" "17"
summary(screen)
## 14 15 17
## 3661 1992 606
In data we analyze, response variable ‘price’ is continuous variable.
Data was collected in the United States about computer price and related properties, including speed, size of hard drive, size of Ram, size of screen and so on. In this test we only investigate effect of ‘screen’ on ‘price’.
#structure of data
str(sample)
## 'data.frame': 38 obs. of 6 variables:
## $ wfood : num 0.454 0.635 0.356 0.398 0.671 ...
## $ totexp: num 39072 47992 23400 24176 46480 ...
## $ age : num 85 73 77 66 88 78 75 72 53 63 ...
## $ size : num 1 1 1 2 1 1 1 1 1 1 ...
## $ town : Factor w/ 5 levels "1","2","3","4",..: 2 3 1 3 4 1 3 4 3 1 ...
## $ sex : Factor w/ 2 levels "man","woman": 2 1 2 1 2 2 2 2 2 2 ...
Data can be assumed as randomly collected.
There is no replication or repeated measures in raw data or this analysis.
There is no blocks in this study
This analysis is going to investigate how the variation of computer screen size influence computer price.eams. Use ANOVA to analyze variance of response variable and determine if screen size has major effect on price. After that we use resampling methods determine their effect of the outcome of ANOVA. Therefore, null hypothesis is: H0: the difference in computer price can only be explained by randomization. Alternative hypothesis: Ha: the difference in computer price can be explained by something other than randomization.
This analysis is about resampling techniques. ANOVA is based on normal distribution, however in fact data might not be as normally distributed as we hope, therefore usage of resampling technique would improve accuracy of ANOVA.
#Histogram of response variable
hist(data$price,xlab='price',ylab='number', main='Computers price/$')
#Boxplot of price and screen size
boxplot(price~screen,data=data, xlab='screen size', ylab='price/$', main='Boxplot of computer price and screen size')
Plots above reveal that clearly computer price will increase with the increase of screen size.
model=aov(price ~ screen, data=data)
anova(model)
## Analysis of Variance Table
##
## Response: price
## Df Sum Sq Mean Sq F value Pr(>F)
## screen 1 1.85e+08 1.85e+08 601 <2e-16 ***
## Residuals 6257 1.93e+09 3.08e+05
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ANOVA result has a P-value of 0, indicating that variance of computer price can be explained by variance in screen size. Based on this result we can reject null hypothesis and accept alternative hypothesis that variance of computer price can be explained by screen size.
#Shapiro.test
#shapiro.test(data$price)
#Cannot do shapiro test since sample amount is over 5000
#Model normality checking
qqnorm(residuals(model))
qqline(residuals(model))
#Residuals fitted plot
plot(fitted(model),residuals(model))
Q-Q norm plot and Q-Q line of residuals exhibit linear pattern of residuals in the middle, which means original model is valid in this area. However the tail area is not fitted the line. However residuals are not evenly distributed on each side of zero in residual and fitted plot, meaning the model is probably not perfectly fitted.
There are three types of resampling techniques, given that permutation is shuffling without replacement, and Monte Carlo simulation is based on known distribution of data, we discard both, and only use the other technique: Bootstrapping, which is shuffling with replacement.
library("Ecdat")
data<-Computers
data(data)
## Warning: data set 'data' not found
with(data, tapply(price,screen,mean))
## 14 15 17
## 2084 2350 2612
with(data, tapply(price,screen,var))
## 14 15 17
## 295031 292515 419346
with(data, tapply(price,screen,length))
## 14 15 17
## 3661 1992 606
summary(aov(price ~ screen,data=data))
## Df Sum Sq Mean Sq F value Pr(>F)
## screen 1 1.85e+08 1.85e+08 601 <2e-16 ***
## Residuals 6257 1.93e+09 3.08e+05
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
meanstar = with(data, tapply(price,screen,mean))
grpA = data$price[data$screen=="14"] - meanstar[1]
grpB = data$price[data$screen=="15"] - meanstar[2]
grpC = data$price[data$screen=="17"] - meanstar[3]
simscreen = data$screen
R = 1000
Fstar = numeric(R)
for (i in 1:R) {
groupA = sample(grpA, size=3661, replace=T)
groupB = sample(grpB, size=1992, replace=T)
groupC = sample(grpC, size=606, replace=T)
simprice = c(groupA,groupB,groupC)
simdata = data.frame(simprice,simscreen)
Fstar[i] = oneway.test(simprice~simscreen, var.equal=T, data=simdata)$statistic
}
hist(Fstar, prob=T, ylim=c(0,1),xlim=c(0,8), main='Bootstrapped F-distribution')
x=seq(.05,6,.25)
points(x,y=df(x,1,6257),type="b",col="red")
# '1' and '6257' are degree of freedom obtained from summary above.
In the Bootstrap F-distribution graph, red dot line represents analytical distribution, histogram represents theoretical distribution. From the graph, theoretical distribution is not that similar to analytical distribution, indicating different experiment results might happen because of randomization.
print(realFstar<-oneway.test(price ~ screen, var.equal=T, data=data)$statistic)
## F
## 318.6
mean(Fstar>=realFstar)
## [1] 0
# quantiles of the analytic F distribution
qf(.95,1,6257)
## [1] 3.843
# quantiles of the bootstrapped F distribution
quantile(Fstar,.95)
## 95%
## 2.735
The mean when Fstar>=realFstar is zero, which corresponds to histogram and red dot line above. That means real F-distribution value is alway smaller than value from data, indicating data is not normally distributed. Estimate alpha = 0.05 value of the test statistic within in the context of analytic F distribution and bootstrapped F distribution. Values are quite different (3.84, 1.47), which again indicates that original model is not well fitted, data is not normally distributed.
Then we are going to use ANOVA based on resampling result to test our assumption.
simmodel=aov(simprice ~ simscreen, data=simdata)
anova(simmodel)
## Analysis of Variance Table
##
## Response: simprice
## Df Sum Sq Mean Sq F value Pr(>F)
## simscreen 1 2.09e+05 209415 0.69 0.41
## Residuals 6257 1.91e+09 304532
ANOVA result of new model has a p-value of 0.35, which indicates it is highly possible that variance of computer price is only due to randomizaiton, and screen size is not a major factor to explain computer price. This is contrast to original ANOVA result, which means data collected might not be normally distributed. Based on this ANOVA result we cannot reject the null hypothesis.