HW7

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

Analysis of Variance with Resampling Techniques

Hongyu Chen

RPI, RIN:661405156

Version:2.0

1. Setting

System under test

Dataset in this analysis is from ‘Computers’ in package ‘Ecdat’. In this analysis, we test the effect of computer screen size on the price of computers.

A quick view of the dataset is as below.

library("Ecdat")

## Loading required package: Ecfun
## 
## Attaching package: 'Ecdat'
## 
## The following object is masked from 'package:datasets':
## 
##     Orange

data<-Computers
#quick view of first and last several rows.
head(data)

##   price speed  hd ram screen cd multi premium ads trend
## 1  1499    25  80   4     14 no    no     yes  94     1
## 2  1795    33  85   2     14 no    no     yes  94     1
## 3  1595    25 170   4     15 no    no     yes  94     1
## 4  1849    25 170   8     14 no    no      no  94     1
## 5  3295    33 340  16     14 no    no     yes  94     1
## 6  3695    66 340  16     14 no    no     yes  94     1

tail(data)

##      price speed   hd ram screen  cd multi premium ads trend
## 6254  2154    66  850  16     15 yes    no     yes  39    35
## 6255  1690   100  528   8     15  no    no     yes  39    35
## 6256  2223    66  850  16     15 yes   yes     yes  39    35
## 6257  2654   100 1200  24     15 yes    no     yes  39    35
## 6258  2195   100  850  16     15 yes    no     yes  39    35
## 6259  2490   100  850  16     17 yes    no     yes  39    35

summary(data)

##      price          speed           hd            ram       
##  Min.   : 949   Min.   : 25   Min.   :  80   Min.   : 2.00  
##  1st Qu.:1794   1st Qu.: 33   1st Qu.: 214   1st Qu.: 4.00  
##  Median :2144   Median : 50   Median : 340   Median : 8.00  
##  Mean   :2220   Mean   : 52   Mean   : 417   Mean   : 8.29  
##  3rd Qu.:2595   3rd Qu.: 66   3rd Qu.: 528   3rd Qu.: 8.00  
##  Max.   :5399   Max.   :100   Max.   :2100   Max.   :32.00  
##      screen       cd       multi      premium         ads     
##  Min.   :14.0   no :3351   no :5386   no : 612   Min.   : 39  
##  1st Qu.:14.0   yes:2908   yes: 873   yes:5647   1st Qu.:162  
##  Median :14.0                                    Median :246  
##  Mean   :14.6                                    Mean   :221  
##  3rd Qu.:15.0                                    3rd Qu.:275  
##  Max.   :17.0                                    Max.   :339  
##      trend     
##  Min.   : 1.0  
##  1st Qu.:10.0  
##  Median :16.0  
##  Mean   :15.9  
##  3rd Qu.:21.5  
##  Max.   :35.0

Factors and Levels

One factor:size of computer screen Three levels: ‘14’ ‘15’ ‘17’

# factors and levels setting
screen<-factor(data$screen)
levels(screen)

## [1] "14" "15" "17"

summary(screen)

##   14   15   17 
## 3661 1992  606

Continuous variables (if any)

In data we analyze, response variable ‘price’ is continuous variable.

The Data: How is it organized and what does it look like?

Data was collected in the United States about computer price and related properties, including speed, size of hard drive, size of Ram, size of screen and so on. In this test we only investigate effect of ‘screen’ on ‘price’.

#structure of data
str(sample)

## 'data.frame':    38 obs. of  6 variables:
##  $ wfood : num  0.454 0.635 0.356 0.398 0.671 ...
##  $ totexp: num  39072 47992 23400 24176 46480 ...
##  $ age   : num  85 73 77 66 88 78 75 72 53 63 ...
##  $ size  : num  1 1 1 2 1 1 1 1 1 1 ...
##  $ town  : Factor w/ 5 levels "1","2","3","4",..: 2 3 1 3 4 1 3 4 3 1 ...
##  $ sex   : Factor w/ 2 levels "man","woman": 2 1 2 1 2 2 2 2 2 2 ...

Randomization

Data can be assumed as randomly collected.

Replication/repeated measures

There is no replication or repeated measures in raw data or this analysis.

Block

There is no blocks in this study

2. (Experimental) Design

How will the experiment be organized and conducted to test the hypothesis?

This analysis is going to investigate how the variation of computer screen size influence computer price.eams. Use ANOVA to analyze variance of response variable and determine if screen size has major effect on price. After that we use resampling methods determine their effect of the outcome of ANOVA. Therefore, null hypothesis is: H0: the difference in computer price can only be explained by randomization. Alternative hypothesis: Ha: the difference in computer price can be explained by something other than randomization.

What is the rationale for this design?

This analysis is about resampling techniques. ANOVA is based on normal distribution, however in fact data might not be as normally distributed as we hope, therefore usage of resampling technique would improve accuracy of ANOVA.

3. (Statistical) Analysis

Exploratory Data Analysis

#Histogram of response variable
hist(data$price,xlab='price',ylab='number', main='Computers price/$')

plot of chunk unnamed-chunk-4

#Boxplot of price and screen size
boxplot(price~screen,data=data, xlab='screen size', ylab='price/$', main='Boxplot of computer price and screen size')

plot of chunk unnamed-chunk-4

Plots above reveal that clearly computer price will increase with the increase of screen size.

Original Model Testing

Original model ANOVA

model=aov(price ~ screen, data=data)
anova(model)

## Analysis of Variance Table
## 
## Response: price
##             Df   Sum Sq  Mean Sq F value Pr(>F)    
## screen       1 1.85e+08 1.85e+08     601 <2e-16 ***
## Residuals 6257 1.93e+09 3.08e+05                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ANOVA result has a P-value of 0, indicating that variance of computer price can be explained by variance in screen size. Based on this result we can reject null hypothesis and accept alternative hypothesis that variance of computer price can be explained by screen size.

Original model adequacy checking

#Shapiro.test
#shapiro.test(data$price)
#Cannot do shapiro test since sample amount is over 5000

#Model normality checking
qqnorm(residuals(model))
qqline(residuals(model))

plot of chunk unnamed-chunk-6

#Residuals fitted plot
plot(fitted(model),residuals(model))

plot of chunk unnamed-chunk-6

Q-Q norm plot and Q-Q line of residuals exhibit linear pattern of residuals in the middle, which means original model is valid in this area. However the tail area is not fitted the line. However residuals are not evenly distributed on each side of zero in residual and fitted plot, meaning the model is probably not perfectly fitted.

Resampling techniques

There are three types of resampling techniques, given that permutation is shuffling without replacement, and Monte Carlo simulation is based on known distribution of data, we discard both, and only use the other technique: Bootstrapping, which is shuffling with replacement.

Bootstrap simulation for F-distribution

library("Ecdat")
data<-Computers
data(data)

## Warning: data set 'data' not found

with(data, tapply(price,screen,mean))

##   14   15   17 
## 2084 2350 2612

with(data, tapply(price,screen,var))

##     14     15     17 
## 295031 292515 419346

with(data, tapply(price,screen,length))

##   14   15   17 
## 3661 1992  606

summary(aov(price ~ screen,data=data))

##               Df   Sum Sq  Mean Sq F value Pr(>F)    
## screen         1 1.85e+08 1.85e+08     601 <2e-16 ***
## Residuals   6257 1.93e+09 3.08e+05                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

meanstar = with(data, tapply(price,screen,mean))
grpA = data$price[data$screen=="14"] - meanstar[1]
grpB = data$price[data$screen=="15"] - meanstar[2]
grpC = data$price[data$screen=="17"] - meanstar[3]

simscreen = data$screen
R = 1000
Fstar = numeric(R)
for (i in 1:R) {
  groupA = sample(grpA, size=3661, replace=T)
  groupB = sample(grpB, size=1992, replace=T)
  groupC = sample(grpC, size=606, replace=T)
  simprice = c(groupA,groupB,groupC)
  simdata = data.frame(simprice,simscreen)
  Fstar[i] = oneway.test(simprice~simscreen, var.equal=T, data=simdata)$statistic
}

hist(Fstar, prob=T, ylim=c(0,1),xlim=c(0,8), main='Bootstrapped F-distribution')

x=seq(.05,6,.25)
points(x,y=df(x,1,6257),type="b",col="red")

plot of chunk unnamed-chunk-7

# '1' and '6257' are degree of freedom obtained from summary above.

In the Bootstrap F-distribution graph, red dot line represents analytical distribution, histogram represents theoretical distribution. From the graph, theoretical distribution is not that similar to analytical distribution, indicating different experiment results might happen because of randomization.

Estimation of Alpha level

print(realFstar<-oneway.test(price ~ screen, var.equal=T, data=data)$statistic)

##     F 
## 318.6

mean(Fstar>=realFstar)

## [1] 0

# quantiles of the analytic F distribution
qf(.95,1,6257)

## [1] 3.843

# quantiles of the bootstrapped F distribution
quantile(Fstar,.95)

##   95% 
## 2.735

The mean when Fstar>=realFstar is zero, which corresponds to histogram and red dot line above. That means real F-distribution value is alway smaller than value from data, indicating data is not normally distributed. Estimate alpha = 0.05 value of the test statistic within in the context of analytic F distribution and bootstrapped F distribution. Values are quite different (3.84, 1.47), which again indicates that original model is not well fitted, data is not normally distributed.

Then we are going to use ANOVA based on resampling result to test our assumption.

ANOVA with resampling model

simmodel=aov(simprice ~ simscreen, data=simdata)
anova(simmodel)

## Analysis of Variance Table
## 
## Response: simprice
##             Df   Sum Sq Mean Sq F value Pr(>F)
## simscreen    1 2.09e+05  209415    0.69   0.41
## Residuals 6257 1.91e+09  304532

ANOVA result of new model has a p-value of 0.35, which indicates it is highly possible that variance of computer price is only due to randomizaiton, and screen size is not a major factor to explain computer price. This is contrast to original ANOVA result, which means data collected might not be normally distributed. Based on this ANOVA result we cannot reject the null hypothesis.

A summary of, or pointer to, the raw data

http://cran.r-project.org/web/packages/Ecdat/index.html

HW7

Hongyu Chen

Monday, November 03, 2014

Analysis of Variance with Resampling Techniques

Hongyu Chen

RPI, RIN:661405156

Version:2.0

1. Setting

System under test

Factors and Levels

Continuous variables (if any)

The Data: How is it organized and what does it look like?

Randomization

Replication/repeated measures

Block

2. (Experimental) Design

How will the experiment be organized and conducted to test the hypothesis?

What is the rationale for this design?

3. (Statistical) Analysis

Exploratory Data Analysis

Original Model Testing

Original model ANOVA

Original model adequacy checking

Resampling techniques

Bootstrap simulation for F-distribution

Estimation of Alpha level

ANOVA with resampling model

A summary of, or pointer to, the raw data