HW7

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

Analysis of Variance with Resampling Techniques

Hongyu Chen

RPI, RIN:661405156

1. Setting

System under test

Dataset in this analysis is from ‘Computers’ in package ‘Ecdat’. In this analysis, we test the effect of computer screen size on the price of computers.

A quick view of the dataset is as below.

library("Ecdat")

## Loading required package: Ecfun
## 
## Attaching package: 'Ecdat'
## 
## The following object is masked from 'package:datasets':
## 
##     Orange

data<-Computers
#quick view of first and last several rows.
head(data)

##   price speed  hd ram screen cd multi premium ads trend
## 1  1499    25  80   4     14 no    no     yes  94     1
## 2  1795    33  85   2     14 no    no     yes  94     1
## 3  1595    25 170   4     15 no    no     yes  94     1
## 4  1849    25 170   8     14 no    no      no  94     1
## 5  3295    33 340  16     14 no    no     yes  94     1
## 6  3695    66 340  16     14 no    no     yes  94     1

tail(data)

##      price speed   hd ram screen  cd multi premium ads trend
## 6254  2154    66  850  16     15 yes    no     yes  39    35
## 6255  1690   100  528   8     15  no    no     yes  39    35
## 6256  2223    66  850  16     15 yes   yes     yes  39    35
## 6257  2654   100 1200  24     15 yes    no     yes  39    35
## 6258  2195   100  850  16     15 yes    no     yes  39    35
## 6259  2490   100  850  16     17 yes    no     yes  39    35

summary(data)

##      price          speed           hd            ram       
##  Min.   : 949   Min.   : 25   Min.   :  80   Min.   : 2.00  
##  1st Qu.:1794   1st Qu.: 33   1st Qu.: 214   1st Qu.: 4.00  
##  Median :2144   Median : 50   Median : 340   Median : 8.00  
##  Mean   :2220   Mean   : 52   Mean   : 417   Mean   : 8.29  
##  3rd Qu.:2595   3rd Qu.: 66   3rd Qu.: 528   3rd Qu.: 8.00  
##  Max.   :5399   Max.   :100   Max.   :2100   Max.   :32.00  
##      screen       cd       multi      premium         ads     
##  Min.   :14.0   no :3351   no :5386   no : 612   Min.   : 39  
##  1st Qu.:14.0   yes:2908   yes: 873   yes:5647   1st Qu.:162  
##  Median :14.0                                    Median :246  
##  Mean   :14.6                                    Mean   :221  
##  3rd Qu.:15.0                                    3rd Qu.:275  
##  Max.   :17.0                                    Max.   :339  
##      trend     
##  Min.   : 1.0  
##  1st Qu.:10.0  
##  Median :16.0  
##  Mean   :15.9  
##  3rd Qu.:21.5  
##  Max.   :35.0

Factors and Levels

One factor:size of computer screen Three levels: ‘14’ ‘15’ ‘17’

# factors and levels setting
screen<-factor(data$screen)
levels(screen)

## [1] "14" "15" "17"

summary(screen)

##   14   15   17 
## 3661 1992  606

Continuous variables (if any)

In data we analyze, response variable ‘price’ is continuous variable.

The Data: How is it organized and what does it look like?

Data was collected in the United States about computer price and related properties, including speed, size of hard drive, size of Ram, size of screen and so on. In this test we only investigate effect of ‘screen’ on ‘price’.

#structure of data
str(sample)

## 'data.frame':    38 obs. of  6 variables:
##  $ wfood : num  0.454 0.635 0.356 0.398 0.671 ...
##  $ totexp: num  39072 47992 23400 24176 46480 ...
##  $ age   : num  85 73 77 66 88 78 75 72 53 63 ...
##  $ size  : num  1 1 1 2 1 1 1 1 1 1 ...
##  $ town  : Factor w/ 5 levels "1","2","3","4",..: 2 3 1 3 4 1 3 4 3 1 ...
##  $ sex   : Factor w/ 2 levels "man","woman": 2 1 2 1 2 2 2 2 2 2 ...

Randomization

Data can be assumed as randomly collected.

Replication/repeated measures

There is no replication or repeated measures in raw data or this analysis.

Block

There is no blocks in this study

2. (Experimental) Design

How will the experiment be organized and conducted to test the hypothesis?

This analysis is going to investigate how the variation of computer screen size influence computer price.eams. Use ANOVA to analyze variance of response variable and determine if screen size has major effect on price. After that we use resampling methods determine their effect of the outcome of the ANOVA test. Therefore, null hypothesis is: H0: the difference in computer price can only be explained by randomization. Alternative hypothesis: Ha: the difference in computer price can be explained by something other than randomization.

What is the rationale for this design?

This analysis is about resampling techniques. ANOVA is based on normal distribution, however in fact data might not be as normally distributed as we hope, therefore usage of resampling technique would improve accuracy of ANOVA.

3. (Statistical) Analysis

Exploratory Data Analysis

#Histogram of response variable
hist(data$price,xlab='price',ylab='number')

plot of chunk unnamed-chunk-4

#Boxplot of price and screen size
boxplot(price~screen,data=data)

plot of chunk unnamed-chunk-4

Plots above reveal that clearly computer price will increase with the increase of screen size.

Original Model Testing

Original model ANOVA

model=aov(price ~ screen, data=data)
anova(model)

## Analysis of Variance Table
## 
## Response: price
##             Df   Sum Sq  Mean Sq F value Pr(>F)    
## screen       1 1.85e+08 1.85e+08     601 <2e-16 ***
## Residuals 6257 1.93e+09 3.08e+05                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ANOVA result has a P-value of 0, indicating that variance of computer price can be explained by variance in screen size. The Mean Sq=307818. Based on this result we can reject null hypothesis and accept alternative hypothesis that variance of computer price can be explained by screen size.

Original model adequacy checking

#Shapiro.test
#shapiro.test(data$price)
#Cannot do shapiro test since sample amount is over 5000

#Model normality checking
qqnorm(residuals(model))
qqline(residuals(model))

plot of chunk unnamed-chunk-6

#Residuals fitted plot
plot(fitted(model),residuals(model))

plot of chunk unnamed-chunk-6

Q-Q norm plot and Q-Q line of residuals exhibit linear pattern of residuals which means original model is valid. However residuals are not evenly distributed on each side of zero in residual and fitted plot, meaning the model is probably not perfectly fitted.

Resampling techniques

Monte Carlo simulation for theoretical F-distribution

with(data, tapply(price,screen,mean))

##   14   15   17 
## 2084 2350 2612

with(data, tapply(price,screen,var))

##     14     15     17 
## 295031 292515 419346

with(data, tapply(price,screen,length))

##   14   15   17 
## 3661 1992  606

summary(aov(price ~ screen,data=data))

##               Df   Sum Sq  Mean Sq F value Pr(>F)    
## screen         1 1.85e+08 1.85e+08     601 <2e-16 ***
## Residuals   6257 1.93e+09 3.08e+05                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#mean of price
meanstar=mean(data$price)
#square root of residual error obtained above
sdstar=sqrt(307818)
simscreen=data$screen
#simulation run time
R=1000

Fstar = numeric(R)
for (i in 1:R) {
  groupA = rnorm(3661, mean=meanstar, sd=sdstar)
  groupB = rnorm(1992, mean=meanstar, sd=sdstar)
  groupC = rnorm(606, mean=meanstar, sd=sdstar)
  simprice = c(groupA,groupB,groupC)
  simdata = data.frame(simprice,simscreen)
  Fstar[i] = oneway.test(simprice~simscreen, var.equal=T, data=simdata)$statistic
}

hist(Fstar, prob=T, ylim=c(0,1),xlim=c(0,8), main='Theoretical F-distribution')

x=seq(.05,6,.25)
points(x,y=df(x,2,6256),type="b",col="red")

plot of chunk unnamed-chunk-7

Bootstrap simulation for F-distribution

library("Ecdat")
data<-Computers
data(data)

## Warning: data set 'data' not found

meanstar = with(data, tapply(price,screen,mean))
grpA = data$price[data$screen=="14"] - meanstar[1]
grpB = data$price[data$screen=="15"] - meanstar[2]
grpC = data$price[data$screen=="17"] - meanstar[3]

newsimscreen = data$screen
R = 1000
newFstar = numeric(R)
for (i in 1:R) {
  groupA = sample(grpA, size=3661, replace=T)
  groupB = sample(grpB, size=1992, replace=T)
  groupC = sample(grpC, size=606, replace=T)
  newsimprice = c(groupA,groupB,groupC)
  newsimdata = data.frame(simprice,simscreen)
  newFstar[i] = oneway.test(newsimprice~newsimscreen, var.equal=T, data=newsimdata)$statistic
}

hist(newFstar, prob=T, ylim=c(0,1),xlim=c(0,8), main='Bootstrap F-distribution')

x=seq(.05,6,.25)
points(x,y=df(x,2,6256),type="b",col="red")

plot of chunk unnamed-chunk-8

Bootstrap F distribution histogram is similar to theoretical F distribution histogram. Then we are going to use ANOVA based on resampling result.

Estimation of Alpha level

mean(Fstar>=newFstar)

## [1] 0.49

qf(.95,5,90)

## [1] 2.316

quantile(Fstar,.95)

##   95% 
## 2.784

ANOVA with new model

newmodel=aov(newsimprice ~ newsimscreen, data=newsimdata)
anova(newmodel)

## Analysis of Variance Table
## 
## Response: newsimprice
##                Df   Sum Sq Mean Sq F value Pr(>F)  
## newsimscreen    1 1.32e+06 1324471    4.28  0.039 *
## Residuals    6257 1.94e+09  309564                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ANOVA result of new model has a p-value of 0.18, which indicates it is highly possible that variance of computer price is only due to randomizaiton, and screen size is not a major factor to explain computer price. This is contrast to original ANOVA result, which means data collected might not be normally distributed. Based on this ANOVA result we cannot reject the null hypothesis.

New model adequacy checking

#Model normality checking
qqnorm(residuals(newmodel))
qqline(residuals(newmodel))

plot of chunk unnamed-chunk-11

Q-Q plot and Q-Q line of residuals exhibit linear pattern of residuals, which means new model is valid.

plot(fitted(newmodel),residuals(newmodel))

plot of chunk unnamed-chunk-12

Plot of fitted model and residuals model shows residuals are not well distributed on each side of 0, indicating the new model we use is not perfectly fitted.

A summary of, or pointer to, the raw data

http://cran.r-project.org/web/packages/Ecdat/index.html

HW7

Hongyu Chen

Monday, November 03, 2014

Analysis of Variance with Resampling Techniques

Hongyu Chen

RPI, RIN:661405156

1. Setting

System under test

Factors and Levels

Continuous variables (if any)

The Data: How is it organized and what does it look like?

Randomization

Replication/repeated measures

Block

2. (Experimental) Design

How will the experiment be organized and conducted to test the hypothesis?

What is the rationale for this design?

3. (Statistical) Analysis

Exploratory Data Analysis

Original Model Testing

Original model ANOVA

Original model adequacy checking

Resampling techniques

Monte Carlo simulation for theoretical F-distribution

Bootstrap simulation for F-distribution

Estimation of Alpha level

ANOVA with new model

New model adequacy checking

A summary of, or pointer to, the raw data