Recipie for Blocked Designs with multiple explanatory and nuisance factors

Ali Svoobda

RPI

10/30/14 V.3

1. Setting

System under test

For this recipie, we will examine the Computers dataset from the Ecdat package. This dataset contains the price and specs on computers for sale between 1993 and 1995. We will create a subset to only examine the compters manufactured by “premium” firms"

Read in and subset data:

install.packages("Ecdat")

## Installing package into 'C:/Users/svoboa/Documents/R/win-library/3.1'
## (as 'lib' is unspecified)

## Error: trying to use CRAN without setting a mirror

library("Ecdat", lib.loc="C:/Users/svoboa/Documents/R/win-library/3.1")

## Loading required package: Ecfun
## 
## Attaching package: 'Ecdat'
## 
## The following object is masked from 'package:datasets':
## 
##     Orange

x<-Computers

data<-subset(x,x$premium=="yes")

For more on the dataset:

?Computers

## starting httpd help server ... done

Factors and Levels

The dataset contains 6 factors(levels): ram(4), screen size(3), and wether or not a CD-ROM(2) or multimedia kit(2) is present.

For this experiment, we will only be examining screen size (14", 15", or 17").

Set up screen size as factor:

data$screen=as.factor(data$screen)

Continuous Variables

The continuous varibales in the dataset are price, speed, hard drive size, ads, and trend (month of sales).

Set up varibles as numeric:

data$speed=as.numeric(data$speed)
data$hd=as.numeric(data$hd)

Response Variables

Price will be the response variable for this experiment.

The Data: How is it organized and what does it look like?

With only the “premium” computers under study, there are 5647 observations of 10 varaibles

Structure, summary, and first/last observations of dataset:

str(data)

## 'data.frame':    5647 obs. of  10 variables:
##  $ price  : num  1499 1795 1595 3295 3695 ...
##  $ speed  : num  25 33 25 33 66 25 50 50 50 33 ...
##  $ hd     : num  80 85 170 340 340 170 85 210 210 170 ...
##  $ ram    : num  4 2 4 16 16 4 2 8 4 8 ...
##  $ screen : Factor w/ 3 levels "14","15","17": 1 1 2 1 1 1 1 1 2 2 ...
##  $ cd     : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 1 1 1 1 ...
##  $ multi  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ premium: Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ads    : num  94 94 94 94 94 94 94 94 94 94 ...
##  $ trend  : num  1 1 1 1 1 1 1 1 1 1 ...

summary(data)

##      price          speed             hd            ram        screen   
##  Min.   : 949   Min.   : 25.0   Min.   :  80   Min.   : 2.00   14:3238  
##  1st Qu.:1790   1st Qu.: 33.0   1st Qu.: 214   1st Qu.: 4.00   15:1879  
##  Median :2144   Median : 50.0   Median : 420   Median : 8.00   17: 530  
##  Mean   :2204   Mean   : 52.8   Mean   : 433   Mean   : 8.65            
##  3rd Qu.:2590   3rd Qu.: 66.0   3rd Qu.: 528   3rd Qu.: 8.00            
##  Max.   :5399   Max.   :100.0   Max.   :2100   Max.   :32.00            
##    cd       multi      premium         ads          trend   
##  no :2823   no :4779   no :   0   Min.   : 39   Min.   : 1  
##  yes:2824   yes: 868   yes:5647   1st Qu.:162   1st Qu.: 9  
##                                   Median :246   Median :16  
##                                   Mean   :218   Mean   :16  
##                                   3rd Qu.:275   3rd Qu.:22  
##                                   Max.   :339   Max.   :35

head(data)

##   price speed  hd ram screen  cd multi premium ads trend
## 1  1499    25  80   4     14  no    no     yes  94     1
## 2  1795    33  85   2     14  no    no     yes  94     1
## 3  1595    25 170   4     15  no    no     yes  94     1
## 5  3295    33 340  16     14  no    no     yes  94     1
## 6  3695    66 340  16     14  no    no     yes  94     1
## 7  1720    25 170   4     14 yes    no     yes  94     1

tail(data)

##      price speed   hd ram screen  cd multi premium ads trend
## 6254  2154    66  850  16     15 yes    no     yes  39    35
## 6255  1690   100  528   8     15  no    no     yes  39    35
## 6256  2223    66  850  16     15 yes   yes     yes  39    35
## 6257  2654   100 1200  24     15 yes    no     yes  39    35
## 6258  2195   100  850  16     15 yes    no     yes  39    35
## 6259  2490   100  850  16     17 yes    no     yes  39    35

2. (Experimental) Design

How will the experiment be organized and conducted to test the hypothesis?

An analysis of covariance will be performed. Computer speed, hard drive size, and screen size will be tested to see if they are explanitory variables for computer price.

What is the Rationale for this design?

This design was choosen to see the effects of speed, hard drive size, and screen size on the price of computers.

Randomize: What is the Randomization Scheme?

The dataset computers is a collection of survey data from computers on the market between january of 1993 and november of 1995.

Replicate: Are there replicates and/or repeated measures?

Each computer is only observed once.

Block: Did you use blocking in the design?

Blocking is not used in this design.

3. (Statistical) Analysis

(Exploratory Data Analysis) Graphics and descriptive summary

Mean of computer price by speed, hd size and screen size:

tapply(data$price, data$speed, mean)

##   25   33   50   66   75  100 
## 1842 2054 2216 2365 2313 2415

tapply(data$price, data$hd, mean)

##   80   85  107  120  125  128  130  170  210  212  213  214  230  240  245 
## 1233 1654 1732 1741 1052 1195 1848 1839 1747 1816 2299 1886 2374 1978 2245 
##  250  256  260  270  320  330  340  345  365  405  420  424  425  426  428 
## 2319 2995 1399 1629 2716 3520 2078 3505 1391 2629 2134 2407 1986 2305 1739 
##  450  452  470  500  520  525  527  528  530  540  545  720  730  810  850 
## 3048 3128 2699 3756 2895 4999 2759 2363 2528 2229 1992 2524 2189 2495 2142 
## 1000 1060 1080 1100 1200 1260 1370 1600 2100 
## 2770 2885 2775 4494 2674 2545 3254 3054 3468

tapply(data$price, data$screen, mean)

##   14   15   17 
## 2078 2323 2555

As we may predict, the highest prices seem to occur with the faster speeds. Similarly, higher prices appear to accompany larger hard drive siZes. Larger sized laptops also have a higher mean cost.

Histogram of prices:

hist(data$price, xlim=c(0,5000), ylim=c(0,2000))

plot of chunk unnamed-chunk-7

The most common computer price seems to be between $1500 and $2000.

Boxplots:

boxplot(data$price~data$speed, xlab="Computer Speed (MHz)", ylab="Price (USD)")

plot of chunk unnamed-chunk-8

boxplot(data$price~data$hd, xlab="Hard Drive Size (MB)", ylab="Price (USD)")

plot of chunk unnamed-chunk-8

boxplot(data$price~data$screen, xlab="Screen Size (in)", ylab="Price (USD)")

plot of chunk unnamed-chunk-8

The boxplots show similar results to examining the means. However, the prices associated with hard drive size dont appear to have as linear of an increase. Further examination will be required.

plot(data[,c(1,2,3,5)])

plot of chunk unnamed-chunk-9

Testing

ANCOVA Models

Null Hypothesis: The variation in the response cannot be explained by anything other than randomization. Alternative Hypothesis: The variation in response can be explained by something other than randomization.

ANCOVA: Linear model of price by speed, hard drive and screen size.

model1=lm(data$price~data$speed+data$hd+data$screen)
summary(model1)

## 
## Call:
## lm(formula = data$price ~ data$speed + data$hd + data$screen)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1218.7  -362.6   -33.3   307.4  2411.7 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.63e+03   1.86e+01   87.40  < 2e-16 ***
## data$speed    3.22e+00   3.36e-01    9.59  < 2e-16 ***
## data$hd       8.13e-01   2.78e-02   29.22  < 2e-16 ***
## data$screen15 6.96e+01   1.53e+01    4.56  5.2e-06 ***
## data$screen17 3.19e+02   2.38e+01   13.44  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 497 on 5642 degrees of freedom
## Multiple R-squared:  0.241,  Adjusted R-squared:  0.241 
## F-statistic:  449 on 4 and 5642 DF,  p-value: <2e-16

Since speed, hard drive size, and screen size each retured a significant p-value (less than .05), we fail to reject the null hypothesis that the variation in price cannot be explained by anything other than randomization. It is likely that computer speed, hard drive size, and screen size can explain some of the variation in price.

Tukey Tests

When the line for a factor pair crosses zero or a p-value greater than .05 is generated, that indicates we fail to reject the null hypothesis that there is no difference in the means of that combination of pairs. When the plotted line does not cross zero or we generate a small p-value, that indicated there is likely a difference in means between the two factor levels.

Differences in data pairs for hard drive are not compared becasuse there are too many different pairs (continuous variable).

Tukey Test for differences in computer speed:

data$speed=as.factor(data$speed)
tukey1<-TukeyHSD(aov(data$price~data$speed))
tukey1

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = data$price ~ data$speed)
## 
## $`data$speed`
##          diff     lwr    upr  p adj
## 33-25  212.68  129.59 295.78 0.0000
## 50-25  373.89  283.60 464.18 0.0000
## 66-25  523.69  440.82 606.56 0.0000
## 75-25  471.39  312.41 630.38 0.0000
## 100-25 573.05  470.72 675.38 0.0000
## 50-33  161.20   98.72 223.69 0.0000
## 66-33  311.00  259.82 362.19 0.0000
## 75-33  258.71  113.70 403.72 0.0000
## 100-33 360.37  281.48 439.26 0.0000
## 66-50  149.80   87.61 211.99 0.0000
## 75-50   97.51  -51.74 246.76 0.4257
## 100-50 199.16  112.73 285.60 0.0000
## 75-66  -52.29 -197.18  92.59 0.9083
## 100-66  49.36  -29.29 128.02 0.4728
## 100-75 101.66  -55.17 258.48 0.4349

plot(tukey1)

plot of chunk unnamed-chunk-11

Most pairs of computer speeds have low p-values, so we reject the null that there is no difference in means between the pair. Looking at the p-values or at the plot, only the pairs at the high end of the speed spectrum appear to have no difference (100-75, 100-66, 75-66, and 75-50).

Tukey Test for differences in screen size:

tukey2<-TukeyHSD(aov(data$price~data$screen))
tukey2

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = data$price ~ data$screen)
## 
## $`data$screen`
##        diff   lwr   upr p adj
## 15-14 244.6 207.3 281.8     0
## 17-14 476.7 416.6 536.9     0
## 17-15 232.2 169.0 295.3     0

plot(tukey2)

plot of chunk unnamed-chunk-12

We fail to reject the null of the tukey test for all pairs of screen size: the means prices between the different screen sizes are not the same.

Diagnostics/Model Adequacy Checking

Visually inspect normality of data:

qqnorm(residuals(model1))
qqline(residuals(model1))

plot of chunk unnamed-chunk-13

The data appears it may normal.

Test normality with Shapiro Wilks test *NOTE- this test can only be run with a sample size of less than 5000. Because of this, a model identical to model1 but with a smaller set of data is created. This is done by taking a random sample of the data originally used:

small <- data[sample(1:nrow(data), 5000, replace=FALSE),]
modelsmall=lm(small$price~small$speed+small$hd+small$screen)
summary(modelsmall)

## 
## Call:
## lm(formula = small$price ~ small$speed + small$hd + small$screen)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1286.9  -354.1   -29.4   301.8  2348.0 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.60e+03   2.63e+01   60.96  < 2e-16 ***
## small$speed33  1.13e+02   2.81e+01    4.01  6.2e-05 ***
## small$speed50  2.35e+02   3.07e+01    7.66  2.3e-14 ***
## small$speed66  3.07e+02   2.88e+01   10.67  < 2e-16 ***
## small$speed75  1.08e+02   5.57e+01    1.93   0.0536 .  
## small$speed100 2.22e+02   3.61e+01    6.17  7.3e-10 ***
## small$hd       8.21e-01   2.95e-02   27.84  < 2e-16 ***
## small$screen15 5.18e+01   1.61e+01    3.23   0.0013 ** 
## small$screen17 3.09e+02   2.54e+01   12.19  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 491 on 4991 degrees of freedom
## Multiple R-squared:  0.255,  Adjusted R-squared:  0.254 
## F-statistic:  213 on 8 and 4991 DF,  p-value: <2e-16

shapiro.test(residuals(modelsmall))

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(modelsmall)
## W = 0.9836, p-value < 2.2e-16

Null hypothesis: The data came from a normally distributed population. We reject the null. We cannot assume the data is normal. This will be addressed in the contingencies section below.

Fitted vs Residuals Plot

plot(fitted(model1),residuals(model1))

plot of chunk unnamed-chunk-15

The model appears to be a decent fit since the residuals are pretty symetrical across the zero. However, the residuals are clustered to the left of the dynamic range. This could be due to lack of observations on the right end or could bring into question the quality of fit.

4. Contingencies

Since the data did not fulfill the normality assumption of the ancova model (which could be why the fit of the model was questionable), a Kruskal-Wallis one-way analysis of variance by Rank Sum Test should be performed:

kruskal.test(data$price~data$speed)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  data$price by data$speed
## Kruskal-Wallis chi-squared = 536.7, df = 5, p-value < 2.2e-16

kruskal.test(data$price~data$hd)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  data$price by data$hd
## Kruskal-Wallis chi-squared = 2265, df = 53, p-value < 2.2e-16

kruskal.test(data$price~data$screen)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  data$price by data$screen
## Kruskal-Wallis chi-squared = 464.8, df = 2, p-value < 2.2e-16

The null hypothesis of the kruskal test is that the mean ranks of the samples from the populations are expected to be the same (this is not the same as saying the populations have identical means). Since each test results in a low p-value, we reject this null hypothesis. It is likely that the variation in the rank means of speed, hard drive size, and screen size can explain the variaion in computer prices.

5. References to the Literature

None used.

6. Appendicies

Link to raw data

Data is from the NYCflight13 package

Complete R Code

All included above.