Recipie for Response Surface Methods

========================================================

Recipie for Response Surface Methods

Ali Svoobda

RPI

12/1/14

1. Setting

System under test

For this recipie, we will examine the Computers dataset from the Ecdat package. This dataset contains the price and specs on computers for sale between 1993 and 1995. We will create a subset to only examine the compters manufactured by “premium” firms"

Read in and subset data:

library("Ecdat", lib.loc="C:/Users/svoboa/Documents/R/win-library/3.1")

## Warning: package 'Ecdat' was built under R version 3.1.2

## Loading required package: Ecfun
## 
## Attaching package: 'Ecdat'
## 
## The following object is masked from 'package:datasets':
## 
##     Orange

x<-Computers

data<-subset(x,x$premium=="yes")

For more on the dataset:

?Computers

## starting httpd help server ... done

Factors and Levels

We will examine 4 factors in this experiment, computer speed, hard drive size(hd), size of Ram, and screen size. Each of the levels will be re-coded to be a value of 1-6 (this is for the response surface method model that is created later in this recipie)

speed (6 levels)- 25(1), 33(2), 50(3), 66(4), 75(5) or 100(6)
hd (3 levels)- 0-250 MB(1), 250-500 MB(2), 500-2100 MB(3)
ram (6 levels)- 2(1), 4(2), 8(3), 16(4), 24(5), 32(6)
screen (3 levels)- 14(1), 15(2) or 17(3) inch

Set up factors and levels for RSM code:

data$hd<-as.numeric(data$hd)
data$hd<-cut(data$hd, c(0,250,500,2100))
data$hd<-as.character(data$hd)

data$hd[data$hd == "(0,250]"]<-1
data$hd[data$hd == "(250,500]"]<-2
data$hd[data$hd == "(500,2.1e+03]"]<-3



data$speed<-as.character(data$speed)

data$speed[data$speed == "25"]<-1
data$speed[data$speed == "33"]<-2
data$speed[data$speed == "50"]<-3
data$speed[data$speed == "66"]<-4
data$speed[data$speed == "75"]<-5
data$speed[data$speed == "100"]<-6


data$ram<-as.character(data$ram)

data$ram[data$ram == "2"]<-1
data$ram[data$ram == "4"]<-2
data$ram[data$ram == "8"]<-3
data$ram[data$ram == "16"]<-4
data$ram[data$ram == "24"]<-5
data$ram[data$ram == "32"]<-6


data$screen<-as.character(data$screen)

data$screen[data$screen == "14"]<-1
data$screen[data$screen == "15"]<-2
data$screen[data$screen == "17"]<-3


attach(data)

Now set all factors as integers:

data$speed<-as.integer(data$speed)
data$hd<-as.integer(data$hd)
data$ram<-as.integer(data$ram)
data$screen<-as.integer(data$screen)

Continuous Variables

The only continous variable under study in this experiment is the price of the computers

Response Variables

The computer will also serve as the response variable.

The Data: How is it organized and what does it look like?

The computers dataset, with only premium computers, has 5647 observations of 10 variables, although only 5 variables are of interest.

Structure and first/last observations of dataset:

str(data)

## 'data.frame':    5647 obs. of  10 variables:
##  $ price  : num  1499 1795 1595 3295 3695 ...
##  $ speed  : int  1 2 1 2 4 1 3 3 3 2 ...
##  $ hd     : int  1 1 1 2 2 1 1 1 1 1 ...
##  $ ram    : int  2 1 2 4 4 2 1 3 2 3 ...
##  $ screen : int  1 1 2 1 1 1 1 1 2 2 ...
##  $ cd     : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 1 1 1 1 ...
##  $ multi  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ premium: Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ads    : num  94 94 94 94 94 94 94 94 94 94 ...
##  $ trend  : num  1 1 1 1 1 1 1 1 1 1 ...

head(data)

##   price speed hd ram screen  cd multi premium ads trend
## 1  1499     1  1   2      1  no    no     yes  94     1
## 2  1795     2  1   1      1  no    no     yes  94     1
## 3  1595     1  1   2      2  no    no     yes  94     1
## 5  3295     2  2   4      1  no    no     yes  94     1
## 6  3695     4  2   4      1  no    no     yes  94     1
## 7  1720     1  1   2      1 yes    no     yes  94     1

tail(data)

##      price speed hd ram screen  cd multi premium ads trend
## 6254  2154     4  3   4      2 yes    no     yes  39    35
## 6255  1690     6  3   3      2  no    no     yes  39    35
## 6256  2223     4  3   4      2 yes   yes     yes  39    35
## 6257  2654     6  3   5      2 yes    no     yes  39    35
## 6258  2195     6  3   4      2 yes    no     yes  39    35
## 6259  2490     6  3   4      3 yes    no     yes  39    35

2. (Experimental) Design

How will the experiment be organized and conducted to test the hypothesis?

In this recipie, we will use response surface methods to see which factors (speed, hd, ram, screen) if any, are responsable for the variation in the response, computer price.

What is the Rationale for this design?

The goal is to determine if any of these factors cause a change in the cost of the computer

Randomize: What is the Randomization Scheme?

The dataset is a collection of survey data on computers from 1993 to 1995 in the US.

Replicate: Are there replicates and/or repeated measures?

There are no replicates or repeated measures.

Block: Did you use blocking in the design?

Blocking is not used in this design.

3. (Statistical) Analysis

(Exploratory Data Analysis) Graphics and descriptive summary

Mean number of affairts by the 4 factors:

tapply(data$price, data$speed, mean)

##    1    2    3    4    5    6 
## 1842 2054 2216 2365 2313 2415

tapply(data$price, data$hd, mean)

##    1    2    3 
## 1904 2274 2437

tapply(data$price, data$ram, mean)

##    1    2    3    4    5    6 
## 1703 1792 2254 2783 2936 3612

tapply(data$price, data$screen, mean)

##    1    2    3 
## 2078 2323 2555

As we may predict, the highest prices seem to occur with the faster speeds. Similarly, higher prices appear to accompany larger hard drive siZes and larger Ram sizes. Larger sized laptops also have a higher mean cost. The means will be further analyzed in the boxplots below.

Boxplots:

boxplot(data$price~data$speed, xlab="Computer Speed", ylab="Computer Price")

plot of chunk unnamed-chunk-7

boxplot(data$price~data$hd, xlab="Hard Drive Size", ylab="Computer Price")

plot of chunk unnamed-chunk-7

boxplot(data$price~data$ram, xlab="ram size", ylab="Computer Price")

plot of chunk unnamed-chunk-7

boxplot(data$price~data$screen, xlab="screen size", ylab="Computer Price")

plot of chunk unnamed-chunk-7

From the boxplots, the trends observed above are reinforced.

Histogram of Visits:

hist(data$price, breaks=20)

plot of chunk unnamed-chunk-8

The most frequent computer price is around 2000.

Testing

Response Surface Methods

The first step was to set up the factors into integer levels so the response surface methods function can be used. This was done in the factors and levels section above.

The “rsm” Package must be installed and read in:

library("rsm", lib.loc="~/R/win-library/3.1")

## Warning: package 'rsm' was built under R version 3.1.2

Create the second order response surface method model:

model1<- rsm(price~SO(speed, hd, ram, screen),data=data)
summary(model1)

## 
## Call:
## rsm(formula = price ~ SO(speed, hd, ram, screen), data = data)
## 
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    840.57      66.19   12.70  < 2e-16 ***
## speed          291.80      19.16   15.23  < 2e-16 ***
## hd             178.78      52.50    3.41  0.00067 ***
## ram            296.08      30.10    9.84  < 2e-16 ***
## screen        -409.10      52.06   -7.86  4.6e-15 ***
## speed:hd       -66.11       7.94   -8.32  < 2e-16 ***
## speed:ram        6.20       6.01    1.03  0.30200    
## speed:screen     9.99       6.70    1.49  0.13579    
## hd:ram          94.08      16.05    5.86  4.9e-09 ***
## hd:screen      -64.31      13.06   -4.92  8.7e-07 ***
## ram:screen     -37.09      11.19   -3.31  0.00093 ***
## speed^2        -15.30       2.68   -5.71  1.2e-08 ***
## hd^2           -80.79      14.42   -5.60  2.2e-08 ***
## ram^2            2.65       7.71    0.34  0.73115    
## screen^2       189.96      12.64   15.03  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Multiple R-squared:  0.579,  Adjusted R-squared:  0.578 
## F-statistic:  554 on 14 and 5632 DF,  p-value: <2e-16
## 
## Analysis of Variance Table
## 
## Response: price
##                               Df   Sum Sq  Mean Sq F value Pr(>F)
## FO(speed, hd, ram, screen)     4 9.74e+08 2.43e+08  1772.8 <2e-16
## TWI(speed, hd, ram, screen)    6 5.17e+07 8.62e+06    62.8 <2e-16
## PQ(speed, hd, ram, screen)     4 3.88e+07 9.70e+06    70.6 <2e-16
## Residuals                   5632 7.73e+08 1.37e+05               
## Lack of fit                  133 1.34e+08 1.01e+06     8.7 <2e-16
## Pure error                  5499 6.39e+08 1.16e+05               
## 
## Stationary point of response surface:
##   speed      hd     ram  screen 
## 20.7074 -4.7009  4.7130  0.1967 
## 
## Eigenanalysis:
## $values
## [1]  197.165   21.261   -8.361 -113.543
## 
## $vectors
##           [,1]    [,2]     [,3]     [,4]
## speed   0.0430 -0.2863  0.90748 -0.30446
## hd     -0.1401  0.4229 -0.15574 -0.88162
## ram    -0.1267  0.8404  0.39000  0.35439
## screen  0.9811  0.1815 -0.01164 -0.06674

ANOVA Null Hypothesis: The variation in price cannot be explained by anything other than randomization. We must reject this null hypothesis for any main effect returning a significant alpha value (less than .05). It happens that all factors indivudual return significant values, suggesting that speed, hd, ram and screen size may all explain some of the variation in price.

The stationary points (found in the output) are also listed below: speed hd ram screen 20.7074332 -4.7009320 4.7129891 0.1967031

We can create contour plots for these points.

Second order response surfaces of the data:

par(mfrow=c(2,3))
contour(model1, ~speed + hd + ram + screen, image=TRUE, at=summary(model1$canoncial$xs))

plot of chunk unnamed-chunk-11

Diagnostics/Model Adequacy Checking

Original Model 1

Visually inspect normality of original data:

par(mfrow=c(1,1))
qqnorm(residuals(model1))
qqline(residuals(model1))

plot of chunk unnamed-chunk-12

The data appears it may be normal. we would perform a shapiro-wilks normality test to confirm this assumption, however, the model is to large to be teseted. In case the data is infact not normal, alternate tests will be performed in the contingencies section below.

Fitted vs Residuals Plot:

plot(fitted(model1),residuals(model1))

plot of chunk unnamed-chunk-13

The data should be symetric over the zero and spread out over the dynamic range. We can assume the fit is good since both of these are true.

4. Contingencies

Since the data may not have fulfilled the normality assumption of the model, a Kruskal-Wallis non-parametric analysis of variance by Rank Sum Test should be performed:

kruskal.test(data$price~data$speed, data=data)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  data$price by data$speed
## Kruskal-Wallis chi-squared = 536.7, df = 5, p-value < 2.2e-16

kruskal.test(data$price~data$hd, data=data)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  data$price by data$hd
## Kruskal-Wallis chi-squared = 965.7, df = 2, p-value < 2.2e-16

kruskal.test(data$price~data$ram, data=data)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  data$price by data$ram
## Kruskal-Wallis chi-squared = 3061, df = 5, p-value < 2.2e-16

kruskal.test(data$price~data$screen, data=data)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  data$price by data$screen
## Kruskal-Wallis chi-squared = 464.8, df = 2, p-value < 2.2e-16

The null hypothesis of the kruskal test is that the mean ranks of the samples from the populations are expected to be the same (this is not the same as saying the populations have identical means).

Since the test results have a low p-value for each of the factors, it is likely that each factor can explain some of the variation in price.

These results match and back up the results of the response surface method anova ran above.

5. References to the Literature

None used.

6. Appendicies

Link to raw data

Data is from the Ecdat Package

Complete R Code

All included above.

Recipie for Response Surface Methods

Ali Svoboda

December 1, 2014