Response Surfact Methods Practice

Hongyu Chen

RPI, RIN:661405156

Nov.28 V1.0

1. Setting

System under test

This analysis is going to analyze prices of personal computers in the US from 1993-1995. Dataset is from ‘Computers’ in package ‘Ecdat’, which contains 6259 observations in total. Dataset contains several factors that may explain difference in computers price.

Below is the quick view of data

library(Ecdat)
## Loading required package: Ecfun
## 
## Attaching package: 'Ecdat'
## 
## The following object is masked from 'package:datasets':
## 
##     Orange
#First and last several lines of data set
head(Computers)
##   price speed  hd ram screen cd multi premium ads trend
## 1  1499    25  80   4     14 no    no     yes  94     1
## 2  1795    33  85   2     14 no    no     yes  94     1
## 3  1595    25 170   4     15 no    no     yes  94     1
## 4  1849    25 170   8     14 no    no      no  94     1
## 5  3295    33 340  16     14 no    no     yes  94     1
## 6  3695    66 340  16     14 no    no     yes  94     1
tail(Computers)
##      price speed   hd ram screen  cd multi premium ads trend
## 6254  2154    66  850  16     15 yes    no     yes  39    35
## 6255  1690   100  528   8     15  no    no     yes  39    35
## 6256  2223    66  850  16     15 yes   yes     yes  39    35
## 6257  2654   100 1200  24     15 yes    no     yes  39    35
## 6258  2195   100  850  16     15 yes    no     yes  39    35
## 6259  2490   100  850  16     17 yes    no     yes  39    35
#Summary and structure of data
summary(Computers)
##      price          speed           hd            ram       
##  Min.   : 949   Min.   : 25   Min.   :  80   Min.   : 2.00  
##  1st Qu.:1794   1st Qu.: 33   1st Qu.: 214   1st Qu.: 4.00  
##  Median :2144   Median : 50   Median : 340   Median : 8.00  
##  Mean   :2220   Mean   : 52   Mean   : 417   Mean   : 8.29  
##  3rd Qu.:2595   3rd Qu.: 66   3rd Qu.: 528   3rd Qu.: 8.00  
##  Max.   :5399   Max.   :100   Max.   :2100   Max.   :32.00  
##      screen       cd       multi      premium         ads     
##  Min.   :14.0   no :3351   no :5386   no : 612   Min.   : 39  
##  1st Qu.:14.0   yes:2908   yes: 873   yes:5647   1st Qu.:162  
##  Median :14.0                                    Median :246  
##  Mean   :14.6                                    Mean   :221  
##  3rd Qu.:15.0                                    3rd Qu.:275  
##  Max.   :17.0                                    Max.   :339  
##      trend     
##  Min.   : 1.0  
##  1st Qu.:10.0  
##  Median :16.0  
##  Mean   :15.9  
##  3rd Qu.:21.5  
##  Max.   :35.0
str(Computers)
## 'data.frame':    6259 obs. of  10 variables:
##  $ price  : num  1499 1795 1595 1849 3295 ...
##  $ speed  : num  25 33 25 25 33 66 25 50 50 50 ...
##  $ hd     : num  80 85 170 170 340 340 170 85 210 210 ...
##  $ ram    : num  4 2 4 8 16 16 4 2 8 4 ...
##  $ screen : num  14 14 15 14 14 14 14 14 14 15 ...
##  $ cd     : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 2 1 1 1 ...
##  $ multi  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ premium: Factor w/ 2 levels "no","yes": 2 2 2 1 2 2 2 2 2 2 ...
##  $ ads    : num  94 94 94 94 94 94 94 94 94 94 ...
##  $ trend  : num  1 1 1 1 1 1 1 1 1 1 ...

Factors and Levels

In this study we focus on four factors that possibly influence computers price, including clock speed in MHz (speed), size of hard drive in MB (hd), size of Ram in MB (ram) and size of screen in inches (screen).

# Levels of factors
speed<-factor(Computers$speed)
hd<-factor(Computers$hd)
ram<-factor(Computers$ram)
screen<-factor(Computers$screen)
levels(speed)
## [1] "25"  "33"  "50"  "66"  "75"  "100"
levels(hd)
##  [1] "80"   "85"   "100"  "107"  "120"  "125"  "128"  "130"  "170"  "200" 
## [11] "210"  "212"  "213"  "214"  "230"  "240"  "245"  "250"  "256"  "260" 
## [21] "270"  "320"  "330"  "340"  "345"  "364"  "365"  "405"  "420"  "424" 
## [31] "425"  "426"  "428"  "450"  "452"  "470"  "500"  "520"  "525"  "527" 
## [41] "528"  "530"  "540"  "545"  "630"  "720"  "728"  "730"  "810"  "850" 
## [51] "1000" "1060" "1080" "1100" "1200" "1260" "1370" "1600" "2100"
levels(ram)
## [1] "2"  "4"  "8"  "16" "24" "32"
levels(screen)
## [1] "14" "15" "17"

Response variables

Price of personal computer (dollars) is the response variable in this analysis, which is column ‘price’ in the dataset.

The Data: How is it organized and what does it look like?

The dataframe contains more columns about factors that might influence computer price, for example whether CD-ROM is present and whether is from a premium firm. However to use response surface methods, only 4 factors above are interested.

Randomization

It can be assumed that all data were randomly collected.

2. (Experimental) Design

How will the experiment be organized and conducted to test the hypothesis?

Null hypothesis H0:None of these 4 factors above could explain variance of computers price.

Alternative hypothesis H1: variance of computers price can be explained by something other than randomizaion.

Set up columns which will be under investigation; run exploratory data analysis for each factor; construct a linear model to conduct response surface methods, second order effects are included; use ‘rsm’ for parameter estimation and draw response surface plots.

What is the rationale for this design?

In statistics, response surface method (RSM) explores the relationships between several explanatory variables and one or more response variables. Through estimation and optimization, it is possible to find certain combination of factors that can have the greatest influence to response variable.

Randomize: What is the Randomization Scheme?

Randomization depends on the way data collected which can be assumed as complete randomization.

Replicate: Are there replicates and/or repeated measures?

There are no replicates or repeated measures in this test.

Block

There is no block used in this recipe.

3. Analysis and testing

Exploratory data analysis

#Boxplot
boxplot(Computers$price~Computers$speed,xlab="Clock speed in MHz",ylab="Computer price")

plot of chunk unnamed-chunk-3

boxplot(Computers$price~Computers$hd,xlab="Size of hard drive in MB",ylab="Computer price")

plot of chunk unnamed-chunk-3

boxplot(Computers$price~Computers$ram,xlab="Size of Ram in MB",ylab="Computer price")

plot of chunk unnamed-chunk-3

boxplot(Computers$price~Computers$screen,xlab="Size of screen in inches",ylab="Computer price")

plot of chunk unnamed-chunk-3

From plots above, it seems all 4 factors could influence the price of a computer. For Ram size and screen size, computer price increases as they increase. However for clock speed, the means of price among computers with speed from 25MHz to 66 MHz are different, while there is no obvious difference when speed changes from 66MHz to 100MHz. There is no clear relationship between computer price and size of hard drive.

Response surface method using rsm:

#Construct linear model and run rsm
Cmpt.lm=lm(price~speed+hd+ram+screen,data=Computers)
anova(Cmpt.lm)
## Analysis of Variance Table
## 
## Response: price
##             Df   Sum Sq  Mean Sq F value Pr(>F)    
## speed        1 1.91e+08 1.91e+08    1046 <2e-16 ***
## hd           1 2.48e+08 2.48e+08    1358 <2e-16 ***
## ram          1 4.76e+08 4.76e+08    2603 <2e-16 ***
## screen       1 5.31e+07 5.31e+07     291 <2e-16 ***
## Residuals 6254 1.14e+09 1.83e+05                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
library(rsm)
## Warning: package 'rsm' was built under R version 3.1.2
Cmpt.rsm=rsm(price~SO(speed,hd,ram,screen),data=Computers)
summary(Cmpt.rsm)
## 
## Call:
## rsm(formula = price ~ SO(speed, hd, ram, screen), data = Computers)
## 
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.34e+04   1.40e+03    9.53  < 2e-16 ***
## speed         1.04e+01   4.56e+00    2.28  0.02243 *  
## hd            3.71e+00   5.51e-01    6.72  1.9e-11 ***
## ram           1.07e+02   2.28e+01    4.70  2.7e-06 ***
## screen       -1.79e+03   1.84e+02   -9.72  < 2e-16 ***
## speed:hd     -1.41e-02   1.72e-03   -8.20  2.9e-16 ***
## speed:ram     2.57e-01   7.15e-02    3.59  0.00033 ***
## speed:screen  6.24e-01   3.13e-01    1.99  0.04641 *  
## hd:ram       -5.26e-02   7.97e-03   -6.60  4.4e-11 ***
## hd:screen    -2.89e-01   3.81e-02   -7.59  3.7e-14 ***
## ram:screen    5.50e-01   1.56e+00    0.35  0.72370    
## speed^2      -8.41e-02   1.14e-02   -7.41  1.5e-13 ***
## hd^2          1.11e-03   8.23e-05   13.52  < 2e-16 ***
## ram^2        -6.86e-01   3.15e-01   -2.18  0.02932 *  
## screen^2      6.44e+01   6.04e+00   10.66  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Multiple R-squared:  0.522,  Adjusted R-squared:  0.521 
## F-statistic:  487 on 14 and 6244 DF,  p-value: <2e-16
## 
## Analysis of Variance Table
## 
## Response: price
##                               Df   Sum Sq  Mean Sq F value Pr(>F)
## FO(speed, hd, ram, screen)     4 9.68e+08 2.42e+08  1497.7 <2e-16
## TWI(speed, hd, ram, screen)    6 6.96e+07 1.16e+07    71.8 <2e-16
## PQ(speed, hd, ram, screen)     4 6.41e+07 1.60e+07    99.1 <2e-16
## Residuals                   6244 1.01e+09 1.62e+05               
## Lack of fit                  547 5.61e+08 1.03e+06    13.0 <2e-16
## Pure error                  5697 4.48e+08 7.87e+04               
## 
## Stationary point of response surface:
##   speed      hd     ram  screen 
##   41.85 1583.28   32.01   17.13 
## 
## Eigenanalysis:
## $values
## [1] 64.369987  0.003605 -0.061810 -0.713299
## 
## $vectors
##             [,1]      [,2]      [,3]      [,4]
## speed   0.004849 -0.165577  0.966058  0.198225
## hd     -0.002250  0.983905  0.175515 -0.033471
## ram     0.004236 -0.067114  0.189475 -0.979580
## screen  0.999977  0.003301 -0.005093  0.003113

From anova and summary of Cmpt.rsm, we can find out that all four factors alone, combinations of any two factors except for ram:screen, and pure quadratic of all four factors are statistically significant and probably explain variance of computers price.

Therefore we can reject the null hypothesis and accept the alternative hypothesis that variance of computers price can be explained by something other than randomizaion.

For second order effects, FO, TWI and PQ all return a p-value equals 0, indicating statistical significance.

Stationary point of response surface: speed(41.85), hd(1583.28), ram(32.01) and screen(17.13).

Response surface characterization through eigenanalysis values:

speed(64.37), hd(0.004), ram(-0.062) and screen(17.13)

Contour plots of response surface

par(mfrow=c(2,3))
contour(Cmpt.rsm, ~speed+hd+ram+screen, image=TRUE, at=summary(Cmpt.rsm$canonical$xs))

plot of chunk unnamed-chunk-5

From plots above, we can find out how combinations of any two factors effect computer price. It is important to point out that from the ram:screen plot, it seems that only ram plays a role in determination of price, which corresponds analysis of variance above. However from the speed:ram plot, it seems speed plays a more significant role than ram.

3D plots are as below:

library(rgl)
## Warning: package 'rgl' was built under R version 3.1.2
par(mfrow=c(1,1))
persp(Cmpt.rsm, ~ speed+hd, image = TRUE,  
    at = c(summary(Cmpt.rsm)$canonical$xs),zlab="Computer price",col.lab=33,contour="colors")
## Warning: "image"不是图形参数
## Warning: "image"不是图形参数
## Warning: "image"不是图形参数

plot of chunk unnamed-chunk-6

persp(Cmpt.rsm, ~ speed+ram, image = TRUE,  
    at = c(summary(Cmpt.rsm)$canonical$xs),zlab="Computer price",col.lab=33,contour="colors")
## Warning: "image"不是图形参数
## Warning: "image"不是图形参数
## Warning: "image"不是图形参数

plot of chunk unnamed-chunk-6

persp(Cmpt.rsm, ~ speed+screen, image = TRUE,  
    at = c(summary(Cmpt.rsm)$canonical$xs),zlab="Computer price",col.lab=33,contour="colors")
## Warning: "image"不是图形参数
## Warning: "image"不是图形参数
## Warning: "image"不是图形参数

plot of chunk unnamed-chunk-6

persp(Cmpt.rsm, ~ hd+ram, image = TRUE,  
    at = c(summary(Cmpt.rsm)$canonical$xs),zlab="Computer price",col.lab=33,contour="colors")
## Warning: "image"不是图形参数
## Warning: "image"不是图形参数
## Warning: "image"不是图形参数

plot of chunk unnamed-chunk-6

persp(Cmpt.rsm, ~ hd+screen, image = TRUE,  
    at = c(summary(Cmpt.rsm)$canonical$xs),zlab="Computer price",col.lab=33,contour="colors")
## Warning: "image"不是图形参数
## Warning: "image"不是图形参数
## Warning: "image"不是图形参数

plot of chunk unnamed-chunk-6

persp(Cmpt.rsm, ~ ram+screen, image = TRUE,  
    at = c(summary(Cmpt.rsm)$canonical$xs),zlab="Computer price",col.lab=33,contour="colors")
## Warning: "image"不是图形参数
## Warning: "image"不是图形参数
## Warning: "image"不是图形参数

plot of chunk unnamed-chunk-6

For hd:speed plot, stationary point (1583.28, 41.85) seems to be a ridge point. For ram:speed plot, stationary point (32.01, 41.85) seems to be a ridge point. For screen:speed plot, stationary point (17.13, 41.85) seems to be a saddle point . For ram:hd plot, stationary point (32.01, 1583.28) seems to be a saddle point. For screen:hd plot, stationary point (17.13, 1583.28) seems to be a minima point. For screen:ram plot, stationary point (17.13, 32.01) might be a maxima or ridge point.

Parameters estimation

Shapiro test for normality

#shapiro.test(Computers$price)

Shapiro test cannot be performed since there are more than 5000 observations in the dataset.

Diagnostics/Model adequacy checking for complete design

#Q-Q norm plot
qqnorm(residuals(Cmpt.rsm))
qqline(residuals(Cmpt.rsm))

plot of chunk unnamed-chunk-8

#Plot of fitted and residuals 
plot(fitted(Cmpt.rsm),residuals(Cmpt.rsm))

plot of chunk unnamed-chunk-8

Q-Q norm shows a linear pattern at the middle part, but not a linear relationship at head or tail, which indicates model is only partially fitted.

However points are not well distributed on each side of zero in residuals-fitted plot, indicating model used previously is not well fitted.

4. References to the literature

http://en.wikipedia.org/wiki/Response_surface_methodology

5. Appendices

A summary of, or pointer to, the raw data