========================================================
For this recipie, we will examine the Computers dataset from the Ecdat package. This dataset contains the price and specs on computers for sale between 1993 and 1995. We will create a subset to only examine the compters manufactured by “premium” firms"
Read in and subset data:
library("Ecdat", lib.loc="C:/Users/svoboa/Documents/R/win-library/3.1")
## Warning: package 'Ecdat' was built under R version 3.1.2
## Loading required package: Ecfun
##
## Attaching package: 'Ecdat'
##
## The following object is masked from 'package:datasets':
##
## Orange
x<-Computers
data<-subset(x,x$premium=="yes")
For more on the dataset:
?Computers
## starting httpd help server ... done
We will examine 4 factors in this experiment, computer speed, hard drive size(hd), size of Ram, and screen size. Each of the levels will be re-coded to be a value of 1-6 (this is for the response surface method model that is created later in this recipie)
speed (6 levels)- 25(1), 33(2), 50(3), 66(4), 75(5) or 100(6)
hd (3 levels)- 0-250 MB(1), 250-500 MB(2), 500-2100 MB(3)
ram (6 levels)- 2(1), 4(2), 8(3), 16(4), 24(5), 32(6)
screen (3 levels)- 14(1), 15(2) or 17(3) inch
Set up factors and levels for RSM code:
data$hd<-as.numeric(data$hd)
data$hd<-cut(data$hd, c(0,250,500,2100))
data$hd<-as.character(data$hd)
data$hd[data$hd == "(0,250]"]<-1
data$hd[data$hd == "(250,500]"]<-2
data$hd[data$hd == "(500,2.1e+03]"]<-3
data$speed<-as.character(data$speed)
data$speed[data$speed == "25"]<-1
data$speed[data$speed == "33"]<-2
data$speed[data$speed == "50"]<-3
data$speed[data$speed == "66"]<-4
data$speed[data$speed == "75"]<-5
data$speed[data$speed == "100"]<-6
data$ram<-as.character(data$ram)
data$ram[data$ram == "2"]<-1
data$ram[data$ram == "4"]<-2
data$ram[data$ram == "8"]<-3
data$ram[data$ram == "16"]<-4
data$ram[data$ram == "24"]<-5
data$ram[data$ram == "32"]<-6
data$screen<-as.character(data$screen)
data$screen[data$screen == "14"]<-1
data$screen[data$screen == "15"]<-2
data$screen[data$screen == "17"]<-3
attach(data)
Now set all factors as integers:
data$speed<-as.integer(data$speed)
data$hd<-as.integer(data$hd)
data$ram<-as.integer(data$ram)
data$screen<-as.integer(data$screen)
The only continous variable under study in this experiment is the price of the computers
The computer will also serve as the response variable.
The computers dataset, with only premium computers, has 5647 observations of 10 variables, although only 5 variables are of interest.
Structure and first/last observations of dataset:
str(data)
## 'data.frame': 5647 obs. of 10 variables:
## $ price : num 1499 1795 1595 3295 3695 ...
## $ speed : int 1 2 1 2 4 1 3 3 3 2 ...
## $ hd : int 1 1 1 2 2 1 1 1 1 1 ...
## $ ram : int 2 1 2 4 4 2 1 3 2 3 ...
## $ screen : int 1 1 2 1 1 1 1 1 2 2 ...
## $ cd : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 1 1 1 1 ...
## $ multi : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ premium: Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ ads : num 94 94 94 94 94 94 94 94 94 94 ...
## $ trend : num 1 1 1 1 1 1 1 1 1 1 ...
head(data)
## price speed hd ram screen cd multi premium ads trend
## 1 1499 1 1 2 1 no no yes 94 1
## 2 1795 2 1 1 1 no no yes 94 1
## 3 1595 1 1 2 2 no no yes 94 1
## 5 3295 2 2 4 1 no no yes 94 1
## 6 3695 4 2 4 1 no no yes 94 1
## 7 1720 1 1 2 1 yes no yes 94 1
tail(data)
## price speed hd ram screen cd multi premium ads trend
## 6254 2154 4 3 4 2 yes no yes 39 35
## 6255 1690 6 3 3 2 no no yes 39 35
## 6256 2223 4 3 4 2 yes yes yes 39 35
## 6257 2654 6 3 5 2 yes no yes 39 35
## 6258 2195 6 3 4 2 yes no yes 39 35
## 6259 2490 6 3 4 3 yes no yes 39 35
In this recipie, we will use response surface methods to see which factors (speed, hd, ram, screen) if any, are responsable for the variation in the response, computer price.
The goal is to determine if any of these factors cause a change in the cost of the computer
The dataset is a collection of survey data on computers from 1993 to 1995 in the US.
There are no replicates or repeated measures.
Blocking is not used in this design.
Mean number of affairts by the 4 factors:
tapply(data$price, data$speed, mean)
## 1 2 3 4 5 6
## 1842 2054 2216 2365 2313 2415
tapply(data$price, data$hd, mean)
## 1 2 3
## 1904 2274 2437
tapply(data$price, data$ram, mean)
## 1 2 3 4 5 6
## 1703 1792 2254 2783 2936 3612
tapply(data$price, data$screen, mean)
## 1 2 3
## 2078 2323 2555
As we may predict, the highest prices seem to occur with the faster speeds. Similarly, higher prices appear to accompany larger hard drive siZes and larger Ram sizes. Larger sized laptops also have a higher mean cost. The means will be further analyzed in the boxplots below.
Boxplots:
boxplot(data$price~data$speed, xlab="Computer Speed", ylab="Computer Price")
boxplot(data$price~data$hd, xlab="Hard Drive Size", ylab="Computer Price")
boxplot(data$price~data$ram, xlab="ram size", ylab="Computer Price")
boxplot(data$price~data$screen, xlab="screen size", ylab="Computer Price")
From the boxplots, the trends observed above are reinforced.
Histogram of Visits:
hist(data$price, breaks=20)
The most frequent computer price is around 2000.
The first step was to set up the factors into integer levels so the response surface methods function can be used. This was done in the factors and levels section above.
The “rsm” Package must be installed and read in:
library("rsm", lib.loc="~/R/win-library/3.1")
## Warning: package 'rsm' was built under R version 3.1.2
Create the second order response surface method model:
model1<- rsm(price~SO(speed, hd, ram, screen),data=data)
summary(model1)
##
## Call:
## rsm(formula = price ~ SO(speed, hd, ram, screen), data = data)
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 840.57 66.19 12.70 < 2e-16 ***
## speed 291.80 19.16 15.23 < 2e-16 ***
## hd 178.78 52.50 3.41 0.00067 ***
## ram 296.08 30.10 9.84 < 2e-16 ***
## screen -409.10 52.06 -7.86 4.6e-15 ***
## speed:hd -66.11 7.94 -8.32 < 2e-16 ***
## speed:ram 6.20 6.01 1.03 0.30200
## speed:screen 9.99 6.70 1.49 0.13579
## hd:ram 94.08 16.05 5.86 4.9e-09 ***
## hd:screen -64.31 13.06 -4.92 8.7e-07 ***
## ram:screen -37.09 11.19 -3.31 0.00093 ***
## speed^2 -15.30 2.68 -5.71 1.2e-08 ***
## hd^2 -80.79 14.42 -5.60 2.2e-08 ***
## ram^2 2.65 7.71 0.34 0.73115
## screen^2 189.96 12.64 15.03 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Multiple R-squared: 0.579, Adjusted R-squared: 0.578
## F-statistic: 554 on 14 and 5632 DF, p-value: <2e-16
##
## Analysis of Variance Table
##
## Response: price
## Df Sum Sq Mean Sq F value Pr(>F)
## FO(speed, hd, ram, screen) 4 9.74e+08 2.43e+08 1772.8 <2e-16
## TWI(speed, hd, ram, screen) 6 5.17e+07 8.62e+06 62.8 <2e-16
## PQ(speed, hd, ram, screen) 4 3.88e+07 9.70e+06 70.6 <2e-16
## Residuals 5632 7.73e+08 1.37e+05
## Lack of fit 133 1.34e+08 1.01e+06 8.7 <2e-16
## Pure error 5499 6.39e+08 1.16e+05
##
## Stationary point of response surface:
## speed hd ram screen
## 20.7074 -4.7009 4.7130 0.1967
##
## Eigenanalysis:
## $values
## [1] 197.165 21.261 -8.361 -113.543
##
## $vectors
## [,1] [,2] [,3] [,4]
## speed 0.0430 -0.2863 0.90748 -0.30446
## hd -0.1401 0.4229 -0.15574 -0.88162
## ram -0.1267 0.8404 0.39000 0.35439
## screen 0.9811 0.1815 -0.01164 -0.06674
ANOVA Null Hypothesis: The variation in price cannot be explained by anything other than randomization. We must reject this null hypothesis for any main effect returning a significant alpha value (less than .05). It happens that all factors indivudual return significant values, suggesting that speed, hd, ram and screen size may all explain some of the variation in price.
The stationary points (found in the output) are also listed below: speed hd ram screen 20.7074332 -4.7009320 4.7129891 0.1967031
We can create contour plots for these points.
Second order response surfaces of the data:
par(mfrow=c(2,3))
contour(model1, ~speed + hd + ram + screen, image=TRUE, at=summary(model1$canoncial$xs))
Visually inspect normality of original data:
par(mfrow=c(1,1))
qqnorm(residuals(model1))
qqline(residuals(model1))
The data appears it may be normal. we would perform a shapiro-wilks normality test to confirm this assumption, however, the model is to large to be teseted. In case the data is infact not normal, alternate tests will be performed in the contingencies section below.
Fitted vs Residuals Plot:
plot(fitted(model1),residuals(model1))
The data should be symetric over the zero and spread out over the dynamic range. We can assume the fit is good since both of these are true.
Since the data may not have fulfilled the normality assumption of the model, a Kruskal-Wallis non-parametric analysis of variance by Rank Sum Test should be performed:
kruskal.test(data$price~data$speed, data=data)
##
## Kruskal-Wallis rank sum test
##
## data: data$price by data$speed
## Kruskal-Wallis chi-squared = 536.7, df = 5, p-value < 2.2e-16
kruskal.test(data$price~data$hd, data=data)
##
## Kruskal-Wallis rank sum test
##
## data: data$price by data$hd
## Kruskal-Wallis chi-squared = 965.7, df = 2, p-value < 2.2e-16
kruskal.test(data$price~data$ram, data=data)
##
## Kruskal-Wallis rank sum test
##
## data: data$price by data$ram
## Kruskal-Wallis chi-squared = 3061, df = 5, p-value < 2.2e-16
kruskal.test(data$price~data$screen, data=data)
##
## Kruskal-Wallis rank sum test
##
## data: data$price by data$screen
## Kruskal-Wallis chi-squared = 464.8, df = 2, p-value < 2.2e-16
The null hypothesis of the kruskal test is that the mean ranks of the samples from the populations are expected to be the same (this is not the same as saying the populations have identical means).
Since the test results have a low p-value for each of the factors, it is likely that each factor can explain some of the variation in price.
These results match and back up the results of the response surface method anova ran above.
None used.
Data is from the Ecdat Package
All included above.