For this recipie, we will examine the Computers dataset from the Ecdat package. This dataset contains the price and specs on computers for sale between 1993 and 1995. We will create a subset to only examine the compters manufactured by “premium” firms"
Read in and subset data:
install.packages("Ecdat")
## Installing package into 'C:/Users/svoboa/Documents/R/win-library/3.1'
## (as 'lib' is unspecified)
## Error: trying to use CRAN without setting a mirror
library("Ecdat", lib.loc="C:/Users/svoboa/Documents/R/win-library/3.1")
## Loading required package: Ecfun
##
## Attaching package: 'Ecdat'
##
## The following object is masked from 'package:datasets':
##
## Orange
x<-Computers
data<-subset(x,x$premium=="yes")
For more on the dataset:
?Computers
## starting httpd help server ... done
The dataset contains 6 factors(levels): ram(4), screen size(3), and wether or not a CD-ROM(2) or multimedia kit(2) is present.
For this experiment, we will only be examining screen size (14", 15", or 17").
Set up screen size as factor:
data$screen=as.factor(data$screen)
The continuous varibales in the dataset are price, speed, hard drive size, ads, and trend (month of sales).
Set up varibles as numeric:
data$speed=as.numeric(data$speed)
data$hd=as.numeric(data$hd)
Price will be the response variable for this experiment.
With only the “premium” computers under study, there are 5647 observations of 10 varaibles
Structure, summary, and first/last observations of dataset:
str(data)
## 'data.frame': 5647 obs. of 10 variables:
## $ price : num 1499 1795 1595 3295 3695 ...
## $ speed : num 25 33 25 33 66 25 50 50 50 33 ...
## $ hd : num 80 85 170 340 340 170 85 210 210 170 ...
## $ ram : num 4 2 4 16 16 4 2 8 4 8 ...
## $ screen : Factor w/ 3 levels "14","15","17": 1 1 2 1 1 1 1 1 2 2 ...
## $ cd : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 1 1 1 1 ...
## $ multi : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ premium: Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ ads : num 94 94 94 94 94 94 94 94 94 94 ...
## $ trend : num 1 1 1 1 1 1 1 1 1 1 ...
summary(data)
## price speed hd ram screen
## Min. : 949 Min. : 25.0 Min. : 80 Min. : 2.00 14:3238
## 1st Qu.:1790 1st Qu.: 33.0 1st Qu.: 214 1st Qu.: 4.00 15:1879
## Median :2144 Median : 50.0 Median : 420 Median : 8.00 17: 530
## Mean :2204 Mean : 52.8 Mean : 433 Mean : 8.65
## 3rd Qu.:2590 3rd Qu.: 66.0 3rd Qu.: 528 3rd Qu.: 8.00
## Max. :5399 Max. :100.0 Max. :2100 Max. :32.00
## cd multi premium ads trend
## no :2823 no :4779 no : 0 Min. : 39 Min. : 1
## yes:2824 yes: 868 yes:5647 1st Qu.:162 1st Qu.: 9
## Median :246 Median :16
## Mean :218 Mean :16
## 3rd Qu.:275 3rd Qu.:22
## Max. :339 Max. :35
head(data)
## price speed hd ram screen cd multi premium ads trend
## 1 1499 25 80 4 14 no no yes 94 1
## 2 1795 33 85 2 14 no no yes 94 1
## 3 1595 25 170 4 15 no no yes 94 1
## 5 3295 33 340 16 14 no no yes 94 1
## 6 3695 66 340 16 14 no no yes 94 1
## 7 1720 25 170 4 14 yes no yes 94 1
tail(data)
## price speed hd ram screen cd multi premium ads trend
## 6254 2154 66 850 16 15 yes no yes 39 35
## 6255 1690 100 528 8 15 no no yes 39 35
## 6256 2223 66 850 16 15 yes yes yes 39 35
## 6257 2654 100 1200 24 15 yes no yes 39 35
## 6258 2195 100 850 16 15 yes no yes 39 35
## 6259 2490 100 850 16 17 yes no yes 39 35
An analysis of covariance will be performed. Computer speed, hard drive size, and screen size will be tested to see if they are explanitory variables for computer price.
This design was choosen to see the effects of speed, hard drive size, and screen size on the price of computers.
The dataset computers is a collection of survey data from computers on the market between january of 1993 and november of 1995.
Each computer is only observed once.
Blocking is not used in this design.
Mean of computer price by speed, hd size and screen size:
tapply(data$price, data$speed, mean)
## 25 33 50 66 75 100
## 1842 2054 2216 2365 2313 2415
tapply(data$price, data$hd, mean)
## 80 85 107 120 125 128 130 170 210 212 213 214 230 240 245
## 1233 1654 1732 1741 1052 1195 1848 1839 1747 1816 2299 1886 2374 1978 2245
## 250 256 260 270 320 330 340 345 365 405 420 424 425 426 428
## 2319 2995 1399 1629 2716 3520 2078 3505 1391 2629 2134 2407 1986 2305 1739
## 450 452 470 500 520 525 527 528 530 540 545 720 730 810 850
## 3048 3128 2699 3756 2895 4999 2759 2363 2528 2229 1992 2524 2189 2495 2142
## 1000 1060 1080 1100 1200 1260 1370 1600 2100
## 2770 2885 2775 4494 2674 2545 3254 3054 3468
tapply(data$price, data$screen, mean)
## 14 15 17
## 2078 2323 2555
As we may predict, the highest prices seem to occur with the faster speeds. Similarly, higher prices appear to accompany larger hard drive siZes. Larger sized laptops also have a higher mean cost.
Histogram of prices:
hist(data$price, xlim=c(0,5000), ylim=c(0,2000))
The most common computer price seems to be between $1500 and $2000.
Boxplots:
boxplot(data$price~data$speed, xlab="Computer Speed (MHz)", ylab="Price (USD)")
boxplot(data$price~data$hd, xlab="Hard Drive Size (MB)", ylab="Price (USD)")
boxplot(data$price~data$screen, xlab="Screen Size (in)", ylab="Price (USD)")
The boxplots show similar results to examining the means. However, the prices associated with hard drive size dont appear to have as linear of an increase. Further examination will be required.
plot(data[,c(1,2,3,5)])
Null Hypothesis: The variation in the response cannot be explained by anything other than randomization. Alternative Hypothesis: The variation in response can be explained by something other than randomization.
model1=lm(data$price~data$speed+data$hd+data$screen)
summary(model1)
##
## Call:
## lm(formula = data$price ~ data$speed + data$hd + data$screen)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1218.7 -362.6 -33.3 307.4 2411.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.63e+03 1.86e+01 87.40 < 2e-16 ***
## data$speed 3.22e+00 3.36e-01 9.59 < 2e-16 ***
## data$hd 8.13e-01 2.78e-02 29.22 < 2e-16 ***
## data$screen15 6.96e+01 1.53e+01 4.56 5.2e-06 ***
## data$screen17 3.19e+02 2.38e+01 13.44 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 497 on 5642 degrees of freedom
## Multiple R-squared: 0.241, Adjusted R-squared: 0.241
## F-statistic: 449 on 4 and 5642 DF, p-value: <2e-16
Since speed, hard drive size, and screen size each retured a significant p-value (less than .05), we fail to reject the null hypothesis that the variation in price cannot be explained by anything other than randomization. It is likely that computer speed, hard drive size, and screen size can explain some of the variation in price.
When the line for a factor pair crosses zero or a p-value greater than .05 is generated, that indicates we fail to reject the null hypothesis that there is no difference in the means of that combination of pairs. When the plotted line does not cross zero or we generate a small p-value, that indicated there is likely a difference in means between the two factor levels.
Differences in data pairs for hard drive are not compared becasuse there are too many different pairs (continuous variable).
data$speed=as.factor(data$speed)
tukey1<-TukeyHSD(aov(data$price~data$speed))
tukey1
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = data$price ~ data$speed)
##
## $`data$speed`
## diff lwr upr p adj
## 33-25 212.68 129.59 295.78 0.0000
## 50-25 373.89 283.60 464.18 0.0000
## 66-25 523.69 440.82 606.56 0.0000
## 75-25 471.39 312.41 630.38 0.0000
## 100-25 573.05 470.72 675.38 0.0000
## 50-33 161.20 98.72 223.69 0.0000
## 66-33 311.00 259.82 362.19 0.0000
## 75-33 258.71 113.70 403.72 0.0000
## 100-33 360.37 281.48 439.26 0.0000
## 66-50 149.80 87.61 211.99 0.0000
## 75-50 97.51 -51.74 246.76 0.4257
## 100-50 199.16 112.73 285.60 0.0000
## 75-66 -52.29 -197.18 92.59 0.9083
## 100-66 49.36 -29.29 128.02 0.4728
## 100-75 101.66 -55.17 258.48 0.4349
plot(tukey1)
Most pairs of computer speeds have low p-values, so we reject the null that there is no difference in means between the pair. Looking at the p-values or at the plot, only the pairs at the high end of the speed spectrum appear to have no difference (100-75, 100-66, 75-66, and 75-50).
tukey2<-TukeyHSD(aov(data$price~data$screen))
tukey2
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = data$price ~ data$screen)
##
## $`data$screen`
## diff lwr upr p adj
## 15-14 244.6 207.3 281.8 0
## 17-14 476.7 416.6 536.9 0
## 17-15 232.2 169.0 295.3 0
plot(tukey2)
We fail to reject the null of the tukey test for all pairs of screen size: the means prices between the different screen sizes are not the same.
Visually inspect normality of data:
qqnorm(residuals(model1))
qqline(residuals(model1))
The data appears it may normal.
Test normality with Shapiro Wilks test *NOTE- this test can only be run with a sample size of less than 5000. Because of this, a model identical to model1 but with a smaller set of data is created. This is done by taking a random sample of the data originally used:
small <- data[sample(1:nrow(data), 5000, replace=FALSE),]
modelsmall=lm(small$price~small$speed+small$hd+small$screen)
summary(modelsmall)
##
## Call:
## lm(formula = small$price ~ small$speed + small$hd + small$screen)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1286.9 -354.1 -29.4 301.8 2348.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.60e+03 2.63e+01 60.96 < 2e-16 ***
## small$speed33 1.13e+02 2.81e+01 4.01 6.2e-05 ***
## small$speed50 2.35e+02 3.07e+01 7.66 2.3e-14 ***
## small$speed66 3.07e+02 2.88e+01 10.67 < 2e-16 ***
## small$speed75 1.08e+02 5.57e+01 1.93 0.0536 .
## small$speed100 2.22e+02 3.61e+01 6.17 7.3e-10 ***
## small$hd 8.21e-01 2.95e-02 27.84 < 2e-16 ***
## small$screen15 5.18e+01 1.61e+01 3.23 0.0013 **
## small$screen17 3.09e+02 2.54e+01 12.19 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 491 on 4991 degrees of freedom
## Multiple R-squared: 0.255, Adjusted R-squared: 0.254
## F-statistic: 213 on 8 and 4991 DF, p-value: <2e-16
shapiro.test(residuals(modelsmall))
##
## Shapiro-Wilk normality test
##
## data: residuals(modelsmall)
## W = 0.9836, p-value < 2.2e-16
Null hypothesis: The data came from a normally distributed population. We reject the null. We cannot assume the data is normal. This will be addressed in the contingencies section below.
plot(fitted(model1),residuals(model1))
The model appears to be a decent fit since the residuals are pretty symetrical across the zero. However, the residuals are clustered to the left of the dynamic range. This could be due to lack of observations on the right end or could bring into question the quality of fit.
Since the data did not fulfill the normality assumption of the ancova model (which could be why the fit of the model was questionable), a Kruskal-Wallis one-way analysis of variance by Rank Sum Test should be performed:
kruskal.test(data$price~data$speed)
##
## Kruskal-Wallis rank sum test
##
## data: data$price by data$speed
## Kruskal-Wallis chi-squared = 536.7, df = 5, p-value < 2.2e-16
kruskal.test(data$price~data$hd)
##
## Kruskal-Wallis rank sum test
##
## data: data$price by data$hd
## Kruskal-Wallis chi-squared = 2265, df = 53, p-value < 2.2e-16
kruskal.test(data$price~data$screen)
##
## Kruskal-Wallis rank sum test
##
## data: data$price by data$screen
## Kruskal-Wallis chi-squared = 464.8, df = 2, p-value < 2.2e-16
The null hypothesis of the kruskal test is that the mean ranks of the samples from the populations are expected to be the same (this is not the same as saying the populations have identical means). Since each test results in a low p-value, we reject this null hypothesis. It is likely that the variation in the rank means of speed, hard drive size, and screen size can explain the variaion in computer prices.
None used.
Data is from the NYCflight13 package
All included above.