Yage Ding

Rensselaer Polytechnic Institute

October 5, 2016

1. Setting

System under test

The rpart::car.test.frame data set resides in the Ecdat package of the R programming language. It contains a collection of car parameters measured for 60 car types from around the world. The data set comes from the list of [100+ Interesting Data Sets for Statistics] (http://rs.io/100-interesting-data-sets-for-statistics/?is_b_version=true&utm_expid=50231141-3.8QxdstXzRuupDFRQRzuMHA.1).

library(Ecdat)
car_data <- rpart::car.test.frame

Factors and Levels

There are 8 parameter in this data set:

colnames(car_data)
## [1] "Price"       "Country"     "Reliability" "Mileage"     "Type"       
## [6] "Weight"      "Disp."       "HP"

For the interest of the experiment, only 4 of the parameters were chosen to be factors: Country, Mileage, Type, and Weight

car_data_new <- car_data[c("Country", "Mileage", "Type", "Weight", "Price")]

Viewing the structure of this data frame,

str(car_data_new)
## 'data.frame':    60 obs. of  5 variables:
##  $ Country: Factor w/ 8 levels "France","Germany",..: 8 8 5 4 3 6 4 5 3 3 ...
##  $ Mileage: int  33 33 37 32 32 26 33 28 25 34 ...
##  $ Type   : Factor w/ 6 levels "Compact","Large",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ Weight : int  2560 2345 1845 2260 2440 2285 2275 2350 2295 1900 ...
##  $ Price  : int  8895 7402 6319 6635 6599 8672 7399 7254 9599 5866 ...

we realize that the factor Country and Type have non-ordinal categorical levels:

levels(car_data_new$Country)
## [1] "France"    "Germany"   "Japan"     "Japan/USA" "Korea"     "Mexico"   
## [7] "Sweden"    "USA"
levels(car_data_new$Type)
## [1] "Compact" "Large"   "Medium"  "Small"   "Sporty"  "Van"

while the other 2 factors are continuous ordinal variables.

To reduce the number of levels, we divide these continuous observations into categorical levels and store them in the data frame as factors:

library(plyr)
car_data_new$Weight <- as.factor(round_any(car_data_new$Weight, 500))
car_data_new$Mileage <- as.factor(round_any(car_data_new$Mileage, 5))

Now we have 4 categorical factors, and the levels for Mileagle and Weight are:

levels(car_data_new$Mileage)
## [1] "20" "25" "30" "35"
levels(car_data_new$Weight)
## [1] "2000" "2500" "3000" "3500" "4000"

Response Variable

The Response in this experimet is the Price of the car.

The Data

str(car_data_new)
## 'data.frame':    60 obs. of  5 variables:
##  $ Country: Factor w/ 8 levels "France","Germany",..: 8 8 5 4 3 6 4 5 3 3 ...
##  $ Mileage: Factor w/ 4 levels "20","25","30",..: 4 4 4 3 3 2 4 3 2 4 ...
##  $ Type   : Factor w/ 6 levels "Compact","Large",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ Weight : Factor w/ 5 levels "2000","2500",..: 2 2 1 2 2 2 2 2 2 1 ...
##  $ Price  : int  8895 7402 6319 6635 6599 8672 7399 7254 9599 5866 ...

2. Design

This experiment, composed of 4 factors (Country, Mileage, Type, and Weight) and 1 response (Price), is designed to reveal how changes in factors will influence the response, in other words, the objective of this experiment is the sensitivity of car Price to the variations in country, mileage, type of vehicle and car weight.

The null hypothesis is Country, Mileage, Type, and Weight do mot affect the Price of the car. Experiment is designed to dispove the null hypothesis.

Among the 3 experimental strategies, we decided to use the method of factorial design. First of all, while best guess gives only observations on expeimental units under a single condition, and an one-factor-at-a-time (OFAT) may miss the optimal setting of factor and lack the ability to estimate interactions, factorial design helps us to draw the most accurate and conprehensive conclusion from observations. Secondly, since our factors in this experiment are all categorical variables, it is possible to utilize factorial design.

Factorial design involves all possible combinations of factors. Since we have 8 levels for factor Country, 6 levels for factor Type, 4 levels for Mileage, and 5 levels for Weight, the number of possible combinations of levels of factors is 960, there are 960 possible conbinations.

##code to show all possible conbinations
comb <- expand.grid(levels(car_data_new$Country), levels(car_data_new$Type),levels(car_data_new$Weight), levels(car_data_new$Mileage))

There are 3 basic principles in expeimental design that ensure the validity of data and the robustness of conclusion we may draw from the data, and these principles are randomization, replication, and blocking. Randomization involves random selection, random assignment, and random run order. In this experiment, our experimental units are different cars types, we then need to select car types from the entire car type population, and avoid intentionally selecting, for instance, car types that weigh more or are from a particular continent. Therefore, conclusions we draw can be applied to a more general population, and this means that the model we are using here is a random effect model. Since cars come with their own features (factor levels), we cannot assign treatments to them, so random assignment cannot be achieved. However, we can achieve random run order by generating a random sequence of all possible combinations, and run our experiment according to that sequence.

for (i in sample(c(1:960)){
  print comb[i]
}

In this data set, there are no replications, and each car type is measured only once. And there are nuisance factors that are not of our interest, such as the aesthetic score of the car. In this data set, these nuisance factors are not blocked, in another word, they are not kept consistant among experimental units, however, in an ideal experiment, they should be.

3. Statistical Analysis

Main Effect

Main effects of all four factors are shown in a boxplot fashion. There are no significant main effect for Country and Type. However, for quantitative factors such as Mileage andWeight, we can conclude that the longer the Mileage, the lower the Price, and the larger the Weight is, the higher the Price gets.

According to the boxplots, there is no linear effect of categorical data such as Country and Type on the response Price, because these variables are not ordinal. Therefore, the main effects of these factors are not so meaningful. Here we show the means of each level of such factors, and the main effct of the ordinal factors (Mileage and Weight).

Country

France 1.59310^{4}

Germany: 1.4447510^{4}

Japan: 1.393805310^{4}

Japan/USA: 1.006757110^{4}

Korea: 7857.3333333

Mexico: 8672

Sweden: 1.84510^{4}

USA: 1.254326910^{4}

Mileage

“20”: “25” 2611.5326087

“20”: “30” 6229.9492754

“20”: “35” 7795.7826087

“25”: “30” 3618.4166667

“25”: “35” 5184.25

“30”: “35” 1565.8333333

Type

Compact: 1.285313310^{4}

Large: 1.597566710^{4}

Medium: 1.620110^{4}

Small: 7682.3846154

Sporty: 1.171711110^{4}

Van: 1.432542910^{4}

Weight

“2000”: “2500” -3315.1973684

“2000”: “3000” -6053.0681818

“2000”: “3500” -9369.0961538

“2000”: “4000” -8870.25

“2500”: “3000” -2737.8708134

“2500”: “3500” -6053.8987854

“2500”: “4000” -5555.0526316

“3000”: “3500” -3316.027972

“3000”: “4000” -2817.1818182

“3500”: “4000” 498.8461538

Interaction Effect

Interaction effect between 2 factors shows if the effect of one factor on the response variable is affected by the other factor. Therefore, when we compute the interaction effects, we hold the first factor at a certain level, and vary the level of the other fator, and repeat this procedure for other levels of the first factor. The interaction effect is the difference in mean of obeservations at different levels of the first factor. With 4 independent factors, we get 6 2-factor combinations. The interaction effect of all 2-factor combinations are shown in the following graphs.There are interaction effect between the 2 factors in all these combinations, because lines in the plot are not parallel. Moreover, the more nonparallel the lines are, the greater the strength of the interaction.

Analysis of Variance

After computing the main effect and interaction effect, we have to perform analysis of variance test to determine if these effects are significant. If interaction effects are significant, then we cannot draw conclusion between factors and response without considering them. Since there are both fixed and random effects in this experiment, corresponding ANOVA tests are performed:

##Fixed Effect ANOVA
aov(Y ~ A, data=d)
##Random Effect ANOVA
A <- aov(Y ~ Error(A), data=d)
summary(A)

ANOVA on Main Effect:

Country Fixed effect

##             Df    Sum Sq  Mean Sq F value Pr(>F)  
## Country      7 214024382 30574912   2.066  0.064 .
## Residuals   52 769527115 14798598                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Mileage Random effect

##             Df    Sum Sq   Mean Sq F value   Pr(>F)    
## Mileage      3 423484060 141161353   14.11 5.77e-07 ***
## Residuals   56 560067437  10001204                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Type Fixed effect

##             Df    Sum Sq   Mean Sq F value   Pr(>F)    
## Type         5 545938805 109187761   13.47 1.56e-08 ***
## Residuals   54 437612692   8103939                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Weight Random effect

##             Df    Sum Sq   Mean Sq F value   Pr(>F)    
## Weight       4 435208731 108802183   10.91 1.39e-06 ***
## Residuals   55 548342767   9969868                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ANOVA on 2-Way Interaction Effect:

Since the 2 factors involved in the 2-way interaction effect can be of different effect models, some 2-way interactions are mixed effect models, and the corresponding ANOVA test is performed.

library(lme4)
library(lmerTest)
AB <- lmer(Y ~ B + (1 | A), data=d)
anova(AB)

Country: Mileage This is an mixed effect model

## fixed-effect model matrix is rank deficient so dropping 14 columns / coefficients
## fixed-effect model matrix is rank deficient so dropping 14 columns / coefficients
## fixed-effect model matrix is rank deficient so dropping 14 columns / coefficients
## Error in calculation of the Satterthwaite's approximation. The output of lme4 package is returned
## anova from lme4 is returned
## some computational error has occurred in lmerTest
## Analysis of Variance Table
##                 Df    Sum Sq  Mean Sq F value
## Country          7 138423397 19774771  2.1151
## Mileage          3  29684772  9894924  1.0584
## Country:Mileage  7  30783830  4397690  0.4704

Country: Type This is a fixed effect model

##              Df    Sum Sq   Mean Sq F value   Pr(>F)    
## Country       7 214024382  30574912  11.407 7.50e-08 ***
## Type          5 528026061 105605212  39.399 1.99e-14 ***
## Country:Type  7 134286165  19183738   7.157 1.49e-05 ***
## Residuals    40 107214889   2680372                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Country: Weight mixed effect model

## fixed-effect model matrix is rank deficient so dropping 23 columns / coefficients
## fixed-effect model matrix is rank deficient so dropping 23 columns / coefficients
## fixed-effect model matrix is rank deficient so dropping 23 columns / coefficients
## Error in calculation of the Satterthwaite's approximation. The output of lme4 package is returned
## anova from lme4 is returned
## some computational error has occurred in lmerTest
## Analysis of Variance Table
##                Df    Sum Sq  Mean Sq F value
## Country         7 203387636 29055377  4.1474
## Weight          4  56146585 14036646  2.0036
## Country:Weight  5  32716838  6543368  0.9340

Mileage: Type mixed effect model

## fixed-effect model matrix is rank deficient so dropping 10 columns / coefficients
## fixed-effect model matrix is rank deficient so dropping 10 columns / coefficients
## fixed-effect model matrix is rank deficient so dropping 10 columns / coefficients
## Error in calculation of the Satterthwaite's approximation. The output of lme4 package is returned
## anova from lme4 is returned
## some computational error has occurred in lmerTest
## Analysis of Variance Table
##              Df    Sum Sq  Mean Sq F value
## Type          5 178943159 35788632  4.0247
## Mileage       3   4043200  1347733  0.1516
## Type:Mileage  5  11876859  2375372  0.2671

Mileage: Weight random effect model

##                Df    Sum Sq   Mean Sq F value   Pr(>F)    
## Mileage         3 423484060 141161353  15.745 2.25e-07 ***
## Weight          4  77698788  19424697   2.167    0.086 .  
## Mileage:Weight  1  25132453  25132453   2.803    0.100    
## Residuals      51 457236196   8965416                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Type: Weight mixed effect model

## fixed-effect model matrix is rank deficient so dropping 16 columns / coefficients
## fixed-effect model matrix is rank deficient so dropping 16 columns / coefficients
## fixed-effect model matrix is rank deficient so dropping 16 columns / coefficients
## Error in calculation of the Satterthwaite's approximation. The output of lme4 package is returned
## anova from lme4 is returned
## some computational error has occurred in lmerTest
## Analysis of Variance Table
##             Df    Sum Sq  Mean Sq F value
## Type         5 245119800 49023960  7.7530
## Weight       4  13198935  3299734  0.5218
## Type:Weight  4  60118414 15029603  2.3769

As we can see in the ANOVA test results, main effects of all for factors except Country are significant (p-values < 0.05), indicating that these factors have significant individual effects of the Price of a car. In the ANOVA tests for 2-way interactions, the code did not provide P values of each interaction effect with mixed models. However, we can tell from the F values, that the interactions between Type:Weight is larger than 2, which suggests that it can be significant. According to the P-value, Country:Type interaction effect is also significant.

4. References to Literature

  1. Seaton, R. (2014). 100 Interesting Data Sets for Statistics - rs.io. Retrieved October 05, 2016, from http://rs.io/100-interesting-data-sets-for-statistics/?is_b_version=true
  2. Interpret the key results for Interaction Plot. Retrieved October 05, 2016, from http://support.minitab.com/en-us/minitab-express/1/help-and-how-to/modeling-statistics/anova/how-to/interaction-plot/interpret-the-results/
  3. Montgomery, D. C. (2001). Design and analysis of experiments. New York: John Wiley.

5. Appendices

Complete and Documented R Code

##Data in R
install.packages("Ecdat")
library(Ecdat)

##A brief view of the data
head(rpart::car.test.frame)
str(rpart::car.test.frame)

car_data <- rpart::car.test.frame

##Package 'plyr' has a round function that allows us to round a number 
##to the desired number of sig. fig.s
install.packages('plyr')
library(plyr)

##Change continuous variable to categorical variable by rounding numbers
##Store the final values as factors in the data frame
car_data$Mileage <- as.factor(round_any(car_data$Mileage, 5))
str(car_data)
car_data$Weight <- as.factor(round_any(car_data$Weight, 500))
str(car_data)

##Select 4 factors and the response
car_data_new <- car_data[c("Country", "Mileage", "Type", "Weight", "Price")]

##A brief view of the reorganized data
head(car_data_new)

##Levels of each factor
levels(car_data_new$Country)
levels(car_data_new$Mileage)
levels(car_data_new$Type)
levels(car_data_new$Weight)

##Clean data
is.na(car_data$Country)
is.na(car_data$Mileage)
is.na(car_data$Type)
is.na(car_data$Weight)
is.na(car_data$Price)

##Main effects of 4 factors: Country, Mileage, Type, and Weight
boxplot(car_data$Price~car_data$Country,data=car_data, main="Main effect of Country", xlab="Country", ylab="Price")
boxplot(car_data$Price~car_data$Mileage,data=car_data, main="Main effect of Miledage", xlab="Mileage", ylab="Price")
boxplot(car_data$Price~car_data$Type,data=car_data, main="Main effect of Type", xlab="Type", ylab="Price")
boxplot(car_data$Price~car_data$Weight,data=car_data, main="Main effect of Weight", xlab="Weight", ylab="Price")

##Interaction effect of 6 2-factor combinations
par(mfrow=c(3,2))
interaction.plot(car_data_new$Country, car_data_new$Mileage, car_data_new$Price, xlab = "Country", ylab = "Price")
interaction.plot(car_data_new$Country, car_data_new$Type, car_data_new$Price, xlab = "Country", ylab = "Price")
interaction.plot(car_data_new$Country, car_data_new$Weight, car_data_new$Price, xlab = "Country", ylab = "Price")
interaction.plot(car_data_new$Mileage, car_data_new$Type, car_data_new$Price, xlab = "Mileage", ylab = "Price")
interaction.plot(car_data_new$Mileage, car_data_new$Weight, car_data_new$Price, xlab = "Mileage", ylab = "Price")
interaction.plot(car_data_new$Type, car_data_new$Weight, car_data_new$Price, xlab = "Type", ylab = "Price")

##Analysis of Variance
Country <- aov(Price ~ Country, car_data_new)
summary(Country)
##Mileage
Mileage <- aov(Price ~ Mileage, car_data_new)
summary(Mileage)
##Type
Type <- aov(Price ~ Type, car_data_new)
summary(Type)
##Weight
Weight <- aov(Price ~ Weight, car_data_new)
summary(Weight)
##ANOVA on 2-Way Interaction Effect:**
library(lme4)
library(lmerTest)
##Country: Mileage
CM <- lmer(Price ~ Country+(1|Mileage), car_data_new)
anova(CM)
##Country: Type
CT <- aov(Price ~ Country*Type, car_data_new)
summary(CT)
##Country: Weight
CW <- lmer(Price ~ Country+(1|Weight), car_data_new)
anova(CW)
##Mileage: Type
MT <- lmer(Price ~ Type+(1|Mileage), car_data_new)
anova(MT)
##Mileage: Weight
MW <- aov (Price ~ Mileage*Weight, car_data_new)
summary(MW)
##Type: Weight
TW <- lmer(Price ~ Type+(1|Weight), car_data_new)
anova(TW)

Relevant Theory

In this section, we will go over some of the relevant theories utilized in this experiment.

First of all, the experimental strategy we used here is factorial design. There are 2 reasons for choosing this method. First, while best guess gives only observations on expeimental units under a single condition, and an one-factor-at-a-time (OFAT) may miss the optimal setting of factor and lack the ability to estimate interactions, factorial design helps us to draw the most accurate and conprehensive conclusion from observations. Secondly, we reorganized the data so factors in this experiment are limited categorical variables, which is suitable for factorial design.

Since the experiment here utilizes factorial design,randomization is an important concept that we need to incorporate in each step of the designing process. Randomization involves random selection, random assignment, and random run order. In this experiment, our experimental units are different car types, therefore, according to random selection, we need to select car types from the entire car type population, and avoid intentionally selecting, for instance, car types that weigh more or are from a particular continent. With samples randomly selected from a general population, we can draw that can be applied to the entire population, and the experimental model that is cahracterized by these 2 features is the random effect model. Since cars come with their own features (factor levels), and we cannot assign treatments to them, random assignment cannot be achieved. However, we can achieve random run order by generating a random sequence of all possible combinations, and run our experiment according to that sequence.

In this particular data set, there are no replicates for each treatment condition. While replication is generally necessary in an ideal experiment, we believe in this particular experiment, they are nor necessary. The purpose of replicates is to get a sense of experimental error, however, in this experient, features of different car types are fixed for that car type, which means that there will be no variantion among replicates. Thus, replication is not necessary in this experiment.

Another important concept in experimental design is blocking, which is to keep nuisance factors at the same level across all experimental units. However, it is hard to achieve blocking in this experiment. We cannot build cars that only varies in factors that we are interested in and are the same otherwise. Due to the lack of blocking, there will be effects generated by nuisance factors, and conclusions we may draw from this experiment is likely to be imprecise.

In the statistic analysis section, we ran into the main effect, 2-way interaction effect, and 2-way analysis of variance (ANOVA). Main effect is the effect of individual factors on the response. The main effect of one factor is computed as the difference in means of responses corresponding to each level of this factor. And if there are difference amongst the means of each level, we say that there is a main effect of this factor. An 2-way interaction effect shows whether or not the effect of one factor on the response is dependent upon another factor of interest. We computed this using plot, and in these 2-way interaction plot, whenever lines are not parallel to each other, there is an interaction effect. There are usually interaction effectes between factors, but we need to examine if they are significant enough for us to take them into consideration as we draw any conclusion. And here is where ANOVA comes into play. In ANOVA tests, we compute variances among treatments and within treatments. The difference between these variances determines whether or not the effect of treatment is significant. If the variance among treatments is larger that that within treatments, then we can conclude the change in response is due to our treatments, rather than some random error.