Yage Ding

Rensselaer Polytechnic Institute

October 5, 2016

1. Setting

System under test

The rpart::car.test.frame data set resides in the Ecdat package of the R programming language. It contains a collection of car parameters measured for 60 car types from around the world.

library(Ecdat)
car_data <- rpart::car.test.frame

Factors and Levels

There are 8 parameter in this data set:

colnames(car_data)
## [1] "Price"       "Country"     "Reliability" "Mileage"     "Type"       
## [6] "Weight"      "Disp."       "HP"

For the interest of the experiment, we choose factors Country and Mileage as independent variables, and countinuous factor Price to be the dependent variable.

car_data_new <- car_data[c("Country", "Mileage", "Price")]

Viewing the structure of this data frame, we know that Country is a categorical variable, and there are 8 levels in factor Country, but Mileage is a continous variable.

str(car_data_new)
## 'data.frame':    60 obs. of  3 variables:
##  $ Country: Factor w/ 8 levels "France","Germany",..: 8 8 5 4 3 6 4 5 3 3 ...
##  $ Mileage: int  33 33 37 32 32 26 33 28 25 34 ...
##  $ Price  : int  8895 7402 6319 6635 6599 8672 7399 7254 9599 5866 ...

To transform Mileage into a categorical factor, we round values of Mileage to their closest 5s.

library(plyr)
car_data_new$Mileage <- as.factor(round_any(car_data_new$Mileage, 5))

While the levels of Country are:

levels(car_data_new$Country)
## [1] "France"    "Germany"   "Japan"     "Japan/USA" "Korea"     "Mexico"   
## [7] "Sweden"    "USA"

levels of Mileage after transformation are:

levels(car_data_new$Mileage)
## [1] "20" "25" "30" "35"

The Data

head(car_data_new)
##                    Country Mileage Price
## Eagle Summit 4         USA      35  8895
## Ford Escort   4        USA      35  7402
## Ford Festiva 4       Korea      35  6319
## Honda Civic 4    Japan/USA      30  6635
## Mazda Protege 4      Japan      30  6599
## Mercury Tracer 4    Mexico      25  8672

2. Design

In this experiment, the null hypothesis is that the factor Mileage does not effect the response variable Price. In other words, we choose Mileage to be the factor of interest, and Country to be a nuisance factor.

Randomization: Since we did not particifate in the design process of this experiment, we cannot fulfill random selection and random assignment. However, we are able to achieve random execution by randomly select observations from the data set. In order to do so, we have to decide first how mamy to select: the required sample size.

We then determine the optimal sample size using effect size, type I error: \(\alpha\)=0.05, and power: 1-\(\beta\)=0.8, but first we have to estimate the effect size according to equation 1:

\[f = \frac{\sqrt{\sum_{i=1}^{k}p_i*(\mu _i-\mu )^2}}{\sigma }\]

where f is the effect size, \(p_i\) is the size ratio between the ith level and the entire sample, \(\mu_i\) is the mean of response variables for the ith level, while \(\mu\) is the mean for the entire sample. \(\sigma\) is the standard deviation of response variables for the entire sample. The estimated effect size of this experiment is:

Mean_25 <- mean(car_data_new$Price[car_data_new$Mileage == 25])
Mean_20 <- mean(car_data_new$Price[car_data_new$Mileage == 20])
Mean_35 <- mean(car_data_new$Price[car_data_new$Mileage == 35])
Mean_30 <- mean(car_data_new$Price[car_data_new$Mileage == 30])
Mean_total <- mean(car_data_new$Price)
n_25 <- length(car_data_new$Price[car_data_new$Mileage == 25])
n_20 <- length(car_data_new$Price[car_data_new$Mileage == 20])
n_30 <- length(car_data_new$Price[car_data_new$Mileage == 30])
n_35 <- length(car_data_new$Price[car_data_new$Mileage == 35])
N <- length(car_data_new$Price)
a <- (n_20/N) * (Mean_20 - Mean_total)^2
b <- (n_25/N) * (Mean_25 - Mean_total)^2
c <- (n_30/N) * (Mean_30 - Mean_total)^2
d <- (n_35/N) * (Mean_35 - Mean_total)^2
variation <- var(car_data_new$Price)
f_square <- (a+b+c+d)/variation
f <- sqrt(f_square)
f
## [1] 0.6506844

Utilizing G*Power, we derive the total sample size N to be 48.

We then randomly select 48 out of the 60 observations provided in rpart::car.test.frame.

car_data_48 <- car_data_new[sample(nrow(car_data_new), 48), ]

Replication: The observations in this data set does not include replications or repeated measures: each car type corresponds to one observation.

Block: If we were to design an experiment taking into account a known and controllable nuisance factor, we can use the method of randomized complete block design or incomplete block design to systematically eliminate its effect on the statistical comparisons among treatments. In a randomized complete design, we fix the level of nuisance factor, and design complete randomized experimental runs under each fixed level of the nuisance factor. If we are to do so, the data will look like the following, only that the experimental runs under each level of Country are not complete.

We organize the data to block Country:

blocked <- car_data_48[order(car_data_48$Country),]
head(blocked, 10)
##                        Country Mileage Price
## Peugeot 405 4           France      25 15930
## Volkswagen Jetta 4     Germany      25  9995
## Audi 80 4              Germany      25 18900
## Mazda MPV V6             Japan      20 14944
## Mazda 929 V6             Japan      20 23300
## Toyota Tercel 4          Japan      35  6488
## Honda Prelude Si 4WS 4   Japan      25 13945
## Subaru Justy 3           Japan      35  5866
## Mitsubishi Galant 4      Japan      25 10989
## Mitsubishi Wagon 4       Japan      20 14929

However, we did not design the experiment. Therefore, although the nuisance factor Country in this experiment is known and controllable, which makes it easy to block it during experimental design, we will have to treat it as a known and uncontrollable nuisance factor.

The final data looks like:

head(car_data_48)
##                        Country Mileage Price
## Ford Tempo 4               USA      25  9483
## Buick Century 4            USA      20 13150
## Chrysler New Yorker V6     USA      20 16342
## Mazda MPV V6             Japan      20 14944
## Mazda 929 V6             Japan      20 23300
## Toyota Tercel 4          Japan      35  6488
str(car_data_48)
## 'data.frame':    48 obs. of  3 variables:
##  $ Country: Factor w/ 8 levels "France","Germany",..: 8 8 8 3 3 3 8 8 4 3 ...
##  $ Mileage: Factor w/ 4 levels "20","25","30",..: 2 1 1 1 1 4 2 1 2 2 ...
##  $ Price  : int  9483 13150 16342 14944 23300 6488 10945 17257 11499 13945 ...

3. Analysis

Exploratory Data Analysis

To further explore the data, we show boxplots of the 2 factors with respect to the response variable.

boxplot(car_data_48$Price~car_data_48$Country,data=car_data_48, main="Main effect of Country", xlab="Country", ylab="Price")

boxplot(car_data_48$Price~car_data_48$Mileage,data=car_data_48, main="Main effect of Mileage", xlab="Country", ylab="Price")

From the boxplot, we can tell that there are differences in Price among different levels of both Country and Mileage. Since we consider Country as a nuisance factor, we will not compute the interaction effect between Country and Mileage. As we observe that levels of Mileage have some effect on the Price of cars, we wonder if this effect is significant or only due to error. With our null hypothesis being there is no effect by Mileage on Price, we perform tests to see if we can reject the null hypothesis.

ANOVA As previously mentioned in Section2. Design, our nuisance factor is known and uncentrollable. Thus, we block its the effect in the analysis process (function lm).

lm.out <- lm(Price ~ Mileage + Country,data=car_data_48)
anova(lm.out)
## Analysis of Variance Table
## 
## Response: Price
##           Df    Sum Sq   Mean Sq F value    Pr(>F)    
## Mileage    3 358244960 119414987 13.7636 3.154e-06 ***
## Country    6 123658863  20609810  2.3755   0.04783 *  
## Residuals 38 329693053   8676133                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the analysis, we see that the effect of Country and Mileage on Price are both significant (both p-values < 0.05), with the effect of Mileage being more significant. Therefore, we may reject the null hypothesis. However, in this experiment, since Country is the nuisance factor, we are not interested in its effect. One thing to notice is that, for this analysis, we assume that there is no interaction effect between Country and Mileage, so the residual only estimates the effect of random errors.

Effect Size

We choose to compute the effect size as one alternative to null hypothesis statistc testing. Using G*Power. Effect size is the measure of the strength of the association between changes in factor and changes in response, where a large effect size indicates a stronger influence one factor has on the response variable. With \(\alpha = 0.05\), \(1-\beta = 0.8\), and sample size of 48, the resulting effect size is 0.60, which is larger than 0.5, which is large. In conjunction with its really small p-value (1.209e-05), we can conclude that the effect of Mileage is very significant on our response variable, the Price, and we can reject our null hypothesis.

Resampling Statistics Another alternative we choose is the resampling statistics, in which we used an computational approach to estimate the pdf of the data. There are three types of resampling statistics: permutations, bootstrapping, and Monte Carlo simulation. Here, we use permutation, where we shuffle the data, compute the F-values for each data set randomly selected from the population, and calculate the possibility of getting F-values that are larger than the F value computed under the assumption what error in the data is normally distributed.

##critical
lm.out <- lm(Price ~ Mileage + Country,data=car_data_48)
anova_48 <- anova(lm.out)$"F value"[1]

## resampling
R<-10000 
f_values<-numeric(R)

# permutation
for (i in 1:R) {
  index <- sample(1:60, size=48, replace=FALSE)
  resample <- car_data_new[index,]
  lm.out <- lm(Price ~ Mileage + Country,data=resample)
  f_values[i] <- anova(lm.out)$"F value"[1]
}

hist(f_values, breaks = 20) 
points(anova_48,0,pch=16) 

# Chance of getting F values larger than critical F value
sum(f_values>=anova_48)/10000
## [1] 0.405

If our null hypothesis statistc testing is representitive, then the critical F value should be larger than most of the F values computed from permutation.As we can see from the histogram, our critical F value is to the right of the distribution, and the possibility of getting F values larger than the critical is 0.405, which is larger than 0.05, indicating that there is a large chance where our assumption is not true, and the rejection of null hypothesis should be reconsidered, and analyzed with a different model.

Model Adequacy Checking

For the previous analysis we performed, we assumed that the error (residuals) was in normal distribution. Now, we perform model adequacy checking to confirm the assumption.

If residuals are normally distributed, the Q-Q plot should show an oblique line, and the fitted-residual plot should show uniform distribution of points.

qqnorm(residuals(lm.out))

plot(fitted(lm.out),residuals(lm.out))

Points in our Q-Q plot do not form a oblique line, and the variation on the left part of our fitted-residual plot is generally larger than that on the right part. Therefore, our data does not fit our model well, and our assumption was not true. For further and more accurate analysis, we can use other models for a better fit.

4. References to Literature

  1. Quick-R. (n.d.). Retrieved November 07, 2016, from http://statmethods.net/stats/power.html
  2. Quick-R. Retrieved November 07, 2016, from http://statmethods.net/stats/anova.html
  3. Montgomery, D. C. (2001). Design and analysis of experiments. New York: John Wiley.

5. Appendices

Raw data used in this experiment is one of the datasets from Ecdat package of the R programming language. It contains observations on 60 types of cars regarding their weight, contry of brand, mileagle, price, etc. ###Complete and Documented R Code

##Acquire data
library(Ecdat)
car_data <- rpart::car.test.frame

##select variables 
car_data_new <- car_data[c("Country", "Mileage", "Price")]

## transform mileage to a categorical variable
library(plyr)
car_data_new$Mileage <- as.factor(round_any(car_data_new$Mileage, 5))

## block design
blocked <- car_data_new[order(car_data_new$Country),]

## effect size
Mean_25 <- mean(car_data_new$Price[car_data_new$Mileage == 25])
Mean_20 <- mean(car_data_new$Price[car_data_new$Mileage == 20])
Mean_35 <- mean(car_data_new$Price[car_data_new$Mileage == 35])
Mean_30 <- mean(car_data_new$Price[car_data_new$Mileage == 30])
Mean_total <- mean(car_data_new$Price)
n_25 <- length(car_data_new$Price[car_data_new$Mileage == 25])
n_20 <- length(car_data_new$Price[car_data_new$Mileage == 20])
n_30 <- length(car_data_new$Price[car_data_new$Mileage == 30])
n_35 <- length(car_data_new$Price[car_data_new$Mileage == 35])
N <- length(car_data_new$Price)
a <- (n_20/N) * (Mean_20 - Mean_total)^2
b <- (n_25/N) * (Mean_25 - Mean_total)^2
c <- (n_30/N) * (Mean_30 - Mean_total)^2
d <- (n_35/N) * (Mean_35 - Mean_total)^2
variation <- var(car_data_new$Price)
f_square <- (a+b+c+d)/variation
f <- sqrt(f_square)
f

## calculate sample size in G*Power, randomly select 48 observations
## from the data (randomization)
car_data_48 <- car_data_new[sample(nrow(car_data_new), 48), ]
head(car_data_48)
str(car_data_48)

##Boxplot
boxplot(car_data_48$Price~car_data_48$Country,data=car_data_48, main="Main effect of Country", xlab="Country", ylab="Price")
boxplot(car_data_48$Price~car_data_48$Mileage,data=car_data_48, main="Main effect of Mileage", xlab="Country", ylab="Price")

## anova
lm.out <- lm(Price ~ Country + Mileage,data=car_data_48)
anova_48 <- anova(lm.out)$"F value"[1]

## resampling
R<-10000 
f_values<-numeric(R)

# permutation
for (i in 1:R) {
  index <- sample(1:60, size=48, replace=FALSE)
  resample <- car_data_new[index,]
  lm.out <- lm(Price ~ Country + Mileage,data=resample)
  f_values[i] <- anova(lm.out)$"F value"[1]
}
# 

hist(f_values, breaks = 20) 
points(anova_48,0,pch=16) 

# Chance of getting F values larger than critical F value
sum(f_values>=anova_48)/10000

## model adequacy checking
qqnorm(residuals(lm.out))
plot(fitted(lm.out),residuals(lm.out))