The rpart::car.test.frame data set resides in the Ecdat package of the R programming language. It contains a collection of car parameters measured for 60 car types from around the world.
library(Ecdat)
car_data <- rpart::car.test.frame
There are 8 parameter in this data set:
colnames(car_data)
## [1] "Price" "Country" "Reliability" "Mileage" "Type"
## [6] "Weight" "Disp." "HP"
For the interest of the experiment, we choose factors Country and Mileage as independent variables, and countinuous factor Price to be the dependent variable.
car_data_new <- car_data[c("Country", "Mileage", "Price")]
Viewing the structure of this data frame, we know that Country is a categorical variable, and there are 8 levels in factor Country, but Mileage is a continous variable.
str(car_data_new)
## 'data.frame': 60 obs. of 3 variables:
## $ Country: Factor w/ 8 levels "France","Germany",..: 8 8 5 4 3 6 4 5 3 3 ...
## $ Mileage: int 33 33 37 32 32 26 33 28 25 34 ...
## $ Price : int 8895 7402 6319 6635 6599 8672 7399 7254 9599 5866 ...
To transform Mileage into a categorical factor, we round values of Mileage to their closest 5s.
library(plyr)
car_data_new$Mileage <- as.factor(round_any(car_data_new$Mileage, 5))
While the levels of Country are:
levels(car_data_new$Country)
## [1] "France" "Germany" "Japan" "Japan/USA" "Korea" "Mexico"
## [7] "Sweden" "USA"
levels of Mileage after transformation are:
levels(car_data_new$Mileage)
## [1] "20" "25" "30" "35"
head(car_data_new)
## Country Mileage Price
## Eagle Summit 4 USA 35 8895
## Ford Escort 4 USA 35 7402
## Ford Festiva 4 Korea 35 6319
## Honda Civic 4 Japan/USA 30 6635
## Mazda Protege 4 Japan 30 6599
## Mercury Tracer 4 Mexico 25 8672
In this experiment, the null hypothesis is that the factor Mileage does not effect the response variable Price. In other words, we choose Mileage to be the factor of interest, and Country to be a nuisance factor.
Randomization: Since we did not particifate in the design process of this experiment, we cannot fulfill random selection and random assignment. However, we are able to achieve random execution by randomly select observations from the data set. In order to do so, we have to decide first how mamy to select: the required sample size.
We then determine the optimal sample size using effect size, type I error: \(\alpha\)=0.05, and power: 1-\(\beta\)=0.8, but first we have to estimate the effect size according to equation 1:
\[f = \frac{\sqrt{\sum_{i=1}^{k}p_i*(\mu _i-\mu )^2}}{\sigma }\]
where f is the effect size, \(p_i\) is the size ratio between the ith level and the entire sample, \(\mu_i\) is the mean of response variables for the ith level, while \(\mu\) is the mean for the entire sample. \(\sigma\) is the standard deviation of response variables for the entire sample. The estimated effect size of this experiment is:
Mean_25 <- mean(car_data_new$Price[car_data_new$Mileage == 25])
Mean_20 <- mean(car_data_new$Price[car_data_new$Mileage == 20])
Mean_35 <- mean(car_data_new$Price[car_data_new$Mileage == 35])
Mean_30 <- mean(car_data_new$Price[car_data_new$Mileage == 30])
Mean_total <- mean(car_data_new$Price)
n_25 <- length(car_data_new$Price[car_data_new$Mileage == 25])
n_20 <- length(car_data_new$Price[car_data_new$Mileage == 20])
n_30 <- length(car_data_new$Price[car_data_new$Mileage == 30])
n_35 <- length(car_data_new$Price[car_data_new$Mileage == 35])
N <- length(car_data_new$Price)
a <- (n_20/N) * (Mean_20 - Mean_total)^2
b <- (n_25/N) * (Mean_25 - Mean_total)^2
c <- (n_30/N) * (Mean_30 - Mean_total)^2
d <- (n_35/N) * (Mean_35 - Mean_total)^2
variation <- var(car_data_new$Price)
f_square <- (a+b+c+d)/variation
f <- sqrt(f_square)
f
## [1] 0.6506844
Utilizing G*Power, we derive the total sample size N to be 48.
We then randomly select 48 out of the 60 observations provided in rpart::car.test.frame.
car_data_48 <- car_data_new[sample(nrow(car_data_new), 48), ]
Replication: The observations in this data set does not include replications or repeated measures: each car type corresponds to one observation.
Block: If we were to design an experiment taking into account a known and controllable nuisance factor, we can use the method of randomized complete block design or incomplete block design to systematically eliminate its effect on the statistical comparisons among treatments. In a randomized complete design, we fix the level of nuisance factor, and design complete randomized experimental runs under each fixed level of the nuisance factor. If we are to do so, the data will look like the following, only that the experimental runs under each level of Country are not complete.
We organize the data to block Country:
blocked <- car_data_48[order(car_data_48$Country),]
head(blocked, 10)
## Country Mileage Price
## Peugeot 405 4 France 25 15930
## Volkswagen Jetta 4 Germany 25 9995
## Audi 80 4 Germany 25 18900
## Mazda MPV V6 Japan 20 14944
## Mazda 929 V6 Japan 20 23300
## Toyota Tercel 4 Japan 35 6488
## Honda Prelude Si 4WS 4 Japan 25 13945
## Subaru Justy 3 Japan 35 5866
## Mitsubishi Galant 4 Japan 25 10989
## Mitsubishi Wagon 4 Japan 20 14929
However, we did not design the experiment. Therefore, although the nuisance factor Country in this experiment is known and controllable, which makes it easy to block it during experimental design, we will have to treat it as a known and uncontrollable nuisance factor.
The final data looks like:
head(car_data_48)
## Country Mileage Price
## Ford Tempo 4 USA 25 9483
## Buick Century 4 USA 20 13150
## Chrysler New Yorker V6 USA 20 16342
## Mazda MPV V6 Japan 20 14944
## Mazda 929 V6 Japan 20 23300
## Toyota Tercel 4 Japan 35 6488
str(car_data_48)
## 'data.frame': 48 obs. of 3 variables:
## $ Country: Factor w/ 8 levels "France","Germany",..: 8 8 8 3 3 3 8 8 4 3 ...
## $ Mileage: Factor w/ 4 levels "20","25","30",..: 2 1 1 1 1 4 2 1 2 2 ...
## $ Price : int 9483 13150 16342 14944 23300 6488 10945 17257 11499 13945 ...
To further explore the data, we show boxplots of the 2 factors with respect to the response variable.
boxplot(car_data_48$Price~car_data_48$Country,data=car_data_48, main="Main effect of Country", xlab="Country", ylab="Price")
boxplot(car_data_48$Price~car_data_48$Mileage,data=car_data_48, main="Main effect of Mileage", xlab="Country", ylab="Price")
From the boxplot, we can tell that there are differences in Price among different levels of both Country and Mileage. Since we consider Country as a nuisance factor, we will not compute the interaction effect between Country and Mileage. As we observe that levels of Mileage have some effect on the Price of cars, we wonder if this effect is significant or only due to error. With our null hypothesis being there is no effect by Mileage on Price, we perform tests to see if we can reject the null hypothesis.
ANOVA As previously mentioned in Section2. Design, our nuisance factor is known and uncentrollable. Thus, we block its the effect in the analysis process (function lm).
lm.out <- lm(Price ~ Mileage + Country,data=car_data_48)
anova(lm.out)
## Analysis of Variance Table
##
## Response: Price
## Df Sum Sq Mean Sq F value Pr(>F)
## Mileage 3 358244960 119414987 13.7636 3.154e-06 ***
## Country 6 123658863 20609810 2.3755 0.04783 *
## Residuals 38 329693053 8676133
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the analysis, we see that the effect of Country and Mileage on Price are both significant (both p-values < 0.05), with the effect of Mileage being more significant. Therefore, we may reject the null hypothesis. However, in this experiment, since Country is the nuisance factor, we are not interested in its effect. One thing to notice is that, for this analysis, we assume that there is no interaction effect between Country and Mileage, so the residual only estimates the effect of random errors.
Effect Size
We choose to compute the effect size as one alternative to null hypothesis statistc testing. Using G*Power. Effect size is the measure of the strength of the association between changes in factor and changes in response, where a large effect size indicates a stronger influence one factor has on the response variable. With \(\alpha = 0.05\), \(1-\beta = 0.8\), and sample size of 48, the resulting effect size is 0.60, which is larger than 0.5, which is large. In conjunction with its really small p-value (1.209e-05), we can conclude that the effect of Mileage is very significant on our response variable, the Price, and we can reject our null hypothesis.
Resampling Statistics Another alternative we choose is the resampling statistics, in which we used an computational approach to estimate the pdf of the data. There are three types of resampling statistics: permutations, bootstrapping, and Monte Carlo simulation. Here, we use permutation, where we shuffle the data, compute the F-values for each data set randomly selected from the population, and calculate the possibility of getting F-values that are larger than the F value computed under the assumption what error in the data is normally distributed.
##critical
lm.out <- lm(Price ~ Mileage + Country,data=car_data_48)
anova_48 <- anova(lm.out)$"F value"[1]
## resampling
R<-10000
f_values<-numeric(R)
# permutation
for (i in 1:R) {
index <- sample(1:60, size=48, replace=FALSE)
resample <- car_data_new[index,]
lm.out <- lm(Price ~ Mileage + Country,data=resample)
f_values[i] <- anova(lm.out)$"F value"[1]
}
hist(f_values, breaks = 20)
points(anova_48,0,pch=16)
# Chance of getting F values larger than critical F value
sum(f_values>=anova_48)/10000
## [1] 0.405
If our null hypothesis statistc testing is representitive, then the critical F value should be larger than most of the F values computed from permutation.As we can see from the histogram, our critical F value is to the right of the distribution, and the possibility of getting F values larger than the critical is 0.405, which is larger than 0.05, indicating that there is a large chance where our assumption is not true, and the rejection of null hypothesis should be reconsidered, and analyzed with a different model.
For the previous analysis we performed, we assumed that the error (residuals) was in normal distribution. Now, we perform model adequacy checking to confirm the assumption.
If residuals are normally distributed, the Q-Q plot should show an oblique line, and the fitted-residual plot should show uniform distribution of points.
qqnorm(residuals(lm.out))
plot(fitted(lm.out),residuals(lm.out))
Points in our Q-Q plot do not form a oblique line, and the variation on the left part of our fitted-residual plot is generally larger than that on the right part. Therefore, our data does not fit our model well, and our assumption was not true. For further and more accurate analysis, we can use other models for a better fit.
Raw data used in this experiment is one of the datasets from Ecdat package of the R programming language. It contains observations on 60 types of cars regarding their weight, contry of brand, mileagle, price, etc. ###Complete and Documented R Code
##Acquire data
library(Ecdat)
car_data <- rpart::car.test.frame
##select variables
car_data_new <- car_data[c("Country", "Mileage", "Price")]
## transform mileage to a categorical variable
library(plyr)
car_data_new$Mileage <- as.factor(round_any(car_data_new$Mileage, 5))
## block design
blocked <- car_data_new[order(car_data_new$Country),]
## effect size
Mean_25 <- mean(car_data_new$Price[car_data_new$Mileage == 25])
Mean_20 <- mean(car_data_new$Price[car_data_new$Mileage == 20])
Mean_35 <- mean(car_data_new$Price[car_data_new$Mileage == 35])
Mean_30 <- mean(car_data_new$Price[car_data_new$Mileage == 30])
Mean_total <- mean(car_data_new$Price)
n_25 <- length(car_data_new$Price[car_data_new$Mileage == 25])
n_20 <- length(car_data_new$Price[car_data_new$Mileage == 20])
n_30 <- length(car_data_new$Price[car_data_new$Mileage == 30])
n_35 <- length(car_data_new$Price[car_data_new$Mileage == 35])
N <- length(car_data_new$Price)
a <- (n_20/N) * (Mean_20 - Mean_total)^2
b <- (n_25/N) * (Mean_25 - Mean_total)^2
c <- (n_30/N) * (Mean_30 - Mean_total)^2
d <- (n_35/N) * (Mean_35 - Mean_total)^2
variation <- var(car_data_new$Price)
f_square <- (a+b+c+d)/variation
f <- sqrt(f_square)
f
## calculate sample size in G*Power, randomly select 48 observations
## from the data (randomization)
car_data_48 <- car_data_new[sample(nrow(car_data_new), 48), ]
head(car_data_48)
str(car_data_48)
##Boxplot
boxplot(car_data_48$Price~car_data_48$Country,data=car_data_48, main="Main effect of Country", xlab="Country", ylab="Price")
boxplot(car_data_48$Price~car_data_48$Mileage,data=car_data_48, main="Main effect of Mileage", xlab="Country", ylab="Price")
## anova
lm.out <- lm(Price ~ Country + Mileage,data=car_data_48)
anova_48 <- anova(lm.out)$"F value"[1]
## resampling
R<-10000
f_values<-numeric(R)
# permutation
for (i in 1:R) {
index <- sample(1:60, size=48, replace=FALSE)
resample <- car_data_new[index,]
lm.out <- lm(Price ~ Country + Mileage,data=resample)
f_values[i] <- anova(lm.out)$"F value"[1]
}
#
hist(f_values, breaks = 20)
points(anova_48,0,pch=16)
# Chance of getting F values larger than critical F value
sum(f_values>=anova_48)/10000
## model adequacy checking
qqnorm(residuals(lm.out))
plot(fitted(lm.out),residuals(lm.out))