For this recipie, the vehicles dataset within the fueleconomy package will be examined. Specifically, we will examine the effect of two factors with multiple levels (make and fuel) on the city fuel economy in mpg.
To access the package and save the table into the workspace:
install.packages("fueleconomy")
## Installing package into 'C:/Users/svoboa/Documents/R/win-library/3.1'
## (as 'lib' is unspecified)
## Error: trying to use CRAN without setting a mirror
library("fueleconomy", lib.loc="C:/Users/svoboa/Documents/R/win-library/3.1")
data<-vehicles
fix(data)
View(data)
*note data on my interface is read into the workspace and must be confirmed in a seperate window (fix(x)) before it is accessable to view in RStudio. Most users can skip this step
Subset smaller set of manufacturers so model doesn't have too many interactions to compute:
x<-subset(data,data$make=="Acura" | data$make=="Audi" | data$make=="Chevrolet" | data$make=="Dodge")
View(x)
For this experiment, we will look at two factors, make and fuel. Make, which is the manufacturer, has 128 levels in the original dataset. But in the subset created above, it includes 4 (Acura, Audi, Chevrolet, Dodge). Fuel, which is the type of fuel the car requires, has 13 levels in the original dataset vehicles. With the subset, it now has 10.
As the vehicles dataset is automatically structured, RStudio reads make and fuel as characters.
str(x)
## 'data.frame': 6852 obs. of 12 variables:
## $ id : num 13309 13310 13311 14038 14039 ...
## $ make : chr "Acura" "Acura" "Acura" "Acura" ...
## $ model: chr "2.2CL/3.0CL" "2.2CL/3.0CL" "2.2CL/3.0CL" "2.3CL/3.0CL" ...
## $ year : num 1997 1997 1997 1998 1998 ...
## $ class: chr "Subcompact Cars" "Subcompact Cars" "Subcompact Cars" "Subcompact Cars" ...
## $ trans: chr "Automatic 4-spd" "Manual 5-spd" "Automatic 4-spd" "Automatic 4-spd" ...
## $ drive: chr "Front-Wheel Drive" "Front-Wheel Drive" "Front-Wheel Drive" "Front-Wheel Drive" ...
## $ cyl : num 4 4 6 4 4 6 4 4 6 5 ...
## $ displ: num 2.2 2.2 3 2.3 2.3 3 2.3 2.3 3 2.5 ...
## $ fuel : chr "Regular" "Regular" "Regular" "Regular" ...
## $ hwy : num 26 28 26 27 29 26 27 29 26 23 ...
## $ cty : num 20 22 18 19 21 17 20 21 17 18 ...
Save make and fuel as factors:
x$make=as.factor(x$make)
x$fuel=as.factor(x$fuel)
str(x)
## 'data.frame': 6852 obs. of 12 variables:
## $ id : num 13309 13310 13311 14038 14039 ...
## $ make : Factor w/ 4 levels "Acura","Audi",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ model: chr "2.2CL/3.0CL" "2.2CL/3.0CL" "2.2CL/3.0CL" "2.3CL/3.0CL" ...
## $ year : num 1997 1997 1997 1998 1998 ...
## $ class: chr "Subcompact Cars" "Subcompact Cars" "Subcompact Cars" "Subcompact Cars" ...
## $ trans: chr "Automatic 4-spd" "Manual 5-spd" "Automatic 4-spd" "Automatic 4-spd" ...
## $ drive: chr "Front-Wheel Drive" "Front-Wheel Drive" "Front-Wheel Drive" "Front-Wheel Drive" ...
## $ cyl : num 4 4 6 4 4 6 4 4 6 5 ...
## $ displ: num 2.2 2.2 3 2.3 2.3 3 2.3 2.3 3 2.5 ...
## $ fuel : Factor w/ 10 levels "CNG","Diesel",..: 10 10 10 10 10 10 10 10 10 7 ...
## $ hwy : num 26 28 26 27 29 26 27 29 26 23 ...
## $ cty : num 20 22 18 19 21 17 20 21 17 18 ...
Now they are known as factors and R will treat them as so.
The continuous variables in the dataset are displ (engine displacement in litres), hwy (highway fuel economy) and cty (city fuel economy).
For this experiment, cty will be used as the response
The data has the following variables: id (EPA identifier), make, mode, year, class(EPA vehicle size class), trans, drive, cyl, disp, fuel, hwy, cty
For more on the vehicles dataset and to view the first and last 6 observations:
?vehicles
## starting httpd help server ... done
head(x)
## id make model year class trans
## 8 13309 Acura 2.2CL/3.0CL 1997 Subcompact Cars Automatic 4-spd
## 9 13310 Acura 2.2CL/3.0CL 1997 Subcompact Cars Manual 5-spd
## 10 13311 Acura 2.2CL/3.0CL 1997 Subcompact Cars Automatic 4-spd
## 11 14038 Acura 2.3CL/3.0CL 1998 Subcompact Cars Automatic 4-spd
## 12 14039 Acura 2.3CL/3.0CL 1998 Subcompact Cars Manual 5-spd
## 13 14040 Acura 2.3CL/3.0CL 1998 Subcompact Cars Automatic 4-spd
## drive cyl displ fuel hwy cty
## 8 Front-Wheel Drive 4 2.2 Regular 26 20
## 9 Front-Wheel Drive 4 2.2 Regular 28 22
## 10 Front-Wheel Drive 6 3.0 Regular 26 18
## 11 Front-Wheel Drive 4 2.3 Regular 27 19
## 12 Front-Wheel Drive 4 2.3 Regular 29 21
## 13 Front-Wheel Drive 6 3.0 Regular 26 17
tail(x)
## id make model year class
## 10245 9362 Dodge W250 Pickup 4WD 1992 Standard Pickup Trucks
## 10246 9363 Dodge W250 Pickup 4WD 1992 Standard Pickup Trucks
## 10247 10358 Dodge W250 Pickup 4WD 1993 Standard Pickup Trucks
## 10248 10359 Dodge W250 Pickup 4WD 1993 Standard Pickup Trucks
## 10249 10360 Dodge W250 Pickup 4WD 1993 Standard Pickup Trucks
## 10250 10361 Dodge W250 Pickup 4WD 1993 Standard Pickup Trucks
## trans drive cyl displ fuel hwy cty
## 10245 Automatic 4-spd 4-Wheel or All-Wheel Drive 8 5.9 Regular 12 10
## 10246 Manual 5-spd 4-Wheel or All-Wheel Drive 8 5.9 Regular 12 8
## 10247 Automatic 4-spd 4-Wheel or All-Wheel Drive 8 5.2 Regular 15 11
## 10248 Manual 5-spd 4-Wheel or All-Wheel Drive 8 5.2 Regular 15 12
## 10249 Automatic 4-spd 4-Wheel or All-Wheel Drive 8 5.9 Regular 14 10
## 10250 Manual 5-spd 4-Wheel or All-Wheel Drive 8 5.9 Regular 13 10
The experiment will look at two factors (make and fuel type) and see how they effect the city fuel economy.
The Null Hypothesis is that the variation in city fuel economy cannot be explained by anything other than randomization This will be tested by creating a model using the factors and then running an analysis of variance on the model. A model will be created for each factor seperately, with interation between the factors, and with blocking between the factors.
This design was choosen to examine if the type of manufacturer and/or type of fuel effect the fuel economy of the car.
The data in the vehicles set is a list of vehicles the EPA has fuel economy on from 1985-2015. It is unknown how the city and highway fuel economys were collected.
There are no replicates or repeated measures as each unique factor combination only occurs once in the table.
Blocking was not required in creating the vehicles dataset but will be used between the two factors in the 4th model.
Summary Statistics for vehicles dataset:
summary(x)
## id make model year
## Min. : 3 Acura : 269 Length:6852 Min. :1984
## 1st Qu.: 6363 Audi : 772 Class :character 1st Qu.:1989
## Median :14193 Chevrolet:3461 Mode :character Median :1996
## Mean :15300 Dodge :2350 Mean :1997
## 3rd Qu.:23612 3rd Qu.:2006
## Max. :34931 Max. :2015
##
## class trans drive cyl
## Length:6852 Length:6852 Length:6852 Min. : 3.00
## Class :character Class :character Class :character 1st Qu.: 4.00
## Mode :character Mode :character Mode :character Median : 6.00
## Mean : 6.21
## 3rd Qu.: 8.00
## Max. :12.00
## NA's :4
## displ fuel hwy cty
## Min. :1.00 Regular :4942 Min. : 10.0 Min. : 8
## 1st Qu.:2.50 Premium :1227 1st Qu.: 17.0 1st Qu.: 13
## Median :3.90 Gasoline or E85: 370 Median : 22.0 Median : 15
## Mean :3.87 Diesel : 252 Mean : 21.8 Mean : 16
## 3rd Qu.:5.20 Midgrade : 22 3rd Qu.: 25.0 3rd Qu.: 18
## Max. :8.40 CNG : 13 Max. :109.0 Max. :128
## NA's :4 (Other) : 26
Average City MPG for each fuel type and make:
tapply(x$cty,x$fuel,mean)
## CNG Diesel
## 11.38 17.00
## Electricity Gasoline or E85
## 61.25 15.17
## Gasoline or natural gas Midgrade
## 16.75 14.77
## Premium Premium Gas or Electricity
## 16.90 35.00
## Premium or E85 Regular
## 20.00 15.80
tapply(x$cty,x$make,mean)
## Acura Audi Chevrolet Dodge
## 18.61 17.17 16.16 15.19
From the means by fuel type, it appears very likely that the fuel type may be able to explain the variation in city mpg.
Boxplots:
boxplot(x$cty~x$make, xlab="City MPG", ylab="Make")
boxplot(x$cty~x$fuel, xlab="City MPG", ylab="Fuel Type")
Based on the medians of the four makes, make may not explain the variation in city MPG. However, it is interesting to point out the outliers. These may represent the cars that run on alternate fuel such as electricity. These may appear as outliers relative to the dataset since each make has a majority of cars that run on conventional fuel types.
Although not all fuel types may explain the variation in MPG, it appears that some may (for example, Electricity) be able to explain the variation in City MPG,
#Model 1 First, we will create a model to examine the effect of fuel type on city mpg:
model1=aov(x$cty~x$fuel)
anova(model1)
## Analysis of Variance Table
##
## Response: x$cty
## Df Sum Sq Mean Sq F value Pr(>F)
## x$fuel 9 11807 1312 79.1 <2e-16 ***
## Residuals 6842 113426 17
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The probability that the variation in city mpg among fuel type used is due to randomization is small. Therefore, we reject the null hypothesis. When fuel type is the only predicting factor, it is likely that it explains the variation in city mpg.
#Model 2 Now we create a similar model, but to test manufacturer:
model2=aov(x$cty~x$make)
anova(model2)
## Analysis of Variance Table
##
## Response: x$cty
## Df Sum Sq Mean Sq F value Pr(>F)
## x$make 3 4496 1499 85 <2e-16 ***
## Residuals 6848 120737 18
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Again, we reject the null hypothesis that the variation in city mpg is due to randomization. It is likely that the manufacturer explains the variation in city mpg.
#Model 3 Model to test internation between manufacturer and fuel type:
model3=aov(x$cty~x$fuel*x$make)
anova(model3)
## Analysis of Variance Table
##
## Response: x$cty
## Df Sum Sq Mean Sq F value Pr(>F)
## x$fuel 9 11807 1312 83.0 <2e-16 ***
## x$make 3 3036 1012 64.0 <2e-16 ***
## x$fuel:x$make 8 2424 303 19.2 <2e-16 ***
## Residuals 6831 107965 16
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Model 4 Model to test whether fuel type OR make effect city mpg:
model4=aov(x$cty~x$fuel+x$make)
anova(model4)
## Analysis of Variance Table
##
## Response: x$cty
## Df Sum Sq Mean Sq F value Pr(>F)
## x$fuel 9 11807 1312 81.3 <2e-16 ***
## x$make 3 3036 1012 62.7 <2e-16 ***
## Residuals 6839 110389 16
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
All four models produce similar results. It is likely that both fuel and make can explain the variation in city mpg.
Visually inspect normality of data:
qqnorm(residuals(model3))
qqline(residuals(model3))
From this plot, the data appears that it may be normal.
Fitted vs Residuals Plot:
plot(fitted(model3),residuals(model3))
The plot is more clustered then we would like to see, suggesting the model may not be a quality fit of the data.
Interatcion Plot:
x$fuel=as.numeric(x$fuel)
x$make=as.numeric(x$make)
interaction.plot(x$cty, x$fuel, x$make)
From the changes in slope and intersecting lines, it is evident there is interaction among the factors and response.
None used.
Since it was not clear if the dataset was normally distributed, a non-parametric test such as the kruskal ranked sum test could be used.
All included above.