This is an R Markdown document. Markdown is a simple formatting syntax for authoring web pages (click the **MD** toolbar button for help on Markdown).

When you click the **Knit HTML** button a web page will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

*as of August 28, 2014, superceding the version of August 24. Always use the most recent version.*

To exam the effect of number of cylinders on vehicle fuel economy (based on 2014 EPA fuel economy data)

```
library("fueleconomy", lib.loc="~/R/win-library/3.1")
x<-vehicles
head(x)
```

```
## id make model year class
## 1 27550 AM General DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD
## 2 28426 AM General DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD
## 3 27549 AM General FJ8c Post Office 1984 Special Purpose Vehicle 2WD
## 4 28425 AM General FJ8c Post Office 1984 Special Purpose Vehicle 2WD
## 5 1032 AM General Post Office DJ5 2WD 1985 Special Purpose Vehicle 2WD
## 6 1033 AM General Post Office DJ8 2WD 1985 Special Purpose Vehicle 2WD
## trans drive cyl displ fuel hwy cty
## 1 Automatic 3-spd 2-Wheel Drive 4 2.5 Regular 17 18
## 2 Automatic 3-spd 2-Wheel Drive 4 2.5 Regular 17 18
## 3 Automatic 3-spd 2-Wheel Drive 6 4.2 Regular 13 13
## 4 Automatic 3-spd 2-Wheel Drive 6 4.2 Regular 13 13
## 5 Automatic 3-spd Rear-Wheel Drive 4 2.5 Regular 17 16
## 6 Automatic 3-spd Rear-Wheel Drive 6 4.2 Regular 13 13
```

This dataset has 9 factors/independent variables and each of the factor has several levels. For example, the factor *cyl* indicates the number of cylinders of the vehicle and it has 9 levels, ranging from 2 to 16.

`head(x)`

```
## id make model year class
## 1 27550 AM General DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD
## 2 28426 AM General DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD
## 3 27549 AM General FJ8c Post Office 1984 Special Purpose Vehicle 2WD
## 4 28425 AM General FJ8c Post Office 1984 Special Purpose Vehicle 2WD
## 5 1032 AM General Post Office DJ5 2WD 1985 Special Purpose Vehicle 2WD
## 6 1033 AM General Post Office DJ8 2WD 1985 Special Purpose Vehicle 2WD
## trans drive cyl displ fuel hwy cty
## 1 Automatic 3-spd 2-Wheel Drive 4 2.5 Regular 17 18
## 2 Automatic 3-spd 2-Wheel Drive 4 2.5 Regular 17 18
## 3 Automatic 3-spd 2-Wheel Drive 6 4.2 Regular 13 13
## 4 Automatic 3-spd 2-Wheel Drive 6 4.2 Regular 13 13
## 5 Automatic 3-spd Rear-Wheel Drive 4 2.5 Regular 17 16
## 6 Automatic 3-spd Rear-Wheel Drive 6 4.2 Regular 13 13
```

`tail(x)`

```
## id make model year class
## 33437 31064 smart fortwo electric drive cabriolet 2011 Two Seaters
## 33438 33305 smart fortwo electric drive convertible 2013 Two Seaters
## 33439 34393 smart fortwo electric drive convertible 2014 Two Seaters
## 33440 31065 smart fortwo electric drive coupe 2011 Two Seaters
## 33441 33306 smart fortwo electric drive coupe 2013 Two Seaters
## 33442 34394 smart fortwo electric drive coupe 2014 Two Seaters
## trans drive cyl displ fuel hwy cty
## 33437 Automatic (A1) Rear-Wheel Drive NA NA Electricity 79 94
## 33438 Automatic (A1) Rear-Wheel Drive NA NA Electricity 93 122
## 33439 Automatic (A1) Rear-Wheel Drive NA NA Electricity 93 122
## 33440 Automatic (A1) Rear-Wheel Drive NA NA Electricity 79 94
## 33441 Automatic (A1) Rear-Wheel Drive NA NA Electricity 93 122
## 33442 Automatic (A1) Rear-Wheel Drive NA NA Electricity 93 122
```

`summary(x)`

```
## id make model year
## Min. : 1 Length:33442 Length:33442 Min. :1984
## 1st Qu.: 8361 Class :character Class :character 1st Qu.:1991
## Median :16724 Mode :character Mode :character Median :1999
## Mean :17038 Mean :1999
## 3rd Qu.:25265 3rd Qu.:2008
## Max. :34932 Max. :2015
##
## class trans drive cyl
## Length:33442 Length:33442 Length:33442 Min. : 2.00
## Class :character Class :character Class :character 1st Qu.: 4.00
## Mode :character Mode :character Mode :character Median : 6.00
## Mean : 5.77
## 3rd Qu.: 6.00
## Max. :16.00
## NA's :58
## displ fuel hwy cty
## Min. :0.00 Length:33442 Min. : 9.0 Min. : 6.0
## 1st Qu.:2.30 Class :character 1st Qu.: 19.0 1st Qu.: 15.0
## Median :3.00 Mode :character Median : 23.0 Median : 17.0
## Mean :3.35 Mean : 23.6 Mean : 17.5
## 3rd Qu.:4.30 3rd Qu.: 27.0 3rd Qu.: 20.0
## Max. :8.40 Max. :109.0 Max. :138.0
## NA's :57
```

There are one *strict* continuous variables in the dataset: *displ* representing the displacement of the vehicle; fuel economy of the vehicle on highway and in the city (*hwy* & *cty*) may also be considered as continuous variable intuitively, however, they are specified as integers in this dataset.

The response variable in this study is the vehicle fuel economy (highway/city).

This dataset summarizes the Envrionmental Protection Agency (EPA) fuel economy data from 1985 to 2015, it contains the highway and city fuel economy information and data on vehicle characteristics such as number of cylinders, make, model and so on.

EPA cooperates with Natioanl Highway Traffic Safety Administration to test the fuel economy data for all new cars and light trucks, and update the fuel economy data for old vehicles every year. Therefore the data are very representative and randomized by different make/model/year.

To test whehter the vehicle power level is correlated with the highway fuel economy, *hwy* is selected as the reponse variable and *cyl* is the experimental factor with 9 levels. We will conduct a t-test to see whether the variation in highway fuel economy is due to sample randomization (or other factors) and estimate a linear regression model to quantify the marginal effect.

Clearly, the more cylinders the more power, and thus the lower fuel economy (i.e., less mileage traveled per gallon of gasoline). However, this might not be true in the new vehicle models, with more advanced technologies, the design of engine might be able to increase fuel economy while maintaining the same level of power. Therefore the effect of the number of cylinders on fuel economy is not clear and this project is aiming to provide more insights to the subject.

Since all the new vehicle make/models are required to take the fuel economy examination at EPA, the data are random and represent the population well.

The fuel economy exam is carried out for all new vehicles before they are sent to the market. Therefore, for each vehicle there is no replicate/repeated measures. However, repeated measures maybe taken for the same make/model in different years.

No, blocking is not considered in this design.

We use histograms and boxplots to compare the highway fuel economy of vehicles with 4 and 6 cylinders (year 2014). The histograms show that: For vehicles with 4 cylinders, the highway fuel economies range from 20mpg to 50 mpg, with an average around 35mpg. Most of the vehciles have fuel economies ranging from 25mpg to 40mpg (around 80%). A few has very high fuel economy (50 mpg) from electric/hybrid vehicles. For vehicles with 6 cylinders, the highway fuel economies range from 17mpg to 37 mpg, with an average around 25mpg. Most of the vehicles have fuel economies ranging from 20mpg to 30mpg (around 70%). Only a few reached 35mpg (the average for vehicles with 4 cylinders), the fuel economy level is lower compared to the vehicles with 4 cylinders.

To further compare the two samples, the boxplot shows that: Overall, vehicles with 4 cylinders has a higher fuel economy than vehicles with 6 cylinders, and has a larger variation. Both cases has a number of outliers with relatively high fuel economies (from electric/hybrid vehicles). To see whehter this difference is statistically significant, we carry out t-test and estimate a regression model in the following sections.

```
# Logical vector identifying all vehicles with 4 cylinders (37.5% of the data)
fourcyl<-subset(x, x$cyl==4 & x$year==2014)
```

```
# All 4 cylinder vehicle records in the dataframe
is.data.frame(fourcyl)
```

`## [1] TRUE`

`summary(fourcyl)`

```
## id make model year
## Min. :33399 Length:470 Length:470 Min. :2014
## 1st Qu.:33854 Class :character Class :character 1st Qu.:2014
## Median :34082 Mode :character Mode :character Median :2014
## Mean :34118 Mean :2014
## 3rd Qu.:34456 3rd Qu.:2014
## Max. :34858 Max. :2014
## class trans drive cyl
## Length:470 Length:470 Length:470 Min. :4
## Class :character Class :character Class :character 1st Qu.:4
## Mode :character Mode :character Mode :character Median :4
## Mean :4
## 3rd Qu.:4
## Max. :4
## displ fuel hwy cty
## Min. :1.20 Length:470 Min. :21.0 Min. :17.0
## 1st Qu.:1.80 Class :character 1st Qu.:30.0 1st Qu.:22.0
## Median :2.00 Mode :character Median :33.0 Median :24.0
## Mean :1.99 Mean :33.2 Mean :25.1
## 3rd Qu.:2.40 3rd Qu.:36.0 3rd Qu.:27.0
## Max. :2.70 Max. :49.0 Max. :53.0
```

```
# histogram of hwy mileage for all 4 cylinder vehicle records
hist(fourcyl$hwy,xlim=c(10,60),ylim=c(0,100))
```

```
# Logical vector identifying all vehicles with 4 cylinders (35.6% of the data)
sixcyl<-subset(x, x$cyl==6 & x$year==2014)
```

```
# All 4 cylinder vehicle records in the dataframe
is.data.frame(sixcyl)
```

`## [1] TRUE`

`summary(sixcyl)`

```
## id make model year
## Min. :33406 Length:416 Length:416 Min. :2014
## 1st Qu.:33732 Class :character Class :character 1st Qu.:2014
## Median :34060 Mode :character Mode :character Median :2014
## Mean :34044 Mean :2014
## 3rd Qu.:34353 3rd Qu.:2014
## Max. :34794 Max. :2014
## class trans drive cyl
## Length:416 Length:416 Length:416 Min. :6
## Class :character Class :character Class :character 1st Qu.:6
## Mode :character Mode :character Mode :character Median :6
## Mean :6
## 3rd Qu.:6
## Max. :6
## displ fuel hwy cty
## Min. :2.50 Length:416 Min. :18.0 Min. :14.0
## 1st Qu.:3.15 Class :character 1st Qu.:24.0 1st Qu.:17.0
## Median :3.50 Mode :character Median :26.0 Median :18.0
## Mean :3.44 Mean :26.3 Mean :18.6
## 3rd Qu.:3.60 3rd Qu.:28.0 3rd Qu.:20.0
## Max. :4.30 Max. :38.0 Max. :32.0
```

```
# histogram of hwy mileage for all 4 cylinder vehicle records
hist(sixcyl$hwy,xlim=c(15,40),ylim=c(0,120))
```

```
#bloxplot to compare the two samples
boxplot(fourcyl$hwy,sixcyl$hwy,names=c("4 cyl","6 cyl"))
```

### Testing

According to the result of the Welch two sample t-test, we reject the null hypothesis that the true difference in means of fuel economy for vehicles with 4 cylinders and vehicles with 6 cylinders is equal to 0. This result indicates that the variation of these two samples (vehicles with 4 cylinders and vehciles with 6 cylinders) is due to some factors other than the sample randomization. To quantify the effect of the numbder of cylinders on fuel economy, we further estimate a linear regression model in the following section.

`t.test(fourcyl$hwy, sixcyl$hwy)`

```
##
## Welch Two Sample t-test
##
## data: fourcyl$hwy and sixcyl$hwy
## t = 24.71, df = 821.4, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 6.356 7.453
## sample estimates:
## mean of x mean of y
## 33.18 26.27
```

The linear regression estimation result shows that the effect of number of cylinders on fuel economy is statistically significant. With everything else equal, the vehicle fuel economy decreases with the number of cylinders, indicating that vehicles with fewer number of cylinders tend to have higher fuel economy, which is consistent with the fact that smaller vehicles with lower power level can travel more distance with less fuel consumption. The estimated coefficient shows that the fuel economy decreases 2.65mpg with one more cylinder installed in the engine: compare a vehicle with 4 cylinders and one with 6 cylinders traveling 100 miles, the smaller car can travel 265 more miles with the same fuel consumption.

```
# We first subset the vehicles from year 2014
subx<-subset(x,x$year ==2014)
fit <- lm(hwy ~ cyl, data=subx)
summary(fit)
```

```
##
## Call:
## lm(formula = hwy ~ cyl, data = subx)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.418 -3.111 -0.111 2.582 16.582
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 43.0321 0.4045 106 <2e-16 ***
## cyl -2.6535 0.0664 -40 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.28 on 1200 degrees of freedom
## (12 observations deleted due to missingness)
## Multiple R-squared: 0.571, Adjusted R-squared: 0.571
## F-statistic: 1.6e+03 on 1 and 1200 DF, p-value: <2e-16
```

Both the qqplots and the Shapiro-Wilk test shows that the model estimates are carried out under proper assumptions and the sample data follow normal distribution.

`qqnorm(fourcyl$hwy,ylab="Highway Mileage",ylim=c(0,50))`

`qqnorm(sixcyl$hwy,ylab="Highway Mileage",ylim=c(0,50))`

```
# Shapiro-Wilk test of normality. Adequate if p < 0.1
shapiro.test(fourcyl$hwy)
```

```
##
## Shapiro-Wilk normality test
##
## data: fourcyl$hwy
## W = 0.9877, p-value = 0.0005239
```

`shapiro.test(sixcyl$hwy)`

```
##
## Shapiro-Wilk normality test
##
## data: sixcyl$hwy
## W = 0.9764, p-value = 2.809e-06
```