Recipe 1: Example of Descriptive Statistics

This is an R Markdown document. Markdown is a simple formatting syntax for authoring web pages (click the MD toolbar button for help on Markdown).

When you click the Knit HTML button a web page will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

Recipes for the Design of Experiments: Recipe Outline

as of August 28, 2014, superceding the version of August 24. Always use the most recent version.

Recipes for the Design of Experiments

Wei Zou

RPI

Sep 11, 14; Version 1

1. Setting

System under test

To exam the effect of number of cylinders on vehicle fuel economy (based on 2014 EPA fuel economy data)

library("fueleconomy", lib.loc="~/R/win-library/3.1")
x<-vehicles
head(x)
##      id       make               model year                       class
## 1 27550 AM General   DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD
## 2 28426 AM General   DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD
## 3 27549 AM General    FJ8c Post Office 1984 Special Purpose Vehicle 2WD
## 4 28425 AM General    FJ8c Post Office 1984 Special Purpose Vehicle 2WD
## 5  1032 AM General Post Office DJ5 2WD 1985 Special Purpose Vehicle 2WD
## 6  1033 AM General Post Office DJ8 2WD 1985 Special Purpose Vehicle 2WD
##             trans            drive cyl displ    fuel hwy cty
## 1 Automatic 3-spd    2-Wheel Drive   4   2.5 Regular  17  18
## 2 Automatic 3-spd    2-Wheel Drive   4   2.5 Regular  17  18
## 3 Automatic 3-spd    2-Wheel Drive   6   4.2 Regular  13  13
## 4 Automatic 3-spd    2-Wheel Drive   6   4.2 Regular  13  13
## 5 Automatic 3-spd Rear-Wheel Drive   4   2.5 Regular  17  16
## 6 Automatic 3-spd Rear-Wheel Drive   6   4.2 Regular  13  13

Factors and Levels

This dataset has 9 factors/independent variables and each of the factor has several levels. For example, the factor cyl indicates the number of cylinders of the vehicle and it has 9 levels, ranging from 2 to 16.

head(x)
##      id       make               model year                       class
## 1 27550 AM General   DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD
## 2 28426 AM General   DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD
## 3 27549 AM General    FJ8c Post Office 1984 Special Purpose Vehicle 2WD
## 4 28425 AM General    FJ8c Post Office 1984 Special Purpose Vehicle 2WD
## 5  1032 AM General Post Office DJ5 2WD 1985 Special Purpose Vehicle 2WD
## 6  1033 AM General Post Office DJ8 2WD 1985 Special Purpose Vehicle 2WD
##             trans            drive cyl displ    fuel hwy cty
## 1 Automatic 3-spd    2-Wheel Drive   4   2.5 Regular  17  18
## 2 Automatic 3-spd    2-Wheel Drive   4   2.5 Regular  17  18
## 3 Automatic 3-spd    2-Wheel Drive   6   4.2 Regular  13  13
## 4 Automatic 3-spd    2-Wheel Drive   6   4.2 Regular  13  13
## 5 Automatic 3-spd Rear-Wheel Drive   4   2.5 Regular  17  16
## 6 Automatic 3-spd Rear-Wheel Drive   6   4.2 Regular  13  13
tail(x)
##          id  make                             model year       class
## 33437 31064 smart   fortwo electric drive cabriolet 2011 Two Seaters
## 33438 33305 smart fortwo electric drive convertible 2013 Two Seaters
## 33439 34393 smart fortwo electric drive convertible 2014 Two Seaters
## 33440 31065 smart       fortwo electric drive coupe 2011 Two Seaters
## 33441 33306 smart       fortwo electric drive coupe 2013 Two Seaters
## 33442 34394 smart       fortwo electric drive coupe 2014 Two Seaters
##                trans            drive cyl displ        fuel hwy cty
## 33437 Automatic (A1) Rear-Wheel Drive  NA    NA Electricity  79  94
## 33438 Automatic (A1) Rear-Wheel Drive  NA    NA Electricity  93 122
## 33439 Automatic (A1) Rear-Wheel Drive  NA    NA Electricity  93 122
## 33440 Automatic (A1) Rear-Wheel Drive  NA    NA Electricity  79  94
## 33441 Automatic (A1) Rear-Wheel Drive  NA    NA Electricity  93 122
## 33442 Automatic (A1) Rear-Wheel Drive  NA    NA Electricity  93 122
summary(x)
##        id            make              model                year     
##  Min.   :    1   Length:33442       Length:33442       Min.   :1984  
##  1st Qu.: 8361   Class :character   Class :character   1st Qu.:1991  
##  Median :16724   Mode  :character   Mode  :character   Median :1999  
##  Mean   :17038                                         Mean   :1999  
##  3rd Qu.:25265                                         3rd Qu.:2008  
##  Max.   :34932                                         Max.   :2015  
##                                                                      
##     class              trans              drive                cyl       
##  Length:33442       Length:33442       Length:33442       Min.   : 2.00  
##  Class :character   Class :character   Class :character   1st Qu.: 4.00  
##  Mode  :character   Mode  :character   Mode  :character   Median : 6.00  
##                                                           Mean   : 5.77  
##                                                           3rd Qu.: 6.00  
##                                                           Max.   :16.00  
##                                                           NA's   :58     
##      displ          fuel                hwy             cty       
##  Min.   :0.00   Length:33442       Min.   :  9.0   Min.   :  6.0  
##  1st Qu.:2.30   Class :character   1st Qu.: 19.0   1st Qu.: 15.0  
##  Median :3.00   Mode  :character   Median : 23.0   Median : 17.0  
##  Mean   :3.35                      Mean   : 23.6   Mean   : 17.5  
##  3rd Qu.:4.30                      3rd Qu.: 27.0   3rd Qu.: 20.0  
##  Max.   :8.40                      Max.   :109.0   Max.   :138.0  
##  NA's   :57

Continuous variables (if any)

There are one strict continuous variables in the dataset: displ representing the displacement of the vehicle; fuel economy of the vehicle on highway and in the city (hwy & cty) may also be considered as continuous variable intuitively, however, they are specified as integers in this dataset.

Response variables

The response variable in this study is the vehicle fuel economy (highway/city).

The Data: How is it organized and what does it look like?

This dataset summarizes the Envrionmental Protection Agency (EPA) fuel economy data from 1985 to 2015, it contains the highway and city fuel economy information and data on vehicle characteristics such as number of cylinders, make, model and so on.

Randomization

EPA cooperates with Natioanl Highway Traffic Safety Administration to test the fuel economy data for all new cars and light trucks, and update the fuel economy data for old vehicles every year. Therefore the data are very representative and randomized by different make/model/year.

2. (Experimental) Design

How will the experiment be organized and conducted to test the hypothesis?

To test whehter the vehicle power level is correlated with the highway fuel economy, hwy is selected as the reponse variable and cyl is the experimental factor with 9 levels. We will conduct a t-test to see whether the variation in highway fuel economy is due to sample randomization (or other factors) and estimate a linear regression model to quantify the marginal effect.

What is the rationale for this design?

Clearly, the more cylinders the more power, and thus the lower fuel economy (i.e., less mileage traveled per gallon of gasoline). However, this might not be true in the new vehicle models, with more advanced technologies, the design of engine might be able to increase fuel economy while maintaining the same level of power. Therefore the effect of the number of cylinders on fuel economy is not clear and this project is aiming to provide more insights to the subject.

Randomize: What is the Randomization Scheme?

Since all the new vehicle make/models are required to take the fuel economy examination at EPA, the data are random and represent the population well.

Replicate: Are there replicates and/or repeated measures?

The fuel economy exam is carried out for all new vehicles before they are sent to the market. Therefore, for each vehicle there is no replicate/repeated measures. However, repeated measures maybe taken for the same make/model in different years.

Block: Did you use blocking in the design?

No, blocking is not considered in this design.

3. (Statistical) Analysis

(Exploratory Data Analysis) Graphics and descriptive summary

We use histograms and boxplots to compare the highway fuel economy of vehicles with 4 and 6 cylinders (year 2014). The histograms show that: For vehicles with 4 cylinders, the highway fuel economies range from 20mpg to 50 mpg, with an average around 35mpg. Most of the vehciles have fuel economies ranging from 25mpg to 40mpg (around 80%). A few has very high fuel economy (50 mpg) from electric/hybrid vehicles. For vehicles with 6 cylinders, the highway fuel economies range from 17mpg to 37 mpg, with an average around 25mpg. Most of the vehicles have fuel economies ranging from 20mpg to 30mpg (around 70%). Only a few reached 35mpg (the average for vehicles with 4 cylinders), the fuel economy level is lower compared to the vehicles with 4 cylinders.

To further compare the two samples, the boxplot shows that: Overall, vehicles with 4 cylinders has a higher fuel economy than vehicles with 6 cylinders, and has a larger variation. Both cases has a number of outliers with relatively high fuel economies (from electric/hybrid vehicles). To see whehter this difference is statistically significant, we carry out t-test and estimate a regression model in the following sections.

# Logical vector identifying all vehicles with 4 cylinders (37.5% of the data)
fourcyl<-subset(x, x$cyl==4 & x$year==2014)
# All 4 cylinder vehicle records in the dataframe
is.data.frame(fourcyl)
## [1] TRUE
summary(fourcyl)
##        id            make              model                year     
##  Min.   :33399   Length:470         Length:470         Min.   :2014  
##  1st Qu.:33854   Class :character   Class :character   1st Qu.:2014  
##  Median :34082   Mode  :character   Mode  :character   Median :2014  
##  Mean   :34118                                         Mean   :2014  
##  3rd Qu.:34456                                         3rd Qu.:2014  
##  Max.   :34858                                         Max.   :2014  
##     class              trans              drive                cyl   
##  Length:470         Length:470         Length:470         Min.   :4  
##  Class :character   Class :character   Class :character   1st Qu.:4  
##  Mode  :character   Mode  :character   Mode  :character   Median :4  
##                                                           Mean   :4  
##                                                           3rd Qu.:4  
##                                                           Max.   :4  
##      displ          fuel                hwy            cty      
##  Min.   :1.20   Length:470         Min.   :21.0   Min.   :17.0  
##  1st Qu.:1.80   Class :character   1st Qu.:30.0   1st Qu.:22.0  
##  Median :2.00   Mode  :character   Median :33.0   Median :24.0  
##  Mean   :1.99                      Mean   :33.2   Mean   :25.1  
##  3rd Qu.:2.40                      3rd Qu.:36.0   3rd Qu.:27.0  
##  Max.   :2.70                      Max.   :49.0   Max.   :53.0
# histogram of hwy mileage for all 4 cylinder vehicle records
hist(fourcyl$hwy,xlim=c(10,60),ylim=c(0,100))

plot of chunk unnamed-chunk-4

# Logical vector identifying all vehicles with 4 cylinders (35.6% of the data)
sixcyl<-subset(x, x$cyl==6 & x$year==2014)
# All 4 cylinder vehicle records in the dataframe
is.data.frame(sixcyl)
## [1] TRUE
summary(sixcyl)
##        id            make              model                year     
##  Min.   :33406   Length:416         Length:416         Min.   :2014  
##  1st Qu.:33732   Class :character   Class :character   1st Qu.:2014  
##  Median :34060   Mode  :character   Mode  :character   Median :2014  
##  Mean   :34044                                         Mean   :2014  
##  3rd Qu.:34353                                         3rd Qu.:2014  
##  Max.   :34794                                         Max.   :2014  
##     class              trans              drive                cyl   
##  Length:416         Length:416         Length:416         Min.   :6  
##  Class :character   Class :character   Class :character   1st Qu.:6  
##  Mode  :character   Mode  :character   Mode  :character   Median :6  
##                                                           Mean   :6  
##                                                           3rd Qu.:6  
##                                                           Max.   :6  
##      displ          fuel                hwy            cty      
##  Min.   :2.50   Length:416         Min.   :18.0   Min.   :14.0  
##  1st Qu.:3.15   Class :character   1st Qu.:24.0   1st Qu.:17.0  
##  Median :3.50   Mode  :character   Median :26.0   Median :18.0  
##  Mean   :3.44                      Mean   :26.3   Mean   :18.6  
##  3rd Qu.:3.60                      3rd Qu.:28.0   3rd Qu.:20.0  
##  Max.   :4.30                      Max.   :38.0   Max.   :32.0
# histogram of hwy mileage for all 4 cylinder vehicle records
hist(sixcyl$hwy,xlim=c(15,40),ylim=c(0,120))

plot of chunk unnamed-chunk-5

#bloxplot to compare the two samples
boxplot(fourcyl$hwy,sixcyl$hwy,names=c("4 cyl","6 cyl"))

plot of chunk unnamed-chunk-5 ### Testing

According to the result of the Welch two sample t-test, we reject the null hypothesis that the true difference in means of fuel economy for vehicles with 4 cylinders and vehicles with 6 cylinders is equal to 0. This result indicates that the variation of these two samples (vehicles with 4 cylinders and vehciles with 6 cylinders) is due to some factors other than the sample randomization. To quantify the effect of the numbder of cylinders on fuel economy, we further estimate a linear regression model in the following section.

t.test(fourcyl$hwy, sixcyl$hwy)
## 
##  Welch Two Sample t-test
## 
## data:  fourcyl$hwy and sixcyl$hwy
## t = 24.71, df = 821.4, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  6.356 7.453
## sample estimates:
## mean of x mean of y 
##     33.18     26.27

Estimation (of Parameters)

The linear regression estimation result shows that the effect of number of cylinders on fuel economy is statistically significant. With everything else equal, the vehicle fuel economy decreases with the number of cylinders, indicating that vehicles with fewer number of cylinders tend to have higher fuel economy, which is consistent with the fact that smaller vehicles with lower power level can travel more distance with less fuel consumption. The estimated coefficient shows that the fuel economy decreases 2.65mpg with one more cylinder installed in the engine: compare a vehicle with 4 cylinders and one with 6 cylinders traveling 100 miles, the smaller car can travel 265 more miles with the same fuel consumption.

# We first subset the vehicles from year 2014
subx<-subset(x,x$year ==2014)
fit <- lm(hwy ~ cyl, data=subx)
summary(fit)
## 
## Call:
## lm(formula = hwy ~ cyl, data = subx)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.418  -3.111  -0.111   2.582  16.582 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  43.0321     0.4045     106   <2e-16 ***
## cyl          -2.6535     0.0664     -40   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.28 on 1200 degrees of freedom
##   (12 observations deleted due to missingness)
## Multiple R-squared:  0.571,  Adjusted R-squared:  0.571 
## F-statistic: 1.6e+03 on 1 and 1200 DF,  p-value: <2e-16

Diagnostics/Model Adequacy Checking

Both the qqplots and the Shapiro-Wilk test shows that the model estimates are carried out under proper assumptions and the sample data follow normal distribution.

qqnorm(fourcyl$hwy,ylab="Highway Mileage",ylim=c(0,50))

plot of chunk unnamed-chunk-8

qqnorm(sixcyl$hwy,ylab="Highway Mileage",ylim=c(0,50))

plot of chunk unnamed-chunk-8

# Shapiro-Wilk test of normality.  Adequate if p < 0.1
shapiro.test(fourcyl$hwy)
## 
##  Shapiro-Wilk normality test
## 
## data:  fourcyl$hwy
## W = 0.9877, p-value = 0.0005239
shapiro.test(sixcyl$hwy)
## 
##  Shapiro-Wilk normality test
## 
## data:  sixcyl$hwy
## W = 0.9764, p-value = 2.809e-06

4. References to the literature

http://www.epa.gov/fueleconomy/basicinformation.htm

5. Appendices

A summary of, or pointer to, the raw data

complete and documented R code