Recipie for Descriptive Statistics

Ali Svoobda

RPI

9/30/14 V.1

1. Setting

System under test

For this recipie, the vehicles dataset within the fueleconomy package will be examined. Specifically, we will examine the effect of two factors with multiple levels (make and fuel) on the city fuel economy in mpg.

To access the package and save the table into the workspace:

install.packages("fueleconomy")
## Installing package into 'C:/Users/svoboa/Documents/R/win-library/3.1'
## (as 'lib' is unspecified)
## Error: trying to use CRAN without setting a mirror
library("fueleconomy", lib.loc="C:/Users/svoboa/Documents/R/win-library/3.1")
data<-vehicles
fix(data)
View(data)

*note data on my interface is read into the workspace and must be confirmed in a seperate window (fix(x)) before it is accessable to view in RStudio. Most users can skip this step

Subset smaller set of manufacturers so model doesn't have too many interactions to compute:

x<-subset(data,data$make=="Acura" | data$make=="Audi" | data$make=="Chevrolet" | data$make=="Dodge")
View(x)

Factors and Levels

For this experiment, we will look at two factors, make and fuel. Make, which is the manufacturer, has 128 levels in the original dataset. But in the subset created above, it includes 4 (Acura, Audi, Chevrolet, Dodge). Fuel, which is the type of fuel the car requires, has 13 levels in the original dataset vehicles. With the subset, it now has 10.

As the vehicles dataset is automatically structured, RStudio reads make and fuel as characters.

str(x)
## 'data.frame':    6852 obs. of  12 variables:
##  $ id   : num  13309 13310 13311 14038 14039 ...
##  $ make : chr  "Acura" "Acura" "Acura" "Acura" ...
##  $ model: chr  "2.2CL/3.0CL" "2.2CL/3.0CL" "2.2CL/3.0CL" "2.3CL/3.0CL" ...
##  $ year : num  1997 1997 1997 1998 1998 ...
##  $ class: chr  "Subcompact Cars" "Subcompact Cars" "Subcompact Cars" "Subcompact Cars" ...
##  $ trans: chr  "Automatic 4-spd" "Manual 5-spd" "Automatic 4-spd" "Automatic 4-spd" ...
##  $ drive: chr  "Front-Wheel Drive" "Front-Wheel Drive" "Front-Wheel Drive" "Front-Wheel Drive" ...
##  $ cyl  : num  4 4 6 4 4 6 4 4 6 5 ...
##  $ displ: num  2.2 2.2 3 2.3 2.3 3 2.3 2.3 3 2.5 ...
##  $ fuel : chr  "Regular" "Regular" "Regular" "Regular" ...
##  $ hwy  : num  26 28 26 27 29 26 27 29 26 23 ...
##  $ cty  : num  20 22 18 19 21 17 20 21 17 18 ...

Save make and fuel as factors:

x$make=as.factor(x$make)
x$fuel=as.factor(x$fuel)
str(x)
## 'data.frame':    6852 obs. of  12 variables:
##  $ id   : num  13309 13310 13311 14038 14039 ...
##  $ make : Factor w/ 4 levels "Acura","Audi",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ model: chr  "2.2CL/3.0CL" "2.2CL/3.0CL" "2.2CL/3.0CL" "2.3CL/3.0CL" ...
##  $ year : num  1997 1997 1997 1998 1998 ...
##  $ class: chr  "Subcompact Cars" "Subcompact Cars" "Subcompact Cars" "Subcompact Cars" ...
##  $ trans: chr  "Automatic 4-spd" "Manual 5-spd" "Automatic 4-spd" "Automatic 4-spd" ...
##  $ drive: chr  "Front-Wheel Drive" "Front-Wheel Drive" "Front-Wheel Drive" "Front-Wheel Drive" ...
##  $ cyl  : num  4 4 6 4 4 6 4 4 6 5 ...
##  $ displ: num  2.2 2.2 3 2.3 2.3 3 2.3 2.3 3 2.5 ...
##  $ fuel : Factor w/ 10 levels "CNG","Diesel",..: 10 10 10 10 10 10 10 10 10 7 ...
##  $ hwy  : num  26 28 26 27 29 26 27 29 26 23 ...
##  $ cty  : num  20 22 18 19 21 17 20 21 17 18 ...

Now they are known as factors and R will treat them as so.

Continuous Variables

The continuous variables in the dataset are displ (engine displacement in litres), hwy (highway fuel economy) and cty (city fuel economy).

Response Variables

For this experiment, cty will be used as the response

The Data: How is it organized and what does it look like?

The data has the following variables: id (EPA identifier), make, mode, year, class(EPA vehicle size class), trans, drive, cyl, disp, fuel, hwy, cty

For more on the vehicles dataset and to view the first and last 6 observations:

?vehicles
## starting httpd help server ... done
head(x)
##       id  make       model year           class           trans
## 8  13309 Acura 2.2CL/3.0CL 1997 Subcompact Cars Automatic 4-spd
## 9  13310 Acura 2.2CL/3.0CL 1997 Subcompact Cars    Manual 5-spd
## 10 13311 Acura 2.2CL/3.0CL 1997 Subcompact Cars Automatic 4-spd
## 11 14038 Acura 2.3CL/3.0CL 1998 Subcompact Cars Automatic 4-spd
## 12 14039 Acura 2.3CL/3.0CL 1998 Subcompact Cars    Manual 5-spd
## 13 14040 Acura 2.3CL/3.0CL 1998 Subcompact Cars Automatic 4-spd
##                drive cyl displ    fuel hwy cty
## 8  Front-Wheel Drive   4   2.2 Regular  26  20
## 9  Front-Wheel Drive   4   2.2 Regular  28  22
## 10 Front-Wheel Drive   6   3.0 Regular  26  18
## 11 Front-Wheel Drive   4   2.3 Regular  27  19
## 12 Front-Wheel Drive   4   2.3 Regular  29  21
## 13 Front-Wheel Drive   6   3.0 Regular  26  17
tail(x)
##          id  make           model year                  class
## 10245  9362 Dodge W250 Pickup 4WD 1992 Standard Pickup Trucks
## 10246  9363 Dodge W250 Pickup 4WD 1992 Standard Pickup Trucks
## 10247 10358 Dodge W250 Pickup 4WD 1993 Standard Pickup Trucks
## 10248 10359 Dodge W250 Pickup 4WD 1993 Standard Pickup Trucks
## 10249 10360 Dodge W250 Pickup 4WD 1993 Standard Pickup Trucks
## 10250 10361 Dodge W250 Pickup 4WD 1993 Standard Pickup Trucks
##                 trans                      drive cyl displ    fuel hwy cty
## 10245 Automatic 4-spd 4-Wheel or All-Wheel Drive   8   5.9 Regular  12  10
## 10246    Manual 5-spd 4-Wheel or All-Wheel Drive   8   5.9 Regular  12   8
## 10247 Automatic 4-spd 4-Wheel or All-Wheel Drive   8   5.2 Regular  15  11
## 10248    Manual 5-spd 4-Wheel or All-Wheel Drive   8   5.2 Regular  15  12
## 10249 Automatic 4-spd 4-Wheel or All-Wheel Drive   8   5.9 Regular  14  10
## 10250    Manual 5-spd 4-Wheel or All-Wheel Drive   8   5.9 Regular  13  10

2. (Experimental) Design

How will the experiment be organized and conducted to test the hypothesis?

The experiment will look at two factors (make and fuel type) and see how they effect the city fuel economy.

The Null Hypothesis is that the variation in city fuel economy cannot be explained by anything other than randomization This will be tested by creating a model using the factors and then running an analysis of variance on the model. A model will be created for each factor seperately, with interation between the factors, and with blocking between the factors.

What is the Rationale for this design?

This design was choosen to examine if the type of manufacturer and/or type of fuel effect the fuel economy of the car.

Randomize: What is the Randomization Scheme?

The data in the vehicles set is a list of vehicles the EPA has fuel economy on from 1985-2015. It is unknown how the city and highway fuel economys were collected.

Replicate: Are there replicates and/or repeated measures?

There are no replicates or repeated measures as each unique factor combination only occurs once in the table.

Block: Did you use blocking in the design?

Blocking was not required in creating the vehicles dataset but will be used between the two factors in the 4th model.

3. (Statistical) Analysis

(Exploratory Data Analysis) Graphics and descriptive summary

Summary Statistics for vehicles dataset:

summary(x)
##        id               make         model                year     
##  Min.   :    3   Acura    : 269   Length:6852        Min.   :1984  
##  1st Qu.: 6363   Audi     : 772   Class :character   1st Qu.:1989  
##  Median :14193   Chevrolet:3461   Mode  :character   Median :1996  
##  Mean   :15300   Dodge    :2350                      Mean   :1997  
##  3rd Qu.:23612                                       3rd Qu.:2006  
##  Max.   :34931                                       Max.   :2015  
##                                                                    
##     class              trans              drive                cyl       
##  Length:6852        Length:6852        Length:6852        Min.   : 3.00  
##  Class :character   Class :character   Class :character   1st Qu.: 4.00  
##  Mode  :character   Mode  :character   Mode  :character   Median : 6.00  
##                                                           Mean   : 6.21  
##                                                           3rd Qu.: 8.00  
##                                                           Max.   :12.00  
##                                                           NA's   :4      
##      displ                   fuel           hwy             cty     
##  Min.   :1.00   Regular        :4942   Min.   : 10.0   Min.   :  8  
##  1st Qu.:2.50   Premium        :1227   1st Qu.: 17.0   1st Qu.: 13  
##  Median :3.90   Gasoline or E85: 370   Median : 22.0   Median : 15  
##  Mean   :3.87   Diesel         : 252   Mean   : 21.8   Mean   : 16  
##  3rd Qu.:5.20   Midgrade       :  22   3rd Qu.: 25.0   3rd Qu.: 18  
##  Max.   :8.40   CNG            :  13   Max.   :109.0   Max.   :128  
##  NA's   :4      (Other)        :  26

Average City MPG for each fuel type and make:

tapply(x$cty,x$fuel,mean)
##                        CNG                     Diesel 
##                      11.38                      17.00 
##                Electricity            Gasoline or E85 
##                      61.25                      15.17 
##    Gasoline or natural gas                   Midgrade 
##                      16.75                      14.77 
##                    Premium Premium Gas or Electricity 
##                      16.90                      35.00 
##             Premium or E85                    Regular 
##                      20.00                      15.80
tapply(x$cty,x$make,mean)
##     Acura      Audi Chevrolet     Dodge 
##     18.61     17.17     16.16     15.19

From the means by fuel type, it appears very likely that the fuel type may be able to explain the variation in city mpg.

Boxplots:

boxplot(x$cty~x$make, xlab="City MPG", ylab="Make")

plot of chunk unnamed-chunk-8

boxplot(x$cty~x$fuel, xlab="City MPG", ylab="Fuel Type")

plot of chunk unnamed-chunk-8

Based on the medians of the four makes, make may not explain the variation in city MPG. However, it is interesting to point out the outliers. These may represent the cars that run on alternate fuel such as electricity. These may appear as outliers relative to the dataset since each make has a majority of cars that run on conventional fuel types.

Although not all fuel types may explain the variation in MPG, it appears that some may (for example, Electricity) be able to explain the variation in City MPG,

Testing

#Model 1 First, we will create a model to examine the effect of fuel type on city mpg:

model1=aov(x$cty~x$fuel)
anova(model1)
## Analysis of Variance Table
## 
## Response: x$cty
##             Df Sum Sq Mean Sq F value Pr(>F)    
## x$fuel       9  11807    1312    79.1 <2e-16 ***
## Residuals 6842 113426      17                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The probability that the variation in city mpg among fuel type used is due to randomization is small. Therefore, we reject the null hypothesis. When fuel type is the only predicting factor, it is likely that it explains the variation in city mpg.

#Model 2 Now we create a similar model, but to test manufacturer:

model2=aov(x$cty~x$make)
anova(model2)
## Analysis of Variance Table
## 
## Response: x$cty
##             Df Sum Sq Mean Sq F value Pr(>F)    
## x$make       3   4496    1499      85 <2e-16 ***
## Residuals 6848 120737      18                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Again, we reject the null hypothesis that the variation in city mpg is due to randomization. It is likely that the manufacturer explains the variation in city mpg.

#Model 3 Model to test internation between manufacturer and fuel type:

model3=aov(x$cty~x$fuel*x$make)
anova(model3)
## Analysis of Variance Table
## 
## Response: x$cty
##                 Df Sum Sq Mean Sq F value Pr(>F)    
## x$fuel           9  11807    1312    83.0 <2e-16 ***
## x$make           3   3036    1012    64.0 <2e-16 ***
## x$fuel:x$make    8   2424     303    19.2 <2e-16 ***
## Residuals     6831 107965      16                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#Model 4 Model to test whether fuel type OR make effect city mpg:

model4=aov(x$cty~x$fuel+x$make)
anova(model4)
## Analysis of Variance Table
## 
## Response: x$cty
##             Df Sum Sq Mean Sq F value Pr(>F)    
## x$fuel       9  11807    1312    81.3 <2e-16 ***
## x$make       3   3036    1012    62.7 <2e-16 ***
## Residuals 6839 110389      16                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

All four models produce similar results. It is likely that both fuel and make can explain the variation in city mpg.

Diagnostics/Model Adequacy Checking

Visually inspect normality of data:

qqnorm(residuals(model3))
qqline(residuals(model3))

plot of chunk unnamed-chunk-13

From this plot, the data appears that it may be normal.

Fitted vs Residuals Plot:

plot(fitted(model3),residuals(model3))

plot of chunk unnamed-chunk-14 The plot is more clustered then we would like to see, suggesting the model may not be a quality fit of the data.

Interatcion Plot:

x$fuel=as.numeric(x$fuel)
x$make=as.numeric(x$make)
interaction.plot(x$cty, x$fuel, x$make)

plot of chunk unnamed-chunk-15

From the changes in slope and intersecting lines, it is evident there is interaction among the factors and response.

4. References to the Literature

None used.

5. Contingencies

Since it was not clear if the dataset was normally distributed, a non-parametric test such as the kruskal ranked sum test could be used.

6. Appendicies

Link to raw data

Complete R Code

All included above.