Recipes for the Design of Experiments: Recipe Outline

The Analysis of Covariance

Wei Zou

RPI

oct 24, 2014

1. Setting

System under test

In this study, we design an experiment to investigate the influencing factors of Cigarette Consumption in the US from 1985 to 1995. To do so, the dataset “Cigarette” under the “Ecdat” package in R was used and to exam wether the variations in average tax, state personal income or state have an effect on the variation of cigarette consumption per capita.

library("Ecdat", lib.loc="~/R/win-library/3.1")

## Loading required package: Ecfun
## 
## Attaching package: 'Ecdat'
## 
## The following object is masked from 'package:datasets':
## 
##     Orange

data1<-Cigarette
head(data1)

##   state year   cpi      pop packpc    income  tax avgprs  taxs
## 1    AL 1985 1.076  3973000  116.5  46014968 32.5 102.18 33.35
## 2    AR 1985 1.076  2327000  128.5  26210736 37.0 101.47 37.00
## 3    AZ 1985 1.076  3184000  104.5  43956936 31.0 108.58 36.17
## 4    CA 1985 1.076 26444000  100.4 447102816 26.0 107.84 32.10
## 5    CO 1985 1.076  3209000  113.0  49466672 31.0  94.27 31.00
## 6    CT 1985 1.076  3201000  109.3  60063368 42.0 128.02 51.48

attach(data1)

Factors and Levels

There are two factors in the dataset: “year” with 11 levels (from year 1985 to 1995) and “state” with 48 levels (48 states in the US), in this study we exam the effect of “state”

Year<-as.factor(year)
nlevels(Year)

## [1] 11

state<-as.factor(state)
nlevels(state)

## [1] 48

Continuous variables (if any)

There are six continuous variables in the dataset: “pop” indicates the state population, “packpc” indicates the number of packs per capita, “income” indicates the state personal total nominal income, “tax” represents the average state, federal and average local excise taxes for fiscal year, avgprs" represents the average price during fiscal year, including sales taxes, “taxs” represents the average excise taxes for fiscal year, including sales tax. In this study, we analyze the effect of “tax” and “income”. ### Response variables The response variable in this study is “packpc” - the number of packs per capita, and we are testing whehter the variation in “packpc” is due to sample ramdomization or other independent variables. ### The Data: How is it organized and what does it look like? The dataset “Cigarette” is a panel data with 528 observations from 1985 to 1995, in 48 states in the US. ### Randomization The data were not collected under the environment of a designed random experiment, so there is no randmize assignment and randomize execution order, however, it is still randomly selected in 48 states through 11 years.

2. (Experimental) Design

How will the experiment be organized and conducted to test the hypothesis?

The purpose of this project is to analyze the influencing factors on cigarrette consumption in the US from 1985 to 1995, the selected three factors are: tax (continuous), nominal gross income (continuous) and state (categorical with 48 levels). Therefore: H0: The variation in cigarrette consumption is simply due to sample randomization HA: The variation in cigarrette consumption is due to something else (in this study we test the effects of tax, income and state). ### What is the rationale for this design? 1. The effect of tax: we would expect to see that the higher the tax, the lower the cigarrette consumption, the effect of tax should be statistically siginificant with a negative sign. 2. The effect of income: the effect income on cigarrette consuption can be two-folded: on one hand we may expect to see cigarrete becomes more affordable to people when the average income increases, on the other hand, people with higher income are better educated the drawbacks of smoking and have other alternatives to deal with their stress and relax, so the consumption of cigarretee may drop. We are hoping to find out the overall trend in this project. 3. The effect of state: Location may also have an effect on the cigarrette consumptions, for example, it is natural to see that the state where the “home of cigarrette” is will have a higher consumption rate while the state that promotes “healthy lifestyle” will have a lower consumption rate.
### Randomize: What is the Randomization Scheme? As it has been mentioned in the previous section, the data were not collected under the environment of a designed random experiment, so there is no randmize assignment and randomize execution order, however, it is still randomly selected in 48 states through 11 years. ### Replicate: Are there replicates and/or repeated measures? There are no replicated/repeated measures in this experiment. ### Block: Did you use blocking in the design? No blocking is used in this design.

3. (Statistical) Analysis

(Exploratory Data Analysis) Graphics and descriptive summary

In the exploratory data analysis, we plot histograms for the response variable and two continous variables, and draw boxplots to analyze the categorical independent variables. 1. Histograms: the distribution of cigarrette consumption generally follows a normal distribution, the average number of cigarrette package consumed per capita ranges from 0 to 200 in 48 states, most of them consumed around 100~120 packages per year per capita; the distribution of average nominal annual income is left skewd with most states have an average personal nominal total annual income of $50,000,000, while a few states have as high as $800,000,000; the distribution of average tax rate also follows a normal distribution, with most states charge around 35~40 dollars. 2. Boxplots:Compare the boxplot of “year” to the boxplot of “state”, the variation in the median of cigarrette consumption among different years is as obvious as among different states - there are a number of states have much higher consumption rates than others, indicating that the effect of “state” on the response variable is highly likely to be statitiscal significant.

plot(data1[,c(5,6,7)])

plot of chunk unnamed-chunk-3

par(mfrow=c(1,3));for (i in c(5,6,7)) hist(data1[,i],main = names(data1)[i])

plot of chunk unnamed-chunk-3

par(mfrow=c(1,1));
boxplot(packpc~year,data=data1, xlab="Year", ylab="Number of packs of cigaraettes per capita")
title("Boxplot of Number of packs of cigaraettes per capita in different years")

plot of chunk unnamed-chunk-3

boxplot(packpc~state,data=data1, xlab="State", ylab="Number of packs of cigaraettes per capita")
title("Boxplot of Number of packs of cigaraettes per capita in different states")

plot of chunk unnamed-chunk-3 ### Testing and Model Estimation To test the hypothesis we discussed previously, we estimate a linear regression model and conduct an analysis of covariance to conduct the analysis. The result from the linear regression shows that all the selected independent variables are statistically significant, both “income” and “tax” influenced the “cigarrette consumption per capita” negatively, indicating that the higher the income or the higher the tax, the lower the cigarrette consumption rate. For the categorical variable “state”, state “AL” is selected as the base, and the remaining 47 states are compared to the base to show the “geolocation” effect on the cigarrette consumption. For example, state CA has a statistically significant positive coefficient, indicating that the cigarrette consumption per capita in California is statistically higher than the one in Alabama; similary, we may conculde that the cigarrette consumption per capita in Washington state is statistically lower than the one in Alabama, while there is no statistical significant difference betweent he state of Virginia and Alabama. The result from the analysis of covariance shows that the probability that the variation in the response variable is due to sample randomization is less than 2.2e-16, therefore we may reject H0 and conclude that the variation in the cigarrette consumption per capita may due to the effect of income/tax/state. Noted that both “income” and “tax” has 1 degree of freedom since they are continuous variables while “state” has 47 degree of freedom as there are 48 levels in the factor (1 is used in the estimator)

model1 <-lm(packpc~income+tax+state, data=data1)
summary(model1)

## 
## Call:
## lm(formula = packpc ~ income + tax + state, data = data1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.640  -3.586  -0.499   3.345  25.600 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.45e+02   2.11e+00   68.54  < 2e-16 ***
## income      -9.72e-08   1.39e-08   -7.01  7.9e-12 ***
## tax         -8.10e-01   4.19e-02  -19.32  < 2e-16 ***
## stateAR      1.21e+01   2.65e+00    4.57  6.2e-06 ***
## stateAZ     -2.06e+01   2.58e+00   -8.00  9.7e-15 ***
## stateCA      2.80e+01   7.93e+00    3.54  0.00045 ***
## stateCO     -1.45e+01   2.58e+00   -5.63  3.1e-08 ***
## stateCT      1.33e+00   2.66e+00    0.50  0.61709    
## stateDE      1.51e+01   2.68e+00    5.64  2.9e-08 ***
## stateFL      2.35e+01   3.43e+00    6.85  2.3e-11 ***
## stateGA      5.36e+00   2.71e+00    1.98  0.04878 *  
## stateIA     -7.01e-01   2.68e+00   -0.26  0.79413    
## stateID     -2.69e+01   2.66e+00  -10.11  < 2e-16 ***
## stateIL      1.68e+01   3.33e+00    5.04  6.5e-07 ***
## stateIN      2.31e+01   2.63e+00    8.78  < 2e-16 ***
## stateKS     -8.69e+00   2.62e+00   -3.31  0.00099 ***
## stateKY      5.24e+01   2.62e+00   19.98  < 2e-16 ***
## stateLA      3.25e+00   2.57e+00    1.26  0.20757    
## stateMA      7.25e+00   2.67e+00    2.71  0.00690 ** 
## stateMD     -2.81e+00   2.62e+00   -1.07  0.28525    
## stateME      1.34e+01   2.80e+00    4.78  2.3e-06 ***
## stateMI      2.42e+01   2.87e+00    8.44  3.9e-16 ***
## stateMN      8.96e-01   2.66e+00    0.34  0.73633    
## stateMO      1.46e+01   2.62e+00    5.57  4.3e-08 ***
## stateMS     -3.54e+00   2.61e+00   -1.36  0.17517    
## stateMT     -2.51e+01   2.68e+00   -9.39  < 2e-16 ***
## stateNC      1.91e+01   2.83e+00    6.75  4.3e-11 ***
## stateND     -1.70e+01   2.82e+00   -6.02  3.6e-09 ***
## stateNE     -1.06e+01   2.71e+00   -3.92  0.00010 ***
## stateNH      5.61e+01   2.67e+00   20.99  < 2e-16 ***
## stateNJ      1.18e+01   2.89e+00    4.08  5.3e-05 ***
## stateNM     -4.02e+01   2.63e+00  -15.30  < 2e-16 ***
## stateNV      1.27e+01   2.72e+00    4.66  4.1e-06 ***
## stateNY      3.09e+01   4.99e+00    6.19  1.3e-09 ***
## stateOH      2.18e+01   3.18e+00    6.87  2.0e-11 ***
## stateOK     -3.49e+00   2.60e+00   -1.34  0.18063    
## stateOR      5.28e-01   2.65e+00    0.20  0.84208    
## statePA      1.55e+01   3.34e+00    4.63  4.7e-06 ***
## stateRI      1.19e+01   2.86e+00    4.17  3.6e-05 ***
## stateSC      5.81e-01   2.59e+00    0.22  0.82301    
## stateSD     -1.56e+01   2.73e+00   -5.71  2.0e-08 ***
## stateTN      1.26e+01   2.60e+00    4.86  1.6e-06 ***
## stateTX      1.48e+01   3.88e+00    3.82  0.00015 ***
## stateUT     -5.27e+01   2.66e+00  -19.81  < 2e-16 ***
## stateVA      1.70e+00   2.90e+00    0.59  0.55739    
## stateVT      1.15e+01   2.69e+00    4.26  2.4e-05 ***
## stateWA     -1.06e+01   2.64e+00   -4.00  7.2e-05 ***
## stateWI      2.39e+00   2.61e+00    0.92  0.36031    
## stateWV     -2.66e+00   2.63e+00   -1.01  0.31283    
## stateWY     -5.92e+00   2.65e+00   -2.24  0.02578 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.04 on 478 degrees of freedom
## Multiple R-squared:  0.938,  Adjusted R-squared:  0.932 
## F-statistic:  148 on 49 and 478 DF,  p-value: <2e-16

anova(model1)

## Analysis of Variance Table
## 
## Response: packpc
##            Df Sum Sq Mean Sq F value Pr(>F)    
## income      1  18379   18379   504.4 <2e-16 ***
## tax         1  75687   75687  2077.0 <2e-16 ***
## state      47 170318    3624    99.4 <2e-16 ***
## Residuals 478  17419      36                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Diagnostics/Model Adequacy Checking

To ensure that the results we obtain from the previous section is valid, we conduct the adequacy checking as follows: 1. QQPlot The QQ-plot shows that the sub-sample selected generally follows the normal distribution assuption and thus the application of anova and linear regression is appropriate. 2.Residual vs. Fit Plot The Residual vs. Fit plot shows that the residuals are generally randomly distributed and thus the model estimation results are reliable.

# qqplot
qqnorm(residuals(model1))
qqline(residuals(model1))

plot of chunk unnamed-chunk-5

plot(fitted(model1),residuals(model1))

plot of chunk unnamed-chunk-5

4. References to the literature

Stock, James H. and Mark W. Watson (2003)Introduction to Econometrics, Addison-Wesley Edu-cational Publishers,http://wps.aw.com/aw_stockwatsn_economtrcs_1, chapter 10. http://cran.r-project.org/web/packages/Ecdat/Ecdat.pdf