Recipes for the Design of Experiments: Recipe Outline

as of August 28, 2014, superceding the version of August 24. Always use the most recent version.

Design of Experiments: Recipe 6 - One Factor with Two Explanatory Variables Analysis of Covariance

Kevin Toth

RPI

10/29/2014 V2.0

1. Setting

Cigarette Consumption Data

This analysis uses cigarette consumption data to test the effect of one factor, with 2 explanatory variables, on the response variable of cigarette sales per pack per capita.

Below is the installation and initial examination of the dataset:

#Installing data package
library("Ecdat", lib.loc="~/R/win-library/3.1")
## Loading required package: Ecfun
## 
## Attaching package: 'Ecdat'
## 
## The following object is masked from 'package:datasets':
## 
##     Orange
#Attaching Cigarrette Consumption dataset from library
data<-Cigar
attach(data)

Factors and Levels

For this single factor analysis we will be examining the effect of the factor disposable income per capita (ndi) on the response variable cigarette sales in packs per capita (sales).

We will also use two explanatory variables to increase the precision of our analysis of covariance. These two variables will be the minimum price in adjoining states per pack of cigarettes (pimin) and the price per pack of cigarettes (price)

Our factor of disposable income per capita will be broken into 3 levels which are $0 to $8,000, $8,000 to $16,00 0, and >$16,000.

Our explanatory variables are both continuous. Minimum price in adjoining states range from 23.40 to 178.50 and the price of a pack of cigarrettes ranges from 23.40 to 201.90

#Setting 3 levels of Disposable Income
data$ndi[data$ndi>= 0 & data$ndi <= 8000 ] ="< 8,000"
data$ndi[as.numeric(data$ndi)> 8000 & as.numeric(data$ndi) <= 16000 ] ="8,000-16,000"
## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion
data$ndi[data$ndi> 16000] = "> 16,000"

#Levels of Disposable Income
unique(data$ndi)
## [1] "< 8,000"  "> 16,000"
#Summary Statistics of Explanatory Variable minimum price in adjoining states per pack
summary(data$pimin)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    23.4    32.0    46.4    62.9    90.5   178.0
#Summary Statistics of explanatory variable price per pack of cigarettes
summary(data$price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    23.4    34.8    52.3    68.7    98.1   202.0
#Overall Look at Data
head(data)
##   state year price  pop pop16  cpi     ndi sales pimin
## 1     1   63  28.6 3383  2236 30.6 < 8,000  93.9  26.1
## 2     1   64  29.8 3431  2277 31.0 < 8,000  95.4  27.5
## 3     1   65  29.8 3486  2328 31.5 < 8,000  98.5  28.9
## 4     1   66  31.5 3524  2370 32.4 < 8,000  96.4  29.5
## 5     1   67  31.6 3533  2394 33.4 < 8,000  95.5  29.6
## 6     1   68  35.6 3522  2405 34.8 < 8,000  88.4  32.0
tail(data)
##      state year price   pop pop16   cpi      ndi sales pimin
## 1375    51   87 102.7 490.0 357.0 113.6 > 16,000 110.4 106.2
## 1376    51   88 112.9 479.0 353.0 118.3 > 16,000 114.3 115.3
## 1377    51   89 118.6 475.0 352.0 124.0 > 16,000 111.4 123.0
## 1378    51   90 129.5 470.9 348.9 130.7 > 16,000  96.9 138.9
## 1379    51   91 127.0 477.1 355.2 136.2 > 16,000 109.1 143.6
## 1380    51   92 155.1 483.3 360.5 140.3 > 16,000 110.8 160.0
summary(data)
##      state           year          price            pop       
##  Min.   : 1.0   Min.   :63.0   Min.   : 23.4   Min.   :  319  
##  1st Qu.:15.0   1st Qu.:70.0   1st Qu.: 34.8   1st Qu.: 1053  
##  Median :26.5   Median :77.5   Median : 52.3   Median : 3174  
##  Mean   :26.8   Mean   :77.5   Mean   : 68.7   Mean   : 4537  
##  3rd Qu.:40.0   3rd Qu.:85.0   3rd Qu.: 98.1   3rd Qu.: 5280  
##  Max.   :51.0   Max.   :92.0   Max.   :201.9   Max.   :30703  
##      pop16            cpi            ndi                sales      
##  Min.   :  215   Min.   : 30.6   Length:1380        Min.   : 53.4  
##  1st Qu.:  781   1st Qu.: 38.8   Class :character   1st Qu.:107.9  
##  Median : 2315   Median : 62.9   Mode  :character   Median :121.2  
##  Mean   : 3367   Mean   : 73.6                      Mean   :124.0  
##  3rd Qu.: 3914   3rd Qu.:107.6                      3rd Qu.:133.2  
##  Max.   :22920   Max.   :140.3                      Max.   :297.9  
##      pimin      
##  Min.   : 23.4  
##  1st Qu.: 32.0  
##  Median : 46.4  
##  Mean   : 62.9  
##  3rd Qu.: 90.5  
##  Max.   :178.5

Continuous variables (if any)

All factors and variables in this data are continuous, however the per capita disposable income has been set into 3 levels for analysis.

Response variables

For this experiment we will be focusing on the cigarette sales in packs per capita for the response variable being effected by the chosen factor and already being explained by the explanatory variables.

The Data: How is it organized and what does it look like?

The data set is organized by the following variables: state, year, price, pop, pop16, cpi, ndi, sales, pimin

State and year are both identification variables used to identify which state the data is from as well as what year it was recorded in allowing for multiple records for each state over time.

Price represents the price per pack of cigarettes in that state.pimin is similar to price in that it represents the minimum price per pack of cigarettes in all adjoining states.

Pop and Pop16 are both population statistics. Pop represents the total population while pop16 is the population of people above the age of 16.

CPI is the consumer price index at the time which represents an overall price of goods.

NDI is a statistic relating to the per capita disposable income of residents.

sales is the sale of cigarettes in packs per capita

Randomization

It is unknown whether or not the data collected for this study was collected by a randomly designed experiment.

2. (Experimental) Design

How will the experiment be organized and conducted to test the hypothesis?

To perform the experiment the data will first be altered so that our factor has 3 levels. Then we will create linear models using the factor and the explanatory variables and an analysisof covariance will be performed on our model. From these analysis we will be able to test the null hypothesis, that sales of cigarettes in packs per capita is independant of disposable income.

What is the rationale for this design?

The rationale for using an analysis of covariance test is used when we want to futher the precision of a factor on a response variable using explanatory variables. It checks whether the means of several groups are equal. The alternative would be to use a two-sample t-tests however there is more likely chance of the test resulting in a false positive of the hypothesis. We already know the explanatory variables have an effect on the repsonse variable and by including them we further the precision of our model.

Randomize: What is the Randomization Scheme?

The data was collected in an unknown way so we do not know if there was any randomization to it.

Replicate: Are there replicates and/or repeated measures?

There are no replicated or repeated measures in the data. States had their data recorded between 1963 to 1992.

Block: Did you use blocking in the design?

There was no blocking in the design of this experiment

3. Statistical Analysis

Exploratory Data Analysis: Graphics and Descriptive summary

To start our statistical analysis we will make our variables factors for the analysis of variance and look at some boxplots of those factors. We will also use this to make an initial determination as to whether or not our explanatory variables are acceptable

#Defining price per pack of cigarettes as a factor
data$price = as.factor(data$price)

#Defining minimum price in adjoining states per pack of cigarettes as a factor
data$pimin = as.factor(data$pimin)

#Defining per capita disposable income as a factor
data$ndi = as.factor(data$ndi)

#Boxplots of the means of the explanatory variables against the response variable of sales
boxplot(sales~price, data=data, xlab="Price per Pack", ylab="Sales in Packs per Capita")

plot of chunk unnamed-chunk-3

boxplot(sales~pimin, data=data, xlab="Minimum Price Nearby", ylab="Sales in Packs per Capita")

plot of chunk unnamed-chunk-3

#Boxplot of the factor of per capita disposable income against sales in packs per capita
boxplot(sales~ndi, data=data, xlab="Disposable Income", ylab="Sales in Packs per Capita")

plot of chunk unnamed-chunk-3

We can’t discern any noticeable effects of the factor from the boxplots except to note that there is a slight negative trend as well as a very large number of outliers in our first level of less than $8,000 in disposable income. Examining our explanatory variables we see discernable negative effects that as the price and minimum price nearby increase the sales decrease.

Testing

First we generate a linear model with our factor and explanatory variables. Then to test the hypotheses we perform an ANOVA test on the factors including each explanatory variable with an “or” operator (using + instead of * in the analysis of variance)

The null hypothesis of the tests is that the single factor does not have an effect on the response variable of sales in packs per capita.

#Generation of Linear Model
model=lm(sales~ndi+price+pimin, data=data)
anova(model)
## Analysis of Variance Table
## 
## Response: sales
##            Df Sum Sq Mean Sq F value  Pr(>F)    
## ndi         1  43621   43621   47.82 3.1e-11 ***
## price     766 672582     878    0.96    0.66    
## pimin     328 349195    1065    1.17    0.09 .  
## Residuals 284 259056     912                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

For each of our analysis of covariance test we found a resulting p-value of 7.224e-11. The p-value represents the probability that we can get the F value using the null hypothesis. Since our probability is extremely close to 0 we can assume that each factor demonstrates an effect on the response variable. We are lead to reject the null hypothesis.

We can ignore the p-values of the explanatory values since they are only included to increase the precision of the estimation of the effect of the factor of disposable income per capita.

Diagnostics/Model Adequacy Checking

Next we graph Q-Q plots to check our data in our model for normality. If the data is not normal the results of the analysis may not be valid.

#QQ plots for residuals of the model
qqnorm(residuals(model))
qqline(residuals(model))

plot of chunk unnamed-chunk-5

Second we plot a fitted model against the residuals. We see a very large degree of variation among the plot.

#Plot of Fitted vs Residuals of model
plot(fitted(model), residuals(model))

plot of chunk unnamed-chunk-6

Overrall the results of our model lead us to believe our model is not adequate and does not explain the effect of the disposable income per capita on the response variable of sale of cigarettes by pack per capita. This is largely due to our qq plot making the data appear non-parametric and a large amount of s ymmetry in our fitted vs. residuals model.

4. References to the literature

No literature was used

5. Appendices

The R package in which this data was found can be located in the R library called Ecdat

6. Contingencies

It is possible that the results of our study are the result of chance. There are a large number of unidentified factors that could effect our response variable of sales. These factors include things such as taxes applied to the sale of a pack of cigarettes, local laws concerning the purchasing age of tobaccoo in that state and drug and tobaccoo education within that state. These factors could sway the publics view on cigarrettes and therefore atler their consumption. We also have to consider errors in the nature of the explanatory variables. While the minimum price of a pack of cigarettes in adjacent states appears to have an effect via the boxplots we have to consider geographic locaiton. It is hard to imply rationally that the price of cigarettes in a state that borders the western edge of the state to have an effect of the sales of cigarettes for anyone that lives outside of a feasible travel distance from the boarder which for many states could be the majority of the population.

The factor of disposable income per capita itself is also questionable. Cost of living in different areas could effect the disposable income and breakdown of how it is spent. In areas where there is a much higher cost of living less money would be available for the purchase of cigarettes. This problem could be mediated by possibly creating a factor that is some sort of combination of the consumer price index and disposable income per capita therefore taking into account the overall price of goods in the area.

Kruskal-Wallis can also do a non-parametric test of two different groups of means. As we can see from our following tests our p-values are close to 0 and therefore significant.

#Kruskal-Wallis test on explanatory variable price
kruskal.test(sales~price, data=data)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  sales by price
## Kruskal-Wallis chi-squared = 841.3, df = 766, p-value = 0.03004
#Kruskal-Wallis test on explanatory variable minimum price in adjacent state
kruskal.test(sales~pimin, data=data)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  sales by pimin
## Kruskal-Wallis chi-squared = 726.9, df = 465, p-value = 7.936e-14
#Kruskal-Wallis test on the factor disposable income per capita
kruskal.test(sales~ndi, data=data)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  sales by ndi
## Kruskal-Wallis chi-squared = 47.55, df = 1, p-value = 5.355e-12