as of August 28, 2014, superceding the version of August 24. Always use the most recent version.
This analysis uses cigarette consumption data to test the effect of one factor, with 2 explanatory variables, on the response variable of cigarette sales per pack per capita.
Below is the installation and initial examination of the dataset:
#Installing data package
library("Ecdat", lib.loc="~/R/win-library/3.1")
## Loading required package: Ecfun
##
## Attaching package: 'Ecdat'
##
## The following object is masked from 'package:datasets':
##
## Orange
#Attaching Cigarrette Consumption dataset from library
data<-Cigar
attach(data)
For this single factor analysis we will be examining the effect of the factor disposable income per capita (ndi) on the response variable cigarette sales in packs per capita (sales).
We will also use two explanatory variables to increase the precision of our analysis of covariance. These two variables will be the minimum price in adjoining states per pack of cigarettes (pimin) and the price per pack of cigarettes (price)
Our factor of disposable income per capita will be broken into 3 levels which are $0 to $8,000, $8,000 to $16,00 0, and >$16,000.
Our explanatory variables are both continuous. Minimum price in adjoining states range from 23.40 to 178.50 and the price of a pack of cigarrettes ranges from 23.40 to 201.90
#Setting 3 levels of Disposable Income
data$ndi[data$ndi>= 0 & data$ndi <= 8000 ] ="< 8,000"
data$ndi[as.numeric(data$ndi)> 8000 & as.numeric(data$ndi) <= 16000 ] ="8,000-16,000"
## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion
data$ndi[data$ndi> 16000] = "> 16,000"
#Levels of Disposable Income
unique(data$ndi)
## [1] "< 8,000" "> 16,000"
#Summary Statistics of Explanatory Variable minimum price in adjoining states per pack
summary(data$pimin)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 23.4 32.0 46.4 62.9 90.5 178.0
#Summary Statistics of explanatory variable price per pack of cigarettes
summary(data$price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 23.4 34.8 52.3 68.7 98.1 202.0
#Overall Look at Data
head(data)
## state year price pop pop16 cpi ndi sales pimin
## 1 1 63 28.6 3383 2236 30.6 < 8,000 93.9 26.1
## 2 1 64 29.8 3431 2277 31.0 < 8,000 95.4 27.5
## 3 1 65 29.8 3486 2328 31.5 < 8,000 98.5 28.9
## 4 1 66 31.5 3524 2370 32.4 < 8,000 96.4 29.5
## 5 1 67 31.6 3533 2394 33.4 < 8,000 95.5 29.6
## 6 1 68 35.6 3522 2405 34.8 < 8,000 88.4 32.0
tail(data)
## state year price pop pop16 cpi ndi sales pimin
## 1375 51 87 102.7 490.0 357.0 113.6 > 16,000 110.4 106.2
## 1376 51 88 112.9 479.0 353.0 118.3 > 16,000 114.3 115.3
## 1377 51 89 118.6 475.0 352.0 124.0 > 16,000 111.4 123.0
## 1378 51 90 129.5 470.9 348.9 130.7 > 16,000 96.9 138.9
## 1379 51 91 127.0 477.1 355.2 136.2 > 16,000 109.1 143.6
## 1380 51 92 155.1 483.3 360.5 140.3 > 16,000 110.8 160.0
summary(data)
## state year price pop
## Min. : 1.0 Min. :63.0 Min. : 23.4 Min. : 319
## 1st Qu.:15.0 1st Qu.:70.0 1st Qu.: 34.8 1st Qu.: 1053
## Median :26.5 Median :77.5 Median : 52.3 Median : 3174
## Mean :26.8 Mean :77.5 Mean : 68.7 Mean : 4537
## 3rd Qu.:40.0 3rd Qu.:85.0 3rd Qu.: 98.1 3rd Qu.: 5280
## Max. :51.0 Max. :92.0 Max. :201.9 Max. :30703
## pop16 cpi ndi sales
## Min. : 215 Min. : 30.6 Length:1380 Min. : 53.4
## 1st Qu.: 781 1st Qu.: 38.8 Class :character 1st Qu.:107.9
## Median : 2315 Median : 62.9 Mode :character Median :121.2
## Mean : 3367 Mean : 73.6 Mean :124.0
## 3rd Qu.: 3914 3rd Qu.:107.6 3rd Qu.:133.2
## Max. :22920 Max. :140.3 Max. :297.9
## pimin
## Min. : 23.4
## 1st Qu.: 32.0
## Median : 46.4
## Mean : 62.9
## 3rd Qu.: 90.5
## Max. :178.5
All factors and variables in this data are continuous, however the per capita disposable income has been set into 3 levels for analysis.
For this experiment we will be focusing on the cigarette sales in packs per capita for the response variable being effected by the chosen factor and already being explained by the explanatory variables.
The data set is organized by the following variables: state, year, price, pop, pop16, cpi, ndi, sales, pimin
State and year are both identification variables used to identify which state the data is from as well as what year it was recorded in allowing for multiple records for each state over time.
Price represents the price per pack of cigarettes in that state.pimin is similar to price in that it represents the minimum price per pack of cigarettes in all adjoining states.
Pop and Pop16 are both population statistics. Pop represents the total population while pop16 is the population of people above the age of 16.
CPI is the consumer price index at the time which represents an overall price of goods.
NDI is a statistic relating to the per capita disposable income of residents.
sales is the sale of cigarettes in packs per capita
It is unknown whether or not the data collected for this study was collected by a randomly designed experiment.
To perform the experiment the data will first be altered so that our factor has 3 levels. Then we will create linear models using the factor and the explanatory variables and an analysisof covariance will be performed on our model. From these analysis we will be able to test the null hypothesis, that sales of cigarettes in packs per capita is independant of disposable income.
The rationale for using an analysis of covariance test is used when we want to futher the precision of a factor on a response variable using explanatory variables. It checks whether the means of several groups are equal. The alternative would be to use a two-sample t-tests however there is more likely chance of the test resulting in a false positive of the hypothesis. We already know the explanatory variables have an effect on the repsonse variable and by including them we further the precision of our model.
The data was collected in an unknown way so we do not know if there was any randomization to it.
There are no replicated or repeated measures in the data. States had their data recorded between 1963 to 1992.
There was no blocking in the design of this experiment
To start our statistical analysis we will make our variables factors for the analysis of variance and look at some boxplots of those factors. We will also use this to make an initial determination as to whether or not our explanatory variables are acceptable
#Defining price per pack of cigarettes as a factor
data$price = as.factor(data$price)
#Defining minimum price in adjoining states per pack of cigarettes as a factor
data$pimin = as.factor(data$pimin)
#Defining per capita disposable income as a factor
data$ndi = as.factor(data$ndi)
#Boxplots of the means of the explanatory variables against the response variable of sales
boxplot(sales~price, data=data, xlab="Price per Pack", ylab="Sales in Packs per Capita")
boxplot(sales~pimin, data=data, xlab="Minimum Price Nearby", ylab="Sales in Packs per Capita")
#Boxplot of the factor of per capita disposable income against sales in packs per capita
boxplot(sales~ndi, data=data, xlab="Disposable Income", ylab="Sales in Packs per Capita")
We can’t discern any noticeable effects of the factor from the boxplots except to note that there is a slight negative trend as well as a very large number of outliers in our first level of less than $8,000 in disposable income. Examining our explanatory variables we see discernable negative effects that as the price and minimum price nearby increase the sales decrease.
First we generate a linear model with our factor and explanatory variables. Then to test the hypotheses we perform an ANOVA test on the factors including each explanatory variable with an “or” operator (using + instead of * in the analysis of variance)
The null hypothesis of the tests is that the single factor does not have an effect on the response variable of sales in packs per capita.
#Generation of Linear Model
model=lm(sales~ndi+price+pimin, data=data)
anova(model)
## Analysis of Variance Table
##
## Response: sales
## Df Sum Sq Mean Sq F value Pr(>F)
## ndi 1 43621 43621 47.82 3.1e-11 ***
## price 766 672582 878 0.96 0.66
## pimin 328 349195 1065 1.17 0.09 .
## Residuals 284 259056 912
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
For each of our analysis of covariance test we found a resulting p-value of 7.224e-11. The p-value represents the probability that we can get the F value using the null hypothesis. Since our probability is extremely close to 0 we can assume that each factor demonstrates an effect on the response variable. We are lead to reject the null hypothesis.
We can ignore the p-values of the explanatory values since they are only included to increase the precision of the estimation of the effect of the factor of disposable income per capita.
Next we graph Q-Q plots to check our data in our model for normality. If the data is not normal the results of the analysis may not be valid.
#QQ plots for residuals of the model
qqnorm(residuals(model))
qqline(residuals(model))
Second we plot a fitted model against the residuals. We see a very large degree of variation among the plot.
#Plot of Fitted vs Residuals of model
plot(fitted(model), residuals(model))
Overrall the results of our model lead us to believe our model is not adequate and does not explain the effect of the disposable income per capita on the response variable of sale of cigarettes by pack per capita. This is largely due to our qq plot making the data appear non-parametric and a large amount of s ymmetry in our fitted vs. residuals model.
No literature was used
The R package in which this data was found can be located in the R library called Ecdat
It is possible that the results of our study are the result of chance. There are a large number of unidentified factors that could effect our response variable of sales. These factors include things such as taxes applied to the sale of a pack of cigarettes, local laws concerning the purchasing age of tobaccoo in that state and drug and tobaccoo education within that state. These factors could sway the publics view on cigarrettes and therefore atler their consumption. We also have to consider errors in the nature of the explanatory variables. While the minimum price of a pack of cigarettes in adjacent states appears to have an effect via the boxplots we have to consider geographic locaiton. It is hard to imply rationally that the price of cigarettes in a state that borders the western edge of the state to have an effect of the sales of cigarettes for anyone that lives outside of a feasible travel distance from the boarder which for many states could be the majority of the population.
The factor of disposable income per capita itself is also questionable. Cost of living in different areas could effect the disposable income and breakdown of how it is spent. In areas where there is a much higher cost of living less money would be available for the purchase of cigarettes. This problem could be mediated by possibly creating a factor that is some sort of combination of the consumer price index and disposable income per capita therefore taking into account the overall price of goods in the area.
Kruskal-Wallis can also do a non-parametric test of two different groups of means. As we can see from our following tests our p-values are close to 0 and therefore significant.
#Kruskal-Wallis test on explanatory variable price
kruskal.test(sales~price, data=data)
##
## Kruskal-Wallis rank sum test
##
## data: sales by price
## Kruskal-Wallis chi-squared = 841.3, df = 766, p-value = 0.03004
#Kruskal-Wallis test on explanatory variable minimum price in adjacent state
kruskal.test(sales~pimin, data=data)
##
## Kruskal-Wallis rank sum test
##
## data: sales by pimin
## Kruskal-Wallis chi-squared = 726.9, df = 465, p-value = 7.936e-14
#Kruskal-Wallis test on the factor disposable income per capita
kruskal.test(sales~ndi, data=data)
##
## Kruskal-Wallis rank sum test
##
## data: sales by ndi
## Kruskal-Wallis chi-squared = 47.55, df = 1, p-value = 5.355e-12