This is an R Markdown document. Markdown is a simple formatting syntax for authoring web pages (click the MD toolbar button for help on Markdown).
When you click the Knit HTML button a web page will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
as of August 28, 2014, superceding the version of August 24. Always use the most recent version.
In this study, we design an experiment to investigate the influencing factors of Cigarette Consumption in the US from 1985 to 1995. To do so, the dataset “Cigarette” under the “Ecdat” package in R was used and to exam wether the variations in average tax, state personal income or state have an effect on the variation of cigarette consumption per capita.
install.packages("Ecdat")
## Installing package into 'C:/Users/wei/Documents/R/win-library/3.1'
## (as 'lib' is unspecified)
## Error: trying to use CRAN without setting a mirror
library("Ecdat", lib.loc="~/R/win-library/3.1")
## Loading required package: Ecfun
##
## Attaching package: 'Ecdat'
##
## The following object is masked from 'package:datasets':
##
## Orange
data1<-Cigarette
head(data1)
## state year cpi pop packpc income tax avgprs taxs
## 1 AL 1985 1.076 3973000 116.5 46014968 32.5 102.18 33.35
## 2 AR 1985 1.076 2327000 128.5 26210736 37.0 101.47 37.00
## 3 AZ 1985 1.076 3184000 104.5 43956936 31.0 108.58 36.17
## 4 CA 1985 1.076 26444000 100.4 447102816 26.0 107.84 32.10
## 5 CO 1985 1.076 3209000 113.0 49466672 31.0 94.27 31.00
## 6 CT 1985 1.076 3201000 109.3 60063368 42.0 128.02 51.48
attach(data1)
There are two factors in the dataset: “year” with 11 levels (from year 1985 to 1995) and “state” with 48 levels (48 states in the US), in this study we exam the effect of “state”
Year<-as.factor(year)
nlevels(Year)
## [1] 11
state<-as.factor(state)
nlevels(state)
## [1] 48
There are six continuous variables in the dataset: “pop” indicates the state population, “packpc” indicates the number of packs per capita, “income” indicates the state personal total nominal income, “tax” represents the average state, federal and average local excise taxes for fiscal year, avgprs" represents the average price during fiscal year, including sales taxes, “taxs” represents the average excise taxes for fiscal year, including sales tax. In this study, we analyze the effect of “tax” and “income”.
The response variable in this study is “packpc” - the number of packs per capita, and we are testing whehter the variation in “packpc” is due to sample ramdomization or other independent variables.
The dataset “Cigarette” is a panel data with 528 observations from 1985 to 1995, in 48 states in the US.
The data were not collected under the environment of a designed random experiment, so it is not “strictly” random, however, since it covers 48 states through 11 years, we may assume it is “random” for this analysis.
The purpose of this project is to analyze the influencing factors on cigarrette consumption in the US from 1985 to 1995, the selected three factors are: tax (continuous), nominal gross income (continuous) and state (categorical with 48 levels). Therefore: H0: The variation in cigarrette consumption is simply due to sample randomization HA: The variation in cigarrette consumption is due to something else (in this study we test the effects of tax, income and state).
In the exploratory data analysis, we plot histograms for the response variable and two continous variables, and draw boxplots to analyze the categorical independent variables. 1. Histograms: the distribution of cigarrette consumption generally follows a normal distribution, the average number of cigarrette package consumed per capita ranges from 0 to 200 in 48 states, most of them consumed around 100~120 packages per year per capita; the distribution of average nominal annual income is left skewd with most states have an average personal nominal total annual income of $50,000,000, while a few states have as high as $800,000,000; the distribution of average tax rate also follows a normal distribution, with most states charge around 35~40 dollars. 2. Boxplots:???Compare the boxplot of “year” to the boxplot of “state”, the variation in the median of cigarrette consumption among different years is as obvious as among different states - there are a number of states have much higher consumption rates than others, indicating that the effect of “state” on the response variable is highly likely to be statitiscal significant.
plot(data1[,c(5,6,7)])
par(mfrow=c(1,3));for (i in c(5,6,7)) hist(data1[,i],main = names(data1)[i])
par(mfrow=c(1,1));
boxplot(packpc~year,data=data1, xlab="Year", ylab="Number of packs of cigaraettes per capita")
title("Boxplot of Number of packs of cigaraettes per capita in different years")
boxplot(packpc~state,data=data1, xlab="State", ylab="Number of packs of cigaraettes per capita")
title("Boxplot of Number of packs of cigaraettes per capita in different states")
To test the hypothesis we discussed previously, we estimate a linear regression model and conduct an analysis of covariance to conduct the analysis. The result from the linear regression shows that all the selected independent variables are statistically significant, both “income” and “tax” influenced the “cigarrette consumption per capita” negatively, indicating that the higher the income or the higher the tax, the lower the cigarrette consumption rate. For the categorical variable “state”, state “AL” is selected as the base, and the remaining 47 states are compared to the base to show the “geolocation” effect on the cigarrette consumption. For example, state CA has a statistically significant positive coefficient, indicating that the cigarrette consumption per capita in California is statistically higher than the one in Alabama; similary, we may conculde that the cigarrette consumption per capita in Washington state is statistically lower than the one in Alabama, while there is no statistical significant difference betweent he state of Virginia and Alabama. The result from the analysis of covariance shows that the probability that the variation in the response variable is due to sample randomization is less than 2.2e-16, therefore we may reject H0 and conclude that the variation in the cigarrette consumption per capita may due to the effect of income/tax/state. Noted that both “income” and “tax” has 1 degree of freedom since they are continuous variables while “state” has 47 degree of freedom as there are 48 levels in the factor (1 is used in the estimator)
contrasts(state)=contr.sum
model1 <-lm(packpc~income+tax+state, data=data1)
summary(model1)
##
## Call:
## lm(formula = packpc ~ income + tax + state, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.640 -3.586 -0.499 3.345 25.600
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.45e+02 2.11e+00 68.54 < 2e-16 ***
## income -9.72e-08 1.39e-08 -7.01 7.9e-12 ***
## tax -8.10e-01 4.19e-02 -19.32 < 2e-16 ***
## stateAR 1.21e+01 2.65e+00 4.57 6.2e-06 ***
## stateAZ -2.06e+01 2.58e+00 -8.00 9.7e-15 ***
## stateCA 2.80e+01 7.93e+00 3.54 0.00045 ***
## stateCO -1.45e+01 2.58e+00 -5.63 3.1e-08 ***
## stateCT 1.33e+00 2.66e+00 0.50 0.61709
## stateDE 1.51e+01 2.68e+00 5.64 2.9e-08 ***
## stateFL 2.35e+01 3.43e+00 6.85 2.3e-11 ***
## stateGA 5.36e+00 2.71e+00 1.98 0.04878 *
## stateIA -7.01e-01 2.68e+00 -0.26 0.79413
## stateID -2.69e+01 2.66e+00 -10.11 < 2e-16 ***
## stateIL 1.68e+01 3.33e+00 5.04 6.5e-07 ***
## stateIN 2.31e+01 2.63e+00 8.78 < 2e-16 ***
## stateKS -8.69e+00 2.62e+00 -3.31 0.00099 ***
## stateKY 5.24e+01 2.62e+00 19.98 < 2e-16 ***
## stateLA 3.25e+00 2.57e+00 1.26 0.20757
## stateMA 7.25e+00 2.67e+00 2.71 0.00690 **
## stateMD -2.81e+00 2.62e+00 -1.07 0.28525
## stateME 1.34e+01 2.80e+00 4.78 2.3e-06 ***
## stateMI 2.42e+01 2.87e+00 8.44 3.9e-16 ***
## stateMN 8.96e-01 2.66e+00 0.34 0.73633
## stateMO 1.46e+01 2.62e+00 5.57 4.3e-08 ***
## stateMS -3.54e+00 2.61e+00 -1.36 0.17517
## stateMT -2.51e+01 2.68e+00 -9.39 < 2e-16 ***
## stateNC 1.91e+01 2.83e+00 6.75 4.3e-11 ***
## stateND -1.70e+01 2.82e+00 -6.02 3.6e-09 ***
## stateNE -1.06e+01 2.71e+00 -3.92 0.00010 ***
## stateNH 5.61e+01 2.67e+00 20.99 < 2e-16 ***
## stateNJ 1.18e+01 2.89e+00 4.08 5.3e-05 ***
## stateNM -4.02e+01 2.63e+00 -15.30 < 2e-16 ***
## stateNV 1.27e+01 2.72e+00 4.66 4.1e-06 ***
## stateNY 3.09e+01 4.99e+00 6.19 1.3e-09 ***
## stateOH 2.18e+01 3.18e+00 6.87 2.0e-11 ***
## stateOK -3.49e+00 2.60e+00 -1.34 0.18063
## stateOR 5.28e-01 2.65e+00 0.20 0.84208
## statePA 1.55e+01 3.34e+00 4.63 4.7e-06 ***
## stateRI 1.19e+01 2.86e+00 4.17 3.6e-05 ***
## stateSC 5.81e-01 2.59e+00 0.22 0.82301
## stateSD -1.56e+01 2.73e+00 -5.71 2.0e-08 ***
## stateTN 1.26e+01 2.60e+00 4.86 1.6e-06 ***
## stateTX 1.48e+01 3.88e+00 3.82 0.00015 ***
## stateUT -5.27e+01 2.66e+00 -19.81 < 2e-16 ***
## stateVA 1.70e+00 2.90e+00 0.59 0.55739
## stateVT 1.15e+01 2.69e+00 4.26 2.4e-05 ***
## stateWA -1.06e+01 2.64e+00 -4.00 7.2e-05 ***
## stateWI 2.39e+00 2.61e+00 0.92 0.36031
## stateWV -2.66e+00 2.63e+00 -1.01 0.31283
## stateWY -5.92e+00 2.65e+00 -2.24 0.02578 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.04 on 478 degrees of freedom
## Multiple R-squared: 0.938, Adjusted R-squared: 0.932
## F-statistic: 148 on 49 and 478 DF, p-value: <2e-16
anova(model1)
## Analysis of Variance Table
##
## Response: packpc
## Df Sum Sq Mean Sq F value Pr(>F)
## income 1 18379 18379 504.4 <2e-16 ***
## tax 1 75687 75687 2077.0 <2e-16 ***
## state 47 170318 3624 99.4 <2e-16 ***
## Residuals 478 17419 36
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
To ensure that the results we obtain from the previous section is valid, we conduct the adequacy checking as follows: 1. QQPlot The QQ-plot shows that the sub-sample selected generally follows the normal distribution assuption and thus the application of anova and linear regression is appropriate. 2.Residual vs. Fit Plot The Residual vs. Fit plot shows that the residuals are generally randomly distributed and thus the model estimation results are reliable.
# qqplot
qqnorm(residuals(model1))
qqline(residuals(model1))
plot(fitted(model1),residuals(model1))
Stock, James H. and Mark W. Watson (2003)Introduction to Econometrics, Addison-Wesley Edu-cational Publishers,http://wps.aw.com/aw_stockwatsn_economtrcs_1, chapter 10. http://cran.r-project.org/web/packages/Ecdat/Ecdat.pdf