I chose the panel data set “CigarettesSW” which details panel data for cigarette consumption in 48 states in the years 1985 and 1995. The variables include:
The data appear balanced since there are entries for every variable for both years included. The time component is year (1985, 1995) and the entity component is the U.S. state.
glimpse(CigarettesSW)
## Rows: 96
## Columns: 9
## $ state <fct> AL, AR, AZ, CA, CO, CT, DE, FL, GA, IA, ID, IL, IN, KS, KY,…
## $ year <fct> 1985, 1985, 1985, 1985, 1985, 1985, 1985, 1985, 1985, 1985,…
## $ cpi <dbl> 1.076, 1.076, 1.076, 1.076, 1.076, 1.076, 1.076, 1.076, 1.0…
## $ population <dbl> 3973000, 2327000, 3184000, 26444000, 3209000, 3201000, 6180…
## $ packs <dbl> 116.4863, 128.5346, 104.5226, 100.3630, 112.9635, 109.2784,…
## $ income <dbl> 46014968, 26210736, 43956936, 447102816, 49466672, 60063368…
## $ tax <dbl> 32.50000, 37.00000, 31.00000, 26.00000, 31.00000, 42.00000,…
## $ price <dbl> 102.18167, 101.47500, 108.57875, 107.83734, 94.26666, 128.0…
## $ taxs <dbl> 33.34834, 37.00000, 36.17042, 32.10400, 31.00000, 51.48333,…
reorder_size <- function(x) {
factor(x, levels = names(sort(table(x), decreasing = TRUE)))
}
ggplot(data = CigarettesSW,
aes(x = reorder_size(year)
)
) +
geom_bar() +
xlab("Year") + ylab("Frequency") +
theme(axis.text.x = element_text(angle = 45)
)
ggplot(data = CigarettesSW,
aes(x = reorder_size(state)
)
) +
geom_bar() +
xlab("State") + ylab("Frequency") +
theme(axis.text.x = element_text(angle = 90)
)
I will see if the number of packs per capita is influenced by average taxes, prices and the cpi. Proceeding in the typical fashion to find the estimating coefficients based on the \(R_{adj}^2\) :
smoking_rate1 <- lm(data = CigarettesSW, packs ~ cpi + tax + price + taxs)
smoking_rate2 <- lm(data = CigarettesSW, packs ~ tax + price + taxs)
smoking_rate3 <- lm(data = CigarettesSW, packs ~ cpi + price + taxs)
smoking_rate4 <- lm(data = CigarettesSW, packs ~ cpi + tax + price)
stargazer(smoking_rate1, smoking_rate2, smoking_rate3, smoking_rate4 , type = "text", title="Results", align=TRUE, style = "aer")
##
## Results
## ===============================================================================================================
## packs
## (1) (2) (3) (4)
## ---------------------------------------------------------------------------------------------------------------
## cpi 104.302*** 103.926*** 87.014***
## (38.881) (38.363) (31.476)
##
## tax -0.055 0.186 0.389
## (0.715) (0.733) (0.413)
##
## price -1.106*** -0.179 -1.102*** -0.920***
## (0.364) (0.118) (0.357) (0.269)
##
## taxs 0.620 -0.658 0.569
## (0.816) (0.684) (0.469)
##
## Constant 104.584*** 158.774*** 104.580*** 111.471***
## (21.570) (7.814) (21.453) (19.534)
##
## Observations 96 96 96 96
## R2 0.488 0.447 0.488 0.485
## Adjusted R2 0.465 0.429 0.471 0.468
## Residual Std. Error 18.918 (df = 91) 19.545 (df = 92) 18.816 (df = 92) 18.875 (df = 92)
## F Statistic 21.666*** (df = 4; 91) 24.818*** (df = 3; 92) 29.201*** (df = 3; 92) 28.827*** (df = 3; 92)
## ---------------------------------------------------------------------------------------------------------------
## Notes: ***Significant at the 1 percent level.
## **Significant at the 5 percent level.
## *Significant at the 10 percent level.
smoking_rate3 <- lm(data = CigarettesSW, packs ~ cpi + price + taxs)
smoking_rate5 <- lm(data = CigarettesSW, packs ~ price + taxs)
smoking_rate6 <- lm(data = CigarettesSW, packs ~ cpi + taxs)
smoking_rate7 <- lm(data = CigarettesSW, packs ~ cpi + price)
stargazer(smoking_rate3, smoking_rate5, smoking_rate6, smoking_rate7, type = "text", title="Results", align=TRUE, style = "aer")
##
## Results
## ===============================================================================================================
## packs
## (1) (2) (3) (4)
## ---------------------------------------------------------------------------------------------------------------
## cpi 103.926*** -8.332 64.879***
## (38.363) (12.613) (20.914)
##
## price -1.102*** -0.183 -0.688***
## (0.357) (0.116) (0.107)
##
## taxs 0.569 -0.498* -0.811***
## (0.469) (0.264) (0.147)
##
## Constant 104.580*** 159.463*** 159.228*** 123.546***
## (21.453) (7.293) (12.624) (14.724)
##
## Observations 96 96 96 96
## R2 0.488 0.447 0.435 0.480
## Adjusted R2 0.471 0.435 0.423 0.468
## Residual Std. Error 18.816 (df = 92) 19.446 (df = 93) 19.657 (df = 93) 18.863 (df = 93)
## F Statistic 29.201*** (df = 3; 92) 37.572*** (df = 2; 93) 35.779*** (df = 2; 93) 42.850*** (df = 2; 93)
## ---------------------------------------------------------------------------------------------------------------
## Notes: ***Significant at the 1 percent level.
## **Significant at the 5 percent level.
## *Significant at the 10 percent level.
Selecting model #3 based, the best estimating equation is :
\[ Packs = \beta_{0} + \beta_1CPI+\;\beta_{2}Price \; +\;\beta_{3}Taxs \]
\[ Packs = 104.58\; +\; 103.93\times CPI\;-1.1\times Price\;+0.57\times Taxs \]
In terms of the coefficients, we would expect the variables to all be negatively correlated with number of packs per capita. However, in the model, price is negative, but excise tax (taxs) is positive. We see that there is an instance of omitted variable bias.
cor(CigarettesSW$tax, CigarettesSW$packs)
## [1] -0.6421176
cor(CigarettesSW$tax, CigarettesSW$price)
## [1] 0.8993727
cor(CigarettesSW$tax, CigarettesSW$taxs)
## [1] 0.985333
cor(CigarettesSW$tax, CigarettesSW$cpi)
## [1] 0.6857145
Since tax is negatively correlated with packs, and positively correlated with the other variables, the estimate will be negatively biased. (Perhaps someone could confirm this? I am still somewhat confused on OVB…)
Please let me know if you have any suggestions ?
Running the two-way fixed effects models to control for Year and Time we have:
smoking_rate3 <- lm(data = CigarettesSW, packs ~ cpi + price + taxs)
smoking_rate_FE<- feols(packs ~ cpi + price + taxs | state + year,
data = CigarettesSW)
## The variable 'cpi' has been removed because of collinearity (see $collin.var).
summary(smoking_rate_FE)
## OLS estimation, Dep. Var.: packs
## Observations: 96
## Fixed-effects: state: 48, year: 2
## Standard-errors: IID
## Estimate Std. Error t value Pr(>|t|)
## price -0.579361 0.199082 -2.910161 0.0055981 **
## taxs 0.152363 0.256588 0.593803 0.5556174
## ... 1 variable was removed because of collinearity (cpi)
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 4.05627 Adj. R2: 0.947558
## Within R2: 0.53662
stargazer(smoking_rate3,type = "text")
##
## ===============================================
## Dependent variable:
## ---------------------------
## packs
## -----------------------------------------------
## cpi 103.926***
## (38.363)
##
## price -1.102***
## (0.357)
##
## taxs 0.569
## (0.469)
##
## Constant 104.580***
## (21.453)
##
## -----------------------------------------------
## Observations 96
## R2 0.488
## Adjusted R2 0.471
## Residual Std. Error 18.816 (df = 92)
## F Statistic 29.201*** (df = 3; 92)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Since CPI was eliminated due to its strong correlation and subsequent redundancy, out of curiosity I’ll see what the model would look like if it were with the variables taxs, tax and price.
smoking_rate8 <- lm(data = CigarettesSW, packs ~ price + taxs + tax)
stargazer(smoking_rate8,type = "text")
##
## ===============================================
## Dependent variable:
## ---------------------------
## packs
## -----------------------------------------------
## price -0.179
## (0.118)
##
## taxs -0.658
## (0.684)
##
## tax 0.186
## (0.733)
##
## Constant 158.774***
## (7.814)
##
## -----------------------------------------------
## Observations 96
## R2 0.447
## Adjusted R2 0.429
## Residual Std. Error 19.545 (df = 92)
## F Statistic 24.818*** (df = 3; 92)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
summary(smoking_rate_FE)
## OLS estimation, Dep. Var.: packs
## Observations: 96
## Fixed-effects: state: 48, year: 2
## Standard-errors: IID
## Estimate Std. Error t value Pr(>|t|)
## price -0.579361 0.199082 -2.910161 0.0055981 **
## taxs 0.152363 0.256588 0.593803 0.5556174
## ... 1 variable was removed because of collinearity (cpi)
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 4.05627 Adj. R2: 0.947558
## Within R2: 0.53662
smoking_rate_FE2<- feols(packs ~ price + taxs + tax | state + year,
data = CigarettesSW)
summary(smoking_rate_FE2)
## OLS estimation, Dep. Var.: packs
## Observations: 96
## Fixed-effects: state: 48, year: 2
## Standard-errors: IID
## Estimate Std. Error t value Pr(>|t|)
## price -0.581433 0.199297 -2.917419 0.0055395 **
## taxs 0.572783 0.510465 1.122081 0.2679141
## tax -0.514392 0.539740 -0.953036 0.3457781
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 4.01504 Adj. R2: 0.947451
## Within R2: 0.545992
So, keeping year and state constant, price is still negatively associated with number of cigarette packs, but taxes (both excise and average) are going in opposite directions. The coefficient on price is quite similar in both FE models (-0.579, -0.581). My thoughts are that these effects could be that due to the variation in state tax codes and levels of personal income per person (which were not included in my analysis). As we know, some states don’t have income tax, others have lower or higher taxes on certain goods. It would therefore make sense that the average price of cigarettes would be the strongest indicator regarding the number of packs per capita.