Believe it or not, workers used to be able to smoke inside office buildings. Smoking bans were introduced in several areas during the 1990s. Supporters of these bans argued that in addition to eliminating the externality of secondhand smoke, they would encourage smokers to quit by reducing their opportunities to smoke. In this assignment, you will estimate the effect of workplace smoking bans on smoking, using data on a sample of 10,000 U.S. indoor workers from 1991 to 1993, available on the text website, http://www.pearsonglobaleditions .com, in the file Smoking. The data set contains information on whether individuals were or were not subject to a workplace smoking ban, whether the individuals smoked, and other individual characteristics.7 A detailed description is given in Smoking_Description, available on the website.
rm(list=ls())
library(openxlsx)
## Warning: package 'openxlsx' was built under R version 4.3.3
library(lmtest)
library(sandwich)
library(car)
id <- "1eVyfH-AhYRm4oMtZUuQos8zfS9YLvN8t"
smoking <- read.xlsx(sprintf("https://docs.google.com/uc?id=%s&export=download",id),sheet=1,startRow=1,colNames=TRUE,rowNames=FALSE)
str(smoking)
## 'data.frame': 10000 obs. of 10 variables:
## $ smoker : num 1 1 0 1 0 0 1 1 0 0 ...
## $ smkban : num 1 1 0 0 1 0 1 0 1 0 ...
## $ age : num 41 44 19 29 28 40 47 36 49 44 ...
## $ hsdrop : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hsgrad : num 1 0 0 1 0 0 0 0 0 0 ...
## $ colsome : num 0 1 1 0 1 1 1 1 1 1 ...
## $ colgrad : num 0 0 0 0 0 0 0 0 0 0 ...
## $ black : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hispanic: num 0 0 0 0 0 0 0 0 0 0 ...
## $ female : num 1 1 1 1 1 0 1 0 1 0 ...
- Estimate the probability of smoking for (i) all workers, (ii) workers affected by workplace smoking bans, and (iii) workers not affected by workplace smoking bans.
n <- dim(smoking)[1]
pro_smoke_all <- sum(smoking$smoker)/n
print(pro_smoke_all)
## [1] 0.2423
smoking_ban <- subset(smoking, smkban==1)
n_ban <- dim(smoking_ban)[1]
pro_smoke_ban <- sum(smoking_ban$smoker)/n_ban
print(pro_smoke_ban)
## [1] 0.2120367
smoking_noban <- subset(smoking, smkban==0)
n_noban <- dim(smoking_noban)[1]
pro_smoke_noban <- sum(smoking_noban$smoker)/n_noban
print(pro_smoke_noban)
## [1] 0.2895951
[Ans] The estimated probabilities of smoking for (i) all workers, (ii) workers affected by workplace smoking bans, and (iii) workers not affected by workplace smoking bans are \(0.2423\), \(0.212\), and \(0.290\), respectively.
- What is the difference in the probability of smoking between workers affected by a workplace smoking ban and workers not affected by a workplace smoking ban? Use a linear probability model to determine whether this difference is statistically significant.
model.b <- smoker ~ smkban
fit.b <- lm(model.b, data=smoking)
coeftest(fit.b, vcov=vcovHC, type="HC1") #use heteroskedasticity robust SE
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.2895951 0.0072619 39.8789 < 2.2e-16 ***
## smkban -0.0775583 0.0089520 -8.6638 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
[Ans] Regress \(smoking\) on \(smkban\). The difference is \(-0.0776\) with a robust standard error of \(0.009\). The resulting t-statistic is \(-8.66\), so the coefficient is statistically significant.
- Estimate a linear probability model with smoker as the dependent variable and the following regressors: smkban, female, age, age2, hsdrop, hsgrad, colsome, colgrad, black, and hispanic. Compare the estimated effect of a smoking ban from this regression with your answer from (b). Suggest an explanation, based on the substance of this regression, for the change in the estimated effect of a smoking ban between (b) and (c).
model.c <- smoker ~ smkban + female + age + I(age^2) + hsdrop + hsgrad + colsome + colgrad + black + hispanic
fit.c <- lm(model.c, data=smoking)
coeftest(fit.c, vcov=vcovHC, type="HC1") #use heteroskedasticity robust SE
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.4110e-02 4.1423e-02 -0.3406 0.7333890
## smkban -4.7240e-02 8.9661e-03 -5.2687 1.402e-07 ***
## female -3.3257e-02 8.5683e-03 -3.8814 0.0001045 ***
## age 9.6744e-03 1.8954e-03 5.1040 3.386e-07 ***
## I(age^2) -1.3180e-04 2.1905e-05 -6.0169 1.841e-09 ***
## hsdrop 3.2271e-01 1.9489e-02 16.5592 < 2.2e-16 ***
## hsgrad 2.3270e-01 1.2590e-02 18.4826 < 2.2e-16 ***
## colsome 1.6430e-01 1.2625e-02 13.0138 < 2.2e-16 ***
## colgrad 4.4798e-02 1.2044e-02 3.7196 0.0002006 ***
## black -2.7566e-02 1.6078e-02 -1.7144 0.0864772 .
## hispanic -1.0482e-01 1.3975e-02 -7.5004 6.905e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
[Ans] From model (c), the estimated difference is \(-0.047\), smaller than the effect in model (b). Evidently, (b) suffers from omitted variable bias. That is, \(smkban\) may be correlated with the education/race/gender indicators or with age. For example, workers with a college degree are more likely to work in an office with a smoking ban than high-school dropouts, and college graduates are less likely to smoke than high-school dropouts.
- Test the hypothesis that the coefficient on smkban is 0 in the population version of the regression in (c) against the alternative that it is nonzero, at the \(5\%\) significance level.
[Ans] The t-statistic is \(-5.27\), so the coefficient is statistically significant at the \(5\%\) level.
- Test the hypothesis that the probability of smoking does not depend on the level of education in the regression in (c). Does the probability of smoking increase or decrease with the level of education?
# Test joint significance of hsdrop, hsgrad, colsome, colgrad
linearHypothesis(fit.c, c("hsdrop=0", "hsgrad=0", "colsome=0", "colgrad=0"), white.adjust=c("hc1"))
[Ans] The F-statistic is \(140.09\) with a p-value of \(0.00\), so the coefficients are jointly significant. The omitted education status is “Masters degree or higher”. Thus the coefficients show the increase in probability relative to someone with a postgraduate degree. For example, the coefficient on \(colgrad\) is \(0.045\), so the probability of smoking for a college graduate is \(4.5\%\) higher than for someone with a postgraduate degree. Similarly, the coefficient on \(hsdrop\) is \(0.323\), so the probability of smoking for a college graduate is \(32.3\%\) higher than for someone with a postgraduate degree. Because the coefficients are all positive and get smaller as educational attainment increases, the probability of smoking falls as educational attainment increases.
- Repeat (c)–(e) using a probit model.
fit.f <- glm(model.c, data=smoking, family = binomial(link = "probit"))
coeftest(fit.f, vcov=vcovHC, type="HC1") #use heteroskedasticity robust SE
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.7349e+00 1.5168e-01 -11.4384 < 2.2e-16 ***
## smkban -1.5863e-01 2.9176e-02 -5.4370 5.419e-08 ***
## female -1.1173e-01 2.8866e-02 -3.8707 0.0001085 ***
## age 3.4511e-02 6.8549e-03 5.0346 4.789e-07 ***
## I(age^2) -4.6754e-04 8.2487e-05 -5.6681 1.444e-08 ***
## hsdrop 1.1416e+00 7.3449e-02 15.5428 < 2.2e-16 ***
## hsgrad 8.8267e-01 6.0664e-02 14.5502 < 2.2e-16 ***
## colsome 6.7712e-01 6.1681e-02 10.9777 < 2.2e-16 ***
## colgrad 2.3468e-01 6.5578e-02 3.5787 0.0003453 ***
## black -8.4279e-02 5.3828e-02 -1.5657 0.1174193
## hispanic -3.3827e-01 5.0155e-02 -6.7446 1.534e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Test joint significance of hsdrop, hsgrad, colsome, colgrad
linearHypothesis(fit.f, c("hsdrop=0", "hsgrad=0", "colsome=0", "colgrad=0"), white.adjust=c("hc1"))
[Ans] The estimated effect of the smoking ban in the probit model depends on the values of the other variable included in the regression. The estimated effects for various values of these regressors are given in question (h). The t-statistic for the coefficient on \(smkban\) is \(-5.44\), similar to the value for the linear probability model. The F-statistic is significant at the \(5\%\) level, as in the linear probability model.
- Repeat (c)–(e) using a logit model.
fit.g <- glm(model.c, data=smoking, family = binomial(link = "logit"))
coeftest(fit.g, vcov=vcovHC, type="HC1") #use heteroskedasticity robust SE
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.99918220 0.26537659 -11.3016 < 2.2e-16 ***
## smkban -0.26202867 0.04950196 -5.2933 1.201e-07 ***
## female -0.19077250 0.04920099 -3.8774 0.0001056 ***
## age 0.05993658 0.01182862 5.0671 4.040e-07 ***
## I(age^2) -0.00081819 0.00014321 -5.7131 1.109e-08 ***
## hsdrop 2.01685337 0.13409500 15.0405 < 2.2e-16 ***
## hsgrad 1.57850359 0.11582668 13.6282 < 2.2e-16 ***
## colsome 1.22997749 0.11770914 10.4493 < 2.2e-16 ***
## colgrad 0.44658309 0.12643554 3.5321 0.0004123 ***
## black -0.15603412 0.09129560 -1.7091 0.0874308 .
## hispanic -0.59717314 0.08623010 -6.9253 4.349e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Test joint significance of hsdrop, hsgrad, colsome, colgrad
linearHypothesis(fit.g, c("hsdrop=0", "hsgrad=0", "colsome=0", "colgrad=0"), white.adjust=c("hc1"))
[Ans] The estimated effect of the smoking ban in the probit model depends on the values of the other variable included in the regression. The estimated effects for various values of these regressors are given question (h). The t-statistic for the coefficient on \(smkban\) is \(-5.29\), similar to the value for the linear probability and probit models. The F-statistic is significant at the \(5\%\) level, as in the linear probability model.
h.i. Mr. A is white, non-Hispanic, 20 years old, and a high school dropout. Using the probit regression and assuming that Mr. A is not subject to a workplace smoking ban, calculate the probability that Mr. A smokes. Carry out the calculation again, assuming that he is subject to a workplace smoking ban. What is the effect of the smoking ban on the probability of smoking?
b.probit <- fit.f$coefficients # all coefficients from a probit model
caseA.1 <- c(1, 0, 0, 20, 20^2, 1, 0, 0, 0, 0, 0)
Mr_A.noban <- pnorm(sum(b.probit*caseA.1))
print(Mr_A.noban)
## [1] 0.464102
caseA.2 <- c(1, 1, 0, 20, 20^2, 1, 0, 0, 0, 0, 0)
Mr_A.ban <- pnorm(sum(b.probit*caseA.2))
print(Mr_A.ban)
## [1] 0.4017831
[Ans] If Mr. A is not subject to a workplace smoking ban, the probability that he smokes is \(46.41\%\); if he is subject to a workplace smoking ban, the probability is \(40.18\%\). The difference is \(6.23\%\).
h.ii. Repeat (i) for Ms. B, a female, black, 40-year-old college graduate.
caseB.1 <- c(1, 0, 1, 40, 40^2, 0, 0, 0, 1, 1, 0)
Ms_B.noban <- pnorm(sum(b.probit*caseB.1))
print(Ms_B.noban)
## [1] 0.1436957
caseB.2 <- c(1, 1, 1, 40, 40^2, 0, 0, 0, 1, 1, 0)
Ms_B.ban <- pnorm(sum(b.probit*caseB.2))
print(Ms_B.ban)
## [1] 0.1107609
[Ans] If Ms. B is not subject to a workplace smoking ban, the probability that she smokes is \(14.37\%\); if she is subject to a workplace smoking ban, the probability is \(11.08\%\). The difference is \(3.29\%\).
h.iii. Repeat (i)–(ii) using the linear probability model.
b.lpm <- fit.c$coefficients # all coefficients from a probit model
Mr_A.noban.lpm <- sum(b.lpm*caseA.1)
print(Mr_A.noban.lpm)
## [1] 0.4493721
Mr_A.ban.lpm <- sum(b.lpm*caseA.2)
print(Mr_A.ban.lpm)
## [1] 0.4021323
Ms_B.noban.lpm <- sum(b.lpm*caseB.1)
print(Ms_B.noban.lpm)
## [1] 0.145961
Ms_B.ban.lpm <- sum(b.lpm*caseB.2)
print(Ms_B.ban.lpm)
## [1] 0.09872116
[Ans] Under the linear probability model, if Mr. A is not subject to a workplace smoking ban, the probability that he smokes is \(44.94\%\); if he is subject to a workplace smoking ban, the probability is \(40.21\%\). The difference is \(4.72\%\). If Ms. B is not subject to a workplace smoking ban, the probability that she smokes is \(14.60\%\); if she is subject to a workplace smoking ban, the probability is \(9.87\%\). The difference is \(4.72\%\).
h.iv. Repeat (i)–(ii) using the logit model.
b.logit <- fit.g$coefficients # all coefficients from a probit model
Mr_A.noban.logit <- plogis(sum(b.logit*caseA.1))
print(Mr_A.noban.logit)
## [1] 0.4723103
Mr_A.ban.logit <- plogis(sum(b.logit*caseA.2))
print(Mr_A.ban.logit)
## [1] 0.4078402
Ms_B.noban.logit <- plogis(sum(b.logit*caseB.1))
print(Ms_B.noban.logit)
## [1] 0.1405121
Ms_B.ban.logit <- plogis(sum(b.logit*caseB.2))
print(Ms_B.ban.logit)
## [1] 0.1117418
[Ans] Under the logit model, if Mr. A is not subject to a workplace smoking ban, the probability that he smokes is \(47.23\%\); if he is subject to a workplace smoking ban, the probability is \(40.78\%\). The difference is \(6.45\%\). If Ms. B is not subject to a workplace smoking ban, the probability that she smokes is \(14.05\%\); if she is subject to a workplace smoking ban, the probability is \(11.17\%\). The difference is \(2.88\%\).
h.v. Based on your answers to (i)–(iv), do the logit, probit, and linear probability models differ? If they do, which results make most sense? Are the estimated effects large in a real-world sense?
[Ans] The linear probability model assumes that the marginal impact of workplace smoking bans on the probability of an individual smoking is not dependent on the other characteristics of the individual. On the other hand, the probit and logit models’ predicted marginal impact of workplace smoking bans on the probability of smoking depends on individual characteristics. Therefore, in the linear probability model, the marginal impact of workplace smoking bans is the same for Mr. A and Mr. B, although their profiles would suggest that Mr. A has a higher probability of smoking based on his characteristics. Looking at the probit model’s results, the marginal impact of workplace smoking bans on the odds of smoking are different for Mr. A and Ms. B, because their different characteristics are incorporated into the impact of the laws on the probability of smoking. The same is true of the logit model. In this sense the probit and logit model are likely more appropriate, and they give very similar answers.
Are the impacts of workplace smoking bans “large” in a real-world sense? Most people might believe the impacts are large. For example, for people with characteristics like Mr.A the reduction on the probability is great than \(6\%\) (from the probit and logit models). Applied to a large number of people, this translates into a \(6\%\) reduction in the number of people smoking.