For this project, we will use the smoke
data which can be found in the wooldridge
package. After loading the wooldridge
package, use ?smoke
to learn about this dataset. Be aware that you will need at least 3.4.0 version of R to install wooldridge
package.
wooldridge
data sets are from the 5th edition (Wooldridge 2013, ISBN-13:978-1-111-53104-1), which is compatible with all other editions.
We found this data interesting because the members of this group are smokers who are trying to quit smoking.
We decided to analysis smoke
data to figure out if there is a relationship between the number of cigarettes that an individual smokes per day and the other variables such as his/her education level, age, or the price of cigarettes. If there’s a relationship between the number of cigarettes and other variables, we want to find out the expected number of cigarettes smoked per day by using other variables, and hopefully, we find something useful to help to stop smoking or to prevent smoking if possible.
A data.frame with 807 rows and 10 variables. Some of interesting variables are:
cigs
number of cigarettes smoked per dayage
age in yearscigpric
state cigarette price, cents/packincome
annual income, $educ
years of schoolingwhite
=1 if white# Load libraries needed for the project.
library(wooldridge)
library(faraway)
library(lmtest)
# Since we want to predict the number of cigarettes smoked per day,
# we are removing rows that has cigs < 1, which is non-smokers.
cleaned_data = smoke[(smoke$cigs) > 0,]
# Fit a simple linear model in R. Using continuous variables in the dataset.
smoke_model = lm(cigs ~ . - white - restaurn , data = cleaned_data)
summary(smoke_model)
##
## Call:
## lm(formula = cigs ~ . - white - restaurn, data = cleaned_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.876 -7.197 -2.015 6.876 50.862
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.834e+02 3.764e+02 1.815 0.070453 .
## educ 5.624e-01 2.943e-01 1.911 0.056957 .
## cigpric 4.179e+00 2.177e+00 1.920 0.055806 .
## age 1.056e+00 2.774e-01 3.808 0.000169 ***
## income -1.394e-04 2.248e-04 -0.620 0.535585
## lincome 3.650e+00 3.023e+00 1.207 0.228326
## agesq -1.099e-02 3.189e-03 -3.447 0.000648 ***
## lcigpric -2.379e+02 1.241e+02 -1.917 0.056207 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.65 on 302 degrees of freedom
## Multiple R-squared: 0.1074, Adjusted R-squared: 0.08667
## F-statistic: 5.189 on 7 and 302 DF, p-value: 1.341e-05
# Checking p-values for each variables to find out statiscally significant variables.
summary(smoke_model)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.833743e+02 3.764306e+02 1.8154058 0.070453293
## educ 5.623637e-01 2.942837e-01 1.9109576 0.056956637
## cigpric 4.179320e+00 2.176783e+00 1.9199524 0.055805708
## age 1.056283e+00 2.773516e-01 3.8084612 0.000169382
## income -1.394407e-04 2.248269e-04 -0.6202135 0.535584934
## lincome 3.649662e+00 3.023416e+00 1.2071317 0.228326026
## agesq -1.099107e-02 3.188960e-03 -3.4466012 0.000648209
## lcigpric -2.378572e+02 1.240908e+02 -1.9167996 0.056206888
age
and agesq
are only variables that are significant at \(\alpha = 0.01\). Since we want to predict the number of cigarettes smoked per day, the good model should not have predictor that are not statistically significant.# Fitting a model with only significant variables.
smoke_model_2 = lm(cigs ~ age + agesq, data = cleaned_data)
summary(smoke_model_2)
##
## Call:
## lm(formula = cigs ~ age + agesq, data = cleaned_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.444 -6.433 -2.466 6.812 53.754
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.169911 5.345927 -0.593 0.554
## age 1.264655 0.268166 4.716 3.66e-06 ***
## agesq -0.013501 0.003062 -4.409 1.44e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.79 on 307 degrees of freedom
## Multiple R-squared: 0.07206, Adjusted R-squared: 0.06602
## F-statistic: 11.92 on 2 and 307 DF, p-value: 1.033e-05
age
or agesq
).# Using ANOVA table to perfrom F-test to decide which model we are going to use.
anova(smoke_model_2, smoke_model)
## Analysis of Variance Table
##
## Model 1: cigs ~ age + agesq
## Model 2: cigs ~ (educ + cigpric + white + age + income + restaurn + lincome +
## agesq + lcigpric) - white - restaurn
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 307 50229
## 2 302 48318 5 1910.8 2.3886 0.03807 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
age
and agesq
.# Performing Breusch-Pagan test and Shapiro-Wilk normality test.
# Use cooks.distance to make data better.
bptest(smoke_model_2)
##
## studentized Breusch-Pagan test
##
## data: smoke_model_2
## BP = 6.0304, df = 2, p-value = 0.04903
shapiro.test(resid(smoke_model_2))
##
## Shapiro-Wilk normality test
##
## data: resid(smoke_model_2)
## W = 0.95286, p-value = 1.989e-08
length(which(cooks.distance(smoke_model_2) > 4 / length(cooks.distance(smoke_model_2))))
## [1] 16
final_data =
cleaned_data[which(cooks.distance(smoke_model_2) <
4 / length(cooks.distance(smoke_model_2))),]
# Visualizations
final_model = lm(cigs ~ age + agesq, data = final_data)
par(mfrow = c(2, 2))
plot(final_model)
# More test to see if our visualizations match and our final model is better.
bptest(final_model)
##
## studentized Breusch-Pagan test
##
## data: final_model
## BP = 12.63, df = 2, p-value = 0.001809
shapiro.test(resid(final_model))
##
## Shapiro-Wilk normality test
##
## data: resid(final_model)
## W = 0.97731, p-value = 0.0001294
# Summary of our final model and final data
summary(final_model)
##
## Call:
## lm(formula = cigs ~ age + agesq, data = final_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.830 -5.175 -1.374 6.673 30.177
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.54610 4.99561 -0.710 0.478
## age 1.23111 0.25882 4.757 3.10e-06 ***
## agesq -0.01335 0.00305 -4.376 1.69e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.7 on 291 degrees of freedom
## Multiple R-squared: 0.08242, Adjusted R-squared: 0.07611
## F-statistic: 13.07 on 2 and 291 DF, p-value: 3.671e-06
summary(final_data)
## educ cigpric white age
## Min. : 6.00 Min. :44.38 Min. :0.000 Min. :17.00
## 1st Qu.:10.00 1st Qu.:58.09 1st Qu.:1.000 1st Qu.:27.00
## Median :12.00 Median :60.88 Median :1.000 Median :35.50
## Mean :12.01 Mean :60.17 Mean :0.881 Mean :38.46
## 3rd Qu.:13.50 3rd Qu.:63.17 3rd Qu.:1.000 3rd Qu.:50.00
## Max. :18.00 Max. :69.90 Max. :1.000 Max. :78.00
## income cigs restaurn lincome
## Min. : 500 Min. : 1.00 Min. :0.0000 Min. : 6.215
## 1st Qu.:12500 1st Qu.:15.00 1st Qu.:0.0000 1st Qu.: 9.433
## Median :20000 Median :20.00 Median :0.0000 Median : 9.903
## Mean :19199 Mean :21.29 Mean :0.2007 Mean : 9.689
## 3rd Qu.:30000 3rd Qu.:30.00 3rd Qu.:0.0000 3rd Qu.:10.309
## Max. :30000 Max. :55.00 Max. :1.0000 Max. :10.309
## agesq lcigpric
## Min. : 289 Min. :3.793
## 1st Qu.: 729 1st Qu.:4.062
## Median :1260 Median :4.109
## Mean :1687 Mean :4.094
## 3rd Qu.:2500 3rd Qu.:4.146
## Max. :6084 Max. :4.247
# We want to see the relationship between the number of cigarettes smoke per day
# and an individuals age.
test_model = lm(cigs ~ age, data = final_data)
plot(cigs ~ age, data = final_data,
xlab = "Age of an Individual in Years",
ylab = "Number of Cigarettes Smoked per Day",
main = "Smoke Data: Age vs Number of Cigarattes",
col = "dodgerblue")
abline(test_model, lwd=2, col="red")
# Getting mean age of the dataset.
mean(final_data$age)
## [1] 38.46259
# Predict the number of cigarettes based on individual's age of 38.
predict(test_model, newdata = data.frame(age=38))
## 1
## 21.23274
The primary object of this project is to use smoke data in wooldridge package and to determine any existing relationship between individual’s education level, age, or the price of cigarettes and number of cigarettes that a person smokes per day. Noticing a trivial relationship through visual representation of scatter plot, we performed data cleaning by removing “rows” that has cigs
values less than 1(non-smokers). Since the p-values for our first model (‘smoke_model’) has some predictors that are not statistically significant predictor, we made the second model (‘smoke_model_2’) containing only significant variables(age
or agesq
). The corresponding model’s Breusch-Pagan test and Shapiro-Wilk test allowed further simplification of data and better model (final_model
). To further analyze the final model (final_model
), BP test and Shapiro test was performed once again and the results matched with our findings with a fitted versus residual plot and agreed with our Q-Q Plot. The scatter plot of age versus number of cigarettes smoked (per day) was drawn with abline(test_model, lwd=2, col="red")
. The data showed slightly increasing in the number of cigarettes smoked per day as the age increased. Finally, the average age, 38.46259, was found from the final_data. Knowing that the age can only be an integer, the value was rounded down to 38. Using this average value, we predicted number of cigarettes smoked per day which was to be 21.23274 or 21. Although our dataset showed some relationship between age and number of cigarettes smoked per day, we couldn’t find decent evidence that agrees to scientific facts related to smoking. It is likely that there are uncontrolled variables such as duration of smoking, nicotine levels and type of cigarettes.