Smoke Data Analysis

Background Information on the Data Sets

For this project, we will use the smoke data which can be found in the wooldridge package. After loading the wooldridge package, use ?smoke to learn about this dataset. Be aware that you will need at least 3.4.0 version of R to install wooldridge package.

wooldridge data sets are from the 5th edition (Wooldridge 2013, ISBN-13:978-1-111-53104-1), which is compatible with all other editions.

We found this data interesting because the members of this group are smokers who are trying to quit smoking.

Objectives

We decided to analysis smoke data to figure out if there is a relationship between the number of cigarettes that an individual smokes per day and the other variables such as his/her education level, age, or the price of cigarettes. If there’s a relationship between the number of cigarettes and other variables, we want to find out the expected number of cigarettes smoked per day by using other variables, and hopefully, we find something useful to help to stop smoking or to prevent smoking if possible.

Procedure and Methods

Description of smoke Data

A data.frame with 807 rows and 10 variables. Some of interesting variables are:

cigs number of cigarettes smoked per day
age age in years
cigpric state cigarette price, cents/pack
income annual income, $
educ years of schooling
white =1 if white

Getting Prepared: Installing Package, Loading the Dataset and Data Cleaning

# Load libraries needed for the project.
library(wooldridge)
library(faraway)
library(lmtest)

# Since we want to predict the number of cigarettes smoked per day, 
# we are removing rows that has cigs < 1, which is non-smokers.
cleaned_data = smoke[(smoke$cigs) > 0,]

Tests to Decide Our Final Model

# Fit a simple linear model in R. Using continuous variables in the dataset.
smoke_model = lm(cigs ~ . - white - restaurn , data = cleaned_data)
summary(smoke_model)

## 
## Call:
## lm(formula = cigs ~ . - white - restaurn, data = cleaned_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.876  -7.197  -2.015   6.876  50.862 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.834e+02  3.764e+02   1.815 0.070453 .  
## educ         5.624e-01  2.943e-01   1.911 0.056957 .  
## cigpric      4.179e+00  2.177e+00   1.920 0.055806 .  
## age          1.056e+00  2.774e-01   3.808 0.000169 ***
## income      -1.394e-04  2.248e-04  -0.620 0.535585    
## lincome      3.650e+00  3.023e+00   1.207 0.228326    
## agesq       -1.099e-02  3.189e-03  -3.447 0.000648 ***
## lcigpric    -2.379e+02  1.241e+02  -1.917 0.056207 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.65 on 302 degrees of freedom
## Multiple R-squared:  0.1074, Adjusted R-squared:  0.08667 
## F-statistic: 5.189 on 7 and 302 DF,  p-value: 1.341e-05

$H_0: \beta_i = 0, i = 1,2,...,7$.
$H_1:$ at least one of $\beta_j \neq 0, j = 1,2,...,7$.
F = 5.189.
P-value = 1.341e-05.
Decision : Reject the null hypothesis at $\alpha = 0.01$.
Conclusion: There is a linear relationship between the number of cigarettes and at least some of the predictor variables.

# Checking p-values for each variables to find out statiscally significant variables.
summary(smoke_model)$coef

##                  Estimate   Std. Error    t value    Pr(>|t|)
## (Intercept)  6.833743e+02 3.764306e+02  1.8154058 0.070453293
## educ         5.623637e-01 2.942837e-01  1.9109576 0.056956637
## cigpric      4.179320e+00 2.176783e+00  1.9199524 0.055805708
## age          1.056283e+00 2.773516e-01  3.8084612 0.000169382
## income      -1.394407e-04 2.248269e-04 -0.6202135 0.535584934
## lincome      3.649662e+00 3.023416e+00  1.2071317 0.228326026
## agesq       -1.099107e-02 3.188960e-03 -3.4466012 0.000648209
## lcigpric    -2.378572e+02 1.240908e+02 -1.9167996 0.056206888

The p-values for age and agesq are only variables that are significant at $\alpha = 0.01$. Since we want to predict the number of cigarettes smoked per day, the good model should not have predictor that are not statistically significant.

# Fitting a model with only significant variables.
smoke_model_2 = lm(cigs ~ age + agesq, data = cleaned_data)
summary(smoke_model_2)

## 
## Call:
## lm(formula = cigs ~ age + agesq, data = cleaned_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -25.444  -6.433  -2.466   6.812  53.754 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.169911   5.345927  -0.593    0.554    
## age          1.264655   0.268166   4.716 3.66e-06 ***
## agesq       -0.013501   0.003062  -4.409 1.44e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.79 on 307 degrees of freedom
## Multiple R-squared:  0.07206,    Adjusted R-squared:  0.06602 
## F-statistic: 11.92 on 2 and 307 DF,  p-value: 1.033e-05

$H_0: \beta_i = 0, i = 1,2$.
$H_1:$ at least one of $\beta_j \neq 0, j = 1,2$.
F = 11.92.
P-value = 1.033e-05.
Decision : Reject the null hypothesis at $\alpha = 0.01$.
Conclusion: There is a linear relationship between the number of cigarettes and at least some of the predictor variables (age or agesq).

# Using ANOVA table to perfrom F-test to decide which model we are going to use.
anova(smoke_model_2, smoke_model)

## Analysis of Variance Table
## 
## Model 1: cigs ~ age + agesq
## Model 2: cigs ~ (educ + cigpric + white + age + income + restaurn + lincome + 
##     agesq + lcigpric) - white - restaurn
##   Res.Df   RSS Df Sum of Sq      F  Pr(>F)  
## 1    307 50229                              
## 2    302 48318  5    1910.8 2.3886 0.03807 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

$H_0: \beta_i = 0, i = educ, cigpric, income, lincome, lcigpric$.
$H_1:$ at least one of $\beta_j \neq 0, j = educ, cigpric, income, lincome, lcigpric$.
F = 2.3886.
P-value = 0.03807.
Decision : Do not reject the null hypothesis at $\alpha = 0.01$.
Conclusion : The linear relationship between cigarettes and the predictors is better explained with the model that only has age and agesq.

# Performing Breusch-Pagan test and Shapiro-Wilk normality test.
# Use cooks.distance to make data better.
bptest(smoke_model_2)

## 
##  studentized Breusch-Pagan test
## 
## data:  smoke_model_2
## BP = 6.0304, df = 2, p-value = 0.04903

shapiro.test(resid(smoke_model_2))

## 
##  Shapiro-Wilk normality test
## 
## data:  resid(smoke_model_2)
## W = 0.95286, p-value = 1.989e-08

length(which(cooks.distance(smoke_model_2) > 4 / length(cooks.distance(smoke_model_2))))

## [1] 16

final_data = 
cleaned_data[which(cooks.distance(smoke_model_2) < 
                       4 / length(cooks.distance(smoke_model_2))),]

We tested the Breusch-Pagan test and Shapiro-Wilk test and removed some points to get a better model.

# Visualizations
final_model = lm(cigs ~ age + agesq, data = final_data)
par(mfrow = c(2, 2))
plot(final_model)

According to the plots of our final model, we can conclude that there is violation of constant variance, but it is normally distributed, the linear functional form holds.

# More test to see if our visualizations match and our final model is better.
bptest(final_model)

## 
##  studentized Breusch-Pagan test
## 
## data:  final_model
## BP = 12.63, df = 2, p-value = 0.001809

shapiro.test(resid(final_model))

## 
##  Shapiro-Wilk normality test
## 
## data:  resid(final_model)
## W = 0.97731, p-value = 0.0001294

For Breusch-Pagan test, here we see a p-value smaller than $\alpha = 0.01$, so we reject the null of homoscedasticity. The constant variance assumption is violated. This matches our findings with a fitted versus residuals plot.
The Shapiro-Wilk test also agrees with our Q-Q plot.

Conclusion

Final Model Selected

# Summary of our final model and final data
summary(final_model)

## 
## Call:
## lm(formula = cigs ~ age + agesq, data = final_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.830  -5.175  -1.374   6.673  30.177 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.54610    4.99561  -0.710    0.478    
## age          1.23111    0.25882   4.757 3.10e-06 ***
## agesq       -0.01335    0.00305  -4.376 1.69e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.7 on 291 degrees of freedom
## Multiple R-squared:  0.08242,    Adjusted R-squared:  0.07611 
## F-statistic: 13.07 on 2 and 291 DF,  p-value: 3.671e-06

summary(final_data)

##       educ          cigpric          white            age       
##  Min.   : 6.00   Min.   :44.38   Min.   :0.000   Min.   :17.00  
##  1st Qu.:10.00   1st Qu.:58.09   1st Qu.:1.000   1st Qu.:27.00  
##  Median :12.00   Median :60.88   Median :1.000   Median :35.50  
##  Mean   :12.01   Mean   :60.17   Mean   :0.881   Mean   :38.46  
##  3rd Qu.:13.50   3rd Qu.:63.17   3rd Qu.:1.000   3rd Qu.:50.00  
##  Max.   :18.00   Max.   :69.90   Max.   :1.000   Max.   :78.00  
##      income           cigs          restaurn         lincome      
##  Min.   :  500   Min.   : 1.00   Min.   :0.0000   Min.   : 6.215  
##  1st Qu.:12500   1st Qu.:15.00   1st Qu.:0.0000   1st Qu.: 9.433  
##  Median :20000   Median :20.00   Median :0.0000   Median : 9.903  
##  Mean   :19199   Mean   :21.29   Mean   :0.2007   Mean   : 9.689  
##  3rd Qu.:30000   3rd Qu.:30.00   3rd Qu.:0.0000   3rd Qu.:10.309  
##  Max.   :30000   Max.   :55.00   Max.   :1.0000   Max.   :10.309  
##      agesq         lcigpric    
##  Min.   : 289   Min.   :3.793  
##  1st Qu.: 729   1st Qu.:4.062  
##  Median :1260   Median :4.109  
##  Mean   :1687   Mean   :4.094  
##  3rd Qu.:2500   3rd Qu.:4.146  
##  Max.   :6084   Max.   :4.247

Visualization and Prediction

# We want to see the relationship between the number of cigarettes smoke per day
# and an individuals age.
test_model = lm(cigs ~ age, data = final_data)

plot(cigs ~ age, data = final_data,
  xlab = "Age of an Individual in Years",
  ylab = "Number of Cigarettes Smoked per Day",
  main = "Smoke Data: Age vs Number of Cigarattes",
  col  = "dodgerblue")
abline(test_model, lwd=2, col="red")

# Getting mean age of the dataset.
mean(final_data$age)

## [1] 38.46259

# Predict the number of cigarettes based on individual's age of 38.
predict(test_model, newdata = data.frame(age=38))

##        1 
## 21.23274

Result

The primary object of this project is to use smoke data in wooldridge package and to determine any existing relationship between individual’s education level, age, or the price of cigarettes and number of cigarettes that a person smokes per day. Noticing a trivial relationship through visual representation of scatter plot, we performed data cleaning by removing “rows” that has cigs values less than 1(non-smokers). Since the p-values for our first model (‘smoke_model’) has some predictors that are not statistically significant predictor, we made the second model (‘smoke_model_2’) containing only significant variables(age or agesq). The corresponding model’s Breusch-Pagan test and Shapiro-Wilk test allowed further simplification of data and better model (final_model). To further analyze the final model (final_model), BP test and Shapiro test was performed once again and the results matched with our findings with a fitted versus residual plot and agreed with our Q-Q Plot. The scatter plot of age versus number of cigarettes smoked (per day) was drawn with abline(test_model, lwd=2, col="red"). The data showed slightly increasing in the number of cigarettes smoked per day as the age increased. Finally, the average age, 38.46259, was found from the final_data. Knowing that the age can only be an integer, the value was rounded down to 38. Using this average value, we predicted number of cigarettes smoked per day which was to be 21.23274 or 21. Although our dataset showed some relationship between age and number of cigarettes smoked per day, we couldn’t find decent evidence that agrees to scientific facts related to smoking. It is likely that there are uncontrolled variables such as duration of smoking, nicotine levels and type of cigarettes.

STAT 420: Data Analysis Project

Project Group 20 / Jang Hoon Hong (jhong47), Steve Lee (slee279)

Due: Saturday, August 5 by 11:50 PM CDT