Project Part 3

Please refer to the bottom of this file for my artificial intelligence usage disclaimer. Any time you see a star *, I used AI for help.

pacman::p_load("sampleSelection")
data(Smoke)

data <- Smoke

I have been attempting to find out what variables are associated with higher levels of smoking. The data that I have been using is from a survey and includes variables, cigarettes smoked per day, education level, age, income and cigarette price. First, I used the data to answer the following questions:

Do people smoke less as cigarette prices increase?

I concluded that there was no correlation between cigarette price and cigarettes smoked. This is likely due to the fact that there is not a wide range of prices in the data. Cigarettes smoked are inelastic in relation to cigarette price. A small change in price will not reveal any change in the amount of cigarettes demanded.

Do people with lower incomes smoke more?

I divided the data into low, middle and high incomes and found the mean cigarettes smoked from each section. I found that people with lower incomes had a lower mean than those with middle and high income, but was not convinced by this data as it could likely be skewed by outliers.

I continued my research with this data set and asked the empirical question: Does one’s age increase one’s likelihood of smoking cigarettes?

I found that the correlation between age and cigarettes smoked appeared positive (as age increases, so does cigarette consumption). I was unable to feel confident in saying that age really does increase smoking because the correlation coefficient was very close to zero.

I will continue trying to understand what variables cause an increase in cigarette consumption. This time, using multiple regression so that I can account for multiple variables as the explanatory variable and try to find a better grasp on the data and be able to find true relationships between variables and cigarettes smoked.

I want to run a multiple regression with cigarettes smoked per day as the dependent variable, and income, age, and education as the dependent variables. This will control income, age, and education at the same time. I will be able to find the effect of each variable independently while all other variables are kept constant. This way, no outside variables are creating a skewed version of the data.
Before running the multiple regression, I need to check for multicollinearity. Multicollinearity exists when two variables are highly correlated with each other. This would lead to unreliable coefficient estimates. Before running any multiple regressions I will test for this. If any of the independent variables have a high correlation (>0.8, <-0.8), they should not be used in multiple regressions.

independent_vars <- data[, c("income", "educ", "age")] 

cor_matrix <- cor(independent_vars)


print(cor_matrix)

##            income       educ        age
## income  1.0000000  0.3343666 -0.0640115
## educ    0.3343666  1.0000000 -0.1805779
## age    -0.0640115 -0.1805779  1.0000000

All variables fall under what I am considering to be a high correlation (>0.8, <-0.8). Multicollinearity is accounted for and not an issue between any of these variables.

Running a linear regression

cigs = b0 + b1(education) + b2(age) +b3(income)

model1 <- lm(cigs ~ educ + age + income, data=data)
library(stargazer)
stargazer(model1, omit = '[i][n][d]', type='text',
          add.lines=list(c('Fixed effects', 'Yes','No'))
)

## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                                cigs            
## -----------------------------------------------
## educ                         -0.378**          
##                               (0.170)          
##                                                
## age                           -0.042           
##                               (0.029)          
##                                                
## income                       0.0001**          
##                              (0.0001)          
##                                                
## Constant                     12.854***         
##                               (2.576)          
##                                                
## -----------------------------------------------
## Fixed effects                   Yes            
## Observations                    807            
## R2                             0.010           
## Adjusted R2                    0.007           
## Residual Std. Error      13.675 (df = 803)     
## F Statistic            2.813** (df = 3; 803)   
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

Results:

Education - while holding age and income constant, I found that with each additional unit of education (education is measured in years), cigarettes smoked per day decreases by 0.378. This result is statistically significant with a p-value of 0.05 so there is very likely a strong correlation between education and cigarette consumption. As one receives more education, their cigarettes smoked per day decreases.

Age - while holding education and income constant, I found that there is no statistically significant relationship between income and age. The data shows that with each additional year, cigarette consumption decreases by 0.042. The p-value is not significant so I won’t draw any conclusion from this.

Income - while holding age and education constant, with each additional unit of income, cigarettes smoked per day increases by 0.0001. Although this data is statistically significant (with a p-value of 0.05), the numbers are so small, so a higher income is highly associated with more smoking, but in minuscule amounts.

R squared is extremely small, only about 1 percent of change in cigarette consumption can be explained by the variables education age and income.

Running confidence intervals to be sure that the true relationship between variables is not zero:

conf_intervals <- confint(model1, level = 0.95) 

print(conf_intervals)

##                     2.5 %       97.5 %
## (Intercept)  7.797280e+00 17.910608917
## educ        -7.105728e-01 -0.044618022
## age         -9.815240e-02  0.014765925
## income       7.379078e-06  0.000226846

Education- CI (-0.71, -0.04). I can conclude with 95% confidence that the true relationship between education and cigarettes smoked lies between -0.71 and -0.04. This interval does not contain zero, so I will conclude that the relationship is negative. As education increases cigarette consumption decreases.

Age - CI (-0.0982, 0.015). This range does include zero. I cannot conclude with confidence that the true relationship between age and cigarette consumption is negative, and that it is not zero.

Income - CI {0.0000074, 0.00023). This range does not include zero so I can conclude that with 95% confidence that the relationship between income and cigarette consumption is positive. Once again, these numbers are so small and only convince me that there is an extremely small relationship, to the point where it’s trivial.

The F-test to thest the null hypothesis that all coefficients have not effect on cigarette consumption: H0: βeduc = 0, βage = 0, βincome = 0

library(car)

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following object is masked from 'package:purrr':
## 
##     some

linearHypothesis(model1, c("educ=0", "age=0", "income=0"))

## 
## Linear hypothesis test:
## educ = 0
## age = 0
## income = 0
## 
## Model 1: restricted model
## Model 2: cigs ~ educ + age + income
## 
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1    806 151754                              
## 2    803 150176  3      1578 2.8126 0.03845 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(model1)$fstatistic

##      value      numdf      dendf 
##   2.812588   3.000000 803.000000

P-value is 0.03845, so I reject the null at the confidence level of 1%. I am confident that at least one of these variables is not truly 0.

Running a residual plot: *3

library(ggplot2)
model1 <- lm(cigs ~ educ + age + income, data = data)
head(residuals(model1))

##          1          2          3          4          5          6 
##  -7.236780  -8.658065  -6.417968  -8.847860 -10.711456  -7.763985

head(fitted(model1))

##         1         2         3         4         5         6 
##  7.236780  8.658065  9.417968  8.847860 10.711456  7.763985

plot(fitted(model1), residuals(model1), 
     xlab = "Fitted values", ylab = "Residuals", 
     main = "Residuals vs Fitted Values")
abline(h = 0, col = "pink")

This residual plot raises some concerns. I am concerned about the sections that are very clustered into three negative sloping lines. I am also concerned because the residuals appear to fan out slightly as fitted values decrease which could mean heteroskedasticity.

The most important inference I was able to make is that higher educated people smoke less. I am also glad to have found that the amount of cigarettes smoked in people with higher incomes is extremely small. I had originally calculated means and found a larger difference. I would still like to do more analysis regarding income and cigarette consumption. I came to a similar conclusion about the relationship between age and cigarette consumption (that the relationship appears positive, but I am not confident because the confidence interval includes zero). This makes a lot of sense because one’s age isn’t affected by outside variables.
Moving forward - I would like to test this data for heteroskedasticity because the residual plot did not look good. I would also like to dive deeper into finding true relationships. To do this I would want to look at individual variables, remove outliers and rerun old tests to see if I can uncover any differences.

AI disclaimer:

1: After attempting to use the following code from the book chapter, I looked to chat GPT for help finding a different code.

cor(educ, age) cor(price, educ)

The AI showed me code where I could create a correlation matrix. It also gave me the numbers (>0.8, <-0.8) to use as cutoffs for too high correlations.

2: Code chunk was giving me an error. I copied and pasted the code into chatgpt, where it told me I needed to load the car package.

I was not able to knit my file. I was getting an error on line 62, I used chat gpt to help identify the problem and was able to knit afterwards.

Project Part 3

Johanna Heine

2024-12-09