Descriptive Analysis
library(ggplot2)
This R chunk is looking into the distribution of the income.
ggplot(drugData, aes(x = personalincome)) + geom_histogram()

The distribution of income is positively skewed which indicates that the majority of the people are in the lower income and that there is a small number in the higher level of income.
The Personalincome variable is categorized in this data set by: 1 = Less than $10,000 (Including Loss) 2 = $10,000 - $19,999 3 = $20,000 - $29,999 4 = $30,000 - $39,999 5 = $40,000 - $49,999 6 = $50,000 - $74,999 7 = $75,000 or more
This R chunk is looking at the distribution of Sex.
ggplot(drugData, aes(x = sex)) + geom_histogram()

This chart shows that there there are more women than there are men. Men = 1 and Women = 2.
This R chunk is looking into the distribution of education.
ggplot(drugData, aes(x = education)) + geom_histogram()

The education variable is coded by 1 = Less high school 2 = High school grad 3 = Some coll/Assoc Dg 4 = College graduate. The chart is showing that the majority of the people had some college or had their Associates degree.
This R chunk is looking at the amount of people who did any drugs within the last year.
ggplot(drugData, aes(x = anydrugyear)) + geom_bar()

The anydruyear variable is coded as true which means the person did any drug within the year and false means that that person did not do any drugs within the year.
Drug use and its affect on income
m1 <- lm(personalincome ~ factor(anydrugyear), data = drugData)
summary(m1)
Call:
lm(formula = personalincome ~ factor(anydrugyear), data = drugData)
Residuals:
Min 1Q Median 3Q Max
-2.6215 -1.6215 -0.6215 1.3785 4.1536
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.62151 0.01319 274.65 <2e-16 ***
factor(anydrugyear)true -0.77509 0.02605 -29.76 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.035 on 32038 degrees of freedom
Multiple R-squared: 0.0269, Adjusted R-squared: 0.02686
F-statistic: 885.5 on 1 and 32038 DF, p-value: < 2.2e-16
This is a binary variable which means that there are only two options. In this case are true and false regarding any drug use within the year. This analysis shows thats if a person did any drugs within in the year they have -0.77509 times less income than those who did not do drugs. The 3.62151 value is the value for the dependent variable or the reference group which in this case is no drug use within in the year.
Drug user’s racial difference on income
m2 <- lm(personalincome ~ anydrugyear + race_str, data = drugData)
summary(m2)
Call:
lm(formula = personalincome ~ anydrugyear + race_str, data = drugData)
Residuals:
Min 1Q Median 3Q Max
-3.0739 -1.9440 -0.1978 1.8101 4.9110
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.07388 0.05183 78.594 < 2e-16 ***
anydrugyeartrue -0.75405 0.02543 -29.656 < 2e-16 ***
race_strBlack/African American -1.11843 0.06048 -18.491 < 2e-16 ***
race_strHawaiian/Pacific Islander -1.23078 0.16354 -7.526 5.37e-14 ***
race_strHispanic -1.12989 0.05825 -19.396 < 2e-16 ***
race_strMixed -0.82808 0.07942 -10.427 < 2e-16 ***
race_strNative American/Alaskan Native -1.00959 0.10663 -9.468 < 2e-16 ***
race_strWhite -0.12206 0.05374 -2.271 0.0231 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.98 on 32032 degrees of freedom
Multiple R-squared: 0.07909, Adjusted R-squared: 0.07889
F-statistic: 393 on 7 and 32032 DF, p-value: < 2.2e-16
Controlling for race, a person who did drugs within the year has -0.75405 less income than those who did not do drugs. When race was out of the analysis in the m1 analysis, people who did drugs had -0.77509 less income. But when race is put into the analysis, the affect of drug use on income decreased.
Interactions
Interaction between Drug Use and Race
m3 <- lm(personalincome ~ anydrugyear*race_str, data = drugData)
summary(m3)
Call:
lm(formula = personalincome ~ anydrugyear * race_str, data = drugData)
Residuals:
Min 1Q Median 3Q Max
-3.0812 -1.8771 -0.1328 1.8672 4.7955
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.08123 0.05559 73.414 < 2e-16 ***
anydrugyeartrue -0.80850 0.15127 -5.345 9.11e-08 ***
race_strBlack/African American -1.12696 0.06654 -16.936 < 2e-16 ***
race_strHawaiian/Pacific Islander -1.33956 0.18907 -7.085 1.42e-12 ***
race_strHispanic -1.20417 0.06346 -18.974 < 2e-16 ***
race_strMixed -0.89327 0.09405 -9.498 < 2e-16 ***
race_strNative American/Alaskan Native -1.01117 0.12478 -8.104 5.53e-16 ***
race_strWhite -0.10683 0.05800 -1.842 0.0655 .
anydrugyeartrue:race_strBlack/African American 0.05874 0.16658 0.353 0.7244
anydrugyeartrue:race_strHawaiian/Pacific Islander 0.43893 0.38297 1.146 0.2518
anydrugyeartrue:race_strHispanic 0.33129 0.16358 2.025 0.0429 *
anydrugyeartrue:race_strMixed 0.20835 0.19543 1.066 0.2864
anydrugyeartrue:race_strNative American/Alaskan Native 0.03554 0.25250 0.141 0.8881
anydrugyeartrue:race_strWhite -0.03314 0.15473 -0.214 0.8304
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.98 on 32026 degrees of freedom
Multiple R-squared: 0.07995, Adjusted R-squared: 0.07958
F-statistic: 214.1 on 13 and 32026 DF, p-value: < 2.2e-16
The only statistically significant interaction in this analysis is for Hispanics who did drugs within the year. According to this analysis, if a Hispanic person did drugs within the year they will have 0.33129 more income compared to those who do not do drugs.
Interaction Between Drug Use and Sex
m4 <- lm(personalincome ~ anydrugyear*sex, data = drugData)
summary(m4)
Call:
lm(formula = personalincome ~ anydrugyear * sex, data = drugData)
Residuals:
Min 1Q Median 3Q Max
-3.0387 -1.5916 -0.2499 1.7501 4.4084
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.82743 0.04181 115.449 < 2e-16 ***
anydrugyeartrue -1.28181 0.08036 -15.950 < 2e-16 ***
sex -0.78878 0.02600 -30.338 < 2e-16 ***
anydrugyeartrue:sex 0.31178 0.05138 6.068 1.31e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.003 on 32036 degrees of freedom
Multiple R-squared: 0.05738, Adjusted R-squared: 0.0573
F-statistic: 650.1 on 3 and 32036 DF, p-value: < 2.2e-16
This shows that when drug use and sex are interacted together they led to an increase in income. According to this data, when someone has done drugs in the year and is a female they will have a 0.31178 increase in their income.
Interaction Between Drug Use and Education
m5 <- lm(personalincome ~ anydrugyear*education, data = drugData)
summary(m5)
Call:
lm(formula = personalincome ~ anydrugyear * education, data = drugData)
Residuals:
Min 1Q Median 3Q Max
-3.6748 -1.6748 -0.0705 1.3252 5.4709
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.06811 0.03649 29.271 < 2e-16 ***
anydrugyeartrue -0.30974 0.07220 -4.290 1.79e-05 ***
education 0.90168 0.01217 74.085 < 2e-16 ***
anydrugyeartrue:education -0.13097 0.02490 -5.259 1.45e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.85 on 32036 degrees of freedom
Multiple R-squared: 0.1962, Adjusted R-squared: 0.1961
F-statistic: 2606 on 3 and 32036 DF, p-value: < 2.2e-16
This analysis shows that the interaction between drug use and education on income will lessen the negative effect on income. According to this data, someone who has done drugs within the year and is educated will have -0.13097 less income compared to someone who did not do drugs. If only the drug use variable is looked at then the affect on income is more compared to when drug use and edcuation are interacted together.
library(texreg)
screenreg(list(m1, m2, m3, m4, m5))
============================================================================================================================
Model 1 Model 2 Model 3 Model 4 Model 5
----------------------------------------------------------------------------------------------------------------------------
(Intercept) 3.62 *** 4.07 *** 4.08 *** 4.83 *** 1.07 ***
(0.01) (0.05) (0.06) (0.04) (0.04)
factor(anydrugyear)true -0.78 ***
(0.03)
anydrugyeartrue -0.75 *** -0.81 *** -1.28 *** -0.31 ***
(0.03) (0.15) (0.08) (0.07)
race_strBlack/African American -1.12 *** -1.13 ***
(0.06) (0.07)
race_strHawaiian/Pacific Islander -1.23 *** -1.34 ***
(0.16) (0.19)
race_strHispanic -1.13 *** -1.20 ***
(0.06) (0.06)
race_strMixed -0.83 *** -0.89 ***
(0.08) (0.09)
race_strNative American/Alaskan Native -1.01 *** -1.01 ***
(0.11) (0.12)
race_strWhite -0.12 * -0.11
(0.05) (0.06)
anydrugyeartrue:race_strBlack/African American 0.06
(0.17)
anydrugyeartrue:race_strHawaiian/Pacific Islander 0.44
(0.38)
anydrugyeartrue:race_strHispanic 0.33 *
(0.16)
anydrugyeartrue:race_strMixed 0.21
(0.20)
anydrugyeartrue:race_strNative American/Alaskan Native 0.04
(0.25)
anydrugyeartrue:race_strWhite -0.03
(0.15)
sex -0.79 ***
(0.03)
anydrugyeartrue:sex 0.31 ***
(0.05)
education 0.90 ***
(0.01)
anydrugyeartrue:education -0.13 ***
(0.02)
----------------------------------------------------------------------------------------------------------------------------
R^2 0.03 0.08 0.08 0.06 0.20
Adj. R^2 0.03 0.08 0.08 0.06 0.20
Num. obs. 32040 32040 32040 32040 32040
RMSE 2.04 1.98 1.98 2.00 1.85
============================================================================================================================
*** p < 0.001, ** p < 0.01, * p < 0.05
According to this analysis, part of the affect of drug use on income can be explained by education levels. When education and drug use are interacted they lessen the affect on income. Model 5 has the highest r-squared value and this shows that it is the most desireable and this indiciates the strength of the variables and fits the data better than the others.