For this HW assignment I will utilize the 2017 BRFSS data set. My dependent variable will be whether or not someone is hispanic or not (1 for Hispanic and 0 for not Hispanic). In order to determine if a person is Hispanic or not I will utilize the folloinfg independetn variables:
Income- (categorical) 3 levels: <$25k/yr, $25k-$75k/yr, >$75k/yr
Education (caetgorical) 3 levels: Less than High School, High school and College +
Age (categorical) 3 levels: 18-44, 44-59, 60+
Research Question: Can we determine whether or not someone is Hispanic with knowledge of their education, income and age?
First we will determine if the independent vaariables are significant with repsect to the dependent variable:
svychisq(~hisp + edu, design = des)
##
## Pearson's X^2: Rao & Scott adjustment
##
## data: svychisq(~hisp + edu, design = des)
## F = 192.61, ndf = 1.9379, ddf = 19825.0000, p-value < 2.2e-16
svychisq(~hisp + inc, design = des)
##
## Pearson's X^2: Rao & Scott adjustment
##
## data: svychisq(~hisp + inc, design = des)
## F = 124.42, ndf = 1.9894, ddf = 20352.0000, p-value < 2.2e-16
svychisq(~hisp + agec, design = des)
##
## Pearson's X^2: Rao & Scott adjustment
##
## data: svychisq(~hisp + agec, design = des)
## F = 40.36, ndf = 1.9949, ddf = 20408.0000, p-value < 2.2e-16
All three p-values are significant. We will continue:
Next we will look at the percentage of adults who are Hispanic by education:
qplot(x = tabe$edu, y = tabe$hisp, data = tabe, xlab = "Education", ylab = "% Hispanic") +
geom_errorbar(aes(x=edu, ymin=hisp - 1.96*se, ymax= hisp+1.96*se), width=.25)+
ggtitle(label = "% of adults who are Hispanic by Education")
Next by Income:
qplot(x = tabi$inc, y = tabi$hisp, data = tabi, xlab = "Income", ylab = "% Hispanic") +
geom_errorbar(aes(x=inc, ymin=hisp - 1.96*se, ymax= hisp+1.96*se), width=.25)+
ggtitle(label = "% of adults who are Hispanic by Income")
And finally by Age
qplot(x = taba$agec, y = taba$hisp, data = taba, xlab = "Age", ylab = "% Hispanic") +
geom_errorbar(aes(x=agec, ymin=hisp - 1.96*se, ymax= hisp+1.96*se), width=.25)+
ggtitle(label = "% of adults who are Hispanic by Age")
From these graphs we can see that the greatest proportion of the Hispanic populatin in our survey tend to be have less educating, are younger and in the lower third of the income categories.
Here are the results of the model:
fit.logit <- svyglm(hisp~edu+inc+agec,
design=des,
family=binomial,
data = hw2)
## Warning in eval(family$initialize): non-integer #successes in a binomial
## glm!
summary(fit.logit)
Call: svyglm(formula = hisp ~ edu + inc + agec, design = des, family = binomial, data = hw2)
Survey design: svydesign(ids = ~1, weights = ~wt, data = hw2)
Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.3001 0.1143 2.625 0.00867 ** educollege+ -0.6490 0.1120 -5.797 6.97e-09 edult hs 1.7032 0.1873 9.091 < 2e-16 inc>$75k -1.1021 0.1394 -7.908 2.89e-15 inc$25k to < $75k -0.3893 0.1223 -3.183 0.00146 agec45-59 -0.6125 0.1233 -4.968 6.87e-07 agec60+ -1.3088 0.1324 -9.883 < 2e-16 * — Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1
(Dispersion parameter for binomial family taken to be 1.001847)
Number of Fisher Scoring iterations: 5
stargazer(fit.logit,
type = "html",
tyle = "demography",
covariate.labels = c( "High School", "Less Than High School", "Greater than $75k", "$25k to $75k", "age 45-59", "age 60+", "constant"),
ci = T,
column.sep.width = "10pt")
| Dependent variable: | |
| hisp | |
| High School | -0.649*** |
| (-0.868, -0.430) | |
| Less Than High School | 1.703*** |
| (1.336, 2.070) | |
| Greater than 75k | -1.102*** |
| (-1.375, -0.829) | |
| 25k to 75k | -0.389*** |
| (-0.629, -0.150) | |
| age 45-59 | -0.612*** |
| (-0.854, -0.371) | |
| age 60+ | -1.309*** |
| (-1.568, -1.049) | |
| constant | 0.300*** |
| (0.076, 0.524) | |
| Observations | 10,231 |
| Log Likelihood | -5,078.742 |
| Akaike Inf. Crit. | 10,171.480 |
| Note: | p<0.1; p<0.05; p<0.01 |
| demography |
#print(oddsratio)
And here we have the odds ratio and respective 95% CI’s
oddsratio <- exp(cbind("Odds ratio" = coef(fit.logit),
confint.default(fit.logit, level = 0.95)))
stargazer(oddsratio,
type = "html",
style = "demography",
ci = T,
column.sep.width = "10pt")
| Odds ratio | 2.5 % | 97.5 % | |
| (Intercept) | 1.350 | 1.079 | 1.689 |
| educollege+ | 0.523 | 0.420 | 0.651 |
| edult hs | 5.491 | 3.804 | 7.927 |
| inc> | 0.332 | 0.253 | 0.436 |
| incto < | 0.678 | 0.533 | 0.861 |
| agec45-59 | 0.542 | 0.426 | 0.690 |
| agec60+ | 0.270 | 0.208 | 0.350 |
Here are all of the respective groupings of our three different variables:
dat <- expand.grid(inc=levels(hw2$inc),edu=levels(hw2$edu), agec=levels(hw2$agec))
#str(dat)
fit <- predict(fit.logit, newdat=dat, type="response")
dat$fitted.prob.lrm <- round(fit, 3)
knitr::kable(head(dat, n=27))
| inc | edu | agec | fitted.prob.lrm |
|---|---|---|---|
| <$25k | hs | 18-44 | 0.574 |
| >$75k | hs | 18-44 | 0.310 |
| $25k to < $75k | hs | 18-44 | 0.478 |
| <$25k | college+ | 18-44 | 0.414 |
| >$75k | college+ | 18-44 | 0.190 |
| $25k to < $75k | college+ | 18-44 | 0.323 |
| <$25k | lt hs | 18-44 | 0.881 |
| >$75k | lt hs | 18-44 | 0.711 |
| $25k to < $75k | lt hs | 18-44 | 0.834 |
| <$25k | hs | 45-59 | 0.423 |
| >$75k | hs | 45-59 | 0.196 |
| $25k to < $75k | hs | 45-59 | 0.331 |
| <$25k | college+ | 45-59 | 0.277 |
| >$75k | college+ | 45-59 | 0.113 |
| $25k to < $75k | college+ | 45-59 | 0.206 |
| <$25k | lt hs | 45-59 | 0.801 |
| >$75k | lt hs | 45-59 | 0.572 |
| $25k to < $75k | lt hs | 45-59 | 0.731 |
| <$25k | hs | 60+ | 0.267 |
| >$75k | hs | 60+ | 0.108 |
| $25k to < $75k | hs | 60+ | 0.198 |
| <$25k | college+ | 60+ | 0.160 |
| >$75k | college+ | 60+ | 0.060 |
| $25k to < $75k | college+ | 60+ | 0.114 |
| <$25k | lt hs | 60+ | 0.667 |
| >$75k | lt hs | 60+ | 0.399 |
| $25k to < $75k | lt hs | 60+ | 0.576 |
Now that we have the preliminaries taken care of we’ll take a look at a couple of interesting cases.
First, let’s look at a person in the 18-44 age category, who makes less than $25k a year without a high school education:
dat[which(dat$inc=="<$25k"&dat$edu=="lt hs"&dat$agec=="18-44"),]
## inc edu agec fitted.prob.lrm
## 7 <$25k lt hs 18-44 0.881
And compare that with someone that is of the same categories, but has some colle education
dat[which(dat$inc=="<$25k"&dat$edu=="college+"&dat$agec=="18-44"),]
## inc edu agec fitted.prob.lrm
## 4 <$25k college+ 18-44 0.414
And we can see that the percent chance that that person is Hispanic decreases by almost 50% from 88.1% to 41.4%.
From the research question, it appears that we are able to take an “educated guess” on whether someone were hispanic or not based on the knowledge of their age, income and education level.