For this HW assignment I will utilize the 2017 BRFSS data set. My dependent variable will be whether or not someone is hispanic or not (1 for Hispanic and 0 for not Hispanic). In order to determine if a person is Hispanic or not I will utilize the folloinfg independetn variables:

Income- (categorical) 3 levels: <$25k/yr, $25k-$75k/yr, >$75k/yr

Education (caetgorical) 3 levels: Less than High School, High school and College +

Age (categorical) 3 levels: 18-44, 44-59, 60+

Research Question: Can we determine whether or not someone is Hispanic with knowledge of their education, income and age?

First we will determine if the independent vaariables are significant with repsect to the dependent variable:

For education:
svychisq(~hisp + edu, design = des)
## 
##  Pearson's X^2: Rao & Scott adjustment
## 
## data:  svychisq(~hisp + edu, design = des)
## F = 192.61, ndf = 1.9379, ddf = 19825.0000, p-value < 2.2e-16
For Income:
svychisq(~hisp + inc, design = des)
## 
##  Pearson's X^2: Rao & Scott adjustment
## 
## data:  svychisq(~hisp + inc, design = des)
## F = 124.42, ndf = 1.9894, ddf = 20352.0000, p-value < 2.2e-16
For Age:
svychisq(~hisp + agec, design = des)
## 
##  Pearson's X^2: Rao & Scott adjustment
## 
## data:  svychisq(~hisp + agec, design = des)
## F = 40.36, ndf = 1.9949, ddf = 20408.0000, p-value < 2.2e-16

All three p-values are significant. We will continue:

Next we will look at the percentage of adults who are Hispanic by education:

qplot(x = tabe$edu, y = tabe$hisp, data = tabe, xlab = "Education", ylab = "% Hispanic") +
  geom_errorbar(aes(x=edu, ymin=hisp - 1.96*se, ymax= hisp+1.96*se), width=.25)+
  ggtitle(label = "% of adults who are Hispanic by Education")

Next by Income:  
qplot(x = tabi$inc, y = tabi$hisp, data = tabi, xlab = "Income", ylab = "% Hispanic") +
  geom_errorbar(aes(x=inc, ymin=hisp - 1.96*se, ymax= hisp+1.96*se), width=.25)+
  ggtitle(label = "% of adults who are Hispanic by Income")

And finally by Age
qplot(x = taba$agec, y = taba$hisp, data = taba, xlab = "Age", ylab = "% Hispanic") +
  geom_errorbar(aes(x=agec, ymin=hisp - 1.96*se, ymax= hisp+1.96*se), width=.25)+
  ggtitle(label = "% of adults who are Hispanic by Age")

From these graphs we can see that the greatest proportion of the Hispanic populatin in our survey tend to be have less educating, are younger and in the lower third of the income categories.

Here are the results of the model:

fit.logit <- svyglm(hisp~edu+inc+agec,
                    design=des,
                    family=binomial,
                    data = hw2)
## Warning in eval(family$initialize): non-integer #successes in a binomial
## glm!
summary(fit.logit)

Call: svyglm(formula = hisp ~ edu + inc + agec, design = des, family = binomial, data = hw2)

Survey design: svydesign(ids = ~1, weights = ~wt, data = hw2)

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.3001 0.1143 2.625 0.00867 ** educollege+ -0.6490 0.1120 -5.797 6.97e-09 edult hs 1.7032 0.1873 9.091 < 2e-16 inc>$75k -1.1021 0.1394 -7.908 2.89e-15 inc$25k to < $75k -0.3893 0.1223 -3.183 0.00146 agec45-59 -0.6125 0.1233 -4.968 6.87e-07 agec60+ -1.3088 0.1324 -9.883 < 2e-16 * — Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1

(Dispersion parameter for binomial family taken to be 1.001847)

Number of Fisher Scoring iterations: 5

stargazer(fit.logit,
          type = "html",
          tyle = "demography",
          covariate.labels = c( "High School", "Less Than High School", "Greater than $75k", "$25k to $75k", "age 45-59", "age 60+", "constant"),
          ci = T,
          column.sep.width = "10pt")
Dependent variable:
hisp
High School -0.649***
(-0.868, -0.430)
Less Than High School 1.703***
(1.336, 2.070)
Greater than 75k -1.102***
(-1.375, -0.829)
25k to 75k -0.389***
(-0.629, -0.150)
age 45-59 -0.612***
(-0.854, -0.371)
age 60+ -1.309***
(-1.568, -1.049)
constant 0.300***
(0.076, 0.524)
Observations 10,231
Log Likelihood -5,078.742
Akaike Inf. Crit. 10,171.480
Note: p<0.1; p<0.05; p<0.01
demography
#print(oddsratio)

And here we have the odds ratio and respective 95% CI’s

oddsratio <- exp(cbind("Odds ratio" = coef(fit.logit),
                        confint.default(fit.logit, level = 0.95)))
stargazer(oddsratio,
          type = "html",
          style = "demography",
          ci = T,
          column.sep.width = "10pt")
Odds ratio 2.5 % 97.5 %
(Intercept) 1.350 1.079 1.689
educollege+ 0.523 0.420 0.651
edult hs 5.491 3.804 7.927
inc> 0.332 0.253 0.436
incto < 0.678 0.533 0.861
agec45-59 0.542 0.426 0.690
agec60+ 0.270 0.208 0.350

Here are all of the respective groupings of our three different variables:

dat <- expand.grid(inc=levels(hw2$inc),edu=levels(hw2$edu), agec=levels(hw2$agec))
#str(dat)

fit <- predict(fit.logit, newdat=dat, type="response")
dat$fitted.prob.lrm <- round(fit, 3)
knitr::kable(head(dat, n=27))
inc edu agec fitted.prob.lrm
<$25k hs 18-44 0.574
>$75k hs 18-44 0.310
$25k to < $75k hs 18-44 0.478
<$25k college+ 18-44 0.414
>$75k college+ 18-44 0.190
$25k to < $75k college+ 18-44 0.323
<$25k lt hs 18-44 0.881
>$75k lt hs 18-44 0.711
$25k to < $75k lt hs 18-44 0.834
<$25k hs 45-59 0.423
>$75k hs 45-59 0.196
$25k to < $75k hs 45-59 0.331
<$25k college+ 45-59 0.277
>$75k college+ 45-59 0.113
$25k to < $75k college+ 45-59 0.206
<$25k lt hs 45-59 0.801
>$75k lt hs 45-59 0.572
$25k to < $75k lt hs 45-59 0.731
<$25k hs 60+ 0.267
>$75k hs 60+ 0.108
$25k to < $75k hs 60+ 0.198
<$25k college+ 60+ 0.160
>$75k college+ 60+ 0.060
$25k to < $75k college+ 60+ 0.114
<$25k lt hs 60+ 0.667
>$75k lt hs 60+ 0.399
$25k to < $75k lt hs 60+ 0.576

Now that we have the preliminaries taken care of we’ll take a look at a couple of interesting cases.

First, let’s look at a person in the 18-44 age category, who makes less than $25k a year without a high school education:

dat[which(dat$inc=="<$25k"&dat$edu=="lt hs"&dat$agec=="18-44"),]
##     inc   edu  agec fitted.prob.lrm
## 7 <$25k lt hs 18-44           0.881
And compare that with someone that is of the same categories, but has some colle education
dat[which(dat$inc=="<$25k"&dat$edu=="college+"&dat$agec=="18-44"),]
##     inc      edu  agec fitted.prob.lrm
## 4 <$25k college+ 18-44           0.414

And we can see that the percent chance that that person is Hispanic decreases by almost 50% from 88.1% to 41.4%.

From the research question, it appears that we are able to take an “educated guess” on whether someone were hispanic or not based on the knowledge of their age, income and education level.