For this assignment, I am still working with the NHIS (National Health Interview Survey.) For this week, I decided to focus on the geographical location of respondents and their racial background. I wanted to see if any possible relationship exists between the two and if so, what is the probability that a respondent of a certain racial background will choose to live in a certain region of the United States. My question is as follows:
Does one’s racial ethnicity & sex affect their geographic location (USA only)?
The effect that I will be focusing on is, US Region, I chose to work with the file titled Persons for the 2017 calendar year. For your reference, you can download the data set and codebook here
VARIABLES USED
Dependent variable is categorical and named US_Region. Independent variables are also categorical and named Sex and Race.
Independent variables are: Race & Sex coded as follows:
Variable Race was coded as 1= White, 2= Black & 3= Asian
HYPOTHESIS
I will be using the multinomial logit model as my dependent variable has 4 categories and my independent variables also contain multiple categories.
NOTE TO READER: THROUGHOUT MY INTERPRETATIONS I USED THE TERMS GEOGRAPHIC LOCATION AND U.S. REGION INTERCHANGEABLY. IN THIS ANALYSIS, THEY REFER TO THE SAME THING WHICH IS THE SPECIFIC REGION OF THE UNITED STATES THAT OUR 50 STATES BELONG TO. ALSO, COMPLEX INSIGNIFICANT MODELS WERE SET AS COMMENTS FOR EASY VIEWING AND EFFICIENCY.
#install.packages("ZeligChoice")
library(dplyr)
library(tidyr)
library(Zelig)
library(readr)
library(ZeligChoice)
InputStat<-read_csv("/Users/safiesaf/Downloads/personsx.csv")
LivStat<-InputStat%>%
rename("US_Region"=REGION,
"Sex"=SEX,
"Race"=RACRECI3,
"Education"=EDUC1,
"MaritalStat"=R_MARITL)%>%
select(US_Region,
Sex,
Race,
Education,
MaritalStat)%>%
mutate(US_Region=factor(US_Region),
Race=factor(Race),
Sex=factor(Sex),
Education=factor(Education),
MaritalStat=factor(MaritalStat))
head(LivStat)
## # A tibble: 6 x 5
## US_Region Sex Race Education MaritalStat
## <fct> <fct> <fct> <fct> <fct>
## 1 3 2 1 15 4
## 2 3 2 2 13 7
## 3 3 1 2 3 0
## 4 2 1 1 15 7
## 5 2 2 1 14 1
## 6 2 1 1 16 1
GENERATING MY MODEL WITH MLOGIT
MODEL 1:
This shows the effect of Gender and Race on the US Region a person chooses to live
Z.Area <- zelig(US_Region~ Sex + Race, model = "mlogit", data = LivStat, cite = F)
summary(Z.Area)
## Model:
##
## Call:
## z5$zelig(formula = US_Region ~ Sex + Race, data = LivStat)
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## log(mu[,1]/mu[,4]) -1.599 -0.3228 -0.2341 -0.2055 4.670
## log(mu[,2]/mu[,4]) -1.592 -0.3882 -0.3083 -0.1782 2.906
## log(mu[,3]/mu[,4]) -2.249 -0.7192 -0.4193 1.3277 1.688
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept):1 -0.352853 0.017360 -20.325 < 2e-16
## (Intercept):2 0.054737 0.015697 3.487 0.000489
## (Intercept):3 0.340610 0.014531 23.440 < 2e-16
## Sex2:1 0.034797 0.023097 1.507 0.131922
## Sex2:2 -0.003384 0.021189 -0.160 0.873111
## Sex2:3 0.035101 0.019264 1.822 0.068434
## Race2:1 0.816048 0.045231 18.042 < 2e-16
## Race2:2 0.429805 0.044886 9.575 < 2e-16
## Race2:3 1.553525 0.038095 40.780 < 2e-16
## Race3:1 -0.757856 0.042533 -17.818 < 2e-16
## Race3:2 -1.572683 0.049354 -31.865 < 2e-16
## Race3:3 -0.972730 0.035953 -27.056 < 2e-16
## Race4:1 -2.158730 0.128705 -16.773 < 2e-16
## Race4:2 -1.247438 0.074267 -16.797 < 2e-16
## Race4:3 -1.053044 0.062233 -16.921 < 2e-16
##
## Names of linear predictors: log(mu[,1]/mu[,4]), log(mu[,2]/mu[,4]),
## log(mu[,3]/mu[,4])
##
## Residual deviance: 204817.8 on 234381 degrees of freedom
##
## Log-likelihood: -102408.9 on 234381 degrees of freedom
##
## Number of Fisher scoring iterations: 5
##
## No Hauck-Donner effect found in any of the estimates
##
##
## Reference group is level 4 of the response
## Next step: Use 'setx' method
MODEL 2:
This shows the effect of Gender (Sex), Race and Marital Status on the US Region someone chooses
#Z.Area2 <- zelig(US_Region~ Sex + Race+ MaritalStat, model = "mlogit", data = LivStat, cite = F)
#summary(Z.Area2)
MODEL 3:
This shows the effect of Gender (Sex), Race and Education on the US Region someone chooses
#Z.Area3 <- zelig(US_Region~ Sex + Race + Education, model = "mlogit", data = LivStat, cite = F)
#summary(Z.Area3)
I will be using Model 1 based on its significance. The output from the model shows statistical significance with race where US region is concerened. However, when it comes to gender the results are not significant. Because Race and Sex were the two most significant variables, I will focus on them in my analysis.
SETTING THE INDEPENDENT VARIABLE: RACE DIFFERENCE
After setting my counterfactual variable race, I want to see the probability of each race living in a particular region of the United States.
x <- setx(Z.Area, Race = 1) # 1= White
x1 <- setx(Z.Area, Race = 2) # 2= Black
x2 <- setx(Z.Area, Race = 3) # 3= Asian
s.race <- sim(Z.Area, x = x, x1 = x1, x2= x2)
#summary(s.race)
The above results show that on average, white respondents are 0.34 times more likely to live in the Soth than any other race. They are also, on average, 0.17 times more likely to live in the Northeast than any other race.
On average, black respondents are 0.62 times more likely ot live in the South than other races. Black respondents, on average, are also 0.09 times more likely to live in the West than other races.
THE FIRST DIFFERENCE IN THE RACES
SETTING THE INDEPENDENT VARIABLE: SEX DIFFERENCE
After setting my counterfactual variable sex, I want to see the probability of male and female living in a particular region of the United States.
x <- setx(Z.Area, Sex = 1) # 1= Male
x1 <- setx(Z.Area, Sex = 2) # 2= Female
s.sex <- sim(Z.Area, x = x, x1 = x1)
summary(s.sex)
##
## sim x :
## -----
## ev
## mean sd 50% 2.5% 97.5%
## Pr(Y=1) 0.1687941 0.002062650 0.1688036 0.1648090 0.1730311
## Pr(Y=2) 0.2535378 0.002454800 0.2536393 0.2486399 0.2582611
## Pr(Y=3) 0.3375321 0.002526213 0.3375503 0.3322863 0.3424295
## Pr(Y=4) 0.2401361 0.002399518 0.2402193 0.2355625 0.2447858
## pv
## 1 2 3 4
## [1,] 0.166 0.28 0.314 0.24
##
## sim x1 :
## -----
## ev
## mean sd 50% 2.5% 97.5%
## Pr(Y=1) 0.1717464 0.002003337 0.1717577 0.1679154 0.1756188
## Pr(Y=2) 0.2485669 0.002292777 0.2485517 0.2442226 0.2532192
## Pr(Y=3) 0.3437570 0.002530787 0.3436202 0.3389997 0.3489257
## Pr(Y=4) 0.2359296 0.002253123 0.2359142 0.2316168 0.2404869
## pv
## 1 2 3 4
## [1,] 0.18 0.266 0.336 0.218
## fd
## mean sd 50% 2.5% 97.5%
## Pr(Y=1) 0.002952343 0.002612516 0.003071791 -0.0022848577 0.007851819
## Pr(Y=2) -0.004970826 0.003238436 -0.005008311 -0.0112294299 0.001332322
## Pr(Y=3) 0.006224961 0.003411704 0.006207049 -0.0005093169 0.012756455
## Pr(Y=4) -0.004206478 0.003211184 -0.004271622 -0.0104578282 0.002145316
On average, male respondents are 0.34 times more likely to live in the South than female respondents and they are 0.17 times more likely to live in the Northeast than females.
On average, female respondents are 0.34 times more likely to live in the South than males, they are also, on average, 0.17 times more likely to live in the northeast than men.
Women are estimated to live 0.003 times more in the Northeastern region of the United States than men, they are also estimated to live 0.005 times more in the Midwest region than men, they are estimated to live 0.006 times more in the Southern region of the United States than men. Lastly, females are estimated to live 0.004 times less in the Western region of the United States than male respondents.
CONCLUSION
The results show that when gender is a factor, there doesn’t seem to be much difference in the geographic location of men and women (they both have the same or close probabilities of living in the same regions of the U.S.) The resuls are very slim and thus not statistically significant so we can infer that the difference is simply due to random chance. Race on the other hand, IS statistically significant where geographic location is concerned. Black respondents have a high probability of living in the South with probability near that of 1, from the results, we can see that on average they are almost double that of whites in the South. However in the Midwest, black respondents have a VERY low probability of living there followed by the Northeast. My hypothesis was partly right in that I suspected that race and US Region would have a strong relationship. However, gender did not seem to have much of a relationship when U.S. Region was taken into account.