DEM 7263 - Calculating Segregation indices for general geographic areas

We saw last time how to use R to calculate various indices of segregation using data from the 2010 Census Summary File 1, and the 2010 ACS 5 year summary file. We will use the libraries acs and seg extensively, as well as the RQGIS library to give us access to geoprocessing scripts from Qgis. I will also rely heavily on the tigris package to download Census geographies directly into R, without having to download them seperately myself.

In the previous example, I used the nested GEOID variable that the Census provides in all of its summary data products. For that example, we used tracts nested within counties. This is easy because the tract GEOID variable has the county code nested within it. For example, the first tract in Bexar county, Texas is code 48029110100, and the first five digits of this GEOID are the state and county fips code 48029.

For other geographies, this is not as simple, as the GEOID variables are not nested in such a fashion. For example, Public Use Microdata areas (PUMAs) are nested within states, but cross county borders. So they do not have a 5 digit state/county FIPS code to nest tracts within.

In order to get around this problem, we can use the QGIS application to perform a geographic intersection between two spatial polygon layers that we want to use. This will allow us to merge the population data at the census tract level to a higher level, so we can calculate a segregation index.

Below, I show how to use R and Qgis to calculate segregation indices for PUMAs, school districts and core based statistical areas (CBSAs)

ACS data extract

For this example, I will use the B03002 table from the 2015 5-year ACS summary file. This table contains counts of population by race and Hispanic ethnicity. I will extract census tracts within the state of Texas. These will be my smaller units that I will aggregate across, within the larger geographies to calculate the segregation measures.

library(rgdal)
library(tigris)
library(RColorBrewer)
library(RQGIS)
library(acs)
library(seg)

#Get 2010 ACS median household incomes for tracts in Texas
txtracts<-geo.make(state="TX",county="*", tract="*")

#ACS table B03002 is Hispanic ethnicity by race 
race.tract<-acs.fetch(key=mykey, endyear=2014, span=5, geography=txtracts, table.number="B03002")

colnames(race.tract@estimate)

##  [1] "B03002_001" "B03002_002" "B03002_003" "B03002_004" "B03002_005"
##  [6] "B03002_006" "B03002_007" "B03002_008" "B03002_009" "B03002_010"
## [11] "B03002_011" "B03002_012" "B03002_013" "B03002_014" "B03002_015"
## [16] "B03002_016" "B03002_017" "B03002_018" "B03002_019" "B03002_020"
## [21] "B03002_021"

tx.trrace<-data.frame( nhwhite= race.tract@estimate[, "B03002_003"],
                       nhblack=race.tract@estimate[, "B03002_004"], 
                       nhother=apply(race.tract@estimate[, c(5:11)], MARGIN = 1, sum ),
                       hispanic =race.tract@estimate[, "B03002_012"])

tx.trrace$ids<-paste(race.tract@geography$state,
                     sprintf("%03d", race.tract@geography$county),
                     sprintf("%05s", race.tract@geography$tract), sep="")

tx.trrace$cofips<-substr(tx.trrace$ids, 1,5)

head(tx.trrace)

##                                              nhwhite nhblack nhother
## Census Tract 9501, Anderson County, Texas       3910     659     103
## Census Tract 9504.01, Anderson County, Texas    1959    2257      52
## Census Tract 9504.02, Anderson County, Texas    1753    2470      37
## Census Tract 9505, Anderson County, Texas       2267     932     105
## Census Tract 9506, Anderson County, Texas       3185    2138     214
## Census Tract 9507, Anderson County, Texas        556    1063     181
##                                              hispanic         ids cofips
## Census Tract 9501, Anderson County, Texas         486 48001950100  48001
## Census Tract 9504.01, Anderson County, Texas     1581 48001950401  48001
## Census Tract 9504.02, Anderson County, Texas     1337 48001950402  48001
## Census Tract 9505, Anderson County, Texas        1411 48001950500  48001
## Census Tract 9506, Anderson County, Texas        1038 48001950600  48001
## Census Tract 9507, Anderson County, Texas         880 48001950700  48001

Now, I will perform the calculations of segregation for the various geographies. All of these examples will follow the same pattern: * load higher level geographies from tigris * Use QGIS to form the geographic intersection between the tracts and the higher level geographies. * Join the intersected data to the tracts using the tract GEOID * Calculate the segregation measures using a modification to the spseg() function, so we can calculate segregation for each higher level geography separately * Merge the calculated segregation measures back to the higher level geographies and map them using spplot()

Public Use Microdata Areas

PUMAs are a very useful geography to work with, because we can get individual level data from the IPUMS on many topics, and we could include segregation as a higher level predictor in an analysis.

puma<-pumas(state = "TX")
tract<-tigris::tracts(state = "48", cb = FALSE, refresh=T)

myenv<-set_env()

## Warning: running command 'find /usr/local/Cellar/ -name 'QGIS.app'' had
## status 1

myenv

## $root
## [1] "/Applications/QGIS.app"
## 
## $qgis_prefix_path
## [1] "/Applications/QGIS.app/Contents"
## 
## $python_plugins
## [1] "/Applications/QGIS.app/Contents/Resources/python/plugins"

#union buffer and tracts
find_algorithms(search_term = "intersect", qgis_env = myenv)

## [1] "ERROR: Opening of authentication db FAILED"                                        
## [2] "WARNING: Auth db query exec() FAILED"                                              
## [3] "Intersection----------------------------------------->qgis:intersection"           
## [4] "Line intersections----------------------------------->qgis:lineintersections"      
## [5] "Fuzzy intersection (and)----------------------------->saga:fuzzyintersectionand"   
## [6] "Intersect-------------------------------------------->saga:intersect"              
## [7] "Line-polygon intersection---------------------------->saga:linepolygonintersection"
## [8] "Polygon self-intersection---------------------------->saga:polygonselfintersection"
## [9] "Polygon-line intersection---------------------------->saga:polygonlineintersection"

get_usage(alg="qgis:intersection", qgis_env = myenv, intern = T)

##  [1] "ERROR: Opening of authentication db FAILED"
##  [2] "WARNING: Auth db query exec() FAILED"      
##  [3] "ALGORITHM: Intersection"                   
##  [4] "\tINPUT <ParameterVector>"                 
##  [5] "\tINPUT2 <ParameterVector>"                
##  [6] "\tIGNORE_NULL <ParameterBoolean>"          
##  [7] "\tOUTPUT <OutputVector>"                   
##  [8] ""                                          
##  [9] ""                                          
## [10] ""

params <- get_args_man(alg = "qgis:intersection", 
                       qgis_env = myenv)
params

## $INPUT
## [1] "None"
## 
## $INPUT2
## [1] "None"
## 
## $IGNORE_NULL
## [1] "False"
## 
## $OUTPUT
## [1] "None"

wd<-"/Users/ozd504//Google Drive/classes/dem7263/class17/data"

params$INPUT <- puma
params$INPUT2<-tract
params$OUTPUT<-file.path(wd, "puma_tract.shp")
# path to the output shapefile
pumatract <- run_qgis(alg = "qgis:intersection",
                          params = params,
                          load_output = params$OUTPUT,
                          qgis_env = myenv)

## ERROR: Opening of authentication db FAILEDWARNING: Auth db query exec() FAILED

cols<-brewer.pal(n=12 , name="Set3")
cols<-sample(cols, size = length(unique(pumatract$PUMACE10)), replace = T)
plot(pumatract, col=cols[as.numeric(pumatract$PUMACE10)], border=grey(.9))

race.sp<-geo_join(pumatract,tx.trrace,  "GEOID","ids")

head(race.sp@data)

##   STATEFP10 PUMACE10 GEOID10
## 0        48    03601 4803601
## 1        48    03601 4803601
## 2        48    03601 4803601
## 3        48    03601 4803601
## 4        48    03601 4803601
## 5        48    03601 4803601
##                                                      NAMELSAD10 MTFCC10
## 0 Brazos Valley COG (Outside Brazos County) & Milam County PUMA   G6120
## 1 Brazos Valley COG (Outside Brazos County) & Milam County PUMA   G6120
## 2 Brazos Valley COG (Outside Brazos County) & Milam County PUMA   G6120
## 3 Brazos Valley COG (Outside Brazos County) & Milam County PUMA   G6120
## 4 Brazos Valley COG (Outside Brazos County) & Milam County PUMA   G6120
## 5 Brazos Valley COG (Outside Brazos County) & Milam County PUMA   G6120
##   FUNCSTAT10     ALAND10  AWATER10  INTPTLAT10   INTPTLON10 STATEFP
## 0          S 14147111500 202332263 +30.8493793 -096.4762632      48
## 1          S 14147111500 202332263 +30.8493793 -096.4762632      48
## 2          S 14147111500 202332263 +30.8493793 -096.4762632      48
## 3          S 14147111500 202332263 +30.8493793 -096.4762632      48
## 4          S 14147111500 202332263 +30.8493793 -096.4762632      48
## 5          S 14147111500 202332263 +30.8493793 -096.4762632      48
##   COUNTYFP TRACTCE       GEOID NAME          NAMELSAD MTFCC FUNCSTAT
## 0      051  970400 48051970400 9704 Census Tract 9704 G5020        S
## 1      051  970500 48051970500 9705 Census Tract 9705 G5020        S
## 2      051  970100 48051970100 9701 Census Tract 9701 G5020        S
## 3      051  970300 48051970300 9703 Census Tract 9703 G5020        S
## 4      051  970200 48051970200 9702 Census Tract 9702 G5020        S
## 5      289  950100 48289950100 9501 Census Tract 9501 G5020        S
##       ALAND   AWATER    INTPTLAT     INTPTLON nhwhite nhblack nhother
## 0 538431560  8650811 +30.4627961 -096.4513889    1904     574       0
## 1 240096871 25958216 +30.3576068 -096.6084742    3023     524     200
## 2 254928575  3613845 +30.6293963 -096.6280153     942     100     107
## 3  26674773   265988 +30.5330133 -096.7009181    2554     713      78
## 4 646752996  7561628 +30.5083502 -096.7671764    3080     125     272
## 5 834389995  5216547 +31.4641407 -095.9191957    5175     622     112
##   hispanic         ids cofips
## 0      724 48051970400  48051
## 1      737 48051970500  48051
## 2       78 48051970100  48051
## 3      959 48051970300  48051
## 4      839 48051970200  48051
## 5     1640 48289950100  48289

dat<-race.sp #spatialpolygondataframe with the race data
cos<-unique(dat$PUMACE10) #unique county fips codes


spsegout<-function(dat, cos){
  theil<-numeric(length(cos))
  diss<-numeric(length(cos))
  div<-numeric(length(cos))
  isol<-numeric(length(cos))
  inter<-numeric(length(cos))
  pumace<-NULL
for (i in 1:length(cos)){
  #Modify the two groups in the next line, this usees NH White and Hispanic
  out<-spseg(dat[dat$PUMACE10==cos[i],],data =dat@data[dat$PUMACE10==cos[i],c("nhwhite", "hispanic")],  method="all", useC=T, negative.rm = T)
  
  div[i]<-out@r #Diversity
  diss[i]<-out@d #Dissimilarity
 theil[i]<-out@h #Theil Entrophy
 isol[i]<-out@p[2,2] #Isolation for Hispanics
 inter[i]<-out@p[1,2] #Interaction for Whites and Hispanics
  pumace[i]<-as.character(cos[i])
}
list(theil=theil, dissim=diss, diversity=div,isolation=isol, interaction=inter,puma=pumace)
}


spsegs<-data.frame(spsegout(dat,cos))

head(spsegs)

##           theil      dissim     diversity isolation interaction  puma
## 1  0.0022004524 0.032573021  0.0027150884 0.2326321   0.2268293 03601
## 2 -0.0104645369 0.032606115 -0.0144918578 0.3538277   0.3471275 04501
## 3 -0.0005585957 0.011015954 -0.0007835573 0.2567259   0.2557366 04504
## 4  0.0004439682 0.025233774  0.0006152952 0.5097524   0.4987791 04625
## 5  0.0090911710 0.015208613  0.0118778626 0.8959877   0.8922094 04602
## 6  0.0001228756 0.002442282  0.0001572151 0.9128087   0.9126782 04618

puma2<-geo_join(puma, spsegs, "PUMACE10", "puma")

#Map the index
spplot(puma2, "interaction", at=seq(0,1, .2), col.regions=brewer.pal(n=5, "Blues"), main="Hispanic/Non Hispanic Interaction index by PUMAs within Texas")

spplot(puma2, "isolation", at=seq(0,1,.2), col.regions=brewer.pal(n=5, "Blues"), main="Hispanic Isolation index by PUMAs within Texas")

School Districts

district<-school_districts(state = "TX", type="unified")

params <- get_args_man(alg = "qgis:intersection", 
                       qgis_env = myenv)
params

## $INPUT
## [1] "None"
## 
## $INPUT2
## [1] "None"
## 
## $IGNORE_NULL
## [1] "False"
## 
## $OUTPUT
## [1] "None"

params$INPUT <- district
params$INPUT2<-tract
params$OUTPUT<-file.path(wd, "schooldistrict_tract.shp")
# path to the output shapefile
sdtract <- run_qgis(alg = "qgis:intersection",
                          params = params,
                          load_output = params$OUTPUT,
                          qgis_env = myenv)

## ERROR: Opening of authentication db FAILEDWARNING: Auth db query exec() FAILED

cols<-brewer.pal(n=12 , name="Set3")
cols<-sample(cols, size = length(unique(sdtract$UNSDLEA)), replace = T)
plot(sdtract, col=cols[as.numeric(sdtract$UNSDLEA)], border=grey(.9))

race.sd<-geo_join(sdtract,tx.trrace,"geoid_2","ids" )
head(race.sd@data)

##   STATEFP UNSDLEA   GEOID                                  NAME LSAD
## 0      48   32280 4832280 Nederland Independent School District   00
## 1      48   32280 4832280 Nederland Independent School District   00
## 2      48   32280 4832280 Nederland Independent School District   00
## 3      48   32280 4832280 Nederland Independent School District   00
## 4      48   32280 4832280 Nederland Independent School District   00
## 5      48   32280 4832280 Nederland Independent School District   00
##   LOGRADE HIGRADE MTFCC SDTYP FUNCSTAT    ALAND  AWATER    INTPTLAT
## 0      PK      12 G5420  <NA>        E 45411569 3622561 +29.9772516
## 1      PK      12 G5420  <NA>        E 45411569 3622561 +29.9772516
## 2      PK      12 G5420  <NA>        E 45411569 3622561 +29.9772516
## 3      PK      12 G5420  <NA>        E 45411569 3622561 +29.9772516
## 4      PK      12 G5420  <NA>        E 45411569 3622561 +29.9772516
## 5      PK      12 G5420  <NA>        E 45411569 3622561 +29.9772516
##       INTPTLON statefp_2 COUNTYFP TRACTCE     geoid_2 name_2
## 0 -094.0054218        48      245  011001 48245011001 110.01
## 1 -094.0054218        48      245  980000 48245980000   9800
## 2 -094.0054218        48      245  007100 48245007100     71
## 3 -094.0054218        48      245  011002 48245011002 110.02
## 4 -094.0054218        48      245  011101 48245011101 111.01
## 5 -094.0054218        48      245  011203 48245011203 112.03
##              NAMELSAD mtfcc_2 funcstat_2  aland_2 awater_2  intptlat_2
## 0 Census Tract 110.01   G5020          S  3683278     5424 +29.9587159
## 1   Census Tract 9800   G5020          S  4546184    20989 +29.9515811
## 2     Census Tract 71   G5020          S 14592830   642583 +29.9115605
## 3 Census Tract 110.02   G5020          S  2298990        0 +29.9740297
## 4 Census Tract 111.01   G5020          S  3586121      875 +29.9668309
## 5 Census Tract 112.03   G5020          S  3500203   130615 +29.9859820
##     intptlon_2 nhwhite nhblack nhother hispanic         ids cofips
## 0 -093.9940534    4042     159     578      709 48245011001  48245
## 1 -094.0206917      19       0     119       17 48245980000  48245
## 2 -094.0281336    2124     266     170     1325 48245007100  48245
## 3 -093.9818113    3123      60     114      349 48245011002  48245
## 4 -094.0059588    3951      43     166      551 48245011101  48245
## 5 -094.0500790     465     764      43      445 48245011203  48245

dat<-race.sd #spatialpolygondataframe with the race data
cos<-unique(dat$UNSDLEA) #unique school district codes

spsegout<-function(dat, cos){
  theil<-numeric(length(cos))
  diss<-numeric(length(cos))
  div<-numeric(length(cos))
  isol<-numeric(length(cos))
  inter<-numeric(length(cos))
  school<-NULL
for (i in 1:length(cos)){
  #Modify the two groups in the next line, this usees NH White and Hispanic
  out<-spseg(dat[dat$UNSDLEA==cos[i],],data =dat@data[dat$UNSDLEA==cos[i],c("nhwhite","hispanic")],  method="all", useC=T, negative.rm=T)
  
  div[i]<-out@r #Diversity
  diss[i]<-out@d #Dissimilarity
 theil[i]<-out@h #Theil Entrophy
 isol[i]<-out@p[2,2] #Isolation for Hispanics
 inter[i]<-out@p[1,2] #Interaction for Whites and Hispanics
  school[i]<-as.character(cos[i])
}
list(theil=theil, dissim=diss, diversity=div,isolation=isol, interaction=inter,school=school)
}


spsegs<-data.frame(spsegout(dat,cos))

head(spsegs)

##          theil       dissim    diversity isolation interaction school
## 1 1.984326e-03 3.915320e-03 2.659673e-03 0.1652343   0.1646716  32280
## 2 1.110223e-16 3.065399e-16 4.440892e-16 0.1301094   0.1301094  38490
## 3 1.136642e-03 3.504607e-02 1.559223e-03 0.4538503   0.4391095  28230
## 4 2.404674e-04 6.804121e-03 3.325919e-04 0.5628839   0.5626281  36390
## 5 1.813791e-03 3.071926e-02 2.466440e-03 0.3762909   0.3723087  21030
## 6 0.000000e+00 0.000000e+00 0.000000e+00 0.1403803   0.1403803  11040

sdseg<-geo_join(district, spsegs, "UNSDLEA", "school")

#Map the index
spplot(sdseg, "interaction", at=seq(0,1, .2), col.regions=brewer.pal(n=5, "Blues"), main="Hispanic/Non Hispanic Interaction index by School Districts within Texas")

spplot(sdseg, "isolation",  at=seq(0,1, .2), col.regions=brewer.pal(n=5, "Blues"), main="Hispanic Isolation index by School Districts within Texas")

Metropolitan Areas

msa<-core_based_statistical_areas(cb=T)

names(msa)

## [1] "CSAFP"    "CBSAFP"   "AFFGEOID" "GEOID"    "NAME"     "LSAD"    
## [7] "ALAND"    "AWATER"

params <- get_args_man(alg = "qgis:intersection", 
                       qgis_env = myenv)

params$INPUT <- msa
params$INPUT2<-tract
params$OUTPUT<-file.path(wd, "msa_tract.shp")
# path to the output shapefile
msatract <- run_qgis(alg = "qgis:intersection",
                          params = params,
                          load_output = params$OUTPUT,
                          qgis_env = myenv)

## ERROR: Opening of authentication db FAILEDWARNING: Auth db query exec() FAILED

msatract<-msatract[msatract$STATEFP=="48",]

cols<-brewer.pal(n=12 , name="Set3")
cols<-sample(cols, size = length(unique(msatract$NAME)), replace = T)
plot(msatract, col=cols[as.numeric(msatract$NAME)], border=grey(.9))

#spplot(msatract, "NAME", col.regions=brewer.pal(n=12, name = "Set3"), auto.key=FALSE)


race.msa<-geo_join(msatract,tx.trrace,"geoid_2","ids" )
head(race.msa@data)

##   CSAFP CBSAFP       AFFGEOID GEOID           NAME LSAD      ALAND  AWATER
## 0  <NA>  45020 310M200US45020 45020 Sweetwater, TX   M2 2362061740 5052158
## 1  <NA>  45020 310M200US45020 45020 Sweetwater, TX   M2 2362061740 5052158
## 2  <NA>  45020 310M200US45020 45020 Sweetwater, TX   M2 2362061740 5052158
## 3  <NA>  45020 310M200US45020 45020 Sweetwater, TX   M2 2362061740 5052158
## 4  <NA>  45020 310M200US45020 45020 Sweetwater, TX   M2 2362061740 5052158
## 5  <NA>  45020 310M200US45020 45020 Sweetwater, TX   M2 2362061740 5052158
##   STATEFP COUNTYFP TRACTCE     geoid_2 name_2          NAMELSAD MTFCC
## 0      48      441  013600 48441013600    136  Census Tract 136 G5020
## 1      48      151  950300 48151950300   9503 Census Tract 9503 G5020
## 2      48      335  950400 48335950400   9504 Census Tract 9504 G5020
## 3      48      353  950100 48353950100   9501 Census Tract 9501 G5020
## 4      48      353  950500 48353950500   9505 Census Tract 9505 G5020
## 5      48      353  950200 48353950200   9502 Census Tract 9502 G5020
##   FUNCSTAT    aland_2 awater_2    INTPTLAT     INTPTLON nhwhite nhblack
## 0        S  701675542   422010 +32.3997269 -100.0143087    4172       6
## 1        S 1308923420  4090575 +32.6420058 -100.3982035    1332      37
## 2        S 1982961660  7252257 +32.2664096 -100.9049589    1174      49
## 3        S  505340985  3533009 +32.4442471 -100.3446966    1698      35
## 4        S 1834440353  1516728 +32.2671635 -100.4183080    1772       0
## 5        S    3300907     2421 +32.4828094 -100.3972167    2921     179
##   nhother hispanic         ids cofips
## 0     253      880 48441013600  48441
## 1       8      319 48151950300  48151
## 2     156      473 48335950400  48335
## 3      30      358 48353950100  48353
## 4      97      811 48353950500  48353
## 5      98     1100 48353950200  48353

dat<-race.msa #spatialpolygondataframe with the race data
cos<-unique(dat$CBSAFP) #unique cbsa codes

spsegout<-function(dat, cos){
  theil<-numeric(length(cos))
  diss<-numeric(length(cos))
  div<-numeric(length(cos))
  isol<-numeric(length(cos))
  inter<-numeric(length(cos))
  cbsa<-NULL
for (i in 1:length(cos)){
  #Modify the two groups in the next line, this usees NH White and Hispanic
  out<-spseg(dat[dat$CBSAFP==cos[i],],data =dat@data[dat$CBSAFP==cos[i],c("nhwhite","hispanic")],  method="all", useC=T, negative.rm=T)
  
  div[i]<-out@r #Diversity
  diss[i]<-out@d #Dissimilarity
 theil[i]<-out@h #Theil Entrophy
 isol[i]<-out@p[2,2] #Isolation for Hispanics
 inter[i]<-out@p[1,2] #Interaction for Whites and Hispanics
  cbsa[i]<-as.character(cos[i])
}
list(theil=theil, dissim=diss, diversity=div,isolation=isol, interaction=inter,cbsa=cbsa)
}


spsegs<-data.frame(spsegout(dat,cos))

head(spsegs)

##          theil     dissim    diversity isolation interaction  cbsa
## 1 -0.004304018 0.06677058 -0.006774552 0.2571533  0.23951264 45020
## 2 -0.009780796 0.02421253 -0.013079487 0.0904850  0.08868336 45500
## 3  0.013846127 0.04965099  0.018996401 0.6436788  0.62951323 18580
## 4 -0.023894987 0.04840618 -0.033088730 0.3163348  0.30051690 34420
## 5  0.000210398 0.01215022  0.000213342 0.1409942  0.13900592 11980
## 6  0.002167641 0.06438760  0.002958120 0.4640171  0.44042268 29500

msaseg<-geo_join(msa, spsegs, "CBSAFP", "cbsa", how="inner")

#Map the index
spplot(msaseg, "interaction", at=seq(0,1, .2), col.regions=brewer.pal(n=5, "Blues"), main="Hispanic/Non Hispanic Interaction index by CBSAs within Texas")

spplot(msaseg, "isolation",  at=seq(0,1, .2), col.regions=brewer.pal(n=5, "Blues"), main="Hispanic Isolation index by CBSAs within Texas")

Use the segregation measures in a model

So, often times we calculate these indices so we can use them in another application. So, below, I load data from the 2014 CDC Behavioral Risk Factor Surveillance System (BRFSS) SMART City data. Link, which includes a geographic identifier for each respondent, which indicates which metropolitan statistical area (MSA) they live in. We can join this to the segregation measures we calculated for the core based statistical areas (basically MSAs) and use it in a model.

Below, I test whether Hispanics living in areas with high hispanic isolation are more likely to

library(car)
load("/Users/ozd504/Google Drive//classes/dem7283/class17/data/brfss_14.Rdata")
#The names in the data are very ugly, so I make them less ugly
nams<-names(brfss14)
head(nams, n=10)

##  [1] "dispcode" "genhlth"  "physhlth" "menthlth" "poorhlth" "hlthpln1"
##  [7] "persdoc2" "medcost"  "checkup1" "exerany2"

#we see some names are lower case, some are upper and some have a little _ in the first position. This is a nightmare.
set.seed(1234)
newnames<-gsub(pattern = "x_",replacement =  "",x =  nams)
names(brfss14)<-tolower(newnames)

#samps<-sample(1:nrow(brfss14), size = 10000, replace = F)
#brfss14<-brfss14[samps,]
#Poor or fair self rated health
#brfss14$badhealth<-ifelse(brfss14$genhlth %in% c(4,5),1,0)
brfss14$badhealth<-recode(brfss14$genhlth, recodes="4:5=1; 1:3=0; else=NA")
#race/ethnicity
brfss14$black<-recode(brfss14$racegr3, recodes="2=1; 9=NA; else=0")
brfss14$white<-recode(brfss14$racegr3, recodes="1=1; 9=NA; else=0")
brfss14$other<-recode(brfss14$racegr3, recodes="3:4=1; 9=NA; else=0")
brfss14$hispanic<-recode(brfss14$racegr3, recodes="5=1; 9=NA; else=0")
brfss14$race_eth<-recode(brfss14$racegr3, recodes="1='nhwhite'; 2='nh black'; 3:4='nh other'; 5='hispanic'; else=NA", as.factor.result = T)
brfss14$race_eth<-relevel(brfss14$race_eth, ref = "nhwhite")
#insurance
brfss14$ins<-ifelse(brfss14$hlthpln1==1,1,0)

#income grouping
brfss14$inc<-ifelse(brfss14$incomg==9, NA, brfss14$incomg)

#education level
brfss14$educ<-recode(brfss14$educa, recodes="1:3='0LTHS';  4='2hsgrad';
                     5='3somecol'; 6='4colgrad';9=NA", as.factor.result=T)
brfss14$educ<-relevel(brfss14$educ, ref='0LTHS')

#employment
brfss14$employ<-recode(brfss14$employ, recodes="1:2='Employed'; 2:6='nilf';
                       7='retired'; 8='unable'; else=NA", as.factor.result=T)
brfss14$employ<-relevel(brfss14$employ, ref='Employed')

#marital status
brfss14$marst<-recode(brfss14$marital, recodes="1='married'; 2='divorced'; 3='widowed';
                      4='separated'; 5='nm';6='cohab'; else=NA", as.factor.result=T)
brfss14$marst<-relevel(brfss14$marst, ref='married')

#Age cut into intervals
brfss14$agec<-cut(brfss14$age80, breaks=c(0,24,39,59,79), include.lowest = T)

#Health behaviors
#insurance
brfss14$ins<-recode(brfss14$hlthpln1, recodes = "1=1; 2=0; else=NA" )

#smoking currently
brfss14$smoke<-recode(brfss14$smoker3, recodes="1:2='Current'; 3='Former';4='NeverSmoked'; else=NA", as.factor.result=T)
brfss14$smoke<-relevel(brfss14$smoke, ref = "NeverSmoked")

#bmi
brfss14$bmi<-ifelse(is.na(brfss14$bmi5)==T, NA, brfss14$bmi5/100)



txbrfss<-brfss14[grep("TX", brfss14$mmsaname),]

txbrfss<-merge(txbrfss, spsegs, by.x="mmsa", by.y="cbsa")
table(as.character(txbrfss$mmsaname))

## 
##                Austin-Round Rock, TX, Metropolitan Statistical Area 
##                                                                2261 
##            College Station-Bryan, TX, Metropolitan Statistical Area 
##                                                                 577 
##                   Corpus Christi, TX, Metropolitan Statistical Area 
##                                                                 623 
##                          El Paso, TX, Metropolitan Statistical Area 
##                                                                 723 
## Houston-The Woodlands-Sugar Land, TX, Metropolitan Statistical Area 
##                                                                2164 
##        San Antonio-New Braunfels, TX, Metropolitan Statistical Area 
##                                                                2283 
##                    Wichita Falls, TX, Metropolitan Statistical Area 
##                                                                 547

library(lme4)
bhtx<-glmer(badhealth~agec+race_eth+educ+(1|mmsa),
            txbrfss, family=binomial, weights = mmsawt/mean(txbrfss$mmsawt, na.rm=T),
            control=glmerControl(optCtrl=list(maxfun=2e4), optimizer="bobyqa"))

arm::display(bhtx, detail=T)

## glmer(formula = badhealth ~ agec + race_eth + educ + (1 | mmsa), 
##     data = txbrfss, family = binomial, control = glmerControl(optCtrl = list(maxfun = 20000), 
##         optimizer = "bobyqa"), weights = mmsawt/mean(txbrfss$mmsawt, 
##         na.rm = T))
##                  coef.est coef.se z value Pr(>|z|)
## (Intercept)       -1.74     0.15  -11.29    0.00  
## agec(24,39]        0.49     0.12    4.07    0.00  
## agec(39,59]        1.13     0.11    9.91    0.00  
## agec(59,79]        1.48     0.12   11.92    0.00  
## race_ethhispanic   0.19     0.09    2.24    0.03  
## race_ethnh black   0.42     0.10    4.17    0.00  
## race_ethnh other  -0.55     0.19   -2.90    0.00  
## educ2hsgrad       -0.71     0.09   -8.08    0.00  
## educ3somecol      -0.88     0.09   -9.99    0.00  
## educ4colgrad      -1.99     0.13  -15.56    0.00  
## 
## Error terms:
##  Groups   Name        Std.Dev.
##  mmsa     (Intercept) 0.14    
##  Residual             1.00    
## ---
## number of obs: 7805, groups: mmsa, 7
## AIC = 6418.2, DIC = 7901.8
## deviance = 7149.0

bhtx2<-glmer(badhealth~agec+race_eth+educ+race_eth*scale(isolation)+(1|mmsa),
             txbrfss, family=binomial, weights = mmsawt/mean(txbrfss$mmsawt, na.rm=T),
             control=glmerControl(optCtrl=list(maxfun=2e4), optimizer="bobyqa"))

arm::display(bhtx2, detail=T)

## glmer(formula = badhealth ~ agec + race_eth + educ + race_eth * 
##     scale(isolation) + (1 | mmsa), data = txbrfss, family = binomial, 
##     control = glmerControl(optCtrl = list(maxfun = 20000), optimizer = "bobyqa"), 
##     weights = mmsawt/mean(txbrfss$mmsawt, na.rm = T))
##                                   coef.est coef.se z value Pr(>|z|)
## (Intercept)                        -1.76     0.15  -12.06    0.00  
## agec(24,39]                         0.50     0.12    4.11    0.00  
## agec(39,59]                         1.14     0.11    9.92    0.00  
## agec(59,79]                         1.48     0.13   11.81    0.00  
## race_ethhispanic                    0.13     0.09    1.54    0.12  
## race_ethnh black                    0.43     0.10    4.17    0.00  
## race_ethnh other                   -0.64     0.20   -3.20    0.00  
## educ2hsgrad                        -0.73     0.09   -8.23    0.00  
## educ3somecol                       -0.91     0.09  -10.21    0.00  
## educ4colgrad                       -2.01     0.13  -15.67    0.00  
## scale(isolation)                    0.01     0.10    0.13    0.90  
## race_ethhispanic:scale(isolation)   0.21     0.12    1.79    0.07  
## race_ethnh black:scale(isolation)  -0.05     0.20   -0.27    0.79  
## race_ethnh other:scale(isolation)   0.87     0.32    2.76    0.01  
## 
## Error terms:
##  Groups   Name        Std.Dev.
##  mmsa     (Intercept) 0.05    
##  Residual             1.00    
## ---
## number of obs: 7805, groups: mmsa, 7
## AIC = 6405.1, DIC = 7911.4
## deviance = 7143.2

anova(bhtx, bhtx2)

## Data: txbrfss
## Models:
## bhtx: badhealth ~ agec + race_eth + educ + (1 | mmsa)
## bhtx2: badhealth ~ agec + race_eth + educ + race_eth * scale(isolation) + 
## bhtx2:     (1 | mmsa)
##       Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)    
## bhtx  11 6418.2 6494.7 -3198.1   6396.2                             
## bhtx2 15 6405.1 6509.5 -3187.5   6375.1 21.089      4  0.0003041 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1