Quant Research Template

#I spent many hours gathering the codes needed here to do my own research and I thought it would be helpful to provide a template for others to use if they want to do quantitative spatial research without the stress of figuring out how to code to analyze your data. 
#This template was created to aid researchers in doing quantitiative research analysis for the social sciences by using U.S census data. This template may also be used for any type of databases -spatial and non-spatial- as well. I will be using San Diego City census data as an example for all the analysis codes. You can easily replace the databases and variables for the ones from your own research if you decide to borrow this template. Good luck and happy researching!

Loading Packages

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.0.6     ✓ dplyr   1.0.4
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(tidycensus)
library(sf)

## Linking to GEOS 3.7.2, GDAL 2.4.2, PROJ 5.2.0

library(tigris)

## To enable 
## caching of data, set `options(tigris_use_cache = TRUE)` in your R script or .Rprofile.

options(tigris_class = "sf")
library(tmap)
library(rmapshaper)

## Registered S3 method overwritten by 'geojsonlint':
##   method         from 
##   print.location dplyr

library(sp)
library(spdep)

## Loading required package: spData

## To access larger datasets in this package, install the spDataLarge
## package with: `install.packages('spDataLarge',
## repos='https://nowosad.github.io/drat/', type='source')`

library(spatstat)

## Loading required package: spatstat.data

## Loading required package: nlme

## 
## Attaching package: 'nlme'

## The following object is masked from 'package:dplyr':
## 
##     collapse

## Loading required package: rpart

## Registered S3 method overwritten by 'spatstat':
##   method     from
##   print.boxx cli

## 
## spatstat 1.64-1       (nickname: 'Help you I can, yes!') 
## For an introduction to spatstat, type 'beginner'

## 
## Note: R version 3.6.2 (2019-12-12) is more than a year old; we strongly recommend upgrading to the latest version

library(readxl)
library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

library(Hmisc)

## Loading required package: lattice

## 
## Attaching package: 'lattice'

## The following object is masked from 'package:spatstat':
## 
##     panel.histogram

## Loading required package: survival

## Loading required package: Formula

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:dplyr':
## 
##     src, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, units

library(ggplot2)
library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:spatstat':
## 
##     area

## The following object is masked from 'package:dplyr':
## 
##     select

library(VIM)

## Loading required package: colorspace

## 
## Attaching package: 'colorspace'

## The following object is masked from 'package:spatstat':
## 
##     coords

## Loading required package: grid

## Loading required package: data.table

## 
## Attaching package: 'data.table'

## The following object is masked from 'package:spatstat':
## 
##     shift

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

## The following object is masked from 'package:purrr':
## 
##     transpose

## VIM is ready to use. 
##  Since version 4.0.0 the GUI is in its own package VIMGUI.
## 
##           Please use the package to use the new (and old) GUI.

## Suggestions and bug-reports can be submitted at: https://github.com/alexkowa/VIM/issues

## 
## Attaching package: 'VIM'

## The following object is masked from 'package:datasets':
## 
##     sleep

library(readr)
library(openxlsx)
library(olsrr)

## 
## Attaching package: 'olsrr'

## The following object is masked from 'package:MASS':
## 
##     cement

## The following object is masked from 'package:datasets':
## 
##     rivers

library(countreg)
library(margins)
library(prediction)
library(webuse)
library(sjPlot)

## Registered S3 methods overwritten by 'lme4':
##   method                          from
##   cooks.distance.influence.merMod car 
##   influence.merMod                car 
##   dfbeta.influence.merMod         car 
##   dfbetas.influence.merMod        car

## Install package "strengejacke" from GitHub (`devtools::install_github("strengejacke/strengejacke")`) to load all sj-packages at once!

census_api_key("1617af527eaf75d77339a0e5595f8067a78dbb26")

## To install your API key for use in future sessions, run this function with `install = TRUE`.

#To obtain census data you have to get a hold of your own census API key. To do so go to https://api.census.gov/data/key_signup.html For now, you can use mine.

Getting Census Data

#Here you can use this code to get census data from the American Community Survey for census tracts. The variables included here are just some of many variables available to you in the ACS. For more variables you can use go to https://www.census.gov/programs-surveys/acs/technical-documentation/code-lists.html
#This code also provides geographic data for you so you can eventually map the data as well
#As an example lets try to get census data for San Diego county
SanDiego <- get_acs(geography = "tract", 
              year = 2018,
              variables = c(tpop = "B01003_001", tpopr = "B03002_001", 
                            nhwhite = "B03002_003", nhblk = "B03002_004",
                            nhasn = "B03002_006", hisp = "B03002_012", medincome= "B19013_001"), 
              state = "CA",
              county = "San Diego",
              survey = "acs5",
              geometry = TRUE) %>%
              dplyr::select(-(moe)) %>%
              spread(key = variable, value = estimate) %>%
              mutate(pnhwhite = nhwhite/tpopr, pnhasn = nhasn/tpopr, 
        pnhblk = nhblk/tpopr, phisp = hisp/tpopr, oth = tpopr - (nhwhite+nhblk+nhasn+hisp), 
        poth = oth/tpopr, nonwhite = tpopr-nhwhite, pnonwhite = nonwhite/tpopr) %>%
              dplyr::select(c(GEOID,tpop, tpopr, medincome, pnhwhite, pnhasn, pnhblk, phisp,
           nhwhite, nhasn, nhblk, hisp, nonwhite, pnonwhite, oth, poth))%>%
  filter(tpop != 0)

## Getting data from the 2014-2018 5-year ACS

## Downloading feature geometry from the Census website.  To cache shapefiles for use in future sessions, set `options(tigris_use_cache = TRUE)`.

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |                                                                      |   1%
  |                                                                            
  |=                                                                     |   1%
  |                                                                            
  |=                                                                     |   2%
  |                                                                            
  |==                                                                    |   2%
  |                                                                            
  |==                                                                    |   3%
  |                                                                            
  |===                                                                   |   4%
  |                                                                            
  |=====                                                                 |   8%
  |                                                                            
  |==========                                                            |  15%
  |                                                                            
  |===========                                                           |  15%
  |                                                                            
  |===========                                                           |  16%
  |                                                                            
  |============                                                          |  17%
  |                                                                            
  |==============                                                        |  20%
  |                                                                            
  |================                                                      |  23%
  |                                                                            
  |==================                                                    |  26%
  |                                                                            
  |=========================                                             |  36%
  |                                                                            
  |==========================                                            |  37%
  |                                                                            
  |============================                                          |  40%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |====================================                                  |  51%
  |                                                                            
  |=============================================                         |  64%
  |                                                                            
  |=============================================                         |  65%
  |                                                                            
  |==============================================                        |  66%
  |                                                                            
  |===============================================                       |  66%
  |                                                                            
  |===============================================                       |  68%
  |                                                                            
  |==========================================================            |  82%
  |                                                                            
  |==========================================================            |  83%
  |                                                                            
  |===========================================================           |  84%
  |                                                                            
  |===========================================================           |  85%
  |                                                                            
  |============================================================          |  85%
  |                                                                            
  |=============================================================         |  87%
  |                                                                            
  |==============================================================        |  89%
  |                                                                            
  |===============================================================       |  89%
  |                                                                            
  |===============================================================       |  90%
  |                                                                            
  |================================================================      |  92%
  |                                                                            
  |=================================================================     |  92%
  |                                                                            
  |=================================================================     |  93%
  |                                                                            
  |==================================================================    |  94%
  |                                                                            
  |==================================================================    |  95%
  |                                                                            
  |===================================================================   |  95%
  |                                                                            
  |===================================================================   |  96%
  |                                                                            
  |===================================================================== |  99%
  |                                                                            
  |======================================================================| 100%

Getting Census Data for a Specific City

#If you want census data of a particular city, as an example let's get census data for San Diego City:
pl <- places(state = "CA", cb = TRUE, year=2018)

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |==                                                                    |   2%
  |                                                                            
  |======                                                                |   8%
  |                                                                            
  |========                                                              |  11%
  |                                                                            
  |========================================                              |  57%
  |                                                                            
  |======================================================                |  77%
  |                                                                            
  |=======================================================               |  79%
  |                                                                            
  |===================================================================== |  99%
  |                                                                            
  |======================================================================| 100%

cb <- core_based_statistical_areas(cb = TRUE, year=2018)

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=                                                                     |   1%
  |                                                                            
  |=                                                                     |   2%
  |                                                                            
  |==                                                                    |   2%
  |                                                                            
  |==                                                                    |   4%
  |                                                                            
  |======                                                                |   9%
  |                                                                            
  |===========                                                           |  16%
  |                                                                            
  |==========================                                            |  37%
  |                                                                            
  |=============================                                         |  41%
  |                                                                            
  |=============================                                         |  42%
  |                                                                            
  |============================================                          |  63%
  |                                                                            
  |=============================================                         |  65%
  |                                                                            
  |==============================================                        |  66%
  |                                                                            
  |===============================================                       |  67%
  |                                                                            
  |================================================                      |  68%
  |                                                                            
  |================================================                      |  69%
  |                                                                            
  |==================================================                    |  71%
  |                                                                            
  |======================================================                |  77%
  |                                                                            
  |==========================================================            |  83%
  |                                                                            
  |======================================================================| 100%

SD <- filter(cb, grepl("San Diego", NAME))
SD.city <- filter(pl, NAME == "San Diego")
SD.city.tracts <- ms_clip(target = SanDiego, clip = SD.city, remove_slivers = TRUE)

Preliminary Analysis: Dealing with Missing Data

#If your dataset contains missing data we can use the following codes to drop the lines with N/A values:

summary(aggr(SD.city.tracts))

## 
##  Missings per variable: 
##   Variable Count
##      GEOID     0
##       tpop     0
##      tpopr     0
##  medincome     5
##   pnhwhite     0
##     pnhasn     0
##     pnhblk     0
##      phisp     0
##    nhwhite     0
##      nhasn     0
##      nhblk     0
##       hisp     0
##   nonwhite     0
##  pnonwhite     0
##        oth     0
##       poth     0
##   geometry     0
## 
##  Missings in combinations of variables: 
##                       Combinations Count   Percent
##  0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0   291 98.310811
##  0:0:0:1:0:0:0:0:0:0:0:0:0:0:0:0:0     5  1.689189

SD.city.tracts <- drop_na(SD.city.tracts)

#The above line of code drops all missing data from your dataset so that we can go ahead and do calculations without recieving an error message. 

summary(aggr(SD.city.tracts))

## 
##  Missings per variable: 
##   Variable Count
##      GEOID     0
##       tpop     0
##      tpopr     0
##  medincome     0
##   pnhwhite     0
##     pnhasn     0
##     pnhblk     0
##      phisp     0
##    nhwhite     0
##      nhasn     0
##      nhblk     0
##       hisp     0
##   nonwhite     0
##  pnonwhite     0
##        oth     0
##       poth     0
##   geometry     0
## 
##  Missings in combinations of variables: 
##                       Combinations Count Percent
##  0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0   291     100

#running the summary code again shows that you now no longer have missing data and can proceed with analyzing the dataset.

Preliminary Analysis: Creating a Histogram

#A good way to analyze the distribution of your variables is through a histogram. For example, lets get one for the number of Asians in San Diego City:
ggplot(SD.city.tracts) + 
  geom_histogram(mapping = aes(x= nhasn), bins = 50) +
  xlab("Number of Asians in San Diego City")

#You can also look at the descriptive stats of any variable:
SD.city.tracts%>%
summarise(mean= mean(nhasn), 
          sd= sd(nhasn), 
          max= max(nhasn))

## although coordinates are longitude/latitude, st_union assumes that they are planar

## Simple feature collection with 1 feature and 3 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -117.2822 ymin: 32.53471 xmax: -116.9057 ymax: 33.11418
## CRS:            4269
##       mean       sd  max                       geometry
## 1 854.6907 1047.549 7603 MULTIPOLYGON (((-116.9266 3...

Preliminary Analysis: Dissimilarity Index

#If you're dealing with racial data, a good analysis is a Dissimilarity Index. Dissimilarity Index indicates what percentage a specific race will have to move to a certain neighborhood to reach uniformity of all races. Again, we'll use San Diego City as an example. 
SD.city.tracts <- SD.city.tracts %>%
  mutate(nhwhitec = sum(nhwhite), nonwhitec = sum(nonwhite), 
             nhasnc = sum(nhasn), nhblkc = sum(nhblk), othc = sum(oth),
             hispc = sum(hisp), tpoprc = sum(tpopr)) %>%
      ungroup()

SD.city.tracts %>%
      mutate(d.wb = abs(nhblk/nhblkc-nhwhite/nhwhitec),
              d.wa = abs(nhasn/nhasnc-nhwhite/nhwhitec), 
              d.wh = abs(hisp/hispc-nhwhite/nhwhitec),
              d.wnw = abs(nonwhite/nonwhitec-nhwhite/nhwhitec)) %>%
      summarise(BWD = 0.5*sum(d.wb, na.rm=TRUE), AWD = 0.5*sum(d.wa, na.rm=TRUE),
                HWD = 0.5*sum(d.wh, na.rm=TRUE), NWWD = 0.5*sum(d.wnw, na.rm=TRUE))

## although coordinates are longitude/latitude, st_union assumes that they are planar

## Simple feature collection with 1 feature and 4 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -117.2822 ymin: 32.53471 xmax: -116.9057 ymax: 33.11418
## CRS:            4269
##         BWD       AWD       HWD      NWWD                       geometry
## 1 0.5719762 0.4678355 0.5453133 0.4466351 MULTIPOLYGON (((-116.9266 3...

Preliminary Analysis: Interaction Index

#Another racial analysis is to create an Interaction Index. The percentage of each variable indicates the chance of a person from X ethnicity to interact with a person who is White in a place. By using San Diego as an example again, we get:
SD.city.tracts %>%
      mutate(i.wb= (nhblk/nhblkc)*(nhwhite/tpopr),
         i.wa= (nhasn/nhasnc)*(nhwhite/tpopr),
         i.wh= (hisp/hispc)*(nhwhite/tpopr))%>%
  summarise(BWI=sum(i.wb, na.rm = TRUE), AWI= sum(i.wa, na.rm = TRUE),
            HWI= sum(i.wh, na.rm = TRUE))

## although coordinates are longitude/latitude, st_union assumes that they are planar

## Simple feature collection with 1 feature and 3 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -117.2822 ymin: 32.53471 xmax: -116.9057 ymax: 33.11418
## CRS:            4269
##         BWI       AWI       HWI                       geometry
## 1 0.2936851 0.3771808 0.2729814 MULTIPOLYGON (((-116.9266 3...

Preliminary Analysis: Multigroup Entropy Index

#This code adds the variable "ent" to the dataset which is the multigroup entropy index. Entropy Index ranges from a score of 0 to 1, where 0 is complete integration and a 1 is complete segregation.
#Let's see what we get by using San Diego as an example:
SD.city.tracts <- SD.city.tracts %>%
     mutate(e1 =  ((pnhwhite)*log(1/(pnhwhite))), e2 = ((pnhasn)*log(1/(pnhasn))),
             e3 = ((pnhblk)*log(1/(pnhblk))), e4 = ((phisp)*log(1/(phisp))),
             e5 = ((poth)*log(1/(poth))), 
            e1 = replace(e1, is.nan(e1), 0), e2 = replace(e2, is.nan(e2), 0),
            e3 = replace(e3, is.nan(e3), 0), e4 = replace(e4, is.nan(e4), 0),
            e5 = replace(e5, is.nan(e5), 0),
            ent = (e1 + e2 + e3 + e4 +e5)) %>%
       dplyr::select(-c(e1:e5)) #If your race variables are in percentages just divide EACH percent variable by 100 in the formula

log(5)

## [1] 1.609438

#the code above calculates the maximum entropy score for five races/ethnic groups, which is 1.609.

Preliminary Analysis: Location Quotient

#The Location Quotient for Racial Residential Segregation (LQRSS) is a measure of concentration. Here you can use the LQRSS formula to understand the differences of racial concentrations in a city. See what you get when you use San Diego as an example:
SD.city.tracts <- SD.city.tracts %>%
  mutate(blklq = (pnhblk/nhblkc), 
        asnlq = (pnhasn/nhasnc),
        hisplq = (phisp/hispc),
        othlq= (poth/othc)) #If your race/ethnic variables are in percentages and not proportions just multiply each location quotient formula by 100, for example (pnhblk/nhblkc)*100

#The following code turns the Location Quotients into histograms so you can see the distribution of each race/ethnic group in a city
SD.city.tracts %>% 
  ggplot() + 
    geom_histogram(mapping = aes(x=blklq), bins= 50, na.rm=TRUE) +
    xlab("Black Location Quotient")

SD.city.tracts %>% 
  ggplot() + 
    geom_histogram(mapping = aes(asnlq), bins= 50, na.rm=TRUE) +
    xlab("Asian Location Quotient")

SD.city.tracts %>% 
  ggplot() + 
    geom_histogram(mapping = aes(hisplq), bins= 50, na.rm=TRUE) +
    xlab("Hispanic Location Quotient")

SD.city.tracts %>% 
  ggplot() + 
    geom_histogram(mapping = aes(othlq), bins= 50, na.rm=TRUE) +
    xlab("Other Location Quotient")

Preliminary Analysis: Scatter plots of Dependent Variable and Independent Variables with Lines of Best Fit

#Here you can create various scatterplots of the dependent variable in your study with each independent variable. The line of best fit on the graphs helps to visualize their correlations. We'll use median income as our dependent variable and Asian as our independent variable as an example:
ggplot(SD.city.tracts,aes(medincome, nhasn)) +
  stat_summary(fun.data=mean_cl_normal) + 
  geom_smooth(method='lm', formula= y~x) +
  xlab("Asian")

## Warning: Removed 287 rows containing missing values (geom_segment).

Preliminary Analysis: Making Maps of San Diego City for all Independent and Dependent Variables

#Here, you can create maps of all your variables to visualize their patterns in a place. Again lets use San Diego as an example. We're going to make a map for Asians in San Diego:
tmap_mode("view")

## tmap mode set to interactive viewing

SD_Asian <- tm_shape(SD.city.tracts, unit = "mi") +
    tm_fill("pnhasn", style="quantile", title="Percent Asian in San Diego", palette = "Reds") +
  tm_legend(main.title = FALSE,
            legend.outside = TRUE,
            frame = FALSE)
SD_Asian #NOTE! This code can only be used if you have a dataset that contains geography data

Preliminary Analysis: Collinearity Analysis of Regressor Variables

#Here you can do a collinearity analysis of your regressor variables to get an understanding of what variables you should leave out of your regression model. You do this by locating regressor variables that have a collinearity above a 2.5 (Paul Allison 2012, "When Can You Safely Ignore Multicollinearity?"). Let's make a model to check the effects of racial variables on median income
model1 <- lm(medincome ~ nhasn + nhblk + hisp, data = SD.city.tracts)
ols_coll_diag(model1)

## Tolerance and Variance Inflation Factor
## ---------------------------------------
##   Variables Tolerance      VIF
## 1     nhasn 0.9803753 1.020018
## 2     nhblk 0.8395748 1.191079
## 3      hisp 0.8549808 1.169617
## 
## 
## Eigenvalue and Condition Index
## ------------------------------
##   Eigenvalue Condition Index   intercept      nhasn      nhblk       hisp
## 1  2.8134472        1.000000 0.036005651 0.04298010 0.04381130 0.03811122
## 2  0.5768988        2.208358 0.002399894 0.68488270 0.14272823 0.11336251
## 3  0.3704351        2.755899 0.158391578 0.05053387 0.79353380 0.24109153
## 4  0.2392189        3.429428 0.803202876 0.22160334 0.01992667 0.60743474

FINAL Analysis: Creating a Bivariate Correlation Matrix of all Independent Variables

#This plot visualizes the bivariate relationships among all independent variables. The individual scatter plots are stacked such that each variable is in turn on the x-axis and on the y-axis. Let's use Median income and Asian as our examples:
ggpairs(SD.city.tracts, columns = c("medincome", "nhasn"))

Final Analysis: Ordinary Least Squares Model

#For the following regression templates we will be using the San Diego City database and some of its variables as an example. We will be measuring the effects of race/ethnicity on median income:
#The most common regression model is an Ordinary Least Squares model or an OLS model. 
OLSm <- lm(medincome ~ nhasn + nhblk + hisp, data = SD.city.tracts) #If you want to add more regressor variables just put a + after each additional variable
summary(OLSm)

## 
## Call:
## lm(formula = medincome ~ nhasn + nhblk + hisp, data = SD.city.tracts)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -72175 -17221  -6840  13882  97879 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 94693.403   2817.016  33.615  < 2e-16 ***
## nhasn          11.352      1.618   7.018 1.62e-11 ***
## nhblk         -21.157      4.866  -4.348 1.91e-05 ***
## hisp          -10.013      1.233  -8.120 1.39e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 28570 on 287 degrees of freedom
## Multiple R-squared:  0.3689, Adjusted R-squared:  0.3623 
## F-statistic: 55.92 on 3 and 287 DF,  p-value: < 2.2e-16

(est <- cbind(Estimate = coef(OLSm), confint(OLSm))) #We can also get the confidence intervals for the coefficients

##                Estimate        2.5 %        97.5 %
## (Intercept) 94693.40338 89148.772455 100238.034315
## nhasn          11.35240     8.168616     14.536192
## nhblk         -21.15739   -30.734319    -11.580468
## hisp          -10.01328   -12.440518     -7.586041

Final Analysis: Poisson Model

#For the following regression models, we will be using the R provided dataset MT Cars as an example:
#The Poisson regression is a generalized linear model and ideal for count-data.
pGLM <- glm(medincome ~ nhasn + nhblk + hisp, data = SD.city.tracts, family= poisson)
summary(pGLM)

## 
## Call:
## glm(formula = medincome ~ nhasn + nhblk + hisp, family = poisson, 
##     data = SD.city.tracts)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -239.10   -64.07   -20.04    43.15   390.06  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.150e+01  3.488e-04 32982.0   <2e-16 ***
## nhasn        1.128e-04  1.698e-07   664.4   <2e-16 ***
## nhblk       -2.794e-04  7.173e-07  -389.4   <2e-16 ***
## hisp        -1.519e-04  1.914e-07  -793.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 4342187  on 290  degrees of freedom
## Residual deviance: 2581720  on 287  degrees of freedom
## AIC: 2585530
## 
## Number of Fisher Scoring iterations: 4

(est <- cbind(Estimate = coef(pGLM), confint(pGLM))) #We can also get the confidence intervals for the coefficients

## Waiting for profiling to be done...

##                  Estimate         2.5 %        97.5 %
## (Intercept) 11.5036124549 11.5029288231 11.5042960374
## nhasn        0.0001128240  0.0001124911  0.0001131567
## nhblk       -0.0002793523 -0.0002807584 -0.0002779466
## hisp        -0.0001518656 -0.0001522409 -0.0001514905

#Here you can create a rootgram that can visualize if the OLS regression was the model best fit for your data
rootogram(pGLM)#It's important to look for over and under predictions in the graph. If there aren't too many, this model is good for your data.

Final Analysis: Negative Binomial Regression Tests

#For count variables that have a concentration of 0s, and are over-dispersed, a negative regression test may be the best fit. 
NegBM <- glm.nb(medincome ~ nhasn + nhblk + hisp, data = SD.city.tracts)
summary(NegBM)

## 
## Call:
## glm.nb(formula = medincome ~ nhasn + nhblk + hisp, data = SD.city.tracts, 
##     init.theta = 8.696249791, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4496  -0.7484  -0.1837   0.4594   4.1315  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.142e+01  3.344e-02 341.406  < 2e-16 ***
## nhasn        1.501e-04  1.920e-05   7.817 5.41e-15 ***
## nhblk       -2.218e-04  5.775e-05  -3.841 0.000123 ***
## hisp        -1.239e-04  1.464e-05  -8.462  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(8.6962) family taken to be 1)
## 
##     Null deviance: 476.34  on 290  degrees of freedom
## Residual deviance: 296.57  on 287  degrees of freedom
## AIC: 6751.3
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  8.696 
##           Std. Err.:  0.708 
## 
##  2 x log-likelihood:  -6741.339

(est <- cbind(Estimate = coef(NegBM), confint(NegBM))) #We can also get the confidence intervals for the coefficients

## Waiting for profiling to be done...

##                  Estimate         2.5 %        97.5 %
## (Intercept) 11.4153874321 11.3572802838  1.147371e+01
## nhasn        0.0001500801  0.0001115143  1.898486e-04
## nhblk       -0.0002218317 -0.0003299021 -1.109342e-04
## hisp        -0.0001238570 -0.0001498985 -9.714173e-05

rootogram(NegBM)

FINAL Analysis: Predictive Margins

#Here you can get a table of the average predictive margins for each regression variable
#First you put the type of regression model you want, for example the one here is for a negative binomial regression for race/ethnicity effects on Median Income:
NegBM <- glm.nb(medincome ~ nhasn + nhblk + hisp, data = SD.city.tracts)
(marg1 <- margins(NegBM))

## Average marginal effects

## glm.nb(formula = medincome ~ nhasn + nhblk + hisp, data = SD.city.tracts,     init.theta = 8.696249791, link = log)

##  nhasn  nhblk   hisp
##  12.37 -18.29 -10.21

summary(marg1)

##  factor      AME     SE       z      p    lower   upper
##    hisp -10.2126 1.2532 -8.1494 0.0000 -12.6688 -7.7564
##   nhasn  12.3748 1.6506  7.4974 0.0000   9.1398 15.6098
##   nhblk -18.2911 4.8047 -3.8069 0.0001 -27.7081 -8.8741

theme_set(theme_sjplot())
#Here you can plot the predictive margins for the regression variables that were significant in your regression test.
plot_model(NegBM, type = "pred", terms = "nhasn", title= "Effect of Median Income by Total Asian")

#A thing to note is that the predictive margin model here operates on the basis that all other variables are averaged while the chosen variable for the plot is not. For example, the effects of a variable for percent foreign born on police reports is being considered regardless of the amount of median income, education, etc.

Quant Research Template

Sheridamae Gudez

2/9/2021