Report Description

In this report, we will investigate the incidence of missing children in the United States.

We will answer questions such as:

Are girls going missing at a different rate than boys?
Are white children going missing at a different rate than non-white children?
Do these gender and racial differences vary between states?
Does a state’s population density have any impact on the rate at which people are going missing?

3 datasets will be used for this investigation:

Census Data (from Social Explorer )
Missing Child Data from National Center of Missing & Exploited Children(via GitHub)
State Map Data (from R: map_data(“state”))

Preparing Data for Analysis

#-----------------------------#
#Load Packages & Prepare Data-#
#-----------------------------#
library(kableExtra)
library(knitr)
library(readr)
library(dplyr)
library(tidyr)
library(ggplot2)
library(magrittr)
library(ggmap)
library(maps)
library(mapdata)
library(nlme)
library(plotly)
library(ggthemes)
library(stargazer)
library(visreg)
library(texreg)


#----------------------------#
#----------------------------#
#-------Map Data-------------#
#----------------------------#
#----------------------------#

#Load Base Map Data
statemap <- map_data("state")

#Creating a list of State Names & State Abbreviations
#Some datasets only have full state names, some only have abbreviations.
#This will allow me to merge those dataframes together

          #Social Explorer has the full state name (New York, New Jersey, Connecticut, etc...)
          #Missing Person Data has the State Abbreviations (NY, NJ, CT)
          #Map data has the full state name (New York, New Jersey, Connecticut, etc...)

       stateabb = c("AK", "AL", "AR", 
                    "AZ", "CA", "CO", 
                    "CT", "DC", "DE", 
                    "FL", "GA", "HI",
                    "IA", "ID", "IL", "IN", "KS", 
                    "KY", "LA", "MA", 
                    "MD", "ME", "MI", "MN",
                    "MO", "MS",  "MT", 
                    "NC", "ND","NE", 
                    "NH", "NJ", "NM", 
                    "NV", "NY", "OH","OK", "OR", 
                    "PA", "PR", "RI", 
                    "SC", "SD","TN",
                    "TX", "UT", "VA", "VT","WA", 
                    "WI", "WV", "WY")
    
        statefull = tolower(c("Alaska","Alabama","Arkansas", 
                              "Arizona","California","Colorado",
                              "Connecticut","District of Columbia","Delaware",
                              "Florida" , "Georgia" ,"Hawaii",
                              "Iowa","Idaho","Illinois","Indiana","Kansas",
                              "Kentucky","Louisiana","Massachusetts",
                              "Maryland","Maine" ,"Michigan","Minnesota",
                              "Missouri","Mississippi","Montana",
                              "North Carolina","North Dakota","Nebraska",
                              "New Hampshire" , "New Jersey" ,  "New Mexico",
                              "Nevada","New York","Ohio","Oklahoma" ,"Oregon",
                              "Pennsylvania" , "Puerto Rico","Rhode Island",
                              "South Carolina", "South Dakota" ,"Tennessee" , 
                              "Texas" , "Utah", "Virginia","Vermont" ,"Washington",
                              "Wisconsin", "West Virginia" , "Wyoming"))

        statelist <- data.frame(statefull, stateabb)

# Add the abbreviations to the statemap data using "merge"
       
         states <-merge(statemap,statelist, 
                       all.x=TRUE, all.y=FALSE, 
                       by.x="region", by.y="statefull")

#----------------------------#         
#----------------------------#    
#----Missing children data---#
#----------------------------#
#----------------------------#
         
missingpersons <- read_csv("/Users/Brett/Library/Mobile Documents/com~apple~CloudDocs/All Files/Education/Data Analysis Masters/SPR17.Advanced Analytics/Homework/Missing Children/MediaReadyActiveCases_03082017.csv")%>%
          mutate(missingdate = as.Date(missingreporteddate, format = "%m/%d/%y %H:%M"),
                 birthdate = as.Date(birthdate, format = "%m/%d/%y %H:%M"),
                 daysmissing = Sys.Date()-missingdate,
                 white = factor(ifelse(race == "White","White","Not White")),
                 male = ifelse(sex=="Male",1,0),
                 missingfromstate = as.factor(missingfromstate),
                 race = factor(race),
                 age = Sys.Date()-birthdate)
                 

missingbysex<- missingpersons%>%
  select(missingfromstate, sex)%>%
  group_by(missingfromstate, sex)%>%
  summarize(n=n())%>%
  spread(sex, n)%>%
          rename(state = missingfromstate,
          missingfemales = Female,
          missingmales = Male)

#----------------------------#
#---Social Explorer Data-----#
#----State Populations-------#
#----------------------------#

statepop <- read_csv("/Users/Brett/Library/Mobile Documents/com~apple~CloudDocs/All Files/Education/Data Analysis Masters/SPR17.Advanced Analytics/Homework/Missing Children/R11374249_SL040.csv")%>%
    rename(                                  #Label Variables with descriptive names
           statename = Geo_NAME, 
           totalpop = SE_T001_001,
           popdensity = SE_T001_002,
           sextotalpop = SE_T002_001,        #Label Male & Female Variables
           malepop = SE_T002_002,
           femalepop = SE_T002_003,
        racetotalpop = SE_T019_001,           #Label Race Variables
        maleracetotalpop= SE_T019_002,
           whitemale = SE_T019_003,
           blackmale = SE_T019_004,
           nativemale = SE_T019_005,
           asianmale = SE_T019_006,
           pacifislamale = SE_T019_007,
           mixedmale = SE_T019_008,
        femracetotalpop = SE_T019_009,
           whitefem = SE_T019_010,
           blackfem = SE_T019_011,
           nativefem = SE_T019_012,
           asianfem = SE_T019_013,
           pacificislafem = SE_T019_014,
           mixedfem = SE_T019_015)%>%
    select(statename, totalpop, popdensity,malepop, femalepop)%>%
    mutate(statename=tolower(statename))  #Lowercase state name to match state map data

Merging Dataframes

#----------Merge-------------# 
#DF 1 = State Abbreviations--#
#DF 2 = Social Explorer Data-#
#----------------------------#
    
            #(This will allow us to merge Soc Explorer with Missing Person data)    

            sexpopulation2 <- merge(statepop, statelist,
                          all.x = TRUE,
                          all.y = TRUE,
                          by.x = "statename",
                          by.y = "statefull")%>%
  select(statename, stateabb,totalpop, malepop, femalepop,popdensity)


#----------Merge----------------# 
#DF 1 = Missing Person----------#
#DF 2 = Soc Explorer Data-------#
#-------------------------------#

# Step 1: Social Explorer + Missing Person Data
#                   Merge dataframes, match on state abbreviations

              SocAndMissing <- merge(sexpopulation2, missingbysex, 
                    all.x = TRUE, 
                    all.y = FALSE, 
                    by.x = "stateabb", 
                    by.y="state")%>% 
                mutate(
                    missingrate = (missingmales+missingfemales)/totalpop*100000, #Calculate missing person rate for each sex.
                    malemissingrate = ((missingmales)/(malepop))*100000,
                    femalemissingrate = ((missingfemales)/(femalepop))*100000,
                    missingratio = malemissingrate/femalemissingrate,
                    dominantmissingsex = ifelse(malemissingrate>femalemissingrate,
                                          "male",
                                          "female"))
                  #Multiplied % by 1000 for simple interpretation.
                  #Divide by 1000 for actual % rates at the end.  

              
              
#----------Merge---------------------------------# 
#DF 1 = State Map Data---------------------------#
#DF 2 = Soc Explorer & Missing Person Data-------#
#------------------------------------------------#
             
             missingbysexfinal <- merge(states,SocAndMissing,
                    all.x = TRUE, 
                    all.y = FALSE, 
                    by.x = "region", 
                    by.y="statename")

SocAndMissing%>%
  head()%>%
   kable(format="html", table.attr = "style='width:40%;'", align="l")%>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

stateabb	statename	totalpop	malepop	femalepop	popdensity	missingfemales	missingmales	missingrate	malemissingrate	femalemissingrate	missingratio	dominantmissingsex
AK	alaska	738432	389025	349407	1.294046	4	5	1.218799	1.2852644	1.1447968	1.1227010	male
AL	alabama	4858979	2353184	2505795	95.939421	16	6	0.452770	0.2549737	0.6385199	0.3993199	female
AR	arkansas	2978204	1462856	1515348	57.234949	14	7	0.705123	0.4785160	0.9238802	0.5179416	female
AZ	arizona	6828065	3391490	3436575	60.109918	43	28	1.039826	0.8255958	1.2512458	0.6598191	female
CA	california	39144818	19443068	19701750	251.269840	293	171	1.185342	0.8794908	1.4871775	0.5913825	female
CO	colorado	5456574	2743763	2712811	52.648896	43	27	1.282856	0.9840500	1.5850717	0.6208236	female

Preliminary Investigation of Missing Person Rates

What is the missing person rate in each state?

Missing person rate will be calculated as: (# of Missing People in a State) / (State Population)

#Heatmap of missing person rate
map1 <- ggplot() + 
    geom_polygon(data = missingbysexfinal, 
                 aes(x = long, y = lat, fill =missingrate/1000, 
                     group = group, text = region),color = "grey")+
    coord_fixed(1.3)+
    xlab("Missing Person Rate by State") + ylab("")+ 
    theme_map()


ggplotly(map1, width = 800, height = 400, tooltip = c("region"))

#Sort States by Missing Rate
SocAndMissing<- transform(SocAndMissing, 
                    statename = reorder(statename, missingrate))

ggplot(data = SocAndMissing, aes(x=statename, y=missingrate/1000, label = sprintf("%2.4f %%", missingrate/1000)))+
    geom_bar(aes(x=statename, y=missingrate/1000, fill =missingrate/1000),stat = "identity")+
       coord_flip()+
       xlab("Missing Rate by State")+
    geom_text(position = position_dodge(1), size=5, angle=0, hjust = -.15,color = "black")+
    theme_fivethirtyeight()+  theme(axis.text=element_text(size=16))

Are more boys or girls missing?

missingpersons%>%
  select(missingfromstate, sex)%>%
  group_by(sex)%>%
  summarize(missing=n())%>%
  ggplot()+
  geom_col(aes(x=sex, y=missing, fill=sex))+
  labs(title="Number of Missing Children by Sex", x="",y="# Missing")+
  theme(legend.position="none")

Missing Ratio by State

To more precisely measure the disparity between missing boys & girls, we can plot the missing ratio for each state on a map.

Missing Ratio = (# missing boys) / (# missing girls)

A higher number = more missing boys (blue)
A smaller number = more missing girls (pink)
1 = equal boys & girls missing (purple)

The map shows a heavy bias towards girls.

map3<- ggplot() + 
    geom_polygon(data = missingbysexfinal, 
                 aes(x = long, y = lat, fill =missingratio, group = group, text = region),color = "grey")+
      scale_fill_gradient(low = "#FA8072", high = "darkblue", 
                          space = "Lab", na.value = "grey50", guide = "colourbar")+
      coord_fixed(1.3)+
      xlab("Missing Ratio: Blue = boys | Coral = girls") + ylab("")+
      theme_map()


ggplotly(map3,width = 800, height = 400, tooltip = c("region"))

#Sort States by Missing Ratio
SocAndMissing<- transform(SocAndMissing, 
                    statename = reorder(statename, missingratio))%>%
              filter(!is.na(dominantmissingsex))

ggplot()+
  geom_bar(data = SocAndMissing, aes(x=statename, y=missingratio, fill =missingratio),stat="identity")+
  geom_text(data = SocAndMissing, aes(x=statename, y=missingratio, label=round(missingratio,2)),
            position = position_dodge(1), size=5, angle=0, hjust = -.15,color = "black")+
  scale_fill_gradient(low = "#FA8072", high = "darkblue", space = "Lab", 
                      na.value = "grey50", guide = "colourbar")+
  coord_flip()+
  xlab("State")+
  theme_fivethirtyeight()+
  theme(axis.text=element_text(size=15 ))

Ecological Analysis of Missing Person Rates

I will run a linear regression to evaluate the difference in missing person rates between males and females. This is an ecological analysis, as we are evaluating the aggregate number of missing people by sex for each state. The units of analysis are aggregates, not individual responses.

#Prepare State-level data for Ecological Analysis
SocAndMissing2 <- missingbysex%>%
  rename(male=missingmales,
         female=missingfemales)%>%
  gather(sex,missing, male:female)
  


SocAndMissing2 <- merge(SocAndMissing2,sexpopulation2,
                    all.x = TRUE, 
                    all.y = FALSE, 
                    by.x = "state", 
                    by.y="stateabb")%>%
                  mutate(
                    missingrate = ifelse(sex == "male",
                                         (missing/malepop),
                                         (missing / femalepop)))

SocAndMissing2%>%
  head()%>%
   kable(format="html", table.attr = "style='width:40%;'", align="l")%>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

state	sex	missing	statename	totalpop	malepop	femalepop	popdensity	missingrate
AK	male	5	alaska	738432	389025	349407	1.294046	1.29e-05
AK	female	4	alaska	738432	389025	349407	1.294046	1.14e-05
AL	female	16	alabama	4858979	2353184	2505795	95.939421	6.40e-06
AL	male	6	alabama	4858979	2353184	2505795	95.939421	2.50e-06
AR	female	14	arkansas	2978204	1462856	1515348	57.234949	9.20e-06
AR	male	7	arkansas	2978204	1462856	1515348	57.234949	4.80e-06

Linear Regression Analysis

Model 1: Effect of Sex on Missing Rate

Regression analysis shows boys experiencing a missing rate that is .0004(.04%) lower than that of girls.
Relative to the population, a .04% decrease in the missing person rate for males may not seem like much. However when we consider that this rate represents a 40% ((.10-.04)/.10) disparity between the missing rates experienced by males and females, it is clear that there is a striking difference.

Model 2: Effect of Sex and Population Density Interaction on Missing Rate

This linear regression model does not reveal any significant impact of population density.

missingreg <- lm(missingrate*100 ~ sex, data = SocAndMissing2)
missingreg2 <- lm(missingrate*100 ~ sex*popdensity, data = SocAndMissing2)
                  
htmlreg(list(missingreg, missingreg2), type="html",digits=4)

Statistical models
	Model 1	Model 2
(Intercept)	0.0010^***	0.0010^***
	(0.0001)	(0.0001)
sexmale	-0.0004^***	-0.0004^***
	(0.0001)	(0.0001)
popdensity		0.0000
		(0.0000)
sexmale:popdensity		-0.0000
		(0.0000)
R²	0.1908	0.1939
Adj. R²	0.1826	0.1690
Num. obs.	101	101
RMSE	0.0004	0.0004
p < 0.001, p < 0.01, p < 0.05

visreg(missingreg)

(Breheny and Burchett 2013)

The variable of interest in the analysis above - missingrate - can not be evaluated in a complete pooling, no-pooling, or partial pooling by random intercept method. This is due to the fact that the missing rate is a state-level variable, and not an individual level variable. We can only conduct an ecological analysis with this data.

Multi-Level Analysis

Using a generalized least squares fit by maximum likelihood regression analysis we will assess the effect of sex & race on # of days a child is missing. This metric (number of days missing) is an individual-level data point.

Complete Pooling: We will pool all respondents together, and evaluate missing rate irrespective of the state they are in.
No Pooling: We will run separate regressions for each state, to see the variation between states in how long people have been missing.
Partial Pooling: We will conduct partial-pooling analysis to account for the variation between states.

Data Used for Multi-Level Analysis

We will look at the number of days a child has been missing (DV), and the effects of sex & race. Each observation in our data represents a missing child, including how many days the child has been missing, their sex, their race, what state they are from, and other identifying characteristics.

childfirstname	childlastname	missingfromstate	daysmissing	white	sex
Christopher	Abeyta	CO	11567 days	White	Male
Aaron	Anderson	MN	10525 days	White	Male
Taranika	Raymond	LA	8052 days	Not White	Female
Ruben	Felix	ID	7673 days	Not White	Male
Tiffany	Dixon	NY	9640 days	Not White	Female
Jamal	Abdul’Faruq	VA	10193 days	Not White	Male

Complete Pooling

Pooling all respondents together, and evaluating # of days a child has been missing, irrespective of state.

Model 1: Effect of sex on # of days a person is missing

When all respondents are pooled together, we observe an intercept value of 2395 days missing for females., and observe that males tend to experience a higher number of days missing by 273 days.

Model 2: Interaction effect of sex & race on # of days a person is missing

Non-White females are estimated to be missing for 1,678 days.
Non-White Males are estimated to be missing for 2,129 days.
White females are estimated to be missing for 3,987 days.
White Males are estimated to be missing for 3,438 days.

missingreg3 <- gls(daysmissing ~ male, data = missingpersons, method = "ML") #Model 1
missingreg4 <- gls(daysmissing ~ male*white, data = missingpersons, method = "ML") #Model 2
   

htmlreg(list(missingreg3,missingreg4), type = "html", digits=0)

Statistical models
	Model 1	Model 2
(Intercept)	2749^***	1810^***
	(86)	(106)
male	273	451^*
	(142)	(179)
whiteWhite		2309^***
		(167)
male:whiteWhite		-549^*
		(276)
AIC	54519	54278
BIC	54537	54307
Log Likelihood	-27256	-27134
Num. obs.	2834	2834
p < 0.001, p < 0.01, p < 0.05

No Pooling

Running an regression for each state, and plotting the intercept and slope values so that we can evaluate differences betwen states for the number of days that a child is missing (intercept), and the impact of sex (slope).

No Pooling: Plotting the Intercepts

Most states show an intercept (missingrate for females) between 1,500 and 3,500 days. This is a wide range, especially if you are a missing person hoping to get back to your family!

dcoef <- missingpersons %>% 
    group_by(missingfromstate)%>% 
    do(mod = lm(daysmissing ~ male+age, data=.))

intc <- dcoef %>% do(data.frame(intc = coef(.$mod)[1]))
ggplot(intc, aes(x = intc)) + geom_histogram(fill="darkblue")+xlab("Intercept: # of Days Females are missing")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

No Pooling: Plotting the Slopes

While a regression with complete pooling would lead us to believe that males are going missing for more days than females, it is clear that the relationship between sex & number of days missing varies by state. It is actually nearly split down the middle, with the effect for males being positive in some states, and negative in others.

#Plotting Effect of Sex on days missing
coef <- dcoef %>% do(data.frame(sexc = coef(.$mod)[2]))
ggplot(coef, aes(x = sexc)) + geom_histogram(fill="darkred")

Partial Pooling

Lastly, we will conduct partial pooling by running a Linear mixed-effects model fit by maximum likelihood, with random effects. This will allow us to generate the best possible estimation of missing person rates while accounting for the variation between states.

Random Intercept Model:

To allow for variation in “Days Missing” between states

We find that when accounting for the variation between states using random intercept method, sex just barely misses 95% confidence, with a p-value of .0593. The relationship is weak, but shows a positive relationship between being male, and the # of days spent missing. This is consistent with the findings above.

m1_lme <- lme(daysmissing ~ male, 
              data = missingpersons, 
              random = ~1|missingfromstate, 
              method = "ML",
              na.action=na.exclude)

              #summary(m1_lme)
            htmlreg(m1_lme, type = "html", digits=0)

Statistical models
	Model 1
(Intercept)	3163^***
	(175)
male	265
	(140)
AIC	54419
BIC	54442
Log Likelihood	-27205
Num. obs.	2831
Num. groups	54
p < 0.001, p < 0.01, p < 0.05

(Henningsen and Toomet 2011)

Random Slope Model:

To Allow for difference in gender effect between states

We find that when accounting for the variation between genders using random slope method, sex is significant with 95% (almost 99%) confidence, with a p-value of .0183. Again, a positive relationship between being male and the # of days spent missing. This is consistent with the findings of the complete pooling & no pooling models.

m2_lme <- lme(daysmissing ~ male, 
              data = missingpersons, 
              random = ~sex|missingfromstate, 
              method = "ML",
              control = lmeControl(opt='optim'),
              na.action=na.exclude)

              #summary(m2_lme)
              htmlreg(m2_lme, type = "html",digits=0)

Statistical models
	Model 1
(Intercept)	3094^***
	(156)
male	376^*
	(159)
AIC	54420
BIC	54456
Log Likelihood	-27204
Num. obs.	2831
Num. groups	54
p < 0.001, p < 0.01, p < 0.05

Datasets used in this report:

Missing Children Data

https://data.world/jamesgray/missing-children-in-the-us

Publicly available information from The National Center for Missing and Exploited Children (NCMEC), released as part of the Cloudera Child Finder Hackathon to develop new methods for finding missing children.

Population Data

<www.socialexplorer.com>

New York City, NY: Social Explorer 2017

http://www.socialexplorer.com/pub/reportdata/HtmlResults.aspx?reportid=R11374249

Population Estimates in addition with Social Explorer tables provide the estimates of population, demographic components of change, and housing units for the 2015 vintage year. As each vintage of estimates includes all years since the most recent decennial census, the latest vintage of data available supersedes all previously-produced estimates for those dates.

Citations

Breheny, Patrick, and Woodrow Burchett. 2013. “Visualization of Regression Models Using Visreg.” R Package, 1–15.

Henningsen, Arne, and Ott Toomet. 2011. “MaxLik: A Package for Maximum Likelihood Estimation in R.” Computational Statistics 26 (3). Springer: 443–58.