Data from the 2014 National Health Interview Survey is used to find information on chronic food insecurity as it relates to emergency food assistance. The goal is to help agencies and public policy makers better understand the need for emergency food assistance. The following is an excerpt of the groups analysis that encompasses my contribution as well as my own personal analysis.

Variables to be used in the model are selected out of the 127 included in the data set.

nhisx <- read.csv("C:/DataMining/Data/family14.csv")
library(car)
set.seed(1)

Variables to be used in the model are selected out of the 127 included in the data set.

nhis=nhisx[,c(-1:-12,-17:-19,-21:-41,-43:-66,-68:-114,-116:-117,-119:-127)]
head(nhis)

##   FLNGINTV FM_SIZE FM_KIDS FM_ELDR FM_EDUC1 FSRUNOUT FHICOVCT INCGRP5
## 1        1       1       0       0        2        3        0       1
## 2        1       3       0       1        5        3        3       2
## 3        1       3       1       0        5        3        3       2
## 4        1       1       0       1        9        3        1       4
## 5        1       1       0       1        4        3        1       1
## 6        1       3       0       0        8        3        3       4
##   HOUSEOWN
## 1        2
## 2        2
## 3        2
## 4        1
## 5        1
## 6        1

FSRUNOUT (Worried food would run out before got money to buy more) is chosen as the dependent variable to focus on in the models. The variable is recoded so that “Often true” and “Sometimes true” are coded as a 1 and “Never true” is coded as a zero. We use a family’s or individual’s fear of running out of food to classify them as having food insecurity.

nhis$FSRUNOUT = recode(nhis$FSRUNOUT, "'1'=1;'2'=1;else=0")
response=nhis$FSRUNOUT

The independent variables that are categorical are recoded into binomial indicator variables to be used in a logistic regression and to remove nonresponses.

v1=rep(1,dim(nhis)[1])
v2=rep(0,dim(nhis)[1])
nhis$INCGRP5 = recode(nhis$INCGRP5, "'1'=1;'2'=2;'3'=3;'4'=4;'96'=NA;'99'=NA")
nhis$NIlessthan35k=ifelse(nhis$INCGRP5==1,v1,v2)
nhis$NI35kto74.9k=ifelse(nhis$INCGRP5==2,v1,v2)
nhis$NI75kto99k=ifelse(nhis$INCGRP5==3,v1,v2)
nhis$NIabove100k=ifelse(nhis$INCGRP5==4,v1,v2)
nhis$FLNGINTV = recode(nhis$FLNGINTV, "'1'=1;'2'=2;'3'=3;'4'=4;'8'=NA")
nhis$Englishspeaking=ifelse(nhis$FLNGINTV==1,v1,v2)
nhis$FM_EDUC1 = recode(nhis$FM_EDUC1, "'1'=1;'2'=1;'3'=2;'4'=2;'5'=2;'6'=3;'7'=3;'8'=4;'9'=4;'98'=NA;'99'=NA")
nhis$Nohighschool=ifelse(nhis$FM_EDUC1==1,v1,v2)
nhis$Highschooldegree=ifelse(nhis$FM_EDUC1==2,v1,v2)
nhis$Twoyrdegree=ifelse(nhis$FM_EDUC1==3,v1,v2)
nhis$Fourormoreyrcollegedegree=ifelse(nhis$FM_EDUC1==4,v1,v2)
nhis$HOUSEOWN = recode(nhis$HOUSEOWN, "'1'=1;'2'=2;'3'=3;'7'=NA;'8'=NA;'9'=NA")
nhis$Ownhome=ifelse(nhis$HOUSEOWN==1,v1,v2)
nhis$Renthome=ifelse(nhis$HOUSEOWN==2,v1,v2)
nhis$Otherlivingarrangement=ifelse(nhis$HOUSEOWN==3,v1,v2)
nhis=nhis[,c(-1,-5,-8,-9,-13,-21)]
xx=cbind(response,FamilySize=nhis$FM_SIZE,NumberofKids=nhis$FM_KIDS,NuberofSeniors=nhis$FM_ELDR,NumHealthCoverage=nhis$FHICOVCT,
         EnglishSpeaking=nhis$Englishspeaking,OwnHome=nhis$Ownhome,RentHome=nhis$Renthome,NIlessthan35k=nhis$NIlessthan35k,
         NI35kto74.9k=nhis$NI35kto74.9k,NI75kto99k=nhis$NI75kto99k,NoHighSchool=nhis$Nohighschool,TwoYrDegree=nhis$Twoyrdegree,
         AtleastFourYrcollegeDegree=nhis$Fourormoreyrcollegedegree)

The data is separated into a training set and a test set.

n=dim(nhis)[1]
n1=floor(n*(0.6))
n1

## [1] 27358

n2=n-n1
n2

## [1] 18239

train=sample(1:n,n1)
xx=xx[,-1]
xtrain <- xx[train,]
xnew <- xx[-train,]
ytrain <- response[train]
ynew <- response[-train]

A logistic regression is run using FSRUNOUT as the dependent variable and fitted on the training data set.

m2=glm(response~.,family=binomial,data=data.frame(response=ytrain,xtrain))
summary(m2)

## 
## Call:
## glm(formula = response ~ ., family = binomial, data = data.frame(response = ytrain, 
##     xtrain))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9917  -0.6457  -0.3662  -0.1703   3.1482  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                -4.26505    0.18043 -23.639  < 2e-16 ***
## FamilySize                  0.36657    0.02854  12.844  < 2e-16 ***
## NumberofKids               -0.16997    0.03403  -4.994 5.92e-07 ***
## NuberofSeniors             -0.56229    0.04200 -13.388  < 2e-16 ***
## NumHealthCoverage          -0.03696    0.02286  -1.617    0.106    
## EnglishSpeaking             0.31552    0.06714   4.700 2.61e-06 ***
## OwnHome                    -0.41552    0.10252  -4.053 5.05e-05 ***
## RentHome                    0.09587    0.09968   0.962    0.336    
## NIlessthan35k               2.77430    0.11920  23.275  < 2e-16 ***
## NI35kto74.9k                1.71758    0.11758  14.608  < 2e-16 ***
## NI75kto99k                  0.88558    0.14170   6.250 4.11e-10 ***
## NoHighSchool                0.30778    0.05587   5.509 3.62e-08 ***
## TwoYrDegree                -0.01223    0.05700  -0.215    0.830    
## AtleastFourYrcollegeDegree -0.68026    0.05592 -12.164  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 21759  on 24114  degrees of freedom
## Residual deviance: 18053  on 24101  degrees of freedom
##   (3243 observations deleted due to missingness)
## AIC: 18081
## 
## Number of Fisher Scoring iterations: 6

Almost all of the variables prove statistically significant in predicting FSRUNOUT except for renting home, the number of family members with health insurance, and having a two-year degree. The effect of NIlessthan35k is quite noticeable, having a total annual household income less than $35,000 increases a family’s or individual’s log odds of worrying about running out of food by 2.77. Similarly, a total annual household income between $35,000 and $74,900 will increase the log odds of FSRUNOUT by 1.72.

To demonstrate the ability of the model to accurately classify cases a confusion matrix is created.

p<-predict(m2,newdata=data.frame(xnew),type="response")
fitted.results<-predict(m2,newdata=data.frame(xnew),type='response')
fitted.results<-ifelse(fitted.results > 0.5,1,0)

library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

confusionMatrix(data=fitted.results, reference=ynew)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 13046  2555
##          1   225   191
##                                           
##                Accuracy : 0.8264          
##                  95% CI : (0.8205, 0.8323)
##     No Information Rate : 0.8286          
##     P-Value [Acc > NIR] : 0.7656          
##                                           
##                   Kappa : 0.0793          
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.98305         
##             Specificity : 0.06956         
##          Pos Pred Value : 0.83623         
##          Neg Pred Value : 0.45913         
##              Prevalence : 0.82856         
##          Detection Rate : 0.81451         
##    Detection Prevalence : 0.97403         
##       Balanced Accuracy : 0.52630         
##                                           
##        'Positive' Class : 0               
##

This output demonstrates that the model is quite good at predicting that a family or individual does not worry about food running out, with a false negative rate of 2%. However, the model is not very good at predicting those that do worry, with a false positive rate of 93%.

The following ROCR curve and its AUC show that the model is still relatively good at prediction.

library(ROCR)

## Loading required package: gplots

## 
## Attaching package: 'gplots'

## The following object is masked from 'package:stats':
## 
##     lowess

pr<- prediction (p, ynew)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf)

auc<-performance(pr, measure = "auc")
auc<-auc@y.values[[1]]
auc

## [1] 0.7894864

Chopped A

Hepp

October 23, 2017

Variables to be used in the model are selected out of the 127 included in the data set.

Variables to be used in the model are selected out of the 127 included in the data set.

The independent variables that are categorical are recoded into binomial indicator variables to be used in a logistic regression and to remove nonresponses.

The data is separated into a training set and a test set.

A logistic regression is run using FSRUNOUT as the dependent variable and fitted on the training data set.

To demonstrate the ability of the model to accurately classify cases a confusion matrix is created.

This output demonstrates that the model is quite good at predicting that a family or individual does not worry about food running out, with a false negative rate of 2%. However, the model is not very good at predicting those that do worry, with a false positive rate of 93%.

The following ROCR curve and its AUC show that the model is still relatively good at prediction.