Data from the 2014 National Health Interview Survey is used to find information on chronic food insecurity as it relates to emergency food assistance. The goal is to help agencies and public policy makers better understand the need for emergency food assistance. The following is an excerpt of the groups analysis that encompasses my contribution as well as my own personal analysis.
Variables to be used in the model are selected out of the 127 included in the data set.
nhisx <- read.csv("C:/DataMining/Data/family14.csv")
library(car)
set.seed(1)
Variables to be used in the model are selected out of the 127 included in the data set.
nhis=nhisx[,c(-1:-12,-17:-19,-21:-41,-43:-66,-68:-114,-116:-117,-119:-127)]
head(nhis)
## FLNGINTV FM_SIZE FM_KIDS FM_ELDR FM_EDUC1 FSRUNOUT FHICOVCT INCGRP5
## 1 1 1 0 0 2 3 0 1
## 2 1 3 0 1 5 3 3 2
## 3 1 3 1 0 5 3 3 2
## 4 1 1 0 1 9 3 1 4
## 5 1 1 0 1 4 3 1 1
## 6 1 3 0 0 8 3 3 4
## HOUSEOWN
## 1 2
## 2 2
## 3 2
## 4 1
## 5 1
## 6 1
FSRUNOUT (Worried food would run out before got money to buy more) is chosen as the dependent variable to focus on in the models. The variable is recoded so that “Often true” and “Sometimes true” are coded as a 1 and “Never true” is coded as a zero. We use a family’s or individual’s fear of running out of food to classify them as having food insecurity.
nhis$FSRUNOUT = recode(nhis$FSRUNOUT, "'1'=1;'2'=1;else=0")
response=nhis$FSRUNOUT
The independent variables that are categorical are recoded into binomial indicator variables to be used in a logistic regression and to remove nonresponses.
v1=rep(1,dim(nhis)[1])
v2=rep(0,dim(nhis)[1])
nhis$INCGRP5 = recode(nhis$INCGRP5, "'1'=1;'2'=2;'3'=3;'4'=4;'96'=NA;'99'=NA")
nhis$NIlessthan35k=ifelse(nhis$INCGRP5==1,v1,v2)
nhis$NI35kto74.9k=ifelse(nhis$INCGRP5==2,v1,v2)
nhis$NI75kto99k=ifelse(nhis$INCGRP5==3,v1,v2)
nhis$NIabove100k=ifelse(nhis$INCGRP5==4,v1,v2)
nhis$FLNGINTV = recode(nhis$FLNGINTV, "'1'=1;'2'=2;'3'=3;'4'=4;'8'=NA")
nhis$Englishspeaking=ifelse(nhis$FLNGINTV==1,v1,v2)
nhis$FM_EDUC1 = recode(nhis$FM_EDUC1, "'1'=1;'2'=1;'3'=2;'4'=2;'5'=2;'6'=3;'7'=3;'8'=4;'9'=4;'98'=NA;'99'=NA")
nhis$Nohighschool=ifelse(nhis$FM_EDUC1==1,v1,v2)
nhis$Highschooldegree=ifelse(nhis$FM_EDUC1==2,v1,v2)
nhis$Twoyrdegree=ifelse(nhis$FM_EDUC1==3,v1,v2)
nhis$Fourormoreyrcollegedegree=ifelse(nhis$FM_EDUC1==4,v1,v2)
nhis$HOUSEOWN = recode(nhis$HOUSEOWN, "'1'=1;'2'=2;'3'=3;'7'=NA;'8'=NA;'9'=NA")
nhis$Ownhome=ifelse(nhis$HOUSEOWN==1,v1,v2)
nhis$Renthome=ifelse(nhis$HOUSEOWN==2,v1,v2)
nhis$Otherlivingarrangement=ifelse(nhis$HOUSEOWN==3,v1,v2)
nhis=nhis[,c(-1,-5,-8,-9,-13,-21)]
xx=cbind(response,FamilySize=nhis$FM_SIZE,NumberofKids=nhis$FM_KIDS,NuberofSeniors=nhis$FM_ELDR,NumHealthCoverage=nhis$FHICOVCT,
EnglishSpeaking=nhis$Englishspeaking,OwnHome=nhis$Ownhome,RentHome=nhis$Renthome,NIlessthan35k=nhis$NIlessthan35k,
NI35kto74.9k=nhis$NI35kto74.9k,NI75kto99k=nhis$NI75kto99k,NoHighSchool=nhis$Nohighschool,TwoYrDegree=nhis$Twoyrdegree,
AtleastFourYrcollegeDegree=nhis$Fourormoreyrcollegedegree)
The data is separated into a training set and a test set.
n=dim(nhis)[1]
n1=floor(n*(0.6))
n1
## [1] 27358
n2=n-n1
n2
## [1] 18239
train=sample(1:n,n1)
xx=xx[,-1]
xtrain <- xx[train,]
xnew <- xx[-train,]
ytrain <- response[train]
ynew <- response[-train]
A logistic regression is run using FSRUNOUT as the dependent variable and fitted on the training data set.
m2=glm(response~.,family=binomial,data=data.frame(response=ytrain,xtrain))
summary(m2)
##
## Call:
## glm(formula = response ~ ., family = binomial, data = data.frame(response = ytrain,
## xtrain))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9917 -0.6457 -0.3662 -0.1703 3.1482
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.26505 0.18043 -23.639 < 2e-16 ***
## FamilySize 0.36657 0.02854 12.844 < 2e-16 ***
## NumberofKids -0.16997 0.03403 -4.994 5.92e-07 ***
## NuberofSeniors -0.56229 0.04200 -13.388 < 2e-16 ***
## NumHealthCoverage -0.03696 0.02286 -1.617 0.106
## EnglishSpeaking 0.31552 0.06714 4.700 2.61e-06 ***
## OwnHome -0.41552 0.10252 -4.053 5.05e-05 ***
## RentHome 0.09587 0.09968 0.962 0.336
## NIlessthan35k 2.77430 0.11920 23.275 < 2e-16 ***
## NI35kto74.9k 1.71758 0.11758 14.608 < 2e-16 ***
## NI75kto99k 0.88558 0.14170 6.250 4.11e-10 ***
## NoHighSchool 0.30778 0.05587 5.509 3.62e-08 ***
## TwoYrDegree -0.01223 0.05700 -0.215 0.830
## AtleastFourYrcollegeDegree -0.68026 0.05592 -12.164 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 21759 on 24114 degrees of freedom
## Residual deviance: 18053 on 24101 degrees of freedom
## (3243 observations deleted due to missingness)
## AIC: 18081
##
## Number of Fisher Scoring iterations: 6
Almost all of the variables prove statistically significant in predicting FSRUNOUT except for renting home, the number of family members with health insurance, and having a two-year degree. The effect of NIlessthan35k is quite noticeable, having a total annual household income less than $35,000 increases a family’s or individual’s log odds of worrying about running out of food by 2.77. Similarly, a total annual household income between $35,000 and $74,900 will increase the log odds of FSRUNOUT by 1.72.
To demonstrate the ability of the model to accurately classify cases a confusion matrix is created.
p<-predict(m2,newdata=data.frame(xnew),type="response")
fitted.results<-predict(m2,newdata=data.frame(xnew),type='response')
fitted.results<-ifelse(fitted.results > 0.5,1,0)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
confusionMatrix(data=fitted.results, reference=ynew)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 13046 2555
## 1 225 191
##
## Accuracy : 0.8264
## 95% CI : (0.8205, 0.8323)
## No Information Rate : 0.8286
## P-Value [Acc > NIR] : 0.7656
##
## Kappa : 0.0793
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.98305
## Specificity : 0.06956
## Pos Pred Value : 0.83623
## Neg Pred Value : 0.45913
## Prevalence : 0.82856
## Detection Rate : 0.81451
## Detection Prevalence : 0.97403
## Balanced Accuracy : 0.52630
##
## 'Positive' Class : 0
##
This output demonstrates that the model is quite good at predicting that a family or individual does not worry about food running out, with a false negative rate of 2%. However, the model is not very good at predicting those that do worry, with a false positive rate of 93%.
The following ROCR curve and its AUC show that the model is still relatively good at prediction.
library(ROCR)
## Loading required package: gplots
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
pr<- prediction (p, ynew)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf)

auc<-performance(pr, measure = "auc")
auc<-auc@y.values[[1]]
auc
## [1] 0.7894864