In my group’s Chopped presentation, our goal was to make a model to predict someone’s level of food insecurity based on a variety of factors with a special focus on government program involvement. In our analysis, we used a linear regression because our dependent variable was not binary. We also had a large amount of missing data which negatively impacted our model’s accuracy, giving us an adjusted r-squared value of 0.151. Lastly, we used the entire data set to make our model instead of splitting the data into a training set and testing it on different samples of the data.
In this additional analysis, I hope to change those problems. First, I will make the dependent variable binary so I can do a logistic regression. I will recode all “no, I have not experienced food insecurity” responses to the food insecurity question to 0 and all “yes, I have experienced food insecurity” answers to one. Second, I will fill in all of the missing data so that the model will have a complete data set to work from. Lastly, I will split my data sets into training and test data sets in order to see that my model is working. My main goal of this further analysis is to try and make a model that predicts if someone is food insecure and I hope that these actions will help me do so.
fam<-read.csv("Final Family (master analysis).csv")
head(fam)
## FM_SIZE FM_KIDS FM_ELDR FM_EDUC1 FSBALANC FSWEIGHT FMEDBNOP INCGRP5
## 1 1 0 1 5 3 2 2 1
## 2 2 0 1 5 3 2 2 4
## 3 2 0 2 4 3 2 2 2
## 4 2 0 2 8 3 2 2 3
## 5 1 0 1 4 3 2 2 1
## 6 2 0 0 8 3 2 2 4
## FSSRRYN FTANFYN FGAH FSNAP FNMEDYN F10DVYN FHICADCT FSALYN FSRUNOUT
## 1 1 2 2 2 2 2 0 2 3
## 2 1 2 2 2 2 2 0 1 3
## 3 1 2 2 2 2 2 0 2 3
## 4 1 2 2 2 2 2 0 1 3
## 5 1 2 2 2 2 2 1 2 3
## 6 2 2 2 2 2 1 0 1 3
## TELCELN
## 1 2
## 2 1
## 3 1
## 4 2
## 5 2
## 6 2
library(car)
library(norm)
library(lattice)
set.seed(1)
These variable names were from the original data set from the survey and do not tell us about what they contain. Next, we will rename the variables and recode some of the survey answers to fill in the missing data. For example, all of the survey columns include answers for “did not answer”, “other”, and other variations of that. In order to build a predictive model, we will recode those answers to the “no” option for each column or the average answer depending on the variable type.
Next, we will visualize these new variables with the recoded numbers. We are using tables because the graphs of this data do not show the distribution properly.
table(FoodInsecurity,Unbalanced_Meals)
## Unbalanced_Meals
## FoodInsecurity 1 2 3
## 0 129 835 36663
## 1 1462 3072 2677
table(FoodInsecurity,FamSz)
## FamSz
## FoodInsecurity 1 2 3 4 5 6 7 8 9 10
## 0 11612 12505 5229 4757 2248 829 284 93 44 16
## 1 2285 1780 1129 956 595 288 102 38 18 8
## FamSz
## FoodInsecurity 11 12 13 14 16
## 0 6 3 0 1 0
## 1 7 3 1 0 1
table(FoodInsecurity,Num_of_Elders)
## Num_of_Elders
## FoodInsecurity 0 1 2 3 4
## 0 27150 6986 3442 48 1
## 1 6072 933 200 5 1
table(FoodInsecurity,Num_of_kids)
## Num_of_kids
## FoodInsecurity 0 1 2 3 4 5 6 7 8 9
## 0 26233 4759 4177 1784 488 123 43 16 2 1
## 1 4182 1190 982 527 234 60 27 5 4 0
## Num_of_kids
## FoodInsecurity 10
## 0 1
## 1 0
table(FoodInsecurity,EduLevl)
## EduLevl
## FoodInsecurity 1 2 3 4 5 6 7 8 9
## 0 1033 1924 703 7072 6927 3190 1681 8719 6378
## 1 427 989 334 1853 1726 646 278 692 266
table(FoodInsecurity,WeightLost)
## WeightLost
## FoodInsecurity 1 2
## 0 57 37570
## 1 1180 6031
table(FoodInsecurity,UnpaidMedBills)
## UnpaidMedBills
## FoodInsecurity 1 2
## 0 1820 35807
## 1 1887 5324
table(FoodInsecurity,SSIMem)
## SSIMem
## FoodInsecurity 1 2
## 0 11053 26574
## 1 1955 5256
table(FoodInsecurity,WelfareMem)
## WelfareMem
## FoodInsecurity 1 2
## 0 488 37139
## 1 439 6772
table(FoodInsecurity,Reduced_Rent)
## Reduced_Rent
## FoodInsecurity 1 2
## 0 1317 36310
## 1 969 6242
table(FoodInsecurity,SNAP)
## SNAP
## FoodInsecurity 1 2
## 0 3712 33915
## 1 3096 4115
table(FoodInsecurity,MemIncome)
## MemIncome
## FoodInsecurity 1 2
## 0 27889 9738
## 1 4920 2291
table(FoodInsecurity,ten_Med_Visits)
## ten_Med_Visits
## FoodInsecurity 1 2
## 0 6742 30885
## 1 2039 5172
table(FoodInsecurity,No_Care)
## No_Care
## FoodInsecurity 1 2
## 0 2547 35080
## 1 2271 4940
table(FoodInsecurity,Inc.Grp)
## Inc.Grp
## FoodInsecurity 1 2 3 4
## 0 11445 14801 4056 7325
## 1 4951 1885 232 143
For the independent variables where the options are 1 or 2, 1 always means “yes, I enroll in that program” and 2 is “no, I do not enroll in that program”. As shown in these tables, there are a large amount of no’s in the Food Insecurity variable.
First, we will create a new data frame with the renamed and recoded variables so that we can use it in our regressions moving forward.
xx=cbind(FoodInsecurity,Unbalanced_Meals,FamSz,Num_of_Elders,Num_of_kids,
EduLevl,WeightLost,UnpaidMedBills,SSIMem,WelfareMem,Reduced_Rent,SNAP,
MemIncome,ten_Med_Visits,No_Care,Cell,Inc.Grp)
m1=glm(FoodInsecurity~.,family=binomial,data=data.frame(xx))
summary(m1)
##
## Call:
## glm(formula = FoodInsecurity ~ ., family = binomial, data = data.frame(xx))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.0973 -0.3731 -0.2667 -0.1702 3.1865
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 17.778840 0.441967 40.227 < 2e-16 ***
## Unbalanced_Meals -2.856831 0.042150 -67.777 < 2e-16 ***
## FamSz 0.151010 0.023358 6.465 1.01e-10 ***
## Num_of_Elders -0.409399 0.048861 -8.379 < 2e-16 ***
## Num_of_kids 0.015030 0.030988 0.485 0.627652
## EduLevl -0.098979 0.009381 -10.551 < 2e-16 ***
## WeightLost -2.825652 0.159302 -17.738 < 2e-16 ***
## UnpaidMedBills -0.602035 0.055824 -10.785 < 2e-16 ***
## SSIMem -0.074627 0.058242 -1.281 0.200077
## WelfareMem -0.151197 0.096347 -1.569 0.116578
## Reduced_Rent -0.260718 0.068677 -3.796 0.000147 ***
## SNAP -0.682530 0.047804 -14.278 < 2e-16 ***
## MemIncome -0.108838 0.051138 -2.128 0.033309 *
## ten_Med_Visits -0.110663 0.045997 -2.406 0.016134 *
## No_Care -0.664810 0.050879 -13.066 < 2e-16 ***
## Cell 0.024727 0.056689 0.436 0.662696
## Inc.Grp -0.567140 0.028495 -19.903 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 39550 on 44837 degrees of freedom
## Residual deviance: 21220 on 44821 degrees of freedom
## AIC: 21254
##
## Number of Fisher Scoring iterations: 6
According to this logistic regression, this is a very good model. Many of the variables are significant. According to the coefficients, many of the variables have an inverse relationship with Food Insecurity. Those variables that cause Food Insecurity to increase are family size, number of kids, and if they have a cell phone. Out of those three, only family size is significant.
exp(m1$coef[2])
## Unbalanced_Meals
## 0.05745054
exp(m1$coef[3])
## FamSz
## 1.163008
exp(m1$coef[4])
## Num_of_Elders
## 0.6640489
exp(m1$coef[5])
## Num_of_kids
## 1.015144
exp(m1$coef[6])
## EduLevl
## 0.9057619
exp(m1$coef[7])
## WeightLost
## 0.05927002
exp(m1$coef[8])
## UnpaidMedBills
## 0.5476957
exp(m1$coef[9])
## SSIMem
## 0.9280895
exp(m1$coef[10])
## WelfareMem
## 0.8596782
exp(m1$coef[11])
## Reduced_Rent
## 0.7704981
exp(m1$coef[12])
## SNAP
## 0.505337
exp(m1$coef[13])
## MemIncome
## 0.8968753
exp(m1$coef[14])
## ten_Med_Visits
## 0.8952403
exp(m1$coef[15])
## No_Care
## 0.5143711
exp(m1$coef[16])
## Cell
## 1.025036
exp(m1$coef[17])
## Inc.Grp
## 0.5671451
fam$prob=(exp(m1$coef[1]+m1$coef[2]*Unbalanced_Meals+m1$coef[3]*FamSz+m1$coef[4]*Num_of_Elders+m1$coef[5]*Num_of_kids+m1$coef[6]*EduLevl+m1$coef[7]*WeightLost+m1$coef[8]*UnpaidMedBills+m1$coef[9]*SSIMem+m1$coef[10]*WelfareMem+m1$coef[11]*Reduced_Rent+m1$coef[12]*SNAP+m1$coef[13]*MemIncome+m1$coef[14]*ten_Med_Visits+m1$coef[15]*No_Care+m1$coef[16]*Cell+m1$coef[17]*Inc.Grp))/(1+(exp(m1$coef[1]+m1$coef[2]*Unbalanced_Meals+m1$coef[3]*FamSz+m1$coef[4]*Num_of_Elders+m1$coef[5]*Num_of_kids+m1$coef[6]*EduLevl+m1$coef[7]*WeightLost+m1$coef[8]*UnpaidMedBills+m1$coef[9]*SSIMem+m1$coef[10]*WelfareMem+m1$coef[11]*Reduced_Rent+m1$coef[12]*SNAP+m1$coef[13]*MemIncome+m1$coef[14]*ten_Med_Visits+m1$coef[15]*No_Care+m1$coef[16]*Cell+m1$coef[17]*Inc.Grp)))
head(fam$prob)
## [1] 0.04973007 0.01193309 0.02409374 0.01065615 0.05462151 0.01415810
This process raises each coefficient to an exponent and shows what contributes to a higher or lower level of food security. Based off of this chart, it is predicting that many of the cases are food secure. According to the coefficients, variables like family size, number of kids, and SSI Income contribute to a family being less food secure.
n=dim(fam)[1]
n
## [1] 44838
n1=floor(n*(0.6))
n1
## [1] 26902
n2=n-n1
n2
## [1] 17936
train=sample(1:n,n1)
xx=xx[,-1]
xtrain <- xx[train,]
xnew <- xx[-train,]
ytrain <- FoodInsecurity[train]
ynew <- FoodInsecurity[-train]
m2=glm(FoodInsecurity~.,family=binomial,data=data.frame(FoodInsecurity=ytrain,xtrain))
summary(m2)
##
## Call:
## glm(formula = FoodInsecurity ~ ., family = binomial, data = data.frame(FoodInsecurity = ytrain,
## xtrain))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.1198 -0.3752 -0.2682 -0.1673 3.1603
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 18.36689 0.58381 31.460 < 2e-16 ***
## Unbalanced_Meals -2.81411 0.05354 -52.560 < 2e-16 ***
## FamSz 0.15200 0.03020 5.034 4.81e-07 ***
## Num_of_Elders -0.40855 0.06307 -6.478 9.32e-11 ***
## Num_of_kids 0.02879 0.04006 0.719 0.47236
## EduLevl -0.10252 0.01210 -8.473 < 2e-16 ***
## WeightLost -3.08059 0.21287 -14.471 < 2e-16 ***
## UnpaidMedBills -0.55237 0.07222 -7.648 2.03e-14 ***
## SSIMem -0.07229 0.07545 -0.958 0.33803
## WelfareMem -0.27459 0.12359 -2.222 0.02631 *
## Reduced_Rent -0.24370 0.08911 -2.735 0.00624 **
## SNAP -0.63438 0.06151 -10.314 < 2e-16 ***
## MemIncome -0.13075 0.06615 -1.977 0.04808 *
## ten_Med_Visits -0.16773 0.05893 -2.846 0.00442 **
## No_Care -0.65095 0.06538 -9.957 < 2e-16 ***
## Cell -0.01407 0.07313 -0.192 0.84746
## Inc.Grp -0.58192 0.03682 -15.806 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 23665 on 26901 degrees of freedom
## Residual deviance: 12787 on 26885 degrees of freedom
## AIC: 12821
##
## Number of Fisher Scoring iterations: 6
In this second regression, more of our variables are significant. This means that our model is still good.
ptest=predict(m2,newdata=data.frame(xnew),type="response")
hist(ptest)
xbar=mean(ynew)
This histogram shows the predictive accuracy of the second regression using testing data. According to this graph, the model predicts a large amount of food secure (0) data points and few food insecure data points (1).
gg1=floor(ptest+0.5)
ttt=table(ynew,gg1)
ttt
## gg1
## ynew 0 1
## 0 14732 300
## 1 1132 1772
error=(ttt[1,2]+ttt[2,1])/n2
error
## [1] 0.07983943
This table shows how many of each dependent variable were actually predicted. According to this table, the model is very good at predicting food security. That accuracy is the main reason why the error rate is so low at 7%. However, the model is not very good at predicting food insecurity as half of them were misidentified.
bb=cbind(ptest,ynew)
bb1=bb[order(ptest,decreasing=TRUE),]
axis=dim(n2)
ax=dim(n2)
ay=dim(n2)
axis[1]=1
ax[1]=xbar
ay[1]=bb1[1,2]
for (i in 2:n2) {
axis[i]=i
ax[i]=xbar*i
ay[i]=ay[i-1]+bb1[i,2]
}
aaa=cbind(bb1[,1],bb1[,2],ay,ax)
aaa[1:20,]
## ay ax
## 17930 0.9998885 1 1 0.1619090
## 17905 0.9998708 1 2 0.3238180
## 17936 0.9998692 1 3 0.4857270
## 17880 0.9998599 1 4 0.6476360
## 17890 0.9998449 1 5 0.8095450
## 17933 0.9998433 1 6 0.9714541
## 17903 0.9998414 1 7 1.1333631
## 17906 0.9998408 1 8 1.2952721
## 17926 0.9998305 1 9 1.4571811
## 17759 0.9998065 1 10 1.6190901
## 17881 0.9998054 1 11 1.7809991
## 17928 0.9997999 1 12 1.9429081
## 17917 0.9997949 1 13 2.1048171
## 17794 0.9997856 1 14 2.2667261
## 17740 0.9997770 1 15 2.4286351
## 17788 0.9997716 1 16 2.5905442
## 17932 0.9997714 1 17 2.7524532
## 17924 0.9997685 1 18 2.9143622
## 17898 0.9997604 1 19 3.0762712
## 17601 0.9997499 1 20 3.2381802
plot(axis,ay,xlab="number of cases",ylab="number of successes",main="Lift: Cum successes sorted by pred val/success prob")
points(axis,ax,type="l")
Based on the large amount of area under the lift curve, this is a very good predictive model. However, most of that predictive accuracy comes from the prediction of food security and not food insecurity.