Changes to Previous Analysis

In my group’s Chopped presentation, our goal was to make a model to predict someone’s level of food insecurity based on a variety of factors with a special focus on government program involvement. In our analysis, we used a linear regression because our dependent variable was not binary. We also had a large amount of missing data which negatively impacted our model’s accuracy, giving us an adjusted r-squared value of 0.151. Lastly, we used the entire data set to make our model instead of splitting the data into a training set and testing it on different samples of the data.

In this additional analysis, I hope to change those problems. First, I will make the dependent variable binary so I can do a logistic regression. I will recode all “no, I have not experienced food insecurity” responses to the food insecurity question to 0 and all “yes, I have experienced food insecurity” answers to one. Second, I will fill in all of the missing data so that the model will have a complete data set to work from. Lastly, I will split my data sets into training and test data sets in order to see that my model is working. My main goal of this further analysis is to try and make a model that predicts if someone is food insecure and I hope that these actions will help me do so.

fam<-read.csv("Final Family (master analysis).csv")
head(fam)
##   FM_SIZE FM_KIDS FM_ELDR FM_EDUC1 FSBALANC FSWEIGHT FMEDBNOP INCGRP5
## 1       1       0       1        5        3        2        2       1
## 2       2       0       1        5        3        2        2       4
## 3       2       0       2        4        3        2        2       2
## 4       2       0       2        8        3        2        2       3
## 5       1       0       1        4        3        2        2       1
## 6       2       0       0        8        3        2        2       4
##   FSSRRYN FTANFYN FGAH FSNAP FNMEDYN F10DVYN FHICADCT FSALYN FSRUNOUT
## 1       1       2    2     2       2       2        0      2        3
## 2       1       2    2     2       2       2        0      1        3
## 3       1       2    2     2       2       2        0      2        3
## 4       1       2    2     2       2       2        0      1        3
## 5       1       2    2     2       2       2        1      2        3
## 6       2       2    2     2       2       1        0      1        3
##   TELCELN
## 1       2
## 2       1
## 3       1
## 4       2
## 5       2
## 6       2
library(car)
library(norm)
library(lattice)
set.seed(1)

These variable names were from the original data set from the survey and do not tell us about what they contain. Next, we will rename the variables and recode some of the survey answers to fill in the missing data. For example, all of the survey columns include answers for “did not answer”, “other”, and other variations of that. In order to build a predictive model, we will recode those answers to the “no” option for each column or the average answer depending on the variable type.

Recoding Variables

Data Visualization

Next, we will visualize these new variables with the recoded numbers. We are using tables because the graphs of this data do not show the distribution properly.

table(FoodInsecurity,Unbalanced_Meals)
##               Unbalanced_Meals
## FoodInsecurity     1     2     3
##              0   129   835 36663
##              1  1462  3072  2677
table(FoodInsecurity,FamSz)
##               FamSz
## FoodInsecurity     1     2     3     4     5     6     7     8     9    10
##              0 11612 12505  5229  4757  2248   829   284    93    44    16
##              1  2285  1780  1129   956   595   288   102    38    18     8
##               FamSz
## FoodInsecurity    11    12    13    14    16
##              0     6     3     0     1     0
##              1     7     3     1     0     1
table(FoodInsecurity,Num_of_Elders)
##               Num_of_Elders
## FoodInsecurity     0     1     2     3     4
##              0 27150  6986  3442    48     1
##              1  6072   933   200     5     1
table(FoodInsecurity,Num_of_kids)
##               Num_of_kids
## FoodInsecurity     0     1     2     3     4     5     6     7     8     9
##              0 26233  4759  4177  1784   488   123    43    16     2     1
##              1  4182  1190   982   527   234    60    27     5     4     0
##               Num_of_kids
## FoodInsecurity    10
##              0     1
##              1     0
table(FoodInsecurity,EduLevl)
##               EduLevl
## FoodInsecurity    1    2    3    4    5    6    7    8    9
##              0 1033 1924  703 7072 6927 3190 1681 8719 6378
##              1  427  989  334 1853 1726  646  278  692  266
table(FoodInsecurity,WeightLost)
##               WeightLost
## FoodInsecurity     1     2
##              0    57 37570
##              1  1180  6031
table(FoodInsecurity,UnpaidMedBills)
##               UnpaidMedBills
## FoodInsecurity     1     2
##              0  1820 35807
##              1  1887  5324
table(FoodInsecurity,SSIMem)
##               SSIMem
## FoodInsecurity     1     2
##              0 11053 26574
##              1  1955  5256
table(FoodInsecurity,WelfareMem)
##               WelfareMem
## FoodInsecurity     1     2
##              0   488 37139
##              1   439  6772
table(FoodInsecurity,Reduced_Rent)
##               Reduced_Rent
## FoodInsecurity     1     2
##              0  1317 36310
##              1   969  6242
table(FoodInsecurity,SNAP)
##               SNAP
## FoodInsecurity     1     2
##              0  3712 33915
##              1  3096  4115
table(FoodInsecurity,MemIncome)
##               MemIncome
## FoodInsecurity     1     2
##              0 27889  9738
##              1  4920  2291
table(FoodInsecurity,ten_Med_Visits)
##               ten_Med_Visits
## FoodInsecurity     1     2
##              0  6742 30885
##              1  2039  5172
table(FoodInsecurity,No_Care)
##               No_Care
## FoodInsecurity     1     2
##              0  2547 35080
##              1  2271  4940
table(FoodInsecurity,Inc.Grp)
##               Inc.Grp
## FoodInsecurity     1     2     3     4
##              0 11445 14801  4056  7325
##              1  4951  1885   232   143

For the independent variables where the options are 1 or 2, 1 always means “yes, I enroll in that program” and 2 is “no, I do not enroll in that program”. As shown in these tables, there are a large amount of no’s in the Food Insecurity variable.

First Regression

First, we will create a new data frame with the renamed and recoded variables so that we can use it in our regressions moving forward.

xx=cbind(FoodInsecurity,Unbalanced_Meals,FamSz,Num_of_Elders,Num_of_kids,
         EduLevl,WeightLost,UnpaidMedBills,SSIMem,WelfareMem,Reduced_Rent,SNAP,
         MemIncome,ten_Med_Visits,No_Care,Cell,Inc.Grp)
m1=glm(FoodInsecurity~.,family=binomial,data=data.frame(xx))
summary(m1)
## 
## Call:
## glm(formula = FoodInsecurity ~ ., family = binomial, data = data.frame(xx))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.0973  -0.3731  -0.2667  -0.1702   3.1865  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      17.778840   0.441967  40.227  < 2e-16 ***
## Unbalanced_Meals -2.856831   0.042150 -67.777  < 2e-16 ***
## FamSz             0.151010   0.023358   6.465 1.01e-10 ***
## Num_of_Elders    -0.409399   0.048861  -8.379  < 2e-16 ***
## Num_of_kids       0.015030   0.030988   0.485 0.627652    
## EduLevl          -0.098979   0.009381 -10.551  < 2e-16 ***
## WeightLost       -2.825652   0.159302 -17.738  < 2e-16 ***
## UnpaidMedBills   -0.602035   0.055824 -10.785  < 2e-16 ***
## SSIMem           -0.074627   0.058242  -1.281 0.200077    
## WelfareMem       -0.151197   0.096347  -1.569 0.116578    
## Reduced_Rent     -0.260718   0.068677  -3.796 0.000147 ***
## SNAP             -0.682530   0.047804 -14.278  < 2e-16 ***
## MemIncome        -0.108838   0.051138  -2.128 0.033309 *  
## ten_Med_Visits   -0.110663   0.045997  -2.406 0.016134 *  
## No_Care          -0.664810   0.050879 -13.066  < 2e-16 ***
## Cell              0.024727   0.056689   0.436 0.662696    
## Inc.Grp          -0.567140   0.028495 -19.903  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 39550  on 44837  degrees of freedom
## Residual deviance: 21220  on 44821  degrees of freedom
## AIC: 21254
## 
## Number of Fisher Scoring iterations: 6

According to this logistic regression, this is a very good model. Many of the variables are significant. According to the coefficients, many of the variables have an inverse relationship with Food Insecurity. Those variables that cause Food Insecurity to increase are family size, number of kids, and if they have a cell phone. Out of those three, only family size is significant.

Logit Calculation

exp(m1$coef[2])
## Unbalanced_Meals 
##       0.05745054
exp(m1$coef[3])
##    FamSz 
## 1.163008
exp(m1$coef[4])
## Num_of_Elders 
##     0.6640489
exp(m1$coef[5])
## Num_of_kids 
##    1.015144
exp(m1$coef[6])
##   EduLevl 
## 0.9057619
exp(m1$coef[7])
## WeightLost 
## 0.05927002
exp(m1$coef[8])
## UnpaidMedBills 
##      0.5476957
exp(m1$coef[9])
##    SSIMem 
## 0.9280895
exp(m1$coef[10])
## WelfareMem 
##  0.8596782
exp(m1$coef[11])
## Reduced_Rent 
##    0.7704981
exp(m1$coef[12])
##     SNAP 
## 0.505337
exp(m1$coef[13])
## MemIncome 
## 0.8968753
exp(m1$coef[14])
## ten_Med_Visits 
##      0.8952403
exp(m1$coef[15])
##   No_Care 
## 0.5143711
exp(m1$coef[16])
##     Cell 
## 1.025036
exp(m1$coef[17])
##   Inc.Grp 
## 0.5671451
fam$prob=(exp(m1$coef[1]+m1$coef[2]*Unbalanced_Meals+m1$coef[3]*FamSz+m1$coef[4]*Num_of_Elders+m1$coef[5]*Num_of_kids+m1$coef[6]*EduLevl+m1$coef[7]*WeightLost+m1$coef[8]*UnpaidMedBills+m1$coef[9]*SSIMem+m1$coef[10]*WelfareMem+m1$coef[11]*Reduced_Rent+m1$coef[12]*SNAP+m1$coef[13]*MemIncome+m1$coef[14]*ten_Med_Visits+m1$coef[15]*No_Care+m1$coef[16]*Cell+m1$coef[17]*Inc.Grp))/(1+(exp(m1$coef[1]+m1$coef[2]*Unbalanced_Meals+m1$coef[3]*FamSz+m1$coef[4]*Num_of_Elders+m1$coef[5]*Num_of_kids+m1$coef[6]*EduLevl+m1$coef[7]*WeightLost+m1$coef[8]*UnpaidMedBills+m1$coef[9]*SSIMem+m1$coef[10]*WelfareMem+m1$coef[11]*Reduced_Rent+m1$coef[12]*SNAP+m1$coef[13]*MemIncome+m1$coef[14]*ten_Med_Visits+m1$coef[15]*No_Care+m1$coef[16]*Cell+m1$coef[17]*Inc.Grp)))
head(fam$prob)
## [1] 0.04973007 0.01193309 0.02409374 0.01065615 0.05462151 0.01415810

This process raises each coefficient to an exponent and shows what contributes to a higher or lower level of food security. Based off of this chart, it is predicting that many of the cases are food secure. According to the coefficients, variables like family size, number of kids, and SSI Income contribute to a family being less food secure.

split the data set into training and test (evaluation) set

n=dim(fam)[1]
n
## [1] 44838
n1=floor(n*(0.6))
n1
## [1] 26902
n2=n-n1
n2
## [1] 17936
train=sample(1:n,n1)
xx=xx[,-1]
xtrain <- xx[train,]
xnew <- xx[-train,]
ytrain <- FoodInsecurity[train]
ynew <- FoodInsecurity[-train]

Second Regression Using Training Data

m2=glm(FoodInsecurity~.,family=binomial,data=data.frame(FoodInsecurity=ytrain,xtrain))
summary(m2)
## 
## Call:
## glm(formula = FoodInsecurity ~ ., family = binomial, data = data.frame(FoodInsecurity = ytrain, 
##     xtrain))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.1198  -0.3752  -0.2682  -0.1673   3.1603  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      18.36689    0.58381  31.460  < 2e-16 ***
## Unbalanced_Meals -2.81411    0.05354 -52.560  < 2e-16 ***
## FamSz             0.15200    0.03020   5.034 4.81e-07 ***
## Num_of_Elders    -0.40855    0.06307  -6.478 9.32e-11 ***
## Num_of_kids       0.02879    0.04006   0.719  0.47236    
## EduLevl          -0.10252    0.01210  -8.473  < 2e-16 ***
## WeightLost       -3.08059    0.21287 -14.471  < 2e-16 ***
## UnpaidMedBills   -0.55237    0.07222  -7.648 2.03e-14 ***
## SSIMem           -0.07229    0.07545  -0.958  0.33803    
## WelfareMem       -0.27459    0.12359  -2.222  0.02631 *  
## Reduced_Rent     -0.24370    0.08911  -2.735  0.00624 ** 
## SNAP             -0.63438    0.06151 -10.314  < 2e-16 ***
## MemIncome        -0.13075    0.06615  -1.977  0.04808 *  
## ten_Med_Visits   -0.16773    0.05893  -2.846  0.00442 ** 
## No_Care          -0.65095    0.06538  -9.957  < 2e-16 ***
## Cell             -0.01407    0.07313  -0.192  0.84746    
## Inc.Grp          -0.58192    0.03682 -15.806  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 23665  on 26901  degrees of freedom
## Residual deviance: 12787  on 26885  degrees of freedom
## AIC: 12821
## 
## Number of Fisher Scoring iterations: 6

In this second regression, more of our variables are significant. This means that our model is still good.

Predicting Probabilities

ptest=predict(m2,newdata=data.frame(xnew),type="response")  
hist(ptest)

xbar=mean(ynew)

This histogram shows the predictive accuracy of the second regression using testing data. According to this graph, the model predicts a large amount of food secure (0) data points and few food insecure data points (1).

Classification Matrix

gg1=floor(ptest+0.5)
ttt=table(ynew,gg1)
ttt
##     gg1
## ynew     0     1
##    0 14732   300
##    1  1132  1772
error=(ttt[1,2]+ttt[2,1])/n2
error
## [1] 0.07983943

This table shows how many of each dependent variable were actually predicted. According to this table, the model is very good at predicting food security. That accuracy is the main reason why the error rate is so low at 7%. However, the model is not very good at predicting food insecurity as half of them were misidentified.

Calculating the Lift Curve

bb=cbind(ptest,ynew)
bb1=bb[order(ptest,decreasing=TRUE),]
axis=dim(n2)
ax=dim(n2)
ay=dim(n2)
axis[1]=1
ax[1]=xbar
ay[1]=bb1[1,2]
for (i in 2:n2) {
  axis[i]=i
  ax[i]=xbar*i
  ay[i]=ay[i-1]+bb1[i,2]
}

aaa=cbind(bb1[,1],bb1[,2],ay,ax)
aaa[1:20,]
##                   ay        ax
## 17930 0.9998885 1  1 0.1619090
## 17905 0.9998708 1  2 0.3238180
## 17936 0.9998692 1  3 0.4857270
## 17880 0.9998599 1  4 0.6476360
## 17890 0.9998449 1  5 0.8095450
## 17933 0.9998433 1  6 0.9714541
## 17903 0.9998414 1  7 1.1333631
## 17906 0.9998408 1  8 1.2952721
## 17926 0.9998305 1  9 1.4571811
## 17759 0.9998065 1 10 1.6190901
## 17881 0.9998054 1 11 1.7809991
## 17928 0.9997999 1 12 1.9429081
## 17917 0.9997949 1 13 2.1048171
## 17794 0.9997856 1 14 2.2667261
## 17740 0.9997770 1 15 2.4286351
## 17788 0.9997716 1 16 2.5905442
## 17932 0.9997714 1 17 2.7524532
## 17924 0.9997685 1 18 2.9143622
## 17898 0.9997604 1 19 3.0762712
## 17601 0.9997499 1 20 3.2381802
plot(axis,ay,xlab="number of cases",ylab="number of successes",main="Lift: Cum successes sorted by pred val/success prob")
points(axis,ax,type="l")

Based on the large amount of area under the lift curve, this is a very good predictive model. However, most of that predictive accuracy comes from the prediction of food security and not food insecurity.