Food insecurity is a big issue across the globe. The following analysis is on data from the 2014 National Health Interview Survey. The goal is to figure out who is at risk for food insecurity emergencies.
Our data set was rather large with roughly 45,000 cases with about 125 variables to start with. After a little bit of research we figured out that our main focus would be on economic, social, and functional limitation factors. After that we reduced our data set to 36 variables.
famE <- read.csv("~/Business Analytics/FamilyEdited.csv")
head(famE)
## CURWRKN TELCELN WRKCELN FLNGINTV FM_SIZE FM_KIDS FM_ELDR FM_TYPE
## 1 2 1 1 1 1 0 0 1
## 2 1 1 2 1 3 0 1 2
## 3 2 1 2 1 3 1 0 4
## 4 1 1 1 1 1 0 1 1
## 5 1 1 1 1 1 0 1 1
## 6 1 1 5 1 3 0 0 2
## FM_STRCP FM_STRP FM_EDUC1 FLAADLYN FLAADLCT FLIADLYN FLIADLCT FWKLIMYN
## 1 11 11 2 2 0 2 0 2
## 2 23 23 5 2 0 2 0 1
## 3 41 42 5 2 0 2 0 2
## 4 11 11 9 2 0 2 0 2
## 5 11 11 4 2 0 2 0 2
## 6 23 23 8 2 0 2 0 2
## FWKLIMCT FANYLYN FANYLCT FHSTATPR FSRUNOUT FSLAST FSBALANC FHICOVYN
## 1 0 2 0 0 3 3 3 2
## 2 1 1 2 0 3 3 3 1
## 3 0 2 0 0 3 3 3 1
## 4 0 2 0 0 3 3 3 1
## 5 0 2 0 0 3 3 3 1
## 6 0 2 0 0 3 3 3 1
## FMEDBILL FDGLWCT1 FWRKLWCT FSALYN FSALCT FSSRRYN FSSRRCT FTANFYN FCHSPYN
## 1 1 1 1 1 1 2 0 2 1
## 2 1 0 NA 2 0 1 2 2 2
## 3 2 2 1 1 2 2 0 2 2
## 4 2 0 NA 2 0 2 0 2 2
## 5 2 0 NA 2 0 1 1 2 2
## 6 2 3 2 1 2 2 0 2 2
## INCGRP4 RAT_CAT4 FSNAP
## 1 1 3 1
## 2 2 8 2
## 3 3 11 2
## 4 5 14 2
## 5 1 6 2
## 6 5 14 2
Our first thoughts were to figure out which variable we wanted to use as our dependent variable. We decided to use FSLAST instead of FSRUNOUT because we interpreted the FSLAST variable as a more tangible variable vs. FSRUNOUT being more of a mental variable. FSLAST was coded as a 1, 2, 3, 7, 8, or 9 so we removed the few cases of 7, 8, and 9 because they are irrelevant in our analysis. Also we wanted to recode the variable so that a 3 was chronic food insecurity and that 2 was moderate and 1 was no food insecurity in order to make our anaylsis work better.
library(car)
## Warning: package 'car' was built under R version 3.4.2
famE$FSLAST=recode(famE$FSLAST,"'1'=3; '2'=2; '3'=1")
We then created a table and barchart to get a look at the responses of FSLAST.
library(lattice)
F=table(famE$FSLAST)
F
##
## 1 2 3
## 39290 4530 1723
barchart(F, ylab="The food did not last and we didn't have money for more", col="black")
Our next thoughts were to run a linear regression.
m1=lm(FSLAST~.,data=famE)
summary(m1)
##
## Call:
## lm(formula = FSLAST ~ ., data = famE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.67849 -0.01342 -0.00202 0.00521 2.65047
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.532e+00 6.229e-02 56.705 < 2e-16 ***
## CURWRKN 5.506e-03 2.553e-03 2.157 0.031031 *
## TELCELN NA NA NA NA
## WRKCELN -3.623e-04 2.326e-04 -1.557 0.119438
## FLNGINTV -2.858e-04 2.864e-03 -0.100 0.920507
## FM_SIZE 4.728e-03 2.662e-03 1.776 0.075728 .
## FM_KIDS -1.930e-03 3.069e-03 -0.629 0.529547
## FM_ELDR -1.237e-02 4.347e-03 -2.845 0.004450 **
## FM_TYPE -3.021e-03 9.155e-03 -0.330 0.741404
## FM_STRCP -8.278e-03 5.733e-03 -1.444 0.148755
## FM_STRP 8.485e-03 5.737e-03 1.479 0.139181
## FM_EDUC1 -2.796e-04 2.189e-04 -1.277 0.201627
## FLAADLYN -1.760e-02 2.545e-02 -0.692 0.489217
## FLAADLCT 3.675e-03 2.528e-02 0.145 0.884436
## FLIADLYN 3.886e-02 2.160e-02 1.799 0.072011 .
## FLIADLCT 7.477e-03 2.069e-02 0.361 0.717836
## FWKLIMYN -9.267e-03 1.013e-02 -0.915 0.360174
## FWKLIMCT -2.802e-03 9.573e-03 -0.293 0.769744
## FANYLYN -2.409e-02 8.091e-03 -2.977 0.002912 **
## FANYLCT -2.063e-03 6.368e-03 -0.324 0.745976
## FHSTATPR 2.700e-02 7.279e-03 3.710 0.000208 ***
## FSRUNOUT -5.511e-01 3.696e-03 -149.092 < 2e-16 ***
## FSBALANC -2.789e-01 4.147e-03 -67.265 < 2e-16 ***
## FHICOVYN 2.121e-03 2.918e-03 0.727 0.467299
## FMEDBILL -9.638e-03 2.665e-03 -3.616 0.000299 ***
## FDGLWCT1 -7.231e-03 3.382e-03 -2.138 0.032547 *
## FWRKLWCT -3.253e-03 2.372e-03 -1.371 0.170317
## FSALYN 7.613e-03 2.821e-03 2.699 0.006965 **
## FSALCT 8.539e-04 2.787e-03 0.306 0.759328
## FSSRRYN 8.459e-03 3.894e-03 2.172 0.029846 *
## FSSRRCT 1.406e-02 5.062e-03 2.778 0.005480 **
## FTANFYN -4.893e-03 4.199e-03 -1.165 0.243974
## FCHSPYN -1.631e-03 3.561e-03 -0.458 0.646900
## INCGRP4 4.968e-05 7.284e-05 0.682 0.495242
## RAT_CAT4 5.087e-06 9.275e-05 0.055 0.956258
## FSNAP -6.664e-03 2.768e-03 -2.407 0.016080 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2239 on 30104 degrees of freedom
## (15404 observations deleted due to missingness)
## Multiple R-squared: 0.7092, Adjusted R-squared: 0.7089
## F-statistic: 2159 on 34 and 30104 DF, p-value: < 2.2e-16
Based on our output we have an R^2 value of .7 which is a significant value. The one thing we did not like was how similar our FSLAST and FSRUNOUT variables were. We thought that they basically meant the same thing so we decided to exclude FSRUNOUT. We then ran another regression.
m3=lm(FSLAST~.-FSRUNOUT,data=famE)
summary(m3)
##
## Call:
## lm(formula = FSLAST ~ . - FSRUNOUT, data = famE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.6063 -0.0741 -0.0381 -0.0101 4.8427
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.247e+00 8.209e-02 39.547 < 2e-16 ***
## CURWRKN 2.288e-02 3.363e-03 6.804 1.03e-11 ***
## TELCELN NA NA NA NA
## WRKCELN -3.647e-04 3.067e-04 -1.189 0.234481
## FLNGINTV 4.210e-03 3.776e-03 1.115 0.264820
## FM_SIZE 7.574e-03 3.510e-03 2.158 0.030930 *
## FM_KIDS 5.344e-03 4.046e-03 1.321 0.186570
## FM_ELDR -2.641e-02 5.730e-03 -4.609 4.06e-06 ***
## FM_TYPE -3.374e-02 1.207e-02 -2.796 0.005177 **
## FM_STRCP -9.491e-03 7.558e-03 -1.256 0.209220
## FM_STRP 1.292e-02 7.564e-03 1.709 0.087543 .
## FM_EDUC1 -1.501e-03 2.884e-04 -5.204 1.97e-07 ***
## FLAADLYN -6.166e-02 3.355e-02 -1.838 0.066125 .
## FLAADLCT -3.147e-02 3.333e-02 -0.944 0.345140
## FLIADLYN 7.264e-03 2.848e-02 0.255 0.798676
## FLIADLCT -1.254e-02 2.728e-02 -0.460 0.645641
## FWKLIMYN -1.669e-02 1.335e-02 -1.250 0.211238
## FWKLIMCT 1.147e-02 1.262e-02 0.909 0.363479
## FANYLYN -2.705e-02 1.067e-02 -2.536 0.011225 *
## FANYLCT 9.391e-03 8.395e-03 1.119 0.263288
## FHSTATPR 5.023e-02 9.595e-03 5.235 1.66e-07 ***
## FSBALANC -6.523e-01 4.357e-03 -149.695 < 2e-16 ***
## FHICOVYN 1.416e-02 3.845e-03 3.684 0.000230 ***
## FMEDBILL -3.901e-02 3.504e-03 -11.131 < 2e-16 ***
## FDGLWCT1 -1.566e-02 4.459e-03 -3.513 0.000444 ***
## FWRKLWCT -2.015e-02 3.124e-03 -6.449 1.14e-10 ***
## FSALYN 1.338e-02 3.719e-03 3.599 0.000320 ***
## FSALCT 5.935e-03 3.675e-03 1.615 0.106265
## FSSRRYN 1.892e-02 5.133e-03 3.686 0.000228 ***
## FSSRRCT 1.614e-02 6.674e-03 2.419 0.015568 *
## FTANFYN 2.689e-03 5.536e-03 0.486 0.627160
## FCHSPYN 6.931e-03 4.694e-03 1.476 0.139822
## INCGRP4 4.764e-05 9.604e-05 0.496 0.619865
## RAT_CAT4 -1.639e-04 1.223e-04 -1.341 0.180036
## FSNAP -4.113e-02 3.637e-03 -11.307 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2952 on 30105 degrees of freedom
## (15404 observations deleted due to missingness)
## Multiple R-squared: 0.4945, Adjusted R-squared: 0.4939
## F-statistic: 892.3 on 33 and 30105 DF, p-value: < 2.2e-16
Based on this output we can see our R^2 dropped drastically and fell to under .5 which is not as high as we would like to see. This justifies our hypothesis that those two variables were highly correlated. Removing this gives us a more accurate model although the R^2 decreased.
After our presentations and looking into the data some more I wanted to run a logistic regression but I decided I would stick with the linear regression and fine tune it. I again looked at the variables and realized I was using duplicates. For some variables it would ask, “Is there any family memebers with X?” and another would ask, “How many family members have X?” I thought it would be redundant and pointless to use both so I decided to stick with the numerical variable instead of the binary variable(which I removed in Excel). After doing this my data set went down to 27 variables but my regression output improved.
famE3 <- read.csv("~/Business Analytics/FamilyEdited3.csv")
famE3$FSLAST=recode(famE3$FSLAST,"'1'=3; '2'=2; '3'=1")
m4=lm(FSLAST~.-FSRUNOUT,data=famE3)
summary(m4)
##
## Call:
## lm(formula = FSLAST ~ . - FSRUNOUT, data = famE3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8180 -0.0823 -0.0445 -0.0024 4.9759
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.118e+00 1.520e-02 205.201 < 2e-16 ***
## CURWRKN 2.615e-02 3.167e-03 8.255 < 2e-16 ***
## WRKCELN -4.731e-04 2.944e-04 -1.607 0.107983
## FLNGINTV 1.743e-03 3.636e-03 0.479 0.631782
## FM_SIZE 8.555e-03 3.275e-03 2.612 0.009002 **
## FM_KIDS 7.156e-03 3.733e-03 1.917 0.055226 .
## FM_ELDR -3.956e-02 4.501e-03 -8.790 < 2e-16 ***
## FM_TYPE -4.203e-02 1.235e-02 -3.403 0.000667 ***
## FM_STRCP 4.237e-03 1.212e-03 3.495 0.000474 ***
## FM_EDUC1 -6.268e-04 2.355e-04 -2.662 0.007776 **
## FLAADLCT 6.110e-03 9.630e-03 0.635 0.525742
## FLIADLCT 8.594e-03 7.903e-03 1.087 0.276846
## FWKLIMCT 2.149e-02 6.101e-03 3.522 0.000429 ***
## FANYLCT 2.821e-02 4.659e-03 6.056 1.41e-09 ***
## FHSTATPR 5.235e-02 7.125e-03 7.348 2.05e-13 ***
## FSBALANC -6.744e-01 3.736e-03 -180.509 < 2e-16 ***
## FHICOVYN 1.956e-02 3.425e-03 5.711 1.13e-08 ***
## FMEDBILL -4.094e-02 3.170e-03 -12.912 < 2e-16 ***
## FDGLWCT1 -3.091e-02 3.310e-03 -9.339 < 2e-16 ***
## FSALCT -2.581e-03 3.111e-03 -0.830 0.406751
## FSSRRCT 7.503e-03 4.372e-03 1.716 0.086134 .
## FTANFYN 8.777e-03 4.531e-03 1.937 0.052720 .
## FCHSPYN 2.713e-02 4.212e-03 6.441 1.20e-10 ***
## INCGRP4 4.366e-05 8.775e-05 0.498 0.618813
## RAT_CAT4 -2.432e-04 1.094e-04 -2.224 0.026178 *
## FSNAP -4.119e-02 3.176e-03 -12.968 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3145 on 39558 degrees of freedom
## (5959 observations deleted due to missingness)
## Multiple R-squared: 0.5335, Adjusted R-squared: 0.5332
## F-statistic: 1810 on 25 and 39558 DF, p-value: < 2.2e-16
My R^2 still isn’t fantastic but it was over .5 and that shows that we can predict atleast more variability than we can’t predict.
I then decided to run a regression on a training set and see how well it predicted on a test set.
set.seed(1)
n=length(famE3$FSLAST)
n1=25000
n2=n-n1
train=sample(1:n,n1)
m10=lm(FSLAST~.-FSRUNOUT,data=famE3[train,])
summary(m10)
##
## Call:
## lm(formula = FSLAST ~ . - FSRUNOUT, data = famE3[train, ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8443 -0.0789 -0.0409 -0.0010 5.0310
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.167e+00 2.012e-02 157.430 < 2e-16 ***
## CURWRKN 2.694e-02 4.188e-03 6.432 1.28e-10 ***
## WRKCELN -1.981e-04 3.815e-04 -0.519 0.603556
## FLNGINTV -6.107e-04 4.882e-03 -0.125 0.900435
## FM_SIZE 1.198e-02 4.370e-03 2.742 0.006115 **
## FM_KIDS 5.805e-03 4.959e-03 1.171 0.241774
## FM_ELDR -3.959e-02 5.941e-03 -6.665 2.71e-11 ***
## FM_TYPE -3.997e-02 1.807e-02 -2.212 0.026961 *
## FM_STRCP 3.774e-03 1.780e-03 2.120 0.033987 *
## FM_EDUC1 -1.019e-03 3.314e-04 -3.076 0.002103 **
## FLAADLCT -3.185e-02 1.286e-02 -2.477 0.013270 *
## FLIADLCT 1.418e-02 1.048e-02 1.353 0.176156
## FWKLIMCT 3.086e-02 8.076e-03 3.821 0.000133 ***
## FANYLCT 2.072e-02 6.145e-03 3.372 0.000747 ***
## FHSTATPR 5.567e-02 9.496e-03 5.862 4.63e-09 ***
## FSBALANC -6.919e-01 4.973e-03 -139.134 < 2e-16 ***
## FHICOVYN 1.548e-02 4.357e-03 3.553 0.000382 ***
## FMEDBILL -4.093e-02 4.142e-03 -9.882 < 2e-16 ***
## FDGLWCT1 -3.512e-02 4.332e-03 -8.107 5.46e-16 ***
## FSALCT 1.197e-03 4.064e-03 0.295 0.768251
## FSSRRCT 8.967e-03 5.773e-03 1.553 0.120412
## FTANFYN 1.522e-02 6.053e-03 2.514 0.011951 *
## FCHSPYN 2.448e-02 5.662e-03 4.324 1.54e-05 ***
## INCGRP4 -3.208e-05 1.164e-04 -0.276 0.782848
## RAT_CAT4 -1.615e-04 1.465e-04 -1.103 0.270197
## FSNAP -4.110e-02 4.300e-03 -9.557 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3085 on 21783 degrees of freedom
## (3191 observations deleted due to missingness)
## Multiple R-squared: 0.5515, Adjusted R-squared: 0.551
## F-statistic: 1072 on 25 and 21783 DF, p-value: < 2.2e-16
pred=predict(m10,newdat=famE3[-train,])
obs=famE3$FSLAST[-train]
diff=obs-pred
percdiff=abs(diff)/obs
me=mean(diff)
rmse=sqrt(sum(diff*2)/n2)
mape=100*(mean(percdiff))
me
## [1] NA
rmse
## [1] NA
mape
## [1] NA
My training set ran everything good until I reached the mean, rmse, and mape and then it gave me N/A’s for some reason. These errors I expect to be fairly high as our R^2 was low but it would have been nice to see how well our training set predicted our test set.