Chopped<-read.csv("c:/users/abbey/Desktop/Data Mining/choppeds.csv")
library(car)
set.seed(1)
We want to take out the variables that we believe is not relavent to food insecurity.
wout=Chopped[,c(-1:-37,-46,-47,-49:-109,-111:-119,-121:-123)]
head(wout)
## FSRUNOUT FSLAST FSBALANC FSSKIP FSSKDAYS FSLESS FSHUNGRY FSWEIGHT
## 1 3 3 3 NA NA NA NA NA
## 2 3 3 3 NA NA NA NA NA
## 3 3 3 3 NA NA NA NA NA
## 4 3 3 3 NA NA NA NA NA
## 5 3 3 3 NA NA NA NA NA
## 6 3 3 3 NA NA NA NA NA
## FDMEDYN INCGRP4 FSNAP
## 1 1 1 1
## 2 2 2 2
## 3 2 3 2
## 4 2 5 2
## 5 1 1 2
## 6 2 5 2
table(wout$FSRUNOUT)
##
## 1 2 3 7 9
## 2257 5068 38227 29 16
We need to recode the values in this variable because 1 and 2 mean yes, 3 and 4 mean no, and 7 and 9 mean did not answer or refusal to answer.
wout$FSRUNOUT = recode(wout$FSRUNOUT, "'1'=1;'2'=1;'3'=0;'4'=0;'7'=NA;'8'=NA;'9'=NA")
FSRUNOUT=wout$FSRUNOUT
table(wout$FSRUNOUT)
##
## 0 1
## 38227 7325
7325/38227
## [1] 0.1916185
Out of the entire data there is 7325 families who are worried about running out of food. That is 19.2% of the total data.
fit<-lm(FSRUNOUT~.,data=wout)
summary(fit)
##
## Call:
## lm(formula = FSRUNOUT ~ ., data = wout)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.05720 -0.00884 0.06522 0.09955 0.33545
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.152e+00 2.186e-02 52.685 < 2e-16 ***
## FSLAST -1.146e-01 7.083e-03 -16.174 < 2e-16 ***
## FSBALANC 3.165e-02 6.434e-03 4.920 9.11e-07 ***
## FSSKIP NA NA NA NA
## FSSKDAYS -1.938e-04 2.583e-04 -0.750 0.45312
## FSLESS -1.871e-02 9.454e-03 -1.979 0.04789 *
## FSHUNGRY -2.072e-02 7.211e-03 -2.873 0.00409 **
## FSWEIGHT 2.026e-03 3.323e-03 0.610 0.54210
## FDMEDYN -5.786e-03 7.720e-03 -0.750 0.45360
## INCGRP4 1.217e-05 2.428e-04 0.050 0.96003
## FSNAP -1.011e-02 5.176e-03 -1.953 0.05096 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.238 on 3203 degrees of freedom
## (42384 observations deleted due to missingness)
## Multiple R-squared: 0.0957, Adjusted R-squared: 0.09316
## F-statistic: 37.66 on 9 and 3203 DF, p-value: < 2.2e-16
When running a logistic regression on the dependent variable with the entire data set the adjusted R- squared is extremely low. We are missing some variables in the data set that can be the reason. I am going to rerun a regression using independent variables that does not have missing data.
fit2<-lm(formula = FSRUNOUT ~ FSBALANC + FSLAST + FDMEDYN+ FSWEIGHT+ FSLESS+
FSHUNGRY + INCGRP4+ FSNAP, data = wout)
summary(fit2)
##
## Call:
## lm(formula = FSRUNOUT ~ FSBALANC + FSLAST + FDMEDYN + FSWEIGHT +
## FSLESS + FSHUNGRY + INCGRP4 + FSNAP, data = wout)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.39782 0.02605 0.10580 0.18983 0.93825
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.138e+00 2.040e-02 55.786 < 2e-16 ***
## FSBALANC 8.220e-02 5.457e-03 15.064 < 2e-16 ***
## FSLAST -1.386e-01 6.015e-03 -23.048 < 2e-16 ***
## FDMEDYN -1.830e-03 7.071e-03 -0.259 0.79577
## FSWEIGHT 5.142e-03 4.441e-03 1.158 0.24691
## FSLESS -7.608e-02 8.339e-03 -9.123 < 2e-16 ***
## FSHUNGRY -2.230e-02 8.479e-03 -2.630 0.00856 **
## INCGRP4 -3.936e-05 1.721e-04 -0.229 0.81908
## FSNAP -1.226e-02 4.255e-03 -2.882 0.00396 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3374 on 8590 degrees of freedom
## (36998 observations deleted due to missingness)
## Multiple R-squared: 0.09896, Adjusted R-squared: 0.09812
## F-statistic: 117.9 on 8 and 8590 DF, p-value: < 2.2e-16
This regression’s adjusted R-squared is better than the last but it only explains about 9.8% of the variance. This model is not very good. I am going to try a new regression model using the significant variables.
fit3<-lm(formula = FSRUNOUT ~ FSBALANC + FSLAST + FSLESS+
FSHUNGRY + FSNAP, data = wout)
summary(fit3)
##
## Call:
## lm(formula = FSRUNOUT ~ FSBALANC + FSLAST + FSLESS + FSHUNGRY +
## FSNAP, data = wout)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.39793 0.02346 0.10744 0.18955 0.95671
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.142629 0.016775 68.114 <2e-16 ***
## FSBALANC 0.082117 0.005435 15.108 <2e-16 ***
## FSLAST -0.138554 0.006014 -23.040 <2e-16 ***
## FSLESS -0.075680 0.008302 -9.116 <2e-16 ***
## FSHUNGRY -0.021314 0.008430 -2.528 0.0115 *
## FSNAP -0.012660 0.003906 -3.241 0.0012 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3374 on 8593 degrees of freedom
## (36998 observations deleted due to missingness)
## Multiple R-squared: 0.09881, Adjusted R-squared: 0.09828
## F-statistic: 188.4 on 5 and 8593 DF, p-value: < 2.2e-16
In this model it is a little better than the last but we are still only explaining 9.83% of the variance. In this model FBALANC has a positive coeffecient this indicates that if the FSBALANC increases the FSRUNOUT increases by 0.082117. The FSLAST, FSLESS, FSHUNGRY, and FSNAP have a negative coeffecient that means that when these independent variables increase FSRUNOUT will decrease by the estimate numbers.
anova(fit3)
## Analysis of Variance Table
##
## Response: FSRUNOUT
## Df Sum Sq Mean Sq F value Pr(>F)
## FSBALANC 1 0.18 0.176 1.5483 0.213416
## FSLAST 1 90.57 90.566 795.7202 < 2.2e-16 ***
## FSLESS 1 14.52 14.525 127.6141 < 2.2e-16 ***
## FSHUNGRY 1 0.77 0.766 6.7328 0.009482 **
## FSNAP 1 1.20 1.195 10.5034 0.001196 **
## Residuals 8593 978.02 0.114
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Running a Anova table tells me that all of these independent variables are significant factors in FSRUNOUT.
now let’s take a further look at the variables
fit4<-glm(formula = FSRUNOUT ~ FSBALANC + FSLAST + FSLESS+
FSHUNGRY + FSNAP, data = wout)
summary(fit4)
##
## Call:
## glm(formula = FSRUNOUT ~ FSBALANC + FSLAST + FSLESS + FSHUNGRY +
## FSNAP, data = wout)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.39793 0.02346 0.10744 0.18955 0.95671
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.142629 0.016775 68.114 <2e-16 ***
## FSBALANC 0.082117 0.005435 15.108 <2e-16 ***
## FSLAST -0.138554 0.006014 -23.040 <2e-16 ***
## FSLESS -0.075680 0.008302 -9.116 <2e-16 ***
## FSHUNGRY -0.021314 0.008430 -2.528 0.0115 *
## FSNAP -0.012660 0.003906 -3.241 0.0012 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.1138159)
##
## Null deviance: 1085.25 on 8598 degrees of freedom
## Residual deviance: 978.02 on 8593 degrees of freedom
## (36998 observations deleted due to missingness)
## AIC: 5723.8
##
## Number of Fisher Scoring iterations: 2
Let’s create a data sample and run a regression on it.
n=length(wout$FSRUNOUT)
n1=floor(n*(0.6))
n2=n-n1
train=sample(1:n,n1)
xx = wout[-4]
xtrain <-xx[train,]
xnew <- xx[-train,]
ytrain <- FSRUNOUT[train]
ynew <- FSRUNOUT[-train]
fit5=glm(FSRUNOUT ~ FSBALANC + FSLAST + FSLESS+
FSHUNGRY + FSNAP, data=data.frame(FSRUNOUT=ytrain,xtrain))
summary(fit5)
##
## Call:
## glm(formula = FSRUNOUT ~ FSBALANC + FSLAST + FSLESS + FSHUNGRY +
## FSNAP, data = data.frame(FSRUNOUT = ytrain, xtrain))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.12014 0.02185 0.10467 0.19155 0.96698
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.147850 0.021366 53.723 < 2e-16 ***
## FSBALANC 0.086881 0.006982 12.443 < 2e-16 ***
## FSLAST -0.141498 0.007631 -18.544 < 2e-16 ***
## FSLESS -0.083308 0.010259 -8.120 5.75e-16 ***
## FSHUNGRY -0.017159 0.010184 -1.685 0.09208 .
## FSNAP -0.014616 0.004990 -2.929 0.00341 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.1124219)
##
## Null deviance: 654.00 on 5205 degrees of freedom
## Residual deviance: 584.59 on 5200 degrees of freedom
## (22152 observations deleted due to missingness)
## AIC: 3404.3
##
## Number of Fisher Scoring iterations: 2
ptest <- predict(fit4,newdata=data.frame(xnew),type="response")
hist(ptest, main = "Prediction of worried about running out food")
xbar=mean(ynew)
xbar
## [1] NA
we now will calculate the error with probability .3
g=floor(ptest+0.3)
t=table(ynew,g)
t
## g
## ynew 0 1
## 0 238 269
## 1 147 2739
of the 2886 worried people it correctly predicts 2739 of them.
error=(t[1,2]+t[2,1])/n2
error
## [1] 0.02280827
gg2=floor(ptest+0.7)
ttt=table(ynew,gg2)
ttt
## gg2
## ynew 0 1 2
## 0 1 504 2
## 1 3 2881 2
error=(ttt[1,2]+ttt[2,1])/n2
error
## [1] 0.02779758
bb=cbind(ptest,ynew)
bb[1:10,]
## ptest ynew
## 2 NA 0
## 3 NA 0
## 4 NA 0
## 10 NA 0
## 12 NA 0
## 14 NA 0
## 15 NA 0
## 16 NA 0
## 22 NA 0
## 23 NA 0
bb1=bb[order(ptest,decreasing=TRUE),]
bb1[1:10,]
## ptest ynew
## 42362 1.397926 0
## 7877 1.385266 1
## 14256 1.385266 NA
## 39391 1.385266 1
## 43396 1.385266 0
## 2134 1.140772 1
## 3526 1.140772 1
## 8142 1.140772 1
## 10669 1.140772 1
## 13816 1.140772 1
rank ordering according to probabilities and ordering cases in test according to their success probabilities
xbar=mean(ynew)
xbar
## [1] NA
library(ROCR)
## Loading required package: gplots
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
p <- predict(fit4, newdata=data.frame(xnew),type="response")
fitted.results <- predict(fit4, newdata=data.frame(xnew),type="response")
head(fitted.results)
## 2 3 4 10 12 14
## NA NA NA NA NA NA
fitted.results <- ifelse(fitted.results > 0.3,1,0)
head(fitted.results)
## 2 3 4 10 12 14
## NA NA NA NA NA NA
#fitted.results= same as p, just naming it differently
misClasificError <-mean(fitted.results != ynew)
print(paste('Accuracy', 1-misClasificError))
## [1] "Accuracy NA"
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(lattice)
library(ggplot2)
library(e1071)
confusionMatrix(data=fitted.results, reference=ynew)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1 3
## 1 506 2883
##
## Accuracy : 0.85
## 95% CI : (0.8375, 0.8618)
## No Information Rate : 0.8506
## P-Value [Acc > NIR] : 0.5501
##
## Kappa : 0.0016
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.0019724
## Specificity : 0.9989605
## Pos Pred Value : 0.2500000
## Neg Pred Value : 0.8506934
## Prevalence : 0.1494253
## Detection Rate : 0.0002947
## Detection Prevalence : 0.0011789
## Balanced Accuracy : 0.5004664
##
## 'Positive' Class : 0
##
I recommend that eligibility for food assistance should not only be restricted by income, but should also include judgement on an individual’s past in the last 12 months. In accordance with if that individual was unable to afford a balanced meal, if they have previously ran out of food before payment, or if they are currently on government assistance, but still worried about running out of food.