Master Analyst

Chopped<-read.csv("c:/users/abbey/Desktop/Data Mining/choppeds.csv")
library(car)
set.seed(1)

We want to take out the variables that we believe is not relavent to food insecurity.

wout=Chopped[,c(-1:-37,-46,-47,-49:-109,-111:-119,-121:-123)]
head(wout)

##   FSRUNOUT FSLAST FSBALANC FSSKIP FSSKDAYS FSLESS FSHUNGRY FSWEIGHT
## 1        3      3        3     NA       NA     NA       NA       NA
## 2        3      3        3     NA       NA     NA       NA       NA
## 3        3      3        3     NA       NA     NA       NA       NA
## 4        3      3        3     NA       NA     NA       NA       NA
## 5        3      3        3     NA       NA     NA       NA       NA
## 6        3      3        3     NA       NA     NA       NA       NA
##   FDMEDYN INCGRP4 FSNAP
## 1       1       1     1
## 2       2       2     2
## 3       2       3     2
## 4       2       5     2
## 5       1       1     2
## 6       2       5     2

table(wout$FSRUNOUT)

## 
##     1     2     3     7     9 
##  2257  5068 38227    29    16

We need to recode the values in this variable because 1 and 2 mean yes, 3 and 4 mean no, and 7 and 9 mean did not answer or refusal to answer.

wout$FSRUNOUT = recode(wout$FSRUNOUT, "'1'=1;'2'=1;'3'=0;'4'=0;'7'=NA;'8'=NA;'9'=NA")
FSRUNOUT=wout$FSRUNOUT
table(wout$FSRUNOUT)

## 
##     0     1 
## 38227  7325

7325/38227

## [1] 0.1916185

Out of the entire data there is 7325 families who are worried about running out of food. That is 19.2% of the total data.

fit<-lm(FSRUNOUT~.,data=wout)
summary(fit)

## 
## Call:
## lm(formula = FSRUNOUT ~ ., data = wout)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.05720 -0.00884  0.06522  0.09955  0.33545 
## 
## Coefficients: (1 not defined because of singularities)
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.152e+00  2.186e-02  52.685  < 2e-16 ***
## FSLAST      -1.146e-01  7.083e-03 -16.174  < 2e-16 ***
## FSBALANC     3.165e-02  6.434e-03   4.920 9.11e-07 ***
## FSSKIP              NA         NA      NA       NA    
## FSSKDAYS    -1.938e-04  2.583e-04  -0.750  0.45312    
## FSLESS      -1.871e-02  9.454e-03  -1.979  0.04789 *  
## FSHUNGRY    -2.072e-02  7.211e-03  -2.873  0.00409 ** 
## FSWEIGHT     2.026e-03  3.323e-03   0.610  0.54210    
## FDMEDYN     -5.786e-03  7.720e-03  -0.750  0.45360    
## INCGRP4      1.217e-05  2.428e-04   0.050  0.96003    
## FSNAP       -1.011e-02  5.176e-03  -1.953  0.05096 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.238 on 3203 degrees of freedom
##   (42384 observations deleted due to missingness)
## Multiple R-squared:  0.0957, Adjusted R-squared:  0.09316 
## F-statistic: 37.66 on 9 and 3203 DF,  p-value: < 2.2e-16

When running a logistic regression on the dependent variable with the entire data set the adjusted R- squared is extremely low. We are missing some variables in the data set that can be the reason. I am going to rerun a regression using independent variables that does not have missing data.

fit2<-lm(formula = FSRUNOUT ~ FSBALANC + FSLAST + FDMEDYN+ FSWEIGHT+ FSLESS+
    FSHUNGRY + INCGRP4+ FSNAP, data = wout)
summary(fit2)

## 
## Call:
## lm(formula = FSRUNOUT ~ FSBALANC + FSLAST + FDMEDYN + FSWEIGHT + 
##     FSLESS + FSHUNGRY + INCGRP4 + FSNAP, data = wout)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.39782  0.02605  0.10580  0.18983  0.93825 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.138e+00  2.040e-02  55.786  < 2e-16 ***
## FSBALANC     8.220e-02  5.457e-03  15.064  < 2e-16 ***
## FSLAST      -1.386e-01  6.015e-03 -23.048  < 2e-16 ***
## FDMEDYN     -1.830e-03  7.071e-03  -0.259  0.79577    
## FSWEIGHT     5.142e-03  4.441e-03   1.158  0.24691    
## FSLESS      -7.608e-02  8.339e-03  -9.123  < 2e-16 ***
## FSHUNGRY    -2.230e-02  8.479e-03  -2.630  0.00856 ** 
## INCGRP4     -3.936e-05  1.721e-04  -0.229  0.81908    
## FSNAP       -1.226e-02  4.255e-03  -2.882  0.00396 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3374 on 8590 degrees of freedom
##   (36998 observations deleted due to missingness)
## Multiple R-squared:  0.09896,    Adjusted R-squared:  0.09812 
## F-statistic: 117.9 on 8 and 8590 DF,  p-value: < 2.2e-16

This regression’s adjusted R-squared is better than the last but it only explains about 9.8% of the variance. This model is not very good. I am going to try a new regression model using the significant variables.

fit3<-lm(formula = FSRUNOUT ~ FSBALANC + FSLAST + FSLESS+
    FSHUNGRY +  FSNAP, data = wout)
summary(fit3)

## 
## Call:
## lm(formula = FSRUNOUT ~ FSBALANC + FSLAST + FSLESS + FSHUNGRY + 
##     FSNAP, data = wout)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.39793  0.02346  0.10744  0.18955  0.95671 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.142629   0.016775  68.114   <2e-16 ***
## FSBALANC     0.082117   0.005435  15.108   <2e-16 ***
## FSLAST      -0.138554   0.006014 -23.040   <2e-16 ***
## FSLESS      -0.075680   0.008302  -9.116   <2e-16 ***
## FSHUNGRY    -0.021314   0.008430  -2.528   0.0115 *  
## FSNAP       -0.012660   0.003906  -3.241   0.0012 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3374 on 8593 degrees of freedom
##   (36998 observations deleted due to missingness)
## Multiple R-squared:  0.09881,    Adjusted R-squared:  0.09828 
## F-statistic: 188.4 on 5 and 8593 DF,  p-value: < 2.2e-16

In this model it is a little better than the last but we are still only explaining 9.83% of the variance. In this model FBALANC has a positive coeffecient this indicates that if the FSBALANC increases the FSRUNOUT increases by 0.082117. The FSLAST, FSLESS, FSHUNGRY, and FSNAP have a negative coeffecient that means that when these independent variables increase FSRUNOUT will decrease by the estimate numbers.

anova(fit3)

## Analysis of Variance Table
## 
## Response: FSRUNOUT
##             Df Sum Sq Mean Sq  F value    Pr(>F)    
## FSBALANC     1   0.18   0.176   1.5483  0.213416    
## FSLAST       1  90.57  90.566 795.7202 < 2.2e-16 ***
## FSLESS       1  14.52  14.525 127.6141 < 2.2e-16 ***
## FSHUNGRY     1   0.77   0.766   6.7328  0.009482 ** 
## FSNAP        1   1.20   1.195  10.5034  0.001196 ** 
## Residuals 8593 978.02   0.114                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Running a Anova table tells me that all of these independent variables are significant factors in FSRUNOUT.

now let’s take a further look at the variables

fit4<-glm(formula = FSRUNOUT ~ FSBALANC + FSLAST + FSLESS+
    FSHUNGRY +  FSNAP, data = wout)
summary(fit4)

## 
## Call:
## glm(formula = FSRUNOUT ~ FSBALANC + FSLAST + FSLESS + FSHUNGRY + 
##     FSNAP, data = wout)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.39793   0.02346   0.10744   0.18955   0.95671  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.142629   0.016775  68.114   <2e-16 ***
## FSBALANC     0.082117   0.005435  15.108   <2e-16 ***
## FSLAST      -0.138554   0.006014 -23.040   <2e-16 ***
## FSLESS      -0.075680   0.008302  -9.116   <2e-16 ***
## FSHUNGRY    -0.021314   0.008430  -2.528   0.0115 *  
## FSNAP       -0.012660   0.003906  -3.241   0.0012 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1138159)
## 
##     Null deviance: 1085.25  on 8598  degrees of freedom
## Residual deviance:  978.02  on 8593  degrees of freedom
##   (36998 observations deleted due to missingness)
## AIC: 5723.8
## 
## Number of Fisher Scoring iterations: 2

Let’s create a data sample and run a regression on it.

n=length(wout$FSRUNOUT)
n1=floor(n*(0.6))
n2=n-n1
train=sample(1:n,n1)
xx = wout[-4]
xtrain <-xx[train,]
xnew <- xx[-train,]
ytrain <- FSRUNOUT[train]
ynew <- FSRUNOUT[-train]


fit5=glm(FSRUNOUT ~ FSBALANC + FSLAST + FSLESS+
    FSHUNGRY +  FSNAP, data=data.frame(FSRUNOUT=ytrain,xtrain))
summary(fit5)

## 
## Call:
## glm(formula = FSRUNOUT ~ FSBALANC + FSLAST + FSLESS + FSHUNGRY + 
##     FSNAP, data = data.frame(FSRUNOUT = ytrain, xtrain))
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.12014   0.02185   0.10467   0.19155   0.96698  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.147850   0.021366  53.723  < 2e-16 ***
## FSBALANC     0.086881   0.006982  12.443  < 2e-16 ***
## FSLAST      -0.141498   0.007631 -18.544  < 2e-16 ***
## FSLESS      -0.083308   0.010259  -8.120 5.75e-16 ***
## FSHUNGRY    -0.017159   0.010184  -1.685  0.09208 .  
## FSNAP       -0.014616   0.004990  -2.929  0.00341 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1124219)
## 
##     Null deviance: 654.00  on 5205  degrees of freedom
## Residual deviance: 584.59  on 5200  degrees of freedom
##   (22152 observations deleted due to missingness)
## AIC: 3404.3
## 
## Number of Fisher Scoring iterations: 2

ptest <- predict(fit4,newdata=data.frame(xnew),type="response")

hist(ptest, main = "Prediction of worried about running out food")

xbar=mean(ynew)
xbar

## [1] NA

we now will calculate the error with probability .3

g=floor(ptest+0.3) 
t=table(ynew,g)
t

##     g
## ynew    0    1
##    0  238  269
##    1  147 2739

of the 2886 worried people it correctly predicts 2739 of them.

error=(t[1,2]+t[2,1])/n2
error

## [1] 0.02280827

gg2=floor(ptest+0.7)
ttt=table(ynew,gg2)
ttt

##     gg2
## ynew    0    1    2
##    0    1  504    2
##    1    3 2881    2

error=(ttt[1,2]+ttt[2,1])/n2
error

## [1] 0.02779758

bb=cbind(ptest,ynew)
bb[1:10,]

##    ptest ynew
## 2     NA    0
## 3     NA    0
## 4     NA    0
## 10    NA    0
## 12    NA    0
## 14    NA    0
## 15    NA    0
## 16    NA    0
## 22    NA    0
## 23    NA    0

bb1=bb[order(ptest,decreasing=TRUE),]
bb1[1:10,]

##          ptest ynew
## 42362 1.397926    0
## 7877  1.385266    1
## 14256 1.385266   NA
## 39391 1.385266    1
## 43396 1.385266    0
## 2134  1.140772    1
## 3526  1.140772    1
## 8142  1.140772    1
## 10669 1.140772    1
## 13816 1.140772    1

rank ordering according to probabilities and ordering cases in test according to their success probabilities

xbar=mean(ynew)
xbar

## [1] NA

library(ROCR)

## Loading required package: gplots

## 
## Attaching package: 'gplots'

## The following object is masked from 'package:stats':
## 
##     lowess

p <- predict(fit4, newdata=data.frame(xnew),type="response")
fitted.results <- predict(fit4, newdata=data.frame(xnew),type="response")
head(fitted.results)

##  2  3  4 10 12 14 
## NA NA NA NA NA NA

fitted.results <- ifelse(fitted.results > 0.3,1,0)
head(fitted.results)

##  2  3  4 10 12 14 
## NA NA NA NA NA NA

#fitted.results= same as p, just naming it differently 

misClasificError <-mean(fitted.results != ynew)
print(paste('Accuracy', 1-misClasificError))

## [1] "Accuracy NA"

library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

library(lattice)
library(ggplot2)
library(e1071)
confusionMatrix(data=fitted.results, reference=ynew)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0    1    3
##          1  506 2883
##                                           
##                Accuracy : 0.85            
##                  95% CI : (0.8375, 0.8618)
##     No Information Rate : 0.8506          
##     P-Value [Acc > NIR] : 0.5501          
##                                           
##                   Kappa : 0.0016          
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.0019724       
##             Specificity : 0.9989605       
##          Pos Pred Value : 0.2500000       
##          Neg Pred Value : 0.8506934       
##              Prevalence : 0.1494253       
##          Detection Rate : 0.0002947       
##    Detection Prevalence : 0.0011789       
##       Balanced Accuracy : 0.5004664       
##                                           
##        'Positive' Class : 0               
##

I recommend that eligibility for food assistance should not only be restricted by income, but should also include judgement on an individual’s past in the last 12 months. In accordance with if that individual was unable to afford a balanced meal, if they have previously ran out of food before payment, or if they are currently on government assistance, but still worried about running out of food.

Master Analyst

Ober

October 25, 2017