Chopped Analysis

In the data that was given two us there were 3 levels of food insecurities. Our group decided to recode the data into either a 1, 2, or 3 depending on the answer given.

After cleaning up the data, we decided to include 36 variables to partake in our regression. We decided to use FSLast as our dependent variable over FSRUNOUT, as it seems to be a more tangible variable.

fam<-read.csv("~/DataMining/Data/FamilyEdited.csv")
table(fam$FSLAST)
## 
##     1     2     3     7     9 
##  1723  4530 39290     7     2
m1=lm(fam$FSLAST~.,data=fam)
summary(m1)
## 
## Call:
## lm(formula = fam$FSLAST ~ ., data = fam)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.6384 -0.0054  0.0020  0.0130  6.2830 
## 
## Coefficients: (1 not defined because of singularities)
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.757e-01  6.430e-02   7.399 1.41e-13 ***
## CURWRKN     -5.041e-03  2.635e-03  -1.913 0.055764 .  
## TELCELN             NA         NA      NA       NA    
## WRKCELN      3.581e-04  2.401e-04   1.491 0.135926    
## FLNGINTV     2.069e-03  2.954e-03   0.700 0.483672    
## FM_SIZE     -4.479e-03  2.748e-03  -1.630 0.103058    
## FM_KIDS      1.860e-03  3.168e-03   0.587 0.557149    
## FM_ELDR      1.622e-02  4.484e-03   3.618 0.000298 ***
## FM_TYPE      3.008e-03  9.449e-03   0.318 0.750250    
## FM_STRCP     8.718e-03  5.917e-03   1.473 0.140678    
## FM_STRP     -8.890e-03  5.922e-03  -1.501 0.133313    
## FM_EDUC1     2.829e-04  2.260e-04   1.252 0.210688    
## FLAADLYN     1.812e-02  2.627e-02   0.690 0.490268    
## FLAADLCT    -8.777e-03  2.610e-02  -0.336 0.736622    
## FLIADLYN    -4.671e-02  2.229e-02  -2.095 0.036167 *  
## FLIADLCT    -9.060e-03  2.136e-02  -0.424 0.671378    
## FWKLIMYN     1.243e-02  1.045e-02   1.189 0.234451    
## FWKLIMCT     9.639e-03  9.876e-03   0.976 0.329065    
## FANYLYN      2.405e-02  8.351e-03   2.880 0.003985 ** 
## FANYLCT      1.479e-03  6.573e-03   0.225 0.821901    
## FHSTATPR    -3.017e-02  7.513e-03  -4.016 5.92e-05 ***
## FSRUNOUT     5.527e-01  3.815e-03 144.881  < 2e-16 ***
## FSBALANC     2.764e-01  4.279e-03  64.592  < 2e-16 ***
## FHICOVYN    -2.357e-03  3.012e-03  -0.783 0.433865    
## FMEDBILL     9.522e-03  2.751e-03   3.462 0.000538 ***
## FDGLWCT1     6.092e-03  3.491e-03   1.745 0.081002 .  
## FWRKLWCT     3.314e-03  2.449e-03   1.353 0.175960    
## FSALYN      -7.607e-03  2.912e-03  -2.612 0.008997 ** 
## FSALCT      -2.194e-04  2.877e-03  -0.076 0.939207    
## FSSRRYN     -8.668e-03  4.019e-03  -2.157 0.031042 *  
## FSSRRCT     -1.811e-02  5.222e-03  -3.468 0.000524 ***
## FTANFYN      4.602e-03  4.335e-03   1.062 0.288409    
## FCHSPYN      1.603e-03  3.676e-03   0.436 0.662790    
## INCGRP4     -5.039e-05  7.519e-05  -0.670 0.502760    
## RAT_CAT4    -1.914e-05  9.573e-05  -0.200 0.841569    
## FSNAP        7.447e-03  2.857e-03   2.606 0.009156 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2311 on 30107 degrees of freedom
##   (15410 observations deleted due to missingness)
## Multiple R-squared:  0.6956, Adjusted R-squared:  0.6952 
## F-statistic:  2023 on 34 and 30107 DF,  p-value: < 2.2e-16

This regression ended up giving us an r^2 of about 70%. And included a barchart to give us a visual of how many people answered what to FSLAST.

library(lattice)
fslast=table(fam$FSLAST)
fslast
## 
##     1     2     3     7     9 
##  1723  4530 39290     7     2
barchart(fslast,ylab="The food did not last and there wasn't enough money to buy more.")

While doing our analysis, we wanted to see how related FSLAST and FSRUNOUT are because to us they seemed to be pretty much the same thing. Using this as our guidline we ran another regression, but eliminated FSRUNOUT.

m3=lm(fam$FSLAST~.-fam$FSRUNOUT,data=fam)
summary(m3)
## 
## Call:
## lm(formula = fam$FSLAST ~ . - fam$FSRUNOUT, data = fam)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.6384 -0.0054  0.0020  0.0130  6.2830 
## 
## Coefficients: (1 not defined because of singularities)
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.757e-01  6.430e-02   7.399 1.41e-13 ***
## CURWRKN     -5.041e-03  2.635e-03  -1.913 0.055764 .  
## TELCELN             NA         NA      NA       NA    
## WRKCELN      3.581e-04  2.401e-04   1.491 0.135926    
## FLNGINTV     2.069e-03  2.954e-03   0.700 0.483672    
## FM_SIZE     -4.479e-03  2.748e-03  -1.630 0.103058    
## FM_KIDS      1.860e-03  3.168e-03   0.587 0.557149    
## FM_ELDR      1.622e-02  4.484e-03   3.618 0.000298 ***
## FM_TYPE      3.008e-03  9.449e-03   0.318 0.750250    
## FM_STRCP     8.718e-03  5.917e-03   1.473 0.140678    
## FM_STRP     -8.890e-03  5.922e-03  -1.501 0.133313    
## FM_EDUC1     2.829e-04  2.260e-04   1.252 0.210688    
## FLAADLYN     1.812e-02  2.627e-02   0.690 0.490268    
## FLAADLCT    -8.777e-03  2.610e-02  -0.336 0.736622    
## FLIADLYN    -4.671e-02  2.229e-02  -2.095 0.036167 *  
## FLIADLCT    -9.060e-03  2.136e-02  -0.424 0.671378    
## FWKLIMYN     1.243e-02  1.045e-02   1.189 0.234451    
## FWKLIMCT     9.639e-03  9.876e-03   0.976 0.329065    
## FANYLYN      2.405e-02  8.351e-03   2.880 0.003985 ** 
## FANYLCT      1.479e-03  6.573e-03   0.225 0.821901    
## FHSTATPR    -3.017e-02  7.513e-03  -4.016 5.92e-05 ***
## FSRUNOUT     5.527e-01  3.815e-03 144.881  < 2e-16 ***
## FSBALANC     2.764e-01  4.279e-03  64.592  < 2e-16 ***
## FHICOVYN    -2.357e-03  3.012e-03  -0.783 0.433865    
## FMEDBILL     9.522e-03  2.751e-03   3.462 0.000538 ***
## FDGLWCT1     6.092e-03  3.491e-03   1.745 0.081002 .  
## FWRKLWCT     3.314e-03  2.449e-03   1.353 0.175960    
## FSALYN      -7.607e-03  2.912e-03  -2.612 0.008997 ** 
## FSALCT      -2.194e-04  2.877e-03  -0.076 0.939207    
## FSSRRYN     -8.668e-03  4.019e-03  -2.157 0.031042 *  
## FSSRRCT     -1.811e-02  5.222e-03  -3.468 0.000524 ***
## FTANFYN      4.602e-03  4.335e-03   1.062 0.288409    
## FCHSPYN      1.603e-03  3.676e-03   0.436 0.662790    
## INCGRP4     -5.039e-05  7.519e-05  -0.670 0.502760    
## RAT_CAT4    -1.914e-05  9.573e-05  -0.200 0.841569    
## FSNAP        7.447e-03  2.857e-03   2.606 0.009156 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2311 on 30107 degrees of freedom
##   (15410 observations deleted due to missingness)
## Multiple R-squared:  0.6956, Adjusted R-squared:  0.6952 
## F-statistic:  2023 on 34 and 30107 DF,  p-value: < 2.2e-16

This dramatically changed our r^2 to less than .5, which is a drastic change.

During the middle of the chopped analysis I suggested to the team that we should run a logisitic regression given that what we were trying to predict was a 1 or 0. At first my group members didn’t know whether to proceed with that or not and then it became too late to act on it. If I were to do this again I would start with a logistic regression right from the begining