Synopsis

As partof the Week 6 HW, this is also the project proposal for the final project of this subject.The datasets I have selected are part of the “Eating and Health Module” dataset found on Kaggle:

https://www.kaggle.com/bls/eating-health-module-dataset

This data is accessed, cleaned up and analysed in both mathematical and graphical ways to answer a few questions and derive relevant conclusions.

Packages

The following packages are required:

library(tidyverse)  #For all the filtering,selecting,piping, ggplot commands we use in the markdown file

SourceCode

The ATUS Eating & Health (EH) Module was fielded from 2006 to 2008 and again in 2014 to 2016. The EH Module data files contain additional information related to eating, meal preparation and health.

The module here consists of 3 datasets from 2014: ehact_2014.csv, ehresp_2014.csv,ehwgts_2014.csv.I am working with only the first two datasets.

1.The EH Respondent file (ehresp_2014.csv) contains information about EH respondents, including general health and body mass index. There are 37 variables:

TUCASEID: identifies each household TULINENO: identifies each individual within the household EEINCOME1: Last month total household income before taxes more than, less than, or equal to (amount) per month Note: (amount) approximates 185 percent of poverty threshold. ERHHCH: Change in household composition between CPS and ATUS. ERSPEMCH: Change in spouse or unmarried partner’s labor force status or full-time or part-time employment status between CPS and ATUS. EUINCOME2: Was last month total household income before taxes more than, less than, or equal to (amount) per month? Note: (amount) approximates 130 percent of poverty threshold. EUINCLVL: Identifies which income values were asked in EEINCOME1 and EUINCOME2 ERINCOME: Income recode EUGENHTH: In general was physical health was excellent, very good, good,fair, or poor? EUHGT: Height (in inches) ETHGT: Topcode flag for height (EUHGT) EUWGT:Weight (in pounds) ETWGT: Topcode flag for weight (EUWGT) ERBMI: Body mass index EUGROSHP: The person who usually does the grocery shopping in your household? EUSTORES: Source of the groceries EUSTREASON: Reason for shopping at the place EUFASTFD: Over the last seven days was any prepared food from a deli, carryout,delivery food, or fast food purchased EUFASTFDFRQ: How many times in the last seven days was fast food purchased EUFFYDAY: Any prepared food purchased yesterday EUPRPMEL The person who usually prepares the meals in the household EUMEAT: In the last 7 days,any meals prepared with meat, poultry, or seafood EUTHERM: Was a food or meat thermometer used when preparing those meals? EUMILK: In the last 7 days,was unpasteurized or raw milk drunk or served ?

2.The EH Activity file (ehact_2014.csv) contains information such as the activity number, whether secondary eating occurred during the activity, and the duration of secondary eating. There are 5 variables.

TUACTIVITY_N:Each activity is identified by an activity number EUEATSUM: Each activity has corresponding values denoting whether secondary eating occurred during the activity EUEDUR24:how much time was spent in secondary eating

The complete explanation can be found in the codebook below: http://www.bls.gov/tus/ehmintcodebk1416.pdf

Importing Data

url1<-"https://raw.githubusercontent.com/RKK101/ProjectR/master/ehresp_2014.csv"
Resp_data<-as_tibble(read.csv(url1));
url2<-"https://raw.githubusercontent.com/RKK101/ProjectR/master/ehact_2014.csv"
Act_data<-as_tibble(read.csv(url2));

head(Resp_data)
## # A tibble: 6 × 37
##      tucaseid tulineno eeincome1 erbmi erhhch erincome erspemch ertpreat
##         <dbl>    <int>     <int> <dbl>  <int>    <int>    <int>    <int>
## 1 2.01401e+13        1        -2  33.2      1       -1       -1       30
## 2 2.01401e+13        1         1  22.7      3        1       -1       45
## 3 2.01401e+13        1         2  49.4      3        5       -1       60
## 4 2.01401e+13        1        -2  -1.0      3       -1       -1        0
## 5 2.01401e+13        1         2  31.0      3        5       -1       65
## 6 2.01401e+13        1         1  30.7      3        1        1       20
## # ... with 29 more variables: ertseat <int>, ethgt <int>, etwgt <int>,
## #   eudietsoda <int>, eudrink <int>, eueat <int>, euexercise <int>,
## #   euexfreq <int>, eufastfd <int>, eufastfdfrq <int>, euffyday <int>,
## #   eufdsit <int>, eufinlwgt <dbl>, eusnap <int>, eugenhth <int>,
## #   eugroshp <int>, euhgt <int>, euinclvl <int>, euincome2 <int>,
## #   eumeat <int>, eumilk <int>, euprpmel <int>, eusoda <int>,
## #   eustores <int>, eustreason <int>, eutherm <int>, euwgt <int>,
## #   euwic <int>, exincome1 <int>
head(Act_data)
## # A tibble: 6 × 5
##      tucaseid tuactivity_n eueatsum euedur euedur24
##         <dbl>        <int>    <int>  <int>    <int>
## 1 2.01401e+13            1       -1     -1       -1
## 2 2.01401e+13            2       -1     -1       -1
## 3 2.01401e+13            3       -1     -1       -1
## 4 2.01401e+13            4       -1     -1       -1
## 5 2.01401e+13            5       -1     -1       -1
## 6 2.01401e+13            6       -1     -1       -1

Data Description

#Checking the dimensions of the datasets

dim(Resp_data)
## [1] 11212    37
dim(Act_data)
## [1] 120719      5
#Checking the names of the variables

names(Resp_data)
##  [1] "tucaseid"    "tulineno"    "eeincome1"   "erbmi"       "erhhch"     
##  [6] "erincome"    "erspemch"    "ertpreat"    "ertseat"     "ethgt"      
## [11] "etwgt"       "eudietsoda"  "eudrink"     "eueat"       "euexercise" 
## [16] "euexfreq"    "eufastfd"    "eufastfdfrq" "euffyday"    "eufdsit"    
## [21] "eufinlwgt"   "eusnap"      "eugenhth"    "eugroshp"    "euhgt"      
## [26] "euinclvl"    "euincome2"   "eumeat"      "eumilk"      "euprpmel"   
## [31] "eusoda"      "eustores"    "eustreason"  "eutherm"     "euwgt"      
## [36] "euwic"       "exincome1"
names(Act_data)
## [1] "tucaseid"     "tuactivity_n" "eueatsum"     "euedur"      
## [5] "euedur24"
#Checking the structure of the data

str(Resp_data)
## Classes 'tbl_df', 'tbl' and 'data.frame':    11212 obs. of  37 variables:
##  $ tucaseid   : num  2.01e+13 2.01e+13 2.01e+13 2.01e+13 2.01e+13 ...
##  $ tulineno   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ eeincome1  : int  -2 1 2 -2 2 1 1 1 1 1 ...
##  $ erbmi      : num  33.2 22.7 49.4 -1 31 ...
##  $ erhhch     : int  1 3 3 3 3 3 1 3 3 3 ...
##  $ erincome   : int  -1 1 5 -1 5 1 1 1 1 1 ...
##  $ erspemch   : int  -1 -1 -1 -1 -1 1 5 -1 -1 5 ...
##  $ ertpreat   : int  30 45 60 0 65 20 30 30 117 80 ...
##  $ ertseat    : int  2 14 0 0 0 10 5 5 10 0 ...
##  $ ethgt      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ etwgt      : int  0 0 0 -1 0 0 0 0 0 0 ...
##  $ eudietsoda : int  -1 -1 -1 2 -1 1 -1 -1 -1 2 ...
##  $ eudrink    : int  2 2 1 1 1 1 1 2 2 1 ...
##  $ eueat      : int  1 1 2 2 2 1 1 1 1 2 ...
##  $ euexercise : int  2 2 2 2 1 1 2 1 1 2 ...
##  $ euexfreq   : int  -1 -1 -1 -1 5 2 -1 3 6 -1 ...
##  $ eufastfd   : int  2 1 2 2 2 1 1 1 2 1 ...
##  $ eufastfdfrq: int  -1 1 -1 -1 -1 3 3 1 -1 2 ...
##  $ euffyday   : int  -1 2 -1 -1 -1 1 2 2 -1 1 ...
##  $ eufdsit    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ eufinlwgt  : num  5202086 29400000 26000000 2728880 17500000 ...
##  $ eusnap     : int  1 2 2 2 1 2 2 2 2 2 ...
##  $ eugenhth   : int  1 2 5 2 4 3 2 2 3 1 ...
##  $ eugroshp   : int  1 3 2 1 1 2 3 1 1 1 ...
##  $ euhgt      : int  60 63 62 64 69 71 65 63 70 65 ...
##  $ euinclvl   : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ euincome2  : int  -2 -1 2 -2 2 -1 -1 -1 -1 -1 ...
##  $ eumeat     : int  1 1 -1 2 1 -1 1 1 1 1 ...
##  $ eumilk     : int  2 2 -1 2 2 -1 2 2 2 2 ...
##  $ euprpmel   : int  1 1 2 1 1 2 3 1 1 1 ...
##  $ eusoda     : int  -1 -1 2 1 2 1 2 -1 -1 1 ...
##  $ eustores   : int  2 1 -1 2 1 -1 2 1 1 3 ...
##  $ eustreason : int  1 2 -1 6 1 -1 5 3 4 1 ...
##  $ eutherm    : int  2 2 -1 -1 2 -1 2 2 2 2 ...
##  $ euwgt      : int  170 128 270 -2 210 220 200 155 180 170 ...
##  $ euwic      : int  1 2 2 2 1 2 2 -1 -1 -1 ...
##  $ exincome1  : int  2 0 12 2 0 0 0 0 0 0 ...
str(Act_data)
## Classes 'tbl_df', 'tbl' and 'data.frame':    120719 obs. of  5 variables:
##  $ tucaseid    : num  2.01e+13 2.01e+13 2.01e+13 2.01e+13 2.01e+13 ...
##  $ tuactivity_n: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ eueatsum    : int  -1 -1 -1 -1 -1 -1 1 -1 -1 -1 ...
##  $ euedur      : int  -1 -1 -1 -1 -1 -1 2 -1 -1 -1 ...
##  $ euedur24    : int  -1 -1 -1 -1 -1 -1 2 -1 -1 -1 ...
#Missing values have been encoded in different forms as specified in the codebook.Values may be coded as -1, -2 -3 to indicate Blank or unknown or refusal to supply info,which can all be considered as missing values.

Data Cleaning

There are a lot of redundant data columns in the datasets and we can get rid of a few.

# THe TULINENO variable has only one value through the dataset as only one data member was supplying the info,so we cn remoe the variable.The TUCASEID gives the case ID per family.

#The ERINCOME combines information in EEINCOME1,EXINCOME1 so the latter columns can be removed too.

#some variables are not directly relevant to our analysis and thus can be removed:EUINCLVL,ERHHCH,ETHGT,ETWGT


resp_1 <- select ( Resp_data,-tulineno,-eeincome1,-exincome1,-erhhch,-euinclvl,-ethgt, -etwgt,-eugroshp )

#We divide the dataset into smaller chunks of connected data so as to handle them more easily

case <- select ( Resp_data,tucaseid,euhgt,euwgt,erbmi,eugenhth,erincome,eusnap,euwic)

head(case)
## # A tibble: 6 × 8
##      tucaseid euhgt euwgt erbmi eugenhth erincome eusnap euwic
##         <dbl> <int> <int> <dbl>    <int>    <int>  <int> <int>
## 1 2.01401e+13    60   170  33.2        1       -1      1     1
## 2 2.01401e+13    63   128  22.7        2        1      2     2
## 3 2.01401e+13    62   270  49.4        5        5      2     2
## 4 2.01401e+13    64    -2  -1.0        2       -1      2     2
## 5 2.01401e+13    69   210  31.0        4        5      1     1
## 6 2.01401e+13    71   220  30.7        3        1      2     2
food <- select ( Resp_data,tucaseid,ertpreat,ertseat,eudrink,eudietsoda,eufastfd,eufastfdfrq,eufdsit,euffyday,eustores,eustreason )

head(food)
## # A tibble: 6 × 11
##      tucaseid ertpreat ertseat eudrink eudietsoda eufastfd eufastfdfrq
##         <dbl>    <int>   <int>   <int>      <int>    <int>       <int>
## 1 2.01401e+13       30       2       2         -1        2          -1
## 2 2.01401e+13       45      14       2         -1        1           1
## 3 2.01401e+13       60       0       1         -1        2          -1
## 4 2.01401e+13        0       0       1          2        2          -1
## 5 2.01401e+13       65       0       1         -1        2          -1
## 6 2.01401e+13       20      10       1          1        1           3
## # ... with 4 more variables: eufdsit <int>, euffyday <int>,
## #   eustores <int>, eustreason <int>
exercise<-select(Resp_data,tucaseid,euexercise,euexfreq)

head(exercise)
## # A tibble: 6 × 3
##      tucaseid euexercise euexfreq
##         <dbl>      <int>    <int>
## 1 2.01401e+13          2       -1
## 2 2.01401e+13          2       -1
## 3 2.01401e+13          2       -1
## 4 2.01401e+13          2       -1
## 5 2.01401e+13          1        5
## 6 2.01401e+13          1        2
head(Act_data)
## # A tibble: 6 × 5
##      tucaseid tuactivity_n eueatsum euedur euedur24
##         <dbl>        <int>    <int>  <int>    <int>
## 1 2.01401e+13            1       -1     -1       -1
## 2 2.01401e+13            2       -1     -1       -1
## 3 2.01401e+13            3       -1     -1       -1
## 4 2.01401e+13            4       -1     -1       -1
## 5 2.01401e+13            5       -1     -1       -1
## 6 2.01401e+13            6       -1     -1       -1
##Future cleaning:

##Need to figure out the best possible method  for considering the missing values (as a lot of missing values present they cannot be removed )

Planned Analysis

The plan for analysis is as follows:

1.What is the relationship between weight or BMI and meal preparation patterns, consumption of fresh/fast food, or snacking patterns?

2.Do grocery shopping patterns differ by income?

3.What factors seem to influence the genral health of the person?

4.What is the exercise and food consumption relationsship per case id?

Further analysis will be performed as the project proceeds.