Context and Purpose of Data:
I will be exploring Eating and Health module datasets and will be interested in analyzing interaction of different parameters like health, exercise, income, weight etc.. This data is collected by USDA’s Economic Research Service along with other cosponsors. This data is about American Time Use Survey (ATUS) respondents primary and secondary eating habits- eating while doing another activity; soft drink consumption; grocery shopping preferences and fast food purchases; meal preparation and food safety practices; food assistance participation; general health, height and weight, and exercise; and income. This data is collected at different years and I will focus on data that is captured for year 2014
More can be found regarding the data on https://www.ers.usda.gov/data-products/eating-and-health-module-atus/
Actual data source:http://www.bls.gov/tus/special.requests/ehresp_2014.zip
Data Source for Analysis:https://raw.githubusercontent.com/taus01/EatingHabit/master/ehresp_2014.dat
Content: The EH Respondent file contains information about EH respondents, including general health and body mass index. There 11212 observations(respondents) and 37 variables.
There are 34 integer variables and 3 numeric variables in the data stet.
The complete data dictionary can be found at: http://www.bls.gov/tus/ehmintcodebk1416.pdf
Missing Values: There are curtain variables which have non valid entries. I will be treating them as missing values in our data. For example EUSTREASON variable have negative values which are not valid.
library(tibble)
library(Hmisc)
url<-"https://raw.githubusercontent.com/taus01/EatingHabit/master/ehresp_2014.dat"
eh_respdt<-read.delim(url,header=T,sep=",")
eh_respdtt<-as_tibble(eh_respdt) ### converting dataframe as tibble
head(eh_respdtt)
## # A tibble: 6 × 37
## TUCASEID TULINENO EEINCOME1 ERBMI ERHHCH ERINCOME ERSPEMCH ERTPREAT
## <dbl> <int> <int> <dbl> <int> <int> <int> <int>
## 1 2.01401e+13 1 -2 33.2 1 -1 -1 30
## 2 2.01401e+13 1 1 22.7 3 1 -1 45
## 3 2.01401e+13 1 2 49.4 3 5 -1 60
## 4 2.01401e+13 1 -2 -1.0 3 -1 -1 0
## 5 2.01401e+13 1 2 31.0 3 5 -1 65
## 6 2.01401e+13 1 1 30.7 3 1 1 20
## # ... with 29 more variables: ERTSEAT <int>, ETHGT <int>, ETWGT <int>,
## # EUDIETSODA <int>, EUDRINK <int>, EUEAT <int>, EUEXERCISE <int>,
## # EUEXFREQ <int>, EUFASTFD <int>, EUFASTFDFRQ <int>, EUFFYDAY <int>,
## # EUFDSIT <int>, EUFINLWGT <dbl>, EUSNAP <int>, EUGENHTH <int>,
## # EUGROSHP <int>, EUHGT <int>, EUINCLVL <int>, EUINCOME2 <int>,
## # EUMEAT <int>, EUMILK <int>, EUPRPMEL <int>, EUSODA <int>,
## # EUSTORES <int>, EUSTREASON <int>, EUTHERM <int>, EUWGT <int>,
## # EUWIC <int>, EXINCOME1 <int>
dim(eh_respdtt) ### dimention of data
## [1] 11212 37
str(eh_respdtt)
## Classes 'tbl_df', 'tbl' and 'data.frame': 11212 obs. of 37 variables:
## $ TUCASEID : num 2.01e+13 2.01e+13 2.01e+13 2.01e+13 2.01e+13 ...
## $ TULINENO : int 1 1 1 1 1 1 1 1 1 1 ...
## $ EEINCOME1 : int -2 1 2 -2 2 1 1 1 1 1 ...
## $ ERBMI : num 33.2 22.7 49.4 -1 31 30.7 33.3 27.5 25.8 28.3 ...
## $ ERHHCH : int 1 3 3 3 3 3 1 3 3 3 ...
## $ ERINCOME : int -1 1 5 -1 5 1 1 1 1 1 ...
## $ ERSPEMCH : int -1 -1 -1 -1 -1 1 5 -1 -1 5 ...
## $ ERTPREAT : int 30 45 60 0 65 20 30 30 117 80 ...
## $ ERTSEAT : int 2 14 0 0 0 10 5 5 10 0 ...
## $ ETHGT : int 0 0 0 0 0 0 0 0 0 0 ...
## $ ETWGT : int 0 0 0 -1 0 0 0 0 0 0 ...
## $ EUDIETSODA : int -1 -1 -1 2 -1 1 -1 -1 -1 2 ...
## $ EUDRINK : int 2 2 1 1 1 1 1 2 2 1 ...
## $ EUEAT : int 1 1 2 2 2 1 1 1 1 2 ...
## $ EUEXERCISE : int 2 2 2 2 1 1 2 1 1 2 ...
## $ EUEXFREQ : int -1 -1 -1 -1 5 2 -1 3 6 -1 ...
## $ EUFASTFD : int 2 1 2 2 2 1 1 1 2 1 ...
## $ EUFASTFDFRQ: int -1 1 -1 -1 -1 3 3 1 -1 2 ...
## $ EUFFYDAY : int -1 2 -1 -1 -1 1 2 2 -1 1 ...
## $ EUFDSIT : int 1 1 1 1 1 1 1 1 1 1 ...
## $ EUFINLWGT : num 5202085 29396791 26009936 2728880 17527153 ...
## $ EUSNAP : int 1 2 2 2 1 2 2 2 2 2 ...
## $ EUGENHTH : int 1 2 5 2 4 3 2 2 3 1 ...
## $ EUGROSHP : int 1 3 2 1 1 2 3 1 1 1 ...
## $ EUHGT : int 60 63 62 64 69 71 65 63 70 65 ...
## $ EUINCLVL : int 5 5 5 5 5 5 5 5 5 5 ...
## $ EUINCOME2 : int -2 -1 2 -2 2 -1 -1 -1 -1 -1 ...
## $ EUMEAT : int 1 1 -1 2 1 -1 1 1 1 1 ...
## $ EUMILK : int 2 2 -1 2 2 -1 2 2 2 2 ...
## $ EUPRPMEL : int 1 1 2 1 1 2 3 1 1 1 ...
## $ EUSODA : int -1 -1 2 1 2 1 2 -1 -1 1 ...
## $ EUSTORES : int 2 1 -1 2 1 -1 2 1 1 3 ...
## $ EUSTREASON : int 1 2 -1 6 1 -1 5 3 4 1 ...
## $ EUTHERM : int 2 2 -1 -1 2 -1 2 2 2 2 ...
## $ EUWGT : int 170 128 270 -2 210 220 200 155 180 170 ...
## $ EUWIC : int 1 2 2 2 1 2 2 -1 -1 -1 ...
## $ EXINCOME1 : int 2 0 12 2 0 0 0 0 0 0 ...
describe(eh_respdtt)
## eh_respdtt
##
## 37 Variables 11212 Observations
## ---------------------------------------------------------------------------
## TUCASEID
## n missing distinct Info Mean Gmd .05
## 11212 0 11212 1 2.014e+13 397798037 2.014e+13
## .10 .25 .50 .75 .90 .95
## 2.014e+13 2.014e+13 2.014e+13 2.014e+13 2.014e+13 2.014e+13
##
## lowest : 2.014010e+13 2.014010e+13 2.014010e+13 2.014010e+13 2.014010e+13
## highest: 2.014121e+13 2.014121e+13 2.014121e+13 2.014121e+13 2.014121e+13
##
## Value 2.014010e+13 2.014011e+13 2.014020e+13 2.014021e+13
## Frequency 107 766 786 136
## Proportion 0.010 0.068 0.070 0.012
##
## Value 2.014030e+13 2.014031e+13 2.014040e+13 2.014050e+13
## Frequency 1051 4 836 865
## Proportion 0.094 0.000 0.075 0.077
##
## Value 2.014051e+13 2.014060e+13 2.014061e+13 2.014070e+13
## Frequency 80 163 872 11
## Proportion 0.007 0.015 0.078 0.001
##
## Value 2.014071e+13 2.014081e+13 2.014091e+13 2.014101e+13
## Frequency 830 916 1056 846
## Proportion 0.074 0.082 0.094 0.075
##
## Value 2.014111e+13 2.014121e+13
## Frequency 948 939
## Proportion 0.085 0.084
## ---------------------------------------------------------------------------
## TULINENO
## n missing distinct Info Mean Gmd
## 11212 0 1 0 1 0
##
## 1 (11212, 1)
## ---------------------------------------------------------------------------
## EEINCOME1
## n missing distinct Info Mean Gmd
## 11212 0 6 0.728 1.294 0.7169
##
## lowest : -3 -2 -1 1 2, highest: -2 -1 1 2 3
##
## -3 (140, 0.012), -2 (155, 0.014), -1 (21, 0.002), 1 (6990, 0.623), 2
## (3454, 0.308), 3 (452, 0.040)
## ---------------------------------------------------------------------------
## ERBMI
## n missing distinct Info Mean Gmd .05 .10
## 11212 0 375 1 26.29 8.727 -1.0 19.9
## .25 .50 .75 .90 .95
## 23.0 26.5 30.4 35.4 39.2
##
## lowest : -1.0 13.0 13.7 13.9 14.5, highest: 60.2 61.4 66.4 68.7 73.6
## ---------------------------------------------------------------------------
## ERHHCH
## n missing distinct Info Mean Gmd
## 11212 0 3 0.188 2.885 0.216
##
## 1 (534, 0.048), 2 (219, 0.020), 3 (10459, 0.933)
## ---------------------------------------------------------------------------
## ERINCOME
## n missing distinct Info Mean Gmd
## 11212 0 6 0.747 2.036 1.653
##
## lowest : -1 1 2 3 4, highest: 1 2 3 4 5
##
## -1 (280, 0.025), 1 (6990, 0.623), 2 (533, 0.048), 3 (976, 0.087), 4 (36,
## 0.003), 5 (2397, 0.214)
## ---------------------------------------------------------------------------
## ERSPEMCH
## n missing distinct Info Mean Gmd
## 11212 0 6 0.794 1.873 2.988
##
## lowest : -1 1 2 3 4, highest: 1 2 3 4 5
##
## -1 (5535, 0.494), 1 (232, 0.021), 2 (93, 0.008), 3 (238, 0.021), 4 (172,
## 0.015), 5 (4942, 0.441)
## ---------------------------------------------------------------------------
## ERTPREAT
## n missing distinct Info Mean Gmd .05 .10
## 11212 0 205 0.997 65.68 50.84 5 15
## .25 .50 .75 .90 .95
## 30 60 90 125 150
##
## lowest : 0 1 2 3 4, highest: 365 390 466 490 508
## ---------------------------------------------------------------------------
## ERTSEAT
## n missing distinct Info Mean Gmd .05 .10
## 11212 0 201 0.904 16.76 26.74 0 0
## .25 .50 .75 .90 .95
## 0 3 15 30 60
##
## lowest : -3 -2 0 1 2, highest: 735 765 810 844 990
## ---------------------------------------------------------------------------
## ETHGT
## n missing distinct Info Mean Gmd
## 11212 0 4 0.064 -0.003122 0.05065
##
## -1 (161, 0.014), 0 (10968, 0.978), 1 (40, 0.004), 2 (43, 0.004)
## ---------------------------------------------------------------------------
## ETWGT
## n missing distinct Info Mean Gmd
## 11212 0 4 0.152 -0.03113 0.112
##
## -1 (500, 0.045), 0 (10610, 0.946), 1 (53, 0.005), 2 (49, 0.004)
## ---------------------------------------------------------------------------
## EUDIETSODA
## n missing distinct Info Mean Gmd
## 11212 0 6 0.608 -0.2867 1.081
##
## lowest : -3 -2 -1 1 2, highest: -2 -1 1 2 3
##
## -3 (2, 0.000), -2 (4, 0.000), -1 (8169, 0.729), 1 (1181, 0.105), 2 (1780,
## 0.159), 3 (76, 0.007)
## ---------------------------------------------------------------------------
## EUDRINK
## n missing distinct Info Mean Gmd
## 11212 0 4 0.663 1.326 0.4469
##
## -3 (1, 0.000), -2 (9, 0.001), 1 (7517, 0.670), 2 (3685, 0.329)
## ---------------------------------------------------------------------------
## EUEAT
## n missing distinct Info Mean Gmd
## 11212 0 4 0.747 1.432 0.5288
##
## -3 (2, 0.000), -2 (61, 0.005), 1 (6112, 0.545), 2 (5037, 0.449)
## ---------------------------------------------------------------------------
## EUEXERCISE
## n missing distinct Info Mean Gmd
## 11212 0 5 0.705 1.353 0.4982
##
## lowest : -3 -2 -1 1 2, highest: -3 -2 -1 1 2
##
## -3 (30, 0.003), -2 (8, 0.001), -1 (19, 0.002), 1 (7014, 0.626), 2 (4141,
## 0.369)
## ---------------------------------------------------------------------------
## EUEXFREQ
## n missing distinct Info Mean Gmd .05 .10
## 11212 0 29 0.941 2.237 3.434 -1 -1
## .25 .50 .75 .90 .95
## -1 2 4 7 7
##
## lowest : -3 -2 -1 1 2, highest: 25 28 30 35 38
## ---------------------------------------------------------------------------
## EUFASTFD
## n missing distinct Info Mean Gmd
## 11212 0 5 0.734 1.407 0.5108
##
## lowest : -3 -2 -1 1 2, highest: -3 -2 -1 1 2
##
## -3 (11, 0.001), -2 (26, 0.002), -1 (6, 0.001), 1 (6470, 0.577), 2 (4699,
## 0.419)
## ---------------------------------------------------------------------------
## EUFASTFDFRQ
## n missing distinct Info Mean Gmd .05 .10
## 11212 0 20 0.913 1.133 2.499 -1 -1
## .25 .50 .75 .90 .95
## -1 1 2 4 6
##
## lowest : -2 -1 1 2 3, highest: 14 15 17 20 21
##
## Value -2 -1 1 2 3 4 5 6 7 8
## Frequency 30 4742 2119 1779 1065 537 376 130 251 34
## Proportion 0.003 0.423 0.189 0.159 0.095 0.048 0.034 0.012 0.022 0.003
##
## Value 9 10 11 12 13 14 15 17 20 21
## Frequency 10 66 5 18 2 25 11 4 4 4
## Proportion 0.001 0.006 0.000 0.002 0.000 0.002 0.001 0.000 0.000 0.000
## ---------------------------------------------------------------------------
## EUFFYDAY
## n missing distinct Info Mean Gmd
## 11212 0 5 0.866 0.5181 1.442
##
## lowest : -3 -2 -1 1 2, highest: -3 -2 -1 1 2
##
## -3 (2, 0.000), -2 (2, 0.000), -1 (4745, 0.423), 1 (2362, 0.211), 2 (4101,
## 0.366)
## ---------------------------------------------------------------------------
## EUFDSIT
## n missing distinct Info Mean Gmd
## 11212 0 6 0.184 1.059 0.1663
##
## lowest : -3 -2 -1 1 2, highest: -2 -1 1 2 3
##
## -3 (21, 0.002), -2 (12, 0.001), -1 (18, 0.002), 1 (10477, 0.934), 2 (548,
## 0.049), 3 (136, 0.012)
## ---------------------------------------------------------------------------
## EUFINLWGT
## n missing distinct Info Mean Gmd .05 .10
## 11212 0 11191 1 8206540 6819055 1887407 2324833
## .25 .50 .75 .90 .95
## 3497206 6005618 10273325 16871894 21768757
##
## lowest : 756843.8 809689.9 824944.5 847037.5 849728.2
## highest: 77669591.6 77792880.8 81002063.2 86042323.1 103211628.8
## ---------------------------------------------------------------------------
## EUSNAP
## n missing distinct Info Mean Gmd
## 11212 0 5 0.296 1.868 0.2384
##
## lowest : -3 -2 -1 1 2, highest: -3 -2 -1 1 2
##
## -3 (21, 0.002), -2 (38, 0.003), -1 (18, 0.002), 1 (1164, 0.104), 2 (9971,
## 0.889)
## ---------------------------------------------------------------------------
## EUGENHTH
## n missing distinct Info Mean Gmd
## 11212 0 8 0.924 2.477 1.212
##
## lowest : -3 -2 -1 1 2, highest: 1 2 3 4 5
##
## -3 (29, 0.003), -2 (36, 0.003), -1 (19, 0.002), 1 (2017, 0.180), 2 (3757,
## 0.335), 3 (3491, 0.311), 4 (1367, 0.122), 5 (496, 0.044)
## ---------------------------------------------------------------------------
## EUGROSHP
## n missing distinct Info Mean Gmd
## 11212 0 5 0.746 1.503 0.687
##
## lowest : -3 -2 1 2 3, highest: -3 -2 1 2 3
##
## -3 (1, 0.000), -2 (2, 0.000), 1 (6914, 0.617), 2 (2940, 0.262), 3 (1355,
## 0.121)
## ---------------------------------------------------------------------------
## EUHGT
## n missing distinct Info Mean Gmd .05 .10
## 11212 0 25 0.995 65.63 6.492 60.00 61.00
## .25 .50 .75 .90 .95
## 63.00 66.00 70.00 72.00 73.45
##
## lowest : -3 -2 -1 56 57, highest: 73 74 75 76 77
## ---------------------------------------------------------------------------
## EUINCLVL
## n missing distinct Info Mean Gmd
## 11212 0 2 0.436 5.177 0.2908
##
## 5 (9232, 0.823), 6 (1980, 0.177)
## ---------------------------------------------------------------------------
## EUINCOME2
## n missing distinct Info Mean Gmd
## 11212 0 6 0.768 -0.2313 1.453
##
## lowest : -3 -2 -1 1 2, highest: -2 -1 1 2 3
##
## -3 (282, 0.025), -2 (599, 0.053), -1 (6818, 0.608), 1 (1116, 0.100), 2
## (2038, 0.182), 3 (359, 0.032)
## ---------------------------------------------------------------------------
## EUMEAT
## n missing distinct Info Mean Gmd
## 11212 0 4 0.716 0.5293 0.9542
##
## -2 (10, 0.001), -1 (3089, 0.276), 1 (7182, 0.641), 2 (931, 0.083)
## ---------------------------------------------------------------------------
## EUMILK
## n missing distinct Info Mean Gmd
## 11212 0 5 0.621 1.158 1.212
##
## lowest : -3 -2 -1 1 2, highest: -3 -2 -1 1 2
##
## -3 (2, 0.000), -2 (1, 0.000), -1 (3090, 0.276), 1 (158, 0.014), 2 (7961,
## 0.710)
## ---------------------------------------------------------------------------
## EUPRPMEL
## n missing distinct Info Mean Gmd
## 11212 0 6 0.734 1.465 0.6607
##
## lowest : -3 -2 -1 1 2, highest: -2 -1 1 2 3
##
## -3 (12, 0.001), -2 (4, 0.000), -1 (10, 0.001), 1 (7011, 0.625), 2 (3061,
## 0.273), 3 (1114, 0.099)
## ---------------------------------------------------------------------------
## EUSODA
## n missing distinct Info Mean Gmd
## 11212 0 4 0.881 0.7385 1.365
##
## -2 (4, 0.000), -1 (3695, 0.330), 1 (3043, 0.271), 2 (4470, 0.399)
## ---------------------------------------------------------------------------
## EUSTORES
## n missing distinct Info Mean Gmd
## 11212 0 8 0.855 0.7889 1.339
##
## lowest : -3 -2 -1 1 2, highest: 1 2 3 4 5
##
## -3 (5, 0.000), -2 (58, 0.005), -1 (2941, 0.262), 1 (5549, 0.495), 2 (2058,
## 0.184), 3 (358, 0.032), 4 (37, 0.003), 5 (206, 0.018)
## ---------------------------------------------------------------------------
## EUSTREASON
## n missing distinct Info Mean Gmd
## 11212 0 9 0.946 1.367 2.047
##
## lowest : -3 -2 -1 1 2, highest: 2 3 4 5 6
##
## -3 (8, 0.001), -2 (65, 0.006), -1 (3008, 0.268), 1 (2648, 0.236), 2 (3047,
## 0.272), 3 (1094, 0.098), 4 (710, 0.063), 5 (172, 0.015), 6 (460, 0.041)
## ---------------------------------------------------------------------------
## EUTHERM
## n missing distinct Info Mean Gmd
## 11212 0 5 0.773 0.844 1.415
##
## lowest : -3 -2 -1 1 2, highest: -3 -2 -1 1 2
##
## -3 (1, 0.000), -2 (5, 0.000), -1 (4030, 0.359), 1 (846, 0.075), 2 (6330,
## 0.565)
## ---------------------------------------------------------------------------
## EUWGT
## n missing distinct Info Mean Gmd .05 .10
## 11212 0 227 0.999 168.2 59.58 100 120
## .25 .50 .75 .90 .95
## 140 168 200 232 260
##
## lowest : -5 -3 -2 -1 98, highest: 333 334 335 337 340
## ---------------------------------------------------------------------------
## EUWIC
## n missing distinct Info Mean Gmd
## 11212 0 5 0.779 0.5121 1.507
##
## lowest : -3 -2 -1 1 2, highest: -3 -2 -1 1 2
##
## -3 (12, 0.001), -2 (25, 0.002), -1 (5370, 0.479), 1 (412, 0.037), 2 (5393,
## 0.481)
## ---------------------------------------------------------------------------
## EXINCOME1
## n missing distinct Info Mean Gmd .05 .10
## 11212 0 20 0.248 4.475 8.437 0.00 0.00
## .25 .50 .75 .90 .95
## 0.00 0.00 0.00 0.00 72.45
##
## lowest : -1 0 2 3 12, highest: 83 84 85 86 87
##
## Value -1 0 2 3 12 13 71 72 73 74
## Frequency 21 10197 155 140 50 3 47 38 237 30
## Proportion 0.002 0.909 0.014 0.012 0.004 0.000 0.004 0.003 0.021 0.003
##
## Value 75 76 77 81 82 83 84 85 86 87
## Frequency 126 6 4 75 24 11 31 10 1 6
## Proportion 0.011 0.001 0.000 0.007 0.002 0.001 0.003 0.001 0.000 0.001
## ---------------------------------------------------------------------------
sum(is.na(eh_respdtt)) ### is there any NA values in data
## [1] 0
sapply(eh_respdtt, class) ### count of different types of variables
## TUCASEID TULINENO EEINCOME1 ERBMI ERHHCH ERINCOME
## "numeric" "integer" "integer" "numeric" "integer" "integer"
## ERSPEMCH ERTPREAT ERTSEAT ETHGT ETWGT EUDIETSODA
## "integer" "integer" "integer" "integer" "integer" "integer"
## EUDRINK EUEAT EUEXERCISE EUEXFREQ EUFASTFD EUFASTFDFRQ
## "integer" "integer" "integer" "integer" "integer" "integer"
## EUFFYDAY EUFDSIT EUFINLWGT EUSNAP EUGENHTH EUGROSHP
## "integer" "integer" "numeric" "integer" "integer" "integer"
## EUHGT EUINCLVL EUINCOME2 EUMEAT EUMILK EUPRPMEL
## "integer" "integer" "integer" "integer" "integer" "integer"
## EUSODA EUSTORES EUSTREASON EUTHERM EUWGT EUWIC
## "integer" "integer" "integer" "integer" "integer" "integer"
## EXINCOME1
## "integer"
I am yet to clean the data. From primary summary of data i have observed following things which need attention with respect to data cleaning:
The data dictionary provided above gives the valid range of different variables. I need to check for invalid values and treat them appropriately.
There are some variables which are coded in integer format. though these are categories.
There are other eating-and-health-module data sets which i need to combine with this to get interesting facts about respondents
I am planning following analysis on the data:
Note: I might change some of the analysis and add some other analysis as I proceed further.