Final Project Data Wrangling in R

Exploring Eating & Health Module datasets

1. Synopsis

Purpose of analysis and context of dataset: I will be exploring Eating and Health module datasets and will be interested in analyzing interaction of different parameters like health, exercise, income, weight etc.. USDA’s Economic Research Service collect this data along with other cosponsors. More about the data can be found on data preparation tab.

Motivation:It would be interesting to see the relationships between health of respondent and household income, between secondary eating and weight of respondents, between exercise and household income etc.

Summary: The analysis provides nature of relationship between: respondents weight and health, secondary eating and weight. Also, from this analysis it was not clear whether there exists relationship between: height and household income, household income and health of respondent, weight and exercise. More on summary of the analysis can be found on summary tab.

2. Packages Required

library(tibble) ## to converte data into tibble
library(Hmisc) ## to provide data summary
library(tidyverse) ##Set of packages including dpylr and ggplot
library(dplyr) ## set of functions for manupulating data
library(DT) ## functions to display data sets
library(GoodmanKruskal) ## functions to mesure relationship between categorical variables

3. Data Preparation

3.1 Data Source

Eating and Health module data is about American Time Use Survey (ATUS) respondents primary and secondary eating habits- eating while doing another activity; soft drink consumption; grocery shopping preferences and fast food purchases; meal preparation and food safety practices; food assistance participation; general health, height and weight, and exercise; and income. This data is collected at different years and I will focus on data that is captured for year 2014

More can be found regarding the data on

https://www.ers.usda.gov/data-products/eating-and-health-module-atus/

Actual data source: http://www.bls.gov/tus/special.requests/ehresp_2014.zip

Data Source for Analysis:

https://raw.githubusercontent.com/taus01/EatingHabit/master/ehresp_2014.dat

3.2 Data cleaning and content

Description:

EH Respondent dataset: Contains information about EH respondents, including general health and body mass index. There 11212 observations(respondents) and 37 variables. There are 34 integer variables and 3 numeric variables in the data stet.

The complete data dictionary can be found at: http://www.bls.gov/tus/ehmintcodebk1416.pdf

Data cleaning Steps:

There are following types of problem in my data for doing my analysis:

1.The data read from the dat files have TUCASEID in num format in respondent dataset. Ideally as these are 14 digit id’s of respondent they should be in character format.

url1<-"https://raw.githubusercontent.com/taus01/EatingHabit/master/ehresp_2014.dat"



eh_resp<-read.delim(url1,header=T,sep=",")

eh_respdt<-as_tibble(eh_resp) ### converting dataframe as tibble

head(eh_respdt)

## # A tibble: 6 × 37
##      TUCASEID TULINENO EEINCOME1 ERBMI ERHHCH ERINCOME ERSPEMCH ERTPREAT
##         <dbl>    <int>     <int> <dbl>  <int>    <int>    <int>    <int>
## 1 2.01401e+13        1        -2  33.2      1       -1       -1       30
## 2 2.01401e+13        1         1  22.7      3        1       -1       45
## 3 2.01401e+13        1         2  49.4      3        5       -1       60
## 4 2.01401e+13        1        -2  -1.0      3       -1       -1        0
## 5 2.01401e+13        1         2  31.0      3        5       -1       65
## 6 2.01401e+13        1         1  30.7      3        1        1       20
## # ... with 29 more variables: ERTSEAT <int>, ETHGT <int>, ETWGT <int>,
## #   EUDIETSODA <int>, EUDRINK <int>, EUEAT <int>, EUEXERCISE <int>,
## #   EUEXFREQ <int>, EUFASTFD <int>, EUFASTFDFRQ <int>, EUFFYDAY <int>,
## #   EUFDSIT <int>, EUFINLWGT <dbl>, EUSNAP <int>, EUGENHTH <int>,
## #   EUGROSHP <int>, EUHGT <int>, EUINCLVL <int>, EUINCOME2 <int>,
## #   EUMEAT <int>, EUMILK <int>, EUPRPMEL <int>, EUSODA <int>,
## #   EUSTORES <int>, EUSTREASON <int>, EUTHERM <int>, EUWGT <int>,
## #   EUWIC <int>, EXINCOME1 <int>

eh_respdt$TUCASEID<-as.character(eh_respdt$TUCASEID)

There are some variables which takes values that are out of range as given in data dictionary. I will categorize these invalid values as some other not defined category, while doing the analysis.

For example:

EUHGT:How tall are you without shoes? (in inches) Its value ranges between 56 to 77 inches

EUHGT is bottom coded to 56 inches and top coded to 77 inches. All those with EUGHT < 56 inches have EUHGT = 56 inches. All those with EUHGT > 77 inches have EUHGT = 77 inches.

After seeing its distribution, the variable has invalid values so we need to cap according to data dictionary

summary(eh_respdt$EUHGT) ## there are invalid values

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -3.00   63.00   66.00   65.63   70.00   77.00

eh_respdt_f<-eh_respdt%>%
  mutate(EUHGT=ifelse(EUHGT<56,56,ifelse(EUHGT>77,77,EUHGT)))
  summary(eh_respdt_f$EUHGT)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   56.00   63.00   66.00   66.46   70.00   77.00

EUWGT:How much do you weigh without shoes? (in pounds) Its value ranges between 98 to 340 inches

EUWGT is bottom coded to 98 lbs and top coded to 340 lbs. All those with EUWGT < 98 lbs have EUWGT = 98 lbs. All those with EUWGT > 340 lbs have EUWGT = 340 lbs.A value of -5 is also accepted.

After seeing its distribution, the variable has invalid values so we need to cap according to data dictionary

summary(eh_respdt_f$EUWGT) ## there are invalid values

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    -5.0   140.0   168.0   168.2   200.0   340.0

eh_respdt_f<-eh_respdt_f%>%
  mutate(EUWGT=ifelse(EUWGT !=-5 & EUWGT<98,98,ifelse(EUWGT>340,340,EUWGT)))
  summary(eh_respdt_f$EUWGT)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    -5.0   140.0   168.0   171.9   200.0   340.0

ERTSEAT: Total amount of time spent in secondary eating and drinking (in minutes). This variables has valid values in between 0 to 1440 minutes. After seeing its distribution, the variable has no invalid values.

summary(eh_respdt_f$ERTSEAT)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -3.00    0.00    3.00   16.76   15.00  990.00

eh_respdt_f<-eh_respdt_f%>%
  mutate(ERTSEAT=ifelse(ERTSEAT<0,0,ifelse(ERTSEAT>1440,1440,ERTSEAT)))
  summary(eh_respdt_f$ERTSEAT)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    3.00   16.77   15.00  990.00

Final dataset on which we will be working:

DT::datatable(head(eh_respdt_f,1000))

4. Exploratory data Analysis

4.1 Distribution of variables of interest

The distribution of physical health of respondents

Physical health EUGENHTH variable has legal values from 1,2,3,4,5. It ranges from 1 for “Excellent” to 5 for “Poor”

Changing the levels of EUGENHTH to more meaningful form and assigning NA to invalid values.

eh_respdt_f$EUGENHTH <- as.factor(eh_respdt_f$EUGENHTH) ## converting the data into factor format

levels(eh_respdt_f$EUGENHTH) <- c("NA","NA","NA","Excellent","Very Good","Good","Fair","Poor")

Distribution of health of respondents

eh_respdt_f %>%
  filter(EUGENHTH %in% c("Excellent","Very Good","Good","Fair","Poor"))%>%
  group_by(EUGENHTH)%>%
  summarise(count=n())%>%
  ggplot(aes(x = EUGENHTH,y = count,fill=EUGENHTH))+
  geom_bar(stat = "identity")+
  xlab("Respondent's health")+
  ylab("Count of Respondent")+
  labs(fill="Respondent's health")

The distribution of total amount of time spent in primary eating and drinking (in minutes)

ERTPREAT: This variable has valid values in between 0 to 1440 minutes. After seeing its distribution, the variable has no invalid values.

summary(eh_respdt_f$ERTPREAT)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   30.00   60.00   65.68   90.00  508.00

eh_respdt_f%>%
  ggplot(aes(ERTPREAT))+
  geom_density(color="blue",fill="blue",lwd=1.2)+
  xlab("Respondent's time spent on primary eating ")

Distribution of people who do exercise during the past 7 days from the time when the survey is conducted As per the data dictionary it should have 2 valid entries either 1 or 2. 1 for yes, people who did Exercise and 2=“No”, people who did not exercise.

 unique(eh_respdt_f$EUEXERCISE) ## checking the unique values

## [1]  2  1 -3 -2 -1

  eh_respdt_f$EUEXERCISE <- as.factor(eh_respdt_f$EUEXERCISE) ## converting the data into factor format
  
  levels(eh_respdt_f$EUEXERCISE) <- c("NA","NA","NA","Yes","No")

Distribution of people who do exercise

eh_respdt_f %>%
  filter(EUEXERCISE %in% c("Yes","No"))%>%
  group_by(EUEXERCISE)%>%
  summarise(count=n())%>%
  ggplot(aes(x = EUEXERCISE,y = count,fill=EUEXERCISE))+
  geom_bar(stat = "identity")+
  xlab("Respondents who did exercise")+
  ylab("Count of Respondent")+
  labs(fill="Respondents who did exercise")

We can see there are more respondents who did exercise in past 7 days then who did not

Distribution of people income based on variable EEINCOME1.

EEINCOME1: Edited: Last month, was your total household income before taxes is more or less than 185 percent of poverty threshold (amount per month)?

unique(eh_respdt$EEINCOME1) ## checking the unique values

## [1] -2  1  2  3 -3 -1

eh_respdt_f$EEINCOME1 <- as.factor(eh_respdt_f$EEINCOME1) ## converting the data into factor format

levels(eh_respdt_f$EEINCOME1) <- c("NA","NA","NA","Income > 185 percent of poverty threshold ","Income < 185 percent of poverty threshold","Income = 185 percent of poverty threshold")

Distribution of household income of people

eh_respdt_f %>%
  filter(EEINCOME1 %in% c("Income > 185 percent of poverty threshold ","Income < 185 percent of poverty threshold","Income = 185 percent of poverty threshold"))%>%
  group_by(EEINCOME1)%>%
  summarise(count=n())%>%
  ggplot(aes(x= EEINCOME1,y = count,fill=EEINCOME1))+
  geom_bar(stat = "identity") +
  coord_flip()+
  xlab("Respondents household income")+
  ylab("Count of Respondent")+
  labs(fill="Respondents household income")

Distribution of height and weight

eh_respdt_f%>%
  ggplot(aes(EUHGT))+
  geom_density(color="blue",fill="blue",lwd=1.2)+
  xlab("Respondents height")

Distribution of Height

eh_respdt_f%>%
  ggplot(aes(EUWGT))+
  geom_density(color="blue",fill="blue",lwd=1.2)+
  xlab("Respondents weight")

4.2 Relationship between variates

People who do exercise in past have effect on weight?

eh_respdt_f%>%
  filter(EUWGT !=-5&EUEXERCISE %in% c("Yes","No"))%>%
  ggplot(aes(x = EUEXERCISE,y = EUWGT,fill=EUEXERCISE))+  geom_boxplot()+
  xlab("Exercise?")+
  ylab("Weight of Respondent")+
  labs(fill="Exercise?")

From the relationship, we see that people who do exercise in past 7 days have marginally lower weight than people who do not. The effect of exercise is not visible in this data as the question asked to respondents is regarding whether they did exercise in past 7 days or not. If the question is asked like whether they do regular exercise or not, then there would be a significant difference between the yes and no category.

Do people who have higher weight spend more time in secondary eating?

eh_respdt_f%>%
  filter(EUWGT !=-5)%>%
  ggplot(aes(x = EUWGT,y = ERTSEAT))+geom_smooth()+
  xlab("Weight of Respondent")+
  ylab("Time in minutes spent on secondary eating")

The relationship is clear from the graph. More you spend time on eating while doing different activities more will be your weight. This might be due to people prefer eating junk food while doing activities like office work, playing games etc.

Do people who have high weight report poor health?

eh_respdt_f%>%
  filter(EUWGT !=-5&EUGENHTH %in% c("Excellent","Very Good","Good","Fair","Poor"))%>%
  ggplot(aes(x = EUGENHTH,y = EUWGT,fill=EUGENHTH))+  geom_boxplot()+
  xlab("Health of Respondent")+
  ylab("Weight of Respondent")+
  labs(fill="Health of Respondent")

It is clear from the box plot that the respondent report poor health who have high weight. This can be verified from this relationship

Do people who are healthy have higher income?

GKtau(eh_respdt_f$EUGENHTH, eh_respdt_f$EEINCOME1)

##                  xName                 yName Nx Ny tauxy tauyx
## 1 eh_respdt_f$EUGENHTH eh_respdt_f$EEINCOME1  6  4 0.072 0.018

I have used Goodman and Kruskal’s tau to measure two way association between physical health and income. In general people who are healthy should have high household income. However, from tauxy and tauyx statistics derived from GoodmanKruskal package it suggest there is very less variability in respondent’s household income that is explained by physical health. Hence, there does not exist any association between household income and health of respondent.

Do people who do exercise have higher household income?

GKtau(eh_respdt_f$EUEXERCISE, eh_respdt_f$EEINCOME1)

##                    xName                 yName Nx Ny tauxy tauyx
## 1 eh_respdt_f$EUEXERCISE eh_respdt_f$EEINCOME1  3  4 0.025 0.027

Like the above analysis. People wo do exercise should have high household income However from the GoodmanKruskal tau statistics obtained, there exists no relationship between doing exercise and household income level. This may be due to EUEXERCISE represents whether the respondent did exercise in last 7 days. If it represented whether respondent do regular exercise, then we would have seen significant difference.

Do respondents who are tall have higher household income?

eh_respdt_f%>%
  filter(EEINCOME1 %in% c("Income > 185 percent of poverty threshold ","Income < 185 percent of poverty threshold","Income = 185 percent of poverty threshold"))%>%
  ggplot(aes(x = EEINCOME1,y = EUHGT,fill=EEINCOME1))+  geom_boxplot() +
  coord_flip()+
  xlab("Household income of Respondent")+
  ylab("Height of Respondent")+
  labs(fill="Household income of Respondent")

From the graph It can be seen that tall people have slightly higher income than short people. However, the difference does not seem to be significant hence we can’t conclude that there exists some relationship between income and height.

5. Summary

Problem addressed: Exploring Eating and Health Module datasets and will be interested in analyzing interaction of different parameters like health, exercise, income, weight etc..

Approach for problem solving: I have used respondent’s data from eating and health module datasets to see the relationship between the pre-started variables. I tried first to clean the data and substitute the invalid data in the variates with proper limits. Also, I tried to look at other datasets in eating and health module to find more information, however I did not find any useful information that will aid my analysis. Next I tried to see the distribution of variables of interest to get an idea about these variates. Finally, with the help of graphs and certain statistical test I measured degree of relationship between different variates.

Conclusion of Analysis:

The effect of exercise on weight is not visible in this data as the question asked to respondents are only regarding whether they did exercise in past 7 days or not. If the question asked whether they did exercise regularly or not, then we might have seen some relationship.
More you spend time on eating while doing different activities more will be your weight. This fact is verified by seeing the relationship between secondary eating and weight
Respondents who reported poor health have high weight and respondents who have excellent health have lower weight on average.
There does not exist any association between household income and health of respondent.
There exists no relationship between doing exercise in last 7 days and household income level of respondent. This may be due to EUEXERCISE represents whether the respondent did exercise in last 7 days.
The difference between income and height does not seem to be significant. If the question asked to respondent is whether they did exercise regularly or not, then we might have seen some relationship.