Purpose of analysis and context of dataset: I will be exploring Eating and Health module datasets and will be interested in analyzing interaction of different parameters like health, exercise, income, weight etc.. USDA’s Economic Research Service collect this data along with other cosponsors. More about the data can be found on data preparation tab.
Motivation:It would be interesting to see the relationships between health of respondent and household income, between secondary eating and weight of respondents, between exercise and household income etc.
Summary: The analysis provides nature of relationship between: respondents weight and health, secondary eating and weight. Also, from this analysis it was not clear whether there exists relationship between: height and household income, household income and health of respondent, weight and exercise. More on summary of the analysis can be found on summary tab.
library(tibble) ## to converte data into tibble
library(Hmisc) ## to provide data summary
library(tidyverse) ##Set of packages including dpylr and ggplot
library(dplyr) ## set of functions for manupulating data
library(DT) ## functions to display data sets
library(GoodmanKruskal) ## functions to mesure relationship between categorical variables
Eating and Health module data is about American Time Use Survey (ATUS) respondents primary and secondary eating habits- eating while doing another activity; soft drink consumption; grocery shopping preferences and fast food purchases; meal preparation and food safety practices; food assistance participation; general health, height and weight, and exercise; and income. This data is collected at different years and I will focus on data that is captured for year 2014
More can be found regarding the data on
https://www.ers.usda.gov/data-products/eating-and-health-module-atus/
Actual data source: http://www.bls.gov/tus/special.requests/ehresp_2014.zip
Data Source for Analysis:
https://raw.githubusercontent.com/taus01/EatingHabit/master/ehresp_2014.dat
Description:
EH Respondent dataset: Contains information about EH respondents, including general health and body mass index. There 11212 observations(respondents) and 37 variables. There are 34 integer variables and 3 numeric variables in the data stet.
The complete data dictionary can be found at: http://www.bls.gov/tus/ehmintcodebk1416.pdf
Data cleaning Steps:
There are following types of problem in my data for doing my analysis:
1.The data read from the dat files have TUCASEID in num format in respondent dataset. Ideally as these are 14 digit id’s of respondent they should be in character format.
url1<-"https://raw.githubusercontent.com/taus01/EatingHabit/master/ehresp_2014.dat"
eh_resp<-read.delim(url1,header=T,sep=",")
eh_respdt<-as_tibble(eh_resp) ### converting dataframe as tibble
head(eh_respdt)
## # A tibble: 6 × 37
## TUCASEID TULINENO EEINCOME1 ERBMI ERHHCH ERINCOME ERSPEMCH ERTPREAT
## <dbl> <int> <int> <dbl> <int> <int> <int> <int>
## 1 2.01401e+13 1 -2 33.2 1 -1 -1 30
## 2 2.01401e+13 1 1 22.7 3 1 -1 45
## 3 2.01401e+13 1 2 49.4 3 5 -1 60
## 4 2.01401e+13 1 -2 -1.0 3 -1 -1 0
## 5 2.01401e+13 1 2 31.0 3 5 -1 65
## 6 2.01401e+13 1 1 30.7 3 1 1 20
## # ... with 29 more variables: ERTSEAT <int>, ETHGT <int>, ETWGT <int>,
## # EUDIETSODA <int>, EUDRINK <int>, EUEAT <int>, EUEXERCISE <int>,
## # EUEXFREQ <int>, EUFASTFD <int>, EUFASTFDFRQ <int>, EUFFYDAY <int>,
## # EUFDSIT <int>, EUFINLWGT <dbl>, EUSNAP <int>, EUGENHTH <int>,
## # EUGROSHP <int>, EUHGT <int>, EUINCLVL <int>, EUINCOME2 <int>,
## # EUMEAT <int>, EUMILK <int>, EUPRPMEL <int>, EUSODA <int>,
## # EUSTORES <int>, EUSTREASON <int>, EUTHERM <int>, EUWGT <int>,
## # EUWIC <int>, EXINCOME1 <int>
eh_respdt$TUCASEID<-as.character(eh_respdt$TUCASEID)
For example:
EUHGT:How tall are you without shoes? (in inches) Its value ranges between 56 to 77 inches
EUHGT is bottom coded to 56 inches and top coded to 77 inches. All those with EUGHT < 56 inches have EUHGT = 56 inches. All those with EUHGT > 77 inches have EUHGT = 77 inches.
After seeing its distribution, the variable has invalid values so we need to cap according to data dictionary
summary(eh_respdt$EUHGT) ## there are invalid values
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3.00 63.00 66.00 65.63 70.00 77.00
eh_respdt_f<-eh_respdt%>%
mutate(EUHGT=ifelse(EUHGT<56,56,ifelse(EUHGT>77,77,EUHGT)))
summary(eh_respdt_f$EUHGT)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 56.00 63.00 66.00 66.46 70.00 77.00
EUWGT:How much do you weigh without shoes? (in pounds) Its value ranges between 98 to 340 inches
EUWGT is bottom coded to 98 lbs and top coded to 340 lbs. All those with EUWGT < 98 lbs have EUWGT = 98 lbs. All those with EUWGT > 340 lbs have EUWGT = 340 lbs.A value of -5 is also accepted.
After seeing its distribution, the variable has invalid values so we need to cap according to data dictionary
summary(eh_respdt_f$EUWGT) ## there are invalid values
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -5.0 140.0 168.0 168.2 200.0 340.0
eh_respdt_f<-eh_respdt_f%>%
mutate(EUWGT=ifelse(EUWGT !=-5 & EUWGT<98,98,ifelse(EUWGT>340,340,EUWGT)))
summary(eh_respdt_f$EUWGT)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -5.0 140.0 168.0 171.9 200.0 340.0
ERTSEAT: Total amount of time spent in secondary eating and drinking (in minutes). This variables has valid values in between 0 to 1440 minutes. After seeing its distribution, the variable has no invalid values.
summary(eh_respdt_f$ERTSEAT)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3.00 0.00 3.00 16.76 15.00 990.00
eh_respdt_f<-eh_respdt_f%>%
mutate(ERTSEAT=ifelse(ERTSEAT<0,0,ifelse(ERTSEAT>1440,1440,ERTSEAT)))
summary(eh_respdt_f$ERTSEAT)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 3.00 16.77 15.00 990.00
Final dataset on which we will be working:
DT::datatable(head(eh_respdt_f,1000))
The distribution of physical health of respondents
Physical health EUGENHTH variable has legal values from 1,2,3,4,5. It ranges from 1 for “Excellent” to 5 for “Poor”
Changing the levels of EUGENHTH to more meaningful form and assigning NA to invalid values.
eh_respdt_f$EUGENHTH <- as.factor(eh_respdt_f$EUGENHTH) ## converting the data into factor format
levels(eh_respdt_f$EUGENHTH) <- c("NA","NA","NA","Excellent","Very Good","Good","Fair","Poor")
Distribution of health of respondents
eh_respdt_f %>%
filter(EUGENHTH %in% c("Excellent","Very Good","Good","Fair","Poor"))%>%
group_by(EUGENHTH)%>%
summarise(count=n())%>%
ggplot(aes(x = EUGENHTH,y = count,fill=EUGENHTH))+
geom_bar(stat = "identity")+
xlab("Respondent's health")+
ylab("Count of Respondent")+
labs(fill="Respondent's health")
The distribution of total amount of time spent in primary eating and drinking (in minutes)
ERTPREAT: This variable has valid values in between 0 to 1440 minutes. After seeing its distribution, the variable has no invalid values.
summary(eh_respdt_f$ERTPREAT)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 30.00 60.00 65.68 90.00 508.00
eh_respdt_f%>%
ggplot(aes(ERTPREAT))+
geom_density(color="blue",fill="blue",lwd=1.2)+
xlab("Respondent's time spent on primary eating ")
Distribution of people who do exercise during the past 7 days from the time when the survey is conducted As per the data dictionary it should have 2 valid entries either 1 or 2. 1 for yes, people who did Exercise and 2=“No”, people who did not exercise.
unique(eh_respdt_f$EUEXERCISE) ## checking the unique values
## [1] 2 1 -3 -2 -1
eh_respdt_f$EUEXERCISE <- as.factor(eh_respdt_f$EUEXERCISE) ## converting the data into factor format
levels(eh_respdt_f$EUEXERCISE) <- c("NA","NA","NA","Yes","No")
Distribution of people who do exercise
eh_respdt_f %>%
filter(EUEXERCISE %in% c("Yes","No"))%>%
group_by(EUEXERCISE)%>%
summarise(count=n())%>%
ggplot(aes(x = EUEXERCISE,y = count,fill=EUEXERCISE))+
geom_bar(stat = "identity")+
xlab("Respondents who did exercise")+
ylab("Count of Respondent")+
labs(fill="Respondents who did exercise")
We can see there are more respondents who did exercise in past 7 days then who did not
Distribution of people income based on variable EEINCOME1.
EEINCOME1: Edited: Last month, was your total household income before taxes is more or less than 185 percent of poverty threshold (amount per month)?
unique(eh_respdt$EEINCOME1) ## checking the unique values
## [1] -2 1 2 3 -3 -1
eh_respdt_f$EEINCOME1 <- as.factor(eh_respdt_f$EEINCOME1) ## converting the data into factor format
levels(eh_respdt_f$EEINCOME1) <- c("NA","NA","NA","Income > 185 percent of poverty threshold ","Income < 185 percent of poverty threshold","Income = 185 percent of poverty threshold")
Distribution of household income of people
eh_respdt_f %>%
filter(EEINCOME1 %in% c("Income > 185 percent of poverty threshold ","Income < 185 percent of poverty threshold","Income = 185 percent of poverty threshold"))%>%
group_by(EEINCOME1)%>%
summarise(count=n())%>%
ggplot(aes(x= EEINCOME1,y = count,fill=EEINCOME1))+
geom_bar(stat = "identity") +
coord_flip()+
xlab("Respondents household income")+
ylab("Count of Respondent")+
labs(fill="Respondents household income")
Distribution of height and weight
eh_respdt_f%>%
ggplot(aes(EUHGT))+
geom_density(color="blue",fill="blue",lwd=1.2)+
xlab("Respondents height")
Distribution of Height
eh_respdt_f%>%
ggplot(aes(EUWGT))+
geom_density(color="blue",fill="blue",lwd=1.2)+
xlab("Respondents weight")
People who do exercise in past have effect on weight?
eh_respdt_f%>%
filter(EUWGT !=-5&EUEXERCISE %in% c("Yes","No"))%>%
ggplot(aes(x = EUEXERCISE,y = EUWGT,fill=EUEXERCISE))+ geom_boxplot()+
xlab("Exercise?")+
ylab("Weight of Respondent")+
labs(fill="Exercise?")
From the relationship, we see that people who do exercise in past 7 days have marginally lower weight than people who do not. The effect of exercise is not visible in this data as the question asked to respondents is regarding whether they did exercise in past 7 days or not. If the question is asked like whether they do regular exercise or not, then there would be a significant difference between the yes and no category.
Do people who have higher weight spend more time in secondary eating?
eh_respdt_f%>%
filter(EUWGT !=-5)%>%
ggplot(aes(x = EUWGT,y = ERTSEAT))+geom_smooth()+
xlab("Weight of Respondent")+
ylab("Time in minutes spent on secondary eating")
The relationship is clear from the graph. More you spend time on eating while doing different activities more will be your weight. This might be due to people prefer eating junk food while doing activities like office work, playing games etc.
Do people who have high weight report poor health?
eh_respdt_f%>%
filter(EUWGT !=-5&EUGENHTH %in% c("Excellent","Very Good","Good","Fair","Poor"))%>%
ggplot(aes(x = EUGENHTH,y = EUWGT,fill=EUGENHTH))+ geom_boxplot()+
xlab("Health of Respondent")+
ylab("Weight of Respondent")+
labs(fill="Health of Respondent")
It is clear from the box plot that the respondent report poor health who have high weight. This can be verified from this relationship
Do people who are healthy have higher income?
GKtau(eh_respdt_f$EUGENHTH, eh_respdt_f$EEINCOME1)
## xName yName Nx Ny tauxy tauyx
## 1 eh_respdt_f$EUGENHTH eh_respdt_f$EEINCOME1 6 4 0.072 0.018
I have used Goodman and Kruskal’s tau to measure two way association between physical health and income. In general people who are healthy should have high household income. However, from tauxy and tauyx statistics derived from GoodmanKruskal package it suggest there is very less variability in respondent’s household income that is explained by physical health. Hence, there does not exist any association between household income and health of respondent.
Do people who do exercise have higher household income?
GKtau(eh_respdt_f$EUEXERCISE, eh_respdt_f$EEINCOME1)
## xName yName Nx Ny tauxy tauyx
## 1 eh_respdt_f$EUEXERCISE eh_respdt_f$EEINCOME1 3 4 0.025 0.027
Like the above analysis. People wo do exercise should have high household income However from the GoodmanKruskal tau statistics obtained, there exists no relationship between doing exercise and household income level. This may be due to EUEXERCISE represents whether the respondent did exercise in last 7 days. If it represented whether respondent do regular exercise, then we would have seen significant difference.
Do respondents who are tall have higher household income?
eh_respdt_f%>%
filter(EEINCOME1 %in% c("Income > 185 percent of poverty threshold ","Income < 185 percent of poverty threshold","Income = 185 percent of poverty threshold"))%>%
ggplot(aes(x = EEINCOME1,y = EUHGT,fill=EEINCOME1))+ geom_boxplot() +
coord_flip()+
xlab("Household income of Respondent")+
ylab("Height of Respondent")+
labs(fill="Household income of Respondent")
From the graph It can be seen that tall people have slightly higher income than short people. However, the difference does not seem to be significant hence we can’t conclude that there exists some relationship between income and height.
Problem addressed: Exploring Eating and Health Module datasets and will be interested in analyzing interaction of different parameters like health, exercise, income, weight etc..
Approach for problem solving: I have used respondent’s data from eating and health module datasets to see the relationship between the pre-started variables. I tried first to clean the data and substitute the invalid data in the variates with proper limits. Also, I tried to look at other datasets in eating and health module to find more information, however I did not find any useful information that will aid my analysis. Next I tried to see the distribution of variables of interest to get an idea about these variates. Finally, with the help of graphs and certain statistical test I measured degree of relationship between different variates.
Conclusion of Analysis:
The effect of exercise on weight is not visible in this data as the question asked to respondents are only regarding whether they did exercise in past 7 days or not. If the question asked whether they did exercise regularly or not, then we might have seen some relationship.
More you spend time on eating while doing different activities more will be your weight. This fact is verified by seeing the relationship between secondary eating and weight
Respondents who reported poor health have high weight and respondents who have excellent health have lower weight on average.
There does not exist any association between household income and health of respondent.
There exists no relationship between doing exercise in last 7 days and household income level of respondent. This may be due to EUEXERCISE represents whether the respondent did exercise in last 7 days.
The difference between income and height does not seem to be significant. If the question asked to respondent is whether they did exercise regularly or not, then we might have seen some relationship.