Synopsis

For the Final Project I have looked into an “Eating and Health Module” dataset.This data was obtained from the ATUS Eating & Health (EH) Module which was fielded from 2006 to 2008 and again in 2014 to 2016. The EH Module data files contain information related to eating, meal preparation,exercise and health of over 10000+ families. I wanted to look into a few questions that could be answered using this dataset:

1.What are the factors that influence BMI in a person? 2.How are income and grocery shopping trends connected? 3.What factors seem to influence the health of this populaiton sample? 4.Is there any noticeable relationship between exercise and food consumtion?

To answer these questions I start with plotting the corrlation matrix and supply the values too between the variables.Then I explore the various relationships (most significant ones) one at a time and derive insights into them.

Finally certain conclusions reached such as BMI is quite strongly correlated to Weight,Weight is in turn dependant on the fast food eaten (how frequently),and to some extent on the exercise frequency,irrespective of the peoples income 2 locations are found to be the most popular and how general health is found to be better in peple with lower BMI and higher income.

Packages

The following packages are required:

library(tidyverse)#For all the filtering,selecting,piping, ggplot commands we use in the markdown file
library(corrplot) #To plot the correlation matrix
library(data.table)# To use the table
library(DT)#To format the tables

Data Preparation

The dataset in question was obtained from : https://www.kaggle.com/bls/eating-health-module-dataset.

The module here consists of 3 datasets from 2014: ehact_2014.csv, ehresp_2014.csv,ehwgts_2014.csv.I am working with only the first and second datasets.

The EH Respondent file (ehresp_2014.csv) contains information about EH respondents, including general health and body mass index. THis contains 37 variables but here I am explaining only those that will be used in this project:

TUCASEID: identifies each household EUGENHTH: In general was the subject’s physical health was excellent, very good, good,fair, or poor?These are coded as 1,2,3,4,5 respectively in the data EUHGT: Height (in inches) EUWGT:Weight (in pounds) ERBMI: Body mass index EUSTORES: Source of the groceries:Grocery store,Supercenter,Warehouse club,Drugstore or convenience store,Some other place.These are coded as 1,2,3,4,5 respectively. EUSTREASON: Reason for shopping at the place:Price,Location,Quality of products,Variety of products,Customer service,Other.These are coded as 1,2,3,4,5,6 respectively. EUFASTFD: Over the last seven days was any prepared food from a deli, carryout,delivery food, or fast food purchased.Coded: 1-Yes, 2-No EUFASTFDFRQ:No of times in the last seven days was fast food purchased EUMEAT: In the last 7 days,any meals prepared with meat, poultry, or seafood EUMILK: In the last 7 days,was unpasteurized or raw milk drunk or served ? EUEXFREQ: How many times over the past 7 days was nay exercise activity performed? Coded:1-Yes, 2-No EUEXERCISE: During the past 7 days, was any physical exercise for fitness and health such as running, bicycling, working out in a gym, walking for exercise performed? ERINCOME: Relationship between income and poverty threshold. Coded: 1 = “Income > 185”,2 = “Income= 185%”,3 = “130% < Income<185%”,4 = “Income = 130%”,5 =“Income < 130%” ERTPREAT:Total amount of time spent in primary eating and drinking (in minutes) ERTSEAT: Total amount of time spent in secondary eating (in minutes)

The complete explanation can be found in the codebook below: http://www.bls.gov/tus/ehmintcodebk1416.pdf

Importing the data:

url1<-"https://raw.githubusercontent.com/RKK101/ProjectR/master/ehresp_2014.csv"
Resp_data<-as_tibble(read.csv(url1));
Natable<-as_tibble(read.csv(url1));

datatable(head(Resp_data))

A look at the raw data obtained:

#Checking the dimensions of the datasets

dim(Resp_data)

## [1] 11212    37

#Checking the names of the variables

names(Resp_data)

##  [1] "tucaseid"    "tulineno"    "eeincome1"   "erbmi"       "erhhch"     
##  [6] "erincome"    "erspemch"    "ertpreat"    "ertseat"     "ethgt"      
## [11] "etwgt"       "eudietsoda"  "eudrink"     "eueat"       "euexercise" 
## [16] "euexfreq"    "eufastfd"    "eufastfdfrq" "euffyday"    "eufdsit"    
## [21] "eufinlwgt"   "eusnap"      "eugenhth"    "eugroshp"    "euhgt"      
## [26] "euinclvl"    "euincome2"   "eumeat"      "eumilk"      "euprpmel"   
## [31] "eusoda"      "eustores"    "eustreason"  "eutherm"     "euwgt"      
## [36] "euwic"       "exincome1"

#Checking the structure of the data

str(Resp_data)

## Classes 'tbl_df', 'tbl' and 'data.frame':    11212 obs. of  37 variables:
##  $ tucaseid   : num  2.01e+13 2.01e+13 2.01e+13 2.01e+13 2.01e+13 ...
##  $ tulineno   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ eeincome1  : int  -2 1 2 -2 2 1 1 1 1 1 ...
##  $ erbmi      : num  33.2 22.7 49.4 -1 31 ...
##  $ erhhch     : int  1 3 3 3 3 3 1 3 3 3 ...
##  $ erincome   : int  -1 1 5 -1 5 1 1 1 1 1 ...
##  $ erspemch   : int  -1 -1 -1 -1 -1 1 5 -1 -1 5 ...
##  $ ertpreat   : int  30 45 60 0 65 20 30 30 117 80 ...
##  $ ertseat    : int  2 14 0 0 0 10 5 5 10 0 ...
##  $ ethgt      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ etwgt      : int  0 0 0 -1 0 0 0 0 0 0 ...
##  $ eudietsoda : int  -1 -1 -1 2 -1 1 -1 -1 -1 2 ...
##  $ eudrink    : int  2 2 1 1 1 1 1 2 2 1 ...
##  $ eueat      : int  1 1 2 2 2 1 1 1 1 2 ...
##  $ euexercise : int  2 2 2 2 1 1 2 1 1 2 ...
##  $ euexfreq   : int  -1 -1 -1 -1 5 2 -1 3 6 -1 ...
##  $ eufastfd   : int  2 1 2 2 2 1 1 1 2 1 ...
##  $ eufastfdfrq: int  -1 1 -1 -1 -1 3 3 1 -1 2 ...
##  $ euffyday   : int  -1 2 -1 -1 -1 1 2 2 -1 1 ...
##  $ eufdsit    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ eufinlwgt  : num  5202086 29400000 26000000 2728880 17500000 ...
##  $ eusnap     : int  1 2 2 2 1 2 2 2 2 2 ...
##  $ eugenhth   : int  1 2 5 2 4 3 2 2 3 1 ...
##  $ eugroshp   : int  1 3 2 1 1 2 3 1 1 1 ...
##  $ euhgt      : int  60 63 62 64 69 71 65 63 70 65 ...
##  $ euinclvl   : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ euincome2  : int  -2 -1 2 -2 2 -1 -1 -1 -1 -1 ...
##  $ eumeat     : int  1 1 -1 2 1 -1 1 1 1 1 ...
##  $ eumilk     : int  2 2 -1 2 2 -1 2 2 2 2 ...
##  $ euprpmel   : int  1 1 2 1 1 2 3 1 1 1 ...
##  $ eusoda     : int  -1 -1 2 1 2 1 2 -1 -1 1 ...
##  $ eustores   : int  2 1 -1 2 1 -1 2 1 1 3 ...
##  $ eustreason : int  1 2 -1 6 1 -1 5 3 4 1 ...
##  $ eutherm    : int  2 2 -1 -1 2 -1 2 2 2 2 ...
##  $ euwgt      : int  170 128 270 -2 210 220 200 155 180 170 ...
##  $ euwic      : int  1 2 2 2 1 2 2 -1 -1 -1 ...
##  $ exincome1  : int  2 0 12 2 0 0 0 0 0 0 ...

#Missing values have been encoded in different forms, as 0,-1, -2 -3 to indicate Blank or unknown or refusal to supply info,which can all be considered as missing values.As different variables have missing values in different observations we do not remove these values, rather filter them when required from those particular columns.

Natable[Natable< 0] <- NA

sum(is.na(Natable))

## [1] 63559

#As can be seen there are way too many missing values across the various columns in total.So we do not remove them as there might be correct values in the columns we need.So for the various comparisions the NA values are only filtered out and not removed.

sum(is.na(Natable$erbmi))

## [1] 575

sum(is.na(Natable$etwgt))

## [1] 500

#There are a lot of redundant data columns in the datasets and we can get rid of a few.

# THe TULINENO variable has only one value through the dataset as only one data member was supplying the info,so we cn remove the variable.The TUCASEID gives the case ID per family.

#The ERINCOME combines information in EEINCOME1,EXINCOME1 so the latter columns can be removed too.

#some variables are not directly relevant to our analysis and thus can be removed:EUINCLVL,ERHHCH,ETHGT,ETWGT

#We divide the dataset into smaller chunks of connected data so as to handle them more easily

case <- Resp_data %>%  filter( euhgt>0,euwgt>0,erbmi>0) %>% 
                          select (tucaseid,euhgt,euwgt,erbmi,eugenhth,erincome)

food <-Resp_data %>%  select (tucaseid,ertpreat,ertseat,eudrink,eudietsoda,eufastfd,eufastfdfrq,eustores,eustreason)

exercise<-Resp_data %>% select(tucaseid,euexercise,euexfreq)

#THe cleaned data is seen as follows:


datatable(head(food))

datatable(head(case))

datatable(head(exercise))

We cannot put out the summary of all the variables here, but we look into some variables like BMI, weight and height in which we are interested in.

#BMI

case  %>%
  ggplot()+
    geom_density(mapping=aes(x=erbmi),color="darkblue",fill="lightblue",lwd=0.8) +
    geom_vline(aes(xintercept = median(erbmi)), linetype = "dashed",color="red") +
    geom_vline(aes(xintercept = mean(erbmi)), linetype = "dashed")

summary(case$erbmi)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.00   23.60   26.60   27.77   30.70   73.60

#The average of the population BMI iS 27.77 with min of 13 and max 73. The dataset is clearly skewed.

#Weight


case %>% ggplot(aes(x=euwgt)) +
        geom_histogram(binwidth =30, color = "lightgreen", fill = "lightblue",na.rm=T,lwd=0.8)+
            scale_x_continuous(name = "Weight",
                       limits = c(50, 350),
                       breaks = seq(50, 350, by = 30))+
         geom_vline(aes(xintercept = mean(euwgt)), linetype = "dashed")+
          geom_vline(aes(xintercept = median(euwgt)), linetype = "dashed",color="red")

summary(case$euwgt)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    98.0   145.0   170.0   176.3   200.0   340.0

#The average of the population weight iS 176.3 pounds with min of 98 and max 340. The dataset is clearly skewed.
                    
#Height

case %>%  ggplot(aes(x=euhgt)) +
        geom_bar( color = "lightgreen", fill = "lightblue",lwd=0.8)+
            scale_x_continuous(name = "height",
                       limits = c(50,80),
                       breaks = seq(50, 80, by = 5))+
      geom_vline(aes(xintercept = mean(euhgt)), linetype = "dashed")+
      geom_vline(aes(xintercept = median(euhgt)), linetype = "dashed",color="red")

summary(case$euhgt)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   56.00   64.00   66.00   66.69   70.00   77.00

#The average of the population weight iS 66.69 pounds with min of 56 and max 77 inches. 

#The rest of the variables can be understood as follows:

table(Resp_data$erincome)

## 
##   -1    1    2    3    4    5 
##  280 6990  533  976   36 2397

#The majority of the values are in the highest income category.

table(Resp_data$eugenhth)

## 
##   -3   -2   -1    1    2    3    4    5 
##   29   36   19 2017 3757 3491 1367  496

#The majority of the population seems to have very good ,good or fair health.

Exploratory Data Analysis

#Let us start our analysis looking at the correlation between the different variables of interest.This will give us a direction as to where we need to concentrate further.

#First all the three tables are joined in order to obtain the correlation between all the different variables and a new data set is created.

cordata<-case %>% 
  left_join(.,food,by=c("tucaseid")) %>% left_join(.,exercise,by=c("tucaseid")) 
                                              

corrplot(corr = cor(cordata[,2:16]),method="color",tl.pos = "lt",tl.cex= 0.7,tl.col="Black",title="Correlation Matrix")

#We can clearly see the strong, mild and non existant correlations between the different variables which we shall explore over the next few sections.

Let us try to answer a few questions that arise:

1.What are the factors that influence BMI in a person?

#We look at the specific values where visualally the correlation looks strong or slightly strong atleast.

#In case of BMI these are: euwgt,eugenhth,euexfreq,euexercise (from the factors considered here)

cortable<-cordata %>% select(eugenhth,erbmi,euwgt,euexfreq,euexercise) 
cor(cortable)

##              eugenhth      erbmi       euwgt   euexfreq  euexercise
## eugenhth    1.0000000  0.2964638  0.22399399 -0.2290678  0.24778020
## erbmi       0.2964638  1.0000000  0.85811577 -0.1333800  0.12840962
## euwgt       0.2239940  0.8581158  1.00000000 -0.0953406  0.08526137
## euexfreq   -0.2290678 -0.1333800 -0.09534060  1.0000000 -0.74475061
## euexercise  0.2477802  0.1284096  0.08526137 -0.7447506  1.00000000

#We find weight has a very strong correlation with BMI and  other two are considerably mild.

#Now let us investigate the factors influencing BMI in a little more detail.


cortable %>% ggplot(aes(x = erbmi,y=euwgt))+geom_point(color="darkgreen")+xlab("BMI")+ylab("Weight") + ggtitle("BMI vs Weight")

#The relationship is visibly strong i.e both seem to have an increase proportionally.


 cortable %>% filter(euexercise %in% c(1,2)) %>% 
 ggplot(mapping=aes(euexercise, erbmi,fill=euexercise,group=euexercise))+ 
  geom_boxplot(outlier.colour = "hotpink")+scale_x_discrete(limits=c("YES","NO"))

 cortable %>% select(euexercise,erbmi)  %>%filter(euexercise>0) %>% group_by(euexercise) %>% mutate(avgbmi=mean(erbmi))%>%  distinct(avgbmi) %>% arrange(desc(-avgbmi))

## Source: local data frame [2 x 2]
## Groups: euexercise [2]
## 
##     avgbmi euexercise
##      <dbl>      <int>
## 1 27.15812          1
## 2 28.81182          2

#As was seen earlier, though the relationship is mildly related and its seen that in people who do exercise the bmi tends to be slighly lower than those who dont. But there are a lot of outliers present in both the cases.
 
 
 cortable %>% filter(eugenhth %in% c(1,2,3,4,5)) %>% 
     ggplot(mapping=aes(eugenhth, erbmi,fill=eugenhth,group=eugenhth))+geom_boxplot()+
          scale_x_discrete(limits=c("Excellent","Very Good","Good","Fair","Poor"))

 #People with excellent health somehow seem to have a lower bmi than those with good health nd people witH poor health seem to have the highest BMI.So they are negatively related.
 
 
 
 cortable %>% filter(euexfreq>0) %>% 
 ggplot(mapping=aes(euexfreq,erbmi))+  geom_point()

 #Again, strengthing our previous belief, people who exercise more frequently seem to have a lower bmi than those who exercise fewer number of times.                                          
 
 #Consolidating the above observations:
 
 bmitable<-cortable %>% select(euexercise,eugenhth,erbmi)  %>%filter(eugenhth>0,euexercise>0) %>% group_by(euexercise,eugenhth) %>% mutate(avgbmi=mean(erbmi))%>%  distinct(avgbmi) %>% arrange(desc(-avgbmi))
 
 datatable(bmitable)

 #The following table looks at the BMIs when exercise in done in terms of the frequency:
   
Bmiexercise<-cortable %>% select(euexfreq,eugenhth,erbmi)  %>%filter(eugenhth>0,euexfreq>0) %>% group_by(euexfreq,eugenhth) %>% mutate(avgbmi=mean(erbmi))%>%  distinct(avgbmi) %>% arrange(eugenhth,euexfreq,desc(-avgbmi))

datatable(Bmiexercise)

# Thus, people with lower BMI seem to enjoy better health, have lower weight and also seem to exercise more frequently.
 
 #This leads us to our next step.

2.What are the factors that influence the overall health? From the previous discussion it appears the exercise frequency and health of a person seem to be related.We explore this possibility. We can check the correlation between the various variables: eugenhth,erbmi,euwgt,erincome,euexfreq,euexercise

corhealth<-cordata %>% select(erbmi,euwgt,euexfreq,eugenhth,erincome) %>% filter(euexfreq > 0,eugenhth>0) 
cor(corhealth)

##                erbmi       euwgt     euexfreq   eugenhth    erincome
## erbmi     1.00000000  0.85107635 -0.074387450  0.3256820 0.098285753
## euwgt     0.85107635  1.00000000 -0.065043897  0.2498976 0.017824454
## euexfreq -0.07438745 -0.06504390  1.000000000 -0.0769834 0.006961492
## eugenhth  0.32568196  0.24989765 -0.076983404  1.0000000 0.232253574
## erincome  0.09828575  0.01782445  0.006961492  0.2322536 1.000000000

#We see there are mild relationships btween the variables suspeted and the general health.As weight and BMI are so positively strongly related ,comparing general health to one of them would be good enough.In here we have already seen BMI and general health are inversely related,so the same can be deduced about its relationship with weight.Thus we need only check the relationship of health with the income and exercise frequency( as seen in our initial correlation matrix).



healthgroups <- c(
                    `1` = "Excellent",
                    `2` = "Very Good",
                    `3` = "Good",
                    `4` = "Fair",
                    `5` ="Poor"
                    )

corhealth %>% filter(euexfreq > 0,eugenhth>0) %>%  ggplot(aes(x=euexfreq)) +
        geom_bar( color ="darkblue" , fill = "lightblue",lwd=0.8)+facet_wrap(~eugenhth,nrow=5,labeller = as_labeller(healthgroups))+
      geom_vline(aes(xintercept = mean(euexfreq)), linetype = "dashed")+
      geom_vline(aes(xintercept = median(euexfreq)), linetype = "dashed",color="red")+xlab("Exercise Frequency")+
  ggtitle("Health facets by frequency")

datatable(corhealth %>% filter(euexfreq > 0,eugenhth>0) %>%  aggregate(euexfreq ~ eugenhth, data = ., median))

summary(corhealth$euexfreq)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   3.000   4.000   4.196   5.000  38.000

#We can see that there is no prominent relationship between frequency of the exercise and health though people with better health do seem to do some physical activity atleast one more time than those with poorer health.Very few people seem to exercise more than 5 times (3rd quartile) in any state of physical health.


#Now lets check its relationship with the income,with which it seems to have a stronger relationship.

incomegroups <- c(
                    `1` = "Income > 185",
                    `2` = "Income= 185%",
                    `3` = "130% < Income<185%",
                    `4` = "Income = 130% ",
                    `5` ="Income < 130% "
                    )


 corhealth %>%
   filter(eugenhth %in% c(1,2,3,4,5),erincome %in% c(1,2,3,4,5)) %>% 
            ggplot(mapping=aes(eugenhth))+ geom_density()+
              scale_x_discrete(limits=c("Excellent","Very Good","Good","Fair","Poor"))+
                  facet_wrap(~ erincome,nrow=5,labeller = as_labeller(incomegroups))+xlab("Health")+
                      ggtitle("Health facets by frequency")

healthtab<-corhealth %>% filter(erincome > 0,eugenhth>0) %>% group_by(eugenhth) %>% summarise(medianpergroup = median(erincome, na.rm = TRUE))

data.table(healthtab)

##    eugenhth medianpergroup
## 1:        1              1
## 2:        2              1
## 3:        3              1
## 4:        4              2
## 5:        5              5

#It can be seen from the visualizations as well as from the table generated that people in the highest income category usually have very good health or excellent or atleast good health (majority), only few of these seem to belong to a lower healthgroup.In the next 2 income groups the density lies in the very good to good health categories. But when we come to the lowest income categories the data flattens out a bit and some population spreads into the fair and poor health categories.

 
#Thus, we see a major indicator of good health along with a lower BMI and weight is a good income.

#Now let us look at a third question this brings up regarding the income.

3.Is there anyspecial grocery pattern in each income group that might have caused this difference?

As seen in the correlation matrix at the beginning income does not have a strong correlation (either positive or negative) with any factor inparticular.

 cordata %>% filter(eustores %in% c(1,2,3,4,5),erincome %in% c(1,2,3,4,5)) %>% 
      ggplot(aes(x=eustores))+ geom_bar( color = "magenta", fill = "lightblue",lwd=0.8)+ 
            scale_x_discrete(limits=c("Grocery store","Supercenter","Warehouse club","Drugstore or convenience store","Some other place"))+ 
                  facet_grid(.~erincome,labeller = as_labeller(incomegroups))+coord_flip()+
                      xlab("Stores")+ggtitle("Income vs grocery store")

#Though the people in the  highest income category seem to shop at the grocery store even the lowest income people seem to shop at the grocery store followed by the supercenter. 
 
cordata %>% 
  filter(eustreason %in% c(1,2,3,4,5,6),erincome %in% c(1,2,3,4,5)) %>% 
        ggplot(aes(x=eustreason)) + 
            geom_bar( color = "lightgreen", fill = "lightblue",lwd=0.8)+
                scale_x_discrete(limits=c("Price","Location","Quality of products ","Variety of products","Customer service","Other"))+
                facet_grid(.~erincome,labeller = as_labeller(incomegroups))+coord_flip()+                   xlab("shopping Reasons")+
                      ggtitle("Income vs Reasons to Shop")

#The two most popular  reasons to shop at a store seem to be its location followed by the price in case of the highest income people and vice versa in case of the lowest income group.The middle income levels seem to prefer both the reasons equally.We need to keep in mind that the sample sizes without missing values was pretty small and thus the results cannot be confirmed.

A general take on things is, if you eat more fast food your weight goes up.So is this true as per this data?

cordata %>% filter(eufastfdfrq>0,euwgt>0) %>% select(eufastfdfrq,euwgt) %>% 
  ggplot(aes(y = as.numeric(eufastfdfrq),x=euwgt))+
  geom_smooth(color="hotpink")+ylab("Fast Food Frequency")+xlab("Weight") + ggtitle("Fast food  vs Weight")

## `geom_smooth()` using method = 'gam'

cordata %>% select(euwgt,eufastfdfrq) %>% filter((eufastfdfrq>0& eufastfdfrq < 13),euwgt>0) %>%  group_by(eufastfdfrq) %>%  mutate(avgweight=mean(euwgt))%>%  distinct(avgweight) %>% arrange(eufastfdfrq)

## Source: local data frame [12 x 2]
## Groups: eufastfdfrq [12]
## 
##    avgweight eufastfdfrq
##        <dbl>       <int>
## 1   174.2689           1
## 2   178.9262           2
## 3   180.0078           3
## 4   184.8583           4
## 5   185.4050           5
## 6   196.3468           6
## 7   188.5144           7
## 8   174.2500           8
## 9   181.4444           9
## 10  193.7344          10
## 11  193.2000          11
## 12  205.8824          12

#Its a clear positive relationship,more fast food more weight!



cordata %>% filter(euexfreq>0,eufastfdfrq>0) %>% select(euexfreq,eufastfdfrq) %>% 
  ggplot(aes(x = as.numeric(euexfreq),y=as.numeric(eufastfdfrq)))+
  geom_jitter(color="hotpink")+xlab("Exercise Frequency")+ylab("Fast food frequency") + ggtitle("Exercise vs Fast food")

# Most people seem to exercise less than 10 times a week but a majority of them do not order fast food many times.There are quite a few outliers though.

#NOw lets see the combined effect on a person considering both exercise and fast food


cordata%>% filter(eufastfdfrq>0,euwgt>0,euexfreq>0) %>% gather(eufastfdfrq,euexfreq,key="Var",value="Value") %>% select(Var,euwgt,Value) %>% ggplot(aes(x=euwgt,y=Value,group=Var,color=Var))+
  geom_smooth()+xlab("Weight") + ggtitle("Fast food  vs Weight")

## `geom_smooth()` using method = 'gam'

weighttab<-cordata %>% select(euwgt,eufastfdfrq,euexfreq)  %>%filter(euwgt>0,eufastfdfrq>0,euexfreq>0) %>% group_by(euexfreq,eufastfdfrq) %>% mutate(avwgt=mean(euwgt))%>%  distinct(avwgt) %>% arrange(eufastfdfrq,euexfreq,desc(-avwgt))

datatable(weighttab)

#Its interesting to see that even though weight increses proportionally.In the case of exercise, a little higher exercise frequency seems to signal lower weight but once the exercise frequency falls to a level it seems to vary only slihtly for the increase in weight.

#Thus, fast food does affect weight and exercise frequency can signal lower weight if its over 4 times a week as per this data.

Summary

In this project I have looked into the Eating and Health Module dataset and by inspecting it from various angles using both graphical and mathematical methods the answers to the initial questions regarding the sample population’s BMI,income,grocery shopping style,general helath and weight in general were found.

Firstly, all teh relevant datasetrs were joined and a general correlation matrix and the values were obtained then I looked into the variables which had a high correlation with our variables of interest and carried out further exploration on each variable.The relevant graphs were obtained using ggplot and the mathematical tables were obtained by manipulating the data in general using mutate,filter, select,group,arrange,mean, median etc.The following conclusions have been reached:

1.Weight has a very strong correlation with BMI i.e both seem to have an increase proportionally.

The relationship between exercise and BMI is mildly related but nonetheless it is seen that in people who do exercise the bmi tends to be slighly lower than those who dont. But there are a lot of outliers present in both the cases.Also,more frequent exercise leads to a lower BMI.

People with excellent health have a lower BMI than those with good health nd people with poor health seem to have the highest BMI.So they are negatively related.

Thus, people with lower BMI seem to enjoy better health, have lower weight and also seem to exercise more frequently.

2.In here we have already seen BMI and general health are inversely related,so the same can be deduced about its relationship with weight.

There is a mild relationship between frequency of the exercise and health though people with better health do seem to do some physical activity atleast one more time than those with poorer health.

People in the highest income category usually have very good health or excellent or atleast good health (majority), only few of these seem to belong to a lower healthgroup.

Thus,we see a major indicator of good health along with a lower BMI and weight is a good income.

3.There is no prominent relation between income and where people seem to shop as the people in the highest income category seem to shop at the grocery store even the lowest income people seem to shop at the grocery store followed by the supercenter.

The two most popular reasons to shop at a store seem to be its location followed by the price in case of the highest income people and vice versa in case of the lowest income group.

4.Fast food does affect weight and exercise frequency can signal lower weight if its over 4 times a week as per this data.

```

Final Project

Ramya Krishna Kollipara

December 9, 2016

Synopsis

Packages

Data Preparation

Exploratory Data Analysis

Summary