Exploring the relationship between lifestyle peoples and their self-esteem health.

Part 3: Exploratory data analysis

Main aspect

First, we going to create selected_brfss2013 data set by selecting from the original data set brfss2013 some columns contain interesting information about:

record identification,
general health,
time sleep,
tobacco and alcohol use,
physical activity.

Before move on, we going to stop for the moment and think a little about some common aspects which can have a significant impact on our habits and health. From our point of view, it’s would be reasonable to make a suggestion that people who have some general health problems may have different lifestyle behavior. For that reason we also going to add information about:

general health problems.

selected_brfss2013 <- brfss2013 %>%
  select(sex, X_ageg5yr,                      # Record Identification
         qlactlm2, useequip, blind, decide,   # General Health Problems
         diffwalk, diffdres, diffalon,        # General Health Problems
         genhlth,                             # General Health
         sleptim1,                            # How Much Time Do You Sleep
         smokday2,                            # Frequency Of Days Now Smoking
         X_drnkmo4,                           # Computed Total Number Drinks A Month
         exerany2,                            # Physical Exercise In Past 30 Days
         exract11                             # Type Of Physical Activity
         )

This data set we going to use through our research. Let’s get a quick view that we got.

head(selected_brfss2013)

##      sex    X_ageg5yr qlactlm2 useequip blind decide diffwalk diffdres diffalon
## 1 Female Age 60 to 64      Yes      Yes    No     No      Yes       No      Yes
## 2 Female Age 50 to 54       No       No    No     No       No       No       No
## 3 Female Age 55 to 59      Yes       No    No     No      Yes       No       No
## 4 Female Age 60 to 64       No       No    No     No       No       No       No
## 5   Male Age 65 to 69       No       No    No     No       No       No       No
## 6 Female Age 45 to 49       No       No    No     No       No       No       No
##     genhlth sleptim1   smokday2 X_drnkmo4 exerany2                   exract11
## 1      Fair       NA Not at all         2       No                       <NA>
## 2      Good        6       <NA>         0      Yes                    Walking
## 3      Good        9  Some days        80       No                       <NA>
## 4 Very good        8       <NA>        16      Yes                    Walking
## 5      Good        6 Not at all        20       No                       <NA>
## 6 Very good        8       <NA>         0      Yes Bicycling machine exercise

Let’s see the structure of our data set.

str(selected_brfss2013)

## 'data.frame':    491775 obs. of  15 variables:
##  $ sex      : Factor w/ 2 levels "Male","Female": 2 2 2 2 1 2 2 2 1 2 ...
##  $ X_ageg5yr: Factor w/ 13 levels "Age 18 to 24",..: 9 7 8 9 10 6 4 9 7 10 ...
##  $ qlactlm2 : Factor w/ 2 levels "Yes","No": 1 2 1 2 2 2 1 1 2 2 ...
##  $ useequip : Factor w/ 2 levels "Yes","No": 1 2 2 2 2 2 2 2 2 2 ...
##  $ blind    : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
##  $ decide   : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
##  $ diffwalk : Factor w/ 2 levels "Yes","No": 1 2 1 2 2 2 2 1 2 2 ...
##  $ diffdres : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
##  $ diffalon : Factor w/ 2 levels "Yes","No": 1 2 2 2 2 2 2 2 2 2 ...
##  $ genhlth  : Factor w/ 5 levels "Excellent","Very good",..: 4 3 3 2 3 2 4 3 1 3 ...
##  $ sleptim1 : int  NA 6 9 8 6 8 7 6 8 8 ...
##  $ smokday2 : Factor w/ 3 levels "Every day","Some days",..: 3 NA 2 NA 3 NA 3 1 NA NA ...
##  $ X_drnkmo4: int  2 0 80 16 20 0 1 2 4 0 ...
##  $ exerany2 : Factor w/ 2 levels "Yes","No": 2 1 2 1 2 1 1 1 1 1 ...
##  $ exract11 : Factor w/ 75 levels "Active Gaming Devices (Wii Fit, Dance, Dance revolution)",..: NA 64 NA 64 NA 6 64 64 7 64 ...

For our research purpose, we going just lightly touch aspects of “General Health Problems” to make our main research more clear. Let’s see more closely the structure of that information.

str(selected_brfss2013 %>%
      select(qlactlm2, # Activity Limitation Due To Health Problems
             useequip, # Health Problems Requiring Special Equipment
             blind,    # Blind Or Difficulty Seeing
             decide,   # Difficulty Concentrating Or Remembering
             diffwalk, # Difficulty Walking Or Climbing Stairs
             diffdres, # Difficulty Dressing Or Bathing 
             diffalon) # Difficulty Doing Errands Alone
    )

## 'data.frame':    491775 obs. of  7 variables:
##  $ qlactlm2: Factor w/ 2 levels "Yes","No": 1 2 1 2 2 2 1 1 2 2 ...
##  $ useequip: Factor w/ 2 levels "Yes","No": 1 2 2 2 2 2 2 2 2 2 ...
##  $ blind   : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
##  $ decide  : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
##  $ diffwalk: Factor w/ 2 levels "Yes","No": 1 2 1 2 2 2 2 1 2 2 ...
##  $ diffdres: Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
##  $ diffalon: Factor w/ 2 levels "Yes","No": 1 2 2 2 2 2 2 2 2 2 ...

As we can see, all those columns have a uniform structure and conclude 2 values: “Yes” and “No”.

We going unite them into one by creating new column GHP (General Health Problems) and summarize information into 2 groups:

Have General Health Problems (Yes)
Don’t have General Health Problems (No)

selected_brfss2013 <- selected_brfss2013 %>%
  mutate(GHP = ifelse(qlactlm2 == "Yes" |
                      useequip == "Yes" |
                      blind    == "Yes" |
                      decide   == "Yes" |
                      diffwalk == "Yes" |
                      diffdres == "Yes" |
                      diffalon == "Yes",
                      "Yes","No"))

Remove from our data set selected_brfss2013 primary columns “General Health Problems”.

selected_brfss2013 <- selected_brfss2013 %>%
  select(-(c(qlactlm2, useequip, blind, decide, diffwalk, diffdres, diffalon)))

Let’s look on the distribution population by “General Health Problems” GHP.

# prepare data
selected_brfss2013 %>%
  filter(!is.na(GHP))  %>%
  filter(!is.na(genhlth)) %>%
  # plot data
  ggplot(aes(x = GHP)) +
  geom_bar() +
  ggtitle('Distribution population by "General Health Problems"')

As we can see the population who have “General Health Problems” is significant and roughly represent 1/3 of the survey.

Let us look at the unique values of the self-esteem of health (genhlth).

unique(selected_brfss2013 %>% 
         select(genhlth))

##       genhlth
## 1        Fair
## 2        Good
## 4   Very good
## 9   Excellent
## 28       Poor
## 241      <NA>

Also, we going convert qualitative column genhlth (General Health) to quantitative n_genhlth, probably it can be useful for our next steps.

selected_brfss2013 <- selected_brfss2013 %>%
  mutate(n_genhlth = ifelse(genhlth == "Poor",      1,
                     ifelse(genhlth == "Fair",      2,
                     ifelse(genhlth == "Good",      3,
                     ifelse(genhlth == "Very good", 4,
                     ifelse(genhlth == "Excellent", 5, NA))))))

Let’s get some summary of the distribution self-esteem of health before plot it.

length_without_na <- selected_brfss2013 %>%
  filter(!is.na(GHP)) %>%
  filter(!is.na(n_genhlth)) %>%
  summarise(n())

selected_brfss2013 %>%
  filter(!is.na(GHP)) %>%
  filter(!is.na(n_genhlth)) %>%
  group_by(GHP) %>%
  summarise(Min_h = min(n_genhlth),
            Q1_h =  quantile(n_genhlth,0.25,type = 1),
            Median_h = median(n_genhlth),
            Q3_h = quantile(n_genhlth,0.75,type = 1),
            Max_h = max(n_genhlth),
            Mean_h = mean(n_genhlth),
            Sd_h = sd(n_genhlth),
            N_h = n(),
            freq_h = n()/length(selected_brfss2013$n_genhlth))

## # A tibble: 2 x 10
##   GHP   Min_h  Q1_h Median_h  Q3_h Max_h Mean_h  Sd_h    N_h freq_h
##   <chr> <dbl> <dbl>    <dbl> <dbl> <dbl>  <dbl> <dbl>  <int>  <dbl>
## 1 No        1     3        4     4     5   3.81 0.890 310693  0.632
## 2 Yes       1     2        3     3     5   2.69 1.09  164532  0.335

Plot our data set to see the distribution of health self-esteem n_genhlth grouping by “General Health Problems” GHP.

# prepare data
selected_brfss2013 %>%
  filter(!is.na(GHP))  %>%
  filter(!is.na(n_genhlth)) %>%
  # plot data
  ggplot(aes(x = n_genhlth,
             fill = GHP)) +
  geom_bar(position = "dodge") +
  ggtitle('Distribution of health self-esteem grouping by GHP')

Looks like our sugetion about relationship between health self-esteem (genhlth) and general health problems (GHP) was right.

(No GHP) The distribution of health self-esteem from the population without general health problems skewed right. Centered at about 4 points with most scores being between 3 and 5 points.
(Yes GHP) The distribution of health self-esteem from the population with general health problems slightly skewed left. Centered at about 3 scores with most scores being between 1 and 4 points.

It is reasonable that people without general health problems rate their health better than people with GHP.

Research quesion 1: Relationship between ‘sleeping time’ and ‘self-esteem health’

For the start, we going to plot sleeping time distribution sleptim1 grouping by “General Health Problems” GHP.

# prepare data
selected_brfss2013 %>%
  filter(!is.na(sleptim1)) %>%
  filter(!is.na(genhlth)) %>%
  filter(!is.na(GHP)) %>%
  # plot data
  ggplot(aes(x = sleptim1)) +
  facet_wrap('GHP') +
  geom_histogram(binwidth = 1, position = "dodge") +
  ggtitle('Distribution of sleeping time by GHP') +
  labs(y = "count of observations",
       x = "sleeping time (hours)")

(No GHP) The distribution of time sleeping from the population without general health problems is roughly bell-shaped. Centered at about 7 hours with most data between 5 and 9 hours. A range of roughly 17 hours and outliers below 4 and above 10 hours.
(Yes GHP) The distribution of time sleeping from the population with general health problems is roughly bell-shaped. Centered at about 7 hours with most data between 4 and 10 hours. A range of roughly 17 hours and outliers below 3 and above 10 hours.

For now, distributions of these two groups look very similar, except that what spread of distribution with “GHP” is a little more variable but we will get a chance to check it. Let’s get it more detailed.

selected_brfss2013 %>%
  filter(!is.na(genhlth))  %>%
  filter(!is.na(sleptim1))  %>%
  filter(!is.na(GHP))  %>%
  group_by(GHP) %>%
  summarise(Min_n = min(sleptim1),
            Q1_n =  quantile(sleptim1, 0.25, type = 1),
            Median_n = median(sleptim1),
            Q3_n = quantile(sleptim1, 0.75, type = 1),
            Max_n = max(sleptim1),
            Mean_n = mean(sleptim1),
            Sd_n = sd(sleptim1),
            N_n = n())

## # A tibble: 2 x 9
##   GHP   Min_n  Q1_n Median_n  Q3_n Max_n Mean_n  Sd_n    N_n
##   <chr> <int> <int>    <dbl> <int> <int>  <dbl> <dbl>  <int>
## 1 No        1     6        7     8    24   7.10  1.23 308190
## 2 Yes       0     6        7     8    24   6.95  1.83 160367

Now we going to plot relationship between ‘sleeping time’ and ‘self-esteem health’ using genhlth (General Health), sleptim1 (How Much Time Do You Sleep) and GHP(General Health Problems). To get a more clear picture in that plot we focusing on the most data of the distribution of time sleeping.

# prepare data
selected_brfss2013 %>%
  filter(!is.na(GHP)) %>%
  filter(!is.na(genhlth)) %>%
  filter(!is.na(sleptim1))  %>%
  filter(sleptim1 <= 10, sleptim1 >= 4) %>%
  # plot data
  ggplot(aes(x = genhlth,
             fill = as.factor(sleptim1))) +
  facet_wrap('GHP') +
  geom_bar(position = "fill") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  ggtitle("Distribution sleeping time frequency by self-esteem health and GHP") +
  labs(y = "sleeping time (frequency)",
       x = "self-esteem health")

Those two plots show us a clean relationship between ‘sleeping time’ and ‘self-esteem health’, and that interesting that relationship is more strong for people who have some general health problem. Generally for worse ‘self-esteem health’ correspond less sleeping time. Let’s see it more closely.

We going plot relationship between ‘self-esteem health’ from ‘sleeping time’ using genhlth (General Health), sleptim1 (How Much Time Do You Sleep) and GHP(General Health Problems). To do that we going to prepare our data to get a more clear picture and plotting the frequency of distribution ‘self-esteem health’ for each usual ‘sleeping time’. To not to do the plot too much complicated we are going to research separately people who have and don’t have GHP.

Some a little more complicated preparation steps are written as comments into code.

Don’t have General Health Problems

Plot relationship between ‘self-esteem health’ from ‘sleeping time’ using genhlth, sleptim1 and GHP, where GHP = “No”. In that plot, we going to get ‘self-esteem health’ frequency for each common ‘sleeping time’.

# prepare data
selected_brfss2013 %>%
  select(GHP, genhlth, sleptim1) %>%
  filter(!is.na(GHP))  %>%
  filter(GHP == "No")  %>%
  filter(!is.na(genhlth))  %>%
  filter(!is.na(sleptim1))  %>%
  filter(sleptim1 <= 9, sleptim1 >= 5)  %>%
  group_by(sleptim1, genhlth) %>%              # grouping by 'sleeping time' and 'self-esteem health'
  summarise(n = n()) %>%                       # count the quantity for each combination
                                               ##     of 'sleeping time' and 'self-esteem health groups'
  mutate(freq = n / sum(n)) %>%                # count frequency for each 'sleeping time' groups
  
  # plot data
  ggplot(aes(
    x = genhlth,
    y = freq,
    fill = genhlth,
    label = scales::percent(round(freq, digits = 2))
  )) +
  facet_wrap('sleptim1') +
  geom_col(position = 'dodge')  +
  
  # description
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  geom_text(position = position_dodge(width = .9),
            vjust = -0.5,
            size = 2) +
  scale_y_continuous(labels = scales::percent) +
  ggtitle("Distribution frequency 'self-esteem health' by 'sleeping time' for No GHP") +
  labs(y = "self-esteem health (frequency)",
       x = "self-esteem health")

That plots show us what frequency of “Excellent” and “Very good” self-esteem health growing with sleeping time mostly due to decreasing “Good” point of self-esteem health. Also, we can notice that after 7 hours of sleeping time distribution does not change a lot, but it get significant changes in ‘self-esteem health’ in the range between 5 and 7 hours. As well this data set is observation, we cannot say that sleeping time affects on ‘self-esteem health’, but the relationship between these 2 variables definitely exist.

Also, it can be interesting to see opposite relationship of distribution between ‘sleeping time’ from ‘self-esteem health’ using genhlth, sleptim1 and GHP. In that plot, we going to get’sleeping time’ frequency for each ‘self-esteem health’. Let’s do it.

# prepare data
selected_brfss2013 %>%
  select(genhlth, sleptim1, GHP) %>%
  filter(!is.na(GHP))  %>%
  filter(GHP == "No")  %>%
  filter(!is.na(genhlth))  %>%
  filter(!is.na(sleptim1))  %>%
  filter(sleptim1 <= 9, sleptim1 >= 5)  %>%
  group_by(genhlth, sleptim1) %>%                 # grouping by 'self-esteem health' and 'sleeping time'
  summarise(n = n()) %>%                          # count the quantity for each combination
                                                  ##     of 'self-esteem health' and 'sleeping time'
  mutate(freq = n / sum(n)) %>%                   # count frequency for each 'self-esteem health' groups
  
  # plot data
  ggplot(aes(
    x = sleptim1,
    y = freq,
    fill = as.factor(sleptim1),
    label = scales::percent(round(freq, digits = 2))
  )) +
  facet_wrap('genhlth') +
  geom_col(position = 'dodge')  +
  
  # description
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  geom_text(position = position_dodge(width = .9),
            vjust = -0.5,
            size = 2) +
  scale_y_continuous(labels = scales::percent) +
  ggtitle("Distribution frequency 'self-esteem health' by 'sleeping time' for No GHP") +
  labs(y = "self-esteem health (frequency)",
       x = "self-esteem health")

That plots, also, show us most the same relationship as previous. We can see significant differences in self-esteem health in the range of 5-7 hours.

Have General Health Problems

Let’s do the same work for people who have general health problems, just lightly expanding the ‘sleeping time’ range. Plot relationship between ‘self-esteem health’ from ‘sleeping time’ using genhlth, sleptim1 and GHP where GHP = “Yes”. In that plot, we going to get ‘self-esteem health’ frequency for each common ‘sleeping time’.

# prepare data
selected_brfss2013 %>%
  select(GHP, genhlth, sleptim1) %>%
  filter(!is.na(GHP))  %>%
  filter(GHP == "Yes")  %>%
  filter(!is.na(genhlth))  %>%
  filter(!is.na(sleptim1))  %>%
  filter(sleptim1 <= 9, sleptim1 >= 4)  %>%
  group_by(sleptim1, genhlth) %>%              # grouping by 'sleeping time' and 'self-esteem health'
  summarise(n = n()) %>%                       # count the quantity for each combination
                                               ##     of 'sleeping time' and 'self-esteem health groups'
  mutate(freq = n / sum(n)) %>%                # count frequency for each 'sleeping time' groups
  
  # plot data
  ggplot(aes(
    x = genhlth,
    y = freq,
    fill = genhlth,
    label = scales::percent(round(freq, digits = 2))
  )) +
  facet_wrap('sleptim1') +
  geom_col(position = 'dodge')  +
  
  # description
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  geom_text(position = position_dodge(width = .9),
            vjust = -0.5,
            size = 2) +
  scale_y_continuous(labels = scales::percent) +
  ggtitle("Distribution frequency 'self-esteem health' by 'sleeping time' for Yes GHP") +
  labs(y = "self-esteem health (frequency)",
       x = "self-esteem health")

In that case, plots show us what frequency of “Very good” and “Good” self-esteem health growing with sleeping time mostly due to decreasing “Fair” and “Poor” point of self-esteem health. We also must notice how significantly decreasing “Poor” point of self-esteem health from 30% to 9% as sleeping time grows from 4 to 7 hours.

Let’s plot relationship of distribution between ‘sleeping time’ from ‘self-esteem health’.

# prepare data
selected_brfss2013 %>%
  select(genhlth, sleptim1, GHP) %>%
  filter(!is.na(GHP))  %>%
  filter(GHP == "Yes")  %>%
  filter(!is.na(genhlth))  %>%
  filter(!is.na(sleptim1))  %>%
  filter(sleptim1 <= 9, sleptim1 >= 4)  %>%
  group_by(genhlth, sleptim1) %>%                 # grouping by 'self-esteem health' and 'sleeping time'
  summarise(n = n()) %>%                          # count the quantity for each combination
                                                  ##     of 'self-esteem health' and 'sleeping time'
  mutate(freq = n / sum(n)) %>%                   # count frequency for each 'self-esteem health' groups
  
  # plot data
  ggplot(aes(
    x = sleptim1,
    y = freq,
    fill = as.factor(sleptim1),
    label = scales::percent(round(freq, digits = 2))
  )) +
  facet_wrap('genhlth') +
  geom_col(position = 'dodge')  +
  
  # description
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  geom_text(position = position_dodge(width = .9),
            vjust = -0.5,
            size = 2) +
  scale_y_continuous(labels = scales::percent) +
  ggtitle("Distribution frequency 'self-esteem health' by 'sleeping time' for Yes GHP") +
  labs(y = "self-esteem health (frequency)",
       x = "sleeping time")

That plots, also, show us most the same relationship as previous, but from a different angle.

Finally, let’s see how to change the relationship between ‘sleeping time’ from ‘self-esteem health’ for different ages, using genhlth,sleptim1 and X_ageg5yr. Note: X_ageg5yr - Reported Age In Five-Year Age Categories Calculated Variable.

# prepare data
selected_brfss2013 %>%
  select(genhlth, sleptim1, X_ageg5yr) %>%
  filter(!is.na(X_ageg5yr))  %>%
  filter(!is.na(genhlth))  %>%
  filter(!is.na(sleptim1))  %>%
  filter(sleptim1 >= 5, sleptim1 <= 9)  %>%
  group_by(X_ageg5yr, genhlth) %>%
  summarise(n = mean(sleptim1)) %>%
  
  # plot data
  ggplot(aes(
    x = X_ageg5yr,
    y = n,
    fill = as.factor(genhlth)
  )) +
  geom_col(position = 'dodge')  +
  
  # description
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  ggtitle("Distribution frequency 'self-esteem health' by 'sleeping time' for Yes GHP") +
  labs(y = "self-esteem health (frequency)",
       x = "self-esteem health")

The distribution shows that the relationship between ‘sleeping time’ from ‘self-esteem health’ saved with age. Also, we can see that average ‘sleeping time’ roughly keeps before 50-55 years and start lightly growing up after. With that, we can make an interesting remark that the average ‘sleeping time’ corresponding with ‘Excellent’ and "Very good’ health self-esteem before 55 years more characteristically ‘Fair’ and ‘Poor’ rating after 70 years. Probably, after 55 years we need to sleep more to feel ourselves better.

Research quesion 2: Relationship between ‘tobacco and alcohol use’ and ‘self-esteem health’

Tobacco Use

Now we are going to explore relationship between ‘tobacco and alcohol use’ and ‘self-esteem health’ and for beginning to plot the frequency of days now smoking distribution grouping by gender, using smokday2 and sex.

# prepare data
selected_brfss2013 %>%
  select(smokday2, sex) %>%
  filter(!is.na(sex))  %>%
  filter(!is.na(smokday2))  %>%
  
  # plot data
  ggplot(aes(x = sex,
             fill = smokday2)) +
  geom_bar(position = 'fill') +
  
  # description
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  ggtitle("Distribution frequency 'Now Smoking' by gender") +
  labs(y = "now smoking (frequency)",
       x = "gender")

The plot shows us approximately the same distribution ‘tobacco use’ for males and females. Roughly 65% do not smoke at all, 10% smoke some days and 25% - every day.

We going plot relationship between ‘self-esteem health’ from ‘tobacco use’ using genhlth (General Health), smokday2 (Frequency Of Days Now Smoking) and GHP(General Health Problems). As before in the first question, we going to prepare our data to get a more clear picture and plotting the frequency of distribution ‘self-esteem health’ for each ‘Frequency Of Days Now Smoking’ value. Also, we are going to separate people who have and don’t have GHP.

Don’t have General Health Problems

Plot relationship between ‘self-esteem health’ from ‘Frequency Of Days Now Smoking’ using genhlth, smokday2 and GHP, where GHP = “No”. In that plot, we going to get ‘self-esteem health’ frequency for each common ‘sleeping time’.

# prepare data
selected_brfss2013 %>%
  select(GHP, genhlth, smokday2) %>%
  filter(!is.na(GHP))  %>%
  filter(GHP == "No")  %>%
  filter(!is.na(genhlth))  %>%
  filter(!is.na(smokday2))  %>%
  group_by(smokday2, genhlth) %>%           # grouping by 'Frequency Of Days Now Smoking' and 'self-esteem health'
  summarise(n = n()) %>%                    # count the quantity for each combination 
                                            ##     of 'Frequency Of Days Now Smoking' and 'self-esteem health groups' 
  mutate(freq = n / sum(n)) %>%             # count frequency for each 'Frequency Of Days Now Smoking' groups
  
# plot data
  ggplot(aes(x = genhlth, 
             y = freq,
             fill = genhlth,
             label = scales::percent(freq))) + 
  facet_wrap('smokday2') +
  geom_col(position = 'dodge')  +
  
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  geom_text(position = position_dodge(width = .9),    
            vjust = -0.5,                           
            size = 2) + 
  scale_y_continuous(labels = scales::percent) +
  ggtitle("Distribution frequency 'self-esteem health' by 'Smoking' for No GHP") +
  labs(y = "self-esteem health (frequency)",
       x = "smoking group")

That plots show us what frequency of “Excellent” self-esteem health growing significant from roughly 14.5% to 21.8% from “every day” to “Not at all” smoking value. Also, we can see decreasing “fair” roughly on 30% from 9.8% to 6.4%.

Have General Health Problems*

Let’s do the same work for people who have general health problems.

# prepare data
selected_brfss2013 %>%
  select(GHP, genhlth, smokday2) %>%
  filter(!is.na(GHP))  %>%
  filter(GHP == "Yes")  %>%
  filter(!is.na(genhlth))  %>%
  filter(!is.na(smokday2))  %>%
  group_by(smokday2, genhlth) %>%           # grouping by 'Frequency Of Days Now Smoking' and 'self-esteem health'
  summarise(n = n()) %>%                    # count the quantity for each combination 
                                            ##     of 'Frequency Of Days Now Smoking' and 'self-esteem health groups' 
  mutate(freq = n / sum(n)) %>%             # count frequency for each 'Frequency Of Days Now Smoking' groups
  
# plot data
  ggplot(aes(x = genhlth, 
             y = freq,
             fill = genhlth,
             label = scales::percent(freq))) + 
  facet_wrap('smokday2') +
  geom_col(position = 'dodge')  +
  
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  geom_text(position = position_dodge(width = .9),    
            vjust = -0.5,                           
            size = 2) + 
  scale_y_continuous(labels = scales::percent) +
  ggtitle("Distribution frequency 'self-esteem health' by 'Smoking' for GHP") +
  labs(y = "self-esteem health (frequency)",
       x = "smoking group")

Those plots don’t show us the same clear picture that plots for No GHP. Nonetheless, we can see general improving self-esteem health from “every day” to “Not at all” smoking value.

Also, can be interesting plotting average self-esteem health for ‘tobacco use’. Let’s do it by using n_genhlth from smokday2 for sex and wrapping by GHP.

# prepare data
selected_brfss2013 %>%
  select(GHP, n_genhlth, smokday2, sex) %>%
  filter(!is.na(sex))  %>%
  filter(!is.na(GHP))  %>%
  filter(!is.na(n_genhlth))  %>%
  filter(!is.na(smokday2))  %>%
  group_by(smokday2, sex, GHP) %>%           # grouping by 'Smoking', 'Gender' and 'General Health Problems'
  summarise(mean_genhlth = mean(n_genhlth)) %>%
  
  # plot data
  ggplot(aes(
    x = smokday2,
    y = mean_genhlth,
    fill = sex,
    label = round(mean_genhlth, digits = 2)
  )) +
  facet_wrap('GHP') +
  geom_col(position = 'dodge')   +
  
  # description
  geom_text(position = position_dodge(width = .9),
            vjust = -0.5,
            size = 3) +
  scale_y_continuous() +
  ggtitle("Distribution avarage 'self-esteem health' by 'Smoking' by Gender and GHP") +
  labs(y = "numeric self-esteem health (average)",
       x = "smoking group")

On the graph, we can see what tobacco use lightly reduce mean of ‘self-esteem health’.

Alcohol Consumption

For our next step, we might be interesting to explore the relationship between ‘alcohol use’ and ‘self-esteem health’.

For start let’s plot distribution of Total Number Drinks A Month, using X_drnkmo4.

# prepare data
selected_brfss2013 %>%
  select(X_drnkmo4) %>%
  filter(!is.na(X_drnkmo4))  %>%

# plot data
  ggplot(aes(x = X_drnkmo4)) + geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We got distribution skewed to the right. Centered at about 0 points with most scores roughly between 0 and 100 points. We have a range of roughly 470 and some of the outliers are present above 100 points. Let’s get some summary.

selected_brfss2013 %>%
  filter(!is.na(X_drnkmo4)) %>%
  summarise(Min_n = min(X_drnkmo4),
            Q1_n =  quantile(X_drnkmo4,0.25,type = 1),
            Median_n = median(X_drnkmo4),
            Q3_n = quantile(X_drnkmo4,0.75,type = 1),
            Max_n = max(X_drnkmo4),
            Mean_n = mean(X_drnkmo4),
            Sd_n = sd(X_drnkmo4),
            N_n = n())

##   Min_n Q1_n Median_n Q3_n Max_n   Mean_n     Sd_n    N_n
## 1     0    0        0    9  2280 11.19984 33.50244 467893

That distribution is quite spred. For plot we will use data inside 2 sd from the mean.

Let’s plot the relationship between average ‘self-esteem health’ from ‘Total Number Drinks A Month’, using n_genhlth and X_drnkmo4.

# prepare data
selected_brfss2013 %>%
  select(n_genhlth, X_drnkmo4) %>%
  filter(!is.na(n_genhlth))  %>%
  filter(!is.na(X_drnkmo4))  %>%
  filter(X_drnkmo4 <= 78)  %>%
  group_by(X_drnkmo4) %>%           # grouping by 'Total Number Drinks A Month'
  summarise(mean_genhlth = mean(n_genhlth)) %>%

# plot data
  ggplot(aes(x = X_drnkmo4, 
             y = mean_genhlth, 
             label = round(mean_genhlth, digits = 2) )) + 
  geom_col() +
  ggtitle("Average ‘self-esteem health’ from ‘Drinks per month’") +
  labs(y = "self-esteem health (average)",
       x = "drinks per month")

Let’s try to get something from this plot. First, it looks like that 0 drink at month corresponds with little less average self-esteem health. Also, we can see that after some numbers of drinks per month (roughly 50) self-esteem health start decrease, but the relationship do not represent that plot clean. For our next step, we can try to unite a number of drinks into some groups. Let’s do it.

selected_brfss2013 <- selected_brfss2013 %>%
mutate(X_drnkmo4_gr = ifelse(X_drnkmo4 == 0,      "0",
                      ifelse(X_drnkmo4 <= 30,     "Between 0 and 1",
                      ifelse(X_drnkmo4 <= 60,     "Between 1 and 2",
                      ifelse(X_drnkmo4 <= 90,     "Between 2 and 3",
                      ifelse(X_drnkmo4 <= 120,    "Between 3 and 4",
                      ifelse(X_drnkmo4 >  120,    "More than 4", NA)))))))

Plot it, using X_drnkmo4_gr,GHP,sex and n_genhlth.

# prepare data
selected_brfss2013 %>%
  select(GHP, n_genhlth, X_drnkmo4_gr, sex) %>%
  filter(!is.na(sex))  %>%
  filter(!is.na(GHP))  %>%
  filter(!is.na(n_genhlth))  %>%
  filter(!is.na(X_drnkmo4_gr))  %>%
  group_by(X_drnkmo4_gr,GHP,sex) %>%           
  summarise(mean_genhlth = mean(n_genhlth)) %>%

  
# plot data
  ggplot(aes(x = X_drnkmo4_gr, 
             y = mean_genhlth,
             fill = sex,
             label = round(mean_genhlth, digits = 2))) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  facet_wrap('GHP') +
  geom_col(position = 'dodge') +
  ggtitle("'Self-esteem health' and 'Alcohol Consumption'") +
  labs(y = "Average self-esteem health", 
       x = "Drinks per day")

That plot gets more clear undersending between ‘Average self-esteem’ health and ‘Drinks per day’. For both sexes ‘Average self-esteem’ lightly increases before 2 drinks per day and after that also lightly starting decrease.

Research quesion 3: Relationship between ‘phisical activity’ and ‘self-esteem health’

In our third research question, we going to explore the relationship between ‘physical activity’ and ‘self-esteem health’. For the start, we will plot physical activity distribution, using exerany2,sex and GHP. Note: exerany2 - Physical Exercise In Past 30 Days.

# prepare data
selected_brfss2013 %>%
  select(exerany2, sex, GHP) %>%
  filter(!is.na(GHP))  %>%
  filter(!is.na(sex))  %>%
  filter(!is.na(exerany2))  %>%

# plot data
  ggplot(aes(x = sex, fill = exerany2)) +
  geom_bar(position = 'fill') +
  facet_wrap('GHP') + 
  ggtitle("Count of physical activity distribution by General Health Problems") +
  labs(y = "physical exercise (frequency)", 
       x = "physical exercise in past 30 days")

We can see that roughly 80% of the survey has some physical exercise in the past 30 days for ‘No GHP’ and roughly 60% with ‘Yes GHP’.

We going plot relationship between ‘self-esteem health’ from ‘physical exercise in past 30 days’ using genhlth (General Health), exerany2 (Physical Exercise In Past 30 Days) and GHP(General Health Problems). Also, we are going to separate people who have and don’t have GHP.

Don’t have General Health Problems

Plot relationship between ‘self-esteem health’ from ‘physical exercise in past 30 days’ using genhlth, exerany2 and GHP, where GHP = “No”. In that plot, we going to get ‘self-esteem health’ frequency for ‘Physical Exercise In Past 30 Days’.

# prepare data
selected_brfss2013 %>%
  select(GHP, genhlth, exerany2) %>%
  filter(!is.na(GHP))  %>%
  filter(GHP == "No")  %>%
  filter(!is.na(genhlth))  %>%
  filter(!is.na(exerany2))  %>%
  group_by(exerany2, genhlth) %>%           # grouping by 'Physical Exercise In Past 30 Days' and 'self-esteem health'
  summarise(n = n()) %>%                    # count the quantity for each combination 
                                            ##     of 'Physical Exercise In Past 30 Days' and 'self-esteem health groups' 
  mutate(freq = n / sum(n)) %>%             # count frequency for each 'Physical Exercise In Past 30 Days' groups
  
# plot data
  ggplot(aes(x = genhlth, 
             y = freq,
             fill = genhlth,
             label = scales::percent(freq))) + 
  facet_wrap('exerany2') +
  geom_col(position = 'dodge')  +
  
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  geom_text(position = position_dodge(width = .9),    
            vjust = -0.5,                           
            size = 2) + 
  scale_y_continuous(labels = scales::percent) +
  ggtitle("Frequency of Self-esteem health by physical activity for 'No GHP'") +
  labs(y = "frequency of self-esteem health", 
       x = "self-esteem health")

That plots show us what frequency of “Excellent” and “Very good” self-esteem health growing roughly from 15.7% to 26.1% and 34.8% to 42.1% respectively with physical activity mostly due to decreasing “Good” point of self-esteem health.

Have General Health Problems

Let’s do the same for people with GHP.

# prepare data
selected_brfss2013 %>%
  select(GHP, genhlth, exerany2) %>%
  filter(!is.na(GHP))  %>%
  filter(GHP == "Yes")  %>%
  filter(!is.na(genhlth))  %>%
  filter(!is.na(exerany2))  %>%
  group_by(exerany2, genhlth) %>%           # grouping by 'Physical Exercise In Past 30 Days' and 'self-esteem health'
  summarise(n = n()) %>%                    # count the quantity for each combination 
                                            ##     of 'Physical Exercise In Past 30 Days' and 'self-esteem health groups' 
  mutate(freq = n / sum(n)) %>%             # count frequency for each 'Physical Exercise In Past 30 Days' groups
  
# plot data
  ggplot(aes(x = genhlth, 
             y = freq,
             fill = genhlth,
             label = scales::percent(freq))) + 
  facet_wrap('exerany2') +
  geom_col(position = 'dodge')  +
  
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  geom_text(position = position_dodge(width = .9),    
            vjust = -0.5,                           
            size = 2) + 
  scale_y_continuous(labels = scales::percent) +
  ggtitle("Frequency of Self-esteem health by physical activity for 'No GHP'") +
  labs(y = "frequency of self-esteem health", 
       x = "self-esteem health")

In that case, we can see how significantly decreasing “Poor” self-esteem health form 22.1% to 10.5%. Because this data set is observation, we can’t say that physical activity affects ‘self-esteem health’, but the relationship between these 2 variables definitely exists.

As the next step, it can be interesting to research a little about type of physical activity.

Frequency “Type Of Physical Activity” for top 10.

# prepare data
top10_physical_activity <- selected_brfss2013 %>%
  select(exract11) %>%
  filter(!is.na(exract11))  %>%
  group_by(exract11) %>%           
  summarise(n = n()) %>%
  mutate(freq = round( n / sum(n)*100, digits = 1)) %>%
  arrange(-n) %>%
  slice(1:10)

top10_physical_activity

## # A tibble: 10 x 3
##    exract11                                            n  freq
##    <fct>                                           <int> <dbl>
##  1 Walking                                        180051  54.4
##  2 Running                                         23152   7  
##  3 Gardening (spading, weeding, digging, filling)  20026   6.1
##  4 Other                                           14119   4.3
##  5 Weight lifting                                  10226   3.1
##  6 Bicycling                                        8565   2.6
##  7 Aerobics video or class                          8154   2.5
##  8 Bicycling machine exercise                       7223   2.2
##  9 Elliptical/EFX machine exercise                  5846   1.8
## 10 Calisthenics                                     4990   1.5

Let’s see more closely for walking, running and bicycling. That type is all outdoor physical activity.

# prepare data
selected_brfss2013 %>%
  select(exract11, GHP, genhlth) %>%
  filter(!is.na(GHP))  %>%
  filter(!is.na(exract11))  %>%
  filter(!is.na(genhlth))  %>%
  filter(as.character(exract11) %in% c("Walking","Running","Bicycling"))  %>%
  group_by(GHP, exract11,genhlth) %>%           
  summarise(n = n()) %>%
  mutate(freq = round(n / sum(n), digits = 4)) %>%
  ggplot(aes(x = genhlth, 
             y = freq,
             fill = exract11
             )) + 
  geom_col(position = 'dodge')  +
  facet_wrap('GHP') +
  
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  ggtitle('Frequency some "Type Of Physical Activity" by GHP') +
  labs(y = "self-esteem health (frequency)", 
       x = "self-esteem health")

It looks like people who running have the best self-esteem health.

Appendix: List of fields

sex - Respondents Sex. Indicate sex of respondent.
X_ageg5yr - Reported Age In Five-Year Age Categories Calculated Variable.
genhlth - General Health. Would you say that in general your health is.
sleptim1 - How Much Time Do You Sleep.
smokday2 - Frequency Of Days Now Smoking. Do you now smoke cigarettes every day, some days, or not at all?
X_drnkmo4 - Computed Total Number Drinks A Month.
exerany2 - Exercise In Past 30 Days. During the past month, other than your regular job, did you participate in any physical activities or exercises such as running, calisthenics, golf, gardening, or walking for exercise?
exract11 - Type Of Physical Activity. What type of physical activity or exercise did you spend the most time doing during the past month?
n_genhlth - convert qualitative column genhlth (General Health) to quantitative.
X_drnkmo4_gr - created by grouping X_drnkmo4.
GHP - General Health Problems, varians was creating by uniting field below:

General Health Problems:

qlactlm2 - Activity Limitation Due To Health Problems.
useequip - Health Problems Requiring Special Equipment.
blind - Blind Or Difficulty Seeing.
decide - Difficulty Concentrating Or Remembering.
diffwalk - Difficulty Walking Or Climbing Stairs.
diffdres - Difficulty Dressing Or Bathing.
diffalon - Difficulty Doing Errands Alone.
make some changes

Exploring the relationship between lifestyle peoples and their self-esteem health.

Setup

Load packages

Load data

Part 1: Data

Part 2: Research questions

Part 3: Exploratory data analysis