Setup

Part 1: Data

This data set was prepared by The Behavioral Risk Factor Surveillance System (BRFSS) in 2013-2014 by collecting data from randomly selected adults (aged 18 years or older) in a household.

BRFSS is the United States nation’s premier system of health-related telephone surveys that collect state data about U.S. residents regarding their health-related risk behaviors, chronic health conditions, and use of preventive services.

The data set represents observation about lifestyle and self-esteem health U.S. residents and does not represent world data, however, it can be interesting as a common understanding of relationship health-related data also and for population in other countries.

More information about the variables you can find in the appendix at the end of the document.


Part 2: Research questions

In our project, we going to exploring the relationship between lifestyle peoples and their self-esteem health.

Research quesion 1:

As a first question, we might be interested in exploring the relationship between sleeping time and self-esteem health.

Research quesion 2:

The second question would be related to research on the relationship between tobacco and alcohol use and self-esteem health.

Research quesion 3:

As a third question, we going to explore the relationship between physical activity and self-esteem health.


Part 3: Exploratory data analysis

Main aspect

First, we going to create selected_brfss2013 data set by selecting from the original data set brfss2013 some columns contain interesting information about:

Before move on, we going to stop for the moment and think a little about some common aspects which can have a significant impact on our habits and health. From our point of view, it’s would be reasonable to make a suggestion that people who have some general health problems may have different lifestyle behavior. For that reason we also going to add information about:

This data set we going to use through our research. Let’s get a quick view that we got.

##      sex    X_ageg5yr qlactlm2 useequip blind decide diffwalk diffdres diffalon
## 1 Female Age 60 to 64      Yes      Yes    No     No      Yes       No      Yes
## 2 Female Age 50 to 54       No       No    No     No       No       No       No
## 3 Female Age 55 to 59      Yes       No    No     No      Yes       No       No
## 4 Female Age 60 to 64       No       No    No     No       No       No       No
## 5   Male Age 65 to 69       No       No    No     No       No       No       No
## 6 Female Age 45 to 49       No       No    No     No       No       No       No
##     genhlth sleptim1   smokday2 X_drnkmo4 exerany2                   exract11
## 1      Fair       NA Not at all         2       No                       <NA>
## 2      Good        6       <NA>         0      Yes                    Walking
## 3      Good        9  Some days        80       No                       <NA>
## 4 Very good        8       <NA>        16      Yes                    Walking
## 5      Good        6 Not at all        20       No                       <NA>
## 6 Very good        8       <NA>         0      Yes Bicycling machine exercise

Let’s see the structure of our data set.

## 'data.frame':    491775 obs. of  15 variables:
##  $ sex      : Factor w/ 2 levels "Male","Female": 2 2 2 2 1 2 2 2 1 2 ...
##  $ X_ageg5yr: Factor w/ 13 levels "Age 18 to 24",..: 9 7 8 9 10 6 4 9 7 10 ...
##  $ qlactlm2 : Factor w/ 2 levels "Yes","No": 1 2 1 2 2 2 1 1 2 2 ...
##  $ useequip : Factor w/ 2 levels "Yes","No": 1 2 2 2 2 2 2 2 2 2 ...
##  $ blind    : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
##  $ decide   : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
##  $ diffwalk : Factor w/ 2 levels "Yes","No": 1 2 1 2 2 2 2 1 2 2 ...
##  $ diffdres : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
##  $ diffalon : Factor w/ 2 levels "Yes","No": 1 2 2 2 2 2 2 2 2 2 ...
##  $ genhlth  : Factor w/ 5 levels "Excellent","Very good",..: 4 3 3 2 3 2 4 3 1 3 ...
##  $ sleptim1 : int  NA 6 9 8 6 8 7 6 8 8 ...
##  $ smokday2 : Factor w/ 3 levels "Every day","Some days",..: 3 NA 2 NA 3 NA 3 1 NA NA ...
##  $ X_drnkmo4: int  2 0 80 16 20 0 1 2 4 0 ...
##  $ exerany2 : Factor w/ 2 levels "Yes","No": 2 1 2 1 2 1 1 1 1 1 ...
##  $ exract11 : Factor w/ 75 levels "Active Gaming Devices (Wii Fit, Dance, Dance revolution)",..: NA 64 NA 64 NA 6 64 64 7 64 ...

For our research purpose, we going just lightly touch aspects of “General Health Problems” to make our main research more clear. Let’s see more closely the structure of that information.

## 'data.frame':    491775 obs. of  7 variables:
##  $ qlactlm2: Factor w/ 2 levels "Yes","No": 1 2 1 2 2 2 1 1 2 2 ...
##  $ useequip: Factor w/ 2 levels "Yes","No": 1 2 2 2 2 2 2 2 2 2 ...
##  $ blind   : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
##  $ decide  : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
##  $ diffwalk: Factor w/ 2 levels "Yes","No": 1 2 1 2 2 2 2 1 2 2 ...
##  $ diffdres: Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
##  $ diffalon: Factor w/ 2 levels "Yes","No": 1 2 2 2 2 2 2 2 2 2 ...

As we can see, all those columns have a uniform structure and conclude 2 values: “Yes” and “No”.

We going unite them into one by creating new column GHP (General Health Problems) and summarize information into 2 groups:

Remove from our data set selected_brfss2013 primary columns “General Health Problems”.

Let’s look on the distribution population by “General Health Problems” GHP.

As we can see the population who have “General Health Problems” is significant and roughly represent 1/3 of the survey.

Let us look at the unique values of the self-esteem of health (genhlth).

##       genhlth
## 1        Fair
## 2        Good
## 4   Very good
## 9   Excellent
## 28       Poor
## 241      <NA>

Also, we going convert qualitative column genhlth (General Health) to quantitative n_genhlth, probably it can be useful for our next steps.

Let’s get some summary of the distribution self-esteem of health before plot it.

## # A tibble: 2 x 10
##   GHP   Min_h  Q1_h Median_h  Q3_h Max_h Mean_h  Sd_h    N_h freq_h
##   <chr> <dbl> <dbl>    <dbl> <dbl> <dbl>  <dbl> <dbl>  <int>  <dbl>
## 1 No        1     3        4     4     5   3.81 0.890 310693  0.632
## 2 Yes       1     2        3     3     5   2.69 1.09  164532  0.335

Plot our data set to see the distribution of health self-esteem n_genhlth grouping by “General Health Problems” GHP.

Looks like our sugetion about relationship between health self-esteem (genhlth) and general health problems (GHP) was right.

It is reasonable that people without general health problems rate their health better than people with GHP.

Research quesion 1: Relationship between ‘sleeping time’ and ‘self-esteem health’

For the start, we going to plot sleeping time distribution sleptim1 grouping by “General Health Problems” GHP.

For now, distributions of these two groups look very similar, except that what spread of distribution with “GHP” is a little more variable but we will get a chance to check it. Let’s get it more detailed.

## # A tibble: 2 x 9
##   GHP   Min_n  Q1_n Median_n  Q3_n Max_n Mean_n  Sd_n    N_n
##   <chr> <int> <int>    <dbl> <int> <int>  <dbl> <dbl>  <int>
## 1 No        1     6        7     8    24   7.10  1.23 308190
## 2 Yes       0     6        7     8    24   6.95  1.83 160367

Now we going to plot relationship between ‘sleeping time’ and ‘self-esteem health’ using genhlth (General Health), sleptim1 (How Much Time Do You Sleep) and GHP(General Health Problems). To get a more clear picture in that plot we focusing on the most data of the distribution of time sleeping.

Those two plots show us a clean relationship between ‘sleeping time’ and ‘self-esteem health’, and that interesting that relationship is more strong for people who have some general health problem. Generally for worse ‘self-esteem health’ correspond less sleeping time. Let’s see it more closely.

We going plot relationship between ‘self-esteem health’ from ‘sleeping time’ using genhlth (General Health), sleptim1 (How Much Time Do You Sleep) and GHP(General Health Problems). To do that we going to prepare our data to get a more clear picture and plotting the frequency of distribution ‘self-esteem health’ for each usual ‘sleeping time’. To not to do the plot too much complicated we are going to research separately people who have and don’t have GHP.

Some a little more complicated preparation steps are written as comments into code.

Don’t have General Health Problems

Plot relationship between ‘self-esteem health’ from ‘sleeping time’ using genhlth, sleptim1 and GHP, where GHP = “No”. In that plot, we going to get ‘self-esteem health’ frequency for each common ‘sleeping time’.

That plots show us what frequency of “Excellent” and “Very good” self-esteem health growing with sleeping time mostly due to decreasing “Good” point of self-esteem health. Also, we can notice that after 7 hours of sleeping time distribution does not change a lot, but it get significant changes in ‘self-esteem health’ in the range between 5 and 7 hours. As well this data set is observation, we cannot say that sleeping time affects on ‘self-esteem health’, but the relationship between these 2 variables definitely exist.

Also, it can be interesting to see opposite relationship of distribution between ‘sleeping time’ from ‘self-esteem health’ using genhlth, sleptim1 and GHP. In that plot, we going to get’sleeping time’ frequency for each ‘self-esteem health’. Let’s do it.

That plots, also, show us most the same relationship as previous. We can see significant differences in self-esteem health in the range of 5-7 hours.

Have General Health Problems

Let’s do the same work for people who have general health problems, just lightly expanding the ‘sleeping time’ range. Plot relationship between ‘self-esteem health’ from ‘sleeping time’ using genhlth, sleptim1 and GHP where GHP = “Yes”. In that plot, we going to get ‘self-esteem health’ frequency for each common ‘sleeping time’.

In that case, plots show us what frequency of “Very good” and “Good” self-esteem health growing with sleeping time mostly due to decreasing “Fair” and “Poor” point of self-esteem health. We also must notice how significantly decreasing “Poor” point of self-esteem health from 30% to 9% as sleeping time grows from 4 to 7 hours.

Let’s plot relationship of distribution between ‘sleeping time’ from ‘self-esteem health’.

That plots, also, show us most the same relationship as previous, but from a different angle.

Finally, let’s see how to change the relationship between ‘sleeping time’ from ‘self-esteem health’ for different ages, using genhlth,sleptim1 and X_ageg5yr. Note: X_ageg5yr - Reported Age In Five-Year Age Categories Calculated Variable.

The distribution shows that the relationship between ‘sleeping time’ from ‘self-esteem health’ saved with age. Also, we can see that average ‘sleeping time’ roughly keeps before 50-55 years and start lightly growing up after. With that, we can make an interesting remark that the average ‘sleeping time’ corresponding with ‘Excellent’ and "Very good’ health self-esteem before 55 years more characteristically ‘Fair’ and ‘Poor’ rating after 70 years. Probably, after 55 years we need to sleep more to feel ourselves better.

Research quesion 2: Relationship between ‘tobacco and alcohol use’ and ‘self-esteem health’

Tobacco Use

Now we are going to explore relationship between ‘tobacco and alcohol use’ and ‘self-esteem health’ and for beginning to plot the frequency of days now smoking distribution grouping by gender, using smokday2 and sex.

The plot shows us approximately the same distribution ‘tobacco use’ for males and females. Roughly 65% do not smoke at all, 10% smoke some days and 25% - every day.

We going plot relationship between ‘self-esteem health’ from ‘tobacco use’ using genhlth (General Health), smokday2 (Frequency Of Days Now Smoking) and GHP(General Health Problems). As before in the first question, we going to prepare our data to get a more clear picture and plotting the frequency of distribution ‘self-esteem health’ for each ‘Frequency Of Days Now Smoking’ value. Also, we are going to separate people who have and don’t have GHP.

Don’t have General Health Problems

Plot relationship between ‘self-esteem health’ from ‘Frequency Of Days Now Smoking’ using genhlth, smokday2 and GHP, where GHP = “No”. In that plot, we going to get ‘self-esteem health’ frequency for each common ‘sleeping time’.

That plots show us what frequency of “Excellent” self-esteem health growing significant from roughly 14.5% to 21.8% from “every day” to “Not at all” smoking value. Also, we can see decreasing “fair” roughly on 30% from 9.8% to 6.4%.

Have General Health Problems*

Let’s do the same work for people who have general health problems.

Those plots don’t show us the same clear picture that plots for No GHP. Nonetheless, we can see general improving self-esteem health from “every day” to “Not at all” smoking value.

Also, can be interesting plotting average self-esteem health for ‘tobacco use’. Let’s do it by using n_genhlth from smokday2 for sex and wrapping by GHP.

On the graph, we can see what tobacco use lightly reduce mean of ‘self-esteem health’.

Alcohol Consumption

For our next step, we might be interesting to explore the relationship between ‘alcohol use’ and ‘self-esteem health’.

For start let’s plot distribution of Total Number Drinks A Month, using X_drnkmo4.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We got distribution skewed to the right. Centered at about 0 points with most scores roughly between 0 and 100 points. We have a range of roughly 470 and some of the outliers are present above 100 points. Let’s get some summary.

##   Min_n Q1_n Median_n Q3_n Max_n   Mean_n     Sd_n    N_n
## 1     0    0        0    9  2280 11.19984 33.50244 467893

That distribution is quite spred. For plot we will use data inside 2 sd from the mean.

Let’s plot the relationship between average ‘self-esteem health’ from ‘Total Number Drinks A Month’, using n_genhlth and X_drnkmo4.

Let’s try to get something from this plot. First, it looks like that 0 drink at month corresponds with little less average self-esteem health. Also, we can see that after some numbers of drinks per month (roughly 50) self-esteem health start decrease, but the relationship do not represent that plot clean. For our next step, we can try to unite a number of drinks into some groups. Let’s do it.

Plot it, using X_drnkmo4_gr,GHP,sex and n_genhlth.

That plot gets more clear undersending between ‘Average self-esteem’ health and ‘Drinks per day’. For both sexes ‘Average self-esteem’ lightly increases before 2 drinks per day and after that also lightly starting decrease.

Research quesion 3: Relationship between ‘phisical activity’ and ‘self-esteem health’

In our third research question, we going to explore the relationship between ‘physical activity’ and ‘self-esteem health’. For the start, we will plot physical activity distribution, using exerany2,sex and GHP. Note: exerany2 - Physical Exercise In Past 30 Days.

We can see that roughly 80% of the survey has some physical exercise in the past 30 days for ‘No GHP’ and roughly 60% with ‘Yes GHP’.

We going plot relationship between ‘self-esteem health’ from ‘physical exercise in past 30 days’ using genhlth (General Health), exerany2 (Physical Exercise In Past 30 Days) and GHP(General Health Problems). Also, we are going to separate people who have and don’t have GHP.

Don’t have General Health Problems

Plot relationship between ‘self-esteem health’ from ‘physical exercise in past 30 days’ using genhlth, exerany2 and GHP, where GHP = “No”. In that plot, we going to get ‘self-esteem health’ frequency for ‘Physical Exercise In Past 30 Days’.

That plots show us what frequency of “Excellent” and “Very good” self-esteem health growing roughly from 15.7% to 26.1% and 34.8% to 42.1% respectively with physical activity mostly due to decreasing “Good” point of self-esteem health.

Have General Health Problems

Let’s do the same for people with GHP.

In that case, we can see how significantly decreasing “Poor” self-esteem health form 22.1% to 10.5%. Because this data set is observation, we can’t say that physical activity affects ‘self-esteem health’, but the relationship between these 2 variables definitely exists.

As the next step, it can be interesting to research a little about type of physical activity.

Frequency “Type Of Physical Activity” for top 10.

## # A tibble: 10 x 3
##    exract11                                            n  freq
##    <fct>                                           <int> <dbl>
##  1 Walking                                        180051  54.4
##  2 Running                                         23152   7  
##  3 Gardening (spading, weeding, digging, filling)  20026   6.1
##  4 Other                                           14119   4.3
##  5 Weight lifting                                  10226   3.1
##  6 Bicycling                                        8565   2.6
##  7 Aerobics video or class                          8154   2.5
##  8 Bicycling machine exercise                       7223   2.2
##  9 Elliptical/EFX machine exercise                  5846   1.8
## 10 Calisthenics                                     4990   1.5

Let’s see more closely for walking, running and bicycling. That type is all outdoor physical activity.

It looks like people who running have the best self-esteem health.

Appendix: List of fields

General Health Problems: