Exploratory Data Analysis

Part 3: Exploratory data analysis

Perform exploratory data analysis (EDA) that addresses each of the three research questions you outlined above. Your EDA should contain numerical summaries and visualizations. Each R output and plot should be accompanied by a brief interpretation.

Research quesion 1: Are there more students who graduated in college who sleeps less than 7 hours still have a good general health compared to students who graduated high school who sleeps less than 7 hours?

We’ll be using the following variables for this question: sleptim1: The hours of sleep a person gets in a 24 hour period. genhlth: General health by the Respondents. *X_educag: Educational Status of a Student and their highest attain. Type of the variables we’re dealing with:

str(select(brfss2013,sleptim1,genhlth, X_educag))

## 'data.frame':    491775 obs. of  3 variables:
##  $ sleptim1: int  NA 6 9 8 6 8 7 6 8 8 ...
##  $ genhlth : Factor w/ 5 levels "Excellent","Very good",..: 4 3 3 2 3 2 4 3 1 3 ...
##  $ X_educag: Factor w/ 4 levels "Did not graduate high school",..: 4 3 4 2 4 4 2 3 4 2 ...

brfss2013 %>%
  filter(sleptim1>24) %>%
  select(sleptim1)

##   sleptim1
## 1      103
## 2      450

We need a clean data frame so we need to filter the unrealistic value there for our further visualization.

rq2_brfss2013 <- brfss2013 %>%
  filter(sleptim1 <= 24)

Then plotting the 24 hour period a day and examine the average sleep of a person.

ggplot(rq2_brfss2013,aes(x=as.factor(sleptim1))) + geom_bar() + ggtitle('Amount of Sleep of Respondents') + xlab('Hours Slept') + theme_bw()

As we see, people tend to sleep 6-8 hours according to our population. So we need first to get the average sleep of a person.

rq2_brfss2013 %>%
  summarise(avg_sleep = mean(sleptim1))

##   avg_sleep
## 1  7.050986

Then we know that the amount of sleep for our population is 7 hours. But the thing here is to know the only specific respondents who sleeps less than 7 hours.

total_obs <- nrow(brfss2013)
brfss2013 %>%
  group_by(X_educag) %>%
  filter(X_educag != "NA") %>%
  summarise(count = n(), Percentage = n()*100/total_obs)

## # A tibble: 4 × 3
##   X_educag                                    count Percentage
##   <fct>                                       <int>      <dbl>
## 1 Did not graduate high school                42213       8.58
## 2 Graduated high school                      142968      29.1 
## 3 Attended college or technical school       134196      27.3 
## 4 Graduated from college or technical school 170118      34.6

total_obs <- nrow(brfss2013)
  brfss2013 %>%
  group_by(sleptim1, X_educag) %>%
    filter(sleptim1 < 7)%>%
  summarise(count=n(), Percentage = n()*100/total_obs)

## `summarise()` has grouped output by 'sleptim1'. You can override using the
## `.groups` argument.

## # A tibble: 31 × 4
## # Groups:   sleptim1 [7]
##    sleptim1 X_educag                                   count Percentage
##       <int> <fct>                                      <int>      <dbl>
##  1        0 <NA>                                           1   0.000203
##  2        1 Did not graduate high school                  65   0.0132  
##  3        1 Graduated high school                         76   0.0155  
##  4        1 Attended college or technical school          51   0.0104  
##  5        1 Graduated from college or technical school    33   0.00671 
##  6        1 <NA>                                           3   0.000610
##  7        2 Did not graduate high school                 252   0.0512  
##  8        2 Graduated high school                        381   0.0775  
##  9        2 Attended college or technical school         278   0.0565  
## 10        2 Graduated from college or technical school   156   0.0317  
## # … with 21 more rows

  brfss2013 %>%
  group_by(sleptim1, X_educag,genhlth) %>%
    filter(sleptim1 < 7)%>%
  summarise(count=n(), Percentage = n()*100/total_obs)

## `summarise()` has grouped output by 'sleptim1', 'X_educag'. You can override
## using the `.groups` argument.

## # A tibble: 172 × 5
## # Groups:   sleptim1, X_educag [31]
##    sleptim1 X_educag                     genhlth   count Percentage
##       <int> <fct>                        <fct>     <int>      <dbl>
##  1        0 <NA>                         Excellent     1   0.000203
##  2        1 Did not graduate high school Excellent     8   0.00163 
##  3        1 Did not graduate high school Very good     3   0.000610
##  4        1 Did not graduate high school Good         12   0.00244 
##  5        1 Did not graduate high school Fair         17   0.00346 
##  6        1 Did not graduate high school Poor         25   0.00508 
##  7        1 Graduated high school        Excellent     8   0.00163 
##  8        1 Graduated high school        Very good    10   0.00203 
##  9        1 Graduated high school        Good         23   0.00468 
## 10        1 Graduated high school        Fair         13   0.00264 
## # … with 162 more rows

  brfss2013 %>%
  group_by(sleptim1, X_educag,genhlth) %>%
    filter(sleptim1 < 7, genhlth == "Good")%>%
  summarise(count=n(), Percentage = n()*100/total_obs)

## `summarise()` has grouped output by 'sleptim1', 'X_educag'. You can override
## using the `.groups` argument.

## # A tibble: 30 × 5
## # Groups:   sleptim1, X_educag [30]
##    sleptim1 X_educag                                   genhlth count Percentage
##       <int> <fct>                                      <fct>   <int>      <dbl>
##  1        1 Did not graduate high school               Good       12   0.00244 
##  2        1 Graduated high school                      Good       23   0.00468 
##  3        1 Attended college or technical school       Good       12   0.00244 
##  4        1 Graduated from college or technical school Good       10   0.00203 
##  5        1 <NA>                                       Good        1   0.000203
##  6        2 Did not graduate high school               Good       45   0.00915 
##  7        2 Graduated high school                      Good       91   0.0185  
##  8        2 Attended college or technical school       Good       53   0.0108  
##  9        2 Graduated from college or technical school Good       42   0.00854 
## 10        2 <NA>                                       Good        3   0.000610
## # … with 20 more rows

  brfss2013 %>%
  group_by(sleptim1, X_educag,genhlth) %>%
    filter(sleptim1 < 7, genhlth == "Good", X_educag != "Did not graduate high school", X_educag != "Attended college or technical school", X_educag != "Graduated high school")%>%
  summarise(count = n(), Percentage = n()*100/total_obs)

## `summarise()` has grouped output by 'sleptim1', 'X_educag'. You can override
## using the `.groups` argument.

## # A tibble: 6 × 5
## # Groups:   sleptim1, X_educag [6]
##   sleptim1 X_educag                                   genhlth count Percentage
##      <int> <fct>                                      <fct>   <int>      <dbl>
## 1        1 Graduated from college or technical school Good       10    0.00203
## 2        2 Graduated from college or technical school Good       42    0.00854
## 3        3 Graduated from college or technical school Good      136    0.0277 
## 4        4 Graduated from college or technical school Good      804    0.163  
## 5        5 Graduated from college or technical school Good     2560    0.521  
## 6        6 Graduated from college or technical school Good    10002    2.03

sum(0.00203, 0.00854, 0.02765, 0.16349, 0.520256, 2.03386)/6

## [1] 0.4593043

brfss2013 %>%
  group_by(sleptim1, X_educag,genhlth) %>%
    filter(sleptim1 < 7, genhlth == "Good", X_educag != "Did not graduate high school", X_educag != "Attended college or technical school", X_educag != "Graduated from college or technical school")%>%
  summarise(count = n(), Percentage = n()*100/total_obs)

## `summarise()` has grouped output by 'sleptim1', 'X_educag'. You can override
## using the `.groups` argument.

## # A tibble: 6 × 5
## # Groups:   sleptim1, X_educag [6]
##   sleptim1 X_educag              genhlth count Percentage
##      <int> <fct>                 <fct>   <int>      <dbl>
## 1        1 Graduated high school Good       23    0.00468
## 2        2 Graduated high school Good       91    0.0185 
## 3        3 Graduated high school Good      323    0.0657 
## 4        4 Graduated high school Good     1554    0.316  
## 5        5 Graduated high school Good     3744    0.761  
## 6        6 Graduated high school Good    11574    2.35

After we filter to only choose college graduate and high school graduate and grouping them by, we finally get the specific respondents who sleep less than 7 hours which is our average sleep and still manage to graduate and have a good general health, however which of them graduated more, is it college or highs school?

abc <- c(sum(23,91,323,1554,3744,11574), sum(10,42,136,804,2560,10002))
labels <- c("Graduated High school", "Graduated COllege")

piepercent <- round(100*abc/491775,1)

pie(abc, main = "College VS Highschool Graduates who sleep less than 7 hours at night",col = rainbow(length(abc)))
legend("topright", c("Graduated High School","Graduated College"), cex = 0.8,
   fill = rainbow(length(abc)))

Therefore, there are more High school graduate that has good general health compared to College graduates who has good general health.

Research quesion 2: Which one has more likely to consume fruits and vegetables once or more than in a day, is it those who have good/better health or fair/poor health?

We’ll be using the followng variables for this: X_rfhlth: The health status by the Respondents. X_frtlt1: Consumption of fruit for 1 or more a day. *X_veglt1: Consumption of vegetable for 1 or more a day.

str(select(brfss2013,X_rfhlth,X_frtlt1,X_veglt1))

## 'data.frame':    491775 obs. of  3 variables:
##  $ X_rfhlth: Factor w/ 2 levels "Good or Better Health",..: 2 1 1 1 1 1 2 1 1 1 ...
##  $ X_frtlt1: Factor w/ 2 levels "Consumed fruit one or more times per day",..: 1 2 2 2 2 1 1 2 1 2 ...
##  $ X_veglt1: Factor w/ 2 levels "Consumed vegetables one or more times per day",..: 2 1 1 1 1 1 1 1 1 1 ...

brfss2013 %>%
  group_by(X_rfhlth) %>%
  summarise(count = n(), percentage = n() * 100/total_obs)

## # A tibble: 3 × 3
##   X_rfhlth               count percentage
##   <fct>                  <int>      <dbl>
## 1 Good or Better Health 395109     80.3  
## 2 Fair or Poor Health    94677     19.3  
## 3 <NA>                    1989      0.404

brfss2013 %>%
  group_by(X_rfhlth) %>%
  filter(X_rfhlth != "NA") %>%
  summarise(count = n(), Percentage = n() *100/total_obs)

## # A tibble: 2 × 3
##   X_rfhlth               count Percentage
##   <fct>                  <int>      <dbl>
## 1 Good or Better Health 395109       80.3
## 2 Fair or Poor Health    94677       19.3

ggplot(rq2_brfss2013,aes(x=X_rfhlth)) + geom_bar() + ggtitle('Respondents') + xlab('Health') + theme_bw()

Around 80% of the respondents who have good or better health while there are only 18.5%-19.5% respondents who have fair or poor health.

brfss2013 %>%
  group_by(X_frtlt1) %>%
  summarise(count = n(), percentage = n() * 100/total_obs)

## # A tibble: 3 × 3
##   X_frtlt1                                   count percentage
##   <fct>                                      <int>      <dbl>
## 1 Consumed fruit one or more times per day  291729      59.3 
## 2 Consumed fruit less than one time per day 171343      34.8 
## 3 <NA>                                       28703       5.84

ggplot(rq2_brfss2013,aes(x=X_frtlt1)) + geom_bar() + ggtitle('Fruit Consumption by Respondents') + xlab('Consumption of Fruits') + theme_bw()

Around 59% of the Respondents consume fruit one or more times a day than those who did not and there are only 34.8%.

brfss2013 %>%
  group_by(X_veglt1) %>%
  summarise(count = n(), percentage = n() * 100/total_obs)

## # A tibble: 3 × 3
##   X_veglt1                                        count percentage
##   <fct>                                           <int>      <dbl>
## 1 Consumed vegetables one or more times per day  359834      73.2 
## 2 Consumed vegetables less than one time per day 101777      20.7 
## 3 <NA>                                            30164       6.13

ggplot(rq2_brfss2013,aes(x=X_veglt1)) + geom_bar() + ggtitle('Vegetable Consumption by Respondents') + xlab('Consumption of Vegetables') + theme_bw()

Around 73% of the Respondents consume vegetables one or more times a day while around 20% who did not.

ct_f <- table(brfss2013$X_rfhlth,brfss2013$X_frtlt1)
prop.table(ct_f,1)

##                        
##                         Consumed fruit one or more times per day
##   Good or Better Health                                0.6467407
##   Fair or Poor Health                                  0.5602353
##                        
##                         Consumed fruit less than one time per day
##   Good or Better Health                                 0.3532593
##   Fair or Poor Health                                   0.4397647

mosaicplot(prop.table(ct_f,1), main='Health vs Fruit Consumption', xlab='', ylab='Consumption of Fruits')

In this frequency table through mosaic plotting, we can see the people who consume more fruits have an excellent health status.

ct_v <- table(brfss2013$X_rfhlth,brfss2013$X_veglt1)
prop.table(ct_v,1)

##                        
##                         Consumed vegetables one or more times per day
##   Good or Better Health                                     0.7994668
##   Fair or Poor Health                                       0.6973174
##                        
##                         Consumed vegetables less than one time per day
##   Good or Better Health                                      0.2005332
##   Fair or Poor Health                                        0.3026826

mosaicplot(prop.table(ct_v,1), main='Health vs Vegetable Consumption', xlab='', ylab='Consumption of Vegetables')

As we can see, it’s more likely the same of the fruits but vegetables has a lot more, so we can conclude that people who have good or better health are more likely to consume fruits and vegetables each day that’s why they have that health status.

Research quesion 3: Who has more good or better health according to the most common bad habit, is it the alcoholic one or the smoker one?

We’ll be using there following variables for this question: X_rfhlth: Health status by the Respondents. X_rfsmok3: Knowing that if a respondents is still smoking. *X_rfchol: Knowing that if a respondents still drinking alcohol.

str(select(brfss2013,X_rfhlth,X_rfsmok3,X_rfchol))

## 'data.frame':    491775 obs. of  3 variables:
##  $ X_rfhlth : Factor w/ 2 levels "Good or Better Health",..: 2 1 1 1 1 1 2 1 1 1 ...
##  $ X_rfsmok3: Factor w/ 2 levels "No","Yes": 1 1 2 1 1 1 1 2 1 1 ...
##  $ X_rfchol : Factor w/ 2 levels "No","Yes": 2 1 1 2 1 2 1 2 2 1 ...

To know that if the respondents is smoking or drinking means that they have a poor health status, let’s first check our selected variables one by one.

total_obs <- nrow(brfss2013)
brfss2013 %>%
  group_by(X_rfhlth) %>%
  summarise(count = n(), percentage = n()*100/total_obs)

## # A tibble: 3 × 3
##   X_rfhlth               count percentage
##   <fct>                  <int>      <dbl>
## 1 Good or Better Health 395109     80.3  
## 2 Fair or Poor Health    94677     19.3  
## 3 <NA>                    1989      0.404

ggplot(brfss2013, aes(x=X_rfhlth)) + geom_bar() + ggtitle('Health Status of Respondents') + xlab('Health Status') + theme_bw()

According to our data, there are more than 80% of the Respondents who have poor health.

brfss2013 %>%
  group_by(X_rfsmok3) %>%
  summarise(count = n(), percentage = n()*100/total_obs)

## # A tibble: 3 × 3
##   X_rfsmok3  count percentage
##   <fct>      <int>      <dbl>
## 1 No        399786      81.3 
## 2 Yes        76654      15.6 
## 3 <NA>       15335       3.12

ggplot(brfss2013, aes(x=X_rfsmok3)) + geom_bar() + ggtitle('Smoking Status of Respondents') + xlab('Currently a Smoker?') + theme_bw()

There are 81% of the Respondents who does not smoke anymore even though maybe they smoke before.

brfss2013 %>%
  group_by(X_rfchol) %>%
  summarise(count = n(), percentage = n()*100/total_obs)

## # A tibble: 3 × 3
##   X_rfchol  count percentage
##   <fct>     <int>      <dbl>
## 1 No       236614       48.1
## 2 Yes      183497       37.3
## 3 <NA>      71664       14.6

ggplot(brfss2013, aes(x=X_rfchol)) + geom_bar() + ggtitle('Alcohol Status of Respondents') + xlab('Currently an Alcoholic?') + theme_bw()

About 35% of the Respondents are still drinking alcohol. Now, we need to create a new category of variable to categorize the persons such as ‘Smoker’, ‘Alcoholic’, ‘Both’, and ‘None’.

brfss2013 <- brfss2013 %>%
  mutate(smoke_alc = ifelse(X_rfchol == 'Yes',
                            ifelse(X_rfsmok3 == 'Yes','Both','Alcoholic'),
                            ifelse(X_rfsmok3 == 'Yes','Smoker','None')))

For the distribution of our new variable:

brfss2013 %>%
  group_by(smoke_alc) %>%
  summarise(count = n(), percentage = n()*100/total_obs)

## # A tibble: 5 × 3
##   smoke_alc  count percentage
##   <chr>      <int>      <dbl>
## 1 Alcoholic 152508      31.0 
## 2 Both       26126       5.31
## 3 None      195855      39.8 
## 4 Smoker     33154       6.74
## 5 <NA>       84132      17.1

ggplot(brfss2013,aes(x=smoke_alc)) + geom_bar() + ggtitle('Being an Alcoholic and Smoker Habits of Respondents') + xlab('Alcoholic or Smoker?') +theme_bw()

About 35% of the Respondents who does not smoke and drink alcohol. Around 13% are smokers, and about 28% are still active at drinking alcohol.

rq1_table <- table(brfss2013$smoke_alc,brfss2013$X_rfhlth)

rq1_table

##            
##             Good or Better Health Fair or Poor Health
##   Alcoholic                114718               37198
##   Both                      15615               10382
##   None                     169910               25264
##   Smoker                    25385                7650

We see the proportions between the alcoholic and the smoker ones and they still have a good or better health status. Now let’s calculate both of their proportions.

prop.table(rq1_table, 1)

##            
##             Good or Better Health Fair or Poor Health
##   Alcoholic             0.7551410           0.2448590
##   Both                  0.6006462           0.3993538
##   None                  0.8705565           0.1294435
##   Smoker                0.7684274           0.2315726

mosaicplot(prop.table(rq1_table, 1),main='Alcoholic and/or Smoker vs Health', xlab='Alcoholic and/or Smoking status', ylab='Poor Health')

Comparing the Alcoholic and the Smoker, we see here that they are indeed very close to each other but the real winner here is the smoker who have more good or better health than those who are alcoholic person.

Exploratory Data Analysis

Tatin

2/21/2022

Setup

Load packages

Refer to the provided data in our google classroom.

Part 1: Research questions

Part 3: Exploratory data analysis