Setup

Load packages

library(ggplot2)
library(dplyr)
library(ggmosaic)

Load data

load("brfss2013.RData")

Part 1: Data

Behavioural Risk Factor Surveillance System

The BRFSS is a state-based survey started in 1984 by the Centers for Disease Control and Prevention(CDC). The objective is to collect information about adult U.S. residents “regarding their risk behaviors and preventive health practices that can affect their health status”.

  • Observation are collected monthly through phone calls interviews , both to landlines and to cellular telephones.

  • The interviews are collected independently state by state and are reviewed by state statisticians . Different sampling methods are possible and are chosen contextually.

  • Sampling is then divided into landline sampling (80%), in which +18 inhabitants of the same household(which was previously investigated) are randomly selected. Cellular respondents(20%) are weighted as single adult households.

  • “Data weighting” is used to remove all possible bias from the collected samples.

  • This is an observational study, as the data is randomly collected through telephone calls through contextually chosen sample methodologies (the goal is at least 4000 interviews per state). This methodology ensures the generalizability of the results to the U.S. population. On the contrary, while we can establish some correlations between variables, causation cannot be inferred.

Since we do not have yet specific question about the dataset, let’s first take a look at it. This is a huge dataset of 491775 obs. and 330 variables, we do not want to call the str function as the output would be cumbersome.

dim.data.frame(brfss2013)
## [1] 491775    330
View(brfss2013)

Refer to the BRFSS codebook for more information about the variables.


Part 2: Research questions

Research question 1:

Looking at the BRFSS codebook, let’s take for example the numadult variable.

str(brfss2013$numadult)
##  Factor w/ 19 levels "1","2","3","4",..: 2 2 3 2 2 1 2 1 5 2 ...

It’s a categorical variable distributed on 19 levels which denotes the number of adults living within a single household. Let’s filter the null answers from the data and visualize the data in a contingency table :

df <- brfss2013 %>% 
  filter(!is.na(numadult))

df %>%
  count(numadult) %>%
  mutate( p = round(n / sum(n), 3))
##    numadult      n     p
## 1         1 130467 0.362
## 2         2 184626 0.513
## 3         3  32180 0.089
## 4         4   9985 0.028
## 5         5   2161 0.006
## 6         6    529 0.001
## 7         7     98 0.000
## 8         8     38 0.000
## 9         9     19 0.000
## 10       10      7 0.000
## 11       11      1 0.000
## 12       12      5 0.000
## 13       14      1 0.000
## 14       16      2 0.000
## 15       17      2 0.000
## 16       18      1 0.000
## 17       32      1 0.000
## 18       45      1 0.000

Another variable of interest is employ1 which gives information about the professional activity of the subject.

str(df$employ1)
##  Factor w/ 8 levels "Employed for wages",..: 7 1 1 7 7 1 1 7 7 5 ...

How is distributed the professional activity within a single household? Is there a correlation between the number of adults living within a single household and the most probable employment status ?

Research question 2:

General observations on how general health, happiness, sleep time and employment status are correlated. What is the relationship between employment status and hours of sleep? Or between employment status and general happiness? What we are going to do with extreme outliers?

Research question 3:

Since we have concentrated our focus on the main variables of the data frame, we would like to explore the relationship between declared health and another variable which is related to the habit of smoking, smokday2. Today the effects of smoking on human health are acknowledged. How can we represent some of these effects from this data?


Part 3: Exploratory data analysis

Research question 1:

Distribution of adults per household

The majority of households contain 2 people with a mean of 1.809 people per household. We can visualize the distribution of adults within households with a simple barplot.

ggplot(df, aes(numadult)) +
  geom_bar()

A more interesting representation, to establish a relation between the two variables numadult and employ1 is obtained with barplot filled by employment category.

ggplot(df, aes(x = numadult)) +
   geom_bar(aes( fill = employ1)) +
   labs(x = "Number of adults", y = "Proportion", title = "Adults in a single household")

Each bar contains the relative proportion of the respective employment activity group. For example, we can see clearly that the majority of household with 1 adults are occupied by retired people.

After that, we visualize the same data with a standardized stacked bar plot, in which we can compare the relative size of the column proportions : it’s easier for understanding the fraction of each professional activity category for number of adults in a household.

ggplot(df, aes(x = numadult, fill = employ1)) +
  geom_bar(position = "fill") + scale_y_continuous(labels = scales::percent) +
  labs(x = "Number of adults", y = "Proportion", title = "Professional activity by number of adults in a single household")    

The first graph has a strongly positive skew distribution and the outliers are not showing at all (for example 1 entry with 45 people, one with 32). In fact households with more than 6 people are less than 0.01%. This kind of distribution could be adapted to be transformed with a logarithmic transformation in order to run parametric tests (e.g. for comparing two means). But this goes beyond the scope of this work. Or we could individuate and remove outliers, like we are going to do in the next section.

As we can easily notice from the second graph, the proportions of the different kind of professional activities vary between the groups, and sometimes they exhibit a clear pattern. For example, of the people living alone, the retired are the majority group constituting the 47%; proportion which generally decreases until we get to households composed of 9 people. Curiously, at ten the proportion peaks again (and maybe because this group represent peculiar structures). Since the proportion of the category of employ1 is not the same between the groups of numadult, we can conclude that the two variables are associated (even if to determine it we should run a chi squared test, which goes beyond the scope of this work.)

Research question 2:

Removing outliers

Let’s try to select three groups of variables,which represent three different columns in our data frame:

 df2 <- brfss2013 %>%
 select(children, sleptim1, employ1, genhlth) 

df2 %>%
  group_by(employ1) %>%
  summarise(meansleep = mean(sleptim1, na.rm = TRUE), mediansleep = median(sleptim1, na.rm = TRUE), minsl= min(sleptim1, na.rm = TRUE), maxsl = max(sleptim1, na.rm = TRUE), sd = sd(sleptim1, na.rm= TRUE), count = n())
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 9 x 7
##   employ1                         meansleep mediansleep minsl maxsl    sd  count
##   <fct>                               <dbl>       <dbl> <int> <int> <dbl>  <int>
## 1 Employed for wages                   6.89           7     1    24  1.21 202200
## 2 Self-employed                        7.08           7     1    22  1.26  39832
## 3 Out of work for 1 year or more       6.91           7     1    24  1.85  14074
## 4 Out of work for less than 1 ye~      6.98           7     0    24  1.67  12242
## 5 A homemaker                          7.19           7     1    24  1.46  31647
## 6 A student                            7.07           7     1    24  1.38  12682
## 7 Retired                              7.35           7     1    24  1.47 138259
## 8 Unable to work                       6.75           6     1    24  2.31  37453
## 9 <NA>                                 7.28           7     1   450  8.26   3386

If we analyze the summary statistics of sleptim1 we can find some extreme outliers. What does it happen if we try to plot an histogram ?

ggplot(df2, aes(sleptim1)) +
geom_histogram() 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 7387 rows containing non-finite values (stat_bin).

Because of the outliers, The x coordinates extend to 500. If we want a meaningful representation, we have to zoom the graphic on the area of interest(0,24) with the coord_cartesian() function.

ggplot(df2, aes(sleptim1, fill = employ1)) +
geom_histogram(binwidth = 1) + coord_cartesian(c(3,12))
## Warning: Removed 7387 rows containing non-finite values (stat_bin).

The previous summary statistics pose some doubts about the validity of the methodology of the interviews (maybe they were automated?). We have as high as 450 hours(or as low as 1 hour) per day of sleep and 47 children declared. This kind of data can be easily classified into the category of data entry errors (even if it is possible to have 45 children, it is impossible to sleep 450 hours per day and humanly impossible to sleep only 1.)
One way to identify outliers is to determine which points have a z-score that’s higher than 3 and lower than -3 standard deviations. We can use the scores() function from Lukasz Komsta’s outliers package to quickly calculate the z-score for every value in a specific column of our data frame.Let’s try this cleaning operation on the sleptim1 variable :

library(outliers)


d <- df2 %>%
  filter(!is.na(sleptim1)) %>%
  mutate(sleep_outlier_scores = scores(df2$sleptim1[!is.na(df2$sleptim1)])) %>%
  filter(sleep_outlier_scores < 3 & sleep_outlier_scores > -3) 

After this cleaning operation we obtain a more reasonable summary, with hours of sleep ranging from 3 to 11:

summary(d$sleptim1)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   6.000   7.000   6.997   8.000  11.000
ggplot(d, aes(sleptim1)) + 
  geom_histogram(binwidth = 0.5)

We can now visualize the distribution of the hours of sleep between the employment1 and the genhlth group of variables with different types of visualizations:

ggplot(d, aes(sleptim1, color = employ1)) +
  geom_freqpoly(linetype = 1, size = 1, binwidth = 1) 

ggplot(d, aes(sleptim1,fill = employ1)) +
  geom_histogram(bins = 10, boundary = 3) + facet_wrap(~employ1, scales = "free")

ggplot(d, aes(sleptim1,fill = employ1)) +
  geom_histogram(binwidth = 1) + labs(title = "2")

ggplot(d, aes(genhlth, fill = employ1)) +
  geom_bar() + coord_flip()

From the graphs we can obtain several useful information, which for categorical variables would be harder to extract from just summary statistics:

  • Graph 1, frequency polygon : The majority of the sample is constituted by “employed for wages”, followed by retired people, self employed and so on. The most common sleep time for the groups is given by the highest vertexes.

  • The most common health condition is “very good”, in which the majority group is represented by the “employed for wages”

  • In the “poor” category, the majority group is represented, and it comes not as a surprise, by the “unable to work”. This group also has the lowest mean of hours of sleep (as we have already seen from the table), the highest being found among retired people.

  • Over the 50% quartile of the sleptim1 the majority group are the “retired”.

 d %>%
 filter(sleptim1 > 7) %>%
 group_by(employ1) %>%
 count()
## # A tibble: 9 x 2
## # Groups:   employ1 [9]
##   employ1                              n
##   <fct>                            <int>
## 1 Employed for wages               59361
## 2 Self-employed                    14731
## 3 Out of work for 1 year or more    4999
## 4 Out of work for less than 1 year  4446
## 5 A homemaker                      13241
## 6 A student                         4781
## 7 Retired                          63284
## 8 Unable to work                   11723
## 9 <NA>                              1271

Research question 3:

Let’s first see the percentage of smokers from the sample population, from the computed variable X_smoker3 :

brfss2013 %>%
  count(X_smoker3) %>%
  mutate(p = round((n / sum(n)), 2))
##                               X_smoker3      n    p
## 1 Current smoker - now smokes every day  55162 0.11
## 2 Current smoker - now smokes some days  21495 0.04
## 3                         Former smoker 138134 0.28
## 4                          Never smoked 261651 0.53
## 5                                  <NA>  15333 0.03

We see that roughly half of the sample population \(\sim 53\%\) has never smoked. Let’s compare the declared health condition with the variable, representing the result with a mosaic graphic :

brfss2013 %>%
  filter(!is.na(X_smoker3), !is.na(genhlth)) %>%
  ggplot() +
  geom_mosaic(aes(x = product(X_smoker3), fill=genhlth)) + coord_flip()

We can observe a clear negative tendency between smoking habit and the “excellent” health condition (the relative proportion of the people in excellent health between each group is decreasing from top to bottom), and an inverse tendency for the “poor” condition. Even if these tendencies are less evident for other variables, the variables are clearly associated.


Source for data cleaning :

Data Cleaning Challenge: Outliers on Kaggle