Exploring the BRFSS data

Setup

Load packages

library(ggplot2)
library(dplyr)

Load data

load("brfss2013.RData")

Part 1: Data

Discussing Generalizability: There are two methods by which data was collected, namely, landline telephonic interviews and cellular telephonic interviews.

1) Landline telephonic interviews: Quoting the BRFFS Overview document, “In conducting the BRFSS landline telephone survey, interviewers collect data from a randomly selected adult in a household.”

Thus, subjects were randomly selected from a household, and each subject is equaly likely to be selected. The resulting sample is therefore representative of the entire population. Data collected from landline telephonic interviews are generalisable.

Multistage sampling was employed for the collection of data: Subjects were divided into households, and data was randomly sampled from one subject of the household.

2) Cellular telephonic interviews: Quoting the BRFFS Overview document, “In conducting the cellular telephone version of the BRFSS questionnaire, interviewers collect data from an adult who participates by using a cellular telephone”.

Subjects were randomly selected here too. The only difference is that simple random sampling was used in this case.

Discussing causality: The BRFFS team did not perform experiments on randomly sampled subjects to derive data. Hence, while there was random sampling, there was no random assignment. The selected individuals were not divided into control and treatment groups to be treated differently (which is the crux of a random assignment).

Hence, we can derive an association between two variables, but not causality. This is supported by the text in the BRFFS Overview document: “The BRFSS objective is to collect uniform, state-specific data on preventive health practices and risk behaviors that are linked to chronic diseases, injuries, and preventable infectious diseases that affect the adult population.”

The keyword in the quoted text is “linked”.

Part 2: Research questions

Research quesion 1:

It would be interesting to understand how the number of alcohol drinks consumed/month would vary with the employment status of each individual. For this purpose, I will make use of 2 variables: ‘employ1’ and ‘avedrnk2’

Research quesion 2:

For the population in the USA represented by the sampled population by the BRFSS organisation, I’m interested in investigating whether there is an association between the number of regular soda/pop cans consumed per month and the self-reported number of days with poor mental health per month. For this purpose, I’ll focus on 2 variables: ‘menthlth’ and ‘ssbsugar’.

Research quesion 3:

For the final research question, I want to find out whether there is any association between owning healthcare coverage and race. I’m going to utilise 2 variables: ‘rrclass2’ and ‘hlthpln1’.

Part 3: Exploratory data analysis

NOTE: Insert code chunks as needed by clicking on the “Insert a new code chunk” button (green button with orange arrow) above. Make sure that your code is visible in the project you submit. Delete this note when before you submit your work.

Research quesion 1:

#filtering out the na values
emp_vs_alc <- filter(brfss2013, !is.na(avedrnk2), !is.na(employ1))

#selecting the required variables
emp_vs_alc<-select(emp_vs_alc, avedrnk2, employ1)

#calculating the average drinks/month consumed by each employement status group
simple_emp_vs_alc<- emp_vs_alc %>% group_by(employ1)%>%summarise(mean_drinks = mean(avedrnk2), median_drinks = median(avedrnk2), sd_drinks = sd(avedrnk2), min_drinks = min(avedrnk2), max_drinks = max(avedrnk2))

#plotting the data for analysis
ggplot(simple_emp_vs_alc, aes(employ1, mean_drinks)) + geom_point() + labs(x="employment status", y = "average number of drinks")

#providing a summary of the data (summary statistics)
simple_emp_vs_alc %>% group_by(employ1) %>% summarise(mean_drinks, median_drinks, sd_drinks, min_drinks, max_drinks)

## # A tibble: 8 x 6
##   employ1        mean_drinks median_drinks sd_drinks min_drinks max_drinks
##   <fct>                <dbl>         <dbl>     <dbl>      <dbl>      <dbl>
## 1 Employed for …        2.34             2      2.34          1         76
## 2 Self-employed         2.23             2      2.31          1         76
## 3 Out of work f…        2.76             2      2.94          1         70
## 4 Out of work f…        2.92             2      3.03          1         60
## 5 A homemaker           1.78             1      1.62          1         36
## 6 A student             2.94             2      2.62          1         50
## 7 Retired               1.72             1      1.83          1         76
## 8 Unable to work        2.62             2      3.18          1         76

There appears to be an association between the concerned variables which is logically consistent. for example, students would be expected to drink the highest being in their formative/teen years and having the most free time at hand. On the other hand, we would expect retired people to be relatively old, and thus not indulge in alchoholic consumption. People that are out of work for less than a year suddenly experience a lot of free time at hand which allows them this activity, and also, they could be consuming more alcohol to cope with unemployment.

It is interesting as to why a homemaker would have such low levels of alcoholic consumption, and further analysis may be necessary. This is not a causal relationship but merely an associative one.

Research quesion 2:

#selecting relevant variables from the data and filtering out the 'na' values
sugarvmh<-select(brfss2013, ssbsugar, menthlth) %>% filter(!is.na(ssbsugar), !is.na(menthlth))

#converting ssbsugar data to a standard no. of soda drinks/month (from per day and per week)
for(i in 1:length(sugarvmh$ssbsugar)){
   if(sugarvmh$ssbsugar[i]>100 && sugarvmh$ssbsugar[i]<200){
        sugarvmh$ssbsugar[i]=(sugarvmh$ssbsugar[i]-100)*30} else if(sugarvmh$ssbsugar[i]>200 && sugarvmh$ssbsugar[i]<300){
          sugarvmh$ssbsugar[i] = (sugarvmh$ssbsugar[i] - 200)*4} else if(sugarvmh$ssbsugar[i]>300 && sugarvmh$ssbsugar[i]<400){
                 sugarvmh$ssbsugar[i] = (sugarvmh$ssbsugar[i]-300)} else{
                     sugarvmh$ssbsugar[i] = 0}}

#calculating mean no. of soda drinks for each value of self-reported number of days with poor mental health per month
simplesugarvmh<-sugarvmh %>% group_by(menthlth) %>% summarise(mean_sugar = mean(ssbsugar), median_sugar = median(ssbsugar), sd_sugar = sd(ssbsugar), min_sugar = min(ssbsugar), max_sugar = max(ssbsugar))

#plotting a scatter plot 
ggplot(simplesugarvmh, aes(menthlth, mean_sugar)) + geom_point()

#providing summary statistics
simplesugarvmh %>% group_by(menthlth) %>% summarise(mean_sugar, median_sugar, sd_sugar, min_sugar, max_sugar)

## # A tibble: 31 x 6
##    menthlth mean_sugar median_sugar sd_sugar min_sugar max_sugar
##       <int>      <dbl>        <dbl>    <dbl>     <dbl>     <dbl>
##  1        0       9.48          1       23.8         0      1200
##  2        1       9.15          2       22.2         0       330
##  3        2      10.7           2       26.5         0       900
##  4        3      10.7           2       22.6         0       300
##  5        4      11.9           2       25.7         0       240
##  6        5      11.7           2       26.8         0       720
##  7        6      14.3           2       34.2         0       300
##  8        7      13.6           3       31.7         0       390
##  9        8      14.8           3       34.3         0       300
## 10        9      23.2           6.5     35.5         0       150
## # ... with 21 more rows

There appears to be a strong initial positive association between the mean number of sugar drinks/month and self-reported number of days/month with poor mental health. The association weakens in between and more or less disappears as we move right on the x-axis.

Since this is purely an observaitonal study, no casual relationships can be drawn from this analysis.

Research quesion 3:

#filtering and selecting the required variables
race_vs_healthcare<-filter(brfss2013, !is.na(hlthpln1), !is.na(rrclass2)) 
race_vs_healthcare<-select(race_vs_healthcare, hlthpln1, rrclass2)

#calculating the percent of people who have healthcare coverage and grouping them by race
simple_race_vs_healthcare<-race_vs_healthcare %>% group_by(rrclass2) %>% summarise(pcyes1 = sum(hlthpln1 == "Yes")*100/n())

#plotting the data
ggplot(simple_race_vs_healthcare, aes(rrclass2, pcyes1)) + geom_point()

#presenting the table (summary statistics: calculating mean, median, sd, min and max doesn't apply here)
simple_race_vs_healthcare %>% group_by(rrclass2) %>% summarise(pcyes1)

## # A tibble: 7 x 2
##   rrclass2                                pcyes1
##   <fct>                                    <dbl>
## 1 White                                     90.4
## 2 Black or African American                 83.4
## 3 Hispanic or Latino                        68.4
## 4 Asian                                     88.9
## 5 Native Hawaiian or Other Pacific Island  100  
## 6 American Indian or Alaska Native          81.9
## 7 Some Other Group                          88.6

Note: For this question, we only have approximately 3000 entries, because most were filtered out as ‘na’. So this data may not be representative of the entire population.

An association between the 2 variables of interest can be noticed. From the analysis, we can notice that the hispanic/latino community has the least people with healthcare, while the native hawaiian/other pacific island community has the highest percent (100%). However, it is possible that there are very few native hawaiian/other pacific islanders in our dataset, which is why we have received a cent percent score.

The 2nd highest community with healthcare coverage is the white population, at approximately 90%.

We may attribute the varying levels of ownership of healthcare coverage to finanical status, but we will need furtther data and analyses to corroborate this assumption.