Exploring the BRFSS data

Setup

Load packages

library(ggplot2)
library(dplyr)

Load data

load("brfss2013.RData")

Part 1: Data

In this particular case, the data is collected as part of collaborative project - Behavioral Risk Factor Surveillance System (BRFSS), between all of the states in the United States (US) and participating US territories and the Centers for Disease Control and Prevention (CDC).

The data is primarily collected through telephonic and cellular interviews. As the individual is randomly selected from each household, the collected of data is through Random Sampling. The landline telephonic interviews consisted of interviewing of a random selected adult individual from a household. Possible drawbacks for this form of data collection is that the data has been self reported by the individuals, so the legitimacy of the data needs to scrutinized.

The data though can only be used for correlation and not causation as there was no random assignment observed, the data is merely an observation.

Part 2: Research questions

Research question 1: Does the family size, number of adults in a household or having more children relate to the mental health of an individual in any way?

This question is of relevance, because of the assumption that exists in the Indian Society about mental health. A significant number of Indians believe that smaller family sizes and not having children affects mental health in a negative way. Thus, through the data available for the US population as whole, we can address the question.

Variables used: (* Added by the Project Developer)

menthlth numadult numadultFiltDec* adultcount* children numchildFiltDec* childcount* familysize* familycount*

Research question 2: How is the number of hours slept correlates with the Mental Health of the Respondents, as well as the number of hours slept as an average for the Internet Users against non users?

This is again a question to tackle the assumption that merely the usage of internet causes people to lose sleep. This question has some flaws because the age demography will be the main decision parameter for the internet usage. But again, there is a graph plotted to see the correlation between Mental Health and the number of hours slept.

Variables used: (* Added by the Project Developer)

sleptim1 numSleptFiltDec* menthlth SleepCount* menthlthInt* internet menthavg*

Research question 3: Does the mental health of an individual vary significantly with the income levels of the individuals?

It’s a common saying that money doesn’t equal happiness, but does it mean less of unhappy is the question that the above question holds relevance for. The income level is a categorical variable, so the summary statistic will be the mean of the days mental health was not good in last 30 days. This may help in developing a heuristic if the alternative hypothesis is correct.

Variables used: (* Added by the Project Developer)

income2 menthlth menthlthInt* menthavg*

Part 3: Exploratory data analysis

Research question 1:

We will first have to convert the variable numadult to a numerical variable.

brfss2013 <- brfss2013 %>%
  mutate(numadultFiltDec = as.numeric(as.character(numadult)))

A similar conversion will have to be performed for the variable children

brfss2013 <- brfss2013 %>%
  mutate(numchildFiltDec = as.numeric(as.character(children)))

Finally, to get the family size, the variables, children and numadult will be added.

brfss2013 <- brfss2013 %>%
  mutate(familysize = numchildFiltDec + numadultFiltDec)

Summarizing Statistics

A brief articulation of the summary statistics has been done to keep the document short.

The output: (For Numerical Conversion of the Variable numadult)

brfss2013 %>%
  group_by(numadult, numadultFiltDec) %>%
  summarise(count = n())

## `summarise()` regrouping output by 'numadult' (override with `.groups` argument)

## # A tibble: 19 x 3
## # Groups:   numadult [19]
##    numadult numadultFiltDec  count
##    <fct>              <dbl>  <int>
##  1 1                      1 130467
##  2 2                      2 184626
##  3 3                      3  32180
##  4 4                      4   9985
##  5 5                      5   2161
##  6 6                      6    529
##  7 7                      7     98
##  8 8                      8     38
##  9 9                      9     19
## 10 10                    10      7
## 11 11                    11      1
## 12 12                    12      5
## 13 14                    14      1
## 14 16                    16      2
## 15 17                    17      2
## 16 18                    18      1
## 17 32                    32      1
## 18 45                    45      1
## 19 <NA>                  NA 131651

The Data variable to store the means of the variable children. (This is summarized to make a plot of the same.)

data2 <- brfss2013 %>%
  filter(!is.na(children)) %>%
  group_by(menthlth) %>%
  summarise(childcount = mean(numchildFiltDec))

## `summarise()` ungrouping output (override with `.groups` argument)

data2

## # A tibble: 32 x 2
##    menthlth childcount
##       <int>      <dbl>
##  1        0      0.479
##  2        1      0.658
##  3        2      0.652
##  4        3      0.637
##  5        4      0.627
##  6        5      0.625
##  7        6      0.557
##  8        7      0.654
##  9        8      0.619
## 10        9      0.602
## # ... with 22 more rows

Plots

Note : The plots have been smoothed.

Plot 1: Number of days Mental Health not good vs Number of Adults in a Household

data1 <- brfss2013 %>%
   filter(!is.na(numadult)) %>%
   group_by(menthlth) %>%
   summarise(adultcount = mean(numadultFiltDec))

## `summarise()` ungrouping output (override with `.groups` argument)

ggplot(data1, aes(x=menthlth,y=adultcount)) + geom_line() + geom_smooth(method = "gam", formula = y ~ poly(x, 2))

Plot 2: Number of days Mental Health not good vs Number of Children in a Household

ggplot(data2, aes(x=menthlth,y=childcount)) + geom_line() + geom_smooth(method = "gam", formula = y ~ poly(x, 2))

Plot 3: Number of days Mental Health not good vs Number of Total People in a Household

data3 <- brfss2013 %>%
  filter(!is.na(familysize)) %>%
  group_by(menthlth) %>%
  summarise(familycount = mean(familysize))

## `summarise()` ungrouping output (override with `.groups` argument)

ggplot(data3, aes(x=menthlth,y=familycount)) + geom_line() + geom_smooth(method = "gam", formula = y ~ poly(x, 2))

Narrative

Plot 1

There is sharp observation to be observed which occurs in all the three plots, which is a sharp spike in the number of the demographic being studied once the number of days goes from 0 to 1.

The original plot moreover doesn’t provide much visualization to infer the relation between the two variables. The smoothed plot concludes that there is a very weak negative correlation between Number of days Mental Health not good and Number of Adults in a Household.

Plot 2

Again the original plot moreover doesn’t provide much visualization to infer the relation between the two variables. The smoothed plot concludes that there is no correlation between Number of days Mental Health not good and Number of children in a Household. Though some other technique could provide us a better way to analyze the relationship between the variables.

Plot 3

Here again the original plot moreover doesn’t provide much visualization to infer the relation between the two variables.

But,The smoothed plot concludes that there is a very weak negative correlation between Number of days Mental Health not good and Number of Total People in a Household.Though this correlation can be attributed the number of adults in the household as the plot earlier suggested.

Research question 2:

We will first have to convert the variable sleptim1 to a numerical variable.

brfss2013 <- brfss2013 %>%
   mutate(numSleptFiltDec = as.numeric(as.character(sleptim1)))

A similar conversion will be performed on the variable menthlth, while simultaneously filtering out the NA data points.

brfss2013 <- brfss2013 %>%
      filter(!is.na(menthlth)) %>%
      mutate(menthlthInt = as.numeric(as.character(menthlth)))

Summarizing Statistics

A brief articulation of the summary statistics has been done to keep the document short.

The output: For the Plot of Number of days Mental Health not good vs Number of Hours Slept. (We will have to perform a filter that happens to remove all the unwanted levels in the variable menthlth)

data4 <- brfss2013 %>%
   filter(!is.na(numSleptFiltDec)) %>%
   group_by(menthlth) %>%
   summarise(SleepCount = mean(numSleptFiltDec))

## `summarise()` ungrouping output (override with `.groups` argument)

data4 <- data4 %>%
   filter(menthlth == "0":"30")
data4

## # A tibble: 31 x 2
##    menthlth SleepCount
##       <int>      <dbl>
##  1        0       7.16
##  2        1       7.04
##  3        2       7.00
##  4        3       6.94
##  5        4       6.91
##  6        5       6.92
##  7        6       6.80
##  8        7       6.84
##  9        8       6.90
## 10        9       6.78
## # ... with 21 more rows

In case of the variable intenet, it is important to understand that it is a categorical variable, thus an efficient summarizing statistic is the average for both the levels of the variable - Yes and No.

data5 <- brfss2013 %>%
       filter(internet == "Yes" || internet == "No") %>%
       group_by(internet) %>%
       summarise(menthavg = mean(menthlthInt))

## `summarise()` ungrouping output (override with `.groups` argument)

data5 <- data5 %>%
  filter(!is.na(internet))
data5

## # A tibble: 2 x 2
##   internet menthavg
##   <fct>       <dbl>
## 1 Yes          3.19
## 2 No           3.96

Plots

Note: The plots have been smoothed, and the graph for categorical variables uses Column Graph.

Plot 1: Number of days Mental Health not good vs Number of hours slept

ggplot(data4, aes(x=menthlth,y=SleepCount)) + geom_line() + geom_smooth(method = "gam", formula = y ~ poly(x, 2))

Plot 2: Average days Mental Health Not Good for Internet Usage(Yes or No)

ggplot(data5) + geom_col(aes(x = internet, y = menthavg))

Narrative

Plot 1

The original plot gives a hint of a negative correlation between the two variables, something which works against the intuition of People with not good mental health being more prone to having longer sleeping hours. An intuition for this observation suggests people with more stressful jobs have unhealthy mental state and shorter sleep hours. But no causation can be implied.

The smoothed plot suggest a strong negative correlation between Numbers of hours slept and Number of days mental health not good.

Plot 2

This is a visual representation of a summarizing statistic, and for a categorical variable, internet. The column chart shows that a significant difference exists between the average days of not good mental health. The average days of not good mental health for Yes to Internet Usage in the last 30 days is 3.19 and the same for No to Internet Usage in the last 30 days is 3.96. This shows that the hypothesis for our question isn’t correct according to this data set. Moreover, an intuition may suggest this because economically weaker sections of society may not have access to internet and similarly, medications and therapy for mental health.

To understand the loss that occurs, take the difference in menthavg : 0.77 days per month for a single person. This is equivalent to 0.77 * 119277 days or 91843 days per month for the 119277 people that said No to internet usage in the past 30 days.

Research question 3:

The variable menthlth has already been converted to an numerical data type variable menthlthInt previously. So we just need to filter the variable income2 and grouping it by the same variable.

data6 <- brfss2013 %>%
  filter(!is.na(income2)) %>%
  group_by(income2) %>%
  summarise(menthavg = mean(menthlthInt))

## `summarise()` ungrouping output (override with `.groups` argument)

Summarizing Statistics

The summary of the Income Levels in various categories and the subsequent number of days mental health not good has been listed.

data6

## # A tibble: 8 x 2
##   income2           menthavg
##   <fct>                <dbl>
## 1 Less than $10,000     7.82
## 2 Less than $15,000     6.35
## 3 Less than $20,000     4.91
## 4 Less than $25,000     4.23
## 5 Less than $35,000     3.35
## 6 Less than $50,000     2.86
## 7 Less than $75,000     2.48
## 8 $75,000 or more       1.93

Plots

Note: The graph for categorical variables uses Column Graph.

Plot 1: Average days Mental Health Not Good for various Income Levels (1-8)

ggplot(data6, aes(x = income2, y = menthavg, x.label= 'null'))+geom_col()+geom_text(size=2.5,aes(label=income2),position=position_dodge(width=0.9),hjust=0.5,vjust=-0.5,colour="Black")+theme(axis.title.x=element_blank(), axis.text.x=element_blank(),axis.ticks.x=element_blank())

Narrative

Plot 1

The visual representation of the categorical data provides a contrasting and stark observation for the number of days mental health not good against the household income levels of people, the strong negative correlation obtained can help us build the heuristic, and provides an essential motivation to study the causation variable between the two variables.

This also hints that the alternate hypothesis may hold some relevance, that Money does mean less of not good mental health. There may be a lot of reason for the same though.