Problem	Point Value	Problem Grade
		`
1	3 `	____________`
		`
2	3 `	____________`
		`
3	4 `	____________`
		`
4	4 `	____________`
		`
5	5 `	____________`
		`
6	6 `	____________`
		`
7	6 `	____________`
		`
8	6 `	____________`
		`
9	6 `	____________`
		`
10	12 `	____________`
		`
11	20 `	____________`
		`
	Total	75

The Data

The Behavioral Risk Factor Surveillance System (BRFSS) is a collaborative project between all of the states in the United States (US) and participating US territories and the Centers for Disease Control and Prevention (CDC). The BRFSS is administered and supported by CDC’s Population Health Surveillance Branch, under the Division of Population Health at the National Center for Chronic Disease Prevention and Health Promotion. The BRFSS is an ongoing surveillance system designed to measure behavioral risk factors for the noninstitutionalized adult population (aged 18 years of age and older) residing in the United States. The BRFSS was initiated in 1984, with 15 states collecting surveillance data on risk behaviors through monthly telephone interviews. Over time, the number of states participating in the survey increased, and by 2001, 50 states, the District of Columbia, Puerto Rico, Guam, and the US Virgin Islands were participating in the BRFSS. Today, all 50 states, the District of Columbia, Puerto Rico, and Guam collect data annually; American Samoa, the Federated States of Micronesia, and Palau collect survey data over a limited point-in-time (usually 1 to 3 months). In this document, the term state is used to refer to all areas participating in the BRFSS, including the District of Columbia, Guam, and the Commonwealth of Puerto Rico.

Factors assessed by the BRFSS in 2014 include tobacco use, HIV/AIDS knowledge and prevention, exercise, immunization, health status, healthy days health-related quality of life, health care access, inadequate sleep, chronic health conditions, alcohol consumption, oral health, falls, drinking and driving, cancer screenings (including breast, cervical, prostate, colorectal cancers), and seatbelt use. Since 2011, the BRFSS conducts both landline telephone- and cellular telephone-based surveys. In conducting the BRFSS landline telephone survey, interviewers collect data from a randomly selected adult in a household. In conducting the cellular telephone version of the BRFSS questionnaire, interviewers collect data from an adult who participates by using a cellular telephone and resides in a private residence or college housing.

Variable Desc	ription
genhlth	Excellent
	Very Good
	Good
	Fair
	Poor
genhlth_bin	Excellent/Very Good/Good
	Fair/Poor
Unhealthy.days	0-30
menthlth	0-30
poorhlth	0-30
imprace	White, Non-Hispanic
	Black, Non-Hispanic
	Asian, Non-Hispanic
	AIAN, Non-Hispanic
	Hispanic
	Other Race, Non-Hispanic
insurance	yes
	No
trnsgndr	Yes, mtf
	Yes, ftm
	Yes, non-conforming
	No
trnsgndr_bin	Yes
	No
sxorient	Heterosexual
	Homosexual
	Bisexual
	Other
sxorient_bin	Heterosexual
	Other
lstisfy	Very Satisfied
	Satisfied
	Disatisfied
	Very Disatisfied
lsatisfy_bin	Satisfied
	Disatisfied
emtsuprt	Always
	Usually
	Sometimes
	Rarely
	Never
emtsuprt_bin	Always/Usually
	Sometimes/Rarely/Never
medcost	Yes
	No

Conceptual Questions

(3 points) What does the expecation tell you?

The expectation tells you the expected value, or average, of independent outcomes of a numerically-valued random value. Hence, in a simulation of a random variable, as the number of trials approaches infinity, the average of all outcomes should converge at the expected value.

(3 points) What does the variance tell you?

Variance tells you how far a set of numerically-valued random values are from the mean, or expectation.

(4 points) What is the story behind Bernoulli Data?

Bernoulli Data, also referred to as Bernoulli Distribution, is the probability distribution of Bernoulli trials. A Bernoulli Trial is a random experiment with only two possible outcomes (the outcome of each trial should be binary). Consecutive Bernoulli Trials that of independent, random variables create Bernoulli Data, which can be statistically analyzed.

(4 points) What are some traits of the normal distribution?

The following are traits of a normal distribution:

i. The mean is a measure of central tendency, and the variance and standard deviation are a measure of variability.

ii. The distribution shows the frequency of occurrence for a given even.

iii. For normally distributed data, 68% of data points should be within 1 standard deviation of the mean (expectation), 95% of data points should be within 2 standards deviations of the mean, and 99% should be within 3 standard deviations of the mean.

iv. Bell-shaped curve

v. Symmetrical (unless the distribution is skewed)

vi. Area under curve is equal to 1

(5 points) Does the story below fit the Binomial Distribution? If not, why does it not fit the story.

Bass dwell in a particular lake. There are N bass, of which a simple random sample of size n are caught and tagged (“simple random sample” means that all sets of n bass are equally likely). The caught bass are returned to the population, and then a new sample is drawn, this time with size m. This is an important method that is widely-used in ecology, known as capture-recapture.

Is the probability of exactly k of the m bass in the new sample were previously tagged binomial? (Assume that a bass that was caught before doesn’t become more or less likely to be caught again.)

No, this story does not fit the Binomial Distribution. The probability of exactly k of the m bass in the new sample being tagged changes on each draw, as the finite population N decreases with each draw. Hence, this is an example of sampling without replacement. When sampling without replacement, the probability changes with each draw and a different—something other than a Binomial Distribution function—probability mass function (PMF) and cumulative distribution function (CDF) will need to be developed.

Data Questions

Our goal here will to be consider the days that stress, depression, and problems with emotions caused days where mental health was not good. We will be addressing this specifically in context to the transgender experience. The reason for this being that we do see different rates of depression, suicide and other outcomes associated with the transgender experience. It is important to begin to understand what the cause might be.

Download the file brfss.rda

Click the link or go to: https://drive.google.com/file/d/188YvZMXQxegZY5oXDoeTa8Lkh8hGbzx4/view?usp=sharing

(6 points) Plot and Describe the Distribution of poor mental health days.

The distribution of poor mental health days is not a normal distribution. The highest density of participants reported between 0 and 5 poor mental health days in the past 30 days. The highest group of people (around 2,500 of the 6,706 observations) reported 0 poor mental health days in the past 30 days. There are also several scattered peaks: at 5, 10, 14, 20, and 30. These are either nice round numbers (5, 10, 20), exactly two weeks (14), or every day in the past 30 days (30). Hence, by looking at a histogram, some patters of response can be observed.

(6 points) Plot and Describe the Distribution of Race.

## # A tibble: 6 x 2
##   imprace            n
##   <fct>          <int>
## 1 white-non-hisp  6101
## 2 black-non-hisp   164
## 3 asian-non-hisp    81
## 4 aian-non-hisp    104
## 5 hisp             162
## 6 other-non-hisp    94

The chart above shows that the Distribution of Race in this sample data is heavily skewed towards White, non-Hispanic. The table accompanying the chart shows that 6,101 of the 6,706 observations (91%) of the sample identifies as White, non-Hispanic. The other categories are so sparsely populated that it can be difficult to make meaningful comparisons between the racial groups.

(6 points) Plot and Describe the Distribution of Transgender Binary.

## # A tibble: 2 x 2
##   trnsgndr_bin     n
##   <fct>        <int>
## 1 yes             43
## 2 no            6663

The above chart and table again show an unevenly distributed sample, this time with regards to self-reported transgender identification. 43 of the 6,706 (.64%) participants identified as transgender. The binary variable was created by collapsing the other response variables that are included under the transgender category (“Yes_mtf,” “Yes_ftm,” “Yes_nonconforming,” “No”). The uneven distribution of observations requires that the trnsgndr category get collapsed. This, again, makes it difficult to determine meaningful relationships between transgender participants and certain exposures (e.g., poor mental health days per month, poor general health days per month, etc.).

(6 points) Plot and Describe the Distribution of General Health (not binary).

## # A tibble: 5 x 2
##   genhlth       n
##   <fct>     <int>
## 1 Excellent   827
## 2 Very Good  2347
## 3 Good       2156
## 4 Fair        992
## 5 Poor        384

This chart shows the distribution of general health across all participants in the data set. About two thirds (67%) of participants reported their health to be either “Very Good” or “Good.” This, then, appears to show some traits of a standard distribution: a somewhat skewed, though clearly discernable bell-shaped curve with a peak that represents about two thirds of the population. The extreme responses (“Excellent” and “Poor”) have the fewest data points, representing about 18% of all responses.

(12 points) Display graphs of variables that have relationships with poor mental health days. (Note: Look at the notes with general health to see how to combine multiple plots in a larger grid image. (lect 13 slide 58) )

Note on this question: of all the plots shown, I argue that the only relationship that can be qualitatively discerned (i.e. without formal statistical tests) is between life satisfaction and poor mental health days. This is largely due to uneven distribution of data points in many of the categorical variables.

## # A tibble: 4 x 2
##   lsatisfy             n
##   <fct>            <int>
## 1 Very Satisfied    2638
## 2 Satisfied         3579
## 3 Disatisfied        400
## 4 Very Disatisfied    89

In these plots, specifically the plot on the left, which represents the a binary mutation of the life satisfaction variable, we see the only clear, discernable relationship with poor mental health days. Again, the major source of variation of the “Dissatisfied” group is the fact that only about 7% of the sample population identified with this group. This leads to high levels of variance and few—in this case zero—outliers beyond the IQR. That being said, despite the large variance within the “Dissatisfied” group, this group still appears to experience, on average, more poor mental health days than those participants who responded “Satisfied” or “Very Satisfied” when asked about their overall life satisfaction. This is shown by the fact that the median for the “Dissatisfied” group is higher than the upper bound of the IQR of the “Satisfied” group.

The above graphs show the relationship between poor mental health days and general health. The plot on the left shows general health as a binary variable (either “Excellent/Very Good/Good” or “Fair/Poor”) and the plot on the right shows the relationship between each individual category and poor mental health days. Starting with the plot on the right, there is no clear relationship between self-reported general health and poor mental health days. The “Excellent/Very Good/Good” group showed much less variation, with an non-skewed interquartile range (IQR) between 0 and 5 days. The “Fair/Poor” group, on the other hand, showed a much larger IQR between 0 and 15 days. The IQR in this group is also skewed, likely due to the lower bound of response option (0 days is the lowest option). That being said, the “Excellent/Very Good/Good” group had many more outliers than the “Fair/Poor” group. As was evident from the analysis in question 9, there are many more people in the “Excellent/Very Good/Good” group (79%) compared to the “Fair/Poor” group (21%). This likely explains the tighter IQR with more outliers for the former group compared to the larger IQR with fewer outliers for the latter group. Despite these differences, the median for both groups is almost identical, at about 3 days. Hence, it would be difficult to show a clear relationship between poor mental health days and self-reported general health.

The categorical data gives more insight into the relationship between self-reported general health and poor mental health days. Again, we see the “Fair” and “Poor” groups with the largest IQRs. This, again, is largely due to the small sample size of those groups (6% for “Poor” group and 15% for “Fair” group). The small sample size means that almost all of the observed data points lie within the IQR. This is especially true for the “Poor” group, where every respondent falls within the IQR. Still, despite these huge differences in variance, the medians for each group is around 3 days, with the “Poor” group being the highest at 5 days.

So here we can say that the categorical plot hints at a slight relationship between self-reported general health and poor mental health days, but only at the extremes (“Excellent” and “Poor”). This makes sense, too. Participants in the “Excellent” group likely reported very few days of poor mental health, while participants in the “Poor” group could have reported up to 30 days of poor mental health.

Overall, there is no clear relationship between self-reported general health and poor mental health days.

Based on the above plots, there is no clear relationship between either Unhealthy days and poor mental health days or poor health days and poor mental health days.

Imputed race includes both self-reported racial/ethnic identification or, in the case that participants refused to ¬answer, racial/ethnic identification was determined by the most common race/ethnicity response for that region of the state. The plot shows that White, Non-Hispanic participants had the least variation and the most outliers. The reason for this is because the White, Non-Hispanic category represented the vast majority of participants in this data set (91%). Hence, the IQR is between 0 and 5 days, with 15 participants above 5 days (75th percentile). The other categories show comparatively more variation, which again makes sense because of the small sample size for those groups.
Still, there appears to be no clear relationship between imputed race and poor mental health days. The medians for each group are around 3 days, with some groups higher and some lower. The main issue in determining a relationship from this plot is that all groups other than White, Non-Hispanic combine to a total of 9% of the sample population. This makes these groups difficult to meaningfully compare.
Overall, there is no clear relationship between imputed race and poor mental health days.

## # A tibble: 2 x 2
##   insurance     n
##   <fct>     <int>
## 1 Yes        6356
## 2 No          350
## # A tibble: 2 x 2
##   medcost     n
##   <fct>   <int>
## 1 Yes       741
## 2 No       5965

The plot on the above left shows the relationship between health insurance status and poor mental health days. Respondents answered the question: “Do you have any kind of health care coverage, including health insurance, prepaid plans such as HMOs, or government plans such as Medicare, or Indian Health Service?” As the plot shows, there is much more variation in the “No” group. This is due to the fact that only about 5% of the sample population do not have health coverage. This forces almost all of the participants to be forced into the IQR, with only a few outliers. Again, there does not seem to be a clear relationship between insurance cost and poor mental health days because the medians are not different.
On the right is a plot that compares medical cost barriers to poor mental health days. Respondents answered the question: “Was there a time in the past 12 months when you needed to see a doctor but could not because of cost?” The distribution of the observations is skewed towards the “No” group. The table shows that around 11% of participants answered “Yes” to the question, which explains the much larger IQR for this group. Again, we cannot accurately determine a relationship between poor mental health days and medical cost barriers.

Based on the above plots, it seems that there is no clear relationship between sexual orientation and poor mental health days. Again, the main issue being that a very small percentage of the sample size does not identify as heterosexual. This makes the variation, noted here by the large IQR, quite large. With such large variance, it makes it difficult to determine a clear relationship between the two variables. In the case of the categorial plot (plot on the right), we see some difference in median values, but the variation within the groups is still so large that it is not possible to note any relationships between the variables.

## # A tibble: 2 x 2
##   emtsuprt_bin        n
##   <fct>           <int>
## 1 Always/Usually   5555
## 2 Sometimes-Never  1151

The above plots show the relationship between self-reported levels of emotional support and poor mental health days. The left plot shows a binary of the categorical responses. Though there is a difference in median, there still seems to be no discernable relationship between poor mental health days and emotional support level. This is due to the large variance of the “Sometimes/Never” group. The variance can, in part, be attributed to the relatively small percentage (only 17% of respondents responded “Sometimes,” “Rarely,” or “Never”) of respondents in that group. In the categorical plot on the right, there is also no clear relationship between the two variables. Though the medians for each group differ, the large variance within in the groups makes it difficult to determine a meaningful relationship between the two variables.

Lastly, we again see no clear relationship between transgender self-identification and poor mental health days. In the binary plot on the left, the large variance of the “yes” group is due to the small proportion of the sample in that group. Hence, there are no clear outliers and the IQR is quite large. This makes it difficult to discern a clear relationship between the two variables. In the categorical plot on the right, we do see different median levels, but we again see huge variation in the “yes-ftm” and “yes-non-conform” groups. This is largely due to the small proportion of the sample that is in those groups.

(20 points) We have been exploring the outcome of days off due to mental health. Papers have been published suggesting that there is a relationship between the transgender experience and mental health outcomes.

Health related to the transgender experience in general has not been well researched. This is an attempt to get everyone to be thinking about the number of days in which stress, depression and other emotions may had led to poor mental health days.

Do there appear to be relationships between the transgender experience and the number of days in which stress, depression and emotions led to poor mental health?

Remember: we have 3 things to consider with a hypothesis.

Note: This problem is much more complex than we have data to consider, this is one starting place with data which is available to us.

Note: You will need to use transgender binary to make this work like the other examples.

3 possible ways to interpret the data:

I. The first possibility is that the difference between the mean number of poor health days is truly higher amongst people who identify as transgender. In this case, we could say that there is a real relationship between the transgender experience and the number of days in which stress, depression, and emotions led to poor mental health.

We can perform a simple test to compare the mean number of days that transgender-identifying participants to the mean number of days that non-transgender-identifying participants:

## # A tibble: 2 x 2
##   trnsgndr_bin `mean(menthlth, na.rm = T)`
##   <fct>                              <dbl>
## 1 yes                                 7.09
## 2 no                                  5.50
## [1] 1.59

On average, people who identify as transgender experience 1.59 more sick days per month than those who do not identify as transgender.

II. The second possibility could be that the difference is due to fact that the groups differ at baseline. This is more complicated to prove, but is a likely reason for the observed difference. Transgender identification status is not the sole variable being studied, so other aspects of participants’ identity are not controlled for. Looking back to the interpretations in question 10, we can see that the transgender participant group is not identical to the cisgender participant group.

Since we have already established that life satisfaction is correlated with poor mental health days, if we can show that the transgender and cisgender differ significantly in their life satisfaction responses, we can identify this variable as a confounding variable. Hence, it mystifies the relationship between gender identity and poor mental health days. The difference in poor mental health days could be, instead, due to the difference in life satisfaction:

Life Satisfaction

The bar plot shows that participants who self-identify as transgender have significantly different responses to the question regarding life satisfaction. Since life satisfaction has a relationship to poor mental health days (shown both in the boxplot above and in question 10), this variable is likely a confounding variable. Hence, it can be determined that the transgender and cisgender groups differ at baseline, which makes the groups difficult to compare.
This can also be shown more quantitatively on a table (note that the proportions do not match for each category):

## # A tibble: 8 x 3
## # Groups:   lsatisfy [4]
##   lsatisfy         trnsgndr_bin     n
##   <fct>            <fct>        <int>
## 1 Very Satisfied   yes             10
## 2 Very Satisfied   no            2628
## 3 Satisfied        yes             29
## 4 Satisfied        no            3550
## 5 Disatisfied      yes              2
## 6 Disatisfied      no             398
## 7 Very Disatisfied yes              2
## 8 Very Disatisfied no              87

As shown in question 10, life satisfaction is the only variable that shows a clear relationship with poor mental health days. That being said, it is useful to show the other variables in the data set to illustrate the ways in which the transgender group is fundamentally different than the cisgender group. This makes it difficult to compare the groups on specific exposures (e.g., health insurance status, race, etc.) and requires more advanced statistical methods to control for these confounders.
We could only make causal claims if we randomize the two groups at baseline and then apply the exposure, which in this case is the experience of being transgender, to one of the two groups. This, of course, is not possible and not ethical for reasons beyond the scope of this exam.
This same logic can be applied to the other variables in the data. Below I have shown this same analysis for the remaining variables.

## # A tibble: 10 x 3
## # Groups:   imprace [6]
##    imprace        trnsgndr_bin     n
##    <fct>          <fct>        <int>
##  1 white-non-hisp yes             37
##  2 white-non-hisp no            6064
##  3 black-non-hisp yes              1
##  4 black-non-hisp no             163
##  5 asian-non-hisp no              81
##  6 aian-non-hisp  yes              3
##  7 aian-non-hisp  no             101
##  8 hisp           yes              2
##  9 hisp           no             160
## 10 other-non-hisp no              94

Emotional Support

General Health

Insurance Status

Prohibitive Medical Costs

Sexual Orientation
III. The third possibility is that the difference is due to random chance.

## # A tibble: 2 x 3
##   trnsgndr_bin     n    freq
##   <fct>        <int>   <dbl>
## 1 yes             43 0.00641
## 2 no            6663 0.994

Amongst participants in the sample, 0.6% of them identify as transgender, while 99.4% identify as cisgender.

## [1] -1.5

When simulating the data, participants who identify as transgender, on average, experience about 1.22 more poor mental health days per month than those who do not identify as transgender.

The histogram shows that most differences are 2 days less and 2 days greater, with…

## [1] 0.092

There is an 11% probability that random chance accounts for the 1.59 days or greater difference between the two groups. Hence, it can be concluding that there is somewhat likely chance that an identical proportion of transgender people report, on average, that they experience at least 1.59 days of poor mental health days in a 30-day window.

Instructions

Scoring

The Data

Conceptual Questions

Data Questions