Summary

This document is the report from the final course project for the Introduction to Probability and Data course The project consisted of exploring dataset - CDC’s 2013 Behavioral Risk Factor Surveillance System (https://www.cdc.gov/brfss/)

The requested task is divided into three stages. In the first, the student is asked to make an analytical introduction about the database. A second phase consists of making three research questions about this database, using the codebook. The third and final stage consists of exploring data analysis in order to understand the research questions developed in the second stage.

The author considered the data suitable for generalization study, but not for causality.

The research questions chosen - and their respective results - were:

1-Associates voter turnout with hours of sleep.

There is some perceived association in the results. Although, it should be noted that the many “missings” cases do not allow greater certainty in this association.

2- Question 2 asks whether marital status affects life satisfaction. And it adds to that original question, the question of whether this is reported differently between the sexes.

In response, we had a high number of missings (omitting NA). Approximately 11,000 were the cases, within these, it was possible to detect a visible weight of the marital status in satisfaction with life. Being married seems, among the cases analyzed, the position most associated with ‘very satisfied’ and divorced, the position most associated with ‘very dissatisfied’. Women are generally more unsatisfied than men, but this insastification is latent in female civil states “with an unmarried couple” and “separated”.

3- It would be the health condition associated with having children or not? Is this different between fathers and mothers?Does the number of children make any difference?

More than 400,000 cases were selected after excluding NA. This generated more capacity to generalize the data, but not causality (see section part 1 below). Not having children at home increased the incidence of people with “Fair” or “Poor” conditions. Here, the gender variable was not relevant. Having more than one child, has some relation to the state of health, although in this case it does not seem to be strong.

Additional note: it is not possible to infer causality in any of the cases, as mentioned in section 1.

Setup

Part 1: Data

The survey covers all 50 states and other territories such as Guam, American Samoa, among others. Given this breadth, the study appears to be sufficiently generalizable for the US territories. I believe, from what I learned in the course, that causality is not possible. The reason? there doesn’t seem to be any random assignments.

According to the website https://www.cdc.gov/brfss/ My login was on 06/10/2020

“BRFSS is a cross-sectional telephone survey that state health departments conduct monthly over landline telephones and cellular telephones with a standardized questionnaire and technical and methodological assistance from CDC. In conducting the BRFSS landline telephone survey, interviewers collect data from a randomly selected adult in a household. In conducting the cellular telephone version of the BRFSS questionnaire, interviewers collect data from an adult who participates by using a cellular telephone and resides in a private residence or college housing.” Given this methodology, I still believe that only generalization is possible, but not causality.

See below the number of observations and variables

## [1] 491775    330

As can be seen above, the dataset consisted of 491.775 observations with 330 possible variables. Not all observations included all variables, so data quality was handled individually on each question below.


Part 2: Research questions

Research quesion 1: Given that in 2013 there was a Democratic government and research data from my area of interest, which is Political Science, show that partisanship makes one feel satisfied or not by the government of the occasion. I will ask my first question based on the assumption that satisfaction with the government is less for Republicans than for Democrats, so they may have less sleep. That will be an extravagant analysis, I know. Let’s see what the data tells us. Unfortunately, the data does not have the candidate voted on, only whether or not he voted in the last presidential election. I didn’t want to give up and decided to continue on the theme of voting.

So my idea is to relate having voted / not voted with hours of sleep. As it is an exercise, I did not worry if the question might seem strange, since it meets the criteria of the activity.

Question: Is there a relationship between having voted in the last election and the hours of sleep?

Variables used:

scntvot1: Did You Vote In The Last Presidential Election?

sleptim1: How Much Time Do You Sleep


Research question 2:

Question:

How does Marital Status affect Satisfaction With Life? How is this reported differently between the sexes?

Variables used:

lsatisfy: Satisfaction With Life

marital: Marital Status

sex: Respondents Sex


Research question 3:

Question: Does parenting increase happiness? How does this happen between fathers and mothers? Does the number of children at home make a difference?

Important note. I didn’t find a variable about having children or not, so I took one about having children at home.

Variables used:

genhlth - Corresponds to General Health. X_chldcnt - Computed Number Of Children In Household. (I put the X in the front as I saw in the example from week 4) sex - Reported gender.


Part 3: Exploratory data analysis

Research question 1: Is there a relationship between having voted in the last election and the hours of sleep?

## [1] 64297     2

I created my own date for question 1 called q1. The objective is only to focus on the variables of interest from research question number 1. When doing the function reported above, I obtained 64,297 observations from 2 variables. Below see the data of the voters / non-voters of q1

## 
##   Yes    No 
## 52564 11733

52,564 voted and 11,733 did not vote.

Observing the general base (brfss2013) it is possible to observe that

Don’t know / Not Sure were 104.

494 were classified as: Not applicable (I did not register, I am not a U.S. citizen, or I am not eligible to vote).

352 were classified as NA Refused.

425,904 were given as NA [Missing].

This can create a problem with the data, given the high number of missing cases. We will follow the analysis, however making this reservation that limits the conclusions.

##      
##               1         2         3         4         5         6         7
##   Yes 0.5789474 0.6747967 0.6751152 0.6780119 0.7466945 0.8026931 0.8589254
##   No  0.4210526 0.3252033 0.3248848 0.3219881 0.2533055 0.1973069 0.1410746
##      
##               8         9        10        11        12        13        14
##   Yes 0.8278177 0.8297386 0.7762500 0.6470588 0.7443182 0.6388889 0.7857143
##   No  0.1721823 0.1702614 0.2237500 0.3529412 0.2556818 0.3611111 0.2142857
##      
##              15        16        17        18        19        20        23
##   Yes 0.6938776 0.8030303 1.0000000 0.7857143 1.0000000 0.6363636 0.0000000
##   No  0.3061224 0.1969697 0.0000000 0.2142857 0.0000000 0.3636364 1.0000000
##      
##              24
##   Yes 1.0000000
##   No  0.0000000

We can see above a brief notion of the relationship between the data. The number in the columns refers to the average hours of sleep in a day. For example, those who sleep an average of one hour a day 57% voted and 42% did not vote

According to a Brazilian study having between 6 to 8 hours of sleep is recommended. Source: https://revistagalileu.globo.com/Ciencia/Saude/noticia/2018/08/ter-entre-seis-oito-horas-de-sono-e-ideal-para-saude-afirma-estudo.html

## # A tibble: 2 x 2
##   scntvot1 count
##   <fct>    <int>
## 1 Yes      43259
## 2 No        8734

64297 were the ones that fit the two variables, as mentioned above.

Of this total, 81.75% (52664) voted and 18.25% (11733 people) did not vote.

Among that total, 51993 claimed to follow the most common recommendation of keeping between 6 and 8 hours.

Of these, 83.21% (43259) voted and the other 8739 people did not vote (16.75%).

The difference of less than 2% made me suspicious. To clarify doubts better, I looked for the percentages of those who sleep 4 and 5 hours to see if it changes a lot. For that I took this article as a base (https://academic.oup.com/sleep/article/20/4/267/2732104). This study says that there is an association with a cumulative decrease in sleep performance, mood disorder and psychomotor surveillance with those who had a week of sleep restricted to 4-5 hours a night.

## # A tibble: 2 x 2
##   scntvot1 count
##   <fct>    <int>
## 1 Yes       4474
## 2 No        1688

6162 people are the ones claiming to sleep between 4 and 5 nights and also answered the question about the vote. Of these, 72.6% (4474) voted and 27.39% (1688) did not vote

Comparing the percentages between those who sleep between 6 and 8 hours and vote and those who sleep between 4 and 5 hours and vote we notice a difference of about 11 percentage points. The available data and the nature of the activity do not allow us to go into the matter further. Even with the exception of so many missings, the data can instigate future research. Associating lifestyle and electoral behavior, including other variables and generating regressions. Therefore, it is concluded that the inference shows a certain impact between sleeping the recommended time and not doing it, without however being able to state this very safely.

Finally, to summarize the findings, I decided to generate a ggplot with sleeptime as x and political participation as y.

It can be seen in the graph when comparing the hours of sleep acquired (6,7 and 8 hours) that they are more likely to vote than the other averages (inadequate according to the research guidelines mentioned).


Research question 2: How does Marital Status affect Satisfaction With Life? How is this reported differently between the sexes?

I chose this issue because it is one of my greatest interests and predilections. Satisfaction with life seems to me to be the key, the most central variable of this whole questionnaire. Given my humanistic background, I looked for marital status as propagated in common sense as a cause of human happiness / unhappiness. Basically, my interest is in explaining life satisfaction through marital status and sex.

First, I created the base q2 with the variables to be analyzed: lsatisfy, sex, marital

## [1] 11633     3

11633 people and 3 variables will be analyzed. Given the high number of ‘missings’, the conclusion will have limited validity.

Next, we made a table of proportion between satisfaction with life and marital status.

##                    
##                         Married    Divorced     Widowed   Separated
##   Very satisfied    0.554726368 0.356775300 0.399409739 0.281914894
##   Satisfied         0.406716418 0.527158376 0.539596655 0.542553191
##   Dissatisfied      0.033582090 0.088050314 0.050172159 0.122340426
##   Very dissatisfied 0.004975124 0.028016009 0.010821446 0.053191489
##                    
##                     Never married A member of an unmarried couple
##   Very satisfied      0.379410064                     0.432203390
##   Satisfied           0.540775014                     0.474576271
##   Dissatisfied        0.057836900                     0.059322034
##   Very dissatisfied   0.021978022                     0.033898305

In an initial observation of the data, there seems to be some weight from the marital situation regarding satisfaction with life.

Married people and members of an unmarried couple are those with the highest percentage of “Very Satisfied”.

Other analyzes can be carried out, including the gender of respondents.

##                    
##                           Male     Female
##   Very satisfied    0.47987236 0.45191163
##   Satisfied         0.46146294 0.47929620
##   Dissatisfied      0.04688267 0.05384310
##   Very dissatisfied 0.01178203 0.01494907

When analyzing satisfaction with life divided by gender, there is no apparent impact of the sex variable. As the focus is not on sex, but on the marital status variable, our search will now focus more on the relationship between the marital status and life satisfaction variables. Later on we will return to the question of the respondents’ sex.

The conclusion obtained in the first table is confirmed, there seems to be an association between marital status and satisfaction with life. The idea now is to try to understand if there is any impact of the ‘sex’ variable on this issue

When introducing the gender of the sample, there does not seem to be much impact. More information with an attempt below to join the three variables of research question number 2 (q2).

Men seem slightly happier than women, but the impact is very small. The idea now is to try to extrapolate this data by marital status. This path is taken in order to understand whether or not there is a more visible weight within the marital groups than in the table above, which includes all civil states.

An interesting question that arises from the data above is “What makes women a little less satisfied than men in general?” The answer to this question seems to be clearer when we look at the data in the table above. In these data, when the marital status is ‘married’, ‘never married’ or ‘widowed’ there seems to be no difference between men and women. Being ‘divorced’ seems, in the sample, to lead women to a slightly higher tendency to point out ‘very dissatisfied’. The civil states ‘member of an unmarried couple’ and ‘divorced’ are the ones that seem to best answer this “interesting question”. In other words, these are the states that most associate the women in the sample with a lot of dissatisfaction with life.


Research question 3: My choice of topic is associated with my recent reading of this article on happiness and fatherhood in Poland:

https://www.jstor.org/stable/41342815?seq=1

In this research, Poles’ happiness is associated with fatherhood. But health in general? Common sense says, apparently quite rightly, that whoever is happy is healthy. Based on this assumption, we will associate health conditions and have children.

In the same survey, it is said that women tend to feel slightly happier with their children than men. Let’s see in our data if there is any association with health in general.

There is no direct question about whether or not to have children. So we chose the variable: ’_chldcnt’. This is a decoded variable, in the codebook it is pointed out as: “Computed Number Of Children In Household”.

Choosing this variable was good because we can also answer the question:

Does the number of children at home make a difference?

Of course, there will be no causality, as mentioned in part 1 of this work.

Firstly, I create q3 with the chosen variables.

## [1] 487538      3

I was very happy that the cases are 487538, much more than in the previous questions.

Then, I made a table of proportion between general health and number of children in the house.

##            
##             No children in household One child in household
##   Excellent               0.15809665             0.20170808
##   Very good               0.31582168             0.33867501
##   Good                    0.31188198             0.30441914
##   Fair                    0.14830327             0.11475812
##   Poor                    0.06589642             0.04043965
##            
##             Two children in household Three children in household
##   Excellent                0.23059243                  0.23018963
##   Very good                0.36780943                  0.35039191
##   Good                     0.28214547                  0.29699115
##   Fair                     0.09227993                  0.09648546
##   Poor                     0.02717275                  0.02594185
##            
##             Four children in household Five or more children in household
##   Excellent                 0.23439988                         0.22704309
##   Very good                 0.32619082                         0.31144131
##   Good                      0.30490806                         0.30044577
##   Fair                      0.10366295                         0.12303120
##   Poor                      0.03083828                         0.03803863

At first glance, not having a child is a lesser indication (only 15%) of having excellent health than having children. Having more than one seems to have an impact as well, although the percentage is lower. Having children at home exceeds the percentage by 5% or more than not having children. When looking at the ‘Fair’ and ‘Poor’ conditions, not having children also seems to increase the likelihood of these unhealthy health conditions.

The data in the graph above confirm the observations pointed out after the proportion table. Next, let’s look at the impact of gender on this relationship.

Apparently, the genre did not show much impact. Therefore, we conclude that not having children seems to impact health. A possible research agenda would be to investigate habits (such as cigarettes, drugs, or alcohol consumption). Having 1 or more children appeared to have some impact, although not strong. Gender does not seem to have an effect on health in general, as in the happiness of the research cited above about the reality in Poland.

Conclusion

Initial observation: It is not possible to infer causality in any of the questions, as mentioned in part 1 of this work.

Regarding Research Question 1, it is possible to state that there is some relationship between the variables analyzed. With the exception of the high amount of missing, the more than 60 thousand cases allow me to state this.

About Question Research 2:

There are few possible cases (just over 11 thousand), when the cases classified as NA are omitted. With this sample, it was possible to detect a visible weight of marital status on satisfaction with life. Being married seems, among the cases analyzed, the position that is most associated with ‘very satisfied’. Divorced, in turn, seems to be the position most associated with “very dissatisfied”. The other civil states mentioned are in intermediate positions in relation to the variable to be explained.

The variable “sex” proved to be an impact among women “with unmarried couples” and “separated” to indicate a greater chance of being in the “very dissatisfied with life” condition.

About Question Research 3:

Important note. I didn’t find a variable about having children or not, so I took one about having children at home.

Not having children at home increased the incidence of people with “Fair” or “Poor” condition and having children appeared to have more cases of “Excellent” health condition. Gender was not relevant. Having more than one child, added the possibility of excellent condition by 3 percentage points when compared to having 1 child, but in this case it does not seem to be as strong.

In relation to the original question, and the high number of cases (more than 400 thousand), there seems to be a safer correlation than in the other research questions in this report / exercise. I believe that an exciting research agenda may be here.

The exercise was difficult and it took me many hours and I’m not sure which is correct. On the other hand, I believe it was an intense learning experience. I look forward to peer correction.

In conclusion, I believe that because it is a beginner’s exercise, I think it’s ok.