Setup

setwd("E:/QUIZ #1 STAT SOFTWARE/")
getwd()
## [1] "E:/QUIZ #1 STAT SOFTWARE"

Load packages

library(ggplot2)
library(dplyr)

Load data

load("brfss2013.Rdata")

Refer to the provided data in our google classroom.

Part 1: Research questions

Come up with at least three research questions that you want to answer using these data. You should phrase your research questions in a way that matches up with the scope of inference your dataset allows for. Make sure that at least two of these questions involve at least three variables. You are welcomed to create new variables based on existing ones. With each question include a brief discussion (1-2 sentences) as to why this question is of interest to you and/or your audience.

Research quesion 1:

Are people who have completed higher levels of education, more likely employed and earned higher incomes?

(People with higher levels of education are more likely to find employment, remain employed and earn higher incomes relative to those with lower levels of education.)

Research quesion 2:

Do people who had smoked at least 100 cigarettes in his/her lifetime, more likely to have an asthma and get a shot of pneumonia vaccines?

(Cigarette smoke is known to trigger asthma attacks and all people with asthma should get a pneumonia vaccination.)

Research quesion 3:

Are married people more likely to have a children of maximum 9 compared to those never married?

(While women are married, they tend to have very high birth rates.)

Part 3: Exploratory data analysis

Perform exploratory data analysis (EDA) that addresses each of the three research questions you outlined above. Your EDA should contain numerical summaries and visualizations. Each R output and plot should be accompanied by a brief interpretation.

Research quesion 1:

Are people who have completed higher levels of education, more likely employed and earned higher incomes?

We’ll be using the following variables:

• educa: Respondents level of education completed.

• employ1: Respondents employment status.

• income2: Respondents computed salary income.

Type of variables we’re dealing with:

str(select(brfss2013,educa,employ1,income2))
## 'data.frame':    491775 obs. of  3 variables:
##  $ educa  : Factor w/ 6 levels "Never attended school or only kindergarten",..: 6 5 6 4 6 6 4 5 6 4 ...
##  $ employ1: Factor w/ 8 levels "Employed for wages",..: 7 1 1 7 7 1 1 7 7 5 ...
##  $ income2: Factor w/ 8 levels "Less than $10,000",..: 7 8 8 7 6 8 NA 6 8 4 ...

To begin, let’s check those people who attained higher levels of education.

total_obs<-nrow(brfss2013)

brfss2013%>%
  filter(educa!="NA")%>%
  group_by(educa)%>%
  summarize(count=n(),percentage=n()*100/total_obs)
## # A tibble: 6 × 3
##   educa                                                         count percentage
##   <fct>                                                         <int>      <dbl>
## 1 Never attended school or only kindergarten                      677      0.138
## 2 Grades 1 through 8 (Elementary)                               13395      2.72 
## 3 Grades 9 though 11 (Some high school)                         28141      5.72 
## 4 Grade 12 or GED (High school graduate)                       142971     29.1  
## 5 College 1 year to 3 years (Some college or technical school) 134197     27.3  
## 6 College 4 years or more (College graduate)                   170120     34.6

As we can see, around 35% of the respondents in our data set are college graduate and followed by high school graduate with the percentage of around 29%.

Then, we will group the data with their corresponding level of education and set their employment status to whose more likely employed for wages.

brfss2013%>%
  group_by(educa,employ1)%>%
  filter(educa!="NA",employ1 == "Employed for wages")%>%
  summarize(count=n(),percentage=n()*100/total_obs)
## `summarise()` has grouped output by 'educa'. You can override using the
## `.groups` argument.
## # A tibble: 6 × 4
## # Groups:   educa [6]
##   educa                                                    employ1 count perce…¹
##   <fct>                                                    <fct>   <int>   <dbl>
## 1 Never attended school or only kindergarten               Employ…   202  0.0411
## 2 Grades 1 through 8 (Elementary)                          Employ…  2536  0.516 
## 3 Grades 9 though 11 (Some high school)                    Employ…  6441  1.31  
## 4 Grade 12 or GED (High school graduate)                   Employ… 49803 10.1   
## 5 College 1 year to 3 years (Some college or technical sc… Employ… 55715 11.3   
## 6 College 4 years or more (College graduate)               Employ… 87090 17.7   
## # … with abbreviated variable name ¹​percentage

Now to answer our question, we can make things a bit easier for ourselves by creating a new categorical variable to categorize a person as: ‘College graduate’, ‘Employed for wages’, ‘Earned $75,000 or more’ to see the difference between the employment status and level of education attained by the respondents, and to identify which more likely to earned higher incomes.

brfss2013%>%
  group_by(educa,employ1,income2)%>%
  filter(educa!="NA",employ1=="Employed for wages", income2=="$75,000 or more")%>%
  summarize(count=n(),percentage=n()*100/total_obs)
## `summarise()` has grouped output by 'educa', 'employ1'. You can override using
## the `.groups` argument.
## # A tibble: 6 × 5
## # Groups:   educa, employ1 [6]
##   educa                                            employ1 income2 count perce…¹
##   <fct>                                            <fct>   <fct>   <int>   <dbl>
## 1 Never attended school or only kindergarten       Employ… $75,00…    18 0.00366
## 2 Grades 1 through 8 (Elementary)                  Employ… $75,00…   102 0.0207 
## 3 Grades 9 though 11 (Some high school)            Employ… $75,00…   416 0.0846 
## 4 Grade 12 or GED (High school graduate)           Employ… $75,00…  8708 1.77   
## 5 College 1 year to 3 years (Some college or tech… Employ… $75,00… 15494 3.15   
## 6 College 4 years or more (College graduate)       Employ… $75,00… 47163 9.59   
## # … with abbreviated variable name ¹​percentage

Graph

Kin<-brfss2013%>%
  filter(educa=="Never attended school or only kindergarten",employ1=="Employed for wages", income2=="$75,000 or more")
K<-nrow(Kin)
Elem<-brfss2013%>%
  filter(educa=="Grades 1 through 8 (Elementary)",employ1=="Employed for wages", income2=="$75,000 or more")
E<-nrow(Elem)
Som<-brfss2013%>%
  filter(educa=="Grades 9 though 11 (Some high school)",employ1=="Employed for wages", income2=="$75,000 or more")
S<-nrow(Som)
High<-brfss2013%>%
  filter(educa=="Grade 12 or GED (High school graduate)",employ1=="Employed for wages", income2=="$75,000 or more")
H<-nrow(High)
Sco<-brfss2013%>%
  filter(educa=="College 1 year to 3 years (Some college or technical school)",employ1=="Employed for wages", income2=="$75,000 or more")
Sc<-nrow(Sco)
Col<-brfss2013%>%
  filter(educa=="College 4 years or more (College graduate)",employ1=="Employed for wages")
C<-nrow(Col)
x<-  c(K,E,S,H,Sc,C)
labels <-  c("Never attended school or only kindergarten","Elementary","Some high school","High school graduate","Some college or technical school","College graduate")
pie(x, main = "Level of Education Attained by People whose Employed for wages that Earned $75,000 or more",col = rainbow(length(x)))
legend("bottomright", c("Never attended school or only kindergarten","Elementary","Some high school","High school graduate","Some college or technical school","College graduate"), cex = 0.6,
   fill = rainbow(length(x)))

Interpretation: Base to the data and graph visualization above, around 9% people whose college graduate are more likely employed for wages and earned more higher incomes with $75,000 or more. Hence, we can say that most people who attained higher levels of education are more likely employed and earned higher incomes than to those with lower levels of education.

Research quesion 2:

Do people who had smoked at least 100 cigarettes in his/her lifetime, more likely to have an asthma and get a shot of pneumonia vaccines?

For this, we have to look at the relationship between the variables:

• smoke100: Is the respondent had smoked at least 100 cigarettes in his/her lifetime?

• asthma3: Is the respondent have an asthma?

• pneuvac3: Is the respondent receive a pneumonia vaccine?

Type of variables we’re dealing with:

str(select(brfss2013,smoke100,asthma3,pneuvac3))
## 'data.frame':    491775 obs. of  3 variables:
##  $ smoke100: Factor w/ 2 levels "Yes","No": 1 2 1 2 1 2 1 1 2 2 ...
##  $ asthma3 : Factor w/ 2 levels "Yes","No": 1 2 2 2 1 2 2 2 2 2 ...
##  $ pneuvac3: Factor w/ 2 levels "Yes","No": 1 2 2 2 2 1 2 2 2 2 ...

Let’s begin to check the data of those respondents who had smoked of at least 100 cigarettes in their lifetime.

total_obs<-nrow(brfss2013)

brfss2013%>%
  group_by(smoke100)%>%
  filter(smoke100!="NA")%>%
  summarize(count=n(),percentage=n()*100/total_obs)
## # A tibble: 2 × 3
##   smoke100  count percentage
##   <fct>     <int>      <dbl>
## 1 Yes      215201       43.8
## 2 No       261654       53.2

We can see that around 53% of the respondents haven’t smoked of at least 100 cigarettes in their lifetime, while the around 43% had smoked.

Now, we will group the data to see if those people who smoked of at least 100 cigarettes in their lifetime more likely to be an asthmatic and usually get a shot of pneumonia vaccines.

total_obs<-nrow(brfss2013)

brfss2013%>%
  filter(smoke100!="NA",asthma3!="NA", pneuvac3!="NA")%>%
  group_by(smoke100,asthma3,pneuvac3)%>%
  summarize(count=n(),percentage=n()*100/total_obs)
## `summarise()` has grouped output by 'smoke100', 'asthma3'. You can override
## using the `.groups` argument.
## # A tibble: 8 × 5
## # Groups:   smoke100, asthma3 [4]
##   smoke100 asthma3 pneuvac3  count percentage
##   <fct>    <fct>   <fct>     <int>      <dbl>
## 1 Yes      Yes     Yes       15659       3.18
## 2 Yes      Yes     No        12183       2.48
## 3 Yes      No      Yes       64689      13.2 
## 4 Yes      No      No        91727      18.7 
## 5 No       Yes     Yes       13389       2.72
## 6 No       Yes     No        14576       2.96
## 7 No       No      Yes       66660      13.6 
## 8 No       No      No       125631      25.5

Graph

Csmoke<- brfss2013%>%
  filter(smoke100 == "Yes",asthma3 == "Yes", pneuvac3=="Yes")
CS<- nrow(Csmoke)
Nsmoke<-brfss2013%>%
  filter( smoke100 == "Yes",asthma3=="No", pneuvac3== "No")
NS<- nrow(Nsmoke)
x<-  c(CS, NS)
labels <-  c("Asthmatic and have a shot of Pneumonia Vaccine","Not Asthmatic and doesn't have a shot of Pneumonia Vaccine")

pie(x, main = "Smoker People",col = rainbow(length(x)))
legend("bottomright", c("Asthmatic and have a shot of Pneumonia Vaccine","Not Asthmatic and doesn't have a shot of Pneumonia Vaccine"), cex = 0.6,
   fill = rainbow(length(x)))

Interpretation: Looking at the summary statistics, we have seen that only 3% of a cigarette smokers having an asthma and receive a shot of pneumonia vaccines, while around 18% is a cigarette smokers but doesn’t have an asthma and shot of pneumonia vaccines. Hence, we can say that not all people who smoked of at least 100 cigarettes in their life time more likely to be an asthmatic and usually get a shot of pneumonia vaccines.

Research question 3:

Are married people more likely to have a child of maximum 9 compared to those who never married? (While women are married, they tend to have very high birth rates.)

For this we’ll be using the following variables:

• marital: Respondents marital status.

• children: Computed number of children of the respondents.

Checking out the type of variables that we’re dealing with:

str(select(brfss2013,marital,children))
## 'data.frame':    491775 obs. of  2 variables:
##  $ marital : Factor w/ 6 levels "Married","Divorced",..: 2 1 1 1 1 2 1 3 1 1 ...
##  $ children: int  0 2 0 0 0 0 1 0 1 0 ...

Let’s begin to check of those people who have married status.

total_obs<-nrow(brfss2013)

brfss2013%>%
  group_by(marital)%>%
  filter(marital!="NA")%>%
  summarize(count=n(),percentage=n()*100/total_obs)
## # A tibble: 6 × 3
##   marital                          count percentage
##   <fct>                            <int>      <dbl>
## 1 Married                         253329      51.5 
## 2 Divorced                         70376      14.3 
## 3 Widowed                          65745      13.4 
## 4 Separated                        10662       2.17
## 5 Never married                    75070      15.3 
## 6 A member of an unmarried couple  13173       2.68

We have noticed that around 51% of the respondents are married and 15% never married.

Now, we will take a look if those married people are more likely to have a child of maximum 9 than to those never married people. The data will group according to their marital status having a child of 9.

total_obs<-nrow(brfss2013)

brfss2013%>%
  group_by(marital,children)%>%
  filter(marital!="NA", children=="9")%>%
  summarize(count=n(),percentage=n()*100/total_obs)
## `summarise()` has grouped output by 'marital'. You can override using the
## `.groups` argument.
## # A tibble: 6 × 4
## # Groups:   marital [6]
##   marital                         children count percentage
##   <fct>                              <int> <int>      <dbl>
## 1 Married                                9    29   0.00590 
## 2 Divorced                               9     2   0.000407
## 3 Widowed                                9     2   0.000407
## 4 Separated                              9     1   0.000203
## 5 Never married                          9     8   0.00163 
## 6 A member of an unmarried couple        9     1   0.000203

Graph

Mar<- brfss2013%>%
  filter(marital == "Married",children=="9")
M<- nrow(Mar)
Div<-brfss2013%>%
  filter(marital == "Divorced",children=="9")
D<- nrow(Div)
Wid<-brfss2013%>%
  filter(marital == "Widowed",children=="9")
W<-nrow(Wid)
Sep<-brfss2013%>%
  filter(marital == "Separated",children=="9")
S<-nrow(Sep)
Nev<-brfss2013%>%
  filter(marital == "Never Married",children=="9")
N<-nrow(Nev)
Unm<-brfss2013%>%
  filter(marital == "A member of an unmarried couple",children=="9")
U<-nrow(Unm)
x<-  c(M,D,W,S,N,U)
labels <-  c("Married","Divorced","Widowed","Separated","Nver Married","A member of an unmarried couple")
pie(x, main = "People who have a child of maximum 9",col= rainbow (length(x)))
legend("bottomright", c("Married","Divorced","Widowed","Separated","Nver Married","A member of an unmarried couple"), cex = 0.5,
   fill = rainbow(length(x)))

Interpretation: In conclusion, people who never married that having a child of 9 only got a percentage of 0.16%. Thus, we can say that married people are more likely to have a child of maximum 9 with a percentage of 0.58% compared to those who never married.