setwd("E:/QUIZ #1 STAT SOFTWARE/")getwd()## [1] "E:/QUIZ #1 STAT SOFTWARE"
library(ggplot2)
library(dplyr)load("brfss2013.Rdata")Come up with at least three research questions that you want to answer using these data. You should phrase your research questions in a way that matches up with the scope of inference your dataset allows for. Make sure that at least two of these questions involve at least three variables. You are welcomed to create new variables based on existing ones. With each question include a brief discussion (1-2 sentences) as to why this question is of interest to you and/or your audience.
Research quesion 1:
Are people who have completed higher levels of education, more likely employed and earned higher incomes?
(People with higher levels of education are more likely to find employment, remain employed and earn higher incomes relative to those with lower levels of education.)
Research quesion 2:
Do people who had smoked at least 100 cigarettes in his/her lifetime, more likely to have an asthma and get a shot of pneumonia vaccines?
(Cigarette smoke is known to trigger asthma attacks and all people with asthma should get a pneumonia vaccination.)
Research quesion 3:
Are married people more likely to have a children of maximum 9 compared to those never married?
(While women are married, they tend to have very high birth rates.)
Perform exploratory data analysis (EDA) that addresses each of the three research questions you outlined above. Your EDA should contain numerical summaries and visualizations. Each R output and plot should be accompanied by a brief interpretation.
Research quesion 1:
Are people who have completed higher levels of education, more likely employed and earned higher incomes?
We’ll be using the following variables:
• educa: Respondents level of education completed.
• employ1: Respondents employment status.
• income2: Respondents computed salary income.
Type of variables we’re dealing with:
str(select(brfss2013,educa,employ1,income2))## 'data.frame': 491775 obs. of 3 variables:
## $ educa : Factor w/ 6 levels "Never attended school or only kindergarten",..: 6 5 6 4 6 6 4 5 6 4 ...
## $ employ1: Factor w/ 8 levels "Employed for wages",..: 7 1 1 7 7 1 1 7 7 5 ...
## $ income2: Factor w/ 8 levels "Less than $10,000",..: 7 8 8 7 6 8 NA 6 8 4 ...
To begin, let’s check those people who attained higher levels of education.
total_obs<-nrow(brfss2013)
brfss2013%>%
filter(educa!="NA")%>%
group_by(educa)%>%
summarize(count=n(),percentage=n()*100/total_obs)## # A tibble: 6 × 3
## educa count percentage
## <fct> <int> <dbl>
## 1 Never attended school or only kindergarten 677 0.138
## 2 Grades 1 through 8 (Elementary) 13395 2.72
## 3 Grades 9 though 11 (Some high school) 28141 5.72
## 4 Grade 12 or GED (High school graduate) 142971 29.1
## 5 College 1 year to 3 years (Some college or technical school) 134197 27.3
## 6 College 4 years or more (College graduate) 170120 34.6
As we can see, around 35% of the respondents in our data set are college graduate and followed by high school graduate with the percentage of around 29%.
Then, we will group the data with their corresponding level of education and set their employment status to whose more likely employed for wages.
brfss2013%>%
group_by(educa,employ1)%>%
filter(educa!="NA",employ1 == "Employed for wages")%>%
summarize(count=n(),percentage=n()*100/total_obs)## `summarise()` has grouped output by 'educa'. You can override using the
## `.groups` argument.
## # A tibble: 6 × 4
## # Groups: educa [6]
## educa employ1 count perce…¹
## <fct> <fct> <int> <dbl>
## 1 Never attended school or only kindergarten Employ… 202 0.0411
## 2 Grades 1 through 8 (Elementary) Employ… 2536 0.516
## 3 Grades 9 though 11 (Some high school) Employ… 6441 1.31
## 4 Grade 12 or GED (High school graduate) Employ… 49803 10.1
## 5 College 1 year to 3 years (Some college or technical sc… Employ… 55715 11.3
## 6 College 4 years or more (College graduate) Employ… 87090 17.7
## # … with abbreviated variable name ¹percentage
Now to answer our question, we can make things a bit easier for ourselves by creating a new categorical variable to categorize a person as: ‘College graduate’, ‘Employed for wages’, ‘Earned $75,000 or more’ to see the difference between the employment status and level of education attained by the respondents, and to identify which more likely to earned higher incomes.
brfss2013%>%
group_by(educa,employ1,income2)%>%
filter(educa!="NA",employ1=="Employed for wages", income2=="$75,000 or more")%>%
summarize(count=n(),percentage=n()*100/total_obs)## `summarise()` has grouped output by 'educa', 'employ1'. You can override using
## the `.groups` argument.
## # A tibble: 6 × 5
## # Groups: educa, employ1 [6]
## educa employ1 income2 count perce…¹
## <fct> <fct> <fct> <int> <dbl>
## 1 Never attended school or only kindergarten Employ… $75,00… 18 0.00366
## 2 Grades 1 through 8 (Elementary) Employ… $75,00… 102 0.0207
## 3 Grades 9 though 11 (Some high school) Employ… $75,00… 416 0.0846
## 4 Grade 12 or GED (High school graduate) Employ… $75,00… 8708 1.77
## 5 College 1 year to 3 years (Some college or tech… Employ… $75,00… 15494 3.15
## 6 College 4 years or more (College graduate) Employ… $75,00… 47163 9.59
## # … with abbreviated variable name ¹percentage
Graph
Kin<-brfss2013%>%
filter(educa=="Never attended school or only kindergarten",employ1=="Employed for wages", income2=="$75,000 or more")
K<-nrow(Kin)
Elem<-brfss2013%>%
filter(educa=="Grades 1 through 8 (Elementary)",employ1=="Employed for wages", income2=="$75,000 or more")
E<-nrow(Elem)
Som<-brfss2013%>%
filter(educa=="Grades 9 though 11 (Some high school)",employ1=="Employed for wages", income2=="$75,000 or more")
S<-nrow(Som)
High<-brfss2013%>%
filter(educa=="Grade 12 or GED (High school graduate)",employ1=="Employed for wages", income2=="$75,000 or more")
H<-nrow(High)
Sco<-brfss2013%>%
filter(educa=="College 1 year to 3 years (Some college or technical school)",employ1=="Employed for wages", income2=="$75,000 or more")
Sc<-nrow(Sco)
Col<-brfss2013%>%
filter(educa=="College 4 years or more (College graduate)",employ1=="Employed for wages")
C<-nrow(Col)x<- c(K,E,S,H,Sc,C)
labels <- c("Never attended school or only kindergarten","Elementary","Some high school","High school graduate","Some college or technical school","College graduate")pie(x, main = "Level of Education Attained by People whose Employed for wages that Earned $75,000 or more",col = rainbow(length(x)))
legend("bottomright", c("Never attended school or only kindergarten","Elementary","Some high school","High school graduate","Some college or technical school","College graduate"), cex = 0.6,
fill = rainbow(length(x)))Interpretation: Base to the data and graph visualization above, around 9% people whose college graduate are more likely employed for wages and earned more higher incomes with $75,000 or more. Hence, we can say that most people who attained higher levels of education are more likely employed and earned higher incomes than to those with lower levels of education.
Research quesion 2:
Do people who had smoked at least 100 cigarettes in his/her lifetime, more likely to have an asthma and get a shot of pneumonia vaccines?
For this, we have to look at the relationship between the variables:
• smoke100: Is the respondent had smoked at least 100 cigarettes in his/her lifetime?
• asthma3: Is the respondent have an asthma?
• pneuvac3: Is the respondent receive a pneumonia vaccine?
Type of variables we’re dealing with:
str(select(brfss2013,smoke100,asthma3,pneuvac3))## 'data.frame': 491775 obs. of 3 variables:
## $ smoke100: Factor w/ 2 levels "Yes","No": 1 2 1 2 1 2 1 1 2 2 ...
## $ asthma3 : Factor w/ 2 levels "Yes","No": 1 2 2 2 1 2 2 2 2 2 ...
## $ pneuvac3: Factor w/ 2 levels "Yes","No": 1 2 2 2 2 1 2 2 2 2 ...
Let’s begin to check the data of those respondents who had smoked of at least 100 cigarettes in their lifetime.
total_obs<-nrow(brfss2013)
brfss2013%>%
group_by(smoke100)%>%
filter(smoke100!="NA")%>%
summarize(count=n(),percentage=n()*100/total_obs)## # A tibble: 2 × 3
## smoke100 count percentage
## <fct> <int> <dbl>
## 1 Yes 215201 43.8
## 2 No 261654 53.2
We can see that around 53% of the respondents haven’t smoked of at least 100 cigarettes in their lifetime, while the around 43% had smoked.
Now, we will group the data to see if those people who smoked of at least 100 cigarettes in their lifetime more likely to be an asthmatic and usually get a shot of pneumonia vaccines.
total_obs<-nrow(brfss2013)
brfss2013%>%
filter(smoke100!="NA",asthma3!="NA", pneuvac3!="NA")%>%
group_by(smoke100,asthma3,pneuvac3)%>%
summarize(count=n(),percentage=n()*100/total_obs)## `summarise()` has grouped output by 'smoke100', 'asthma3'. You can override
## using the `.groups` argument.
## # A tibble: 8 × 5
## # Groups: smoke100, asthma3 [4]
## smoke100 asthma3 pneuvac3 count percentage
## <fct> <fct> <fct> <int> <dbl>
## 1 Yes Yes Yes 15659 3.18
## 2 Yes Yes No 12183 2.48
## 3 Yes No Yes 64689 13.2
## 4 Yes No No 91727 18.7
## 5 No Yes Yes 13389 2.72
## 6 No Yes No 14576 2.96
## 7 No No Yes 66660 13.6
## 8 No No No 125631 25.5
Graph
Csmoke<- brfss2013%>%
filter(smoke100 == "Yes",asthma3 == "Yes", pneuvac3=="Yes")
CS<- nrow(Csmoke)
Nsmoke<-brfss2013%>%
filter( smoke100 == "Yes",asthma3=="No", pneuvac3== "No")
NS<- nrow(Nsmoke)x<- c(CS, NS)
labels <- c("Asthmatic and have a shot of Pneumonia Vaccine","Not Asthmatic and doesn't have a shot of Pneumonia Vaccine")
pie(x, main = "Smoker People",col = rainbow(length(x)))
legend("bottomright", c("Asthmatic and have a shot of Pneumonia Vaccine","Not Asthmatic and doesn't have a shot of Pneumonia Vaccine"), cex = 0.6,
fill = rainbow(length(x)))Interpretation: Looking at the summary statistics, we have seen that only 3% of a cigarette smokers having an asthma and receive a shot of pneumonia vaccines, while around 18% is a cigarette smokers but doesn’t have an asthma and shot of pneumonia vaccines. Hence, we can say that not all people who smoked of at least 100 cigarettes in their life time more likely to be an asthmatic and usually get a shot of pneumonia vaccines.
Research question 3:
Are married people more likely to have a child of maximum 9 compared to those who never married? (While women are married, they tend to have very high birth rates.)
For this we’ll be using the following variables:
• marital: Respondents marital status.
• children: Computed number of children of the respondents.
Checking out the type of variables that we’re dealing with:
str(select(brfss2013,marital,children))## 'data.frame': 491775 obs. of 2 variables:
## $ marital : Factor w/ 6 levels "Married","Divorced",..: 2 1 1 1 1 2 1 3 1 1 ...
## $ children: int 0 2 0 0 0 0 1 0 1 0 ...
Let’s begin to check of those people who have married status.
total_obs<-nrow(brfss2013)
brfss2013%>%
group_by(marital)%>%
filter(marital!="NA")%>%
summarize(count=n(),percentage=n()*100/total_obs)## # A tibble: 6 × 3
## marital count percentage
## <fct> <int> <dbl>
## 1 Married 253329 51.5
## 2 Divorced 70376 14.3
## 3 Widowed 65745 13.4
## 4 Separated 10662 2.17
## 5 Never married 75070 15.3
## 6 A member of an unmarried couple 13173 2.68
We have noticed that around 51% of the respondents are married and 15% never married.
Now, we will take a look if those married people are more likely to have a child of maximum 9 than to those never married people. The data will group according to their marital status having a child of 9.
total_obs<-nrow(brfss2013)
brfss2013%>%
group_by(marital,children)%>%
filter(marital!="NA", children=="9")%>%
summarize(count=n(),percentage=n()*100/total_obs)## `summarise()` has grouped output by 'marital'. You can override using the
## `.groups` argument.
## # A tibble: 6 × 4
## # Groups: marital [6]
## marital children count percentage
## <fct> <int> <int> <dbl>
## 1 Married 9 29 0.00590
## 2 Divorced 9 2 0.000407
## 3 Widowed 9 2 0.000407
## 4 Separated 9 1 0.000203
## 5 Never married 9 8 0.00163
## 6 A member of an unmarried couple 9 1 0.000203
Graph
Mar<- brfss2013%>%
filter(marital == "Married",children=="9")
M<- nrow(Mar)
Div<-brfss2013%>%
filter(marital == "Divorced",children=="9")
D<- nrow(Div)
Wid<-brfss2013%>%
filter(marital == "Widowed",children=="9")
W<-nrow(Wid)
Sep<-brfss2013%>%
filter(marital == "Separated",children=="9")
S<-nrow(Sep)
Nev<-brfss2013%>%
filter(marital == "Never Married",children=="9")
N<-nrow(Nev)
Unm<-brfss2013%>%
filter(marital == "A member of an unmarried couple",children=="9")
U<-nrow(Unm)x<- c(M,D,W,S,N,U)
labels <- c("Married","Divorced","Widowed","Separated","Nver Married","A member of an unmarried couple")pie(x, main = "People who have a child of maximum 9",col= rainbow (length(x)))
legend("bottomright", c("Married","Divorced","Widowed","Separated","Nver Married","A member of an unmarried couple"), cex = 0.5,
fill = rainbow(length(x)))Interpretation: In conclusion, people who never married that having a child of 9 only got a percentage of 0.16%. Thus, we can say that married people are more likely to have a child of maximum 9 with a percentage of 0.58% compared to those who never married.