Setup

Load packages

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.0.4
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.0.4

Load data

Make sure your data and R Markdown files are in the same directory. When loaded your data file will be called brfss2013. Delete this note when before you submit your work.

load("brfss2013.RData")

Part 1: Data

As we could see in the “BRFSS Overview”, the data was collected in two ways: 1) landline telephone-based surveys; 2) cellular telephone-based surveys.

In landline telephone-based surveys part, it used stratified sampling. Every single landline telephone number represents a group, and then an adult in the group will be randomly selected.

In cellular telephone-based surveys part, it just collected data directly from a person who has a cellular phone, this process doesn’t involve any kind of sampling.

Thus, both methods didn’t use random sampling.

However, random assignment is used in landline tele surveys rather than cellular tele surveys. The reason is that there is not a detailed data shows that every American has a cellular phone, and this method will exclude the person who doesn’t have a cellular phone, or have a phone which is not a smart phone.

Thus, there is no randomn assignment was used, and causality can not be inferred.


Part 2: Research questions

Research quesion 1:

Is there a relationship between, state average income level, and state percentage of people who do not have any health coverage?

Research quesion 2:

I want to explore education level of the people whose income level is more than $50,000,are the people gain more money,are more educated?

Research quesion 3:

so I assume i am a baby from Florida, I want to be born in a family which parents are both have a income level over $50,000, what is the probability of that?


Part 3: Exploratory data analysis

NOTE: Insert code chunks as needed by clicking on the “Insert a new code chunk” button (green button with orange arrow) above. Make sure that your code is visible in the project you submit. Delete this note when before you submit your work.

Research quesion 1:

#Is there a relationship between, state average income level, and state percentage of people who do not have any health coverage?

library(ggplot2)
library(dplyr)


#First, let's find out whether there are OUTLIERS in X_state
brfss2013 %>% 
  group_by(X_state) %>% 
  summarise(count=n()) %>% 
  arrange(count)
## # A tibble: 55 x 2
##    X_state              count
##    <fct>                <int>
##  1 0                        1
##  2 80                       1
##  3 Guam                  1897
##  4 Arizona               4253
##  5 Alaska                4578
##  6 District of Columbia  4931
##  7 Nevada                5101
##  8 Delaware              5206
##  9 Louisiana             5215
## 10 Arkansas              5268
## # ... with 45 more rows
# we could find that, there are 2 outliers, "0" and "80"

#Next step is to find out how many people DO NOT have any Health Care Coverage
#also we need to find out whether there is any outlier

brfss2013 %>% 
  group_by(hlthpln1) %>% 
  summarise(count=n())
## # A tibble: 3 x 2
##   hlthpln1  count
##   <fct>     <int>
## 1 Yes      434571
## 2 No        55300
## 3 <NA>       1904
#so we could see that, there are 1904 "NA" value
#to get the percentage of people do not have any Health Coverage in each state,
#we need to eliminate these outliers
#there are new concept, 
#"Stateper" shows each state the percentage of people who do not have any Health
#Coverage
#"Statetotal" means how many people of the state are collected in the data
#"StateNo" means how many people each state do not have any Health Coverage
#"NoPer" means the percentage of proportion of people do not have any Health
#Coverage in each state


StateNoPer <- brfss2013 %>% 
  filter(X_state!="0", X_state!="80",!is.na(hlthpln1)) %>% 
  group_by(X_state) %>% 
  summarise(Statetotal=n(),StateNo=sum(hlthpln1=="No")) %>% 
  mutate(NoPer=StateNo/Statetotal)

StateNoPer %>% 
  group_by(X_state) %>% 
  summarise(NoPer)
## # A tibble: 53 x 2
##    X_state               NoPer
##    <fct>                 <dbl>
##  1 Alabama              0.108 
##  2 Alaska               0.153 
##  3 Arizona              0.131 
##  4 Arkansas             0.144 
##  5 California           0.136 
##  6 Colorado             0.119 
##  7 Connecticut          0.0785
##  8 Delaware             0.0859
##  9 District of Columbia 0.0447
## 10 Florida              0.143 
## # ... with 43 more rows
#then, let's get the average income level of each state
#we need to find out whether there also exists outlier
#new concepts:
#"NumIncome", I transfer the "Less than $10000" to 1,"Less than $20000" to 2 ..... and so on
#"MeanIncomeLv", the mean of every state income level



brfss2013 %>% 
  group_by(income2) %>% 
  summarise(n())
## # A tibble: 9 x 2
##   income2            `n()`
##   <fct>              <int>
## 1 Less than $10,000  25441
## 2 Less than $15,000  26794
## 3 Less than $20,000  34873
## 4 Less than $25,000  41732
## 5 Less than $35,000  48867
## 6 Less than $50,000  61509
## 7 Less than $75,000  65231
## 8 $75,000 or more   115902
## 9 <NA>               71426
incomeLevel <- brfss2013 %>% 
  filter(X_state!="0", X_state!="80",!is.na(income2)) %>% 
  group_by(X_state) %>% 
  summarise(NumIncome=as.numeric(income2)) %>% 
  summarise(MeanIncomeLv=mean(NumIncome))
## `summarise()` has grouped output by 'X_state'. You can override using the `.groups` argument.
incomeLevel %>% 
  group_by(X_state) %>% 
  summarise(MeanIncomeLv)
## # A tibble: 53 x 2
##    X_state              MeanIncomeLv
##    <fct>                       <dbl>
##  1 Alabama                      5.10
##  2 Alaska                       6.10
##  3 Arizona                      5.35
##  4 Arkansas                     5.02
##  5 California                   5.46
##  6 Colorado                     6.01
##  7 Connecticut                  6.12
##  8 Delaware                     5.81
##  9 District of Columbia         6.19
## 10 Florida                      5.24
## # ... with 43 more rows
#Fine, we get what we get, let's join the two table

StateNoPer <- left_join(StateNoPer,incomeLevel,by="X_state",
                        copy=FALSE)

#then we draw a point plot to check
  
ggplot(StateNoPer, aes(x=MeanIncomeLv, y=NoPer))+
  geom_point()

cov(x=StateNoPer$MeanIncomeLv, y=StateNoPer$NoPer)
## [1] -0.004220329
ggplot(StateNoPer, aes(x=MeanIncomeLv, y=NoPer))+
  geom_point()+geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

#Conclusion: we find that there is an extreme outlier "Puerto Rico",
#if we exclude this value (MeanIncomeLv=3.10, NoPer=0.0047),
#we could find that, MeanIncomeLv and NoPer are negetive correlated.
#so that we could say that, if a state has higher income level, it has the lower percentage of people 
#who do not have Health Care Coverage

#things could improve: I want to exclude the extreme outlier and draw a more suitable line, 
#but I don't know the function

Research quesion 2:

#So, as the conclusion mention above, people who are at higher income level
#are more likely to have a Health Care coverage. 
#Thus, I want to explore education level of the people whose income level is 
#more than $50,000,are the people gain more money,are more educated?
#I want a pie chart of the education level of whose income level is more than
#$50,000




brfss2013 %>% 
  group_by(educa) %>% 
  summarise(n())
## # A tibble: 7 x 2
##   educa                                                         `n()`
##   <fct>                                                         <int>
## 1 Never attended school or only kindergarten                      677
## 2 Grades 1 through 8 (Elementary)                               13395
## 3 Grades 9 though 11 (Some high school)                         28141
## 4 Grade 12 or GED (High school graduate)                       142971
## 5 College 1 year to 3 years (Some college or technical school) 134197
## 6 College 4 years or more (College graduate)                   170120
## 7 <NA>                                                           2274
brfss2013 %>% 
  group_by(income2) %>% 
  summarise(n())
## # A tibble: 9 x 2
##   income2            `n()`
##   <fct>              <int>
## 1 Less than $10,000  25441
## 2 Less than $15,000  26794
## 3 Less than $20,000  34873
## 4 Less than $25,000  41732
## 5 Less than $35,000  48867
## 6 Less than $50,000  61509
## 7 Less than $75,000  65231
## 8 $75,000 or more   115902
## 9 <NA>               71426
#Though there both "income2" and "educa" have outliers,
#so we need to filter these two variables,
#and make a new table names "incomeEdu"

incomeEdu <- brfss2013 %>% 
  filter(income2=="Less than $75,000"|income2== "$75,000 or more",
         !is.na(educa)) %>% 
  select(income2, educa)

#here comes the number of education level of people whose income level is more than $50,000
incomeEdu %>% 
  group_by(educa) %>% 
  summarise(n())
## # A tibble: 6 x 2
##   educa                                                         `n()`
##   <fct>                                                         <int>
## 1 Never attended school or only kindergarten                       51
## 2 Grades 1 through 8 (Elementary)                                 531
## 3 Grades 9 though 11 (Some high school)                          2065
## 4 Grade 12 or GED (High school graduate)                        29888
## 5 College 1 year to 3 years (Some college or technical school)  45516
## 6 College 4 years or more (College graduate)                   102945
#in order to make a piechart, we need create a new table names "EduPlot"
#"countEdu" means people number in this education level
#"propEdu" means the proportion of education level of people whose income level is over $50,000

EduPlot <- incomeEdu %>% 
  group_by(educa) %>% 
  summarise(countEdu=n()) 

EduPlot <- EduPlot %>% 
  mutate(propEdu=countEdu/sum(EduPlot$countEdu)*100)

EduPlot %>% 
  group_by(educa) %>% 
  summarise(propEdu)
## # A tibble: 6 x 2
##   educa                                                        propEdu
##   <fct>                                                          <dbl>
## 1 Never attended school or only kindergarten                    0.0282
## 2 Grades 1 through 8 (Elementary)                               0.293 
## 3 Grades 9 though 11 (Some high school)                         1.14  
## 4 Grade 12 or GED (High school graduate)                       16.5   
## 5 College 1 year to 3 years (Some college or technical school) 25.1   
## 6 College 4 years or more (College graduate)                   56.9
# so here is the ggplot function, I just find an example online and rewrite it
ggplot(data = EduPlot, aes(x="", y=countEdu, fill=educa))+
  geom_bar(stat = "identity",width = 1, color="black")+
  labs(title = "Education proportion of people's income level over $50,000")+
  coord_polar("y", start = 0)+
  theme_void()

#Conclusion: We could see on the pie chart is that, 
#more than 80% of the people whose income level is over $50,000
#they have a education level over College 1 year

Research quesion 3:

# so I assume i am a baby from Florida, I want to be born in a family
#which parents are both have a income level over $50,000,
#what is the probability of that?



brfss2013 %>% 
  group_by(X_state) %>% 
  summarise(count=n()) %>% 
  arrange(desc(count))
## # A tibble: 55 x 2
##    X_state       count
##    <fct>         <int>
##  1 Florida       33668
##  2 Kansas        23282
##  3 Nebraska      17139
##  4 Massachusetts 15071
##  5 Minnesota     14340
##  6 New Jersey    13776
##  7 Colorado      13649
##  8 Maryland      13011
##  9 Utah          12769
## 10 Michigan      12761
## # ... with 45 more rows
#check outliers in "sex"
brfss2013 %>% 
  group_by(sex) %>% 
  summarise(n())
## # A tibble: 3 x 2
##   sex     `n()`
##   <fct>   <int>
## 1 Male   201313
## 2 Female 290455
## 3 <NA>        7
#also check outliers in "income2"
brfss2013 %>% 
  group_by(income2) %>% 
  summarise(n())
## # A tibble: 9 x 2
##   income2            `n()`
##   <fct>              <int>
## 1 Less than $10,000  25441
## 2 Less than $15,000  26794
## 3 Less than $20,000  34873
## 4 Less than $25,000  41732
## 5 Less than $35,000  48867
## 6 Less than $50,000  61509
## 7 Less than $75,000  65231
## 8 $75,000 or more   115902
## 9 <NA>               71426
#the number of male in Florida with income level over $50,000
brfss2013 %>% 
  filter(X_state=="Florida",
         income2=="Less than $75,000"|
           income2=="$75,000 or more",
         sex=="Male") %>% 
  group_by(X_state) %>% 
  summarise(n())
## # A tibble: 1 x 2
##   X_state `n()`
##   <fct>   <int>
## 1 Florida  4718
#the number of male in Florida, 
#remember to eliminate the "NA" value in "income2"
#otherwise it will make the final outcome inaccurate
brfss2013 %>% 
  filter(X_state=="Florida",
         !is.na(income2),
         sex=="Male") %>% 
  group_by(X_state) %>% 
  summarise(n())
## # A tibble: 1 x 2
##   X_state `n()`
##   <fct>   <int>
## 1 Florida 11618
#the number of female in Florida with income level over $50,000
brfss2013 %>% 
  filter(X_state=="Florida",
         income2=="Less than $75,000"|
           income2=="$75,000 or more",
         sex=="Female") %>% 
  group_by(X_state) %>% 
  summarise(n())
## # A tibble: 1 x 2
##   X_state `n()`
##   <fct>   <int>
## 1 Florida  5423
#the number of female in Florida
brfss2013 %>% 
  filter(X_state=="Florida",
         !is.na(income2),
         sex=="Female") %>% 
  group_by(X_state) %>% 
  summarise(n())
## # A tibble: 1 x 2
##   X_state `n()`
##   <fct>   <int>
## 1 Florida 17111
#So, it will be 12.98% that i will be born in a family which my parents' income level both over $50,000
(4718/11618)*(5473/17111)
## [1] 0.1298903