Exploring the BRFSS data

Setup

Load packages

library(ggplot2)
library(dplyr)
library(ggthemes)
options(digits = 4)

Load data

load("brfss2013.RData")

Part 1: Data

Motive of this Data Analysis Project is to investigate three questions using the BRFSS-2013 dataset.

Behavioural Risk Factor Surveillance System (BRFSS) is an ongoing surveillance system, designed to measure a uniform and state-specific ‘preventive health practices’ and ‘behavioral risk factors’ for the non-institutionalized adult (above 18yrs) population residing in the US. BRFSS dataset is the result of an observational study and NO causal inference can be drawn from this survey.

Thus, the inference that can be drawn from it, at best can be an association between two or more variables, which can further be used to form a hypothesis that can be verified using separate randomized experiments built on four principles, namely Controlling, Randomization, Replication and Blocking.

BRFSS conducts both land line telephone- and cellular telephone-based surveys. In conducting the BRFSS land line telephone survey, interviewers collect data from a randomly selected adult in a household. In conducting the cellular telephone version of the BRFSS questionnaire, interviewers collect data from an adult who participates by using a cellular telephone and resides in a private residence or college housing. The BRFSS questionnaire is comprised of an annual standard core, a biannual rotating core, optional modules, and state-added questions. In order to maintain consistency across states, the BRFSS has set standard protocols for data collection. These standardised protocols allow for state-to-state data comparison applicable uniformly on the entire population in the United States of America.

Many steps such as standardised protocols, Disproportionate Stratified Sampling (DSS), sampling land line numbers based on sub-state geographic locations, two steps weighting process namely, ‘design’ weighting and ‘iterative proportional fitting’ weighting are important statistical processes to remove bias in the sample.

Part 2: Research questions

COVID19 pandemic has hit the world hard. It has taken toll on the economy and the human life alike. But the biggest sufferers are those who are elderly or and suffer co-morbid diseases. Most widely prevalent among the co-morbid diseases are diabetes and hyper-tension. As a keen observer and practitioner of public policy, through this research paper, using BRFSS 2013 data, using three separate set of mandated questions, I have made an attempt, to draw an association between these two co-morbid conditions (diabetes and hyper-tension) with the following three set of variables namely:-

1. Income Level, which is the annual household income from all sources, which has been recorded in the data set, as a categorical variable income2 with 6 factor levels.
1. Education Levels, which provides the employment status of a person, put into 6 categories like
- College 4 years or more (College graduate),
- College 1 year to 3 years (Some college,
- Grade 12 or GED (High school graduate),
- Grades 9 through 11 (Some high school),
- Grades 1 through 8 (Elementary),
- Never attended school or only kindergarten.
It has been recorded as an educa categorical variable in the data set.
1. Status of Employment. Using the variable employ1, The question for the BRFSS 2013 data posed was- Are you currently…?
- Employed for wages,
- Self-Employed,
- Out of work for one year or more,
- Out of work for less than one year,
- A homemaker,
- A student,
- Retired and
- Unable to Work.

Further, I have made an attempt, within each set of questions, to explore any association of different results for males and females.

As far as the identification of those suffering from Diabetes and Hyper Tension (High BP) is concerned, my ‘Data Analysis Research Project’ takes into account two variables; one from the Main Section 5 on Hypertension Awareness and the other from Main Section 7 on Chronic Health Conditions. These are bpmeds recording those ‘Currently taking Blood Pressure Medication’ and diabete3 recording those ‘(Ever Told) You Have Diabetes’

Research question 1: Does the proportion of repondents with incidence of having both, the diabetes and the hypertension together, vary with income groups? Does it increase or decrease with rising income? Within each of these groups, how does it vary among males and females? Knowing fully well that the answers will not infer any causal relationship, can we identify any association between the explanatory (income2) and response variables (diabete3 bpmeds and sex )?

Research Question 2: Does the proportion of respondents with incidence of having both, the diabetes and the hypertension together, vary with levels of education? These education levels are College 4 years or more (College graduate), College 1 year to 3 years (Some college, Grade 12 or GED (High school graduate), Grades 9 through 11 (Some high school), Grades 1 through 8 (Elementary), Never attended school or only kindergarten. Does it increase or decrease with the higher levels of education? Within each of these levels, how does it vary among males and females? Knowing fully well that the answers will not infer any causal relationship, can we identify any association between the explanatory (educa) and response variables (diabete3 bpmeds and sex )?

Research quesion 3: Does the proportion of respondents with incidence of having both, the diabetes and the hypertension together, vary with ’Status of Employment? Does it increase or decrease with status of employment? Does out of work for a long period, or retirement, or unable to work result in higher proportion of Hypertension and Diabetic patients? Within each of these groups, how does it vary among males and females? Knowing fully well that the answers will not infer any causal relationship, can we identify any association between the explanatory (employ1) and response variables (diabete3 bpmeds and sex )?

Part 3: Exploratory data analysis

This Data Analysis Project will attempt to answer three above mentioned questions. Thus the variables required to attempt answering these questions are

bpmeds diabete3 income2 sex employ1 educa

Since, I have estimated the section of population suffering from diabetes and hypertension together, I have added a variable which takes into account only such respondents, who have answered ONLY “Yes” for both, ‘Currently taking Blood Pressure Medication’ and ‘(Ever Told) You Have Diabetes’.

Thus this does not capture pre-diabetes or borderline diabetes or even those females, who have been told “You Have Diabetes” during pregnancy. This numbers 62,345 out of total 491,773 surveyed.
For Blood Pressure, my research paper captures only those who have answered as “Yes” on being asked, if they are currently taking medicine for high blood pressure. This numbers 166,155 out of total 491,773 surveyed.

I have cleaned the NA data, either by removing them or adding a variable as mentioned above, which contains values of 0 and 1 only.

Before moving to specifics, an estimation should be made with regards to the following general concepts

1. Among the total surveyed, proportion of those who are currently taking medicine for high blood pressure.

GQ.a <- brfss2013 %>%
  select(bpmeds) %>%
  mutate(High_BP = ifelse(bpmeds == "Yes", 1, 0)) %>%
  summarise(Total = n(), High_BP = sum(High_BP %in% 1), High_BP_Prop = (High_BP / Total), .groups = 'drop')
GQ.a$High_BP_Prop #is the proportion among the total surveyed,who are currently taking medicine for High Blood Pressure.

## [1] 0.3379

Thus, 33.79% of those surveyed, are currently on Blood Pressure medicine.

1. Among the total surveyed, proportion of those who have been told “You Have Diabetes”. Excluding all others who have not answered “Yes”.

GQ.b <- brfss2013 %>%
  select(diabete3) %>%
  mutate(Diabetic = ifelse(diabete3 == "Yes", 1, 0)) %>%
  summarise(Total = n(), Diabetic = sum(Diabetic %in% 1), 
            Diabetic_Prop = (Diabetic / Total), .groups = 'drop')
GQ.b$Diabetic_Prop #is the proportion among the total surveyed, who have been told

## [1] 0.1268

                   #"You Have Diabetes". Excluding all others who have not answered "Yes".

Thus 12.68% of those surveyed have been told“You Have Diabetes.”

1. Among the total surveyed, proportion of those have answered “Yes” to both a) and b) above, thereby meaning suffer from both diabetes and Hypertension together.

GQ.c <- brfss2013 %>%
  select(diabete3, bpmeds) %>%
  mutate(HighBPDiab = ifelse(bpmeds == "Yes" & 
                               diabete3 == "Yes", 1, 0)) %>%
  summarise(Total = n(), HighBPDiab = sum(HighBPDiab %in% 1),
            HighBPDiab_Prop = (HighBPDiab / Total), .groups = 'drop')
GQ.c$HighBPDiab_Prop #is the proportion among the total surveyed, who have been told "You Have Diabetes". Excluding all others who have not answered "Yes".

## [1] 0.08895

Thus 8.90% of those surveyed have been told“You Have Diabetes” and are also currently taking medicines for High Blood Pressure.

1. How the above queries vary with sex?

GQ.d <- brfss2013 %>%
  filter(!is.na(sex)) %>%
  select(sex, diabete3, bpmeds) %>%
  mutate(HighBPDiab = ifelse(bpmeds == "Yes" & 
                               diabete3 == "Yes", 1, 0)) %>%
  group_by(sex) %>%
  summarise(Total = n(), HighBPDiab = sum(HighBPDiab %in% 1),
            HighBPDiab_Prop = (HighBPDiab / Total), .groups = 'drop')
GQ.d

## # A tibble: 2 x 4
##   sex     Total HighBPDiab HighBPDiab_Prop
##   <fct>   <int>      <int>           <dbl>
## 1 Male   201313      18244          0.0906
## 2 Female 290455      25501          0.0878

8.90% of all those surveyed have Diabetes and are also currently taking medicines for High Blood Pressure. But among the Males, it is marginally higher at 9.06% and Females it is marginally lower at 8.78%. Even though in absolute numbers, it is higher for females.

General Basic Observations

1) 33.79% of those surveyed, are currently on Blood Pressure medicine.
2) 12.68% of those surveyed have been told “You Have Diabetes.”
3) 8.90% of those surveyed have been told“You Have Diabetes” and are also currently taking medicines for High Blood Pressure.
4) But among the Males, it is marginally higher at 9.06% and Females it is marginally lower at 8.78%. Even though in absolute numbers, it is higher for females.

RESEARCH QUESTIONS

Research Question 1: Does the proportion of respondents with incidence of having both, the diabetes and the hypertension together, vary with income groups? Does it increase or decrease with rising income? Within each of these groups, how does it vary among males and females? Knowing fully well that the answers will not infer any causal relationship, can we identify any association between the explanatory (income2) and response variables (diabete3 bpmeds and sex )?

RQ1.1 <- brfss2013 %>%
  filter(!is.na(sex), !is.na(income2)) %>%
  select(sex, bpmeds, diabete3, income2) %>%
  mutate(HighBPDiab = ifelse(bpmeds == "Yes" & 
                               diabete3 == "Yes", 1, 0)) %>%
  group_by(income2) %>% 
  summarise(Total = n(), HighBPDiab = sum(HighBPDiab %in% 1), 
            HighBPDiab_Prop = (HighBPDiab / Total), .groups = 'drop')
  colnames(RQ1.1) <- c("Income_Groups", "Total", "HighBPDiab", "HighBPDiab_Prop")
  
RQ1.1

## # A tibble: 8 x 4
##   Income_Groups      Total HighBPDiab HighBPDiab_Prop
##   <fct>              <int>      <int>           <dbl>
## 1 Less than $10,000  25441       3627          0.143 
## 2 Less than $15,000  26793       4146          0.155 
## 3 Less than $20,000  34873       4605          0.132 
## 4 Less than $25,000  41732       4895          0.117 
## 5 Less than $35,000  48867       5105          0.104 
## 6 Less than $50,000  61509       5236          0.0851
## 7 Less than $75,000  65231       4314          0.0661
## 8 $75,000 or more   115902       4955          0.0428

Except for one group of those earning below $10,000, as the income levels goes up, the proportion BP and Diabetes in the surveyed population comes down.This can be very clearly observed in the plotted graph also.

But before plotting a graph, let us calculate the total proportion of those who suffer from both the High BP and Diabetes in our data-set RQ2.1.

sum(RQ1.1$`HighBPDiab`) / sum(RQ1.1$Total)

## [1] 0.08774

As it can be seen, the proportion of persons in our RQ1.1 data set, suffering from both Hypertension and Diabetes is 0.08774. We will put it as a vertical blue dash line on the graph to understand the relative positions of various groups.

RQ1.1 <- brfss2013 %>%
  filter(!is.na(sex), !is.na(income2)) %>%
  select(sex, bpmeds, diabete3, income2) %>%
  mutate(HighBPDiab = ifelse(bpmeds == "Yes" & 
                               diabete3 == "Yes", 1, 0)) %>%
  group_by(income2) %>% 
  summarise(Total = n(), HighBPDiab = sum(HighBPDiab %in% 1), 
            HighBPDiab_Prop = (HighBPDiab / Total), .groups = 'drop')
  
  
RQ1.1 %>% ggplot(aes(x = HighBPDiab_Prop,
                    y = income2)) +
  geom_bar(stat = "identity",fill = "#FF6666") +
  geom_vline(xintercept = .08774, size = 2, linetype = "dashed",
             col = "blue") +  scale_x_continuous(limits = c(0, .20)) +
  scale_y_discrete(name = "Income",
                       labels = c("Less than $10,000" = "Less Than $10,000", 
                                  "Less than $15,000" = "$10,000 - $15,000",
                                  "Less than $20,000" = "$15,000 - $20,000",
                                  "Less than $25,000" = "$20,000 - $25,000", 
                                  "Less than $35,000" = "$25,000 - $35,000",
                                  "Less than $50,000" = "$35,000 - $50,000", 
                                  "Less than $75,000" = "$50,000 - $75,000",
                                  "$75,000 or more" = "More Than $75,000")) +
  geom_text(aes(label = HighBPDiab),size = 4,
            position = position_fill(vjust = .01)) + labs(title = "Graph (1.1) BP & Diabetic in 
            Income Groups", 
                                                          y = "Income Groups", x = "Proportion of BP and Diabetic Patients") +
 theme_economist()

As it can be seen in the Graph (1.1), though the dotted blue line, that the mean value of total surveyed population is .08895. But, as soon as the income level increases from $35,000, the proportion of BP and Diabetic decreases drastically. It is as low as 4.28% for income group above $75,00 as compared to 15.47% for the income group between $10,000 and $15,000. The numbers inside the graph within each group denotes the number of patients suffering from High BP and Diabetes within that group.

Let us now compare it among Males and Females within these income groups.

RQ1.2 <- brfss2013 %>%
  filter(!is.na(sex), !is.na(income2)) %>%
  select(sex, bpmeds, diabete3, income2) %>%
  mutate(HighBPDiab = ifelse(bpmeds == "Yes" & 
                               diabete3 == "Yes", 1, 0)) %>%
  group_by(income2, sex) %>% 
  summarise(Total = n(), HighBPDiab = sum(HighBPDiab %in% 1), 
            HighBPDiab_Prop = (HighBPDiab / Total), .groups = 'drop')

RQ1.2

## # A tibble: 16 x 5
##    income2           sex    Total HighBPDiab HighBPDiab_Prop
##    <fct>             <fct>  <int>      <int>           <dbl>
##  1 Less than $10,000 Male    8296        994          0.120 
##  2 Less than $10,000 Female 17145       2633          0.154 
##  3 Less than $15,000 Male    9207       1364          0.148 
##  4 Less than $15,000 Female 17586       2782          0.158 
##  5 Less than $20,000 Male   12562       1653          0.132 
##  6 Less than $20,000 Female 22311       2952          0.132 
##  7 Less than $25,000 Male   15734       1947          0.124 
##  8 Less than $25,000 Female 25998       2948          0.113 
##  9 Less than $35,000 Male   19628       2149          0.109 
## 10 Less than $35,000 Female 29239       2956          0.101 
## 11 Less than $50,000 Male   26817       2565          0.0956
## 12 Less than $50,000 Female 34692       2671          0.0770
## 13 Less than $75,000 Male   29405       2325          0.0791
## 14 Less than $75,000 Female 35826       1989          0.0555
## 15 $75,000 or more   Male   56537       3111          0.0550
## 16 $75,000 or more   Female 59365       1844          0.0311

RQ1.2 %>% ggplot(aes(fill = sex, y = income2,
                           x = HighBPDiab_Prop)) +
  geom_bar(position = "fill", stat = "identity") +
 geom_vline(xintercept = .5, size = 2, linetype = "dashed",
             col = "blue") + geom_text(aes(label = HighBPDiab),size = 4,
            position = position_fill(vjust = .01)) +
  scale_y_discrete(name = "Income",
                       labels = c("Less than $10,000" = "< $10,000", 
                                  "Less than $15,000" = "$10,000 - $15,000",
                                  "Less than $20,000" = "$15,000 - $20,000",
                                  "Less than $25,000" = "$20,000 - $25,000", 
                                  "Less than $35,000" = "$25,000 - $35,000",
                                  "Less than $50,000" = "$35,000 - $50,000", 
                                  "Less than $75,000" = "$50,000 - $75,000",
                                  "$75,000 or more" = ">= $75,000")) +
        labs(title = "Graph (1.2) BP & Diabetic in 
         Income Groups", y = "Income Groups", x = "Proportion of BP and Diabetic Patients") +
  theme_economist()

As it can be seen in Graph (1.2), the proportion of female BP and Diabetic patients increases steadily, as we move down the income groups. The numbers inside each group in the graph above, denotes absolute numbers of Males and Females suffering from diabetes and Hypertension both.

Conclusion for Research Question 1

As it can be seen in the Graph (1.1), though the dotted blue line, that the mean value of total surveyed population is .08895. But, as soon as the income level increases from $35,000, the proportion of BP and Diabetic decreases drastically. It is as low as 4.28% for income group above $75,00 as compared to 15.47% for the income group between $10,000 and $15,000.
The proportion of female BP and Diabetic patients increases steadily, as we move down the income groups.

Research Question 2:

Does the proportion of respondents with incidence of having both, the diabetes and the hypertension together, vary with levels of education? These education levels are College 4 years or more (College graduate), College 1 year to 3 years (Some college, Grade 12 or GED (High school graduate), Grades 9 through 11 (Some high school), Grades 1 through 8 (Elementary), Never attended school or only kindergarten. Does it increase or decrease with the higher levels of education? Within each of these levels, how does it vary among males and females? Knowing fully well that the answers will not infer any causal relationship, can we identify any association between the explanatory (educa) and response variables (diabete3 bpmeds and sex )?

RQ2.1 <- brfss2013 %>%
  filter(!is.na(sex), !is.na(educa)) %>%
  select(sex, bpmeds,diabete3, educa) %>%
  mutate(HighBPDiab = 
           ifelse(bpmeds == "Yes" & 
                    diabete3 == "Yes", 1, 0)) %>%
  group_by(educa) %>% 
  summarise(Total = n(), HighBPDiab = sum(HighBPDiab %in% 1),
            HighBPDiab_Prop = (HighBPDiab / Total),
            .groups = "drop") %>%
  arrange(desc(HighBPDiab_Prop))
RQ2.1

## # A tibble: 6 x 4
##   educa                                         Total HighBPDiab HighBPDiab_Prop
##   <fct>                                         <int>      <int>           <dbl>
## 1 Grades 1 through 8 (Elementary)               13395       2357          0.176 
## 2 Never attended school or only kindergarten      674        109          0.162 
## 3 Grades 9 though 11 (Some high school)         28141       4127          0.147 
## 4 Grade 12 or GED (High school graduate)       142971      15437          0.108 
## 5 College 1 year to 3 years (Some college or ~ 134196      11795          0.0879
## 6 College 4 years or more (College graduate)   170120       9730          0.0572

As it can be seen from the table, a strong co-relation between the increasing levels of education and decreasing proportion of High BP and Diabetes patients.For 4 or years or more college graduates, the proportion is just .0572, whereas the proportion of High BP and Diabetic is as high as .176 for grade 1 through grade 8 education levels.

But before plotting a graph, let us calculate the total proportion of those who suffer from both the High BP and Diabetes in our data-set RQ2.1.

sum(RQ2.1$HighBPDiab) / sum(RQ2.1$Total)

## [1] 0.08898

As it can be seen, the proportion of persons in our RQ2.1 data set, suffering from both Hypertension and Diabetes is 0.08898.We will put it as a vertical blue dash line on the graph to understand the relative positions of various groups.

Let us try to plot it on graph (2.1).

RQ2.1 <- brfss2013 %>%
  filter(!is.na(sex), !is.na(educa)) %>%
  select(sex, bpmeds,diabete3, educa) %>%
  mutate(HighBPDiab = 
           ifelse(bpmeds == "Yes" & 
                    diabete3 == "Yes", 1, 0)) %>%
  group_by(educa) %>% 
  summarise(Total = n(), HighBPDiab = sum(HighBPDiab %in% 1),
            HighBPDiab_Prop = (HighBPDiab / Total),
            .groups = "drop") %>%
  arrange(desc(HighBPDiab_Prop))
  
  
RQ2.1 %>% ggplot(aes(x = HighBPDiab_Prop,
                    y = educa)) +
  geom_bar(stat = "identity",fill = "#FF6666") +
  geom_vline(xintercept = .08898, size = 2, linetype = "dashed",
             col = "blue") +  scale_x_continuous(limits = c(0, .20)) +
  scale_y_discrete(labels = c("College 4 years or more (College graduate)" = "4yr or more College Graduate", 
                                  "College 1 year to 3 years (Some college or technical school)" = "1-3yr College or Technical Schl",
                                  "Grade 12 or GED (High school graduate)" = "High School Graduate",
                                  "Grades 9 though 11 (Some high school)" = "High School",
                                  "Grades 1 through 8 (Elementary)" = "Elementary", 
                                  "Never attended school or only kindergarten" = "No School or Kindergarden")) +
  geom_text(aes(label = HighBPDiab),size = 4,
            position = position_fill(vjust = .01)) + 
            labs(title =
"Graph(2.1) BP&Diabetic 
For Education Levels", y = "Education", x = "Proportion of BP and Diabetic Patients") +
 theme_economist()

As it can be seen in the Graph (2.1), though the dotted blue line, that the mean value of total surveyed population is .08898. But, as soon as the education levels decreases, the proportion of BP and Diabetic increases drastically.For 4 or years or more college graduates, the proportion is just .0572, whereas the proportion of High BP and Diabetic is as high as .176 for grade 1 through grade 8 education levels. The numbers inside the graph within each group denotes the number of patients suffering from High BP and Diabetes within that group.

Let us now compare it among Males and Females within these education levels.

RQ2.2 <- brfss2013 %>%
  filter(!is.na(sex), !is.na(educa)) %>%
  select(sex, bpmeds,diabete3, educa) %>%
  mutate(HighBPDiab = 
           ifelse(bpmeds == "Yes" & 
                    diabete3 == "Yes", 1, 0)) %>%
  group_by(educa, sex) %>% 
  summarise(Total = n(), HighBPDiab = sum(HighBPDiab %in% 1),
            HighBPDiab_Prop = (HighBPDiab / Total),
            .groups = "drop") %>%
  arrange(desc(HighBPDiab_Prop))
  
  
RQ2.2 %>% ggplot(aes(fill = sex, y = educa,
                           x = HighBPDiab_Prop)) +
  geom_bar(position = "fill", stat = "identity") +
 geom_vline(xintercept = .5, size = 2, linetype = "dashed",
             col = "blue") + geom_text(aes(label = HighBPDiab),size = 4,
            position = position_fill(vjust = .01)) + 
  scale_y_discrete(labels = c("College 4 years or more (College graduate)" = "4yr+ College Graduate", 
                                  "College 1 year to 3 years (Some college or technical school)" = "1-3yr College/TSchl",
                                  "Grade 12 or GED (High school graduate)" = "HS Graduate",
                                  "Grades 9 though 11 (Some high school)" = "High School",
                                  "Grades 1 through 8 (Elementary)" = "Elementary", 
                                  "Never attended school or only kindergarten" = "KG/No School")) +
  geom_text(aes(label = HighBPDiab),size = 4,
            position = position_fill(vjust = .01)) + 
            labs(title =
"Graph(2.2) BP&Diabetic 
For Education Levels", y = "Education", x = "Proportion of BP and Diabetic Patients") +
 theme_economist()

As it can be seen in Graph(2.2), as we slide down the education levels among the respondents, the proportion of female patients steadily goes up (Except for KG level education, which has a very small number in total). Its proportion is as high as 19.4% in elementary and just 4.82% among the highest level of educated persons. The numbers inside each group in the graph above, denotes absolute numbers of Males and Females suffering from diabetes and Hypertension both.

Conclusion for Research Question 2

A strong co-relation is observed between the increasing levels of education and decreasing proportion of High BP and Diabetes patients. For 4 or years or more college graduates, the proportion is just .0572, whereas the proportion of High BP and Diabetic is three times this value, as high as .176 for elementary education levels.
As we slide down the education levels among the respondents, the proportion of female patients steadily goes up (Except for KG level education, which has a very small number in total). Its proportion is as high as 19.4% in elementary and just 4.82% among the highest level of educated persons.

Research Question 3:

Does the proportion of respondents with incidence of having both, the diabetes and the hypertension together, vary with ’Status of Employment? Does it increase or decrease with status of employment? Does out of work for a long period, or retirement, or unable to work result in higher proportion of Hypertension and Diabetic patients? Within each of these groups, how does it vary among males and females? Knowing fully well that the answers will not infer any causal relationship, can we identify any association between the explanatory (employ1) and response variables (diabete3 bpmeds and sex )?

RQ3.1 <- brfss2013 %>%
  filter(!is.na(sex), !is.na(employ1)) %>%
  select(sex, bpmeds,diabete3, employ1) %>%
  mutate(HighBPDiab = 
           ifelse(bpmeds == "Yes" & 
                    diabete3 == "Yes", 1, 0)) %>%
  group_by(employ1) %>% 
  summarise(Total = n(), HighBPDiab = sum(HighBPDiab %in% 1),
            HighBPDiab_Prop = (HighBPDiab / Total),
            .groups = "drop") %>%
  arrange(desc(HighBPDiab_Prop))

RQ3.1

## # A tibble: 8 x 4
##   employ1                           Total HighBPDiab HighBPDiab_Prop
##   <fct>                             <int>      <int>           <dbl>
## 1 Unable to work                    37453       8256         0.220  
## 2 Retired                          138259      20936         0.151  
## 3 Out of work for 1 year or more    14073       1142         0.0811 
## 4 A homemaker                       31646       2377         0.0751 
## 5 Out of work for less than 1 year  12241        587         0.0480 
## 6 Self-employed                     39832       1739         0.0437 
## 7 Employed for wages               202200       8348         0.0413 
## 8 A student                         12682        103         0.00812

As the above table clearly shows that the ‘unable to work’ with 22.03% and ‘retired persons’ with 15.14% are the two groups of people who have increased the total proportion of BP and Diabetic patients.

But before plotting a graph, let us calculate the total proportion of those who suffer from both the High BP and Diabetes in our data-set RQ3.1.

sum(RQ3.1$HighBPDiab) / sum(RQ3.1$Total)

## [1] 0.08904

As it can be seen, the proportion of persons in our RQ3.1 data set, suffering from both Hypertension and Diabetes is 0.08904.We will put it as a vertical blue dash line on the graph to understand the relative positions of various groups.

The ‘unable to work’ with 22.03% and ‘retired persons’ with 15.14% are the two groups which are much higher than the entire 8.90%. Let us examine this on the graph below.

RQ3.1 <- brfss2013 %>%
  filter(!is.na(sex), !is.na(employ1)) %>%
  select(sex, bpmeds,diabete3, employ1) %>%
  mutate(HighBPDiab = 
           ifelse(bpmeds == "Yes" & 
                    diabete3 == "Yes", 1, 0)) %>%
  group_by(employ1) %>% 
  summarise(Total = n(), HighBPDiab = sum(HighBPDiab %in% 1),
            HighBPDiab_Prop = (HighBPDiab / Total),
            .groups = "drop") %>%
  arrange(desc(HighBPDiab_Prop))
  
  
RQ3.1 %>% ggplot(aes(x = HighBPDiab_Prop,
                    y = employ1)) +
  geom_bar(stat = "identity",fill = "#FF6666") +
  geom_vline(xintercept = .089904, size = 2, linetype = "dashed",
             col = "blue") +  scale_x_continuous(limits = c(0, .23)) +
  scale_y_discrete(labels = c("Unable to work" = "Unable to Work",
                                  "Retired" = "Retired",
                                  "Out of work for 1 year or more" = "Out of Work 1yr+",
                                  "A homemaker" = "Homemaker", 
                                  "Out of work for less than 1 year" = "Out of Work less 1yr",
                              "Self-employed" = "Self-employed",
                              "Employed for wages" = "Employed for Wages",
                              "A student" = "student")) +
  geom_text(aes(label = HighBPDiab),size = 4,
            position = position_fill(vjust = .01)) + 
            labs(title ="Graph(3.1) BP&Diabetic 
For Employment Status", y = "Employment Status", x = "Proportion of BP and Diabetic Patients") +
 theme_economist()

Graph (3.1) clearly depicts how unable to work and retired groups are way beyond normal proportions of diabetic and High BP patients. Dashed blue line depicts the average proportion of all the respondents, which is .089904.The numbers inside the graph within each group denotes the number of patients suffering from High BP and Diabetes within that group.

Let us now compare it among Males and Females within these Employment Status levels.

RQ3.2 <- brfss2013 %>%
  filter(!is.na(sex), !is.na(employ1)) %>%
  select(sex, bpmeds,diabete3, employ1) %>%
  mutate(HighBPDiab = 
           ifelse(bpmeds == "Yes" & 
                    diabete3 == "Yes", 1, 0)) %>%
  group_by(employ1, sex) %>% 
  summarise(Total = n(), HighBPDiab = sum(HighBPDiab %in% 1),
            HighBPDiab_Prop = (HighBPDiab / Total),
            .groups = "drop")

RQ3.2

## # A tibble: 16 x 5
##    employ1                          sex     Total HighBPDiab HighBPDiab_Prop
##    <fct>                            <fct>   <int>      <int>           <dbl>
##  1 Employed for wages               Male    91055       3941         0.0433 
##  2 Employed for wages               Female 111145       4407         0.0397 
##  3 Self-employed                    Male    23081       1193         0.0517 
##  4 Self-employed                    Female  16751        546         0.0326 
##  5 Out of work for 1 year or more   Male     5830        506         0.0868 
##  6 Out of work for 1 year or more   Female   8243        636         0.0772 
##  7 Out of work for less than 1 year Male     5709        264         0.0462 
##  8 Out of work for less than 1 year Female   6532        323         0.0494 
##  9 A homemaker                      Male      610         31         0.0508 
## 10 A homemaker                      Female  31036       2346         0.0756 
## 11 A student                        Male     5382         41         0.00762
## 12 A student                        Female   7300         62         0.00849
## 13 Retired                          Male    54893       9242         0.168  
## 14 Retired                          Female  83366      11694         0.140  
## 15 Unable to work                   Male    13367       2945         0.220  
## 16 Unable to work                   Female  24086       5311         0.221

RQ3.2 <- brfss2013 %>%
  filter(!is.na(sex), !is.na(employ1)) %>%
  select(sex, bpmeds,diabete3, employ1) %>%
  mutate(HighBPDiab = 
           ifelse(bpmeds == "Yes" & 
                    diabete3 == "Yes", 1, 0)) %>%
  group_by(employ1, sex) %>% 
  summarise(Total = n(), HighBPDiab = sum(HighBPDiab %in% 1),
            HighBPDiab_Prop = (HighBPDiab / Total),
            .groups = "drop") %>%
  arrange(desc(HighBPDiab_Prop))
  
  
RQ3.2 %>% ggplot(aes(fill = sex, y = employ1,
                           x = HighBPDiab_Prop)) +
  geom_bar(position = "fill", stat = "identity") +
 geom_vline(xintercept = .5, size = 2, linetype = "dashed",
             col = "blue") + geom_text(aes(label = HighBPDiab),size = 4,
            position = position_fill(vjust = .01)) + 
  scale_y_discrete(labels = c("Unable to work" = "Unable to Work",
                                  "Retired" = "Retired",
                                  "Out of work for 1 year or more" = "Out of Work 1yr+",
                                  "A homemaker" = "Homemaker", 
                                  "Out of work for less than 1 year" = "Out of Work less 1yr",
                              "Self-employed" = "Self-employed",
                              "Employed for wages" = "Employed for Wages",
                              "A student" = "student")) + 
            labs(title =
"Graph(3.2) BP&Diabetic 
For Employment Status", y = "Employment Status", x = "Proportion of BP and Diabetic Patients") +
 theme_economist()

Among Homemakers, the proportion of Female suffering from High BP and Diabetes is 0.075590 and among Males is 0.0508. The numbers inside each group in the graph above, denotes absolute numbers of Males and Females suffering from diabetes and Hypertension both.

Conclusion for Research Question 3

The proportion of persons in our RQ3.1 data set, suffering from both Hypertension and Diabetes is 0.08904.The ‘unable to work’ with 22.03% and ‘retired persons’ with 15.14% are the two groups which are much higher than the average.
Among Homemakers, the proportion of Female suffering from High BP and Diabetes is 0.075590 and among Males is 0.0508.This group has disproportionately high female proportion of High BP and Diabetic patients.