suppressMessages(library(tidyverse))
suppressMessages(library(ggtext))Loading required dataset into R (brfss 2013)
load("brfss2013.RData")The Behavioral Risk Factor Surveillance System (BRFSS) is a collaborative project between all of the states in the United States (US) and participating US territories and the Centers for Disease Control and Prevention (CDC). The System measures behavioral risk factors for the non-institutionalized adult population (18 years of age and older) residing in the US. All 50 states, the District of Columbia, Puerto Rico, and Guam collect data annually and American Samoa, Federated States of Micronesia, and Palau collect survey data over a limited point- in-time (usually one to three months).
The BRFSS system collects uniform, state-specific data on preventive health practices and risk behaviors that are linked to chronic diseases, injuries, and preventable infectious diseases that affect the adult population. Factors assessed by the BRFSS in 2013 include tobacco use, HIV/AIDS knowledge and prevention, exercise, immunization, health status, healthy days — health-related quality of life, health care access, inadequate sleep, hypertension awareness, cholesterol awareness, chronic health conditions, alcohol consumption, fruits and vegetables consumption, arthritis burden, and seatbelt use.
Since 2011, BRFSS collects its data via telephone (landline and cellular) surveys.For landline surveys, data is collected from a randomly selected adult in a household whereas collecting via cellular means involves randomly sampling an adult who resides in a private residence or college housing.
Research question 1: Between men and women, who suffer the most heart attacks?
Research question 2: What is the relationship between walking as an exercise and the number of days full of energy?
Research question 2: What is the most diagnosed chronic health condition in the US?
Research question 1: Between men and women, who have suffered the most heart attacks?
From an article Based on a study in Norway, the Harvard Medical School in 2016 published men tend to suffer heart attacks than women even at an early age. There somehow appear to be a global unanimity on this finding and we will therefore like to see if the US can also corroborate this finding.
For this inquiry, we will require the variables sex and cvdinfr4. The method by which the data was collected will only suggests an association between the variables of interest since the data generated wasn’t effected in an experiment but by just observation as such only a generalization could be made but no causality. Nonetheless, the method of data collection reeks potential bias since it only reaches people with tele and cellular phones and as a result, the supposed random sampling employed may not be representative of the entire population, hence, sex representation may be biased.
The data of interest is subsetted using the code below
brfss2013[, c("sex", "cvdinfr4")] -> `data for q1`Below are the first and last six observations of the dataset data for q1
`data for q1`[1:6, ] %>%
print## sex cvdinfr4
## 1 Female No
## 2 Female No
## 3 Female No
## 4 Female No
## 5 Male No
## 6 Female No
`data for q1`[(nrow(`data for q1`)-5):nrow(`data for q1`), ] %>%
print## sex cvdinfr4
## 491770 Male No
## 491771 Female Yes
## 491772 Male No
## 491773 Female No
## 491774 Female No
## 491775 Male No
To look at the structure of the data or the variables for our analysis of Research question 1 1, we run the code below
str(`data for q1`)## 'data.frame': 491775 obs. of 2 variables:
## $ sex : Factor w/ 2 levels "Male","Female": 2 2 2 2 1 2 2 2 1 2 ...
## $ cvdinfr4: Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
The code chunk above prints to screen the the class of each vector or variable for this analysis. They are factors (categorical variables*) of two levels each.
Now we check to see if there are missing values in each of the variables and if so, how many there are
any(is.na(`data for q1`))## [1] TRUE
Indeed there are. Now, how many are there?
colSums(is.na(`data for q1`))## sex cvdinfr4
## 7 2587
Apparently, there are 7 and 2587 in the variables sex and cvdinfr4, respectively.
Now we compute how many males and females were involved in the study
`data for q1` %>%
group_by(sex) %>%
summarise(length = length(sex)) %>%
data.frame() %>%
print## sex length
## 1 Male 201313
## 2 Female 290455
## 3 <NA> 7
Of all who participated in the study, 201313 responded as males, 290455 responded as females and 7 responded to be neither of the two.
Of these statistics, how many males responded yes to having been diagnosed with heart attack?
`data for q1` %>%
filter(!(is.na(cvdinfr4))) %>%
filter(sex == "Male") %>%
group_by(cvdinfr4) %>%
summarise(length = length(cvdinfr4)) %>%
rename(`Diagnosed with heart attack` = cvdinfr4) %>%
data.frame() %>%
print## Diagnosed.with.heart.attack length
## 1 Yes 16059
## 2 No 184087
How many females responded yes to having been diagnosed with heart attack?
`data for q1` %>%
filter(!(is.na(cvdinfr4))) %>%
filter(sex == "Female") %>%
group_by(cvdinfr4) %>%
summarise(length = length(cvdinfr4)) %>%
rename(`Diagnosed with heart attack` = cvdinfr4) %>%
data.frame() %>%
print## Diagnosed.with.heart.attack length
## 1 Yes 13223
## 2 No 275816
To illustrate the above distribution of having been diagnosed with a heart attack conditioned by sex, here is a barplot.
drop_na(`data for q1`) %>%
filter(cvdinfr4 == "Yes") %>%
ggplot(aes(x = sex)) +
geom_bar(fill = c("darkblue","brown")) +
labs(title = "Number of Males and Females Diagnosed with Heart Attack") +
theme(axis.title = element_text(size = 21, face = "bold"),
axis.text = element_text(size = 19, face = "bold"),
plot.title = element_text(size = 26, face = "bold", hjust = 0.5)) The graph above shows and re-affirms how males have been known to have suffered the most of heart attacks and continuous to do so. This therefore suggests the most of heart attacks are suffered by males in the US..
Research question 2: What is the relationship between walking as an exercise and the number of days full of energy?
Walking amongst other exercises have been proven to prevent heart disease, lower blood pressure, strengthen muscles, and maintains a healthy weight according to the online group Healthy and Natural World. Healthy living is been known to be a precursor for happy a life, so we would like to explore the association between engaging in walking as exercise or just walking most often and the length of liveliness of happy days one gets to enjoy. The variables of interest in this analysis are; qlhlth2,exract11,exeroft1. Just like the preceding research question, this analysis can only and will only explore the relationship between the variable concerned since the method of data collection was just observation and didn’t involve assigning participants to treatments. Although, the method of data collection makes this certain, the generalizability may come from an unrepresentative sample since not everybody has access to cellular or telephones. Hence, the to-be established relationship between walking as exercise and days full of energy are not representative of the population of the US.
The code below prints the first 6 rows of the data of interest to screen
head(brfss2013[ ,c("exract11","exeroft1","qlhlth2")]) %>% print## exract11 exeroft1 qlhlth2
## 1 <NA> NA 0
## 2 Walking 105 25
## 3 <NA> NA 2
## 4 Walking 205 20
## 5 <NA> NA NA
## 6 Bicycling machine exercise 102 NA
Below is the class of each variable in the aboe dataframe
str(brfss2013[ ,c("exract11","exeroft1","qlhlth2")]) %>% print## 'data.frame': 491775 obs. of 3 variables:
## $ exract11: Factor w/ 75 levels "Active Gaming Devices (Wii Fit, Dance, Dance revolution)",..: NA 64 NA 64 NA 6 64 64 7 64 ...
## $ exeroft1: int NA 105 NA 205 NA 102 220 102 102 220 ...
## $ qlhlth2 : int 0 25 2 20 NA NA NA NA NA NA ...
## NULL
The information above reveals the first variable exract11 which represents type of physical activityis categorical with 75 levels, exeroft1 which represents number of times exercised per week and month is an integer and qlhlth2 also an integer represents number of days full of energy.
The summary of statistics inherent in the data is below
summary(brfss2013[ ,c("exract11","exeroft1","qlhlth2")]) %>% print## exract11 exeroft1
## Walking :180051 Min. : 0.0
## Running : 23152 1st Qu.:103.0
## Gardening (spading, weeding, digging, filling): 20026 Median :105.0
## Other : 14119 Mean :135.8
## Weight lifting : 10226 3rd Qu.:203.0
## (Other) : 83253 Max. :299.0
## NA's :160948 NA's :164165
## qlhlth2
## Min. : 0.0
## 1st Qu.: 2.0
## Median : 15.0
## Mean : 15.9
## 3rd Qu.: 28.0
## Max. :243.0
## NA's :491310
The first column of the output above represents the times each observation or each physical activity is frequented in the distribution, the second and third columns display the basic statistics(mean,quartiles,min,max and NAs) of the number of times exercised per week/month and number of days full of energy.
From the codebook, if an observation of the exeroft1 variable falls within 101-199, it represents the number of times the subject or respondent exercised per week and if falls with 201-299, it represents the number of times respondent exercised per month. since our interest is in ascertaining the relationship between healthy days and number of walking exercises per week, we subset or further narrow down our data to only responses on number of times walked per week.
brfss2013[
brfss2013$exract11 == "Walking" & (brfss2013$exeroft1 >= 101 & brfss2013$exeroft1 <= 199), c("exract11","exeroft1","qlhlth2")
] -> `walk per week`
head(`walk per week`)## exract11 exeroft1 qlhlth2
## NA <NA> NA NA
## 2 Walking 105 25
## NA.1 <NA> NA NA
## NA.2 <NA> NA NA
## 8 Walking 102 NA
## NA.3 <NA> NA NA
Since we are only interested in responses == Walking and not non-responses, we delete all NAs within the exract11 column or vector with their corresponfing responses in the remainiing
`walk per week`[!is.na(`walk per week`), ] -> `walk per week`
print(head(`walk per week`))## exract11 exeroft1 qlhlth2
## 2 Walking 105 25
## 8 Walking 102 NA
## 20 Walking 107 NA
## 26 Walking 105 NA
## 27 Walking 107 NA
## 31 Walking 107 NA
In order to establish this relationship, we compare the number of days of respondents whose values fall below the 25th percentile, those that are >= the 25th but < the 75th and those >= 75th percentile to their equivalent values in qlhlth2column. thus, length of days full of energy .
quantile(`walk per week`[,2], probs = c(.25, .75), na.rm = T)## 25% 75%
## 103 107
The following above are the 25th and 75th percentile or quartile marks of the distribution or variable exeroft1 or number of exercise per week. so now we check the average exeroft1 (number of days full of energy) for the variable qlhlth2 (number of times of walking per week) that - fall below the 25th percentile. - are above the least 25% but below the highest 25%. - are in the top 25% of number of times walking per week .
data.frame(
`least 25 perc` = mean(`walk per week`[`walk per week`[,2] < 103, ][,3], na.rm = T),
`Middle 50 perc` = mean(`walk per week`[`walk per week`[,2] >= 103 & `walk per week`[,2] < 107, ][,3], na.rm = T),
`Highest 25 perc` = mean(`walk per week`[`walk per week`[,2] >= 107, ][,3], na.rm = T)
) -> per_avg
print(per_avg)## least.25.perc Middle.50.perc Highest.25.perc
## 1 13.58824 19.65152 18.72222
From the table above, people whose walk time per week lie within the least 25% spend about 13.6 days full of energy which is the least, whereas those within the middle 50% have an average length of about 19.7 days. However, people who walk for the most times per week enjoy an average of 18.7 days full of energy just a little short of those in the middle 50%. The analysis seems to corroborate the old saying “too much of everything is bad”, such that, people who spent a moderate amount of time walking as an exercise, have the most days full energy although it doesn’t deviate significantly from that of the top 25%.
A graphical representation of the data above lies below
data.frame(
names = colnames(per_avg) %>%
fct_reorder((data.frame(t(per_avg)))[,1]),
av = (data.frame(t(per_avg)))[,1]
) %>%
ggplot(aes(names, av, fill = colorRampPalette(c("brown","grey"))(3))) +
geom_col() +
scale_fill_identity() +
labs(title = "Days full of Energy for Each Group", y = "Length of Days", x = "Group") +
theme(plot.title = element_text(size = 26, face = "bold",hjust = 0.5),
axis.title = element_text(size = 21, face = "bold"),
axis.text = element_text(size = 19)) The barplot doesn’t seem to show any form of direct relationship between the walking and number of Days full of energy. However, it revealed that not exercise enough robs you of a healthy and energetic life style and somehow, it subtly highlights excessive walking/exercise is also not all that good.
Research question 3:What is the most diagnosed chronic health condition in the US
The Center for Disease COntrol and Prevention estimates that, Chronic diseases are the leading causes of death and disability in the United States. They are also leading drivers of the nation’s $3.5 trillion in annual health care cost. It also states that, diabetes, heart diseases and cancer are the most prevalent of these health conditions, hence we would like asses the most diagnosed of the chronic conditions available to see if it corroborates with existing assertions and findings. Again the most diagnosed chronic health condition to be identified via this analysis is potentially biased since only persons with cellular and telephone devices can conveniently participate in the study and hence are certainly not representative of the US population.
The variables of interest for the analysis are sex,cvdinfr4,cvdcrhd4,cvdstrk3,asthma3,chcscncr,chcocncr,chccopd1,havarth3,addepev2,chckidny,diabete3 which represent an inquiry of having been diagnosed with the following chronic health conditions,respectively. heart attack,Angina Or Coronary Hear Disease,Stroke,Asthma,Skin Cancer,Other Types of Cancer,Chronic Ostructive Pulmonary Disease (Copd),Arthritis,Depressive Disorder,Kdney Disease,Diabetes.
This inquiry only presents a total of diagnoses or number of persons to have suffered a disease but not establish an association and will not be scientifically able to establish causation. Now we subset our variables of interest from the parent data brfss2013.
#codenames
chron <- c('cvdinfr4','cvdcrhd4','cvdstrk3','asthma3','chcscncr','chcocncr',
'chccopd1','havarth3','addepev2','chckidny','diabete3')
#codename in full
name <- c('heart attack','Angina Or Coronary Heart Disease','Stroke','Asthma','Skin Cancer',
'Other Types of Cancer', 'Chronic Ostructive Pulmonary Disease (Copd)','Arthritis',
'Depressive Disorder','Kdney Disease','Diabetes')
brfss2013 %>%
select(sex) %>%
data.frame(brfss2013[, colnames(brfss2013) %in% chron]) -> `dominant chron`We print the first six rows of our data subset of interest
print(head(`dominant chron`))## sex cvdinfr4 cvdcrhd4 cvdstrk3 asthma3 chcscncr chcocncr chccopd1 havarth3
## 1 Female No <NA> No Yes No No Yes Yes
## 2 Female No No No No No No No No
## 3 Female No No No No No No No Yes
## 4 Female No No No No No No No No
## 5 Male No No No Yes No Yes No No
## 6 Female No No No No No No No No
## addepev2 chckidny diabete3
## 1 Yes Yes No
## 2 Yes No No
## 3 Yes No No
## 4 No No No
## 5 No No No
## 6 No No No
We them go through the structure of the data to reveal some inherent properties as a function of its variables
str(`dominant chron`)## 'data.frame': 491775 obs. of 12 variables:
## $ sex : Factor w/ 2 levels "Male","Female": 2 2 2 2 1 2 2 2 1 2 ...
## $ cvdinfr4: Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
## $ cvdcrhd4: Factor w/ 2 levels "Yes","No": NA 2 2 2 2 2 2 1 2 2 ...
## $ cvdstrk3: Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
## $ asthma3 : Factor w/ 2 levels "Yes","No": 1 2 2 2 1 2 2 2 2 2 ...
## $ chcscncr: Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
## $ chcocncr: Factor w/ 2 levels "Yes","No": 2 2 2 2 1 2 2 2 2 2 ...
## $ chccopd1: Factor w/ 2 levels "Yes","No": 1 2 2 2 2 2 2 2 2 2 ...
## $ havarth3: Factor w/ 2 levels "Yes","No": 1 2 1 2 2 2 1 1 1 2 ...
## $ addepev2: Factor w/ 2 levels "Yes","No": 1 1 1 2 2 2 2 2 2 2 ...
## $ chckidny: Factor w/ 2 levels "Yes","No": 1 2 2 2 2 2 2 2 2 2 ...
## $ diabete3: Factor w/ 4 levels "Yes","Yes, but female told only during pregnancy",..: 3 3 3 3 3 3 3 3 3 3 ...
As its visibly printed to screen above, all variables of the dataset are categorical with two levels: Yes and No. A look at the summary of basic statistics of the data
summary(`dominant chron`)## sex cvdinfr4 cvdcrhd4 cvdstrk3 asthma3
## Male :201313 Yes : 29284 Yes : 29064 Yes : 20391 Yes : 67204
## Female:290455 No :459904 No :458288 No :469917 No :423012
## NA's : 7 NA's: 2587 NA's: 4423 NA's: 1467 NA's: 1559
##
##
## chcscncr chcocncr chccopd1 havarth3 addepev2
## Yes : 45446 Yes : 47074 Yes : 40660 Yes :165152 Yes : 95779
## No :445020 No :443553 No :448391 No :323653 No :393707
## NA's: 1309 NA's: 1148 NA's: 2724 NA's: 2970 NA's: 2289
##
##
## chckidny diabete3
## Yes : 15901 Yes : 62363
## No :474153 Yes, but female told only during pregnancy: 4602
## NA's: 1721 No :415374
## No, pre-diabetes or borderline diabetes : 8604
## NA's : 832
The output above shows the times observations frequent each column/variable including missing values.
Although, the above summary displays frequency of responses in each variable including missing values, we explicitly print to screen the number of non-responses in each column of the data subset dominant chron
colSums(is.na(`dominant chron`))## sex cvdinfr4 cvdcrhd4 cvdstrk3 asthma3 chcscncr chcocncr chccopd1
## 7 2587 4423 1467 1559 1309 1148 2724
## havarth3 addepev2 chckidny diabete3
## 2970 2289 1721 832
And since we have no need for those, we remove them and their corresponding responses in the other columns since we are only interested in respondents whose sex where identified to be either male or female, thus, respondents identified to be persons
`dominant chron` <- `dominant chron`[!is.na(`dominant chron`$sex), ]
colSums(is.na(`dominant chron`))## sex cvdinfr4 cvdcrhd4 cvdstrk3 asthma3 chcscncr chcocncr chccopd1
## 0 2583 4419 1463 1554 1305 1143 2719
## havarth3 addepev2 chckidny diabete3
## 2965 2284 1715 828
Now all non-responses for sex are removed and now we compute the number of “yeses” in each column of the dataset dominant chron
#function to enumerate yeses in a vector
yes <- function(x){
x[!is.na(x)] -> x
x[x == "Yes"] %>%
length
}
#looping the yes function over the columns of the data "dominant chron"
sapply(`dominant chron`, yes) %>%
print()## sex cvdinfr4 cvdcrhd4 cvdstrk3 asthma3 chcscncr chcocncr chccopd1
## 0 29282 29062 20390 67203 45445 47074 40660
## havarth3 addepev2 chckidny diabete3
## 165151 95778 15901 62361
The dataframe above shows the number of yeses in each column which means the number of persons to have ever suffered the chronic health condition which are coded as print(chron) for column names as opposed to their full names; print(names).
In order of descent is the highlights the most diagnosed chronic health condition in the US.
yes <- function(x){
x[!is.na(x)] -> x
x[x == "Yes"] %>%
length
}
sapply(`dominant chron`, yes) %>%
data.frame() -> a; a <- data.frame(
`Chronic disorder` = rownames(a)[-1],
`full name` = name,
Infections = a[,1][-1]
) %>%
arrange(desc(Infections)) %>%
data.frame() -> a
print(a)## Chronic.disorder full.name Infections
## 1 havarth3 Arthritis 165151
## 2 addepev2 Depressive Disorder 95778
## 3 asthma3 Asthma 67203
## 4 diabete3 Diabetes 62361
## 5 chcocncr Other Types of Cancer 47074
## 6 chcscncr Skin Cancer 45445
## 7 chccopd1 Chronic Ostructive Pulmonary Disease (Copd) 40660
## 8 cvdinfr4 heart attack 29282
## 9 cvdcrhd4 Angina Or Coronary Heart Disease 29062
## 10 cvdstrk3 Stroke 20390
## 11 chckidny Kdney Disease 15901
For a graphical illustration of the finding above, we use ggplot to create a visualization of the most diagnosed in the US.
fct_reorder(a[ ,1] ,a[,3]) -> label
color <- colorRampPalette(c("red","green","blue", "cyan", "grey","black","yellow"))
ggplot(a, aes(label, a[,3], fill = color(11))) +
geom_col(alpha = 1) +
scale_fill_identity() +
labs(
title = "Dominant Chronic Health Conditon in the US",
x ="Disorder",
y = expression("Number of Infections")
) +
theme(plot.title = element_text(size = 26, hjust = 0.5, face = "bold"),
axis.title = element_text(size = 21, face = "bold"),
axis.text = element_text(size = 19, face = "bold", angle = 270)) The barplot above represents the most chronic health conditions in the US which are represented on the x-axis by their code names which represent the following beginning from the right,respectively.
## chron name
## 1 cvdinfr4 heart attack
## 2 cvdcrhd4 Angina Or Coronary Heart Disease
## 3 cvdstrk3 Stroke
## 4 asthma3 Asthma
## 5 chcscncr Skin Cancer
## 6 chcocncr Other Types of Cancer
## 7 chccopd1 Chronic Ostructive Pulmonary Disease (Copd)
## 8 havarth3 Arthritis
## 9 addepev2 Depressive Disorder
## 10 chckidny Kdney Disease
## 11 diabete3 Diabetes
From the graph, Athritis is shown to have the tallest bar, hence the dominant disorder .This however doesn’t seem to corroborate that of the CDC which highlights heart diseases and cancer to be most chronic illness amongst the US population. This calls for further statistical inference to ascertain which indeed is the most chronic disorder in the US..