Exploring the BRFSS data

Part 3: Exploratory data analysis

Research question 1: Between men and women, who have suffered the most heart attacks?
From an article Based on a study in Norway, the Harvard Medical School in 2016 published men tend to suffer heart attacks than women even at an early age. There somehow appear to be a global unanimity on this finding and we will therefore like to see if the US can also corroborate this finding.
For this inquiry, we will require the variables sex and cvdinfr4. The method by which the data was collected will only suggests an association between the variables of interest since the data generated wasn’t effected in an experiment but by just observation as such only a generalization could be made but no causality. Nonetheless, the method of data collection reeks potential bias since it only reaches people with tele and cellular phones and as a result, the supposed random sampling employed may not be representative of the entire population, hence, sex representation may be biased.
The data of interest is subsetted using the code below

brfss2013[, c("sex", "cvdinfr4")] -> `data for q1`

Below are the first and last six observations of the dataset data for q1

`data for q1`[1:6, ] %>% 
       print

##      sex cvdinfr4
## 1 Female       No
## 2 Female       No
## 3 Female       No
## 4 Female       No
## 5   Male       No
## 6 Female       No

`data for q1`[(nrow(`data for q1`)-5):nrow(`data for q1`), ] %>% 
       print

##           sex cvdinfr4
## 491770   Male       No
## 491771 Female      Yes
## 491772   Male       No
## 491773 Female       No
## 491774 Female       No
## 491775   Male       No

To look at the structure of the data or the variables for our analysis of Research question 1 1, we run the code below

str(`data for q1`)

## 'data.frame':    491775 obs. of  2 variables:
##  $ sex     : Factor w/ 2 levels "Male","Female": 2 2 2 2 1 2 2 2 1 2 ...
##  $ cvdinfr4: Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...

The code chunk above prints to screen the the class of each vector or variable for this analysis. They are factors (categorical variables*) of two levels each.

Now we check to see if there are missing values in each of the variables and if so, how many there are

any(is.na(`data for q1`))

## [1] TRUE

Indeed there are. Now, how many are there?

colSums(is.na(`data for q1`))

##      sex cvdinfr4 
##        7     2587

Apparently, there are 7 and 2587 in the variables sex and cvdinfr4, respectively.

Now we compute how many males and females were involved in the study

`data for q1` %>% 
       group_by(sex) %>% 
       summarise(length = length(sex)) %>% 
       data.frame() %>% 
       print

##      sex length
## 1   Male 201313
## 2 Female 290455
## 3   <NA>      7

Of all who participated in the study, 201313 responded as males, 290455 responded as females and 7 responded to be neither of the two.
Of these statistics, how many males responded yes to having been diagnosed with heart attack?

`data for q1` %>% 
       filter(!(is.na(cvdinfr4))) %>% 
       filter(sex == "Male") %>% 
       group_by(cvdinfr4) %>% 
       summarise(length = length(cvdinfr4)) %>% 
       rename(`Diagnosed with heart attack` = cvdinfr4) %>% 
       data.frame() %>% 
       print

##   Diagnosed.with.heart.attack length
## 1                         Yes  16059
## 2                          No 184087

How many females responded yes to having been diagnosed with heart attack?

`data for q1` %>% 
       filter(!(is.na(cvdinfr4))) %>% 
       filter(sex == "Female") %>% 
       group_by(cvdinfr4) %>% 
       summarise(length = length(cvdinfr4)) %>% 
       rename(`Diagnosed with heart attack` = cvdinfr4) %>% 
       data.frame() %>% 
       print

##   Diagnosed.with.heart.attack length
## 1                         Yes  13223
## 2                          No 275816

To illustrate the above distribution of having been diagnosed with a heart attack conditioned by sex, here is a barplot.

drop_na(`data for q1`) %>% 
       filter(cvdinfr4 == "Yes") %>% 
       ggplot(aes(x = sex)) + 
              geom_bar(fill = c("darkblue","brown")) +
              labs(title = "Number of Males and Females Diagnosed with Heart Attack") +
              theme(axis.title = element_text(size = 21, face = "bold"),
                   axis.text = element_text(size = 19, face = "bold"),
                   plot.title = element_text(size = 26, face = "bold", hjust = 0.5))

The graph above shows and re-affirms how males have been known to have suffered the most of heart attacks and continuous to do so. This therefore suggests the most of heart attacks are suffered by males in the US..

Research question 2: What is the relationship between walking as an exercise and the number of days full of energy?
Walking amongst other exercises have been proven to prevent heart disease, lower blood pressure, strengthen muscles, and maintains a healthy weight according to the online group Healthy and Natural World. Healthy living is been known to be a precursor for happy a life, so we would like to explore the association between engaging in walking as exercise or just walking most often and the length of liveliness of happy days one gets to enjoy. The variables of interest in this analysis are; qlhlth2,exract11,exeroft1. Just like the preceding research question, this analysis can only and will only explore the relationship between the variable concerned since the method of data collection was just observation and didn’t involve assigning participants to treatments. Although, the method of data collection makes this certain, the generalizability may come from an unrepresentative sample since not everybody has access to cellular or telephones. Hence, the to-be established relationship between walking as exercise and days full of energy are not representative of the population of the US.
The code below prints the first 6 rows of the data of interest to screen

head(brfss2013[ ,c("exract11","exeroft1","qlhlth2")]) %>% print

##                     exract11 exeroft1 qlhlth2
## 1                       <NA>       NA       0
## 2                    Walking      105      25
## 3                       <NA>       NA       2
## 4                    Walking      205      20
## 5                       <NA>       NA      NA
## 6 Bicycling machine exercise      102      NA

Below is the class of each variable in the aboe dataframe

str(brfss2013[ ,c("exract11","exeroft1","qlhlth2")]) %>% print

## 'data.frame':    491775 obs. of  3 variables:
##  $ exract11: Factor w/ 75 levels "Active Gaming Devices (Wii Fit, Dance, Dance revolution)",..: NA 64 NA 64 NA 6 64 64 7 64 ...
##  $ exeroft1: int  NA 105 NA 205 NA 102 220 102 102 220 ...
##  $ qlhlth2 : int  0 25 2 20 NA NA NA NA NA NA ...
## NULL

The information above reveals the first variable exract11 which represents type of physical activityis categorical with 75 levels, exeroft1 which represents number of times exercised per week and month is an integer and qlhlth2 also an integer represents number of days full of energy.

The summary of statistics inherent in the data is below

summary(brfss2013[ ,c("exract11","exeroft1","qlhlth2")]) %>% print

##                                            exract11         exeroft1     
##  Walking                                       :180051   Min.   :  0.0   
##  Running                                       : 23152   1st Qu.:103.0   
##  Gardening (spading, weeding, digging, filling): 20026   Median :105.0   
##  Other                                         : 14119   Mean   :135.8   
##  Weight lifting                                : 10226   3rd Qu.:203.0   
##  (Other)                                       : 83253   Max.   :299.0   
##  NA's                                          :160948   NA's   :164165  
##     qlhlth2      
##  Min.   :  0.0   
##  1st Qu.:  2.0   
##  Median : 15.0   
##  Mean   : 15.9   
##  3rd Qu.: 28.0   
##  Max.   :243.0   
##  NA's   :491310

The first column of the output above represents the times each observation or each physical activity is frequented in the distribution, the second and third columns display the basic statistics(mean,quartiles,min,max and NAs) of the number of times exercised per week/month and number of days full of energy.

From the codebook, if an observation of the exeroft1 variable falls within 101-199, it represents the number of times the subject or respondent exercised per week and if falls with 201-299, it represents the number of times respondent exercised per month. since our interest is in ascertaining the relationship between healthy days and number of walking exercises per week, we subset or further narrow down our data to only responses on number of times walked per week.

brfss2013[
  brfss2013$exract11 == "Walking" & (brfss2013$exeroft1 >= 101 & brfss2013$exeroft1 <= 199), c("exract11","exeroft1","qlhlth2")
  ] -> `walk per week`
head(`walk per week`)

##      exract11 exeroft1 qlhlth2
## NA       <NA>       NA      NA
## 2     Walking      105      25
## NA.1     <NA>       NA      NA
## NA.2     <NA>       NA      NA
## 8     Walking      102      NA
## NA.3     <NA>       NA      NA

Since we are only interested in responses == Walking and not non-responses, we delete all NAs within the exract11 column or vector with their corresponfing responses in the remainiing

`walk per week`[!is.na(`walk per week`), ] -> `walk per week`
print(head(`walk per week`))

##    exract11 exeroft1 qlhlth2
## 2   Walking      105      25
## 8   Walking      102      NA
## 20  Walking      107      NA
## 26  Walking      105      NA
## 27  Walking      107      NA
## 31  Walking      107      NA

In order to establish this relationship, we compare the number of days of respondents whose values fall below the 25th percentile, those that are >= the 25th but < the 75th and those >= 75th percentile to their equivalent values in qlhlth2column. thus, length of days full of energy .

quantile(`walk per week`[,2], probs = c(.25, .75), na.rm = T)

## 25% 75% 
## 103 107

The following above are the 25th and 75th percentile or quartile marks of the distribution or variable exeroft1 or number of exercise per week. so now we check the average exeroft1 (number of days full of energy) for the variable qlhlth2 (number of times of walking per week) that - fall below the 25th percentile. - are above the least 25% but below the highest 25%. - are in the top 25% of number of times walking per week .

data.frame(
  `least 25 perc` = mean(`walk per week`[`walk per week`[,2] < 103, ][,3], na.rm = T),
  `Middle 50 perc` = mean(`walk per week`[`walk per week`[,2] >= 103 & `walk per week`[,2] < 107, ][,3], na.rm = T), 
  `Highest 25 perc` = mean(`walk per week`[`walk per week`[,2] >= 107, ][,3], na.rm = T) 
) -> per_avg
print(per_avg)

##   least.25.perc Middle.50.perc Highest.25.perc
## 1      13.58824       19.65152        18.72222

From the table above, people whose walk time per week lie within the least 25% spend about 13.6 days full of energy which is the least, whereas those within the middle 50% have an average length of about 19.7 days. However, people who walk for the most times per week enjoy an average of 18.7 days full of energy just a little short of those in the middle 50%. The analysis seems to corroborate the old saying “too much of everything is bad”, such that, people who spent a moderate amount of time walking as an exercise, have the most days full energy although it doesn’t deviate significantly from that of the top 25%.
A graphical representation of the data above lies below

data.frame(
  names = colnames(per_avg) %>% 
  fct_reorder((data.frame(t(per_avg)))[,1]),
  av = (data.frame(t(per_avg)))[,1]
) %>% 
  ggplot(aes(names, av, fill = colorRampPalette(c("brown","grey"))(3))) + 
  geom_col() + 
  scale_fill_identity() + 
  labs(title = "Days full of Energy for Each Group", y = "Length of Days", x = "Group") +
  theme(plot.title = element_text(size = 26, face = "bold",hjust = 0.5),
        axis.title = element_text(size = 21, face = "bold"),
        axis.text = element_text(size = 19))

The barplot doesn’t seem to show any form of direct relationship between the walking and number of Days full of energy. However, it revealed that not exercise enough robs you of a healthy and energetic life style and somehow, it subtly highlights excessive walking/exercise is also not all that good.

Research question 3:What is the most diagnosed chronic health condition in the US
The Center for Disease COntrol and Prevention estimates that, Chronic diseases are the leading causes of death and disability in the United States. They are also leading drivers of the nation’s $3.5 trillion in annual health care cost. It also states that, diabetes, heart diseases and cancer are the most prevalent of these health conditions, hence we would like asses the most diagnosed of the chronic conditions available to see if it corroborates with existing assertions and findings. Again the most diagnosed chronic health condition to be identified via this analysis is potentially biased since only persons with cellular and telephone devices can conveniently participate in the study and hence are certainly not representative of the US population.
The variables of interest for the analysis are sex,cvdinfr4,cvdcrhd4,cvdstrk3,asthma3,chcscncr,chcocncr,chccopd1,havarth3,addepev2,chckidny,diabete3 which represent an inquiry of having been diagnosed with the following chronic health conditions,respectively. heart attack,Angina Or Coronary Hear Disease,Stroke,Asthma,Skin Cancer,Other Types of Cancer,Chronic Ostructive Pulmonary Disease (Copd),Arthritis,Depressive Disorder,Kdney Disease,Diabetes.
This inquiry only presents a total of diagnoses or number of persons to have suffered a disease but not establish an association and will not be scientifically able to establish causation. Now we subset our variables of interest from the parent data brfss2013.

#codenames
chron <- c('cvdinfr4','cvdcrhd4','cvdstrk3','asthma3','chcscncr','chcocncr', 
           'chccopd1','havarth3','addepev2','chckidny','diabete3')
#codename in full
name <- c('heart attack','Angina Or Coronary Heart Disease','Stroke','Asthma','Skin Cancer',
          'Other Types of Cancer', 'Chronic Ostructive Pulmonary Disease (Copd)','Arthritis',
          'Depressive Disorder','Kdney Disease','Diabetes')

brfss2013 %>% 
       select(sex) %>% 
       data.frame(brfss2013[, colnames(brfss2013) %in% chron]) -> `dominant chron`

We print the first six rows of our data subset of interest

print(head(`dominant chron`))

##      sex cvdinfr4 cvdcrhd4 cvdstrk3 asthma3 chcscncr chcocncr chccopd1 havarth3
## 1 Female       No     <NA>       No     Yes       No       No      Yes      Yes
## 2 Female       No       No       No      No       No       No       No       No
## 3 Female       No       No       No      No       No       No       No      Yes
## 4 Female       No       No       No      No       No       No       No       No
## 5   Male       No       No       No     Yes       No      Yes       No       No
## 6 Female       No       No       No      No       No       No       No       No
##   addepev2 chckidny diabete3
## 1      Yes      Yes       No
## 2      Yes       No       No
## 3      Yes       No       No
## 4       No       No       No
## 5       No       No       No
## 6       No       No       No

We them go through the structure of the data to reveal some inherent properties as a function of its variables

str(`dominant chron`)

## 'data.frame':    491775 obs. of  12 variables:
##  $ sex     : Factor w/ 2 levels "Male","Female": 2 2 2 2 1 2 2 2 1 2 ...
##  $ cvdinfr4: Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
##  $ cvdcrhd4: Factor w/ 2 levels "Yes","No": NA 2 2 2 2 2 2 1 2 2 ...
##  $ cvdstrk3: Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
##  $ asthma3 : Factor w/ 2 levels "Yes","No": 1 2 2 2 1 2 2 2 2 2 ...
##  $ chcscncr: Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
##  $ chcocncr: Factor w/ 2 levels "Yes","No": 2 2 2 2 1 2 2 2 2 2 ...
##  $ chccopd1: Factor w/ 2 levels "Yes","No": 1 2 2 2 2 2 2 2 2 2 ...
##  $ havarth3: Factor w/ 2 levels "Yes","No": 1 2 1 2 2 2 1 1 1 2 ...
##  $ addepev2: Factor w/ 2 levels "Yes","No": 1 1 1 2 2 2 2 2 2 2 ...
##  $ chckidny: Factor w/ 2 levels "Yes","No": 1 2 2 2 2 2 2 2 2 2 ...
##  $ diabete3: Factor w/ 4 levels "Yes","Yes, but female told only during pregnancy",..: 3 3 3 3 3 3 3 3 3 3 ...

As its visibly printed to screen above, all variables of the dataset are categorical with two levels: Yes and No. A look at the summary of basic statistics of the data

summary(`dominant chron`)

##      sex         cvdinfr4      cvdcrhd4      cvdstrk3      asthma3      
##  Male  :201313   Yes : 29284   Yes : 29064   Yes : 20391   Yes : 67204  
##  Female:290455   No  :459904   No  :458288   No  :469917   No  :423012  
##  NA's  :     7   NA's:  2587   NA's:  4423   NA's:  1467   NA's:  1559  
##                                                                         
##                                                                         
##  chcscncr      chcocncr      chccopd1      havarth3      addepev2     
##  Yes : 45446   Yes : 47074   Yes : 40660   Yes :165152   Yes : 95779  
##  No  :445020   No  :443553   No  :448391   No  :323653   No  :393707  
##  NA's:  1309   NA's:  1148   NA's:  2724   NA's:  2970   NA's:  2289  
##                                                                       
##                                                                       
##  chckidny                                            diabete3     
##  Yes : 15901   Yes                                       : 62363  
##  No  :474153   Yes, but female told only during pregnancy:  4602  
##  NA's:  1721   No                                        :415374  
##                No, pre-diabetes or borderline diabetes   :  8604  
##                NA's                                      :   832

The output above shows the times observations frequent each column/variable including missing values.

Although, the above summary displays frequency of responses in each variable including missing values, we explicitly print to screen the number of non-responses in each column of the data subset dominant chron

colSums(is.na(`dominant chron`))

##      sex cvdinfr4 cvdcrhd4 cvdstrk3  asthma3 chcscncr chcocncr chccopd1 
##        7     2587     4423     1467     1559     1309     1148     2724 
## havarth3 addepev2 chckidny diabete3 
##     2970     2289     1721      832

And since we have no need for those, we remove them and their corresponding responses in the other columns since we are only interested in respondents whose sex where identified to be either male or female, thus, respondents identified to be persons

`dominant chron` <- `dominant chron`[!is.na(`dominant chron`$sex), ]
colSums(is.na(`dominant chron`))

##      sex cvdinfr4 cvdcrhd4 cvdstrk3  asthma3 chcscncr chcocncr chccopd1 
##        0     2583     4419     1463     1554     1305     1143     2719 
## havarth3 addepev2 chckidny diabete3 
##     2965     2284     1715      828

Now all non-responses for sex are removed and now we compute the number of “yeses” in each column of the dataset dominant chron

#function to enumerate yeses in a vector
yes <- function(x){
       x[!is.na(x)] -> x
       x[x == "Yes"] %>% 
              length
}
#looping the yes function over the columns of the data "dominant chron"
sapply(`dominant chron`, yes) %>% 
       print()

##      sex cvdinfr4 cvdcrhd4 cvdstrk3  asthma3 chcscncr chcocncr chccopd1 
##        0    29282    29062    20390    67203    45445    47074    40660 
## havarth3 addepev2 chckidny diabete3 
##   165151    95778    15901    62361

The dataframe above shows the number of yeses in each column which means the number of persons to have ever suffered the chronic health condition which are coded as print(chron) for column names as opposed to their full names; print(names).
In order of descent is the highlights the most diagnosed chronic health condition in the US.

       yes <- function(x){
       x[!is.na(x)] -> x
       x[x == "Yes"] %>% 
              length
}
sapply(`dominant chron`, yes) %>% 
       data.frame() -> a; a <- data.frame(
              `Chronic disorder` = rownames(a)[-1],
              `full name` = name,
              Infections = a[,1][-1] 
       ) %>% 
       arrange(desc(Infections)) %>% 
       data.frame() -> a
       print(a)

##    Chronic.disorder                                   full.name Infections
## 1          havarth3                                   Arthritis     165151
## 2          addepev2                         Depressive Disorder      95778
## 3           asthma3                                      Asthma      67203
## 4          diabete3                                    Diabetes      62361
## 5          chcocncr                       Other Types of Cancer      47074
## 6          chcscncr                                 Skin Cancer      45445
## 7          chccopd1 Chronic Ostructive Pulmonary Disease (Copd)      40660
## 8          cvdinfr4                                heart attack      29282
## 9          cvdcrhd4            Angina Or Coronary Heart Disease      29062
## 10         cvdstrk3                                      Stroke      20390
## 11         chckidny                               Kdney Disease      15901

For a graphical illustration of the finding above, we use ggplot to create a visualization of the most diagnosed in the US.

fct_reorder(a[ ,1] ,a[,3]) -> label
color <- colorRampPalette(c("red","green","blue", "cyan", "grey","black","yellow"))

ggplot(a, aes(label, a[,3], fill = color(11))) +
  geom_col(alpha = 1) +
  scale_fill_identity() + 
  labs(
    title = "Dominant Chronic Health Conditon in the US",
    x ="Disorder", 
    y = expression("Number of Infections")
  ) +
  theme(plot.title = element_text(size = 26, hjust = 0.5, face = "bold"),
    axis.title = element_text(size = 21, face = "bold"),
    axis.text = element_text(size = 19, face = "bold", angle = 270))

The barplot above represents the most chronic health conditions in the US which are represented on the x-axis by their code names which represent the following beginning from the right,respectively.

##       chron                                        name
## 1  cvdinfr4                                heart attack
## 2  cvdcrhd4            Angina Or Coronary Heart Disease
## 3  cvdstrk3                                      Stroke
## 4   asthma3                                      Asthma
## 5  chcscncr                                 Skin Cancer
## 6  chcocncr                       Other Types of Cancer
## 7  chccopd1 Chronic Ostructive Pulmonary Disease (Copd)
## 8  havarth3                                   Arthritis
## 9  addepev2                         Depressive Disorder
## 10 chckidny                               Kdney Disease
## 11 diabete3                                    Diabetes

From the graph, Athritis is shown to have the tallest bar, hence the dominant disorder .This however doesn’t seem to corroborate that of the CDC which highlights heart diseases and cancer to be most chronic illness amongst the US population. This calls for further statistical inference to ascertain which indeed is the most chronic disorder in the US..

Exploring the BRFSS data

Setup

Load packages

Load data

Part 1: Data

Part 2: Research questions

Part 3: Exploratory data analysis