Introduction

During this analysis I looked at The Behavioral Risk Factor Surveillance System (BRFSS) dataset. It was sourced from the Centers of Disease Control and is the highest leading national health-related telephone survey in the world. It collects surveys from over 400,000 adults in the US per year in all 50 states. It’s purpose is to collect data on US residents health-related risk behaviors, chronic health conditions, and preventive services these individuals use. The data set is made up of 8887 rows and has 36 different fields. The data is from the years 2017, 2018, 2021, and 2022. This large sample of data allows analysis and researchers the opportunity to make inferences about the rest of the US population based on the gathered information.

This dataset was chosen as a way to gain insights in health related conditions in the US population based on various factors. In this analysis I hope to gain insight into individuals BMI, days of not good physical health, days of not good mental health, and days of poor mental or physical health based on the various other health related information in the dataset. The variables listed above made up the integer values in the data set along with year. In the data cleaning process I converted all Yes/No columns to 1 and 0 respectively. I omitted response as not sure and missing to NAs. This converted all column types in this categorical format to numeric for analysis purposes. The rest of the columns were converted into factors as they all contained various character levels.

I then went through and removed outliers to get a better idea of the consistent aspects of the data for BMI. I filtered out all values above the upper bound of 75% and the lower bound of 25%.

## [1] "No"                  "Yes"                 "Don’t know/Not sure"
## [4] "Refused"
## [1] "No"                  "Yes"                 "Don’t know/Not sure"
## [4] "Refused"
## [1] "No"                   "Yes"                  "Not asked or Missing"
## [1] "No"                  "Yes"                 "Don’t know/Not sure"
## [1] "Good"                "Fair/Poor"           "Don’t know/Not sure"
## [1] "Former"  "Current" ""
##  [1] "Employed for wages"               "Refused"                         
##  [3] "A homemaker"                      "A student"                       
##  [5] "Self-employed"                    "Out of work for less than 1 year"
##  [7] "Retired"                          "Out of work for 1 year or more"  
##  [9] "Unable to work"                   "Not asked or Missing"
## [1] "College 1 year to 3 years (Some college or technical school)"
## [2] "Grade 12 or GED (High school graduate)"                      
## [3] "College 4 years or more (College graduate)"                  
## [4] "Grades 9 through 11 (Some high school)"                      
## [5] "Grades 1 through 8 (Elementary)"                             
## [6] "Never attended school or only kindergarten"                  
## [7] "Refused"
## [1] "Not asked or Missing" "Yes"                  "No"                  
## [4] "Don’t know/Not sure"  "Refused"
## [1] "5 or more years ago" "Within past 2 years" "Within past year"   
## [4] "Within past 5 years" "Don’t know/Not sure" "Never"              
## [7] "Refused"

Data Analysis

When looking at the BRFSS dataset there are several trends I noticed. The first one is the data is made up entirely of people who currently smoke or are former smokers. Another interesting finding in this data set is that the majority of respondents in this sample were primarily binge drinkers. These were two factors that pointed me to realizing this is by design and that it should be considered throughout the analysis the majority of those in the sample already have underlying habits that can lead to poor health conditions. The rest of the variables were made up of demographic information such as income, education, employment, region, state, and so on. It also included individual reports by the people surveyed such as if they ever had a chronic health condition and were asked to answer in a yes or no format. Other factors were respondents reporting days of poor physical, mental, or both as well as individuals BMI.

I first summarized key descriptive statistics for each numerical group. Keeping in mind Yes and No were converted to 1 and 0. The descriptive statistics generated were count of observations for that variable, mean, standard deviation, minimum value, and maximum. For categorical variables in binary format the mean can be read as the percent of respondents answering Yes. There were significant outliers in physical, mental, and poor health with some individuals reporting 30 days of not good for the given category. I chose not to filter these outliers as I wanted to look at just respondents that reported having some level of the given variable instead of just reporting zero days. BMI had significant outliers that were filtered out in the data cleaning process.

The key variables I analyzed as mentioned previously were BMI, days of poor physical, mental, and poor overall health reporting both. Initially those making of poor health days was extremely right skewed with the vast majority reporting zero days of the given health state. The initial sample means respectively were 3.91, 3.66, and 5.019. For my analysis the day zero was filtered out giving a much greater mean. After filtering BMI for outliers above or below 25% and 75% the mean BMI was 2763.4 with a standard deviation of 523.63.

For the first portion of my analysis I wanted to see how having any form of cancer diagnosis across the two variables skin cancer and any type of cancer impacted days of poor physical health. I generated a frequency table comparing answer yes to one of the two types of cancer diagnosis (first frequency table) and compared it to those who have never had a cancer diagnosis (second frequency table). I wanted to look at those most heavily impacted so I compared the percentage of those that reported 30 days poor physical health in both tables. In the table for some cancer diagnosis it reported 142 people making up 27.68% of all results excluding zero. The results for those who never had a cancer diagnosis in the sample in the second table was 404 people reporting 30 days poor physical health and making up 19.17% of all reports with no cancer diagnosis.

I then looked at those who reported days poor health in the entire dataset (again filtering out zero) to those who made 75k or more in a year. I generated two bar graphs to compare these results. It can be seen there is significantly more individuals that reported 30 days poor health in the data set compared to those who make 75k or more per year.

I next wanted to look at how BMI related to those who answered Yes to diabetes. From our descriptive statistics we can see the mean was 0.136 which translates to 13.6% of people in the filtered data reported having diabetes. I generated two histograms with a red line representing the mean for each distribution to compare the BMIs. They both follow a normal distribution. The mean BMI for those who answered yes to diabetes was 3014.57 and for those who answered no was 2718.60 which shows a fairly significant difference moving the mean approximately 60% of one standard deviation (523.63 in the initial data).

I then constructed a confidence interval for both those who answered Yes to diabetes and those who answered No to determine how a range of where the population means for BMI would fall. I used a 95% confidence interval to give a balance of precision and accuracy. The results can be seen below but there was a fairly significant difference in the upper and lower bounds with the upper bound for No diabetes being below the lower bound for Yes to diabetes.

##          iyear physhlth menthlth poorhlth   bpmeds cvdinfr4 cvdcrhd4 cvdstrk3
## n     7758.000 7616.000 7633.000 3992.000 3143.000 7730.000 7689.000 7732.000
## mean  2018.969    3.910    3.661    5.019    0.830    0.056    0.056    0.042
## stdev    2.019    8.459    7.836    9.020    0.375    0.231    0.230    0.202
## min   2017.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000
## max   2022.000   30.000   30.000   30.000    1.000    1.000    1.000    1.000
##        asthma3 chcscncr chcocncr  chccopd  addepev diabete_2   decide diffwalk
## n     7731.000 7744.000 7747.000 7736.000 7719.000  7527.000 7635.000 7626.000
## mean     0.128    0.102    0.098    0.081    0.191     0.136    0.107    0.155
## stdev    0.334    0.303    0.297    0.273    0.393     0.342    0.309    0.362
## min      0.000    0.000    0.000    0.000    0.000     0.000    0.000    0.000
## max      1.000    1.000    1.000    1.000    1.000     1.000    1.000    1.000
##       diffdres diffalon  x_michd   x_bmi5 x_rfbing5
## n     7642.000 7618.000 7694.000 7758.000  7420.000
## mean     0.041    0.069    0.087 2763.399     0.855
## stdev    0.198    0.253    0.282  523.631     0.352
## min      0.000    0.000    0.000 1325.000     0.000
## max      1.000    1.000    1.000 4233.000     1.000
## # A tibble: 23 × 3
##    physhlth count percentage_yes_cancer
##       <int> <int>                 <dbl>
##  1        1    37                 7.21 
##  2        2    57                11.1  
##  3        3    44                 8.58 
##  4        4    31                 6.04 
##  5        5    42                 8.19 
##  6        6     6                 1.17 
##  7        7    21                 4.09 
##  8        8     4                 0.780
##  9        9     2                 0.390
## 10       10    33                 6.43 
## # ℹ 13 more rows
## # A tibble: 30 × 3
##    physhlth count percentage_no_cancer
##       <int> <int>                <dbl>
##  1        1   280               13.3  
##  2        2   361               17.1  
##  3        3   211               10.0  
##  4        4   102                4.84 
##  5        5   194                9.20 
##  6        6    18                0.854
##  7        7   115                5.46 
##  8        8    14                0.664
##  9        9     3                0.142
## 10       10   118                5.60 
## # ℹ 20 more rows

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## [1] 3014.572
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## [1] 2718.596
## [1] "The 95% confidence interval for BMI levels for those who answered No to diabetes is: 2847 <= mu <= 2913 ."
## [1] "The 95% confidence interval for BMI levels for those who answered Yes to diabetes is: 3048 <= mu <= 3158 ."

Summary

In this report we looked at various factors such as diabetes and cancer diagnosis and their impact on days of poor, physical, or mental health. We also looked at how BMI was impacted by various factors. Throughout this analysis we just looked at BMI without outliers and for days of poor health variables we just looked at instances the condition was true (>0). My first key findings in this report was it those who had some kind of cancer diagnosis were more likely to experience having 30 days of poor physical health. Another key finding was those who made $75k or more a year were less likely to report having 30 days of poor health than those in the dataset who made less. Lastly I found a fairly significant difference in the range of population mean for BMI of those who answered Yes to diabetes compared to those who answered no. Again the results were those who answered No to diabetes there was a range of 2847 <= mu <= 2913 for BMI. For those who answered yes the 95% confidence interval range was 3048 <= mu <= 3158. This shows that those who had diabetes were more likely to have a higher BMI.

I would like to further explore other variables in the data set impact on the variables relating to days of poor health. Similar procedure as to what done above could be carried out for differing variables such as using mental health instead of physical for cancer diagnosis. Some other questions I would like to ask is how demographic information compares among various group. Such as would people in the midwest be more likely to report having had cancer or would be people with less education be more or less likely to report days of poor health.

References

Kabacoff, R. I. (2015). R in Action (2nd ed.). Manning Publications.

Bluman, A. G. (2018). Elementary statistics: A step by step approach (10th ed.). McGraw Hill.

When prompted with “how to replace yes and no with 1 and 0 in R for unique(data$variable) (then uniquie values)?” the ChatGPT generated text indicated “This code uses the ifelse function to convert”Yes” to 1, “No” to 0, and any other categories to NA (missing values). This way, you’re creating a binary representation of the “Yes/No” responses.” (OpenAI, 2024).

OpenAI. (2024). ChatGPT (March 5 version) [Large Language model] https://chat.openai.com/