Explanatory analysis

The objective

Come up with three research questions that you want to answer using these data. You should phrase your research questions in a way that matches up with the scope of inference your dataset allows for. Make sure that at least two of these questions involve at least three variables. You are welcomed to create new variables based on existing ones. Along with each research question include a brief discussion (1-2 sentences) as to why this question is of interest to you and/or your audience. Perform exploratory data analysis (EDA) that addresses each of the three research questions you outlined above. Your EDA should contain numerical summaries and visualizations. Each R output and plot should be accompanied by a brief interpretation.

Of note, each analysis is presented in a separate document.

Setup

Loading packages

library (tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.1
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## Warning: package 'readr' was built under R version 4.2.1
## Warning: package 'forcats' was built under R version 4.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library (ggstatsplot)
## Warning: package 'ggstatsplot' was built under R version 4.2.1
## You can cite this package as:
##      Patil, I. (2021). Visualizations with statistical details: The 'ggstatsplot' approach.
##      Journal of Open Source Software, 6(61), 3167, doi:10.21105/joss.03167

Loading data

brfss<-read.csv("brfss2013.csv")

Data overview

as_tibble(brfss)
## # A tibble: 491,775 × 330
##    X_state fmonth     idate imonth    iday iyear dispcode  seqno  X_psu ctelenum
##    <chr>   <chr>      <int> <chr>    <int> <int> <chr>     <int>  <int> <chr>   
##  1 Alabama January  1092013 January      9  2013 Complet… 2.01e9 2.01e9 Yes     
##  2 Alabama January  1192013 January     19  2013 Complet… 2.01e9 2.01e9 Yes     
##  3 Alabama January  1192013 January     19  2013 Complet… 2.01e9 2.01e9 Yes     
##  4 Alabama January  1112013 January     11  2013 Complet… 2.01e9 2.01e9 Yes     
##  5 Alabama February 2062013 February     6  2013 Complet… 2.01e9 2.01e9 Yes     
##  6 Alabama March    3272013 March       27  2013 Complet… 2.01e9 2.01e9 Yes     
##  7 Alabama March    3222013 March       22  2013 Complet… 2.01e9 2.01e9 Yes     
##  8 Alabama March    3042013 March        4  2013 Complet… 2.01e9 2.01e9 Yes     
##  9 Alabama April    4242013 April       24  2013 Complet… 2.01e9 2.01e9 Yes     
## 10 Alabama April    4242013 April       24  2013 Complet… 2.01e9 2.01e9 Yes     
## # … with 491,765 more rows, and 320 more variables: pvtresd1 <chr>,
## #   colghous <chr>, stateres <chr>, cellfon3 <chr>, ladult <chr>,
## #   numadult <int>, nummen <int>, numwomen <int>, genhlth <chr>,
## #   physhlth <int>, menthlth <int>, poorhlth <int>, hlthpln1 <chr>,
## #   persdoc2 <chr>, medcost <chr>, checkup1 <chr>, sleptim1 <int>,
## #   bphigh4 <chr>, bpmeds <chr>, bloodcho <chr>, cholchk <chr>, toldhi2 <chr>,
## #   cvdinfr4 <chr>, cvdcrhd4 <chr>, cvdstrk3 <chr>, asthma3 <chr>, …

Research question: Relationship between self reported everyday smoke and myocardial infarction

The objective

To examine relationship between proportion of subjects reporting every day smoking (BRFSS question: “Do you now smoke cigarettes every day, some days, or not at all?”) and proportion of subjects who reported being ever diagnosed with heart attack (BRFSS question: “(Ever told) you had a heart attack, also called a myocardial infarction?”).

Method:

For the purpose of the current evaluation, 2013 Behavioral Risk Factor Surveillance System dataset and Behavioral Risk Factor Surveillance System 2013 Codebook Report, Land-Line and Cell-Phone data October 24, 2014 were used. All the variables used for this analysis are categorical, therefore tabulation will be used to see whether there are missing values or outliers. These will be removed, if relevant. Proportions of those reporting every day smoking and myocardial infarction in a medical history will be calculated at the US state level. A scatter plot with Spearman´s correlation will be drawn.

Results

Descriptive data

table(brfss$X_state, brfss$cvdinfr4, brfss$smokday2, dnn = c("state", "infarction", "smoking"), exclude = NULL)
## , , smoking = 
## 
##                       infarction
## state                           No   Yes
##   0                        0     0     1
##   80                       1     0     0
##   Alabama                 27  3411   186
##   Alaska                  16  2257    59
##   Arizona                 14  2180    91
##   Arkansas                19  2625   141
##   California              17  7214   229
##   Colorado                21  7720   227
##   Connecticut             33  4110   125
##   Delaware                 9  2723   106
##   District of Columbia    20  2815    95
##   Florida                112 16276   939
##   Georgia                 22  5134   216
##   Guam                    13  1090    34
##   Hawaii                  21  4480   113
##   Idaho                   15  3207   124
##   Illinois                 6  3043   138
##   Indiana                 34  5307   239
##   Iowa                    24  4492   192
##   Kansas                  50 12626   497
##   Kentucky                22  5336   297
##   Louisiana                6  2777   135
##   Maine                   17  3914   168
##   Maryland                32  7370   272
##   Massachusetts           46  8071   306
##   Michigan                33  6456   278
##   Minnesota               23  7839   229
##   Mississippi             28  4114   210
##   Missouri                29  3559   190
##   Montana                 20  4941   174
##   Nebraska                52  9603   390
##   Nevada                   9  2647   103
##   New Hampshire           12  3010   129
##   New Jersey              45  7726   283
##   New Mexico              30  4901   200
##   New York                35  5017   176
##   North Carolina          18  4591   200
##   North Dakota            23  4159   154
##   Ohio                    37  6128   299
##   Oklahoma                14  4127   210
##   Oregon                  15  3212    99
##   Pennsylvania            39  5930   250
##   Puerto Rico             10  4062   218
##   Rhode Island            21  3345   137
##   South Carolina          36  5601   228
##   South Dakota            22  3524   158
##   Tennessee               15  2974   178
##   Texas                   41  6462   260
##   Utah                    36  9031   292
##   Vermont                 13  3217   115
##   Virginia                25  4642   176
##   Washington              32  6107   227
##   West Virginia           13  2737   198
##   Wisconsin               11  3460   116
##   Wyoming                 16  3357   169
## 
## , , smoking = Every day
## 
##                       infarction
## state                           No   Yes
##   0                        0     0     0
##   80                       0     0     0
##   Alabama                  4   720    66
##   Alaska                   4   641    25
##   Arizona                  4   400    36
##   Arkansas                 6   678    54
##   California               3   752    34
##   Colorado                 8  1166    70
##   Connecticut              6   658    47
##   Delaware                 3   548    36
##   District of Columbia     1   354    27
##   Florida                 29  3413   369
##   Georgia                  4   845    61
##   Guam                     1   292    19
##   Hawaii                   6   627    36
##   Idaho                    1   535    43
##   Illinois                 4   535    37
##   Indiana                 11  1279   113
##   Iowa                    11   852    79
##   Kansas                  17  2660   207
##   Kentucky                 7  1642   146
##   Louisiana                2   595    43
##   Maine                    5   875    66
##   Maryland                 6  1109    94
##   Massachusetts            4  1418   123
##   Michigan                17  1434   118
##   Minnesota                2  1332    71
##   Mississippi              8   861    71
##   Missouri                 4   935    99
##   Montana                  7  1132    85
##   Nebraska                 7  1771   113
##   Nevada                   4   573    49
##   New Hampshire            5   560    37
##   New Jersey               7  1270    86
##   New Mexico               7   972    64
##   New York                 6   745    56
##   North Carolina           2  1006    91
##   North Dakota             4   932    67
##   Ohio                    12  1669   137
##   Oklahoma                 6  1013   121
##   Oregon                   6   534    45
##   Pennsylvania             7  1386   107
##   Puerto Rico              3   322    10
##   Rhode Island             2   599    41
##   South Carolina           9  1151    87
##   South Dakota             4   773    51
##   Tennessee                9   772    88
##   Texas                    5   895    63
##   Utah                     4   732    38
##   Vermont                  5   588    50
##   Virginia                 5   913    67
##   Washington               7   934    65
##   West Virginia            7  1026   115
##   Wisconsin                3   723    37
##   Wyoming                  5   673    57
## 
## , , smoking = Not at all
## 
##                       infarction
## state                           No   Yes
##   0                        0     0     0
##   80                       0     0     0
##   Alabama                 13  1560   206
##   Alaska                   5  1233    85
##   Arizona                  5  1199   141
##   Arkansas                 9  1285   184
##   California               5  2632   198
##   Colorado                17  3555   278
##   Connecticut             22  2200   162
##   Delaware                11  1373   142
##   District of Columbia     5  1277    88
##   Florida                 76  9714  1228
##   Georgia                 17  1828   178
##   Guam                     2   312    12
##   Hawaii                  13  2105   145
##   Idaho                    5  1362   126
##   Illinois                 3  1438   141
##   Indiana                 21  2557   310
##   Iowa                     6  1989   228
##   Kansas                  32  5647   565
##   Kentucky                10  2511   354
##   Louisiana                6  1228   147
##   Maine                   13  2602   278
##   Maryland                14  3341   301
##   Massachusetts           28  4077   361
##   Michigan                23  3438   390
##   Minnesota               19  3927   301
##   Mississippi             20  1531   195
##   Missouri                 8  1717   232
##   Montana                  8  2597   278
##   Nebraska                38  4064   439
##   Nevada                   6  1307   141
##   New Hampshire           10  1980   155
##   New Jersey              18  3410   289
##   New Mexico              22  2408   212
##   New York                10  2340   207
##   North Carolina          14  2269   225
##   North Dakota            16  1925   193
##   Ohio                    19  2765   352
##   Oklahoma                24  2060   242
##   Oregon                   5  1690   128
##   Pennsylvania            18  2870   319
##   Puerto Rico              3  1084   119
##   Rhode Island             9  1918   192
##   South Carolina          24  2735   299
##   South Dakota             6  1781   197
##   Tennessee                4  1279   190
##   Texas                   31  2421   237
##   Utah                    17  2098   193
##   Vermont                  6  2002   160
##   Virginia                12  2030   186
##   Washington              15  3069   285
##   West Virginia            9  1350   204
##   Wisconsin                3  1780   153
##   Wyoming                  6  1742   191
## 
## , , smoking = Some days
## 
##                       infarction
## state                           No   Yes
##   0                        0     0     0
##   80                       0     0     0
##   Alabama                  5   271    36
##   Alaska                   2   239    12
##   Arizona                  2   170    11
##   Arkansas                 3   241    23
##   California               1   411    22
##   Colorado                 1   556    30
##   Connecticut              0   332    15
##   Delaware                 1   239    15
##   District of Columbia     0   229    20
##   Florida                 19  1355   138
##   Georgia                  0   322    31
##   Guam                     2   112     8
##   Hawaii                   1   302     9
##   Idaho                    1   198    13
##   Illinois                 1   251    11
##   Indiana                  3   425    39
##   Iowa                     3   261    20
##   Kansas                   5   914    62
##   Kentucky                 1   502    49
##   Louisiana                0   259    17
##   Maine                    2   303    30
##   Maryland                 4   426    42
##   Massachusetts            9   596    32
##   Michigan                 3   531    40
##   Minnesota                4   569    24
##   Mississippi              7   374    34
##   Missouri                 5   308    32
##   Montana                  2   425    24
##   Nebraska                 7   608    47
##   Nevada                   1   240    21
##   New Hampshire            5   153    20
##   New Jersey               4   610    28
##   New Mexico               2   461    37
##   New York                 2   361    24
##   North Carolina           4   409    31
##   North Dakota             3   304    26
##   Ohio                     4   499    50
##   Oklahoma                 0   384    43
##   Oregon                   1   204    10
##   Pennsylvania             5   455    43
##   Puerto Rico              0   160     6
##   Rhode Island             0   240    27
##   South Carolina           0   506    41
##   South Dakota             4   348    27
##   Tennessee                0   272    34
##   Texas                    6   466    30
##   Utah                     2   314    12
##   Vermont                  3   219    14
##   Virginia                 5   377    26
##   Washington               2   390    29
##   West Virginia            0   218    22
##   Wisconsin                2   281    20
##   Wyoming                  1   215    22

As can be seen, the variable “state” contains values “0”, “80” with no relevant information on smoking and myocardial infarction associated with any state, therefore new data frame is created to contain variables state (X_state), smoke (smokday2) and prev_inf (cvdinfr4). From the data frame rows with “0”, “80” in the state column will be removed.

new_brfss<-tibble(state=c(brfss$X_state), prev_inf=c(brfss$cvdinfr4), smoke=c(brfss$smokday2))
head(new_brfss)
## # A tibble: 6 × 3
##   state   prev_inf smoke       
##   <chr>   <chr>    <chr>       
## 1 Alabama No       "Not at all"
## 2 Alabama No       ""          
## 3 Alabama No       "Some days" 
## 4 Alabama No       ""          
## 5 Alabama No       "Not at all"
## 6 Alabama No       ""
new_brfss_clean<-new_brfss[!(new_brfss$state == 0|new_brfss$state == 80),]

###Exploratory Data Analysis

Proportion of those reporting everyday smoking at the state level:

brfss_state_smoke<-new_brfss_clean%>%
  group_by(state)%>%
  summarise(everydaysmoke=sum(smoke=="Every day")/n())%>%
  arrange(desc(everydaysmoke))
View(brfss_state_smoke)

Proportion of those reporting myocardial infarction in the medical history:

brfss_state_inf<-new_brfss_clean%>%
  group_by(state)%>%
  summarise(previnf=sum(prev_inf=="Yes")/n())%>%
  arrange(desc(previnf))
View(brfss_state_inf)

Comment: At 2013, West Virginia was the state with highest reported proportion of both smoking every day (19%) and myocardial infarction in medical history (9%).

Creating one data frame containing information on the state, proportion of everyday smoking and myocardial infarction in medical history:

brfss_state_smoke_inf<-merge(brfss_state_smoke,brfss_state_inf, by="state")

Plotting proportions of those reporting everyday smoking and myocardial infarction in the medical history by state:

ggplot(data=brfss_state_smoke_inf)+geom_point(mapping=aes(x=everydaysmoke, y=previnf, color=state))

Plotting correlation between proportions:

ggstatsplot::ggscatterstats(data=brfss_state_smoke_inf, x=everydaysmoke, y=previnf, type="spearman")
## Registered S3 method overwritten by 'ggside':
##   method from   
##   +.gg   ggplot2
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Conclusion:

The analysis has shown that at the state level proportion of those reporting, that they are smoking every day is moderate to strongly correlated with proportion of those reporting having been diagnosed with myocardial infarction (rho = 0.62).

Limitation:

Correlation is not causation, even the results are supported with commonly known knowledge, that smoking is a risk factor for cardiovascular disease, no conclusion on causality can be made based on the evaluation performed. Data on both variables were obtained at the same time (we do not know if smoking really preceded myocardial infarction diagnosis, despite we can assume that it did). Moreover, data were provided by the subjects of the survey, therefore might be influenced by willingness to say the truth (self-reported bias). This kind of conclusion might be generally considered as hypothesis generating, needed to be confirmed in more rigorous research.