Come up with three research questions that you want to answer using these data. You should phrase your research questions in a way that matches up with the scope of inference your dataset allows for. Make sure that at least two of these questions involve at least three variables. You are welcomed to create new variables based on existing ones. Along with each research question include a brief discussion (1-2 sentences) as to why this question is of interest to you and/or your audience. Perform exploratory data analysis (EDA) that addresses each of the three research questions you outlined above. Your EDA should contain numerical summaries and visualizations. Each R output and plot should be accompanied by a brief interpretation.
Of note, each analysis is presented in a separate document.
library (tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.1
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## Warning: package 'readr' was built under R version 4.2.1
## Warning: package 'forcats' was built under R version 4.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library (ggstatsplot)
## Warning: package 'ggstatsplot' was built under R version 4.2.1
## You can cite this package as:
## Patil, I. (2021). Visualizations with statistical details: The 'ggstatsplot' approach.
## Journal of Open Source Software, 6(61), 3167, doi:10.21105/joss.03167
brfss<-read.csv("brfss2013.csv")
as_tibble(brfss)
## # A tibble: 491,775 × 330
## X_state fmonth idate imonth iday iyear dispcode seqno X_psu ctelenum
## <chr> <chr> <int> <chr> <int> <int> <chr> <int> <int> <chr>
## 1 Alabama January 1092013 January 9 2013 Complet… 2.01e9 2.01e9 Yes
## 2 Alabama January 1192013 January 19 2013 Complet… 2.01e9 2.01e9 Yes
## 3 Alabama January 1192013 January 19 2013 Complet… 2.01e9 2.01e9 Yes
## 4 Alabama January 1112013 January 11 2013 Complet… 2.01e9 2.01e9 Yes
## 5 Alabama February 2062013 February 6 2013 Complet… 2.01e9 2.01e9 Yes
## 6 Alabama March 3272013 March 27 2013 Complet… 2.01e9 2.01e9 Yes
## 7 Alabama March 3222013 March 22 2013 Complet… 2.01e9 2.01e9 Yes
## 8 Alabama March 3042013 March 4 2013 Complet… 2.01e9 2.01e9 Yes
## 9 Alabama April 4242013 April 24 2013 Complet… 2.01e9 2.01e9 Yes
## 10 Alabama April 4242013 April 24 2013 Complet… 2.01e9 2.01e9 Yes
## # … with 491,765 more rows, and 320 more variables: pvtresd1 <chr>,
## # colghous <chr>, stateres <chr>, cellfon3 <chr>, ladult <chr>,
## # numadult <int>, nummen <int>, numwomen <int>, genhlth <chr>,
## # physhlth <int>, menthlth <int>, poorhlth <int>, hlthpln1 <chr>,
## # persdoc2 <chr>, medcost <chr>, checkup1 <chr>, sleptim1 <int>,
## # bphigh4 <chr>, bpmeds <chr>, bloodcho <chr>, cholchk <chr>, toldhi2 <chr>,
## # cvdinfr4 <chr>, cvdcrhd4 <chr>, cvdstrk3 <chr>, asthma3 <chr>, …
To examine relationship between proportion of subjects reporting every day smoking (BRFSS question: “Do you now smoke cigarettes every day, some days, or not at all?”) and proportion of subjects who reported being ever diagnosed with heart attack (BRFSS question: “(Ever told) you had a heart attack, also called a myocardial infarction?”).
For the purpose of the current evaluation, 2013 Behavioral Risk Factor Surveillance System dataset and Behavioral Risk Factor Surveillance System 2013 Codebook Report, Land-Line and Cell-Phone data October 24, 2014 were used. All the variables used for this analysis are categorical, therefore tabulation will be used to see whether there are missing values or outliers. These will be removed, if relevant. Proportions of those reporting every day smoking and myocardial infarction in a medical history will be calculated at the US state level. A scatter plot with Spearman´s correlation will be drawn.
table(brfss$X_state, brfss$cvdinfr4, brfss$smokday2, dnn = c("state", "infarction", "smoking"), exclude = NULL)
## , , smoking =
##
## infarction
## state No Yes
## 0 0 0 1
## 80 1 0 0
## Alabama 27 3411 186
## Alaska 16 2257 59
## Arizona 14 2180 91
## Arkansas 19 2625 141
## California 17 7214 229
## Colorado 21 7720 227
## Connecticut 33 4110 125
## Delaware 9 2723 106
## District of Columbia 20 2815 95
## Florida 112 16276 939
## Georgia 22 5134 216
## Guam 13 1090 34
## Hawaii 21 4480 113
## Idaho 15 3207 124
## Illinois 6 3043 138
## Indiana 34 5307 239
## Iowa 24 4492 192
## Kansas 50 12626 497
## Kentucky 22 5336 297
## Louisiana 6 2777 135
## Maine 17 3914 168
## Maryland 32 7370 272
## Massachusetts 46 8071 306
## Michigan 33 6456 278
## Minnesota 23 7839 229
## Mississippi 28 4114 210
## Missouri 29 3559 190
## Montana 20 4941 174
## Nebraska 52 9603 390
## Nevada 9 2647 103
## New Hampshire 12 3010 129
## New Jersey 45 7726 283
## New Mexico 30 4901 200
## New York 35 5017 176
## North Carolina 18 4591 200
## North Dakota 23 4159 154
## Ohio 37 6128 299
## Oklahoma 14 4127 210
## Oregon 15 3212 99
## Pennsylvania 39 5930 250
## Puerto Rico 10 4062 218
## Rhode Island 21 3345 137
## South Carolina 36 5601 228
## South Dakota 22 3524 158
## Tennessee 15 2974 178
## Texas 41 6462 260
## Utah 36 9031 292
## Vermont 13 3217 115
## Virginia 25 4642 176
## Washington 32 6107 227
## West Virginia 13 2737 198
## Wisconsin 11 3460 116
## Wyoming 16 3357 169
##
## , , smoking = Every day
##
## infarction
## state No Yes
## 0 0 0 0
## 80 0 0 0
## Alabama 4 720 66
## Alaska 4 641 25
## Arizona 4 400 36
## Arkansas 6 678 54
## California 3 752 34
## Colorado 8 1166 70
## Connecticut 6 658 47
## Delaware 3 548 36
## District of Columbia 1 354 27
## Florida 29 3413 369
## Georgia 4 845 61
## Guam 1 292 19
## Hawaii 6 627 36
## Idaho 1 535 43
## Illinois 4 535 37
## Indiana 11 1279 113
## Iowa 11 852 79
## Kansas 17 2660 207
## Kentucky 7 1642 146
## Louisiana 2 595 43
## Maine 5 875 66
## Maryland 6 1109 94
## Massachusetts 4 1418 123
## Michigan 17 1434 118
## Minnesota 2 1332 71
## Mississippi 8 861 71
## Missouri 4 935 99
## Montana 7 1132 85
## Nebraska 7 1771 113
## Nevada 4 573 49
## New Hampshire 5 560 37
## New Jersey 7 1270 86
## New Mexico 7 972 64
## New York 6 745 56
## North Carolina 2 1006 91
## North Dakota 4 932 67
## Ohio 12 1669 137
## Oklahoma 6 1013 121
## Oregon 6 534 45
## Pennsylvania 7 1386 107
## Puerto Rico 3 322 10
## Rhode Island 2 599 41
## South Carolina 9 1151 87
## South Dakota 4 773 51
## Tennessee 9 772 88
## Texas 5 895 63
## Utah 4 732 38
## Vermont 5 588 50
## Virginia 5 913 67
## Washington 7 934 65
## West Virginia 7 1026 115
## Wisconsin 3 723 37
## Wyoming 5 673 57
##
## , , smoking = Not at all
##
## infarction
## state No Yes
## 0 0 0 0
## 80 0 0 0
## Alabama 13 1560 206
## Alaska 5 1233 85
## Arizona 5 1199 141
## Arkansas 9 1285 184
## California 5 2632 198
## Colorado 17 3555 278
## Connecticut 22 2200 162
## Delaware 11 1373 142
## District of Columbia 5 1277 88
## Florida 76 9714 1228
## Georgia 17 1828 178
## Guam 2 312 12
## Hawaii 13 2105 145
## Idaho 5 1362 126
## Illinois 3 1438 141
## Indiana 21 2557 310
## Iowa 6 1989 228
## Kansas 32 5647 565
## Kentucky 10 2511 354
## Louisiana 6 1228 147
## Maine 13 2602 278
## Maryland 14 3341 301
## Massachusetts 28 4077 361
## Michigan 23 3438 390
## Minnesota 19 3927 301
## Mississippi 20 1531 195
## Missouri 8 1717 232
## Montana 8 2597 278
## Nebraska 38 4064 439
## Nevada 6 1307 141
## New Hampshire 10 1980 155
## New Jersey 18 3410 289
## New Mexico 22 2408 212
## New York 10 2340 207
## North Carolina 14 2269 225
## North Dakota 16 1925 193
## Ohio 19 2765 352
## Oklahoma 24 2060 242
## Oregon 5 1690 128
## Pennsylvania 18 2870 319
## Puerto Rico 3 1084 119
## Rhode Island 9 1918 192
## South Carolina 24 2735 299
## South Dakota 6 1781 197
## Tennessee 4 1279 190
## Texas 31 2421 237
## Utah 17 2098 193
## Vermont 6 2002 160
## Virginia 12 2030 186
## Washington 15 3069 285
## West Virginia 9 1350 204
## Wisconsin 3 1780 153
## Wyoming 6 1742 191
##
## , , smoking = Some days
##
## infarction
## state No Yes
## 0 0 0 0
## 80 0 0 0
## Alabama 5 271 36
## Alaska 2 239 12
## Arizona 2 170 11
## Arkansas 3 241 23
## California 1 411 22
## Colorado 1 556 30
## Connecticut 0 332 15
## Delaware 1 239 15
## District of Columbia 0 229 20
## Florida 19 1355 138
## Georgia 0 322 31
## Guam 2 112 8
## Hawaii 1 302 9
## Idaho 1 198 13
## Illinois 1 251 11
## Indiana 3 425 39
## Iowa 3 261 20
## Kansas 5 914 62
## Kentucky 1 502 49
## Louisiana 0 259 17
## Maine 2 303 30
## Maryland 4 426 42
## Massachusetts 9 596 32
## Michigan 3 531 40
## Minnesota 4 569 24
## Mississippi 7 374 34
## Missouri 5 308 32
## Montana 2 425 24
## Nebraska 7 608 47
## Nevada 1 240 21
## New Hampshire 5 153 20
## New Jersey 4 610 28
## New Mexico 2 461 37
## New York 2 361 24
## North Carolina 4 409 31
## North Dakota 3 304 26
## Ohio 4 499 50
## Oklahoma 0 384 43
## Oregon 1 204 10
## Pennsylvania 5 455 43
## Puerto Rico 0 160 6
## Rhode Island 0 240 27
## South Carolina 0 506 41
## South Dakota 4 348 27
## Tennessee 0 272 34
## Texas 6 466 30
## Utah 2 314 12
## Vermont 3 219 14
## Virginia 5 377 26
## Washington 2 390 29
## West Virginia 0 218 22
## Wisconsin 2 281 20
## Wyoming 1 215 22
As can be seen, the variable “state” contains values “0”, “80” with no relevant information on smoking and myocardial infarction associated with any state, therefore new data frame is created to contain variables state (X_state), smoke (smokday2) and prev_inf (cvdinfr4). From the data frame rows with “0”, “80” in the state column will be removed.
new_brfss<-tibble(state=c(brfss$X_state), prev_inf=c(brfss$cvdinfr4), smoke=c(brfss$smokday2))
head(new_brfss)
## # A tibble: 6 × 3
## state prev_inf smoke
## <chr> <chr> <chr>
## 1 Alabama No "Not at all"
## 2 Alabama No ""
## 3 Alabama No "Some days"
## 4 Alabama No ""
## 5 Alabama No "Not at all"
## 6 Alabama No ""
new_brfss_clean<-new_brfss[!(new_brfss$state == 0|new_brfss$state == 80),]
###Exploratory Data Analysis
Proportion of those reporting everyday smoking at the state level:
brfss_state_smoke<-new_brfss_clean%>%
group_by(state)%>%
summarise(everydaysmoke=sum(smoke=="Every day")/n())%>%
arrange(desc(everydaysmoke))
View(brfss_state_smoke)
Proportion of those reporting myocardial infarction in the medical history:
brfss_state_inf<-new_brfss_clean%>%
group_by(state)%>%
summarise(previnf=sum(prev_inf=="Yes")/n())%>%
arrange(desc(previnf))
View(brfss_state_inf)
Comment: At 2013, West Virginia was the state with highest reported proportion of both smoking every day (19%) and myocardial infarction in medical history (9%).
Creating one data frame containing information on the state, proportion of everyday smoking and myocardial infarction in medical history:
brfss_state_smoke_inf<-merge(brfss_state_smoke,brfss_state_inf, by="state")
Plotting proportions of those reporting everyday smoking and myocardial infarction in the medical history by state:
ggplot(data=brfss_state_smoke_inf)+geom_point(mapping=aes(x=everydaysmoke, y=previnf, color=state))
Plotting correlation between proportions:
ggstatsplot::ggscatterstats(data=brfss_state_smoke_inf, x=everydaysmoke, y=previnf, type="spearman")
## Registered S3 method overwritten by 'ggside':
## method from
## +.gg ggplot2
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The analysis has shown that at the state level proportion of those reporting, that they are smoking every day is moderate to strongly correlated with proportion of those reporting having been diagnosed with myocardial infarction (rho = 0.62).
Correlation is not causation, even the results are supported with commonly known knowledge, that smoking is a risk factor for cardiovascular disease, no conclusion on causality can be made based on the evaluation performed. Data on both variables were obtained at the same time (we do not know if smoking really preceded myocardial infarction diagnosis, despite we can assume that it did). Moreover, data were provided by the subjects of the survey, therefore might be influenced by willingness to say the truth (self-reported bias). This kind of conclusion might be generally considered as hypothesis generating, needed to be confirmed in more rigorous research.