DSU EDA Analysis

Author

Charles Rose

Quarto

Quarto enables you to weave together content and executable code into a finished document. To learn more about Quarto see https://quarto.org.

Running Code

When you click the Render button a document will be generated that includes both content and the output of embedded code. You can embed code like this:

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.1     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Loading required package: colorspace

Loading required package: grid

The legacy packages maptools, rgdal, and rgeos, underpinning the sp package,
which was just loaded, will retire in October 2023.
Please refer to R-spatial evolution reports for details, especially
https://r-spatial.org/r/2023/05/15/evolution4.html.
It may be desirable to make the sf package available;
package maintainers should consider adding sf to Suggests:.
The sp package is now running under evolution status 2
     (status 2 uses the sf package in place of rgdal)

VIM is ready to use.


Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues


Attaching package: 'VIM'


The following object is masked from 'package:datasets':

    sleep

You can add options to executable code like this

[1] "C:/GitLab Repository/inquisitiveimputers/R code"
# A tibble: 5,000 × 121
   patientuid      gender race  hispanic dob   outcome tract county_fips zipcode
   <chr>           <chr>  <chr> <chr>    <chr> <chr>   <chr> <chr>       <chr>  
 1 3a336c26-5e1c-… M      white unknown  2020… 1       3110… 31109       68521  
 2 3a336c26-5e1c-… M      white unknown  2020… 1       3110… 31109       68521  
 3 3a336c26-5e1c-… M      white unknown  2020… 1       3110… 31109       68521  
 4 3a336c26-5e1c-… M      white unknown  2020… 1       3110… 31109       68521  
 5 3a336c26-5e1c-… M      white unknown  2020… 1       3110… 31109       68521  
 6 3a336c26-5e1c-… M      white unknown  2020… 1       3110… 31109       68521  
 7 3a336c26-5e1c-… M      white unknown  2020… 1       3110… 31109       68521  
 8 3a336c26-5e1c-… M      white unknown  2020… 1       3110… 31109       68521  
 9 3a336c26-5e1c-… M      white unknown  2020… 1       3110… 31109       68521  
10 3a336c26-5e1c-… M      white unknown  2020… 1       3110… 31109       68521  
# ℹ 4,990 more rows
# ℹ 112 more variables: USPS_ZIP_PREF_STATE <chr>, weight_t <dbl>,
#   weight_c <dbl>, missing_geography <chr>, stcnty_c <chr>,
#   rpl_theme1_c <dbl>, rpl_theme2_c <dbl>, rpl_theme3_c <dbl>,
#   rpl_theme4_c <dbl>, rpl_themes_c <dbl>, area_sqmi_c <dbl>,
#   e_totpop_c <dbl>, d_pop_c <dbl>, st_abbr_t <chr>, rpl_theme1_t <dbl>,
#   rpl_theme2_t <dbl>, rpl_theme3_t <dbl>, rpl_theme4_t <dbl>, …
# A tibble: 6 × 121
  patientuid       gender race  hispanic dob   outcome tract county_fips zipcode
  <chr>            <chr>  <chr> <chr>    <chr> <chr>   <chr> <chr>       <chr>  
1 e97d1934-0fb4-4… M      white unknown  2018… 1       2803… 28033       38632  
2 E0051A0F-CE1D-4… M      unkn… unknown  2017… 1       3401… 34013       07003  
3 FB2CFA12-730B-4… M      unkn… unknown  2018… 1       0801… 08013       80504  
4 71de8492-8fb7-4… M      unkn… not his… 2021… 0       1800… 18003       46845  
5 58cdf738-8122-4… F      unkn… unknown  2018… 0       3004… 30047       59864  
6 0196aadf-b7d6-4… M      white unknown  2019… 1       1207… 12073       32317  
# ℹ 112 more variables: USPS_ZIP_PREF_STATE <chr>, weight_t <dbl>,
#   weight_c <dbl>, missing_geography <chr>, stcnty_c <chr>,
#   rpl_theme1_c <dbl>, rpl_theme2_c <dbl>, rpl_theme3_c <dbl>,
#   rpl_theme4_c <dbl>, rpl_themes_c <dbl>, area_sqmi_c <dbl>,
#   e_totpop_c <dbl>, d_pop_c <dbl>, st_abbr_t <chr>, rpl_theme1_t <dbl>,
#   rpl_theme2_t <dbl>, rpl_theme3_t <dbl>, rpl_theme4_t <dbl>,
#   rpl_themes_t <dbl>, area_sqmi_t <dbl>, e_totpop_t <dbl>, d_pop_t <dbl>, …

Explore Missing Data

[1] 566106
[1] 210888

Missing Data Table by Screening

#df_all <- df_all %>%
#  mutate_at(vars(gender),
#            ~labelled(., labels = c(Male = "M", Female = "F", `Other/Unknown` = "Missing")))

# Convert scrn to chr called Screen
df_all <- df_ABFM %>%
  mutate(Screen = as.character(outcome))

# Print the variable labels
#print(val_labels(df_all$gender))
df_all <- df_ABFM %>%
  mutate(race = if_else(race == "Missing", NA_character_, race)) %>%
  mutate(racenew = if_else(is.na(race), 1, 0)) %>%
  mutate(Screen = as.factor(outcome)) %>%
  mutate(Screen = recode(Screen, "0" = "No", "1" = "Yes")) 

df_all <- df_all %>%
  mutate(Screen = fct_relevel(Screen, "No", "Yes"))


# Convert scrn to chr called Screen
df_all <- df_all %>%
  mutate(Screen = as.character(outcome))
df_all <- df_all %>%
  mutate(tract_na = ifelse(is.na(tract), 1, 0) %>% as.factor())
df_all <- df_all %>%
  mutate(Screen = ifelse(Screen == "1", 1, 0) %>% as.factor())

head(df_all)
# A tibble: 6 × 124
  patientuid       gender race  hispanic dob   outcome tract county_fips zipcode
  <chr>            <chr>  <chr> <chr>    <chr> <chr>   <chr> <chr>       <chr>  
1 3a336c26-5e1c-4… M      white unknown  2020… 1       3110… 31109       68521  
2 3a336c26-5e1c-4… M      white unknown  2020… 1       3110… 31109       68521  
3 3a336c26-5e1c-4… M      white unknown  2020… 1       3110… 31109       68521  
4 3a336c26-5e1c-4… M      white unknown  2020… 1       3110… 31109       68521  
5 3a336c26-5e1c-4… M      white unknown  2020… 1       3110… 31109       68521  
6 3a336c26-5e1c-4… M      white unknown  2020… 1       3110… 31109       68521  
# ℹ 115 more variables: USPS_ZIP_PREF_STATE <chr>, weight_t <dbl>,
#   weight_c <dbl>, missing_geography <chr>, stcnty_c <chr>,
#   rpl_theme1_c <dbl>, rpl_theme2_c <dbl>, rpl_theme3_c <dbl>,
#   rpl_theme4_c <dbl>, rpl_themes_c <dbl>, area_sqmi_c <dbl>,
#   e_totpop_c <dbl>, d_pop_c <dbl>, st_abbr_t <chr>, rpl_theme1_t <dbl>,
#   rpl_theme2_t <dbl>, rpl_theme3_t <dbl>, rpl_theme4_t <dbl>,
#   rpl_themes_t <dbl>, area_sqmi_t <dbl>, e_totpop_t <dbl>, d_pop_t <dbl>, …
 df_all <- df_all %>%
   mutate(gender = ifelse(gender == "Other/Unknown", NA, gender)) %>%
   mutate(gender = recode(gender, "M" = "Male", "F" = "Female")) %>%
   mutate(race = recode(race, "american indian or alaska native" = "AIAN", "asian" = "Asian",
                        "black or african american" = "Black", "multiple races" = "Multiple",
                        "native hawaiian or other pacific islander" = "NHOPI", "unknown" = "Missing",
                        "white" = "White")) %>%
   mutate(hispanic = recode(hispanic, "hispanic or latino" = "Yes", "not hispanic or latino" = "No", 
                            "unknown" = "Missing")) %>%
   mutate(tract_na = recode(tract_na, "0" = "Yes", "1" = "Missing")) %>%
   mutate(Screen = recode(Screen, "0" = "No", "1" = "Yes")) %>%
   mutate(gender = fct_relevel(gender, "Female", "Male")) %>%
   mutate(race = fct_relevel(race, "AIAN", "Asian", "Black", "NHOPI", "White", "Multiple", "Missing")) %>%
   mutate(hispanic = fct_relevel(hispanic, "No", "Yes", "Missing")) %>%
   mutate(tract_na = fct_relevel(tract_na, "Yes", "Missing")) %>%
   mutate(Screen = fct_relevel(Screen, "No", "Yes"))
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `tract_na = fct_relevel(tract_na, "Yes", "Missing")`.
Caused by warning:
! 1 unknown level in `f`: Missing
 df_all$scrn <- df_all$outcome


var_label(df_all) <- list(
  gender = "Gender",
  race = "Race",
  hispanic = "Hispanic",
  tract_na = "Census Tract",
  Screen = "Screen Test"
)

table1shell <- df_all %>% select(gender, race, hispanic, tract_na, rpl_themes_t, z_SE_nat_t, scrn) %>% 
  tbl_summary(by = scrn) %>%
  add_overall() %>%
  modify_spanning_header(c("stat_1", "stat_2") ~ "**Developmental Screening**") %>%
  modify_header(stat_1 = "**No**, n = 153,373", stat_2 = "**Yes**, n = 57,573") %>%
  bold_labels() %>%
  add_p()
table1shell <- modify_caption(table1shell, caption = "**Example of Table 1 for Descriptive Statistics**")
table1shell
Example of Table 1 for Descriptive Statistics
Characteristic Overall, N = 771,1751 Developmental Screening p-value2
No, n = 153,3731 Yes, n = 57,5731
Gender


<0.001
    Female 372,079 (48%) 255,971 (48%) 116,108 (49%)
    Male 398,424 (52%) 277,157 (52%) 121,267 (51%)
    Unknown 672 529 143
Race


<0.001
    AIAN 18,488 (2.4%) 17,501 (3.3%) 987 (0.4%)
    Asian 11,829 (1.5%) 8,412 (1.6%) 3,417 (1.4%)
    Black 55,166 (7.2%) 37,943 (7.1%) 17,223 (7.3%)
    NHOPI 1,468 (0.2%) 1,110 (0.2%) 358 (0.2%)
    White 365,921 (47%) 249,513 (47%) 116,408 (49%)
    Multiple 3,534 (0.5%) 1,625 (0.3%) 1,909 (0.8%)
    Missing 314,769 (41%) 217,553 (41%) 97,216 (41%)
Hispanic


<0.001
    No 332,933 (43%) 240,293 (45%) 92,640 (39%)
    Yes 144,082 (19%) 89,307 (17%) 54,775 (23%)
    Missing 294,160 (38%) 204,057 (38%) 90,103 (38%)
Census Tract



    Yes 771,175 (100%) 533,657 (100%) 237,518 (100%)
rpl_themes_t 0.52 (0.27, 0.76) 0.53 (0.28, 0.76) 0.49 (0.25, 0.76) <0.001
    Unknown 4,295 2,950 1,345
z_SE_nat_t 0.03 (-0.09, 0.16) 0.02 (-0.10, 0.15) 0.04 (-0.08, 0.18) <0.001
    Unknown 205,069 137,123 67,946
1 n (%); Median (IQR)
2 Pearson’s Chi-squared test; Wilcoxon rank sum test
#table1shell <- modify_caption(table1shell, "<div style='text-align: left; font-weight: bold; color: grey'> Table 1. Patient Characteristics</div>")
#table1shell
#save.image(file='myEnvironment.RData')
save(table1shell, file = "C:\\GitLab Repository\\inquisitiveimputers\\Documents\\Results\\EDA\\Table1Desc.Rdata")
#names(df_all)

head(df_all, 500)
# A tibble: 500 × 125
   patientuid      gender race  hispanic dob   outcome tract county_fips zipcode
   <chr>           <fct>  <fct> <fct>    <chr> <chr>   <chr> <chr>       <chr>  
 1 3a336c26-5e1c-… Male   White Missing  2020… 1       3110… 31109       68521  
 2 3a336c26-5e1c-… Male   White Missing  2020… 1       3110… 31109       68521  
 3 3a336c26-5e1c-… Male   White Missing  2020… 1       3110… 31109       68521  
 4 3a336c26-5e1c-… Male   White Missing  2020… 1       3110… 31109       68521  
 5 3a336c26-5e1c-… Male   White Missing  2020… 1       3110… 31109       68521  
 6 3a336c26-5e1c-… Male   White Missing  2020… 1       3110… 31109       68521  
 7 3a336c26-5e1c-… Male   White Missing  2020… 1       3110… 31109       68521  
 8 3a336c26-5e1c-… Male   White Missing  2020… 1       3110… 31109       68521  
 9 3a336c26-5e1c-… Male   White Missing  2020… 1       3110… 31109       68521  
10 3a336c26-5e1c-… Male   White Missing  2020… 1       3110… 31109       68521  
# ℹ 490 more rows
# ℹ 116 more variables: USPS_ZIP_PREF_STATE <chr>, weight_t <dbl>,
#   weight_c <dbl>, missing_geography <chr>, stcnty_c <chr>,
#   rpl_theme1_c <dbl>, rpl_theme2_c <dbl>, rpl_theme3_c <dbl>,
#   rpl_theme4_c <dbl>, rpl_themes_c <dbl>, area_sqmi_c <dbl>,
#   e_totpop_c <dbl>, d_pop_c <dbl>, st_abbr_t <chr>, rpl_theme1_t <dbl>,
#   rpl_theme2_t <dbl>, rpl_theme3_t <dbl>, rpl_theme4_t <dbl>, …
# Summary table with chi-square test
df_all2 <- df_all %>%
  mutate(race = if_else(race == "Missing", NA_character_, race)) %>%
  mutate(racenew = if_else(is.na(race), 1, 0)) %>%
  distinct(patientuid, .keep_all = TRUE)

missingtableshell <- df_all2 %>% select(Screen, gender, racenew, hispanic, rpl_themes_t,
                                        acs_avg_hh_size_c, acs_pct_foreign_born_t) %>% 
  tbl_summary(
    by = racenew, 
    type = list(Screen ~ "categorical"),
    statistic = list(
      all_continuous() ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p}%)"
    ),
    digits = all_continuous() ~ 2
    ) %>%
  add_overall() %>%
  modify_spanning_header(c("stat_1", "stat_2") ~ "**Missing Race**") %>%
  modify_header(stat_1 = "**No**, n = 140,909", stat_2 = "**Yes**, n = 70,037") %>%
  bold_labels() %>%
  add_p(test = list(
    all_categorical() ~ "chisq.test",
    all_continuous() ~ "t.test"
  ))
missingtableshell <- modify_caption(missingtableshell, caption = "**Example of Table 2 for Missing Race Descriptive Statistics**")
missingtableshell
Example of Table 2 for Missing Race Descriptive Statistics
Characteristic Overall, N = 210,8881 Missing Race p-value2
No, n = 140,9091 Yes, n = 70,0371
Screen Test


<0.001
    No 153,319 (73%) 99,681 (71%) 53,638 (77%)
    Yes 57,569 (27%) 41,187 (29%) 16,382 (23%)
Gender


0.003
    Female 101,830 (48%) 68,378 (49%) 33,452 (48%)
    Male 108,873 (52%) 72,438 (51%) 36,435 (52%)
    Unknown 185 52 133
Hispanic


<0.001
    No 104,275 (49%) 96,314 (68%) 7,961 (11%)
    Yes 33,568 (16%) 18,251 (13%) 15,317 (22%)
    Missing 73,045 (35%) 26,303 (19%) 46,742 (67%)
rpl_themes_t 0.54 (0.27) 0.54 (0.26) 0.55 (0.28) <0.001
    Unknown 22 8 14
acs_avg_hh_size_c 2.58 (0.27) 2.56 (0.25) 2.62 (0.30) <0.001
acs_pct_foreign_born_t 9.59 (11.94) 8.63 (11.19) 11.53 (13.12) <0.001
    Unknown 13 2 11
1 n (%); Mean (SD)
2 Pearson’s Chi-squared test; Welch Two Sample t-test
table2_m_r_acs <- df_all2 %>% select(racenew, acs_avg_hh_size_t, 
                                        acs_pct_child_disab_t, 
                                        acs_pct_ctz_naturalized_t, acs_pct_ctz_nonus_born_t, 
                                        acs_pct_ctz_us_born_t, acs_pct_foreign_born_t, 
                                        acs_pct_non_citizen_t, acs_pct_api_lang_t, acs_pct_english_t, 
                                        acs_pct_spanish_t, acs_pct_hh_no_internet_t, 
                                        acs_pct_child_1fam_t, acs_pct_children_grandparent_t, 
                                        acs_pct_hh_kid_1prnt_t, acs_pct_not_labor_t, 
                                        acs_pct_unemploy_t, acs_gini_index_t, acs_median_hh_inc_t, 
                                        acs_pct_health_inc_below137_t, acs_pct_inc50_t, 
                                        acs_pct_hh_food_stmp_t, acs_pct_bachelor_dgr_t, 
                                        acs_pct_owner_hu_t, acs_pct_vacant_hu_t, acs_pct_hu_no_veh_t, 
                                        acs_pct_medicaid_any_below64_t, acs_pct_uninsured_below64_t) %>% 
  tbl_summary(
    by = racenew, 
    #type = list(Screen ~ "categorical"),
    statistic = list(
      all_continuous() ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p}%)"
    ),
    digits = all_continuous() ~ 2
    ) %>%
  add_overall() %>%
  modify_spanning_header(c("stat_1", "stat_2") ~ "**Missing Race**") %>%
  modify_header(stat_1 = "**No**, n = 140,909", stat_2 = "**Yes**, n = 70,037") %>%
  bold_labels() %>%
  add_p(test = list(
    all_categorical() ~ "chisq.test",
    all_continuous() ~ "t.test"
  ))
table2_m_r_acs <- modify_caption(table2_m_r_acs, caption = "**Example of Table 2 for Missing Race Descriptive Statistics**")
table2_m_r_acs
Example of Table 2 for Missing Race Descriptive Statistics
Characteristic Overall, N = 210,8881 Missing Race p-value2
No, n = 140,9091 Yes, n = 70,0371
acs_avg_hh_size_t 2.66 (0.50) 2.62 (0.46) 2.72 (0.58) <0.001
    Unknown 23 9 14
acs_pct_child_disab_t 4.83 (4.49) 4.97 (4.60) 4.55 (4.23) <0.001
    Unknown 72 44 28
acs_pct_ctz_naturalized_t 3.83 (5.50) 3.54 (5.38) 4.41 (5.69) <0.001
    Unknown 13 2 11
acs_pct_ctz_nonus_born_t 4.59 (5.86) 4.30 (5.75) 5.16 (6.02) <0.001
    Unknown 13 2 11
acs_pct_ctz_us_born_t 90.41 (11.94) 91.37 (11.19) 88.47 (13.12) <0.001
    Unknown 13 2 11
acs_pct_foreign_born_t 9.59 (11.94) 8.63 (11.19) 11.53 (13.12) <0.001
    Unknown 13 2 11
acs_pct_non_citizen_t 5.01 (7.50) 4.33 (6.62) 6.37 (8.88) <0.001
    Unknown 13 2 11
acs_pct_api_lang_t 1.56 (3.64) 1.45 (3.59) 1.79 (3.73) <0.001
    Unknown 13 2 11
acs_pct_english_t 82.63 (22.34) 84.48 (20.85) 78.91 (24.67) <0.001
    Unknown 13 2 11
acs_pct_spanish_t 13.20 (20.88) 11.58 (19.21) 16.45 (23.55) <0.001
    Unknown 13 2 11
acs_pct_hh_no_internet_t 15.88 (10.10) 16.39 (10.22) 14.85 (9.78) <0.001
    Unknown 21 8 13
acs_pct_child_1fam_t 30.63 (18.41) 30.64 (18.86) 30.61 (17.47) 0.7
    Unknown 154 94 60
acs_pct_children_grandparent_t 8.76 (8.06) 8.90 (8.16) 8.50 (7.85) <0.001
    Unknown 71 43 28
acs_pct_hh_kid_1prnt_t 17.11 (8.87) 16.98 (8.87) 17.39 (8.87) <0.001
    Unknown 21 8 13
acs_pct_not_labor_t 38.13 (9.84) 38.79 (9.91) 36.82 (9.55) <0.001
    Unknown 13 2 11
acs_pct_unemploy_t 5.15 (3.94) 5.21 (4.04) 5.02 (3.73) <0.001
    Unknown 20 6 14
acs_gini_index_t 0.42 (0.06) 0.42 (0.06) 0.42 (0.06) <0.001
    Unknown 32 15 17
acs_median_hh_inc_t 59,785.26 (23,425.52) 59,061.72 (23,206.82) 61,239.93 (23,792.82) <0.001
    Unknown 354 268 86
acs_pct_health_inc_below137_t 21.99 (12.57) 21.95 (12.31) 22.07 (13.07) 0.054
    Unknown 19 7 12
acs_pct_inc50_t 6.00 (5.18) 5.98 (5.07) 6.03 (5.40) 0.029
    Unknown 19 7 12
acs_pct_hh_food_stmp_t 12.32 (10.10) 12.43 (10.20) 12.11 (9.89) <0.001
    Unknown 21 8 13
acs_pct_bachelor_dgr_t 16.11 (8.70) 15.97 (8.57) 16.39 (8.97) <0.001
    Unknown 13 2 11
acs_pct_owner_hu_t 67.40 (19.75) 68.13 (19.38) 65.91 (20.39) <0.001
    Unknown 21 8 13
acs_pct_vacant_hu_t 12.59 (9.34) 13.14 (9.51) 11.50 (8.90) <0.001
    Unknown 21 8 13
acs_pct_hu_no_veh_t 6.07 (7.04) 6.21 (7.22) 5.79 (6.64) <0.001
    Unknown 21 8 13
acs_pct_medicaid_any_below64_t 20.30 (13.15) 20.40 (13.00) 20.10 (13.45) <0.001
    Unknown 17 5 12
acs_pct_uninsured_below64_t 12.00 (8.38) 11.92 (8.39) 12.18 (8.37) <0.001
    Unknown 17 5 12
1 Mean (SD)
2 Welch Two Sample t-test
#table1shell <- modify_caption(table1shell, "<div style='text-align: left; font-weight: bold; color: grey'> Table 1. Patient Characteristics</div>")
#table1shell
#save.image(file='myEnvironment.RData')
save(table2_m_r_acs, file = "C:\\GitLab Repository\\inquisitiveimputers\\Documents\\Analysis Plan\\Table2MissingRaceACST.Rdata")
#names(df_all)

The echo: false option disables the printing of code (only output is displayed).