DSU EDA Analysis

Author

Charles Rose

Quarto

Quarto enables you to weave together content and executable code into a finished document. To learn more about Quarto see https://quarto.org.

Running Code

When you click the Render button a document will be generated that includes both content and the output of embedded code. You can embed code like this:

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.1     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Loading required package: colorspace

Loading required package: grid

The legacy packages maptools, rgdal, and rgeos, underpinning the sp package,
which was just loaded, will retire in October 2023.
Please refer to R-spatial evolution reports for details, especially
https://r-spatial.org/r/2023/05/15/evolution4.html.
It may be desirable to make the sf package available;
package maintainers should consider adding sf to Suggests:.
The sp package is now running under evolution status 2
     (status 2 uses the sf package in place of rgdal)

VIM is ready to use.


Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues


Attaching package: 'VIM'


The following object is masked from 'package:datasets':

    sleep

You can add options to executable code like this

[1] "C:/GitLab Repository/inquisitiveimputers/R code"

# A tibble: 5,000 × 121
   patientuid      gender race  hispanic dob   outcome tract county_fips zipcode
   <chr>           <chr>  <chr> <chr>    <chr> <chr>   <chr> <chr>       <chr>  
 1 3a336c26-5e1c-… M      white unknown  2020… 1       3110… 31109       68521  
 2 3a336c26-5e1c-… M      white unknown  2020… 1       3110… 31109       68521  
 3 3a336c26-5e1c-… M      white unknown  2020… 1       3110… 31109       68521  
 4 3a336c26-5e1c-… M      white unknown  2020… 1       3110… 31109       68521  
 5 3a336c26-5e1c-… M      white unknown  2020… 1       3110… 31109       68521  
 6 3a336c26-5e1c-… M      white unknown  2020… 1       3110… 31109       68521  
 7 3a336c26-5e1c-… M      white unknown  2020… 1       3110… 31109       68521  
 8 3a336c26-5e1c-… M      white unknown  2020… 1       3110… 31109       68521  
 9 3a336c26-5e1c-… M      white unknown  2020… 1       3110… 31109       68521  
10 3a336c26-5e1c-… M      white unknown  2020… 1       3110… 31109       68521  
# ℹ 4,990 more rows
# ℹ 112 more variables: USPS_ZIP_PREF_STATE <chr>, weight_t <dbl>,
#   weight_c <dbl>, missing_geography <chr>, stcnty_c <chr>,
#   rpl_theme1_c <dbl>, rpl_theme2_c <dbl>, rpl_theme3_c <dbl>,
#   rpl_theme4_c <dbl>, rpl_themes_c <dbl>, area_sqmi_c <dbl>,
#   e_totpop_c <dbl>, d_pop_c <dbl>, st_abbr_t <chr>, rpl_theme1_t <dbl>,
#   rpl_theme2_t <dbl>, rpl_theme3_t <dbl>, rpl_theme4_t <dbl>, …

# A tibble: 6 × 121
  patientuid       gender race  hispanic dob   outcome tract county_fips zipcode
  <chr>            <chr>  <chr> <chr>    <chr> <chr>   <chr> <chr>       <chr>  
1 e97d1934-0fb4-4… M      white unknown  2018… 1       2803… 28033       38632  
2 E0051A0F-CE1D-4… M      unkn… unknown  2017… 1       3401… 34013       07003  
3 FB2CFA12-730B-4… M      unkn… unknown  2018… 1       0801… 08013       80504  
4 71de8492-8fb7-4… M      unkn… not his… 2021… 0       1800… 18003       46845  
5 58cdf738-8122-4… F      unkn… unknown  2018… 0       3004… 30047       59864  
6 0196aadf-b7d6-4… M      white unknown  2019… 1       1207… 12073       32317  
# ℹ 112 more variables: USPS_ZIP_PREF_STATE <chr>, weight_t <dbl>,
#   weight_c <dbl>, missing_geography <chr>, stcnty_c <chr>,
#   rpl_theme1_c <dbl>, rpl_theme2_c <dbl>, rpl_theme3_c <dbl>,
#   rpl_theme4_c <dbl>, rpl_themes_c <dbl>, area_sqmi_c <dbl>,
#   e_totpop_c <dbl>, d_pop_c <dbl>, st_abbr_t <chr>, rpl_theme1_t <dbl>,
#   rpl_theme2_t <dbl>, rpl_theme3_t <dbl>, rpl_theme4_t <dbl>,
#   rpl_themes_t <dbl>, area_sqmi_t <dbl>, e_totpop_t <dbl>, d_pop_t <dbl>, …

Explore Missing Data

[1] 566106

[1] 210888

Missing Data Table by Screening

#df_all <- df_all %>%
#  mutate_at(vars(gender),
#            ~labelled(., labels = c(Male = "M", Female = "F", `Other/Unknown` = "Missing")))

# Convert scrn to chr called Screen
df_all <- df_ABFM %>%
  mutate(Screen = as.character(outcome))

# Print the variable labels
#print(val_labels(df_all$gender))
df_all <- df_ABFM %>%
  mutate(race = if_else(race == "Missing", NA_character_, race)) %>%
  mutate(racenew = if_else(is.na(race), 1, 0)) %>%
  mutate(Screen = as.factor(outcome)) %>%
  mutate(Screen = recode(Screen, "0" = "No", "1" = "Yes")) 

df_all <- df_all %>%
  mutate(Screen = fct_relevel(Screen, "No", "Yes"))


# Convert scrn to chr called Screen
df_all <- df_all %>%
  mutate(Screen = as.character(outcome))
df_all <- df_all %>%
  mutate(tract_na = ifelse(is.na(tract), 1, 0) %>% as.factor())
df_all <- df_all %>%
  mutate(Screen = ifelse(Screen == "1", 1, 0) %>% as.factor())

head(df_all)

# A tibble: 6 × 124
  patientuid       gender race  hispanic dob   outcome tract county_fips zipcode
  <chr>            <chr>  <chr> <chr>    <chr> <chr>   <chr> <chr>       <chr>  
1 3a336c26-5e1c-4… M      white unknown  2020… 1       3110… 31109       68521  
2 3a336c26-5e1c-4… M      white unknown  2020… 1       3110… 31109       68521  
3 3a336c26-5e1c-4… M      white unknown  2020… 1       3110… 31109       68521  
4 3a336c26-5e1c-4… M      white unknown  2020… 1       3110… 31109       68521  
5 3a336c26-5e1c-4… M      white unknown  2020… 1       3110… 31109       68521  
6 3a336c26-5e1c-4… M      white unknown  2020… 1       3110… 31109       68521  
# ℹ 115 more variables: USPS_ZIP_PREF_STATE <chr>, weight_t <dbl>,
#   weight_c <dbl>, missing_geography <chr>, stcnty_c <chr>,
#   rpl_theme1_c <dbl>, rpl_theme2_c <dbl>, rpl_theme3_c <dbl>,
#   rpl_theme4_c <dbl>, rpl_themes_c <dbl>, area_sqmi_c <dbl>,
#   e_totpop_c <dbl>, d_pop_c <dbl>, st_abbr_t <chr>, rpl_theme1_t <dbl>,
#   rpl_theme2_t <dbl>, rpl_theme3_t <dbl>, rpl_theme4_t <dbl>,
#   rpl_themes_t <dbl>, area_sqmi_t <dbl>, e_totpop_t <dbl>, d_pop_t <dbl>, …

 df_all <- df_all %>%
   mutate(gender = ifelse(gender == "Other/Unknown", NA, gender)) %>%
   mutate(gender = recode(gender, "M" = "Male", "F" = "Female")) %>%
   mutate(race = recode(race, "american indian or alaska native" = "AIAN", "asian" = "Asian",
                        "black or african american" = "Black", "multiple races" = "Multiple",
                        "native hawaiian or other pacific islander" = "NHOPI", "unknown" = "Missing",
                        "white" = "White")) %>%
   mutate(hispanic = recode(hispanic, "hispanic or latino" = "Yes", "not hispanic or latino" = "No", 
                            "unknown" = "Missing")) %>%
   mutate(tract_na = recode(tract_na, "0" = "Yes", "1" = "Missing")) %>%
   mutate(Screen = recode(Screen, "0" = "No", "1" = "Yes")) %>%
   mutate(gender = fct_relevel(gender, "Female", "Male")) %>%
   mutate(race = fct_relevel(race, "AIAN", "Asian", "Black", "NHOPI", "White", "Multiple", "Missing")) %>%
   mutate(hispanic = fct_relevel(hispanic, "No", "Yes", "Missing")) %>%
   mutate(tract_na = fct_relevel(tract_na, "Yes", "Missing")) %>%
   mutate(Screen = fct_relevel(Screen, "No", "Yes"))

Warning: There was 1 warning in `mutate()`.
ℹ In argument: `tract_na = fct_relevel(tract_na, "Yes", "Missing")`.
Caused by warning:
! 1 unknown level in `f`: Missing

 df_all$scrn <- df_all$outcome


var_label(df_all) <- list(
  gender = "Gender",
  race = "Race",
  hispanic = "Hispanic",
  tract_na = "Census Tract",
  Screen = "Screen Test"
)

table1shell <- df_all %>% select(gender, race, hispanic, tract_na, rpl_themes_t, z_SE_nat_t, scrn) %>% 
  tbl_summary(by = scrn) %>%
  add_overall() %>%
  modify_spanning_header(c("stat_1", "stat_2") ~ "**Developmental Screening**") %>%
  modify_header(stat_1 = "**No**, n = 153,373", stat_2 = "**Yes**, n = 57,573") %>%
  bold_labels() %>%
  add_p()
table1shell <- modify_caption(table1shell, caption = "**Example of Table 1 for Descriptive Statistics**")
table1shell

**Example of Table 1 for Descriptive Statistics**
Characteristic	Overall, N = 771,175¹	Developmental Screening		p-value²
Characteristic	Overall, N = 771,175¹	No, n = 153,373¹	Yes, n = 57,573¹	p-value²
Gender				<0.001
Female	372,079 (48%)	255,971 (48%)	116,108 (49%)
Male	398,424 (52%)	277,157 (52%)	121,267 (51%)
Unknown	672	529	143
Race				<0.001
AIAN	18,488 (2.4%)	17,501 (3.3%)	987 (0.4%)
Asian	11,829 (1.5%)	8,412 (1.6%)	3,417 (1.4%)
Black	55,166 (7.2%)	37,943 (7.1%)	17,223 (7.3%)
NHOPI	1,468 (0.2%)	1,110 (0.2%)	358 (0.2%)
White	365,921 (47%)	249,513 (47%)	116,408 (49%)
Multiple	3,534 (0.5%)	1,625 (0.3%)	1,909 (0.8%)
Missing	314,769 (41%)	217,553 (41%)	97,216 (41%)
Hispanic				<0.001
No	332,933 (43%)	240,293 (45%)	92,640 (39%)
Yes	144,082 (19%)	89,307 (17%)	54,775 (23%)
Missing	294,160 (38%)	204,057 (38%)	90,103 (38%)
Census Tract
Yes	771,175 (100%)	533,657 (100%)	237,518 (100%)
rpl_themes_t	0.52 (0.27, 0.76)	0.53 (0.28, 0.76)	0.49 (0.25, 0.76)	<0.001
Unknown	4,295	2,950	1,345
z_SE_nat_t	0.03 (-0.09, 0.16)	0.02 (-0.10, 0.15)	0.04 (-0.08, 0.18)	<0.001
Unknown	205,069	137,123	67,946
¹ n (%); Median (IQR)
² Pearson’s Chi-squared test; Wilcoxon rank sum test

#table1shell <- modify_caption(table1shell, "<div style='text-align: left; font-weight: bold; color: grey'> Table 1. Patient Characteristics</div>")
#table1shell
#save.image(file='myEnvironment.RData')
save(table1shell, file = "C:\\GitLab Repository\\inquisitiveimputers\\Documents\\Results\\EDA\\Table1Desc.Rdata")
#names(df_all)

head(df_all, 500)

# A tibble: 500 × 125
   patientuid      gender race  hispanic dob   outcome tract county_fips zipcode
   <chr>           <fct>  <fct> <fct>    <chr> <chr>   <chr> <chr>       <chr>  
 1 3a336c26-5e1c-… Male   White Missing  2020… 1       3110… 31109       68521  
 2 3a336c26-5e1c-… Male   White Missing  2020… 1       3110… 31109       68521  
 3 3a336c26-5e1c-… Male   White Missing  2020… 1       3110… 31109       68521  
 4 3a336c26-5e1c-… Male   White Missing  2020… 1       3110… 31109       68521  
 5 3a336c26-5e1c-… Male   White Missing  2020… 1       3110… 31109       68521  
 6 3a336c26-5e1c-… Male   White Missing  2020… 1       3110… 31109       68521  
 7 3a336c26-5e1c-… Male   White Missing  2020… 1       3110… 31109       68521  
 8 3a336c26-5e1c-… Male   White Missing  2020… 1       3110… 31109       68521  
 9 3a336c26-5e1c-… Male   White Missing  2020… 1       3110… 31109       68521  
10 3a336c26-5e1c-… Male   White Missing  2020… 1       3110… 31109       68521  
# ℹ 490 more rows
# ℹ 116 more variables: USPS_ZIP_PREF_STATE <chr>, weight_t <dbl>,
#   weight_c <dbl>, missing_geography <chr>, stcnty_c <chr>,
#   rpl_theme1_c <dbl>, rpl_theme2_c <dbl>, rpl_theme3_c <dbl>,
#   rpl_theme4_c <dbl>, rpl_themes_c <dbl>, area_sqmi_c <dbl>,
#   e_totpop_c <dbl>, d_pop_c <dbl>, st_abbr_t <chr>, rpl_theme1_t <dbl>,
#   rpl_theme2_t <dbl>, rpl_theme3_t <dbl>, rpl_theme4_t <dbl>, …

# Summary table with chi-square test
df_all2 <- df_all %>%
  mutate(race = if_else(race == "Missing", NA_character_, race)) %>%
  mutate(racenew = if_else(is.na(race), 1, 0)) %>%
  distinct(patientuid, .keep_all = TRUE)

missingtableshell <- df_all2 %>% select(Screen, gender, racenew, hispanic, rpl_themes_t,
                                        acs_avg_hh_size_c, acs_pct_foreign_born_t) %>% 
  tbl_summary(
    by = racenew, 
    type = list(Screen ~ "categorical"),
    statistic = list(
      all_continuous() ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p}%)"
    ),
    digits = all_continuous() ~ 2
    ) %>%
  add_overall() %>%
  modify_spanning_header(c("stat_1", "stat_2") ~ "**Missing Race**") %>%
  modify_header(stat_1 = "**No**, n = 140,909", stat_2 = "**Yes**, n = 70,037") %>%
  bold_labels() %>%
  add_p(test = list(
    all_categorical() ~ "chisq.test",
    all_continuous() ~ "t.test"
  ))
missingtableshell <- modify_caption(missingtableshell, caption = "**Example of Table 2 for Missing Race Descriptive Statistics**")
missingtableshell

**Example of Table 2 for Missing Race Descriptive Statistics**
Characteristic	Overall, N = 210,888¹	Missing Race		p-value²
Characteristic	Overall, N = 210,888¹	No, n = 140,909¹	Yes, n = 70,037¹	p-value²
Screen Test				<0.001
No	153,319 (73%)	99,681 (71%)	53,638 (77%)
Yes	57,569 (27%)	41,187 (29%)	16,382 (23%)
Gender				0.003
Female	101,830 (48%)	68,378 (49%)	33,452 (48%)
Male	108,873 (52%)	72,438 (51%)	36,435 (52%)
Unknown	185	52	133
Hispanic				<0.001
No	104,275 (49%)	96,314 (68%)	7,961 (11%)
Yes	33,568 (16%)	18,251 (13%)	15,317 (22%)
Missing	73,045 (35%)	26,303 (19%)	46,742 (67%)
rpl_themes_t	0.54 (0.27)	0.54 (0.26)	0.55 (0.28)	<0.001
Unknown	22	8	14
acs_avg_hh_size_c	2.58 (0.27)	2.56 (0.25)	2.62 (0.30)	<0.001
acs_pct_foreign_born_t	9.59 (11.94)	8.63 (11.19)	11.53 (13.12)	<0.001
Unknown	13	2	11
¹ n (%); Mean (SD)
² Pearson’s Chi-squared test; Welch Two Sample t-test

table2_m_r_acs <- df_all2 %>% select(racenew, acs_avg_hh_size_t, 
                                        acs_pct_child_disab_t, 
                                        acs_pct_ctz_naturalized_t, acs_pct_ctz_nonus_born_t, 
                                        acs_pct_ctz_us_born_t, acs_pct_foreign_born_t, 
                                        acs_pct_non_citizen_t, acs_pct_api_lang_t, acs_pct_english_t, 
                                        acs_pct_spanish_t, acs_pct_hh_no_internet_t, 
                                        acs_pct_child_1fam_t, acs_pct_children_grandparent_t, 
                                        acs_pct_hh_kid_1prnt_t, acs_pct_not_labor_t, 
                                        acs_pct_unemploy_t, acs_gini_index_t, acs_median_hh_inc_t, 
                                        acs_pct_health_inc_below137_t, acs_pct_inc50_t, 
                                        acs_pct_hh_food_stmp_t, acs_pct_bachelor_dgr_t, 
                                        acs_pct_owner_hu_t, acs_pct_vacant_hu_t, acs_pct_hu_no_veh_t, 
                                        acs_pct_medicaid_any_below64_t, acs_pct_uninsured_below64_t) %>% 
  tbl_summary(
    by = racenew, 
    #type = list(Screen ~ "categorical"),
    statistic = list(
      all_continuous() ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p}%)"
    ),
    digits = all_continuous() ~ 2
    ) %>%
  add_overall() %>%
  modify_spanning_header(c("stat_1", "stat_2") ~ "**Missing Race**") %>%
  modify_header(stat_1 = "**No**, n = 140,909", stat_2 = "**Yes**, n = 70,037") %>%
  bold_labels() %>%
  add_p(test = list(
    all_categorical() ~ "chisq.test",
    all_continuous() ~ "t.test"
  ))
table2_m_r_acs <- modify_caption(table2_m_r_acs, caption = "**Example of Table 2 for Missing Race Descriptive Statistics**")
table2_m_r_acs

**Example of Table 2 for Missing Race Descriptive Statistics**
Characteristic	Overall, N = 210,888¹	Missing Race		p-value²
Characteristic	Overall, N = 210,888¹	No, n = 140,909¹	Yes, n = 70,037¹	p-value²
acs_avg_hh_size_t	2.66 (0.50)	2.62 (0.46)	2.72 (0.58)	<0.001
Unknown	23	9	14
acs_pct_child_disab_t	4.83 (4.49)	4.97 (4.60)	4.55 (4.23)	<0.001
Unknown	72	44	28
acs_pct_ctz_naturalized_t	3.83 (5.50)	3.54 (5.38)	4.41 (5.69)	<0.001
Unknown	13	2	11
acs_pct_ctz_nonus_born_t	4.59 (5.86)	4.30 (5.75)	5.16 (6.02)	<0.001
Unknown	13	2	11
acs_pct_ctz_us_born_t	90.41 (11.94)	91.37 (11.19)	88.47 (13.12)	<0.001
Unknown	13	2	11
acs_pct_foreign_born_t	9.59 (11.94)	8.63 (11.19)	11.53 (13.12)	<0.001
Unknown	13	2	11
acs_pct_non_citizen_t	5.01 (7.50)	4.33 (6.62)	6.37 (8.88)	<0.001
Unknown	13	2	11
acs_pct_api_lang_t	1.56 (3.64)	1.45 (3.59)	1.79 (3.73)	<0.001
Unknown	13	2	11
acs_pct_english_t	82.63 (22.34)	84.48 (20.85)	78.91 (24.67)	<0.001
Unknown	13	2	11
acs_pct_spanish_t	13.20 (20.88)	11.58 (19.21)	16.45 (23.55)	<0.001
Unknown	13	2	11
acs_pct_hh_no_internet_t	15.88 (10.10)	16.39 (10.22)	14.85 (9.78)	<0.001
Unknown	21	8	13
acs_pct_child_1fam_t	30.63 (18.41)	30.64 (18.86)	30.61 (17.47)	0.7
Unknown	154	94	60
acs_pct_children_grandparent_t	8.76 (8.06)	8.90 (8.16)	8.50 (7.85)	<0.001
Unknown	71	43	28
acs_pct_hh_kid_1prnt_t	17.11 (8.87)	16.98 (8.87)	17.39 (8.87)	<0.001
Unknown	21	8	13
acs_pct_not_labor_t	38.13 (9.84)	38.79 (9.91)	36.82 (9.55)	<0.001
Unknown	13	2	11
acs_pct_unemploy_t	5.15 (3.94)	5.21 (4.04)	5.02 (3.73)	<0.001
Unknown	20	6	14
acs_gini_index_t	0.42 (0.06)	0.42 (0.06)	0.42 (0.06)	<0.001
Unknown	32	15	17
acs_median_hh_inc_t	59,785.26 (23,425.52)	59,061.72 (23,206.82)	61,239.93 (23,792.82)	<0.001
Unknown	354	268	86
acs_pct_health_inc_below137_t	21.99 (12.57)	21.95 (12.31)	22.07 (13.07)	0.054
Unknown	19	7	12
acs_pct_inc50_t	6.00 (5.18)	5.98 (5.07)	6.03 (5.40)	0.029
Unknown	19	7	12
acs_pct_hh_food_stmp_t	12.32 (10.10)	12.43 (10.20)	12.11 (9.89)	<0.001
Unknown	21	8	13
acs_pct_bachelor_dgr_t	16.11 (8.70)	15.97 (8.57)	16.39 (8.97)	<0.001
Unknown	13	2	11
acs_pct_owner_hu_t	67.40 (19.75)	68.13 (19.38)	65.91 (20.39)	<0.001
Unknown	21	8	13
acs_pct_vacant_hu_t	12.59 (9.34)	13.14 (9.51)	11.50 (8.90)	<0.001
Unknown	21	8	13
acs_pct_hu_no_veh_t	6.07 (7.04)	6.21 (7.22)	5.79 (6.64)	<0.001
Unknown	21	8	13
acs_pct_medicaid_any_below64_t	20.30 (13.15)	20.40 (13.00)	20.10 (13.45)	<0.001
Unknown	17	5	12
acs_pct_uninsured_below64_t	12.00 (8.38)	11.92 (8.39)	12.18 (8.37)	<0.001
Unknown	17	5	12
¹ Mean (SD)
² Welch Two Sample t-test

#table1shell <- modify_caption(table1shell, "<div style='text-align: left; font-weight: bold; color: grey'> Table 1. Patient Characteristics</div>")
#table1shell
#save.image(file='myEnvironment.RData')
save(table2_m_r_acs, file = "C:\\GitLab Repository\\inquisitiveimputers\\Documents\\Analysis Plan\\Table2MissingRaceACST.Rdata")
#names(df_all)

The echo: false option disables the printing of code (only output is displayed).