NDAR Words & Sentences Data

data_AS_WS <- read.delim("data/ASD CDI data/mci_sentences02.txt", 
                         header = TRUE, sep = "\t", dec = ".")

# extracting first row as a descriptive dataframe
description_WS <- data_AS_WS[1:1,]
description_WS <- as.data.frame(t(description_WS))
names(description_WS) <- "description"

# what we get rid of:
eliminated_WS <- data_AS_WS %>%
  dplyr::select(c(701:709,780:850)) %>%
  slice(1:1) %>%
  gather(key = "Column names", value = "Description")#%>%
  #head(10)

# grammar items (past tense, future, not present, etc.)
#description_WS[701:709,]
# complexity and examples (e.g. longest sentences)
#description_WS[780:850,]

#using mci_sentences02_id   as a distinctive id and The NDAR Global Unique Identifier 
data_raw_AS_WS <- data_AS_WS %>%
  dplyr::select(c("mci_sentences02_id","subjectkey","interview_age", 
                  "collection_id", "dataset_id", "interview_date", 
                  "src_subject_id", "sex", 21:779)) %>% # starting from 785 is complexity. we kept vocabs, word endings, word forms.
  dplyr::select(-(689:697))
  #dplyr::select(-(684:692)) # before adding interview_age -> sex...

colnames(data_raw_AS_WS) <- as.character(unlist(data_raw_AS_WS[1,])) #unlist the row
data_raw_AS_WS = data_raw_AS_WS[-1, ]

data_raw_AS_WS <- data_raw_AS_WS %>%
  rename(id = "mci_sentences02_id",
         GUID = "The NDAR Global Unique Identifier (GUID) for research subject", 
         age = "Age in months at the time of the interview/test/sampling/imaging.",
         test_date = "Date on which the interview/genetic test/sampling/imaging/biospecimen was completed. MM/DD/YYYY",
         src_subject_id = "Subject ID how it's defined in lab/project",
         sex = "Sex of the subject") %>%
  mutate(age = as.numeric(as.character(age)))

Loaded data from 5101 subjects. Which of these have mcs_vc_total==“999”? (A code often used in SPSS for missing data.) We show this below by collection_id:

nnn <- subset(data_AS_WS, mcs_vc_total=="999")
nnn %>% group_by(collection_id) %>% 
  summarise(n = n()) %>%
  kable()

collection_id	n
2024	425
2368	16
2666	8
6	71
8	92

Item-level data

But many of these actually have at least some item-level data: 66 subjects with non-empty responses (0, 1, or 2) for ‘grr’. 428 subjects with non-empty responses (0, 1, or 2) for ‘grr’. 66 subjects with non-empty responses (0, 1, or 2) for ‘grr’. 66 subjects with non-empty responses (0, 1, or 2) for ‘grr’.

At least 62 subjects in collection 2024 and 4 subjects in collection 2368 have item-level data for many of the items. Collection 2024’s PI is Athena Vouloumanos, and actual enrollment is listed as 209. Indeed, there do seem to be redundant rows in our data. Out of the 612 entries with mcs_vc_total==“999”, there are only 200 unique subjectkey values, and 198 unique src_subject_id entries. (239 distinct subjectkey x interview_age, so mostly not longitudinal data…)

Collection 2368’s PI is Susan Swedo, and reported submitting 125 CDI:WS subjects (and 114 CDI:WG).

Empty Vocabulary Columns

data_AS_WS$empty_voc_resp = rowSums(data_AS_WS[,21:700]=="")==680

novoc <- data_AS_WS %>% filter(empty_voc_resp)

1486 subjects are entirely blank in their vocabulary columns. Below we show the number of subjects with blank vocabulary along with the total N per collection.

blank_n <- novoc %>%
  group_by(collection_id) %>%
  summarise(blank_voc_n = n()) 

ws_tab <- data_raw_AS_WS %>% 
  distinct(collection_id, id, GUID, age) %>%
  group_by(collection_id) %>%
  summarise(total_n=n()) %>%
  left_join(blank_n) %>%
  arrange(total_n)

## Joining, by = "collection_id"

#sum(subset(ws_tab, is.na(blank_voc_n))$total_n)

ws_tab %>%
  kable()

collection_id	total_n	blank_voc_n
6	71	71
2664	72	NA
1856	90	NA
8	92	92
2026	115	NA
2355	125	125
2368	272	1
2666	396	396
9	657	657
2024	3210	144

Only four collections don’t have any subjects with fully blank item-level vocab responses: collection IDs 2664, 1856, 2368, and 2026, with a total of 830 subjects.

There are 3210 subjects in collection 2024 (Vouloumanos)…is that possible? 5 experiments are listed, Shared Data tab doesn’t even mention the CDI, and Data Expected tab only lists 100 Targeted Enrollment for the CDI (shows 0 Subjects Shared). There are 17 listed publications, but their abstracts don’t have Ns.

Also, there are 23164 “2” responses in vocabulary: are some of these WS administrations actually WG administrations? (Where presumably 1=“understands” and 2=“understands and says”?)

Now let’s look at the WG data.

NDAR Words & Gestures Data

data_AS_WG <- read.delim("data/ASD CDI data/mci_words_gestures01.txt", 
                         header = TRUE, sep = "\t", dec = ".") 

# extracting first row as a descriptive dataframe
description_WG <- data_AS_WG[1:1,]
description_WG <- as.data.frame(t(description_WG))
names(description_WG) <- "description"

# we only kept vocab 
eliminated_WG <- data_AS_WG %>%
  dplyr::select(c(23:58,454:520))

#using mci_words_gestures01_id as a distinctive id and The NDAR Global Unique Identifier 

data_raw_AS_WG <- data_AS_WG %>%
  dplyr::select(c("collection_id","dataset_id","sex","mci_words_gestures01_id","subjectkey","interview_age", 59:453, 517))

colnames(data_raw_AS_WG) <- as.character(unlist(data_raw_AS_WG[1,])) #unlist the row
data_raw_AS_WG = data_raw_AS_WG[-1, ]

# what are the duplicated
AS_WG_duplicated <- data_raw_AS_WG[duplicated(colnames(data_raw_AS_WG))] # can call colnames 

# making column names unique
names(data_raw_AS_WG) <- make.unique(names(data_raw_AS_WG), sep="_")


data_raw_AS_WG <- data_raw_AS_WG %>%
  rename(id = "mci_words_gestures01_id",
         GUID = "The NDAR Global Unique Identifier (GUID) for research subject", 
         age = "Age in months at the time of the interview/test/sampling/imaging.",
         house = "MacArthur Words and Gestures: Vocabulary Checklist: House",
         sex = "Sex of the subject") %>%
  mutate(age = as.numeric(as.character(age)))


nnn_wg <- subset(data_AS_WG, mcg_vc_totcom=="999")

Loaded data from 8489 subjects. There aren’t any “999”s in mcg_vc_totcom, at least. But there are a lot entries with no item-level data…

Empty Vocabulary Columns

# sum(duplicated(data_AS_WG[,59:517])) # just the vocab columns
# 7089 -- but are all of these just empty?
data_AS_WG$empty_voc_resp = rowSums(data_AS_WG[,59:517]=="")==459

novoc <- data_AS_WG %>% filter(empty_voc_resp)

4572 subjects are entirely blank in their vocabulary columns. Below we show the number of subjects with blank vocabulary along with the total N per collection.

blank_n <- novoc %>%
  group_by(collection_id) %>%
  summarise(blank_voc_n = n()) 

wg_tab <- data_raw_AS_WG %>% 
  distinct(collection_id, id, GUID, age) %>%
  group_by(collection_id) %>%
  summarise(total_n=n()) %>%
  left_join(blank_n) %>%
  arrange(total_n)

## Joining, by = "collection_id"

#sum(subset(wg_tab, is.na(blank_voc_n))$total_n)

wg_tab %>%
  kable()

collection_id	total_n	blank_voc_n
2169	3	3
2503	18	18
2510	18	NA
1952	20	20
2192	30	NA
2878	50	NA
2027	61	NA
2666	112	NA
6	138	138
16	176	176
2355	213	203
8	269	269
2368	290	NA
1885	305	305
2026	781	781
1600	840	NA
9	876	876
2089	918	918
19	1413	NA
2024	1957	865

Eight collections don’t have any missing item-level vocabulary data, and have a total of 2814 subjects.

Investigate Missing Data

George

2022-10-18

NDAR Words & Sentences Data

Item-level data

Empty Vocabulary Columns

NDAR Words & Gestures Data

Empty Vocabulary Columns