data_AS_WS <- read.delim("data/ASD CDI data/mci_sentences02.txt",
header = TRUE, sep = "\t", dec = ".")
# extracting first row as a descriptive dataframe
description_WS <- data_AS_WS[1:1,]
description_WS <- as.data.frame(t(description_WS))
names(description_WS) <- "description"
# what we get rid of:
eliminated_WS <- data_AS_WS %>%
dplyr::select(c(701:709,780:850)) %>%
slice(1:1) %>%
gather(key = "Column names", value = "Description")#%>%
#head(10)
# grammar items (past tense, future, not present, etc.)
#description_WS[701:709,]
# complexity and examples (e.g. longest sentences)
#description_WS[780:850,]
#using mci_sentences02_id as a distinctive id and The NDAR Global Unique Identifier
data_raw_AS_WS <- data_AS_WS %>%
dplyr::select(c("mci_sentences02_id","subjectkey","interview_age",
"collection_id", "dataset_id", "interview_date",
"src_subject_id", "sex", 21:779)) %>% # starting from 785 is complexity. we kept vocabs, word endings, word forms.
dplyr::select(-(689:697))
#dplyr::select(-(684:692)) # before adding interview_age -> sex...
colnames(data_raw_AS_WS) <- as.character(unlist(data_raw_AS_WS[1,])) #unlist the row
data_raw_AS_WS = data_raw_AS_WS[-1, ]
data_raw_AS_WS <- data_raw_AS_WS %>%
rename(id = "mci_sentences02_id",
GUID = "The NDAR Global Unique Identifier (GUID) for research subject",
age = "Age in months at the time of the interview/test/sampling/imaging.",
test_date = "Date on which the interview/genetic test/sampling/imaging/biospecimen was completed. MM/DD/YYYY",
src_subject_id = "Subject ID how it's defined in lab/project",
sex = "Sex of the subject") %>%
mutate(age = as.numeric(as.character(age)))
Loaded data from 5101 subjects. Which of these have mcs_vc_total==“999”? (A code often used in SPSS for missing data.) We show this below by collection_id:
nnn <- subset(data_AS_WS, mcs_vc_total=="999")
nnn %>% group_by(collection_id) %>%
summarise(n = n()) %>%
kable()
collection_id | n |
---|---|
2024 | 425 |
2368 | 16 |
2666 | 8 |
6 | 71 |
8 | 92 |
But many of these actually have at least some item-level data: 66 subjects with non-empty responses (0, 1, or 2) for ‘grr’. 428 subjects with non-empty responses (0, 1, or 2) for ‘grr’. 66 subjects with non-empty responses (0, 1, or 2) for ‘grr’. 66 subjects with non-empty responses (0, 1, or 2) for ‘grr’.
At least 62 subjects in collection 2024 and 4 subjects in collection 2368 have item-level data for many of the items. Collection 2024’s PI is Athena Vouloumanos, and actual enrollment is listed as 209. Indeed, there do seem to be redundant rows in our data. Out of the 612 entries with mcs_vc_total==“999”, there are only 200 unique subjectkey values, and 198 unique src_subject_id entries. (239 distinct subjectkey x interview_age, so mostly not longitudinal data…)
Collection 2368’s PI is Susan Swedo, and reported submitting 125 CDI:WS subjects (and 114 CDI:WG).
data_AS_WS$empty_voc_resp = rowSums(data_AS_WS[,21:700]=="")==680
novoc <- data_AS_WS %>% filter(empty_voc_resp)
1486 subjects are entirely blank in their vocabulary columns. Below we show the number of subjects with blank vocabulary along with the total N per collection.
blank_n <- novoc %>%
group_by(collection_id) %>%
summarise(blank_voc_n = n())
ws_tab <- data_raw_AS_WS %>%
distinct(collection_id, id, GUID, age) %>%
group_by(collection_id) %>%
summarise(total_n=n()) %>%
left_join(blank_n) %>%
arrange(total_n)
## Joining, by = "collection_id"
#sum(subset(ws_tab, is.na(blank_voc_n))$total_n)
ws_tab %>%
kable()
collection_id | total_n | blank_voc_n |
---|---|---|
6 | 71 | 71 |
2664 | 72 | NA |
1856 | 90 | NA |
8 | 92 | 92 |
2026 | 115 | NA |
2355 | 125 | 125 |
2368 | 272 | 1 |
2666 | 396 | 396 |
9 | 657 | 657 |
2024 | 3210 | 144 |
Only four collections don’t have any subjects with fully blank item-level vocab responses: collection IDs 2664, 1856, 2368, and 2026, with a total of 830 subjects.
There are 3210 subjects in collection 2024 (Vouloumanos)…is that possible? 5 experiments are listed, Shared Data tab doesn’t even mention the CDI, and Data Expected tab only lists 100 Targeted Enrollment for the CDI (shows 0 Subjects Shared). There are 17 listed publications, but their abstracts don’t have Ns.
Also, there are 23164 “2” responses in vocabulary: are some of these WS administrations actually WG administrations? (Where presumably 1=“understands” and 2=“understands and says”?)
Now let’s look at the WG data.
data_AS_WG <- read.delim("data/ASD CDI data/mci_words_gestures01.txt",
header = TRUE, sep = "\t", dec = ".")
# extracting first row as a descriptive dataframe
description_WG <- data_AS_WG[1:1,]
description_WG <- as.data.frame(t(description_WG))
names(description_WG) <- "description"
# we only kept vocab
eliminated_WG <- data_AS_WG %>%
dplyr::select(c(23:58,454:520))
#using mci_words_gestures01_id as a distinctive id and The NDAR Global Unique Identifier
data_raw_AS_WG <- data_AS_WG %>%
dplyr::select(c("collection_id","dataset_id","sex","mci_words_gestures01_id","subjectkey","interview_age", 59:453, 517))
colnames(data_raw_AS_WG) <- as.character(unlist(data_raw_AS_WG[1,])) #unlist the row
data_raw_AS_WG = data_raw_AS_WG[-1, ]
# what are the duplicated
AS_WG_duplicated <- data_raw_AS_WG[duplicated(colnames(data_raw_AS_WG))] # can call colnames
# making column names unique
names(data_raw_AS_WG) <- make.unique(names(data_raw_AS_WG), sep="_")
data_raw_AS_WG <- data_raw_AS_WG %>%
rename(id = "mci_words_gestures01_id",
GUID = "The NDAR Global Unique Identifier (GUID) for research subject",
age = "Age in months at the time of the interview/test/sampling/imaging.",
house = "MacArthur Words and Gestures: Vocabulary Checklist: House",
sex = "Sex of the subject") %>%
mutate(age = as.numeric(as.character(age)))
nnn_wg <- subset(data_AS_WG, mcg_vc_totcom=="999")
Loaded data from 8489 subjects. There aren’t any “999”s in mcg_vc_totcom, at least. But there are a lot entries with no item-level data…
# sum(duplicated(data_AS_WG[,59:517])) # just the vocab columns
# 7089 -- but are all of these just empty?
data_AS_WG$empty_voc_resp = rowSums(data_AS_WG[,59:517]=="")==459
novoc <- data_AS_WG %>% filter(empty_voc_resp)
4572 subjects are entirely blank in their vocabulary columns. Below we show the number of subjects with blank vocabulary along with the total N per collection.
blank_n <- novoc %>%
group_by(collection_id) %>%
summarise(blank_voc_n = n())
wg_tab <- data_raw_AS_WG %>%
distinct(collection_id, id, GUID, age) %>%
group_by(collection_id) %>%
summarise(total_n=n()) %>%
left_join(blank_n) %>%
arrange(total_n)
## Joining, by = "collection_id"
#sum(subset(wg_tab, is.na(blank_voc_n))$total_n)
wg_tab %>%
kable()
collection_id | total_n | blank_voc_n |
---|---|---|
2169 | 3 | 3 |
2503 | 18 | 18 |
2510 | 18 | NA |
1952 | 20 | 20 |
2192 | 30 | NA |
2878 | 50 | NA |
2027 | 61 | NA |
2666 | 112 | NA |
6 | 138 | 138 |
16 | 176 | 176 |
2355 | 213 | 203 |
8 | 269 | 269 |
2368 | 290 | NA |
1885 | 305 | 305 |
2026 | 781 | 781 |
1600 | 840 | NA |
9 | 876 | 876 |
2089 | 918 | 918 |
19 | 1413 | NA |
2024 | 1957 | 865 |
Eight collections don’t have any missing item-level vocabulary data, and have a total of 2814 subjects.