R Data Pipeline 101

Purpose

After listening to Abby Hare-Harris at IMFAR 2016, with her presentation of a practical use of an “inefficiency” index to uncover developmental deviance and link it to autism, a few of us thought this might be a good, bite-size project to tackle here at CAR – why not do it collaboratively in a learning environment, and learn about markdown, R, etc., while we’re at it?

A couple of quotes from her abstract, to get us started:

Developmental deviance (DDEV) refers to the non-sequential attainment of milestones within a developmental domain. This observation is in contrast to developmental delay (DD), where milestones are reached in the typical sequence, but the timeline of attainments is delayed.

…at the itemized level, individuals with DDEV exhibit a more scattered pattern of incorrect answers. Differentiating between DDEV and DD may inform prognosis and predict long-term outcomes.

Hypothesis

We hypothesize that we will find that developmental deviance in executive function development as measured by the BRIEF will correlate with ASD diagnosis.

To learn more about the research that inspires this investigation please check out https://imfar.confex.com/imfar/2016/webprogram/Paper21281.html, the abstract for Developmental Deviance of Item-Level Responses on Standardized Language Measures Correlates with Autism Spectrum Disorder Diagnosis by A. E. Hare-Harris et al.

Miscellaneous Assistance

To get R cheat sheets, try https://www.rstudio.com/resources/cheatsheets.

To get CAR data assistance, try http://data.centerforautismresearch.org

Getting Data

To do any data work, first we have to get data!

Static Data

If you’re not connecting to our Data Warehouse, but instead have a file from Joy titled “brief.csv”, skip ahead to “Static Data Part II”! The next few sections are important, because they cover lots of important data reshaping tools, but if you don’t have Data Warehouse access, you’ll just have to learn those skills separately and move ahead to the analysis bit.

Connecting to the Data Warehouse

After making sure we have a few libraries loaded, we establish a SQL connection. This allows us to get the freshest data instead of relying on a possibly out-of-date .csv. This code is not echoed to screen, since it contains our password! Please note here that with great power comes great responsibility!

# Get some libraries.  Error saying you don't have?  Try
# install.packages('RMySQL'), etc.
library(RMySQL)
## Loading required package: DBI
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.3.2
library(car)
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode

# This just shows how to do it... important info is redacted!
warehouse_connection <- dbConnect(MySQL(), user = "******", host = "******.chop.edu", 
    password = "*****", db = "car_tabular_data_warehouse")

Get Some BRIEF Data!

There are a couple of approaches to getting data out of our data warehouse:

learn a lot of SQL and do much of your reshaping in SQL, and pull it into R fully-formed
learn a bit of SQL, do the stuff that’s easier in SQL there, then pull bits into R for further reshaping
just learn the mere basics, like what the table name is you want, and do it all in R

We’ll try the last method today, and use dplyr to do cleanup of our data a bit at a time. In dplyr, select operates on columns/variables, while filter operates on rows/observations.

Please note that we will end up with unbalanced samples, because we’re not matching on anything. We leave this as an exercise for the reader.

# We could do a lot of fancy SQL here, but who has the time?  We'll
# clean it up in R.
brief <- dbGetQuery(warehouse_connection, "SELECT * from brief_item_level_cleaned")

# select just parent reports.  The %>% symbol and 'filter', are our
# first uses of the very handy dplyr package.
brief <- brief %>% filter(brief_form_type == "P")

# Do we have the same subject multiple times?  Just use one.
brief <- brief %>% filter(!duplicated(brief))

BRIEF domains

In R we can use the concatenate operator to make a list: c('eggs', 'milk', 'bread'). We’ll do this to make a list of items belonging to each domain of the BRIEF.

# Make lists for subdomain selections.
inhibit <- c("brief_r38", "brief_r41", "brief_r43", "brief_r44", "brief_r49", 
    "brief_r54", "brief_r55", "brief_r56", "brief_r59", "brief_r65")
shift <- c("brief_r5", "brief_r6", "brief_r8", "brief_r12", "brief_r13", 
    "brief_r23", "brief_r30", "brief_r39")
emot_control <- c("brief_r1", "brief_r7", "brief_r20", "brief_r25", "brief_r26", 
    "brief_r45", "brief_r50", "brief_r62", "brief_r64", "brief_r70")
initiate <- c("brief_r3", "brief_r10", "brief_r16", "brief_r47", "brief_r48", 
    "brief_r61", "brief_r66", "brief_r71")
working_mem <- c("brief_r2", "brief_r9", "brief_r17", "brief_r19", "brief_r24", 
    "brief_r27", "brief_r32", "brief_r33", "brief_r37", "brief_r57")
plan_org <- c("brief_r11", "brief_r15", "brief_r18", "brief_r22", "brief_r28", 
    "brief_r35", "brief_r36", "brief_r40", "brief_r46", "brief_r51", "brief_r53", 
    "brief_r58")
org_materials <- c("brief_r4", "brief_r29", "brief_r67", "brief_r68", "brief_r69", 
    "brief_r72")
monitor <- c("brief_r14", "brief_r21", "brief_r31", "brief_r34", "brief_r42", 
    "brief_r52", "brief_r60", "brief_r63")

Combining data from other sources

OK, so getting the BRIEF data was easy, but what if I want dx? How can I combine two “data frames” in R?

merge is a base R command that, unless you state otherwise, will do the “inner join” (the overlap part of the Venn) of two data frames. Want more info? Type ?merge() . This works for 99% of commands, btw… try it!

Once I merge the data from dx and brief into a single data frame, I want to get rid of any excess data I don’t need any more. In R, it is very easy to have a cluttered environment with so many objects you forget what you’re doing. I use rm() to avoid that.

# Get dx.  The SQL here is a bit more complex bc we choose the fields
# we want, instead of using an asterisk to choose all.  We also rename
# a field.

dx <- dbGetQuery(warehouse_connection, "SELECT study_id,
      caseconcep_dx2012_current as dx FROM dsm_iv_case_concept_cleaned")

# remove incomplete and duplicated dx (not really a problem w/dx, but a
# good idea always)

dx <- unique(dx[which(complete.cases(dx)), ])

# get dx and brief item level together...

brief <- merge(x = dx, y = brief, by.x = "study_id", by.y = "brief_ndar")

# drop unneeded data
rm(dx)

De-identifying Data

In our case, we have a bit of PHI in this table (NDARs might appear in the study_id, and the admin date of the BRIEF is PHI), so we’ll use dplyr select to give us all the columns but those two, first. This way, we can publish our final document without showing PHI.

With dplyr select you can select “not these” by using the minus.

brief <- brief %>% select(-c(study_id, brief_admin_date))

Static Data Part II

Note for anyone who isn’t using the SQL database, because they aren’t on the CHOP network: The following commands will be useful to you. First you want to know where R’s “working directory” is. Stash your .csv there. Then run the read.csv. I have removed PHI data. This will get you to the point after all the data combining your peers just did…

getwd()
brief <- read.csv("brief.csv", stringsAsFactors = FALSE)

Peeking at Data

At this point, it might be nice to take a look at what we have. The BRIEF data frame now has a ton of data in it – from the BRIEF, from the DSM-IV Case Conceptualization project, from SRS-Parent. head() and tail() will give you a quick look at the ends of your data.

head(brief)
##                dx brief_sex brief_age brief_form_type brief_inhibit_raw
## 1 Not met for ASD         F         7               P                18
## 2 Not met for ASD         M         7               P                30
## 3    AUT (Autism)         M        10               P                28
## 4 Not met for ASD         M        10               P                18
## 5    AUT (Autism)         M        12               P                20
## 6    AUT (Autism)         M         9               P                30
##   brief_shift_raw brief_ec_raw brief_initiate_raw brief_wm_raw
## 1              13           20                 14           15
## 2              24           25                 18           30
## 3              23           26                 16           23
## 4              16           22                 17           16
## 5              18           23                 15           28
## 6              23           28                 16           23
##   brief_po_raw brief_org_of_matl_raw brief_monitor_raw brief_bri_raw
## 1           15                    12                16            51
## 2           31                    16                22            79
## 3           27                    15                18            77
## 4           20                    15                18            56
## 5           26                    17                22            61
## 6           27                    15                20            81
##   brief_mi_raw brief_gec_raw brief_negativity_raw brief_inconsistency_raw
## 1           72           123                    1                       3
## 2          117           196                    6                       0
## 3           99           176                    6                       6
## 4           86           142                    3                       7
## 5          108           169                    6                       5
## 6          101           182                    6                       2
##   brief_r1 brief_r2 brief_r3 brief_r4 brief_r5 brief_r6 brief_r7 brief_r8
## 1        2        1        2        2        2        1        1        2
## 2        2        3        3        3        3        3        2        3
## 3        3        3        3        3        3        3        3        3
## 4        2        2        2        3        3        2        1        1
## 5        3        3        2        3        1        2        3        1
## 6        3        2        2        3        3        3        3        2
##   brief_r9 brief_r10 brief_r11 brief_r12 brief_r13 brief_r14 brief_r15
## 1        2         1         1         2         1         2         1
## 2        3         3         1         3         3         3         3
## 3        2         3         2         3         3         2         2
## 4        3         2         2         2         1         3         2
## 5        3         2         1         3         3         3         3
## 6        3         2         2         3         3         2         3
##   brief_r16 brief_r17 brief_r18 brief_r19 brief_r20 brief_r21 brief_r22
## 1         2         2         1         2         2         2         1
## 2         2         3         3         3         3         3         1
## 3         2         2         1         2         1         2         2
## 4         3         1         1         2         2         2         1
## 5         2         3         3         3         1         3         1
## 6         2         2         1         3         2         1         2
##   brief_r23 brief_r24 brief_r25 brief_r26 brief_r27 brief_r28 brief_r29
## 1         2         1         2         2         2         2         2
## 2         3         3         2         2         3         3         3
## 3         3         2         2         2         3         3         2
## 4         2         1         2         2         2         2         3
## 5         3         3         2         1         3         2         3
## 6         3         2         3         2         3         3         3
##   brief_r30 brief_r31 brief_r32 brief_r33 brief_r34 brief_r35 brief_r36
## 1         1         2         1         1         2         1         2
## 2         3         2         3         3         3         3         3
## 3         2         1         2         3         3         3         3
## 4         2         2         1         1         3         3         1
## 5         3         3         2         2         2         1         3
## 6         3         3         2         3         3         3         2
##   brief_r37 brief_r38 brief_r39 brief_r40 brief_r41 brief_r42 brief_r43
## 1         2         3         2         1         2         2         1
## 2         3         3         3         3         3         3         3
## 3         2         3         3         3         3         3         3
## 4         2         3         3         1         2         3         2
## 5         3         2         2         2         3         2         2
## 6         1         3         3         2         3         3         3
##   brief_r44 brief_r45 brief_r46 brief_r47 brief_r48 brief_r49 brief_r50
## 1         2         2         1         2         1         2         2
## 2         3         3         3         3         2         3         3
## 3         3         3         2         2         1         3         3
## 4         2         2         2         2         3         1         3
## 5         1         3         2         3         1         2         3
## 6         3         3         2         2         3         3         3
##   brief_r51 brief_r52 brief_r53 brief_r54 brief_r55 brief_r56 brief_r57
## 1         1         2         2         2         1         1         1
## 2         3         3         2         3         3         3         3
## 3         2         2         3         2         3         3         2
## 4         2         1         2         1         2         2         1
## 5         2         3         3         1         3         2         3
## 6         2         2         3         3         3         3         2
##   brief_r58 brief_r59 brief_r60 brief_r61 brief_r62 brief_r63 brief_r64
## 1         1         2         2         2         3         2         2
## 2         3         3         2         3         3         3         3
## 3         1         2         2         2         3         3         3
## 4         1         2         1         1         3         3         3
## 5         3         2         3         2         1         3         3
## 6         2         3         3         2         3         3         3
##   brief_r65 brief_r66 brief_r67 brief_r68 brief_r69 brief_r70 brief_r71
## 1         2         2         2         2         2         2         2
## 2         3         1         3         3         3         2         1
## 3         3         2         2         3         3         3         1
## 4         1         1         2         2         3         2         3
## 5         2         2         3         3         3         3         1
## 6         3         2         2         1         3         3         1
##   brief_r72 brief_r73 brief_r74 brief_r75 brief_r76 brief_r77 brief_r78
## 1         2         2         2         2         2         1         2
## 2         1         3         2         2         2         3         2
## 3         2         3         2         2         2         1         3
## 4         2         2         2         2         1         1         2
## 5         2         2         3         3         3         2         3
## 6         3         3         2         2         1         1         3
##   brief_r79 brief_r80 brief_r81 brief_r82 brief_r83 brief_r84 brief_r85
## 1         2         1         2         2         2         2         1
## 2         3         3         3         3         2         2         2
## 3         3         3         3         3         2         3         3
## 4         2         3         1         1         1         3         1
## 5         3         3         3         3         3         3         3
## 6         3         3         3         3         2         3         3
##   brief_r86 brief_inhibit_t_score brief_shift_t_score brief_ec_t_score
## 1         1                    58                  56               60
## 2         3                    80                  90               71
## 3         3                    78                  88               71
## 4         2                    55                  64               62
## 5         3                    65                  74               72
## 6         2                    82                  88               76
##   brief_initiate_t_score brief_wm_t_score brief_po_t_score
## 1                     59               49               46
## 2                     68               85               78
## 3                     59               65               65
## 4                     63               49               50
## 5                     56               78               61
## 6                     59               65               65
##   brief_org_of_matl_t_score brief_monitor_t_score brief_bri_t_score
## 1                        52                    62                60
## 2                        66                    76                84
## 3                        61                    59                82
## 4                        61                    59                62
## 5                        66                    75                74
## 6                        61                    66                86
##   brief_mi_t_score brief_gec_t_score brief_inhibit_pct brief_shift_pct
## 1               53                56                84              83
## 2               78                83                99              99
## 3               65                73                98              99
## 4               57                59                75              92
## 5               69                73                91              97
## 6               66                76                99              99
##   brief_ec_pct brief_initiate_pct brief_wm_pct brief_po_pct
## 1           87                 86           54           44
## 2           96                 94           99           99
## 3           98                 84           90           91
## 4           90                 88           61           59
## 5           97                 78           98           86
## 6           98                 84           90           91
##   brief_org_of_matl_pct brief_monitor_pct brief_bri_pct brief_mi_pct
## 1                    67                91            84           66
## 2                    92                99            99           98
## 3                    83                83            99           91
## 4                    83                83            87           73
## 5                    94                99            96           94
## 6                    83                96            99           94
##   brief_gec_pct brief_inhibit_mean brief_shift_mean brief_ec_mean
## 1            75                1.8             1.63           2.0
## 2            99                3.0             3.00           2.5
## 3            97                2.8             2.88           2.6
## 4            81                1.8             2.00           2.2
## 5            97                2.0             2.25           2.3
## 6            98                3.0             2.88           2.8
##   brief_initiate_mean brief_wm_mean brief_po_mean brief_org_of_matl_mean
## 1                1.75           1.5          1.25                   2.00
## 2                2.25           3.0          2.58                   2.67
## 3                2.00           2.3          2.25                   2.50
## 4                2.13           1.6          1.67                   2.50
## 5                1.88           2.8          2.17                   2.83
## 6                2.00           2.3          2.25                   2.50
##   brief_monitor_mean brief_bri_mean brief_mi_mean brief_gec_mean
## 1               2.00           1.82          1.64           1.71
## 2               2.75           2.82          2.66           2.72
## 3               2.25           2.75          2.25           2.44
## 4               2.25           2.00          1.95           1.97
## 5               2.75           2.18          2.45           2.35
## 6               2.50           2.89          2.30           2.53
##   primary_key
## 1           2
## 2           4
## 3           5
## 4           7
## 5           8
## 6          10

What if I wanted to know more about just the dx column? What diagnoses are represented? We can easily create a frequency table:

table(brief$dx)
## 
##         ASD-NOS    AUT (Autism) Not met for ASD             TDC 
##             104             429             108             213

By the way, these are identical… you will find in R that there are many ways to do the same thing!

table(brief$dx)
table(brief["dx"])
table(brief[1])

Subsetting Data for Replacement

In our case we need to combine two categories into one. It makes sense to blend the Autism and ASD-NOS subjects together into one category. We can do this by selecting both subsets and replacing the dx value with “ASD”.

Note that when you use square bracket notation, you always think [RC]: rows first, then columns. So we want to pick rows in which the dx is “ASD-NOS”, and in those rows, pick the dx column, and in that selection of data, make each observation equal to “ASD”. Ditto with “AUT (Autism)”. Instead of combining both into one command (which is possible), let’s promote code understanding and reuse by doing the two selections and replacements separately.

brief[which(brief$dx == "ASD-NOS"), "dx"] <- "ASD"
brief[which(brief$dx == "AUT (Autism)"), "dx"] <- "ASD"

Note I could have used a mixed notation, using $ to indicate the column and [] to choose rows:

brief$dx[which(brief$dx == "ASD-NOS")] <- "ASD"
brief$dx[which(brief$dx == "AUT (Autism)")] <- "ASD"

Inefficiency

We’ll first do some difficulty assessment. The abstract we are using as inspiration states:

“Inefficiency was defined for each subtest as the product of the total number of subtest items and the sum of the weights (percentage of unaffected family members who correctly answered the item) of the items missed. …”

If you imagine a five-item domain with weights 0.90, 0.30, 0.10, 0.50, 0.20, you can visualize a subject’s results in one of two ways. First, a “typical” developmental trajectory like this:

Response	Weight	Cumulative Sum
R	0.90	0
R	0.30	0
W	0.10	0.10
R	0.50	0.10
W	0.20	0.30

We multiply 0.30 by 5 (the domain size) to get an inefficiency score of 1.5. Note that this person missed the two “hardest” items, the ones with the lowest weight (percent correct in TD population).

A different person with developmental deviance may also get two wrong but have a higher inefficiency score:

Response	Weight	Cumulative Sum
W	0.90	0.90
R	0.30	0.90
R	0.10	0.90
R	0.50	0.90
W	0.20	1.10

For a total of 5.5 inefficiency score. This subject missed some “easy” items but not some “difficult” ones (atypical development).

Applying Inefficiency to a Domain

Let’s choose the Initiate domain of the BRIEF and come up with the inefficiency score for each participant.

Aside: this is where Joy cheated and cherry picked a bit from among the subscales, so you may enjoy picking a different subscale to study. The steps are essentially identical but you might have to do a bit of detective work to track down a couple of column names.

We’ll first pull out just the initiate columns, and just the rows that correspond to TDCs.

initiate_norms <- brief %>% filter(dx == "TDC") %>% select(one_of(initiate))

We know the BRIEF is scored on a 3 point Likert. I choose in this example to use a 2 or a 3 as an endorsement (in the parlance of the author above, a “missed” item, an item that demonstrates a lack). A 1 (or a 0, if there’s a missing item) indicates a non-endorsement, aka “correctly answered”.

Aside: I cherry-picked again and used the results to drive my endorsement cutoff. Cutting off at 3 made the weights / difficulty measures cluster together too tightly for my taste. Don’t emulate my “change the methods to suit the outcome” method. Really.

Do I care about selecting on per row or per column? Nope, my entire data frame should be turned into 1s and 0s. This allows me a handy shortcut, where I can select data observations based not on row or column but using their value alone:

initiate_norms[initiate_norms <= 1] <- 1  # 'non-endorsed'
initiate_norms[initiate_norms > 1] <- 0  # 'endorsed' 
head(initiate_norms)
##   brief_r3 brief_r10 brief_r16 brief_r47 brief_r48 brief_r61 brief_r66
## 1        1         1         1         1         1         1         1
## 2        0         0         0         0         1         0         0
## 3        1         1         1         1         1         1         0
## 4        0         0         1         0         1         0         0
## 5        0         0         1         1         1         0         0
## 6        1         1         1         1         1         1         0
##   brief_r71
## 1         1
## 2         1
## 3         1
## 4         0
## 5         0
## 6         0

Great, now we have what looks like a matrix of ones and zeros.

The weight is the percentage of TDs who did not endorse a given item. So we’ll take number non-endorsed / number of TDs. A higher weight = “easier” / less-endorsed, and a lower weight = “harder”, more endorsed.

initiate_difficulty <- colSums(initiate_norms)/nrow(initiate_norms)
initiate_difficulty
##  brief_r3 brief_r10 brief_r16 brief_r47 brief_r48 brief_r61 brief_r66 
## 0.6150235 0.6384977 0.7981221 0.5539906 0.8591549 0.6666667 0.4741784 
## brief_r71 
## 0.7464789
names(initiate_difficulty) = c("weight")
# drop unneeded data
rm(initiate_norms)

To gauge understanding… here are two items from the BRIEF:

Has trouble getting started on homework or chores
Has trouble organizing activites with friends

Which is “easier” or more frequently carried out by typical controls? Starting on homework/chores, or organizing activities with friends?

Measuring inefficiency

OK, now get the initiate data from all of our BRIEF takers. We’ll take their performance and apply item-level weights to it to see how “inefficient” (out-of-order / scattered) they are.

initiate_data <- brief %>% select(one_of(initiate))

This time we flip the script and mark the endorsed ones. Why? Because if it’s endorsed, we’re going to add the weight that corresponds to that “incorrect” question to the sum of weights for this subject. In other words, its multiplicative strength is 1 – I’ll add 1 of the corresponding weight to the weight total. If the item is not endorsed, it has a 0 multiplicative strength – I’ll won’t add any corresponding weight to the weight total. Make sense?

initiate_data[initiate_data <= 1] <- 0  # 'non-endorsed'
initiate_data[initiate_data == 2] <- 1  # 'endorsed' 
initiate_data[initiate_data == 3] <- 1  # 'endorsed' 
# Count up the number endorsed for each subject (row)
initiate_endorsed <- rowSums(initiate_data)

There are many ways to handle the “if it’s 1 (endorsed), add the corresponding weight”, but the easiest may be the use of a matrix dot product. We’re taking element-wise products (1*weight or 0*weight) and summing them. This is the interior or dot product.

# Calculate the dot product of each row with the difficulty matrix.
# Then multiply by the number of endorsements to get inefficiency
# score.
initiate_inefficiency <- as.matrix(initiate_data) %*% as.matrix(initiate_difficulty) * 
    ncol(initiate_data)
# Drop unneeded data
rm(initiate_data)

Now, it makes sense to combine all the info we have into one data frame for analysis. You’ll notice that we pull in some columns directly from BRIEF. We haven’t removed or changed the order of any rows, so that’s ok – we’re lining up the same subject across all these columns. The nice thing is that we have no need to get more data at this point, so there’s not any need to include the study id. This makes it much easier for us to publish this document!

We’re going to use column bind, cbind, to combine diagnosis, age, sex, the number of endorsements, the BRIEF calculated scores from our domain, our inefficiency score, and a new thing we might be interested in, mean inefficiency per endorsement. I just tossed that one in because it’s obvious that we’ll have lots of TD’s with few endorsements and low inefficiency and ASD’s with many endorsements and high inefficiency. Perhaps finding the per-item mean might help us.

initiate_summary <- cbind.data.frame(brief$dx, brief$brief_age, brief$brief_sex, 
    initiate_endorsed, brief$brief_initiate_raw, brief$brief_initiate_t_score, 
    initiate_inefficiency, initiate_inefficiency/initiate_endorsed)
names(initiate_summary) <- c("dx", "age", "sex", "endorsed", "raw", "t", 
    "inefficiency", "ineff_div_endorsed")
# Drop unneeded data
rm(initiate_endorsed, initiate_inefficiency)

If we take a look at initiate_summary, we’ll notice that some of our ineff_dif_endorsed values are not defined, because we divided by 0:

head(initiate_summary$ineff_div_endorsed, 50)
##  [1] 5.139280 5.508607 4.995305 5.615023 4.995305 5.263581 4.926448
##  [8] 5.352113 5.204561 4.995305 5.765258 5.352113 5.211268 5.352113
## [15] 5.333333 5.333333 5.439750 5.508607 5.352113 5.546166 5.641315
## [22] 5.574782 5.708920 4.717371 5.574782 5.352113 5.204561      NaN
## [29] 5.204561 5.352113 5.352113 5.352113 5.204561 5.352113 4.995305
## [36] 3.793427 5.204561 4.926448 5.263581 5.439750 5.402191 5.025352
## [43] 5.263581 5.333333 4.882629 5.543662 5.596244 5.204561 5.352113
## [50] 5.574782

We may consider using a different measure either for inefficiency generally, or as a mean inefficiency measure that controls for number of endorsements. Toss out any suggestions, or we can just keep going…

OK, let’s pause.

I’ve been over a lot of setup, obtaining data, combining data, and reshaping it / manipulating it prior to analysis. Any questions?

Just a shoutout to the data blog, data.centerforautismresearch.org, which has a lot of resources (more coming!) and is searchable…

One last look at our data – do we have anywhere near enough power? Let’s take a look at dx vs number of endorsements:

table(initiate_summary$dx, initiate_summary$endorsed)
##                  
##                     0   1   2   3   4   5   6   7   8  10
##   ASD               1   2   5  18  47  73 113 153 119   2
##   Not met for ASD   0   0   3   4   9  15  18  28  31   0
##   TDC              47  35  33  26  17  25  20   8   2   0

Just eyeballing this, without applying statistical rigor, it’s clear that we don’t have enough subjects at the ends (endorsed = 0, 1, 2, 10), especially where TDC’s are concerned. Long-term, we would need more data. Short-term, let’s see if we can narrow our scope and say that we hypothesize a difference in inefficiency between subjects with ASD and our mixed clinical group where the number of endorsements is between 4 and 8, inclusive. After all, it’s not that tricky to discern ASD from TDC – what we really want to do is separate ASD from our mixed clinicals. Even with pared down data, our samples are a bit impoverished for non-ASD subjects.

Let’s pare our data down.

initiate_summary <- initiate_summary %>% filter(endorsed >= 4, endorsed <= 
    8)

Data Visualization

Base Plotting for quick analysis

First off, does our inefficiency even look normal enough for analysis using parametric tools like anova / t-test?

We can do a quick-and-dirty histogram to see.

hist(initiate_summary$inefficiency)

Not very normal…

You can also execute the following, if you’re interested:

library(moments)
skewness(initiate_summary$inefficiency)
kurtosis(initiate_summary$inefficiency)

What about that thing I made up, the ineff_div_endorsed, the intra-subject mean of missed question?

hist(initiate_summary$ineff_div_endorsed)

More symmetrical, but possibly too pointy to call normal.

For now, in the interest of moving forward, let’s assume that the Very Nice Sample Sizes we (don’t) have will account for the distribution problems for parametric measures. We’ll regard p-values and R^2 values with a grain of salt but use them as guides.

ggplot for better graphics

Let’s plot number endorsed on x and inefficicency on y. Color will be determined by diagnostic category. Often, we won’t use ggplot right away, but use R’s base plotting system for exploratory data viz. But why wait?

ggplot requires you to tell it what data you’re drawing from, the aesthetic input column names (we’re going to visually represent endorsements and inefficiency), and the geometric layer (points, bars, etc.) that you want, at the very least. ggplot is very complex, but makes for great visualizations! With ggplot, you can make publication-quality graphics with relative ease.

Pro tip: What’s the “gg” for? The “grammar of graphics”.

p <- ggplot(initiate_summary, aes(endorsed, inefficiency))
p + geom_point(aes(colour = dx))

Looks like some of our points might be overlaid, meaning we don’t see everything. One answer is to add a bit of noise, or jitter, to the plot. I’ll also add a title and make this a bit nicer to look at:

p <- ggplot(initiate_summary, aes(endorsed, inefficiency))
p + geom_point(position = "jitter", aes(colour = dx)) + labs(title = "Developmental Deviance in the\nInitiate Domain of the BRIEF\n(Noise Added)", 
    x = "Number of Endorsed (value = 2 or 3) Symptoms", y = "Inefficiency Score")

What if we looked at the inefficiency mean instead? This time I’ll add a bit of transparency, using alpha, as well as up the size of the points. Note that we’ll be missing some data here. There will be points left off because the mean is undefined and subjects with no endorsements won’t be included.

p <- ggplot(initiate_summary, aes(endorsed, ineff_div_endorsed))
p + geom_point(size = 3.5, alpha = 0.5, aes(colour = dx))

Sometimes a scatterplot is hard to interpret… what if we did a boxplot? I’ll add labels as well as change the theme to a custom theme that looks like your boxplot is from The Economist:

library(ggthemes)

p <- ggplot(initiate_summary, aes(factor(dx), ineff_div_endorsed))
p + geom_boxplot() + labs(title = "Per-subject Mean Inefficiency\n(Total Inefficiency / Number of Endorsements)", 
    x = "DSM-IV Classification", y = "Mean Inefficiency") + theme_economist_white()
## Warning: `panel.margin` is deprecated. Please use `panel.spacing` property
## instead
## Warning: `legend.margin` must be specified using `margin()`. For the old
## behavior use legend.spacing

Stats

OK, so this is where you’re the expert and I’m the novice. But let’s forge ahead bravely anyhow!

Linear model

This is tricky because we have such strong interaction between inefficiency and the number of endorsed items, as well as between dx and endorsed (i.e. TD’s endorse fewer items than ASD’s).

Want to see the interaction?

interaction.plot(initiate_summary$endorsed, initiate_summary$dx, initiate_summary$inefficiency)

So how will we get to the controlled interaction of inefficiency and dx? This is where you all are stronger than I am in selecting statistical tools, so this could be where you tell me what to do!

Let’s create a model that controls for endorsements by using the ineff_div_endorsements variable. We can also gauge significance and check the coefficient of determination as well, by using summary:

initiate_model <- lm(ineff_div_endorsed ~ dx, data = initiate_summary)
summary(initiate_model)
## 
## Call:
## lm(formula = ineff_div_endorsed ~ dx, data = initiate_summary)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.32761 -0.08281  0.00572  0.10231  0.79445 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        5.346392   0.009549 559.902   <2e-16 ***
## dxNot met for ASD -0.059701   0.023390  -2.552   0.0109 *  
## dxTDC             -0.277870   0.027032 -10.279   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2146 on 675 degrees of freedom
## Multiple R-squared:  0.1366, Adjusted R-squared:  0.1341 
## F-statistic: 53.42 on 2 and 675 DF,  p-value: < 2.2e-16

You can also plot residuals, q-q, and leverage for the model:

plot(initiate_model)

Or maybe you’d rather do something like the following, to track factor interactions:

initiate_model <- lm(inefficiency ~ dx + sex + dx * sex + dx * endorsed, 
    data = initiate_summary)
Anova(initiate_model, type = "II")

Or just prepare for a simple, one-way ANOVA:

initiate_model <- lm(inefficiency ~ dx, data = initiate_summary)

The construction of your linear model really depends on what factors you consider and whether you want to account for interactions / covariates.

Analysis of Variance

We can now use whatever model you chose (I’m working with the first option) to calculate an anova. But this first anova could be wrong. Why?

anova(initiate_model)
## Analysis of Variance Table
## 
## Response: ineff_div_endorsed
##            Df  Sum Sq Mean Sq F value    Pr(>F)    
## dx          2  4.9194 2.45969  53.419 < 2.2e-16 ***
## Residuals 675 31.0809 0.04605                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Luckily, the car package has its own Anova (notice the capital A in this one) that allows you to account for unbalanced samples:

Anova(initiate_model, type = "II")
## Anova Table (Type II tests)
## 
## Response: ineff_div_endorsed
##            Sum Sq  Df F value    Pr(>F)    
## dx         4.9194   2  53.419 < 2.2e-16 ***
## Residuals 31.0809 675                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Tukey ‘Honest Significant Difference’

You can also do a Tukey post-hoc significance test on your model, but this is not very reliable for a design as unbalanced as ours. It’s a decent rule of thumb, though, to see what the pairwise significances are.

We’ll do a Tukey on inefficiency divided by endorsed, just to switch things up a bit.

TukeyHSD(aov(ineff_div_endorsed ~ dx, data = initiate_summary))
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = ineff_div_endorsed ~ dx, data = initiate_summary)
## 
## $dx
##                            diff        lwr          upr     p adj
## Not met for ASD-ASD -0.05970096 -0.1146408 -0.004761117 0.0293519
## TDC-ASD             -0.27787004 -0.3413641 -0.214375930 0.0000000
## TDC-Not met for ASD -0.21816908 -0.2959107 -0.140427465 0.0000000

T-Test

So, to sum up, there appears to be some large statistical significance to inefficiency between ASD and TDC and between Not met for ASD (some sort of mixed clinical) and TDC. There is some difference, likely not statistically significant, between ASD and Not met for ASD, but we should do a T test anyhow. Let’s do a very simple T-test that uses the ineff_div_endorsed variable to control for the various interactions. First, ASD vs TDC – should be pretty cut and dried.

t.test(initiate_summary$ineff_div_endorsed[which(initiate_summary$dx == 
    "ASD")], initiate_summary$ineff_div_endorsed[which(initiate_summary$dx == 
    "TDC")])
## 
##  Welch Two Sample t-test
## 
## data:  initiate_summary$ineff_div_endorsed[which(initiate_summary$dx ==  and initiate_summary$ineff_div_endorsed[which(initiate_summary$dx ==     "ASD")] and     "TDC")]
## t = 9.017, df = 87.06, p-value = 4.123e-14
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.2166198 0.3391203
## sample estimates:
## mean of x mean of y 
##  5.346392  5.068522

Next, let’s take a look at TDC vs. mixed clinical:

t.test(initiate_summary$ineff_div_endorsed[which(initiate_summary$dx == 
    "TDC")], initiate_summary$ineff_div_endorsed[which(initiate_summary$dx == 
    "Not met for ASD")])
## 
##  Welch Two Sample t-test
## 
## data:  initiate_summary$ineff_div_endorsed[which(initiate_summary$dx ==  and initiate_summary$ineff_div_endorsed[which(initiate_summary$dx ==     "TDC")] and     "Not met for ASD")]
## t = -6.3976, df = 119.91, p-value = 3.18e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.2856884 -0.1506498
## sample estimates:
## mean of x mean of y 
##  5.068522  5.286691

Finally, ASD vs mixed clinical (the really interesting bit!)

t.test(initiate_summary$ineff_div_endorsed[which(initiate_summary$dx == 
    "ASD")], initiate_summary$ineff_div_endorsed[which(initiate_summary$dx == 
    "Not met for ASD")])
## 
##  Welch Two Sample t-test
## 
## data:  initiate_summary$ineff_div_endorsed[which(initiate_summary$dx ==  and initiate_summary$ineff_div_endorsed[which(initiate_summary$dx ==     "ASD")] and     "Not met for ASD")]
## t = 2.9896, df = 166.77, p-value = 0.003217
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.02027470 0.09912722
## sample estimates:
## mean of x mean of y 
##  5.346392  5.286691

Do we have a finding? How else should we work on this? Work with non-parametric stats tools? Move into effect size? Find Cohen’s D?

wilcox.test(initiate_summary$ineff_div_endorsed[which(initiate_summary$dx == 
    "ASD")], initiate_summary$ineff_div_endorsed[which(initiate_summary$dx == 
    "TDC")])
wilcox.test(initiate_summary$ineff_div_endorsed[which(initiate_summary$dx == 
    "TDC")], initiate_summary$ineff_div_endorsed[which(initiate_summary$dx == 
    "Not met for ASD")])
wilcox.test(initiate_summary$ineff_div_endorsed[which(initiate_summary$dx == 
    "ASD")], initiate_summary$ineff_div_endorsed[which(initiate_summary$dx == 
    "Not met for ASD")])

library(lsr)
effect_size_DF <- initiate_summary[which(initiate_summary$dx != "Not met for ASD"), 
    c("ineff_div_endorsed", "dx")]
effect_size_DF$dx <- droplevels(effect_size_DF$dx)
cohensD(ineff_div_endorsed ~ dx, data = effect_size_DF)

More Data Viz!

Well, this Initiate domain certainly looks interesting, even if we don’t have a large sample size. Let’s take a middling sample, say, the subjects with six endorsements, and see how our diagnostic groups differ on their inefficiency score. We’re controlling for the number of endorsements, so the shape of a histogram or density graph should be unaffected by that particular variable. We’ll again use ggplot, first to plot the overall distribution, then the by-dx-distribution:

Density Plots

ggplot(initiate_summary[which(initiate_summary$endorsed == 6), ], aes(inefficiency)) + 
    geom_density() + labs(title = "Distribution of Inefficiency Scores\nFor Subjects With 6 Endorsements", 
    x = "Inefficiency Score", y = "Density")


ggplot(initiate_summary[which(initiate_summary$endorsed == 6), ], aes(inefficiency, 
    color = dx)) + geom_density() + labs(title = "Distribution of Inefficiency Scores\nFor Subjects With 6 Endorsements", 
    x = "Inefficiency Score", y = "Density")

Hmmm… it really does seem that even if they have the same number of endorsements, the samples do differ in inefficiency / developmental trajectory. Could this be misleading? Is this true for every endorsement number, or just for six endorsements?

Facets

ggplot(initiate_summary, aes(inefficiency, color = dx)) + geom_density() + 
    facet_wrap(~endorsed)

That spike in 8 is annoying. We’ll correct it by allowing for free reassignment of scale, instead of preserving the same scale for each graph. Let’s also change some visual elements of the graph.

ggplot(initiate_summary, aes(inefficiency, color = dx)) + geom_density() + 
    facet_wrap(~endorsed, scales = "free") + labs(title = "Inefficiency Score Distribution\nby Number of Endorsements", 
    x = "Inefficiency Score", y = "Density") + theme(text = element_text(colour = "navy"), 
    panel.background = element_rect(fill = "white", colour = "grey"), panel.grid.major = element_line(colour = "lightgrey", 
        size = 0.25))

Nope, it definitely looks like something’s up. But we haven’t separated males from females, or kiddos by age. Let’s do a grid of facets where we look at the inefficiency distribution for everyone with 6 endorsements, but broken up by sex as columns and age as rows:

When Facets Go Bad…

ggplot(initiate_summary[which(initiate_summary$endorsed == 6), ], aes(inefficiency, 
    color = dx)) + geom_density() + facet_grid(age ~ sex)

Wow, that didn’t really work. Too many ages, and too many categories where there isn’t anyone! We’re running up against the limits of our sample.

By the way, there are several ways to find out whether you have a wonky sample (like a subset ends up really small), before getting to t-tests and the like.

You can get the structure of your data using str, or you can use the environment pane in your RStudio IDE, or get the dimensions of your data with dim, or just ask for column and row numbers. You can also do n-dimensional frequency tables or get statistical thumbnails:

str(initiate_summary)
## 'data.frame':    678 obs. of  8 variables:
##  $ dx                : Factor w/ 3 levels "ASD","Not met for ASD",..: 2 2 1 2 1 1 1 1 1 1 ...
##  $ age               : int  7 7 10 10 12 9 9 6 8 12 ...
##  $ sex               : Factor w/ 2 levels "F","M": 1 2 2 2 2 2 2 2 2 1 ...
##  $ endorsed          : num  6 6 6 6 6 7 6 8 7 6 ...
##  $ raw               : int  14 18 16 17 15 16 15 23 19 18 ...
##  $ t                 : int  59 68 59 63 56 59 56 84 69 69 ...
##  $ inefficiency      : num  30.8 33.1 30 33.7 30 ...
##  $ ineff_div_endorsed: num  5.14 5.51 5 5.62 5 ...
dim(initiate_summary)
## [1] 678   8
ncol(initiate_summary)
## [1] 8
nrow(initiate_summary)
## [1] 678
table(initiate_summary$age, initiate_summary$sex)
##     
##       F  M
##   6   4 67
##   7   6 72
##   8  15 76
##   9   9 77
##   10 17 59
##   11  8 49
##   12 12 51
##   13  2 49
##   14  3 38
##   15  0 18
##   16  3 22
##   17  2 17
##   18  0  2
summary(initiate_summary)
##                dx           age        sex        endorsed    
##  ASD            :505   Min.   : 6.00   F: 81   Min.   :4.000  
##  Not met for ASD:101   1st Qu.: 8.00   M:597   1st Qu.:5.000  
##  TDC            : 72   Median :10.00           Median :7.000  
##                        Mean   :10.15           Mean   :6.345  
##                        3rd Qu.:12.00           3rd Qu.:7.000  
##                        Max.   :18.00           Max.   :8.000  
##       raw              t          inefficiency   ineff_div_endorsed
##  Min.   : 9.00   Min.   :40.00   Min.   :18.25   Min.   :4.019     
##  1st Qu.:15.00   1st Qu.:56.00   1st Qu.:28.15   1st Qu.:5.205     
##  Median :17.00   Median :63.00   Median :35.94   Median :5.352     
##  Mean   :17.13   Mean   :64.08   Mean   :33.74   Mean   :5.308     
##  3rd Qu.:19.00   3rd Qu.:71.00   3rd Qu.:39.02   3rd Qu.:5.440     
##  Max.   :24.00   Max.   :91.00   Max.   :45.97   Max.   :6.141