Purpose of this Markdown Document

This is a markdown document for importing, tidying, and recoding the Pilot Data from the MMIEL Summer School.

Insofar as it is possible, the data tidying RMD that we produce for the summer school on the full data should follow this format - for that, we may want to mark some of the R blocks as include = FALSE, because there is some behind-the-scences recoding of columns etc. that won’t be super relevant to the participants - or at least to those who aren’t reasonably advanced in using R to manipulate data.

We will start, as always, by importing the tidyverse, and then importing our data frame

library(tidyverse)
library(plyr)
pilotdata <- read.csv("F:/Google Drive/GitHub Repos/Crossmodality-Toolkit/data/pilotData.csv")

# First we need a chunk of code that redoes the condition column, because "Condition" current includes the participant number for each condition - SS participants will not need to see this, so in the final version use include = FALSE

pilotdata <- separate(data = pilotdata, col = condition,
                      into = c('Focal1', 'Focal2', "Focal3", "ParticipantNum"), 
                      sep = "-", remove = FALSE)

pilotdata$condition <- paste(pilotdata$Focal1, pilotdata$Focal2, pilotdata$Focal3, sep='-')

Sanity Checks

We can then do some pretty basic sanity checks on the data

length(unique(pilotdata$subject))  #1- This tells us how many experimental participants we have

nrow(pilotdata) / length(unique(pilotdata$subject))  #2- This tells us how many trials there are per                                                               participant

unique(pilotdata$condition)                         #3- Lists all of the "conditions"

length(unique(pilotdata$condition))                 #4- Tells us how many conditions there are

table(pilotdata$condition)/96                       #5- Tells us how many participants are in each                                                            condition

unique(pilotdata$choice)                            #6- This checks that we only have responses of 0 or 1

##ADD OTHER SANITY CHECKS AS NEEDED

So, those sanity checks tell us what we want to see- there are 61 participants, each of whom completed 96 trials in 6 Conditions.

The 6 conditions were Noise-Shape-Speed, Pitch-Size-Color, Brightness-Amp-Affect, Noise-Brightness-Color, Pitch-Shape-Affect, Amp-Size-Speed and there are between 10 and 11 participants per condition.

Finally, we have recorded only legal responses from our participants, both 1, which means that participants chose that the left inducer goes with the top concurrent (and thus the right inducer with the bottom concurrent) and 0, which means the opposite.

Manipulating the Dataframe

There are now some columns that we need to create in the dataframe that will make the comparisons we are interested in doing statistically possible

First, we need to take the InducerL, InducerR, ConcurrentL, and ConcurrentR columns and break them apart into their component bits- Their Domain (e.g. Pitch), their token set from that domain (e.g. Hum), and their specific token (e.g. “high”)

Second, we code a column that tells us which inducer token and concurrent token corresponds to a response (choice) of 0

#1- Separate columns into their components
pilotdata <- separate(data = pilotdata, col = InducerL, 
                      into = c('IndDomainL', 'IndSetL', "IndTokenL"), 
                      sep = "-", remove = FALSE)
pilotdata <- separate(data = pilotdata, col = InducerR,
                      into = c('IndDomainR', 'IndSetR', "IndTokenR"),
                      sep = "-", remove = FALSE)
pilotdata <- separate(data = pilotdata, col = ConcurrentL,
                      into = c('ConDomainL', 'ConSetL', "ConTokenL"),
                      sep = "-", remove = FALSE)
pilotdata <- separate(data = pilotdata, col = ConcurrentR,
                      into = c('ConDomainR', 'ConSetR', "ConTokenR"),
                      sep = "-", remove = FALSE)

#2- Codes a column that says which token (Low or High) the Left Inducer and Top Concurrent are, as a single value (e.g. "H H")
pilotdata$LeftPair <- paste(pilotdata$IndTokenL, pilotdata$ConTokenL)

#3- Codes a new response column where if the participant has made the choice '0' (Left Inducer Matches with Top Concurrent)
pilotdata$Resp <- ifelse(pilotdata$choice == 0, 
                         paste(pilotdata$IndTokenL, pilotdata$ConTokenL), 
                         paste(pilotdata$IndTokenL, pilotdata$ConTokenR))

So, that’s our data clean, to a first approximation- we’ve added some columns we need, and recoded some other columns so that they are more informative.

Now what we need to do is code “correctness” for each trial- something that we can sensibly put into a our linear mixed effects models and other statistical techniques.

In the course of our analysis, we are going to “correctness” in at least 3 ways

Magnitude Bias

The first way that we are going to code correctness is just via Magnitude Bias- this is the assumption that everything we have coded as “high” (i.e. Loud, Noisy, High Pitched, Jagged, Large, Bright, Fast, Happy, Stressed, Excited, and Pleased) should be associated.

In this case, if a participant responded that High goes with High and Low goes with Low they are “correct”

pilotdata$MagRespCorr <- as.numeric(mapvalues(pilotdata$Resp,
                                        from= c("H H", "L L", "H L", "L H"),
                                        to= c(1, 1, 0, 0)))

Predicted Data Response Correctness

The data so far has correctness coded based only on Magnitue Symbolism - where we assume that a “correct” response is one where “High” goes with “High”

This is the most basic explanation of the data that we are interested in, but it isn’t what we think is likely to be the best accounting of the data, for a number of reasons.

Other predictions from the literature

First, there are some pretty clear places where there are already established findings in crossmodal research that are not in the direction of magnitude symbolism- for example people associate small objects (Size = L) with bright colors (brightness = H) and fast moving objects (Speed = H), both of which would be marked as “incorrect” based on magnitude symbolism, so we need to consider those things in our best-specified set of predictions.

Affect

Second, two of our dimensions don’t very neatly partial out into “High” vs. “Low”.

In the case of Affect, we have the pairs Happy vs. Sad, Excited vs. Bored, Stressed vs. Calm, and Pleased vs. Disgusted

We can generally think of one in the pair as “higher” than the others, but this “highness” varies as whether they are high for valence or high for affect - Happy is high affect and valence, whereas “Excited” is High Valence but closer to neutral affect.

So for Affect we want to make some predictions on a by-token set basis - e.g. whether Brightness is associated with Happy vs. Sad, Excited vs. Bored, etc.

Color

Color is even more “problematic” - the color token pairs are Red vs. Blue (triangles), Red vs. Green (diamonds), Red vs. Yellow (circles), and Yellow vs. Blue (squares)

We can’t really consider any of these “higher” or “lower” than each other- the way I have them set up has an implied heirarchy that Red>Yellow=Green>Blue

Realistically though, color data should probably be broken down by hue and chroma- this is a learning experience that color is a really tough dimension to deal with in this sort of experiment - That type of analysis is likely too much for the summer school.

To that end, we will take a look at the data both including and excluding color - technically, it feels like it should be a pretty messy dimension

First, we want to consider all affect associations independently (because we want to make predictions about specific affect tokens) To this end, rather than treating “Affect” as an overarching domain, we consider all of them independently

Second, we do the same for color

Third, we enummerate all of the possible comparisons in the data frame

Fourth, we remove all comparisons that don’t actually exist (mostly, domains being compared to themselves)

#1- Code "Affect" and "Color" as Tokens- once for each relevant column
pilotdata$IndDomainL2 <- ifelse(pilotdata$IndDomainL == "Affect"|pilotdata$IndDomainL == "Color",                              paste(pilotdata$IndDomainL, pilotdata$IndSetL, sep = " "), 
                       pilotdata$IndDomainL)
pilotdata$IndDomainR2 <- ifelse(pilotdata$IndDomainR == "Affect"|pilotdata$IndDomainR == "Color",                              paste(pilotdata$IndDomainR, pilotdata$IndSetR, sep = " "), 
                       pilotdata$IndDomainR)
pilotdata$ConDomainL2 <- ifelse(pilotdata$ConDomainL == "Affect"|pilotdata$ConDomainL == "Color",                              paste(pilotdata$ConDomainL, pilotdata$ConSetL, sep = " "), 
                       pilotdata$ConDomainL)
pilotdata$ConDomainR2 <- ifelse(pilotdata$ConDomainR == "Affect"|pilotdata$ConDomainR == "Color",                              paste(pilotdata$ConDomainR, pilotdata$ConSetR, sep = " "),   
                       pilotdata$ConDomainR)                                       

#3- Enumerate all possible combinations
Inducers <- unique(pilotdata$IndDomainL2)       #All possible inducer token sets   
Concurrents <- unique(pilotdata$ConDomainL2)    #All possible concurrent token sets

Combinations <- expand.grid(Inducer = Inducers, Concurrent = Concurrents) # Gives all combinations


Combinations <- separate(data=Combinations, col= Inducer,     #split columns back up for subsetting
                         into= c("IndType", "IndToken"), sep = " ", remove = FALSE)
Combinations <- separate(data=Combinations, col= Concurrent, 
                         into= c("ConType", "ConToken"), sep = " ", remove = FALSE)

#4- Removing impossible combinations
Combinations <- subset(Combinations, IndType != ConType)

#5- Make a data frame that just has the various combinations of inducers and concurrent, agnostic to which is inducer or concurrent (i.e. Pitch-Noise is the same as Noise-Pitch)
Predictions <- as.data.frame(paste(Combinations$Inducer, Combinations$Concurrent, sep = "-"))
colnames(Predictions)[1] <- "Comparison"
Predictions <- separate(data=Predictions, col= Comparison, 
                         into= c("Ind", "Conc"), sep = "-", remove = FALSE)
Predictions$BA <- paste(Predictions$Conc, Predictions$Ind, sep = "-")
Predictions <- arrange(Predictions, Comparison)

#6- Makes an empty vector called pairs (we will use this in the next chunk)
pairs = c()

So, from this we have a list of 186 total comparisons that we need to make predictions for, but these are currently bi-directional- that is, we are asked to make predictions both for the association between two domains when one is the inducer and one is the concurrent, and vice-versa (e.g. Noise to Pitch and Pitch to Noise)

We generally want to operate under the assumption that these relationships are symmetrical (although we’ll test whether that is true statistically later)- so we’re going to make only half of the predictions (i.e. assume that Noise to Pitch and Pitch to Noise have the same associations)

To that end we define a function that goes through our list of combinations and takes half of it.

We also run another chunk of code that loops through the Predictions data frame and deletes the same rows

DOING IT THIS WAY IS CLUNKY, BUT I CAN’T FIGURE OUT HOW TO GET JUSTIN’S CODE TO TRIM THE PREDICTIONS DATAFRAME, RATHER THAN JUST OUTPUTTING PAIRS- FORTUNATELY I USE HIS LIST PAIRS LATER IN THE DF

#1- Function for halving the list of combinations to remove mirrored duplicates
pickOne <- function(col1,col2){
  if(col1 %in% pairs | col2 %in% pairs){
    return(ifelse(col1 %in% pairs, toString(col1), toString(col2)))
  } else {
    pairs <<- append(pairs, toString(col1))
    return(col1)
  }
}

#2- Calling the function
Predictions %>% rowwise %>% mutate(use = pickOne(Comparison, BA))

## Warning in if (col1 %in% pairs | col2 %in% pairs) {: the condition has
## length > 1 and only the first element will be used

## Source: local data frame [186 x 5]
## Groups: <by row>
## 
## # A tibble: 186 x 5
##              Comparison       Ind       Conc                   BA
##  *               <fctr>     <chr>      <chr>                <chr>
##  1        Affect EB-Amp Affect EB        Amp        Amp-Affect EB
##  2 Affect EB-Brightness Affect EB Brightness Brightness-Affect EB
##  3   Affect EB-Color RB Affect EB   Color RB   Color RB-Affect EB
##  4   Affect EB-Color RG Affect EB   Color RG   Color RG-Affect EB
##  5   Affect EB-Color RY Affect EB   Color RY   Color RY-Affect EB
##  6   Affect EB-Color YB Affect EB   Color YB   Color YB-Affect EB
##  7      Affect EB-Noise Affect EB      Noise      Noise-Affect EB
##  8      Affect EB-Pitch Affect EB      Pitch      Pitch-Affect EB
##  9      Affect EB-Shape Affect EB      Shape      Shape-Affect EB
## 10       Affect EB-Size Affect EB       Size       Size-Affect EB
## # ... with 176 more rows, and 1 more variables: use <fctr>

#3- Code to remove mirrored duplicates from dataframe
delRows = NULL # the rows to be removed
for(i in 1:nrow(Predictions)){
  j = which(Predictions$Ind == Predictions$Conc[i] & Predictions$Conc == Predictions$Ind[i])
  j = j [j > i]
  if (length(j) > 0){
    delRows = c(delRows, j)
  }
}
Predictions <- Predictions[-delRows,]

rm(i, j, delRows,pickOne) #Removes some bits we no longer need  
Predictions$Comparison <- sort(as.vector(Predictions$Comparison))
as.vector(Predictions$Comparison)

##  [1] "Affect EB-Amp"        "Affect EB-Brightness" "Affect EB-Color RB"  
##  [4] "Affect EB-Color RG"   "Affect EB-Color RY"   "Affect EB-Color YB"  
##  [7] "Affect EB-Noise"      "Affect EB-Pitch"      "Affect EB-Shape"     
## [10] "Affect EB-Size"       "Affect EB-Speed"      "Affect HS-Amp"       
## [13] "Affect HS-Brightness" "Affect HS-Color RB"   "Affect HS-Color RG"  
## [16] "Affect HS-Color RY"   "Affect HS-Color YB"   "Affect HS-Noise"     
## [19] "Affect HS-Pitch"      "Affect HS-Shape"      "Affect HS-Size"      
## [22] "Affect HS-Speed"      "Affect PD-Amp"        "Affect PD-Brightness"
## [25] "Affect PD-Color RB"   "Affect PD-Color RG"   "Affect PD-Color RY"  
## [28] "Affect PD-Color YB"   "Affect PD-Noise"      "Affect PD-Pitch"     
## [31] "Affect PD-Shape"      "Affect PD-Size"       "Affect PD-Speed"     
## [34] "Affect SC-Amp"        "Affect SC-Brightness" "Affect SC-Color RB"  
## [37] "Affect SC-Color RG"   "Affect SC-Color RY"   "Affect SC-Color YB"  
## [40] "Affect SC-Noise"      "Affect SC-Pitch"      "Affect SC-Shape"     
## [43] "Affect SC-Size"       "Affect SC-Speed"      "Amp-Brightness"      
## [46] "Amp-Color RB"         "Amp-Color RG"         "Amp-Color RY"        
## [49] "Amp-Color YB"         "Amp-Noise"            "Amp-Pitch"           
## [52] "Amp-Shape"            "Amp-Size"             "Amp-Speed"           
## [55] "Brightness-Color RB"  "Brightness-Color RG"  "Brightness-Color RY" 
## [58] "Brightness-Color YB"  "Brightness-Noise"     "Brightness-Pitch"    
## [61] "Brightness-Shape"     "Brightness-Size"      "Brightness-Speed"    
## [64] "Color RB-Noise"       "Color RB-Pitch"       "Color RB-Shape"      
## [67] "Color RB-Size"        "Color RB-Speed"       "Color RG-Noise"      
## [70] "Color RG-Pitch"       "Color RG-Shape"       "Color RG-Size"       
## [73] "Color RG-Speed"       "Color RY-Noise"       "Color RY-Pitch"      
## [76] "Color RY-Shape"       "Color RY-Size"        "Color RY-Speed"      
## [79] "Color YB-Noise"       "Color YB-Pitch"       "Color YB-Shape"      
## [82] "Color YB-Size"        "Color YB-Speed"       "Noise-Pitch"         
## [85] "Noise-Shape"          "Noise-Size"           "Noise-Speed"         
## [88] "Pitch-Shape"          "Pitch-Size"           "Pitch-Speed"         
## [91] "Shape-Size"           "Shape-Speed"          "Size-Speed"

Predictions <- (subset(Predictions, select = "Comparison"))
Predictions$Comparison <- factor(Predictions$Comparison)

Predictions$Prediction <- c(1, 1, 1, 0,   #1-4
                            1, 1, 1, 1,   #5-8
                            1, 0, 1, 1,   #9-12
                            1, 1, 0, 0,   #13-16
                            1, 0, 1, 1,   #17-20
                            0, 1, 0, 1,   #21-24
                            0, 0, 0, 1,   #25-28
                            0, 1, 1, 0,   #29-32
                            1, 1, 1, 1,   #33-36
                            1, 1, 1, 1,   #37-40
                            1, 1, 0, 1,   #41-44
                            1, 1, 1, 1,   #45-48
                            1, 1, 0, 1,   #49-52
                            1, 1, 1, 0,   #53-56
                            0, 1, 1, 1,   #57-60
                            1, 0, 1, 1,   #61-64
                            1, 1, 0, 1,   #65-68
                            1, 0, 1, 0,   #69-72
                            1, 1, 0, 1,   #73-76
                            1, 1, 1, 1,   #77-80
                            1, 0, 1, 0,   #81-84
                            1, 0, 1, 0,   #85-88
                            0, 1, 0, 1,   #89-92
                            0)            #93

Predictions$MagSym <- 1
Predictions$Same1 <- ifelse(Predictions$MagSym == Predictions$Prediction, 1, 0)

# Output the prediction file for Bill to use in Python
# write.csv(Predictions, "F:/Google Drive/GitHub Repos/Crossmodality-Toolkit/data/predictions.csv")

pilotdata$IndCon <- paste(pilotdata$IndDomainL2, pilotdata$ConDomainL2, sep = '-')
pilotdata$ConInd <- paste(pilotdata$ConDomainL2, pilotdata$IndDomainL2, sep = '-')

pilotdata$Comparison <- ifelse(pilotdata$IndCon %in% Predictions$Comparison,
                               pilotdata$IndCon,
                               pilotdata$ConInd)

NCComparisons <- sort(factor(unique(pilotdata$Comparison)))
PredComparisons <- sort(factor(unique(Predictions$Comparison)))

setdiff(PredComparisons, NCComparisons)  # Test whether we have the same mirrored associations in both                                              data frames (for mapping values)

## character(0)

pilotdata$Prediction <- mapvalues(pilotdata$Comparison,
                                from = Predictions$Comparison,
                                to = Predictions$Prediction)

pilotdata$Resp2 <- ifelse(pilotdata$Resp == "H H" |pilotdata$Resp == "L L", 1, 0)
pilotdata$RespCorrPred <- ifelse(pilotdata$Resp2 == pilotdata$Prediction, 1, 0)

There will be more here, including coding a third column for “Correctness” based on predictions imputed from the Affect data, but for now this is what we’ve got.

So in the short term, we export the CSV of data for Bill to play with in Python, and for others to start thinking about looking at with statistics

CleanDataTemp <- subset(pilotdata, select = -c(X, ParticipantNum, IndDomainL, IndSetL, IndTokenL, IndDomainR, IndSetR, IndTokenR, ConDomainL, ConSetL, ConTokenL, ConDomainR, ConSetR, ConTokenR, IndDomainR2, IndCon, ConInd))

#write.csv(CleanDataTemp, "F:/Google Drive/GitHub Repos/Crossmodality-Toolkit/data/CleanDataTemp.csv")

Pilot Data Cleaning and Manipulation

Alan Nielsen

September 12, 2017