Purpose of this Markdown Document

This is a markdown document for importing, tidying, and recoding the data from the MMIEL Summer School.

Insofar as it is possible, the data tidying RMD that we produce for the summer school on the full data should follow this format - for that, we may want to mark some of the R blocks as include = FALSE, because there is some behind-the-scences recoding of columns etc. that won’t be super relevant to the participants - or at least to those who aren’t reasonably advanced in using R to manipulate data.

First, we load some librarires we need Second, we read in our data Third, we label our data

#1 Loading Libraries
library(tidyverse)
library(plyr)

#2 Reading in the Data
pilotdata <- read.csv("F:/Google Drive/GitHub Repos/Crossmodality-Toolkit/data/pilotData.csv")
affectdata <- read.csv("F:/Google Drive/GitHub Repos/Crossmodality-Toolkit/data/affectData.csv")
# linguisticdata <- read.csv("F:/Google Drive/GitHub Repos/Crossmodality-Toolkit/data/linguisticData.csv")
# fulldata <- read.csv("F:/Google Drive/GitHub Repos/Crossmodality-Toolkit/data/fullData.csv")
simdata <- read.csv("F:/Google Drive/GitHub Repos/Crossmodality-Toolkit/data/SimulatedData_SubjectSpecificMean.csv")

#3 Labelling the Data
pilotdata$DataSet <- "Pilot"
affectdata$DataSet <- "Affect"
#linguisticdata$DataSet <- "Linguistic"
#fulldata$DataSet <- "Full"
simdata$DataSet <- "Simulated"

Next we need to re-do the “condition” column- because currently it contains the Focal Dimensions and a participant number - that just won’t do for analysis, so we have to take care of it.

The Affect-central version of the experiment, on the other hand- only has a single focal domain - so we need to dummy code the “Focal1” and “Focal2” columns

pilotdata <- separate(data = pilotdata, col = condition,
                      into = c('Focal1', 'Focal2', "Focal3", "ParticipantNum"), 
                      sep = "-", remove = FALSE)
pilotdata$condition <- paste(pilotdata$Focal1, pilotdata$Focal2, pilotdata$Focal3, sep='-')

affectdata <- separate(data = affectdata, col = condition,
                      into = c('Focal1', "ParticipantNum"), 
                      sep = "-", remove = FALSE)
affectdata$condition <- affectdata$Focal1
affectdata$Focal2 <- ''
affectdata$Focal3 <- ''

Now we can stick the data frames together into a single data frame

SSData <- rbind(pilotdata, affectdata)

We can then do some pretty basic sanity checks on the data. We do these in an R code chunk, but hide the results so that we can show the participants how to call R directly in text

length(unique(pilotdata$subject))  #1- This tells us how many experimental participants we have

nrow(pilotdata) / length(unique(pilotdata$subject))  #2- This tells us how many trials there are per                                                               participant

unique(pilotdata$condition)                         #3- Lists all of the "conditions"

length(unique(pilotdata$condition))                 #4- Tells us how many conditions there are

table(pilotdata$condition)/96                       #5- Tells us how many participants are in each                                                            condition

unique(pilotdata$choice)                            #6- This checks that we only have responses of 0 or 1

##ADD OTHER SANITY CHECKS AS NEEDED

Pilot Data Report

So, those sanity checks tell us what we want to see- there are 61 participants, each of whom completed 96 trials in 6 Conditions.

The 6 conditions were Noise-Shape-Speed, Pitch-Size-Color, Brightness-Amp-Affect, Noise-Brightness-Color, Pitch-Shape-Affect, Amp-Size-Speed and there are between 10 and 11 participants per condition.

Finally, we have recorded only legal responses from our participants, both 1, which means that participants chose that the left inducer goes with the top concurrent (and thus the right inducer with the bottom concurrent) and 0, which means the opposite.

length(unique(affectdata$subject))  #1- This tells us how many experimental participants we have

nrow(affectdata) / length(unique(affectdata$subject))  #2- This tells us how many trials there are per                                                               participant

unique(affectdata$condition)                         #3- Lists all of the "conditions"

length(unique(affectdata$condition))                 #4- Tells us how many conditions there are

table(affectdata$condition)/64                      #5- Tells us how many participants are in each                                                            condition

unique(affectdata$choice)                            #6- This checks that we only have responses of 0 or 1

##ADD OTHER SANITY CHECKS AS NEEDED

Affect Data Report

So, those sanity checks tell us what we want to see- there are 55 participants, each of whom completed 64 trials in 1 Conditions.

Finally, we have recorded only legal responses from our participants, both 0, which means that participants chose that the left inducer goes with the top concurrent (and thus the right inducer with the bottom concurrent) and 1, which means the opposite.

Manipulating the Dataframe

There are now some columns that we need to create in the dataframe that will make the comparisons we are interested in doing statistically possible

First, we need to take the InducerL, InducerR, ConcurrentL, and ConcurrentR columns and break them apart into their component bits- Their Domain (e.g. Pitch), their token set from that domain (e.g. Hum), and their specific token (e.g. “high”)

Second, we code a column that tells us which inducer token and concurrent token corresponds to a response (choice) of 0

#1- Separate columns into their components
SSData <- separate(data = SSData, col = InducerL, 
                      into = c('IndDomainL', 'IndSetL', "IndTokenL"), 
                      sep = "-", remove = FALSE)
SSData <- separate(data = SSData, col = InducerR,
                      into = c('IndDomainR', 'IndSetR', "IndTokenR"),
                      sep = "-", remove = FALSE)
SSData <- separate(data = SSData, col = ConcurrentL,
                      into = c('ConDomainL', 'ConSetL', "ConTokenL"),
                      sep = "-", remove = FALSE)
SSData <- separate(data = SSData, col = ConcurrentR,
                      into = c('ConDomainR', 'ConSetR', "ConTokenR"),
                      sep = "-", remove = FALSE)

#2- Codes a column that says which token (Low or High) the Left Inducer and Top Concurrent are, as a single value (e.g. "H H")
SSData$LeftPair <- paste(SSData$IndTokenL, SSData$ConTokenL)

#3- Codes a new response column where if the participant has made the choice '0' (Left Inducer Matches with Top Concurrent)
SSData$Resp <- ifelse(SSData$choice == 0, 
                         paste(SSData$IndTokenL, SSData$ConTokenL), 
                         paste(SSData$IndTokenL, SSData$ConTokenR))

#4 Makes a numeric response column, where '0' means that the participant matched High with Low, and '1' means that the participant matched High with High
SSData$Resp2 <- ifelse(SSData$Resp == "H H"| SSData$Resp == "L L", 1, 0)

So, the data is clean, at least to a fisrt approximation - we’ve added some columns we need, and recoded some other columns so that they are more informative.

This should alliow us to do some of our most basic statistics, but there are many more things that we are interesting in looking at that aren’t currently possible with our data frame as it stands.

First, we need to further break apart some of the Domains - For each domain we have four sets of tokens, and broadly speaking we don’t think that there should be differences between the tokens- a high-pitched hum should have the same associations as a high-pitched tone, piano note, or pulse.

For some of our domains however, this isn’t the case.

Affect

For our Affect Tokens, although we have generally tried to pick pairs where one token is high valence/arousal and the other is not, we do not expect participants to treat all High Valence/High Arousal tokens as equivalent- “Happy” has much different connotations than “Stressed” or “Excited”, so we need to consider Affect Tokens individually, rather than lumping all of the Affect trials together.

Colour

Color is even more “problematic” than Affect- the color token pairs are Red vs. Blue (triangles), Red vs. Green (diamonds), Red vs. Yellow (circles), and Yellow vs. Blue (squares)

#1- Code "Affect" and "Color" as Tokens- once for each relevant column
#(Note we don't need to do this for *all* columns)
SSData$IndDomainL2 <- ifelse(SSData$IndDomainL == "Affect"|SSData$IndDomainL == "Color", 
                             paste(SSData$IndDomainL, SSData$IndSetL, sep = " "), 
                       SSData$IndDomainL)
SSData$ConDomainL2 <- ifelse(SSData$ConDomainL == "Affect"|SSData$ConDomainL == "Color", 
                             paste(SSData$ConDomainL, SSData$ConSetL, sep = " "), 
                       SSData$ConDomainL)

The data frame now has some stuff we need, but it’s also become pretty bloated with some columns that we don’t need at all- so lets get rid of those and reorganise the columns a bit:

This seems like we are cutting a lot of information- and indeed we are, but it can all be recovered from the original files if we so desire- and only a few pieces of information are required for doing our analyses.

SSData <- subset(SSData, select = c(DataSet, subject, condition, Focal1, Focal2, Focal3, trialNum, IndDomainL2, ConDomainL2, Resp2))

So this is a good slice of the data, but there are a few more things we can add.

Currently we have our Inducer and Concurrent domains coded in the data frame, which would let us look at all of the comparisons we are interested in - but this is likely to be too much - with Color and Affect Broken up there are 186 possible comparisons in the data.

But we don’t really think that there will be much of a difference between asking someone whether a blue triangle is high pitched or asking whether a high pitched sound goes best with a blue triangle- that is, we don’t suspect there is a difference between whether a domain is an inducer or a concurrent- we expect the data to be symmetrical.

This is something we can test, so we will leave our inducer and concurrent columns in the data set, but we will also code in a “Comparison” column that tells us (insensitive to order) what two domains are being compared (this will also make generating predictions simpler)

Inducers <- unique(SSData$IndDomainL2)       #All possible inducer token sets   
Concurrents <- unique(SSData$ConDomainL2)    #All possible concurrent token sets

Combinations <- expand.grid(Inducer = Inducers, Concurrent = Concurrents) # Gives all combinations


Combinations <- separate(data=Combinations, col= Inducer,     #split columns back up for subsetting
                         into= c("IndType", "IndToken"), sep = " ", remove = FALSE)
Combinations <- separate(data=Combinations, col= Concurrent, 
                         into= c("ConType", "ConToken"), sep = " ", remove = FALSE)

Combinations <- subset(Combinations, IndType != ConType)  #4- Removing impossible combinations

Combinations$Comparison <- paste(Combinations$Inducer, Combinations$Concurrent, sep = '-') #Make a comparison column
Combinations <- arrange(Combinations, Comparison)  # Order the data frame alphabetically by the comparison column

delRows = NULL # the rows to be removed
for(i in 1:nrow(Combinations)){
  j = which(Combinations$Inducer == Combinations$Concurrent[i] & Combinations$Concurrent == Combinations$Inducer[i])
  j = j [j > i]
  if (length(j) > 0){
    delRows = c(delRows, j)
  }
}
Combinations <- Combinations[-delRows,]

# Code the comparison column into the SSData frame
SSData$IndCon <- paste(SSData$IndDomainL2, SSData$ConDomainL2, sep = '-')
SSData$ConInd <- paste(SSData$ConDomainL2, SSData$IndDomainL2, sep = '-')

SSData$Comparison <- ifelse(SSData$IndCon %in% Combinations$Comparison,
                               SSData$IndCon,
                               SSData$ConInd)


#### Now we need to put our SimData into the same format as the rest of the data and attach it to the data frame
simdata$condition <- paste(simdata$Focal1, simdata$Focal2, simdata$Focal3)
simdata <- subset(simdata, select = c(DataSet, Id, condition, Focal1, Focal2, Focal3, TrialNum,
                                      IndDomainL2, ConDomainL2, Response_SIMULATED, IndCon, ConInd, Comparison))
colnames(simdata) <- c("DataSet", "subject", "condition", "Focal1", "Focal2", "Focal3", "trialNum",
                       "IndDomainL2", "ConDomainL2", "Resp2", "IndCon", "ConInd", "Comparison")

SSData <- rbind(SSData, simdata)

So now we have a lovely “Comparison” column in the dataframe, which we can use for some statistical tests

The last thing we want to do is code “Correctness” for each trial. Of course in this task there is no such thing as a “correct” answer- participants choose what they want and cannot be wrong.

We do, however, have a series of predictions that can be made about this data- so we can code in “Correctness” according to those sets of predictions

We have made those predictions elsewhere and stored them as ImputedPredictions.csv.

So we simply need to load in those predictions, then use that file to populate some additional columns in our dataframe

We have three sets of predictions in our “Predictions” file

1- Magnitude Symbolism 2- Predictions generated from a lit review 3- Predictions imputed from our Affect version of the Experiment- Absolute

Predictions <- read.csv("F:/Google Drive/GitHub Repos/Crossmodality-Toolkit/data/ImputedPredictions.csv")

SSData$Magnitude <- mapvalues(SSData$Comparison,
                                from = Predictions$Comparison,
                                to = Predictions$MagSym)
SSData$LitReview <- mapvalues(SSData$Comparison,
                                from = Predictions$Comparison,
                                to = Predictions$Prediction)
SSData$Affect <- mapvalues(SSData$Comparison,
                                from = Predictions$Comparison,
                                to = Predictions$ImputedPrediction)

As it stands this tells us what Response each set predicts, not what is correct - we can recode that fairly simply

SSData$Magnitude <- ifelse(SSData$Magnitude == SSData$Resp2, 1, 0)
SSData$LitReview <- ifelse(SSData$LitReview == SSData$Resp2, 1, 0)
SSData$Affect <- ifelse(SSData$Affect == SSData$Resp2, 1, 0)

So that’s our data pretty much as we are going to use it- but lets make sure it is nicely formatted- to do so we will remove a few columns and rename them

SSData <- subset(SSData, select = c(DataSet, subject, condition, trialNum,
                                    IndDomainL2, ConDomainL2, Comparison, Resp2, Magnitude, LitReview, Affect))

colnames(SSData) <- c("DataSet", "Subject", "Condition", "TrialNum", "Inducer", "Concurrent", "Comparison", "Response",
                      "Magnitude", "LitReview", "Affect")

And that is the data as it is needed- currently it is in wide-format and not aggregated - in the actual analyses we’ll be using the data in long, aggregated format, but because we need to aggregate and otherwise reshape the data differently depending on what analysis we are employing, in this file we will stop with our data wide and unaggregated.

write.csv(SSData, "F:/Google Drive/GitHub Repos/Crossmodality-Toolkit/data/CleanData.csv", row.names = FALSE)

Data Cleaning and Manipulation

Alan Nielsen

September 16, 2017