This is a markdown document for importing, tidying, and recoding the Pilot Data from the MMIEL Summer School.
Insofar as it is possible, the data tidying RMD that we produce for the summer school on the full data should follow this format - for that, we may want to mark some of the R blocks as include = FALSE, because there is some behind-the-scences recoding of columns etc. that won’t be super relevant to the participants - or at least to those who aren’t reasonably advanced in using R to manipulate data.
We will start, as always, by importing the tidyverse, and then importing our data frame
library(tidyverse)
library(plyr)
pilotdata <- read.csv("F:/Google Drive/GitHub Repos/Crossmodality-Toolkit/data/pilotData.csv")
# First we need a chunk of code that redoes the condition column, because "Condition" current includes the participant number for each condition - SS participants will not need to see this, so in the final version use include = FALSE
pilotdata <- separate(data = pilotdata, col = condition,
into = c('Focal1', 'Focal2', "Focal3", "ParticipantNum"),
sep = "-", remove = FALSE)
pilotdata$condition <- paste(pilotdata$Focal1, pilotdata$Focal2, pilotdata$Focal3, sep='-')
We can then do some pretty basic sanity checks on the data
length(unique(pilotdata$subject)) #1- This tells us how many experimental participants we have
nrow(pilotdata) / length(unique(pilotdata$subject)) #2- This tells us how many trials there are per participant
unique(pilotdata$condition) #3- Lists all of the "conditions"
length(unique(pilotdata$condition)) #4- Tells us how many conditions there are
table(pilotdata$condition)/96 #5- Tells us how many participants are in each condition
unique(pilotdata$choice) #6- This checks that we only have responses of 0 or 1
##ADD OTHER SANITY CHECKS AS NEEDED
So, those sanity checks tell us what we want to see- there are 61 participants, each of whom completed 96 trials in 6 Conditions.
The 6 conditions were Noise-Shape-Speed, Pitch-Size-Color, Brightness-Amp-Affect, Noise-Brightness-Color, Pitch-Shape-Affect, Amp-Size-Speed and there are between 10 and 11 participants per condition.
Finally, we have recorded only legal responses from our participants, both 1, which means that participants chose that the left inducer goes with the top concurrent (and thus the right inducer with the bottom concurrent) and 0, which means the opposite.
There are now some columns that we need to create in the dataframe that will make the comparisons we are interested in doing statistically possible
First, we need to take the InducerL, InducerR, ConcurrentL, and ConcurrentR columns and break them apart into their component bits- Their Domain (e.g. Pitch), their token set from that domain (e.g. Hum), and their specific token (e.g. “high”)
Second, we code a column that tells us which inducer token and concurrent token corresponds to a response (choice) of 0
#1- Separate columns into their components
pilotdata <- separate(data = pilotdata, col = InducerL,
into = c('IndDomainL', 'IndSetL', "IndTokenL"),
sep = "-", remove = FALSE)
pilotdata <- separate(data = pilotdata, col = InducerR,
into = c('IndDomainR', 'IndSetR', "IndTokenR"),
sep = "-", remove = FALSE)
pilotdata <- separate(data = pilotdata, col = ConcurrentL,
into = c('ConDomainL', 'ConSetL', "ConTokenL"),
sep = "-", remove = FALSE)
pilotdata <- separate(data = pilotdata, col = ConcurrentR,
into = c('ConDomainR', 'ConSetR', "ConTokenR"),
sep = "-", remove = FALSE)
#2- Codes a column that says which token (Low or High) the Left Inducer and Top Concurrent are, as a single value (e.g. "H H")
pilotdata$LeftPair <- paste(pilotdata$IndTokenL, pilotdata$ConTokenL)
#3- Codes a new response column where if the participant has made the choice '0' (Left Inducer Matches with Top Concurrent)
pilotdata$Resp <- ifelse(pilotdata$choice == 0,
paste(pilotdata$IndTokenL, pilotdata$ConTokenL),
paste(pilotdata$IndTokenL, pilotdata$ConTokenR))
So, that’s our data clean, to a first approximation- we’ve added some columns we need, and recoded some other columns so that they are more informative.
Now what we need to do is code “correctness” for each trial- something that we can sensibly put into a our linear mixed effects models and other statistical techniques.
In the course of our analysis, we are going to “correctness” in at least 3 ways
The first way that we are going to code correctness is just via Magnitude Bias- this is the assumption that everything we have coded as “high” (i.e. Loud, Noisy, High Pitched, Jagged, Large, Bright, Fast, Happy, Stressed, Excited, and Pleased) should be associated.
In this case, if a participant responded that High goes with High and Low goes with Low they are “correct”
pilotdata$MagRespCorr <- as.numeric(mapvalues(pilotdata$Resp,
from= c("H H", "L L", "H L", "L H"),
to= c(1, 1, 0, 0)))
The data so far has correctness coded based only on Magnitue Symbolism - where we assume that a “correct” response is one where “High” goes with “High”
This is the most basic explanation of the data that we are interested in, but it isn’t what we think is likely to be the best accounting of the data, for a number of reasons.
First, there are some pretty clear places where there are already established findings in crossmodal research that are not in the direction of magnitude symbolism- for example people associate small objects (Size = L) with bright colors (brightness = H) and fast moving objects (Speed = H), both of which would be marked as “incorrect” based on magnitude symbolism, so we need to consider those things in our best-specified set of predictions.
Second, two of our dimensions don’t very neatly partial out into “High” vs. “Low”.
In the case of Affect, we have the pairs Happy vs. Sad, Excited vs. Bored, Stressed vs. Calm, and Pleased vs. Disgusted
We can generally think of one in the pair as “higher” than the others, but this “highness” varies as whether they are high for valence or high for affect - Happy is high affect and valence, whereas “Excited” is High Valence but closer to neutral affect.
So for Affect we want to make some predictions on a by-token set basis - e.g. whether Brightness is associated with Happy vs. Sad, Excited vs. Bored, etc.
Color is even more “problematic” - the color token pairs are Red vs. Blue (triangles), Red vs. Green (diamonds), Red vs. Yellow (circles), and Yellow vs. Blue (squares)
We can’t really consider any of these “higher” or “lower” than each other- the way I have them set up has an implied heirarchy that Red>Yellow=Green>Blue
Realistically though, color data should probably be broken down by hue and chroma- this is a learning experience that color is a really tough dimension to deal with in this sort of experiment - That type of analysis is likely too much for the summer school.
To that end, we will take a look at the data both including and excluding color - technically, it feels like it should be a pretty messy dimension
First, we want to consider all affect associations independently (because we want to make predictions about specific affect tokens) To this end, rather than treating “Affect” as an overarching domain, we consider all of them independently
Second, we do the same for color
Third, we enummerate all of the possible comparisons in the data frame
Fourth, we remove all comparisons that don’t actually exist (mostly, domains being compared to themselves)
#1- Code "Affect" and "Color" as Tokens- once for each relevant column
pilotdata$IndDomainL2 <- ifelse(pilotdata$IndDomainL == "Affect"|pilotdata$IndDomainL == "Color", paste(pilotdata$IndDomainL, pilotdata$IndSetL, sep = " "),
pilotdata$IndDomainL)
pilotdata$IndDomainR2 <- ifelse(pilotdata$IndDomainR == "Affect"|pilotdata$IndDomainR == "Color", paste(pilotdata$IndDomainR, pilotdata$IndSetR, sep = " "),
pilotdata$IndDomainR)
pilotdata$ConDomainL2 <- ifelse(pilotdata$ConDomainL == "Affect"|pilotdata$ConDomainL == "Color", paste(pilotdata$ConDomainL, pilotdata$ConSetL, sep = " "),
pilotdata$ConDomainL)
pilotdata$ConDomainR2 <- ifelse(pilotdata$ConDomainR == "Affect"|pilotdata$ConDomainR == "Color", paste(pilotdata$ConDomainR, pilotdata$ConSetR, sep = " "),
pilotdata$ConDomainR)
#3- Enumerate all possible combinations
Inducers <- unique(pilotdata$IndDomainL2) #All possible inducer token sets
Concurrents <- unique(pilotdata$ConDomainL2) #All possible concurrent token sets
Combinations <- expand.grid(Inducer = Inducers, Concurrent = Concurrents) # Gives all combinations
Combinations <- separate(data=Combinations, col= Inducer, #split columns back up for subsetting
into= c("IndType", "IndToken"), sep = " ", remove = FALSE)
Combinations <- separate(data=Combinations, col= Concurrent,
into= c("ConType", "ConToken"), sep = " ", remove = FALSE)
#4- Removing impossible combinations
Combinations <- subset(Combinations, IndType != ConType)
#5- Make a data frame that just has the various combinations of inducers and concurrent, agnostic to which is inducer or concurrent (i.e. Pitch-Noise is the same as Noise-Pitch)
Predictions <- as.data.frame(paste(Combinations$Inducer, Combinations$Concurrent, sep = "-"))
colnames(Predictions)[1] <- "Comparison"
Predictions <- separate(data=Predictions, col= Comparison,
into= c("Ind", "Conc"), sep = "-", remove = FALSE)
Predictions$BA <- paste(Predictions$Conc, Predictions$Ind, sep = "-")
Predictions <- arrange(Predictions, Comparison)
#6- Makes an empty vector called pairs (we will use this in the next chunk)
pairs = c()
So, from this we have a list of 186 total comparisons that we need to make predictions for, but these are currently bi-directional- that is, we are asked to make predictions both for the association between two domains when one is the inducer and one is the concurrent, and vice-versa (e.g. Noise to Pitch and Pitch to Noise)
We generally want to operate under the assumption that these relationships are symmetrical (although we’ll test whether that is true statistically later)- so we’re going to make only half of the predictions (i.e. assume that Noise to Pitch and Pitch to Noise have the same associations)
To that end we define a function that goes through our list of combinations and takes half of it.
We also run another chunk of code that loops through the Predictions data frame and deletes the same rows
#1- Function for halving the list of combinations to remove mirrored duplicates
pickOne <- function(col1,col2){
if(col1 %in% pairs | col2 %in% pairs){
return(ifelse(col1 %in% pairs, toString(col1), toString(col2)))
} else {
pairs <<- append(pairs, toString(col1))
return(col1)
}
}
#2- Calling the function
Predictions %>% rowwise %>% mutate(use = pickOne(Comparison, BA))
## Warning in if (col1 %in% pairs | col2 %in% pairs) {: the condition has
## length > 1 and only the first element will be used
## Source: local data frame [186 x 5]
## Groups: <by row>
##
## # A tibble: 186 x 5
## Comparison Ind Conc BA
## * <fctr> <chr> <chr> <chr>
## 1 Affect EB-Amp Affect EB Amp Amp-Affect EB
## 2 Affect EB-Brightness Affect EB Brightness Brightness-Affect EB
## 3 Affect EB-Color RB Affect EB Color RB Color RB-Affect EB
## 4 Affect EB-Color RG Affect EB Color RG Color RG-Affect EB
## 5 Affect EB-Color RY Affect EB Color RY Color RY-Affect EB
## 6 Affect EB-Color YB Affect EB Color YB Color YB-Affect EB
## 7 Affect EB-Noise Affect EB Noise Noise-Affect EB
## 8 Affect EB-Pitch Affect EB Pitch Pitch-Affect EB
## 9 Affect EB-Shape Affect EB Shape Shape-Affect EB
## 10 Affect EB-Size Affect EB Size Size-Affect EB
## # ... with 176 more rows, and 1 more variables: use <fctr>
#3- Code to remove mirrored duplicates from dataframe
delRows = NULL # the rows to be removed
for(i in 1:nrow(Predictions)){
j = which(Predictions$Ind == Predictions$Conc[i] & Predictions$Conc == Predictions$Ind[i])
j = j [j > i]
if (length(j) > 0){
delRows = c(delRows, j)
}
}
Predictions <- Predictions[-delRows,]
rm(i, j, delRows,pickOne) #Removes some bits we no longer need
Predictions$Comparison <- sort(as.vector(Predictions$Comparison))
as.vector(Predictions$Comparison)
## [1] "Affect EB-Amp" "Affect EB-Brightness" "Affect EB-Color RB"
## [4] "Affect EB-Color RG" "Affect EB-Color RY" "Affect EB-Color YB"
## [7] "Affect EB-Noise" "Affect EB-Pitch" "Affect EB-Shape"
## [10] "Affect EB-Size" "Affect EB-Speed" "Affect HS-Amp"
## [13] "Affect HS-Brightness" "Affect HS-Color RB" "Affect HS-Color RG"
## [16] "Affect HS-Color RY" "Affect HS-Color YB" "Affect HS-Noise"
## [19] "Affect HS-Pitch" "Affect HS-Shape" "Affect HS-Size"
## [22] "Affect HS-Speed" "Affect PD-Amp" "Affect PD-Brightness"
## [25] "Affect PD-Color RB" "Affect PD-Color RG" "Affect PD-Color RY"
## [28] "Affect PD-Color YB" "Affect PD-Noise" "Affect PD-Pitch"
## [31] "Affect PD-Shape" "Affect PD-Size" "Affect PD-Speed"
## [34] "Affect SC-Amp" "Affect SC-Brightness" "Affect SC-Color RB"
## [37] "Affect SC-Color RG" "Affect SC-Color RY" "Affect SC-Color YB"
## [40] "Affect SC-Noise" "Affect SC-Pitch" "Affect SC-Shape"
## [43] "Affect SC-Size" "Affect SC-Speed" "Amp-Brightness"
## [46] "Amp-Color RB" "Amp-Color RG" "Amp-Color RY"
## [49] "Amp-Color YB" "Amp-Noise" "Amp-Pitch"
## [52] "Amp-Shape" "Amp-Size" "Amp-Speed"
## [55] "Brightness-Color RB" "Brightness-Color RG" "Brightness-Color RY"
## [58] "Brightness-Color YB" "Brightness-Noise" "Brightness-Pitch"
## [61] "Brightness-Shape" "Brightness-Size" "Brightness-Speed"
## [64] "Color RB-Noise" "Color RB-Pitch" "Color RB-Shape"
## [67] "Color RB-Size" "Color RB-Speed" "Color RG-Noise"
## [70] "Color RG-Pitch" "Color RG-Shape" "Color RG-Size"
## [73] "Color RG-Speed" "Color RY-Noise" "Color RY-Pitch"
## [76] "Color RY-Shape" "Color RY-Size" "Color RY-Speed"
## [79] "Color YB-Noise" "Color YB-Pitch" "Color YB-Shape"
## [82] "Color YB-Size" "Color YB-Speed" "Noise-Pitch"
## [85] "Noise-Shape" "Noise-Size" "Noise-Speed"
## [88] "Pitch-Shape" "Pitch-Size" "Pitch-Speed"
## [91] "Shape-Size" "Shape-Speed" "Size-Speed"
Predictions <- (subset(Predictions, select = "Comparison"))
Predictions$Comparison <- factor(Predictions$Comparison)
Predictions$Prediction <- c(1, 1, 1, 0, #1-4
1, 1, 1, 1, #5-8
1, 0, 1, 1, #9-12
1, 1, 0, 0, #13-16
1, 0, 1, 1, #17-20
0, 1, 0, 1, #21-24
0, 0, 0, 1, #25-28
0, 1, 1, 0, #29-32
1, 1, 1, 1, #33-36
1, 1, 1, 1, #37-40
1, 1, 0, 1, #41-44
1, 1, 1, 1, #45-48
1, 1, 0, 1, #49-52
1, 1, 1, 0, #53-56
0, 1, 1, 1, #57-60
1, 0, 1, 1, #61-64
1, 1, 0, 1, #65-68
1, 0, 1, 0, #69-72
1, 1, 0, 1, #73-76
1, 1, 1, 1, #77-80
1, 0, 1, 0, #81-84
1, 0, 1, 0, #85-88
0, 1, 0, 1, #89-92
0) #93
Predictions$MagSym <- 1
Predictions$Same1 <- ifelse(Predictions$MagSym == Predictions$Prediction, 1, 0)
# Output the prediction file for Bill to use in Python
# write.csv(Predictions, "F:/Google Drive/GitHub Repos/Crossmodality-Toolkit/data/predictions.csv")
pilotdata$IndCon <- paste(pilotdata$IndDomainL2, pilotdata$ConDomainL2, sep = '-')
pilotdata$ConInd <- paste(pilotdata$ConDomainL2, pilotdata$IndDomainL2, sep = '-')
pilotdata$Comparison <- ifelse(pilotdata$IndCon %in% Predictions$Comparison,
pilotdata$IndCon,
pilotdata$ConInd)
NCComparisons <- sort(factor(unique(pilotdata$Comparison)))
PredComparisons <- sort(factor(unique(Predictions$Comparison)))
setdiff(PredComparisons, NCComparisons) # Test whether we have the same mirrored associations in both data frames (for mapping values)
## character(0)
pilotdata$Prediction <- mapvalues(pilotdata$Comparison,
from = Predictions$Comparison,
to = Predictions$Prediction)
pilotdata$Resp2 <- ifelse(pilotdata$Resp == "H H" |pilotdata$Resp == "L L", 1, 0)
pilotdata$RespCorrPred <- ifelse(pilotdata$Resp2 == pilotdata$Prediction, 1, 0)
So in the short term, we export the CSV of data for Bill to play with in Python, and for others to start thinking about looking at with statistics
CleanDataTemp <- subset(pilotdata, select = -c(X, ParticipantNum, IndDomainL, IndSetL, IndTokenL, IndDomainR, IndSetR, IndTokenR, ConDomainL, ConSetL, ConTokenL, ConDomainR, ConSetR, ConTokenR, IndDomainR2, IndCon, ConInd))
#write.csv(CleanDataTemp, "F:/Google Drive/GitHub Repos/Crossmodality-Toolkit/data/CleanDataTemp.csv")