Introduction

The following is data analysis for Stekic, Kovic, and Nielsen (2019), where we explored the learnability of artificial languages that differed in the relationship between those labels and the objects that they describe. All files for this experiment can be found at our GitHub Repository and further details of the experimental design can be found at our OSF Repository.

Below in two code chunks we load libraries and read in our files.

Note that in this document all code is “folded” by default- you can unfold sections (and look at the code) by clicking on the “Code” button on the Top Right of each block.

Loading Libraries

library(data.table)
library(tidyverse)
library(ggthemes)
library(outliers)
library(lme4)
library(lmerTest)
library(afex)
library(kableExtra)

Loading Data

# Our data is in two formats because of a change in jsPsych version halfway through our data collection - thus we read those separate data file types (distinguishable by file size) in here separately, then combine them into a single large data frame

setwd("C:/Users/Alan/Documents/GitHub/Stekic-et-al/Data/50s/")
files50  <- list.files(pattern = '\\.csv')
tables50 <- lapply(files50, read.csv, header = TRUE)
combined.df.50 <- do.call(rbind , tables50)

combined.df.50[] <- lapply(combined.df.50, function(x) gsub("\\\\", "", x))

setwd("C:/Users/Alan/Documents/GitHub/Stekic-et-al/Data/40s/")
files40  <- list.files(pattern = '\\.csv')
tables40 <- lapply(files40, read.csv, header = TRUE)
combined.df.40 <- do.call(rbind , tables40)

colnames(combined.df.50)<- c("rtD", "key_press", "trialtype2", "TrialIndex", "elapsed", "node_id", "viewhist", "responses", "Yoking", "RT", "RespKey", "RespCorr", "TrialType", "Image", "Label", "Location", "CorrectResponse", "Block", "BlockTrial", "Condition", "LabelType", "Subcondition", "TrialNum")


colnames(combined.df.40)<- c("rtD", "key_press", "trialtype2", "TrialIndex", "elapsed", "node_id", "viewhist", "responses", "Yoking", "RT", "RespKey", "RespCorr", "TrialType", "Image", "Label", "Location", "CorrectResponse", "Block", "BlockTrial", "Condition", "LabelType", "Subcondition", "TrialNum")
  
combined.df <- rbind(combined.df.50, combined.df.40)

Cleaning Up the Data

#1- Substitute out some special characters
combined.df[] <- lapply(combined.df, function(x) gsub("\\\\", "", x))
combined.df[] <- lapply(combined.df, function(x) gsub("[{}]", "", x))
combined.df[] <- lapply(combined.df, function(x) gsub("\"", "", x))

#2- Add in a column with the biographical data (which is currently stored in a single value on the fourth line of each participant's file)
biodata <- combined.df[seq(4, nrow(combined.df), 246),]
biodata <- as.data.frame(biodata$responses)
colnames(biodata) <- "biodata"

biodata <- separate(biodata, col=biodata, into = c("Age", "Gender", "Specify"), sep = ",")

biodata$Age <- sub("age:", "", biodata$Age)
biodata$Gender <- sub("gender:", "", biodata$Gender)
biodata$Specify <- sub("specify:", "", biodata$Specify)

combined.df$Age <- rep(biodata$Age, each = 246)
combined.df$Gender <- rep(biodata$Gender, each = 246)

#3- Add a unique participantID (actually the name of each file)
files <- c(files40, files50)
files <- sub(".csv", "", files)

combined.df$ParticipantID <- rep(files, each= 246)

#4- Clean up our Data, Get rid of some useless columns, and re-sort the remaining columns into ones we will actually use
CleanData <- subset(combined.df, select = c("ParticipantID", "Condition", "Subcondition", "Yoking", "TrialNum", "TrialType", "Block", "BlockTrial", "Image", "Label", "Location", "CorrectResponse", "RespKey", "RespCorr", "RT"))

#5- Get rid of extra lines from the jsPsych output- leaving us with only our Trial data (everything else of use we've extracted and added as columns)
CleanData <- subset(CleanData, TrialNum > 0)

#6- Set the data types of our various columns
CleanData$ParticipantID <- as.factor(CleanData$ParticipantID)
CleanData$Condition <- as.factor(CleanData$Condition)
CleanData$Subcondition <- as.factor(CleanData$Subcondition)
CleanData$Yoking <- as.factor(CleanData$Yoking)
CleanData$TrialType <- as.factor(CleanData$TrialType)
CleanData$Location <- as.factor(CleanData$Location)

CleanData$TrialNum <- as.numeric(CleanData$TrialNum)
CleanData$Block <- as.numeric(CleanData$Block)
CleanData$BlockTrial <- as.numeric(CleanData$BlockTrial)
CleanData$RespCorr <- as.numeric(CleanData$RespCorr)
CleanData$RT <- as.numeric(CleanData$RT)

We have all of our data combined, but it has to be cleaned up considerably (tidied)

remove all of the slashes and backslashes everywhere to have data that we can actually read and parse properly with later commands
Extract the biological data from the participants, which is on line 4 of each participant’s dataframe, then put it back into the main dataframe as the Age and Gender columns
Add a unique participantID column (from the filenames)
Trim down to the required columns and re-order them
Delete the useless rows
Make sure the columns are all the right data types

Exploring the Data

#1- Removing participant 7gtriiTixvBaQ
RespCorrStrange <- CleanData[is.na(CleanData$RespCorr),]
CleanData.RespCorr <- subset(CleanData, ParticipantID != "7gtriiTixvBaQ")

#2A- Removing Participants with impossible negative RT values
RTNegative <- subset(CleanData.RespCorr, RT < 0)

CleanData.RTNegCorr <- subset(CleanData.RespCorr, ParticipantID != "60CBPiTiO1gKw")
CleanData.RTNegCorr <- subset(CleanData.RTNegCorr, ParticipantID != "k3LHwiTiOx7fx")

# Verify that we have removed all negative values RT values from the data frame
RTNegative2 <- subset(CleanData.RTNegCorr, RT < 0)

#2B- Removing Participants with very large single RT values

RTHigh <- subset(CleanData.RespCorr, RT>120000 )

#Relevel this to get rid of factor levels that aren't there any longer
RTHigh$ParticipantID <- factor(RTHigh$ParticipantID)

RTHighs <- as.data.frame(table(RTHigh$ParticipantID))
colnames(RTHighs) <- c("Participant", "Count")

#re-order by count
RTHighs <- RTHighs[order(-RTHighs$Count),]

#Give shorter participantIDs
RTHighs$Participant <- substring(RTHighs$Participant, 1,3)

#Making the Participant Column into an Index
RTHighs2 <- RTHighs[-1]
row.names(RTHighs2) <- RTHighs$Participant

#Transpose for output
RTHighsT <- as.data.frame(t(RTHighs2))

The basics of tidying are now done, but we should take a visual look at the data- for example looking for impossible values

I leave this as an exercise to the reader, but point out things I noticed:

I sorted the “RespCorr” (Correctness of response where 0 = Incorrect and 1 = Correct) and noted that we have a single case where the value is “NA”- thsi is Testing Trial 1 for the participant 7gtriiTixvBaQ . Because we can’t be certain what went wrong here, and whether the other data obtained from the participant is of good quality (visually it appears to be fairly good, with the participants fairly uniformly taking about 5 seconds per trial and being correct on about half of trials), the most conservative approach is the simply throw out all of the data from this participant, in case something larger went wrong
I sorted the RT (response time) column as well, revealing a number of strange values
1. First, there are a number of negative RT values, which are actually impossible -These are from two participants- there are 17 from participant 60CBPiTiO1gKw and 1 from participant k3LHwiTiOx7fx. As above, we will simply remove these participants from our data entirely
2. Second, there are some very high RT values, with the highest being 2460.568 seconds (41.0094667 minutes or 0.6834911 hours). This is patently too long for a single trial - beyond the realm where we might consider that a participant could be simply evaluating possible correct responses- This long of a break also almost certainly has negative learning consequences. As such, we must set a baseline for removing participants who have individual trials over a certain length. Here we will use 2 minutes (1.210^{5} milliseconds) as a cutoff time to eliminate all trials for a participant. Note that we remove individual trials with lengths over 30 seconds in a section below- those we view as outliers for participants responding normally, whereas trials over 2 minutes in length suggest a failure of attention and/or multitasking that could deleteriously effect the results on all trials.
The prospect of divided attention is broadly supported by looking at these trials- we can see that the 32 trials with RTs over 2 minutes are from 21 participants.

The offending participants can be seen below:

Outputting a table with KableExtra

knitr::kable(RTHighsT, caption = 'Number of Trials over 2 Minutes long by ParticipantID') %>%
  kable_styling() %>%
  scroll_box(width = "800px", height = "150px")

Number of Trials over 2 Minutes long by ParticipantID
	iN9	YtJ	oOr	5KG	Ecp	lDj	5bM	60C	6T8	7U9	9wR	AVp	d3i	fTX	IOe	jwE	QBc	SWq	tHU	y9y	YVW
Count	4	4	3	2	2	2	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1

Removing Long RT Participants

#Get the participantIDs for participants with too-long RT values

participants <- unique(RTHigh$Participant)

#Write a for loop that removes all the lines of the data frame for each of these participants
CleanData.HighRTCorr <- CleanData.RTNegCorr

for (participant in participants) {
  
  CleanData.HighRTCorr <- subset(CleanData.HighRTCorr, ParticipantID != participant)
  
}

CleanData2 <- CleanData.HighRTCorr

#Output this clean Data to acsv
write.csv(CleanData2, file=("C:/Users/Alan/Documents/GitHub/Stekic-et-al/Data/CleanData.csv"))

In the code chunk above, we remove all of the data for these 21 participants - we could be more liberal with our criteria and only erase participants with multiple lapses, but given our large amount of data a conservative policy is likely best.

This gives us a first bash at clean data - we started with 429 participants, removed 1 participant (Participant ID “7gtriiTixvBaQ”) based on an NA value on one of their trials, 2 participants (Participant IDs: 60CBPiTiO1gKw, k3LHwiTiOx7fx) with impossible Negative RT values, and 20 participants (Participant IDS: 5bMnRiTi8m8VY, fTXj1iTiH9yh4, jwEltiTiE1yXC, 5KGA3iTidksdQ, 60CBPiTiO1gKw, 6T8RyiTitqvgs, 7U90riTiU7b4A, 9wR2NiTi3j05w, AVpqMiTidvCFP, d3iMTiTi0J4mA, EcpbriTixVTkS, iN9miiTiQr9C6, IOeHuiTiGoOjS, lDjT0iTiBH9Bd, oOrsQiTiA7N0u, QBc3UiTitqvaT, SWqVniTixSi42, tHUnMiTikaIIL, y9y6biTidLSDu, YtJcdiTiDvWGS, YVWDNiTi5c584) with overly long RTs (at least 2 minutes long) for at least one trial.

This leaves us with a total of 406 participants between our 10 Experimental Conditions, which is still more than we thought we’d have access to.

Removing Outliers

We have removed data for participants under a number of conditions, but what about outliers, especially for things like RT - how do we want to handle those?

The best way we can get some estimate of how they are likely to affect our data is by looking at some histograms of response time and maybe qq plots to get some idea of their normality.

Plotting RT Distributions

ggplot(CleanData2, aes(RT)) +
  geom_density() #+

  #xlim(0, 40000)

So it doesn’t actually look like we have a lot of outliers, but this doesn’t mean they won’t have an effect on the data.

As a first bash we’ll use 1.5x the interquartile range as our cutoff. We can reconsider this as needed

Removing outliers and replotting

#Calculating the interquartile range

lowerquart <- quantile(CleanData2$RT)[2]
upperquart <- quantile(CleanData2$RT)[4]

Interquartile <- upperquart - lowerquart

#Calculating thresholds
#Mild thresholds are 1.5* interquartile range
mild.low <- lowerquart - (Interquartile * 1.5)
mild.high <- upperquart + (Interquartile * 1.5)

#Extremes are 3* interquartile range
extreme.low <- lowerquart - (Interquartile * 3)
extreme.high <- upperquart + (Interquartile * 3)


#1- Removing all Outliers ()
CleanData.RTTrim1 <- subset(CleanData2, RT > mild.low & RT < mild.high)

ggplot(CleanData.RTTrim1, aes(RT)) +
  geom_density() +
  ggtitle("Density Plot of RTs- All Outliers Removed")

#2- Replacing all Outliers with the Mean
CleanData.RTTrim2 <- CleanData2

#Compute non-outlier means
NOMean1 <- mean(CleanData.RTTrim1$RT)

CleanData.RTTrim2$RT <- ifelse(CleanData.RTTrim2$RT < mild.low,
                               NOMean1,
                               CleanData.RTTrim2$RT)

CleanData.RTTrim2$RT <- ifelse(CleanData.RTTrim2$RT > mild.high,
                               NOMean1,
                               CleanData.RTTrim2$RT)


ggplot(CleanData.RTTrim2, aes(RT)) +
  geom_density() +
  ggtitle("Density Plot of RTs- All Outliers Replaced with Mean")

#3- Removing only extreme outliers

CleanData.RTTrim3 <- subset(CleanData2, RT > extreme.low & RT < extreme.high)

ggplot(CleanData.RTTrim3, aes(RT)) +
  geom_density() +
  ggtitle("Density Plot of RTs- Extreme Outliers Removed")

#4- Setting outliers to the most extreme of the minimum outlier values

CleanData.RTTrim4 <- CleanData2

CleanData.RTTrim4$RT <- ifelse(CleanData.RTTrim4$RT < mild.low,
                               mild.low,
                               CleanData.RTTrim4$RT)

CleanData.RTTrim4$RT <- ifelse(CleanData.RTTrim4$RT > mild.high,
                               mild.high,
                               CleanData.RTTrim4$RT)

ggplot(CleanData.RTTrim4, aes(RT)) +
  geom_density() +
  ggtitle("Density Plot of RTs- Outliers Trimmed to Mild Outlier Boundary")

#5- Replacing all outliers with the mean on a by-subject basis

participantIDs <- unique(CleanData2$ParticipantID)

participantdata <- list()
CleanData.RTTrim5 <- list()

for (participant in participantIDs) {
  
  participantdata <- subset(CleanData2, ParticipantID == participant)
  
  #Get Non-Outlier Mean
  participantdataNO <- subset(participantdata, RT > mild.low & RT < mild.high)
  NOMean2 <- mean(participantdataNO$RT)
  
  

  participantdata$RT <- ifelse(participantdata$RT < mild.low,
                               NOMean2,
                               participantdata$RT)

  participantdata$RT <- ifelse(participantdata$RT > mild.high,
                                NOMean2,
                               participantdata$RT)
  

  
  CleanData.RTTrim5 <- rbind(CleanData.RTTrim5, participantdata)

}


ggplot(CleanData.RTTrim5, aes(RT)) +
  geom_density() +
  ggtitle("Density Plot of RTs- Outliers Replaced with Mean (By Subject)")

CleanData3 <- CleanData.RTTrim5

Above I output five graphs, which are informatively titled so that it should be clear what we’ve done

I think that the final option is probably the best one, although we’ll likely find that no matter what data we run with won’t make much of a statistical difference

Splitting testing trials into new vs. old items (test of generalisation)

#Splitting Testing Trials
participantIDs <- unique(CleanData3$ParticipantID)
participantdata <- list()
participantdata.training <- list()
participantdata.testing <- list()

testingtrials <- list()
trainingtrials <- list()


for (participant in participantIDs) {
  
  participantdata <- subset(CleanData3, ParticipantID == participant)
  participantdata.training <- subset(participantdata, TrialType == "Training")
  participantdata.testing <- subset(participantdata, TrialType == "Testing")
  
  
  TrainingFigures <- unique(participantdata.training$Image)
  
  participantdata.training$Generalisation <- NA
  
  participantdata.testing$Generalisation <- ifelse(
    participantdata.testing$Image %in% TrainingFigures,
    "Old",
    "New")
  
  trainingtrials <- rbind(trainingtrials, participantdata.training)
  testingtrials <- rbind(testingtrials, participantdata.testing)
  
  
}

CleanData4 <- rbind(trainingtrials, testingtrials)


#Factor levels, adding Trial Type 2

CleanData4$TrialType <- factor(CleanData4$TrialType, level = c ("Training", "Testing"))
CleanData4$TrialType2 <- paste(CleanData4$TrialType, CleanData4$Generalisation, sep = "-")
CleanData4$Generalisation <- as.factor(CleanData4$Generalisation)

CleanData4$TrialType2 <- factor(CleanData4$TrialType2, level = c ("Training-NA", "Testing-Old", "Testing-New"),
                                     labels = c("Training", "Testing-Old", "Testing-New"))


# Obtaining the curviness of images from our original script

CleanData4$Image <- sub("Stims/Figures/", "", CleanData4$Image)
CleanData4$Image <- sub(".bmp", "", CleanData4$Image)

CleanData4 <- separate(CleanData4, col=Image, into = c("ImageSeed", "JaggedvsCurved", "Curviness", "Set"), sep = "-")

Our experiment is split into Training Trials and Testing Trials, and that is already in the data frame, but there areafew additional differences required.

The first of these is splitting up our testing trials so we can look at Generalisation. Testing trials in this experiment use both the initial set of 16 stimuli per participant and also introduce 8 novel exemplars to test the ability of participants to generalise the rules that they have learned

The second of these is adding in information about how curvy vs. jagged the various images used are.

Recall that our images were generated along a range of “Curviness” from 0 (Very Jagged) to 30 (Very Curvy) with 10 (somewhat jagged) and 20 (somewhat curvy) also used

We thus used 4 possible pairings of curviness:

0 vs 30 10 vs 20 0 vs 20 10 vs 30

Which means there may be many ways to look at this data.

This is actually already coded in the currently filename structure- so we just need to expand it out into separate columns

Data Analysis- A First Look

#Name the levels of Condition, rather than having them be numbers
ReplicationData2007 <- subset(CleanData4, Condition == 10|Condition == "3A"|Condition == "3B")

ReplicationData2007$Condition <- factor(ReplicationData2007$Condition,
                            levels= c("3A", "3B", 10),
                            labels = c("Conventional Category","Conventional Category","No Label"))
#There are 2 subconditions here - 3A and 3B - we're going to ignore thedifference for now

Rep2007Agg1 <- aggregate(RespCorr ~ Condition + TrialType + Block , data=ReplicationData2007, mean, na.rm= FALSE)

Rep2007Agg1$Block <- factor(Rep2007Agg1$Block)

That was a lot of work to get us here, but we’re ready (in theory) to start looking at our data.

We already know because of the various crossings of the data that we’re going to have a very hard time getting any models to converge- even though we have a lot of data our design isn’t fully crossed and specifying things can be very difficult - I’ve previously attempted to get those models to converge and had no luck, so instead we’ll go about this a little bit more systematically.

The first thing we can do in a fairly straightforward fashion is see how well we have replicated the results of the papers we are extending. The first of these is Lupyan et al. (2007): Language is Not Just for Talking.

In this paper, from which we took our basic design of Training vs. Testing Trials, Lupyan et al. compared the learnability of categories with no label to those with a single label for each category. These are both included in our design as well, as No Label (Condition 10) and Conventional Category Label (Condition 3)

So lets take a look at our data

ggplot(data=Rep2007Agg1, aes(x=Block, y=RespCorr, group= Condition)) +
  #geom_line(aes(color= Condition)) +
  #geom_point(size=1.75, aes(colour = Condition)) +
  geom_smooth(method='loess', formula= y ~ x, se= FALSE, aes(colour = Condition)) +
 # scale_linetype_manual(values = c("solid", "solid", "solid",
  #                        "longdash", "longdash", "longdash", "dotdash",
  #                        "dotted")) +
  scale_color_manual(values= c("#0066CC", "#CC0033","#33FF00", "#0066CC", "#CC0033","#33FF00", "#33FF00", "#000000")) +
  labs(x="Block", y="Proportion of Correct Responses") +
  scale_y_continuous(limits = c(0.45,1), breaks=c(0.5,0.6,0.7,0.8,0.9,1.0)) +
  facet_grid(~TrialType, scales="free", space= "free_x") +
  theme_tufte()

These are interesting results - Having a label is helpful, but when participants roll around to the testing phase their performance falls off a cliff (and doesn’t recover), suggesting that maybe they are relying too much on the words and haven’t learned very much about the image differences

For best comparison, lets plot this with Lupyan et al’s data

Plotting with data from Lupyan et al 2007

#Reading in Gary's Data
Lupyan2007 <- read.csv("C:/Users/Alan/Documents/GitHub/Stekic-et-al/RScripts/Lupyan2007Data.csv")

Rep2007Agg2 <- Rep2007Agg1

#Adding a "study" column to our data
Rep2007Agg2$Study <- "Stekic et al (2019)"

#Combining the 2 dataframes into 1
CombinedData <- rbind(Rep2007Agg2, Lupyan2007)

CombinedData$Group <- paste(CombinedData$Study, CombinedData$Condition)

#Plotting the Data
ggplot(data=CombinedData, aes(x=Block, y=RespCorr, group= Group)) +
  #geom_line(aes(color= Condition)) +
  #geom_point(size=1.75, aes(colour = Condition)) +
  geom_smooth(method='loess', formula= y ~ x, se= FALSE, aes(colour = Condition, linetype= Study)) +
 # scale_linetype_manual(values = c("solid", "solid", "solid",
  #                        "longdash", "longdash", "longdash", "dotdash",
  #                        "dotted")) +
  scale_color_manual(values= c("#0066CC", "#CC0033","#33FF00", "#0066CC", "#CC0033","#33FF00", "#33FF00", "#000000")) +
  labs(x="Block", y="Proportion of Correct Responses") +
  scale_y_continuous(limits = c(0.45,1), breaks=c(0.5,0.6,0.7,0.8,0.9,1.0)) +
  facet_grid(~TrialType, scales="free", space= "free_x") +
  theme_tufte()

So some very intersting similarities and some interesting differences as well

Our no-label condition (dashed red line) is harder than Lupyan et al. (2007) by around 10% across the board, but the general trajectory is bang on, including the fact that performance peaks at the end of training and is basically maintained in testing trials

Our Conventional Category Label condition is almost identical in Training to Lupyan et al, but is totally different in testing - around a 20% difference with testing performance falling off a cliff for our participants, suggesting that in our experiment they might not be learning very much about the image types at all, and instead learning only the words, such that they can’t generalise when given new stimuli

Exploring the Test Trial Failure

Rep2007Agg3 <- aggregate(RespCorr ~ Condition + TrialType2 + Block , data=ReplicationData2007, mean, na.rm= FALSE)

Rep2007Agg3$Block <- factor(Rep2007Agg3$Block)

ggplot(data=Rep2007Agg3, aes(x=Block, y=RespCorr, group= Condition)) +
  #geom_line(aes(color= Condition)) +
  #geom_point(size=1.75, aes(colour = Condition)) +
  geom_smooth(method='loess', formula= y ~ x, se= FALSE, aes(colour = Condition)) +
 # scale_linetype_manual(values = c("solid", "solid", "solid",
  #                        "longdash", "longdash", "longdash", "dotdash",
  #                        "dotted")) +
  scale_color_manual(values= c("#0066CC", "#CC0033","#33FF00", "#0066CC", "#CC0033","#33FF00", "#33FF00", "#000000")) +
  labs(x="Block", y="Proportion of Correct Responses") +
  scale_y_continuous(limits = c(0.45,1), breaks=c(0.5,0.6,0.7,0.8,0.9,1.0)) +
  facet_grid(~TrialType2, scales="free", space= "free_x") +
  theme_tufte()

This interpretation is supported by breaking apart the data and looking at the two types of testing trials separately.

For participants in the no-label condition, Testing Trials differ from Training Trials only in the lack of feedback given and the introduction of new exemplars to generalise to. Unsurprisingly, they find Old (familiar) images easier than new images which they have to generalise to (which they get better at).

Participants in the Conventional Category label condition however don’t show this- performance on all testing trials falls off the cliff and there are no differences between familiar and new images - this suggests that they are actually learning very little about the image categories.

Words of a Feather Flock Together

Data Analysis and Exploration