Referential Complexity Analyses

M. Lewis

August 04, 2014



Analyses:

  1. Cross-linguistic analyses
    (A) Complexity Norms
    (B) Correlation between all lengths
    © Correlation between all lengths, controling for frequency, open class only
    (D) Correlation between all lengths and complexity, controling for frequency
    (E) Translation check data

  2. High frequency words in mapping task

  3. Novel real objects
    (A) Norms
    (B) Mappping task (adults) TO DO
    © Mapping task (children) TO DO
    (D) Production task (labels + descriptions) TO DO

  4. Geons
    (A) Norms
    (B) Mappping task TO DO

figure out how to clear before start new experiment save to git hub clean up so that only see plots and critical statistical results check that all experients remove duplicates



Set global variables

processNorms = TRUE # process norms or load norms? 
removeRepeatSubj = TRUE  # remove repeat subjects?
savePlots = FALSE # save plots to pdf?

LOAD PACKAGES, FUNCTIONS, AND REPEAT SUBJ DATA FILE



(1) Cross-linguistic analyses (Complexity norms task)

(A) Norms

(B) Correlation between all lengths

read in data and merge with English complexity norms

xling = read.csv("data/xling_csv.csv") 
xling = merge(xling, englishComplexityNorms, by.x = "ENGLISH", by.y = "word")

# get rid of bad item (peso)
xling = xling[xling$ENGLISH != "peso",]

word class distribution

xling$Open_class = as.factor(xling$Open_class)
counts = as.data.frame(summary(xling$Open_class))
counts$class = c("closed class", "open class bare", "open class inflected")
names(counts) = c("freq", "class")

ggplot(counts, aes(class, freq, fill = class)) + 
  geom_bar(stat = "identity") +
  ggtitle("Word types in corpus")

plot of chunk unnamed-chunk-6

lens = c(which(grepl("LEN",names(xling)))) # get length column indices
col1 <- colorRampPalette(c("blue", "white" , "red"))

## Correlations between all lengths, all words
xling_len = xling[, lens] 
names(xling_len) = as.character(tolower(lapply(str_split(names(xling_len),"_"),function(x) {x[1]})))

# Correlations between all lengths
cmat = cor(xling_len, use = "pairwise.complete.obs")
corrplot(cmat,  tl.cex=.5, tl.srt=45, method = "color", tl.col = "black" ,col =col1(100),order = "FPC")

plot of chunk unnamed-chunk-7

mean(cmat)
## [1] 0.3213
## Correlations between all lengths, open class words only
xlingO = xling[xling$Open_class != 0,lens] 
names(xlingO) = as.character(tolower(lapply(str_split(names(xlingO),"_"),function(x) {x[1]})))

# correlations between all lenghts
cmat = cor(xlingO, use = "pairwise.complete.obs")
corrplot(cmat,  tl.cex=.5, tl.srt=45, method = "color", tl.col = "black" ,col =col1(100), order = "FPC")

plot of chunk unnamed-chunk-7

mean(cmat)
## [1] 0.2876

( C ) Correlation between all lengths, controling for frequency

## all words
xling_len_p = xling[,c(lens, which(names(xling)== "log.e.freq"))] 
names(xling_len_p) = as.character(tolower(lapply(str_split(names(xling_len_p),"_"),function(x) {x[1]})))

# correlations between all lengths, open class only
cmat.p = partial.r(xling_len_p,1:80,81 ) 
mean(cmat.p)
## [1] 0.216
## open class words only
xlingOF = xling[xling$Open_class !=0 ,c(lens, which(names(xling)== "log.e.freq"))] 
names(xlingOF) = as.character(tolower(lapply(str_split(names(xlingOF),"_"),function(x) {x[1]})))

# correlations between all lengths, open class only
cmat.p = partial.r(xlingOF,1:80,81 ) 

# sorted by first principle component
if (savePlots) {pdf('sort.pdf',height = 10, width = 10)}
corrplot(cmat.p,  tl.cex=.5, tl.srt=45,  order = "FPC", method = "color", tl.col = "black" ,col =col1(100))

plot of chunk unnamed-chunk-8

if (savePlots) {dev.off() }
# sorted by  angular order of the eigenvectors.
corrplot(cmat.p,  tl.cex=.5, tl.srt=45,  order = "AOE", method = "color", tl.col = "black" ,col =col1(100))

plot of chunk unnamed-chunk-8

# sorted by hierarchical clustering
corrplot(cmat.p,  tl.cex=.5, tl.srt=45,  order = "hclus", method = "color", tl.col = "black", col =col1(100) )