Anaylyzing Liguistc Data: Chapter 2

14/05/2013

library(languageR)
library(MASS)
library(lattice)

1. The data set warlpiri (data courtesy Carmel O’Shannessy) provides information about the use of the ergative case in Lajamanu Warlpiri. Data were elicited for adults and children of various ages. The question of interest is to what extent the use of the ergative case marker is predictable from the animacy of the subject, word order, and the age of the speaker (adult versus child). Explore this data set with respect to this issue by means of a mosaic plot. (First construct a contingency table with xtabs(), then supply this contingency table as argument to mosaicplot().)

str(warlpiri)

## 'data.frame':    347 obs. of  9 variables:
##  $ Speaker          : Factor w/ 27 levels "Sub10","Sub1027",..: 27 27 27 27 27 27 27 27 27 27 ...
##  $ Sentence         : Factor w/ 343 levels "and_nganayi_kurdu_pawu_ngu_manu_maliki_ji_",..: 50 73 76 145 153 233 248 258 259 260 ...
##  $ AgeGroup         : Factor w/ 2 levels "adult","child": 2 2 2 2 2 2 2 2 2 2 ...
##  $ CaseMarking      : Factor w/ 2 levels "ergative","other": 1 1 1 1 1 1 1 1 1 1 ...
##  $ WordOrder        : Factor w/ 2 levels "subInitial","subNotInitial": 2 1 1 2 1 1 1 2 1 2 ...
##  $ AnimacyOfSubject : Factor w/ 2 levels "animate","inanimate": 1 1 1 1 1 2 2 1 1 2 ...
##  $ OvertnessOfObject: Factor w/ 2 levels "notOvert","overt": 2 1 1 1 1 1 2 2 1 1 ...
##  $ AnimacyOfObject  : Factor w/ 2 levels "animate","inanimate": 1 1 1 1 1 1 1 1 2 1 ...
##  $ Text             : Factor w/ 3 levels "texta","textb",..: 1 2 2 1 1 3 3 3 3 3 ...

(warlpiri.xtabs <- xtabs(~CaseMarking + AnimacyOfSubject + WordOrder + AgeGroup, 
    data = warlpiri))

## , , WordOrder = subInitial, AgeGroup = adult
## 
##            AnimacyOfSubject
## CaseMarking animate inanimate
##    ergative     102        16
##    other          9         5
## 
## , , WordOrder = subNotInitial, AgeGroup = adult
## 
##            AnimacyOfSubject
## CaseMarking animate inanimate
##    ergative      38         9
##    other          4         4
## 
## , , WordOrder = subInitial, AgeGroup = child
## 
##            AnimacyOfSubject
## CaseMarking animate inanimate
##    ergative      64        13
##    other         23         4
## 
## , , WordOrder = subNotInitial, AgeGroup = child
## 
##            AnimacyOfSubject
## CaseMarking animate inanimate
##    ergative      40        10
##    other          3         3

mosaicplot(warlpiri.xtabs, main = "Usage of ergative")

plot of chunk unnamed-chunk-2

2. In Chapter 1 we created a data frame with mean reaction times and mean base frequencies for neologisms in the Dutch suffix -heid. Reconstruct the data frame heid2. Both reaction times and frequencies are logarithmically transformed. Use exp() to undo these transformations and make a scatterplot of the averaged reaction times (MeanRT) against the frequency of the base (BaseFrequency). Compare this scatterplot with a scatterplot using the log-transformed values.

# reconstruimos heid2 tal y como aparece en pág. 17
heid2 <- aggregate(heid$RT, list(heid$Word), mean)
colnames(heid2) <- c("Word", "MeanRT")
items <- heid[, c("Word", "BaseFrequency")]
items <- unique(items)
(heid2 <- merge(heid2, items, by.x = "Word", by.y = "Word"))

##            Word MeanRT BaseFrequency
## 1   aftandsheid  6.705          4.20
## 2    antiekheid  6.542          6.75
## 3    banaalheid  6.588          5.74
## 4    basaalheid  6.586          3.56
## 5   bebrildheid  6.673          3.61
## 6   beschutheid  6.552          4.79
## 7       beuheid  6.637          5.07
## 8   bezweetheid  6.500          4.75
## 9  blusbaarheid  6.691          0.00
## 10  contentheid  6.548          4.50
## 11  coulantheid  6.538          1.79
## 12   dementheid  6.524          3.83
## 13    enormheid  6.453          8.42
## 14   erkendheid  6.466          6.52
## 15   gammelheid  6.578          4.91
## 16 geurloosheid  6.539          2.56
## 17    jofelheid  6.505          3.33
## 18 kalkrijkheid  6.644          3.14
## 19    koketheid  6.587          4.79
## 20   kortafheid  6.508          5.87
## 21   labielheid  6.565          4.38
## 22   lobbigheid  6.660          0.00
## 23   ludiekheid  6.666          4.14
## 24  markantheid  6.594          5.16
## 25 onattentheid  6.610          1.61
## 26 onbelastheid  6.692          3.18
## 27   ondiepheid  6.612          5.66
## 28 ontroerdheid  6.481          5.55
## 29    onwelheid  6.645          3.97
## 30    ovaalheid  6.487          5.86
## 31     pipsheid  6.571          3.33
## 32  pitloosheid  6.542          0.00
## 33    riantheid  6.632          4.53
## 34   royaalheid  6.543          6.09
## 35  saprijkheid  6.687          0.00
## 36  summierheid  6.666          5.30
## 37  tactvolheid  6.583          4.75
## 38  tembaarheid  6.640          0.00
## 39  tilbaarheid  6.638          0.00
## 40  visrijkheid  6.521          3.04

colnames(heid2)

## [1] "Word"          "MeanRT"        "BaseFrequency"

# renombro algunas columnas
colnames(heid2)[2] <- "logMeanRT"
colnames(heid2)[3] <- "logBaseFrequency"
# añado dos nuevas columnas con los valores en escala natural
heid2$MeanRT <- exp(heid2$logMeanRT)
heid2$BaseFrequency <- exp(heid2$logBaseFrequency)
str(heid2)

## 'data.frame':    40 obs. of  5 variables:
##  $ Word            : Factor w/ 40 levels "aftandsheid",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ logMeanRT       : num  6.71 6.54 6.59 6.59 6.67 ...
##  $ logBaseFrequency: num  4.2 6.75 5.74 3.56 3.61 4.79 5.07 4.75 0 4.5 ...
##  $ MeanRT          : num  816 694 726 725 791 ...
##  $ BaseFrequency   : num  66.7 854.1 311.1 35.2 37 ...

# pinto los scatterplots
par(mfrow = c(2, 1))
plot(heid2$MeanRT, heid2$BaseFrequency, xlab = "Mean RT", ylab = "Base Frequency")
plot(heid2$logMeanRT, heid2$logBaseFrequency, xlab = "Log Mean RT", ylab = "Log Base Frequency")

plot of chunk unnamed-chunk-3

3. The data set moby is a character vector with the text of Melville’s Moby Dick. In this exercise, we consider whether Zipf’s law holds for Moby Dick. According to Zipf’s law (Zipf, 1949), the frequency of a word is inversely proportional to its rank in a numerically sorted list. The word with the highest frequency has rank 1, the word with the next highest frequency has rank 2, etc. If Zipf’s law holds, a plot of log frequency against log rank should reveal a straight line. We make a table of word frequencies with table()—we cannot use xtabs(), because words is a vector and xtabs() expects a data frame—and sort the frequencies in reverse numerical order:

moby.table <- table(moby)
moby.table <- sort(moby.table, decreasing = TRUE)
moby.table[1:10]

## moby
##   the    of   and     a    to    in  that   his    it     I 
## 13717  6512  6008  4551  4514  3908  2982  2457  2209  2122

We now have the word frequencies. We use the colon operator and length(), which returns the length of a vector, to construct the corresponding ranks:

ranks <- 1:length(moby.table)
ranks[1:10]

##  [1]  1  2  3  4  5  6  7  8  9 10

Make a scatterplot of log frequency against log rank.

par(mfrow = c(1, 1))
plot(log(ranks), log(moby.table), xlab = "Log Rank", ylab = "Log Frequency")

plot of chunk unnamed-chunk-6

4. The column labeled Trial in the data set lexdec specifies, for each subject, the trial number of the responses. For a given subject, the first trial in the experiment has trial number 1, the second has trial number 2, etc. Use xylowess.fnc() to explore the possibility that the subjects proceeded through the experiment in different ways, some revealing effects of learning, and others effects of fatigue.

par(mfrow = c(1, 1))
xylowess.fnc(RT ~ Trial | Subject, data = lexdec, xlab = "trial number", ylab = "log reaction time")

plot of chunk unnamed-chunk-7

# el gráfico sugiere que sujetos como T2 muestran efecto aprendizaje (a
# medida que hacen más pruebas, reducen su tiempo de respuesta); otros
# sujetos como D muestran claramente fatiga, ya que su tiempo de respuesta
# aumenta en las últimas pruebas que realizan.

5. The data set english lists lexical decision and word naming latencies for two age groups. Inspect the distribution of the naming latencies (RTnaming). First plot a histogram for the naming latencies with truehist(). Then plot the density.

par(mfrow = c(2, 1))
truehist(english$RTnaming, col = "lightgrey", xlab = "naming latencies", main = "histogram")
plot(density(english$RTnaming), main = "density for naming latencies")

plot of chunk unnamed-chunk-8

The voicekey registering the naming responses is sensitive to the different acoustic properties of a word’s initial phoneme. The column Voice specifies whether a word’s initial phoneme was voiced or voiceless. Use bwplot() to make a trellis boxplot for the distribution of the naming latencies across voiced and voiceless phonemes with the age group of the subjects (AgeSubject) as grouping factor.

par(mfrow = c(1, 1))
bwplot(RTnaming ~ Voice | AgeSubject, data = english, groups = levels(english$Voice), 
    xlab = "Initial phoneme", ylab = "naming latencies")

plot of chunk unnamed-chunk-9


# ¿coinciden las medianas?
(tapply(english[english$AgeSubject == "old", ]$RTnaming, english[english$AgeSubject == 
    "old", ]$Voice, median))

##    voiced voiceless 
##     6.476     6.500

(tapply(english[english$AgeSubject == "young", ]$RTnaming, english[english$AgeSubject == 
    "young", ]$Voice, median))

##    voiced voiceless 
##     6.130     6.165