library(languageR)

Baayen 2008 Page 20 The data set spanishMeta contains metadata about fifteen texts sampled from three Spanish authors. Each line in this file provides information on a single text. Later in this book we will consider whether these authors can be distinguished on the basis of the quantitative characteristics of their personal styles (gauged by the relative frequencies of function words and tag trigrams). Exercise 1 Display this data frame in the R terminal. Extract the column names from the data frame. Also extract the number of rows.

spanishMeta #Display this data frame in the R terminal.
##    Author YearOfBirth  TextName PubDate Nwords    FullName
## 1       C        1916 X14458gll    1983   2972        Cela
## 2       C        1916 X14459gll    1951   3040        Cela
## 3       C        1916 X14460gll    1956   3066        Cela
## 4       C        1916 X14461gll    1948   3044        Cela
## 5       C        1916 X14462gll    1942   3053        Cela
## 6       M        1943 X14463gll    1986   3013     Mendoza
## 7       M        1943 X14464gll    1992   3049     Mendoza
## 8       M        1943 X14465gll    1989   3042     Mendoza
## 9       M        1943 X14466gll    1982   3039     Mendoza
## 10      M        1943 X14467gll    2002   3045     Mendoza
## 11      V        1936 X14472gll    1965   3037 VargasLLosa
## 12      V        1936 X14473gll    1963   3067 VargasLLosa
## 13      V        1936 X14474gll    1977   3020 VargasLLosa
## 14      V        1936 X14475gll    1987   3016 VargasLLosa
## 15      V        1936 X14476gll    1981   3054 VargasLLosa
colnames(spanishMeta)  #Extract the column names from the data frame.
## [1] "Author"      "YearOfBirth" "TextName"    "PubDate"     "Nwords"     
## [6] "FullName"
nrow(spanishMeta) #Also extract the number of rows.
## [1] 15

Exercise 2 Calculate how many different texts are available in meta for each author. Also calculate the mean publication date of the texts sampled for each author.

AuthC=subset(spanishMeta,Author=="C")
AuthM=subset(spanishMeta,Author=="M")
AuthV=subset(spanishMeta,Author=="V")

nrow(AuthC) #Number of Texts available for Cela
## [1] 5
nrow(AuthM) #Number of Texts available for Mendoza
## [1] 5
nrow(AuthV) #Number of Texts available for VargasLLosa
## [1] 5
xtabs(~spanishMeta$Author)
## spanishMeta$Author
## C M V 
## 5 5 5
mean(AuthC[,4]) #Mean publication date for Cela
## [1] 1956
mean(AuthM[,4]) #Mean publication date for Mendoza
## [1] 1990.2
mean(AuthV[,4]) #Mean publication date for VargasLLosa
## [1] 1974.6

Exercise 3 Sort the rows in meta by year of birth (YearOfBirth) and the number of words sampled from the texts (Nwords).

spanishMeta[order(spanishMeta$YearOfBirth, spanishMeta$Nwords),] #Sort the rows in meta by year of birth (YearOfBirth) and the number of words sampled from the texts (Nwords).
##    Author YearOfBirth  TextName PubDate Nwords    FullName
## 1       C        1916 X14458gll    1983   2972        Cela
## 2       C        1916 X14459gll    1951   3040        Cela
## 4       C        1916 X14461gll    1948   3044        Cela
## 5       C        1916 X14462gll    1942   3053        Cela
## 3       C        1916 X14460gll    1956   3066        Cela
## 14      V        1936 X14475gll    1987   3016 VargasLLosa
## 13      V        1936 X14474gll    1977   3020 VargasLLosa
## 11      V        1936 X14472gll    1965   3037 VargasLLosa
## 15      V        1936 X14476gll    1981   3054 VargasLLosa
## 12      V        1936 X14473gll    1963   3067 VargasLLosa
## 6       M        1943 X14463gll    1986   3013     Mendoza
## 9       M        1943 X14466gll    1982   3039     Mendoza
## 8       M        1943 X14465gll    1989   3042     Mendoza
## 10      M        1943 X14467gll    2002   3045     Mendoza
## 7       M        1943 X14464gll    1992   3049     Mendoza

Exercise 4 Extract the vector of publication dates from meta. Sort this vector. Consult the help page for sort() and sort the vector in reverse numerical order. Also sort the row names of meta.

spanishMeta[,4] #Extract the vector of publication dates from meta.
##  [1] 1983 1951 1956 1948 1942 1986 1992 1989 1982 2002 1965 1963 1977 1987
## [15] 1981
PubYear=spanishMeta[,4]
sort(PubYear) #Sort this vector.
##  [1] 1942 1948 1951 1956 1963 1965 1977 1981 1982 1983 1986 1987 1989 1992
## [15] 2002
sort(PubYear, decreasing=TRUE) #Consult the help page for sort() and sort the vector in reverse numerical order.
##  [1] 2002 1992 1989 1987 1986 1983 1982 1981 1977 1965 1963 1956 1951 1948
## [15] 1942
RowNames=rownames(spanishMeta)
sort(RowNames) #Also sort the row names of meta.
##  [1] "1"  "10" "11" "12" "13" "14" "15" "2"  "3"  "4"  "5"  "6"  "7"  "8" 
## [15] "9"

Exercise 5 Extract from meta all rows with texts that were published before 1980.

subset(spanishMeta,PubDate<1980) #Extract from meta all rows with texts that were published before 1980.
##    Author YearOfBirth  TextName PubDate Nwords    FullName
## 2       C        1916 X14459gll    1951   3040        Cela
## 3       C        1916 X14460gll    1956   3066        Cela
## 4       C        1916 X14461gll    1948   3044        Cela
## 5       C        1916 X14462gll    1942   3053        Cela
## 11      V        1936 X14472gll    1965   3037 VargasLLosa
## 12      V        1936 X14473gll    1963   3067 VargasLLosa
## 13      V        1936 X14474gll    1977   3020 VargasLLosa

Exercise 6 Calculate the mean publication date for all texts. The arithmetic mean is defined as the sum of the observations in a vector divided by the number of elements in the vector. The length of a vector is provided by the function length(). Recalculate the mean year of publication by means of the functions sum() and length().

mean(spanishMeta$PubDate) #Calculate the mean publication date for all texts.
## [1] 1973.6
sum(spanishMeta$PubDate)/length(spanishMeta$PubDate) #Recalculate the mean year of publication by means of the functions sum() and length().
## [1] 1973.6

Exercise 7 We create a new data frame with fictitious information on each author’s favorite composer with the function data.frame(). Add the information in this new data frame to meta with merge().

composer = data.frame(Author = c("Cela","Mendoza","VargasLLosa"), Favorite = c("Stravinsky", "Bach", "Villa-Lobos"))
merge(spanishMeta,composer, by.x="FullName", by.y="Author") #Add the information in this new data frame to meta with merge().
##       FullName Author YearOfBirth  TextName PubDate Nwords    Favorite
## 1         Cela      C        1916 X14458gll    1983   2972  Stravinsky
## 2         Cela      C        1916 X14459gll    1951   3040  Stravinsky
## 3         Cela      C        1916 X14460gll    1956   3066  Stravinsky
## 4         Cela      C        1916 X14461gll    1948   3044  Stravinsky
## 5         Cela      C        1916 X14462gll    1942   3053  Stravinsky
## 6      Mendoza      M        1943 X14463gll    1986   3013        Bach
## 7      Mendoza      M        1943 X14464gll    1992   3049        Bach
## 8      Mendoza      M        1943 X14465gll    1989   3042        Bach
## 9      Mendoza      M        1943 X14466gll    1982   3039        Bach
## 10     Mendoza      M        1943 X14467gll    2002   3045        Bach
## 11 VargasLLosa      V        1936 X14472gll    1965   3037 Villa-Lobos
## 12 VargasLLosa      V        1936 X14473gll    1963   3067 Villa-Lobos
## 13 VargasLLosa      V        1936 X14474gll    1977   3020 Villa-Lobos
## 14 VargasLLosa      V        1936 X14475gll    1987   3016 Villa-Lobos
## 15 VargasLLosa      V        1936 X14476gll    1981   3054 Villa-Lobos

Pages 45-6 Exercise 1 The data set arlpiri provides information about the use of the ergative case in Lajamanu Warlpiri. Data were elicited for adults and children of various ages. The questions of interest is to what extent the use of the ergative case marker is predictable from the animacy of the subject, word order, and the age of the speaker (adult versus child). Explore this data set with respect to this issue by means of a mosaic plot.

ergative.xtabs=xtabs(~AgeGroup+WordOrder+CaseMarking+AnimacyOfSubject, data=warlpiri)
mosaicplot(ergative.xtabs, main="Ergative") #I found that there are many ways to make a mosaic plot (depending on the order of the variables), and that this configuration was most informative.

Exercise 2 In Chapter 1 we created a data frame with mean reaction times and mean base frequencies for neologisms in the Dutch suffix -heid. Reconstruct the data frame heid2. Both reaction times and frequencies are logarithmically transformed. Use exp() to undo these transformations and make a scatterplot of the averaged reaction times (MeanRT) against the frequency of the base (BaseFrequency). Compare this scatterplot with a scatterplot using the log-transformed values.

heid2=aggregate(heid$RT,list(heid$Word),mean)
colnames(heid2)=c("Word","MeanRT")
items=heid[,c("Word","BaseFrequency")]
items=unique(items)
heid2=merge(heid2,items,by.x="Word",by.y="Word")
heid2$ExpMeanRT=exp(heid2$MeanRT)
heid2$ExpBaseFrequency=exp(heid2$BaseFrequency)
par(mfrow=c(1,2))
plot(heid2$ExpMeanRT,heid2$ExpBaseFrequency,xlab="Base Frequency",ylab="Mean Reaction Time")
plot(heid2$MeanRT,heid2$BaseFrequency,xlab="Log Base Frequency",ylab="Log Mean Reaction Time")

Exercise 3 The data set moby is a character vector with the text of Melville’s Moby Dick. In this exercise, we consider whether Ziph’s law holds for Moby Dick. According to Zipf’s law [Zipf, 1949], the frequency of a word is inversely proportional to its rank in a numerically sorted list. The word with the highest frequency has rank 1, the word with the one but highest frequency has rank 2, etc. If Zipf’s law holds, a plot of log frequency against log rank should reveal a straight line. We make a table of word frequencies with table() – we cannot use xtabs(), because words is a vector and xtabs() expects a data frame – and sort the frequencies in reverse numerical order.

moby.table=table(moby)
moby.table=sort(moby.table,decreasing=TRUE)
moby.table[1:5]
## moby
##   the    of   and     a    to 
## 13717  6512  6008  4551  4514

We now have the word frequencies. We use the colon operator and length(), which returns the length of a vector, to construct the corresponding lengths.

ranks=1:length(moby.table)
ranks[1:5]
## [1] 1 2 3 4 5

Make a scatterplot of log frequency against log rank.

MobyLogFrequency=log(moby.table)
MobyLogRank=log(ranks)
par(mfrow=c(1,1))
plot(MobyLogRank,MobyLogFrequency,xlab="Log Rank",ylab="LogFrequency")

Exercise 4 The column labeled Trial in the data set lexdec specifies, for each subject, the trial number of the responses. For a given subject, the first trial in the experiment has trial number 1, the second has trial number 2, etc. Use xylowess.fnc() to explore the possibility that the subjects proceeded through the experiment in different ways, some revealing effects of learning, and others effects of fatigue.

lexdec$DiffFromMeanRT=lexdec$meanRT-lexdec$RT #Here, I normalized the reaction times, based on the mean reaction time for that lexical item by all participants. A positive value here indicates a faster than expected RT, whereas a negative value represents a slower than expected RT.
xylowess.fnc(DiffFromMeanRT~Trial|Subject, data=lexdec) #This plot shows difference from mean reaction time over trials for each subject. If the fit line has a positive slope, it indicates that the subject is getting faster than expected, whereas if the fit line has a negative slope, this may mean that the subject was getting tired and answering slower than expected. One could also look at accuracy over trial time, but this data wasn't immediately available from the dataset.