2 Main steps
Once the hotwarm.RData file has been downloaded, the easiest way to load the data is by double-clicking the file to open an R session. Alternatively, one may load it from an R console using load(), provided that the .RData is stored in the same working directory with the current R session.
load(file = "hotwarm.RData")
The R workspace consists of a data frame called hotwarm with the following format.
head(hotwarm)
## decade word hot warm freq diff
## 1 1860 admirer 0.00000000 0.4087567 0.4087567 11.0
## 2 1860 affection 0.00000000 0.9343010 0.9343010 25.2
## 3 1860 afternoon 0.23357525 0.2919691 0.5255443 1.0
## 4 1860 air 0.64233194 1.0510886 1.6934206 1.3
## 5 1860 attachment 0.00000000 0.4087567 0.4087567 11.0
## 6 1860 bed 0.05839381 0.5255443 0.5839381 7.1
The decade represents the COHA decades. In the paper, we track the nominal collocational change of hot and warm across fifteen decades, from the 1860s to the 2000s. The hot and warm columns show the normalised co-occurrence frequencies per million words of the nouns (in the word column) with each hot and warm respectively in the [adj + Noun] schema. In other words, we retrieved the R1 collocates of the two adjectives from COHA. The freq column contains the joint frequency of the nouns in the specified schema across hot and warm (i.e., the sum of values in hot and warm). Finally, the diff column uses the values from the “SCORE” column produced by the “compare” feature of COHA (this choice follows the suggestion in Martin Hilpert’s tutorial).
2.1 Base R version of the static motion chart
The steps for generating the static chart with the base R are described below.
- Generate character vectors for the labels of each decade.
decade.label <- as.character(unique(hotwarm$decade))
decade.label
## [1] "1860" "1870" "1880" "1890" "1900" "1910" "1920" "1930" "1940" "1950"
## [11] "1960" "1970" "1980" "1990" "2000"
- Create a character vector containing collocates whose distribution with hot and warm will be traced through. This vector will be used as the selected labels for the bubbles in the scatter plots.
selected.collocs <- c("bath","day","dog","heart","pursuit","smile","spot","water","welcome")
- Subset the data points for the selected collocate-labels defined in step (2) from the
hotwarmdata frame.
dframe.label <- subset(hotwarm, word %in% selected.collocs)
head(dframe.label)
## decade word hot warm freq diff
## 16 1860 day 0.99269482 1.3430577 2.3357525 1.1
## 26 1860 heart 0.05839381 2.3941463 2.4525401 32.2
## 32 1860 pursuit 0.81751338 0.0000000 0.8175134 35.6
## 45 1860 water 5.89777509 1.1094824 7.0072575 6.8
## 47 1860 welcome 0.00000000 0.5255443 0.5255443 14.2
## 54 1870 bath 0.21493636 1.5582886 1.7732249 5.6
- Split the
hotwarmdata frame into a list of subset data frame by decades.
# data frame for all data points
df.main <- split(hotwarm, hotwarm$decade)
# check the type of the data and the length
typeof(df.main)
## [1] "list"
length(df.main)
## [1] 15
- Split also the
dframe.labelproduced in (3) above into a list of data frame per decade, containing the data for the collocates labels.
# data frame for the selected collocates
df.labs <- split(dframe.label, dframe.label$decade)
- Now comes the
forloop to generate the scatter plots for each of the fifteen decades. Readers might do some adjustment from the codes below, such as the limits, or tick-labels, of the x- (xlim) and y-axis (ylim), given the nature of their data. Before running theforloop, we need to define a number of graphical parameters for the ploting (withpar(), cf. 6.1 in the chunk below), especially for the number of row and column in the plot. Thehotwarmdata consists of fifteen decades. So we divide the plot into 5-by-3 matrix (hence,mfrow = c(5, 3)). Details for the list of arguments in thepar()call below can be reviewed by typing?par()in the console.
# 6.1 Define plotting parameter---------
par(mfrow = c(5, 3), # set a 5-by-3 plotting space
cex = 0.425,
mar = c(2.5, 2, 1, 1),
mgp = c(3, 0.75, 0),
tck = -0.025,
las = 1)
# 6.2 "for" loop to generate the sequential scatterplots---------
for (i in seq_along(decade.label)) { # loop sequence is derived from the total number of (15) decades
# preparing empty plot field from the dataset of all decades
plot(df.main[[i]][,3], # column [,3] for the x-axis
df.main[[i]][,4], # column [,4] for the y-axis
type = "n",
xlim = c(-0.5, max(hotwarm$hot)+1),
ylim = c(-0.5, max(hotwarm$warm)+0.5),
yaxt = "n",
xaxt = "n",
ann = F)
# adding major y-axis to the empty plot
axis(side = 2, at = c(0, 1, 2, 3), labels = c(0, 1, 2, 3))
# (OPTIONAL) adding minor y-axis to the empty plot
#axis(side = 2, at = c(1, 3), labels = FALSE, tcl = -0.2)
# adding major x-axis to the empty plot
axis(side = 1, at = c(0, 2, 4, 6, 8), labels = c(0, 2, 4, 6, 8), tick = TRUE)
# (OPTIONAL) adding minor x-axis to the empty plot
#axis(side = 1, at = c(1, 3, 5, 7), labels = FALSE, tcl = -0.2)
# adding gridlines onto the empty plot field
grid(lwd = 1.15)
# adding the decade labels onto the plot
text(4, 3, labels = decade.label[i], col = "grey85", cex = 4.15, font = 2)
# adding the bubble symbol onto the plot from all data points (cf. Kabacoff, 2011, pp. 278-279)
symbols(df.main[[i]][,3],
df.main[[i]][,4],
inches = 0.150,
fg = "white",
bg = "grey85",
circles = sqrt(df.main[[i]][,5]/pi), # column [,5] for the "freq" used as bubble-size input
add = TRUE)
# adding the bubble symbol onto the plot with darker colour for the labelled data points
symbols(df.labs[[i]][,3],
df.labs[[i]][,4],
inches = 0.150,
fg = "white",
bg = "grey65",
circles = sqrt(df.labs[[i]][,5]/pi),
add = TRUE)
# adding the selected collocate labels onto the plot
text(df.labs[[i]][,3],
df.labs[[i]][,4],
labels = df.labs[[i]][,2],
cex = 1.25)
} # end of "for-loop"
Figure 2.1: Changes in the collocational profiles of hot (x-axis) and warm (y-axis), COHA 1860s-2000s (Primahadi-Wijaya-R. & Rajeg, 2014, p. 253).
In order to save the above plot, first set up a .png file (or other formats, such as .jpeg or .pdf) for the resulting plot, before executing the above codes, and define the plot’s resolution and sizes. After that, run the previous codes in 6.1 and 6.2 (i.e., step (6)), and end with dev.off() to close the graphic device and save the plot. See the code-chunk below.
# set up a .png file for the resulting plot, including the plot's resolution and sizes
png("hotwarmPLOT-loop.png",
res = 450,
width = 6.5 * 500,
height = 8.5 * 500)
# Insert here (and then run) the (i) plotting parameter call (6.1) and (ii) the for loop (6.2)
# close the active graphic device to save the plot
dev.off ()
Note that to achieve the desired results in terms of the size and resolution of the plot, a couple of trial-and-errors should be expected.
2.2 The ggplot2 version of the static motion chart
The ggplot2 package offers a facetting feature that is the key to a (much) simpler code for producing the static motion charts. In ggplot2, we can use the facet_wrap() function, which can be directly fed with the decade column and let ggplot2 do the job to split the plots by decades. To run the following codes, readers need to either install ggplot2 package separately or together as a part of a suite of packages called the tidyverse (Wickham & Grolemund, 2017). Optionally, readers might also want to install the ggrepel package that is used here to handle the overlapping labels.
Here are the steps for the ggplot2 version.
- Prepare two data frames: (i) one for the selected collocate/item labels (
df.label), and (ii) one for the remaining items (df.others), which will not be labelled in the plot. Here, we also generate the axes labels.
# get the data for the selected collocates
df.label <- subset(hotwarm, word %in% selected.collocs)
# get the data for the remaining collocates
df.others <- subset(hotwarm, !word %in% selected.collocs)
# x-axis labels
xaxis <- expression(paste("Co-occurrence frequency per million words with ", italic("hot"), sep = ""))
# y-axis labels
yaxis <- expression(paste("Co-occurrence frequency per million words with ", italic("warm"), sep = ""))
- Generate the first
ggplot()call for the non-labelled data and save the output intop.
# load the ggplot2 via tidyverse (or individually) and load the ggrepel
library(ggplot2)
library(ggrepel)
# generate plot for the non-labelled data
p <- ggplot(data = df.others,
aes(x = hot, y = warm)) +
geom_point(aes(size = freq),
alpha = 1/8,
show.legend = FALSE) +
facet_wrap(~decade) # here is the key to split the plot by decades!
- Add the labelled data points to the previous plot (
p) on the basis of thedf.labeldata frame.
p <- p + geom_point(data = df.label,
aes(size = freq),
alpha = 1/2,
colour = "royalblue",
show.legend = FALSE) +
# if you are not using `ggrepel` below, use the standard `geom_text()` function
geom_text_repel(data = df.label,
aes(label = word),
size = 2.5) +
labs(x = xaxis, # insert the `xaxis` data
y = yaxis) + # insert the `yaxis` data
theme_bw()
# call out the plot
p
Figure 2.2: Changes in the nominal collocational profiles of hot and warm, COHA 1860s-2000s (Primahadi-Wijaya-R. & Rajeg, 2014, p. 253).
- Save the plot as
.pngfile usingggsave().
ggsave(file = "hotwarm.png", plot = p, width = 20, height = 20, unit = "cm", dpi = 600)
Note that trial and errors are expected to get the desired results in terms of the size and resolution. For the previous plot, the parameters for width, height, and resolution (i.e., dpi) given above produce a desirable result from our own perspective.
2.3 Recreating previous motion chart in Hilpert (2011)
To recreate the chart (i.e., “Figure 8”) in Hilpert’s (2011) paper, we need to download the data from Martin Hilpert’s motion charts resource page. The data is available as zip file, which contains the mot.RData workspace. After the downloads, open/load the R workspace that contains two data frames: (i) convdata and (ii) compdata. The former is the data for Hilpert’s (2011, p. 444) case study 1 on the “ambicategorical nouns and verbs”. We will instead use the compdata to produce the static motion charts for the English complement-taking predicates (readers can try the codes for themselves with the convdata).
load("mot.RData")
head(compdata)
## x y decade verb freq
## 1 -0.29379995 0.1350332 1860 acknowledge 12.583587
## 2 -0.38929710 0.2318209 1860 admit 34.346505
## 3 -0.54856108 0.3116948 1860 affirm 8.206687
## 4 0.01545954 -0.6818823 1860 appreciate 1.519757
## 5 0.08741028 -0.7591102 1860 await 5.957447
## 6 -0.28720186 0.2074660 1860 believe 178.115502
The x- and y-axis represent the dimensions of the metric Multidimensional Scaling (MDS) from the (dis)similarity measures between the verbs on the basis of their complementation profiles (Hilpert, 2011, pp. 451–452, 454). The freq column indicates the normalised text frequencies of the verbs (Hilpert, 2011, p. 452). The steps to generate the ggplot2 motion charts from the compdata data are as follows.
- As above, define two data frames, one for the labelled items and one for the remainder of the data points. We also store the labels for the axes (cf., Hilpert, 2011, p. 453, for the interpretation given on the x- and y-axes).
# selected complement-taking predicates
predicates <- c("confirm", "await", "enjoy", "demand", "consider", "remember", "hope", "dislike", "want", "try")
# get the data for the selected predicates
df.label <- subset(compdata, verb %in% predicates)
# get the data for the remaining predicates
df.others <- subset(compdata, !verb %in% predicates)
# xaxis label
xaxis <- expression(paste(italic("that"), "-clauses <---------- ----------> ", italic("to"), "-infinitives", sep = ""))
# yaxis label
yaxis <- "nominal compl. <---------- ----------> complex verbal/clausal compl."
- Generate the first
ggplot()call for the non-labelled data and save the output intop.
# generate plot for the non-labelled data
p <- ggplot(data = df.others,
aes(x = x, y = y)) +
geom_point(aes(size = freq),
alpha = 1/8,
show.legend = FALSE) +
facet_wrap(~decade) # here is the key to split the plot by decades!
- Add the labelled data points to the previous plot (
p) on the basis of thedf.labeldata frame.
p <- p + geom_point(data = df.label,
aes(size = freq),
alpha = 1/2,
colour = "royalblue",
show.legend = FALSE) +
# if you are not using `ggrepel` below, use the standard `geom_text()` function
geom_text_repel(data = df.label,
aes(label = verb),
size = 2.5) +
labs(x = xaxis, # insert the `xaxis` data
y = yaxis) + # insert the `yaxis` data
theme_bw()
# call out the plot
p
Figure 2.3: Two-dimensional MDS solution for the English complement-taking predicates, COHA 1860s-2000s (Hilpert, 2011, p. 455, Figure 8).
We can save the plot using the ggsave() function following the code in section 2.2 with the hotwarm data.