This markdown replicates and extends the analysis of double modals (DMs) in US geolocated Twitter data collected between 2013 and 2014 presented in Grieve et al. (2015) at NWAV Toronto. This analysis is to be reported in a paper in prep by me and Cameron Morin that follows up on the NWAV analyses. I’m redoing and extending the code and visualisations here and walking through the steps of this analysis, including extensive informal discussion of the results.
Our basic goal is to use very large Twitter corpus to describe the use of one of the most elusive grammatical constructions in American English: the DM.
In essence this is a descriptive and exploratory study. There is a fair bit of theory on DMs, but it seems to me fairly superficial, since all they really take for empirical data is that DMs exist in general (i.e. they are just trying to explain or model the existence of DMs, regardless of the pairing) and to a lesser extent that certain DMs exist. And that is because that is literally all the information anyone had access to, as no one has really observed DMs in natural language at scale and in an open way. And so detailed information about variation across DMs has basically be lacking – or sometimes people just guessed at, trying to generalise from anectodal evidence (and as we’ll see it’s often wrong). All we really know is that DMs exist, and therefore research questions and theories about the nature of DMs has been limited.
So for specific RQs (aside from the methodological one, i.e. can we find them? The answer is clearly yes):
The data is based on an analysis of approximately 1 million geolocated tweets from across the contiguous US that were collected between October 2013 and November 2014 using the Twitter API. The corpus total 8.9 billion words. Notably every tweet is geo-located, so we know the precise longitude and latitude of the user when they posted that message (on their smart phone with the geo-location option activated). This allows for the fine-grained analysis of regional variation. Based on this information we stratified the corpus into counties, so that every tweet is linked to a county (or a county equivalent). There are about 3,000 of these in the US (see below for the exact numbers). In this way we can measure and map linguistic features across the US. For more information on this corpus see Huang et al. (2016) and Grieve et al. (2017, 2018).
The dataset analysed in this study was based on this corpus. To identify DMs we automatically extracted all modal-modal sequences, focusing on the 9 primary modals (can, could, may, might, shall, should, will, would, must) and 2 of the semi-modals (used to, ought to). We ignored all duplicated double modals (e.g. could could), which are surprisingly common, other semi-modals (e.g. gonna), modals followed by have (e.g. woulda), and various other forms (e.g. might as well). This gives (11x10) 110 different possible double modals types.
Through this process we identified 10,137 tokens of potential double modals in the corpus. However, when we looked at the Tweets containing these forms, we found that a substantial number were not true double modals. We therefore went through each of the tokens manually to identify true double modals. After removing all problematic tokens, we were left with 5,439 legitimate double modal tokens.
Before we get to maps, we have some data related to the overall frequencies of the forms. There are a few different data sets we’re going to want to load in and generate some basic statistics and graphs for.
We can start with the just the overall frequency distribution of the DMs in the corpus.
First we read in the data.
dm_dist <- read.table("DM_DIST.txt", header = TRUE, sep = "\t")
We can take a look at it.
dm_dist
## DM tokens
## 1 Can Could 135
## 2 Can May 21
## 3 Can Might 15
## 4 Can Must 11
## 5 Can Ought to 0
## 6 Can Shall 2
## 7 Can Should 20
## 8 Can Used to 0
## 9 Can Will 111
## 10 Can Would 14
## 11 Could Can 60
## 12 Could May 10
## 13 Could Might 20
## 14 Could Must 0
## 15 Could Ought to 0
## 16 Could Shall 0
## 17 Could Should 19
## 18 Could Used to 0
## 19 Could Will 16
## 20 Could Would 41
## 21 May Can 273
## 22 May Could 53
## 23 May Might 7
## 24 May Must 4
## 25 May Ought to 4
## 26 May Shall 3
## 27 May Should 14
## 28 May Used to 2
## 29 May Will 22
## 30 May Would 14
## 31 Might Can 1733
## 32 Might Could 932
## 33 Might May 13
## 34 Might Must 8
## 35 Might Ought to 32
## 36 Might Shall 1
## 37 Might Should 146
## 38 Might Used to 1
## 39 Might Will 52
## 40 Might Would 243
## 41 Must Can 164
## 42 Must Could 3
## 43 Must May 0
## 44 Must Might 8
## 45 Must Ought to 0
## 46 Must Shall 0
## 47 Must Should 1
## 48 Must Used to 5
## 49 Must Will 5
## 50 Must Would 3
## 51 Ought to Can 0
## 52 Ought to Could 0
## 53 Ought to May 0
## 54 Ought to Might 0
## 55 Ought to Must 0
## 56 Ought to Shall 0
## 57 Ought to Should 1
## 58 Ought to Used to 0
## 59 Ought to Will 0
## 60 Ought to Would 0
## 61 Shall Can 3
## 62 Shall Could 0
## 63 Shall May 0
## 64 Shall Might 0
## 65 Shall Must 2
## 66 Shall Ought to 0
## 67 Shall Should 5
## 68 Shall Used to 0
## 69 Shall Will 3
## 70 Shall Would 1
## 71 Should Can 14
## 72 Should Could 33
## 73 Should May 0
## 74 Should Might 8
## 75 Should Must 12
## 76 Should Ought to 3
## 77 Should Shall 1
## 78 Should Used to 1
## 79 Should Will 16
## 80 Should Would 177
## 81 Used to Can 8
## 82 Used to Could 144
## 83 Used to May 0
## 84 Used to Might 0
## 85 Used to Must 0
## 86 Used to Ought to 0
## 87 Used to Shall 0
## 88 Used to Should 0
## 89 Used to Will 0
## 90 Used to Would 20
## 91 Will Can 111
## 92 Will Could 20
## 93 Will May 26
## 94 Will Might 59
## 95 Will Must 11
## 96 Will Ought to 1
## 97 Will Shall 21
## 98 Will Should 23
## 99 Will Used to 0
## 100 Will Would 80
## 101 Would Can 9
## 102 Would Could 171
## 103 Would May 8
## 104 Would Might 29
## 105 Would Must 4
## 106 Would Ought to 0
## 107 Would Shall 0
## 108 Would Should 81
## 109 Would Used to 40
## 110 Would Will 52
Note that there all 110 possible DMs in this dataset, but we observe at least one occurrence for only 76 of these forms.
nrow(dm_dist)
## [1] 110
dm_dist_trim <- dm_dist[dm_dist$tokens>0,]
nrow(dm_dist_trim)
## [1] 76
We can also order this by frequency.
dm_dist_trim <- dm_dist_trim[order(dm_dist_trim$tokens, decreasing = TRUE ),]
dm_dist_trim
## DM tokens
## 31 Might Can 1733
## 32 Might Could 932
## 21 May Can 273
## 40 Might Would 243
## 80 Should Would 177
## 102 Would Could 171
## 41 Must Can 164
## 37 Might Should 146
## 82 Used to Could 144
## 1 Can Could 135
## 9 Can Will 111
## 91 Will Can 111
## 108 Would Should 81
## 100 Will Would 80
## 11 Could Can 60
## 94 Will Might 59
## 22 May Could 53
## 39 Might Will 52
## 110 Would Will 52
## 20 Could Would 41
## 109 Would Used to 40
## 72 Should Could 33
## 35 Might Ought to 32
## 104 Would Might 29
## 93 Will May 26
## 98 Will Should 23
## 29 May Will 22
## 2 Can May 21
## 97 Will Shall 21
## 7 Can Should 20
## 13 Could Might 20
## 90 Used to Would 20
## 92 Will Could 20
## 17 Could Should 19
## 19 Could Will 16
## 79 Should Will 16
## 3 Can Might 15
## 10 Can Would 14
## 27 May Should 14
## 30 May Would 14
## 71 Should Can 14
## 33 Might May 13
## 75 Should Must 12
## 4 Can Must 11
## 95 Will Must 11
## 12 Could May 10
## 101 Would Can 9
## 34 Might Must 8
## 44 Must Might 8
## 74 Should Might 8
## 81 Used to Can 8
## 103 Would May 8
## 23 May Might 7
## 48 Must Used to 5
## 49 Must Will 5
## 67 Shall Should 5
## 24 May Must 4
## 25 May Ought to 4
## 105 Would Must 4
## 26 May Shall 3
## 42 Must Could 3
## 50 Must Would 3
## 61 Shall Can 3
## 69 Shall Will 3
## 76 Should Ought to 3
## 6 Can Shall 2
## 28 May Used to 2
## 65 Shall Must 2
## 36 Might Shall 1
## 38 Might Used to 1
## 47 Must Should 1
## 57 Ought to Should 1
## 70 Shall Would 1
## 77 Should Shall 1
## 78 Should Used to 1
## 96 Will Ought to 1
Next let’s plot the data. For the initial submission we can pull the figures right out of this rmd.
purp <- rgb(.2,.15,.57)
red <- rgb(1,.25,0) # We'll reuse these colours
grey <- rgb(.4,.4,.4)
barplot(dm_dist_trim$tokens,
names.arg = dm_dist_trim$DM,
las = 2,
cex.names = .4,
col = red, border = "NA", axes=FALSE)
axis(2,
cex.axis = .8,
las = 1,
col = "white",
col.ticks = "white",
col.axis = grey,
at=c(dm_dist_trim$tokens[1],
dm_dist_trim$tokens[2],
dm_dist_trim$tokens[3],
0))
Might Could is not the most frequent DM in this corpus – not even close. Might Can is far more common. This is very surprising given previous research, which generally highlights Might Could, often saying it’s most common, and which sometimes ignore Might Can altogether. This is a big finding – arguably our biggest – and certainly one we’ll want to build on.
I think we should first acknowledge that this could be a register effect of Twitter. I think that is what most people who are invested in the idea that Might Could is most common would say. And it is true that frequency distributions don’t always align: most notably, ‘the’ is generally the most common word in English corpora, while ‘I’ is here on Twitter. But I think this is unlikely, as this is a really huge difference in really low frequency and comparable forms, whereas ‘the’ and ‘I’ are very frequent and very different and are associated with broad register differences.
But more to the point, regardless of what happens in other contexts, Might Can is clearly most common on Twitter – we need to be clear about that – and that is very surprising given what previous studies have said. And crucially no one else has ever made these kind of comparative frequency analyses in any other context, at least based on sufficient amounts of natural data. I mean, I assume some of the existing data is based on field workers asking informants directly about specific forms – which is then totally irrelevant to the question of relative frequency. So there isn’t much anything to compare against. Like in many ways this is the first empirical data on which form is most frequent.
We should give examples of Might Can in context here from the corpus, as this is our first chance to really discuss and exemplify what DMs in this corpus look like. I think a series of lines in like aligned concordance line format makes sense here (using a monospace font). We’ll want to show what the meaning is and that there is consistent meaning. Giving Tweets as examples is somewhat problematic though, but for now let’s just use them.
We need to discuss prominently right away. And we should note that (reassuringly) Might Could is second. We’ll want to show examples for Might Could to and discuss the meaning and then we’ll want to discuss about the similarity between these two forms and start setting the stage in that way for the making of a direct comparison between these two which is what we’ll do.
Furthermore, we should go through those list of most common DMs, as we discuss in the intro, and explicitly link back to those: so noting the other consistencies and inconsistencies along with Might Can. I haven’t looked at this at all honestly. And we shouldn’t make a big deal about ranking at this point, since the token counts are low, and so there might not be much to say except broad overall agreement: the usual suspects are all accounted for. But there might be some of the expected ones that hardly occur at all, or a couple unexpected ones that are surprisingly high. We should note those and we should definitely be at least thinking about how any unexpected results might relate to the Might Can finding.
I think this is a second major result. Maybe this is more surprising than the Might Can result even. We need to highlight how many more this is than anyone has thought possible. We need to be clear here that these are valid, hand checked and interpretable DMs that make sense in context, even though most are exceedingly uncommon. And we need to provide examples to demonstrate that.
This would also be the time to point out some of these extremely DMs that have never been discussed in the literature. Really, it would be great if you went through the list and counted them and provided a full list somehow of the newly observed DMs. And then, again, somewhere in this section demonstrate via examples that these are real uses, with consistently interpretable meanings. So like we could show all 10 or whatever examples for one of these and show what it means. I think that is important.
And finally we should note the distribution is highly sub-linear, aligning braodly at least with a Zipfian distribution, which is exactly I think what we’d expect, except I guess that this long tail of really uncommon and previously unattested DMs is pretty surprising. Really, I think it’s fair the say this result suggests that with enough data you’d see all the double modals.
So far we’ve only looked at the double modals as units, but of course they consist of two modals: first position and second position. So we can also look at patterns there, starting with a heatmap.
First we read in the data: this is a 11 modal by 11 modal matrix, showing the frequency of the first modal (rows) combining with the second modal (columns). We did not consider DMs with a repeated modal (because they seem much more likely to be typos or emphasis and they’re really hard to judge).
dmmat <- as.matrix(read.table("DM_FREQ_SORT.txt", sep="\t", header=TRUE, row.names=1))
dmmat
## Might Can Could Would Will Should May Must Used.to Shall Ought.to
## Might 0 1733 932 243 52 146 13 8 1 1 32
## Can 15 0 135 14 111 20 21 11 0 2 0
## Could 20 60 0 41 16 19 10 0 0 0 0
## Would 29 9 171 0 52 81 8 4 40 0 0
## Will 59 111 20 80 0 23 26 11 0 21 1
## Should 8 14 33 177 16 0 0 12 1 1 3
## May 7 273 53 14 22 14 0 4 2 3 4
## Must 8 164 3 3 5 1 0 0 5 0 0
## Used to 0 8 144 20 0 0 0 0 0 0 0
## Shall 0 3 0 1 3 5 0 2 0 0 0
## Ought to 0 0 0 0 0 1 0 0 0 0 0
Now we can plot that full co-occurrence matrix as a heatmap. I like using image here, but it rotates the image, so I make a rotate function and then I plot the rotated image and add axes.
rotate <- function(x) t(apply(x, 2, rev))
image(rotate(dmmat),
col = hcl.colors(7, "Purples2", rev = TRUE),
breaks = c(0,1,5,10,50,100,500,100000),
tck = 0,
axes = TRUE,
frame.plot = TRUE,
xaxt='n',
yaxt = 'n')
axis(side = 2,
at = c(1, .9, .8, .7, .6, .5, .4, .3, .2, .1, 0),
labels = rownames(dmmat),
las = 2,
col.axis = grey,
cex.axis = .8,
tick = FALSE,
line = FALSE)
axis(side = 3,
at = c(0, .1, .2, .3, .4, .5, .6, .7, .8, .9, 1),
labels = rownames(dmmat),
las = 2,
col.axis = grey,
cex.axis = .8,
tick = FALSE,
line = FALSE)
I think we also want another heat map to show the overall frequencies of the modals in the two position, regardless of the other modal in the pair.
First we get totals for each modal by row (first position) and column (second position).
first <- as.matrix(rowSums(dmmat))
second <- as.matrix(colSums(dmmat))
dmmat2 <- cbind(first, second)
dmmat2
## [,1] [,2]
## Might 3161 146
## Can 329 2375
## Could 166 1491
## Would 394 593
## Will 352 277
## Should 265 310
## May 396 78
## Must 189 52
## Used to 172 49
## Shall 14 28
## Ought to 1 40
And then we can plot like before.
image(rotate(dmmat2),
col = hcl.colors(9, "Purples2", rev = TRUE),
breaks = c(0,1,20,50,100,200,500,1000,2000,100000),
tck = 0,
axes = TRUE,
frame.plot = TRUE,
xaxt='n',
yaxt = 'n')
axis(side = 2,
at = c(1, .9, .8, .7, .6, .5, .4, .3, .2, .1, 0),
labels = rownames(dmmat2),
las = 2,
col.axis = grey,
cex.axis = .8,
tick = FALSE,
line = FALSE)
axis(side = 3,
at = c(.33, .66),
labels = c("Modal 1", "Modal 2"),
las = 1,
col.axis = grey,
cex.axis = .8,
tick = FALSE,
line = FALSE)
(Note: I wouldn’t include this image in the paper)
Specifically, for percent in first position (or not in second), we can calculate as.
100*dmmat2[,1]/(dmmat2[,1]+dmmat2[,2])
## Might Can Could Would Will Should May Must
## 95.585122 12.167160 10.018105 39.918946 55.961844 46.086957 83.544304 78.423237
## Used to Shall Ought to
## 77.828054 33.333333 2.439024
This graph provides a good picture of it.
plot(log(dmmat2[,2]),
log(dmmat2[,1]),
xlim = c(0, 8),
ylim = c(0, 8),
xlab = "Frequency of Modal in 2nd Position (Logged)",
ylab = "Frequency of Modal in 1st Position (Logged)",
col = grey,
cex.axis = .85,
cex.lab = .85,
col.lab = grey,
col.axis = grey,
bty = "n",xaxt = "n", yaxt = "n", type = "n"
)
text(log(dmmat2[,2]),
log(dmmat2[,1]),
labels = row.names(dmmat2),
col = c(red, purp, purp, purp, red, purp, red, red, red, purp, purp),
cex = .75)
abline(a = 0, b = 1, lty = "dashed", lwd = .5)
axis(side = 1,
lwd = .5,
col.axis = grey,
cex.axis = .75)
axis(side = 2,
lwd = .5,
col.axis = grey,
cex.axis = .75)
box(which = "plot", col = grey, lwd = .5)
Let’s also redo this just looking a types, so ignoring frequency, but looking if the modals occur more in different positions.
First, we’ll save the data to a new object and then get rid of the ‘to’ to make this easier to split up.
dm_type <- dm_dist_trim
dm_type$DM<-gsub("to ", "", dm_type$DM)
dm_type$DM<-gsub(" to", "", dm_type$DM)
dm_type
## DM tokens
## 31 Might Can 1733
## 32 Might Could 932
## 21 May Can 273
## 40 Might Would 243
## 80 Should Would 177
## 102 Would Could 171
## 41 Must Can 164
## 37 Might Should 146
## 82 Used Could 144
## 1 Can Could 135
## 9 Can Will 111
## 91 Will Can 111
## 108 Would Should 81
## 100 Will Would 80
## 11 Could Can 60
## 94 Will Might 59
## 22 May Could 53
## 39 Might Will 52
## 110 Would Will 52
## 20 Could Would 41
## 109 Would Used 40
## 72 Should Could 33
## 35 Might Ought 32
## 104 Would Might 29
## 93 Will May 26
## 98 Will Should 23
## 29 May Will 22
## 2 Can May 21
## 97 Will Shall 21
## 7 Can Should 20
## 13 Could Might 20
## 90 Used Would 20
## 92 Will Could 20
## 17 Could Should 19
## 19 Could Will 16
## 79 Should Will 16
## 3 Can Might 15
## 10 Can Would 14
## 27 May Should 14
## 30 May Would 14
## 71 Should Can 14
## 33 Might May 13
## 75 Should Must 12
## 4 Can Must 11
## 95 Will Must 11
## 12 Could May 10
## 101 Would Can 9
## 34 Might Must 8
## 44 Must Might 8
## 74 Should Might 8
## 81 Used Can 8
## 103 Would May 8
## 23 May Might 7
## 48 Must Used 5
## 49 Must Will 5
## 67 Shall Should 5
## 24 May Must 4
## 25 May Ought 4
## 105 Would Must 4
## 26 May Shall 3
## 42 Must Could 3
## 50 Must Would 3
## 61 Shall Can 3
## 69 Shall Will 3
## 76 Should Ought 3
## 6 Can Shall 2
## 28 May Used 2
## 65 Shall Must 2
## 36 Might Shall 1
## 38 Might Used 1
## 47 Must Should 1
## 57 Ought Should 1
## 70 Shall Would 1
## 77 Should Shall 1
## 78 Should Used 1
## 96 Will Ought 1
Now we can split the DM column into first and second position on the space.
library(stringr)
temp<-str_split_fixed(dm_type$DM, " ", 2)
dm_type$first <- temp[,1]
dm_type$second <- temp[,2]
dm_type
## DM tokens first second
## 31 Might Can 1733 Might Can
## 32 Might Could 932 Might Could
## 21 May Can 273 May Can
## 40 Might Would 243 Might Would
## 80 Should Would 177 Should Would
## 102 Would Could 171 Would Could
## 41 Must Can 164 Must Can
## 37 Might Should 146 Might Should
## 82 Used Could 144 Used Could
## 1 Can Could 135 Can Could
## 9 Can Will 111 Can Will
## 91 Will Can 111 Will Can
## 108 Would Should 81 Would Should
## 100 Will Would 80 Will Would
## 11 Could Can 60 Could Can
## 94 Will Might 59 Will Might
## 22 May Could 53 May Could
## 39 Might Will 52 Might Will
## 110 Would Will 52 Would Will
## 20 Could Would 41 Could Would
## 109 Would Used 40 Would Used
## 72 Should Could 33 Should Could
## 35 Might Ought 32 Might Ought
## 104 Would Might 29 Would Might
## 93 Will May 26 Will May
## 98 Will Should 23 Will Should
## 29 May Will 22 May Will
## 2 Can May 21 Can May
## 97 Will Shall 21 Will Shall
## 7 Can Should 20 Can Should
## 13 Could Might 20 Could Might
## 90 Used Would 20 Used Would
## 92 Will Could 20 Will Could
## 17 Could Should 19 Could Should
## 19 Could Will 16 Could Will
## 79 Should Will 16 Should Will
## 3 Can Might 15 Can Might
## 10 Can Would 14 Can Would
## 27 May Should 14 May Should
## 30 May Would 14 May Would
## 71 Should Can 14 Should Can
## 33 Might May 13 Might May
## 75 Should Must 12 Should Must
## 4 Can Must 11 Can Must
## 95 Will Must 11 Will Must
## 12 Could May 10 Could May
## 101 Would Can 9 Would Can
## 34 Might Must 8 Might Must
## 44 Must Might 8 Must Might
## 74 Should Might 8 Should Might
## 81 Used Can 8 Used Can
## 103 Would May 8 Would May
## 23 May Might 7 May Might
## 48 Must Used 5 Must Used
## 49 Must Will 5 Must Will
## 67 Shall Should 5 Shall Should
## 24 May Must 4 May Must
## 25 May Ought 4 May Ought
## 105 Would Must 4 Would Must
## 26 May Shall 3 May Shall
## 42 Must Could 3 Must Could
## 50 Must Would 3 Must Would
## 61 Shall Can 3 Shall Can
## 69 Shall Will 3 Shall Will
## 76 Should Ought 3 Should Ought
## 6 Can Shall 2 Can Shall
## 28 May Used 2 May Used
## 65 Shall Must 2 Shall Must
## 36 Might Shall 1 Might Shall
## 38 Might Used 1 Might Used
## 47 Must Should 1 Must Should
## 57 Ought Should 1 Ought Should
## 70 Shall Would 1 Shall Would
## 77 Should Shall 1 Should Shall
## 78 Should Used 1 Should Used
## 96 Will Ought 1 Will Ought
We can count the number in first a second positions and compute first position percent.
results<-c()
results$first <- table(dm_type$first)
results$second <-table(dm_type$second)
results$perc <-100*results$first/(results$first+results$second)
results
## $first
##
## Can Could May Might Must Ought Shall Should Used Will Would
## 8 6 10 10 7 1 5 9 3 9 8
##
## $second
##
## Can Could May Might Must Ought Shall Should Used Will Would
## 9 8 5 7 7 4 5 9 5 8 9
##
## $perc
##
## Can Could May Might Must Ought Shall Should
## 47.05882 42.85714 66.66667 58.82353 50.00000 20.00000 50.00000 50.00000
## Used Will Would
## 37.50000 52.94118 47.05882
Also let’s just repeat it with only DMs over ten.
dm_type_ten <- dm_type[dm_type$tokens > 9,]
results_ten<-c()
temp <- table(dm_type_ten$first)
# Got to do something because Ought to and Shall don't occur in first
results_ten$first[1:5] <- temp[1:5]
results_ten$first[6:7] <- 0
results_ten$first[8:11] <- temp[6:9]
results_ten$second <-table(dm_type_ten$second)
results_ten$perc <-100*results_ten$first/(results_ten$first+results_ten$second)
results_ten
## $first
## [1] 7 6 5 7 1 0 0 5 2 8 5
##
## $second
##
## Can Could May Might Must Ought Shall Should Used Will Would
## 6 7 4 4 3 1 1 6 1 6 7
##
## $perc
##
## Can Could May Might Must Ought Shall Should
## 53.84615 46.15385 55.55556 63.63636 25.00000 0.00000 0.00000 45.45455
## Used Will Would
## 66.66667 57.14286 41.66667
And then plot that like before…
plot(as.numeric(results_ten$second),
as.numeric(results_ten$first),
xlim = c(0, 8),
ylim = c(0, 8),
xlab = "Types with Modal in 2nd Position",
ylab = "Types with Modal in 1st Position",
col = grey,
cex.axis = .85,
cex.lab = .85,
col.lab = grey,
col.axis = grey,
bty = "n",xaxt = "n", yaxt = "n", type = "n")
text(as.numeric(results_ten$second),
as.numeric(results_ten$first),
cex = .75,
labels = c("Can", "Could", "May", "Might", "Must", "Ought to",
"Shall", "Should", "Used to", "Will", "Would"),
col = c(red, purp, red, red, purp, purp, purp, purp, red, red, purp))
abline(a = 0, b = 1, lty = "dashed", lwd = .5)
axis(side = 1,
lwd = .5,
col.axis = grey,
cex.axis = .75)
axis(side = 2,
lwd = .5,
col.axis = grey,
cex.axis = .75)
box(which = "plot", col = grey, lwd = .5)
And lastly a heatmap to show their overall frequencies.
dmmat3 <- as.matrix(dmmat2[,1]+dmmat2[,2])
image(rotate(dmmat3),
col = hcl.colors(9, "Purples2", rev = TRUE),
breaks = c(0,1,20,50,100,200,500,1000,2000,100000),
tck = 0,
axes = TRUE,
frame.plot = TRUE,
xaxt='n',
yaxt = 'n')
axis(side = 2,
at = c(1, .9, .8, .7, .6, .5, .4, .3, .2, .1, 0),
labels = rownames(dmmat3),
las = 2,
col.axis = grey,
cex.axis = .8,
tick = FALSE,
line = FALSE)
axis(side = 3,
at = c(.5),
labels = c("Overall"),
las = 1,
col.axis = grey,
cex.axis = .8,
tick = FALSE,
line = FALSE)
(Note: I wouldn’t include this image in the paper either)
And better a barplot.
barplot(dmmat3[,1],
cex.names = .7,
col = red, border = "NA", axes=FALSE)
axis(2,
cex.axis = .7,
las = 1,
col = "white",
col.ticks = "white",
col.axis = grey,
at=c(dmmat3[1,1],
dmmat3[2,1],
dmmat3[3,1],
dmmat3[4,1],
dmmat3[5,1],
dmmat3[8,1],
0))
Lastly, we can plot the overall structure as a network using a markov chain. It’s really just an alternative way of representing the main co-occurrence matrix, but I think it gives us a different perspective, showing which modals are most often linked.
First we’ll need a network graphing library
library(igraph)
##
## Attaching package: 'igraph'
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
We’ll keep working with the same data matrix, but to make the network diagram easier to read and to focus in on the main relationships we’ll strip away pairs that are relatively infrequent (less than 50 hits), by just replacing with 0. We could play around with this cutoff.
dmmat4 <- dmmat
dmmat4[dmmat4 < 50] <- 0
dmmat4
## Might Can Could Would Will Should May Must Used.to Shall Ought.to
## Might 0 1733 932 243 52 146 0 0 0 0 0
## Can 0 0 135 0 111 0 0 0 0 0 0
## Could 0 60 0 0 0 0 0 0 0 0 0
## Would 0 0 171 0 52 81 0 0 0 0 0
## Will 59 111 0 80 0 0 0 0 0 0 0
## Should 0 0 0 177 0 0 0 0 0 0 0
## May 0 273 53 0 0 0 0 0 0 0 0
## Must 0 164 0 0 0 0 0 0 0 0 0
## Used to 0 0 144 0 0 0 0 0 0 0 0
## Shall 0 0 0 0 0 0 0 0 0 0 0
## Ought to 0 0 0 0 0 0 0 0 0 0 0
After doing that the rows and columns for Shall and Outght to all zeroes, so we can drop those altogether.
dmmat4 <- dmmat4[-10:-11,-10:-11]
And then we can convert to a graph object and plot it (note: there is a line from Might to Should which can be obscured; we can position these nodes ourselves).
IG <- graph_from_adjacency_matrix(dmmat4, weighted=TRUE, mode = "directed")
par(mar=c(.0,0,0,0))
plot.igraph(IG,
vertex.size = 24,
vertex.color=red,
vertex.shape="circle",
vertex.frame.color=red,
vertex.frame.lwd=1,
vertex.label.color="black",
vertex.label.family="sans",
vertex.label.cex=.75,
edge.color="black",
edge.width=.8,
edge.arrow.size=.5)
Note here I’ve simplified here on purpose: there is no attempt to represent the frequencies aside from the cutting of less frequent nodes and links. Note here that the arrows are only representing order (which we assume matters, e.g. we don’t combine Might Could and Could Might), so there is nothing here at odds with CxG. By using the arrows, however, we can cut the number of nodes in half, which makes this much more readable that if we had separated nodes for first and second position. This also lets this visualisation complement the heat map, which separates out the modals by position. Specifically what it does a good job of I think is that it highlights which modals are really central to this phenomenon (regardless of position): might, could, and can are the core here.
Also I really think we should go through here and provide examples (like 10) for each of the top say 10 double modals to really establish their meaning. Maybe a few others? It would be good to see how each modal work in first position, assuming DMs are basically adding a modal in front of a modal. People don’t know this and it is important in general and to our argument (e.g. Might Can vs. Might Could) later on.
Like, just looking at these top 10 myself, some have more transparent meaning to me, although I’m clear coming to this with a lot of pre-existing knowledge and biases. But like The Might __ and May __ are easy for to understand (I think), just by replace either by ‘Maybe’. Must __ seems fine too (replace with ‘Necessarily’? As well as Used to __ (‘Once’) and Can – (‘Certainly’). But I don’t really know how to read Should Would or Would Could.
The main observation here is, I think, that no matter how you look at it, it’s not just random or even in terms of which modal are involved in DMs. We need to explain it.
I’ll break it down along several lines, but let me first say I’m not really sure how those patterns relate to what has been said in the formal literature (we should look into that and discuss it) or how that might tie in with CxG considerations, but here is my quick, general descriptive take on what’s happening, writing out my thoughts as I think them through…
At the most basic level, Might, Can, and Could are doing most of the work. Not just Might Can and Might Could, although those are the vast majority of hits, but also Might being followed by other modals (esp. Would, Will) and Can and Could being preceded by other modals.
I’m not totally sure what to make of all this. But clearly this means that just the frequencies of these modals individually isn’t predicting all this. It’s not like they just combine freely (although like I said, it looks like basically any combo is possible) and that the main constraint on the DM frequencies IS NOT just the frequencies of the modals individually.
We should definitely run these for the corpus, but in Longman (page 486) it’s
Whereas we have this over all DM and both positions
So, the relatively low use of Would and esp. Will in DMs (in either position) is really notable!
Why don’t these two forms participate more in DMs?
And then there is the massive rise of Might. And so along with Can and Could, why do these participate so much more?
So that is the initial finding here: the balance between which modals participate in DMs is not what we’d expect.
So the question is why?
I think this all implies that there are either grammatical or more likely semantic/pragmatic/functional reasons for this hierarchy.
I’m not sure if that can just be explained based on the characteristics of these different modals: It seems to me like we’ll need to look at the combinations/positions as well, but let’s start with the individual modals, and then move on to the combinations. But first I think I need at least to review modal classifications.
I think we need to review this information in the literature review probably. I think what I’m writing is all right, but you should check/correct/extend it – esp. because some of this I’m grabbing from wikipedia.
First, modals are generally grouped into so-called present-preterite pairs, although the preterite forms aren’t really preterite, more being subjunctive, as in expressing the speakers attitude, as if signalling that they have an choice in the matter but there is doubt, where as the present is more neutral.
There are also various ways of classifying modals into larger categories.
For example, according to Longman, there are three types of modals, notably splitting Should and Shall, but none of the others:
I mean these don’t see air tight to me: e.g. ‘He should win the race’ or ‘I think he can do it’ seem like prediction, ‘He will have to do it’ seems like necessity, and ‘He could do it last week’ seems like none. I feel like the distinction between the Possibility modals and the others is maybe clearest?
There is also the traditional distinction between Epistemic modals (possibility of being true or untrue) and Deotonic modals (possibility of being free to act), as well sometimes as Dynamic modals (which are seen as a sub-type of Deontic modals, where the it’s the individual’s will to act).
But my understanding is that often any given modal can be used either way, and so to make use of these categories we’d need to code individual uses, and that would be in the DM, so it’s not even clear how to do this. Honestly I don’t find the categories very clear either but they do seem useful.
With all said, a few notable things about the individual modals involved in the DMs as we observe.
Might, our top modal, according two wikipedia is the only modal that is exclusively epistemic. I’m not so sure thinking about it, but I guess that makes sense in general: it’s kind of the modal marking generic, straightforward doubt that something is true or not).
And then Can and Could as well as Would, which is in fourth place overall, are all used epistemically as well, and are also the only modals listed as having dynamic meaning (although I have seen Will listed as having dynamic meaning as well, which would totally shoot this explanation down)
The top 3 individual modals (Might, Can, Could) are all from the Possibility Type, as well as the top 3 DMs (Might Can, Might Could and May Can)
Would is in the next 3 top DMs, and Would and Will are the next two most common overall and both are from the Prediction Type.
Therefoe the Necessity type is the least active in DMs, although Must and Should do participate in DMs in the top 10 (Should Would, Must Can, Might Should).
So, overall I think there is at least something possibly here around DMs primarily involving the expression or maybe even the adding of possibility and/or epistemic modality – but I think really figuring that out would require looking at examples of the DMs in context.
We also need to look into the patterns of the modals combine: that may help us explain why certain modals are more or less common in general in DMs, and it’s also just a result in and of itself that we also should try to explain.
First, let’s look at just the positions of the modals, taking frequencies into consideration.
Some initial observations:
Most modals are primarily positional in DMs
But Would, Should and Will are about 50/50
So in other words, once again, this is far from random.
So, these results basically let us cluster these modals into three groups of three, based on where they tend to occur, ignoring the semi-modals, we get
Now, in terms of our categories we get
preSent/preTerite 1. T, S, S 2. S, T, S 3. S, T, T
pOssibility/Necessity/pRediction 1. O, O, N 2. O, O, R 3. R, R, N
And I don’t really know how to code Epistemic/Deontic, but it’s notable that the less Deontic ones (Might, Could, Would), are spread across all three types.
So overall, not much here!
The only real pattern I see is that Possibility modals, which are most frequent tend to be more fixed in place.
The possible issue here, depending on what exactly we want, is that we’re taking frequency into account, so like Might Can and Might Could alone are driving these results.
So let’s next consider the types, ignoring their frequencies. If we look at all types that occur at least once, with 45%-55% classified as Neutral, and ignoring Ought to, Used to, Shall, which are uncommon, we get:
And then looking just at DMs types with token counts of at least 10:
Taking this altogether we get a hierarchy here something like, from first posiition dominant to neutral to second position dominant, focusing on the top 7, with Must being a bit less clear than the others.
Looking at this now just impressionistically what I see is like a cline based on a whether there is a statement being made about the possibility of a future event.
So like ‘I Might/May/Will/Must go to work’ are all referring to the specific possibility of going to work in a specific instance in the future, ordered roughly from least to most likely to happen. They are all claims about whether a specific event will or won’t happen – which can be true or not true. I guess these are all more epistemic?
But ‘I Can/Should/Could/Would go to work’ are all speaking in the hypothetical in a way, like there is more missing information being signaled, like it’s up to the subject in different ways, more of a judgment, I guess on their freedom to act, so more deontic/dynamic?
So overall it feels like maybe there is something of a pattern here, where more Epistemic/Possibility/Subjunctive Modals tend to go first position and more Deontic modals tend to go second? I guess that is the main result here/explanation. At least possibly.
It would be good to think that through a bit. I’m not sure I totally get the Epistemic vs. Deontic distinction, but this seems very important. We also might want to look at examples from the most common DMs and see if meaning shakes out this way in the DMs, since most modals can have both Epistemic and Deontic meanings individually in context.
Next, we can look at which modals tend to pair together.
Some observations about pairing, including working of the Present-Preterite and the Possibility-Necessity-Prediction systems:
aside from Might and May, the modals that most often come before Can (Will, Must) and Could (Would, Used to) are different.
we also get both Can Could and Could Can relatively frequently.
Might Can is the most DM common and splits the Present-Preterite distinction, whereas Might Could is Preterite-Preterite, and May Can is Present-Present.
Combining Present-Preterite pairs is fairly uncommon, but the do occur: Can Could (135), Will Would (80), Could Can (60), Would Will (32), Might May (13), May Might (7), Shall Should (5), Should Shall (1), with the relatively high frequency of the Will Would and Would Will maybe being especially notable, given the relative low frequency of these two otherwise frequent modals in other DMs.
We can split up the top 10 in terms of these 2 systems (note: Can and May on the one hand and Could and Might on the other come out the same in this system; also note: since basically all these DMs can occur, we can at best describe which one tend to occur or not to occur)
So from this we can say:
So overall it seems like Preterite-Subjunctive and Possibility Modals are favoured, regardless of position, and that generally the Pret/Pres Agreement is present.
So that might seem like the goal most often is to attach a possibility modality meaning on top of a base modality meaning, but you get 4 Poss-Poss DMs, including the top 3, and these don’t all feel like they are just emphasis: each modal brings a distinct meaning in an additive way, I guess just like you’d expect.
So I think overall what we see is that DMs are often involve specifying or complicating the expression possibility. I don’t know. Maybe there is more here?
I guess my general conclusion would be that broadly speaking DMs are used to express more complex modalities in additive way, where the first modal modifies the meaning of the second in an adverbial like way, especially to express fine-grained modality meaning, as opposed to just emphasis, and especially around extending fine-grained meanings involving the expression of possibility/epistemic/preterite-subjunctive knowledge, specifically for expressing nuance about the outlook/opinion of the speaker on the proposition (like are DMs used mostly with personal pronouns/proper nouns??), where the first modal more often than not being of one of those types (possibility/etc.), and second often being deotonic/prediction/etc.
Can you think about this a bit – assess what I’m saying critically (maybe I’m wrong), see what else you can pull from the analysis and if you can glean more from an analysis of the DMs in context (maybe I’ve missed something, clean up the argument, the support for the argument? And write this up?
Also it may be that there may be two different patterns here. A Southern pattern and a rest of the US pattern. A Black pattern and a White pattern – at least among the most frequent/Southeastern DMs. In that case we would expect to stuff to be somewhat unclear. We might note that. We’ll definitely follow it up below via mapping.
Next let’s move on to mapping.
I’m going to do this in old school library(sp) style because I’m better with that. Nowadays, you should probably learn how to do this using library(sf) which is better integrated with the tidyverse. I’ve been switching over but I’m more confident this way and don’t want to make mistakes.
library(maps)
library(sp)
library(maptools)
## Checking rgeos availability: TRUE
library(classInt)
First we can read in the data.
dm_maps <- as.data.frame(read.table("REG_MATRIX_DM_NORM.txt",
sep=",",
header=TRUE))
summary(dm_maps)
## fips STATE NAME LONG
## Min. : 1001 Length:3075 Length:3075 Min. :-124.14
## 1st Qu.:19030 Class :character Class :character 1st Qu.: -98.09
## Median :29179 Mode :character Mode :character Median : -90.40
## Mean :30461 Mean : -91.80
## 3rd Qu.:45050 3rd Qu.: -83.57
## Max. :56045 Max. : -67.65
## LAT CAN_COULD CAN_MAY CAN_MIGHT
## Min. :25.53 Min. : 0.00 Min. : 0.000 Min. : 0.0000
## 1st Qu.:34.63 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 0.0000
## Median :38.37 Median : 0.00 Median : 0.000 Median : 0.0000
## Mean :38.29 Mean : 13.89 Mean : 3.945 Mean : 0.7633
## 3rd Qu.:41.73 3rd Qu.: 0.00 3rd Qu.: 0.000 3rd Qu.: 0.0000
## Max. :48.82 Max. :11639.00 Max. :4232.000 Max. :1455.0000
## CAN_MUST CAN_SHALL CAN_SHOULD CAN_WILL
## Min. : 0.0000 Min. : 0.0000 Min. : 0.000 Min. : 0.00
## 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.: 0.000 1st Qu.: 0.00
## Median : 0.0000 Median : 0.0000 Median : 0.000 Median : 0.00
## Mean : 0.6904 Mean : 0.1538 Mean : 1.161 Mean : 14.71
## 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 0.000 3rd Qu.: 0.00
## Max. :925.0000 Max. :464.0000 Max. :1159.000 Max. :7750.00
## CAN_WOULD COULD_CAN COULD_MAY COULD_MIGHT
## Min. : 0.0000 Min. : 0.00 Min. : 0.0000 Min. : 0.000
## 1st Qu.: 0.0000 1st Qu.: 0.00 1st Qu.: 0.0000 1st Qu.: 0.000
## Median : 0.0000 Median : 0.00 Median : 0.0000 Median : 0.000
## Mean : 0.4959 Mean : 7.52 Mean : 0.6732 Mean : 4.193
## 3rd Qu.: 0.0000 3rd Qu.: 0.00 3rd Qu.: 0.0000 3rd Qu.: 0.000
## Max. :472.0000 Max. :4438.00 Max. :954.0000 Max. :10962.000
## COULD_SHOULD COULD_WILL COULD_WOULD MAY_CAN
## Min. : 0.0000 Min. : 0.0000 Min. : 0.000 Min. : 0.00
## 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.: 0.000 1st Qu.: 0.00
## Median : 0.0000 Median : 0.0000 Median : 0.000 Median : 0.00
## Mean : 0.5958 Mean : 0.4397 Mean : 7.606 Mean : 48.65
## 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 0.000 3rd Qu.: 0.00
## Max. :570.0000 Max. :451.0000 Max. :6151.000 Max. :19757.00
## MAY_COULD MAY_MIGHT MAY_MUST MAY_OUGHT_TO
## Min. : 0.00 Min. : 0.0000 Min. : 0.0000 Min. : 0.000
## 1st Qu.: 0.00 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.: 0.000
## Median : 0.00 Median : 0.0000 Median : 0.0000 Median : 0.000
## Mean : 13.54 Mean : 0.2751 Mean : 0.7259 Mean : 0.718
## 3rd Qu.: 0.00 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 0.000
## Max. :9509.00 Max. :430.0000 Max. :2072.0000 Max. :934.000
## MAY_SHALL MAY_SHOULD MAY_USED_TO MAY_WILL
## Min. : 0.000 Min. : 0.0000 Min. : 0.00000 Min. : 0.0000
## 1st Qu.: 0.000 1st Qu.: 0.0000 1st Qu.: 0.00000 1st Qu.: 0.0000
## Median : 0.000 Median : 0.0000 Median : 0.00000 Median : 0.0000
## Mean : 1.113 Mean : 0.8852 Mean : 0.03967 Mean : 0.8439
## 3rd Qu.: 0.000 3rd Qu.: 0.0000 3rd Qu.: 0.00000 3rd Qu.: 0.0000
## Max. :3378.000 Max. :1058.0000 Max. :63.00000 Max. :461.0000
## MAY_WOULD MIGHT_CAN MIGHT_COULD MIGHT_MAY
## Min. : 0.000 Min. : 0.0 Min. : 0.0 Min. : 0.0000
## 1st Qu.: 0.000 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 0.0000
## Median : 0.000 Median : 0.0 Median : 0.0 Median : 0.0000
## Mean : 2.834 Mean : 324.8 Mean : 174.6 Mean : 0.3236
## 3rd Qu.: 0.000 3rd Qu.: 0.0 3rd Qu.: 0.0 3rd Qu.: 0.0000
## Max. :4829.000 Max. :243368.0 Max. :33218.0 Max. :325.0000
## MIGHT_MUST MIGHT_OUGHT_TO MIGHT_SHALL MIGHT_SHOULD
## Min. : 0.0000 Min. : 0.000 Min. : 0.00000 Min. : 0.00
## 1st Qu.: 0.0000 1st Qu.: 0.000 1st Qu.: 0.00000 1st Qu.: 0.00
## Median : 0.0000 Median : 0.000 Median : 0.00000 Median : 0.00
## Mean : 0.2504 Mean : 7.798 Mean : 0.00878 Mean : 24.61
## 3rd Qu.: 0.0000 3rd Qu.: 0.000 3rd Qu.: 0.00000 3rd Qu.: 0.00
## Max. :330.0000 Max. :10880.000 Max. :27.00000 Max. :6947.00
## MIGHT_USED_TO MIGHT_WILL MIGHT_WOULD MUST_CAN
## Min. : 0.0000 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.0000 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00
## Median : 0.0000 Median : 0.00 Median : 0.00 Median : 0.00
## Mean : 0.0787 Mean : 11.75 Mean : 75.03 Mean : 20.47
## 3rd Qu.: 0.0000 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.00
## Max. :242.0000 Max. :18419.00 Max. :19487.00 Max. :4619.00
## MUST_COULD MUST_MIGHT MUST_SHOULD MUST_USED_TO
## Min. : 0.00000 Min. : 0.0000 Min. : 0.00000 Min. : 0.000
## 1st Qu.: 0.00000 1st Qu.: 0.0000 1st Qu.: 0.00000 1st Qu.: 0.000
## Median : 0.00000 Median : 0.0000 Median : 0.00000 Median : 0.000
## Mean : 0.03837 Mean : 0.7301 Mean : 0.02797 Mean : 1.886
## 3rd Qu.: 0.00000 3rd Qu.: 0.0000 3rd Qu.: 0.00000 3rd Qu.: 0.000
## Max. :70.00000 Max. :789.0000 Max. :86.00000 Max. :4775.000
## MUST_WILL MUST_WOULD OUGHT_TO_SHOULD SHALL_CAN
## Min. : 0.0000 Min. : 0.00000 Min. : 0.00000 Min. : 0.0000
## 1st Qu.: 0.0000 1st Qu.: 0.00000 1st Qu.: 0.00000 1st Qu.: 0.0000
## Median : 0.0000 Median : 0.00000 Median : 0.00000 Median : 0.0000
## Mean : 0.4598 Mean : 0.09236 Mean : 0.00878 Mean : 0.2091
## 3rd Qu.: 0.0000 3rd Qu.: 0.00000 3rd Qu.: 0.00000 3rd Qu.: 0.0000
## Max. :1105.0000 Max. :227.00000 Max. :27.00000 Max. :449.0000
## SHALL_MUST SHALL_SHOULD SHALL_WILL SHALL_WOULD
## Min. : 0.00000 Min. : 0.0000 Min. : 0.00000 Min. : 0.00000
## 1st Qu.: 0.00000 1st Qu.: 0.0000 1st Qu.: 0.00000 1st Qu.: 0.00000
## Median : 0.00000 Median : 0.0000 Median : 0.00000 Median : 0.00000
## Mean : 0.02959 Mean : 0.2491 Mean : 0.08715 Mean : 0.03382
## 3rd Qu.: 0.00000 3rd Qu.: 0.0000 3rd Qu.: 0.00000 3rd Qu.: 0.00000
## Max. :67.00000 Max. :423.0000 Max. :231.00000 Max. :104.00000
## SHOULD_CAN SHOULD_COULD SHOULD_MIGHT SHOULD_MUST
## Min. : 0.0000 Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.0000 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000
## Median : 0.0000 Median : 0.000 Median : 0.000 Median : 0.000
## Mean : 0.1568 Mean : 5.192 Mean : 2.682 Mean : 1.524
## 3rd Qu.: 0.0000 3rd Qu.: 0.000 3rd Qu.: 0.000 3rd Qu.: 0.000
## Max. :95.0000 Max. :5866.000 Max. :2239.000 Max. :4023.000
## SHOULD_OUGHT_TO SHOULD_SHALL SHOULD_USED_TO SHOULD_WILL
## Min. : 0.0000 Min. : 0.00000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.0000 1st Qu.: 0.00000 1st Qu.: 0.000 1st Qu.: 0.000
## Median : 0.0000 Median : 0.00000 Median : 0.000 Median : 0.000
## Mean : 0.2361 Mean : 0.03415 Mean : 0.905 Mean : 4.377
## 3rd Qu.: 0.0000 3rd Qu.: 0.00000 3rd Qu.: 0.000 3rd Qu.: 0.000
## Max. :656.0000 Max. :105.00000 Max. :2783.000 Max. :8342.000
## SHOULD_WOULD USED_TO_CAN USED_TO_COULD USED_TO_WOULD
## Min. : 0.00 Min. : 0.0000 Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.00 1st Qu.: 0.0000 1st Qu.: 0.00 1st Qu.: 0.000
## Median : 0.00 Median : 0.0000 Median : 0.00 Median : 0.000
## Mean : 22.28 Mean : 0.9076 Mean : 23.18 Mean : 5.225
## 3rd Qu.: 0.00 3rd Qu.: 0.0000 3rd Qu.: 0.00 3rd Qu.: 0.000
## Max. :12912.00 Max. :2052.0000 Max. :10622.00 Max. :4829.000
## WILL_CAN WILL_COULD WILL_MAY WILL_MIGHT
## Min. : 0.0 Min. : 0.000 Min. : 0.0000 Min. : 0.000
## 1st Qu.: 0.0 1st Qu.: 0.000 1st Qu.: 0.0000 1st Qu.: 0.000
## Median : 0.0 Median : 0.000 Median : 0.0000 Median : 0.000
## Mean : 20.2 Mean : 1.635 Mean : 0.7197 Mean : 4.989
## 3rd Qu.: 0.0 3rd Qu.: 0.000 3rd Qu.: 0.0000 3rd Qu.: 0.000
## Max. :30470.0 Max. :1760.000 Max. :361.0000 Max. :3991.000
## WILL_MUST WILL_OUGHT_TO WILL_SHALL WILL_SHOULD
## Min. :0 Min. : 0.000000 Min. : 0.0000 Min. : 0.000
## 1st Qu.:0 1st Qu.: 0.000000 1st Qu.: 0.0000 1st Qu.: 0.000
## Median :0 Median : 0.000000 Median : 0.0000 Median : 0.000
## Mean :0 Mean : 0.009431 Mean : 0.7912 Mean : 1.376
## 3rd Qu.:0 3rd Qu.: 0.000000 3rd Qu.: 0.0000 3rd Qu.: 0.000
## Max. :0 Max. :29.000000 Max. :650.0000 Max. :558.000
## WILL_WOULD WOULD_CAN WOULD_COULD WOULD_MAY
## Min. : 0.000 Min. : 0.0000 Min. : 0.0 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.: 0.0000 1st Qu.: 0.0 1st Qu.: 0.000
## Median : 0.000 Median : 0.0000 Median : 0.0 Median : 0.000
## Mean : 5.626 Mean : 0.4859 Mean : 17.6 Mean : 2.112
## 3rd Qu.: 0.000 3rd Qu.: 0.0000 3rd Qu.: 0.0 3rd Qu.: 0.000
## Max. :5578.000 Max. :700.0000 Max. :4112.0 Max. :3599.000
## WOULD_MIGHT WOULD_MUST WOULD_SHOULD WOULD_USED_TO
## Min. : 0.000 Min. : 0.0000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.: 0.0000 1st Qu.: 0.000 1st Qu.: 0.000
## Median : 0.000 Median : 0.0000 Median : 0.000 Median : 0.000
## Mean : 2.989 Mean : 0.7184 Mean : 8.794 Mean : 2.125
## 3rd Qu.: 0.000 3rd Qu.: 0.0000 3rd Qu.: 0.000 3rd Qu.: 0.000
## Max. :1843.000 Max. :1722.0000 Max. :7207.000 Max. :1589.000
## WOULD_WILL
## Min. : 0.000
## 1st Qu.: 0.000
## Median : 0.000
## Mean : 7.964
## 3rd Qu.: 0.000
## Max. :12379.000
dim(dm_maps)
## [1] 3075 81
So basically what we’ve got here are 5 meta data variables and 76 DM variables, which show the relative frequency (per billion words) of each DM across 3075 counties/county equivalents in the contiguous US. This is the data we’ll map.
Note as expected it’s super skewed data.
The metadata is the county FIPS code, State, County Name, and the Centroid Longitude and Latitude for that county.
NOTE: that Will Must has 0 counts. That is despite a count of 11 overall. I’m not totally sure what is happening here, especially when you have other DMs with counts of 1 that overall that don’t sum to 0. I guess it’s possible that the 11 are in big counties.
summary(dm_maps$WILL_MUST)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 0 0 0 0
summary(dm_maps$SHOULD_SHALL)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.03415 0.00000 105.00000
So we can drop it, leaving 75 maps. WE NEED TO CHECK THIS. IN GENERAL WE SHOULD BE WORKING WITH THE FREQUENCIES BY COUNTY NOT THE RELATIVE FREQUENCIES BY COUNTY WHEN COMPUTING STATE AVERAGES.
dm_maps <- dm_maps[,c(1:68,70:ncol(dm_maps))]
dim(dm_maps)
## [1] 3075 80
Aside from this, there are some other issues with this data: the order is weird and we don’t have the rows coded with the county names used for the approach to mapping we’ll use.
tail(dm_maps[2:3], 40)
## STATE NAME
## 3036 colorado kit carson
## 3037 colorado lake
## 3038 colorado la plata
## 3039 colorado larimer
## 3040 colorado las animas
## 3041 colorado lincoln
## 3042 colorado logan
## 3043 colorado mesa
## 3044 colorado mineral
## 3045 colorado moffat
## 3046 colorado montezuma
## 3047 colorado montrose
## 3048 colorado morgan
## 3049 colorado otero
## 3050 colorado ouray
## 3051 colorado park
## 3052 colorado phillips
## 3053 colorado pitkin
## 3054 colorado prowers
## 3055 colorado pueblo
## 3056 colorado rio blanco
## 3057 colorado rio grande
## 3058 colorado routt
## 3059 colorado saguache
## 3060 colorado san juan
## 3061 colorado san miguel
## 3062 colorado sedgwick
## 3063 colorado summit
## 3064 colorado teller
## 3065 colorado washington
## 3066 colorado weld
## 3067 colorado yuma
## 3068 connecticut fairfield
## 3069 connecticut hartford
## 3070 connecticut litchfield
## 3071 connecticut middlesex
## 3072 connecticut new haven
## 3073 connecticut new london
## 3074 connecticut tolland
## 3075 connecticut windham
To fix that we merge our DM map data to the list of US counties used in library(maps) (county.fips) using the FIPS codes, basically creating an ordered version of our old data frame with a ‘polyname’ variable added, matching the us_map object.
dm_align <- merge(county.fips, dm_maps, by="fips")
summary(dm_align[,1:10])
## fips polyname STATE NAME
## Min. : 1001 Length:3083 Length:3083 Length:3083
## 1st Qu.:19032 Class :character Class :character Class :character
## Median :29183 Mode :character Mode :character Mode :character
## Mean :30477
## 3rd Qu.:45052
## Max. :56045
## LONG LAT CAN_COULD CAN_MAY
## Min. :-124.14 Min. :25.53 Min. : 0.00 Min. : 0.000
## 1st Qu.: -98.11 1st Qu.:34.63 1st Qu.: 0.00 1st Qu.: 0.000
## Median : -90.41 Median :38.37 Median : 0.00 Median : 0.000
## Mean : -91.83 Mean :38.29 Mean : 13.86 Mean : 3.935
## 3rd Qu.: -83.58 3rd Qu.:41.75 3rd Qu.: 0.00 3rd Qu.: 0.000
## Max. : -67.65 Max. :48.82 Max. :11639.00 Max. :4232.000
## CAN_MIGHT CAN_MUST
## Min. : 0.0000 Min. : 0.0000
## 1st Qu.: 0.0000 1st Qu.: 0.0000
## Median : 0.0000 Median : 0.0000
## Mean : 0.7587 Mean : 0.6886
## 3rd Qu.: 0.0000 3rd Qu.: 0.0000
## Max. :1455.0000 Max. :925.0000
Next, we create a map object of US counties.
us_map <- map("county", plot = FALSE, fill = TRUE)
We want to project the map so it looks nice. This is kind of messy. For some reason we need to do it in two steps.
wgs84 <- CRS("+proj=longlat + datum=WGS84")
us_map_sp <- map2SpatialPolygons(us_map, IDs = us_map$names, proj4string = wgs84)
albers <- CRS("+proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=37.5 +lon_0=-96 +x_0=0 +y_0=0 +ellps=GRS80 +datum=NAD83 +units=m +no_defs")
us_map_sp <- spTransform(us_map_sp, CRS = albers)
And we’re also going to make just a state map, so we can overlay that with different borders.
…
state <- map("state", plot = FALSE, fill = TRUE, res = 0)
state_sp <- map2SpatialPolygons(state, IDs = state$names, proj4string = wgs84)
state_sp <- spTransform(state_sp, CRS = albers)
Last, we define a map function, which we can use over and over again to map. This does most of the work and it’s pretty complicated, mostly just for aesthetics sake: most of it is colour palette and legend.
map_us <- function(feat, title, round) {
# Getting the chloroplet colours and cuts (0 + quartiles of non-zeroes)
pal <- colorRampPalette(c("white", rgb(1,.27,0)), bias = 1)
collist <- pal(5)
feat <- feat/1000
temp <- feat
temp[temp == 0] <- NA
temp[temp == NaN] <- NA #
qu<-quantile(temp, na.rm=TRUE)
class <- classIntervals(feat,
5,
style = "fixed",
fixedBreaks = c(0, qu[1], qu[2], qu[3], qu[4], qu[5]))
colour <- findColours(class, collist)
# This is the map stuff
par(mar = c(0, 0, 0, 0))
plot(us_map_sp, lwd = 0.25, col = colour, border="white")
plot(state_sp, lwd = 0.5, add = TRUE)
# Legend
rect(-2220000, -980000, -2020000, -1080000, col = collist[1], lwd = 0.5)
rect(-2020000, -980000, -1820000, -1080000, col = collist[2], lwd = 0.5)
rect(-1820000, -980000, -1620000, -1080000, col = collist[3], lwd = 0.5)
rect(-1620000, -980000, -1420000, -1080000, col = collist[4], lwd = 0.5)
rect(-1420000, -980000, -1220000, -1080000, col = collist[5], lwd = 0.5)
text(-2220000, -1160000, cex = 0.5, labels = paste(round(class$brks[1]*100, round)))
text(-2020000, -1160000, cex = 0.5, labels = paste(round(class$brks[2]*100, round)))
text(-1820000, -1160000, cex = 0.5, labels = paste(round(class$brks[3]*100, round)))
text(-1620000, -1160000, cex = 0.5, labels = paste(round(class$brks[4]*100, round)))
text(-1420000, -1160000, cex = 0.5, labels = paste(round(class$brks[5]*100, round)))
text(-1220000, -1160000, cex = 0.5, labels = paste(round(class$brks[6]*100, round)))
text(-1700000, -860000, title, cex = 1)
text(-1700000, -1260000, "rate of use per million words", cex = .65)
}
First we can map all DMs, which is something I didn’t do before.
Something not working here…it was
all <- rowSums(dm_align[,7:ncol(dm_align)])
#map_us(dm_align$all , "All Double Modals", 0)
And next let’s map the top 10.
map_us(dm_align$MIGHT_CAN, "Might Can", 0)
map_us(dm_align$MIGHT_COULD, "Might Could", 0)
map_us(dm_align$MAY_CAN, "May Can", 0)
map_us(dm_align$MIGHT_WOULD, "Might Would", 0)
map_us(dm_align$SHOULD_WOULD, "Should Would", 0)
map_us(dm_align$WOULD_COULD, "Would Could", 0)
map_us(dm_align$MUST_CAN, "Must Can", 0)
map_us(dm_align$MIGHT_SHOULD, "Might Should", 0)
map_us(dm_align$USED_TO_COULD, "Used to Could", 0)
map_us(dm_align$CAN_COULD, "Can Could", 0)
We can definitely see there are two different types of patterns, at least in these maps: Southeastern ones, which I assume are at least the one’s we’re expecting, and ones with wider distributions, which seems more unexpected.
Given that these maps are very broadly different one way to quickly visualise and assess these differences is to work with state level maps for a bit.
Using the data at hand, for each DM types let’s compute the mean relative frequency across all counties.
dm_states <- as.data.frame(dm_align[,c(3,7:ncol(dm_align))])
dm_states <- aggregate(. ~ STATE, dm_states, mean)
row.names(dm_states) <- dm_states$STATE
dm_states <- dm_states[,2:ncol(dm_states)]
st_norm <- as.data.frame(100*scale(dm_states,
center=FALSE,
scale=colSums(dm_states)))
state$names
## [1] "alabama" "arizona"
## [3] "arkansas" "california"
## [5] "colorado" "connecticut"
## [7] "delaware" "district of columbia"
## [9] "florida" "georgia"
## [11] "idaho" "illinois"
## [13] "indiana" "iowa"
## [15] "kansas" "kentucky"
## [17] "louisiana" "maine"
## [19] "maryland" "massachusetts:martha's vineyard"
## [21] "massachusetts:main" "massachusetts:nantucket"
## [23] "michigan:north" "michigan:south"
## [25] "minnesota" "mississippi"
## [27] "missouri" "montana"
## [29] "nebraska" "nevada"
## [31] "new hampshire" "new jersey"
## [33] "new mexico" "new york:manhattan"
## [35] "new york:main" "new york:staten island"
## [37] "new york:long island" "north carolina:knotts"
## [39] "north carolina:main" "north carolina:spit"
## [41] "north dakota" "ohio"
## [43] "oklahoma" "oregon"
## [45] "pennsylvania" "rhode island"
## [47] "south carolina" "south dakota"
## [49] "tennessee" "texas"
## [51] "utah" "vermont"
## [53] "virginia:chesapeake" "virginia:chincoteague"
## [55] "virginia:main" "washington:san juan island"
## [57] "washington:lopez island" "washington:orcas island"
## [59] "washington:whidbey island" "washington:main"
## [61] "west virginia" "wisconsin"
## [63] "wyoming"
# Because of the separate rows for islands and stuff
st_align <- rbind(st_norm[1:20,],
st_norm[20,], #MA
st_norm[20,],
st_norm[21,], #MI
st_norm[21,],
st_norm[22:31,],
st_norm[31,], #NY
st_norm[31,],
st_norm[31,],
st_norm[32,], #NC
st_norm[32,],
st_norm[32,],
st_norm[22:45,], #VA
st_norm[45,],
st_norm[45,],
st_norm[46,], #WA
st_norm[46,],
st_norm[46,],
st_norm[46,],
st_norm[46,],
st_norm[47:49,])
Still not working. Still working through this. Like here an issue is that the breaks are being calculated based on the repeated rows, but some other issues here prob
map_st <- function(feat, title, round) {
pal <- colorRampPalette(c("white", rgb(1,.27,0)), bias = 1)
collist <- pal(5)
class <- classIntervals(feat,
5,
style = "quantile")
colour <- findColours(class, collist)
# This is the map stuff
par(mar = c(0, 0, 0, 0))
plot(state_sp, lwd = 0.5, col = colour, border="black")
# Legend
rect(-2220000, -980000, -2020000, -1080000, col = collist[1], lwd = 0.5)
rect(-2020000, -980000, -1820000, -1080000, col = collist[2], lwd = 0.5)
rect(-1820000, -980000, -1620000, -1080000, col = collist[3], lwd = 0.5)
rect(-1620000, -980000, -1420000, -1080000, col = collist[4], lwd = 0.5)
rect(-1420000, -980000, -1220000, -1080000, col = collist[5], lwd = 0.5)
text(-2220000, -1160000, cex = 0.5, labels = paste(round(class$brks[1], round)))
text(-2020000, -1160000, cex = 0.5, labels = paste(round(class$brks[2], round)))
text(-1820000, -1160000, cex = 0.5, labels = paste(round(class$brks[3], round)))
text(-1620000, -1160000, cex = 0.5, labels = paste(round(class$brks[4], round)))
text(-1420000, -1160000, cex = 0.5, labels = paste(round(class$brks[5], round)))
text(-1220000, -1160000, cex = 0.5, labels = paste(round(class$brks[6], round)))
text(-1700000, -860000, title, cex = 1)
text(-1700000, -1260000, "percent of tokens (normalised)", cex = .65)
}
map_st(st_align$MIGHT_COULD, "Might Could", 0)
map_st(st_align$MIGHT_CAN, "Might Can", 0)
I’m juts thinking about how best to sort the maps. We could just do it by hand, or maybe taking state proportions, with just a threshold. Or we could do it like I would usually find dialect regions, but that seems like a lot of complication here.
data <- dm_align[,7:ncol(dm_align)] dm_as_obs <- transpose(data) rownames(dm_as_obs) <- colnames(data) rownames(dm_as_obs)
d<-dist(dm_as_obs) mds <- cmdscale(d, 2)
library(data.table)