1 Introduction

This markdown replicates and extends the analysis of double modals (DMs) in US geolocated Twitter data collected between 2013 and 2014 presented in Grieve et al. (2015) at NWAV Toronto. This analysis is to be reported in a paper in prep by me and Cameron Morin that follows up on the NWAV analyses. I’m redoing and extending the code and visualisations here and walking through the steps of this analysis, including extensive informal discussion of the results.

2 Research Questions

Our basic goal is to use very large Twitter corpus to describe the use of one of the most elusive grammatical constructions in American English: the DM.

In essence this is a descriptive and exploratory study. There is a fair bit of theory on DMs, but it seems to me fairly superficial, since all they really take for empirical data is that DMs exist in general (i.e. they are just trying to explain or model the existence of DMs, regardless of the pairing) and to a lesser extent that certain DMs exist. And that is because that is literally all the information anyone had access to, as no one has really observed DMs in natural language at scale and in an open way. And so detailed information about variation across DMs has basically be lacking – or sometimes people just guessed at, trying to generalise from anectodal evidence (and as we’ll see it’s often wrong). All we really know is that DMs exist, and therefore research questions and theories about the nature of DMs has been limited.

So for specific RQs (aside from the methodological one, i.e. can we find them? The answer is clearly yes):

  • What DMs exist in AmE? What DMs are most Frequent?
  • How are DMs used?
    • What are their range of meanings?
  • What constrains (categorically or probabilistically) the combining of DMs?
    • Are certain modals more common in certain position?
    • Are certain pairings more common?
    • Can DMs be grouped into some kind of structural/semantic taxonomy?
  • What are the regional patterns associated with DMs?
    • Do DMs generally only occur in the Southeast?
    • Are there differences in the regional patterns of DMs in the Southeast?
    • Can we infer any social differences within the southeast
  • Can We model these results in grammaticality
    • Are they consistent/inconsistent with predictions of existing theories
    • In general, like can we model DMs, as one could test w/o our results
    • And specifically can we model the specific patterns we observe

3 Data

3.1 Background

The data is based on an analysis of approximately 1 million geolocated tweets from across the contiguous US that were collected between October 2013 and November 2014 using the Twitter API. The corpus total 8.9 billion words. Notably every tweet is geo-located, so we know the precise longitude and latitude of the user when they posted that message (on their smart phone with the geo-location option activated). This allows for the fine-grained analysis of regional variation. Based on this information we stratified the corpus into counties, so that every tweet is linked to a county (or a county equivalent). There are about 3,000 of these in the US (see below for the exact numbers). In this way we can measure and map linguistic features across the US. For more information on this corpus see Huang et al. (2016) and Grieve et al. (2017, 2018).

The dataset analysed in this study was based on this corpus. To identify DMs we automatically extracted all modal-modal sequences, focusing on the 9 primary modals (can, could, may, might, shall, should, will, would, must) and 2 of the semi-modals (used to, ought to). We ignored all duplicated double modals (e.g. could could), which are surprisingly common, other semi-modals (e.g. gonna), modals followed by have (e.g. woulda), and various other forms (e.g. might as well). This gives (11x10) 110 different possible double modals types.

Through this process we identified 10,137 tokens of potential double modals in the corpus. However, when we looked at the Tweets containing these forms, we found that a substantial number were not true double modals. We therefore went through each of the tokens manually to identify true double modals. After removing all problematic tokens, we were left with 5,439 legitimate double modal tokens.

3.2 Overall Frequency Data

Before we get to maps, we have some data related to the overall frequencies of the forms. There are a few different data sets we’re going to want to load in and generate some basic statistics and graphs for.

We can start with the just the overall frequency distribution of the DMs in the corpus.

First we read in the data.

dm_dist <- read.table("DM_DIST.txt", header = TRUE, sep = "\t")

We can take a look at it.

dm_dist
##                   DM tokens
## 1          Can Could    135
## 2            Can May     21
## 3          Can Might     15
## 4           Can Must     11
## 5       Can Ought to      0
## 6          Can Shall      2
## 7         Can Should     20
## 8        Can Used to      0
## 9           Can Will    111
## 10         Can Would     14
## 11         Could Can     60
## 12         Could May     10
## 13       Could Might     20
## 14        Could Must      0
## 15    Could Ought to      0
## 16       Could Shall      0
## 17      Could Should     19
## 18     Could Used to      0
## 19        Could Will     16
## 20       Could Would     41
## 21           May Can    273
## 22         May Could     53
## 23         May Might      7
## 24          May Must      4
## 25      May Ought to      4
## 26         May Shall      3
## 27        May Should     14
## 28       May Used to      2
## 29          May Will     22
## 30         May Would     14
## 31         Might Can   1733
## 32       Might Could    932
## 33         Might May     13
## 34        Might Must      8
## 35    Might Ought to     32
## 36       Might Shall      1
## 37      Might Should    146
## 38     Might Used to      1
## 39        Might Will     52
## 40       Might Would    243
## 41          Must Can    164
## 42        Must Could      3
## 43          Must May      0
## 44        Must Might      8
## 45     Must Ought to      0
## 46        Must Shall      0
## 47       Must Should      1
## 48      Must Used to      5
## 49         Must Will      5
## 50        Must Would      3
## 51      Ought to Can      0
## 52    Ought to Could      0
## 53      Ought to May      0
## 54    Ought to Might      0
## 55     Ought to Must      0
## 56    Ought to Shall      0
## 57   Ought to Should      1
## 58  Ought to Used to      0
## 59     Ought to Will      0
## 60    Ought to Would      0
## 61         Shall Can      3
## 62       Shall Could      0
## 63         Shall May      0
## 64       Shall Might      0
## 65        Shall Must      2
## 66    Shall Ought to      0
## 67      Shall Should      5
## 68     Shall Used to      0
## 69        Shall Will      3
## 70       Shall Would      1
## 71        Should Can     14
## 72      Should Could     33
## 73        Should May      0
## 74      Should Might      8
## 75       Should Must     12
## 76   Should Ought to      3
## 77      Should Shall      1
## 78    Should Used to      1
## 79       Should Will     16
## 80      Should Would    177
## 81       Used to Can      8
## 82     Used to Could    144
## 83       Used to May      0
## 84     Used to Might      0
## 85      Used to Must      0
## 86  Used to Ought to      0
## 87     Used to Shall      0
## 88    Used to Should      0
## 89      Used to Will      0
## 90     Used to Would     20
## 91          Will Can    111
## 92        Will Could     20
## 93          Will May     26
## 94        Will Might     59
## 95         Will Must     11
## 96     Will Ought to      1
## 97        Will Shall     21
## 98       Will Should     23
## 99      Will Used to      0
## 100       Will Would     80
## 101        Would Can      9
## 102      Would Could    171
## 103        Would May      8
## 104      Would Might     29
## 105       Would Must      4
## 106   Would Ought to      0
## 107      Would Shall      0
## 108     Would Should     81
## 109    Would Used to     40
## 110       Would Will     52

Note that there all 110 possible DMs in this dataset, but we observe at least one occurrence for only 76 of these forms.

nrow(dm_dist)
## [1] 110
dm_dist_trim <- dm_dist[dm_dist$tokens>0,]
nrow(dm_dist_trim)
## [1] 76

We can also order this by frequency.

dm_dist_trim <- dm_dist_trim[order(dm_dist_trim$tokens, decreasing = TRUE ),]
dm_dist_trim
##                  DM tokens
## 31        Might Can   1733
## 32      Might Could    932
## 21          May Can    273
## 40      Might Would    243
## 80     Should Would    177
## 102     Would Could    171
## 41         Must Can    164
## 37     Might Should    146
## 82    Used to Could    144
## 1         Can Could    135
## 9          Can Will    111
## 91         Will Can    111
## 108    Would Should     81
## 100      Will Would     80
## 11        Could Can     60
## 94       Will Might     59
## 22        May Could     53
## 39       Might Will     52
## 110      Would Will     52
## 20      Could Would     41
## 109   Would Used to     40
## 72     Should Could     33
## 35   Might Ought to     32
## 104     Would Might     29
## 93         Will May     26
## 98      Will Should     23
## 29         May Will     22
## 2           Can May     21
## 97       Will Shall     21
## 7        Can Should     20
## 13      Could Might     20
## 90    Used to Would     20
## 92       Will Could     20
## 17     Could Should     19
## 19       Could Will     16
## 79      Should Will     16
## 3         Can Might     15
## 10        Can Would     14
## 27       May Should     14
## 30        May Would     14
## 71       Should Can     14
## 33        Might May     13
## 75      Should Must     12
## 4          Can Must     11
## 95        Will Must     11
## 12        Could May     10
## 101       Would Can      9
## 34       Might Must      8
## 44       Must Might      8
## 74     Should Might      8
## 81      Used to Can      8
## 103       Would May      8
## 23        May Might      7
## 48     Must Used to      5
## 49        Must Will      5
## 67     Shall Should      5
## 24         May Must      4
## 25     May Ought to      4
## 105      Would Must      4
## 26        May Shall      3
## 42       Must Could      3
## 50       Must Would      3
## 61        Shall Can      3
## 69       Shall Will      3
## 76  Should Ought to      3
## 6         Can Shall      2
## 28      May Used to      2
## 65       Shall Must      2
## 36      Might Shall      1
## 38    Might Used to      1
## 47      Must Should      1
## 57  Ought to Should      1
## 70      Shall Would      1
## 77     Should Shall      1
## 78   Should Used to      1
## 96    Will Ought to      1

3.2.1 Plot 1

Next let’s plot the data. For the initial submission we can pull the figures right out of this rmd.

purp <- rgb(.2,.15,.57) 
red <- rgb(1,.25,0)     # We'll reuse these colours
grey <- rgb(.4,.4,.4)

barplot(dm_dist_trim$tokens, 
        names.arg = dm_dist_trim$DM,
        las = 2,
        cex.names = .4, 
        col = red, border = "NA", axes=FALSE)
axis(2, 
     cex.axis = .8, 
     las = 1, 
     col = "white", 
     col.ticks = "white", 
     col.axis = grey,
     at=c(dm_dist_trim$tokens[1],
          dm_dist_trim$tokens[2],
          dm_dist_trim$tokens[3],
          0))

3.2.2 Finding 1: Might Can

Might Could is not the most frequent DM in this corpus – not even close. Might Can is far more common. This is very surprising given previous research, which generally highlights Might Could, often saying it’s most common, and which sometimes ignore Might Can altogether. This is a big finding – arguably our biggest – and certainly one we’ll want to build on.

I think we should first acknowledge that this could be a register effect of Twitter. I think that is what most people who are invested in the idea that Might Could is most common would say. And it is true that frequency distributions don’t always align: most notably, ‘the’ is generally the most common word in English corpora, while ‘I’ is here on Twitter. But I think this is unlikely, as this is a really huge difference in really low frequency and comparable forms, whereas ‘the’ and ‘I’ are very frequent and very different and are associated with broad register differences.

But more to the point, regardless of what happens in other contexts, Might Can is clearly most common on Twitter – we need to be clear about that – and that is very surprising given what previous studies have said. And crucially no one else has ever made these kind of comparative frequency analyses in any other context, at least based on sufficient amounts of natural data. I mean, I assume some of the existing data is based on field workers asking informants directly about specific forms – which is then totally irrelevant to the question of relative frequency. So there isn’t much anything to compare against. Like in many ways this is the first empirical data on which form is most frequent.

We should give examples of Might Can in context here from the corpus, as this is our first chance to really discuss and exemplify what DMs in this corpus look like. I think a series of lines in like aligned concordance line format makes sense here (using a monospace font). We’ll want to show what the meaning is and that there is consistent meaning. Giving Tweets as examples is somewhat problematic though, but for now let’s just use them.

We need to discuss prominently right away. And we should note that (reassuringly) Might Could is second. We’ll want to show examples for Might Could to and discuss the meaning and then we’ll want to discuss about the similarity between these two forms and start setting the stage in that way for the making of a direct comparison between these two which is what we’ll do.

3.2.3 Finding 1.1: Overall Alignment

Furthermore, we should go through those list of most common DMs, as we discuss in the intro, and explicitly link back to those: so noting the other consistencies and inconsistencies along with Might Can. I haven’t looked at this at all honestly. And we shouldn’t make a big deal about ranking at this point, since the token counts are low, and so there might not be much to say except broad overall agreement: the usual suspects are all accounted for. But there might be some of the expected ones that hardly occur at all, or a couple unexpected ones that are surprisingly high. We should note those and we should definitely be at least thinking about how any unexpected results might relate to the Might Can finding.

3.2.4 Finding 2: So Many DMs

I think this is a second major result. Maybe this is more surprising than the Might Can result even. We need to highlight how many more this is than anyone has thought possible. We need to be clear here that these are valid, hand checked and interpretable DMs that make sense in context, even though most are exceedingly uncommon. And we need to provide examples to demonstrate that.

This would also be the time to point out some of these extremely DMs that have never been discussed in the literature. Really, it would be great if you went through the list and counted them and provided a full list somehow of the newly observed DMs. And then, again, somewhere in this section demonstrate via examples that these are real uses, with consistently interpretable meanings. So like we could show all 10 or whatever examples for one of these and show what it means. I think that is important.

3.2.5 Finding 2.1: Zipf Dist

And finally we should note the distribution is highly sub-linear, aligning braodly at least with a Zipfian distribution, which is exactly I think what we’d expect, except I guess that this long tail of really uncommon and previously unattested DMs is pretty surprising. Really, I think it’s fair the say this result suggests that with enough data you’d see all the double modals.

3.3 Overall Frequency Data by Position

So far we’ve only looked at the double modals as units, but of course they consist of two modals: first position and second position. So we can also look at patterns there, starting with a heatmap.

3.3.1 Data

First we read in the data: this is a 11 modal by 11 modal matrix, showing the frequency of the first modal (rows) combining with the second modal (columns). We did not consider DMs with a repeated modal (because they seem much more likely to be typos or emphasis and they’re really hard to judge).

dmmat <- as.matrix(read.table("DM_FREQ_SORT.txt", sep="\t", header=TRUE, row.names=1))

dmmat
##          Might  Can Could Would Will Should May Must Used.to Shall Ought.to
## Might        0 1733   932   243   52    146  13    8       1     1       32
## Can         15    0   135    14  111     20  21   11       0     2        0
## Could       20   60     0    41   16     19  10    0       0     0        0
## Would       29    9   171     0   52     81   8    4      40     0        0
## Will        59  111    20    80    0     23  26   11       0    21        1
## Should       8   14    33   177   16      0   0   12       1     1        3
## May          7  273    53    14   22     14   0    4       2     3        4
## Must         8  164     3     3    5      1   0    0       5     0        0
## Used to      0    8   144    20    0      0   0    0       0     0        0
## Shall        0    3     0     1    3      5   0    2       0     0        0
## Ought to     0    0     0     0    0      1   0    0       0     0        0

3.3.2 Plot 2

Now we can plot that full co-occurrence matrix as a heatmap. I like using image here, but it rotates the image, so I make a rotate function and then I plot the rotated image and add axes.

rotate <- function(x) t(apply(x, 2, rev))

image(rotate(dmmat),
      col = hcl.colors(7, "Purples2", rev = TRUE), 
      breaks = c(0,1,5,10,50,100,500,100000),
      tck = 0, 
      axes = TRUE,      
      frame.plot = TRUE, 
      xaxt='n', 
      yaxt = 'n')

  axis(side = 2, 
       at = c(1, .9, .8, .7, .6, .5, .4, .3, .2, .1, 0), 
       labels = rownames(dmmat),
       las = 2,
       col.axis = grey,
       cex.axis = .8,
       tick = FALSE, 
       line = FALSE)
       
  axis(side = 3, 
       at = c(0, .1, .2, .3, .4, .5, .6, .7, .8, .9, 1), 
       labels = rownames(dmmat),
       las = 2,
       col.axis = grey,
       cex.axis = .8,
       tick = FALSE, 
       line = FALSE)    

3.3.3 Position Totals by Token

I think we also want another heat map to show the overall frequencies of the modals in the two position, regardless of the other modal in the pair.

First we get totals for each modal by row (first position) and column (second position).

first <- as.matrix(rowSums(dmmat))
second <- as.matrix(colSums(dmmat))
dmmat2 <- cbind(first, second)
dmmat2
##          [,1] [,2]
## Might    3161  146
## Can       329 2375
## Could     166 1491
## Would     394  593
## Will      352  277
## Should    265  310
## May       396   78
## Must      189   52
## Used to   172   49
## Shall      14   28
## Ought to    1   40

And then we can plot like before.

image(rotate(dmmat2),
      col = hcl.colors(9, "Purples2", rev = TRUE), 
      breaks = c(0,1,20,50,100,200,500,1000,2000,100000),
      tck = 0, 
      axes = TRUE,      
      frame.plot = TRUE, 
      xaxt='n', 
      yaxt = 'n')

  axis(side = 2, 
       at = c(1, .9, .8, .7, .6, .5, .4, .3, .2, .1, 0), 
       labels = rownames(dmmat2),
       las = 2,
       col.axis = grey,
       cex.axis = .8,
       tick = FALSE, 
       line = FALSE)
       
  
  axis(side = 3, 
       at = c(.33, .66), 
       labels = c("Modal 1", "Modal 2"),
       las = 1,
       col.axis = grey,
       cex.axis = .8,
       tick = FALSE, 
       line = FALSE)    

(Note: I wouldn’t include this image in the paper)

Specifically, for percent in first position (or not in second), we can calculate as.

100*dmmat2[,1]/(dmmat2[,1]+dmmat2[,2])
##     Might       Can     Could     Would      Will    Should       May      Must 
## 95.585122 12.167160 10.018105 39.918946 55.961844 46.086957 83.544304 78.423237 
##   Used to     Shall  Ought to 
## 77.828054 33.333333  2.439024

3.3.4 Plot 3

This graph provides a good picture of it.

plot(log(dmmat2[,2]), 
     log(dmmat2[,1]), 
     xlim = c(0, 8), 
     ylim = c(0, 8),
     xlab = "Frequency of Modal in 2nd Position (Logged)",
     ylab = "Frequency of Modal in 1st Position (Logged)",
     col = grey,
     cex.axis = .85,
     cex.lab = .85,
     col.lab = grey,
     col.axis = grey,
     bty = "n",xaxt = "n", yaxt = "n", type = "n"
)
text(log(dmmat2[,2]),
     log(dmmat2[,1]), 
     labels = row.names(dmmat2),
     col = c(red, purp, purp, purp, red, purp,  red, red, red, purp, purp),
     cex = .75)
abline(a = 0, b = 1, lty = "dashed", lwd = .5)

axis(side = 1, 
     lwd = .5,
       col.axis = grey,
       cex.axis = .75)
       
  
  axis(side = 2, 
       lwd = .5,
       col.axis = grey,
       cex.axis = .75)    
box(which = "plot", col = grey, lwd = .5)

3.3.5 Position Totals by Type

Let’s also redo this just looking a types, so ignoring frequency, but looking if the modals occur more in different positions.

First, we’ll save the data to a new object and then get rid of the ‘to’ to make this easier to split up.

dm_type <- dm_dist_trim
dm_type$DM<-gsub("to ", "", dm_type$DM)
dm_type$DM<-gsub(" to", "", dm_type$DM)
dm_type 
##               DM tokens
## 31     Might Can   1733
## 32   Might Could    932
## 21       May Can    273
## 40   Might Would    243
## 80  Should Would    177
## 102  Would Could    171
## 41      Must Can    164
## 37  Might Should    146
## 82    Used Could    144
## 1      Can Could    135
## 9       Can Will    111
## 91      Will Can    111
## 108 Would Should     81
## 100   Will Would     80
## 11     Could Can     60
## 94    Will Might     59
## 22     May Could     53
## 39    Might Will     52
## 110   Would Will     52
## 20   Could Would     41
## 109   Would Used     40
## 72  Should Could     33
## 35   Might Ought     32
## 104  Would Might     29
## 93      Will May     26
## 98   Will Should     23
## 29      May Will     22
## 2        Can May     21
## 97    Will Shall     21
## 7     Can Should     20
## 13   Could Might     20
## 90    Used Would     20
## 92    Will Could     20
## 17  Could Should     19
## 19    Could Will     16
## 79   Should Will     16
## 3      Can Might     15
## 10     Can Would     14
## 27    May Should     14
## 30     May Would     14
## 71    Should Can     14
## 33     Might May     13
## 75   Should Must     12
## 4       Can Must     11
## 95     Will Must     11
## 12     Could May     10
## 101    Would Can      9
## 34    Might Must      8
## 44    Must Might      8
## 74  Should Might      8
## 81      Used Can      8
## 103    Would May      8
## 23     May Might      7
## 48     Must Used      5
## 49     Must Will      5
## 67  Shall Should      5
## 24      May Must      4
## 25     May Ought      4
## 105   Would Must      4
## 26     May Shall      3
## 42    Must Could      3
## 50    Must Would      3
## 61     Shall Can      3
## 69    Shall Will      3
## 76  Should Ought      3
## 6      Can Shall      2
## 28      May Used      2
## 65    Shall Must      2
## 36   Might Shall      1
## 38    Might Used      1
## 47   Must Should      1
## 57  Ought Should      1
## 70   Shall Would      1
## 77  Should Shall      1
## 78   Should Used      1
## 96    Will Ought      1

Now we can split the DM column into first and second position on the space.

library(stringr)

temp<-str_split_fixed(dm_type$DM, " ", 2)
dm_type$first <- temp[,1]
dm_type$second <- temp[,2]
dm_type
##               DM tokens  first second
## 31     Might Can   1733  Might    Can
## 32   Might Could    932  Might  Could
## 21       May Can    273    May    Can
## 40   Might Would    243  Might  Would
## 80  Should Would    177 Should  Would
## 102  Would Could    171  Would  Could
## 41      Must Can    164   Must    Can
## 37  Might Should    146  Might Should
## 82    Used Could    144   Used  Could
## 1      Can Could    135    Can  Could
## 9       Can Will    111    Can   Will
## 91      Will Can    111   Will    Can
## 108 Would Should     81  Would Should
## 100   Will Would     80   Will  Would
## 11     Could Can     60  Could    Can
## 94    Will Might     59   Will  Might
## 22     May Could     53    May  Could
## 39    Might Will     52  Might   Will
## 110   Would Will     52  Would   Will
## 20   Could Would     41  Could  Would
## 109   Would Used     40  Would   Used
## 72  Should Could     33 Should  Could
## 35   Might Ought     32  Might  Ought
## 104  Would Might     29  Would  Might
## 93      Will May     26   Will    May
## 98   Will Should     23   Will Should
## 29      May Will     22    May   Will
## 2        Can May     21    Can    May
## 97    Will Shall     21   Will  Shall
## 7     Can Should     20    Can Should
## 13   Could Might     20  Could  Might
## 90    Used Would     20   Used  Would
## 92    Will Could     20   Will  Could
## 17  Could Should     19  Could Should
## 19    Could Will     16  Could   Will
## 79   Should Will     16 Should   Will
## 3      Can Might     15    Can  Might
## 10     Can Would     14    Can  Would
## 27    May Should     14    May Should
## 30     May Would     14    May  Would
## 71    Should Can     14 Should    Can
## 33     Might May     13  Might    May
## 75   Should Must     12 Should   Must
## 4       Can Must     11    Can   Must
## 95     Will Must     11   Will   Must
## 12     Could May     10  Could    May
## 101    Would Can      9  Would    Can
## 34    Might Must      8  Might   Must
## 44    Must Might      8   Must  Might
## 74  Should Might      8 Should  Might
## 81      Used Can      8   Used    Can
## 103    Would May      8  Would    May
## 23     May Might      7    May  Might
## 48     Must Used      5   Must   Used
## 49     Must Will      5   Must   Will
## 67  Shall Should      5  Shall Should
## 24      May Must      4    May   Must
## 25     May Ought      4    May  Ought
## 105   Would Must      4  Would   Must
## 26     May Shall      3    May  Shall
## 42    Must Could      3   Must  Could
## 50    Must Would      3   Must  Would
## 61     Shall Can      3  Shall    Can
## 69    Shall Will      3  Shall   Will
## 76  Should Ought      3 Should  Ought
## 6      Can Shall      2    Can  Shall
## 28      May Used      2    May   Used
## 65    Shall Must      2  Shall   Must
## 36   Might Shall      1  Might  Shall
## 38    Might Used      1  Might   Used
## 47   Must Should      1   Must Should
## 57  Ought Should      1  Ought Should
## 70   Shall Would      1  Shall  Would
## 77  Should Shall      1 Should  Shall
## 78   Should Used      1 Should   Used
## 96    Will Ought      1   Will  Ought

We can count the number in first a second positions and compute first position percent.

results<-c()
results$first <- table(dm_type$first)
results$second <-table(dm_type$second)
results$perc <-100*results$first/(results$first+results$second)
results
## $first
## 
##    Can  Could    May  Might   Must  Ought  Shall Should   Used   Will  Would 
##      8      6     10     10      7      1      5      9      3      9      8 
## 
## $second
## 
##    Can  Could    May  Might   Must  Ought  Shall Should   Used   Will  Would 
##      9      8      5      7      7      4      5      9      5      8      9 
## 
## $perc
## 
##      Can    Could      May    Might     Must    Ought    Shall   Should 
## 47.05882 42.85714 66.66667 58.82353 50.00000 20.00000 50.00000 50.00000 
##     Used     Will    Would 
## 37.50000 52.94118 47.05882

Also let’s just repeat it with only DMs over ten.

dm_type_ten <- dm_type[dm_type$tokens > 9,]

results_ten<-c()
temp <- table(dm_type_ten$first)

# Got to do something because Ought to and Shall don't occur in first
results_ten$first[1:5] <- temp[1:5]
results_ten$first[6:7] <- 0
results_ten$first[8:11] <- temp[6:9]

results_ten$second <-table(dm_type_ten$second)
results_ten$perc <-100*results_ten$first/(results_ten$first+results_ten$second)
results_ten
## $first
##  [1] 7 6 5 7 1 0 0 5 2 8 5
## 
## $second
## 
##    Can  Could    May  Might   Must  Ought  Shall Should   Used   Will  Would 
##      6      7      4      4      3      1      1      6      1      6      7 
## 
## $perc
## 
##      Can    Could      May    Might     Must    Ought    Shall   Should 
## 53.84615 46.15385 55.55556 63.63636 25.00000  0.00000  0.00000 45.45455 
##     Used     Will    Would 
## 66.66667 57.14286 41.66667

3.3.6 Plot 4

And then plot that like before…

plot(as.numeric(results_ten$second),
     as.numeric(results_ten$first),
     xlim = c(0, 8), 
     ylim = c(0, 8),
     xlab = "Types with Modal in 2nd Position",
     ylab = "Types with Modal in 1st Position",
     col = grey,
     cex.axis = .85,
     cex.lab = .85,
     col.lab = grey,
     col.axis = grey,
     bty = "n",xaxt = "n", yaxt = "n", type = "n")

text(as.numeric(results_ten$second),
     as.numeric(results_ten$first),
     cex = .75,
     labels = c("Can", "Could", "May", "Might", "Must", "Ought to", 
                "Shall", "Should", "Used to", "Will", "Would"), 
     col = c(red, purp, red, red, purp, purp, purp, purp, red, red, purp))


abline(a = 0, b = 1, lty = "dashed", lwd = .5)

axis(side = 1, 
     lwd = .5,
       col.axis = grey,
       cex.axis = .75)
       
  
  axis(side = 2, 
       lwd = .5,
       col.axis = grey,
       cex.axis = .75)    
  
box(which = "plot", col = grey, lwd = .5)

3.3.7 Overall Totals

And lastly a heatmap to show their overall frequencies.

dmmat3 <- as.matrix(dmmat2[,1]+dmmat2[,2])

image(rotate(dmmat3),
      col = hcl.colors(9, "Purples2", rev = TRUE), 
      breaks = c(0,1,20,50,100,200,500,1000,2000,100000),
      tck = 0, 
      axes = TRUE,      
      frame.plot = TRUE, 
      xaxt='n', 
      yaxt = 'n')

  axis(side = 2, 
       at = c(1, .9, .8, .7, .6, .5, .4, .3, .2, .1, 0), 
       labels = rownames(dmmat3),
       las = 2,
       col.axis = grey,
       cex.axis = .8,
       tick = FALSE, 
       line = FALSE)
       
  
  axis(side = 3, 
       at = c(.5), 
       labels = c("Overall"),
       las = 1,
       col.axis = grey,
       cex.axis = .8,
       tick = FALSE, 
       line = FALSE)    

(Note: I wouldn’t include this image in the paper either)

3.3.8 Plot 5

And better a barplot.

barplot(dmmat3[,1], 
        cex.names = .7, 
        col = red, border = "NA", axes=FALSE)
axis(2, 
     cex.axis = .7, 
     las = 1, 
     col = "white", 
     col.ticks = "white", 
     col.axis = grey,
     at=c(dmmat3[1,1],
          dmmat3[2,1],
          dmmat3[3,1],
          dmmat3[4,1],
          dmmat3[5,1],
          dmmat3[8,1],
          0))

3.3.9 Markov Chain

Lastly, we can plot the overall structure as a network using a markov chain. It’s really just an alternative way of representing the main co-occurrence matrix, but I think it gives us a different perspective, showing which modals are most often linked.

First we’ll need a network graphing library

library(igraph)
## 
## Attaching package: 'igraph'
## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum
## The following object is masked from 'package:base':
## 
##     union

We’ll keep working with the same data matrix, but to make the network diagram easier to read and to focus in on the main relationships we’ll strip away pairs that are relatively infrequent (less than 50 hits), by just replacing with 0. We could play around with this cutoff.

dmmat4 <- dmmat
dmmat4[dmmat4 < 50] <- 0
dmmat4
##          Might  Can Could Would Will Should May Must Used.to Shall Ought.to
## Might        0 1733   932   243   52    146   0    0       0     0        0
## Can          0    0   135     0  111      0   0    0       0     0        0
## Could        0   60     0     0    0      0   0    0       0     0        0
## Would        0    0   171     0   52     81   0    0       0     0        0
## Will        59  111     0    80    0      0   0    0       0     0        0
## Should       0    0     0   177    0      0   0    0       0     0        0
## May          0  273    53     0    0      0   0    0       0     0        0
## Must         0  164     0     0    0      0   0    0       0     0        0
## Used to      0    0   144     0    0      0   0    0       0     0        0
## Shall        0    0     0     0    0      0   0    0       0     0        0
## Ought to     0    0     0     0    0      0   0    0       0     0        0

After doing that the rows and columns for Shall and Outght to all zeroes, so we can drop those altogether.

dmmat4 <- dmmat4[-10:-11,-10:-11]

And then we can convert to a graph object and plot it (note: there is a line from Might to Should which can be obscured; we can position these nodes ourselves).

IG <- graph_from_adjacency_matrix(dmmat4, weighted=TRUE, mode = "directed")

par(mar=c(.0,0,0,0))
plot.igraph(IG, 
              vertex.size = 24,
              vertex.color=red,
              vertex.shape="circle",
              vertex.frame.color=red,
              vertex.frame.lwd=1,
              vertex.label.color="black",
              vertex.label.family="sans",
              vertex.label.cex=.75,
              edge.color="black",
              edge.width=.8,
              edge.arrow.size=.5)

Note here I’ve simplified here on purpose: there is no attempt to represent the frequencies aside from the cutting of less frequent nodes and links. Note here that the arrows are only representing order (which we assume matters, e.g. we don’t combine Might Could and Could Might), so there is nothing here at odds with CxG. By using the arrows, however, we can cut the number of nodes in half, which makes this much more readable that if we had separated nodes for first and second position. This also lets this visualisation complement the heat map, which separates out the modals by position. Specifically what it does a good job of I think is that it highlights which modals are really central to this phenomenon (regardless of position): might, could, and can are the core here.

3.3.10 Action: Describe Meaning

Also I really think we should go through here and provide examples (like 10) for each of the top say 10 double modals to really establish their meaning. Maybe a few others? It would be good to see how each modal work in first position, assuming DMs are basically adding a modal in front of a modal. People don’t know this and it is important in general and to our argument (e.g. Might Can vs. Might Could) later on.

Like, just looking at these top 10 myself, some have more transparent meaning to me, although I’m clear coming to this with a lot of pre-existing knowledge and biases. But like The Might __ and May __ are easy for to understand (I think), just by replace either by ‘Maybe’. Must __ seems fine too (replace with ‘Necessarily’? As well as Used to __ (‘Once’) and Can – (‘Certainly’). But I don’t really know how to read Should Would or Would Could.

3.3.11 Finding 3: Co-occurence Patterns

The main observation here is, I think, that no matter how you look at it, it’s not just random or even in terms of which modal are involved in DMs. We need to explain it.

I’ll break it down along several lines, but let me first say I’m not really sure how those patterns relate to what has been said in the formal literature (we should look into that and discuss it) or how that might tie in with CxG considerations, but here is my quick, general descriptive take on what’s happening, writing out my thoughts as I think them through…

3.3.11.1 Finding 3.1: Modals Involved

At the most basic level, Might, Can, and Could are doing most of the work. Not just Might Can and Might Could, although those are the vast majority of hits, but also Might being followed by other modals (esp. Would, Will) and Can and Could being preceded by other modals.

  • Of the top 10 DMs, 9 have might/can/could in it.
  • Of the top 20 DMs, 18 do

I’m not totally sure what to make of all this. But clearly this means that just the frequencies of these modals individually isn’t predicting all this. It’s not like they just combine freely (although like I said, it looks like basically any combo is possible) and that the main constraint on the DM frequencies IS NOT just the frequencies of the modals individually.

We should definitely run these for the corpus, but in Longman (page 486) it’s

  • Will >> Would > Can >> Could >> May = Should > Must > Might >> Shall

Whereas we have this over all DM and both positions

  • Might > Can >> Could >> Would > Will = Should > May >> Must >> Shall

So, the relatively low use of Would and esp. Will in DMs (in either position) is really notable!

Why don’t these two forms participate more in DMs?

  • of the top 10 DMs, 0 have Will, and 2 have Would
  • but then 6 of the next 10 have Will, and 4 have Would

And then there is the massive rise of Might. And so along with Can and Could, why do these participate so much more?

So that is the initial finding here: the balance between which modals participate in DMs is not what we’d expect.

So the question is why?

I think this all implies that there are either grammatical or more likely semantic/pragmatic/functional reasons for this hierarchy.

I’m not sure if that can just be explained based on the characteristics of these different modals: It seems to me like we’ll need to look at the combinations/positions as well, but let’s start with the individual modals, and then move on to the combinations. But first I think I need at least to review modal classifications.

3.3.11.2 Types of Modals

I think we need to review this information in the literature review probably. I think what I’m writing is all right, but you should check/correct/extend it – esp. because some of this I’m grabbing from wikipedia.

First, modals are generally grouped into so-called present-preterite pairs, although the preterite forms aren’t really preterite, more being subjunctive, as in expressing the speakers attitude, as if signalling that they have an choice in the matter but there is doubt, where as the present is more neutral.

  • Can, Could
  • May, Might
  • Shall, Should
  • Will, Would
  • Must, no pret

There are also various ways of classifying modals into larger categories.

For example, according to Longman, there are three types of modals, notably splitting Should and Shall, but none of the others:

  1. Permission/possibility/ability: Can, Could, May, Might
  2. Obligation/necessity: Must, Should, Ought to [plus (had) better, have (got) to, need to, be supposed to]
  3. Volition/prediction: Will, Would, Shall [plus be going to]

I mean these don’t see air tight to me: e.g. ‘He should win the race’ or ‘I think he can do it’ seem like prediction, ‘He will have to do it’ seems like necessity, and ‘He could do it last week’ seems like none. I feel like the distinction between the Possibility modals and the others is maybe clearest?

There is also the traditional distinction between Epistemic modals (possibility of being true or untrue) and Deotonic modals (possibility of being free to act), as well sometimes as Dynamic modals (which are seen as a sub-type of Deontic modals, where the it’s the individual’s will to act).

But my understanding is that often any given modal can be used either way, and so to make use of these categories we’d need to code individual uses, and that would be in the DM, so it’s not even clear how to do this. Honestly I don’t find the categories very clear either but they do seem useful.

3.3.11.3 Finding 3.1: Follow up

With all said, a few notable things about the individual modals involved in the DMs as we observe.

  • Might, our top modal, according two wikipedia is the only modal that is exclusively epistemic. I’m not so sure thinking about it, but I guess that makes sense in general: it’s kind of the modal marking generic, straightforward doubt that something is true or not).

  • And then Can and Could as well as Would, which is in fourth place overall, are all used epistemically as well, and are also the only modals listed as having dynamic meaning (although I have seen Will listed as having dynamic meaning as well, which would totally shoot this explanation down)

  • The top 3 individual modals (Might, Can, Could) are all from the Possibility Type, as well as the top 3 DMs (Might Can, Might Could and May Can)

  • Would is in the next 3 top DMs, and Would and Will are the next two most common overall and both are from the Prediction Type.

  • Therefoe the Necessity type is the least active in DMs, although Must and Should do participate in DMs in the top 10 (Should Would, Must Can, Might Should).

So, overall I think there is at least something possibly here around DMs primarily involving the expression or maybe even the adding of possibility and/or epistemic modality – but I think really figuring that out would require looking at examples of the DMs in context.

3.3.11.4 Finding 3.2: Modal positions

We also need to look into the patterns of the modals combine: that may help us explain why certain modals are more or less common in general in DMs, and it’s also just a result in and of itself that we also should try to explain.

First, let’s look at just the positions of the modals, taking frequencies into consideration.

Some initial observations:

  • Most modals are primarily positional in DMs

    • First position: Might, May, Must, Used to
    • Second position: Can, Could, Shall, Ought to
  • But Would, Should and Will are about 50/50

  • So in other words, once again, this is far from random.

So, these results basically let us cluster these modals into three groups of three, based on where they tend to occur, ignoring the semi-modals, we get

  1. Might, May, Must
  2. Can, Could, Shall
  3. Will, Would, Should

Now, in terms of our categories we get

preSent/preTerite 1. T, S, S 2. S, T, S 3. S, T, T

pOssibility/Necessity/pRediction 1. O, O, N 2. O, O, R 3. R, R, N

And I don’t really know how to code Epistemic/Deontic, but it’s notable that the less Deontic ones (Might, Could, Would), are spread across all three types.

So overall, not much here!

The only real pattern I see is that Possibility modals, which are most frequent tend to be more fixed in place.

The possible issue here, depending on what exactly we want, is that we’re taking frequency into account, so like Might Can and Might Could alone are driving these results.

So let’s next consider the types, ignoring their frequencies. If we look at all types that occur at least once, with 45%-55% classified as Neutral, and ignoring Ought to, Used to, Shall, which are uncommon, we get:

  • First position: May, Might
  • Second position: Could
  • Neutral: Can, Must, Should, Will, Would

And then looking just at DMs types with token counts of at least 10:

  • First position: May, Might, Will
  • Second position: Must, Would
  • Neutral: Can, Could, Should,

Taking this altogether we get a hierarchy here something like, from first posiition dominant to neutral to second position dominant, focusing on the top 7, with Must being a bit less clear than the others.

  • Might > May >> Will > Must >> Can >> Should >> Could > Would

Looking at this now just impressionistically what I see is like a cline based on a whether there is a statement being made about the possibility of a future event.

So like ‘I Might/May/Will/Must go to work’ are all referring to the specific possibility of going to work in a specific instance in the future, ordered roughly from least to most likely to happen. They are all claims about whether a specific event will or won’t happen – which can be true or not true. I guess these are all more epistemic?

But ‘I Can/Should/Could/Would go to work’ are all speaking in the hypothetical in a way, like there is more missing information being signaled, like it’s up to the subject in different ways, more of a judgment, I guess on their freedom to act, so more deontic/dynamic?

So overall it feels like maybe there is something of a pattern here, where more Epistemic/Possibility/Subjunctive Modals tend to go first position and more Deontic modals tend to go second? I guess that is the main result here/explanation. At least possibly.

It would be good to think that through a bit. I’m not sure I totally get the Epistemic vs. Deontic distinction, but this seems very important. We also might want to look at examples from the most common DMs and see if meaning shakes out this way in the DMs, since most modals can have both Epistemic and Deontic meanings individually in context.

3.3.11.5 Finding 3.3: Modal pairings

Next, we can look at which modals tend to pair together.

Some observations about pairing, including working of the Present-Preterite and the Possibility-Necessity-Prediction systems:

  • aside from Might and May, the modals that most often come before Can (Will, Must) and Could (Would, Used to) are different.

  • we also get both Can Could and Could Can relatively frequently.

  • Might Can is the most DM common and splits the Present-Preterite distinction, whereas Might Could is Preterite-Preterite, and May Can is Present-Present.

  • Combining Present-Preterite pairs is fairly uncommon, but the do occur: Can Could (135), Will Would (80), Could Can (60), Would Will (32), Might May (13), May Might (7), Shall Should (5), Should Shall (1), with the relatively high frequency of the Will Would and Would Will maybe being especially notable, given the relative low frequency of these two otherwise frequent modals in other DMs.

  • We can split up the top 10 in terms of these 2 systems (note: Can and May on the one hand and Could and Might on the other come out the same in this system; also note: since basically all these DMs can occur, we can at best describe which one tend to occur or not to occur)

    • Might Can 1733: Pret/Poss + Pres/Poss
    • Might Could 932: Pret/Poss + Pret/Poss
    • May Can 273: Pres/Poss + Pres/Poss
    • Might Would 243: Pret/Poss + Pret/Pred
    • Should Would 177: Pret/Nec + Pret/Pred
    • Would Could 171: Pret/Pred + Pret/Poss
    • Must Can 164: Pres/Nec + Pres/Poss
    • Might Should 146: Pret/Poss + Pret/Nec
    • Used to Could 144: Pret/??? + Pret/Poss
    • Can Could 135: Pres/Poss + Pret/Poss
  • So from this we can say:

    • Pret in first position (7/10)
    • Poss in first position (6/9 and 4/4)
    • Pret in second position (7/10)
    • Poss in second position (8/10)
    • Agreement on Pret/Pres (8/10 but 0/1)
    • Agreement on Type, all Poss-Poss (4/9 but 3/3)
    • Has at least one Poss modal (9/10)
  • So overall it seems like Preterite-Subjunctive and Possibility Modals are favoured, regardless of position, and that generally the Pret/Pres Agreement is present.

  • So that might seem like the goal most often is to attach a possibility modality meaning on top of a base modality meaning, but you get 4 Poss-Poss DMs, including the top 3, and these don’t all feel like they are just emphasis: each modal brings a distinct meaning in an additive way, I guess just like you’d expect.

So I think overall what we see is that DMs are often involve specifying or complicating the expression possibility. I don’t know. Maybe there is more here?

3.3.11.6 Summary

I guess my general conclusion would be that broadly speaking DMs are used to express more complex modalities in additive way, where the first modal modifies the meaning of the second in an adverbial like way, especially to express fine-grained modality meaning, as opposed to just emphasis, and especially around extending fine-grained meanings involving the expression of possibility/epistemic/preterite-subjunctive knowledge, specifically for expressing nuance about the outlook/opinion of the speaker on the proposition (like are DMs used mostly with personal pronouns/proper nouns??), where the first modal more often than not being of one of those types (possibility/etc.), and second often being deotonic/prediction/etc.

Can you think about this a bit – assess what I’m saying critically (maybe I’m wrong), see what else you can pull from the analysis and if you can glean more from an analysis of the DMs in context (maybe I’ve missed something, clean up the argument, the support for the argument? And write this up?

Also it may be that there may be two different patterns here. A Southern pattern and a rest of the US pattern. A Black pattern and a White pattern – at least among the most frequent/Southeastern DMs. In that case we would expect to stuff to be somewhat unclear. We might note that. We’ll definitely follow it up below via mapping.

4 Mapping

Next let’s move on to mapping.

4.1 Libraries

I’m going to do this in old school library(sp) style because I’m better with that. Nowadays, you should probably learn how to do this using library(sf) which is better integrated with the tidyverse. I’ve been switching over but I’m more confident this way and don’t want to make mistakes.

library(maps)
library(sp)
library(maptools)
## Checking rgeos availability: TRUE
library(classInt)

4.2 Data

First we can read in the data.

dm_maps <- as.data.frame(read.table("REG_MATRIX_DM_NORM.txt",
                                    sep=",",
                                    header=TRUE))
summary(dm_maps)
##       fips          STATE               NAME                LONG        
##  Min.   : 1001   Length:3075        Length:3075        Min.   :-124.14  
##  1st Qu.:19030   Class :character   Class :character   1st Qu.: -98.09  
##  Median :29179   Mode  :character   Mode  :character   Median : -90.40  
##  Mean   :30461                                         Mean   : -91.80  
##  3rd Qu.:45050                                         3rd Qu.: -83.57  
##  Max.   :56045                                         Max.   : -67.65  
##       LAT          CAN_COULD           CAN_MAY           CAN_MIGHT        
##  Min.   :25.53   Min.   :    0.00   Min.   :   0.000   Min.   :   0.0000  
##  1st Qu.:34.63   1st Qu.:    0.00   1st Qu.:   0.000   1st Qu.:   0.0000  
##  Median :38.37   Median :    0.00   Median :   0.000   Median :   0.0000  
##  Mean   :38.29   Mean   :   13.89   Mean   :   3.945   Mean   :   0.7633  
##  3rd Qu.:41.73   3rd Qu.:    0.00   3rd Qu.:   0.000   3rd Qu.:   0.0000  
##  Max.   :48.82   Max.   :11639.00   Max.   :4232.000   Max.   :1455.0000  
##     CAN_MUST          CAN_SHALL          CAN_SHOULD          CAN_WILL      
##  Min.   :  0.0000   Min.   :  0.0000   Min.   :   0.000   Min.   :   0.00  
##  1st Qu.:  0.0000   1st Qu.:  0.0000   1st Qu.:   0.000   1st Qu.:   0.00  
##  Median :  0.0000   Median :  0.0000   Median :   0.000   Median :   0.00  
##  Mean   :  0.6904   Mean   :  0.1538   Mean   :   1.161   Mean   :  14.71  
##  3rd Qu.:  0.0000   3rd Qu.:  0.0000   3rd Qu.:   0.000   3rd Qu.:   0.00  
##  Max.   :925.0000   Max.   :464.0000   Max.   :1159.000   Max.   :7750.00  
##    CAN_WOULD          COULD_CAN         COULD_MAY         COULD_MIGHT       
##  Min.   :  0.0000   Min.   :   0.00   Min.   :  0.0000   Min.   :    0.000  
##  1st Qu.:  0.0000   1st Qu.:   0.00   1st Qu.:  0.0000   1st Qu.:    0.000  
##  Median :  0.0000   Median :   0.00   Median :  0.0000   Median :    0.000  
##  Mean   :  0.4959   Mean   :   7.52   Mean   :  0.6732   Mean   :    4.193  
##  3rd Qu.:  0.0000   3rd Qu.:   0.00   3rd Qu.:  0.0000   3rd Qu.:    0.000  
##  Max.   :472.0000   Max.   :4438.00   Max.   :954.0000   Max.   :10962.000  
##   COULD_SHOULD        COULD_WILL        COULD_WOULD          MAY_CAN        
##  Min.   :  0.0000   Min.   :  0.0000   Min.   :   0.000   Min.   :    0.00  
##  1st Qu.:  0.0000   1st Qu.:  0.0000   1st Qu.:   0.000   1st Qu.:    0.00  
##  Median :  0.0000   Median :  0.0000   Median :   0.000   Median :    0.00  
##  Mean   :  0.5958   Mean   :  0.4397   Mean   :   7.606   Mean   :   48.65  
##  3rd Qu.:  0.0000   3rd Qu.:  0.0000   3rd Qu.:   0.000   3rd Qu.:    0.00  
##  Max.   :570.0000   Max.   :451.0000   Max.   :6151.000   Max.   :19757.00  
##    MAY_COULD         MAY_MIGHT           MAY_MUST          MAY_OUGHT_TO    
##  Min.   :   0.00   Min.   :  0.0000   Min.   :   0.0000   Min.   :  0.000  
##  1st Qu.:   0.00   1st Qu.:  0.0000   1st Qu.:   0.0000   1st Qu.:  0.000  
##  Median :   0.00   Median :  0.0000   Median :   0.0000   Median :  0.000  
##  Mean   :  13.54   Mean   :  0.2751   Mean   :   0.7259   Mean   :  0.718  
##  3rd Qu.:   0.00   3rd Qu.:  0.0000   3rd Qu.:   0.0000   3rd Qu.:  0.000  
##  Max.   :9509.00   Max.   :430.0000   Max.   :2072.0000   Max.   :934.000  
##    MAY_SHALL          MAY_SHOULD         MAY_USED_TO          MAY_WILL       
##  Min.   :   0.000   Min.   :   0.0000   Min.   : 0.00000   Min.   :  0.0000  
##  1st Qu.:   0.000   1st Qu.:   0.0000   1st Qu.: 0.00000   1st Qu.:  0.0000  
##  Median :   0.000   Median :   0.0000   Median : 0.00000   Median :  0.0000  
##  Mean   :   1.113   Mean   :   0.8852   Mean   : 0.03967   Mean   :  0.8439  
##  3rd Qu.:   0.000   3rd Qu.:   0.0000   3rd Qu.: 0.00000   3rd Qu.:  0.0000  
##  Max.   :3378.000   Max.   :1058.0000   Max.   :63.00000   Max.   :461.0000  
##    MAY_WOULD          MIGHT_CAN         MIGHT_COULD        MIGHT_MAY       
##  Min.   :   0.000   Min.   :     0.0   Min.   :    0.0   Min.   :  0.0000  
##  1st Qu.:   0.000   1st Qu.:     0.0   1st Qu.:    0.0   1st Qu.:  0.0000  
##  Median :   0.000   Median :     0.0   Median :    0.0   Median :  0.0000  
##  Mean   :   2.834   Mean   :   324.8   Mean   :  174.6   Mean   :  0.3236  
##  3rd Qu.:   0.000   3rd Qu.:     0.0   3rd Qu.:    0.0   3rd Qu.:  0.0000  
##  Max.   :4829.000   Max.   :243368.0   Max.   :33218.0   Max.   :325.0000  
##    MIGHT_MUST       MIGHT_OUGHT_TO       MIGHT_SHALL        MIGHT_SHOULD    
##  Min.   :  0.0000   Min.   :    0.000   Min.   : 0.00000   Min.   :   0.00  
##  1st Qu.:  0.0000   1st Qu.:    0.000   1st Qu.: 0.00000   1st Qu.:   0.00  
##  Median :  0.0000   Median :    0.000   Median : 0.00000   Median :   0.00  
##  Mean   :  0.2504   Mean   :    7.798   Mean   : 0.00878   Mean   :  24.61  
##  3rd Qu.:  0.0000   3rd Qu.:    0.000   3rd Qu.: 0.00000   3rd Qu.:   0.00  
##  Max.   :330.0000   Max.   :10880.000   Max.   :27.00000   Max.   :6947.00  
##  MIGHT_USED_TO        MIGHT_WILL        MIGHT_WOULD          MUST_CAN      
##  Min.   :  0.0000   Min.   :    0.00   Min.   :    0.00   Min.   :   0.00  
##  1st Qu.:  0.0000   1st Qu.:    0.00   1st Qu.:    0.00   1st Qu.:   0.00  
##  Median :  0.0000   Median :    0.00   Median :    0.00   Median :   0.00  
##  Mean   :  0.0787   Mean   :   11.75   Mean   :   75.03   Mean   :  20.47  
##  3rd Qu.:  0.0000   3rd Qu.:    0.00   3rd Qu.:    0.00   3rd Qu.:   0.00  
##  Max.   :242.0000   Max.   :18419.00   Max.   :19487.00   Max.   :4619.00  
##    MUST_COULD         MUST_MIGHT        MUST_SHOULD        MUST_USED_TO     
##  Min.   : 0.00000   Min.   :  0.0000   Min.   : 0.00000   Min.   :   0.000  
##  1st Qu.: 0.00000   1st Qu.:  0.0000   1st Qu.: 0.00000   1st Qu.:   0.000  
##  Median : 0.00000   Median :  0.0000   Median : 0.00000   Median :   0.000  
##  Mean   : 0.03837   Mean   :  0.7301   Mean   : 0.02797   Mean   :   1.886  
##  3rd Qu.: 0.00000   3rd Qu.:  0.0000   3rd Qu.: 0.00000   3rd Qu.:   0.000  
##  Max.   :70.00000   Max.   :789.0000   Max.   :86.00000   Max.   :4775.000  
##    MUST_WILL           MUST_WOULD        OUGHT_TO_SHOULD      SHALL_CAN       
##  Min.   :   0.0000   Min.   :  0.00000   Min.   : 0.00000   Min.   :  0.0000  
##  1st Qu.:   0.0000   1st Qu.:  0.00000   1st Qu.: 0.00000   1st Qu.:  0.0000  
##  Median :   0.0000   Median :  0.00000   Median : 0.00000   Median :  0.0000  
##  Mean   :   0.4598   Mean   :  0.09236   Mean   : 0.00878   Mean   :  0.2091  
##  3rd Qu.:   0.0000   3rd Qu.:  0.00000   3rd Qu.: 0.00000   3rd Qu.:  0.0000  
##  Max.   :1105.0000   Max.   :227.00000   Max.   :27.00000   Max.   :449.0000  
##    SHALL_MUST        SHALL_SHOULD        SHALL_WILL         SHALL_WOULD       
##  Min.   : 0.00000   Min.   :  0.0000   Min.   :  0.00000   Min.   :  0.00000  
##  1st Qu.: 0.00000   1st Qu.:  0.0000   1st Qu.:  0.00000   1st Qu.:  0.00000  
##  Median : 0.00000   Median :  0.0000   Median :  0.00000   Median :  0.00000  
##  Mean   : 0.02959   Mean   :  0.2491   Mean   :  0.08715   Mean   :  0.03382  
##  3rd Qu.: 0.00000   3rd Qu.:  0.0000   3rd Qu.:  0.00000   3rd Qu.:  0.00000  
##  Max.   :67.00000   Max.   :423.0000   Max.   :231.00000   Max.   :104.00000  
##    SHOULD_CAN       SHOULD_COULD       SHOULD_MIGHT       SHOULD_MUST      
##  Min.   : 0.0000   Min.   :   0.000   Min.   :   0.000   Min.   :   0.000  
##  1st Qu.: 0.0000   1st Qu.:   0.000   1st Qu.:   0.000   1st Qu.:   0.000  
##  Median : 0.0000   Median :   0.000   Median :   0.000   Median :   0.000  
##  Mean   : 0.1568   Mean   :   5.192   Mean   :   2.682   Mean   :   1.524  
##  3rd Qu.: 0.0000   3rd Qu.:   0.000   3rd Qu.:   0.000   3rd Qu.:   0.000  
##  Max.   :95.0000   Max.   :5866.000   Max.   :2239.000   Max.   :4023.000  
##  SHOULD_OUGHT_TO     SHOULD_SHALL       SHOULD_USED_TO      SHOULD_WILL      
##  Min.   :  0.0000   Min.   :  0.00000   Min.   :   0.000   Min.   :   0.000  
##  1st Qu.:  0.0000   1st Qu.:  0.00000   1st Qu.:   0.000   1st Qu.:   0.000  
##  Median :  0.0000   Median :  0.00000   Median :   0.000   Median :   0.000  
##  Mean   :  0.2361   Mean   :  0.03415   Mean   :   0.905   Mean   :   4.377  
##  3rd Qu.:  0.0000   3rd Qu.:  0.00000   3rd Qu.:   0.000   3rd Qu.:   0.000  
##  Max.   :656.0000   Max.   :105.00000   Max.   :2783.000   Max.   :8342.000  
##   SHOULD_WOULD       USED_TO_CAN        USED_TO_COULD      USED_TO_WOULD     
##  Min.   :    0.00   Min.   :   0.0000   Min.   :    0.00   Min.   :   0.000  
##  1st Qu.:    0.00   1st Qu.:   0.0000   1st Qu.:    0.00   1st Qu.:   0.000  
##  Median :    0.00   Median :   0.0000   Median :    0.00   Median :   0.000  
##  Mean   :   22.28   Mean   :   0.9076   Mean   :   23.18   Mean   :   5.225  
##  3rd Qu.:    0.00   3rd Qu.:   0.0000   3rd Qu.:    0.00   3rd Qu.:   0.000  
##  Max.   :12912.00   Max.   :2052.0000   Max.   :10622.00   Max.   :4829.000  
##     WILL_CAN         WILL_COULD          WILL_MAY          WILL_MIGHT      
##  Min.   :    0.0   Min.   :   0.000   Min.   :  0.0000   Min.   :   0.000  
##  1st Qu.:    0.0   1st Qu.:   0.000   1st Qu.:  0.0000   1st Qu.:   0.000  
##  Median :    0.0   Median :   0.000   Median :  0.0000   Median :   0.000  
##  Mean   :   20.2   Mean   :   1.635   Mean   :  0.7197   Mean   :   4.989  
##  3rd Qu.:    0.0   3rd Qu.:   0.000   3rd Qu.:  0.0000   3rd Qu.:   0.000  
##  Max.   :30470.0   Max.   :1760.000   Max.   :361.0000   Max.   :3991.000  
##    WILL_MUST WILL_OUGHT_TO         WILL_SHALL        WILL_SHOULD     
##  Min.   :0   Min.   : 0.000000   Min.   :  0.0000   Min.   :  0.000  
##  1st Qu.:0   1st Qu.: 0.000000   1st Qu.:  0.0000   1st Qu.:  0.000  
##  Median :0   Median : 0.000000   Median :  0.0000   Median :  0.000  
##  Mean   :0   Mean   : 0.009431   Mean   :  0.7912   Mean   :  1.376  
##  3rd Qu.:0   3rd Qu.: 0.000000   3rd Qu.:  0.0000   3rd Qu.:  0.000  
##  Max.   :0   Max.   :29.000000   Max.   :650.0000   Max.   :558.000  
##    WILL_WOULD         WOULD_CAN         WOULD_COULD       WOULD_MAY       
##  Min.   :   0.000   Min.   :  0.0000   Min.   :   0.0   Min.   :   0.000  
##  1st Qu.:   0.000   1st Qu.:  0.0000   1st Qu.:   0.0   1st Qu.:   0.000  
##  Median :   0.000   Median :  0.0000   Median :   0.0   Median :   0.000  
##  Mean   :   5.626   Mean   :  0.4859   Mean   :  17.6   Mean   :   2.112  
##  3rd Qu.:   0.000   3rd Qu.:  0.0000   3rd Qu.:   0.0   3rd Qu.:   0.000  
##  Max.   :5578.000   Max.   :700.0000   Max.   :4112.0   Max.   :3599.000  
##   WOULD_MIGHT         WOULD_MUST         WOULD_SHOULD      WOULD_USED_TO     
##  Min.   :   0.000   Min.   :   0.0000   Min.   :   0.000   Min.   :   0.000  
##  1st Qu.:   0.000   1st Qu.:   0.0000   1st Qu.:   0.000   1st Qu.:   0.000  
##  Median :   0.000   Median :   0.0000   Median :   0.000   Median :   0.000  
##  Mean   :   2.989   Mean   :   0.7184   Mean   :   8.794   Mean   :   2.125  
##  3rd Qu.:   0.000   3rd Qu.:   0.0000   3rd Qu.:   0.000   3rd Qu.:   0.000  
##  Max.   :1843.000   Max.   :1722.0000   Max.   :7207.000   Max.   :1589.000  
##    WOULD_WILL       
##  Min.   :    0.000  
##  1st Qu.:    0.000  
##  Median :    0.000  
##  Mean   :    7.964  
##  3rd Qu.:    0.000  
##  Max.   :12379.000
dim(dm_maps)
## [1] 3075   81

So basically what we’ve got here are 5 meta data variables and 76 DM variables, which show the relative frequency (per billion words) of each DM across 3075 counties/county equivalents in the contiguous US. This is the data we’ll map.

Note as expected it’s super skewed data.

The metadata is the county FIPS code, State, County Name, and the Centroid Longitude and Latitude for that county.

NOTE: that Will Must has 0 counts. That is despite a count of 11 overall. I’m not totally sure what is happening here, especially when you have other DMs with counts of 1 that overall that don’t sum to 0. I guess it’s possible that the 11 are in big counties.

summary(dm_maps$WILL_MUST)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0       0       0       0       0
summary(dm_maps$SHOULD_SHALL)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##   0.00000   0.00000   0.00000   0.03415   0.00000 105.00000

So we can drop it, leaving 75 maps. WE NEED TO CHECK THIS. IN GENERAL WE SHOULD BE WORKING WITH THE FREQUENCIES BY COUNTY NOT THE RELATIVE FREQUENCIES BY COUNTY WHEN COMPUTING STATE AVERAGES.

dm_maps <- dm_maps[,c(1:68,70:ncol(dm_maps))]
dim(dm_maps)
## [1] 3075   80

Aside from this, there are some other issues with this data: the order is weird and we don’t have the rows coded with the county names used for the approach to mapping we’ll use.

tail(dm_maps[2:3], 40)
##            STATE       NAME
## 3036    colorado kit carson
## 3037    colorado       lake
## 3038    colorado   la plata
## 3039    colorado    larimer
## 3040    colorado las animas
## 3041    colorado    lincoln
## 3042    colorado      logan
## 3043    colorado       mesa
## 3044    colorado    mineral
## 3045    colorado     moffat
## 3046    colorado  montezuma
## 3047    colorado   montrose
## 3048    colorado     morgan
## 3049    colorado      otero
## 3050    colorado      ouray
## 3051    colorado       park
## 3052    colorado   phillips
## 3053    colorado     pitkin
## 3054    colorado    prowers
## 3055    colorado     pueblo
## 3056    colorado rio blanco
## 3057    colorado rio grande
## 3058    colorado      routt
## 3059    colorado   saguache
## 3060    colorado   san juan
## 3061    colorado san miguel
## 3062    colorado   sedgwick
## 3063    colorado     summit
## 3064    colorado     teller
## 3065    colorado washington
## 3066    colorado       weld
## 3067    colorado       yuma
## 3068 connecticut  fairfield
## 3069 connecticut   hartford
## 3070 connecticut litchfield
## 3071 connecticut  middlesex
## 3072 connecticut  new haven
## 3073 connecticut new london
## 3074 connecticut    tolland
## 3075 connecticut    windham

To fix that we merge our DM map data to the list of US counties used in library(maps) (county.fips) using the FIPS codes, basically creating an ordered version of our old data frame with a ‘polyname’ variable added, matching the us_map object.

dm_align <- merge(county.fips, dm_maps, by="fips")
summary(dm_align[,1:10])
##       fips         polyname            STATE               NAME          
##  Min.   : 1001   Length:3083        Length:3083        Length:3083       
##  1st Qu.:19032   Class :character   Class :character   Class :character  
##  Median :29183   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :30477                                                           
##  3rd Qu.:45052                                                           
##  Max.   :56045                                                           
##       LONG              LAT          CAN_COULD           CAN_MAY        
##  Min.   :-124.14   Min.   :25.53   Min.   :    0.00   Min.   :   0.000  
##  1st Qu.: -98.11   1st Qu.:34.63   1st Qu.:    0.00   1st Qu.:   0.000  
##  Median : -90.41   Median :38.37   Median :    0.00   Median :   0.000  
##  Mean   : -91.83   Mean   :38.29   Mean   :   13.86   Mean   :   3.935  
##  3rd Qu.: -83.58   3rd Qu.:41.75   3rd Qu.:    0.00   3rd Qu.:   0.000  
##  Max.   : -67.65   Max.   :48.82   Max.   :11639.00   Max.   :4232.000  
##    CAN_MIGHT            CAN_MUST       
##  Min.   :   0.0000   Min.   :  0.0000  
##  1st Qu.:   0.0000   1st Qu.:  0.0000  
##  Median :   0.0000   Median :  0.0000  
##  Mean   :   0.7587   Mean   :  0.6886  
##  3rd Qu.:   0.0000   3rd Qu.:  0.0000  
##  Max.   :1455.0000   Max.   :925.0000

4.3 Preparing the Map

Next, we create a map object of US counties.

us_map <- map("county", plot = FALSE, fill = TRUE)

We want to project the map so it looks nice. This is kind of messy. For some reason we need to do it in two steps.

wgs84 <- CRS("+proj=longlat + datum=WGS84")
us_map_sp <- map2SpatialPolygons(us_map, IDs = us_map$names, proj4string = wgs84)
albers <- CRS("+proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=37.5 +lon_0=-96 +x_0=0 +y_0=0 +ellps=GRS80 +datum=NAD83 +units=m +no_defs")
us_map_sp <- spTransform(us_map_sp, CRS = albers)

And we’re also going to make just a state map, so we can overlay that with different borders.

state <- map("state", plot = FALSE, fill = TRUE, res = 0)
state_sp <- map2SpatialPolygons(state, IDs = state$names, proj4string = wgs84)
state_sp <- spTransform(state_sp, CRS = albers)

4.4 Map Function

Last, we define a map function, which we can use over and over again to map. This does most of the work and it’s pretty complicated, mostly just for aesthetics sake: most of it is colour palette and legend.

map_us <- function(feat, title, round) {
 
  # Getting the chloroplet colours and cuts (0 + quartiles of non-zeroes)
  
  pal <- colorRampPalette(c("white", rgb(1,.27,0)), bias = 1)
  collist <- pal(5)
  feat <- feat/1000
  temp <- feat
  temp[temp == 0] <- NA
  temp[temp == NaN] <- NA  #
  qu<-quantile(temp, na.rm=TRUE)
  
  class <- classIntervals(feat, 
                          5, 
                          style = "fixed", 
                          fixedBreaks = c(0, qu[1], qu[2], qu[3], qu[4], qu[5]))
  
  colour <- findColours(class, collist)
  
  
  # This is the map stuff
  
  par(mar = c(0, 0, 0, 0))
  plot(us_map_sp, lwd = 0.25, col = colour, border="white")
  plot(state_sp, lwd = 0.5, add = TRUE)
  
  # Legend
  
  rect(-2220000, -980000, -2020000, -1080000, col = collist[1], lwd = 0.5)
  rect(-2020000, -980000, -1820000, -1080000, col = collist[2], lwd = 0.5)
  rect(-1820000, -980000, -1620000, -1080000, col = collist[3], lwd = 0.5)
  rect(-1620000, -980000, -1420000, -1080000, col = collist[4], lwd = 0.5)
  rect(-1420000, -980000, -1220000, -1080000, col = collist[5], lwd = 0.5)
  
  text(-2220000, -1160000, cex = 0.5, labels = paste(round(class$brks[1]*100, round)))
  text(-2020000, -1160000, cex = 0.5, labels = paste(round(class$brks[2]*100, round)))
  text(-1820000, -1160000, cex = 0.5, labels = paste(round(class$brks[3]*100, round)))
  text(-1620000, -1160000, cex = 0.5, labels = paste(round(class$brks[4]*100, round)))
  text(-1420000, -1160000, cex = 0.5, labels = paste(round(class$brks[5]*100, round)))
  text(-1220000, -1160000, cex = 0.5, labels = paste(round(class$brks[6]*100, round)))
  
  text(-1700000, -860000, title, cex = 1)

  text(-1700000, -1260000, "rate of use per million words", cex = .65)
}

4.5 Maps

First we can map all DMs, which is something I didn’t do before.

Something not working here…it was

all <- rowSums(dm_align[,7:ncol(dm_align)])
#map_us(dm_align$all , "All Double Modals", 0)

And next let’s map the top 10.

map_us(dm_align$MIGHT_CAN, "Might Can", 0)

map_us(dm_align$MIGHT_COULD, "Might Could", 0)

map_us(dm_align$MAY_CAN, "May Can", 0)

map_us(dm_align$MIGHT_WOULD, "Might Would", 0)

map_us(dm_align$SHOULD_WOULD, "Should Would", 0)

map_us(dm_align$WOULD_COULD, "Would Could", 0)

map_us(dm_align$MUST_CAN, "Must Can", 0)

map_us(dm_align$MIGHT_SHOULD, "Might Should", 0)

map_us(dm_align$USED_TO_COULD, "Used to Could", 0)

map_us(dm_align$CAN_COULD, "Can Could", 0)

We can definitely see there are two different types of patterns, at least in these maps: Southeastern ones, which I assume are at least the one’s we’re expecting, and ones with wider distributions, which seems more unexpected.

4.6 State Level Maps

Given that these maps are very broadly different one way to quickly visualise and assess these differences is to work with state level maps for a bit.

Using the data at hand, for each DM types let’s compute the mean relative frequency across all counties.

dm_states <- as.data.frame(dm_align[,c(3,7:ncol(dm_align))])
dm_states <- aggregate(. ~ STATE, dm_states, mean)
row.names(dm_states) <- dm_states$STATE
dm_states <- dm_states[,2:ncol(dm_states)]

st_norm <- as.data.frame(100*scale(dm_states, 
                                 center=FALSE, 
                                 scale=colSums(dm_states)))


state$names
##  [1] "alabama"                         "arizona"                        
##  [3] "arkansas"                        "california"                     
##  [5] "colorado"                        "connecticut"                    
##  [7] "delaware"                        "district of columbia"           
##  [9] "florida"                         "georgia"                        
## [11] "idaho"                           "illinois"                       
## [13] "indiana"                         "iowa"                           
## [15] "kansas"                          "kentucky"                       
## [17] "louisiana"                       "maine"                          
## [19] "maryland"                        "massachusetts:martha's vineyard"
## [21] "massachusetts:main"              "massachusetts:nantucket"        
## [23] "michigan:north"                  "michigan:south"                 
## [25] "minnesota"                       "mississippi"                    
## [27] "missouri"                        "montana"                        
## [29] "nebraska"                        "nevada"                         
## [31] "new hampshire"                   "new jersey"                     
## [33] "new mexico"                      "new york:manhattan"             
## [35] "new york:main"                   "new york:staten island"         
## [37] "new york:long island"            "north carolina:knotts"          
## [39] "north carolina:main"             "north carolina:spit"            
## [41] "north dakota"                    "ohio"                           
## [43] "oklahoma"                        "oregon"                         
## [45] "pennsylvania"                    "rhode island"                   
## [47] "south carolina"                  "south dakota"                   
## [49] "tennessee"                       "texas"                          
## [51] "utah"                            "vermont"                        
## [53] "virginia:chesapeake"             "virginia:chincoteague"          
## [55] "virginia:main"                   "washington:san juan island"     
## [57] "washington:lopez island"         "washington:orcas island"        
## [59] "washington:whidbey island"       "washington:main"                
## [61] "west virginia"                   "wisconsin"                      
## [63] "wyoming"
# Because of the separate rows for islands and stuff


st_align <- rbind(st_norm[1:20,],
                  st_norm[20,],         #MA
                  st_norm[20,],
                  st_norm[21,],          #MI
                  st_norm[21,],
                  st_norm[22:31,],
                  st_norm[31,],           #NY
                  st_norm[31,],
                  st_norm[31,],
                  st_norm[32,],           #NC
                  st_norm[32,],
                  st_norm[32,],
                  st_norm[22:45,],       #VA
                  st_norm[45,],
                  st_norm[45,],
                  st_norm[46,],      #WA
                  st_norm[46,], 
                  st_norm[46,], 
                  st_norm[46,], 
                  st_norm[46,], 
                  st_norm[47:49,])

Still not working. Still working through this. Like here an issue is that the breaks are being calculated based on the repeated rows, but some other issues here prob

map_st <- function(feat, title, round) {
 
    pal <- colorRampPalette(c("white", rgb(1,.27,0)), bias = 1)
  collist <- pal(5)
  class <- classIntervals(feat, 
                          5, 
                          style = "quantile")
  
  colour <- findColours(class, collist)
  
  
  # This is the map stuff
  
  par(mar = c(0, 0, 0, 0))
  plot(state_sp, lwd = 0.5, col = colour, border="black")

  
  # Legend
  
  rect(-2220000, -980000, -2020000, -1080000, col = collist[1], lwd = 0.5)
  rect(-2020000, -980000, -1820000, -1080000, col = collist[2], lwd = 0.5)
  rect(-1820000, -980000, -1620000, -1080000, col = collist[3], lwd = 0.5)
  rect(-1620000, -980000, -1420000, -1080000, col = collist[4], lwd = 0.5)
  rect(-1420000, -980000, -1220000, -1080000, col = collist[5], lwd = 0.5)
  
  text(-2220000, -1160000, cex = 0.5, labels = paste(round(class$brks[1], round)))
  text(-2020000, -1160000, cex = 0.5, labels = paste(round(class$brks[2], round)))
  text(-1820000, -1160000, cex = 0.5, labels = paste(round(class$brks[3], round)))
  text(-1620000, -1160000, cex = 0.5, labels = paste(round(class$brks[4], round)))
  text(-1420000, -1160000, cex = 0.5, labels = paste(round(class$brks[5], round)))
  text(-1220000, -1160000, cex = 0.5, labels = paste(round(class$brks[6], round)))
  
  text(-1700000, -860000, title, cex = 1)

  text(-1700000, -1260000, "percent of tokens (normalised)", cex = .65)
}





map_st(st_align$MIGHT_COULD, "Might Could", 0)

map_st(st_align$MIGHT_CAN, "Might Can", 0)

I’m juts thinking about how best to sort the maps. We could just do it by hand, or maybe taking state proportions, with just a threshold. Or we could do it like I would usually find dialect regions, but that seems like a lot of complication here.

data <- dm_align[,7:ncol(dm_align)] dm_as_obs <- transpose(data) rownames(dm_as_obs) <- colnames(data) rownames(dm_as_obs)

d<-dist(dm_as_obs) mds <- cmdscale(d, 2)

library(data.table)