In the context of the U.S. political party system, a party platform refers to the non-binding principles, goals, and strategies that party members use to address political issues. Platforms are generally announced at the party’s national conventions, which happen every four years and result in the official selection of that party’s nominees for president and vice president. These nominees, as well as party candidates in other elections, are meant to run on the issues and positions presented in the party platform. Sources: TeachDemocracy and Ballotpedia.
I confirm that I did this. Note: I included the preambles when I copied the platform texts, and also got rid of any extraneous characters (ex: the 2012 Democratic Party Platform included some stars in their text, which I deleted).
# First, load in the data:
data_raw <- readtext("/Users/linde/Downloads/QuantPol 2/R Text Analysis/HW6/platformdocuments/*.txt", cache = FALSE)
data_raw
## readtext object consisting of 8 documents and 0 docvars.
## # A data frame: 8 × 2
## doc_id text
## <chr> <chr>
## 1 Dem_2004.txt "\"PREAMBLE\nA\"..."
## 2 Dem_2008.txt "\"RENEWING A\"..."
## 3 Dem_2012.txt "\"Moving Ame\"..."
## 4 Dem_2016.txt "\"Preamble\nI\"..."
## 5 Rep_2004.txt "\"2004 Repub\"..."
## 6 Rep_2008.txt "\"This platf\"..."
## # ℹ 2 more rows
# Then, make the initial corpus:
corpus_1 <- corpus(data_raw)
What a corpus is: per the quanteda documentation, a corpus is a “‘library’ of original documents that have been converted to plain, UTF-8 encoded text…designed to be a more or less static container of texts with respect to processing and analysis.” This is why the pre-processing steps in Part 7 are performed on the dfm instead of the corpus itself. And since the dfms are NOT static, I kept a raw version and made a processed version.
summary(corpus_1)
## Corpus consisting of 8 documents, showing 8 documents:
##
## Text Types Tokens Sentences
## Dem_2004.txt 3388 19705 943
## Dem_2008.txt 4547 28677 1119
## Dem_2012.txt 4213 29284 1084
## Dem_2016.txt 4497 28654 1065
## Rep_2004.txt 5877 45843 1761
## Rep_2008.txt 4745 26197 1046
## Rep_2012.txt 5625 33835 1280
## Rep_2016.txt 6102 39425 1597
Interpreting the table:
“Text”: this is the list of all the texts that comprise the corpus. It’s the Democratic and Republican Party Platforms for 2004, 2008, 2012, and 2016.
“Types”: this is the number of unique tokens (in this case, words, punctuation marks, numbers, etc.) in each text. We can see that the 2016 Republican Party Platform had the most unique words (6,102), while the 2004 Democratic Party Platform has the least (3,388).
“Tokens”: this is the number of tokens (see above for a definition) in each text. We can see a loose adherence to the pattern observed by the UCSB APP, which is that the length in words of party platforms have been increasing over time.
“Sentences”: this is the number of sentences in each text. Unsurprisingly, texts with a larger number of tokens have a larger number of sentences, and vice versa.
dfm_raw <- dfm(tokens(corpus_1))
Apply stemming + remove punctuation and stopwords:
# Remove punctuation
dfm_pro <- dfm(tokens(corpus_1, remove_punct=TRUE))
# Remove stopwords
dfm_pro <- dfm_remove(dfm_pro, stopwords("english"))
# Stem words
dfm_pro <- dfm_wordstem(dfm_pro)
Why we preprocess: when copying the party platform texts into .txt files, I noticed that there were a lot of asterisks and backslashes and hyphens and numbers and commas and other characters that don’t provide useful information for text analysis. This is the case for many documents you might want to run text analysis on, hence the need for pre-processing. Pre-processing techniques, including removing punctuation marks, removing stopwords (a type of filler word that lacks analysis-worthy substance), and stemming words (“economics” and “economies” and “economy” become 1 feature so that they can be properly quantified), help Wordfish/Wordscore/other algorithms analyze the text data more efficiently and consistently.
Estimate the # of columns in the DFM post-preprocessing:
ncol(dfm_pro)
## [1] 7491
topfeatures(dfm_pro, 50)
## american support nation presid state america govern
## 1350 1033 996 836 834 827 784
## must peopl work health right republican secur
## 759 733 703 683 682 633 617
## democrat protect care famili feder make new
## 600 575 553 544 535 529 524
## countri need law job can world tax
## 523 494 491 487 482 474 473
## provid ensur communiti program educ also econom
## 457 441 436 433 429 408 407
## continu unit commit includ believ help effort
## 404 403 394 392 389 386 377
## public administr year develop creat polici increas
## 376 373 363 362 355 351 343
## economi
## 342
First, have to figure out which column is which:
dfm_pro[1:8,]
## Document-feature matrix of: 8 documents, 7,491 features (60.78% sparse) and 0 docvars.
## features
## docs preambl come togeth declar vision democrat mind challeng time
## Dem_2004.txt 1 10 11 2 6 44 1 26 29
## Dem_2008.txt 1 28 18 1 6 49 2 35 51
## Dem_2012.txt 0 16 22 0 7 148 1 31 48
## Dem_2016.txt 1 12 16 0 3 230 1 22 27
## Rep_2004.txt 1 15 17 9 10 33 1 27 55
## Rep_2008.txt 1 7 7 2 7 31 4 21 24
## features
## docs new
## Dem_2004.txt 68
## Dem_2008.txt 91
## Dem_2012.txt 57
## Dem_2016.txt 42
## Rep_2004.txt 128
## Rep_2008.txt 34
## [ reached max_ndoc ... 2 more documents, reached max_nfeat ... 7,481 more features ]
It cut off, but I got the basic idea. So Dem 2016 is row 4 and Rep 2016 is row 8. Now, I know which columns to call when extracting the top features for each document:
# 2016 Dem:
topfeatures(dfm_pro[4,], 50)
## democrat american support health believ peopl right communiti
## 230 176 143 130 121 112 108 101
## nation protect work must make countri also public
## 96 92 92 91 91 83 82 79
## state america ensur educ care famili fight provid
## 75 71 71 70 68 67 67 66
## feder job need includ invest servic secur program
## 65 64 64 64 63 62 60 59
## continu expand worker creat student commit govern law
## 59 58 57 53 53 52 51 51
## world can access school help economi live end
## 50 49 49 49 48 47 47 47
## women build
## 46 46
# 2016 Rep:
topfeatures(dfm_pro[8,], 50)
## state american govern nation feder right must
## 185 180 178 162 142 131 126
## republican support peopl law protect america administr
## 119 116 110 107 101 87 86
## countri congress presid secur constitut current can
## 85 82 81 72 72 70 69
## public famili polici unit need new educ
## 69 66 66 63 62 60 60
## program world power econom forc call act
## 60 59 58 58 57 57 56
## advanc regul parti militari economi includ provid
## 55 55 53 53 52 52 51
## democrat job communiti privat amend year use
## 50 50 50 50 50 49 49
## make
## 49
Per your suggestion, I first make a df of the words that appear: 1) in both texts; 2) just in the 2016 Dem text; and 3) just in the 2016 Rep text.
dem_2016 <- topfeatures(dfm_pro[4, ], 50)
rep_2016 <- topfeatures(dfm_pro[8, ], 50)
overlap <- intersect(names(dem_2016), names(rep_2016))
just_dem <- setdiff(names(dem_2016), names(rep_2016))
just_rep <- setdiff(names(rep_2016), names(dem_2016))
pad_to <- function(x, n) c(x, rep(NA, n - length(x)))
n <- max(length(overlap), length(just_dem), length(just_rep))
comparison_df <- data.frame(
overlap = pad_to(overlap, n),
just_dem = pad_to(just_dem, n),
just_rep = pad_to(just_rep, n),
stringsAsFactors = FALSE)
comparison_df
## overlap just_dem just_rep
## 1 democrat health republican
## 2 american believ administr
## 3 support work congress
## 4 peopl also presid
## 5 right ensur constitut
## 6 communiti care current
## 7 nation fight polici
## 8 protect invest unit
## 9 must servic new
## 10 make continu power
## 11 countri expand econom
## 12 public worker forc
## 13 state creat call
## 14 america student act
## 15 educ commit advanc
## 16 famili access regul
## 17 provid school parti
## 18 feder help militari
## 19 job live privat
## 20 need end amend
## 21 includ women year
## 22 secur build use
## 23 program <NA> <NA>
## 24 govern <NA> <NA>
## 25 law <NA> <NA>
## 26 world <NA> <NA>
## 27 can <NA> <NA>
## 28 economi <NA> <NA>
Overlap/Neutral:
Generic stems like “American”, “govern”, “law”, “world”, “economi” and “make” are present in significant numbers in both party platforms.
This makes sense, because why wouldn’t you mention the United States of America in a party platform for an American political party? Similarly, stems like “govern”, “law”, and “economi” just make sense in a party platform. The government does exist, so does the economy, and a big part of being an elected official is engaging with the government and economy.
Stems like “make” have significant overlap, which is intuitive because when it comes to verbs, “make” is pretty neutral, especially when compared to other verbs used in the texts that DON’T overlap (see below).
Democrat v. Republican Differences:
The most frequent stem in the 2016 Democratic Party Platform was “democrat”. The 2016 Republican Party Platform also has “republican” as its 8th most common stem, but interestingly, while the Dem platform does not have “republican” in its top 50 features, the Rep platform has “democrat” at #43. The Dem party is seeming pretty egoistic right now, but maybe that’s because “democrat” is also the stem for “democracy”? I’m not sure. Wish there was more transparency on what all these stems encompassed.
The Dem platform frequently mentions the stem “public”, and the Rep platform frequently mentions the stem “privat”. This makes sense, especially given the two parties’ general economic positions, but I’d imagine that the Dem platform would mention “privat” a fair amount, too, even if not in the top 50 features. Likewise with the Rep platform and “public”, just by virtue of platform rhetoric tending to emphasize contrasts (“not that, but this”). I’d be curious to look at the top 100 features and see if this comes true.
The Dem platform emphasizes “worker”, “student” and “women”, while the Rep platform focuses on “congress” and “militari”. Both have “famili” and “communiti” as top 50 features, but those are kinda generic so taking them with a grain of salt when it comes to groups of interest. I’d interpret the relevant groups as not the audiences of the platforms (because who is really reading this outside of the convention), but as the groups that campaign messaging should target. Very interesting, because a hallmark of Republican messaging that I grew up with was workers and the “common man”, but we don’t see that reflected in the texts, or at least not in the top 50 features; quite the opposite.
The verbs used in the two texts are very curious. The Dem platform uses stems like “support”, “believ”, “ensur”, “care”, “includ”, “invest” and “help”. The Rep platform uses stems like “call”, “act”, “advanc”, “regul”, and “amend”. At least at this surface level, the Dem platform seems to use flashier, more people-oriented buzzwords, while the Rep platform’s verbs take as their transitives pieces of legislation.
All this being said, I’m not yet comfortable proposing ideological leanings to most of these stems. I’ll give the Republican party “militari”, “constitut”, and “regul”, and the Democratic party “women”, “student”, “school” and “health”, partially because of the lack of overlap and partially because it aligns with my own background ideas of what the parties are about. To give any leanings to the rest of the words, I’d want to discuss with other people of different backgrounds.
Research relevant to these data:
There’s two broad areas of research that these reference data could be helpful in: 1) predicting the ideologies of virgin texts (text-focused); and 2) comparing text analysis methods (method-focused).
In the case of 1), these 8 documents would be treated as reference texts with known ideological positions (found via hand-coding as in Part 10 or otherwise), and virgin texts would be assigned ideological scores based on their similarities to the reference texts. Since we have 8 documents and not just 2, this could be accomplished using the Wordscore or Wordfish methods. The dictionary method would require us to pick just 2 texts - 1 from each party - so I’m only focusing on Wordscore and Wordfish.
In the case of 2), these data could help researchers understand the limits of certain analysis methods, such as handcoding or LLMs. Without having assigned prior ideological positions to these texts, researchers could compare the results of handcoding and unsupervised techniques with the goal of understanding what these different methods are more sensitive to. Kind of a robustness check in a way. I specified that these texts would not be a priori coded because the goal isn’t to see which methods are more right or wrong, but to descriptively assess how they tend to behave. I guess you could do the former, too, though, as in the paper we were shown in class. I’m just an anti-automation fan who wants a little more supervision of these unsupervised algorithms.
Wordscore:
Assumptions:
More important (ideologically-charged) words are used more often.
Word position doesn’t matter (bag of words).
The texts all use the same lexicon, in the same context (ex: can’t look at 2016 party platforms at the same time as 1812 election speeches).
The differences in word frequency are driven primarily by ideology (as opposed to the drafters just really liking specific words or wanting to one-up the other party by using GRE words). But maybe not?
Specific to Wordscore:
Researchers must have access to confident estimates of/assumptions about the reference texts’ ideological positions.
Policy positions of the reference texts span the dimensions that researchers are interested in (ex: extreme positions and more centrist positions of the issue of interest, ex: healthcare policy, immigration, etc.)
The set of reference texts contain as many different words as possible.
Pros:
More robust than the dictionary method in that it allows for multiple reference texts.
Since it’s on a computer, it’s generally faster + more economical than handcoding.
Can handle multiple languages.
Cons:
The assumptions are quite confining. Rarely is word choice influenced chiefly by ideology. The stars have to align for a text to always use the same words in the same ways. These party platforms are drafted by humans and thus deserve analysis by humans.
Formatting is ignored. I didn’t think this was a big deal prior to starting this assignment, but when copying and pasting the party text, I noticed that the Republican Party Platforms tended to have separate preambles and tables of content, as well as lots of quotes in italics, while the Democratic Party Platforms tended not to have those things but use a lot of asterisk lines. Maybe this isn’t an informative observation, maybe the structure of platforms is just passed down at this point, but it could be useful context for inferring the values of the parties beyond their lexicon. Could a computer do that? No. It can only handle plain text.
Wordfish:
Assumptions:
Pros:
As an unsupervised algorithm, there’s no need to locate and vet reference texts, which frees up time and energy.
Faster + more economical than handcoding.
Can handle multiple languages.
Cons:
It can be hard to meet all the assumptions, especially if you’re webscraping your data instead of manually vetting it as you would have to for reference texts.
Since it’s unsupervised, we’re not really sure what the scale is (is it really republican vs. democrat?)
Depending on the size of the data you’re working with, running Wordfish could draw electricity from data centers, which are disproportionately sited in places with higher populations of marginalized peoples, and which are known to increase residents’ utility rates, contaminate nearby groundwater, pollute the surrounding air with toxic chemicals that have caused an increase in asthma and mortality, generate noise pollution capable of damaging people’s hearing, and create intense light pollution that has contributed to changes in birds’ migration paths.