POLSGU4712 HW6

Parts 1: Defining “Party Platform”

In the context of the U.S. political party system, a party platform refers to the non-binding principles, goals, and strategies that party members use to address political issues. Platforms are generally announced at the party’s national conventions, which happen every four years and result in the official selection of that party’s nominees for president and vice president. These nominees, as well as party candidates in other elections, are meant to run on the issues and positions presented in the party platform. Sources: TeachDemocracy and Ballotpedia.

Parts 2 & 3: Organizing Files on my PC

I confirm that I did this. Note: I included the preambles when I copied the platform texts, and also got rid of any extraneous characters (ex: the 2012 Democratic Party Platform included some stars in their text, which I deleted).

Part 4: Loading Data + Corpus

# First, load in the data: 
data_raw <- readtext("/Users/linde/Downloads/QuantPol 2/R Text Analysis/HW6/platformdocuments/*.txt", cache = FALSE)
data_raw

## readtext object consisting of 8 documents and 0 docvars.
## # A data frame: 8 × 2
##   doc_id       text                
##   <chr>        <chr>               
## 1 Dem_2004.txt "\"PREAMBLE\nA\"..."
## 2 Dem_2008.txt "\"RENEWING A\"..." 
## 3 Dem_2012.txt "\"Moving Ame\"..." 
## 4 Dem_2016.txt "\"Preamble\nI\"..."
## 5 Rep_2004.txt "\"2004 Repub\"..." 
## 6 Rep_2008.txt "\"This platf\"..." 
## # ℹ 2 more rows

# Then, make the initial corpus: 
corpus_1 <- corpus(data_raw)

What a corpus is: per the quanteda documentation, a corpus is a “‘library’ of original documents that have been converted to plain, UTF-8 encoded text…designed to be a more or less static container of texts with respect to processing and analysis.” This is why the pre-processing steps in Part 7 are performed on the dfm instead of the corpus itself. And since the dfms are NOT static, I kept a raw version and made a processed version.

Part 5: Descriptive Table

summary(corpus_1)

## Corpus consisting of 8 documents, showing 8 documents:
## 
##          Text Types Tokens Sentences
##  Dem_2004.txt  3388  19705       943
##  Dem_2008.txt  4547  28677      1119
##  Dem_2012.txt  4213  29284      1084
##  Dem_2016.txt  4497  28654      1065
##  Rep_2004.txt  5877  45843      1761
##  Rep_2008.txt  4745  26197      1046
##  Rep_2012.txt  5625  33835      1280
##  Rep_2016.txt  6102  39425      1597

Interpreting the table:

“Text”: this is the list of all the texts that comprise the corpus. It’s the Democratic and Republican Party Platforms for 2004, 2008, 2012, and 2016.
“Types”: this is the number of unique tokens (in this case, words, punctuation marks, numbers, etc.) in each text. We can see that the 2016 Republican Party Platform had the most unique words (6,102), while the 2004 Democratic Party Platform has the least (3,388).
“Tokens”: this is the number of tokens (see above for a definition) in each text. We can see a loose adherence to the pattern observed by the UCSB APP, which is that the length in words of party platforms have been increasing over time.
“Sentences”: this is the number of sentences in each text. Unsurprisingly, texts with a larger number of tokens have a larger number of sentences, and vice versa.

Part 6: Create a DFM

dfm_raw <- dfm(tokens(corpus_1))

Part 7: Preprocessing

Apply stemming + remove punctuation and stopwords:

# Remove punctuation 
dfm_pro <- dfm(tokens(corpus_1, remove_punct=TRUE))

# Remove stopwords 
dfm_pro <- dfm_remove(dfm_pro, stopwords("english"))

# Stem words 
dfm_pro <- dfm_wordstem(dfm_pro)

Why we preprocess: when copying the party platform texts into .txt files, I noticed that there were a lot of asterisks and backslashes and hyphens and numbers and commas and other characters that don’t provide useful information for text analysis. This is the case for many documents you might want to run text analysis on, hence the need for pre-processing. Pre-processing techniques, including removing punctuation marks, removing stopwords (a type of filler word that lacks analysis-worthy substance), and stemming words (“economics” and “economies” and “economy” become 1 feature so that they can be properly quantified), help Wordfish/Wordscore/other algorithms analyze the text data more efficiently and consistently.

Estimate the # of columns in the DFM post-preprocessing:

ncol(dfm_pro)

## [1] 7491

Part 8: Top 50 Words in the DFM:

topfeatures(dfm_pro, 50)

##   american    support     nation     presid      state    america     govern 
##       1350       1033        996        836        834        827        784 
##       must      peopl       work     health      right republican      secur 
##        759        733        703        683        682        633        617 
##   democrat    protect       care     famili      feder       make        new 
##        600        575        553        544        535        529        524 
##    countri       need        law        job        can      world        tax 
##        523        494        491        487        482        474        473 
##     provid      ensur  communiti    program       educ       also     econom 
##        457        441        436        433        429        408        407 
##    continu       unit     commit     includ     believ       help     effort 
##        404        403        394        392        389        386        377 
##     public  administr       year    develop      creat     polici    increas 
##        376        373        363        362        355        351        343 
##    economi 
##        342

Part 9: Top 50 Words in the 2016 Dem and 2016 Rep Party Platforms:

First, have to figure out which column is which:

dfm_pro[1:8,]

## Document-feature matrix of: 8 documents, 7,491 features (60.78% sparse) and 0 docvars.
##               features
## docs           preambl come togeth declar vision democrat mind challeng time
##   Dem_2004.txt       1   10     11      2      6       44    1       26   29
##   Dem_2008.txt       1   28     18      1      6       49    2       35   51
##   Dem_2012.txt       0   16     22      0      7      148    1       31   48
##   Dem_2016.txt       1   12     16      0      3      230    1       22   27
##   Rep_2004.txt       1   15     17      9     10       33    1       27   55
##   Rep_2008.txt       1    7      7      2      7       31    4       21   24
##               features
## docs           new
##   Dem_2004.txt  68
##   Dem_2008.txt  91
##   Dem_2012.txt  57
##   Dem_2016.txt  42
##   Rep_2004.txt 128
##   Rep_2008.txt  34
## [ reached max_ndoc ... 2 more documents, reached max_nfeat ... 7,481 more features ]

It cut off, but I got the basic idea. So Dem 2016 is row 4 and Rep 2016 is row 8. Now, I know which columns to call when extracting the top features for each document:

# 2016 Dem: 
topfeatures(dfm_pro[4,], 50)

##  democrat  american   support    health    believ     peopl     right communiti 
##       230       176       143       130       121       112       108       101 
##    nation   protect      work      must      make   countri      also    public 
##        96        92        92        91        91        83        82        79 
##     state   america     ensur      educ      care    famili     fight    provid 
##        75        71        71        70        68        67        67        66 
##     feder       job      need    includ    invest    servic     secur   program 
##        65        64        64        64        63        62        60        59 
##   continu    expand    worker     creat   student    commit    govern       law 
##        59        58        57        53        53        52        51        51 
##     world       can    access    school      help   economi      live       end 
##        50        49        49        49        48        47        47        47 
##     women     build 
##        46        46

# 2016 Rep: 
topfeatures(dfm_pro[8,], 50)

##      state   american     govern     nation      feder      right       must 
##        185        180        178        162        142        131        126 
## republican    support      peopl        law    protect    america  administr 
##        119        116        110        107        101         87         86 
##    countri   congress     presid      secur  constitut    current        can 
##         85         82         81         72         72         70         69 
##     public     famili     polici       unit       need        new       educ 
##         69         66         66         63         62         60         60 
##    program      world      power     econom       forc       call        act 
##         60         59         58         58         57         57         56 
##     advanc      regul      parti   militari    economi     includ     provid 
##         55         55         53         53         52         52         51 
##   democrat        job  communiti     privat      amend       year        use 
##         50         50         50         50         50         49         49 
##       make 
##         49

Part 10: Qualitative Comparison of the 2016 Dem and Rep Lists

Per your suggestion, I first make a df of the words that appear: 1) in both texts; 2) just in the 2016 Dem text; and 3) just in the 2016 Rep text.

dem_2016 <- topfeatures(dfm_pro[4, ], 50)
rep_2016 <- topfeatures(dfm_pro[8, ], 50)

overlap   <- intersect(names(dem_2016), names(rep_2016))
just_dem  <- setdiff(names(dem_2016), names(rep_2016))
just_rep  <- setdiff(names(rep_2016), names(dem_2016))

pad_to <- function(x, n) c(x, rep(NA, n - length(x)))

n <- max(length(overlap), length(just_dem), length(just_rep))

comparison_df <- data.frame(
  overlap  = pad_to(overlap, n),
  just_dem = pad_to(just_dem, n),
  just_rep = pad_to(just_rep, n),
  stringsAsFactors = FALSE)

comparison_df

##      overlap just_dem   just_rep
## 1   democrat   health republican
## 2   american   believ  administr
## 3    support     work   congress
## 4      peopl     also     presid
## 5      right    ensur  constitut
## 6  communiti     care    current
## 7     nation    fight     polici
## 8    protect   invest       unit
## 9       must   servic        new
## 10      make  continu      power
## 11   countri   expand     econom
## 12    public   worker       forc
## 13     state    creat       call
## 14   america  student        act
## 15      educ   commit     advanc
## 16    famili   access      regul
## 17    provid   school      parti
## 18     feder     help   militari
## 19       job     live     privat
## 20      need      end      amend
## 21    includ    women       year
## 22     secur    build        use
## 23   program     <NA>       <NA>
## 24    govern     <NA>       <NA>
## 25       law     <NA>       <NA>
## 26     world     <NA>       <NA>
## 27       can     <NA>       <NA>
## 28   economi     <NA>       <NA>

Overlap/Neutral:

Generic stems like “American”, “govern”, “law”, “world”, “economi” and “make” are present in significant numbers in both party platforms.
This makes sense, because why wouldn’t you mention the United States of America in a party platform for an American political party? Similarly, stems like “govern”, “law”, and “economi” just make sense in a party platform. The government does exist, so does the economy, and a big part of being an elected official is engaging with the government and economy.
Stems like “make” have significant overlap, which is intuitive because when it comes to verbs, “make” is pretty neutral, especially when compared to other verbs used in the texts that DON’T overlap (see below).

Democrat v. Republican Differences:

The most frequent stem in the 2016 Democratic Party Platform was “democrat”. The 2016 Republican Party Platform also has “republican” as its 8th most common stem, but interestingly, while the Dem platform does not have “republican” in its top 50 features, the Rep platform has “democrat” at #43. The Dem party is seeming pretty egoistic right now, but maybe that’s because “democrat” is also the stem for “democracy”? I’m not sure. Wish there was more transparency on what all these stems encompassed.
The Dem platform frequently mentions the stem “public”, and the Rep platform frequently mentions the stem “privat”. This makes sense, especially given the two parties’ general economic positions, but I’d imagine that the Dem platform would mention “privat” a fair amount, too, even if not in the top 50 features. Likewise with the Rep platform and “public”, just by virtue of platform rhetoric tending to emphasize contrasts (“not that, but this”). I’d be curious to look at the top 100 features and see if this comes true.
The Dem platform emphasizes “worker”, “student” and “women”, while the Rep platform focuses on “congress” and “militari”. Both have “famili” and “communiti” as top 50 features, but those are kinda generic so taking them with a grain of salt when it comes to groups of interest. I’d interpret the relevant groups as not the audiences of the platforms (because who is really reading this outside of the convention), but as the groups that campaign messaging should target. Very interesting, because a hallmark of Republican messaging that I grew up with was workers and the “common man”, but we don’t see that reflected in the texts, or at least not in the top 50 features; quite the opposite.
The verbs used in the two texts are very curious. The Dem platform uses stems like “support”, “believ”, “ensur”, “care”, “includ”, “invest” and “help”. The Rep platform uses stems like “call”, “act”, “advanc”, “regul”, and “amend”. At least at this surface level, the Dem platform seems to use flashier, more people-oriented buzzwords, while the Rep platform’s verbs take as their transitives pieces of legislation.
All this being said, I’m not yet comfortable proposing ideological leanings to most of these stems. I’ll give the Republican party “militari”, “constitut”, and “regul”, and the Democratic party “women”, “student”, “school” and “health”, partially because of the lack of overlap and partially because it aligns with my own background ideas of what the parties are about. To give any leanings to the rest of the words, I’d want to discuss with other people of different backgrounds.

Part 11: Theory, etc.

Research relevant to these data:

There’s two broad areas of research that these reference data could be helpful in: 1) predicting the ideologies of virgin texts (text-focused); and 2) comparing text analysis methods (method-focused).
In the case of 1), these 8 documents would be treated as reference texts with known ideological positions (found via hand-coding as in Part 10 or otherwise), and virgin texts would be assigned ideological scores based on their similarities to the reference texts. Since we have 8 documents and not just 2, this could be accomplished using the Wordscore or Wordfish methods. The dictionary method would require us to pick just 2 texts - 1 from each party - so I’m only focusing on Wordscore and Wordfish.
In the case of 2), these data could help researchers understand the limits of certain analysis methods, such as handcoding or LLMs. Without having assigned prior ideological positions to these texts, researchers could compare the results of handcoding and unsupervised techniques with the goal of understanding what these different methods are more sensitive to. Kind of a robustness check in a way. I specified that these texts would not be a priori coded because the goal isn’t to see which methods are more right or wrong, but to descriptively assess how they tend to behave. I guess you could do the former, too, though, as in the paper we were shown in class. I’m just an anti-automation fan who wants a little more supervision of these unsupervised algorithms.

Wordscore:

Assumptions:
- More important (ideologically-charged) words are used more often.
- Word position doesn’t matter (bag of words).
- The texts all use the same lexicon, in the same context (ex: can’t look at 2016 party platforms at the same time as 1812 election speeches).
- The differences in word frequency are driven primarily by ideology (as opposed to the drafters just really liking specific words or wanting to one-up the other party by using GRE words). But maybe not?
- Specific to Wordscore:
  - Researchers must have access to confident estimates of/assumptions about the reference texts’ ideological positions.
  - Policy positions of the reference texts span the dimensions that researchers are interested in (ex: extreme positions and more centrist positions of the issue of interest, ex: healthcare policy, immigration, etc.)
  - The set of reference texts contain as many different words as possible.
Pros:
- More robust than the dictionary method in that it allows for multiple reference texts.
- Since it’s on a computer, it’s generally faster + more economical than handcoding.
- Can handle multiple languages.
Cons:
- The assumptions are quite confining. Rarely is word choice influenced chiefly by ideology. The stars have to align for a text to always use the same words in the same ways. These party platforms are drafted by humans and thus deserve analysis by humans.
- Formatting is ignored. I didn’t think this was a big deal prior to starting this assignment, but when copying and pasting the party text, I noticed that the Republican Party Platforms tended to have separate preambles and tables of content, as well as lots of quotes in italics, while the Democratic Party Platforms tended not to have those things but use a lot of asterisk lines. Maybe this isn’t an informative observation, maybe the structure of platforms is just passed down at this point, but it could be useful context for inferring the values of the parties beyond their lexicon. Could a computer do that? No. It can only handle plain text.

Wordfish:

Assumptions:
- Same as in the above Wordscore section, minus the “Specific to Wordscore” ones.
Pros:
- As an unsupervised algorithm, there’s no need to locate and vet reference texts, which frees up time and energy.
- Faster + more economical than handcoding.
- Can handle multiple languages.
Cons:
- It can be hard to meet all the assumptions, especially if you’re webscraping your data instead of manually vetting it as you would have to for reference texts.
- Since it’s unsupervised, we’re not really sure what the scale is (is it really republican vs. democrat?)
- Depending on the size of the data you’re working with, running Wordfish could draw electricity from data centers, which are disproportionately sited in places with higher populations of marginalized peoples, and which are known to increase residents’ utility rates, contaminate nearby groundwater, pollute the surrounding air with toxic chemicals that have caused an increase in asthma and mortality, generate noise pollution capable of damaging people’s hearing, and create intense light pollution that has contributed to changes in birds’ migration paths.