Parts 1: Defining “Party Platform”

In the context of the U.S. political party system, a party platform refers to the non-binding principles, goals, and strategies that party members use to address political issues. Platforms are generally announced at the party’s national conventions, which happen every four years and result in the official selection of that party’s nominees for president and vice president. These nominees, as well as party candidates in other elections, are meant to run on the issues and positions presented in the party platform. Sources: TeachDemocracy and Ballotpedia.

Parts 2 & 3: Organizing Files on my PC

I confirm that I did this. Note: I included the preambles when I copied the platform texts, and also got rid of any extraneous characters (ex: the 2012 Democratic Party Platform included some stars in their text, which I deleted).

Part 4: Loading Data + Corpus

# First, load in the data: 
data_raw <- readtext("/Users/linde/Downloads/QuantPol 2/R Text Analysis/HW6/platformdocuments/*.txt", cache = FALSE)
data_raw
## readtext object consisting of 8 documents and 0 docvars.
## # A data frame: 8 × 2
##   doc_id       text                
##   <chr>        <chr>               
## 1 Dem_2004.txt "\"PREAMBLE\nA\"..."
## 2 Dem_2008.txt "\"RENEWING A\"..." 
## 3 Dem_2012.txt "\"Moving Ame\"..." 
## 4 Dem_2016.txt "\"Preamble\nI\"..."
## 5 Rep_2004.txt "\"2004 Repub\"..." 
## 6 Rep_2008.txt "\"This platf\"..." 
## # ℹ 2 more rows
# Then, make the initial corpus: 
corpus_1 <- corpus(data_raw)

What a corpus is: per the quanteda documentation, a corpus is a “‘library’ of original documents that have been converted to plain, UTF-8 encoded text…designed to be a more or less static container of texts with respect to processing and analysis.” This is why the pre-processing steps in Part 7 are performed on the dfm instead of the corpus itself. And since the dfms are NOT static, I kept a raw version and made a processed version.

Part 5: Descriptive Table

summary(corpus_1)
## Corpus consisting of 8 documents, showing 8 documents:
## 
##          Text Types Tokens Sentences
##  Dem_2004.txt  3388  19705       943
##  Dem_2008.txt  4547  28677      1119
##  Dem_2012.txt  4213  29284      1084
##  Dem_2016.txt  4497  28654      1065
##  Rep_2004.txt  5877  45843      1761
##  Rep_2008.txt  4745  26197      1046
##  Rep_2012.txt  5625  33835      1280
##  Rep_2016.txt  6102  39425      1597

Interpreting the table:

Part 6: Create a DFM

dfm_raw <- dfm(tokens(corpus_1))

Part 7: Preprocessing

Apply stemming + remove punctuation and stopwords:

# Remove punctuation 
dfm_pro <- dfm(tokens(corpus_1, remove_punct=TRUE))

# Remove stopwords 
dfm_pro <- dfm_remove(dfm_pro, stopwords("english"))

# Stem words 
dfm_pro <- dfm_wordstem(dfm_pro)

Why we preprocess: when copying the party platform texts into .txt files, I noticed that there were a lot of asterisks and backslashes and hyphens and numbers and commas and other characters that don’t provide useful information for text analysis. This is the case for many documents you might want to run text analysis on, hence the need for pre-processing. Pre-processing techniques, including removing punctuation marks, removing stopwords (a type of filler word that lacks analysis-worthy substance), and stemming words (“economics” and “economies” and “economy” become 1 feature so that they can be properly quantified), help Wordfish/Wordscore/other algorithms analyze the text data more efficiently and consistently.

Estimate the # of columns in the DFM post-preprocessing:

ncol(dfm_pro) 
## [1] 7491

Part 8: Top 50 Words in the DFM:

topfeatures(dfm_pro, 50)
##   american    support     nation     presid      state    america     govern 
##       1350       1033        996        836        834        827        784 
##       must      peopl       work     health      right republican      secur 
##        759        733        703        683        682        633        617 
##   democrat    protect       care     famili      feder       make        new 
##        600        575        553        544        535        529        524 
##    countri       need        law        job        can      world        tax 
##        523        494        491        487        482        474        473 
##     provid      ensur  communiti    program       educ       also     econom 
##        457        441        436        433        429        408        407 
##    continu       unit     commit     includ     believ       help     effort 
##        404        403        394        392        389        386        377 
##     public  administr       year    develop      creat     polici    increas 
##        376        373        363        362        355        351        343 
##    economi 
##        342

Part 9: Top 50 Words in the 2016 Dem and 2016 Rep Party Platforms:

First, have to figure out which column is which:

dfm_pro[1:8,]
## Document-feature matrix of: 8 documents, 7,491 features (60.78% sparse) and 0 docvars.
##               features
## docs           preambl come togeth declar vision democrat mind challeng time
##   Dem_2004.txt       1   10     11      2      6       44    1       26   29
##   Dem_2008.txt       1   28     18      1      6       49    2       35   51
##   Dem_2012.txt       0   16     22      0      7      148    1       31   48
##   Dem_2016.txt       1   12     16      0      3      230    1       22   27
##   Rep_2004.txt       1   15     17      9     10       33    1       27   55
##   Rep_2008.txt       1    7      7      2      7       31    4       21   24
##               features
## docs           new
##   Dem_2004.txt  68
##   Dem_2008.txt  91
##   Dem_2012.txt  57
##   Dem_2016.txt  42
##   Rep_2004.txt 128
##   Rep_2008.txt  34
## [ reached max_ndoc ... 2 more documents, reached max_nfeat ... 7,481 more features ]

It cut off, but I got the basic idea. So Dem 2016 is row 4 and Rep 2016 is row 8. Now, I know which columns to call when extracting the top features for each document:

# 2016 Dem: 
topfeatures(dfm_pro[4,], 50) 
##  democrat  american   support    health    believ     peopl     right communiti 
##       230       176       143       130       121       112       108       101 
##    nation   protect      work      must      make   countri      also    public 
##        96        92        92        91        91        83        82        79 
##     state   america     ensur      educ      care    famili     fight    provid 
##        75        71        71        70        68        67        67        66 
##     feder       job      need    includ    invest    servic     secur   program 
##        65        64        64        64        63        62        60        59 
##   continu    expand    worker     creat   student    commit    govern       law 
##        59        58        57        53        53        52        51        51 
##     world       can    access    school      help   economi      live       end 
##        50        49        49        49        48        47        47        47 
##     women     build 
##        46        46
# 2016 Rep: 
topfeatures(dfm_pro[8,], 50) 
##      state   american     govern     nation      feder      right       must 
##        185        180        178        162        142        131        126 
## republican    support      peopl        law    protect    america  administr 
##        119        116        110        107        101         87         86 
##    countri   congress     presid      secur  constitut    current        can 
##         85         82         81         72         72         70         69 
##     public     famili     polici       unit       need        new       educ 
##         69         66         66         63         62         60         60 
##    program      world      power     econom       forc       call        act 
##         60         59         58         58         57         57         56 
##     advanc      regul      parti   militari    economi     includ     provid 
##         55         55         53         53         52         52         51 
##   democrat        job  communiti     privat      amend       year        use 
##         50         50         50         50         50         49         49 
##       make 
##         49

Part 10: Qualitative Comparison of the 2016 Dem and Rep Lists

Per your suggestion, I first make a df of the words that appear: 1) in both texts; 2) just in the 2016 Dem text; and 3) just in the 2016 Rep text.

dem_2016 <- topfeatures(dfm_pro[4, ], 50)
rep_2016 <- topfeatures(dfm_pro[8, ], 50)

overlap   <- intersect(names(dem_2016), names(rep_2016))
just_dem  <- setdiff(names(dem_2016), names(rep_2016))
just_rep  <- setdiff(names(rep_2016), names(dem_2016))

pad_to <- function(x, n) c(x, rep(NA, n - length(x)))

n <- max(length(overlap), length(just_dem), length(just_rep))

comparison_df <- data.frame(
  overlap  = pad_to(overlap, n),
  just_dem = pad_to(just_dem, n),
  just_rep = pad_to(just_rep, n),
  stringsAsFactors = FALSE)

comparison_df
##      overlap just_dem   just_rep
## 1   democrat   health republican
## 2   american   believ  administr
## 3    support     work   congress
## 4      peopl     also     presid
## 5      right    ensur  constitut
## 6  communiti     care    current
## 7     nation    fight     polici
## 8    protect   invest       unit
## 9       must   servic        new
## 10      make  continu      power
## 11   countri   expand     econom
## 12    public   worker       forc
## 13     state    creat       call
## 14   america  student        act
## 15      educ   commit     advanc
## 16    famili   access      regul
## 17    provid   school      parti
## 18     feder     help   militari
## 19       job     live     privat
## 20      need      end      amend
## 21    includ    women       year
## 22     secur    build        use
## 23   program     <NA>       <NA>
## 24    govern     <NA>       <NA>
## 25       law     <NA>       <NA>
## 26     world     <NA>       <NA>
## 27       can     <NA>       <NA>
## 28   economi     <NA>       <NA>

Overlap/Neutral:

Democrat v. Republican Differences:

Part 11: Theory, etc.

Research relevant to these data:

Wordscore:

Wordfish: