Blog Post 4 (Take 2)

Contintuing description of data collection for TaD Final Project.

Lissie Bates-Haus, Ph.D. https://github.com/lbateshaus (U Mass Amherst DACSS MS Student)https://www.umass.edu/sbs/data-analytics-and-computational-social-science-program/ms
2022-03-26

For this blog post/stage of the project, I’m going to work on doing pulling articles for analysis into R. For this project, I’m going to do the articles from one journal, the American Journal of Political Science, as I already have them all downloaded.

Step 1

I’ve printed out my Endnote bibliography and I’m cross-checking that with the actual files in the directory to make sure I have them all.

  1. Three article were missing, so I downloaded them using my U Mass login for authorization to do so.
  2. Compressed all files in my American Journal of Political Science folder and moved to my DACSS R directory.
  3. Read in the practice .zip file using readr.
  4. Do a google search on how to do this and find instructions for reading in the pdfs.
  5. set wd to the folder where the AJPS pdfs are.
library(readr)
library(pdftools)  #tool used by the web tutorial
setwd("~/DACSS/697D Text as Data/Final Project Materials/Articles pdfs/American Journal of Political Science")

files <- list.files(pattern = "pdf$")

AJPS <- lapply(files, pdf_text)  #There were a bunch of errors

I got a number of parsing errors, which I believe means that some of the characters were not able to be parsed by r (r could not read them).

I’m going to run a sample of the OCR function. So far, I’m not able to run this successfully.

library(tesseract)  #first run failed, needed this package

# AJPS1 <- pdftools::pdf_ocr_text(pdf = 'Able and Mostly
# Willing_ An Empirical Anatomy of Information's Effect on
# Voter‐Driven Accountability in Senegal')

Note: Should I think about simplifying all the article file names?

Okay, trying the original code again.

library(tm)  #text mining analysis
library(pdftools)  #reads pdf documents

setwd("~/DACSS/697D Text as Data/Final Project Materials/Articles pdfs/American Journal of Political Science")

files <- list.files(pattern = "pdf$")
AJPS <- lapply(files, pdf_text)  #loads all pdf files
length(AJPS)  #verify how many files loaded in
[1] 44
lapply(AJPS, length)  #length of each pdf
[[1]]
[1] 27

[[2]]
[1] 16

[[3]]
[1] 14

[[4]]
[1] 16

[[5]]
[1] 18

[[6]]
[1] 17

[[7]]
[1] 14

[[8]]
[1] 19

[[9]]
[1] 15

[[10]]
[1] 16

[[11]]
[1] 18

[[12]]
[1] 21

[[13]]
[1] 16

[[14]]
[1] 15

[[15]]
[1] 15

[[16]]
[1] 17

[[17]]
[1] 17

[[18]]
[1] 14

[[19]]
[1] 16

[[20]]
[1] 13

[[21]]
[1] 15

[[22]]
[1] 20

[[23]]
[1] 15

[[24]]
[1] 20

[[25]]
[1] 17

[[26]]
[1] 19

[[27]]
[1] 18

[[28]]
[1] 16

[[29]]
[1] 16

[[30]]
[1] 18

[[31]]
[1] 17

[[32]]
[1] 15

[[33]]
[1] 12

[[34]]
[1] 1

[[35]]
[1] 14

[[36]]
[1] 13

[[37]]
[1] 16

[[38]]
[1] 22

[[39]]
[1] 19

[[40]]
[1] 17

[[41]]
[1] 18

[[42]]
[1] 19

[[43]]
[1] 20

[[44]]
[1] 16
# AJPS[1] it does look like the whole article read in, very
# messy!

Okay it looks like maybe the whole pdf did load.

Now I’m going to attempt to create a Corpus. Note: I had to reset the working directory so that the command could find files.

setwd("~/DACSS/697D Text as Data/Final Project Materials/Articles pdfs/American Journal of Political Science")

AJPScorpus <- Corpus(URISource(files), readerControl = list(reader = readPDF))

From here, I’m going to create a term-document matrix and do some initial cleanup. Because I am going to be looking for specific terms, consistent with my original paper, I am going to remove stopwords, numbers, and I’m going to put the whole thing to lower case.

Terms from my original paper: * “ethic-” * “IRB”, “institutional”, “review board”, “human”, “subjects”, “committee” * “informed”, “consent” * “harm”, “burden”, “mitigat-”, “minimi-” and “safe-” * “informed”, “consent” * “benefit”

AJPStdm <- TermDocumentMatrix(AJPScorpus, control = list(removePunctuation = TRUE,
    stopwords = TRUE, tolower = TRUE, stemming = TRUE, removeNumbers = TRUE))

# bounds = list(global = c(3, Inf)) - I'm removing this
# command because I don't actually want to remove the
# sparse terms. In fact, I might even want to remove the
# frequent terms! I'll have to think on this

So, apparently I have now created my document term matrix! [Note: I think I need to go back and review what I’m doing - do I need a dtm for my project?]

Now I’m going to inspect my matrix.

inspect(AJPStdm)
<<TermDocumentMatrix (terms: 18195, documents: 44)>>
Non-/sparse entries: 73069/727511
Sparsity           : 91%
Maximal term length: 59
Weighting          : term frequency (tf)
Sample             :
           Docs
Terms       Able and Mostly Willing_ An Empirical Anatomy of Information's Effect on Voter‐Driven Accountability in Senegal.pdf
  effect                                                                                                                     68
  elect                                                                                                                      43
  experi                                                                                                                     10
  group                                                                                                                       6
  inform                                                                                                                    196
  polit                                                                                                                      57
  result                                                                                                                     11
  treatment                                                                                                                  60
  vote                                                                                                                       93
  voter                                                                                                                     164
           Docs
Terms       Comparing and Combining List and Endorsement Experiments_ Evidence from Afghanistan.pdf
  effect                                                                                         31
  elect                                                                                          17
  experi                                                                                        205
  group                                                                                          53
  inform                                                                                          5
  polit                                                                                          21
  result                                                                                         43
  treatment                                                                                      37
  vote                                                                                            1
  voter                                                                                           1
           Docs
Terms       Does Race Affect Access to Government Services_ An Experiment Exploring Street‐Level Bureaucrats and Access to Public Housing.pdf
  effect                                                                                                                                   22
  elect                                                                                                                                    10
  experi                                                                                                                                   19
  group                                                                                                                                    32
  inform                                                                                                                                   28
  polit                                                                                                                                    32
  result                                                                                                                                   34
  treatment                                                                                                                                22
  vote                                                                                                                                     10
  voter                                                                                                                                     1
           Docs
Terms       Explaining Explanations_ How Legislators Explain their Policy Positions and How Citizens React.pdf
  effect                                                                                                    43
  elect                                                                                                     10
  experi                                                                                                    70
  group                                                                                                     19
  inform                                                                                                    11
  polit                                                                                                     57
  result                                                                                                    46
  treatment                                                                                                 47
  vote                                                                                                     138
  voter                                                                                                     25
           Docs
Terms       How Markets Shape Values and Political Preferences  A Field Experiment.pdf
  effect                                                                           103
  elect                                                                              5
  experi                                                                            25
  group                                                                             40
  inform                                                                            11
  polit                                                                             52
  result                                                                            31
  treatment                                                                        131
  vote                                                                              17
  voter                                                                             11
           Docs
Terms       Non‐Governmental Monitoring of Local Governments Increases Compliance with Central Mandates_ A National‐Scale Field Experiment in China.pdf
  effect                                                                                                                                             65
  elect                                                                                                                                               3
  experi                                                                                                                                             13
  group                                                                                                                                              30
  inform                                                                                                                                             71
  polit                                                                                                                                              44
  result                                                                                                                                             27
  treatment                                                                                                                                          98
  vote                                                                                                                                                0
  voter                                                                                                                                               0
           Docs
Terms       Political Determinants of Economic Exchange_ Evidence from a Business Experiment in Senegal.pdf
  effect                                                                                                 38
  elect                                                                                                   2
  experi                                                                                                 20
  group                                                                                                  21
  inform                                                                                                 16
  polit                                                                                                 189
  result                                                                                                 39
  treatment                                                                                              47
  vote                                                                                                    1
  voter                                                                                                   0
           Docs
Terms       The Impact of Elections on Cooperation_ Evidence from a Lab‐in‐the‐Field Experiment in Uganda.pdf
  effect                                                                                                   41
  elect                                                                                                   104
  experi                                                                                                   51
  group                                                                                                    80
  inform                                                                                                   19
  polit                                                                                                    44
  result                                                                                                   21
  treatment                                                                                                19
  vote                                                                                                      5
  voter                                                                                                     1
           Docs
Terms       The Moderating Effect of Debates on Political Attitudes.pdf
  effect                                                            110
  elect                                                              14
  experi                                                              9
  group                                                              14
  inform                                                             66
  polit                                                              68
  result                                                             46
  treatment                                                          57
  vote                                                               42
  voter                                                              97
           Docs
Terms       Urbanization Patterns, Information Diffusion, and Female Voting in Rural Paraguay.pdf
  effect                                                                                      172
  elect                                                                                        31
  experi                                                                                       21
  group                                                                                        37
  inform                                                                                       64
  polit                                                                                        41
  result                                                                                       33
  treatment                                                                                   102
  vote                                                                                         67
  voter                                                                                        32
# When I remove the 'bounds' code, my sparsity increases
# from 75% to 91%

From here, I’m going back to review our first few weeks to see if I can run the analyses I’m interested in.

I think what I want to do is use RegEx to find my search terms?

library(stringr)

sum(str_detect(AJPS, "ethic*"))
[1] 24
sum(str_count(AJPS, "ethic*"))
[1] 57

So, I believe I am understanding this - my search term appears in 24 of my elements (documents) for a total of 57 times.

Next Steps

  1. Repeat these steps for my other three journals.
  2. Do I want to run searches for other terms? Do I need to go back to the original sources to see the context of these terms?
  3. Think about my specific research question - this is an extension of my pilot study. Why does this have value?
  4. Think about my specific visualizations and how to do them.
  5. Think about comparisons? Are there differences among the journals?