Blog Post 5 - Getting the Data

The Process of Data Acquisition for my TaD Final Project.

Lissie Bates-Haus, Ph.D. https://github.com/lbateshaus (U Mass Amherst DACSS MS Student)https://www.umass.edu/sbs/data-analytics-and-computational-social-science-program/ms
2022-04-16

This blog post will talk about the process of data collection and possibly some initial text analysis for my final Text as Data Project.

The previous blog posts about this can be found here: Blog Post 4 and Blog Post 4 (Take 2)

Today I am working on finalizing my complete data collection. I have collected all the articles I am interested in using and cited them in EndNotes. Using that bibliography, I have cross-checked on my personal computer where I have a folder system set up to organize the pdfs, and ensured that I have pdfs of all the articles in my EndNotes bibliography, sorted into folders by publication.

Using the code I worked on in Blog 4 (Take 2), I’m going to read in all of the articles.

library(readr)
library(pdftools)  #tool used by the web tutorial
library(tm)

American Journal of Political Science:

#set WD to the proper subfolder:
setwd("~/DACSS/697D Text as Data/Final Project Materials/Articles pdfs/American Journal of Political Science")

AJPSfiles <- list.files(pattern = "pdf$")

AJPS <- lapply(AJPSfiles, pdf_text)

As before, there were a large number of parsing errors, which I believe to mean that not all of my characters were able to be read in.

Check how many files loaded in:

length(AJPS)  #verify how many files loaded in
[1] 44

Cross check this with EndNotes and downloaded files. No discrepancies found.

Check the length of each file:

lapply(AJPS, length)  #length of each pdf
[[1]]
[1] 27

[[2]]
[1] 16

[[3]]
[1] 14

[[4]]
[1] 16

[[5]]
[1] 18

[[6]]
[1] 17

[[7]]
[1] 14

[[8]]
[1] 19

[[9]]
[1] 15

[[10]]
[1] 16

[[11]]
[1] 18

[[12]]
[1] 21

[[13]]
[1] 16

[[14]]
[1] 15

[[15]]
[1] 15

[[16]]
[1] 17

[[17]]
[1] 17

[[18]]
[1] 14

[[19]]
[1] 16

[[20]]
[1] 13

[[21]]
[1] 15

[[22]]
[1] 20

[[23]]
[1] 15

[[24]]
[1] 20

[[25]]
[1] 17

[[26]]
[1] 19

[[27]]
[1] 18

[[28]]
[1] 16

[[29]]
[1] 16

[[30]]
[1] 18

[[31]]
[1] 17

[[32]]
[1] 15

[[33]]
[1] 12

[[34]]
[1] 1

[[35]]
[1] 14

[[36]]
[1] 13

[[37]]
[1] 16

[[38]]
[1] 22

[[39]]
[1] 19

[[40]]
[1] 17

[[41]]
[1] 18

[[42]]
[1] 19

[[43]]
[1] 20

[[44]]
[1] 16

Repeat this for each journal.

#American Political Science Review

setwd("~/DACSS/697D Text as Data/Final Project Materials/Articles pdfs/American Political Science Review")

APSRfiles <- list.files(pattern = "pdf$")

APSR <- lapply(APSRfiles, pdf_text)

length(APSR)
[1] 25
lapply(APSR, length) 
[[1]]
[1] 17

[[2]]
[1] 24

[[3]]
[1] 12

[[4]]
[1] 19

[[5]]
[1] 17

[[6]]
[1] 18

[[7]]
[1] 17

[[8]]
[1] 16

[[9]]
[1] 20

[[10]]
[1] 22

[[11]]
[1] 24

[[12]]
[1] 23

[[13]]
[1] 18

[[14]]
[1] 13

[[15]]
[1] 14

[[16]]
[1] 7

[[17]]
[1] 22

[[18]]
[1] 16

[[19]]
[1] 25

[[20]]
[1] 15

[[21]]
[1] 19

[[22]]
[1] 12

[[23]]
[1] 16

[[24]]
[1] 14

[[25]]
[1] 17

Cross-check with EndNotes and pdf folder. No discrepancies found.

#American Politics Research

setwd("~/DACSS/697D Text as Data/Final Project Materials/Articles pdfs/American Politics Research")

APRfiles <- list.files(pattern = "pdf$")

APR <- lapply(APRfiles, pdf_text)

length(APR)
[1] 13
lapply(APR, length) 
[[1]]
[1] 28

[[2]]
[1] 28

[[3]]
[1] 19

[[4]]
[1] 27

[[5]]
[1] 26

[[6]]
[1] 14

[[7]]
[1] 23

[[8]]
[1] 26

[[9]]
[1] 9

[[10]]
[1] 16

[[11]]
[1] 27

[[12]]
[1] 26

[[13]]
[1] 16

Cross-check with EndNotes and pdf folder. No discrepancies found.

#Journal of Experimental Political Science

setwd("~/DACSS/697D Text as Data/Final Project Materials/Articles pdfs/Journal of Experimental Political Science")

JEPSfiles <- list.files(pattern = "pdf$")

JEPS <- lapply(JEPSfiles, pdf_text)

length(JEPS)
[1] 19
lapply(JEPS, length) 
[[1]]
[1] 10

[[2]]
[1] 12

[[3]]
[1] 12

[[4]]
[1] 6

[[5]]
[1] 10

[[6]]
[1] 10

[[7]]
[1] 10

[[8]]
[1] 10

[[9]]
[1] 11

[[10]]
[1] 6

[[11]]
[1] 9

[[12]]
[1] 14

[[13]]
[1] 12

[[14]]
[1] 16

[[15]]
[1] 16

[[16]]
[1] 7

[[17]]
[1] 15

[[18]]
[1] 9

[[19]]
[1] 12

Cross-check with EndNotes and pdf folder. Discrepancy found with EndNote (20 citations in that database). Realize I’ve accidentally re-imported a citation I already had, and delete. No discrepancies found.

#Political Science Research and Methods

setwd("~/DACSS/697D Text as Data/Final Project Materials/Articles pdfs/Political Science Research and Methods")

PSRMfiles <- list.files(pattern = "pdf$")

PSRM <- lapply(PSRMfiles, pdf_text)

length(PSRM)
[1] 10
lapply(PSRM, length) 
[[1]]
[1] 19

[[2]]
[1] 21

[[3]]
[1] 18

[[4]]
[1] 15

[[5]]
[1] 12

[[6]]
[1] 16

[[7]]
[1] 11

[[8]]
[1] 9

[[9]]
[1] 16

[[10]]
[1] 21

Cross-check with EndNotes and pdf folder. No discrepancies found.

#Research and Politics

setwd("~/DACSS/697D Text as Data/Final Project Materials/Articles pdfs/Research and Politics")

RPfiles <- list.files(pattern = "pdf$")

RP <- lapply(RPfiles, pdf_text)

length(RP)
[1] 10
lapply(RP, length) 
[[1]]
[1] 8

[[2]]
[1] 9

[[3]]
[1] 7

[[4]]
[1] 8

[[5]]
[1] 13

[[6]]
[1] 7

[[7]]
[1] 7

[[8]]
[1] 7

[[9]]
[1] 9

[[10]]
[1] 3

Cross-check with EndNotes and pdf folder. No discrepancies found.

AND THERE THEY ALL ARE!! This is very exciting.

When I go back to my original research question, it is this:

How are ethical standards related to experimental research dissmentaed to students, young prodessionals and an academic disciplime as a whole? I have uploaded the PowerPoint I created for the DACSS Three Minute Thesis research presentation on my github.

In my pilot study, I did a simple wordsearch, and that is where I am going to begin with this expanded data set as well.

Using a tutotorial on text mining here, I am going to do an initial look at my documents first by journal and then as a whole.

AJPSdtm <- CreateDtm(AJPS, ngram_window = c(1, 1),
                lower = TRUE,
                remove_punctuation = TRUE,
                remove_numbers = TRUE,
                stem_lemma_function = wordStem)
AJPSget.doc.tokens<- function(AJPSdtm, docid)
  AJPSdtm[docid, ] %>% as.data.frame() %>% rename(count=".") %>%
  mutate(token=row.names(.)) %>% arrange(-count)
 
AJPSget.token.occurrences<- function(AJPSdtm, token)
  AJPSdtm[, token] %>% as.data.frame() %>% rename(count=".") %>%
  mutate(token=row.names(.)) %>% arrange(-count)
 
AJPSget.total.freq<- function(AJPSdtm, token) dtm[, token] %>% sum
 
AJPSget.doc.freq<- function(AJPSdtm, token)
  AJPSdtm[, token] %>% as.data.frame() %>% rename(count=".") %>%
  filter(count>0) %>% pull(count) %>% length

See if I can pull from the AJPS which documents mention ethics (using stemming) and what the total count of that word is in this dataset.

#which documents have the word?
AJPSdtm %>% AJPSget.token.occurrences(wordStem('ethics')) %>% head(10)
   count token
4      7     4
40     7    40
25     6    25
8      5     8
37     4    37
11     3    11
28     3    28
30     3    30
10     2    10
18     2    18
#total number of occurrence of the word?
AJPSdtm %>% AJPSget.doc.freq(wordStem('ethics'))  
[1] 20

I need to make sure I am understanding this output correctly, as it’s somewhat different from what I had dne in Blog Post 4 (Take 2).

My first result is number of times the token (ethics) appears in a particular document and the document ID. The second result is the total number of occurrences by the word.

I"m going to stop here and submit this, and plan to make an appointment to talk with Professor Song about the best way to do this analysis.