Analysis for Final Project TAKE 1

Preliminary Analysis on Text

Lissie Bates-Haus, Ph.D. https://github.com/lbateshaus (U Mass Amherst DACSS MS Student)https://www.umass.edu/sbs/data-analytics-and-computational-social-science-program/ms
2022-04-20

I realized that I’ve been going about this the wrong way, so I’m going to comment out all of my code chunks but leave them in to demonstrate the work that I did.

Step 1: Read in data from my computer

library(readr)
library(pdftools)  #tool used by the web tutorial
library(tm)

American Journal of Political Science:

#set WD to the proper subfolder:
#setwd("~/DACSS/697D Text as Data/Final Project Materials/Articles pdfs/American Journal of Political Science")

#AJPSfiles <- list.files(pattern = "pdf$")

#AJPS <- lapply(AJPSfiles, pdf_text)

As before, there were a large number of parsing errors, which I believe to mean that not all of my characters were able to be read in.

Check how many files loaded in:

#length(AJPS)  #verify how many files loaded in

Cross check this with EndNotes and downloaded files. No discrepancies found.

Check the length of each file:

#lapply(AJPS, length)  #length of each pdf

Repeat this for each journal.

American Political Science Review:

#American Political Science Review

#setwd("~/DACSS/697D Text as Data/Final Project Materials/Articles pdfs/American Political Science Review")

#APSRfiles <- list.files(pattern = "pdf$")

#APSR <- lapply(APSRfiles, pdf_text)

#length(APSR)
#lapply(APSR, length) 

Cross-check with EndNotes and pdf folder. No discrepancies found.

American Politics Research:

#American Politics Research

#setwd("~/DACSS/697D Text as Data/Final Project Materials/Articles pdfs/American Politics Research")

#APRfiles <- list.files(pattern = "pdf$")

#APR <- lapply(APRfiles, pdf_text)

#length(APR)
#lapply(APR, length) 

Cross-check with EndNotes and pdf folder. No discrepancies found.

Journal of Experimental Political Science:

#Journal of Experimental Political Science

#setwd("~/DACSS/697D Text as Data/Final Project Materials/Articles pdfs/Journal of Experimental Political Science")

#JEPSfiles <- list.files(pattern = "pdf$")

#JEPS <- lapply(JEPSfiles, pdf_text)

#length(JEPS)
#lapply(JEPS, length) 

Cross-check with EndNotes and pdf folder. Discrepancy found with EndNote (20 citations in that database). Realize I’ve accidentally re-imported a citation I already had, and delete. No discrepancies found.

Political Science Research and Methods:

#Political Science Research and Methods

#setwd("~/DACSS/697D Text as Data/Final Project Materials/Articles pdfs/Political Science Research and Methods")

#PSRMfiles <- list.files(pattern = "pdf$")

#PSRM <- lapply(PSRMfiles, pdf_text)

#length(PSRM)
#lapply(PSRM, length) 

Cross-check with EndNotes and pdf folder. No discrepancies found.

Research and Politics:

#Research and Politics

#setwd("~/DACSS/697D Text as Data/Final Project Materials/Articles pdfs/Research and Politics")

#RPfiles <- list.files(pattern = "pdf$")

#RP <- lapply(RPfiles, pdf_text)

#length(RP)
#lapply(RP, length) 

Cross-check with EndNotes and pdf folder. No discrepancies found.

AND THERE THEY ALL ARE!! This is very exciting.

When I go back to my original research question, it is this:

How are ethical standards related to experimental research dissemenated to students, young professionals and to an academic discipline as a whole? I have uploaded the PowerPoint I created for the DACSS Three Minute Thesis research presentation on my github.

From my pilot study:

The general topics that were searched on included: ● Was there any mention of ethics in the article? ● Was IRB approval reported? ● Was informed consent obtained by the researchers from the research participants? ● Was potential harm to research subjects or staff discussed?

Searches included: ● “ethic-” ● “IRB”, “institutional”, “review board”, “human”, “subjects”, “committee” ● “informed”, “consent” ● “harm”, “burden”, “mitigat-”, “minimi-” and “safe-” ● “informed”, “consent” ● “benefit”

In addition, assessment was made as to the following: ● Was contact information for the authors provided? ● How were research subjects selected and/or recruited? ● Was the study registered or preregistered? ● Was the general conflict of interest disclosure included in the article? ● Was the data made publicly available? ● Was any financial support or funding acknowledged?

What information am I looking for?

  1. Number of documents (aka journal articles) that contain the word ethics, by Journal
  1. Do I want to include all of the search terms I used in my pilot study?

  2. Can I connect my authors with ethics mentions? And connect to my Networks project?

Create corpus

#APSRcorpus <- corpus(APSRfiles)
#APSRsummary <- summary(APSRcorpus)
#APSRsummary

I’m realizing that what I thought was actually pulling the pdfs in is just creating a list of the titles from the file.

Trying something based on code found here.

#function to perform pdf to text conversion for many documents #Just for AJPS for now

convertpdf2txt <- function(dirpath){ files <- list.files(dirpath, full.names = T) x <- sapply(files, function(x){ x <- pdftools::pdf_text(x) %>% paste(sep = " “) %>% stringr::str_replace_all(fixed(”“),” “) %>% stringr::str_replace_all(fixed(”), " “) %>% stringr::str_replace_all(fixed(”), " “) %>% stringr::str_replace_all(fixed(”"“),” “) %>% paste(sep =” “, collapse =” “) %>% stringr::str_squish() %>% stringr::str_replace_all(”- “,”") return(x) }) }

Now apply the function to the directory:

AJPStxts <- convertpdf2txt(“~/DACSS/697D Text as Data/Final Project Materials/Articles pdfs/American Political Science Review”)

inspect the structure of the txts element

str(AJPStxts)

AJPScorpus <- corpus(AJPStxts) AJPSsummary <- summary(AJPScorpus) AJPSsummary

OKAY THIS LOOKS MORE PROMISING!! I’m going to go ahead and save this and publish to RPubs then create a new document to do this correctly so that I actually have my texts.