Preliminary Analysis on Text
I realized that I’ve been going about this the wrong way, so I’m going to comment out all of my code chunks but leave them in to demonstrate the work that I did.
American Journal of Political Science:
#set WD to the proper subfolder:
#setwd("~/DACSS/697D Text as Data/Final Project Materials/Articles pdfs/American Journal of Political Science")
#AJPSfiles <- list.files(pattern = "pdf$")
#AJPS <- lapply(AJPSfiles, pdf_text)
As before, there were a large number of parsing errors, which I believe to mean that not all of my characters were able to be read in.
Check how many files loaded in:
#length(AJPS) #verify how many files loaded in
Cross check this with EndNotes and downloaded files. No discrepancies found.
Check the length of each file:
#lapply(AJPS, length) #length of each pdf
Repeat this for each journal.
American Political Science Review:
#American Political Science Review
#setwd("~/DACSS/697D Text as Data/Final Project Materials/Articles pdfs/American Political Science Review")
#APSRfiles <- list.files(pattern = "pdf$")
#APSR <- lapply(APSRfiles, pdf_text)
#length(APSR)
#lapply(APSR, length)
Cross-check with EndNotes and pdf folder. No discrepancies found.
American Politics Research:
#American Politics Research
#setwd("~/DACSS/697D Text as Data/Final Project Materials/Articles pdfs/American Politics Research")
#APRfiles <- list.files(pattern = "pdf$")
#APR <- lapply(APRfiles, pdf_text)
#length(APR)
#lapply(APR, length)
Cross-check with EndNotes and pdf folder. No discrepancies found.
Journal of Experimental Political Science:
#Journal of Experimental Political Science
#setwd("~/DACSS/697D Text as Data/Final Project Materials/Articles pdfs/Journal of Experimental Political Science")
#JEPSfiles <- list.files(pattern = "pdf$")
#JEPS <- lapply(JEPSfiles, pdf_text)
#length(JEPS)
#lapply(JEPS, length)
Cross-check with EndNotes and pdf folder. Discrepancy found with EndNote (20 citations in that database). Realize I’ve accidentally re-imported a citation I already had, and delete. No discrepancies found.
Political Science Research and Methods:
#Political Science Research and Methods
#setwd("~/DACSS/697D Text as Data/Final Project Materials/Articles pdfs/Political Science Research and Methods")
#PSRMfiles <- list.files(pattern = "pdf$")
#PSRM <- lapply(PSRMfiles, pdf_text)
#length(PSRM)
#lapply(PSRM, length)
Cross-check with EndNotes and pdf folder. No discrepancies found.
Research and Politics:
#Research and Politics
#setwd("~/DACSS/697D Text as Data/Final Project Materials/Articles pdfs/Research and Politics")
#RPfiles <- list.files(pattern = "pdf$")
#RP <- lapply(RPfiles, pdf_text)
#length(RP)
#lapply(RP, length)
Cross-check with EndNotes and pdf folder. No discrepancies found.
AND THERE THEY ALL ARE!! This is very exciting.
When I go back to my original research question, it is this:
How are ethical standards related to experimental research dissemenated to students, young professionals and to an academic discipline as a whole? I have uploaded the PowerPoint I created for the DACSS Three Minute Thesis research presentation on my github.
From my pilot study:
The general topics that were searched on included: ● Was there any mention of ethics in the article? ● Was IRB approval reported? ● Was informed consent obtained by the researchers from the research participants? ● Was potential harm to research subjects or staff discussed?
Searches included: ● “ethic-” ● “IRB”, “institutional”, “review board”, “human”, “subjects”, “committee” ● “informed”, “consent” ● “harm”, “burden”, “mitigat-”, “minimi-” and “safe-” ● “informed”, “consent” ● “benefit”
In addition, assessment was made as to the following: ● Was contact information for the authors provided? ● How were research subjects selected and/or recruited? ● Was the study registered or preregistered? ● Was the general conflict of interest disclosure included in the article? ● Was the data made publicly available? ● Was any financial support or funding acknowledged?
What information am I looking for?
Do I want to include all of the search terms I used in my pilot study?
Can I connect my authors with ethics mentions? And connect to my Networks project?
Create corpus
#APSRcorpus <- corpus(APSRfiles)
#APSRsummary <- summary(APSRcorpus)
#APSRsummary
I’m realizing that what I thought was actually pulling the pdfs in is just creating a list of the titles from the file.
Trying something based on code found here.
#function to perform pdf to text conversion for many documents #Just for AJPS for now
convertpdf2txt <- function(dirpath){ files <- list.files(dirpath, full.names = T) x <- sapply(files, function(x){ x <- pdftools::pdf_text(x) %>% paste(sep = " “) %>% stringr::str_replace_all(fixed(”“),” “) %>% stringr::str_replace_all(fixed(”), " “) %>% stringr::str_replace_all(fixed(”), " “) %>% stringr::str_replace_all(fixed(”"“),” “) %>% paste(sep =” “, collapse =” “) %>% stringr::str_squish() %>% stringr::str_replace_all(”- “,”") return(x) }) }
Now apply the function to the directory:
AJPStxts <- convertpdf2txt(“~/DACSS/697D Text as Data/Final Project Materials/Articles pdfs/American Political Science Review”)
str(AJPStxts)
AJPScorpus <- corpus(AJPStxts) AJPSsummary <- summary(AJPScorpus) AJPSsummary
OKAY THIS LOOKS MORE PROMISING!! I’m going to go ahead and save this and publish to RPubs then create a new document to do this correctly so that I actually have my texts.