Contintuing description of data collection for TaD Final Project.
For this blog post/stage of the project, I’m going to work on doing pulling articles for analysis into R. For this project, I’m going to do the articles from one journal, the American Journal of Political Science, as I already have them all downloaded.
I’ve printed out my Endnote bibliography and I’m cross-checking that with the actual files in the directory to make sure I have them all.
I got a number of parsing errors, which I believe means that some of the characters were not able to be parsed by r (r could not read them).
I’m going to run a sample of the OCR function. So far, I’m not able to run this successfully.
Note: Should I think about simplifying all the article file names?
Okay, trying the original code again.
library(tm) #text mining analysis
library(pdftools) #reads pdf documents
setwd("~/DACSS/697D Text as Data/Final Project Materials/Articles pdfs/American Journal of Political Science")
files <- list.files(pattern = "pdf$")
AJPS <- lapply(files, pdf_text) #loads all pdf files
length(AJPS) #verify how many files loaded in
[1] 44
lapply(AJPS, length) #length of each pdf
[[1]]
[1] 27
[[2]]
[1] 16
[[3]]
[1] 14
[[4]]
[1] 16
[[5]]
[1] 18
[[6]]
[1] 17
[[7]]
[1] 14
[[8]]
[1] 19
[[9]]
[1] 15
[[10]]
[1] 16
[[11]]
[1] 18
[[12]]
[1] 21
[[13]]
[1] 16
[[14]]
[1] 15
[[15]]
[1] 15
[[16]]
[1] 17
[[17]]
[1] 17
[[18]]
[1] 14
[[19]]
[1] 16
[[20]]
[1] 13
[[21]]
[1] 15
[[22]]
[1] 20
[[23]]
[1] 15
[[24]]
[1] 20
[[25]]
[1] 17
[[26]]
[1] 19
[[27]]
[1] 18
[[28]]
[1] 16
[[29]]
[1] 16
[[30]]
[1] 18
[[31]]
[1] 17
[[32]]
[1] 15
[[33]]
[1] 12
[[34]]
[1] 1
[[35]]
[1] 14
[[36]]
[1] 13
[[37]]
[1] 16
[[38]]
[1] 22
[[39]]
[1] 19
[[40]]
[1] 17
[[41]]
[1] 18
[[42]]
[1] 19
[[43]]
[1] 20
[[44]]
[1] 16
# AJPS[1] it does look like the whole article read in, very
# messy!
Okay it looks like maybe the whole pdf did load.
Now I’m going to attempt to create a Corpus. Note: I had to reset the working directory so that the command could find files.
From here, I’m going to create a term-document matrix and do some initial cleanup. Because I am going to be looking for specific terms, consistent with my original paper, I am going to remove stopwords, numbers, and I’m going to put the whole thing to lower case.
Terms from my original paper: * “ethic-” * “IRB”, “institutional”, “review board”, “human”, “subjects”, “committee” * “informed”, “consent” * “harm”, “burden”, “mitigat-”, “minimi-” and “safe-” * “informed”, “consent” * “benefit”
AJPStdm <- TermDocumentMatrix(AJPScorpus, control = list(removePunctuation = TRUE,
stopwords = TRUE, tolower = TRUE, stemming = TRUE, removeNumbers = TRUE))
# bounds = list(global = c(3, Inf)) - I'm removing this
# command because I don't actually want to remove the
# sparse terms. In fact, I might even want to remove the
# frequent terms! I'll have to think on this
So, apparently I have now created my document term matrix! [Note: I think I need to go back and review what I’m doing - do I need a dtm for my project?]
Now I’m going to inspect my matrix.
inspect(AJPStdm)
<<TermDocumentMatrix (terms: 18195, documents: 44)>>
Non-/sparse entries: 73069/727511
Sparsity : 91%
Maximal term length: 59
Weighting : term frequency (tf)
Sample :
Docs
Terms Able and Mostly Willing_ An Empirical Anatomy of Information's Effect on Voter‐Driven Accountability in Senegal.pdf
effect 68
elect 43
experi 10
group 6
inform 196
polit 57
result 11
treatment 60
vote 93
voter 164
Docs
Terms Comparing and Combining List and Endorsement Experiments_ Evidence from Afghanistan.pdf
effect 31
elect 17
experi 205
group 53
inform 5
polit 21
result 43
treatment 37
vote 1
voter 1
Docs
Terms Does Race Affect Access to Government Services_ An Experiment Exploring Street‐Level Bureaucrats and Access to Public Housing.pdf
effect 22
elect 10
experi 19
group 32
inform 28
polit 32
result 34
treatment 22
vote 10
voter 1
Docs
Terms Explaining Explanations_ How Legislators Explain their Policy Positions and How Citizens React.pdf
effect 43
elect 10
experi 70
group 19
inform 11
polit 57
result 46
treatment 47
vote 138
voter 25
Docs
Terms How Markets Shape Values and Political Preferences A Field Experiment.pdf
effect 103
elect 5
experi 25
group 40
inform 11
polit 52
result 31
treatment 131
vote 17
voter 11
Docs
Terms Non‐Governmental Monitoring of Local Governments Increases Compliance with Central Mandates_ A National‐Scale Field Experiment in China.pdf
effect 65
elect 3
experi 13
group 30
inform 71
polit 44
result 27
treatment 98
vote 0
voter 0
Docs
Terms Political Determinants of Economic Exchange_ Evidence from a Business Experiment in Senegal.pdf
effect 38
elect 2
experi 20
group 21
inform 16
polit 189
result 39
treatment 47
vote 1
voter 0
Docs
Terms The Impact of Elections on Cooperation_ Evidence from a Lab‐in‐the‐Field Experiment in Uganda.pdf
effect 41
elect 104
experi 51
group 80
inform 19
polit 44
result 21
treatment 19
vote 5
voter 1
Docs
Terms The Moderating Effect of Debates on Political Attitudes.pdf
effect 110
elect 14
experi 9
group 14
inform 66
polit 68
result 46
treatment 57
vote 42
voter 97
Docs
Terms Urbanization Patterns, Information Diffusion, and Female Voting in Rural Paraguay.pdf
effect 172
elect 31
experi 21
group 37
inform 64
polit 41
result 33
treatment 102
vote 67
voter 32
# When I remove the 'bounds' code, my sparsity increases
# from 75% to 91%
From here, I’m going back to review our first few weeks to see if I can run the analyses I’m interested in.
I think what I want to do is use RegEx to find my search terms?
So, I believe I am understanding this - my search term appears in 24 of my elements (documents) for a total of 57 times.