This has to be the last one.
I’ve tried a lot of things that did not work, and have come to the conclusion that overall, my topic is very unsuited for a Text as Data approach; it really requires a qualitative method because of the amount of interpretation required for the project I wanted to do.
I’ve struggled enormously with the preprocessing of my pdfs and what I realized when I finally ran my models was that the way my pdfs came in means I cannot trust my results. The models I was getting were nonsensical (see any of my attempts posted on my github). I’ve put a lot of hours into this and it’s been challenging.
What I ended up doing was taking my EndNote citations and getting them into an excel spreadsheet (to save as a csv). I copied all abstracts into this spreadsheet. Using the results from my R searches, I was able to identify which articles used the term ethic*, and coded those. I then pulled up each article individually and found the term ethics and copied the relevant passasges into a field in my spreadsheet. I also assessed and coded whether the mentions were: substantive (i.e. actually related to discussing the ethical concerns of resrch), topical (i.e., part of the resarch topic, not an overarching discussion), procesural (mentioned as a pro forma report of IRB approval) or citation (in which the term appears in the bibliography). Note: I assessed as substantive any article that mentioned furthur discussion of ethical concerns being included in on-line supplemental material.
I do not know how I would use any sort of automated text-as-data tools to make these kinds of evaluations.
From here, I want to see the proportion of articles that mention ethics, grouped by journal:
I am going to run these numbers and visualizations in excel, because it’s easier and faster than doing it with ggplot (and I have better control over my plot)
Now I’m going to work with just my ethics texts to create a word cloud simply because I like them
Load Libraries:
Please note: this code is almost entirely from my project 911 5b-12 TAKE 4. New code will be cited as necessary.
Explore Common Words
ethicsWords <- ethics %>%
dplyr::select(Paragraph) %>%
unnest_tokens(word, Paragraph)
head(ethicsWords)
# A tibble: 6 × 1
word
<chr>
1 shalvi
2 shaul
3 jason
4 dana
5 michel
6 j
Plot the top 30 words:
# plot the top 30 words
ethicsWords %>%
dplyr::count(word, sort = TRUE) %>%
top_n(30) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) +
geom_col() +
xlab(NULL) +
coord_flip() +
labs(x = "Count",
y = "Unique words",
title = "Count of Unique Words Found in Ethic* Paragraphs")
Deal with Stop Words
A majority of these seem to be stop words, so let’s fix that!
library(stringr)
data("stop_words")
# how many words do you have including the stop words?
nrow(ethicsWords)
[1] 8337
ethicsClean <- ethicsWords %>%
anti_join(stop_words)
# how many words after removing the stop words?
nrow(ethicsClean)
[1] 4308
Replot the top 30 words:
# plot the top 30 words
ethicsClean %>%
dplyr::count(word, sort = TRUE) %>%
top_n(30) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) +
geom_col() +
xlab(NULL) +
coord_flip() +
labs(x = "Count",
y = "Unique words",
title = "Count of Unique Words (Cleaned) Found in Ethic* Paragraphs")
A little more data cleaning…
# replace all numbers with empty string
ethicsClean$word <- gsub("[0-9]+", "", ethicsClean$word)
# drop observations that are only empty strings
ethicsClean <- ethicsClean[ethicsClean$word != "",]
# how many words after removing numbers?
nrow(ethicsClean)
[1] 4094
Replot the top 30 words:
# plot the top 30 words
ethicsClean %>%
dplyr::count(word, sort = TRUE) %>%
top_n(30) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) +
geom_col() +
xlab(NULL) +
coord_flip() +
labs(x = "Count",
y = "Unique words",
title = "Count of Unique Words (Cleaned/No #) Found in Ethic* Paragraphs")
This is weird - the number is gone but the “si” has appeared? I’m going to go ahead and run my wordcloud now:
WordCloud
word_cloud <- ethicsClean %>%
dplyr::count(word, sort = TRUE) %>%
top_n(40)
library(wordcloud2)
library(paletteer)
set.seed(1010)
wordcloud2(data = word_cloud, size = .75,
color = "random-dark")
Now to export it as a pdf!
setwd("~/DACCS R/Text as Data")
# install webshot
library(webshot)
webshot::install_phantomjs()
# Make the graph
set.seed(1010)
my_graph <- wordcloud2(data = word_cloud, size = .75,
color = "random-dark")
my_graph
# save it in html
library("htmlwidgets")
saveWidget(my_graph,"tmp.html",selfcontained = F)
# and in png or pdf
webshot("tmp.html","finalEthicsWC.pdf", delay =5, vwidth = 1000, vheight=800)
I don’t know, even though I set the seed, it comes out different (and also looks different in the saved version?)
Now to repeat with the Abstracts:
Explore Common Words
abstractsWords <- abstracts %>%
dplyr::select(Abstract) %>%
unnest_tokens(word, Abstract)
head(abstractsWords)
# A tibble: 6 × 1
word
<chr>
1 traditionally
2 the
3 virtue
4 of
5 democratic
6 elections
Plot the top 30 words:
# plot the top 30 words
abstractsWords %>%
dplyr::count(word, sort = TRUE) %>%
top_n(30) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) +
geom_col() +
xlab(NULL) +
coord_flip() +
labs(x = "Count",
y = "Unique words",
title = "Count of Unique Words Found in Abstracts")
Deal with Stop Words
A majority of these seem to be stop words, so let’s fix that!
library(stringr)
data("stop_words")
# how many words do you have including the stop words?
nrow(abstractsWords)
[1] 18378
abstractsClean <- abstractsWords %>%
anti_join(stop_words)
# how many words after removing the stop words?
nrow(abstractsClean)
[1] 9986
Replot the top 30 words:
# plot the top 30 words
abstractsClean %>%
dplyr::count(word, sort = TRUE) %>%
top_n(30) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) +
geom_col() +
xlab(NULL) +
coord_flip() +
labs(x = "Count",
y = "Unique words",
title = "Count of Unique Words (Cleaned) Found in Abstracts")
A little more data cleaning
# replace all numbers with empty string
abstractsClean$word <- gsub("[0-9]+", "", abstractsClean$word)
# drop observations that are only empty strings
abstractsClean <- abstractsClean[abstractsClean$word != "",]
# how many words after removing numbers?
nrow(abstractsClean)
[1] 9884
Replot the top 30 words:
# plot the top 30 words
abstractsClean %>%
dplyr::count(word, sort = TRUE) %>%
top_n(30) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) +
geom_col() +
xlab(NULL) +
coord_flip() +
labs(x = "Count",
y = "Unique words",
title = "Count of Unique Words (Cleaned/No #) Found in Abstracts")
Word Cloud
word_cloudA <- abstractsClean %>%
dplyr::count(word, sort = TRUE) %>%
top_n(40)
library(wordcloud2)
library(paletteer)
set.seed(1010)
wordcloud2(data = word_cloudA, size = .75,
color = "random-dark")
Now to export it as a pdf!
setwd("~/DACCS R/Text as Data")
# install webshot
library(webshot)
webshot::install_phantomjs()
# Make the graph
set.seed(1010)
my_graph1 <- wordcloud2(data = word_cloudA, size = .75,
color = "random-dark")
my_graph1
# save it in html
library("htmlwidgets")
saveWidget(my_graph1,"tmp.html",selfcontained = F)
# and in png or pdf
webshot("tmp.html","finalAbstractsWC.pdf", delay =5, vwidth = 1000, vheight=800)