Models and Visualizations for Final Project

Brief Overview

I’ve tried a lot of things that did not work, and have come to the conclusion that overall, my topic is very unsuited for a Text as Data approach; it really requires a qualitative method because of the amount of interpretation required for the project I wanted to do.

I’ve struggled enormously with the preprocessing of my pdfs and what I realized when I finally ran my models was that the way my pdfs came in means I cannot trust my results. The models I was getting were nonsensical (see any of my attempts posted on my github). I’ve put a lot of hours into this and it’s been challenging.

What I ended up doing was taking my EndNote citations and getting them into an excel spreadsheet (to save as a csv). I copied all abstracts into this spreadsheet. Using the results from my R searches, I was able to identify which articles used the term ethic*, and coded those. I then pulled up each article individually and found the term ethics and copied the relevant passasges into a field in my spreadsheet. I also assessed and coded whether the mentions were: substantive (i.e. actually related to discussing the ethical concerns of resrch), topical (i.e., part of the resarch topic, not an overarching discussion), procesural (mentioned as a pro forma report of IRB approval) or citation (in which the term appears in the bibliography). Note: I assessed as substantive any article that mentioned furthur discussion of ethical concerns being included in on-line supplemental material.

I do not know how I would use any sort of automated text-as-data tools to make these kinds of evaluations.

setwd("~/DACCS R/Text as Data/Final Project TaD R")
library(readr)

authors_abstracts <- read_csv("authors_abstracts.csv", show_col_types = FALSE)

From here, I want to see the proportion of articles that mention ethics, grouped by journal:

I am going to run these numbers and visualizations in excel, because it’s easier and faster than doing it with ggplot (and I have better control over my plot)

Ethics

Now I’m going to work with just my ethics texts to create a word cloud simply because I like them

#filter dataframe down
library(dplyr)
ethics <- authors_abstracts %>% filter(`Ethics Mentioned` == "Yes") %>% select(ID, `Ethics Mentioned`, Paragraph, Journal)

Load Libraries:

#all necessary libraries here

library(rtweet)
library(twitteR)
library(leaflet)
library(quanteda)
library(readr)
library(httr)
library(tidytext)
library(tidyverse)
library(quanteda.textmodels)
library(tm)
library(wordcloud)
library(RColorBrewer)
library(wordcloud2)
library(dplyr)

Please note: this code is almost entirely from my project 911 5b-12 TAKE 4. New code will be cited as necessary.

Explore Common Words

ethicsWords <- ethics %>%
  dplyr::select(Paragraph) %>%
  unnest_tokens(word, Paragraph)

head(ethicsWords)

# A tibble: 6 × 1
  word  
  <chr> 
1 shalvi
2 shaul 
3 jason 
4 dana  
5 michel
6 j

Plot the top 30 words:

# plot the top 30 words
ethicsWords %>%
  dplyr::count(word, sort = TRUE) %>%
  top_n(30) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() +
      labs(x = "Count",
      y = "Unique words",
      title = "Count of Unique Words Found in Ethic* Paragraphs")

Deal with Stop Words

A majority of these seem to be stop words, so let’s fix that!

library(stringr)

data("stop_words")
# how many words do you have including the stop words?
nrow(ethicsWords)

[1] 8337

ethicsClean <- ethicsWords %>%
  anti_join(stop_words)

# how many words after removing the stop words?
nrow(ethicsClean)

[1] 4308

Replot the top 30 words:

# plot the top 30 words
ethicsClean %>%
  dplyr::count(word, sort = TRUE) %>%
  top_n(30) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() +
      labs(x = "Count",
      y = "Unique words",
      title = "Count of Unique Words (Cleaned) Found in Ethic* Paragraphs")

A little more data cleaning…

# replace all numbers with empty string
ethicsClean$word <- gsub("[0-9]+", "", ethicsClean$word) 
# drop observations that are only empty strings
ethicsClean <- ethicsClean[ethicsClean$word != "",]  
  
# how many words after removing numbers?
nrow(ethicsClean)

[1] 4094

Replot the top 30 words:

# plot the top 30 words
ethicsClean %>%
  dplyr::count(word, sort = TRUE) %>%
  top_n(30) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() +
      labs(x = "Count",
      y = "Unique words",
      title = "Count of Unique Words (Cleaned/No #) Found in Ethic* Paragraphs")

This is weird - the number is gone but the “si” has appeared? I’m going to go ahead and run my wordcloud now:

WordCloud

word_cloud <- ethicsClean %>%
  dplyr::count(word, sort = TRUE) %>%
  top_n(40)

library(wordcloud2)
library(paletteer)
set.seed(1010)

wordcloud2(data = word_cloud, size = .75, 
           color = "random-dark")

Now to export it as a pdf!

setwd("~/DACCS R/Text as Data")

# install webshot
library(webshot)
webshot::install_phantomjs()

# Make the graph
set.seed(1010)

my_graph <- wordcloud2(data = word_cloud, size = .75, 
           color = "random-dark")

my_graph

# save it in html
library("htmlwidgets")
saveWidget(my_graph,"tmp.html",selfcontained = F)

# and in png or pdf
webshot("tmp.html","finalEthicsWC.pdf", delay =5, vwidth = 1000, vheight=800)

I don’t know, even though I set the seed, it comes out different (and also looks different in the saved version?)

Abstracts

Now to repeat with the Abstracts:

#filter dataframe down
library(dplyr)
abstracts <- authors_abstracts %>% select(ID, Abstract, Journal)

Explore Common Words

abstractsWords <- abstracts %>%
  dplyr::select(Abstract) %>%
  unnest_tokens(word, Abstract)

head(abstractsWords)

# A tibble: 6 × 1
  word         
  <chr>        
1 traditionally
2 the          
3 virtue       
4 of           
5 democratic   
6 elections

Plot the top 30 words:

# plot the top 30 words
abstractsWords %>%
  dplyr::count(word, sort = TRUE) %>%
  top_n(30) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() +
      labs(x = "Count",
      y = "Unique words",
      title = "Count of Unique Words Found in Abstracts")

Deal with Stop Words

A majority of these seem to be stop words, so let’s fix that!

library(stringr)

data("stop_words")
# how many words do you have including the stop words?
nrow(abstractsWords)

[1] 18378

abstractsClean <- abstractsWords %>%
  anti_join(stop_words)

# how many words after removing the stop words?
nrow(abstractsClean)

[1] 9986

Replot the top 30 words:

# plot the top 30 words
abstractsClean %>%
  dplyr::count(word, sort = TRUE) %>%
  top_n(30) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() +
      labs(x = "Count",
      y = "Unique words",
      title = "Count of Unique Words (Cleaned) Found in Abstracts")

A little more data cleaning

# replace all numbers with empty string
abstractsClean$word <- gsub("[0-9]+", "", abstractsClean$word) 
# drop observations that are only empty strings
abstractsClean <- abstractsClean[abstractsClean$word != "",]  
  
# how many words after removing numbers?
nrow(abstractsClean)

[1] 9884

Replot the top 30 words:

# plot the top 30 words
abstractsClean %>%
  dplyr::count(word, sort = TRUE) %>%
  top_n(30) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() +
      labs(x = "Count",
      y = "Unique words",
      title = "Count of Unique Words (Cleaned/No #) Found in Abstracts")

Word Cloud

word_cloudA <- abstractsClean %>%
  dplyr::count(word, sort = TRUE) %>%
  top_n(40)

library(wordcloud2)
library(paletteer)
set.seed(1010)

wordcloud2(data = word_cloudA, size = .75, 
           color = "random-dark")

Now to export it as a pdf!

setwd("~/DACCS R/Text as Data")

# install webshot
library(webshot)
webshot::install_phantomjs()

# Make the graph
set.seed(1010)

my_graph1 <- wordcloud2(data = word_cloudA, size = .75, 
           color = "random-dark")

my_graph1

# save it in html
library("htmlwidgets")
saveWidget(my_graph1,"tmp.html",selfcontained = F)

# and in png or pdf
webshot("tmp.html","finalAbstractsWC.pdf", delay =5, vwidth = 1000, vheight=800)