LDA and STM Models for Final Project

Attempt #2

Lissie Bates-Haus, Ph.D. https://github.com/lbateshaus (U Mass Amherst DACSS MS Student)https://www.umass.edu/sbs/data-analytics-and-computational-social-science-program/ms
2022-05-07

[Note: because my knits keep failing, I’m going to take out all code chunks to see if that can get it to knit.]

{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE)


### Load in libraries:

{r, warning=FALSE, message=FALSE}
#load in initial necessary libraries

library(quanteda)
library(readr)
library(dplyr)
library(stringr)
library(tidytext)

Load in all articles as a text:

{r} #function to perform pdf to text conversion for many documents

convertpdf2txt <- function(dirpath){ files <- list.files(dirpath, full.names = T) x <- sapply(files, function(x){ x <- pdftools::pdf_text(x) %>% paste(sep = " “) %>% stringr::str_replace_all(fixed(”“),” “) %>% stringr::str_replace_all(fixed(”), " “) %>% stringr::str_replace_all(fixed(”), " “) %>% stringr::str_replace_all(fixed(”"“),” “) %>% stringr::str_replace_all(fixed(”[:digits:]“),” “) %>% #Lissie put in to take out num paste(sep =” “, collapse =” “) %>% stringr::str_squish() %>% stringr::str_replace_all(”- “,”") return(x) }) }


### Apply the function to the directory to pull in texts:

{r, results=FALSE, message=FALSE, warning=FALSE}
texts <- convertpdf2txt("~/DACSS/697D Text as Data/Final Project Materials/Articles pdfs/All Articles")

{r, warning=FALSE, message=FALSE} library(tm) library(lexicon) library(wordcloud) library(textstem) library(tidyverse) library(quanteda) library(purrr) library(rvest) library(curl) library(httr) library(text2vec) library(LDAvis) library(caret) library(randomForest) library(caTools) library(ldatuning)


### save texts as data frame and add unique document identifier

{r}
texts.df <- as.data.frame(texts)
texts.df$ID <- 1:nrow(texts.df)
dim(texts.df)

All code take from Week 9 TaD Tutorial

Tokenization

{r} # Creates string of combined lowercased words tokens <- tolower(texts.df)

Performs tokenization

tokens <- word_tokenizer(tokens)


First issue I see is that there are a LOT of numbers in my texts? Did my function not clean them out?

Moved over to quanteda to see if that works better?

{r}
texts1 <- tolower(texts)
corpus <- corpus(texts1)
tokens1 <- quanteda::tokens(corpus, remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE)
tokens1 <- tokens_select(tokens1, pattern = c(stopwords("en")), selection = "remove")
tokens1 <- tokens_select(tokens1, pattern = "[:alnum:] ", selection = "remove", valuetype = c("regex") )
#typesnum <- grep("[[:digit:]]", types(corpus_tok2), value = TRUE)
#corpus_tok2 <- tokens_remove(corpus_tok2,c(typesnum))

Why does qunateda return an object with 121 elements (# of documents) while text2vec returns an object with 2 elements?

Iteration

{r} # Iterates over each token it <- itoken(as.list(tokens1), ids = texts.df$ID, progressbar = FALSE)

Prints iterator

it


### Vocabulary-Based Vectorization

{r}
# Built the vocabulary
v <- create_vocabulary(it)

# Print vocabulary
#v
class(v)

So the issue is that this v list is really messy even after I convert it to lowercase. I’m going to write it to a csv and pull it into excel to prune out all the numbers and bring it back in.

{r} setwd(“~/DACCS R/Text as Data/Final Project”) #write_as_csv(v, “vocab.csv”)

vocab1 <- read_csv(“vocab1.csv”, show_col_types = FALSE)


[Note: according to r, my "v" list is 50,062 obs, but it downloads in a csv as 56,997 obs, and when I trim the numbers and pull it back in, it's now at 52762 obs. I don't know why this is?]

[Note2: after the crash, my "v" object is now only 49896 obs, but I'll be pulling vocab1 in, which I created before the crash.]

[Note3: got my code back so I'm retrying things!]

Do I want to prune my vocab list? 

In looking at the terms that include the pattern "ethic", I'm going to prune all words that appears <3 times, which keeps the terms "ethical," "ethically," "ethic" and "ethics." The other terms do look like either encoding errors or very specific. They are:

conethical
ethica
ethicality
forethical
how-to-make-field-experiments-more-ethical
publiclyethical
unethical

[The only one I'm really interested in is "unethical" - but that can be qualitative investigation at some point.]

### This code chunk has given me a fatal error twice so I'm going to comment it out and figure something else out

{r}

#vocab2 <- create_vocabulary(vocab1)

# Prunes vocabulary
#vocab2 <- prune_vocabulary(vocab2, term_count_min = 3)

# Check dimensions
#dim(vocab2)

Steps for tomorrow:

Pulls csv back in Convert it to a text2vec vocabulary object Prune it Continue from there

5/7/2022

I’ve pulled my vocab1 back in. I believe I need to tokenize it, iterate over it, and then create vocabulary list? I’m going to try this.

{r}

vocab1 <- as.character(vocab1) texts2 <- tolower(vocab1) corpus1 <- corpus(texts2) tokens2 <- quanteda::tokens(corpus1, remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) tokens2 <- tokens_select(tokens2, pattern = c(stopwords(“en”)), selection = “remove”) tokens2 <- tokens_select(tokens2, pattern = “[:alnum:]”, selection = “remove”, valuetype = c(“regex”) )


{r}
# Iterates over each token
it1 <- itoken(as.list(tokens2), ids = texts.df$ID, progressbar = FALSE)

# Prints iterator
it1

# Built the vocabulary
v1 <- create_vocabulary(it1)

# Print vocabulary
#v
class(v1)

So this only brought me down to 49170 observations? But I’m going to move forward for now!

Okay I just went back and checked to make sure I’d actually saved the vocab1 csv in excel and I’m still getting all these numbers? is this an encoding issue? Ookay a quick search shows me I still have a lot of numbers. Back to scrolling and deleting for data cleanup.

Going to remove all the urls. Removed all characters, single punctuation, etc. Since I’m already pruning in excel, I’m going to remove all of my terms that appear 1 or 2 times.

Now I’m going to repeat the process.

{r} setwd(“~/DACCS R/Text as Data/Final Project”) #vocab2 <- read_csv(“vocab2.csv”, show_col_types = FALSE) vocab2 <- as.character(vocab2) texts3 <- tolower(vocab2) corpus2 <- corpus(texts3) tokens3 <- quanteda::tokens(corpus2, remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) tokens3 <- quanteda::tokens_select(tokens3, pattern = c(stopwords(“en”)), selection = “remove”)

#tokens2 <- tokens_select(tokens2, pattern = “[:alnum:]”, selection = “remove”, valuetype = c(“regex”) ) - don’t think I Need this anymore?


{r}
# Iterates over each token
it2 <- itoken(as.list(tokens3), ids = texts.df$ID, progressbar = FALSE)

# Prints iterator
it2

# Built the vocabulary
v2 <- create_vocabulary(it2)

# Print vocabulary
#v
class(v2)

I cannot figure out what’s going on here. I am absolutely writing in the file where I’ve removed all my terms that appear 1-2 times, but they’re still showing up? I don’t get it. So I guess I’ll try and prune here?

{r} dim(v2)

Prunes vocabualry

#v2 <- prune_vocabulary(v2, term_count_min = 3)

Check dimensions

#dim(v2)


Okay that's a disaster because now I'm down to literally one token? What the hell happened? I'm going to rerun my stuff and comment out the pruning.

OH I just realized - so I'm tokenizing a frequency list so OF COURSE each item only appears once.

Okay, I'm going to start over AGAIN and see if I can figure this out.
```{.r .distill-force-highlighting-css}