Utilizing topic modeling analysis on open-end responses in political surveys

Introduction

The main focus of this research project is to analyze open-end survey responses in an automated content analysis context and to automatically identify underlying topic composition in open-ended responses (OE) in surveys. This research study will mainly focus on natural language processing (NLP) techniques that are suitable for the analysis of OE responses in surveys. An analysis of existing literature shows there are various methods that are focused on the development of topic categorization which also includes content analysis done on OE responses as well as other disciplines.

Previous literature has tackled analyzing text by utilizing multiple techniques that will be reviewed under this section. Open-ended questions have been used in surveys to gather qualitative responses. (Gendall, Menelaou and Brennan, 1996) in their study analyzed data from a mail survey to test two hypotheses about responses to open-ended questions. Their research suggests that researchers need to decide on their objective for using an open-ended question and use question cues to achieve it.Their research provides the importance of framing open-ended questions in surveys and the qualitative data that it provides. OE responses are a supporting data point to quantitative data thus making them significant to analyze. Modern techniques of analysis of text data have been implemented over the years.

Aggarwal and Zhai(2012) in their book provided an overview of the different methods and algorithms with a particular focus on mining text in unstructured data. Mining is the initial step to statistically organize and categorize text data in a structured format. (Aggarwal et al., 2012) further postulate analyzing text data at different levels of representation (p. 3), such as treating the unstructured text data as either a mixture of bag-of-words or as a string of words. However, the approach taken has to make meaningful semantic sense for analysis.

The categorization of such raw text is done in the topic modeling method. Bicalho, Pita, Pedrosa, Lacerda and Pappa (2017), proposed a general framework for topic modeling of short text by creating larger pseudo-document representations from the original documents. Their methods specialize in general framework to create larger pseudo-documents wherein they consider two methods, mainly; word co-occurrence and distributed word vector representations. Results for experiments run in seven datasets were compared against methods for extracting topics by generating pseudo-documents or modifying topic modeling methods for short text show significant and improved results in terms of normalized pointwise mutual information. A classification task was also used to evaluate the quality of the topics in terms of document representation.The majority of topic modeling and text analysis research was conducted using Latent Dirichlet allocation (LDA).

Nguyen, Billingsley, Du and Johnson (2015) in their paper extend two different Dirichlet multinomial topic models by incorporating latent feature vector representations of words trained on very large corpora to improve the word-topic mapping learned on a smaller corpus. Their results indicated their model using information from the external corpora produced significant improvements on topic coherence, document clustering, and document classification tasks, especially on datasets with few or short documents. Similarly, Wei and Croft (2006) studied how to efficiently use LDA to improve ad-hoc retrieval. Their results showed that improvements over retrieval using cluster-based models were obtained with efficiency. Hall, Jurafsky and Manning (2008) applied unsupervised topic modeling to identify topic clusters using LDA and analyzed historical trends in the field of Computational Linguistics and examined the strength of each topic over time.

Similar to this project, there are also studies that focused on analyzing OE responses in an automated content analysis context wherein Simon and Xenos (2004) in their paper present a method for using dimensional reduction in the analysis of political content by using latent semantic analysis (LSA) theory. They suggest that the factor analysis of word frequencies generated from text such as OE survey responses provides adequate content analysis categories and can substitute for more commonly practiced techniques. The authors analyzed responses collected in the execution of an experimental design dealing with the topic of partial-birth abortion and compared the factor analysis against the human coding of the same material.

Furthermore, ten Kleij and Musters (2003) looked at OE survey responses and used correspondence analysis to construct a visualization that is comparable to a preference map that drives consumer preferences for products. Their results indicate an agreement between the correspondence map and the preference map. Alternatively, Jackson and Trochim (2002) presented concept mapping on open-ended survey responses as an effective method to code-based and word-based text analysis techniques. Pietsch and Lessmann (2018) analyzed OE responses in surveys by testing three short text topic models: Latent Feature Latent Dirichlet Allocation, Biterm Topic Model and Word Network Topic Model. The results indicate topic models to be a viable solution to OE coding for survey responses as an alternative to human coding.

Hypothesis / Problem Statement

The OE responses are obtained from data collected in a political message testing research study conducted by a private marketing and polling research firm called Mercury Analytics. The survey raw data includes an OE text response to the question supplemented with other demographic information that will not be used for analysis for this research study. The open ended question is asked to respondents that are first shown an essay text beforehand.

The final dataset includes open ended responses to the essay follow up question “If someone asked your opinion about the Op-Ed you just read, what would you say to them? What is your overall reaction?” which gauges respondents’ reaction a text paragraph they viewed prior to answering the question.

Categorization of such open ended responses is typically done manually by humans, formulating categories apriori or after reviewing each responses individually. This research will look into the effectiveness of categorization of open ended responses in a survey using topic modeling.

Statistical Analysis Plan

This research paper will implement Latent Dirichlet Allocation (LDA) based on the underlying assumptions that similar topics make use of similar words and the documents talk about several topics, for which a statistical distribution can be determined. The purpose of utlizing LDA is mapping each document in our corpus of OE responses to a set of topics which covers a significant amount of words in the document.

There is an underlying assumption that the documents (responses) are written with arrangements of words and those arrangements determine topics. LDA also ignores information that is syntax-like and treats documents as bags of words. Additionally, it assumes that all words in the document can be assigned a probability of belonging to a topic. That main goal of LDA is to determine the mixture of topics that a document contains.

Method - Data - Variables

Participants of this study are respondents to a survey that was fielded to a US nationally representative sample that were older than 18 years of age. There were a total of 731 respondents that the survey was fielded to out of which 518 successfully completed the survey.

For the purpose of this research study, the open-ended question from the online survey is used for analysis that asks for respondents’ opinion on the essay. The open-end responses were recorded for each respondent in the variable “MOST_IMPORTANT”.

This study proposes to implement automatic topic modeling coding of open end responses in the “MOST_IMPORTANT” variable.

Statistical Analysis Results

pre-load all essential package libraries.

library(tm)
library(topicmodels)
library(tidyverse)
library(tidytext)
library(slam)
library(stringr)
library(dplyr)
library(tidyr)
library(lsa)
library(LSAfun)
library(ngram)
library(reticulate)
py_module_available("gensim")
py_module_available("pyLDAvis")

Load the Python libraries or functions.

##python chunk
import nltk

from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer 
ps = PorterStemmer()
from gensim import corpora
from gensim.models import LsiModel
from gensim.models import ldamodel
from gensim.models.ldamodel import LdaModel
from gensim.models.coherencemodel import CoherenceModel
import matplotlib.pyplot as plt

Read the data file

A data frame of the responses will be used to create a corpus.

import_corpus = Corpus(VectorSource(surveydata$MOST_IMPORTANT))

The text in the responses will be pre-processed for the Latent Dirichlet Allocation (LDA) analysis and used to create the Document Term Matrix.

import_mat = 
  DocumentTermMatrix(import_corpus,
           control = list(stemming = TRUE, #create root words
                          stopwords = TRUE, #remove stop words
                          minWordLength = 5, #cut out small words
                          removeNumbers = TRUE, #take out the numbers
                          removePunctuation = TRUE)) #take out punctuation

This matrix will be weighted to remove all the high and low frequency words.

##r chunk
#weight the space
import_weight = tapply(import_mat$v/row_sums(import_mat)[import_mat$i], 
                       import_mat$j, 
                       mean) *
  log2(nDocs(import_mat)/col_sums(import_mat > 0))

#ignore very frequent and 0 terms
import_mat = import_mat[ , import_weight >= .1]
import_mat = import_mat[ row_sums(import_mat) > 0, ]

Lastly the LDA topic model fit will be generated to estimate the topics identifed by the algorithm. Run and LDA Fit model.

##r chunk
k = 3 #set the number of topics

SEED = 5451 #set a random number 

LDA_fit = LDA(import_mat, k = k, 
              control = list(seed = SEED))

topics(LDA_fit, k)

Create a plot of the top ten terms for each topic.

##r chunk
terms(LDA_fit,10)

##       Topic 1  Topic 2   Topic 3  
##  [1,] "think"  "trump"   "presid" 
##  [2,] "presid" "presid"  "believ" 
##  [3,] "someon" "dont"    "trump"  
##  [4,] "good"   "agre"    "true"   
##  [5,] "like"   "peopl"   "time"   
##  [6,] "trump"  "white"   "person" 
##  [7,] "say"    "hous"    "well"   
##  [8,] "just"   "know"    "written"
##  [9,] "peopl"  "need"    "elect"  
## [10,] "right"  "countri" "peopl"

#use tidyverse to clean up the the fit     
LDA_fit_topics = tidy(LDA_fit, matrix = "beta")

#create a top terms 
top_terms = LDA_fit_topics %>%
   group_by(topic) %>%
   top_n(10, beta) %>%
   ungroup() %>%
   arrange(topic, -beta)

cleanup = theme(panel.grid.major = element_blank(), 
                panel.grid.minor = element_blank(), 
                panel.background = element_blank(), 
                axis.line.x = element_line(color = "black"),
                axis.line.y = element_line(color = "black"),
                legend.key = element_rect(fill = "white"),
                text = element_text(size = 10))

#make the plot
top_terms %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  cleanup +
  coord_flip()

Gensim Modeling in Python.

Transfer the variable MOST_IMPORTANT to Python and convert it to a list for processing.

##python chunk
res = list(filter(None, r.surveydata["MOST_IMPORTANT"])) 
res[0]

## 'This is a piece of garbage.'

Certainly very colorful and interesting responses…but I digress.

Process the text using Python.

##python chunk
##create a spot to save the processed text
processed_text = []

##loop through each item in the list
for answer in res:
  #lower case
  answer = answer.lower() 
  #create tokens
  answer = nltk.word_tokenize(answer) 
  #take out stop words
  answer = [word for word in answer if word not in stopwords.words('english')] 
  #stem the words
  answer = [ps.stem(word = word) for word in answer]
  #add it to our list
  processed_text.append(answer)

processed_text[0]

## ['piec', 'garbag', '.']

Create the dictionary and term document matrix in Python.

##python chunk
#create a dictionary of the words
dictionary = corpora.Dictionary(processed_text)

#create a TDM
doc_term_matrix = [dictionary.doc2bow(doc) for doc in processed_text]

Create the LDA Topics model in Python using the same number of topics picked for the LDA Fit R Model.

##python chunk
lda_model = LdaModel(corpus = doc_term_matrix, id2word = dictionary, num_topics = 3, random_state = 100, update_every = 1, chunksize = 100, passes = 10, alpha = 'auto', per_word_topics = True)

print(lda_model.print_topics())

## [(0, '0.018*"trump" + 0.013*"countri" + 0.013*"america" + 0.012*"treason" + 0.012*"state" + 0.010*"articl" + 0.009*"destroy" + 0.009*"insid" + 0.009*"fire" + 0.008*"worst"'), (1, '0.055*"." + 0.023*"person" + 0.022*"someon" + 0.017*"," + 0.017*"say" + 0.016*"written" + 0.015*"would" + 0.013*"!" + 0.013*"agre" + 0.012*"n\'t"'), (2, '0.107*"." + 0.043*"," + 0.031*"presid" + 0.016*"peopl" + 0.015*"trump" + 0.012*"good" + 0.011*"n\'t" + 0.010*"believ" + 0.010*"know" + 0.010*"think"')]

Interpretation and Discussion

Looking at the beta values across all the OE responses, the topics 1 2 and 3 are unifomally distributed across all respopnses.

For all the topics 1 , 2 and 3 there seems to be 2 frequent words: president, trump

Looking at the dominant terms in topic 1 provide an indication of support for president trump after reading the essay. Topic 2 is similar with more words describing positive affirmation towards the question asked and also includes individual entities besides president trump such as the white house and also general terms such as country.The third topic has a theme of a general praise for the essay itself, describing it as well written.

The results from this study will be used to aid efforts of coding of OE responses by humans. This research will also contribute to automatic coding of OE responses in the survey research domain.

References

Gendall, P., Menelaou, H., & Brennan, M. (1996). Open-ended questions: Some implications for mail survey research. MARKETING BULLETIN-DEPARTMENT OF MARKETING MASSEY UNIVERSITY, 7, 1-8.
Aggarwal, C. C., & Zhai, C. (Eds.). (2012). Mining text data. Springer Science & Business Media.
Bicalho, P., Pita, M., Pedrosa, G., Lacerda, A., & Pappa, G. L. (2017). A general framework to expand short text for topic modeling. Information Sciences, 393, 66-81.
Nguyen, D. Q., Billingsley, R., Du, L., & Johnson, M. (2015). Improving topic models with latent feature word representations. Transactions of the Association for Computational Linguistics, 3, 299-313.
Wei, X., & Croft, W. B. (2006, August). LDA-based document models for ad-hoc retrieval. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 178-185).
Hall, D., Jurafsky, D., & Manning, C. D. (2008, October). Studying the history of ideas using topic models. In Proceedings of the 2008 conference on empirical methods in natural language processing (pp. 363-371).
Simon, A. F., & Xenos, M. (2004). Dimensional reduction of word-frequency data as a substitute for intersubjective content analysis. Political Analysis, 12(1), 63-75.
ten Kleij, F., & Musters, P. A. (2003). Text analysis of open-ended survey responses: A complementary method to preference mapping. Food quality and preference, 14(1), 43-52.
Jackson, K. M., & Trochim, W. M. (2002). Concept mapping as an alternative approach for the analysis of open-ended survey responses. Organizational research methods, 5(4), 307-336.
Pietsch, A. S., & Lessmann, S. (2018). Topic modeling for analyzing open-ended survey responses. Journal of Business Analytics, 1(2), 93-116.