Setup

In case the tidyverse is not installed on your computer already (you haven’t used it in R before) you need to install it before loading the library. This you need to do only once as long as you use the same computer or don’t need to update a loaded package. The library() function needs to be used whenever you start from a clean R state.

install.packages("tidyverse")
library(tidyverse)

Ditto for the tidytext package:

install.packages("tidytext")
library(tidytext)

Data import

We read the data with read_csv, not with the base package read.csv, to create a tibble.

text_df <- read_csv("survey-v1.csv")

The columns are named “Respondent” and Q1 to Q9

Later we’ll have to do some data cleaning, but lets’ do some analysis first to give us a feeling for what that means.

Text analysis for Q1

To get us started, we look first at one question only: the answers to Q1. We do so by creating a data frame (df) just for Q1. This is of course not how to do this in general, and we’ll later look at performing the analysis for all question answers contained in one data frame with a tidy structure.

Word frequencies

Select all Q1 answers:

text_q1 <- select(text_df, Respondent, Q1)

(Note. Minimally, to make the analysis more general, we would use names here for variables that are question neutral: text rather than text_q1, etc; this way, copy and modify would become easier.)

Get all the words from the answers into a df along with the respondent’s ID. The result is still a tibble:

words_q1 <- text_q1 %>%
  unnest_tokens(word, Q1)

Let’s look at the most frequent words:

words_q1 %>%
  count(word, sort = TRUE) %>%
  top_n(10)
Selecting by n

Clearly there are the typical stop words that we don’t want to have in counts. Let’s remove them. There’s a data frame that has stop words specified for us, as part of tidytext:

data(stop_words)

We can View(stop_words) to see them, and add our own ones, of course. Here are the first 10 stop words:

head(stop_words)

Note that it is important that stop_words has a column ‘word’, as words_q1 has. To remove them from the q1 answers, do this:

words_q1 <- anti_join(words_q1, stop_words)

This removes all rows in words_q1 that are matched by a stop word.

Now the frequencies are more informative (top ten only are shown):

words_q1 %>%
  count(word, sort = TRUE) %>%
  top_n(10)
Selecting by n

A bit of sentiment analysis

Tidytext comes with three lexicons that contains words for sentiments, overall more than 27,000 words:

head(sentiments)

Let’s use the nrc lexicon and look for joyfull words in Q1. First, get the joyful words into a df:

nrc_joy <- get_sentiments("nrc") %>% 
  filter(sentiment == "joy")

Then look for them in answers to Q1:

words_q1 %>%
  inner_join(nrc_joy) %>%
  count(word, sort = TRUE)
Joining, by = "word"

So there, a bit of joy! You can perform additional (sentiment) analysis following the examples in this book–see in in particular the chapter on twitter analysis that is similar in data structure to a survey–but before that it is a good idea to change our data structure so that all the survey answers are in a tidy format.

Tidying the survey data

While it is sensible to analyse the answers separately for each question, it is less than elegant to do this based on a dataframe that has just one question in it as we did so far. It would mean to copy the analysis 8 times and replace all

To get the data into tidy format, we need ‘lengthen’ the original data. We gather the question responses all into one column instead of leaving them distributed over 9 columns. For this we need a new column questions. The gather function is part of the ‘tidy’ package, which we loaded when we loaded the tidyverse packages above.

text_df <- text_df %>%
  gather('Q1':'Q9', key="question", value = "words")

We know have a table with 207 rows in three columns: Respondent ID, Question number, and words (the response).

head(text_df)

Text mining on the tidy data set

Let’s get to the words:

words <- text_df %>%
  unnest_tokens(word, words)

and remove stopwords:

words <- anti_join(words, stop_words)
head(words)

We may also want to remove numbers and special characters. First we need a tibble to store these kind of stop symbols under:

stop_symbols <- tibble(word = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "0", "#", "(", ")", "."))

Since the survey data is in one-word-per-row format, we can remove stop symbols with an anti_join (from dplyr):

words <- words %>%
  anti_join(stop_symbols)

The data are now cleaner, but not shown here.

This now also is the point were we also could do some stemming. For this we need this library:

library(SnowballC)

Which gives us a function ‘wordstem’:

words_stemmed <- mutate(words, word = wordStem(word))
head(words_stemmed, n=20L)

But we don’t want to use the stemmed words just now because it would intefere with the sentiment analysis. Stemming is useful, but only once we no longer need “real” words, such as for topic modelling and the likes.

Word frequencies

Now we can calculate word frequencies for each person. First, we group by person and count how many times each person used each word. Then we use left_join() to add a column of the total number of words used by each person. Finally, we calculate a frequency for each person and word. (If this is really meaningful across questions, as done here, is debatable but this is mainly a demonstration. See below for filtering and grouping.)

frequency <- words %>% 
  group_by(Respondent) %>% 
  count(word, sort = TRUE) %>% 
  left_join(words %>% 
              group_by(Respondent) %>% 
              summarise(total = n())) %>%
  mutate(freq = n/total)
Joining, by = "Respondent"
head(frequency)

Sentiment analysis

Sentiments across all questions

To repeat form the very first example, if we were interested in the joyfulness, we can get words of joy from for instance the nrc lexicon:

nrc_joy <- get_sentiments("nrc") %>% 
  filter(sentiment == "joy")

And then look for them in answers to from Q1 to Q9:

words %>%
  inner_join(nrc_joy) %>%
  count(word, sort = TRUE) %>%
  top_n(10)
Joining, by = "word"
Selecting by n

Sentiment anaysis on one question

Just for Q1:

words %>%
  filter(question == "Q1") %>%
  inner_join(nrc_joy) %>%
  count(word, sort = TRUE)
Joining, by = "word"

Note that questions IDs are now a row value, so we need to filter (rows) rather than select (variables).

Sentiment on all questions

words %>%
  group_by(question) %>%
  inner_join(nrc_joy) %>%
  count(word, sort = FALSE)
Joining, by = "word"

We turned sorting on the count off here because we want to have the results grouped by question id. If Sorting was TRUE, the numeric count would be used first, then the question ID.

Sentiment analysis on a sub-set of all questions

Let’s pick Q1, 3 and 7:

words %>%
  filter(question == "Q1" | question == "Q3" | question == "Q7") %>%
  inner_join(nrc_joy) %>%
  count(word, sort = TRUE)
Joining, by = "word"

Most common positive and negative sentiment words

Using the lexicon bing, we can find how words costribute to sentiments

bing_word_counts <- words %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()
Joining, by = "word"
head(bing_word_counts)

Let’s graph this with ggplot2:

bing_word_counts %>%
  group_by(sentiment) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to sentiment",
       x = NULL) +
  coord_flip()
Selecting by n

The top_n(10) seems not to work but let’s worry about that later.

There’s much one one can do with sentiment analysis, obviously, see the tidytext book amongst others.

