The following code explores a preliminary sentiment analysis of semi-structured interview transcriptions. A case use application of this method could support human exploration of semi-structured narrative text for evidence of psychological ambivalence. All data is completely ambiguated and de-identified for the purpose of this example.
# import stop word dictionary
data(stop_words)
# modify stop word dictionary
custom_stop_words <-
bind_rows(tibble(word = c("hum", "yeah"),
lexicon = c("custom")),
stop_words)# function to import mulitple files as corpus.
temp = list.files(pattern="*.txt")
for (i in 1:length(temp))
assign(temp[i], read.delim(temp[i]))# clean text, format to select answer portion of text data.
ef_1007 <- `interview_1.txt` %>%
filter(Q. == "A:") %>%
rename(nar_text = starts_with("Tell.me")) %>%
select(-Q.) %>%
mutate(nar_text =
str_replace_all(nar_text, "’", "'")) %>%
mutate(nar_text =
str_replace_all(nar_text, "\\d", "")) # label answer numbers.
ans_num <- 1:101
ef_1007 <- cbind(ef_1007, ans_num)
ef_1007$id <- 1007
# tokenize answer data and remove stop words.
ef_1007_token <- ef_1007 %>%
as_tibble %>%
unnest_tokens(word, nar_text) %>%
anti_join(custom_stop_words)Text is now tokenized with a transcript ID and answer number linked to each token. We repeat this process with three other transcripts.
# repeat process with three other transripts
ef_1029 <- `interview_2.txt` %>%
filter(Q. == "A:") %>%
rename(nar_text = starts_with("So.I.")) %>%
select(-Q.) %>%
mutate(nar_text =
str_replace_all(nar_text, "’", "'")) %>%
mutate(nar_text =
str_replace_all(nar_text, "\\d", ""))
ans_num <- 1:79
ef_1029 <- cbind(ef_1029, ans_num)
ef_1029$id <- 1029
ef_1029_token <- ef_1029 %>%
as_tibble %>%
unnest_tokens(word, nar_text) %>%
anti_join(custom_stop_words)
ef_1031 <- `interview_3.txt` %>%
filter(Q. == "A:") %>%
rename(nar_text = starts_with("Can.you.")) %>%
select(-Q.) %>%
mutate(nar_text =
str_replace_all(nar_text, "’", "'")) %>%
mutate(nar_text =
str_replace_all(nar_text, "\\d", ""))
ans_num <- 1:111
ef_1031 <- cbind(ef_1031, ans_num)
ef_1031$id <- 1031
ef_1031_token <- ef_1031 %>%
as_tibble %>%
unnest_tokens(word, nar_text) %>%
anti_join(custom_stop_words)
ef_1040 <- `interview_4.txt` %>%
filter(Q. == "A:") %>%
rename(nar_text = starts_with("I.d.like")) %>%
select(-Q.) %>%
mutate(nar_text =
str_replace_all(nar_text, "’", "'")) %>%
mutate(nar_text =
str_replace_all(nar_text, "\\d", ""))
ans_num <- 1:224
ef_1040 <- cbind(ef_1040, ans_num)
ef_1040$id <- 1040
ef_1040_token <- ef_1040 %>%
as_tibble %>%
unnest_tokens(word, nar_text) %>%
anti_join(custom_stop_words)# bind all tokenize data into dataframe.
two_sents <-
rbind(ef_1007_token, ef_1029_token,
ef_1031_token, ef_1040_token)# join tokens with "afinn" sentiment dictionary words.
two_sents_org <-
two_sents %>%
inner_join(get_sentiments("afinn")) %>%
count(id, questions = ans_num, value)The AFINN Sentiment Lexicon is a list of English terms with a valence score between (-5 to 5). AFINN sentiment lexicon is created by Finn Årup Nielsen, and distributed under the Open Database License (ODbL) v1.0.
# filter to a single interview.
int_one_eval <- two_sents_org %>%
filter(id == "1007")
# view sentiment across answers of numbered questions.
ggplot(int_one_eval,
aes(questions, value, fill = id)) +
geom_col(show.legend = FALSE) +
facet_wrap(~id, ncol = 2, scales = "free_x")Here we can evaluate an interviee response as a whole, after filtering for tokens that are present in the afinn sentiment lexicon. The construct of psychological ambivalence would be expressed here as question responses with both high and low values together.
If qualitative analysis and exploration of these specific questions interested us, we could follow up with a close reading of the question and interviewee response.
#plot all sentiment scores as columns.
ggplot(two_sents_org,
aes(questions, value, fill = id)) +
geom_col(show.legend = FALSE) +
facet_wrap(~id, ncol = 2, scales = "free_x")Looking a several interviews with this method is helpful, but also highlights limitations, future directions, and next steps to be employed.
We notice at once the different length of each interview. A more formal analysis would need to align questions to answers in a data cleaning or post-hoc wrangling process, allowing for aggregate comparison across individual responses to specific questions.
As we are observing open ended narrative text, perhaps the frequency of words in a participant repsonse is related to the intensity of the ambivalency score. Developing a scale value with something like afinn_words_per_answer / tokens_per_answer would help standardize ambivalence as we hope to measure the construct.
The tokens we have used are single words, where n-gram tokens of two or more words could very well help us contextualize and control for negative association with words. “I’m hungry” vs. “I’m not hungry”, for example.
Finally the sentiment dictionary of choice. An AFINN lexicon is useful for quantifying sentiment through a numeric scale, exploring and comparing results between different sentiment dictionaries may add depth to the analysis and more accurate expression of the construct we seek to evaluate for.
Questions, comments, concerns? Feel free to contact me: avery.richards@berkeley.edu. Any sort of feedback is welcome!