text <- read.csv("C:/Users/tijeo/OneDrive/Documents/M731/Electricity_OpenEnds.csv")
# head(text )
txt <- text[, 4 ]
# head(txt)
# str(txt)

Marketing research questionnaires often contain one or more open-ended questions that ask respondents to answer in their own words, called verbatim answers. Reading those detailed verbatims can provide good insights into customers thoughts, but that is usually not the preferred method for using that information. Usually those verbatims are disassembled into sentences, phrases and words with code numbers assigned and tabulated in lists, tables and sometimes graphs. Notwithstanding the mode of survey administration, coding and analyzing those open-ends can be expensive, time-consuming and error-prone. Usually, that coding has been done manually.

Developments in machine learning and artificial intelligence have made it possible to mechanize the analysis of verbatim answers fully or partially. Those software developments have emerged from the area of Natural Language Processing (NLP) and linguistic research. While coding questionnaires manually is one of the more basic functions of marketing research, NLP has evolved into a highly sophisticated area of research and practical applications.

1 Studies of electricity pricing

Professors Ken Deal and Dean Mountain of the DeGroote School of Business have conducted many studies of the pricing of electricity. Several of those projects were conducted as part of large electricity pricing pilots. An electricity pilot project is basically an experiment where a variety of stimuli are offered to electricity customers for a period of time, usually a year or so.

The projects mentioned in this document provided different electricity pricing plans to selected customers according to an experimental design and asked that they manage their electricity usage as best as possible for one year. These pricing plans included time-of-use (TOU) pricing, variable peak pricing (VPP) and the current regulated pricing plan. One variation of a time-of-use pricing plan that was part of a question in the survey is shown in Figure 1.

Figure 1: A typical time-of-use pricing plan

knitr::include_graphics("http://bus-sawtooth.mcmaster.ca/M731/OEB_TOU.jpg")

Surveys were conducted prior to the pilot, during and afterwards. The purpose of those surveys was to obtain customer preferences to better design the pilot pricing plans and to understand how customers behaved during the pilot and what they liked and disliked about the pilot, the pricing plans and electricity pricing.

2 Natural lanuguage processing (NLP) of verbatim comments

During a survey that investigated electricity customers’ perceptions and preferences for electricity pricing, subjects were asked for their verbatim comments about electricity pricing. Those verbatim qualitative raw responses are the “data” that will be analyzed in this document.

The first 10 rows of those verbatim comments are shown below. Of course, while all of those comments for 1,203 respondents could be printed out for the client, most will want that information summarized. This is the raw material that will be processed by NLP methods.

text[1:10, 4]

 [1] "ease of simplicity"                                                                                                                                                                                  
 [2] "For me, I'm retired,so the change would not affect me that much,for family of 4or5,it maybe a probleme,morning rush,supper meham......It would not work..for them.......Thanks "                     
 [3] "?"                                                                                                                                                                                                   
 [4] "Seemed to  be the most static for determining what I pay. Also buy energy efficient qppliances for my home and realize savings this way includinga heat pump, etc.."                                 
 [5] "allows me to plan usage to save money"                                                                                                                                                               
 [6] "less complicated"                                                                                                                                                                                    
 [7] "it seems like the best rate"                                                                                                                                                                         
 [8] "because im not usually active in my home during these times."                                                                                                                                        
 [9] "it is more practical"                                                                                                                                                                                
[10] "Peak periods are morning and after work.  I can consciously try to use less energy during the evening period.  This is about conserving energy and getting the best price for your customer, right ?"

3 Insights

In Figure 2, we see that the words that are more prominent in the word cloud include “peak, time, period, plan, hour” and a few others. This word cloud shows us the relative frequency of words that appeared in the respondents answers. Therefore, the aforementioned words were common in respondents answers when describing their views on the pricing of electricity. These words are expected, as peak times and periods are when electricity is most expensive. In Figure 3, we see a bar chart that gives us more information on the frequency that these words were mentioned. The words: peak, time, period, plan, and hour were mentioned about 249, 180, 120, 70, and 63 times respectively.

Figures 4 and 5 give us further insight into the relative frequency of respondents answers. It is seen that the phrases “peak period”, “peak time”, “peak hour”, “standard plan”, and “use electricity” in order of frequency, are the most common phrases in respondents’ answers. We can infer that the top three significantly contribute to the pricing of electricity in the views of consumers.

Figures 6 and 7 shows the top few phrases that consumers used in describing their attitudes towards electricity pricing. From this figure, we can see that consumers are most concerned or vocal about the peak period/time/hour, time most, and the pricing surrounding these peaks.

In conclusion and based on respondents comments to electricity pricing, it can be deduced that consumers are aware of the difference in pricing during peak periods and some try to ensure that they use energy according to these periods. However, the use of “on-peak”, “mid=peak” and “off-peak” is not really seen in their comments. This may indicate that although consumers are aware of the existence of peak periods, they may not be aware of the difference between on-peak, mid-peak, and off-peak periods and how energy consumption during these periods directly affects their electricity bills. More research and analysis, especially of closed-ended survey questions will be needed to obtain a holistic view of consumers’ attitutudes towards electricity pricing.

4 NLP Research: The Use of NLP for Marketing Research

According to Leeson et al. (2019), Natural Language Processing (NLP) is a machine learning technique from computer science that uses algorithms to analyze textual data. Looking into the history of NLP, the first market research report that referred to NLP technology was published in 1985 by Ovum (Dale, 2017). However, due to the deep widespread interest in Artificial Intelligence in the last 7 years, the commercialisation of NLP has become more noticeable. More specifically, in verbatim text coding, the verbatim answers to open-ended questions are usually classified and coded by human coders even though some reports exist regarding computer-assisted coding systems for qualitative research (Leung & Yeh, 2022). Leung & and Yeh (2022), asserted that NLP reports can be used to describe basic patterns of data even though it is difficult to construct a knowledge base to comply with the various explanatory models in marketing research.

In the commercial world, pre-built and pre-configured models now exist to aid the analysis of verbatim qualitative comments from customers. The text software would provide the means to use models from other projects and would identify themes from the dataset or pick up “topics” based on the clustering ability of the model.

Guetterman et al. (2018) conducted a study to compare the effectiveness of a traditional qualitative text analysis (which involves researchers reading, cleaning/coding data and iteratively building findings), a natural language processing analysis, and an enhanced approach that combines both the traditional qualitative and NLP methods. The study involved the analysis of data generated through two text message survey questions sent to people aged 14-24 years. One question was about prescription drugs while the other was about police interactions; a question was randomly assigned to each of two experienced qualitative analysis teams for independent coding and analysis while a third team separately conducted NLP analysis of the same two questions.

Results were assessed and the similarity of discoveries obtained, quality of insights, and the time spent on analysis were compared. The traditional method produced four major findings for both the drug-related and police interaction questions. The NLP method produced three findings that missed contextual elements in the drug-related question and four slightly different findings (in comparison to the traditional method) in the police interaction question. Guetterman and colleague’s enhanced approach that combined both traditional and NLP methods produced the most comprehensive and highest quality findings through details and frequencies. Observing the time taken, the traditional approach took 270 minutes for each question. The NLP approach took 120 minutes and 40 minutes for the drug and police questions respectively. An approach starting with the traditional method followed by the enhanced method took 450 and 390 minutes for the drug and police questions respectively; while an approach starting with the NLP method followed by the enhanced method took 240 and 220 minutes for the drug and police questions respectively.

In a nutshell, the study proved that NLP can be used as a method to validate qualitative findings and can speed up the time spend on qualitative coding if used prior to it. It also proved that the NLP approach could identify major themes in the respondents’ answers but could not identify nuances in their responses.

5 Advantages and Disadvantages of Using NLP for Analyzing Verbatim Qualitative Comments

One major advantage of using natural language processing in analyzing verbatim qualitative comments is that it saves the time of coders dealing with the raw data. The use of NLP can group together similar answers in an enormous list of verbatim ones. Additionally, it can arrange words or phrases that were most frequent in answers of respondents providing insight to patterns or trends in the data.

Another advantage of NLP is that it can capture multilingual verbatims. For example, if one customer in the United States of America says that “expensive pricing” and another customer in Mexico states “precios caros”, a model leveraging NLP would be able to capture both reviews as conservative attitudes towards pricing accurately and consistently. Amazingly, one can perform model building in any language which can be translated to a base or local language.

NLP can also capture every related comment or expression from respondents that human coders may miss. This further contributes to the elimination of bias that can arise when humans manually analyse the verbatim text. The accurate and unbiased analysis of keywords and themes can help profile the respondents and hence tailor marketing efforts to those consumers.

One disadvantage of using NLP in analyzing verbatim qualitative research is that it does not take into context respondents that use negations in their responses. Obvious negators such “not impressed” can be easily analyzed by rules-based systems. However, unspoken negators such as “I’ve experienced better” would require customized rules or models leveraging machine learning to analyze accurately.

NLP is also unable to identify emotions such as anger or confusion in the answers of respondents. Unfortunately, NLP is unable to account for expressions involving sarcasm and the nuances in respondents answers. If used alone, it may provide skewed results without any explanation of what the results mean. This also makes it difficult to translate results of an NLP analysis into a human readable format. (Mikolov et al., 2013)

Lastly, for better accuracy, some NLP models will require training (machine learning). The amount of datasets required to train such models is enormous. Consequently, this leads to an elongated training time – which could be days or weeks.

6 Conclusion

Natural Language Processing has made a substantial mark in the market research industry; it provides a solid foundation to code qualitatively more efficiently and serves as a validation method. However, used alone, this approach is not helpful in recognizing subtleties in qualitative verbatim comments.

The results of the study conducted by Guetterman and colleagues validates the insights and recommendation suggested from the electricity pricing attitude question. If the results derived from the electricity pricing NLP analysis were combined with findings from traditional qualitative research, a more holistic and accurate conclusion would be determined.

It is evident that the traditional qualitative text analysis and NLP analysis are two methods that complement each other and are most efficient when used together. Although NLP captures frequencies of “buzz words”, identifies statements that human coders may miss, and adds minimal time to analysis, the traditional qualitative coding provides context for these NLP identified “buzz words” and detects the attitudes of respondents more dependably. To interpret responses more accurately and efficiently, market researchers using NLP methods should consider analyzing a subset of their data with the traditional qualitative text analysis methods to ensure that important context and nuances are not missed. A combined approach will leverage the strengths of both methods.

7 References

Dale, R. (2017, June 5). The commercial NLP landscape in 2017: Natural Language Engineering. Cambridge Core. Retrieved March 20, 2022, from https://www.cambridge.org/core/journals/natural-language-engineering/article/commercial-nlp-landscape-in-2017/CDCE7C93C48CFDF9094908F3A0DB9E26
Eng, C. (2021, April 5). Five use cases for NLP techniques in Marketing Analytics. AiThority. Retrieved March 21, 2022, from https://aithority.com/natural-language/five-use-cases-for-natural-language-processing-nlp-techniques-in-marketing-analytics/#:~:text=NLP%20is%20often%20used%20to,smarter%2C%20more%20efficient%20marketing%20strategies.
Guetterman, T. C., Chang, T., DeJonckheere, M., Basu, T., Scruggs, E., & Vydiswaran, V. G. V. (2018, June). Augmenting qualitative text analysis with Natural Language Processing: Methodological Study. Journal of Medical Internet Research. Retrieved March 21, 2022, from https://www.jmir.org/2018/6/e231/
Leung, J., & Yeh, C.-L. (2022). (rep.). Natural Language Processing for Verbatim Text Coding and Data Mining Report Generation (pp. 1–14). Pittsburgh, PA: Pennsylvania State University.
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013, September 7). Efficient estimation of word representations in vector space. arXiv.org. Retrieved March 21, 2022, from https://arxiv.org/abs/1301.3781
Unknown, U. (2022, January 25). The Definitive Guide to Text Analysis. Qualtrics. Retrieved March 21, 2022, from https://www.qualtrics.com/experience-management/research/text-analysis/

8 Appendix

8.1 The udpipe model for NLP

 #ud_model <- udpipe_download_model(language = "english")
 #ud_model <- udpipe_load_model(ud_model$file_model)

ud_model <- udpipe_download_model(language = "english", model_dir="C:/FlexPlan2a/Verbatim opinions, Electricity/", udpipe_cores=4)
ud_model <- udpipe_load_model( "C:/FlexPlan2a/Verbatim opinions, Electricity/english-ewt-ud-2.5-191206.udpipe" )

pacman::p_load(utf8)
txt <- utf8_format(txt)
# head(txt)

x <- udpipe_annotate(ud_model, x = txt, trace = FALSE)
x <- as.data.frame(x)
 #str(x)

# function to expand contractions in an English-language source
fix.contractions <- function(doc) {
  doc <- gsub("won't", "will not", doc)
  doc <- gsub("can't", "can not", doc)
  doc <- gsub("n't", " not", doc)
  doc <- gsub("'ll", " will", doc)
  doc <- gsub("'re", " are", doc)
  doc <- gsub("'ve", " have", doc)
  doc <- gsub("'m", " am", doc)
  doc <- gsub("'d", " would", doc)
  doc <- gsub("'s", "", doc)
  return(doc)
}
x$token <- sapply(x$lemma, fix.contractions)

undesirable_words <- c( "!!!!!!!!!!!!!!!!!!!!!!!!!!1111", "%bonus", ",s", "100", "6�", "2", "am", "thing", "NA", "na na", "na na na", "na i", "na na na na", "na na i", "na it", "na we", "na pricing", "na na na i", "that i", "na na it", "?", "t",  "4", "ooh", "uurh", "pheromone", "poompoom", "3121",  "matic", "ai", "c", "la", "hey", "na",  "da", "uh", "tin", "ll", "transcription",  "repeats", "la", "da", "uh", "ah", "e.g", "i.e", "@", "/", "\\|", "'ve", "'re", "'m", "'ll", "â???Ts", "â???Tve", "â???Tre", " s", "'s", ".", "..", "...", "....our", "lifestyles..e.g..wells" )
# undesirable_words
#Create tidy text format: Unnested, Unsummarized, -Undesirables, Stop and Short words

# txt_tidy <- text_df %>%
#   unnest_tokens(text) 

# str(tidy_books)
 x <- x  %>%
  dplyr::filter(!lemma %in% undesirable_words) %>% #Remove undesirables
  dplyr::filter(!nchar(lemma) < 3) %>% #Words like "ah" or "oo" used in music
  distinct()

frq(x$upos, out="v", title= "Universal parts of speech identified by udpipe's annotation function")

Universal parts of speech identified by udpipe’s annotation function
val	label	frq	raw.prc	valid.prc	cum.prc
ADJ		1130	12.32	12.32	12.32
ADP		541	5.90	5.90	18.21
ADV		783	8.53	8.53	26.75
AUX		360	3.92	3.92	30.67
CCONJ		310	3.38	3.38	34.05
DET		825	8.99	8.99	43.04
INTJ		6	0.07	0.07	43.11
NOUN		2553	27.83	27.83	70.93
NUM		94	1.02	1.02	71.96
PART		229	2.50	2.50	74.45
PRON		337	3.67	3.67	78.13
PROPN		93	1.01	1.01	79.14
PUNCT		5	0.05	0.05	79.19
SCONJ		223	2.43	2.43	81.62
VERB		1671	18.21	18.21	99.84
X		15	0.16	0.16	100.00
NA	NA	0	0.00	NA	NA
total N=9175 · valid N=9175 · x̄=7.57 · σ=4.68

x$topic_level_id <- unique_identifier(x, fields = c("doc_id", "paragraph_id", "sentence_id"))
x$linenumber <- 1:nrow(x)

x.noun <- subset(x, upos %in% c("NOUN", "ADJ") & 
                !lemma %in% c(".", "I", "na", "we", "....our", "lifestyles..e.g..well") |
                 lemma %in% c("Hydro") )

dtf.an <- document_term_frequencies(x.noun, document = "topic_level_id", term = "lemma")

stats.an <- txt_freq(dtf.an$term)

dtm.na <- document_term_matrix(dtf.an)

dtm_clean <- dtm_remove_lowfreq(dtm.na, minfreq = 5)
dtm_clean <- dtm_remove_terms(dtm_clean, terms = c("!!!!!!!!!!!!!!!!!!!!!!!!!!1111", "%bonus", ",s", "100", "6�") )
dtm_clean <- dtm_remove_terms(dtm_clean, terms = c("na na", "na na na", "na i", "na na na na", "na na i", "na it", "na we", "na pricing", "na na na i", "that i", "na na it") )
dtm_clean <- dtm_remove_terms(dtm_clean, terms = c("am", "thing", ",s", "100", "6�", "pm", "am") )
# colnames(dtm_clean)
dtm_clean <- dtm_remove_terms(dtm_clean, terms = c( "!!!!!!!!!!!!!!!!!!!!!!!!!!1111", "%bonus", ",s", "100", "6�", "2", "am", "thing", "na na", "na na na", "na i", "na na na na", "na na i", "na it", "na we", "na pricing", "na na na i", "that i", "na na it", "?", "t",  "4", "ooh", "uurh", "pheromone", "poompoom", "3121",  "matic", "ai", "c", "la", "hey", "na",  "da", "uh", "tin", "ll", "transcription",  "repeats", "la", "da", "uh", "ah", "e.g", "i.e", "@", "/", "\\|", "'ve", "'re", "'m", "'ll", "â???Ts", "â???Tve", "â???Tre", " s", "'s" ) )

dcs <- dtm_colsums(dtm_clean)
dcs.1 <- as.list(dcs)
dcs.df <- as.data.frame(dcs.1)
dd <- as.data.frame( matrix(0, nrow=length(dcs), ncol=2))
dd[,1] <- colnames(dcs.df)
dd[,2] <- colSums(dcs.df)
colnames(dd) <- c(  "word", "freq" )

8.2 Analysis of nouns and adjective words

Figure 2: Electricity pricing attitudes, noun and adjective frequencies

set.seed(1234)
wordcloud(words = stats.an$key, freq = stats.an$freq, min.freq = 2,
          max.words=200, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))

# title(main = "Electricity pricing attitudes", font.main = 1, col.main = "black", cex.main = 1.5)

Figure 3: Electricity pricing attitudes

library(lattice)
stats.an$key <- factor( stats.an$key, levels = rev(stats.an$key) )
barchart(key ~ freq, data = head(stats.an, 30), col = "blue", main = "Electricity pricing attitudes, \nMost occurring nouns and adjectives", xlab = "Freq")

8.3 Extracting bigrams, trigrams and longer phrases

8.4 Extracting keywords and ngrams based on nouns and adjectives

library(textrank)
stats.bi <- textrank_keywords(x$lemma, 
                          relevant = x$upos %in% c("NOUN", "ADJ"), 
                          ngram_max = 8, sep = " ")
stats.bi <- subset(stats.bi$keywords, ngram > 1 & freq >= 2)

Figure 4: Electricity pricing attitudes, bigrams and trigrams

library(wordcloud)
# stats <- stats[!stats$text == '',]
# head(stats.bi, 30)
# str(stats.bi)
wordcloud(words = stats.bi$keyword, freq = stats.bi$freq, min.freq = 2,
          max.words=200, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))

# title(main = "Electricity pricing attitudes, \nMost occurring bigrams and trigrams", font.main = 1, col.main = "black", cex.main = 1.5)

Figure 5: Electricity pricing attitudes

library(lattice)
stats.bi$keyword <- factor( stats.bi$keyword, levels = rev(stats.bi$keyword) )
barchart(keyword ~ freq, data = head(stats.bi, 30), col = "blue", main = "Electricity pricing attitudes, \nMost occurring bigrams and trigrams", xlab = "Freq")

8.5 Text rank keywords diagrams for nouns and adjectives

The ‘textrank’ package provides the “textrank_keywords()” function that focuses on words that follow one another. A link is established when words follow one another and that link is strengthened as that behaviour is found to occur more often. An importance measure is developed and the top one-third of the words is kept and presented.

Figure 6: Electricity pricing attitudes, Most important keywords

library(textrank)
stats.bi <- textrank_keywords(x.noun$lemma, 
                          relevant = x.noun$upos %in% c("NOUN", "ADJ"), 
                          ngram_max = 8, p = 1/2, sep = " ")
stats.bi <- subset(stats.bi$keywords, ngram > 1 & freq >= 2)

library(wordcloud)
# stats <- stats[!stats$text == '',]
# head(stats.bi, 30)
# str(stats.bi)
wordcloud(words = stats.bi$keyword, freq = stats.bi$freq, min.freq = 2,
          max.words=200, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))

# title(main = "Electricity pricing attitudes, \nMost important keywords", font.main = 1, col.main = "black", cex.main = 1.5)

Figure 7: Electricity pricing attitudes, Most important keywords

library(lattice)
stats.bi$keyword <- factor( stats.bi$keyword, levels = rev(stats.bi$keyword) )
barchart(keyword ~ freq, data = head(stats.bi, 50), col = "blue", main = "Electricity pricing attitudes, \nMost important keywords", xlab = "Freq")

M731 Assignment 3: NLP for Verbatims

Reducing Coding Time of Qualitative Information with AI

Olufunmilayo (Tije) Oyetunji, MBA Candidate

21 March 2022