Natural Language Processing

NLP uses the character string object class most often, differing from factor levels. Factors have distinct, often repeatable levels such as a roster listing the university attended by the athlete. In contrast, character strings may or may not repeat, and they are most often used to convey varying meaning. Thus, NLP requires a good understanding of the object class string and how to manipulate, correct, and adjust it for analysis.

Given that string data such as social media content are oftentimes unorganized and complex, you are expected to possess good data wrangling skills, for example, gsub(), strsplit(), tolower(), trimws(), grep(), grepl()… You should be able find similar functions in pandas package in Python programming.

Why NLP Is Important

In online social community, people interact with each other using text as the dominant mode of communication.
* Twitter feeds
* Emails
* Facebook posts
* Reddit comments
* Blog postings

In sport business, understanding customer feedback often requires understanding text
* Product reviews
* Customer feedback
* Fans’ opinion
* Sport news

Why Text Analysis is Difficult

Simply put, because of unstructured data. Text does not have structure like relational data.
* Words with varying lengths
* Text fields with varying numbers of words
* Word order may matter
* Other “dirty” attributes
- ungrammatical sentence
- misspell words
- abbreviate unpredictably
- run words together
- punctuate randomly
- variability of linguistic expression: many forms can mean the same thing
- ambiguity of linguistic expression: one form can mean many things

Basic Steps of An NLP project

Basic steps of an NLP project are outlined below:
1. Problem definition;
2. Identifying text sources;
3. Pre-processing and feature extraction;
4. Analytics;
5. Insight and recommendations.

The first step is to clearly define the aim of the project. Without a problem definition, you will be doing “curiosity analysis” with no direction. Given the challenges of NLP, strive to be as succinct as possible and be willing to iterate and adjust along the way. Once you have a problem reasonably defined, this should lead you toward a channel ans specific pieces of text of analysis. You may use online reviews, contracts, or something else, but it is rarely the entire Internet or some vast collection of unrelated and diverse documents. Next, you need to preprocess the documents, which entrails organization and feature extraction. A simple example would be collecting 10,000 tweets mentioning a player. Once organized into a [corpus], or collection of related documents, you can extract features such as sentiment analysis from those document. The features or values extracted vary depending on the type of analysis you expect to perform. Once the appropriate features have been extracted from the documents, you then run the analysis and finally seek to address the problem definition. Here, addressing the problem statement may be as simple as providing a visual like a word cloud or as complex as using the output of the analysis in a customer propensity machine learning models.

Case of NBA Tweets

The case below is fan engagement in social media for various teams in the NBA using multiple common methods and marquee players. The methods can be applied to other types of document yet are not an exhaustive set of approaches. However, the methods used in the chapter are useful and satisfying to explore.

Below are specific steps:
1. Identifying text sources: We will focus on a collection of tweets amassed daily throughout the 29-20 NBA season.
2. Pre-processing and feature extraction: conduct string manipulation and organization into a document term matrix to get term frequency.
3. Analytics: build visualizations such as bar charts, word clouds, and pyramid plots. Perform word associations and sentiment analysis.
4. Insight and recommendations: within the provided text, identify the most discussed teams, terms, and corresponding sentiment representing the Twitter dialogue of fans and sports professionals.

library(tm) #text mining package
library(slam) #To create sparse matrices
library(plyr) #Data wrangling package
library(dplyr) #Data wrangling package
library(qdapRegex) #Regular expression removal, extraction, and replacement tools
library(ggplot2) #Visualization package
library(ggthemes) #To apply for certain plot themes
library(wordcloud) #To generate word cloud plots
library(plotrix) #To create pyramid plot
library(SnowballC) #To implement Porter's word stemming algorithm for collapsing words to a common root to aid comparison of vocabulary
library(SentimentAnalysis) #To generate sentiment value based on lexicon (the vocabulary of natural language)

options(stringsAsFactors = F, scipen = 999)
setwd("C:/Users/wangj/Dropbox/Labs")
tweets <- read.csv("SLM637/nba_oct.csv")
tweets$text <- rm_url(tweets$text)
tweets$text <- rm_twitter_url(tweets$text)
 
textCorp <- VCorpus(DataframeSource(tweets)) #Define the volatile corpus. Corpus is a term referring to the text document. Volatile means that  if the system shuts down without saving the environment, the corpus is lost.
textCorp

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 2
## Content:  documents: 453875

content(textCorp[1]) #return high-level information calculated by the function call.

## [[1]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 84

content(textCorp[[1]]) #return the text itself.

## [1] "Ole Miss: 15 Braves: 0 Falcons: 0 Hawks: 0 I enjoy suffering with my Atlanta friends"

#define a library of the contextualized stop words
uniqueTerms <- tolower(unique(tweets$team))
uniqueTerms <- unlist(strsplit(uniqueTerms, " "))
uniqueTerms <- c(uniqueTerms, "nba", "game", "basketball", "team", "amp", "preseason", "extension", "season")
stops <- c(stopwords("SMART"), uniqueTerms) #stopwords("SMART") covers 500+ common stop words. This step combines the common "SMART" stop words with some contextualized stop words.
tail(stops, 100) #take a look those stop words.

##   [1] "will"         "willing"      "wish"         "with"         "within"      
##   [6] "without"      "won't"        "wonder"       "would"        "would"       
##  [11] "wouldn't"     "x"            "y"            "yes"          "yet"         
##  [16] "you"          "you'd"        "you'll"       "you're"       "you've"      
##  [21] "your"         "yours"        "yourself"     "yourselves"   "z"           
##  [26] "zero"         "atlanta"      "hawks"        "boston"       "celtics"     
##  [31] "brooklyn"     "nets"         "charlotte"    "hornets"      "chicago"     
##  [36] "bulls"        "cleveland"    "cavaliers"    "dallas"       "mavericks"   
##  [41] "denver"       "nuggets"      "detroit"      "pistons"      "golden"      
##  [46] "state"        "warriors"     "houston"      "rockets"      "indiana"     
##  [51] "pacers"       "la"           "clippers"     "la"           "lakers"      
##  [56] "memphis"      "grizzlies"    "miami"        "heat"         "milwaukee"   
##  [61] "bucks"        "minnesota"    "timberwolves" "new"          "orleans"     
##  [66] "pelicans"     "new"          "york"         "knicks"       "oklahoma"    
##  [71] "city"         "thunder"      "orlando"      "magic"        "philadelphia"
##  [76] "sixers"       "phoenix"      "suns"         "portland"     "trail"       
##  [81] "blazers"      "sacramento"   "kings"        "san"          "antonio"     
##  [86] "spurs"        "toronto"      "raptors"      "utah"         "jazz"        
##  [91] "washington"   "wizards"      "nba"          "game"         "basketball"  
##  [96] "team"         "amp"          "preseason"    "extension"    "season"

#Define a function to clean the corpus, including stopwords removal, punctuation removal, number removal, and lowercase conversion,
customClean <- function(corpus, stops=stopwords('SMART')) {
  corpus <- corpus %>% tm_map(content_transformer(tolower)) %>%
    tm_map(removePunctuation) %>%
    tm_map(removeWords, stops) %>%
    tm_map(removeNumbers) %>%
    tm_map(content_transformer(trimws))
  return(corpus)
}

textCorp <- customClean(textCorp, stops) #Apply the previously-defined function to the textCorp.

#To create a unigram or one-word TDM (Term Document Matrix)
textDTM <- DocumentTermMatrix(textCorp) #generate a sparse matrix: each row is represented by a document and each column is the count of unique terms in the tweet. This matrix represents word usage among the thousands of tweets.
dim(textDTM)

## [1] 453875 104046

#To create bigram or two-word combination TDM, which is more insightful (than unigrams) because phrases are generally easier to understood.
bigramTokens <- function(x){
  unlist(lapply(NLP::ngrams(words(x),2), paste, collapse = " "),
         use.names = F)
}
bigramDTM <- DocumentTermMatrix(textCorp, control=list(tokenize=bigramTokens))
bigramDTM <- removeSparseTerms(bigramDTM, 0.999) #Given TDM being a sparse matrix, the sparse or 0 should be removed for later analysis.  0.99 here means all terms will be removed if 99 percent of the cells are zero.

dim(bigramDTM)

## [1] 453875    837

#Create a data frame with in the info of variable words_per_tweet
wordsPerTweet <- row_sums(textDTM, na.rm = T)
dlm <- data.frame(doc_id = rownames(textDTM), 
                  words_per_tweet = wordsPerTweet,
                  team = tweets$team)
head(dlm)

##                doc_id words_per_tweet          team
## 1 1179143769574297600               7 Atlanta Hawks
## 2 1179143120375762944               9 Atlanta Hawks
## 3 1179142532191723520              10 Atlanta Hawks
## 4 1179141961753792512               7 Atlanta Hawks
## 5 1179141924218920960              10 Atlanta Hawks
## 6 1179141590612353024               2 Atlanta Hawks

#Calculate the mean value and the total count
fanEffort <- aggregate(words_per_tweet ~team, dlm, mean)
fanCount <- aggregate(doc_id ~team, dlm, length)
fanEffort <- left_join(fanEffort, fanCount, by="team")

#Visualize the words per tweet
ggplot(fanEffort, aes(x=reorder(team, -words_per_tweet), y=words_per_tweet))+
  geom_point(aes(size=doc_id))+
  labs(title="Words per Tweet")+
  theme_hc()+
  theme(axis.title.x=element_blank(), axis.title.y=element_blank(), axis.text.x=element_text(angle=90, vjust=0.5, hjust=1))

#Create a data frame with the info of word frequency 
termUseInCorpus <- col_sums(textDTM, na.rm=T)
wfm <- data.frame(term = names(termUseInCorpus), frequency=termUseInCorpus)
wfm <- wfm[order(wfm$frequency, decreasing=T),]
head(wfm)

##            term frequency
## win         win     22455
## fans       fans     15331
## live       live     14080
## night     night     12700
## china     china     12622
## tonight tonight     12325

#Word cloud for shared terms (that are exactly same)
wordcloud(wfm$term, wfm$frequency, max.words = 100, colors="black")

#World cloud for specific teams and commonality cloud
#Collapse all tweets from Rockets and Thunder into a single body of text corpus, respectively
rockets <- grep("Houston Rockets", tweets$text) #search for matches
rockets <- textCorp[rockets]
rockets <- unlist(lapply(rockets, content)) #produce a vector which contains all the atomic components
rockets <- paste(rockets, collapse = " ")

thunder <- grep("Oklahoma City Thunder", tweets$text)
thunder <- textCorp[thunder]
thunder <- unlist(lapply(thunder, content))
thunder <- paste(thunder, collapse = " ")

bothTeams <- c(rockets, thunder)
bothTeams <- VCorpus((VectorSource(bothTeams)))
bothTeams <- as.matrix(TermDocumentMatrix(bothTeams))
colnames(bothTeams) <- c("Rockets", "Thunder")

set.seed(1234)
commonality.cloud(bothTeams, max.words=50, random.order=F)

set.seed(2020)
comparison.cloud(bothTeams, max.words=50, scale=c(4, .2),
                 title.size = 1,
                 random.order = F,
                 color = c("black", "darkgrey"))

#To show terms that are shared but not in an equal manner
bothTeamsDF <- as.data.frame(bothTeams)
bothTeamsDF$diff <- abs(bothTeams[,1] - bothTeams[, 2])
top35 <- subset(bothTeamsDF, bothTeamsDF$Rockets>10 & bothTeamsDF$Thunder>10)
top35 <- top35[order(top35$diff, decreasing=T), ]
top35 <- top35[1:35, ]

pyramid.plot(lx = top35$Rockets,
             rx = top35$Thunder,
             labels = row.names(top35),
             top.labels = c("Rockets", "Terms", "Thunder"),
             gap = 350,
             main = "Word in Common",
             unit = "Word frequency",
             lxcol = "darkgrey",
             rxcol = "black")

## 2843 2843

## [1] 5.1 4.1 4.1 2.1

#Calculate and visualize the share of voice
players <- c("kobe bryant", "lebron james", "giannis antetokounmpo", "kawhi leonard")

playerShare <- list()
for (i in 1:length(players)){
  print(paste("working on", players[i]))
  x <- grepl(players[i], tweets$text, ignore.case=T) #Returns a logical vector (match or not for an element)
  y <- as.POSIXct(tweets$created, format="%Y-%m-%d") #Date-time conversion
  dailyShare <- data.frame(player=players[i], 
                           doc_id=tweets$doc_id, 
                           playerMention=x, 
                           date=y)
  nam <- make.names(players[i])
  playerShare[[nam]] <- dailyShare
}

## [1] "working on kobe bryant"
## [1] "working on lebron james"
## [1] "working on giannis antetokounmpo"
## [1] "working on kawhi leonard"

head(playerShare$kobe.bryant)

##        player              doc_id playerMention       date
## 1 kobe bryant 1179143769574297600         FALSE 2019-10-01
## 2 kobe bryant 1179143120375762944         FALSE 2019-10-01
## 3 kobe bryant 1179142532191723520         FALSE 2019-10-01
## 4 kobe bryant 1179141961753792512         FALSE 2019-10-01
## 5 kobe bryant 1179141924218920960         FALSE 2019-10-01
## 6 kobe bryant 1179141590612353024         FALSE 2019-10-01

mentionsByDate <- do.call(rbind, playerShare)
mentionsByDate <- aggregate(playerMention ~ player+date, mentionsByDate, sum)

#Line chart
ggplot(mentionsByDate, aes(date, playerMention, linetype=player))+
  geom_line()+
  theme_hc()+
  ggtitle("Player Mentions by Date")

#Bar chart
ggplot(mentionsByDate, aes(date, playerMention, fill=player))+
  geom_bar(position="stack", stat="identity")+
  theme_hc()+
  scale_fill_grey(start=0, end=0.75)+
  ggtitle("Player Mentions by Date")

#Sentiment Analysis
playersSentiment <- list()
for(i in 1:length(players)){
  print(paste("working on", players[i]))
  idx <- grepl(players[i], tweets$text, ignore.case=T) # #Returns a logical vector (match or not for player i)
  x <- tweets$text[idx] #select the tweet text of player i
  sentimentX <- analyzeSentiment(x)
  wordCount <- sentimentX$WordCount
  polarityX <- convertToBinaryResponse(sentimentX$SentimentQDAP) #generate negative/positive binary results.
  nam <- make.names(players[i])
  playersSentiment[[nam]] <- data.frame(player=players[i], 
                                        polarity = polarityX,
                                        numericPolarity = sentimentX$SentimentQDAP,
                                        wordCount = wordCount,
                                        totalTweets = length(x))
}

## [1] "working on kobe bryant"
## [1] "working on lebron james"
## [1] "working on giannis antetokounmpo"
## [1] "working on kawhi leonard"

playersSentiment <- do.call(rbind, playersSentiment)
avgPolarity <- aggregate(numericPolarity~player, playersSentiment, mean)
avgTweetLength <- aggregate(wordCount~player, playersSentiment, mean)

posNegPlayer <- as.data.frame.matrix(
  table(playersSentiment$player, playersSentiment$polarity)
)

posNegPlayer$total <- rowSums(posNegPlayer)
posNegPlayer$player <- rownames(posNegPlayer)

plotDF <- join_all(list(avgPolarity, avgTweetLength, posNegPlayer),
                   by="player",
                   type="left")
plotDF

##                  player numericPolarity wordCount negative positive total
## 1 giannis antetokounmpo     0.041100245  11.18205      100      669   769
## 2         kawhi leonard     0.069035316  13.11159      126     1496  1622
## 3           kobe bryant    -0.007226916  12.54478       37       97   134
## 4          lebron james     0.007357175  13.93467     1688     2460  4148

ggplot(plotDF, aes(numericPolarity, wordCount, size=total))+
  geom_point(color="darkgrey")+
  geom_text(aes(label=player),
            hjust="inward", vjust="inward", size=5, color="black")+
  theme_hc()+
  ggtitle("Oct 2019 Player Sentiment")

Findings and Conclusions

Below are some general findings and conclusions. You can dig deeper by combining these analytical results with specific contextual background.
1. Compared with Thunder, Rockets have longer tweets, indicating higher author effort within the corpus or a bigger and active fan base;
2. Among Kobe, LeBron, Giannis and Kawhi, LeBron has a larger share of voice on most days. On multiple days in Oct, LeBron captures more than 90% of the mentions among four players examined;
3. Terms associated with LeBron indicate his persona has a larger cultural relevance outside of basketball;
4. As the sentiment valence of each player, you should look into relevant tweets and specific content (Oct 2019) to figure out the reasons.