About this Notebook
On this notebook we are going to analysis tweets from march madness 2018
Use regular expression to clean the tweets text
Familiarize with some natural language processing tools
Sentiment Analysis: Natural Language Processing
First lets read the dataset and inspect the first 10 rows
tweets <- read_csv("data/march_madness_sent.csv")
tweets$tweet_id <- as.character(tweets$tweet_id)
head(tweets[12:21])
For sentiment analysis we are going to use the cleanNLP package that uses Stanford CoreNLP – Natural language software int he backend. First we need to initialize the CoreNLP engine and create an annotation object using the text column, tweet_id and the other columns are given as metadata
cnlp_init_udpipe()
doc <- cnlp_annotate(input = tweets$text, as_strings = TRUE, doc_ids = tweets$tweet_id, meta = tweets[-c(1,2)])
Lets explore the emotions in the tweets more in-depth. Here we are going to extract the variables regarding emotions and create a subset.
value <- as.double(colSums(prop.table(tweets[, 11:18])))
emotion <- names(tweets)[11:18]
emotion <- factor(emotion, levels = names(tweets)[11:18][order(value, decreasing = FALSE)])
emotions <- data_frame(emotion, percent = value * 100)
head(emotions)
