This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.
Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Cmd+Shift+Enter.
Install necessary packages. Comment after installation
#install.packages('tm')
#install.packages('RColorBrewer')
#install.packages('wordcloud')
Include the packages.
library('tm')
## Loading required package: NLP
library('RColorBrewer')
library('wordcloud')
Process data
Data <- readRDS("Communitech.RDS")
tweets <- Data$text
# swap out all non-alphanumeric characters
# Note that the definition of what constitutes a letter or a number or a punctuatution mark varies slightly depending upon your locale, so you may need to experiment a little to get exactly what you want.
# str_replace_all(tweets, "[^[:alnum:]]", " ")
# iconv(tweets, from = 'UTF-8', to = 'ASCII//TRANSLIT')
# Encoding(tweets) <- "UTF-8"
# Function to clean tweets
clean.text = function(x)
{
# remove rt
x = gsub("rt", "", x)
# remove at
x = gsub("@\\w+", "", x)
# remove punctuation
x = gsub("[[:punct:]]", "", x)
# remove numbers
x = gsub("[[:digit:]]", "", x)
# remove links http
x = gsub("http\\w+", "", x)
# remove tabs
x = gsub("[ |\t]{2,}", "", x)
# remove blank spaces at the beginning
x = gsub("^ ", "", x)
# remove blank spaces at the end
x = gsub(" $", "", x)
# tolower
# x = tolower(x)
return(x)
}
# clean tweets
tweets = clean.text(tweets)
Create word cloud of tweets
corpus = Corpus(VectorSource(tweets))
# create term-document matrix
tdm = TermDocumentMatrix(
corpus,
control = list(
wordLengths=c(3,20),
removePunctuation = TRUE,
stopwords = c("the", "a", stopwords("english")),
removeNumbers = TRUE,
# tolower may cause trouble on Window because UTF-8 encoding, changed to FALSE
tolower = FALSE) )
# convert as matrix. It may consume near 1g of your RAM
tdm = as.matrix(tdm)
# get word counts in decreasing order
word_freqs = sort(rowSums(tdm), decreasing=TRUE)
#check top 50 most mentioned words
head(word_freqs, 50)
## tech volunteer learn
## 45 27 27
## amp The help
## 26 24 23
## Communitech community Design
## 23 22 22
## Innovation RTWe TrueNoh
## 21 21 19
## hosting Tech space
## 19 19 19
## good get new
## 18 17 17
## March Jam Lab
## 16 16 16
## StaUpHereTO Canadas RTJoin
## 16 16 15
## access Leadership excited
## 15 15 15
## Minister Waterloos collaborative
## 15 15 15
## colleges everything hubs
## 15 15 15
## incubators local universities
## 15 15 15
## companies RTFrom Founders
## 14 14 13
## Join products RTis
## 13 13 13
## welcomeCEO went chat
## 13 13 13
## Bootcamp Digital CommunitechAcademy
## 13 12 12
## like Small
## 12 12
#remove the top words which don’t generate insights such as "the", "a", "and", etc.
word_freqs = word_freqs[-(1:2)] #Here “1:5” is 1st-5th words in the list we want to remove
# create a data frame with words and their frequencies
dm = data.frame(word=names(word_freqs), freq=word_freqs)
#Plot corpus in a clored graph; need RColorBrewer package
wordcloud(head(dm$word, 50), head(dm$freq, 50), random.order=FALSE, colors=brewer.pal(8, "Dark2"))
#check top 50 most mentioned words
head(word_freqs, 50)
## learn amp The
## 27 26 24
## help Communitech community
## 23 23 22
## Design Innovation RTWe
## 22 21 21
## TrueNoh hosting Tech
## 19 19 19
## space good get
## 19 18 17
## new March Jam
## 17 16 16
## Lab StaUpHereTO Canadas
## 16 16 16
## RTJoin access Leadership
## 15 15 15
## excited Minister Waterloos
## 15 15 15
## collaborative colleges everything
## 15 15 15
## hubs incubators local
## 15 15 15
## universities companies RTFrom
## 15 14 14
## Founders Join products
## 13 13 13
## RTis welcomeCEO went
## 13 13 13
## chat Bootcamp Digital
## 13 13 12
## CommunitechAcademy like Small
## 12 12 12
## morning Fierce
## 12 11
# I see some words I don't know or understand, so I retrieve the tweets that have the words
# I retrieve all the tweets that have "nigeria" in it
index = grep("learn", tweets)
tweets[index]
## [1] "RTOn Marchlearn how to access s iPaaS testbed to build Genabled products and services at an info session h"
## [2] "On Marchlearn how to access s iPaaS testbed to build Genabled products and services at an info sess"
## [3] "Want to be pa of Canadas tech for good conference Visit our volunteer page and learn more about our volunteer o"
## [4] "RTJoin theAcademy DesignThinking Bootcamp to learn how to build products and services customers love Regis"
## [5] "RTJoin theAcademy DesignThinking Bootcamp to learn how to build products and services customers love Regis"
## [6] "RTJoin theAcademy DesignThinking Bootcamp to learn how to build products and services customers love Regis"
## [7] "Join theAcademy DesignThinking Bootcamp to learn how to build products and services customers love"
## [8] "Want to be pa of Canadas tech for good conference Visit our volunteer page and learn more about our volunteer o"
## [9] "The CommunitechAcademy Design Thinking Bootcamp is running AprilThis twoday bootcamp emphasises a learn"
## [10] "Want to be pa of Canadas tech for good conference Visit our volunteer page and learn more about our volunteer o"
## [11] "If you want to be a software engineer and found a company just go for it and dont be afraid to learn as you go"
## [12] "On Wed MarchjoinTalent Services team for an interactive and informative session to learn more about"
## [13] "RTHappening today Joinat the Data Hub in Waterloo for OpenGovWeek to learn the basic theory and methods of ve"
## [14] "RTThe CommunitechAcademy Design Thinking Bootcamp is running AprilThis twoday bootcamp emphasizes a learn by doin"
## [15] "Happening today Joinat the Data Hub in Waterloo for OpenGovWeek to learn the basic theory and metho"
## [16] "RTThe CommunitechAcademy Design Thinking Bootcamp is running AprilThis twoday bootcamp emphasizes a learn by doin"
## [17] "The CommunitechAcademy Design Thinking Bootcamp is running AprilThis twoday bootcamp emphasizes a learn"
## [18] "Join CommunitechAcademy Discover Leadership and learn what it means to be a people leader and whether leading othe"
## [19] "RTWant to be pa of Canadas tech for good conference Visit our volunteer page and learn more about our volunteer oppoun"
## [20] "Seats for tomorrows Data Hub session are all filled You can still learn more about the basic theory and methods of"
## [21] "RTWant to be pa of Canadas tech for good conference Visit our volunteer page and learn more about our volunteer oppoun"
## [22] "RTWant to be pa of Canadas tech for good conference Visit our volunteer page and learn more about our volunteer oppoun"
## [23] "Want to be pa of Canadas tech for good conference Visit our volunteer page and learn more about our volunteer o"
## [24] "Apply and learn more about your oppounities for volunteering at TrueNoh this JuneJoin the team and g"
## [25] "Join CommunitechAcademy Discover Leadership and learn what it means to be a people leader and whether leading othe"
## [26] "The CommunitechAcademy Design Thinking Bootcamp is running AprilThis twoday bootcamp emphasises a learn"
## [27] "RTWant to be pa of Canadas tech for good conference Visit our volunteer page and learn more about our volunteer oppoun"
## [28] "RTWant to be pa of Canadas tech for good conference Visit our volunteer page and learn more about our volunteer oppoun"
## [29] "Want to be pa of Canadas tech for good conference Visit our volunteer page and learn more about your oppounit"
Prepare for Bigram
# Install the following packages
library(dplyr)
library(tidyverse) # data manipulation & plotting
library(stringr) # text cleaning and regular expressions
library(tidytext) # provides additional text mining functions
titles <- c("v")
books <- list(tweets)
series <- tibble()
for(i in seq_along(titles)) {
clean <- tibble(chapter = seq_along(books[[i]]),
text = books[[i]]) %>%
# Number of gram
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
mutate(book = titles[i]) %>%
select(book, everything())
series <- rbind(series, clean)
}
Bigram of the Data
temp1 = subset(series, book == 'v') %>%
separate(bigram, c("word1", "word2"), sep = " ") %>%
# filter(!word1 %in% stop_words$word,
# !word2 %in% stop_words$word) %>%
count(word1, word2, sort = TRUE)
temp1[1:20,]
## # A tibble: 20 x 3
## word1 word2 n
## <chr> <chr> <int>
## 1 to be 23
## 2 on the 20
## 3 design jam 19
## 4 how to 19
## 5 our volunteer 19
## 6 at the 17
## 7 innovation lab 16
## 8 is hosting 16
## 9 lab is 16
## 10 of the 16
## 11 and colleges 15
## 12 and everything 15
## 13 between waterloos 15
## 14 colleges to 15
## 15 everything in 15
## 16 excited to 15
## 17 hubs and 15
## 18 in between 15
## 19 incubators hubs 15
## 20 local universities 15
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(dplyr)
library("plyr")
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:plotly':
##
## arrange, mutate, rename, summarise
## The following object is masked from 'package:purrr':
##
## compact
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
library("stringi")
money.words = scan('moneyWords.txt', what='character', comment.char=';')
score.topic = function(sentences, dict, .progress='none')
{
# we got a vector of sentences. plyr will handle a list
# or a vector as an "l" for us
# we want a simple array of scores back, so we use
# "l" + "a" + "ply" = "laply":
scores = laply(sentences, function(sentence, dict) {
# clean up sentences with R's regex-driven global substitute, gsub():
sentence = gsub('[[:punct:]]', '', sentence)
sentence = gsub('[[:cntrl:]]', '', sentence)
sentence = gsub('\\d+', '', sentence)
# and convert to lower case:
# sentence = tolower(sentence)
# split into words. str_split is in the stringr package
word.list = str_split(sentence, '\\s+')
# sometimes a list() is one level of hierarchy too much
words = unlist(word.list)
# compare our words to the dictionaries of positive & negative terms
topic.matches = match(words, dict)
# match() returns the position of the matched term or NA
# we just want a TRUE/FALSE:
topic.matches = !is.na(topic.matches)
# and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum():
score = sum(topic.matches)
return(score)
}, dict, .progress=.progress )
topicscores.df = data.frame(score=scores, text=sentences)
return(topicscores.df)
}
topic.scores= score.topic(tweets, money.words, .progress='none')
# topic.scores= score.topic(Etweets, fear.words, .progress='none')
topic.mentioned = subset(topic.scores, score !=0)
N= nrow(topic.scores)
Nmentioned = nrow(topic.mentioned)
dftemp=data.frame(topic=c("Mentioned", "Not Mentioned"),
number=c(Nmentioned,N-Nmentioned))
p <- plot_ly(data=dftemp, labels = ~topic, values = ~number, type = 'pie') %>%
layout(title = 'Pie Chart of Tweets Talking about Money',
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
p
library(tidytext)
titles <- c("v")
books <- list(tweets)
series <- tibble()
# create a series of book with text lines
for(i in seq_along(titles)) {
clean <- tibble(chapter = seq_along(books[[i]]),
text = books[[i]]) %>%
# unnest_tokens(word, text) %>%
mutate(book = titles[i]) %>%
select(book, everything())
series <- rbind(series, clean)
}
# find tweets with "fear"
# other emotion to find
##########################
# anger
# anticipation
# disgust
# fear
# joy
# sadness
# surprise
# trust
##########################
senti <- series %>%
unnest_tokens(word, text) %>%
inner_join(get_sentiments("nrc")) %>%
filter(sentiment=="fear") %>% # replace "fear" with other emotion words
group_by(chapter)
## Joining, by = "word"
sentitext = series[senti$chapter,]
sentitext$sentiment = senti$sentiment
sentitext
## # A tibble: 73 x 4
## book chapter text sentiment
## <chr> <int> <chr> <chr>
## 1 v 1 Many people tell me that they wish they had th… fear
## 2 v 2 RTJoin us at The Fierce Founders finale on Mar… fear
## 3 v 8 Join us at The Fierce Founders finale on March… fear
## 4 v 30 Want to be pa of Canadas tech for good confere… fear
## 5 v 30 Want to be pa of Canadas tech for good confere… fear
## 6 v 31 Tech About Town How to get staed in volunteer … fear
## 7 v 47 RTYou know when its design jam time when your … fear
## 8 v 52 RTYou know when its design jam time when your … fear
## 9 v 55 RTCanada is a trading nation but it doesnt hav… fear
## 10 v 55 RTCanada is a trading nation but it doesnt hav… fear
## # ... with 63 more rows
Create word cloud of tweets showing fear
corpus = Corpus(VectorSource(sentitext$text))
# create term-document matrix
tdm = TermDocumentMatrix(
corpus,
control = list(
wordLengths=c(3,20),
removePunctuation = TRUE,
stopwords = c("the", "a", stopwords("english")),
removeNumbers = TRUE,
# tolower may cause trouble on Window because UTF-8 encoding, changed to FALSE
tolower = FALSE) )
# convert as matrix. It may consume near 1g of your RAM
tdm = as.matrix(tdm)
# get word counts in decreasing order
word_freqs = sort(rowSums(tdm), decreasing=TRUE)
#check top 50 most mentioned words
head(word_freqs, 50)
## volunteer tech learn Canadas Visit
## 45 25 20 19 19
## conference good page Tech Fierce
## 19 19 19 14 11
## Founders topcompanies RTWant oppoun Want
## 11 11 10 10 9
## government Communitechs Marchand cheer compete
## 9 9 9 9 9
## RTJoin get doesnt enough friendly
## 8 7 7 7 7
## intends nation smallbiz tradersour trading
## 7 7 7 7 7
## great amp another tell About
## 7 7 7 6 6
## How Town staed RTCanada change
## 6 6 6 6 6
## agrand prize challenge discomfo leaders
## 6 6 6 6 6
## ️ StaUpHereTO CASE MISSED Tea
## 6 5 5 5 5
#remove the top words which don’t generate insights such as "the", "a", "and", etc.
word_freqs = word_freqs[-(1:5)] #Here “1:5” is 1st-5th words in the list we want to remove
# create a data frame with words and their frequencies
dm = data.frame(word=names(word_freqs), freq=word_freqs)
#Plot corpus in a clored graph; need RColorBrewer package
wordcloud(head(dm$word, 50), head(dm$freq, 50), random.order=FALSE, colors=brewer.pal(8, "Dark2"))
#check top 50 most mentioned words
head(word_freqs, 50)
## conference good page Tech Fierce
## 19 19 19 14 11
## Founders topcompanies RTWant oppoun Want
## 11 11 10 10 9
## government Communitechs Marchand cheer compete
## 9 9 9 9 9
## RTJoin get doesnt enough friendly
## 8 7 7 7 7
## intends nation smallbiz tradersour trading
## 7 7 7 7 7
## great amp another tell About
## 7 7 7 6 6
## How Town staed RTCanada change
## 6 6 6 6 6
## agrand prize challenge discomfo leaders
## 6 6 6 6 6
## ️ StaUpHereTO CASE MISSED Tea
## 6 5 5 5 5
## YOU basics internet looking using
## 5 5 5 5 5
# I see some words I don't know or understand, so I retrieve the tweets that have the words
# I retrieve all the tweets that have "nigeria" in it
Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Cmd+Option+I.
When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Cmd+Shift+K to preview the HTML file).
The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike Knit, Preview does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.