This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.

Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Cmd+Shift+Enter.

Install necessary packages. Comment after installation

#install.packages('tm')
#install.packages('RColorBrewer')
#install.packages('wordcloud')

Include the packages.

library('tm')

## Loading required package: NLP

library('RColorBrewer')
library('wordcloud')

Process data

Data <- readRDS("Communitech.RDS")
tweets <- Data$text


# swap out all non-alphanumeric characters
# Note that the definition of what constitutes a letter or a number or a punctuatution mark varies slightly depending upon your locale, so you may need to experiment a little to get exactly what you want.
# str_replace_all(tweets, "[^[:alnum:]]", " ")
# iconv(tweets, from = 'UTF-8', to = 'ASCII//TRANSLIT')
# Encoding(tweets)  <- "UTF-8"

# Function to clean tweets
clean.text = function(x)
{
  # remove rt
  x = gsub("rt", "", x)
  # remove at
  x = gsub("@\\w+", "", x)
  # remove punctuation
  x = gsub("[[:punct:]]", "", x)
  # remove numbers
  x = gsub("[[:digit:]]", "", x)
  # remove links http
  x = gsub("http\\w+", "", x)
  # remove tabs
  x = gsub("[ |\t]{2,}", "", x)
  # remove blank spaces at the beginning
  x = gsub("^ ", "", x)
  # remove blank spaces at the end
  x = gsub(" $", "", x)
  # tolower
#  x = tolower(x)
  return(x)
}

# clean tweets
tweets = clean.text(tweets)

Create word cloud of tweets

corpus = Corpus(VectorSource(tweets))

# create term-document matrix
tdm = TermDocumentMatrix(
  corpus,
  control = list(
    wordLengths=c(3,20),
    removePunctuation = TRUE,
    stopwords = c("the", "a", stopwords("english")),
    removeNumbers = TRUE, 
  # tolower may cause trouble on Window because UTF-8 encoding, changed to FALSE  
    tolower = FALSE) )

# convert as matrix. It may consume near 1g of your RAM
tdm = as.matrix(tdm)

# get word counts in decreasing order
word_freqs = sort(rowSums(tdm), decreasing=TRUE) 

#check top 50 most mentioned words
head(word_freqs, 50)

##               tech          volunteer              learn 
##                 45                 27                 27 
##                amp                The               help 
##                 26                 24                 23 
##        Communitech          community             Design 
##                 23                 22                 22 
##         Innovation               RTWe            TrueNoh 
##                 21                 21                 19 
##            hosting               Tech              space 
##                 19                 19                 19 
##               good                get                new 
##                 18                 17                 17 
##              March                Jam                Lab 
##                 16                 16                 16 
##        StaUpHereTO            Canadas             RTJoin 
##                 16                 16                 15 
##             access         Leadership            excited 
##                 15                 15                 15 
##           Minister          Waterloos      collaborative 
##                 15                 15                 15 
##           colleges         everything               hubs 
##                 15                 15                 15 
##         incubators              local       universities 
##                 15                 15                 15 
##          companies             RTFrom           Founders 
##                 14                 14                 13 
##               Join           products               RTis 
##                 13                 13                 13 
##         welcomeCEO               went               chat 
##                 13                 13                 13 
##           Bootcamp            Digital CommunitechAcademy 
##                 13                 12                 12 
##               like              Small 
##                 12                 12

#remove the top words which don’t generate insights such as "the", "a", "and", etc.
word_freqs = word_freqs[-(1:2)]  #Here “1:5” is 1st-5th words in the list we want to remove 

# create a data frame with words and their frequencies
dm = data.frame(word=names(word_freqs), freq=word_freqs)

#Plot corpus in a clored graph; need RColorBrewer package

wordcloud(head(dm$word, 50), head(dm$freq, 50), random.order=FALSE, colors=brewer.pal(8, "Dark2"))

#check top 50 most mentioned words
head(word_freqs, 50)

##              learn                amp                The 
##                 27                 26                 24 
##               help        Communitech          community 
##                 23                 23                 22 
##             Design         Innovation               RTWe 
##                 22                 21                 21 
##            TrueNoh            hosting               Tech 
##                 19                 19                 19 
##              space               good                get 
##                 19                 18                 17 
##                new              March                Jam 
##                 17                 16                 16 
##                Lab        StaUpHereTO            Canadas 
##                 16                 16                 16 
##             RTJoin             access         Leadership 
##                 15                 15                 15 
##            excited           Minister          Waterloos 
##                 15                 15                 15 
##      collaborative           colleges         everything 
##                 15                 15                 15 
##               hubs         incubators              local 
##                 15                 15                 15 
##       universities          companies             RTFrom 
##                 15                 14                 14 
##           Founders               Join           products 
##                 13                 13                 13 
##               RTis         welcomeCEO               went 
##                 13                 13                 13 
##               chat           Bootcamp            Digital 
##                 13                 13                 12 
## CommunitechAcademy               like              Small 
##                 12                 12                 12 
##            morning             Fierce 
##                 12                 11

# I see some words I don't know or understand, so I retrieve the tweets that have the words
# I retrieve all the tweets that have "nigeria" in it

I saw ‘learn’ on the work cloud and want to tweets contain it

index = grep("learn", tweets)
tweets[index]

##  [1] "RTOn Marchlearn how to access s iPaaS testbed to build Genabled products and services at an info session h"            
##  [2] "On Marchlearn how to access s iPaaS testbed to build Genabled products and services at an info sess"                   
##  [3] "Want to be pa of Canadas tech for good conference Visit our volunteer page and learn more about our volunteer o"       
##  [4] "RTJoin theAcademy DesignThinking Bootcamp to learn how to build products and services customers love Regis"            
##  [5] "RTJoin theAcademy DesignThinking Bootcamp to learn how to build products and services customers love Regis"            
##  [6] "RTJoin theAcademy DesignThinking Bootcamp to learn how to build products and services customers love Regis"            
##  [7] "Join theAcademy DesignThinking Bootcamp to learn how to build products and services customers love"                    
##  [8] "Want to be pa of Canadas tech for good conference Visit our volunteer page and learn more about our volunteer o"       
##  [9] "The CommunitechAcademy Design Thinking Bootcamp is running AprilThis twoday bootcamp emphasises a learn"               
## [10] "Want to be pa of Canadas tech for good conference Visit our volunteer page and learn more about our volunteer o"       
## [11] "If you want to be a software engineer and found a company just go for it and dont be afraid to learn as you go"        
## [12] "On Wed MarchjoinTalent Services team for an interactive and informative session to learn more about"                   
## [13] "RTHappening today Joinat the Data Hub in Waterloo for OpenGovWeek to learn the basic theory and methods of ve"         
## [14] "RTThe CommunitechAcademy Design Thinking Bootcamp is running AprilThis twoday bootcamp emphasizes a learn by doin"     
## [15] "Happening today Joinat the Data Hub in Waterloo for OpenGovWeek to learn the basic theory and metho"                   
## [16] "RTThe CommunitechAcademy Design Thinking Bootcamp is running AprilThis twoday bootcamp emphasizes a learn by doin"     
## [17] "The CommunitechAcademy Design Thinking Bootcamp is running AprilThis twoday bootcamp emphasizes a learn"               
## [18] "Join CommunitechAcademy Discover Leadership and learn what it means to be a people leader and whether leading othe"    
## [19] "RTWant to be pa of Canadas tech for good conference Visit our volunteer page and learn more about our volunteer oppoun"
## [20] "Seats for tomorrows Data Hub session are all filled You can still learn more about the basic theory and methods of"    
## [21] "RTWant to be pa of Canadas tech for good conference Visit our volunteer page and learn more about our volunteer oppoun"
## [22] "RTWant to be pa of Canadas tech for good conference Visit our volunteer page and learn more about our volunteer oppoun"
## [23] "Want to be pa of Canadas tech for good conference Visit our volunteer page and learn more about our volunteer o"       
## [24] "Apply and learn more about your oppounities for volunteering at TrueNoh this JuneJoin the team and g"                  
## [25] "Join CommunitechAcademy Discover Leadership and learn what it means to be a people leader and whether leading othe"    
## [26] "The CommunitechAcademy Design Thinking Bootcamp is running AprilThis twoday bootcamp emphasises a learn"               
## [27] "RTWant to be pa of Canadas tech for good conference Visit our volunteer page and learn more about our volunteer oppoun"
## [28] "RTWant to be pa of Canadas tech for good conference Visit our volunteer page and learn more about our volunteer oppoun"
## [29] "Want to be pa of Canadas tech for good conference Visit our volunteer page and learn more about your oppounit"

Prepare for Bigram

# Install the following packages 
library(dplyr)
library(tidyverse)      # data manipulation & plotting
library(stringr)        # text cleaning and regular expressions
library(tidytext)       # provides additional text mining functions

titles <- c("v")

books <- list(tweets)
  
series <- tibble()

 for(i in seq_along(titles)) {
        
        clean <- tibble(chapter = seq_along(books[[i]]),
                        text = books[[i]]) %>%
          # Number of gram
             unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
             mutate(book = titles[i]) %>%
             select(book, everything())

        series <- rbind(series, clean)
}

Bigram of the Data

temp1 = subset(series, book == 'v') %>%
        separate(bigram, c("word1", "word2"), sep = " ") %>%
      #  filter(!word1 %in% stop_words$word,
      #         !word2 %in% stop_words$word) %>%
        count(word1, word2, sort = TRUE)
temp1[1:20,]

## # A tibble: 20 x 3
##    word1      word2            n
##    <chr>      <chr>        <int>
##  1 to         be              23
##  2 on         the             20
##  3 design     jam             19
##  4 how        to              19
##  5 our        volunteer       19
##  6 at         the             17
##  7 innovation lab             16
##  8 is         hosting         16
##  9 lab        is              16
## 10 of         the             16
## 11 and        colleges        15
## 12 and        everything      15
## 13 between    waterloos       15
## 14 colleges   to              15
## 15 everything in              15
## 16 excited    to              15
## 17 hubs       and             15
## 18 in         between         15
## 19 incubators hubs            15
## 20 local      universities    15

See whether people talk about money in their tweets

library(plotly)

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

library(dplyr)
library("plyr")

## -------------------------------------------------------------------------

## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)

## -------------------------------------------------------------------------

## 
## Attaching package: 'plyr'

## The following objects are masked from 'package:plotly':
## 
##     arrange, mutate, rename, summarise

## The following object is masked from 'package:purrr':
## 
##     compact

## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

library("stringi")

money.words = scan('moneyWords.txt', what='character', comment.char=';')
score.topic = function(sentences, dict, .progress='none')
{
  
  # we got a vector of sentences. plyr will handle a list
  # or a vector as an "l" for us
  # we want a simple array of scores back, so we use
  # "l" + "a" + "ply" = "laply":
  scores = laply(sentences, function(sentence, dict) {
    
    # clean up sentences with R's regex-driven global substitute, gsub():
    sentence = gsub('[[:punct:]]', '', sentence)
    sentence = gsub('[[:cntrl:]]', '', sentence)
    sentence = gsub('\\d+', '', sentence)
    # and convert to lower case:
    # sentence = tolower(sentence)
    
    # split into words. str_split is in the stringr package
    word.list = str_split(sentence, '\\s+')
    # sometimes a list() is one level of hierarchy too much
    words = unlist(word.list)
    
    # compare our words to the dictionaries of positive & negative terms
    topic.matches = match(words, dict)
    
    # match() returns the position of the matched term or NA
    # we just want a TRUE/FALSE:
    topic.matches = !is.na(topic.matches)
    
    # and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum():
    score = sum(topic.matches)
    
    return(score)
  }, dict, .progress=.progress )
  
  topicscores.df = data.frame(score=scores, text=sentences)
  return(topicscores.df)
}

topic.scores= score.topic(tweets, money.words, .progress='none')
# topic.scores= score.topic(Etweets, fear.words, .progress='none')

topic.mentioned = subset(topic.scores, score !=0)

N= nrow(topic.scores)
Nmentioned = nrow(topic.mentioned)

dftemp=data.frame(topic=c("Mentioned", "Not Mentioned"), 
                  number=c(Nmentioned,N-Nmentioned))

p <- plot_ly(data=dftemp, labels = ~topic, values = ~number, type = 'pie') %>%
  layout(title = 'Pie Chart of Tweets Talking about Money',
         xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
         yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
p

Check the emotion of the tweets

library(tidytext)

titles <- c("v")
books <- list(tweets)
series <- tibble()

# create a series of book with text lines
for(i in seq_along(titles)) {
  
  clean <- tibble(chapter = seq_along(books[[i]]),
                  text = books[[i]]) %>%
  #  unnest_tokens(word, text) %>%
    mutate(book = titles[i]) %>%
    select(book, everything())
  series <- rbind(series, clean)
}

# find tweets with "fear"
# other emotion to find
##########################
# anger     
# anticipation          
# disgust           
# fear              
# joy               
# sadness       
# surprise      
# trust
##########################

senti <- series %>%
        unnest_tokens(word, text) %>%
        inner_join(get_sentiments("nrc")) %>%
  filter(sentiment=="fear") %>%   # replace "fear" with other emotion words 
   group_by(chapter)

## Joining, by = "word"

sentitext = series[senti$chapter,]
sentitext$sentiment = senti$sentiment
sentitext

## # A tibble: 73 x 4
##    book  chapter text                                            sentiment
##    <chr>   <int> <chr>                                           <chr>    
##  1 v           1 Many people tell me that they wish they had th… fear     
##  2 v           2 RTJoin us at The Fierce Founders finale on Mar… fear     
##  3 v           8 Join us at The Fierce Founders finale on March… fear     
##  4 v          30 Want to be pa of Canadas tech for good confere… fear     
##  5 v          30 Want to be pa of Canadas tech for good confere… fear     
##  6 v          31 Tech About Town How to get staed in volunteer … fear     
##  7 v          47 RTYou know when its design jam time when your … fear     
##  8 v          52 RTYou know when its design jam time when your … fear     
##  9 v          55 RTCanada is a trading nation but it doesnt hav… fear     
## 10 v          55 RTCanada is a trading nation but it doesnt hav… fear     
## # ... with 63 more rows

Create word cloud of tweets showing fear

corpus = Corpus(VectorSource(sentitext$text))

# create term-document matrix
tdm = TermDocumentMatrix(
  corpus,
  control = list(
    wordLengths=c(3,20),
    removePunctuation = TRUE,
    stopwords = c("the", "a", stopwords("english")),
    removeNumbers = TRUE, 
  # tolower may cause trouble on Window because UTF-8 encoding, changed to FALSE  
    tolower = FALSE) )

# convert as matrix. It may consume near 1g of your RAM
tdm = as.matrix(tdm)

# get word counts in decreasing order
word_freqs = sort(rowSums(tdm), decreasing=TRUE) 

#check top 50 most mentioned words
head(word_freqs, 50)

##    volunteer         tech        learn      Canadas        Visit 
##           45           25           20           19           19 
##   conference         good         page         Tech       Fierce 
##           19           19           19           14           11 
##     Founders topcompanies       RTWant       oppoun         Want 
##           11           11           10           10            9 
##   government Communitechs     Marchand        cheer      compete 
##            9            9            9            9            9 
##       RTJoin          get       doesnt       enough     friendly 
##            8            7            7            7            7 
##      intends       nation     smallbiz   tradersour      trading 
##            7            7            7            7            7 
##        great          amp      another         tell        About 
##            7            7            7            6            6 
##          How         Town        staed     RTCanada       change 
##            6            6            6            6            6 
##       agrand        prize    challenge     discomfo      leaders 
##            6            6            6            6            6 
##             ️  StaUpHereTO         CASE       MISSED          Tea 
##            6            5            5            5            5

#remove the top words which don’t generate insights such as "the", "a", "and", etc.
word_freqs = word_freqs[-(1:5)]  #Here “1:5” is 1st-5th words in the list we want to remove 

# create a data frame with words and their frequencies
dm = data.frame(word=names(word_freqs), freq=word_freqs)

#Plot corpus in a clored graph; need RColorBrewer package

wordcloud(head(dm$word, 50), head(dm$freq, 50), random.order=FALSE, colors=brewer.pal(8, "Dark2"))

#check top 50 most mentioned words
head(word_freqs, 50)

##   conference         good         page         Tech       Fierce 
##           19           19           19           14           11 
##     Founders topcompanies       RTWant       oppoun         Want 
##           11           11           10           10            9 
##   government Communitechs     Marchand        cheer      compete 
##            9            9            9            9            9 
##       RTJoin          get       doesnt       enough     friendly 
##            8            7            7            7            7 
##      intends       nation     smallbiz   tradersour      trading 
##            7            7            7            7            7 
##        great          amp      another         tell        About 
##            7            7            7            6            6 
##          How         Town        staed     RTCanada       change 
##            6            6            6            6            6 
##       agrand        prize    challenge     discomfo      leaders 
##            6            6            6            6            6 
##             ️  StaUpHereTO         CASE       MISSED          Tea 
##            6            5            5            5            5 
##          YOU       basics     internet      looking        using 
##            5            5            5            5            5

# I see some words I don't know or understand, so I retrieve the tweets that have the words
# I retrieve all the tweets that have "nigeria" in it

Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Cmd+Option+I.

When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Cmd+Shift+K to preview the HTML file).

The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike Knit, Preview does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.

CS695： working session R notebook

I saw ‘learn’ on the work cloud and want to tweets contain it

See whether people talk about money in their tweets

Check the emotion of the tweets