title: “CS695: Week 3 wordCloud Notebook” output: html_document df_print: paged
This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.
Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Cmd+Shift+Enter.
Install necessary packages. Comment after installation
#install.packages('tm')
#install.packages('RColorBrewer')
#install.packages('wordcloud')
#install.packages("slam", type = "binary")
Include the packages.
library('tm')
## Loading required package: NLP
library('RColorBrewer')
library('wordcloud')
library('slam')
Process data
entrepreneurshipData <- readRDS("entrepreneurship.RDS")
tweets <- entrepreneurshipData$text
# swap out all non-alphanumeric characters
# Note that the definition of what constitutes a letter or a number or a punctuatution mark varies slightly depending upon your locale, so you may need to experiment a little to get exactly what you want.
# str_replace_all(tweets, "[^[:alnum:]]", " ")
# iconv(tweets, from = 'UTF-8', to = 'ASCII//TRANSLIT')
# Encoding(tweets) <- "UTF-8"
# Function to clean tweets
clean.text = function(x)
{
# remove rt
x = gsub("rt", "", x)
# remove at
x = gsub("@\\w+", "", x)
# remove punctuation
x = gsub("[[:punct:]]", "", x)
# remove numbers
x = gsub("[[:digit:]]", "", x)
# remove links http
x = gsub("http\\w+", "", x)
# remove tabs
x = gsub("[ |\t]{2,}", "", x)
# remove blank spaces at the beginning
x = gsub("^ ", "", x)
# remove blank spaces at the end
x = gsub(" $", "", x)
# tolower
# x = tolower(x)
return(x)
}
# clean tweets
tweets = clean.text(tweets)
Create word cloud of tweets
corpus = Corpus(VectorSource(tweets))
# create term-document matrix
tdm = TermDocumentMatrix(
corpus,
control = list(
wordLengths=c(3,20),
removePunctuation = TRUE,
stopwords = c("the", "a", stopwords("english")),
removeNumbers = TRUE,
# tolower may cause trouble on Window because UTF-8 encoding, changed to FALSE
tolower = FALSE) )
# convert as matrix. It may consume near 1g of your RAM
tdm = as.matrix(tdm)
# get word counts in decreasing order
word_freqs = sort(rowSums(tdm), decreasing=TRUE)
#check top 50 most mentioned words
head(word_freqs, 50)
## entrepreneurship Entrepreneurship business amp
## 1956 1654 967 966
## The will help entrepreneurs
## 648 567 448 412
## can young Business entrepreneur
## 397 385 344 342
## grow know now Nigeria
## 333 317 315 314
## government person RTIn appointment
## 314 312 307 301
## fast epitome new think
## 300 295 288 276
## Your people This Entrepreneurs
## 226 223 221 219
## RTThe someone You How
## 214 212 211 209
## Marketing inspired entrepre staup
## 209 205 201 200
## success staups appeal lose
## 198 196 191 191
## leadership Innovation like Mallya
## 190 187 187 185
## RTDear Vijay innovation great
## 185 185 184 184
## get social
## 183 182
#remove the top words which don’t generate insights such as "the", "a", "and", etc.
word_freqs = word_freqs[-(1:5)] #Here “1:5” is 1st-5th words in the list we want to remove
# create a data frame with words and their frequencies
dm = data.frame(word=names(word_freqs), freq=word_freqs)
#Plot corpus in a clored graph; need RColorBrewer package
wordcloud(head(dm$word, 50), head(dm$freq, 50), random.order=FALSE, colors=brewer.pal(8, "Dark2"))
#check top 50 most mentioned words
head(word_freqs, 50)
## will help entrepreneurs can young
## 567 448 412 397 385
## Business entrepreneur grow know now
## 344 342 333 317 315
## Nigeria government person RTIn appointment
## 314 314 312 307 301
## fast epitome new think Your
## 300 295 288 276 226
## people This Entrepreneurs RTThe someone
## 223 221 219 214 212
## You How Marketing inspired entrepre
## 211 209 209 205 201
## staup success staups appeal lose
## 200 198 196 191 191
## leadership Innovation like Mallya RTDear
## 190 187 187 185 185
## Vijay innovation great get social
## 185 184 184 183 182
## students Development youth first businesses
## 173 173 170 169 168
# I see some words I don't know or understand, so I retrieve the tweets that have the words
# I retrieve all the tweets that have "nigeria" in it
index = grep("solutions", tweets)
tweets[index]
## [1] "Entrepreneurship and innovation to make provide lasting solutions to challenges facing Kenya from food insecurity t"
## [2] "Socent has reached beyond Silicon Valley amp utilizing its tools amp resources has found solutions to global problems"
## [3] "Did the things this past week with a small group of ladiesTalked entrepreneurship challenges and solutionsGav"
## [4] "RTCleantechCamp is a suppo program for entrepreneurship in the field of clean energy We look for solutions with"
## [5] "CleantechCamp is a suppo program for entrepreneurship in the field of clean energy We look for solutions w"
## [6] "RTGreat oppounity to promote entrepreneurship amp innovation in high school through designing solutions to improve mental h"
## [7] "RTGreat oppounity to promote entrepreneurship amp innovation in high school through designing solutions to improve mental h"
## [8] "RTGood morning\n\nRecruitment entrepreneurship teamtbs teamkhumo solutions wealth hireagraduate teamkhumo jobseekers"
## [9] "Congratulations you made it through January But how did you get on with your goals and resolutions Heres how t"
## [10] "RTGood morning\n\nRecruitment entrepreneurship teamtbs teamkhumo solutions wealth hireagraduate teamkhumo jobseekers"
## [11] "Good morning\n\nRecruitment entrepreneurship teamtbs teamkhumo solutions wealth hireagraduate teamkhumo"
## [12] "Day two out of fivelooking forward to it Yesterday was all about entrepreneurship problems and solutions"
## [13] "Why does every comrade have to include these points in argument\n Ad hominems\n Government only solutions\n Tax"
## [14] "ICYMI Lighting the way with innovative solutions Five Questions with Latif Jamani president of Calgary Lighting"
## [15] "Entrepreneurship Practical solutions means a billion company entrepreneurship womenpreneurs womeninbiz"
## [16] "RTCreative solutions for food waste water conservation and physical therapy impress judges at rd annualpitch competition"
## [17] "RTHave you met our newest GSBI coho Presenting our largest coho of socentsfocus on offgrid energy solutions"
## [18] "percent of all New Years resolutions never get fulfilled Check out thisfor some t"
## [19] "Creative solutions for food waste water conservation and physical therapy impress judges at rd annualpitch"
## [20] "RTAttention Calling on all students entrepreneurial structures at VUT You are invited to engage in providing solutions"
## [21] "RTGreat oppounity to promote entrepreneurship amp innovation in high school through designing solutions to improve mental h"
## [22] "RTGreat oppounity to promote entrepreneurship amp innovation in high school through designing solutions to improve mental h"
## [23] "Sta with problems Not solutionsThe StaupMedium\n\nentrepreneurship staup"
Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Cmd+Option+I.
When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Cmd+Shift+K to preview the HTML file).
The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike Knit, Preview does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.