title: “CS695: Week 3 wordCloud Notebook” output: html_document df_print: paged

This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.

Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Cmd+Shift+Enter.

Install necessary packages. Comment after installation

#install.packages('tm')
#install.packages('RColorBrewer')
#install.packages('wordcloud')
#install.packages("slam", type = "binary")

Include the packages.

library('tm')
## Loading required package: NLP
library('RColorBrewer')
library('wordcloud')
library('slam')

Process data

entrepreneurshipData <- readRDS("entrepreneurship.RDS")
tweets <- entrepreneurshipData$text

# swap out all non-alphanumeric characters
# Note that the definition of what constitutes a letter or a number or a punctuatution mark varies slightly depending upon your locale, so you may need to experiment a little to get exactly what you want.
# str_replace_all(tweets, "[^[:alnum:]]", " ")
# iconv(tweets, from = 'UTF-8', to = 'ASCII//TRANSLIT')
# Encoding(tweets)  <- "UTF-8"

# Function to clean tweets
clean.text = function(x)
{
  # remove rt
  x = gsub("rt", "", x)
  # remove at
  x = gsub("@\\w+", "", x)
  # remove punctuation
  x = gsub("[[:punct:]]", "", x)
  # remove numbers
  x = gsub("[[:digit:]]", "", x)
  # remove links http
  x = gsub("http\\w+", "", x)
  # remove tabs
  x = gsub("[ |\t]{2,}", "", x)
  # remove blank spaces at the beginning
  x = gsub("^ ", "", x)
  # remove blank spaces at the end
  x = gsub(" $", "", x)
  # tolower
  # x = tolower(x)
  return(x)
}

# clean tweets
tweets = clean.text(tweets)

Create word cloud of tweets

corpus = Corpus(VectorSource(tweets))

# create term-document matrix
tdm = TermDocumentMatrix(
  corpus,
  control = list(
    wordLengths=c(3,20),
    removePunctuation = TRUE,
    stopwords = c("the", "a", stopwords("english")),
    removeNumbers = TRUE, 
  # tolower may cause trouble on Window because UTF-8 encoding, changed to FALSE  
    tolower = FALSE) )

# convert as matrix. It may consume near 1g of your RAM
tdm = as.matrix(tdm)

# get word counts in decreasing order
word_freqs = sort(rowSums(tdm), decreasing=TRUE) 

#check top 50 most mentioned words
head(word_freqs, 50)
## entrepreneurship Entrepreneurship         business              amp 
##             1956             1654              967              966 
##              The             will             help    entrepreneurs 
##              648              567              448              412 
##              can            young         Business     entrepreneur 
##              397              385              344              342 
##             grow             know              now          Nigeria 
##              333              317              315              314 
##       government           person             RTIn      appointment 
##              314              312              307              301 
##             fast          epitome              new            think 
##              300              295              288              276 
##             Your           people             This    Entrepreneurs 
##              226              223              221              219 
##            RTThe          someone              You              How 
##              214              212              211              209 
##        Marketing         inspired         entrepre            staup 
##              209              205              201              200 
##          success           staups           appeal             lose 
##              198              196              191              191 
##       leadership       Innovation             like           Mallya 
##              190              187              187              185 
##           RTDear            Vijay       innovation            great 
##              185              185              184              184 
##              get           social 
##              183              182
#remove the top words which don’t generate insights such as "the", "a", "and", etc.
word_freqs = word_freqs[-(1:5)]  #Here “1:5” is 1st-5th words in the list we want to remove 

# create a data frame with words and their frequencies
dm = data.frame(word=names(word_freqs), freq=word_freqs)

#Plot corpus in a clored graph; need RColorBrewer package

wordcloud(head(dm$word, 50), head(dm$freq, 50), random.order=FALSE, colors=brewer.pal(8, "Dark2"))

#check top 50 most mentioned words
head(word_freqs, 50)
##          will          help entrepreneurs           can         young 
##           567           448           412           397           385 
##      Business  entrepreneur          grow          know           now 
##           344           342           333           317           315 
##       Nigeria    government        person          RTIn   appointment 
##           314           314           312           307           301 
##          fast       epitome           new         think          Your 
##           300           295           288           276           226 
##        people          This Entrepreneurs         RTThe       someone 
##           223           221           219           214           212 
##           You           How     Marketing      inspired      entrepre 
##           211           209           209           205           201 
##         staup       success        staups        appeal          lose 
##           200           198           196           191           191 
##    leadership    Innovation          like        Mallya        RTDear 
##           190           187           187           185           185 
##         Vijay    innovation         great           get        social 
##           185           184           184           183           182 
##      students   Development         youth         first    businesses 
##           173           173           170           169           168
# I see some words I don't know or understand, so I retrieve the tweets that have the words
# I retrieve all the tweets that have "nigeria" in it

index = grep("solutions", tweets)
tweets[index]
##  [1] "Entrepreneurship and innovation to make provide lasting solutions to challenges facing Kenya from food insecurity t"        
##  [2] "Socent has reached beyond Silicon Valley amp utilizing its tools amp resources has found solutions to global problems"      
##  [3] "Did the things this past week with a small group of ladiesTalked entrepreneurship challenges and solutionsGav"              
##  [4] "RTCleantechCamp is a suppo program for entrepreneurship in the field of clean energy We look for solutions with"            
##  [5] "CleantechCamp is a suppo program for entrepreneurship in the field of clean energy We look for solutions w"                 
##  [6] "RTGreat oppounity to promote entrepreneurship amp innovation in high school through designing solutions to improve mental h"
##  [7] "RTGreat oppounity to promote entrepreneurship amp innovation in high school through designing solutions to improve mental h"
##  [8] "RTGood morning\n\nRecruitment entrepreneurship teamtbs teamkhumo solutions wealth hireagraduate teamkhumo jobseekers"       
##  [9] "Congratulations you made it through January But how did you get on with your goals and resolutions Heres how t"             
## [10] "RTGood morning\n\nRecruitment entrepreneurship teamtbs teamkhumo solutions wealth hireagraduate teamkhumo jobseekers"       
## [11] "Good morning\n\nRecruitment entrepreneurship teamtbs teamkhumo solutions wealth hireagraduate teamkhumo"                    
## [12] "Day two out of fivelooking forward to it Yesterday was all about entrepreneurship problems and solutions"                   
## [13] "Why does every comrade have to include these points in argument\n Ad hominems\n Government only solutions\n Tax"            
## [14] "ICYMI Lighting the way with innovative solutions Five Questions with Latif Jamani president of Calgary Lighting"            
## [15] "Entrepreneurship Practical solutions means a billion company entrepreneurship womenpreneurs womeninbiz"                     
## [16] "RTCreative solutions for food waste water conservation and physical therapy impress judges at rd annualpitch competition"   
## [17] "RTHave you met our newest GSBI coho Presenting our largest coho of socentsfocus on offgrid energy solutions"                
## [18] "percent of all New Years resolutions never get fulfilled Check out thisfor some t"                                          
## [19] "Creative solutions for food waste water conservation and physical therapy impress judges at rd annualpitch"                 
## [20] "RTAttention Calling on all students entrepreneurial structures at VUT You are invited to engage in providing solutions"     
## [21] "RTGreat oppounity to promote entrepreneurship amp innovation in high school through designing solutions to improve mental h"
## [22] "RTGreat oppounity to promote entrepreneurship amp innovation in high school through designing solutions to improve mental h"
## [23] "Sta with problems Not solutionsThe StaupMedium\n\nentrepreneurship staup"

Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Cmd+Option+I.

When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Cmd+Shift+K to preview the HTML file).

The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike Knit, Preview does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.