The purpose of this document is to perfom some natural language processing analysis on some text samples from different sources such as blogs, news or twitter. We took the English language as an easy first approach to learn the basic techniques that can be use on text samples. However the presented algorithms and techniques here could be extended to other type of languages. Some R code will be added to each chapter but most of the source code can be found at the Appendix of this document.
The original sample data files contains more than 2 million lines. We would need more computer performance to perform this analysis on that size of samples, that is why we took a reduced version of the files with aprox. 1000 lines. Thus each file is a reduced version of the original file but the analysis is valid on the whole text data. There are three groups: blogs, news and twitter. These categories identify the type of source where the text data was found.
To read the files we use a built function ‘’’extractSet’’’ we designed (see Appendix).
twitterLines <- extractSet("./en_Red.twitter.txt")
blogLines <- extractSet("./en_Red.blogs.txt")
newsLines <- extractSet("./en_Red.news.txt")
nrTwitterLines <- length(twitterLines)
nrTwitterLines
## [1] 1000
nrBlogLines <- length(blogLines)
nrBlogLines
## [1] 999
nrNewsLines <- length(newsLines)
nrNewsLines
## [1] 907
We found a github repository with some bad words lists in many languages. https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words The one for english language is used inside our ‘’’profanity filter’’’ (see Appendix) to remove bad words from our sample data, since we do not want to include them in the analysis.
Thus we can apply this function to our data:
# filter bad words
twitterLines <- profanityFilter(twitterLines)
blogLines <- profanityFilter(blogLines)
newsLines <- profanityFilter(newsLines)
To make easier the rest of transformations and algorithms to apply, we transform the source data into a data frame.
# you have to specify the number of lines (size of twitterLines)
twitterLinesDF <- data_frame(line = 1:length(twitterLines), text = twitterLines)
blogLinesDF <- data_frame(line = 1:length(blogLines), text = blogLines)
newsLinesDF <- data_frame(line = 1:length(newsLines), text = newsLines)
Let’s begin with the tokenization of the text. For that purpose we used the tidytext package which provides the unnest_tokens function.
# use to_lower = FALSE if you do not want to transform all into lowercase
# it is useful if you want to compare terms and count frequencies
twitterTokensDF <- twitterLinesDF %>% unnest_tokens(word, text)
blogTokensDF <- blogLinesDF %>% unnest_tokens(word, text)
newsTokensDF <- newsLinesDF %>% unnest_tokens(word, text)
Stop words are words that are not useful for an analysis, typically extremely common words such as “the”, “of”, “to”, and so forth in English. We can benefit from the stop_words dataset in tidytext to remove stop words in English with the anti_join() function. Stop words in other languages can be found here: https://github.com/dbpedia/fact-extractor/pull/21
data(stop_words)
twitterTokensDF <- twitterTokensDF %>% anti_join(stop_words)
## Joining, by = "word"
blogTokensDF <- blogTokensDF %>% anti_join(stop_words)
## Joining, by = "word"
newsTokensDF <- newsTokensDF %>% anti_join(stop_words)
## Joining, by = "word"
First we just measure the frequency of words by using dplyr’s count() to find the most common words in the sample data. Let’s analyze each dataset separately
The most common words in this dataset are:
twitterTokensDF %>% count(word, sort = TRUE)
## # A tibble: 3,103 x 2
## word n
## <chr> <int>
## 1 â 75
## 2 love 48
## 3 day 44
## 4 rt 37
## 5 ã 32
## 6 time 32
## 7 lol 28
## 8 å 24
## 9 night 24
## 10 follow 22
## # ... with 3,093 more rows
The first results of this count ranking shows us that there are still stop words to remove, in particular: ã, â and å. Thus we remove them from the dataset and count again word frequencies.
mystopwords <- data_frame(word = c("ã", "â", "å"))
twitterTokensDF <- anti_join(twitterTokensDF, mystopwords, by = "word")
twitterTokensDF %>% count(word, sort = TRUE)
## # A tibble: 3,100 x 2
## word n
## <chr> <int>
## 1 love 48
## 2 day 44
## 3 rt 37
## 4 time 32
## 5 lol 28
## 6 night 24
## 7 follow 22
## 8 tonight 21
## 9 hey 20
## 10 2 19
## # ... with 3,090 more rows
Some numbers has been also included as “words”. We could design a filter based in a regular expression to count how many isolated numbers are there in this dataset, but we keep them for now at the ranking because it let us know some kind of codification in the language by using numbers. For instance the number 2 can be used to express the word “to” or the number 4 can be used to express the word “for”. As a summary we could say that words such as “love”, “day” or “time” are the most used. The special word “rt” is used in twitter to indicate a retweet, that is why it has a high rank.
The most common words in the dataset are:
blogTokensDF %>% count(word, sort = TRUE)
## # A tibble: 7,697 x 2
## word n
## <chr> <int>
## 1 â 305
## 2 time 114
## 3 day 71
## 4 people 62
## 5 iâ 58
## 6 love 50
## 7 itâ 49
## 8 world 41
## 9 life 39
## 10 god 37
## # ... with 7,687 more rows
Again we have some stopwords to remove with no meaning (probably typos from users): â, iâ and itâ.
mystopwords <- data_frame(word = c("iâ", "â", "itâ"))
blogTokensDF <- anti_join(blogTokensDF, mystopwords, by = "word")
blogTokensDF %>% count(word, sort = TRUE)
## # A tibble: 7,694 x 2
## word n
## <chr> <int>
## 1 time 114
## 2 day 71
## 3 people 62
## 4 love 50
## 5 world 41
## 6 life 39
## 7 god 37
## 8 days 35
## 9 lot 35
## 10 donâ 34
## # ... with 7,684 more rows
It is interesting that two of the most common words are again “time” and “day”. However in blogs it is “people” a word mentioned more than “love”.
Finally we repeat the analysis with the news dataset.
newsTokensDF %>% count(word, sort = TRUE)
## # A tibble: 7,047 x 2
## word n
## <chr> <int>
## 1 â 411
## 2 ã 138
## 3 time 51
## 4 å 41
## 5 10 40
## 6 home 40
## 7 ï 40
## 8 people 40
## 9 police 39
## 10 1 35
## # ... with 7,037 more rows
We need to remove the same stop words as for twitter and besides we will remove aditionally ï and the numbers 10 and 1, which bring no meaning in this context.
mystopwords <- data_frame(word = c("ã", "â", "å", "ï","10","1"))
newsTokensDF <- anti_join(newsTokensDF, mystopwords, by = "word")
newsTokensDF %>% count(word, sort = TRUE)
## # A tibble: 7,041 x 2
## word n
## <chr> <int>
## 1 time 51
## 2 home 40
## 3 people 40
## 4 police 39
## 5 game 35
## 6 2 34
## 7 3 33
## 8 day 30
## 9 team 30
## 10 school 29
## # ... with 7,031 more rows
The results here have the common word with the previous results which is “time”, the word “people” shared with the blogs’ result and here it plays also an important role the word “home”.
We take the head of the results dataframes to create a word comparison for the three sources.
headTwitter <- head(twitterTokensDF %>% count(word, sort = TRUE))
headBlogs <- head(blogTokensDF %>% count(word, sort = TRUE))
headNews <- head(newsTokensDF %>% count(word, sort = TRUE))
headTwitter <- mutate(headTwitter,source = "twitter")
headBlogs <- mutate(headBlogs,source = "blogs")
headNews <- mutate(headNews,source = "news")
comparisonDF <- rbind(headTwitter,headBlogs,headNews)
comparisonDF$source <- as.factor(comparisonDF$source)
plotComparison <- comparisonDF %>% arrange(desc(n))
plotComparison %>%
ggplot(aes(word, n, fill = source)) +
geom_col() +
labs(x = NULL, y = "count") +
coord_flip()
In the graphic we can see now visually the words which have the higher number of counts and with the different colours we can differentiate the reference inside each type of source. For instance “time” appears in the three sources as a common big contributor, “day” has also a high rank in two sources, shortly followed by the word “people”.
Now we are going to study the relationships between words in one of these three datasets, for instance we choose the “news” dataset, but we are going to increase the number of lines close to 3000. The libraries “ggraph” and “widyr” are used. The function to tokenize into consecutive sequences of words, called n-grams is an option to unnest_tokens().
newsLines <- extractSet("./en_Red3000.news.txt")
newsLines <- profanityFilter(newsLines)
newsLinesDF <- data_frame(line = 1:length(newsLines), text = newsLines)
In this model each token represents a pair of words. Besides overlapping is allowed so the same word can be combined with different partners, for instance we see “home alone” and “wasn’t home”.
newsBigrams <- newsLinesDF %>% unnest_tokens(bigram, text, token = "ngrams", n = 2)
We can perfom the same analysis as before by examining the most common bigrams using dplyr’s count():
newsBigrams %>% count(bigram, sort = TRUE)
## # A tibble: 69,376 x 2
## bigram n
## <chr> <int>
## 1 in the 495
## 2 of the 490
## 3 㢠㢠465
## 4 ã<U+0083> 㢠427
## 5 to the 235
## 6 for the 212
## 7 on the 196
## 8 㢠s 181
## 9 in a 159
## 10 and the 157
## # ... with 69,366 more rows
Again we have plenty of stop words which are not interesting, including “㢔 and “ãf”. In the appendix the filter to remove stop words from the bigrams is provided.
newstopwords <- data_frame(word = c("ã¢", "ãf"))
filteredBigrams <- stopWordsFilter(newsBigrams,2,mystopwords = newstopwords)
If we repeat now the analysis:
filteredBigrams %>% count(bigram, sort = TRUE)
## # A tibble: 19,193 x 2
## bigram n
## <chr> <int>
## 1 st louis 21
## 2 health care 17
## 3 los angeles 17
## 4 0 0 13
## 5 fountain parks 11
## 6 1 0 10
## 7 san francisco 10
## 8 1 2 9
## 9 2 1 9
## 10 7 p.m 9
## # ... with 19,183 more rows
The results show that, apart from the time hours references and numbers, the news contain a lot of information about “los angeles”, “st louis” and “health care”, which are the most common pairs found.
With a 3-grams model let’s see the results.
newsTrigrams <- newsLinesDF %>% unnest_tokens(trigram, text, token = "ngrams", n = 3)
newstopwords <- data_frame(word = c("ã¢", "ãf"))
filteredTrigrams <- stopWordsFilter(newsTrigrams,3,mystopwords = newstopwords)
filteredTrigrams %>% count(trigram, sort = TRUE)
## # A tibble: 8,911 x 2
## trigram n
## <chr> <int>
## 1 12u 14u 16u 7
## 2 0 0 0 6
## 3 10u 12u 14u 5
## 4 14u 16u 18u 5
## 5 9 30 a.m 5
## 6 president barack obama 5
## 7 steamboat springs 1 5
## 8 world war ii 5
## 9 4 30 p.m 4
## 10 gov chris christie 4
## # ... with 8,901 more rows
The numbers “12u”, “14u” and so on might refer to some units in a specific terminology. Besides the most common words in groups of 3 are “president barack obama”, “steamboat springs 1” and “world war ii”.
With this exploratory analysis we managed to see the most common words that appear in each of the three provided files. Besides we investigated which are the most common groups of words in the “news” dataset for 2 and three words. More cleaning processing could be applied to remove not relevant words, but we decided to keep some of them to have a more realistic idea of the kind of information we might find out there.
extractSet <- function(filepath){
con = file(filepath, "r")
lines <- readLines(con)
close(con)
return(lines)
}
library(stringr)
profanityFilter <- function(textInput){
# read the file with the bad words
badWords <- extractSet("./badWordsEN.txt")
# traverse all the bad words and look if there is a match in the text
outputText <- textInput
for(i in 1:length(badWords)){
outputText <- str_replace_all(outputText,badWords[i],"")
}
return(outputText)
}
# the input data must contain a column named bigram for n = 2
# and trigram for n = 3. The third argument is optional but if provided
# it should contain only one column named "word"
stopWordsFilter <- function(data,n, mystopwords = NULL){
data(stop_words)
allStopW <- stop_words
if ( !is.null(mystopwords)){
# we have to add the extra column lexicon to be compatible with stop_words
mystopwords <- mutate(mystopwords, lexicon = "SMART")
allStopW <- rbind(allStopW,mystopwords)
}
# split the two words
if ( n == 2 ) {
splitData <- data %>% separate(bigram, c("word1", "word2"), sep = " ")
outputData <- splitData %>%
filter(!word1 %in% allStopW$word) %>%
filter(!word2 %in% allStopW$word)
# reunite data
outputData <- outputData %>% unite(bigram, word1, word2, sep = " ")
}else if ( n == 3) {
splitData <- data %>% separate(trigram, c("word1", "word2", "word3"), sep = " ")
outputData <- splitData %>%
filter(!word1 %in% allStopW$word) %>%
filter(!word2 %in% allStopW$word) %>%
filter(!word3 %in% allStopW$word)
# reunite data
outputData <- outputData %>% unite(trigram, word1, word2, word3, sep = " ")
}else{
return("Error: the number of the n-grams is wrong")
}
return(outputData)
}