Synopsis

The purpose of this document is to perfom some natural language processing analysis on some text samples from different sources such as blogs, news or twitter. We took the English language as an easy first approach to learn the basic techniques that can be use on text samples. However the presented algorithms and techniques here could be extended to other type of languages. Some R code will be added to each chapter but most of the source code can be found at the Appendix of this document.

Sample Data

The original sample data files contains more than 2 million lines. We would need more computer performance to perform this analysis on that size of samples, that is why we took a reduced version of the files with aprox. 1000 lines. Thus each file is a reduced version of the original file but the analysis is valid on the whole text data. There are three groups: blogs, news and twitter. These categories identify the type of source where the text data was found.

To read the files we use a built function ‘’’extractSet’’’ we designed (see Appendix).

twitterLines <- extractSet("./en_Red.twitter.txt")
blogLines <- extractSet("./en_Red.blogs.txt")
newsLines <- extractSet("./en_Red.news.txt")

Number of lines in the reduced files

nrTwitterLines <- length(twitterLines)
nrTwitterLines
## [1] 1000
nrBlogLines <- length(blogLines)
nrBlogLines
## [1] 999
nrNewsLines <- length(newsLines)
nrNewsLines
## [1] 907

Profanity filter

We found a github repository with some bad words lists in many languages. https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words The one for english language is used inside our ‘’’profanity filter’’’ (see Appendix) to remove bad words from our sample data, since we do not want to include them in the analysis.

Thus we can apply this function to our data:

# filter bad words
twitterLines <- profanityFilter(twitterLines)
blogLines <- profanityFilter(blogLines)
newsLines <- profanityFilter(newsLines)

Data frame format

To make easier the rest of transformations and algorithms to apply, we transform the source data into a data frame.

# you have to specify the number of lines (size of twitterLines)
twitterLinesDF <- data_frame(line = 1:length(twitterLines), text = twitterLines)
blogLinesDF <- data_frame(line = 1:length(blogLines), text = blogLines)
newsLinesDF <- data_frame(line = 1:length(newsLines), text = newsLines)

Tokenization

Let’s begin with the tokenization of the text. For that purpose we used the tidytext package which provides the unnest_tokens function.

# use to_lower = FALSE if you do not want to transform all into lowercase
# it is useful if you want to compare terms and count frequencies
twitterTokensDF <- twitterLinesDF %>% unnest_tokens(word, text)
blogTokensDF <- blogLinesDF %>% unnest_tokens(word, text)
newsTokensDF <- newsLinesDF %>% unnest_tokens(word, text)

Removing stop words

Stop words are words that are not useful for an analysis, typically extremely common words such as “the”, “of”, “to”, and so forth in English. We can benefit from the stop_words dataset in tidytext to remove stop words in English with the anti_join() function. Stop words in other languages can be found here: https://github.com/dbpedia/fact-extractor/pull/21

data(stop_words)
twitterTokensDF <- twitterTokensDF %>% anti_join(stop_words)
## Joining, by = "word"
blogTokensDF <- blogTokensDF %>% anti_join(stop_words)
## Joining, by = "word"
newsTokensDF <- newsTokensDF %>% anti_join(stop_words)
## Joining, by = "word"

Word frequency

First we just measure the frequency of words by using dplyr’s count() to find the most common words in the sample data. Let’s analyze each dataset separately

twitter dataset

The most common words in this dataset are:

twitterTokensDF %>% count(word, sort = TRUE)
## # A tibble: 3,103 x 2
##      word     n
##     <chr> <int>
##  1      â    75
##  2   love    48
##  3    day    44
##  4     rt    37
##  5      ã    32
##  6   time    32
##  7    lol    28
##  8      å    24
##  9  night    24
## 10 follow    22
## # ... with 3,093 more rows

The first results of this count ranking shows us that there are still stop words to remove, in particular: ã, â and å. Thus we remove them from the dataset and count again word frequencies.

mystopwords <- data_frame(word = c("ã", "â", "å"))
twitterTokensDF <- anti_join(twitterTokensDF, mystopwords, by = "word")
twitterTokensDF %>% count(word, sort = TRUE)
## # A tibble: 3,100 x 2
##       word     n
##      <chr> <int>
##  1    love    48
##  2     day    44
##  3      rt    37
##  4    time    32
##  5     lol    28
##  6   night    24
##  7  follow    22
##  8 tonight    21
##  9     hey    20
## 10       2    19
## # ... with 3,090 more rows

Some numbers has been also included as “words”. We could design a filter based in a regular expression to count how many isolated numbers are there in this dataset, but we keep them for now at the ranking because it let us know some kind of codification in the language by using numbers. For instance the number 2 can be used to express the word “to” or the number 4 can be used to express the word “for”. As a summary we could say that words such as “love”, “day” or “time” are the most used. The special word “rt” is used in twitter to indicate a retweet, that is why it has a high rank.

blogs dataset

The most common words in the dataset are:

blogTokensDF %>% count(word, sort = TRUE)
## # A tibble: 7,697 x 2
##      word     n
##     <chr> <int>
##  1      â   305
##  2   time   114
##  3    day    71
##  4 people    62
##  5     iâ    58
##  6   love    50
##  7    itâ    49
##  8  world    41
##  9   life    39
## 10    god    37
## # ... with 7,687 more rows

Again we have some stopwords to remove with no meaning (probably typos from users): â, iâ and itâ.

mystopwords <- data_frame(word = c("iâ", "â", "itâ"))
blogTokensDF <- anti_join(blogTokensDF, mystopwords, by = "word")
blogTokensDF %>% count(word, sort = TRUE)
## # A tibble: 7,694 x 2
##      word     n
##     <chr> <int>
##  1   time   114
##  2    day    71
##  3 people    62
##  4   love    50
##  5  world    41
##  6   life    39
##  7    god    37
##  8   days    35
##  9    lot    35
## 10   donâ    34
## # ... with 7,684 more rows

It is interesting that two of the most common words are again “time” and “day”. However in blogs it is “people” a word mentioned more than “love”.

news dataset

Finally we repeat the analysis with the news dataset.

newsTokensDF %>% count(word, sort = TRUE)
## # A tibble: 7,047 x 2
##      word     n
##     <chr> <int>
##  1      â   411
##  2      ã   138
##  3   time    51
##  4      å    41
##  5     10    40
##  6   home    40
##  7      ï    40
##  8 people    40
##  9 police    39
## 10      1    35
## # ... with 7,037 more rows

We need to remove the same stop words as for twitter and besides we will remove aditionally ï and the numbers 10 and 1, which bring no meaning in this context.

mystopwords <- data_frame(word = c("ã", "â", "å", "ï","10","1"))
newsTokensDF <- anti_join(newsTokensDF, mystopwords, by = "word")
newsTokensDF %>% count(word, sort = TRUE)
## # A tibble: 7,041 x 2
##      word     n
##     <chr> <int>
##  1   time    51
##  2   home    40
##  3 people    40
##  4 police    39
##  5   game    35
##  6      2    34
##  7      3    33
##  8    day    30
##  9   team    30
## 10 school    29
## # ... with 7,031 more rows

The results here have the common word with the previous results which is “time”, the word “people” shared with the blogs’ result and here it plays also an important role the word “home”.

Comparison plot

We take the head of the results dataframes to create a word comparison for the three sources.

headTwitter <- head(twitterTokensDF %>% count(word, sort = TRUE))
headBlogs <- head(blogTokensDF %>% count(word, sort = TRUE))
headNews <- head(newsTokensDF %>% count(word, sort = TRUE))
headTwitter <- mutate(headTwitter,source = "twitter")
headBlogs <- mutate(headBlogs,source = "blogs")
headNews <- mutate(headNews,source = "news")
comparisonDF <- rbind(headTwitter,headBlogs,headNews)
comparisonDF$source <- as.factor(comparisonDF$source)
plotComparison <- comparisonDF %>% arrange(desc(n))

plotComparison %>%
  ggplot(aes(word, n, fill = source)) +
  geom_col() +
  labs(x = NULL, y = "count") +
  coord_flip()

In the graphic we can see now visually the words which have the higher number of counts and with the different colours we can differentiate the reference inside each type of source. For instance “time” appears in the three sources as a common big contributor, “day” has also a high rank in two sources, shortly followed by the word “people”.

Word relationships

Now we are going to study the relationships between words in one of these three datasets, for instance we choose the “news” dataset, but we are going to increase the number of lines close to 3000. The libraries “ggraph” and “widyr” are used. The function to tokenize into consecutive sequences of words, called n-grams is an option to unnest_tokens().

newsLines <- extractSet("./en_Red3000.news.txt")
newsLines <- profanityFilter(newsLines)
newsLinesDF <- data_frame(line = 1:length(newsLines), text = newsLines)

2-grams frequency

In this model each token represents a pair of words. Besides overlapping is allowed so the same word can be combined with different partners, for instance we see “home alone” and “wasn’t home”.

newsBigrams <- newsLinesDF %>% unnest_tokens(bigram, text, token = "ngrams", n = 2)

We can perfom the same analysis as before by examining the most common bigrams using dplyr’s count():

newsBigrams %>% count(bigram, sort = TRUE)
## # A tibble: 69,376 x 2
##     bigram     n
##      <chr> <int>
##  1  in the   495
##  2  of the   490
##  3   㢠㢠  465
##  4   ã<U+0083> 㢠  427
##  5  to the   235
##  6 for the   212
##  7  on the   196
##  8    㢠s   181
##  9    in a   159
## 10 and the   157
## # ... with 69,366 more rows

Again we have plenty of stop words which are not interesting, including “㢔 and “ãf”. In the appendix the filter to remove stop words from the bigrams is provided.

newstopwords <- data_frame(word = c("ã¢", "ãf"))
filteredBigrams <- stopWordsFilter(newsBigrams,2,mystopwords = newstopwords)

If we repeat now the analysis:

filteredBigrams %>% count(bigram, sort = TRUE)
## # A tibble: 19,193 x 2
##            bigram     n
##             <chr> <int>
##  1       st louis    21
##  2    health care    17
##  3    los angeles    17
##  4            0 0    13
##  5 fountain parks    11
##  6            1 0    10
##  7  san francisco    10
##  8            1 2     9
##  9            2 1     9
## 10          7 p.m     9
## # ... with 19,183 more rows

The results show that, apart from the time hours references and numbers, the news contain a lot of information about “los angeles”, “st louis” and “health care”, which are the most common pairs found.

3-grams frequency

With a 3-grams model let’s see the results.

newsTrigrams <- newsLinesDF %>% unnest_tokens(trigram, text, token = "ngrams", n = 3)
newstopwords <- data_frame(word = c("ã¢", "ãf"))
filteredTrigrams <- stopWordsFilter(newsTrigrams,3,mystopwords = newstopwords)
filteredTrigrams %>% count(trigram, sort = TRUE)
## # A tibble: 8,911 x 2
##                   trigram     n
##                     <chr> <int>
##  1            12u 14u 16u     7
##  2                  0 0 0     6
##  3            10u 12u 14u     5
##  4            14u 16u 18u     5
##  5               9 30 a.m     5
##  6 president barack obama     5
##  7    steamboat springs 1     5
##  8           world war ii     5
##  9               4 30 p.m     4
## 10     gov chris christie     4
## # ... with 8,901 more rows

The numbers “12u”, “14u” and so on might refer to some units in a specific terminology. Besides the most common words in groups of 3 are “president barack obama”, “steamboat springs 1” and “world war ii”.

Conclusion

With this exploratory analysis we managed to see the most common words that appear in each of the three provided files. Besides we investigated which are the most common groups of words in the “news” dataset for 2 and three words. More cleaning processing could be applied to remove not relevant words, but we decided to keep some of them to have a more realistic idea of the kind of information we might find out there.

Appendix

extractSet <- function(filepath){
  con = file(filepath, "r")
  lines <- readLines(con)
  close(con)
  return(lines)
}
library(stringr)
profanityFilter <- function(textInput){
  # read the file with the bad words
  badWords <- extractSet("./badWordsEN.txt")
  # traverse all the bad words and look if there is a match in the text
  outputText <- textInput
  for(i in 1:length(badWords)){
    outputText <- str_replace_all(outputText,badWords[i],"")
  }
  return(outputText)
}
# the input data must contain a column named bigram for n = 2
# and trigram for n = 3. The third argument is optional but if provided
# it should contain only one column named "word"
stopWordsFilter <- function(data,n, mystopwords = NULL){
    data(stop_words)
    allStopW <- stop_words
    if ( !is.null(mystopwords)){
        # we have to add the extra column lexicon to be compatible with stop_words
        mystopwords <- mutate(mystopwords, lexicon = "SMART")
        allStopW <- rbind(allStopW,mystopwords)    
    }
    # split the two words
    if ( n == 2 ) {
        splitData <- data %>% separate(bigram, c("word1", "word2"), sep = " ")
        outputData <- splitData %>%
            filter(!word1 %in% allStopW$word) %>%
            filter(!word2 %in% allStopW$word)
        # reunite data
        outputData <- outputData %>% unite(bigram, word1, word2, sep = " ")
    }else if ( n == 3) {
        splitData <- data %>% separate(trigram, c("word1", "word2", "word3"), sep = " ")
        outputData <- splitData %>%
          filter(!word1 %in% allStopW$word) %>%
          filter(!word2 %in% allStopW$word) %>%
          filter(!word3 %in% allStopW$word)
        # reunite data
        outputData <- outputData %>% unite(trigram, word1, word2, word3, sep = " ")
    }else{
        return("Error: the number of the n-grams is wrong")
    }
    return(outputData)
}