Data Science Capstone

Goal of the analysis

The research concentrates on Natural Language Processing - a field of data science dealing with analysing texts. Our source material includes data scraped from blogs, news and Twitter. The goal of this publication is to perform an initial exploratory analysis of the data. We will take a look at the source material and outline its basic features.

Reading data

We start by reading the data. We read the data as a corpus - a collection of text documents. Our corpus consists of three documents, containing data from blogs, news and Twitter.

collection <- VCorpus(DirSource(data.path, encoding = "UTF-8"))
summary(collection)

##                   Length Class             Mode
## en_US.blogs.txt   2      PlainTextDocument list
## en_US.news.txt    2      PlainTextDocument list
## en_US.twitter.txt 2      PlainTextDocument list

Basic features

To start the analysis, we look at basic features of the documents, like their length, number of words or average number of characters.

textDesc <- function(txt) {
    #' returns basic statistics about a text - number of lines,
    #' words, mean characters in a line and mean words in line
    lines <- length(txt)
    words <- wordcount(txt)
    avgChar <- round(mean(nchar(txt)), 2)
    avgWords <- round(words / lines, 2)
    c(lines = lines, words = words, avgChar = avgChar, avgWords = avgWords)
}

rbind(c(title = collection[[1]]$meta$id, textDesc(collection[[1]]$content)),
      c(title = collection[[2]]$meta$id, textDesc(collection[[2]]$content)),
      c(title = collection[[3]]$meta$id, textDesc(collection[[3]]$content)))

##      title               lines     words      avgChar  avgWords
## [1,] "en_US.blogs.txt"   "899288"  "37334131" "229.99" "41.52" 
## [2,] "en_US.news.txt"    "77259"   "2643969"  "202.43" "34.22" 
## [3,] "en_US.twitter.txt" "2360148" "30373543" "68.68"  "12.87"

We can see, that the Twitter data contain largest number of lines, however they are significantly shorter that news or blog texts. It is also visible, that the files are very large. In order to perform more advanced analysis, we will only use a sample of each text.

set.seed(2137)

blogs_sample <- sample(collection[[1]]$content, 10000)
news_sample <- sample(collection[[2]]$content, 10000)
twitter_sample <- sample(collection[[3]]$content, 10000)
collection <- VCorpus(VectorSource(list(blogs_sample,
                                        news_sample,
                                        twitter_sample)))

collection[[1]]$meta$id <- "blogs_sample"
collection[[2]]$meta$id <- "news_sample"
collection[[3]]$meta$id <- "twitter_sample"

Exploratory analysis

Cleaning the data

Before performing additional analyses, we clean up the data by performing the following transformations:

Converting to lowercase
Removing stopwords (the, of, etc.)
Removing profanity
Removing punctuation
Removing numbers
Stripping whitespaces

This will allow as to look at a more interesting and readable information.

#transforming to all lowercase
collection <- tm_map(collection, content_transformer(tolower))
#removing stopwords
collection <- tm_map(collection, removeWords, stopwords())
#removing profanities, using list of profanities from lexicon package
data("profanity_banned")
collection <- tm_map(collection, removeWords, profanity_banned)
#removing punctuation
collection <- tm_map(collection, removePunctuation, ucp = T)
#removing numbers, as they are less interesting in text mining
collection <- tm_map(collection, removeNumbers)
#removing dollarsigns, a very common character, but useless without numbers
subSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
collection <- tm_map(collection, subSpace, "\\$")
#stripping whitespaces
collection <- tm_map(collection, stripWhitespace)

Exploration

We can visualise most common words in each text by creating a wordcloud.

#BLOGS
wordcloud(collection[[1]]$content, max.words = 40, random.order = FALSE,colors=brewer.pal(8, "Dark2"))

#NEWS
wordcloud(collection[[2]]$content, max.words = 40, random.order = FALSE,colors=brewer.pal(8, "Dark2"))

#TWITTER
wordcloud(collection[[3]]$content, max.words = 40, random.order = FALSE,colors=brewer.pal(8, "Dark2"))

To get a more precise look at the word frequencies, we will create frequency tables. Apart from single word frequencies, we will also look at 2- and 3-word phrases that appear in the texts most often.

UnigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
tdm1 <- TermDocumentMatrix(collection, control = list(tokenize = UnigramTokenizer))
tdm2 <- TermDocumentMatrix(collection, control = list(tokenize = BigramTokenizer))
tdm3 <- TermDocumentMatrix(collection, control = list(tokenize = TrigramTokenizer))

Next, we take a look at bigrams - 2-word phrases, that are most common in the texts

Next, we take a look at trigrams:

Language detection

Finally, we attempt a more demanding task - detecting foreign words in the text. To perform this, we will use an English disctionary from the lexicon package.

tdm.df <- data.frame(
            phrase = rownames(as.matrix(tdm1)),
            as.matrix(tdm1)
            )

eng.words <- tdm.df %>%
    mutate(is.english = phrase %in% grady_augmented) %>%
    group_by(is.english) %>%
    summarise(sum_blogs = sum(blogs_sample),
              sum_news = sum(news_sample),
              sum_twitter = sum(twitter_sample)) %>%
    transmute(is.english = is.english,
              Blogs = sum_blogs / sum(sum_blogs),
              News = sum_news / sum(sum_news),
              Twitter = sum_twitter / sum(sum_twitter)) %>%
    gather(key = "Source", value = "Percent", -is.english)

g_english <- ggplot(data = eng.words,
                    aes(x = Source,
                        y = Percent,
                        fill = is.english,
                        label = round(Percent * 100, 0))) +
    geom_bar(stat = "identity") +
    geom_text(size = 3, position = position_stack(vjust = 0.5))

g_english

We see, that Twitter data contains most words identified as foreign. However, we have to take into account, that spelling mistakes will be counted as foreign words. Also hashtags with many concatenated words (like #blackhistorymonth or #happyeaster) will be treated as foreign, because they are not in the English dictionary. In order to better understand the data, we would need more advance analitics, involving spelling correction mechanisms.

Data Science Capstone - Exploration

Jakub Wiatrak