Exploratory Data Analysis

The purpose of this report is to show some initial exploratory analysis of texts that will be used to train a predictive model. The model will be used to predict words when typing.

Loading & summarizing the data

The data was downloaded from here: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip.

blogs <- readLines("final/en_US/en_US.blogs.txt", skipNul = TRUE)
news <- readLines("final/en_US/en_US.news.txt", skipNul = TRUE)
twitter <- readLines("final/en_US/en_US.twitter.txt", skipNul = TRUE)

# Gather Stats for each dataset, blogs, news, and twitter
sumStatsBlogs <- cbind("Total Lines" = length(blogs), "Tota Words" = sum(stri_count_words(blogs)), "Median Words Per Line" = median(stri_count_words(blogs)), "Mean Words Per Line" = mean(stri_count_words(blogs)),"Most Words Per Line" =  max(stri_count_words(blogs)))

sumStatsNews <- cbind("Total Lines" = length(news), "Tota Words" = sum(stri_count_words(news)), "Median Words Per Line" = median(stri_count_words(news)), "Mean Words Per Line" = mean(stri_count_words(news)),"Most Words Per Line" =  max(stri_count_words(news)))

sumStatsTwitter <- cbind("Total Lines" = length(twitter), "Tota Words" = sum(stri_count_words(twitter)), "Median Words Per Line" = median(stri_count_words(twitter)), "Mean Words Per Line" = mean(stri_count_words(twitter)),"Most Words Per Line" =  max(stri_count_words(twitter)))

sumStats <- rbind(`row.names<-`(sumStatsBlogs, "Blogs"), `row.names<-`(sumStatsNews, "News"), `row.names<-`(sumStatsTwitter, "Twitter"))

kable(sumStats, format.args = list(decimal.mark =".", big.mark=",", nsmall = 0))
Total Lines Tota Words Median Words Per Line Mean Words Per Line Most Words Per Line
Blogs 899,288 38,154,238 29 42.42716 6,726
News 77,259 2,693,898 32 34.86840 1,123
Twitter 2,360,148 30,218,166 12 12.80350 60

Initial Observations

The corpora provided is composed of three source types: blogs posts, news articles, and tweets. The table shows counts of words and lines found in each dataset. A line presumably represents a topic and maybe more or less verbose.

  • Blogs have the most words and the most words per topic

  • Blogs may represent a more ‘natural’ way of communicating

  • news articles represent the smallest sample and appear more concise

  • True to platform, tweets are numerous and each topic consists of 12 words on average

Sampling

The sample corpus for this exploratory analysis will consist of as many lines as practically possible given processing power. It is 10% of all lines, which is about 300,000 lines.

set.seed(1001) # So that this may be reproduced...

blogsC <- VCorpus(VectorSource(blogs[sample(1:length(blogs),50000)]))
newsC <- VCorpus(VectorSource(news[sample(1:length(news),25000)]))
twitterC <- VCorpus(VectorSource(twitter[sample(1:length(twitter),200000)]))

# Create a matrix to show the first line of each source before and after cleaning
BA_Matrix <- matrix(nrow = 3, ncol = 2)
dimnames(BA_Matrix) = list(c("Blogs", "News", "Twitter"), c("Before", "After"))
BA_Matrix[1,1] <- as.character(blogsC[[1]])
BA_Matrix[2,1] <- as.character(newsC[[1]])
BA_Matrix[3,1] <- as.character(twitterC[[2]])

Cleaning the Sample Corpus

The following code clean the text using a number of functions in the tm package

# Create a function to clean the text in the corpora
cleanC <- function(corpus) {

  # This was a useful reference: R_textMining.pdf document page 19,
  # and http://www.rdatamining.com/books/rdm/faq/removeurlsfromtext

  corpus <- tm_map(corpus, content_transformer(tolower))          # Convert everything to lowercase

  ID_URLs <- function(x) gsub("http[[:alnum:][:punct:]]*", "", x) # Found regular expression from above web site 
  corpus <- tm_map(corpus, content_transformer(ID_URLs))          # Remove URLs

  badWords <- c("ass", "asshole", "cunt", "fuck", "goddamn", "motherfucker", "nigger", "shit")
  corpus <- tm_map(corpus, removeWords, badWords)                 # Professor said no profanity

  corpus <- tm_map(corpus, removePunctuation)                     # Remove punctuation
  corpus <- tm_map(corpus, removeNumbers)                         # Remove numbers
  corpus <- tm_map(corpus, stripWhitespace)                       # Remove white spaces (best to do this last)
}

# Run each corpus through cleaner function
blogsC <- cleanC(blogsC)
newsC <- cleanC(newsC)
twitterC <- cleanC(twitterC)

# complete a matrix to show the first line of each source before and after cleaning
BA_Matrix[1,2] <- as.character(blogsC[[1]])
BA_Matrix[2,2] <- as.character(newsC[[1]])
BA_Matrix[3,2] <- as.character(twitterC[[2]])

What the before and after cleaning text looks like,

kable(BA_Matrix)
Before After
Blogs Love the use of onomatopoeia, and I wish they made dumplings with just scallions and cabbage. But there are plenty of places on 8th where you can buy dumplings (veggie or otherwise) for 4 for a dollar or so. Several places that look like they don’t sell anything (just a white booth with a see-through window) actually sell delicious stuff to take home a cook. It’s inexpensive and usually quite good. Enjoy. love the use of onomatopoeia and i wish they made dumplings with just scallions and cabbage but there are plenty of places on th where you can buy dumplings veggie or otherwise for for a dollar or so several places that look like they dont sell anything just a white booth with a seethrough window actually sell delicious stuff to take home a cook its inexpensive and usually quite good enjoy
News “With no status in the country, the cycle can continue indefinitely, with the migrant re-traded once the employer no longer needs their services,” he said. with no status in the country the cycle can continue indefinitely with the migrant retraded once the employer no longer needs their services he said
Twitter don’t sleep on me yet I’m still getting calls! I promise you will see me soon;) dont sleep on me yet im still getting calls i promise you will see me soon

Word Sequences (n-grams)

Count the most common sequence of words (n-grams)…

# Function to get return n-grams
oneG <- function(dtm) {NGramTokenizer(dtm, Weka_control(min = 1, max = 1))}
twoG <- function(dtm) {NGramTokenizer(dtm, Weka_control(min = 2, max = 2))}
threeG <- function(dtm) {NGramTokenizer(dtm, Weka_control(min = 3, max = 3))}
fourG <- function(dtm) {NGramTokenizer(dtm, Weka_control(min = 4, max = 4))}

blogsDTM <- DocumentTermMatrix(blogsC, control = list(tokenize = oneG))
blogsFreq_1 <- sort(colSums(as.matrix(removeSparseTerms(blogsDTM, .97))), decreasing = TRUE)

blogsDTM <- DocumentTermMatrix(blogsC, control = list(tokenize = twoG))
blogsFreq_2 <- sort(colSums(as.matrix(removeSparseTerms(blogsDTM, .99))), decreasing = TRUE)

blogsDTM <- DocumentTermMatrix(blogsC, control = list(tokenize = threeG))
blogsFreq_3 <- sort(colSums(as.matrix(removeSparseTerms(blogsDTM, .999))), decreasing = TRUE)

blogsDTM <- DocumentTermMatrix(blogsC, control = list(tokenize = fourG))
blogsFreq_4 <- sort(colSums(as.matrix(removeSparseTerms(blogsDTM, .999))), decreasing = TRUE)

# This code was run again for each source, but hidden for the purposes of 
# keeping this summary concise

Top 20 Unigrams

Top 20 Bigrams

Top 20 Trigrams

Top 20 Four-grams

Prediction Algorithm

Because this app will likely be used for mobile messaging or shorter messages, the sample may be comprised of strategically sized components from each source. For example, because twitter has only 12 words per line and may represent how people might communicate with the app, the corpus could be comprised of a larger proportion of twitter lines.