The purpose of this report is to show some initial exploratory analysis of texts that will be used to train a predictive model. The model will be used to predict words when typing.
The data was downloaded from here: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip.
blogs <- readLines("final/en_US/en_US.blogs.txt", skipNul = TRUE)
news <- readLines("final/en_US/en_US.news.txt", skipNul = TRUE)
twitter <- readLines("final/en_US/en_US.twitter.txt", skipNul = TRUE)
# Gather Stats for each dataset, blogs, news, and twitter
sumStatsBlogs <- cbind("Total Lines" = length(blogs), "Tota Words" = sum(stri_count_words(blogs)), "Median Words Per Line" = median(stri_count_words(blogs)), "Mean Words Per Line" = mean(stri_count_words(blogs)),"Most Words Per Line" = max(stri_count_words(blogs)))
sumStatsNews <- cbind("Total Lines" = length(news), "Tota Words" = sum(stri_count_words(news)), "Median Words Per Line" = median(stri_count_words(news)), "Mean Words Per Line" = mean(stri_count_words(news)),"Most Words Per Line" = max(stri_count_words(news)))
sumStatsTwitter <- cbind("Total Lines" = length(twitter), "Tota Words" = sum(stri_count_words(twitter)), "Median Words Per Line" = median(stri_count_words(twitter)), "Mean Words Per Line" = mean(stri_count_words(twitter)),"Most Words Per Line" = max(stri_count_words(twitter)))
sumStats <- rbind(`row.names<-`(sumStatsBlogs, "Blogs"), `row.names<-`(sumStatsNews, "News"), `row.names<-`(sumStatsTwitter, "Twitter"))
kable(sumStats, format.args = list(decimal.mark =".", big.mark=",", nsmall = 0))
Total Lines | Tota Words | Median Words Per Line | Mean Words Per Line | Most Words Per Line | |
---|---|---|---|---|---|
Blogs | 899,288 | 38,154,238 | 29 | 42.42716 | 6,726 |
News | 77,259 | 2,693,898 | 32 | 34.86840 | 1,123 |
2,360,148 | 30,218,166 | 12 | 12.80350 | 60 |
The corpora provided is composed of three source types: blogs posts, news articles, and tweets. The table shows counts of words and lines found in each dataset. A line presumably represents a topic and maybe more or less verbose.
Blogs have the most words and the most words per topic
Blogs may represent a more ‘natural’ way of communicating
news articles represent the smallest sample and appear more concise
True to platform, tweets are numerous and each topic consists of 12 words on average
The sample corpus for this exploratory analysis will consist of as many lines as practically possible given processing power. It is 10% of all lines, which is about 300,000 lines.
set.seed(1001) # So that this may be reproduced...
blogsC <- VCorpus(VectorSource(blogs[sample(1:length(blogs),50000)]))
newsC <- VCorpus(VectorSource(news[sample(1:length(news),25000)]))
twitterC <- VCorpus(VectorSource(twitter[sample(1:length(twitter),200000)]))
# Create a matrix to show the first line of each source before and after cleaning
BA_Matrix <- matrix(nrow = 3, ncol = 2)
dimnames(BA_Matrix) = list(c("Blogs", "News", "Twitter"), c("Before", "After"))
BA_Matrix[1,1] <- as.character(blogsC[[1]])
BA_Matrix[2,1] <- as.character(newsC[[1]])
BA_Matrix[3,1] <- as.character(twitterC[[2]])
The following code clean the text using a number of functions in the tm package
# Create a function to clean the text in the corpora
cleanC <- function(corpus) {
# This was a useful reference: R_textMining.pdf document page 19,
# and http://www.rdatamining.com/books/rdm/faq/removeurlsfromtext
corpus <- tm_map(corpus, content_transformer(tolower)) # Convert everything to lowercase
ID_URLs <- function(x) gsub("http[[:alnum:][:punct:]]*", "", x) # Found regular expression from above web site
corpus <- tm_map(corpus, content_transformer(ID_URLs)) # Remove URLs
badWords <- c("ass", "asshole", "cunt", "fuck", "goddamn", "motherfucker", "nigger", "shit")
corpus <- tm_map(corpus, removeWords, badWords) # Professor said no profanity
corpus <- tm_map(corpus, removePunctuation) # Remove punctuation
corpus <- tm_map(corpus, removeNumbers) # Remove numbers
corpus <- tm_map(corpus, stripWhitespace) # Remove white spaces (best to do this last)
}
# Run each corpus through cleaner function
blogsC <- cleanC(blogsC)
newsC <- cleanC(newsC)
twitterC <- cleanC(twitterC)
# complete a matrix to show the first line of each source before and after cleaning
BA_Matrix[1,2] <- as.character(blogsC[[1]])
BA_Matrix[2,2] <- as.character(newsC[[1]])
BA_Matrix[3,2] <- as.character(twitterC[[2]])
What the before and after cleaning text looks like,
kable(BA_Matrix)
Before | After | |
---|---|---|
Blogs | Love the use of onomatopoeia, and I wish they made dumplings with just scallions and cabbage. But there are plenty of places on 8th where you can buy dumplings (veggie or otherwise) for 4 for a dollar or so. Several places that look like they don’t sell anything (just a white booth with a see-through window) actually sell delicious stuff to take home a cook. It’s inexpensive and usually quite good. Enjoy. | love the use of onomatopoeia and i wish they made dumplings with just scallions and cabbage but there are plenty of places on th where you can buy dumplings veggie or otherwise for for a dollar or so several places that look like they dont sell anything just a white booth with a seethrough window actually sell delicious stuff to take home a cook its inexpensive and usually quite good enjoy |
News | “With no status in the country, the cycle can continue indefinitely, with the migrant re-traded once the employer no longer needs their services,” he said. | with no status in the country the cycle can continue indefinitely with the migrant retraded once the employer no longer needs their services he said |
don’t sleep on me yet I’m still getting calls! I promise you will see me soon;) | dont sleep on me yet im still getting calls i promise you will see me soon |
Count the most common sequence of words (n-grams)…
# Function to get return n-grams
oneG <- function(dtm) {NGramTokenizer(dtm, Weka_control(min = 1, max = 1))}
twoG <- function(dtm) {NGramTokenizer(dtm, Weka_control(min = 2, max = 2))}
threeG <- function(dtm) {NGramTokenizer(dtm, Weka_control(min = 3, max = 3))}
fourG <- function(dtm) {NGramTokenizer(dtm, Weka_control(min = 4, max = 4))}
blogsDTM <- DocumentTermMatrix(blogsC, control = list(tokenize = oneG))
blogsFreq_1 <- sort(colSums(as.matrix(removeSparseTerms(blogsDTM, .97))), decreasing = TRUE)
blogsDTM <- DocumentTermMatrix(blogsC, control = list(tokenize = twoG))
blogsFreq_2 <- sort(colSums(as.matrix(removeSparseTerms(blogsDTM, .99))), decreasing = TRUE)
blogsDTM <- DocumentTermMatrix(blogsC, control = list(tokenize = threeG))
blogsFreq_3 <- sort(colSums(as.matrix(removeSparseTerms(blogsDTM, .999))), decreasing = TRUE)
blogsDTM <- DocumentTermMatrix(blogsC, control = list(tokenize = fourG))
blogsFreq_4 <- sort(colSums(as.matrix(removeSparseTerms(blogsDTM, .999))), decreasing = TRUE)
# This code was run again for each source, but hidden for the purposes of
# keeping this summary concise
Because this app will likely be used for mobile messaging or shorter messages, the sample may be comprised of strategically sized components from each source. For example, because twitter has only 12 words per line and may represent how people might communicate with the app, the corpus could be comprised of a larger proportion of twitter lines.