Introduction

Finaly here we are at the first steps to complete the Data Science Specialization doing this Capstone Project.
The goal is get a group of texts, so random and comprehensive as possible, in a bundle, and statistically analyse it searching for clues that could help to predict the next letters in a word and the next words in a phrase.
There are a lot of theories about how to do this, considering the roots of the idiom being analysed, creating personal rules, mixing different flavors of analyse but, in this work, we will try to create some machine decisions based on statistics and tools available to data scientists.

First step: create the bundle

Swiftkey, partner of Johns Hopkins Health Public School for the Coursera Data Science Specialization, provided three groups of texts:
texts from blogs
texts from news
texts from twitters.
We need to know how big are these files and the profile of the contents.

#Set the work directory and inspect the number of lines of each file
setwd("C:/Coursera/10_Data_Science_Capstone")
linesBlogs <- readLines("en_US.blogs.txt")
length(linesBlogs)
## [1] 899289
linesNews <- readLines("en_US.news.txt")
length(linesNews)
## [1] 1010243
linesTwitter <- readLines("en_US.twitter.txt")
length(linesTwitter)
## [1] 2360149

We need to know if there are enough texts and if there are diversity in the contents. So, we will count the lines and characters, and proceed a cleaning for eliminate too much blank spaces, too much repeted letters and eliminate some badwords that, for the pourpose of this project, are not considered.

#Read a sample of 400 lines and make a summary to understand the data profile
linestored<-400
english_blogs<-readLines("en_US.blogs.txt", n=linestored)
english_news<-readLines("en_US.news.txt", n=linestored)
english_twitter<-readLines("en_US.twitter.txt", n=linestored)
bundle<-as.data.frame(english_blogs, stringsAsFactors=FALSE)
bundle[,2]<-as.data.frame(english_news, stringsAsFactors=FALSE)
bundle[,3]<-as.data.frame(english_twitter, stringsAsFactors=FALSE)
colnames(bundle)[]<-c("Blogs", "News", "Twitter")
summary(bundle)
##     Blogs               News             Twitter         
##  Length:400         Length:400         Length:400        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character

The tm library provides tools to process texts allowing to promote some cleaning before begin the analysis. Thanks to Ingo Feinerer for his “Introduction to the tm Package - Text mining in R”

library(tm)
## Loading required package: NLP
library(wordcloud)
## Loading required package: RColorBrewer
tokenmaker <- function(x) {
        corpus <- Corpus(VectorSource(x))
        corpus <- tm_map(corpus, content_transformer(tolower))
        corpus <- tm_map(corpus, removePunctuation)
        corpus <- tm_map(corpus, stripWhitespace)
        corpus <- tm_map(corpus, removeWords, stopwords("english"))
        corpus <- tm_map(corpus, removeNumbers)
        corpus <- tm_map(corpus, PlainTextDocument)
        corpus <- tm_map(corpus, stemDocument)
        corpus <- Corpus(VectorSource(corpus))
}  


wordcounter <- function(x) {
        dtm<-DocumentTermMatrix(x)
        dtm_matrix <- as.matrix(dtm)
        word_freq <- colSums(dtm_matrix)
        word_freq <- sort(word_freq, decreasing = TRUE)
        words <- names(word_freq)
        return(list(words, word_freq))
}  

Then, we will manage the three column, aplying the two functions, making a summary of each kind of data, inspect it´s profile printing the header and making an curious graphical expression with the 100 more common words in each subject. This graphical wordcloud has a “scientific”" expression in the histogram displayed after it.

library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate
blogs_token <- tokenmaker(bundle [,1])
blogs_words <- wordcounter(blogs_token)
summary(nchar(bundle [,1]))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     4.0    47.0   180.5   250.2   361.5  1461.0
head(bundle [,1])
## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan â<U+0080><U+009C>godsâ<U+0080>."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
## [2] "We love you Mr. Brown."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
## [3] "Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him."
## [4] "so anyways, i am going to share some home decor inspiration that i have been storing in my folder on the puter. i have all these amazing images stored away ready to come to life when we get our home."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
## [5] "With graduation season right around the corner, Nancy has whipped up a fun set to help you out with not only your graduation cards and gifts, but any occasion that brings on a change in one's life. I stamped the images in Memento Tuxedo Black and cut them out with circle Nestabilities. I embossed the kraft and red cardstock with TE's new Stars Impressions Plate, which is double sided and gives you 2 fantastic patterns. You can see how to use the Impressions Plates in this tutorial Taylor created. Just one pass through your die cut machine using the Embossing Pad Kit is all you need to do - super easy!"                                                                                    
## [6] "If you have an alternative argument, let's hear it! :)"
wordcloud(blogs_words[[1]], blogs_words[[2]], max.words=100)

tdm_Blogs<-TermDocumentMatrix(blogs_token)
m_Blogs<-as.matrix(tdm_Blogs)
v_Blogs<-sort(rowSums(m_Blogs),decreasing=TRUE)
d_Blogs<-data.frame(word=names(v_Blogs),freq=v_Blogs)
head(v_Blogs, 25)
##  like  time  just   one  will   get  make  work   can   day  know  good 
##    74    69    58    57    54    52    45    40    38    38    38    34 
##   now   use  much think   way  year  come  want thing  even first peopl 
##    32    32    31    31    30    30    29    29    28    27    27    27 
##  back 
##    26
p<-ggplot(subset(d_Blogs, freq>30),  aes(word,freq)) 
p<-p+geom_bar(stat="identity")
p<-p+ theme(axis.text.x=element_text(angle=45,hjust=1))
p

news_token <- tokenmaker(bundle[,2])
news_words <- wordcounter(news_token)
summary(nchar(bundle [,2]))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     5.0   109.0   183.0   200.5   270.5   982.0
head(bundle [,2])
## [1] "He wasn't home alone, apparently."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
## [2] "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s."                                                                                                                                                                                                                                                                                                                                                         
## [3] "WSU's plans quickly became a hot topic on local online sites. Though most people applauded plans for the new biomedical center, many deplored the potential loss of the building."                                                                                                                                                                                                                                                                                                                                 
## [4] "The Alaimo Group of Mount Holly was up for a contract last fall to evaluate and suggest improvements to Trenton Water Works. But campaign finance records released this week show the two employees donated a total of $4,500 to the political action committee (PAC) Partners for Progress in early June. Partners for Progress reported it gave more than $10,000 in both direct and in-kind contributions to Mayor Tony Mack in the two weeks leading up to his victory in the mayoral runoff election June 15."
## [5] "And when it's often difficult to predict a law's impact, legislators should think twice before carrying any bill. Is it absolutely necessary? Is it an issue serious enough to merit their attention? Will it definitely not make the situation worse?"                                                                                                                                                                                                                                                            
## [6] "There was a certain amount of scoffing going around a few years ago when the NFL decided to move the draft from the weekend to prime time -- eventually splitting off the first round to a separate day."
wordcloud(news_words[[1]],news_words[[2]], max.words=100)

tdm_News<-TermDocumentMatrix(news_token)
m_News<-as.matrix(tdm_News)
v_News<-sort(rowSums(m_News),decreasing=TRUE)
d_News<-data.frame(word=names(v_News),freq=v_News)
head(v_News, 25)
##   said   year   will   time    new    two  first   like   make school 
##    122     41     34     33     31     30     27     27     24     24 
##    one    day   last  polic   work    say   also   citi    get    now 
##     23     22     22     22     22     21     20     20     20     19 
##   come   just  peopl  state  offic 
##     18     18     18     18     17
p<-ggplot(subset(d_News, freq>20),  aes(word,freq)) 
p<-p+geom_bar(stat="identity")
p<-p+ theme(axis.text.x=element_text(angle=45,hjust=1))
p

twitter_token <- tokenmaker(bundle[,3])
head(twitter_token)
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 3
twitter_words <- wordcounter(twitter_token)
summary(nchar(bundle [,3]))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   34.75   63.50   66.56   92.25  140.00
head(bundle [,3])
## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."  
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
## [3] "they've decided its more fun if I don't."                                                                       
## [4] "So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)"                           
## [5] "Words from a complete stranger! Made my birthday even better :)"                                                
## [6] "First Cubs game ever! Wrigley field is gorgeous. This is perfect. Go Cubs Go!"
wordcloud(twitter_words[[1]],twitter_words[[2]], max.words=100)

tdm_Twitter<-TermDocumentMatrix(twitter_token)
m_Twitter<-as.matrix(tdm_Twitter)
v_Twitter<-sort(rowSums(m_Twitter),decreasing=TRUE)
d_Twitter<-data.frame(word=names(v_Twitter),freq=v_Twitter)
head(v_Twitter, 25)
##    like     day     get    just     can    good    love   thank    will 
##      27      24      24      24      22      22      22      20      20 
##     one    know    need     new    dont   great    time  follow     now 
##      19      18      17      15      14      14      14      13      13 
##     got    show tonight    much   right     hey     let 
##      12      12      12      11      11      10      10
p<-ggplot(subset(d_Twitter, freq>15),  aes(word,freq)) 
p<-p+geom_bar(stat="identity")
p<-p+ theme(axis.text.x=element_text(angle=45,hjust=1))
p

2. step: Profanity filtering - removing profanity and other words you do not want to predict

After the execution of his task, a new wordcount of the dataset is made, to know the remaining words in the bundle, without the profan words provided inthe badwords.txt file.

# Print the number of words before filtering the selectd badwords
print(matrix(c("Blogs", "News", "Twitter", length(blogs_words[[1]]), length(news_words[[1]]),length(twitter_words[[1]])),nrow=3))
##      [,1]      [,2]  
## [1,] "Blogs"   "3581"
## [2,] "News"    "3359"
## [3,] "Twitter" "1429"
badwordremover <- function(x) {
        if (!file.exists("badwords.txt")){download.file(url="https://raw.githubusercontent.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/master/en",destfile="badwords.txt")}
        profanity <- readLines("badwords.txt")
        corpus <- tm_map(x, removeWords, profanity)
                }

blogs_token <- badwordremover(blogs_token)
blogs_words <- wordcounter(blogs_token)

news_token <- badwordremover(news_token)
news_token <- wordcounter(news_token)

twitter_token <- badwordremover(twitter_token)
twitter_token <- wordcounter(twitter_token)
# Print the number of words after remove the selectd badwords
print(matrix(c("Blogs", "News", "Twitter", length(blogs_words[[1]]), length(news_words[[1]]),length(twitter_words[[1]])),nrow=3))
##      [,1]      [,2]  
## [1,] "Blogs"   "3577"
## [2,] "News"    "3359"
## [3,] "Twitter" "1429"

3. Prepare to next step

We have maneged the three text files, undestand their profile and prepared tools to start working with the whole bunch of texts concerarning about time to process and memory needed and star searching not only words alone but pairs , trigrams and other sets of words and their frequency of occurence.

Thanks for reading.