The Capstone Project is about predicting the next word based on previous in that sentence.
The dataset used for exploration and initial modeling is obtained in Corpora, in following link: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
This dataset is collected from publicly available sources by a web crawler. The data is classified by three main sources:
This report presents the first four tasks of Capstone Project of Data Science Specialization.
The following topics will be evaluated:
In this section the data will be downloaded and very basic examination of the data will carried out.
## Setting the path where data is stored
setwd("C:/Users/andgo/Coursera - Data Science/Course 10 - Capstone Project/final/en_US")
## Loading texts form blog data
fileblog <- file("./en_US.blogs.txt", "r")
blogdata <-readLines(fileblog, encoding="latin1")
close(fileblog)
## Number of samples of text obtained form blogs
length(blogdata)
## [1] 899288
## View of content of a random text
blogdata[sample(c(1:length(blogdata)), 1, replace = FALSE, prob = NULL)]
## [1] "In between races we high-five and giggle. This is sweet. We do it every morning. And by the second race (thanks to my coffee), I am totally into it - we have a blast!"
## Loading texts form twitter data
filetwitter <- file("./en_US.twitter.txt", "r")
twitterdata <-readLines(filetwitter, encoding="latin1")
close(filetwitter)
## Number of samples of text obtained form twitter
length(twitterdata)
## [1] 2360148
## View of content of a random text
twitterdata[sample(c(1:length(twitterdata)), 1, replace = FALSE, prob = NULL)]
## [1] "Hhahaha I am so disappointed I'll never be able to try any of it. Cheesey for days"
## Loading texts form twitter data
filenews <- file("./en_US.news.txt", "r")
newsdata <-readLines(filenews, encoding="latin1")
close(filenews)
## Number of samples of text obtained form twitter
length(newsdata)
## [1] 77259
## View of content of a random text
newsdata[sample(c(1:length(newsdata)), 1, replace = FALSE, prob = NULL)]
## [1] "The dominant force at Melbourne Park this century, Williams had lost only two matches at the Australian Open since winning the first of her five titles here in 2003. She was on a 17-match winning streak after capturing titles in 2009 and 2010 and missing last year due to injury."
The total entries for blogs, twitter and news are 899288, 2360148 and 77259, respectively.
In this section, it will be apllied the most common steps for cleaning raw texts for further evaluation.
All exploratory analysis will be carried out in a sample of data. The size of sample is set to be 1500 of each source: blogs, twitter and news. One sample of each source (blogs, twitter and news, respectively) is presented below.
## [1] "For three consecutive days beginning Saturday July 2nd, and every Sunday thereafter, The Altered Page will be presenting the results of this year's artist survey as a series of mini projects."
## [1] "He should be... he was in the damn movie!"
## [1] "\"Oh that we could outlaw all behavior that offends us!"
Following, all cleaning steps showed above are performed.
Important: the function to remove special character was obtained in: https://eight2late.wordpress.com/2015/05/27/a-gentle-introduction-to-text-mining-using-r/
The same samples is presented below to illustrate the cleaning process.
## [1] "for three consecutive days beginning saturday july nd and every sunday thereafter the altered page will be presenting the results of this years artist survey as a series of mini projects"
## [1] "he should be he was in the damn movie"
## [1] "oh that we could outlaw all behavior that offends us"
The next step will remove inadequate words that may offend the user. The list of profanity words was obtained in:
http://www.cs.cmu.edu/~biglou/resources/bad-words.txt
In my opinion, not all words presented in this list is inadequate for the users. A list of words to be removed from initial filter was obtained in: https://rpubs.com/Nikotino/58395
profanity <- scan("./profanity.txt", character(0), sep = "\n", encoding="UTF-8")
profanity <- profanity[-(which(profanity%in%c("refugee","reject","remains","screw","welfare", "sweetness","shoot","sick","shooting","servant","sex","radical","racial","racist","republican","public","molestation","mexican","looser","lesbian","liberal","kill","killing","killer","heroin","fraud","fire","fight","fairy","^die","death","desire","deposit","crash","^crim","crack","^color","cigarette","church","^christ","canadian", "cancer","^catholic","cemetery","buried","burn","breast","^bomb","^beast","attack","australian","balls","baptist","^addict","abuse","abortion","amateur","asian","aroused","angry","arab","bible")==TRUE))]
profanity_vector <- VectorSource(profanity)
corpusblog <- tm_map(corpusblog, removeWords, profanity_vector)
corpustwitter <- tm_map(corpustwitter, removeWords, profanity_vector)
corpusnews <- tm_map(corpusnews, removeWords, profanity_vector)
One common task in NLP is remove stopwords, that are very common words, like “the”, “a”, etc. This is important for computational otimization as a fair amount of words is not processed.
This can not be the case in this situation, as we want to predict the next word in sentence. In this exploratory analysis, it will be evaluated two sets of data, one removing stopwords and the second without removing.
The n-grams will be created. In this report the 1-gram(single word) tokenization, 2-grams sets and 3-grams sets will be evaluated.
Now, the frequencies of each word will be evaluated. The top 15 frequency words is shown below.
## Blog.Terms B.Freq Twitter.Terms T.Freq News.Terms N.Freq
## 1 the 3117 the 583 the 2922
## 2 and 1850 you 365 and 1322
## 3 that 778 and 287 for 504
## 4 for 661 for 224 that 478
## 5 you 549 that 142 with 409
## 6 was 476 with 127 said 382
## 7 with 464 this 116 was 335
## 8 this 452 your 114 are 234
## 9 but 356 have 111 his 232
## 10 have 353 just 109 have 222
## 11 are 335 are 106 from 218
## 12 not 290 but 87 but 215
## 13 from 263 all 86 this 204
## 14 they 237 love 72 has 180
## 15 about 216 not 69 not 177
Now, we will repeat the previous task with the second dataset, without the stopwords.
## Blog.Terms B.Freq Twitter.Terms T.Freq News.Terms N.Freq
## 1 will 205 just 109 said 382
## 2 just 195 love 72 will 167
## 3 like 173 get 67 one 121
## 4 one 170 good 67 year 120
## 5 can 168 like 63 first 103
## 6 time 164 thanks 62 new 103
## 7 get 137 follow 54 two 102
## 8 know 113 day 53 time 101
## 9 people 113 know 53 just 88
## 10 now 107 one 52 can 86
## 11 make 98 time 52 years 83
## 12 new 97 will 51 also 82
## 13 first 91 can 50 like 82
## 14 little 91 great 48 last 77
## 15 also 90 dont 47 people 76
We can see that most common words are in the stopword category. The most common word after stopword removal is almost out of top 10 in complete data set.
Further analysis will be carried out mantaining the stopwords. This decison can be changed in future if runtime of prediction tool is too long.
Let’s check the most common 2-gram and 3-gram:
## Blog.Terms B.Freq Twitter.Terms T.Freq News.Terms N.Freq
## 1 of the 289 for the 48 in the 283
## 2 in the 265 on the 45 of the 265
## 3 on the 138 in the 44 to the 131
## 4 to be 120 of the 35 on the 111
## 5 to the 114 thanks for 30 for the 96
## 6 for the 108 to be 29 at the 89
## 7 and the 105 to get 27 and the 78
## 8 i was 90 at the 26 to be 70
## 9 it was 84 to the 25 in a 67
## 10 in a 80 going to 24 of a 59
## 11 and i 78 do you 23 with the 56
## 12 at the 78 have a 23 with a 53
## 13 i have 78 if you 22 he said 52
## 14 i am 76 the best 22 and a 50
## 15 is a 74 i have 21 from the 50
## Blog.Terms B.Freq Twitter.Terms T.Freq News.Terms N.Freq
## 1 one of the 22 thanks for the 13 a lot of 18
## 2 a lot of 18 for the follow 8 in the first 13
## 3 you want to 16 do you know 7 going to be 12
## 4 as well as 13 i have to 6 said in a 12
## 5 i have a 12 i love you 6 one of the 11
## 6 it is a 12 a lot of 5 some of the 10
## 7 i had a 11 have a great 5 the united states 10
## 8 i want to 11 have to be 5 according to the 9
## 9 this is the 11 i want to 5 it was a 9
## 10 be able to 10 looking forward to 5 the end of 9
## 11 there is a 10 cant wait to 4 to be a 9
## 12 you have to 10 going to be 4 out of the 8
## 13 i wanted to 9 i dont know 4 be able to 7
## 14 some of the 9 i have a 4 end of the 7
## 15 the fact that 9 i need to 4 from to pm 7
In this section we will evaluate how many words is necessary to cover 90% of all words presented in dataset.
The figure above shows that 90% of all words occurences come from 52% unique words of bolg data. For twiiter data, 90% of all words occurences come from 67% unique words. Finally, for news data, 90% of all words occurences come from 59% unique words.
This means that a smaller amount of words correpsonds to almost all words of data set. This information will be very useful for otimization purposes by reducing the amount of words that the predictor have to deal and not having such reduction in accuracy.
The initial approach is to find the most common 2-grams with analyzed word in first place. Then, it will be finded the most common 3-grams with analyzed word in second place.
As example, let’s take the word “you” and perform the tasks described above.
## term.1 term.2 Freq
## 1 you can 43
## 2 you are 36
## 3 you have 30
## 4 you will 23
## 5 you want 21
## 6 you know 20
## term.1 term.2 term.3 Freq
## 1 if you are 8
## 2 if you want 5
## 3 so you can 5
## 4 do you think 4
## 5 do you want 4
## 6 where you can 4
Let’s evaluate the chance of a good prediction.
For a prediction based only in 2-gram model, the total number of events is 529. The most common 2-gram with word “you” have 43 events. So, the chance of a good prediction with only one suggestion is 8.1%. If the number of suggestions could be increased to 3 suggestions, the the chance of a good prediction would increase to 20.6%.
Let’s evaluate the chance of a good prediction with a 3-gram model. As example, we will examine the sentence “if you”.
## term.1 term.2 term.3 Freq
## 1 if you are 8
## 2 if you want 5
## 3 if you dont 3
## 4 if you have 3
## 5 if you enjoy 2
## 6 if you watch 2
In this example, the total number of events is 41. The most common 3-gram with word “if” followed by “you” have 8 events. So, the chance of a good prediction with only one suggestion is 19.5%. If the number of suggestions could be increased to 3 suggestions, the the chance of a good prediction would increase to 39%.