The purpose of this analysis is the text files exploration. The results of the initial exploration are neccesary for further development of word prediction application.
The issues I want to explore are:
setwd("~/R files/Natural Language Processing/Coursera-SwiftKey/en_US")
Sys.setlocale(category = "LC_ALL", locale = "English")
## [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
blogs<-readLines("en_US.blogs.txt", encoding="UTF-8")
news<-readLines("en_US.news.txt", encoding="UTF-8")
## Warning: incomplete final line found on 'en_US.news.txt'
twitter<-readLines("en_US.twitter.txt")
## Warning: line 167155 appears to contain an embedded nul
## Warning: line 268547 appears to contain an embedded nul
## Warning: line 1274086 appears to contain an embedded nul
## Warning: line 1759032 appears to contain an embedded nul
For the initial analysis I take 5K line samples from each files.
sampleblog<-sample(blogs,5000)
samplenews<-sample(news, 5000)
sampletwitter<-sample(twitter, 5000)
#dictionary of profanity words
badwords<-readLines("http://www.cs.cmu.edu/~biglou/resources/bad-words.txt")
badwords<-badwords[-(which(badwords%in%c("refugee","reject","remains","screw","welfare", "sweetness","shoot","sick","shooting","servant","sex","radical","racial","racist","republican","public","molestation","mexican","looser","lesbian","liberal","kill","killing","killer","heroin","fraud","fire","fight","fairy","^die","death","desire","deposit","crash","^crim","crack","^color","cigarette","church","^christ","canadian", "cancer","^catholic","cemetery","buried","burn","breast","^bomb","^beast","attack","australian","balls","baptist","^addict","abuse","abortion","amateur","asian","aroused","angry","arab","bible")==TRUE))]
Let’s take the example of the first 100 words in twitter text before and after cleaning
## [1] "never know where news will take you. Today: accused serial stalker's home. Out of jail-wants to clear name. His Interview @ 11 Sentence found in the YMCA men’s room: “He fell out of love with his sideburns but they refused to leave on their own.” Can anyone claim? Actually I forgot not likes to throw mike under the bus! \"If you don't like change you're going to like irrelevance even less\" -US Army Gen Shinseki via Show tomorrow night, 7pm at Clearwater Theater in West Dundee with our friends Indolent, Zero to End, and Dysfunctional Mariachi. $8 for tix \"It"
## [1] "never know where news will take you STOP today accused serial stalker's home STOP out of jailwants to clear name STOP his interview NUM sentence found in the ymca menвs room вhe fell out of love with his sideburns but they refused to leave on their own STOPв can anyone claim STOP actually i forgot not likes to throw mike under the bus STOP if you don't like change you're going to like irrelevance even less us army gen shinseki via show tomorrow night NUM at clearwater theater in west dundee with our friends indolent zero to end and"
Let’s look at top 10 words, top 2-grams and 3-grams of the texts.
## Warning: package 'tm' was built under R version 3.1.2
## Loading required package: NLP
## Warning: package 'NLP' was built under R version 3.1.2
## Warning: package 'RWeka' was built under R version 3.1.2
## Warning: package 'reshape2' was built under R version 3.1.1
## blog news twitter
## [1,] "the" "the" "the"
## [2,] "and" "and" "you"
## [3,] "that" "NUM" "NUM"
## [4,] "NUM" "for" "and"
## [5,] "for" "that" "for"
## [6,] "you" "with" "that"
## [7,] "with" "said" "profanity"
## [8,] "was" "was" "this"
## [9,] "this" "his" "with"
## [10,] "have" "from" "your"
## blog news twitter
## [1,] "of the" "in the" "i m"
## [2,] "in the" "of the" "for the"
## [3,] "to the" "to the" "in the"
## [4,] "to be" "for the" "it s"
## [5,] "on the" "on the" "don t"
## [6,] "and the" "at the" "of the"
## [7,] "i was" "in NUM" "can t"
## [8,] "for the" "it s" "on the"
## [9,] "and i" "in a" "thanks for"
## [10,] "it is" "the NUM" "you re"
## blog news twitter
## [1,] "one of the" "more than NUM" "thanks for the"
## [2,] "a lot of" "one of the" "can t wait"
## [3,] "i don t" "NUM percent of" "i don t"
## [4,] "some of the" "a lot of" "i can t"
## [5,] "to be a" "in the NUM" "looking forward to"
## [6,] "i m not" "NUM and NUM" "i want to"
## [7,] "i didn t" "i don t" "t wait to"
## [8,] "in the NUM" "NUM NUM NUM" "going to be"
## [9,] "this is a" "NUM to NUM" "i ll be"
## [10,] "as well as" "in the first" "for the follow"
The conclusions I can make from first review: - the most used phrases are pretty similar in 3 texts but I’ll explore ths deeper in the next points. - news text use much more numeric words than the other two texts - twitter text appears to have more profanity than the other two texts
Let’s compare the most used words and phrases of 3 files. The chart below shows the % of common words between blog, twitter and news text.
## Warning: package 'ggplot2' was built under R version 3.1.1
##
## Attaching package: 'ggplot2'
##
## The following object is masked from 'package:NLP':
##
## annotate
The above chart shows that there’s more similarity betweem blog and news texts (~80% similarity in the dictionary) and lowest between news and twitter (~60% of the dictionary)
Let’s also look at % of profanity words in 3 text files.
## [1] "0.47%"
## [1] "1.04%"
## [1] "0.32%"
The below chart demonstrates what % of dictionary is required to cover given share of the text, for 3 types of texts.
The chart shows that majority of blog texts can be covered by smaller share of words used in the text, than news and twitter. If I want to cover 90% of the dictionary, I can take only 34% of the words for blog text, and around 50% in twitter. This means that blog text has more rarely used words than twitter and news.
The approach I’m going to apply for application is labeling words which were used only once in the corpus as “rare word”. Let’s see what share of text would be covered in this case.
## [1] 0.9308
## [1] 0.8757
## [1] 0.9221
This approach would give around 90% coverage for all texts.
Let’s take a sample word “work”
# number of times "look" appears in the text
ungramblog$count[ungramblog$Terms=="work"]
## [1] 197
Here the top 5 words following the word “work”
## term2 term1 count
## 114506 work STOP 31
## 114525 work with 14
## 114440 work and 13
## 114489 work on 10
## 114515 work to 7
If we consider several 2-gram ending with “work”: we’ll see that the next word is different depending on what word is in front of “work”:
## term3 term2 term1 count
## 173571 to work STOP 6
## 173577 to work with 4
## term3 term2 term1 count
## 61650 hard work and 1
## 61651 hard work day 1
## term3 term2 term1 count
## 162773 the work of 2
## 4183 a work of 1
Therefore, predicting next word based on 3-gram instead of 2-gram should improve the result for certain words.
Use of longer n-grams might provide even more precise prediction, however, the chances that same n-gram would appear in the text are lower, plus we’ll have to compromise on the calculation time and dictionaries volume. Therefore, we have to experiment with different ngram size and find the optimal length.