--- title: "Text corpus analysis" author: "Veronika Nuretdinova" date: "Monday, February 2, 2014" output: html_document --- The purpose of this analysis is the text files exploration. The results of the initial exploration are neccesary for further development of word prediction application. The issues I want to explore: - diversity of the library, ie what % of the word covers majority (eg 95%) of the text. This would help reduce the dictionaries and deal with rare word in the application - difference between 3 files to be taken into account when I create the application library - % of profanity word - ngrams: how well the words can be predicted by 1, 2 or 3 previous words #1. Read the file. ```r setwd("~/R files/Natural Language Processing/Coursera-SwiftKey/en_US") Sys.setlocale(category = "LC_ALL", locale = "English") ``` ``` ## [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252" ``` ```r blogs<-readLines("en_US.blogs.txt", encoding="UTF-8") news<-readLines("en_US.news.txt", encoding="UTF-8") ``` ``` ## Warning: incomplete final line found on 'en_US.news.txt' ``` ```r twitter<-readLines("en_US.twitter.txt") ``` ``` ## Warning: line 167155 appears to contain an embedded nul ## Warning: line 268547 appears to contain an embedded nul ## Warning: line 1274086 appears to contain an embedded nul ## Warning: line 1759032 appears to contain an embedded nul ``` For the initial analysis I take 5K line samples from each files. ```r sampleblog<-sample(blogs,5000) samplenews<-sample(news, 5000) sampletwitter<-sample(twitter, 5000) ``` #2. Cleaning of the files includes: - remove profanity words - remove special symbols - convert to lower, so that words in the beggining of the sentence would not be read as different words. - change words containing digit to the NUM symbol meaning "numeric word". - change ".","?","!" to STOP sign meaning end of the sentence after given phrase, while other punctuation would be removed. Let's take the example of the first 100 words in blog text before and after cleaning ``` ## [1] "He jumped out of the bed, pulled on this track-suit bottoms and ran to check the other bedrooms. All empty. Sometimes we just need to stop and watch the fishes. The day after he had killed King Arthur, Mordred opened his eyes to flickering candlelight and damp rock. There had been nightmares, screaming, and much pain. Terrible pain such as his pampered body had never felt before. But the worst had passed. His crippled form stirred in the shadows and his remaining hand closed about cold metal. Not his axe – he’d lost that on the battlefield, along with his" ``` ``` ## [1] "he jumped out of the bed pulled on this tracksuit bottoms and ran to check the other bedrooms STOP all empty STOP sometimes we just need to stop and watch the fishes STOP the day after he had profanity king arthur mordred opened his eyes to flickering candlelight and damp rock STOP there had been nightmares screaming and much pain STOP terrible pain such as his pampered body had never felt before STOP but the worst had passed STOP his crippled form stirred in the shadows and his remaining hand closed about cold metal STOP not his axe hed" ``` #3. Tokenize the texts, create n-grams. Let's look at top 10 words and top 2-grams and 3-grams from one of the file, I take the blogs text. ``` ## 1-gram 2-gram 3-gram ## [1,] "the" "of the" "one of the" ## [2,] "and" "in the" "a lot of" ## [3,] "that" "to the" "i don t" ## [4,] "NUM" "on the" "as well as" ## [5,] "for" "to be" "it is a" ## [6,] "you" "and the" "out of the" ## [7,] "with" "and i" "a couple of" ## [8,] "was" "for the" "to be a" ## [9,] "this" "is a" "going to be" ## [10,] "have" "i am" "be able to" ``` #4. Comparison of 3 text sources. Let's compare the most used words and phrases of 3 files. The chart below shows the % of common words between blog, twitter and news text. ![plot of chunk unnamed-chunk-7](figure/unnamed-chunk-7.png) The above chart shows that there's more similarity betweem blog and news texts (~80% similarity in the dictionary) and lowest between news and twitter (~60% of the dictionary) If we look % of profanity words in 3 text files, then we see than news text are slightly lower than twitter and blog in the use of profanity words. But overall, all text have ~0.8% of profanity ``` ## [1] 0.005813 ``` ``` ## [1] 0.01245 ``` ``` ## [1] 0.005438 ``` #5. #What's % of words constitutes different % of words used in 3 files The below chart demonstrates what % of dictionary is required to cover given share of the text, for 3 types of texts. ![plot of chunk unnamed-chunk-9](figure/unnamed-chunk-9.png) The chart shows that majority of blog texts can be covered by smaller share of words used in the text, than news and twitter. If I want to cover 90% of the dictionary, I can take only 34% of the words for blog text, and around 50% in news and twitter. This means that blog text has more rarely used words than twitter and news. The approach I'm going to apply for application is labeling words which were used only once in the corpus as "rare word". Let's see what share of text would be covered in this case. ``` ## [1] 0.9297 ``` ``` ## [1] 0.8785 ``` ``` ## [1] 0.9215 ``` This approach would give around 90% coverage for all texts. #6. How well the next word can be predicted by previous word Let's take a sample word "broken" ```r # number of times "look" appears in the text ungramblog$count[ungramblog$Terms=="broken"] ``` ``` ## [1] 22 ``` Here the top 5 words following the word "prepared" ``` ## term2 term1 count ## 16195 broken into 3 ## 16189 broken and 2 ## 16190 broken by 2 ## 16204 broken up 2 ## 16191 broken china 1 ``` If we consider several 2-gram ending with "prepared": we'll see that the next word is different depending on what word is in front of "prepared": verb (has/is) +broken would be tipycally followed by "stop", adverb or preposition, while preposition (with/in/a/the/from) + broken would be followed by noun. ``` ## term2 term1 count ## 5880 and broken 2 ## 36340 get broken 2 ## 47168 is broken 2 ## 112807 your broken 2 ## 132 a broken 1 ## 3874 album broken 1 ## 9144 are broken 1 ## 11916 bars broken 1 ## 39375 has broken 1 ## 48944 jeans broken 1 ## 49478 just broken 1 ## 53228 longer broken 1 ## 63581 of broken 1 ## 80177 seem broken 1 ## 80911 severed broken 1 ``` ``` ## term3 term2 term1 count ## 59737 has broken into 1 ## 75396 is broken by 1 ``` ``` ## term3 term2 term1 count ## 307 a broken leg 1 ## 101681 of broken walls 1 ``` threegramblog[which(threegramblog$term2=="broken"),] Therefore, predicting next word based on 3-gram instead of 2-gram should improve the result. Use of longer n-grams might provide even more precise prediction, however, the chances that same n-gram would appear in the text are lower, plus we'll have to compromise on the calculation time and dictionaries volume. Therefore, we have to experiment with different ngram size and find the optimal length. #7. Next steps - create the text corpus based on 3 texts to be used in word prediction application. This includes cleaning of the text, labeling rarely used words - create the algorithm for the next word prediction. The algorithm would be based on ngrams built from text corpus, ie next word = f(last words, number of last words) - experiment with different size of ngram to find out the optimum length in terms of prediction quality/calculation time - analyze whether other features can be taken into account in the algorithm, eg part-of-speech or the ending of the words in the text. Do the additional features improve the performance and, at the same time, whats their cost in terms of memory and calculation time next word = f(last words, number of last words, ending of the last word, POS of the last words) knit2html("text_analysis.Rmd") rpubsUpload("text analysis","text analysis.html")