The main objective of this Milestone Report is to display and explain only the major features of the data you have identified and briefly summarize your next" plans for creating the prediction algorithm and Shiny app behind the final product. Specifically as stated in the assignment description:
The data is originated from a corpus called HC Corpora and it can be downloded at the following link. The corpora have been collected from publicly available sources by a web crawler and includes tweets, blogs and news in english, german, finnish and russian.
A meaningful excerpt from the About the Corpora informational page:
' You may still find lines of entirely different languages in the corpus. There are 2 main reasons for that: 1. Similar languages. Some languages are very similar, and the automatic language checker could therefore erroneously accept the foreign language text. 2. "Embedded" foreign languages. While a text may be mainly in the desired language there may be parts of it in another language. Since the text is then split up into individual lines, it is possible to see entire lines written in a foreign language.Whereas number 1 is just an out-and-out error, I think number 2 is actually desirable, as it will give a picture of when foreign language is used within the main language.'
Note! The focus of the analysis is on the english language only (‘en_US’) - covering: tweets (twitter), news and blogs.
| sources | noOfLines | maxNoOfChar | minNoOfChar |
|---|---|---|---|
| twitters | 2360148 | 213 | 2 |
| news | 1010242 | 11384 | 1 |
| blogs | 899288 | 40835 | 1 |
Some considerations
the amount of data available considering the tweets, news and blogs entries. For simplification, the exploration has been done using a representative sample to infer facts about a population (considering also limitations connected with the available processing hardware)
the minimum size, in terms of number of characters, for the different entries. There are tweets, news and blogs with few characters - are they relevant?
When loading the data the following locale/ encoding has been used English_United States.1252 ISO8859-1. Inspecting the loaded data it is possible to identify some encoding issues (gremlings) due to the unrecognized characters (not supported languages, emoticons, etc)
'I'm doing it!👦'
'Wilted Greens Salad with Squash, Apples, and Country Ham Recipe from Bon Appétit'
'Everything is good in its season 鬼もå八番茶も出花'
In order to remove such gremlings the following strategy and simplification has been considered: limit the set of available characters to the ASCII charset, removing non ASCII characters.
'I'm doing it!'
'Wilted Greens Salad with Squash, Apples, and Country Ham Recipe from Bon Apptit'
'Everything is good in its season '
There are around 214097 tweets (0.09%) that are less than 20 chars long. Few examples of such tweets can be found below:
## [1] "send me beats fam" "My moms so annoying!" "knowledge is power!"
## [4] "126 square blocks" "thanks, love! :)" "fun :D"
## [7] "oh no!" "Missing my hubby..." "M.O.B"
## [10] "Gonna be a long day" "Ok brotha thanks!!!"
Because of the limited number of such tweets and the “irrelevance” of their content (especially the ones with less than 10 chars), it has been decided to remove them from the twitter corpora.
There are around 30644 news (0.03%) that are less than 20 chars long. Few examples of such news can be found below:
## [1] "BL Knight 9." "In other trading:" "Drage Vukcevich"
## [4] "Chain" "10. Youngstown" "Aberdeen"
## [7] "Radio Radio" "A gust of popularity" "(Ticker Tape)"
## [10] "last."
Because of the limited number of such news and the “irrelevance” of their content , it has been decided to remove them from the news corpora.
There are around 77241 blogs (0.09%) that are less than 20 chars long. Few examples of such blogs can be found below:
## [1] "If I were a bear," "Tis all." "1/3 cup tomato paste"
## [4] "3 T ketchup" "M. Blakeman Ingle" "Sphere: V = 4/3"
## [7] "Rm25" "So " "In shrouds of words,"
## [10] "You die?"
Because of the limited number of such blogs and the “irrelevance” of their content , it has been decided to remove them from the blogs corpora.
Note that the data contain words of offensive and profane meaning. Some examples …
!++-~| G1 CERTIFIED WET TSHIRT CONTEST --- FRIDAY CLUB DRAMA --WANT TO GET IN FOR FREE?? TXT ME I WILL TELL YOU HOW---214 609 3316 --
Wisconsin Governor Walker Attacks Sex Ed
Profanity words will be removed from the Corpora (as stopwords). An external resource providing a comprehensive list of 1383 profanity words is used.
For this analysis it is not needed use all of the data. Often relatively few randomly selected rows or chunks need to be included to get an accurate approximation to results that would be obtained using all the data. A “biased coin” approach has been used to select the tweets, news and blogs to be included in the analysis based on the following percentages
Exploration of the corpora is done using natural language processing techniques - specifically term frequency analysis using ngrams (1-gram, 2-gram and 3-gram). Before tokenizing the corpora the following steps are performed:
' (apostrophe) in order to not lose contractions (e.g. I’ll, I’m, etc)<s> marker at the beginning of each entry (tweet, news, blog)</s> marker at the end of each entry (tweet, news, blog)Wordclouds and barplots are used to visualize the most frequent words/ tokens for the different n-grams. When plotting the ‘barplots’ only the first most frequent terms (top 30) are shown and max 200 terms in the wordclouds. Note: For 2-grams and 3-grams a token like <s> at the refers to at the at the beginning of the entry (tweet, news or blog), while the top </s> refers to the top at the end of the entry (tweet, news or blog).
How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?
| N = number of tokens | V = vocabulary size | 50% coverage | 90% coverage |
|---|---|---|---|
| 1103869 | 59091 | 211 | 6772 |
How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?
| N = number of tokens | V = vocabulary size | 50% coverage | 90% coverage |
|---|---|---|---|
| 1533686 | 544485 | 16697 | 391117 |
How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?
| N = number of tokens | V = vocabulary size | 50% coverage | 90% coverage |
|---|---|---|---|
| 1426576 | 1061053 | 347765 | 918396 |
How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?
| N = number of tokens | V = vocabulary size | 50% coverage | 90% coverage |
|---|---|---|---|
| 2731479 | 88107 | 395 | 9581 |
How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?
| N = number of tokens | V = vocabulary size | 50% coverage | 90% coverage |
|---|---|---|---|
| 3445777 | 1170126 | 39386 | 825549 |
How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?
| N = number of tokens | V = vocabulary size | 50% coverage | 90% coverage |
|---|---|---|---|
| 3348002 | 2508621 | 834620 | 2173821 |
How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?
| N = number of tokens | V = vocabulary size | 50% coverage | 90% coverage |
|---|---|---|---|
| 2863955 | 91345 | 237 | 8284 |
How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?
| N = number of tokens | V = vocabulary size | 50% coverage | 90% coverage |
|---|---|---|---|
| 3750369 | 1154501 | 25933 | 779465 |
How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?
| N = number of tokens | V = vocabulary size | 50% coverage | 90% coverage |
|---|---|---|---|
| 3668201 | 2632369 | 798269 | 2265549 |