Synopsis

The main objective of this Milestone Report is to display and explain only the major features of the data you have identified and briefly summarize your next" plans for creating the prediction algorithm and Shiny app behind the final product. Specifically as stated in the assignment description:

The Data

The data is originated from a corpus called HC Corpora and it can be downloded at the following link. The corpora have been collected from publicly available sources by a web crawler and includes tweets, blogs and news in english, german, finnish and russian.

A meaningful excerpt from the About the Corpora informational page:

' You may still find lines of entirely different languages in the corpus. There are 2 main reasons for that: 1. Similar languages. Some languages are very similar, and the automatic language checker could therefore erroneously accept the foreign language text. 2. "Embedded" foreign languages. While a text may be mainly in the desired language there may be parts of it in another language. Since the text is then split up into individual lines, it is possible to see entire lines written in a foreign language.Whereas number 1 is just an out-and-out error, I think number 2 is actually desirable, as it will give a picture of when foreign language is used within the main language.'

Note! The focus of the analysis is on the english language only (‘en_US’) - covering: tweets (twitter), news and blogs.

Some basic statistics about the Corpora
sources noOfLines maxNoOfChar minNoOfChar
twitters 2360148 213 2
news 1010242 11384 1
blogs 899288 40835 1

Some considerations

Cleaning the Data

Encoding Issues (Gremlings)

When loading the data the following locale/ encoding has been used English_United States.1252  ISO8859-1. Inspecting the loaded data it is possible to identify some encoding issues (gremlings) due to the unrecognized characters (not supported languages, emoticons, etc)

'I'm doing it!👦'
'Wilted Greens Salad with Squash, Apples, and Country Ham Recipe from Bon Appétit'
'Everything is good in its season 鬼も十八番茶も出花'

In order to remove such gremlings the following strategy and simplification has been considered: limit the set of available characters to the ASCII charset, removing non ASCII characters.

'I'm doing it!'
'Wilted Greens Salad with Squash, Apples, and Country Ham Recipe from Bon Apptit'
'Everything is good in its season '

Entries with a limited number of chars

Twitter Corpora

There are around 214097 tweets (0.09%) that are less than 20 chars long. Few examples of such tweets can be found below:

##  [1] "send me beats fam"    "My moms so annoying!" "knowledge is power!" 
##  [4] "126 square blocks"    "thanks, love! :)"     "fun :D"              
##  [7] "oh no!"               "Missing my hubby..."  "M.O.B"               
## [10] "Gonna be a long day"  "Ok brotha thanks!!!"

Because of the limited number of such tweets and the “irrelevance” of their content (especially the ones with less than 10 chars), it has been decided to remove them from the twitter corpora.

News Corpora

There are around 30644 news (0.03%) that are less than 20 chars long. Few examples of such news can be found below:

##  [1] "BL  Knight 9."        "In other trading:"    "Drage Vukcevich"     
##  [4] "Chain"                "10. Youngstown"       "Aberdeen"            
##  [7] "Radio Radio"          "A gust of popularity" "(Ticker Tape)"       
## [10] "last."

Because of the limited number of such news and the “irrelevance” of their content , it has been decided to remove them from the news corpora.

Blogs Corpora

There are around 77241 blogs (0.09%) that are less than 20 chars long. Few examples of such blogs can be found below:

##  [1] "If I were a bear,"    "Tis all."             "1/3 cup tomato paste"
##  [4] "3 T ketchup"          "M. Blakeman Ingle"    "Sphere: V = 4/3"     
##  [7] "Rm25"                 "So "                  "In shrouds of words,"
## [10] "You die?"

Because of the limited number of such blogs and the “irrelevance” of their content , it has been decided to remove them from the blogs corpora.

Removal of Profanity Words

Note that the data contain words of offensive and profane meaning. Some examples …

!++-~| G1 CERTIFIED WET TSHIRT CONTEST --- FRIDAY CLUB DRAMA --WANT TO GET IN FOR FREE?? TXT ME I WILL TELL YOU HOW---214 609 3316 --
Wisconsin Governor Walker Attacks Sex Ed

Profanity words will be removed from the Corpora (as stopwords). An external resource providing a comprehensive list of 1383 profanity words is used.

Others

  • replace contractions “u, r, c’mon, doin’, y’all, ya’ll, ma’am” with “you, are, come on, doing, you all, madam”
  • remove links (e.g. “https://www.coursera.org/” or “http://www.coursera.org/”)
  • remove “RT” (only for the twitter corpora)

Sampling of the Corpora

For this analysis it is not needed use all of the data. Often relatively few randomly selected rows or chunks need to be included to get an accurate approximation to results that would be obtained using all the data. A “biased coin” approach has been used to select the tweets, news and blogs to be included in the analysis based on the following percentages

  • 5% of the tweets (107110 tweets)
  • 10% of the news (97775 news)
  • 10% of the blogs (82168 blogs)

Exploring the (Sample) Corpora

Exploration of the corpora is done using natural language processing techniques - specifically term frequency analysis using ngrams (1-gram, 2-gram and 3-gram). Before tokenizing the corpora the following steps are performed:

Wordclouds and barplots are used to visualize the most frequent words/ tokens for the different n-grams. When plotting the ‘barplots’ only the first most frequent terms (top 30) are shown and max 200 terms in the wordclouds. Note: For 2-grams and 3-grams a token like <s> at the refers to at the at the beginning of the entry (tweet, news or blog), while the top </s> refers to the top at the end of the entry (tweet, news or blog).

Twitter Corpora

1-grams

How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?

N = number of tokens V = vocabulary size 50% coverage 90% coverage
1103869 59091 211 6772

2-grams

How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?

N = number of tokens V = vocabulary size 50% coverage 90% coverage
1533686 544485 16697 391117

3-grams

How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?

N = number of tokens V = vocabulary size 50% coverage 90% coverage
1426576 1061053 347765 918396

News Corpora

1-grams

How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?

N = number of tokens V = vocabulary size 50% coverage 90% coverage
2731479 88107 395 9581

2-grams

How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?

N = number of tokens V = vocabulary size 50% coverage 90% coverage
3445777 1170126 39386 825549

3-grams

How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?

N = number of tokens V = vocabulary size 50% coverage 90% coverage
3348002 2508621 834620 2173821

Blogs Corpora

1-grams

How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?

N = number of tokens V = vocabulary size 50% coverage 90% coverage
2863955 91345 237 8284

2-grams

How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?

N = number of tokens V = vocabulary size 50% coverage 90% coverage
3750369 1154501 25933 779465

3-grams

How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?

N = number of tokens V = vocabulary size 50% coverage 90% coverage
3668201 2632369 798269 2265549

Next Steps