Synopsis

The main objective of this Milestone Report is to display and explain only the major features of the data you have identified and briefly summarize your next" plans for creating the prediction algorithm and Shiny app behind the final product. Specifically as stated in the assignment description:

Demonstrate that the data has been downloaded and loaded
Create a basic report of summary statistics about the data sets (twitters, news, blogs)
Report any interesting findings and eventual considerations/ implications
Get feedback on the “next” plans for creating a prediction algorithm and Shiny app

The Data

The data is originated from a corpus called HC Corpora and it can be downloded at the following link. The corpora have been collected from publicly available sources by a web crawler and includes tweets, blogs and news in english, german, finnish and russian.

A meaningful excerpt from the About the Corpora informational page:

' You may still find lines of entirely different languages in the corpus. There are 2 main reasons for that: 1. Similar languages. Some languages are very similar, and the automatic language checker could therefore erroneously accept the foreign language text. 2. "Embedded" foreign languages. While a text may be mainly in the desired language there may be parts of it in another language. Since the text is then split up into individual lines, it is possible to see entire lines written in a foreign language.Whereas number 1 is just an out-and-out error, I think number 2 is actually desirable, as it will give a picture of when foreign language is used within the main language.'

Note! The focus of the analysis is on the english language only (‘en_US’) - covering: tweets (twitter), news and blogs.

Some basic statistics about the Corpora
sources	noOfLines	maxNoOfChar	minNoOfChar
twitters	2360148	213	2
news	1010242	11384	1
blogs	899288	40835	1

Some considerations

the amount of data available considering the tweets, news and blogs entries. For simplification, the exploration has been done using a representative sample to infer facts about a population (considering also limitations connected with the available processing hardware)
the minimum size, in terms of number of characters, for the different entries. There are tweets, news and blogs with few characters - are they relevant?

Cleaning the Data

Encoding Issues (Gremlings)

When loading the data the following locale/ encoding has been used English_United States.1252 ISO8859-1. Inspecting the loaded data it is possible to identify some encoding issues (gremlings) due to the unrecognized characters (not supported languages, emoticons, etc)

'I'm doing it!ðŸ‘¦'
'Wilted Greens Salad with Squash, Apples, and Country Ham Recipe from Bon AppÃ©tit'
'Everything is good in its season é¬¼ã‚‚åå…«ç•ªèŒ¶ã‚‚å‡ºèŠ±'

In order to remove such gremlings the following strategy and simplification has been considered: limit the set of available characters to the ASCII charset, removing non ASCII characters.

'I'm doing it!'
'Wilted Greens Salad with Squash, Apples, and Country Ham Recipe from Bon Apptit'
'Everything is good in its season '

Entries with a limited number of chars

Twitter Corpora

There are around 214097 tweets (0.09%) that are less than 20 chars long. Few examples of such tweets can be found below:

##  [1] "send me beats fam"    "My moms so annoying!" "knowledge is power!" 
##  [4] "126 square blocks"    "thanks, love! :)"     "fun :D"              
##  [7] "oh no!"               "Missing my hubby..."  "M.O.B"               
## [10] "Gonna be a long day"  "Ok brotha thanks!!!"

Because of the limited number of such tweets and the “irrelevance” of their content (especially the ones with less than 10 chars), it has been decided to remove them from the twitter corpora.

News Corpora

There are around 30644 news (0.03%) that are less than 20 chars long. Few examples of such news can be found below:

##  [1] "BL  Knight 9."        "In other trading:"    "Drage Vukcevich"     
##  [4] "Chain"                "10. Youngstown"       "Aberdeen"            
##  [7] "Radio Radio"          "A gust of popularity" "(Ticker Tape)"       
## [10] "last."

Because of the limited number of such news and the “irrelevance” of their content , it has been decided to remove them from the news corpora.

Blogs Corpora

There are around 77241 blogs (0.09%) that are less than 20 chars long. Few examples of such blogs can be found below:

##  [1] "If I were a bear,"    "Tis all."             "1/3 cup tomato paste"
##  [4] "3 T ketchup"          "M. Blakeman Ingle"    "Sphere: V = 4/3"     
##  [7] "Rm25"                 "So "                  "In shrouds of words,"
## [10] "You die?"

Because of the limited number of such blogs and the “irrelevance” of their content , it has been decided to remove them from the blogs corpora.

Removal of Profanity Words

Note that the data contain words of offensive and profane meaning. Some examples …

!++-~| G1 CERTIFIED WET TSHIRT CONTEST --- FRIDAY CLUB DRAMA --WANT TO GET IN FOR FREE?? TXT ME I WILL TELL YOU HOW---214 609 3316 --
Wisconsin Governor Walker Attacks Sex Ed

Profanity words will be removed from the Corpora (as stopwords). An external resource providing a comprehensive list of 1383 profanity words is used.

Others

replace contractions “u, r, c’mon, doin’, y’all, ya’ll, ma’am” with “you, are, come on, doing, you all, madam”
remove links (e.g. “https://www.coursera.org/” or “http://www.coursera.org/”)
remove “RT” (only for the twitter corpora)

Sampling of the Corpora

For this analysis it is not needed use all of the data. Often relatively few randomly selected rows or chunks need to be included to get an accurate approximation to results that would be obtained using all the data. A “biased coin” approach has been used to select the tweets, news and blogs to be included in the analysis based on the following percentages

5% of the tweets (107110 tweets)
10% of the news (97775 news)
10% of the blogs (82168 blogs)

Exploring the (Sample) Corpora

Exploration of the corpora is done using natural language processing techniques - specifically term frequency analysis using ngrams (1-gram, 2-gram and 3-gram). Before tokenizing the corpora the following steps are performed:

transform to lower case
remove profanity words
remove numbers
remove punctuations - except of the ' (apostrophe) in order to not lose contractions (e.g. I’ll, I’m, etc)
add a <s> marker at the beginning of each entry (tweet, news, blog)
add a </s> marker at the end of each entry (tweet, news, blog)

Wordclouds and barplots are used to visualize the most frequent words/ tokens for the different n-grams. When plotting the ‘barplots’ only the first most frequent terms (top 30) are shown and max 200 terms in the wordclouds. Note: For 2-grams and 3-grams a token like <s> at the refers to at the at the beginning of the entry (tweet, news or blog), while the top </s> refers to the top at the end of the entry (tweet, news or blog).