Introduction

This Milestone report is created as a part of Data Science Capstone course, prepared by JHU and available on Coursera. The aim of this report is to present results of raw data’s exploration and considerations, which arise from exploratory analysis and might be significant in further stages of the project. It is not going to answer any questions, rather just highlight things that might have influence on future decisions.
I’m going to investigate raw data from all 3 sources (News, Blogs, Twitter) in order to determine how different or similar they are. Hopefully this will help me to decide if I should treat them as independent sources in later phases of model building.

Exploratory Analysis

I’ve merged all data sources into a single dataset leaving the source or each text available for analysis. At this stage I didn’t want to bother too much with data cleaning, the only thing I’ve done to dataset - removed non ASCII characters from it. This is the result I’ve got:

##      source            text          
##  Blogs  : 899288   Length:4269678    
##  News   :1010242   Class :character  
##  Twitter:2360148   Mode  :character

All right, so the three datasets have different number of lines (I do have to provide this info in this report for some reason).

Next thing I’ve done is broke up all texts from all sources into sentences. I used Apache openNLP library. Although it took a lot of time, I didn’t want to pick random samples from distinct sources of data and I processed all of it as I think it’ll be usefull in later phases of model building. This is what I’ve got:

##      source          sentence        
##  Blogs  :2320088   Length:8827361    
##  News   :2900951   Class :character  
##  Twitter:3606322   Mode  :character

And here are few samples…

source sentence
News “The industry lost $11 billion in 2009 and will probably lose $5.6 billion in 2010,” he told AP. “The emphasis at airlines is saving cash, managing capacity as effectively as possible, and cutting costs.”
News Sometimes Kauai is just Kauai, as in Elvis Presley’s “Blue Hawaii” or “Soul Surfer,” the 2011 drama based on the shark attack suffered by athlete Bethany Hamilton.
News The topic is not on the table, Marchionne said at the 2012 North American International Auto Show.
News Note: Defensive end Mario Ojemudia of Farmington Hills Harrison enrolled today for spring term, a U-M official confirmed.
News He joins three other February signees who enrolled in January for winter term.
News The rest of the class will arrive in June.

So now I have close to 9 millions sentences, which I’m going to treat as my raw data source for modelling and model training. Notice that I did not remove punctuations, any kind of stopwords, didn’t use word stemming or similar techniques that would alter source data in one way or another. Most likely I’ll do some of it later in the process, however, currently I approach this data as complete dataset of raw data, which I’m going to use to build n-grams and later analyze them in order to see if some sort of data adjustment could improve prediction power of my model. I might try to identify persons or places or other types of information contained in raw text and I’m affraid that if I strip it too much, I might loose some vital info which helps existing libraries efficiently work with texts.

So let’s get to comparison of data comming form different sources (News, Blogs, Twitter). Since the dataset has almost 9 million rows, I’m going to randomly select 1 % of data from each partition (News, Blogs, Twitter).

##      source        sentence        
##  Blogs  :23201   Length:88275      
##  News   :29010   Class :character  
##  Twitter:36064   Mode  :character

As you can see sampled records are weighted based on total number of sentences in each partition. Let’s see some descriptive statistics (provided by qdap library).

For detailed explanation of all parameter meanings refer to the documentation of qdap package here. What I’m interested in particular is wps - words per sentence statistic, which clearly indicates that Twitter sentences are shorter and this might be important when building n-grams. Also notice that Twitter proportion of statements (p.state), proportion of exclamations (p.exclm) is significantly different from Blogs and News. In general it looks like News and Blogs are more or less similar and Twitter is a bit different.

I’m going to examine one more thing - 200 most frequent words in each data partition and see how they compare to 200 most frequent words overall.

What this figure shows is - rank of individual words (higher number means they are used less frequently), which are among 200 most popular words overall in the entire dataset of sentences, but are outside of 200 most popular words in any (one or two) of three partitions (Blogs, News, Twitter). For example word “lol” is ranked 1296.5 most popular word in Blogs, 52398.5 most popular word in News and is within top 200 most popular words in Twitter. What this figure suggests is that all common words are frequently used in Twitter. However, there are certain words that are frequently used in Twitter and are not so pupular in other 2 - Blogs and News - sources. This suggests me that I’ll have to take a closer look at Twitter data and most likely apply some additional rules of data processing when working with it.

Conclusions

So far I’m leaning towards the following:

Modelling Approach

At the moment my plan is as follows (now hold on to your chair):