Data Science Project - Week 2 - Task 2 Exploratory Data Analysis

The motivation for this project is to:

Demonstrate that I’ve downloaded the data and have successfully loaded it in.
Create a basic report of summary statistics using tables and plots to illustrate essential summaries of the data set.
Report any exciting findings that I have amassed so far.
Report the next steps to create a prediction algorithm and Shiny app.

Step 1: I downloaded the zipped file and unzipped it.

Step 2: I counted the lines of each unzipped file. I verified that the most extensive file is twitter’s file; the second is the blog’s file. So the smaller file is about news’s file.After I read a subset its, and combined 20000 lines of each: blogs, news and Twitter, to generate only one corpus of the 60000 lines.

## [1] 899288

## [1] 77259

## [1] 2360148

Step 3: I did a tokenization of the corpus based on English language.

## Document-feature matrix of: 60,000 documents, 60,706 features (99.98% sparse) and 1 docvar.
##        features
## docs    year thereaft oil field platform name pagan god love mr
##   text1    1        1   1     1        1    1     1   1    0  0
##   text2    0        0   0     0        0    0     0   0    1  1
##   text3    0        0   0     0        0    0     0   0    1  0
##   text4    0        0   0     0        0    0     0   0    0  0
##   text5    0        0   0     0        0    0     0   0    0  0
##   text6    0        0   0     0        0    0     0   0    0  0
## [ reached max_ndoc ... 59,994 more documents, reached max_nfeat ... 60,696 more features ]

Step 4: I identified the ideology tokenization of the corpus based on the Wordstat dictionary that contains political left-right ideology keywords (Laver and Garry 2000). This method cannot detect multi-word expressions since a document-feature matrix does not store information about the positions of words.

Step 5: I analyzed the pattern tweets that are the ten most popular inside the tokenized corpus, removing stopwords, punctuations, and tweets with only numbers.

##           feature frequency rank docfreq group
## 1                    947133    1   59863   all
## 2             #ff        48    2      47   all
## 3 #teamfollowback         9    3       9   all
## 4        #brewers         8    4       8   all
## 5     #nowplaying         7    5       7   all
## 6            #nba         7    5       7   all

Step 6: Analyse the bigram and trigram models.

An N-gram model predicts the occurrence of a word based on the occurrence of its N – 1 previous words. So a bigram model (N = 2) predicts the occurrence of a word given only its previous word (N – 1 = 1). In this sense, a trigram model (N = 3) predicts the occurrence of a word based on its previous two words (N – 1 = 2). Let’s assume a bigram model. So we are going to find the probability of a word based only on its previous word. In general, we can say that this probability is (the times’s number the previous word ‘wp’ occurs before the word ‘wn’) / (the times’s number the previous word ‘wp’ occurs in the corpus) =

[Count (wp wn)]/(Count (wp))

Analysing the frequencies of 2-grams and 3-grams in the corpus, I identified the following.

Next Steps

I will identify the unique words you need in a frequency sorted dictionary to cover 50% of all word instances in the language. After, I will choose the better model to predict situations in this previous analysis. Finally, I will work in the shiny APP to publish on the shiny site.