Exploratory Analysis of text-based databases

Introduction

We will execute an exploratory analysis in three text databases: One with blogs entries, another from twitter feed and other from newspaper pages. Unfortunatly, because of processing restrictions, we are going to peform the analysis using samples from those databases. This report is a part of the Capstone Project course in the Data Science Specialization in Coursera.

Blogs database

After loading blogs database look at a summary of the number os characters in each line.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0    47.0   157.0   231.7   331.0 40835.0

The number of lines in this database is 899288, the numbers of words in this dataset is 37334131, and the average number of words per line is 41.5152109

A document-feature matrix of a sample with 20% of the size of the first database was created. Then, commom words like pronouns, punctuations and numbers were removed.

The number of unique words in this sample is:

## [1] 172471

The 20 most common words and its quantities are:

##    one   just   like    can   time    get    now   know people   also    new 
##  25005  19909  19619  19466  17596  14081  11897  11876  11873  11123  10828 
##   even    day  first really   make   back    see   much     us 
##  10286  10160  10159  10019   9967   9940   9867   9685   9585

Now, a wordcloud with the 100 most common words:

Twitter database

After loading the twitter database, we will look at a summary of the number os characters in each line.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     2.0    37.0    64.0    68.8   100.0   213.0

The number of lines in this database is 2360148, the numbers of words in this dataset is 30373543, and the average number of words per line is 12.8693383

A document-feature matrix of a sample with 20% of the size of the first database was created. Then, commom words like pronouns, punctuations and numbers were removed.

The number of unique words in this sample is:

## [1] 167273

The 20 most common words and its quantities are:

##   just   like    get   love   good    can    day thanks     rt    now    one 
##  30149  24314  22397  21560  20087  17823  17809  17760  17556  16854  16314 
##   know      u  great   time  today     go    new    lol    see 
##  15809  15329  15248  15064  14620  14586  13957  13942  13379

Now, a wordcloud with the 100 most common words:

News database

After loading the news database, we will look at a summary of the number os characters in each line.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       2     111     186     203     270    5760

The number of lines in this database is 77259, the numbers of words in this dataset is 2643969, and the average number of words per line is 34.2221489

A document-feature matrix of a sample with 20% of the size of the first database was created. Then, commom words like pronouns, punctuations and numbers were removed.

The number of unique words in this sample is:

## [1] 41448

The 20 most common words and its quantities are:

##    said     one     new     two    also     can    year    time    just   first 
##    3861    1315    1072     912     912     886     871     858     848     835 
##    last   state   years  people    like     get    city     now   three percent 
##     812     790     745     731     731     690     599     571     545     539

Now, a wordcloud with the 200 most common words:

Next steps

After analysing the data, we are going to use samples of those datasets to create a smaller training set. We will clean smooth this data so we can fit a Katz’s back-off model to predict the next 3 words from a given sentence in a shiny app.

Exploratory Analysis of text-based databases

Filipe Lima

16/11/2020

Introduction

Blogs database

Twitter database

News database

Next steps