We will execute an exploratory analysis in three text databases: One with blogs entries, another from twitter feed and other from newspaper pages. Unfortunatly, because of processing restrictions, we are going to peform the analysis using samples from those databases. This report is a part of the Capstone Project course in the Data Science Specialization in Coursera.
After loading blogs database look at a summary of the number os characters in each line.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 47.0 157.0 231.7 331.0 40835.0
The number of lines in this database is 899288, the numbers of words in this dataset is 37334131, and the average number of words per line is 41.5152109
A document-feature matrix of a sample with 20% of the size of the first database was created. Then, commom words like pronouns, punctuations and numbers were removed.
The number of unique words in this sample is:
## [1] 172471
The 20 most common words and its quantities are:
## one just like can time get now know people also new
## 25005 19909 19619 19466 17596 14081 11897 11876 11873 11123 10828
## even day first really make back see much us
## 10286 10160 10159 10019 9967 9940 9867 9685 9585
Now, a wordcloud with the 100 most common words:
After loading the twitter database, we will look at a summary of the number os characters in each line.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.0 37.0 64.0 68.8 100.0 213.0
The number of lines in this database is 2360148, the numbers of words in this dataset is 30373543, and the average number of words per line is 12.8693383
A document-feature matrix of a sample with 20% of the size of the first database was created. Then, commom words like pronouns, punctuations and numbers were removed.
The number of unique words in this sample is:
## [1] 167273
The 20 most common words and its quantities are:
## just like get love good can day thanks rt now one
## 30149 24314 22397 21560 20087 17823 17809 17760 17556 16854 16314
## know u great time today go new lol see
## 15809 15329 15248 15064 14620 14586 13957 13942 13379
Now, a wordcloud with the 100 most common words:
After loading the news database, we will look at a summary of the number os characters in each line.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2 111 186 203 270 5760
The number of lines in this database is 77259, the numbers of words in this dataset is 2643969, and the average number of words per line is 34.2221489
A document-feature matrix of a sample with 20% of the size of the first database was created. Then, commom words like pronouns, punctuations and numbers were removed.
The number of unique words in this sample is:
## [1] 41448
The 20 most common words and its quantities are:
## said one new two also can year time just first
## 3861 1315 1072 912 912 886 871 858 848 835
## last state years people like get city now three percent
## 812 790 745 731 731 690 599 571 545 539
Now, a wordcloud with the 200 most common words:
After analysing the data, we are going to use samples of those datasets to create a smaller training set. We will clean smooth this data so we can fit a Katz’s back-off model to predict the next 3 words from a given sentence in a shiny app.