Introdruction:

This Capstone project is designed to demonstrate the scientist’s ability to load the data, to explore the data to record its statistics and report any interesting findings that will cast light on the features of the data, and to show this in visual illustrations. In this report we will work on data downloaded from Twitter, blogs and news on which we will perform our analysis, to prepare it for eventual app model.

Getting and Cleaning Data:

library(readr)
library(quanteda)
## Package version: 1.3.14
## Parallel computing: 2 of 4 threads used.
## See https://quanteda.io for tutorials and examples.
## 
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
## 
##     View
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
#Load the data:
twitter <- read_lines("en_US.twitter.txt")
news <- read_lines("en_US.news.txt")
blogs <- read_lines("en_US.blogs.txt")

And create corresponding corpora from the data; we will create one for each and one that contains all:

twit_corp <- corpus(twitter)
docvars(twit_corp, "Source") <- "Twitter"

news_corp <- corpus(news)
docvars(news_corp, "Source") <- "News"

blogs_corp <- corpus(blogs)
docvars(blogs_corp, "Source") <- "Blogs"

all_corp <- twit_corp + blogs_corp + news_corp

And then converting them to a Document_Feature Matrix on which we can conduct our analysis:

twit_dfm <- dfm(twit_corp, remove_punct =TRUE, remove = stopwords("english") ,verbose = TRUE)
## Creating a dfm from a corpus input...
##    ... lowercasing
##    ... found 2,360,148 documents, 441,359 features
##    ... removed 175 features
##    ... created a 2,360,148 x 441,184 sparse dfm
##    ... complete. 
## Elapsed time: 225 seconds.
news_dfm <- dfm(news_corp, remove_punct =TRUE, remove = stopwords("english"), verbose = TRUE)
## Creating a dfm from a corpus input...
##    ... lowercasing
##    ... found 1,010,242 documents, 379,883 features
##    ... removed 175 features
##    ... created a 1,010,242 x 379,708 sparse dfm
##    ... complete. 
## Elapsed time: 135 seconds.
blogs_dfm <- dfm(blogs_corp, remove_punct =TRUE, remove = stopwords("english"), verbose = TRUE)
## Creating a dfm from a corpus input...
##    ... lowercasing
##    ... found 899,288 documents, 402,860 features
##    ... removed 175 features
##    ... created a 899,288 x 402,685 sparse dfm
##    ... complete. 
## Elapsed time: 149 seconds.

Data Exploration:

Now we have our corpora ready, and DFMatrices created, we can conduct our analysis to spot featurs of each dataset:

Twitter texts:

#Let's have a feeling of what it looks like
texts(twit_corp)[1:3]
##                                                                                                             text1 
##   "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long." 
##                                                                                                             text2 
## "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason." 
##                                                                                                             text3 
##                                                                        "they've decided its more fun if I don't."
#The number of tokens and length
mean(ntoken(twit_corp, remove_punct = TRUE))
## [1] 12.69384
head(summary(twit_corp))
##    Text Types Tokens Sentences  Source
## 1 text1    25     30         5 Twitter
## 2 text2    19     24         2 Twitter
## 3 text3     9      9         1 Twitter
## 4 text4    20     23         1 Twitter
## 5 text5    13     13         2 Twitter
## 6 text6    13     18         4 Twitter
#Now look at the top 20 featured words -*punctuation excluded*- and their frequencies
topfeatures(twit_dfm, 20)
##   just   like    get   love   good    day    can thanks     rt    now 
## 151016 122019 112316 106175 100749  90071  89732  89526  89300  83794 
##    one   know      u  great   time  today     go    lol    new    see 
##  82009  79817  77137  75991  75581  72847  72402  70026  69622  66954
#And to extracting top 10 two-word expressions from a -for the sake of saving time- sample of 100 tweets:
head(arrange(textstat_collocations(corpus_sample(twit_corp, 100), size = 2, min_count = 3), desc(count)), 10)
##    collocation count count_nested length   lambda        z
## 1       in the     5            0      2 2.512887 4.757471
## 2       we can     4            0      2 5.580514 6.715054
## 3      for all     4            0      2 5.005734 5.178950
## 4      will be     3            0      2 6.179138 5.799826
## 5     when you     3            0      2 4.389371 5.109840
## 6        in my     3            0      2 3.270240 4.814394
## 7      need to     3            0      2 3.561677 4.582567
## 8       i just     3            0      2 3.665507 4.344232
## 9      i don't     3            0      2 3.075905 4.208589
## 10 looking for     3            0      2 5.854817 3.834754
#How about measuring the mean **Flesch Readability Score** of the sample
mean(textstat_readability(corpus_sample(twit_corp, 100), measure = "Flesch")$Flesch)
## [1] 78.54892

Let’s do the same to the News dataset:

texts(news_corp)[1:3]
##                                                                                                                                                                               text1 
##                                                                                                                                                 "He wasn't home alone, apparently." 
##                                                                                                                                                                               text2 
##                         "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s." 
##                                                                                                                                                                               text3 
## "WSU's plans quickly became a hot topic on local online sites. Though most people applauded plans for the new biomedical center, many deplored the potential loss of the building."
mean(ntoken(news_corp, remove_punct = TRUE))
## [1] 33.91165
head(summary(news_corp))
##    Text Types Tokens Sentences Source
## 1 text1     7      7         1   News
## 2 text2    27     33         3   News
## 3 text3    28     32         2   News
## 4 text4    69     92         3   News
## 5 text5    39     45         4   News
## 6 text6    32     40         1   News
topfeatures(news_dfm, 20)
##    said     one     new     can    also     two    year    just   first 
##  250393   83146   70330   58758   58745   57334   54829   53149   52636 
##    last    time    like   state  people   years     get    city     now 
##   51516   51152   49417   48079   47572   46809   43498   37185   36080 
## percent  school 
##   34423   34216
head(arrange(textstat_collocations(corpus_sample(news_corp, 100), size = 2, min_count = 3), desc(count)), 10)
##    collocation count count_nested length    lambda        z
## 1       of the    21            0      2 1.8965898 7.197480
## 2       in the    18            0      2 2.0588844 7.153772
## 3       to the    12            0      2 0.9913146 3.157605
## 4         in a     8            0      2 1.8895024 4.866915
## 5        for a     7            0      2 2.6816978 6.105332
## 6         is a     6            0      2 3.1987703 6.325363
## 7        to be     6            0      2 3.3190815 6.076730
## 8     from the     6            0      2 2.6457869 5.015949
## 9       as the     6            0      2 1.5024485 3.373771
## 10    has been     5            0      2 5.2643109 8.298356
mean(textstat_readability(corpus_sample(news_corp, 100), measure = "Flesch")$Flesch)
## [1] 60.25528

And one more time for the Blogs dataset:

texts(blogs_corp)[1:3]
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  text1 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       "In the years thereafter, most of the Oil fields and platforms were named after pagan \"gods\"." 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  text2 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               "We love you Mr. Brown." 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  text3 
## "Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him."
mean(ntoken(blogs_corp, remove_punct = TRUE))
## [1] 41.47307
head(summary(blogs_corp))
##    Text Types Tokens Sentences Source
## 1 text1    18     20         1  Blogs
## 2 text2     6      7         1  Blogs
## 3 text3   104    154         7  Blogs
## 4 text4    36     43         1  Blogs
## 5 text5    91    119         5  Blogs
## 6 text6    13     13         1  Blogs
topfeatures(blogs_dfm, 20)
##    one   just   like    can   time    get   know    now people   also 
## 124740 100592  98643  98282  88560  70866  60260  60114  59409  55331 
##    new   even  first    day   make   back     us    see really   much 
##  54475  51991  50877  50653  50648  50560  50185  50053  49936  48920
head(arrange(textstat_collocations(corpus_sample(blogs_corp, 100), size = 2, min_count = 3), desc(count)), 10)
##    collocation count count_nested length    lambda        z
## 1       in the    18            0      2 1.9948673 7.095819
## 2       of the    18            0      2 1.8510198 6.678056
## 3       to the    10            0      2 0.7893654 2.353179
## 4         is a     9            0      2 2.5179630 6.460429
## 5      for the     9            0      2 1.7564450 4.689014
## 6     into the     8            0      2 3.7416348 6.347521
## 7       on the     8            0      2 2.0825026 5.083801
## 8       it was     7            0      2 3.7682379 8.045318
## 9       i have     7            0      2 3.4336693 7.293072
## 10       to be     6            0      2 2.7797634 5.555722
mean(textstat_readability(corpus_sample(blogs_corp, 100), measure = "Flesch")$Flesch)
## [1] 66.25273

Plotting results:

Creating plots of most used words in each corpus is insightful and easy:

barplot(topfeatures(twit_dfm, 20), width = 3, col = "royalblue4", xlab = "Top Featured Words", ylab = "Frequency", main = "Top Words in Twitter Corpus")

barplot(topfeatures(news_dfm, 20), width = 3, col = "springgreen4", xlab = "Top Featured Words", ylab = "Frequency", main = "Top Words in News Corpus")

barplot(topfeatures(blogs_dfm, 20), width = 3, col = "orangered4", xlab = "Top Featured Words", ylab = "Frequency", main = "Top Words in Blogs Corpus")

Now, let’s look at top used words from all corpora combined; this time let’s use a wordcloud for this purpose, and -just for the sake of saving time- let’s use a sample from our combined corpus:

sam <- corpus_sample(all_corp, 5000)
sam_dfm <- dfm(sam, remove = stopwords("english"), remove_punct = TRUE, verbose = TRUE)
## Creating a dfm from a corpus input...
##    ... lowercasing
##    ... found 5,000 documents, 17,423 features
##    ... removed 170 features
##    ... created a 5,000 x 17,253 sparse dfm
##    ... complete. 
## Elapsed time: 2.31 seconds.
textplot_wordcloud(sam_dfm, rotation = 0.25, min_count = 20, max_words = 50, color = RColorBrewer::brewer.pal(8, "Dark2"))

Conclusion:

Now we have loaded our data properly, cleaned it, and investigated its features, let’s turn our attention to building our prediction model, which will be created on the Shiny app eventually. For that we will split the data into training and testing sets. We will then use the training set to create the prediction model using the N-gram method. Then we will creat a Shiny app on which we will deploy the prediction model, this app will allow people to type words and the app will predict next words for the user.