This Capstone project is designed to demonstrate the scientist’s ability to load the data, to explore the data to record its statistics and report any interesting findings that will cast light on the features of the data, and to show this in visual illustrations. In this report we will work on data downloaded from Twitter, blogs and news on which we will perform our analysis, to prepare it for eventual app model.
library(readr)
library(quanteda)
## Package version: 1.3.14
## Parallel computing: 2 of 4 threads used.
## See https://quanteda.io for tutorials and examples.
##
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
##
## View
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#Load the data:
twitter <- read_lines("en_US.twitter.txt")
news <- read_lines("en_US.news.txt")
blogs <- read_lines("en_US.blogs.txt")
And create corresponding corpora from the data; we will create one for each and one that contains all:
twit_corp <- corpus(twitter)
docvars(twit_corp, "Source") <- "Twitter"
news_corp <- corpus(news)
docvars(news_corp, "Source") <- "News"
blogs_corp <- corpus(blogs)
docvars(blogs_corp, "Source") <- "Blogs"
all_corp <- twit_corp + blogs_corp + news_corp
And then converting them to a Document_Feature Matrix on which we can conduct our analysis:
twit_dfm <- dfm(twit_corp, remove_punct =TRUE, remove = stopwords("english") ,verbose = TRUE)
## Creating a dfm from a corpus input...
## ... lowercasing
## ... found 2,360,148 documents, 441,359 features
## ... removed 175 features
## ... created a 2,360,148 x 441,184 sparse dfm
## ... complete.
## Elapsed time: 225 seconds.
news_dfm <- dfm(news_corp, remove_punct =TRUE, remove = stopwords("english"), verbose = TRUE)
## Creating a dfm from a corpus input...
## ... lowercasing
## ... found 1,010,242 documents, 379,883 features
## ... removed 175 features
## ... created a 1,010,242 x 379,708 sparse dfm
## ... complete.
## Elapsed time: 135 seconds.
blogs_dfm <- dfm(blogs_corp, remove_punct =TRUE, remove = stopwords("english"), verbose = TRUE)
## Creating a dfm from a corpus input...
## ... lowercasing
## ... found 899,288 documents, 402,860 features
## ... removed 175 features
## ... created a 899,288 x 402,685 sparse dfm
## ... complete.
## Elapsed time: 149 seconds.
Now we have our corpora ready, and DFMatrices created, we can conduct our analysis to spot featurs of each dataset:
#Let's have a feeling of what it looks like
texts(twit_corp)[1:3]
## text1
## "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."
## text2
## "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
## text3
## "they've decided its more fun if I don't."
#The number of tokens and length
mean(ntoken(twit_corp, remove_punct = TRUE))
## [1] 12.69384
head(summary(twit_corp))
## Text Types Tokens Sentences Source
## 1 text1 25 30 5 Twitter
## 2 text2 19 24 2 Twitter
## 3 text3 9 9 1 Twitter
## 4 text4 20 23 1 Twitter
## 5 text5 13 13 2 Twitter
## 6 text6 13 18 4 Twitter
#Now look at the top 20 featured words -*punctuation excluded*- and their frequencies
topfeatures(twit_dfm, 20)
## just like get love good day can thanks rt now
## 151016 122019 112316 106175 100749 90071 89732 89526 89300 83794
## one know u great time today go lol new see
## 82009 79817 77137 75991 75581 72847 72402 70026 69622 66954
#And to extracting top 10 two-word expressions from a -for the sake of saving time- sample of 100 tweets:
head(arrange(textstat_collocations(corpus_sample(twit_corp, 100), size = 2, min_count = 3), desc(count)), 10)
## collocation count count_nested length lambda z
## 1 in the 5 0 2 2.512887 4.757471
## 2 we can 4 0 2 5.580514 6.715054
## 3 for all 4 0 2 5.005734 5.178950
## 4 will be 3 0 2 6.179138 5.799826
## 5 when you 3 0 2 4.389371 5.109840
## 6 in my 3 0 2 3.270240 4.814394
## 7 need to 3 0 2 3.561677 4.582567
## 8 i just 3 0 2 3.665507 4.344232
## 9 i don't 3 0 2 3.075905 4.208589
## 10 looking for 3 0 2 5.854817 3.834754
#How about measuring the mean **Flesch Readability Score** of the sample
mean(textstat_readability(corpus_sample(twit_corp, 100), measure = "Flesch")$Flesch)
## [1] 78.54892
texts(news_corp)[1:3]
## text1
## "He wasn't home alone, apparently."
## text2
## "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s."
## text3
## "WSU's plans quickly became a hot topic on local online sites. Though most people applauded plans for the new biomedical center, many deplored the potential loss of the building."
mean(ntoken(news_corp, remove_punct = TRUE))
## [1] 33.91165
head(summary(news_corp))
## Text Types Tokens Sentences Source
## 1 text1 7 7 1 News
## 2 text2 27 33 3 News
## 3 text3 28 32 2 News
## 4 text4 69 92 3 News
## 5 text5 39 45 4 News
## 6 text6 32 40 1 News
topfeatures(news_dfm, 20)
## said one new can also two year just first
## 250393 83146 70330 58758 58745 57334 54829 53149 52636
## last time like state people years get city now
## 51516 51152 49417 48079 47572 46809 43498 37185 36080
## percent school
## 34423 34216
head(arrange(textstat_collocations(corpus_sample(news_corp, 100), size = 2, min_count = 3), desc(count)), 10)
## collocation count count_nested length lambda z
## 1 of the 21 0 2 1.8965898 7.197480
## 2 in the 18 0 2 2.0588844 7.153772
## 3 to the 12 0 2 0.9913146 3.157605
## 4 in a 8 0 2 1.8895024 4.866915
## 5 for a 7 0 2 2.6816978 6.105332
## 6 is a 6 0 2 3.1987703 6.325363
## 7 to be 6 0 2 3.3190815 6.076730
## 8 from the 6 0 2 2.6457869 5.015949
## 9 as the 6 0 2 1.5024485 3.373771
## 10 has been 5 0 2 5.2643109 8.298356
mean(textstat_readability(corpus_sample(news_corp, 100), measure = "Flesch")$Flesch)
## [1] 60.25528
texts(blogs_corp)[1:3]
## text1
## "In the years thereafter, most of the Oil fields and platforms were named after pagan \"gods\"."
## text2
## "We love you Mr. Brown."
## text3
## "Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him."
mean(ntoken(blogs_corp, remove_punct = TRUE))
## [1] 41.47307
head(summary(blogs_corp))
## Text Types Tokens Sentences Source
## 1 text1 18 20 1 Blogs
## 2 text2 6 7 1 Blogs
## 3 text3 104 154 7 Blogs
## 4 text4 36 43 1 Blogs
## 5 text5 91 119 5 Blogs
## 6 text6 13 13 1 Blogs
topfeatures(blogs_dfm, 20)
## one just like can time get know now people also
## 124740 100592 98643 98282 88560 70866 60260 60114 59409 55331
## new even first day make back us see really much
## 54475 51991 50877 50653 50648 50560 50185 50053 49936 48920
head(arrange(textstat_collocations(corpus_sample(blogs_corp, 100), size = 2, min_count = 3), desc(count)), 10)
## collocation count count_nested length lambda z
## 1 in the 18 0 2 1.9948673 7.095819
## 2 of the 18 0 2 1.8510198 6.678056
## 3 to the 10 0 2 0.7893654 2.353179
## 4 is a 9 0 2 2.5179630 6.460429
## 5 for the 9 0 2 1.7564450 4.689014
## 6 into the 8 0 2 3.7416348 6.347521
## 7 on the 8 0 2 2.0825026 5.083801
## 8 it was 7 0 2 3.7682379 8.045318
## 9 i have 7 0 2 3.4336693 7.293072
## 10 to be 6 0 2 2.7797634 5.555722
mean(textstat_readability(corpus_sample(blogs_corp, 100), measure = "Flesch")$Flesch)
## [1] 66.25273
Creating plots of most used words in each corpus is insightful and easy:
barplot(topfeatures(twit_dfm, 20), width = 3, col = "royalblue4", xlab = "Top Featured Words", ylab = "Frequency", main = "Top Words in Twitter Corpus")
barplot(topfeatures(news_dfm, 20), width = 3, col = "springgreen4", xlab = "Top Featured Words", ylab = "Frequency", main = "Top Words in News Corpus")
barplot(topfeatures(blogs_dfm, 20), width = 3, col = "orangered4", xlab = "Top Featured Words", ylab = "Frequency", main = "Top Words in Blogs Corpus")
Now, let’s look at top used words from all corpora combined; this time let’s use a wordcloud for this purpose, and -just for the sake of saving time- let’s use a sample from our combined corpus:
sam <- corpus_sample(all_corp, 5000)
sam_dfm <- dfm(sam, remove = stopwords("english"), remove_punct = TRUE, verbose = TRUE)
## Creating a dfm from a corpus input...
## ... lowercasing
## ... found 5,000 documents, 17,423 features
## ... removed 170 features
## ... created a 5,000 x 17,253 sparse dfm
## ... complete.
## Elapsed time: 2.31 seconds.
textplot_wordcloud(sam_dfm, rotation = 0.25, min_count = 20, max_words = 50, color = RColorBrewer::brewer.pal(8, "Dark2"))
Now we have loaded our data properly, cleaned it, and investigated its features, let’s turn our attention to building our prediction model, which will be created on the Shiny app eventually. For that we will split the data into training and testing sets. We will then use the training set to create the prediction model using the N-gram method. Then we will creat a Shiny app on which we will deploy the prediction model, this app will allow people to type words and the app will predict next words for the user.