The goal of this project is just to display that you've gotten used to working with the data and that you are on track to create your prediction algorithm. Please submit a report on R Pubs (http://rpubs.com/) that explains your exploratory analysis and your goals for the eventual app and algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set. The motivation for this project is to: 1. Demonstrate that you've downloaded the data and have successfully loaded it in.2. Create a basic report of summary statistics about the data sets.3. Report any interesting findings that you amassed so far.4. Get feedback on your plans for creating a prediction algorithm and Shiny app. Review criteria
To reiterate, to build models you don't need to load in and use all of the data. Often relatively few randomly selected rows or chunks need to be included to get an accurate approximation to results that would be obtained using all the data. Remember your inference class and how a representative sample can be used to infer facts about a population.
To reiterate, to build models you don't need to load in and use all of the data. Often relatively few randomly selected rows or chunks need to be included to get an accurate approximation to results that would be obtained using all the data. Remember your inference class and how a representative sample can be used to infer facts about a population. You might want to create a separate sub-sample dataset by reading in a random subset of the original data and writing it out to a separate file. That way, you can store the sample and not have to recreate it every time. You can use the rbinom function to "flip a biased coin" to determine whether you sample a line of text or not.
if(Sys.getenv("JAVA_HOME")!="")
Sys.setenv(JAVA_HOME="")
library(rJava)
library(stringi)
library(openNLP)
library(qdap)
## Loading required package: qdapDictionaries
## Loading required package: qdapRegex
## Loading required package: qdapTools
## Loading required package: RColorBrewer
##
## Attaching package: 'qdap'
## The following objects are masked from 'package:base':
##
## Filter, proportions
library(tm)
## Loading required package: NLP
##
## Attaching package: 'NLP'
## The following object is masked from 'package:qdap':
##
## ngrams
##
## Attaching package: 'tm'
## The following objects are masked from 'package:qdap':
##
## as.DocumentTermMatrix, as.TermDocumentMatrix
library(tokenizers)
library(NLP)
library(RWeka)
# Read the blogs and twitter files using readLines
blog_data <- readLines("final/en_US/en_US.blogs.txt", warn = FALSE, encoding = "UTF-8")
twitter_data <- readLines("final/en_US/en_US.twitter.txt", warn = FALSE, encoding = "UTF-8")
# Read the news file using binary/binomial mode as there are special characters in the text
con <- file("final/en_US/en_US.news.txt", open="rb")
news_data <- readLines(con, encoding = "UTF-8")
close(con)
rm(con)
## size of the data
blog_dim <- file.info("final/en_US/en_US.blogs.txt")$size / 1024 ^ 2
sprintf("The en_US.blogs.txt file is: %s Megabytes", blog_dim)
## [1] "The en_US.blogs.txt file is: 200.424207687378 Megabytes"
news_dim <- file.info("final/en_US/en_US.news.txt")$size / 1024 ^ 2
sprintf("The en_US.news.txt file is: %s Megabytes", news_dim)
## [1] "The en_US.news.txt file is: 196.277512550354 Megabytes"
twitter_dim <- file.info("final/en_US/en_US.twitter.txt")$size / 1024 ^ 2
sprintf("The en_US.twitter.txt file is: %s Megabytes", twitter_dim)
## [1] "The en_US.twitter.txt file is: 159.364068984985 Megabytes"
Data_info <- data.frame('File' = c("Blogs","News","Twitter"),
"FileSizeinMB" = c(blog_dim, news_dim, twitter_dim),
'NumberofLines' = sapply(list(blog_data, news_data, twitter_data), function(x){length(x)}),
'TotalCharacters' = sapply(list(blog_data, news_data, twitter_data), function(x){sum(nchar(x))}),
TotalWords = sapply(list(blog_data,news_data,twitter_data),stri_stats_latex)[4,],
'MaxCharacters' = sapply(list(blog_data, news_data, twitter_data), function(x){max(unlist(lapply(x, function(y) nchar(y))))})
)
Data_info
## File FileSizeinMB NumberofLines TotalCharacters TotalWords MaxCharacters
## 1 Blogs 200.4242 899288 206824505 37570839 40833
## 2 News 196.2775 1010242 203223159 34494539 11384
## 3 Twitter 159.3641 2360148 162096031 30451128 140
As we can see, each files has 200 & below MBand number of words are more than 30 million per file.
- Twitter has the most amount of lines, and fewer words per line.
- Blogs has the longest line (40,833 characters).
- News has the longest paragraphs.
In the en_US twitter data set, if you divide the number of lines where the word "love" (all lowercase) occurs by the number of lines the word "hate" (all lowercase) occurs, about what do you get?
love_hate<-length(grep("love", twitter_data))/length(grep("hate", twitter_data))
sprintf("We get around: %s", love_hate)
## [1] "We get around: 4.10859156202006"
The one tweet in the en_US twitter data set that matches the word "biostats" says what?
tweet<-grep("biostats", twitter_data, value=TRUE)
tweet
## [1] "i know how you feel.. i have biostats on tuesday and i have yet to study =/"
How many tweets have the exact characters "A computer once beat me at chess, but it was no match for me at kickboxing". (I.e. the line matches those characters exactly.)
tweet_match<-grep("A computer once beat me at chess, but it was no match for me at kickboxing", twitter_data)
tweet_match
## [1] 519059 835824 2283423
The first step in building a predictive model for text is understanding the distribution and relationship between the words, tokens, and phrases in the text. The goal of this task is to understand the basic relationships you observe in the data and prepare to build your first linguistic models.
Tasks to accomplish
Questions to consider
Questions to consider
Using Tidy Data methodology we can perform some EDA. We can use a simple Word Cloud to visualize the most frequent words.
blogsmp <- sample(blog_data,1000)
newsmp <- sample(news_data,1000)
twittersmp <- sample(twitter_data,1000)
sample <- c(blogsmp,newsmp,twittersmp)
txt <- sent_detect(sample)
remove(blogsmp,newsmp,twittersmp,blog_data,news_data,twitter_data,sample)
Now we remove what we won't be needing
txt <- removeNumbers(txt)
txt <- removePunctuation(txt)
txt <- stripWhitespace(txt)
txt <- tolower(txt)
txt <- txt[which(txt!="")]
txt <- data.frame(txt,stringsAsFactors = FALSE)
We now create some word clouds to visualize the most frequent words and order data frames of 1-grams, 2-grams, 3-grams
words<-WordTokenizer(txt)
grams<-NGramTokenizer(txt)
for(i in 1:length(grams))
{if(length(WordTokenizer(grams[i]))==2) break}
for(j in 1:length(grams))
{if(length(WordTokenizer(grams[j]))==1) break}
onegrams <- data.frame(table(words))
onegrams <- onegrams[order(onegrams$Freq, decreasing = TRUE),]
bigrams <- data.frame(table(grams[i:(j-1)]))
bigrams <- bigrams[order(bigrams$Freq, decreasing = TRUE),]
trigrams <- data.frame(table(grams[1:(i-1)]))
trigrams <- trigrams[order(trigrams$Freq, decreasing = TRUE),]
remove(i,j,grams)
library(wordcloud)
wordcloud(words, scale=c(5,0.1), max.words=100, random.order=FALSE,
rot.per=0.5, use.r.layout=FALSE, colors=brewer.pal(8,"Accent"))
## Warning in tm_map.SimpleCorpus(corpus, tm::removePunctuation): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(corpus, function(x) tm::removeWords(x,
## tm::stopwords())): transformation drops documents
wordcloud(onegrams$words, onegrams$Freq, scale=c(5,0.5), max.words=300, random.order=FALSE,
rot.per=0.5, use.r.layout=FALSE, colors=brewer.pal(8,"Accent"))
The first graph shows the distribution of words in the corpora except such words, as "the", "a", "of", "to", etc. The second graph - the distribution of all single wors. The frequences lay between 3796 to 1.
What are the frequencies of 2-grams and 3-grams in the dataset?
barplot(bigrams[1:20,2],col="plum3",
names.arg = bigrams$Var1[1:20],srt = 45,
space=0.1, xlim=c(0,20),las=2)
barplot(trigrams[1:20,2],col="plum3",
names.arg = trigrams$Var1[1:20],srt = 45,
space=0.1, xlim=c(0,20),las=2)
The goal is to create a predictive model which predicts the most probable words to follow an input from the user. This model will be evaluated and deployed as a shiny application.