Overview

The goal of this project is just to display that you've gotten used to working with the data and that you are on track to create your prediction algorithm. Please submit a report on R Pubs (http://rpubs.com/) that explains your exploratory analysis and your goals for the eventual app and algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set. The motivation for this project is to: 1. Demonstrate that you've downloaded the data and have successfully loaded it in.2. Create a basic report of summary statistics about the data sets.3. Report any interesting findings that you amassed so far.4. Get feedback on your plans for creating a prediction algorithm and Shiny app. Review criteria

Does the link lead to an HTML page describing the exploratory analysis of the training data set?
Has the data scientist done basic summaries of the three files? Word counts, line counts and basic data tables?
Has the data scientist made basic plots, such as histograms to illustrate features of the data?
Was the report written in a brief, concise style, in a way that a non-data scientist manager could appreciate?##

Sampling

To reiterate, to build models you don't need to load in and use all of the data. Often relatively few randomly selected rows or chunks need to be included to get an accurate approximation to results that would be obtained using all the data. Remember your inference class and how a representative sample can be used to infer facts about a population. You might want to create a separate sub-sample dataset by reading in a random subset of the original data and writing it out to a separate file. That way, you can store the sample and not have to recreate it every time. You can use the rbinom function to "flip a biased coin" to determine whether you sample a line of text or not.

Loading libraries

if(Sys.getenv("JAVA_HOME")!="")
      Sys.setenv(JAVA_HOME="")
library(rJava)
library(stringi)
library(openNLP)
library(qdap)

## Loading required package: qdapDictionaries

## Loading required package: qdapRegex

## Loading required package: qdapTools

## Loading required package: RColorBrewer

## 
## Attaching package: 'qdap'

## The following objects are masked from 'package:base':
## 
##     Filter, proportions

library(tm)

## Loading required package: NLP

## 
## Attaching package: 'NLP'

## The following object is masked from 'package:qdap':
## 
##     ngrams

## 
## Attaching package: 'tm'

## The following objects are masked from 'package:qdap':
## 
##     as.DocumentTermMatrix, as.TermDocumentMatrix

library(tokenizers)
library(NLP)
library(RWeka)

Loading data

# Read the blogs and twitter files using readLines
blog_data <- readLines("final/en_US/en_US.blogs.txt", warn = FALSE, encoding = "UTF-8")
twitter_data <- readLines("final/en_US/en_US.twitter.txt", warn = FALSE, encoding = "UTF-8")

# Read the news file using binary/binomial mode as there are special characters in the text
con <- file("final/en_US/en_US.news.txt", open="rb")
news_data <- readLines(con, encoding = "UTF-8")
close(con)
rm(con)

Calculating data dimensions Megabytes

## size of the data
blog_dim <- file.info("final/en_US/en_US.blogs.txt")$size / 1024 ^ 2
sprintf("The en_US.blogs.txt file is: %s Megabytes", blog_dim)

## [1] "The en_US.blogs.txt file is: 200.424207687378 Megabytes"

news_dim <- file.info("final/en_US/en_US.news.txt")$size / 1024 ^ 2
sprintf("The en_US.news.txt file is: %s Megabytes", news_dim)

## [1] "The en_US.news.txt file is: 196.277512550354 Megabytes"

twitter_dim <- file.info("final/en_US/en_US.twitter.txt")$size / 1024 ^ 2
sprintf("The en_US.twitter.txt file is: %s Megabytes", twitter_dim)

## [1] "The en_US.twitter.txt file is: 159.364068984985 Megabytes"

Data Summary

Data_info <- data.frame('File' = c("Blogs","News","Twitter"),
                      "FileSizeinMB" = c(blog_dim, news_dim, twitter_dim),
                      'NumberofLines' = sapply(list(blog_data, news_data, twitter_data), function(x){length(x)}),
                      'TotalCharacters' = sapply(list(blog_data, news_data, twitter_data), function(x){sum(nchar(x))}),
                      TotalWords = sapply(list(blog_data,news_data,twitter_data),stri_stats_latex)[4,],
                      'MaxCharacters' = sapply(list(blog_data, news_data, twitter_data), function(x){max(unlist(lapply(x, function(y) nchar(y))))})
                      )

Data_info

##      File FileSizeinMB NumberofLines TotalCharacters TotalWords MaxCharacters
## 1   Blogs     200.4242        899288       206824505   37570839         40833
## 2    News     196.2775       1010242       203223159   34494539         11384
## 3 Twitter     159.3641       2360148       162096031   30451128           140

As we can see, each files has 200 & below MBand number of words are more than 30 million per file.
- Twitter has the most amount of lines, and fewer words per line.
- Blogs has the longest line (40,833 characters).
- News has the longest paragraphs.

Counting words

In the en_US twitter data set, if you divide the number of lines where the word "love" (all lowercase) occurs by the number of lines the word "hate" (all lowercase) occurs, about what do you get?

love_hate<-length(grep("love", twitter_data))/length(grep("hate", twitter_data))
sprintf("We get around: %s", love_hate)

## [1] "We get around: 4.10859156202006"

The one tweet in the en_US twitter data set that matches the word "biostats" says what?

tweet<-grep("biostats", twitter_data, value=TRUE)
tweet

## [1] "i know how you feel.. i have biostats on tuesday and i have yet to study =/"

How many tweets have the exact characters "A computer once beat me at chess, but it was no match for me at kickboxing". (I.e. the line matches those characters exactly.)

tweet_match<-grep("A computer once beat me at chess, but it was no match for me at kickboxing", twitter_data)
tweet_match

## [1]  519059  835824 2283423

Task 2 - Exploratory Data Analysis & Task 3 - Modeling

The first step in building a predictive model for text is understanding the distribution and relationship between the words, tokens, and phrases in the text. The goal of this task is to understand the basic relationships you observe in the data and prepare to build your first linguistic models.

Tasks to accomplish

Exploratory analysis perform a thorough exploratory analysis of the data, understanding the distribution of words and relationship between the words in the corpora.
Understand frequencies of words and word pairs build figures and tables to understand variation in the frequencies of words and word pairs in the data.

Questions to consider

Some words are more frequent than others - what are the distributions of word frequencies?
What are the frequencies of 2-grams and 3-grams in the dataset?
How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?
How do you evaluate how many of the words come from foreign languages?
Can you think of a way to increase the coverage -- identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases?

Questions to consider

How can you efficiently store an n-gram model (think Markov Chains)?
How can you use the knowledge about word frequencies to make your model smaller and more efficient?
How many parameters do you need (i.e. how big is n in your n-gram model)?
Can you think of simple ways to "smooth" the probabilities (think about giving all n-grams a non-zero probability even if they aren't observed in the data) ?
How do you evaluate whether your model is any good?
How can you use backoff models to estimate the probability of unobserved n-grams?

Using Tidy Data methodology we can perform some EDA. We can use a simple Word Cloud to visualize the most frequent words.

blogsmp <- sample(blog_data,1000)
newsmp <- sample(news_data,1000)
twittersmp <- sample(twitter_data,1000)
sample <- c(blogsmp,newsmp,twittersmp)
txt <- sent_detect(sample)
remove(blogsmp,newsmp,twittersmp,blog_data,news_data,twitter_data,sample)

Now we remove what we won't be needing

txt <- removeNumbers(txt)
txt <- removePunctuation(txt)
txt <- stripWhitespace(txt)
txt <- tolower(txt)
txt <- txt[which(txt!="")]
txt <- data.frame(txt,stringsAsFactors = FALSE)

We now create some word clouds to visualize the most frequent words and order data frames of 1-grams, 2-grams, 3-grams

words<-WordTokenizer(txt) 
grams<-NGramTokenizer(txt)

for(i in 1:length(grams)) 
{if(length(WordTokenizer(grams[i]))==2) break}
for(j in 1:length(grams)) 
{if(length(WordTokenizer(grams[j]))==1) break}

onegrams <- data.frame(table(words))
onegrams <- onegrams[order(onegrams$Freq, decreasing = TRUE),]
bigrams <- data.frame(table(grams[i:(j-1)]))
bigrams <- bigrams[order(bigrams$Freq, decreasing = TRUE),]
trigrams <- data.frame(table(grams[1:(i-1)]))
trigrams <- trigrams[order(trigrams$Freq, decreasing = TRUE),]
remove(i,j,grams)

Word cloud

library(wordcloud)
wordcloud(words, scale=c(5,0.1), max.words=100, random.order=FALSE, 
          rot.per=0.5, use.r.layout=FALSE, colors=brewer.pal(8,"Accent"))

## Warning in tm_map.SimpleCorpus(corpus, tm::removePunctuation): transformation
## drops documents

## Warning in tm_map.SimpleCorpus(corpus, function(x) tm::removeWords(x,
## tm::stopwords())): transformation drops documents

wordcloud(onegrams$words, onegrams$Freq, scale=c(5,0.5), max.words=300, random.order=FALSE, 
          rot.per=0.5, use.r.layout=FALSE, colors=brewer.pal(8,"Accent"))

The first graph shows the distribution of words in the corpora except such words, as "the", "a", "of", "to", etc. The second graph - the distribution of all single wors. The frequences lay between 3796 to 1.

What are the frequencies of 2-grams and 3-grams in the dataset?

Some Barplot

barplot(bigrams[1:20,2],col="plum3",
        names.arg = bigrams$Var1[1:20],srt = 45,
        space=0.1, xlim=c(0,20),las=2)

barplot(trigrams[1:20,2],col="plum3",
        names.arg = trigrams$Var1[1:20],srt = 45,
        space=0.1, xlim=c(0,20),las=2)

Future work

The goal is to create a predictive model which predicts the most probable words to follow an input from the user. This model will be evaluated and deployed as a shiny application.

Milestone Report

Gabriela Ochoa

06 aprile 2021