MILESTONE PROJECT

Summary

The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm.

Exploratory Data Analysis

At first we have to access and explore the dataset containing the text which we will use for creating a language model. The dataset contains data in form of texts from blogs, twitter and news.

The table below shows information the following: 1.file size 2.total number of lines 3.length of the longest line 4.total number of words 5.average number of words per line in each of the files.

File Names  Size in MB  Total Lines Longest Line    Total Words Average Words per Line
blogs.txt     200.42        899288          40835         38601176              42.92
news.txt      196.28        1010242         11384       35806831                35.44
twitter.txt   159.36        2360148         213         31130623                13.19

From previous attempts, creating a dfm for the whole document takes up a lot of processing time, especially when we start looking at occurences of sequences of 2 adjacent words (bigrams) and 3 adjacent words (trigrams). Due to this, I chose to take 10% of each text as a representative sample to look at the different features.

Below you will find the word clouds representing most frequent words used in tweets (blue), blogs (green) and news (pink) after removing english stop words. Looking at the 3 plots side by side shows the difference in the vocabulary for the 3 types.

## Warning: ignoredFeatures argument is not used.

## Warning: ignoredFeatures argument is not used.

## Warning: ignoredFeatures argument is not used.

I also generated a frequency matrix for bigrams from each of the the texts. You can find the histogram for the bigrams for each text below.

## Warning: ngrams argument is not used

## Warning: ngrams argument is not used

## Warning: ngrams argument is not used

The Required Steps

1.Perform additional data cleansing, particulary related to profanity filtering and removing punctuation and special characters as looking at the word clouds produced, I still need to do further clean up with the data. 2.Related to the first item, I also plan to do more exploration using both tm and quanteda package to see which one can best cater to all the clean up that needs to be done 3.Identify what percentage of the dataset to use to be able to cover most of the words used in the text. Right now I chose an arbitrary 10% of the data to speed up processing without putting much thought on whether this is a large enough sample size for predictions 4.Explore smoothing techniques to account for any words not in the text, or even in the sample of text I will be using for modelling. 5.Explore trigrams, the runtime needed to generate them and whether they are more useful for predictions compared to bigrams.

For Reading the data

To read the files, the function readLines() was initially used, however errors were consistenly encountered when loading the en_US.news.txt as seen below.

news <- readLines("final/en_US/en_US.news.txt")

## Warning in readLines("final/en_US/en_US.news.txt"): incomplete final line found
## on 'final/en_US/en_US.news.txt'

After manually opening the file in a text editor (Notepad++), some non-text characters were found (particularly on line 77259) and the approach to read the files was modified to read lines in binary mode

llibrary(stringr)
fileNames <- list.files("final/en_US", full.names=TRUE)
totalLines <- NULL
longestLine <- NULL
size <- NULL
totalWords <- NULL
aveWords <- NULL
for (i in 1:length(fileNames)) {
    conn <- file(fileNames[i], open = "rb")
    file <- readLines(conn, skipNul = TRUE)
    totalLines <- c(totalLines, length(file))
    longestLine <- c(longestLine, max(nchar(file)))
    size <- c(size, round(file.size(fileNames[i])/1024/1024, 2))
    totalWords <- c(totalWords, sum(str_count(file, '\\w+')))
    aveWords <- c(aveWords, round(mean(str_count(file, '\\w+')), 2))
    close(conn)
}
rm(file)
filestats <- as.data.frame(cbind(fileNames, size, totalLines, longestLine, totalWords, aveWords))

library(knitr)
kable(topB, col.names = c("File Names", "Size in MB", "Total Lines", "Longest Line", "Total Words", "Average Words per Line"))

The code snippet below shows how files were accessed and initial information gathered.

Code for generation of DFM and plotting word clouds

suppressMessages(library(quanteda))
suppressMessages(library(wordcloud))
suppressMessages(library(ggplot2))

par(mfrow=c(1, 3))

#Generate and plot twitter dfm
twitter <- readLines("final/en_US/en_US.twitter.txt", skipNul = TRUE)
set.seed(123)
sample <- as.logical(rbinom (n=length(twitter),size=1, prob = 0.10))
sampleTweets <- twitter[sample]
rm(twitter)

twitterDFM <- dfm(sampleTweets, verbose = FALSE, ignoredFeatures=stopwords("english"))
suppressMessages(wordcloud(names(topfeatures(twitterDFM, 100)), topfeatures(twitterDFM,100), colors="steelblue3"))

#Generate and plot blog dfm
blogs <- readLines("final/en_US/en_US.blogs.txt")
sample <- as.logical(rbinom (n=length(blogs),size=1, prob = 0.10))
sampleBlogs <- blogs[sample]
rm(blogs)

blogsDFM <- dfm(sampleBlogs, verbose = FALSE, ignoredFeatures=stopwords("english"))
suppressMessages(wordcloud(names(topfeatures(blogsDFM, 100)), topfeatures(blogsDFM,100), colors="darkolivegreen4"))

#Generate and plot news dfm
conn <- file("final/en_US/en_US.news.txt", open = "rb")
news <- readLines(conn, skipNul = TRUE)
sample <- as.logical(rbinom (n=length(news),size=1, prob = 0.10))
sampleNews <- news[sample]
rm(news)
rm(sample)

newsDFM <- dfm(sampleNews, verbose = FALSE, ignoredFeatures=stopwords("english"))
suppressMessages(wordcloud(names(topfeatures(newsDFM, 100)), topfeatures(newsDFM,100), colors="coral3"))

Code for generation and plotting bigrams

suppressMessages(twitterBigram <- dfm(sampleTweets, ngrams=2))
topTwitter <- data.frame(frequency=topfeatures(twitterBigram,30), row.names = NULL)
topTwitter$ngram <- names(topfeatures(twitterBigram,30))

par(mfrow=c(1,1))
barplot(topTwitter$frequency, names.arg = topTwitter$ngram, xlab = "Bigrams", ylab = "Frequency", main="Top Twitter Bigrams", col="steelblue", cex.axis=0.3)

suppressMessages(BlogBigram <- dfm(sampleBlogs, ngrams=2))
topBlog <- data.frame(frequency=topfeatures(BlogBigram,30), row.names = NULL)
topBlog$ngram <- names(topfeatures(BlogBigram,30))
barplot(topBlog$frequency, names.arg = topBlog$ngram, xlab = "Bigrams", ylab = "Frequency", main="Top Blogs Bigrams", col="darkolivegreen4", cex.axis=0.3)

suppressMessages(NewsBigram <- dfm(sampleNews, ngrams=2))
topNews <- data.frame(frequency=topfeatures(NewsBigram,30), row.names = NULL)
topNews$ngram <- names(topfeatures(NewsBigram,30))
barplot(topNews$frequency, names.arg = topNews$ngram, xlab = "Bigrams", ylab = "Frequency", main="Top News Bigrams", col="coral3", cex.axis=0.3)

MILESTONE PROJECT

NILOSREE SENGUPTA

10/12/2020