Milestone Report

This document is an explanation of the major features of the data. Document is produced with echo=FALSE.

document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set. The motivation for this project is to:

1. Demonstrate that you’ve downloaded the data and have successfully loaded it in.

For the purpose of this document some helper functions were created.
- Helpers/DownloadFile() - downloads and unzips file from web if file is missing.
- All paths are declared in the Helpers/filePaths.R file and loaded int the Global.
- Helpers/CountFileLines() - counts lines in file by reading them one by one.
- Helpers/GetLongestLine() - reads lines one by one and finds the longest line.

## File exists in: C:/Users/Dmitri/Desktop/Documents/R Projects/Coursera/10 - DSS Capstone Project

The data consists of 4 directories, each corresponding different language with 3 files in each. Those directories are: de_DE, en_US, fi_FI, ru_RU.

Files are a collection of posts in different social medias:

.blogs.txt - blog posts.
.news.txt - news.
.twitter.txt - tweets.

For the most, only the en_US locale is used in assignment.

2. Create a basic report of summary statistics about the data sets

The biggest file is the blogs file and it has the longest, measured in chars, document. The smallest and shortest file award goes to the twitter set, but it has mot documents of all sets.

File	Size	Number of lines	Longest line
en_US.blogs.txt	210.2 MB	899,289	40,835
en_US.news.txt	205.8 MB	77,260	5,760
en_US.twitter.txt	167.1 MB	2,360,149	213

Given the known Twitter limit of max Characters per tweet equal to 140, the twitter file will require some cleaning before processing.

3. Report any interesting findings that you amassed so far.

Cleaning

As it was mentioned earlier, the twitter file contains some garbage. It has shown up, that most of that garbage is UTF codes in plaint test like:

<f0><U+009F><U+0098><U+0096><f0><U+009F><U+0098>

Those codes are deleted at read using a simple regEx: <(U\\+00..|f0)>

Reading the file

The file was too big to fit into the memory, so a small sample of first 25.000 lines was used. Given the total number of lines in files that was not a “fair” sample and there may be conveniences in the future.

Another problem is the twitter file because it has a large amount of small lines. So, instead of reading first N-rows, the files are read line by line until the given memory limit is reached by the SampleFileByObjectSize function.

## Processed samples files are in: C:/Users/Dmitri/Desktop/Documents/R Projects/Coursera/10 - DSS Capstone Project/sample

For the analysis part the quanteda package was used instead of suggested tm because of speed and lower memory usage.

There were some problems with the quanteda. Some documentation examples didn’t worked as expected.

dfm didn’t lowered the words after tokenize’er.
toLower couldn’t accepted the tokenized object as documentation says.

Most of the time, dfm objects would be constructed from texts or a corpus, without calling tokenize() as an intermediate step. Here, the punctuation and numbers must be removes, the the tokenizer is called. As well as toLower before tokinizer.

Creating document-frequency matrix

The easiest way to examine document is to create a document-frequency matrix.

The produced DFM look like:

Blogs: 25020, 48663
News: 27720, 49022
Twitter: 53400, 40956

Side by side, the wordcloud looks like (blogs, news, twitts):

# A vector of the form c(nr, nc)
par(mfrow=c(1,3))

plot(dfm.blogs, max.words = 80, colors = brewer.pal(8, "Dark2"), scale = c(8, .5))
plot(dfm.news, max.words = 80, colors = brewer.pal(8, "Dark2"), scale = c(8, .5))
plot(dfm.twitter, max.words = 80, colors = brewer.pal(8, "Dark2"), scale = c(8, .5))

Top features of corresponding set are presented in tables.

topFeatures.blogs <- topfeatures(dfm.blogs, decreasing=T, n=20)
df_freq.blogs <- data.frame(keyName=names(topFeatures.blogs), value=topFeatures.blogs, row.names=NULL)
kable(df_freq.blogs, caption = "Top features of Blogs DFM.")

Top features of Blogs DFM.
keyName	value
one	3523
will	3093
just	2849
like	2806
can	2742
time	2616
get	1942
know	1774
now	1687
people	1577
new	1536
also	1498
first	1473
us	1441
even	1437
much	1416
good	1406
back	1401
really	1392
make	1390

topFeatures.news <- topfeatures(dfm.news, decreasing=T, n=20)
df_freq.news <- data.frame(keyName=names(topFeatures.news), value=topFeatures.news, row.names=NULL)
kable(df_freq.news, caption = "Top features of News DFM.")

Top features of News DFM.
keyName	value
said	6865
will	3008
one	2420
year	2155
new	1868
two	1796
also	1631
can	1612
time	1599
first	1595
just	1506
like	1441
last	1404
state	1392
people	1343
years	1311
get	1175
three	1126
now	1028
city	1008

topFeatures.twitter <- topfeatures(dfm.twitter, decreasing=T, n=20)
df_freq.twitter <- data.frame(keyName=names(topFeatures.twitter), value=topFeatures.twitter, row.names=NULL)
kable(df_freq.twitter, caption = "Top features of Twitter DFM.")

Top features of Twitter DFM.
keyName	value
just	3366
like	2832
get	2513
love	2418
u	2299
good	2261
will	2170
can	2134
day	2083
thanks	2073
rt	1991
now	1947
one	1946
know	1777
great	1720
time	1718
go	1674
today	1612
lol	1547
new	1524

As swoth, there are some differences in the sets.

4. Get feedback on your plans for creating a prediction algorithm and Shiny app.

There is some uncertainty in the plans. Mostly, I plan to continue examine n Grams of order 1-2-3-4 and predicting the results based on a simple fall-back strategy.

Try to find 4-grams and then move to lower numbers if nothing similar is found.

The most concerning part is smoothing and predicting the unknown words. Another concern is a very limited amount of time (must be on the fly) and resources.