This document is an explanation of the major features of the data. Document is produced with echo=FALSE.

document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set. The motivation for this project is to:

1. Demonstrate that you’ve downloaded the data and have successfully loaded it in.

  1. For the purpose of this document some helper functions were created.
    • Helpers/DownloadFile() - downloads and unzips file from web if file is missing.
    • All paths are declared in the Helpers/filePaths.R file and loaded int the Global.
    • Helpers/CountFileLines() - counts lines in file by reading them one by one.
    • Helpers/GetLongestLine() - reads lines one by one and finds the longest line.
## File exists in: C:/Users/Dmitri/Desktop/Documents/R Projects/Coursera/10 - DSS Capstone Project

The data consists of 4 directories, each corresponding different language with 3 files in each. Those directories are: de_DE, en_US, fi_FI, ru_RU.

Files are a collection of posts in different social medias:

For the most, only the en_US locale is used in assignment.

2. Create a basic report of summary statistics about the data sets

The biggest file is the blogs file and it has the longest, measured in chars, document. The smallest and shortest file award goes to the twitter set, but it has mot documents of all sets.

File Size Number of lines Longest line
en_US.blogs.txt 210.2 MB 899,289 40,835
en_US.news.txt 205.8 MB 77,260 5,760
en_US.twitter.txt 167.1 MB 2,360,149 213

Given the known Twitter limit of max Characters per tweet equal to 140, the twitter file will require some cleaning before processing.

3. Report any interesting findings that you amassed so far.

Cleaning

As it was mentioned earlier, the twitter file contains some garbage. It has shown up, that most of that garbage is UTF codes in plaint test like:

<f0><U+009F><U+0098><U+0096><f0><U+009F><U+0098>

Those codes are deleted at read using a simple regEx: <(U\\+00..|f0)>

Reading the file

The file was too big to fit into the memory, so a small sample of first 25.000 lines was used. Given the total number of lines in files that was not a “fair” sample and there may be conveniences in the future.

Another problem is the twitter file because it has a large amount of small lines. So, instead of reading first N-rows, the files are read line by line until the given memory limit is reached by the SampleFileByObjectSize function.

## Processed samples files are in: C:/Users/Dmitri/Desktop/Documents/R Projects/Coursera/10 - DSS Capstone Project/sample

For the analysis part the quanteda package was used instead of suggested tm because of speed and lower memory usage.

There were some problems with the quanteda. Some documentation examples didn’t worked as expected.

  • dfm didn’t lowered the words after tokenize’er.
  • toLower couldn’t accepted the tokenized object as documentation says.

Most of the time, dfm objects would be constructed from texts or a corpus, without calling tokenize() as an intermediate step. Here, the punctuation and numbers must be removes, the the tokenizer is called. As well as toLower before tokinizer.

Creating document-frequency matrix

The easiest way to examine document is to create a document-frequency matrix.

The produced DFM look like:

  • Blogs: 25020, 48663
  • News: 27720, 49022
  • Twitter: 53400, 40956

Side by side, the wordcloud looks like (blogs, news, twitts):

# A vector of the form c(nr, nc)
par(mfrow=c(1,3))

plot(dfm.blogs, max.words = 80, colors = brewer.pal(8, "Dark2"), scale = c(8, .5))
plot(dfm.news, max.words = 80, colors = brewer.pal(8, "Dark2"), scale = c(8, .5))
plot(dfm.twitter, max.words = 80, colors = brewer.pal(8, "Dark2"), scale = c(8, .5))

Top features of corresponding set are presented in tables.

topFeatures.blogs <- topfeatures(dfm.blogs, decreasing=T, n=20)
df_freq.blogs <- data.frame(keyName=names(topFeatures.blogs), value=topFeatures.blogs, row.names=NULL)
kable(df_freq.blogs, caption = "Top features of Blogs DFM.")
Top features of Blogs DFM.
keyName value
one 3523
will 3093
just 2849
like 2806
can 2742
time 2616
get 1942
know 1774
now 1687
people 1577
new 1536
also 1498
first 1473
us 1441
even 1437
much 1416
good 1406
back 1401
really 1392
make 1390
topFeatures.news <- topfeatures(dfm.news, decreasing=T, n=20)
df_freq.news <- data.frame(keyName=names(topFeatures.news), value=topFeatures.news, row.names=NULL)
kable(df_freq.news, caption = "Top features of News DFM.")
Top features of News DFM.
keyName value
said 6865
will 3008
one 2420
year 2155
new 1868
two 1796
also 1631
can 1612
time 1599
first 1595
just 1506
like 1441
last 1404
state 1392
people 1343
years 1311
get 1175
three 1126
now 1028
city 1008
topFeatures.twitter <- topfeatures(dfm.twitter, decreasing=T, n=20)
df_freq.twitter <- data.frame(keyName=names(topFeatures.twitter), value=topFeatures.twitter, row.names=NULL)
kable(df_freq.twitter, caption = "Top features of Twitter DFM.")
Top features of Twitter DFM.
keyName value
just 3366
like 2832
get 2513
love 2418
u 2299
good 2261
will 2170
can 2134
day 2083
thanks 2073
rt 1991
now 1947
one 1946
know 1777
great 1720
time 1718
go 1674
today 1612
lol 1547
new 1524

As swoth, there are some differences in the sets.

4. Get feedback on your plans for creating a prediction algorithm and Shiny app.

There is some uncertainty in the plans. Mostly, I plan to continue examine n Grams of order 1-2-3-4 and predicting the results based on a simple fall-back strategy.

Try to find 4-grams and then move to lower numbers if nothing similar is found.

The most concerning part is smoothing and predicting the unknown words. Another concern is a very limited amount of time (must be on the fly) and resources.