This document is an explanation of the major features of the data. Document is produced with echo=FALSE.
document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set. The motivation for this project is to:
## File exists in: C:/Users/Dmitri/Desktop/Documents/R Projects/Coursera/10 - DSS Capstone Project
The data consists of 4 directories, each corresponding different language with 3 files in each. Those directories are: de_DE, en_US, fi_FI, ru_RU.
Files are a collection of posts in different social medias:
For the most, only the en_US locale is used in assignment.
The biggest file is the blogs file and it has the longest, measured in chars, document. The smallest and shortest file award goes to the twitter set, but it has mot documents of all sets.
| File | Size | Number of lines | Longest line |
|---|---|---|---|
| en_US.blogs.txt | 210.2 MB | 899,289 | 40,835 |
| en_US.news.txt | 205.8 MB | 77,260 | 5,760 |
| en_US.twitter.txt | 167.1 MB | 2,360,149 | 213 |
Given the known Twitter limit of max Characters per tweet equal to 140, the twitter file will require some cleaning before processing.
As it was mentioned earlier, the twitter file contains some garbage. It has shown up, that most of that garbage is UTF codes in plaint test like:
<f0><U+009F><U+0098><U+0096><f0><U+009F><U+0098>
Those codes are deleted at read using a simple regEx: <(U\\+00..|f0)>
The file was too big to fit into the memory, so a small sample of first 25.000 lines was used. Given the total number of lines in files that was not a “fair” sample and there may be conveniences in the future.
Another problem is the twitter file because it has a large amount of small lines. So, instead of reading first N-rows, the files are read line by line until the given memory limit is reached by the SampleFileByObjectSize function.
## Processed samples files are in: C:/Users/Dmitri/Desktop/Documents/R Projects/Coursera/10 - DSS Capstone Project/sample
For the analysis part the quanteda package was used instead of suggested tm because of speed and lower memory usage.
There were some problems with the quanteda. Some documentation examples didn’t worked as expected.
Most of the time, dfm objects would be constructed from texts or a corpus, without calling tokenize() as an intermediate step. Here, the punctuation and numbers must be removes, the the tokenizer is called. As well as toLower before tokinizer.
The easiest way to examine document is to create a document-frequency matrix.
The produced DFM look like:
Side by side, the wordcloud looks like (blogs, news, twitts):
# A vector of the form c(nr, nc)
par(mfrow=c(1,3))
plot(dfm.blogs, max.words = 80, colors = brewer.pal(8, "Dark2"), scale = c(8, .5))
plot(dfm.news, max.words = 80, colors = brewer.pal(8, "Dark2"), scale = c(8, .5))
plot(dfm.twitter, max.words = 80, colors = brewer.pal(8, "Dark2"), scale = c(8, .5))
Top features of corresponding set are presented in tables.
topFeatures.blogs <- topfeatures(dfm.blogs, decreasing=T, n=20)
df_freq.blogs <- data.frame(keyName=names(topFeatures.blogs), value=topFeatures.blogs, row.names=NULL)
kable(df_freq.blogs, caption = "Top features of Blogs DFM.")
| keyName | value |
|---|---|
| one | 3523 |
| will | 3093 |
| just | 2849 |
| like | 2806 |
| can | 2742 |
| time | 2616 |
| get | 1942 |
| know | 1774 |
| now | 1687 |
| people | 1577 |
| new | 1536 |
| also | 1498 |
| first | 1473 |
| us | 1441 |
| even | 1437 |
| much | 1416 |
| good | 1406 |
| back | 1401 |
| really | 1392 |
| make | 1390 |
topFeatures.news <- topfeatures(dfm.news, decreasing=T, n=20)
df_freq.news <- data.frame(keyName=names(topFeatures.news), value=topFeatures.news, row.names=NULL)
kable(df_freq.news, caption = "Top features of News DFM.")
| keyName | value |
|---|---|
| said | 6865 |
| will | 3008 |
| one | 2420 |
| year | 2155 |
| new | 1868 |
| two | 1796 |
| also | 1631 |
| can | 1612 |
| time | 1599 |
| first | 1595 |
| just | 1506 |
| like | 1441 |
| last | 1404 |
| state | 1392 |
| people | 1343 |
| years | 1311 |
| get | 1175 |
| three | 1126 |
| now | 1028 |
| city | 1008 |
topFeatures.twitter <- topfeatures(dfm.twitter, decreasing=T, n=20)
df_freq.twitter <- data.frame(keyName=names(topFeatures.twitter), value=topFeatures.twitter, row.names=NULL)
kable(df_freq.twitter, caption = "Top features of Twitter DFM.")
| keyName | value |
|---|---|
| just | 3366 |
| like | 2832 |
| get | 2513 |
| love | 2418 |
| u | 2299 |
| good | 2261 |
| will | 2170 |
| can | 2134 |
| day | 2083 |
| thanks | 2073 |
| rt | 1991 |
| now | 1947 |
| one | 1946 |
| know | 1777 |
| great | 1720 |
| time | 1718 |
| go | 1674 |
| today | 1612 |
| lol | 1547 |
| new | 1524 |
As swoth, there are some differences in the sets.
There is some uncertainty in the plans. Mostly, I plan to continue examine n Grams of order 1-2-3-4 and predicting the results based on a simple fall-back strategy.
Try to find 4-grams and then move to lower numbers if nothing similar is found.
The most concerning part is smoothing and predicting the unknown words. Another concern is a very limited amount of time (must be on the fly) and resources.