This is the R Markdown document for Data Science Capstone assignment course. It contains an explanation about my Report Submission as a milestone in the evolution to Data Science Specialization final project.
The data for this analysis comes from a corpus called “HC Corpora”.
The main task we have done till now are:
The dataset was downloaded from the course repository available on: Capstone Dataset.
I’ve saved on my desktop in a subdirectory of the main job called “Capstone_Dataset”. Under this, there are 4 subfolders to distinct languages (de - German, en - English, fi - Finnish, ru - Russian). To each one there are 3 similar files that represents Corpora from SwiftKey - blogs, news and twitter. They have a large number of sentences and millions of words.
Until this time, I analyzed only the files in English:
| Locale | File | Size (MB) | Lines | Words |
|---|---|---|---|---|
| en_US | blogs | 200.4 | 1 010 242 | 37,334,131 |
| en_US | news | 196.3 | 899 288 | 34,372,530 |
| en_US | 159.4 | 2 360 148 | 30,373,583 | |
| — Total — | 556.1 | 4 269 678 556 | 102,080,244 | |
To do the sampling of these above files, I created a set of functions stored in this .R script
-> sampling.R (click this link to view the source on my GitHub).
It includes two another scripts:
A main method called “read_dataset()”, read the 3 files (blogs, news and twitter), collected a systematic sample of 30% size, cleaned and saved in 3 .RDS format files (to save space and future time processing).
This process was done in 334 minutes (or 5.6 hrs) and its log was saved in this file: sampling_output.txt.
This was the hardest job I had. I got some trouble in order to convert .RDS files to a source for ‘tm’ Corpus() method. After I did another transformation to provide the entry parameters of ‘quanteda’ package [method “corpus()”]. While I was generating two intermediary .txt files I did a second cleaning, removing some features that can cause trouble in exploratory analysis and NLP processing - tweet chars (hash tags, @), numbers, extra spaces. But some methods from these packages didn’t work fine. In ‘qdapRegex’ the methods to remove tweet features did not have good performance. They last a long, long time and I needed to abort their use. I want to solve this from now on.
In order to save time, I checked if part of necessary job was done, before running it.
These are the core code I’ve used in [R]:
SAMPLE_DIR = "./Samples/"
SAMPLE_CLEAN = "sample_clean.txt"
cFiles <- c("news","twitter","blogs")
# generate 3 txt files with corpora sampling from "Swift Key Dataset", in case of not exists
if(!file.exists(SAMPLE_DIR)) {
dir.create(SAMPLE_DIR)
lapply(cFiles,
function(i) {
write.table(objRDS(i), paste0(SAMPLE_DIR,i,".txt"), sep="\t", col.names=FALSE, quote=FALSE, row.names=FALSE, append=TRUE) }
)
}
print(paste("Directory",SAMPLE_DIR,"Corpus' created with 3 sample text files."))
# create a primary Corpus, using "tm" package that contains all sample texts (as Source) to be cleaned
# (because cleaning functions for "quanteda' package aren't working fine...)
# generate a clean text file with corpora sampling , in case of not exists
if(!file.exists(SAMPLE_CLEAN))
{
cp <- Corpus(DirSource(SAMPLE_DIR), readerControl = list(language="lat"))
print("Corpus object's created.")
#cp <- removeTweetFeatures(cp)
cp <- removeOtherFeatures(cp, numbers=TRUE, punctuation=FALSE, spaces=TRUE, stopwords=TRUE)
# and now, rewrite these data to another file representing sample Corpora already clean up
write.table(dfClean, SAMPLE_CLEAN, sep="", col.names=FALSE, quote=FALSE, row.names=FALSE, append=TRUE)
print(paste(SAMPLE_CLEAN,"is generated"))
print(paste(SAMPLE_CLEAN,"is generated"))
rm(cp) # (release memory)
}
# this 'sample clean text file' could be a source to "corpus" method in "quanteda" package
mycorp <- corpus(textfile(SAMPLE_CLEAN))
Below we show a graph known as “Word Cloud”. It’s a handy tool to highlight the most frequent words found in these Corpora (I choose 100). After I have removed ‘stop words’, visually we may say that various tokens prevail; these are: will, said, just, one, get, can, like.
[R] function to produce the ‘word clout plot, having ’tm’ corpus object as parameter:
library(wordcloud)
library(RColorBrewer)
plot_word_cloud <- function(corp) {
tdm <- TermDocumentMatrix(corp)
m <- as.matrix(tdm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
pal <- brewer.pal(9, "BuGn")
pal <- pal[-(1:2)]
png("wordcloud.png", width=1280,height=800)
wordcloud(d$word,d$freq, scale=c(8,.3),min.freq=2,max.words=100, random.order=T, rot.per=.15, colors=pal, vfont=c("sans serif","plain"))
dev.off()
}
To construct the ‘n-grams’ tokens I used the ‘tokenize’ method from ‘quanteda’ package. This is the [R] code excerpt:
for(i in 1:3) {
tk <- tokenize(mycorp, ngrams=i, concatenator=',')
plot_bar_gram( table_tokens(tk), i)
}
These are the graphs of distribution of frequency tokens (dfm) that I’ve got after applying ‘quanteda’ package function ‘tokenize’ (the code above):
The complete code to produce this Exploratory Analysis you may see on my GitHub’ repository to this project:
To enhance prediction modelling and finish my project, I am taking account: