The report contains an exploratory analysis of text data for the Coursera Data Science Capstone Project. Source data includes texts from three sourses: twitter, blogs and news. There are four availiable laguages: English, Finnish, English and German. This analysis will focus on English data sources. The goal of the project is to build an application based on a predictive text model, which is capable to predict next word for given phrase. Data for training predictive model was provided by Swiftkey.
First step is to load data file “Coursera-SwiftKey.zip” from the Internet.
# en: data URL
# ru: URL исходных данных
url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
# en: download data file
# ru: загружаем файл данных
if (!dir.exists('data')) {
dir.create('data')
}
if (!file.exists('./data/Coursera-SwiftKey.zip')) {
download.file(url,
destfile = "./data/Coursera-SwiftKey.zip")
}
# en: unzip data file
# ru: распаковываем архив
if (!dir.exists('./data/final')) {
unzip('./data/Coursera-SwiftKey.zip', exdir = './data')
}
Data sources are organised by languages:
# en: data directories list
# ru: список директорий данных
dir('./data/final')
## [1] "de_DE" "en_US" "fi_FI" "ru_RU"
Let’s read twitter, blogs and news texst in English and count lines.
# en: read English texts
# ru: читаем английские тексты
en.corp <- VCorpus(DirSource(directory = './data/final/en_US/', encoding = 'UTF-8'),
readerControl = list(language = 'en'))
Text source | Length of English files | File size, Mb |
---|---|---|
899288 | 159.4 | |
Blogs | 1010242 | 200.4 |
News | 2360148 | 196.3 |
An application we are going to create will not see the difference in source of phrase to finish. So, we are going to ignore information about text origin and concatenate twitter, blogs and news texts.
Data files are rather large, so we will use 10% of lines as training sample. Also, we need to clean our data:
for (i in 1:3) {
# en: select training set
# ru: отбираем обучающую выборку
n <- length(en.corp[[i]]$content)
set.seed(my.seed)
inTrain <- sample(1:n, size = train.part * n)
en.corp[[i]]$content <- en.corp[[i]]$content[inTrain]
# en: remove urls
# ru: убрать url из текста
en.corp[[i]]$content <- gsub(en.corp[[i]]$content,
pattern = '(f|ht)tp(s?)://(.*)[.][a-z]+',
replacement = '')
}
# en: clean text data
# ru: чистим текстовые данные
en.corp <- tm_map(en.corp,
content_transformer(function(x) iconv(enc2utf8(x),
sub = 'byte')))
en.corp <- tm_map(en.corp, removePunctuation)
en.corp <- tm_map(en.corp, removeNumbers)
en.corp <- tm_map(en.corp, content_transformer(tolower))
my.stopwords.en <- c(stopwords('en'), 'rt', 'via')
en.corp <- tm_map(en.corp, removeWords, my.stopwords.en)
en.corp <- tm_map(en.corp, stripWhitespace)
save.image('./data/corp_en.RData')
To any further analysis we will need a document-term matrices, which count words in the text corpora.
# en: remove empty and one-space documents
# ru: убираем пустые документы и те, что состоят из одного пробела
for (i in 1:3) {
en.corp[[i]]$content <- en.corp[[i]]$content[!(en.corp[[i]]$content %in% c(' ', ''))]
}
# en: document terms matrix for blogs
# ru: матрица частот для корпуса блогов
dtm.bl <- DocumentTermMatrix(VCorpus(VectorSource(en.corp[["en_US.blogs.txt"]]$content)))
dtm.bl <- removeSparseTerms(dtm.bl, 0.999)
freq.bl <- sort(colSums(as.matrix(dtm.bl)), decreasing = TRUE)
# en: document terms matrix for news
# ru: матрица частот для корпуса новостей
dtm.nw <- DocumentTermMatrix(VCorpus(VectorSource(en.corp[["en_US.news.txt"]]$content)))
dtm.nw <- removeSparseTerms(dtm.nw, 0.999)
freq.nw <- sort(colSums(as.matrix(dtm.nw)), decreasing = TRUE)
# en: document terms matrix for twitter
# ru: матрица частот для корпуса твиттера
dtm.tw <- DocumentTermMatrix(VCorpus(VectorSource(en.corp[["en_US.twitter.txt"]]$content)))
dtm.tw <- removeSparseTerms(dtm.tw, 0.999)
freq.tw <- sort(colSums(as.matrix(dtm.tw)), decreasing = TRUE)
A word cloud plot shows words scaled in proportion to their frequency.
# en: blogs wordcloud
# ru: облако слов из блогов
wordcloud(names(freq.bl), freq.bl, max.words = 80, random.order = F)
mtext('Top-80 words from blogs', side = 3, line = 1, col = 'blue', font = 3, cex = .9)
head(freq.bl)
## one will just can like time
## 12379 11186 9946 9925 9709 8725
# en: news wordcloud
# ru: облако слов из новостей
wordcloud(names(freq.nw), freq.nw, max.words = 80, random.order = F)
mtext('Top-80 words from news', side = 3, line = 1, col = 'blue', font = 3, cex = .9)
head(freq.nw)
## said will one new also two
## 24972 10719 8114 7060 5926 5786
# en: twitter wordcloud
# ru: облако слов из твиттера
wordcloud(names(freq.tw), freq.tw, max.words = 80, random.order = F)
mtext('Top-800 words from twitter', side = 3, line = 1, col = 'blue', font = 3, cex = .9)
head(freq.tw)
## just like get love good will
## 15056 12089 11114 10444 9957 9518
As one may assume from the plots above, news texts have a special style, and as a result, fewer specific words are repeated here more often. But in general, the text corpora from news, blogs and twitter contain 3223, 3072 and 1037 unique words respectively. Thus, the dictionary of news texts is the most diverse. The dictionary of microblogging of Twitter is the most limited, which is explained by the allowable length of texts. For further analysis it is important to answer the question, is the percentage of common words in the three sources of texts large enough to treat this sources equally?
common.en <- intersect(intersect(names(freq.bl), names(freq.nw)), names(freq.tw))
common.en <- data.frame(word = common.en,
freq.bl = freq.bl[common.en],
freq.nw = freq.nw[common.en],
freq.tw = freq.tw[common.en])
common.en$rank.freq.bl <- rank(common.en$freq.bl)
common.en$rank.freq.nw <- rank(common.en$freq.nw)
common.en$rank.freq.tw <- rank(common.en$freq.tw)
# en: concordance analysis for ranks of common words
# ru: ранговый коэффициент конкордации Кендалла для общих слов в кодексах
kndl <- kendall.global(common.en[, grep(colnames(common.en), pattern = 'rank.', value = T)])
kndl
## $Concordance_analysis
## Group.1
## W 7.317040e-01
## F 5.454454e+00
## Prob.F 1.242922e-199
## Chi2 1.927308e+03
## Prob.perm 1.000000e-03
##
## attr(,"class")
## [1] "kendall.global"
879 words are found in all three texts corpora: 27.3%, 84.8% and 28.6% for news, twitter and blogs respectfully. Kendall’s concordance coefficient between the ranks of words by their occurrence in the three types of texts is significant and shows a positive connection: 0.73. This means that all three types of texts can be used as a general sample for training a model.
The predictive text model will be based on frequencies of occurrence of n-grams. According to Wikipedia, “an n-gram is a contiguous sequence of n items from a given sequence of text or speech”. In order to create a model we need to accomplish a number of steps, using our corpus:
* build n-grams frequency matrices;
* find associations between words and n-grams, using the frequency matrices.