####Diyana Nanova
####Juni 2022
The objective of this report is to show various statistical properties of the data set that can later be used when building the prediction model for the final data product - the Shiny application. Using exploratory data analysis, this report describes the major features of thetraining data. This is the basis for creating of a predictive model.
The goal of the report is to use the skills acquired in the specialization in creating a predictive text model using a large text corpus of documents as training data. Natural language processing techniques will be used to perform the analysis and build the predictive model.
The data can be found at the following link on Coursera:
https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
The data set includes text files from various languages. For this project, the germanlanguage files will be used. The model will be trained using a unified document corpus compiled from the following three sources of text data:
library(doParallel)
library(stringi)
library(dplyr)
library(kableExtra)
library(SnowballC)
library(ggplot2)
library(gridExtra)
library(stringr)
library(tidyverse)
library(tidytext)
library(tm)
library(NLP)
library(kableExtra)
library(ggraph)
library(foreach)
library(iterators)
library(parallel)
library(wordcloud)
library(RColorBrewer)
library(wordcloud2)
## [1] "de_DE.blogs.txt" "de_DE.news.txt" "de_DE.twitter.txt"
## [4] "gg.Rmd" "project10_week2.Rmd" "project10.R"
Smaller datasets are used for the analysis
blogs_con <- file(paste0(directory_de, "/de_DE.blogs.txt"), "r")
blogs <- readLines(blogs_con, encoding="UTF-8", skipNul = TRUE)
close(blogs_con)
blogs_sm <- blogs[1:10000]
news_con <- file(paste0(directory_de, "/de_DE.news.txt"), "r")
news <- readLines(news_con, encoding="UTF-8", skipNul = TRUE)
close(news_con)
news_sm <- news[1:10000]
twitter_con <- file(paste0(directory_de, "/de_DE.twitter.txt"), "r")
twitter <- readLines(twitter_con, encoding="UTF-8", skipNul = TRUE)
close(twitter_con)
twitter_sm <- twitter[1:10000]
####Create corpus data
The datasets are saved as corpus and then they are cleaned.his includes the following transformation steps for each document:
## Warning in tm_map.SimpleCorpus(., removeNumbers): transformation drops documents
## Warning in tm_map.SimpleCorpus(., removePunctuation): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(., stripWhitespace): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(data_doc_b, content_transformer(tolower)):
## transformation drops documents
## Warning in tm_map.SimpleCorpus(data_doc_b, removeWords, stopwords("de")):
## transformation drops documents
## Warning in tm_map.SimpleCorpus(., removeNumbers): transformation drops documents
## Warning in tm_map.SimpleCorpus(., removePunctuation): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(., stripWhitespace): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(data_doc_n, content_transformer(tolower)):
## transformation drops documents
## Warning in tm_map.SimpleCorpus(data_doc_n, removeWords, stopwords("de")):
## transformation drops documents
## Warning in tm_map.SimpleCorpus(., removeNumbers): transformation drops documents
## Warning in tm_map.SimpleCorpus(., removePunctuation): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(., stripWhitespace): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(data_doc_t, content_transformer(tolower)):
## transformation drops documents
## Warning in tm_map.SimpleCorpus(data_doc_t, removeWords, stopwords("de")):
## transformation drops documents
####Create document tern matrix for the datasets The next step is to create document-term-matrixes for the individual datasets
An initial investigation of the data shows that on average, all 3 texts have a relatively low number of words per line. Twitter tends to have more words per line, followed by blogs and then news which has the least words per line. The highest number of words per line and characters and the lowest lines number is expected in the news data.
How many lines are in the files?
numLines <- sapply(list(blogs, news, twitter), length)
numLines
## [1] 371440 244743 947774
How many characters are per file?
numChars <- sapply(list(nchar(blogs), nchar(news), nchar(twitter)), sum)
numChars
## [1] 83204145 93388799 72776717
How many words are per file?
numWords <- sapply(list(blogs, news, twitter), stri_stats_latex)[4,]
numWords
## [1] 12496671 13140403 11542946
How many words are per file?
How many words are perline(summary)?
wplSummary = sapply(list(blogs, news, twitter),
function(x) summary(stri_count_words(x))[c('Min.', 'Mean', 'Max.')])
rownames(wplSummary) = c('Min', 'Mean', 'Max')
wplSummary
## [,1] [,2] [,3]
## Min 0.00000 1.00000 0.00000
## Mean 34.14457 54.64954 12.28779
## Max 1638.00000 603.00000 42.00000
Prior to building the unified document corpus and cleaning the data, a basic summary of the three text corpora is being provided which includes number of lines, number of characters, and number of words for each source file. Also included are basic statistics on the number of words per line (min, mean, and max).
summary <- data.frame(
File = c("de_DE.blogs.txt", "de_DE.news.txt", "de_DE.twitter.txt"),
Lines = numLines,
Characters = numChars,
Words = numWords,
t(rbind(round(wplSummary)))
)
####Blogs histogram
####News histogram
####Twitter histogram
For this step, 50000 rows of data sets are used, which facilitates data processing.
###Most used words
The predictive model I plan to develop for the Shiny application will handle uniqrams and bigrams. In this section, I will tokenize the sample data and construct matrices of uniqrams and bigrams.
####Most used words: In section blogs
####Most used words: In section news
####Most used words: In section twitter
###Tokenizing and N-Gram Generation
The next step is to create a word cloud. T
###Create wordclowds
####Word cloud for news
####Word clouds for twitter
####Word clouds for all data sets