Analysis of the data for the Coursea DataScience Capstone project

####Diyana Nanova

####Juni 2022

The objective of this report is to show various statistical properties of the data set that can later be used when building the prediction model for the final data product - the Shiny application. Using exploratory data analysis, this report describes the major features of thetraining data. This is the basis for creating of a predictive model.

The goal of the report is to use the skills acquired in the specialization in creating a predictive text model using a large text corpus of documents as training data. Natural language processing techniques will be used to perform the analysis and build the predictive model.

The data can be found at the following link on Coursera:

https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

The data set includes text files from various languages. For this project, the germanlanguage files will be used. The model will be trained using a unified document corpus compiled from the following three sources of text data:

  1. Blogs
  2. News
  3. Twitter

Load required libraries

library(doParallel)
library(stringi)
library(dplyr)
library(kableExtra)
library(SnowballC)
library(ggplot2)
library(gridExtra)
library(stringr)
library(tidyverse)
library(tidytext)
library(tm)
library(NLP)
library(kableExtra)
library(ggraph)
library(foreach)
library(iterators)
library(parallel)
library(wordcloud)
library(RColorBrewer)
library(wordcloud2)

Load the training data corpus and convert the text

## [1] "de_DE.blogs.txt"     "de_DE.news.txt"      "de_DE.twitter.txt"  
## [4] "gg.Rmd"              "project10_week2.Rmd" "project10.R"

Loading Files and show summaries

Smaller datasets are used for the analysis

blogs_con <- file(paste0(directory_de, "/de_DE.blogs.txt"), "r")
blogs <- readLines(blogs_con, encoding="UTF-8", skipNul = TRUE)
close(blogs_con)
blogs_sm <- blogs[1:10000]
news_con <- file(paste0(directory_de, "/de_DE.news.txt"), "r")
news <- readLines(news_con, encoding="UTF-8", skipNul = TRUE)
close(news_con)
news_sm <- news[1:10000]
twitter_con <- file(paste0(directory_de, "/de_DE.twitter.txt"), "r")
twitter <- readLines(twitter_con, encoding="UTF-8", skipNul = TRUE)
close(twitter_con)
twitter_sm <- twitter[1:10000]

####Create corpus data

The datasets are saved as corpus and then they are cleaned.his includes the following transformation steps for each document:

  1. Remove common german stop words
  2. Remove numbers
  3. Remove punctuation marks
  4. Trim whitespace
  5. strip white spaces
## Warning in tm_map.SimpleCorpus(., removeNumbers): transformation drops documents
## Warning in tm_map.SimpleCorpus(., removePunctuation): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(., stripWhitespace): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(data_doc_b, content_transformer(tolower)):
## transformation drops documents
## Warning in tm_map.SimpleCorpus(data_doc_b, removeWords, stopwords("de")):
## transformation drops documents
## Warning in tm_map.SimpleCorpus(., removeNumbers): transformation drops documents
## Warning in tm_map.SimpleCorpus(., removePunctuation): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(., stripWhitespace): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(data_doc_n, content_transformer(tolower)):
## transformation drops documents
## Warning in tm_map.SimpleCorpus(data_doc_n, removeWords, stopwords("de")):
## transformation drops documents
## Warning in tm_map.SimpleCorpus(., removeNumbers): transformation drops documents
## Warning in tm_map.SimpleCorpus(., removePunctuation): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(., stripWhitespace): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(data_doc_t, content_transformer(tolower)):
## transformation drops documents
## Warning in tm_map.SimpleCorpus(data_doc_t, removeWords, stopwords("de")):
## transformation drops documents

####Create document tern matrix for the datasets The next step is to create document-term-matrixes for the individual datasets

Basic Data Summary

An initial investigation of the data shows that on average, all 3 texts have a relatively low number of words per line. Twitter tends to have more words per line, followed by blogs and then news which has the least words per line. The highest number of words per line and characters and the lowest lines number is expected in the news data.

num lines per file

How many lines are in the files?

numLines <- sapply(list(blogs, news, twitter), length)
numLines
## [1] 371440 244743 947774

num characters per file

How many characters are per file?

numChars <- sapply(list(nchar(blogs), nchar(news), nchar(twitter)), sum)
numChars
## [1] 83204145 93388799 72776717

num words per file

How many words are per file?

numWords <- sapply(list(blogs, news, twitter), stri_stats_latex)[4,]
numWords
## [1] 12496671 13140403 11542946

words per line

How many words are per file?

words per line summary

How many words are perline(summary)?

wplSummary = sapply(list(blogs, news, twitter),
                    function(x) summary(stri_count_words(x))[c('Min.', 'Mean', 'Max.')])
rownames(wplSummary) = c('Min', 'Mean', 'Max')
wplSummary
##            [,1]      [,2]     [,3]
## Min     0.00000   1.00000  0.00000
## Mean   34.14457  54.64954 12.28779
## Max  1638.00000 603.00000 42.00000

All files summyry

Prior to building the unified document corpus and cleaning the data, a basic summary of the three text corpora is being provided which includes number of lines, number of characters, and number of words for each source file. Also included are basic statistics on the number of words per line (min, mean, and max).

summary <- data.frame(
  File = c("de_DE.blogs.txt", "de_DE.news.txt", "de_DE.twitter.txt"),
  Lines = numLines,
  Characters = numChars,
  Words = numWords,
  t(rbind(round(wplSummary)))
)

Histogram of Words per Line

####Blogs histogram

####News histogram

####Twitter histogram

Create small files

For this step, 50000 rows of data sets are used, which facilitates data processing.

###Most used words

The predictive model I plan to develop for the Shiny application will handle uniqrams and bigrams. In this section, I will tokenize the sample data and construct matrices of uniqrams and bigrams.

####Most used words: In section blogs

####Most used words: In section news

####Most used words: In section twitter

###Tokenizing and N-Gram Generation

bigrams for blogs

Bigrams for news

Bigrams for twitter

Word cloud

The next step is to create a word cloud. T

###Create wordclowds

Word cloud for blogs

####Word cloud for news

####Word clouds for twitter

####Word clouds for all data sets