Data_Science_Capstone_Week_2A

Thomas J James

5/9/2021

output: pdf_document: default html_document: default —

Introduction

The Coursera Data Science Capstone - Milestone Report (aka, “the report”) is intended to give an introductory look at analyzing the SwiftKey data set and figuring out:

  1. What the data consists of, and
  2. Identifying the standard tools and models used for this type of data.

The report is then to be written in a clear, concise, style that a data scientist OR non-data scientist can understand and make sense of.

Purpose

The purpose of the report is a four-fold exploratory data analysis that will:

  1. Demonstrate that the data has been downloaded from Swiftkey (via Coursera) and successfully loaded into R.

  2. Create a basic report of summary statistics about the data sets to include:

  1. Report any interesting findings about the data so far amassed.

  2. Present the basic plan behind creating a prediction algorithm and Shiny app from the data.

Load Data, Calculate Size, Word Count, and Summarize

The Swiftkey data consists of four datasets, each in a different language (one each in German, English, Russian, and Finnish), containing random:

  1. blog entries
  2. news entries
  3. twitter feeds

For this report, we will process the English data, and reference the German, Finnish, and Russian sets to possibly match foreign language characters and/or words embedded in the English data.

Download the data and unzip on Windows 10 PC

Coursera-SwiftKey extract all... 

Read the data into R from the connection (simple text files)

blog_entries<-readLines("C:/Users/tjame/Downloads/Coursera-SwiftKey/final/en_US/en_US.blogs.txt", skipNul = TRUE, warn= FALSE)
news_entries<-readLines("C:/Users/tjame/Downloads/Coursera-SwiftKey/final/en_US/en_US.news.txt", skipNul = TRUE, warn=FALSE)
twitter_feeds<-readLines("C:/Users/tjame/Downloads/Coursera-SwiftKey/final/en_US/en_US.twitter.txt", skipNul = TRUE, warn=FALSE)

Calculate the SIZE of the English dataset (in megabytes) and display

blog_entries_size<-file.info("C:/Users/tjame/Downloads/Coursera-SwiftKey/final/en_US/en_US.blogs.txt")$size/ 1024 ^ 2
news_entries_size<-file.info("C:/Users/tjame/Downloads/Coursera-SwiftKey/final/en_US/en_US.news.txt")$size/ 1024 ^ 2
twitter_feeds_size<-file.info("C:/Users/tjame/Downloads/Coursera-SwiftKey/final/en_US/en_US.twitter.txt")$size/ 1024 ^ 2
eng_data_set_size<-c(blog_entries_size,news_entries_size,twitter_feeds_size)
data_frame_size<-data.frame(eng_data_set_size)
names(data_frame_size)[1] <-"MBs"
row.names(data_frame_size) <- c("Blog entries", "News entries", "Twitter Feeds")
data_frame_size
##                    MBs
## Blog entries  200.4242
## News entries  196.2775
## Twitter Feeds 159.3641

Calculate the LINE COUNT of the English dataset and display

blog_entries_line_count<-length(blog_entries)
news_entries_line_count<-length(news_entries)
twitter_feeds_line_count<-length(twitter_feeds)
data_set_length <-c(blog_entries_line_count,news_entries_line_count, twitter_feeds_line_count)
eng_data_frame_line_count <-data.frame(data_set_length)
names(eng_data_frame_line_count)[1] <-"Line Count"
row.names(eng_data_frame_line_count) <- c("Blog entries", "News entries", "Twitter Feeds")
eng_data_frame_line_count
##               Line Count
## Blog entries      899288
## News entries       77259
## Twitter Feeds    2360148

Calculate the WORD COUNT of the English dataset and display

library(ngram)
blog_entries_word_count <-wordcount(blog_entries)
news_entries_word_count <-wordcount(news_entries)
twitter_feeds_word_count <-wordcount(twitter_feeds)
data_set_word_count <-c(blog_entries_word_count, news_entries_word_count, twitter_feeds_word_count)
eng_data_frame_word_count <-data.frame(data_set_word_count)
names(eng_data_frame_word_count)[1] <-"Word Count"
row.names(eng_data_frame_word_count) <- c("Blog entries", "News entries", "Twitter Feeds")
eng_data_frame_word_count
##               Word Count
## Blog entries    37334131
## News entries     2643969
## Twitter Feeds   30373583

SUMMERIZE the English dataset and display two sample entries from each file

summary(blog_entries)
##    Length     Class      Mode 
##    899288 character character
head(blog_entries,2)
## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan â\200œgodsâ\200\235."
## [2] "We love you Mr. Brown."
summary(news_entries)
##    Length     Class      Mode 
##     77259 character character
head(news_entries,2)
## [1] "He wasn't home alone, apparently."                                                                                                                        
## [2] "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s."
summary(twitter_feeds)
##    Length     Class      Mode 
##   2360148 character character
head(twitter_feeds,2)
## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."  
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."

Basic Plots

Because of the amount of data that needs processing, and because this is an exploratory data analysis, we will extract the seventy-five most frequently used words from within each data file, and then proceed with some basic plotting.

Reduced Sample Size - training sets

We will use 1/100th of each data file for our reduced sample size, and create the necessary data subsets from them.

sample_size <- 0.01
blogs_index <- sample(seq_len(blog_entries_line_count),blog_entries_line_count*sample_size)
news_index <- sample(seq_len(length(news_entries)),length(news_entries)*sample_size)
twitter_index <- sample(seq_len(length(twitter_feeds)),length(twitter_feeds)*sample_size)
blogs_sub <- blog_entries[blogs_index[]]
news_sub <- news_entries[news_index[]]
twitter_sub <- twitter_feeds[twitter_index[]]

Combine the Subsets - final training set

We will now create a corpus from the data subsets. The tm Library will assist us in this task. The process involves removing all non-ASCII character data, punctuation marks, excess white space, numeric data, converting the remaining alpha characters to lower case, and generating the entire corpus in plain text. A brief summary of the corpus is provided.

library(tm)
## Warning: package 'tm' was built under R version 4.0.5
## Loading required package: NLP
korpus <- Corpus(VectorSource(c(blogs_sub, news_sub, twitter_sub)),readerControl=list(reader=readPlain,language="en"))
korpus <- Corpus(VectorSource(sapply(korpus, function(row) iconv(row, "latin1", "ASCII", sub=""))))
korpus <- tm_map(korpus, removePunctuation)
## Warning in tm_map.SimpleCorpus(korpus, removePunctuation): transformation drops
## documents
korpus <- tm_map(korpus, stripWhitespace) 
## Warning in tm_map.SimpleCorpus(korpus, stripWhitespace): transformation drops
## documents
korpus <- tm_map(korpus, content_transformer(tolower)) 
## Warning in tm_map.SimpleCorpus(korpus, content_transformer(tolower)):
## transformation drops documents
korpus <- tm_map(korpus, removeNumbers) 
## Warning in tm_map.SimpleCorpus(korpus, removeNumbers): transformation drops
## documents
korpus <- tm_map(korpus, PlainTextDocument) 
## Warning in tm_map.SimpleCorpus(korpus, PlainTextDocument): transformation drops
## documents
korpus <- Corpus(VectorSource(korpus))
head(korpus,5)
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 3

Display as a HISTOGRAM

We will now use our cleaned data subset to generate a histogram of the thirty most frequently used words in the corpus. The libraries slam and ggplot2 will help with this task.

library(slam)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.0.5
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate
s_korpus <-TermDocumentMatrix(korpus,control=list(minWordLength=1))
wordFrequency <-rowapply_simple_triplet_matrix(s_korpus,sum)
wordFrequency <-wordFrequency[order(wordFrequency,decreasing=T)]
mostFrequent30 <-as.data.frame(wordFrequency[1:30])
mostFrequent30 <-data.frame(Words = row.names(mostFrequent30),mostFrequent30)
names(mostFrequent30)[2] = "Frequency"
row.names(mostFrequent30) <-NULL
mf30Plot = ggplot(data=mostFrequent30, aes(x=Words, y=Frequency, fill=Frequency)) + geom_bar(stat="identity") +  guides(fill=FALSE) + theme(axis.text.x=element_text(angle=90))
# mf30Plot +labs(title="30 Most Frequently Used Words")
mf30Plot + ggtitle("30 Most Frequently Used Words") + theme(plot.title = element_text(hjust = 0.5))

Display as a PIE CHART

We will now use our cleaned data subset to generate a pie chart of the five most frequently used words in the corpus. The library plotrix will help with this task.

library(plotrix)
mostFrequent10 <-head(mostFrequent30,5)
pie3D(mostFrequent10$Frequency, labels = mostFrequent10$Words, main = "Pie of Five Greatest Word Frequencies", explode=0.1, radius=1.8, labelcex = 1.3, start=0.7)

Interesting Findings

With the exploratory data analysis done on the English data set to this point, the findings regarding the top 30 most frequently occurring words, to include the five most frequently occurring words, are not that surprising. The bulk of them are articles and pronouns. Further analysis using bigrams and trigrams would give better most frequently used phrase distribution. This type of finding could then be used to predict trends in the data and to create a predictive model of English text.

Basic Project Plan

The basic plan is to use the initial data analysis presented herein to further progress with the prediction algorithm necessary for the Shiny application - a predictive model of English text. One way of doing this might be to investigate what is possible using Markov Chains. Further analysis will be done using NGram modeling, to predict next-word selection with accuracy. All will be incorporated into a user-friendly Shiny front end that will allow the user to interact with the data and make logical next-word selections.