Thomas J James
5/9/2021
output: pdf_document: default html_document: default —
The Coursera Data Science Capstone - Milestone Report (aka, “the report”) is intended to give an introductory look at analyzing the SwiftKey data set and figuring out:
The report is then to be written in a clear, concise, style that a data scientist OR non-data scientist can understand and make sense of.
The purpose of the report is a four-fold exploratory data analysis that will:
Demonstrate that the data has been downloaded from Swiftkey (via Coursera) and successfully loaded into R.
Create a basic report of summary statistics about the data sets to include:
Report any interesting findings about the data so far amassed.
Present the basic plan behind creating a prediction algorithm and Shiny app from the data.
The Swiftkey data consists of four datasets, each in a different language (one each in German, English, Russian, and Finnish), containing random:
For this report, we will process the English data, and reference the German, Finnish, and Russian sets to possibly match foreign language characters and/or words embedded in the English data.
blog_entries<-readLines("C:/Users/tjame/Downloads/Coursera-SwiftKey/final/en_US/en_US.blogs.txt", skipNul = TRUE, warn= FALSE)
news_entries<-readLines("C:/Users/tjame/Downloads/Coursera-SwiftKey/final/en_US/en_US.news.txt", skipNul = TRUE, warn=FALSE)
twitter_feeds<-readLines("C:/Users/tjame/Downloads/Coursera-SwiftKey/final/en_US/en_US.twitter.txt", skipNul = TRUE, warn=FALSE)
blog_entries_size<-file.info("C:/Users/tjame/Downloads/Coursera-SwiftKey/final/en_US/en_US.blogs.txt")$size/ 1024 ^ 2
news_entries_size<-file.info("C:/Users/tjame/Downloads/Coursera-SwiftKey/final/en_US/en_US.news.txt")$size/ 1024 ^ 2
twitter_feeds_size<-file.info("C:/Users/tjame/Downloads/Coursera-SwiftKey/final/en_US/en_US.twitter.txt")$size/ 1024 ^ 2
eng_data_set_size<-c(blog_entries_size,news_entries_size,twitter_feeds_size)
data_frame_size<-data.frame(eng_data_set_size)
names(data_frame_size)[1] <-"MBs"
row.names(data_frame_size) <- c("Blog entries", "News entries", "Twitter Feeds")
data_frame_size
## MBs
## Blog entries 200.4242
## News entries 196.2775
## Twitter Feeds 159.3641
blog_entries_line_count<-length(blog_entries)
news_entries_line_count<-length(news_entries)
twitter_feeds_line_count<-length(twitter_feeds)
data_set_length <-c(blog_entries_line_count,news_entries_line_count, twitter_feeds_line_count)
eng_data_frame_line_count <-data.frame(data_set_length)
names(eng_data_frame_line_count)[1] <-"Line Count"
row.names(eng_data_frame_line_count) <- c("Blog entries", "News entries", "Twitter Feeds")
eng_data_frame_line_count
## Line Count
## Blog entries 899288
## News entries 77259
## Twitter Feeds 2360148
library(ngram)
blog_entries_word_count <-wordcount(blog_entries)
news_entries_word_count <-wordcount(news_entries)
twitter_feeds_word_count <-wordcount(twitter_feeds)
data_set_word_count <-c(blog_entries_word_count, news_entries_word_count, twitter_feeds_word_count)
eng_data_frame_word_count <-data.frame(data_set_word_count)
names(eng_data_frame_word_count)[1] <-"Word Count"
row.names(eng_data_frame_word_count) <- c("Blog entries", "News entries", "Twitter Feeds")
eng_data_frame_word_count
## Word Count
## Blog entries 37334131
## News entries 2643969
## Twitter Feeds 30373583
## Length Class Mode
## 899288 character character
## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan â\200œgodsâ\200\235."
## [2] "We love you Mr. Brown."
## Length Class Mode
## 77259 character character
## [1] "He wasn't home alone, apparently."
## [2] "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s."
## Length Class Mode
## 2360148 character character
## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
Because of the amount of data that needs processing, and because this is an exploratory data analysis, we will extract the seventy-five most frequently used words from within each data file, and then proceed with some basic plotting.
We will use 1/100th of each data file for our reduced sample size, and create the necessary data subsets from them.
sample_size <- 0.01
blogs_index <- sample(seq_len(blog_entries_line_count),blog_entries_line_count*sample_size)
news_index <- sample(seq_len(length(news_entries)),length(news_entries)*sample_size)
twitter_index <- sample(seq_len(length(twitter_feeds)),length(twitter_feeds)*sample_size)
blogs_sub <- blog_entries[blogs_index[]]
news_sub <- news_entries[news_index[]]
twitter_sub <- twitter_feeds[twitter_index[]]
We will now create a corpus from the data subsets. The tm Library will assist us in this task. The process involves removing all non-ASCII character data, punctuation marks, excess white space, numeric data, converting the remaining alpha characters to lower case, and generating the entire corpus in plain text. A brief summary of the corpus is provided.
## Warning: package 'tm' was built under R version 4.0.5
## Loading required package: NLP
korpus <- Corpus(VectorSource(c(blogs_sub, news_sub, twitter_sub)),readerControl=list(reader=readPlain,language="en"))
korpus <- Corpus(VectorSource(sapply(korpus, function(row) iconv(row, "latin1", "ASCII", sub=""))))
korpus <- tm_map(korpus, removePunctuation)
## Warning in tm_map.SimpleCorpus(korpus, removePunctuation): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(korpus, stripWhitespace): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(korpus, content_transformer(tolower)):
## transformation drops documents
## Warning in tm_map.SimpleCorpus(korpus, removeNumbers): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(korpus, PlainTextDocument): transformation drops
## documents
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 3
We will now use our cleaned data subset to generate a histogram of the thirty most frequently used words in the corpus. The libraries slam and ggplot2 will help with this task.
## Warning: package 'ggplot2' was built under R version 4.0.5
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
s_korpus <-TermDocumentMatrix(korpus,control=list(minWordLength=1))
wordFrequency <-rowapply_simple_triplet_matrix(s_korpus,sum)
wordFrequency <-wordFrequency[order(wordFrequency,decreasing=T)]
mostFrequent30 <-as.data.frame(wordFrequency[1:30])
mostFrequent30 <-data.frame(Words = row.names(mostFrequent30),mostFrequent30)
names(mostFrequent30)[2] = "Frequency"
row.names(mostFrequent30) <-NULL
mf30Plot = ggplot(data=mostFrequent30, aes(x=Words, y=Frequency, fill=Frequency)) + geom_bar(stat="identity") + guides(fill=FALSE) + theme(axis.text.x=element_text(angle=90))
# mf30Plot +labs(title="30 Most Frequently Used Words")
mf30Plot + ggtitle("30 Most Frequently Used Words") + theme(plot.title = element_text(hjust = 0.5))
We will now use our cleaned data subset to generate a pie chart of the five most frequently used words in the corpus. The library plotrix will help with this task.
library(plotrix)
mostFrequent10 <-head(mostFrequent30,5)
pie3D(mostFrequent10$Frequency, labels = mostFrequent10$Words, main = "Pie of Five Greatest Word Frequencies", explode=0.1, radius=1.8, labelcex = 1.3, start=0.7)
With the exploratory data analysis done on the English data set to this point, the findings regarding the top 30 most frequently occurring words, to include the five most frequently occurring words, are not that surprising. The bulk of them are articles and pronouns. Further analysis using bigrams and trigrams would give better most frequently used phrase distribution. This type of finding could then be used to predict trends in the data and to create a predictive model of English text.
The basic plan is to use the initial data analysis presented herein to further progress with the prediction algorithm necessary for the Shiny application - a predictive model of English text. One way of doing this might be to investigate what is possible using Markov Chains. Further analysis will be done using NGram modeling, to predict next-word selection with accuracy. All will be incorporated into a user-friendly Shiny front end that will allow the user to interact with the data and make logical next-word selections.