“Capstone; Task II, Exploring the Data”

author: “John Mastapeter” date: “3/15/2021” —

In order to make reasonable predictions based off the data provided for the project, first a profile and summary of the data is necessary for further analysis

Coursera Course Discription

The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. Please submit a report on R Pubs (http://rpubs.com/) that explains your exploratory analysis and your goals for the eventual app and algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set. The motivation for this project is to: 1. Demonstrate that you’ve downloaded the data and have successfully loaded it in.2. Create a basic report of summary statistics about the data sets.3. Report any interesting findings that you amassed so far.4. Get feedback on your plans for creating a prediction algorithm and Shiny app.

Developer Goals People do not write or converse the same way in every medium and one could argue that each text provide for analysis represents a different medium to relate a story; news is similar to an official report, blogs a personal narrative, and twitter being the most conversational. As individuals would not use the same kind of words in causal conversation as they would while providing a a detailed report, the goal of my analysis is to visualize the difference, if any, in the words used in each set of texts.

Set up Envivornment

Before developing and running any other code for analysis, set up the coding environment to include the proper working directory and the libraries necessary to run the rest of the analysis. Set the seed at this time as well so it applies to the entire code from the start.

#Set working Directory
setwd('C:/Users/mastapeterj/Documents/Coursera_DataScience/Capstone/final/en_US')

#Load Relevant Libraries
library(tm)
library(RWeka)
library(textcat)
library(ggplot2)
library(stringi)
library(gridExtra)

#Set Seed
set.seed(371293)

Load, Clean, and Profile Text for Analysis

Next, load each of the data sets; twitter, blogs, and news. In order to remove irrelevant characters from the text, set the encording to UTF-8, and then remove those characters that do not contribute to words.

Provide a brief profile fo the data by extracting the size and count for each set of text.

#Load Twitter, Blogs, and News Data and Prep by Converting it to UTF - 8 and Removing Special Characters
twitter <- readLines('en_US.twitter.txt')
Encoding(twitter) <- "UTF-8"
twitter <- gsub("[^[:alnum:][:blank:]?&/\\-]", "", twitter)
twitter_size <- file.info('en_US.twitter.txt')$size
twitter_count <- stri_count_words(twitter)

blogs <- readLines('en_US.blogs.txt')
Encoding(blogs) <- "UTF-8"
blogs <- gsub("[^[:alnum:][:blank:]?&/\\-]", "", blogs)
blogs_size <- file.info('en_US.blogs.txt')$size
blogs_count <- stri_count_words(blogs)

news <- readLines('en_US.news.txt')
Encoding(news) <- "UTF-8"
news <- gsub("[^[:alnum:][:blank:]?&/\\-]", "", news)
news_size <- file.info('en_US.news.txt')$size
news_count <- stri_count_words(news)

Review Profile of Data

Print summary of the text data and review any noticable characteristics of the texts that might have an effect on the future analysis of the data

#Provide a Summary of the Data
doc_sum <- data.frame(source = c("blogs", "news", "twitter"),
           file_size_kb = c(blogs_size, news_size, twitter_size),
           Lines = c(length(blogs), length(news), length(twitter)),
           Words = c(sum(blogs_count), sum(news_count), sum(twitter_count)),
           Words_Per_Entry = c(mean(blogs_count), mean(news_count), mean(twitter_count)))

doc_sum

##    source file_size_kb   Lines    Words Words_Per_Entry
## 1   blogs    210160014  899288 37445286        41.63881
## 2    news    205811889   77259  2670920        34.57099
## 3 twitter    167105338 2360148 29915324        12.67519

Initial Analysis From the start a few things are clear; blogs is the largest file, contributing the most lines and words, blogs are also the longest per entry, as opposed to news, twitter contributes the most lines, but fewest words per entry

Potential Consequences How will the writing style of twitter impact the tool? How relevant will the news text be considering it size and word count?

Generate Sample Data for Texts and Create Vecrot Sources for the Sample Data

Before running any analysis on the data, it is best practice to take a sample of the data as not run the entire set through the code which could take more time than necessary.

At this point, also generate a sample that includes all twitter, blogs, and news texts to determine how much influence each set has on the whole provided to the user.

Convert the sample data into vectors so it can be run through future analysis tools

#Generate Sample Data
blog_sample <- sample(blogs, length(blogs)*0.01)
news_sample <- sample(news, length(news)*0.01)
twitter_sample <- sample(twitter, length(twitter)*0.01)

#Generate Sample of Entire Data Package
data_sample <- c(sample(blogs, length(blogs)*0.01),
sample(news, length(news)*0.01),
sample(twitter, length(twitter)*0.01))

#Create Vector Source for Sample Data
blog_clean <- VCorpus(VectorSource(blog_sample))
news_clean <- VCorpus(VectorSource(news_sample))
twitter_clean <- VCorpus(VectorSource(twitter_sample))
data_clean <- VCorpus(VectorSource(data_sample))

Clean Sample Data

This process started after loading the data, but this point completes the cleaning process leaving on English words to analysize.

Since all the data samples will need to be cleaned in the exact same way, producing a function to clean the data provide an efficient means of avoiding duplicating code unnecessarily.

This function will remove such characters, such as numbers, whitespaces, and punctuation, while making the all the remaining letter characters lower case.

#Function to Clean Data of Irrelevant Formats and Characters
cleaning_func <- function(clean){
    spacing <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
    cleaned <- tm_map(clean, spacing, "(f|ht)tp(s?)://(.*)[.][a-z]+")
    cleaned <- tm_map(cleaned, spacing, "@[^\\s]+")
    cleaned <- tm_map(cleaned, tolower)
    cleaned <- tm_map(cleaned, removeWords, stopwords("en"))
    cleaned <- tm_map(cleaned, removePunctuation)
    cleaned <- tm_map(cleaned, removeNumbers)
    cleaned <- tm_map(cleaned, stripWhitespace)
    cleaned <- tm_map(cleaned, PlainTextDocument)
    return(cleaned)
}

#Generate "Cleaned" Data for Analysis
twitter_cleaned <- cleaning_func(twitter_clean)
blog_cleaned <- cleaning_func(blog_clean)
news_cleaned <- cleaning_func(news_clean)
data_cleaned <- cleaning_func(data_clean)

Calculate Individual Word, Word Pairs, and Triples of Words

After cleaning the data and removing any irrelevant characters for further analysis, next comes counting those individual words, pairs, and triples that repeat themselves throughout each text.

At this point, set the number of cores to process the data.

#Calculate Counts of Words and Patterns of Words
options(mc.cores=1)

calc_freq <- function(cleaned) {
    wordfreq <- sort(rowSums(as.matrix(cleaned)), decreasing = TRUE)
    return(data.frame(word = names(wordfreq), wordfreq = wordfreq))
}

word_pairs <- function(x)NGramTokenizer(x, Weka_control(min=2, max=2))
word_triples <- function(x)NGramTokenizer(x, Weka_control(min=3, max=3))

Design Plot of Top Counts

To provide a visualization for the results, simple bar charts displaying the top 25 words and patterns provides the user with context for further analysis.

#Develop Bar Plots of Word and Pattern Counts
freq_plot <- function(freq, label) {
    ggplot(freq[1:25,], aes(reorder(word, -wordfreq), wordfreq))+
        labs(x = label, y = "Word Count")+
        theme(axis.text.x = element_text(angle = 60, size = 8, hjust = 1))+
        geom_bar(stat="identity", fill = I("black"))
}

Generate Reviews of Data

Run the sample data through the count and plotting code.

knitr::knit_meta(class=NULL, clean = TRUE)

## list()

#Review Blogs
blogs_word1 <- calc_freq(removeSparseTerms(TermDocumentMatrix(blog_cleaned), 0.99))
blogs_word2 <- calc_freq(removeSparseTerms(TermDocumentMatrix(blog_cleaned, control = list(tokenize = word_pairs)), 0.999))
#Running blogs at the same precision as the other data continued to cause knitr to error and fail. While increasing the memory limit using the following code;
#memory.limit(10000)
#inside RStudio allowed for the preccision to complete, I could not find a way to mimic that functionality in knitr, so I reduced the precision requirement to allow for the plot to be visualized. I have included the memory limit setting and code for so other users can reproduce the results.
#blogs_word3 <- calc_freq(removeSparseTerms(TermDocumentMatrix(blog_cleaned, control = list(tokenize = word_triples)), 0.9999))

blog_plot_1 <- freq_plot(blogs_word1, "25 Top Common Words in Blogs")
blog_plot_2 <- freq_plot(blogs_word2, "25 Top Common Word Pairs in Blogs")
#blog_plot_3 <- freq_plot(blogs_word3, "25 Top Common Word Triples in Blogs")

#Review Twitter
twitter_word1 <- calc_freq(removeSparseTerms(TermDocumentMatrix(twitter_cleaned), 0.99))
twitter_word2 <- calc_freq(removeSparseTerms(TermDocumentMatrix(twitter_cleaned, control = list(tokenize = word_pairs)), 0.999))
twitter_word3 <- calc_freq(removeSparseTerms(TermDocumentMatrix(twitter_cleaned, control = list(tokenize = word_triples)), 0.9999))

twitter_plot_1 <- freq_plot(twitter_word1, "25 Top Common Words in Twitter")
twitter_plot_2 <- freq_plot(twitter_word2, "25 Top Common Word Pairs in Twitter")
twitter_plot_3 <- freq_plot(twitter_word3, "25 Top Common Word Triples in Twitter")

#Review News
news_word1 <- calc_freq(removeSparseTerms(TermDocumentMatrix(news_cleaned), 0.99))
news_word2 <- calc_freq(removeSparseTerms(TermDocumentMatrix(news_cleaned, control = list(tokenize = word_pairs)), 0.999))
news_word3 <- calc_freq(removeSparseTerms(TermDocumentMatrix(news_cleaned, control = list(tokenize = word_triples)), 0.9999))

news_plot_1 <- freq_plot(news_word1, "25 Top Common Words in News")
news_plot_2 <- freq_plot(news_word2, "25 Top Common Word Pairs in News")
news_plot_3 <- freq_plot(news_word3, "25 Top Common Word Triples in News")

#Review Complete Data
data_word1 <- calc_freq(removeSparseTerms(TermDocumentMatrix(data_cleaned), 0.99))
data_word2 <- calc_freq(removeSparseTerms(TermDocumentMatrix(data_cleaned, control = list(tokenize = word_pairs)), 0.999))
data_word3 <- calc_freq(removeSparseTerms(TermDocumentMatrix(data_cleaned, control = list(tokenize = word_triples)), 0.9999))

data_plot_1 <- freq_plot(data_word1, "25 Top Common Words in Data")
data_plot_2 <- freq_plot(data_word2, "25 Top Common Word Pairs in Data")
data_plot_3 <- freq_plot(data_word3, "25 Top Common Word Triples in Data")

Draw Plots of Counts

Once all the data has been run through the analysis code, prints the plots to analysize the words frequently used in each source format and throughout the entire dataset.

#Plot all Results
words_I_plot <-grid.arrange(twitter_plot_1, news_plot_1, blog_plot_1, data_plot_1, ncol = 2, nrow = 2)

words_II_plot <-grid.arrange(twitter_plot_2, news_plot_2, blog_plot_2, data_plot_2, ncol = 2, nrow = 2)

words_III_plot <-grid.arrange(twitter_plot_3, news_plot_3, data_plot_3, ncol = 2, nrow = 2)

Conclusion

A few things stand out in the plots; especially how frequently those patterns from the twitter data occur in the common patterns data. While not completely unexpect, due to the lines and word count of the twitter file, patterns frequent in twitter influenced the data to a greater degree than the blogs text, which was a larger file and contained more words.

While there were issues visualizing the outputs for the blogs text within knitr, the effects of the text can be seen by looking over the complete data results. While many patterns related to their frequency in the twitter text, other clearly come from the blogs.

As predicted, the influence on pattern frequency by the news text was not as impactful as the twitter or blogs text. the news text conntained numerous patterns of either proper nouns like locations or words related to scientific or medical documentation, which may not be worth spelling out in more personal formats like twitter or blogs.

This method of analysis demostrates how the medium from which the sample comes from may effect word use; highlightly the potenail of creating a generic model that might recommend the wrrong words depending on if an individual is writing a quick twitter message or a published news account.