Milestone Report

Background

This milestone report is part of the capstone project for the Data Science Specialization curriculum by Johns Hopkins University on Coursera. This project involves applying data science in the area of natural language processing.

The goal of this project is to display exploratory analysis of the data and goals for the eventual app and algorithm.

Data

The data for this project come from a Coursera provided corpus called HC Corpora. The corpora are collected from publicly available sources by a web crawler. The crawler checks for language, so as to mainly get texts consisting of the desired language.

Each entry is tagged with it’s date of publication. Where user comments are included they will be tagged with the date of the main entry.

Reading and Processing Data

First we input the datasets and checked the line counts for each.

blogs<-readLines('en_US.blogs.txt', warn=TRUE)
news<-readLines('en_US.news.txt', warn=TRUE)
twitter<-readLines('en_US.twitter.txt', warn=TRUE)
## Warning in readLines("en_US.twitter.txt", warn = TRUE): line 167155 appears to
## contain an embedded nul
## Warning in readLines("en_US.twitter.txt", warn = TRUE): line 268547 appears to
## contain an embedded nul
## Warning in readLines("en_US.twitter.txt", warn = TRUE): line 1274086 appears to
## contain an embedded nul
## Warning in readLines("en_US.twitter.txt", warn = TRUE): line 1759032 appears to
## contain an embedded nul
##      blogs_lines news_lines twitter_lines
## [1,]      899288    1010242       2360148

Then we created a sample set from each file and retrieved word counts and basic data tables.

set.seed(1211)
blogs_sam<-blogs[sample(1:length(blogs),5000)]
news_sam<-news[sample(1:length(news),5000)]
twit_sam<-twitter[sample(1:length(twitter),5000)]
##      blogs_totalwc news_totalwc twit_totalwc
## [1,]        208824       176948        66613
##    Length     Class      Mode 
##      5000 character character
##    Length     Class      Mode 
##      5000 character character
##    Length     Class      Mode 
##      5000 character character
## [1] "These numbers say either that the country is going down the tubes fast, or that most PhD degrees are either nearly worthless to begin with or have rapidly become obsolete for today's economy. Actually, I suspect that all three of these statements are true to a certain extent."                                        
## [2] "To start, I painted the edges of each page with Ranger Paint Dabber and then covered each one with paper. I used the UHU Stic to glue it all in place and then used my Crop-A-Dile to repunch the holes. A light sanding around the edges adds a bit of “time” to the album and gets rid of any paper hanging over the edge."
## [3] "All good things must come to an end I suppose. I had to make it back today so I could go in for at least some part of my last week of work. Jon has until Thursday, still not enough time, I think, to chase the feeling, the exhilaration, that the drive gave us."
## [1] "The gathering, where Shannon's new \"Championship Filet\" also made its debut, was a fundraiser for the Ronald McDonald house."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
## [2] "Others have found ways to work within the system. Bassem Youssef, a heart surgeon who tended to the wounded during the revolution, vaulted to fame when his homemade YouTube videos pointing out the foibles of those in power caught on in the months after the revolution. Soon he became a star, landing his own television show, \"El Bernameg\" (The Program) on an independent satellite station. Working with a staff of just four, Youssef has pulled off what is exceedingly rare in any Middle Eastern country — a satire program along the lines of \"The Daily Show\" that stands in sharp contrast to the party-line programs that populate state-run news stations. (In June, he will spend a few days at the Comedy Central series' New York set.) Bits on his show frequently lampoon pronouncements of SCAF or the Muslim Brotherhood by contrasting them with footage and facts from modern-day Egypt."
## [3] "California has its own statute forbidding such practices, though the Golden State quit regulating debt collection agencies in 1992."
## [1] "u jive turkey lmao !!!"           "Tech Kidds So Aggravating."      
## [3] ":( my opinion doesn't matter lol"

Then we combined each sample set into one dataset for use in the remainder of our exploratory analysis.

sam_data<- c(blogs_sam, news_sam, twit_sam)
file_name <- "data.txt"
writeLines(sam_data, file_name)

Next we prepocessed the data using tm pacakge tools to remove english stopwords, punctuation, and extra white space. We also converted all strings to lowercase and cleaned data to remove offensive words. For the last step, we used a profanity list downloaded here.

Exploratory Analysis

To first explore the data, we took the preprocessed data, converted it to a dataframe format and followed tokenizing steps to discover the most frequent words.

cleandf<- as.data.frame(cleandata)
token_data<- cleandf %>% unnest_tokens(word, cleandata)
tokenCount <- token_data %>% count(word, sort = T)
head(tokenCount, 5)
##   word    n
## 1 said 1410
## 2  one 1254
## 3 just 1102
## 4 like 1034
## 5  can  993
Most Frequent Words

Most Frequent Words

Then we created frequency graphs for the most common n-grams of length 1, 2, and 3.
Unigrams

Unigrams

Bigrams.

Bigrams.

Trigrams

Trigrams

We also considered how many unique words were needed to cover 50% and 90% of all word instances in the dataset. This analysis shows that 1099 unique words were needed to account for 50% of all words in the dataset, and 14290 were needed to account for 90%.

tokenCountfreq<- tokenCount %>% 
    mutate(word_freq = cumsum(n))
total <- sum(tokenCountfreq$n)

tokenCountfreq <- tokenCountfreq %>% 
    mutate(perc_cov = ((word_freq/total) * 100))

sum(tokenCountfreq$perc_cov < 50) + 1
## [1] 1116
sum(tokenCountfreq$perc_cov < 90) + 1
## [1] 14355

Next Steps

With the above exploratory analysis informing us about the frequency of words, word pairs, and word triads. In the next sections of the course we will use these n-grams and frequency tables created from them to finalize a predictive algorithm, the model as a Shiny application and create a deck to be able to present the final result.

The predictive algorithm will use an n-gram backoff model, where it will start by looking for the most common 3-gram that includes the provided text, and either choose the most common one based on frequency, or revert to the immediate smaller n-gram all the way to the unigram.