#Clear the environment
rm(list=ls())
In executing the Data Science Capstone Project, the goal of this milestone report is to demonstrate basic proficiency in working with the data and to provide evidence that the project is on track for eventual completion.
According, the following demonstration will show: 1. The data has been successfully downloaded and accessed in R 2. A report of basic summary statistics regarding the data sets 3. Preliminary findings 4. Initial plans for creating a prediction algorithm and Shiny app
During the initial data wrangling process, several natural language processing libraries for R were tested and evaluated. As a first step, we load the libraries and assocaited dependencies:
library(ngram)
library(tm)
## Loading required package: NLP
library(tokenizers)
##
## Attaching package: 'tokenizers'
## The following object is masked from 'package:tm':
##
## stopwords
Now, we shall attempt to read in the data using R’s readLines function. As an exploratory evaluation, this demonstration will only attempt to read the first 1000 lines of each document.
knitr::opts_chunk$set(echo = TRUE)
setwd("~/Academic/R Projects/Capstone")
blogs.dat <- file("~/Academic/R Projects/Capstone/SwiftKey_Data/en_US/en_US.blogs.txt", "r")
news.dat <- file("~/Academic/R Projects/Capstone/SwiftKey_Data/en_US/en_US.news.txt", "r")
twitter.dat <- file("~/Academic/R Projects/Capstone/SwiftKey_Data/en_US/en_US.twitter.txt", "r")
blogs <- readLines(blogs.dat, n=1000, skipNul=TRUE) #REMOVE n=500 FOR FULL DATASET!
news <- readLines(news.dat, n=1000, skipNul=TRUE)
twitter <- readLines(twitter.dat, n=1000, skipNul=TRUE)
Next, let us examine the content of each of the three corpuses:
## Overview summaries
cat("Blog Data:\n")
## Blog Data:
head(blogs,4)
## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan âgodsâ."
## [2] "We love you Mr. Brown."
## [3] "Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him."
## [4] "so anyways, i am going to share some home decor inspiration that i have been storing in my folder on the puter. i have all these amazing images stored away ready to come to life when we get our home."
tail(blogs,4)
## [1] "âI just repainted my office using H2 Ahh!, my signature color from Ellenâs Designerâs Palette. I had originally painted my bedroom this color with ânot full spectrumâ blended paint from another company. When Ellen invited me to be one of the designers in her line I knew I wanted my signature blue/green/gray and she formulated it as a full spectrum blend."
## [2] "There are several varieties of kelp: true kelp, which thrives in cool seas; giant kelp, and bladder kelp, which grow in the North Pacific. Giant kelp is so named because it grows to 213 ft (65 m). Kelp anchors itself to rocky surfaces via tentacle-like roots. From these roots grows a slender stalk with long, leaf-like blades."
## [3] "Spoon out about 1/3 cup of dough for each shortcake onto the baking sheet, leaving about 3 inches of space between the mounds. Pat each mound down until it is between 3/4 and 1 inch high. (The shortcakes can be made to this point and frozen on the baking sheet, then wrapped airtight and kept in the freezer for up to 2 months. Bake without defrosting â just add at least 5 more minutes to the oven time.)"
## [4] "Apparently this is available on a DVD of Frank Tashlinâs THE GLASS BOTTOM BOAT, which is vaguely apt, but it should really be an extra with VERTIGO. Both because of the ways in which Jonesâs visuals approach Saul Bassâs (the YouTuber who posted it apparently thinks itâs by Norman McLaren â a fair guess, but WRONG), and in the way the short reverses the sympathies engendered in Hitchcockâs film â a woman trapped and torn and manipulated and molded between two horrible men is replaced by a female manipulator who remodels the men in her life, rejecting the less adaptable model in favour of the one who can literally be bent to her will."
cat("\n\nNews Data:\n")
##
##
## News Data:
head(news,4)
## [1] "He wasn't home alone, apparently."
## [2] "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s."
## [3] "WSU's plans quickly became a hot topic on local online sites. Though most people applauded plans for the new biomedical center, many deplored the potential loss of the building."
## [4] "The Alaimo Group of Mount Holly was up for a contract last fall to evaluate and suggest improvements to Trenton Water Works. But campaign finance records released this week show the two employees donated a total of $4,500 to the political action committee (PAC) Partners for Progress in early June. Partners for Progress reported it gave more than $10,000 in both direct and in-kind contributions to Mayor Tony Mack in the two weeks leading up to his victory in the mayoral runoff election June 15."
tail(news,4)
## [1] "$1.1 trillion: Projected overall cost for Alzheimerâs patients in 2050"
## [2] "Charles R. Priestly, 48, ran a Louisiana company called Hummingbird Aviation. Although he won a contract to supply helicopters and personnel to the U.S. Transportation Command for use in Afghanistan, the Federal Aviation Administration denied clearance for Hummingbird to operate there, killing the deal."
## [3] "The next wave of valley stock launches may well be made by less-sexy enterprise software companies like Palo Alto Networks, which filed plans earlier this month for a $175 million offering. The Santa Clara-based maker of network security products reported $119 million in fiscal year 2011 revenues, which would have placed it 142nd on this year's list."
## [4] "The men in the car were 25-year-old Bloomfield residents Joshua Rubens, Rabbiel Williams and Stephan Thompson, 24-year-old Bloomfield resident Emotes Furet and 23-year-old Bloomfield resident Willie Parnagot. On Thursday they were being held at the Bloomfield Police Department in lieu of $250,000 bail each and were awaiting transfer to Essex County Jail."
cat("\n\nTwitter Data:\n")
##
##
## Twitter Data:
head(twitter,4)
## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
## [3] "they've decided its more fun if I don't."
## [4] "So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)"
tail(twitter,4)
## [1] "Alright, W. Wager. I don't know who you are or how you got on my exam list, but I will take your advice in *Enough Is as Good as a Feast.*"
## [2] "Boo. I accidentally threw away my list of story ideas for next semester!!"
## [3] "Art washes from the soul the dust of everyday life. -Pablo Picasso"
## [4] "as of right now at this sec pass me by!"
Now, let’s count the number of lines for each collection:
## Count number of lines for each collection
paste("Number of lines in blogs collection:", length(blogs))
## [1] "Number of lines in blogs collection: 1000"
paste("Number of lines in news collection:", length(news))
## [1] "Number of lines in news collection: 1000"
paste("Number of lines in twitter collection:", length(twitter))
## [1] "Number of lines in twitter collection: 1000"
Next, we shall count the nunmber of words within each collection:
## Count number of words in each collection
blogs.length <- sum(sapply(strsplit(blogs,"\\s+"),length))
news.length <- sum(sapply(strsplit(news,"\\s+"),length))
twitter.length <- sum(sapply(strsplit(twitter,"\\s+"),length))
paste("Number of words in blogs collection:", blogs.length)
## [1] "Number of words in blogs collection: 41890"
paste("Number of words in news collection:", news.length)
## [1] "Number of words in news collection: 33489"
paste("Number of words in twitter collection:", twitter.length)
## [1] "Number of words in twitter collection: 12782"
Finally, let’s calculate the average line length per collecction:
paste("Average words per line in blogs collction:", blogs.length/length(blogs))
## [1] "Average words per line in blogs collction: 41.89"
paste("Average words per line in news collction:", news.length/length(news))
## [1] "Average words per line in news collction: 33.489"
paste("Average words per line in twitter collction:", twitter.length/length(twitter))
## [1] "Average words per line in twitter collction: 12.782"
To further our exploration of natural language processing, we will need to tokenize each corpus. For the sake of simplicity, we shall tokenize to single word instances. Next, we will stem each word to its lexical root, remove stop words, and drop all words less than two characters in length:
## Tokenize and stem words
blogs.words <- unlist(tokenize_word_stems(blogs))
blogs.words <- gsub("\\w*[0-9]+\\w*\\s*", "", blogs.words) #Remove non-character elements
blogs.words <- removeWords(blogs.words, stopwords("en"))#Remove stopwords
blogs.words <- blogs.words[nchar(blogs.words)>2] #Remove all words less than two character length
news.words <- unlist(tokenize_word_stems(news))
news.words <- gsub("\\w*[0-9]+\\w*\\s*", "", news.words) #Remove non-character elements
news.words <- removeWords(news.words, stopwords("en"))#Remove stopwords
news.words <- news.words[nchar(news.words)>2] #Remove all words less than two character length
twitter.words <- unlist(tokenize_word_stems(twitter))
twitter.words <- gsub("\\w*[0-9]+\\w*\\s*", "", twitter.words) #Remove non-character elements
twitter.words <- removeWords(twitter.words, stopwords("en"))#Remove stopwords
twitter.words <- twitter.words[nchar(twitter.words)>2] #Remove all words less than two character length
Having executed tokenization and stop word removal, we shall now generate a series of histograms displaying the 25 most frequent words for each corpus of text:
blogs.tab <- table(blogs.words)
barplot(blogs.tab[order(blogs.tab, decreasing = TRUE)[1:25]], las=2, ylab='Frenquency', xlab='Word', main='Blogs: Most Frequent Words', col='green')
news.tab <- table(news.words)
barplot(news.tab[order(news.tab, decreasing = TRUE)[1:25]], las=2, ylab='Frenquency', xlab='Word', main='News: Most Frequent Words', col='yellow')
twitter.tab <- table(twitter.words)
barplot(twitter.tab[order(twitter.tab, decreasing = TRUE)[1:25]], las=2, ylab='Frenquency', xlab='Word', main='Twitter: Most Frequent Words', col='blue')
Purusuant to our preliminary analysis of the data, news articles contained the word “said” as its most frequent word, which is to be expected of media reports citing and quoting sources directly. Interesting, analysis of news reporting showed “year” to be the second most frequently reported word, while Twitter data contained the word “day” at an equally high level of freqency. This perhaps demonstrates the temporaral differnces between news reporting, which covers often covers issues within the historic context of years and decades, while Twitter is more often spontaeneous and temporally ephemeral. In furthering the analysis, subsequent steps shall attempt to build a predictive algorithm which assesses the likely origin of a document classification (news, blogs, or twitter) given its component parts (n-gram or bag-of-words). Execution of this task will be conducted by generating term-frequncy inverse-document frequencies for each corpus, and building a generative model using Latent Dirichlet Allocation or other popular topic-document approach.