The objective of this milestone report is to outline the steps for data analysis for the Capstone project for the data science course on Coursera by John Hopkins University.
The Project is to create a text predicting algorithm based on a Corpus created from three different data sets.
The following code load the required libraries and download the data.
library(tidyverse)
library(tidytext)
library(ggplot2)
url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
if(dir.exists("~/R/Capestone/data/") == FALSE){
dir.create("~/R/Capestone/data/")}
if(file.exists("~/R/Capestone/data/data.zip") == FALSE){
download.file(url,destfile = "~/R/Capestone/data/data.zip")
setwd("~/R/Capestone/data/")
unzip("~/R/Capestone/data/data.zip")
setwd("~/R/Capestone")
}
With the files downloaded, they can now be read and stored in memory.
blog <- readLines("~/R/Capestone/data/final/en_US/en_US.blogs.txt")
news <- readLines("~/R/Capestone/data/final/en_US/en_US.news.txt")
twitter <- readLines("~/R/Capestone/data/final/en_US/en_US.twitter.txt")
blog <- data.frame(line = 1 : length(blog), text = blog)
news <- data.frame(line = 1 : length(news), text = news)
twitter <- data.frame(line = 1 : length(twitter), text = twitter)
The first step is to tokenise the data which will make it easier to analyze. This is easy to achieve with the unnest_tokens function in the tidytext library within the tidyverse library. The profanity filter could be applied now but it takes a long time to do over the entire dataset. It is better to use it on a sample instead.
Tblog <- unnest_tokens(blog, input = text,, output = word, format = "text",
token = "words", drop = TRUE, to_lower = TRUE)
Tnews <- unnest_tokens(news, input = text,, output = word, format = "text",
token = "words", drop = TRUE, to_lower = TRUE)
Ttwitter <- unnest_tokens(twitter, input = text,, output = word, format = "text",
token = "words", drop = TRUE, to_lower = TRUE)
The Count function in the Dplyr library can also be useful for determine the count for each unique word.
Cblog <- count(Tblog, word)
Cnews <- count(Tnews, word)
Ctwitter <- count(Ttwitter, word)
The following code is used to determine some useful statistics for each source in a table
Table1 <- data.frame(Source = c("Blog", "News", "Twitter"))
Table1$Lines <- c(nrow(blog), nrow(news), nrow(twitter))
Tablelist <- list(Cblog$n,Cnews$n, Ctwitter$n)
Table1$Count <- lapply(Tablelist, sum)
Table1$Unique <- lapply(Tablelist, length)
Table1$"Words per line" <- round(as.numeric(Table1$Count)/Table1$Lines,1)
Table1$"% unique" <- round(as.numeric(Table1$Unique)/as.numeric(Table1$Count)*100,1)
It is helpful to look at the top ten used words for each source.
topblog <- Cblog %>% arrange(desc(n)) %>% filter(row_number() < 11)
topnews <- Cnews %>% arrange(desc(n)) %>% filter(row_number() < 11)
toptwitter <- Ctwitter %>% arrange(desc(n)) %>%
filter(row_number() < 11)
The top ten results can be represented in a bar graph with the following code:
Datasum <- data.frame(Source = c(rep("Blog",10), rep("News",10), rep("Twitter",10))
,Word = c (topblog$word, topnews$word, toptwitter$word),
Count = c (topblog$n, topnews$n, toptwitter$n))
g <- ggplot(Datasum, aes(x= Word, y = Count)) + geom_bar(stat = "identity") +
aes(fill=Source) + labs(title = "Top 10 word counts by source")
The previously generated table that expresses statistics on the different data sets can be found here:
## Source Lines Count Unique Words per line % unique
## 1 Blog 899288 38154238 360784 42.4 0.9
## 2 News 77259 2693898 89335 34.9 3.3
## 3 Twitter 2360148 30218125 383565 12.8 1.3
The plot of the top ten results by source is here:
From the summary table it is clear the largest data sets are from twitter and from blogs in terms of the number of lines and the number of words. The news set may be the smallest but it has the highest percent of unique words. It is no surprising that the twitter set has the lest number of words per line. In terms of the top ten words, it is clear that there is a large number of overlap between each set as all the top ten words are included in a list of 13 words. The words themselves seem to be the most common words in the English language.