Overview

This document aims to show a simple exploratory data analysis about the news, blogs and twitter data.

rm(list = ls())
library(data.table)
library(tidyverse)
library(tidytext)
library(dplyr)
library(ggplot2)

Loading the Data

The data was preloaded in another script shown in the Appendix below. Checking the summary of the data.

load("./dta/textDta.RData")
summary(textDta)
##                       text            source         
##  Thanks for the RT!     :    571   Length:3336695    
##  Thank you!             :    547   Class :character  
##  thank you!             :    382   Mode  :character  
##  Thanks for the follow! :    326                     
##  Thanks for the mention!:    188                     
##  thanks for the RT!     :    185                     
##  (Other)                :3334496
summary(as.factor(textDta$source))
##    blog    news twitter 
##  899288   77259 2360148

Since the the data is too big to process, we will only use 5% of the observations which are randomly selected. Below are the distribution of the samples per source.

textDta <- textDta[sample(1:nrow(textDta), 0.05*nrow(textDta)),]
summary(as.factor(textDta$source))
##    blog    news twitter 
##   45309    3847  117678

Stop Words

Depending on the type of analysis, stop words are used to exclude the words that are very common. In this project, I used the pre-existing stop words data called “stop_words”. The data summary is shown below.

## Classes 'tbl_df', 'tbl' and 'data.frame':    1149 obs. of  2 variables:
##  $ word   : chr  "a" "a's" "able" "about" ...
##  $ lexicon: chr  "SMART" "SMART" "SMART" "SMART" ...

There are 1,149 words listed in the stop words which we will use in the following data exploration.

Tokenization of the Data

Tokenization is the process of breaking up a sequence of strings into words or phrases. The term used to determine the number of sequence of words in tokenization is “n-gram”. The “n” describes the number of words in the sequence. Here we will explore different sequences, 1-gram, 2-gram and 3-gram.

The following are the top 20 1-gram, 2-gram and 3-grams. It is important to note that this is a sample and a margin error should be considered when analysing.

Path Forward

The project requires to predict the next word when typing in a keyboard. The best approach to achieve this is by using n-grams and applying a prediction algorithm afterwards. I will most likely utilize the Stupid backoff algorithm as it is very well suited for this type of data.

Appendix

Below is the R code I used to read and save the data.

rm(list = ls())
library(tidyverse)
library(tidytext)
library(ggplot2)

## download the data ##
if(!file.exists("./dta/raw")){
        dir.create("./dta/raw")
}
url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(url, destfile = "./dta/raw/Coursera-SwiftKey.zip", mode = "wb")
unzip(zipfile = "./dta/raw/Coursera-SwiftKey.zip", exdir = "./dta/raw")    ## unzip to open files
path <- file.path("./dta/raw" , "en_US")
files <- list.files(path, recursive = TRUE)

## open twitter data ##
con <- file("./dta/raw/final/en_US/en_US.twitter.txt", "r") 
twitterDta <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
twitterDta <- as.data.frame(twitterDta)
names(twitterDta)[1] <- "text"
twitterDta$source <- "twitter"  ## add an identifier to what data source it belongs
close(con)
save(twitterDta, file = "./dta/twitterDta.RData")

## open news data ##
con <- file("./dta/raw/final/en_US/en_US.news.txt", "r") 
newsDta <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
newsDta <- as.data.frame(newsDta)
names(newsDta)[1] <- "text"
newsDta$source <- "news"  ## add an identifier to what data source it belongs
close(con)
save(newsDta, file = "./dta/newsDta.RData")

## open blog data ##
con <- file("./dta/raw/final/en_US/en_US.blogs.txt", "r") 
blogDta <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
blogDta <- as.data.frame(blogDta)
names(blogDta)[1] <- "text"
blogDta$source <- "blog"  ## add an identifier to what data source it belongs
close(con)
save(blogDta, file = "./dta/blogDta.RData")

## bind data to make 1 data frame ## 
textDta <- rbind(blogDta, newsDta, twitterDta)
save(textDta, file = "./dta/textDta.RData")