Milestone Report

Overview

The data used in this project is from a corpus called HC Corpora, and collected from publicly available sources by a web crawler. This data is used by SwiftKey to build a smart keyboard which provides featuring text generated by predictive text models.

The data contains information about the content of some Twitter posts, Blogs, and News, made available for applying data science in the area of natural language processing. The size of the overall folder is 548.0 MB, here are the steps taken to download and read selected files. It takes about 8 minutes to load.

# set a longer timeout
options(timeout = 6000)

# assign the data link to a variable named url
url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
# set a destination path
dest <- file.path("data", "Coursera-SwiftKey.zip")
# launch the clock
s<- Sys.time()
# download the data 
downloader::download(url=url,destfile = dest)
e <- Sys.time()
# check the time taken to download
e-s

In particular, the Coursera-SwiftKey.zip folder contains Local data in English, German, Russian and Finnish, they are stored in a sub-folder named final.

This project will use the English database.

                           Name    Length                Date
1                  final/en_US/         0 2014-07-22 10:10:00
2 final/en_US/en_US.twitter.txt 167105338 2014-07-22 10:12:00
3    final/en_US/en_US.news.txt 205811889 2014-07-22 10:13:00
4   final/en_US/en_US.blogs.txt 210160014 2014-07-22 10:13:00

Exploratory Data Analysis

To see the first line of the en_US.twitter.txt file the readLines() function is used.

[1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."

Here is a function to read random lines in one of the selected sets. For practice, the function access to the .zip folder and extracts requested information, releasing a dataframe made of two columns: text and type with the scope of merging all sets by rows of random lines.

It might take a bit to release the information as requested is loading from raw data, this is something that would need to be improved in order to be used in a Shiny app for text prediction.

randomRead(data = "data/Coursera-SwiftKey.zip",
              file = "final/en_US/en_US.blogs.txt",
              n = 2)

                    text  type
1 We love you Mr. Brown. blogs

The objective of this project is to give the user the opportunity whether to select one or more type of source (twitter, blogs, or news) for randomly visualize the text. Further work will involve tokenization of the sentences to extrapolate only certain type of words and special signs such as the hashtag # sign.

To introduce what will be shown, here are some visualizations of the data which will be automated by loading the sets into the environment, to be used for the rest of the project.

All three data sets take about 50 secs to load, then the images are saved as .RData files and set ready to be used use by loading them into the environment.

Data Visualization

The blogs set has about 900 000 rows, the twitter set about 2 millions rows, while the news set comes with about a millions lines of information.

Here is a table summarizing the dimension and the max string length of the sets.

     data     dim max_string_length
1   blogs  899288             40833
2 twitter 2360148               140
3    news 1010242             11384

An interesting finding is that by selecting how many times the hashtag # sign is typed on a Twitter post, based on available data.

[1] 214369

Only 9% of the available tweets contain a hashtag!

\[\frac{\text{n of rows containg # sign}}{\text{total n of rows}}\]

[1] 9.082863

Tokenization

Tokenization is used to identify the appropriate tokens such as words, punctuation, or numbers. Two functions are used that take a file as input and returns a tokenized version of the sampled string:

firstToken()
lastToken()

  firstToken
1         In
2         We
3       Chad

  lastToken
1   “gods”.
2    Brown.
3      him.

Here is the random result of the two functions output joined on a data frame. This shows the first and last word on three random lines, along with their row identification number.

       firstToken     lastToken
747958        but          just
74522        Have relationship.
260593      There           on.

Profanity filtering

One more tool to be used to remove profanity and other words that are not to be included in the prediction algorithm, is the profanityRead() function. It readLines() from the bad-words database and releases a list to be used for filtering out these words.

set.seed(2222)
profanityRead(twitter,n=1)

    bad
1 fight

Prediction algorithm and Shiny app

Finally, the objective of this capstone is to build a predictive algorithm embedded in a Shiny app to be used with strings of data for random extrapolations.

It will provide visualizations of the statistics for selected data and a word cloud of the most frequent words in the text.

Other additional features are still to be tested.