Capstone

Introduction

The coursera week 2 assignment provided a link to download data file. The expectation is for students to perform natural language processing on the data. The download provided text files from 3 different sources: Blogs, twitter and news. In the following sections, downloading data, processing it and making a few visual representations are explored.

Downloading and reading data

In this section I answer the following questions from the review criteria: -Does the link lead to an HTML page describing the exploratory analysis of the training data set? -Has the data scientist done basic summaries of the three files? Word counts, line counts and basic data tables? -Has the data scientist made basic plots, such as histograms to illustrate features of the data? -Was the report written in a brief, concise style, in a way that a non-data scientist manager could appreciate?

# Download data if it doesn't exist from the URL given in the first week

dataURL<-"https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
dataDIR = "final"

if (!dir.exists(dataDIR)) {
    dataZipName <- "Coursera-SwiftKey.zip"
    if (!file.exists(dataZipName))
        download.file(dataURL, dataZipName, method = "auto")
    unzip(dataZipName)
}

# Summary of the downloaded files

list.dirs(path = "./final", full.names = TRUE, recursive = TRUE)

## [1] "./final"       "./final/de_DE" "./final/en_US" "./final/fi_FI"
## [5] "./final/ru_RU"

# The 3 text files downloaded in english language is shown here
list.files(path = "./final/en_US", full.names = TRUE, recursive = TRUE)

## [1] "./final/en_US/en_US.blogs.txt"   "./final/en_US/en_US.news.txt"   
## [3] "./final/en_US/en_US.twitter.txt"

# A sampling of blogs is shown below
con_blogs     <- file("final/en_US/en_US.blogs.txt")
en_US.blogs   <- readLines(con_blogs)
close(con_blogs)

# Similar observations can be done for the other 2 txt files
# con_twitter   <- file("final/en_US/en_US.blogs.txt")
# con_news      <- file("final/en_US/en_US.blogs.txt")

Word count, line count, data tables and plots

Basic summaries of the 3 files is done in terms of word counts, line counts and basic data tables The 3 files explored are in the US English language. They are from 3 different sources. One from blogs, one from twitter and another from news

# Summary for blogs
library(tokenizers)

## Warning: package 'tokenizers' was built under R version 3.4.2

con_blogs         <- file("final/en_US/en_US.blogs.txt")
en_US.blogs       <- readLines(con_blogs)

# Word count
blogs_text        <- paste(readLines(con_blogs), collapse = "\n")
blogs_word_count  <- tokenize_words(blogs_text)
length(blogs_word_count[[1]])

## [1] 38154238

#line count       
blogs_line_count  <- NROW(en_US.blogs)
blogs_line_count

## [1] 899288

#Word count of top 20 longest lines
blogs_line        <- tokenize_words(en_US.blogs)
blogs_line_length <- sapply(blogs_line,length)
blogs_length_sort <- sort(blogs_line_length)
blogs_top_length  <- tail(blogs_length_sort,20)
plot(blogs_top_length, col= "dark green", bg= "dark green", pch= 19,  main= "Word count of top 20 longest lines for blogs text", ylab= "Word count")

close(con_blogs)

# Summary for twitter

con_twitter         <- file("final/en_US/en_US.twitter.txt")
en_US.twitter       <- readLines(con_twitter)

## Warning in readLines(con_twitter): line 167155 appears to contain an
## embedded nul

## Warning in readLines(con_twitter): line 268547 appears to contain an
## embedded nul

## Warning in readLines(con_twitter): line 1274086 appears to contain an
## embedded nul

## Warning in readLines(con_twitter): line 1759032 appears to contain an
## embedded nul

# Word count
twitter_text        <- paste(readLines(con_twitter), collapse = "\n")

## Warning in readLines(con_twitter): line 167155 appears to contain an
## embedded nul

## Warning in readLines(con_twitter): line 268547 appears to contain an
## embedded nul

## Warning in readLines(con_twitter): line 1274086 appears to contain an
## embedded nul

## Warning in readLines(con_twitter): line 1759032 appears to contain an
## embedded nul

twitter_word_count  <- tokenize_words(twitter_text)
length(twitter_word_count[[1]])

## [1] 30218125

#line count       
twitter_line_count  <- NROW(en_US.twitter)
twitter_line_count

## [1] 2360148

#Word count of top 20 longest lines
twitter_line        <- tokenize_words(en_US.twitter)
twitter_line_length <- sapply(twitter_line,length)
twitter_length_sort <- sort(twitter_line_length)
twitter_top_length  <- tail(twitter_length_sort,20)
plot(twitter_top_length, col= "blue", bg= "blue", pch= 19,  main= "Word count of top 20 longest lines for twitter text", ylab= "Word count")

close(con_twitter)

# Summary for news

con_news         <- file("final/en_US/en_US.news.txt")
en_US.news       <- readLines(con_news)

## Warning in readLines(con_news): incomplete final line found on 'final/
## en_US/en_US.news.txt'

# Word count
news_text        <- paste(readLines(con_news), collapse = "\n")

## Warning in readLines(con_news): incomplete final line found on 'final/
## en_US/en_US.news.txt'

news_word_count  <- tokenize_words(news_text)
length(news_word_count[[1]])

## [1] 2693898

#line count       
news_line_count  <- NROW(en_US.news)
news_line_count

## [1] 77259

#Word count of top 20 longest lines
news_line        <- tokenize_words(en_US.news)
news_line_length <- sapply(news_line,length)
news_length_sort <- sort(news_line_length)
news_top_length  <- tail(news_length_sort,20)
plot(news_top_length, col= "purple", bg= "purple", pch= 19,  main= "Word count of top 20 longest lines for news text", ylab= "Word count")

close(con_news)

Conclusion

My attempt at preliminary research into natural language processing is shown in this article. In future I would like to be able to predict words (upto a combination of 3 words), that would follow, when a person is writing text messages electronically.

Capstone

Sreeya Sreevatsa

October 28, 2017

Introduction

Downloading and reading data

Word count, line count, data tables and plots

Conclusion