NextWord - Milestone Report

Abstract

This is an report for the basic exploratory data analysis of Natural Language Processing for English text from news, blogs and twitters.

There are three steps described in this report:

The goal of this project
Data Acquisition
Exploratory Analysis

The goal

The goal for this report is to investigate the corpus data and use it to create a text predicting model. The model should predict the most probable word by using an input string.

Below are few things to be noted from the given corpus data.

The data contains a lot unnecessary noise and other foreign words and words from different encodings.
Most of the words repeat only few times, so associating each words with other is important to predict the next word.
I select only english language words by using a regular expression.

Data Acquisition & Cleaning

Make sure the path for working directory is set to location where your files are stored. The data provided by Coursera in partnership with Swiftkey contains data for different languages like Russian, German, Finnish & english. We are intrested in english so lets load the files inside the “en_US” folder.

# Load Required Packages 

library(ggplot2)
library(tm)
library(qdap)
library(rJava)
library(RWekajars)
library(RWeka) # See NOTE in the description
library(dplyr)
library(wordcloud)

NOTE: You may face problem while installing rJava and RWeka package. I used Mac OS. Following work around is tested for Mac OSX 10.10 - 10.12.

Download Java Developer kit from Download (jdk-8u11**.dmg) http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html CHECK: After you install this, you should see jdk… file at /Library/Java/JavaVM/
Download Java for Mac OS https://support.apple.com/kb/DL1572?locale=en_US
Go to Mac Terminal and type following command: sudo R CMD javareconf
Go to R console (or Rstudio) and install rJava and RWeka: install.packages(“rJava”, , “http://rforge.net/”, type = “source”) install.packages(“RWeka”)

Load Data

# Open Connection
con_twitter <- file("en_US.twitter.txt", "r") 
con_news <- file("en_US.news.txt", "r") 
con_blogs <- file("en_US.blogs.txt", "r") 

# Readlines From Connection
news <- readLines("en_US.news.txt")
twitter <- readLines("en_US.twitter.txt")
blogs <- readLines("en_US.blogs.txt")

# Close Connections
close(con_twitter)
close(con_news)
close(con_blogs)

Data Summary

# Summay of the dataset

news_mat <- matrix(c("en_US.news.txt", 
                     file.info("en_US.news.txt")$size/1024/1024 ,
                     length(news), sum(nchar(news))),nrow = 1, byrow=T)

twitter_mat <- matrix(c("en_US.twitter.txt", 
                        file.info("en_US.twitter.txt")$size/1024/1024 ,
                        length(twitter), sum(nchar(twitter))),nrow = 1, byrow=T)

blogs_mat <- matrix(c("en_US.blogs.txt", 
                      file.info("en_US.blogs.txt")$size/1024/1024 ,
                      length(blogs), sum(nchar(blogs))),nrow = 1, byrow = T)

summary <- data.frame(matrix(c(news_mat,twitter_mat,blogs_mat),nrow=3, byrow = T))
colnames(summary) <- c("File Name", "File Size (MB)", "Total #Lines", "Total #Words")
summary

##           File Name   File Size (MB) Total #Lines Total #Words
## 1    en_US.news.txt 196.277512550354      1010242    203223159
## 2 en_US.twitter.txt 159.364068984985      2360148    162096031
## 3   en_US.blogs.txt 200.424207687378       899288    206824505

Sampling

These files are huge, so lets subset the data and select a portion of data using sampling so that our subset is a representative sample. For simplicity, I’m conisdering only 10% of the data from each file.

news_sample <- sample(news, length(news)*0.1, replace = F)
twitter_sample <- sample(twitter, length(twitter)*0.1, replace = F)
blogs_sample <- sample(blogs, length(blogs)*0.1, replace = F)

Exploratory Analysis

1. Number of linear in each file

Following plot shows number of lines of text in each file.

2. Word frequency per line in each file

Following text shows no of word per line in each file.

Future work

Need to perform data clearning e.g, removing punctuation, numbers, tags etc.
Make corpus and extract 1, 2, 3 grams.
Build a predictive Model.