This is an report for the basic exploratory data analysis of Natural Language Processing for English text from news, blogs and twitters.
There are three steps described in this report:
The goal for this report is to investigate the corpus data and use it to create a text predicting model. The model should predict the most probable word by using an input string.
Below are few things to be noted from the given corpus data.
The data contains a lot unnecessary noise and other foreign words and words from different encodings.
Most of the words repeat only few times, so associating each words with other is important to predict the next word.
I select only english language words by using a regular expression.
Make sure the path for working directory is set to location where your files are stored. The data provided by Coursera in partnership with Swiftkey contains data for different languages like Russian, German, Finnish & english. We are intrested in english so lets load the files inside the “en_US” folder.
# Load Required Packages
library(ggplot2)
library(tm)
library(qdap)
library(rJava)
library(RWekajars)
library(RWeka) # See NOTE in the description
library(dplyr)
library(wordcloud)
NOTE: You may face problem while installing rJava and RWeka package. I used Mac OS. Following work around is tested for Mac OSX 10.10 - 10.12.
Download Java Developer kit from Download (jdk-8u11**.dmg) http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html CHECK: After you install this, you should see jdk… file at /Library/Java/JavaVM/
Download Java for Mac OS https://support.apple.com/kb/DL1572?locale=en_US
Go to Mac Terminal and type following command: sudo R CMD javareconf
Go to R console (or Rstudio) and install rJava and RWeka: install.packages(“rJava”, , “http://rforge.net/”, type = “source”) install.packages(“RWeka”)
# Open Connection
con_twitter <- file("en_US.twitter.txt", "r")
con_news <- file("en_US.news.txt", "r")
con_blogs <- file("en_US.blogs.txt", "r")
# Readlines From Connection
news <- readLines("en_US.news.txt")
twitter <- readLines("en_US.twitter.txt")
blogs <- readLines("en_US.blogs.txt")
# Close Connections
close(con_twitter)
close(con_news)
close(con_blogs)
# Summay of the dataset
news_mat <- matrix(c("en_US.news.txt",
file.info("en_US.news.txt")$size/1024/1024 ,
length(news), sum(nchar(news))),nrow = 1, byrow=T)
twitter_mat <- matrix(c("en_US.twitter.txt",
file.info("en_US.twitter.txt")$size/1024/1024 ,
length(twitter), sum(nchar(twitter))),nrow = 1, byrow=T)
blogs_mat <- matrix(c("en_US.blogs.txt",
file.info("en_US.blogs.txt")$size/1024/1024 ,
length(blogs), sum(nchar(blogs))),nrow = 1, byrow = T)
summary <- data.frame(matrix(c(news_mat,twitter_mat,blogs_mat),nrow=3, byrow = T))
colnames(summary) <- c("File Name", "File Size (MB)", "Total #Lines", "Total #Words")
summary
## File Name File Size (MB) Total #Lines Total #Words
## 1 en_US.news.txt 196.277512550354 1010242 203223159
## 2 en_US.twitter.txt 159.364068984985 2360148 162096031
## 3 en_US.blogs.txt 200.424207687378 899288 206824505
These files are huge, so lets subset the data and select a portion of data using sampling so that our subset is a representative sample. For simplicity, I’m conisdering only 10% of the data from each file.
news_sample <- sample(news, length(news)*0.1, replace = F)
twitter_sample <- sample(twitter, length(twitter)*0.1, replace = F)
blogs_sample <- sample(blogs, length(blogs)*0.1, replace = F)
Following plot shows number of lines of text in each file.
Following text shows no of word per line in each file.