Jenifer PK & Srinath KS
26th August 2015
ABOUT DATASET :
The project that we have selected for this course is “TEXT PREDICTION - EXPLORATORY ANALYSIS ON SWIFTKEY DATASETS”. The HC Corpora dataset is comprised of the output of lots of news sites, blogs and twitter. The dataset contains 3 files across four languages (Russian, Finnish, German and English). This project will focus on the English language datasets. The names of the data files are as follows:
en_US.blogs.txt
en_US.twitter.txt
en_US.news.txt
The datasets will be referred to as “Blogs”, “Twitter” and “News” for the remainder of this report
SUMMARY & DESCRIPTION OF THE DATASET:
| Words | Characters | Letters | Lines | Avg Word Length | Avg Words/Line | |
| Blogs | 37,242,000 | 206,824,000 | 163,815,000 | 899,000 | 4.4 | 41.41 |
| Newspapers | 34,275,000 | 203,223,000 | 162,803,000 | 1,010,000 | 4.75 | 33.93 |
| 29,876,000 | 162,122,000 | 125,998,000 | 2,360,000 | 4.22 | 12.66 | |
| Total | 101,393,000 | 572,170,000 | 452,617,000 | 4,269,000 | 4.46 | 23.75 |
TASK:
Explore the data
Profanity filtering - removing profanity and other words you do not want to predict
Tokenization - identifying appropriate tokens such as words, punctuation, and numbers
Train the data Natural Language Process
Build a shiny apps with the model
The task is to build predictive model that predicts the next word when a user types a word/phrase similar to how google predicts what you want to search for, based on the most popular search terms.
From Swiftkey files Twitters, News, Blogs in the English Create a data product to predict the next word. The datasets from blogs and news contain approximately 200 megabytes and the Twitter dataset contains approximately 160 megabytes.
In this project we will apply Natural Language Processing (NLP), text mining, and the tools in R for exploratory data analysis and for the following text modelling and prediction as well.
DATASET DOWNLOAD:
We downloaded the datasets for this project from the web site and unzip it into our working directory using the following code:
fileName = "Coursera-SwiftKey.zip"
if(!file.exists(fileName)){
#Download the dataset
download.file(url="https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip",
"Coursera-SwiftKey.zip",destfile = fileName)
Download_Date <- Sys.time()
Download_Date
#Unzip the dataset
unzip(zipfile = fileName,overwrite = TRUE)
}else{
print("Dataset is already downloaded!")
}
## [1] "Dataset is already downloaded!"
PACKAGES TO BE LOADED:
library(NLP)
library(openNLP)
library(tm)
library(RWeka)
library(qdapDictionaries)
library(qdapRegex)
library(qdapTools)
library(RColorBrewer)
library(qdap)
library(stringr)
library(ggplot2)
library(RColorBrewer)
library(SnowballC)
library(wordcloud)
No of Characters & Lines in Blogs
connection = file ("file:///E:/LIBA_EDBA/PL/EDBA- PL-1st Sem-Project/en_US/en_US.blogs.txt", "r")
blogs = readLines (connection, n = -1, encoding = "UTF-8")
close (connection)
nCharBlog = sum(nchar(blogs)); lenBlog = length(blogs)
nCharBlog;lenBlog
## [1] 206824505
## [1] 899288
No of Characters & Lines in News
connection = file ("file:///E:/LIBA_EDBA/PL/EDBA- PL-1st Sem-Project/en_US/en_US.news.txt", "r")
news = readLines (connection, n = -1, encoding = "UTF-8")
## Warning in readLines(connection, n = -1, encoding = "UTF-8"): incomplete
## final line found on 'E:/LIBA_EDBA/PL/EDBA- PL-1st Sem-Project/en_US/
## en_US.news.txt'
close (connection)
nCharNews = sum(nchar(news)); lenNews = length(news)
nCharNews;lenNews
## [1] 15639408
## [1] 77259
No of Characters & Lines in Twitter
connection = file ("file:///E:/LIBA_EDBA/PL/EDBA- PL-1st Sem-Project/en_US/en_US.twitter.txt", "r")
tweets = readLines (connection, n = -1, encoding = "UTF-8")
## Warning in readLines(connection, n = -1, encoding = "UTF-8"): line 167155
## appears to contain an embedded nul
## Warning in readLines(connection, n = -1, encoding = "UTF-8"): line 268547
## appears to contain an embedded nul
## Warning in readLines(connection, n = -1, encoding = "UTF-8"): line 1274086
## appears to contain an embedded nul
## Warning in readLines(connection, n = -1, encoding = "UTF-8"): line 1759032
## appears to contain an embedded nul
close (connection)
nCharTweets = sum(nchar(tweets)); lenTweets = length (tweets)
nCharTweets;lenTweets
## [1] 162096031
## [1] 2360148
READ & SUMMARIZE DATA :
linesToRead = 500
connection = file ("file:///E:/LIBA_EDBA/PL/EDBA- PL-1st Sem-Project/en_US/en_US.blogs.txt", "r")
blogs = readLines (connection, n = -1, encoding = "UTF-8")
close (connection)
nCharBlog = nchar (blogs); lenBlog = length (blogs)
connection = file ("file:///E:/LIBA_EDBA/PL/EDBA- PL-1st Sem-Project/en_US/en_US.blogs.txt", "r")
blogs = readLines (connection, n = linesToRead, encoding = "UTF-8")
close (connection)
connection = file ("file:///E:/LIBA_EDBA/PL/EDBA- PL-1st Sem-Project/en_US/en_US.news.txt", "r")
news = readLines (connection, n = -1, encoding = "UTF-8")
close (connection)
nCharNews = nchar (news); lenNews = length (news)
connection = file ("file:///E:/LIBA_EDBA/PL/EDBA- PL-1st Sem-Project/en_US/en_US.news.txt", "r")
news = readLines (connection, n = linesToRead, encoding = "UTF-8")
close (connection)
connection = file ("file:///E:/LIBA_EDBA/PL/EDBA- PL-1st Sem-Project/en_US/en_US.twitter.txt", "r")
tweets = readLines (connection, n = -1, encoding = "UTF-8")
close (connection)
nCharTweets = nchar (tweets); lenTweets = length (tweets)
connection = file ("file:///E:/LIBA_EDBA/PL/EDBA- PL-1st Sem-Project/en_US/en_US.twitter.txt", "r")
tweets = readLines (connection, n = linesToRead, encoding = "UTF-8")
close (connection)
}