Background

Mobile devices have become ubiquitous and integral to our life,We communicate through emails,social media,texts,messaging app etc.As most of the action involves typing,and typing long texts on devices isn’t easy. Good thing is that, there are some companies such as Swiftkey, trying to make typing on mobile devices easier for us.They are working towards building sophisticated text prediction application using state of the art tech and concepts in natural language processing. This Capstone project will test us on all the skills, that we have acquired in the specialization by analyzing text documents supplied by Swiftkey. The ultimate goal is to comeup with an algorithm which can with high level of accuracy predict the next word the user might type.

As R’s limitation is well known, and to avoid any memory issues later on, we need to plan ahead,and sample the data in a way that could represent the population.

Methodolgy

read the data completely
Perfomr basic statistical analysis
Basic pattern is analysed
sampling is done, more samples from twitter are taken , as tweets are very similar to the way we chat and text these days
Tokenization and normalization of data is done
stop words have not been removed, as it meant the elimination the word ‘the’ and several others.
single alphabets have been removed
Exploratory Analysis

Data Source

Source:- https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip.
Type:- news,blogs,twitter

Setup and Getting Data

setwd(“/home/kanudutta/Desktop/capstone”)

if (!file.exists("Coursera-SwiftKey.zip")) 
{ 
  url="https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip" 
  download.file(url, destfile="./Coursera-SwiftKey.zip") 
  dateDownloaded <- date() 
  unzip("Coursera-SwiftKey.zip", overwrite = TRUE) 
}

Check for packages,if not found install them

packages <- c("ggplot2", "tm","RCurl","quanteda","dplyr") 

if (length(setdiff(packages, rownames(installed.packages()))) > 0) 
{ 
  install.packages(setdiff(packages, rownames(installed.packages())),dependencies = TRUE)
} 

# Loadig packages
sapply(packages,function(x) library(x,character.only=TRUE))

Preprocessing the data

blogfileName <- 'en_US.blogs.txt' 
newsfileName <- 'en_US.news.txt' 
twitfileName <-'en_US.twitter.txt' 
profName <- 'bad-words.txt'

# file connection establish

blogConn <- file(blogfileName,open="r") 

# 'rb' mode was used as the entire text was not being read 
newsConn <- file(newsfileName,open="rb")  

twitConn <- file(twitfileName,open="r") 

profnConn <- file(profName,open="r")

# read files line by line entirely 

blogsData <- readLines(blogConn,encoding = "UTF-8")
newsData <- readLines(newsConn,encoding = "UTF-8" )
twitData <- readLines(twitConn,encoding = "UTF-8" )
profnData <- readLines(profnConn,encoding = "UTF-8") 


blogsDatae <- readLines(blogConn)
newsDatae <- readLines(newsConn)
twitDatae <- readLines(twitConn )
profnDatae <- readLines(profnConn)

Punctuation and digits remove

blogsData1 <- gsub("[[:punct:][:blank:]]+", " ", tolower(blogsDatae)) 
newsData1 <- gsub("[[:punct:][:blank:]]+", " ", tolower(newsDatae)) 
twitData1 <- gsub("[[:punct:][:blank:]]+", " ", tolower(twitDatae)) 

blogsData2 <- gsub('[[:digit:]]+', '', blogsData1)
newsData2 <- gsub('[[:digit:]]+', '', newsData1)
twitData2 <- gsub('[[:digit:]]+', '', twitData1)

# close connections 

close(blogConn)
close(newsConn)
close(twitConn) 
close(profnConn)

Summary Stats

-Basic Summary of Text files i.e. file name, size of file,number of lines,total number of words

df2

## [1] Frequency cumsum    Perct    
## <0 rows> (or 0-length row.names)

df4

## [1] Frequency cumsum    Perct    
## <0 rows> (or 0-length row.names)

df6

## [1] Frequency cumsum    Perct    
## <0 rows> (or 0-length row.names)

## number of lines in the files 
lenF <- c(length(blogsData),length(newsData),length(twitData)) 

## longest line in the file 
charF <- c(max(nchar(blogsData)),max(nchar(newsData)),max(nchar(newsData))) 

## Total number of uniqe words 
blogswordsvectorUniq <- unique(blogswordsVector) 
newswordsvectorUniq<- unique(newswordsVector)  
twitswordsvectorUniq<- unique(twitswordsVector) 

## Summary files

summary.df <- rbind(sizeF,lenF,charF,cbind(length(blogswordsVector),length(newswordsVector),
                                           length(twitswordsVector)),cbind(length(blogswordsvectorUniq),
                                                                           length(newswordsvectorUniq),length(twitswordsvectorUniq)))
rownames(summary.df) <- c("File Size","Total Line","Max Line Length","Word Count","Unique Word Count")
colnames(summary.df) <- c(blogfileName,newsfileName,twitfileName)

summary.df

##                   en_US.blogs.txt en_US.news.txt en_US.twitter.txt
## File Size                  210.16         205.81            167.11
## Total Line              899288.00     1010242.00        2360148.00
## Max Line Length          40833.00       11384.00          11384.00
## Word Count                   0.00           0.00              0.00
## Unique Word Count            0.00           0.00              0.00

As we know the R’s limitation , becuase of lexical scoping,R is required to keep the objects in the memory,this can smartly be overcome by sampling.

sampling

blogsSam <- sample(blogsData,size=100000)
newsSam <- sample(newsData,size=100000)
tweetSam <- sample(twitData,size=150000)
combinedSam1 <- combine(blogsSam,newsSam,tweetSam)

NGrams generation

# unigram 

docMat1 <- dfm(combinedSam1, ngrams = 1, verbose = FALSE, concatenator = " ",
               stem=FALSE, 
               removeNumbers=TRUE,removeSeparators=TRUE,removeTwitter=TRUE)
docFreq1 <- docfreq(docMat1)
docFreqDF1 <- as.data.frame(docFreq1)

docFreqDFSort1 <- sort(rowSums(docFreqDF1), decreasing=TRUE)
docFreqDFSort12 <- data.frame(Words=names(docFreqDFSort1), Frequency = docFreqDFSort1)
topFeatureDf1 <- as.data.frame(topfeatures(docMat1, 40),stringsAsFactors=FALSE)
topFeatureDf12 <- data.frame(Words=row.names(topFeatureDf1), Frequency = topFeatureDf1,stringsAsFactors=FALSE)
names(topFeatureDf12) <- c('Words','Frequency')
topFeatureDf13 <- tbl_df(topFeatureDf12)
topFeatureDf14 <- filter(topFeatureDf13,nchar(Words)>2)

plot1 <- ggplot(topFeatureDf14,aes(Words,Frequency))
plot1+labs(x="Words" , y="Freq", title="Top 20 Unigram Word Frequency")+geom_bar(stat='identity',color="yellow",fill="orange")+
theme(axis.text.x = element_text(angle = 45, hjust = 1))

# bigram

plot2 <- ggplot(topFeatureDf22,aes(Words,Frequency))
plot2+labs(x="2-Gram" , y="Freq", title="Top 20 Bigram Word Frequency")+geom_bar(stat='identity',color="yellow",fill="orange")+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

trigram

plot3 <- ggplot(topFeatureDf32,aes(Words,Frequency))
plot3+labs(x="3-Gram" , y="Freq", title="Top 20 Trigram Word Frequency")+geom_bar(stat='identity',color="yellow",fill="orange")+
  theme(axis.text.x = element_text(angle = 65, hjust = 1))

Conclusion:-

The exploratory analysis and cumulative sum table of each text clearly showed, that 90% of the words were comprised of top 7000 words.

John Hopkins Coursera Data Science Coourse Capstone Project’s MileStone Report

Kanu

March 21, 2016