Mobile devices have become ubiquitous and integral to our life,We communicate through emails,social media,texts,messaging app etc.As most of the action involves typing,and typing long texts on devices isn’t easy. Good thing is that, there are some companies such as Swiftkey, trying to make typing on mobile devices easier for us.They are working towards building sophisticated text prediction application using state of the art tech and concepts in natural language processing. This Capstone project will test us on all the skills, that we have acquired in the specialization by analyzing text documents supplied by Swiftkey. The ultimate goal is to comeup with an algorithm which can with high level of accuracy predict the next word the user might type.
As R’s limitation is well known, and to avoid any memory issues later on, we need to plan ahead,and sample the data in a way that could represent the population.
setwd(“/home/kanudutta/Desktop/capstone”)
if (!file.exists("Coursera-SwiftKey.zip"))
{
url="https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(url, destfile="./Coursera-SwiftKey.zip")
dateDownloaded <- date()
unzip("Coursera-SwiftKey.zip", overwrite = TRUE)
}
Check for packages,if not found install them
packages <- c("ggplot2", "tm","RCurl","quanteda","dplyr")
if (length(setdiff(packages, rownames(installed.packages()))) > 0)
{
install.packages(setdiff(packages, rownames(installed.packages())),dependencies = TRUE)
}
# Loadig packages
sapply(packages,function(x) library(x,character.only=TRUE))
blogfileName <- 'en_US.blogs.txt'
newsfileName <- 'en_US.news.txt'
twitfileName <-'en_US.twitter.txt'
profName <- 'bad-words.txt'
# file connection establish
blogConn <- file(blogfileName,open="r")
# 'rb' mode was used as the entire text was not being read
newsConn <- file(newsfileName,open="rb")
twitConn <- file(twitfileName,open="r")
profnConn <- file(profName,open="r")
# read files line by line entirely
blogsData <- readLines(blogConn,encoding = "UTF-8")
newsData <- readLines(newsConn,encoding = "UTF-8" )
twitData <- readLines(twitConn,encoding = "UTF-8" )
profnData <- readLines(profnConn,encoding = "UTF-8")
blogsDatae <- readLines(blogConn)
newsDatae <- readLines(newsConn)
twitDatae <- readLines(twitConn )
profnDatae <- readLines(profnConn)
blogsData1 <- gsub("[[:punct:][:blank:]]+", " ", tolower(blogsDatae))
newsData1 <- gsub("[[:punct:][:blank:]]+", " ", tolower(newsDatae))
twitData1 <- gsub("[[:punct:][:blank:]]+", " ", tolower(twitDatae))
blogsData2 <- gsub('[[:digit:]]+', '', blogsData1)
newsData2 <- gsub('[[:digit:]]+', '', newsData1)
twitData2 <- gsub('[[:digit:]]+', '', twitData1)
# close connections
close(blogConn)
close(newsConn)
close(twitConn)
close(profnConn)
-Basic Summary of Text files i.e. file name, size of file,number of lines,total number of words
df2
## [1] Frequency cumsum Perct
## <0 rows> (or 0-length row.names)
df4
## [1] Frequency cumsum Perct
## <0 rows> (or 0-length row.names)
df6
## [1] Frequency cumsum Perct
## <0 rows> (or 0-length row.names)
## number of lines in the files
lenF <- c(length(blogsData),length(newsData),length(twitData))
## longest line in the file
charF <- c(max(nchar(blogsData)),max(nchar(newsData)),max(nchar(newsData)))
## Total number of uniqe words
blogswordsvectorUniq <- unique(blogswordsVector)
newswordsvectorUniq<- unique(newswordsVector)
twitswordsvectorUniq<- unique(twitswordsVector)
## Summary files
summary.df <- rbind(sizeF,lenF,charF,cbind(length(blogswordsVector),length(newswordsVector),
length(twitswordsVector)),cbind(length(blogswordsvectorUniq),
length(newswordsvectorUniq),length(twitswordsvectorUniq)))
rownames(summary.df) <- c("File Size","Total Line","Max Line Length","Word Count","Unique Word Count")
colnames(summary.df) <- c(blogfileName,newsfileName,twitfileName)
summary.df
## en_US.blogs.txt en_US.news.txt en_US.twitter.txt
## File Size 210.16 205.81 167.11
## Total Line 899288.00 1010242.00 2360148.00
## Max Line Length 40833.00 11384.00 11384.00
## Word Count 0.00 0.00 0.00
## Unique Word Count 0.00 0.00 0.00
As we know the R’s limitation , becuase of lexical scoping,R is required to keep the objects in the memory,this can smartly be overcome by sampling.
blogsSam <- sample(blogsData,size=100000)
newsSam <- sample(newsData,size=100000)
tweetSam <- sample(twitData,size=150000)
combinedSam1 <- combine(blogsSam,newsSam,tweetSam)
# unigram
docMat1 <- dfm(combinedSam1, ngrams = 1, verbose = FALSE, concatenator = " ",
stem=FALSE,
removeNumbers=TRUE,removeSeparators=TRUE,removeTwitter=TRUE)
docFreq1 <- docfreq(docMat1)
docFreqDF1 <- as.data.frame(docFreq1)
docFreqDFSort1 <- sort(rowSums(docFreqDF1), decreasing=TRUE)
docFreqDFSort12 <- data.frame(Words=names(docFreqDFSort1), Frequency = docFreqDFSort1)
topFeatureDf1 <- as.data.frame(topfeatures(docMat1, 40),stringsAsFactors=FALSE)
topFeatureDf12 <- data.frame(Words=row.names(topFeatureDf1), Frequency = topFeatureDf1,stringsAsFactors=FALSE)
names(topFeatureDf12) <- c('Words','Frequency')
topFeatureDf13 <- tbl_df(topFeatureDf12)
topFeatureDf14 <- filter(topFeatureDf13,nchar(Words)>2)
plot1 <- ggplot(topFeatureDf14,aes(Words,Frequency))
plot1+labs(x="Words" , y="Freq", title="Top 20 Unigram Word Frequency")+geom_bar(stat='identity',color="yellow",fill="orange")+
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# bigram
plot2 <- ggplot(topFeatureDf22,aes(Words,Frequency))
plot2+labs(x="2-Gram" , y="Freq", title="Top 20 Bigram Word Frequency")+geom_bar(stat='identity',color="yellow",fill="orange")+
theme(axis.text.x = element_text(angle = 60, hjust = 1))
plot3 <- ggplot(topFeatureDf32,aes(Words,Frequency))
plot3+labs(x="3-Gram" , y="Freq", title="Top 20 Trigram Word Frequency")+geom_bar(stat='identity',color="yellow",fill="orange")+
theme(axis.text.x = element_text(angle = 65, hjust = 1))
The exploratory analysis and cumulative sum table of each text clearly showed, that 90% of the words were comprised of top 7000 words.