The purpose of this report is to display the advancement of the Peer-graded Assignment of the Data Science Capstone Project. More precisely, it gives an overview of how the data has been downloaded and read and the basic exploratory statistics of the databases.
Note: The code below uses custom functions that are described in appendix.
Checking that the data are downloaded. If not, it will download. Sourcing the main R file containing all function to perform the data analysis
if(!file.exists("data.zip")){
download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip",destfile = "data.zip")
unzip(zipfile = "data.zip")
}
source("DataAnalysis.R")
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Inputing the file path
filePathBlog <- "final/en_US/en_US.blogs.txt"
filePathNews <- "final/en_US/en_US.news.txt"
filePathTwitter <- "final/en_US/en_US.twitter.txt"
The function token() is a self made function that reads the file line by line, remove ponctuation and mispelled words. It takes as input a character vector (lines or phrases) and return a character vector (words). Details of the token() function are available in the appendix.
textLineBlog <- readLines(filePathBlog)
textLineNews <- readLines(filePathNews)
textLineTwitter <- readLines(filePathTwitter)
wordsBlog <- token(textLineBlog)
wordsNews <- token(textLineNews)
wordsTwitter <- token(textLineTwitter)
blogSum <- c(length(textLineBlog),length(wordsBlog))
newsSum <- c(length(textLineNews),length(wordsNews))
TwitterSum <- c(length(textLineTwitter),length(wordsTwitter))
rownames <- c("Number of lines", "Number of words")
summary <- data.frame(Blog = blogSum, News = newsSum, Twitter = TwitterSum)
summary$Total <- rowMeans(summary)
row.names(summary) <- rownames
summary
## Blog News Twitter Total
## Number of lines 899288 77259 2360148 1112232
## Number of words 36369316 2542354 28993657 22635109
Blog is the file containing the most lines and words.
The total number of words is very high. For performance issues, only a fraction of this number can be used to predict. The following code sample the lines and reprocess the words using the token() function. Then, Blog, News and Twitter are merged.
sampling <- 0.001
set.seed(460)
wordsBlogSampled <- token(textLineBlog,sampling)
wordsNewsSampled <- token(textLineNews,sampling)
wordsTwitterSampled <- token(textLineTwitter,sampling)
words <- append(wordsBlogSampled,wordsNewsSampled)
words <- append(words,wordsTwitterSampled)
wordsLength <- length(words)
The vector of words “words” only counts 67074, allowing the upcoming calculation to be carried out faster.
profFilt() is a function that downloads a data base of profaniting words and removes those bad words from the inputed vector of character.
words <- profFilt(words)
The function freqExplo() orders the N-grames and take the top 20 of the most used N-grames.
freqExplo(words)
The function wordSeqFunc() takes as an input a vector of character (the worlds) and the “n” of the N-grams. It return a vector of character with the N-grames
wordSeq2 <- wordSeqFunc(words,2)
freqExplo(wordSeq2,20)
wordSeq3 <- wordSeqFunc(words,3)
freqExplo(wordSeq3,20)
The predictive algorithm will use these most frequently used n-grams to propose the next word. When the user finishes a word, by tapping space, a word should be quickly and automatically proposed. As the user keep typing, the algorithm should keep suggesting words taking into consideration 2-3 words before the last word. A more advanced algorithm could take into consideration the context (i.e. the lexical field) for more accurate suggestion. The specific case of a world that is not in the database should be handled correctly. Finally the model should be tested to evaluate its prediction and its speed in order to have something usable.
The token function:
token <- function(textLine, sampling = 1){
sampleIndex <- sample(1:length(textLine), length(textLine)*sampling)
textLine <- textLine[sampleIndex]
text <- gsub("[ :,]", " ",textLine) ### Removing space and middle of the phrase ponctuation
text <- gsub("[.!?]", "",text) ### Removing end of the line ponctuations
text <- gsub("[^a-zA-Z. ]", "_",text) ### Keeping only words that have only letters
text <- gsub("\\S+_", "",text) ### Removing strong words
text <- gsub("_", "",text) ### Removing strong words
text <- gsub(" ", " ",text) ### Removing double space
text <- gsub(" ", " ",text) ### Removing triple space
text <- gsub(" ", " ",text) ### Removing qudruple space
words <- unlist(strsplit(text," "))
words <- tolower(words)
words <- words[words != " "]
words <- words[words != ""]
badWordsIndex <- grep("([[:alpha:]])\\1\\1",words)
if(length(badWordsIndex) > 0){
words <- words[-grep("([[:alpha:]])\\1\\1",words)]
}
return(words)
}
The profFilt function:
profFilt <- function(list){
if(!file.exists("badWords.zip")){
download.file("https://www.freewebheaders.com/wordpress/wp-content/uploads/base-list-of-bad-words-csv-file_2018_03_26_26.zip",destfile = "badWords.zip")
unzip(zipfile = "badWords.zip")
}
badWords <- read.csv("base-list-of-bad-words-csv-file_2018_03_26.csv")
wordFiltered <- list[-(list %in% badWords$arse)]
}
The freqExplo function
freqExplo <- function(dataFrame,topNum = 20) {
dataFrame <- as.data.frame(dataFrame)
list <- unite(dataFrame, united, colnames(dataFrame),sep = " ")
df <- as.data.frame(table(list))
top <- head(arrange(df, desc(Freq)), topNum)
barplot(top$Freq, main = "Frequencies of top n-grams words", names = top$list,las=2)
}
The wordSeqFunc function
wordSeqFunc <- function(list, seqLength = 2){
wordseq <- data.frame(matrix(ncol = seqLength, nrow = 0))
colnames <- "word1"
for(i in 2:seqLength){
newColnames <- paste("word",i,sep = "")
colnames <- append(colnames,newColnames)
}
colnames(wordseq) <- colnames
loopLength <- length(list)-1
for(i in 1:loopLength){ ### Looping for each word
newLine <- data.frame(word1 = as.character(list[i]))
for(j in 2:seqLength){ ### Looping seq of word
newNewLine <- data.frame(as.character(list[i+j-1]))
newNewlineColName <- paste("word",j,sep = "")
colnames(newNewLine) <- newNewlineColName
newLine <- cbind(newLine,newNewLine)
}
wordseq <- rbind(wordseq,newLine)
}
return(wordseq)
}