Synopsis

The purpose of this report is to display the advancement of the Peer-graded Assignment of the Data Science Capstone Project. More precisely, it gives an overview of how the data has been downloaded and read and the basic exploratory statistics of the databases.

Note: The code below uses custom functions that are described in appendix.

Analysis

Initialization

Checking that the data are downloaded. If not, it will download. Sourcing the main R file containing all function to perform the data analysis

if(!file.exists("data.zip")){
    download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip",destfile = "data.zip")
    unzip(zipfile = "data.zip")
}
source("DataAnalysis.R")

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Inputing the file path

filePathBlog <- "final/en_US/en_US.blogs.txt"
filePathNews <- "final/en_US/en_US.news.txt"
filePathTwitter <- "final/en_US/en_US.twitter.txt"

Reading the data and extracting words

The function token() is a self made function that reads the file line by line, remove ponctuation and mispelled words. It takes as input a character vector (lines or phrases) and return a character vector (words). Details of the token() function are available in the appendix.

textLineBlog <- readLines(filePathBlog)
textLineNews <- readLines(filePathNews)
textLineTwitter <- readLines(filePathTwitter)

wordsBlog <- token(textLineBlog)
wordsNews <- token(textLineNews)
wordsTwitter <- token(textLineTwitter)

Exploratory analysis

blogSum <- c(length(textLineBlog),length(wordsBlog))
newsSum <- c(length(textLineNews),length(wordsNews))
TwitterSum <- c(length(textLineTwitter),length(wordsTwitter))
rownames <- c("Number of lines", "Number of words")
summary <- data.frame(Blog = blogSum, News = newsSum, Twitter = TwitterSum)
summary$Total <- rowMeans(summary)
row.names(summary) <- rownames
summary

##                     Blog    News  Twitter    Total
## Number of lines   899288   77259  2360148  1112232
## Number of words 36369316 2542354 28993657 22635109

Blog is the file containing the most lines and words.

Sampling

The total number of words is very high. For performance issues, only a fraction of this number can be used to predict. The following code sample the lines and reprocess the words using the token() function. Then, Blog, News and Twitter are merged.

sampling <- 0.001
set.seed(460)

wordsBlogSampled <- token(textLineBlog,sampling)
wordsNewsSampled <- token(textLineNews,sampling)
wordsTwitterSampled <- token(textLineTwitter,sampling)

words <- append(wordsBlogSampled,wordsNewsSampled)
words <- append(words,wordsTwitterSampled)
wordsLength <- length(words)

The vector of words “words” only counts 67074, allowing the upcoming calculation to be carried out faster.

Profanity filtering

profFilt() is a function that downloads a data base of profaniting words and removes those bad words from the inputed vector of character.

words <- profFilt(words)

Understanding frequencies of words, 2-grams and 3-grams

The 20 most used words

The function freqExplo() orders the N-grames and take the top 20 of the most used N-grames.

freqExplo(words)

The 20 most used 2-grams

The function wordSeqFunc() takes as an input a vector of character (the worlds) and the “n” of the N-grams. It return a vector of character with the N-grames

wordSeq2 <- wordSeqFunc(words,2)
freqExplo(wordSeq2,20)

The 20 most used 3-grams

wordSeq3 <- wordSeqFunc(words,3)
freqExplo(wordSeq3,20)

Plans for creating a prediction algorithm and Shina app

The predictive algorithm will use these most frequently used n-grams to propose the next word. When the user finishes a word, by tapping space, a word should be quickly and automatically proposed. As the user keep typing, the algorithm should keep suggesting words taking into consideration 2-3 words before the last word. A more advanced algorithm could take into consideration the context (i.e. the lexical field) for more accurate suggestion. The specific case of a world that is not in the database should be handled correctly. Finally the model should be tested to evaluate its prediction and its speed in order to have something usable.

Appendix

The token function:

token <- function(textLine, sampling = 1){
    sampleIndex <- sample(1:length(textLine), length(textLine)*sampling)
    textLine <- textLine[sampleIndex]
    
    text <- gsub("[ :,]", " ",textLine) ### Removing space and middle of the phrase ponctuation
    text <- gsub("[.!?]", "",text) ### Removing end of the line ponctuations
    text <- gsub("[^a-zA-Z. ]", "_",text) ### Keeping only words that have only letters
    text <- gsub("\\S+_", "",text) ### Removing strong words
    text <- gsub("_", "",text) ### Removing strong words
    text <- gsub("  ", " ",text) ### Removing double space
    text <- gsub("   ", " ",text) ### Removing triple space
    text <- gsub("   ", " ",text) ### Removing qudruple space
    
    words <-  unlist(strsplit(text," "))
    
    words <-  tolower(words)
    words <- words[words != " "]
    words <- words[words != ""]
    
    badWordsIndex <- grep("([[:alpha:]])\\1\\1",words)
    if(length(badWordsIndex) > 0){
        words <- words[-grep("([[:alpha:]])\\1\\1",words)]
    }
    
    return(words)
}

The profFilt function:

profFilt <- function(list){
    
    if(!file.exists("badWords.zip")){
        download.file("https://www.freewebheaders.com/wordpress/wp-content/uploads/base-list-of-bad-words-csv-file_2018_03_26_26.zip",destfile = "badWords.zip")
        unzip(zipfile = "badWords.zip")
    }
    
    badWords <- read.csv("base-list-of-bad-words-csv-file_2018_03_26.csv")
    
    wordFiltered <- list[-(list %in% badWords$arse)]
    
}

The freqExplo function

freqExplo <-  function(dataFrame,topNum = 20) {
    dataFrame <- as.data.frame(dataFrame)
    list <- unite(dataFrame, united, colnames(dataFrame),sep = " ")
    df <- as.data.frame(table(list))
    top <- head(arrange(df, desc(Freq)), topNum)
    barplot(top$Freq, main = "Frequencies of top n-grams words", names = top$list,las=2)
}

The wordSeqFunc function

wordSeqFunc <- function(list, seqLength = 2){
    wordseq <- data.frame(matrix(ncol = seqLength, nrow = 0))
    
    colnames <- "word1"
    for(i in 2:seqLength){
        newColnames <- paste("word",i,sep = "")
        colnames <- append(colnames,newColnames)
    }
    colnames(wordseq) <- colnames
        
    loopLength <- length(list)-1
    for(i in 1:loopLength){ ### Looping for each word
        newLine <- data.frame(word1 = as.character(list[i]))
        for(j in 2:seqLength){ ### Looping seq of word
            newNewLine <- data.frame(as.character(list[i+j-1]))
            newNewlineColName <- paste("word",j,sep = "")
            colnames(newNewLine) <- newNewlineColName
            newLine <- cbind(newLine,newNewLine)
        }

        wordseq <- rbind(wordseq,newLine)
    }
    return(wordseq)
}

Peer-graded Assignment: Milestone Report

Johan Di Pietrantonio

24 juin 2018