Load library

Reading blogs, twitter, and news files

blogs_file<-"Coursera-SwiftKey/final/de_DE/de_DE.blogs.txt"
twitter_file<-"Coursera-SwiftKey/final/de_DE/de_DE.twitter.txt"
news_file<-"Coursera-SwiftKey/final/de_DE/de_DE.news.txt"

blogs <- readLines(blogs_file, warn = FALSE, encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines(twitter_file, warn = FALSE, encoding = "UTF-8", skipNul = TRUE)
news <- readLines(news_file, warn = FALSE, encoding = "UTF-8", skipNul = TRUE)

Summarizing the number of lines and words

## blogs number of lines
length(blogs)

## [1] 371440

## twitter number of lines
length(twitter)

## [1] 947774

## news number of lines
length(news)

## [1] 244743

Collection of a data sample: 1% of each data set for scoping analysis

set.seed(12345)
twitter_sample  <- sample(twitter, length(twitter) * 0.01, replace = FALSE)
blogs_sample    <- sample(blogs, length(blogs) * 0.01, replace = FALSE)
news_sample     <- sample(news, length(news) * 0.01, replace = FALSE)
data_sample = c(twitter_sample, blogs_sample, news_sample)

Sample and Clean the Data

# 1. Remove lines with unidentify characters
NotKnown <- grep("NotKnown", iconv(data_sample, "latin1", "ASCII", sub="NotKnown"))
data_sample <- data_sample[-NotKnown]
# doing some simple cleaning
data_sample <- gsub("&amp", "", data_sample)
data_sample <- gsub("RT :|@[a-z,A-Z]*: ", "", data_sample) # remove tweets
data_sample <- gsub("@\\w+", "", data_sample)
data_sample <- gsub("[[:digit:]]", "", data_sample) # remove digits
data_sample <- gsub(" #\\S*","", data_sample)  # remove hash tags 
data_sample <- gsub(" ?(f|ht)tp(s?)://(.*)[.][a-z]+", "", data_sample) # remove url
data_sample <- rm_white(data_sample) # remove extra spaces

Creating dataframe from data sample for n-gram analysis

data_sample_df <- data_frame(line = 1:length(data_sample), text = data_sample)

Creating And Displaying Unigram

UnigramFreq <- data_sample_df %>%
    unnest_tokens(unigram, text, token = "ngrams", n = 3) %>%
    separate(unigram, c("word1"), sep = " ", 
             extra = "drop", fill = "right") %>%
    filter(!word1 %in% stop_words$word) %>%
    unite(unigram, word1, sep = " ") %>%
    count(unigram, sort = TRUE)

ggplot(head(UnigramFreq,15), aes(reorder(unigram,n), n)) +   
    geom_bar(stat="identity") + coord_flip() + 
    xlab("Unigrams") + ylab("Frequency") +
    ggtitle("Most frequent unigrams")

Creating And Displaying Bigram

BigramFreq <- data_sample_df %>%
    unnest_tokens(bigram, text, token = "ngrams", n = 3) %>%
    separate(bigram, c("word1", "word2"), sep = " ", 
             extra = "drop", fill = "right") %>%
    filter(!word1 %in% stop_words$word,
           !word2 %in% stop_words$word) %>%
    unite(bigram, word1, word2, sep = " ") %>%
    count(bigram, sort = TRUE)

ggplot(head(BigramFreq,15), aes(reorder(bigram,n), n)) +   
    geom_bar(stat="identity") + coord_flip() + 
    xlab("Bigrams") + ylab("Frequency") +
    ggtitle("Most frequent bigrams")

Creating And Displaying Trigram

TrigramFreq <- data_sample_df %>%
    unnest_tokens(trigram, text, token = "ngrams", n = 3) %>%
    separate(trigram, c("word1", "word2", "word3"), sep = " ", 
             extra = "drop", fill = "right") %>%
    filter(!word1 %in% stop_words$word,
           !word2 %in% stop_words$word,
           !word3 %in% stop_words$word) %>%
    unite(trigram, word1, word2, word3, sep = " ") %>%
    count(trigram, sort = TRUE)

ggplot(color = "#C4961A",head(TrigramFreq,15), aes(reorder(trigram,n), n)) +   
    geom_bar(stat="identity") + coord_flip() + 
    xlab("Trigrams") + ylab("Frequency") +
    ggtitle("Most frequent trigrams")

Tell us about any interesting discoveries you’ve made so far.

With this project, in addition to the packages, which I used, I studied tm, quanteda and n-gram. n-gram was the slowest. And quanteda was much faster than tm, but if I used keywords in dfm, then quanteda worked slower than tm. So therefore I combined tm and quanteda by using empty words with tm, and then using the corpus of tm with dfm. While I was trying to figure out how to improve or correct quanteda’s keywords, I found the other way to do this project, which was the best and most accurate for me.

Feedback on your projects to create a prediction algorithm and a Shiny application.

With the Shiny application, I go to right text input and, based on that input, I prediction the next word.

Assignment Milestone Report

Abdelbasset Boukdir

June 09, 2020

Introductions