Exploratory Analysis Milestone Report

Introduction

This report gives an insight into the data that is used for prediction model created for the final capstone project.The raw data used in the dataset is avaiable at https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
The various steps for the initial analysis included:
1.Downloading the raw data
2.Sampling and Tidying the data(Including removing bad and degrading words)
3.Performing an exploratory analysis of the tidy data.Relationship between the words and their distribution was calculated.
4.Construct tables,figures,graphs to depict the frequencies of single words,bigrams and trigrams in the data.

Analysis of the Raw Data

Statistics of the different files(twitter,blogs and news

Total
Words by Entry
File totalLines totalWords totalChars averageWords minWords maxWords
Blog 899288 37546239 206824382 41.75 0 6726
News 77259 2674536 15639408 34.62 1 1123
Twitter 2360148 30093413 162096241 12.75 1 47

Now lets create a tokenized version of the datasets.Extremely common words(of,in,the etc)are removed.

##Creating a tokenized version of twitter
tokenized_twitter <- clean_twitter_df %>%
        unnest_tokens(output = word, input = text) %>%
        anti_join(get_stopwords())

##Creating a tokenized version of blogs
tokenized_blogs <- clean_blogs_df %>%
        unnest_tokens(output = word, input = text) %>%
        anti_join(get_stopwords())
##Creating a tokenized version of news
tokenized_news <- clean_news_df %>%
        unnest_tokens(output = word, input = text) %>%
        anti_join(get_stopwords())

Visualization of the most common words in each file.

###Creating and Analysing Of N-Grams

An n-gram “is a contiguous sequence of n items from a given sample of text or speech”.The different types are unigram, bigram and trigram for the case when n=1, 2 or 3 respectively. For example, “come in” is bigram.The following codes helps in visualising the top biagrams and trigrams present in each file.The most common unigrams were visualised earlier.Bad and foul words were removed earlier.Also very common words(for,is,was etc)are also removed.

#Creating bigrams
make_bigrams <- function(text, filetype) {
        text %>%
                unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
                tidyr::separate(bigram, c("word1", "word2"), sep = " ") %>%
                na.omit() %>%
                filter(!word1 %in% stop_words$word,
                       !word2 %in% stop_words$word) %>%
                count(word1, word2, sort = TRUE) %>%
                top_n(n = 10, wt = n) %>% #selects only top 10 values
                slice(row_number(1:10)) %>% #prevents ties (i.e. several bigrams with the 10th highest value)
                mutate(bigram = paste(word1, word2, sep = " ")) %>%
                mutate(file = filetype)
}
bigram_of_blogs <- make_bigrams(clean_blogs_df, "Blogs")
bigram_of_news <- make_bigrams(clean_news_df, "News")
bigram_of_twitter <- make_bigrams(clean_twitter_df, "Twitter")
#Now lets combine the bigrams to a dataframe.
total_bigram <- as.data.frame(rbind.data.frame(bigram_of_blogs, bigram_of_news, bigram_of_twitter))

#Creating trigrams
make_trigrams <- function(text, filetype) {
        text %>%
                unnest_tokens(trigram, text, token = "ngrams", n = 3) %>%
                tidyr::separate(trigram, c("word1", "word2", "word3"), sep = " ") %>%
                na.omit() %>%
                filter(!word1 %in% stop_words$word,
                       !word2 %in% stop_words$word,
                       !word3 %in% stop_words$word) %>%
                count(word1, word2, word3, sort = TRUE) %>%
                top_n(n = 10, wt = n) %>%
                slice(row_number(1:10)) %>%
                mutate(trigram = paste(word1, word2, word3, sep = " ")) %>%
                mutate(file = filetype)
}
trigram_of_blogs <- make_trigrams(clean_blogs_df, "Blogs")
trigram_of_news <- make_trigrams(clean_news_df, "News")
trigram_of_twitter <- make_trigrams(clean_twitter_df, "Twitter")
#combine for easy visualizations
total_trigram <- as.data.frame(rbind.data.frame(trigram_of_blogs, trigram_of_news, trigram_of_twitter))

Visualize the Bigrams & Trigrams

*Conclusion And Next Steps

As we conclude our exploratory analysis,we were able to gather key informations regarding the raw dataset that was presented to us.Now we can build the predictive model for our final capstone project.

###Our main steps would be

-Clean & tokenize the SwiftKey dataset into ngrams of 1-4 words.(Bad words would have been removed)
- Save all separate ngram data sets.
-Create a prediction model which analyses a string by its length and returns the most probable word that comes next.
-Create a shiny app that can use this algorithm to predict words based on user input.