Data Science Capstone Milestone Report

Exploratory Analysis Milestone Report

Introduction

This report gives an insight into the data that is used for prediction model created for the final capstone project.The raw data used in the dataset is avaiable at https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
The various steps for the initial analysis included:
1.Downloading the raw data
2.Sampling and Tidying the data(Including removing bad and degrading words)
3.Performing an exploratory analysis of the tidy data.Relationship between the words and their distribution was calculated.
4.Construct tables,figures,graphs to depict the frequencies of single words,bigrams and trigrams in the data.

Analysis of the Raw Data

Statistics of the different files(twitter,blogs and news

	Total			Words by Entry
File	totalLines	totalWords	totalChars	averageWords	minWords	maxWords
Blog	899288	37546239	206824382	41.75	0	6726
News	77259	2674536	15639408	34.62	1	1123
Twitter	2360148	30093413	162096241	12.75	1	47

Now lets create a tokenized version of the datasets.Extremely common words(of,in,the etc)are removed.

##Creating a tokenized version of twitter
tokenized_twitter <- clean_twitter_df %>%
        unnest_tokens(output = word, input = text) %>%
        anti_join(get_stopwords())

##Creating a tokenized version of blogs
tokenized_blogs <- clean_blogs_df %>%
        unnest_tokens(output = word, input = text) %>%
        anti_join(get_stopwords())
##Creating a tokenized version of news
tokenized_news <- clean_news_df %>%
        unnest_tokens(output = word, input = text) %>%
        anti_join(get_stopwords())

Visualization of the most common words in each file.

###Creating and Analysing Of N-Grams

An n-gram “is a contiguous sequence of n items from a given sample of text or speech”.The different types are unigram, bigram and trigram for the case when n=1, 2 or 3 respectively. For example, “come in” is bigram.The following codes helps in visualising the top biagrams and trigrams present in each file.The most common unigrams were visualised earlier.Bad and foul words were removed earlier.Also very common words(for,is,was etc)are also removed.

#Creating bigrams
make_bigrams <- function(text, filetype) {
        text %>%
                unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
                tidyr::separate(bigram, c("word1", "word2"), sep = " ") %>%
                na.omit() %>%
                filter(!word1 %in% stop_words$word,
                       !word2 %in% stop_words$word) %>%
                count(word1, word2, sort = TRUE) %>%
                top_n(n = 10, wt = n) %>% #selects only top 10 values
                slice(row_number(1:10)) %>% #prevents ties (i.e. several bigrams with the 10th highest value)
                mutate(bigram = paste(word1, word2, sep = " ")) %>%
                mutate(file = filetype)
}
bigram_of_blogs <- make_bigrams(clean_blogs_df, "Blogs")
bigram_of_news <- make_bigrams(clean_news_df, "News")
bigram_of_twitter <- make_bigrams(clean_twitter_df, "Twitter")
#Now lets combine the bigrams to a dataframe.
total_bigram <- as.data.frame(rbind.data.frame(bigram_of_blogs, bigram_of_news, bigram_of_twitter))

#Creating trigrams
make_trigrams <- function(text, filetype) {
        text %>%
                unnest_tokens(trigram, text, token = "ngrams", n = 3) %>%
                tidyr::separate(trigram, c("word1", "word2", "word3"), sep = " ") %>%
                na.omit() %>%
                filter(!word1 %in% stop_words$word,
                       !word2 %in% stop_words$word,
                       !word3 %in% stop_words$word) %>%
                count(word1, word2, word3, sort = TRUE) %>%
                top_n(n = 10, wt = n) %>%
                slice(row_number(1:10)) %>%
                mutate(trigram = paste(word1, word2, word3, sep = " ")) %>%
                mutate(file = filetype)
}
trigram_of_blogs <- make_trigrams(clean_blogs_df, "Blogs")
trigram_of_news <- make_trigrams(clean_news_df, "News")
trigram_of_twitter <- make_trigrams(clean_twitter_df, "Twitter")
#combine for easy visualizations
total_trigram <- as.data.frame(rbind.data.frame(trigram_of_blogs, trigram_of_news, trigram_of_twitter))

Visualize the Bigrams & Trigrams

*Conclusion And Next Steps

As we conclude our exploratory analysis,we were able to gather key informations regarding the raw dataset that was presented to us.Now we can build the predictive model for our final capstone project.

###Our main steps would be

-Clean & tokenize the SwiftKey dataset into ngrams of 1-4 words.(Bad words would have been removed)
- Save all separate ngram data sets.
-Create a prediction model which analyses a string by its length and returns the most probable word that comes next.
-Create a shiny app that can use this algorithm to predict words based on user input.