Peer-graded Assignment: Milestone Report

Instructions

The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. Please submit a report on R Pubs (http://rpubs.com/) that explains your exploratory analysis and your goals for the eventual app and algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set. The motivation for this project is to: 1. Demonstrate that you’ve downloaded the data and have successfully loaded it in.2. Create a basic report of summary statistics about the data sets.3. Report any interesting findings that you amassed so far.4. Get feedback on your plans for creating a prediction algorithm and Shiny app.

##Review Criteria

##Does the link lead to an HTML page describing the exploratory analysis of the training data set? 1.The file can be downloaded from the url https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

  1. Since the files are too heavy, we did not use the r command to download the files. instead we downloaded it manually, unzipped it and placed it in the working directory.

  2. We load the respective libraries to start with,

library(qdap)
## Warning: package 'qdap' was built under R version 3.6.3
## Loading required package: qdapDictionaries
## Loading required package: qdapRegex
## Warning: package 'qdapRegex' was built under R version 3.6.3
## Loading required package: qdapTools
## Warning: package 'qdapTools' was built under R version 3.6.3
## Loading required package: RColorBrewer
## 
## Attaching package: 'qdap'
## The following object is masked from 'package:base':
## 
##     Filter
library(RCurl)
## Warning: package 'RCurl' was built under R version 3.6.3
library(stringi)
## Warning: package 'stringi' was built under R version 3.6.2
library(RWeka)
## Warning: package 'RWeka' was built under R version 3.6.3
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.6.3
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:qdapRegex':
## 
##     %+%
library(slam)
## Warning: package 'slam' was built under R version 3.6.2
library(quanteda)
## Warning: package 'quanteda' was built under R version 3.6.3
## Package version: 2.1.0
## Parallel computing: 2 of 4 threads used.
## See https://quanteda.io for tutorials and examples.
## 
## Attaching package: 'quanteda'
## The following objects are masked from 'package:qdap':
## 
##     %>%, as.DocumentTermMatrix, as.wfm
## The following object is masked from 'package:utils':
## 
##     View
  1. there are 3 files in the zip file downloaded, en_US.blogs.txt,en_US.news.txt and en_US.twitter.txt

out of these 3 the 2 files for blogs and twitter can be read directly but the news file is in binary and so we open for eading in binary mode.

blogs <- readLines("en_US.blogs.txt", skipNul=TRUE,encoding="UTF-8")
twitter <- readLines("en_US.twitter.txt", skipNul=TRUE,encoding="UTF-8")
con <- file("en_US.news.txt",open="rb")
news <- readLines(con, encoding="UTF-8")

##Has the data scientist done basic summaries of the three files? Word counts, line counts and basic data tables? 4. now since the files have been loaded let us summarise the datasets. for summarizing we do them in file name, file size, lines count

file_name <- c("blogs", "twitter", "news")
file_size_in_MB <- c(file.info("en_US.blogs.txt")$size / 1024^2, 
                     file.info("en_US.twitter.txt")$size / 1024^2, 
                     file.info("en_US.news.txt")$size / 1024^2)
line_counts <- c(length(blogs), length(twitter), length(news))
  1. Before we start loading we do data cleaning to remove any unwanted characters. To do that we execute the following code
blogs <- gsub("[^[:alpha:][:space:]']", " ", blogs)
blogs <- gsub("â ", "'", blogs)
blogs <- gsub("ã", "'", blogs)
blogs <- gsub("ð", "'", blogs)

news <- gsub("[^[:alpha:][:space:]']", " ", news)
news <- gsub("â ", "'", news)
news <- gsub("ã", "'", news)
news <- gsub("ð", "'", news)

twitter <- gsub("[^[:alpha:][:space:]']", " ", twitter)
twitter <- gsub("â ", "'", twitter)
twitter <- gsub("ã", "'", twitter)
twitter <- gsub("ð", "'", twitter)

blogs <- Trim(clean(blogs))
news <- Trim(clean(news))
twitter <- Trim(clean(twitter))
wc_blogs <- sum(stri_count_words(blogs))
wc_news <- sum(stri_count_words(news))
wc_twitter <- sum(stri_count_words(twitter))

word_counts <- c(wc_blogs, wc_twitter, wc_news)
  1. now we need to summarize the data to summarize we do the following
summary_table <- data.frame(file_name,file_size_in_MB,line_counts, word_counts)
summary_table
##   file_name file_size_in_MB line_counts word_counts
## 1     blogs        200.4242      899288    37526965
## 2   twitter        159.3641     2360148    29748148
## 3      news        196.2775     1010242    34057134

So according to the data we have made summaries

##Has the data scientist made basic plots, such as histograms to illustrate features of the data?

  1. before we start to plot we need to understand that the data is huge. so we need to sample. we take a sample of 2500to reduce the time taken
sample.text <- sample(c(blogs, news, twitter), 2500, replace=FALSE)
sample.text <- gsub(" #\\S*","", sample.text)  # remove hash tags 
sample.text <- gsub("(f|ht)(tp)(s?)(://)(\\S*)", "", sample.text) # remove url  
sample.text <- gsub("[^0-9A-Za-z///' ]", "", sample.text) # remove special characters

In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. Below we will build 3 n-grams and plot them to explore our data.

I tried using package tm but I met with errors while trying to coerce the TermDocumentMatrix into a matrix or data frame. Luckily, I read about this other package quanteda, whose 2 functions dfm and docfreq, seems to work faster and better.

# creating the document-feature matrix
myDfm.ng1 <- dfm(sample.text, ngrams = 1, verbose = TRUE, concatenator = " ", stopwords=TRUE)
## Creating a dfm from a character input...
##  ...lowercasing
## Warning: ngrams, concatenator, stopwords arguments are not used.
##  ...found 2,500 documents, 10,269 features
##  ...complete, elapsed time: 39.8 seconds.
## Finished constructing a 2,500 x 10,269 sparse dfm.
myDfm.ng2 <- dfm(sample.text, ngrams = 2, verbose = TRUE, concatenator = " ", stopwords=TRUE)
## Creating a dfm from a character input...
##  ...lowercasing
## Warning: ngrams, concatenator, stopwords arguments are not used.
##  ...found 2,500 documents, 10,269 features
##  ...complete, elapsed time: 17.5 seconds.
## Finished constructing a 2,500 x 10,269 sparse dfm.
myDfm.ng3 <- dfm(sample.text, ngrams = 3, verbose = TRUE, concatenator = " ", stopwords=TRUE)
## Creating a dfm from a character input...
##  ...lowercasing
## Warning: ngrams, concatenator, stopwords arguments are not used.
##  ...found 2,500 documents, 10,269 features
##  ...complete, elapsed time: 3.15 seconds.
## Finished constructing a 2,500 x 10,269 sparse dfm.
# coercing the dfm into matrix.
# docfreq will get the document frequency of a feature
myDfm.ng1.mat <- as.data.frame(as.matrix(docfreq(myDfm.ng1)))
myDfm.ng2.mat <- as.data.frame(as.matrix(docfreq(myDfm.ng2)))
myDfm.ng3.mat <- as.data.frame(as.matrix(docfreq(myDfm.ng3)))

# sorting the n-grams for plotting
ng1.sorted <- sort(rowSums(myDfm.ng1.mat), decreasing=TRUE)
ng2.sorted <- sort(rowSums(myDfm.ng2.mat), decreasing=TRUE)
ng3.sorted <- sort(rowSums(myDfm.ng3.mat), decreasing=TRUE)
  1. Now we go ahead and plot the histogram for top 15 unigrams
ng1.FreqTable <- data.frame(Words=names(ng1.sorted), Frequency = ng1.sorted)
ng1.Plot <- ggplot(within(ng1.FreqTable[1:15, ], Words <- factor(Words, levels=Words)), aes(Words, Frequency))
ng1.Plot <- ng1.Plot + geom_bar(stat="identity", fill="purple") + ggtitle("Top 15 Unigrams")
ng1.Plot <- ng1.Plot + theme(axis.text.x=element_text(angle=45, hjust=1))
ng1.Plot

9. Next we plot the top 15 bigrams

ng2.FreqTable <- data.frame(Words=names(ng2.sorted), Frequency = ng2.sorted)
ng2.Plot <- ggplot(within(ng2.FreqTable[1:15, ], Words <- factor(Words, levels=Words)), aes(Words, Frequency))
ng2.Plot <- ng2.Plot + geom_bar(stat="identity", fill="maroon") + ggtitle("Top 15 Bigrams")
ng2.Plot <- ng2.Plot + theme(axis.text.x=element_text(angle=45, hjust=1))
ng2.Plot

next we plot the top 15 trigrams

ng3.FreqTable <- data.frame(Words=names(ng3.sorted), Frequency = ng3.sorted)
ng3.Plot <- ggplot(within(ng3.FreqTable[1:15, ], Words <- factor(Words, levels=Words)), aes(Words, Frequency))
ng3.Plot <- ng3.Plot + geom_bar(stat="identity", fill="blue") + ggtitle("Top 15 Trigrams")
ng3.Plot <- ng3.Plot + theme(axis.text.x=element_text(angle=45, hjust=1))
ng3.Plot

##Was the report written in a brief, concise style, in a way that a non-data scientist manager could appreciate?

I have tried my level best to understand the analysis.

I hope that I have done the analysis as per your expectations. Thank you for listening and reading my assignment