Instructions
The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. Please submit a report on R Pubs (http://rpubs.com/) that explains your exploratory analysis and your goals for the eventual app and algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set. The motivation for this project is to: 1. Demonstrate that you’ve downloaded the data and have successfully loaded it in.2. Create a basic report of summary statistics about the data sets.3. Report any interesting findings that you amassed so far.4. Get feedback on your plans for creating a prediction algorithm and Shiny app.
##Review Criteria
##Does the link lead to an HTML page describing the exploratory analysis of the training data set? 1.The file can be downloaded from the url https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
Since the files are too heavy, we did not use the r command to download the files. instead we downloaded it manually, unzipped it and placed it in the working directory.
We load the respective libraries to start with,
library(qdap)
## Warning: package 'qdap' was built under R version 3.6.3
## Loading required package: qdapDictionaries
## Loading required package: qdapRegex
## Warning: package 'qdapRegex' was built under R version 3.6.3
## Loading required package: qdapTools
## Warning: package 'qdapTools' was built under R version 3.6.3
## Loading required package: RColorBrewer
##
## Attaching package: 'qdap'
## The following object is masked from 'package:base':
##
## Filter
library(RCurl)
## Warning: package 'RCurl' was built under R version 3.6.3
library(stringi)
## Warning: package 'stringi' was built under R version 3.6.2
library(RWeka)
## Warning: package 'RWeka' was built under R version 3.6.3
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.6.3
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:qdapRegex':
##
## %+%
library(slam)
## Warning: package 'slam' was built under R version 3.6.2
library(quanteda)
## Warning: package 'quanteda' was built under R version 3.6.3
## Package version: 2.1.0
## Parallel computing: 2 of 4 threads used.
## See https://quanteda.io for tutorials and examples.
##
## Attaching package: 'quanteda'
## The following objects are masked from 'package:qdap':
##
## %>%, as.DocumentTermMatrix, as.wfm
## The following object is masked from 'package:utils':
##
## View
out of these 3 the 2 files for blogs and twitter can be read directly but the news file is in binary and so we open for eading in binary mode.
blogs <- readLines("en_US.blogs.txt", skipNul=TRUE,encoding="UTF-8")
twitter <- readLines("en_US.twitter.txt", skipNul=TRUE,encoding="UTF-8")
con <- file("en_US.news.txt",open="rb")
news <- readLines(con, encoding="UTF-8")
##Has the data scientist done basic summaries of the three files? Word counts, line counts and basic data tables? 4. now since the files have been loaded let us summarise the datasets. for summarizing we do them in file name, file size, lines count
file_name <- c("blogs", "twitter", "news")
file_size_in_MB <- c(file.info("en_US.blogs.txt")$size / 1024^2,
file.info("en_US.twitter.txt")$size / 1024^2,
file.info("en_US.news.txt")$size / 1024^2)
line_counts <- c(length(blogs), length(twitter), length(news))
blogs <- gsub("[^[:alpha:][:space:]']", " ", blogs)
blogs <- gsub("â ", "'", blogs)
blogs <- gsub("ã", "'", blogs)
blogs <- gsub("ð", "'", blogs)
news <- gsub("[^[:alpha:][:space:]']", " ", news)
news <- gsub("â ", "'", news)
news <- gsub("ã", "'", news)
news <- gsub("ð", "'", news)
twitter <- gsub("[^[:alpha:][:space:]']", " ", twitter)
twitter <- gsub("â ", "'", twitter)
twitter <- gsub("ã", "'", twitter)
twitter <- gsub("ð", "'", twitter)
blogs <- Trim(clean(blogs))
news <- Trim(clean(news))
twitter <- Trim(clean(twitter))
wc_blogs <- sum(stri_count_words(blogs))
wc_news <- sum(stri_count_words(news))
wc_twitter <- sum(stri_count_words(twitter))
word_counts <- c(wc_blogs, wc_twitter, wc_news)
summary_table <- data.frame(file_name,file_size_in_MB,line_counts, word_counts)
summary_table
## file_name file_size_in_MB line_counts word_counts
## 1 blogs 200.4242 899288 37526965
## 2 twitter 159.3641 2360148 29748148
## 3 news 196.2775 1010242 34057134
So according to the data we have made summaries
##Has the data scientist made basic plots, such as histograms to illustrate features of the data?
sample.text <- sample(c(blogs, news, twitter), 2500, replace=FALSE)
sample.text <- gsub(" #\\S*","", sample.text) # remove hash tags
sample.text <- gsub("(f|ht)(tp)(s?)(://)(\\S*)", "", sample.text) # remove url
sample.text <- gsub("[^0-9A-Za-z///' ]", "", sample.text) # remove special characters
In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. Below we will build 3 n-grams and plot them to explore our data.
I tried using package tm but I met with errors while trying to coerce the TermDocumentMatrix into a matrix or data frame. Luckily, I read about this other package quanteda, whose 2 functions dfm and docfreq, seems to work faster and better.
# creating the document-feature matrix
myDfm.ng1 <- dfm(sample.text, ngrams = 1, verbose = TRUE, concatenator = " ", stopwords=TRUE)
## Creating a dfm from a character input...
## ...lowercasing
## Warning: ngrams, concatenator, stopwords arguments are not used.
## ...found 2,500 documents, 10,269 features
## ...complete, elapsed time: 39.8 seconds.
## Finished constructing a 2,500 x 10,269 sparse dfm.
myDfm.ng2 <- dfm(sample.text, ngrams = 2, verbose = TRUE, concatenator = " ", stopwords=TRUE)
## Creating a dfm from a character input...
## ...lowercasing
## Warning: ngrams, concatenator, stopwords arguments are not used.
## ...found 2,500 documents, 10,269 features
## ...complete, elapsed time: 17.5 seconds.
## Finished constructing a 2,500 x 10,269 sparse dfm.
myDfm.ng3 <- dfm(sample.text, ngrams = 3, verbose = TRUE, concatenator = " ", stopwords=TRUE)
## Creating a dfm from a character input...
## ...lowercasing
## Warning: ngrams, concatenator, stopwords arguments are not used.
## ...found 2,500 documents, 10,269 features
## ...complete, elapsed time: 3.15 seconds.
## Finished constructing a 2,500 x 10,269 sparse dfm.
# coercing the dfm into matrix.
# docfreq will get the document frequency of a feature
myDfm.ng1.mat <- as.data.frame(as.matrix(docfreq(myDfm.ng1)))
myDfm.ng2.mat <- as.data.frame(as.matrix(docfreq(myDfm.ng2)))
myDfm.ng3.mat <- as.data.frame(as.matrix(docfreq(myDfm.ng3)))
# sorting the n-grams for plotting
ng1.sorted <- sort(rowSums(myDfm.ng1.mat), decreasing=TRUE)
ng2.sorted <- sort(rowSums(myDfm.ng2.mat), decreasing=TRUE)
ng3.sorted <- sort(rowSums(myDfm.ng3.mat), decreasing=TRUE)
ng1.FreqTable <- data.frame(Words=names(ng1.sorted), Frequency = ng1.sorted)
ng1.Plot <- ggplot(within(ng1.FreqTable[1:15, ], Words <- factor(Words, levels=Words)), aes(Words, Frequency))
ng1.Plot <- ng1.Plot + geom_bar(stat="identity", fill="purple") + ggtitle("Top 15 Unigrams")
ng1.Plot <- ng1.Plot + theme(axis.text.x=element_text(angle=45, hjust=1))
ng1.Plot
9. Next we plot the top 15 bigrams
ng2.FreqTable <- data.frame(Words=names(ng2.sorted), Frequency = ng2.sorted)
ng2.Plot <- ggplot(within(ng2.FreqTable[1:15, ], Words <- factor(Words, levels=Words)), aes(Words, Frequency))
ng2.Plot <- ng2.Plot + geom_bar(stat="identity", fill="maroon") + ggtitle("Top 15 Bigrams")
ng2.Plot <- ng2.Plot + theme(axis.text.x=element_text(angle=45, hjust=1))
ng2.Plot
next we plot the top 15 trigrams
ng3.FreqTable <- data.frame(Words=names(ng3.sorted), Frequency = ng3.sorted)
ng3.Plot <- ggplot(within(ng3.FreqTable[1:15, ], Words <- factor(Words, levels=Words)), aes(Words, Frequency))
ng3.Plot <- ng3.Plot + geom_bar(stat="identity", fill="blue") + ggtitle("Top 15 Trigrams")
ng3.Plot <- ng3.Plot + theme(axis.text.x=element_text(angle=45, hjust=1))
ng3.Plot
##Was the report written in a brief, concise style, in a way that a non-data scientist manager could appreciate?
I have tried my level best to understand the analysis.
I hope that I have done the analysis as per your expectations. Thank you for listening and reading my assignment