This is the milestone report, for the Data science capstone project. This report contains the following parts:
Data download part.
A fundamental exploratory analysis of the data.
An item with more in-depth exploratory study and a portion (sample of the data set) to get some sense of how the data set looks like.
Next steps description
#Link to the Location of the Text data file
UrlLink<-"https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
#Create Data Directory (If not exist)
if(!file.exists("data")) {
dir.create("data")
}
#Download and Unzip the file (if not exist)
if(!file.exists("./data/Coursera-SwiftKey.zip")) {
download.file(UrlLink, destfile="./data/Coursera-SwiftKey.zip")
if(!file.exists("./data/final"))
unzip("./data/Coursera-SwiftKey.zip",exdir="./data")
}
en_US.Blogs <- readLines(con = "./data/final/en_US/en_US.blogs.txt")
en_US.news <- readLines(con = "./data/final/en_US/en_US.news.txt")
en_US.twitter <- readLines(con = "./data/final/en_US/en_US.twitter.txt")
Build a table of basic statistics of the three Datasets: 1. Number of Lines. 2. The number of sentences. 3. The number of characters.
| en_US.blogs.txt | en_US.news.txt | en_US.twitter.txt | |
|---|---|---|---|
| Lines | 899288 | 77259 | 2360148 |
| Total Number of Words | 37334131 | 2643969 | 30373543 |
| Total Number of Chars | 208361438 | 15683765 | 162384825 |
Sample - for the exploratory purpose, The amount of data is significant, doing the calculation on the entire a population of that will take a long time and consume a high volume of memory
set.seed(534)
Sample.Text<-c(sample(en_US.Blogs,length(en_US.Blogs) * (10/100),replace = FALSE),
sample(en_US.news,length(en_US.news) * (10/100),replace = FALSE),
sample(en_US.twitter,length(en_US.twitter) * (10/100),replace = FALSE))
rm(en_US.Blogs)
rm(en_US.news)
rm(en_US.twitter)
The cleaning process includes: Tokenize into sentences (each line may contain more than one sentence) Remove sentence which has non-English phrases Remove Punctuation Remove @hashtags Remove URLs Remove Stop Words (For the Exploratory step Stopwords are removed)
#Convert the Sample.txt vector to a Data frame
DF <- data_frame(text = Sample.Text)
rm(Sample.Text)
#Split of sentences (Each line may contain more than one sentence)
Text.Sentences<-DF %>%
unnest_tokens(sentence, text, token = "sentences")
#Convert all non-Engish words to NA and remove them
Text.Sentences<-lapply( Text.Sentences$sentence, iconv,from = "latin1", to = "ASCII")
#If a sentence contains non-English word or character remove the entire sentence
Text.Sentences <- Text.Sentences[!is.na(Text.Sentences)]
#Remove punctuation, Twitter, Numbers, Hyphens, symbols, and URL
Text.Sentences <- tokens(
x = tolower(Text.Sentences),
remove_punct = TRUE,
remove_twitter = TRUE,
remove_numbers = TRUE,
remove_hyphens = TRUE,
remove_symbols = TRUE,
remove_url = TRUE)
#Remove Stop Words
Text.Sentences <- tokens_remove(Text.Sentences,stopwords("english"))
Create 1 gram word’s statistics and plot it using ggplot2.
## Creating a dfm from a tokens input...
## ... lowercasing
## ... found 428,783 documents, 128,404 features
## ... stemming features (English)
## , trimmed 33914 feature variants
## ... created a 428,783 x 94,490 sparse dfm
## ... complete.
## Elapsed time: 3 seconds.
Create 2 grams word’s statistics and plot it using ggplot2.
## Creating a dfm from a tokens input...
## ... lowercasing
## ... found 428,783 documents, 1,781,719 features
## ... created a 428,783 x 1,781,719 sparse dfm
## ... complete.
## Elapsed time: 9.77 seconds.
Create 3 grams word’s statistics and plot it using ggplot2.
## Creating a dfm from a tokens input...
## ... lowercasing
## ... found 428,783 documents, 2,205,684 features
## ... created a 428,783 x 2,205,684 sparse dfm
## ... complete.
## Elapsed time: 12.6 seconds.