In this document I will summarize the facts I have found in the given data in the Coursera Data Science Specialization.
The goal of this document as targeted by the mentors is:
“The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. Please submit a report on R Pubs (http://rpubs.com/) that explains your exploratory analysis and your goals for the eventual app and algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set. The motivation for this project is to: 1. Demonstrate that you’ve downloaded the data and have successfully loaded it in.2. Create a basic report of summary statistics about the data sets.3. Report any interesting findings that you amassed so far.4. Get feedback on your plans for creating a prediction algorithm and Shiny app.”
The data was downloaded in the following link https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip the data was downloaded and unziped with the following code:
strURL<-"https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
strZIPFile<-"dataCapstone.zip"
strDir<-"final"
downloadTextData<-function(){
if(!file.exists(strZIPFile)){
download.file(strURL,strZIPFile)
}
if(!dir.exists(strDir)){
unzip(strZIPFile)
}
}
downloadTextData()
After that the data was loaded, each of the files were read.
I show some code that uses a function I did. My plan is modularize more, I use some constants that I will set in other code units in the final version. These constants only are used in order to construct the paths to the files.
## Idioms
ENGLISH_ID<-"en"
FINNISH_ID<-"fn"
RUSSIAN_ID<-"ru"
DEUTSCH_ID<-"de"
## Document types
BLOGS_ID<-"blogs"
NEWS_ID<-"news"
TWITTER_ID<-"twitter"
## Idioms prefix
IDIOM_CODE_DEUTSCH<-"de_DE"
IDIOM_CODE_ENGLISH<-"en_US"
IDIOM_CODE_FINNISH<-"fi_FI"
IDIOM_CODE_RUSSIAN<-"ru_RU"
getFileName<-function(language=ENGLISH_ID,type=TWITTER_ID){
strIdiomCode<-switch (language,
"du" = IDIOM_CODE_DEUTSCH,
"fn" = IDIOM_CODE_FINNISH,
"en" = IDIOM_CODE_ENGLISH,
"ru" = IDIOM_CODE_RUSSIAN
)
strFileName<-paste(strIdiomCode,".",type,".txt",sep="")
file.path(strDir,strIdiomCode,strFileName)
}
language=ENGLISH_ID
strFileBlogs<-getFileName(language,BLOGS_ID)
strFileNews<-getFileName(language,NEWS_ID)
strFileTwitter<-getFileName(language,TWITTER_ID)
fileBlogs<-file(strFileBlogs)
linesBlogs<-readLines(fileBlogs,skipNul = TRUE,encoding = "UTF-8")
close(fileBlogs)
fileNews<-file(strFileNews)
linesNews<-readLines(fileNews,skipNul=TRUE,encoding = "UTF-8")
close(fileNews)
fileTwitter<-file(strFileTwitter)
linesTwitter<-readLines(fileTwitter,skipNul=TRUE,encoding = "UTF-8")
close(fileTwitter)
After that I can access the total number of lines of each file, I will show the maximum size of the rows for every file aside the mean and minimum:
filesLinesSummary<-data.table(files=c(BLOGS_ID,TWITTER_ID,NEWS_ID),
noLines=c(length(linesBlogs),length(linesTwitter),length(linesNews)),
maximumLength=c(max(nchar(linesBlogs)),max(nchar(linesTwitter)),max(nchar(linesNews))),
mimimumLength=c(min(nchar(linesBlogs)),min(nchar(linesTwitter)),min(nchar(linesNews))),
mean=c(mean(nchar(linesBlogs)),mean(nchar(linesTwitter)),mean(nchar(linesNews))))
kable(filesLinesSummary)
| files | noLines | maximumLength | mimimumLength | mean |
|---|---|---|---|---|
| blogs | 899288 | 40833 | 1 | 229.98695 |
| 2360148 | 140 | 2 | 68.68054 | |
| news | 77259 | 5760 | 2 | 202.42830 |
This is a very useful view, but, is not necessary use all the samples to do a preliminary analysis. I will take samples based on the text length, so the mean for blogs and news are similar and is three times, more or less, the mean for twitter. This is more an ad hoc aproach but will work just to get an idea of the trigrams and bigrams that could be useful for the algorithm. So I will take samples of 50000 for news and blogs and 150000 for twitters.
## num of lines to consider
linesToReadBlogs<-50000
linesToReadNews<-50000
linesToReadTwitter<-150000
set.seed(33234)
linesBlogs<-sample(linesBlogs,linesToReadBlogs)
linesTwitter<-sample(linesTwitter,linesToReadTwitter)
linesNews<-sample(linesNews,linesToReadNews)
linesToProcess<-c(linesBlogs,linesNews,linesTwitter)
So in linesToProcess I have the lines I will use in my analysis, now I face another problem, the text is raw and full of punctuation marks and spaces that will not help. I will clean the data using str_remove_all of the stringr package and removepuntuation, stripWhitespace, and removeNumbers of tm. This will help me to delete some dirty characters that I found in the text aside numbers and punctuation marks that will be useless and that can complicate the construction of the n-grams.
cleanText<-function(dataTable){
#dataTable$text<-removeWords(dataTable$text,stopwords('en'))
dataTable$text<- str_remove_all(dataTable$text,"#")
dataTable$text<- str_remove_all(dataTable$text,"–")
dataTable$text<- str_remove_all(dataTable$text,"-")
dataTable$text<- str_remove_all(dataTable$text,"“")
dataTable$text<- str_remove_all(dataTable$text,"”")
dataTable$text<- str_remove_all(dataTable$text,"“")
dataTable$text<- removePunctuation(dataTable$text)
dataTable$text<- removeNumbers(dataTable$text)
dataTable$text<- stripWhitespace(dataTable$text)
dataTable$text<- tolower(dataTable$text)
dataTable
}
dataTable<-data.table(text=linesToProcess)
dataTable<-cleanText(dataTable)
Now I have a data table with the lines cleaned, I am able to get the n-grams now and get a word count of the total data, I will discard the words that count lesss than five with prune_vocabulary.
I will use functions from the text2vec library.
tokens = space_tokenizer(dataTable$text)
vocab3gram <- create_vocabulary(itoken(tokens),ngram = c(3,3))
vocab2gram <- create_vocabulary(itoken(tokens),ngram = c(2,2))
vocab1gram <- create_vocabulary(itoken(tokens),ngram = c(1,1))
vocab3gram <- prune_vocabulary(vocab3gram, term_count_min = 5)
vocab2gram <- prune_vocabulary(vocab2gram, term_count_min = 5)
vocab1gram <- prune_vocabulary(vocab1gram, term_count_min = 5)
Here I stop a little just to show the word count in the corpora and the count of documents in which the word appears, I will show only the first 30 occurrences for brevity.
kable(head(vocab1gram[order(vocab1gram$term_count,decreasing=TRUE),],30))
| term | term_count | doc_count | |
|---|---|---|---|
| 161925 | the | 260218 | 113631 |
| 161924 | to | 154159 | 92477 |
| 161923 | and | 132652 | 77277 |
| 161922 | a | 131996 | 81584 |
| 161921 | of | 109547 | 66497 |
| 161920 | i | 96352 | 57119 |
| 161919 | in | 90450 | 62283 |
| 161918 | for | 62356 | 50235 |
| 161917 | is | 60656 | 46021 |
| 161916 | that | 57089 | 41534 |
| 161915 | you | 55539 | 40584 |
| 161914 | it | 51758 | 38412 |
| 161913 | on | 45884 | 38378 |
| 161912 | with | 39540 | 32363 |
| 161911 | my | 35773 | 27372 |
| 161910 | was | 34614 | 24457 |
| 161909 | at | 31868 | 27206 |
| 161908 | this | 30947 | 25935 |
| 161907 | be | 30794 | 26156 |
| 161906 | have | 29661 | 25072 |
| 161905 | are | 27478 | 22827 |
| 161904 | but | 26600 | 23743 |
| 161903 | as | 26546 | 19214 |
| 161902 | we | 23372 | 16566 |
| 161901 | he | 22852 | 15281 |
| 161900 | not | 22763 | 19552 |
| 161899 | so | 22233 | 19598 |
| 161898 | me | 21897 | 18667 |
| 161897 | from | 21201 | 18542 |
| 161896 | all | 18854 | 16797 |
Now I can show the top n-grams for my analisys, I show the 1-gram histogram plot:
gg<- ggplot(head(vocab1gram[order(vocab1gram$term_count,decreasing=TRUE),],30),aes(x=term,y=term_count, colour=doc_count,group=1))
gg+geom_point(shape=19, size=3)+geom_line()
For 2-grams:
gg<- ggplot(head(vocab2gram[order(vocab2gram$term_count,decreasing=TRUE),],20),aes(x=term,y=term_count, colour=doc_count,group=1))
gg+geom_point(shape=19, size=3)+geom_line()
And finally for 3-grams:
gg<- ggplot(head(vocab3gram[order(vocab3gram$term_count,decreasing=TRUE),],10),aes(x=term,y=term_count, colour=doc_count,group=1))
gg+geom_point(shape=19, size=3)+geom_line()
Even with this graphs is dificult to see the real context of the n-grams, I will use the wordcloud library to make the wordclouds for my analisys, the wordclouds are a very fancy way to display these kind of contexts.
For 1-grams:
pal <- brewer.pal(9,"BuGn")
pal <- pal[-(1:2)]
wordcloud(vocab1gram$term,vocab1gram$term_count,c(3,.5),2,250,FALSE,,.15,pal)
For 2-grams:
pal <- brewer.pal(9,"PuBu")
pal <- pal[-(1:2)]
wordcloud(vocab2gram$term,vocab2gram$term_count,c(2.5,0.4),2,200,FALSE,,.15,pal)
For 3-grams:
pal <- brewer.pal(9,"PuBu")
pal <- pal[-(1:2)]
wordcloud(vocab3gram$term,vocab3gram$term_count,c(1.5,0.25),2,150,FALSE,,.15,pal)
We can see more clearly the n-grams in these graphs.
The next steps are:
I have reviewed the Markov networks and I think is a very good start to do the prredictor, you can find a very nice introductory video here.