Introduction

The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. Please submit a report on R Pubs (http://rpubs.com/) that explains your exploratory analysis and your goals for the eventual app and algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set.

Objectives

  1. Demonstrate that you’ve downloaded the data and have successfully loaded it in.
  2. Create a basic report of summary statistics about the data sets.
  3. Report any interesting findings that you amassed so far.
  4. Get feedback on your plans for creating a prediction algorithm and Shiny app.

Project

Loading the data

After downloading all the data from: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip .The data is from a corpus called HC Corpora (www.corpora.heliohost.org). In linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts. For this exercise we will be using the english sets. We load it in three separate files

blogs <- readLines("en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news <- readLines("en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
## Warning in readLines("en_US.news.txt", encoding = "UTF-8", skipNul = TRUE):
## incomplete final line found on 'en_US.news.txt'
twitter <- readLines("en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)

There are three large files containing multiple text characters from news, blogs and twitter.

Exploratory Analysis

Is important to know what does our data have, so what we are going to do is grab each of the files and determine the size, number of lines, average and max characters per line.

blogst <- data.frame(Data="Blogs", 'Size in MB'=as.numeric(object.size(blogs)/(1024^2)), 
                     'Number of Rows'=length(blogs), 
                     'Mean Characters'=mean(nchar(blogs)),'Max Characters'=max(nchar(blogs)))

newst <- data.frame(Data="News", 'Size in MB'=as.numeric(object.size(news)/(1024^2)), 
                    'Number of Rows'=length(news), 
                    'Mean Characters'=mean(nchar(news)),'Max Characters'=max(nchar(news)))

twittert <- data.frame(Data="Twitter", 'Size in MB'=as.numeric(object.size(twitter)/(1024^2)), 
                       'Number of Rows'=length(twitter), 
                       'Mean Characters'=mean(nchar(twitter)),'Max Characters'=max(nchar(twitter)))

all <- rbind(blogst,newst,twittert)

Based on our data the larges one is twitter, however in this one the characters per line are less than with the other ones. Blogs have the longest sentences. Now let’s see it graphically:

p1 = ggplot (all,aes(x=Data,y=Size.in.MB))+ 
  geom_bar(stat="identity",fill="steelblue") + labs(x="") + theme_minimal()
p2 = ggplot (all,aes(x=Data,y=Number.of.Rows))+ 
  geom_bar(stat="identity",fill="steelblue") + labs(x="") + theme_minimal()
p3 = ggplot (all,aes(x=Data,y=Mean.Characters))+ 
  geom_bar(stat="identity",fill="steelblue") + labs(x="") + theme_minimal()
p4 = ggplot (all,aes(x=Data,y=Max.Characters))+ 
  geom_bar(stat="identity",fill="steelblue") + labs(x="") + theme_minimal()

grid.arrange(p1,p2,p3,p4,nrow=1, top="Exploratory Analysis on English Data by Type")

Some observations

The file size is quite big, so unless we have a super computer we will have to split the files. It would be useful also to merge all the data and then sample it randomly due to the fact that the properties amongst groups are different. For purpose of this ferst report we are going just to take the smaller size file the “news” file and we are going to dig deeper.

Analysis of the News File

To provide a better explanation and start working towards a predictive algorithm, using the quanteda package we are going to divide all the characters in n-gram, meaning one word, two word and three word combinations. Having in mind the most common and a large data sate we might be able to create a prediction algorithm. Here we will see the 30 most common single word n-grams or uni-grams.

dataframe1 <- dfm(news, ngrams = 1, verbose = FALSE, concatenator = " ", remove = stopwords("english"), remove_punct = TRUE)

features1 <- topfeatures(dataframe1,30)

table1 <- data.frame(list(term=names(features1),frequency=unname(features1)))

And now we to have a better idea we can make it graphically

ggplot (table1,aes(x=term,y=frequency))+ 
  geom_bar(stat="identity",fill="steelblue") + 
  labs(x="",title="Word Count on News File") + coord_flip() + theme_minimal()

And by a word map

textplot_wordcloud(dataframe1,min.freq=1000,random_order = FALSE)
## Warning: min.freq is deprecated; use min_count instead

We can see that the word “said” is the most common in the news english data set which makes sence as we are talking about the news.

Future Directions

Now we have developed a single n-gram, going forward and to provide a better prediction model would be ideal into comprise at least 2 and 3 gram to provide a more accurate prediction. I am still looking for the best method, have read some about Markov chains. So the plan would be: