Summary

Goals to accomplish

  1. Exploratory analysis - perform a thorough exploratory analysis of the data, understanding the distribution of words and relationship between the words in the corpora.
  2. Understand frequencies of words and word pairs - build figures and tables to understand variation in the frequencies of words and word pairs in the data.

Questions to consider

  1. Some words are more frequent than others - what are the distributions of word frequencies?
  2. What are the frequencies of 2-grams and 3-grams in the dataset?
  3. How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?
  4. How do you evaluate how many of the words come from foreign languages?
  5. Can you think of a way to increase the coverage – identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases?

Data and exploration

For this data exploration I am going to load 500 lines from each document to save time.

# Summary
summary(docs)
##                        Length Class             Mode
## en_US.samp-blogs.txt   2      PlainTextDocument list
## en_US.samp-news.txt    2      PlainTextDocument list
## en_US.samp-twitter.txt 2      PlainTextDocument list
inspect(docs)
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 3
## 
## [[1]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 120089
## 
## [[2]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 100367
## 
## [[3]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 33474

Clean the imported data

This raw data needs to be cleaned up quite a bit, dropping non-useful characters, punctuation, numbers, etc. Then a normalizing process converting all words to lower case. Finally finish up with removing extra whitespace.

docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, tolower)
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, PlainTextDocument)
summary(docs)
##              Length Class             Mode
## character(0) 2      PlainTextDocument list
## character(0) 2      PlainTextDocument list
## character(0) 2      PlainTextDocument list
inspect(docs)
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 3
## 
## [[1]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 115596
## 
## [[2]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 95306
## 
## [[3]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 31468

Explore the data

Quickly exploring the data, we do a frequency count and order it peeking at the results:

##  the  and that  for with  you  was  but this have  not  are said from  its 
## 2214 1146  520  472  354  314  307  239  233  225  199  195  170  165  162

You can see here the top 15 words in the corpus, and what you would expect for usual english language text.

Plotted the top word frequencies looks like this:

Sorted in order of the word, you can see

The motivation for this project is to:

Demonstrate that you’ve downloaded the data and have successfully loaded it in. Create a basic report of summary statistics about the data sets. Report any interesting findings that you amassed so far. Get feedback on your plans for creating a prediction algorithm and Shiny app.

Please upload the URL of an R Pubs document describing your exploratory analysis (http://rpubs.com/, be sure that the url that you submit is http and not https).

Does the link lead to an HTML page describing the exploratory analysis of the training data set? 0: no, the link does not lead to a document describing the exploratory analysis 1: yes, the link does not lead to a document describing the exploratory analysis

Has the data scientist done basic summaries of the three files? Word counts, line counts and basic data tables? 0: no, the data scientist has not evaluated basic summaries of the data such as word and line counts 1: yes, the data scientist has evaluated basic summaries of the data such as word and line counts

Has the data scientist made basic plots, such as histograms to illustrate features of the data? 0: no, the data scientist has not made basic plots, such as histograms to illustrate features of the data 1: yes, the data scientist has made basic plots, such as histograms to illustrate features of the data

Was the report written in a brief, concise style, in a way that a non-data scientist manager could appreciate? 0: no, the report is not brief and concise and can not be understood by a non data scientist 1: yes, the report is brief and and concise, but could not be understoodd by a non data scientist 2: yes, the report could be understood by a non data scientist, but is not brief and concise 3: yes, the report could be understood by a non data scientist and is brief and concise

An important part of being a data scientist is being able to provide your colleagues with constructive feedback that they can then use to improve their own work. This is the most important evaluation criteria. In the space below, we want you to do just that. Give the data scientist good, useful, and actionable feedback about the strengths of their work and the areas that need improvement. Give them advice about what they can do to take their work to the next level. (25 word minimum)