Goals to accomplish
Questions to consider
For this data exploration I am going to load 500 lines from each document to save time.
# Summary
summary(docs)
## Length Class Mode
## en_US.samp-blogs.txt 2 PlainTextDocument list
## en_US.samp-news.txt 2 PlainTextDocument list
## en_US.samp-twitter.txt 2 PlainTextDocument list
inspect(docs)
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 3
##
## [[1]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 120089
##
## [[2]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 100367
##
## [[3]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 33474
This raw data needs to be cleaned up quite a bit, dropping non-useful characters, punctuation, numbers, etc. Then a normalizing process converting all words to lower case. Finally finish up with removing extra whitespace.
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, tolower)
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, PlainTextDocument)
summary(docs)
## Length Class Mode
## character(0) 2 PlainTextDocument list
## character(0) 2 PlainTextDocument list
## character(0) 2 PlainTextDocument list
inspect(docs)
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 3
##
## [[1]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 115596
##
## [[2]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 95306
##
## [[3]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 31468
Quickly exploring the data, we do a frequency count and order it peeking at the results:
## the and that for with you was but this have not are said from its
## 2214 1146 520 472 354 314 307 239 233 225 199 195 170 165 162
You can see here the top 15 words in the corpus, and what you would expect for usual english language text.
Plotted the top word frequencies looks like this:
Sorted in order of the word, you can see
The motivation for this project is to:
Demonstrate that you’ve downloaded the data and have successfully loaded it in. Create a basic report of summary statistics about the data sets. Report any interesting findings that you amassed so far. Get feedback on your plans for creating a prediction algorithm and Shiny app.
Please upload the URL of an R Pubs document describing your exploratory analysis (http://rpubs.com/, be sure that the url that you submit is http and not https).
Does the link lead to an HTML page describing the exploratory analysis of the training data set? 0: no, the link does not lead to a document describing the exploratory analysis 1: yes, the link does not lead to a document describing the exploratory analysis
Has the data scientist done basic summaries of the three files? Word counts, line counts and basic data tables? 0: no, the data scientist has not evaluated basic summaries of the data such as word and line counts 1: yes, the data scientist has evaluated basic summaries of the data such as word and line counts
Has the data scientist made basic plots, such as histograms to illustrate features of the data? 0: no, the data scientist has not made basic plots, such as histograms to illustrate features of the data 1: yes, the data scientist has made basic plots, such as histograms to illustrate features of the data
Was the report written in a brief, concise style, in a way that a non-data scientist manager could appreciate? 0: no, the report is not brief and concise and can not be understood by a non data scientist 1: yes, the report is brief and and concise, but could not be understoodd by a non data scientist 2: yes, the report could be understood by a non data scientist, but is not brief and concise 3: yes, the report could be understood by a non data scientist and is brief and concise
An important part of being a data scientist is being able to provide your colleagues with constructive feedback that they can then use to improve their own work. This is the most important evaluation criteria. In the space below, we want you to do just that. Give the data scientist good, useful, and actionable feedback about the strengths of their work and the areas that need improvement. Give them advice about what they can do to take their work to the next level. (25 word minimum)