This report is a summary report for the second week of work on the Capstone project, in the Data Acience specialization, on Coursera.
The goal of the report is to demonstrate the downloading of the project data, strucuring the data for analysis, cleaning and prepairing the data, some summary statistics about the data and my initial plans for the proposed text prediction algorithm and app.
The raw data for this project was downloaded to my working directory from The Capstone data set
This data set includes zipped .txt files, for 4 different languages and 3 data sources for each language: Blogs, News, Twitter. For the purpose of this project I will be working with the 3 English language source files.
# File pathes have been defined in hidden block
dat_blog <- readLines(txt_blog, skipNul = TRUE)
dat_news <- readLines(txt_news, skipNul = TRUE)
dat_twit <- readLines(txt_twit, skipNul = TRUE)
Number of lines and words in each document
| Number.of.lines | Number.of.words | |
|---|---|---|
| Blogs | 899,288 | 37,334,441 |
| News | 77,259 | 2,643,972 |
| 2,360,148 | 30,373,832 |