About this report

This report is a summary report for the second week of work on the Capstone project, in the Data Acience specialization, on Coursera.

The goal of the report is to demonstrate the downloading of the project data, strucuring the data for analysis, cleaning and prepairing the data, some summary statistics about the data and my initial plans for the proposed text prediction algorithm and app.

Getting and cleaning the data

The raw data for this project was downloaded to my working directory from The Capstone data set

This data set includes zipped .txt files, for 4 different languages and 3 data sources for each language: Blogs, News, Twitter. For the purpose of this project I will be working with the 3 English language source files.

# File pathes have been defined in hidden block
dat_blog <- readLines(txt_blog, skipNul = TRUE)
dat_news <- readLines(txt_news, skipNul = TRUE)
dat_twit <- readLines(txt_twit, skipNul = TRUE)

Description of raw data

Number of lines and words in each document

Number.of.lines Number.of.words
Blogs 899,288 37,334,441
News 77,259 2,643,972
Twitter 2,360,148 30,373,832