Capstone project - week 2 report

About this report

This report is a summary report for the second week of work on the Capstone project, in the Data Acience specialization, on Coursera.

The goal of the report is to demonstrate the downloading of the project data, strucuring the data for analysis, cleaning and prepairing the data, some summary statistics about the data and my initial plans for the proposed text prediction algorithm and app.

Getting and cleaning the data

The raw data for this project was downloaded to my working directory from The Capstone data set

This data set includes zipped .txt files, for 4 different languages and 3 data sources for each language: Blogs, News, Twitter. For the purpose of this project I will be working with the 3 English language source files.

# File pathes have been defined in hidden block
dat_blog <- readLines(txt_blog, skipNul = TRUE)
dat_news <- readLines(txt_news, skipNul = TRUE)
dat_twit <- readLines(txt_twit, skipNul = TRUE)

Description of raw data

Number of lines and words in each document

	Number.of.lines	Number.of.words
Blogs	899,288	37,334,441
News	77,259	2,643,972
Twitter	2,360,148	30,373,832

Capstone project - week 2 report

Yoav Pridor

January 7, 2018

About this report

Getting and cleaning the data

Description of raw data