Summary

The goal of this project is to load and explore the corpus data to illustrate being on track to create a prediction algorithm.

Results to include:

  1. Does the link lead to an HTML page describing the exploratory analysis of the training data set?

  2. Has the data scientist done basic summaries of the three files? Word counts, line counts and basic data tables?

  3. Has the data scientist made basic plots, such as histograms to illustrate features of the data?

  4. Was the report written in a brief, concise style, in a way that a non-data scientist manager could appreciate?

The first step is that of an exploratory analysis for the major features of the data of the three English data sets blogs, news, and twitter. The report includes summary statistics about the major features of the data sets and any interesting findings. The report will also include an analysis of a sample data set that will ultimately apply to the full data set.Additionally the project report will briefly summarize a plan for creating the prediction algorithm and Shiny app for presentation.

Downloading Data for blogs, news, twitter

Download of data zip file to working directory https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

Identify English Files for blogs, news, twitter

blogs <- file.path(en_dir, “en_US.blogs.txt”,fsep=“\”)

news <- file.path(en_dir, “en_US.news.txt”,fsep=“\”)

twitter <- file.path(en_dir, “en_US.twitter.txt”,fsep=“\”)

Read Files

blogs_content <- readLines(blogs, encoding=“UTF-8”,skipNul=TRUE)

news_content <- readLines(news, encoding=“UTF-8”,skipNul=TRUE)

twitter_content <- readLines(twitter,encoding=“UTF-8”,skipNul=TRUE)

Data Analysis

File Size

file.size(blogs_file)

file.size(news_file)

file.size(twitter_file)

Line Count

blogs_lines <- length(blogs_content)

news_lines <- length(news_content)

twitter_lines <- length(twitter_content)

Word Count

blogs_string <- paste(blogs_content, collapse = ” “) blogs_wordcount <- str_count(blogs_string,”\S+“)

news_string <- paste(news_content, collapse = ” “) news_wordcount <- str_count(news_string,”\S+“)

twitter_string <- paste(twitter_content, collapse = ” “) twitter_wordcount <- str_count(twitter_string,”\S+“)

Character Count

blogs_char_count <- nchar(blogs_content)

news_char_count <- nchar(news_content)

twitter_char_count <- nchar(twitter_content)

Exploratory Values

Exploratory Values
Exploratory Values

Training Data Set

For exploratory purposes the training data set is selected as a sample data set from the blogs, news, and twitter data sets. library(caret)

Partitioning Sample

Take a sample of 10,000 rows

blogs_data_sample <- blogs_content_df[sample(1:nrow(blogs_content_df),10000), ,drop=FALSE ]

train data set 70% of the sample and test data set 30%

blogs_train_indices <- sample(1:nrow(blogs_data_sample), size = 0.7 * nrow(blogs_data_sample))

blogs_train_set <- blogs_data_sample[blogs_train_indices, , drop = FALSE]

blogs_test_set <- blogs_data_sample[-blogs_train_indices, , drop = FALSE]

This process was used for the additional sets news and twitter data sets.

Training data sets for blogs, news, twitter each contained 7000 rows.

Test data sets for blogs, news, twitter each contained 3000 rows.

File Content Values
File Content Values

Interesting Finding

The frequency of words was done to investigate the data. For example the blogs data was cleaned to words and the frequency assessed.

blogs_words_freq <- data.frame(value = blogs_words) %>% group_by(value) %>% summarize(count = n())

blogs_words_freq <- arrange(blogs_words_freq,desc(count))

The head provided nothing unexpected.

A tibble: 6 × 2

value count

1 the 1666685

2 to 1048248

3 and 1031345

4 of 865177

5 a 861941

However, the tail provided results of interest that may need further examination as the project continues.

value count

1 zzshazz 1

2 zzz’s 1

3 zzzZZZZZZzzzz 1

4 zzzs 1

5 zzzz 1

6 zzzz’s 1

7 zzzzz’s. 1

8 zzzzz. 1

9 zzzzzzz.. 1

10 zzzzzzzzz……… 1

11 zzzzzzzzzzzzzzzzzzzzzzzzzzz 1

12 zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz. 1