The goal of this project is to load and explore the corpus data to illustrate being on track to create a prediction algorithm.
Results to include:
Does the link lead to an HTML page describing the exploratory analysis of the training data set?
Has the data scientist done basic summaries of the three files? Word counts, line counts and basic data tables?
Has the data scientist made basic plots, such as histograms to illustrate features of the data?
Was the report written in a brief, concise style, in a way that a non-data scientist manager could appreciate?
The first step is that of an exploratory analysis for the major features of the data of the three English data sets blogs, news, and twitter. The report includes summary statistics about the major features of the data sets and any interesting findings. The report will also include an analysis of a sample data set that will ultimately apply to the full data set.Additionally the project report will briefly summarize a plan for creating the prediction algorithm and Shiny app for presentation.
Download of data zip file to working directory https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
blogs <- file.path(en_dir, “en_US.blogs.txt”,fsep=“\”)
news <- file.path(en_dir, “en_US.news.txt”,fsep=“\”)
twitter <- file.path(en_dir, “en_US.twitter.txt”,fsep=“\”)
blogs_content <- readLines(blogs, encoding=“UTF-8”,skipNul=TRUE)
news_content <- readLines(news, encoding=“UTF-8”,skipNul=TRUE)
twitter_content <- readLines(twitter,encoding=“UTF-8”,skipNul=TRUE)
file.size(blogs_file)
file.size(news_file)
file.size(twitter_file)
blogs_lines <- length(blogs_content)
news_lines <- length(news_content)
twitter_lines <- length(twitter_content)
blogs_string <- paste(blogs_content, collapse = ” “) blogs_wordcount <- str_count(blogs_string,”\S+“)
news_string <- paste(news_content, collapse = ” “) news_wordcount <- str_count(news_string,”\S+“)
twitter_string <- paste(twitter_content, collapse = ” “) twitter_wordcount <- str_count(twitter_string,”\S+“)
blogs_char_count <- nchar(blogs_content)
news_char_count <- nchar(news_content)
twitter_char_count <- nchar(twitter_content)
For exploratory purposes the training data set is selected as a sample data set from the blogs, news, and twitter data sets. library(caret)
Partitioning Sample
Take a sample of 10,000 rows
blogs_data_sample <- blogs_content_df[sample(1:nrow(blogs_content_df),10000), ,drop=FALSE ]
train data set 70% of the sample and test data set 30%
blogs_train_indices <- sample(1:nrow(blogs_data_sample), size = 0.7 * nrow(blogs_data_sample))
blogs_train_set <- blogs_data_sample[blogs_train_indices, , drop = FALSE]
blogs_test_set <- blogs_data_sample[-blogs_train_indices, , drop = FALSE]
This process was used for the additional sets news and twitter data sets.
Training data sets for blogs, news, twitter each contained 7000 rows.
Test data sets for blogs, news, twitter each contained 3000 rows.
The frequency of words was done to investigate the data. For example the blogs data was cleaned to words and the frequency assessed.
blogs_words_freq <- data.frame(value = blogs_words) %>% group_by(value) %>% summarize(count = n())
blogs_words_freq <- arrange(blogs_words_freq,desc(count))
The head provided nothing unexpected.
A tibble: 6 × 2
value count
1 the 1666685
2 to 1048248
3 and 1031345
4 of 865177
5 a 861941
However, the tail provided results of interest that may need further examination as the project continues.
value count
1 zzshazz 1
2 zzz’s 1
3 zzzZZZZZZzzzz 1
4 zzzs 1
5 zzzz 1
6 zzzz’s 1
7 zzzzz’s. 1
8 zzzzz. 1
9 zzzzzzz.. 1
10 zzzzzzzzz……… 1
11 zzzzzzzzzzzzzzzzzzzzzzzzzzz 1
12 zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz. 1