Introduction

The goal of this project is to explore the provided text data and outline a plan for building a prediction algorithm and Shiny application.
This report summarizes the key characteristics of the data in a concise and non-technical manner.


Loading the Data

Download and unzip the dataset (only if not already present)

if (!file.exists(“Coursera-SwiftKey.zip”)) { download.file( “https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip”, destfile = “Coursera-SwiftKey.zip”, mode = “wb” ) unzip(“Coursera-SwiftKey.zip”) }

Read the data files

blogs <- readLines(“final/en_US/en_US.blogs.txt”, encoding = “UTF-8”, skipNul = TRUE) news <- readLines(“final/en_US/en_US.news.txt”, encoding = “UTF-8”, skipNul = TRUE) twitter <- readLines(“final/en_US/en_US.twitter.txt”, encoding = “UTF-8”, skipNul = TRUE)

summary_table <- data.frame( Dataset = c(“Blogs”, “News”, “Twitter”), Lines = c(length(blogs), length(news), length(twitter)), Words = c( sum(sapply(strsplit(blogs, ” “), length)), sum(sapply(strsplit(news,” “), length)), sum(sapply(strsplit(twitter,” “), length)) ) )

summary_table

blog_lengths <- sapply(strsplit(blogs, ” “), length) news_lengths <- sapply(strsplit(news,” “), length) twitter_lengths <- sapply(strsplit(twitter,” “), length)

hist(blog_lengths, breaks = 50, main = “Blog Line Lengths”, xlab = “Words per Line”) hist(news_lengths, breaks = 50, main = “News Line Lengths”, xlab = “Words per Line”) hist(twitter_lengths, breaks = 50, main = “Twitter Line Lengths”, xlab = “Words per Line”)