1. Introduction

This report presents an initial exploratory analysis of the SwiftKey dataset provided for the Coursera Data Science Capstone.
The goal is to:

This document is written in a simple, business-friendly style so that non-technical stakeholders can understand the progress.


2. Dataset Overview

The dataset contains text data from:

We use only the English-language files for model training.

# Load packages
library(stringi)
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
blogs   <- readLines("final/en_US/en_US.blogs.txt", warn = FALSE, skipNul = TRUE)
news    <- readLines("final/en_US/en_US.news.txt", warn = FALSE, skipNul = TRUE)
twitter <- readLines("final/en_US/en_US.twitter.txt", warn = FALSE, skipNul = TRUE)