This report presents an initial exploratory analysis of the SwiftKey
dataset provided for the Coursera Data Science Capstone.
The goal is to:
This document is written in a simple, business-friendly style so that non-technical stakeholders can understand the progress.
The dataset contains text data from:
We use only the English-language files for model training.
# Load packages
library(stringi)
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
blogs <- readLines("final/en_US/en_US.blogs.txt", warn = FALSE, skipNul = TRUE)
news <- readLines("final/en_US/en_US.news.txt", warn = FALSE, skipNul = TRUE)
twitter <- readLines("final/en_US/en_US.twitter.txt", warn = FALSE, skipNul = TRUE)