Data Science Capstone: Milestone Report

C Innes | February 2020

The data analysed in this project is text data, collected from three sources: twitter, news feeds and blogs. The data has been downloaded from [linked phrase] https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

The aim of this project is to produce a model that will predict the next word, given a single word or string of words. This document details the initial exploratory data analyses, and tokenizes the data into n-grams (strings of 1,2,3,4 words which are repeated frequently in the full data set)

Once the data is downloaded and unzipped, we load into R and start pre-processing.

## Loading required package: NLP

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

## First we must create a data connections for the data  
## - we're only showing the process for Twitter data here, however it will be the same idea for all files
conT <- file("./final/en_US/en_US.twitter.txt","r")
twitter <- readLines(conT,skipNul = TRUE)
close(conT)
## Next we want to sample the data as the initial data set is too large for analysis and will likely cause bottlenecks
## (I have been having system resource issues on my computer so cannot process anything larger than this)
set.seed(999)
sampleTwitter <- twitter[rbinom(length(twitter)*0.01,length(twitter),0.5)]
## Delete the initial data set to ensure we are not using unnecessary resource 
rm(twitter)

The above process is repeated for all three data sources.

We then want to perform some basic summary analyses on the data: namely word count and record count.

df

##    Source Word Count Record Count
## 1 Twitter     304790        23601
## 2   Blogs     370780         8992
## 3    News     340276        10102

We can see that Twitter has the largest record count, but highest word count and blogs has the largest word count smallest record count. This is unsurprising due to the character limit on tweets.

Now that we know a little bit about the data, we can look at splitting it into common words/phrases.

## First we must combine all of our data into one data set before changing this into a corpus (a collection of documents)

sampleAll <- c(sampleTwitter,sampleBlogs,sampleNews)

## Ensuring to remove unnecessary data to free up resource
rm(sampleTwitter); rm(sampleBlogs); rm(sampleNews)

Now we will assemble this data into a Corpus and cleanse before analyse the word patterns.

Once cleansing is complete we look at the frequency charts of these word patterns

Uni-grams (Single Words) Top 10

ngramsChart(1)

Bi-grams (Word-Pairs) Top 10

ngramsChart(2)

Tri-grams (Word-Trios) Top 10

ngramsChart(3)

Next Steps

The next step will be to create a model which will predict the next word given a phrase of say 3 words. Once this model is created, I will build a shiny app which take an input word/phrase from the user and produce a prediction of the next word.