Please note that the purpose of this document is to fully document the process of reading, cleaning, and modeling the data such that one could reproduce the results of this project, if desired. For a more polished and less detailed writeup, see the short report.

Links: source code | intermediate output files

Note: You may have to open the links in a new tab.

Methods

For clarification on any functions that are referred to in this section, see the ‘Functions’ section at the end of the document which contains full details of each function’s input/output values and functionality.

1. Preprocessing

To begin, it seemed most reasonable to get an idea of which words are the most frequently used across each data source (data was collected from either blogs, news articles, or Twitter). First, the data sets were read in using R’s native readLines function with the default parameters. filterSpecial was used to remove any punctuation that could cause us to get inaccurate counts of words due to them being attached to a period, semicolon, parentheses, etc. Initial observations of the data sets showed that there were around 900,000 blog entries, about 1,000,000 news article entries, and nearly 2,400,000 Twitter entries. These data sets are extremely large (on the order of 200Mb+ collectively). To get around memory limitations, n = 20 was used for the blog entries in the filterSpecial function (this argument tells the function to randomly sample 1/20 of the available items). n = 20 was used for the news entries, and n = 12 was used for the Twitter entries. This gave us randomly sampled sets of approximately 45,000 blog entries, 50,000 news articles, and 200,000 Twitter posts. The larger sample size of Twitter posts seemed reasonable, since the average Twitter entry had significantly fewer words than the average entry from one of the two other sources. (Blogs: 230, News: 201, Twitter: 69)

Special Note: When reading in the news articles on a Windows 10 machine, an ‘incomplete final line’ error kept appearing. This caused the function to stop reading in the data part of the way through the file. The dimension of data that had already been read in was used as a reference to determine which line of the source text file caused this error. Upon further observation, unicode U+001A (‘SUB’) characters were detected in lines 77259, 766277, 926143, and 948564 of the en_US.news.txt file. These were manually deleted, and the readLines was then able to load the file to completion. This character appeared to be present near or in place of numbers, most notably on any entries containing recipes or measurements. It should also be noted that this error only appeared when reading in the data on a Windows 10 machine. Reading the data in OS X Mojave produced no error, and the file was fully loaded by readLines without any special handling. R 3.6.1 was used to read in the data on both operating systems.

Upon initial read, each data set was processed using countWords, countTwo, and countMany. countWords created a CSV totaling the number of occurrences of each detected word across all strings in the set. countTwo totaled the number of occurrences of each two word sequence across all strings. countMany totaled the number of occurrences of each n-gram (word sequences of length n) across all strings. countMany was used to find the frequencies of each three, four, and five word sequence. For each type of data, five CSV files were created: ‘Blogs Single Word Counts’, …, ‘Blogs Five Word Counts’, ‘News Single Word Counts’, … . The single word counts were simply stored as a two-column CSV, with the first column containing a word and the second column indicating how many times that word was found. All other counts were stored as three-column CSVs. The first column contains the first n - 1 words of a sequence, the second column contains the final word, and the third column contains the number of times the sequence (column 1 + column 2) was detected. In total, 15 pre-processed CSV files were produced from these function calls. Throughout the rest of this document, words from the single word CSV will be referred to as ‘single words’, word sequences from the first column of any other CSV will be referred to as ‘leading phrases’, and the words in column 2 of these CSVs will be referred to as ‘next words’. ‘Word sequences’ will refer to the combination of a leading phrase with its respective next word. See a sample of the ‘Twitter Five Word Counts’ table below.

Most Common Phrases in the Twitter Set
Leading_Phrase	Next_Word	Count
happy mothers day to	all	70
cant wait to see	you	69
thank you so much	for	63
thanks for the shout	out	61
thanks so much for	the	60

2. Cleaning and Further Processing

All preprocessed CSV files were then run through the filter function, which combines four different filter functionalities and one modifier. Each of these functions is also stored as a separate .R file that is not used in the final implementation.

removeSingles gets rid of any words or sequences assigned a count of 1. A count of 1 meant that, in the sample size of over 40,000 sentences and paragraphs, the word or sequence only showed up once. Any items falling under this category are either very uncommon and unlikely to be reproduced by the user, or simply contained ‘words’ that were not actually English words. Hence, it seemed reasonable to apply this filter. This first filter removed 75% to 90% of the entries from each data set, with the effect understandably being more pronounced in the sets with longer word sequences.

removeSymbols does just as its name suggests: It removes entries with words containing any ‘special’ characters that do not fall in the alphanumeric (Aa-Zz, 0-9) range. This helps remove foreign words and other items unlikely to be entered by the user, including emojis. The use of many emojis varies greatly from user to user, and would ultimately be rather difficult to accurately predict.

removeNoAlpha removes any words that contain no characters in the English alphabet (Aa-Zz). This helps to remove any ‘words’ that consist entirely as numbers, as they have little use to us. Unlike removeSymbols which detects any occurrences of invalid chracters whatsoever, this function will allow any string containing at least one alphabetic character through. This could cause the filter to fail to catch, for example, an uncommon word sequence containing ‘6 foot 3 210’ as the leading phrase and ‘pounds’ as the next word, since both the leading phrase and next word each contain at least one alphabetic character. This mechanism is necessary, however, to ensure that commonly used phrases containing numbers, such as ‘365 days in a’, ‘year’ are not filtered out. Any ‘happy medium’ between this all-or-nothing approach would greatly increase the complexity of this filter while providing minimal benefits to the overall cleaning process.

removeProfanity filters through and removes any entries containing profanity. A list of inappropriate words was taken from the freewebheaders.com ‘Base List of Bad Words in English’. Since the filtration process requires that each entry be searched across all inappropriate words, the time complexity of this operation is O(m*n). Thus, to reduce processing time, a subset of these words was used. Words removed from the set were either very mildly profane or contained character strings that would be detected by other items in the set. The final subset can be found here.

convertContractions restores the apostrophes that were originally removed from the phrases. This was accomplished using a list of common English contractions from textfixer.com.

After applying the filter function, the size of each CSV file was reduced by an average of 95%. Overall data was reduced from about 440 MB to 15 MB.

The next step was to merge the data sets of each word/sequence size across all sources. For our purposes, we will assume that the blogs, news, and Twitter source data collectively represents a typical typing patttern of the average individual. The function mergeSetsSingles accomplished this task for the two-column ‘Single Word Counts’ files, while mergeSets did the same for all other files.

Side note: Leading spaces were detected in a large majority of the single words and leading phrases. This was causing issues with the prediction algorithm, so a method to remove leading phrases was implemented at the end of the mergeSets and mergeSetsSingles functions to fix this issue before the data was written to the CSV file.

The result of the merging process was five ‘Master n Word Counts’ files, each of which contained the merged sums of the counts of n-word sequences across all three sources, for n = 1, 2, 3, 4, 5. The separate (non-merged) CSV files were kept separate for potential later use in a more application-specific text prediction algorithm.

Finally, each of the master data files except for the single words set was run through topThree, a function that, for each single word or leading phrase, determines the three next words that are most likely to follow that word or phrase. Since the data was already sorted by count, this involved simply taking the first three occurrences of each leading phrase, and storing the respective next words. The single words set was omitted since it contains no leading phrase-next word sequences to analyze. The analyzed data was stored in a five-column CSV file, with each of the five columns representing the leading phrase, the word that most commonly followed that word or phrase, the second most common word, the third most common word, and the number of times each of the word sequences appeared (stored as a string of 3 numbers). The result is four master data files that map a series of leading words or phrases to the three words that most commonly followed it. A sample table from the four word counts is shown below. Contrary to what is shown in this table, the majority of the entries (~80%) contained only one next word suggestion.

Sample of Four Word Processed Data
Leading_Phrase	Most_Common_Next_Word	Second_Most_Common	Third_Most_Common	Counts
all going to	die	a	be	3 2 2
all good things	must			5
all had a	wonderful	great		6 5
all hands on	deck			2
all have a	great	wonderful	good	8 8 2

This final step led to approximately 65% reduction in file sizes. The end result was a set of five files that was approximately 8 MB.

3. Building a Model

Once the data was sorted into leading words/phrases with up to three suggested next words, the prediction model was be built. This requires all four of the ‘Top Three’ data sets we previously generated. To allow the model to give quick feedback to the user, each of these data sets is converted into a hash table, with the key being the leading word or phrase, and the value being a list of up to three suggested next words. The Single Word Counts data set will be used to help with the model’s autocorrect features. The four hash tables and list of single words will be combined into a list object that can then be passed to the prediction function.

Before any modeling is done, the user input will be processed. First any punctuation is removed. Next, the last 6 words of the user input are run through the autocorrect function. This function first looks for any user-input words that are not present in the ‘Single Word Counts’ data set. Any words not matched are then processed through a series of algorithms that attempt to catch common typing mistakes, including: typing an extra character, omitting a character, swapping two characters, typing an incorrect character in place of a correct one, missing a space between two words, adding additional characters at the start, or adding additional characters at the end. The final two algorithms will continuously delete characters off of the front or end of the word until the string either becomes too short, or a match is made. This is meant as a fail-safe to ensure that the majority of mistyped words can be matched to some word in the single words set. If a correction algorithm is able to guess which word the user meant to input, this word will be inserted into the string in place of the mistyped word. If not, the mistyped word will be returned to the input string. Since the method which detects two words that were mistakenly concatenated (by the user missing a space between words), the length of the returned string may exceed 6 words. The returned string is therefore cut down to its last 6 words, representing the last 6 words the user entered. At this point, the prediction model takes in the string and begins executing.

This model will look at the four most recent words that the user typed, and attempt to match that four word sequence to a leading phrase in the ‘Five Word Counts’ data set. If less than 3 suggestions are found, the algorithm will look at the three most recent words and repeat the process with the ‘Four Word Counts’ data set. This will continue with the two more recent and single most recent word until 3 matches are made. If, by the time the model finishes searching for a match with the most recently used word, no matches are found, the most recent word will be removed, and the new four most recent words will be passed on to another call of the function. This process continues until at least one match has been found, or until the function runs recursively two additional times. If available, up to three suggestions will then be sent back to the user. The removal of the last word is done in case the user’s last word cannot be matched by the autocorrect function. An unidentified word would obviously not be matched to any leading phrase, thus preventing the model from making any suggestions. Removing the last word allows the model to use previous input to attempt to make a matchh.

Additional Notes

If you wish to reproduce this project, the consoleInput.R file contains all of the necessary code to reproduce the reading, processing, and analysis of the data. See the consoleInput documentation at the end of the functions section for more details on proper usage. Please note that, with the settings I used, the entire reading, cleaning, and processing process took around 7 and a half hours.

Functions

This section contains full documentation on all of the functions that were written and used in processing the data and building a final model.

1. Preprocessing

filterSpecial

Functionality:
Reads in a list of strings and removes any punctuation marks or special characters contained in [[:punct:]]. Returns the filtered list of strings. Capable of selecting and filtering a random subset of the original strings. The fractional size of the subset is determined by the argument n.

Parameters:
phrases (required) is a list of strings to be filtered. The data passed to this parameter was obtained using R’s readLines function from a .txt file.
n (default = 10) is an integer that indicates what (1/n) fraction of the strings you would like filtered. n = 1 will process the entire list of available strings. n = 20 will process 1/20, or 5% of the available strings.The subset of strings is selected using sample with seed 1234.
log (default = TRUE) is a logical indicating whether the progress log should be printed as the function executes. This setting is recommended for larger files, as they may take a while to execute.

countWords

Functionality:
Takes a list of strings as input, and counts the total number of occurrences of each ‘word’. In this process, all characters are reduced to lowecase. Returns a data frame with two columns. Column 1 contains a word that was detected in at least one string. Column 2 indicates how many times that word was detected across all strings. Returned data frames are sorted in descending order by count.

Additional Notes:
To improve performance, this function uses a hash table that maps each detected word to the number of times it was detected, using the hash library. The hash table must be then unhashed to construct the final data frame that is returned. Estimated time complexity is O(n*m), where n is the length of the strings list and m is the average length of each string, and separating the leading phrases from next words in the final step.

Parameters:
phrases (required) is a list of strings to be analyzed.
log (default = TRUE) is a logical indicating whether the progress log should be printed as the function executes. This setting is recommended for larger files, as they may take a considerable amount of time to execute.

countTwo

Functionality:
Works similarly to countWords. Takes a list of strings as input, and counts the total number of occurrences of each two word sequence observed. In this process, all characters are reduced to lowercase. Returns a data frame with three columns. Column 1 contains the first word in each detected pair. Column 2 contains the second word of the detected pair. Column 3 indicates how many times the word pair was detected. Returned data frames are sorted in descending order by count.

Additional Notes:
This function uses a two-layer hash table method for storing the counts of each pair of words. The primary hash table takes the leading phrases (which appear in column 1 of the output) as keys. Each key maps to a secondary hash table, whose keys are the next words (column 2 of the output). These keys are used to locate the count (column 3 of the output) of the number of observations of the respective word sequence This two-layer hash functionality was used so that the words could easily be separated into two vectors and be returned in the data frame. The alternative would have been to use a single hash table taking in a longer string.

Visualization: Primary Hash –First Word (key)–> Secondary Hash –Second Word(key)–> Number of Occurrences of First Word + Second Word

countMany

Functionality:
Works similarly to countTwo, but allows for the detection of word sequences of a desired length n. Takes a list of sentence strings as input, and returns the number of occurrences of each detected string of length n. In process, all words are reduced to lowecase. Outputs a data frame with three columns. Column 1 contains the first n - 1 words (leading phrase) of each detected sequence. Column 2 contains the last word of the detected sequence (next word). Column 3 indicates how many times the word pair was detected. Returned data frames are sorted in descending order by count.

Additional Notes: Similar to countTwo, this function uses a two-layer hash table implementation. The first hash key is a string containing the leading phrase in the detected sequence, and the second hash key is the next word.

Parameters:
phrases (required) is a list of strings to be analyzed.
n (default = 3) indicates the length of word sequences to be detected.
log (default = TRUE) is a logical indicating whether the progress log should be printed as the function executes. This setting is recommended for larger files, as they may take a considerable amount of time to execute.

2. Cleaning and Further Processing

filter

Functionality:
A four-layer filter that removes irrelevant items from the pre-processed data (requires that the raw data has been run through either countWords, countTwo, or countMany). Takes a pre-processed data frame as input and applies the following filters:

1. removeSingles: Removes any words or sequences that appeared only one time (if column 3 “Count” == 1) (column 2 for single word data sets).
2. removeSymbols: Removes any words that contain any instances of non- alphanumeric characters [[:alnum:]] or [A-z0-9]. For data frames with 3 columns (countTwo or countMany), a detection of non alphanumeric characters in either column 1 or column 2 will remove the entire entry.
3. removeNoAlpha: Removes any ‘words’ that contain no letters whatsoever +[[:lower]]. Similarly to removeSymbols, if either entry (column 1 or column 2) contains a no-alpha string, the entire entry will be removed. However, if any character is present in the leading phrase whatsoever (eg. ‘6 ft 3 210’), the filter will not remove the entry.
4. removeProfanity: Checks all entries across a list of inappropriate English words, and removes any entries containing an instance of one of these words. See project documentation for details on the list that was used.
5. convertContractions: Adds apostrophes back into any contractions that previously had their punctuation removed. This is accomplished by comparing a list of English contractions to the ‘next words’ column of each data set. See project documentation for details on the list that was used.

Parameters:
data (required) is the preprocessed data frame to be cleaned.
removeSingles (default = TRUE) is a logical indicating whether to use the removeSingles filter.
removeSymbols (default = TRUE) is a logical indicating whether to use the removeSymbols filter.
removeNoAlpha (default = TRUE) is a logical indicating whether to use the removeNoAlpha filter.
removeProfanity (default = TRUE) is a logical indicating whether to use the removeProfanity filter.
convertContractions (default = TRUE) is a logical indicating whether to use the convertContractions modifier.
twoOrMoreWords (default = TRUE) is a logical that should be set to FALSE if and only if single word counts are being cleaned (ie, a data frame generated by using countWords is passed to data). This parameter is needed for the compatibility of the function across data frames generated by countWords, which contain only two columns, and those generated by countTwo or countMany, which contain three columns. Incorrect configuration of this parameter may result in an indexing error.Incorrectly setting it to TRUE will cause all of the filters to fail. Incorrectly setting it to FALSE will cause removeSingles to fail, and cause the other filters to overlook any character inputs in column 2 of the data frame.

mergeSetsSingles

Functionality:
Takes three pre-processed data frames of single word counts (resulting from a call to countWords) and merges them into one. Any words that appear in more than one of the frames will have their counts summed in the merged table. Returns a two-column data frame where the first column contains the detected words, and the second column contains their respective counts. Returned data frames are sorted in descending order by count.

Parameters:
data1, data2, data3 (required): Data frames to include in the merge.

mergeSets

Functionality:
Has the same functionality as mergeSetsSingles, but for data frames of word sequences (resulting from a call to countTwo or countMany). Any sequences that appear in more than one of the frames will have their counts summed in the merged table. Returns a three-column data frame where the first column contains the leading phrases, the second column contains the next words, and the third column contains their respective counts. Returned data frames are sorted in descending order by count.

Parameters:
data1, data2, data3 (required): Data frames to include in the merge.

topThree

Functionality:
Only works on data frames for word sequences. Ie, will not work for output produced by countWords. For each sequence in column 1 of the data frame, determines the three words that are most commonly used immediately following that sequence. Returns a data table with five columns. Column 1 contains the sequence for which we would like to predict the next word. Column 2 contains the word that was found to most commonly follow the sequence in column 1. Column 3 contains the second most common word, and column 4 contains the third most common. Column 5 contains up to three space separated integer values which indicate how many times the matchesin columns 2-4 were found when processing the data. Returned data tables are sorted alphabetically by leading phrase.

3. Modeling

unpack

Functionality:
Takes the Master Single Word Counts data set, along with the other four ‘Top Three’ data sets, and converts them into an object that can be efficiently used by the prediction model. From the Single Word Counts data set, a list of all of the identified words is created. From each of the remaining data sets, a hash table is generated to allow the prediction algorithm to quickly look up items based on the user input. Each hash table uses the leading word/phrase as a key, and stores a list of up to three suggested next words as a value. The returned object is a list containing the list of single words and the four hash tables.

Additional Notes:
The Master Single Word Counts dataset should be the one generated by the call to mergeSetsSingles.

Parameters:
data1, data2, data3, data4, data5 (required): The five Master data sets generated by the processing and filtering functions. It is important that these are passed to the function in the appropriate order: Master Single Word Counts, Master Two Word Counts, Master Three Word Counts, Master Four Word Counts, and Master Five Word Counts. The first should be the resulting data frame from the call to mergeSetsSingles. The last four should be the outputs of calls to topThree.

predict

Functionality:
Takes user text input as a string and suggests some next words for them to enter. First, any punctuation included in the user input is removed. Then, the input is passed on to the autocorrect function, which attempts to fix any typos the user may have made (see below for further details). Only the six most recent words are passed to the autocorrect function. If autocorrect returns more than six words, only the six most reecnt words are used in the analysis.

The filtered input is then analyzed by the prediction model, which will return up to three suggested next words for the user based on the input they provided. First, the four most recent words entered by the user will be considered. This four word string will be passed to the Five Word Counts hash table, and any matched values will be added to the suggestions list. If fewer than three matches have been made, the three most recent words are then passed to the Four Word Counts hash table, and any matches are again added to the suggestions list. This is repeated for the two most recent and single most recent words. If, by this point, no matches have been found, the most recent word will be removed from the input string, and a recursive call to the predict function will be made to try and find a match with this new string. If the most recent word entered by the user contained major typos that could not be corrected, it would have prevented any matches from being found and thus should be removed. Removing this word allows for matches to be made even in the presence of such an error. This recursive calling continues until either a match is found, or maxCalls has been reached. The suggested values, if present, are then returned to the user.

Parameters:
text (required): A string of text provided by the user.
hashList (required): The object provided by unpack.
maxCalls (default = 3): The maximum number of recursions that may be executed.
mode (default = ‘quick’): The autocorrect mode to be used (see autocorrect documentation for more details).
calls (DO NOT MODIFY): An argument passed during recursion that tracks the number of times that the function has been executed. Modifying this argument is not recommended, and may cause unexpected behavior.

Has an additional setting within the file, showThinking, which will print out to the console showing how the algorithm ‘thinks’. This can be disabled by setting the value to FALSE.

autocorrect

Functionality:

Please note that this is a utility function that was created to improve the readability of the predict function. It is automatically called by predict, and no user calls to the function are necessary.

First checks if any of the words inputted by the user cannot be matched to an entry in the Single Word Counts data set. For any such words, a series of algorithms are applied in an attempt to guess what word the user intended to input. The algorithms applied depend on which mode the function is set to. They are applied in the following order:
Search for missing characters (‘full’ mode only): In case the user missed a letter when typing. Inserts all letters, plus ‘, into each slot of the string and checks the new strings against the data set. This was selected as the first algorithm since incorrectly inserting a letter is not very likely to produce a mistaken match. This is not always executed as it requires 27*nchar*length(single words set) combination and comparison operations per word.
Search for swapped characters (all modes): In case a user typed two letters out of order. For example, ’tehre’, ‘blikn’. Attempts to switch every adjacent pair of letters except for the first two. The reasoning is that the first letter is not often mistyped, and that attempting to check this may lead to incorrect matches.
Search for mistyped characters (‘full’ mode only): In case a user typed the wrong letter in place of another. Attempts to replace each slot in the string with all letters, plus ’ and -, and checks each of these strings against the data set. This is not always executed as it also requires 27*nchar*length(single words set) combination and comparison operations per word.
Search for added characters (all modes): In case the user inserted an extra letter by mistake. Deletes one letter at a time and checks if the resulting string matches with any words in the data set.
Check if two words were concatenated (all modes): In case the user missed a space key when typing, will check between all adjacent characters and see if splitting the word into two creates two new valiid words. If so, these two new words will be added to the return string. Delete from the end (all modes): A sort of last-resort when none of the other algorithms are able to find a matching word, this algorithm deletes characters off of the end of the word one by one, and checks if any of the resulting words match a value in the data set. Continues until a minimum string length is reached or a match is found.
Delete from the front (all modes): Paired with the previous algorithm, this one deletes characters off of the front of the word one by one, and checks if any of the resulting words match a value in the data set. Also continues until either a minimum string length is reached or a match is found. If no matches are found between either of the two functions, the original word is returned to the input string. In the case that both of these algorithms find a match, the tiebreaker is the number of characters that were deleted prior to finding a match, with the lower value winning. A lower number of deleted characters indicates a more complete match. In the case that this is also a tie, the word matched to the single words set with a lower index wins. A lower index means that the word was found more frequently in the training set, and thus should have a higher chance of matching with the user’s intended input.

If, after passing through all of these filters, no matches are found, the original ‘word’ will be returned to the input string.

Parameters:
text (required): The user input string passed from predict. Automatically passed by predict.
wordBank (required): The first argument of the hashList object from predict. Contains all of the words from the single word counts data set. Automatically passed from predict.
minLength (required): The minimum length of a word in order for it to be passed through the final two filters. Automatically passed from predict with a default of 3.
mode (default = ‘quick’): Helps to reduce runtime for longer strings input into predict. Current implementation only support ‘quick’ and ‘full’ autocorrect modes. ‘quick’ mode does not apply the costly ‘insert letter’ and ‘replace letter’ filters, and sacrifices time for accuracy. ‘full’ mode applies all available filters. ‘full’ mode is used by default by predict

4. Additional Features: Using the Predictor

If you want to try the algorithm on your computer, you just need to make calls to the following two functions.

consoleInput

The ‘do all’ function that takes care of all of the processing and formatting of the entire set of data. It has the following settings (not parameters, these will need to be adjusted within the function body):
1. debugMode (bool, default = FALSE): If set, will only run about 0.05% of the data in an attempt to check whether your system has been set up correctly. It is highly recommended that you run the program once through in debug mode before attempting a full analysis. The process should take about 90 seconds. A full analysis may take several hours depending on the settings used.
2. requireInput (bool, deafult = FALSE): Will take a break between major analysis cycles (initial reading, filtering, merging, top three sorting). This is an old feature that allowed the user to copy files over between cycles before they were overwritten. Files are now automatically stored in separate folders at each stage of analysis, so it is recommended that this feature be set to false to increase automation and save time.
3. mode (string): Allows you to save and easily switch between different commonly used working directories. The directories can manually be changed in the lines underneath the debugMode section.

Usage: Prior to use, make sure to do the following:

Configure the n-values. This is just underneath the settings, at the top of the file. Do not change the first set of values. These are used for debugging purposes. Do set the second set of values appropriately. The defaults are n1 = 20, n2 = 20, n3 = 12. This resulted in a runtime of approximately 7.5 hours on a Windows 10 machine with an i5-7600K 7th Gen 3.8 GHz (4.2 GHz turbo) processor and 16 GB DDR4 3000 MHz RAM. At peak, this configuration appeared to use just under 9 GB of RAM. Pushing the system to its memory limits (using lower n-value) significantly increased runtime. Anything up to n = 100 should still result in a reliable prediction model. Choosing a lower value than the default will marginally increase accuracy at the cost of increased RAM usage and greater runtimes. Set the values as you see fit.
Set the proper function and data directories. This can be done just below the n-value settings. The data directory should contain en_US.blogs.txt, en_US.news.txt, and en_US.twitter.txt. The function directory should contain all of the helper function files. All output files will be saved in this directory (an ‘Output’ folder will be created).
Run the script with debugMode = TRUE. This will ensure that there are no outstanding issues with your configuration, and will save you the headache of getting an error 5 hours in. It cannot do anything about potential memory errors if you set the n-values too low, so ensure that you configure these appropriately. A run-through in debug mode should take about 90 seconds to complete. Check that an ‘Ouput’ folder with five subdirectories has been created. The first two subdirectories should have 15 CSV files each, the third should have 5, the fourth should have 12, and the fifth should have 4. An additional ‘Time Log.csv’ should be present in the main directory, outlining how long each step of the analysis took.
Free up as much memory as possible by closing unneeded programs, and run the script with debugMode = FALSE. This will probably take a few hours. Once finished, you will have your set of processed files, which can then be run through unpack and then loaded into predict.

By default, progress logs will be printed to the console as the function runs to completion. For reference, the functions will be called in the following order:

filterSpecial
filter
topThree (generates CSVs for non-merged sets)
mergeSetsSingles
mergeSets
topThree (generates CSVs for merged sets)

appRun

appRun can be used to run the app in RStudio. The only required configuration is setting the function directory at the top of the file. This should be the same function directory supplied to consoleInput. As long as the final data files remain in their default directories with their default names, the program should run without trouble. By default, the prediction algorithm shows you its ‘thought’ process. This can be disabled in the predict.R file at the top, by setting showThinking = FALSE.

Appendix

Runtime Analytics

These times were logged on a Windows 10 machine with an i5-7600K 7th Gen 3.8 GHz (4.2 GHz turbo) processor and 16 GB DDR4 3000 MHz RAM.

Runtime Analytics: n1 = 20, n2 = 20, n3 = 12 (Time in Seconds) (01 = filterSpecial, 02 = filter, 03 = topThree (individual), 04 = mergeSets, 05 = topThree (merged))
Process	Time
01_Blogs_Read	2.05
01_Blogs_Filter	0.85
01_Blogs_Single	24.60
01_Blogs_Two	78.61
01_Blogs_Three	159.31
01_Blogs_Four	297.88
01_Blogs_Five	470.21
01_Blogs_Total	1033.51
01_News_Read	2.22
01_News_Filter	0.83
01_News_Single	80.17
01_News_Two	243.07
01_News_Three	401.21
01_News_Four	585.21
01_News_Five	764.95
01_News_Total	2077.66
01_Twitter_Read	7.50
01_Twitter_Filter	12.00
01_Twitter_Single	206.05
01_Twitter_Two	706.14
01_Twitter_Three	951.47
01_Twitter_Four	1159.81
01_Twitter_Five	1301.94
01_Twitter_Total	4344.91
01_Total	7456.08
02_Blogs	4926.44
02_News	4508.75
02_Twitter	5976.17
02_Total	15411.36
03_Blogs	273.38
03_News	250.67
03_Twitter	329.87
03_Total	853.92
04_Singles	38.83
04_Twos	343.31
04_Threes	317.16
04_Fours	129.79
04_Fives	39.33
04_Totals	868.44
05_Twos	179.91
05_Threes	255.51
05_Fours	139.91
05_Fives	49.44
05_Total	624.77
Total_Runtime	25214.58

Data Reduction Analytics

Data Reduction Analytics (Number of rows per CSV) (‘WC’ = ‘Word Counts’)
Data_Class	PreProcessed_01	Cleaned_02	Merged_03	TopThree_Separate_04a	TopThree_Merged_04b
Blogs 1 WC	83859	34146	NA	NA	NA
News 1 WC	81574	36798	NA	NA	NA
Twitter 1 WC	104270	36721	NA	NA	NA
Total 1 WC	269703	107665	60844	NA	NA
Blogs 2 WC	699723	146197	NA	14735	NA
News 2 WC	715086	146901	NA	16535	NA
Twitter 2 WC	809216	167076	NA	13589	NA
Total 2 WC	2224025	460174	318151	44859	25338
Blogs 3 WC	1377571	122231	NA	45265	NA
News 3 WC	1312527	104692	NA	46938	NA
Twitter 3 WC	1504954	147062	NA	51171	NA
Total 3 WC	4195052	373985	295258	143374	102557
Blogs 4 WC	1630316	43113	NA	29840	NA
News 4 WC	1485037	33778	NA	26261	NA
Twitter 4 WC	1681374	61559	NA	40929	NA
Total 4 WC	4796727	138450	122213	97030	80928
Blogs 5 WC	1653934	10586	NA	9341	NA
News 5 WC	1479701	9347	NA	8624	NA
Twitter 5 WC	1587137	21480	NA	18541	NA
Total 5 WC	4720772	41413	39279	36506	33934

SessionInfo

R version 3.6.1 (2019-07-05) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 17134)

Matrix products: default

locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] hash_2.2.6.1

loaded via a namespace (and not attached): [1] compiler_3.6.1 tools_3.6.1

Capstone Project Documentation

Riley Matsuda

Methods

1. Preprocessing

2. Cleaning and Further Processing

3. Building a Model

Additional Notes

Functions

1. Preprocessing

filterSpecial

countWords

countTwo

countMany

2. Cleaning and Further Processing

filter

mergeSetsSingles

mergeSets

topThree

3. Modeling

unpack

predict

autocorrect

4. Additional Features: Using the Predictor

consoleInput

appRun

Appendix

Runtime Analytics

Data Reduction Analytics

SessionInfo