Marcel Merchat
December 14, 2016
This project developed a text prediction algorithm that predicts the next desired word. As the emphasis is on speed, all predictions are chosen from data frames that consist of a dictionary called “Gram-1,” two word phrases in “Gram-2,” and three word phrases in “Gram-3.” For short sentence fragments of only a few words, the prediction is based on the phrase with the highest frequency of occurrence. If the fragment is long enough and contains at least two non-trivial words, the five most likely words are evaluated for their frequency of appearance with the non-trivial words.
In the development of the algorithm, more unusual words in the fragment were used to find other associated words and a formula was adjusted to find about one hundred such words without regard to frequency of appearance as the initial search was simply to find candidates. The candidate words were ranked for frequency of occurrence with the non-trivial words in the fragment. For words like “offense”, a related word like “defense” could usually be found amoung the candidate words; but this was dropped in order to obtain faster results and only candidates from 1-gram, 2-gram and 3-gram data files remained in the final program.
The raw data files were first divided into 1024 segments, and a relatively small training set of these were randomly selected to develop the model. Before using the tm package to clean the data, all lines of text were divided into sentences so that only unit sentences appeared on each line which permitted mining related words only found together in the same sentence.
Long twitter and other lines were broken up so that the prediction model was only based on phrases and relationships within a sentence. The function get_sentences broke up long lines of raw text into sentences. For example, the final sentence was isolated as Line-17 and is shown as the last line below.
traininga <- get_sentences(training)
training[5]
[1] “so anyways, i am going to share some home decor inspiration that i have been storing in my folder on the puter. i have all these amazing images stored away ready to come to life when we get our home.”
## Only Sentences
get_sentences(traininga)[17]
[1] “i have all these amazing images stored away ready to come to life when we get our home”
shortened_phrase
[1] “i have all these amazing images stored away ready to come to life when we get our”
print(x1, type="html") ## Ranked Choices
Word_Choices | Odds_Vector | |
---|---|---|
2 | way | 0.0048 |
1 | house | 0.0000 |
shortened_phrase2
[1] “they were all phenomenal for being out in the rain for so long and”
print(x2, type="html") ## Ranked Choices
Word_Choices | Odds_Vector | |
---|---|---|
3 | hard | 0.0071 |
2 | the | 0.0031 |
1 | short | 0.0000 |