Synopsis

Creating a text predictor require to understand which words and more used to achieve higher probabilities of success when the user starts typing, also the relation between the a word and the next one. To obtain this information three sources where used, twitter, blogs and news.

The following steps where made to prepare data to be used:

  1. Downloading and reading information from the three sources (twitter, blogs and news)
  2. Get to know the raw state of information, and use it to decide how they should be cleaned in a way the information can be used
  3. Cleaning, define rules and strategies. Complete rules can be found in the CLEANING part of this article
  4. Exploration, obtain frequencies of use of simple words and 2 word phrases. With this information, strategies for reducing data size, accuracy and efficiency where made
  5. Strategies, how we want the predictor works, customization, learning, and define limits, like other languages words, slang, profanity

Cleaning

Cleaning is an essential part of the process and was divided in two parts:

  1. Remove text format, punctuation, spaces, etc. In this way, same words and grouped correctly, with punctuation, upper cases and spaces where can be incorrectly identified as different when they are the same.
    1. Remove upper case words at the beginning of phrases after period
    2. All other upper case word where considered as names and where leaved as is, when the reduction is made, names use frequently will remain and the others not, is an error margin we can accept
    3. Remove hashtags, they are commonly one o more word together with no spaces, also useless because they are use in a short period of time
    4. Remove punctuation symbols, keeping periods and apostrophes. We leave apostrophes because we want to predict contractions, and the rest of them are possessives, normally from names and they will be discarded if not use frequently. Periods will be eliminated after removing final and begin words of different phrases that aren’t necessarily related
  2. Remove words and phrases we don’t want to be included, as a personal preference or a customer preference and are not necessary in the predict functioning.
    1. Remove words different from English.
    2. Remove profanities.
    3. Incorrect used words or phrases but commonly used.
    4. Combining all three databases or using them separately. In the conclusions the proposal can be discussed.

Exploration

After and initial exploration with the raw data, the cleaning requirements and strategies where defined. After that we can view the structure of data and start defining strategies for the prediction.

From the news data base we found the following:

The most common words are shown in the figure.

From a total of 337101 words found, 10431 words or 3.09% represent 90% of the most common, we can reduce the database removing those 96.91% words with less frequency with a 90% confidence.

The two-word most common phrases are:

In the same way, 49.6% of the total phrases represent 90% of the language used. Removing those phrases we can improve speed of the calculation in 50.4% because of the database size with a 90% confidence.

After building the prediction model is build, the confidence intervals may change, seeking for the best relationship between accuracy and speed.

From the twitter data base we found the following:

The most common words are shown in the figure.

From a total of 468438 words found, 8864 words or 1.89% represent 90% of the most common, we can reduce the database removing those 98.11% words with less frequency with a 90% confidence.

The two-word most common phrases are:

In the same way, 56.22% of the total phrases represent 90% of the language used. Removing those phrases we can improve speed of the calculation in 43.78% because of the database size with a 90% confidence.

After building the prediction model is build, the confidence intervals may change, seeking for the best relationship between accuracy and speed.

From the blogs data base we found the following:

The most common words are shown in the figure.

From a total of 426776 words found, 9019 words or 2.11% represent 90% of the most common, we can reduce the database removing those 97.89% words with less frequency with a 90% confidence.

The two-word most common phrases are:

In the same way, 46.48% of the total phrases represent 90% of the language used. Removing those phrases we can improve speed of the calculation in 53.52% because of the database size with a 90% confidence.

After building the prediction model is build, the confidence intervals may change, seeking for the best relationship between accuracy and speed.

CONCLUTIONS

After cleaning the information and make some exploration we can define our strategy for building the predictor.

  1. Remove less frequently used words and phrases of each data base to achieve 90% confidence.
  2. Predict contractions and common names (with capital letters)
  3. Remove profanities
  4. Measure accuracy of two-word phrases.
  5. Test improvement accordingly with initial accuracy.
    • Three-word phrases
    • Search for database containing correctly spelled words in case of finding considerably mistakes.
  6. Because we can’t show the user more than 3 or 4 options (the ones with higher probabilities), the final database will keep only the 5th first phrases that starts with each word.
  7. Predictor must learn from user common words and phrases. (user choice)

Because we can find in the market many keyboards and predictors we well offer an new characteristic, this predictor will allow the user to define from three different databases.

  1. News - formal language, correct spelling, grammar structure.
  2. Blogs - public language but correct grammar structure, common use words.
  3. Twitter - informal language, daily use language.

In some applications, like whats-app, the app will select the preferred language level accordingly to the group or user its been writing to. This will allow the user a more natural writing, adapting to the different activities and relationships people normally has in their life.

Appendix

A sample of the raw data looks like this:

##  [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."  
##  [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
##  [3] "they've decided its more fun if I don't."                                                                       
##  [4] "So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)"                           
##  [5] "Words from a complete stranger! Made my birthday even better :)"                                                
##  [6] "First Cubs game ever! Wrigley field is gorgeous. This is perfect. Go Cubs Go!"                                  
##  [7] "i no! i get another day off from skool due to the wonderful snow (: and THIS wakes me up...damn thing"          
##  [8] "I'm coo... Jus at work hella tired r u ever in cali"                                                            
##  [9] "The new sundrop commercial ...hehe love at first sight"                                                         
## [10] "we need to reconnect THIS WEEK"

The codes used for cleaning data are:
Removing upper letters after period
clean <- gsub(“(\.)\s*([[:upper:]])“,”\1 \L\2" , database, perl = T)

Remove hashtags
clean <- gsub(“#\S+”, “”, clean)

Remove punctuation, except period and apostrophe
clean <- gsub(“[^[:alnum:][:space:]’\.]”, " “, clean)

Remove numbers
clean <- gsub(“[[:digit:]]+”, “”, clean)

.