Creating a text predictor require to understand which words and more used to achieve higher probabilities of success when the user starts typing, also the relation between the a word and the next one. To obtain this information three sources where used, twitter, blogs and news.
The following steps where made to prepare data to be used:
Cleaning is an essential part of the process and was divided in two parts:
After and initial exploration with the raw data, the cleaning requirements and strategies where defined. After that we can view the structure of data and start defining strategies for the prediction.
From the news data base we found the following:
The most common words are shown in the figure.
From a total of 337101 words found, 10431 words or 3.09% represent 90% of the most common, we can reduce the database removing those 96.91% words with less frequency with a 90% confidence.
The two-word most common phrases are:
In the same way, 49.6% of the total phrases represent 90% of the language used. Removing those phrases we can improve speed of the calculation in 50.4% because of the database size with a 90% confidence.
After building the prediction model is build, the confidence intervals may change, seeking for the best relationship between accuracy and speed.
From the twitter data base we found the following:
The most common words are shown in the figure.
From a total of 468438 words found, 8864 words or 1.89% represent 90% of the most common, we can reduce the database removing those 98.11% words with less frequency with a 90% confidence.
The two-word most common phrases are:
In the same way, 56.22% of the total phrases represent 90% of the language used. Removing those phrases we can improve speed of the calculation in 43.78% because of the database size with a 90% confidence.
After building the prediction model is build, the confidence intervals may change, seeking for the best relationship between accuracy and speed.
From the blogs data base we found the following:
The most common words are shown in the figure.
From a total of 426776 words found, 9019 words or 2.11% represent 90% of the most common, we can reduce the database removing those 97.89% words with less frequency with a 90% confidence.
The two-word most common phrases are:
In the same way, 46.48% of the total phrases represent 90% of the language used. Removing those phrases we can improve speed of the calculation in 53.52% because of the database size with a 90% confidence.
After building the prediction model is build, the confidence intervals may change, seeking for the best relationship between accuracy and speed.
After cleaning the information and make some exploration we can define our strategy for building the predictor.
Because we can find in the market many keyboards and predictors we well offer an new characteristic, this predictor will allow the user to define from three different databases.
In some applications, like whats-app, the app will select the preferred language level accordingly to the group or user its been writing to. This will allow the user a more natural writing, adapting to the different activities and relationships people normally has in their life.
A sample of the raw data looks like this:
## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
## [3] "they've decided its more fun if I don't."
## [4] "So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)"
## [5] "Words from a complete stranger! Made my birthday even better :)"
## [6] "First Cubs game ever! Wrigley field is gorgeous. This is perfect. Go Cubs Go!"
## [7] "i no! i get another day off from skool due to the wonderful snow (: and THIS wakes me up...damn thing"
## [8] "I'm coo... Jus at work hella tired r u ever in cali"
## [9] "The new sundrop commercial ...hehe love at first sight"
## [10] "we need to reconnect THIS WEEK"
The codes used for cleaning data are:
Removing upper letters after period
clean <- gsub(“(\.)\s*([[:upper:]])“,”\1 \L\2" , database, perl = T)
Remove hashtags
clean <- gsub(“#\S+”, “”, clean)
Remove punctuation, except period and apostrophe
clean <- gsub(“[^[:alnum:][:space:]’\.]”, " “, clean)
Remove numbers
clean <- gsub(“[[:digit:]]+”, “”, clean)
.