Text Mining Data Exploration

Summary

This report summarizes my exploratory analysis of some blog, news, and Twitter data samples. The three sources vary substantially in the length and number. If combined, these data sources are more than sufficient to train a text prediction algorithm, but the files are too large to run a lightweight mobile application. I will need to take a subsample of the data and carefully configure my application to balance memory and storage usage.

Major Features of the Data

Before building a text prediction application, I needed to get a large enough sample of existing text to train my model. I used the blog, news, and Twitter samples already provided for this purpose. Preliminary analysis of these samples suggests that they are large enough and have enough variety of text to train an accurate model. In the first stage of my preliminary analysis, I explored the basic structure of each of the three samples–in particular, how many characters were in each line and how many lines were in each sample.

Blog Sample Line Lengths Summary:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0    47.0   157.0   231.7   331.0 40835.0

News Sample Line Lengths Summary:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       2     111     186     203     270    5760

Twitter Sample Line Lengths Summary:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     2.0    37.0    64.0    68.8   100.0   213.0

The blog sample had the longest lines, followed by news, then Twitter:

Twitter had the most lines of text, followed by blogs, then news. Both of these measures make sense, since Twitter should have short but frequent posts, whereas news sources will tend to have longer articles, but less frequently.

Finally, I viewed the first six entries from each source to examine some of the text itself (see below). As expected, the blog and Twitter entries included informal language, whereas the news sample was more formal. This variety is good for training a general-purpose prediction model, since it will account for different styles of language. It would not be as good for a context-specific application, such as predicting text inputs for the Twitter app.

Blogs:

## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan â<U+0080><U+009C>godsâ<U+0080>."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
## [2] "We love you Mr. Brown."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
## [3] "Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him."
## [4] "so anyways, i am going to share some home decor inspiration that i have been storing in my folder on the puter. i have all these amazing images stored away ready to come to life when we get our home."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
## [5] "With graduation season right around the corner, Nancy has whipped up a fun set to help you out with not only your graduation cards and gifts, but any occasion that brings on a change in one's life. I stamped the images in Memento Tuxedo Black and cut them out with circle Nestabilities. I embossed the kraft and red cardstock with TE's new Stars Impressions Plate, which is double sided and gives you 2 fantastic patterns. You can see how to use the Impressions Plates in this tutorial Taylor created. Just one pass through your die cut machine using the Embossing Pad Kit is all you need to do - super easy!"                                                                                    
## [6] "If you have an alternative argument, let's hear it! :)"

News:

## [1] "He wasn't home alone, apparently."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
## [2] "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s."                                                                                                                                                                                                                                                                                                                                                         
## [3] "WSU's plans quickly became a hot topic on local online sites. Though most people applauded plans for the new biomedical center, many deplored the potential loss of the building."                                                                                                                                                                                                                                                                                                                                 
## [4] "The Alaimo Group of Mount Holly was up for a contract last fall to evaluate and suggest improvements to Trenton Water Works. But campaign finance records released this week show the two employees donated a total of $4,500 to the political action committee (PAC) Partners for Progress in early June. Partners for Progress reported it gave more than $10,000 in both direct and in-kind contributions to Mayor Tony Mack in the two weeks leading up to his victory in the mayoral runoff election June 15."
## [5] "And when it's often difficult to predict a law's impact, legislators should think twice before carrying any bill. Is it absolutely necessary? Is it an issue serious enough to merit their attention? Will it definitely not make the situation worse?"                                                                                                                                                                                                                                                            
## [6] "There was a certain amount of scoffing going around a few years ago when the NFL decided to move the draft from the weekend to prime time -- eventually splitting off the first round to a separate day."

Twitter:

## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."  
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
## [3] "they've decided its more fun if I don't."                                                                       
## [4] "So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)"                           
## [5] "Words from a complete stranger! Made my birthday even better :)"                                                
## [6] "First Cubs game ever! Wrigley field is gorgeous. This is perfect. Go Cubs Go!"

Prediction Algorithm Plans

These three text sources provide a range of length and complexity, as well as varying levels of formality. These characteristics will train a fairly accurate general-purpose prediction model. However, the size of the text files and the number of lines make it impractical to train the model based on the entire sample. To make the application useful for mobile devices, I will need to take a subsample from each of the three sources. I plan to start with 10,000 lines randomly selected from each source.

After collecting a random sample of 30,000 total entries, I plan to train a Markov chain model based on trigrams. This approach involves separating the text samples into all combinations of three words (trigrams) found in the sample. Then, using the observed frequency of each trigram combination found in the sample, I will train a model that will predict the next word based on the previous two words a user has already entered. For example, if “I will go” is the most common combination and the user has already entered “I will,” then the model would predict “go” as the most likely next word. This approach will balance speed, accuracy, memory, and storage for the application.

Text Mining Data Exploration

Benjamin Shearn

October 29, 2017

Summary

Major Features of the Data

Prediction Algorithm Plans