This is a simple R markdown file to analyze three files: A file of blogs called en_US.blogs.txt A file of tweets called en_US.twitter.txt A file of news stories called en_US.news.txt
In this first assignment, I am mainly learning how to read in the files, tokenize them, and perform basic textual operations. I found key code to perform basic tasks in Silge and Robinson Text mining with R, particularly pp. 4-6.
I am still trying to determine my ultimate goal in this class, beyond learning the basic mechanics. Much of my goal is simply to improve my R programming proficiency and visualization. Program ran 2019-01-27 14:20:26
Now the actual work. I’ll start with the blog dataset. First I use read.delim to read it in. I then use readLines to process it, determine the number of lines, maximum line length, and show the head file. In all the processing below, I have suppressed warning messages for null lines and things like that.
## individual_line_passage
## 1 We love you Mr. Brown.
## 2 Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him.
## 3 so anyways, i am going to share some home decor inspiration that i have been storing in my folder on the puter. i have all these amazing images stored away ready to come to life when we get our home.
## 4 With graduation season right around the corner, Nancy has whipped up a fun set to help you out with not only your graduation cards and gifts, but any occasion that brings on a change in one's life. I stamped the images in Memento Tuxedo Black and cut them out with circle Nestabilities. I embossed the kraft and red cardstock with TE's new Stars Impressions Plate, which is double sided and gives you 2 fantastic patterns. You can see how to use the Impressions Plates in this tutorial Taylor created. Just one pass through your die cut machine using the Embossing Pad Kit is all you need to do - super easy!
## 5 If you have an alternative argument, let's hear it! :)
## line_length
## 1 22
## 2 692
## 3 199
## 4 608
## 5 54
The number of lines in the blogging dataset is 899288. The maximum line length is 40833. (See below for bar graphs.) We will get the number of words a little later when we have the code to do so.
Below I do the same for the Twitter file and for the news file, and found some of the items required for the first quiz. Just for the fun of it, I found 4633 tweets referring to President Obama. In all the processing below, I have suppressed warning messages for null lines and things like that.
## individual_line_passage
## 1 When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason.
## 2 they've decided its more fun if I don't.
## 3 So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)
## 4 Words from a complete stranger! Made my birthday even better :)
## 5 First Cubs game ever! Wrigley field is gorgeous. This is perfect. Go Cubs Go!
## line_length
## 1 111
## 2 40
## 3 84
## 4 63
## 5 77
## [1] "GOP line on Obama gay marriage stance seems to be that he flip-flopped. Really want to use that with Romney as your presidential candidate?"
## [2] "#WordsYouWillNeverHearMeSay Obama rocks!"
## [3] "#askObama Ans 7. Promotes Collective Bargaining as the reason for week-ends off, minimum wage. but all have to make adjustments in this econ"
## [4] "Worried about missing the State of the Union so you can get your trivia fix? Don't. We're moving trivia to 7 so you can get your Obama on"
## [5] "get the name right, dumbass! OSAMA!! NOT OBAMA!!!"
## individual_line_passage
## 1 The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s.
## 2 WSU's plans quickly became a hot topic on local online sites. Though most people applauded plans for the new biomedical center, many deplored the potential loss of the building.
## 3 The Alaimo Group of Mount Holly was up for a contract last fall to evaluate and suggest improvements to Trenton Water Works. But campaign finance records released this week show the two employees donated a total of $4,500 to the political action committee (PAC) Partners for Progress in early June. Partners for Progress reported it gave more than $10,000 in both direct and in-kind contributions to Mayor Tony Mack in the two weeks leading up to his victory in the mayoral runoff election June 15.
## 4 And when it's often difficult to predict a law's impact, legislators should think twice before carrying any bill. Is it absolutely necessary? Is it an issue serious enough to merit their attention? Will it definitely not make the situation worse?
## 5 There was a certain amount of scoffing going around a few years ago when the NFL decided to move the draft from the weekend to prime time -- eventually splitting off the first round to a separate day.
## line_length
## 1 153
## 2 177
## 3 498
## 4 246
## 5 200
The number of lines in the Twitter dataset is 2360148. The maximum line length is 140. The number of lines in the news dataset is 1010242. The maximum line length is 11384.
We will get the number of words a little later when we have the code to do so.
Now I really get into producing and analyzing what is called a tokenized dataset, following the general coding approach presented by Silge and Robinson. What we mean by tokenized is a dataset that has divided into small pieces called tokens. In this case the tokens are words. The tokenization process also throws away various nonword characters such as quotation marks.
After unnesting the tokens, I remove what are called stopwords–and, of, the, etc. These are words which are not typically valuable for search.
I then remove profanity as found in the file https://raw.githubusercontent.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/master/en.
I actually didn’t remove all words in that profanity file, since some were standard anatomical or sexual references that may be of public health interest. So I created a file HAP_allowed_profane_words.txt that retained various terms that I might want later.
Then I plotted up with ggplot
## Warning: Column `word` joining character vector and factor, coercing into
## character vector
## Warning: Column `word` joining character vector and factor, coercing into
## character vector