Exploratory Data Analysis

Synopsis

In this assignment, I will show the exploratory data analysis performed in the provided data. We have three text datasets, one with Twitter texts, one with blogs texts and one with news texts. Here I will show the exploratory analysis I did in order to understand the data, I will show the distribution, frequent words, bi-gram and three-gram, this is the first step to developing a speech prediction model.

Downloading Data

The data was downloaded from this url provided by the Coursera page of the course. There are three files, all in English, German and Finnish. Here I will be using the English files, these are named en_US.blogs, en_US.news, and en_US.twitter.

Below we can see the first 5 lines of each file

## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
## [2] "We love you Mr. Brown."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
## [3] "Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him."
## [4] "so anyways, i am going to share some home decor inspiration that i have been storing in my folder on the puter. i have all these amazing images stored away ready to come to life when we get our home."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
## [5] "With graduation season right around the corner, Nancy has whipped up a fun set to help you out with not only your graduation cards and gifts, but any occasion that brings on a change in one's life. I stamped the images in Memento Tuxedo Black and cut them out with circle Nestabilities. I embossed the kraft and red cardstock with TE's new Stars Impressions Plate, which is double sided and gives you 2 fantastic patterns. You can see how to use the Impressions Plates in this tutorial Taylor created. Just one pass through your die cut machine using the Embossing Pad Kit is all you need to do - super easy!"                                                                                    
## [6] "If you have an alternative argument, let's hear it! :)"

## [1] "He wasn't home alone, apparently."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
## [2] "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s."                                                                                                                                                                                                                                                                                                                                                         
## [3] "WSU's plans quickly became a hot topic on local online sites. Though most people applauded plans for the new biomedical center, many deplored the potential loss of the building."                                                                                                                                                                                                                                                                                                                                 
## [4] "The Alaimo Group of Mount Holly was up for a contract last fall to evaluate and suggest improvements to Trenton Water Works. But campaign finance records released this week show the two employees donated a total of $4,500 to the political action committee (PAC) Partners for Progress in early June. Partners for Progress reported it gave more than $10,000 in both direct and in-kind contributions to Mayor Tony Mack in the two weeks leading up to his victory in the mayoral runoff election June 15."
## [5] "And when it's often difficult to predict a law's impact, legislators should think twice before carrying any bill. Is it absolutely necessary? Is it an issue serious enough to merit their attention? Will it definitely not make the situation worse?"                                                                                                                                                                                                                                                            
## [6] "There was a certain amount of scoffing going around a few years ago when the NFL decided to move the draft from the weekend to prime time -- eventually splitting off the first round to a separate day."

## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."  
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
## [3] "they've decided its more fun if I don't."                                                                       
## [4] "So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)"                           
## [5] "Words from a complete stranger! Made my birthday even better :)"                                                
## [6] "First Cubs game ever! Wrigley field is gorgeous. This is perfect. Go Cubs Go!"

Exploratory Data Analysis

In the next step I check each file, the number of lines, total words, minimum, maximum, and mean number of words per line. I also computed the percentage of words each file would represent if I merge the three of them in a single file. We can see a table with this summary below, the last column represents the percentage I should use from each file, so when I merge them each of them would be 0.07% of the whole data.

File	FileLines	FileWords	WordsPercentage	MeanWordByLine	MinWord	MaxWord	equiv_percentaje
twitter	2360148	31003501	43.027720	13.13625	1	47	0.0016269
news	77259	2741594	3.804878	35.48576	1	1522	0.0183974
blogs	899288	38309620	53.167402	42.59995	1	6851	0.0013166

Now I will select the percentage in the last column of the table from each file and merge them together. We can see below a table with a summary of these samples.

File	Lines	Words
twitter	3839	50621
news	1421	49638
blogs	1183	49123

Besides that, I divide the resulting sample into two datasets, a training set with 75% of the data, and a testing set with 25% of the data.

In the next step, I create the tokens and a data frame with two columns, the first one has every token created and the second one has the number of times each tokens appear in the training set. Here I will show the first 10 tokens organized in decreasing order, these are 10 more frequent words in the test set.

word	freq
the	5012
to	2863
and	2584
a	2517
of	2084
i	1841
in	1725
for	1197
is	1170
that	1079

In the next graph, we can see the 20 most frequent words.

Next, I will generate tokens for bi-grams, we can see the first ten terms of the frequency table for the bi-grams and a barplot with the 20 most frequent bi-gram.

word	freq
of_the	468
in_the	426
to_the	237
on_the	229
for_the	224
to_be	171
at_the	152
and_the	132
in_a	119
it_is	117

I repeat the previous step for three-grams

word	freq
a_lot_of	37
one_of_the	36
thanks_for_the	27
it_was_a	25
i_want_to	23
to_be_a	22
i’m_going_to	22
can’t_wait_to	20
going_to_be	18
be_able_to	17

Prediction model plans

In the previous graphs, we could see the most frequent words, 2-gram and 3-grams. However, we need to create a model that helps us predict the word following another word or a sentence, in order to accomplish this I am planning on using a back-off model with smoothing and use perplexity to measure the model’s accuracy. I am also considering pruning the data since I may have some memory limitations.

Exploratory Data Analysis

Juana Arroyo

2023-05-07

Synopsis

Downloading Data

Exploratory Data Analysis

Prediction model plans