Milestone Report

Introduction

The aim of this milestone report is to describe the data stored in the files en_US.blogs.txt, en_US.news.txt and en_US.twitter.txt and applies to this information operations such as removing punctuation symbols, discarding unnecessary words, and cleaning profanity language. This results will help to develop a predictive texting model.

Basic summaries

We are going to analyze the files: en_US.blogs.txt, en_US.news.txt, and en_US.twitter.txt which correspond with data stored in blogs, news and twitter respectively. This data is from the web site http://www.corpora.heliohost.org/. The following table shows basic information about these files.

File	Number of lines	Size in MB
en_US.blogs.txt	899288	201
en_US.news.txt	1010242	197
en_US.twitter.txt	2360148	160

Due to the great size of the files, we took a sample of the 0.5 percent of the lines in the files and wrote these lines to the files: sample.en_US.blogs.txt, sample.en_US.news.txt and sample.en_US.twitter.txt, then these files were splitted in 48, 52, and 118 files for the purpose of to be handled better by the statistical software R.

Tokenization

The first step in textual data analysis is tokenization. Using the R package tm, the following tasks were applied to the sample data:

Change letters to lowercase
Remove punctuation symbols and numbers
Delete stop words which are the most common words in a language and is a common task in natural language processing
Eliminate new line symbol “\n” and strip white space
Erase words that belong to others language like spanish, chinese, arabic and so on.

Profanity filtering

Due to the great amount of swear words and other kind of profanity language, the data were cleaned with the goal to prepare the data to exploratory analysis.

Exploratory analysis

In the graph below, x-axis is the frequency of contiguous sequence of 1, 2 and 3 words and y-axis represents counting of these frequencies. Clearly the graph reveal the majority of contiguous sequences of words have low frequency. Counting falls steeply with increasing frequency.

Figure 1: Frequency of contiguous sequence of 1, 2 and 3 words in the blogs data

The Frequency distribution of contiguous sequence of 1, 2 and 3 words in the blogs data are shown respectively in the table 1, 2 and 3 from the appendix A. Taking words with frequency between 31 and 678 allows covering 54% of all word instance, we will take this fact into account to focus in the words that appears more frequently in the blogs data to improve the predictive model.

The figure 2 presents the correlation for words with a frequency more or equal to 31 and with at least 0.40 correlation. It appears that the concepts “alon”" and “along” are similar, however the term “alon” is an typo and will be replaced by its correct term “along”.

Figure 2: Visualization of correlations within the blogs data

In the appendices B and C are shown information about the news and twitter data. The news and twitter data have features similar to blogs data. Moreover, the typo “alon” also appears therefore we will fix it changing “alon” by “along”. Finally, the modelling phase of the data science capstone project is ready to begin.

Appendix A

Interval	Freq	Rel.Freq	Rel.Cum.Freq
[1,31)	13367	0.9519	0.9519
[31,61)	383	0.0273	0.9792
[61,91)	150	0.0107	0.9899
[91,121)	49	0.0035	0.9934
[121,151)	25	0.0018	0.9952
[151,181)	16	0.0011	0.9963
[181,211)	11	0.0008	0.9971
[211,241)	10	0.0007	0.9978
[241,271)	5	0.0004	0.9981
[271,301)	6	0.0004	0.9986
[301,331)	8	0.0006	0.9991
[331,361)	3	0.0002	0.9994
[361,391)	0	0.0000	0.9994
[391,421)	2	0.0001	0.9995
[421,451)	0	0.0000	0.9995
[451,481)	1	0.0001	0.9996
[481,511)	0	0.0000	0.9996
[511,541)	2	0.0001	0.9997
[541,571)	2	0.0001	0.9999
[571,601)	0	0.0000	0.9999
[601,631)	1	0.0001	0.9999
[631,661)	0	0.0000	0.9999
[661,691)	1	0.0001	1.0000
Sum	14042	1.0000	NA

Table 1: Frequency distribution of contiguous sequence of 1 word in the blogs data

Interval	Freq	Rel.Freq	Rel.Cum.Freq
[1,6)	91092	0.9970	0.9970
[6,11)	215	0.0024	0.9993
[11,16)	38	0.0004	0.9998
[16,21)	13	0.0001	0.9999
[21,26)	3	0.0000	0.9999
[26,31)	5	0.0000	1.0000
[31,36)	1	0.0000	1.0000
[36,41)	0	0.0000	1.0000
[41,46)	1	0.0000	1.0000
Sum	91368	1.0000	NA

Table 2: Frequency distribution of contiguous sequence of 2 word in the blogs data

Interval	Freq	Rel.Freq	Rel.Cum.Freq
[1,2)	100545	0.9957	0.9957
[2,3)	394	0.0039	0.9996
[3,4)	26	0.0003	0.9999
[4,5)	12	0.0001	1.0000
[5,6)	1	0.0000	1.0000
Sum	100978	1.0000	NA

Table 3: Frequency distribution of contiguous sequence of 3 word in the blogs data

Appendix B

Figure 1: Frequency of contiguous sequence of 1, 2 and 3 words in the news data

Interval	Freq	Rel.Freq	Rel.Cum.Freq
[1,31)	13943	0.9536	0.9536
[31,61)	423	0.0289	0.9826
[61,91)	127	0.0087	0.9912
[91,121)	54	0.0037	0.9949
[121,151)	19	0.0013	0.9962
[151,181)	23	0.0016	0.9978
[181,211)	7	0.0005	0.9983
[211,241)	5	0.0003	0.9986
[241,271)	5	0.0003	0.9990
[271,301)	7	0.0005	0.9995
[301,331)	1	0.0001	0.9995
[331,361)	2	0.0001	0.9997
[361,391)	1	0.0001	0.9997
[391,421)	1	0.0001	0.9998
[421,451)	0	0.0000	0.9998
[451,481)	0	0.0000	0.9998
[481,511)	0	0.0000	0.9998
[511,541)	1	0.0001	1.0000
Sum	14619	1.0000	NA

Table 1: Frequency distribution of contiguous sequence of 1 word in the news data

Interval	Freq	Rel.Freq	Rel.Cum.Freq
[1,6)	88473	0.9970	0.9970
[6,11)	207	0.0023	0.9993
[11,16)	32	0.0004	0.9997
[16,21)	13	0.0001	0.9998
[21,26)	5	0.0001	0.9999
[26,31)	1	0.0000	0.9999
[31,36)	2	0.0000	0.9999
[36,41)	1	0.0000	0.9999
[41,46)	1	0.0000	0.9999
[46,51)	2	0.0000	1.0000
[51,56)	1	0.0000	1.0000
Sum	88738	1.0000	NA

Table 2: Frequency distribution of contiguous sequence of 2 word in the news data

Interval	Freq	Rel.Freq	Rel.Cum.Freq
[1,2)	97008	0.9952	0.9952
[2,3)	410	0.0042	0.9994
[3,4)	37	0.0004	0.9998
[4,5)	14	0.0001	0.9999
[5,6)	3	0.0000	1.0000
Sum	97472	1.0000	NA

Table 3: Frequency distribution of contiguous sequence of 3 word in the news data

Figure 2: Visualization of correlations within the news data

Appendix C

Figure 1: Frequency of contiguous sequence of 1, 2 and 3 words in the twitters data

Interval	Freq	Rel.Freq	Rel.Cum.Freq
[1,31)	12334	0.9637	0.9637
[31,61)	251	0.0196	0.9834
[61,91)	77	0.0060	0.9894
[91,121)	45	0.0035	0.9929
[121,151)	20	0.0016	0.9945
[151,181)	17	0.0013	0.9958
[181,211)	11	0.0009	0.9966
[211,241)	8	0.0006	0.9973
[241,271)	5	0.0004	0.9977
[271,301)	3	0.0002	0.9979
[301,331)	6	0.0005	0.9984
[331,361)	2	0.0002	0.9985
[361,391)	5	0.0004	0.9989
[391,421)	3	0.0002	0.9991
[421,451)	3	0.0002	0.9994
[451,481)	1	0.0001	0.9995
[481,511)	0	0.0000	0.9995
[511,541)	1	0.0001	0.9995
[541,571)	1	0.0001	0.9996
[571,601)	1	0.0001	0.9997
[601,631)	1	0.0001	0.9998
[631,661)	0	0.0000	0.9998
[661,691)	0	0.0000	0.9998
[691,721)	3	0.0002	1.0000
Sum	12798	1.0000	NA

Table 1: Frequency distribution of contiguous sequence of 1 word in the twitter data

Interval	Freq	Rel.Freq	Rel.Cum.Freq
[1,6)	73426	0.9940	0.9940
[6,11)	330	0.0045	0.9984
[11,16)	60	0.0008	0.9992
[16,21)	30	0.0004	0.9996
[21,26)	5	0.0001	0.9997
[26,31)	5	0.0001	0.9998
[31,36)	3	0.0000	0.9998
[36,41)	4	0.0001	0.9999
[41,46)	3	0.0000	0.9999
[46,51)	2	0.0000	0.9999
[51,56)	1	0.0000	1.0000
[56,61)	0	0.0000	1.0000
[61,66)	1	0.0000	1.0000
Sum	73870	1.0000	NA

Table 2: Frequency distribution of contiguous sequence of 2 word in the twitter data

Interval	Freq	Rel.Freq	Rel.Cum.Freq
[1,2)	84686	0.9944	0.9944
[2,3)	395	0.0046	0.9990
[3,4)	53	0.0006	0.9996
[4,5)	14	0.0002	0.9998
[5,6)	6	0.0001	0.9999
[6,7)	2	0.0000	0.9999
[7,8)	1	0.0000	0.9999
[8,9)	1	0.0000	0.9999
[9,10)	2	0.0000	1.0000
Sum	85160	1.0000	NA

Table 3: Frequency distribution of contiguous sequence of 3 word in the twitter data

Figure 2: Visualization of correlations within the twitter data