The purpose of this report is to explain our preliminary analysis and our goals for a possible application and algorithm. This document should explain the main features of the data that we have identified, and briefly outline plans for creating a prediction algorithm and a Shiny application. We are getting three text documents that contains various sentenses in English, collected from tweets, news and blogs, clearing them, exploring and trying to understand language patterns. This is the first step of the Data Science Capstone project, whose ultimate goal is to create predictive models that simplify text entry on various devices.
We do not provide the code here, since the task of this assignment states that this document should be a report explaining our goals to people who are not related to data processing.
We start from loading data. As already noted, we receive various text documents containing random phrases. In this document, we will only scan files in English.
As we can see, original files is pretty big, so we take sample, 1% from each file and combine them in one set. It seems to us that a sample of 10% would be more convincing, but processing 10% sample takes too much time and resources, so a sample of 1% is suitable for the initial analysis. We will combine samples from three files into one object and we will continue to consider this object.
length(sample_data)
## [1] 33365
head(sample_data)
## [1] "I agree. I think there are plenty of reasons for flight attendants to be unhappy, but none to be impolite. I'm especially troubled by a saying that's used a lot, mostly privately, among flight attendants: \"We're here to save your butt - not kiss it.\""
## [2] "Katselas isn't the only local company letting the riot's emotions bubble to the surface once more. Just a mile and a half away, at the Company of Angels in the Los Feliz-Silverlake area, the nonprofit theater's playwrights group is presenting \"L.A. ViewsV:April 29, 1992,\" a series of eight short plays by different authors all set during the time of the disturbance."
## [3] "This was the second consecutive year Chrysler was among the top 50, which for 2012 were selected from 587 participating companies. Selections are based on data submitted in the 2011 DiversityInc annual top 50 companies for diversity survey. The top 50 for 2012 were announced Tuesday in New York City."
## [4] "One more piece awaits, Otis. You will have to have your mojo working again, and this time it may be more challenging than dumping those bloated contracts. Trading smalls for bigs is never easy."
## [5] "Trace levels of the fungicide carbendazim were discovered in domestic orange juice samples, the Food and Drug Administration reported Thursday. But the FDA said the levels pose no safety risk, and the orange juice will not be recalled."
## [6] "\"This is amazing,\" she said on the phone Monday. \"This is a dream come true.\" We'll be rooting for you, kid."
Before starting the research, we should clean our data set - for example, to get rid of numbers. And it is also worth bringing all the words either to lower case or to upper case, so that the difference in registarch does not affect the search for words. We also get rid of punctuation, since it is irrelevant for our purposes, and from unnecessary spaces, including reducing the volume of the object under study.
In essence, this part is an exploratory analysis. We did some preliminary processing for our data, then we analyze the N-grams, where N is a number - these are phrases consisting of one, two (“bigram”), three (“trigram”), and so on Words. The process of partitioning the corpus into N-grams and storing it in a “terminological document matrix” is called tokenization. Here RWeka Package will help us which contains the tools to accomplish this task. Each N-gram of different lengths is subjected to a frequency count.
In this document, we have demonstrated the process of processing text documents in order to identify linguistic turns, regardless of language proficiency. Since we understand English, we can conclude that the process has led to satisfactory results. The next steps are: