First Steps

Analyse Dataset

Tasks to accomplish

Obtaining the data - Can you download the data and load/manipulate it in R?

https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

Answer: Yes, the data set Coursera-Swiftkey.zip was obtained at the URL, unpacked and create three files: en_US.blogs.txt, en_US.news.txt and en_US.twitter.txt.

Familiarizing yourself with NLP and text mining

Learn about the basics of natural language processing and how it relates to the data science process you have learned in the Data Science Specialization.

Questions to consider:

What do the data look like?

In the file en_US.blogs.txt there are a collection os phrases extracted, possibly, from blogs in english language, due to the name of the file. Same consideration about the en_US.news.txt, a collection of phrases from news in english language and a collection of short phrases from twitters, also in english language.

Where do the data come from?

The Swiftkey company, a partner of Johns Hopkins Health School, prepared the dataset to be used by the Coursera Capstone Project. It was collected from publicly available sources by a web crawler, to four languages: english, russian, german and finn. In this Capstone the english language will be used.

Can you think of any other data sources that might help you in this project?

Yes, some literature texts or poems, texts with regional vocabulary or from people that uses new words like slang.

What are the common steps in natural language processing?

Retreaving,cleaning, exploring and processing data.

What are some common issues in the analysis of text data?

We can expect that Using informal texts we can find slang, foreing words, mispelling and new vocabules that are created as the language evolves.

What is the relationship between NLP and the concepts you have learned in the Specialization?

Since the 1990s, much Natural-Language Processing research has relied heavily on machine learning. The machine-learning paradigm calls instead for using statistical inference to automatically learn such rules through the analysis of large corpora of typical real-world examples. Systems based on machine-learning algorithms have many advantages over hand-produced rules.

(Wikipedia)