In the capstone project we will be applying data science in the area of natural language processing.
This report shows the first step of the project from
loading the data sets,
getting a first look on the data.
getting a first idea of cleansing and tokenization
In a later step a model should be build that proposes a list of next-to-be-typed words after a user enters a phrase of text.
The Dataset came originally from a corpus called HC Corpora.
This is the training data to get you started that will be the basis for most of the capstone. You must download the data from the Coursera site and not from external websites to start.
The data set contains different folders for different languages. I’ll focus on the english one because of the fact english is the defined language for the course participants.
Folder: en_US
Files:
en_US.blogs.txt
en_US.news.txt
en_US.twitter.txt
## File TotNumber_Lines MaxNumber_CharsPerLine AvgNumber_CharsPerLine
## 1 BLOG 899288 40833 229.98695
## 2 NEWS 77259 5760 202.42830
## 3 TWITTER 2360148 140 68.68054
## File TotNumber_Words MaxNumber_WordsPerLine AvgNumber_WordsPerLine
## 1 BLOG 37334131 6630 41.51521
## 2 NEWS 2643969 1031 34.22215
## 3 TWITTER 30373583 47 12.86936
The first summary shows nothing surprising.
The blog file contains the most data per line.
The twitter files have the shortest lines but highest number of lines.
The news file lays in the middle between blog and twitter regarding the mentioned measures but is a little closer to blog than to twitter.
For the next step I use a smaller set of data because of performance issues.
In the next section I will have a closer look on the data for getting a better feeling for
Tokenization
Profanity filtering
For this task I create a small example of a entry in e.g. twitter which may help me to understand what I want and what I can use of the TM package([1]). The package has a number of predefined transformations which can be uses (see ‘getTransformations()’)
Are the result for removing punctuation and numbers as expected?
testentry <- "Hello, how do you do? I'm fine#. Today (9th 11.) we had good weather 20.5°! You too?"
removePunctuation(testentry)
## [1] "Hello how do you do Im fine Today 9th 11 we had good weather 205 You too"
removeNumbers(testentry)
## [1] "Hello, how do you do? I'm fine#. Today (th .) we had good weather .°! You too?"
From my point of view the ‘removePunctuation’ function have a limited usage for me, because some punctuation’s - which separate sentences - are important, because every sentence is a logical unit and therefore for me the last word in one sentence don’t determine the first one of the next sentence.
What do I really want is first split the lines in the files in there logical parts and than analyse the words. Removing punctuation’s might be a solution for a later step after the splitting.
strsplit(testentry, "\\. |\\! |\\? |\\: |\\; ")
## [[1]]
## [1] "Hello, how do you do"
## [2] "I'm fine#"
## [3] "Today (9th 11.) we had good weather 20.5°"
## [4] "You too?"
Second aspect are abbreviations or word combinings like for ‘I am’ or ‘I have’. When I use the ‘removePunctuation’ function they will be changed also in a way that the algorithm might get trouble later.
Another function or list ‘stopwords(“english”)’ in the package seams not quite good for me because in combination with ‘removeWords’ it would eliminate important parts and in combination with ‘removePunctuation’ I was not able to get it work. So I will only use this for my list of ‘bad’ words.
When I now look again at the first 100 lines in each file with the ‘line based’ view I get the following measures:
## File TotNumber_Words MaxNumber_WordsPerLine AvgNumber_WordsPerLine
## 1 BLOG 4704 275 47.04
## 2 NEWS 3222 142 32.22
## 3 TWITTER 1275 28 12.75
After changing the logical units from lines to sentences the values for the measures like ‘total number of words per line’ changes.
## TotNumber_Words MinNumber_CharPerWord MaxNumber_CharPerWord
## Min. : 1.00 Min. : 0.00 Min. : 1.00
## 1st Qu.: 4.00 1st Qu.: 1.00 1st Qu.: 7.00
## Median :10.00 Median : 2.00 Median : 9.00
## Mean :13.06 Mean : 1.98 Mean : 8.74
## 3rd Qu.:19.00 3rd Qu.: 2.00 3rd Qu.:11.00
## Max. :60.00 Max. :12.00 Max. :22.00
After tokenization I store the sentences into a DocumentTermMatrix. In the next step I take a look on the frequency of the words. This frequency is calculated by sum of the frequency of each word in the single sentences.
## the and that for with was you not but have this will your all are
## 433 244 108 81 68 64 63 53 50 50 39 38 38 32 32
Natural Language Processing (NLP) is a field of computer science, artificial intelligence and linguistics concerned with the interactions between computers and human (natural) languages ([2]). This gets us directly to the goal of the project/model which is to determine the next word given the preceding words in a phrase.
Based on the first examination of the data and the tm-Package([1]) the plan is
The next step is finding the n-grams in the sentences on different levels 2-grams, 3-grams and so on and define rules, like “when n-grams of different levels exists which should be taken” or other way round “when no n-gram of a certain level exists which should be taken instead”.
The model should be trained on a sample set of a combination of all three files. The last step is a Shiny App for providing an interactive front end for the algorithm.