Data Science Capstone Project Milestone Report

What is the report about

In the capstone project we will be applying data science in the area of natural language processing.

This report shows the first step of the project from

loading the data sets,
getting a first look on the data.
getting a first idea of cleansing and tokenization

In a later step a model should be build that proposes a list of next-to-be-typed words after a user enters a phrase of text.

Data source

The Dataset came originally from a corpus called HC Corpora.

This is the training data to get you started that will be the basis for most of the capstone. You must download the data from the Coursera site and not from external websites to start.

First analyse of the data

The data set contains different folders for different languages. I’ll focus on the english one because of the fact english is the defined language for the course participants.

Folder: en_US

Files:

en_US.blogs.txt
en_US.news.txt
en_US.twitter.txt

##      File TotNumber_Lines MaxNumber_CharsPerLine AvgNumber_CharsPerLine
## 1    BLOG          899288                  40833              229.98695
## 2    NEWS           77259                   5760              202.42830
## 3 TWITTER         2360148                    140               68.68054

##      File TotNumber_Words MaxNumber_WordsPerLine AvgNumber_WordsPerLine
## 1    BLOG        37334131                   6630               41.51521
## 2    NEWS         2643969                   1031               34.22215
## 3 TWITTER        30373583                     47               12.86936

The first summary shows nothing surprising.

The blog file contains the most data per line.
The twitter files have the shortest lines but highest number of lines.
The news file lays in the middle between blog and twitter regarding the mentioned measures but is a little closer to blog than to twitter.

For the next step I use a smaller set of data because of performance issues.

Tokenization and Frequencies

In the next section I will have a closer look on the data for getting a better feeling for

Tokenization
Profanity filtering

For this task I create a small example of a entry in e.g. twitter which may help me to understand what I want and what I can use of the TM package([1]). The package has a number of predefined transformations which can be uses (see ‘getTransformations()’)

removeNumbers
removePunctuation
removeWords
stemDocument
stripWhitespace

Are the result for removing punctuation and numbers as expected?

testentry <- "Hello, how do you do? I'm fine#. Today (9th 11.) we had good weather 20.5°! You too?"
removePunctuation(testentry)

## [1] "Hello how do you do Im fine Today 9th 11 we had good weather 205 You too"

removeNumbers(testentry)

## [1] "Hello, how do you do? I'm fine#. Today (th .) we had good weather .°! You too?"

From my point of view the ‘removePunctuation’ function have a limited usage for me, because some punctuation’s - which separate sentences - are important, because every sentence is a logical unit and therefore for me the last word in one sentence don’t determine the first one of the next sentence.

What do I really want is first split the lines in the files in there logical parts and than analyse the words. Removing punctuation’s might be a solution for a later step after the splitting.

strsplit(testentry, "\\. |\\! |\\? |\\: |\\; ")

## [[1]]
## [1] "Hello, how do you do"                     
## [2] "I'm fine#"                                
## [3] "Today (9th 11.) we had good weather 20.5°"
## [4] "You too?"

Second aspect are abbreviations or word combinings like for ‘I am’ or ‘I have’. When I use the ‘removePunctuation’ function they will be changed also in a way that the algorithm might get trouble later.

Another function or list ‘stopwords(“english”)’ in the package seams not quite good for me because in combination with ‘removeWords’ it would eliminate important parts and in combination with ‘removePunctuation’ I was not able to get it work. So I will only use this for my list of ‘bad’ words.

When I now look again at the first 100 lines in each file with the ‘line based’ view I get the following measures:

##      File TotNumber_Words MaxNumber_WordsPerLine AvgNumber_WordsPerLine
## 1    BLOG            4704                    275                  47.04
## 2    NEWS            3222                    142                  32.22
## 3 TWITTER            1275                     28                  12.75

After changing the logical units from lines to sentences the values for the measures like ‘total number of words per line’ changes.

##  TotNumber_Words MinNumber_CharPerWord MaxNumber_CharPerWord
##  Min.   : 1.00   Min.   : 0.00         Min.   : 1.00        
##  1st Qu.: 4.00   1st Qu.: 1.00         1st Qu.: 7.00        
##  Median :10.00   Median : 2.00         Median : 9.00        
##  Mean   :13.06   Mean   : 1.98         Mean   : 8.74        
##  3rd Qu.:19.00   3rd Qu.: 2.00         3rd Qu.:11.00        
##  Max.   :60.00   Max.   :12.00         Max.   :22.00

After tokenization I store the sentences into a DocumentTermMatrix. In the next step I take a look on the frequency of the words. This frequency is calculated by sum of the frequency of each word in the single sentences.

##  the  and that  for with  was  you  not  but have this will your  all  are 
##  433  244  108   81   68   64   63   53   50   50   39   38   38   32   32

Conclusion

Natural Language Processing (NLP) is a field of computer science, artificial intelligence and linguistics concerned with the interactions between computers and human (natural) languages ([2]). This gets us directly to the goal of the project/model which is to determine the next word given the preceding words in a phrase.

Based on the first examination of the data and the tm-Package([1]) the plan is

create test and train data sets for each files with a smaller number of records because of performance issues
perform some cleaning and filtering
split the lines in the given files in their logical units (for all 3 files together)
split the generated “sentence-based”-output into smaller sets before storing them into tm-corpus (because of corpus/performance limitations)
perform the frequency calculation for the n-grams
combine the results for the smaller sets for an overall view

Further steps

The next step is finding the n-grams in the sentences on different levels 2-grams, 3-grams and so on and define rules, like “when n-grams of different levels exists which should be taken” or other way round “when no n-gram of a certain level exists which should be taken instead”.

The model should be trained on a sample set of a combination of all three files. The last step is a Shiny App for providing an interactive front end for the algorithm.