The goal of this project is to work with a dataset of a datatype which we don’t know well by now: language and text. Our data sample comes from a corpus called HC Corpora which collects textsamples from publicly available sources in the internet. In our cases those sources are blogs, newssites and twitter. For now, we’ll just use the English files, altough there are samples provided in German, Russian and Finnish. The ultimative goal will be to build a model that predicts the next word a user may write based on previousely written word. In this document I’ll outline the first steps towards this final goal. Mainly the cleaning of the data and some exploratory analysis.
We get the data directly from the website provided by coursera. There are the 3 basic sources, blogs, news and tweets, well read them in separately.
source <-
"https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
if(!file.exists("Coursera-SwiftKey.zip")){
download.file(source,
destfile = "Coursera-SwiftKey.zip")
}
blogs <- readLines("final/en_US/en_US.blogs.txt",
skipNul = TRUE,
warn = FALSE)
news <- readLines("final/en_US/en_US.news.txt",
skipNul = TRUE,
warn = FALSE)
tweets <- readLines("final/en_US/en_US.twitter.txt",
skipNul = TRUE,
warn = FALSE)
Lets look at a short summary of the data.
| Source | Number of Rows | Mean characters per Row | Total Number of words |
|---|---|---|---|
| Blogs | 899288 | 232 | 37334131 |
| News | 77259 | 203 | 2643969 |
| Tweets | 2360148 | 69 | 30373583 |
We see that the data from the blogs and from twitter are fairly equal in respect to the total number of words (which is what we’re interested in, since we want to predict the next word based on the last words). The data from the news is fairly different.
I don’t think that this will be a big problem for our prediction, since we’re going to build a general model and not a model based on what type of text the user is writing. But we’ll see, it’s definitely something to keep in mind.
Just to give you a feeling of how the data looks like, we show the beginnings (80 characters) of the first entry of each of the datasets
## [1] "Blogs: In the years thereafter, most of the Oil fields and platforms were named after p ..."
## [1] "News: He wasn't home alone, apparently. ..."
## [1] "Tweets: How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see ..."
Last thing, before starting with cleaning of the data: we take a sample of each of the datasets. We do this because otherwise the sets will be too big and computations would need enless time. We’re going to take a random subsample of 10% of each source. Since we’ll use the sample later, we’ll save it to the working directory, so it can be retrieved later.
set.seed(42)
blogsam <- blogs[sample(1:numrows[1], numrows[1]*0.1, replace = FALSE)]
write.csv(blogsam, file = "en_US.blogs.sample.csv")
newsam <- news[sample(1:numrows[2], numrows[2]*0.1, replace = FALSE)]
write.csv(newsam, file = "en_US.news.sample.csv")
tweetsam <- tweets[sample(1:numrows[3], numrows[3]*0.1, replace = FALSE)]
write.csv(tweetsam, file = "en_US.twitter.sample.csv")
As we saw above the data right now has a lot of things that make life difficult, when trying to use it to predict models. In fact we want to get rid of all sorts of things in the sample - upper and lower case letters - punctuation - numbers - internet links - special characters ($, £,… ) - profanity (this we didn’t see from the example, but we still want to)
so, lets read in our sampledata and combine it to one file, so that we dont have to do every operation 3 times.
blogs <- readLines("en_US.blogs.sample.csv",
skipNul = TRUE,
warn = FALSE)
news <- readLines("en_US.news.sample.csv",
skipNul = TRUE,
warn = FALSE)
tweets <- readLines("en_US.twitter.sample.csv",
skipNul = TRUE,
warn = FALSE)
#combine the samples into one sample
sample <- c(blogs,news,tweets)
Before we remove all the things mentioned above, we’d like to make sure, that we’re not going to significantly change the data by doing that. We do that, by looking at how big the percentage of numbers oder special characters is in our new sampled dataset.
## [1] 0.01068226
We can live with that. So let’s go on with cleaning the dataset
sample <- gsub("â", "'", sample)
#Remove all the hashtags
sample <- gsub(" #\\S*", "",sample)
#Remove internetlinks
sample <- gsub("(f|ht)(tp)(s?)(://)(\\S*)", "", sample)
#basically remove everything thats not a number or a letter or ' or /
sample <- gsub("[^0-9A-Za-z///' ]", "", sample)
#remove all kind of special cases that may have been missed
sample <- str_replace_all(string=sample, pattern = "[&…™ðŸ¥]", replacement= "")
Next we’re going to us R’s tm-package (textmining) to transform the data into an object of the class corpus and do some further cleaning. We remove the numbers, since they’re not going to have a big predictive value. We also remove all punctuation. That may lead to some problems, since in english the apostrophe is used to negate statements, to abbreviate ‘would’ or ‘had’ or to build a genitive, but lets deal with that later. We’ll also collapse large blank spaces and convert everything to lower case.
Csample <- Corpus(VectorSource(sample))
#remove numbers, since they're not going to have a lot of predictive value
Csample <- tm_map(Csample, removeNumbers)
#remove remaining punctuation
Csample <- tm_map(Csample, removePunctuation)
#collapse multiple whitspace characters to one
Csample <- tm_map(Csample, stripWhitespace)
#rewrite the sample in all lower cases
Csample <- tm_map(Csample, tolower)
Remember that we wanted to remove profanity as well? A short google search provded this list of words that may be found offensive. Note that the list is very conservative. For instance the word “Africa” is on the list. That’s a word that can obviously be used to write a lot of non-offensive things. But since the list is not hard-coded in the code, things like that may be changed later.
badwords <- readLines("profanity.txt")
Csample <- tm_map(Csample, removeWords, badwords)
Csample <- corpus(Csample)
So, now the data is as we want it and we can start to explore it.
We’ll use the quanteda package to tokenize the sample (creating single words) and count the frequency of each single word. Then there is the package called wordcloud, which lets us create a (…wait for it) wordcloud, which is a nice and very intuitive way to visualize which words are frequent in the set. We’re still going to plot a barplot of the frequencies.
It is not very surprising, that the most common words are articles, prepositions and all sorts of “filling words”.
The same thing we did for just single words, we can do for every combination fo two single words in the sample.
Also not very surprising the most frequent combinations contain words which for themselves have been under the most frequent one. But in the end, the less words are used just few (one or two times) in our sample, the better our model will be.
And of course we can do the same thing for all combinations of. Which is what we’re doing next.
Remember that we said, that the fewer words we have that are rare (frequency 1 or 2) in the sample, the better is our prediction going to be? Lets look at this data. We plot the count of words which occur between 1 and 5 times in the whole sample.
| frequency | n |
|---|---|
| 1 | 99060 |
| 2 | 18547 |
| 3 | 8990 |
| 4 | 5408 |
| 5 | 3876 |
| 6 | 2865 |
| 7 | 2285 |
| 8 | 1887 |
| 9 | 1525 |
| 10 | 1283 |
We see that the words with frequency one are a huge part of the set. Those words can probably be omitted for the model later, because they’re probably mostly names and typos, which hold no predictive value for us. Lets look at a sample of 10 words with frequency 1 to be sure of that.
| feature | frequency | rank | docfreq | group |
|---|---|---|---|---|
| doggerel | 1 | 68217 | 1 | all |
| sympathized | 1 | 68217 | 1 | all |
| roadtown | 1 | 68217 | 1 | all |
| selfdestruct | 1 | 68217 | 1 | all |
| rcis | 1 | 68217 | 1 | all |
| fogginess | 1 | 68217 | 1 | all |
There was also the question asked, how many unique words are needed to cover 50% or 90% of our corpus. The following plot, which shows the cumulative frquency of single words, answers this question.
## [1] 6497
## [1] 116
As we can see, we could use only 6’500 words and would capture 90% of all the words in the sample. This is represented by the red line in the plot. To capture 50% of all words, we’d need only 120 words, as seen with the red line. When you think of that, that’s not really lot.
The last question to consider is how we can make sure, that all the words we’re looking at are in English. We could, of course, take an english dictionary and compare each of our tokens/words if it exists in the dictionary. But I think that would be rather time consuming and I don’t think that we would gain a lot by doing this. There’s the approach of stemming, where you cut down a word to its stem and omit al the pre- und suffixes, but I think that this either is too much trouble, since we seem to have a rather “nice” set of data in the sense, that after cleaning we didn’t find anything unexpected or unwanted in the data.
The cleaning of the data and the exploratory analysis was, of course, just the beginning. The real work will be to find a working algorithm which predicts the next words of a user. The prediction will rely on these n-grams. Until now we calculated for N= 1,2 and 3, we shall see if we may even need n= 4. Probably we’ll also use some correlations of words within one sentence. The tm-package provides some functionalities, which will be usefull to calcuate those values.