Large databases comprising of text in a target language are commonly used when generating language models for various purposes. In this exercise, you will use the English database but may consider three other databases in German, Russian and Finnish.
The goal of this task is to get familiar with the databases and do the necessary cleaning. After this exercise, you should understand what real data looks like and how much effort you need to put into cleaning the data. When you commence on developing a new language, the first thing is to understand the language and its peculiarities with respect to your target. You can learn to read, speak and write the language. Alternatively, you can study data and learn from existing information about the language through literature and the internet. At the very least, you need to understand how the language is written: writing script, existing input methods, some phonetic knowledge, etc.
Note that the data contain words of offensive and profane meaning. They are left there intentionally to highlight the fact that the developer has to work on them.
Tasks to accomplish
Tokenization - identifying appropriate tokens such as words, punctuation, and numbers. Writing a function that takes a file as input and returns a tokenized version of it.
Profanity filtering - removing profanity and other words you do not want to predict.
Tips, tricks, and hints
library(tm)
## Loading required package: NLP
url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(url, dest = "capstone_dataset.zip")
unzip ("capstone_dataset.zip")
url_directory <- "C:/Users/maigu/OneDrive/Documentos/Eli/datasciencecoursera/Data-Science-Capstone/final/en_US/"
url_twitter <- paste(url_directory, "en_US.twitter.txt", sep = "")
url_blogs <- paste(url_directory, "en_US.blogs.txt", sep = "")
url_news <- paste(url_directory, "en_US.news.txt", sep = "")
twitter_file <- readLines(url_twitter, encoding="UTF-8")
## Warning in readLines(url_twitter, encoding = "UTF-8"): line 167155 appears to
## contain an embedded nul
## Warning in readLines(url_twitter, encoding = "UTF-8"): line 268547 appears to
## contain an embedded nul
## Warning in readLines(url_twitter, encoding = "UTF-8"): line 1274086 appears to
## contain an embedded nul
## Warning in readLines(url_twitter, encoding = "UTF-8"): line 1759032 appears to
## contain an embedded nul
blogs_file <- readLines(url_blogs, encoding="UTF-8")
news_file <- readLines(url_news, encoding="UTF-8")
## Warning in readLines(url_news, encoding = "UTF-8"): incomplete final line found
## on 'C:/Users/maigu/OneDrive/Documentos/Eli/datasciencecoursera/Data-Science-
## Capstone/final/en_US/en_US.news.txt'
badwords_url <-"http://www.cs.cmu.edu/~biglou/resources/bad-words.txt"
download.file(badwords_url, destfile = "bad-words.txt")
badwords <- readLines("bad-words.txt", encoding="UTF-8")
twitter_subset <- readLines(url_twitter, 5000)
blogs_subset <- readLines(url_blogs, 5000)
news_subset <- readLines(url_news, 5000)
#Get all the subset files together
subset_all_data <- paste(twitter_subset, blogs_subset, news_subset)
# Get subset into clean data to work
clean_data <- subset_all_data
# Remove special characters from dataset
clean_data <- iconv(clean_data, "UTF-8", "ASCII", sub = "")
# Remove numbers from dataset
clean_data <- removeNumbers(clean_data)
# Remove white spaces from dataset
clean_data <- stripWhitespace(clean_data)
# Set the entire dataset to lower case
clean_data <- tolower(clean_data)
# Remove punctuation
clean_data <- removePunctuation(clean_data)
# Remove bad words from dataset
clean_data <- removeWords(clean_data, c(badwords))
data.frame(file = c("Twitter", "Blogs", "News"),
size_MB = c(file.info(url_twitter)$size/1024^2,
file.info(url_blogs)$size/1024^2,
file.info(url_news)$size/1024^2),
lines = c(length(readLines(url_twitter)),
length(readLines(url_blogs)),
length(readLines(url_news))),
longest_line = c(summary(nchar(twitter_file))[6],
summary(nchar(blogs_file))[6],
summary(nchar(news_file))[6])
)
## Warning in readLines(url_twitter): line 167155 appears to contain an embedded
## nul
## Warning in readLines(url_twitter): line 268547 appears to contain an embedded
## nul
## Warning in readLines(url_twitter): line 1274086 appears to contain an embedded
## nul
## Warning in readLines(url_twitter): line 1759032 appears to contain an embedded
## nul
## Warning in readLines(url_news): incomplete final line found on 'C:/Users/maigu/
## OneDrive/Documentos/Eli/datasciencecoursera/Data-Science-Capstone/final/en_US/
## en_US.news.txt'
## file size_MB lines longest_line
## 1 Twitter 159.3641 2360148 140
## 2 Blogs 200.4242 899288 40833
## 3 News 196.2775 77259 5760
The en_US.blogs.txt file is how many megabytes?
Answer: According to the above table the Blogs files is 200.4242 Mb large.
The en_US.twitter.txt has how many lines of text?
Answer: According to the above table the Twitter file has 2360148 lines.
What is the length of the longest line seen in any of the three en_US data sets?
Answer: According to the above table the Blogs file has the longest line with 40835 characters.
In the en_US twitter data set, if you divide the number of lines where the word “love” (all lowercase) occurs by the number of lines the word “hate” (all lowercase) occurs, about what do you get?
sum(grepl("love", twitter_file)) / sum(grepl("hate", twitter_file))
## [1] 4.108592
The one tweet in the en_US twitter data set that matches the word “biostats” says what?
twitter_file[grepl("biostats", twitter_file)]
## [1] "i know how you feel.. i have biostats on tuesday and i have yet to study =/"
How many tweets have the exact characters “A computer once beat me at chess, but it was no match for me at kickboxing”. (I.e. the line matches those characters exactly.)
grep("A computer once beat me at chess, but it was no match for me at kickboxing", twitter_file)
## [1] 519059 835824 2283423