Large databases comprising of text in a target language are commonly used when generating language models for various purposes. In this exercise, you will use the English database but may consider three other databases in German, Russian and Finnish.

The goal of this task is to get familiar with the databases and do the necessary cleaning. After this exercise, you should understand what real data looks like and how much effort you need to put into cleaning the data. When you commence on developing a new language, the first thing is to understand the language and its peculiarities with respect to your target. You can learn to read, speak and write the language. Alternatively, you can study data and learn from existing information about the language through literature and the internet. At the very least, you need to understand how the language is written: writing script, existing input methods, some phonetic knowledge, etc.

Note that the data contain words of offensive and profane meaning. They are left there intentionally to highlight the fact that the developer has to work on them.

Tasks to accomplish

  1. Tokenization - identifying appropriate tokens such as words, punctuation, and numbers. Writing a function that takes a file as input and returns a tokenized version of it.

  2. Profanity filtering - removing profanity and other words you do not want to predict.

Tips, tricks, and hints

  1. Loading the data in. This dataset is fairly large. We emphasize that you don’t necessarily need to load the entire dataset in to build your algorithms (see point 2 below). At least initially, you might want to use a smaller subset of the data. Reading in chunks or lines using R’s readLines or scan functions can be useful. You can also loop over each line of text by embedding readLines within a for/while loop, but this may be slower than reading in large chunks at a time. Reading pieces of the file at a time will require the use of a file connection in R. For example, the following code could be used to read the first few lines of the English Twitter dataset:

Load libraries neccesssary for the project

library(tm)
## Loading required package: NLP

Download and load file from the web (Date: March 1st, 2022)

url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(url, dest = "capstone_dataset.zip")
unzip ("capstone_dataset.zip")

Read the files

url_directory <- "C:/Users/maigu/OneDrive/Documentos/Eli/datasciencecoursera/Data-Science-Capstone/final/en_US/"
url_twitter <- paste(url_directory, "en_US.twitter.txt", sep = "")
url_blogs <- paste(url_directory, "en_US.blogs.txt", sep = "")
url_news <- paste(url_directory, "en_US.news.txt", sep = "")


twitter_file <- readLines(url_twitter, encoding="UTF-8")
## Warning in readLines(url_twitter, encoding = "UTF-8"): line 167155 appears to
## contain an embedded nul
## Warning in readLines(url_twitter, encoding = "UTF-8"): line 268547 appears to
## contain an embedded nul
## Warning in readLines(url_twitter, encoding = "UTF-8"): line 1274086 appears to
## contain an embedded nul
## Warning in readLines(url_twitter, encoding = "UTF-8"): line 1759032 appears to
## contain an embedded nul
blogs_file <- readLines(url_blogs, encoding="UTF-8")
news_file <- readLines(url_news, encoding="UTF-8")
## Warning in readLines(url_news, encoding = "UTF-8"): incomplete final line found
## on 'C:/Users/maigu/OneDrive/Documentos/Eli/datasciencecoursera/Data-Science-
## Capstone/final/en_US/en_US.news.txt'

Getting badwords file from web

badwords_url <-"http://www.cs.cmu.edu/~biglou/resources/bad-words.txt"
download.file(badwords_url, destfile = "bad-words.txt")
badwords <- readLines("bad-words.txt", encoding="UTF-8")

Setting a subset of the data to work with

twitter_subset <- readLines(url_twitter, 5000)
blogs_subset <- readLines(url_blogs, 5000)
news_subset <- readLines(url_news, 5000)

#Get all the subset files together
subset_all_data <- paste(twitter_subset, blogs_subset, news_subset)

Clean and organize the data

# Get subset into clean data to work
clean_data <- subset_all_data

# Remove special characters from dataset
clean_data <- iconv(clean_data, "UTF-8", "ASCII", sub = "")

# Remove numbers from dataset
clean_data <- removeNumbers(clean_data)

# Remove white spaces from dataset
clean_data <- stripWhitespace(clean_data) 

# Set the entire dataset to lower case
clean_data <- tolower(clean_data) 

# Remove punctuation
clean_data <- removePunctuation(clean_data)

# Remove bad words from dataset
clean_data <- removeWords(clean_data, c(badwords))

Getting information about the files

 data.frame(file = c("Twitter", "Blogs", "News"),
                         size_MB = c(file.info(url_twitter)$size/1024^2,
                                     file.info(url_blogs)$size/1024^2,
                                     file.info(url_news)$size/1024^2),
            lines = c(length(readLines(url_twitter)),
                      length(readLines(url_blogs)),
                      length(readLines(url_news))),
            longest_line = c(summary(nchar(twitter_file))[6],
                             summary(nchar(blogs_file))[6],
                             summary(nchar(news_file))[6])
)
## Warning in readLines(url_twitter): line 167155 appears to contain an embedded
## nul
## Warning in readLines(url_twitter): line 268547 appears to contain an embedded
## nul
## Warning in readLines(url_twitter): line 1274086 appears to contain an embedded
## nul
## Warning in readLines(url_twitter): line 1759032 appears to contain an embedded
## nul
## Warning in readLines(url_news): incomplete final line found on 'C:/Users/maigu/
## OneDrive/Documentos/Eli/datasciencecoursera/Data-Science-Capstone/final/en_US/
## en_US.news.txt'
##      file  size_MB   lines longest_line
## 1 Twitter 159.3641 2360148          140
## 2   Blogs 200.4242  899288        40833
## 3    News 196.2775   77259         5760

Quiz 1: Getting Started

Question 1

The en_US.blogs.txt file is how many megabytes?

Answer: According to the above table the Blogs files is 200.4242 Mb large.

Question 2

The en_US.twitter.txt has how many lines of text?

Answer: According to the above table the Twitter file has 2360148 lines.

Question 3

What is the length of the longest line seen in any of the three en_US data sets?

Answer: According to the above table the Blogs file has the longest line with 40835 characters.

Question 4

In the en_US twitter data set, if you divide the number of lines where the word “love” (all lowercase) occurs by the number of lines the word “hate” (all lowercase) occurs, about what do you get?

sum(grepl("love", twitter_file)) / sum(grepl("hate", twitter_file))
## [1] 4.108592

Question 5

The one tweet in the en_US twitter data set that matches the word “biostats” says what?

twitter_file[grepl("biostats", twitter_file)]
## [1] "i know how you feel.. i have biostats on tuesday and i have yet to study =/"

Question 6

How many tweets have the exact characters “A computer once beat me at chess, but it was no match for me at kickboxing”. (I.e. the line matches those characters exactly.)

grep("A computer once beat me at chess, but it was no match for me at kickboxing", twitter_file)
## [1]  519059  835824 2283423