A natural language or ordinary language is any language that has evolved naturally in humans through use and repetition without conscious planning or premeditation. Natural languages can take different forms, such as speech or signing. They are distinguished from constructed and formal languages such as those used to program computers or to study logic. Natural languages contain many nuances and variations that depend on culture and society. Analyzing natural language data is essential for the development of text prediction algortighms and software as those used in keypads used in smartphones and text editors used in word processors and/or email editors.
This exercise is part of John Hoppkin’s University Data Science Specialization offered through COURSERA. It consists of three activites:
The following code downloads a zip file of natural language datasets and extracts them into a specific directory.
# DOWNLOADING THE DATA
setwd("/media/iskar/archivos/R/DATA_SCIENCE_CAPSTONE/datos")
datos_url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(datos_url, "Coursera-SwiftKey.zip")
options(timeout=1000000) # EXTENDS DOWNLOAD TIME LIMIT TO ALLOW LARGE FILE DOWNLOADS
datos_local <- "Coursera-SwiftKey.zip"
dir <- getwd()
# UNZIPPING THE DATA
unzip(paste0(datos_local, "/", datos_local), files = NULL, list = FALSE, overwrite = TRUE,
junkpaths = FALSE, exdir = ".", unzip = "internal",
setTimes = FALSE)
After downloading and extracting the data, a directory named “files” should appear with the following 4 subdirectories:
Each directory contains three text files with data from Twitter, news and blogs. The data was accquired through the use of a web crawler. According to the course, the data is treated in order to respect copyright laws.
Reading the lines consists of two steps:
The first step is optional, but it will very useful for the exercise. The following code reads the lines for the English language dataset. The following code accesses the text files in a specific directory (“final/en_US”) and assigns the data to variables. There are three files with data from: Twitter, blogs and news in US English.
# READING en_US DATA FROM EXTRACTED FOLDERS
# ASSIGNING A VARIABLE TO EACH OF THE TEXT FILES
setwd("/media/iskar/archivos/R/DATA_SCIENCE_CAPSTONE/datos/final/en_US")
twitter_us_en <- file("/media/iskar/archivos/R/DATA_SCIENCE_CAPSTONE/datos/final/en_US/en_US.twitter.txt", "r")
blogs_us_en <- file("/media/iskar/archivos/R/DATA_SCIENCE_CAPSTONE/datos/final/en_US/en_US.blogs.txt", "r")
news_us_en <- file("/media/iskar/archivos/R/DATA_SCIENCE_CAPSTONE/datos/final/en_US/en_US.news.txt", "r")
# READING THE LINES
txt_twitter_en_us <- readLines(twitter_us_en)
txt_blogs_en_us <- readLines(blogs_us_en)
txt_news_en_us <- readLines(news_us_en)
This process can be repeated or put into a loop in order to read the other provided datasets in other languages.
NOTE: After reading the lines, the data is available in the R environment. If it is not in use it is important to close the connection to the data.
The following scripts analyze 1) general dimensions of the data set, 2) length of the lines in characters, 3) length of the lines in words and 4) word specific data. The data is also used to create basic plots to help better understand the general characterisics of the data through visualization.
# FILE SIZE
bytes_en_us <- cbind(object.size(txt_twitter_en_us), object.size(txt_blogs_en_us), object.size(txt_news_en_us))
# NUMBER OF TOTAL LINES
lines_en_us <- cbind(length(txt_twitter_en_us), length(txt_blogs_en_us), length(txt_news_en_us))
# NUMBER OF TOTAL WORDS
# characters_en_us <- cbind(sum(strsplit(txt_twitter_en_us, " ")), sum(nchar(txt_, sum(nchar(txt_news_en_us)))
# NUMBER OF TOTAL CHARACTERS
characters_en_us <- cbind(sum(nchar(txt_twitter_en_us)), sum(nchar(txt_blogs_en_us)), sum(nchar(txt_news_en_us)))
data_summary <- rbind (bytes_en_us, lines_en_us, characters_en_us)
rownames(data_summary) <- c("Bytes", "Lines", "Characters")
colnames(data_summary) <- c("Twitter", "Blogs", "News")
data_summary
## Twitter Blogs News
## Bytes 334484736 267758632 269840992
## Lines 2360148 899288 1010242
## Characters 162096031 206824505 203223159
# NUMBER OF CHARACTER
summary(nchar(txt_twitter_en_us[1:length(txt_twitter_en_us)]))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.0 37.0 64.0 68.7 100.0 140.0
summary(nchar(txt_blogs_en_us[1:length(txt_blogs_en_us)]))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 47 156 230 329 40833
summary(nchar(txt_news_en_us[1:length(txt_news_en_us)]))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 110 185 201 268 11384
# NUMBER OF CHARACTER DISTRIBUTION (HISTOGRAM)
hist(nchar(txt_twitter_en_us[1:length(txt_twitter_en_us)]), main = "Twitter line length histogram (characters)",
xlab = "Number of characters")
hist(nchar(txt_blogs_en_us[1:length(txt_blogs_en_us)]), main = "Blog line length histogram (characters)",
xlab = "Number of characters")
hist(nchar(txt_news_en_us[1:length(txt_news_en_us)]), main = "News line length histogram (characters)",
xlab = "Number of characters")
# NUMBER OF WORDS PER LINE
wordcount_line_en_us <- cbind(sapply(strsplit(txt_twitter_en_us, " "), length), sapply(strsplit(txt_blogs_en_us, " "), length), sapply(strsplit(txt_news_en_us, " "), length))
summary(sapply(strsplit(txt_twitter_en_us, " "), length))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 7.0 12.0 12.9 18.0 47.0
summary(sapply(strsplit(txt_blogs_en_us, " "), length))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 9 28 42 59 6630
summary(sapply(strsplit(txt_news_en_us, " "), length))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 19 31 34 45 1792
# NUMBER OF WORDS PER LINE (HISTOGRAM)
hist(sapply(strsplit(txt_twitter_en_us, " "), length), main = "Twitter line length histogram (words)", xlab = "Number of words")
hist(sapply(strsplit(txt_blogs_en_us, " "), length), main = "Blog line length histogram (words)", xlab = "Number of words")
hist(sapply(strsplit(txt_news_en_us, " "), length), main = "News line length histogram (words)", xlab = "Number of words")
This part of the activity analyzes specific words used in the datasets. Particularly, 1) the use of explicit language (“fuck”, “shit” and “bitch”), 2) the use of “stopwords” and 3) the use of words regarding the environment.
NOTE: “Stopwords” are words usually excluded from searches to help improve indexing. The list of “stopwords” may vary, but it usually refers to connectors, auxilaries and conjunctions that do not contribute to classifying or sorting the information each text file contains.
explicit_twitter_en_us <- c(sum(grepl("[Ff]uck", txt_twitter_en_us)), sum(grepl("[Ss]hit",txt_twitter_en_us)), sum(grepl("[Bb]itch", txt_twitter_en_us)))
explicit_blogs_en_us <- c(sum(grepl("[Ff]uck", txt_blogs_en_us)), sum(grepl("[Ss]hit", txt_blogs_en_us)), sum(grepl("[Bb]itch", txt_blogs_en_us)))
explicit_news_en_us <- c(sum(grepl("[Ff]uck", txt_news_en_us)), sum(grepl("[Ss]hit", txt_news_en_us)),
sum(grepl("[Bb]itch", txt_news_en_us)))
explicit_en_us <- cbind(explicit_twitter_en_us, explicit_blogs_en_us, explicit_news_en_us)
colnames(explicit_en_us) <- c("Twitter", "Blogs", "News")
rownames(explicit_en_us) <- c("Use of word 'fuck'", "Use of 'shit'", "Use of word 'bitch'")
explicit_en_us
## Twitter Blogs News
## Use of word 'fuck' 22221 3033 0
## Use of 'shit' 19554 3000 49
## Use of word 'bitch' 10238 923 74
“Stopwords” are words that do not contribute very much to the classification or analysis of a text such as conjuctons and connecting words. There are special for R to do this type of analysis. However, in this case it will be done by providing a manually created .csv file of a list of words to search for.
# READ THE LIST OF STOPWORDS AND ASSIGN THEM TO THE VARIABLE "stopword_list"
setwd("/media/iskar/archivos/R/DATA_SCIENCE_CAPSTONE/datos")
stopwords_list <- file("/media/iskar/archivos/R/DATA_SCIENCE_CAPSTONE/datos/stopwords_list.txt", "r")
stopwords_list <- readLines(stopwords_list)
stopwords_en_us <- cbind(sum(grepl(stopwords_list, txt_twitter_en_us)), sum(grepl(stopwords_list, txt_blogs_en_us)), sum(grepl(stopwords_list, txt_news_en_us)))
stopwords_en_us <- rbind(stopwords_en_us, colSums(wordcount_line_en_us), (stopwords_en_us/colSums(wordcount_line_en_us)*100))
colnames(stopwords_en_us) <- c("Twitter", "Blogs", "News")
rownames(stopwords_en_us) <- c("Stopwords", "Total words", "stopwords (%)")
stopwords_en_us
## Twitter Blogs News
## Stopwords 2324675.00 885716.000 997976.00
## Total words 30373543.00 97962848.000 80326185.00
## stopwords (%) 7.65 0.904 1.24
Twitter line length is more structured due to the 140 character limit. News and Blog data varies very much. In fact, it varies so much that it is not possible to visualize the data directly with histograms. This is indicative of a wide range of line lengths used in the entries.
Twitter lines used the most explicit words (namely: fuck, shit and bitch), then blogs and then the news. This is not surprising since many blogs are independent and personal and news data is probably related to a specific company or business that has to conform to certain standards.
There were a very low percentage of stopwords used in the natural language datasets, but this was the part of processing that took the longest. Which presents us with two quesionts: 1) is it always worth eliminating stopwords? 2) are there more efficient ways to find stopwords?