Natural language data

A natural language or ordinary language is any language that has evolved naturally in humans through use and repetition without conscious planning or premeditation. Natural languages can take different forms, such as speech or signing. They are distinguished from constructed and formal languages such as those used to program computers or to study logic. Natural languages contain many nuances and variations that depend on culture and society. Analyzing natural language data is essential for the development of text prediction algortighms and software as those used in keypads used in smartphones and text editors used in word processors and/or email editors.

Exploratory Data Analysis of natural language datasets

This exercise is part of John Hoppkin’s University Data Science Specialization offered through COURSERA. It consists of three activites:

  1. Download and extract natural language datasets in English, German, Finnish and Russian.
  2. Read the lines in the datasets
  3. Basic data exploration
  4. Analyze any relevant findings

I. Download and extract the natural language datasets

The following code downloads a zip file of natural language datasets and extracts them into a specific directory.

# DOWNLOADING THE DATA
setwd("/media/iskar/archivos/R/DATA_SCIENCE_CAPSTONE/datos")
datos_url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(datos_url, "Coursera-SwiftKey.zip")
options(timeout=1000000) # EXTENDS DOWNLOAD TIME LIMIT TO ALLOW LARGE FILE DOWNLOADS
datos_local <- "Coursera-SwiftKey.zip"
dir <- getwd()

# UNZIPPING THE DATA
unzip(paste0(datos_local, "/", datos_local), files = NULL, list = FALSE, overwrite = TRUE,
      junkpaths = FALSE, exdir = ".", unzip = "internal",
      setTimes = FALSE)

After downloading and extracting the data, a directory named “files” should appear with the following 4 subdirectories:

  • fi_FI
  • en_US
  • ru_RU
  • de_DE

Each directory contains three text files with data from Twitter, news and blogs. The data was accquired through the use of a web crawler. According to the course, the data is treated in order to respect copyright laws.

II: Read the lines in the datasets

Reading the lines consists of two steps:

  1. assign a variable to each file using “file”
  2. read the data using “readLines”

The first step is optional, but it will very useful for the exercise. The following code reads the lines for the English language dataset. The following code accesses the text files in a specific directory (“final/en_US”) and assigns the data to variables. There are three files with data from: Twitter, blogs and news in US English.

# READING en_US DATA FROM EXTRACTED FOLDERS

# ASSIGNING A VARIABLE TO EACH OF THE TEXT FILES
setwd("/media/iskar/archivos/R/DATA_SCIENCE_CAPSTONE/datos/final/en_US")
twitter_us_en <- file("/media/iskar/archivos/R/DATA_SCIENCE_CAPSTONE/datos/final/en_US/en_US.twitter.txt", "r") 
blogs_us_en <- file("/media/iskar/archivos/R/DATA_SCIENCE_CAPSTONE/datos/final/en_US/en_US.blogs.txt", "r")
news_us_en <- file("/media/iskar/archivos/R/DATA_SCIENCE_CAPSTONE/datos/final/en_US/en_US.news.txt", "r")

# READING THE LINES
txt_twitter_en_us <- readLines(twitter_us_en)
txt_blogs_en_us <- readLines(blogs_us_en)
txt_news_en_us <- readLines(news_us_en)

This process can be repeated or put into a loop in order to read the other provided datasets in other languages.

NOTE: After reading the lines, the data is available in the R environment. If it is not in use it is important to close the connection to the data.

III. Basic data exploration

The following scripts analyze 1) general dimensions of the data set, 2) length of the lines in characters, 3) length of the lines in words and 4) word specific data. The data is also used to create basic plots to help better understand the general characterisics of the data through visualization.

1. General dataset dimensions

# FILE SIZE
bytes_en_us <- cbind(object.size(txt_twitter_en_us), object.size(txt_blogs_en_us), object.size(txt_news_en_us))
# NUMBER OF TOTAL LINES 
lines_en_us <- cbind(length(txt_twitter_en_us), length(txt_blogs_en_us), length(txt_news_en_us))
# NUMBER OF TOTAL WORDS
# characters_en_us <- cbind(sum(strsplit(txt_twitter_en_us, " ")), sum(nchar(txt_, sum(nchar(txt_news_en_us)))
# NUMBER OF TOTAL CHARACTERS
characters_en_us <- cbind(sum(nchar(txt_twitter_en_us)), sum(nchar(txt_blogs_en_us)), sum(nchar(txt_news_en_us)))
data_summary <- rbind (bytes_en_us, lines_en_us, characters_en_us)
rownames(data_summary) <- c("Bytes", "Lines", "Characters")
colnames(data_summary) <- c("Twitter", "Blogs", "News")
data_summary
##              Twitter     Blogs      News
## Bytes      334484736 267758632 269840992
## Lines        2360148    899288   1010242
## Characters 162096031 206824505 203223159

2. Number of characters per line

# NUMBER OF CHARACTER
summary(nchar(txt_twitter_en_us[1:length(txt_twitter_en_us)]))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     2.0    37.0    64.0    68.7   100.0   140.0
summary(nchar(txt_blogs_en_us[1:length(txt_blogs_en_us)]))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1      47     156     230     329   40833
summary(nchar(txt_news_en_us[1:length(txt_news_en_us)]))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1     110     185     201     268   11384
# NUMBER OF CHARACTER DISTRIBUTION (HISTOGRAM)
hist(nchar(txt_twitter_en_us[1:length(txt_twitter_en_us)]), main = "Twitter line length histogram (characters)",
     xlab = "Number of characters")

hist(nchar(txt_blogs_en_us[1:length(txt_blogs_en_us)]), main = "Blog line length histogram (characters)",
     xlab = "Number of characters")

hist(nchar(txt_news_en_us[1:length(txt_news_en_us)]), main = "News line length histogram (characters)",
     xlab = "Number of characters")

3. Number of words per line

# NUMBER OF WORDS PER LINE 

wordcount_line_en_us <- cbind(sapply(strsplit(txt_twitter_en_us, " "), length), sapply(strsplit(txt_blogs_en_us, " "), length), sapply(strsplit(txt_news_en_us, " "), length))

summary(sapply(strsplit(txt_twitter_en_us, " "), length))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0     7.0    12.0    12.9    18.0    47.0
summary(sapply(strsplit(txt_blogs_en_us, " "), length))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1       9      28      42      59    6630
summary(sapply(strsplit(txt_news_en_us, " "), length))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1      19      31      34      45    1792
# NUMBER OF WORDS PER LINE (HISTOGRAM)
hist(sapply(strsplit(txt_twitter_en_us, " "), length), main = "Twitter line length histogram (words)", xlab = "Number of words")

hist(sapply(strsplit(txt_blogs_en_us, " "), length), main = "Blog line length histogram (words)", xlab = "Number of words")

hist(sapply(strsplit(txt_news_en_us, " "), length), main = "News line length histogram (words)", xlab = "Number of words")

4. Specific word use in the lines

This part of the activity analyzes specific words used in the datasets. Particularly, 1) the use of explicit language (“fuck”, “shit” and “bitch”), 2) the use of “stopwords” and 3) the use of words regarding the environment.

NOTE: “Stopwords” are words usually excluded from searches to help improve indexing. The list of “stopwords” may vary, but it usually refers to connectors, auxilaries and conjunctions that do not contribute to classifying or sorting the information each text file contains.

Analyzing the use of explicit language

explicit_twitter_en_us <- c(sum(grepl("[Ff]uck", txt_twitter_en_us)), sum(grepl("[Ss]hit",txt_twitter_en_us)), sum(grepl("[Bb]itch", txt_twitter_en_us)))
explicit_blogs_en_us <- c(sum(grepl("[Ff]uck", txt_blogs_en_us)), sum(grepl("[Ss]hit", txt_blogs_en_us)), sum(grepl("[Bb]itch", txt_blogs_en_us)))
explicit_news_en_us <- c(sum(grepl("[Ff]uck", txt_news_en_us)), sum(grepl("[Ss]hit", txt_news_en_us)),
sum(grepl("[Bb]itch", txt_news_en_us)))

explicit_en_us <- cbind(explicit_twitter_en_us, explicit_blogs_en_us, explicit_news_en_us)
colnames(explicit_en_us) <- c("Twitter", "Blogs", "News")
rownames(explicit_en_us) <- c("Use of word 'fuck'", "Use of 'shit'", "Use of word 'bitch'")
explicit_en_us
##                     Twitter Blogs News
## Use of word 'fuck'    22221  3033    0
## Use of 'shit'         19554  3000   49
## Use of word 'bitch'   10238   923   74

Analyzing the use of “stopwords”

“Stopwords” are words that do not contribute very much to the classification or analysis of a text such as conjuctons and connecting words. There are special for R to do this type of analysis. However, in this case it will be done by providing a manually created .csv file of a list of words to search for.

# READ THE LIST OF STOPWORDS AND ASSIGN THEM TO THE VARIABLE "stopword_list"
setwd("/media/iskar/archivos/R/DATA_SCIENCE_CAPSTONE/datos")
stopwords_list <- file("/media/iskar/archivos/R/DATA_SCIENCE_CAPSTONE/datos/stopwords_list.txt", "r")
stopwords_list <- readLines(stopwords_list)
stopwords_en_us <- cbind(sum(grepl(stopwords_list, txt_twitter_en_us)), sum(grepl(stopwords_list, txt_blogs_en_us)), sum(grepl(stopwords_list, txt_news_en_us)))

stopwords_en_us <- rbind(stopwords_en_us, colSums(wordcount_line_en_us), (stopwords_en_us/colSums(wordcount_line_en_us)*100))

colnames(stopwords_en_us) <- c("Twitter", "Blogs", "News")
rownames(stopwords_en_us) <- c("Stopwords", "Total words", "stopwords (%)")
stopwords_en_us
##                   Twitter        Blogs        News
## Stopwords      2324675.00   885716.000   997976.00
## Total words   30373543.00 97962848.000 80326185.00
## stopwords (%)        7.65        0.904        1.24

IV. Analyze any relevant findings

Regarding line length

Twitter line length is more structured due to the 140 character limit. News and Blog data varies very much. In fact, it varies so much that it is not possible to visualize the data directly with histograms. This is indicative of a wide range of line lengths used in the entries.

Regarding explicit words

Twitter lines used the most explicit words (namely: fuck, shit and bitch), then blogs and then the news. This is not surprising since many blogs are independent and personal and news data is probably related to a specific company or business that has to conform to certain standards.

Regarding stopwords

There were a very low percentage of stopwords used in the natural language datasets, but this was the part of processing that took the longest. Which presents us with two quesionts: 1) is it always worth eliminating stopwords? 2) are there more efficient ways to find stopwords?