Data Science Capstone

Milestone Report

Author
Affiliation

Felipe Ruiz Bruzzone

Coursera & John Hopkins University - Data Science Specialization

Published

April 9, 2024

1 Introduction

In this milestone report we present the first steps in order to build a prediction app for Coursera’s Data Science Capstone Project course. Across this executive report we will show, in a reproducible way, the different steps given in order to perform an exploratory analysis of the data.

2 Data preparation

In this section we show the preparation and required transformations of the data, in order to perform exploratory analysis.

2.1 Data sets download

The first step is to download the original data from the web. The data is stored in a zip file which may be downloaded here; for that reason, we need to build a conditional download of the data and automatize its extraction from the compressed file. We will extract only the English language files, as can be seen checking the downloaded data.

# Define download url & folder/file destination
trainURL <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
trainDataFile <- "data/Coursera-SwiftKey.zip"

# Conditional creation of 'data' folder
if (!file.exists('data')) {
  dir.create('data')
}

# Condictional download & unzip of the data

if (!file.exists("data/final/en_US")) {
  tempFile <- tempfile()
  download.file(trainURL, tempFile)
  unzip(tempFile, exdir = "data")
  unlink(tempFile)
}
remove(tempFile, trainURL, trainDataFile)

list.files("data/final/en_US")
[1] "en_US.blogs.txt"   "en_US.news.txt"    "en_US.twitter.txt"

After the correct extraction of original information, we load the data distinguishing by source: blogs, news & twitter.

## Extraction of corpus by source

# blogs
blogsFileName <- "data/final/en_US/en_US.blogs.txt"
con <- file(blogsFileName, open = "r")
blogs <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
close(con)

# news
newsFileName <- "data/final/en_US/en_US.news.txt"
con <- file(newsFileName, open = "r")
news <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
close(con)

# twitter
twitterFileName <- "data/final/en_US/en_US.twitter.txt"
con <- file(twitterFileName, open = "r")
twitter <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
close(con)
remove(twitterFileName, newsFileName, blogsFileName)
rm(con)

2.2 Language encoding transformation

We need to convert all characters to ASCII because the news file had special characters (emoticons) that can cause problems to further computations. After that set up, we save the files in .txt format.

blogs <- iconv(blogs, "latin1", "ASCII", sub="")
news <- iconv(news, "latin1", "ASCII", sub="")
twitter <- iconv(twitter, "latin1", "ASCII", sub="")

# save the data to .txt files
save(blogs, file="data/blogs.txt")
save(news, file="data/news.txt")
save(twitter, file="data/twitter.txt")

2.3 Basic Statistics of original files

In the following table (Table 1) we summarise the properties of the files themselves. As we can see, Blogs file is the largest in terms of MB Size (87,02); despite that, Twitter file is the largest according to Total Lines Sum (2.360.148). Finally, Blogs file is the largest regarding Total Words sum (37.510.168), Total Character Sum (206.043.906) and Total Empty Spaces Sum (36.434.843).

Table 1: File properties, by source
Source Size (MB) Total lines Total words Total characters Empty spaces
Blogs 200,42 899.288 37.546.250 206.824.505 36.434.843
Twitter 159,36 2.360.148 30.093.413 162.096.241 28.013.435
News 196,28 1.010.242 34.762.395 203.223.159 33.362.288

2.4 Data sampling and corpus preliminar features

Given the large sizes of these files, a sample procedure is needed in order to improve computing processing efficiency, as we design and test our prediction app. We will sample 10.000 lines from each file and combine the results in one unique corpus called all_samp.

blogs_samp <- blogs[sample(1:length(blogs),10000)]
news_samp <- news[sample(1:length(news),10000)]
twitter_samp <- twitter[sample(1:length(twitter),10000)]
all_samp <- c(blogs_samp, news_samp, twitter_samp)
save(all_samp, file="data/all_samp.txt")
# Save the sampled data to a .txt files
writeLines(all_samp, "data/all_samp.txt")
# Statistics for the sample
samp_size <- file.info("data/all_samp.txt")$size / 1024.0^2
samp_lines <- length(all_samp)
samp_words <- sum(stri_count_words(all_samp))
samp_char <-sum(stri_length(all_samp))
samp_empty_char <- sum(stri_count_fixed(all_samp, ' '))

# Create table with results
all_samp_table <-data.frame(source=c("corpus"),
                         size=c(round(samp_size, digits = 2)),
                         lines=c(samp_lines),
                         words=c(samp_words),
                         chars=c(samp_char),
                         chars_empty=c(samp_empty_char)
                         )

In the following table (Table 2) we summarise the properties of the sampled corpus, considering: size (in MB, Total Sum of Lines, Total Sum of Words, Total Sum of Characters and Total Sum of Empty Spaces.

Table 2: Sample properties
Source Size (MB) Total lines Total words Total characters Empty spaces
corpus 4,77 30.000 886.262 4.970.887 852.233

3 Corpus preparation

3.1 Corpus preparation: overview

Using our corpus saved in the all_samp.txt we will use thelibrary tidytext that includes natural language processing tools, to perform the following transformations within our corpus:

  1. Convert all words to lower case.
  2. Strip away all white spaces.
  3. Strip away all punctuation marks.
  4. Strip away all numbers.
  5. Strip away various non-alphanumeric characters.
  6. Remove stop words. This means, removing words that are not relevant for analysis but appear frequently in written text (such as “the”,“and”, “also”, etc.)
  7. Strip away all links to webpages (url adress).
  8. Remove profanity.
  9. Stemming to remove common word endings (e.g. ‘’s’, ‘ing’, etc.).

Several operations are included in unnest_tokens function from tidytext package.

3.2 Tokenization, Text Cleaning and Normalization

After performing our corpus cleaning and validations, we will build unigrams and bigrams. This new expressions of the corpus will allow us to perform exploratory analysis, such as compute word frequencies and correlations between words. In the next piece of code, all the operations for unigram calculations are reported.

# Set as df
corpus <- as.data.frame(all_samp)

# Tokenization, text cleaning & normalization (unigram)
unigram <- corpus |> 
  unnest_tokens(output = word, input = all_samp)  |>  # Split text into words
  filter(!grepl('[0-9]', word)) |>  # remove numbers
  anti_join(stop_words) |>          # remove stop words
  anti_join(profanities)  |>         # Remove profanities
  mutate(stem = wordStem(word))   # stems words and creates column

Accordingly to our processing plan, in the next piece of code, all the operations for bigram calculations are reported.

# Tokenization, text cleaning & normalization (bigram)
bigram <- corpus |> 
  unnest_tokens(output = word, input = all_samp, token = "ngrams", n = 2)  |>  # Split text into words
  separate(word, c("word1", "word2"), sep = " ") |>  # Separate bigram for cleaning
  filter(!grepl('[0-9]', word1)) |>                 # remove numbers
  filter(!grepl('[0-9]', word2)) |>                 # remove numbers
  filter(!word1 %in% stop_words$word) %>%           # remove stop words
  filter(!word2 %in% stop_words$word) %>%           # remove stop words
  filter(!word1 %in% profanities$word) |>           # Remove profanities
  filter(!word2 %in% profanities$word) |>            # Remove profanities
  mutate(stem1 = wordStem(word1),
         stem2 = wordStem(word2))   |>              # stems words and creates column
  unite(stem,stem1, stem2, sep = " ")

4 Exploratory Data Analysis

Having our data prepared, in this section we present histograms to explore the frequencies of words in our corpus. The following figure show the top fifty most common unique words in our corpus.

Figure 1: Unigrams

The following figure shows the top fifty most common combination of two words, within our corpus.

Figure 2: Bigrams

5 Conclusions

5.1 Observations and next steps

  1. Elimination of non useful characters improved our corpus in order to perform data analysis and predictive applications.
  2. Stemming of the corpus made the computation around our code more efficient.
  3. There are, however some stemming problems that must be resolved for predictive usage of the corpus. Form example, in the corpus we see terms like happi instead of happy.
  4. Removal of stop words made our corpus cleaner. But, is an open question if that exclusion is useful in predictive computations.
  5. Besides the speed of computations improved after the Section 3.2 process it is recommended to find further ways to improve the compute efficience of the future predictive application.

6 References

  1. Earth Data Analytics Online Certificate. Lesson 3. Text Mining Twitter Data With TidyText in R.

  2. Sonkin. Sentiment Analysis of 49 years of Warren Buffett’s Letters to Shareholders of Berkshire Hathaway. Chapter 3: Tokenization, Text Cleaning and Normalization.

  3. Majed. How to create unigrams, bigrams and n-grams of App Reviews.