Coursera & John Hopkins University - Data Science Specialization
Published
April 9, 2024
1 Introduction
In this milestone report we present the first steps in order to build a prediction app for Coursera’s Data Science Capstone Project course. Across this executive report we will show, in a reproducible way, the different steps given in order to perform an exploratory analysis of the data.
2 Data preparation
In this section we show the preparation and required transformations of the data, in order to perform exploratory analysis.
2.1 Data sets download
The first step is to download the original data from the web. The data is stored in a zip file which may be downloaded here; for that reason, we need to build a conditional download of the data and automatize its extraction from the compressed file. We will extract only the English language files, as can be seen checking the downloaded data.
After the correct extraction of original information, we load the data distinguishing by source: blogs, news & twitter.
## Extraction of corpus by source# blogsblogsFileName <-"data/final/en_US/en_US.blogs.txt"con <-file(blogsFileName, open ="r")blogs <-readLines(con, encoding ="UTF-8", skipNul =TRUE)close(con)# newsnewsFileName <-"data/final/en_US/en_US.news.txt"con <-file(newsFileName, open ="r")news <-readLines(con, encoding ="UTF-8", skipNul =TRUE)close(con)# twittertwitterFileName <-"data/final/en_US/en_US.twitter.txt"con <-file(twitterFileName, open ="r")twitter <-readLines(con, encoding ="UTF-8", skipNul =TRUE)close(con)remove(twitterFileName, newsFileName, blogsFileName)rm(con)
2.2 Language encoding transformation
We need to convert all characters to ASCII because the news file had special characters (emoticons) that can cause problems to further computations. After that set up, we save the files in .txt format.
blogs <-iconv(blogs, "latin1", "ASCII", sub="")news <-iconv(news, "latin1", "ASCII", sub="")twitter <-iconv(twitter, "latin1", "ASCII", sub="")# save the data to .txt filessave(blogs, file="data/blogs.txt")save(news, file="data/news.txt")save(twitter, file="data/twitter.txt")
2.3 Basic Statistics of original files
In the following table (Table 1) we summarise the properties of the files themselves. As we can see, Blogs file is the largest in terms of MB Size (87,02); despite that, Twitter file is the largest according to Total Lines Sum (2.360.148). Finally, Blogs file is the largest regarding Total Words sum (37.510.168), Total Character Sum (206.043.906) and Total Empty Spaces Sum (36.434.843).
Table 1: File properties, by source
Source
Size (MB)
Total lines
Total words
Total characters
Empty spaces
Blogs
200,42
899.288
37.546.250
206.824.505
36.434.843
Twitter
159,36
2.360.148
30.093.413
162.096.241
28.013.435
News
196,28
1.010.242
34.762.395
203.223.159
33.362.288
2.4 Data sampling and corpus preliminar features
Given the large sizes of these files, a sample procedure is needed in order to improve computing processing efficiency, as we design and test our prediction app. We will sample 10.000 lines from each file and combine the results in one unique corpus called all_samp.
blogs_samp <- blogs[sample(1:length(blogs),10000)]news_samp <- news[sample(1:length(news),10000)]twitter_samp <- twitter[sample(1:length(twitter),10000)]all_samp <-c(blogs_samp, news_samp, twitter_samp)save(all_samp, file="data/all_samp.txt")# Save the sampled data to a .txt fileswriteLines(all_samp, "data/all_samp.txt")
# Statistics for the samplesamp_size <-file.info("data/all_samp.txt")$size /1024.0^2samp_lines <-length(all_samp)samp_words <-sum(stri_count_words(all_samp))samp_char <-sum(stri_length(all_samp))samp_empty_char <-sum(stri_count_fixed(all_samp, ' '))# Create table with resultsall_samp_table <-data.frame(source=c("corpus"),size=c(round(samp_size, digits =2)),lines=c(samp_lines),words=c(samp_words),chars=c(samp_char),chars_empty=c(samp_empty_char) )
In the following table (Table 2) we summarise the properties of the sampled corpus, considering: size (in MB, Total Sum of Lines, Total Sum of Words, Total Sum of Characters and Total Sum of Empty Spaces.
Table 2: Sample properties
Source
Size (MB)
Total lines
Total words
Total characters
Empty spaces
corpus
4,77
30.000
886.262
4.970.887
852.233
3 Corpus preparation
3.1 Corpus preparation: overview
Using our corpus saved in the all_samp.txt we will use thelibrary tidytext that includes natural language processing tools, to perform the following transformations within our corpus:
Convert all words to lower case.
Strip away all white spaces.
Strip away all punctuation marks.
Strip away all numbers.
Strip away various non-alphanumeric characters.
Remove stop words. This means, removing words that are not relevant for analysis but appear frequently in written text (such as “the”,“and”, “also”, etc.)
Strip away all links to webpages (url adress).
Remove profanity.
Stemming to remove common word endings (e.g. ‘’s’, ‘ing’, etc.).
Several operations are included in unnest_tokens function from tidytext package.
3.2 Tokenization, Text Cleaning and Normalization
After performing our corpus cleaning and validations, we will build unigrams and bigrams. This new expressions of the corpus will allow us to perform exploratory analysis, such as compute word frequencies and correlations between words. In the next piece of code, all the operations for unigram calculations are reported.
# Set as dfcorpus <-as.data.frame(all_samp)# Tokenization, text cleaning & normalization (unigram)unigram <- corpus |>unnest_tokens(output = word, input = all_samp) |># Split text into wordsfilter(!grepl('[0-9]', word)) |># remove numbersanti_join(stop_words) |># remove stop wordsanti_join(profanities) |># Remove profanitiesmutate(stem =wordStem(word)) # stems words and creates column
Accordingly to our processing plan, in the next piece of code, all the operations for bigram calculations are reported.
# Tokenization, text cleaning & normalization (bigram)bigram <- corpus |>unnest_tokens(output = word, input = all_samp, token ="ngrams", n =2) |># Split text into wordsseparate(word, c("word1", "word2"), sep =" ") |># Separate bigram for cleaningfilter(!grepl('[0-9]', word1)) |># remove numbersfilter(!grepl('[0-9]', word2)) |># remove numbersfilter(!word1 %in% stop_words$word) %>%# remove stop wordsfilter(!word2 %in% stop_words$word) %>%# remove stop wordsfilter(!word1 %in% profanities$word) |># Remove profanitiesfilter(!word2 %in% profanities$word) |># Remove profanitiesmutate(stem1 =wordStem(word1),stem2 =wordStem(word2)) |># stems words and creates columnunite(stem,stem1, stem2, sep =" ")
4 Exploratory Data Analysis
Having our data prepared, in this section we present histograms to explore the frequencies of words in our corpus. The following figure show the top fifty most common unique words in our corpus.
Figure 1: Unigrams
The following figure shows the top fifty most common combination of two words, within our corpus.
Figure 2: Bigrams
5 Conclusions
5.1 Observations and next steps
Elimination of non useful characters improved our corpus in order to perform data analysis and predictive applications.
Stemming of the corpus made the computation around our code more efficient.
There are, however some stemming problems that must be resolved for predictive usage of the corpus. Form example, in the corpus we see terms like happi instead of happy.
Removal of stop words made our corpus cleaner. But, is an open question if that exclusion is useful in predictive computations.
Besides the speed of computations improved after the Section 3.2 process it is recommended to find further ways to improve the compute efficience of the future predictive application.