Introduction

This analysis focuses on processing and summarizing data from three text files: Blogs, News, and Twitter. Each file is analyzed for its size, number of lines, and word counts. Given the large size of the files, we will use a small random sample from each dataset for further analysis.

Reading Data and Summarizing

First, we read the text data from the three sources and calculate basic statistics: file size, number of lines, and word counts.

setwd('~/OneDrive/IPEA/licenca capacitacao 2024/Data Science Specialization/files/project capston/final/en_US')
# Read the files
twitter <- readLines("en_US.twitter.txt", skipNul = TRUE)
blogs <- readLines("en_US.blogs.txt", skipNul = TRUE)
news <- readLines("en_US.news.txt", warn = FALSE, skipNul = TRUE)

# Initialize a list to store text data
text <- list(blogs = blogs, news = news, twitter = twitter)

# Create a summary matrix for file size, lines, and word counts
matrix.summary <- matrix(0, nrow = 3, ncol = 3, 
                         dimnames = list(c("blogs", "news", "twitter"),
                                         c("file size, Mb", "lines", "words")))

# Fill the matrix with actual values
matrix.summary["blogs", "file size, Mb"] <- file.info("en_US.blogs.txt")$size / (1024 * 1024)
matrix.summary["news", "file size, Mb"] <- file.info("en_US.news.txt")$size / (1024 * 1024)
matrix.summary["twitter", "file size, Mb"] <- file.info("en_US.twitter.txt")$size / (1024 * 1024)

matrix.summary["blogs", "lines"] <- length(blogs)
matrix.summary["news", "lines"] <- length(news)
matrix.summary["twitter", "lines"] <- length(twitter)

matrix.summary["blogs", "words"] <- sum(stri_count_words(blogs))
matrix.summary["news", "words"] <- sum(stri_count_words(news))
matrix.summary["twitter", "words"] <- sum(stri_count_words(twitter))

# Print the summary table
kable(matrix.summary)
file size, Mb lines words
blogs 200.4242 899288 37546250
news 196.2775 1010242 34762395
twitter 159.3641 2360148 30093413

The table below summarizes the file size, number of lines, and word counts for each dataset:

file size, Mb lines words
blogs 200.42 899,288 37,546,246
news 196.28 1,010,242 34,762,395
twitter 159.36 2,360,148 30,093,410

Sampling the Data

Since the files are large, we randomly select 0.5% of the lines from each file for further analysis.

set.seed(123)
blogs_sample <- sample(text$blogs, 0.005 * length(text$blogs))
news_sample <- sample(text$news, 0.005 * length(text$news))
twitter_sample <- sample(text$twitter, 0.005 * length(text$twitter))

Text Processing and Word Frequency Analysis

Now, we will clean and analyze the text data by converting to lowercase, removing punctuation, numbers, stopwords, and white spaces. We will then find the most frequent words in each dataset.

Blogs Data

# Create corpus for blogs
corpus1 <- Corpus(VectorSource(blogs_sample))

# Clean the text
corpus1 <- tm_map(corpus1, content_transformer(tolower))
## Warning in tm_map.SimpleCorpus(corpus1, content_transformer(tolower)):
## transformation drops documents
corpus1 <- tm_map(corpus1, removePunctuation)
## Warning in tm_map.SimpleCorpus(corpus1, removePunctuation): transformation
## drops documents
corpus1 <- tm_map(corpus1, removeNumbers)
## Warning in tm_map.SimpleCorpus(corpus1, removeNumbers): transformation drops
## documents
corpus1 <- tm_map(corpus1, removeWords, stopwords("english"))
## Warning in tm_map.SimpleCorpus(corpus1, removeWords, stopwords("english")):
## transformation drops documents
corpus1 <- tm_map(corpus1, stripWhitespace)
## Warning in tm_map.SimpleCorpus(corpus1, stripWhitespace): transformation drops
## documents
# Calculate frequent words
frequentWords <- head(sort(rowSums(as.matrix(TermDocumentMatrix(corpus1))), decreasing = TRUE), 10)

# Barplot of frequent words
barplot(frequentWords, 
        main = "Blogs Data: Most Frequent Words", 
        xlab = "Word", 
        ylab = "Count", 
        col = "lightblue")

# Word cloud
term.doc.matrix1 <- TermDocumentMatrix(corpus1)
word.freqs1 <- sort(rowSums(as.matrix(term.doc.matrix1)), decreasing = TRUE)
dm1 <- data.frame(word = names(word.freqs1), freq = word.freqs1)
wordcloud(dm1$word, dm1$freq, min.freq = 100, random.order = TRUE, rot.per = 0.25, colors = brewer.pal(8, "Dark2"))

News Data

# Create corpus for news
corpus2 <- Corpus(VectorSource(news_sample))

# Clean the text
corpus2 <- tm_map(corpus2, content_transformer(tolower))
## Warning in tm_map.SimpleCorpus(corpus2, content_transformer(tolower)):
## transformation drops documents
corpus2 <- tm_map(corpus2, removePunctuation)
## Warning in tm_map.SimpleCorpus(corpus2, removePunctuation): transformation
## drops documents
corpus2 <- tm_map(corpus2, removeNumbers)
## Warning in tm_map.SimpleCorpus(corpus2, removeNumbers): transformation drops
## documents
corpus2 <- tm_map(corpus2, removeWords, stopwords("english"))
## Warning in tm_map.SimpleCorpus(corpus2, removeWords, stopwords("english")):
## transformation drops documents
corpus2 <- tm_map(corpus2, stripWhitespace)
## Warning in tm_map.SimpleCorpus(corpus2, stripWhitespace): transformation drops
## documents
# Calculate frequent words
frequentWords <- head(sort(rowSums(as.matrix(TermDocumentMatrix(corpus2))), decreasing = TRUE), 10)

# Barplot of frequent words
barplot(frequentWords, 
        main = "News Data: Most Frequent Words", 
        xlab = "Word", 
        ylab = "Count", 
        col = "lightgreen")

# Word cloud
term.doc.matrix2 <- TermDocumentMatrix(corpus2)
word.freqs2 <- sort(rowSums(as.matrix(term.doc.matrix2)), decreasing = TRUE)
dm2 <- data.frame(word = names(word.freqs2), freq = word.freqs2)
wordcloud(dm2$word, dm2$freq, min.freq = 100, random.order = TRUE, rot.per = 0.25, colors = brewer.pal(8, "Dark2"))

Twitter Data

# Create corpus for twitter
corpus3 <- Corpus(VectorSource(twitter_sample))

# Clean the text
corpus3 <- tm_map(corpus3, content_transformer(tolower))
## Warning in tm_map.SimpleCorpus(corpus3, content_transformer(tolower)):
## transformation drops documents
corpus3 <- tm_map(corpus3, removePunctuation)
## Warning in tm_map.SimpleCorpus(corpus3, removePunctuation): transformation
## drops documents
corpus3 <- tm_map(corpus3, removeNumbers)
## Warning in tm_map.SimpleCorpus(corpus3, removeNumbers): transformation drops
## documents
corpus3 <- tm_map(corpus3, removeWords, stopwords("english"))
## Warning in tm_map.SimpleCorpus(corpus3, removeWords, stopwords("english")):
## transformation drops documents
corpus3 <- tm_map(corpus3, stripWhitespace)
## Warning in tm_map.SimpleCorpus(corpus3, stripWhitespace): transformation drops
## documents
# Calculate frequent words
frequentWords <- head(sort(rowSums(as.matrix(TermDocumentMatrix(corpus3))), decreasing = TRUE), 10)

# Barplot of frequent words
barplot(frequentWords, 
        main = "Twitter Data: Most Frequent Words", 
        xlab = "Word", 
        ylab = "Count", 
        col = "lightcoral")

# Word cloud
term.doc.matrix3 <- TermDocumentMatrix(corpus3)
word.freqs3 <- sort(rowSums(as.matrix(term.doc.matrix3)), decreasing = TRUE)
dm3 <- data.frame(word = names(word.freqs3), freq = word.freqs3)
wordcloud(dm3$word, dm3$freq, min.freq = 100, random.order = FALSE, rot.per = 0.25, colors = brewer.pal(8, "Dark2"))

Conclusion

The current analysis focuses on the most frequent words found in the Blogs, News, and Twitter datasets, using a random sample of each. By processing the text data through various cleaning steps, such as converting text to lowercase, removing punctuation, numbers, and common stop words, we obtain a clearer view of the most prominent terms within each dataset. These terms give us insight into the nature and focus of the content typical in blogs, news articles, and social media posts. For example, blogs may exhibit a more personalized and casual language, while news datasets might emphasize factual reporting with terms related to current events. Twitter, on the other hand, being a platform designed for shorter, real-time communication, may showcase a more dynamic vocabulary driven by trending topics.

The random sampling method ensures that the computational load remains manageable while still capturing the essence of the larger datasets. For each dataset, after creating a text corpus and performing these preprocessing steps, we used frequency analysis to identify the most common words. This analysis not only highlights the most dominant terms in each dataset but also provides an initial glimpse into potential patterns and differences between the platforms in terms of language use and content.

In the next phase of the project, we plan to take this analysis further by implementing n-gram models. Unlike simple word frequency analysis, n-grams (such as bigrams and trigrams) capture sequences of words that commonly appear together. This approach will allow us to identify common phrases or combinations of terms that occur more frequently than by random chance. For instance, bigrams could reveal pairs of words frequently used together in news reports, such as “breaking news” or “government policy,” while trigrams might expose longer patterns, such as “global climate change” or “social media trends.”

Building on this n-gram analysis, we also aim to create predictive models. These models will leverage the patterns in word frequencies and sequences to predict future words or phrases based on preceding text. Predictive models could be useful for various applications, such as improving text auto-completion algorithms, developing chatbots, or enhancing natural language processing (NLP) tasks like sentiment analysis or topic modeling. By training the models on large text corpora, they can learn to recognize not only common phrases but also the context in which certain terms are likely to appear, making the models more effective for real-world language prediction tasks.

In summary, while the initial analysis has successfully identified common words in each dataset, the future work will focus on deeper linguistic structures by incorporating n-grams and developing predictive models. This will enable a more sophisticated understanding of the text and open new avenues for applying the results in NLP tasks, further extending the practical utility of the datasets.