Exploratory Data Analysis

1. Introduction

The exploratory analysis aimed to provide an overview of the training data set, which comprised three text files: blogs.txt, news.txt, and twitter.txt. The primary objective was to extract key insights and present them in a clear, concise manner suitable for non-technical stakeholders, such as managers.

2. Read In Data

The data was read from the three files using R, with the stringr, dplyr, and ggplot2 libraries facilitating data manipulation and visualization. The process involved checking the existence of each file, reading their contents line by line, and preparing them for analysis.

3. Basic Summaries

A summary of each file’s line and word counts was generated to highlight the differences in the length and verbosity of content among the files.

print(stats_df)

##          File LineCount WordCount
## 1   blogs.txt    899288  38309621
## 2    news.txt     77259   2741594
## 3 twitter.txt   2360148  31003502

Plots

To visualize the data, bar plots were created to display

4. Line Count

It showed that twitter.txt had the most lines, followed by blogs.txt.

ggplot(stats_df, aes(x = File, y = LineCount)) +
  geom_bar(stat = "identity", fill = "blue") +
  labs(title = "Line Count by File", x = "File", y = "Line Count") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

It showed that blogs.txt contained the highest total word count, followed by twitter.txt.

ggplot(stats_df, aes(x = File, y = WordCount)) +
  geom_bar(stat = "identity", fill = "green") +
  labs(title = "Word Count by File", x = "File", y = "Word Count") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

5. Conclusion

This analysis provided a clear understanding of the training data set, showing distinct characteristics of each text source. The results were presented through concise summaries and visual plots, offering insights accessible to non-data scientist managers. Key takeaways included the significant difference in line counts, word counts, and distribution patterns across the three files, preparing the foundation for further analysis and modeling.