The exploratory analysis aimed to provide an overview of the training data set, which comprised three text files: blogs.txt, news.txt, and twitter.txt. The primary objective was to extract key insights and present them in a clear, concise manner suitable for non-technical stakeholders, such as managers.
The data was read from the three files using R, with the stringr, dplyr, and ggplot2 libraries facilitating data manipulation and visualization. The process involved checking the existence of each file, reading their contents line by line, and preparing them for analysis.
A summary of each file’s line and word counts was generated to highlight the differences in the length and verbosity of content among the files.
print(stats_df)
## File LineCount WordCount
## 1 blogs.txt 899288 38309621
## 2 news.txt 77259 2741594
## 3 twitter.txt 2360148 31003502
To visualize the data, bar plots were created to display
It showed that twitter.txt had the most lines, followed by blogs.txt.
ggplot(stats_df, aes(x = File, y = LineCount)) +
geom_bar(stat = "identity", fill = "blue") +
labs(title = "Line Count by File", x = "File", y = "Line Count") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))
It showed that blogs.txt contained the highest total word count, followed by twitter.txt.
ggplot(stats_df, aes(x = File, y = WordCount)) +
geom_bar(stat = "identity", fill = "green") +
labs(title = "Word Count by File", x = "File", y = "Word Count") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))
This analysis provided a clear understanding of the training data set, showing distinct characteristics of each text source. The results were presented through concise summaries and visual plots, offering insights accessible to non-data scientist managers. Key takeaways included the significant difference in line counts, word counts, and distribution patterns across the three files, preparing the foundation for further analysis and modeling.