# Set global options for the chunks in the document
knitr::opts_chunk$set(
echo = TRUE, # Show the code in the final output
results = 'markup' # Show results in the document
)
#Introduction
#This report demonstrates the exploratory data analysis (EDA) of blog text data sourced from multiple locales (en_US, de_DE, ru_RU, fi_FI) as part of a Natural Language Processing (NLP) capstone project. The goal is to explore and understand the data, and outline the next steps for developing a prediction algorithm and Shiny app. The focus is on preparing for a model that could classify text based on sentiment or topic.
#Data Overview
#The dataset consists of blog posts in different languages and is sourced from the HC Corpora. In this analysis, the focus is on the English (en_US), German (de_DE), Russian (ru_RU), and Finnish (fi_FI) blog data.
#Below is an overview of the key steps taken in the data exploration:
#Data Exploration
#load required libraries
library(ggplot2)
library(stringi)
#load the datasets
blogs_en<-readLines("C:\\Users\\user\\Documents\\en_US\\en_US.blogs.txt", warn=FALSE)
blogs_de<-readLines("C:\\Users\\user\\Documents\\de_DE\\de_DE.blogs.txt", warn=FALSE)
blogs_ru<-readLines("C:\\Users\\user\\Documents\\ru_RU\\ru_RU.blogs.txt", warn=FALSE)
blogs_fi<-readLines("C:\\Users\\user\\Documents\\fi_FI\\fi_FI.blogs.txt", warn=FALSE)
#Function to Calculate the basic data summary(lines,words,characters)
get_summary=function(data){
lines=length(data)
words=sum(stri_count_words(data))
characters=sum(nchar(data))
return(c(Lines=lines, Words=words, Characters=characters))
}
#get summaries for each data set
en_summary=get_summary(blogs_en)
de_summary=get_summary(blogs_de)
ru_summary=get_summary(blogs_ru)
fi_summary=get_summary(blogs_fi)
#create a summary table
summary_table=data.frame(
Locale= c("en_US", "de_DE", "ru_RU","fi_FI" ),
Lines=c(en_summary[1],de_summary[1],ru_summary[1],fi_summary[1]),
Words=c(en_summary[2],de_summary[2],ru_summary[2],fi_summary[2]),
Characters=c(en_summary[3],de_summary[3],ru_summary[3],fi_summary[3])
)
# Print the summary table nicely in the R Markdown output
knitr::kable(summary_table, caption = "Summary of Blog Data")
Summary of Blog Data
| en_US |
899288 |
37546250 |
206824509 |
| de_DE |
181958 |
6205913 |
40729299 |
| ru_RU |
337100 |
9388482 |
64103385 |
| fi_FI |
439785 |
12785318 |
102911937 |
#Interesting Findings
#The datasets contain large amounts of text, making them suitable for large-scale text analysis and modeling.
#While the datasets are of equal size, the writing styles across different languages may vary significantly, which will be crucial when building the model.
#Initial Data Visualization
#A basic word frequency analysis was performed on the English blog dataset (en_US.blogs.txt). Below is an example of how to visualize the most frequent words using a Histogram.
# Tokenize the English blogs text into words
words_en <- unlist(strsplit(blogs_en, "\\W+")) # Split lines into words using non-word characters as delimiters
words_en <- words_en[words_en != ""] # Remove empty strings
# Calculate the length of each word
word_lengths <- nchar(words_en)
# Create a histogram of word lengths
ggplot(data.frame(word_lengths), aes(x = word_lengths)) +
geom_histogram(binwidth = 10, fill = "blue", color = "red", alpha = 0.7) +
labs(
title = "Word Length Distribution in en_US Blogs",
x = "Word Length (Number of Characters)",
y = "Frequency"
) +
theme_minimal()

#Next Steps
#The exploratory data analysis has provided insights into the blog data. Based on these findings, here are the planned next steps for the project:
#Prediction Algorithm
# 1.Data Preprocessing:
#Clean the text by removing stopwords, punctuation, numbers, and unnecessary spaces.
#Convert the text to lowercase for consistency.
#Perform stemming or lemmatization to reduce words to their base forms.
# 2.Feature Extraction:
#Tokenize the text into n-grams (e.g., unigrams, bigrams).
#Use techniques like TF-IDF (Term Frequency-Inverse Document Frequency) to transform text into numerical features.
#Explore word embeddings (e.g., Word2Vec or GloVe) for capturing semantic relationships.
# 3.Model Training:
#Train a machine learning model (e.g., Naive Bayes, Random Forest, or SVM) for sentiment or topic classification.
#Evaluate model performance using metrics like accuracy, precision, recall, and F1 score.
#Shiny App Development
# 1.User Interface:
#Create an interactive Shiny app with a text input box for users to type or paste text.
#Add a submit button that triggers the prediction.
# 2.Model Integration:
#Embed the trained predictive model within the Shiny app for real-time text classification.
#Display the predicted sentiment or topic dynamically.
# 3.User Experience:
#Include visualizations, such as word clouds or confidence scores, to enhance the user interface.
#Optimize the app for speed and ease of use.
#Conclusion
#This exploratory analysis has provided a solid understanding of the blog datasets and laid the foundation for building a robust predictive algorithm. The next steps will involve preprocessing, modeling, and creating the interactive Shiny app.