This report presents an exploratory analysis of text datasets in four languages: English, German, Finnish, and Russian. The goal is to summarize the datasets’ structure, visualize important features, and outline plans for creating a predictive text model and Shiny app.
We start by loading the required libraries and reading in the data files.
# Load necessary libraries
library(stringi) # For string processing
library(ggplot2) # For creating visualizations
library(knitr) # For creating clean tables
# File paths for the datasets
files <- list(
en = "/Users/valel/Downloads/LOCALE/en_US.blogs.txt",
de = "/Users/valel/Downloads/LOCALE/de_DE.blogs.txt",
ru = "/Users/valel/Downloads/LOCALE/ru_RU.blogs.txt",
fi = "/Users/valel/Downloads/LOCALE/fi_FI.blogs.txt"
)
# Function to load the data
load_data <- function(filepath) {
con <- file(filepath, "r")
data <- readLines(con, warn = FALSE, encoding = "UTF-8")
close(con)
return(data)
}
# Load all datasets
datasets <- lapply(files, load_data)
names(datasets) <- c("English", "German", "Russian", "Finnish")
# Function to calculate summary statistics
summarize_data <- function(data, lang) {
data.frame(
Language = lang,
Lines = length(data),
Words = sum(stri_count_words(data)),
Characters = sum(nchar(data)),
Avg_Words_Per_Line = mean(stri_count_words(data), na.rm = TRUE),
stringsAsFactors = FALSE
)
}
# Apply the summary function to all datasets
summaries <- lapply(names(datasets), function(lang) summarize_data(datasets[[lang]], lang))
summaries_df <- do.call(rbind, summaries)
# Display the summary table
kable(summaries_df, caption = "Summary Statistics of the Datasets")
| Language | Lines | Words | Characters | Avg_Words_Per_Line |
|---|---|---|---|---|
| English | 15 | 102 | 504 | 6.8 |
| German | 15 | 102 | 504 | 6.8 |
| Russian | 15 | 102 | 504 | 6.8 |
| Finnish | 15 | 102 | 504 | 6.8 |
# Calculate word counts per line for each dataset
word_counts <- lapply(datasets, function(data) stri_count_words(data))
# Combine word counts into a single data frame
word_counts_df <- data.frame(
Language = rep(names(word_counts), sapply(word_counts, length)),
Word_Counts = unlist(word_counts)
)
# Plot the normalized histogram with facets
ggplot(word_counts_df, aes(x = Word_Counts, fill = Language)) +
geom_histogram(binwidth = 2, alpha = 0.7, position = "identity", color = "black") +
facet_wrap(~Language, scales = "free_y") +
labs(title = "Histogram of Word Counts per Line (Normalized)", x = "Word Count", y = "Proportion") +
theme_minimal() +
theme(legend.position = "none")
Here are some key findings based on the analysis:
This exploratory analysis provides valuable insights into the structure and distribution of the datasets. The following steps will be crucial in the development of the predictive model and Shiny app:
With these steps, we aim to create a robust predictive text solution that adapts to the nuances of each language while maintaining usability on mobile devices.