Introduction:
The collected dataset comprises massive textual content obtained from various platforms that include blogs, news articles, and Twitter feeds in multiple international languages. Our project predetermines work with four specific languages: en_US, de_DE, fi_FI, and ru_RU.
English (en_US) German (de_DE) Finnish (fi_FI) Russian (ru_RU)
The main goal of this exploratory analysis is to gain understanding of dataset structure and content while performing basic text cleaning and sampling and evaluating word pattern frequencies using unigrams, bigrams, and trigrams, which correspond to single words and two-word and three-word phrases.
Loading Required Libraries:
library(stringi)
library(ggplot2)
library(knitr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tm)
## Warning: package 'tm' was built under R version 4.4.3
## Loading required package: NLP
## Warning: package 'NLP' was built under R version 4.4.2
##
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
##
## annotate
library(tidytext)
## Warning: package 'tidytext' was built under R version 4.4.3
library(tidyr)
Loading the Data:
data_dir <- "final"
files <- list(
en = file.path(data_dir, "en_US/en_US.blogs.txt"),
de = file.path(data_dir, "de_DE/de_DE.blogs.txt"),
fi = file.path(data_dir, "fi_FI/fi_FI.blogs.txt"),
ru = file.path(data_dir, "ru_RU/ru_RU.blogs.txt")
)
# Fix: Re-save each file with correct line endings
for (lang in names(files)) {
lines <- readLines(files[[lang]], encoding = "UTF-8", skipNul = TRUE)
writeLines(lines, files[[lang]])
}
texts <- lapply(files, readLines, encoding = "UTF-8", skipNul = TRUE)
names(texts) <- c("en", "de", "fi", "ru")
Summary Statistics:
summary_df <- data.frame(
Language = character(),
Lines = integer(),
Words = integer(),
Characters = integer(),
stringsAsFactors = FALSE
)
for (lang in names(texts)) {
lines <- texts[[lang]]
summary_df <- rbind(summary_df, data.frame(
Language = lang,
Lines = length(lines),
Words = sum(stri_count_words(lines)),
Characters = sum(nchar(lines))
))
}
kable(summary_df, caption = "Summary Statistics for Each Language File")
| Language | Lines | Words | Characters |
|---|---|---|---|
| en | 658440 | 27466217 | 151298523 |
| de | 181958 | 6206038 | 40729298 |
| fi | 439785 | 12803410 | 102911932 |
| ru | 337100 | 9388810 | 64103385 |
Histogram The Line Length Distribution:
line_lengths <- lapply(texts, nchar)
df_lines <- do.call(rbind, lapply(names(line_lengths), function(lang) {
data.frame(Language = lang, LineLength = line_lengths[[lang]])
}))
ggplot(df_lines, aes(x = LineLength, fill = Language)) +
geom_histogram(bins = 50, alpha = 0.6) +
facet_wrap(~Language, scales = "free") +
labs(title = "Histogram of Line Lengths by Language", x = "Line Length (characters)", y = "Frequency") +
theme_minimal()
Histogram The Word Count Per Line:
word_counts <- lapply(texts, stri_count_words)
df_words <- do.call(rbind, lapply(names(word_counts), function(lang) {
data.frame(Language = lang, WordCount = word_counts[[lang]])
}))
ggplot(df_words, aes(x = WordCount, fill = Language)) +
geom_histogram(bins = 50, alpha = 0.7) +
facet_wrap(~Language, scales = "free") +
labs(title = "Histogram of Word Counts Per Line by Language", x = "Words per Line", y = "Frequency") +
theme_minimal()
The Word Frequency taking English Sample:
sample_en <- sample(texts$en, 10000)
corpus_en <- VCorpus(VectorSource(sample_en))
corpus_en <- tm_map(corpus_en, content_transformer(tolower))
corpus_en <- tm_map(corpus_en, removePunctuation)
corpus_en <- tm_map(corpus_en, removeNumbers)
corpus_en <- tm_map(corpus_en, removeWords, stopwords("en"))
corpus_en <- tm_map(corpus_en, stripWhitespace)
dtm <- DocumentTermMatrix(corpus_en)
freq <- sort(colSums(as.matrix(dtm)), decreasing = TRUE)
word_freq <- data.frame(Word = names(freq), Frequency = freq)
kable(head(word_freq, 10), caption = "Top 10 Most Frequent Words in English Sample")
| Word | Frequency | |
|---|---|---|
| one | one | 1362 |
| will | will | 1265 |
| just | just | 1108 |
| can | can | 1078 |
| like | like | 1057 |
| time | time | 971 |
| get | get | 744 |
| now | now | 672 |
| know | know | 665 |
| people | people | 630 |
ggplot(head(word_freq, 20), aes(x = reorder(Word, -Frequency), y = Frequency)) +
geom_bar(stat = "identity", fill = "steelblue") +
labs(title = "Top 20 Most Frequent Words (English)", x = "Word", y = "Frequency") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Interesting Findings:
English contains the greatest number of lines together with the most words. Finnish and Russian have longer average lines. The number of words per line displays wide variation between different languages. Common English words include frequent verbs and nouns, not just stop words. The process of data text preparation depends heavily on cleaning and normalization steps.
Plan for Prediction Algorithm & Shiny App
Goals:
The program should use user-provided text to predict upcoming words. Provide predictions for English, German, Finnish, and Russian.
Strategy: 1.Sampling & Cleaning:
The analysis selects a subset of 10,000 lines from each data collection. Clean data: lowercasing, remove punctuation/numbers/stopwords.
2.Tokenization & N-gram Modeling:
Use tidytext to generate unigrams, bigrams, trigrams. Build frequency tables for n-grams.
3.Prediction Algorithm:
Use a Stupid Backoff Model: Match input against trigrams. If not found, backoff to bigrams. If not found, fallback to most common unigrams.
Shiny App:
Input box for user text. Language selection. Display of top predicted next word.
Conclusion this report shows:
All datasets are loaded and understood. Basic statistics and visualizations are done. The prediction model together with the app design remains certain and establisBuilding and testing n-gram models along with evaluating performance and creating the Shiny interface represent upcoming tasks.