This report presents a brief exploratory analysis of a sample corpus and outlines the initial steps toward building a predictive text algorithm and Shiny application. The goal is to demonstrate familiarity with the data and readiness to develop a scalable model.
The corpus consists of five English-language sentences related to sustainability and development. It was successfully loaded and cleaned using basic text preprocessing techniques.
## Warning: package 'dplyr' was built under R version 4.4.3
##
## Adjuntando el paquete: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Lines Total_Words Avg_Words_Per_Line
## 1 5 27 5.4
## Warning: package 'tidytext' was built under R version 4.4.3
## Warning: package 'ggplot2' was built under R version 4.4.3
The model uses n-grams (uni-, bi-, and trigrams) to predict the next word based on previous context. It includes a backoff strategy to handle unseen combinations and is optimized for memory and speed.
Next steps:
Expand the corpus with real-world documents
Apply smoothing techniques
Integrate the model into a Shiny app for interactive predictions
This report confirms that the data has been successfully loaded and explored. The initial predictive model is functional, and the next phase will focus on scaling and deploying it via Shiny.