La minería de texto (TM) es el proceso de extraer información útil, patrones o conocimiento de textos no estructurados.
Consta de 3 etapas: 1. Obtener datos: El reconocimeinto óptico de caracteres (OCR) es una tecnología que permite convertir imágenes de texto en el texto editable. También es conocido como extracción de texto de imágenes. 2. Explorar datos: Representación gráfica o visual de los datos para su interpretación. Los métodos más comunes son el Análisis de Sentimientos, la Nube de Palabras y el Topic Modeling. 3. Análisis Predictivo: Son las técnicas y modelos estadísticos para predecir resultados futuros. Los modelos más usados son el Random Forest, Redes Neuronales y Regresiones.
#install.packages("tidyverse") # Data wrangling
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.3.1
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
#install.packages("tesseract") # OCR
library(tesseract)
## Warning: package 'tesseract' was built under R version 4.3.1
#install.packages("magick") # PNG
library(magick)
## Warning: package 'magick' was built under R version 4.3.1
## Linking to ImageMagick 6.9.12.93
## Enabled features: cairo, fontconfig, freetype, heic, lcms, pango, raw, rsvg, webp
## Disabled features: fftw, ghostscript, x11
#install.packages("officer") # Office (word)
library(officer)
## Warning: package 'officer' was built under R version 4.3.1
#install.packages("pdftools") # PDF
library(pdftools)
## Warning: package 'pdftools' was built under R version 4.3.1
## Using poppler version 23.04.0
#install.packages("purrr") # Para la función map
library(purrr)
#install.packages("tm") # Text Mining
library(tm)
## Loading required package: NLP
##
## Attaching package: 'NLP'
##
## The following object is masked from 'package:ggplot2':
##
## annotate
#install.packages("RColorBrewer") # Colores
library(RColorBrewer)
#install.packages("wordcloud") # Nubre de palabras
library(wordcloud)
#install.packages("topicmodels") # Modelos de temas
library(topicmodels)
## Warning: package 'topicmodels' was built under R version 4.3.1
library(ggplot2)
imagen1 <- image_read("/Users/genarorodriguezalcantara/Desktop/Tec/AI - Concentración/Módulo 2 - Machine Learning/BD/imagen1.PNG")
texto1 <- ocr(imagen1)
texto1
## [1] "Linear regression with one variable x is also known as univariate linear regression\nor simple linear regression. Simple linear regression is used to predict a single\noutput from a single input. This is an example of supervised learning, which means\nthat the data is labeled, i.e., the output values are known in the training data. Let us\nfit a line through the data using simple linear regression as shown in Fig. 4.1.\n"
doc1 <- read_docx() # Crea un documento en blanco
doc1 <- doc1 %>% body_add_par(texto1, style = "Normal")
print(doc1, target = "texto1.docx")
imagen2 <- image_read("/Users/genarorodriguezalcantara/Desktop/Tec/AI - Concentración/Módulo 2 - Machine Learning/BD/imagen2.PNG")
tesseract_download("spa")
## [1] "/Users/genarorodriguezalcantara/Library/Application Support/tesseract5/tessdata/spa.traineddata"
texto2 <- ocr(imagen2, engine = tesseract("spa"))
texto2
## [1] "Un importante, y quizá controversial, asunto político es el que se refiere al efecto del salario mínimo sobre\nlas tasas de desempleo en diversos grupos de trabajadores. Aunque este problema puede ser estudiado con\ndiversos tipos de datos (corte transversal, series de tiempo o datos de panel), suelen usarse las series de\ntiempo para observar los efectos agregados. En la tabla 1.3 se presenta un ejemplo de una base de datos\nde series de tiempo sobre tasas de desempleo y salarios mínimos.\n"
doc2 <- read_docx() # Crea un documento de word en blanco
doc2 <- doc2 %>% body_add_par(texto2, style = "Normal")
#print(doc2, "texto2.docx")
pdf1 <- pdf_convert ("/Users/genarorodriguezalcantara/Desktop/Tec/AI - Concentración/Módulo 2 - Machine Learning/BD/pdf1.pdf", dpi = 600) %>% map(ocr)
## Converting page 1 to pdf1_1.png... done!
## Converting page 2 to pdf1_2.png... done!
## Converting page 3 to pdf1_3.png... done!
## Converting page 4 to pdf1_4.png... done!
## Converting page 5 to pdf1_5.png... done!
## Converting page 6 to pdf1_6.png... done!
## Converting page 7 to pdf1_7.png... done!
## Converting page 8 to pdf1_8.png... done!
pdf2 <- pdf_convert("/Users/genarorodriguezalcantara/Desktop/Tec/AI - Concentración/Módulo 2 - Machine Learning/BD/eso3.pdf", dpi = 600) %>% map(ocr)
## Converting page 1 to eso3_1.png... done!
## Converting page 2 to eso3_2.png... done!
## Converting page 3 to eso3_3.png... done!
# Agregar el texto extraído del primer archivo PNG
eso1 <- image_read("/Users/genarorodriguezalcantara/Desktop/Tec/AI - Concentración/Módulo 2 - Machine Learning/Code/eso3_1.png")
eso2 <- image_read("/Users/genarorodriguezalcantara/Desktop/Tec/AI - Concentración/Módulo 2 - Machine Learning/Code/eso3_2.png")
eso3 <- image_read("/Users/genarorodriguezalcantara/Desktop/Tec/AI - Concentración/Módulo 2 - Machine Learning/Code/eso3_3.png")
tesseract_download("spa")
## [1] "/Users/genarorodriguezalcantara/Library/Application Support/tesseract5/tessdata/spa.traineddata"
eso_1 <- ocr(eso1, engine = tesseract("spa"))
eso_2 <- ocr(eso2, engine = tesseract("spa"))
eso_3 <- ocr(eso3, engine = tesseract("spa"))
doc_it <- read_docx()
doc_it <- doc_it %>% body_add_par(eso_1, style = "Normal") %>% body_add_par(eso_2, style = "Normal") %>% body_add_par(eso_3, style = "Normal")
# Guardar el documento de Word
print(doc_it, target = "textos_extraidos.docx")
text <- readLines("http://www.sthda.com/sthda/RDoc/example-files/martin-luther-king-i-have-a-dream-speech.txt")
corpus <- Corpus(VectorSource(text))
corpus <- tm_map(corpus, content_transformer(tolower))
## Warning in tm_map.SimpleCorpus(corpus, content_transformer(tolower)):
## transformation drops documents
corpus <- tm_map(corpus, removePunctuation)
## Warning in tm_map.SimpleCorpus(corpus, removePunctuation): transformation drops
## documents
corpus <- tm_map(corpus, removeNumbers)
## Warning in tm_map.SimpleCorpus(corpus, removeNumbers): transformation drops
## documents
corpus <- tm_map(corpus, removeWords, stopwords("en"))
## Warning in tm_map.SimpleCorpus(corpus, removeWords, stopwords("en")):
## transformation drops documents
# corpus <- tm_map(corpus, removeWords, c("dream","will")) # Elimina palabras puntuales
tdm <- TermDocumentMatrix(corpus)
m <- as.matrix(tdm) # Cuentas las veces que aparece cada palabra por renglón
frecuencia <- sort(rowSums(m), decreasing = TRUE) # Cuenta la frecuencia de cada palabra en el texto completo
frecuencia_df <- data.frame(word=names(frecuencia), freq=frecuencia) # Convierte la frecuencia a data frame.
ggplot(head(frecuencia_df, 20), aes(x = word, y = freq)) +
geom_bar(stat = "identity") +
labs(title = "TOP 10 palabras más frecuenctes", x="Palabra", y="Frecuencia", subtitle = "Discurso 'I Have a Dream' & M. L. King") +
ylim(0,20)
inspect(corpus)
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 46
##
## [1]
## [2] even though face difficulties today tomorrow still dream dream deeply rooted american dream
## [3]
## [4] dream one day nation will rise live true meaning creed
## [5]
## [6] hold truths selfevident men created equal
## [7]
## [8] dream one day red hills georgia sons former slaves sons former slave owners will able sit together table brotherhood
## [9]
## [10] dream one day even state mississippi state sweltering heat injustice sweltering heat oppression will transformed oasis freedom justice
## [11]
## [12] dream four little children will one day live nation will judged color skin content character
## [13]
## [14] dream today
## [15]
## [16] dream one day alabama vicious racists governor lips dripping words interposition nullification one day right alabama little black boys black girls will able join hands little white boys white girls sisters brothers
## [17]
## [18] dream today
## [19]
## [20] dream one day every valley shall exalted every hill mountain shall made low rough places will made plain crooked places will made straight glory lord shall revealed flesh shall see together
## [21]
## [22] hope faith go back south
## [23]
## [24] faith will able hew mountain despair stone hope faith will able transform jangling discords nation beautiful symphony brotherhood faith will able work together pray together struggle together go jail together stand freedom together knowing will free one day
## [25]
## [26] will day will day god s children will able sing new meaning
## [27]
## [28] country tis thee sweet land liberty thee sing
## [29] land fathers died land pilgrim s pride
## [30] every mountainside let freedom ring
## [31] america great nation must become true
## [32] let freedom ring prodigious hilltops new hampshire
## [33] let freedom ring mighty mountains new york
## [34] let freedom ring heightening alleghenies pennsylvania
## [35] let freedom ring snowcapped rockies colorado
## [36] let freedom ring curvaceous slopes california
## [37]
## [38]
## [39] let freedom ring stone mountain georgia
## [40] let freedom ring lookout mountain tennessee
## [41] let freedom ring every hill molehill mississippi
## [42] every mountainside let freedom ring
## [43] happens allow freedom ring let ring every village every hamlet every state every city will able speed day god s children black men white men jews gentiles protestants catholics will able join hands sing words old negro spiritual
## [44] free last free last
## [45]
## [46] thank god almighty free last
# El procesamiento de datos antes de la nube de palabras es igual que en el Análisis de Frecuencias, desde importar el texto hasta frecuencia_df
set.seed(123)
wordcloud(words = frecuencia_df$word, freq = frecuencia_df$freq, min.freq = 1, random.order = FALSE, colors = brewer.pal(8, "RdPu"))
text2 <- readLines("/Users/genarorodriguezalcantara/Desktop/Tec/AI - Concentración/Módulo 2 - Machine Learning/Code/textos_extraidos.txt", encoding = "UTF-8")
# Convertir el texto a UTF-8 si es necesario
text2 <- iconv(text2, to = "UTF-8", sub = "byte")
# Crear un corpus con el texto
corpus <- Corpus(VectorSource(text2))
# Pre-procesamiento del texto
corpus <- tm_map(corpus, content_transformer(tolower))
## Warning in tm_map.SimpleCorpus(corpus, content_transformer(tolower)):
## transformation drops documents
corpus <- tm_map(corpus, removePunctuation)
## Warning in tm_map.SimpleCorpus(corpus, removePunctuation): transformation drops
## documents
corpus <- tm_map(corpus, removeNumbers)
## Warning in tm_map.SimpleCorpus(corpus, removeNumbers): transformation drops
## documents
corpus <- tm_map(corpus, removeWords, stopwords("spanish"))
## Warning in tm_map.SimpleCorpus(corpus, removeWords, stopwords("spanish")):
## transformation drops documents
# Crear una matriz de términos por documento
tdm <- TermDocumentMatrix(corpus)
# Convertir a matriz y calcular frecuencias
m <- as.matrix(tdm)
frecuencia2 <- sort(rowSums(m), decreasing = TRUE)
frecuencia_df2 <- data.frame(word = names(frecuencia2), freq = frecuencia2)
ggplot(head(frecuencia_df2, 10), aes(x = word, y = freq)) +
geom_bar(stat = "identity") +
labs(title = "TOP 10 palabras más frecuentes", x = "Palabra", y = "Frecuencia", subtitle = "Análisis del archivo textos_extraidos") +
ylim(0, max(frecuencia_df2$freq[1:10]) + 10) # Ajustar el límite superior para mejor visualización