Datos Estructurados

Son datos con una estructura fija. Se consultan clasifican y analizan facilmente. Se suelen encontrar en formato de tabla con filas columnas y títulos. Por ejemplo: hojas de cálculo, archivos de textos, hojas de cálculo relacionales.

Datos No Estructurados

Son datos que no cuentan con una estructura fija. Carecen de organización, son datos sin procesar y pueden llegar en tiempo real. Por ejemplo: imágenes, libros, sitios web.

Lectura datos

Importar SPSS Stata y SAS

library(haven)
#stata
dataStata <- read_dta("ejemploStata.dta")
head(dataStata )

## # A tibble: 6 x 11
##     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  21       6   160   110  3.9   2.62  16.5     0     1     4     4
## 2  21       6   160   110  3.9   2.88  17.0     0     1     4     4
## 3  22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
## 4  21.4     6   258   110  3.08  3.22  19.4     1     0     3     1
## 5  18.7     8   360   175  3.15  3.44  17.0     0     0     3     2
## 6  18.1     6   225   105  2.76  3.46  20.2     1     0     3     1

library(readr)

dataCSV <- readr::read_csv( "ejemplo.csv")
head(dataCSV)

## # A tibble: 6 x 12
##   X1             mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##   <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mazda RX4     21       6   160   110  3.9   2.62  16.5     0     1     4     4
## 2 Mazda RX4 W~  21       6   160   110  3.9   2.88  17.0     0     1     4     4
## 3 Datsun 710    22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
## 4 Hornet 4 Dr~  21.4     6   258   110  3.08  3.22  19.4     1     0     3     1
## 5 Hornet Spor~  18.7     8   360   175  3.15  3.44  17.0     0     0     3     2
## 6 Valiant       18.1     6   225   105  2.76  3.46  20.2     1     0     3     1

library(openxlsx)
dataXlsx <- read.xlsx("ejemplo.xlsx")
head(dataXlsx)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Importar informacion

x <- 1:5
y <- letters[1:5]
z <- letters[26:22]
data <- data.frame(Numero=x,LetrasInicio=y,LetrasFinal=z)
data

##   Numero LetrasInicio LetrasFinal
## 1      1            a           z
## 2      2            b           y
## 3      3            c           x
## 4      4            d           w
## 5      5            e           v

write.xlsx(data,"nombre.xlsx")
write.csv(data,"nombre.csv")
write_dta(data,"nombre.dta")

Lectura de información de PDF

https://cran.r-project.org/web/packages/tesseract/vignettes/intro.html#Read_from_PDF_files

library(pdftools)

## Using poppler version 21.04.0

pngfile <- pdf_convert('https://jeroen.github.io/images/ocrscan.pdf', dpi = 600)

## Converting page 1 to ocrscan_1.png... done!

text <- tesseract::ocr(pngfile)
cat(text)

## | SAPORS LANE - BOOLE - DORSET - BH25 8 ER
## TELEPHONE BOOLE (945 13) 51617 - TELEX 123456
## 
## Our Ref. 350/PJC/EAC 18th January, 1972.
## Dr. P.N. Cundall,
## Mining Surveys Ltd.,
## Holroyd Road,
## Reading,
## Berks.
## Dear Pete,
## 
## Permit me to introduce you to the facility of facsimile
## transmission.
## 
## In facsimile a photocell is caused to perform a raster scan over
## 
## the subject copy. The variations of print density on the document
## cause the photocell to generate an analogous electrical video signal.
## This signal is used to modulate a carrier, which is transmitted to a
## remote destination over a radio or cable communications link.
## 
## At the remote terminal, demodulation reconstructs the video
## signal, which is used to modulate the density of print produced by a
## printing device. This device is scanning in a raster scan synchronised
## with that at the transmitting terminal. As a result, a facsimile
## copy of the subject document is produced.
## 
## Probably you have uses for this facility in your organisation.
## 
## Yours sincerely,
## 44, f
## P.J. CROSS
## Group Leader - Facsimile Research
## Registered in England: No. 2038
## No. 1 Registered Office: GO Vicara Lane, Ilford. Eseex.

Web Scraping

Es un técnica para extraer información de sitios web a través de algoritmos y programas de software.

Scraping

Librerías

library(rvest)
library(tidyverse)

Paginas web

Funciones

read_html(): Leer una página web
html_nodes(): Extraer información de un DIV
html_text(): Extraer el texto del nodo
html_attrs(): Extraer atributos
html_table(): Extraer una tabla

Al extraer información de una clase se debe usar html_nodes(“.class”), al extraer de una ID se debe usar html_nodes(“#class”), al extraer desde una xpath se debe usar html_nodes(xpath=“xpath”).

Ejemplo scraping blog Banco Pichincha

Obtenemos el código de la página web y almacenamos en la variable blog.

url <- "https://www.pichincha.com/portal/blog"
blog <- read_html(url)
blog

## {html_document}
## <html lang="es-ES">
## [1] <head id="Head">\n<!-- Google Tag Manager --><script>(function(w,d,s,l,i) ...
## [2] <body id="Body">\n<script async src="https://www.googletagmanager.com/gta ...

Obtenemos los titulos almacenados en la clase “name”.

title_html <- html_nodes(blog, ".name")
title_html

## {xml_nodeset (9)}
## [1] <div class="name"><h5><a href="https://www.pichincha.com/portal/blog/post ...
## [2] <div class="name"><h5><a href="https://www.pichincha.com/portal/blog/post ...
## [3] <div class="name"><h5><a href="https://www.pichincha.com/portal/blog/post ...
## [4] <div class="name"><h5><a href="https://www.pichincha.com/portal/blog/post ...
## [5] <div class="name"><h5><a href="https://www.pichincha.com/portal/blog/post ...
## [6] <div class="name"><h5><a href="https://www.pichincha.com/portal/blog/post ...
## [7] <div class="name"><h5><a href="https://www.pichincha.com/portal/blog/post ...
## [8] <div class="name"><h5><a href="https://www.pichincha.com/portal/blog/post ...
## [9] <div class="name"><h5><a href="https://www.pichincha.com/portal/blog/post ...

Extraemos el texto de cada div

html_text(title_html )

## [1] "¿Demasiadas claves y usuarios? Usa un gestor de contraseñas para no olvidar ninguna"                    
## [2] "Cinco tips que necesitas poner en práctica antes de hacer una inversión inmobiliaria"                   
## [3] "Cinco preguntas sobre la contribución a SOLCA para el financiamiento de la atención integral del cáncer"
## [4] "La innovación como oportunidad de negocio para las PYMES ecuatorianas"                                  
## [5] "Seis cualidades de los personajes de Friends que te harán mejor emprendedor"                            
## [6] "¿Tienes capacidad de ahorro? Cuánto debes ahorrar del sueldo para alcanzar tus metas"                   
## [7] "Diferencia entre el pago mínimo y el pago total en tu tarjeta de crédito"                               
## [8] "Entre troyanos y gusanos: qué es un malware y cómo funciona"                                            
## [9] "Factura falsa: un fraude que puede hacerle perder mucho dinero a tu negocio"

Extracción de información del atributo “a” y extracción de la cadena de caracteres de la dirección web almacenada.

html_nodes(title_html,"a")[1]

## {xml_nodeset (1)}
## [1] <a href="https://www.pichincha.com/portal/blog/post/gestor-de-contrasenas ...

unlist(sapply(strsplit(as.character(html_nodes(title_html,"a")[1]), split = '"'),
                                 function(i){
                                   x <- i[ grepl("http", i)]
                                   if(length(x) == 0) x <- NA
                                   x
                                 }))

## [1] "https://www.pichincha.com/portal/blog/post/gestor-de-contrasenas"

Ejemplo scraping tabla

url <- "https://pokemondb.net/pokedex/all"
pokemon <- read_html(url)


tablaPokedex <- html_nodes(pokemon,".resp-scroll") 
tablaPokedex <- html_table(tablaPokedex)
head(tablaPokedex)

## [[1]]
## # A tibble: 1,045 x 10
##      `#` Name        Type   Total    HP Attack Defense `Sp. Atk` `Sp. Def` Speed
##    <int> <chr>       <chr>  <int> <int>  <int>   <int>     <int>     <int> <int>
##  1     1 Bulbasaur   Grass~   318    45     49      49        65        65    45
##  2     2 Ivysaur     Grass~   405    60     62      63        80        80    60
##  3     3 Venusaur    Grass~   525    80     82      83       100       100    80
##  4     3 VenusaurMe~ Grass~   625    80    100     123       122       120    80
##  5     4 Charmander  Fire     309    39     52      43        60        50    65
##  6     5 Charmeleon  Fire     405    58     64      58        80        65    80
##  7     6 Charizard   FireF~   534    78     84      78       109        85   100
##  8     6 CharizardM~ FireD~   634    78    130     111       130        85   100
##  9     6 CharizardM~ FireF~   634    78    104      78       159       115   100
## 10     7 Squirtle    Water    314    44     48      65        50        64    43
## # ... with 1,035 more rows

Ejemplo extracción de información de Google News

busqueda = "Quito"
news_pag = "https://news.google.com/"
html_dir = paste0(news_pag,"search?q=",gsub(" ","+",busqueda),"&hl=es-419&gl=US&ceid=US:es-419")
google_news = read_html(html_dir)
noticias =  html_nodes(google_news,".ipQwMb")
head(noticias)

## {xml_nodeset (6)}
## [1] <h3 class="ipQwMb ekueJc RD0gLb"><a href="./articles/CBMiYWh0dHBzOi8vd3d3 ...
## [2] <h3 class="ipQwMb ekueJc RD0gLb"><a href="./articles/CBMicWh0dHBzOi8vd3d3 ...
## [3] <h3 class="ipQwMb ekueJc RD0gLb"><a href="./articles/CBMigAFodHRwczovL3d3 ...
## [4] <h3 class="ipQwMb ekueJc RD0gLb"><a href="./articles/CBMiW2h0dHBzOi8vd3d3 ...
## [5] <h3 class="ipQwMb ekueJc RD0gLb"><a href="./articles/CAIiEDEQs08yyFUFlK26 ...
## [6] <h3 class="ipQwMb ekueJc RD0gLb"><a href="./articles/CBMiX2h0dHBzOi8vd3d3 ...

html_text(noticias[1:3])

## [1] "Quito se consolida como un destino renovado"                                                                            
## [2] "Quito una metrópoli para admirar desde el cielo"                                                                        
## [3] "Gabriel Di Noia y Ariel Varady dirigirán a Liga de Quito en la Supercopa; ¿hay candidatos a entrenadores? - El Comercio"

periodico <-  html_nodes(google_news,".wEwyrc") 
periodico <-  html_text(periodico )
periodico[1:3]

## [1] "Expreso.info"          "El Caribe"             "El Comercio (Ecuador)"

Extracción datos de twitter

La librería rtweet permite conectarse con la API de twitter para obtener post.

library(rtweet)
twitter_token <- create_token(app = appname, consumer_key = key, consumer_secret = secret)

busqueda1 = search_tweets("Manta, Manabi", n = 7, include_rts = F, lang="es", token = twitter_token, place_country="EC")
busqueda1 <- busqueda1 %>% dplyr::select(screen_name, created_at, status_id, text)
head(busqueda1)

## # A tibble: 6 x 4
##   screen_name   created_at          status_id    text                           
##   <chr>         <dttm>              <chr>        <chr>                          
## 1 ABorreroVega  2021-06-17 18:32:58 14055943325~ "Aplaudimos al alcalde de Mant~
## 2 PeriodismoP_~ 2021-06-17 18:32:29 14055942114~ "El vicepresidente @ABorreroVe~
## 3 MetroEcuador  2021-06-17 18:24:17 14055921505~ "#Noticias Nuevo sismo de 4.37~
## 4 Vocero593     2021-06-17 18:18:02 14055905740~ "#SISMO ID: igepn2021luma Revi~
## 5 PoliciaEcuad~ 2021-06-17 17:56:48 14055852326~ "En Manta #Manabí, aprehendimo~
## 6 NoticiasMund~ 2021-06-17 17:52:58 14055842666~ "SISMO ID: igepn2021luma Revis~

Limpieza del texto y tokenizacion

# Limpieza de texto y tokenización
limpiar_tokenizar <- function(texto){
  # El orden de la limpieza no es arbitrario
  # Se convierte todo el texto a minúsculas
  nuevo_texto <- tolower(texto)
  nuevo_texto <- str_replace_all(nuevo_texto,"RT @[a-z,A-Z]*: ","")
  # Get rid of hashtags
  nuevo_texto <- str_replace_all(nuevo_texto,"#[a-z,A-Z]*","")
  # Get rid of references to other screennames
  nuevo_texto <- str_replace_all(nuevo_texto,"@[a-z,A-Z]*","")  
  
  # Eliminación de páginas web (palabras que empiezan por "http." seguidas 
  # de cualquier cosa que no sea un espacio)
  nuevo_texto <- str_replace_all(nuevo_texto,"http\\S*", "")
  # Eliminación de signos de puntuación
  nuevo_texto <- str_replace_all(nuevo_texto,"[[:punct:]]", " ")
  # Eliminación de nÃºmeros
  nuevo_texto <- str_replace_all(nuevo_texto,"[[:digit:]]", " ")
  # Eliminación de espacios en blanco múltiples
  nuevo_texto <- str_replace_all(nuevo_texto,"[\\s]+", " ")
  # Tokenización por palabras individuales
  nuevo_texto <- str_split(nuevo_texto, " ")[[1]]
  # Eliminación de tokens con una longitud < 2
  nuevo_texto <- keep(.x = nuevo_texto, .p = function(x){str_length(x) > 1})
  return(nuevo_texto)
}
# Se aplica la función de limpieza y tokenización a cada tweet
busqueda1  <- busqueda1  %>% mutate(texto_tokenizado = map(.x = text,
                                                   .f = limpiar_tokenizar))
# Analisis Exploratorio
tweets_tidy <- busqueda1  %>% dplyr::select(-text)%>% unnest()

## Warning: `cols` is now required when using unnest().
## Please use `cols = c(texto_tokenizado)`

tweets_tidy <- tweets_tidy %>% rename(token = texto_tokenizado)

Construcción de una nube de palabras

library(wordcloud)

## Loading required package: RColorBrewer

library(RColorBrewer)
df <- tweets_tidy %>% group_by(token) %>% summarise(frecuencia=n())
total <- sum(df$frecuencia)
df <- df%>% 
  mutate(frecuencia=frecuencia/total)
wordcloud(words = df$token, freq = df$frecuencia,
            max.words = 400, random.order = FALSE, rot.per = 0.35,
            colors = brewer.pal(8, "Dark2"))

Minería de datos- Extracción de datos

Andrés Vinueza

16/6/2021