Trabajo Práctico N3: “Capturando y Explorando datos de Twitter”

La ciudad con la que vamos a continuar trabajando es la Ciudad Autónoma de Buenos Aires"

library(rtweet)
library(tidyverse)

## -- Attaching packages --------------------------------------------------------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.0     v purrr   0.3.3
## v tibble  2.1.3     v dplyr   0.8.4
## v tidyr   1.0.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0

## -- Conflicts ------------------------------------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter()  masks stats::filter()
## x purrr::flatten() masks rtweet::flatten()
## x dplyr::lag()     masks stats::lag()

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following object is masked from 'package:base':
## 
##     date

library(ggmap)

## Google's Terms of Service: https://cloud.google.com/maps-platform/terms/.

## Please cite ggmap if you use it! See citation("ggmap") for details.

library(leaflet)

II. Analizar:

Vamos a analizar todos los tweets que tengan que ver con: DeudaSostenible, Deuda, FuturoSostenible, acreedores y reestructuración

tweets_deuda <- search_tweets(q = "DeudaSostenible OR Deuda OR FuturoSostenible OR acreedores OR reestructuración",
              geocode = "-34.603722,-58.381592,20mi",
              include_rts = FALSE,
              n = 100000,
              retryonratelimit = TRUE)

Se encontraron esta cantidad de tweets:

count (tweets_deuda)

## # A tibble: 1 x 1
##       n
##   <int>
## 1  4107

head(tweets_deuda)

## # A tibble: 6 x 90
##   user_id status_id created_at          screen_name text  source
##   <chr>   <chr>     <dttm>              <chr>       <chr> <chr> 
## 1 125011~ 12588849~ 2020-05-08 22:22:03 dialogo2000 @Fre~ Twitt~
## 2 125011~ 12588846~ 2020-05-08 22:20:56 dialogo2000 @Fre~ Twitt~
## 3 125011~ 12567150~ 2020-05-02 22:39:42 dialogo2000 Exce~ Twitt~
## 4 420256~ 12570344~ 2020-05-03 19:49:06 NicoReggio~ @Ser~ Twitt~
## 5 420256~ 12585962~ 2020-05-08 03:15:09 NicoReggio~ @Ser~ Twitt~
## 6 420256~ 12588845~ 2020-05-08 22:20:51 NicoReggio~ @San~ Twitt~
## # ... with 84 more variables: display_text_width <dbl>,
## #   reply_to_status_id <chr>, reply_to_user_id <chr>,
## #   reply_to_screen_name <chr>, is_quote <lgl>, is_retweet <lgl>,
## #   favorite_count <int>, retweet_count <int>, quote_count <int>,
## #   reply_count <int>, hashtags <list>, symbols <list>, urls_url <list>,
## #   urls_t.co <list>, urls_expanded_url <list>, media_url <list>,
## #   media_t.co <list>, media_expanded_url <list>, media_type <list>,
## #   ext_media_url <list>, ext_media_t.co <list>, ext_media_expanded_url <list>,
## #   ext_media_type <chr>, mentions_user_id <list>, mentions_screen_name <list>,
## #   lang <chr>, quoted_status_id <chr>, quoted_text <chr>,
## #   quoted_created_at <dttm>, quoted_source <chr>, quoted_favorite_count <int>,
## #   quoted_retweet_count <int>, quoted_user_id <chr>, quoted_screen_name <chr>,
## #   quoted_name <chr>, quoted_followers_count <int>,
## #   quoted_friends_count <int>, quoted_statuses_count <int>,
## #   quoted_location <chr>, quoted_description <chr>, quoted_verified <lgl>,
## #   retweet_status_id <chr>, retweet_text <chr>, retweet_created_at <dttm>,
## #   retweet_source <chr>, retweet_favorite_count <int>,
## #   retweet_retweet_count <int>, retweet_user_id <chr>,
## #   retweet_screen_name <chr>, retweet_name <chr>,
## #   retweet_followers_count <int>, retweet_friends_count <int>,
## #   retweet_statuses_count <int>, retweet_location <chr>,
## #   retweet_description <chr>, retweet_verified <lgl>, place_url <chr>,
## #   place_name <chr>, place_full_name <chr>, place_type <chr>, country <chr>,
## #   country_code <chr>, geo_coords <list>, coords_coords <list>,
## #   bbox_coords <list>, status_url <chr>, name <chr>, location <chr>,
## #   description <chr>, url <chr>, protected <lgl>, followers_count <int>,
## #   friends_count <int>, listed_count <int>, statuses_count <int>,
## #   favourites_count <int>, account_created_at <dttm>, verified <lgl>,
## #   profile_url <chr>, profile_expanded_url <chr>, account_lang <lgl>,
## #   profile_banner_url <chr>, profile_background_url <chr>,
## #   profile_image_url <chr>

a. ¿Cuáles son los mensajes con más repercusión? ¿Qué dicen?

Los mensajes con más repercución son los siguientes:

tweets_deuda_max <- tweets_deuda  %>% 
      group_by(retweet_count) %>% 
  select(screen_name, retweet_count, text) %>% 
arrange(desc(retweet_count))
 head(tweets_deuda_max, 3)

## # A tibble: 3 x 3
## # Groups:   retweet_count [3]
##   screen_name   retweet_count text                                              
##   <chr>                 <int> <chr>                                             
## 1 SantoroLeand~          3554 "Notable como todos le piden a Guzmán precisiones~
## 2 RCachanosky            1114 "Es lo mismo que yo le pido a los políticos!. Que~
## 3 JMilei                 1036 "Aquí Martín Guzmán negociando la deuda con acree~

ggplot(filter(tweets_deuda))+
    geom_histogram(aes(x = retweet_count))+
  scale_x_log10()

## Warning: Transformation introduced infinite values in continuous x-axis

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 2847 rows containing non-finite values (stat_bin).

Como podemos observar los tres tweets mencionados anteriormente son los que superan los 1000 retweets, mientras que la gran mayoría se encuentran de 0 a 10 retweets.

b. ¿En qué momento del día se realiza la mayor cantidad de tweets? Graficar.

tweets_deuda <- tweets_deuda %>%
mutate(created_at=ymd_hms(created_at))
ts_plot(tweets_deuda, "hours")

La mayor cantidad de tweets se realizan en el horario de la tarde noche, existiendo picos los días 4 de Mayo y 7 de Mayo.

c. ¿Cómo se distribuye la popularidad de los usuarios? ¿Quiénes son los 5 que más seguidores tienen? Graficar.

tweets_deuda_seguidores <- tweets_deuda %>% 
  group_by(screen_name) %>% 
  summarise(seguidores =  mean( followers_count)) %>% 
      arrange(desc(seguidores))

options(scipen = 20)
ggplot(tweets_deuda_seguidores) +
    geom_histogram(aes(x = seguidores))+
  scale_x_log10()

## Warning: Transformation introduced infinite values in continuous x-axis

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 14 rows containing non-finite values (stat_bin).

La mayoría de los twitteros tienen un rango de 1000 seguidores aproximadamente. Se realizó un “group_by + summarise”, dado que muchos diarios habían realizado varios twetts, repitiendose la misma cuenta.

Los cinco usuarios que más seguidores tienen son los siguientes:

 head(tweets_deuda_seguidores, 5)

## # A tibble: 5 x 2
##   screen_name     seguidores
##   <chr>                <dbl>
## 1 clarincom          3109488
## 2 C5N                2715918
## 3 FernandezAnibal    1088748
## 4 Cris_noticias       915803
## 5 RiverLPM            768530

d. Aislando los tweets que poseen coordenadas geográficas (lat y long), crear mapas que muestren posición de los tweets y cantidad de seguidores del usuario que tuitea.

tweets_deuda_geo <- lat_lng(tweets_deuda) %>% 
    select(-geo_coords, -coords_coords, -bbox_coords) %>% 
      filter(!is.na(lat), !is.na(lng))
nrow(tweets_deuda_geo)

## [1] 381

bbox <- make_bbox(lon = tweets_deuda_geo$lng, lat = tweets_deuda_geo$lat)

bbox

##      left    bottom     right       top 
## -58.83245 -34.85154 -58.17054 -34.42451

mapa_BA <- get_stamenmap(bbox, maptype = "toner-lite", zoom = 11)

ggmap(mapa_BA)

tweets_deuda_geo <- arrange(tweets_deuda_geo, followers_count)
ggmap(mapa_BA) + 
    geom_point(data = tweets_deuda_geo, 
               aes(x = lng, y = lat, color = followers_count, size = retweet_count), 
               alpha = .5) +
    scale_color_distiller(palette = "Spectral")+
      labs(title = "Gran Buenos Aires",
         subtitle = "Ubicación Geográfica de Tweets",
         caption = "Fuente: Twitter",
         color = "Seguidores", 
         size= "Retweets")

paleta <- colorNumeric(
  palette = "viridis",
  domain = tweets_deuda_geo$followers_count)

leaflet(tweets_deuda_geo) %>% 
    addTiles() %>% 
    addCircleMarkers(popup = ~text,
                     color = ~ paleta(followers_count)) %>% 
    addLegend(title = "seguidores", pal = paleta, values = ~followers_count)

## Assuming "lng" and "lat" are longitude and latitude, respectively

Entrega TP3 (Ciencia de Datos 2)

Lavezzolo

8/5/2020

Trabajo Práctico N3: “Capturando y Explorando datos de Twitter”

La ciudad con la que vamos a continuar trabajando es la Ciudad Autónoma de Buenos Aires"

II. Analizar:

Vamos a analizar todos los tweets que tengan que ver con: DeudaSostenible, Deuda, FuturoSostenible, acreedores y reestructuración

Se encontraron esta cantidad de tweets:

a. ¿Cuáles son los mensajes con más repercusión? ¿Qué dicen?

Los mensajes con más repercución son los siguientes:

Como podemos observar los tres tweets mencionados anteriormente son los que superan los 1000 retweets, mientras que la gran mayoría se encuentran de 0 a 10 retweets.

b. ¿En qué momento del día se realiza la mayor cantidad de tweets? Graficar.

La mayor cantidad de tweets se realizan en el horario de la tarde noche, existiendo picos los días 4 de Mayo y 7 de Mayo.

c. ¿Cómo se distribuye la popularidad de los usuarios? ¿Quiénes son los 5 que más seguidores tienen? Graficar.

La mayoría de los twitteros tienen un rango de 1000 seguidores aproximadamente. Se realizó un “group_by + summarise”, dado que muchos diarios habían realizado varios twetts, repitiendose la misma cuenta.

Los cinco usuarios que más seguidores tienen son los siguientes:

d. Aislando los tweets que poseen coordenadas geográficas (lat y long), crear mapas que muestren posición de los tweets y cantidad de seguidores del usuario que tuitea.