Vamos a usar un dataset sobre preguntas y respuestas sobre R de la web stackoverflow.com

  1. Lee los cuatro ficheros externos answers.rds, question_tags.rds, questions.rds y tags.rds mediante la función readRDS()
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.4     v dplyr   1.0.7
## v tidyr   1.1.4     v stringr 1.4.0
## v readr   2.0.2     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
setwd(".../Practica_11")
questions <- readRDS(file = "questions.rds")
question_tags <- readRDS(file = "question_tags.rds")
answers <- readRDS(file = "answers.rds")
tags <- readRDS(file = "tags.rds")

Explora las cuatro tablas y encuentra las posibles relaciones entre ellas.

questions:

## # A tibble: 6 x 3
##         id creation_date score
##      <int> <date>        <int>
## 1 22557677 2014-03-21        1
## 2 22557707 2014-03-21        2
## 3 22558084 2014-03-21        2
## 4 22558395 2014-03-21        2
## 5 22558613 2014-03-21        0
## 6 22558677 2014-03-21        2

question_tags:

## # A tibble: 6 x 2
##   question_id tag_id
##         <int>  <int>
## 1    22557677     18
## 2    22557677    139
## 3    22557677  16088
## 4    22557677   1672
## 5    22558084   6419
## 6    22558084  92764

answers:

## # A tibble: 6 x 4
##         id creation_date question_id score
##      <int> <date>              <int> <int>
## 1 39143713 2016-08-25       39143518     3
## 2 39143869 2016-08-25       39143518     1
## 3 39143935 2016-08-25       39142481     0
## 4 39144014 2016-08-25       39024390     0
## 5 39144252 2016-08-25       39096741     6
## 6 39144375 2016-08-25       39143885     5

tags:

## # A tibble: 6 x 2
##       id tag_name                 
##    <dbl> <chr>                    
## 1 124399 laravel-dusk             
## 2 124402 spring-cloud-vault-config
## 3 124404 spring-vault             
## 4 124405 apache-bahir             
## 5 124407 astc                     
## 6 124408 simulacrum
  1. Une la tabla questions con question_tags, sin perder información de questions y sin agregar tags que no estén en questions.
questions_question_tags <- questions %>%
  left_join(question_tags, by = c("id" = "question_id"))
head(questions_question_tags)
## # A tibble: 6 x 4
##         id creation_date score tag_id
##      <int> <date>        <int>  <int>
## 1 22557677 2014-03-21        1     18
## 2 22557677 2014-03-21        1    139
## 3 22557677 2014-03-21        1  16088
## 4 22557677 2014-03-21        1   1672
## 5 22557707 2014-03-21        2     NA
## 6 22558084 2014-03-21        2   6419
  1. Agrega la información de la tabla tags a la consulta anterior, sin perder información de la consulta del ejercicio anterior.
questions_question_tags_tags <- questions %>%
  left_join(question_tags, by = c("id" = "question_id")) %>%
  left_join(tags, by = c("tag_id" = "id")) 
head(questions_question_tags_tags)
## # A tibble: 6 x 5
##         id creation_date score tag_id tag_name       
##      <int> <date>        <int>  <dbl> <chr>          
## 1 22557677 2014-03-21        1     18 regex          
## 2 22557677 2014-03-21        1    139 string         
## 3 22557677 2014-03-21        1  16088 time-complexity
## 4 22557677 2014-03-21        1   1672 backreference  
## 5 22557707 2014-03-21        2     NA <NA>           
## 6 22558084 2014-03-21        2   6419 time-series
  1. Sustituye los NAs de la columna tag_name, usando replace_na(), por el texto “only-r” y guarda el resultado de la consulta en la variable questions_with_tags
questions_with_tags <- questions %>%
  left_join(question_tags, by = c("id" = "question_id")) %>%
  left_join(tags, by = c("tag_id" = "id")) %>%
  replace_na(list(tag_name="only_r"))
head(questions_with_tags)
## # A tibble: 6 x 5
##         id creation_date score tag_id tag_name       
##      <int> <date>        <int>  <dbl> <chr>          
## 1 22557677 2014-03-21        1     18 regex          
## 2 22557677 2014-03-21        1    139 string         
## 3 22557677 2014-03-21        1  16088 time-complexity
## 4 22557677 2014-03-21        1   1672 backreference  
## 5 22557707 2014-03-21        2     NA only_r         
## 6 22558084 2014-03-21        2   6419 time-series
  1. Utilizando el dataset questions_with_tags, completa el siguiente código:
avg_tag_sum_question <- questions_with_tags %>%
    # Agrupa por tag_name
    --- %>%
    # Agrega los datos y obten la media de puntuaciones para cada tag_name y el total de preguntas
    summarize(score = ---,
              num_questions = ---) %>%
    # Ordena por número de preguntas en orden descendente
    ---
avg_tag_sum_question <- questions_with_tags %>%
    # Group by tag_name
    group_by(tag_name) %>%
    # Get mean score and num_questions
    summarize(score = mean(score),
              num_questions = n()) %>%
    # Sort num_questions in descending order
    arrange(desc(num_questions))
avg_tag_sum_question
## # A tibble: 7,841 x 3
##    tag_name   score num_questions
##    <chr>      <dbl>         <int>
##  1 only_r     1.26          48541
##  2 ggplot2    2.61          28228
##  3 dataframe  2.31          18874
##  4 shiny      1.45          14219
##  5 dplyr      1.95          14039
##  6 plot       2.24          11315
##  7 data.table 2.97           8809
##  8 matrix     1.66           6205
##  9 loops      0.743          5149
## 10 regex      2              4912
## # ... with 7,831 more rows
  1. La tabla tags incluye todas las etiquetas de Stack Overflow, incluyendo las que no se refieren al lenguaje R. Localiza, usando el verbo anti_join las etiquetas de la tabla tags que nunca aparecen en una pregunta sobre R. ¿Cuántas etiquetas no pertenecen a R?
tags %>%
   anti_join(question_tags, by=c("id" = "tag_id")) 
## # A tibble: 40,459 x 2
##        id tag_name                 
##     <dbl> <chr>                    
##  1 124399 laravel-dusk             
##  2 124402 spring-cloud-vault-config
##  3 124404 spring-vault             
##  4 124405 apache-bahir             
##  5 124407 astc                     
##  6 124408 simulacrum               
##  7 124410 angulartics2             
##  8 124411 django-rest-viewsets     
##  9 124414 react-native-lightbox    
## 10 124417 java-module              
## # ... with 40,449 more rows
  1. Localiza, usando el verbo semi_join las etiquetas de la tabla tags que son sobre R. ¿Cuántas etiquetas pertenecen a R?
tags %>%
   semi_join(question_tags, by=c("id" = "tag_id")) 
## # A tibble: 7,840 x 2
##        id tag_name              
##     <dbl> <chr>                 
##  1 124426 tidyquant             
##  2 124454 catalan               
##  3 124469 salesforce-lightning  
##  4 124500 hyperledger-composer  
##  5 124540 mini-batch            
##  6 124556 pacman-package-manager
##  7 124614 amazon-linux          
##  8 124620 longitudinal          
##  9 124657 hypernym              
## 10 124678 modelr                
## # ... with 7,830 more rows
  1. Vamos a relacionar ahora preguntas con respuestas. Con ello, podemos medir el tiempo entre preguntas y respuestas.
questions %>%
    # Realiza un Inner join entre questions y answers con los sufijos apropiados (_question y _answer) para diferenciar las columnas entre preguntas y respuestas
    inner_join(---, ---, suffix = ---) ---
    # Haz una resta de creation_date_question a creation_date_answer para crear la columna gap
    ---(gap = as.integer(---))
questions %>%
    # Inner join questions and answers with proper suffixes
    inner_join(answers, by=c("id" = "question_id"), suffix = c("_question", "_answer")) %>%
    # Subtract creation_date_question from creation_date_answer to create gap
    mutate(gap = as.integer(creation_date_answer - creation_date_question))
## # A tibble: 380,643 x 7
##          id creation_date_question score_question id_answer creation_date_answer
##       <int> <date>                          <int>     <int> <date>              
##  1 22557677 2014-03-21                          1  22560670 2014-03-21          
##  2 22557707 2014-03-21                          2  22558516 2014-03-21          
##  3 22557707 2014-03-21                          2  22558726 2014-03-21          
##  4 22558084 2014-03-21                          2  22558085 2014-03-21          
##  5 22558084 2014-03-21                          2  22606545 2014-03-24          
##  6 22558084 2014-03-21                          2  22610396 2014-03-24          
##  7 22558084 2014-03-21                          2  34374729 2015-12-19          
##  8 22558395 2014-03-21                          2  22559327 2014-03-21          
##  9 22558395 2014-03-21                          2  22560102 2014-03-21          
## 10 22558395 2014-03-21                          2  22560288 2014-03-21          
## # ... with 380,633 more rows, and 2 more variables: score_answer <int>,
## #   gap <int>
  1. Cuenta y ordena las diferentes preguntas (columna question id) utilizando la tabla anwswers y solamente el verbo count(). Guarda el resultado en answer_counts
--- <- --- 
    count(---, --- = )
answer_counts <- answers %>%
    count(question_id, sort=TRUE)
answer_counts
## # A tibble: 243,930 x 2
##    question_id     n
##          <int> <int>
##  1     1295955    34
##  2     2547402    30
##  3     1358003    27
##  4     4090169    26
##  5     1535021    25
##  6     1189759    24
##  7     1815606    24
##  8     5963269    23
##  9    17200114    22
## 10      102056    21
## # ... with 243,920 more rows
  1. Combina las tablas questions con answer_counts y sustituye los NA de la columna “n” creada mediante count() en answer_counts por ceros. Guarda el resultado en question_answer_counts.
--- <- questions %>%
  ---(---,---) %>%
  replace_na(---(--- = 0))
question_answer_counts <- questions %>%
    left_join(answer_counts, by=c("id" = "question_id")) %>%
    # Replace the NAs in the n column
    replace_na(list(n = 0))
  1. Une question_answer_counts con question_tags y con tags, usando inner_join dos veces. Guarda el resultado en tagged_answers
tagged_answers  <- question_answer_counts %>%
    ---
    ---
tagged_answers <- question_answer_counts %>%
    inner_join(question_tags, by=c("id" = "question_id")) %>%
    inner_join(tags, by = c("tag_id" = "id"))
  1. Agrega los datos anteriores, agrupando por tag_name. Calcula el total y el promedio de respuestas por pregunta.
tagged_answers %>%
    # Agrupa por tag_name
  ---  %>%
  # Agrega el número de preguntas total por tag y el promedio de respuestas
    ---(questions = ---,
              average_answers = ---) %>%
    # Ordena de mayor a menor
    ---(---)
tagged_answers %>%
    group_by(tag_name) %>%
    summarize(questions = n(),
              average_answers = mean(n)) %>%
    arrange(desc(questions))
## # A tibble: 7,840 x 3
##    tag_name   questions average_answers
##    <chr>          <int>           <dbl>
##  1 ggplot2        28228           1.15 
##  2 dataframe      18874           1.67 
##  3 shiny          14219           0.921
##  4 dplyr          14039           1.55 
##  5 plot           11315           1.23 
##  6 data.table      8809           1.47 
##  7 matrix          6205           1.45 
##  8 loops           5149           1.39 
##  9 regex           4912           1.91 
## 10 function        4892           1.30 
## # ... with 7,830 more rows
  1. Ahora vamos a crear dos tablas con la misma estructura una para preguntas y otra para respuestas, para realizar una comparativa del nº de post de preguntas y respuestas.
questions_with_tags <- --- %>%
  inner_join(---, ---) %>%
  inner_join(---, ---) 

answers_with_tags <- answers %>%
  inner_join(question_tags, ---) %>%
  inner_join(tags, ---) 
questions_with_tags <- questions %>%
  inner_join(question_tags, by = c("id" = "question_id")) %>%
  inner_join(tags, by = c("tag_id" = "id"))

answers_with_tags <- answers %>%
  inner_join(question_tags, by = c("question_id")) %>%
  inner_join(tags, by = c("tag_id" = "id"))
  1. Añade la columna type a cada una de las dos tablas anteriores, dependiendo de si son preguntas o respuestas el valor será “question” o “answer”
questions_with_tags <- questions_with_tags ---
  ---()

answers_with_tags <- answers_with_tags ---
  ---()
questions_with_tags <- questions_with_tags %>%
  mutate(type="question")
answers_with_tags <- answers_with_tags %>%
  mutate(type="answer")
  1. Utilizando el verbo union_all(), une las tablas questions_with_tags y answers_with_tags. El resultado de la unión será post_with_tags
posts_with_tags <- union_all(---, ---)
posts_with_tags
posts_with_tags <- union_all(questions_with_tags, answers_with_tags)
posts_with_tags
## # A tibble: 1,122,998 x 7
##          id creation_date score tag_id tag_name            type     question_id
##       <int> <date>        <int>  <dbl> <chr>               <chr>          <int>
##  1 22557677 2014-03-21        1     18 regex               question          NA
##  2 22557677 2014-03-21        1    139 string              question          NA
##  3 22557677 2014-03-21        1  16088 time-complexity     question          NA
##  4 22557677 2014-03-21        1   1672 backreference       question          NA
##  5 22558084 2014-03-21        2   6419 time-series         question          NA
##  6 22558084 2014-03-21        2  92764 panel-data          question          NA
##  7 22558395 2014-03-21        2   5569 function            question          NA
##  8 22558395 2014-03-21        2    134 sorting             question          NA
##  9 22558395 2014-03-21        2   9412 vectorization       question          NA
## 10 22558395 2014-03-21        2  18621 operator-precedence question          NA
## # ... with 1,122,988 more rows
  1. Agregamos la columna año y contamos por type, year y tag_name
library(lubridate)
by_type_year_tag <- --- %>%
    mutate(year = year(creation_date)) %>%
    count(---, ---, ---)
by_type_year_tag
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
by_type_year_tag <- posts_with_tags %>%
    mutate(year = year(creation_date)) %>%
    count(type, year, tag_name)
by_type_year_tag
## # A tibble: 58,299 x 4
##    type    year tag_name                      n
##    <chr>  <dbl> <chr>                     <int>
##  1 answer  2008 bayesian                      1
##  2 answer  2008 dataframe                     3
##  3 answer  2008 dirichlet                     1
##  4 answer  2008 eof                           1
##  5 answer  2008 file                          1
##  6 answer  2008 file-io                       1
##  7 answer  2008 function                      7
##  8 answer  2008 global-variables              7
##  9 answer  2008 math                          2
## 10 answer  2008 mathematical-optimization     1
## # ... with 58,289 more rows
  1. Filtramos por tag dplr y ggplot2 y dibujamos una gráfica para identificar la frecuencia de apariciones en post de ambos tags, diferenciando preguntas y respuestas por color:
by_type_year_tag_filtered <- --- %>%
  filter(--- %in% ---)

ggplot(---, aes(---, ---, color = ---)) +
  geom_line() +
  facet_wrap(~ ---)
by_type_year_tag_filtered <- by_type_year_tag %>%
  filter(tag_name %in% c("dplyr", "ggplot2"))

ggplot(by_type_year_tag_filtered, aes(year, n, color = type)) +
  geom_line() +
  facet_wrap(~ tag_name)