Taller 1

.title[
# Taller 1
]
.subtitle[
## Introducción a R, Rstudio y el paquete tidyverse
]
.author[
### Carlos Daboín
]
.date[
### Septiembre, 2023
]

---

<style>

</style>

## Talleres de análisis de datos
  
Tras la culminación de cada clase teórica tendremos un taller donde revisaremos algunas maneras de aplicar los métodos presentados en clase.

Para ello nos valdremos de **R** y **Rstudio**

Si aún no lo has hecho:

.pull-left[
1. Descarga e instala [R](https://www.r-project.org/).
2. Luego descarga e instala [RStudio.](https://www.rstudio.com/products/rstudio/)
]

*Nota: También puedes trabajar desde tu buscador accediendo a [Rstudio Cloud](https://www.rstudio.com/products/cloud/). El plan gratuito tiene límites de almancenamiento y procesamiento, pero basta para manejar las asignaturas del curso si lo usas bien.*

---

## R y Rstudio
  
**R** es un ecosistema de software gratuito para análisis estadístico y la visualización de datos. **RStudio** es un Entorno Integrado de Desarrollo (IDE) que ayuda a los usuarios de R a programar cómodamente.

Piensa en **R como el motor** corriendo tu análisis, y en **Rstudio como la cabina de control.**

]

---

## ¿Por qué R y RStudio?

- Es gratis

- Comunidad activa e innovación constante (Tip: sigue a [@R4DScommunity](https://twitter.com/R4DScommunity?s=20&t=ALR2omSKksL53wWja8JHxg) en twitter)

- Excelentes librarias para el análisis y la visualización de datos

- Soluciones cómodas para crear [reportes](https://rmarkdown.rstudio.com/gallery.html), [presentaciones](https://arm.rbind.io/slides/xaringan.html#94), libros, [páginas web](https://nz-stefan.shinyapps.io/commute-explorer-2/), [APIs](https://www.rplumber.io/), y más

- Alta demanda en el mercado laboral.

---

## Un vistazo a tu ambiente de trabajo (R Studio IDE)
.center[
<img src="images/Rstudio1.png" width="75%" height="75%" />
]

---

## Principios de la programación en R

1. Todo es un **objeto**

2. Cada objeto tiene un **nombre** y un **valor**

3. Puedes insertar los objetos en **funciones**

4. Las funciónes vienen con **instrucciones**

6. Las funciones son empaquetadas en **librerías**

7. Las funciones emiten **alertas sobre posibles errores**


]

`precio`

`precio<-100`

`log(precio, base = 10)`

<br>
`?log`

`library(ggplot2)`

`log(-1)`

]

---
layout:true

##COPY, PASTE, RUN

---

Las matrices son objetos útiles en la programación y aplicación de métodos lineales.

Veamos cómo se crea una matriz en R con la función matrix().

```r
## revisa la documentación
?matrix
# data: opcional, require insertar un vector
# nrow: número filas de la matriz
# ncol: número columnas de la matriz
```
]

<img src="images/help-matrix.jpg" width="100%" height="80%" />
]

---

```r
# Creemos un vector con un cero
obj_1<-0

# Matriz A: 5x2 llena de ceros
A<-matrix(data = obj_1, nrow = 5, ncol = 2)

# Veamos la matriz A
A
```

```
##      [,1] [,2]
## [1,]    0    0
## [2,]    0    0
## [3,]    0    0
## [4,]    0    0
## [5,]    0    0
```
]

```r
# Creemos un vector con numeros del 1 al 10
obj_2<-c(1:10)

# Matrix B: 5x2 con una sequencia numerica
B<-matrix( data = obj_2, nrow = 5, ncol = 2)

# Veamos la matriz B
B
```

```
##      [,1] [,2]
## [1,]    1    6
## [2,]    2    7
## [3,]    3    8
## [4,]    4    9
## [5,]    5   10
```

]

---
layout: true
## Funciones para análisis estadístico

---

Este es el código que vamos a correr.

```r
x<-c(1:10) # vector x
y<-x*2+5   # vector y
# Mean
mean(x)
# Median
median(x)
# Std. dev. and variance
sd(x)
var(x)
# Min. and max.
min(x)
max(x)
# Correlation/covariance
cor(x, y)
cov(x, y)
# Quartiles and mean of x
summary(x)
```
]

Este es el output que veremos en la consola de Rstudio:

```
## [1] 5.5
```

```
## [1] 3.02765
```

```
## [1] 9.166667
```

```
## [1] 1
```

```
## [1] 10
```

```
## [1] 1
```

```
## [1] 18.33333
```

```
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    3.25    5.50    5.50    7.75   10.00
```

]

---
Otras funciones

Este es el código que vamos a correr:

```r
# Set seed (pin down random number generation)
set.seed(1)
# 4 random draws from N(3,5)
rnorm(n = 4, mean = 3, sd = sqrt(5))
# CDF for N(0,1) at z=1.96
pnorm(q = 1.96, mean = 0, sd = 1)
# Sample 5 draws from x w/ repl.
sample(
  x = x,
  size = 5,
  replace = T
)
# First and last 3 elements of x
head(x, 3)
tail(x, 3)
```
]

```
## [1] 1.599207 3.410639 1.131478 6.567156
```

```
## [1] 0.9750021
```

```
## [1] 2 3 1 5 5
```

```
## [1] 1 2 3
```

```
## [1]  8  9 10
```

]

---

Otras funciones

```r
# Set seed (pin down random number generation)
set.seed(1)
# 4 random draws from N(3,5)
distribucion_normal<-rnorm(n = 4, mean = 3, sd = sqrt(5))
# CDF for N(0,1) at z=1.96
cdf<-pnorm(q = 1.96, mean = 0, sd = 1)
# Sample 5 draws from x w/ repl.
muestra<-sample(
  x = x,
  size = 5,
  replace = T
)
# First and last 3 elements of x
head_x<-head(x, 3)
tail_x<-tail(x, 3)
```
]

<br>
<br>
<br>
<br>
<br>
<br>
**Ahora no veo el código ¿Que pasó?**

]

---

Otras funciones

.pull-right[
Los objetos que definimos a la derecha fueron guardados al correr el nuevo código. En el panel de estructura de datos queda constancia de ello.

![panel_de_datos](images/panel_de_estructura_de_datos.png)

]

---
layout:true

## Importación o lectura de datos

---

Algunas de las librerías y funciones más usadas son:

.pull-left[
**Librería readr** para archivos de formatos variados.
* readr::read_csv()
* readr::read_delim()
* readr::read_rds()

**Librería readxl** para archivos xls, xlsx, o similares.
* readxl::read_xlsx()

**Librería haven** para archivos típicos de otros programas (p.e Stata).
* haven::read_dta()
]

```r
readr::read_csv("https://github.com/tidyverse/readr/raw/main/inst/extdata/mtcars.csv")
```

```r
readxl::read_excel("../data/news-release-table1-202307.xlsx", 
                   range = "A4:K45")
```

```r
haven::read_dta("https://raw.github.com/scunning1975/mixtape/master/titanic.dta")
```

]

---
layout:false

## Paremos para "leer" los datos que vamos a usar

Datos del banco mundial sobre PIB, población y esperanza de vida al nacer en su formato original [accede aqui](https://www.bing.com/search?pglt=43&q=world+bank+development+indicators&cvid=c163e5df13924e76b9846b044539d96c&aqs=edge.0.0j69i64j0l7.8635j0j1&FORM=ANAB01&PC=U531)

```r
WDI_wide <- read_csv("../data/WDI_extract_data.csv")
```

Datos del banco mundial en formato tidy (hablaremos de donde viene luego):

```r
WDI_long <- read_csv("../data/WDI_extract_data_long.csv")
```

La dirección depende del directorio donde estés ubicado. Escoge tu directorio con Ctrl+Shift+H

---
layout: true

## Introducción al tidyverse

---

### (Tidy ~ Ordenado) + (verse ~ universo)

- El tidyverse es un conjunto de librerías en R basados en la misma filosofía
- Tiene su propia sintaxis y fue pensado para ser más intuitivo que las funciones "base" de R

### Promueve usar datos en formato *Tidy*:

.pull-left[
1. Cada variable tiene su propia columna
2. Cada observación tiene su propia fila
3. Cada valor tiene su propia celda
]

![Tidy data](images/tidy-1.png)

]

**¿Qué se gana con esto?:** Orden. Hay mil de maneras de tener datos desordenados, pero sólo una manera de tener datos tidy.

---

### Brinda soluciones para cada etapa del análisis de datos

Cuenta con al menos 8 librerías que usaremos a lo largo del curso.

![Analisis](images/data-science-explore.png)

]

#### Esta semana:

-**dplyr** para manipular datos en formato tidy.   
-**ggplot2** para visualizarlos.

]

---
layout: true
## ¿Tidy or not?

---

### World development indicators (World Bank database)  
<table>
 <thead>
  <tr>
   <th style="text-align:left;"> Country Name </th>
   <th style="text-align:left;"> Series Name </th>
   <th style="text-align:left;"> Series Code </th>
   <th style="text-align:left;"> 1960 [YR1960] </th>
   <th style="text-align:left;"> 1961 [YR1961] </th>
   <th style="text-align:left;"> 1962 [YR1962] </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Afghanistan </td>
   <td style="text-align:left;"> GDP per capita (current US$) </td>
   <td style="text-align:left;"> NY.GDP.PCAP.CD </td>
   <td style="text-align:left;"> 59.7732337032148 </td>
   <td style="text-align:left;"> 59.8608999923829 </td>
   <td style="text-align:left;"> 58.4580086983139 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Afghanistan </td>
   <td style="text-align:left;"> GDP (current US$) </td>
   <td style="text-align:left;"> NY.GDP.MKTP.CD </td>
   <td style="text-align:left;"> 537777811.111111 </td>
   <td style="text-align:left;"> 548888895.555556 </td>
   <td style="text-align:left;"> 546666677.777778 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Afghanistan </td>
   <td style="text-align:left;"> Life expectancy at birth, total (years) </td>
   <td style="text-align:left;"> SP.DYN.LE00.IN </td>
   <td style="text-align:left;"> 32.446 </td>
   <td style="text-align:left;"> 32.962 </td>
   <td style="text-align:left;"> 33.471 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Afghanistan </td>
   <td style="text-align:left;"> Population, total </td>
   <td style="text-align:left;"> SP.POP.TOTL </td>
   <td style="text-align:left;"> 8996967 </td>
   <td style="text-align:left;"> 9169406 </td>
   <td style="text-align:left;"> 9351442 </td>
  </tr>
</tbody>
</table>

**Opiniones: ¿es tidy? ¿por qué?**

---

### World development indicators, version tidy

<table>
 <thead>
  <tr>
   <th style="text-align:right;"> year </th>
   <th style="text-align:left;"> continent_name </th>
   <th style="text-align:left;"> country_name </th>
   <th style="text-align:right;"> gdp_pc </th>
   <th style="text-align:right;"> life_exp </th>
   <th style="text-align:right;"> population </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 1960 </td>
   <td style="text-align:left;"> Asia </td>
   <td style="text-align:left;"> Afghanistan </td>
   <td style="text-align:right;"> 59.77323 </td>
   <td style="text-align:right;"> 32.446 </td>
   <td style="text-align:right;"> 8996967 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 1961 </td>
   <td style="text-align:left;"> Asia </td>
   <td style="text-align:left;"> Afghanistan </td>
   <td style="text-align:right;"> 59.86090 </td>
   <td style="text-align:right;"> 32.962 </td>
   <td style="text-align:right;"> 9169406 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 1962 </td>
   <td style="text-align:left;"> Asia </td>
   <td style="text-align:left;"> Afghanistan </td>
   <td style="text-align:right;"> 58.45801 </td>
   <td style="text-align:right;"> 33.471 </td>
   <td style="text-align:right;"> 9351442 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 1963 </td>
   <td style="text-align:left;"> Asia </td>
   <td style="text-align:left;"> Afghanistan </td>
   <td style="text-align:right;"> 78.70643 </td>
   <td style="text-align:right;"> 33.971 </td>
   <td style="text-align:right;"> 9543200 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 1964 </td>
   <td style="text-align:left;"> Asia </td>
   <td style="text-align:left;"> Afghanistan </td>
   <td style="text-align:right;"> 82.09531 </td>
   <td style="text-align:right;"> 34.463 </td>
   <td style="text-align:right;"> 9744772 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 1965 </td>
   <td style="text-align:left;"> Asia </td>
   <td style="text-align:left;"> Afghanistan </td>
   <td style="text-align:right;"> 101.10833 </td>
   <td style="text-align:right;"> 34.948 </td>
   <td style="text-align:right;"> 9956318 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 1966 </td>
   <td style="text-align:left;"> Asia </td>
   <td style="text-align:left;"> Afghanistan </td>
   <td style="text-align:right;"> 137.59430 </td>
   <td style="text-align:right;"> 35.430 </td>
   <td style="text-align:right;"> 10174840 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 1967 </td>
   <td style="text-align:left;"> Asia </td>
   <td style="text-align:left;"> Afghanistan </td>
   <td style="text-align:right;"> 160.89843 </td>
   <td style="text-align:right;"> 35.914 </td>
   <td style="text-align:right;"> 10399936 </td>
  </tr>
</tbody>
</table>

---
layout: true

## Librería dplyr

---

Esta librería sirve para manipular de datos en formato tidy (1 variable -> 1 columna, 1 observación -> 1 fila).

Para instalarlo en tu equipo:

```r
install.packages("dplry")
```

Para cargar todas las funciones a en tu sesión:

```r
library(dplry)
```

Para acceder puntualmente a alguna función:

```r
dplyr::
```

]

![](images/EPP_gif_1.gif)

]

---

**Funciones principales:**
- `filter()`: Devlueve los **registros** que cumplen ciertas condiciones.
- `select()`: Selecciona, ordena y cambia el nombre de las **variables**.
- `arrange()`: Ordena los registros según ciertas variables.
- `mutate()`: Crea o transforma variables.
- `summarise()`: Collapsa todos los registros individuales en uno solo. 
- `sample_frac()`: Toma muestra aleatoria de las observaciones. 
- `group_by()`: Condiciona todas las anteriores por grupo.
- `lag() & Lead`: Accede al valor de observaciones anteriores o posteriores.

]

**Ejemplos:**

```r
filter(data=datos, continent=="Europe")
```

```r
select(data=datos, year,pais=country,poblacion=pop)
```

```r
arrange(data=datos, country,year)
```

```r
mutate(datos, gdp=pop*gdpPercap)
```

```r
summarise(data=filter(data=datos, year==2007),
          lifeExp=mean(lifeExp))
```

```r
sample_frac(data=datos,size = 0.5)
```

```r
group_by(data=datos, country,year)
```

]

---
layout: true

## El operador pipe %>%

---
Se usa para ubicar **cualquier objeto** a la izquierda en el primer argumento de **cualquier función** a la derecha.

Sirve para encadenar el resultado de varias funciones. Hace el código mas legible.

.footnote[1) **Short-cut**"Crtl+Shitf+M. 2) Versiones recientes de R incluyen el operador |>, que hace exactamente lo mismo.]

---

**Ejemplos originales:**

```r
filter(data=datos, continent=="Europe")
```

```r
select(data=datos,year,pais=country,pob=pop)
```

```r
arrange(data=datos, country, year)
```

```r
mutate(datos, gdp=pop*gdpPercap)
```

```r
summarise(data=filter(data=datos, year==2007),
          lifeExp=mean(lifeExp))
```

]

**Ejemplos con %>% :**

```r
datos %>% 
  filter(continent=="Europe")
```

```r
datos %>% 
  select(year,pais=country,poblacion=pop)
```

```r
datos %>% arrange(country,year)
```

```r
datos %>% 
  mutate(gdp=pop*gdpPercap)
```

```r
datos %>% 
  filter(year==2007) %>%  
  summarise(lifeExp=mean(lifeExp))
```

]

---
layout: true

## dplyr en acción

Usa **select()** para darle orden a tus datos

Imagina que te mandan a trabajar con este dataset

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> nombre_Terrible </th>
   <th style="text-align:right;"> VALUE </th>
   <th style="text-align:left;"> Nombre.Peor </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> A </td>
   <td style="text-align:right;"> -0.01 </td>
   <td style="text-align:left;"> I </td>
  </tr>
  <tr>
   <td style="text-align:left;"> A </td>
   <td style="text-align:right;"> 2.40 </td>
   <td style="text-align:left;"> I </td>
  </tr>
  <tr>
   <td style="text-align:left;"> B </td>
   <td style="text-align:right;"> 0.76 </td>
   <td style="text-align:left;"> II </td>
  </tr>
  <tr>
   <td style="text-align:left;"> B </td>
   <td style="text-align:right;"> -0.80 </td>
   <td style="text-align:left;"> II </td>
  </tr>
</tbody>
</table>

]

---

Puedes hacer tu propia copia del mismo asi

```r
datos_terribles<-data.frame(
  nombre_Terrible=c(rep("A",2),rep("B",2)),
  VALUE=round(rnorm(4,mean = 0,sd=1),2),
  Nombre.Peor=c(rep("I",2),rep("II",2))
)
```
]

---
.pull-right[

Reordena las columnas

```r
datos_terribles %>% 
  select(nombre_Terrible, Nombre.Peor,VALUE) 
```

```
##   nombre_Terrible Nombre.Peor VALUE
## 1               A           I -1.15
## 2               A           I -0.29
## 3               B          II -0.30
## 4               B          II -0.41
```
]

---
.pull-right[

Reordena, excluye y cambia los nombres:

```r
datos_terribles %>% 
  select(categoria=nombre_Terrible,
         valor=VALUE)
```

```
##   categoria valor
## 1         A -1.15
## 2         A -0.29
## 3         B -0.30
## 4         B -0.41
```
]

---
.pull-right[

Selecciona variables según su formato

```r
datos_terribles %>% 
  select(where(is.numeric)) 
```

```
##   VALUE
## 1 -1.15
## 2 -0.29
## 3 -0.30
## 4 -0.41
```

]

---
.pull-right[

Selecciona variables según su nombre

```r
datos_terribles %>% 
  select(starts_with("nombre"))
```

```
##   nombre_Terrible Nombre.Peor
## 1               A           I
## 2               A           I
## 3               B          II
## 4               B          II
```

]

---

Usa la función janitor::cleanames() para estandarizar los nombres:

```r
library(janitor) # tenes que haberlo instalado install.packages("janitor")

datos_terribles %>% 
  clean_names() 
```

```
##   nombre_terrible value nombre_peor
## 1               A -1.15           I
## 2               A -0.29           I
## 3               B -0.30          II
## 4               B -0.41          II
```

]

---

## dplyr en acción

---

**Restringe tu analisis a grupos específicos:** Muestre el producto per-cápita de países de las Americas en 2018.

```r
WDI_long %>% 
  ## only years where data is available
  filter(year==2018) %>% 
  ## only North American countries
  filter(continent_name=="Americas" ) %>% 
  ## only year, country, and per capita gdp
  select(year,country_name,gdp_pc)
```

]

```
## # A tibble: 46 × 3
##     year country_name            gdp_pc
##    <dbl> <chr>                    <dbl>
##  1  2018 Antigua and Barbuda     16673.
##  2  2018 Argentina               11633.
##  3  2018 Aruba                   30253.
##  4  2018 Bahamas, The            33768.
##  5  2018 Barbados                17745.
##  6  2018 Belize                   5001.
##  7  2018 Bermuda                113023.
##  8  2018 Bolivia                  3549.
##  9  2018 Brazil                   9151.
## 10  2018 British Virgin Islands     NA 
## # ℹ 36 more rows
```

]

---

**Crea nuevas variables:** Calcule el producto interno bruto por país y año.

**mutate()**

```r
WDI_long %>% 
  ## select variables of your interest
  select(year,country_name,gdp_pc ,population) %>% 
  ## estimate total GDP by country (in billions). Keep 3 decimals
  mutate(gdp_bn=round(population*gdp_pc/(10^9),3)) %>% 
  head()
```
]

```
## # A tibble: 6 × 5
##    year country_name gdp_pc population gdp_bn
##   <dbl> <chr>         <dbl>      <dbl>  <dbl>
## 1  1960 Afghanistan    59.8    8996967  0.538
## 2  1961 Afghanistan    59.9    9169406  0.549
## 3  1962 Afghanistan    58.5    9351442  0.547
## 4  1963 Afghanistan    78.7    9543200  0.751
## 5  1964 Afghanistan    82.1    9744772  0.8  
## 6  1965 Afghanistan   101.     9956318  1.01
```

]

---

Obten montos totales, promedios y otras medidas agregadas

```r
WDI_long %>% 
  ## keep year 2018
  filter(year==2018) %>% 
  ## estimate total GDP by country (in billions). Keep 3 decimals
  mutate(gdp_bn=round(population*gdp_pc/(10^9),3)) %>% 
  ## estimate world gdp and world population
  summarise(life_exp=mean(life_exp, na.rm = T),
            population=sum(population, na.rm = T),
            gdp_bn=sum(gdp_bn, na.rm = T))
```

]

```
## # A tibble: 1 × 3
##   life_exp population gdp_bn
##      <dbl>      <dbl>  <dbl>
## 1     72.8 7828350268 87756.
```

]

---
Extiende tus cálculos a lo largo de diferentes grupos

**group_by()**

```r
WDI_long %>%
  filter(year==2018) %>% 
  group_by(continent_name) %>% 
  summarise(count=n(),
            mean_life_exp=mean(life_exp,na.rm = T),
            sd_life_exp=sd(life_exp,na.rm = T))
```
]

```
## # A tibble: 5 × 4
##   continent_name count mean_life_exp sd_life_exp
##   <chr>          <int>         <dbl>       <dbl>
## 1 Africa            54          63.8        5.98
## 2 Americas          46          75.7        3.64
## 3 Asia              51          74.4        5.04
## 4 Europe            54          78.7        3.80
## 5 Oceania           19          73.4        5.99
```

]

---

Cálcula cambios respecto períodos anteriores, o diferencias entre distintas unidades.

**lag()**

```r
WDI_long %>%
  # Un solo país
  filter(country_name=="Argentina") %>% 
  # Datos ordenados desde mas antiguo a mas reciente
  arrange(year) %>% 
  # Cambios cada 10 años
  mutate(change=life_exp-lag(life_exp,10) ) %>% 
  # Muestrame nada mas 4 años
  filter(year==2018 | year==2008 |year==1998  |year==1988) %>%
  select(country_name, year, life_exp, change)
```
]

```
## # A tibble: 4 × 4
##   country_name  year life_exp change
##   <chr>        <dbl>    <dbl>  <dbl>
## 1 Argentina     1988     71.2   2.30
## 2 Argentina     1998     73.2   2.03
## 3 Argentina     2008     75.0   1.74
## 4 Argentina     2018     76.5   1.57
```

]

.footnote[**Ojo:** Tienes que verificar el orden de las observaciones y el nivel de agrupamiento de los datos para no comparar manzans con peras. ]

---

**Datos:** Ingreso per cápita de cada país desde 1950 hasta 2018.

**Desafío:** Obtén un resumen de la distribucion del ingreso por continente en el año mas reciente ¿cúal sintáxis te parece mas clara?

### (a) Con funciones base de R:

```r
aggregate(x = WDI_long[WDI_long$year==max(WDI_long$year),"gdp_pc"]  , 
          list(continent = WDI_long[WDI_long$year==max(WDI_long$year),]$continent_name), 
          FUN = function(x) c(min=min(x,na.rm = T),
                              mean=mean(x,na.rm = T),
                              max=max(x,na.rm = T))) %>% as.tibble()
```

]

### (b) Con dplyr:

```r
WDI_long %>%
  filter(year==max(year)) %>% 
  group_by(continent_name) %>% 
  summarize(min=min(gdp_pc,na.rm = T),
            mean=mean(gdp_pc,na.rm = T),
            max=max(gdp_pc,na.rm = T))
```

]

---

**Distribucion del ingreso por continente en 2018**

### (a) Con funciones base de R:

```
## # A tibble: 5 × 2
##   continent gdp_pc[,"min"] [,"mean"] [,"max"]
##   <chr>              <dbl>     <dbl>    <dbl>
## 1 Africa              261.     2620.   16199.
## 2 Americas           1272.    18536.  117098.
## 3 Asia                507.    15233.   86118.
## 4 Europe             3663.    33384.  190513.
## 5 Oceania            1655.    13524.   55057.
```

]

### (b) Con dplyr:

```
## # A tibble: 5 × 4
##   continent_name   min   mean     max
##   <chr>          <dbl>  <dbl>   <dbl>
## 1 Africa          261.  2620.  16199.
## 2 Americas       1272. 18536. 117098.
## 3 Asia            507. 15233.  86118.
## 4 Europe         3663. 33384. 190513.
## 5 Oceania        1655. 13524.  55057.
```

]

---

## Una imagen vale mas que 1000 *líneas de código*
<img src="Workshop_01_slides_files/figure-html/unnamed-chunk-55-1.png" width="50%" style="display: block; margin: auto;" />

---
layout: true

## ggplot2() and the grammar of graphics

---

.pull-left[
1. Tu gráfica esta vinculada a los datos mediante coordenadas (aesthetic mappings)
2. Una vez que esas coordenadas están definidas puedes presentar tus graficos en distintas formas (geoms), tales como puntos, lineas, barras, etc
3. Puedes agregar tantas capas como gustes a una grafica
]

![The baics](images/ggplot-grammar.png)

]

---
layout: false

## Visualizando el PIB per cápita global

Empecemos visualizando la siguiente serie temporal:

```r
gdp_pc_by_year<-WDI_long %>% 
  filter(year>1990) %>% 
  group_by(year) %>% 
  summarise(gdp_pc=weighted.mean(gdp_pc,
                                 w=population,
                                 na.rm = TRUE))
```
]

.pull-right[
<table>
 <thead>
  <tr>
   <th style="text-align:right;"> year </th>
   <th style="text-align:right;"> gdp_pc </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 1991 </td>
   <td style="text-align:right;"> 4565.059 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 1992 </td>
   <td style="text-align:right;"> 4703.727 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 1993 </td>
   <td style="text-align:right;"> 4650.490 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 1994 </td>
   <td style="text-align:right;"> 4875.684 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 1995 </td>
   <td style="text-align:right;"> 5351.510 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 1996 </td>
   <td style="text-align:right;"> 5393.152 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 1997 </td>
   <td style="text-align:right;"> 5304.659 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 1998 </td>
   <td style="text-align:right;"> 5221.885 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 1999 </td>
   <td style="text-align:right;"> 5288.473 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 2000 </td>
   <td style="text-align:right;"> 5396.982 </td>
  </tr>
</tbody>
</table>
]

---
layout:false

## Mapeo de coordenadas vs. y aplicacion de geoms

```r
ggplot(data = gdp_pc_by_year) +
* geom_point(
*      aes(x = year, y = gdp_pc)) +
  labs(x = "Year",
       y = "GDP per capita")
```

]

```r
ggplot(data = gdp_pc_by_year) +
* geom_line(
*      aes(x = year, y = gdp_pc)) +
  labs(x = "Year",
       y = "GDP per capita")
```

<img src="Workshop_01_slides_files/figure-html/unnamed-chunk-59-1.png" width="100%" />
]

```r
ggplot(data = gdp_pc_by_year, 
       aes(x = year, y = gdp_pc)) +
* geom_point()+
* geom_line() +
  labs(x = "Year",
       y = "GDP per capita")
```

<img src="Workshop_01_slides_files/figure-html/unnamed-chunk-60-1.png" width="100%" />
]

---

## Atributos en función de datos y atributos fijos

```r
ggplot(data = gdp_pc_by_year, 
       aes(x = year, y = gdp_pc)) +
*geom_point(aes(color=gdp_pc))+
  labs(x = "Year",
       y = "GDP per capita")
```

]

```r
ggplot(data = gdp_pc_by_year, 
       aes(x = year, y = gdp_pc)) +
*geom_line(color="navy")+
  labs(x = "Year",
       y = "GDP per capita")
```

]

```r
ggplot(data = gdp_pc_by_year, 
       aes(x = year, y = gdp_pc)) +
*geom_line(aes(color="navy"))+
  labs(x = "Year",
       y = "GDP per capita")
```

]

---
layout: false

## Hablemos de la evolucion del ingreso por habitante y la esperanza de vida

World development indicators (World Bank)
<table>
 <thead>
  <tr>
   <th style="text-align:left;"> country_name </th>
   <th style="text-align:left;"> country_code </th>
   <th style="text-align:right;"> year </th>
   <th style="text-align:right;"> gdp_pc </th>
   <th style="text-align:right;"> gdp </th>
   <th style="text-align:right;"> life_exp </th>
   <th style="text-align:right;"> population </th>
   <th style="text-align:left;"> continent_name </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Afghanistan </td>
   <td style="text-align:left;"> AFG </td>
   <td style="text-align:right;"> 1960 </td>
   <td style="text-align:right;"> 59.77323 </td>
   <td style="text-align:right;"> 537777811 </td>
   <td style="text-align:right;"> 32.446 </td>
   <td style="text-align:right;"> 8996967 </td>
   <td style="text-align:left;"> Asia </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Afghanistan </td>
   <td style="text-align:left;"> AFG </td>
   <td style="text-align:right;"> 1961 </td>
   <td style="text-align:right;"> 59.86090 </td>
   <td style="text-align:right;"> 548888896 </td>
   <td style="text-align:right;"> 32.962 </td>
   <td style="text-align:right;"> 9169406 </td>
   <td style="text-align:left;"> Asia </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Afghanistan </td>
   <td style="text-align:left;"> AFG </td>
   <td style="text-align:right;"> 1962 </td>
   <td style="text-align:right;"> 58.45801 </td>
   <td style="text-align:right;"> 546666678 </td>
   <td style="text-align:right;"> 33.471 </td>
   <td style="text-align:right;"> 9351442 </td>
   <td style="text-align:left;"> Asia </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Afghanistan </td>
   <td style="text-align:left;"> AFG </td>
   <td style="text-align:right;"> 1963 </td>
   <td style="text-align:right;"> 78.70643 </td>
   <td style="text-align:right;"> 751111191 </td>
   <td style="text-align:right;"> 33.971 </td>
   <td style="text-align:right;"> 9543200 </td>
   <td style="text-align:left;"> Asia </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Afghanistan </td>
   <td style="text-align:left;"> AFG </td>
   <td style="text-align:right;"> 1964 </td>
   <td style="text-align:right;"> 82.09531 </td>
   <td style="text-align:right;"> 800000044 </td>
   <td style="text-align:right;"> 34.463 </td>
   <td style="text-align:right;"> 9744772 </td>
   <td style="text-align:left;"> Asia </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Afghanistan </td>
   <td style="text-align:left;"> AFG </td>
   <td style="text-align:right;"> 1965 </td>
   <td style="text-align:right;"> 101.10833 </td>
   <td style="text-align:right;"> 1006666638 </td>
   <td style="text-align:right;"> 34.948 </td>
   <td style="text-align:right;"> 9956318 </td>
   <td style="text-align:left;"> Asia </td>
  </tr>
</tbody>
</table>

---
layout:true
## Gráficos de distribución

---

Densidad: geom_density

```r
 ggplot(data=WDI_long_2018, 
        aes(x=gdp_pc))+
* geom_density()+
  labs(title="Distribucion de ingreso por habitante")
```

]

```r
 ggplot(data=WDI_long_2018, 
        aes(x=gdp_pc))+
* geom_histogram()+
  labs(title="Distribucion de ingreso por habitante")
```

]

```r
 ggplot(data=WDI_long_2018, 
        aes(x=gdp_pc))+
* geom_dotplot()+
  labs(title="Distribucion de ingreso por habitante")
```

]

---
**Distribución del ingreso por habitante**

```r
# Total
 ggplot(data=WDI_long_2018, aes(x=gdp_pc))+
  # Geom de distribuciones de densidad. Ponemos color en fill afuera de los aesthetics
  geom_density(fill="gray")+
  # Geom de lineas verticales. Requiren el valor del punto de corte
  # Usamos la palabra "Mean" en punto de corte para que la legenda de la 
  # linea muestre esa palabra (not correct, but effective).
* geom_vline(aes(xintercept = mean(gdp_pc),linetype="Mean"))+
  labs(title="Distribución completa",
       linetype="Stats",
       x="Ingreso por habitante")
```

]

]

---
**Distribución del ingreso por habitante, por continente**

```r
# Por continente
  ggplot(data= WDI_long_2018, aes(x=gdp_pc))+
  # Geom de distribucion de densidades, especificando grupos
  geom_density(aes(fill=continent_name), alpha=0.4) +
  # Geom de lineas verticales por contienente. Insertamos datos agregados a nivel continente para trazar varias lineas
* geom_vline(data= group_by(WDI_long_2018,continent_name) %>%
*              summarise(gdp_pc=mean(gdp_pc)),
*            aes(xintercept = gdp_pc,color=continent_name),
*            show.legend = F)+
  scale_x_log10(labels=scales::number_format())+
  labs(title="Distribución por continente",
       fill="Continente",
       linetype="Stats",
       x="Ingreso por habitante")
```
]

.pull-right[
<img src="Workshop_01_slides_files/figure-html/unnamed-chunk-69-1.png" width="100%" />
]

---

**Distribución de la esperanza de vida en el mundo**

```r
WDI_long_2018 %>%
  ggplot(aes(x=life_exp))+
  # Geom de distribuciones de densidad. Ponemos color en fill afuera de los aesthetics
  geom_density(fill="gray")+
  # Geom de lineas verticales. Requiren el valor del punto de corte
  # Usamos la palabra "Mean" en punto de corte para que la legenda de la 
  # linea muestre esa palabra (not correct, but effective).
* geom_vline(aes(xintercept = mean(life_exp),linetype="Mean"))+
  labs(title="Distribución completa",
       linetype="Stats",
       x="Esperanza de vida")
```

]
 
.pull-right[

]

---

**Distribución de la esperanza de vida, por continente** 
.pull-left[

```r
WDI_long_2018 %>%
  ggplot(aes(x=life_exp))+
  # Geom de distribucion de densidades, especificando grupos
  geom_density(aes(fill=continent_name), alpha=0.4) +
  # Geom de lineas verticales por contienente
  # Insertamos datos agregados a nivel continente para trazar varias lineas
* geom_vline(data= WDI_long_2018 %>%
*              group_by(continent_name) %>%
*              summarise(life_exp=mean(life_exp)),
*            aes(xintercept = life_exp,color=continent_name),
*            show.legend = F)+
  labs(title="Distribución por continente",
       fill="Muestra",
       linetype="Stats",
       x="Esperanza de vida")
```
]

]

---
layout:true
## Relación entre dos variables continuas

---

**Define la data, las coordenadas, y la forma**
.pull-left[

```r
# Datos y coordenadas
ggplot(data=WDI_long_2018, 
       mapping = aes(y=life_exp,x=gdp_pc))+
  # Formas o geometrias
  geom_point()+
  labs(x="Gdp Per Capita",
       y="Life expectancy")
```
]

]

---

**Añade otras formas y haz cambios en el formato**

```r
ggplot(data=WDI_long_2018, 
       mapping = aes(y=life_exp,x=gdp_pc))+
  # coordenadas para una geometria especifica
  geom_point(aes(size=population/1000000, 
*                color=continent_name))+
  scale_size_continuous(labels=scales::number_format(), 
                        breaks = c(50,500,1000))+
  labs(x="Gdp Per Capita",
       y="Life expectancy",,
       color="Continent",
       size="Pop (millions)")
```
]

.pull-right[
<img src="Workshop_01_slides_files/figure-html/unnamed-chunk-73-1.png" width="100%" />
]

---

**Cambiemos la escala de gpd per capita ¿Qué ganámos con logs?**

```r
ggplot(data=WDI_long_2018, 
       mapping = aes(y=life_exp,x=gdp_pc))+
  # coordenadas para una geometria especifica
  geom_point(aes(size=population/1000000, 
                 color=continent_name))+
  scale_size_continuous(labels=scales::number_format(), 
                        breaks = c(50,500,1000))+
* scale_x_log10(labels=scales::number_format())+
  labs(x="Gdp Per Capita (log scale)",
       y="Life expectancy",,
       color="Continent",
       size="Pop (millions)")
```
]

.pull-right[
<img src="Workshop_01_slides_files/figure-html/unnamed-chunk-74-1.png" width="100%" />
]

---

**La editorial nos pide que hagamos un gráfico accesible para lectores dáltonicos**

```r
ggplot(data=WDI_long_2018, 
       mapping = aes(y=life_exp,x=gdp_pc))+
  # coordenadas para una geometria especifica
  geom_point(aes(size=population/1000000, 
*                shape=continent_name))+
  scale_size_continuous(labels=scales::number_format(), 
                        breaks = c(50,500,1000))+
  scale_x_log10(labels=scales::number_format())+
  labs(x="Gdp Per Capita (log scale)",
       y="Life expectancy",
*      shape="Continent",
       size="Pop (millions)")
```
]

.pull-right[
<img src="Workshop_01_slides_files/figure-html/unnamed-chunk-75-1.png" width="100%" />
]

---
layout: false
### Material de apoyo

Fuentes recomendadas para seguir aprendiendo:

- [R for Economists video series](https://www.youtube.com/watch?v=dFSPmjSynCs&list=PLcTBLulJV_AIuXCxr__V8XAzWZosMQIfW&index=3) (by Nick Hungtington-Klein)
- [R for Data Sicence](https://r4ds.had.co.nz/index.html) (Wickham & Grolemund, 2017)     
- [Statistical Inference via Data Science](https://moderndive.com/) (Ismay & Kim, 2022)
- [Top 50 ggplot2 Visualizations - The Master List](http://r-statistics.co/Top50-Ggplot2-Visualizations-MasterList-R-Code.html#1.%20Correlation) (by Selva Prabhakaran).
- [Video: The best stats you'll ever see](https://www.youtube.com/watch?v=hVimVzgtD6w) (by Hans Rosling)
- [Video: Statistics without the agonizing pain](https://www.youtube.com/watch?v=5Dnw46eC-0o) (by John Rauser)

---

### Tips

[Usen Chat-GPT](https://openai.com/chatgpt):

- Pídanle que les explique paso a paso bloques de código complicados.
- Pidanle recomendaciones sobre que librerías y paquetes usar para una tarea específica.
- Recuerden que la versión gratuita puede errar, no pidan soluciones complejas de un saque. Pídanla paso a paso para que puedan cazar los errores.
]

[Usen la ventana de ayuda en R (?funcion)](https://openai.com/chatgpt):

![selecciona codigo + ctrl + enter](images/EPP_gif_2.gif)

]

---
### Recuerden

1. Descarguen R y Rstudio e instalen las siguientes librerías:

```r
# Librerías
librerías <- c("tidyverse", "readxl", "haven",
               "lubridate","stringr","gt","stargazer",
               "causaldata","wooldridge","palmerpenguins","AER",
               "broom", "modelr","modelsummary",
               "knitr","rmarkdown","skimr")

# Instalar librerias aun no instaladas
installed_packages <- librerías %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
  install.packages(librerías[!installed_packages])
}
```

2. Les dejé una práctica en el proximo slide.

3. El trabajo práctico del primer bloque está publicado. Leerlo puede ser bueno para reforzar lo aprendido esta semana y anticipar el contenido de las próximas clases.

---
class: center middle
# Fin de primer taller
Gracias

---

## Práctica

Todas las semanas recomendaremos actividades breves para reforzar lo aprendido en los talleres. Habrá un espacio para aclarar dudas que se presentaron al inicio de cada clase presencial.

1. Installen R y Rstudio e intenten replicar todas las diapositivas.

Discutimos gráficos univariados continuos (distribuciones), pero nos faltó discutir gráficos univariados categóricos. Práctiquenlo en casa.

2. Instalen la librería `causaldata'. Al instalarla tendran acceso a varios datasets que usaremos a lo largo del curso

3. Lean la base de datos `causaldata::gapminder`.

4. Creen un data frame con la esperanza de vida promedio de cada continente en 2007.

5. Muestre la esperanza de vida promedio de cada continente en un gráfico de barras (buen tutorial acá).

.footnote[Estas prácticas no son evaluadas, pero sinceramente creemos que los ayudará a mantenerse al dia con los conceptos. Esto les ayudará a encarar los trabajos prácticos de cada bloque con mayor facilidad.]