CURSO ANALITICA DE DATOS / Tidyverse (Beta)

2023-2P

INTRODUCCIÓN A TIDYVERSE EN R

Tidiverse es una forma eficiente de manejar datos en R

Desarrollando Análisis Avanzados

Como Funciona?

Empecemos con TIDYVERSE

Cargue el paquete gapminder

# install.packages("gapminder")
library(gapminder)

Cargue el paquete dplyr

library(dplyr)

Mire el conjunto de datos de gapminder

gapminder

Ejemplo de tibble

gapminder

## # A tibble: 1,704 × 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.
## # ℹ 1,694 more rows

Resumen de un DatasetEjemplo de tibble

summary(gapminder)

##         country        continent        year         lifeExp     
##  Afghanistan:  12   Africa  :624   Min.   :1952   Min.   :23.60  
##  Albania    :  12   Americas:300   1st Qu.:1966   1st Qu.:48.20  
##  Algeria    :  12   Asia    :396   Median :1980   Median :60.71  
##  Angola     :  12   Europe  :360   Mean   :1980   Mean   :59.47  
##  Argentina  :  12   Oceania : 24   3rd Qu.:1993   3rd Qu.:70.85  
##  Australia  :  12                  Max.   :2007   Max.   :82.60  
##  (Other)    :1632                                                
##       pop              gdpPercap       
##  Min.   :6.001e+04   Min.   :   241.2  
##  1st Qu.:2.794e+06   1st Qu.:  1202.1  
##  Median :7.024e+06   Median :  3531.8  
##  Mean   :2.960e+07   Mean   :  7215.3  
##  3rd Qu.:1.959e+07   3rd Qu.:  9325.5  
##  Max.   :1.319e+09   Max.   :113523.1  
##

FILTER EN R

El verbo filtro extrae observaciones particulares basadas en una condición. En este ejercicio filtrará las observaciones de un año determinado.

Filtrar en gapminder dataset de el año 1957

gapminder %>% filter(year == 1957)

## # A tibble: 142 × 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1957    30.3  9240934      821.
##  2 Albania     Europe     1957    59.3  1476505     1942.
##  3 Algeria     Africa     1957    45.7 10270856     3014.
##  4 Angola      Africa     1957    32.0  4561361     3828.
##  5 Argentina   Americas   1957    64.4 19610538     6857.
##  6 Australia   Oceania    1957    70.3  9712569    10950.
##  7 Austria     Europe     1957    67.5  6965860     8843.
##  8 Bahrain     Asia       1957    53.8   138655    11636.
##  9 Bangladesh  Asia       1957    39.3 51365468      662.
## 10 Belgium     Europe     1957    69.2  8989111     9715.
## # ℹ 132 more rows

Filtrar en gapminder dataset de el año 1957 de la forma tradicional

gapminder[gapminder$year == 1957, ]

## # A tibble: 142 × 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1957    30.3  9240934      821.
##  2 Albania     Europe     1957    59.3  1476505     1942.
##  3 Algeria     Africa     1957    45.7 10270856     3014.
##  4 Angola      Africa     1957    32.0  4561361     3828.
##  5 Argentina   Americas   1957    64.4 19610538     6857.
##  6 Australia   Oceania    1957    70.3  9712569    10950.
##  7 Austria     Europe     1957    67.5  6965860     8843.
##  8 Bahrain     Asia       1957    53.8   138655    11636.
##  9 Bangladesh  Asia       1957    39.3 51365468      662.
## 10 Belgium     Europe     1957    69.2  8989111     9715.
## # ℹ 132 more rows

Multiples Filtros

También puedes usar el verbo filter() para establecer dos condiciones, que podría recuperar una sola observación.

Al igual que en el último ejercicio, puedes hacer esto en dos líneas de código, comenzando con gapminder %>% y teniendo el filter() en la segunda línea.

Mantener un verbo en cada línea ayuda a que el código sea legible. Tenga en cuenta que cada vez colocará el pipe %>% al final de la primera línea (como gapminder %>%); poniendo el pipe en el El comienzo de la segunda línea arrojará un error.

NOTA

El símbolo & implica que ambas condiciones deben cumplirse. Así como este existen otros:

Símbolo	Condición
==	es igual a
!=	es distinto a
<	es menor
<=	es menor igual
>	es menor
>=	es menor igual
%in%	se encuentra en o entre los valores
&;	se cumpla una condición y la otra
`\|`	se cumple una condición o la otra

Filtrar China 2002

gapminder %>% filter(country=="China" &  year== 2002)

## # A tibble: 1 × 6
##   country continent  year lifeExp        pop gdpPercap
##   <fct>   <fct>     <int>   <dbl>      <int>     <dbl>
## 1 China   Asia       2002    72.0 1280400000     3119.

ARRANGE EN R

Se utiliza arrange() para ordenar las observaciones en forma ascendente o orden descendente de una variable particular.

En este caso, ordenará los datos según la variable lifeExp. Ordenar en orden ascendente de lifeExp

gapminder %>% arrange(lifeExp)

ARRANGE EN R

gapminder %>% arrange(lifeExp)

## # A tibble: 1,704 × 6
##    country      continent  year lifeExp     pop gdpPercap
##    <fct>        <fct>     <int>   <dbl>   <int>     <dbl>
##  1 Rwanda       Africa     1992    23.6 7290203      737.
##  2 Afghanistan  Asia       1952    28.8 8425333      779.
##  3 Gambia       Africa     1952    30    284320      485.
##  4 Angola       Africa     1952    30.0 4232095     3521.
##  5 Sierra Leone Africa     1952    30.3 2143249      880.
##  6 Afghanistan  Asia       1957    30.3 9240934      821.
##  7 Cambodia     Asia       1977    31.2 6978607      525.
##  8 Mozambique   Africa     1952    31.3 6446316      469.
##  9 Sierra Leone Africa     1957    31.6 2295678     1004.
## 10 Burkina Faso Africa     1952    32.0 4469979      543.
## # ℹ 1,694 more rows

ARRANGE EN R

Ordenar en orden descendente de lifeExp

gapminder %>% arrange(desc(lifeExp))

## # A tibble: 1,704 × 6
##    country          continent  year lifeExp       pop gdpPercap
##    <fct>            <fct>     <int>   <dbl>     <int>     <dbl>
##  1 Japan            Asia       2007    82.6 127467972    31656.
##  2 Hong Kong, China Asia       2007    82.2   6980412    39725.
##  3 Japan            Asia       2002    82   127065841    28605.
##  4 Iceland          Europe     2007    81.8    301931    36181.
##  5 Switzerland      Europe     2007    81.7   7554661    37506.
##  6 Hong Kong, China Asia       2002    81.5   6762476    30209.
##  7 Australia        Oceania    2007    81.2  20434176    34435.
##  8 Spain            Europe     2007    80.9  40448191    28821.
##  9 Sweden           Europe     2007    80.9   9031088    33860.
## 10 Israel           Asia       2007    80.7   6426679    25523.
## # ℹ 1,694 more rows

Filtering y arranging

A menudo necesitarás usar el operador de pipe %>% para combinar múltiples verbos dplyr seguidos. En este caso, combinará un filter() con un arrange() para encontrar la población más alta países en un año determinado.

Filtrar por el año 1957, luego ordénelo en orden descendente de población

gapminder %>% filter(year==1957)%>% arrange(desc(pop))

Filtering y arranging

Filtrar por el año 1957, luego ordénelo en orden descendente de población

gapminder %>% filter(year==1957)%>% arrange(desc(pop))

## # A tibble: 142 × 6
##    country        continent  year lifeExp       pop gdpPercap
##    <fct>          <fct>     <int>   <dbl>     <int>     <dbl>
##  1 China          Asia       1957    50.5 637408000      576.
##  2 India          Asia       1957    40.2 409000000      590.
##  3 United States  Americas   1957    69.5 171984000    14847.
##  4 Japan          Asia       1957    65.5  91563009     4318.
##  5 Indonesia      Asia       1957    39.9  90124000      859.
##  6 Germany        Europe     1957    69.1  71019069    10188.
##  7 Brazil         Americas   1957    53.3  65551171     2487.
##  8 United Kingdom Europe     1957    70.4  51430000    11283.
##  9 Bangladesh     Asia       1957    39.3  51365468      662.
## 10 Italy          Europe     1957    67.8  49182000     6249.
## # ℹ 132 more rows

Mutate en R

Utilizando Mutate en R

Supongamos que queremos medir la esperanza de vida en meses. en lugar de años: habría que multiplicar el valor existente por 12.

Puede utilizar el verbo mutate() para cambiar esta columna o para cree una nueva columna que se calcule de esta manera.

Usar mutate para cambiar lifeExp en meses

gapminder %>% mutate( lifeExp = 12*lifeExp)

gapminder %>% mutate( lifeExpMonths = 12*lifeExp)

Filter, mutate, y arrange con gapminder

En este ejercicio, combinarás los tres verbos que has aprendido en este capítulo, para encontrar los países con mayor Esperanza de vida, en meses, en el año 2007.

gapminder %>% filter(year == 2007) %>%
                    mutate(lifeExpMonths = 12*lifeExp)%>%
                          arrange(desc(lifeExpMonths))

Filter, mutate, y arrange con gapminder

gapminder %>% filter(year == 2007) %>%
                    mutate(lifeExpMonths = 12*lifeExp)%>%
                          arrange(desc(lifeExpMonths))

## # A tibble: 142 × 7
##    country          continent  year lifeExp       pop gdpPercap lifeExpMonths
##    <fct>            <fct>     <int>   <dbl>     <int>     <dbl>         <dbl>
##  1 Japan            Asia       2007    82.6 127467972    31656.          991.
##  2 Hong Kong, China Asia       2007    82.2   6980412    39725.          986.
##  3 Iceland          Europe     2007    81.8    301931    36181.          981.
##  4 Switzerland      Europe     2007    81.7   7554661    37506.          980.
##  5 Australia        Oceania    2007    81.2  20434176    34435.          975.
##  6 Spain            Europe     2007    80.9  40448191    28821.          971.
##  7 Sweden           Europe     2007    80.9   9031088    33860.          971.
##  8 Israel           Asia       2007    80.7   6426679    25523.          969.
##  9 France           Europe     2007    80.7  61083916    30470.          968.
## 10 Canada           Americas   2007    80.7  33390141    36319.          968.
## # ℹ 132 more rows

Visualizacion con GGPLOT2

Crear un data set con gapminder_2007

gapminder_2007 <- gapminder %>%
                            filter(year == 2007)

Visualizacion con GGPLOT2

Crear un data set con gapminder_2007

gapminder_2007 <- gapminder %>%
                  filter(year == 2007)
gapminder_2007

## # A tibble: 142 × 6
##    country     continent  year lifeExp       pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>     <int>     <dbl>
##  1 Afghanistan Asia       2007    43.8  31889923      975.
##  2 Albania     Europe     2007    76.4   3600523     5937.
##  3 Algeria     Africa     2007    72.3  33333216     6223.
##  4 Angola      Africa     2007    42.7  12420476     4797.
##  5 Argentina   Americas   2007    75.3  40301927    12779.
##  6 Australia   Oceania    2007    81.2  20434176    34435.
##  7 Austria     Europe     2007    79.8   8199783    36126.
##  8 Bahrain     Asia       2007    75.6    708573    29796.
##  9 Bangladesh  Asia       2007    64.1 150448339     1391.
## 10 Belgium     Europe     2007    79.4  10392226    33693.
## # ℹ 132 more rows

Visualizacion Tradicional de R

plot(gapminder_2007$gdpPercap, gapminder_2007$lifeExp )

Visualizacion con GGPLOT2

Scatterplot con GGPLOT2

#install.packages("ggplot2")
library(ggplot2)
ggplot(gapminder_2007, aes(x = gdpPercap, y = lifeExp)) +
geom_point()

Escala Log

ggplot(gapminder_2007, aes(x = gdpPercap,
y = lifeExp)) + geom_point() +scale_x_log10()

Agregando color en aesthetic

ggplot(gapminder_2007, aes(x = gdpPercap,
y = lifeExp, color = continent)) +
geom_point() + scale_x_log10()

Agregando tamaño/color en aesthetic

ggplot(gapminder_2007, aes(x = gdpPercap,
y = lifeExp, color = continent,
size = pop)) + geom_point() + scale_x_log10()

Faceting

ggplot(gapminder_2007, aes(x = gdpPercap, y = lifeExp)) +
geom_point() +
scale_x_log10() + facet_wrap(~ continent)

Replicar este Grafico

G1 = gapminder
ggplot(G1, aes(x = gdpPercap, y = lifeExp, color = continent, 
               size = pop)) +
geom_point() + scale_x_log10() + facet_wrap(~ year)

Funcion sum

gapminder %>%
summarize(meanLifeExp = mean(lifeExp))

## # A tibble: 1 × 1
##   meanLifeExp
##         <dbl>
## 1        59.5

Funcion sum (Sumando entre columnas)

gapminder %>%
filter(year == 2007) %>%
summarize(meanLifeExp = mean(lifeExp),
totalPop = sum(pop))

## # A tibble: 1 × 2
##   meanLifeExp   totalPop
##         <dbl>      <dbl>
## 1        67.0 6251013179

Funciones que se pueden Utilizar

mean
sum
median
min
max

The group_by

Agrupados por año

gapminder %>%
    group_by(year) %>%
            summarize(meanLifeExp = mean(lifeExp),
                      totalPop = sum(as.numeric(pop)))

## # A tibble: 12 × 3
##     year meanLifeExp   totalPop
##    <int>       <dbl>      <dbl>
##  1  1952        49.1 2406957150
##  2  1957        51.5 2664404580
##  3  1962        53.6 2899782974
##  4  1967        55.7 3217478384
##  5  1972        57.6 3576977158
##  6  1977        59.6 3930045807
##  7  1982        61.5 4289436840
##  8  1987        63.2 4691477418
##  9  1992        64.2 5110710260
## 10  1997        65.0 5515204472
## 11  2002        65.7 5886977579
## 12  2007        67.0 6251013179

The group_by

Agrupados por continente

gapminder %>%
    group_by(continent) %>%
            summarize(meanLifeExp = mean(lifeExp),
                      totalPop = sum(as.numeric(pop)))

## # A tibble: 5 × 3
##   continent meanLifeExp    totalPop
##   <fct>           <dbl>       <dbl>
## 1 Africa           48.9  6187585961
## 2 Americas         64.7  7351438499
## 3 Asia             60.1 30507333901
## 4 Europe           71.9  6181115304
## 5 Oceania          74.3   212992136

The group_by

Agrupados por continente y año

gapminder %>%
    group_by(year,continent) %>%
            summarize(meanLifeExp = mean(lifeExp),
                      totalPop = sum(as.numeric(pop)))

## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.

## # A tibble: 60 × 4
## # Groups:   year [12]
##     year continent meanLifeExp   totalPop
##    <int> <fct>           <dbl>      <dbl>
##  1  1952 Africa           39.1  237640501
##  2  1952 Americas         53.3  345152446
##  3  1952 Asia             46.3 1395357351
##  4  1952 Europe           64.4  418120846
##  5  1952 Oceania          69.3   10686006
##  6  1957 Africa           41.3  264837738
##  7  1957 Americas         56.0  386953916
##  8  1957 Asia             49.3 1562780599
##  9  1957 Europe           66.7  437890351
## 10  1957 Oceania          70.3   11941976
## # ℹ 50 more rows

Graficando : The group_by

Agrupados en un objeto; Caso por año

O1= gapminder %>%
      group_by(year) %>%
            summarize(meanLifeExp = mean(lifeExp),
                      totalPop = sum(as.numeric(pop)))

Graficando : The group_by

Agrupados en un objeto; Caso por año

ggplot(O1, aes(x = year, y = totalPop)) +
geom_point()

Graficando : The group_by

Agrupados en un objeto; Caso por año y continente

O2= gapminder %>%
      group_by(year,continent) %>%
            summarize(meanLifeExp = mean(lifeExp),
                      totalPop = sum(as.numeric(pop)))

## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.

Graficando : The group_by

Agrupados en un objeto; Caso por año y continente

ggplot(O2, aes(x = year, y = totalPop, color= continent)) +
geom_point() + expand_limits(y = 0)

INTRODUCCIÓN A TIDYVERSE EN R

Desarrollando Análisis Avanzados

Como Funciona?

Empecemos con TIDYVERSE

Ejemplo de tibble

Resumen de un DatasetEjemplo de tibble

FILTER EN R

Filtrar en gapminder dataset de el año 1957

Filtrar en gapminder dataset de el año 1957 de la forma tradicional

Multiples Filtros

NOTA

Filtrar China 2002

ARRANGE EN R

ARRANGE EN R

ARRANGE EN R

ARRANGE EN R

Filtering y arranging

Filtering y arranging

Mutate en R

Utilizando Mutate en R

Filter, mutate, y arrange con gapminder

Filter, mutate, y arrange con gapminder

Visualizacion con GGPLOT2

Visualizacion con GGPLOT2

Visualizacion Tradicional de R

Visualizacion con GGPLOT2

Escala Log

Agregando color en aesthetic

Agregando tamaño/color en aesthetic

Faceting

Replicar este Grafico

Replicar este Grafico

Funcion sum

Funcion sum (Sumando entre columnas)

Funciones que se pueden Utilizar

The group_by

The group_by

The group_by

Graficando : The group_by

Graficando : The group_by

Graficando : The group_by

Graficando : The group_by

Muchas Gracias