Manipulação de dados e Gráfico em R

Rafael Henrique Pertille e Silvia Scariotto

October 11, 2018

Porque R?

Praticidade, velocidade (script prontos)
- Rotinas (Relatórios)
Era da Big Data - Não é modinha, é realidade!!
Oracle R Enterprise e o Microsoft R Server (Maior velocidade e capacidade de processamento)
Grande quantidade de análises e formas de apresentação de dados
Data Wrangling - usa 50% a 80% do tempo de um cientista de dados
Reprodutibilidade científica - Tendência na comunidade científica mundial.
- Github, Rpubs
- Compartilhamento de códigos
Comunidade de colaboração muito grande
- Propicia a implantação de métodos e ferramentas inviáveis de se realizar em softwares estatísticos e gráficos

citar lista R-br

Tidyverse

Movimento tidyverse
Produção de pacotes funcionais e com sintaxe de fácil manipulação (escrever e ler)

Manipulação de dados - Data Wrangling

Normalmente usa-se MS Excel
- Processo demorado com conjunto de dados grandes.
O R facilita esse processamento e permite maior flexibilização
- Exige tempo para aprender…
Alguns pacotes permitem realizar a manipulação de dados de forma rápida e intuitiva.
- dplyr, tidyr
  - lubridate, forcats, stringr, purr

Tidy Data

Data

Estrutura de dados tidy
- Cada variável é uma coluna
- Cada observação é uma linha

Artigo

Exemplos

Estrutura não aceitável

Estrutura aceitável

Pacotes

dplyr

Agrupar, filtrar, selecionar, sumariar, modificar, juntar colunas, juntar linhas …

tidyr

Reunir, separar, espalhar …

magrittr

Operador “pipe” ( %>% )

Exemplos

Exemplo

library(tidyverse)
library(magrittr)

str(iris)

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

iris %>% group_by(Species) %>% summarize_all(mean) # Para fazer o %>% pode utilizar Ctrl + Shift + M

## # A tibble: 3 x 5
##   Species    Sepal.Length Sepal.Width Petal.Length Petal.Width
##   <fct>             <dbl>       <dbl>        <dbl>       <dbl>
## 1 setosa             5.01        3.43         1.46       0.246
## 2 versicolor         5.94        2.77         4.26       1.33 
## 3 virginica          6.59        2.97         5.55       2.03

library(tidyverse)
library(magrittr)
library(lubridate)

## 
## Attaching package: 'lubridate'

## The following object is masked from 'package:base':
## 
##     date

library(nycflights13) #pacote de dados

Filtrar por condição

nycflights13::flights

jan_1 <- flights %>% filter(month==1, day==1) #voos realizados no dia 1 de Janeiro

filter(flights, arr_delay <= 120, dep_delay <= 120)

## # A tibble: 316,050 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # ... with 316,040 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

filter(flights, arr_delay > 120 & dep_delay == 0)

## # A tibble: 3 x 19
##    year month   day dep_time sched_dep_time dep_delay arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>
## 1  2013    10     7     1350           1350         0     1736
## 2  2013     5    23     1810           1810         0     2208
## 3  2013     7     1      905            905         0     1443
## # ... with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## #   time_hour <dttm>

filter(flights, month == 7 | month == 8 | month == 9)

## # A tibble: 86,326 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     7     1        1           2029       212      236
##  2  2013     7     1        2           2359         3      344
##  3  2013     7     1       29           2245       104      151
##  4  2013     7     1       43           2130       193      322
##  5  2013     7     1       44           2150       174      300
##  6  2013     7     1       46           2051       235      304
##  7  2013     7     1       48           2001       287      308
##  8  2013     7     1       58           2155       183      335
##  9  2013     7     1      100           2146       194      327
## 10  2013     7     1      100           2245       135      337
## # ... with 86,316 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

Ordenar variáveis

arrange(flights, year, month, day) # ordena por ano, mes e dia.

## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

arrange(flights, carrier) # ordena por empresa e atraso de voo.

## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      810            810         0     1048
##  2  2013     1     1     1451           1500        -9     1634
##  3  2013     1     1     1452           1455        -3     1637
##  4  2013     1     1     1454           1500        -6     1635
##  5  2013     1     1     1507           1515        -8     1651
##  6  2013     1     1     1530           1530         0     1650
##  7  2013     1     1     1546           1540         6     1753
##  8  2013     1     1     1550           1550         0     1844
##  9  2013     1     1     1552           1600        -8     1749
## 10  2013     1     1     1554           1600        -6     1701
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

arrange(flights, desc(arr_delay)) # uso do 'desc' para ordenar de forma decrescente.

## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     9      641            900      1301     1242
##  2  2013     6    15     1432           1935      1137     1607
##  3  2013     1    10     1121           1635      1126     1239
##  4  2013     9    20     1139           1845      1014     1457
##  5  2013     7    22      845           1600      1005     1044
##  6  2013     4    10     1100           1900       960     1342
##  7  2013     3    17     2321            810       911      135
##  8  2013     7    22     2257            759       898      121
##  9  2013    12     5      756           1700       896     1058
## 10  2013     5     3     1133           2055       878     1250
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

Selecionar variáveis

flights %>% select(year, month, day) # seleciona apenas as variáveis ano, mes e dia.

## # A tibble: 336,776 x 3
##     year month   day
##    <int> <int> <int>
##  1  2013     1     1
##  2  2013     1     1
##  3  2013     1     1
##  4  2013     1     1
##  5  2013     1     1
##  6  2013     1     1
##  7  2013     1     1
##  8  2013     1     1
##  9  2013     1     1
## 10  2013     1     1
## # ... with 336,766 more rows

Dados climáticos para construção de gráfico

c2012 <- read_table2("2012.txt", col_names = T)
c2013 <- read_table2("2013.txt", col_names = T)
c2014 <- read_table2("2014.txt", col_names = T)

Fazer o mesmo para os outros anos
Visualizar estrutura de dados

str(c2012)

## Classes 'tbl_df', 'tbl' and 'data.frame':    19 obs. of  5 variables:
##  $ ano  : int  2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
##  $ data : chr  "3/15/2012" "4/5/2012" "4/26/2012" "5/17/2012" ...
##  $ max  : num  29.9 27 26.7 22.1 23.4 ...
##  $ media: num  21 17.4 17.1 13 14.2 ...
##  $ min  : num  14.05 4.15 9 0.025 8.55 ...
##  - attr(*, "spec")=List of 2
##   ..$ cols   :List of 5
##   .. ..$ ano  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ data : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ max  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ media: list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ min  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   ..$ default: list()
##   .. ..- attr(*, "class")= chr  "collector_guess" "collector"
##   ..- attr(*, "class")= chr "col_spec"

dados <- rbind(c2012,c2013,c2014)
dados$data %<>% mdy() %>% format('%m-%d') %>% as.Date('%m-%d')

str(dados)

## Classes 'tbl_df', 'tbl' and 'data.frame':    61 obs. of  5 variables:
##  $ ano  : int  2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
##  $ data : Date, format: "2018-03-15" "2018-04-05" ...
##  $ max  : num  29.9 27 26.7 22.1 23.4 ...
##  $ media: num  21 17.4 17.1 13 14.2 ...
##  $ min  : num  14.05 4.15 9 0.025 8.55 ...
##  - attr(*, "spec")=List of 2
##   ..$ cols   :List of 5
##   .. ..$ ano  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ data : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ max  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ media: list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ min  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   ..$ default: list()
##   .. ..- attr(*, "class")= chr  "collector_guess" "collector"
##   ..- attr(*, "class")= chr "col_spec"

dt <- dados %>% gather(key = variavel, value = temperatura, max,media,min)

Pacotes para gráficos

ggplot2 *
lattice
plot (base)

Visualização de dados

Gráficos Descritivos - Exploração

library(ggplot2)
ggplot(iris, aes(x = as.factor(Species), y = Sepal.Width))+
geom_boxplot()

ggplot(iris)+
geom_histogram(aes(Sepal.Width))

ggplot(diamonds, aes(carat, fill = cut)) +
  geom_density(position = "stack")

Gráficos de dispersão, linhas e etc…

### grafico pontos
ggplot(data = mpg) +
geom_point (mapping = aes (x = hwy, y = cty, shape = class))

## Warning: The shape palette can deal with a maximum of 6 discrete values
## because more than 6 becomes difficult to discriminate; you have 7.
## Consider specifying shapes manually if you must have them.

## Warning: Removed 62 rows containing missing values (geom_point).