Rafael Henrique Pertille e Silvia Scariotto
October 11, 2018
Praticidade, velocidade (script prontos)
Era da Big Data - Não é modinha, é realidade!!
Oracle R Enterprise e o Microsoft R Server (Maior velocidade e capacidade de processamento)
Grande quantidade de análises e formas de apresentação de dados
Data Wrangling - usa 50% a 80% do tempo de um cientista de dados
Reprodutibilidade científica - Tendência na comunidade científica mundial.
Movimento tidyverse
Produção de pacotes funcionais e com sintaxe de fácil manipulação (escrever e ler)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
iris %>% group_by(Species) %>% summarize_all(mean) # Para fazer o %>% pode utilizar Ctrl + Shift + M
## # A tibble: 3 x 5
## Species Sepal.Length Sepal.Width Petal.Length Petal.Width
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 setosa 5.01 3.43 1.46 0.246
## 2 versicolor 5.94 2.77 4.26 1.33
## 3 virginica 6.59 2.97 5.55 2.03
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
## # A tibble: 316,050 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # ... with 316,040 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
## # A tibble: 3 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 10 7 1350 1350 0 1736
## 2 2013 5 23 1810 1810 0 2208
## 3 2013 7 1 905 905 0 1443
## # ... with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## # time_hour <dttm>
## # A tibble: 86,326 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 7 1 1 2029 212 236
## 2 2013 7 1 2 2359 3 344
## 3 2013 7 1 29 2245 104 151
## 4 2013 7 1 43 2130 193 322
## 5 2013 7 1 44 2150 174 300
## 6 2013 7 1 46 2051 235 304
## 7 2013 7 1 48 2001 287 308
## 8 2013 7 1 58 2155 183 335
## 9 2013 7 1 100 2146 194 327
## 10 2013 7 1 100 2245 135 337
## # ... with 86,316 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 810 810 0 1048
## 2 2013 1 1 1451 1500 -9 1634
## 3 2013 1 1 1452 1455 -3 1637
## 4 2013 1 1 1454 1500 -6 1635
## 5 2013 1 1 1507 1515 -8 1651
## 6 2013 1 1 1530 1530 0 1650
## 7 2013 1 1 1546 1540 6 1753
## 8 2013 1 1 1550 1550 0 1844
## 9 2013 1 1 1552 1600 -8 1749
## 10 2013 1 1 1554 1600 -6 1701
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 9 641 900 1301 1242
## 2 2013 6 15 1432 1935 1137 1607
## 3 2013 1 10 1121 1635 1126 1239
## 4 2013 9 20 1139 1845 1014 1457
## 5 2013 7 22 845 1600 1005 1044
## 6 2013 4 10 1100 1900 960 1342
## 7 2013 3 17 2321 810 911 135
## 8 2013 7 22 2257 759 898 121
## 9 2013 12 5 756 1700 896 1058
## 10 2013 5 3 1133 2055 878 1250
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
## # A tibble: 336,776 x 3
## year month day
## <int> <int> <int>
## 1 2013 1 1
## 2 2013 1 1
## 3 2013 1 1
## 4 2013 1 1
## 5 2013 1 1
## 6 2013 1 1
## 7 2013 1 1
## 8 2013 1 1
## 9 2013 1 1
## 10 2013 1 1
## # ... with 336,766 more rows
c2012 <- read_table2("2012.txt", col_names = T)
c2013 <- read_table2("2013.txt", col_names = T)
c2014 <- read_table2("2014.txt", col_names = T)
Fazer o mesmo para os outros anos
Visualizar estrutura de dados
## Classes 'tbl_df', 'tbl' and 'data.frame': 19 obs. of 5 variables:
## $ ano : int 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
## $ data : chr "3/15/2012" "4/5/2012" "4/26/2012" "5/17/2012" ...
## $ max : num 29.9 27 26.7 22.1 23.4 ...
## $ media: num 21 17.4 17.1 13 14.2 ...
## $ min : num 14.05 4.15 9 0.025 8.55 ...
## - attr(*, "spec")=List of 2
## ..$ cols :List of 5
## .. ..$ ano : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ data : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ max : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ media: list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ min : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## ..$ default: list()
## .. ..- attr(*, "class")= chr "collector_guess" "collector"
## ..- attr(*, "class")= chr "col_spec"
## Classes 'tbl_df', 'tbl' and 'data.frame': 61 obs. of 5 variables:
## $ ano : int 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
## $ data : Date, format: "2018-03-15" "2018-04-05" ...
## $ max : num 29.9 27 26.7 22.1 23.4 ...
## $ media: num 21 17.4 17.1 13 14.2 ...
## $ min : num 14.05 4.15 9 0.025 8.55 ...
## - attr(*, "spec")=List of 2
## ..$ cols :List of 5
## .. ..$ ano : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ data : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ max : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ media: list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ min : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## ..$ default: list()
## .. ..- attr(*, "class")= chr "collector_guess" "collector"
## ..- attr(*, "class")= chr "col_spec"
### grafico pontos
ggplot(data = mpg) +
geom_point (mapping = aes (x = hwy, y = cty, shape = class))
## Warning: The shape palette can deal with a maximum of 6 discrete values
## because more than 6 becomes difficult to discriminate; you have 7.
## Consider specifying shapes manually if you must have them.
## Warning: Removed 62 rows containing missing values (geom_point).