yes

A Unidade 3 é estruturada considerando os seguintes tópicos: - Manipulação e transformação de dados no R: funções para manipulação e tratamento de dados. - Estatística descritiva e mineração de dados

Diretrizes gerais:

  1. Baixe o arquivo .Rmd e abra no RStudio.

Arquivo

  1. Siga as diretrizes da atividade.

  2. Rode o arquivo .Rmd por meio do ícone knitr

  3. Salve o .Rmd e submeta-o por meio da tarefa no Sigaa.


Manipulação de dados

Exercício 1.

Carregue os dados no seu ambiente.

#install.packages("tidyverse")
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.2
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.3     v dplyr   1.0.7
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   2.1.0     v forcats 0.5.1
## Warning: package 'readr' was built under R version 4.1.2
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(dplyr)
library(magrittr)
## 
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
## 
##     set_names
## The following object is masked from 'package:tidyr':
## 
##     extract
library(knitr)


```r
# From TidyTuesday: https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-02-11/readme.md
#install.packages("readr")
library(readr)
hotels <- read_csv(file = "C:\\Users\\Fernando\\Downloads\\hotels.csv")

Exercício 2.

As pessoas estão viajando por um capricho? Vamos ver…

Preencha os espaços em branco para filtragem de reservas de hotel onde o hóspede é não dos EUA (código do país USA) e o lead_time é menos de 1 dia.

Nota: Você precisará definir eval=TRUE quando tiver uma resposta que queira experimentar.

# on the fly
hotels %>%
  filter(
    country != "USA", 
    lead_time < 1
    )
## # A tibble: 6,174 x 32
##    hotel        is_canceled lead_time arrival_date_year arrival_date_month
##    <chr>              <dbl>     <dbl>             <dbl> <chr>             
##  1 Resort Hotel           0         0              2015 July              
##  2 Resort Hotel           0         0              2015 July              
##  3 Resort Hotel           0         0              2015 July              
##  4 Resort Hotel           0         0              2015 July              
##  5 Resort Hotel           0         0              2015 July              
##  6 Resort Hotel           0         0              2015 July              
##  7 Resort Hotel           0         0              2015 July              
##  8 Resort Hotel           0         0              2015 July              
##  9 Resort Hotel           0         0              2015 July              
## 10 Resort Hotel           0         0              2015 July              
## # ... with 6,164 more rows, and 27 more variables:
## #   arrival_date_week_number <dbl>, arrival_date_day_of_month <dbl>,
## #   stays_in_weekend_nights <dbl>, stays_in_week_nights <dbl>, adults <dbl>,
## #   children <dbl>, babies <dbl>, meal <chr>, country <chr>,
## #   market_segment <chr>, distribution_channel <chr>, is_repeated_guest <dbl>,
## #   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
## #   reserved_room_type <chr>, assigned_room_type <chr>, ...
# Com registro de objeto

hotel <- hotels %>% 
  filter(country != "USA", lead_time < 1) %>% 
  select(country, lead_time) 

# Sem pipe

hotels_sem_pipe <- filter(hotels, country != "USA", lead_time < 1)
hotels_sem_pipe <- select(hotels_sem_pipe, country, lead_time)

Exercício 3.

Quantas marcações envolvem pelo menos 1 criança ou bebê?

No seguinte chunk, substitua

  • [AT LEAST] com o operador lógico para “pelo menos” (em dois lugares)
  • [OR] com o operador lógico para “ou”

Nota: Você precisará definir eval=TRUE quando tiver uma resposta que queira experimentar.

hotels %>%
  filter(children >= 1 | babies >= 1)
## # A tibble: 9,332 x 32
##    hotel        is_canceled lead_time arrival_date_year arrival_date_month
##    <chr>              <dbl>     <dbl>             <dbl> <chr>             
##  1 Resort Hotel           0        18              2015 July              
##  2 Resort Hotel           1        47              2015 July              
##  3 Resort Hotel           0         1              2015 July              
##  4 Resort Hotel           0        10              2015 July              
##  5 Resort Hotel           1        79              2015 July              
##  6 Resort Hotel           0       101              2015 July              
##  7 Resort Hotel           0        92              2015 July              
##  8 Resort Hotel           1        26              2015 July              
##  9 Resort Hotel           0       102              2015 July              
## 10 Resort Hotel           0        78              2015 July              
## # ... with 9,322 more rows, and 27 more variables:
## #   arrival_date_week_number <dbl>, arrival_date_day_of_month <dbl>,
## #   stays_in_weekend_nights <dbl>, stays_in_week_nights <dbl>, adults <dbl>,
## #   children <dbl>, babies <dbl>, meal <chr>, country <chr>,
## #   market_segment <chr>, distribution_channel <chr>, is_repeated_guest <dbl>,
## #   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
## #   reserved_room_type <chr>, assigned_room_type <chr>, ...
# Com registro de objeto
hotel_bebe_chil <- hotels %>%
  filter(children >= 1 | babies >= 1)

# Sem pipe

hotel_bebe_chil_sem_pipe <- filter(hotels, children >= 1 | babies >= 1)

Exercício 4.

Você acha que é mais provável encontrar reservas com crianças ou bebês em hotéis urbanos ou resorts hoteleiros? Teste sua intuição.

Usando filter() determine o número de reservas em hotéis resort que têm mais de 1 criança ou bebê no quarto?

Então, faça o mesmo para hotéis urbanos, e compare o número de linhas no dataframe filtrado resultantes.

# Com registro de objeto
 hotel_bebe_chil_resort <- hotels %>%
  filter(children >= 1 | babies >= 1) %>% 
  filter(hotel == "Resort Hotel")

# Sem pipe

hotel_bebe_chil_sem_pipe <- filter(hotels, children >= 1 | babies >= 1)
hotel_bebe_chil_resort_sem_pipe <- filter(hotel_bebe_chil_resort, hotel == "Resort Hotel")
hotel_bebe_chil_resort_sem_pipe
## # A tibble: 3,929 x 32
##    hotel        is_canceled lead_time arrival_date_year arrival_date_month
##    <chr>              <dbl>     <dbl>             <dbl> <chr>             
##  1 Resort Hotel           0        18              2015 July              
##  2 Resort Hotel           1        47              2015 July              
##  3 Resort Hotel           0         1              2015 July              
##  4 Resort Hotel           0        10              2015 July              
##  5 Resort Hotel           1        79              2015 July              
##  6 Resort Hotel           0       101              2015 July              
##  7 Resort Hotel           0        92              2015 July              
##  8 Resort Hotel           1        26              2015 July              
##  9 Resort Hotel           0       102              2015 July              
## 10 Resort Hotel           0        78              2015 July              
## # ... with 3,919 more rows, and 27 more variables:
## #   arrival_date_week_number <dbl>, arrival_date_day_of_month <dbl>,
## #   stays_in_weekend_nights <dbl>, stays_in_week_nights <dbl>, adults <dbl>,
## #   children <dbl>, babies <dbl>, meal <chr>, country <chr>,
## #   market_segment <chr>, distribution_channel <chr>, is_repeated_guest <dbl>,
## #   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
## #   reserved_room_type <chr>, assigned_room_type <chr>, ...
# Com registro de objeto
hotel_bebe_chil_city <- hotels %>%
  filter(children >= 1 | babies >= 1) %>% 
  mutate(hotel = tolower(hotel)) %>% 
  filter(hotel == "city hotel")

# Sem pipe

hotel_bebe_chil_sem_pipe <- filter(hotels, children >= 1 | babies >= 1)

hotel_bebe_chil_city_sem_pipe <- filter(hotel_bebe_chil_sem_pipe, hotel == "City Hotel")
hotel_bebe_chil_city_sem_pipe
## # A tibble: 5,403 x 32
##    hotel      is_canceled lead_time arrival_date_year arrival_date_month
##    <chr>            <dbl>     <dbl>             <dbl> <chr>             
##  1 City Hotel           1       100              2015 July              
##  2 City Hotel           0        69              2015 July              
##  3 City Hotel           0        67              2015 July              
##  4 City Hotel           1        60              2015 July              
##  5 City Hotel           0         0              2015 July              
##  6 City Hotel           0         0              2015 July              
##  7 City Hotel           0         0              2015 July              
##  8 City Hotel           0        84              2015 July              
##  9 City Hotel           0        74              2015 July              
## 10 City Hotel           0        75              2015 July              
## # ... with 5,393 more rows, and 27 more variables:
## #   arrival_date_week_number <dbl>, arrival_date_day_of_month <dbl>,
## #   stays_in_weekend_nights <dbl>, stays_in_week_nights <dbl>, adults <dbl>,
## #   children <dbl>, babies <dbl>, meal <chr>, country <chr>,
## #   market_segment <chr>, distribution_channel <chr>, is_repeated_guest <dbl>,
## #   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
## #   reserved_room_type <chr>, assigned_room_type <chr>, ...
hotel_bebe_chil_class <- hotels %>% 
  filter(children >= 1 | babies >= 1) %>% 
  group_by(hotel) %>% 
  summarise(n = max(stays_in_weekend_nights))
hotel_bebe_chil_class
## # A tibble: 2 x 2
##   hotel            n
##   <chr>        <dbl>
## 1 City Hotel       8
## 2 Resort Hotel     8
list(hotel_bebe_chil_class)
## [[1]]
## # A tibble: 2 x 2
##   hotel            n
##   <chr>        <dbl>
## 1 City Hotel       8
## 2 Resort Hotel     8
hotels %>%
  filter(children >= 1 | babies >= 1) %>%    
  group_by(country, hotel) %>% 
  summarise(tempomedio = mean(lead_time), reservas = n(), bebes = mean(babies), criancas = mean(children)) %>%
ggplot(aes(x= country, y= tempomedio, color= hotel)) +
geom_point() +
theme(legend.position = "bottom", axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
coord_flip ()
## `summarise()` has grouped output by 'country'. You can override using the `.groups` argument.

hotels
## # A tibble: 119,390 x 32
##    hotel        is_canceled lead_time arrival_date_year arrival_date_month
##    <chr>              <dbl>     <dbl>             <dbl> <chr>             
##  1 Resort Hotel           0       342              2015 July              
##  2 Resort Hotel           0       737              2015 July              
##  3 Resort Hotel           0         7              2015 July              
##  4 Resort Hotel           0        13              2015 July              
##  5 Resort Hotel           0        14              2015 July              
##  6 Resort Hotel           0        14              2015 July              
##  7 Resort Hotel           0         0              2015 July              
##  8 Resort Hotel           0         9              2015 July              
##  9 Resort Hotel           1        85              2015 July              
## 10 Resort Hotel           1        75              2015 July              
## # ... with 119,380 more rows, and 27 more variables:
## #   arrival_date_week_number <dbl>, arrival_date_day_of_month <dbl>,
## #   stays_in_weekend_nights <dbl>, stays_in_week_nights <dbl>, adults <dbl>,
## #   children <dbl>, babies <dbl>, meal <chr>, country <chr>,
## #   market_segment <chr>, distribution_channel <chr>, is_repeated_guest <dbl>,
## #   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
## #   reserved_room_type <chr>, assigned_room_type <chr>, ...

Exercício 5.

Criar uma tabela de freqüência do número de adults em uma reserva.

Mostre os resultados em ordem decrescente para que a observação mais comum esteja no topo.

Qual é o número mais comum de adultos em reservas neste conjunto de dados?

Há algum resultado surpreendente?

Nota: Não esqueça de rotular também seu chunk R (onde diz lable-me-1). Seu rótulo deve ser curto, informativo, e não deve incluir espaços. Também não deve repetir uma etiqueta anterior, caso contrário o R Markdown lhe dará um erro sobre a repetição de etiquetas R em pedaços.

str(hotels)
## spec_tbl_df [119,390 x 32] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ hotel                         : chr [1:119390] "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
##  $ is_canceled                   : num [1:119390] 0 0 0 0 0 0 0 0 1 1 ...
##  $ lead_time                     : num [1:119390] 342 737 7 13 14 14 0 9 85 75 ...
##  $ arrival_date_year             : num [1:119390] 2015 2015 2015 2015 2015 ...
##  $ arrival_date_month            : chr [1:119390] "July" "July" "July" "July" ...
##  $ arrival_date_week_number      : num [1:119390] 27 27 27 27 27 27 27 27 27 27 ...
##  $ arrival_date_day_of_month     : num [1:119390] 1 1 1 1 1 1 1 1 1 1 ...
##  $ stays_in_weekend_nights       : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ stays_in_week_nights          : num [1:119390] 0 0 1 1 2 2 2 2 3 3 ...
##  $ adults                        : num [1:119390] 2 2 1 1 2 2 2 2 2 2 ...
##  $ children                      : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ babies                        : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ meal                          : chr [1:119390] "BB" "BB" "BB" "BB" ...
##  $ country                       : chr [1:119390] "PRT" "PRT" "GBR" "GBR" ...
##  $ market_segment                : chr [1:119390] "Direct" "Direct" "Direct" "Corporate" ...
##  $ distribution_channel          : chr [1:119390] "Direct" "Direct" "Direct" "Corporate" ...
##  $ is_repeated_guest             : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_cancellations        : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_bookings_not_canceled: num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ reserved_room_type            : chr [1:119390] "C" "C" "A" "A" ...
##  $ assigned_room_type            : chr [1:119390] "C" "C" "C" "A" ...
##  $ booking_changes               : num [1:119390] 3 4 0 0 0 0 0 0 0 0 ...
##  $ deposit_type                  : chr [1:119390] "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
##  $ agent                         : chr [1:119390] "NULL" "NULL" "NULL" "304" ...
##  $ company                       : chr [1:119390] "NULL" "NULL" "NULL" "NULL" ...
##  $ days_in_waiting_list          : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ customer_type                 : chr [1:119390] "Transient" "Transient" "Transient" "Transient" ...
##  $ adr                           : num [1:119390] 0 0 75 75 98 ...
##  $ required_car_parking_spaces   : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ total_of_special_requests     : num [1:119390] 0 0 0 0 1 1 0 1 1 0 ...
##  $ reservation_status            : chr [1:119390] "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
##  $ reservation_status_date       : Date[1:119390], format: "2015-07-01" "2015-07-01" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   hotel = col_character(),
##   ..   is_canceled = col_double(),
##   ..   lead_time = col_double(),
##   ..   arrival_date_year = col_double(),
##   ..   arrival_date_month = col_character(),
##   ..   arrival_date_week_number = col_double(),
##   ..   arrival_date_day_of_month = col_double(),
##   ..   stays_in_weekend_nights = col_double(),
##   ..   stays_in_week_nights = col_double(),
##   ..   adults = col_double(),
##   ..   children = col_double(),
##   ..   babies = col_double(),
##   ..   meal = col_character(),
##   ..   country = col_character(),
##   ..   market_segment = col_character(),
##   ..   distribution_channel = col_character(),
##   ..   is_repeated_guest = col_double(),
##   ..   previous_cancellations = col_double(),
##   ..   previous_bookings_not_canceled = col_double(),
##   ..   reserved_room_type = col_character(),
##   ..   assigned_room_type = col_character(),
##   ..   booking_changes = col_double(),
##   ..   deposit_type = col_character(),
##   ..   agent = col_character(),
##   ..   company = col_character(),
##   ..   days_in_waiting_list = col_double(),
##   ..   customer_type = col_character(),
##   ..   adr = col_double(),
##   ..   required_car_parking_spaces = col_double(),
##   ..   total_of_special_requests = col_double(),
##   ..   reservation_status = col_character(),
##   ..   reservation_status_date = col_date(format = "")
##   .. )
##  - attr(*, "problems")=<externalptr>
adultos <- hotels %>%
  mutate(adults = as.factor (adults), children = as.factor (children))
table(adultos$adults)
## 
##     0     1     2     3     4     5     6    10    20    26    27    40    50 
##   403 23027 89680  6202    62     2     1     1     2     5     2     1     1 
##    55 
##     1
table(hotels$adults, hotels$children)
##     
##          0     1     2     3    10
##   0    180     4   208    11     0
##   1  22587   279   157     4     0
##   2  82278  4089  3248    61     1
##   3   5675   487    39     0     0
##   4     60     2     0     0     0
##   5      2     0     0     0     0
##   6      1     0     0     0     0
##   10     1     0     0     0     0
##   20     2     0     0     0     0
##   26     5     0     0     0     0
##   27     2     0     0     0     0
##   40     1     0     0     0     0
##   50     1     0     0     0     0
##   55     1     0     0     0     0
hist(hotels$adults)

library(gmodels)
tab1 <- CrossTable(hotels$adults, hotels$children, chisq=TRUE)
## Warning in chisq.test(t, correct = FALSE, ...): Chi-squared approximation may be
## incorrect
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  119386 
## 
##  
##               | hotels$children 
## hotels$adults |         0 |         1 |         2 |         3 |        10 | Row Total | 
## --------------|-----------|-----------|-----------|-----------|-----------|-----------|
##             0 |       180 |         4 |       208 |        11 |         0 |       403 | 
##               |   100.634 |     9.384 |  3105.820 |   449.907 |     0.003 |           | 
##               |     0.447 |     0.010 |     0.516 |     0.027 |     0.000 |     0.003 | 
##               |     0.002 |     0.001 |     0.057 |     0.145 |     0.000 |           | 
##               |     0.002 |     0.000 |     0.002 |     0.000 |     0.000 |           | 
## --------------|-----------|-----------|-----------|-----------|-----------|-----------|
##             1 |     22587 |       279 |       157 |         4 |         0 |     23027 | 
##               |    69.287 |   462.606 |   425.386 |     7.750 |     0.193 |           | 
##               |     0.981 |     0.012 |     0.007 |     0.000 |     0.000 |     0.193 | 
##               |     0.204 |     0.057 |     0.043 |     0.053 |     0.000 |           | 
##               |     0.189 |     0.002 |     0.001 |     0.000 |     0.000 |           | 
## --------------|-----------|-----------|-----------|-----------|-----------|-----------|
##             2 |     82278 |      4089 |      3248 |        61 |         1 |     89677 | 
##               |    10.767 |    52.457 |    92.890 |     0.268 |     0.082 |           | 
##               |     0.917 |     0.046 |     0.036 |     0.001 |     0.000 |     0.751 | 
##               |     0.743 |     0.841 |     0.889 |     0.803 |     1.000 |           | 
##               |     0.689 |     0.034 |     0.027 |     0.001 |     0.000 |           | 
## --------------|-----------|-----------|-----------|-----------|-----------|-----------|
##             3 |      5675 |       487 |        39 |         0 |         0 |      6201 | 
##               |     1.107 |   217.827 |   119.706 |     3.947 |     0.052 |           | 
##               |     0.915 |     0.079 |     0.006 |     0.000 |     0.000 |     0.052 | 
##               |     0.051 |     0.100 |     0.011 |     0.000 |     0.000 |           | 
##               |     0.048 |     0.004 |     0.000 |     0.000 |     0.000 |           | 
## --------------|-----------|-----------|-----------|-----------|-----------|-----------|
##             4 |        60 |         2 |         0 |         0 |         0 |        62 | 
##               |     0.105 |     0.109 |     1.897 |     0.039 |     0.001 |           | 
##               |     0.968 |     0.032 |     0.000 |     0.000 |     0.000 |     0.001 | 
##               |     0.001 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
##               |     0.001 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
## --------------|-----------|-----------|-----------|-----------|-----------|-----------|
##             5 |         2 |         0 |         0 |         0 |         0 |         2 | 
##               |     0.011 |     0.081 |     0.061 |     0.001 |     0.000 |           | 
##               |     1.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 | 
##               |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
##               |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
## --------------|-----------|-----------|-----------|-----------|-----------|-----------|
##             6 |         1 |         0 |         0 |         0 |         0 |         1 | 
##               |     0.006 |     0.041 |     0.031 |     0.001 |     0.000 |           | 
##               |     1.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 | 
##               |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
##               |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
## --------------|-----------|-----------|-----------|-----------|-----------|-----------|
##            10 |         1 |         0 |         0 |         0 |         0 |         1 | 
##               |     0.006 |     0.041 |     0.031 |     0.001 |     0.000 |           | 
##               |     1.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 | 
##               |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
##               |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
## --------------|-----------|-----------|-----------|-----------|-----------|-----------|
##            20 |         2 |         0 |         0 |         0 |         0 |         2 | 
##               |     0.011 |     0.081 |     0.061 |     0.001 |     0.000 |           | 
##               |     1.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 | 
##               |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
##               |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
## --------------|-----------|-----------|-----------|-----------|-----------|-----------|
##            26 |         5 |         0 |         0 |         0 |         0 |         5 | 
##               |     0.028 |     0.204 |     0.153 |     0.003 |     0.000 |           | 
##               |     1.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 | 
##               |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
##               |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
## --------------|-----------|-----------|-----------|-----------|-----------|-----------|
##            27 |         2 |         0 |         0 |         0 |         0 |         2 | 
##               |     0.011 |     0.081 |     0.061 |     0.001 |     0.000 |           | 
##               |     1.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 | 
##               |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
##               |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
## --------------|-----------|-----------|-----------|-----------|-----------|-----------|
##            40 |         1 |         0 |         0 |         0 |         0 |         1 | 
##               |     0.006 |     0.041 |     0.031 |     0.001 |     0.000 |           | 
##               |     1.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 | 
##               |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
##               |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
## --------------|-----------|-----------|-----------|-----------|-----------|-----------|
##            50 |         1 |         0 |         0 |         0 |         0 |         1 | 
##               |     0.006 |     0.041 |     0.031 |     0.001 |     0.000 |           | 
##               |     1.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 | 
##               |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
##               |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
## --------------|-----------|-----------|-----------|-----------|-----------|-----------|
##            55 |         1 |         0 |         0 |         0 |         0 |         1 | 
##               |     0.006 |     0.041 |     0.031 |     0.001 |     0.000 |           | 
##               |     1.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 | 
##               |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
##               |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
## --------------|-----------|-----------|-----------|-----------|-----------|-----------|
##  Column Total |    110796 |      4861 |      3652 |        76 |         1 |    119386 | 
##               |     0.928 |     0.041 |     0.031 |     0.001 |     0.000 |           | 
## --------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 
##  
## Statistics for All Table Factors
## 
## 
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 =  5133.464     d.f. =  52     p =  0 
## 
## 
## 
tab2 <- CrossTable(adultos$adults, adultos$children)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  119386 
## 
##  
##                | adultos$children 
## adultos$adults |         0 |         1 |         2 |         3 |        10 | Row Total | 
## ---------------|-----------|-----------|-----------|-----------|-----------|-----------|
##              0 |       180 |         4 |       208 |        11 |         0 |       403 | 
##                |   100.634 |     9.384 |  3105.820 |   449.907 |     0.003 |           | 
##                |     0.447 |     0.010 |     0.516 |     0.027 |     0.000 |     0.003 | 
##                |     0.002 |     0.001 |     0.057 |     0.145 |     0.000 |           | 
##                |     0.002 |     0.000 |     0.002 |     0.000 |     0.000 |           | 
## ---------------|-----------|-----------|-----------|-----------|-----------|-----------|
##              1 |     22587 |       279 |       157 |         4 |         0 |     23027 | 
##                |    69.287 |   462.606 |   425.386 |     7.750 |     0.193 |           | 
##                |     0.981 |     0.012 |     0.007 |     0.000 |     0.000 |     0.193 | 
##                |     0.204 |     0.057 |     0.043 |     0.053 |     0.000 |           | 
##                |     0.189 |     0.002 |     0.001 |     0.000 |     0.000 |           | 
## ---------------|-----------|-----------|-----------|-----------|-----------|-----------|
##              2 |     82278 |      4089 |      3248 |        61 |         1 |     89677 | 
##                |    10.767 |    52.457 |    92.890 |     0.268 |     0.082 |           | 
##                |     0.917 |     0.046 |     0.036 |     0.001 |     0.000 |     0.751 | 
##                |     0.743 |     0.841 |     0.889 |     0.803 |     1.000 |           | 
##                |     0.689 |     0.034 |     0.027 |     0.001 |     0.000 |           | 
## ---------------|-----------|-----------|-----------|-----------|-----------|-----------|
##              3 |      5675 |       487 |        39 |         0 |         0 |      6201 | 
##                |     1.107 |   217.827 |   119.706 |     3.947 |     0.052 |           | 
##                |     0.915 |     0.079 |     0.006 |     0.000 |     0.000 |     0.052 | 
##                |     0.051 |     0.100 |     0.011 |     0.000 |     0.000 |           | 
##                |     0.048 |     0.004 |     0.000 |     0.000 |     0.000 |           | 
## ---------------|-----------|-----------|-----------|-----------|-----------|-----------|
##              4 |        60 |         2 |         0 |         0 |         0 |        62 | 
##                |     0.105 |     0.109 |     1.897 |     0.039 |     0.001 |           | 
##                |     0.968 |     0.032 |     0.000 |     0.000 |     0.000 |     0.001 | 
##                |     0.001 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
##                |     0.001 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
## ---------------|-----------|-----------|-----------|-----------|-----------|-----------|
##              5 |         2 |         0 |         0 |         0 |         0 |         2 | 
##                |     0.011 |     0.081 |     0.061 |     0.001 |     0.000 |           | 
##                |     1.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 | 
##                |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
##                |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
## ---------------|-----------|-----------|-----------|-----------|-----------|-----------|
##              6 |         1 |         0 |         0 |         0 |         0 |         1 | 
##                |     0.006 |     0.041 |     0.031 |     0.001 |     0.000 |           | 
##                |     1.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 | 
##                |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
##                |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
## ---------------|-----------|-----------|-----------|-----------|-----------|-----------|
##             10 |         1 |         0 |         0 |         0 |         0 |         1 | 
##                |     0.006 |     0.041 |     0.031 |     0.001 |     0.000 |           | 
##                |     1.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 | 
##                |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
##                |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
## ---------------|-----------|-----------|-----------|-----------|-----------|-----------|
##             20 |         2 |         0 |         0 |         0 |         0 |         2 | 
##                |     0.011 |     0.081 |     0.061 |     0.001 |     0.000 |           | 
##                |     1.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 | 
##                |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
##                |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
## ---------------|-----------|-----------|-----------|-----------|-----------|-----------|
##             26 |         5 |         0 |         0 |         0 |         0 |         5 | 
##                |     0.028 |     0.204 |     0.153 |     0.003 |     0.000 |           | 
##                |     1.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 | 
##                |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
##                |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
## ---------------|-----------|-----------|-----------|-----------|-----------|-----------|
##             27 |         2 |         0 |         0 |         0 |         0 |         2 | 
##                |     0.011 |     0.081 |     0.061 |     0.001 |     0.000 |           | 
##                |     1.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 | 
##                |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
##                |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
## ---------------|-----------|-----------|-----------|-----------|-----------|-----------|
##             40 |         1 |         0 |         0 |         0 |         0 |         1 | 
##                |     0.006 |     0.041 |     0.031 |     0.001 |     0.000 |           | 
##                |     1.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 | 
##                |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
##                |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
## ---------------|-----------|-----------|-----------|-----------|-----------|-----------|
##             50 |         1 |         0 |         0 |         0 |         0 |         1 | 
##                |     0.006 |     0.041 |     0.031 |     0.001 |     0.000 |           | 
##                |     1.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 | 
##                |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
##                |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
## ---------------|-----------|-----------|-----------|-----------|-----------|-----------|
##             55 |         1 |         0 |         0 |         0 |         0 |         1 | 
##                |     0.006 |     0.041 |     0.031 |     0.001 |     0.000 |           | 
##                |     1.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 | 
##                |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
##                |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
## ---------------|-----------|-----------|-----------|-----------|-----------|-----------|
##   Column Total |    110796 |      4861 |      3652 |        76 |         1 |    119386 | 
##                |     0.928 |     0.041 |     0.031 |     0.001 |     0.000 |           | 
## ---------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 
## 
qplot(hotels$adults,
      geom="histogram",
      binwidth=5,  
      main="Histograma por adultos", 
      xlab="Adultos", 
      fill=I("blue"), 
      col=I("red"))

Exercício 6.

Repita o exercício 5, uma vez para reservas canceladas (is_canceled codificado como 1) e uma vez para reservas não canceladas (is_canceled codificado como 0).

O que isto revela sobre os resultados surpreendentes que você observou no exercício anterior?

Note: Não se esqueça de rotular também seu chunk de R (onde diz label-me-2).

hist(hotels$is_canceled,
 main="Histograma por cancelamentos")

Exercício 7.

Calcular a tarifa mínima, média, mediana e máxima média diária (adr) agrupados por tipo de hotel para que você possa obter estas estatísticas separadamente para hotéis de resorts e cidades.

Que tipo de hotel é mais caro, em média? RESORTS

hotels %>%
  group_by(hotel) %>% 
  summarise(mediaadr = mean(adr)) %>%
ggplot(aes(x= mediaadr, y= hotel, color= hotel)) +
geom_point() +
theme(legend.position = "bottom", axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
coord_flip ()

hotels %>%
  group_by(hotel) %>% 
  summarise(maxadr = max(adr)) %>%
ggplot(aes(x= maxadr, y= hotel, color= hotel)) +
geom_point() +
theme(legend.position = "bottom", axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
coord_flip ()

hotels %>%
  group_by(hotel) %>% 
  summarise(minadr = min(adr)) %>%
ggplot(aes(x= minadr, y= hotel, color= hotel)) +
geom_point() +
theme(legend.position = "bottom", axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
coord_flip ()

hotels %>%
  group_by(hotel) %>% 
  summarise(medianaadr = median(adr)) %>%
ggplot(aes(x= medianaadr, y= hotel, color= hotel)) +
geom_point() +
theme(legend.position = "bottom", axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
coord_flip ()

Exercício 8.

Observamos dois valores incomuns nas estatísticas resumidas acima – um mínimo negativo, e um máximo muito alto). Que tipos de hotéis são estes?

Localize estas observações no conjunto de dados e descubra a data de chegada (ano e mês), assim como quantas pessoas (adultos, crianças e bebês) permaneceram no quarto.

Você pode investigar os dados no espectador para localizar estes valores, mas de preferência você deve identificá-los de forma reprodutível com algum código.

Dica: Por exemplo, você pode filter para o dado quantidade adr e select as colunas relevantes.

filter(hotels, hotels$adr<=-6)
## # A tibble: 1 x 32
##   hotel        is_canceled lead_time arrival_date_ye~ arrival_date_mo~ arrival_date_we~
##   <chr>              <dbl>     <dbl>            <dbl> <chr>                       <dbl>
## 1 Resort Hotel           0       195             2017 March                          10
## # ... with 26 more variables: arrival_date_day_of_month <dbl>,
## #   stays_in_weekend_nights <dbl>, stays_in_week_nights <dbl>, adults <dbl>,
## #   children <dbl>, babies <dbl>, meal <chr>, country <chr>,
## #   market_segment <chr>, distribution_channel <chr>, is_repeated_guest <dbl>,
## #   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
## #   reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
## #   deposit_type <chr>, agent <chr>, company <chr>, ...
filter(hotels, hotels$adr>=5000)
## # A tibble: 1 x 32
##   hotel      is_canceled lead_time arrival_date_ye~ arrival_date_mo~ arrival_date_we~
##   <chr>            <dbl>     <dbl>            <dbl> <chr>                       <dbl>
## 1 City Hotel           1        35             2016 March                          13
## # ... with 26 more variables: arrival_date_day_of_month <dbl>,
## #   stays_in_weekend_nights <dbl>, stays_in_week_nights <dbl>, adults <dbl>,
## #   children <dbl>, babies <dbl>, meal <chr>, country <chr>,
## #   market_segment <chr>, distribution_channel <chr>, is_repeated_guest <dbl>,
## #   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
## #   reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
## #   deposit_type <chr>, agent <chr>, company <chr>, ...

Dicionário de dados

Abaixo está o dicionário de dados completo. Note que é longo (há muitas variáveis nos dados), mas utilizamos um conjunto limitado de variáveis para nossa análise.

variable class description
hotel character Hotel (H1 = Resort Hotel or H2 = City Hotel)
is_canceled double Value indicating if the booking was canceled (1) or not (0)
lead_time double Number of days that elapsed between the entering date of the booking into the PMS and the arrival date
arrival_date_year double Year of arrival date
arrival_date_month character Month of arrival date
arrival_date_week_number double Week number of year for arrival date
arrival_date_day_of_month double Day of arrival date
stays_in_weekend_nights double Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel
stays_in_week_nights double Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel
adults double Number of adults
children double Number of children
babies double Number of babies
meal character Type of meal booked. Categories are presented in standard hospitality meal packages:
Undefined/SC – no meal package;
BB – Bed & Breakfast;
HB – Half board (breakfast and one other meal – usually dinner);
FB – Full board (breakfast, lunch and dinner)
country character Country of origin. Categories are represented in the ISO 3155–3:2013 format
market_segment character Market segment designation. In categories, the term “TA” means “Travel Agents” and “TO” means “Tour Operators”
distribution_channel character Booking distribution channel. The term “TA” means “Travel Agents” and “TO” means “Tour Operators”
is_repeated_guest double Value indicating if the booking name was from a repeated guest (1) or not (0)
previous_cancellations double Number of previous bookings that were cancelled by the customer prior to the current booking
previous_bookings_not_canceled double Number of previous bookings not cancelled by the customer prior to the current booking
reserved_room_type character Code of room type reserved. Code is presented instead of designation for anonymity reasons
assigned_room_type character Code for the type of room assigned to the booking. Sometimes the assigned room type differs from the reserved room type due to hotel operation reasons (e.g. overbooking) or by customer request. Code is presented instead of designation for anonymity reasons
booking_changes double Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation
deposit_type character Indication on if the customer made a deposit to guarantee the booking. This variable can assume three categories:
No Deposit – no deposit was made;
Non Refund – a deposit was made in the value of the total stay cost;
Refundable – a deposit was made with a value under the total cost of stay.
agent character ID of the travel agency that made the booking
company character ID of the company/entity that made the booking or responsible for paying the booking. ID is presented instead of designation for anonymity reasons
days_in_waiting_list double Number of days the booking was in the waiting list before it was confirmed to the customer
customer_type character Type of booking, assuming one of four categories:
Contract - when the booking has an allotment or other type of contract associated to it;
Group – when the booking is associated to a group;
Transient – when the booking is not part of a group or contract, and is not associated to other transient booking;
Transient-party – when the booking is transient, but is associated to at least other transient booking
adr double Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights
required_car_parking_spaces double Number of car parking spaces required by the customer
total_of_special_requests double Number of special requests made by the customer (e.g. twin bed or high floor)
reservation_status character Reservation last status, assuming one of three categories:
Canceled – booking was canceled by the customer;
Check-Out – customer has checked in but already departed;
No-Show – customer did not check-in and did inform the hotel of the reason why
reservation_status_date double Date at which the last status was set. This variable can be used in conjunction with the ReservationStatus to understand when was the booking canceled or when did the customer checked-out of the hotel

Junção de dataframes

Exercício 1.

Carregar arquivos

  1. Façam o download do arquivo .zip.
  2. Removam os arquivos de dados do arquivo .zip.
  3. Carregue os arquivos de dados usando a função read_csv e atribuindo cada conjunto de dados a um objeto

Exercício 2.

Manipulação de dados - mulheres cientistas

  1. Faça a junção dos dataframes

  2. Ordene por ano de nascimento

  3. Filtre apenas as linhas que têm informações sobre ano de nascimento e ano de morte

  4. Encontre as pesquisadoras que estão vivas

  5. Gere um dataframe considerando apenas a coluna com o nome das cientistas.

Exercício 3.

Manipulação de dados - matrículas

  1. Gere um dataframe com os alunos matriculados e que responderam estar frequentando as aulas.

  2. Gere um dataframe com os alunos matriculados e que responderam não estar frequentando as aulas.

  3. Gere um dataframe com os alunos que estão frequentando as aulas e que não estão matriculados.

Exercício 4.

Manipulação de dados - compras

  1. Gere um dataframe com a síntese da soma da receita da loja por consumidor