A Unidade 3 é estruturada considerando os seguintes tópicos: - Manipulação e transformação de dados no R: funções para manipulação e tratamento de dados. - Estatística descritiva e mineração de dados
Siga as diretrizes da atividade.
Rode o arquivo .Rmd por meio do ícone knitr
Salve o .Rmd e submeta-o por meio da tarefa no Sigaa.
Carregue os dados no seu ambiente.
#install.packages("tidyverse")
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.2
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.3 v dplyr 1.0.7
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 2.1.0 v forcats 0.5.1
## Warning: package 'readr' was built under R version 4.1.2
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(dplyr)
library(magrittr)
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
##
## set_names
## The following object is masked from 'package:tidyr':
##
## extract
library(knitr)
```r
# From TidyTuesday: https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-02-11/readme.md
#install.packages("readr")
library(readr)
hotels <- read_csv(file = "C:\\Users\\Fernando\\Downloads\\hotels.csv")
As pessoas estão viajando por um capricho? Vamos ver…
Preencha os espaços em branco para filtragem de reservas de hotel onde o hóspede é não dos EUA (código do país USA) e o lead_time é menos de 1 dia.
Nota: Você precisará definir eval=TRUE quando tiver uma resposta que queira experimentar.
# on the fly
hotels %>%
filter(
country != "USA",
lead_time < 1
)
## # A tibble: 6,174 x 32
## hotel is_canceled lead_time arrival_date_year arrival_date_month
## <chr> <dbl> <dbl> <dbl> <chr>
## 1 Resort Hotel 0 0 2015 July
## 2 Resort Hotel 0 0 2015 July
## 3 Resort Hotel 0 0 2015 July
## 4 Resort Hotel 0 0 2015 July
## 5 Resort Hotel 0 0 2015 July
## 6 Resort Hotel 0 0 2015 July
## 7 Resort Hotel 0 0 2015 July
## 8 Resort Hotel 0 0 2015 July
## 9 Resort Hotel 0 0 2015 July
## 10 Resort Hotel 0 0 2015 July
## # ... with 6,164 more rows, and 27 more variables:
## # arrival_date_week_number <dbl>, arrival_date_day_of_month <dbl>,
## # stays_in_weekend_nights <dbl>, stays_in_week_nights <dbl>, adults <dbl>,
## # children <dbl>, babies <dbl>, meal <chr>, country <chr>,
## # market_segment <chr>, distribution_channel <chr>, is_repeated_guest <dbl>,
## # previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
## # reserved_room_type <chr>, assigned_room_type <chr>, ...
# Com registro de objeto
hotel <- hotels %>%
filter(country != "USA", lead_time < 1) %>%
select(country, lead_time)
# Sem pipe
hotels_sem_pipe <- filter(hotels, country != "USA", lead_time < 1)
hotels_sem_pipe <- select(hotels_sem_pipe, country, lead_time)
Quantas marcações envolvem pelo menos 1 criança ou bebê?
No seguinte chunk, substitua
[AT LEAST] com o operador lógico para “pelo menos” (em dois lugares)[OR] com o operador lógico para “ou”Nota: Você precisará definir eval=TRUE quando tiver uma resposta que queira experimentar.
hotels %>%
filter(children >= 1 | babies >= 1)
## # A tibble: 9,332 x 32
## hotel is_canceled lead_time arrival_date_year arrival_date_month
## <chr> <dbl> <dbl> <dbl> <chr>
## 1 Resort Hotel 0 18 2015 July
## 2 Resort Hotel 1 47 2015 July
## 3 Resort Hotel 0 1 2015 July
## 4 Resort Hotel 0 10 2015 July
## 5 Resort Hotel 1 79 2015 July
## 6 Resort Hotel 0 101 2015 July
## 7 Resort Hotel 0 92 2015 July
## 8 Resort Hotel 1 26 2015 July
## 9 Resort Hotel 0 102 2015 July
## 10 Resort Hotel 0 78 2015 July
## # ... with 9,322 more rows, and 27 more variables:
## # arrival_date_week_number <dbl>, arrival_date_day_of_month <dbl>,
## # stays_in_weekend_nights <dbl>, stays_in_week_nights <dbl>, adults <dbl>,
## # children <dbl>, babies <dbl>, meal <chr>, country <chr>,
## # market_segment <chr>, distribution_channel <chr>, is_repeated_guest <dbl>,
## # previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
## # reserved_room_type <chr>, assigned_room_type <chr>, ...
# Com registro de objeto
hotel_bebe_chil <- hotels %>%
filter(children >= 1 | babies >= 1)
# Sem pipe
hotel_bebe_chil_sem_pipe <- filter(hotels, children >= 1 | babies >= 1)
Você acha que é mais provável encontrar reservas com crianças ou bebês em hotéis urbanos ou resorts hoteleiros? Teste sua intuição.
Usando filter() determine o número de reservas em hotéis resort que têm mais de 1 criança ou bebê no quarto?
Então, faça o mesmo para hotéis urbanos, e compare o número de linhas no dataframe filtrado resultantes.
# Com registro de objeto
hotel_bebe_chil_resort <- hotels %>%
filter(children >= 1 | babies >= 1) %>%
filter(hotel == "Resort Hotel")
# Sem pipe
hotel_bebe_chil_sem_pipe <- filter(hotels, children >= 1 | babies >= 1)
hotel_bebe_chil_resort_sem_pipe <- filter(hotel_bebe_chil_resort, hotel == "Resort Hotel")
hotel_bebe_chil_resort_sem_pipe
## # A tibble: 3,929 x 32
## hotel is_canceled lead_time arrival_date_year arrival_date_month
## <chr> <dbl> <dbl> <dbl> <chr>
## 1 Resort Hotel 0 18 2015 July
## 2 Resort Hotel 1 47 2015 July
## 3 Resort Hotel 0 1 2015 July
## 4 Resort Hotel 0 10 2015 July
## 5 Resort Hotel 1 79 2015 July
## 6 Resort Hotel 0 101 2015 July
## 7 Resort Hotel 0 92 2015 July
## 8 Resort Hotel 1 26 2015 July
## 9 Resort Hotel 0 102 2015 July
## 10 Resort Hotel 0 78 2015 July
## # ... with 3,919 more rows, and 27 more variables:
## # arrival_date_week_number <dbl>, arrival_date_day_of_month <dbl>,
## # stays_in_weekend_nights <dbl>, stays_in_week_nights <dbl>, adults <dbl>,
## # children <dbl>, babies <dbl>, meal <chr>, country <chr>,
## # market_segment <chr>, distribution_channel <chr>, is_repeated_guest <dbl>,
## # previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
## # reserved_room_type <chr>, assigned_room_type <chr>, ...
# Com registro de objeto
hotel_bebe_chil_city <- hotels %>%
filter(children >= 1 | babies >= 1) %>%
mutate(hotel = tolower(hotel)) %>%
filter(hotel == "city hotel")
# Sem pipe
hotel_bebe_chil_sem_pipe <- filter(hotels, children >= 1 | babies >= 1)
hotel_bebe_chil_city_sem_pipe <- filter(hotel_bebe_chil_sem_pipe, hotel == "City Hotel")
hotel_bebe_chil_city_sem_pipe
## # A tibble: 5,403 x 32
## hotel is_canceled lead_time arrival_date_year arrival_date_month
## <chr> <dbl> <dbl> <dbl> <chr>
## 1 City Hotel 1 100 2015 July
## 2 City Hotel 0 69 2015 July
## 3 City Hotel 0 67 2015 July
## 4 City Hotel 1 60 2015 July
## 5 City Hotel 0 0 2015 July
## 6 City Hotel 0 0 2015 July
## 7 City Hotel 0 0 2015 July
## 8 City Hotel 0 84 2015 July
## 9 City Hotel 0 74 2015 July
## 10 City Hotel 0 75 2015 July
## # ... with 5,393 more rows, and 27 more variables:
## # arrival_date_week_number <dbl>, arrival_date_day_of_month <dbl>,
## # stays_in_weekend_nights <dbl>, stays_in_week_nights <dbl>, adults <dbl>,
## # children <dbl>, babies <dbl>, meal <chr>, country <chr>,
## # market_segment <chr>, distribution_channel <chr>, is_repeated_guest <dbl>,
## # previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
## # reserved_room_type <chr>, assigned_room_type <chr>, ...
hotel_bebe_chil_class <- hotels %>%
filter(children >= 1 | babies >= 1) %>%
group_by(hotel) %>%
summarise(n = max(stays_in_weekend_nights))
hotel_bebe_chil_class
## # A tibble: 2 x 2
## hotel n
## <chr> <dbl>
## 1 City Hotel 8
## 2 Resort Hotel 8
list(hotel_bebe_chil_class)
## [[1]]
## # A tibble: 2 x 2
## hotel n
## <chr> <dbl>
## 1 City Hotel 8
## 2 Resort Hotel 8
hotels %>%
filter(children >= 1 | babies >= 1) %>%
group_by(country, hotel) %>%
summarise(tempomedio = mean(lead_time), reservas = n(), bebes = mean(babies), criancas = mean(children)) %>%
ggplot(aes(x= country, y= tempomedio, color= hotel)) +
geom_point() +
theme(legend.position = "bottom", axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
coord_flip ()
## `summarise()` has grouped output by 'country'. You can override using the `.groups` argument.
hotels
## # A tibble: 119,390 x 32
## hotel is_canceled lead_time arrival_date_year arrival_date_month
## <chr> <dbl> <dbl> <dbl> <chr>
## 1 Resort Hotel 0 342 2015 July
## 2 Resort Hotel 0 737 2015 July
## 3 Resort Hotel 0 7 2015 July
## 4 Resort Hotel 0 13 2015 July
## 5 Resort Hotel 0 14 2015 July
## 6 Resort Hotel 0 14 2015 July
## 7 Resort Hotel 0 0 2015 July
## 8 Resort Hotel 0 9 2015 July
## 9 Resort Hotel 1 85 2015 July
## 10 Resort Hotel 1 75 2015 July
## # ... with 119,380 more rows, and 27 more variables:
## # arrival_date_week_number <dbl>, arrival_date_day_of_month <dbl>,
## # stays_in_weekend_nights <dbl>, stays_in_week_nights <dbl>, adults <dbl>,
## # children <dbl>, babies <dbl>, meal <chr>, country <chr>,
## # market_segment <chr>, distribution_channel <chr>, is_repeated_guest <dbl>,
## # previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
## # reserved_room_type <chr>, assigned_room_type <chr>, ...
Criar uma tabela de freqüência do número de adults em uma reserva.
Mostre os resultados em ordem decrescente para que a observação mais comum esteja no topo.
Qual é o número mais comum de adultos em reservas neste conjunto de dados?
Há algum resultado surpreendente?
Nota: Não esqueça de rotular também seu chunk R (onde diz lable-me-1). Seu rótulo deve ser curto, informativo, e não deve incluir espaços. Também não deve repetir uma etiqueta anterior, caso contrário o R Markdown lhe dará um erro sobre a repetição de etiquetas R em pedaços.
str(hotels)
## spec_tbl_df [119,390 x 32] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ hotel : chr [1:119390] "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
## $ is_canceled : num [1:119390] 0 0 0 0 0 0 0 0 1 1 ...
## $ lead_time : num [1:119390] 342 737 7 13 14 14 0 9 85 75 ...
## $ arrival_date_year : num [1:119390] 2015 2015 2015 2015 2015 ...
## $ arrival_date_month : chr [1:119390] "July" "July" "July" "July" ...
## $ arrival_date_week_number : num [1:119390] 27 27 27 27 27 27 27 27 27 27 ...
## $ arrival_date_day_of_month : num [1:119390] 1 1 1 1 1 1 1 1 1 1 ...
## $ stays_in_weekend_nights : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ stays_in_week_nights : num [1:119390] 0 0 1 1 2 2 2 2 3 3 ...
## $ adults : num [1:119390] 2 2 1 1 2 2 2 2 2 2 ...
## $ children : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ babies : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ meal : chr [1:119390] "BB" "BB" "BB" "BB" ...
## $ country : chr [1:119390] "PRT" "PRT" "GBR" "GBR" ...
## $ market_segment : chr [1:119390] "Direct" "Direct" "Direct" "Corporate" ...
## $ distribution_channel : chr [1:119390] "Direct" "Direct" "Direct" "Corporate" ...
## $ is_repeated_guest : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_cancellations : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_bookings_not_canceled: num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ reserved_room_type : chr [1:119390] "C" "C" "A" "A" ...
## $ assigned_room_type : chr [1:119390] "C" "C" "C" "A" ...
## $ booking_changes : num [1:119390] 3 4 0 0 0 0 0 0 0 0 ...
## $ deposit_type : chr [1:119390] "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
## $ agent : chr [1:119390] "NULL" "NULL" "NULL" "304" ...
## $ company : chr [1:119390] "NULL" "NULL" "NULL" "NULL" ...
## $ days_in_waiting_list : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ customer_type : chr [1:119390] "Transient" "Transient" "Transient" "Transient" ...
## $ adr : num [1:119390] 0 0 75 75 98 ...
## $ required_car_parking_spaces : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ total_of_special_requests : num [1:119390] 0 0 0 0 1 1 0 1 1 0 ...
## $ reservation_status : chr [1:119390] "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
## $ reservation_status_date : Date[1:119390], format: "2015-07-01" "2015-07-01" ...
## - attr(*, "spec")=
## .. cols(
## .. hotel = col_character(),
## .. is_canceled = col_double(),
## .. lead_time = col_double(),
## .. arrival_date_year = col_double(),
## .. arrival_date_month = col_character(),
## .. arrival_date_week_number = col_double(),
## .. arrival_date_day_of_month = col_double(),
## .. stays_in_weekend_nights = col_double(),
## .. stays_in_week_nights = col_double(),
## .. adults = col_double(),
## .. children = col_double(),
## .. babies = col_double(),
## .. meal = col_character(),
## .. country = col_character(),
## .. market_segment = col_character(),
## .. distribution_channel = col_character(),
## .. is_repeated_guest = col_double(),
## .. previous_cancellations = col_double(),
## .. previous_bookings_not_canceled = col_double(),
## .. reserved_room_type = col_character(),
## .. assigned_room_type = col_character(),
## .. booking_changes = col_double(),
## .. deposit_type = col_character(),
## .. agent = col_character(),
## .. company = col_character(),
## .. days_in_waiting_list = col_double(),
## .. customer_type = col_character(),
## .. adr = col_double(),
## .. required_car_parking_spaces = col_double(),
## .. total_of_special_requests = col_double(),
## .. reservation_status = col_character(),
## .. reservation_status_date = col_date(format = "")
## .. )
## - attr(*, "problems")=<externalptr>
adultos <- hotels %>%
mutate(adults = as.factor (adults), children = as.factor (children))
table(adultos$adults)
##
## 0 1 2 3 4 5 6 10 20 26 27 40 50
## 403 23027 89680 6202 62 2 1 1 2 5 2 1 1
## 55
## 1
table(hotels$adults, hotels$children)
##
## 0 1 2 3 10
## 0 180 4 208 11 0
## 1 22587 279 157 4 0
## 2 82278 4089 3248 61 1
## 3 5675 487 39 0 0
## 4 60 2 0 0 0
## 5 2 0 0 0 0
## 6 1 0 0 0 0
## 10 1 0 0 0 0
## 20 2 0 0 0 0
## 26 5 0 0 0 0
## 27 2 0 0 0 0
## 40 1 0 0 0 0
## 50 1 0 0 0 0
## 55 1 0 0 0 0
hist(hotels$adults)
library(gmodels)
tab1 <- CrossTable(hotels$adults, hotels$children, chisq=TRUE)
## Warning in chisq.test(t, correct = FALSE, ...): Chi-squared approximation may be
## incorrect
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 119386
##
##
## | hotels$children
## hotels$adults | 0 | 1 | 2 | 3 | 10 | Row Total |
## --------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 0 | 180 | 4 | 208 | 11 | 0 | 403 |
## | 100.634 | 9.384 | 3105.820 | 449.907 | 0.003 | |
## | 0.447 | 0.010 | 0.516 | 0.027 | 0.000 | 0.003 |
## | 0.002 | 0.001 | 0.057 | 0.145 | 0.000 | |
## | 0.002 | 0.000 | 0.002 | 0.000 | 0.000 | |
## --------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 1 | 22587 | 279 | 157 | 4 | 0 | 23027 |
## | 69.287 | 462.606 | 425.386 | 7.750 | 0.193 | |
## | 0.981 | 0.012 | 0.007 | 0.000 | 0.000 | 0.193 |
## | 0.204 | 0.057 | 0.043 | 0.053 | 0.000 | |
## | 0.189 | 0.002 | 0.001 | 0.000 | 0.000 | |
## --------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 2 | 82278 | 4089 | 3248 | 61 | 1 | 89677 |
## | 10.767 | 52.457 | 92.890 | 0.268 | 0.082 | |
## | 0.917 | 0.046 | 0.036 | 0.001 | 0.000 | 0.751 |
## | 0.743 | 0.841 | 0.889 | 0.803 | 1.000 | |
## | 0.689 | 0.034 | 0.027 | 0.001 | 0.000 | |
## --------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 3 | 5675 | 487 | 39 | 0 | 0 | 6201 |
## | 1.107 | 217.827 | 119.706 | 3.947 | 0.052 | |
## | 0.915 | 0.079 | 0.006 | 0.000 | 0.000 | 0.052 |
## | 0.051 | 0.100 | 0.011 | 0.000 | 0.000 | |
## | 0.048 | 0.004 | 0.000 | 0.000 | 0.000 | |
## --------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 4 | 60 | 2 | 0 | 0 | 0 | 62 |
## | 0.105 | 0.109 | 1.897 | 0.039 | 0.001 | |
## | 0.968 | 0.032 | 0.000 | 0.000 | 0.000 | 0.001 |
## | 0.001 | 0.000 | 0.000 | 0.000 | 0.000 | |
## | 0.001 | 0.000 | 0.000 | 0.000 | 0.000 | |
## --------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 5 | 2 | 0 | 0 | 0 | 0 | 2 |
## | 0.011 | 0.081 | 0.061 | 0.001 | 0.000 | |
## | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
## | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | |
## | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | |
## --------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 6 | 1 | 0 | 0 | 0 | 0 | 1 |
## | 0.006 | 0.041 | 0.031 | 0.001 | 0.000 | |
## | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
## | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | |
## | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | |
## --------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 10 | 1 | 0 | 0 | 0 | 0 | 1 |
## | 0.006 | 0.041 | 0.031 | 0.001 | 0.000 | |
## | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
## | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | |
## | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | |
## --------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 20 | 2 | 0 | 0 | 0 | 0 | 2 |
## | 0.011 | 0.081 | 0.061 | 0.001 | 0.000 | |
## | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
## | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | |
## | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | |
## --------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 26 | 5 | 0 | 0 | 0 | 0 | 5 |
## | 0.028 | 0.204 | 0.153 | 0.003 | 0.000 | |
## | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
## | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | |
## | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | |
## --------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 27 | 2 | 0 | 0 | 0 | 0 | 2 |
## | 0.011 | 0.081 | 0.061 | 0.001 | 0.000 | |
## | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
## | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | |
## | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | |
## --------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 40 | 1 | 0 | 0 | 0 | 0 | 1 |
## | 0.006 | 0.041 | 0.031 | 0.001 | 0.000 | |
## | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
## | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | |
## | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | |
## --------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 50 | 1 | 0 | 0 | 0 | 0 | 1 |
## | 0.006 | 0.041 | 0.031 | 0.001 | 0.000 | |
## | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
## | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | |
## | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | |
## --------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 55 | 1 | 0 | 0 | 0 | 0 | 1 |
## | 0.006 | 0.041 | 0.031 | 0.001 | 0.000 | |
## | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
## | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | |
## | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | |
## --------------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 110796 | 4861 | 3652 | 76 | 1 | 119386 |
## | 0.928 | 0.041 | 0.031 | 0.001 | 0.000 | |
## --------------|-----------|-----------|-----------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 5133.464 d.f. = 52 p = 0
##
##
##
tab2 <- CrossTable(adultos$adults, adultos$children)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 119386
##
##
## | adultos$children
## adultos$adults | 0 | 1 | 2 | 3 | 10 | Row Total |
## ---------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 0 | 180 | 4 | 208 | 11 | 0 | 403 |
## | 100.634 | 9.384 | 3105.820 | 449.907 | 0.003 | |
## | 0.447 | 0.010 | 0.516 | 0.027 | 0.000 | 0.003 |
## | 0.002 | 0.001 | 0.057 | 0.145 | 0.000 | |
## | 0.002 | 0.000 | 0.002 | 0.000 | 0.000 | |
## ---------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 1 | 22587 | 279 | 157 | 4 | 0 | 23027 |
## | 69.287 | 462.606 | 425.386 | 7.750 | 0.193 | |
## | 0.981 | 0.012 | 0.007 | 0.000 | 0.000 | 0.193 |
## | 0.204 | 0.057 | 0.043 | 0.053 | 0.000 | |
## | 0.189 | 0.002 | 0.001 | 0.000 | 0.000 | |
## ---------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 2 | 82278 | 4089 | 3248 | 61 | 1 | 89677 |
## | 10.767 | 52.457 | 92.890 | 0.268 | 0.082 | |
## | 0.917 | 0.046 | 0.036 | 0.001 | 0.000 | 0.751 |
## | 0.743 | 0.841 | 0.889 | 0.803 | 1.000 | |
## | 0.689 | 0.034 | 0.027 | 0.001 | 0.000 | |
## ---------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 3 | 5675 | 487 | 39 | 0 | 0 | 6201 |
## | 1.107 | 217.827 | 119.706 | 3.947 | 0.052 | |
## | 0.915 | 0.079 | 0.006 | 0.000 | 0.000 | 0.052 |
## | 0.051 | 0.100 | 0.011 | 0.000 | 0.000 | |
## | 0.048 | 0.004 | 0.000 | 0.000 | 0.000 | |
## ---------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 4 | 60 | 2 | 0 | 0 | 0 | 62 |
## | 0.105 | 0.109 | 1.897 | 0.039 | 0.001 | |
## | 0.968 | 0.032 | 0.000 | 0.000 | 0.000 | 0.001 |
## | 0.001 | 0.000 | 0.000 | 0.000 | 0.000 | |
## | 0.001 | 0.000 | 0.000 | 0.000 | 0.000 | |
## ---------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 5 | 2 | 0 | 0 | 0 | 0 | 2 |
## | 0.011 | 0.081 | 0.061 | 0.001 | 0.000 | |
## | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
## | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | |
## | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | |
## ---------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 6 | 1 | 0 | 0 | 0 | 0 | 1 |
## | 0.006 | 0.041 | 0.031 | 0.001 | 0.000 | |
## | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
## | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | |
## | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | |
## ---------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 10 | 1 | 0 | 0 | 0 | 0 | 1 |
## | 0.006 | 0.041 | 0.031 | 0.001 | 0.000 | |
## | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
## | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | |
## | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | |
## ---------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 20 | 2 | 0 | 0 | 0 | 0 | 2 |
## | 0.011 | 0.081 | 0.061 | 0.001 | 0.000 | |
## | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
## | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | |
## | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | |
## ---------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 26 | 5 | 0 | 0 | 0 | 0 | 5 |
## | 0.028 | 0.204 | 0.153 | 0.003 | 0.000 | |
## | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
## | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | |
## | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | |
## ---------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 27 | 2 | 0 | 0 | 0 | 0 | 2 |
## | 0.011 | 0.081 | 0.061 | 0.001 | 0.000 | |
## | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
## | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | |
## | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | |
## ---------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 40 | 1 | 0 | 0 | 0 | 0 | 1 |
## | 0.006 | 0.041 | 0.031 | 0.001 | 0.000 | |
## | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
## | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | |
## | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | |
## ---------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 50 | 1 | 0 | 0 | 0 | 0 | 1 |
## | 0.006 | 0.041 | 0.031 | 0.001 | 0.000 | |
## | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
## | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | |
## | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | |
## ---------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 55 | 1 | 0 | 0 | 0 | 0 | 1 |
## | 0.006 | 0.041 | 0.031 | 0.001 | 0.000 | |
## | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
## | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | |
## | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | |
## ---------------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 110796 | 4861 | 3652 | 76 | 1 | 119386 |
## | 0.928 | 0.041 | 0.031 | 0.001 | 0.000 | |
## ---------------|-----------|-----------|-----------|-----------|-----------|-----------|
##
##
qplot(hotels$adults,
geom="histogram",
binwidth=5,
main="Histograma por adultos",
xlab="Adultos",
fill=I("blue"),
col=I("red"))
Repita o exercício 5, uma vez para reservas canceladas (is_canceled codificado como 1) e uma vez para reservas não canceladas (is_canceled codificado como 0).
O que isto revela sobre os resultados surpreendentes que você observou no exercício anterior?
Note: Não se esqueça de rotular também seu chunk de R (onde diz label-me-2).
hist(hotels$is_canceled,
main="Histograma por cancelamentos")
Calcular a tarifa mínima, média, mediana e máxima média diária (adr) agrupados por tipo de hotel para que você possa obter estas estatísticas separadamente para hotéis de resorts e cidades.
Que tipo de hotel é mais caro, em média? RESORTS
hotels %>%
group_by(hotel) %>%
summarise(mediaadr = mean(adr)) %>%
ggplot(aes(x= mediaadr, y= hotel, color= hotel)) +
geom_point() +
theme(legend.position = "bottom", axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
coord_flip ()
hotels %>%
group_by(hotel) %>%
summarise(maxadr = max(adr)) %>%
ggplot(aes(x= maxadr, y= hotel, color= hotel)) +
geom_point() +
theme(legend.position = "bottom", axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
coord_flip ()
hotels %>%
group_by(hotel) %>%
summarise(minadr = min(adr)) %>%
ggplot(aes(x= minadr, y= hotel, color= hotel)) +
geom_point() +
theme(legend.position = "bottom", axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
coord_flip ()
hotels %>%
group_by(hotel) %>%
summarise(medianaadr = median(adr)) %>%
ggplot(aes(x= medianaadr, y= hotel, color= hotel)) +
geom_point() +
theme(legend.position = "bottom", axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
coord_flip ()
Observamos dois valores incomuns nas estatísticas resumidas acima – um mínimo negativo, e um máximo muito alto). Que tipos de hotéis são estes?
Localize estas observações no conjunto de dados e descubra a data de chegada (ano e mês), assim como quantas pessoas (adultos, crianças e bebês) permaneceram no quarto.
Você pode investigar os dados no espectador para localizar estes valores, mas de preferência você deve identificá-los de forma reprodutível com algum código.
Dica: Por exemplo, você pode filter para o dado quantidade adr e select as colunas relevantes.
filter(hotels, hotels$adr<=-6)
## # A tibble: 1 x 32
## hotel is_canceled lead_time arrival_date_ye~ arrival_date_mo~ arrival_date_we~
## <chr> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 Resort Hotel 0 195 2017 March 10
## # ... with 26 more variables: arrival_date_day_of_month <dbl>,
## # stays_in_weekend_nights <dbl>, stays_in_week_nights <dbl>, adults <dbl>,
## # children <dbl>, babies <dbl>, meal <chr>, country <chr>,
## # market_segment <chr>, distribution_channel <chr>, is_repeated_guest <dbl>,
## # previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
## # reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
## # deposit_type <chr>, agent <chr>, company <chr>, ...
filter(hotels, hotels$adr>=5000)
## # A tibble: 1 x 32
## hotel is_canceled lead_time arrival_date_ye~ arrival_date_mo~ arrival_date_we~
## <chr> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 City Hotel 1 35 2016 March 13
## # ... with 26 more variables: arrival_date_day_of_month <dbl>,
## # stays_in_weekend_nights <dbl>, stays_in_week_nights <dbl>, adults <dbl>,
## # children <dbl>, babies <dbl>, meal <chr>, country <chr>,
## # market_segment <chr>, distribution_channel <chr>, is_repeated_guest <dbl>,
## # previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
## # reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
## # deposit_type <chr>, agent <chr>, company <chr>, ...
Abaixo está o dicionário de dados completo. Note que é longo (há muitas variáveis nos dados), mas utilizamos um conjunto limitado de variáveis para nossa análise.
| variable | class | description |
|---|---|---|
| hotel | character | Hotel (H1 = Resort Hotel or H2 = City Hotel) |
| is_canceled | double | Value indicating if the booking was canceled (1) or not (0) |
| lead_time | double | Number of days that elapsed between the entering date of the booking into the PMS and the arrival date |
| arrival_date_year | double | Year of arrival date |
| arrival_date_month | character | Month of arrival date |
| arrival_date_week_number | double | Week number of year for arrival date |
| arrival_date_day_of_month | double | Day of arrival date |
| stays_in_weekend_nights | double | Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel |
| stays_in_week_nights | double | Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel |
| adults | double | Number of adults |
| children | double | Number of children |
| babies | double | Number of babies |
| meal | character | Type of meal booked. Categories are presented in standard hospitality meal packages: Undefined/SC – no meal package; BB – Bed & Breakfast; HB – Half board (breakfast and one other meal – usually dinner); FB – Full board (breakfast, lunch and dinner) |
| country | character | Country of origin. Categories are represented in the ISO 3155–3:2013 format |
| market_segment | character | Market segment designation. In categories, the term “TA” means “Travel Agents” and “TO” means “Tour Operators” |
| distribution_channel | character | Booking distribution channel. The term “TA” means “Travel Agents” and “TO” means “Tour Operators” |
| is_repeated_guest | double | Value indicating if the booking name was from a repeated guest (1) or not (0) |
| previous_cancellations | double | Number of previous bookings that were cancelled by the customer prior to the current booking |
| previous_bookings_not_canceled | double | Number of previous bookings not cancelled by the customer prior to the current booking |
| reserved_room_type | character | Code of room type reserved. Code is presented instead of designation for anonymity reasons |
| assigned_room_type | character | Code for the type of room assigned to the booking. Sometimes the assigned room type differs from the reserved room type due to hotel operation reasons (e.g. overbooking) or by customer request. Code is presented instead of designation for anonymity reasons |
| booking_changes | double | Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation |
| deposit_type | character | Indication on if the customer made a deposit to guarantee the booking. This variable can assume three categories: No Deposit – no deposit was made; Non Refund – a deposit was made in the value of the total stay cost; Refundable – a deposit was made with a value under the total cost of stay. |
| agent | character | ID of the travel agency that made the booking |
| company | character | ID of the company/entity that made the booking or responsible for paying the booking. ID is presented instead of designation for anonymity reasons |
| days_in_waiting_list | double | Number of days the booking was in the waiting list before it was confirmed to the customer |
| customer_type | character | Type of booking, assuming one of four categories: Contract - when the booking has an allotment or other type of contract associated to it; Group – when the booking is associated to a group; Transient – when the booking is not part of a group or contract, and is not associated to other transient booking; Transient-party – when the booking is transient, but is associated to at least other transient booking |
| adr | double | Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights |
| required_car_parking_spaces | double | Number of car parking spaces required by the customer |
| total_of_special_requests | double | Number of special requests made by the customer (e.g. twin bed or high floor) |
| reservation_status | character | Reservation last status, assuming one of three categories: Canceled – booking was canceled by the customer; Check-Out – customer has checked in but already departed; No-Show – customer did not check-in and did inform the hotel of the reason why |
| reservation_status_date | double | Date at which the last status was set. This variable can be used in conjunction with the ReservationStatus to understand when was the booking canceled or when did the customer checked-out of the hotel |
Carregar arquivos
read_csv e atribuindo cada conjunto de dados a um objetoManipulação de dados - mulheres cientistas
Faça a junção dos dataframes
Ordene por ano de nascimento
Filtre apenas as linhas que têm informações sobre ano de nascimento e ano de morte
Encontre as pesquisadoras que estão vivas
Gere um dataframe considerando apenas a coluna com o nome das cientistas.
Manipulação de dados - matrículas
Gere um dataframe com os alunos matriculados e que responderam estar frequentando as aulas.
Gere um dataframe com os alunos matriculados e que responderam não estar frequentando as aulas.
Gere um dataframe com os alunos que estão frequentando as aulas e que não estão matriculados.
Manipulação de dados - compras