library(tidyverse)
library(titanic)
library(janitor)
library(nycflights13)
titanic <- titanic_train
Answer these questions using the titanic data frame.
Make sure variable names adhere to tidyverse style. Do not forget to
save the changes to the titanic data frame so that you can
use cleaner names moving forward.
titanic <- clean_names(titanic)
glimpse(titanic)
## Rows: 891
## Columns: 12
## $ passenger_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17…
## $ survived <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, …
## $ pclass <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, …
## $ name <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (F…
## $ sex <chr> "male", "female", "female", "female", "male", "male", "ma…
## $ age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14,…
## $ sib_sp <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, …
## $ parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, …
## $ ticket <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "3…
## $ fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625…
## $ cabin <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "…
## $ embarked <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S…
What are the names of passengers who have survived, were male, older than 18 years, embarked Titanic in Cherbourg and were traveling in the first class?
titanic %>%
filter(age > 18 & sex == "male" & survived == 1 & pclass == '1' & embarked == "C") %>% select(name)
## name
## 1 Greenfield, Mr. William Bertram
## 2 Blank, Mr. Henry
## 3 Harder, Mr. George Achilles
## 4 Goldenberg, Mr. Samuel L
## 5 Bishop, Mr. Dickinson H
## 6 Frolicher-Stehli, Mr. Maxmillian
## 7 Duff Gordon, Sir. Cosmo Edmund ("Mr Morgan")
## 8 Homer, Mr. Harry ("Mr E Haven")
## 9 Stahelin-Maeglin, Dr. Max
## 10 Harper, Mr. Henry Sleeper
## 11 Simonius-Blumer, Col. Oberst Alfons
## 12 Cardeza, Mr. Thomas Drake Martinez
## 13 Hassab, Mr. Hammad
## 14 Lesurer, Mr. Gustave J
## 15 Behr, Mr. Karl Howell
Are there any differences in the amount of fare paid depending on ticket’s class? First, second, third? Answer this question with a visual. Are you running into an obstacle? Why do you think it is happening?
glimpse(titanic)
## Rows: 891
## Columns: 12
## $ passenger_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17…
## $ survived <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, …
## $ pclass <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, …
## $ name <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (F…
## $ sex <chr> "male", "female", "female", "female", "male", "male", "ma…
## $ age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14,…
## $ sib_sp <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, …
## $ parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, …
## $ ticket <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "3…
## $ fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625…
## $ cabin <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "…
## $ embarked <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S…
This is happening because fare’s class is originally classified as a numeric form of data: the double (dbl). We want fare to be a factor, so I have changed its variable type from double to factor.
titanic %>%
mutate(pclass = as.factor(pclass)) %>%
ggplot(aes(x=pclass,
y = fare)) +
geom_violin()
During first week, you had to decide what type of variables the
titanic should have. Now is the time to make these changes.
For each variable, decide whether its type is correct. If it is not then
change the variable type. Make sure that you save your data frame
titanic at the end so that moving forward titanic would
have the correct variable types.
titanic <- titanic %>% mutate(pclass = as.factor(pclass),
age = as.integer(age),
sex = as.factor(sex),
survived = as.factor(survived))
glimpse(titanic)
## Rows: 891
## Columns: 12
## $ passenger_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17…
## $ survived <fct> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, …
## $ pclass <fct> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, …
## $ name <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (F…
## $ sex <fct> male, female, female, female, male, male, male, male, fem…
## $ age <int> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14,…
## $ sib_sp <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, …
## $ parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, …
## $ ticket <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "3…
## $ fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625…
## $ cabin <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "…
## $ embarked <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S…
Create a new variable called child. If a person is less
than 18 years the variable child should have a value of
TRUE, if a person is 18 years old or older than the
variable child should have a value of FALSE.
Make sure not to use quotations around TRUE or FALSE. These are values
known to R. If you use quotes R would make child be a character
variable. If you do not use quotes then it would be a logical variable
as it should be. Count how many children were on board after creating
the child variable?
titanic %>%
mutate(child = case_when(
age < 18 ~ TRUE,
age >= 18 ~ FALSE)) %>%
count(child == TRUE)
## child == TRUE n
## 1 FALSE 601
## 2 TRUE 113
## 3 NA 177
There are 113 children on board.
What is the mean and median fare paid? What is the mean
and median fare paid for each passenger class?
mean(titanic$fare)
## [1] 32.20421
median(titanic$fare)
## [1] 14.4542
The mean of fare is 32.20421. The median of fare is 14.4542.
titanic %>%
group_by(pclass) %>%
summarize(mean(fare), median(fare))
## # A tibble: 3 × 3
## pclass `mean(fare)` `median(fare)`
## <fct> <dbl> <dbl>
## 1 1 84.2 60.3
## 2 2 20.7 14.2
## 3 3 13.7 8.05
First class passengers fare came to an average of 84.20 (or 84.15369 to be exact). The median fare for first class was 60.30 (or 60.2875).
Second class passengers had a fare average of 20.70 (20.66218). The median fare for them is 14.30 (14.2500).
Third class passengers paid an average fare of 13.70 (13.67555). Median fare was 8.10 (8.0500).
Construct a histogram to look at the distribution of fare. Use a bin
size of 30.
What is the shape of the distribution? What is the best numerical
summary to describe the fare? Why? Calculate the numerical summary you
chose.
ggplot(data = titanic,
aes(x = fare)) +
geom_histogram(binwidth = 30)
summarize(titanic, quantile(fare, c(0.25, 0.50, 0.75)))
## quantile(fare, c(0.25, 0.5, 0.75))
## 1 7.9104
## 2 14.4542
## 3 31.0000
The shape of the distribution is very left leaning. The best numerical summary I chose to describe the fare was quantile, as it depicted fare prices more in depth compared to the histograph. It is calculated above, where 25% of passengers would pay the fare price of 7.91 or less. Almost half (50%) of the passengers paid 14.45 or less, and an overwhelming 3/4 (75%) of passengers paid 31.00 or less.