## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
- The data that I will be using for this project is the Titanic data which I extracted from Kaggle.
-The data consists of 1309 observations divided as follows: 891 observation consisting of a column to tell if the passenger survived or not and the rest do not tell (481). The data consists of 12 variables divided as follows:
-PassengerID- Passenger ID
-Survived- 1 for survive, 0 for dead Categorical -Pclass- Passenger’s
class Categorical (Ordinal) -Name- Passenger’s name Categorical -Sex-
Passenger’s sex Categorical -Age- Passenger’s age Numerical (continuous)
-SibSp- Number of siblings or spouses on the ship Numerical (discrete)
-Parch- Numbers of children or parents on the ship Numerical (discrete)
-Ticket- Ticket number (Mixed) -Fare- Ticket fare Numerical (continuous)
-Cabin- Cabin number (Mixed) -Embarked- Port of embarkation
Categorical
-The main research question is how will we know the correlation between the variables and chances of survival. -Some groups of individuals had a higher chance of survival than others, such as women, children, and the upper class, even if there was a certain amount of luck involved in surviving the sinking. -I will try to know and visualize how is the survival rate correlated with some of the variables.
## Rows: 418 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Name, Sex, Ticket, Cabin, Embarked
## dbl (6): PassengerId, Pclass, Age, SibSp, Parch, Fare
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## cols(
## PassengerId = col_double(),
## Pclass = col_double(),
## Name = col_character(),
## Sex = col_character(),
## Age = col_double(),
## SibSp = col_double(),
## Parch = col_double(),
## Ticket = col_character(),
## Fare = col_double(),
## Cabin = col_character(),
## Embarked = col_character()
## )
• The final outcome is whether the passenger survived or not is correlated to any of the variables; sex, passenger class, age, number of siblings, number of parents or children, fare of the ticket and the port of embarkation. • I will discuss the effect of different variables on the output, for example: whether the sex can interfere with the probability of death or whether the age has a hand in keeping the passengers alive, what about the passenger class, would it affect the survival rate?
Steps: 1) Import and clean the data 2) Check missing data and fix them 3) Add extra columns to be used 4) Do some data visualization to understand the data more and know more about the correlations between some of the variables. 5) Do a heat map, to know the correlation value between the variables and the (survived) column.
I believe we can use all these variables to be able to detect and forecast whether a passenger would be likely to survive or die. After we do the model using the variables we have, we will test our model on the test data and then check if our hypothesis was right or not.
Now we are going to do some data visualizations to understand the data more and know the correlations between the variables and each other
-the figure above shows that most of the people that were on the Titanic were men, also it shows that survival rate of men was much less than women.
## Warning: Removed 177 rows containing non-finite values (stat_density).
The distribution shows that 25 is the mode of ages on the Titanic
-Class 3 had the majority of passengers and also the the majority of deaths followed by Class 2 then Class 1 (the elite)
-Class 1 had the highest Fare price followed by class 2 and then Class 3
## [1] 418
## [1] 413
## [1] 86
## spec_tbl_df [418 × 11] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ PassengerId: num [1:418] 892 893 894 895 896 897 898 899 900 901 ...
## $ Pclass : num [1:418] 3 3 2 3 3 3 3 2 3 3 ...
## $ Name : chr [1:418] "Kelly, Mr. James" "Wilkes, Mrs. James (Ellen Needs)" "Myles, Mr. Thomas Francis" "Wirz, Mr. Albert" ...
## $ Sex : chr [1:418] "male" "female" "male" "male" ...
## $ Age : num [1:418] 34.5 47 62 27 22 14 30 26 18 21 ...
## $ SibSp : num [1:418] 0 1 0 0 1 0 0 1 0 2 ...
## $ Parch : num [1:418] 0 0 0 0 1 0 0 1 0 0 ...
## $ Ticket : chr [1:418] "330911" "363272" "240276" "315154" ...
## $ Fare : num [1:418] 7.83 7 9.69 8.66 12.29 ...
## $ Cabin : chr [1:418] NA NA NA NA ...
## $ Embarked : chr [1:418] "Q" "S" "Q" "S" ...
## - attr(*, "spec")=
## .. cols(
## .. PassengerId = col_double(),
## .. Pclass = col_double(),
## .. Name = col_character(),
## .. Sex = col_character(),
## .. Age = col_double(),
## .. SibSp = col_double(),
## .. Parch = col_double(),
## .. Ticket = col_character(),
## .. Fare = col_double(),
## .. Cabin = col_character(),
## .. Embarked = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
-In conclusion,after analyzing and cleaning the dataset, I saw a direct relationship between the sex of the people and the number of survival rate. -Even though there were more males onboard the Titanic, more females survived and this is largely because women and children were allowed on the emergency boats and rescued first. -Also, the elite class had the highest survival rate. This is because they were located closest to the exits and were given second priority after women and children.