Project_Presentation

Chinyere Akaigwe

2022-08-16

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.8     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Introduction

- The data that I will be using for this project is the Titanic data which I extracted from Kaggle.
-The data consists of 1309 observations divided as follows: 891 observation consisting of a column to tell if the passenger survived or not and the rest do not tell (481).  The data consists of 12 variables divided as follows:

Analysis Question

-The main research question is how will we know the correlation between the variables and chances of survival. -Some groups of individuals had a higher chance of survival than others, such as women, children, and the upper class, even if there was a certain amount of luck involved in surviving the sinking. -I will try to know and visualize how is the survival rate correlated with some of the variables.

Import the dataset and read the csv file

## Rows: 418 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Name, Sex, Ticket, Cabin, Embarked
## dbl (6): PassengerId, Pclass, Age, SibSp, Parch, Fare
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## cols(
##   PassengerId = col_double(),
##   Pclass = col_double(),
##   Name = col_character(),
##   Sex = col_character(),
##   Age = col_double(),
##   SibSp = col_double(),
##   Parch = col_double(),
##   Ticket = col_character(),
##   Fare = col_double(),
##   Cabin = col_character(),
##   Embarked = col_character()
## )

Section III- Data Analysis Plan

• The final outcome is whether the passenger survived or not is correlated to any of the variables; sex, passenger class, age, number of siblings, number of parents or children, fare of the ticket and the port of embarkation. • I will discuss the effect of different variables on the output, for example: whether the sex can interfere with the probability of death or whether the age has a hand in keeping the passengers alive, what about the passenger class, would it affect the survival rate?

Steps to achieve the Analysis plan

Steps: 1) Import and clean the data 2) Check missing data and fix them 3) Add extra columns to be used 4) Do some data visualization to understand the data more and know more about the correlations between some of the variables. 5) Do a heat map, to know the correlation value between the variables and the (survived) column.

I believe we can use all these variables to be able to detect and forecast whether a passenger would be likely to survive or die. After we do the model using the variables we have, we will test our model on the test data and then check if our hypothesis was right or not.

Data Visualization

Now we are going to do some data visualizations to understand the data more and know the correlations between the variables and each other

Plot males vs females with the survival rate shown

-the figure above shows that most of the people that were on the Titanic were men, also it shows that survival rate of men was much less than women.

Plot that shows the age distribution on the Titanic

## Warning: Removed 177 rows containing non-finite values (stat_density).

The distribution shows that 25 is the mode of ages on the Titanic

Plot that shows different classes with the survival rate for each

-Class 3 had the majority of passengers and also the the majority of deaths followed by Class 2 then Class 1 (the elite)

plot that shows different port of embarkation with the survival rate for each

Box plot to show the fare with respect to each class

-Class 1 had the highest Fare price followed by class 2 and then Class 3

Missing Values from the test_data

## [1] 418

## [1] 413

## [1] 86

Looking at the Categories of each column

## spec_tbl_df [418 × 11] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ PassengerId: num [1:418] 892 893 894 895 896 897 898 899 900 901 ...
##  $ Pclass     : num [1:418] 3 3 2 3 3 3 3 2 3 3 ...
##  $ Name       : chr [1:418] "Kelly, Mr. James" "Wilkes, Mrs. James (Ellen Needs)" "Myles, Mr. Thomas Francis" "Wirz, Mr. Albert" ...
##  $ Sex        : chr [1:418] "male" "female" "male" "male" ...
##  $ Age        : num [1:418] 34.5 47 62 27 22 14 30 26 18 21 ...
##  $ SibSp      : num [1:418] 0 1 0 0 1 0 0 1 0 2 ...
##  $ Parch      : num [1:418] 0 0 0 0 1 0 0 1 0 0 ...
##  $ Ticket     : chr [1:418] "330911" "363272" "240276" "315154" ...
##  $ Fare       : num [1:418] 7.83 7 9.69 8.66 12.29 ...
##  $ Cabin      : chr [1:418] NA NA NA NA ...
##  $ Embarked   : chr [1:418] "Q" "S" "Q" "S" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   PassengerId = col_double(),
##   ..   Pclass = col_double(),
##   ..   Name = col_character(),
##   ..   Sex = col_character(),
##   ..   Age = col_double(),
##   ..   SibSp = col_double(),
##   ..   Parch = col_double(),
##   ..   Ticket = col_character(),
##   ..   Fare = col_double(),
##   ..   Cabin = col_character(),
##   ..   Embarked = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

Plotting passengers that survived and didnt

Correlation between different parameters and (Survived) column

Conclusion

-In conclusion,after analyzing and cleaning the dataset, I saw a direct relationship between the sex of the people and the number of survival rate. -Even though there were more males onboard the Titanic, more females survived and this is largely because women and children were allowed on the emergency boats and rescued first. -Also, the elite class had the highest survival rate. This is because they were located closest to the exits and were given second priority after women and children.