Data 101 Course Project

###Libraries

library('tidyverse')

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.8     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library('codebook')
library('ggplot2') 
library('dplyr')
library('skimr')

####Introduction The data that I will be using for this project is the Titanic data which I extracted from Kaggle. The data consists of 1309 observation divided as follows: 891 observation consisting of a column to tell if the passenger survived or not and the rest do not tell and the goal is to predict the other 418 passengers. The data consists of 12 variables divided as follows:

PassengerID- Passenger ID
Survived- 1 for survive, 0 for dead Categorical Pclass- Passenger’s class Categorical (Ordinal) Name- Passenger’s name Categorical Sex- Passenger’s sex Categorical Age- Passenger’s age Numerical (continuous) SibSp- Number of siblings or spouses on the ship Numerical (discrete) Parch- Numbers of children or parents on the ship Numerical (discrete) Ticket- Ticket number (Mixed) Fare- Ticket fare Numerical (continuous) Cabin- Cabin number (Mixed) Embarked- Port of embarkation Categorical

###Analysis Question The main research question is how will we know the missing 418 passenger if they survived or not. Some groups of individuals had a higher chance of survival than others, such as women, children, and the upper class, even if there was a certain amount of luck involved in surviving the sinking. We will try to know and visualize how is the survival rate correlated with some of the variables.

###Import the dataset and read the csv file

test <- read.csv("/cloud/project/Data 101 Project/test.csv")

attach(test)

###Section III- Data Analysis Plan Data analysis plan • The final outcome is whether the passenger survived or not and the variables are sex, passenger class, age, number of siblings, number of parents or children, fare of the ticket and the port of embarkation. • I will compare whether a family died together or an individual • I will take care of the missing data. • I will discuss the effect of different variables on the output, for example: whether the sex can interfere with the probability of death or whether the age has a hand in keeping the passengers alive, what about the passenger class, would it affect the survival rate?

####Steps to achieve the Analysis plan Steps: 1) Import and clean the data 2) Check missing data and fix them 3) Add extra columns to be used 4) Do some data visualization to understand the data more and know more about the correlations between some of the variables. 5) Do a heat map, to know the correlation value between the variables and the (survived) column.

I believe we can use all these variables to be able to detect and forecast whether a passenger would be likely to survive or die and I think, random forest model will be our way to do so. After we do the model using the variables we have, we will test our model on the test data and then check if our hypothesis was right or not.

Data 101 Course Project

Chinyere Akaigwe

2022-08-05