M4 Project R: Titanic - Markdown

M4 Project R: Titanic - Markdown

titanic_data <- read.csv("C:\\Users\\18328\\Desktop\\train.csv")

Analysis 1: Passenger Age

In this analysis, my goal is to understand the age distribution of passengers.

Why?

I want to identify the age groups present among Titanic passengers.

What?

I will analyze the age data to create a summary and visualize the distribution using a histogram.

Including Plots

summary(titanic_data$Age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.42   20.12   28.00   29.70   38.00   80.00     177
hist(titanic_data$Age)

Analysis 2: Gender Distribution

Why:

To understand the distribution of genders among passengers. ### What: Analyzing gender data to identify the proportion of males and females.

Including Plots

table(titanic_data$Sex)
## 
## female   male 
##    314    577
barplot(table(titanic_data$Sex), main="Gender Distribution", xlab="Gender", ylab="Count", col=c("blue", "pink"))

Analysis 3: Fare Distribution

Why:

To understand the distribution of fares paid by passengers. ### What: Analyzing fare data to identify the financial status of passengers, as financial status may impact overall well-being and health.

summary(titanic_data$Fare)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    7.91   14.45   32.20   31.00  512.33
hist(titanic_data$Fare, main="Fare Distribution", xlab="Fare", ylab="Count", col="navy")

What information/conclusion I get from the analysis result/plot

to answer the quesation i will run an analysis on the survival rate based on the variables i’ve explored (Age, Gender, and Fare)

Analysis 4: Age and Survival

Create a new variable indicating whether a passenger survived (1) or not (0)

titanic_data$Survived <- factor(titanic_data$Survived)

Visualize the relationship between Age and Survival

boxplot(Age ~ Survived, data = titanic_data, col = c("yellow", "blue"), main = "Age vs. Survival", xlab = "Survived", ylab = "Age")

Compare the mean age between survivors and non-survivors

t.test(Age ~ Survived, data = titanic_data)
## 
##  Welch Two Sample t-test
## 
## data:  Age by Survived
## t = 2.046, df = 598.84, p-value = 0.04119
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  0.09158472 4.47339446
## sample estimates:
## mean in group 0 mean in group 1 
##        30.62618        28.34369

Analysis 5: Gender and Survival

Create a contingency table for Gender and Survival

gender_survival_table <- table(titanic_data$Sex, titanic_data$Survived)

Perform a chi-square test for independence

chisq.test(gender_survival_table)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  gender_survival_table
## X-squared = 260.72, df = 1, p-value < 2.2e-16

Analysis 6: Fare and Survival

Visualize the relationship between Fare and Survival

boxplot(Fare ~ Survived, data = titanic_data, col = c("blue", "red"), main = "Fare vs. Survival", xlab = "Survived", ylab = "Fare")

Compare the mean fare between survivors and non-survivors

t.test(Fare ~ Survived, data = titanic_data)
## 
##  Welch Two Sample t-test
## 
## data:  Fare by Survived
## t = -6.8391, df = 436.7, p-value = 2.699e-11
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -33.82912 -18.72592
## sample estimates:
## mean in group 0 mean in group 1 
##        22.11789        48.39541

Interpreting R Outputs for Analysis:

Age and Survival:

I observe differences in age distributions between survivors and non-survivors. However, the results do not show a clear difference or variation in survival rates between older and younger passengers.

Gender and Survival:

Women are more likely to have higher survival rates compared to men. This indicates a strong association between gender and survival.

Fare and Survival:

Passengers who pay higher fares tend to have better chances of survival. The t-test also shows a difference in the average fare between survivors and non-survivors. The p-value is very small (p-value = 2.699e-11), indicating a significant difference in the average fare between the two groups.