library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
theme_set(theme_minimal())
titanic=read.csv("data/titanic.csv")
TITANIC_1= titanic %>%
mutate(Survived=factor(Survived,levels=c(1,0), labels = c("Yes","No")),
Pclass=factor(Pclass, labels = c("1st","2nd","3rd")),
Embarked=factor(Embarked, levels=c("C","Q","S"),labels=c("Cherbourg","Queenstown","Sothhampton"))
)
Data description:
Calculate the survival rate for male and female passengers. Visualize the survival rates using a bar plot.
survival__rates=aggregate(Survived ~ Sex, data=titanic, mean)
ggplot(survival__rates, aes(x = Sex, y = Survived, fill = Sex)) +
geom_bar(stat = "identity") +
labs(title = "Survival rates", x = "Sex", y = "Survival rates") +
theme_minimal()
Investigate the relationship between ticket class (Pclass) and survival. What is the survival rate for each passenger class? Visualize it using a stacked bar plot.
titanic %>%
group_by(Pclass)%>%
summarise((mean(Survived)))
## # A tibble: 3 × 2
## Pclass `(mean(Survived))`
## <int> <dbl>
## 1 1 0.630
## 2 2 0.473
## 3 3 0.242
Analyze the age distribution of passengers on the Titanic. Compare the distribution for those who survived and those who did not using a boxplot.
titanic %>%
group_by(Survived)%>%
summarise(count=n())
## # A tibble: 2 × 2
## Survived count
## <int> <int>
## 1 0 549
## 2 1 342
Analyze how family size (number of siblings, spouses, parents, or children aboard) affects the likelihood of survival. Visualize the relationship between family size and survival rate using an appropriate plot.
Visualize the age distribution of passengers based on their class (Pclass) and gender (Sex) using a facet grid plot.
Group passengers into age categories (children, adults, elderly) and calculate the survival rate for each age group. Visualize the survival rate using a bar plot. Hints: You may need to create a new column.
chisq.test(table(titanic$Sex, titanic$Survived), correct = FALSE)
##
## Pearson's Chi-squared test
##
## data: table(titanic$Sex, titanic$Survived)
## X-squared = 263.05, df = 1, p-value < 2.2e-16
Analyze the correlation between fare, and survival. Use scatter plots to visualize the relationships and calculate correlation coefficients. Hints: Use cor.test() to test the significance of the relationship.
Ans:
ggplot(titanic, aes(x = Fare, y = as.factor(Survived))) +
geom_point() +
labs(title = "Scatter Plot", x = "Fare", y = "Survived") +
theme_minimal()
#The scatter plot above visualizes the relationship between fare and survival.
correlation_test <- cor.test(titanic$Fare, titanic$Survived, method = "pearson")
#Interpretation:The positive correlation coefficient indicates a weak positive relationship between fare and survival, meaning that passengers who paid higher fares were slightly more likely to survive.The p-value is extremely small (less than 0.05), indicating that the correlation is statistically significant.