Joyce Fang Project 1

Project 1

Part 1: Information about the Data

The dataset that I chose was the Titanic passenger dataset ()https://www.kaggle.com/datasets/yasserh/titanic-dataset/data), which lists the name of each passenger, additional demographic information about them (sex, ticket fare, class), and also whether they survived the titanic or not. I wanted to see if I could predict whether someone would survive the Titanic by using their demographic information, and whether there were any correlations with the survival rate of each passenger. In particular, I wanted to test the correlation between the ticket class (1st, 2nd, and 3rd) and the survival rate to see if socioeconomic status increased your survival or not.

Qualitative Variables: Survived (Binary), PClass (Ticket Class), Gender, Name, Cabin, Embarked (Port of Embarkation)

Numeric Variables: PassengerID, Age, SibSp (Siblings/Spouses), Parch (Parents/Children), Ticket Number, Passenger Fare

The source was not listed on Kaggle, but after extensive research, the data seems to have been derived from Encyclopedia Titanica

Part 2: Loading Libraries + Loading Dataset

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(RColorBrewer) #used for data visualization

#loading dataset 
df <- read_csv("Titanic-Dataset.csv")
Rows: 891 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): Name, Sex, Ticket, Cabin, Embarked
dbl (7): PassengerId, Survived, Pclass, Age, SibSp, Parch, Fare

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#view(df) to check if loaded in correctly 

print(colnames(df))
 [1] "PassengerId" "Survived"    "Pclass"      "Name"        "Sex"        
 [6] "Age"         "SibSp"       "Parch"       "Ticket"      "Fare"       
[11] "Cabin"       "Embarked"   

Part 3: Preparing Data + Linear Regression

#Creating a Model  
model <- lm(Survived ~Pclass + Fare, data = df) 
summary(model)

Call:
lm(formula = Survived ~ Pclass + Fare, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.8297 -0.2635 -0.2458  0.4035  0.7620 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.7309907  0.0599330  12.197  < 2e-16 ***
Pclass      -0.1643246  0.0219056  -7.501 1.53e-13 ***
Fare         0.0010003  0.0003686   2.714  0.00677 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4565 on 888 degrees of freedom
Multiple R-squared:  0.1219,    Adjusted R-squared:  0.1199 
F-statistic: 61.61 on 2 and 888 DF,  p-value: < 2.2e-16

Equation: of model: Survival_rate = 0.73 - 0.1643246(Pclass) + 0.001003(Fare)

Both variables seem to be significant because its p-values are high, but because of how little the coefficient is for the fare variable, it may not actually be significantly contributing to the model. However, the R-Squared value seems much lower, which suggests that the model may not be as robust as suggested by the p-value of the model (2.2e-16), which is very significant)

Part 4: Exploring the Data + Analyzing the Model

#To analyze the through data visualizations, a continuous variable has to be created
#Group by Pclass and Fare, then create a survival rate in each group 
df1 <- df %>% 
  group_by(Pclass, Fare) %>% 
  summarize(survival_rate = mean(Survived))
`summarise()` has regrouped the output.
ℹ Summaries were computed grouped by Pclass and Fare.
ℹ Output is grouped by Pclass.
ℹ Use `summarise(.groups = "drop_last")` to silence this message.
ℹ Use `summarise(.by = c(Pclass, Fare))` for per-operation grouping
  (`?dplyr::dplyr_by`) instead.
#Plot looking at the relationship between survival rate and class
df1 %>% 
  ggplot(aes(x = Pclass, y = survival_rate)) + 
  geom_point()

#there definitely is a correlation between survival rate and class, could it look more defined with a box plot? 
df1 %>% 
  ggplot(aes(group = Pclass, y = survival_rate)) + 
  geom_boxplot()

#medians decrease as pclass increases (passengers are poorer)
#I thought that a histogram would be the best choice because I can see the number of each age as a bar 

#looking at distribution of fare 
df %>% 
  ggplot(aes(x = Fare)) + 
  geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

#looking at correlation between class and fare 
df %>% 
  ggplot(aes(x = Fare, group = Pclass, fill = as.factor(Pclass))) + 
  geom_histogram() + 
  scale_fill_brewer(palette = "Pastel1")
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

df %>% 
  ggplot(aes(x = Fare, y = Survived, color = as.factor(Pclass))) + #using as.factor for scale_fill_brewer
  geom_point() + 
  scale_color_brewer(palette = "Dark2", name = "Passenger Class") +
  labs(
    x = "Fare",
    y = "Survival Rate",
    title = "Fare vs. Survival Rate on the Titanic",
    caption = "Data Source: Encyclopedia Titanica"
  ) + theme_minimal() 

#No actual information can be gathered because each point is only at 1 and 0 

#Trying to create a survival rate 
#Looking at the correlation between survival rate and fare 
df1 <- df %>% 
  group_by(Pclass, Fare) %>% 
  summarize(survival_rate = mean(Survived))
`summarise()` has regrouped the output.
ℹ Summaries were computed grouped by Pclass and Fare.
ℹ Output is grouped by Pclass.
ℹ Use `summarise(.groups = "drop_last")` to silence this message.
ℹ Use `summarise(.by = c(Pclass, Fare))` for per-operation grouping
  (`?dplyr::dplyr_by`) instead.
df1 %>% 
  ggplot(aes(x = Fare, y = survival_rate, color = as.factor(Pclass))) + #using as.factor for scale_fill_brewer
  geom_point() + 
  scale_color_brewer(palette = "Dark2", name = "Passenger Class") +
  labs(
    x = "Fare",
    y = "Survival Rate",
    title = "Fare vs. Survival Rate on the Titanic",
    caption = "Data Source: Encyclopedia Titanica"
  ) + theme_minimal() 

#Still too many datapoints on top and bottom and introduces to much variability, trying to group each ticket fare by 10-dollar intervals? 

By looking at these exploratory graphs and data visualizations, I see that there is a strong correlation between survival rates and class. However, there may not be a strong correlation between fare and survival rates due to the high variability of those with fares above 200, as even one passenger being unlucky would significantly change the graph. This means that the model could be more accurate at predicting those with fares that cost $200 or below, as there is a lot less variability.

Part 5: Creating a Plot

#I ultimately decided on creating a scatterplot to see the correlation between the fare and survival rate, while also including passenger class and the number

#in order to create a graph that communicates the information, I have to round each fare to the nearest 10 (as seen by the exploration process)
df2 <- df %>% 
  mutate(fare_rounded = round(Fare/10) * 10) %>% 
  group_by(fare_rounded, Pclass) %>% 
  summarize(survival_rate = mean(Survived), n = n())
`summarise()` has regrouped the output.
ℹ Summaries were computed grouped by fare_rounded and Pclass.
ℹ Output is grouped by fare_rounded.
ℹ Use `summarise(.groups = "drop_last")` to silence this message.
ℹ Use `summarise(.by = c(fare_rounded, Pclass))` for per-operation grouping
  (`?dplyr::dplyr_by`) instead.
df2 %>% 
  ggplot(aes(x = fare_rounded, y = survival_rate, color = as.factor(Pclass), size= n)) + #using as.factor for scale_fill_brewer
  geom_point() + 
  scale_color_brewer(palette = "Dark2") +
  labs(
    x = "Fare",
    y = "Survival Rate",
    title = "Fare vs. Survival Rate on the Titanic",
    caption = "Data Source: Encyclopedia Titanica", 
    size = "Number of Datapoints", 
    color = "Passenger Class"
  ) + theme_minimal()

Part 5: Brief Essay

A lot of the data cleaning came from preparing the data visualization because of the need for a more numeric way to describe whether or not someone survives. The dataset itself is very sophisticated, so no data cleaning was needed to perform the linear regression. However, to create a data visualization, a numeric variable was needed to create a scatterplot that would be able to actually communicate information, as simply using the Survived variable, a binary variable, as the y-variable would be very hard to understand.

Additionally, all fares were rounded to the nearest 10 then grouped by class and fare by creating a new column called fare_rounded using the mutate function. This decreased the amount of datapoints on the graph to communicate the correlation better. The size of each group is detonated through the size of the point on the scatterplot. The visualization is then able to describe the correlation between the fare and the survival rate on the Titanic in a clear manner to the reader, which also including the population of each point and the passenger class.

The visualization was a little surprising, given the outlier that is in the 500s range in comparison to the rest of the data. It is also interesting how much variability there was in the 1st class due to how little people were part of this minority, so one person being unlucky on the Titanic would disrupt the whole pattern. I wished that I could have included gender in some way, but ultimately decided against it because of how much information is already being communicated right now and wanting to have an easy-to-understand graph.