2025-03-25

Titanic Dataset Overview

There is a total of 12 variables. I consider 5 meaningful variables suitable for simple linear regression and hypothesis testing.

  • Survived
    • 0 = Didn’t survive
    • 1 = Survived
  • Pclass
    • Passenger’s ticket class, represent a passenger’s economic status
    • 1 = First class
    • 2 = Second class
    • 3 = Third class
  • Sex
    • “male” and “female”
  • Age
    • Passengers’ ages
  • Fare
    • Ticket price that passengers paid

Simple Linear Regression

Simple Linear Regression is the relationship between 2 quantitative variables. It is expressed by this equation:
\[ y = \beta_0 + \beta_1 x + \epsilon \] For the first analysis, I’ll be first exploring the correlation between Survived and Fare. The model of regression in this case will be: \[ Survived = \beta_0 + \beta_1 \times Fare \] After calculating the values, which will be displayed next page, we get: \[ Survived = 0.3197 + 0.0025 \times Fare \]

Code Snippet for point estimate(Survived and Fare)

Below is the R code and its output for the linear regression model:

fare_LRM <- lm(Survived ~ Fare, data = ttdf_filtered)
#summary(fare_LRM)

Interpretation

Intercept = 0.30270, means when fare is 0, the survival probability is arouond 30.27%. Fare coefficient = 0.00252, which means 1 unit increase in fare will result in the increase of surviving by 0.252%

Code snippet for confidence interval(Survived and Fare)

confint(fare_LRM)
##                   2.5 %      97.5 %
## (Intercept) 0.278114862 0.361384438
## Fare        0.001832389 0.003148964
#summary(fare_LRM)
t_alpha <- qt(0.975, df = 712)
t_alpha
## [1] 1.963301

Confidence Interval

\[ \text{CI} = \beta_1 \pm t_{\alpha/2} \times \text{SE}(\beta_1) \] From the result we get by the code summary, we can get

\[ \text{CI} = 0.0024907 \pm \text{1.963301} \times(0.0003353) \] \[ = [0.001832,0.003149] \]

Interpretation

Confidence Intervals: Under 95% confidence level, the true value of intercept is about 0.2781 ~ 0.3614; the true value of Fare’s coefficient is about 0.0018 ~ 0.0031. Both are positive significant.

P-value: For both the intercept (< 2e-16) and Fare (3.16e-13), the P-values are extremely small, which means both are statistically significant. Moreover, the P-value of Fare is way less than 0.05, so we have enough confidence to prove that Fare has a significant effect on survival probability.

Fare and Survival (Scatter Plot)

## `geom_smooth()` using formula = 'y ~ x'

Age and Survival (Scatter Plot)

## `geom_smooth()` using formula = 'y ~ x'

Passenger Class and Survival (Stacked Bar Plot)

Gender and Survival (Stacked Bar Plot)


As we can easily find out, people who are female or living in higher passenger class were easier to survive.

3D Scatter Plot(Fare & Age & Survived)

The reason that I pick “Fare” and “Age” is because the values of these two have more differences between each other, compare to Pclass, which is only in the scale of 1~3.