Abstract

This study is about social class and survival chance on Titanic. Titanic was called ‘unsinkable’ luxurious ship yet sank in North Atlantic Ocean on April 15th 1912 after it struck an iceberg. And there were nearly 1501 of her 2207 passengers and crew loss because of very limited lifeboats for the passengers. It is well-known ‘Women and Children First’. For this result, we want to see if relationship in these variable by using statistical analysis.

Research questions

  1. Does ‘Women First’ true?

  2. Does woman increase the likelihood of a person getting more survival chance?

  3. Does woman increase the likelihood of a person getting more survival chance in each social classes?

  4. Does ‘Children First’ true?

  5. Does ‘Children First’ equaly apply in each social classes?

  6. Does missing passenger ‘survived’ predictable?

suppressMessages(suppressWarnings(library('tidyverse')))
suppressMessages(suppressWarnings(library('data.table')))
suppressMessages(suppressWarnings(library('dplyr')))
suppressMessages(suppressWarnings(library('ggplot2')))
suppressMessages(suppressWarnings(library('reshape2')))
suppressMessages(suppressWarnings(library('scales')))
suppressMessages(suppressWarnings(library('rmarkdown')))
suppressMessages(suppressWarnings(library('tidyr')))
suppressMessages(suppressWarnings(library('plotly')))
suppressMessages(suppressWarnings(library('psych')))
suppressMessages(suppressWarnings(library('aod')))

Data Preparation

Data Collection: The data collected by individual-level. link

Cases: There are 1313 observations of 6 cases in the data set.

Variables: Name, PClass, Age, Sex, Survived, SexCode

  • numerical variables: Age

  • categorical variables: PClass, Sex, Survived, SexCode

Type of study: This is experience study in which we want to investigate if sexual, class and age are matter in Titanic survival case.

Scope of inference - generalizability: Population size is 2207, but we only have 1313 sample data on hand.


Titanic Data

I construct a table to summerize data by class and sex in the following. It gives me overview of 3 social classes.

#removed the index while load the data set
Titanic<-read.csv(url("https://raw.githubusercontent.com/czhu505/606-project-proposal-/master/Titanic.csv"), stringsAsFactors = FALSE)[-1]
summary(Titanic)
##      Name              PClass               Age            Sex           
##  Length:1313        Length:1313        Min.   : 0.17   Length:1313       
##  Class :character   Class :character   1st Qu.:21.00   Class :character  
##  Mode  :character   Mode  :character   Median :28.00   Mode  :character  
##                                        Mean   :30.40                     
##                                        3rd Qu.:39.00                     
##                                        Max.   :71.00                     
##                                        NA's   :557                       
##     Survived         SexCode      
##  Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000  
##  Mean   :0.3427   Mean   :0.3519  
##  3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000  
## 
str(Titanic)
## 'data.frame':    1313 obs. of  6 variables:
##  $ Name    : chr  "Allen, Miss Elisabeth Walton" "Allison, Miss Helen Loraine" "Allison, Mr Hudson Joshua Creighton" "Allison, Mrs Hudson JC (Bessie Waldo Daniels)" ...
##  $ PClass  : chr  "1st" "1st" "1st" "1st" ...
##  $ Age     : num  29 2 30 25 0.92 47 63 39 58 71 ...
##  $ Sex     : chr  "female" "female" "male" "female" ...
##  $ Survived: int  1 0 0 0 1 1 1 0 1 0 ...
##  $ SexCode : int  1 1 0 1 0 0 1 0 1 0 ...
#passenger_Class: included count numbers and ratios in classes(1 row in PClass with unknow value '*' will be excludsive in the analysis.)
T1 <- data.frame(Titanic$PClass,Titanic$Survived, Titanic$SexCode)

C1<-T1$Titanic.PClass=="1st"
C2<-T1$Titanic.PClass=="2nd"
C3<-T1$Titanic.PClass=="3rd"

S<-T1$Titanic.Survived>0
M<-T1$Titanic.SexCode<1
W<-T1$Titanic.SexCode>0

C1_M<-T1[which(C1 & M),] #dataset for 1st Class men
C2_M<-T1[which(C2 & M),] #dataset for 2nd Class men
C3_M<-T1[which(C3 & M),] #dataset for 3rd Class men
C1_W<-T1[which(C1 & W),] #dataset for 1st Class women
C2_W<-T1[which(C2 & W),] #dataset for 2nd Class women
C3_W<-T1[which(C3 & W),] #dataset for 3rd Class women

m1<-nrow(C1_M) #number of 1st class men
m2<-nrow(C2_M) #number of 2nd class men
m3<-nrow(C3_M) #number of 3rd class men
w1<-nrow(C1_W) #number of 1st class women
w2<-nrow(C2_W) #number of 2nd class women
w3<-nrow(C3_W) #number of 3rd class women


C1_S_M<-T1[which(C1 & S & M),] #dataset for 1st Class survial men
C2_S_M<-T1[which(C2 & S & M),] #dataset for 2nd Class survial men
C3_S_M<-T1[which(C3 & S & M),] #dataset for 3rd Class survial men
C1_S_W<-T1[which(C1 & S & W),] #dataset for 1st Class survial women
C2_S_W<-T1[which(C2 & S & W),] #dataset for 2nd Class survial women
C3_S_W<-T1[which(C3 & S & W),] #dataset for 3rd Class survial women

m_s1<-nrow(C1_S_M) #number of 1st class survial men
m_s2<-nrow(C2_S_M) #number of 2nd class survial men
m_s3<-nrow(C3_S_M) #number of 3rd class survial men
w_s1<-nrow(C1_S_W) #number of 1st class survial women
w_s2<-nrow(C2_S_W) #number of 2nd class survial women
w_s3<-nrow(C3_S_W) #number of 3rd class survial women

t<-matrix(c(m1,m2,m3,w1,w2,w3,m_s1,m_s2,m_s3,w_s1,w_s2,w_s3),ncol = 4) #creat a matrix 
colnames(t)<-c('Men','Women','Saved_Men','Saved_Women')
row.names(t)<-c('1st','2nd','3rd')
passenger_Class<-as.data.frame(t) # creat passenger_Class data frame
passenger_Class$Sum_inClass<-c(322,279,711) #add column Sum_inClass 
passenger_Class$Saved_inClass<- passenger_Class$Saved_Men + passenger_Class$Saved_Women #add column Saved_inClass
#print(passenger_Class[,1:6]) # 5.1


passenger_Class$Men_inClass<-passenger_Class$Men/passenger_Class$Sum_inClass #add column Men_inClass
passenger_Class$Women_inClass<-1-passenger_Class$Men_inClass #add column Women_inClass

passenger_Class$Saved_Men_inClass<-passenger_Class$Saved_Men/passenger_Class$Men #add column Saved_Men_inClass
passenger_Class$Saved_Women_inClass<-passenger_Class$Saved_Women/passenger_Class$Women #add column Saved_Women_inClass

passenger_Class$TotalSaved_inClass<-passenger_Class$Saved_inClass/passenger_Class$Sum_inClass #add column TotalSaved_inClass
passenger_Class$TotalSaved<-passenger_Class$Saved_inClass/1313 #add column TotalSaved
#print(passenger_Class[,7:12]) 


#Removed 557 NA from Age,remained 756 observations
T2 <- na.omit(data.frame(Titanic$Age,Titanic$PClass,Titanic$Sex, Titanic$Survived, Titanic$SexCode)) #remove NA and * dataset
vPClass<-as.integer(T2$Titanic.PClass)-1 #PClass is changed factor to numeric data type 
passenger_Age<-data.frame(T2$Titanic.Age,vPClass,T2$Titanic.Sex,T2$Titanic.Survived,T2$Titanic.SexCode) # creat passenger_Age data frame
names(passenger_Age)<-c("Age","PClass","Sex","Survived","SexCode") #rename column names


# Creat passenger_Age_Survived subset for Surivals (313 out of 756 samples)
passenger_Age_Survived<-subset(passenger_Age,passenger_Age$Survived>0)
passenger_Age_Survived[4]<-NULL #remove 3rd column 
names(passenger_Age_Survived)<-c("Age_S","PClass_S","Sex_S","SexCode_S") #rename column names

Summery in quantity:

By columns:

  • The numbers of Men and women is more even in higher classes. Men and women ’s ratios in 1st,2nd and 3rd classes are 1.25, 1.6, and 2.35 respectively. There are more survival women than men in each class.

By rows:

  • Numbers of men and women in 3rd class are more than numbers of men and women in other classes. The 3rd class men are 2.79 times of men in 1st class and 2.9 times of men in 2nd class. The 3rd class women are 1.48 times of women in 1st class, 1.98 times of women in 2nd class.

  • Total number of survived man and women in 1st class are higher than other two classes.

  • The numbers of survivals in each class look fairly close.

print(passenger_Class[,1:6])
##     Men Women Saved_Men Saved_Women Sum_inClass Saved_inClass
## 1st 179   143        59         134         322           193
## 2nd 172   107        25          94         279           119
## 3rd 499   212        58          80         711           138

Summery in ratios:

By columns:

The variance of ratios between Men and women is increasing from high class to low class. The percentage survival in women are much higher chance than men in each class.

By rows:

  • Majority women in 1st and 2nd class were rescued.

  • Total number of survived man and women in 1st class are higher than other classes, nearly 40% 3rd class women were rescued.

  • The percentage of total survivals in 1st class is pretty high comparing to 2nd and 3rd classes percentage, according to the percentage of total saved ratios in 60%, 43% and 19% respectively.

  • Though percentages based on total amount of passengers in each class is fairly close, the percentage of survival in each class show higher class has higher survival chance.

print(passenger_Class[,7:12])
##     Men_inClass Women_inClass Saved_Men_inClass Saved_Women_inClass
## 1st   0.5559006     0.4440994         0.3296089           0.9370629
## 2nd   0.6164875     0.3835125         0.1453488           0.8785047
## 3rd   0.7018284     0.2981716         0.1162325           0.3773585
##     TotalSaved_inClass TotalSaved
## 1st          0.5993789 0.14699162
## 2nd          0.4265233 0.09063214
## 3rd          0.1940928 0.10510282

Statistical Analysis

Pooled Proportion & Hypothesis Testing

“Woman First”?

The following I use Hypothesis tests for pooled proportion \(p^f\) - \(p^m\).

Before staring, it requires two conditions: the sample observations are independent and at least 10 sucesses and 10 failure in the female and male sample groups (success-failure condition). If these two conditions are met, then the sampling distribution of p is nearly normal with mean p and standard error SE.

table(Titanic$Survived, Titanic$Sex)
   
    female male
  0    154  709
  1    308  142

The following result shows the difference of female survived rate over man survived rate.

n.f<-154+308
n.m<-709+142
f<-308
m<-142
p.f<-f/n.f
p.m<-m/n.m

SEsq.f<-p.f*(1-p.f)/n.f
SEsq.m<-p.m*(1-p.m)/n.m
SE<-sqrt(SEsq.f+SEsq.m)

diff<-p.f-p.m
diff
## [1] 0.4998042

Simulation for Casual Relationship

Does woman increase the likelihood of a person getting more survial chance?

The following, I use simulation to approach for investigating whether or not woman increase survival chance is to use a randomization technique.I write ‘0’ as dead on 863 cards representing a passenger who were dead of the study, and ‘1’ as ‘alive’ on 450 cards representing who were not. Then, we shuffle these cards and split them into two groups: one group of size 462 representing female, and another group of size 851 representing male.I calculate the difference between the proportion of ‘1’ cards in the female and male groups and record (alivefemal - alivemale) value. I repeat this 1000 times to build a distribution centered at 0. By one more step, I calculate the fraction of simulations where the simulated differences in proportions are 0 or higher. If this fraction low, I conclude that it is unlikely to have observed such an outcome by chance and that the null hypothesis should be rejected in favor of the alternative.

n.dead <- 863
n.alive <- 450
n.female <- 462
n.male <- 851
n.samples <- 1000

cards <- c(rep(0, n.dead), rep(1, n.alive))

diff<-0
FemaleAndAlive<-0
MaleAndAlive<-0
 
set.seed(1234) # To reproduce exact results

simulation <- data.frame()
for(i in seq_len(n.samples)) {
       test <- data.frame(Survived=cards, Sex='male', stringsAsFactors = FALSE)
       test[sample(nrow(test), n.female ),]$Sex <-'female'
       
    # prop.table(table(Titanic$Survived, Titanic$Sex))
    simulation <- rbind(simulation, data.frame(
        iter = i,
        diff = (sum(test$Survived==1 & test$Sex == 'female') / n.female)-(sum(test$Survived==1 & test$Sex == 'male') / n.male), 
        stringsAsFactors = FALSE))
    simulation
}

ggplot(simulation, aes(x = diff,fill = (diff > 0)))+geom_histogram()+stat_bin(bins = 1000)+
   labs(x="propotion of survial (femal-male)", title='Difference of survial chance from femal to male')+
   geom_density() 

sum(simulation$diff > 0 ) / nrow(simulation)
[1] 0.506

From the graph result, I have seen 50% true for ‘a woman has more suvivial chance than a man’. And the fraction is more than 50%, I conclude woman increase the likelihood of a person getting more survial chance.


“Woman First” in all social classes? Does ‘Sex’ has casual relationship with ‘Survived’ ?

I use same Hypothesis Test and simulation technique apply for three social classes.

1st Class:**
class1<- Titanic[Titanic$PClass %in% c("1st"),]
table(class1$Survived, class1$Sex)
   
    female male
  0      9  120
  1    134   59
prop.table(table(class1$Survived, class1$Sex))
   
        female       male
  0 0.02795031 0.37267081
  1 0.41614907 0.18322981
n.f<-9+134
n.m<-120+59
f<-120
m<-59
p.f<-f/n.f
p.m<-m/n.m

SEsq.f<-p.f*(1-p.f)/n.f
SEsq.m<-p.m*(1-p.m)/n.m
SE<-sqrt(SEsq.f+SEsq.m)

diff<-p.f-p.m
diff
## [1] 0.5095519

Hypothesis tests for \(p^f\) - \(p^m\)

\(H_0:\) The female survived rate was 50% higher than man survived rate.

\(H_1:\) The female survived rate was not 50% higher than man survived rate.

#1.96 is 95% confident interval
c(diff-SE*1.96, diff+SE*1.96)
## [1] 0.4180743 0.6010295
z<-diff/SE
pnorm(z)
## [1] 1

Conclusion: The female survived rate was 50% higher than man survived rate.I am 95% confident that the survived rate changes between 42% to 60%. Since p-value is larger than 0.05, I do not reject the null hypothesis. Therefore, “Woman First” is True in 1st class.


n.dead <- 129
n.alive <- 193
n.female <- 143
n.male <- 179
n.samples <- 1000

cards <- c(rep(0, n.dead), rep(1, n.alive))

diff<-0
FemaleAndAlive<-0
MaleAndAlive<-0
 
set.seed(1234) # To reproduce exact results

simulation <- data.frame()
for(i in seq_len(n.samples)) {
       test <- data.frame(Survived=cards, Sex='male', stringsAsFactors = FALSE)
       test[sample(nrow(test), n.female ),]$Sex <-'female'
       
    # prop.table(table(Titanic$Survived, Titanic$Sex))
    simulation <- rbind(simulation, data.frame(
        iter = i,
        diff = (sum(test$Survived==1 & test$Sex == 'female') / n.female)-(sum(test$Survived==1 & test$Sex == 'male') / n.male), 
        stringsAsFactors = FALSE))
    simulation
}

ggplot(simulation, aes(x = diff,fill = (diff > 0)))+geom_histogram()+stat_bin(bins = 100)+
   labs(x="propotion of survial (femal-male)", title='Difference of survial chance from femal to male')+
   geom_density() 

sum(simulation$diff > 0 ) / nrow(simulation)
[1] 0.501

From 1000 simulation results, I have seen nearly 50% true for ‘a woman has more suvivial chance than a man’. And the fraction is more than 50%, I conclude woman increase the likelihood of a person getting more survial chance in 1st class.


2nd Class:
class2<- Titanic[Titanic$PClass %in% c("2nd"),]
table(class2$Survived, class2$Sex)
   
    female male
  0     13  147
  1     94   25
prop.table(table(class2$Survived, class2$Sex))
   
        female       male
  0 0.04659498 0.52688172
  1 0.33691756 0.08960573
n.f<-13+94
n.m<-147+25
f<-94
m<-25
p.f<-f/n.f
p.m<-m/n.m

SEsq.f<-p.f*(1-p.f)/n.f
SEsq.m<-p.m*(1-p.m)/n.m
SE<-sqrt(SEsq.f+SEsq.m)

diff<-p.f-p.m
diff
## [1] 0.7331558
n.dead <- 160
n.alive <- 119
n.female <- 143
n.male <- 107
n.samples <- 1000

cards <- c(rep(0, n.dead), rep(1, n.alive))

diff<-0
FemaleAndAlive<-0
MaleAndAlive<-0
 
set.seed(1234) # To reproduce exact results

simulation <- data.frame()
for(i in seq_len(n.samples)) {
       test <- data.frame(Survived=cards, Sex='male', stringsAsFactors = FALSE)
       test[sample(nrow(test), n.female ),]$Sex <-'female'
       
    # prop.table(table(Titanic$Survived, Titanic$Sex))
    simulation <- rbind(simulation, data.frame(
        iter = i,
        diff = (sum(test$Survived==1 & test$Sex == 'female') / n.female)-(sum(test$Survived==1 & test$Sex == 'male') / n.male),
        stringsAsFactors = FALSE))
    simulation
}

ggplot(simulation, aes(x = diff,fill = (diff > 0)))+geom_histogram()+stat_bin(bins = 100)+
   labs(x="propotion of survial (femal-male)", title='Difference of survial chance from femal to male')+geom_density() 

sum(simulation$diff > 0 ) / nrow(simulation)
[1] 0.01

From 1000 simulation results, I have seen very small chance for ‘a woman has more suvivial chance than a man’. And the fraction is very low. I conclude woman in 2nd class do not increase the likelihood of a person getting more survial chance.


3rd Class:
class3<- Titanic[Titanic$PClass %in% c("3rd"),]
table(class3$Survived, class3$Sex)
   
    female male
  0    132  441
  1     80   58
prop.table(table(class3$Survived, class3$Sex))
   
        female       male
  0 0.18565401 0.62025316
  1 0.11251758 0.08157525
n.f<-132+80
n.m<-441+58
f<-80
m<-142
p.f<-f/n.f
p.m<-m/n.m

SEsq.f<-p.f*(1-p.f)/n.f
SEsq.m<-p.m*(1-p.m)/n.m
SE<-sqrt(SEsq.f+SEsq.m)

diff<-p.f-p.m
diff
## [1] 0.09278935
n.dead <- 573
n.alive <- 138
n.female <- 212
n.male <- 599
n.samples <- 1000

cards <- c(rep(0, n.dead), rep(1, n.alive))

diff<-0
FemaleAndAlive<-0
MaleAndAlive<-0
 
set.seed(1234) # To reproduce exact results

simulation <- data.frame()
for(i in seq_len(n.samples)) {
       test <- data.frame(Survived=cards, Sex='male', stringsAsFactors = FALSE)
       test[sample(nrow(test), n.female ),]$Sex <-'female'
       
    # prop.table(table(Titanic$Survived, Titanic$Sex))
    simulation <- rbind(simulation, data.frame(
        iter = i,
        diff = (sum(test$Survived==1 & test$Sex == 'female') / n.female)-(sum(test$Survived==1 & test$Sex == 'male') / n.male), 
        stringsAsFactors = FALSE))
    simulation
}

ggplot(simulation, aes(x = diff,fill = (diff > 0)))+geom_histogram()+stat_bin(bins = 100)+
   labs(x="propotion of survial (femal-male)", title='Difference of survial chance from femal to male')+
   geom_density() 

sum(simulation$diff > 0 ) / nrow(simulation)
[1] 0.914

From 1000 simulation results, I have seen very small chance for ‘a woman has more suvivial chance than a man’. And the fraction is very high. I conclude woman in 3rd class do not increase the likelihood of a person getting more survial chance.


T Test

“Children First”?

There are 557 data missing age. I splited indepent data in to two groups, ‘alive’ and ‘dead’. And I need to find two distribution alive in age and dead in age, and see if they are nearly normal.

AgeTitanic<-na.omit(read.csv(url("https://raw.githubusercontent.com/czhu505/606-project-proposal-/master/Titanic.csv"), stringsAsFactors = FALSE)[-1])

by(AgeTitanic$Age, AgeTitanic$Survived, length)
AgeTitanic$Survived: 0
[1] 443
-------------------------------------------------------- 
AgeTitanic$Survived: 1
[1] 313
AgeTitanic1<-AgeTitanic[which(AgeTitanic$Survived>0),] 

AgeTitanic0<- AgeTitanic[which(AgeTitanic$Survived<1),]

qplot(AgeTitanic1$Age,geom="histogram",binwidth = 0.5, main = "Histogram for alive in Age ",fill=I("steelblue"))

qplot(AgeTitanic0$Age,geom="histogram",binwidth = 0.5, main = "Histogram for dead in Age ",fill=I("steelblue"))


Make a side-by-side boxplot of Age and survived. The box plots show how the medians of the two distributions compare.

plot_ly(y = ~AgeTitanic$Age, x= ~AgeTitanic$Survived, type = "box",color = ~AgeTitanic$Survived)

From the box plot, there are some outliner in ‘0’, but with 443 observations in the group, this outlier is not a concern. However, I don’t see if the age has statistically significant difference for the ages between alive and dead. Therefore, I use t test to value two means. We can compute the group means and group variance of the distribution.

We use the hypotheses for testing if the average ages of alive and dead are different.

\(H_0\): There is no difference statistically significant in Age.

\(H_1\): There is difference statistically significant in Age.

t.test(AgeTitanic1$Age,AgeTitanic0$Age)

    Welch Two Sample t-test

data:  AgeTitanic1$Age and AgeTitanic0$Age
t = -1.648, df = 615.49, p-value = 0.09987
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -3.8838049  0.3396334
sample estimates:
mean of x mean of y 
 29.35958  31.13167 

Conclusion: The t test result shows there are 95% confident that difference between alive and dead in age is within (-3.8838049,0.3396334) years. Sicne $p_value$ > 0.05, we reject H0. There is statistically significant difference for Age from alive and dead.


Correlations & Least Squares

It is not clear to see correlations between ‘Age’ and ‘Survived’. The following I create a table with Survived Rate in splited Age, then find their Correlations.

Age10<-subset(AgeTitanic, Age %in% 0:10)
Age20<-subset(AgeTitanic, Age %in% 11:20)
Age30<-subset(AgeTitanic, Age %in% 21:30)
Age40<-subset(AgeTitanic, Age %in% 31:40)
Age50<-subset(AgeTitanic, Age %in% 41:50)
Age60<-subset(AgeTitanic, Age %in% 51:60)
Age70<-subset(AgeTitanic, Age %in% 61:70)
Age80<-subset(AgeTitanic, Age %in% 71:80)


y<-function(x,z){
  n<-nrow(x)
   tab<-table(x$Survived, x$PClass)
   tab<- cbind(tab, Sum = rowSums(tab)) 
   tab<- cbind(tab, Age = z)  
   tab<- cbind(tab, Total=n) 
   tab
}
t10<-y(Age10,10)
t20<-y(Age20,20)
t30<-y(Age30,30)
t40<-y(Age40,40)
t50<-y(Age50,50)
t60<-y(Age60,60)
t70<-y(Age70,70)
t80<-y(Age80,80)

a1<-bind_rows(t10[2,],t20[2,],t30[2,],t40[2,],t50[2,],t60[2,],t70[2,])
a1<-cbind(a1,avgR=a1$Sum/a1$Total) 
a1
  1st 2nd 3rd Sum Age Total      avgR
1   2  18  12  32  10    47 0.6808511
2  17  15  18  50  20   117 0.4273504
3  30  28  28  86  30   260 0.3307692
4  33  21  13  67  40   150 0.4466667
5  29  11   4  44  50   104 0.4230769
6  23   1   0  24  60    48 0.5000000
7   4   0   1   5  70    19 0.2631579
corr.test(a1)
Call:corr.test(x = a1)
Correlation matrix 
        1st   2nd   3rd   Sum   Age Total  avgR
1st    1.00  0.40  0.29  0.73  0.10  0.70 -0.25
2nd    0.40  1.00  0.91  0.91 -0.74  0.84  0.15
3rd    0.29  0.91  1.00  0.85 -0.71  0.86 -0.06
Sum    0.73  0.91  0.85  1.00 -0.50  0.96 -0.08
Age    0.10 -0.74 -0.71 -0.50  1.00 -0.36 -0.59
Total  0.70  0.84  0.86  0.96 -0.36  1.00 -0.29
avgR  -0.25  0.15 -0.06 -0.08 -0.59 -0.29  1.00
Sample Size 
[1] 7
Probability values (Entries above the diagonal are adjusted for multiple tests.) 
       1st  2nd  3rd  Sum  Age Total avgR
1st   0.00 1.00 1.00 0.91 1.00  0.96    1
2nd   0.37 0.00 0.09 0.09 0.83  0.31    1
3rd   0.53 0.00 0.00 0.25 0.96  0.25    1
Sum   0.06 0.00 0.01 0.00 1.00  0.01    1
Age   0.82 0.06 0.07 0.25 0.00  1.00    1
Total 0.08 0.02 0.01 0.00 0.43  0.00    1
avgR  0.60 0.75 0.90 0.87 0.16  0.53    0

 To see confidence intervals of the correlations, print with the short=FALSE option
ggplot(a1, aes(x=Age, y=avgR)) + 
  geom_point()+
  geom_smooth(method=lm)+
  labs(x="Alive in Age (0-70)", title='Survived Weight in Age')+
  geom_line() 

m1 <- lm(avgR ~ Age, data = a1)
plot(m1$residuals ~ a1$avgR)
abline(h = 0, lty = 3) 

conclusion: After breaking down the data into 7 ranges of age, there is a inconspicuous decreasing trend line in scatter plot for the weighted survived rate in different age. Though the lease squares line scatter plot wasn’t show “age is matter” very influential in the survival rate because most residuals are far from the central.


F test and ANOVA

Does “Children First” in each class are same?

** I want to check age distribution for the woman and man in each of the class.**

qplot(Age,data= passenger_Age, fill= Sex,facets = .~PClass,bins=30)+
    scale_alpha_continuous(guide=FALSE)

qplot(Age,data= AgeTitanic1, fill= Sex,facets = .~PClass,bins=30)+
    scale_alpha_continuous(guide=FALSE)+
  labs(x="Alive in Age", title='Survived Count in Age')

From histogram for survived passenerger:

  • Each number of women in different class is bigger than men.

  • Age distribution for men and women are similar in each class which means samples were evenly selected in Sex dataset.

  • 3rd class has bigger sample size than 1st and 2nd classes because 3rd classes had many more people than other two classes.


qplot(PClass_S,Age_S,data=passenger_Age_Survived,color= Sex_S,facets = .~PClass_S)

By observed scatter plot in 3rd class, survived young men age between 19 and 40 had higher ratio (men vs women) than other two classes.


Since each group is nearly normal and 3 grpups are all indepent, I use the F test and ANOVA to observe whether the differences in sample means could have happened just by chance even if there was no difference in the respective population means.

s<-AgeTitanic[AgeTitanic$Survived > 0,]
fit=lm(Age ~ PClass, s)
anova(fit)
Analysis of Variance Table

Response: Age
           Df Sum Sq Mean Sq F value    Pr(>F)    
PClass      2  13888  6944.2  36.351 6.569e-15 ***
Residuals 310  59219   191.0                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From F-test result, p-value is less than 0.05, reject H0. The data provide convincing evidence of a difference among the survival ages in 3 different social classes.


Chi-square test

In the following table, I want to see if every class treat equally in each age. I caulate expectied count and Chi-square for each class. For satisfy Chi-square test, each case that contributes a count to the table must be indepent of all the other cases in the table. Also, ezch particular scenario must have at least 5 expected cases.

a0<-bind_rows(t10[1,],t20[1,],t30[1,],t40[1,],t50[1,],t60[1,],t70[1,])
a1<-cbind(a1,e1=(a1$`1st`+a0$`1st`)*a1$avgR) #E1st is expectied count for 1st 
a1<-cbind(a1,e2=(a1$`2nd`+a0$`2nd`)*a1$avgR) #E2nd is expectied count for 2nd 
a1<-cbind(a1,e3=(a1$`3rd`+a0$`3rd`)*a1$avgR) #E3rd is expectied count for 3rd

\(H_0\): Classes have same survied rate at same age.

\(H_1\): Classes do not have same survied rate at same age.

a1<-cbind(a1,chi1=(a1$`1st`-a1$e1)^2/a1$e1)
a1<-cbind(a1,chi2=(a1$`2nd`-a1$e2)^2/a1$e2)
a1<-cbind(a1,chi3=(a1$`3rd`-a1$e3)^2/a1$e3)
a1<-cbind(a1,sumchi=a1$chi1+a1$chi2+a1$chi3)
a1<-cbind(a1,chi1chi2=a1$chi1+a1$chi2)
a1<-cbind(a1,chi1chi3=a1$chi1+a1$chi3)
a1<-cbind(a1,chi3chi2=a1$chi3+a1$chi2)
a1
  1st 2nd 3rd Sum Age Total      avgR        e1         e2         e3
1   2  18  12  32  10    47 0.6808511  2.042553 12.2553191 17.7021277
2  17  15  18  50  20   117 0.4273504  8.119658 12.3931624 29.4871795
3  30  28  28  86  30   260 0.3307692 14.884615 27.7846154 43.3307692
4  33  21  13  67  40   150 0.4466667 21.886667 19.2066667 25.9066667
5  29  11   4  44  50   104 0.4230769 22.846154 10.5769231 10.5769231
6  23   1   0  24  60    48 0.5000000 18.500000  4.5000000  1.0000000
7   4   0   1   5  70    19 0.2631579  4.210526  0.2631579  0.5263158
          chi1        chi2      chi3    sumchi   chi1chi2   chi1chi3
1 8.865248e-04 2.692819149 1.8367430  4.530449  2.6937057  1.8376296
2 9.712290e+00 0.548334807 4.4750056 14.735630 10.2606245 14.1872953
3 1.534973e+01 0.001669648 5.4241475 20.775549 15.3514013 20.7738792
4 5.642987e+00 0.167444174 6.4300841 12.240515  5.8104313 12.0730712
5 1.657602e+00 0.016923077 4.0896503  5.764175  1.6745247  5.7472520
6 1.094595e+00 2.722222222 1.0000000  4.816817  3.8168168  2.0945946
7 1.052632e-02 0.263157895 0.4263158  0.700000  0.2736842  0.4368421
   chi3chi2
1 4.5295622
2 5.0233404
3 5.4258172
4 6.5975282
5 4.1065734
6 3.7222222
7 0.6894737

For chi-square test, with df=2, uppertail<0.05, x^2 >5.99, reject H_0.

For chi-square test, with df=1, uppertail<0.05, x^2 >3.84, reject H_0.

However, there are serveral cases at two tails of e1,e2,e3 lesser than 5, the result might not accurate.

A<-c(10,20,30,40,50,60,70)
sumchiTest<-c("Not sure", "reject","reject","reject","Not reject","Not reject","Not sure")
chi1chi2Test<-c("Not sure", "reject","reject","reject","Not reject","Not reject","Not sure")
chi1chi3Test<-c("Not sure", "reject","reject","reject","reject","Not reject","Not sure")
chi3chi2Test<-c("reject", "reject","reject","reject","reject","Not reject","Not sure")

data.frame(A,sumchiTest,chi1chi2Test,chi1chi3Test,chi3chi2Test)
   A sumchiTest chi1chi2Test chi1chi3Test chi3chi2Test
1 10   Not sure     Not sure     Not sure       reject
2 20     reject       reject       reject       reject
3 30     reject       reject       reject       reject
4 40     reject       reject       reject       reject
5 50 Not reject   Not reject       reject       reject
6 60 Not reject   Not reject   Not reject   Not reject
7 70   Not sure     Not sure     Not sure     Not sure

From Chi-square test results, there are more than 50% results are faile to reject H0. I concules three social Classes do not have same survied rate at same age.


Logistic Regression

“Does missing passenger ‘survived’ predictable?”

The following I try to predicte the ‘Survived’ statues for missing passengers. If response variable ‘Survived’ has two level categories, and all predictor ‘SexCode’,‘PClass’ and ‘Age’ variables are linearly related to logit(pi) if they are held constant, I can use training data to built a logistic regression model prdicting the ‘Survived’ statues.

df<- subset(AgeTitanic,select=c(2,3,5,6))

#chang the PClass as numeric type 
df$PClass<-as.numeric(gsub("([0-9]+).*$", "\\1", df$PClass))

df$SexCode <- factor(df$SexCode)
df$Survived<- factor(df$Survived)

mylogit <- glm(Survived ~ SexCode + Age +PClass , data = df, family = "binomial")
summary(mylogit )

Call:
glm(formula = Survived ~ SexCode + Age + PClass, family = "binomial", 
    data = df)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.7165  -0.7130  -0.3901   0.6454   2.5309  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  2.36932    0.43794   5.410 6.30e-08 ***
SexCode1     2.63170    0.20156  13.056  < 2e-16 ***
Age         -0.03899    0.00751  -5.192 2.08e-07 ***
PClass      -1.25868    0.13768  -9.142  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1025.57  on 755  degrees of freedom
Residual deviance:  695.16  on 752  degrees of freedom
AIC: 703.16

Number of Fisher Scoring iterations: 5

‘SexCode’,‘Age’ and ‘PClass’ all three P-value are lesser than 0.05, which means all three variables have statistical significant for ‘Survived’. From the summary, it tells to increase 1 survived person, increasing 2.6317 unit in SexCode, decreasing 0.03899 unit in Age and 1.25868 in PClass.

hist(mylogit$residuals)

qqnorm(mylogit$residuals)
qqline(mylogit$residuals)

The scattplot of residuals shows the glm which including ‘SexCode’,‘Age’ and ‘PClass’ has very good apporch for predictive ‘Survived’ variable, even though there are some outliner at right tails.


In the following, I try to remove ‘PClass’ from the regression model, and check if the new model has better approach.

mylogit <- glm(Survived ~ SexCode + Age, data = df, family = "binomial")
hist(mylogit$residuals)

qqnorm(mylogit$residuals)
qqline(mylogit$residuals)

The new model shows residuals increasing a lot, which has bad approach to predicting ‘Survived’ variable. Therefore, I should keep ‘SexCode’,‘Age’ and ‘PClass’ three in my regression model.