This study is about social class and survival chance on Titanic. Titanic was called ‘unsinkable’ luxurious ship yet sank in North Atlantic Ocean on April 15th 1912 after it struck an iceberg. And there were nearly 1501 of her 2207 passengers and crew loss because of very limited lifeboats for the passengers. It is well-known ‘Women and Children First’. For this result, we want to see if relationship in these variable by using statistical analysis.
Does ‘Women First’ true?
Does woman increase the likelihood of a person getting more survival chance?
Does woman increase the likelihood of a person getting more survival chance in each social classes?
Does ‘Children First’ true?
Does ‘Children First’ equaly apply in each social classes?
Does missing passenger ‘survived’ predictable?
suppressMessages(suppressWarnings(library('tidyverse')))
suppressMessages(suppressWarnings(library('data.table')))
suppressMessages(suppressWarnings(library('dplyr')))
suppressMessages(suppressWarnings(library('ggplot2')))
suppressMessages(suppressWarnings(library('reshape2')))
suppressMessages(suppressWarnings(library('scales')))
suppressMessages(suppressWarnings(library('rmarkdown')))
suppressMessages(suppressWarnings(library('tidyr')))
suppressMessages(suppressWarnings(library('plotly')))
suppressMessages(suppressWarnings(library('psych')))
suppressMessages(suppressWarnings(library('aod')))
Data Collection: The data collected by individual-level. link
Cases: There are 1313 observations of 6 cases in the data set.
Variables: Name, PClass, Age, Sex, Survived, SexCode
numerical variables: Age
categorical variables: PClass, Sex, Survived, SexCode
Type of study: This is experience study in which we want to investigate if sexual, class and age are matter in Titanic survival case.
Scope of inference - generalizability: Population size is 2207, but we only have 1313 sample data on hand.
I construct a table to summerize data by class and sex in the following. It gives me overview of 3 social classes.
#removed the index while load the data set
Titanic<-read.csv(url("https://raw.githubusercontent.com/czhu505/606-project-proposal-/master/Titanic.csv"), stringsAsFactors = FALSE)[-1]
summary(Titanic)
## Name PClass Age Sex
## Length:1313 Length:1313 Min. : 0.17 Length:1313
## Class :character Class :character 1st Qu.:21.00 Class :character
## Mode :character Mode :character Median :28.00 Mode :character
## Mean :30.40
## 3rd Qu.:39.00
## Max. :71.00
## NA's :557
## Survived SexCode
## Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000
## Mean :0.3427 Mean :0.3519
## 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000
##
str(Titanic)
## 'data.frame': 1313 obs. of 6 variables:
## $ Name : chr "Allen, Miss Elisabeth Walton" "Allison, Miss Helen Loraine" "Allison, Mr Hudson Joshua Creighton" "Allison, Mrs Hudson JC (Bessie Waldo Daniels)" ...
## $ PClass : chr "1st" "1st" "1st" "1st" ...
## $ Age : num 29 2 30 25 0.92 47 63 39 58 71 ...
## $ Sex : chr "female" "female" "male" "female" ...
## $ Survived: int 1 0 0 0 1 1 1 0 1 0 ...
## $ SexCode : int 1 1 0 1 0 0 1 0 1 0 ...
#passenger_Class: included count numbers and ratios in classes(1 row in PClass with unknow value '*' will be excludsive in the analysis.)
T1 <- data.frame(Titanic$PClass,Titanic$Survived, Titanic$SexCode)
C1<-T1$Titanic.PClass=="1st"
C2<-T1$Titanic.PClass=="2nd"
C3<-T1$Titanic.PClass=="3rd"
S<-T1$Titanic.Survived>0
M<-T1$Titanic.SexCode<1
W<-T1$Titanic.SexCode>0
C1_M<-T1[which(C1 & M),] #dataset for 1st Class men
C2_M<-T1[which(C2 & M),] #dataset for 2nd Class men
C3_M<-T1[which(C3 & M),] #dataset for 3rd Class men
C1_W<-T1[which(C1 & W),] #dataset for 1st Class women
C2_W<-T1[which(C2 & W),] #dataset for 2nd Class women
C3_W<-T1[which(C3 & W),] #dataset for 3rd Class women
m1<-nrow(C1_M) #number of 1st class men
m2<-nrow(C2_M) #number of 2nd class men
m3<-nrow(C3_M) #number of 3rd class men
w1<-nrow(C1_W) #number of 1st class women
w2<-nrow(C2_W) #number of 2nd class women
w3<-nrow(C3_W) #number of 3rd class women
C1_S_M<-T1[which(C1 & S & M),] #dataset for 1st Class survial men
C2_S_M<-T1[which(C2 & S & M),] #dataset for 2nd Class survial men
C3_S_M<-T1[which(C3 & S & M),] #dataset for 3rd Class survial men
C1_S_W<-T1[which(C1 & S & W),] #dataset for 1st Class survial women
C2_S_W<-T1[which(C2 & S & W),] #dataset for 2nd Class survial women
C3_S_W<-T1[which(C3 & S & W),] #dataset for 3rd Class survial women
m_s1<-nrow(C1_S_M) #number of 1st class survial men
m_s2<-nrow(C2_S_M) #number of 2nd class survial men
m_s3<-nrow(C3_S_M) #number of 3rd class survial men
w_s1<-nrow(C1_S_W) #number of 1st class survial women
w_s2<-nrow(C2_S_W) #number of 2nd class survial women
w_s3<-nrow(C3_S_W) #number of 3rd class survial women
t<-matrix(c(m1,m2,m3,w1,w2,w3,m_s1,m_s2,m_s3,w_s1,w_s2,w_s3),ncol = 4) #creat a matrix
colnames(t)<-c('Men','Women','Saved_Men','Saved_Women')
row.names(t)<-c('1st','2nd','3rd')
passenger_Class<-as.data.frame(t) # creat passenger_Class data frame
passenger_Class$Sum_inClass<-c(322,279,711) #add column Sum_inClass
passenger_Class$Saved_inClass<- passenger_Class$Saved_Men + passenger_Class$Saved_Women #add column Saved_inClass
#print(passenger_Class[,1:6]) # 5.1
passenger_Class$Men_inClass<-passenger_Class$Men/passenger_Class$Sum_inClass #add column Men_inClass
passenger_Class$Women_inClass<-1-passenger_Class$Men_inClass #add column Women_inClass
passenger_Class$Saved_Men_inClass<-passenger_Class$Saved_Men/passenger_Class$Men #add column Saved_Men_inClass
passenger_Class$Saved_Women_inClass<-passenger_Class$Saved_Women/passenger_Class$Women #add column Saved_Women_inClass
passenger_Class$TotalSaved_inClass<-passenger_Class$Saved_inClass/passenger_Class$Sum_inClass #add column TotalSaved_inClass
passenger_Class$TotalSaved<-passenger_Class$Saved_inClass/1313 #add column TotalSaved
#print(passenger_Class[,7:12])
#Removed 557 NA from Age,remained 756 observations
T2 <- na.omit(data.frame(Titanic$Age,Titanic$PClass,Titanic$Sex, Titanic$Survived, Titanic$SexCode)) #remove NA and * dataset
vPClass<-as.integer(T2$Titanic.PClass)-1 #PClass is changed factor to numeric data type
passenger_Age<-data.frame(T2$Titanic.Age,vPClass,T2$Titanic.Sex,T2$Titanic.Survived,T2$Titanic.SexCode) # creat passenger_Age data frame
names(passenger_Age)<-c("Age","PClass","Sex","Survived","SexCode") #rename column names
# Creat passenger_Age_Survived subset for Surivals (313 out of 756 samples)
passenger_Age_Survived<-subset(passenger_Age,passenger_Age$Survived>0)
passenger_Age_Survived[4]<-NULL #remove 3rd column
names(passenger_Age_Survived)<-c("Age_S","PClass_S","Sex_S","SexCode_S") #rename column names
Summery in quantity:
By columns:
By rows:
Numbers of men and women in 3rd class are more than numbers of men and women in other classes. The 3rd class men are 2.79 times of men in 1st class and 2.9 times of men in 2nd class. The 3rd class women are 1.48 times of women in 1st class, 1.98 times of women in 2nd class.
Total number of survived man and women in 1st class are higher than other two classes.
The numbers of survivals in each class look fairly close.
print(passenger_Class[,1:6])
## Men Women Saved_Men Saved_Women Sum_inClass Saved_inClass
## 1st 179 143 59 134 322 193
## 2nd 172 107 25 94 279 119
## 3rd 499 212 58 80 711 138
Summery in ratios:
By columns:
The variance of ratios between Men and women is increasing from high class to low class. The percentage survival in women are much higher chance than men in each class.
By rows:
Majority women in 1st and 2nd class were rescued.
Total number of survived man and women in 1st class are higher than other classes, nearly 40% 3rd class women were rescued.
The percentage of total survivals in 1st class is pretty high comparing to 2nd and 3rd classes percentage, according to the percentage of total saved ratios in 60%, 43% and 19% respectively.
Though percentages based on total amount of passengers in each class is fairly close, the percentage of survival in each class show higher class has higher survival chance.
print(passenger_Class[,7:12])
## Men_inClass Women_inClass Saved_Men_inClass Saved_Women_inClass
## 1st 0.5559006 0.4440994 0.3296089 0.9370629
## 2nd 0.6164875 0.3835125 0.1453488 0.8785047
## 3rd 0.7018284 0.2981716 0.1162325 0.3773585
## TotalSaved_inClass TotalSaved
## 1st 0.5993789 0.14699162
## 2nd 0.4265233 0.09063214
## 3rd 0.1940928 0.10510282
The following I use Hypothesis tests for pooled proportion \(p^f\) - \(p^m\).
Before staring, it requires two conditions: the sample observations are independent and at least 10 sucesses and 10 failure in the female and male sample groups (success-failure condition). If these two conditions are met, then the sampling distribution of p is nearly normal with mean p and standard error SE.
table(Titanic$Survived, Titanic$Sex)
female male
0 154 709
1 308 142
The following result shows the difference of female survived rate over man survived rate.
n.f<-154+308
n.m<-709+142
f<-308
m<-142
p.f<-f/n.f
p.m<-m/n.m
SEsq.f<-p.f*(1-p.f)/n.f
SEsq.m<-p.m*(1-p.m)/n.m
SE<-sqrt(SEsq.f+SEsq.m)
diff<-p.f-p.m
diff
## [1] 0.4998042
The following, I use simulation to approach for investigating whether or not woman increase survival chance is to use a randomization technique.I write ‘0’ as dead on 863 cards representing a passenger who were dead of the study, and ‘1’ as ‘alive’ on 450 cards representing who were not. Then, we shuffle these cards and split them into two groups: one group of size 462 representing female, and another group of size 851 representing male.I calculate the difference between the proportion of ‘1’ cards in the female and male groups and record (alivefemal - alivemale) value. I repeat this 1000 times to build a distribution centered at 0. By one more step, I calculate the fraction of simulations where the simulated differences in proportions are 0 or higher. If this fraction low, I conclude that it is unlikely to have observed such an outcome by chance and that the null hypothesis should be rejected in favor of the alternative.
n.dead <- 863
n.alive <- 450
n.female <- 462
n.male <- 851
n.samples <- 1000
cards <- c(rep(0, n.dead), rep(1, n.alive))
diff<-0
FemaleAndAlive<-0
MaleAndAlive<-0
set.seed(1234) # To reproduce exact results
simulation <- data.frame()
for(i in seq_len(n.samples)) {
test <- data.frame(Survived=cards, Sex='male', stringsAsFactors = FALSE)
test[sample(nrow(test), n.female ),]$Sex <-'female'
# prop.table(table(Titanic$Survived, Titanic$Sex))
simulation <- rbind(simulation, data.frame(
iter = i,
diff = (sum(test$Survived==1 & test$Sex == 'female') / n.female)-(sum(test$Survived==1 & test$Sex == 'male') / n.male),
stringsAsFactors = FALSE))
simulation
}
ggplot(simulation, aes(x = diff,fill = (diff > 0)))+geom_histogram()+stat_bin(bins = 1000)+
labs(x="propotion of survial (femal-male)", title='Difference of survial chance from femal to male')+
geom_density()
sum(simulation$diff > 0 ) / nrow(simulation)
[1] 0.506
From the graph result, I have seen 50% true for ‘a woman has more suvivial chance than a man’. And the fraction is more than 50%, I conclude woman increase the likelihood of a person getting more survial chance.
There are 557 data missing age. I splited indepent data in to two groups, ‘alive’ and ‘dead’. And I need to find two distribution alive in age and dead in age, and see if they are nearly normal.
AgeTitanic<-na.omit(read.csv(url("https://raw.githubusercontent.com/czhu505/606-project-proposal-/master/Titanic.csv"), stringsAsFactors = FALSE)[-1])
by(AgeTitanic$Age, AgeTitanic$Survived, length)
AgeTitanic$Survived: 0
[1] 443
--------------------------------------------------------
AgeTitanic$Survived: 1
[1] 313
AgeTitanic1<-AgeTitanic[which(AgeTitanic$Survived>0),]
AgeTitanic0<- AgeTitanic[which(AgeTitanic$Survived<1),]
qplot(AgeTitanic1$Age,geom="histogram",binwidth = 0.5, main = "Histogram for alive in Age ",fill=I("steelblue"))
qplot(AgeTitanic0$Age,geom="histogram",binwidth = 0.5, main = "Histogram for dead in Age ",fill=I("steelblue"))
Make a side-by-side boxplot of Age and survived. The box plots show how the medians of the two distributions compare.
plot_ly(y = ~AgeTitanic$Age, x= ~AgeTitanic$Survived, type = "box",color = ~AgeTitanic$Survived)
From the box plot, there are some outliner in ‘0’, but with 443 observations in the group, this outlier is not a concern. However, I don’t see if the age has statistically significant difference for the ages between alive and dead. Therefore, I use t test to value two means. We can compute the group means and group variance of the distribution.
We use the hypotheses for testing if the average ages of alive and dead are different.
\(H_0\): There is no difference statistically significant in Age.
\(H_1\): There is difference statistically significant in Age.
t.test(AgeTitanic1$Age,AgeTitanic0$Age)
Welch Two Sample t-test
data: AgeTitanic1$Age and AgeTitanic0$Age
t = -1.648, df = 615.49, p-value = 0.09987
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-3.8838049 0.3396334
sample estimates:
mean of x mean of y
29.35958 31.13167
Conclusion: The t test result shows there are 95% confident that difference between alive and dead in age is within (-3.8838049,0.3396334) years. Sicne $p_value$ > 0.05, we reject H0. There is statistically significant difference for Age from alive and dead.
It is not clear to see correlations between ‘Age’ and ‘Survived’. The following I create a table with Survived Rate in splited Age, then find their Correlations.
Age10<-subset(AgeTitanic, Age %in% 0:10)
Age20<-subset(AgeTitanic, Age %in% 11:20)
Age30<-subset(AgeTitanic, Age %in% 21:30)
Age40<-subset(AgeTitanic, Age %in% 31:40)
Age50<-subset(AgeTitanic, Age %in% 41:50)
Age60<-subset(AgeTitanic, Age %in% 51:60)
Age70<-subset(AgeTitanic, Age %in% 61:70)
Age80<-subset(AgeTitanic, Age %in% 71:80)
y<-function(x,z){
n<-nrow(x)
tab<-table(x$Survived, x$PClass)
tab<- cbind(tab, Sum = rowSums(tab))
tab<- cbind(tab, Age = z)
tab<- cbind(tab, Total=n)
tab
}
t10<-y(Age10,10)
t20<-y(Age20,20)
t30<-y(Age30,30)
t40<-y(Age40,40)
t50<-y(Age50,50)
t60<-y(Age60,60)
t70<-y(Age70,70)
t80<-y(Age80,80)
a1<-bind_rows(t10[2,],t20[2,],t30[2,],t40[2,],t50[2,],t60[2,],t70[2,])
a1<-cbind(a1,avgR=a1$Sum/a1$Total)
a1
1st 2nd 3rd Sum Age Total avgR
1 2 18 12 32 10 47 0.6808511
2 17 15 18 50 20 117 0.4273504
3 30 28 28 86 30 260 0.3307692
4 33 21 13 67 40 150 0.4466667
5 29 11 4 44 50 104 0.4230769
6 23 1 0 24 60 48 0.5000000
7 4 0 1 5 70 19 0.2631579
corr.test(a1)
Call:corr.test(x = a1)
Correlation matrix
1st 2nd 3rd Sum Age Total avgR
1st 1.00 0.40 0.29 0.73 0.10 0.70 -0.25
2nd 0.40 1.00 0.91 0.91 -0.74 0.84 0.15
3rd 0.29 0.91 1.00 0.85 -0.71 0.86 -0.06
Sum 0.73 0.91 0.85 1.00 -0.50 0.96 -0.08
Age 0.10 -0.74 -0.71 -0.50 1.00 -0.36 -0.59
Total 0.70 0.84 0.86 0.96 -0.36 1.00 -0.29
avgR -0.25 0.15 -0.06 -0.08 -0.59 -0.29 1.00
Sample Size
[1] 7
Probability values (Entries above the diagonal are adjusted for multiple tests.)
1st 2nd 3rd Sum Age Total avgR
1st 0.00 1.00 1.00 0.91 1.00 0.96 1
2nd 0.37 0.00 0.09 0.09 0.83 0.31 1
3rd 0.53 0.00 0.00 0.25 0.96 0.25 1
Sum 0.06 0.00 0.01 0.00 1.00 0.01 1
Age 0.82 0.06 0.07 0.25 0.00 1.00 1
Total 0.08 0.02 0.01 0.00 0.43 0.00 1
avgR 0.60 0.75 0.90 0.87 0.16 0.53 0
To see confidence intervals of the correlations, print with the short=FALSE option
ggplot(a1, aes(x=Age, y=avgR)) +
geom_point()+
geom_smooth(method=lm)+
labs(x="Alive in Age (0-70)", title='Survived Weight in Age')+
geom_line()
m1 <- lm(avgR ~ Age, data = a1)
plot(m1$residuals ~ a1$avgR)
abline(h = 0, lty = 3)
conclusion: After breaking down the data into 7 ranges of age, there is a inconspicuous decreasing trend line in scatter plot for the weighted survived rate in different age. Though the lease squares line scatter plot wasn’t show “age is matter” very influential in the survival rate because most residuals are far from the central.
** I want to check age distribution for the woman and man in each of the class.**
qplot(Age,data= passenger_Age, fill= Sex,facets = .~PClass,bins=30)+
scale_alpha_continuous(guide=FALSE)
qplot(Age,data= AgeTitanic1, fill= Sex,facets = .~PClass,bins=30)+
scale_alpha_continuous(guide=FALSE)+
labs(x="Alive in Age", title='Survived Count in Age')
From histogram for survived passenerger:
Each number of women in different class is bigger than men.
Age distribution for men and women are similar in each class which means samples were evenly selected in Sex dataset.
3rd class has bigger sample size than 1st and 2nd classes because 3rd classes had many more people than other two classes.
qplot(PClass_S,Age_S,data=passenger_Age_Survived,color= Sex_S,facets = .~PClass_S)
By observed scatter plot in 3rd class, survived young men age between 19 and 40 had higher ratio (men vs women) than other two classes.
Since each group is nearly normal and 3 grpups are all indepent, I use the F test and ANOVA to observe whether the differences in sample means could have happened just by chance even if there was no difference in the respective population means.
s<-AgeTitanic[AgeTitanic$Survived > 0,]
fit=lm(Age ~ PClass, s)
anova(fit)
Analysis of Variance Table
Response: Age
Df Sum Sq Mean Sq F value Pr(>F)
PClass 2 13888 6944.2 36.351 6.569e-15 ***
Residuals 310 59219 191.0
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From F-test result, p-value is less than 0.05, reject H0. The data provide convincing evidence of a difference among the survival ages in 3 different social classes.
In the following table, I want to see if every class treat equally in each age. I caulate expectied count and Chi-square for each class. For satisfy Chi-square test, each case that contributes a count to the table must be indepent of all the other cases in the table. Also, ezch particular scenario must have at least 5 expected cases.
a0<-bind_rows(t10[1,],t20[1,],t30[1,],t40[1,],t50[1,],t60[1,],t70[1,])
a1<-cbind(a1,e1=(a1$`1st`+a0$`1st`)*a1$avgR) #E1st is expectied count for 1st
a1<-cbind(a1,e2=(a1$`2nd`+a0$`2nd`)*a1$avgR) #E2nd is expectied count for 2nd
a1<-cbind(a1,e3=(a1$`3rd`+a0$`3rd`)*a1$avgR) #E3rd is expectied count for 3rd
\(H_0\): Classes have same survied rate at same age.
\(H_1\): Classes do not have same survied rate at same age.
a1<-cbind(a1,chi1=(a1$`1st`-a1$e1)^2/a1$e1)
a1<-cbind(a1,chi2=(a1$`2nd`-a1$e2)^2/a1$e2)
a1<-cbind(a1,chi3=(a1$`3rd`-a1$e3)^2/a1$e3)
a1<-cbind(a1,sumchi=a1$chi1+a1$chi2+a1$chi3)
a1<-cbind(a1,chi1chi2=a1$chi1+a1$chi2)
a1<-cbind(a1,chi1chi3=a1$chi1+a1$chi3)
a1<-cbind(a1,chi3chi2=a1$chi3+a1$chi2)
a1
1st 2nd 3rd Sum Age Total avgR e1 e2 e3
1 2 18 12 32 10 47 0.6808511 2.042553 12.2553191 17.7021277
2 17 15 18 50 20 117 0.4273504 8.119658 12.3931624 29.4871795
3 30 28 28 86 30 260 0.3307692 14.884615 27.7846154 43.3307692
4 33 21 13 67 40 150 0.4466667 21.886667 19.2066667 25.9066667
5 29 11 4 44 50 104 0.4230769 22.846154 10.5769231 10.5769231
6 23 1 0 24 60 48 0.5000000 18.500000 4.5000000 1.0000000
7 4 0 1 5 70 19 0.2631579 4.210526 0.2631579 0.5263158
chi1 chi2 chi3 sumchi chi1chi2 chi1chi3
1 8.865248e-04 2.692819149 1.8367430 4.530449 2.6937057 1.8376296
2 9.712290e+00 0.548334807 4.4750056 14.735630 10.2606245 14.1872953
3 1.534973e+01 0.001669648 5.4241475 20.775549 15.3514013 20.7738792
4 5.642987e+00 0.167444174 6.4300841 12.240515 5.8104313 12.0730712
5 1.657602e+00 0.016923077 4.0896503 5.764175 1.6745247 5.7472520
6 1.094595e+00 2.722222222 1.0000000 4.816817 3.8168168 2.0945946
7 1.052632e-02 0.263157895 0.4263158 0.700000 0.2736842 0.4368421
chi3chi2
1 4.5295622
2 5.0233404
3 5.4258172
4 6.5975282
5 4.1065734
6 3.7222222
7 0.6894737
For chi-square test, with df=2, uppertail<0.05, x^2 >5.99, reject H_0.
For chi-square test, with df=1, uppertail<0.05, x^2 >3.84, reject H_0.
However, there are serveral cases at two tails of e1,e2,e3 lesser than 5, the result might not accurate.
A<-c(10,20,30,40,50,60,70)
sumchiTest<-c("Not sure", "reject","reject","reject","Not reject","Not reject","Not sure")
chi1chi2Test<-c("Not sure", "reject","reject","reject","Not reject","Not reject","Not sure")
chi1chi3Test<-c("Not sure", "reject","reject","reject","reject","Not reject","Not sure")
chi3chi2Test<-c("reject", "reject","reject","reject","reject","Not reject","Not sure")
data.frame(A,sumchiTest,chi1chi2Test,chi1chi3Test,chi3chi2Test)
A sumchiTest chi1chi2Test chi1chi3Test chi3chi2Test
1 10 Not sure Not sure Not sure reject
2 20 reject reject reject reject
3 30 reject reject reject reject
4 40 reject reject reject reject
5 50 Not reject Not reject reject reject
6 60 Not reject Not reject Not reject Not reject
7 70 Not sure Not sure Not sure Not sure
From Chi-square test results, there are more than 50% results are faile to reject H0. I concules three social Classes do not have same survied rate at same age.
The following I try to predicte the ‘Survived’ statues for missing passengers. If response variable ‘Survived’ has two level categories, and all predictor ‘SexCode’,‘PClass’ and ‘Age’ variables are linearly related to logit(pi) if they are held constant, I can use training data to built a logistic regression model prdicting the ‘Survived’ statues.
df<- subset(AgeTitanic,select=c(2,3,5,6))
#chang the PClass as numeric type
df$PClass<-as.numeric(gsub("([0-9]+).*$", "\\1", df$PClass))
df$SexCode <- factor(df$SexCode)
df$Survived<- factor(df$Survived)
mylogit <- glm(Survived ~ SexCode + Age +PClass , data = df, family = "binomial")
summary(mylogit )
Call:
glm(formula = Survived ~ SexCode + Age + PClass, family = "binomial",
data = df)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.7165 -0.7130 -0.3901 0.6454 2.5309
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.36932 0.43794 5.410 6.30e-08 ***
SexCode1 2.63170 0.20156 13.056 < 2e-16 ***
Age -0.03899 0.00751 -5.192 2.08e-07 ***
PClass -1.25868 0.13768 -9.142 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1025.57 on 755 degrees of freedom
Residual deviance: 695.16 on 752 degrees of freedom
AIC: 703.16
Number of Fisher Scoring iterations: 5
‘SexCode’,‘Age’ and ‘PClass’ all three P-value are lesser than 0.05, which means all three variables have statistical significant for ‘Survived’. From the summary, it tells to increase 1 survived person, increasing 2.6317 unit in SexCode, decreasing 0.03899 unit in Age and 1.25868 in PClass.
hist(mylogit$residuals)
qqnorm(mylogit$residuals)
qqline(mylogit$residuals)
The scattplot of residuals shows the glm which including ‘SexCode’,‘Age’ and ‘PClass’ has very good apporch for predictive ‘Survived’ variable, even though there are some outliner at right tails.
In the following, I try to remove ‘PClass’ from the regression model, and check if the new model has better approach.
mylogit <- glm(Survived ~ SexCode + Age, data = df, family = "binomial")
hist(mylogit$residuals)
qqnorm(mylogit$residuals)
qqline(mylogit$residuals)
The new model shows residuals increasing a lot, which has bad approach to predicting ‘Survived’ variable. Therefore, I should keep ‘SexCode’,‘Age’ and ‘PClass’ three in my regression model.