Introduction

This is our final data analysis project. We decided to work with the training set within the Titanic data after spending some time searching, as it is a classic case in Kaggle. Then, we downloaded this dataset from Kaggle. The dataset contains 891 Titanic passengers. To be specific, 577 females and 314 males are in the Titanic dataset. Our goal is to obtain the surviving probability of a 10-year-old boy in the first Pclass when we hold all other variables at an average number. The first limitation of our dataset are that several values are missing in our dataset from the data themselves. The second limitation is the sample size is relative small. Speicifilly, there are two dataset inside the Titanic, one is training set, another is test set in the Kaggle. However, there are no survived column in the test set. Thus, we decided to not use the test dataset since our question is study the survival rate.

1.Load Data and Generate an Overview

We read our dataset into the data frame by using the read.csv() function. Afterward, we used the summary() function to generate the result of the summary statistics analysis. To determine the size of this dataset, we used the dim() function. Then, we used the str() function to see the structure of each variable in our dataset. Moreover, we used the head() function to see the first several rows of the data frame. Because we were studying the survival rate, we used the table() function to build a contingency table. Then, we used the prop.table() function to see the ratio of survivors.

train<-read.csv("~/Downloads/train.csv",stringsAsFactors = F)
summary(train)

##   PassengerId       Survived          Pclass          Name          
##  Min.   :  1.0   Min.   :0.0000   Min.   :1.000   Length:891        
##  1st Qu.:223.5   1st Qu.:0.0000   1st Qu.:2.000   Class :character  
##  Median :446.0   Median :0.0000   Median :3.000   Mode  :character  
##  Mean   :446.0   Mean   :0.3838   Mean   :2.309                     
##  3rd Qu.:668.5   3rd Qu.:1.0000   3rd Qu.:3.000                     
##  Max.   :891.0   Max.   :1.0000   Max.   :3.000                     
##                                                                     
##      Sex                 Age            SibSp           Parch       
##  Length:891         Min.   : 0.42   Min.   :0.000   Min.   :0.0000  
##  Class :character   1st Qu.:20.12   1st Qu.:0.000   1st Qu.:0.0000  
##  Mode  :character   Median :28.00   Median :0.000   Median :0.0000  
##                     Mean   :29.70   Mean   :0.523   Mean   :0.3816  
##                     3rd Qu.:38.00   3rd Qu.:1.000   3rd Qu.:0.0000  
##                     Max.   :80.00   Max.   :8.000   Max.   :6.0000  
##                     NA's   :177                                     
##     Ticket               Fare           Cabin             Embarked        
##  Length:891         Min.   :  0.00   Length:891         Length:891        
##  Class :character   1st Qu.:  7.91   Class :character   Class :character  
##  Mode  :character   Median : 14.45   Mode  :character   Mode  :character  
##                     Mean   : 32.20                                        
##                     3rd Qu.: 31.00                                        
##                     Max.   :512.33                                        
##

dim(train)

## [1] 891  12

str(train)

## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...

names(train)

##  [1] "PassengerId" "Survived"    "Pclass"      "Name"        "Sex"        
##  [6] "Age"         "SibSp"       "Parch"       "Ticket"      "Fare"       
## [11] "Cabin"       "Embarked"

head(train)

##   PassengerId Survived Pclass
## 1           1        0      3
## 2           2        1      1
## 3           3        1      3
## 4           4        1      1
## 5           5        0      3
## 6           6        0      3
##                                                  Name    Sex Age SibSp
## 1                             Braund, Mr. Owen Harris   male  22     1
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1
## 3                              Heikkinen, Miss. Laina female  26     0
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1
## 5                            Allen, Mr. William Henry   male  35     0
## 6                                    Moran, Mr. James   male  NA     0
##   Parch           Ticket    Fare Cabin Embarked
## 1     0        A/5 21171  7.2500              S
## 2     0         PC 17599 71.2833   C85        C
## 3     0 STON/O2. 3101282  7.9250              S
## 4     0           113803 53.1000  C123        S
## 5     0           373450  8.0500              S
## 6     0           330877  8.4583              Q

table(train$Survived)

## 
##   0   1 
## 549 342

prop.table(table(train$Survived))

## 
##         0         1 
## 0.6161616 0.3838384

In the contingency table, “0” represents that people did not survive. On the contrary, “1” means that people remain alive. In looking at the output, we learned that 61.6% of people who embarked did not survive. In addition, by looking at the dimensions, we found that 891 rows and 12 columns are in the train dataset. Furthermore, by looking at the output of the structure, we found that “Age” and “Embarked” are characters. Therefore, we had to convert them into factors.

2.Exploratory Data Analysis

In order to do the basic analysis, we converted the “Sex”,“Embarked”,“Pclass”, and “Survived” in to factors at first.

#convert 'sex','embarked', 'pclass','survived' into factors
train$Sex<-factor(train$Sex)
train$Embarked<-factor(train$Embarked)
train$Survived<-factor(train$Survived)
train$Pclass<-factor(train$Pclass)
str(train)

## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
##  $ Pclass     : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...

2.1 Univariate Analysis

Response: Survived

To analyze the variable “Survived,” we verified whether a value was missing first. Then, we took a look at the number of people who survived or died by using the table() function. For the data visualization, we used a bar chart to display our result.

sum(is.na(train$Survived))

## [1] 0

table(train$Survived)

## 
##   0   1 
## 549 342

h1 <- barplot(table(train$Survived), main = "Number of Died vs Survived", ylab = "Number of People", col = c("grey", "red"), ylim = c(0, 700), xlim = c(0, 5)) 
text (h1, table(train$Survive), label = table(train$Survive), pos =3)

From the output, we saw that no values are missing in the “Survived” column. Moreover, we can clearly see that the number of deaths was far greater than the number of people living.

Response: Sex

Then, we used the same method to analyze the variable “Sex.”

sum(is.na(train$Sex))

## [1] 0

#Take a look at the number of female vs male
table(train$Sex)

## 
## female   male 
##    314    577

h2 <- barplot(table(train$Sex), main = "Number of Female vs Male", ylab = "Number of People", col = c("darkred", "darkblue"), border = "white", ylim = c(0, 700), xlim= c(0, 5))
text(h2, table(train$Sex), label = table(train$Sex), pos =3)

There are also no values are missing for the variable “Sex.” From the bar chart, we see that the number of males was much greater than the number of females on the Titanic.

Response:Pclass

To analyze the variable “Pclass,” we completed an additional step, which involved calculating the percentage of each class.

sum(is.na(train$Pclass))

## [1] 0

table(train$Pclass)

## 
##   1   2   3 
## 216 184 491

#the number of people in each class
firstclass = length(which(train[,3]=="1"))
secondclass = length(which(train[,3]=="2"))
thirdclass = length(which(train[,3]=="3"))
firstclass

## [1] 216

secondclass

## [1] 184

thirdclass

## [1] 491

total = firstclass + secondclass + thirdclass
total

## [1] 891

percent1stclass = firstclass/total
percent2ndclass = secondclass/total
percent3rdclass = thirdclass/total
percent1stclass * 100

## [1] 24.24242

percent2ndclass  * 100

## [1] 20.65095

percent3rdclass * 100

## [1] 55.10662

h3 <- barplot(table(train$Pclass),xlim=c(0,5),ylim=c(0,600), xlab = "Class", ylab = "Frequency", main = "Histogram of Class", col = c("lightyellow","yellow","orange"), border = "darkgreen")
text(h3, table(train$Pclass), label = table(train$Pclass), pos =3)

After our analysis, we, again, found no missing values for the variable “Pclass.” Furthermore, we found that the majority of people were located in the third class.

Response: Age

Because “Age” is a numerical variable, we took the additional step of calculating the mean and standard deviation of it. We use plot to display the “Age” column.To visualization clearer, we used the “Sex” as color, which mean red reprensents female, and black represent male. Moreover, to observe whether certain data points differed from others, we used a boxplot to analyze the outliers. In addition, we used the qq-plot to check the data normality by comparing each variable with the normal distribution

sum(is.na(train$Age)) #There are 177 missing values in the variable age

## [1] 177

head(train$Age)

## [1] 22 38 26 35 35 NA

mean.age <- mean(train$Age, na.rm = TRUE)
sd.age <- sd(train$Age, na.rm = TRUE)
mean.age

## [1] 29.69912

sd.age

## [1] 14.5265

boxplot(train$Age,notch=T, horizontal=T,main="Boxplot of Age",  col = "orange")

plot(train$Age, xlab = "Passengers", ylab = "Frequency", main = "Distribution of Age", col = train$Sex)

#Histogram of "Age"
h<-hist(train$Age,plot=F)
h

## $breaks
## [1]  0 10 20 30 40 50 60 70 80
## 
## $counts
## [1]  64 115 230 155  86  42  17   5
## 
## $density
## [1] 0.0089635854 0.0161064426 0.0322128852 0.0217086835 0.0120448179
## [6] 0.0058823529 0.0023809524 0.0007002801
## 
## $mids
## [1]  5 15 25 35 45 55 65 75
## 
## $xname
## [1] "train$Age"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"

str(h)

## List of 6
##  $ breaks  : num [1:9] 0 10 20 30 40 50 60 70 80
##  $ counts  : int [1:8] 64 115 230 155 86 42 17 5
##  $ density : num [1:8] 0.00896 0.01611 0.03221 0.02171 0.01204 ...
##  $ mids    : num [1:8] 5 15 25 35 45 55 65 75
##  $ xname   : chr "train$Age"
##  $ equidist: logi TRUE
##  - attr(*, "class")= chr "histogram"

plot(h, col = heat.colors(length(h$mids)),ylim = c(0,500))
text(h$mids, h$count, h$count, pos=3)

rm(h)

#normality test of age
qqnorm(train$Age, main="Q-Q Plot of Age", col = train$Age)
qqline(train$Age)

From the output, we learned that the average age of passengers was 29.70 in the Titanic. It also can be see from scatter plot. In addition, the boxplot revealed several outliers in the “Age” column. From the qq-plot, we can easily see that the variable “age” is normally distributed.

Then, we converted the “Age” column into a factor column and added a new feature called age_Group to see which group had more people: Any age between (0, 18) belongs to group 1; any age between (18, 50) belongs to group 2; and any age above 50 belongs to group 3.

train$Age_Group <- 'Child'

train$Age_Group[train$Age>18 & train$Age<=50]<-'Adult'
train$Age_Group[train$Age>50]<-'Old'

train$Age_Group <- as.factor(train$Age_Group)

table(train$Age_Group)

## 
## Adult Child   Old 
##   511   316    64

#There are 139 people in age group 1, 511 people in age group 2, 64 people in age group 3

h4 <- barplot(table(train$Age_Group), main = "Number of Passengers in Different Age Group",ylim=c(0,700),xlim=c(0,5), xlab = "Groups of Different Ages", ylab = "Frequencies of Ages", col = c("orange","lightblue","darkgrey"))
text(h4, table(train$Age_Group), label = table(train$Age_Group), pos =3)

The bar chart depicts that adults made up the majority on the Titanic.

Now, let us take a look at our updated dataset “passenger.”

summary(train)

##   PassengerId    Survived Pclass      Name               Sex     
##  Min.   :  1.0   0:549    1:216   Length:891         female:314  
##  1st Qu.:223.5   1:342    2:184   Class :character   male  :577  
##  Median :446.0            3:491   Mode  :character               
##  Mean   :446.0                                                   
##  3rd Qu.:668.5                                                   
##  Max.   :891.0                                                   
##                                                                  
##       Age            SibSp           Parch           Ticket         
##  Min.   : 0.42   Min.   :0.000   Min.   :0.0000   Length:891        
##  1st Qu.:20.12   1st Qu.:0.000   1st Qu.:0.0000   Class :character  
##  Median :28.00   Median :0.000   Median :0.0000   Mode  :character  
##  Mean   :29.70   Mean   :0.523   Mean   :0.3816                     
##  3rd Qu.:38.00   3rd Qu.:1.000   3rd Qu.:0.0000                     
##  Max.   :80.00   Max.   :8.000   Max.   :6.0000                     
##  NA's   :177                                                        
##       Fare           Cabin           Embarked Age_Group  
##  Min.   :  0.00   Length:891          :  2    Adult:511  
##  1st Qu.:  7.91   Class :character   C:168    Child:316  
##  Median : 14.45   Mode  :character   Q: 77    Old  : 64  
##  Mean   : 32.20                      S:644               
##  3rd Qu.: 31.00                                          
##  Max.   :512.33                                          
##

str(train)

## 'data.frame':    891 obs. of  13 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
##  $ Pclass     : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
##  $ Age_Group  : Factor w/ 3 levels "Adult","Child",..: 1 1 1 1 1 2 3 2 1 2 ...

names(train)

##  [1] "PassengerId" "Survived"    "Pclass"      "Name"        "Sex"        
##  [6] "Age"         "SibSp"       "Parch"       "Ticket"      "Fare"       
## [11] "Cabin"       "Embarked"    "Age_Group"

2.2 Bivariate Analysis

Correlation

Because our question involved the surviving probability of a 10-year-old boy placed in the first Pclass, we needed to find the relationship between “sex” and “survived,”Pclass" and “Survived,” and “Age” and “Survived.” We used the chi-square test to evaluate whether an association exists between the two variables.

Relationship Between “Sex” and “Survived” by Using Chi-Square Test

First, we overserved “Sex” and “Survived.” We used the str function to check the types of these two variables. Then, we conducted the chi-square test.

str(train$Sex)

##  Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...

str(train$Survived)

##  Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...

#both "Sex"" and "Survived"" are factor variables
chisq.test(train$Sex, train$Survived)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  train$Sex and train$Survived
## X-squared = 260.72, df = 1, p-value < 2.2e-16

tbl <- table(train$Sex, train$Survived)
tblsexsur<-  chisq.test(tbl)
tblsexsur$observed

##         
##            0   1
##   female  81 233
##   male   468 109

tblsexsur$expected

##         
##                 0        1
##   female 193.4747 120.5253
##   male   355.5253 221.4747

tblsexsur$residuals

##         
##                  0         1
##   female -8.086170 10.245095
##   male    5.965128 -7.557757

From the output, we can see that these two variables are factors with the same lengths, which mean if we want to run a test of impendence and chi-square is perfect. When we ran the chi-square test, we obtained a p-value of less than .05, meaning our test is significant. This means that a statistically significant relationship exists between these two variables. By looking at the expected values and the observed values, we can see that the observed values are different from the expected ones. From the residual, we can see the difference between the actual observed response values and the response values that the model predicted.

Relationship Between “Pclass” and “Survived” by Using Chi-Square Test

We completed the same steps to test whether an association exists between “Pclass” and “Survived.”

str(train$Survived)

##  Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...

str(train$Pclass)

##  Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...

#both "Pclass"" and "Survived"" are factor variables
chisq.test(train$Pclass, train$Survived)

## 
##  Pearson's Chi-squared test
## 
## data:  train$Pclass and train$Survived
## X-squared = 102.89, df = 2, p-value < 2.2e-16

tbl <- table(train$Pclass, train$Survived)

tblsexsur<-  chisq.test(tbl)
tblsexsur$observed

##    
##       0   1
##   1  80 136
##   2  97  87
##   3 372 119

tblsexsur$expected

##    
##            0         1
##   1 133.0909  82.90909
##   2 113.3737  70.62626
##   3 302.5354 188.46465

tblsexsur$residuals

##    
##             0         1
##   1 -4.601993  5.830678
##   2 -1.537771  1.948340
##   3  3.993703 -5.059981

The p-value is also less than .05; therefore, our test is also significant, which means a statistically significant relationship exists between “Pclass” and “Survived.”

Relationship between “Age” and “Survived” by Uusing Chi-Square Test

When we verified the association between “Age” and “Survived,” we converted “Age” into a factor variable at first since the “Age” column was an integer when we used str () to test it. We converted the “Age” column into two groups and used the chi-square test to test it.

train$Age_Group2[train$Age <= 18] = "child"
train$Age_Group2[train$Age>18] = "adult"
train$Age_Group2 = as.factor(train$Age_Group2)
str(train$Age_Group2)

##  Factor w/ 2 levels "adult","child": 1 1 1 1 1 NA 1 2 1 2 ...

str(train$Survived)

##  Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...

#both "Pclass"" and "Survived"" are factor variables
chisq.test(train$Age_Group2, train$Survived)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  train$Age_Group2 and train$Survived
## X-squared = 6.3013, df = 1, p-value = 0.01206

tbl <- table(train$Age_Group2, train$Survived)
tblsexsur<- chisq.test(tbl)
tblsexsur$observed

##        
##           0   1
##   adult 355 220
##   child  69  70

tblsexsur$expected

##        
##                 0         1
##   adult 341.45658 233.54342
##   child  82.54342  56.45658

tblsexsur$residuals

##        
##                  0          1
##   adult  0.7329267 -0.8862259
##   child -1.4906890  1.8024821

From the output, we obtained a p-value of more than 0.05, which means our test is not statistically significant. In other words, no statistically significant difference in the surviving rate existed between people older than 18 years old and people younger than 18 years old.

3.Clean Data and Data Manipulation

First, we tested which column has missing values.

#missing value
NA_rate<-colMeans(sapply(train,is.na))
table(NA_rate)

## NA_rate
##                 0 0.198653198653199 
##                12                 2

print('---------------------')

## [1] "---------------------"

sort(NA_rate,decreasing = T)

##         Age  Age_Group2 PassengerId    Survived      Pclass        Name 
##   0.1986532   0.1986532   0.0000000   0.0000000   0.0000000   0.0000000 
##         Sex       SibSp       Parch      Ticket        Fare       Cabin 
##   0.0000000   0.0000000   0.0000000   0.0000000   0.0000000   0.0000000 
##    Embarked   Age_Group 
##   0.0000000   0.0000000

From the output, we found 12 variables in the train dataset, with only the “Age” column having a missing value. The missing value rate is less than 20%.

Thus, we used ggplot () to visualize our missing value.

miss<-data.frame(names(train),NA_rate,row.names=NULL,stringsAsFactors = F)
miss<-miss[order(NA_rate,decreasing=T),]

library(ggplot2)
ggplot(miss,aes(x=reorder(miss[,1],-NA_rate),y=NA_rate))+geom_bar(stat='identity',fill='red')+
    labs(x='variables',title='missing rate of the variables')+
    theme(axis.text.x = element_text(angle=90,hjust=1))

Because we found missing values in the variable “Age,” we had to address this. Therefore, we filled in the missing value.

#missing value imputation
sum(is.na(train$Age))

## [1] 177

summary(train$Age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.42   20.12   28.00   29.70   38.00   80.00     177

par(mfrow=c(1,2))
#histogram of age before imputation
h <-hist(train$Age,xlab = 'Train$Age Before Imputation', ylab = 'Number of Passengers', main = 'Histogran of Age Before Imputation', col = c("orange","lightblue","darkgrey"), ylim=range(0,500))
#histogram after imputation
train$Age[is.na(train$Age)]<-mean(train$Age[!is.na(train$Age)])
hist(train$Age,xlab = 'Train$Age After Imputation', ylab = 'Number of Passengers',main = 'Histogran of Age After Imputation',col = c("orange","lightblue","darkgrey"), ylim=range(0,500))

Because the missing rate was relatively low (177 in total), we decided to fill in the missing age values with the mean age. Then, we compared the histogram graph before and after imputation, and we found no major difference. Thus, we were confident about filling the age with the age mean.

4.Model

After addressing the missing data, we conducted the model to continue our research. First, we split the training dataset and the testing dataset.

ind<-sample(2,nrow(train),prob = c(0.7,0.3),replace = T)
titanic_train<-train[ind==1,]
titanic_test<-train[ind==2,]

Then, we dropped several variables, including “Passengerid,” “Names,” and “Ticket” to obtain the raw model. We used the whole dataset compare all other variables except for “Passengerid,” “Names,” and “Ticket.” We used the glm () command to perform a logistic regression, and we set family as a binomial. Then, we ran the model and obtained the summary of our raw model.

library(glmnet)

## Warning: package 'glmnet' was built under R version 3.4.2

## Loading required package: Matrix

## Loading required package: foreach

## Loaded glmnet 2.0-13

model<-glm(Survived~Pclass+Sex+Age+SibSp+Parch+Fare+Cabin+Embarked,family='binomial',
           data=titanic_train)
summary(model)

## 
## Call:
## glm(formula = Survived ~ Pclass + Sex + Age + SibSp + Parch + 
##     Fare + Cabin + Embarked, family = "binomial", data = titanic_train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.11887  -0.54229  -0.36465   0.00018   2.48168  
## 
## Coefficients: (1 not defined because of singularities)
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           1.734e+00  9.043e-01   1.918 0.055153 .  
## Pclass2               2.806e-01  6.918e-01   0.406 0.685050    
## Pclass3              -5.709e-01  7.130e-01  -0.801 0.423273    
## Sexmale              -2.453e+00  2.600e-01  -9.432  < 2e-16 ***
## Age                  -4.011e-02  1.195e-02  -3.355 0.000794 ***
## SibSp                -5.260e-01  1.577e-01  -3.336 0.000851 ***
## Parch                -1.262e-01  1.515e-01  -0.833 0.404778    
## Fare                  2.337e-02  9.956e-03   2.347 0.018902 *  
## CabinA16              1.779e+01  6.523e+03   0.003 0.997823    
## CabinA19             -1.726e+01  6.523e+03  -0.003 0.997888    
## CabinA24             -1.778e+01  6.523e+03  -0.003 0.997824    
## CabinA26              2.014e+01  6.523e+03   0.003 0.997537    
## CabinA31              1.960e+01  6.523e+03   0.003 0.997602    
## CabinA32             -1.783e+01  6.523e+03  -0.003 0.997820    
## CabinA36             -1.628e+01  6.523e+03  -0.002 0.998008    
## CabinA5              -1.637e+01  6.523e+03  -0.003 0.997997    
## CabinA6               1.958e+01  6.523e+03   0.003 0.997605    
## CabinB101             8.150e+00  6.523e+03   0.001 0.999003    
## CabinB18              1.680e+01  6.523e+03   0.003 0.997944    
## CabinB19             -1.618e+01  6.523e+03  -0.002 0.998020    
## CabinB20              1.867e+01  3.771e+03   0.005 0.996049    
## CabinB22              1.687e+01  6.523e+03   0.003 0.997937    
## CabinB28              1.703e+01  4.510e+03   0.004 0.996986    
## CabinB3               1.374e+01  6.523e+03   0.002 0.998319    
## CabinB35              1.561e+01  6.523e+03   0.002 0.998090    
## CabinB37             -1.678e+01  6.523e+03  -0.003 0.997947    
## CabinB38             -1.666e+01  6.523e+03  -0.003 0.997962    
## CabinB39              1.625e+01  6.523e+03   0.002 0.998013    
## CabinB4               1.739e+01  6.523e+03   0.003 0.997873    
## CabinB41              1.993e+01  6.523e+03   0.003 0.997562    
## CabinB49              1.714e+01  3.872e+03   0.004 0.996467    
## CabinB5               1.262e+01  6.523e+03   0.002 0.998456    
## CabinB50              1.929e+01  6.523e+03   0.003 0.997640    
## CabinB51 B53 B55     -1.664e+01  6.523e+03  -0.003 0.997964    
## CabinB57 B59 B63 B66  1.222e+01  4.611e+03   0.003 0.997885    
## CabinB58 B60         -5.246e+00  2.566e+00  -2.044 0.040937 *  
## CabinB69              1.738e+01  6.523e+03   0.003 0.997873    
## CabinB71             -1.729e+01  6.523e+03  -0.003 0.997885    
## CabinB73              1.585e+01  6.523e+03   0.002 0.998061    
## CabinB77              1.601e+01  6.523e+03   0.002 0.998041    
## CabinB78              1.456e+01  6.523e+03   0.002 0.998219    
## CabinB79              1.545e+01  6.523e+03   0.002 0.998110    
## CabinB80              1.517e+01  6.523e+03   0.002 0.998144    
## CabinB82 B84         -1.842e+01  6.523e+03  -0.003 0.997747    
## CabinB86             -1.930e+01  6.523e+03  -0.003 0.997639    
## CabinB94             -1.624e+01  6.523e+03  -0.002 0.998013    
## CabinB96 B98          1.770e+01  6.523e+03   0.003 0.997835    
## CabinC101             1.881e+01  6.523e+03   0.003 0.997700    
## CabinC103             1.854e+01  6.523e+03   0.003 0.997732    
## CabinC104             2.066e+01  6.523e+03   0.003 0.997473    
## CabinC106             1.976e+01  6.523e+03   0.003 0.997583    
## CabinC110            -1.718e+01  6.523e+03  -0.003 0.997899    
## CabinC118            -1.750e+01  6.523e+03  -0.003 0.997860    
## CabinC123             2.208e-01  1.809e+00   0.122 0.902874    
## CabinC124            -1.701e+01  4.572e+03  -0.004 0.997032    
## CabinC125             1.485e+01  6.523e+03   0.002 0.998184    
## CabinC128            -1.747e+01  6.523e+03  -0.003 0.997862    
## CabinC148             1.906e+01  6.523e+03   0.003 0.997668    
## CabinC2               1.668e+01  6.523e+03   0.003 0.997959    
## CabinC22 C26         -2.752e+00  1.918e+00  -1.435 0.151304    
## CabinC23 C25 C27     -3.962e+00  2.623e+00  -1.511 0.130849    
## CabinC30             -1.635e+01  6.523e+03  -0.003 0.997999    
## CabinC32              1.454e+01  6.523e+03   0.002 0.998221    
## CabinC46             -1.772e+01  6.523e+03  -0.003 0.997832    
## CabinC49             -1.953e+01  6.523e+03  -0.003 0.997611    
## CabinC52              1.972e+01  4.610e+03   0.004 0.996587    
## CabinC68             -1.246e+00  1.976e+00  -0.631 0.528212    
## CabinC70              1.706e+01  6.523e+03   0.003 0.997913    
## CabinC78             -6.638e-01  2.121e+00  -0.313 0.754259    
## CabinC85              1.665e+01  6.523e+03   0.003 0.997963    
## CabinC87             -1.598e+01  6.523e+03  -0.002 0.998045    
## CabinC91             -1.978e+01  6.523e+03  -0.003 0.997580    
## CabinC92              1.804e+01  3.706e+03   0.005 0.996117    
## CabinC93              1.923e+01  6.523e+03   0.003 0.997648    
## CabinC95             -2.184e+01  6.523e+03  -0.003 0.997328    
## CabinC99              1.507e+01  6.523e+03   0.002 0.998157    
## CabinD                3.019e-01  1.470e+00   0.205 0.837333    
## CabinD10 D12          1.829e+01  6.523e+03   0.003 0.997763    
## CabinD11              1.758e+01  6.523e+03   0.003 0.997849    
## CabinD15              1.577e+01  6.523e+03   0.002 0.998071    
## CabinD17              1.817e+01  4.612e+03   0.004 0.996856    
## CabinD19              2.027e+01  6.523e+03   0.003 0.997521    
## CabinD20              1.713e+01  6.523e+03   0.003 0.997904    
## CabinD26             -1.736e+01  6.523e+03  -0.003 0.997876    
## CabinD28              1.668e+01  6.523e+03   0.003 0.997960    
## CabinD30             -1.780e+01  6.523e+03  -0.003 0.997823    
## CabinD33              1.697e+01  6.523e+03   0.003 0.997925    
## CabinD35              1.938e+01  4.082e+03   0.005 0.996213    
## CabinD37              1.744e+01  6.523e+03   0.003 0.997866    
## CabinD45              1.977e+01  6.523e+03   0.003 0.997581    
## CabinD46             -1.676e+01  6.523e+03  -0.003 0.997950    
## CabinD49              1.801e+01  6.523e+03   0.003 0.997797    
## CabinD56              2.006e+01  6.523e+03   0.003 0.997546    
## CabinD6              -1.739e+01  6.523e+03  -0.003 0.997873    
## CabinD7               1.806e+01  6.523e+03   0.003 0.997791    
## CabinD9               1.585e+01  6.523e+03   0.002 0.998061    
## CabinE10              2.095e+01  6.523e+03   0.003 0.997437    
## CabinE101             1.732e+01  4.578e+03   0.004 0.996981    
## CabinE121             1.993e+01  6.523e+03   0.003 0.997562    
## CabinE24              2.022e+01  4.603e+03   0.004 0.996495    
## CabinE25              2.011e+01  6.523e+03   0.003 0.997540    
## CabinE31             -1.691e+01  6.523e+03  -0.003 0.997932    
## CabinE33              1.655e+01  6.523e+03   0.003 0.997975    
## CabinE34              1.538e+01  6.523e+03   0.002 0.998118    
## CabinE38             -1.586e+01  6.523e+03  -0.002 0.998060    
## CabinE46             -1.689e+01  6.523e+03  -0.003 0.997933    
## CabinE49              1.654e+01  6.523e+03   0.003 0.997977    
## CabinE63             -1.686e+01  6.523e+03  -0.003 0.997937    
## CabinE67              1.074e-01  1.936e+00   0.055 0.955747    
## CabinE68              1.594e+01  6.523e+03   0.002 0.998050    
## CabinE77             -1.854e+01  6.523e+03  -0.003 0.997732    
## CabinE8               1.744e+01  6.523e+03   0.003 0.997867    
## CabinF G73           -1.658e+01  4.605e+03  -0.004 0.997128    
## CabinF2               1.913e+01  6.523e+03   0.003 0.997660    
## CabinF33              1.746e+01  4.588e+03   0.004 0.996965    
## CabinF38             -1.665e+01  6.523e+03  -0.003 0.997963    
## CabinF4               1.842e+01  3.990e+03   0.005 0.996316    
## CabinG6              -4.737e-01  1.079e+00  -0.439 0.660630    
## CabinT               -1.687e+01  6.523e+03  -0.003 0.997936    
## EmbarkedC             5.634e-01  3.375e-01   1.669 0.095056 .  
## EmbarkedQ             3.854e-01  4.016e-01   0.960 0.337286    
## EmbarkedS                    NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 820.51  on 622  degrees of freedom
## Residual deviance: 447.11  on 502  degrees of freedom
## AIC: 689.11
## 
## Number of Fisher Scoring iterations: 17

Let us take a look the output. We see quite a few things. The AIC is too large, which is not good. In addition, the residual deviance reduction from big number to a small number is also not great. Moreover, some coefficients are negative, whereas some are positive. Take “Age” as an example. The coefficient for “Age” is negative, which means as the age goes up, the probability of survival goes down. In our dataset, 0 is for female, and 1 is for male. Thus, the “Sexmale” coefficient is also negative, thus hinting that when from female going to the male, the survival rate is going down. Moreover, by looking at the p-value, we found that “Pclass3,” “Sexmale,” “Age,” “SibSp,” and “Embarked C” are significant because the p-value is less than 0.05, which means these variables correlate to “Survived.” Others do not correlate to one another much, which means no multicollinearity probably exists. Thus, we removed other variables that are non-significant to improve our model.

Thus, we obtained the new model

model2<-glm(Survived~Pclass+Sex+Age+SibSp,family='binomial',
           data=titanic_train)
summary(model2)

## 
## Call:
## glm(formula = Survived ~ Pclass + Sex + Age + SibSp, family = "binomial", 
##     data = titanic_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2867  -0.6095  -0.4289   0.6737   2.4383  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  4.00543    0.48529   8.254  < 2e-16 ***
## Pclass2     -1.21995    0.32155  -3.794 0.000148 ***
## Pclass3     -2.34480    0.28974  -8.093 5.83e-16 ***
## Sexmale     -2.54332    0.22410 -11.349  < 2e-16 ***
## Age         -0.04416    0.00944  -4.678 2.90e-06 ***
## SibSp       -0.36305    0.12070  -3.008 0.002631 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 820.51  on 622  degrees of freedom
## Residual deviance: 570.91  on 617  degrees of freedom
## AIC: 582.91
## 
## Number of Fisher Scoring iterations: 5

By looking the output, we see that the AIC of our new model is much lower than our raw model, which means the new model is better compared with the raw model since lower AIC is better. The reductions of residual deviance are also less than those of the raw model. Thus, our new model is better than the raw model is.

5.Model Fit

To judge whether our model works well, we used the Hosmer-Lemeshow Goodness of Fit Test. First, we converted our coefficients to %, which is more usable, by using the exp() function because the output is in log terms. Afterward, we tested our model’s Goodness of Fit.

#model fit
exp(coef(model2))

## (Intercept)     Pclass2     Pclass3     Sexmale         Age       SibSp 
## 54.89521476  0.29524577  0.09586651  0.07860504  0.95680429  0.69555378

library(ResourceSelection)

## ResourceSelection 0.3-2   2017-02-28

#Test our models goodness of fit 
hoslem.test(titanic_train$Survived, fitted(model2))

## Warning in Ops.factor(1, y): '-' not meaningful for factors

## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  titanic_train$Survived, fitted(model2)
## X-squared = 623, df = 8, p-value < 2.2e-16

In the output, a p-value lower than 0.05 means the data not well distributed and suggests that the model is not ideal.

Therefore, we measured the error rate between the prediction and the real values.

prob=predict(model2,titanic_test,type = c("response"))
survive<-ifelse(prob>0.5,1,0)
mean(survive!=titanic_test$Survived)

## [1] 0.2126866

From the output, we saw a prediction error rate of around 20%.

Then, we used the ROC and AUC to conduct the model evaluation. To be specific, the ROC is a way of verifying the sensitivity and specificity of the model. Meanwhile, the AUC is a measure for telling between acceptance and a decline.

#ROC curve gauge
library(pROC)

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following object is masked from 'package:glmnet':
## 
##     auc

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

h <- roc(titanic_test$Survived~prob, data=titanic_train)
h

## 
## Call:
## roc.formula(formula = titanic_test$Survived ~ prob, data = titanic_train)
## 
## Data: prob in 156 controls (titanic_test$Survived 0) < 112 cases (titanic_test$Survived 1).
## Area under the curve: 0.8867

plot(h)

The AUC is good if it is bigger than 0.8. From the output, we saw an AUC is bigger than 0.8, which means our model is good.

Then, we also calculated a Pseudo R by using the pR2 function

#Psuedo R
library(pscl)

## Warning: package 'pscl' was built under R version 3.4.2

## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis

pR2(model2)

##          llh      llhNull           G2     McFadden         r2ML 
## -285.4567199 -410.2570627  249.6006857    0.3042004    0.3301109 
##         r2CU 
##    0.4509268

The output revealed a McFadden value is between 0.2 and 0.4. Thus, our model is considered excellent.

6. Conclusion

Then, we predicted the surviving probability of a 10-year-old boy in the first Pclass use the predict function based on our new model.

#answer the question
average_sibsp<-round(sum(train$SibSp)/nrow(train))
average_sibsp

## [1] 1

df<-data.frame(Age=10,Sex="male",Pclass="1",SibSp=1)

str(df)

## 'data.frame':    1 obs. of  4 variables:
##  $ Age   : num 10
##  $ Sex   : Factor w/ 1 level "male": 1
##  $ Pclass: Factor w/ 1 level "1": 1
##  $ SibSp : num 1

str(train)

## 'data.frame':    891 obs. of  14 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
##  $ Pclass     : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num  22 38 26 35 35 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
##  $ Age_Group  : Factor w/ 3 levels "Adult","Child",..: 1 1 1 1 1 2 3 2 1 2 ...
##  $ Age_Group2 : Factor w/ 2 levels "adult","child": 1 1 1 1 1 NA 1 2 1 2 ...

#predict the surviving probability of the 10-year-old boy in the 1st pclass.
predict.df<-predict(model2,df,type='response')
predict.df

##         1 
## 0.6586976

Finally, we got the surviving probability of a 10-year-old boy in the 1st Pclass is around 60%. Because our model can be considered excellent as we mentioned before, our result is reliable. To make our model more accurate, we need got the whole test set and merge it into our train dataset.

Titanic

Xiaodan Chen, Zhen Zhang, Zhaoyang Chen

2017.12.13