Intro

For this discussion I wanted to take a first pass at a challenge I have been meaning to attempt: the intro Kaggle challenge on Titanic survivor ship.

This training data set is effectively a ship manifest for the Titanic, with an added field indicating whether they survived the iceberg or not. I converted some of the fields to “dummy variables,” which I will explain below.

x <- getURL("https://raw.githubusercontent.com/ChristopherBloome/Misc/main/TitanicTrain.csv")
Train_Set <-  read.csv(text = x)

head(Train_Set)
##   PassengerId Survived Pclass
## 1           1        0      3
## 2           2        1      1
## 3           3        1      3
## 4           4        1      1
## 5           5        0      3
## 6           6        0      3
##                                                  Name    Sex Age SibSp Parch
## 1                             Braund, Mr. Owen Harris   male  22     1     0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0
## 3                              Heikkinen, Miss. Laina female  26     0     0
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1     0
## 5                            Allen, Mr. William Henry   male  35     0     0
## 6                                    Moran, Mr. James   male  NA     0     0
##             Ticket    Fare Cabin Embarked ES EC EQ Male C1 C2 C3
## 1        A/5 21171  7.2500              S  1  0  0    1  0  0  1
## 2         PC 17599 71.2833   C85        C  0  1  0    0  1  0  0
## 3 STON/O2. 3101282  7.9250              S  1  0  0    0  0  0  1
## 4           113803 53.1000  C123        S  1  0  0    0  1  0  0
## 5           373450  8.0500              S  1  0  0    1  0  0  1
## 6           330877  8.4583              Q  0  0  1    1  0  0  1

Survived is a Boolean field, 1 indicating they did in fact survive. Pclass is a categorical variable, which I had converted to “dummy” Boolean variables C1-C3. Sex has also been converted to a dummy variable with male = 1. SibSp and Parch indicate the number of siblings or parents the passengers have on the ship. Embarked indicated they port which the passenger embarked from, turned to dummy variables Es, Ec and Eq.

First LM:

A first pass a linear model using all variables yields an \[ R^2 \] of .42, not great:

LM1 <- lm(Survived ~ Age + SibSp + Parch + ES + EC + EQ + Male + C1 + C2 + C3, data = Train_Set)

summary(LM1)
## 
## Call:
## lm(formula = Survived ~ Age + SibSp + Parch + ES + EC + EQ + 
##     Male + C1 + C2 + C3, data = Train_Set)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.08245 -0.23181 -0.06494  0.22897  1.00105 
## 
## Coefficients: (1 not defined because of singularities)
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.934032   0.275754   3.387 0.000745 ***
## Age         -0.006434   0.001135  -5.669 2.09e-08 ***
## SibSp       -0.049595   0.017361  -2.857 0.004407 ** 
## Parch       -0.008637   0.018715  -0.462 0.644576    
## ES          -0.159532   0.273286  -0.584 0.559572    
## EC          -0.088670   0.273738  -0.324 0.746093    
## EQ          -0.190863   0.282434  -0.676 0.499405    
## Male        -0.486009   0.031583 -15.388  < 2e-16 ***
## C1           0.387684   0.040227   9.637  < 2e-16 ***
## C2           0.194558   0.036501   5.330 1.32e-07 ***
## C3                 NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3821 on 704 degrees of freedom
##   (177 observations deleted due to missingness)
## Multiple R-squared:  0.4032, Adjusted R-squared:  0.3955 
## F-statistic: 52.84 on 9 and 704 DF,  p-value: < 2.2e-16

Second LM:

A second pass removing our least predictive variables by Pvalues is actually worse than above:

LM2 <- lm(Survived ~ Age + SibSp + Male + C1 + C2 , data = Train_Set)

summary(LM2)
## 
## Call:
## lm(formula = Survived ~ Age + SibSp + Male + C1 + C2, data = Train_Set)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.12876 -0.23267 -0.06678  0.23129  0.99905 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.784143   0.042484  18.458  < 2e-16 ***
## Age         -0.006583   0.001126  -5.845 7.71e-09 ***
## SibSp       -0.054888   0.016274  -3.373 0.000785 ***
## Male        -0.486974   0.030577 -15.926  < 2e-16 ***
## C1           0.412669   0.038053  10.844  < 2e-16 ***
## C2           0.194451   0.036169   5.376 1.03e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3821 on 708 degrees of freedom
##   (177 observations deleted due to missingness)
## Multiple R-squared:  0.3998, Adjusted R-squared:  0.3956 
## F-statistic: 94.33 on 5 and 708 DF,  p-value: < 2.2e-16

Conclusions / Next Steps:

The next step would be to engineer some variables. A continuous variable like age is not well suited for this type of analysis. I would break ages into categories: likely “under 20” “Adult 20-50” and “senior 50+”. I would also do similar work with the variable indicating ticket price.