Discussion 11

For this discussion I wanted to take a first pass at a challenge I have been meaning to attempt: the intro Kaggle challenge on Titanic survivor ship.

This training data set is effectively a ship manifest for the Titanic, with an added field indicating whether they survived the iceberg or not. I converted some of the fields to “dummy variables,” which I will explain below.

x <- getURL("https://raw.githubusercontent.com/ChristopherBloome/Misc/main/TitanicTrain.csv")
Train_Set <-  read.csv(text = x)

head(Train_Set)
##   PassengerId Survived Pclass
## 1           1        0      3
## 2           2        1      1
## 3           3        1      3
## 4           4        1      1
## 5           5        0      3
## 6           6        0      3
##                                                  Name    Sex Age SibSp Parch
## 1                             Braund, Mr. Owen Harris   male  22     1     0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0
## 3                              Heikkinen, Miss. Laina female  26     0     0
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1     0
## 5                            Allen, Mr. William Henry   male  35     0     0
## 6                                    Moran, Mr. James   male  NA     0     0
##             Ticket    Fare Cabin Embarked ES EC EQ Male C1 C2 C3
## 1        A/5 21171  7.2500              S  1  0  0    1  0  0  1
## 2         PC 17599 71.2833   C85        C  0  1  0    0  1  0  0
## 3 STON/O2. 3101282  7.9250              S  1  0  0    0  0  0  1
## 4           113803 53.1000  C123        S  1  0  0    0  1  0  0
## 5           373450  8.0500              S  1  0  0    1  0  0  1
## 6           330877  8.4583              Q  0  0  1    1  0  0  1

Survived is a Boolean field, 1 indicating they did in fact survive. Pclass is a categorical variable, which I had converted to “dummy” Boolean variables C1-C3. Sex has also been converted to a dummy variable with male = 1. SibSp and Parch indicate the number of siblings or parents the passengers have on the ship. Embarked indicated they port which the passenger embarked from, turned to dummy variables Es, Ec and Eq.

First LM:

A first pass a linear model using all variables yields an \[ R^2 \] of .42, not great:

LM1 <- lm(Survived ~ Age + SibSp + Parch + ES + EC + EQ + Male + C1 + C2 + C3, data = Train_Set)

summary(LM1)
## 
## Call:
## lm(formula = Survived ~ Age + SibSp + Parch + ES + EC + EQ + 
##     Male + C1 + C2 + C3, data = Train_Set)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.08245 -0.23181 -0.06494  0.22897  1.00105 
## 
## Coefficients: (1 not defined because of singularities)
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.934032   0.275754   3.387 0.000745 ***
## Age         -0.006434   0.001135  -5.669 2.09e-08 ***
## SibSp       -0.049595   0.017361  -2.857 0.004407 ** 
## Parch       -0.008637   0.018715  -0.462 0.644576    
## ES          -0.159532   0.273286  -0.584 0.559572    
## EC          -0.088670   0.273738  -0.324 0.746093    
## EQ          -0.190863   0.282434  -0.676 0.499405    
## Male        -0.486009   0.031583 -15.388  < 2e-16 ***
## C1           0.387684   0.040227   9.637  < 2e-16 ***
## C2           0.194558   0.036501   5.330 1.32e-07 ***
## C3                 NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3821 on 704 degrees of freedom
##   (177 observations deleted due to missingness)
## Multiple R-squared:  0.4032, Adjusted R-squared:  0.3955 
## F-statistic: 52.84 on 9 and 704 DF,  p-value: < 2.2e-16

Second LM:

A second pass removing our least predictive variables by Pvalues is actually worse than above:

LM2 <- lm(Survived ~ Age + SibSp + Male + C1 + C2 , data = Train_Set)

summary(LM2)
## 
## Call:
## lm(formula = Survived ~ Age + SibSp + Male + C1 + C2, data = Train_Set)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.12876 -0.23267 -0.06678  0.23129  0.99905 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.784143   0.042484  18.458  < 2e-16 ***
## Age         -0.006583   0.001126  -5.845 7.71e-09 ***
## SibSp       -0.054888   0.016274  -3.373 0.000785 ***
## Male        -0.486974   0.030577 -15.926  < 2e-16 ***
## C1           0.412669   0.038053  10.844  < 2e-16 ***
## C2           0.194451   0.036169   5.376 1.03e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3821 on 708 degrees of freedom
##   (177 observations deleted due to missingness)
## Multiple R-squared:  0.3998, Adjusted R-squared:  0.3956 
## F-statistic: 94.33 on 5 and 708 DF,  p-value: < 2.2e-16

Conclusions / Next Steps:

The next step would be to engineer some variables. A continuous variable like age is not well suited for this type of analysis. I would break ages into categories: likely “under 20” “Adult 20-50” and “senior 50+”. I would also do similar work with the variable indicating ticket price.

Discussion 12

Well, here we are a week later. I wanted to take the steps as I outlined above to see how close I could get this model.

I did some exploratory analysis in Excel, and came up with a few new “dummy variables” I wanted to try in my model. I noticed the Cabin value was mostly blank, but those who had a cabin were significantly more likely to survive. While the Cabin value contains the cabin number for all passengers with a cabin, the Cab2 variable effectively translates this to a Boolean value - 1 indicates that the passenger has a cabin.

Similarly, the Parch variable indicates the quantity of children or parents the passenger is traveling with. My “Family” variable turns this into a Boolean value with 1 indicating they are either a child or parent.

Finally, I broke the continuous Age variable into categories. Age_Children is for those under 14, Age YA for those between 14 and 25, Age_Adult for those between 25 and 45 and Age_Senior for 45+. In my exploratory analysis I found that the Children category was the only significant predictor amount these.

Third LM:

x <- getURL("https://raw.githubusercontent.com/ChristopherBloome/Misc/main/TitanicTrain2.csv")
Train_Set2 <-  read.csv(text = x)

head(Train_Set2)
##   PassengerId Survived Pclass
## 1           1        0      3
## 2           2        1      1
## 3           3        1      3
## 4           4        1      1
## 5           5        0      3
## 6           6        0      3
##                                                  Name    Sex Age SibSp Parch
## 1                             Braund, Mr. Owen Harris   male  22     1     0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0
## 3                              Heikkinen, Miss. Laina female  26     0     0
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1     0
## 5                            Allen, Mr. William Henry   male  35     0     0
## 6                                    Moran, Mr. James   male  NA     0     0
##             Ticket    Fare Cabin Embarked ES EC EQ Male C1 C2 C3 Age_Unknown
## 1        A/5 21171  7.2500              S  1  0  0    1  0  0  1           0
## 2         PC 17599 71.2833   C85        C  0  1  0    0  1  0  0           0
## 3 STON/O2. 3101282  7.9250              S  1  0  0    0  0  0  1           0
## 4           113803 53.1000  C123        S  1  0  0    0  1  0  0           0
## 5           373450  8.0500              S  1  0  0    1  0  0  1           0
## 6           330877  8.4583              Q  0  0  1    1  0  0  1           1
##   Age_Children Age_YA Age_Adult Age_Senior Family Cab1 Cab2
## 1            0      1         0          0      0         0
## 2            0      0         1          0      0    C    1
## 3            0      1         0          0      0         0
## 4            0      0         1          0      0    C    1
## 5            0      0         1          0      0         0
## 6            0      0         0          0      0         0
LM3 <- lm(Survived ~ Male + Age_Children + C1 + C2 + Cab2 , data = Train_Set2)

summary(LM3)
## 
## Call:
## lm(formula = Survived ~ Male + Age_Children + C1 + C2 + Cab2, 
##     data = Train_Set2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1389 -0.2344 -0.0723  0.2248  0.9277 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.57034    0.02668  21.381  < 2e-16 ***
## Male         -0.49804    0.02726 -18.270  < 2e-16 ***
## Age_Children  0.20481    0.04796   4.270 2.16e-05 ***
## C1            0.21822    0.05069   4.305 1.85e-05 ***
## C2            0.16212    0.03329   4.870 1.32e-06 ***
## Cab2          0.14553    0.05011   2.904  0.00378 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3819 on 885 degrees of freedom
## Multiple R-squared:  0.3876, Adjusted R-squared:  0.3841 
## F-statistic:   112 on 5 and 885 DF,  p-value: < 2.2e-16

As we see, our R squared has not improved at all. This is likely due to the fact that we have a categorical response variable, and our fitted values are continuous. In order to truly measure how predictive our model is, we need to set a barrier to convert these fitted values to a Boolean response variable.

LM3DF <- data.frame(Train_Set2$Survived, LM3$fitted.values)
names(LM3DF) <- c("Actual", "Fitted")

ggplot(LM3DF, aes(x = Actual, y = Fitted )) + geom_jitter()

The above plot has the fitted values on Y axis, with the actual Survive outcome on the X. We see that the line should go somewhere around .5. Lets start with that:

LM3DF$Predicted <- ""

  for (i in 1:nrow(LM3DF)){
    if (LM3DF$Fitted[i] < .5){LM3DF$Predicted[i] = 0} else {LM3DF$Predicted[i] = 1}
  }

LM3DF$Correct <- LM3DF$Actual == LM3DF$Predicted

nrow(LM3DF[LM3DF$Correct == TRUE,]) / nrow(LM3DF)
## [1] 0.7934905

We find that the model predicted nearly 80% of the outcomes correctly.

Fourth LM

Since we have already preformed one analysis, we can try adding in one last variable. Through some trial and error, I found that for whatever reason, passengers who embarked from the “C” port had a better chance of survival. Moving this into our model, we can see if this improves or weakens our predictions:

LM4 <- lm(Survived ~ Male + Age_Children + C1 + C2 + Cab2 + Family + EC , data = Train_Set2)

summary(LM4)
## 
## Call:
## lm(formula = Survived ~ Male + Age_Children + C1 + C2 + Cab2 + 
##     Family + EC, data = Train_Set2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.10765 -0.23682 -0.06928  0.23848  0.93072 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.57644    0.02877  20.036  < 2e-16 ***
## Male         -0.50716    0.02787 -18.198  < 2e-16 ***
## Age_Children  0.25933    0.05447   4.761 2.25e-06 ***
## C1            0.18796    0.05154   3.647 0.000281 ***
## C2            0.16754    0.03316   5.052 5.31e-07 ***
## Cab2          0.16105    0.05010   3.214 0.001354 ** 
## Family       -0.07713    0.03539  -2.179 0.029570 *  
## EC            0.08786    0.03421   2.568 0.010395 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3799 on 883 degrees of freedom
## Multiple R-squared:  0.3952, Adjusted R-squared:  0.3904 
## F-statistic: 82.44 on 7 and 883 DF,  p-value: < 2.2e-16
LM4DF <- data.frame(Train_Set2$Survived, LM4$fitted.values)
names(LM4DF) <- c("Actual", "Fitted")

ggplot(LM4DF, aes(x = Actual, y = Fitted )) + geom_jitter()

LM4DF$Predicted <- ""

  for (i in 1:nrow(LM4DF)){
    if (LM4DF$Fitted[i] < .50){LM4DF$Predicted[i] = 0} else {LM4DF$Predicted[i] = 1}
  }

LM4DF$Correct <- LM4DF$Actual == LM4DF$Predicted

nrow(LM4DF[LM4DF$Correct == TRUE,]) / nrow(LM4DF)
## [1] 0.8148148

Up to 81%! Not bad.

Quickly, lets see the distribution of our residuals:

LM4DF$Residuals <- LM4$residuals

ggplot(LM4DF, aes(x = LM4DF$Fitted, y = LM4DF$Residuals, color = LM4DF$Correct)) + geom_jitter()

ggplot(LM4DF,aes(x=LM4DF$Residuals)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

While our Residual histogram is vaguely normal, the trend line in the residuals to fitted graph is concerning. I imagine this is due to the categorical response variable, but I will need to due more research to learn if this is valid or not.