For this discussion I wanted to take a first pass at a challenge I have been meaning to attempt: the intro Kaggle challenge on Titanic survivor ship.
This training data set is effectively a ship manifest for the Titanic, with an added field indicating whether they survived the iceberg or not. I converted some of the fields to “dummy variables,” which I will explain below.
x <- getURL("https://raw.githubusercontent.com/ChristopherBloome/Misc/main/TitanicTrain.csv")
Train_Set <- read.csv(text = x)
head(Train_Set)
## PassengerId Survived Pclass
## 1 1 0 3
## 2 2 1 1
## 3 3 1 3
## 4 4 1 1
## 5 5 0 3
## 6 6 0 3
## Name Sex Age SibSp Parch
## 1 Braund, Mr. Owen Harris male 22 1 0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0
## 3 Heikkinen, Miss. Laina female 26 0 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0
## 5 Allen, Mr. William Henry male 35 0 0
## 6 Moran, Mr. James male NA 0 0
## Ticket Fare Cabin Embarked ES EC EQ Male C1 C2 C3
## 1 A/5 21171 7.2500 S 1 0 0 1 0 0 1
## 2 PC 17599 71.2833 C85 C 0 1 0 0 1 0 0
## 3 STON/O2. 3101282 7.9250 S 1 0 0 0 0 0 1
## 4 113803 53.1000 C123 S 1 0 0 0 1 0 0
## 5 373450 8.0500 S 1 0 0 1 0 0 1
## 6 330877 8.4583 Q 0 0 1 1 0 0 1
Survived is a Boolean field, 1 indicating they did in fact survive. Pclass is a categorical variable, which I had converted to “dummy” Boolean variables C1-C3. Sex has also been converted to a dummy variable with male = 1. SibSp and Parch indicate the number of siblings or parents the passengers have on the ship. Embarked indicated they port which the passenger embarked from, turned to dummy variables Es, Ec and Eq.
A first pass a linear model using all variables yields an \[ R^2 \] of .42, not great:
LM1 <- lm(Survived ~ Age + SibSp + Parch + ES + EC + EQ + Male + C1 + C2 + C3, data = Train_Set)
summary(LM1)
##
## Call:
## lm(formula = Survived ~ Age + SibSp + Parch + ES + EC + EQ +
## Male + C1 + C2 + C3, data = Train_Set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.08245 -0.23181 -0.06494 0.22897 1.00105
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.934032 0.275754 3.387 0.000745 ***
## Age -0.006434 0.001135 -5.669 2.09e-08 ***
## SibSp -0.049595 0.017361 -2.857 0.004407 **
## Parch -0.008637 0.018715 -0.462 0.644576
## ES -0.159532 0.273286 -0.584 0.559572
## EC -0.088670 0.273738 -0.324 0.746093
## EQ -0.190863 0.282434 -0.676 0.499405
## Male -0.486009 0.031583 -15.388 < 2e-16 ***
## C1 0.387684 0.040227 9.637 < 2e-16 ***
## C2 0.194558 0.036501 5.330 1.32e-07 ***
## C3 NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3821 on 704 degrees of freedom
## (177 observations deleted due to missingness)
## Multiple R-squared: 0.4032, Adjusted R-squared: 0.3955
## F-statistic: 52.84 on 9 and 704 DF, p-value: < 2.2e-16
A second pass removing our least predictive variables by Pvalues is actually worse than above:
LM2 <- lm(Survived ~ Age + SibSp + Male + C1 + C2 , data = Train_Set)
summary(LM2)
##
## Call:
## lm(formula = Survived ~ Age + SibSp + Male + C1 + C2, data = Train_Set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.12876 -0.23267 -0.06678 0.23129 0.99905
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.784143 0.042484 18.458 < 2e-16 ***
## Age -0.006583 0.001126 -5.845 7.71e-09 ***
## SibSp -0.054888 0.016274 -3.373 0.000785 ***
## Male -0.486974 0.030577 -15.926 < 2e-16 ***
## C1 0.412669 0.038053 10.844 < 2e-16 ***
## C2 0.194451 0.036169 5.376 1.03e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3821 on 708 degrees of freedom
## (177 observations deleted due to missingness)
## Multiple R-squared: 0.3998, Adjusted R-squared: 0.3956
## F-statistic: 94.33 on 5 and 708 DF, p-value: < 2.2e-16
The next step would be to engineer some variables. A continuous variable like age is not well suited for this type of analysis. I would break ages into categories: likely “under 20” “Adult 20-50” and “senior 50+”. I would also do similar work with the variable indicating ticket price.
Well, here we are a week later. I wanted to take the steps as I outlined above to see how close I could get this model.
I did some exploratory analysis in Excel, and came up with a few new “dummy variables” I wanted to try in my model. I noticed the Cabin value was mostly blank, but those who had a cabin were significantly more likely to survive. While the Cabin value contains the cabin number for all passengers with a cabin, the Cab2 variable effectively translates this to a Boolean value - 1 indicates that the passenger has a cabin.
Similarly, the Parch variable indicates the quantity of children or parents the passenger is traveling with. My “Family” variable turns this into a Boolean value with 1 indicating they are either a child or parent.
Finally, I broke the continuous Age variable into categories. Age_Children is for those under 14, Age YA for those between 14 and 25, Age_Adult for those between 25 and 45 and Age_Senior for 45+. In my exploratory analysis I found that the Children category was the only significant predictor amount these.
x <- getURL("https://raw.githubusercontent.com/ChristopherBloome/Misc/main/TitanicTrain2.csv")
Train_Set2 <- read.csv(text = x)
head(Train_Set2)
## PassengerId Survived Pclass
## 1 1 0 3
## 2 2 1 1
## 3 3 1 3
## 4 4 1 1
## 5 5 0 3
## 6 6 0 3
## Name Sex Age SibSp Parch
## 1 Braund, Mr. Owen Harris male 22 1 0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0
## 3 Heikkinen, Miss. Laina female 26 0 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0
## 5 Allen, Mr. William Henry male 35 0 0
## 6 Moran, Mr. James male NA 0 0
## Ticket Fare Cabin Embarked ES EC EQ Male C1 C2 C3 Age_Unknown
## 1 A/5 21171 7.2500 S 1 0 0 1 0 0 1 0
## 2 PC 17599 71.2833 C85 C 0 1 0 0 1 0 0 0
## 3 STON/O2. 3101282 7.9250 S 1 0 0 0 0 0 1 0
## 4 113803 53.1000 C123 S 1 0 0 0 1 0 0 0
## 5 373450 8.0500 S 1 0 0 1 0 0 1 0
## 6 330877 8.4583 Q 0 0 1 1 0 0 1 1
## Age_Children Age_YA Age_Adult Age_Senior Family Cab1 Cab2
## 1 0 1 0 0 0 0
## 2 0 0 1 0 0 C 1
## 3 0 1 0 0 0 0
## 4 0 0 1 0 0 C 1
## 5 0 0 1 0 0 0
## 6 0 0 0 0 0 0
LM3 <- lm(Survived ~ Male + Age_Children + C1 + C2 + Cab2 , data = Train_Set2)
summary(LM3)
##
## Call:
## lm(formula = Survived ~ Male + Age_Children + C1 + C2 + Cab2,
## data = Train_Set2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.1389 -0.2344 -0.0723 0.2248 0.9277
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.57034 0.02668 21.381 < 2e-16 ***
## Male -0.49804 0.02726 -18.270 < 2e-16 ***
## Age_Children 0.20481 0.04796 4.270 2.16e-05 ***
## C1 0.21822 0.05069 4.305 1.85e-05 ***
## C2 0.16212 0.03329 4.870 1.32e-06 ***
## Cab2 0.14553 0.05011 2.904 0.00378 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3819 on 885 degrees of freedom
## Multiple R-squared: 0.3876, Adjusted R-squared: 0.3841
## F-statistic: 112 on 5 and 885 DF, p-value: < 2.2e-16
As we see, our R squared has not improved at all. This is likely due to the fact that we have a categorical response variable, and our fitted values are continuous. In order to truly measure how predictive our model is, we need to set a barrier to convert these fitted values to a Boolean response variable.
LM3DF <- data.frame(Train_Set2$Survived, LM3$fitted.values)
names(LM3DF) <- c("Actual", "Fitted")
ggplot(LM3DF, aes(x = Actual, y = Fitted )) + geom_jitter()
The above plot has the fitted values on Y axis, with the actual Survive outcome on the X. We see that the line should go somewhere around .5. Lets start with that:
LM3DF$Predicted <- ""
for (i in 1:nrow(LM3DF)){
if (LM3DF$Fitted[i] < .5){LM3DF$Predicted[i] = 0} else {LM3DF$Predicted[i] = 1}
}
LM3DF$Correct <- LM3DF$Actual == LM3DF$Predicted
nrow(LM3DF[LM3DF$Correct == TRUE,]) / nrow(LM3DF)
## [1] 0.7934905
We find that the model predicted nearly 80% of the outcomes correctly.
Since we have already preformed one analysis, we can try adding in one last variable. Through some trial and error, I found that for whatever reason, passengers who embarked from the “C” port had a better chance of survival. Moving this into our model, we can see if this improves or weakens our predictions:
LM4 <- lm(Survived ~ Male + Age_Children + C1 + C2 + Cab2 + Family + EC , data = Train_Set2)
summary(LM4)
##
## Call:
## lm(formula = Survived ~ Male + Age_Children + C1 + C2 + Cab2 +
## Family + EC, data = Train_Set2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.10765 -0.23682 -0.06928 0.23848 0.93072
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.57644 0.02877 20.036 < 2e-16 ***
## Male -0.50716 0.02787 -18.198 < 2e-16 ***
## Age_Children 0.25933 0.05447 4.761 2.25e-06 ***
## C1 0.18796 0.05154 3.647 0.000281 ***
## C2 0.16754 0.03316 5.052 5.31e-07 ***
## Cab2 0.16105 0.05010 3.214 0.001354 **
## Family -0.07713 0.03539 -2.179 0.029570 *
## EC 0.08786 0.03421 2.568 0.010395 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3799 on 883 degrees of freedom
## Multiple R-squared: 0.3952, Adjusted R-squared: 0.3904
## F-statistic: 82.44 on 7 and 883 DF, p-value: < 2.2e-16
LM4DF <- data.frame(Train_Set2$Survived, LM4$fitted.values)
names(LM4DF) <- c("Actual", "Fitted")
ggplot(LM4DF, aes(x = Actual, y = Fitted )) + geom_jitter()
LM4DF$Predicted <- ""
for (i in 1:nrow(LM4DF)){
if (LM4DF$Fitted[i] < .50){LM4DF$Predicted[i] = 0} else {LM4DF$Predicted[i] = 1}
}
LM4DF$Correct <- LM4DF$Actual == LM4DF$Predicted
nrow(LM4DF[LM4DF$Correct == TRUE,]) / nrow(LM4DF)
## [1] 0.8148148
Up to 81%! Not bad.
Quickly, lets see the distribution of our residuals:
LM4DF$Residuals <- LM4$residuals
ggplot(LM4DF, aes(x = LM4DF$Fitted, y = LM4DF$Residuals, color = LM4DF$Correct)) + geom_jitter()
ggplot(LM4DF,aes(x=LM4DF$Residuals)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
While our Residual histogram is vaguely normal, the trend line in the residuals to fitted graph is concerning. I imagine this is due to the categorical response variable, but I will need to due more research to learn if this is valid or not.