train <- read.csv("C:/Users/jonah/Downloads/train.csv")
install.packages("psych")
## Installing package into 'C:/Users/jonah/AppData/Local/R/win-library/4.5'
## (as 'lib' is unspecified)
## package 'psych' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\jonah\AppData\Local\Temp\RtmpaE19Kg\downloaded_packages
library(psych)
describe(train)
## vars n mean sd median trimmed mad min max range
## PassengerId 1 891 446.00 257.35 446.00 446.00 330.62 1.00 891.00 890.00
## Survived 2 891 0.38 0.49 0.00 0.35 0.00 0.00 1.00 1.00
## Pclass 3 891 2.31 0.84 3.00 2.39 0.00 1.00 3.00 2.00
## Name* 4 891 446.00 257.35 446.00 446.00 330.62 1.00 891.00 890.00
## Sex* 5 891 1.65 0.48 2.00 1.68 0.00 1.00 2.00 1.00
## Age 6 714 29.70 14.53 28.00 29.27 13.34 0.42 80.00 79.58
## SibSp 7 891 0.52 1.10 0.00 0.27 0.00 0.00 8.00 8.00
## Parch 8 891 0.38 0.81 0.00 0.18 0.00 0.00 6.00 6.00
## Ticket* 9 891 339.52 200.83 338.00 339.65 268.35 1.00 681.00 680.00
## Fare 10 891 32.20 49.69 14.45 21.38 10.24 0.00 512.33 512.33
## Cabin* 11 891 18.63 38.14 1.00 8.29 0.00 1.00 148.00 147.00
## Embarked* 12 891 3.53 0.80 4.00 3.66 0.00 1.00 4.00 3.00
## skew kurtosis se
## PassengerId 0.00 -1.20 8.62
## Survived 0.48 -1.77 0.02
## Pclass -0.63 -1.28 0.03
## Name* 0.00 -1.20 8.62
## Sex* -0.62 -1.62 0.02
## Age 0.39 0.16 0.54
## SibSp 3.68 17.73 0.04
## Parch 2.74 9.69 0.03
## Ticket* 0.00 -1.28 6.73
## Fare 4.77 33.12 1.66
## Cabin* 2.09 3.07 1.28
## Embarked* -1.27 -0.16 0.03
head(train$PassengerId)
## [1] 1 2 3 4 5 6
tail(train$PassengerId)
## [1] 886 887 888 889 890 891
PassengerId is a qualitative variable. Though its values look numeric, it doesn’t measure anything and can be used as a surrogate for the specific person in the data set. Because of that, we know that it’s nominal as well.
Age, however, is a quantitative variable. It is effectively a measurement that each passenger possessed, which tells us that it’s also an ordinal variable.
colSums(is.na(train))
## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 177
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 0 0
Age. It’s missing 177 observations.
train$Age[is.na(train$Age)] <- median(train$Age, na.rm=TRUE)
sum(is.na(train$Age))
## [1] 0
train$SibSp[is.na(train$SibSp)] <- median(train$SibSp, na.rm = TRUE)
sum(is.na(train$SibSp))
## [1] 0
train$Parch[is.na(train$Parch)] <- median(train$Parch, na.rm = TRUE)
sum(is.na(train$Parch))
## [1] 0
describe(train$Age)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 891 29.36 13.02 28 28.83 8.9 0.42 80 79.58 0.51 0.97 0.44
The two averages this function provides (mean and median) are close in value but not the same. This implies a slight bias towards higher values, which is supported by the skew value. A skew of 0.5 is close to normal, but would suggest a slight tail towards older passengers. When combined with the kurtosis value (0.97), this would suggest that the data is peaked, close to normal, and has a slight tail. Lastly, the range, min, and max values suggest rather forcefully that passengers of nearly all ages boarded the Titanic.
describe(train$SibSp)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 891 0.52 1.1 0 0.27 0 0 8 8 3.68 17.73 0.04
The descriptive statistics here paint a very different picture than the ones for Age did. A mean value of 0.52 compared to a median value of 0 suggests that a plurality (if not a majority) of passengers had neither siblings or spouses on the ship with them. This is supported by the skew value of 3.68, as it would suggest a long tail extended towards higher values, meaning that most values would be on the lower end of the specturm. This is complemented again by the kurtosis value of 17.73, which would suggest a more peaked distribution. In this case, a minimum value of 0 doesn’t suggest much on its own, since we saw previously that the median was also 0. A maximum value of 8 suggests a large family, as it would presumably indicate one spouse and seven siblings.
describe(train$Parch)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 891 0.38 0.81 0 0.18 0 0 6 6 2.74 9.69 0.03
The descriptive statistics here resemble those of the SibSp variable. Once again, we see a mean that is very close to the median (0.38 compared to 0, respectively), but given that the median is 0, this would suggest that over half of the passengers did not board with parents or children. Again, a skew of 2.74 suggests a long tail to the right, and a kurtosis of 9.69 suggests a peaked distribution. This makes sense - if our average passenger did not have parents or children on board with them, the data should cluster there with minimal (comparatively) entries for the higher values. This is supported by the standard deviation being 0.81. In a normal distribution, we expect over half of the data to fall within one standard deviation of the mean (-0.43 to 1.19).
table(train$Survived, train$Sex)
##
## female male
## 0 81 468
## 1 233 109
Proportionately, approximately 74% of women survived, compared to the approximate 19% of mean who survived. I also notice that more men did not survive than there were women on board the ship (314 total women aboard compared to 468 men who did not survive). Additionally, more men died than there were total survivors.
boxplot(train$Age~train$Survived, notch=TRUE, horizontal=T)
The interquartile range for ages looks similar for passengers, regardless of if they survived. Going stricly by the graphic, the first quartile for the group of passengers who survived looks slightly lower than that of the other group. There appear to be relatively few extreme values in the group that survived compared to the other group (the survivor extreme values exclusively being above the fourth quartile), suggesting that most of the passengers who survived were around that data set’s median.
This is contrasted by the box plot for the survivors who did not survive, as while the plot itself is more condensed, it also appears to have many more extreme values. Considering the relative narrowness of the interquartile range, this might imply that the extreme values are in fact outliers. Additionally, the median value for the age of passengers who did not survive looks to be about the same as the median value for the survivors, suggesting that age might not have been a determining factor in who did or did not survive.
The relative lack of extreme values in the box plot of the survivors would suggest that there were fewer survivors overall (something we had confirmed in the prior question). This would make sense, since in another prior question we saw that the descriptive statistics for Age suggested that it was distributed very close to normally. A random sample of a normally distributed data set is not guaranteed to also be normal, but it is likely since 68% of the data would fall within one standard deviation of the mean. As we saw before, the mean Age across the whole ship was 29.36 (with a median of 28), and both box plots look to have their medians right around those values.