HW 1 - Titanic

Q1a. What are the types of variable (quantitative/qualitative) and levels of measurement (nominal/ordinal/interval/ratio) for PassengerId and Age?

train <- read.csv("C:/Users/jonah/Downloads/train.csv")
install.packages("psych")

## Installing package into 'C:/Users/jonah/AppData/Local/R/win-library/4.5'
## (as 'lib' is unspecified)

## package 'psych' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\jonah\AppData\Local\Temp\RtmpaE19Kg\downloaded_packages

library(psych)
describe(train)

##             vars   n   mean     sd median trimmed    mad  min    max  range
## PassengerId    1 891 446.00 257.35 446.00  446.00 330.62 1.00 891.00 890.00
## Survived       2 891   0.38   0.49   0.00    0.35   0.00 0.00   1.00   1.00
## Pclass         3 891   2.31   0.84   3.00    2.39   0.00 1.00   3.00   2.00
## Name*          4 891 446.00 257.35 446.00  446.00 330.62 1.00 891.00 890.00
## Sex*           5 891   1.65   0.48   2.00    1.68   0.00 1.00   2.00   1.00
## Age            6 714  29.70  14.53  28.00   29.27  13.34 0.42  80.00  79.58
## SibSp          7 891   0.52   1.10   0.00    0.27   0.00 0.00   8.00   8.00
## Parch          8 891   0.38   0.81   0.00    0.18   0.00 0.00   6.00   6.00
## Ticket*        9 891 339.52 200.83 338.00  339.65 268.35 1.00 681.00 680.00
## Fare          10 891  32.20  49.69  14.45   21.38  10.24 0.00 512.33 512.33
## Cabin*        11 891  18.63  38.14   1.00    8.29   0.00 1.00 148.00 147.00
## Embarked*     12 891   3.53   0.80   4.00    3.66   0.00 1.00   4.00   3.00
##              skew kurtosis   se
## PassengerId  0.00    -1.20 8.62
## Survived     0.48    -1.77 0.02
## Pclass      -0.63    -1.28 0.03
## Name*        0.00    -1.20 8.62
## Sex*        -0.62    -1.62 0.02
## Age          0.39     0.16 0.54
## SibSp        3.68    17.73 0.04
## Parch        2.74     9.69 0.03
## Ticket*      0.00    -1.28 6.73
## Fare         4.77    33.12 1.66
## Cabin*       2.09     3.07 1.28
## Embarked*   -1.27    -0.16 0.03

head(train$PassengerId)

## [1] 1 2 3 4 5 6

tail(train$PassengerId)

## [1] 886 887 888 889 890 891

PassengerId is a qualitative variable. Though its values look numeric, it doesn’t measure anything and can be used as a surrogate for the specific person in the data set. Because of that, we know that it’s nominal as well.

Age, however, is a quantitative variable. It is effectively a measurement that each passenger possessed, which tells us that it’s also an ordinal variable.

Q1b. Which variable has the most missing observations?

colSums(is.na(train))

## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0           0           0

Age. It’s missing 177 observations.

Q2. Impute missing observations for Age, SibSp, and Parch with the column median (ordinal, interval, or ratio), or the column mode.

train$Age[is.na(train$Age)] <- median(train$Age, na.rm=TRUE)
sum(is.na(train$Age))

## [1] 0

train$SibSp[is.na(train$SibSp)] <- median(train$SibSp, na.rm = TRUE)
sum(is.na(train$SibSp))

## [1] 0

train$Parch[is.na(train$Parch)] <- median(train$Parch, na.rm = TRUE)
sum(is.na(train$Parch))

## [1] 0

Q3. Install the psych package in R: install.packages(‘pscyh’). Invoke the package using library(psych). Then provide descriptive statistics for Age, SibSp, and Parch (e.g., describe(mydata$Age). Please comment on what you observe from the summary statistics.

describe(train$Age)

##    vars   n  mean    sd median trimmed mad  min max range skew kurtosis   se
## X1    1 891 29.36 13.02     28   28.83 8.9 0.42  80 79.58 0.51     0.97 0.44

The two averages this function provides (mean and median) are close in value but not the same. This implies a slight bias towards higher values, which is supported by the skew value. A skew of 0.5 is close to normal, but would suggest a slight tail towards older passengers. When combined with the kurtosis value (0.97), this would suggest that the data is peaked, close to normal, and has a slight tail. Lastly, the range, min, and max values suggest rather forcefully that passengers of nearly all ages boarded the Titanic.

describe(train$SibSp)

##    vars   n mean  sd median trimmed mad min max range skew kurtosis   se
## X1    1 891 0.52 1.1      0    0.27   0   0   8     8 3.68    17.73 0.04

The descriptive statistics here paint a very different picture than the ones for Age did. A mean value of 0.52 compared to a median value of 0 suggests that a plurality (if not a majority) of passengers had neither siblings or spouses on the ship with them. This is supported by the skew value of 3.68, as it would suggest a long tail extended towards higher values, meaning that most values would be on the lower end of the specturm. This is complemented again by the kurtosis value of 17.73, which would suggest a more peaked distribution. In this case, a minimum value of 0 doesn’t suggest much on its own, since we saw previously that the median was also 0. A maximum value of 8 suggests a large family, as it would presumably indicate one spouse and seven siblings.

describe(train$Parch)

##    vars   n mean   sd median trimmed mad min max range skew kurtosis   se
## X1    1 891 0.38 0.81      0    0.18   0   0   6     6 2.74     9.69 0.03

The descriptive statistics here resemble those of the SibSp variable. Once again, we see a mean that is very close to the median (0.38 compared to 0, respectively), but given that the median is 0, this would suggest that over half of the passengers did not board with parents or children. Again, a skew of 2.74 suggests a long tail to the right, and a kurtosis of 9.69 suggests a peaked distribution. This makes sense - if our average passenger did not have parents or children on board with them, the data should cluster there with minimal (comparatively) entries for the higher values. This is supported by the standard deviation being 0.81. In a normal distribution, we expect over half of the data to fall within one standard deviation of the mean (-0.43 to 1.19).

Q4. Provide a cross-tabulation of Survived and Sex. What do you notice?

table(train$Survived, train$Sex)

##    
##     female male
##   0     81  468
##   1    233  109

Proportionately, approximately 74% of women survived, compared to the approximate 19% of mean who survived. I also notice that more men did not survive than there were women on board the ship (314 total women aboard compared to 468 men who did not survive). Additionally, more men died than there were total survivors.

Q5. Provide notched boxplots for Survived and Age. What do you notice?

boxplot(train$Age~train$Survived, notch=TRUE, horizontal=T)

The interquartile range for ages looks similar for passengers, regardless of if they survived. Going stricly by the graphic, the first quartile for the group of passengers who survived looks slightly lower than that of the other group. There appear to be relatively few extreme values in the group that survived compared to the other group (the survivor extreme values exclusively being above the fourth quartile), suggesting that most of the passengers who survived were around that data set’s median.

This is contrasted by the box plot for the survivors who did not survive, as while the plot itself is more condensed, it also appears to have many more extreme values. Considering the relative narrowness of the interquartile range, this might imply that the extreme values are in fact outliers. Additionally, the median value for the age of passengers who did not survive looks to be about the same as the median value for the survivors, suggesting that age might not have been a determining factor in who did or did not survive.

The relative lack of extreme values in the box plot of the survivors would suggest that there were fewer survivors overall (something we had confirmed in the prior question). This would make sense, since in another prior question we saw that the descriptive statistics for Age suggested that it was distributed very close to normally. A random sample of a normally distributed data set is not guaranteed to also be normal, but it is likely since 68% of the data would fall within one standard deviation of the mean. As we saw before, the mean Age across the whole ship was 29.36 (with a median of 28), and both box plots look to have their medians right around those values.