Importing the CSV from kaggle imports a 891 observation data set which contains 12 variables. The variables are PassengerId, Survived, Pclass, Name, Sex, Age, Sibsp (number of siblings and spouses), Parch (number of parents and children), Ticket, Fare, Cabin, and Embarked (which had port of embarkation).
# Open up a window to select the .csv file
titanic <- read.csv("D:\\eric\\Boston College\\1. ADEC7310.02 - Data Analysis\\Week 1\\train.csv")
# Load psych and Hmisc packages
library(psych)
library(Hmisc)
What are the types of variable (quantitative / qualitative) and levels of measurement (nominal / ordinal /interval / ratio) for PassengerId, and Age?
# Display the structure of each variable in question
str(titanic$PassengerId)
## int [1:891] 1 2 3 4 5 6 7 8 9 10 ...
str(titanic$Age)
## num [1:891] 22 38 26 35 35 NA 54 2 27 14 ...
Passenger ID is a qaulitative variable and Age is a quantitative variable. The levels of measurement are nominal for Passenger ID (ID is a unique number assigned to each passenger) and interval for Age.
Which variable has the most missing observations?
# I explicitly called out the package to use since both psych and Hmisc have
# describe() functions
Hmisc::describe(titanic)
## titanic
##
## 12 Variables 891 Observations
## --------------------------------------------------------------------------------
## PassengerId
## n missing distinct Info Mean Gmd .05 .10
## 891 0 891 1 446 297.3 45.5 90.0
## .25 .50 .75 .90 .95
## 223.5 446.0 668.5 802.0 846.5
##
## lowest : 1 2 3 4 5, highest: 887 888 889 890 891
## --------------------------------------------------------------------------------
## Survived
## n missing distinct Info Sum Mean Gmd
## 891 0 2 0.71 342 0.3838 0.4735
##
## --------------------------------------------------------------------------------
## Pclass
## n missing distinct Info Mean Gmd
## 891 0 3 0.81 2.309 0.8631
##
## Value 1 2 3
## Frequency 216 184 491
## Proportion 0.242 0.207 0.551
## --------------------------------------------------------------------------------
## Name
## n missing distinct
## 891 0 891
##
## lowest : Abbing, Mr. Anthony Abbott, Mr. Rossmore Edward Abbott, Mrs. Stanton (Rosa Hunt) Abelson, Mr. Samuel Abelson, Mrs. Samuel (Hannah Wizosky)
## highest: Yousseff, Mr. Gerious Yrois, Miss. Henriette ("Mrs Harbeck") Zabour, Miss. Hileni Zabour, Miss. Thamine Zimmerman, Mr. Leo
## --------------------------------------------------------------------------------
## Sex
## n missing distinct
## 891 0 2
##
## Value female male
## Frequency 314 577
## Proportion 0.352 0.648
## --------------------------------------------------------------------------------
## Age
## n missing distinct Info Mean Gmd .05 .10
## 714 177 88 0.999 29.7 16.21 4.00 14.00
## .25 .50 .75 .90 .95
## 20.12 28.00 38.00 50.00 56.00
##
## lowest : 0.42 0.67 0.75 0.83 0.92, highest: 70.00 70.50 71.00 74.00 80.00
## --------------------------------------------------------------------------------
## SibSp
## n missing distinct Info Mean Gmd
## 891 0 7 0.669 0.523 0.823
##
## lowest : 0 1 2 3 4, highest: 2 3 4 5 8
##
## Value 0 1 2 3 4 5 8
## Frequency 608 209 28 16 18 5 7
## Proportion 0.682 0.235 0.031 0.018 0.020 0.006 0.008
## --------------------------------------------------------------------------------
## Parch
## n missing distinct Info Mean Gmd
## 891 0 7 0.556 0.3816 0.6259
##
## lowest : 0 1 2 3 4, highest: 2 3 4 5 6
##
## Value 0 1 2 3 4 5 6
## Frequency 678 118 80 5 4 5 1
## Proportion 0.761 0.132 0.090 0.006 0.004 0.006 0.001
## --------------------------------------------------------------------------------
## Ticket
## n missing distinct
## 891 0 681
##
## lowest : 110152 110413 110465 110564 110813
## highest: W./C. 6608 W./C. 6609 W.E.P. 5734 W/C 14208 WE/P 5735
## --------------------------------------------------------------------------------
## Fare
## n missing distinct Info Mean Gmd .05 .10
## 891 0 248 1 32.2 36.78 7.225 7.550
## .25 .50 .75 .90 .95
## 7.910 14.454 31.000 77.958 112.079
##
## lowest : 0.0000 4.0125 5.0000 6.2375 6.4375
## highest: 227.5250 247.5208 262.3750 263.0000 512.3292
## --------------------------------------------------------------------------------
## Cabin
## n missing distinct
## 204 687 147
##
## lowest : A10 A14 A16 A19 A20, highest: F33 F38 F4 G6 T
## --------------------------------------------------------------------------------
## Embarked
## n missing distinct
## 889 2 3
##
## Value C Q S
## Frequency 168 77 644
## Proportion 0.189 0.087 0.724
## --------------------------------------------------------------------------------
Cabin has the most missing values (the proportion of entries listed as NA is 0.771). Age is the second highest variable with missing entries (177 entries missing or a proportion of 0.199).
Impute missing observations for Age, SibSp, and Parch with the column median (ordinal, interval, or ratio), or the column mode. To do so, use something like this: mydata$Age[is.na(mydata$Age)]=median(mydata$Age, na.rm=TRUE).
# Determine variable median
median_age <- median(titanic$Age, na.rm=T)
# Store original data in a variable in case we make a mistake or want it for
# comparison
original_age_data <- titanic$Age
# replace NAs with median age
titanic$Age[is.na(titanic$Age)] <- median_age
modified_age_data <- titanic$Age
# Comparison between original data and adjusted data
summary(original_age_data)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.42 20.12 28.00 29.70 38.00 80.00 177
summary(titanic$Age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.42 22.00 28.00 29.36 35.00 80.00
the code above makes the changes. As expected, min, median, and max values remain unchanged. The mean has lessened slightly and both quartiles have been pulled closer to the median due to the increased number of variables (the data set is roughly 25% larger).
Install the psych package in R: install.packages(‘pscyh’). Invoke the package using library(psych). Then provide descriptive statistics for Age, SibSp, and Parch (e.g., describe(mydata$Age).
# Package has already been loaded at this point so we'll jump straight to
# descriptive statistics
# Age (for this one we'll also run descriptive statistics on the original data)
psych::describe(original_age_data)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 714 29.7 14.53 28 29.27 13.34 0.42 80 79.58 0.39 0.16 0.54
psych::describe(titanic$Age)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 891 29.36 13.02 28 28.83 8.9 0.42 80 79.58 0.51 0.97 0.44
hist(titanic$Age, xlab = "Age", main = "Age")
# SibSp
psych::describe(titanic$SibSp)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 891 0.52 1.1 0 0.27 0 0 8 8 3.68 17.73 0.04
hist(titanic$SibSp, xlab = "Siblings / Spouses", main = "Siblings / Spouses")
# Parch
psych::describe(titanic$Parch)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 891 0.38 0.81 0 0.18 0 0 6 6 2.74 9.69 0.03
hist(titanic$Parch, xlab = "Parents / Children", main = "Parents / Children")
The code above provides the descriptive statistics for the three variables. Additionally, I had histograms generated to better visualize the distributions and skewness of the variables.
Provide a cross-tabulation of Survived and Sex (e.g., table(mydata$Survived, mydata$Sex). What do you notice?
table(titanic$Survived, titanic$Sex)
##
## female male
## 0 81 468
## 1 233 109
What I noticed based on this table was a large disparity between proportion of men that survived and the proportion of women that survived (which makes sense given that “women and children first” was the standard policy to allocate lifeboats when evacuating a ship at the time). Additional analysis that would be interesting to apply here would be to categorize all passengers as either child or adult and then create two additional tables to cross-tabulate children and adults versus survival (one table for females and one table for males). There would need to be some additional research to determine at what age a young boy was considered to no longer be a child in 1912.
Provide notched boxplots for Survived and Age (e.g., boxplot(mydata$Age~mydata$Survived, notch=TRUE, horizontal=T). What do you notice?
# NOTE: for this one I was curious to see the differences in the notched
# boxplots if I used the original age data or the modified data (NAs replaced
# with the median age)
# First with the Modified Age Data
boxplot(titanic$Age~titanic$Survived, notch=T, horizontal=T, ylab = "Survived",
xlab = "Age", main = "Modified Age and Survival Rate (1 = Survived)")
# Additional Summaries to view the specific numbers
print("Survived")
## [1] "Survived"
summary(titanic$Age[titanic$Survived==1])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.42 21.00 28.00 28.29 35.00 80.00
print("Perished")
## [1] "Perished"
summary(titanic$Age[titanic$Survived==0])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 23.00 28.00 30.03 35.00 74.00
# Replace age data with original
titanic$Age <- original_age_data
# Now with the Unmodified Age Data
boxplot(titanic$Age~titanic$Survived, notch=T, horizontal=T, ylab = "Survived",
xlab = "Age", main = "Original Age and Survival Rate (1 = Survived)")
# Additional summaries would help to see the specific numbers
print("Survived")
## [1] "Survived"
summary(titanic$Age[titanic$Survived==1])
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.42 19.00 28.00 28.34 36.00 80.00 52
print("Perished")
## [1] "Perished"
summary(titanic$Age[titanic$Survived==0])
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 21.00 28.00 30.63 39.00 74.00 125
# Put the modified data back into the data frame
titanic$Age <- modified_age_data
The boxplots actually have quit a bit of overlap around the medianage. A lot more children survived than perished (although there were four outliers that showed some children did not managed to get off the boat). More passengers over the age of 60 perished than managed to board a life raft. The perished boxplot is more tightly centered about the median, shrinking the range of the data compared to the survived data. It is worth noting that this does need to be taken with a grain of salt given that ~20% of the ages are not truly known and are substituted with the median age.