As noted in the instructions and first video from week 1, I went onto Kaggle and downloaded the data set into my week 1 folder as a text file and excel file.
mydata <- read.csv('./train.csv')
str(mydata)
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
head(mydata)
## PassengerId Survived Pclass
## 1 1 0 3
## 2 2 1 1
## 3 3 1 3
## 4 4 1 1
## 5 5 0 3
## 6 6 0 3
## Name Sex Age SibSp Parch
## 1 Braund, Mr. Owen Harris male 22 1 0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0
## 3 Heikkinen, Miss. Laina female 26 0 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0
## 5 Allen, Mr. William Henry male 35 0 0
## 6 Moran, Mr. James male NA 0 0
## Ticket Fare Cabin Embarked
## 1 A/5 21171 7.2500 S
## 2 PC 17599 71.2833 C85 C
## 3 STON/O2. 3101282 7.9250 S
## 4 113803 53.1000 C123 S
## 5 373450 8.0500 S
## 6 330877 8.4583 Q
names(mydata)
## [1] "PassengerId" "Survived" "Pclass" "Name" "Sex"
## [6] "Age" "SibSp" "Parch" "Ticket" "Fare"
## [11] "Cabin" "Embarked"
summary(mydata)
## PassengerId Survived Pclass Name
## Min. : 1.0 Min. :0.0000 Min. :1.000 Length:891
## 1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000 Class :character
## Median :446.0 Median :0.0000 Median :3.000 Mode :character
## Mean :446.0 Mean :0.3838 Mean :2.309
## 3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :891.0 Max. :1.0000 Max. :3.000
##
## Sex Age SibSp Parch
## Length:891 Min. : 0.42 Min. :0.000 Min. :0.0000
## Class :character 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000
## Mode :character Median :28.00 Median :0.000 Median :0.0000
## Mean :29.70 Mean :0.523 Mean :0.3816
## 3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000
## Max. :80.00 Max. :8.000 Max. :6.0000
## NA's :177
## Ticket Fare Cabin Embarked
## Length:891 Min. : 0.00 Length:891 Length:891
## Class :character 1st Qu.: 7.91 Class :character Class :character
## Mode :character Median : 14.45 Mode :character Mode :character
## Mean : 32.20
## 3rd Qu.: 31.00
## Max. :512.33
##
Passenger ID
str(mydata$PassengerId)
## int [1:891] 1 2 3 4 5 6 7 8 9 10 ...
Passenger ID is a qualitative variable and the level of measurement is nominal. We can see in our slides that ID numbers are nominal since it categorizes a variable. We also know that nominal variables are qualitative.
Age
str(mydata$Age)
## num [1:891] 22 38 26 35 35 NA 54 2 27 14 ...
Age is a quantitative variable and the level of measurement is Interval. We know that age is quantitative because it’s real valued. It is an interval because it has an order and is measured on a scale of equal-sized units.
mydata[mydata == ""] <- NA
mydata[mydata == " "] <- NA
colSums(is.na(mydata))
## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 177
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 687 2
We can see here that ‘Cabin’ is the column with the most missing observations with 687. This is followed by ‘age’ with 177, and ‘embarked’ with 2. The rest have no missing values. When I first entered the code, the only column with NA/missing values was age. That did not seem right when looking at the data and I realized my initial code was only counting NA values. I then found how to replace missing values with NA so our code properly counted all the values.
mydata$Age[is.na(mydata$Age)] <- median(mydata$Age, na.rm=TRUE)
mydata$SibSp[is.na(mydata$SibSp)] <- median(mydata$SibSp, na.rm=TRUE)
mydata$Parch[is.na(mydata$Parch)] <- median(mydata$Parch, na.rm=TRUE)
summary(mydata$Age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.42 22.00 28.00 29.36 35.00 80.00
summary(mydata$SibSp)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 0.523 1.000 8.000
summary(mydata$Parch)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.3816 0.0000 6.0000
We followed the original code to impute any missing data with the column median. I then did a summary for each so we could see the updated data. We can see there are no NA’s listed in the summary data.
library(psych)
#Age
describe(mydata$Age)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 891 29.36 13.02 28 28.83 8.9 0.42 80 79.58 0.51 0.97 0.44
#SibSP
describe(mydata$SibSp)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 891 0.52 1.1 0 0.27 0 0 8 8 3.68 17.73 0.04
#Parch
describe(mydata$Parch)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 891 0.38 0.81 0 0.18 0 0 6 6 2.74 9.69 0.03
hist(mydata$Age, breaks = 39,
xlab = "Age",
ylab = "Frequency",
main = "Histogram of Age")
To do a deeper dive into the descriptive statistics, I thought it would be helpful to visualize the data. I chose to do a histogram of the ages and break up each bar by ages of 2. In my code for this, I added 39 breaks since we can see in our descriptive statistics, we have a range of 79.58. As we can see in the graph, our median age which is listed as 28, also has by far the most amount of people compared to the rest of the population ages. This also makes a lot of sense since we substituted any missing ages with the median age, creating such a large jump. If we were to do a normal distribution, we would see that 95% of the population is between ages 2 and 54. Our standard deviation is 13 and we know from statistics that 95% of the population lies between 2 standard deviations of the mean.
table(mydata$Survived, mydata$Sex)
##
## female male
## 0 81 468
## 1 233 109
We can clearly see the amount of women that survived vs men is more than doubled and the amount of men that did not survive, was more than 5x that of women. When we see a difference that is this clear, we need to think as to why that may be the case. This makes a lot of sense knowing that when they evacuated the Titanic, they sent women and children first to get on the lifeboats.
boxplot(mydata$Age~mydata$Survived, notch=TRUE, horizontal=T,
xlab = "Age",
ylab = "Survived 1 = Yes",
main = "Survived vs Age")
In this box chart we can see that not much changes in terms fo the median age, likely being from us replacing so many values with the median age. We can see, however, much more people did not survive that were above 3rd quartile of this boxplot. This backs up our hypothesis relating to question four that women and children were evacuated first.