Downloading the dataset

As noted in the instructions and first video from week 1, I went onto Kaggle and downloaded the data set into my week 1 folder as a text file and excel file.

mydata <- read.csv('./train.csv')

Setting Up Data

str(mydata)
## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...
head(mydata)
##   PassengerId Survived Pclass
## 1           1        0      3
## 2           2        1      1
## 3           3        1      3
## 4           4        1      1
## 5           5        0      3
## 6           6        0      3
##                                                  Name    Sex Age SibSp Parch
## 1                             Braund, Mr. Owen Harris   male  22     1     0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0
## 3                              Heikkinen, Miss. Laina female  26     0     0
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1     0
## 5                            Allen, Mr. William Henry   male  35     0     0
## 6                                    Moran, Mr. James   male  NA     0     0
##             Ticket    Fare Cabin Embarked
## 1        A/5 21171  7.2500              S
## 2         PC 17599 71.2833   C85        C
## 3 STON/O2. 3101282  7.9250              S
## 4           113803 53.1000  C123        S
## 5           373450  8.0500              S
## 6           330877  8.4583              Q
names(mydata)
##  [1] "PassengerId" "Survived"    "Pclass"      "Name"        "Sex"        
##  [6] "Age"         "SibSp"       "Parch"       "Ticket"      "Fare"       
## [11] "Cabin"       "Embarked"
summary(mydata)
##   PassengerId       Survived          Pclass          Name          
##  Min.   :  1.0   Min.   :0.0000   Min.   :1.000   Length:891        
##  1st Qu.:223.5   1st Qu.:0.0000   1st Qu.:2.000   Class :character  
##  Median :446.0   Median :0.0000   Median :3.000   Mode  :character  
##  Mean   :446.0   Mean   :0.3838   Mean   :2.309                     
##  3rd Qu.:668.5   3rd Qu.:1.0000   3rd Qu.:3.000                     
##  Max.   :891.0   Max.   :1.0000   Max.   :3.000                     
##                                                                     
##      Sex                 Age            SibSp           Parch       
##  Length:891         Min.   : 0.42   Min.   :0.000   Min.   :0.0000  
##  Class :character   1st Qu.:20.12   1st Qu.:0.000   1st Qu.:0.0000  
##  Mode  :character   Median :28.00   Median :0.000   Median :0.0000  
##                     Mean   :29.70   Mean   :0.523   Mean   :0.3816  
##                     3rd Qu.:38.00   3rd Qu.:1.000   3rd Qu.:0.0000  
##                     Max.   :80.00   Max.   :8.000   Max.   :6.0000  
##                     NA's   :177                                     
##     Ticket               Fare           Cabin             Embarked        
##  Length:891         Min.   :  0.00   Length:891         Length:891        
##  Class :character   1st Qu.:  7.91   Class :character   Class :character  
##  Mode  :character   Median : 14.45   Mode  :character   Mode  :character  
##                     Mean   : 32.20                                        
##                     3rd Qu.: 31.00                                        
##                     Max.   :512.33                                        
## 

Q1 a. What are the types of variable (quantitative / qualitative) and (nominal / ordinal / interval / ratio) for PassengerId and Age?

Passenger ID

str(mydata$PassengerId)
##  int [1:891] 1 2 3 4 5 6 7 8 9 10 ...

Passenger ID is a qualitative variable and the level of measurement is nominal. We can see in our slides that ID numbers are nominal since it categorizes a variable. We also know that nominal variables are qualitative.

Age

str(mydata$Age)
##  num [1:891] 22 38 26 35 35 NA 54 2 27 14 ...

Age is a quantitative variable and the level of measurement is Interval. We know that age is quantitative because it’s real valued. It is an interval because it has an order and is measured on a scale of equal-sized units.

Q1b. Which variable has the most missing observations? You could have Googled for “Count NA values in R for all columns in a dataframe or something like that.

mydata[mydata == ""] <- NA
mydata[mydata == " "] <- NA
colSums(is.na(mydata))
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0         687           2

We can see here that ‘Cabin’ is the column with the most missing observations with 687. This is followed by ‘age’ with 177, and ‘embarked’ with 2. The rest have no missing values. When I first entered the code, the only column with NA/missing values was age. That did not seem right when looking at the data and I realized my initial code was only counting NA values. I then found how to replace missing values with NA so our code properly counted all the values.

Q2. Impute missing observations for Age, SibSp, and Parch with the column median (ordinal, interval, or ratio), or the column mode. To do so, use something like this: mydata\(Age[is.na(mydata\)Age)] <- median(mydata$Age, na.rm=TRUE) . You can read up on indexing in dataframe.

mydata$Age[is.na(mydata$Age)] <- median(mydata$Age, na.rm=TRUE)
mydata$SibSp[is.na(mydata$SibSp)] <- median(mydata$SibSp, na.rm=TRUE)
mydata$Parch[is.na(mydata$Parch)] <- median(mydata$Parch, na.rm=TRUE)
summary(mydata$Age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.42   22.00   28.00   29.36   35.00   80.00
summary(mydata$SibSp)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.523   1.000   8.000
summary(mydata$Parch)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3816  0.0000  6.0000

We followed the original code to impute any missing data with the column median. I then did a summary for each so we could see the updated data. We can see there are no NA’s listed in the summary data.

Q3. Install the psych package in R: install.packages(‘pscyh’). Invoke the package using library(psych). Then provide descriptive statistics for Age, SibSp, and Parch (e.g., describe(mydata$Age). Please comment on what you observe from the summary statistics.

library(psych)

#Age
describe(mydata$Age)
##    vars   n  mean    sd median trimmed mad  min max range skew kurtosis   se
## X1    1 891 29.36 13.02     28   28.83 8.9 0.42  80 79.58 0.51     0.97 0.44
#SibSP
describe(mydata$SibSp)
##    vars   n mean  sd median trimmed mad min max range skew kurtosis   se
## X1    1 891 0.52 1.1      0    0.27   0   0   8     8 3.68    17.73 0.04
#Parch
describe(mydata$Parch)
##    vars   n mean   sd median trimmed mad min max range skew kurtosis   se
## X1    1 891 0.38 0.81      0    0.18   0   0   6     6 2.74     9.69 0.03
hist(mydata$Age, breaks = 39, 
     xlab = "Age",
     ylab = "Frequency",
     main = "Histogram of Age")

To do a deeper dive into the descriptive statistics, I thought it would be helpful to visualize the data. I chose to do a histogram of the ages and break up each bar by ages of 2. In my code for this, I added 39 breaks since we can see in our descriptive statistics, we have a range of 79.58. As we can see in the graph, our median age which is listed as 28, also has by far the most amount of people compared to the rest of the population ages. This also makes a lot of sense since we substituted any missing ages with the median age, creating such a large jump. If we were to do a normal distribution, we would see that 95% of the population is between ages 2 and 54. Our standard deviation is 13 and we know from statistics that 95% of the population lies between 2 standard deviations of the mean.

Q4. Provide a cross-tabulation of Survived and Sex (e.g., table(mydata\(Survived, mydata\)Sex). What do you notice?

table(mydata$Survived, mydata$Sex)
##    
##     female male
##   0     81  468
##   1    233  109

We can clearly see the amount of women that survived vs men is more than doubled and the amount of men that did not survive, was more than 5x that of women. When we see a difference that is this clear, we need to think as to why that may be the case. This makes a lot of sense knowing that when they evacuated the Titanic, they sent women and children first to get on the lifeboats.

Q5. Provide notched boxplots for Survived and Age (e.g., boxplot(mydata\(Age~mydata\)Survived, notch=TRUE, horizontal=T). What do you notice?

boxplot(mydata$Age~mydata$Survived, notch=TRUE, horizontal=T,
        xlab = "Age",
        ylab = "Survived 1 = Yes", 
        main = "Survived vs Age")

In this box chart we can see that not much changes in terms fo the median age, likely being from us replacing so many values with the median age. We can see, however, much more people did not survive that were above 3rd quartile of this boxplot. This backs up our hypothesis relating to question four that women and children were evacuated first.