DATASET I have to import the Titanic survival data set from: website [http://www.personal.psu.edu/dlp/w540/titanic540.csv]

Before I begin…

I have to load all necessary packages. I don’t want to display any warning signs.

library(utils)
library(datasets)
library(magrittr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Task 1: Import the titanic540.csv dataset into R

  • This is a large data set (1309 observations), so I have chosen to hide the results. I don’t want to print all 1309 observations.
titanic <- read.csv("http://www.personal.psu.edu/dlp/w540/titanic540.csv")
titanic

Task 2: Convert the titanic 540.csv dataset into data frame as a “tibble”

To convert the data set, I need to use the tbl_df function in the tibble package. So I’ll load the package and use the function.

library(tibble)
## Warning: package 'tibble' was built under R version 3.4.1
titanic.tbl <- tbl_df(titanic)
titanic.tbl
## # A tibble: 1,309 x 8
##    pclass survived    sex   age sibsp parch   fare embarked
##     <int>    <int> <fctr> <int> <int> <int>  <dbl>   <fctr>
##  1      1        1 female    29     0     0 211.34        S
##  2      1        1   male     1     1     2 151.55        S
##  3      1        0 female     2     1     2 151.55        S
##  4      1        0   male    30     1     2 151.55        S
##  5      1        0 female    25     1     2 151.55        S
##  6      1        1   male    48     0     0  26.55        S
##  7      1        1 female    63     1     0  77.96        S
##  8      1        0   male    39     0     0   0.00        S
##  9      1        1 female    53     2     0  51.48        S
## 10      1        0   male    71     0     0  49.50        C
## # ... with 1,299 more rows

Task 3: Calculate the proportion of surviving passengers

  • I can do this in several different ways.
  • First I’ll get the structure and summary of the data to figure out the variable types and see if any data is missing.
str(titanic.tbl)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1309 obs. of  8 variables:
##  $ pclass  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ survived: int  1 1 0 0 0 1 1 0 1 0 ...
##  $ sex     : Factor w/ 2 levels "female","male": 1 2 1 2 1 2 1 2 1 2 ...
##  $ age     : int  29 1 2 30 25 48 63 39 53 71 ...
##  $ sibsp   : int  0 1 1 1 1 0 1 0 2 0 ...
##  $ parch   : int  0 2 2 2 2 0 0 0 0 0 ...
##  $ fare    : num  211 152 152 152 152 ...
##  $ embarked: Factor w/ 4 levels "","C","Q","S": 4 4 4 4 4 4 4 4 4 2 ...
summary(titanic.tbl)
##      pclass         survived         sex           age      
##  Min.   :1.000   Min.   :0.000   female:466   Min.   : 0.0  
##  1st Qu.:2.000   1st Qu.:0.000   male  :843   1st Qu.:21.0  
##  Median :3.000   Median :0.000                Median :28.0  
##  Mean   :2.295   Mean   :0.382                Mean   :29.9  
##  3rd Qu.:3.000   3rd Qu.:1.000                3rd Qu.:39.0  
##  Max.   :3.000   Max.   :1.000                Max.   :80.0  
##                                               NA's   :263   
##      sibsp            parch            fare        embarked
##  Min.   :0.0000   Min.   :0.000   Min.   :  0.00    :  2   
##  1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:  7.90   C:270   
##  Median :0.0000   Median :0.000   Median : 14.45   Q:123   
##  Mean   :0.4989   Mean   :0.385   Mean   : 33.30   S:914   
##  3rd Qu.:1.0000   3rd Qu.:0.000   3rd Qu.: 31.28           
##  Max.   :8.0000   Max.   :9.000   Max.   :512.33           
##                                   NA's   :1

From the summary I gather that no values are missing for the ‘survived’ variable.

survivors <- table(titanic.tbl$survived)
survivors
## 
##   0   1 
## 809 500
500/1309
## [1] 0.381971

Total survival rate was 38%

I could also run the ‘prop.table’ function to automatically calculate the proportion of survivors.

survivorprop <- table(titanic.tbl$survived==1)
survivorprop
## 
## FALSE  TRUE 
##   809   500
prop.table(survivorprop)
## 
##    FALSE     TRUE 
## 0.618029 0.381971

Task 4: Calculate the proportion of surviving passengers by sex.

  • For this calculation, I must first use the ‘group_by’ function to gather information on each gender of passengers.
  • I can then use the summarise verb to find the mean.
titanic.tbl %>%
  group_by(sex) %>%
  summarise(survivors_by_sex = mean(survived))
## # A tibble: 2 x 2
##      sex survivors_by_sex
##   <fctr>            <dbl>
## 1 female        0.7274678
## 2   male        0.1909846

The survival rate for women was almost 73%, while that for men was 19%.

Task 5: Calculate the mean (average) age of surviving female passengers.

This one requires the use of another piped command. Filter to identify the relevant observations and summarise to find the average.

titanic.tbl %>%
  filter(sex=="female", survived=) %>%
  summarise(female_survivors = mean(age, na.rm=TRUE))
## # A tibble: 1 x 1
##   female_survivors
##              <dbl>
## 1          28.6933

The mean(average) age of surviving female passengers was about 30.

Task 6: Calculate the number of surviving passengers 10 years old or younger.

I filter the relevant observations and count the total number of observations that return.

titanic.tbl %>%
  filter(age<=10, survived==1)
## # A tibble: 50 x 8
##    pclass survived    sex   age sibsp parch   fare embarked
##     <int>    <int> <fctr> <int> <int> <int>  <dbl>   <fctr>
##  1      1        1   male     1     1     2 151.55        S
##  2      1        1   male     4     0     2  81.86        S
##  3      1        1   male     6     0     2 134.50        C
##  4      2        1   male     1     2     1  39.00        S
##  5      2        1 female     4     2     1  39.00        S
##  6      2        1   male     1     0     2  29.00        S
##  7      2        1 female     8     0     2  26.25        S
##  8      2        1   male     8     1     1  36.75        S
##  9      2        1   male     8     0     2  32.50        S
## 10      2        1   male     1     1     1  14.50        S
## # ... with 40 more rows

The number of surviving passengers 10 years old or younger was 50.

Task 7: Calculate the maximum, minimum, and median age of surviving passengers 10 years old or older.

I use another piped command for this - filter relevant observations and summarise the maximum, minimum and median

titanic.tbl %>%
  filter(age>=10, survived==1) %>%
  summarise(max=max(age, na.rm=TRUE), min=min(age, na.rm=TRUE), median(age, na.rm = TRUE))
## # A tibble: 1 x 3
##     max   min `median(age, na.rm = TRUE)`
##   <dbl> <dbl>                       <int>
## 1    80    11                          30

The maximum age of surviving passengers 10 years or older was 80, the minimum was 11 and the median was 30.

Task 8: Calculate the proportion of surviving passengers by port of embarkation.

I will use the prop.table function in r for this. I first need to create a table of the surviving passengers by port of embarkation.

survivors_by_port <- table(titanic.tbl$survived, titanic.tbl$embarked)
survivors_by_port
##    
##           C   Q   S
##   0   0 120  79 610
##   1   2 150  44 304
prop.table(survivors_by_port)
##    
##                           C           Q           S
##   0 0.000000000 0.091673033 0.060351413 0.466004584
##   1 0.001527884 0.114591291 0.033613445 0.232238350

Task 9: Calculate the number of surviving female passengers over the age of 40 years old by port of embarkation.

This one’s a little complicated. I have to string together several piped commands - filter, select, group_by and finally count.

titanic.tbl$embarked <- as.numeric(titanic.tbl$embarked)
sur.fem <- titanic.tbl %>%
  filter(sex=="female", age>=40) %>%
  select(sex, age, embarked) %>%
  group_by(embarked)
sur.fem
## # A tibble: 84 x 3
## # Groups:   embarked [3]
##       sex   age embarked
##    <fctr> <int>    <dbl>
##  1 female    63        4
##  2 female    53        4
##  3 female    50        2
##  4 female    47        4
##  5 female    42        2
##  6 female    58        4
##  7 female    45        2
##  8 female    44        2
##  9 female    59        4
## 10 female    60        2
## # ... with 74 more rows
count(sur.fem)
## # A tibble: 3 x 2
## # Groups:   embarked [3]
##   embarked     n
##      <dbl> <int>
## 1        1     1
## 2        2    31
## 3        4    52

Task 10: Calculate the mean (average) fare that passengers paid by port of embarkation. Note: some values of fare are missing.

I use the group_by and summarise verbs to calculate the mean. Since some values for the fare are missing, I have to use na.rm=TRUE

titanic.tbl %>%
  group_by(embarked) %>%
  summarise (avg_fare = mean(fare,na.rm = TRUE)) 
## # A tibble: 4 x 2
##   embarked avg_fare
##      <dbl>    <dbl>
## 1        1 80.00000
## 2        2 62.33719
## 3        3 12.40935
## 4        4 27.41963

Task 11: Calculate number of surviving passengers who had any siblings/spouses aboard the Titanic.

I use the filter verb and count the number of observations in the tibble that’s created

titanic.tbl %>%
  filter(survived==1, sibsp>0)
## # A tibble: 191 x 8
##    pclass survived    sex   age sibsp parch   fare embarked
##     <int>    <int> <fctr> <int> <int> <int>  <dbl>    <dbl>
##  1      1        1   male     1     1     2 151.55        4
##  2      1        1 female    63     1     0  77.96        4
##  3      1        1 female    53     2     0  51.48        4
##  4      1        1 female    18     1     0 227.53        2
##  5      1        1   male    37     1     1  52.55        4
##  6      1        1 female    47     1     1  52.55        4
##  7      1        1   male    25     1     0  91.08        2
##  8      1        1 female    19     1     0  91.08        2
##  9      1        1 female    59     2     0  51.48        4
## 10      1        1   male    11     1     2 120.00        4
## # ... with 181 more rows

Total number of surviving passengers who had any siblings/spouses aboard the Titanic was 191.

Task 12: Calculate number of surviving passengers who had any parents/children aboard the Titanic.

I filter the relevant observations and count the total number that return.

titanic.tbl %>%
  filter(survived==1, parch>0)
## # A tibble: 164 x 8
##    pclass survived    sex   age sibsp parch   fare embarked
##     <int>    <int> <fctr> <int> <int> <int>  <dbl>    <dbl>
##  1      1        1   male     1     1     2 151.55        4
##  2      1        1 female    50     0     1 247.52        2
##  3      1        1   male    37     1     1  52.55        4
##  4      1        1 female    47     1     1  52.55        4
##  5      1        1 female    22     0     1  55.00        4
##  6      1        1   male    36     0     1 512.33        2
##  7      1        1 female    58     0     1 512.33        2
##  8      1        1   male    11     1     2 120.00        4
##  9      1        1 female    14     1     2 120.00        4
## 10      1        1   male    36     1     2 120.00        4
## # ... with 154 more rows

Total number of surviving passengers who had any siblings/spouses aboard the Titanic was 164.

Task 13: Calculate the mean (average) fare that passengers paid by passenger class.

It’s time to calculate mean again. That means using the summarise verb. Of course, I have to use group_by as I have to calculate the mean by passenger class.

titanic.tbl %>%
  group_by(pclass) %>%
  summarise(avg_fare=mean(fare, na.rm=TRUE))
## # A tibble: 3 x 2
##   pclass avg_fare
##    <int>    <dbl>
## 1      1 87.50935
## 2      2 21.17928
## 3      3 13.30414

Task 14: Calculate a regular frequency distribution of the number of parents/children aboard the Titanic of female passengers.

I have to calculate the frequency distribution. That means using the ftable function. But first I have to filter the relevant observations and select the relevant variables.

frq.dist1 <- titanic.tbl %>%
  filter(survived==1, sex=="female", parch>0) %>%
  select(survived, parch)
frq.dist1
## # A tibble: 121 x 2
##    survived parch
##       <int> <int>
##  1        1     1
##  2        1     1
##  3        1     1
##  4        1     1
##  5        1     2
##  6        1     2
##  7        1     1
##  8        1     1
##  9        1     2
## 10        1     2
## # ... with 111 more rows
ftable(frq.dist1)
##          parch  1  2  3  4  5
## survived                     
## 1              70 44  5  1  1

Task 15: Calculate a regular frequency distribution of the number of siblings/spouses of male passengers who had at least one or more siblings/spouses aboard the Titanic.

Another frequency distribution. That means using the ftable function . again, after I filter the relevant observations and select the relevant variables.

frq.dist2 <- titanic.tbl %>%
  filter(sex=="male", sibsp>=1) %>%
  select(sibsp)
frq.dist2
## # A tibble: 214 x 1
##    sibsp
##    <int>
##  1     1
##  2     1
##  3     1
##  4     1
##  5     1
##  6     1
##  7     1
##  8     1
##  9     1
## 10     1
## # ... with 204 more rows
ftable(frq.dist2)
## x   1   2   3   4   5   8
##                          
##   159  23   8  15   4   5