Now that we have our data sets set up let’s take a look at them and start to analyze what happened. (And learn some R basics along the way.)

Let’s focus on the Titanic.

Start by loading the dataset that you previously saved.

  # I had to add the extra level because I have projects.
  load(paste("~","fall16/data/RMS_Titanic.Rda", sep="/"))

Describing the Data

The data we downloaded contains 1 row per person on the ship for a given ship, in this case RMS Titanic.

Find some basic information about the RMS_Titanic using the str() function and the summary() function on the RMS_Titanic data set. That means, put the name of the data set inside the parentheses.

str(RMS_Titanic)
## Classes 'tbl_df', 'tbl' and 'data.frame':    2208 obs. of  20 variables:
##  $ Id_8                       : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ Ship Id                    : num  8 8 8 8 8 8 8 8 8 8 ...
##  $ Year                       : num  1912 1912 1912 1912 1912 ...
##  $ Nationality of the Ship    : chr  "U.K" "U.K" "U.K" "U.K" ...
##  $ Women and children first   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Quick                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Cause                      : chr  "Collision" "Collision" "Collision" "Collision" ...
##  $ No. of passengers          : num  1317 1317 1317 1317 1317 ...
##  $ No. of women passengers    : num  463 463 463 463 463 463 463 463 463 463 ...
##  $ Women passengers/passengers: num  0.352 0.352 0.352 0.352 0.352 0.352 0.352 0.352 0.352 0.352 ...
##  $ Ship size                  : num  2208 2208 2208 2208 2208 ...
##  $ Length of voyage           : num  5 5 5 5 5 5 5 5 5 5 ...
##  $ Gender                     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Age                        : num  42 16 14 21 30 33 30 26 26 20 ...
##  $ Child                      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Crew                       : num  0 0 0 1 0 1 0 0 1 1 ...
##  $ Passenger Class            : num  3 3 3 NA 2 NA 3 3 NA NA ...
##  $ Nationality of Passenger   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Companionship              : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Survival                   : num  0 0 0 0 0 0 0 0 0 0 ...
summary(RMS_Titanic)
##       Id_8           Ship Id       Year      Nationality of the Ship
##  Min.   :   1.0   Min.   :8   Min.   :1912   Length:2208            
##  1st Qu.: 552.8   1st Qu.:8   1st Qu.:1912   Class :character       
##  Median :1104.5   Median :8   Median :1912   Mode  :character       
##  Mean   :1104.5   Mean   :8   Mean   :1912                          
##  3rd Qu.:1656.2   3rd Qu.:8   3rd Qu.:1912                          
##  Max.   :2208.0   Max.   :8   Max.   :1912                          
##                                                                     
##  Women and children first     Quick      Cause           No. of passengers
##  Min.   :1                Min.   :0   Length:2208        Min.   :1317     
##  1st Qu.:1                1st Qu.:0   Class :character   1st Qu.:1317     
##  Median :1                Median :0   Mode  :character   Median :1317     
##  Mean   :1                Mean   :0                      Mean   :1317     
##  3rd Qu.:1                3rd Qu.:0                      3rd Qu.:1317     
##  Max.   :1                Max.   :0                      Max.   :1317     
##                                                                           
##  No. of women passengers Women passengers/passengers   Ship size   
##  Min.   :463             Min.   :0.352               Min.   :2208  
##  1st Qu.:463             1st Qu.:0.352               1st Qu.:2208  
##  Median :463             Median :0.352               Median :2208  
##  Mean   :463             Mean   :0.352               Mean   :2208  
##  3rd Qu.:463             3rd Qu.:0.352               3rd Qu.:2208  
##  Max.   :463             Max.   :0.352               Max.   :2208  
##                                                                    
##  Length of voyage     Gender            Age            Child     
##  Min.   :5        Min.   :0.0000   Min.   : 0.00   Min.   : NA   
##  1st Qu.:5        1st Qu.:0.0000   1st Qu.:22.00   1st Qu.: NA   
##  Median :5        Median :0.0000   Median :29.00   Median : NA   
##  Mean   :5        Mean   :0.2201   Mean   :29.91   Mean   :NaN   
##  3rd Qu.:5        3rd Qu.:0.0000   3rd Qu.:36.00   3rd Qu.: NA   
##  Max.   :5        Max.   :1.0000   Max.   :74.00   Max.   : NA   
##                                    NA's   :10      NA's   :2208  
##       Crew        Passenger Class Nationality of Passenger Companionship 
##  Min.   :0.0000   Min.   :1.000   Min.   : NA              Min.   : NA   
##  1st Qu.:0.0000   1st Qu.:2.000   1st Qu.: NA              1st Qu.: NA   
##  Median :0.0000   Median :3.000   Median : NA              Median : NA   
##  Mean   :0.4035   Mean   :2.292   Mean   :NaN              Mean   :NaN   
##  3rd Qu.:1.0000   3rd Qu.:3.000   3rd Qu.: NA              3rd Qu.: NA   
##  Max.   :1.0000   Max.   :3.000   Max.   : NA              Max.   : NA   
##                   NA's   :891     NA's   :2208             NA's   :2208  
##     Survival     
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.3225  
##  3rd Qu.:1.0000  
##  Max.   :1.0000  
## 

You should notice some odd things.

Some variables are all set to NA. Which ones are these?

We need to pay attention to the fact that there are some variables not available for all of the ships. When a variable is not available, all values will be missing.

There are some variables where all the values are the same. Which are these?

Why do some variables have the same value for every observation? Think about what they refer to.

Browse the dataset. There is something interesting about how the Crew and Passenger Class variables relate to each other. What is that?

Tables

Now let’s get some descrptive information about the people on the ship using the crosstab function. The variables we are interested in in the RMS_Titanic data set are: Gender, Passenger Class, Crew, Survival.

#Here is the first. You add the others.
crosstab(RMS_Titanic, row.vars = "Gender")
##       
## Gender   Count Total %
##    0   1722.00   77.99
##    1    486.00   22.01
##    Sum 2208.00  100.00
crosstab(RMS_Titanic, row.vars = "Passenger Class")
##                
## Passenger.Class   Count Total %
##             1    324.00   24.60
##             2    285.00   21.64
##             3    708.00   53.76
##             Sum 1317.00  100.00
crosstab(RMS_Titanic, row.vars = "Survival")
##         
## Survival   Count Total %
##      0   1496.00   67.75
##      1    712.00   32.25
##      Sum 2208.00  100.00
crosstab(RMS_Titanic, row.vars = "Crew")
##      
## Crew    Count Total %
##   0   1317.00   59.65
##   1    891.00   40.35
##   Sum 2208.00  100.00

We would also like to know about the proportion who are 15 and under. The problem is that the Child variable is all missing and age is given in years. Let’s update the Child variable with information from the Age variable and then run the crosstab. (For your other ship you need to see if this is necessary and possible. You may need to do other similar things with other variables.)

RMS_Titanic$Child <- as.numeric(RMS_Titanic$Age) <= 15

# Add the crosstab below
crosstab(RMS_Titanic, row.vars = "Child")
##        
## Child     Count Total %
##   FALSE 2063.00   93.86
##   TRUE   135.00    6.14
##   Sum   2198.00  100.00
crosstab(RMS_Titanic, col.vars = c("Child"), row.vars=c("Survival"), type = c("c"))
##         NA    NA     NA     NA
## 1          Child  FALSE   TRUE
## 2 Survival                    
## 3 0               68.78  49.63
## 4 1               31.22  50.37
## 5 Sum            100.00 100.00

Table: Survival for children and adults

Explain what you think the code RMS_Titanic$Child <- as.numeric(RMS_Titanic$Age) <= 15 does.

Summary

Write a paragraph describing the people who were on board the Titanic and what happened to them.