Now that we have our data sets set up let’s take a look at them and start to analyze what happened. (And learn some R basics along the way.)
Let’s focus on the Titanic.
Start by loading the dataset that you previously saved.
# I had to add the extra level because I have projects.
load(paste("~","fall16/data/RMS_Titanic.Rda", sep="/"))
The data we downloaded contains 1 row per person on the ship for a given ship, in this case RMS Titanic.
Find some basic information about the RMS_Titanic using the str() function and the summary() function on the RMS_Titanic data set. That means, put the name of the data set inside the parentheses.
str(RMS_Titanic)
## Classes 'tbl_df', 'tbl' and 'data.frame': 2208 obs. of 20 variables:
## $ Id_8 : num 1 2 3 4 5 6 7 8 9 10 ...
## $ Ship Id : num 8 8 8 8 8 8 8 8 8 8 ...
## $ Year : num 1912 1912 1912 1912 1912 ...
## $ Nationality of the Ship : chr "U.K" "U.K" "U.K" "U.K" ...
## $ Women and children first : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Quick : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Cause : chr "Collision" "Collision" "Collision" "Collision" ...
## $ No. of passengers : num 1317 1317 1317 1317 1317 ...
## $ No. of women passengers : num 463 463 463 463 463 463 463 463 463 463 ...
## $ Women passengers/passengers: num 0.352 0.352 0.352 0.352 0.352 0.352 0.352 0.352 0.352 0.352 ...
## $ Ship size : num 2208 2208 2208 2208 2208 ...
## $ Length of voyage : num 5 5 5 5 5 5 5 5 5 5 ...
## $ Gender : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Age : num 42 16 14 21 30 33 30 26 26 20 ...
## $ Child : num NA NA NA NA NA NA NA NA NA NA ...
## $ Crew : num 0 0 0 1 0 1 0 0 1 1 ...
## $ Passenger Class : num 3 3 3 NA 2 NA 3 3 NA NA ...
## $ Nationality of Passenger : num NA NA NA NA NA NA NA NA NA NA ...
## $ Companionship : num NA NA NA NA NA NA NA NA NA NA ...
## $ Survival : num 0 0 0 0 0 0 0 0 0 0 ...
summary(RMS_Titanic)
## Id_8 Ship Id Year Nationality of the Ship
## Min. : 1.0 Min. :8 Min. :1912 Length:2208
## 1st Qu.: 552.8 1st Qu.:8 1st Qu.:1912 Class :character
## Median :1104.5 Median :8 Median :1912 Mode :character
## Mean :1104.5 Mean :8 Mean :1912
## 3rd Qu.:1656.2 3rd Qu.:8 3rd Qu.:1912
## Max. :2208.0 Max. :8 Max. :1912
##
## Women and children first Quick Cause No. of passengers
## Min. :1 Min. :0 Length:2208 Min. :1317
## 1st Qu.:1 1st Qu.:0 Class :character 1st Qu.:1317
## Median :1 Median :0 Mode :character Median :1317
## Mean :1 Mean :0 Mean :1317
## 3rd Qu.:1 3rd Qu.:0 3rd Qu.:1317
## Max. :1 Max. :0 Max. :1317
##
## No. of women passengers Women passengers/passengers Ship size
## Min. :463 Min. :0.352 Min. :2208
## 1st Qu.:463 1st Qu.:0.352 1st Qu.:2208
## Median :463 Median :0.352 Median :2208
## Mean :463 Mean :0.352 Mean :2208
## 3rd Qu.:463 3rd Qu.:0.352 3rd Qu.:2208
## Max. :463 Max. :0.352 Max. :2208
##
## Length of voyage Gender Age Child
## Min. :5 Min. :0.0000 Min. : 0.00 Min. : NA
## 1st Qu.:5 1st Qu.:0.0000 1st Qu.:22.00 1st Qu.: NA
## Median :5 Median :0.0000 Median :29.00 Median : NA
## Mean :5 Mean :0.2201 Mean :29.91 Mean :NaN
## 3rd Qu.:5 3rd Qu.:0.0000 3rd Qu.:36.00 3rd Qu.: NA
## Max. :5 Max. :1.0000 Max. :74.00 Max. : NA
## NA's :10 NA's :2208
## Crew Passenger Class Nationality of Passenger Companionship
## Min. :0.0000 Min. :1.000 Min. : NA Min. : NA
## 1st Qu.:0.0000 1st Qu.:2.000 1st Qu.: NA 1st Qu.: NA
## Median :0.0000 Median :3.000 Median : NA Median : NA
## Mean :0.4035 Mean :2.292 Mean :NaN Mean :NaN
## 3rd Qu.:1.0000 3rd Qu.:3.000 3rd Qu.: NA 3rd Qu.: NA
## Max. :1.0000 Max. :3.000 Max. : NA Max. : NA
## NA's :891 NA's :2208 NA's :2208
## Survival
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.3225
## 3rd Qu.:1.0000
## Max. :1.0000
##
You should notice some odd things.
We need to pay attention to the fact that there are some variables not available for all of the ships. When a variable is not available, all values will be missing.
Now let’s get some descrptive information about the people on the ship using the crosstab function. The variables we are interested in in the RMS_Titanic data set are: Gender, Passenger Class, Crew, Survival.
#Here is the first. You add the others.
crosstab(RMS_Titanic, row.vars = "Gender")
##
## Gender Count Total %
## 0 1722.00 77.99
## 1 486.00 22.01
## Sum 2208.00 100.00
crosstab(RMS_Titanic, row.vars = "Passenger Class")
##
## Passenger.Class Count Total %
## 1 324.00 24.60
## 2 285.00 21.64
## 3 708.00 53.76
## Sum 1317.00 100.00
crosstab(RMS_Titanic, row.vars = "Survival")
##
## Survival Count Total %
## 0 1496.00 67.75
## 1 712.00 32.25
## Sum 2208.00 100.00
crosstab(RMS_Titanic, row.vars = "Crew")
##
## Crew Count Total %
## 0 1317.00 59.65
## 1 891.00 40.35
## Sum 2208.00 100.00
We would also like to know about the proportion who are 15 and under. The problem is that the Child variable is all missing and age is given in years. Let’s update the Child variable with information from the Age variable and then run the crosstab. (For your other ship you need to see if this is necessary and possible. You may need to do other similar things with other variables.)
RMS_Titanic$Child <- as.numeric(RMS_Titanic$Age) <= 15
# Add the crosstab below
crosstab(RMS_Titanic, row.vars = "Child")
##
## Child Count Total %
## FALSE 2063.00 93.86
## TRUE 135.00 6.14
## Sum 2198.00 100.00
crosstab(RMS_Titanic, col.vars = c("Child"), row.vars=c("Survival"), type = c("c"))
## NA NA NA NA
## 1 Child FALSE TRUE
## 2 Survival
## 3 0 68.78 49.63
## 4 1 31.22 50.37
## 5 Sum 100.00 100.00
Table: Survival for children and adults
RMS_Titanic$Child <- as.numeric(RMS_Titanic$Age) <= 15 does.Write a paragraph describing the people who were on board the Titanic and what happened to them.