DATASET I have to import the Titanic survival data set from: website [http://www.personal.psu.edu/dlp/w540/titanic540.csv]
I have to load all necessary packages. I don’t want to display any warning signs.
library(utils)
library(datasets)
library(magrittr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
- This is a large data set (1309 observations), so I have chosen to hide the results. I don’t want to print all 1309 observations.
titanic <- read.csv("http://www.personal.psu.edu/dlp/w540/titanic540.csv")
titanic
To convert the data set, I need to use the tbl_df function in the tibble package. So I’ll load the package and use the function.
library(tibble)
## Warning: package 'tibble' was built under R version 3.4.1
titanic.tbl <- tbl_df(titanic)
titanic.tbl
## # A tibble: 1,309 x 8
## pclass survived sex age sibsp parch fare embarked
## <int> <int> <fctr> <int> <int> <int> <dbl> <fctr>
## 1 1 1 female 29 0 0 211.34 S
## 2 1 1 male 1 1 2 151.55 S
## 3 1 0 female 2 1 2 151.55 S
## 4 1 0 male 30 1 2 151.55 S
## 5 1 0 female 25 1 2 151.55 S
## 6 1 1 male 48 0 0 26.55 S
## 7 1 1 female 63 1 0 77.96 S
## 8 1 0 male 39 0 0 0.00 S
## 9 1 1 female 53 2 0 51.48 S
## 10 1 0 male 71 0 0 49.50 C
## # ... with 1,299 more rows
- I can do this in several different ways.
- First I’ll get the structure and summary of the data to figure out the variable types and see if any data is missing.
str(titanic.tbl)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1309 obs. of 8 variables:
## $ pclass : int 1 1 1 1 1 1 1 1 1 1 ...
## $ survived: int 1 1 0 0 0 1 1 0 1 0 ...
## $ sex : Factor w/ 2 levels "female","male": 1 2 1 2 1 2 1 2 1 2 ...
## $ age : int 29 1 2 30 25 48 63 39 53 71 ...
## $ sibsp : int 0 1 1 1 1 0 1 0 2 0 ...
## $ parch : int 0 2 2 2 2 0 0 0 0 0 ...
## $ fare : num 211 152 152 152 152 ...
## $ embarked: Factor w/ 4 levels "","C","Q","S": 4 4 4 4 4 4 4 4 4 2 ...
summary(titanic.tbl)
## pclass survived sex age
## Min. :1.000 Min. :0.000 female:466 Min. : 0.0
## 1st Qu.:2.000 1st Qu.:0.000 male :843 1st Qu.:21.0
## Median :3.000 Median :0.000 Median :28.0
## Mean :2.295 Mean :0.382 Mean :29.9
## 3rd Qu.:3.000 3rd Qu.:1.000 3rd Qu.:39.0
## Max. :3.000 Max. :1.000 Max. :80.0
## NA's :263
## sibsp parch fare embarked
## Min. :0.0000 Min. :0.000 Min. : 0.00 : 2
## 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.: 7.90 C:270
## Median :0.0000 Median :0.000 Median : 14.45 Q:123
## Mean :0.4989 Mean :0.385 Mean : 33.30 S:914
## 3rd Qu.:1.0000 3rd Qu.:0.000 3rd Qu.: 31.28
## Max. :8.0000 Max. :9.000 Max. :512.33
## NA's :1
From the summary I gather that no values are missing for the ‘survived’ variable.
survivors <- table(titanic.tbl$survived)
survivors
##
## 0 1
## 809 500
500/1309
## [1] 0.381971
I could also run the ‘prop.table’ function to automatically calculate the proportion of survivors.
survivorprop <- table(titanic.tbl$survived==1)
survivorprop
##
## FALSE TRUE
## 809 500
prop.table(survivorprop)
##
## FALSE TRUE
## 0.618029 0.381971
- For this calculation, I must first use the ‘group_by’ function to gather information on each gender of passengers.
- I can then use the
summariseverb to find the mean.
titanic.tbl %>%
group_by(sex) %>%
summarise(survivors_by_sex = mean(survived))
## # A tibble: 2 x 2
## sex survivors_by_sex
## <fctr> <dbl>
## 1 female 0.7274678
## 2 male 0.1909846
This one requires the use of another piped command. Filter to identify the relevant observations and summarise to find the average.
titanic.tbl %>%
filter(sex=="female", survived=) %>%
summarise(female_survivors = mean(age, na.rm=TRUE))
## # A tibble: 1 x 1
## female_survivors
## <dbl>
## 1 28.6933
I filter the relevant observations and count the total number of observations that return.
titanic.tbl %>%
filter(age<=10, survived==1)
## # A tibble: 50 x 8
## pclass survived sex age sibsp parch fare embarked
## <int> <int> <fctr> <int> <int> <int> <dbl> <fctr>
## 1 1 1 male 1 1 2 151.55 S
## 2 1 1 male 4 0 2 81.86 S
## 3 1 1 male 6 0 2 134.50 C
## 4 2 1 male 1 2 1 39.00 S
## 5 2 1 female 4 2 1 39.00 S
## 6 2 1 male 1 0 2 29.00 S
## 7 2 1 female 8 0 2 26.25 S
## 8 2 1 male 8 1 1 36.75 S
## 9 2 1 male 8 0 2 32.50 S
## 10 2 1 male 1 1 1 14.50 S
## # ... with 40 more rows
I use another piped command for this - filter relevant observations and summarise the maximum, minimum and median
titanic.tbl %>%
filter(age>=10, survived==1) %>%
summarise(max=max(age, na.rm=TRUE), min=min(age, na.rm=TRUE), median(age, na.rm = TRUE))
## # A tibble: 1 x 3
## max min `median(age, na.rm = TRUE)`
## <dbl> <dbl> <int>
## 1 80 11 30
I will use the prop.table function in r for this. I first need to create a table of the surviving passengers by port of embarkation.
survivors_by_port <- table(titanic.tbl$survived, titanic.tbl$embarked)
survivors_by_port
##
## C Q S
## 0 0 120 79 610
## 1 2 150 44 304
prop.table(survivors_by_port)
##
## C Q S
## 0 0.000000000 0.091673033 0.060351413 0.466004584
## 1 0.001527884 0.114591291 0.033613445 0.232238350
This one’s a little complicated. I have to string together several piped commands - filter, select, group_by and finally count.
titanic.tbl$embarked <- as.numeric(titanic.tbl$embarked)
sur.fem <- titanic.tbl %>%
filter(sex=="female", age>=40) %>%
select(sex, age, embarked) %>%
group_by(embarked)
sur.fem
## # A tibble: 84 x 3
## # Groups: embarked [3]
## sex age embarked
## <fctr> <int> <dbl>
## 1 female 63 4
## 2 female 53 4
## 3 female 50 2
## 4 female 47 4
## 5 female 42 2
## 6 female 58 4
## 7 female 45 2
## 8 female 44 2
## 9 female 59 4
## 10 female 60 2
## # ... with 74 more rows
count(sur.fem)
## # A tibble: 3 x 2
## # Groups: embarked [3]
## embarked n
## <dbl> <int>
## 1 1 1
## 2 2 31
## 3 4 52
I use the group_by and summarise verbs to calculate the mean. Since some values for the fare are missing, I have to use na.rm=TRUE
titanic.tbl %>%
group_by(embarked) %>%
summarise (avg_fare = mean(fare,na.rm = TRUE))
## # A tibble: 4 x 2
## embarked avg_fare
## <dbl> <dbl>
## 1 1 80.00000
## 2 2 62.33719
## 3 3 12.40935
## 4 4 27.41963
I use the filter verb and count the number of observations in the tibble that’s created
titanic.tbl %>%
filter(survived==1, sibsp>0)
## # A tibble: 191 x 8
## pclass survived sex age sibsp parch fare embarked
## <int> <int> <fctr> <int> <int> <int> <dbl> <dbl>
## 1 1 1 male 1 1 2 151.55 4
## 2 1 1 female 63 1 0 77.96 4
## 3 1 1 female 53 2 0 51.48 4
## 4 1 1 female 18 1 0 227.53 2
## 5 1 1 male 37 1 1 52.55 4
## 6 1 1 female 47 1 1 52.55 4
## 7 1 1 male 25 1 0 91.08 2
## 8 1 1 female 19 1 0 91.08 2
## 9 1 1 female 59 2 0 51.48 4
## 10 1 1 male 11 1 2 120.00 4
## # ... with 181 more rows
I filter the relevant observations and count the total number that return.
titanic.tbl %>%
filter(survived==1, parch>0)
## # A tibble: 164 x 8
## pclass survived sex age sibsp parch fare embarked
## <int> <int> <fctr> <int> <int> <int> <dbl> <dbl>
## 1 1 1 male 1 1 2 151.55 4
## 2 1 1 female 50 0 1 247.52 2
## 3 1 1 male 37 1 1 52.55 4
## 4 1 1 female 47 1 1 52.55 4
## 5 1 1 female 22 0 1 55.00 4
## 6 1 1 male 36 0 1 512.33 2
## 7 1 1 female 58 0 1 512.33 2
## 8 1 1 male 11 1 2 120.00 4
## 9 1 1 female 14 1 2 120.00 4
## 10 1 1 male 36 1 2 120.00 4
## # ... with 154 more rows
It’s time to calculate mean again. That means using the summarise verb. Of course, I have to use group_by as I have to calculate the mean by passenger class.
titanic.tbl %>%
group_by(pclass) %>%
summarise(avg_fare=mean(fare, na.rm=TRUE))
## # A tibble: 3 x 2
## pclass avg_fare
## <int> <dbl>
## 1 1 87.50935
## 2 2 21.17928
## 3 3 13.30414
I have to calculate the frequency distribution. That means using the ftable function. But first I have to filter the relevant observations and select the relevant variables.
frq.dist1 <- titanic.tbl %>%
filter(survived==1, sex=="female", parch>0) %>%
select(survived, parch)
frq.dist1
## # A tibble: 121 x 2
## survived parch
## <int> <int>
## 1 1 1
## 2 1 1
## 3 1 1
## 4 1 1
## 5 1 2
## 6 1 2
## 7 1 1
## 8 1 1
## 9 1 2
## 10 1 2
## # ... with 111 more rows
ftable(frq.dist1)
## parch 1 2 3 4 5
## survived
## 1 70 44 5 1 1
Another frequency distribution. That means using the ftable function . again, after I filter the relevant observations and select the relevant variables.
frq.dist2 <- titanic.tbl %>%
filter(sex=="male", sibsp>=1) %>%
select(sibsp)
frq.dist2
## # A tibble: 214 x 1
## sibsp
## <int>
## 1 1
## 2 1
## 3 1
## 4 1
## 5 1
## 6 1
## 7 1
## 8 1
## 9 1
## 10 1
## # ... with 204 more rows
ftable(frq.dist2)
## x 1 2 3 4 5 8
##
## 159 23 8 15 4 5