- Complete either Task 1 or Task 2
- Complete Task 3
Note: If you Rmd file submission knits you will receive total of (5 points)
# load the packages needed
library(PASWR2)
library(ggplot2)
library(dplyr)
library(lattice)
Answer:
Note: Problem 8/p. 196 is modified
Some claim that the final hours aboard the Titanic were marked by class warfare;other claim it was characterized by male chivalry. The data frame TITANIC3 from the PASWR2 package contains information pertaining to class status pclass,survival of passengers survived, and gender sex, among others. Based on the information in the data frame:
A description of the variables can be found by running the code:
help("TITANIC3")
data("TITANIC3")
TITANIC3 data?Hint: Use the function dim(), glimpse() or str().
titanicRow <- function(x) dim(TITANIC3)[1]
titanicCol <- function(y) dim(TITANIC3)[2]
titanicRow()
[1] 1309
titanicCol()
[1] 14
Answer: There are 1309 rows and 14 columns in TITANIC3.
TITANIC3 data?TITANIC3 %>% select(1:6)
survived variable in the TITANIC3 data, which is of type integer (0/1) mutate it to a factor variable by running the code below and create new data frame TITANIC.What are the new levels of survived and its type?
```{r}
Error: attempt to use zero-length variable name
Answer: No and Yes
TITANIC Data. Write code using the pipe %>% operator the produces the same result.summary(TITANIC)
YOUR CODE HERE:
TITANIC %>% summary
pclass survived name sex age sibsp parch ticket
1st:323 No :809 Connolly, Miss. Kate : 2 female:466 Min. : 0.1667 Min. :0.0000 Min. :0.000 CA. 2343: 11
2nd:277 Yes:500 Kelly, Mr. James : 2 male :843 1st Qu.:21.0000 1st Qu.:0.0000 1st Qu.:0.000 1601 : 8
3rd:709 Abbing, Mr. Anthony : 1 Median :28.0000 Median :0.0000 Median :0.000 CA 2144 : 8
Abbott, Master. Eugene Joseph : 1 Mean :29.8811 Mean :0.4989 Mean :0.385 3101295 : 7
Abbott, Mr. Rossmore Edward : 1 3rd Qu.:39.0000 3rd Qu.:1.0000 3rd Qu.:0.000 347077 : 7
Abbott, Mrs. Stanton (Rosa Hunt: 1 Max. :80.0000 Max. :8.0000 Max. :9.000 347082 : 7
(Other) :1301 NA's :263 (Other) :1261
fare cabin embarked boat body home.dest
Min. : 0.000 :1014 : 2 :823 Min. : 1.0 :564
1st Qu.: 7.896 B57 B59 B63 B66: 5 Cherbourg :270 13 : 39 1st Qu.: 72.0 New York, NY : 64
Median : 14.454 C23 C25 C27 : 5 Queenstown :123 C : 38 Median :155.0 London : 14
Mean : 33.295 G6 : 5 Southampton:914 15 : 37 Mean :160.8 Montreal, PQ : 10
3rd Qu.: 31.275 B96 B98 : 4 14 : 33 3rd Qu.:256.0 Cornwall / Akron, OH: 9
Max. :512.329 C22 C26 : 4 4 : 31 Max. :328.0 Paris, France : 9
NA's :1 (Other) : 272 (Other):308 NA's :1188 (Other) :639
survived) according to class (pclass).Hint: Uncomment one of the first 3 lines n the code chunk below and then use the prop.table function.
T1 <- xtabs(~survived + pclass, data = TITANIC)
T1 <- table(TITANIC$survived,TITANIC$pclass)
T1 <- TITANIC %>% select(survived, pclass) %>% table()
T1
pclass
survived 1st 2nd 3rd
No 123 158 528
Yes 200 119 181
prop.table(T1, margin = 2) # to produce the proportion per column (2), per row would be margin = 1
pclass
survived 1st 2nd 3rd
No 0.3808050 0.5703971 0.7447109
Yes 0.6191950 0.4296029 0.2552891
Answer: In 1st class percent survived is 200 2nd class 119, 3rd class 181
Hint: Use the code below that creates 3-way table and then use prop.table() similarly to part a).
T2 <- TITANIC %>% select(pclass, sex, survived) %>% table()
T2
, , survived = No
sex
pclass female male
1st 5 118
2nd 12 146
3rd 110 418
, , survived = Yes
sex
pclass female male
1st 139 61
2nd 94 25
3rd 106 75
prop.table(T2)
, , survived = No
sex
pclass female male
1st 0.003819710 0.090145149
2nd 0.009167303 0.111535523
3rd 0.084033613 0.319327731
, , survived = Yes
sex
pclass female male
1st 0.106187930 0.046600458
2nd 0.071810542 0.019098549
3rd 0.080977846 0.057295646
Answer: 8.0977846 % of women in third class survived while 4.6600458 % of men in first class survived.
Hint: Run the code below that produces some summary statistics and the density distribution. The commented code is old style of R programming, it is shown as it may resemble the textbook examples.
TITANIC %>% summarise(mean = mean(age, na.rm = TRUE), meadian = median(age, na.rm = TRUE))
# IQR(TITANIC$age, na.rm = TRUE)
TITANIC %>% pull(age) %>% IQR(na.rm = TRUE) # pull() does extract the column from the data frame as vector object
[1] 18
# look at the density function to see if it is uni or bi-modal distribution
ggplot(data = TITANIC, aes(x = age)) +
geom_density(fill = "lightgreen") +
theme_bw()
Answer: Positively skewed
Hint: Using the dplyr package functions and the pipes operator %>% elegant code can produce summaries for each statistics - mean, median, sd, IRQ.
pclass variable, namely regardless of passenger class:TITANIC %>% group_by(sex, survived) %>% summarise(avg = mean(age, na.rm = TRUE))
`summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
NA
TITANIC %>% group_by(sex, survived) %>% summarise(stdev = sd(age, na.rm = TRUE))
`summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
TITANIC %>% group_by(sex, survived) %>% summarise(med = median(age, na.rm = TRUE))
`summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
TITANIC %>% group_by(sex, survived) %>% summarise(IQR = IQR(age, na.rm = TRUE))
`summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
Based on the summaries, answer the question below:
d-1)
For those who survived, the mean age for females is higher than the mean age for males?
d-2) For those who survived, the median age for females is higher than the median age for males?
Answer: _ _ _
survived variable in the TITANIC data too, create similar summary statistics and answer the question below.For those who survived, which class the mean age for females is *less** than the mean age for males? 3rd class
For those who survived, which class the median age for females is greater than the median age for males? 2nd class
Write your code in the chunk below:
TITANIC %>% group_by(pclass, sex, survived) %>% summarise(avg = median(age, na.rm = TRUE))
`summarise()` has grouped output by 'pclass', 'sex'. You can override using the `.groups` argument.
Hint: Read the output of the code in part d)
Answer: Mean Survived: 26.97778. Mean Did Not Survive: 31.51641. Mean Did Not SUrvive is higher. Median Survived: 27. Median Did Not Survive: 29. Mean Did Not Survive higher.
Hint: Complete the code below by specify which variable you want to be arranged.
Arranging in descending order is achieved by specifying in the arrange() function desc(var_name).
YOUR CODE HERE:
Answer:
Oldest female in 1st class survived was 76 years of age.
Oldest male in 1st class survived was 80 years of age.
Hint: Review and explain the exploratory graphs created by the code chunk. How they support you justification?
addCircles() function.Try: E.g. addCircles(weight = 1, radius = sqrt(UNC_schools$size)*100)
set.seed(2020-02-01)
library(leaflet)
# The code below will create list of 5 UNC university data points with lat & lng, name and school size
# Create data frame with column variables name (of UNC school), students (size), lat, lng)
UNC_schools <- data.frame(name = c("NC State", "UNC Chapel Hill", "FSU", "ECU", "UNC Charlotte"),
size = c(30130, 28136, 6000, 25990, 25990),
lat = c(36.0373638, 35.9050353, 35.0726, 35.6073769, 35.2036325),
lng = c(-79.0355663, -79.0477533, -78.8924739, -77.3671566, -80.8401145))
# Use the data frame to draw map and circles proportional to the school sizes of the cities
UNC_schools %>%
leaflet() %>%
addTiles() %>%
addCircles(weight = 1, radius = sqrt(UNC_schools$size)*150) # try adjusting the radius by multiplying with 50 instead of 100. What do you notice?
Assuming "lng" and "lat" are longitude and latitude, respectively