- Complete either Task 1 or Task 2
- Complete Task 3
Note: If you Rmd file submission knits
you will receive total of (5 points)
# load the packages needed
library(PASWR2)
library(ggplot2)
library(tidyverse)
library(lattice)
ANSWER: 4 (Though 19 check marks appear under packages)
Note: Problem 8/p. 196 is modified
Some claim that the final hours aboard the Titanic were marked by
class warfare other claim it was characterized by male chivalry. The
data frame TITANIC3 from the PASWR2 package
contains information pertaining to class status
pclass,survival of passengers survived, and
gender sex, among others. Based on the information in the
data frame:
A description of the variables can be found by running the code:
help("TITANIC3")
data("TITANIC3")
TITANIC3 data?Hint: Use the function dim(), glimpse() or
str().
dim(TITANIC3)
[1] 1309 14
ANSWER: There are 1309 rows and 14 columns in
TITANIC3.
TITANIC3 data?TITANIC3 %>% head
NA
NA
NA
survived
variable in the TITANIC3 data, which is of type integer
(0/1) mutate it to a factor variable by running the code
below and create new data frame
TITANIC.What are the new levels of survived and its type?
TITANIC <- TITANIC3 %>% mutate(survived = factor(survived, levels = 0:1, labels = c("No", "Yes")))
ANSWER: Levels - Yes, No Type - Factor
TITANIC Data. Write code using the pipe %>% operator
the produces the same result.summary(TITANIC)
pclass survived name sex age
1st:323 No :809 Connolly, Miss. Kate : 2 female:466 Min. : 0.1667
2nd:277 Yes:500 Kelly, Mr. James : 2 male :843 1st Qu.:21.0000
3rd:709 Abbing, Mr. Anthony : 1 Median :28.0000
Abbott, Master. Eugene Joseph : 1 Mean :29.8811
Abbott, Mr. Rossmore Edward : 1 3rd Qu.:39.0000
Abbott, Mrs. Stanton (Rosa Hunt: 1 Max. :80.0000
(Other) :1301 NA's :263
sibsp parch ticket fare cabin
Min. :0.0000 Min. :0.000 CA. 2343: 11 Min. : 0.000 :1014
1st Qu.:0.0000 1st Qu.:0.000 1601 : 8 1st Qu.: 7.896 B57 B59 B63 B66: 5
Median :0.0000 Median :0.000 CA 2144 : 8 Median : 14.454 C23 C25 C27 : 5
Mean :0.4989 Mean :0.385 3101295 : 7 Mean : 33.295 G6 : 5
3rd Qu.:1.0000 3rd Qu.:0.000 347077 : 7 3rd Qu.: 31.275 B96 B98 : 4
Max. :8.0000 Max. :9.000 347082 : 7 Max. :512.329 C22 C26 : 4
(Other) :1261 NA's :1 (Other) : 272
embarked boat body home.dest
: 2 :823 Min. : 1.0 :564
Cherbourg :270 13 : 39 1st Qu.: 72.0 New York, NY : 64
Queenstown :123 C : 38 Median :155.0 London : 14
Southampton:914 15 : 37 Mean :160.8 Montreal, PQ : 10
14 : 33 3rd Qu.:256.0 Cornwall / Akron, OH: 9
4 : 31 Max. :328.0 Paris, France : 9
(Other):308 NA's :1188 (Other) :639
YOUR CODE HERE:
TITANIC %>% summary()
pclass survived name sex age
1st:323 No :809 Connolly, Miss. Kate : 2 female:466 Min. : 0.1667
2nd:277 Yes:500 Kelly, Mr. James : 2 male :843 1st Qu.:21.0000
3rd:709 Abbing, Mr. Anthony : 1 Median :28.0000
Abbott, Master. Eugene Joseph : 1 Mean :29.8811
Abbott, Mr. Rossmore Edward : 1 3rd Qu.:39.0000
Abbott, Mrs. Stanton (Rosa Hunt: 1 Max. :80.0000
(Other) :1301 NA's :263
sibsp parch ticket fare cabin
Min. :0.0000 Min. :0.000 CA. 2343: 11 Min. : 0.000 :1014
1st Qu.:0.0000 1st Qu.:0.000 1601 : 8 1st Qu.: 7.896 B57 B59 B63 B66: 5
Median :0.0000 Median :0.000 CA 2144 : 8 Median : 14.454 C23 C25 C27 : 5
Mean :0.4989 Mean :0.385 3101295 : 7 Mean : 33.295 G6 : 5
3rd Qu.:1.0000 3rd Qu.:0.000 347077 : 7 3rd Qu.: 31.275 B96 B98 : 4
Max. :8.0000 Max. :9.000 347082 : 7 Max. :512.329 C22 C26 : 4
(Other) :1261 NA's :1 (Other) : 272
embarked boat body home.dest
: 2 :823 Min. : 1.0 :564
Cherbourg :270 13 : 39 1st Qu.: 72.0 New York, NY : 64
Queenstown :123 C : 38 Median :155.0 London : 14
Southampton:914 15 : 37 Mean :160.8 Montreal, PQ : 10
14 : 33 3rd Qu.:256.0 Cornwall / Akron, OH: 9
4 : 31 Max. :328.0 Paris, France : 9
(Other):308 NA's :1188 (Other) :639
survived) according to class (pclass).Hint: Uncomment one of the first 3 lines n the code
chunk below and then use the prop.table function.
T1 <- xtabs(~survived + pclass, data = TITANIC)
T1 <- table(TITANIC$survived,TITANIC$pclass)
T1 <- TITANIC %>% select(survived, pclass) %>% table()
T1
pclass
survived 1st 2nd 3rd
No 123 158 528
Yes 200 119 181
prop.table(T1, margin = 2) # to produce the proportion per column (2), per row would be margin = 1
pclass
survived 1st 2nd 3rd
No 0.3808050 0.5703971 0.7447109
Yes 0.6191950 0.4296029 0.2552891
ANSWER: In 1st class percent survived is 61.91950, 2nd class 42.96029, 3rd class 25.52
Hint: Use the code below that creates 3-way table and then use
prop.table() similarly to part a).
T2 <- TITANIC %>% select(pclass, sex, survived) %>% table()
T2
, , survived = No
sex
pclass female male
1st 5 118
2nd 12 146
3rd 110 418
, , survived = Yes
sex
pclass female male
1st 139 61
2nd 94 25
3rd 106 75
prop.table(T2)
, , survived = No
sex
pclass female male
1st 0.003819710 0.090145149
2nd 0.009167303 0.111535523
3rd 0.084033613 0.319327731
, , survived = Yes
sex
pclass female male
1st 0.106187930 0.046600458
2nd 0.071810542 0.019098549
3rd 0.080977846 0.057295646
ANSWER: 8.1% of women in third class survived while 4.7 % of men in first class survived. Women had a higher rate.
Hint: Run the code below that produces some summary statistics and the density distribution. The commented code is old style of R programming, it is shown as it may resemble the textbook examples.
# Finding summary statistics
median(TITANIC$age, na.rm = TRUE) # old style
[1] 28
mean(TITANIC$age, na.rm = TRUE) # old style
[1] 29.88113
# dplyr style
TITANIC %>% summarise(mean = mean(age, na.rm = TRUE), meadian = median(age, na.rm = TRUE))
IQR(TITANIC$age, na.rm = TRUE)
[1] 18
TITANIC %>% pull(age) %>% IQR(na.rm = TRUE) # pull() does extract the column from the data frame as vector object
[1] 18
# look at the density function to see if it is uni or bi-modal distribution
ggplot(data = TITANIC, aes(x = age)) +
geom_density(fill = "lightgreen") +
theme_bw()
ANSWER: Positively Skewed. there were a few older passengers
Hint: Using the dplyr package functions
and the pipes operator %>% elegant code can produce
summaries for each statistics - mean, median, sd, IRQ.
pclass variable, namely
regardless of passenger class:# mean summaries
TITANIC %>% group_by(sex, survived) %>% summarise(avg = mean(age, na.rm = TRUE))
`summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
# sd summaries
TITANIC %>% group_by(sex, survived) %>% summarise(stdev = sd(age, na.rm = TRUE))
`summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
# median summaries
TITANIC %>% group_by(sex, survived) %>% summarise(med = median(age, na.rm = TRUE))
`summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
# IQR summaries
TITANIC %>% group_by(sex, survived) %>% summarise(IQR = IQR(age, na.rm = TRUE))
`summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
Based on the summaries, answer the question below:
d-1)
For those who survived, the mean age for females is 29.81535 or 30 and the mean age for males? 26.97778 or 27
d-2) For those who survived, the median age for females is 28.5 or 29 and the median age for males? 27
ANSWER: Women that survived were older than the one ones who did not.
6. (10 points) Now Consider the
survived variable in the TITANIC data too,
create similar summary statistics and answer the question below.
For those who survived, which class the mean age for females is less than the mean age for males? Third
For those who survived, which class the median age for females is greater than the median age for males? Second
Write your code in the chunk below:
# mean summaries
TITANIC %>% group_by(pclass, sex, survived) %>% summarise(avg = mean(age, na.rm = TRUE))
`summarise()` has grouped output by 'pclass', 'sex'. You can override using the `.groups` argument.
# median summaries
TITANIC %>% group_by(pclass, sex, survived) %>% summarise(med = median(age, na.rm = TRUE))
`summarise()` has grouped output by 'pclass', 'sex'. You can override using the `.groups` argument.
Hint: Read the output of the code in part d)
ANSWER: In the overall, males that were younger survived than not survive. The males that survived in the 1st and 2nd class were younger, but the third class was age 25 for both. For first class the mean was 36 and median 36. For second class the mean is 17 and median is 19. For third class the mean is 22 and median is 25
Hint: Complete the code below by specify which variable you want to be arranged.
TITANIC %>% filter (sex =="female" & survived =="Yes" & pclass == "1st") %>% arrange(age)
TITANIC %>% filter (sex =="female" & survived =="Yes" & pclass == "1st") %>% arrange(desc(age))
Arranging in descending order is achieved by specifying in the
arrange() function desc(var_name).
YOUR CODE HERE:
TITANIC %>% filter(sex =="female" & survived =="Yes" & pclass == "1st") %>% arrange(desc(age))
TITANIC %>% filter (sex =="male" & survived =="Yes" & pclass == "1st") %>% arrange(desc(age))
ANSWER:
Oldest female in 1st class survived was Cavendish, Mrs. Tyrell William, 76 years of age.
Oldest male in 1st class survived was Barkworth, Mr. Algernon Henry Wyears, 80 years of age.
Hint: Review and explain the exploratory graphs created by the code chunk. How they support you justification?
TITANIC %>% ggplot(aes(x = survived)) +
geom_bar(aes(fill = sex), stat = "count", position = "stack" ) +
theme_bw()
TITANIC %>% ggplot(aes(x = survived)) +
geom_bar(aes(fill = pclass), stat = "count", position = "stack" ) +
theme_bw()
I would infer that the older male passengers did, in fact, try to protect the young and women. The graph shows more women survived than men. The second graph shows that a high amount of 3rd class male passengers died. That may be because of location of the cabins or maybe the higher class passengers just simply used status over male 3rd class.
NAComment: In most of the code you used/wrote in Task
1, functions were called with argument
na.rm = TRUE, instructing the NA values to be
dropped for the computations.
part 1) (5 points) Use the function
na.omit()(or the filter()) function from
dplyr package to create a clean data set
that removes subjects if any observations on the subject are
unknown Store the modified data frame in a data frame
named CLEAN. Run the function dim() on the
data frame CLEAN to find the number of observations(rows)
in the CLEAN data.
COMPLETE THE CODE HERE, uncomment necessary lines before running:
CLEAN <- na.omit(TITANIC)
#print the dimensions
dim(CLEAN)
[1] 119 14
part 2) (5 points) How many missing values in the
data frame TITANIC are there? How many rows of
TITANIC have no missing values, one missing value, two
missing values, and three missing values, respectively? Note: the number
of rows in CLEAN should agree with your answer for the
number of rows in TITANIC that have no missing values. What
are the cons of cleaning the data in the suggested way?
Use the code, explain what it does.
#get the number of missing values in columns
colNAs<- colSums(is.na(TITANIC))
(colNAs <- as.vector(colSums(is.na(TITANIC)))) # coerce to a vector
[1] 0 0 0 0 263 0 0 0 1 0 0 0 1188 0
rowNAs <- table(rowSums(is.na(TITANIC)))
(rowNAs <- as.vector(table(rowSums(is.na(TITANIC))))) # coerce to a vector
[1] 119 928 262
Comment: The missing values are for variables age, fare, and body.
Comment: There are 119 rows with no missing values, 928 rows with 1 missing value, and 262 rows with 2 missing values.
Comment how this align with the dimensions of your CLEAN
data. > Your comment: The no missing values is the same and
since there is at least one value in every column the 14 is the same
amount of columns.
Good practice: Save your customized data frame
CLEAN in your working directory as a *.csv
file using the function write.csv() using the argument
row.names = FALSE.
write.csv(CLEAN, file="TITANIC_CLEAN.csv", row.names=FALSE)
Note: This is not guided task, you have to write your own code from scratch!
Use the CARS2004 data frame from the PASWR2 package,
which contains the numbers of cars per 1000 inhabitants
(cars), the total number of known mortal accidents
(deaths), and the country population/1000
(population) for the 25 member countries of the European
Union for the year 2004.
YOUR CODE:
library()
ANSWER:
YOUR CODE:
ANSWER:
YOUR CODE:
ANSWER:
YOUR CODE:
ANSWER:
YOUR CODE:
ANSWER:
YOUR CODE:
ANSWER:
YOUR CODE:
ANSWER:
YOUR CODE:
ANSWER:
addCircles()
function.Try: E.g.
addCircles(weight = 1, radius = sqrt(UNC_schools$size)*100)
set.seed(123)
library(leaflet)
# The code below will create list of 5 UNC university data points with lat & lng, name and school size
# Create data frame with column variables name (of UNC school), students (size), lat, lng)
UNC_schools <- data.frame(name = c("NC State", "UNC Chapel Hill", "FSU", "ECU", "UNC Charlotte", "UNC Greensboro", "UNC Wilmington"),
size = c(30130, 28136, 6000, 25990, 25990, 19653, 17843),
lat = c(36.0373638, 35.9050353, 35.0726, 35.6073769, 35.2036325, 36.0689, 34.2239),
lng = c(-79.0355663, -79.0477533, -78.8924739, -77.3671566, -80.8401145, -79.8102, -77.8696))
# Use the data frame to draw map and circles proportional to the school sizes of the cities
UNC_schools %>%
leaflet() %>%
addTiles() %>%
addCircles(weight = 1, radius = sqrt(UNC_schools$size)*100) # try adjusting the radius by multiplying with 50 instead of 100. What do you notice?
# Circles are smaller