- Complete either Task 1 or Task 2
- Complete Task 3
Note: If you Rmd file submission knits
you will receive total of (5 points)
# load the packages needed
library(PASWR2)
Loading required package: lattice
Loading required package: ggplot2
library(ggplot2)
library(tidyverse)
── Attaching core tidyverse packages ────────────────────────────────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ lubridate 1.9.3 ✔ tibble 3.2.1
✔ purrr 1.0.2 ✔ tidyr 1.3.1
── Conflicts ──────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(lattice)
ANSWER: 4 packages were loaded and 8 packages came from tidyverse for a total of 12.
Note: Problem 8/p. 196 is modified
Some claim that the final hours aboard the Titanic were marked by
class warfare other claim it was characterized by male chivalry. The
data frame TITANIC3 from the PASWR2 package
contains information pertaining to class status pclass,
survival of passengers survived, and gender
sex, among others. Based on the information in the data
frame:
A description of the variables can be found by running the code:
help("TITANIC3")
data("TITANIC3")
TITANIC3 data?Hint: Use the function dim(),
glimpse() or str().
glimpse(TITANIC3)
Rows: 1,309
Columns: 14
$ pclass <fct> 1st, 1st, 1st, 1st, 1st, 1st, 1st, 1st, 1st, 1st, 1st, 1st, 1st, 1st, 1st, 1st, 1st, 1…
$ survived <int> 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1,…
$ name <fct> "Allen, Miss. Elisabeth Walton", "Allison, Master. Hudson Trevor", "Allison, Miss. Hel…
$ sex <fct> female, male, female, male, female, male, female, male, female, male, male, female, fe…
$ age <dbl> 29.0000, 0.9167, 2.0000, 30.0000, 25.0000, 48.0000, 63.0000, 39.0000, 53.0000, 71.0000…
$ sibsp <int> 0, 1, 1, 1, 1, 0, 1, 0, 2, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0,…
$ parch <int> 0, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0,…
$ ticket <fct> 24160, 113781, 113781, 113781, 113781, 19952, 13502, 112050, 11769, PC 17609, PC 17757…
$ fare <dbl> 211.3375, 151.5500, 151.5500, 151.5500, 151.5500, 26.5500, 77.9583, 0.0000, 51.4792, 4…
$ cabin <fct> B5, C22 C26, C22 C26, C22 C26, C22 C26, E12, D7, A36, C101, , C62 C64, C62 C64, B35, ,…
$ embarked <fct> Southampton, Southampton, Southampton, Southampton, Southampton, Southampton, Southamp…
$ boat <fct> 2, 11, , , , 3, 10, , D, , , 4, 9, 6, B, , , 6, 8, A, 5, 5, 5, 4, 8, , 7, 7, 8, D, , 7…
$ body <int> NA, NA, NA, 135, NA, NA, NA, NA, NA, 22, 124, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ home.dest <fct> "St Louis, MO", "Montreal, PQ / Chesterville, ON", "Montreal, PQ / Chesterville, ON", …
ANSWER : There are 1309 rows and 14 columns in
TITANIC3. (Fill in the blanks)
TITANIC3 data?TITANIC3 %>% head(n=6)
TITANIC3 %>% tail(n=6)
NA
NA
survived
variable in the TITANIC3 data, which is of type integer
(0/1) mutate it to a factor variable by running the code
below and create new data frame
TITANIC.What are the new levels of survived and its type?
TITANIC <- TITANIC3 %>% mutate(survived = factor(survived, levels = 0:1, labels = c("No", "Yes")))
# Check the levels of the 'survived' variable
levels(TITANIC$survived)
[1] "No" "Yes"
# Check the type of the 'survived' variable
class(TITANIC$survived)
[1] "factor"
ANSWER: The levels are now No and Yes and the type is factor.
TITANIC Data. Write code using the pipe %>% operator
the produces the same result.summary(TITANIC)
pclass survived name sex age
1st:323 No :809 Connolly, Miss. Kate : 2 female:466 Min. : 0.1667
2nd:277 Yes:500 Kelly, Mr. James : 2 male :843 1st Qu.:21.0000
3rd:709 Abbing, Mr. Anthony : 1 Median :28.0000
Abbott, Master. Eugene Joseph : 1 Mean :29.8811
Abbott, Mr. Rossmore Edward : 1 3rd Qu.:39.0000
Abbott, Mrs. Stanton (Rosa Hunt: 1 Max. :80.0000
(Other) :1301 NA's :263
sibsp parch ticket fare cabin
Min. :0.0000 Min. :0.000 CA. 2343: 11 Min. : 0.000 :1014
1st Qu.:0.0000 1st Qu.:0.000 1601 : 8 1st Qu.: 7.896 B57 B59 B63 B66: 5
Median :0.0000 Median :0.000 CA 2144 : 8 Median : 14.454 C23 C25 C27 : 5
Mean :0.4989 Mean :0.385 3101295 : 7 Mean : 33.295 G6 : 5
3rd Qu.:1.0000 3rd Qu.:0.000 347077 : 7 3rd Qu.: 31.275 B96 B98 : 4
Max. :8.0000 Max. :9.000 347082 : 7 Max. :512.329 C22 C26 : 4
(Other) :1261 NA's :1 (Other) : 272
embarked boat body home.dest
: 2 :823 Min. : 1.0 :564
Cherbourg :270 13 : 39 1st Qu.: 72.0 New York, NY : 64
Queenstown :123 C : 38 Median :155.0 London : 14
Southampton:914 15 : 37 Mean :160.8 Montreal, PQ : 10
14 : 33 3rd Qu.:256.0 Cornwall / Akron, OH: 9
4 : 31 Max. :328.0 Paris, France : 9
(Other):308 NA's :1188 (Other) :639
YOUR CODE HERE:
TITANIC %>% summary()
pclass survived name sex age
1st:323 No :809 Connolly, Miss. Kate : 2 female:466 Min. : 0.1667
2nd:277 Yes:500 Kelly, Mr. James : 2 male :843 1st Qu.:21.0000
3rd:709 Abbing, Mr. Anthony : 1 Median :28.0000
Abbott, Master. Eugene Joseph : 1 Mean :29.8811
Abbott, Mr. Rossmore Edward : 1 3rd Qu.:39.0000
Abbott, Mrs. Stanton (Rosa Hunt: 1 Max. :80.0000
(Other) :1301 NA's :263
sibsp parch ticket fare cabin
Min. :0.0000 Min. :0.000 CA. 2343: 11 Min. : 0.000 :1014
1st Qu.:0.0000 1st Qu.:0.000 1601 : 8 1st Qu.: 7.896 B57 B59 B63 B66: 5
Median :0.0000 Median :0.000 CA 2144 : 8 Median : 14.454 C23 C25 C27 : 5
Mean :0.4989 Mean :0.385 3101295 : 7 Mean : 33.295 G6 : 5
3rd Qu.:1.0000 3rd Qu.:0.000 347077 : 7 3rd Qu.: 31.275 B96 B98 : 4
Max. :8.0000 Max. :9.000 347082 : 7 Max. :512.329 C22 C26 : 4
(Other) :1261 NA's :1 (Other) : 272
embarked boat body home.dest
: 2 :823 Min. : 1.0 :564
Cherbourg :270 13 : 39 1st Qu.: 72.0 New York, NY : 64
Queenstown :123 C : 38 Median :155.0 London : 14
Southampton:914 15 : 37 Mean :160.8 Montreal, PQ : 10
14 : 33 3rd Qu.:256.0 Cornwall / Akron, OH: 9
4 : 31 Max. :328.0 Paris, France : 9
(Other):308 NA's :1188 (Other) :639
survived) according to class (pclass).Hint: Uncomment the 3 lines n the code chunk below
and then use the prop.table function.
T1 <- TITANIC %>% select(survived, pclass) %>% table
#
T1
pclass
survived 1st 2nd 3rd
No 123 158 528
Yes 200 119 181
#
prop.table(T1, margin = 2) # to produce the proportion per column (2), per row would be margin = 1
pclass
survived 1st 2nd 3rd
No 0.3808050 0.5703971 0.7447109
Yes 0.6191950 0.4296029 0.2552891
ANSWER (Fill in the blank spaces): In 1st class percent survived is 61.9195%, 2nd class 42.96029%, 3rd class 25.52891%
Hint: Use the code below that creates 3-way table and then use
prop.table() similarly to part a).
T2 <- TITANIC %>% select(pclass, sex, survived) %>% table()
T2
, , survived = No
sex
pclass female male
1st 5 118
2nd 12 146
3rd 110 418
, , survived = Yes
sex
pclass female male
1st 139 61
2nd 94 25
3rd 106 75
prop.table(T2)
, , survived = No
sex
pclass female male
1st 0.003819710 0.090145149
2nd 0.009167303 0.111535523
3rd 0.084033613 0.319327731
, , survived = Yes
sex
pclass female male
1st 0.106187930 0.046600458
2nd 0.071810542 0.019098549
3rd 0.080977846 0.057295646
ANSWER (Fill in the blank spaces): Of those who survived 8.0977846 % were women in third class and 4.6600458 % were men in first class.
Hint: Run the code below that produces some summary statistics and the density distribution. The commented code is old style of R programming, it is shown as it may resemble the textbook examples.
# Finding summary statistics
#median(TITANIC$age, na.rm = TRUE) # old style
#mean(TITANIC$age, na.rm = TRUE) # old style
# dplyr style
TITANIC %>% summarise(mean = mean(age, na.rm = TRUE), meadian = median(age, na.rm = TRUE))
# IQR(TITANIC$age, na.rm = TRUE)
TITANIC %>% pull(age) %>% IQR(na.rm = TRUE) # pull() does extract the column from the data frame as vector object
[1] 18
# look at the density function to see if it is uni or bi-modal distribution
ggplot(data = TITANIC, aes(x = age)) +
geom_density(fill = "lightgreen") +
theme_bw()
Warning: Removed 263 rows containing non-finite outside the scale range (`stat_density()`).
ANSWER: This is a positively skewed bi-modal distribution.
Hint: Using the dplyr package functions
and the pipes operator %>% elegant code can produce
summaries for each statistics - mean, median, sd, IRQ.
pclass variable, namely
regardless of passenger class:# mean summaries
TITANIC %>% group_by(sex, survived) %>% summarise(avg = mean(age, na.rm = TRUE))
`summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
# sd summaries
TITANIC %>% group_by(sex, survived) %>% summarise(stdev = sd(age, na.rm = TRUE))
`summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
# median summaries
TITANIC %>% group_by(sex, survived) %>% summarise(med = median(age, na.rm = TRUE))
`summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
# IQR summaries
TITANIC %>% group_by(sex, survived) %>% summarise(IQR = IQR(age, na.rm = TRUE))
`summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
Based on the summaries, answer the question below:
d-1)
For those who survived, the mean age for females is about 3 years older than the mean age for males?
d-2) For those who survived, the median age for females is 1.5 years older than the median age for males?
ANSWER: The mean and median of the ages of females who survived is older than the ones who died. Women who survived had a mean age of 29.81535 compared to the mean age of women who died which is 26.25521. The median of those who lived is 28.5 while for those who died it was 24.5
survived variable in the TITANIC data too,
create similar summary statistics and answer the question below.For those who survived, which class the mean age for females is less than the mean age for males?
Third class
For those who survived, which class the median age for females is greater than the median age for males?
Second Class
Write your code in the chunk below:
# mean summaries
TITANIC %>% group_by(pclass, sex, survived) %>% summarise(avg = mean(age, na.rm = TRUE))
`summarise()` has grouped output by 'pclass', 'sex'. You can override using the `.groups` argument.
# median summaries
TITANIC %>% group_by(pclass, sex, survived) %>% summarise(med = median(age, na.rm = TRUE))
`summarise()` has grouped output by 'pclass', 'sex'. You can override using the `.groups` argument.
# write your code here
Hint: Read the output of the code in part d)
ANSWER: The mean and median for males who survived is younger than those who died. Males who survived had a mean age of 26.97778 vs those who died at 31.5164. Males who survived had a median age of 27 while those who died had a median age of 29.
Hint: Complete the code below by specify which variable you want to be arranged.
14 years old.
TITANIC %>% filter (sex =="female" & survived =="Yes" & pclass == "1st") %>% arrange(age)
Arranging in descending order is achieved by specifying in the
arrange() function desc(var_name).
YOUR CODE HERE:
TITANIC %>% filter(sex =="female" & survived =="Yes" & pclass == "1st") %>% arrange(desc(age))
TITANIC %>% filter (sex =="male" & survived =="Yes" & pclass == "1st") %>% arrange(desc(age))
ANSWER:
Oldest female in 1st class survived was 76 years of age.
Oldest male in 1st class survived was 80 years of age.
Hint: Review and explain the exploratory graphs created by the code chunk. How they support you justification?
I think the data suggest that there was a combination of both based on the fact that large portion of those who died were men and a large portion where 3rd class.
TITANIC %>% ggplot(aes(x = survived)) +
geom_bar(aes(fill = sex), stat = "count", position = "stack" ) +
theme_bw()
TITANIC %>% ggplot(aes(x = survived)) +
geom_bar(aes(fill = pclass), stat = "count", position = "stack" ) +
theme_bw()
NAComment: In most of the code you used/wrote in Task
1, functions were called with argument
na.rm = TRUE, instructing the NA values to be
dropped for the computations.
part 1) (5 points) Use the function
na.omit()(or the filter()) function from
dplyr package to create a clean data set
that removes subjects if any observations on the subject are
unknown Store the modified data frame in a data frame
named CLEAN. Run the function dim() on the
data frame CLEAN to find the number of observations(rows)
in the CLEAN data.
COMPLETE THE CODE HERE, uncomment necessary lines before running:
CLEAN <- na.omit(TITANIC)
#or
#CLEAN <- TITANIC %>% filter(complete.cases(na.rm = TRUE))
#print the dimensions
dim(CLEAN)
[1] 119 14
part 2) (5 points) How many missing values in the
data frame TITANIC are there? How many rows of
TITANIC have no missing values, one missing value, two
missing values, and three missing values, respectively? Note: the number
of rows in CLEAN should agree with your answer for the
number of rows in TITANIC that have no missing values. What
are the cons of cleaning the data in the suggested way?
Use the code, explain what it does.
#get the number of missing values in columns
colNAs<- colSums(is.na(TITANIC))
(colNAs <- as.vector(colSums(is.na(TITANIC)))) # coerce to a vector
[1] 0 0 0 0 263 0 0 0 1 0 0 0 1188 0
rowNAs <- table(rowSums(is.na(TITANIC)))
(rowNAs <- as.vector(table(rowSums(is.na(TITANIC))))) # coerce to a vector
[1] 119 928 262
Comment: The missing values are for variables age, fare, and body.
Comment: There are 119 rows with no missing values, 928 rows with 1 missing value, and 262 rows with 2 missing values.
Comment how this align with the dimensions of your CLEAN
data. > Your comment: The CLEAN data aligns with the 119 rows
with no missing data across the 14 columns
Good practice: Save your customized data frame
CLEAN in your working directory as a *.csv
file using the function write.csv() using the argument
row.names = FALSE.
write.csv(CLEAN, file="TITANIC_CLEAN.csv", row.names=FALSE)
Note: This is not guided task, you have to write your own code from scratch!
Use the CARS2004 data frame from the PASWR2 package,
which contains the numbers of cars per 1000 inhabitants
(cars), the total number of known mortal accidents
(deaths), and the country population/1000
(population) for the 25 member countries of the European
Union for the year 2004.
total.cars. Determine the total number of known
automobile fatalities in 2004 divided by the total number of cars for
each country and store the result in an object named
death.rate.YOUR CODE:
library()
ANSWER:
YOUR CODE:
ANSWER:
YOUR CODE:
ANSWER:
YOUR CODE:
ANSWER:
total.cars. Superimpose the least
squares line on the scatterplot from (d). What population does the least
squares model predict for a country with a total.cars value
of 19224.630? Find the difference between the population predicted from
the least squares model and the actual population for the country with a
total.cars value of 19224.630.YOUR CODE:
ANSWER:
total.cars versus death.rate. How would you
characterize the relationship between the two variables?YOUR CODE:
ANSWER:
total.cars and death.rate.
(Hint: Use cor(x, y, method=“spearman”).) What is this
coefficient measuring?YOUR CODE:
ANSWER:
total.cars versus the logarithm of death.rate.
How would you characterize the relationship?YOUR CODE:
ANSWER:
addCircles()
function.Try: E.g.
addCircles(weight = 1, radius = sqrt(UNC_schools$size)*100)