This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. Review this website for more details on using R Markdown http://rmarkdown.rstudio.com.
Use RStudio for this assignment. Complete the assignment by inserting your code wherever you see the string “#INSERT YOUR ANSWER HERE”.
When you click the Knit button, a document (PDF, Word, or HTML format) will be generated that includes both the assignment content as well as the output of any embedded R code chunks.
NOTE: YOU SHOULD NEVER HAVE
install.packages IN YOUR CODE; OTHERWISE, THE
Knit OPTION WILL GIVE AN ERROR. COMMENT OUT ALL PACKAGE
INSTALLATIONS.
Submit both the rmd and generated
output files. Failing to submit both files will be subject
to mark deduction.
Use seq() to create the vector \((100, 97, \dots, 4)\).
seq(100, 3, -3)
## [1] 100 97 94 91 88 85 82 79 76 73 70 67 64 61 58 55 52 49 46
## [20] 43 40 37 34 31 28 25 22 19 16 13 10 7 4
The Titanic Passenger Survival Data Set provides information on the
fate of passengers on the fatal maiden voyage of the ocean liner
“Titanic.” The dataset is available from the Department of Biostatistics
at the Vanderbilt University School of Medicine (https://biostat.app.vumc.org/wiki/pub/Main/DataSets/titanic3.csv)
in several formats. store the Titanic Data Set into the variable
titanic_train using the following commands.
library(readr)
# Read the Titanic Data Set from the CSV file
titanic_train <- read_csv("https://biostat.app.vumc.org/wiki/pub/Main/DataSets/titanic3.csv")
## Rows: 1309 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): name, sex, ticket, cabin, embarked, boat, home.dest
## dbl (7): pclass, survived, age, sibsp, parch, fare, body
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
1.a) Extract the columns cabin, ticket,
embarked, name into a new data frame named
titanicSubset and show its head.
(5 points)
# Extract the desired columns into a new DataFrame
titanicSubset <- titanic_train[, c("cabin", "ticket", "embarked", "name")]
# Show the head of the new DataFrame
head(titanicSubset)
## # A tibble: 6 × 4
## cabin ticket embarked name
## <chr> <chr> <chr> <chr>
## 1 B5 24160 S Allen, Miss. Elisabeth Walton
## 2 C22 C26 113781 S Allison, Master. Hudson Trevor
## 3 C22 C26 113781 S Allison, Miss. Helen Loraine
## 4 C22 C26 113781 S Allison, Mr. Hudson Joshua Creighton
## 5 C22 C26 113781 S Allison, Mrs. Hudson J C (Bessie Waldo Daniels)
## 6 E12 19952 S Anderson, Mr. Harry
1.b) Categorical data: Use the hist() function to
display the histogram of the age of passengers.
(5 points)
# Plot histogram of age
hist(titanic_train$age,
breaks = "Sturges",
main = "Histogram of Age",
xlab = "Age",
ylab = "Frequency")
1.c) Pivot Table: In a data frame, show the number of survived/not
survived people per gender. In other words, the table should have three
columns: sex, survived, and n
(i.e., the count per each case).
HINT: Use count() and group_by() functions
from the dplyr package to calculate the number of
survived/not survived cases by sex. group_by()
should be used first and then pipe %>%the result to
count() to calculate the output.
(5 points)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
titanic_pivot <- titanic_train %>%
group_by(sex)
titanic_pivot <- titanic_pivot %>%
count(survived) %>%
rename(n = n)
print(titanic_pivot)
## # A tibble: 4 × 3
## # Groups: sex [2]
## sex survived n
## <chr> <dbl> <int>
## 1 female 0 127
## 2 female 1 339
## 3 male 0 682
## 4 male 1 161
1.d) Use the aggregate() function to calculate the ‘survivalCount’ of
each sex and calculate the survival rate of each sex (i.e.,
the percentage of survived people per gender). Then draw the conclusion
on which sex has the higher survival rate.
OPTIONAL: try to do the same task using group_by,
summarize, and mean.
(5 points)
survival_count <- aggregate(survived ~ sex, data = titanic_train, FUN = sum)
survival_count$survival_rate <- survival_count$survived / table(titanic_train$sex)[survival_count$sex] * 100
print(survival_count)
## sex survived survival_rate
## 1 female 339 72.74678
## 2 male 161 19.09846
1.e) Using boxplot to display the distribution of age
for each class (pclass) and infer which class is the
wealthiest.
(5 points)
# Load the required library
library(ggplot2)
# Create a boxplot
ggplot(titanic_train, aes(x = as.factor(pclass), y = age)) +
geom_boxplot() +
labs(x = "Class (pclass)", y = "Age") +
ggtitle("Distribution of Age by Class")
## Warning: Removed 263 rows containing non-finite values (`stat_boxplot()`).
1.f) Calculate the average fare per class and describe if the calculation agrees with the box plot.
(5 points)
# Calculate average fare per class
# Calculate the average fare per class
average_fare_per_class <- titanic_train %>%
group_by(pclass) %>%
summarize(avg_fare = mean(fare))
# Print the average fare per class
print(average_fare_per_class)
## # A tibble: 3 × 2
## pclass avg_fare
## <dbl> <dbl>
## 1 1 87.5
## 2 2 21.2
## 3 3 NA
1.g) Use the for loop and if control
statements to list the women’s names whose ages are less than 30 that
embarked from C (Cherbourg) on the Titanic.
(10 points)
# Create an empty vector to store the names
women_names <- c()
# Iterate through each row in the Titanic data frame
for (i in 1:nrow(titanic_train)) {
# Check if the passenger is female, age is less than 30, and embarked from Cherbourg
if (!is.na(titanic_train$sex[i]) &&
!is.na(titanic_train$age[i]) &&
titanic_train$sex[i] == "female" &&
titanic_train$age[i] < 30 &&
titanic_train$embarked[i] == "C") {
# Append the name to the vector
women_names <- c(women_names, titanic_train$name[i])
}
}
# Print the list of women's names
print(women_names)
## [1] "Astor, Mrs. John Jacob (Madeleine Talmadge Force)"
## [2] "Aubart, Mme. Leontine Pauline"
## [3] "Bishop, Mrs. Dickinson H (Helen Walton)"
## [4] "Clark, Mrs. Walter Miller (Virginia McDowell)"
## [5] "Douglas, Mrs. Frederick Charles (Mary Helene Baxter)"
## [6] "Earnshaw, Mrs. Boulton (Olive Potter)"
## [7] "Frolicher, Miss. Hedwig Margaritha"
## [8] "Gibson, Miss. Dorothy Winifred"
## [9] "Harder, Mrs. George Achilles (Dorothy Annan)"
## [10] "Hays, Miss. Margaret Bechstein"
## [11] "Hippach, Miss. Jean Gertrude"
## [12] "Mayne, Mlle. Berthe Antonine (\"Mrs de Villiers\")"
## [13] "Newell, Miss. Marjorie"
## [14] "Ostby, Miss. Helene Ragnhild"
## [15] "Penasco y Castellana, Mrs. Victor de Satode (Maria Josefa Perez de Soto y Vallejo)"
## [16] "Ryerson, Miss. Emily Borie"
## [17] "Ryerson, Miss. Susan Parker \"Suzette\""
## [18] "Sagesser, Mlle. Emma"
## [19] "Abelson, Mrs. Samuel (Hannah Wizosky)"
## [20] "del Carlo, Mrs. Sebastiano (Argenia Genovesi)"
## [21] "Duran y More, Miss. Asuncion"
## [22] "Jerwan, Mrs. Amin S (Marie Marthe Thuillard)"
## [23] "Laroche, Miss. Louise"
## [24] "Laroche, Miss. Simonne Marie Anne Andree"
## [25] "Laroche, Mrs. Joseph (Juliette Marie Louise Lafargue)"
## [26] "Lehmann, Miss. Bertha"
## [27] "Mallet, Mrs. Albert (Antoinette Magnin)"
## [28] "Nasser, Mrs. Nicholas (Adele Achem)"
## [29] "Abrahim, Mrs. Joseph (Sophie Halaut Easu)"
## [30] "Attalah, Miss. Malake"
## [31] "Ayoub, Miss. Banoura"
## [32] "Baclini, Miss. Eugenie"
## [33] "Baclini, Miss. Helene Barbara"
## [34] "Baclini, Miss. Marie Catherine"
## [35] "Baclini, Mrs. Solomon (Latifa Qurban)"
## [36] "Barbara, Miss. Saiide"
## [37] "Boulos, Miss. Nourelain"
## [38] "Karun, Miss. Manca"
## [39] "Najib, Miss. Adele Kiamie \"Jane\""
## [40] "Nakid, Miss. Maria (\"Mary\")"
## [41] "Nakid, Mrs. Said (Waika \"Mary\" Mowad)"
## [42] "Nicola-Yarred, Miss. Jamila"
## [43] "Thomas, Mrs. Alexander (Thamine \"Thelma\")"
## [44] "Touma, Miss. Maria Youssef"
## [45] "Touma, Mrs. Darwis (Hanne Youssef Razi)"
## [46] "Yasbeck, Mrs. Antoni (Selini Alexander)"
## [47] "Zabour, Miss. Hileni"
20 engines work together in a sequence. The historical data shows that the probability of each engine’s failure is 0.15. We know that if one engine fails, the whole system stops.
2.a) What is the probability that the system operates without failure?
(5 points)
# Calculate the probability of the system operating without failure
probability_without_failure <- 0.85^20
# Print the result
print(probability_without_failure)
## [1] 0.03875953
2.b) Use the Binomial approximation to calculate the probability that at least 10 engines are defective?— Please check
(5 points)
# Define the parameters
n <- 20 # Number of trials (engines)
p <- 0.15 # Probability of failure for each engine
# Calculate the probability of at least 10 defective engines
probability_at_least_10_defective <- 1 - pbinom(9, n, p)
# Print the result
print(probability_at_least_10_defective)
## [1] 0.000248382
2.c) What is the probability that the second engine (B) is defective P(B|A) given the first engine (A) is not defective and the first and second engines are independent.
(10 points)
# Define the probability of an engine's failure
p_failure <- 0.15
# Calculate the probability that the second engine (B) is defective (P(B|A))
p_B_given_A <- p_failure
# Print the result
print(p_B_given_A)
## [1] 0.15
On average, Sarah visits her parents three times a week.
3.a) Find the probabilities that she visits her parents at most 2 times a week?
(5 points)
# Define the average number of visits per week
average_visits <- 3
# Calculate the probabilities of visiting at most 2 times a week
probability_at_most_2 <- ppois(2, lambda = average_visits)
# Print the result
print(probability_at_most_2)
## [1] 0.4231901
3.b) Find the probability that Sarah visits her parents at least twice a week?
(5 points)
# Define the average number of visits per week
average_visits <- 3
# Calculate the probability of at least two visits per week
probability_at_least_2 <- 1 - ppois(1, lambda = average_visits)
# Print the result
print(probability_at_least_2)
## [1] 0.8008517
3.c) Comparing the similarity between Binomial and Poisson distribution.
Create 55,000 samples for a Binomial random variable using parameters described in Question 2.
Create 55,000 samples for a Poisson random variable using parameters described in Question 3.
Illustrate how well the Poisson probability distribution approximates the Binomial probability distribution.
HINT: You may use multhist() from the
plotrix package to show their histogram next to each other,
or you can create two separate histogram plots with the same x and y
axis ranges.
(15 points@ 5 points each)
# Set the random seed for reproducibility
set.seed(42)
# Define the number of samples
num_samples <- 55000
# Define the parameters for the Binomial and Poisson distributions
n <- 20 # Number of trials for the Binomial distribution
p <- 0.15 # Probability of success for the Binomial distribution
lambda <- 3 # Average rate for the Poisson distribution
# Generate samples from the Binomial distribution
binomial_samples <- rbinom(num_samples, n, p)
# Generate samples from the Poisson distribution
poisson_samples <- rpois(num_samples, lambda)
# Create separate histogram plots for the Binomial and Poisson samples
hist(binomial_samples, col = "blue", xlab = "Number of Events",
ylab = "Frequency", main = "Binomial Distribution")
hist(poisson_samples, col = "green", xlab = "Number of Events",
ylab = "Frequency", main = "Poisson Distribution")
legend("topright", legend = c("Binomial", "Poisson"), fill = c("blue", "green"))
Write scripts in \(\texttt{R}\) to compute the following probabilities of a random variable following the normal distribution with the mean of 12 and the variance of 25
4.a) The probability that it lies between 8.2 and 11.2 (inclusive)
(5 points)
# Define the mean and variance
mean <- 12
variance <- 25
# Define the lower and upper bounds
lower <- 8.2
upper <- 11.2
# Calculate the probabilities using the cumulative distribution function (CDF)
probability <- pnorm(upper, mean, sqrt(variance)) - pnorm(lower, mean, sqrt(variance))
# Print the result
print(probability)
## [1] 0.2128132
4.b) The probability that it is greater than 12.0
(5 points)
# Define the mean and variance
mean <- 12
variance <- 25
# Define the threshold value
threshold <- 12.0
# Calculate the complementary probability using the cumulative distribution function (CDF)
probability <- 1 - pnorm(threshold, mean, sqrt(variance))
# Print the result
print(probability)
## [1] 0.5
4.c) The probability that it is less than 7.5 or greater than 12.5
(5 points)
# Define the mean and variance
mean <- 12
variance <- 25
# Define the lower and upper bounds
lower <- 7.5
upper <- 12.5
# Calculate the probabilities using the cumulative distribution function (CDF)
probability_less_than_7.5 <- pnorm(lower, mean, sqrt(variance))
probability_greater_than_12.5 <- 1 - pnorm(upper, mean, sqrt(variance))
# Calculate the probability of being less than 7.5 or greater than 12.5
probability <- probability_less_than_7.5 + probability_greater_than_12.5
# Print the result
print(probability)
## [1] 0.6442323
END of Assignment #2.