CIND 123 - Data Analytics: Basic Methods

Assignment 2 (10%)

Total 100 Marks (5 Bonus upto 100)

[Insert your full name]

[Insert course section & student number]


Instructions

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. Review this website for more details on using R Markdown http://rmarkdown.rstudio.com.

Use RStudio for this assignment. Complete the assignment by inserting your R code wherever you see the string “#INSERT YOUR ANSWER HERE”.

When you click the Knit button, a document (PDF, Word, or HTML format) will be generated that includes both the assignment content as well as the output of any embedded R code chunks.

Submit both the rmd and generated output files. Failing to submit both files will be subject to mark deduction.

Sample Question and Solution

Use seq() to create the vector [3,5, … ,29).

seq(3, 30, 2)
##  [1]  3  5  7  9 11 13 15 17 19 21 23 25 27 29
seq(3, 29, 2)
##  [1]  3  5  7  9 11 13 15 17 19 21 23 25 27 29

Question 1 (40 points)

The Titanic Passenger Survival Data Set provides information on the fate of passengers on the fatal maiden voyage of the ocean liner “Titanic.” The dataset is available from the Department of Biostatistics at the Vanderbilt University School of Medicine (https://biostat.app.vumc.org/wiki/pub/Main/DataSets/titanic3.csv)in several formats. store the Titanic Data Set titanic_train using the following commands.

#install.packages("titanic")
library(titanic)
titanicDataset <- read.csv(file = "https://biostat.app.vumc.org/wiki/pub/Main/DataSets/titanic3.csv", stringsAsFactors = F)
summary(titanicDataset)
##      pclass         survived         name               sex           
##  Min.   :1.000   Min.   :0.000   Length:1309        Length:1309       
##  1st Qu.:2.000   1st Qu.:0.000   Class :character   Class :character  
##  Median :3.000   Median :0.000   Mode  :character   Mode  :character  
##  Mean   :2.295   Mean   :0.382                                        
##  3rd Qu.:3.000   3rd Qu.:1.000                                        
##  Max.   :3.000   Max.   :1.000                                        
##                                                                       
##       age            sibsp            parch          ticket         
##  Min.   : 0.17   Min.   :0.0000   Min.   :0.000   Length:1309       
##  1st Qu.:21.00   1st Qu.:0.0000   1st Qu.:0.000   Class :character  
##  Median :28.00   Median :0.0000   Median :0.000   Mode  :character  
##  Mean   :29.88   Mean   :0.4989   Mean   :0.385                     
##  3rd Qu.:39.00   3rd Qu.:1.0000   3rd Qu.:0.000                     
##  Max.   :80.00   Max.   :8.0000   Max.   :9.000                     
##  NA's   :263                                                        
##       fare            cabin             embarked             boat          
##  Min.   :  0.000   Length:1309        Length:1309        Length:1309       
##  1st Qu.:  7.896   Class :character   Class :character   Class :character  
##  Median : 14.454   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 33.295                                                           
##  3rd Qu.: 31.275                                                           
##  Max.   :512.329                                                           
##  NA's   :1                                                                 
##       body        home.dest        
##  Min.   :  1.0   Length:1309       
##  1st Qu.: 72.0   Class :character  
##  Median :155.0   Mode  :character  
##  Mean   :160.8                     
##  3rd Qu.:256.0                     
##  Max.   :328.0                     
##  NA's   :1188
# if you get the following error message when you knit 
#     Error in contrib.url(repos,"source")
# you can fix it in 2 ways
# Method 1: comment out the install.packages() in your code
# Method 2: add specific repos address in the install.package() comments, e.g. install.packages("package_name", repos = "http://cran.us.r-project.org")
  1. Extract and show the columns sex, age, fare, cabin and survived into a new data frame of the name ‘titanicSubset’. (5 points)
#INSERT YOUR ANSWER HERE
  1. Numerical data: Use the count() function from the dplyr package to display the total number of passengers within each Ticket Class Pclass. (5 points) HINT: use %>% to pipe the dataframe to count().
#INSERT YOUR ANSWER HERE
  1. Categorical data: Use count() and group_by() functions from the dplyr package to calculate the number of passengers by sex. (5 points) HINT: use group_by() first then pipe the result to count() to calculate the number of passengers per
#INSERT YOUR ANSWER HERE
  1. Use the aggregate() function to calculate the ‘survivalCount’ of each sex and calculate the survival rate of each sex. Then draw the conclusion on which sex has the higher survival rate. (5 points)
#INSERT YOUR ANSWER HERE
  1. Using boxplot to display the distribution of age for each sex and infer if male or female group is older. (5 points)
#INSERT YOUR ANSWER HERE
  1. Calculate the average age for male and female and describe if the calculation agrees with the box plot. (5 points)
#INSERT YOUR ANSWER HERE
  1. Use the for loop and if control statements to list the women’s names, age 34 or more that embarked from S (Southampton), on the Titanic. (10 points)
#INSERT YOUR ANSWER HERE

Question 2 (20 points)

In a shipment of 20 engines, history shows that the probability of any one engine proving unsatisfactory is 0.1

  1. The probability that all 20 engines are satisfactory: (5 points)
#INSERT YOUR ANSWER HERE
  1. Use the Binomial approximation to calculate the probability that more than 10 engines are defective? (5 points)
#INSERT YOUR ANSWER HERE
  1. Use the binomial approximation to calculate the probability that at most three engines are defective? (5 points)
#INSERT YOUR ANSWER HERE
  1. What is the probability that the second engine is defective given the first engine is not defective? From the result, draw the conclusion if the first and second engines are dependent or independent. (5 points)
#INSERT YOUR ANSWER HERE

Question 3 (25 points)

On average, John goes to his parents’ place twice a week for visits.

  1. Find the probabilities that John goes to his parents’ place 0 to 5 times in a week? (5 points)
#INSERT YOUR ANSWER HERE
  1. Find the probability that John goes to his parents’ place 3 times or less in a week? (5 points)
#INSERT YOUR ANSWER HERE
  1. Compare the similarity between Binomial and Poisson distribution. (15 points @ 5 point each)
  1. Create 100,000 samples for a Binomial random variable using parameters described in Question 2
  2. Create 100,000 samples for a Poisson random variable using parameters described in Question 3
  3. then illustrate on how well the Poisson probability distribution approximates the Binomial probability distribution. (BONUS) HINT: use multhist() from the ‘plotrix’ package
#INSERT YOUR ANSWER HERE

Question 4 (20 points)

Write a script in R to compute the following probabilities of a normal random variable with mean 16 and variance 9

  1. lies between 14.4 and 20.3 (inclusive) (5 points)
#INSERT YOUR ANSWER HERE
  1. is greater than 21.8 (5 points)
#INSERT YOUR ANSWER HERE
  1. is less than or equal to 10.5 (5 points)
#INSERT YOUR ANSWER HERE
  1. is less than 13 or greater than 19 (5 points)
#INSERT YOUR ANSWER HERE

END of Assignment #2.