Instructions

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

x_int <- c(1, 2, 3)
x_int
## [1] 1 2 3

Please submit your answers by 5:59 pm on 01/21/2019

Section 1: Probability : Total points 50

Q1. Which of the following numbers cannot be probability? Explain why.

  1. -0.0001
  2. 0.05
  3. 3.415
  4. 20%
  5. 1

Ans 1. a) -0.0001 and c) 3.415. Probability is between 0 and 1, cannot be negative or over 1.

Q2. A card is drawn randomly from a deck of ordinary playing cards. The game rules are that you win if the card is a spade or an ace. What is the probability that you will win the game?

Ans 2. 4/13 (13 spades, 4 aces, 1 ace that is a spade so 13/52 + 4/52 - 1/52)

Q3. An urban hospital has a 20% mortality rate on average for admitted patients. If on a particular day, 17 patients got admitted, what are:

  1. the chances that exactly 7 will survive?

  2. the chances that at least 15 patients will survive?

Ans 3. a) 0.0004176 b) 0.3086

Q4. Let F and G be two events such that P(F) is 0.4, P(G) is 0.8. F and G are independent events. Fill in the remaining elements of the table.

Table \(G\) \(\bar{G}\) Marginal
\(F\) 0.32 0.08 0.4
\(\bar{F}\) 0.48 0.12 0.6
Marginal 0.8 0.2 1

Q5. Let F and G be two events such that P(F) is 0.2, P(G) is 0.7. Now, the conditional probability P(G|F) is given as 0.4. Fill in the remaining elements of the table.

Table \(G\) \(\bar{G}\) Marginal
\(F\) 0.8 0.12 0.2
\(\bar{F}\) 0.62 0.18 0.8
Marginal 0.7 0.3 1

Q6. A survey was conducted among 100 patients about smoking status. The following is the sample size split by smoking status (Yes or No) and gender (Male or Female).

Table Smoking (Yes) Smoking(No) Total
Male 19 36 55
Female 13 32 45
Total 32 68 100

The probability that a randomly selected patient is a male who smokes is 0.19.

  1. Fill in all the elements of the table

  2. What is the probability of a randomly selected patient being a female? 45/100

  3. What is the probability of a randomly selected patient being a smoker? 32/100

  4. What is the probability of a randomly selected smoker being a female? 13/32

Section 2: Data Analysis using R: Total points 25

Q1 : Using the dataset provided (“sample_patient_dataset.csv”), the task to build a 2x2 table for the studying the association between age at admission >70 and cardiac arrests. You can either use the sample table given below or build your own. Rememer to output both count and % in the table. Be sure to round the % to the nearest integer (e.g, 0.674 will be 67% and 0.675 will be 68%, see notes in Lecture2 on summary statistics as example). Fill in the code in the shaded areas.

Table Cardiac Arrests (Yes) Cardiac Arrests (No) Total
Age > 70 (%) 453 (2%) 4728 (20%) 5181
Age <= 70 (%) 1672 (7%) 17254 (72%) 18926
Total 2125 21982 24,107
### Insert code here
rm(list=ls())
patient_data <- read.csv("~/Desktop/sample_patient_dataset.csv")

library(plyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:plyr':
## 
##     here
## The following object is masked from 'package:base':
## 
##     date
patient_data <- mutate(patient_data, dob_formatted = mdy(patient_data$dob), hosp_admission_form = mdy(patient_data$hosp_admission))

patient_data <- mutate(patient_data, age_at_admit=interval(dob_formatted,hosp_admission_form) / dyears(1))

patient_data.ca <- filter(patient_data, had_cardiac_arrests == 1)
patient_data.no.ca <- filter(patient_data, had_cardiac_arrests == 0)

patient_data.younger70 <- filter(patient_data, age_at_admit <= 70)
patient_data.older70 <- filter(patient_data, age_at_admit > 70)

patient_data.younger70.ca <- filter(patient_data.younger70, had_cardiac_arrests == 1)
patient_data.younger70.no.ca <- filter(patient_data.younger70, had_cardiac_arrests == 0)

patient_data.older70.ca <- filter(patient_data.older70, had_cardiac_arrests == 1)
patient_data.older70.no.ca <- filter(patient_data.older70, had_cardiac_arrests == 0)

#percentage calculations
453/24107
## [1] 0.01879122
1672/24107
## [1] 0.06935745
4728/24107
## [1] 0.1961256
17254/24107
## [1] 0.7157257

Q2: Create your own de-identified version of “patient_dataset.csv”. Upload your de-identified dataset onto Canvas and write the de-identification code below. You will need to refer to the document “Deidentification.pdf” (on Canvas, look under files -> lectures -> lecture_2).

Insert code here

### Insert code here
library(plyr)
library(dplyr)
library(lubridate)

patient_data <- read.csv("~/Desktop/patient_dataset.csv")

all.patients <- patient_data %>%
  select(patient.names) %>%
  unique()
all.patients$random_id <- sample(nrow(all.patients), replace = FALSE)

patient_data <- merge(patient_data, all.patients, by = "patient.names")

patient_data <- patient_data %>%
  select(-c(patient.names))

patient_data <- patient_data %>%
  mutate(hosp_admission_form = mdy(hosp_admission), hosp_discharge_form = mdy(hosp_discharge))

num_patients <- nrow(patient_data)
random_shift <- sample(seq(1,365), size=num_patients, replace = TRUE)

patient_data <- patient_data %>%
  mutate(hosp_admission = hosp_admission + ddays(random_shift), hosp_discharge = hosp_discharge + ddays(random_shift))
## Warning in Ops.factor(dur@.Data, num): '+' not meaningful for factors

## Warning in Ops.factor(dur@.Data, num): '+' not meaningful for factors
patient_data <- patient_data %>%
  mutate(dob_form = mdy(patient_data$dob)) %>%
  mutate(temp_interval = interval(dob_form, hosp_admission_form)) %>%
  mutate(age_at_admit = temp_interval / dyears(1))

patient_data <- patient_data %>%
  select(-c(hosp_admission, hosp_discharge, hosp_admission_form, hosp_discharge_form, dob, dob_form, temp_interval))

patient_data <- patient_data %>%
  select(-c(street_address, city, zip_code, contact_number, admitting_provider))

write.csv(patient_data, "~/Desktop/patient_dataset1.csv", row.names = FALSE, quote = TRUE)