This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
x_int <- c(1, 2, 3)
x_int
## [1] 1 2 3
Please submit your answers by 5:59 pm on 01/21/2019
Q1. Which of the following numbers cannot be probability? Explain why.
Ans 1. a) -0.0001 and c) 3.415. Probability is between 0 and 1, cannot be negative or over 1.
Q2. A card is drawn randomly from a deck of ordinary playing cards. The game rules are that you win if the card is a spade or an ace. What is the probability that you will win the game?
Ans 2. 4/13 (13 spades, 4 aces, 1 ace that is a spade so 13/52 + 4/52 - 1/52)
Q3. An urban hospital has a 20% mortality rate on average for admitted patients. If on a particular day, 17 patients got admitted, what are:
the chances that exactly 7 will survive?
the chances that at least 15 patients will survive?
Ans 3. a) 0.0004176 b) 0.3086
Q4. Let F and G be two events such that P(F) is 0.4, P(G) is 0.8. F and G are independent events. Fill in the remaining elements of the table.
| Table | \(G\) | \(\bar{G}\) | Marginal |
|---|---|---|---|
| \(F\) | 0.32 | 0.08 | 0.4 |
| \(\bar{F}\) | 0.48 | 0.12 | 0.6 |
| Marginal | 0.8 | 0.2 | 1 |
Q5. Let F and G be two events such that P(F) is 0.2, P(G) is 0.7. Now, the conditional probability P(G|F) is given as 0.4. Fill in the remaining elements of the table.
| Table | \(G\) | \(\bar{G}\) | Marginal |
|---|---|---|---|
| \(F\) | 0.8 | 0.12 | 0.2 |
| \(\bar{F}\) | 0.62 | 0.18 | 0.8 |
| Marginal | 0.7 | 0.3 | 1 |
Q6. A survey was conducted among 100 patients about smoking status. The following is the sample size split by smoking status (Yes or No) and gender (Male or Female).
| Table | Smoking (Yes) | Smoking(No) | Total |
|---|---|---|---|
| Male | 19 | 36 | 55 |
| Female | 13 | 32 | 45 |
| Total | 32 | 68 | 100 |
The probability that a randomly selected patient is a male who smokes is 0.19.
Fill in all the elements of the table
What is the probability of a randomly selected patient being a female? 45/100
What is the probability of a randomly selected patient being a smoker? 32/100
What is the probability of a randomly selected smoker being a female? 13/32
Q1 : Using the dataset provided (“sample_patient_dataset.csv”), the task to build a 2x2 table for the studying the association between age at admission >70 and cardiac arrests. You can either use the sample table given below or build your own. Rememer to output both count and % in the table. Be sure to round the % to the nearest integer (e.g, 0.674 will be 67% and 0.675 will be 68%, see notes in Lecture2 on summary statistics as example). Fill in the code in the shaded areas.
| Table | Cardiac Arrests (Yes) | Cardiac Arrests (No) | Total |
|---|---|---|---|
| Age > 70 (%) | 453 (2%) | 4728 (20%) | 5181 |
| Age <= 70 (%) | 1672 (7%) | 17254 (72%) | 18926 |
| Total | 2125 | 21982 | 24,107 |
### Insert code here
rm(list=ls())
patient_data <- read.csv("~/Desktop/sample_patient_dataset.csv")
library(plyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:plyr':
##
## here
## The following object is masked from 'package:base':
##
## date
patient_data <- mutate(patient_data, dob_formatted = mdy(patient_data$dob), hosp_admission_form = mdy(patient_data$hosp_admission))
patient_data <- mutate(patient_data, age_at_admit=interval(dob_formatted,hosp_admission_form) / dyears(1))
patient_data.ca <- filter(patient_data, had_cardiac_arrests == 1)
patient_data.no.ca <- filter(patient_data, had_cardiac_arrests == 0)
patient_data.younger70 <- filter(patient_data, age_at_admit <= 70)
patient_data.older70 <- filter(patient_data, age_at_admit > 70)
patient_data.younger70.ca <- filter(patient_data.younger70, had_cardiac_arrests == 1)
patient_data.younger70.no.ca <- filter(patient_data.younger70, had_cardiac_arrests == 0)
patient_data.older70.ca <- filter(patient_data.older70, had_cardiac_arrests == 1)
patient_data.older70.no.ca <- filter(patient_data.older70, had_cardiac_arrests == 0)
#percentage calculations
453/24107
## [1] 0.01879122
1672/24107
## [1] 0.06935745
4728/24107
## [1] 0.1961256
17254/24107
## [1] 0.7157257
Q2: Create your own de-identified version of “patient_dataset.csv”. Upload your de-identified dataset onto Canvas and write the de-identification code below. You will need to refer to the document “Deidentification.pdf” (on Canvas, look under files -> lectures -> lecture_2).
### Insert code here
library(plyr)
library(dplyr)
library(lubridate)
patient_data <- read.csv("~/Desktop/patient_dataset.csv")
all.patients <- patient_data %>%
select(patient.names) %>%
unique()
all.patients$random_id <- sample(nrow(all.patients), replace = FALSE)
patient_data <- merge(patient_data, all.patients, by = "patient.names")
patient_data <- patient_data %>%
select(-c(patient.names))
patient_data <- patient_data %>%
mutate(hosp_admission_form = mdy(hosp_admission), hosp_discharge_form = mdy(hosp_discharge))
num_patients <- nrow(patient_data)
random_shift <- sample(seq(1,365), size=num_patients, replace = TRUE)
patient_data <- patient_data %>%
mutate(hosp_admission = hosp_admission + ddays(random_shift), hosp_discharge = hosp_discharge + ddays(random_shift))
## Warning in Ops.factor(dur@.Data, num): '+' not meaningful for factors
## Warning in Ops.factor(dur@.Data, num): '+' not meaningful for factors
patient_data <- patient_data %>%
mutate(dob_form = mdy(patient_data$dob)) %>%
mutate(temp_interval = interval(dob_form, hosp_admission_form)) %>%
mutate(age_at_admit = temp_interval / dyears(1))
patient_data <- patient_data %>%
select(-c(hosp_admission, hosp_discharge, hosp_admission_form, hosp_discharge_form, dob, dob_form, temp_interval))
patient_data <- patient_data %>%
select(-c(street_address, city, zip_code, contact_number, admitting_provider))
write.csv(patient_data, "~/Desktop/patient_dataset1.csv", row.names = FALSE, quote = TRUE)