Assignment I

Instructions

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

x_int <- c(1, 2, 3)
x_int

## [1] 1 2 3

Please submit your answers by 5:59 pm on 01/21/2019

Section 1: Probability : Total points 50

Q1. Which of the following numbers cannot be probability? Explain why.

-0.0001
0.05
3.415
20%
1

Ans 1. a) -0.0001 and c) 3.415. Probability is between 0 and 1, cannot be negative or over 1.

Q2. A card is drawn randomly from a deck of ordinary playing cards. The game rules are that you win if the card is a spade or an ace. What is the probability that you will win the game?

Ans 2. 4/13 (13 spades, 4 aces, 1 ace that is a spade so 13/52 + 4/52 - 1/52)

Q3. An urban hospital has a 20% mortality rate on average for admitted patients. If on a particular day, 17 patients got admitted, what are:

the chances that exactly 7 will survive?
the chances that at least 15 patients will survive?

Ans 3. a) 0.0004176 b) 0.3086

Q4. Let F and G be two events such that P(F) is 0.4, P(G) is 0.8. F and G are independent events. Fill in the remaining elements of the table.

Table	\(G\)	\(\bar{G}\)	Marginal
\(F\)	0.32	0.08	0.4
\(\bar{F}\)	0.48	0.12	0.6
Marginal	0.8	0.2	1

Q5. Let F and G be two events such that P(F) is 0.2, P(G) is 0.7. Now, the conditional probability P(G|F) is given as 0.4. Fill in the remaining elements of the table.

Table	\(G\)	\(\bar{G}\)	Marginal
\(F\)	0.8	0.12	0.2
\(\bar{F}\)	0.62	0.18	0.8
Marginal	0.7	0.3	1

Q6. A survey was conducted among 100 patients about smoking status. The following is the sample size split by smoking status (Yes or No) and gender (Male or Female).

Table	Smoking (Yes)	Smoking(No)	Total
Male	19	36	55
Female	13	32	45
Total	32	68	100

The probability that a randomly selected patient is a male who smokes is 0.19.

Fill in all the elements of the table
What is the probability of a randomly selected patient being a female? 45/100
What is the probability of a randomly selected patient being a smoker? 32/100
What is the probability of a randomly selected smoker being a female? 13/32

Section 2: Data Analysis using R: Total points 25

Q1 : Using the dataset provided (“sample_patient_dataset.csv”), the task to build a 2x2 table for the studying the association between age at admission >70 and cardiac arrests. You can either use the sample table given below or build your own. Rememer to output both count and % in the table. Be sure to round the % to the nearest integer (e.g, 0.674 will be 67% and 0.675 will be 68%, see notes in Lecture2 on summary statistics as example). Fill in the code in the shaded areas.

Table	Cardiac Arrests (Yes)	Cardiac Arrests (No)	Total
Age > 70 (%)	453 (2%)	4728 (20%)	5181
Age <= 70 (%)	1672 (7%)	17254 (72%)	18926
Total	2125	21982	24,107

### Insert code here
rm(list=ls())
patient_data <- read.csv("~/Desktop/sample_patient_dataset.csv")

library(plyr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following object is masked from 'package:plyr':
## 
##     here

## The following object is masked from 'package:base':
## 
##     date

patient_data <- mutate(patient_data, dob_formatted = mdy(patient_data$dob), hosp_admission_form = mdy(patient_data$hosp_admission))

patient_data <- mutate(patient_data, age_at_admit=interval(dob_formatted,hosp_admission_form) / dyears(1))

patient_data.ca <- filter(patient_data, had_cardiac_arrests == 1)
patient_data.no.ca <- filter(patient_data, had_cardiac_arrests == 0)

patient_data.younger70 <- filter(patient_data, age_at_admit <= 70)
patient_data.older70 <- filter(patient_data, age_at_admit > 70)

patient_data.younger70.ca <- filter(patient_data.younger70, had_cardiac_arrests == 1)
patient_data.younger70.no.ca <- filter(patient_data.younger70, had_cardiac_arrests == 0)

patient_data.older70.ca <- filter(patient_data.older70, had_cardiac_arrests == 1)
patient_data.older70.no.ca <- filter(patient_data.older70, had_cardiac_arrests == 0)

#percentage calculations
453/24107

## [1] 0.01879122

1672/24107

## [1] 0.06935745

4728/24107

## [1] 0.1961256

17254/24107

## [1] 0.7157257

Q2: Create your own de-identified version of “patient_dataset.csv”. Upload your de-identified dataset onto Canvas and write the de-identification code below. You will need to refer to the document “Deidentification.pdf” (on Canvas, look under files -> lectures -> lecture_2).

Insert code here

### Insert code here
library(plyr)
library(dplyr)
library(lubridate)

patient_data <- read.csv("~/Desktop/patient_dataset.csv")

all.patients <- patient_data %>%
  select(patient.names) %>%
  unique()
all.patients$random_id <- sample(nrow(all.patients), replace = FALSE)

patient_data <- merge(patient_data, all.patients, by = "patient.names")

patient_data <- patient_data %>%
  select(-c(patient.names))

patient_data <- patient_data %>%
  mutate(hosp_admission_form = mdy(hosp_admission), hosp_discharge_form = mdy(hosp_discharge))

num_patients <- nrow(patient_data)
random_shift <- sample(seq(1,365), size=num_patients, replace = TRUE)

patient_data <- patient_data %>%
  mutate(hosp_admission = hosp_admission + ddays(random_shift), hosp_discharge = hosp_discharge + ddays(random_shift))

## Warning in Ops.factor(dur@.Data, num): '+' not meaningful for factors

## Warning in Ops.factor(dur@.Data, num): '+' not meaningful for factors

patient_data <- patient_data %>%
  mutate(dob_form = mdy(patient_data$dob)) %>%
  mutate(temp_interval = interval(dob_form, hosp_admission_form)) %>%
  mutate(age_at_admit = temp_interval / dyears(1))

patient_data <- patient_data %>%
  select(-c(hosp_admission, hosp_discharge, hosp_admission_form, hosp_discharge_form, dob, dob_form, temp_interval))

patient_data <- patient_data %>%
  select(-c(street_address, city, zip_code, contact_number, admitting_provider))

write.csv(patient_data, "~/Desktop/patient_dataset1.csv", row.names = FALSE, quote = TRUE)

Assignment I

January 15, 2019

Instructions

Section 1: Probability : Total points 50

Section 2: Data Analysis using R: Total points 25

Insert code here