Final Exam

You are a given two csv files (132539.txt.csv and 132540.txt.csv). Each file represents a patient who was admitted to ICU unit. Each file has several columns. The first column represents the Time at which a particular physiological measurement on the patient was taken. The next two columns represent patient RecordID and HospitalDeath (0 for survived and 1 for death in hospital) information. The remaining columns represent the following measurements:

ALP,ALT,AST,Age,Albumin,BUN,Bilirubin,Cholesterol,Creatinine,DiasABP,FiO2,GCS,Gender,Glucose,HCO3,HCT,HR,Height,ICUType,K,Lactate,MAP, MechVent,Mg,NIDiasABP,NIMAP,NISysABP,Na,PaCO2,PaO2,Platelets,RespRate,SaO2,SysABP,Temp,TroponinI,TroponinT,Urine,WBC,Weight,pH.

If measurement was not taken at a particular time, it is marked as NA (as a missing value). Note that some metrics such as Age, Gender etc will be constant throughout the patient’s stay. You will also see a lot of NA of these kinds of metrics because measurements were not taken at those particular time points.

Question 1

A. First design an R function that does the following: [10 points]

 Input parameter: a string representing name of the patient csv file

 The function should:

1.Read the input patient csv file

2.Select columns representing the following 5 patient’s metrics: HR, pH, Glucose, Urine, RespRate. (use select function from dplyr package)

3.Summarize (compute the averaged values) of the 5 metrics. The function should return a single data frame with 5 columns (column names being the 5 metrics names) and one row containing the averaged metrics. If a metric was not measured at all, you will get NA or NaN as the averaged value. (use summarize from dplyr package).

SummarizeMetrics<-function(file){
  file<-read.csv(file)
  a<-NULL
  name<-NULL
  for(i in 1:ncol(file)){
    file[[i]]<-as.numeric(file[[i]])
    vec<-select(file,HR,pH,Glucose,Urine,RespRate)
    for(j in seq_along(vec)){
      name[j]<-names(vec[j])
      a[j]<-summarize(vec,mean(vec[[j]],na.rm=T))
    }
    a<-as.data.frame(a)
    colnames(a)<-name
  }
return(a)
}

B. Now apply this function to compute the average of the 5 metrics for patients 132539.txt.csv and 132540.txt.csv for the five different unique time groups as mentioned above. [10 points]

patient1<-"/Users/paulasantamaria/Desktop/Working Directory/132539.txt.csv"
patient2<-"/Users/paulasantamaria/Desktop/Working Directory/132540.txt.csv"
SummarizeMetrics(patient1)
##         HR  pH Glucose    Urine RespRate
## 1 70.81081 NaN     160 164.8649 17.42857
SummarizeMetrics(patient2)
##         HR    pH Glucose   Urine RespRate
## 1 80.79412 7.395   125.5 151.561      NaN

C. For these two patients, find out which metrics (out of the 5 metrics) were not measured during the entire stay in the hospital. [5 points]

Note here that you don’t have to group the data since we are interested in the entire duration of the patient’s stay.

P1<-SummarizeMetrics(patient1)
P2<-SummarizeMetrics(patient2)

Not_Measured_Patient1<-names(P1)[is.na(P1)]
Not_Measured_Patient1
## [1] "pH"
Not_Measured_Patient2<-names(P2)[is.na(P2)]
Not_Measured_Patient2
## [1] "RespRate"

Question 2

A. First design an R function that does the following: [15 points]

 Input parameters: string representing the name of the patient csv file

 The function should:

o Read the input patient csv file

o Select columns representing the Time and following 5 patient’s metrics: HR, pH, Glucose, Urine, RespRate. (use select function from dplyr package)

o Replace the values in the time column using the following criteria. Time Between 0 and 12 replace by 1 Between 12 and 24 replace by 2 Between 24 and 36 replace by 3 Between 36 and 48 replace by 4

o Next, group the data based (use group_by function from dplyr package) on the modified Time column (now there should be 4 unique groups of time column). Then, using the grouped data, summarize (obtain the average) the values of the following metrics (columns): HR, pH, Glucose, Urine, RespRate. (use summarize from dplyr package, If a metric wasn’t measured at all during a particular time group you will get NA or NaN.). The function should return a data frame with 4 rows (representing the four groups of time column) and 5 other columns representing the averaged metrics.

SummarizeMetricsTimed <- function(file){
  file<-read.csv(file)
  for(i in 1:ncol(file)){
    f<-dplyr::select(file,Time,HR,pH,Glucose,Urine,RespRate)
    f$Time[f$Time>=0 & f$Time<=12]<-1
    f$Time[f$Time>12 & f$Time<=24]<-2
    f$Time[f$Time>24 & f$Time<=36]<-3
    f$Time[f$Time>36 & f$Time<=48]<-4
    R<-f %>% group_by(Time) %>% summarise(Avg_HR=mean(HR,na.rm = T),Avg_pH=mean(pH,na.rm=T),Avg_Glucose=mean(Glucose,na.rm=T),Avg_Urine=mean(Urine,na.rm=T),Avg_RespRate=mean(RespRate,na.rm = T))
  }
  return(R)
}
SummarizeMetricsTimed(patient1)
## # A tibble: 4 x 6
##    Time Avg_HR Avg_pH Avg_Glucose Avg_Urine Avg_RespRate
##   <dbl>  <dbl>  <dbl>       <dbl>     <dbl>        <dbl>
## 1     1   67.7    NaN         205      166.         17.3
## 2     2   65.4    NaN         NaN       80          15.2
## 3     3   78.8    NaN         115      157.         17.9
## 4     4   78.6    NaN         NaN      271.         19.6
SummarizeMetricsTimed(patient2)
## # A tibble: 4 x 6
##    Time Avg_HR Avg_pH Avg_Glucose Avg_Urine Avg_RespRate
##   <dbl>  <dbl>  <dbl>       <dbl>     <dbl>        <dbl>
## 1     1   87.4   7.40         NaN      169.          NaN
## 2     2   80.1   7.4          105      101.          NaN
## 3     3   76.4 NaN            NaN      155           NaN
## 4     4   73.2   7.38         146      188.          NaN

Question 3

Below is an R function that takes a numeric vector as an argument and returns the difference between the largest and smallest item in the vector x.

A. Now use map function and the above DiffMaxMin function to get the difference between the largest and smallest item in each column of a numeric data frame df given below. Note that the purpose of map functions is to avoid using loops. [5 points]

# Function DaffMaxMin
DiffMaxMin <- function(x){
return(max(x)-min(x))
}

# Data frame given to apply function above 
df <- data.frame(a = rnorm(10),b = rnorm(10),c = rnorm(10),d = rnorm(10))
df
##             a           b          c           d
## 1   0.7127483 -0.62852760  0.8164389 -1.99202593
## 2   0.2726864  1.29739949 -0.7232404  0.04811085
## 3   0.9345936  0.38503426 -0.6914450  0.94675172
## 4  -1.4270545 -0.90896279  0.4516017  1.23184936
## 5   0.2710854  0.26247313 -2.2648521 -2.22041628
## 6   0.9439881  0.47217305  0.7979807  0.50499838
## 7   1.4083819  1.14444112 -0.4227630 -1.59247615
## 8   0.9691734 -0.34581869 -0.3883645 -0.98901219
## 9  -1.9940022 -0.06446111 -0.7956670  1.74565761
## 10  0.5199629 -1.28097264 -0.3406422  0.57369883

# Results with map functionalities 
Results<-unlist(map(df,DiffMaxMin))
Results
##        a        b        c        d 
## 3.402384 2.578372 3.081291 3.966074

B.Perform Q3(a) by using for-loop and without using map or lapply functionalities [5 points].

DiffMaxMin1 <- function(x){
  a<-NULL
  for(i in 1:length(df)){
    a[i]<-max(df[i])-min(df[i])
  }
  return(a)
}

DiffMaxMin1(df)
## [1] 3.402384 2.578372 3.081291 3.966074

Question 4

For this Question use the file covid_19.data.csv that contains covid-19 tracking data of covid-19 positive and death cases in The United States. This file is a simpler version of the data found in this website https://covidtracking.com/data. The covid_19.data.csv contains four columns (date, state, cumulative positive cases and cumulative death). [25 points]

A. First, select any five states of your choice. It will be good to choose Texas and some other states that have been in the current news.

# Loading Covid-19 data 
covid<-read.csv("/Users/paulasantamaria/Desktop/covid_19.data.csv",stringsAsFactors = F)

# Cleaning data formats with lubridate package
covid$date<-ymd(covid$date)

# Selecting states to graph Covid-19 cases
states<-filter(covid,state%in% c("NY","TX","NJ","FL","CA"))
head(states,5)
##         date state positive death
## 1 2020-04-27    CA    43464  1755
## 2 2020-04-27    FL    32138  1101
## 3 2020-04-27    NJ   111188  6044
## 4 2020-04-27    NY   291996 17303
## 5 2020-04-27    TX    25297   663

B. Plot a single graph (use ggplot2) that shows the daily increase in covid-19 positive cases for the states of your choice. Here, the plots for the states should be separate, but should be shown on the same graph for comparison purposes.

daily_positives<-states %>% group_by(state)%>%arrange(state,date)%>%mutate(positive_daily_frequency=abs((positive-lag(positive))))
head(daily_positives,5)
## # A tibble: 5 x 5
## # Groups:   state [1]
##   date       state positive death positive_daily_frequency
##   <date>     <chr>    <int> <int>                    <int>
## 1 2020-03-04 CA          53     0                       NA
## 2 2020-03-05 CA          53     0                        0
## 3 2020-03-06 CA          60     0                        7
## 4 2020-03-07 CA          69     0                        9
## 5 2020-03-08 CA          88     0                       19

C. Plot a single graph (use ggplot2) that shows the cumulative positive cases for the states of your choice with respect to time (date on x-axis and cumulative positive cases on y-axis). Here, the plots for the states should be separate, but should be shown on the same graph for comparison purposes.

D. Plot a single graph (use ggplot2) that shows the daily increase in covid-19 death cases for the states of your choice. Here, the plots for the states should be separate, but should be shown on the same graph for comparison purposes.

daily_deaths<-states %>%group_by(state)%>%arrange(state,date)%>%mutate(daily_deaths_frequency=abs((death-lag(death))))
tail(daily_deaths,5)
## # A tibble: 5 x 5
## # Groups:   state [1]
##   date       state positive death daily_deaths_frequency
##   <date>     <chr>    <int> <int>                  <int>
## 1 2020-04-23 TX       21944   561                     18
## 2 2020-04-24 TX       22806   593                     32
## 3 2020-04-25 TX       23773   623                     30
## 4 2020-04-26 TX       24631   648                     25
## 5 2020-04-27 TX       25297   663                     15

E. Plot a single graph (use ggplot2) that shows the cumulative death cases for the states of your choice with respect to time (date on x-axis and cumulative death cases on y-axis). Here, the plots for the states should be separate, but should be shown on the same graph for comparison purposes.

Question 5

You are given a text file that contains the genome sequence of the SARS-CoV-2 virus. The genome sequence is a single string that consists of 4 letters (A, T, G, or C). Using stringr package in R, perform the following tasks.[20 points]

A. What is the length of the SARS-CoV-2 virus genome?

# Reading the file 
Genome<-readLines("/Users/paulasantamaria/Desktop/sars-cov-2.txt")

# Length of the Covid-19 Genome
str_length(Genome)
## [1] 29903

B. In the sequence of the virus, how many A’s, T’s, C’s and G’s are there.

a<-str_count(Genome,"A")
a
## [1] 8954
t<-str_count(Genome,"T")
t
## [1] 9594
c<-str_count(Genome,"C")
c
## [1] 5492
g<-str_count(Genome,"G")
g
## [1] 5863

C Come up with a regular expression that represents a string of length 4 but begins with ‘A’ and ends with ‘A’. Find out how many times this pattern occurs in the virus genome?

# Dividing Genome in strings of 4 characters
Div<-unlist(str_extract_all(Genome,"[A-Z]{4}"))
head(Div,5)
## [1] "ATTA" "AAGG" "TTTA" "TACC" "TTCC"

# Regular expression extracting strings starting with "A" and ending with "A" 
b<-str_subset(Div,"^A.*.A$")
head(b,10)
##  [1] "ATTA" "ATAA" "AAGA" "ACGA" "AAGA" "AAAA" "AACA" "AGCA" "ATAA" "AGCA"

# Times pattern occurs in the Genome 
length(b)
## [1] 675

D. Extract the last 50 letters in the genome of the virus.

last50<-str_sub(Genome,-50,-1)
last50
## [1] "CTTCTTAGGAGAATGACAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"

E. Extract the first 50 letters in the genome of the virus.

first50<-str_sub(Genome,1,50)
first50
## [1] "ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTC"