R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

  1. Read in the Evans data set. Type in the R expessions below. Do not copy and paste. After each expression describe what the step accomplished.
###1
edat <- read.table("evans.txt", header = T, sep = "")
str(edat)
## 'data.frame':    609 obs. of  12 variables:
##  $ id : int  21 31 51 71 74 91 111 131 141 191 ...
##  $ chd: int  0 0 1 0 0 0 1 0 0 0 ...
##  $ cat: int  0 0 1 1 0 0 0 0 0 0 ...
##  $ age: int  56 43 56 64 49 46 52 63 42 55 ...
##  $ chl: int  270 159 201 179 243 252 179 217 176 250 ...
##  $ smk: int  0 1 1 1 1 1 1 0 1 0 ...
##  $ ecg: int  0 0 1 0 0 0 1 0 0 1 ...
##  $ dbp: int  80 74 112 100 82 88 80 92 76 114 ...
##  $ sbp: int  138 128 164 200 145 142 128 135 114 182 ...
##  $ hpt: int  0 0 1 1 0 0 0 0 0 1 ...
##  $ ch : int  0 0 1 1 0 0 0 0 0 0 ...
##  $ cc : int  0 0 201 179 0 0 0 0 0 0 ...
#Edat CHD2 Generation
xtabs(~chd, data = edat)
## chd
##   0   1 
## 538  71
edat$chd2 <- factor(edat$chd, level = 0:1, labels = c("No", "Yes"))
xtabs(~chd2, data = edat)
## chd2
##  No Yes 
## 538  71
#EDAT CAT2 Generation
xtabs(~cat, data = edat)
## cat
##   0   1 
## 487 122
edat$cat2 <- factor(edat$cat, level = 0:1, labels = c("Normal", "High"))
xtabs(~cat2, data = edat)
## cat2
## Normal   High 
##    487    122
#EDAT SMK2 Generation
xtabs(~smk, data = edat)
## smk
##   0   1 
## 222 387
edat$smk2 <- factor(edat$smk, level = 0:1, labels = c("Never", "Ever"))
xtabs(~smk2, data = edat)
## smk2
## Never  Ever 
##   222   387
#EDAT ECG2 Generation
xtabs(~ecg, data = edat)
## ecg
##   0   1 
## 443 166
edat$ecg2 <- factor(edat$ecg, level = 0:1, labels = c("Normal", "Abnormal"))
xtabs(~ecg2, data = edat)
## ecg2
##   Normal Abnormal 
##      443      166
#HPT2 Generation
xtabs(~hpt, data = edat)
## hpt
##   0   1 
## 354 255
edat$hpt2 <- factor(edat$hpt, level = 0:1, labels = c("No", "Yes"))
xtabs(~hpt2, data = edat)
## hpt2
##  No Yes 
## 354 255

This series of commands does the following:

  1. Imports the Evans Dataset
  2. Creates new variables in the data frame, CHD2, CAT2, SMK2, ECG2, and HPT2, which has proper factors levels for each one.
  3. Checks the starting and ending variables, to make sure that numbers coorespond and that labels are properly applied.
###2
quantile(edat$age)
##   0%  25%  50%  75% 100% 
##   40   46   52   60   76
edat$age4 <- cut(edat$age, quantile(edat$age, right = F, include.lowest = T))
xtabs(~age4, data = edat)
## age4
## (40,46] (46,52] (52,60] (60,76] 
##     157     148     140     152

This series of commands first cuts age into quantiles and shows us the distribution. For example, the range from 0-25 % == (40-46]. Then, it creates a new variable, age4, that uses these quantile ranges as levels in a factored variable.

##3
hptnew <- rep(NA, nrow(edat))
normal <- edat$sbp < 120 & edat$dbp < 80
hptnew[normal] <- 1

prehyp <- (edat$spb >= 120 & edat$sbp < 140) | (edat$dbp >= 80 & edat$dbp < 90)
hptnew[prehyp] <- 2

stage1 <- (edat$sbp >= 140 & edat$sbp < 160) | (edat$dbp >= 90 & edat$dbp < 100)
hptnew[stage1] <- 3

stage2 <- edat$sbp >= 160 | edat$dbp >= 100
hptnew[stage2] <- 4

edat$hpt4 <- factor(hptnew, level = 1:4, labels = c("Normal", "PreHTN", "HTN.Stage1", "HTN.Stage2"))
xtabs(~hpt4, data = edat)
## hpt4
##     Normal     PreHTN HTN.Stage1 HTN.Stage2 
##         56          0        177        211

This set of commands creates a new set of data based on a person’s systolic and diastolylic blood pressue. It defines 4 levels based on these categories and then arranges them into a new variabled called hpt4.

###4
xtabs(~hpt2 + hpt4, data = edat)
##      hpt4
## hpt2  Normal PreHTN HTN.Stage1 HTN.Stage2
##   No      56      0        133          0
##   Yes      0      0         44        211

This final set of commands checks the hpt (hypertension) value versus an alternate hypertension calculation. The table is in line with what you’d expected – normal people do not have hypertension and all people with Stage 2 have hypertension.

Exercise #2

Review the Califonria 2004 surveillance data on human West Nile virus cases available at: [url]. Read in the data setting as.is = T, taking into account missing values (using na.strings option). Convert the calendar dates into the international standard format. Using the write.table function export the data as an ASCII text file.

#Loading data, using specified parameters
wnv2004 <- read.table("wnv2004raw.txt", header = T, sep = ",", as.is = T, na.strings = "N/A")
#Testing
head(wnv2004)
##   ID         County Age Sex Syndrome Date.onset  Date.test Death
## 1  1 San Bernardino  40   F      WNF 05/19/2004 06/02/2004    No
## 2  2 San Bernardino  64   F      WNF 05/22/2004 06/16/2004    No
## 3  3 San Bernardino  19   M      WNF 05/22/2004 06/16/2004    No
## 4  4 San Bernardino  12   M      WNF 05/16/2004 06/16/2004    No
## 5  5 San Bernardino  12   M      WNF 05/14/2004 06/16/2004    No
## 6  6 San Bernardino  17   M      WNF 06/07/2004 06/17/2004    No
summary(wnv2004)
##        ID         County              Age                Sex           
##  Min.   :  1   Length:781         Length:781         Length:781        
##  1st Qu.:196   Class :character   Class :character   Class :character  
##  Median :391   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :391                                                           
##  3rd Qu.:586                                                           
##  Max.   :781                                                           
##    Syndrome          Date.onset         Date.test        
##  Length:781         Length:781         Length:781        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##     Death          
##  Length:781        
##  Class :character  
##  Mode  :character  
##                    
##                    
## 
#Converting Dates into Juline Dates for reformmatting
wnv.onset.time <- as.Date(wnv2004$Date.onset, format = "%m/%d/%Y")
wnv.test.time <- as.Date(wnv2004$Date.test, format = "%m/%d/%Y")

#Sanit check
class(wnv.onset.time)
## [1] "Date"
class(wnv.test.time)
## [1] "Date"
#Converting Time to Standard format DD/MM/YYYY
wnv.onset.julian <- julian(wnv.onset.time)
wnv.test.julian <- julian(wnv.test.time)

#Converting into Standard International Foramt YYYY/MM/DD
wnv.onset.intl <- as.Date(wnv.onset.julian, origin = as.Date('1970-01-01'))
wnv.test.intl <- as.Date(wnv.test.julian, origin = as.Date('1970-01-01'))

#Sanity Check
head(wnv.onset.intl)
## [1] "2004-05-19" "2004-05-22" "2004-05-22" "2004-05-16" "2004-05-14"
## [6] "2004-06-07"
head(wnv.test.intl)
## [1] "2004-06-02" "2004-06-16" "2004-06-16" "2004-06-16" "2004-06-16"
## [6] "2004-06-17"
#Appendding Vectors to Origianl Data Frame
wnv2004$Date.onset <- wnv.onset.intl
wnv2004$Date.test <- wnv.test.intl
head(wnv2004)
##   ID         County Age Sex Syndrome Date.onset  Date.test Death
## 1  1 San Bernardino  40   F      WNF 2004-05-19 2004-06-02    No
## 2  2 San Bernardino  64   F      WNF 2004-05-22 2004-06-16    No
## 3  3 San Bernardino  19   M      WNF 2004-05-22 2004-06-16    No
## 4  4 San Bernardino  12   M      WNF 2004-05-16 2004-06-16    No
## 5  5 San Bernardino  12   M      WNF 2004-05-14 2004-06-16    No
## 6  6 San Bernardino  17   M      WNF 2004-06-07 2004-06-17    No
#Lastly, Exporting this into a file
write.table(wnv2004, "wnv2004 dates fixed")

4.12.3

On April 19th 1940 the local health officer in the village of Lycoming, Oswego County, New York reported the occurence of an outbreak of acute gastrointestinal illness to the District Health Officer in Syracuse. Dr. AM Rubin, epidemilogist-in-training was assigned to conduct an investigation….

  1. Using RStudio plot the cases by time of onset of illness. What does the graph tell you? (Process the text data and then use the hist) function.

  2. Are there any cases for which the times of onset are inconsistent with the general experience? How might they be explained?

  3. How could the data be sorted by illness status and illness onset times?

  4. Where poissible, calculate incubation periods and illustarte their distribution with an appropriate graph. Determine the mean, median, and range of the incubation period.

oswego <- read.table("oswego.txt", header = T, sep = "", as.is = T, na.strings = "N/A")
summary(oswego)
##        id            age            sex             meal.time        
##  Min.   : 1.0   Min.   : 3.00   Length:75          Length:75         
##  1st Qu.:19.5   1st Qu.:16.50   Class :character   Class :character  
##  Median :38.0   Median :36.00   Mode  :character   Mode  :character  
##  Mean   :38.0   Mean   :36.81                                        
##  3rd Qu.:56.5   3rd Qu.:57.50                                        
##  Max.   :75.0   Max.   :77.00                                        
##      ill             onset.date         onset.time       
##  Length:75          Length:75          Length:75         
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##   baked.ham           spinach          mashed.potato     
##  Length:75          Length:75          Length:75         
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##  cabbage.salad         jello              rolls          
##  Length:75          Length:75          Length:75         
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##  brown.bread            milk              coffee         
##  Length:75          Length:75          Length:75         
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##     water              cakes           van.ice.cream     
##  Length:75          Length:75          Length:75         
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##  choc.ice.cream     fruit.salad       
##  Length:75          Length:75         
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
## 
#Saity Check
class(oswego$onset.date)
## [1] "character"
#Add Year to easily work with data as Julian Values
oswego$onset.date <- paste(oswego$onset.date, "1940", sep = "/")
head(oswego)
##   id age sex meal.time ill onset.date onset.time baked.ham spinach
## 1  2  52   F   8:00 PM   Y  4/19/1940   12:30 AM         Y       Y
## 2  3  65   M   6:30 PM   Y  4/19/1940   12:30 AM         Y       Y
## 3  4  59   F   6:30 PM   Y  4/19/1940   12:30 AM         Y       Y
## 4  6  63   F   7:30 PM   Y  4/18/1940   10:30 PM         Y       Y
## 5  7  70   M   7:30 PM   Y  4/18/1940   10:30 PM         Y       Y
## 6  8  40   F   7:30 PM   Y  4/19/1940    2:00 AM         N       N
##   mashed.potato cabbage.salad jello rolls brown.bread milk coffee water
## 1             Y             N     N     Y           N    N      Y     N
## 2             Y             Y     N     N           N    N      Y     N
## 3             N             N     N     N           N    N      Y     N
## 4             N             Y     Y     N           N    N      N     Y
## 5             Y             N     Y     Y           Y    N      Y     Y
## 6             N             N     N     N           N    N      N     N
##   cakes van.ice.cream choc.ice.cream fruit.salad
## 1     N             Y              N           N
## 2     N             Y              Y           N
## 3     Y             Y              Y           N
## 4     N             Y              N           N
## 5     N             Y              N           N
## 6     N             Y              Y           N
#Variable that stores Oswego Meal Time and Day (MealTD)
oswego.MealTD <- paste("4/18/1940", oswego$meal.time, sep = "")

#Variable that stores Oswego Onset Time and Day (OnsetTD)
oswego.OnsetTD <- paste(oswego$onset.date, oswego$onset.time, sep = "")

#Next, conversting these to proper time functions using the strptime function.
oswego$MealTD <- as.POSIXct(strptime(oswego.MealTD, format = "%m/%d/%Y %I:%M %p"))
oswego$OnsetTD <- as.POSIXct(strptime(oswego.OnsetTD, format = "%m/%d/%Y %I:%M %p"))


#Creating the Plot
library(ggplot2)
a <- ggplot(data = oswego, aes(OnsetTD))
a + geom_histogram(fill = 'Blue', color = 'Grey', alpha = .8, bins =  20) + xlab("Onset Time") + ylab('People') + ggtitle("Onset Time of Infection in Oswego Data Set")
## Warning: Removed 29 rows containing non-finite values (stat_bin).

  1. The graph shows us that the onset time is clustered around Apr 19th at midnight.

  2. There are 2 odd readings – one occured very early on, before the pinic time and one significantly afterwards. The early one could be explained by prior illness and the late one could also be that, or a late onset of the disease, transmitted by somebody else.

  3. Only those that got the illness would also have illness onset times. So you could approach this in two ways. The first is to check to see if a person will have variable oswego$ill = Y. The other way would be to check to see if somebody has an onset ilness i.e. onset.time != n.a.

oswego$IncubPeriod <- (oswego$OnsetTD - oswego$MealTD)
b <- ggplot(data = oswego, aes(IncubPeriod))


b + geom_histogram(fill = 'brown', color = 'Grey', alpha = .6, binwidth = 1) + xlab('Incubation Period In Hours') + ylab('Number of People') + ggtitle("Incubation Period in Oswego Data Set")
## Don't know how to automatically pick scale for object of type difftime. Defaulting to continuous.
## Warning: Removed 53 rows containing non-finite values (stat_bin).

mean(oswego$IncubPeriod, na.rm = T)
## Time difference of 4.295455 hours
median(oswego$IncubPeriod, na.rm = T)
## Time difference of 4 hours
range(oswego$IncubPeriod, na.rm = T)
## Time differences in hours
## [1] 3 7

The histogram shows the incubation period to be between 3 and 7 hours, with a plurality having an incubation time of around 3 hours. The mean is 4.3 hours and a Median of 4 hours suggests a left skewed distribution.