This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
###1
edat <- read.table("evans.txt", header = T, sep = "")
str(edat)
## 'data.frame': 609 obs. of 12 variables:
## $ id : int 21 31 51 71 74 91 111 131 141 191 ...
## $ chd: int 0 0 1 0 0 0 1 0 0 0 ...
## $ cat: int 0 0 1 1 0 0 0 0 0 0 ...
## $ age: int 56 43 56 64 49 46 52 63 42 55 ...
## $ chl: int 270 159 201 179 243 252 179 217 176 250 ...
## $ smk: int 0 1 1 1 1 1 1 0 1 0 ...
## $ ecg: int 0 0 1 0 0 0 1 0 0 1 ...
## $ dbp: int 80 74 112 100 82 88 80 92 76 114 ...
## $ sbp: int 138 128 164 200 145 142 128 135 114 182 ...
## $ hpt: int 0 0 1 1 0 0 0 0 0 1 ...
## $ ch : int 0 0 1 1 0 0 0 0 0 0 ...
## $ cc : int 0 0 201 179 0 0 0 0 0 0 ...
#Edat CHD2 Generation
xtabs(~chd, data = edat)
## chd
## 0 1
## 538 71
edat$chd2 <- factor(edat$chd, level = 0:1, labels = c("No", "Yes"))
xtabs(~chd2, data = edat)
## chd2
## No Yes
## 538 71
#EDAT CAT2 Generation
xtabs(~cat, data = edat)
## cat
## 0 1
## 487 122
edat$cat2 <- factor(edat$cat, level = 0:1, labels = c("Normal", "High"))
xtabs(~cat2, data = edat)
## cat2
## Normal High
## 487 122
#EDAT SMK2 Generation
xtabs(~smk, data = edat)
## smk
## 0 1
## 222 387
edat$smk2 <- factor(edat$smk, level = 0:1, labels = c("Never", "Ever"))
xtabs(~smk2, data = edat)
## smk2
## Never Ever
## 222 387
#EDAT ECG2 Generation
xtabs(~ecg, data = edat)
## ecg
## 0 1
## 443 166
edat$ecg2 <- factor(edat$ecg, level = 0:1, labels = c("Normal", "Abnormal"))
xtabs(~ecg2, data = edat)
## ecg2
## Normal Abnormal
## 443 166
#HPT2 Generation
xtabs(~hpt, data = edat)
## hpt
## 0 1
## 354 255
edat$hpt2 <- factor(edat$hpt, level = 0:1, labels = c("No", "Yes"))
xtabs(~hpt2, data = edat)
## hpt2
## No Yes
## 354 255
###2
quantile(edat$age)
## 0% 25% 50% 75% 100%
## 40 46 52 60 76
edat$age4 <- cut(edat$age, quantile(edat$age, right = F, include.lowest = T))
xtabs(~age4, data = edat)
## age4
## (40,46] (46,52] (52,60] (60,76]
## 157 148 140 152
This series of commands first cuts age into quantiles and shows us the distribution. For example, the range from 0-25 % == (40-46]. Then, it creates a new variable, age4, that uses these quantile ranges as levels in a factored variable.
##3
hptnew <- rep(NA, nrow(edat))
normal <- edat$sbp < 120 & edat$dbp < 80
hptnew[normal] <- 1
prehyp <- (edat$spb >= 120 & edat$sbp < 140) | (edat$dbp >= 80 & edat$dbp < 90)
hptnew[prehyp] <- 2
stage1 <- (edat$sbp >= 140 & edat$sbp < 160) | (edat$dbp >= 90 & edat$dbp < 100)
hptnew[stage1] <- 3
stage2 <- edat$sbp >= 160 | edat$dbp >= 100
hptnew[stage2] <- 4
edat$hpt4 <- factor(hptnew, level = 1:4, labels = c("Normal", "PreHTN", "HTN.Stage1", "HTN.Stage2"))
xtabs(~hpt4, data = edat)
## hpt4
## Normal PreHTN HTN.Stage1 HTN.Stage2
## 56 0 177 211
This set of commands creates a new set of data based on a person’s systolic and diastolylic blood pressue. It defines 4 levels based on these categories and then arranges them into a new variabled called hpt4.
###4
xtabs(~hpt2 + hpt4, data = edat)
## hpt4
## hpt2 Normal PreHTN HTN.Stage1 HTN.Stage2
## No 56 0 133 0
## Yes 0 0 44 211
This final set of commands checks the hpt (hypertension) value versus an alternate hypertension calculation. The table is in line with what you’d expected – normal people do not have hypertension and all people with Stage 2 have hypertension.
Exercise #2
Review the Califonria 2004 surveillance data on human West Nile virus cases available at: [url]. Read in the data setting as.is = T, taking into account missing values (using na.strings option). Convert the calendar dates into the international standard format. Using the write.table function export the data as an ASCII text file.
#Loading data, using specified parameters
wnv2004 <- read.table("wnv2004raw.txt", header = T, sep = ",", as.is = T, na.strings = "N/A")
#Testing
head(wnv2004)
## ID County Age Sex Syndrome Date.onset Date.test Death
## 1 1 San Bernardino 40 F WNF 05/19/2004 06/02/2004 No
## 2 2 San Bernardino 64 F WNF 05/22/2004 06/16/2004 No
## 3 3 San Bernardino 19 M WNF 05/22/2004 06/16/2004 No
## 4 4 San Bernardino 12 M WNF 05/16/2004 06/16/2004 No
## 5 5 San Bernardino 12 M WNF 05/14/2004 06/16/2004 No
## 6 6 San Bernardino 17 M WNF 06/07/2004 06/17/2004 No
summary(wnv2004)
## ID County Age Sex
## Min. : 1 Length:781 Length:781 Length:781
## 1st Qu.:196 Class :character Class :character Class :character
## Median :391 Mode :character Mode :character Mode :character
## Mean :391
## 3rd Qu.:586
## Max. :781
## Syndrome Date.onset Date.test
## Length:781 Length:781 Length:781
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## Death
## Length:781
## Class :character
## Mode :character
##
##
##
#Converting Dates into Juline Dates for reformmatting
wnv.onset.time <- as.Date(wnv2004$Date.onset, format = "%m/%d/%Y")
wnv.test.time <- as.Date(wnv2004$Date.test, format = "%m/%d/%Y")
#Sanit check
class(wnv.onset.time)
## [1] "Date"
class(wnv.test.time)
## [1] "Date"
#Converting Time to Standard format DD/MM/YYYY
wnv.onset.julian <- julian(wnv.onset.time)
wnv.test.julian <- julian(wnv.test.time)
#Converting into Standard International Foramt YYYY/MM/DD
wnv.onset.intl <- as.Date(wnv.onset.julian, origin = as.Date('1970-01-01'))
wnv.test.intl <- as.Date(wnv.test.julian, origin = as.Date('1970-01-01'))
#Sanity Check
head(wnv.onset.intl)
## [1] "2004-05-19" "2004-05-22" "2004-05-22" "2004-05-16" "2004-05-14"
## [6] "2004-06-07"
head(wnv.test.intl)
## [1] "2004-06-02" "2004-06-16" "2004-06-16" "2004-06-16" "2004-06-16"
## [6] "2004-06-17"
#Appendding Vectors to Origianl Data Frame
wnv2004$Date.onset <- wnv.onset.intl
wnv2004$Date.test <- wnv.test.intl
head(wnv2004)
## ID County Age Sex Syndrome Date.onset Date.test Death
## 1 1 San Bernardino 40 F WNF 2004-05-19 2004-06-02 No
## 2 2 San Bernardino 64 F WNF 2004-05-22 2004-06-16 No
## 3 3 San Bernardino 19 M WNF 2004-05-22 2004-06-16 No
## 4 4 San Bernardino 12 M WNF 2004-05-16 2004-06-16 No
## 5 5 San Bernardino 12 M WNF 2004-05-14 2004-06-16 No
## 6 6 San Bernardino 17 M WNF 2004-06-07 2004-06-17 No
#Lastly, Exporting this into a file
write.table(wnv2004, "wnv2004 dates fixed")
4.12.3
On April 19th 1940 the local health officer in the village of Lycoming, Oswego County, New York reported the occurence of an outbreak of acute gastrointestinal illness to the District Health Officer in Syracuse. Dr. AM Rubin, epidemilogist-in-training was assigned to conduct an investigation….
Using RStudio plot the cases by time of onset of illness. What does the graph tell you? (Process the text data and then use the hist) function.
Are there any cases for which the times of onset are inconsistent with the general experience? How might they be explained?
How could the data be sorted by illness status and illness onset times?
Where poissible, calculate incubation periods and illustarte their distribution with an appropriate graph. Determine the mean, median, and range of the incubation period.
oswego <- read.table("oswego.txt", header = T, sep = "", as.is = T, na.strings = "N/A")
summary(oswego)
## id age sex meal.time
## Min. : 1.0 Min. : 3.00 Length:75 Length:75
## 1st Qu.:19.5 1st Qu.:16.50 Class :character Class :character
## Median :38.0 Median :36.00 Mode :character Mode :character
## Mean :38.0 Mean :36.81
## 3rd Qu.:56.5 3rd Qu.:57.50
## Max. :75.0 Max. :77.00
## ill onset.date onset.time
## Length:75 Length:75 Length:75
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## baked.ham spinach mashed.potato
## Length:75 Length:75 Length:75
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## cabbage.salad jello rolls
## Length:75 Length:75 Length:75
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## brown.bread milk coffee
## Length:75 Length:75 Length:75
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## water cakes van.ice.cream
## Length:75 Length:75 Length:75
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## choc.ice.cream fruit.salad
## Length:75 Length:75
## Class :character Class :character
## Mode :character Mode :character
##
##
##
#Saity Check
class(oswego$onset.date)
## [1] "character"
#Add Year to easily work with data as Julian Values
oswego$onset.date <- paste(oswego$onset.date, "1940", sep = "/")
head(oswego)
## id age sex meal.time ill onset.date onset.time baked.ham spinach
## 1 2 52 F 8:00 PM Y 4/19/1940 12:30 AM Y Y
## 2 3 65 M 6:30 PM Y 4/19/1940 12:30 AM Y Y
## 3 4 59 F 6:30 PM Y 4/19/1940 12:30 AM Y Y
## 4 6 63 F 7:30 PM Y 4/18/1940 10:30 PM Y Y
## 5 7 70 M 7:30 PM Y 4/18/1940 10:30 PM Y Y
## 6 8 40 F 7:30 PM Y 4/19/1940 2:00 AM N N
## mashed.potato cabbage.salad jello rolls brown.bread milk coffee water
## 1 Y N N Y N N Y N
## 2 Y Y N N N N Y N
## 3 N N N N N N Y N
## 4 N Y Y N N N N Y
## 5 Y N Y Y Y N Y Y
## 6 N N N N N N N N
## cakes van.ice.cream choc.ice.cream fruit.salad
## 1 N Y N N
## 2 N Y Y N
## 3 Y Y Y N
## 4 N Y N N
## 5 N Y N N
## 6 N Y Y N
#Variable that stores Oswego Meal Time and Day (MealTD)
oswego.MealTD <- paste("4/18/1940", oswego$meal.time, sep = "")
#Variable that stores Oswego Onset Time and Day (OnsetTD)
oswego.OnsetTD <- paste(oswego$onset.date, oswego$onset.time, sep = "")
#Next, conversting these to proper time functions using the strptime function.
oswego$MealTD <- as.POSIXct(strptime(oswego.MealTD, format = "%m/%d/%Y %I:%M %p"))
oswego$OnsetTD <- as.POSIXct(strptime(oswego.OnsetTD, format = "%m/%d/%Y %I:%M %p"))
#Creating the Plot
library(ggplot2)
a <- ggplot(data = oswego, aes(OnsetTD))
a + geom_histogram(fill = 'Blue', color = 'Grey', alpha = .8, bins = 20) + xlab("Onset Time") + ylab('People') + ggtitle("Onset Time of Infection in Oswego Data Set")
## Warning: Removed 29 rows containing non-finite values (stat_bin).
The graph shows us that the onset time is clustered around Apr 19th at midnight.
There are 2 odd readings – one occured very early on, before the pinic time and one significantly afterwards. The early one could be explained by prior illness and the late one could also be that, or a late onset of the disease, transmitted by somebody else.
Only those that got the illness would also have illness onset times. So you could approach this in two ways. The first is to check to see if a person will have variable oswego$ill = Y. The other way would be to check to see if somebody has an onset ilness i.e. onset.time != n.a.
oswego$IncubPeriod <- (oswego$OnsetTD - oswego$MealTD)
b <- ggplot(data = oswego, aes(IncubPeriod))
b + geom_histogram(fill = 'brown', color = 'Grey', alpha = .6, binwidth = 1) + xlab('Incubation Period In Hours') + ylab('Number of People') + ggtitle("Incubation Period in Oswego Data Set")
## Don't know how to automatically pick scale for object of type difftime. Defaulting to continuous.
## Warning: Removed 53 rows containing non-finite values (stat_bin).
mean(oswego$IncubPeriod, na.rm = T)
## Time difference of 4.295455 hours
median(oswego$IncubPeriod, na.rm = T)
## Time difference of 4 hours
range(oswego$IncubPeriod, na.rm = T)
## Time differences in hours
## [1] 3 7
The histogram shows the incubation period to be between 3 and 7 hours, with a plurality having an incubation time of around 3 hours. The mean is 4.3 hours and a Median of 4 hours suggests a left skewed distribution.