1. Using the source function

Create a R script file named problem1.R and save it into your project1 folder. Type print("Hello World") and source this file

printing “Hello World”

print("Hello World")
## [1] "Hello World"

Sourcing the file

source("/Users/dr.auwal/Desktop/Personals/UC Berkeley/Classes/Fall, 2019/PB HLTH 251D Applied Epidemiology Using R/Project/Project 1/problem1.R")
## [1] "Hello World"

2. Read an ASCII data set

The Evans data set (evans.txt) is here: https://github.com/taragonmd/data. Alternatively, here is the raw Evans data set: https://raw.githubusercontent.com/taragonmd/data/master/evans.txt. Demonstrate reading the Evans data file (evans.txt) to create a data frame, and use the str function to explore the structure of the data set. The data dictionary is in Appendix C of the PHDS book.Show the R code chunk and results below.

reading the Evans data file

evans<- read.table("https://raw.githubusercontent.com/taragonmd/data/master/evans.txt",header = TRUE, sep = '')

creating a data frame

data.frame.evans<-data.frame(evans)

using the str function

str(data.frame.evans)
## 'data.frame':    609 obs. of  12 variables:
##  $ id : int  21 31 51 71 74 91 111 131 141 191 ...
##  $ chd: int  0 0 1 0 0 0 1 0 0 0 ...
##  $ cat: int  0 0 1 1 0 0 0 0 0 0 ...
##  $ age: int  56 43 56 64 49 46 52 63 42 55 ...
##  $ chl: int  270 159 201 179 243 252 179 217 176 250 ...
##  $ smk: int  0 1 1 1 1 1 1 0 1 0 ...
##  $ ecg: int  0 0 1 0 0 0 1 0 0 1 ...
##  $ dbp: int  80 74 112 100 82 88 80 92 76 114 ...
##  $ sbp: int  138 128 164 200 145 142 128 135 114 182 ...
##  $ hpt: int  0 0 1 1 0 0 0 0 0 1 ...
##  $ ch : int  0 0 1 1 0 0 0 0 0 0 ...
##  $ cc : int  0 0 201 179 0 0 0 0 0 0 ...

3. Discretizing a continuous variable into a categorical variable

Use the cut function to discretize age into the following age categories and make a table of counts and a table of proportions.

30-39, 40-49, 50-59, 60-69, $>$70

Be sure to pay attention to age interval transitions.

Making a table of counts

evans$agecat <- cut(evans$age, breaks=c(30,40,50,60,70, 100),right=FALSE)
evans.table<- table(evans$agecat)
evans.table
## 
##  [30,40)  [40,50)  [50,60)  [60,70) [70,100) 
##        0      247      203      115       44

Making a Making a table of proportions

sweep(evans.table, 1, sum(evans.table), "/")
## 
##    [30,40)    [40,50)    [50,60)    [60,70)   [70,100) 
## 0.00000000 0.40558292 0.33333333 0.18883415 0.07224959

4. Working with dates and times

President Donald Trump was elected on “November 8, 2016”. Convert this character string into a R date object. Show how to use R to display (a) the Julian date; (b) the day of the week, and (c) the week of the year.

(a) the Julian date

trump<- as.Date('November 8, 2016', format = "%B %d, %Y")
trump
## [1] "2016-11-08"
julian(trump)
## [1] 17113
## attr(,"origin")
## [1] "1970-01-01"

(b) the day of the week

weekdays(trump)
## [1] "Tuesday"

(c) the week of the year

format(trump, format='%U')
## [1] "45"

5. Simple two-way analysis

Create a simple 2x2 table of smoking (smk) and coronary heart disease (chd). Use the fisher.test on this 2x2 table and describe your findings.

Creating a simple 2x2 table of smoking (smk) and coronary heart disease (chd)

evans2by2<- xtabs(~ smk+ chd, data = evans)
evans2by2
##    chd
## smk   0   1
##   0 205  17
##   1 333  54

Using the fisher.test on the 2x2 table

fisher.test(evans2by2)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  evans2by2
## p-value = 0.02512
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  1.079813 3.697097
## sample estimates:
## odds ratio 
##   1.953491

Describing findings:

It seems that smoking(smk) significantly increases the odds of coronary heart disease (chd). This is because the odds of chd is about 1.95 times higher among smk than Non-Smokers. This is also supported by significant p-value (less than the 0.05). So we reject the null hypotheses and accept the alternate hypothesis.

6. Write your own function

Now, write a function to calculate the risk ratio of your 2x2 table above. The exposure is smoking status and the outcome is coronary heart disease.

epitable <- function(a, b, c, d){
N1=a+c
N0=b+d
R0 <- b/N0
R1 <- a/N1
RR <- R1/R0
RR
}
epitable(54,17,333, 205)
## [1] 1.822161

7.

Now, use the xtabs function to create a 3-D array object of chd, hpt, and smk. Now use the addmargins function on this object.

using the xtabs function to create a 3-D array object of chd, hpt, and `smk

evans3darray <- xtabs(~ chd+ hpt + smk, data= evans)
evans3darray
## , , smk = 0
## 
##    hpt
## chd   0   1
##   0 122  83
##   1   6  11
## 
## , , smk = 1
## 
##    hpt
## chd   0   1
##   0 204 129
##   1  22  32

Using the addmargins function on the object

addmargins(evans3darray)
## , , smk = 0
## 
##      hpt
## chd     0   1 Sum
##   0   122  83 205
##   1     6  11  17
##   Sum 128  94 222
## 
## , , smk = 1
## 
##      hpt
## chd     0   1 Sum
##   0   204 129 333
##   1    22  32  54
##   Sum 226 161 387
## 
## , , smk = Sum
## 
##      hpt
## chd     0   1 Sum
##   0   326 212 538
##   1    28  43  71
##   Sum 354 255 609

8. Create a PNG graph and save file

From the Evans data create a histogram of age (age). Label with a title and axis labels. Output to a PNG file using the png function. Hint is provided.

png(file = "myplot.png")
hist(evans$age, breaks = 16,col = "brown", xlab='age', main='Histogram of age of Participants')
dev.off()
## quartz_off_screen 
##                 2

9. Display PNG file in your Rmarkdown document

Using Rmarkdown syntax, display the PNG file you created above. Hint: use the include_graphics function from the knitr package.

library(knitr)
include_graphics('myplot.png')

10. Using regular expressions

Here are the California counties: https://raw.githubusercontent.com/taragonmd/data/master/calcounty.txt Read in data using the scan function. Hint provided below. Remove the “California” entry. Use regular expressions to identify and display the County names that start with "San " and end with "o".

Reading in data using the scan function

cac<-scan("https://raw.githubusercontent.com/taragonmd/data/master/calcounty.txt", what = "")

Removing the “California” entry

cac <- cac[cac!="California"]

Using regular expressions to identify and display the County names that start with "San " and end with "o".

grep("^San.+o$", cac, value = TRUE)
## [1] "San Benito"      "San Bernardino"  "San Diego"       "San Francisco"  
## [5] "San Luis Obispo" "San Mateo"