Use this template and R Markdown to demonstrate the following skills:

1. Using the source function

Create a R script file named problem1.R and save it into your project1 folder. Type print("Hello World") and source this file.

source("problem1.R")
## [1] "Hello World"

2. Read an ASCII data set

The Evans data set (evans.txt) is here: https://github.com/taragonmd/data.

Alternatively, here is the raw Evans data set: https://raw.githubusercontent.com/taragonmd/data/master/evans.txt.

Demonstrate reading the Evans data file (evans.txt) to create a data frame, and use the str function to explore the structure of the data set. The data dictionary is in Appendix C of the PHDS book.

Show the R code chunk and results below.

evans<-read.table("https://raw.githubusercontent.com/taragonmd/data/master/evans.txt", header = T)
str(evans)
## 'data.frame':    609 obs. of  12 variables:
##  $ id : int  21 31 51 71 74 91 111 131 141 191 ...
##  $ chd: int  0 0 1 0 0 0 1 0 0 0 ...
##  $ cat: int  0 0 1 1 0 0 0 0 0 0 ...
##  $ age: int  56 43 56 64 49 46 52 63 42 55 ...
##  $ chl: int  270 159 201 179 243 252 179 217 176 250 ...
##  $ smk: int  0 1 1 1 1 1 1 0 1 0 ...
##  $ ecg: int  0 0 1 0 0 0 1 0 0 1 ...
##  $ dbp: int  80 74 112 100 82 88 80 92 76 114 ...
##  $ sbp: int  138 128 164 200 145 142 128 135 114 182 ...
##  $ hpt: int  0 0 1 1 0 0 0 0 0 1 ...
##  $ ch : int  0 0 1 1 0 0 0 0 0 0 ...
##  $ cc : int  0 0 201 179 0 0 0 0 0 0 ...

3. Discretizing a continuous variable into a categorical variable

Use the cut function to discretize age into the following age categories and make a table of counts and a table of proportions.

Be sure to pay attention to age interval transitions.

evans$age_cut<-cut(evans$age, breaks=c(30,40,50,60,70,100),right=FALSE)
table_evans<-table(evans$age_cut); table_evans
## 
##  [30,40)  [40,50)  [50,60)  [60,70) [70,100) 
##        0      247      203      115       44
prop.table(table_evans)
## 
##    [30,40)    [40,50)    [50,60)    [60,70)   [70,100) 
## 0.00000000 0.40558292 0.33333333 0.18883415 0.07224959

4. Working with dates and times

President President Donald Trump was elected on “November 8, 2016”. Convert this character string into a R date object. Show how to use R to display (a) the Julian date; (b) the day of the week, and (c) the week of the year.

trump<-c("11/8/2016")
trump<-as.Date(trump, format="%m/%d/%Y"); trump
## [1] "2016-11-08"
julian<-julian(trump); julian
## [1] 17113
## attr(,"origin")
## [1] "1970-01-01"
day<-weekdays(trump); day
## [1] "Tuesday"
week<-format(trump, format='%U'); week
## [1] "45"

5. Simple two-way analysis

Create a simple 2x2 table of smoking (smk) and coronary heart disease (chd). Use the fisher.test on this 2x2 table and describe your findings.

table2x2<-xtabs(~smk+chd, data=evans)
fisher.test(table2x2)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  table2x2
## p-value = 0.02512
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  1.079813 3.697097
## sample estimates:
## odds ratio 
##   1.953491

We observe that the p-value is low (lower than the .05 level of significance) and hence we reject the null that the odds ratio is equal to 1 (independence). In fact, we observe that the odds ratio is almost double, hence, in this study for this period of time, we can see that the people who smoke have double the odds of coronary heart disease.

6. Write your own function

Now, write a function to calculate the risk ratio of your 2x2 table above. The exposure is smoking status and the outcome is coronary heart disease.

rr <- function(x){
  re <- x[2,2]/x[2,1]
  ru <- x[1,2]/x[1,1]
  rr<-re/ru
  list(rr)
}

rr(table2x2)
## [[1]]
## [1] 1.955485

7. XTABS

Now, use the xtabs function to create a 3-D array object of chd, hpt, and smk. Now use the addmargins function on this object.

table3x3<-xtabs(~smk+chd+hpt, data=evans)
addmargins(table3x3)
## , , hpt = 0
## 
##      chd
## smk     0   1 Sum
##   0   122   6 128
##   1   204  22 226
##   Sum 326  28 354
## 
## , , hpt = 1
## 
##      chd
## smk     0   1 Sum
##   0    83  11  94
##   1   129  32 161
##   Sum 212  43 255
## 
## , , hpt = Sum
## 
##      chd
## smk     0   1 Sum
##   0   205  17 222
##   1   333  54 387
##   Sum 538  71 609

8. Create a PNG graph and save file

From the Evans data create a histogram of age (age). Label with a title and axis labels. Output to a PNG file using the png function. Hint is provided.

library(ggplot2)
png(file = "myplot.png") # start PNG device 
ggplot(evans, aes(x=age))+
  geom_histogram()+
  labs(x="Age", title="Histogram of Age")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
dev.off() # close device
## png 
##   2

9. Display PNG file in your Rmarkdown document

Using Rmarkdown syntax, display the PNG file you created above. Hint: use the include_graphics function from the knitr package.

library(knitr)
include_graphics("/cloud/project/myplot.png")

10. Using regular expressions

Here are the California counties: https://raw.githubusercontent.com/taragonmd/data/master/calcounty.txt

Read in data using the scan function. Hint provided below.

Remove the “California” entry.

Use regular expressions to identify and display the County names that start with "San " and end with "o".

cac <-
scan('https://raw.githubusercontent.com/taragonmd/data/master/calcounty.txt', what="")
nocal <- cac[cac!="California"]; nocal
##  [1] "Alameda"         "Alpine"          "Amador"         
##  [4] "Butte"           "Calaveras"       "Colusa"         
##  [7] "Contra Costa"    "Del Norte"       "El Dorado"      
## [10] "Fresno"          "Glenn"           "Humboldt"       
## [13] "Imperial"        "Inyo"            "Kern"           
## [16] "Kings"           "Lake"            "Lassen"         
## [19] "Los Angeles"     "Madera"          "Marin"          
## [22] "Mariposa"        "Mendocino"       "Merced"         
## [25] "Modoc"           "Mono"            "Monterey"       
## [28] "Napa"            "Nevada"          "Orange"         
## [31] "Placer"          "Plumas"          "Riverside"      
## [34] "Sacramento"      "San Benito"      "San Bernardino" 
## [37] "San Diego"       "San Francisco"   "San Joaquin"    
## [40] "San Luis Obispo" "San Mateo"       "Santa Barbara"  
## [43] "Santa Clara"     "Santa Cruz"      "Shasta"         
## [46] "Sierra"          "Siskiyou"        "Solano"         
## [49] "Sonoma"          "Stanislaus"      "Sutter"         
## [52] "Tehama"          "Trinity"         "Tulare"         
## [55] "Tuolumne"        "Ventura"         "Yolo"           
## [58] "Yuba"
san_o<-grep("^San.*o$", cac, value = TRUE); san_o
## [1] "San Benito"      "San Bernardino"  "San Diego"       "San Francisco"  
## [5] "San Luis Obispo" "San Mateo"