Use this template and R Markdown to demonstrate the following skills:
source
functionCreate a R script file named problem1.R
and save it into your project1
folder. Type print("Hello World")
and source
this file.
source("problem1.R")
## [1] "Hello World"
The Evans data set (evans.txt
) is here: https://github.com/taragonmd/data.
Alternatively, here is the raw Evans data set: https://raw.githubusercontent.com/taragonmd/data/master/evans.txt.
Demonstrate reading the Evans data file (evans.txt) to create a data frame, and use the str
function to explore the structure of the data set. The data dictionary is in Appendix C of the PHDS book.
Show the R code chunk and results below.
evans<-read.table("https://raw.githubusercontent.com/taragonmd/data/master/evans.txt", header = T)
str(evans)
## 'data.frame': 609 obs. of 12 variables:
## $ id : int 21 31 51 71 74 91 111 131 141 191 ...
## $ chd: int 0 0 1 0 0 0 1 0 0 0 ...
## $ cat: int 0 0 1 1 0 0 0 0 0 0 ...
## $ age: int 56 43 56 64 49 46 52 63 42 55 ...
## $ chl: int 270 159 201 179 243 252 179 217 176 250 ...
## $ smk: int 0 1 1 1 1 1 1 0 1 0 ...
## $ ecg: int 0 0 1 0 0 0 1 0 0 1 ...
## $ dbp: int 80 74 112 100 82 88 80 92 76 114 ...
## $ sbp: int 138 128 164 200 145 142 128 135 114 182 ...
## $ hpt: int 0 0 1 1 0 0 0 0 0 1 ...
## $ ch : int 0 0 1 1 0 0 0 0 0 0 ...
## $ cc : int 0 0 201 179 0 0 0 0 0 0 ...
Use the cut
function to discretize age into the following age categories and make a table of counts and a table of proportions.
Be sure to pay attention to age interval transitions.
evans$age_cut<-cut(evans$age, breaks=c(30,40,50,60,70,100),right=FALSE)
table_evans<-table(evans$age_cut); table_evans
##
## [30,40) [40,50) [50,60) [60,70) [70,100)
## 0 247 203 115 44
prop.table(table_evans)
##
## [30,40) [40,50) [50,60) [60,70) [70,100)
## 0.00000000 0.40558292 0.33333333 0.18883415 0.07224959
President President Donald Trump was elected on “November 8, 2016”. Convert this character string into a R date object. Show how to use R to display (a) the Julian date; (b) the day of the week, and (c) the week of the year.
trump<-c("11/8/2016")
trump<-as.Date(trump, format="%m/%d/%Y"); trump
## [1] "2016-11-08"
julian<-julian(trump); julian
## [1] 17113
## attr(,"origin")
## [1] "1970-01-01"
day<-weekdays(trump); day
## [1] "Tuesday"
week<-format(trump, format='%U'); week
## [1] "45"
Create a simple 2x2 table of smoking (smk
) and coronary heart disease (chd
). Use the fisher.test
on this 2x2 table and describe your findings.
table2x2<-xtabs(~smk+chd, data=evans)
fisher.test(table2x2)
##
## Fisher's Exact Test for Count Data
##
## data: table2x2
## p-value = 0.02512
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 1.079813 3.697097
## sample estimates:
## odds ratio
## 1.953491
We observe that the p-value is low (lower than the .05 level of significance) and hence we reject the null that the odds ratio is equal to 1 (independence). In fact, we observe that the odds ratio is almost double, hence, in this study for this period of time, we can see that the people who smoke have double the odds of coronary heart disease.
Now, write a function to calculate the risk ratio of your 2x2 table above. The exposure is smoking status and the outcome is coronary heart disease.
rr <- function(x){
re <- x[2,2]/x[2,1]
ru <- x[1,2]/x[1,1]
rr<-re/ru
list(rr)
}
rr(table2x2)
## [[1]]
## [1] 1.955485
Now, use the xtabs
function to create a 3-D array object of chd
, hpt
, and smk
. Now use the addmargins
function on this object.
table3x3<-xtabs(~smk+chd+hpt, data=evans)
addmargins(table3x3)
## , , hpt = 0
##
## chd
## smk 0 1 Sum
## 0 122 6 128
## 1 204 22 226
## Sum 326 28 354
##
## , , hpt = 1
##
## chd
## smk 0 1 Sum
## 0 83 11 94
## 1 129 32 161
## Sum 212 43 255
##
## , , hpt = Sum
##
## chd
## smk 0 1 Sum
## 0 205 17 222
## 1 333 54 387
## Sum 538 71 609
From the Evans data create a histogram of age (age
). Label with a title and axis labels. Output to a PNG file using the png
function. Hint is provided.
library(ggplot2)
png(file = "myplot.png") # start PNG device
ggplot(evans, aes(x=age))+
geom_histogram()+
labs(x="Age", title="Histogram of Age")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
dev.off() # close device
## png
## 2
Using Rmarkdown syntax, display the PNG file you created above. Hint: use the include_graphics
function from the knitr
package.
library(knitr)
include_graphics("/cloud/project/myplot.png")
Here are the California counties: https://raw.githubusercontent.com/taragonmd/data/master/calcounty.txt
Read in data using the scan
function. Hint provided below.
Remove the “California” entry.
Use regular expressions to identify and display the County names that start with "San "
and end with "o"
.
cac <-
scan('https://raw.githubusercontent.com/taragonmd/data/master/calcounty.txt', what="")
nocal <- cac[cac!="California"]; nocal
## [1] "Alameda" "Alpine" "Amador"
## [4] "Butte" "Calaveras" "Colusa"
## [7] "Contra Costa" "Del Norte" "El Dorado"
## [10] "Fresno" "Glenn" "Humboldt"
## [13] "Imperial" "Inyo" "Kern"
## [16] "Kings" "Lake" "Lassen"
## [19] "Los Angeles" "Madera" "Marin"
## [22] "Mariposa" "Mendocino" "Merced"
## [25] "Modoc" "Mono" "Monterey"
## [28] "Napa" "Nevada" "Orange"
## [31] "Placer" "Plumas" "Riverside"
## [34] "Sacramento" "San Benito" "San Bernardino"
## [37] "San Diego" "San Francisco" "San Joaquin"
## [40] "San Luis Obispo" "San Mateo" "Santa Barbara"
## [43] "Santa Clara" "Santa Cruz" "Shasta"
## [46] "Sierra" "Siskiyou" "Solano"
## [49] "Sonoma" "Stanislaus" "Sutter"
## [52] "Tehama" "Trinity" "Tulare"
## [55] "Tuolumne" "Ventura" "Yolo"
## [58] "Yuba"
san_o<-grep("^San.*o$", cac, value = TRUE); san_o
## [1] "San Benito" "San Bernardino" "San Diego" "San Francisco"
## [5] "San Luis Obispo" "San Mateo"