Create a project folder called project1 on your computer. You will put all your Project 1 files in this folder.
Go to my GitHub site at https://github.com/taragonmd/data.
Go into the project1 folder.
Download this Rmarkdown template (PH251D2018_LastName_Project1.Rmd) and edit. Use R Markdown to demonstrate the following skills:
source functionDownload the problem1.R file and save to the project1 folder. Run the program file (problem1.r) using the âsourceâ command. Show the R code chunk and results below.
source('problem1.R')
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
The Evans data set (evans.txt) is here: https://github.com/taragonmd/data.
Alternatively, here is the raw Evans data set: https://raw.githubusercontent.com/taragonmd/data/master/evans.txt.
Demonstrate reading the Evans data file (evans.txt) to create a data frame, and use the str function to explore the structure of the data set. Show the R code chunk and results below.
evans <- data.frame(read.table('https://raw.githubusercontent.com/taragonmd/data/master/evans.txt', header = T))
str(evans)
## 'data.frame': 609 obs. of 12 variables:
## $ id : int 21 31 51 71 74 91 111 131 141 191 ...
## $ chd: int 0 0 1 0 0 0 1 0 0 0 ...
## $ cat: int 0 0 1 1 0 0 0 0 0 0 ...
## $ age: int 56 43 56 64 49 46 52 63 42 55 ...
## $ chl: int 270 159 201 179 243 252 179 217 176 250 ...
## $ smk: int 0 1 1 1 1 1 1 0 1 0 ...
## $ ecg: int 0 0 1 0 0 0 1 0 0 1 ...
## $ dbp: int 80 74 112 100 82 88 80 92 76 114 ...
## $ sbp: int 138 128 164 200 145 142 128 135 114 182 ...
## $ hpt: int 0 0 1 1 0 0 0 0 0 1 ...
## $ ch : int 0 0 1 1 0 0 0 0 0 0 ...
## $ cc : int 0 0 201 179 0 0 0 0 0 0 ...
head(evans)
## id chd cat age chl smk ecg dbp sbp hpt ch cc
## 1 21 0 0 56 270 0 0 80 138 0 0 0
## 2 31 0 0 43 159 1 0 74 128 0 0 0
## 3 51 1 1 56 201 1 1 112 164 1 1 201
## 4 71 0 1 64 179 1 0 100 200 1 1 179
## 5 74 0 0 49 243 1 0 82 145 0 0 0
## 6 91 0 0 46 252 1 0 88 142 0 0 0
Total cholesterol levels less than 200 milligrams per deciliter (mg/dL) are considered desirable (normal) for adults. A reading between 200 and 239 mg/dL is considered borderline high and a reading of 240 mg/dL and above is considered high.1
The Evan data dictionary is in Appendix D of the PHDSwR book. Convert total cholesterol variable (chl) into a categorical variable (factor) with the three levels described above.
evans$Category <- cut(evans$chl, breaks = c(0, 200, 240, 500), right = T, labels = c('Normal', 'Borderline', 'High'))
summary(evans$Category)
## Normal Borderline High
## 257 223 129
str(evans$Category)
## Factor w/ 3 levels "Normal","Borderline",..: 3 1 2 1 3 3 1 2 1 3 ...
President John F. Kennedy was assassinated on “November 22, 1963”. Convert this character string into a R date object. Show how to use R to display (a) the Julian date; (b) the day of the week, and (c) the week of the year.
ken.death <- as.Date("November 22, 1963", format = "%B %d, %Y" )
julian(ken.death)
## [1] -2232
## attr(,"origin")
## [1] "1970-01-01"
weekdays(ken.death)
## [1] "Friday"
x <- as.numeric(ken.death) %% 365
x <- as.integer(x / 7)
paste(x, "is the week of the year in which Kennedy was killed", sep = ' ')
## [1] "46 is the week of the year in which Kennedy was killed"
Create a simple 2x2 table of smoking (smk) and coronary heart disease (chd). Use the fisher.test on this 2x2 table and describe your findings.
smk.chd <- xtabs(~ evans$smk + evans$chd)
smk.chd
## evans$chd
## evans$smk 0 1
## 0 205 17
## 1 333 54
fisher.test(smk.chd)
##
## Fisher's Exact Test for Count Data
##
## data: smk.chd
## p-value = 0.02512
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 1.079813 3.697097
## sample estimates:
## odds ratio
## 1.953491
The Fisher’s Test has a p value of .02, which indicates that these variables are not independent.
Now, write a function to calculate the odds ratio of your 2x2 table above.
#Function Constructing a standard epi contingency table
Con_Table <- function(y) {
c_table <- matrix(c(y[2,2], y[2,1], y[2,2]+y[2,1], y[1,2], y[1,1], y[1,2]+y[1,1]), nrow = 3)
c_table <- cbind(c_table, rowSums(c_table))
dimnames(c_table) <- list(Smoking = c('Yes', 'No', 'Total'),CVD = c('Yes', 'No', 'Total'))
Odds <- (c_table[1,1] * c_table[2,2]) / (c_table[1,2] * c_table[2,1])
return(list(c_table, Odds))
}
Con_Table(smk.chd)
## [[1]]
## CVD
## Smoking Yes No Total
## Yes 54 17 71
## No 333 205 538
## Total 387 222 609
##
## [[2]]
## [1] 1.955485
for loopsWrite a nested for loops to create a mulitiplication table for the numbers 1 to 10.
multi.table <- c()
for(y in 1:10){
for(z in 1:10){
multi.table <- c(multi.table, y*z)
}
}
multi.table <- matrix(multi.table, nrow = 10)
multi.table
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 1 2 3 4 5 6 7 8 9 10
## [2,] 2 4 6 8 10 12 14 16 18 20
## [3,] 3 6 9 12 15 18 21 24 27 30
## [4,] 4 8 12 16 20 24 28 32 36 40
## [5,] 5 10 15 20 25 30 35 40 45 50
## [6,] 6 12 18 24 30 36 42 48 54 60
## [7,] 7 14 21 28 35 42 49 56 63 70
## [8,] 8 16 24 32 40 48 56 64 72 80
## [9,] 9 18 27 36 45 54 63 72 81 90
## [10,] 10 20 30 40 50 60 70 80 90 100
From the Evans data create a histogram of the total cholesterol (chl). Label with a title and axis labels. Output to a PNG file using the png function.
library(ggplot2)
pl <- ggplot(evans, aes(x = chl, fill = Category)) + geom_histogram(color = 'grey', breaks = seq(80, 370, 10)) + xlab('Cholesterol mg/dL') + ylab('People') + ggtitle("Evans Dataset Cholesterol by Value and Category")
pl +theme_minimal() + theme(legend.position = c(.8, .8))
ggsave("Cholgraph.png", device = 'png')
## Saving 7 x 5 in image
Using Rmarkdown syntax, display the PNG you created above.
Here are the California counties: https://github.com/taragonmd/data/blob/master/calcounty.txt
Remove the “California” entry.
Use regular expressions to identify and display the County names that start with two or three letters followed by a space (e.g., "San ").
CalCounty <- read.table('calcounty.txt', header = F)
CalCounty <- CalCounty[-59, ]
CalCounty.concat1 <- grep('^[A-Za-z][A-Za-z]\\b', CalCounty, value = T)
CalCounty.concat2 <- grep('^[A-Za-z][A-Za-z][A-Za-z]\\b', CalCounty, value = T)
CalCounty.concat <- append(CalCounty.concat1, CalCounty.concat2)
CalCounty.concat
## [1] "El Dorado" "Del Norte" "Los Angeles"
## [4] "San Benito" "San Bernardino" "San Diego"
## [7] "San Francisco" "San Joaquin" "San Luis Obispo"
## [10] "San Mateo"