Create a project folder called project1 on your computer. You will put all your Project 1 files in this folder.

Go to my GitHub site at https://github.com/taragonmd/data.

Go into the project1 folder.

Download this Rmarkdown template (PH251D2018_LastName_Project1.Rmd) and edit. Use R Markdown to demonstrate the following skills:

1. Using the source function

Download the problem1.R file and save to the project1 folder. Run the program file (problem1.r) using the ‘source’ command. Show the R code chunk and results below.

source('problem1.R')
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4

2. Read an ASCII data set

The Evans data set (evans.txt) is here: https://github.com/taragonmd/data.

Alternatively, here is the raw Evans data set: https://raw.githubusercontent.com/taragonmd/data/master/evans.txt.

Demonstrate reading the Evans data file (evans.txt) to create a data frame, and use the str function to explore the structure of the data set. Show the R code chunk and results below.

evans <- data.frame(read.table('https://raw.githubusercontent.com/taragonmd/data/master/evans.txt', header = T))
str(evans)
## 'data.frame':    609 obs. of  12 variables:
##  $ id : int  21 31 51 71 74 91 111 131 141 191 ...
##  $ chd: int  0 0 1 0 0 0 1 0 0 0 ...
##  $ cat: int  0 0 1 1 0 0 0 0 0 0 ...
##  $ age: int  56 43 56 64 49 46 52 63 42 55 ...
##  $ chl: int  270 159 201 179 243 252 179 217 176 250 ...
##  $ smk: int  0 1 1 1 1 1 1 0 1 0 ...
##  $ ecg: int  0 0 1 0 0 0 1 0 0 1 ...
##  $ dbp: int  80 74 112 100 82 88 80 92 76 114 ...
##  $ sbp: int  138 128 164 200 145 142 128 135 114 182 ...
##  $ hpt: int  0 0 1 1 0 0 0 0 0 1 ...
##  $ ch : int  0 0 1 1 0 0 0 0 0 0 ...
##  $ cc : int  0 0 201 179 0 0 0 0 0 0 ...
head(evans)
##   id chd cat age chl smk ecg dbp sbp hpt ch  cc
## 1 21   0   0  56 270   0   0  80 138   0  0   0
## 2 31   0   0  43 159   1   0  74 128   0  0   0
## 3 51   1   1  56 201   1   1 112 164   1  1 201
## 4 71   0   1  64 179   1   0 100 200   1  1 179
## 5 74   0   0  49 243   1   0  82 145   0  0   0
## 6 91   0   0  46 252   1   0  88 142   0  0   0

3. Discretizing a continuous variable into a categorical variable

Total cholesterol levels less than 200 milligrams per deciliter (mg/dL) are considered desirable (normal) for adults. A reading between 200 and 239 mg/dL is considered borderline high and a reading of 240 mg/dL and above is considered high.1

The Evan data dictionary is in Appendix D of the PHDSwR book. Convert total cholesterol variable (chl) into a categorical variable (factor) with the three levels described above.

evans$Category <- cut(evans$chl, breaks = c(0, 200, 240, 500), right = T, labels = c('Normal', 'Borderline', 'High'))
summary(evans$Category)
##     Normal Borderline       High 
##        257        223        129
str(evans$Category)
##  Factor w/ 3 levels "Normal","Borderline",..: 3 1 2 1 3 3 1 2 1 3 ...

4. Working with dates and times

President John F. Kennedy was assassinated on “November 22, 1963”. Convert this character string into a R date object. Show how to use R to display (a) the Julian date; (b) the day of the week, and (c) the week of the year.

ken.death <- as.Date("November 22, 1963", format = "%B %d, %Y" )
julian(ken.death)
## [1] -2232
## attr(,"origin")
## [1] "1970-01-01"
weekdays(ken.death)
## [1] "Friday"
x <- as.numeric(ken.death) %% 365
x <- as.integer(x / 7)
paste(x, "is the week of the year in which Kennedy was killed", sep = ' ')
## [1] "46 is the week of the year in which Kennedy was killed"

5. Simple two-way analysis

Create a simple 2x2 table of smoking (smk) and coronary heart disease (chd). Use the fisher.test on this 2x2 table and describe your findings.

smk.chd <- xtabs(~ evans$smk + evans$chd)
smk.chd
##          evans$chd
## evans$smk   0   1
##         0 205  17
##         1 333  54
fisher.test(smk.chd)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  smk.chd
## p-value = 0.02512
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  1.079813 3.697097
## sample estimates:
## odds ratio 
##   1.953491

The Fisher’s Test has a p value of .02, which indicates that these variables are not independent.

6. Write your own function

Now, write a function to calculate the odds ratio of your 2x2 table above.

#Function Constructing a standard epi contingency table
Con_Table <- function(y) {
  c_table <- matrix(c(y[2,2], y[2,1], y[2,2]+y[2,1], y[1,2],   y[1,1], y[1,2]+y[1,1]), nrow = 3)
  c_table <- cbind(c_table, rowSums(c_table))
  dimnames(c_table) <- list(Smoking = c('Yes', 'No', 'Total'),CVD = c('Yes', 'No', 'Total'))
  Odds <- (c_table[1,1] * c_table[2,2]) / (c_table[1,2] * c_table[2,1])
  
  return(list(c_table, Odds))
}
Con_Table(smk.chd)
## [[1]]
##        CVD
## Smoking Yes  No Total
##   Yes    54  17    71
##   No    333 205   538
##   Total 387 222   609
## 
## [[2]]
## [1] 1.955485

7. Nested for loops

Write a nested for loops to create a mulitiplication table for the numbers 1 to 10.

multi.table <- c()
for(y in 1:10){
  for(z in 1:10){
    multi.table <- c(multi.table, y*z)
  }
}
multi.table <- matrix(multi.table, nrow = 10)
multi.table
##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
##  [1,]    1    2    3    4    5    6    7    8    9    10
##  [2,]    2    4    6    8   10   12   14   16   18    20
##  [3,]    3    6    9   12   15   18   21   24   27    30
##  [4,]    4    8   12   16   20   24   28   32   36    40
##  [5,]    5   10   15   20   25   30   35   40   45    50
##  [6,]    6   12   18   24   30   36   42   48   54    60
##  [7,]    7   14   21   28   35   42   49   56   63    70
##  [8,]    8   16   24   32   40   48   56   64   72    80
##  [9,]    9   18   27   36   45   54   63   72   81    90
## [10,]   10   20   30   40   50   60   70   80   90   100

8. Create a simple graph

From the Evans data create a histogram of the total cholesterol (chl). Label with a title and axis labels. Output to a PNG file using the png function.

library(ggplot2)
pl <- ggplot(evans, aes(x = chl, fill = Category)) + geom_histogram(color = 'grey', breaks = seq(80, 370, 10)) + xlab('Cholesterol mg/dL') + ylab('People') + ggtitle("Evans Dataset Cholesterol by Value and Category")
pl +theme_minimal() + theme(legend.position = c(.8, .8))

ggsave("Cholgraph.png", device = 'png')
## Saving 7 x 5 in image

9. Display PNG file in your Rmarkdown document

Using Rmarkdown syntax, display the PNG you created above.

10. Using regular expressions

Here are the California counties: https://github.com/taragonmd/data/blob/master/calcounty.txt

Remove the “California” entry.

Use regular expressions to identify and display the County names that start with two or three letters followed by a space (e.g., "San ").

CalCounty <- read.table('calcounty.txt', header = F)

CalCounty <- CalCounty[-59, ]
CalCounty.concat1 <- grep('^[A-Za-z][A-Za-z]\\b', CalCounty, value = T)
CalCounty.concat2 <- grep('^[A-Za-z][A-Za-z][A-Za-z]\\b', CalCounty, value = T)
CalCounty.concat <- append(CalCounty.concat1, CalCounty.concat2)
CalCounty.concat
##  [1] "El Dorado"       "Del Norte"       "Los Angeles"    
##  [4] "San Benito"      "San Bernardino"  "San Diego"      
##  [7] "San Francisco"   "San Joaquin"     "San Luis Obispo"
## [10] "San Mateo"

  1. Source: https://www.medicalnewstoday.com/articles/315900.php