February 14, 2018

agenda

  • review indexing data objects
  • review manipulating data frames with dplyr
  • split-apply-combine
  • packages for categorical data analysis

what is indexing?

  • simply subsetting a data object (e.g., vector, matrix, array, data frame, list)
  • you can subset data by developing a vector
    • numerical vector–subsetting based on position
    • character vector–subsetting with vector names
    • logical vector–subsetting with the help of boolean characters

indexing with a numerical vector

  • based on position
my_rows <- c(1,3,5)
my_cols <- c(1,2)
cars[my_rows,my_cols]
##   speed dist
## 1     4    2
## 3     7    4
## 5     8   16
my_rows <- seq(from=1,to=5,by=2)
my_cols <- 2:1
cars[my_rows,my_cols]
##   dist speed
## 1    2     4
## 3    4     7
## 5   16     8

indexing with a character vector

  • based on column name
my_rows <- c(1,3,5)
my_cols <- "dist"
cars[my_rows,my_cols]
## [1]  2  4 16

indexing with a logical vector

  • with the help of a character vector
my_match <- c("A","B")
my_logical <- OrchardSprays$treatment %in% my_match
OrchardSprays[ my_logical, "treatment" ]
  • return all columns
my_match <- c("A","B")
my_logical <- OrchardSprays$treatment %in% my_match
OrchardSprays[ my_logical,  ]

indexing with a logical vector

  • with the help of a numeric vector
my_logical <- cars[, 1] <= 10
cars[my_logical, ]

my_logical <- cars[, 1] %in% c(4,7,8,9,10)
cars[my_logical, ]

indexing vs dplyr

  1. speed compared to base R is anywhere between 20–100 times faster

  2. cleaner code syntax allows for function chaining — preventing any potential cluttering in the code

  3. simpler code limited number of functions focused on the most common requirements of data manipulation

dplyr verbs

using dplyr
functions or 'verb' description
filter() select observations based on their values
arrange() reorder the observations
select() and rename() select variables based on their names
mutate() and transmute() add new variables based on functions of existing variables
summarise() condense multiple values to a single value
sample_n() and sample_frac() take random samples

lets try taking a 20% random sample of mtcars

library(dplyr)
sample_frac(mtcars, size = 0.2)
##                    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Merc 240D         24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## AMC Javelin       15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Pontiac Firebird  19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Porsche 914-2     26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Duster 360        14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
  • note that we will get a different random sample everytime we run sample_frac()
  • how do we filter so that horse power is >150?
    • please use pipes

lets filter so that horse power (hp) is >150

library(dplyr)
sample_frac(mtcars, size = 0.2) %>% 
  filter(hp >150)
##    mpg cyl  disp  hp drat   wt  qsec vs am gear carb
## 1 10.4   8 472.0 205 2.93 5.25 17.98  0  0    3    4
## 2 15.0   8 301.0 335 3.54 3.57 14.60  0  1    5    8
## 3 13.3   8 350.0 245 3.73 3.84 15.41  0  0    3    4
## 4 16.4   8 275.8 180 3.07 4.07 17.40  0  0    3    3
  • how de we arrange by the hp and number of cylinders (cyl)?

lets arrange by the hp and number of cylinders (cyl)

library(dplyr)
sample_frac(mtcars, size = 0.2) %>% 
  filter(hp >150) %>%
  arrange(hp, cyl)
##    mpg cyl  disp  hp drat   wt qsec vs am gear carb
## 1 17.3   8 275.8 180 3.07 3.73 17.6  0  0    3    3
## 2 15.2   8 275.8 180 3.07 3.78 18.0  0  0    3    3
## 3 15.0   8 301.0 335 3.54 3.57 14.6  0  1    5    8
  • how do we only select the following columns: mpg, cyl, hp, wt

lets select the following columns: mpg, cyl, hp, wt

library(dplyr)
sample_frac(mtcars, size = 0.2) %>% 
  filter(hp >150) %>%
  arrange(hp, cyl) %>%
  select(mpg, cyl, hp, wt) 
##    mpg cyl  hp    wt
## 1 14.7   8 230 5.345
## 2 14.3   8 245 3.570
## 3 13.3   8 245 3.840
## 4 15.8   8 264 3.170
  • how can we summarise the data by taking the mean weight (wt)?

lets summarise the data by taking the mean weight (wt)

library(dplyr)
sample_frac(mtcars, size = 0.2) %>% 
  filter(hp >150) %>%
  arrange(hp, cyl) %>%
  select(mpg, cyl, hp, wt) %>%
  summarise(mean(wt))
##   mean(wt)
## 1 3.806667

there are multiple ways to do things in R

  • dplyr leverages a package built to make data manipulation easier
  • however, using packages may hinder our ability to understand how base R works
  • there is funcitonality available in base R that doesn't require packages
    • almost all statistical tests
  • so we will still see concepts of base R throughout the course

split-apply-combine example (using base R)

  • can all data analysis results be proken up into 3 steps?
  1. split — separate your data into different groups
  2. apply — run an anlysis on all groups
  3. combine — aggregate results in a table
  • let's caculate the average decrease of OrchardSprays by treatment

split df — OrchardSprays

  • subset data according to vector of interest
OrchardSprays_decrease <- OrchardSprays$decrease
  • split OrchardSPrays by treatment
split_OrchardSprays_decrease <- split(OrchardSprays_decrease, OrchardSprays$treatment)
  • calculate average by list element using apply function
list_OrchardSprays_decrease <- lapply(split_OrchardSprays_decrease, mean)
unlist(list_OrchardSprays_decrease)
##      A      B      C      D      E      F      G      H 
##  4.625  7.625 25.250 35.000 63.125 69.000 68.500 90.250

you try — calculate max mpg after spliting mtcars by cyl

  • here is the skeleton of the answer
  • fill in "???"
my_df <- mtcars[, c(???)]
list_my_df <- split(x = ???, f = ???)
lapply(X = list_my_df, FUN = ???)

you try — calculate max mpg after spliting mtcars by cyl

  • here is the skeleton of the answer
my_df <- mtcars[, c("mpg","cyl")]
list_my_df <- split(x = my_df, f = mtcars$cyl)
lapply(X = list_my_df, FUN = max)
## $`4`
## [1] 33.9
## 
## $`6`
## [1] 21.4
## 
## $`8`
## [1] 19.2

utility of packages

  • packages are collections of community created functions
  • these functions can be written in different languages
    • Java, C++, and even R!

tidyverse packages — can likely use in everyday data analyses

tidyverse core package description
ggplot2 system for declaratively creating graphics, based on The Grammar of Graphics
dplyr provides a grammar of data manipulation, providing a consistent set of verbs that solve the most common data manipulation challenges
tidyr provides a set of functions that help you get to tidy data (consistent form)
readr provides a fast and friendly way to read rectangular data
purrr enhances R's functional programming (FP) toolkit by providing a complete and consistent set of tools for working with functions and vectors
tibble a modern re-imagining of the data frame, keeping what time has proven to be effective, and throwing out what it has not
stringr provides a cohesive set of functions designed to make working with strings as easy as possible
forcats provides a suite of useful tools that solve common problems with factors

other packages relevant to public health

use packages
epidemiology epitab, epitool, epiR
GIS sp, maptools, rgdal
geo- and spatial stats gstat, geoR, starr
dashboards leaflet, flexdashboard, highcharter
next generation sequencing biocLite

R for categorical analysis

  • some useful functions available in base R
statistical test package::function
inference single proportion | 2-sample test for equality of proportions base::prop.test
fishers exact test count data stats::fisher.test
chi square test count data stats::mantelhaen.test
independent 2-group t-test | paired t-test | one sample t-test stats::t.test

test of equal or given proportions

  • Ho: proportions in two or more groups are the same
    • Ha: !Ho
  • Ho: All 5 populations from which patients were drawn from have the same true proportion of smokers
    • Ha: At least 1 population from which the patients were drawn from has a different proportion of smokers
sample_of_proportions <- matrix(data = c(80, 82, 81, 79, 132, 201, 203, 199, 195, 206),
                                ncol = 2, 
                                dimnames = list (row <- NULL, col <- c("smokers", "patients")))
prop.test(sample_of_proportions)
## 
##  5-sample test for equality of proportions without continuity
##  correction
## 
## data:  sample_of_proportions
## X-squared = 12.87, df = 4, p-value = 0.01193
## alternative hypothesis: two.sided
## sample estimates:
##    prop 1    prop 2    prop 3    prop 4    prop 5 
## 0.2846975 0.2877193 0.2892857 0.2883212 0.3905325

reccommend storing object — examine structure

sample_of_proportions <- matrix(data = c(80, 82, 81, 79, 132, 201, 203, 199, 195, 206),
                                ncol = 2, 
                                dimnames = list (row <- NULL, col <- c("smokers", "patients")))
results <- prop.test(sample_of_proportions)
str(results)
## List of 9
##  $ statistic  : Named num 12.9
##   ..- attr(*, "names")= chr "X-squared"
##  $ parameter  : Named num 4
##   ..- attr(*, "names")= chr "df"
##  $ p.value    : num 0.0119
##  $ estimate   : Named num [1:5] 0.285 0.288 0.289 0.288 0.391
##   ..- attr(*, "names")= chr [1:5] "prop 1" "prop 2" "prop 3" "prop 4" ...
##  $ null.value : NULL
##  $ conf.int   : NULL
##  $ alternative: chr "two.sided"
##  $ method     : chr "5-sample test for equality of proportions without continuity correction"
##  $ data.name  : chr "sample_of_proportions"
##  - attr(*, "class")= chr "htest"

t-test to compare 2 sample means

  • Ho: Means in two groups are not signficantly different
    • Ha: !Ho
  • Ho: Means of age-adjusted death rates are not signficantly different between Latinos and asians
    • Ha: Means of age-adjusted death rates are signficantly different between Latinos and asians
  • Work on your own
library(RSocrata)
url <- "https://data.cityofnewyork.us/resource/uvxr-2jwn.json"
NYC_COD_json <- read.socrata(url)
NYC_COD_json$age_adjusted_death_rate <- as.numeric(NYC_COD_json$age_adjusted_death_rate)

t-test to compare age-adjusted death rates between Latinos and asians

## subset data using base R
rates_latino <- NYC_COD_json[NYC_COD_json$race_ethnicity == "Hispanic", "age_adjusted_death_rate"]
rates_asian <- NYC_COD_json[NYC_COD_json$race_ethnicity == "Asian and Pacific Islander", "age_adjusted_death_rate"]
t.test(rates_latino,rates_asian)
## 
##  Welch Two Sample t-test
## 
## data:  rates_latino and rates_asian
## t = 2.8664, df = 317.27, p-value = 0.004429
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   4.84144 26.03314
## sample estimates:
## mean of x mean of y 
##  49.82429  34.38701

packages useful for epidemiology

  • epitab, epitools, epiR
  • epitools — tools for training and practicing epidemiologists; including 2- and multi-way contigency tables

epitab example

  • what are the odds of getting sick if you ate baked ham
library(epitools)
data("oswego")
ob <- oswego #ob is for outbreak
oddsratio(x = ob[,5], y = ob[,8])
## $data
##          Outcome
## Predictor  N  Y Total
##     N     12 17    29
##     Y     17 29    46
##     Total 29 46    75
## 
## $measure
##          odds ratio with 95% C.I.
## Predictor estimate     lower    upper
##         N 1.000000        NA       NA
##         Y 1.202151 0.4552657 3.149389
## 
## $p.value
##          two-sided
## Predictor midp.exact fisher.exact chi.square
##         N         NA           NA         NA
##         Y  0.7078076    0.8087389  0.7017014
## 
## $correction
## [1] FALSE
## 
## attr(,"method")
## [1] "median-unbiased estimate & mid-p exact CI"

epitools example

  • what are the odds of getting sick if you ate spinach

epitools example

  • what are the odds of getting sick if you ate spinach
library(epitools)
data("oswego")
ob <- oswego #ob is for outbreak
oddsratio(x = ob[,5], y = ob[,9])
## $data
##          Outcome
## Predictor  N  Y Total
##     N     12 17    29
##     Y     20 26    46
##     Total 32 43    75
## 
## $measure
##          odds ratio with 95% C.I.
## Predictor  estimate     lower    upper
##         N 1.0000000        NA       NA
##         Y 0.9202213 0.3509368 2.377177
## 
## $p.value
##          two-sided
## Predictor midp.exact fisher.exact chi.square
##         N         NA           NA         NA
##         Y  0.8640061            1  0.8579544
## 
## $correction
## [1] FALSE
## 
## attr(,"method")
## [1] "median-unbiased estimate & mid-p exact CI"

example epitools with lapply

  • how do i calculate all the odds of eating every single food item ob[, 8:21]?
  • fill in the "???"
library(epitools)
data("oswego")
ob <- oswego #ob is for outbreak
list_odds <- lapply(X = ??? , FUN = oddsratio, x = ???)

example epitools with lapply

library(epitools)
data("oswego")
ob <- oswego #ob is for outbreak
list_odds <- lapply(X = ob[,8:21] , FUN = oddsratio, x = ob[,5])
## Warning in chisq.test(xx, correct = correction): Chi-squared approximation
## may be incorrect

## Warning in chisq.test(xx, correct = correction): Chi-squared approximation
## may be incorrect

prepare oswego dataset for logistic model

  • based on epi investigations and odds ratio — predictors of interest are ob$baked.ham + ob$mashed.potato + ob$brown.bread + ob$milk
library(epitools)
data("oswego")
ob <- data.frame(oswego)
  • code below will not work y values must be 0 <= y <= 1
ill_model <- glm(ob$ill ~ ob$baked.ham + 
                   ob$mashed.potato + ob$brown.bread + 
                   ob$milk, family = binomial())

recode "Y" and "N"

library(dplyr)
ob$ill <- ob$ill %>% recode (Y = 1, N = 0)
ill_model <- glm(ob$ill ~ ob$baked.ham + 
                   ob$mashed.potato + ob$brown.bread + 
                   ob$milk, family = binomial())
ob$ill
##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [36] 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [71] 0 0 0 0 0

what is next?

  • check the syllabus