R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

Project management and workflow

Please create a project folder for this R bootcamp and keep everything related in there. This tends to reduce the problem with working directory becoming something other than what you want.

The typical data analysis work flow can be understood like this.

Loading a library.

Use library(package_name)!

## Load package tidyverse
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats

Loading data files

Use appropriate functions depending on the type of your dataset. Don’t forget to assign it to a named object with the arrow or equal sign.

fram <- read_csv("framingham.csv")
## Parsed with column specification:
## cols(
##   .default = col_integer(),
##   SYSBP = col_double(),
##   DIABP = col_double(),
##   BMI = col_double(),
##   TIMEAP = col_double(),
##   TIMEMI = col_double(),
##   TIMEMIFC = col_double(),
##   TIMECHD = col_double(),
##   TIMESTRK = col_double(),
##   TIMECVD = col_double(),
##   TIMEDTH = col_double(),
##   TIMEHYP = col_double()
## )
## See spec(...) for full column specifications.
fram2 <- readxl::read_excel("framingham.xlsx")

Examining data

Typing the dataset name and evaluating shows a small part of dataset. Do you see that the columns are variables and rows are observations (people in this case). This structure is called tidy data.

fram
## # A tibble: 4,434 x 37
##      SEX TOTCHOL   AGE SYSBP DIABP CURSMOKE CIGPDAY   BMI DIABETES BPMEDS
##    <int>   <int> <int> <dbl> <dbl>    <int>   <int> <dbl>    <int>  <int>
##  1     1     195    39 106.0    70        0       0 26.97        0      0
##  2     0     250    46 121.0    81        0       0 28.73        0      0
##  3     1     245    48 127.5    80        1      20 25.34        0      0
##  4     0     225    61 150.0    95        1      30 28.58        0      0
##  5     0     285    46 130.0    84        1      23 23.10        0      0
##  6     0     228    43 180.0   110        0       0 30.30        0      0
##  7     0     205    63 138.0    71        0       0 33.11        0      0
##  8     0     313    45 100.0    71        1      20 21.68        0      0
##  9     1     260    52 141.5    89        0       0 26.36        0      0
## 10     1     225    43 162.0   107        1      30 23.61        0      0
## # ... with 4,424 more rows, and 27 more variables: HEARTRTE <int>,
## #   GLUCOSE <int>, PREVCHD <int>, PREVAP <int>, PREVMI <int>,
## #   PREVSTRK <int>, PREVHYP <int>, DEATH <int>, ANGINA <int>,
## #   HOSPMI <int>, MI_FCHD <int>, ANYCHD <int>, STROKE <int>, CVD <int>,
## #   HYPERTEN <int>, TIMEAP <dbl>, TIMEMI <dbl>, TIMEMIFC <dbl>,
## #   TIMECHD <dbl>, TIMESTRK <dbl>, TIMECVD <dbl>, TIMEDTH <dbl>,
## #   TIMEHYP <dbl>, bmicat <int>, agecat <int>, highbp <int>, packs <int>

To see the list of all variables, glimpse() is helpful.

glimpse(fram)
## Observations: 4,434
## Variables: 37
## $ SEX      <int> 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0,...
## $ TOTCHOL  <int> 195, 250, 245, 225, 285, 228, 205, 313, 260, 225, 254...
## $ AGE      <int> 39, 46, 48, 61, 46, 43, 63, 45, 52, 43, 50, 43, 46, 4...
## $ SYSBP    <dbl> 106.0, 121.0, 127.5, 150.0, 130.0, 180.0, 138.0, 100....
## $ DIABP    <dbl> 70.0, 81.0, 80.0, 95.0, 84.0, 110.0, 71.0, 71.0, 89.0...
## $ CURSMOKE <int> 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1,...
## $ CIGPDAY  <int> 0, 0, 20, 30, 23, 0, 0, 20, 0, 30, 0, 0, 15, 0, 9, 20...
## $ BMI      <dbl> 26.97, 28.73, 25.34, 28.58, 23.10, 30.30, 33.11, 21.6...
## $ DIABETES <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ BPMEDS   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,...
## $ HEARTRTE <int> 80, 95, 75, 65, 85, 77, 60, 79, 76, 93, 75, 72, 98, 6...
## $ GLUCOSE  <int> 77, 76, 70, 103, 85, 99, 85, 78, 79, 88, 76, 61, 64, ...
## $ PREVCHD  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ PREVAP   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ PREVMI   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ PREVSTRK <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ PREVHYP  <int> 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0,...
## $ DEATH    <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1,...
## $ ANGINA   <int> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ HOSPMI   <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ MI_FCHD  <int> 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ANYCHD   <int> 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ STROKE   <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ CVD      <int> 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ HYPERTEN <int> 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0,...
## $ TIMEAP   <dbl> 24.0000000, 24.0000000, 24.0000000, 8.0930869, 24.000...
## $ TIMEMI   <dbl> 17.6262834, 24.0000000, 24.0000000, 8.0930869, 24.000...
## $ TIMEMIFC <dbl> 17.6262834, 24.0000000, 24.0000000, 8.0930869, 24.000...
## $ TIMECHD  <dbl> 17.6262834, 24.0000000, 24.0000000, 8.0930869, 24.000...
## $ TIMESTRK <dbl> 24.0000000, 24.0000000, 24.0000000, 5.7193703, 24.000...
## $ TIMECVD  <dbl> 17.6262834, 24.0000000, 24.0000000, 5.7193703, 24.000...
## $ TIMEDTH  <dbl> 24.0000000, 24.0000000, 24.0000000, 8.0930869, 24.000...
## $ TIMEHYP  <dbl> 24.000000, 24.000000, 24.000000, 0.000000, 11.731691,...
## $ bmicat   <int> 3, 3, 3, 3, 2, 4, 4, 2, 3, 2, 2, 3, 3, 4, 2, 2, 2, 2,...
## $ agecat   <int> 1, 2, 2, 4, 2, 2, 4, 2, 3, 2, 2, 2, 2, 2, 1, 1, 2, 2,...
## $ highbp   <int> 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0,...
## $ packs    <int> 0, 0, 1, 2, 2, 0, 0, 1, 0, 2, 0, 0, 1, 0, 1, 1, 1, 1,...

Use select(data, var1, var2, …) to select columns (variables).

select(fram, AGE, SEX, TOTCHOL)
## # A tibble: 4,434 x 3
##      AGE   SEX TOTCHOL
##    <int> <int>   <int>
##  1    39     1     195
##  2    46     0     250
##  3    48     1     245
##  4    61     0     225
##  5    46     0     285
##  6    43     0     228
##  7    63     0     205
##  8    45     0     313
##  9    52     1     260
## 10    43     1     225
## # ... with 4,424 more rows

Use filter(data, condition1, condition2, …) to filter rows (observations) based on conditions. These are taken to be AND conditions.

filter(fram, AGE > 65, AGE <= 70)
## # A tibble: 132 x 37
##      SEX TOTCHOL   AGE SYSBP DIABP CURSMOKE CIGPDAY   BMI DIABETES BPMEDS
##    <int>   <int> <int> <dbl> <dbl>    <int>   <int> <dbl>    <int>  <int>
##  1     0     254    67   157  89.0        0       0 24.25        0      0
##  2     0     311    66   154  80.0        0       0 28.55        0      0
##  3     0     278    66   187  88.0        0       0 40.52        0      0
##  4     0     264    67   139  80.0        0       0 25.75        0      0
##  5     1     288    66   109  71.0        0       0 29.29        0      0
##  6     0     214    66   212 104.0        0       0 25.32        0      0
##  7     0     259    67   151 101.0        0       0 21.67        0      0
##  8     1     257    67   125  67.5        0       0 25.95        0      0
##  9     0     249    67   128  68.0        0       0 25.81        0      0
## 10     0     285    66   166  98.0        0       0 26.04        0      0
## # ... with 122 more rows, and 27 more variables: HEARTRTE <int>,
## #   GLUCOSE <int>, PREVCHD <int>, PREVAP <int>, PREVMI <int>,
## #   PREVSTRK <int>, PREVHYP <int>, DEATH <int>, ANGINA <int>,
## #   HOSPMI <int>, MI_FCHD <int>, ANYCHD <int>, STROKE <int>, CVD <int>,
## #   HYPERTEN <int>, TIMEAP <dbl>, TIMEMI <dbl>, TIMEMIFC <dbl>,
## #   TIMECHD <dbl>, TIMESTRK <dbl>, TIMECVD <dbl>, TIMEDTH <dbl>,
## #   TIMEHYP <dbl>, bmicat <int>, agecat <int>, highbp <int>, packs <int>

Manipulating variables

To rename variables use rename(), if you want it to persist. Assign it to a named object (here fram3).

fram3 <- rename(fram,
                MALE = SEX,
                sytolic_blood_pressure = SYSBP)

To create a new variable using some transformation, use mutate(). Here hypertension is 1 if SYSBP is greater than 140 mmHg OR (vertical bar) DIABP is greater than 90 mmHg.

fram4 <- mutate(fram,
                htn = as.numeric(SYSBP > 140 | DIABP > 90))

Piping with the percent-greater_than-percent operator is helpful is you want to sequentially perform multiple operations. A pipe takes the result on the left-hand side and feed it to the expression on the right-hand side. Here we rename first, and then create a new variable next. Again don’t forget to assign it to a named object if you want to keep it.

fram5 <- fram %>%
  rename(MALE = SEX) %>%
  mutate(htn = as.numeric(SYSBP > 140 | DIABP > 90))

More explicitly, you can use . (period) to express where the result from the LHS should go in. Here fram is inserted to . in rename() and then the result after rename is inserted into . in mutate(). When the function takes the dataset as the first argument, you can omit it as in the previous code example.

fram5 <- fram %>%
  rename(., MALE = SEX) %>%
  mutate(., htn = as.numeric(SYSBP > 140 | DIABP > 90))

Summarizing

summarize() create a single number summary based on some transformation. Here we checked mean, variance (var), and standard deviation (sd).

fram %>%
  summarize(mean_age = mean(AGE),
            var_age = var(AGE),
            sd_age = sd(AGE),
            mean_div_sd = mean_age / sd_age)
## # A tibble: 1 x 4
##   mean_age var_age   sd_age mean_div_sd
##      <dbl>   <dbl>    <dbl>       <dbl>
## 1  49.9258 75.2891 8.676929    5.753856

Often it’s necessary to summarize based on subgroups. Here get the subgroup mean ages grouped by sex and current smoking status. Use the group_by() function. Additionally, I’m reordering the results by the newly created mean_age variable (ascending order) using arrange().

fram %>%
  group_by(SEX, CURSMOKE) %>%
  summarise(mean_age = mean(AGE)) %>%
  arrange(mean_age)
## # A tibble: 4 x 3
## # Groups:   SEX [2]
##     SEX CURSMOKE mean_age
##   <int>    <int>    <dbl>
## 1     0        1 47.33201
## 2     1        1 48.70298
## 3     1        0 51.44213
## 4     0        0 51.86658

Let’s try graphics

Data visualization is very helpful understanding the data. ggplot2 is the package name. This is automatically loaded with tidyverse. The command is called ggplot.

aes() creates the aesthetic mapping between variables in your dataset and the structure of the figure. geom_X (geometric objects) will actually create a figure with geometry X. The plus sign means the expressions are combined to create a final figure.

factor() is used to make R interpret SEX variable as strictly categorical. Because it is code 0,1, sometimes it is interpreted as a numerical variable and gives you a funny result.

## <- These hashtags mean comments in a code chunk, but a header outside.
## this is a comment, ignored
ggplot(data = fram,
       mapping = aes(x = AGE, y = SYSBP, color = factor(SEX))) +
  geom_point()

Single variable figures

ggplot(data = fram, mapping = aes(x = AGE)) +
  geom_histogram(binwidth = 5)

ggplot(fram, aes(x = AGE)) +
  geom_density()

ggplot(fram, aes(x = agecat)) +
  geom_bar()

Two variable figures

ggplot(fram, aes(x = factor(agecat), y = BMI)) +
  geom_boxplot()
## Warning: Removed 19 rows containing non-finite values (stat_boxplot).

geom_smooth() can be used to add mean tendency of y variable over the x variable. Here we used a simple linear model (lm) to fit a straight line to represent the relationship.

ggplot(fram, aes(x = AGE, y = BMI)) +
  geom_point() +
  geom_smooth(method = "lm")
## Warning: Removed 19 rows containing non-finite values (stat_smooth).
## Warning: Removed 19 rows containing missing values (geom_point).

Multi-panel figure

Sometimes it’s helpful to have multiple panels classified by additional variables. This is done with facet.

ggplot(fram, aes(x = AGE, y = BMI)) +
  geom_point() +
  facet_grid(SEX ~ agecat)
## Warning: Removed 19 rows containing missing values (geom_point).