This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
Please create a project folder for this R bootcamp and keep everything related in there. This tends to reduce the problem with working directory becoming something other than what you want.
The typical data analysis work flow can be understood like this.
Use library(package_name)!
## Load package tidyverse
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
Use appropriate functions depending on the type of your dataset. Don’t forget to assign it to a named object with the arrow or equal sign.
fram <- read_csv("framingham.csv")
## Parsed with column specification:
## cols(
## .default = col_integer(),
## SYSBP = col_double(),
## DIABP = col_double(),
## BMI = col_double(),
## TIMEAP = col_double(),
## TIMEMI = col_double(),
## TIMEMIFC = col_double(),
## TIMECHD = col_double(),
## TIMESTRK = col_double(),
## TIMECVD = col_double(),
## TIMEDTH = col_double(),
## TIMEHYP = col_double()
## )
## See spec(...) for full column specifications.
fram2 <- readxl::read_excel("framingham.xlsx")
Typing the dataset name and evaluating shows a small part of dataset. Do you see that the columns are variables and rows are observations (people in this case). This structure is called tidy data.
fram
## # A tibble: 4,434 x 37
## SEX TOTCHOL AGE SYSBP DIABP CURSMOKE CIGPDAY BMI DIABETES BPMEDS
## <int> <int> <int> <dbl> <dbl> <int> <int> <dbl> <int> <int>
## 1 1 195 39 106.0 70 0 0 26.97 0 0
## 2 0 250 46 121.0 81 0 0 28.73 0 0
## 3 1 245 48 127.5 80 1 20 25.34 0 0
## 4 0 225 61 150.0 95 1 30 28.58 0 0
## 5 0 285 46 130.0 84 1 23 23.10 0 0
## 6 0 228 43 180.0 110 0 0 30.30 0 0
## 7 0 205 63 138.0 71 0 0 33.11 0 0
## 8 0 313 45 100.0 71 1 20 21.68 0 0
## 9 1 260 52 141.5 89 0 0 26.36 0 0
## 10 1 225 43 162.0 107 1 30 23.61 0 0
## # ... with 4,424 more rows, and 27 more variables: HEARTRTE <int>,
## # GLUCOSE <int>, PREVCHD <int>, PREVAP <int>, PREVMI <int>,
## # PREVSTRK <int>, PREVHYP <int>, DEATH <int>, ANGINA <int>,
## # HOSPMI <int>, MI_FCHD <int>, ANYCHD <int>, STROKE <int>, CVD <int>,
## # HYPERTEN <int>, TIMEAP <dbl>, TIMEMI <dbl>, TIMEMIFC <dbl>,
## # TIMECHD <dbl>, TIMESTRK <dbl>, TIMECVD <dbl>, TIMEDTH <dbl>,
## # TIMEHYP <dbl>, bmicat <int>, agecat <int>, highbp <int>, packs <int>
To see the list of all variables, glimpse() is helpful.
glimpse(fram)
## Observations: 4,434
## Variables: 37
## $ SEX <int> 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0,...
## $ TOTCHOL <int> 195, 250, 245, 225, 285, 228, 205, 313, 260, 225, 254...
## $ AGE <int> 39, 46, 48, 61, 46, 43, 63, 45, 52, 43, 50, 43, 46, 4...
## $ SYSBP <dbl> 106.0, 121.0, 127.5, 150.0, 130.0, 180.0, 138.0, 100....
## $ DIABP <dbl> 70.0, 81.0, 80.0, 95.0, 84.0, 110.0, 71.0, 71.0, 89.0...
## $ CURSMOKE <int> 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1,...
## $ CIGPDAY <int> 0, 0, 20, 30, 23, 0, 0, 20, 0, 30, 0, 0, 15, 0, 9, 20...
## $ BMI <dbl> 26.97, 28.73, 25.34, 28.58, 23.10, 30.30, 33.11, 21.6...
## $ DIABETES <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ BPMEDS <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,...
## $ HEARTRTE <int> 80, 95, 75, 65, 85, 77, 60, 79, 76, 93, 75, 72, 98, 6...
## $ GLUCOSE <int> 77, 76, 70, 103, 85, 99, 85, 78, 79, 88, 76, 61, 64, ...
## $ PREVCHD <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ PREVAP <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ PREVMI <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ PREVSTRK <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ PREVHYP <int> 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0,...
## $ DEATH <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1,...
## $ ANGINA <int> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ HOSPMI <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ MI_FCHD <int> 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ANYCHD <int> 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ STROKE <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ CVD <int> 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ HYPERTEN <int> 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0,...
## $ TIMEAP <dbl> 24.0000000, 24.0000000, 24.0000000, 8.0930869, 24.000...
## $ TIMEMI <dbl> 17.6262834, 24.0000000, 24.0000000, 8.0930869, 24.000...
## $ TIMEMIFC <dbl> 17.6262834, 24.0000000, 24.0000000, 8.0930869, 24.000...
## $ TIMECHD <dbl> 17.6262834, 24.0000000, 24.0000000, 8.0930869, 24.000...
## $ TIMESTRK <dbl> 24.0000000, 24.0000000, 24.0000000, 5.7193703, 24.000...
## $ TIMECVD <dbl> 17.6262834, 24.0000000, 24.0000000, 5.7193703, 24.000...
## $ TIMEDTH <dbl> 24.0000000, 24.0000000, 24.0000000, 8.0930869, 24.000...
## $ TIMEHYP <dbl> 24.000000, 24.000000, 24.000000, 0.000000, 11.731691,...
## $ bmicat <int> 3, 3, 3, 3, 2, 4, 4, 2, 3, 2, 2, 3, 3, 4, 2, 2, 2, 2,...
## $ agecat <int> 1, 2, 2, 4, 2, 2, 4, 2, 3, 2, 2, 2, 2, 2, 1, 1, 2, 2,...
## $ highbp <int> 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0,...
## $ packs <int> 0, 0, 1, 2, 2, 0, 0, 1, 0, 2, 0, 0, 1, 0, 1, 1, 1, 1,...
Use select(data, var1, var2, …) to select columns (variables).
select(fram, AGE, SEX, TOTCHOL)
## # A tibble: 4,434 x 3
## AGE SEX TOTCHOL
## <int> <int> <int>
## 1 39 1 195
## 2 46 0 250
## 3 48 1 245
## 4 61 0 225
## 5 46 0 285
## 6 43 0 228
## 7 63 0 205
## 8 45 0 313
## 9 52 1 260
## 10 43 1 225
## # ... with 4,424 more rows
Use filter(data, condition1, condition2, …) to filter rows (observations) based on conditions. These are taken to be AND conditions.
filter(fram, AGE > 65, AGE <= 70)
## # A tibble: 132 x 37
## SEX TOTCHOL AGE SYSBP DIABP CURSMOKE CIGPDAY BMI DIABETES BPMEDS
## <int> <int> <int> <dbl> <dbl> <int> <int> <dbl> <int> <int>
## 1 0 254 67 157 89.0 0 0 24.25 0 0
## 2 0 311 66 154 80.0 0 0 28.55 0 0
## 3 0 278 66 187 88.0 0 0 40.52 0 0
## 4 0 264 67 139 80.0 0 0 25.75 0 0
## 5 1 288 66 109 71.0 0 0 29.29 0 0
## 6 0 214 66 212 104.0 0 0 25.32 0 0
## 7 0 259 67 151 101.0 0 0 21.67 0 0
## 8 1 257 67 125 67.5 0 0 25.95 0 0
## 9 0 249 67 128 68.0 0 0 25.81 0 0
## 10 0 285 66 166 98.0 0 0 26.04 0 0
## # ... with 122 more rows, and 27 more variables: HEARTRTE <int>,
## # GLUCOSE <int>, PREVCHD <int>, PREVAP <int>, PREVMI <int>,
## # PREVSTRK <int>, PREVHYP <int>, DEATH <int>, ANGINA <int>,
## # HOSPMI <int>, MI_FCHD <int>, ANYCHD <int>, STROKE <int>, CVD <int>,
## # HYPERTEN <int>, TIMEAP <dbl>, TIMEMI <dbl>, TIMEMIFC <dbl>,
## # TIMECHD <dbl>, TIMESTRK <dbl>, TIMECVD <dbl>, TIMEDTH <dbl>,
## # TIMEHYP <dbl>, bmicat <int>, agecat <int>, highbp <int>, packs <int>
To rename variables use rename(), if you want it to persist. Assign it to a named object (here fram3).
fram3 <- rename(fram,
MALE = SEX,
sytolic_blood_pressure = SYSBP)
To create a new variable using some transformation, use mutate(). Here hypertension is 1 if SYSBP is greater than 140 mmHg OR (vertical bar) DIABP is greater than 90 mmHg.
fram4 <- mutate(fram,
htn = as.numeric(SYSBP > 140 | DIABP > 90))
Piping with the percent-greater_than-percent operator is helpful is you want to sequentially perform multiple operations. A pipe takes the result on the left-hand side and feed it to the expression on the right-hand side. Here we rename first, and then create a new variable next. Again don’t forget to assign it to a named object if you want to keep it.
fram5 <- fram %>%
rename(MALE = SEX) %>%
mutate(htn = as.numeric(SYSBP > 140 | DIABP > 90))
More explicitly, you can use . (period) to express where the result from the LHS should go in. Here fram is inserted to . in rename() and then the result after rename is inserted into . in mutate(). When the function takes the dataset as the first argument, you can omit it as in the previous code example.
fram5 <- fram %>%
rename(., MALE = SEX) %>%
mutate(., htn = as.numeric(SYSBP > 140 | DIABP > 90))
summarize() create a single number summary based on some transformation. Here we checked mean, variance (var), and standard deviation (sd).
fram %>%
summarize(mean_age = mean(AGE),
var_age = var(AGE),
sd_age = sd(AGE),
mean_div_sd = mean_age / sd_age)
## # A tibble: 1 x 4
## mean_age var_age sd_age mean_div_sd
## <dbl> <dbl> <dbl> <dbl>
## 1 49.9258 75.2891 8.676929 5.753856
Often it’s necessary to summarize based on subgroups. Here get the subgroup mean ages grouped by sex and current smoking status. Use the group_by() function. Additionally, I’m reordering the results by the newly created mean_age variable (ascending order) using arrange().
fram %>%
group_by(SEX, CURSMOKE) %>%
summarise(mean_age = mean(AGE)) %>%
arrange(mean_age)
## # A tibble: 4 x 3
## # Groups: SEX [2]
## SEX CURSMOKE mean_age
## <int> <int> <dbl>
## 1 0 1 47.33201
## 2 1 1 48.70298
## 3 1 0 51.44213
## 4 0 0 51.86658
Data visualization is very helpful understanding the data. ggplot2 is the package name. This is automatically loaded with tidyverse. The command is called ggplot.
aes() creates the aesthetic mapping between variables in your dataset and the structure of the figure. geom_X (geometric objects) will actually create a figure with geometry X. The plus sign means the expressions are combined to create a final figure.
factor() is used to make R interpret SEX variable as strictly categorical. Because it is code 0,1, sometimes it is interpreted as a numerical variable and gives you a funny result.
## <- These hashtags mean comments in a code chunk, but a header outside.
## this is a comment, ignored
ggplot(data = fram,
mapping = aes(x = AGE, y = SYSBP, color = factor(SEX))) +
geom_point()
ggplot(data = fram, mapping = aes(x = AGE)) +
geom_histogram(binwidth = 5)
ggplot(fram, aes(x = AGE)) +
geom_density()
ggplot(fram, aes(x = agecat)) +
geom_bar()
ggplot(fram, aes(x = factor(agecat), y = BMI)) +
geom_boxplot()
## Warning: Removed 19 rows containing non-finite values (stat_boxplot).
geom_smooth() can be used to add mean tendency of y variable over the x variable. Here we used a simple linear model (lm) to fit a straight line to represent the relationship.
ggplot(fram, aes(x = AGE, y = BMI)) +
geom_point() +
geom_smooth(method = "lm")
## Warning: Removed 19 rows containing non-finite values (stat_smooth).
## Warning: Removed 19 rows containing missing values (geom_point).
Sometimes it’s helpful to have multiple panels classified by additional variables. This is done with facet.
ggplot(fram, aes(x = AGE, y = BMI)) +
geom_point() +
facet_grid(SEX ~ agecat)
## Warning: Removed 19 rows containing missing values (geom_point).