A dplyr tutorial

dplyr is probably my favorite R “power tool”. It provides a simple grammar for data wrangling that’s very easy to learn and it’s also very fast in terms of performance. In this document we’ll use it to briefly look at the alzheimer’s dataset.

library(dplyr)
library(ggplot2)

alz <- read.csv('Alzheimers.csv')

dplyr offers its own function for inspecting the contents of a dataframe called glimpse(). It’s like str() but shows more data.

glimpse(alz)

## Observations: 333
## Variables: 132
## $ ACE_CD143_Angiotensin_Converti   (dbl) 2.0031003, 1.5618560, 1.52065...
## $ ACTH_Adrenocorticotropic_Hormon  (dbl) -1.3862944, -1.3862944, -1.71...
## $ AXL                              (dbl) 1.09838668, 0.68328157, -0.14...
## $ Adiponectin                      (dbl) -5.360193, -5.020686, -5.8091...
## $ Alpha_1_Antichymotrypsin         (dbl) 1.7404662, 1.4586150, 1.19392...
## $ Alpha_1_Antitrypsin              (dbl) -12.631361, -11.909882, -13.6...
## $ Alpha_1_Microglobulin            (dbl) -2.577022, -3.244194, -2.8824...
## $ Alpha_2_Macroglobulin            (dbl) -72.65029, -154.61228, -136.5...
## $ Angiopoietin_2_ANG_2             (dbl) 1.06471074, 0.74193734, 0.832...
## $ Angiotensinogen                  (dbl) 2.510547, 2.457283, 1.976365,...
## $ Apolipoprotein_A_IV              (dbl) -1.427116, -1.660731, -1.6607...
## $ Apolipoprotein_A1                (dbl) -7.402052, -7.047017, -7.6842...
## $ Apolipoprotein_A2                (dbl) -0.26136476, -0.86750057, -0....
## $ Apolipoprotein_B                 (dbl) -4.624044, -6.747507, -3.9760...
## $ Apolipoprotein_CI                (dbl) -1.2729657, -1.2729657, -1.71...
## $ Apolipoprotein_CIII              (dbl) -2.312635, -2.343407, -2.7488...
## $ Apolipoprotein_D                 (dbl) 2.0794415, 1.3350011, 1.33500...
## $ Apolipoprotein_E                 (dbl) 3.7545215, 3.0971187, 2.75305...
## $ Apolipoprotein_H                 (dbl) -0.15734908, -0.57539617, -0....
## $ B_Lymphocyte_Chemoattractant_BL  (dbl) 2.2969819, 1.6731213, 1.67312...
## $ BMP_6                            (dbl) -2.200744, -1.728053, -2.0624...
## $ Beta_2_Microglobulin             (dbl) 0.69314718, 0.47000363, 0.336...
## $ Betacellulin                     (int) 34, 53, 49, 52, 67, 51, 41, 4...
## $ C_Reactive_Protein               (dbl) -4.074542, -6.645391, -8.0471...
## $ CD40                             (dbl) -0.7964147, -1.2733760, -1.24...
## $ CD5L                             (dbl) 0.09531018, -0.67334455, 0.09...
## $ Calbindin                        (dbl) 33.21363, 25.27636, 22.16609,...
## $ Calcitonin                       (dbl) 1.3862944, 3.6109179, 2.11625...
## $ CgA                              (dbl) 397.6536, 465.6759, 347.8639,...
## $ Clusterin_Apo_J                  (dbl) 3.555348, 3.044522, 2.772589,...
## $ Complement_3                     (dbl) -10.36305, -16.10824, -16.108...
## $ Complement_Factor_H              (dbl) 3.5737252, 3.6000471, 4.47456...
## $ Connective_Tissue_Growth_Factor  (dbl) 0.5306283, 0.5877867, 0.64185...
## $ Cortisol                         (dbl) 10.0, 12.0, 10.0, 14.0, 11.0,...
## $ Creatine_Kinase_MB               (dbl) -1.710172, -1.751002, -1.3835...
## $ Cystatin_C                       (dbl) 9.041922, 9.067624, 8.954157,...
## $ EGF_R                            (dbl) -0.1354543, -0.3700474, -0.73...
## $ EN_RAGE                          (dbl) -3.688879, -3.816713, -4.7559...
## $ ENA_78                           (dbl) -1.349543, -1.356595, -1.3906...
## $ Eotaxin_3                        (int) 53, 62, 62, 44, 64, 57, 64, 6...
## $ FAS                              (dbl) -0.08338161, -0.52763274, -0....
## $ FSH_Follicle_Stimulation_Hormon  (dbl) -0.6516715, -1.6272839, -1.56...
## $ Fas_Ligand                       (dbl) 3.1014922, 2.9788133, 1.36000...
## $ Fatty_Acid_Binding_Protein       (dbl) 2.5208712, 2.2477966, 0.90630...
## $ Ferritin                         (dbl) 3.329165, 3.932959, 3.176872,...
## $ Fetuin_A                         (dbl) 1.2809338, 1.1939225, 1.41098...
## $ Fibrinogen                       (dbl) -7.035589, -8.047190, -7.1954...
## $ GRO_alpha                        (dbl) 1.381830, 1.372438, 1.412679,...
## $ Gamma_Interferon_induced_Monokin (dbl) 2.949822, 2.721793, 2.762231,...
## $ Glutathione_S_Transferase_alpha  (dbl) 1.0641271, 0.8670202, 0.88901...
## $ HB_EGF                           (dbl) 6.559746, 8.754531, 7.745463,...
## $ HCC_4                            (dbl) -3.036554, -4.074542, -3.6496...
## $ Hepatocyte_Growth_Factor_HGF     (dbl) 0.58778666, 0.53062825, 0.095...
## $ I_309                            (dbl) 3.433987, 3.135494, 2.397895,...
## $ ICAM_1                           (dbl) -0.1907787, -0.4620172, -0.46...
## $ IGF_BP_2                         (dbl) 5.609472, 5.347108, 5.181784,...
## $ IL_11                            (dbl) 5.121987, 4.936704, 4.665910,...
## $ IL_13                            (dbl) 1.282549, 1.269463, 1.274133,...
## $ IL_16                            (dbl) 4.192081, 2.876338, 2.616102,...
## $ IL_17E                           (dbl) 5.731246, 6.705891, 4.149327,...
## $ IL_1alpha                        (dbl) -6.571283, -8.047190, -8.1807...
## $ IL_3                             (dbl) -3.244194, -3.912023, -4.6459...
## $ IL_4                             (dbl) 2.484907, 2.397895, 1.824549,...
## $ IL_5                             (dbl) 1.09861229, 0.69314718, -0.24...
## $ IL_6                             (dbl) 0.26936976, 0.09622438, 0.185...
## $ IL_6_Receptor                    (dbl) 0.64279595, 0.43115645, 0.096...
## $ IL_7                             (dbl) 4.8050453, 3.7055056, 1.00562...
## $ IL_8                             (dbl) 1.711325, 1.675557, 1.691393,...
## $ IP_10_Inducible_Protein_10       (dbl) 6.242223, 5.686975, 5.049856,...
## $ IgA                              (dbl) -6.812445, -6.377127, -6.3199...
## $ Insulin                          (dbl) -0.6258253, -0.9431406, -1.44...
## $ Kidney_Injury_Molecule_1_KIM_1   (dbl) -1.204295, -1.197703, -1.1911...
## $ LOX_1                            (dbl) 1.7047481, 1.5260563, 1.16315...
## $ Leptin                           (dbl) -1.5290628, -1.4660558, -1.66...
## $ Lipoprotein_a                    (dbl) -4.268698, -4.933674, -5.8430...
## $ MCP_1                            (dbl) 6.740519, 6.849066, 6.767343,...
## $ MCP_2                            (dbl) 1.9805094, 1.8088944, 0.40059...
## $ MIF                              (dbl) -1.237874, -1.897120, -2.3025...
## $ MIP_1alpha                       (dbl) 4.968453, 3.690160, 4.049508,...
## $ MIP_1beta                        (dbl) 3.258097, 3.135494, 2.397895,...
## $ MMP_2                            (dbl) 4.478566, 3.781473, 2.866631,...
## $ MMP_3                            (dbl) -2.207275, -2.465104, -2.3025...
## $ MMP10                            (dbl) -3.270169, -3.649659, -2.7333...
## $ MMP7                             (dbl) -3.7735027, -5.9681907, -4.03...
## $ Myoglobin                        (dbl) -1.89711998, -0.75502258, -1....
## $ NT_proBNP                        (dbl) 4.553877, 4.219508, 4.248495,...
## $ NrCAM                            (dbl) 5.003946, 5.209486, 4.744932,...
## $ Osteopontin                      (dbl) 5.356586, 6.003887, 5.017280,...
## $ PAI_1                            (dbl) 1.00350156, -0.03059880, 0.43...
## $ PAPP_A                           (dbl) -2.902226, -2.813276, -2.9355...
## $ PLGF                             (dbl) 4.442651, 4.025352, 4.510860,...
## $ PYY                              (dbl) 3.218876, 3.135494, 2.890372,...
## $ Pancreatic_polypeptide           (dbl) 0.57878085, 0.33647224, -0.89...
## $ Prolactin                        (dbl) 0.00000000, -0.51082562, -0.1...
## $ Prostatic_Acid_Phosphatase       (dbl) -1.620527, -1.739232, -1.6366...
## $ Protein_S                        (dbl) -1.784998, -2.463991, -2.2591...
## $ Pulmonary_and_Activation_Regulat (dbl) -0.8439701, -2.3025851, -1.66...
## $ RANTES                           (dbl) -6.214608, -6.938214, -6.6453...
## $ Resistin                         (dbl) -16.475315, -16.025283, -16.4...
## $ S100b                            (dbl) 1.5618560, 1.7566212, 1.43572...
## $ SGOT                             (dbl) -0.94160854, -0.65392647, 0.3...
## $ SHBG                             (dbl) -1.897120, -1.560648, -2.2072...
## $ SOD                              (dbl) 5.609472, 5.814131, 5.723585,...
## $ Serum_Amyloid_P                  (dbl) -5.599422, -6.119298, -5.3816...
## $ Sortilin                         (dbl) 4.908629, 5.478731, 3.810182,...
## $ Stem_Cell_Factor                 (dbl) 4.174387, 3.713572, 3.433987,...
## $ TGF_alpha                        (dbl) 8.649098, 11.331619, 10.85849...
## $ TIMP_1                           (dbl) 15.204651, 11.266499, 12.2828...
## $ TNF_RII                          (dbl) -0.06187540, -0.32850407, -0....
## $ TRAIL_R3                         (dbl) -0.1829004, -0.5007471, -0.92...
## $ TTR_prealbumin                   (dbl) 2.944439, 2.833213, 2.944439,...
## $ Tamm_Horsfall_Protein_THP        (dbl) -3.095810, -3.111190, -3.1667...
## $ Thrombomodulin                   (dbl) -1.340566, -1.675252, -1.5342...
## $ Thrombopoietin                   (dbl) -0.1026334, -0.6733501, -0.92...
## $ Thymus_Expressed_Chemokine_TECK  (dbl) 4.149327, 3.810182, 2.791992,...
## $ Thyroid_Stimulating_Hormone      (dbl) -3.863233, -4.828314, -4.9908...
## $ Thyroxine_Binding_Globulin       (dbl) -1.4271164, -1.6094379, -1.89...
## $ Tissue_Factor                    (dbl) 2.04122033, 2.02814825, 1.435...
## $ Transferrin                      (dbl) 3.332205, 2.890372, 2.890372,...
## $ Trefoil_Factor_3_TFF3            (dbl) -3.381395, -3.912023, -3.7297...
## $ VCAM_1                           (dbl) 3.258097, 2.708050, 2.639057,...
## $ VEGF                             (dbl) 22.03456, 18.60184, 17.47619,...
## $ Vitronectin                      (dbl) -0.04082199, -0.38566248, -0....
## $ von_Willebrand_Factor            (dbl) -3.146555, -3.863233, -3.5404...
## $ age                              (dbl) 88.52057, 80.33331, 83.20507,...
## $ tau                              (dbl) 6.297754, 6.659294, 6.270988,...
## $ p_tau                            (dbl) 4.348108, 4.859967, 4.400247,...
## $ Ab_42                            (dbl) 12.019678, 11.015759, 12.3022...
## $ male                             (int) 0, 0, 1, 0, 0, 1, 1, 1, 0, 0,...
## $ Genotype                         (fctr) E3E3, E3E4, E3E4, E3E4, E3E3...
## $ response                         (fctr) NotImpaired, NotImpaired, No...
## $ gender                           (fctr) Female, Female, M, Female, F...

Since there are quite a few variables lets just get a few that might be interesting. dplyr provides a select() function that works much like the SQL equivalent:

cort <- select(alz, age, gender, Cortisol, Apolipoprotein_A1, Apolipoprotein_A2)
summary(cort)

##       age            gender       Cortisol     Apolipoprotein_A1
##  Min.   : 65.00   female:  2   Min.   : 0.10   Min.   :-8.680   
##  1st Qu.: 73.99   Female:202   1st Qu.: 9.50   1st Qu.:-7.775   
##  Median : 80.23   M     :  3   Median :12.00   Median :-7.470   
##  Mean   : 79.99   male  :  2   Mean   :11.68   Mean   :-7.484   
##  3rd Qu.: 86.05   Male  :124   3rd Qu.:14.00   3rd Qu.:-7.195   
##  Max.   :100.00                Max.   :29.00   Max.   :-6.166   
##  Apolipoprotein_A2
##  Min.   :-1.9661  
##  1st Qu.:-0.9676  
##  Median :-0.6931  
##  Mean   :-0.6462  
##  3rd Qu.:-0.3285  
##  Max.   : 0.9555

In pretty much every dplyr function the first argument is always the dataframe we’re performing an operation on, followed by function specific arguments which in this case are just the columns of interest. For example, we can order by a particular column or columns using the arrange function:

head(cort)

##        age gender Cortisol Apolipoprotein_A1 Apolipoprotein_A2
## 1 88.52057 Female       10         -7.402052       -0.26136476
## 2 80.33331 Female       12         -7.047017       -0.86750057
## 3 83.20507      M       10         -7.684284       -0.65392647
## 4 83.40201 Female       14         -8.047190       -1.23787436
## 5 85.96176 Female       11         -6.725434        0.09531018
## 6 69.37921   Male       13         -7.402052       -0.27443685

cort <- arrange(cort, age)
head(cort)

##        age gender Cortisol Apolipoprotein_A1 Apolipoprotein_A2
## 1 65.00000 Female      9.0         -7.130899        -0.6931472
## 2 65.15299 Female      9.8         -8.180721        -1.4271164
## 3 65.30548 Female      5.5         -7.621105        -1.0788097
## 4 65.30548 Female     12.0         -8.145630        -1.3862944
## 5 65.30548   Male     13.0         -7.354042        -0.1053605
## 6 65.60895   Male      9.8         -6.725434         0.4054651

We can create a new column as a function of the data in other columns with mutate():

cort <- mutate(cort, Apolipoprotein_ATotal = Apolipoprotein_A1 + Apolipoprotein_A2)
head(cort)

##        age gender Cortisol Apolipoprotein_A1 Apolipoprotein_A2
## 1 65.00000 Female      9.0         -7.130899        -0.6931472
## 2 65.15299 Female      9.8         -8.180721        -1.4271164
## 3 65.30548 Female      5.5         -7.621105        -1.0788097
## 4 65.30548 Female     12.0         -8.145630        -1.3862944
## 5 65.30548   Male     13.0         -7.354042        -0.1053605
## 6 65.60895   Male      9.8         -6.725434         0.4054651
##   Apolipoprotein_ATotal
## 1             -7.824046
## 2             -9.607837
## 3             -8.699915
## 4             -9.531924
## 5             -7.459403
## 6             -6.319969

But maybe it doesn’t make sense to have an Apolipoprotein_ATotal column, so let’s get rid of it. We can use select() again only this time with the ‘-’ operator to remove the column.

cort <- select(cort, -Apolipoprotein_ATotal)
head(cort)

##        age gender Cortisol Apolipoprotein_A1 Apolipoprotein_A2
## 1 65.00000 Female      9.0         -7.130899        -0.6931472
## 2 65.15299 Female      9.8         -8.180721        -1.4271164
## 3 65.30548 Female      5.5         -7.621105        -1.0788097
## 4 65.30548 Female     12.0         -8.145630        -1.3862944
## 5 65.30548   Male     13.0         -7.354042        -0.1053605
## 6 65.60895   Male      9.8         -6.725434         0.4054651

We can also get rid of rows we don’t need using filter(). This is great for getting rid of outliers. For the sake of this tutorial, let’s consider an outlier to be anyone with Cortisol more than 2 standard deviations from the mean.

cort$Cortisol_z <- scale(cort$Cortisol)[,1]
summary(cort)

##       age            gender       Cortisol     Apolipoprotein_A1
##  Min.   : 65.00   female:  2   Min.   : 0.10   Min.   :-8.680   
##  1st Qu.: 73.99   Female:202   1st Qu.: 9.50   1st Qu.:-7.775   
##  Median : 80.23   M     :  3   Median :12.00   Median :-7.470   
##  Mean   : 79.99   male  :  2   Mean   :11.68   Mean   :-7.484   
##  3rd Qu.: 86.05   Male  :124   3rd Qu.:14.00   3rd Qu.:-7.195   
##  Max.   :100.00                Max.   :29.00   Max.   :-6.166   
##  Apolipoprotein_A2   Cortisol_z      
##  Min.   :-1.9661   Min.   :-2.91771  
##  1st Qu.:-0.9676   1st Qu.:-0.54980  
##  Median :-0.6931   Median : 0.07996  
##  Mean   :-0.6462   Mean   : 0.00000  
##  3rd Qu.:-0.3285   3rd Qu.: 0.58377  
##  Max.   : 0.9555   Max.   : 4.36234

cort <- filter(cort, Cortisol_z > -2, Cortisol_z < 2)
summary(cort)

##       age            gender       Cortisol      Apolipoprotein_A1
##  Min.   : 65.00   female:  2   Min.   : 4.000   Min.   :-8.680   
##  1st Qu.: 73.99   Female:190   1st Qu.: 9.625   1st Qu.:-7.793   
##  Median : 80.12   M     :  3   Median :12.000   Median :-7.470   
##  Mean   : 79.80   male  :  2   Mean   :11.554   Mean   :-7.485   
##  3rd Qu.: 85.73   Male  :117   3rd Qu.:13.750   3rd Qu.:-7.209   
##  Max.   :100.00                Max.   :19.000   Max.   :-6.166   
##  Apolipoprotein_A2   Cortisol_z      
##  Min.   :-1.9661   Min.   :-1.93528  
##  1st Qu.:-0.9676   1st Qu.:-0.51831  
##  Median :-0.6832   Median : 0.07996  
##  Mean   :-0.6435   Mean   :-0.03227  
##  3rd Qu.:-0.3182   3rd Qu.: 0.52079  
##  Max.   : 0.9555   Max.   : 1.84329

Notice you can use multiple filtering expressions. In this case, the comma seperator serves as an implicit ‘&’. We could have used ‘|’ instead if we’d wanted to.

The gender column is a mess so let’s clean that up using R’s built in functions.

levels(cort$gender)[levels(cort$gender)=='female'] <- 'Female'
levels(cort$gender)[levels(cort$gender)=='male'] <- 'Male'
levels(cort$gender)[levels(cort$gender)=='M'] <- 'Male'
summary(cort$gender)

## Female   Male 
##    192    122

Let’s see how cortisol varies by gender. We can find this out by using the dplyr functions group_by() and summarize() or “summarise()” if you’re British ;)

cort.g <- group_by(cort, gender)
cort.s <- summarize(cort.g, Count=n(), AvgCortisol=mean(Cortisol))
cort.s

## Source: local data frame [2 x 3]
## 
##   gender Count AvgCortisol
##   (fctr) (int)       (dbl)
## 1 Female   192    11.24583
## 2   Male   122    12.04016

Notice we had to create an intermediate object, cort.g, to pass to summarize(). We can skip the explicit creation of this object by chaining together dplyr functions using the %>% operator.

cort.s <- group_by(cort, gender) %>% summarize(Count=n(), AvgCortisol=mean(Cortisol))

Whenever we chain dplyr functions together like this we omit the explicit first argument to successive functions. It’s still there; its just implicit.

There’s also sample_n() which is useful for drawing random samples of rows from a data frame.

sample_means <- rep(0,1000)
for (i in 1:1000){
  sample_means[i] <- mean(sample_n(cort, size=25)$Cortisol)
}

qplot(sample_means, binwidth=0.5)

That’s it for this tutorial. There are several other dplyr functions that I haven’t covered here but I hope this gives you a sense of how simple this handy little library is. Thank you for your time and interest!

A dplyr tutorial

Bradley Mitchell

June 17, 2016