25 March, 2020

Day 1 goals

Introduce R

  • What it is, and why that goofy name
  • Understand a bit about R’s capabilities and why people use it
  • Install R and RStudio and understand the difference between them
  • Start RStudio and look around

Do some computing in R

  • Objects: variables and functions
  • Sequences (vectors) and making comparisons

What is “R”?

R is a programming language and free software environment for statistical computing and graphics supported by the R Foundation for Statistical Computing. The R language is widely used among statisticians and data miners for developing statistical software and data analysis.

https://en.wikipedia.org/wiki/R_(programming_language)

R is named partly after the first names of the first two R authors and partly as a play on the name of [an earlier programming language called] S.

Why use it?

R offers powerful and flexible data processing, beautiful graphics, and the ability to analyze data more more quickly and reproducibly than e.g. Excel.

Why use it?

R is command-line software: you type and the computer (well, R) computes, displays figures…whatever you order it to do.

RStudio is a popular interface for R. It sits on top of R, and it allows users to more easily (for many people) interact with and use the underlying R.

Us <—> RStudio <—> R

Installation

Let’s look around RStudio

In particular, we’ll most use two parts of RStudio:

  • The console
  • The editor

Do something with R

We can type calculations into the console and R will immediately respond:

1 + 1
## [1] 2
2 * 3
## [1] 6
(135 + 2) ^ 3
## [1] 2571353

Do something with R

We can create a sequence (or a vector):

1:10
##  [1]  1  2  3  4  5  6  7  8  9 10

Oooooh. What will this do?

1:10 * 2
##  [1]  2  4  6  8 10 12 14 16 18 20

What R is doing here is a fundamental characteristic of the langauge.

Variables

Our first variable:

x <- 1   # assign the value "1" to the name "x"
x
## [1] 1
x * 3
## [1] 3
# R ignores anything after the '#' symbol
# These comments are a critical way to make code readable to PEOPLE

Variables

We can have as many variables as we like (though there are some rules about naming them). If we re-assign their values, the old value is lost.

my_variable <- 2 * 3 + 4
z <- my_variable
y <- 1000 + z
y <- 23
y
## [1] 23

Note that my_variable, x, y, and z all refer to separate objects, whether or not they have identical values.

Sequences

We saw 1:10 before but let’s look at sequences (or vectors) a bit more.

my_sequence <- 5:8
my_sequence[2:3]
## [1] 6 7

Another way of making a vector, the c() notation:

my_sequence <- c(5, 6, 7, 8)
my_sequence[2:3]
## [1] 6 7

Variables

Variables are often numeric, but don’t have to be.

a <- "hello"
a
## [1] "hello"
b <- TRUE
print(b)
## [1] TRUE

Functions

Wait, what was the print(b) on the previous slide?

Very broadly, there are two kinds of objects: variables and functions.

Functions take inputs (technically called parameters) and produce an output. print is a function; so is mean:

x <- c(1, 3, 4, 9)
mean(x)  # send 'x' to mean
## [1] 4.25

Functions

sum is similar. It take a sequence of numbers and returns a single value:

x <- c(1, 3, 4, 9)
y <- x + 10
z <- sum(y)
print(z)
## [1] 57

There are lots of functions in R, and they are the fundamental workhorse of R computing.

Comparisons

The double equals (==) is the standard in computer programming for making comparisons.

1 == 1
## [1] TRUE
"ben" == "handsome"
## [1] FALSE

Ouch.

Comparisons

We can compare a sequence, producing another sequence:

x <- c(1, 3, 2, 2, 3)
x == 3
## [1] FALSE  TRUE FALSE FALSE  TRUE
any(x == 3) # the 'any' function returns TRUE or FALSE
## [1] TRUE
which(x == 3)
## [1] 2 5

Comparisons

What will this print? Why? Think about it before typing.

x <- c(1, 3, 2, 2, 3)
x_two_values <- which(x == 2)
x_two_values
## [1] 3 4
x[x_two_values]
## [1] 2 2
x[-x_two_values]
## [1] 1 3 3

Summary

What have we learned today in the R language?

  • R as a calculator
  • Assigning variables
  • Variables can have different types (numeric, character, logical, …)
  • Variables can be sequences of values
  • Indexing sequences: x[2]
  • Functions
  • Making comparisons
  • Comments

Day 2 goals

  • Review
  • Making a script to run code
  • Control: if and for
  • Data frames
  • Functions

Review and remember…

  • The difference between R and RStudio
  • Vectors (sequences) of numbers or other types of data
  • Variables and how they behave
  • Types of data/variables: numeric, character, logical
  • Comparing data
  • Built-in functions and how they behave

Scripts

Create a new script file in RMarkdown and let’s make a short program:

  x <- 1:10
  print(sum(x))
  print(paste("This is", x))
  
  plot(x)

If you click Source in the upper-right of the window, R will execute your commands one by one.

Scripts let us save work between sessions and generally make our lives easier.

Flow control

We often want to execute parts of our code multiple times, or make choices.

R has a number of standard programming constructs to do this.

x <- 5
if(x > 3) {
  # this block gets executed if the 'if' condition is TRUE
  print("Greater than 3")
} else {
  # ...and this if it's FALSE
  print("Less than or equal to 3")
}
## [1] "Greater than 3"

Flow control

for loops execute blocks of code multiple times.

This is very common in programming. It’s less used in R, because of R’s ability to do this ‘in parallel’ (e.g. 1:10 + 1), but still sometimes necessary or just handy.

for(i in 1:10) {
  print(paste("This is", i))   # we don't need a for loop to do this!
}
## [1] "This is 1"
## [1] "This is 2"
## [1] "This is 3"
## [1] "This is 4"
## [1] "This is 5"
## [1] "This is 6"
## [1] "This is 7"
## [1] "This is 8"
## [1] "This is 9"
## [1] "This is 10"

Data frames

For many people this is the single most common data type they use in R.

It’s a table of data, where each column can be a different type.

Data frames

The cars dataset is one of many that comes with R.

head(cars)
##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10
dim(cars)  # dimensions
## [1] 50  2
summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00
plot(cars$speed, cars$dist)

Data frames

Accessing rows and columns and entries of a data frame:

cars[1,]
##   speed dist
## 1     4    2
cars[,1]
##  [1]  4  4  7  7  8  9 10 10 10 11 11 12 12 12 12 13 13 13 13 14 14 14 14
## [24] 15 15 15 16 16 17 17 17 18 18 18 18 19 19 19 20 20 20 20 20 22 23 24
## [47] 24 24 24 25
cars[1,2]
## [1] 2

Data frames

Accessing rows and columns and entries of a data frame:

cars$speed
##  [1]  4  4  7  7  8  9 10 10 10 11 11 12 12 12 12 13 13 13 13 14 14 14 14
## [24] 15 15 15 16 16 17 17 17 18 18 18 18 19 19 19 20 20 20 20 20 22 23 24
## [47] 24 24 24 25
mean(cars$dist)
## [1] 42.98
cars[1:3,]
##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4

Data frames

We saw plotting in base R briefly a few slides ago:

plot(cars$speed, cars$dist)

But we can make a much nicer plot…so it’s time to use our first R package!

What’s a package?

R packages are collections of functions and data sets developed by the community. They increase the power of R by improving existing base R functionalities, or by adding new ones.

(This DataCamp page has a good introduction.)

library(ggplot2)

You may need to install ggplot2 first.

Plotting cars

library(ggplot2)  # load 'ggplot2' into our current library
qplot(speed, dist, data = cars)  # qplot is a function in ggplot2

Plotting cars

qplot(speed, dist, color = speed, data = cars)

Plotting cars

qplot(speed, dist, size = dist, data = cars)

Plotting cars

qplot(speed, dist, color = speed, size = dist, data = cars)

Plotting cars

As long as we’re talking about cars…

  1. Take a look at the mpg dataset (included with ggplot2)
  2. Use summary on it
  3. Use qplot to plot. Try plotting displacement (on the x axis) versus city mileage (y axis), coloring by class of car
  4. Try other plots!

Remember, qplot is part of the ggplot2 package.

“A package is a like a book, a library is like a library; you use library() to check a package out of the library.” Source

Making our own functions

As we’ve seen, R has lots of built-in functions. These

  • Take zero or more inputs
  • Return zero or one outputs
sum(1, 2) # 2 inputs -> 1 output
print(1)  # 1 input -> 0 outputs

But we can also write out own functions.

Making our own functions

extract_row_1 <- function(x) {
  return(x[1,])
}

extract_row_1(cars)
##   speed dist
## 1     4    2
extract_row_1(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa

Making our own functions

Functions have their own “scope”. This is important.

x <- 1
f <- function(x) {
  print(x)
}
f(2)  # what gets printed?

Another example:

x <- 1
y <- 2
f <- function(x) {
  y <- x
}
f(2)  # what are the values of x and y after this?

Exercises

Write a function that take two parameters, x and n, and returns the nth row of x.

extract_row <- function(x, n) {
  return(x[n,])
}

extract_row(cars, 2)
##   speed dist
## 2     4   10

Exercises: Fibonacci

Write a function that computes the nth Fibonacci number.

This is a classic problem in computer science. It also slightly tricky, and interesting…let’s do it together.

Exercises: Fibonacci

fibonacci <- function(n) {
  if(n < 0) {
    stop("Error! n has to be positive")
  } else if(n == 0) {
    return(0)
  } else if(n == 1) {
    return(1)
  } else {
    return(fibonacci(n - 1) + fibonacci(n - 2))
  }
}

Whoah! A function calling itself. This is called recursion.

Exercises: Fibonacci

fibonacci(2)
## [1] 1
fibonacci(3)
## [1] 2
fibonacci(12)
## [1] 144
# try this: fibonacci(-1)

Day 3 goals

  • Review
  • RStudio projects
  • Reading data into R
  • The aggregate function
  • More plotting
  • Saving graphs and data

Review and remember…

  • Types of variables and how they behave
  • Functions, both built in and ones we write
  • Scripts - writing, saving, running
  • Program flow: if...else and for
  • Vectors and data frames - bracket subsetting
  • Missing values: NA
  • Packages
  • Visualizing data using ggplot2

RStudio projects

RStudio projects make it straightforward to divide your work into multiple contexts.

RStudio projects are associated with R working directories. You can create an RStudio project:

  • In a brand new directory
  • In an existing directory where you already have R code and data
  • By cloning a version control (Git or Subversion) repository

Let’s create a new project for our work today.

RStudio projects

The working directory of our project is set to the project folder.

getwd()
## [1] "/Users/d3x290/Dropbox/Documents/Work/Mini-projects/2020/Rworkshops"
list.files()
##  [1] "deniro.csv"              "deniro.pdf"             
##  [3] "deniro.R"                "good_deniro_films.csv"  
##  [5] "images"                  "Introduction_to_R_files"
##  [7] "Introduction_to_R.html"  "Introduction_to_R.Rmd"  
##  [9] "LICENSE"                 "README.md"              
## [11] "rsconnect"               "Rworkshops.Rproj"

All reading and writing takes place starting here.

You can set the working directory (see ?setwd) but I strongly discourage this.

Reading data into R

Data files come in lots of different formats:

  • image files
  • tabular data
  • binary data
  • text

Which you use depends on your task, the format the dataset is provided in, and other factors.

Reading data into R

Tabular data are a perfect match for our data frames, and are very common, so let’s start with that. The most common form of these is CSV (comma separated values).

Download a sample data file:

https://github.com/JGCRI/Rworkshops

Note: I put this file in the repository for these workshops. The original file is here.

Reading data into R

What does deniro.csv look like?

## "Year", "Score", "Title"
## 1968,  86, "Greetings"
## 1970,  17, "Bloody Mama"
## 1970,  73, "Hi, Mom!"
## 1971,  40, "Born to Win"

We use the read.csv function read it into R:

deniro <- read.csv("deniro.csv")

Note that read.csv can also read directly from the Internet!

deniro <- read.csv("https://people.sc.fsu.edu/~jburkardt/data/csv/deniro.csv")

Exploring deniro

Use head, tail, and summary to look at the De Niro data.

Use bracket notation to look at which films he made in 1990

which_years_are_1990 <- deniro$Year == 1990
deniro[which_years_are_1990,]
##    Year Score           Title
## 26 1990    88      Awakenings
## 27 1990    29  Stanley & Iris
## 28 1990    96      Goodfellas

Exploring deniro

Another way to do this is using R’s subset command.

subset(deniro, Year == 1990)
##    Year Score           Title
## 26 1990    88      Awakenings
## 27 1990    29  Stanley & Iris
## 28 1990    96      Goodfellas

Play around with subset a bit. Subset his ‘good’ and ‘bad’ films, recent and old ones.

A script

Let’s write a script that

  • reads in deniro
  • prints a summary of the dataset
  • assign a good/ok/bad label to each movie
  • plots scores by year, coloring by category
  • saves the plot
  • calculates how movies there are of each category <– this is new
  • saves his good films (Rotten Tomatoes score >90)

Read in deniro and summarize

Hint: use read.csv and summary

deniro <- read.csv("deniro.csv")

print(summary(deniro))
##       Year          Score                 Title   
##  Min.   :1968   Min.   :  4.0    15 Minutes  : 1  
##  1st Qu.:1988   1st Qu.: 38.0    1900        : 1  
##  Median :1997   Median : 65.0    A Bronx Tale: 1  
##  Mean   :1996   Mean   : 58.2    Analyze That: 1  
##  3rd Qu.:2007   3rd Qu.: 80.0    Analyze This: 1  
##  Max.   :2016   Max.   :100.0    Angel Heart : 1  
##                                 (Other)      :81

Assign labels to each movie

Hint: use a logical vector

print(paste("Total movies:", nrow(deniro)))
## [1] "Total movies: 87"
goods <- deniro$Score > 80
print(paste("Good movies:", sum(goods)))
## [1] "Good movies: 21"
bads <- deniro$Score < 20
print(paste("Bad movies:", sum(bads)))
## [1] "Bad movies: 11"
# Assign good/okay/bad categories

deniro$Category <- "Okay"
deniro$Category[goods] <- "Good"
deniro$Category[bads] <- "Bad"

Plot our data!

We’ll use qplot in the ggplot2 package

library(ggplot2)
print(qplot(Year, Score, color = Category, data = deniro))

ggsave("deniro.pdf")
## Saving 7.5 x 4 in image

Summarize the data

This is something we haven’t seen before.

howmany <- aggregate(Score ~ Category, data = deniro, FUN = length)
print(howmany)
##   Category Score
## 1      Bad    11
## 2     Good    21
## 3     Okay    55

Save the good movies

Because we need something to watch while in quarantine.

good_movies <- subset(deniro, goods)
write.csv(good_movies, "good_deniro_films.csv")
list.files()
##  [1] "deniro.csv"              "deniro.pdf"             
##  [3] "deniro.R"                "good_deniro_films.csv"  
##  [5] "images"                  "Introduction_to_R_files"
##  [7] "Introduction_to_R.html"  "Introduction_to_R.Rmd"  
##  [9] "LICENSE"                 "README.md"              
## [11] "rsconnect"               "Rworkshops.Rproj"

Day 4 goals

  • Review
  • Dipping a toe into the tidyverse

Review and remember…

Types of variables, vectors, data frames

x <- 1
y <- x  # creates a separate copy
z <- 100:1
my_df <- data.frame(a = c("a", "b", "c"), b = 1:3)
length(x)
## [1] 1
length(z)
## [1] 100

Review and remember…

Bracket subsetting and subset

z[12]
## [1] 89
my_df[1,]
##   a b
## 1 a 1
my_df$a
## [1] a b c
## Levels: a b c

Review and remember…

Functions and scope

times_two <- function(x) {
  x * 2
}

x <- 10
times_two(3)
## [1] 6

Review and remember…

Program control flow

if(x > 5) {
  print("more than 5")
} else {
  print("less than or equal to 5")
}
## [1] "more than 5"

Review and remember…

Missing values

x <- c(1, 2, NA, 4, 5)
sum(x)
## [1] NA
sum(x, na.rm = TRUE)
## [1] 12
is.na(x)
## [1] FALSE FALSE  TRUE FALSE FALSE

Review and remember…

Packages

library(ggplot2)  # loads the ggplot2 package into our active library
qplot(speed, dist, data = cars)

Review and remember…

Reading data into R

deniro <- read.csv("deniro.csv")
dim(deniro)
## [1] 87  3
summary(deniro)
##       Year          Score                 Title   
##  Min.   :1968   Min.   :  4.0    15 Minutes  : 1  
##  1st Qu.:1988   1st Qu.: 38.0    1900        : 1  
##  Median :1997   Median : 65.0    A Bronx Tale: 1  
##  Mean   :1996   Mean   : 58.2    Analyze That: 1  
##  3rd Qu.:2007   3rd Qu.: 80.0    Analyze This: 1  
##  Max.   :2016   Max.   :100.0    Angel Heart : 1  
##                                 (Other)      :81

Review and remember…

Aggregating data

by_year <- aggregate(Title ~ Year, data = deniro, FUN = length)
qplot(Year, Title, data = by_year, geom = "line", ylab = "# of movies")

The tidyverse

The last slide showed that aggregate has problems: you can only compute one thing at a time; the resulting columns might not be what you want. Also, it can be kind of slow.

This and other problems prompted the development of a related group of R packages known as the tidyverse that have become very popular.

dplyr

The dplyr package uses verbs (functions) to operate on tibbles (data frames).

some_data_frame %>% 

  do_something() %>%

  do_something_else() %>% 

  finish up()

This says, “take some data frame, do something; then take the results, and do something else; then take THOSE results, and finish up”.

pipes

dplyr uses %>%, the pipe operator, which allows us to pipe an object forward into a function or call expression. Pipes make it easy to see each step of a multi-step process.

Note that x %>% f is usually equivalent to f(x):

cars %>% nrow()
## [1] 50
cars %>% nrow() %>% length()
## [1] 1

Using dplyr: filtering rows

dplyr provides a filter function:

# You may need to `install.packages("dplyr")` first.
library(dplyr)
cars %>%
  filter(speed < 10)
##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10

Using dplyr: filtering rows

dplyr provides a filter function:

cars %>%
  filter(speed < 10, dist > 20)
##   speed dist
## 1     7   22

This selects rows based on particular condition(s).

Using dplyr: filtering rows

dplyr provides a filter function:

cars %>%
  filter(speed < 10 | dist > 20)
##    speed dist
## 1      4    2
## 2      4   10
## 3      7    4
## 4      7   22
## 5      8   16
## 6      9   10
## 7     10   26
## 8     10   34
## 9     11   28
## 10    12   24
## 11    12   28
## 12    13   26
## 13    13   34
## 14    13   34
## 15    13   46
## 16    14   26
## 17    14   36
## 18    14   60
## 19    14   80
## 20    15   26
## 21    15   54
## 22    16   32
## 23    16   40
## 24    17   32
## 25    17   40
## 26    17   50
## 27    18   42
## 28    18   56
## 29    18   76
## 30    18   84
## 31    19   36
## 32    19   46
## 33    19   68
## 34    20   32
## 35    20   48
## 36    20   52
## 37    20   56
## 38    20   64
## 39    22   66
## 40    23   54
## 41    24   70
## 42    24   92
## 43    24   93
## 44    24  120
## 45    25   85

This selects rows based on particular condition(s).

Using dplyr: selecting columns

dplyr::select picks out particular columns, dropping the others:

cars %>%
  select(speed)
##    speed
## 1      4
## 2      4
## 3      7
## 4      7
## 5      8
## 6      9
## 7     10
## 8     10
## 9     10
## 10    11
## 11    11
## 12    12
## 13    12
## 14    12
## 15    12
## 16    13
## 17    13
## 18    13
## 19    13
## 20    14
## 21    14
## 22    14
## 23    14
## 24    15
## 25    15
## 26    15
## 27    16
## 28    16
## 29    17
## 30    17
## 31    17
## 32    18
## 33    18
## 34    18
## 35    18
## 36    19
## 37    19
## 38    19
## 39    20
## 40    20
## 41    20
## 42    20
## 43    20
## 44    22
## 45    23
## 46    24
## 47    24
## 48    24
## 49    24
## 50    25

Using dplyr: selecting columns

dplyr::select picks out particular columns, dropping the others:

cars %>%
  select(dist)
##    dist
## 1     2
## 2    10
## 3     4
## 4    22
## 5    16
## 6    10
## 7    18
## 8    26
## 9    34
## 10   17
## 11   28
## 12   14
## 13   20
## 14   24
## 15   28
## 16   26
## 17   34
## 18   34
## 19   46
## 20   26
## 21   36
## 22   60
## 23   80
## 24   20
## 25   26
## 26   54
## 27   32
## 28   40
## 29   32
## 30   40
## 31   50
## 32   42
## 33   56
## 34   76
## 35   84
## 36   36
## 37   46
## 38   68
## 39   32
## 40   48
## 41   52
## 42   56
## 43   64
## 44   66
## 45   54
## 46   70
## 47   92
## 48   93
## 49  120
## 50   85

Using dplyr: selecting columns

dplyr::select picks out particular columns, dropping the others:

cars %>%
  select(-dist)
##    speed
## 1      4
## 2      4
## 3      7
## 4      7
## 5      8
## 6      9
## 7     10
## 8     10
## 9     10
## 10    11
## 11    11
## 12    12
## 13    12
## 14    12
## 15    12
## 16    13
## 17    13
## 18    13
## 19    13
## 20    14
## 21    14
## 22    14
## 23    14
## 24    15
## 25    15
## 26    15
## 27    16
## 28    16
## 29    17
## 30    17
## 31    17
## 32    18
## 33    18
## 34    18
## 35    18
## 36    19
## 37    19
## 38    19
## 39    20
## 40    20
## 41    20
## 42    20
## 43    20
## 44    22
## 45    23
## 46    24
## 47    24
## 48    24
## 49    24
## 50    25

Using dplyr: sorting data

dplyr::arrange sort data:

cars %>%
  arrange(dist)
##    speed dist
## 1      4    2
## 2      7    4
## 3      4   10
## 4      9   10
## 5     12   14
## 6      8   16
## 7     11   17
## 8     10   18
## 9     12   20
## 10    15   20
## 11     7   22
## 12    12   24
## 13    10   26
## 14    13   26
## 15    14   26
## 16    15   26
## 17    11   28
## 18    12   28
## 19    16   32
## 20    17   32
## 21    20   32
## 22    10   34
## 23    13   34
## 24    13   34
## 25    14   36
## 26    19   36
## 27    16   40
## 28    17   40
## 29    18   42
## 30    13   46
## 31    19   46
## 32    20   48
## 33    17   50
## 34    20   52
## 35    15   54
## 36    23   54
## 37    18   56
## 38    20   56
## 39    14   60
## 40    20   64
## 41    22   66
## 42    19   68
## 43    24   70
## 44    18   76
## 45    14   80
## 46    18   84
## 47    25   85
## 48    24   92
## 49    24   93
## 50    24  120

Using dplyr: summarizing

Thinking back to the typical data pipeline, we often want to summarize data by groups as an intermediate or final step. For example, for each subgroup we might want to:

  • Compute mean, max, min, etc. (n->1)
  • Compute rolling mean and other window functions (n->n)
  • Fit models and extract their parameters, goodness of fit, etc.

For example: what is the average age of JGCRI staff, by sex?

Using dplyr: summarizing

Using dplyr: summarizing

dplyr verbs become particularly powerful when used in conjunction with groups we define in the dataset. This doesn’t change the data but instead groups it, ready for the next operation we perform.

cars %>%
  group_by(speed)
## # A tibble: 50 x 2
## # Groups:   speed [19]
##    speed  dist
##    <dbl> <dbl>
##  1     4     2
##  2     4    10
##  3     7     4
##  4     7    22
##  5     8    16
##  6     9    10
##  7    10    18
##  8    10    26
##  9    10    34
## 10    11    17
## # … with 40 more rows

Using dplyr: summarizing

Once data are grouped we can use summarise:

cars %>%
  group_by(speed) %>% 
  summarise(avg_dist = mean(dist))
## # A tibble: 19 x 2
##    speed avg_dist
##    <dbl>    <dbl>
##  1     4      6  
##  2     7     13  
##  3     8     16  
##  4     9     10  
##  5    10     26  
##  6    11     22.5
##  7    12     21.5
##  8    13     35  
##  9    14     50.5
## 10    15     33.3
## 11    16     36  
## 12    17     40.7
## 13    18     64.5
## 14    19     50  
## 15    20     50.4
## 16    22     66  
## 17    23     54  
## 18    24     93.8
## 19    25     85

Using dplyr: summarizing

Once data are grouped we can use summarise:

cars %>%
  group_by(speed) %>% 
  summarise(n = n(),
            avg_dist = mean(dist))
## # A tibble: 19 x 3
##    speed     n avg_dist
##    <dbl> <int>    <dbl>
##  1     4     2      6  
##  2     7     2     13  
##  3     8     1     16  
##  4     9     1     10  
##  5    10     3     26  
##  6    11     2     22.5
##  7    12     4     21.5
##  8    13     4     35  
##  9    14     4     50.5
## 10    15     3     33.3
## 11    16     2     36  
## 12    17     3     40.7
## 13    18     4     64.5
## 14    19     3     50  
## 15    20     5     50.4
## 16    22     1     66  
## 17    23     1     54  
## 18    24     4     93.8
## 19    25     1     85

Verbalize: how does split-apply-combine apply here?

babynames

Install the babynames package, and take a look.

library(babynames)
babynames
## # A tibble: 1,924,665 x 5
##     year sex   name          n   prop
##    <dbl> <chr> <chr>     <int>  <dbl>
##  1  1880 F     Mary       7065 0.0724
##  2  1880 F     Anna       2604 0.0267
##  3  1880 F     Emma       2003 0.0205
##  4  1880 F     Elizabeth  1939 0.0199
##  5  1880 F     Minnie     1746 0.0179
##  6  1880 F     Margaret   1578 0.0162
##  7  1880 F     Ida        1472 0.0151
##  8  1880 F     Alice      1414 0.0145
##  9  1880 F     Bertha     1320 0.0135
## 10  1880 F     Sarah      1288 0.0132
## # … with 1,924,655 more rows

Explore babynames a bit. How many rows, columns does it have? How many unique names?

babynames

Let’s use dplyr to calculate the number of babies born each year.

babynames %>% 
  group_by(year) %>% 
  summarise(babies = sum(n))
## # A tibble: 138 x 2
##     year babies
##    <dbl>  <int>
##  1  1880 201484
##  2  1881 192696
##  3  1882 221533
##  4  1883 216946
##  5  1884 243462
##  6  1885 240854
##  7  1886 255317
##  8  1887 247394
##  9  1888 299473
## 10  1889 288946
## # … with 128 more rows

babynames

Let’s use dplyr to calculate the number of babies born each year.

babynames %>% 
  group_by(year) %>% 
  summarise(babies = sum(n)) %>% 
  ggplot(aes(year, babies)) + geom_line()

babynames

Let’s use dplyr to calculate the number of babies born each year.

babynames %>% 
  group_by(year, sex) %>% 
  summarise(babies = sum(n)) %>% 
  ggplot(aes(year, babies, color = sex)) + geom_line()

babynames

Exercise.

Filter babynames for your name and plot the number of kids born by year.

babynames

babynames %>% 
  filter(name == "Benjamin") %>% 
  group_by(year, sex) %>% 
  summarise(babies = sum(n)) %>% 
  ggplot(aes(year, babies, color = sex)) + geom_line() + ggtitle("Benjamin")

babynames

Now plot the prop column (proportion of all names) for your name.

babynames %>% 
  filter(name == "Benjamin") %>% 
  ggplot(aes(year, prop, color = sex)) + geom_line() + ggtitle("Benjamin")

babynames

We can wrap up by playing a bit more with dplyr and babynames (no slides, depends on time).

Thanks for attending this introduction to R workshop! I hope it was useful.

These slides are available at https://rpubs.com/bpbond/581300.