Challenge 3

https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/gh-pages/_episodes_rmd/data/gapminder_data.csv

Download the file (CTRL + S, right mouse click -> “Save as”, or File -> “Save page as”)
Make sure it’s saved under the name gapminder_data.csv
Save the file in the data/ folder within your project.

We will load and inspect these data later.

Challenge 4

It is useful to get some general idea about the dataset, directly from the command line, before loading it into R. Understanding the dataset better will come in handy when making decisions on how to load it in R. Use the command-line shell to answer the following questions:

What is the size of the file?
How many rows of data does it contain?
What kinds of values are stored in this file?

Challenge 1

Look at the help for the c function. What kind of vector do you expect you will create if you evaluate the following:

c(1, 2, 3)
c('d', 'e', 'f')
c(1, 2, 'f')

Challenge 2

Look at the help for the paste function. You’ll need to use this later. What is the difference between the sep and collapse arguments?

Challenge 3

Use help to find a function (and its associated parameters) that you could use to load data from a tabular file in which columns are delimited with “ (tab) and the decimal point is a “.” (period). This check for decimal separator is important, especially if you are working with international colleagues, because different countries have different conventions for the decimal point (i.e. comma vs period). hint: use ??"read table" to look up functions related to reading in tabular data.

Code to create first dataset

cats <- data.frame(coat = c("calico", "black", "tabby"),    
                    weight = c(2.1, 5.0, 3.2),    
                    likes_string = c(1, 0, 1))    
write.csv(x = cats, file = "data/feline-data.csv", row.names = FALSE)

Read dataset

cats <- read.csv(file = "data/feline-data.csv", stringsAsFactors = TRUE)
cats

Five Main data types

double
integer
complex
logical
character

Content for second dataset

Create new text file (File -> New File -> Text File)

Copy and paste:

coat,weight,likes_string
calico,2.1,1
black,5.0,0
tabby,3.2,1
tabby,2.3 or 2.4,1

Save file as “data/feline-data_v2.csv”

Load new cats data

cats <- read.csv(file="data/feline-data_v2.csv", stringsAsFactors = TRUE)
typeof(cats$weight)

Data structure

A structure which R knows how to build out of the basic data types

For example, data.frame

Vectors

an ordered list of things
all things must be the same basic data type
default type is logical, or true/false

Data frame columns are all vectors

Discussion 1

Why is R so opinionated about what we put in our columns of data? How does this help us?

Coercion

When we create a data frame, R will choose a data type for each column and force the elements to be that data type

Coercion rules (converted from top to bottom):

logical
integer
numeric
complex
character

Challenge 1

Start by making a vector with the numbers 1 through 26. Multiply the vector by 2, and give the resulting vector names A through Z (hint: there is a built in vector called LETTERS)

Factors

useful when character vectors represent a small set of categories
factors are really integers underneath with associated labels or “levels”
can manually specify the order of levels with the levels argument

Challenge 2

Is there a factor in our cats data.frame? what is its name? Try using ?read.csv to figure out how to keep text columns as character vectors instead of factors; then write a command or two to show that the factor in cats is actually a character vector when loaded in this way.

Solution to Challenge 2

Lists

can contain anything
surprise: a data frame is a list! (but all elements are vectors of the same length)

Retrieve parts of the data frame

Just for columns: data_frame_name$column_name

this works for all lists where elements have names

data_frame_name[row_number, column_number]

a single row is still a data frame; a single column is just a vector
can leave either number blank to get full row or full column

Challenge 3

There are several subtly different ways to call variables, observations and elements from data.frames:

cats[1]
cats[[1]]
cats$coat
cats["coat"]
cats[1, 1]
cats[, 1]
cats[1, ]

Try out these examples and explain what is returned by each one.

Hint: Use the function typeof() to examine what is returned in each case.

Matrices

two-dimensional data structure, like data.frames
can still use matrix[row_number,col_number] to access elements
difference: all elements must be same data type
can build a matrix from a single value or an existing vector

Final set of data structures

data.frame
vector
factor
list
matrix

Challenge 4

What do you think will be the result of length(matrix_example)? Try it. Were you right? Why / why not?

Challenge 5

Make another matrix, this time containing the numbers 1:50, with 5 columns and 10 rows. Did the matrix function fill your matrix by column, or by row, as its default behaviour? See if you can figure out how to change this. (hint: read the documentation for matrix!)

Challenge 6

Create a list of length two containing a character vector for each of the sections in this part of the workshop:

Data types
Data structures

Populate each character vector with the names of the data types and data structures we’ve seen so far.

Challenge 7

Consider the R output of the matrix below:

     [,1] [,2]
[1,]    4    1
[2,]    9    5
[3,]   10    7

What was the correct command used to write this matrix? Examine each command and try to figure out the correct one before typing them. Think about what matrices the other commands will produce.

matrix(c(4, 1, 9, 5, 10, 7), nrow = 3)
matrix(c(4, 9, 10, 1, 5, 7), ncol = 2, byrow = TRUE)
matrix(c(4, 9, 10, 1, 5, 7), nrow = 2)
matrix(c(4, 1, 9, 5, 10, 7), ncol = 2, byrow = TRUE)

5. Exploring Data Frames

Challenge 1

Let’s imagine that 1 cat year is equivalent to 7 human years.

Create a vector called human_age by multiplying cats$age by 7.
Convert human_age to a factor.
Convert human_age back to a numeric vector using the as.numeric() function. Now divide it by 7 to get the original ages back. Explain what happened.

Challenge 2

You can create a new data frame right from within R with the following syntax:

df <- data.frame(id = c("a", "b", "c"),
                 x = 1:3,
                 y = c(TRUE, TRUE, FALSE),
                 stringsAsFactors = FALSE)

Make a data frame that holds the following information for yourself:

first name
last name
lucky number

Then use rbind to add an entry for the people sitting beside you. Finally, use cbind to add a column with each person’s answer to the question, “Is it time for coffee break?”

Challenge 3

It’s good practice to also check the last few lines of your data and some in the middle. How would you do this?

Searching for ones specifically in the middle isn’t too hard, but we could ask for a few lines at random. How would you code this?

Challenge 4

Go to file -> new file -> R script, and write an R script to load in the gapminder dataset. Put it in the scripts/ directory and add it to version control.

Run the script using the source function, using the file path as its argument (or by pressing the “source” button in RStudio).

Challenge 5

Read the output of str(gapminder) again; this time, use what you’ve learned about factors, lists and vectors, as well as the output of functions like colnames and dim to explain what everything that str prints out for gapminder means. If there are any parts you can’t interpret, discuss with your neighbors!

Square Brackets

It may look different, but the square brackets operator ([]) is a function. For vectors (and matrices), it works as “get me the nth element,” where n is a number that goes inside the brackets.

x[1]

Multiple elements

Use a numerical vector to ask for multiple elements by their positions

x[c(1, 3)]

x[c(1,1,3)]

x[1:4]

(Note: the : operator creates a sequence of numbers from the left element to the right.)

Skipping elements

Use - to skip an element or set of elements by position

x[-4]

x[c(-1, -5)]

x[-(1:3)]

Challenge 1

Given the following code:

x <- c(5.4, 6.2, 7.1, 4.8, 7.5)
names(x) <- c('a', 'b', 'c', 'd', 'e')
print(x)

##   a   b   c   d   e 
## 5.4 6.2 7.1 4.8 7.5

Come up with at least 2 different commands that will produce the following output:

  b   c   d 
6.2 7.1 4.8

After you find 2 different commands, compare notes with your neighbour. Did you have different strategies?

Subsetting by name

For named vectors, you can use names inside the brackets instead of numerical positions.

x["a"]

x[c("a","c")]

x[names(x) == "a"]

Note: name is more reliable than position, since position can change if you subset multiple times.

Subsetting by true/false

You can also use a logical vector to subset

x[c(FALSE, FALSE, TRUE, FALSE, TRUE)]

Or you can use an expression that produces a logical vector

x[x > 7]

Challenge 2

Given the following code:

x <- c(5.4, 6.2, 7.1, 4.8, 7.5)
names(x) <- c('a', 'b', 'c', 'd', 'e')
print(x)

##   a   b   c   d   e 
## 5.4 6.2 7.1 4.8 7.5

Write a subsetting command to return the values in x that are greater than 4 and less than 7.

Skipping Named Elements

x[names(x) != "a"]

What about multiple?

x[names(x) != c("a","c")]

Troubleshooting subsetting

If you aren’t sure what is going wrong with your subset, try the logical expression on its own. Look for errors or unexpected results.

names(x) != c("a","c")

Comparing character vectors

Better Logic

x[names(x) != "a" & names(x) != "c"]

Using the %in% operator

%in% compares each element in the left argument with every element in the right argument

x[! names(x) %in% c("a","c") ]

Challenge 3

Selecting elements of a vector that match any of a list of components is a very common data analysis task. For example, the gapminder data set contains country and continent variables, but no information between these two scales. Suppose we want to pull out information from southeast Asia: how do we set up an operation to produce a logical vector that is TRUE for all of the countries in southeast Asia and FALSE otherwise?

Suppose you have these data:

seAsia <- c("Myanmar","Thailand","Cambodia","Vietnam","Laos")
## read in the gapminder data that we downloaded in episode 2
gapminder <- read.csv("data/gapminder_data.csv", header=TRUE)
## extract the `country` column from a data frame (we'll see this later);
## convert from a factor to a character;
## and get just the non-repeated elements
countries <- unique(as.character(gapminder$country))

There’s a wrong way (using only ==), which will give you a warning; a clunky way (using the logical operators == and |); and an elegant way (using %in%). See whether you can come up with all three and explain how they (don’t) work.

Returning logical vectors

>, >=, <, <=, ==
& (and), | (or)
! (not)
all(), any()
%in%
is.na(), !is.na()

(Can also directly filter out NA values with na.omit().)

Subsetting factors

Factor subsetting works like vector subsetting, but note that removing elements will not change the valid levels.

Subsetting matrices

For matrices, the square brackets take two arguments: row position, column position.

You can leave one argument blank to select an entire row or column, but you still need to include the comma (,).

Can use drop = FALSE if you want to return a matrix.

Challenge 4

Given the following code:

m <- matrix(1:18, nrow=3, ncol=6)
print(m)

##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    1    4    7   10   13   16
## [2,]    2    5    8   11   14   17
## [3,]    3    6    9   12   15   18

Which of the following commands will extract the values 11 and 14?

A. m[2,4,2,5]/ B. m[2:5]/ C. m[4:5,2]/ D. m[2,c(4,5)]

Subsetting lists

There are three functions for subsetting lists:

[
[[
$

Square brackets

We’ve already used square brackets with vectors, and they work the same for lists. Square brackets select elements but return the same data structure, even if you’re selecting a single element.

Subsetting a vector with square brackets returns a vector.

Subsetting a list with square brackets returns a list.

We could consider this “subsetting” (limiting the elements of a data structure).

Double square brackets

If you want to break an element out of its container list, you need to use double square brackets ([[).

typeof(xlist[1])
# returns "list"

typeof(xlist[[1]])
# returns "character"

Double square brackets only work for one element at a time.

We could consider this “extracting” (removing a data element from its data structure), instead of “subsetting.”

Double square brackets can include element position number or name.

Dollar sign

The dollar sign was mentioned previously as a way to retrieve a data frame column. The dollar sign will extract an element by name only.

xlist$a

xlist$b

xlist$data

Challenge 5

Given the following list:

xlist <- list(a = "Software Carpentry", b = 1:10, data = head(mtcars))

Using your knowledge of both list and vector subsetting, extract the number 2 from xlist. Hint: the number 2 is contained within the “b” item in the list.

Challenge 6

Given a linear model:

mod <- aov(pop ~ lifeExp, data=gapminder)

Extract the residual degrees of freedom (hint: attributes() will help you)

Subsetting Data frames

Because data frames are lists, you can subset a data frame using the same functions for lists.

[ (with one argument) works like it does for lists - it subsets to one or more elements. In data frames, elements are columns. This use of the function will return the same data structure - in this case, another (smaller) data frame.

Data frames are also two-dimensional objects, like matrices. If you use [ with two arguments, you can subset by rows or columns or both. If you subset by row, you will get a data frame. If, however, you subset by just a single column, you will extract the vector version of the column.

# single argument, so treating the data frame as a list of columns
class(gapminder[1])
# returns"data.frame"

# double argument but only specifying first argument (rows)
class(gapminder[1,])
# returns "data.frame"

# double argument, specifying second argument (columns)
class(gapminder[,1:2])
# returns"data.frame"

# double argument, only a single column
class(gapminder[,1])
# returns "character"

# double argument, only a single column, but keep in data frame
class(gapminder[,1, drop=FALSE])
# returns "data.frame"

Extracting from data frames

[[ also works like it does for lists, but instead of returning a smaller data frame, it returns the vector version of a single column.

$ again works like it does for lists. Since the data frame’s columns are its named elements, you can use the dollar sign to extract a single column by its name. This function will return just the vector version of the column, like [[.

Challenge 7

Fix each of the following common data frame subsetting errors:

Extract observations collected for the year 1957

gapminder[gapminder$year = 1957,]

Extract all columns except 1 through to 4

gapminder[,-1:4]

Extract the rows where the life expectancy is longer the 80 years

gapminder[gapminder$lifeExp > 80]

Extract the first row, and the fourth and fifth columns (continent and lifeExp).

gapminder[1, 4, 5]

Advanced: extract rows that contain information for the years 2002 and 2007

gapminder[gapminder$year == 2002 | 2007,]

Challenge 8

Why does gapminder[1:20] return an error? How does it differ from gapminder[1:20, ]?
Create a new data.frame called gapminder_small that only contains rows 1 through 9 and 19 through 23. You can do this in one or two steps.

7. Control Flow

Sometimes we want our code to run only under certain circumstances - if one or more conditions have been met.

Conditional statements control the flow of R code by testing one or more conditions before executing the code.

A basic `if` statement

if (condition is true) {
  perform action
}

Providing a default response with `else`

if (condition is true) {
  perform action
} else {  # that is, if the condition is false,
  perform alternative action
}

Adding additional conditions with `else if`

if (condition is true) {
  perform action
} else  if (second condition is true) { # if first condition fails, try another
  perform second action
} else {  
  perform final action
}

Challenge 1

Use an if() statement to print a suitable message reporting whether there are any records from 2002 in the gapminder dataset. Now do the same for 2012.

Warning

The if() function only accepts singular inputs, so it won’t process a logical vector (a vector of true/false values).

Another option: `ifelse()`

# often written on one line, but spread out here for clarity

ifelse(
    condition,
    action to perform if true,
    second action to perform if false
)

Repeating code

If you have code you want to execute a certain number of times, you can create a special kind of loop called a for loop.

for (iterator in set of values) {
  do a thing
}

An iterator is something that keeps track of what repetition you are on. The name is arbitrary, but it is often just a letter, like i or j.

The set of values defines how many repetitions you want. It should be a vector, and it often includes values that you want to use inside the R code.

Each time the code in the brackets is completed, the iterator will move to the next element in the set of values and the loop will repeat until the iterator reaches the end of the vector.

Another loop - the `while` loop

Sometimes you don’t know how long the loop needs to go, you just want it to stop when a condition is met.

while(this condition is true){
  do a thing
}

Warning: it’s important to be careful that your while loop condition will eventually become false. Something that is always true creates an “infinite loop” that will never stop on its own.

Challenge 2

Compare the objects output_vector and output_vector2. Are they the same? If not, why not? How would you change the last block of code to make output_vector2 the same as output_vector?

Challenge 3

Write a script that loops through the gapminder data by continent and prints out whether the mean life expectancy is smaller or larger than 50 years.

Challenge 4

Modify the script from Challenge 3 to loop over each country. This time print out whether the life expectancy is smaller than 50, between 50 and 70, or greater than 70.

Challenge 5 - Advanced

Write a script that loops over each country in the gapminder dataset, tests whether the country starts with a ‘B’, and graphs life expectancy against time as a line graph if the mean life expectancy is under 50 years.

8. Creating Publication-Quality Graphics with ggplot2

Basic plotting systems in R

base plotting system
lattice package
ggplot2 package

Grammar of graphics

Any plot can be expressed from the same set of components:

a data set
a coordinate system
a set of geoms

Challenge 1

Modify the example so that the figure shows how life expectancy has changed over time:

ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + 
    geom_point()

Hint: the gapminder dataset has a column called “year”, which should appear on the x-axis.

Challenge 2

In the previous examples and challenge we’ve used the aes function to tell the scatterplot geom about the x and y locations of each point. Another aesthetic property we can modify is the point color. Modify the code from the previous challenge to color the points by the “continent” column.

What trends do you see in the data? Are they what you expected?

Challenge 3

Switch the order of the point and line layers from the previous example.

ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, by=country)) +
  geom_line(mapping = aes(color=continent)) + geom_point()

What happened?

Challenge 4a

Modify the color and size of the points on the point layer in the previous example.

ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
  geom_point() + scale_x_log10() + geom_smooth(method="lm", size=1.5)

Hint: do not use the aes function.

Challenge 4b

Modify your solution to Challenge 4a so that the points are now a different shape and are colored by continent with new trendlines. Hint: The color argument can be used inside the aesthetic.

Challenge 5

Generate boxplots to compare life expectancy between the different continents during the available years.

Advanced:

Rename y axis as Life Expectancy.
Remove x axis labels.

Challenge 1

Let’s try this on the pop column of the gapminder dataset.

Make a new column in the gapminder data frame that contains population in units of millions of people. Check the head or tail of the data frame to make sure it worked.

Challenge 2

On a single graph, plot population, in millions, against year, for all countries. Don’t worry about identifying which country is which.

Repeat the exercise, graphing only for China, India, and Indonesia. Again, don’t worry about which is which.

Challenge 3

Given the following matrix:

m <- matrix(1:12, nrow=3, ncol=4)
m

##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11
## [3,]    3    6    9   12

Write down what you think will happen when you run:

m ^ -1
m * c(1, 0, -1)
m > c(0, 20)
m * c(1, 0, -1, 2)

Did you get the output you expected? If not, ask a helper!

Challenge 4

We’re interested in looking at the sum of the following sequence of fractions:

x = 1/(1^2) + 1/(2^2) + 1/(3^2) + ... + 1/(n^2)

This would be tedious to type out, and impossible for high values of n. Use vectorisation to compute x when n=100. What is the sum when n=10,000?

10. Functions Explained

Challenge 1

Write a function called kelvin_to_celsius() that takes a temperature in Kelvin and returns that temperature in Celsius.

Hint: To convert from Kelvin to Celsius you subtract 273.15

Challenge 2

Define the function to convert directly from Fahrenheit to Celsius, by reusing the two functions above (or using your own functions if you prefer).

Challenge 3

Use defensive programming to ensure that our fahr_to_celsius() function throws an error immediately if the argument temp is specified inappropriately.

Challenge 4

Test out your GDP function by calculating the GDP for New Zealand in 1987. How does this differ from New Zealand’s GDP in 1952?

Challenge 5

The paste() function can be used to combine text together, e.g:

best_practice <- c("Write", "programs", "for", "people", "not", "computers")
paste(best_practice, collapse=" ")

## [1] "Write programs for people not computers"

Write a function called fence() that takes two vectors as arguments, called text and wrapper, and prints out the text wrapped with the wrapper:

fence(text=best_practice, wrapper="***")

Note: the paste() function has an argument called sep, which specifies the separator between text. The default is a space: “ “. The default for paste0() is no space “”.

Challenge 1

Rewrite your ‘pdf’ command to print a second page in the pdf, showing a facet plot (hint: use facet_grid) of the same data with one panel per continent.

Challenge 2

Write a data-cleaning script file that subsets the gapminder data to include only data points collected since 1990.

Use this script to write out the new subset to a file in the cleaned-data/ directory.

12. Splitting and Combining Data Frames with plyr

13. Data frame Manipulation with dplyr

The `dplyr` package

Common functions:

select()
filter()
group_by()
summarize()
mutate()
%>% (the “pipe” operator)

Challenge 1

Write a single command (which can span multiple lines and includes pipes) that will produce a data frame that has the African values for lifeExp, country and year, but not for other Continents. How many rows does your data frame have and why?

Challenge 2

Calculate the average life expectancy per country. Which has the longest average life expectancy and which has the shortest average life expectancy?

Advanced Challenge

Calculate the average life expectancy in 2002 of 2 randomly selected countries for each continent. Then arrange the continent names in reverse order. Hint: Use the dplyr functions arrange() and sample_n(), they have similar syntax to other dplyr functions.

14. Data frame Manipulation with tidyr

Challenge 1

Is gapminder a purely long, purely wide, or some intermediate format?

Challenge 2

Using gap_long, calculate the mean life expectancy, population, and gdpPercap for each continent. Hint: use the group_by() and summarize() functions we learned in the dplyr lesson

R for Reproducible Scientific Analysis: Slides

Challenge 1

Challenge 2

Challenge 3

Challenge 4

Challenge 5

Typical project organization

Challenge 1: Creating a self-contained project

Challenge 2: Opening an RStudio project through the file system

Challenge 3

Challenge 4

Challenge 1

Challenge 2

Challenge 3

Code to create first dataset

Read dataset

Five Main data types

Content for second dataset

Load new cats data

Data structure

Vectors

Data frame columns are all vectors

Discussion 1

Coercion

Challenge 1

Factors

Challenge 2

Lists

Retrieve parts of the data frame

Challenge 3

Matrices

Final set of data structures

Challenge 4

Challenge 5

Challenge 6

Challenge 7

Challenge 1

Challenge 2

Challenge 3

Challenge 4

Challenge 5

Square Brackets

Multiple elements

Skipping elements

Challenge 1

Subsetting by name

Subsetting by true/false

Challenge 2

Skipping Named Elements

Troubleshooting subsetting

Comparing character vectors

Recycling

Better Logic

Using the %in% operator

Challenge 3

Returning logical vectors

Subsetting factors

Subsetting matrices

Challenge 4

Subsetting lists

Square brackets

Double square brackets

Dollar sign

Challenge 5

Challenge 6

Subsetting Data frames

Extracting from data frames

Challenge 7

Challenge 8

A basic if statement

Providing a default response with else

Adding additional conditions with else if

Challenge 1

Warning

Another option: ifelse()

Repeating code

Another loop - the while loop

Challenge 2

Challenge 3

Challenge 4

A basic `if` statement

Providing a default response with `else`

Adding additional conditions with `else if`

Another option: `ifelse()`

Another loop - the `while` loop

The `dplyr` package

`select()`

`group_by()`

`group_by()`