Angela Zoss
2/12/2021
Which of the following are valid R variable names?
What will be the value of each variable after each statement in the following program?
Run the code from the previous challenge, and write a command to compare mass to age. Is mass larger than age?
Clean up your working environment by deleting the mass and age variables.
Install the following packages: ggplot2, plyr, gapminder
We’re going to create a new project in RStudio:
gapminder_data.csvdata/ folder within your project.We will load and inspect these data later.
It is useful to get some general idea about the dataset, directly from the command line, before loading it into R. Understanding the dataset better will come in handy when making decisions on how to load it in R. Use the command-line shell to answer the following questions:
Look at the help for the c function. What kind of vector do you expect you will create if you evaluate the following:
Look at the help for the paste function. You’ll need to use this later. What is the difference between the sep and collapse arguments?
Use help to find a function (and its associated parameters) that you could use to load data from a tabular file in which columns are delimited with “ (tab) and the decimal point is a “.” (period). This check for decimal separator is important, especially if you are working with international colleagues, because different countries have different conventions for the decimal point (i.e. comma vs period). hint: use ??"read table" to look up functions related to reading in tabular data.
Create new text file (File -> New File -> Text File)
Copy and paste:
coat,weight,likes_string
calico,2.1,1
black,5.0,0
tabby,3.2,1
tabby,2.3 or 2.4,1
Save file as “data/feline-data_v2.csv”
A structure which R knows how to build out of the basic data types
For example, data.frame
logical, or true/falseWhy is R so opinionated about what we put in our columns of data? How does this help us?
When we create a data frame, R will choose a data type for each column and force the elements to be that data type
Coercion rules (converted from top to bottom):
Start by making a vector with the numbers 1 through 26. Multiply the vector by 2, and give the resulting vector names A through Z (hint: there is a built in vector called LETTERS)
levels argumentIs there a factor in our cats data.frame? what is its name? Try using ?read.csv to figure out how to keep text columns as character vectors instead of factors; then write a command or two to show that the factor in cats is actually a character vector when loaded in this way.
Solution to Challenge 2
Just for columns: data_frame_name$column_name
data_frame_name[row_number, column_number]
There are several subtly different ways to call variables, observations and elements from data.frames:
cats[1]cats[[1]]cats$coatcats["coat"]cats[1, 1]cats[, 1]cats[1, ]Try out these examples and explain what is returned by each one.
Hint: Use the function typeof() to examine what is returned in each case.
matrix[row_number,col_number] to access elementsWhat do you think will be the result of length(matrix_example)? Try it. Were you right? Why / why not?
Make another matrix, this time containing the numbers 1:50, with 5 columns and 10 rows. Did the matrix function fill your matrix by column, or by row, as its default behaviour? See if you can figure out how to change this. (hint: read the documentation for matrix!)
Create a list of length two containing a character vector for each of the sections in this part of the workshop:
Populate each character vector with the names of the data types and data structures we’ve seen so far.
Consider the R output of the matrix below:
What was the correct command used to write this matrix? Examine each command and try to figure out the correct one before typing them. Think about what matrices the other commands will produce.
matrix(c(4, 1, 9, 5, 10, 7), nrow = 3)matrix(c(4, 9, 10, 1, 5, 7), ncol = 2, byrow = TRUE)matrix(c(4, 9, 10, 1, 5, 7), nrow = 2)matrix(c(4, 1, 9, 5, 10, 7), ncol = 2, byrow = TRUE)Let’s imagine that 1 cat year is equivalent to 7 human years.
human_age by multiplying cats$age by 7.human_age to a factor.human_age back to a numeric vector using the as.numeric() function. Now divide it by 7 to get the original ages back. Explain what happened.You can create a new data frame right from within R with the following syntax:
df <- data.frame(id = c("a", "b", "c"),
x = 1:3,
y = c(TRUE, TRUE, FALSE),
stringsAsFactors = FALSE)Make a data frame that holds the following information for yourself:
Then use rbind to add an entry for the people sitting beside you. Finally, use cbind to add a column with each person’s answer to the question, “Is it time for coffee break?”
It’s good practice to also check the last few lines of your data and some in the middle. How would you do this?
Searching for ones specifically in the middle isn’t too hard, but we could ask for a few lines at random. How would you code this?
Go to file -> new file -> R script, and write an R script to load in the gapminder dataset. Put it in the scripts/ directory and add it to version control.
Run the script using the source function, using the file path as its argument (or by pressing the “source” button in RStudio).
Read the output of str(gapminder) again; this time, use what you’ve learned about factors, lists and vectors, as well as the output of functions like colnames and dim to explain what everything that str prints out for gapminder means. If there are any parts you can’t interpret, discuss with your neighbors!
It may look different, but the square brackets operator ([]) is a function. For vectors (and matrices), it works as “get me the nth element,” where n is a number that goes inside the brackets.
x[1]
Use a numerical vector to ask for multiple elements by their positions
x[c(1, 3)]
x[c(1,1,3)]
x[1:4]
(Note: the : operator creates a sequence of numbers from the left element to the right.)
Use - to skip an element or set of elements by position
x[-4]
x[c(-1, -5)]
x[-(1:3)]
Given the following code:
## a b c d e
## 5.4 6.2 7.1 4.8 7.5
Come up with at least 2 different commands that will produce the following output:
After you find 2 different commands, compare notes with your neighbour. Did you have different strategies?
For named vectors, you can use names inside the brackets instead of numerical positions.
x["a"]
x[c("a","c")]
x[names(x) == "a"]
Note: name is more reliable than position, since position can change if you subset multiple times.
You can also use a logical vector to subset
x[c(FALSE, FALSE, TRUE, FALSE, TRUE)]
Or you can use an expression that produces a logical vector
x[x > 7]
Given the following code:
## a b c d e
## 5.4 6.2 7.1 4.8 7.5
Write a subsetting command to return the values in x that are greater than 4 and less than 7.
x[names(x) != "a"]
What about multiple?
x[names(x) != c("a","c")]
If you aren’t sure what is going wrong with your subset, try the logical expression on its own. Look for errors or unexpected results.
names(x) != c("a","c")
x[names(x) != "a" & names(x) != "c"]
%in% compares each element in the left argument with every element in the right argument
x[! names(x) %in% c("a","c") ]
Selecting elements of a vector that match any of a list of components is a very common data analysis task. For example, the gapminder data set contains country and continent variables, but no information between these two scales. Suppose we want to pull out information from southeast Asia: how do we set up an operation to produce a logical vector that is TRUE for all of the countries in southeast Asia and FALSE otherwise?
Suppose you have these data:
seAsia <- c("Myanmar","Thailand","Cambodia","Vietnam","Laos")
## read in the gapminder data that we downloaded in episode 2
gapminder <- read.csv("data/gapminder_data.csv", header=TRUE)
## extract the `country` column from a data frame (we'll see this later);
## convert from a factor to a character;
## and get just the non-repeated elements
countries <- unique(as.character(gapminder$country))There’s a wrong way (using only ==), which will give you a warning; a clunky way (using the logical operators == and |); and an elegant way (using %in%). See whether you can come up with all three and explain how they (don’t) work.
>, >=, <, <=, ==& (and), | (or)! (not)all(), any()%in%is.na(), !is.na()(Can also directly filter out NA values with na.omit().)
Factor subsetting works like vector subsetting, but note that removing elements will not change the valid levels.
For matrices, the square brackets take two arguments: row position, column position.
You can leave one argument blank to select an entire row or column, but you still need to include the comma (,).
Can use drop = FALSE if you want to return a matrix.
Given the following code:
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 1 4 7 10 13 16
## [2,] 2 5 8 11 14 17
## [3,] 3 6 9 12 15 18
Which of the following commands will extract the values 11 and 14?
A. m[2,4,2,5]/ B. m[2:5]/ C. m[4:5,2]/ D. m[2,c(4,5)]
There are three functions for subsetting lists:
[[[$We’ve already used square brackets with vectors, and they work the same for lists. Square brackets select elements but return the same data structure, even if you’re selecting a single element.
Subsetting a vector with square brackets returns a vector.
Subsetting a list with square brackets returns a list.
We could consider this “subsetting” (limiting the elements of a data structure).
If you want to break an element out of its container list, you need to use double square brackets ([[).
Double square brackets only work for one element at a time.
We could consider this “extracting” (removing a data element from its data structure), instead of “subsetting.”
Double square brackets can include element position number or name.
The dollar sign was mentioned previously as a way to retrieve a data frame column. The dollar sign will extract an element by name only.
xlist$a
xlist$b
xlist$data
Given the following list:
Using your knowledge of both list and vector subsetting, extract the number 2 from xlist. Hint: the number 2 is contained within the “b” item in the list.
Given a linear model:
Extract the residual degrees of freedom (hint: attributes() will help you)
Because data frames are lists, you can subset a data frame using the same functions for lists.
[ (with one argument) works like it does for lists - it subsets to one or more elements. In data frames, elements are columns. This use of the function will return the same data structure - in this case, another (smaller) data frame.
Data frames are also two-dimensional objects, like matrices. If you use [ with two arguments, you can subset by rows or columns or both. If you subset by row, you will get a data frame. If, however, you subset by just a single column, you will extract the vector version of the column.
# single argument, so treating the data frame as a list of columns
class(gapminder[1])
# returns"data.frame"
# double argument but only specifying first argument (rows)
class(gapminder[1,])
# returns "data.frame"
# double argument, specifying second argument (columns)
class(gapminder[,1:2])
# returns"data.frame"
# double argument, only a single column
class(gapminder[,1])
# returns "character"
# double argument, only a single column, but keep in data frame
class(gapminder[,1, drop=FALSE])
# returns "data.frame"[[ also works like it does for lists, but instead of returning a smaller data frame, it returns the vector version of a single column.
$ again works like it does for lists. Since the data frame’s columns are its named elements, you can use the dollar sign to extract a single column by its name. This function will return just the vector version of the column, like [[.
Fix each of the following common data frame subsetting errors:
continent and lifeExp).Why does gapminder[1:20] return an error? How does it differ from gapminder[1:20, ]?
Create a new data.frame called gapminder_small that only contains rows 1 through 9 and 19 through 23. You can do this in one or two steps.
Sometimes we want our code to run only under certain circumstances - if one or more conditions have been met.
Conditional statements control the flow of R code by testing one or more conditions before executing the code.
elseelse ifUse an if() statement to print a suitable message reporting whether there are any records from 2002 in the gapminder dataset. Now do the same for 2012.
The if() function only accepts singular inputs, so it won’t process a logical vector (a vector of true/false values).
ifelse()If you have code you want to execute a certain number of times, you can create a special kind of loop called a for loop.
An iterator is something that keeps track of what repetition you are on. The name is arbitrary, but it is often just a letter, like i or j.
The set of values defines how many repetitions you want. It should be a vector, and it often includes values that you want to use inside the R code.
Each time the code in the brackets is completed, the iterator will move to the next element in the set of values and the loop will repeat until the iterator reaches the end of the vector.
while loopSometimes you don’t know how long the loop needs to go, you just want it to stop when a condition is met.
Warning: it’s important to be careful that your while loop condition will eventually become false. Something that is always true creates an “infinite loop” that will never stop on its own.
Compare the objects output_vector and output_vector2. Are they the same? If not, why not? How would you change the last block of code to make output_vector2 the same as output_vector?
Write a script that loops through the gapminder data by continent and prints out whether the mean life expectancy is smaller or larger than 50 years.
Modify the script from Challenge 3 to loop over each country. This time print out whether the life expectancy is smaller than 50, between 50 and 70, or greater than 70.
Write a script that loops over each country in the gapminder dataset, tests whether the country starts with a ‘B’, and graphs life expectancy against time as a line graph if the mean life expectancy is under 50 years.
lattice packageggplot2 packageAny plot can be expressed from the same set of components:
Modify the example so that the figure shows how life expectancy has changed over time:
Hint: the gapminder dataset has a column called “year”, which should appear on the x-axis.
In the previous examples and challenge we’ve used the aes function to tell the scatterplot geom about the x and y locations of each point. Another aesthetic property we can modify is the point color. Modify the code from the previous challenge to color the points by the “continent” column.
What trends do you see in the data? Are they what you expected?
Switch the order of the point and line layers from the previous example.
ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, by=country)) +
geom_line(mapping = aes(color=continent)) + geom_point()What happened?
Modify the color and size of the points on the point layer in the previous example.
ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
geom_point() + scale_x_log10() + geom_smooth(method="lm", size=1.5)Hint: do not use the aes function.
Modify your solution to Challenge 4a so that the points are now a different shape and are colored by continent with new trendlines. Hint: The color argument can be used inside the aesthetic.
Generate boxplots to compare life expectancy between the different continents during the available years.
Advanced:
Let’s try this on the pop column of the gapminder dataset.
Make a new column in the gapminder data frame that contains population in units of millions of people. Check the head or tail of the data frame to make sure it worked.
On a single graph, plot population, in millions, against year, for all countries. Don’t worry about identifying which country is which.
Repeat the exercise, graphing only for China, India, and Indonesia. Again, don’t worry about which is which.
Given the following matrix:
## [,1] [,2] [,3] [,4]
## [1,] 1 4 7 10
## [2,] 2 5 8 11
## [3,] 3 6 9 12
Write down what you think will happen when you run:
m ^ -1m * c(1, 0, -1)m > c(0, 20)m * c(1, 0, -1, 2)Did you get the output you expected? If not, ask a helper!
We’re interested in looking at the sum of the following sequence of fractions:
This would be tedious to type out, and impossible for high values of n. Use vectorisation to compute x when n=100. What is the sum when n=10,000?
Write a function called kelvin_to_celsius() that takes a temperature in Kelvin and returns that temperature in Celsius.
Hint: To convert from Kelvin to Celsius you subtract 273.15
Define the function to convert directly from Fahrenheit to Celsius, by reusing the two functions above (or using your own functions if you prefer).
Use defensive programming to ensure that our fahr_to_celsius() function throws an error immediately if the argument temp is specified inappropriately.
Test out your GDP function by calculating the GDP for New Zealand in 1987. How does this differ from New Zealand’s GDP in 1952?
The paste() function can be used to combine text together, e.g:
best_practice <- c("Write", "programs", "for", "people", "not", "computers")
paste(best_practice, collapse=" ")## [1] "Write programs for people not computers"
Write a function called fence() that takes two vectors as arguments, called text and wrapper, and prints out the text wrapped with the wrapper:
Note: the paste() function has an argument called sep, which specifies the separator between text. The default is a space: “ “. The default for paste0() is no space “”.
Rewrite your ‘pdf’ command to print a second page in the pdf, showing a facet plot (hint: use facet_grid) of the same data with one panel per continent.
Write a data-cleaning script file that subsets the gapminder data to include only data points collected since 1990.
Use this script to write out the new subset to a file in the cleaned-data/ directory.
dplyr packageCommon functions:
select()filter()group_by()summarize()mutate()%>% (the “pipe” operator)select()Write a single command (which can span multiple lines and includes pipes) that will produce a data frame that has the African values for lifeExp, country and year, but not for other Continents. How many rows does your data frame have and why?
group_by()group_by()Calculate the average life expectancy per country. Which has the longest average life expectancy and which has the shortest average life expectancy?
Calculate the average life expectancy in 2002 of 2 randomly selected countries for each continent. Then arrange the continent names in reverse order. Hint: Use the dplyr functions arrange() and sample_n(), they have similar syntax to other dplyr functions.
Is gapminder a purely long, purely wide, or some intermediate format?
Using gap_long, calculate the mean life expectancy, population, and gdpPercap for each continent. Hint: use the group_by() and summarize() functions we learned in the dplyr lesson
Take this 1 step further and create a gap_ludicrously_wide format data by pivoting over countries, year and the 3 metrics? Hint this new data frame should only have 5 rows.
Create a new R Markdown document. Delete all of the R code chunks and write a bit of Markdown (some sections, some italicized text, and an itemized list).
Convert the document to a webpage.
Add code chunks to:
Use chunk options to control the size of a figure and to hide the code.
Try out a bit of in-line R code.