Fork me on GitHub

1. Installation of R, RStudio, and R libraries

Please install R.

Please install the latest Preview version of RStudio.

Installing additional packages

R comes with a base system and some contributed packages. This is what you just downloaded. The functionality of R can be significantly extended by using additional contributed packages, also called libraries. Those packages typically contain commands (functions) for more specialized tasks. They can also contain example datasets. We will make use of a series of external packages to work with spatial data later.

To install additional packages there are two main options:

  1. You can use the RStudio interface like this:

alt text alt text

  1. You can install from the R commandline like this:
install.packages("NameOfTheLibraryToInstall", dependencies = TRUE)

Make use of the installed packages

In order to actually use commands from the installed packages you also will need to load the installed packages. This can be automated (whenever you launch R it will also load the libraries for you - see for example here) or otherwise you need to sumbit a command:

library(NameOfTheLibraryToLoad)

or

require(NameOfTheLibraryToLoad)

The difference between the two is that library will result in an error, if the library does not exist, whereas require will result in a warning.


Exercise 1

  1. Launch RStudio
  2. Find the Console window, type in the following line, hit enter and see what happens.

    volcano.r <- raster(volcano); spplot(volcano.r)
  3. Install the R package called raster, including its dependencies
  4. Load the library and observe the message you receive
  5. Use the "up" arrow on your keyboard to recall the line that you last typed in and hit enter


2. Overview of the RStudio environment and main features

RStudio is a development environment that makes working with R easier. In order to use it you need to also have R installed (which we did above). These are two independent software pieces, however they work together seamlessly. When launching RStudio it automatically detects your R installation and starts it up as well.

Some features of RStudio:

[ Demo ]

3. R syntax, basic data structures, how to read data in

Vector

Here is how you assign a value to a variable. Variable names cannot contain spaces and they cannot begin with a dot followed by a number. For example, The Moon and .2TheMoon will not work. .ToTheMoon will work, but is not recommended. (To find reserved words in R type ?reserved in your console or search for reserved in the help).

x <- 14  # x = 14 also works, but some feel strongly that it should not be used and this is a comment, BTW

(Tip: Use Alt and - as shortcut for <- on a Mac.)

Here is how you retrieve the value from a variable:

x
[1] 14

Variable names are case sensitive. Thevariable is not the same as thevariable which is not the same as theVariable! Try the commands below and see that happens. (Tip: Concatenate commands in one line with a semicolon: ;)

myvar <- .37; myvar
myVar <- "dog"; myvar

(Most common) data types in R are:

  • logical,
  • integer,
  • double (often called numeric), and
  • character.

The typeof() command tells you the data type of an object.

typeof(x)
[1] "double"

Even though x contains only one value, in R it is called a “vector”.

Now let’s construct some longer vectors in R. One common way to do this is to use a function called c(). This is how it’s done:

a <- c(1, 2, 5.3, 6, -2, 4)  # numeric vector
b <- c("one", "two", "three")  # character vector
c <- c(TRUE, TRUE, TRUE, FALSE, TRUE, FALSE)  #logical vector

A convenient command to find out about the structure of a object is str(). For example:

str(c)
 logi [1:6] TRUE TRUE TRUE FALSE TRUE FALSE

And note that c in this example is the name of a vector, but that c is also the name of the function used to construct a vector. OUCH. It works, but is confusing. So avoid it.

Now try this and see what happens here:

d <- c("a", 1); str(d)

Subsetting:

Oftentimes we want to select one or several elements from a vector. This is called subsetting. Below are a few examples.

a[1]  # note that in R the first position in the index is 1, not 0!
[1] 1
a[-1]  # all except the first element
[1]  2.0  5.3  6.0 -2.0  4.0
a[c(2, 4)]  # 2nd and 4th elements of vector
[1] 2 6
a[2:4]  # from 2nd to 4th element
[1] 2.0 5.3 6.0
a[a > 5]  # this is sometimes convenient
[1] 5.3 6.0

We can do math with vectors:

a + 1
[1]  2.0  3.0  6.3  7.0 -1.0  5.0
a * a
[1]  1.00  4.00 28.09 36.00  4.00 16.00

When the vectors operated upon are of unequal lengths, then the shorter vector is “recycled” as often as necessary to match the length of the longer vector. If the vectors are of different lengths because of a programming error, this can lead to unexpected results. But sometimes recycling of short vectors is the basis of clever programming.

Here is an example.

e <- c(1, 10)
a * e
[1]  1.0 20.0  5.3 60.0 -2.0 40.0

Now let’s try to understand what happens here.

a[c(TRUE, FALSE)]
[1]  1.0  5.3 -2.0

Matrix

If we arrange data elements of a vector in a two-dimensional rectangular layout we have a matrix. To construct a matrix, we use a function conveniently called matrix().

y <- matrix(1:20, nrow = 5, ncol = 4)  # generates 5 x 4 numeric matrix

Subset a matrix with [row , column]:

y[, 4]  # 4th column of matrix
y[3, ]  # 3rd row of matrix
y[2:4, 1:3]  # rows 2,3,4 of columns 1,2,3

Not surprisingly 2-dimensional matrices play an important role when working with raster data. We will come back to that at a later time.

List

Lists can have elements of any type. Here is how we construct lists. You may have guessed that to construct a list, we use the list() function:

myl <- list(name = "Sue", mynumbers = a, mymatrix = y, age = 5.3)  # example of a list with 4 components

myl[[2]]  # 2nd component of the list
myl[["mynumbers"]]  # component named mynumbers in list

Lists will be important as they are used to construct vector data of the sp* type in R. This sounds forgeign to you now, but you will see this shortly.

Data frame

A data frame is the most common way of storing tabular data in R and something you will likely deal with a lot. For example, attribute data for vector based spatial objects in R are stored as a data frame. You can really think of it as a table or a spreadsheet. It a 2-dimensional structure and columns can be of different element types.

Here is how you could construct a data frame.

mydf <- data.frame(ID=c(1:4),
                   
                   Color=c("red", "white", "red", NA),
                   
                   Passed=c(TRUE,TRUE,TRUE,FALSE),
                   
                   Weight=c(99, 54, 85, 70),
                   
                   Height=c(1.78, 1.67, 1.82, 1.59))

mydf
  ID Color Passed Weight Height
1  1   red   TRUE     99   1.78
2  2 white   TRUE     54   1.67
3  3   red   TRUE     85   1.82
4  4  <NA>  FALSE     70   1.59

And subsetting it (try the following commands yourself):

mydf$Weight               # Weight column
mydf[c("ID","Weight")]    # columns ID and Weight
mydf[,2:4]                # columns 2,3,4 of dataframe
mydf[mydf$Height > 1.8,]  # use logical conditions to filter rows
mydf[mydf$Passed,]    

A useful command to look at the top rows of a data frame is head(). Column names can be either retrieved or assigned with names(). We can also assign row names with rownames().

We can easily create a new column in a data frame, calculated from existing columns, for example:

mydf$bmi <- mydf$Weight/mydf$Height^2
mydf
  ID Color Passed Weight Height      bmi
1  1   red   TRUE     99   1.78 31.24605
2  2 white   TRUE     54   1.67 19.36247
3  3   red   TRUE     85   1.82 25.66115
4  4  <NA>  FALSE     70   1.59 27.68878

A word about using functions in R

By now we have used several R functions. I have also -sloppily, perhaps- called them commands. They all have in common that they are executed by typing their name followed by round brackets, in which we provide one or more parameters (or arguments) for the function to do something, separated by commas. Each function requires their specific arguments and those can be looked up with the help and all the arguments have names.

As example let us revisit the matrix() function from above. If you look up the documentation, this is what you will find under the usage section:

matrix(data = NA, nrow = 1, ncol = 1, byrow = FALSE,
       dimnames = NULL)

Now, here is what we said earlier:

matrix(1:20, nrow = 5, ncol = 4)

So you can see that we have not consistently named our parameters, but R still knows what we want1. The reason is that R evaluates function arguments in three steps: first, by exact matching on argument name, then by partial matching on argument name, and finally by position.

A second thing to notice is that often you do not have to specify all of the arguments. If you don’t, R will use default values if they are specified by the function. If no default value is specified, you will receive an error.

Functions usually return someting back to you as output. Whatever they return (a table, some informational text, a logical value, …) is by default written to the console, so you can see it right away.

Oftentimes, however, we want re-use the output of such a function. This is what we did above with the matrix example and this is also what we will now do to read in some data.

Read data into R

One the most common ways of getting data into R is to read in a table. And – you guessed it – we read it into a data frame! The function we will use for this is read.csv(). We will take a simple CSV file as example.

What is a CSV file?


Exercise 2

  1. Create a new directory R_Workshop on your Desktop.
  2. Download and unzip table_10.zip in this directory.
  3. Set your working directory to R_Workshop:

alt text

  1. Load table_10.csv2 into R and assign it to a data frame calle dfm.(Hint: Use the help tab in RStudio to search for the command and find the exact syntax)
  2. OPTIONAL: Try to use the read.table() function. What is different?
  3. Check out the column names
  4. Look at the first few rows to see if you like what you did
  5. How would you retreive all records for “New England”?
  6. Using the columns with totalpop and totalIL calculate the percentage of the illiterate population, call it pctIL and add it as a new column to the data frame.
  7. FOR FUN:
dColors <- data.frame(division = levels(dfm$division), color = rainbow(nlevels(dfm$division)))
dfm.col <- merge(dfm, dColors)
dfm.ord <- dfm.col[order(dfm.col$pctIL), ]
barplot(dfm.ord$pctIL, names.arg = dfm.ord$state, horiz = TRUE, las = 2, cex.names = 0.5, 
    col = dfm.ord$color)
legend("bottomright", legend = dColors$division, fill = dColors$color)

4. Overview of some convenient data manipulation tools

reshape

reshape2 is an external library. It contains two functions that help to transform data tables between wide and long formats.

A table in wide format there is a column for each variable and it looks like this:

  subject age height
1   Adele  20   1.76
2   Adele  21   1.77

A table in long format there is at least one column for the so called “ID variables”, a column for possible variable types and a column for the values of those variables. For the above table would look like this using subject as the ID variable:

  subject variable value
1   Adele      age 20.00
2   Adele      age 21.00
3   Adele   height  1.76
4   Adele   height  1.77

While we perhaps tend to record data in wide format, in R long-format data often needed, for example when plotting with ggplot.

melt turns wide format into long format.
cast turns long format into wide format.

So for the above example the command

wide.df <- data.frame(subject=c("Adele", "Adele"), # some sample data in wide format
                      age=c(20, 21), 
                      height=c(1.76, 1.77))

melt(wide.df) # convert to long format
  subject variable value
1   Adele      age 20.00
2   Adele      age 21.00
3   Adele   height  1.76
4   Adele   height  1.77

By default melt() uses all columns that have numeric values as variables for the values. But we could also tell it otherwise, for example:

melt(wide.df, id.vars = c("subject", "age"))
  subject age variable value
1   Adele  20   height  1.76
2   Adele  21   height  1.77

This is another possible long format for our table above!

Now let’s save this long format to a variable and cast it back into wide format. There are actually several commands for this, we will use dcast() as it produces a data frame as output, which is what we most typially want. Input variables for dcast() are: the data frame, and a formula for the ID variables and the values.

long.df <- melt(wide.df, id.vars = c("subject", "age"))
dcast(long.df, subject + age ~ variable)
  subject age height
1   Adele  20   1.76
2   Adele  21   1.77

This is one of the simplest cases. Typically going from long to wide data tables takes a bit more tinkering as there are a number of more advanced options as you can see in the help.


Exercise 3 (Optional)

  1. Install and load the reshape2 package.
  2. Use the data frame dfm you created earlier (from table_10.csv)
  3. Select the first four columns and save them into a new data frame dfm.select
  4. Convert the selection from wide to long format, using state and division as ID variables and save into a new data frame dfm.long

More..

Just to mention two more packages that can be helpful with data wrangling.

tidyr is a package similar to reshape2, with extended functionality. It can transform between long and short form and has some additional convenient functionalities, like renaming, concatenating and separating columns.

dplyr is a package that makes working with tables a little easier. In addition to simple operations like filtering, reordering of colums, selecting unique values, one can connect directly to databases. Most conveniently it also has a set of grouping functions that allow to calculate statistics for subgroups of the entire dataset. This is how it works:

  1. split up the original data in subgroups,
  2. apply an operate on those subgrouops, and
  3. combine the results back into a table.

This process is accordingly called split-apply-combine and illustrated below.

Split - Apply - Combine. From Karthik Ram 2012: Introduction to R (http://inundata.org/2012/04/05/an-intro-to-r/)

We will use the dataframe we created earlier in long format dfm.select for an example. We want to calculate the percentage Illiterate per division, not per state as we did above. Following the recipe from above we would:

  1. split the data into divisions
  2. calculate percentage for each of those
  3. combine all back into a table
library(dplyr)
by_division <- group_by(dfm.select, division)
summarise(by_division, pct = sum(totalIL)/sum(totalpop)*100)

The end. For today.


  1. It is strongly discouraged to omit argument names when you actually write programs in R.

  2. Manfred te Grotenhuis, Rob Eisinga, and SV Subramanian: Robinson’s Ecological Correlations and the Behavior of Individuals: methodological corrections. International journal of epidemiology. Data Source: U.S. Census Bureau (1933). Fifteenth Census of the United States: 1930. Population, Volume II, General Report. Statistics by Subjects, Chapter 13, Page 1229: Table 10.-Illiteracy in the population 10 years old and over, by color and nativity, by divisions and states: 1930. Retrieved October, 2010 [http://www2.census.gov/prod2/decennial/documents/16440598v2ch16.pdf].