Please install R.
Please install the latest Preview version of RStudio.
R comes with a base system and some contributed packages. This is what you just downloaded. The functionality of R can be significantly extended by using additional contributed packages, also called libraries. Those packages typically contain commands (functions) for more specialized tasks. They can also contain example datasets. We will make use of a series of external packages to work with spatial data later.
To install additional packages there are two main options:
install.packages("NameOfTheLibraryToInstall", dependencies = TRUE)
In order to actually use commands from the installed packages you also will need to load the installed packages. This can be automated (whenever you launch R it will also load the libraries for you - see for example here) or otherwise you need to sumbit a command:
library(NameOfTheLibraryToLoad)
or
require(NameOfTheLibraryToLoad)
The difference between the two is that library will result in an error, if the library does not exist, whereas require will result in a warning.
Find the Console window, type in the following line, hit enter and see what happens.
volcano.r <- raster(volcano); spplot(volcano.r)raster, including its dependenciesUse the "up" arrow on your keyboard to recall the line that you last typed in and hit enter
RStudio is a development environment that makes working with R easier. In order to use it you need to also have R installed (which we did above). These are two independent software pieces, however they work together seamlessly. When launching RStudio it automatically detects your R installation and starts it up as well.
Some features of RStudio:
[ Demo ]
Here is how you assign a value to a variable. Variable names cannot contain spaces and they cannot begin with a dot followed by a number. For example, The Moon and .2TheMoon will not work. .ToTheMoon will work, but is not recommended. (To find reserved words in R type ?reserved in your console or search for reserved in the help).
x <- 14 # x = 14 also works, but some feel strongly that it should not be used and this is a comment, BTW
(Tip: Use Alt and - as shortcut for <- on a Mac.)
Here is how you retrieve the value from a variable:
x
[1] 14
Variable names are case sensitive. Thevariable is not the same as thevariable which is not the same as theVariable! Try the commands below and see that happens. (Tip: Concatenate commands in one line with a semicolon: ;)
myvar <- .37; myvar
myVar <- "dog"; myvar
(Most common) data types in R are:
The typeof() command tells you the data type of an object.
typeof(x)
[1] "double"
Even though x contains only one value, in R it is called a “vector”.
Now let’s construct some longer vectors in R. One common way to do this is to use a function called c(). This is how it’s done:
a <- c(1, 2, 5.3, 6, -2, 4) # numeric vector
b <- c("one", "two", "three") # character vector
c <- c(TRUE, TRUE, TRUE, FALSE, TRUE, FALSE) #logical vector
A convenient command to find out about the structure of a object is str(). For example:
str(c)
logi [1:6] TRUE TRUE TRUE FALSE TRUE FALSE
And note that c in this example is the name of a vector, but that c is also the name of the function used to construct a vector. OUCH. It works, but is confusing. So avoid it.
Now try this and see what happens here:
d <- c("a", 1); str(d)
Subsetting:
Oftentimes we want to select one or several elements from a vector. This is called subsetting. Below are a few examples.
a[1] # note that in R the first position in the index is 1, not 0!
[1] 1
a[-1] # all except the first element
[1] 2.0 5.3 6.0 -2.0 4.0
a[c(2, 4)] # 2nd and 4th elements of vector
[1] 2 6
a[2:4] # from 2nd to 4th element
[1] 2.0 5.3 6.0
a[a > 5] # this is sometimes convenient
[1] 5.3 6.0
We can do math with vectors:
a + 1
[1] 2.0 3.0 6.3 7.0 -1.0 5.0
a * a
[1] 1.00 4.00 28.09 36.00 4.00 16.00
When the vectors operated upon are of unequal lengths, then the shorter vector is “recycled” as often as necessary to match the length of the longer vector. If the vectors are of different lengths because of a programming error, this can lead to unexpected results. But sometimes recycling of short vectors is the basis of clever programming.
Here is an example.
e <- c(1, 10)
a * e
[1] 1.0 20.0 5.3 60.0 -2.0 40.0
Now let’s try to understand what happens here.
a[c(TRUE, FALSE)]
[1] 1.0 5.3 -2.0
If we arrange data elements of a vector in a two-dimensional rectangular layout we have a matrix. To construct a matrix, we use a function conveniently called matrix().
y <- matrix(1:20, nrow = 5, ncol = 4) # generates 5 x 4 numeric matrix
Subset a matrix with [row , column]:
y[, 4] # 4th column of matrix
y[3, ] # 3rd row of matrix
y[2:4, 1:3] # rows 2,3,4 of columns 1,2,3
Not surprisingly 2-dimensional matrices play an important role when working with raster data. We will come back to that at a later time.
Lists can have elements of any type. Here is how we construct lists. You may have guessed that to construct a list, we use the list() function:
myl <- list(name = "Sue", mynumbers = a, mymatrix = y, age = 5.3) # example of a list with 4 components
myl[[2]] # 2nd component of the list
myl[["mynumbers"]] # component named mynumbers in list
Lists will be important as they are used to construct vector data of the
sp*type in R. This sounds forgeign to you now, but you will see this shortly.
A data frame is the most common way of storing tabular data in R and something you will likely deal with a lot. For example, attribute data for vector based spatial objects in R are stored as a data frame. You can really think of it as a table or a spreadsheet. It a 2-dimensional structure and columns can be of different element types.
Here is how you could construct a data frame.
mydf <- data.frame(ID=c(1:4),
Color=c("red", "white", "red", NA),
Passed=c(TRUE,TRUE,TRUE,FALSE),
Weight=c(99, 54, 85, 70),
Height=c(1.78, 1.67, 1.82, 1.59))
mydf
ID Color Passed Weight Height
1 1 red TRUE 99 1.78
2 2 white TRUE 54 1.67
3 3 red TRUE 85 1.82
4 4 <NA> FALSE 70 1.59
And subsetting it (try the following commands yourself):
mydf$Weight # Weight column
mydf[c("ID","Weight")] # columns ID and Weight
mydf[,2:4] # columns 2,3,4 of dataframe
mydf[mydf$Height > 1.8,] # use logical conditions to filter rows
mydf[mydf$Passed,]
A useful command to look at the top rows of a data frame is head(). Column names can be either retrieved or assigned with names(). We can also assign row names with rownames().
We can easily create a new column in a data frame, calculated from existing columns, for example:
mydf$bmi <- mydf$Weight/mydf$Height^2
mydf
ID Color Passed Weight Height bmi
1 1 red TRUE 99 1.78 31.24605
2 2 white TRUE 54 1.67 19.36247
3 3 red TRUE 85 1.82 25.66115
4 4 <NA> FALSE 70 1.59 27.68878
By now we have used several R functions. I have also -sloppily, perhaps- called them commands. They all have in common that they are executed by typing their name followed by round brackets, in which we provide one or more parameters (or arguments) for the function to do something, separated by commas. Each function requires their specific arguments and those can be looked up with the help and all the arguments have names.
As example let us revisit the matrix() function from above. If you look up the documentation, this is what you will find under the usage section:
matrix(data = NA, nrow = 1, ncol = 1, byrow = FALSE,
dimnames = NULL)
Now, here is what we said earlier:
matrix(1:20, nrow = 5, ncol = 4)
So you can see that we have not consistently named our parameters, but R still knows what we want1. The reason is that R evaluates function arguments in three steps: first, by exact matching on argument name, then by partial matching on argument name, and finally by position.
A second thing to notice is that often you do not have to specify all of the arguments. If you don’t, R will use default values if they are specified by the function. If no default value is specified, you will receive an error.
Functions usually return someting back to you as output. Whatever they return (a table, some informational text, a logical value, …) is by default written to the console, so you can see it right away.
Oftentimes, however, we want re-use the output of such a function. This is what we did above with the matrix example and this is also what we will now do to read in some data.
One the most common ways of getting data into R is to read in a table. And – you guessed it – we read it into a data frame! The function we will use for this is read.csv(). We will take a simple CSV file as example.
R_Workshop on your Desktop.R_Workshop:table_10.csv2 into R and assign it to a data frame calle dfm.(Hint: Use the help tab in RStudio to search for the command and find the exact syntax)read.table() function. What is different?totalpop and totalIL calculate the percentage of the illiterate population, call it pctIL and add it as a new column to the data frame.dColors <- data.frame(division = levels(dfm$division), color = rainbow(nlevels(dfm$division)))
dfm.col <- merge(dfm, dColors)
dfm.ord <- dfm.col[order(dfm.col$pctIL), ]
barplot(dfm.ord$pctIL, names.arg = dfm.ord$state, horiz = TRUE, las = 2, cex.names = 0.5,
col = dfm.ord$color)
legend("bottomright", legend = dColors$division, fill = dColors$color)
reshape2 is an external library. It contains two functions that help to transform data tables between wide and long formats.
A table in wide format there is a column for each variable and it looks like this:
subject age height
1 Adele 20 1.76
2 Adele 21 1.77
A table in long format there is at least one column for the so called “ID variables”, a column for possible variable types and a column for the values of those variables. For the above table would look like this using subject as the ID variable:
subject variable value
1 Adele age 20.00
2 Adele age 21.00
3 Adele height 1.76
4 Adele height 1.77
While we perhaps tend to record data in wide format, in R long-format data often needed, for example when plotting with ggplot.
melt turns wide format into long format.
cast turns long format into wide format.
So for the above example the command
wide.df <- data.frame(subject=c("Adele", "Adele"), # some sample data in wide format
age=c(20, 21),
height=c(1.76, 1.77))
melt(wide.df) # convert to long format
subject variable value
1 Adele age 20.00
2 Adele age 21.00
3 Adele height 1.76
4 Adele height 1.77
By default melt() uses all columns that have numeric values as variables for the values. But we could also tell it otherwise, for example:
melt(wide.df, id.vars = c("subject", "age"))
subject age variable value
1 Adele 20 height 1.76
2 Adele 21 height 1.77
This is another possible long format for our table above!
Now let’s save this long format to a variable and cast it back into wide format. There are actually several commands for this, we will use dcast() as it produces a data frame as output, which is what we most typially want. Input variables for dcast() are: the data frame, and a formula for the ID variables and the values.
long.df <- melt(wide.df, id.vars = c("subject", "age"))
dcast(long.df, subject + age ~ variable)
subject age height
1 Adele 20 1.76
2 Adele 21 1.77
This is one of the simplest cases. Typically going from long to wide data tables takes a bit more tinkering as there are a number of more advanced options as you can see in the help.
reshape2 package.dfm you created earlier (from table_10.csv)dfm.selectstate and division as ID variables and save into a new data frame dfm.longJust to mention two more packages that can be helpful with data wrangling.
tidyr is a package similar to reshape2, with extended functionality. It can transform between long and short form and has some additional convenient functionalities, like renaming, concatenating and separating columns.
dplyr is a package that makes working with tables a little easier. In addition to simple operations like filtering, reordering of colums, selecting unique values, one can connect directly to databases. Most conveniently it also has a set of grouping functions that allow to calculate statistics for subgroups of the entire dataset. This is how it works:
This process is accordingly called split-apply-combine and illustrated below.
We will use the dataframe we created earlier in long format dfm.select for an example. We want to calculate the percentage Illiterate per division, not per state as we did above. Following the recipe from above we would:
library(dplyr)
by_division <- group_by(dfm.select, division)
summarise(by_division, pct = sum(totalIL)/sum(totalpop)*100)
The end. For today.
It is strongly discouraged to omit argument names when you actually write programs in R.↩
Manfred te Grotenhuis, Rob Eisinga, and SV Subramanian: Robinson’s Ecological Correlations and the Behavior of Individuals: methodological corrections. International journal of epidemiology. Data Source: U.S. Census Bureau (1933). Fifteenth Census of the United States: 1930. Population, Volume II, General Report. Statistics by Subjects, Chapter 13, Page 1229: Table 10.-Illiteracy in the population 10 years old and over, by color and nativity, by divisions and states: 1930. Retrieved October, 2010 [http://www2.census.gov/prod2/decennial/documents/16440598v2ch16.pdf].↩