The basics

Setting up RStudio
Working in RStudio
Reading in data
Data structures
Resources

Setting up RStudio

To run RStudio you must have the most recent version of R and RStudio installed on your computer.

Installation

Get R from the Comprehensive R Archive Network (CRAN): Windows Mac

RStudio is an interface to R with added functionalities to make it more user-friendly. While RStudio is not required to use R, it is very useful for data visualization because it supports implements like Shiny, ggvis, and Rmarkdown. RStudio can be downloaded here by clicking on the version recommended for your system.

Orientation to RStudio

RStudio has 4 windows:
Upper-left: Code Editor, for composing R scripts (code you write) and viewing objects like data frames.
Lower-left: R Console, this is the “real” R, i.e. if you opened R alone (without R studio), it would look and behave like this window (only).
Upper-right: Environment and History, shows objects like dataframes, variables and vectors (in Environment) and the code you’ve submitted (in History).
Lower-right: Tabs for viewing plots, installing packages, viewing help files, etc.

Under RStudio >> Preferences in the file pane you can set RStudio console options, including reassigning the tabs in each window and setting the color scheme of your display. Under Code in this menu I suggest selecting Insert matching parens/quotes. With this selected, anytime you use () or “” in your code the complete set will appear automatically and prevent common errors associated with open parentheses and quotations. Tools >> Global Options >> Code >> Editing >> Soft-wrap R source files makes the content of your R script wrap so that it is always visible within your Code Editor window.

R scripts

To begin a new R session go to File >> New File >> R script. Then under File >> Save as, save this R script in the folder that will be your working directory. Use the R script to draft your code to be sent to the console. While you can write code directly in the console, best practice is to document every step in a script. This way you can go back and edit, troubleshoot, & create a completely reproducible workflow.

In the following sections, chunks of R code can directly be copy & pasted into your R script. They will appear like this:

R code will look like this

Executing code will generate output which looks like this:

## [1] "R outputs will look like this"

Working in RStudio

In this section you will learn how to run, write, and save code in an R script.

Executing code

To run code from your R script, press Ctrl+Enter or manually click ->Run in the upper righthand corner of the R script. This executes the line or chunk of code the cursor is on, or any highlighted code. Code that has ran will be printed in the console along with any associated outputs. Go to RStudio >> Preferences >> Code in the menu to define how the keyboard shortcut pushes code. Selecting Multi-line R statement will run complete code chunks that span multiple lines together (recommended).

To practice running code, type a math equation and execute it to find the answer.

453*4.6/2+34

## [1] 1075.9

3^9

## [1] 19683

Commands can also be written directly in the console, but you will want to write the majority of your code in an annotated R script.

Annotating code

Using the # symbol you can annotate your code and take notes as you work. Use this to include descriptions of what you are doing in the R script.

# a math equation
1-23+456/7890

## [1] -21.94221

When written on the same line as R code, only the lefthand side of the # symbol will be run as code.

1-23+456/7890 # a math equation

## [1] -21.94221

‘<-’ operator

The ‘<-’ operator assigns a name to any attribute and is used to store data while you are working. This can be typed as either ‘<-’ or ‘->’ and the name that is being pointed to will be assigned what is written on the other side.

value<-1-23+456/7890
value

## [1] -21.94221

Whatever is stored in the name can then be accessed by typing the name and using it in subsequent operations. Names must start with a letter and can contain any combination of letter, numbers, underscores and periods.

myval<-144
myval

## [1] 144

myval.sqrt<-sqrt(144) # calculate the square root of the value stored in 'myval'
myval.sqrt

## [1] 12

Packages

When you install R from the CRAN, it comes with basic functionality (base) to implement the R language. There are endless R packages that have been developed by the R community to extend upon this. Many of these are hosted on the CRAN.

To install new packages from CRAN:

install.packages('ggplot2') # a core package for data visualization

install.packages(c('ggplot2','tidyverse')) # to install multiple packages at once

R Task Views highlight useful packages on CRAN for specific topics, like graphics or phylogenetics.

There are also many excellent packages that are not available from CRAN, but may be installed from github via the developer. If you don’t know where to start, check out the tidyverse or ROpenSci and Bioconductor for a wide array of packages that support research.

To be able to use an installed package, it must be loaded into R:

library(ggplot2) # load ggplot2

Typically, all libraries needed to run a given R script will be loaded in the beginning of the script. Packages come with documentation that describe each function and use in R. Many R packages also have vignettes associated with them that demonstrate how they can be used.

ggplot2 index
ggplot2 reference manual
ggplot2 aesthetic specifications vignette

Working directory

To be able to read in data, R needs to know exactly where the data are. By setting your working directory in the very beginning of a script, you can give R the location to look for files to read in. It is also the location where R will save files.

To set your working directory, modify the following code to the appropriate location on your computer:

# On mac
setwd('/Users/collnell/rstats/data viz/GWU-visual')

# On PC
setwd('C:/collnell/rstats/data viz/GWU-visual')

To determine your current working directory:

getwd()

## [1] "/Users/collnell/rstats/"

To list the contents of your working directory:

list.files()

Reading in data

You can import many types of data into R, the most simple being .csv or .txt files which can be read in using base functions (no packages needed). For any files in your working directory you can simply type the relative pathname to the file to import. The relative path should be the location within your working directory.

.csv

data.in<-read.csv('data_birdpredation.csv)

This says the ‘data_birdpredation.csv’ file is in my working directory, and assigns the data to ‘data.in’

Or give the complete filepath:

data.in<-read.csv('/collnell/rstats/data viz/GWU-visualdata_birdpredation.csv')

.txt

For .txt files, read in as a table and define the field separator character. The default is do use white space but may need to be set appropriately based on how your data are saved.

data.in<-read.table('data_birdpredation.txt', sep = ',')

If the rows in your dataset have unequal length, set fill = TRUE to fill out each cell and be able to read in the dataframe.

.xlsx

To read in excel files you can use the readxl package in the tidyverse:

install.packages('readxl')
library(readxl)

data.in<-read_excel("data_birdpredation.xlsx", sheet = 1) # need to indicate which sheet in the files

via github

You can also read data in from the internet into R. Github is commonly used to host data, and can be directly accessed using the web address to raw files. To find the address for a raw file, go select the datafile from github.com and in the upper righthand corner select ‘Raw’. Copy the URL from your browser for the file location.

data.in<-read.csv('https://raw.githubusercontent.com/collnell/GWU-visual/master/data_birdpredation.csv')

Data structures

It is essential to be familiar with your data! Understanding the structure of your data will make it easier to work with. To look at the structure:

class(data.in)

## [1] "data.frame"

str(data.in)

## 'data.frame':    29 obs. of  13 variables:
##  $ diversity : Factor w/ 2 levels "M","P": 1 1 1 1 1 1 1 1 1 1 ...
##  $ plot      : int  3 9 12 17 20 21 30 38 39 53 ...
##  $ tree.sps  : Factor w/ 19 levels "A","ABCD","ABCE",..: 17 1 18 19 1 10 17 10 15 19 ...
##  $ abundance : int  2 11 4 11 5 21 3 13 4 21 ...
##  $ richness  : int  1 4 3 5 2 6 2 7 4 7 ...
##  $ FD        : num  0.189 0.581 0.48 0.649 0.352 0.788 0.352 0.719 0.541 0.845 ...
##  $ predation : num  0.12 0.157 0.152 0.106 0.156 0.2 0.105 0.224 0.208 0.111 ...
##  $ cwm.inv   : num  30 45 56.7 56 70 ...
##  $ cwm.canopy: num  0 5 0 4 10 ...
##  $ DBH       : num  15.5 22.7 17.2 22.1 21.3 ...
##  $ DBH_sd    : num  6.19 8.82 6.68 8.02 7.63 ...
##  $ height    : num  7.65 8.07 6.11 9.45 7.88 ...
##  $ height_sd : num  1.12 1.12 1.08 1.59 1.19 ...

This tells us that ‘data.in’ is a data.frame with 29 rows (observations) and 13 columns (variables). For each column, the name of the variable is given (e.g. diversity, plot…), the type of variable (Factor, int, num, or chr), and the first sequence of data. Not everything is a dataframe. You may be working with lists, vectors, matrices or other which will have a different outputs for str().

The ‘$' symbol on each line indicates that the following information describes a column in the dataset. This '$’ symbol can also be used to select variables within a dataframe (see ‘Indexing dataframes’).

To view the dataframe:

View(data.in)

Or just look the first few lines:

head(data.in)

Or last few lines:

tail(data.in)

Data types

The data type that is assigned to each variable will affect how you can work with that data in R. This is a good first step to assess your data when you are getting errors in R. To assign a new data type, use ‘as.factor()’, ‘as.character()’, ‘as.numeric()’, or ‘as.integer()’.

For example, in the dataframe above, the ‘tree.sps’ variable was read in as a factor, but it should be character.

# convert factor to character
data.in$tree.sps<-as.character(data.in$tree.sps)
str(data.in)

## 'data.frame':    29 obs. of  13 variables:
##  $ diversity : Factor w/ 2 levels "M","P": 1 1 1 1 1 1 1 1 1 1 ...
##  $ plot      : int  3 9 12 17 20 21 30 38 39 53 ...
##  $ tree.sps  : chr  "D" "A" "E" "F" ...
##  $ abundance : int  2 11 4 11 5 21 3 13 4 21 ...
##  $ richness  : int  1 4 3 5 2 6 2 7 4 7 ...
##  $ FD        : num  0.189 0.581 0.48 0.649 0.352 0.788 0.352 0.719 0.541 0.845 ...
##  $ predation : num  0.12 0.157 0.152 0.106 0.156 0.2 0.105 0.224 0.208 0.111 ...
##  $ cwm.inv   : num  30 45 56.7 56 70 ...
##  $ cwm.canopy: num  0 5 0 4 10 ...
##  $ DBH       : num  15.5 22.7 17.2 22.1 21.3 ...
##  $ DBH_sd    : num  6.19 8.82 6.68 8.02 7.63 ...
##  $ height    : num  7.65 8.07 6.11 9.45 7.88 ...
##  $ height_sd : num  1.12 1.12 1.08 1.59 1.19 ...

Now when we look at the data structure we can see it is character.

Indexing dataframes

Being able to manipulate your data in R is essential. To access specific rows and/or columns in a dataframe R uses a bracket notation [,]. The comma between the two brackets reflects the 2 dimensions of the dataframe, the rows and columns. To select a specific row, enter the row number to the left of the comma.

Subset to the 3rd row of ‘data.in’

data.in[3,]

By leaving the righthand side of the brackets empty, all columns are returned. In the same way columns can be subsetted:

data.in[3,2]

## [1] 12

Now a single value is returned that is the cell value for row 3, column 2.

You can also subset to multiple rows or columns at once. To do this, use the ‘:’ operator to select a range.

# subset to rows 1 to 5 in column 4
data.in[1:5,4]

## [1]  2 11  4 11  5

# subset to rows 1 to 10 in columns 4 to 8
data.in[1:5, 4:8]

This only works if you want to select the full range. You can also subset by providing a list or row or column values:

# select columns 1,2,3 & 5 (not 4) and rows 1-5
data.in[1:5,c(1,2,3,5)]

Columns may also be indexed by their name:

data.in[1:5,'diversity']

## [1] M M M M M
## Levels: M P

cwm.df<-data.in[1:5,c('cwm.inv','cwm.canopy')]
head(cwm.df)

And as shown above, the ‘$’ symbol can be used to slect specific columns

cwm.df$cwm.inv

## [1] 30.000 45.000 56.667 56.000 70.000

Some basic syntax

Troubleshooting

If you are not sure how to use a function you can search the helpfiles by typing ‘?’ in front of the function name:

?read.csv()

The Help window gives an explanation of usage, with a description of each of the arguements below, including the expected outputs. Many funcitons will also provide an example at the bottom with sample data that can be very useful.

Errors can occur when R is out of date, when there are conflicts between packages, or due to changes in functions between different versions of a package. To be able to see what version of R you are running, as well as packages that are loaded in your environment:

sessionInfo()

## R version 3.5.1 (2018-07-02)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS  10.14.2
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggplot2_3.1.0
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.0       bindr_0.1.1      knitr_1.21       magrittr_1.5    
##  [5] tidyselect_0.2.5 munsell_0.5.0    colorspace_1.3-2 R6_2.3.0        
##  [9] rlang_0.3.0.1    stringr_1.3.1    plyr_1.8.4       dplyr_0.7.8     
## [13] tools_3.5.1      grid_3.5.1       gtable_0.2.0     xfun_0.4        
## [17] withr_2.1.2      htmltools_0.3.6  assertthat_0.2.0 yaml_2.2.0      
## [21] lazyeval_0.2.1   digest_0.6.18    tibble_1.4.2     crayon_1.3.4    
## [25] bindrcpp_0.2.2   purrr_0.2.5      glue_1.3.0       evaluate_0.12   
## [29] rmarkdown_1.11   stringi_1.2.4    compiler_3.5.1   pillar_1.3.0    
## [33] scales_1.0.0     jsonlite_1.6     pkgconfig_2.0.2

Resources

For an intro to summarizing, filtering, and creating new variables with dplyr see my tutorial on data wrangling.
Best practices
R graph gallery