Introduction to R

Downloading R and RStudio

We will be using an integrated development environment (IDE) to write and run code in the R programming language. This will allow us to test code before we have our final script. The IDE for R is RStudio. You will need to download both the latest version of R and RStudio. R can be downloaded through the Comprehensive R Archive Network (CRAN) website and RStudio can be downloaded at this link. We will go over the different windows and functions of RStudio before jumping into the code below.

Using Functions

All functions have syntax where the function name is followed by parentheses for the user to put inputs. The appropriate types and default inputs for each function can be found in the written summary of the packages easily found on CRAN or GitHub or simply by putting a ? in front of the function like ?plot().

R Packages

R packages are compilations of functions that can be used for any number of purposes. Packages are extremely useful because the code for functions have already been written, so we don’t need to worry about writing them ourselves. The base installation of R already includes a number of useful functions for data analysis and plotting; but, because R is an open sourced programming language, many other packages have been created by other users for a variety of purposes. The United States Geologic Survey (USGS) and the Water Survey of Canada, for example, both have packages such as dataRetrieval and fasstr (Flow Analysis Summary Statistics Tool for R) that allow easy access to their databases of flow and water quality data. Packages are stored in your system library and can be easily downloaded in RStudio or through CRAN or GitHub. Packages all have written information about each of the functions included within it.

Cheat sheets for most packages are available online and are very helpful resources

#call an installed package into your library
library(fasstr)

#get the list of packages in your library
library()

Working Directories

The working directory is the location on your computer or hard drive where R will search for files you reference. RStudio will automatically set your working directory to the source location of the R project file that you open. You can check your working directory with the following command:

getwd()

## [1] "C:/Users/Lexie/OneDrive - The Pennsylvania State University/WEB Lab Files/Courses/FOR 602/Lesson Plans/R Analyses/Intro to R"

To set your working directory in your R script, use the following function with the folder location as the input. Note that you need to use “/” instead of “" in your file location.

setwd("C:/Users/Lexie/OneDrive - The Pennsylvania State University/WEB Lab Files/Courses/FOR 602/Lesson Plans/R Analyses")

Comments in R and Good Coding Habits

In R, comments can be made to accompany your code. Anything that follows a hashtag (#) will not be run as code, but will be left as comments in your console and script. R does not allow multi-line comments, so you must put a hashtag before each line in a comment. Comments are critical for understanding code. No one is capable of remembering every possible function in R, so comments help to keep track of what the code actually means. Always put comments before any functions are used. You can also add multiple hashtags in a row (######) to start a section that be easily referenced in RStudio. Beyond providing adequate comments in your code, make sure to use indents appropriately. You can split up a line of code into multiple lines that are indented from the starting point of your line. Make sure to always break up long lines of code so that you can read them easily. Additionally, if you start new functions or loops within your code, indent your code further. Keeping your code organized like this can make reading it and recalling its purpose much easier.

Below is an example of proper commenting and indentation. We will discuss what this type of code means later on .

#set the colors for the graph
mycolors = c("dodgerblue","darkred")

#create a plot of percent change in flow rates
plot1 = ggplot(Data, aes(x=BFI, y=PercentChange, color=Year)) +
  scale_color_manual(values = mycolors) +                         #set the colors for each year manually
  geom_point() +                                                  #add points to the graph
  labs(x="Average BFI",y="Change in 10th Percentile Flow (%)") +  #create labels for your x and y axes
  theme(legend.position = c(0.8,0.9))                             #set the position of the legend

Data and Data Structures in R

Below are four common types of data that you will encounter in R

Type of Data	Description	Example
Double	Numbers that contain a decimal	23
Integer	Whole numbers	23.5
Logical	Binary data based on T or F	TRUE
Character	Sequences of characters	Florida

Classes of data include numeric, character, factors

All data is stored in objects in R. Some common forms of objects in R are vectors, lists and data frames. Vectors store data in a one dimensional dataset. Vectors can also have attributes attached to them, such as a name. Base R has many simple math functions that can be applied to vectors.Let’s go through some basic syntax of creating a vector in R.

#create a vector using the c() or concatenate function
#note <- or = can be used interchangeably
vector1 = c(1,2,3,4)

vector1

## [1] 1 2 3 4

#check to determine what type of data is in vector1
typeof(vector1)

## [1] "double"

#find the mean of vector1
mean(vector1)

## [1] 2.5

#find length of vector1
length(vector1)

## [1] 4

#if you want to save the manipulated vector, you can
vector2 = vector1 + 2

vector2

## [1] 3 4 5 6

Lists are similar to vectors in that they are only one dimension; however, lists hold objects in one dimension rather than individual values. The list() function creates lists, similar to how c() creates vectors.

list1 = list(vector1, vector2)

list1

## [[1]]
## [1] 1 2 3 4
## 
## [[2]]
## [1] 3 4 5 6

Data frames are the most common type of data that you will be using. Data frames are essentially the two dimension version of a list. Like an Excel spreadsheet, data frames store equal length vectors of any class in two dimensions.

Creating Data Frames

You can upload data frames manually from .txt and .csv files using the read.csv() and read.delim() functions. RStudio also has an Import Dataset tab above the global environment. You can also create data frames using the data.frame() function. You can check the structure of your data frame using the str() function.

Some notes on formatting your data spreadsheets before uploading

Column names should not include spaces. R will automatically put a period mark where spaces existed.
Do not include symbols in your spreadsheet.
All of you rows need to be the same length or R will fill in missing data with NA’s

#upload example data from the file in your directory
data1 = read.csv("example1.csv", header=T)

#check the structure of your data frame
str(data1)

## 'data.frame':    4 obs. of  3 variables:
##  $ Station    : chr  "Cedar Run" "Slab Cabin Run" "Thompson Run" "Spring Creek"
##  $ AnnualQ_cms: num  0.1258 0.2139 0.2642 0.0939
##  $ Land_Use   : chr  "Agriculture" "Mixed" "Urban" "Mixed"

#get number of rows
nrow(data1)

## [1] 4

#get number of columns
ncol(data1)

## [1] 3

You can click on your data frame in the environment window in RStudio to look at the data frame in a spreadsheet format.

Subsetting Data

Data can be extracted from a data frame using the subset() function. The subset() function requires a logical argument to select which data you want to keep or remove. You can also manually select rows and columns from a data frame based on their location in the data frame with dataframe[row number(s), column number(s)].

Logical Operators in R

Logical Operator	Meaning
==	Equal to
!	Logical Negation (not)
and
& and &&	And
<, >, >=, <=	less than, greater than, (or equal to)

#subset based on condition
data2 = subset(data1, data1$AnnualQ_cms > 0.10)

data2

##          Station AnnualQ_cms    Land_Use
## 1      Cedar Run   0.1258454 Agriculture
## 2 Slab Cabin Run   0.2139495       Mixed
## 3   Thompson Run   0.2642315       Urban

#subset first two rows by removing one row based on location
#remove row 3 and keep all columns
data3 = data1[-c(3),]

data3

##          Station AnnualQ_cms    Land_Use
## 1      Cedar Run   0.1258454 Agriculture
## 2 Slab Cabin Run   0.2139495       Mixed
## 4   Spring Creek   0.0938724       Mixed

Some Helpful dplyr Functions

The dplyr package within the tidyverse collection of data management packages offers some very useful functions for data management and cleaning.This package allows you to easily subset or extract data as well as manipulating data. The filter() functions works similarly to subset() to extract rows that meet a logical condition. The select() function works similarly to extract columns. The mutate() function is very helpful in computing new columns. The RStudio cheat sheet for dplyr provides information on all of the functions available in the package.

#call dplyr into your library
library(dplyr)

#use filter to subset flow data greater than 0.1 cms
data4 = filter(data1, AnnualQ_cms > 0.10)

data4

##          Station AnnualQ_cms    Land_Use
## 1      Cedar Run   0.1258454 Agriculture
## 2 Slab Cabin Run   0.2139495       Mixed
## 3   Thompson Run   0.2642315       Urban

#use mutate to create a column with flow in cfs
data4 = mutate(data4, AnnualQ_cfs = AnnualQ_cms*35.31)

data4

##          Station AnnualQ_cms    Land_Use AnnualQ_cfs
## 1      Cedar Run   0.1258454 Agriculture    4.443601
## 2 Slab Cabin Run   0.2139495       Mixed    7.554556
## 3   Thompson Run   0.2642315       Urban    9.330014

#use select to only selection station and flow in cfs
data5 = select(data4, Station, AnnualQ_cfs)

data5

##          Station AnnualQ_cfs
## 1      Cedar Run    4.443601
## 2 Slab Cabin Run    7.554556
## 3   Thompson Run    9.330014

Another useful tool in dplyr is the pipe. You can run multiple functions to manipulate a data frame within one pipe. The pipe notation is “%>%” f(). The “%>%” must be placed before each function used.

#a pipe that incorporates all of the above example functions
data6 <- data1 %>%
  filter(AnnualQ_cms > 0.10) %>%
  mutate(AnnualQ_cfs = AnnualQ_cms*35.31) %>%
  select(Station, AnnualQ_cfs)

data6

##          Station AnnualQ_cfs
## 1      Cedar Run    4.443601
## 2 Slab Cabin Run    7.554556
## 3   Thompson Run    9.330014

Graphics in R

Base R

Base R has a plot functions that can be used to easily create plots. There are separate functions for scatter plots and barplots. Remember you can use the ?plot() or ?barplot() to check arguments for each function. There are many resources on-line as well.

#grid plot of all of the data in the data frame
plot(data1)

#create a barplot of the flow data in data5
barplot(data5$AnnualQ_cfs, ylab="Annual Q (cfs)",
        names.arg = data5$Station)

Plotting with the ggplot2

The ggplot2 package is another useful graphing tool that creates high quality figures. We will use this in our next R lab to graph precipitation data. The cheat sheet for ggplot2 can be found here. There are a number of functions and formatting options that can be applied to a graph in ggplot2. To add formatting to a plot a “+” needs to included before each command.

library(ggplot2)

#create a barplot of data5
ggplot(data5, mapping = aes(x=Station, y=AnnualQ_cfs)) + #set data to be plotted
  geom_bar(stat="identity") +   #define what type of plot (barplot)
  labs(y="Annual Q (cfs)")      #define y-axis label

You can set the colors of your plot mannually or by a specified variable. The fill aesthetic in ggplot2 controls the color of barplots. The color aesthetic controls the color of scatter plots.

#set the colors to be by station
ggplot(data5, mapping = aes(x=Station, y=AnnualQ_cfs, fill=Station)) + #set fill to station to color the bars by station
  geom_bar(stat="identity") +  
  labs(y="Annual Q (cfs)")

When a color is added, a legend appears. You can format the legend position as needed, or, in this case, remove the legend because the x-axis label already includes that information. Legend formatting is done within the theme() function.

par(mfrow = c(3,1))

#set the colors to be by station
ggplot(data5, mapping = aes(x=Station, y=AnnualQ_cfs, fill=Station)) +  
  geom_bar(stat="identity") +  
  labs(y="Annual Q (cfs)")  +
  theme(legend.position = "None")

#manually set the colors and remove the gray background
ggplot(data5, mapping = aes(x=Station, y=AnnualQ_cfs, fill=Station)) +  
  geom_bar(stat="identity") +
  scale_fill_manual(values=alpha(c("skyblue","deepskyblue", "blue4"),0.5)) + #manually sets colors with 50% transparancy
  labs(y="Annual Q (cfs)")  +
  theme_bw() +            #theme to remove gray background
  theme(legend.position = "None")

#scatter plot using geom_point() instead of geom_bar()
ggplot(data5, mapping = aes(x=Station, y=AnnualQ_cfs, color=Station)) +  
  geom_point(size=7) +
  scale_color_manual(values=alpha(c("skyblue","deepskyblue", "blue4"),0.5)) + #manually sets colors with 50% transparancy
  labs(y="Annual Q (cfs)")  +
  theme_bw() +            #theme to remove gray background
  theme(legend.position = "None")

There are numerous ways to format plots in ggplot2. I would recommend playing around with the different features to determine the style you like best.