Topics

In this sesssion we’ll be covering:

Remember from last session…



Functions

  • Now that we have learned how to group data and assign it a variable name, we can start performing tasks on the data.
  • The way we do this, is via ‘functions’
  • Functions are a way to repeat the same task on different data
  • R has many built-in functions that perform common tasks
x <- c(4, 8, 1, 14, 34)
mean(x) # Calculate the mean of the data set
## [1] 12.2
y <- c(1, 4, 3, 5, 10, NA)
mean(y, na.rm = TRUE) # If you have missing data in your dataset you need to tell the mean function to remove it before doing the calculation.
## [1] 4.6
log(27)  #Natural logarithm
## [1] 3.295837
log10(100) #base 10 logarithm
## [1] 2
sqrt(225) # Square root 
## [1] 15
abs(-5) #Absolute value 
## [1] 5
  • They all have the form function()
  • “function” is the name, which usually gives you a clue about what it does
  • () is where you put your data or indicate options. These are referred to as ‘Arguments’.
  • To see what goes inside (), type a question mark in front of the function and run it
?mean()

In RStudio, you will see the help page for mean() in the bottom right corner help page

  • On the help page, under Usage, you see mean(x, ...)
  • This means that the only necessary thing that has to go into () is x
  • On the help page under Arguments you will find a description of what x needs to be

You can also use functions in combination with objects you have created

answer <- 1+1
log(25 + answer)
## [1] 3.295837

Many built-in functions in R have multiple arguments, so you have to give the function some more information so that it can perform the correct calculation.

round(12.3456) # default is to round to the nearest integer
## [1] 12
round(12.3456, digits=3)  
## [1] 12.346
round(12.3456, digits=1)
## [1] 12.3


Other Common and Useful functions

The seq() or ‘sequence’ function is commonly used to create a vector with a certain sequence. This is used a lot when writing functions. seq(from = 1, to = 1, by = )

seq(1, 5, by = 1)
## [1] 1 2 3 4 5
x <- 1:5  #Here we are using the colon operator to create a sequence from 1 to 5.  This is a shortcut if you just need to sequence through numbers or a vector, without skipping.


The paste function

The paste function will concatenate or combine vectors. Paste only works with characters so if you give it numbers to paste, it will convert them to characters first.

x <- "Hello"
y <- "world!"
paste(x, y, sep = " ")
## [1] "Hello world!"
x <- 5
y <- 3
z <- paste(x, y, sep = "")
str(z)
##  chr "53"


The substr function

As data analysts we often run into messy data. The use of substr allows you to pull out only the elements you care about in a date, address, monitor ID, parameter description, etc.

  • In AQS data a monitor ID may be written in the following format: [State code - County code - Site number - Parameter code - POC]. If we only wanted to pull out the site number for this monitor ID we could do the following:
wi.mon <- c('55-021-0015-44201-2')  # Ozone monitor in Columbia County, WI
site.id <- substr(wi.mon, start = 8, stop = 11)  # Provide the start and stop position within the character string.
site.id
## [1] "0015"


Nesting functions

  • You can place a function inside of another function to perform multiple tasks on data in one step. For instance if you want to create a sequence of numbers and then take the mean of that sequence, you could either do it in a couple of steps, or all at once.
#Two steps
x <- seq(from=1, to=10, by=3)
mean(x)
## [1] 5.5
#One step
mean(seq(from=1, to=10, by=3))  #Make sure you have the parentheses located in the correct spot as R will evaluate from the inside out.
## [1] 5.5

Note: Typically you don’t want to have too many nested functions because it becomes difficult to interpret by somebody else (or yourself later on…). You will learn about this more later when you are writing your own functions.


Using R packages

  • R comes with basic functionality, meaning that some functions will always be available when you start an R session
  • However, anyone can write functions for R that are not part of the base functionality and make it available to other R users in a package
  • Packages must be installed first then loaded before using it
  • This is similar to a mobile app: you must first install the R package (like first downloading an app) then you must load the package before using its functions (like opening an app to use it)
  • For example, lets say that R doesn’t have a function you need
  • The best way to find out if another R package does have that function is to ask Google
  • Use a search with key words describing what you want the function to do and just add “R package” to the end

  • Let’s say what you want to do is find serial correlation in an environmental data set
  • Google tells you that the R package EnvStats has a function called serialCorrelationTest()
  • First, try to use the function

x <- c(1.3, 3.5, 2.6, 3.4, 6.4)
serialCorrelationTest(x)
  • It’s not available because we need to install the package first (again, like initially downloading an app)
  • In the bottom right panel of RStudio, click on the “Packages” tab then click “Install Packages” in the tool bar packages

  • A window will pop up
  • Start typing “EnvStats” into the “Packages” box, select that package, and click “Install” packages2

  • Now that we’ve installed the package, we still can’t use the function we want
  • We’ve got to load the package first (opening the app)
  • For this, we will use the library() function

install.packages("EnvStats")
library(EnvStats)


Now we can use the function we want

x <- c(1.3, 3.5, 2.6, 3.4, 6.4)
serialCorrelationTest(x)
## 
## Results of Hypothesis Test
## --------------------------
## 
## Null Hypothesis:                 rho = 0
## 
## Alternative Hypothesis:          True rho is not equal to 0
## 
## Test Name:                       Rank von Neumann Test for
##                                  Lag-1 Autocorrelation
##                                  (Exact Method)
## 
## Estimated Parameter(s):          rho = -0.0187589
## 
## Estimation Method:               Yule-Walker
## 
## Data:                            x
## 
## Sample Size:                     5
## 
## Test Statistic:                  RVN = 1.8
## 
## P-value:                         0.7833333
## 
## Confidence Interval for:         rho
## 
## Confidence Interval Method:      Normal Approximation
## 
## Confidence Interval Type:        two-sided
## 
## Confidence Level:                95%
## 
## Confidence Interval:             LCL = -0.8951272
##                                  UCL =  0.8576094

Here is a link to a page that lists many, many useful packages for environmental data analysis: https://cran.r-project.org/web/views/Environmetrics.html

  • Remember, when you close down RStudio, then start it up again, you don’t have to download the package again
  • But you do have to load the package to use any function that’s not in the R core functionality (this is very easy to forget)

Importing data

Download example Excel file

  • For this example, we will be using a file named “chicago_air.xlsx” located in the ‘datasets’ folder on the thumb drive that was provided
  • Alternatively, you can dowloaded the file here. Download the Excel file and save it to the folder called “datasets” on the thumb drive.

Save as csv file

  • Open the spreadsheet in Excel
  • Click the “Office Button” in the upper-left corner
  • Choose “Save As”, then “Other Formats” Excel Spreadsheet

Save as csv file

  • Select “CSV (Comma delimited)” from “Save as type:”
  • Click “Save”
  • Close Excel Excel Spreadsheet Save Dialog

Importing csv files in RStudio

  • In RStudio, go to “Tools” -> “Import Dataset” -> “From Text File”
  • Select your file using the window import

Importing csv files in RStudio

Accept the defaults in the popup window and click “Import”

import2

  • In the top right panel you will see a variable named airquality that is a data frame of the spreadsheet we imported
  • In the top left panel the data frame is displayed import3


Importing csv files with read.csv()

  • You can also easily import csv files from the command line
  • read.csv() is a function that takes the name of a csv file as its main argument
  • It reads the csv file and converts it to a data frame
  • It assumes that the first row contains column names
  • You must assign the output of read.csv() to a variable to be able to work with the data

  • Since you saved your chicago_air.csv file to the folder named “datasets” in your thumb drive, you will use the entire file path as the argument in read.csv()
  • Assign the data.frame a name such as “chi.aq” to save it to your environment.

#  chi_aq <- read.csv("E:/datasets/chicago_air.csv", header=TRUE, stringsAsFactors = FALSE)
#   head(chi_aq)

read.table is a function that helps you import data from a text file. This is what you want to use if you are importing an AQS AMP 350 Raw Data Work File in pipe delimited format.

#    aqs1 <- read.delim("E:/datasets/aqs1.txt",sep="|", comment.char='#', skip=5, header=F)  #   This is just example code, but it should work for importing an AMP 350WF.

You will notice that this includes several more arguments than what we used for read.csv(). This is because the raw AQS file type is a little more complicated than a nice clean, formatted csv file. You will want to review the help file for ?read.delim() if you are using this format.

You can also use the write.csv function or write.table function to save your data.frame to your computer for later use, or to share with others.

#  write.csv(x = dataframe, file = where to write the file, row.names = FALSE)

#  write.csv(chicago_air, file = "E:/datasets/mydata.csv", row.names = FALSE) 

#You typically want to set row.names = FALSE unless you want to retain the row numbers for some reason. 

Note: You can use the library("foreign") package to write to SAS, SPSS and STATA files.



The setwd() function is your friend!

setwd() allows you to tell R exactly where to look for the data you want. It’s like creating a shortcut to the folder. So if you always store your data for air quality analysis at this location, for example: C:/Data/AQAnalysis/R/ you can set this as your working directory at the begining of your session. Then you only have to tell R to go to the specific file you want. Check out the example below

#  setwd('C:/Data/AQAnalysis/R/')

So instead of typing this:

#  aqs1 <- read.csv('C:/Data/AQAnalysis/R/aqs1.txt')

You can just type the filename like this:

#  aqs1 <- read.csv('aqs1.txt')

So let’s go head and set our working directory for today’s session

#  setwd("E:/datasets")  #Set this to whatever path your thumb drive appears as

This is really handy when you are importing and exporting a lot of data from the same folder.



Quick Data exploration

Now that we have imported some data let’s learn more about it.

  • The data we imported in the previous section can actually be obtained in R by using the data() function and a package we created for this training.
  • Use the code below to obtain the data that we will use moving forward. This is very similar to what is in the Excel file.
require(devtools)
install_github("natebyers/region5air")
library(region5air)
data(chicago_air)
chicago_air
##         date ozone temp solar month weekday
## 1 2013-01-01 0.032   17  0.65     1       3
## 2 2013-01-02 0.020   15  0.61     1       4
## 3 2013-01-03 0.021   28  0.17     1       5
## 4 2013-01-04 0.028   18  0.62     1       6
## 5 2013-01-05 0.025   26  0.48     1       7
## 6 2013-01-06 0.026   36  0.47     1       1
  • chicago_air is a data frame with 2013 ozone (ppm), temperature (F), and solar radiation (W/m2) readings from a monitor in the Chicago area
  • What column names are in the data frame?
colnames(chicago_air)
## [1] "date"    "ozone"   "temp"    "solar"   "month"   "weekday"
  • How many observations does the dataset contain?
  • We use the nrow() function to get the number of rows
nrow(chicago_air)
## [1] 365

Viewing the data

RStudio has a special function called View() that makes it easier to look at data in a data frame

View(chicago_air)

More functions for viewing data attributes

head(chicago_air)   ##Looks at the first 5 lines in the dataset
##         date ozone temp solar month weekday
## 1 2013-01-01 0.032   17  0.65     1       3
## 2 2013-01-02 0.020   15  0.61     1       4
## 3 2013-01-03 0.021   28  0.17     1       5
## 4 2013-01-04 0.028   18  0.62     1       6
## 5 2013-01-05 0.025   26  0.48     1       7
## 6 2013-01-06 0.026   36  0.47     1       1
tail(chicago_air)  ##Looks at the last 5 lines in the dataset
##           date ozone temp solar month weekday
## 360 2013-12-26 0.026   NA  0.41    12       5
## 361 2013-12-27 0.021   NA  0.62    12       6
## 362 2013-12-28 0.026   NA  0.61    12       7
## 363 2013-12-29 0.029   NA  0.08    12       1
## 364 2013-12-30 0.024   NA  0.44    12       2
## 365 2013-12-31 0.021   NA  0.49    12       3

The str function is important because it describes the basic structure of the dataset. This lets you know if all the data was imported they way it was intended. i.e. numbers came in as numeric, text came in as characters, etc. This is great if you want a snapshot of the data structure.

str(chicago_air)  ##Describes the basic structure of the dataset
## 'data.frame':    365 obs. of  6 variables:
##  $ date   : chr  "2013-01-01" "2013-01-02" "2013-01-03" "2013-01-04" ...
##  $ ozone  : num  0.032 0.02 0.021 0.028 0.025 0.026 0.024 0.021 0.031 0.024 ...
##  $ temp   : num  17 15 28 18 26 36 25 30 41 33 ...
##  $ solar  : num  0.65 0.61 0.17 0.62 0.48 0.47 0.65 0.39 0.65 0.42 ...
##  $ month  : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ weekday: num  3 4 5 6 7 1 2 3 4 5 ...

The summary function is a more robust version of str if you are working with a lot of numeric values, because it will automatically do summary statistics on any numbers in your vector or data.frame.

summary(chicago_air)
##      date               ozone              temp            solar      
##  Length:365         Min.   :0.00400   Min.   :-17.00   Min.   :0.040  
##  Class :character   1st Qu.:0.02500   1st Qu.: 36.75   1st Qu.:0.510  
##  Mode  :character   Median :0.03400   Median : 59.50   Median :0.910  
##                     Mean   :0.03567   Mean   : 54.84   Mean   :0.841  
##                     3rd Qu.:0.04500   3rd Qu.: 73.00   3rd Qu.:1.200  
##                     Max.   :0.08100   Max.   : 92.00   Max.   :1.490  
##                     NA's   :26        NA's   :109                     
##      month           weekday     
##  Min.   : 1.000   Min.   :1.000  
##  1st Qu.: 4.000   1st Qu.:2.000  
##  Median : 7.000   Median :4.000  
##  Mean   : 6.526   Mean   :3.997  
##  3rd Qu.:10.000   3rd Qu.:6.000  
##  Max.   :12.000   Max.   :7.000  
## 

The $ operator in R

  • You can refer to specific column names in a data frame with the $ operator
  • You can use this to feed specific columns into a function
mean(chicago_air$ozone, na.rm = TRUE) # Calculates the mean temperature of the ozone column. Because there is missing data for ozone, must tell the mean function to ignore them.
## [1] 0.03567257

Now let’s try some exercises to test our understanding of functions and importing data.

Exercise 2

http://rpubs.com/kfrost14/Ex2