In this sesssion we’ll be covering:
Remember from last session…
<-x <- c(4, 8, 1, 14, 34)
mean(x) # Calculate the mean of the data set
## [1] 12.2
y <- c(1, 4, 3, 5, 10, NA)
mean(y, na.rm = TRUE) # If you have missing data in your dataset you need to tell the mean function to remove it before doing the calculation.
## [1] 4.6
log(27) #Natural logarithm
## [1] 3.295837
log10(100) #base 10 logarithm
## [1] 2
sqrt(225) # Square root
## [1] 15
abs(-5) #Absolute value
## [1] 5
function()() is where you put your data or indicate options. These are referred to as ‘Arguments’.(), type a question mark in front of the function and run it?mean()
In RStudio, you will see the help page for mean() in the bottom right corner
Usage, you see mean(x, ...)() is xArguments you will find a description of what x needs to beYou can also use functions in combination with objects you have created
answer <- 1+1
log(25 + answer)
## [1] 3.295837
Many built-in functions in R have multiple arguments, so you have to give the function some more information so that it can perform the correct calculation.
round(12.3456) # default is to round to the nearest integer
## [1] 12
round(12.3456, digits=3)
## [1] 12.346
round(12.3456, digits=1)
## [1] 12.3
The seq() or ‘sequence’ function is commonly used to create a vector with a certain sequence. This is used a lot when writing functions. seq(from = 1, to = 1, by = )
seq(1, 5, by = 1)
## [1] 1 2 3 4 5
x <- 1:5 #Here we are using the colon operator to create a sequence from 1 to 5. This is a shortcut if you just need to sequence through numbers or a vector, without skipping.
paste functionThe paste function will concatenate or combine vectors. Paste only works with characters so if you give it numbers to paste, it will convert them to characters first.
x <- "Hello"
y <- "world!"
paste(x, y, sep = " ")
## [1] "Hello world!"
x <- 5
y <- 3
z <- paste(x, y, sep = "")
str(z)
## chr "53"
substr functionAs data analysts we often run into messy data. The use of substr allows you to pull out only the elements you care about in a date, address, monitor ID, parameter description, etc.
wi.mon <- c('55-021-0015-44201-2') # Ozone monitor in Columbia County, WI
site.id <- substr(wi.mon, start = 8, stop = 11) # Provide the start and stop position within the character string.
site.id
## [1] "0015"
#Two steps
x <- seq(from=1, to=10, by=3)
mean(x)
## [1] 5.5
#One step
mean(seq(from=1, to=10, by=3)) #Make sure you have the parentheses located in the correct spot as R will evaluate from the inside out.
## [1] 5.5
Note: Typically you don’t want to have too many nested functions because it becomes difficult to interpret by somebody else (or yourself later on…). You will learn about this more later when you are writing your own functions.
Use a search with key words describing what you want the function to do and just add “R package” to the end
EnvStats has a function called serialCorrelationTest()First, try to use the function
x <- c(1.3, 3.5, 2.6, 3.4, 6.4)
serialCorrelationTest(x)
In the bottom right panel of RStudio, click on the “Packages” tab then click “Install Packages” in the tool bar
Start typing “EnvStats” into the “Packages” box, select that package, and click “Install”
For this, we will use the library() function
install.packages("EnvStats")
library(EnvStats)
Now we can use the function we want
x <- c(1.3, 3.5, 2.6, 3.4, 6.4)
serialCorrelationTest(x)
##
## Results of Hypothesis Test
## --------------------------
##
## Null Hypothesis: rho = 0
##
## Alternative Hypothesis: True rho is not equal to 0
##
## Test Name: Rank von Neumann Test for
## Lag-1 Autocorrelation
## (Exact Method)
##
## Estimated Parameter(s): rho = -0.0187589
##
## Estimation Method: Yule-Walker
##
## Data: x
##
## Sample Size: 5
##
## Test Statistic: RVN = 1.8
##
## P-value: 0.7833333
##
## Confidence Interval for: rho
##
## Confidence Interval Method: Normal Approximation
##
## Confidence Interval Type: two-sided
##
## Confidence Level: 95%
##
## Confidence Interval: LCL = -0.8951272
## UCL = 0.8576094
Here is a link to a page that lists many, many useful packages for environmental data analysis: https://cran.r-project.org/web/views/Environmetrics.html
xlsxXLConnectAccept the defaults in the popup window and click “Import”
airquality that is a data frame of the spreadsheet we importedread.csv() is a function that takes the name of a csv file as its main argumentYou must assign the output of read.csv() to a variable to be able to work with the data
read.csv()Assign the data.frame a name such as “chi.aq” to save it to your environment.
# chi_aq <- read.csv("E:/datasets/chicago_air.csv", header=TRUE, stringsAsFactors = FALSE)
# head(chi_aq)
read.table is a function that helps you import data from a text file. This is what you want to use if you are importing an AQS AMP 350 Raw Data Work File in pipe delimited format.
# aqs1 <- read.delim("E:/datasets/aqs1.txt",sep="|", comment.char='#', skip=5, header=F) # This is just example code, but it should work for importing an AMP 350WF.
You will notice that this includes several more arguments than what we used for read.csv(). This is because the raw AQS file type is a little more complicated than a nice clean, formatted csv file. You will want to review the help file for ?read.delim() if you are using this format.
You can also use the write.csv function or write.table function to save your data.frame to your computer for later use, or to share with others.
# write.csv(x = dataframe, file = where to write the file, row.names = FALSE)
# write.csv(chicago_air, file = "E:/datasets/mydata.csv", row.names = FALSE)
#You typically want to set row.names = FALSE unless you want to retain the row numbers for some reason.
Note: You can use the library("foreign") package to write to SAS, SPSS and STATA files.
setwd() function is your friend!setwd() allows you to tell R exactly where to look for the data you want. It’s like creating a shortcut to the folder. So if you always store your data for air quality analysis at this location, for example: C:/Data/AQAnalysis/R/ you can set this as your working directory at the begining of your session. Then you only have to tell R to go to the specific file you want. Check out the example below
# setwd('C:/Data/AQAnalysis/R/')
So instead of typing this:
# aqs1 <- read.csv('C:/Data/AQAnalysis/R/aqs1.txt')
You can just type the filename like this:
# aqs1 <- read.csv('aqs1.txt')
So let’s go head and set our working directory for today’s session
# setwd("E:/datasets") #Set this to whatever path your thumb drive appears as
This is really handy when you are importing and exporting a lot of data from the same folder.
data() function and a package we created for this training.require(devtools)
install_github("natebyers/region5air")
library(region5air)
data(chicago_air)
chicago_air
## date ozone temp solar month weekday
## 1 2013-01-01 0.032 17 0.65 1 3
## 2 2013-01-02 0.020 15 0.61 1 4
## 3 2013-01-03 0.021 28 0.17 1 5
## 4 2013-01-04 0.028 18 0.62 1 6
## 5 2013-01-05 0.025 26 0.48 1 7
## 6 2013-01-06 0.026 36 0.47 1 1
chicago_air is a data frame with 2013 ozone (ppm), temperature (F), and solar radiation (W/m2) readings from a monitor in the Chicago areacolnames(chicago_air)
## [1] "date" "ozone" "temp" "solar" "month" "weekday"
nrow() function to get the number of rowsnrow(chicago_air)
## [1] 365
RStudio has a special function called View() that makes it easier to look at data in a data frame
View(chicago_air)
head(chicago_air) ##Looks at the first 5 lines in the dataset
## date ozone temp solar month weekday
## 1 2013-01-01 0.032 17 0.65 1 3
## 2 2013-01-02 0.020 15 0.61 1 4
## 3 2013-01-03 0.021 28 0.17 1 5
## 4 2013-01-04 0.028 18 0.62 1 6
## 5 2013-01-05 0.025 26 0.48 1 7
## 6 2013-01-06 0.026 36 0.47 1 1
tail(chicago_air) ##Looks at the last 5 lines in the dataset
## date ozone temp solar month weekday
## 360 2013-12-26 0.026 NA 0.41 12 5
## 361 2013-12-27 0.021 NA 0.62 12 6
## 362 2013-12-28 0.026 NA 0.61 12 7
## 363 2013-12-29 0.029 NA 0.08 12 1
## 364 2013-12-30 0.024 NA 0.44 12 2
## 365 2013-12-31 0.021 NA 0.49 12 3
The str function is important because it describes the basic structure of the dataset. This lets you know if all the data was imported they way it was intended. i.e. numbers came in as numeric, text came in as characters, etc. This is great if you want a snapshot of the data structure.
str(chicago_air) ##Describes the basic structure of the dataset
## 'data.frame': 365 obs. of 6 variables:
## $ date : chr "2013-01-01" "2013-01-02" "2013-01-03" "2013-01-04" ...
## $ ozone : num 0.032 0.02 0.021 0.028 0.025 0.026 0.024 0.021 0.031 0.024 ...
## $ temp : num 17 15 28 18 26 36 25 30 41 33 ...
## $ solar : num 0.65 0.61 0.17 0.62 0.48 0.47 0.65 0.39 0.65 0.42 ...
## $ month : num 1 1 1 1 1 1 1 1 1 1 ...
## $ weekday: num 3 4 5 6 7 1 2 3 4 5 ...
The summary function is a more robust version of str if you are working with a lot of numeric values, because it will automatically do summary statistics on any numbers in your vector or data.frame.
summary(chicago_air)
## date ozone temp solar
## Length:365 Min. :0.00400 Min. :-17.00 Min. :0.040
## Class :character 1st Qu.:0.02500 1st Qu.: 36.75 1st Qu.:0.510
## Mode :character Median :0.03400 Median : 59.50 Median :0.910
## Mean :0.03567 Mean : 54.84 Mean :0.841
## 3rd Qu.:0.04500 3rd Qu.: 73.00 3rd Qu.:1.200
## Max. :0.08100 Max. : 92.00 Max. :1.490
## NA's :26 NA's :109
## month weekday
## Min. : 1.000 Min. :1.000
## 1st Qu.: 4.000 1st Qu.:2.000
## Median : 7.000 Median :4.000
## Mean : 6.526 Mean :3.997
## 3rd Qu.:10.000 3rd Qu.:6.000
## Max. :12.000 Max. :7.000
##
$ operator in R$ operatormean(chicago_air$ozone, na.rm = TRUE) # Calculates the mean temperature of the ozone column. Because there is missing data for ozone, must tell the mean function to ignore them.
## [1] 0.03567257
Now let’s try some exercises to test our understanding of functions and importing data.