Variables and Data Types

R Objects
Built-in functions
Using R packages
Importing data
Data exploration

Creating a variable

Remember from last session…

An object is used to store information in R
To create an object or ‘variable’ in R we use an arrow symbol pointing left <-
On the right we’ve created the object x and y by assigning some numbers to them

x <- 10
y <- 5
x + y

## [1] 15

x <- 10
y <- 5
answer1 <- x + y
answer2 <- x * y
answer3 <- answer1 + answer2
answer3

## [1] 65

You can start to see that storing information as objects has the potential to be very powerful. This is true because you can store lists of items or even entire dataframes (spreadsheets) as an object and perform all sorts of math or statistics on that object.

R Object types

R has three main object types:

Type	Description	Examples
`character`	letters and words	`"z"`, `"red"`, `"H2O"`
`numeric`	numbers	`1`, `3.14`, `log(10)`
`logical`	binary	`TRUE`, `FALSE`

Grouping Data

There are several ways to group data to make them easier to work with: - Vectors: contain multiple values of the same type (e.g., all numbers or all words) - Lists: contain multiple values of different types (e.g., some numbers and some words) - Matrix: a table, like a spreadsheet, with only one data type - Data Frames: Like a matrix, but you can mix data types

Vectors

Vectors are variables with an ordered set of values
They contain only one type of data (numeric, character, or logical)
We use c( ) as a container for vector elements. Think of the c as concatenating or combining elements.

x <- c(1, 2, 3, 4, 5)
x

## [1] 1 2 3 4 5

fruit<- c('apples','bananas','oranges')
fruit

## [1] "apples"  "bananas" "oranges"

Lists

Lists are like vectors but can contain any mix of data types
We use list() as a container for list items

x <- list("Benzene", 1.3, TRUE)
x

## [[1]]
## [1] "Benzene"
## 
## [[2]]
## [1] 1.3
## 
## [[3]]
## [1] TRUE

Data frames

Data frames are spreadsheet-like tables in R
We use data.frame() as a container for many vectors of the same length
Dataframes are the most common objects that we typically work with while analyzing environmental data, but vectors and lists can be important when you start using R as a programming language. We’ll talk about that later…

pollutant <- c("Benzene", "Toluene", "Xylenes")
concentration <- c(1.3, 5.5, 6.0)
carcinogen <- c(TRUE, FALSE, FALSE)
my.data <- data.frame(pollutant, concentration, carcinogen)
my.data

##   pollutant concentration carcinogen
## 1   Benzene           1.3       TRUE
## 2   Toluene           5.5      FALSE
## 3   Xylenes           6.0      FALSE

If you try to input a data.frame where the columns are not all the same length, this will cause an error

pollutant <- c("Benzene", "Toluene")
concentration <- c(1.3, 5.5, 6.0)
carcinogen <- c(TRUE, FALSE, FALSE)
my.data <- data.frame(pollutant, concentration, carcinogen)

## Error in data.frame(pollutant, concentration, carcinogen): arguments imply differing number of rows: 2, 3

cbind(), rbind()

Functions

Functions are a way to repeat the same task on different data
R has many built-in functions that perform common tasks

x <- c(4, 8, 1, 14, 34)
mean(x) # Calculate the mean of the data set

## [1] 12.2

y <- c(1, 4, 3, 5, 10)
mean(y) # Mean of a different data set

## [1] 4.6

log(27)  #Natural logarithm

## [1] 3.295837

log10(100) #base 10 logarithm

## [1] 2

sqrt(225) # Square root

## [1] 15

abs(-5) #Absolute value

## [1] 5

You can use functions in combination with objects you have created

answer <- 1+1
log(25 + answer)

## [1] 3.295837

Many built-in functions in R have multiple arguments, which means you have to give the function some more information so that it can perform the correct calculation

round(12.3456, digits=3)

## [1] 12.346

round(12.3456, digits=1)

## [1] 12.3

Other Common and Useful functions

The seq() or ‘sequence’ function is commonly used to create a vector with a certain sequence. This is used a lot when writing functions.
The length() function can be used to find out the length of a vector or dataframe or to tell another function that you want it to look at the whole legnth of something else. The rep() or ‘repeat’ function is often used to create a pattern of numbers

seq(1,5,by=1)

## [1] 1 2 3 4 5

x <- 1:5  #Here we are using the colon operator to create a sequence from 1 to 5.  This is a shortcut if you just need to sequence through numbers or a vector, without skipping.
length(x)

## [1] 5

#rep()

Nesting functions

seq(rep)

Note on commenting

To write a comment in your script that will not be evaluated, type # in front of your comment
The text after# will not be evaluated
Run all of the code below and see what gets returned in the R console (bottom left panel in RStudio)

# Full line comment
x # partial line comment
"new line"

Functions

Back to functions: they all have the form function()
“function” is the name, which usually gives you a clue about what it does
() is where you put your data or indicate options
To see what goes inside (), type a question mark in front of the function and run it

?mean()

In RStudio, you will see the help page for mean() in the bottom right corner

On the help page, under Usage, you see mean(x, ...)
This means that the only necessary thing that has to go into () is x
On the help page under Arguments you will find a description of what x needs to be
(For most purposes, you will want the x in the mean function to be a numeric vector)

Using R packages

R comes with basic functionality, meaning that some functions will always be available when you start an R session
However, anyone can write functions for R that are not part of the base functionality and make it available to other R users in a package
Packages must be installed first then loaded before using it
This is similar to a mobile app: you must first install the R package (like first downloading an app) then you must load the package before using its functions (like opening an app to use it)
For example, lets say that R doesn’t have a function you need
The best way to find out if another R package does have that function is to ask Google
Use a search with key words describing what you want the function to do and just add “R package” to the end
Let’s say what you want to do is find serial correlation in an environmental data set
Google tells you that the R package EnvStats has a function called serialCorrelationTest()
First, try to use the function

x <- c(1.3, 3.5, 2.6, 3.4, 6.4)
serialCorrelationTest(x)

It’s not available because we need to install the package first (again, like initially downloading an app)
In the bottom right panel of RStudio, click on the “Packages” tab then click “Install Packages” in the tool bar
A window will pop up
Start typing “EnvStats” into the “Packages” box, select that package, and click “Install”
Now that we’ve installed the package, we still can’t use the function we want
We’ve got to load the package first (opening the app)
For this, we will use the library() function

library("EnvStats")

## Warning: package 'EnvStats' was built under R version 3.1.3

## 
## Attaching package: 'EnvStats'
## 
## The following objects are masked from 'package:stats':
## 
##     predict, predict.lm
## 
## The following object is masked from 'package:base':
## 
##     print.default

Now we can use the function we want

x <- c(1.3, 3.5, 2.6, 3.4, 6.4)
serialCorrelationTest(x)

## 
## Results of Hypothesis Test
## --------------------------
## 
## Null Hypothesis:                 rho = 0
## 
## Alternative Hypothesis:          True rho is not equal to 0
## 
## Test Name:                       Rank von Neumann Test for
##                                  Lag-1 Autocorrelation
##                                  (Exact Method)
## 
## Estimated Parameter(s):          rho = -0.0187589
## 
## Estimation Method:               Yule-Walker
## 
## Data:                            x
## 
## Sample Size:                     5
## 
## Test Statistic:                  RVN = 1.8
## 
## P-value:                         0.7833333
## 
## Confidence Interval for:         rho
## 
## Confidence Interval Method:      Normal Approximation
## 
## Confidence Interval Type:        two-sided
## 
## Confidence Level:                95%
## 
## Confidence Interval:             LCL = -0.8951272
##                                  UCL =  0.8576094

Remember, when you close down RStudio, then start it up again, you don’t have to download the package again
But you do have to load the package to use any function that’s not in the R core functionality (this is very easy to forget)

Importing data

There are many ways to import data into R
We will be importing from an Excel spreadsheet
Packages exist to automate this
xlsx
XLConnect
But it’s often easier to just save the Excel document as a csv file first, and then import that file

Download example Excel file

For this example, we will be using a file named “chicago_air.xlsx”
This file is located on our GitHub site here.
Download the Excel file to whatever location you prefer

Save as csv file

Open the spreadsheet in Excel
Click the “Office Button” in the upper-left corner
Choose “Save As”, then “Other Formats”

Save as csv file

Select “CSV (Comma delimited)” from “Save as type:”
Click “Save”
Close Excel

Importing csv files in RStudio

In RStudio, go to “Tools” -> “Import Dataset” -> “From Text File”
Select your file using the window

Importing csv files in RStudio

Accept the defaults in the popup window and click “Import”

import2

In the top right panel you will see a variable named airquality that is a data frame of the spreadsheet we imported
In the top left panel the data frame is displayed

Importing csv files with read.csv()

You can also easily import csv files from the command line
read.csv() is a function that takes the name of a csv file as its main argument
It reads the csv file and converts it to a data frame
It assumes that the first row contains column names
You must assign the output of read.csv() to a variable to be able to work with the data
Let’s say you downloaded your file to a folder named “My Data” in your C: drive
Use the entire file path as the argument in read.csv()

#airquality <- read.csv("C:/My Data/chicago_air.csv")
#airquality

read.table is a function that helps you import data from a text file. This is what you want to use if you are importing a RAW AQS file in pipe delimited format.

metals_Lake <- read.delim("C:/My Data/aqsprodFKP1036275-0.txt",sep="|", comment.char='#', skip=5, header=F)
head(metals_Lake)

setwd()

Data exploration

About the data

The data we imported in the previous section can actually be obtained in R by using the data() function
If you were not able to download that Excel file, use the code below to obtain the data

require(devtools)
install_github("natebyers/region5air")
library(region5air)
data(chicago_air)
chicago_air

##         date ozone temp solar month weekday
## 1 2013-01-01 0.032   17  0.65     1       3
## 2 2013-01-02 0.020   15  0.61     1       4
## 3 2013-01-03 0.021   28  0.17     1       5
## 4 2013-01-04 0.028   18  0.62     1       6
## 5 2013-01-05 0.025   26  0.48     1       7
## 6 2013-01-06 0.026   36  0.47     1       1

chicago_air is a data frame with ozone readings from a monitor in Chicago
What column names are in the data frame?

colnames(chicago_air)

## [1] "date"    "ozone"   "temp"    "solar"   "month"   "weekday"

How many observations does the dataset contain?
We use the nrow() function to get the number of rows

nrow(chicago_air)

## [1] 365

Viewing the data

RStudio has a special function called View() that makes it easier to look at data in a data frame

View(chicago_air)

More functions for viewing data attributes

tail(chicago_air)  ##Looks at the last 5 lines in the dataset

##           date ozone temp solar month weekday
## 360 2013-12-26 0.026   NA  0.41    12       5
## 361 2013-12-27 0.021   NA  0.62    12       6
## 362 2013-12-28 0.026   NA  0.61    12       7
## 363 2013-12-29 0.029   NA  0.08    12       1
## 364 2013-12-30 0.024   NA  0.44    12       2
## 365 2013-12-31 0.021   NA  0.49    12       3

The str function is important because it describes the basic structure of the dataset. This lets you know if all the data was imported they way it was intended. i.e. numbers came in as numeric, text came in as characters, etc. This is great if you want a snapshot of the data structure.

str(chicago_air)  ##Describes the basic structure of the dataset

## 'data.frame':    365 obs. of  6 variables:
##  $ date   : chr  "2013-01-01" "2013-01-02" "2013-01-03" "2013-01-04" ...
##  $ ozone  : num  0.032 0.02 0.021 0.028 0.025 0.026 0.024 0.021 0.031 0.024 ...
##  $ temp   : num  17 15 28 18 26 36 25 30 41 33 ...
##  $ solar  : num  0.65 0.61 0.17 0.62 0.48 0.47 0.65 0.39 0.65 0.42 ...
##  $ month  : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ weekday: num  3 4 5 6 7 1 2 3 4 5 ...

The summary function is a more robust version of str if you are working with a lot of numeric values, because it will automatically do summary statistics on any numbers in your vector or data.frame.

summary(chicago_air)

##      date               ozone              temp            solar      
##  Length:365         Min.   :0.00400   Min.   :-17.00   Min.   :0.040  
##  Class :character   1st Qu.:0.02500   1st Qu.: 36.75   1st Qu.:0.510  
##  Mode  :character   Median :0.03400   Median : 59.50   Median :0.910  
##                     Mean   :0.03567   Mean   : 54.84   Mean   :0.841  
##                     3rd Qu.:0.04500   3rd Qu.: 73.00   3rd Qu.:1.200  
##                     Max.   :0.08100   Max.   : 92.00   Max.   :1.490  
##                     NA's   :26        NA's   :109                     
##      month           weekday     
##  Min.   : 1.000   Min.   :1.000  
##  1st Qu.: 4.000   1st Qu.:2.000  
##  Median : 7.000   Median :4.000  
##  Mean   : 6.526   Mean   :3.997  
##  3rd Qu.:10.000   3rd Qu.:6.000  
##  Max.   :12.000   Max.   :7.000  
##

The table function is helpful for summarizing your data by counts.

table(chicago_air$ozone)  ##Summarizes by counts
plot(table(chicago_air$ozone))  #Quickly plot this info
hist(chicago_air$ozone)  #Like a historgram plot except no binning occurs

## 
## 0.004 0.008  0.01 0.011 0.013 0.014 0.015 0.016 0.017 0.018 0.019  0.02 
##     1     1     1     1     1     3     6     4     5     3     3     6 
## 0.021 0.022 0.023 0.024 0.025 0.026 0.027 0.028 0.029  0.03 0.031 0.032 
##    11    10    12    12    12    11     6    13    12     8     5     6 
## 0.033 0.034 0.035 0.036 0.037 0.038 0.039  0.04 0.041 0.042 0.043 0.044 
##    12     8    13     8     8     8    11     6     9     4     4     7 
## 0.045 0.046 0.047 0.048 0.049  0.05 0.051 0.052 0.053 0.054 0.055 0.056 
##     6     4     5     6     5     7     6     5     4     5     6     3 
## 0.057 0.058 0.059  0.06 0.061 0.062 0.064 0.065 0.066 0.067 0.068 0.069 
##     3     3     3     2     1     2     2     1     2     1     1     2 
## 0.074 0.078 0.081 
##     1     1     1

Working with data frames

You can refer to specific columns in a data frame with the $ operator
Use this to feed specific columns into a function

mean(airquality$Temp) # Calculate the mean temperature

## [1] 77.88235

Data Types & Importing, Exporting Data

Kali Frost and Nathan Byers

Tuesday, February 2, 2016

Variables and Data Types

Creating a variable

R Object types

Grouping Data

Vectors

Lists

Data frames

Functions

Other Common and Useful functions

Nesting functions

Note on commenting

Functions

Using R packages

Importing data

Download example Excel file

Save as csv file

Save as csv file

Importing csv files in RStudio

Importing csv files in RStudio

Importing csv files with read.csv()

Data exploration

About the data

Viewing the data

More functions for viewing data attributes

Working with data frames

Exercise 2

Next session:

Other sessions today: