Variables and Data Types

Creating a variable

Remember from last session…

  • An object is used to store information in R
  • To create an object or ‘variable’ in R we use an arrow symbol pointing left <-
  • On the right we’ve created the object x and y by assigning some numbers to them
x <- 10
y <- 5
x + y
## [1] 15
x <- 10
y <- 5
answer1 <- x + y
answer2 <- x * y
answer3 <- answer1 + answer2
answer3
## [1] 65

You can start to see that storing information as objects has the potential to be very powerful. This is true because you can store lists of items or even entire dataframes (spreadsheets) as an object and perform all sorts of math or statistics on that object.

R Object types

R has three main object types:

Type Description Examples
character letters and words "z", "red", "H2O"
numeric numbers 1, 3.14, log(10)
logical binary TRUE, FALSE

Grouping Data

There are several ways to group data to make them easier to work with: - Vectors: contain multiple values of the same type (e.g., all numbers or all words) - Lists: contain multiple values of different types (e.g., some numbers and some words) - Matrix: a table, like a spreadsheet, with only one data type - Data Frames: Like a matrix, but you can mix data types

Vectors

  • Vectors are variables with an ordered set of values
  • They contain only one type of data (numeric, character, or logical)
  • We use c( ) as a container for vector elements. Think of the c as concatenating or combining elements.
x <- c(1, 2, 3, 4, 5)
x
## [1] 1 2 3 4 5
fruit<- c('apples','bananas','oranges')
fruit
## [1] "apples"  "bananas" "oranges"

Lists

  • Lists are like vectors but can contain any mix of data types
  • We use list() as a container for list items
x <- list("Benzene", 1.3, TRUE)
x
## [[1]]
## [1] "Benzene"
## 
## [[2]]
## [1] 1.3
## 
## [[3]]
## [1] TRUE

Data frames

  • Data frames are spreadsheet-like tables in R
  • We use data.frame() as a container for many vectors of the same length
  • Dataframes are the most common objects that we typically work with while analyzing environmental data, but vectors and lists can be important when you start using R as a programming language. We’ll talk about that later…
pollutant <- c("Benzene", "Toluene", "Xylenes")
concentration <- c(1.3, 5.5, 6.0)
carcinogen <- c(TRUE, FALSE, FALSE)
my.data <- data.frame(pollutant, concentration, carcinogen)
my.data
##   pollutant concentration carcinogen
## 1   Benzene           1.3       TRUE
## 2   Toluene           5.5      FALSE
## 3   Xylenes           6.0      FALSE

If you try to input a data.frame where the columns are not all the same length, this will cause an error

pollutant <- c("Benzene", "Toluene")
concentration <- c(1.3, 5.5, 6.0)
carcinogen <- c(TRUE, FALSE, FALSE)
my.data <- data.frame(pollutant, concentration, carcinogen)
## Error in data.frame(pollutant, concentration, carcinogen): arguments imply differing number of rows: 2, 3

cbind(), rbind()

Functions

  • Functions are a way to repeat the same task on different data
  • R has many built-in functions that perform common tasks
x <- c(4, 8, 1, 14, 34)
mean(x) # Calculate the mean of the data set
## [1] 12.2
y <- c(1, 4, 3, 5, 10)
mean(y) # Mean of a different data set
## [1] 4.6
log(27)  #Natural logarithm
## [1] 3.295837
log10(100) #base 10 logarithm
## [1] 2
sqrt(225) # Square root 
## [1] 15
abs(-5) #Absolute value 
## [1] 5

You can use functions in combination with objects you have created

answer <- 1+1
log(25 + answer)
## [1] 3.295837

Many built-in functions in R have multiple arguments, which means you have to give the function some more information so that it can perform the correct calculation

round(12.3456, digits=3)
## [1] 12.346
round(12.3456, digits=1)
## [1] 12.3

Other Common and Useful functions

The seq() or ‘sequence’ function is commonly used to create a vector with a certain sequence. This is used a lot when writing functions.
The length() function can be used to find out the length of a vector or dataframe or to tell another function that you want it to look at the whole legnth of something else. The rep() or ‘repeat’ function is often used to create a pattern of numbers

seq(1,5,by=1)
## [1] 1 2 3 4 5
x <- 1:5  #Here we are using the colon operator to create a sequence from 1 to 5.  This is a shortcut if you just need to sequence through numbers or a vector, without skipping.
length(x)
## [1] 5
#rep()

Nesting functions

seq(rep)

Note on commenting

  • To write a comment in your script that will not be evaluated, type # in front of your comment
  • The text after# will not be evaluated
  • Run all of the code below and see what gets returned in the R console (bottom left panel in RStudio)
# Full line comment
x # partial line comment
"new line"

Functions

  • Back to functions: they all have the form function()
  • “function” is the name, which usually gives you a clue about what it does
  • () is where you put your data or indicate options
  • To see what goes inside (), type a question mark in front of the function and run it
?mean()

In RStudio, you will see the help page for mean() in the bottom right corner help page

  • On the help page, under Usage, you see mean(x, ...)
  • This means that the only necessary thing that has to go into () is x
  • On the help page under Arguments you will find a description of what x needs to be
  • (For most purposes, you will want the x in the mean function to be a numeric vector)

Using R packages

  • R comes with basic functionality, meaning that some functions will always be available when you start an R session
  • However, anyone can write functions for R that are not part of the base functionality and make it available to other R users in a package
  • Packages must be installed first then loaded before using it
  • This is similar to a mobile app: you must first install the R package (like first downloading an app) then you must load the package before using its functions (like opening an app to use it)
  • For example, lets say that R doesn’t have a function you need
  • The best way to find out if another R package does have that function is to ask Google
  • Use a search with key words describing what you want the function to do and just add “R package” to the end

  • Let’s say what you want to do is find serial correlation in an environmental data set
  • Google tells you that the R package EnvStats has a function called serialCorrelationTest()
  • First, try to use the function

x <- c(1.3, 3.5, 2.6, 3.4, 6.4)
serialCorrelationTest(x)
  • It’s not available because we need to install the package first (again, like initially downloading an app)
  • In the bottom right panel of RStudio, click on the “Packages” tab then click “Install Packages” in the tool bar packages

  • A window will pop up
  • Start typing “EnvStats” into the “Packages” box, select that package, and click “Install” packages2

  • Now that we’ve installed the package, we still can’t use the function we want
  • We’ve got to load the package first (opening the app)
  • For this, we will use the library() function

library("EnvStats")
## Warning: package 'EnvStats' was built under R version 3.1.3
## 
## Attaching package: 'EnvStats'
## 
## The following objects are masked from 'package:stats':
## 
##     predict, predict.lm
## 
## The following object is masked from 'package:base':
## 
##     print.default


Now we can use the function we want

x <- c(1.3, 3.5, 2.6, 3.4, 6.4)
serialCorrelationTest(x)
## 
## Results of Hypothesis Test
## --------------------------
## 
## Null Hypothesis:                 rho = 0
## 
## Alternative Hypothesis:          True rho is not equal to 0
## 
## Test Name:                       Rank von Neumann Test for
##                                  Lag-1 Autocorrelation
##                                  (Exact Method)
## 
## Estimated Parameter(s):          rho = -0.0187589
## 
## Estimation Method:               Yule-Walker
## 
## Data:                            x
## 
## Sample Size:                     5
## 
## Test Statistic:                  RVN = 1.8
## 
## P-value:                         0.7833333
## 
## Confidence Interval for:         rho
## 
## Confidence Interval Method:      Normal Approximation
## 
## Confidence Interval Type:        two-sided
## 
## Confidence Level:                95%
## 
## Confidence Interval:             LCL = -0.8951272
##                                  UCL =  0.8576094
  • Remember, when you close down RStudio, then start it up again, you don’t have to download the package again
  • But you do have to load the package to use any function that’s not in the R core functionality (this is very easy to forget)

Importing data

Download example Excel file

  • For this example, we will be using a file named “chicago_air.xlsx”
  • This file is located on our GitHub site here.
  • Download the Excel file to whatever location you prefer

Save as csv file

  • Open the spreadsheet in Excel
  • Click the “Office Button” in the upper-left corner
  • Choose “Save As”, then “Other Formats” Excel Spreadsheet

Save as csv file

  • Select “CSV (Comma delimited)” from “Save as type:”
  • Click “Save”
  • Close Excel Excel Spreadsheet Save Dialog

Importing csv files in RStudio

  • In RStudio, go to “Tools” -> “Import Dataset” -> “From Text File”
  • Select your file using the window import

Importing csv files in RStudio

Accept the defaults in the popup window and click “Import”

import2

  • In the top right panel you will see a variable named airquality that is a data frame of the spreadsheet we imported
  • In the top left panel the data frame is displayed import3

Importing csv files with read.csv()

  • You can also easily import csv files from the command line
  • read.csv() is a function that takes the name of a csv file as its main argument
  • It reads the csv file and converts it to a data frame
  • It assumes that the first row contains column names
  • You must assign the output of read.csv() to a variable to be able to work with the data

  • Let’s say you downloaded your file to a folder named “My Data” in your C: drive
  • Use the entire file path as the argument in read.csv()

#airquality <- read.csv("C:/My Data/chicago_air.csv")
#airquality

read.table is a function that helps you import data from a text file. This is what you want to use if you are importing a RAW AQS file in pipe delimited format.

metals_Lake <- read.delim("C:/My Data/aqsprodFKP1036275-0.txt",sep="|", comment.char='#', skip=5, header=F)
head(metals_Lake)

setwd()

Data exploration

About the data
  • The data we imported in the previous section can actually be obtained in R by using the data() function
  • If you were not able to download that Excel file, use the code below to obtain the data
require(devtools)
install_github("natebyers/region5air")
library(region5air)
data(chicago_air)
chicago_air
##         date ozone temp solar month weekday
## 1 2013-01-01 0.032   17  0.65     1       3
## 2 2013-01-02 0.020   15  0.61     1       4
## 3 2013-01-03 0.021   28  0.17     1       5
## 4 2013-01-04 0.028   18  0.62     1       6
## 5 2013-01-05 0.025   26  0.48     1       7
## 6 2013-01-06 0.026   36  0.47     1       1
  • chicago_air is a data frame with ozone readings from a monitor in Chicago
  • What column names are in the data frame?
colnames(chicago_air)
## [1] "date"    "ozone"   "temp"    "solar"   "month"   "weekday"
  • How many observations does the dataset contain?
  • We use the nrow() function to get the number of rows
nrow(chicago_air)
## [1] 365

Viewing the data

RStudio has a special function called View() that makes it easier to look at data in a data frame

View(chicago_air)

More functions for viewing data attributes

tail(chicago_air)  ##Looks at the last 5 lines in the dataset
##           date ozone temp solar month weekday
## 360 2013-12-26 0.026   NA  0.41    12       5
## 361 2013-12-27 0.021   NA  0.62    12       6
## 362 2013-12-28 0.026   NA  0.61    12       7
## 363 2013-12-29 0.029   NA  0.08    12       1
## 364 2013-12-30 0.024   NA  0.44    12       2
## 365 2013-12-31 0.021   NA  0.49    12       3

The str function is important because it describes the basic structure of the dataset. This lets you know if all the data was imported they way it was intended. i.e. numbers came in as numeric, text came in as characters, etc. This is great if you want a snapshot of the data structure.

str(chicago_air)  ##Describes the basic structure of the dataset
## 'data.frame':    365 obs. of  6 variables:
##  $ date   : chr  "2013-01-01" "2013-01-02" "2013-01-03" "2013-01-04" ...
##  $ ozone  : num  0.032 0.02 0.021 0.028 0.025 0.026 0.024 0.021 0.031 0.024 ...
##  $ temp   : num  17 15 28 18 26 36 25 30 41 33 ...
##  $ solar  : num  0.65 0.61 0.17 0.62 0.48 0.47 0.65 0.39 0.65 0.42 ...
##  $ month  : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ weekday: num  3 4 5 6 7 1 2 3 4 5 ...

The summary function is a more robust version of str if you are working with a lot of numeric values, because it will automatically do summary statistics on any numbers in your vector or data.frame.

summary(chicago_air)
##      date               ozone              temp            solar      
##  Length:365         Min.   :0.00400   Min.   :-17.00   Min.   :0.040  
##  Class :character   1st Qu.:0.02500   1st Qu.: 36.75   1st Qu.:0.510  
##  Mode  :character   Median :0.03400   Median : 59.50   Median :0.910  
##                     Mean   :0.03567   Mean   : 54.84   Mean   :0.841  
##                     3rd Qu.:0.04500   3rd Qu.: 73.00   3rd Qu.:1.200  
##                     Max.   :0.08100   Max.   : 92.00   Max.   :1.490  
##                     NA's   :26        NA's   :109                     
##      month           weekday     
##  Min.   : 1.000   Min.   :1.000  
##  1st Qu.: 4.000   1st Qu.:2.000  
##  Median : 7.000   Median :4.000  
##  Mean   : 6.526   Mean   :3.997  
##  3rd Qu.:10.000   3rd Qu.:6.000  
##  Max.   :12.000   Max.   :7.000  
## 

The table function is helpful for summarizing your data by counts.

table(chicago_air$ozone)  ##Summarizes by counts
plot(table(chicago_air$ozone))  #Quickly plot this info
hist(chicago_air$ozone)  #Like a historgram plot except no binning occurs
## 
## 0.004 0.008  0.01 0.011 0.013 0.014 0.015 0.016 0.017 0.018 0.019  0.02 
##     1     1     1     1     1     3     6     4     5     3     3     6 
## 0.021 0.022 0.023 0.024 0.025 0.026 0.027 0.028 0.029  0.03 0.031 0.032 
##    11    10    12    12    12    11     6    13    12     8     5     6 
## 0.033 0.034 0.035 0.036 0.037 0.038 0.039  0.04 0.041 0.042 0.043 0.044 
##    12     8    13     8     8     8    11     6     9     4     4     7 
## 0.045 0.046 0.047 0.048 0.049  0.05 0.051 0.052 0.053 0.054 0.055 0.056 
##     6     4     5     6     5     7     6     5     4     5     6     3 
## 0.057 0.058 0.059  0.06 0.061 0.062 0.064 0.065 0.066 0.067 0.068 0.069 
##     3     3     3     2     1     2     2     1     2     1     1     2 
## 0.074 0.078 0.081 
##     1     1     1

Working with data frames

  • You can refer to specific columns in a data frame with the $ operator
  • Use this to feed specific columns into a function
mean(airquality$Temp) # Calculate the mean temperature
## [1] 77.88235

Exercise 2

-Data Types Exercises

Next session: