Remember from last session…
<-x and y by assigning some numbers to themx <- 10
y <- 5
x + y
## [1] 15
x <- 10
y <- 5
answer1 <- x + y
answer2 <- x * y
answer3 <- answer1 + answer2
answer3
## [1] 65
You can start to see that storing information as objects has the potential to be very powerful. This is true because you can store lists of items or even entire dataframes (spreadsheets) as an object and perform all sorts of math or statistics on that object.
R has three main object types:
| Type | Description | Examples |
|---|---|---|
character |
letters and words | "z", "red", "H2O" |
numeric |
numbers | 1, 3.14, log(10) |
logical |
binary | TRUE, FALSE |
There are several ways to group data to make them easier to work with: - Vectors: contain multiple values of the same type (e.g., all numbers or all words) - Lists: contain multiple values of different types (e.g., some numbers and some words) - Matrix: a table, like a spreadsheet, with only one data type - Data Frames: Like a matrix, but you can mix data types
c( ) as a container for vector elements. Think of the c as concatenating or combining elements.x <- c(1, 2, 3, 4, 5)
x
## [1] 1 2 3 4 5
fruit<- c('apples','bananas','oranges')
fruit
## [1] "apples" "bananas" "oranges"
list() as a container for list itemsx <- list("Benzene", 1.3, TRUE)
x
## [[1]]
## [1] "Benzene"
##
## [[2]]
## [1] 1.3
##
## [[3]]
## [1] TRUE
data.frame() as a container for many vectors of the same lengthpollutant <- c("Benzene", "Toluene", "Xylenes")
concentration <- c(1.3, 5.5, 6.0)
carcinogen <- c(TRUE, FALSE, FALSE)
my.data <- data.frame(pollutant, concentration, carcinogen)
my.data
## pollutant concentration carcinogen
## 1 Benzene 1.3 TRUE
## 2 Toluene 5.5 FALSE
## 3 Xylenes 6.0 FALSE
If you try to input a data.frame where the columns are not all the same length, this will cause an error
pollutant <- c("Benzene", "Toluene")
concentration <- c(1.3, 5.5, 6.0)
carcinogen <- c(TRUE, FALSE, FALSE)
my.data <- data.frame(pollutant, concentration, carcinogen)
## Error in data.frame(pollutant, concentration, carcinogen): arguments imply differing number of rows: 2, 3
cbind(), rbind()
x <- c(4, 8, 1, 14, 34)
mean(x) # Calculate the mean of the data set
## [1] 12.2
y <- c(1, 4, 3, 5, 10)
mean(y) # Mean of a different data set
## [1] 4.6
log(27) #Natural logarithm
## [1] 3.295837
log10(100) #base 10 logarithm
## [1] 2
sqrt(225) # Square root
## [1] 15
abs(-5) #Absolute value
## [1] 5
You can use functions in combination with objects you have created
answer <- 1+1
log(25 + answer)
## [1] 3.295837
Many built-in functions in R have multiple arguments, which means you have to give the function some more information so that it can perform the correct calculation
round(12.3456, digits=3)
## [1] 12.346
round(12.3456, digits=1)
## [1] 12.3
The seq() or ‘sequence’ function is commonly used to create a vector with a certain sequence. This is used a lot when writing functions.
The length() function can be used to find out the length of a vector or dataframe or to tell another function that you want it to look at the whole legnth of something else. The rep() or ‘repeat’ function is often used to create a pattern of numbers
seq(1,5,by=1)
## [1] 1 2 3 4 5
x <- 1:5 #Here we are using the colon operator to create a sequence from 1 to 5. This is a shortcut if you just need to sequence through numbers or a vector, without skipping.
length(x)
## [1] 5
#rep()
seq(rep)
# in front of your comment# will not be evaluated# Full line comment
x # partial line comment
"new line"
function()() is where you put your data or indicate options(), type a question mark in front of the function and run it?mean()
In RStudio, you will see the help page for mean() in the bottom right corner
Usage, you see mean(x, ...)() is xArguments you will find a description of what x needs to bex in the mean function to be a numeric vector)Use a search with key words describing what you want the function to do and just add “R package” to the end
EnvStats has a function called serialCorrelationTest()First, try to use the function
x <- c(1.3, 3.5, 2.6, 3.4, 6.4)
serialCorrelationTest(x)
In the bottom right panel of RStudio, click on the “Packages” tab then click “Install Packages” in the tool bar
Start typing “EnvStats” into the “Packages” box, select that package, and click “Install”
For this, we will use the library() function
library("EnvStats")
## Warning: package 'EnvStats' was built under R version 3.1.3
##
## Attaching package: 'EnvStats'
##
## The following objects are masked from 'package:stats':
##
## predict, predict.lm
##
## The following object is masked from 'package:base':
##
## print.default
Now we can use the function we want
x <- c(1.3, 3.5, 2.6, 3.4, 6.4)
serialCorrelationTest(x)
##
## Results of Hypothesis Test
## --------------------------
##
## Null Hypothesis: rho = 0
##
## Alternative Hypothesis: True rho is not equal to 0
##
## Test Name: Rank von Neumann Test for
## Lag-1 Autocorrelation
## (Exact Method)
##
## Estimated Parameter(s): rho = -0.0187589
##
## Estimation Method: Yule-Walker
##
## Data: x
##
## Sample Size: 5
##
## Test Statistic: RVN = 1.8
##
## P-value: 0.7833333
##
## Confidence Interval for: rho
##
## Confidence Interval Method: Normal Approximation
##
## Confidence Interval Type: two-sided
##
## Confidence Level: 95%
##
## Confidence Interval: LCL = -0.8951272
## UCL = 0.8576094
xlsxXLConnectAccept the defaults in the popup window and click “Import”
airquality that is a data frame of the spreadsheet we importedread.csv() is a function that takes the name of a csv file as its main argumentYou must assign the output of read.csv() to a variable to be able to work with the data
Use the entire file path as the argument in read.csv()
#airquality <- read.csv("C:/My Data/chicago_air.csv")
#airquality
read.table is a function that helps you import data from a text file. This is what you want to use if you are importing a RAW AQS file in pipe delimited format.
metals_Lake <- read.delim("C:/My Data/aqsprodFKP1036275-0.txt",sep="|", comment.char='#', skip=5, header=F)
head(metals_Lake)
setwd()
data() functionrequire(devtools)
install_github("natebyers/region5air")
library(region5air)
data(chicago_air)
chicago_air
## date ozone temp solar month weekday
## 1 2013-01-01 0.032 17 0.65 1 3
## 2 2013-01-02 0.020 15 0.61 1 4
## 3 2013-01-03 0.021 28 0.17 1 5
## 4 2013-01-04 0.028 18 0.62 1 6
## 5 2013-01-05 0.025 26 0.48 1 7
## 6 2013-01-06 0.026 36 0.47 1 1
chicago_air is a data frame with ozone readings from a monitor in Chicagocolnames(chicago_air)
## [1] "date" "ozone" "temp" "solar" "month" "weekday"
nrow() function to get the number of rowsnrow(chicago_air)
## [1] 365
RStudio has a special function called View() that makes it easier to look at data in a data frame
View(chicago_air)
tail(chicago_air) ##Looks at the last 5 lines in the dataset
## date ozone temp solar month weekday
## 360 2013-12-26 0.026 NA 0.41 12 5
## 361 2013-12-27 0.021 NA 0.62 12 6
## 362 2013-12-28 0.026 NA 0.61 12 7
## 363 2013-12-29 0.029 NA 0.08 12 1
## 364 2013-12-30 0.024 NA 0.44 12 2
## 365 2013-12-31 0.021 NA 0.49 12 3
The str function is important because it describes the basic structure of the dataset. This lets you know if all the data was imported they way it was intended. i.e. numbers came in as numeric, text came in as characters, etc. This is great if you want a snapshot of the data structure.
str(chicago_air) ##Describes the basic structure of the dataset
## 'data.frame': 365 obs. of 6 variables:
## $ date : chr "2013-01-01" "2013-01-02" "2013-01-03" "2013-01-04" ...
## $ ozone : num 0.032 0.02 0.021 0.028 0.025 0.026 0.024 0.021 0.031 0.024 ...
## $ temp : num 17 15 28 18 26 36 25 30 41 33 ...
## $ solar : num 0.65 0.61 0.17 0.62 0.48 0.47 0.65 0.39 0.65 0.42 ...
## $ month : num 1 1 1 1 1 1 1 1 1 1 ...
## $ weekday: num 3 4 5 6 7 1 2 3 4 5 ...
The summary function is a more robust version of str if you are working with a lot of numeric values, because it will automatically do summary statistics on any numbers in your vector or data.frame.
summary(chicago_air)
## date ozone temp solar
## Length:365 Min. :0.00400 Min. :-17.00 Min. :0.040
## Class :character 1st Qu.:0.02500 1st Qu.: 36.75 1st Qu.:0.510
## Mode :character Median :0.03400 Median : 59.50 Median :0.910
## Mean :0.03567 Mean : 54.84 Mean :0.841
## 3rd Qu.:0.04500 3rd Qu.: 73.00 3rd Qu.:1.200
## Max. :0.08100 Max. : 92.00 Max. :1.490
## NA's :26 NA's :109
## month weekday
## Min. : 1.000 Min. :1.000
## 1st Qu.: 4.000 1st Qu.:2.000
## Median : 7.000 Median :4.000
## Mean : 6.526 Mean :3.997
## 3rd Qu.:10.000 3rd Qu.:6.000
## Max. :12.000 Max. :7.000
##
The table function is helpful for summarizing your data by counts.
table(chicago_air$ozone) ##Summarizes by counts
plot(table(chicago_air$ozone)) #Quickly plot this info
hist(chicago_air$ozone) #Like a historgram plot except no binning occurs
##
## 0.004 0.008 0.01 0.011 0.013 0.014 0.015 0.016 0.017 0.018 0.019 0.02
## 1 1 1 1 1 3 6 4 5 3 3 6
## 0.021 0.022 0.023 0.024 0.025 0.026 0.027 0.028 0.029 0.03 0.031 0.032
## 11 10 12 12 12 11 6 13 12 8 5 6
## 0.033 0.034 0.035 0.036 0.037 0.038 0.039 0.04 0.041 0.042 0.043 0.044
## 12 8 13 8 8 8 11 6 9 4 4 7
## 0.045 0.046 0.047 0.048 0.049 0.05 0.051 0.052 0.053 0.054 0.055 0.056
## 6 4 5 6 5 7 6 5 4 5 6 3
## 0.057 0.058 0.059 0.06 0.061 0.062 0.064 0.065 0.066 0.067 0.068 0.069
## 3 3 3 2 1 2 2 1 2 1 1 2
## 0.074 0.078 0.081
## 1 1 1
$ operatormean(airquality$Temp) # Calculate the mean temperature
## [1] 77.88235