Ben Bellman
August 28, 2018
“R is an integrated suite of software facilities for data manipulation, calculation and graphical display.” (R Project website)
Provides eight base packages for computing and data analysis
True strength of R is development commpunity, wealth of packages
Huge range of tools, R is becoming a flagship for academic software
Open source, ethos of reproducability and sharing
I think of R as an ecosystem
All kinds of packages
tidyverse and dplyr)ggplot2), Stats and Modelingobj <- funct(arg1 = data,
arg2 = T,
arg3 = "setting",
...)
obj = output is stored in object<- = the assignment operator for storing results
funct = name of function being calledarg1 = first argument is usually object/data being operated onarg2, arg3 = additional arguments that change how funct works
There are a few basic types of objects/data
Other types of data are introduced through packages
We'll start by using the console as a calculator
Results of commands are returned there
2 + 3
[1] 5
a <- 2 + 3
a
[1] 5
string <- "Hello world"
string
[1] "Hello world"
x <- c(2, 18, 12, 23, 73)
x - 3
[1] -1 15 9 20 70
str(x)
num [1:5] 2 18 12 23 73
c function “c"oerces elements into a vectorstr function lets us see the "str"ucture of the datac("Learning", "R", "is", "fun!")
[1] "Learning" "R" "is" "fun!"
c(T, F, F, F, T, F)
[1] TRUE FALSE FALSE FALSE TRUE FALSE
NA is the most important missing value
b <- c(1, 2, 3, NA, 5)
is.na(b)
[1] FALSE FALSE FALSE TRUE FALSE
Inf, -Inf, and NaN[n] function/notation
n is a number or vector of numbers/logical valuesx <- c("Learning", "R", "is", "fun!")
x[1]
[1] "Learning"
x[c(1, 4)]
[1] "Learning" "fun!"
x[c(T, F, F, T)]
[1] "Learning" "fun!"
v <- c("abcd", "efgh", "ijkl")
v
[1] "abcd" "efgh" "ijkl"
substr(v, 1, 2)
[1] "ab" "ef" "ij"
f <- c("Red", "Blue", "Red", "Blue")
summary(f)
Length Class Mode
4 character character
f <- factor(f)
summary(f)
Blue Red
2 2
temperatures <- c("High", "Low", "High","Low", "Medium")
temp_factor <- factor(temperatures,
order = TRUE,
levels = c("Low", "Medium", "High"))
temp_factor
[1] High Low High Low Medium
Levels: Low < Medium < High
matrix(1:9, byrow = TRUE, nrow = 3)
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
1:9 generates a vector of all integers from 1 through 9col1 <- c(1, 2, 3)
col2 <- c(4, 5, 6)
col3 <- c(7, 8, 9)
matrix(c(col1, col2, col3), ncol = 3)
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
[ ] and two dimensions
m <- matrix(c(col1, col2, col3), ncol = 3)
m
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
m[2, 2]
[1] 5
l <- list(a, v, matrix(1:9, byrow = TRUE, nrow = 3))
str(l)
List of 3
$ : num 5
$ : chr [1:3] "abcd" "efgh" "ijkl"
$ : int [1:3, 1:3] 1 4 7 2 5 8 3 6 9
[[ ]]l[[2]]
[1] "abcd" "efgh" "ijkl"
[ ] to reference cells of object in listl[[3]][2, 2]
[1] 5
data.frame(col1, col2, col3)
col1 col2 col3
1 1 4 7
2 2 5 8
3 3 6 9
col2 <- c("lmao", "brb", "smh") #character
col3 <- factor(c("Good", "Good", "Bad")) #factor
col4 <- c(T, F, T) #logical
df <- data.frame(col1, col2, col3, col4, stringsAsFactors = F)
df
col1 col2 col3 col4
1 1 lmao Good TRUE
2 2 brb Good FALSE
3 3 smh Bad TRUE
summary(df)
col1 col2 col3 col4
Min. :1.0 Length:3 Bad :1 Mode :logical
1st Qu.:1.5 Class :character Good:2 FALSE:1
Median :2.0 Mode :character TRUE :2
Mean :2.0
3rd Qu.:2.5
Max. :3.0
[ ]$df$new <- 0
df$new
[1] 0 0 0
df
col1 col2 col3 col4 new
1 1 lmao Good TRUE 0
2 2 brb Good FALSE 0
3 3 smh Bad TRUE 0
#change to own file path
library(here)
salaries <- read.csv(here("data","white-house-salaries.csv"))
summary(salaries)
employee_name salary
brooke, mary j : 16 Min. : 0
campbell, frances l: 16 1st Qu.: 45000
droege, philip c : 16 Median : 61952
jones, crystal b : 16 Mean : 76412
kalbaugh, david e : 16 3rd Qu.:102000
mattson, philip c : 16 Max. :225000
(Other) :7012
position year
staff assistant : 429 Min. :2001
associate director : 221 1st Qu.:2006
records management analyst: 206 Median :2010
deputy associate director : 176 Mean :2010
executive assistant : 154 3rd Qu.:2014
(Other) :5920 Max. :2017
NA's : 2
status party president term
detailee : 328 democrat :3740 bush :2991 first :3982
employee :6778 republican:3368 obama:3740 second:3126
employee (part-time): 2 trump: 377
gender
female:3521
male :3587
These are the main data types, but there are others
Other packages introduce their own classes
Checking object classes is a good debugging tool
class(df)
[1] "data.frame"
class(df$col4)
[1] "logical"
When coding, think of objects as nouns and functions as verbs
Easy to create custom functions to simplify code
Must pay attention to required arguments of functions
Can always view CRAN documentation of a function with ?funct_name
ggplot2Mara Averick (@dataandme)
Sharon Machlis (@sharon000)
Jenny Bryan (@JennyBryan)
Angela Li (@CivicAngela)
Kyle Walker (@kyle_e_walker)
Thomas Mock (@thomas_mock)