Jesse Yang
Oct 5, 2017
The good and bad of the R language
c(...)list(...) - analogous to a dictionary in other languagesc(1, 1.)c(1L, 5L, 6L)c(TRUE, FALSE, c(9, 2) == 2)c(1, '2')<- and =x <- c('a', 'b', 'c')
x
[1] "a" "b" "c"
x <- matrix(data=c('a', 'b', 'c'), nrow=2)
x
[,1] [,2]
[1,] "a" "c"
[2,] "b" "a"
which creates ambiguities.
str() - not “string”, but “structure”structure() - actually means “re-structure”get() - get what?date() - current date? create a date? day of month of a Date object?c() - combine? character?ncol <- 2 - can I use it as a variable when it's already a function?You many find many common verbs/nouns you want to use for your function already taken by the built-in functions.
R is smart enough to figure out what you mean
ncol
function (x)
dim(x)[2L]
<bytecode: 0x7f9ab8214518>
<environment: namespace:base>
ncol <- 3
dim(matrix(1:12, ncol=ncol))
[1] 4 3
ncol
[1] 3
Therefore the order when loading libaries matters:
library(plyr)
library(dplyr)
is different from
library(dplyr)
library(plyr)
Attaching package: ‘dplyr’
The following objects are masked from ‘package:plyr’:
arrange, count, desc, failwith, id, mutate, rename, summarise, summarize
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
stringAsFactors=FALSE?data.frame(name=c('a','b', 'a'), val=c(1,2,3), stringsAsFactors=FALSE) %>% str()
'data.frame': 3 obs. of 2 variables:
$ name: chr "a" "b" "a"
$ val : num 1 2 3
data.frame(name=c('a','b', 'a'), val=c(1,2,3)) %>% str()
'data.frame': 3 obs. of 2 variables:
$ name: Factor w/ 2 levels "a","b": 1 2 1
$ val : num 1 2 3
Can you tell the result of following code?
as.numeric(factor(5:10))
[1] 1 2 3 4 5 6
as.numeric(factor(c(2, 3, 3, 6, 6, 6)))
[1] 1 2 2 3 3 3
as.numeric(factor(5:10))
[1] 1 2 3 4 5 6
5:10 generates c(5, 6, 7, 8, 9, 10).factor() coerces it into characters first: as.characters(5:10)factor()as.numeric returns the internal numeric representations of factor values; it does not try to convert them into characters first.
as.numeric(factor(c(2, 3, 3, 6, 6, 6)))
[1] 1 2 2 3 3 3
Change order of categorical variables
water %>%
mutate(
risk = forcats::fct_relevel(risk, 'Less Risk', 'Some Risk', 'More Risk')
) %>%
group_by(risk, section) %>%
summarise(u238=mean(uranium238)) %>%
ggplot(aes(x=section, y=u238, fill=risk)) +
geom_bar(stat="identity", position="dodge")
x <- factor(c('a', 'b', 'c'))
x[4] <- 'd'
Warning in `[<-.factor`(`*tmp*`, 4, value = "d"): invalid factor level, NA
generated
x <- data.frame(id=factor(c('a','b','a')), val=1:3)
y <- data.frame(id=factor(c('c','b','a')), val=1:3)
x %>% left_join(y, by='id')
Warning: Column `id` joining factors with different levels, coercing to
character vector
id val.x val.y
1 a 1 3
2 b 2 2
3 a 3 3
a <- c(1:5, 7:9)
a > 4
[1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE
which(a > 6 & a < 8)
[1] 6
which(a == 6)
integer(0)
c(TRUE, FALSE, c(9, 2) == 2)
[1] TRUE FALSE FALSE TRUE
c(TRUE, FALSE, c(9, 2) == 2, c(1, c(0, 1, 1)))
[1] 1 0 0 1 1 0 1 1
x <- c(2, 4, 6:8) # sequence integers are inclusive
x[c(2:4, 1, 3, 1)] # you can pick a value more than one time
[1] 4 6 7 2 6 2
x <- matrix(1:20, ncol = 5)
x
[,1] [,2] [,3] [,4] [,5]
[1,] 1 5 9 13 17
[2,] 2 6 10 14 18
[3,] 3 7 11 15 19
[4,] 4 8 12 16 20
x[2, 2:5, drop=FALSE] # what if you don't add drop=FALSE?
[,1] [,2] [,3] [,4]
[1,] 6 10 14 18
x
[,1] [,2] [,3] [,4] [,5]
[1,] 1 5 9 13 17
[2,] 2 6 10 14 18
[3,] 3 7 11 15 19
[4,] 4 8 12 16 20
x[-c(15:5)]
[1] 1 2 3 4 16 17 18 19 20
dplyr)x <- data.frame(name=c('a','b', 'a'), val1=1:3, val2=5:7)
select(x, val1:val2)
val1 val2
1 1 5
2 2 6
3 3 7
select(x, -val1:val2) # error!
Error in combine_vars(vars, ind_list) :
Each argument must yield either positive or negative integers
x[, val1:val2]
Error in `[.data.frame`(x, , val1:val2) : object 'val1' not found
cols <- val1:val2
x[, cols]
c(1:5) + c(10:15)
[1] 11 13 15 17 19 16
c(1:5) / c(10:15)
[1] 0.10000000 0.18181818 0.25000000 0.30769231 0.35714286 0.06666667
MergeDedup <- function(x, collapse='; ', sep=' *[,;\\*] *') {
# Merge strings separated by `sep`
# remove duplicate and empty items
str_c(x, collapse = collapse) %>%
str_split(sep) %>%
map_chr(function(x) {
# remove empty string so to clean the end output
x <- unique(x) %>% .[. != '']
if (length(x) > 0) {
x <- str_c(x, collapse = collapse)
} else {
x <- NA
}
x
})
}
You don't need quotes in for dplyr functions.
# without pipe
filter(flights, dep_delay > 0)
# with pipe
flights %>%
filter(dep_delay > 0) %>%
group_by(carrier) %>%
summarize(avg.delay = mean(dep_delay)) %>%
arrange(desc(avg.delay))
dep_delay > 0 evaluated when we do actually need the results of the filter() function call.dep_delay.flights[flights$dep_delay > 0, ]
What really happened:
x <- flights$dep_delayx > 0, get a logical vectorUse the logical vector make a subset:
flights[c(TRUE,FALSE,...,TRUE), ]
Column selection needs quote
x <- data.frame(name=c('a','b', 'a'), val1=1:3, val2=5:7)
x[, c('val1', 'val2')]
val1 val2
1 1 5
2 2 6
3 3 7
Conditionally select columns
x[, startsWith(colnames(x), 'val')]
val1 val2
1 1 5
2 2 6
3 3 7
Good practices:
Debug process:
Break in code 
print(str(...)) intermediate resultflights %>%
filter(dep_delay > 0) %>%
group_by(carrier) %>%
summarize(avg.delay = mean(dep_delay)) %>%
arrange(desc(avg.delay))
# --- V.S. -----
flights.delayed <- flights %>% filter(dep_delay > 0)
str(flights.delayed)
summarized <- flights.delayed %>% group_by(carrier) %>% summarize(avg.delay = mean(dep_delay))
summarized %>% arrange(desc(avg.delay))
Create projects
Fold your code
“# ======” or “# -----” both OK
<Tab> to trigger autocomplete.