… that is, Data Science Hello World.
Sure, first we ALWAYS do some data exploration.
data(longley)
head(longley)
GNP.deflator GNP Unemployed Armed.Forces Population Year Employed
1947 83.0 234.289 235.6 159.0 107.608 1947 60.323
1948 88.5 259.426 232.5 145.6 108.632 1948 61.122
1949 88.2 258.054 368.2 161.6 109.773 1949 60.171
1950 89.5 284.599 335.1 165.0 110.929 1950 61.187
1951 96.2 328.975 209.9 309.9 112.075 1951 63.221
1952 98.1 346.999 193.2 359.4 113.270 1952 63.639
ggpairs(longley)
Linear models.
fit <- lm(Employed ~ GNP, longley)
Now what can we do with this thing returned by lm?
# equivalently: print(fit)
fit
Call:
lm(formula = Employed ~ GNP, data = longley)
Coefficients:
(Intercept) GNP
51.84359 0.03475
summary(fit)
Call:
lm(formula = Employed ~ GNP, data = longley)
Residuals:
Min 1Q Median 3Q Max
-0.77958 -0.55440 -0.00944 0.34361 1.44594
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 51.843590 0.681372 76.09 < 2e-16 ***
GNP 0.034752 0.001706 20.37 8.36e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.6566 on 14 degrees of freedom
Multiple R-squared: 0.9674, Adjusted R-squared: 0.965
F-statistic: 415.1 on 1 and 14 DF, p-value: 8.363e-12
predict(fit, newdata = data.frame(GNP = c(222, 223)))
1 2
59.55860 59.59335
par(mfrow=c(2,2))
plot(fit)
dev.off()
null device
1
Like extracting the Akaike Information Criterion …
extractAIC(fit)
[1] 2.00000 -11.59718
Or getting confidence intervals for coefficients.
confint(fit)
2.5 % 97.5 %
(Intercept) 50.38219297 53.30498660
GNP 0.03109391 0.03841068
df <- data.frame(x = 1:8, y = cumsum(rnorm(8)))
df
x y
1 1 -0.4763357
2 2 1.1415539
3 3 1.1076525
4 4 2.8341954
5 5 2.6276654
6 6 3.5959169
7 7 3.3822216
8 8 3.0054988
summary(df)
x y
Min. :1.00 Min. :-0.4763
1st Qu.:2.75 1st Qu.: 1.1331
Median :4.50 Median : 2.7309
Mean :4.50 Mean : 2.1523
3rd Qu.:6.25 3rd Qu.: 3.0997
Max. :8.00 Max. : 3.5959
plot(df)
ts <- ts(cumsum(round(rnorm(120), 2)), start = c(2004,12), frequency = 12)
ts
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov
2004
2005 -2.50 -0.50 -1.81 -1.14 0.14 0.95 0.07 1.94 2.63 1.07 0.38
2006 -1.55 0.34 -0.36 2.03 2.79 1.58 2.01 2.20 2.70 2.71 2.80
2007 2.47 4.13 5.32 5.06 4.87 3.93 4.74 5.59 6.49 8.15 9.59
2008 6.15 5.49 7.00 6.62 4.35 5.71 6.75 6.49 7.38 6.88 4.76
2009 0.89 1.25 1.64 2.10 3.10 3.04 2.91 4.15 3.30 3.72 4.36
2010 4.79 5.47 4.68 4.24 5.27 2.57 2.17 3.10 2.23 2.05 3.55
2011 1.00 0.35 1.07 1.21 0.89 0.43 -0.77 -1.19 -2.93 -1.90 -0.94
2012 -0.22 0.68 -0.28 -2.17 -1.63 -2.02 -1.58 -2.72 -5.15 -5.32 -6.36
2013 -7.40 -7.26 -6.01 -3.93 -4.39 -3.51 -3.12 -2.89 -1.18 -2.22 -1.74
2014 -0.38 0.18 0.95 -1.95 -1.88 1.67 2.35 3.33 2.96 3.01 2.83
Dec
2004 -0.17
2005 -1.75
2006 2.68
2007 7.23
2008 2.90
2009 4.46
2010 2.24
2011 -1.40
2012 -6.62
2013 -1.28
2014
summary(ts)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-7.400 -1.182 1.805 1.376 3.772 9.590
plot(ts)
m <- matrix(1:10, nrow = 2)
summary(m)
V1 V2 V3 V4
Min. :1.00 Min. :3.00 Min. :5.00 Min. :7.00
1st Qu.:1.25 1st Qu.:3.25 1st Qu.:5.25 1st Qu.:7.25
Median :1.50 Median :3.50 Median :5.50 Median :7.50
Mean :1.50 Mean :3.50 Mean :5.50 Mean :7.50
3rd Qu.:1.75 3rd Qu.:3.75 3rd Qu.:5.75 3rd Qu.:7.75
Max. :2.00 Max. :4.00 Max. :6.00 Max. :8.00
V5
Min. : 9.00
1st Qu.: 9.25
Median : 9.50
Mean : 9.50
3rd Qu.: 9.75
Max. :10.00
plot(function(x) x^2, from = -2, to = 2)
print
function (x, ...)
UseMethod("print")
<bytecode: 0x5618623538c0>
<environment: namespace:base>
UseMethod
function (generic, object) .Primitive("UseMethod")
# a bike constructor
bike <- function(type, color) {structure (list(type = type, color = color), class = 'bike')}
# create an instance of class bike
mybike <- bike('cyclocross', 'green')
class(mybike)
[1] "bike"
# still prints like a list
mybike
$type
[1] "cyclocross"
$color
[1] "green"
attr(,"class")
[1] "bike"
# methods are called <funcname>.<classname>
print.bike <- function(b) {print(paste0('a ', b$color, ' ', b$type, ' bike'))}
mybike
[1] "a green cyclocross bike"
… or just … don't.
# what if we just change the class
class(mybike) <- 'lm'
# and then print it
mybike
Call:
NULL
No coefficients
# let's undo this ASAP ;-)
class(mybike) <- 'bike'
# and nothing's broken
mybike
[1] "a green cyclocross bike"
ebike <- function(type, color) {
parent <- bike(type, color)
structure (c(unclass(parent), motor = TRUE), class = c('ebike', class(parent)))}
theotherpersonsbike <-ebike('mountain', 'red')
class(theotherpersonsbike)
[1] "ebike" "bike"
theotherpersonsbike
[1] "a red mountain bike"
print.ebike <- function(b) {
ptext <- NextMethod()
print('... with a motor!')
}
theotherpersonsbike
[1] "a red mountain bike"
[1] "... with a motor!"
# create a generic function that calls UseMethod to do the dispatching
speed_up <- function(object, ...) UseMethod("speed_up")
# create an implementation for our bike class
speed_up.bike <- function(object, target_speed) {
accelerate_until_at(target_speed)
}
speed_up(mybike, target_speed = 33)
[1] "Now accelerating to 33 km/h"
# also of course create an implementation for the e-bike
speed_up.ebike <- function(object, target_speed) {
adjusted_speed <- ifelse(target_speed <= 25, target_speed, 25) # ... you've seen that coming
accelerate_until_at(adjusted_speed)
}
speed_up(theotherpersonsbike, target_speed = 33)
[1] "Now accelerating to 25 km/h"
The default method for a function will be used.
Remember confint() from above?
# these implementations exist for confint:
methods('confint')
[1] confint.default confint.fracdiff* confint.glm* confint.lm
[5] confint.multinom* confint.nls*
see '?methods' for accessing help and source code
data("lynx")
fit <- auto.arima(lynx)
# same as explicitly calling confint.default
confint(fit)
2.5 % 97.5 %
ar1 1.1491419 1.53501010
ar2 -0.8307363 -0.51688060
ma1 -0.4498853 0.04440014
ma2 -0.4713138 -0.04147972
intercept 1285.8372685 1802.97062435
S4:
Reference classes (RC):
functional programming!
Magic Three in Haskell:
\( map \): map a function over a list of elements
λ: map (+1) [1..10]
[2,3,4,5,6,7,8,9,10,11]
\( filter \): filter a list of elements according to some predicate
λ: filter even [1..10]
[2,4,6,8,10]
\( fold \): combine values recursively (a.k.a. \( reduce \) (Clojure, Java, Python…), \( apply \) (Scheme, …))
λ: foldl (+) 0 [1..10]
55
apply, lapply, sapply, vapply, mapply, tapply … ough!
Basic question: What data structure(s) am I working with?
Use with more-than-one-dimensional data structures: data.frame, matrix, array
m <- matrix(1:12, nrow = 3, ncol = 4)
m
[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
# apply mean to the columns
apply(m, 2, mean)
[1] 2 5 8 11
# apply mean to the rows
apply(m, 1, mean)
[1] 5.5 6.5 7.5
Use with one-dimensional stuff (list, vector)
mychars <- c("a", "b"); str(lapply(mychars, toupper))
List of 2
$ : chr "A"
$ : chr "B"
mychars <- c("a", "b"); str(sapply(mychars, toupper))
Named chr [1:2] "A" "B"
- attr(*, "names")= chr [1:2] "a" "b"
mychars <- c("a", "b"); str(vapply(mychars, utf8ToInt, integer(1)))
Named int [1:2] 97 98
- attr(*, "names")= chr [1:2] "a" "b"
# a list of 3
l1 <-list(
col1 = "a",
col2 = "b",
col3 = c("c", "d")
)
str(l1)
List of 3
$ col1: chr "a"
$ col2: chr "b"
$ col3: chr [1:2] "c" "d"
# a list of 2
l2 <- l1[1:2]
str(l2)
List of 2
$ col1: chr "a"
$ col2: chr "b"
# upper case everything
u1 <- sapply(l1, toupper)
# result is still a list!
str(u1)
List of 3
$ col1: chr "A"
$ col2: chr "B"
$ col3: chr [1:2] "C" "D"
u2 <- sapply(l2, toupper)
# result is a vector!
str(u2)
Named chr [1:2] "A" "B"
- attr(*, "names")= chr [1:2] "col1" "col2"
Yes, we have them in R, too:
Map
# same as lapply(1:10, function(x) x+1), but see order of args!
m <- Map(function(x) x+1, 1:10)
Filter
# same as lapply(1:10, function(x) x+1), but see order of args!
Filter(function(x) x %% 2 == 0, 1:10)
[1] 2 4 6 8 10
Reduce
# same as lapply(1:10, function(x) x+1), but see order of args!
Reduce(`+`, 1:10)
[1] 55
m <- map_dbl(1:10, function(x) x+1)
# but
m2 <- map_chr(letters[1:3], toupper)
So … what is purrr?
Too verbose?
m <- map_dbl(1:10, function(x) x+1)
How about partial application:
m <- map_dbl(1:10, partial(`+`,1))
Or, how about function composition?
inttolower <- compose(tolower, intToUtf8)
inttolower(65:68)
[1] "abcd"
We've seen objects, we've seen functions.
But what's R basically made of?
Just wanna use R? Use class():
myfunc <- function(x) x + 1
tests1 <- c(`<-`, `if`, `[`, length, c, sum, nrow, eval, myfunc)
sapply(tests1, class)
[1] "function" "function" "function" "function" "function" "function"
[7] "function" "function" "function"
For the user, these are all functions. Even though they do such different things as
tests1 <- c(`<-`, `if`, `[`, length, c, sum, nrow, eval, myfunc)
sapply(tests1, typeof)
[1] "special" "special" "special" "builtin" "builtin" "builtin" "closure"
[8] "closure" "closure"
So we have three different corresponding object types:
Every user-defined function is a closure.
With closures, we can conveniently print the source code on the console:
nrow
function (x)
dim(x)[1L]
<bytecode: 0x558eaeba6f60>
<environment: namespace:base>
Closures have formals, a body, and an associated environment.
c(formals(myfunc), body(myfunc), environment(myfunc))
$x
[[2]]
x + 1
[[3]]
<environment: R_GlobalEnv>
(Remember, this was a closure, too.)
body(eval)
.Internal(eval(expr, envir, enclos))
Oops!
For .Internal and .Primitive functions (the “builtins” above),
$R_source/src/main/names.c
contains the mapping to the corresponding C function:
Notice something?
Not just the CARs, CADRs, CADDRs…
… S-expressions…
… the whole idea of environments and closures in current R is modeled after Lisp.
(No time now and here, but there's always SICP to read up on environments etc.)
Specials get their arguments passed in quoted and decide themselves when to evaluate what.
What do you think will happen here?
# no confint.bike defined for bikes -> confint.default will get called
# but confint.default needs some other methods that do not exist
if(1 > 0) 1 else confint(mybike)
[1] 1
Just so you believe me an error would get generated if confint(mybike) were called.
confint(mybike)
Error in UseMethod("vcov"): no applicable method for 'vcov' applied to an object of class "bike"
Here we have a user-defined function, f. What will happen?
f <- function(exp1, exp2) {
exp1
}
f(confint(mybike), f)
Error in UseMethod("vcov"): no applicable method for 'vcov' applied to an object of class "bike"
f(123, confint(mybike))
[1] 123
Closures evaluate their arguments lazily (by need).
As users, we can create promises, too:
# normal assignment - this can't work
x <- confint(mybike)
Error in UseMethod("vcov"): no applicable method for 'vcov' applied to an object of class "bike"
# promise to evaluate when needed
# this works without error
delayedAssign('x', confint(mybike))
# here it gets evaluated
x
Error in UseMethod("vcov"): no applicable method for 'vcov' applied to an object of class "bike"
… this could go on for quite some time … but ;-)
Thanks a lot for your attention!
:-)