最終更新: 2020/10/06

First step

install.packages("ISwR")
library(ISwR)
plot(rnorm(1000))

An calculator(電卓の代わりに)

2+2
## [1] 4
rnorm(15)
##  [1] -1.47609735  2.34831500 -0.60946411  1.88423948  2.06653744 -2.39010691
##  [7]  0.01326897  0.45297068 -0.21167449 -1.79313276  0.50931746 -0.20721466
## [13]  0.71014616 -0.42244100 -0.15358791
  • \(exp(2)=e^{-2}\)
exp(-2)
## [1] 0.1353353

Assignments

  • Use symbol < - .
  • 代入を意味する
x <- 2
x
## [1] 2
x + x
## [1] 4
  • An example of (object) naming : height.1yr
  • R command is case-sensitive : Wt and wt are different

Vectorized arithmetic(vetor演算)

weight <- c(60, 72, 57, 90, 95, 72)
weight
## [1] 60 72 57 90 95 72
体重  <- c(60, 72, 57, 90, 95, 72)
体重
## [1] 60 72 57 90 95 72
height <- c(1.65, 1.80, 1.65, 1.90, 1.74, 1.91)
height
## [1] 1.65 1.80 1.65 1.90 1.74 1.91

並び替え

h <- c(1.65, 1.80, 1.65, 1.90, 1.74, 1.91)
h
## [1] 1.65 1.80 1.65 1.90 1.74 1.91
h[order(h)]
## [1] 1.65 1.65 1.74 1.80 1.90 1.91
order(h)
## [1] 1 3 5 2 4 6

BMI 計算

height <- c(1.65, 1.80, 1.65, 1.90, 1.74, 1.91)
weight <- c(60, 72, 57, 90, 95, 72)
bmi <- weight / height ^ 2
bmi
## [1] 22.03857 22.22222 20.93664 24.93075 31.37799 19.73630
  • 体重の総和
sum(weight)
## [1] 446
  • データの個数 n
length(weight)
## [1] 6
  • 体重の平均
sum(weight) / length(weight)
## [1] 74.33333

体重の標準偏差(standard deviation)計算

  • 体重 vector 中身を表示
weight
## [1] 60 72 57 90 95 72
  • 平均=\(\frac{\sum^n_{i=1} X_i}{n}\)
xbar <- sum(weight) / length(weight)
xbar
## [1] 74.33333

  • Notice xbar is recycled
  • Deviation (偏差)
  • \(X_i - \bar X\)
weight - xbar
## [1] -14.333333  -2.333333 -17.333333  15.666667  20.666667  -2.333333
  • squared deviations(偏差の二乗)
  • \((X_i - \bar X)^2\)
(weight - xbar)^2
## [1] 205.444444   5.444444 300.444444 245.444444 427.111111   5.444444

  • sum of squared deviations(偏差二乗の和)
  • \(\sum^n_{i=1} (X_i - \bar X)^2\)
sum((weight - xbar)^2)
## [1] 1189.333
  • standard deviation(標準偏差)
  • 分散の平方根: sqrt()
sqrt(sum((weight - xbar) ^ 2) / (length(weight) - 1))
## [1] 15.42293

built-in関数を使用

  • 平均 : mean()
  • 標準偏差 : sd()
mean(weight)
## [1] 74.33333
sd(weight)
## [1] 15.42293

Standard procedures(例えば、t-検定は t.test()関数)

  • The rule of thumb is that the BMI for a normal-weight individual should be between 20 and 25, assumed to have mean 22.5
bmi
## [1] 22.03857 22.22222 20.93664 24.93075 31.37799 19.73630

  • BMIの平均が22.5であるかを検定(t-検定)
t.test(bmi, mu=22.5)
## 
##  One Sample t-test
## 
## data:  bmi
## t = 0.60539, df = 5, p-value = 0.5713
## alternative hypothesis: true mean is not equal to 22.5
## 95 percent confidence interval:
##  19.12268 27.95814
## sample estimates:
## mean of x 
##  23.54041

Graphics

  • Want to investigate the relation between weight and height
  • plot()関数
plot(height, weight)

  • Let’s change the symbol:

plot(height, weight, pch=2)

plot(height, weight, pch=3)

R language essentials

# install.packages("ISwR")
library(ISwR)
height <- c(1.65, 1.80, 1.65, 1.90, 1.74, 1.91)
height
## [1] 1.65 1.80 1.65 1.90 1.74 1.91
weight <- c(60, 72, 57, 90, 95, 72)
weight
## [1] 60 72 57 90 95 72
bmi <- weight / height ^ 2
bmi
## [1] 22.03857 22.22222 20.93664 24.93075 31.37799 19.73630

object表示関数(ls()

ls()
## [1] "bmi"    "h"      "height" "weight" "x"      "xbar"   "体重"

Vectors(数値型)

  • We have already seen numeric vectors.
  • There are two further types,
  • character vectors and logical vectors.

Vectors(文字型)

  • use quote symbol ""
c("Huey","Dewey","Louie")
## [1] "Huey"  "Dewey" "Louie"
c('Huey','Dewey','Louie')
## [1] "Huey"  "Dewey" "Louie"

Vectors(論理型)

c(T,T,F,T)
## [1]  TRUE  TRUE FALSE  TRUE
bmi > 25
## [1] FALSE FALSE FALSE FALSE  TRUE FALSE

Missing values(欠損値)

  • In R, NA means missing value.
c(42,57,12,NA,1,3,4) 
## [1] 42 57 12 NA  1  3  4

Functions(関数) that create vectors

c()

c(42,57,12,39,1,3,4) 
## [1] 42 57 12 39  1  3  4
x <- c(1, 2, 3)
y <- c(10, 20)
c(x, y, 5)
## [1]  1  2  3 10 20  5

It is also possible to assign names to the elements.

x <- c(red = "Huey", blue = "Dewey", green = "Louie")
x
##     red    blue   green 
##  "Huey" "Dewey" "Louie"
names(x)
## [1] "red"   "blue"  "green"

seq( )

  • The second function, seq (“sequence”),
  • is used for equidistant series of numbers.
seq(4, 9)
## [1] 4 5 6 7 8 9
  • If you want a sequence in jumps of 2, write
seq(4, 10, 2)
## [1]  4  6  8 10

:

4:9
## [1] 4 5 6 7 8 9

The above is exactly the same as seq(4,9), only easier to read.

seq(4, 9)
## [1] 4 5 6 7 8 9

rep( )

The third function, rep (“replicate”),

oops <- c(7, 9, 13)
rep(oops, 3)
## [1]  7  9 13  7  9 13  7  9 13
rep(1:2, c(10, 15))
##  [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
  • 同じ結果
rep(1:2, each = 10)
##  [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2
rep(1:2, c(10, 10))
##  [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2

Matrices(行列) and arrays

A matrix in mathematics is just a two-dimensional array of numbers.

x <- 1:12
dim(x) <- c(3, 4)
x
##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11
## [3,]    3    6    9   12
  • byrow=T option による違いを確認
  • Notice how the byrow=T switch causes the matrix to be filled in a rowwise fashion rather than columnwise.
matrix(1:12, nrow = 3, byrow = T)
##      [,1] [,2] [,3] [,4]
## [1,]    1    2    3    4
## [2,]    5    6    7    8
## [3,]    9   10   11   12
matrix(1:12, nrow = 3, byrow = F)
##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11
## [3,]    3    6    9   12
x <- matrix(1:12, nrow = 3, byrow = T)
rownames(x) <- LETTERS[1:3]
x
##   [,1] [,2] [,3] [,4]
## A    1    2    3    4
## B    5    6    7    8
## C    9   10   11   12
  • You can “glue” vectors together, columnwise or rowwise, using the cbind and rbind functions.
cbind(A = 1:4, B = 5:8, C = 9:12)
##      A B  C
## [1,] 1 5  9
## [2,] 2 6 10
## [3,] 3 7 11
## [4,] 4 8 12
rbind(A = 1:4, B = 5:8, C = 9:12)
##   [,1] [,2] [,3] [,4]
## A    1    2    3    4
## B    5    6    7    8
## C    9   10   11   12

Factors(因子型、categorical variables)

pain <- c(0, 3, 2, 2, 1)
fpain <- factor(pain, levels = 0:3)
levels(fpain) <- c("none", "mild", "medium", "severe")
fpain
## [1] none   severe medium medium mild  
## Levels: none mild medium severe
as.numeric(fpain)
## [1] 1 4 3 3 2
levels(fpain)
## [1] "none"   "mild"   "medium" "severe"

Lists

  • To combine a collection of objects into a larger composite object.
  • This can be done using list()
intake.pre <-c(5260, 5470, 5640, 6180, 6390, 6515, 6805, 7515, 7515, 8230, 8770)
intake.post <-c(3910, 4220, 3885, 5160, 5645, 4680, 5265, 5975, 6790, 6900, 7335)
mylist <- list(before=intake.pre,after=intake.post)
mylist
## $before
##  [1] 5260 5470 5640 6180 6390 6515 6805 7515 7515 8230 8770
## 
## $after
##  [1] 3910 4220 3885 5160 5645 4680 5265 5975 6790 6900 7335
mylist$before
##  [1] 5260 5470 5640 6180 6390 6515 6805 7515 7515 8230 8770

Data frames

  • important!
  • A data frame corresponds to what other statistical packages call a “data matrix” or a “data set”.
d <- data.frame(intake.pre, intake.post)
d
##    intake.pre intake.post
## 1        5260        3910
## 2        5470        4220
## 3        5640        3885
## 4        6180        5160
## 5        6390        5645
## 6        6515        4680
## 7        6805        5265
## 8        7515        5975
## 9        7515        6790
## 10       8230        6900
## 11       8770        7335
  • As with lists, components (i.e., individual variables) can be accessed using the $ notation:
d$intake.pre
##  [1] 5260 5470 5640 6180 6390 6515 6805 7515 7515 8230 8770

Indexing

  • Often, you you need a particular element in a vector
intake.pre[5]
## [1] 6390
intake.pre[c(3, 5, 7)]
## [1] 5640 6390 6805
v <- c(3, 5, 7)
intake.pre[v]
## [1] 5640 6390 6805
intake.pre[1:5]
## [1] 5260 5470 5640 6180 6390
intake.pre[-c(3, 5, 7)]
## [1] 5260 5470 6180 6515 7515 7515 8230 8770

Conditional selection

  • To extract data that satisfy certain criteria,
  • Such as data from the males or the prepubertal or those with chronic diseases, etc.
intake.post[intake.pre > 7000]
## [1] 5975 6790 6900 7335
  • yielding the postmenstrual energy intake for the four women who had an energy intake above 7000 kJ premenstrually.

logical operation

  • < (less than), > (greater than), == (equal to), <= (less than or equal to), >= (greater than or equal to), != (not equal to).

  • To combine several expressions, you can use the logical operators & (“and”), | (“or”), and ! (“not”).

Conditional selection : example

  • For instance, we want to find ‘the postmenstrual intake for women with a premenstrual intake between 7000 and 8000 kJ’
intake.post[intake.pre > 7000 & intake.pre <= 8000]
## [1] 5975 6790
intake.pre > 7000 & intake.pre <= 8000
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE

Indexing

d <- data.frame(intake.pre, intake.post)

# 5行、1列の値
d[5, 1]
## [1] 6390
# 5行の値
d[5, ]
##   intake.pre intake.post
## 5       6390        5645

Indexing

  • 列の指定は $使用
d$intake.pre
##  [1] 5260 5470 5640 6180 6390 6515 6805 7515 7515 8230 8770

Indexing

d[d$intake.pre > 7000, ]
##    intake.pre intake.post
## 8        7515        5975
## 9        7515        6790
## 10       8230        6900
## 11       8770        7335

Indexing

sel <- d$intake.pre > 7000
sel
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE
d[sel, ]
##    intake.pre intake.post
## 8        7515        5975
## 9        7515        6790
## 10       8230        6900
## 11       8770        7335
  • It is often convenient to look at the first few cases in a data set.
  • This can be done with indexing, like this:
d[1:2, ]
##   intake.pre intake.post
## 1       5260        3910
## 2       5470        4220

Shows the first six lines.

  • Usually, we use, head()
head(d)
##   intake.pre intake.post
## 1       5260        3910
## 2       5470        4220
## 3       5640        3885
## 4       6180        5160
## 5       6390        5645
## 6       6515        4680

Grouped data and data frames

energy
##    expend stature
## 1    9.21   obese
## 2    7.53    lean
## 3    7.48    lean
## 4    8.08    lean
## 5    8.09    lean
## 6   10.15    lean
## 7    8.40    lean
## 8   10.88    lean
## 9    6.13    lean
## 10   7.90    lean
## 11  11.51   obese
## 12  12.79   obese
## 13   7.05    lean
## 14  11.85   obese
## 15   9.97   obese
## 16   7.48    lean
## 17   8.79   obese
## 18   9.69   obese
## 19   9.68   obese
## 20   7.58    lean
## 21   9.19   obese
## 22   8.11    lean
  • Sometimes it is desirable to have data in a separate vector for each group.
exp.lean <- energy$expend[energy$stature == "lean"]
exp.obese <- energy$expend[energy$stature == "obese"]
  • Alternatively, you can use the split function, which generates a list of vectors according to a grouping.
l <- split(energy$expend, energy$stature)
l
## $lean
##  [1]  7.53  7.48  8.08  8.09 10.15  8.40 10.88  6.13  7.90  7.05  7.48  7.58
## [13]  8.11
## 
## $obese
## [1]  9.21 11.51 12.79 11.85  9.97  8.79  9.69  9.68  9.19

Implicit loops

  • In R this is abstracted by the functions lapply and sapply.
  • The former always returns a list (hence the ‘l’), whereas the latter tries to simplify (hence the ‘s’) the result to a vector or a matrix if possible.
  • na.rm=T used to request that missing values be removed
lapply(thuesen, mean, na.rm = T)
## $blood.glucose
## [1] 10.3
## 
## $short.velocity
## [1] 1.325652
sapply(thuesen, mean, na.rm = T)
##  blood.glucose short.velocity 
##      10.300000       1.325652
  • Sometimes you just want to repeat something a number of times but still collect the results as a vector.
  • Obviously, this makes sense only when the repeated computations actually give different results, the common case being simulation studies.
  • This can be done using sapply, but there is a simplified version called replicate, in which you just have to give a count and the expression to evaluate:
replicate(10, mean(rexp(20)))
##  [1] 1.1543345 0.6253008 1.0642741 0.8529334 1.1719649 0.9398824 0.7693953
##  [8] 0.8281102 1.5938795 1.1165344
m <- matrix(rnorm(12), 4)
m
##            [,1]        [,2]       [,3]
## [1,]  1.6376249 -0.73975702 -0.4987458
## [2,] -0.3125668  0.17665811 -0.3867262
## [3,]  0.3139244  1.59216373 -0.4166742
## [4,] -0.7316883 -0.01078286  0.1755406
m <- matrix(rnorm(12), nrow = 4)
m
##            [,1]        [,2]       [,3]
## [1,] -0.4184491  0.69345588 -0.4237782
## [2,] -0.3193066  2.98385765 -0.6240393
## [3,]  0.6595142 -0.03836443  0.6065398
## [4,] -0.9591165  0.26312178  0.1139435
  • A similar function, apply, allows you to apply a function to the rows or columns of a matrix

  • The second argument is the index (or vector of indices) that defines what the function is applied to; in this case we get the columnwise minima.

apply(m, 2, min)
## [1] -0.95911651 -0.03836443 -0.62403929
  • Also, the function tapply allows you to create tables (hence the ‘t’) of the value of a function on subgroups defined by its second argument, which can be a factor or a list of factors.
  • In the latter case a cross-classified table is generated.
tapply(energy$expend, energy$stature, median)
##  lean obese 
##  7.90  9.69

Sorting

  • Just use the sort function.
intake$post
##  [1] 3910 4220 3885 5160 5645 4680 5265 5975 6790 6900 7335
sort(intake$post)
##  [1] 3885 3910 4220 4680 5160 5265 5645 5975 6790 6900 7335
  • However, sorting a single vector is not always what is required.
  • Often you need to sort a series of variables according to the values of some other variables — blood pressures sorted by sex and age, for instance.
order(intake$post)
##  [1]  3  1  2  6  4  7  5  8  9 10 11
  • The result is the numbers 1 to 11 (or whatever the length of the vec- tor is), sorted according to the size of the argument to order (here intake$post).
  • Interpreting the result of order is a bit tricky — it should be read as follows: You sort intake$post by placing its values in the order no.
o <- order(intake$post)
intake$post[o]
##  [1] 3885 3910 4220 4680 5160 5265 5645 5975 6790 6900 7335
intake$pre[o]
##  [1] 5640 5260 5470 6515 6180 6805 6390 7515 7515 8230 8770
  • It is of course also possible to sort the entire data frame intake
intake
##     pre post
## 1  5260 3910
## 2  5470 4220
## 3  5640 3885
## 4  6180 5160
## 5  6390 5645
## 6  6515 4680
## 7  6805 5265
## 8  7515 5975
## 9  7515 6790
## 10 8230 6900
## 11 8770 7335
intake.sorted <- intake[o,]
  • Sorting by several criteria is done simply by having several arguments to order;

  • for instance, order(sex,age) will give a main division into men and women, and within each sex an ordering by age.

  • The second variable is used when the order cannot be decided from the first variable.

  • Sorting in reverse order can be handled by, for example, changing the sign of the variable.

Exercises

  • Q1.1 How would you check whether two vectors are the same if they may contain missing (NA) values? (Use of the identical function is considered cheating!)
  • Q1.2 If x is a factor with n levels and y is a length n vector, what happens if you compute y[x]?
  • Q1.3 Write the logical expression to use to extract girls between 7 and 14 years of age in the juul data set.
  • Q1.4 What happens if you change the levels of a factor (with levels) and give the same value to two or more levels?
  • Q1.5 On p.27, replicate was used to simulate the distribution of the mean of 20 random numbers from the exponential distribution by repeating the operation 10 times. How would you do the same thing with sapply?