ISwR Chapter1 Basics : Summary

最終更新: 2020/10/06

First step

install.packages("ISwR")

library(ISwR)

plot(rnorm(1000))

An calculator（電卓の代わりに）

2+2

## [1] 4

rnorm(15)

##  [1] -1.47609735  2.34831500 -0.60946411  1.88423948  2.06653744 -2.39010691
##  [7]  0.01326897  0.45297068 -0.21167449 -1.79313276  0.50931746 -0.20721466
## [13]  0.71014616 -0.42244100 -0.15358791

$exp(2)=e^{-2}$

exp(-2)

## [1] 0.1353353

Assignments

Use symbol < - .
代入を意味する

x <- 2
x

## [1] 2

x + x

## [1] 4

An example of (object) naming : height.1yr
R command is case-sensitive : Wt and wt are different

Vectorized arithmetic(vetor演算)

weight <- c(60, 72, 57, 90, 95, 72)
weight

## [1] 60 72 57 90 95 72

体重  <- c(60, 72, 57, 90, 95, 72)
体重

## [1] 60 72 57 90 95 72

height <- c(1.65, 1.80, 1.65, 1.90, 1.74, 1.91)
height

## [1] 1.65 1.80 1.65 1.90 1.74 1.91

並び替え

h <- c(1.65, 1.80, 1.65, 1.90, 1.74, 1.91)
h

## [1] 1.65 1.80 1.65 1.90 1.74 1.91

h[order(h)]

## [1] 1.65 1.65 1.74 1.80 1.90 1.91

order(h)

## [1] 1 3 5 2 4 6

BMI 計算

BMIとは

height <- c(1.65, 1.80, 1.65, 1.90, 1.74, 1.91)
weight <- c(60, 72, 57, 90, 95, 72)
bmi <- weight / height ^ 2
bmi

## [1] 22.03857 22.22222 20.93664 24.93075 31.37799 19.73630

体重の総和

sum(weight)

## [1] 446

データの個数 n

length(weight)

## [1] 6

体重の平均

sum(weight) / length(weight)

## [1] 74.33333

体重の標準偏差(standard deviation)計算

体重 vector 中身を表示

weight

## [1] 60 72 57 90 95 72

平均=$\frac{\sum^n_{i=1} X_i}{n}$

xbar <- sum(weight) / length(weight)
xbar

## [1] 74.33333

Notice xbar is recycled
Deviation (偏差)
$X_i - \bar X$

weight - xbar

## [1] -14.333333  -2.333333 -17.333333  15.666667  20.666667  -2.333333

squared deviations（偏差の二乗）
$(X_i - \bar X)^2$

(weight - xbar)^2

## [1] 205.444444   5.444444 300.444444 245.444444 427.111111   5.444444

sum of squared deviations(偏差二乗の和)
$\sum^n_{i=1} (X_i - \bar X)^2$

sum((weight - xbar)^2)

## [1] 1189.333

standard deviation（標準偏差）
分散の平方根: sqrt()

sqrt(sum((weight - xbar) ^ 2) / (length(weight) - 1))

## [1] 15.42293

built-in関数を使用

平均： mean()
標準偏差 : sd()

mean(weight)

## [1] 74.33333

sd(weight)

## [1] 15.42293

Standard procedures（例えば、t-検定は `t.test()`関数）

The rule of thumb is that the BMI for a normal-weight individual should be between 20 and 25, assumed to have mean 22.5

bmi

## [1] 22.03857 22.22222 20.93664 24.93075 31.37799 19.73630

BMIの平均が22.5であるかを検定（t-検定）

t.test(bmi, mu=22.5)

## 
##  One Sample t-test
## 
## data:  bmi
## t = 0.60539, df = 5, p-value = 0.5713
## alternative hypothesis: true mean is not equal to 22.5
## 95 percent confidence interval:
##  19.12268 27.95814
## sample estimates:
## mean of x 
##  23.54041

Graphics

Want to investigate the relation between weight and height
plot()関数

plot(height, weight)

Let’s change the symbol:

plot(height, weight, pch=2)

plot(height, weight, pch=3)

R language essentials

# install.packages("ISwR")
library(ISwR)

height <- c(1.65, 1.80, 1.65, 1.90, 1.74, 1.91)
height

## [1] 1.65 1.80 1.65 1.90 1.74 1.91

weight <- c(60, 72, 57, 90, 95, 72)
weight

## [1] 60 72 57 90 95 72

bmi <- weight / height ^ 2
bmi

## [1] 22.03857 22.22222 20.93664 24.93075 31.37799 19.73630

object表示関数（`ls()`）

ls()

## [1] "bmi"    "h"      "height" "weight" "x"      "xbar"   "体重"

Vectors(数値型)

We have already seen numeric vectors.
There are two further types,
character vectors and logical vectors.

Vectors(文字型)

use quote symbol ""

c("Huey","Dewey","Louie")

## [1] "Huey"  "Dewey" "Louie"

c('Huey','Dewey','Louie')

## [1] "Huey"  "Dewey" "Louie"

Vectors(論理型)

c(T,T,F,T)

## [1]  TRUE  TRUE FALSE  TRUE

bmi > 25

## [1] FALSE FALSE FALSE FALSE  TRUE FALSE

Missing values(欠損値)

In R, NA means missing value.

c(42,57,12,NA,1,3,4)

## [1] 42 57 12 NA  1  3  4

Functions(関数) that create vectors

`c()`

c(42,57,12,39,1,3,4)

## [1] 42 57 12 39  1  3  4

x <- c(1, 2, 3)
y <- c(10, 20)
c(x, y, 5)

## [1]  1  2  3 10 20  5

It is also possible to assign names to the elements.

x <- c(red = "Huey", blue = "Dewey", green = "Louie")
x

##     red    blue   green 
##  "Huey" "Dewey" "Louie"

names(x)

## [1] "red"   "blue"  "green"

seq( )

The second function, seq (“sequence”),
is used for equidistant series of numbers.

seq(4, 9)

## [1] 4 5 6 7 8 9

If you want a sequence in jumps of 2, write

seq(4, 10, 2)

## [1]  4  6  8 10

`:`

4:9

## [1] 4 5 6 7 8 9

The above is exactly the same as seq(4,9), only easier to read.

seq(4, 9)

## [1] 4 5 6 7 8 9

rep( )

The third function, rep (“replicate”),

oops <- c(7, 9, 13)
rep(oops, 3)

## [1]  7  9 13  7  9 13  7  9 13

rep(1:2, c(10, 15))

##  [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

同じ結果

rep(1:2, each = 10)

##  [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2

rep(1:2, c(10, 10))

##  [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2

Matrices(行列) and arrays

A matrix in mathematics is just a two-dimensional array of numbers.

x <- 1:12
dim(x) <- c(3, 4)
x

##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11
## [3,]    3    6    9   12

byrow=T option による違いを確認
Notice how the byrow=T switch causes the matrix to be filled in a rowwise fashion rather than columnwise.

matrix(1:12, nrow = 3, byrow = T)

##      [,1] [,2] [,3] [,4]
## [1,]    1    2    3    4
## [2,]    5    6    7    8
## [3,]    9   10   11   12

matrix(1:12, nrow = 3, byrow = F)

##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11
## [3,]    3    6    9   12

x <- matrix(1:12, nrow = 3, byrow = T)
rownames(x) <- LETTERS[1:3]
x

##   [,1] [,2] [,3] [,4]
## A    1    2    3    4
## B    5    6    7    8
## C    9   10   11   12

You can “glue” vectors together, columnwise or rowwise, using the cbind and rbind functions.

cbind(A = 1:4, B = 5:8, C = 9:12)

##      A B  C
## [1,] 1 5  9
## [2,] 2 6 10
## [3,] 3 7 11
## [4,] 4 8 12

rbind(A = 1:4, B = 5:8, C = 9:12)

##   [,1] [,2] [,3] [,4]
## A    1    2    3    4
## B    5    6    7    8
## C    9   10   11   12

Factors(因子型、categorical variables)

pain <- c(0, 3, 2, 2, 1)
fpain <- factor(pain, levels = 0:3)
levels(fpain) <- c("none", "mild", "medium", "severe")
fpain

## [1] none   severe medium medium mild  
## Levels: none mild medium severe

as.numeric(fpain)

## [1] 1 4 3 3 2

levels(fpain)

## [1] "none"   "mild"   "medium" "severe"

Lists

To combine a collection of objects into a larger composite object.
This can be done using list()

intake.pre <-c(5260, 5470, 5640, 6180, 6390, 6515, 6805, 7515, 7515, 8230, 8770)
intake.post <-c(3910, 4220, 3885, 5160, 5645, 4680, 5265, 5975, 6790, 6900, 7335)
mylist <- list(before=intake.pre,after=intake.post)
mylist

## $before
##  [1] 5260 5470 5640 6180 6390 6515 6805 7515 7515 8230 8770
## 
## $after
##  [1] 3910 4220 3885 5160 5645 4680 5265 5975 6790 6900 7335

mylist$before

##  [1] 5260 5470 5640 6180 6390 6515 6805 7515 7515 8230 8770

Data frames

important!
A data frame corresponds to what other statistical packages call a “data matrix” or a “data set”.

d <- data.frame(intake.pre, intake.post)
d

##    intake.pre intake.post
## 1        5260        3910
## 2        5470        4220
## 3        5640        3885
## 4        6180        5160
## 5        6390        5645
## 6        6515        4680
## 7        6805        5265
## 8        7515        5975
## 9        7515        6790
## 10       8230        6900
## 11       8770        7335

As with lists, components (i.e., individual variables) can be accessed using the $ notation:

d$intake.pre

##  [1] 5260 5470 5640 6180 6390 6515 6805 7515 7515 8230 8770

Indexing

Often, you you need a particular element in a vector

intake.pre[5]

## [1] 6390

intake.pre[c(3, 5, 7)]

## [1] 5640 6390 6805

v <- c(3, 5, 7)
intake.pre[v]

## [1] 5640 6390 6805

intake.pre[1:5]

## [1] 5260 5470 5640 6180 6390

intake.pre[-c(3, 5, 7)]

## [1] 5260 5470 6180 6515 7515 7515 8230 8770

Conditional selection

To extract data that satisfy certain criteria,
Such as data from the males or the prepubertal or those with chronic diseases, etc.

intake.post[intake.pre > 7000]

## [1] 5975 6790 6900 7335

yielding the postmenstrual energy intake for the four women who had an energy intake above 7000 kJ premenstrually.

logical operation

< (less than), > (greater than), == (equal to), <= (less than or equal to), >= (greater than or equal to), != (not equal to).
To combine several expressions, you can use the logical operators & (“and”), | (“or”), and ! (“not”).

Conditional selection : example

For instance, we want to find ‘the postmenstrual intake for women with a premenstrual intake between 7000 and 8000 kJ’

intake.post[intake.pre > 7000 & intake.pre <= 8000]

## [1] 5975 6790

intake.pre > 7000 & intake.pre <= 8000

##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE

Indexing

d <- data.frame(intake.pre, intake.post)

# 5行、１列の値
d[5, 1]

## [1] 6390

# 5行の値
d[5, ]

##   intake.pre intake.post
## 5       6390        5645

Indexing

列の指定は $使用

d$intake.pre

##  [1] 5260 5470 5640 6180 6390 6515 6805 7515 7515 8230 8770

Indexing

d[d$intake.pre > 7000, ]

##    intake.pre intake.post
## 8        7515        5975
## 9        7515        6790
## 10       8230        6900
## 11       8770        7335

Indexing

sel <- d$intake.pre > 7000
sel

##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE

d[sel, ]

##    intake.pre intake.post
## 8        7515        5975
## 9        7515        6790
## 10       8230        6900
## 11       8770        7335

It is often convenient to look at the first few cases in a data set.
This can be done with indexing, like this:

d[1:2, ]

##   intake.pre intake.post
## 1       5260        3910
## 2       5470        4220

Shows the first six lines.

Usually, we use, head()

head(d)

##   intake.pre intake.post
## 1       5260        3910
## 2       5470        4220
## 3       5640        3885
## 4       6180        5160
## 5       6390        5645
## 6       6515        4680

Grouped data and data frames

energy

##    expend stature
## 1    9.21   obese
## 2    7.53    lean
## 3    7.48    lean
## 4    8.08    lean
## 5    8.09    lean
## 6   10.15    lean
## 7    8.40    lean
## 8   10.88    lean
## 9    6.13    lean
## 10   7.90    lean
## 11  11.51   obese
## 12  12.79   obese
## 13   7.05    lean
## 14  11.85   obese
## 15   9.97   obese
## 16   7.48    lean
## 17   8.79   obese
## 18   9.69   obese
## 19   9.68   obese
## 20   7.58    lean
## 21   9.19   obese
## 22   8.11    lean

Sometimes it is desirable to have data in a separate vector for each group.

exp.lean <- energy$expend[energy$stature == "lean"]
exp.obese <- energy$expend[energy$stature == "obese"]

Alternatively, you can use the split function, which generates a list of vectors according to a grouping.

l <- split(energy$expend, energy$stature)
l

## $lean
##  [1]  7.53  7.48  8.08  8.09 10.15  8.40 10.88  6.13  7.90  7.05  7.48  7.58
## [13]  8.11
## 
## $obese
## [1]  9.21 11.51 12.79 11.85  9.97  8.79  9.69  9.68  9.19

Implicit loops

In R this is abstracted by the functions lapply and sapply.
The former always returns a list (hence the ‘l’), whereas the latter tries to simplify (hence the ‘s’) the result to a vector or a matrix if possible.

na.rm=T used to request that missing values be removed

lapply(thuesen, mean, na.rm = T)

## $blood.glucose
## [1] 10.3
## 
## $short.velocity
## [1] 1.325652

sapply(thuesen, mean, na.rm = T)

##  blood.glucose short.velocity 
##      10.300000       1.325652

Sometimes you just want to repeat something a number of times but still collect the results as a vector.
Obviously, this makes sense only when the repeated computations actually give different results, the common case being simulation studies.
This can be done using sapply, but there is a simplified version called replicate, in which you just have to give a count and the expression to evaluate:

replicate(10, mean(rexp(20)))

##  [1] 1.1543345 0.6253008 1.0642741 0.8529334 1.1719649 0.9398824 0.7693953
##  [8] 0.8281102 1.5938795 1.1165344

m <- matrix(rnorm(12), 4)
m

##            [,1]        [,2]       [,3]
## [1,]  1.6376249 -0.73975702 -0.4987458
## [2,] -0.3125668  0.17665811 -0.3867262
## [3,]  0.3139244  1.59216373 -0.4166742
## [4,] -0.7316883 -0.01078286  0.1755406

m <- matrix(rnorm(12), nrow = 4)
m

##            [,1]        [,2]       [,3]
## [1,] -0.4184491  0.69345588 -0.4237782
## [2,] -0.3193066  2.98385765 -0.6240393
## [3,]  0.6595142 -0.03836443  0.6065398
## [4,] -0.9591165  0.26312178  0.1139435

A similar function, apply, allows you to apply a function to the rows or columns of a matrix
The second argument is the index (or vector of indices) that defines what the function is applied to; in this case we get the columnwise minima.

apply(m, 2, min)

## [1] -0.95911651 -0.03836443 -0.62403929

Also, the function tapply allows you to create tables (hence the ‘t’) of the value of a function on subgroups defined by its second argument, which can be a factor or a list of factors.
In the latter case a cross-classified table is generated.

tapply(energy$expend, energy$stature, median)

##  lean obese 
##  7.90  9.69

Sorting

Just use the sort function.

intake$post

##  [1] 3910 4220 3885 5160 5645 4680 5265 5975 6790 6900 7335

sort(intake$post)

##  [1] 3885 3910 4220 4680 5160 5265 5645 5975 6790 6900 7335

However, sorting a single vector is not always what is required.
Often you need to sort a series of variables according to the values of some other variables — blood pressures sorted by sex and age, for instance.

order(intake$post)

##  [1]  3  1  2  6  4  7  5  8  9 10 11

The result is the numbers 1 to 11 (or whatever the length of the vec- tor is), sorted according to the size of the argument to order (here intake$post).
Interpreting the result of order is a bit tricky — it should be read as follows: You sort intake$post by placing its values in the order no.

o <- order(intake$post)
intake$post[o]

##  [1] 3885 3910 4220 4680 5160 5265 5645 5975 6790 6900 7335

intake$pre[o]

##  [1] 5640 5260 5470 6515 6180 6805 6390 7515 7515 8230 8770

It is of course also possible to sort the entire data frame intake

intake

##     pre post
## 1  5260 3910
## 2  5470 4220
## 3  5640 3885
## 4  6180 5160
## 5  6390 5645
## 6  6515 4680
## 7  6805 5265
## 8  7515 5975
## 9  7515 6790
## 10 8230 6900
## 11 8770 7335

intake.sorted <- intake[o,]

Sorting by several criteria is done simply by having several arguments to order;
for instance, order(sex,age) will give a main division into men and women, and within each sex an ordering by age.
The second variable is used when the order cannot be decided from the first variable.
Sorting in reverse order can be handled by, for example, changing the sign of the variable.

Exercises

Q1.1 How would you check whether two vectors are the same if they may contain missing (NA) values? (Use of the identical function is considered cheating!)
Q1.2 If x is a factor with n levels and y is a length n vector, what happens if you compute y[x]?
Q1.3 Write the logical expression to use to extract girls between 7 and 14 years of age in the juul data set.
Q1.4 What happens if you change the levels of a factor (with levels) and give the same value to two or more levels?
Q1.5 On p.27, replicate was used to simulate the distribution of the mean of 20 random numbers from the exponential distribution by repeating the operation 10 times. How would you do the same thing with sapply?

First step

An calculator（電卓の代わりに）

Assignments

Vectorized arithmetic(vetor演算)

並び替え

BMI 計算

体重の標準偏差(standard deviation)計算

built-in関数を使用

Standard procedures（例えば、t-検定は t.test()関数）

Graphics

R language essentials

object表示関数（ls()）

Vectors(数値型)

Vectors(文字型)

Vectors(論理型)

Missing values(欠損値)

Functions(関数) that create vectors

c()

seq( )

:

rep( )

Matrices(行列) and arrays

Factors(因子型、categorical variables)

Lists

Data frames

Indexing

Conditional selection

logical operation

Conditional selection : example

Indexing

Indexing

Indexing

Indexing

Shows the first six lines.

Grouped data and data frames

Implicit loops

Sorting

Exercises

Standard procedures（例えば、t-検定は `t.test()`関数）

object表示関数（`ls()`）

`c()`

`:`