1 Basic operating instructions / R の基本的な操作

See An Introduction to R https://intro2r.com/ for detailed instructions.

1 + 1
## [1] 2
2 * 3
## [1] 6
2 / 3
## [1] 0.6666667
# ... "#" symbol and after are commented out (i.e., not executed)
2 ^ 3  # 2 to the power of 3
## [1] 8
sqrt(16)  # square root
## [1] 4
log(10)  # natural logarithm
## [1] 2.302585
exp(1)  # exponential
## [1] 2.718282
pi  # pi (circle ratio)
## [1] 3.141593
c(1, 2, 6)  # combine values into vector
## [1] 1 2 6
x <- c(1, 2, 6)  # assign vector to object "x"
min(x)  # minimum value
## [1] 1
max(x)  # maximum value
## [1] 6
x2 <- c(1, 3, 4)  # another vector

How to find “help” of function?

?exp
help(exp)  # same as above

To write multiple commands on a single line, join them with semicolon “;”.

1行に複数のコマンドを書く場合は「;」で繋ぐ.

x; x + x2 + 1
## [1] 1 2 6
## [1]  3  6 11

2 Mean, variance, correlation / 平均,分散,相関

The mean, variance, standard deviation, covariance and (Pearson’s) correlation coefficient can be calculated with mean, var, sd, cov and cor, respectively.

平均は mean,分散は var,標準偏差は sd,共分散は cov,(ピアソンの積率)相関係数は cor でそれぞれ計算できます.

2.1 Mean / 平均

\[ \bar{x} = \frac{1}{n} \sum_i x_i \]

x  # display
## [1] 1 2 6
1 + 2 + 6  # sum (1+2+6=9)
## [1] 9
sum(x)  # sum
## [1] 9
sum(x) / 3  # definition of mean
## [1] 3
mean(x)
## [1] 3
mean(c(1, 2, 6))  # same as above / このように書いても同じ結果が得られる
## [1] 3
summary(x)  # min, 1st quantile, median, mean, 3rd quantile, max
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0     1.5     2.0     3.0     4.0     6.0

2.2 Variance and standard deviation / 分散と標準偏差

\[ V(x) = \frac{1}{n-1} \sum_i (x_i - \bar{x})^2 \]

x - 3
## [1] -2 -1  3
x - mean(x)
## [1] -2 -1  3
(x - mean(x))^2
## [1] 4 1 9
mean((x - mean(x))^2)  # definition of population variance
## [1] 4.666667
sum((x - mean(x))^2) / 3  # same as above
## [1] 4.666667
sum((x - mean(x))^2) / (3-1)  # definition of sample variance (unbiased variance)
## [1] 7
var(x)  # variance defined as sample variance
## [1] 7
sqrt(var(x))  # definition of standard deviation
## [1] 2.645751
sd(x)
## [1] 2.645751

2.2.1 Example of dice roll

Expected value and variance of the dice roll.

サイコロの出目の期待値と分散

\[ E(X) = \sum_j^k x_j f(x_j), \quad V(X) = E[(X-E(X))^2] = \sum_j^k (x_j - E(X))^2 f(x_j) = E(X^2) - (E(X))^2 \]

c(1, 2, 3, 4, 5, 6)
## [1] 1 2 3 4 5 6
1:6  # c(1, 2, 3, 4, 5, 6) と同じ
## [1] 1 2 3 4 5 6
dice <- 1:6
sum(dice * (1/6))  # definition of expectation
## [1] 3.5
mean(dice)  # mean
## [1] 3.5
sum((dice - mean(dice))^2 * (1/6))  # variance: sum[(x-E(X))*f(x)]
## [1] 2.916667
sum(dice^2 * (1/6)) - sum(dice * (1/6))^2  # variance: E(X^2)-E(X)^2
## [1] 2.916667

The value of var(dice) is 3.5, which does not equal the above, but this is because the var function calculates the sample variance (unbiased variance); 3.5*(n-1)/n = 3.5*5/6 ≈ 2.9.

なお var(dice) を計算すると3.5となり,上記と一致しないが,これは var 関数が標本分散(不偏分散)を計算するためである.3.5*(n-1)/n = 3.5*5/6 ≒ 2.9 である.

2.3 Covariance and correlation / 共分散と相関係数

\[ Cov(x, y) = \frac{1}{n-1} \sum_i (x_i - \bar{x})(y_i - \bar{y}) \]

plot(x, x2)  # scatter plot

(x - mean(x)) * (x2 - mean(x2))
## [1]  3.3333333 -0.3333333  4.0000000
sum((x - mean(x)) * (x2 - mean(x2))) / 3  # definition of sample covariance
## [1] 2.333333
sum((x - mean(x)) * (x2 - mean(x2))) / (3-1)  # definition of sample covariance
## [1] 3.5
cov(x, x2)  # covariance of unbiased covariance
## [1] 3.5

\[ Cor(x, y) = \frac{Cov(x, y)}{\sqrt{V(x)} \cdot \sqrt{V(y)}} \]

cov(x, x2) / (sd(x) * sd(x2))  # definition of correlation coef.
## [1] 0.8660254
cor(x, x2)  # correlation coefficient
## [1] 0.8660254

Correlation coefficients of 1 or -1 are obtained when one variable is a constant multiple or constant addition of the other variable.

相関係数が 1 や -1 になるのは,一方の変数がもう一方の変数の定数倍・定数加算によって得られるような場合です.

x3 <- x * 2 + 3
x4 <- x * (-0.5) + 3
var(x3); 2^2 * var(x)  # These two should have the same value / この2つは同じ値になるはず
## [1] 28
## [1] 28
par(mfrow = c(1, 2))
plot(x, x3); abline(lm(x3 ~ x), col = "red")  # Add regression line (red line) / 回帰直線を追加
plot(x, x4); abline(lm(x4 ~ x), col = "red")

cor(x, x3)
## [1] 1
cor(x, x4)
## [1] -1

3 Create a dataframe and try to calculate / データフレームを作成して計算してみる

data <- data.frame(x, x2, x3, x4)
data
##   x x2 x3  x4
## 1 1  1  5 2.5
## 2 2  3  7 2.0
## 3 6  4 15 0.0
data$x  # Get variables using the "$" symbol
## [1] 1 2 6
data$x + data$x3
## [1]  6  9 21
plot(data$x, data$x2)  # scatter plot

data[1, 1]  # Get the elements of the 1st row 1 and 1st column
## [1] 1
data[2, 1]
## [1] 2
data[, 1]  # 1st column
## [1] 1 2 6
data[, "x"]  # column named "x"
## [1] 1 2 6
data[, c("x", "x3")]
##   x x3
## 1 1  5
## 2 2  7
## 3 6 15

Change the value of the data frame

データフレームの値を変更する

data[2, 1] <- 100
data[, 2] <- 10 * data[, 2]
data[, c(1, 2)]
##     x x2
## 1   1 10
## 2 100 30
## 3   6 40

4 Using sample data built-in to R / サンプルデータを使って計算してみる

4.1 Swiss Fertility and Socioeconomic Indicators / スイスの出生率と社会経済指標

str(swiss)
## 'data.frame':    47 obs. of  6 variables:
##  $ Fertility       : num  80.2 83.1 92.5 85.8 76.9 76.1 83.8 92.4 82.4 82.9 ...
##  $ Agriculture     : num  17 45.1 39.7 36.5 43.5 35.3 70.2 67.8 53.3 45.2 ...
##  $ Examination     : int  15 6 5 12 17 9 16 14 12 16 ...
##  $ Education       : int  12 9 5 7 15 7 7 8 7 13 ...
##  $ Catholic        : num  9.96 84.84 93.4 33.77 5.16 ...
##  $ Infant.Mortality: num  22.2 22.2 20.2 20.3 20.6 26.6 23.6 24.9 21 24.4 ...
plot(swiss)  # scatter plot matrix

cor(swiss)  # correlation matrix
##                   Fertility Agriculture Examination   Education   Catholic
## Fertility         1.0000000  0.35307918  -0.6458827 -0.66378886  0.4636847
## Agriculture       0.3530792  1.00000000  -0.6865422 -0.63952252  0.4010951
## Examination      -0.6458827 -0.68654221   1.0000000  0.69841530 -0.5727418
## Education        -0.6637889 -0.63952252   0.6984153  1.00000000 -0.1538589
## Catholic          0.4636847  0.40109505  -0.5727418 -0.15385892  1.0000000
## Infant.Mortality  0.4165560 -0.06085861  -0.1140216 -0.09932185  0.1754959
##                  Infant.Mortality
## Fertility              0.41655603
## Agriculture           -0.06085861
## Examination           -0.11402160
## Education             -0.09932185
## Catholic               0.17549591
## Infant.Mortality       1.00000000

4.2 Anscombe’s quartet / アンスコムの数値例

See https://en.wikipedia.org/wiki/Anscombe%27s_quartet

Four sets of variables {x, y} are included, whose mean, variance, and correlation coefficients are approximately the same, although the relationship between the variables is different from each other.

変数の関係が互いに異なるが平均・分散・相関係数が概ね同じ {x, y} という変数の組が4セット含まれる.

str(anscombe)  # display structure of object
## 'data.frame':    11 obs. of  8 variables:
##  $ x1: num  10 8 13 9 11 14 6 4 12 7 ...
##  $ x2: num  10 8 13 9 11 14 6 4 12 7 ...
##  $ x3: num  10 8 13 9 11 14 6 4 12 7 ...
##  $ x4: num  8 8 8 8 8 8 8 19 8 8 ...
##  $ y1: num  8.04 6.95 7.58 8.81 8.33 ...
##  $ y2: num  9.14 8.14 8.74 8.77 9.26 8.1 6.13 3.1 9.13 7.26 ...
##  $ y3: num  7.46 6.77 12.74 7.11 7.81 ...
##  $ y4: num  6.58 5.76 7.71 8.84 8.47 7.04 5.25 12.5 5.56 7.91 ...
summary(anscombe)
##        x1             x2             x3             x4           y1        
##  Min.   : 4.0   Min.   : 4.0   Min.   : 4.0   Min.   : 8   Min.   : 4.260  
##  1st Qu.: 6.5   1st Qu.: 6.5   1st Qu.: 6.5   1st Qu.: 8   1st Qu.: 6.315  
##  Median : 9.0   Median : 9.0   Median : 9.0   Median : 8   Median : 7.580  
##  Mean   : 9.0   Mean   : 9.0   Mean   : 9.0   Mean   : 9   Mean   : 7.501  
##  3rd Qu.:11.5   3rd Qu.:11.5   3rd Qu.:11.5   3rd Qu.: 8   3rd Qu.: 8.570  
##  Max.   :14.0   Max.   :14.0   Max.   :14.0   Max.   :19   Max.   :10.840  
##        y2              y3              y4        
##  Min.   :3.100   Min.   : 5.39   Min.   : 5.250  
##  1st Qu.:6.695   1st Qu.: 6.25   1st Qu.: 6.170  
##  Median :8.140   Median : 7.11   Median : 7.040  
##  Mean   :7.501   Mean   : 7.50   Mean   : 7.501  
##  3rd Qu.:8.950   3rd Qu.: 7.98   3rd Qu.: 8.190  
##  Max.   :9.260   Max.   :12.74   Max.   :12.500
mean(anscombe$x1)
## [1] 9
apply(anscombe, 2, mean)  # mean
##       x1       x2       x3       x4       y1       y2       y3       y4 
## 9.000000 9.000000 9.000000 9.000000 7.500909 7.500909 7.500000 7.500909
apply(anscombe, 2, var)  # variance
##        x1        x2        x3        x4        y1        y2        y3        y4 
## 11.000000 11.000000 11.000000 11.000000  4.127269  4.127629  4.122620  4.123249
par(mfrow = c(2, 2))
plot(anscombe$x1, anscombe$y1)
plot(anscombe$x2, anscombe$y2)
plot(anscombe$x3, anscombe$y3)
plot(anscombe$x4, anscombe$y4)

cor(anscombe$x1, anscombe$y1)  # correlation of x1 and y1
## [1] 0.8164205
cor(anscombe$x2, anscombe$y2)
## [1] 0.8162365
cor(anscombe$x3, anscombe$y3)
## [1] 0.8162867
cor(anscombe$x4, anscombe$y4)
## [1] 0.8165214

This example shows us the importance of drawing raw data, rather than just looking at descriptive statistics.

この例は,記述統計だけを見るのではなく,生のデータを描画することの重要性を我々に教えてくれる.

See also: Same Stats, Different Graphs https://www.research.autodesk.com/publications/same-stats-different-graphs/

5 Descriving data / データの記述

Applewood vehicle sales data

  • Data source: Appendix (A.4) Data Set 4 from: Lind, Marchal, and Wathen (2021) “Basic Statistics for Business and Economics”, 10th ed.
  • Age = Age of the buyer at the time of the purchase
  • Profit = Amount earned by the dealership on the sale of each vehicle
  • Location = Dealership where the vehicle was purchased (Tionesta, Sheffield, Kane, and Olean)
  • VehicleType = SUV, sedan, compact, hybrid, or truch
  • Previous = Number of vehicles previously perchased at any of the four Applewood dealership by the customer

5.1 Loading CSV file / CSVファイルを読み込む

setwd("c://ws_stat")  # set working directory
getwd()  # get working directory
## [1] "c:/ws_stat"
apple <- read.csv("Statistics_Data_20231008.csv")
str(apple)
## 'data.frame':    180 obs. of  6 variables:
##  $ ConsumerID : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Age        : int  21 23 24 25 26 27 27 28 28 29 ...
##  $ Profit     : int  1387 1754 1817 1040 1273 1529 3082 1951 2692 1206 ...
##  $ Location   : chr  "Tionesta" "Sheffield" "Sheffield" "Sheffield" ...
##  $ VehicleType: chr  "Sedan" "SUV" "Hybrid" "Compact" ...
##  $ Previous   : int  0 1 1 0 1 1 0 1 0 0 ...
summary(apple)
##    ConsumerID          Age            Profit       Location        
##  Min.   :  1.00   Min.   :21.00   Min.   : 294   Length:180        
##  1st Qu.: 45.75   1st Qu.:40.00   1st Qu.:1422   Class :character  
##  Median : 90.50   Median :46.00   Median :1882   Mode  :character  
##  Mean   : 90.50   Mean   :45.88   Mean   :1843                     
##  3rd Qu.:135.25   3rd Qu.:52.25   3rd Qu.:2268                     
##  Max.   :180.00   Max.   :73.00   Max.   :3292                     
##  VehicleType           Previous    
##  Length:180         Min.   :0.000  
##  Class :character   1st Qu.:0.000  
##  Mode  :character   Median :1.000  
##                     Mean   :1.017  
##                     3rd Qu.:1.000  
##                     Max.   :4.000

If the working directory settings do not work, execute the following line to load the CSV file. The “Select File” screen will appear, from which you can select the CSV file to be read.

作業ディレクトリの設定がうまくできない場合は,次の行を実行してCSVファイルを読み込む.実行すると「ファイルを選択」画面が出てきて,そこから読み込むCSVファイルを選択することができる.

apple <- read.csv(file.choose())

5.2 Frequency table / 度数分布表

Construct frequency table of vehicle type sold.

販売された車種の度数分布表を作成.

apple$VehicleType == "Sedan"  # logical ... TRUE or FALSE
##   [1]  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE
##  [13]  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
##  [25] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE
##  [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE
##  [49]  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE
##  [61] FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE
##  [73]  TRUE FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE
##  [85] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE  TRUE  TRUE
##  [97]  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE
## [109] FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
## [121] FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE
## [133] FALSE FALSE FALSE  TRUE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
## [145] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE
## [157]  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE
## [169] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE
1 * (apple$VehicleType == "Sedan")  # =1 if TRUE, =0 if FALSE
##   [1] 1 0 0 0 1 1 0 0 0 1 1 1 1 0 1 0 1 1 1 0 0 0 0 0 0 1 1 1 1 1 0 0 1 0 1 0 0
##  [38] 0 0 0 0 0 0 1 0 1 0 0 1 1 1 0 1 0 1 1 0 0 1 0 0 1 0 0 0 0 1 0 0 1 1 0 1 0
##  [75] 1 1 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 1 1 0 1 1 1 1 0 0 0 0 1 1 0 0 0 0 0 1 0
## [112] 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 1 0 1 1 0 0 0 1 1 0 1 0 0 0 0 0 0 1 0 0
## [149] 0 0 0 1 1 1 1 0 1 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 1 1 0 0
sum(apple$VehicleType == "Sedan")  # the number of TRUE = the number of Sedan
## [1] 68
table(apple$VehicleType)  # Frequency table
## 
## Compact  Hybrid   Sedan     SUV   Truck 
##      28       8      68      57      19

5.3 Bar chart / 棒グラフ

barplot(table(apple$VehicleType))

5.4 Pie chart / 円グラフ

pie(table(apple$VehicleType))

Add percentage to labels.

ラベルに「%」を追加する.

apple_type_table <- table(apple$VehicleType)
apple_label <- paste(names(apple_type_table), " ", 
                     round(100 * apple_type_table / sum(apple_type_table), 1), "%", sep = "")
pie(apple_type_table, labels = apple_label)

5.5 Frequency distribution table of quantitative data / 数量データの度数分布表

apple$Profit[1:20]  # first 20 profit data
##  [1] 1387 1754 1817 1040 1273 1529 3082 1951 2692 1206 1342  443  754 1621  870
## [16] 1174 1412 1809 2415 1546
c(200, 600, 1000, 1400, 1800, 2200, 2600, 3000)
## [1]  200  600 1000 1400 1800 2200 2600 3000
apple_profit_breaks <- seq(from = 200, to = 3400, by = 400)  # generating sequence
apple_profit_interval <- cut(apple$Profit, breaks = apple_profit_breaks, right = FALSE)
# [right = F] means intervals should not be closed on the right
table(apple_profit_interval)
## apple_profit_interval
##         [200,600)       [600,1e+03)   [1e+03,1.4e+03) [1.4e+03,1.8e+03) 
##                 8                11                23                38 
## [1.8e+03,2.2e+03) [2.2e+03,2.6e+03)   [2.6e+03,3e+03)   [3e+03,3.4e+03) 
##                45                32                19                 4
table(apple_profit_interval) / sum(table(apple_profit_interval))  # relative frequency
## apple_profit_interval
##         [200,600)       [600,1e+03)   [1e+03,1.4e+03) [1.4e+03,1.8e+03) 
##        0.04444444        0.06111111        0.12777778        0.21111111 
## [1.8e+03,2.2e+03) [2.2e+03,2.6e+03)   [2.6e+03,3e+03)   [3e+03,3.4e+03) 
##        0.25000000        0.17777778        0.10555556        0.02222222

5.6 Histogram / ヒストグラム

hist(apple$Profit)

hist(apple$Profit, breaks = apple_profit_breaks)  # set breakpoints defined above

hist(apple$Profit, breaks = 30)  # set the number of cells

5.7 Descriptive statistics / 記述統計

mean(apple$Profit)  # mean
## [1] 1843.167
var(apple$Profit)  # sample variance
## [1] 414256.6
max(apple$Profit); min(apple$Profit)  # maximum, minimum
## [1] 3292
## [1] 294
max(apple$Profit) - min(apple$Profit)  # range
## [1] 2998
cor(apple$Profit, apple$Age)  # correlation
## [1] 0.261529

5.8 Quantiles and box plot / 分位数と箱ひげ図

5.8.1 Quantiles / 分位数

quantile(c(1, 2, 4, 7, 8, 11, 13, 13, 15, 16, 18))  # returns quartiles as default 四分位数
##   0%  25%  50%  75% 100% 
##  1.0  5.5 11.0 14.0 18.0
quantile(apple$Profit)
##     0%    25%    50%    75%   100% 
##  294.0 1422.5 1882.5 2268.5 3292.0
summary(apple$Profit)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     294    1422    1882    1843    2268    3292

The summary function also reports the quartiles, but with a slightly different value than when using the quantile function. See https://en.wikipedia.org/wiki/Quartile.

summary関数も四分位点を報告するが,quantile関数を使った場合と少し値が違う.

quantile(apple$Profit, probs = seq(0.1, 0.9, by = 0.1))  # deciles 十分位数
##    10%    20%    30%    40%    50%    60%    70%    80%    90% 
##  994.2 1315.6 1531.1 1737.0 1882.5 2028.4 2202.8 2394.2 2668.6
quantile(apple$Profit, probs = 0.35)  # 35th percentile
##    35% 
## 1633.8
median(apple$Profit)  # cf. median 中央値
## [1] 1882.5

5.8.2 Box plot / 箱ひげ図

boxplot(apple$Profit)

apple_profit_sedan <- apple$Profit[apple$VehicleType == "Sedan"]  # profit from sedan
apple_profit_suv <- apple$Profit[apple$VehicleType == "SUV"]
boxplot(apple_profit_sedan, apple_profit_suv, names = c("Sedan", "SUV"), ylab = "Profit ($)")

boxplot(apple$Age)  # three outliers

5.9 Scatter plot / 散布図

plot(apple$Age, apple$Profit, xlab = "Age of customer (years)", ylab = "Proft per vehicle ($)")

5.9.1 Scale does matter

Same data but difference scale (on y-axis).

tempx <- 1:10
set.seed(1)
tempy <- tempx + rnorm(10, sd = 4)
par(mfrow = c(1, 2))
plot(tempx, tempy, xlab = "x", ylab = "y")
plot(tempx, tempy, xlab = "x", ylab = "y", ylim = c(-50, 50))

The left figure appears to show a positive correlation, while the right figure will look uncorrelated.

左の図は正の相関を示しているように見えるが,右の図は無相関に見えるだろう.

5.10 Contingency table / 分割表

Specify two arguments, e.g., location of dealership and vehicle type, in the table function.

table(apple$Location, apple$VehicleType)
##            
##             Compact Hybrid Sedan SUV Truck
##   Kane            3      1    20  21     7
##   Olean           9      2    17   8     4
##   Sheffield       8      5    16  11     5
##   Tionesta        8      0    15  17     3

.