See An Introduction to R https://intro2r.com/ for detailed instructions.
1 + 1
## [1] 2
2 * 3
## [1] 6
2 / 3
## [1] 0.6666667
# ... "#" symbol and after are commented out (i.e., not executed)
2 ^ 3 # 2 to the power of 3
## [1] 8
sqrt(16) # square root
## [1] 4
log(10) # natural logarithm
## [1] 2.302585
exp(1) # exponential
## [1] 2.718282
pi # pi (circle ratio)
## [1] 3.141593
c(1, 2, 6) # combine values into vector
## [1] 1 2 6
x <- c(1, 2, 6) # assign vector to object "x"
min(x) # minimum value
## [1] 1
max(x) # maximum value
## [1] 6
x2 <- c(1, 3, 4) # another vector
How to find “help” of function?
?exp
help(exp) # same as above
To write multiple commands on a single line, join them with semicolon “;”.
1行に複数のコマンドを書く場合は「;」で繋ぐ.
x; x + x2 + 1
## [1] 1 2 6
## [1] 3 6 11
The mean, variance, standard deviation, covariance and (Pearson’s) correlation coefficient can be calculated with mean, var, sd, cov and cor, respectively.
平均は mean,分散は var,標準偏差は sd,共分散は cov,(ピアソンの積率)相関係数は cor でそれぞれ計算できます.
\[ \bar{x} = \frac{1}{n} \sum_i x_i \]
x # display
## [1] 1 2 6
1 + 2 + 6 # sum (1+2+6=9)
## [1] 9
sum(x) # sum
## [1] 9
sum(x) / 3 # definition of mean
## [1] 3
mean(x)
## [1] 3
mean(c(1, 2, 6)) # same as above / このように書いても同じ結果が得られる
## [1] 3
summary(x) # min, 1st quantile, median, mean, 3rd quantile, max
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 1.5 2.0 3.0 4.0 6.0
\[ V(x) = \frac{1}{n-1} \sum_i (x_i - \bar{x})^2 \]
x - 3
## [1] -2 -1 3
x - mean(x)
## [1] -2 -1 3
(x - mean(x))^2
## [1] 4 1 9
mean((x - mean(x))^2) # definition of population variance
## [1] 4.666667
sum((x - mean(x))^2) / 3 # same as above
## [1] 4.666667
sum((x - mean(x))^2) / (3-1) # definition of sample variance (unbiased variance)
## [1] 7
var(x) # variance defined as sample variance
## [1] 7
sqrt(var(x)) # definition of standard deviation
## [1] 2.645751
sd(x)
## [1] 2.645751
Expected value and variance of the dice roll.
サイコロの出目の期待値と分散
\[ E(X) = \sum_j^k x_j f(x_j), \quad V(X) = E[(X-E(X))^2] = \sum_j^k (x_j - E(X))^2 f(x_j) = E(X^2) - (E(X))^2 \]
c(1, 2, 3, 4, 5, 6)
## [1] 1 2 3 4 5 6
1:6 # c(1, 2, 3, 4, 5, 6) と同じ
## [1] 1 2 3 4 5 6
dice <- 1:6
sum(dice * (1/6)) # definition of expectation
## [1] 3.5
mean(dice) # mean
## [1] 3.5
sum((dice - mean(dice))^2 * (1/6)) # variance: sum[(x-E(X))*f(x)]
## [1] 2.916667
sum(dice^2 * (1/6)) - sum(dice * (1/6))^2 # variance: E(X^2)-E(X)^2
## [1] 2.916667
The value of var(dice) is 3.5, which does not equal the
above, but this is because the var function calculates the
sample variance (unbiased variance);
3.5*(n-1)/n = 3.5*5/6 ≈ 2.9.
なお var(dice)
を計算すると3.5となり,上記と一致しないが,これは var
関数が標本分散(不偏分散)を計算するためである.3.5*(n-1)/n = 3.5*5/6 ≒ 2.9
である.
\[ Cov(x, y) = \frac{1}{n-1} \sum_i (x_i - \bar{x})(y_i - \bar{y}) \]
plot(x, x2) # scatter plot
(x - mean(x)) * (x2 - mean(x2))
## [1] 3.3333333 -0.3333333 4.0000000
sum((x - mean(x)) * (x2 - mean(x2))) / 3 # definition of sample covariance
## [1] 2.333333
sum((x - mean(x)) * (x2 - mean(x2))) / (3-1) # definition of sample covariance
## [1] 3.5
cov(x, x2) # covariance of unbiased covariance
## [1] 3.5
\[ Cor(x, y) = \frac{Cov(x, y)}{\sqrt{V(x)} \cdot \sqrt{V(y)}} \]
cov(x, x2) / (sd(x) * sd(x2)) # definition of correlation coef.
## [1] 0.8660254
cor(x, x2) # correlation coefficient
## [1] 0.8660254
Correlation coefficients of 1 or -1 are obtained when one variable is a constant multiple or constant addition of the other variable.
相関係数が 1 や -1 になるのは,一方の変数がもう一方の変数の定数倍・定数加算によって得られるような場合です.
x3 <- x * 2 + 3
x4 <- x * (-0.5) + 3
var(x3); 2^2 * var(x) # These two should have the same value / この2つは同じ値になるはず
## [1] 28
## [1] 28
par(mfrow = c(1, 2))
plot(x, x3); abline(lm(x3 ~ x), col = "red") # Add regression line (red line) / 回帰直線を追加
plot(x, x4); abline(lm(x4 ~ x), col = "red")
cor(x, x3)
## [1] 1
cor(x, x4)
## [1] -1
data <- data.frame(x, x2, x3, x4)
data
## x x2 x3 x4
## 1 1 1 5 2.5
## 2 2 3 7 2.0
## 3 6 4 15 0.0
data$x # Get variables using the "$" symbol
## [1] 1 2 6
data$x + data$x3
## [1] 6 9 21
plot(data$x, data$x2) # scatter plot
data[1, 1] # Get the elements of the 1st row 1 and 1st column
## [1] 1
data[2, 1]
## [1] 2
data[, 1] # 1st column
## [1] 1 2 6
data[, "x"] # column named "x"
## [1] 1 2 6
data[, c("x", "x3")]
## x x3
## 1 1 5
## 2 2 7
## 3 6 15
Change the value of the data frame
データフレームの値を変更する
data[2, 1] <- 100
data[, 2] <- 10 * data[, 2]
data[, c(1, 2)]
## x x2
## 1 1 10
## 2 100 30
## 3 6 40
str(swiss)
## 'data.frame': 47 obs. of 6 variables:
## $ Fertility : num 80.2 83.1 92.5 85.8 76.9 76.1 83.8 92.4 82.4 82.9 ...
## $ Agriculture : num 17 45.1 39.7 36.5 43.5 35.3 70.2 67.8 53.3 45.2 ...
## $ Examination : int 15 6 5 12 17 9 16 14 12 16 ...
## $ Education : int 12 9 5 7 15 7 7 8 7 13 ...
## $ Catholic : num 9.96 84.84 93.4 33.77 5.16 ...
## $ Infant.Mortality: num 22.2 22.2 20.2 20.3 20.6 26.6 23.6 24.9 21 24.4 ...
plot(swiss) # scatter plot matrix
cor(swiss) # correlation matrix
## Fertility Agriculture Examination Education Catholic
## Fertility 1.0000000 0.35307918 -0.6458827 -0.66378886 0.4636847
## Agriculture 0.3530792 1.00000000 -0.6865422 -0.63952252 0.4010951
## Examination -0.6458827 -0.68654221 1.0000000 0.69841530 -0.5727418
## Education -0.6637889 -0.63952252 0.6984153 1.00000000 -0.1538589
## Catholic 0.4636847 0.40109505 -0.5727418 -0.15385892 1.0000000
## Infant.Mortality 0.4165560 -0.06085861 -0.1140216 -0.09932185 0.1754959
## Infant.Mortality
## Fertility 0.41655603
## Agriculture -0.06085861
## Examination -0.11402160
## Education -0.09932185
## Catholic 0.17549591
## Infant.Mortality 1.00000000
See https://en.wikipedia.org/wiki/Anscombe%27s_quartet
Four sets of variables {x, y} are included, whose mean, variance, and correlation coefficients are approximately the same, although the relationship between the variables is different from each other.
変数の関係が互いに異なるが平均・分散・相関係数が概ね同じ {x, y} という変数の組が4セット含まれる.
str(anscombe) # display structure of object
## 'data.frame': 11 obs. of 8 variables:
## $ x1: num 10 8 13 9 11 14 6 4 12 7 ...
## $ x2: num 10 8 13 9 11 14 6 4 12 7 ...
## $ x3: num 10 8 13 9 11 14 6 4 12 7 ...
## $ x4: num 8 8 8 8 8 8 8 19 8 8 ...
## $ y1: num 8.04 6.95 7.58 8.81 8.33 ...
## $ y2: num 9.14 8.14 8.74 8.77 9.26 8.1 6.13 3.1 9.13 7.26 ...
## $ y3: num 7.46 6.77 12.74 7.11 7.81 ...
## $ y4: num 6.58 5.76 7.71 8.84 8.47 7.04 5.25 12.5 5.56 7.91 ...
summary(anscombe)
## x1 x2 x3 x4 y1
## Min. : 4.0 Min. : 4.0 Min. : 4.0 Min. : 8 Min. : 4.260
## 1st Qu.: 6.5 1st Qu.: 6.5 1st Qu.: 6.5 1st Qu.: 8 1st Qu.: 6.315
## Median : 9.0 Median : 9.0 Median : 9.0 Median : 8 Median : 7.580
## Mean : 9.0 Mean : 9.0 Mean : 9.0 Mean : 9 Mean : 7.501
## 3rd Qu.:11.5 3rd Qu.:11.5 3rd Qu.:11.5 3rd Qu.: 8 3rd Qu.: 8.570
## Max. :14.0 Max. :14.0 Max. :14.0 Max. :19 Max. :10.840
## y2 y3 y4
## Min. :3.100 Min. : 5.39 Min. : 5.250
## 1st Qu.:6.695 1st Qu.: 6.25 1st Qu.: 6.170
## Median :8.140 Median : 7.11 Median : 7.040
## Mean :7.501 Mean : 7.50 Mean : 7.501
## 3rd Qu.:8.950 3rd Qu.: 7.98 3rd Qu.: 8.190
## Max. :9.260 Max. :12.74 Max. :12.500
mean(anscombe$x1)
## [1] 9
apply(anscombe, 2, mean) # mean
## x1 x2 x3 x4 y1 y2 y3 y4
## 9.000000 9.000000 9.000000 9.000000 7.500909 7.500909 7.500000 7.500909
apply(anscombe, 2, var) # variance
## x1 x2 x3 x4 y1 y2 y3 y4
## 11.000000 11.000000 11.000000 11.000000 4.127269 4.127629 4.122620 4.123249
par(mfrow = c(2, 2))
plot(anscombe$x1, anscombe$y1)
plot(anscombe$x2, anscombe$y2)
plot(anscombe$x3, anscombe$y3)
plot(anscombe$x4, anscombe$y4)
cor(anscombe$x1, anscombe$y1) # correlation of x1 and y1
## [1] 0.8164205
cor(anscombe$x2, anscombe$y2)
## [1] 0.8162365
cor(anscombe$x3, anscombe$y3)
## [1] 0.8162867
cor(anscombe$x4, anscombe$y4)
## [1] 0.8165214
This example shows us the importance of drawing raw data, rather than just looking at descriptive statistics.
この例は,記述統計だけを見るのではなく,生のデータを描画することの重要性を我々に教えてくれる.
See also: Same Stats, Different Graphs https://www.research.autodesk.com/publications/same-stats-different-graphs/
Applewood vehicle sales data
setwd("c://ws_stat") # set working directory
getwd() # get working directory
## [1] "c:/ws_stat"
apple <- read.csv("Statistics_Data_20231008.csv")
str(apple)
## 'data.frame': 180 obs. of 6 variables:
## $ ConsumerID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Age : int 21 23 24 25 26 27 27 28 28 29 ...
## $ Profit : int 1387 1754 1817 1040 1273 1529 3082 1951 2692 1206 ...
## $ Location : chr "Tionesta" "Sheffield" "Sheffield" "Sheffield" ...
## $ VehicleType: chr "Sedan" "SUV" "Hybrid" "Compact" ...
## $ Previous : int 0 1 1 0 1 1 0 1 0 0 ...
summary(apple)
## ConsumerID Age Profit Location
## Min. : 1.00 Min. :21.00 Min. : 294 Length:180
## 1st Qu.: 45.75 1st Qu.:40.00 1st Qu.:1422 Class :character
## Median : 90.50 Median :46.00 Median :1882 Mode :character
## Mean : 90.50 Mean :45.88 Mean :1843
## 3rd Qu.:135.25 3rd Qu.:52.25 3rd Qu.:2268
## Max. :180.00 Max. :73.00 Max. :3292
## VehicleType Previous
## Length:180 Min. :0.000
## Class :character 1st Qu.:0.000
## Mode :character Median :1.000
## Mean :1.017
## 3rd Qu.:1.000
## Max. :4.000
If the working directory settings do not work, execute the following line to load the CSV file. The “Select File” screen will appear, from which you can select the CSV file to be read.
作業ディレクトリの設定がうまくできない場合は,次の行を実行してCSVファイルを読み込む.実行すると「ファイルを選択」画面が出てきて,そこから読み込むCSVファイルを選択することができる.
apple <- read.csv(file.choose())
Construct frequency table of vehicle type sold.
販売された車種の度数分布表を作成.
apple$VehicleType == "Sedan" # logical ... TRUE or FALSE
## [1] TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE
## [13] TRUE FALSE TRUE FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE FALSE TRUE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE
## [49] TRUE TRUE TRUE FALSE TRUE FALSE TRUE TRUE FALSE FALSE TRUE FALSE
## [61] FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE TRUE FALSE
## [73] TRUE FALSE TRUE TRUE FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE
## [85] FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE TRUE FALSE TRUE TRUE
## [97] TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE
## [109] FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
## [121] FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE TRUE FALSE TRUE TRUE
## [133] FALSE FALSE FALSE TRUE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
## [145] FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE
## [157] TRUE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE
## [169] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE TRUE FALSE FALSE
1 * (apple$VehicleType == "Sedan") # =1 if TRUE, =0 if FALSE
## [1] 1 0 0 0 1 1 0 0 0 1 1 1 1 0 1 0 1 1 1 0 0 0 0 0 0 1 1 1 1 1 0 0 1 0 1 0 0
## [38] 0 0 0 0 0 0 1 0 1 0 0 1 1 1 0 1 0 1 1 0 0 1 0 0 1 0 0 0 0 1 0 0 1 1 0 1 0
## [75] 1 1 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 1 1 0 1 1 1 1 0 0 0 0 1 1 0 0 0 0 0 1 0
## [112] 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 1 0 1 1 0 0 0 1 1 0 1 0 0 0 0 0 0 1 0 0
## [149] 0 0 0 1 1 1 1 0 1 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 1 1 0 0
sum(apple$VehicleType == "Sedan") # the number of TRUE = the number of Sedan
## [1] 68
table(apple$VehicleType) # Frequency table
##
## Compact Hybrid Sedan SUV Truck
## 28 8 68 57 19
barplot(table(apple$VehicleType))
pie(table(apple$VehicleType))
Add percentage to labels.
ラベルに「%」を追加する.
apple_type_table <- table(apple$VehicleType)
apple_label <- paste(names(apple_type_table), " ",
round(100 * apple_type_table / sum(apple_type_table), 1), "%", sep = "")
pie(apple_type_table, labels = apple_label)
apple$Profit[1:20] # first 20 profit data
## [1] 1387 1754 1817 1040 1273 1529 3082 1951 2692 1206 1342 443 754 1621 870
## [16] 1174 1412 1809 2415 1546
c(200, 600, 1000, 1400, 1800, 2200, 2600, 3000)
## [1] 200 600 1000 1400 1800 2200 2600 3000
apple_profit_breaks <- seq(from = 200, to = 3400, by = 400) # generating sequence
apple_profit_interval <- cut(apple$Profit, breaks = apple_profit_breaks, right = FALSE)
# [right = F] means intervals should not be closed on the right
table(apple_profit_interval)
## apple_profit_interval
## [200,600) [600,1e+03) [1e+03,1.4e+03) [1.4e+03,1.8e+03)
## 8 11 23 38
## [1.8e+03,2.2e+03) [2.2e+03,2.6e+03) [2.6e+03,3e+03) [3e+03,3.4e+03)
## 45 32 19 4
table(apple_profit_interval) / sum(table(apple_profit_interval)) # relative frequency
## apple_profit_interval
## [200,600) [600,1e+03) [1e+03,1.4e+03) [1.4e+03,1.8e+03)
## 0.04444444 0.06111111 0.12777778 0.21111111
## [1.8e+03,2.2e+03) [2.2e+03,2.6e+03) [2.6e+03,3e+03) [3e+03,3.4e+03)
## 0.25000000 0.17777778 0.10555556 0.02222222
hist(apple$Profit)
hist(apple$Profit, breaks = apple_profit_breaks) # set breakpoints defined above
hist(apple$Profit, breaks = 30) # set the number of cells
mean(apple$Profit) # mean
## [1] 1843.167
var(apple$Profit) # sample variance
## [1] 414256.6
max(apple$Profit); min(apple$Profit) # maximum, minimum
## [1] 3292
## [1] 294
max(apple$Profit) - min(apple$Profit) # range
## [1] 2998
cor(apple$Profit, apple$Age) # correlation
## [1] 0.261529
quantile(c(1, 2, 4, 7, 8, 11, 13, 13, 15, 16, 18)) # returns quartiles as default 四分位数
## 0% 25% 50% 75% 100%
## 1.0 5.5 11.0 14.0 18.0
quantile(apple$Profit)
## 0% 25% 50% 75% 100%
## 294.0 1422.5 1882.5 2268.5 3292.0
summary(apple$Profit)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 294 1422 1882 1843 2268 3292
The summary function also reports the quartiles, but
with a slightly different value than when using the
quantile function. See https://en.wikipedia.org/wiki/Quartile.
summary関数も四分位点を報告するが,quantile関数を使った場合と少し値が違う.
quantile(apple$Profit, probs = seq(0.1, 0.9, by = 0.1)) # deciles 十分位数
## 10% 20% 30% 40% 50% 60% 70% 80% 90%
## 994.2 1315.6 1531.1 1737.0 1882.5 2028.4 2202.8 2394.2 2668.6
quantile(apple$Profit, probs = 0.35) # 35th percentile
## 35%
## 1633.8
median(apple$Profit) # cf. median 中央値
## [1] 1882.5
boxplot(apple$Profit)
apple_profit_sedan <- apple$Profit[apple$VehicleType == "Sedan"] # profit from sedan
apple_profit_suv <- apple$Profit[apple$VehicleType == "SUV"]
boxplot(apple_profit_sedan, apple_profit_suv, names = c("Sedan", "SUV"), ylab = "Profit ($)")
boxplot(apple$Age) # three outliers
plot(apple$Age, apple$Profit, xlab = "Age of customer (years)", ylab = "Proft per vehicle ($)")
Same data but difference scale (on y-axis).
tempx <- 1:10
set.seed(1)
tempy <- tempx + rnorm(10, sd = 4)
par(mfrow = c(1, 2))
plot(tempx, tempy, xlab = "x", ylab = "y")
plot(tempx, tempy, xlab = "x", ylab = "y", ylim = c(-50, 50))
The left figure appears to show a positive correlation, while the right figure will look uncorrelated.
左の図は正の相関を示しているように見えるが,右の図は無相関に見えるだろう.
Specify two arguments, e.g., location of dealership and vehicle type,
in the table function.
table(apple$Location, apple$VehicleType)
##
## Compact Hybrid Sedan SUV Truck
## Kane 3 1 20 21 7
## Olean 9 2 17 8 4
## Sheffield 8 5 16 11 5
## Tionesta 8 0 15 17 3
.