1 Probability / 確率

1.1 Z score

The area of the standard normal distribution to the left of \(z = 1.96\) is the probability of \(\Pr(Z \le 1.96)\).

The function name pnorm comes from p (probability) + norm (normal distribution). pnorm is a function that returns the probability corresponding to a given z-score.

pnorm(q = 1.96, mean = 0, sd = 1)

## [1] 0.9750021

Conversely, qnorm is a function that returns the z-score corresponding to a given probability.

qnorm(p = 0.975, mean = 0, sd = 1)

## [1] 1.959964

qnorm(p = 0.975)  # for N(0,1), the mean and sd arguments may be omitted

## [1] 1.959964

2 Sampling / サンプリング

2.1 Preparation for simulation / シミュレーションの準備

2.1.1 Iterative calculations / 繰り返し計算

To perform iterative calculations, we can use the for statement.

繰り返し計算をするにはfor文を使えばよい．

Integer vectors such as {1, 2, 3, 4, … , n} can be generated by writing 1:n using a colon.
The syntax for (i in 1:n) { ... } means that the operations in the curly brackets { ... } are computed for \(i=1, i=2, i=3, ..., i=n\) sequentially.

1:3

## [1] 1 2 3

for (i in 1:3) {  # i = 1, 2, 3
  print(i)
}

## [1] 1
## [1] 2
## [1] 3

i_cumulative <- 0  # create an object to store cumulative values
for (i in 1:3) {
  i_cumulative <- i_cumulative + i
  print(i_cumulative)
}

## [1] 1
## [1] 3
## [1] 6

i_cumulative  # should be 6 (found by 1+2+3)

## [1] 6

2.1.2 Combining vectors / ベクトルの結合

We can combine multiple vectors (or a single value) into a single vector by using the function c. The function name comes from “combine.”

複数のベクトル（またはスカラー）を結合するには関数 c を使う．関数名はcombineから来ている．

c(1, 5)

## [1] 1 5

c(1, 5:7)

## [1] 1 5 6 7

2.1.3 Generating random numbers / 乱数の生成

To draw random numbers from a uniform distribution, use the runif function. The function name comes from r (random) + unif (uniform distribution).

一様分布から乱数をドローするには，runif関数を使う．この関数名は r (random) + unif (uniform distribution) から来ている．

runif(n = 5, min = 0, max = 1)

## [1] 0.3365511 0.2085515 0.2709944 0.0499206 0.6640037

2.2 Central limit theorem / 中心極限定理

Find the distribution of the sample mean of the uniform random variable \(x \sim U(0, 1)\). We see that the larger the sample size, the closer to a normal distribution, and the smaller the variance of the distribution. Such a simulation is called a Monte Carlo simulation.

一様確率変数の標本平均の分布を求める．標本サイズが大きくなるほど正規分布に近づき，分布の分散が小さくなることが分かる．このようなシミュレーションを，モンテカルロシミュレーションと呼ぶ．

2.2.1 \(n=1\)

This is equivalent to the distribution of the random numbers themselves.

これは乱数そのものの分布と等しい．

\[ \bar{x} = x_i, \quad x_i \sim U(0, 1) \]

x_bar <- NULL  # create empty (null) object
for (i in 1:10000) {
  x_bar_i <- mean(runif(n = 1, min = 0, max = 1))
  x_bar <- c(x_bar, x_bar_i)
}
hist(x_bar, breaks = 50, main = "Sample mean of uniform random variables with sample size n=1")

2.2.2 \(n=2\)

\[ \bar{x} = \frac{1}{2} \sum_{i=1}^{2} x_i, \quad x_i \sim U(0, 1) \]

x_bar <- NULL
for (i in 1:10000) {
  x_bar_i <- mean(runif(n = 2, min = 0, max = 1))
  x_bar <- c(x_bar, x_bar_i)
}
hist(x_bar, breaks = 50, main = "Sample mean of uniform random variables with sample size n=2")

2.2.3 \(n=10\)

\[ \bar{x} = \frac{1}{10} \sum_{i=1}^{10} x_i, \quad x_i \sim U(0, 1) \]

x_bar <- NULL
for (i in 1:10000) {
  x_bar_i <- mean(runif(n = 10, min = 0, max = 1))
  x_bar <- c(x_bar, x_bar_i)
}
hist(x_bar, breaks = 50, main = "Sample mean of uniform random variables with sample size n=10",
     xlim = c(0, 1))

2.2.4 For advanced students / 上級者向け

In fact, we can also write the following using the sapply function without using the for function:

実は，for文を使わずにsapply関数を使って次のように書くこともできる：

hist(sapply(X = 1:10000, FUN = function (x) mean(runif(10)) ), breaks = 50)

3 Confidence interval / 信頼区間

3.1 Population mean with known \(\sigma\) / 標準偏差が既知の場合の母平均

For a sample mean of 100, population standard deviation of 2, and sample size of 100, the confidence interval is calculated as follows:

\[ \bar{x} \pm z \frac{\sigma}{\sqrt{n}} = 100 \pm 1.96 \frac{2}{\sqrt{100}} = 100 \pm 0.392, \quad \mbox{95% CI}: [99.61, 100.39] \]

qnorm(p = 0.025, mean = 0, sd = 1)  # P(Z<z) = 0.025

## [1] -1.959964

qnorm(p = 0.975, mean = 0, sd = 1)  # P(Z<z) = 0.975

## [1] 1.959964

100 + qnorm(p = 0.025) * 2 / sqrt(100)  # lower limit

## [1] 99.60801

100 + qnorm(p = 0.975) * 2 / sqrt(100)  # upper limit

## [1] 100.392

3.2 Population mean with unknown \(\sigma\) / 標準偏差が未知の場合の母平均

For a sample mean of 100, sample standard deviation of 2, and sample size of 100, the confidence interval is calculated as follows:

\[ \bar{x} \pm t \frac{s}{\sqrt{n}} = 100 \pm 1.98 \frac{2}{\sqrt{100}} = 100 \pm 0.396, \quad \mbox{95% CI}: [99.60, 100.40] \]

Note that we use t-value, not z-value.

qt(p = 0.025, df = 100-1)  # P(T<t) = 0.025

## [1] -1.984217

100 + qt(p = 0.025, df = 100-1) * 2 / sqrt(100)  # lower limit

## [1] 99.60316

100 + qt(p = 0.975, df = 100-1) * 2 / sqrt(100)  # upper limit

## [1] 100.3968

3.2.1 Using `t.test` function

The function t.test, which performs a \(t\)-test (to be covered in the next class), also outputs a confidence interval.

\(t\)検定（次回以降の授業で勉強します）を行うt.testという関数は信頼区間をついでに出力してくれる．

wage <- c(1000, 1200, 1300, 1200, 1150, 1000, 1450, 1500, 1150, 1350)
mean(wage)  # mean

## [1] 1230

sd(wage)  # standard deviation

## [1] 170.2939

length(wage)  # sample size (length of the vector)

## [1] 10

mean(wage) + qt(p = 0.025, df = length(wage)-1) * sd(wage) / sqrt(length(wage))  # lower bound

## [1] 1108.179

mean(wage) + qt(p = 0.975, df = length(wage)-1) * sd(wage) / sqrt(length(wage))  # upper bound

## [1] 1351.821

t.test(wage)  # default: 95% CI

## 
##  One Sample t-test
## 
## data:  wage
## t = 22.841, df = 9, p-value = 2.806e-09
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  1108.179 1351.821
## sample estimates:
## mean of x 
##      1230

t.test(wage, conf.level = 0.99)  # 99% CI

## 
##  One Sample t-test
## 
## data:  wage
## t = 22.841, df = 9, p-value = 2.806e-09
## alternative hypothesis: true mean is not equal to 0
## 99 percent confidence interval:
##  1054.991 1405.009
## sample estimates:
## mean of x 
##      1230

3.3 Population proportion / 母比率

\[ p \pm z \sqrt{\frac{p (1-p)}{n}} \]

Example of Banerjee, Duflo, and Glennerster (BMJ 2010)

Of the 382 children, 148 received immunizations.

382人の子供のうち148人がワクチンの予防接種を受けた．

n1 <- 148; n0 <- 382  # intervention B
p <- n1 / n0
p

## [1] 0.3874346

p + qnorm(p = .025) * sqrt(p * (1 - p) / n0)  # lower limit of 95% CI

## [1] 0.3385815

p + qnorm(p = .975) * sqrt(p * (1 - p) / n0)  # upper limit

## [1] 0.4362876

In fact, these figures differ from those reported in the paper. Because a cluster randomized controlled trial was performed using geographic blocks (at the village level), it is likely that the authors have performed a more complex calculation to account for the hierarchical nature of data.

実は，この数値は論文中で報告されている数値とは異なる．地理的なブロック（村）を用いてクラスター無作為化対照試験が実行されたため，著者らはこのデータの階層性を考慮したより複雑な計算をしていると思われる．

4 Statistical hypothesis testing / 統計的仮説検定

4.1 One-sample \(t\) test for population mean / 1標本の母平均の \(t\) 検定

4.1.1 Calculation from the sample mean and sd of the data / データの標本平均・標準偏差から計算

Wage population is normally distributed, \(\sigma\) is unknown, \(\bar{x} = 1230\), \(s=170.3\), \(n = 10\), \(\alpha=0.05\)

\[ H_0: \mu = 1100, \quad H_1: \mu \ne 1100 \]

\[ t = \frac{\bar{x} - \mu_0}{s/\sqrt{n}} = \frac{1230 - 1110}{170.3 / \sqrt{10}} \approx 2.41 \] \[ \Pr (T \le 2.41) = 0.98 , \quad 2 \times (1 - \Pr (T \le 2.41)) \approx 0.039 \]

t_value <- (1230 - 1100) / (170.3 / sqrt(10))  # t value
t_value

## [1] 2.413952

pt(q = t_value, df = 10 - 1)  # Pr(T <= t)

## [1] 0.9805023

1 - pt(q = t_value, df = 10 - 1)  # one sided p value ... p = Pr(T > t)

## [1] 0.01949773

2 * (1 - pt(q = t_value, df = 10 - 1))  # two sided p value

## [1] 0.03899546

Thus, \(H_0\) is rejected and \(H_1\) is accepted.

4.1.2 Calculation from the data itself / データそのものから計算

wage <- c(1000, 1200, 1300, 1200, 1150, 1000, 1450, 1500, 1150, 1350)
mean(wage); sd(wage)

## [1] 1230

## [1] 170.2939

t.test(wage, mu = 1100)  # default: two sided ... H1 is [mu /= 1150]

## 
##  One Sample t-test
## 
## data:  wage
## t = 2.414, df = 9, p-value = 0.03899
## alternative hypothesis: true mean is not equal to 1100
## 95 percent confidence interval:
##  1108.179 1351.821
## sample estimates:
## mean of x 
##      1230

t.test(wage, mu = 1100, alternative = "greater")  # one sided ... H1 is [mu > 1150]

## 
##  One Sample t-test
## 
## data:  wage
## t = 2.414, df = 9, p-value = 0.01949
## alternative hypothesis: true mean is greater than 1100
## 95 percent confidence interval:
##  1131.284      Inf
## sample estimates:
## mean of x 
##      1230

In this way, the confidence intervals are calculated at the same time.

このように信頼区間も同時に計算してくれる．

4.2 Two-sample \(t\) test for population mean / 2標本の母平均の \(t\) 検定

4.2.1 Independent \(t\) test

4.2.1.1 Example of “wage” data

wage_jp <- c(1000, 1200, 1300, 1200, 1150, 1000, 1450, 1500, 1150, 1350)  # Japan
wage_us <- c(900, 1300, 1200, 800, 1600, 850, 1000, 950)  # US
t.test(wage_jp, wage_us)  # default: Welch's t test (assuming unequal variance)

## 
##  Welch Two Sample t-test
## 
## data:  wage_jp and wage_us
## t = 1.4041, df = 11.205, p-value = 0.1874
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -87.42307 397.42307
## sample estimates:
## mean of x mean of y 
##      1230      1075

t.test(wage_jp, wage_us, var.equal = TRUE)  # t test (assuming equal variance)

## 
##  Two Sample t-test
## 
## data:  wage_jp and wage_us
## t = 1.479, df = 16, p-value = 0.1586
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -67.16377 377.16377
## sample estimates:
## mean of x mean of y 
##      1230      1075

4.2.1.2 Applewood data

Use the t.test function with the tidyverse package.

library(tidyverse)
setwd("c://ws_stat")
apple <- read.csv("Statistics_Data_20231008.csv")

If the above does not work, try using file.choose() as shown below.

apple <- read.csv(file.choose())

profit_sedan <- apple %>% filter(VehicleType == "Sedan") %>% select(Profit)
profit_suv <- apple %>% filter(VehicleType == "SUV") %>% select(Profit)
t.test(profit_sedan, profit_suv)

## 
##  Welch Two Sample t-test
## 
## data:  profit_sedan and profit_suv
## t = -1.3386, df = 117.65, p-value = 0.1833
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -384.7004   74.3851
## sample estimates:
## mean of x mean of y 
##  1747.544  1902.702

You can also specify target_variable ~ group_variable as shown below.

apple %>%
  filter(VehicleType %in% c("Sedan", "SUV")) %>% # the number of groups must be exactly two
  t.test(Profit ~ VehicleType, data = .)

## 
##  Welch Two Sample t-test
## 
## data:  Profit by VehicleType
## t = -1.3386, df = 117.65, p-value = 0.1833
## alternative hypothesis: true difference in means between group Sedan and group SUV is not equal to 0
## 95 percent confidence interval:
##  -384.7004   74.3851
## sample estimates:
## mean in group Sedan   mean in group SUV 
##            1747.544            1902.702

4.2.1.3 Infant Birth Weight and mother’s smoking / 赤ちゃんの出生時体重と母親の喫煙

Do babies of mothers who smoked during pregnancy weigh less? Note, however, that testing for differences in means does not identify causal effects.

妊娠中に喫煙していた母親の赤ちゃんの体重は軽いのか？ただし，平均値の差の検定をしても因果効果を識別できないことに留意してください．

Data source: Hosmer and Lemeshow (1989; 2000 2nd ed) Applied Logistic Regression, John Wiley and Sons. (according to: Venables and Ripley, 2002, p. 194)

library(tidyverse)

bw <- MASS::birthwt  # "birthwt" data in MASS package
boxplot(bwt ~ smoke, data = bw, xlab = "Mother's smoking status during pregnancy", 
        ylab = "Infant birth weight (in grams)", names = c("No smoking (label: 0)", "Smoking (1)"))

bw %>% 
  group_by(smoke) %>% 
  summarise(n = n(), mean = mean(bwt), sd = sd(bwt), median = median(bwt))

## # A tibble: 2 × 5
##   smoke     n  mean    sd median
##   <int> <int> <dbl> <dbl>  <dbl>
## 1     0   115 3056.  753.  3100 
## 2     1    74 2772.  660.  2776.

t.test(bwt ~ smoke, data = bw)

## 
##  Welch Two Sample t-test
## 
## data:  bwt by smoke
## t = 2.7299, df = 170.1, p-value = 0.007003
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##   78.57486 488.97860
## sample estimates:
## mean in group 0 mean in group 1 
##        3055.696        2771.919

4.2.1.4 Miller et al. (AJPS 2016)

Miller, Saunders, and Farhart (AJPS 2016) / Replication Data in Harvard Dataverse

Do those who self-identify as conservatives believe more in conspiracy theories such as “Obama was not born in the U.S.” than those who self-identify as liberals?

自らを保守と自認する人は，自らをリベラルと自認する人と比較して，「オバマは米国で生まれなかった」などの陰謀論をより信じているのか？

conservative_d: dummy variable for self-identified as conservative
conspiracy1: Conspiracy: Obama was not born in the U.S.
conspiracy3: Conspiracy: Government knew about 9/11 prior to attacks

setwd("c://ws_stat")
cons <- read.csv("conspiracy.csv")
library(tidyverse)

# table(cons$Conservative)  # frequency distribution table 
# head(cons %>% select(Conservative, conspiracy1, conspiracy3))
cons <- cons %>% 
  filter(Conservative != "") %>% # remove empty observations
  mutate(conservative_d = 1*(Conservative == "Conservative"))
# table(cons$conservative_d)  # 0: Liberal, 1: Conservative 
t.test(conspiracy1 ~ conservative_d, data = cons)  # Obama

## 
##  Welch Two Sample t-test
## 
## data:  conspiracy1 by conservative_d
## t = -22.968, df = 958.13, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -0.3345162 -0.2818512
## sample estimates:
## mean in group 0 mean in group 1 
##      0.09259259      0.40077633

t.test(conspiracy3 ~ conservative_d, data = cons)  # 9/11

## 
##  Welch Two Sample t-test
## 
## data:  conspiracy3 by conservative_d
## t = 0.84234, df = 1224.1, p-value = 0.3998
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -0.01682104  0.04213250
## sample estimates:
## mean in group 0 mean in group 1 
##       0.4337084       0.4210526

4.2.2 Paired (dependent) \(t\) test

wage_w <- c(1000, 1200, 1300, 1200, 1150, 1000, 1450, 1500, 1150, 1350)  # wife
wage_h <- c(900, 1300, 1200, 800, 1600, 850, 1000, 950, 1200, 1400)  # husband
# length(wage_w); length(wage_h)  # = 10
t.test(wage_w, wage_h, paired = TRUE)

## 
##  Paired t-test
## 
## data:  wage_w and wage_h
## t = 1.1638, df = 9, p-value = 0.2744
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  -103.8108  323.8108
## sample estimates:
## mean difference 
##             110

t.test(wage_w - wage_h)  # same as an ordinary "one-sample t test"

## 
##  One Sample t-test
## 
## data:  wage_w - wage_h
## t = 1.1638, df = 9, p-value = 0.2744
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  -103.8108  323.8108
## sample estimates:
## mean of x 
##       110

4.3 Multiple testing problem: Simulation / 多重検定の問題に関するシミュレーション

One-sample test for population mean: \(H_0: \mu = 0\)

p_value_list <- NULL
for (i in 1:1000) {
  p_value_list <- c(p_value_list, t.test(rnorm(100, mean = 0, sd = 1))$p.value)
}
hist(p_value_list, breaks = 20)

5 Regression / 回帰分析

5.1 Used condominiums data / 中古マンションデータ

Data source: Kano, Shigeki “New Econometrics” Nippon Hyoron Sha, 2015. (鹿野『新しい計量経済学』)

Data on 194 used condominiums transacted in Setagaya-ku, Tokyo from January to March 2010
Variables: transaction price (10,000 JPY), walking time to the nearest station (minutes), age of the building (year), area (m2), studio (= 1 if studio apartment)

setwd("c://ws_stat")
ap <- read.csv("apartment.csv")

If the setewd or read.csv functions do not work, use the following command to select and download the csv file using the GUI.

ap <- read.csv(file.choose())

head(ap, 1)

##   id price min age area studio
## 1  1   620   5  26   15      1

hist(ap$price, breaks = 20)

ap <- ap %>% filter(price <= 10000)  # remove outlier
plot(x = ap$age, y = ap$price, xlab = "Age of the building", 
     ylab = "Transaction price", pch = 20)

cor(ap[, -1])  # correlation coefficient matrix

##             price         min         age        area      studio
## price   1.0000000  0.20241142 -0.43638408  0.84137685 -0.56786171
## min     0.2024114  1.00000000 -0.04834375  0.33528104 -0.25551625
## age    -0.4363841 -0.04834375  1.00000000 -0.09421416 -0.07971788
## area    0.8413768  0.33528104 -0.09421416  1.00000000 -0.65127492
## studio -0.5678617 -0.25551625 -0.07971788 -0.65127492  1.00000000

5.2 Regression / 回帰

\[ price = \beta_0 + \beta_1 age + e \]

lm(price ~ age, data = ap)  # regression (lm: linear model)

## 
## Call:
## lm(formula = price ~ age, data = ap)
## 
## Coefficients:
## (Intercept)          age  
##      4745.5        -70.5

\[ price = \beta_0 + \beta_1 age + \beta_2 area + e \]

lm(price ~ age + area, data = ap)

## 
## Call:
## lm(formula = price ~ age + area, data = ap)
## 
## Coefficients:
## (Intercept)          age         area  
##     1317.02       -58.21        61.95

5.3 Omitted variable bias / 除外変数バイアス

cor(ap[, c("price", "min", "area")])

##           price       min      area
## price 1.0000000 0.2024114 0.8413768
## min   0.2024114 1.0000000 0.3352810
## area  0.8413768 0.3352810 1.0000000

lm(price ~ min, data = ap)

## 
## Call:
## lm(formula = price ~ min, data = ap)
## 
## Coefficients:
## (Intercept)          min  
##     3063.05        69.15

lm(price ~ min + area, data = ap)

## 
## Call:
## lm(formula = price ~ min + area, data = ap)
## 
## Coefficients:
## (Intercept)          min         area  
##      458.21       -30.67        66.86

5.4 Dummy variables / ダミー変数

table(ap$studio)  # = 1 if studio apartment

## 
##   0   1 
## 156  37

lm(price ~ min + area + studio, data = ap)

## 
## Call:
## lm(formula = price ~ min + area + studio, data = ap)
## 
## Coefficients:
## (Intercept)          min         area       studio  
##      603.36       -31.25        64.88      -189.96

5.5 Quadratic function model / 二次関数モデル

# lm(price ~ age + area, data = ap)
lm(price ~ age + area + I(area^2), data = ap)

## 
## Call:
## lm(formula = price ~ age + area + I(area^2), data = ap)
## 
## Coefficients:
## (Intercept)          age         area    I(area^2)  
##    631.4558     -60.2052      93.3760      -0.2799

5.6 Interaction term / 交差項

# lm(price ~ area + studio, data = ap)
lm(price ~ area + studio + area:studio, data = ap)  # [X1:X2] is [X1 times x2]

## 
## Call:
## lm(formula = price ~ area + studio + area:studio, data = ap)
## 
## Coefficients:
## (Intercept)         area       studio  area:studio  
##      489.99        61.75     -1571.53        67.86

5.7 Hypothesis testing of regression coefficient / 回帰係数の検定

summary(lm(price ~ age + area + studio, data = ap))

## 
## Call:
## lm(formula = price ~ age + area + studio, data = ap)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2888.25  -400.05    55.35   412.64  1913.90 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1815.460    212.234   8.554 4.02e-15 ***
## age          -61.138      4.722 -12.949  < 2e-16 ***
## area          55.457      2.945  18.829  < 2e-16 ***
## studio      -597.632    180.056  -3.319  0.00108 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 734.1 on 189 degrees of freedom
## Multiple R-squared:  0.8456, Adjusted R-squared:  0.8431 
## F-statistic:   345 on 3 and 189 DF,  p-value: < 2.2e-16

5.8 Miller et al. (AJPS 2016)

Miller, Saunders, and Farhart (AJPS 2016) / Replication Data in Harvard Dataverse

Does knowledge of politics have different effects on belief in conservative conspiracy theories (e.g., Obama was not born in the US) for conservatives and the rest of the population (liberals)?

政治に関する知識が保守層の間で有名な陰謀論を信じることに対する影響は，保守層とそれ以外（リベラル層）で異なるのか？

\[ (\mbox{conservative-conspiracy}) = \beta_0 + \beta_1 (\mbox{conservative}) + \beta_2 (\mbox{knowledge}) + \beta_3 (\mbox{conservative}) \times (\mbox{knowledge}) \]

con_index_rep2: conservative conspiracy index
conservative_d: dummy variable for self-identified as conservative
polknow_alt: knowledge of politics

setwd("c://ws_stat")
cons <- read.csv("conspiracy.csv")
library(tidyverse)
cons <- cons %>% 
  filter(Conservative != "") %>% # remove empty observations
  mutate(conservative_d = 1*(Conservative == "Conservative"))
# summary(cons %>% select(con_index_rep2, conservative_d, polknow_alt))
fm <- con_index_rep2 ~ conservative_d + polknow_alt + conservative_d:polknow_alt
summary(lm(formula = fm, data = cons))

## 
## Call:
## lm(formula = fm, data = cons)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4536 -0.1154 -0.0197  0.1023  0.6575 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 0.39307    0.01349  29.143   <2e-16 ***
## conservative_d              0.03912    0.02309   1.694   0.0904 .  
## polknow_alt                -0.31234    0.01945 -16.061   <2e-16 ***
## conservative_d:polknow_alt  0.33544    0.03333  10.066   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1675 on 2196 degrees of freedom
##   ( 3 個の観測値が欠損のため削除されました )
## Multiple R-squared:  0.3881, Adjusted R-squared:  0.3873 
## F-statistic: 464.3 on 3 and 2196 DF,  p-value: < 2.2e-16

Note: The estimates above differ slightly from Table 2 (p. 832) in the paper because we do not include the control variables used in the main analysis in the paper.

The more knowledgeable a liberal citizen is, the less they believe in conspiracy theories. On the other hand, the more knowledgeable a conservative citizen is, the more they believe in conspiracy theories.

リベラルの市民は知識があるほど陰謀論を信じなくなる一方で，保守の市民は知識があるほど陰謀論を信じる．

6 Discrete choice model: Binary responce / 二値離散選択

Titanic data

Source: Kaggle (CC0: Public domain)
Survived: =1 if Yes, =0 if No
Pclass: Ticket class … 1 = 1st, 2 = 2nd, 3 = 3rd

titanic <- read.csv("https://www.dropbox.com/scl/fi/sup1eqcpn7ffg8puweffl/Titanic.csv?rlkey=fp63rjplysexua8fhhdjz5qfw&dl=1")
# table(titanic$Survived, titanic$Sex)
titanic %>% filter(Sex == "male") %>% select(Survived, Pclass) %>% table  # Male

##         Pclass
## Survived   1   2   3
##        0  77  91 300
##        1  45  17  47

titanic %>% filter(Sex == "female") %>% select(Survived, Pclass) %>% table  # Female

##         Pclass
## Survived  1  2  3
##        0  3  6 72
##        1 91 70 72

6.1 Linear probability model

\[ \mbox{Survived} = \beta_0 + \beta_1 (\mbox{Sex}) + \beta_2 (\mbox{Age}) + \beta_3 (\mbox{2nd class}) + \beta_4 (\mbox{3rd class}) \]

summary(lm(Survived ~ Sex + Age + as.factor(Pclass), data = titanic))

## 
## Call:
## lm(formula = Survived ~ Sex + Age + as.factor(Pclass), data = titanic)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.11410 -0.25081 -0.06422  0.23015  1.00676 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         1.125021   0.050671  22.202  < 2e-16 ***
## Sexmale            -0.479456   0.030718 -15.608  < 2e-16 ***
## Age                -0.005460   0.001084  -5.039 5.96e-07 ***
## as.factor(Pclass)2 -0.207747   0.041689  -4.983 7.86e-07 ***
## as.factor(Pclass)3 -0.406618   0.038288 -10.620  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3849 on 709 degrees of freedom
##   ( 177 個の観測値が欠損のため削除されました )
## Multiple R-squared:  0.3902, Adjusted R-squared:  0.3867 
## F-statistic: 113.4 on 4 and 709 DF,  p-value: < 2.2e-16

6.2 Probit model

\[ \Pr(\mbox{Survived} = 1) = \Phi ( \beta_0 + \beta_1 (\mbox{Sex}) + \beta_2 (\mbox{Age}) + \beta_3 (\mbox{2nd class}) + \beta_4 (\mbox{3rd class}) ) \] \[ \mbox{Marginal effect}_k = \beta_k \frac{1}{n} \sum_i \phi \left( \beta_0 + \sum_j \beta_j x_{ij} \right) \]

\(\Phi\): normal distribution function
\(\phi\): normal density function

# install.packages("mfx")
library(mfx)

probitmfx(Survived ~ Sex + Age + as.factor(Pclass), data = titanic)

## Call:
## probitmfx(formula = Survived ~ Sex + Age + as.factor(Pclass), 
##     data = titanic)
## 
## Marginal Effects:
##                         dF/dx  Std. Err.        z     P>|z|    
## Sexmale            -0.5406375  0.0356417 -15.1687 < 2.2e-16 ***
## Age                -0.0077961  0.0016571  -4.7047 2.543e-06 ***
## as.factor(Pclass)2 -0.2635377  0.0492177  -5.3545 8.578e-08 ***
## as.factor(Pclass)3 -0.5123511  0.0467708 -10.9545 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## dF/dx is for discrete change for the following variables:
## 
## [1] "Sexmale"            "as.factor(Pclass)2" "as.factor(Pclass)3"

6.3 Logit model

\[ \Pr(\mbox{Survived} = 1) = \Lambda ( \beta_0 + \beta_1 (\mbox{Sex}) + \beta_2 (\mbox{Age}) + \beta_3 (\mbox{2nd class}) + \beta_4 (\mbox{3rd class}) ) \]

logitmfx(Survived ~ Sex + Age + as.factor(Pclass), data = titanic)

## Call:
## logitmfx(formula = Survived ~ Sex + Age + as.factor(Pclass), 
##     data = titanic)
## 
## Marginal Effects:
##                         dF/dx  Std. Err.        z     P>|z|    
## Sexmale            -0.5553141  0.0364508 -15.2346 < 2.2e-16 ***
## Age                -0.0086430  0.0017856  -4.8404 1.296e-06 ***
## as.factor(Pclass)2 -0.2686912  0.0486271  -5.5255 3.285e-08 ***
## as.factor(Pclass)3 -0.5420565  0.0470137 -11.5298 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## dF/dx is for discrete change for the following variables:
## 
## [1] "Sexmale"            "as.factor(Pclass)2" "as.factor(Pclass)3"

6.4 Without `mfx` package

Use built-in glm function.

# Probit
probit_titanic <- glm(Survived ~ Sex + Age + as.factor(Pclass), data = titanic, 
                      family = binomial(link = "probit"))
summary(probit_titanic)

## 
## Call:
## glm(formula = Survived ~ Sex + Age + as.factor(Pclass), family = binomial(link = "probit"), 
##     data = titanic)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8568  -0.7036  -0.4020   0.6593   2.4823  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         2.163290   0.219778   9.843  < 2e-16 ***
## Sexmale            -1.485373   0.115945 -12.811  < 2e-16 ***
## Age                -0.020360   0.004331  -4.701 2.59e-06 ***
## as.factor(Pclass)2 -0.755229   0.159447  -4.737 2.17e-06 ***
## as.factor(Pclass)3 -1.447517   0.155482  -9.310  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 964.52  on 713  degrees of freedom
## Residual deviance: 649.94  on 709  degrees of freedom
##   ( 177 個の観測値が欠損のため削除されました )
## AIC: 659.94
## 
## Number of Fisher Scoring iterations: 5

# marginal effect
# summary(dnorm(predict(probit_titanic)) * coef(probit_titanic)["Sexmale"])

# Logit
logit_titanic <- glm(Survived ~ Sex + Age + as.factor(Pclass), data = titanic, 
                     family = binomial(link = "logit"))
summary(logit_titanic)

## 
## Call:
## glm(formula = Survived ~ Sex + Age + as.factor(Pclass), family = binomial(link = "logit"), 
##     data = titanic)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7303  -0.6780  -0.3953   0.6485   2.4657  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         3.777013   0.401123   9.416  < 2e-16 ***
## Sexmale            -2.522781   0.207391 -12.164  < 2e-16 ***
## Age                -0.036985   0.007656  -4.831 1.36e-06 ***
## as.factor(Pclass)2 -1.309799   0.278066  -4.710 2.47e-06 ***
## as.factor(Pclass)3 -2.580625   0.281442  -9.169  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 964.52  on 713  degrees of freedom
## Residual deviance: 647.28  on 709  degrees of freedom
##   ( 177 個の観測値が欠損のため削除されました )
## AIC: 657.28
## 
## Number of Fisher Scoring iterations: 5

# marginal effect
# summary(dnorm(predict(logit_titanic)) * coef(logit_titanic)["Sexmale"])

Inferential statistics / 推測統計学

Sho Kuroda

2023-11-14 (update: 2024-01-10)