The area of the standard normal distribution to the left of \(z = 1.96\) is the probability of \(\Pr(Z \le 1.96)\).
The function name pnorm comes from p (probability) +
norm (normal distribution). pnorm is a function that
returns the probability corresponding to a given z-score.
pnorm(q = 1.96, mean = 0, sd = 1)
## [1] 0.9750021
Conversely, qnorm is a function that returns the z-score
corresponding to a given probability.
qnorm(p = 0.975, mean = 0, sd = 1)
## [1] 1.959964
qnorm(p = 0.975) # for N(0,1), the mean and sd arguments may be omitted
## [1] 1.959964
To perform iterative calculations, we can use the for
statement.
繰り返し計算をするにはfor文を使えばよい.
1:n using a colon.for (i in 1:n) { ... } means that the
operations in the curly brackets { ... } are computed for
\(i=1, i=2, i=3, ..., i=n\)
sequentially.1:3
## [1] 1 2 3
for (i in 1:3) { # i = 1, 2, 3
print(i)
}
## [1] 1
## [1] 2
## [1] 3
i_cumulative <- 0 # create an object to store cumulative values
for (i in 1:3) {
i_cumulative <- i_cumulative + i
print(i_cumulative)
}
## [1] 1
## [1] 3
## [1] 6
i_cumulative # should be 6 (found by 1+2+3)
## [1] 6
We can combine multiple vectors (or a single value) into a single
vector by using the function c. The function name comes
from “combine.”
複数のベクトル(またはスカラー)を結合するには関数 c
を使う.関数名はcombineから来ている.
c(1, 5)
## [1] 1 5
c(1, 5:7)
## [1] 1 5 6 7
To draw random numbers from a uniform distribution, use the
runif function. The function name comes from r (random) +
unif (uniform distribution).
一様分布から乱数をドローするには,runif関数を使う.この関数名は
r (random) + unif (uniform distribution) から来ている.
runif(n = 5, min = 0, max = 1)
## [1] 0.3365511 0.2085515 0.2709944 0.0499206 0.6640037
Find the distribution of the sample mean of the uniform random variable \(x \sim U(0, 1)\). We see that the larger the sample size, the closer to a normal distribution, and the smaller the variance of the distribution. Such a simulation is called a Monte Carlo simulation.
一様確率変数の標本平均の分布を求める.標本サイズが大きくなるほど正規分布に近づき,分布の分散が小さくなることが分かる.このようなシミュレーションを,モンテカルロシミュレーションと呼ぶ.
This is equivalent to the distribution of the random numbers themselves.
これは乱数そのものの分布と等しい.
\[ \bar{x} = x_i, \quad x_i \sim U(0, 1) \]
x_bar <- NULL # create empty (null) object
for (i in 1:10000) {
x_bar_i <- mean(runif(n = 1, min = 0, max = 1))
x_bar <- c(x_bar, x_bar_i)
}
hist(x_bar, breaks = 50, main = "Sample mean of uniform random variables with sample size n=1")
\[ \bar{x} = \frac{1}{2} \sum_{i=1}^{2} x_i, \quad x_i \sim U(0, 1) \]
x_bar <- NULL
for (i in 1:10000) {
x_bar_i <- mean(runif(n = 2, min = 0, max = 1))
x_bar <- c(x_bar, x_bar_i)
}
hist(x_bar, breaks = 50, main = "Sample mean of uniform random variables with sample size n=2")
\[ \bar{x} = \frac{1}{10} \sum_{i=1}^{10} x_i, \quad x_i \sim U(0, 1) \]
x_bar <- NULL
for (i in 1:10000) {
x_bar_i <- mean(runif(n = 10, min = 0, max = 1))
x_bar <- c(x_bar, x_bar_i)
}
hist(x_bar, breaks = 50, main = "Sample mean of uniform random variables with sample size n=10",
xlim = c(0, 1))
In fact, we can also write the following using the
sapply function without using the for
function:
実は,for文を使わずにsapply関数を使って次のように書くこともできる:
hist(sapply(X = 1:10000, FUN = function (x) mean(runif(10)) ), breaks = 50)
For a sample mean of 100, population standard deviation of 2, and sample size of 100, the confidence interval is calculated as follows:
\[ \bar{x} \pm z \frac{\sigma}{\sqrt{n}} = 100 \pm 1.96 \frac{2}{\sqrt{100}} = 100 \pm 0.392, \quad \mbox{95% CI}: [99.61, 100.39] \]
qnorm(p = 0.025, mean = 0, sd = 1) # P(Z<z) = 0.025
## [1] -1.959964
qnorm(p = 0.975, mean = 0, sd = 1) # P(Z<z) = 0.975
## [1] 1.959964
100 + qnorm(p = 0.025) * 2 / sqrt(100) # lower limit
## [1] 99.60801
100 + qnorm(p = 0.975) * 2 / sqrt(100) # upper limit
## [1] 100.392
For a sample mean of 100, sample standard deviation of 2, and sample size of 100, the confidence interval is calculated as follows:
\[ \bar{x} \pm t \frac{s}{\sqrt{n}} = 100 \pm 1.98 \frac{2}{\sqrt{100}} = 100 \pm 0.396, \quad \mbox{95% CI}: [99.60, 100.40] \]
Note that we use t-value, not z-value.
qt(p = 0.025, df = 100-1) # P(T<t) = 0.025
## [1] -1.984217
100 + qt(p = 0.025, df = 100-1) * 2 / sqrt(100) # lower limit
## [1] 99.60316
100 + qt(p = 0.975, df = 100-1) * 2 / sqrt(100) # upper limit
## [1] 100.3968
t.test functionThe function t.test, which performs a \(t\)-test (to be covered in the next class),
also outputs a confidence interval.
\(t\)検定(次回以降の授業で勉強します)を行うt.testという関数は信頼区間をついでに出力してくれる.
wage <- c(1000, 1200, 1300, 1200, 1150, 1000, 1450, 1500, 1150, 1350)
mean(wage) # mean
## [1] 1230
sd(wage) # standard deviation
## [1] 170.2939
length(wage) # sample size (length of the vector)
## [1] 10
mean(wage) + qt(p = 0.025, df = length(wage)-1) * sd(wage) / sqrt(length(wage)) # lower bound
## [1] 1108.179
mean(wage) + qt(p = 0.975, df = length(wage)-1) * sd(wage) / sqrt(length(wage)) # upper bound
## [1] 1351.821
t.test(wage) # default: 95% CI
##
## One Sample t-test
##
## data: wage
## t = 22.841, df = 9, p-value = 2.806e-09
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 1108.179 1351.821
## sample estimates:
## mean of x
## 1230
t.test(wage, conf.level = 0.99) # 99% CI
##
## One Sample t-test
##
## data: wage
## t = 22.841, df = 9, p-value = 2.806e-09
## alternative hypothesis: true mean is not equal to 0
## 99 percent confidence interval:
## 1054.991 1405.009
## sample estimates:
## mean of x
## 1230
\[ p \pm z \sqrt{\frac{p (1-p)}{n}} \]
Example of Banerjee, Duflo, and Glennerster (BMJ 2010)
Of the 382 children, 148 received immunizations.
382人の子供のうち148人がワクチンの予防接種を受けた.
n1 <- 148; n0 <- 382 # intervention B
p <- n1 / n0
p
## [1] 0.3874346
p + qnorm(p = .025) * sqrt(p * (1 - p) / n0) # lower limit of 95% CI
## [1] 0.3385815
p + qnorm(p = .975) * sqrt(p * (1 - p) / n0) # upper limit
## [1] 0.4362876
In fact, these figures differ from those reported in the paper. Because a cluster randomized controlled trial was performed using geographic blocks (at the village level), it is likely that the authors have performed a more complex calculation to account for the hierarchical nature of data.
実は,この数値は論文中で報告されている数値とは異なる.地理的なブロック(村)を用いてクラスター無作為化対照試験が実行されたため,著者らはこのデータの階層性を考慮したより複雑な計算をしていると思われる.
Wage population is normally distributed, \(\sigma\) is unknown, \(\bar{x} = 1230\), \(s=170.3\), \(n = 10\), \(\alpha=0.05\)
\[ H_0: \mu = 1100, \quad H_1: \mu \ne 1100 \]
\[ t = \frac{\bar{x} - \mu_0}{s/\sqrt{n}} = \frac{1230 - 1110}{170.3 / \sqrt{10}} \approx 2.41 \] \[ \Pr (T \le 2.41) = 0.98 , \quad 2 \times (1 - \Pr (T \le 2.41)) \approx 0.039 \]
t_value <- (1230 - 1100) / (170.3 / sqrt(10)) # t value
t_value
## [1] 2.413952
pt(q = t_value, df = 10 - 1) # Pr(T <= t)
## [1] 0.9805023
1 - pt(q = t_value, df = 10 - 1) # one sided p value ... p = Pr(T > t)
## [1] 0.01949773
2 * (1 - pt(q = t_value, df = 10 - 1)) # two sided p value
## [1] 0.03899546
Thus, \(H_0\) is rejected and \(H_1\) is accepted.
wage <- c(1000, 1200, 1300, 1200, 1150, 1000, 1450, 1500, 1150, 1350)
mean(wage); sd(wage)
## [1] 1230
## [1] 170.2939
t.test(wage, mu = 1100) # default: two sided ... H1 is [mu /= 1150]
##
## One Sample t-test
##
## data: wage
## t = 2.414, df = 9, p-value = 0.03899
## alternative hypothesis: true mean is not equal to 1100
## 95 percent confidence interval:
## 1108.179 1351.821
## sample estimates:
## mean of x
## 1230
t.test(wage, mu = 1100, alternative = "greater") # one sided ... H1 is [mu > 1150]
##
## One Sample t-test
##
## data: wage
## t = 2.414, df = 9, p-value = 0.01949
## alternative hypothesis: true mean is greater than 1100
## 95 percent confidence interval:
## 1131.284 Inf
## sample estimates:
## mean of x
## 1230
In this way, the confidence intervals are calculated at the same time.
このように信頼区間も同時に計算してくれる.
wage_jp <- c(1000, 1200, 1300, 1200, 1150, 1000, 1450, 1500, 1150, 1350) # Japan
wage_us <- c(900, 1300, 1200, 800, 1600, 850, 1000, 950) # US
t.test(wage_jp, wage_us) # default: Welch's t test (assuming unequal variance)
##
## Welch Two Sample t-test
##
## data: wage_jp and wage_us
## t = 1.4041, df = 11.205, p-value = 0.1874
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -87.42307 397.42307
## sample estimates:
## mean of x mean of y
## 1230 1075
t.test(wage_jp, wage_us, var.equal = TRUE) # t test (assuming equal variance)
##
## Two Sample t-test
##
## data: wage_jp and wage_us
## t = 1.479, df = 16, p-value = 0.1586
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -67.16377 377.16377
## sample estimates:
## mean of x mean of y
## 1230 1075
Use the t.test function with the tidyverse package.
library(tidyverse)
setwd("c://ws_stat")
apple <- read.csv("Statistics_Data_20231008.csv")
If the above does not work, try using file.choose() as
shown below.
apple <- read.csv(file.choose())
profit_sedan <- apple %>% filter(VehicleType == "Sedan") %>% select(Profit)
profit_suv <- apple %>% filter(VehicleType == "SUV") %>% select(Profit)
t.test(profit_sedan, profit_suv)
##
## Welch Two Sample t-test
##
## data: profit_sedan and profit_suv
## t = -1.3386, df = 117.65, p-value = 0.1833
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -384.7004 74.3851
## sample estimates:
## mean of x mean of y
## 1747.544 1902.702
You can also specify target_variable ~ group_variable as
shown below.
apple %>%
filter(VehicleType %in% c("Sedan", "SUV")) %>% # the number of groups must be exactly two
t.test(Profit ~ VehicleType, data = .)
##
## Welch Two Sample t-test
##
## data: Profit by VehicleType
## t = -1.3386, df = 117.65, p-value = 0.1833
## alternative hypothesis: true difference in means between group Sedan and group SUV is not equal to 0
## 95 percent confidence interval:
## -384.7004 74.3851
## sample estimates:
## mean in group Sedan mean in group SUV
## 1747.544 1902.702
Do babies of mothers who smoked during pregnancy weigh less? Note, however, that testing for differences in means does not identify causal effects.
妊娠中に喫煙していた母親の赤ちゃんの体重は軽いのか? ただし,平均値の差の検定をしても因果効果を識別できないことに留意してください.
Data source: Hosmer and Lemeshow (1989; 2000 2nd ed) Applied Logistic Regression, John Wiley and Sons. (according to: Venables and Ripley, 2002, p. 194)
library(tidyverse)
bw <- MASS::birthwt # "birthwt" data in MASS package
boxplot(bwt ~ smoke, data = bw, xlab = "Mother's smoking status during pregnancy",
ylab = "Infant birth weight (in grams)", names = c("No smoking (label: 0)", "Smoking (1)"))
bw %>%
group_by(smoke) %>%
summarise(n = n(), mean = mean(bwt), sd = sd(bwt), median = median(bwt))
## # A tibble: 2 × 5
## smoke n mean sd median
## <int> <int> <dbl> <dbl> <dbl>
## 1 0 115 3056. 753. 3100
## 2 1 74 2772. 660. 2776.
t.test(bwt ~ smoke, data = bw)
##
## Welch Two Sample t-test
##
## data: bwt by smoke
## t = 2.7299, df = 170.1, p-value = 0.007003
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## 78.57486 488.97860
## sample estimates:
## mean in group 0 mean in group 1
## 3055.696 2771.919
See also: https://bookdown.org/cuborican/RE_STAT/hypothesis-testing.html
Miller, Saunders, and Farhart (AJPS 2016) / Replication Data in Harvard Dataverse
Do those who self-identify as conservatives believe more in conspiracy theories such as “Obama was not born in the U.S.” than those who self-identify as liberals?
自らを保守と自認する人は,自らをリベラルと自認する人と比較して,「オバマは米国で生まれなかった」などの陰謀論をより信じているのか?
conservative_d: dummy variable for self-identified as
conservativeconspiracy1: Conspiracy: Obama was not born in the
U.S.conspiracy3: Conspiracy: Government knew about 9/11
prior to attackssetwd("c://ws_stat")
cons <- read.csv("conspiracy.csv")
library(tidyverse)
# table(cons$Conservative) # frequency distribution table
# head(cons %>% select(Conservative, conspiracy1, conspiracy3))
cons <- cons %>%
filter(Conservative != "") %>% # remove empty observations
mutate(conservative_d = 1*(Conservative == "Conservative"))
# table(cons$conservative_d) # 0: Liberal, 1: Conservative
t.test(conspiracy1 ~ conservative_d, data = cons) # Obama
##
## Welch Two Sample t-test
##
## data: conspiracy1 by conservative_d
## t = -22.968, df = 958.13, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -0.3345162 -0.2818512
## sample estimates:
## mean in group 0 mean in group 1
## 0.09259259 0.40077633
t.test(conspiracy3 ~ conservative_d, data = cons) # 9/11
##
## Welch Two Sample t-test
##
## data: conspiracy3 by conservative_d
## t = 0.84234, df = 1224.1, p-value = 0.3998
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -0.01682104 0.04213250
## sample estimates:
## mean in group 0 mean in group 1
## 0.4337084 0.4210526
wage_w <- c(1000, 1200, 1300, 1200, 1150, 1000, 1450, 1500, 1150, 1350) # wife
wage_h <- c(900, 1300, 1200, 800, 1600, 850, 1000, 950, 1200, 1400) # husband
# length(wage_w); length(wage_h) # = 10
t.test(wage_w, wage_h, paired = TRUE)
##
## Paired t-test
##
## data: wage_w and wage_h
## t = 1.1638, df = 9, p-value = 0.2744
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## -103.8108 323.8108
## sample estimates:
## mean difference
## 110
t.test(wage_w - wage_h) # same as an ordinary "one-sample t test"
##
## One Sample t-test
##
## data: wage_w - wage_h
## t = 1.1638, df = 9, p-value = 0.2744
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## -103.8108 323.8108
## sample estimates:
## mean of x
## 110
One-sample test for population mean: \(H_0: \mu = 0\)
p_value_list <- NULL
for (i in 1:1000) {
p_value_list <- c(p_value_list, t.test(rnorm(100, mean = 0, sd = 1))$p.value)
}
hist(p_value_list, breaks = 20)
Data source: Kano, Shigeki “New Econometrics” Nippon Hyoron Sha, 2015. (鹿野『新しい計量経済学』)
setwd("c://ws_stat")
ap <- read.csv("apartment.csv")
If the setewd or read.csv functions do not work, use the following command to select and download the csv file using the GUI.
ap <- read.csv(file.choose())
head(ap, 1)
## id price min age area studio
## 1 1 620 5 26 15 1
hist(ap$price, breaks = 20)
ap <- ap %>% filter(price <= 10000) # remove outlier
plot(x = ap$age, y = ap$price, xlab = "Age of the building",
ylab = "Transaction price", pch = 20)
cor(ap[, -1]) # correlation coefficient matrix
## price min age area studio
## price 1.0000000 0.20241142 -0.43638408 0.84137685 -0.56786171
## min 0.2024114 1.00000000 -0.04834375 0.33528104 -0.25551625
## age -0.4363841 -0.04834375 1.00000000 -0.09421416 -0.07971788
## area 0.8413768 0.33528104 -0.09421416 1.00000000 -0.65127492
## studio -0.5678617 -0.25551625 -0.07971788 -0.65127492 1.00000000
\[ price = \beta_0 + \beta_1 age + e \]
lm(price ~ age, data = ap) # regression (lm: linear model)
##
## Call:
## lm(formula = price ~ age, data = ap)
##
## Coefficients:
## (Intercept) age
## 4745.5 -70.5
\[ price = \beta_0 + \beta_1 age + \beta_2 area + e \]
lm(price ~ age + area, data = ap)
##
## Call:
## lm(formula = price ~ age + area, data = ap)
##
## Coefficients:
## (Intercept) age area
## 1317.02 -58.21 61.95
cor(ap[, c("price", "min", "area")])
## price min area
## price 1.0000000 0.2024114 0.8413768
## min 0.2024114 1.0000000 0.3352810
## area 0.8413768 0.3352810 1.0000000
lm(price ~ min, data = ap)
##
## Call:
## lm(formula = price ~ min, data = ap)
##
## Coefficients:
## (Intercept) min
## 3063.05 69.15
lm(price ~ min + area, data = ap)
##
## Call:
## lm(formula = price ~ min + area, data = ap)
##
## Coefficients:
## (Intercept) min area
## 458.21 -30.67 66.86
table(ap$studio) # = 1 if studio apartment
##
## 0 1
## 156 37
lm(price ~ min + area + studio, data = ap)
##
## Call:
## lm(formula = price ~ min + area + studio, data = ap)
##
## Coefficients:
## (Intercept) min area studio
## 603.36 -31.25 64.88 -189.96
# lm(price ~ age + area, data = ap)
lm(price ~ age + area + I(area^2), data = ap)
##
## Call:
## lm(formula = price ~ age + area + I(area^2), data = ap)
##
## Coefficients:
## (Intercept) age area I(area^2)
## 631.4558 -60.2052 93.3760 -0.2799
# lm(price ~ area + studio, data = ap)
lm(price ~ area + studio + area:studio, data = ap) # [X1:X2] is [X1 times x2]
##
## Call:
## lm(formula = price ~ area + studio + area:studio, data = ap)
##
## Coefficients:
## (Intercept) area studio area:studio
## 489.99 61.75 -1571.53 67.86
summary(lm(price ~ age + area + studio, data = ap))
##
## Call:
## lm(formula = price ~ age + area + studio, data = ap)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2888.25 -400.05 55.35 412.64 1913.90
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1815.460 212.234 8.554 4.02e-15 ***
## age -61.138 4.722 -12.949 < 2e-16 ***
## area 55.457 2.945 18.829 < 2e-16 ***
## studio -597.632 180.056 -3.319 0.00108 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 734.1 on 189 degrees of freedom
## Multiple R-squared: 0.8456, Adjusted R-squared: 0.8431
## F-statistic: 345 on 3 and 189 DF, p-value: < 2.2e-16
Miller, Saunders, and Farhart (AJPS 2016) / Replication Data in Harvard Dataverse
Does knowledge of politics have different effects on belief in conservative conspiracy theories (e.g., Obama was not born in the US) for conservatives and the rest of the population (liberals)?
政治に関する知識が保守層の間で有名な陰謀論を信じることに対する影響は,保守層とそれ以外(リベラル層)で異なるのか?
\[ (\mbox{conservative-conspiracy}) = \beta_0 + \beta_1 (\mbox{conservative}) + \beta_2 (\mbox{knowledge}) + \beta_3 (\mbox{conservative}) \times (\mbox{knowledge}) \]
con_index_rep2: conservative conspiracy indexconservative_d: dummy variable for self-identified as
conservativepolknow_alt: knowledge of politicssetwd("c://ws_stat")
cons <- read.csv("conspiracy.csv")
library(tidyverse)
cons <- cons %>%
filter(Conservative != "") %>% # remove empty observations
mutate(conservative_d = 1*(Conservative == "Conservative"))
# summary(cons %>% select(con_index_rep2, conservative_d, polknow_alt))
fm <- con_index_rep2 ~ conservative_d + polknow_alt + conservative_d:polknow_alt
summary(lm(formula = fm, data = cons))
##
## Call:
## lm(formula = fm, data = cons)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.4536 -0.1154 -0.0197 0.1023 0.6575
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.39307 0.01349 29.143 <2e-16 ***
## conservative_d 0.03912 0.02309 1.694 0.0904 .
## polknow_alt -0.31234 0.01945 -16.061 <2e-16 ***
## conservative_d:polknow_alt 0.33544 0.03333 10.066 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1675 on 2196 degrees of freedom
## ( 3 個の観測値が欠損のため削除されました )
## Multiple R-squared: 0.3881, Adjusted R-squared: 0.3873
## F-statistic: 464.3 on 3 and 2196 DF, p-value: < 2.2e-16
Note: The estimates above differ slightly from Table 2 (p. 832) in the paper because we do not include the control variables used in the main analysis in the paper.
The more knowledgeable a liberal citizen is, the less they believe in conspiracy theories. On the other hand, the more knowledgeable a conservative citizen is, the more they believe in conspiracy theories.
リベラルの市民は知識があるほど陰謀論を信じなくなる一方で,保守の市民は知識があるほど陰謀論を信じる.
Titanic data
titanic <- read.csv("https://www.dropbox.com/scl/fi/sup1eqcpn7ffg8puweffl/Titanic.csv?rlkey=fp63rjplysexua8fhhdjz5qfw&dl=1")
# table(titanic$Survived, titanic$Sex)
titanic %>% filter(Sex == "male") %>% select(Survived, Pclass) %>% table # Male
## Pclass
## Survived 1 2 3
## 0 77 91 300
## 1 45 17 47
titanic %>% filter(Sex == "female") %>% select(Survived, Pclass) %>% table # Female
## Pclass
## Survived 1 2 3
## 0 3 6 72
## 1 91 70 72
\[ \mbox{Survived} = \beta_0 + \beta_1 (\mbox{Sex}) + \beta_2 (\mbox{Age}) + \beta_3 (\mbox{2nd class}) + \beta_4 (\mbox{3rd class}) \]
summary(lm(Survived ~ Sex + Age + as.factor(Pclass), data = titanic))
##
## Call:
## lm(formula = Survived ~ Sex + Age + as.factor(Pclass), data = titanic)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.11410 -0.25081 -0.06422 0.23015 1.00676
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.125021 0.050671 22.202 < 2e-16 ***
## Sexmale -0.479456 0.030718 -15.608 < 2e-16 ***
## Age -0.005460 0.001084 -5.039 5.96e-07 ***
## as.factor(Pclass)2 -0.207747 0.041689 -4.983 7.86e-07 ***
## as.factor(Pclass)3 -0.406618 0.038288 -10.620 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3849 on 709 degrees of freedom
## ( 177 個の観測値が欠損のため削除されました )
## Multiple R-squared: 0.3902, Adjusted R-squared: 0.3867
## F-statistic: 113.4 on 4 and 709 DF, p-value: < 2.2e-16
\[ \Pr(\mbox{Survived} = 1) = \Phi ( \beta_0 + \beta_1 (\mbox{Sex}) + \beta_2 (\mbox{Age}) + \beta_3 (\mbox{2nd class}) + \beta_4 (\mbox{3rd class}) ) \] \[ \mbox{Marginal effect}_k = \beta_k \frac{1}{n} \sum_i \phi \left( \beta_0 + \sum_j \beta_j x_{ij} \right) \]
# install.packages("mfx")
library(mfx)
probitmfx(Survived ~ Sex + Age + as.factor(Pclass), data = titanic)
## Call:
## probitmfx(formula = Survived ~ Sex + Age + as.factor(Pclass),
## data = titanic)
##
## Marginal Effects:
## dF/dx Std. Err. z P>|z|
## Sexmale -0.5406375 0.0356417 -15.1687 < 2.2e-16 ***
## Age -0.0077961 0.0016571 -4.7047 2.543e-06 ***
## as.factor(Pclass)2 -0.2635377 0.0492177 -5.3545 8.578e-08 ***
## as.factor(Pclass)3 -0.5123511 0.0467708 -10.9545 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## dF/dx is for discrete change for the following variables:
##
## [1] "Sexmale" "as.factor(Pclass)2" "as.factor(Pclass)3"
\[ \Pr(\mbox{Survived} = 1) = \Lambda ( \beta_0 + \beta_1 (\mbox{Sex}) + \beta_2 (\mbox{Age}) + \beta_3 (\mbox{2nd class}) + \beta_4 (\mbox{3rd class}) ) \]
logitmfx(Survived ~ Sex + Age + as.factor(Pclass), data = titanic)
## Call:
## logitmfx(formula = Survived ~ Sex + Age + as.factor(Pclass),
## data = titanic)
##
## Marginal Effects:
## dF/dx Std. Err. z P>|z|
## Sexmale -0.5553141 0.0364508 -15.2346 < 2.2e-16 ***
## Age -0.0086430 0.0017856 -4.8404 1.296e-06 ***
## as.factor(Pclass)2 -0.2686912 0.0486271 -5.5255 3.285e-08 ***
## as.factor(Pclass)3 -0.5420565 0.0470137 -11.5298 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## dF/dx is for discrete change for the following variables:
##
## [1] "Sexmale" "as.factor(Pclass)2" "as.factor(Pclass)3"
mfx packageUse built-in glm function.
# Probit
probit_titanic <- glm(Survived ~ Sex + Age + as.factor(Pclass), data = titanic,
family = binomial(link = "probit"))
summary(probit_titanic)
##
## Call:
## glm(formula = Survived ~ Sex + Age + as.factor(Pclass), family = binomial(link = "probit"),
## data = titanic)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8568 -0.7036 -0.4020 0.6593 2.4823
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.163290 0.219778 9.843 < 2e-16 ***
## Sexmale -1.485373 0.115945 -12.811 < 2e-16 ***
## Age -0.020360 0.004331 -4.701 2.59e-06 ***
## as.factor(Pclass)2 -0.755229 0.159447 -4.737 2.17e-06 ***
## as.factor(Pclass)3 -1.447517 0.155482 -9.310 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 964.52 on 713 degrees of freedom
## Residual deviance: 649.94 on 709 degrees of freedom
## ( 177 個の観測値が欠損のため削除されました )
## AIC: 659.94
##
## Number of Fisher Scoring iterations: 5
# marginal effect
# summary(dnorm(predict(probit_titanic)) * coef(probit_titanic)["Sexmale"])
# Logit
logit_titanic <- glm(Survived ~ Sex + Age + as.factor(Pclass), data = titanic,
family = binomial(link = "logit"))
summary(logit_titanic)
##
## Call:
## glm(formula = Survived ~ Sex + Age + as.factor(Pclass), family = binomial(link = "logit"),
## data = titanic)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7303 -0.6780 -0.3953 0.6485 2.4657
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.777013 0.401123 9.416 < 2e-16 ***
## Sexmale -2.522781 0.207391 -12.164 < 2e-16 ***
## Age -0.036985 0.007656 -4.831 1.36e-06 ***
## as.factor(Pclass)2 -1.309799 0.278066 -4.710 2.47e-06 ***
## as.factor(Pclass)3 -2.580625 0.281442 -9.169 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 964.52 on 713 degrees of freedom
## Residual deviance: 647.28 on 709 degrees of freedom
## ( 177 個の観測値が欠損のため削除されました )
## AIC: 657.28
##
## Number of Fisher Scoring iterations: 5
# marginal effect
# summary(dnorm(predict(logit_titanic)) * coef(logit_titanic)["Sexmale"])
.