Learn Statistics and Probability Theory with R

R语言基础

R语言的操作

#加减乘除
x = 1
y = x+x
y

## [1] 2

z = y*y
z

## [1] 4

w = z ^z
w

## [1] 256

v = w%%3
v

## [1] 1

##R数据
向量c
1、max min range
2、sum prod length
3、median mean var sort

x = c(1,2,3)
x

## [1] 1 2 3

max(x)

## [1] 3

min(x)

## [1] 1

range(x)

## [1] 1 3

sum(x)

## [1] 6

prod(x)

## [1] 6

length(x)

## [1] 3

R基础语法

R常用概率模型

离散型

binom：二项分布
geom：几何分布
hyper：超几何分布
nbinom：负二项分布
pois：泊松分布

连续型

norm：正态分布
unif：均匀分布
exp：指数分布
gama：伽马分布
beta：贝塔分布
t：t分布
f：F分布
chisq：卡方分布

R语言中的四种前缀

d：概率密度函数f
p：概率分布函数F
q：分位数 F的反函数
r：随机数

例子

dbinom(50,100,0.5)

## [1] 0.07958924

pbinom(50,100,0.5)

## [1] 0.5397946

pbinom(40,100,0.5)

## [1] 0.02844397

qnorm(0.05)

## [1] -1.644854

runif(10,0,1)

##  [1] 0.01364474 0.74041046 0.06923150 0.51625336 0.52835178 0.37823481
##  [7] 0.38284709 0.05395771 0.62948439 0.48128880

多元正态

library(mvtnorm)
mu = c(1,-1)
Sig = matrix(c(1,0.5,0.5,1),2,2)
x = rmvnorm(100,mu,Sig)
x[1:10,]

##             [,1]       [,2]
##  [1,]  1.4879518 -0.7996319
##  [2,]  1.1731242 -1.8413677
##  [3,]  0.6725695 -1.1549749
##  [4,] -0.5045210 -3.4477486
##  [5,]  1.8706699 -0.7890664
##  [6,]  1.9422948 -0.6224334
##  [7,]  1.0743796 -2.8168067
##  [8,]  0.9174477 -1.5483637
##  [9,]  1.8671996 -1.0035270
## [10,] -0.8740146 -2.9342584

plot(x=x[,1],y=x[,2])

常见统计量

均值

data("mtcars")
attach(mtcars)
mean(wt)

## [1] 3.21725

方差和标准差

var(wt)

## [1] 0.957379

sd(wt)

## [1] 0.9784574

sd(wt)^2

## [1] 0.957379

中位数和分位数

median(wt)

## [1] 3.325

合成控制法

library(tidysynth)
data("smoking")
smoking %>% dplyr::glimpse()

## Rows: 1,209
## Columns: 7
## $ state     <chr> "Rhode Island", "Tennessee", "Indiana", "Nevada", "Louisiana…
## $ year      <dbl> 1970, 1970, 1970, 1970, 1970, 1970, 1970, 1970, 1970, 1970, …
## $ cigsale   <dbl> 123.9, 99.8, 134.6, 189.5, 115.9, 108.4, 265.7, 93.8, 100.3,…
## $ lnincome  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ beer      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ age15to24 <dbl> 0.1831579, 0.1780438, 0.1765159, 0.1615542, 0.1851852, 0.175…
## $ retprice  <dbl> 39.3, 39.9, 30.6, 38.9, 34.3, 38.4, 31.4, 37.3, 36.7, 28.8, …

smoking_out <-
  smoking %>%
  synthetic_control(outcome = cigsale,
                    unit = state,
                    time = year,
                    i_unit = "California",
                    i_time = 1988,
                    generate_placebos = T
                    ) %>% 
  generate_predictor(time_window = 1980:1988,
                     ln_income = mean(lnincome,na.rm = T),
                     ret_price = mean(retprice,na.rm = T),
                     youth = mean(age15to24, na.rm = T)) %>% 
  
  generate_predictor(time_window = 1984:1988,
                     beer_sales = mean(beer,na.rm = T)) %>% 
  
  generate_predictor(time_window = 1975,
                     cigsale_1975 =cigsale) %>% 
  generate_predictor(time_window = 1980,
                     cigsale_1980 =cigsale) %>% 
  generate_predictor(time_window = 1988,
                     cigsale_1988 = cigsale) %>% 
  generate_weights(optimization_window = 1970:1988,
                   margin_ipop = .02,sigf_ipop = 7,bound_ipop =  6
                   ) %>% 
  generate_control()

smoking_out %>% plot_trends()

smoking_out %>% plot_differences()

smoking_out %>% plot_weights()

smoking_out %>% grab_balance_table()

## # A tibble: 7 × 4
##   variable     California synthetic_California donor_sample
##   <chr>             <dbl>                <dbl>        <dbl>
## 1 ln_income        10.1                  9.84         9.83 
## 2 ret_price        89.4                 89.4         87.3  
## 3 youth             0.174                0.174        0.173
## 4 beer_sales       24.3                 24.3         23.7  
## 5 cigsale_1975    127.                 127.         137.   
## 6 cigsale_1980    120.                 120.         138.   
## 7 cigsale_1988     90.1                 90.8        114.

smoking_out %>% plot_placebos()

smoking_out %>% plot_placebos(prune = FALSE)

smoking_out %>% plot_mspe_ratio()

smoking_out %>% grab_signficance()

## # A tibble: 39 × 8
##    unit_name     type    pre_mspe post_mspe mspe_ratio  rank fishers_e…¹ z_score
##    <chr>         <chr>      <dbl>     <dbl>      <dbl> <int>       <dbl>   <dbl>
##  1 California    Treated     3.94     390.       99.0      1      0.0256  5.13  
##  2 Georgia       Donor       3.48     174.       49.8      2      0.0513  2.33  
##  3 Virginia      Donor       5.86     171.       29.2      3      0.0769  1.16  
##  4 Indiana       Donor      18.4      415.       22.6      4      0.103   0.787 
##  5 West Virginia Donor      14.3      287.       20.1      5      0.128   0.646 
##  6 Connecticut   Donor      27.3      335.       12.3      6      0.154   0.202 
##  7 Nebraska      Donor       6.47      54.3       8.40     7      0.179  -0.0189
##  8 Missouri      Donor       9.19      77.0       8.38     8      0.205  -0.0199
##  9 Texas         Donor      24.5      160.        6.54     9      0.231  -0.125 
## 10 Idaho         Donor      53.2      340.        6.39    10      0.256  -0.133 
## # … with 29 more rows, and abbreviated variable name ¹fishers_exact_pvalue
## # ℹ Use `print(n = ...)` to see more rows