Maintainer: Chunkai

## Loading required package: s20x

Handout 1: Introduction and null model

Working with R and R markdown
The null model for iid normal data, $y \sim N(\mu,\sigma^2)$.
The null linear model gives a one-sample t-test of $H_0: \mu=0$.
Manually calculating the t-statistic
- $t-stat = (\hat{mean} - \mu_0)/se$
Manually calculating the 95% confidence interval for $\mu$.
- $tmult = qt(1-.05/2, df=n-1)$
- $se(y) = sd(y)/\sqrt{(n)}$
- $CI = mean(y) + c(-1,+1) \times tmult \times se(y)$
The paired t-test is a one-sample test of within-pair differences
- 配对t检验，其实就是将成对样本只差作为单样本的t检验。
Our Executive Summary makes conclusions about the expected value, $\mu$.
expected mean，总体均值$\mu$
中央极限定理CLT：总体容量足够大的时候，我们能取到的所有样本的均值近似正态分布
T型分布
- $\frac{\hat Y - \mu}{\frac{S}{\sqrt n}} \sim T_{n-1}$
- 自由度为n-1的t型分布在概率论和统计学中，学生t-分布（Student’s t-distribution），可简称为t分布，用于根据小样本来估计呈正态分布且方差未知的总体的均值。如果总体方差已知（例如在样本数量足够多时），则应该用正态分布来估计总体均值。它是对两个样本均值差异进行显著性测试的学生t检定的基础。学生t检定改进了Z检定（Z-test），因为Z检定以母体标准差已知为前提。虽然在样本数量大（超过30个）时，可以应用Z检定来求得近似值，但Z检定用在小样本会产生很大的误差，因此必须改用学生t检定以求准确。
Null model summary 只能给出总体均值的期望值还有它为0的概率
t.test 可以指定$\mu$ 所以可以用来检测样本均值跟假定均值是否有差距。

Handout 2: Simple linear model

A single numeric explanatory variable: \[ y_i = \beta_0 + \beta_1 x_i + \epsilon_i \]
$R^2$
Model checking:
- Independence of observations
- Constant scatter and no trend in residuals
- Residuals approximately normal
- Checking for (and possibly removing) any highly influential points
Confidence and prediction intervals using the predict function
- Confidence intervals are for estimating an expected value
- Prediction intervals are for predicting a new $y$ value.
Writing the equation for the model
Our Executive Summary makes conclusions about the expected change in $y$ for a change in $x$.
- Typically not of interest to report about the intercept ($\beta_0$).

Handout 3: Fitting curves using linear models

If the scatter plot of $x$ and $y$ suggests nonlinearity
- the residual plot from a simple linear model will display stronger nonlinearity
- 拟合新型模型后的残差图 do not appear to randomly scattered around 0.
- 中间低两边高总之不是直线
Add a quadratic term \[ y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \epsilon_i \]
Determine whether the quadratic term is needed, by testing $H_0: \beta_2=0$.
- If not signficant, the simple linear model is preferred.
- 如果它看起来像个曲线那就fit一个曲线，只要响应变量恒定的散布在曲线周围即可。

Handout 4: Categorical explanatory variable:

A single categorical explanatory variable with two levels (i.e., two-sample t-test situation)
Expressing the model using dummy variables (that takes the value 0 or 1): \[ y_i = \beta_0 + \beta_1 D_i + \epsilon_i \]
Interpreting the coefficient $\beta_1$ relative to the baseline $\beta_0$.

Handout 5: Multiplicative models — working with logged response variable 乘法模型

$y=\alpha \cdot {(e^{\beta_1})}^x$
Symptoms that indicate we should workwith $\log y$ include:
- $y$ values are right-skewed(右偏，又称为正偏) around trend line and variability increases with increasing $y$.
- Scatter plot of $x$ vs $y$ showing exponential increase or decrease
- Our intuition telling us that effects will operate on the multiplicative scale
- So a log-linear model was fitted
Fit an appropriate linear model to $\log y$ \[log(y_i)=\beta_0+\beta_1x_i+\epsilon_i\]
Backtransform CIs using $exp$.
Interpretation is on the multiplicative scale — e.g., percentage change
The percentage change is in the median of y.

Handout 6: Categorical and numeric explanatory variables (a.k.a. ANCOVA model)

One numeric explanatory variable and one 2-level categorical/factor explanatory variable (i.e., grouping variable).
Fit interaction model
Dummy variables for both intercept and slope: \[ y_i = \beta_0 + \beta_1 x_i + \beta_2 D_i + \beta_3 D_i x_i + \epsilon_i \]
Interpreting the coefficients $\beta_2$ and $\beta_3$ relative to the coefficients of the baseline level (i.e, group 1) of the categorical variable:
- $\beta_2$ is change in intercept relative to baseline level
- $\beta_3$ is change in slope (i.e., change in effect of $x$) relative to baseline.
Check p-value of $\beta_3$ to determine if interaction is significant – remove if not.
Model with interaction corresponds to different slopes in the two groups
- Typically not of interest to report on intercept coefficients.

Handout 7: Categorical and numeric explanatory variables (a.k.a. ANCOVA model) 分类和数值变量

Example where the categorical variable has 3 levels
Use anova function to jointly test the statistical significance of the interaction terms.
- 用anov 函数来一并测试全部交互项的统计学显著程度，是否有common slope。
- 它可以同时检查两个或者更多个系数是否为零。
- 如果交互项是显著的（i.e., 系数不为零）则即使该项的斜率为0，也不能去掉该项；
- 与二次方模型相同，如果有二次方项，则不能去掉其1次方项，即使它的技术为0.
Apply Occam’s razor (i.e., simplify the model by removing interaction term) if interaction not significant.
- This gives the main-effects model (a.k.a. parallel lines).
  - 主效应模型，即不管属于哪个类别，随数字变量的变化效果都相同。
- Further simplify model if common slope not significantly different from zero, or difference in intercepts not significant.
  - 如果这个共同的斜率common slope等于0，不管属于哪个分类，岁数字变量的变化都为0，也就是说分类变量已经没有作用了，见CS7.2 — 亦即，分类变量对响应变量没有影响，无须解释。
For main-effects model:（有common slope $\neq 0$）
- Report on the slope (which is common to all groups)
- Report on the difference in intercepts (i.e, the vertical separation between the parallel lines)

Handout 8: Models with several explanatory variables 多重回归模型

Use pairs20x plot to look at patterns and strength of correlations
- 对于pairs20x
- 对角线是概率密度分布图
- 左下角是各变量的系数，表明他们的关系强度
- 右上角是y相对x的散点图
本质上，解释变量的统计意义（P-Value）衡量的是：在响应变量的变化性被其他解释变量调整以后，该解释变量还能解释这个变异性的大小
Using confounding (highly correlated) explanatory variables causes problems:
- Loose statistical signficance even when explanatory variables have clear association with $y$.
- Can be hard to interpret the effects
Avoid using highly correlated explanatory variables.
Interpret the effect as conditional on all other explanatory terms in the model

Handout 9: Power law models 幂律模型

Response increases/decreases to some power of the numeric explanatory variable: $y=\alpha x^{\beta_1}$
相应变量增加或减少到解释变量的n次幂
- Typically, the variability increases with increasing $y$.
Take the log of both $y$ and $x$, and fit a linear model: \[ \log(y_i) = \beta_0 + \beta_1 \log(x_i) + \epsilon_i \]
Interpret $\beta_1$ as a percentage change in the median of $y$ from a 1% increase in $x$.
- Or, could report percentage change due to doubling of $x$, say.

Handout 10: 多级分类变量以及多级比较问题

$lm(y ~ fact1)$
One-Way ANOVA 只有一个解释变量，不存在interaction
能够通过ANOVA 输出计算$R^2$
- $R^2 = 100 \times (1 - \frac{"Residuals \, Sum \, Sq"}{"Sum\,Sq列之和""})$
多个levels要同时进行比较，否则:
- We have 5% chance of erroneously making an untrue conclusion each time.
- This is known as multiple comparison problem
- There will be erroneous evidence of effect of …
- So, we need to use founction multipleComp(data.fit)
步骤：
- iid检查
- anova(fit) 检查P-Value 得出各组数据是否存在差异，可计算$R^2$
- summary1way(fit) 可以得出总体均值，Table of Effects：各组均值相对总体均值的偏差（reference是总体均值，不是G1）
- multipleComp(fit) 得出所有比较结果，提取其中P-value小于0.05 的项用来解释
- 如果y值有取过对数，则记得做exp逆运算。
- Executive Summary:

Handout 11: 两个分类变量

ANOVA 的解释：ANOVA（方差分析）主要描述解释变量全部是分类的线性回归。
两个解释变量用 Two-Way ANOVA
步骤：
- interactionPlots(y ~ x1 + x2, data = df)
  - 看不到平行线 parallel lines indicating interaction between two factors
- fit with interaction lm(y~x1*x2, data = df)
- iid check
- anova(fit) 检查相关性是否确立
- summary2way(fit, page="interaction") 如果有interaction
- Executive Summary: 我们常常这样解释
  - We estimate that for someone who …,(因为互相影响，所以要先设定一个相同的前提下讨论另一个区别) those who with … can expect to …. more than those who ….
- anova 还能确认其中一个factor 是否有影响（就是看这个factor里面不同组对应的响应变量是否有所不同。），如果P-Value小于0.05，则去掉一个解释变量，变成One-Way Anova，如果只有两个levels，又变成了two-sample t-test.

Handout 12: 分析计数值数据

计数值数据

计数值数据常出现在对分类数据的观察中
如果对每个对象的度量都是分类变量（即：测量结果不是数值，而是归入那一类）
分类的级级别就是categories（分类变量可以取到的不同值）
计数值 count 就是每种分类组合出现的次数
rowdistr(.tbl) 查看行内百分比
count 数据不是正态分布的，线性回归有时候可以应用到count data（通常要log-transformed）

泊松分布

non-negative 非负整数
no upper limit 没有上限
discrete variable 离散变量
重要特性 $\sigma^2 = \mu$ : $Var(Y)=E(Y)$
right skew when $\mu$ is small, like normal when $\mu$ is large 右偏态
- 不看峰值，看尾部，尾部在哪边就是哪偏，右偏尾部为正直所以又称为正偏。
cell counts 大于5的时候可以用卡方检测，将count当成正态分布而不是泊松分布

Useful functions

dpois(12, 9.61)

## [1] 0.08685078

rpois(20, 10)

##  [1]  7 20  9 11 14 13  5  6 10  9 10  9 14 14  8 11  8 12 17 11

barplot(dpois(1:12, 3), names=1:12, las=1)

barplot(dpois(0:250,100), names=0:250, las=1)

## 用卡方检测，检查2-way table的相关性 * 什么是卡方分布若n个相互独立的随机变量ξ₁、ξ₂、……、ξn，均服从标准正态分布（也称独立同分布于标准正态分布），
则这n个服从标准正态分布的随机变量的平方和 \[Q = \sum_{i=1}^n \xi_i^2\] 构成一新的随机变量，其分布规律称为$\chi^2$分布 (原来某某分布指的就是随机变量值的分布规律)
（chi-square distribution），其中参数n称为自由度，正如正态分布中均值或方差不同就是另一个正态分布一样，自由度不同就是另一个$\chi^2$分布,记为$Q \sim \chi^2(n)$ 或者 $Q \sim \chi_n^2$.
卡方分布是由正态分布构造而成的一个新的分布，当自由度n很大时, $\chi^2$分布近似为正态分布。
对于任意正整数x，自由度为 k的卡方分布是一个随机变量X的机率分布.

期望和方差
- 分布的均值$\mu$为自由度 n，记为 $E(\chi^2) = n$。
- 分布的方差$\sigma^2$为2倍的自由度(2n)，记为 $D(\chi^2) = 2n$。
- 回忆：泊松分布的方差等于均值$\sigma^2 = E(y) = \hat\mu$
性质
- 分布在第一象限内(指的是概率密度函数)，卡方值（随机变量每个值都是平方和）都是正值，呈正偏态（右偏态），随着参数 n 的增大，$\chi^2$分布趋近于正态分布；卡方分布密度曲线下的面积都是1.
- 分布的均值与方差可以看出，随着自由度n的增大，$\chi^2$分布向正无穷方向延伸（因为均值n越来越大），分布曲线也越来越低阔（因为方差2n越来越大）。
- 不同的自由度决定不同的卡方分布，自由度越小，分布越偏斜。
- 若 $\chi_{v_1}^2$ 与 $\chi_{v2}^2$ 互相独立，则：$\chi_{v_1}^2 + \chi_{v2}^2$ 服从$\chi_{v_1+v_2}^2$分布，自由度为 $v_1+v_2$

有效条件：

卡方检测用到了Z统计量，这就要求$n_{ij}$ 正态分布：is at least approximately normally distributed.
$\Leftarrow$ Expected counts $\hat{n_{ij}}$ to be sufficient $\geq 5$

举例：

The $\chi^2$ test of independence compares the oberved counts to thoes one would expect if there was no association between row and column factors/

AP.tb1 = matrix(c(17,83,27,19), nrow = 2, byrow = T, dimnames = list(c("attand", "not.attend"),c("fail","pass")))
# AP.tb1
# rowdistr(AP.tb1)
# show to show row proportions

# chisq.test(AP.tb1)
# chisq.test(AP.tb1, correct = F)

默认chisq.test 是修正的，这个修正是为了通过修正$Z_{ij}$的非连续性，来修正Counts的非连续整数的性质
使用卡方检验的条件是：count不能太小，太小就不正态分布了，$n_{ij}>5$ 如果小于则代码会有warning 1. P-Value表明了Null hypothesis was rejected. 他们之间有相关性

AP.chisq=chisq.test(AP.tb1, correct = F)
AP.chisq$expected

##                fail     pass
## attand     30.13699 69.86301
## not.attend 13.86301 32.13699

X = (17-30.14)^2/30.14 + (83-69.86)^2/69.86 + (27-13.86)^2/13.86 + (19-32.14)^2/32.14
X

## [1] 26.02961

chisq.test(AP.tb1, correct = F)

## 
##  Pearson's Chi-squared test
## 
## data:  AP.tb1
## X-squared = 26.016, df = 1, p-value = 3.386e-07

从估值也能看出他们的$H_0$假设不成立

The next step is to quantify the strength of the association.

rowdistr(AP.tb1, comp = "between", plot = F)

## Row Proportions
##            fail pass Totals   n
## attand     0.17 0.83      1 100
## not.attend 0.59 0.41      1  46
## 
## 95% CIs for diffs between proportions with fac2 = fail 
## (rowname-colname) 
##        not.attend      
## attand (-0.577,-0.257)*
## 
## 95% CIs for diffs between proportions with fac2 = pass 
## (rowname-colname) 
##        not.attend    
## attand (0.257,0.577)*

rowdistr(AP.tb1, comp = "between", suppressText = T)

通过比较同一列在不同行内的概率的不同，可以知道行列有相关性（interactionPlot 不行）但并不能比较相关性的强弱，即概率差相同但在尺度上不同，不能认为他们的相关度就相同

卡方检验的问题：

count < 5 会报错，因为cell里的count
需要chisq.test(*.tb, correct=F), TRUE和F要对比着看，如果差异太大也有问题默认是True，因为需要修正Zij的非连续性，以便修正count不是连续数的问题
chisq.test()$expected 可以查看估值
rowdistr(*.tb, comp=“between”, plot=F) 可以查看同一列内不同的概率，以及95% CI 概率差估值，根据概率不同，可以看到他们的有关联度。
根据概率差的大小可以看出他们的关联度强弱
同一列内不同行的概率差的置信区间（xxx, xxx）这样的区间在2-by-2 contingency table里有两个，在column为two levels factor 时他们绝对值相同互为正负
这个置信区间可以手动计算
- $se(\hat p_1 - \hat p_2) = \sqrt{\frac{\hat p_1(1-\hat p_1)}{1} + \frac{\hat p_2(1- \hat p_2)}{1}}$
- $\hat p_1 - \hat p_2 \pm 1.96 \times se(\hat p_1 - \hat p_2)$
- 这个概率差是总结时需要用到的。
这个不同row的概率差，在不同的column有所不同，提供了evidence of association. 但其很难在尺度上衡量association的strength。
interactionPlots()不同，他的值是统计次数，还要看斜率才能粗略知道是否相关。

Handout 13：比值比

AP.tb1

##            fail pass
## attand       17   83
## not.attend   27   19

手动求CI for OR的公式知道contingency table 可以交叉求出$\hat{OR}$,
注意排列顺序不同$\hat{OR}$可能不同，但也可能相同。
$\hat{OR} = \frac{\hat{Odds_1}}{\hat{Odds_2}} = \frac{n_{11}n_{22}}{n_{12}n_{21}}$
- 这个$\hat{OR}$ 算出来就可以知道同在列1组内不同行事件概率的比值了，而且可以推广到不同table排列方式的比值
- $se\left(\log(\hat{OR})\right) = \sqrt{\frac{1}{n_{11}} + \frac{1}{n_{12}} + \frac{1}{n_{21}} + \frac{1}{n_{22}}}$ 根据STATS703
- 当样本比较大的时候$\log(\hat{OR})$ 接近正态分布 is approximately normal,
- so CI for $\log(\hat{OR})$:
  - $\Rightarrow \log(\hat{OR}) \pm 1.96 \times se\left(\log(\hat{OR})\right) = (a,b)$
  - $\Rightarrow$ 95% CI for $\hat{OR}$:
  - $\exp(a), \exp(b)$ 用以解释数据的行列比较
Test the association using the log odds: $H_0$ means no row-column association $\Rightarrow \hat{OR}=1 \Rightarrow log(\hat{OR})=0$ To test $H_0$ we use Z statisic $Z = \frac{\log(\hat{OR}) - 0}{\rm se(\log(\hat{OR}))}$ $\Rightarrow P-value 进而判断 \hat{OR}=1 是否极端$ 其实这个P-Value就是测试$H_0:log(\hat{OR}=0)$ 跟第15章的 summary(glm.fit) 出来的第四二个 $P-value 测试 coefficient[4] \beta_3=0$ 是一样的，p值也一样 z-test 适用于大型数据，多余30个个体以上，如果知道标准差，那最好用z!

Handout 14：费舍尔检验 FET

如果行列不关联，即选中观察值的完全是凭运气猜的。
能够选择到这样的表格的概率$P_1$+所有比这个表格更难猜到的表格概率$P_2, P_3, \cdots$ 之和要小于0.05 那么我们就说这个表格不是乱猜出来的。
这说明行列有关联，是因为这个关联性才得出这样的统计数字。

Handout 15：推广线性统计模型到关联列表

多重线性模型提供了很多便利性：
- simple linear regression
- ANOVA two-way ANOVA summary2way()
- ANCOVA
如何让多重线性模型在响应变量是count的时候仍然有效？
log不是好办法
泊松分布
相关性：
- 关联列表中没有相关性，第一行不同列的期望值差，与第二行内不同列期望值之差相同，就说他们有相关性，这是不对的。
- 要看他们不同期望值之间的倍数关系而不是之差。
- 这个倍数差别相同才能说他们有相关性。
- 如果他们在倍数尺度上的差别相同
- 这点关联列表跟two-way ANOVA 有所不同，关联列表里面的数是两分类变量组合出现的个数。要倍数差别相同才能说没有相关性，比如来听课的和不来听课的学生，开始通过的和不通过的人数都是8：2或者说是4倍，这样才能说没有相关性。但two-way ANOVA因为其响应变量不是出现个数而是领一个值，所以只要增加减少的幅度想等就可以说没有相关性了，所以看two-way ANOVA可以直接看斜率，而关联列表不行。
分析步骤：

fisher.test(tbl)$p.val $H_0$ 是无相关或者 `chisq.test(tbl)$expected 倍数差距相等则无相关
模型： $n_{11} = n \times r_1 \times c_1$
$log(n_{11}) = log\,n + log\, r_1 + log\, c_1 = \beta_0$
$log(n_{12}) = \beta_0 + \beta_1 \Rightarrow \beta_1 = log c_2 - log c_1$
$log(n_{21}) = \beta_0 + \beta_2 \Rightarrow \beta_2 = log r_2 - log r_1$
$log(n_{22}) = \beta_0 + \beta_1 + \beta_2$ no interaction
$log(n_{22}) = \beta_0 + \beta_1 + \beta_2 + \beta_3$ interaction
$\beta_3 = log(\hat{OR})$
fit = glm(y~fac1 * fac2, family = poisson, data = df)
summary(fit) residual deviance means residual sum-of-squares 4. summary 出来就coefficient $\beta_0, \beta_1, \beta_2$ 将他们加起来再exp 就可以得到4个格子的期望值了。似乎没什么意思，估出来跟观测值相同，因为模型中我们加入了相关性参数。
如果$\beta_3 \neq 0$
- 其中$\beta_3$ 的解释比较多取exp 可以得到 $\hat{OR}$
- 用起来解释不同组合的odds之比
如果$\beta_3 = 0$
- 没有interaction的情况下，我们可以说，独立的针对行内和列内进行比较
- $log(n_{12}) = \beta_0 + \beta_1 \Rightarrow \beta_1 = log c_2 - log c_1 \Rightarrow exp^{\beta_1} = \frac{c_2}{c_1}$
- $log(n_{21}) = \beta_0 + \beta_2 \Rightarrow \beta_2 = log r_2 - log r_1 \Rightarrow exp^{\beta_2} = \frac{r_2}{r_1}$
- c2列出现的数量是c1列出现数量的多少倍
- r2行出现的数量是r1行出现数量的多少倍

Handout 16：推广线性统计模型到计数数据

这次数据类型不是列联表了，不过他的因变量仍然是一个count
数据类型，只不过可以说是数据个数少了吧。
如何将多重线性模型在响应变量是count的时候仍然有效？
之前已经学习了各种情况下拟合线性模型，现在y是一个count的时候只需要在R code里稍稍改一点点，重要的是如何解释输出。
$log(E[Y]) = linear model$
重要的是我们不需要变换Y 因为Rcode 已经内建了变换
以前用lm 的倍数模型拟合count 数据的问题是，Y的值如果为0，如何log(0)呢，这样就会人为的造成outlier
fit = glm(y~f2 * x2, family = poisson, data = df)
1-pchisq(deviance(gfit), df = df.residual(gfit))求P值用来假设泊松假设：方差等于均值
- 如果模型正确，残差偏差应该是卡方分布的。
- 如果p小于0.05
- 说明我们的模型不适当not adequate
- family = quasipoisson
family = quasipoisson 说明数据的方差比他们满足泊松分布应有的方差要大。
$Var(Y) > E(Y) \quad or\quad Var(Y) = k\,E(Y)$
K > 1
其结果是我们估计效果的标准差变大了，但估计效果本身没有变
z-value 的绝对值变小了，p值变大了。
用lm模型来拟合log(Y) 没有考虑到数据的方差随着均值的增大而增大。同时，被迫抛弃一些有效的数据点，因为他们的cook’s distance 太大。

举例1：

地震的例子其实是：一个 ANCOVA with interaction
对于glm(fac1*x2, family = poisson)

由于$log(E[Y]) 而且 \beta_2 是相对于 reference 的斜率$
要求另一组斜率就要调换fac1 里面两个levels的顺序，或者$\beta_2 + \beta_3$得出另一个斜率

举例2：

大学录取性别歧视检测其实是一个 two-way ANOVA
最开始的时候只考虑性别和是否录取：two-way ANOVA
只不过原始数据没有给出关联列表，而是要用xtabs(y~gender+outcome) 得出关联列表
后来又考虑到学院就开始要用xtabs(Y~x1 + x2 + x3, data=df) 了
- 这里有个著名的辛普森悖论，即，不同组里出现的趋势，在合并这些组以后改变了。
- 模型变成了3个解释变量，变成$fac_1 \times x_3 + fac_2 \times x_3 + fac_1 \times fac_2$

举例3：

鲷鱼在两个地区及在其保护区与否，的不同数量count
interactionPlots(y~fac_1 \times fac_2) 斜率相等不能证明没有相关性了。要在倍数尺度（multiplicative scale）上相等才行
$glm(y~ fac_1 * fac_2)$
summary(glm) $residual deviance \approx df \Rightarrow no need quasi+$
$P-value < 0.05 \Rightarrow no association$
Explain:
- $\beta_1: c2/c1$
- $\beta_2: r2/r1$

Handout 17.5 逻辑回归

当响应变量是二元量的时候 -（逻辑回归logistic regression）

Y是伯努利随机变量，它的取值为1/0，该伯努利实验有个参数：成功率（Y=1）的几率为p，
$E(Y)=p$ 伯努利随机变量的均值 = p
为了fit model：如何将区间在[0,1]的二元数据扩展到为$[-\infty, +\infty]$
$Odds = \frac{p}{1-p}$ $\subset [0, +\infty]$
log-odds is $\log(\frac{p}{1-p}) = x$ $\subset [-\infty, +\infty]$ 为logit:分对数函数
$\Rightarrow p = \frac{exp(x)}{1+exp(x)}$ 为logistic
技术上来讲，它也就是一个二项式的广义线性模型二项式广义线性模型也可以用在概率上，只要你知道尝试的总数。成功次数加起来除以总数就得到概率
逻辑函数
- 具体步骤：
- 用xtabs()帮你统计成功数，并用：100*成功数/次数=得出成功率，但这不重要（每个性别在每个距离上都是投了10次，所以trial 次数为10）
- $logit(E[Y_i]) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3(x_1 \times x_2)$
- fit = glm(Y~X_1 \times X_2, family = binomial, data = df)
- $Y_i$是值为0，1的随机变量通过取它的odds的对数而不是直接取它的对数，变为正态分布
- 书本的男女不同距离投篮例子其实还是一个 two-way ANOVA
- anova 这里给出的信息非常多：
- p-value >0.05 的系数可以被简化掉了。基本上这样就够了
- 1-pchisq(Resid.Dev (Residual Deviance) Resid.Df (Residual Df)) 小于.05 则用 quasi+
- 根据这两个数可以求p-value = 1-pchisq(RD, DF) 当做泊松分布来计算行列相关性，>0.05则说明没有相关性得到的结果应该是很接近的
- 也可以根据DF 求卡方分布的标准差$s = \sqrt{2*DF}$
refit的model
summary(fit, family=binary, data =)
将系数还原$exp(\beta)$ 得到的值是对Odds的影响
分析个系数的意义 So a 1 metre increase in distance would result in a 92.8% reduction in odds. Base on the p-value from the summary, simplify the fitted model. exponentitate the coefficient to get the multiplicative effect on the odds

当相应变量是成功数的时候

用xtabs(y~x, df = ) 统计成功数
只是因变量不是[0,1]指，而是1的个数：the count of ‘successes’
此处将因变量变化为：$fit=glm(cbind(success, fail)~x, family = binormial, data = df)$
其他相同
还是一个逻辑回归 logistic regression

Handout 18：当相应变量是百分比的时候

Handout 18.5：垃圾邮件

要建模我们需要一些训练数据
逻辑回归可以称为监管学习方法
fit = glm(spam~.) . 超级牛，涵盖所有解释变量除了spam本身。 sapply(email, function(x){if(is.factor(x)){table(email$spam,x)}}) 找出level 中没有case 或者control的category
anova 去掉not significant variables
- dropColumns 去掉不显著的columns
- refit
- 设定分类依据 classification rules
- 其实就是设定判断垃圾邮件的threshold
- 这个threshold affect the “true positive rate and false negative rate”
- `predSpam = ifelse(fitted(spam.fit2) > 0.5, 1, 0)
- 预测的正确率：
  - 通过一个混合矩阵：confusion matrix，就是一个two-way table，行是真实值，列是估值，对照一看就知道正确率了，这其实还是回到看关联列表的相关性和相关程度上来了。有意思
  - conf.matric = table(email$spam, predSpam)
  - 这个table的，对角线之和除以所有条码之和，就得到正确率了
  - pred.accuracy = sum(diag(conf.matrix))/sum(conf.matrix)
  - tpr = conf.matrix[2,2]/sum(conf.matrix[2,])
  - 这样得出的准确率是高，但是垃圾邮件的检测成功率却不高，这是因为模型把大部分精力用在了拟合费垃圾邮件上，必经费垃圾邮件数量庞大得多。
  - 给每个判断做一个权重，spam的权重为1，不是垃圾的权重为367/3554.
  - 重新refit
  - 这样准确率降低了一些，但垃圾邮件的检测成功率大幅提升了。
- 交叉验证
- 用同一组数据进行建模以及对模型的验证，多半会导致对建模的效果过分乐观。最好是有一组数据专门做建模，又叫训练模型。另一组数据用于测试，又叫测试模型。但通常我们没有机会或者这些信息。怎么办？
- 交叉验证：用一部分portion数据用来建模，一部分用来测试，这个比例proportion 有很多种：
  - 留一法，5倍法，10倍法，k倍法，
  - 10倍法 10-fold 就是90%用来训练模型，10%用来测试模型
  - 按这个比例随机抽取数据得到两个子集subset of the data，然后进行前面的计算。
  - 讲这个随机抽取和计算重复100遍，就可以得到这个模型的性能的均值，已经这个性能的变化范围（变异性）

Handout 时间序列

介绍时间序列

时间序列分析：

一个时间序列由在连续递增的时间上对数据的收集、重新排序和观察组成。
时间间隔相等：时、日、月、季度、周年等
时间序列的数据是有序的，不像回归，这里数据的顺序很关键
注意，时间是有方向的，如果过去能够影响将来，那么观察值就不是相互独立的。

研究方法

head(data.df, 5)/tail 查看变量、值的类型
pairs20x(data.df[,2:5]) 为数据帧绘制直方图（对角线）和散点图（右上角）
没有把时间放进去研究的时候
plot(data.df, which = 1) 查看残差分布，constant and with no trend
normcheck(lm.fit)
cooks20x(lm.fit) no influential outlier
由于曲线弯曲，所以log是试试
refit.fit = lm(log(y)~x, data.df)
summary(lm.fit)
查看残差随自变量的分布趋势
把时间放进去一起研究的时候残差图看起来更像随机分布在0附近了。可见年顺序在时间序列分析中的作用。

时间序列的数据结构

随着时间记录下来的响应变量值
以及一系列在同样时间点记录下来的解释变量（表示为：$X_{t1},X_{t2},X_{t3},...,X_{tk},$)
- 我们将用它们解释相应变量的行为，或者预测相应变量未来的值。
在金融和经济领域，我们通常关心股票市场、消费价格等随时间的变化。
这类变量的观察值常常以指数形式出现：
- 消费价格指数 CPI
- 交易权重指数 TWI
- 股指 NZSX50
附带解释变量的观察值
$时间段 \quad 观察值 \quad 解释变量$
$\quad 1 \qquad Y_1 \qquad X_{11} \quad X_{12} \quad \cdots \quad X_{1k}$
$\quad 2 \qquad Y_2 \qquad X_{21} \quad X_{22} \quad \cdots \quad X_{2k}$
$\quad \vdots \qquad \vdots \qquad \quad \vdots \qquad \vdots \quad \; \vdots \qquad \vdots$
$\quad t \qquad Y_t \; \qquad X_{t1} \; \quad X_{t2} \quad \cdots \quad X_{tk}$
$\quad \vdots \qquad \vdots \qquad \quad \vdots \qquad \vdots \quad \; \vdots \qquad \vdots$
$\quad T \qquad Y_T \qquad X_{T1} \quad X_{T2} \quad \cdots \quad X_{Tk}$
这里多了一列时间，它的顺序必须是升序排列

时间序列的成分

时间序列模型

在为时间序列数据建模的时候，分解成四个成分可以帮助理解和量化数据随时间的变化例如：我们的模型可以如下：
\[Y_t = T_t + C_t + S_T + R_t\]

我们可以将这几个成分重新合并成一个新模型来做预测，但这不是理想的方法。

四个成分

趋势成分：
最慢的成分，我们的问题是：这个趋势是上升的还是下降的？
- 最慢、长期、平滑
- 趋势通常来自总体的增长、变化、技术性的改变等等。
- 趋势可以随着时间或者在一段时间内，上升或下降。
周期成分：
不规则波浪形行的模式可以出现在不同的时间间隔，并且每次出现都不一样。
- 第二慢，周期性的，一系列波浪形的
- 波动的波长和振幅既不固定也无法预测，所以是不规则的模式
- 很多时间序列的周期性行为主要来自商业周期注：如果趋势成分允许非线性，则很难区分趋势成分和周期（波动）成分，这样他们就可以合并成一个趋势成分。
季节成分：
有规律的重复的模式，有固定长度的时间间隔，例如周年或者月。
- 变化比周期更迅速，为时间序列中，相等间隔规律性的重复模式。
- 季节性基于日历或者小时
- 可以字面上理解为三个月（一季度），也可以理解为月，周，日或年。
随机成分(余数)
无法预测的随机变化
- 总是存在的，变化最快的成分。
- 它由那些无法预测的关于趋势，周期，季节成分的变化组成。
- 但他可以当作其他模型的观测值一样拟合，例如标准正态分布$N(0, \sigma^2)$, 但它可以不满足独立性原则。
- 它可以理解为当其他成分都去掉以后留下的成分
- 它是时间序列中无法被其他东西解释的成分
- 在季节变化的数据中，随机成分可以从季节成分无法准确重复出现看出。

STL（Seasonal Trend Loess）函数分解时间序列

将响应变量转换成时间序列类型对象 ts()
stl()函数执行季节趋势回归分解

data("airpass.df")
passengers.ts = ts(airpass.df$passengers, start = 1949, frequency = 12)
plot(passengers.ts, main = "", ylab = "")

passengers.stl = stl(passengers.ts, s.window = "periodic")
plot(passengers.stl)
abline(v=1950:1960, lty=2)

如何读图

解释变量和它的值取自连续的时间间隔
- 响应变量相对于时间的图有助于发现是否存在时间趋势。

plot(passengers.ts, main = "", ylab = "")

remainder = passengers.stl$time.series[,3]

- 有对时间进行回归分析得出的残差图可以显示出其他解释变量所没有捕捉到的时间的作用。
- 将观察值的点连接起来有利于发现时间顺序的模式。plot(y~x, data=data.df) lines(y~x, data=data.df)

例1：

data("mening.df")
mening.ts = ts(mening.df$mening, frequency=12, start=c(1990,1))
plot(mening.ts, main="")
abline(v=1991:2001, lty = 2)

# airpass.df = within(airpass.df, {month = factor(rep(month.abb, 12), levels = month.abb); time = 1:144})
# lines(passengers~month, data = airpass.df)

解读：

可以看出有年度季节性模式
每隔12个月放一个垂直线，将数据按成年分段：abline(v=1991:2001, lty=2)
总体来看这个图是在重复的。
所以有季节性。

例2

data(beer.df)
beer.ts = ts(beer.df, frequency = 12, start = c(1971,7))
plot(beer.ts, main = "US Beer Production, July 1970 to June 1978")
abline(v = 1971:1978, lty = 2)

解读：

上图总体来看也是重复波动的，而且它的常规峰值出现在一年中相同的时间

例3

passengers.ts = ts(airpass.df$passengers, start = 1949, frequency = 12)
plot(passengers.ts, main = "", ylab="")
abline(v=1950:1960, lty=2)

解读：

我们可以看到季节性的涨落幅度更大了；
如果这个幅度随时间增加或者减少，我们应该用乘法分解代替加法分解。 $Y_t = T_t \times C_t \times S_t \times R_t$
由于响应变量是一个计数值，所以用乘法模型是很自然的
原则上，我可以通过扩展处理泊松数据的方法，来将其用在计数值时间序列上。
这里我们用乘法模型，通过$log(Y_t)$来实现，前提是$Y_t$都比较大，如本例中。
对数据取自然对数可以是季节变化变得恒定，但时间成分的作用将变成倍数（增加多少时间->数据增加多少倍）

log.passengers.ts = log(passengers.ts)
plot(log.passengers.ts, main = "", ylab="")
abline(v=1950:1960, lty=2)

分解图：

分解图将帮助我们去掉季节和随机成分，去了解潜在的趋势或波动：

passengers.stl = stl(log.passengers.ts, s.window = "periodic")
plot(passengers.stl)
title("Decomposition of log(Passengers)")

* 趋势存在少量的非线性，最多表明数据中可能有较弱的波动。

分解时间序列：

要解决的问题：

相对12月，1月的销量往往有所下降，我们想知道这个下降是否仅仅是正常的季节变化引起的？
同样的，由于学生毕业加入劳动力人群，在每个学年末失业率都有所提升。年末提升的失业率表明失业率真的有所提升还是仅仅是正常的季节变化？

解决的方法：

研究去季节化后的值（时间序列去掉季节成分）
1. 我们估计去季节后的值存在趋势或波动（或者存在解释变量的作用）。研究潜在的趋势或者波动
2. 如果没有证据证明趋势的存在，我们就可以推断我们看到的只是时间序列上正常的季节变化
3. 否则我们就说存在根本性的变化，而不是季节
用stl分解
我们从数据中减掉季节成分再在其上做回归分析；注：这里要小心，因为不满足独立假设。
1. 获得季节成分
2. decomp.stl = stl(data.ts, s.window = “periodic”)
3. sa.data = data.ts - decomp.stl$time.series[,2]

# 代码段2 ----
passengers.ts = ts(log(airpass.df$passengers), start = 1949, frequency = 12)
plot(passengers.ts, main = "Data with Seasonal Component")

passengers.stl = stl(passengers.ts, s.window = "periodic")
sa.passengers = passengers.ts - passengers.stl$time.series[, "seasonal"]
plot(sa.passengers, main = "Data with seasonal component removed")

用一个解释分类变量来去掉季节成分

将一个分类变量用在时间序列的回归模型中，分类的级别对应每个季节。
用这种方法不需要调整数据去掉季节成分，季节成分是模型设定的一部分
这种回归方法适合用在预测，因为它考虑到了整个时间序列。
另一种预测方法是Holt-Winters 指数平滑。

预测

指数平滑（exponential smoothing）

概念

指数平滑是能用过去的所有信息来预测将来的方法
每个平滑值是序列中之前所有值的加权平均数（weighted average）
指数序列标记为：$S_t$ (区别于之前的原始序列$Y_t$)
$S_t$ 的平滑度有平滑参数$\alpha$控制

基本形式：

\[S_t = \alpha Y_t + (1 - \alpha) S_{t-1}\]

平滑常量$\alpha$越小，结果序列越平滑
之所以叫指数平滑是因为距离现在越远的观察值其权重随时间指数级的变小（exponentially smaller）
表示如下： \[S_t = \alpha Y_t + \alpha (1 - \alpha) Y_{t-1} + \alpha (1 - \alpha)^2 Y_{t-2} + \cdots + (1-\alpha)^{t-1} Y_1\]
从等式的左往右，随着t变得非常大，表达式的系数变得更小（以指数速度 at an exponential rate）

性质

平滑后的序列有赖于之前的所有值，其中离现在最近的值的权重最大
指数平滑需要大量的观测值
以上指数平滑的基本形式不适合存在趋势或者季节性的数据
指数平滑在R中用 HoltWinters 函数

指数平滑的基本形式：（废除其他成分，免得成为通用形式）

HoltWinters(data.df, alpha = x, beta = FALSE, gamma = FALSE) $\alpha$ 由x来制定，beta 和 gamma 用在有趋势和季节性的时候

data("larain.df")
LArain.ts = ts(rev(larain.df$LA.Rain) * 25.4, frequency = 1, start = 1877)
plot(LArain.ts, xlab = "Season", ylab = "Total rainfall (mm)")

LArain.hw = HoltWinters(LArain.ts, alpha = 0.5, beta = FALSE, gamma = FALSE)
plot(LArain.hw, lwd = 2, col.predicted = "blue", main = "alpha = 0.5")

LArain.hw = HoltWinters(LArain.ts, alpha = 0.95, beta = FALSE, gamma = FALSE)
plot(LArain.hw, lwd = 2, col.predicted = "blue", main = "alpha = 0.95")

LArain.hw = HoltWinters(LArain.ts, alpha = 0.75, beta = FALSE, gamma = FALSE)
plot(LArain.hw, lwd = 2, col.predicted = "blue", main = "alpha = 0.75")

LArain.hw = HoltWinters(LArain.ts, alpha = 0.25, beta = FALSE, gamma = FALSE)
plot(LArain.hw, lwd = 2, col.predicted = "blue", main = "alpha = 0.25")

LArain.hw = HoltWinters(LArain.ts, alpha = 0.05, beta = FALSE, gamma = FALSE)
plot(LArain.hw, lwd = 2, col.predicted = "blue", main = "alpha = 0.05")

实际上我们可以让HoltWinters来指定$\alpha$，它通过最小化均方预测误差来确定。

预测举例：

在predict 中调用 n.ahead 参数

LArain.hw1 = HoltWinters(LArain.ts, beta = FALSE, gamma = FALSE)
LArain.pred = predict(LArain.hw1, n.ahead = 5)
plot(LArain.hw1, predicted.values = LArain.pred, lwd = 2, col.predicted = "blue")

LArain.pred = predict(LArain.hw1, n.ahead = 15, prediction.interval = TRUE)
plot(LArain.hw1, predicted.values = LArain.pred, lwd = 2, col.predicted = "blue")

hist(resid(LArain.hw1))

这个模型似乎不是很好，对数据取对数来去掉向右歪斜，同时避免第

log.LArain.ts = log(LArain.ts)
LArain.hw2 = HoltWinters(log.LArain.ts, beta = FALSE, gamma = FALSE)
plot(LArain.hw2)

log.LArain.pred = predict(LArain.hw2, n.ahead = 15, prediction.interval = TRUE)
exp(log.LArain.pred)

## Time Series:
## Start = 1943 
## End = 1957 
## Frequency = 1 
##           fit      upr      lwr
## 1943 343.7816 929.1857 127.1928
## 1944 343.7816 931.5911 126.8644
## 1945 343.7816 933.9966 126.5377
## 1946 343.7816 936.4020 126.2126
## 1947 343.7816 938.8074 125.8893
## 1948 343.7816 941.2129 125.5675
## 1949 343.7816 943.6184 125.2474
## 1950 343.7816 946.0239 124.9289
## 1951 343.7816 948.4296 124.6121
## 1952 343.7816 950.8353 124.2968
## 1953 343.7816 953.2411 123.9831
## 1954 343.7816 955.6470 123.6709
## 1955 343.7816 958.0531 123.3604
## 1956 343.7816 960.4593 123.0513
## 1957 343.7816 962.8657 122.7438

用HoltWinters预测

可以用HoltWinters预测包含趋势和季节的时间序列
在拟合过程中有三个不同的指数平滑，更新公式：
- 水平作用 $a_t$
- 斜率作用 $b_t$
- 季节作用 $s_t$
每一个更新的方程都有它自己平滑常数：$\alpha, \beta , \gamma$
预测模型为： \[F_{T+h} = a_T + hb_T + s_{T,h}\]
- 预测是通过估计水平，斜率和季节在时间上的作用来实现的。T是时间序列中的最后一个时间点。
- $s_{T,h}$ 表示持续重复的季节作用，尤其是在时间T上估算出来的，并用来计算h个时间单位以后的季节作用。

log.pass.ts = ts(log(airpass.df$passengers), frequency = 12, start = 1949)
log.pass.hw = HoltWinters(log.pass.ts)
log.pass.pred = predict(log.pass.hw, n.ahead = 30)
plot(log.pass.hw, log.pass.pred, col.predicted = "blue", lwd = 2)

让R来决定最优的 $\alpha, \beta , \gamma$

log.pass.pred = predict(log.pass.hw, n.ahead = 30, prediction.interval = TRUE)
plot(log.pass.hw, log.pass.pred, col.predicted = "blue", col.intervals = "red", lwd = 2)

针对时间序列的线性回归模型

时间序列线性回归

我们前面一直假设在标准模型中，误差是相互独立的，但这个假设在时间序列不成立。
通过明确建立依赖模型，可以扩展线性模型来放宽这个假设要求
在此简述一阶（first-order）自相关作用

自相关作用

拟合线性模型的到时间序列的主要问题就是随机误成分的值常常非独立
这个问题成为自相关或者连续相关
自相关很难人言识别，尤其是有很多解释变量的时候。所以要集中注意力在解释那些拟合解释变量到响应变量以后所剩下的产值上。

识别自相关

最简单的就是检查残差的自相关函数图（ACF）首先要明确用回归模型分析时间序列的策略：

以可用的解释变量给均值建模，如果没有，到第二步
通过增加一个季节分类变量来给季节性（如有）建模(这个方法的缺陷是它没有施加任何模式到季节成分上，所以它没有利用1，2月比1月和7月更近这样的事实)
通过增加一个（或多个）时间项到模型中来给所有时间依赖关系建模
检查残差的ACF图。如果有自相关证据，延迟相应（lagged-response）模型可以解决

举例：

plot(log.passengers.ts)

没有解释变量，跳过第一步
有很强的季节效果，建立一个以季节分类建立一个线性模型

# 第二步：添加季节项 ----
airpass.df = within(airpass.df, { month = factor(rep(month.abb, 12), levels = month.abb); time = 1:144})
pass.fit = lm(log(airpass.df$passengers)~month, data = airpass.df)
plot(pass.fit$residuals)
# plot(pass.fit$residuals, type = "l")
lines(airpass.df$time, residuals(pass.fit)) # 用线将点连接起来
lines(lowess(airpass.df$time, residuals(pass.fit))) # 用离散点平滑后画出的曲线，用来了解趋势再好不过了。

# lowess returns a list containing components x and y which give the coordinates of the smooth. The smooth can be added to a plot of the original points with the function lines: see the examples. 在这里显然不用了，因为散点图已经可以看出明显的趋势了。

可以看到很强的时间作用（线性），需要再对时间建模，

# 第三部：添加时间项 ----
pass.fit1 = lm( log(airpass.df$passengers)~time + month, data = airpass.df )
anova(pass.fit1)

## Analysis of Variance Table
## 
## Response: log(airpass.df$passengers)
##            Df  Sum Sq Mean Sq  F value    Pr(>F)    
## time        1 25.1233 25.1233 7143.582 < 2.2e-16 ***
## month      11  2.2843  0.2077   59.047 < 2.2e-16 ***
## Residuals 131  0.4607  0.0035                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

p-value可以看成指示性的，因为不满足独立性假设。

看残差图

plot( residuals(pass.fit1)~fitted(pass.fit1))
lines(lowess(fitted(pass.fit1), residuals(pass.fit1)))

二次方的？

重新拟合

pass.fit2 = lm( log(airpass.df$passengers)~time + month + I(time^2), data = airpass.df )
anova(pass.fit2)

## Analysis of Variance Table
## 
## Response: log(airpass.df$passengers)
##            Df  Sum Sq Mean Sq   F value    Pr(>F)    
## time        1 25.1233 25.1233 10813.900 < 2.2e-16 ***
## month      11  2.2843  0.2077    89.386 < 2.2e-16 ***
## I(time^2)   1  0.1587  0.1587    68.307 1.409e-13 ***
## Residuals 130  0.3020  0.0023                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

plot(residuals(pass.fit2)~fitted(pass.fit2))
lines(lowess(fitted(pass.fit2), residuals(pass.fit2)))

plot(residuals(pass.fit2), type = "l") #link the points lower case L
lines(lowess(fitted(pass.fit2), residuals(pass.fit2)))

* 这回看起来好多了，相对恒定而且没有趋势了 * 但残差对时间的图表现出非独立性 4. 残差是否有自相关性？

acf(residuals(pass.fit2))

ACF 图

ACF图是自变量h=0,1,…,T-2，的一个函数图，通常没到T-2就被截断了，因为数据太少了
垂线的高度代表我们感兴趣的变量（通常为残差）在时间t和t+h的值的相关性
第一条垂直线的高度恒为1，因为$\rm Cor[Y_t, Y_{t}] = 1$
第二条垂直线的高度是$\rm Cor[Y_t, Y_{t+1}]$， lag-1 自相关叫1阶自相关
第三条垂直线的高度是$\rm Cor[Y_t, Y_{t+2}]$， lag-2自相关又称为二阶自相关，以此类推
蓝色的水平虚线是我们认为自相关性I显著的位置。大致来讲，如果垂直线超过虚线，则$H_0: \rho_h = 0$ 的 P-value将会小于0.05。（由于多重比较，即使没有相关性，有一两个自相关值稍微超过虚线也是正常的）

回到前面如何识别自相关性的第四步

残差是否有自相关性？

acf(residuals(pass.fit2))

* 这是明显的自相关模式，有很多垂直线超过了虚线。我们遇到了自相关问题。 * 要解决这个问题，我们明确建立一个一阶自相关模型，通过引进一个延迟响应变量作为解释变量到模型中来解决这个问题。 * 即：$Y_1$用来解释$Y_2$，$Y_2$用来解释$Y_3 \cdots$，以此类推 * 这意味着我们不能使用$Y_1$作为相应变量，因为没有$Y_0$ log(passengers)[-1] 去掉第一项 * 这类模型称为延迟响应模型

pass.fit3 = lm( log(passengers)[-1] ~ time[-1] + month[-1] + I(time[-1]^2) + log(passengers)[-144], data = airpass.df)

我们总要加上延迟相应变量作为模型的最后一项。

anova(pass.fit3)

## Analysis of Variance Table
## 
## Response: log(passengers)[-1]
##                        Df  Sum Sq Mean Sq  F value    Pr(>F)    
## time[-1]                1 24.4515 24.4515 19260.14 < 2.2e-16 ***
## month[-1]              11  2.2733  0.2067   162.79 < 2.2e-16 ***
## I(time[-1]^2)           1  0.1617  0.1617   127.35 < 2.2e-16 ***
## log(passengers)[-144]   1  0.1362  0.1362   107.25 < 2.2e-16 ***
## Residuals             128  0.1625  0.0013                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

P-Value很小，有力证明延迟相应项应该包含在模型中（假设我们的模型是正确的）

acf(residuals(pass.fit3))

至少有少量的最高点超过了虚线，这可能是因为多重比较导致的
留意一阶自相关建模是如何消除其他显著的自相关的
- 因为如果A和B紧密相关，而B和C紧密相关，则A和C也可能是紧密相关的。从而消除一阶自相关的时候可能同时消除了更高阶的关系。
- 有两个延迟（12 和 16）也超过了虚线。这个acf图有效的做了21个假设测试，它给我们图像化展示了前21个自相关性是否显著（在5%水平上）。我们不应该感到奇怪去掉自相关性后还有如果1到两个是显著的。
正式的通用延迟响应模型如下： \[Y_t = \beta_0 + \beta_1 t + \beta_2 X_{t1} + \cdots + \beta_{p+1} X_{tp} + \rho Y_{t-1}, \quad t = 2, \cdots , T\]
- 通过包含一个延迟的响应变量到解释变量中，明确的调整一阶自相关。
- 延迟相应变量的系数$\rho$可以当作残差自相关的天然估值。
- 这个模型可以用来做预测但不适合作超过往前1步的预测。
如果没有修正正自相关就拟合一个线性回归模型，则我们对残差标准差的估值将低于其实际值。
结果，将会高估$R^2$，还有回归系数的t-test以及anova F-test 也将变得无效。

为什么要关注自相关性

信息的遗失：自相关的结果是阻碍我们了解总体中的个体差异的真实大小
举例：

总结：

先说说GLM

对于广义线性模型，这里讨论的主要就是当响应变量Y不是连续正态分布的时候。
除了Poisson count data 是 log(E[Y]) 以外，其他无一例外都是log(odds) 或者称为 logit(p)：
glm 建模的区别：
- count data: family = poisson
- binary data: family = binomial
- count of success: cbind(success, fail), family = binomial
- proportion: family = binomial, weight = n(trials)
- 相同点都要看residual deviance vs degree of freedom
如何通过把它们变换形式后拟合线性模型，求出它们的系数，之后通过逆变换回归分析它们自身，
即：通过系数我们了解到当解释变量发生+1、×0.1、×2、×10（numerical explanatory）,或分别处在 level 1， level 2 的时候，相应变量将发生什么相应的变化。
由于Y进行了变换，有时候他们变换后的形式本身也具有统计意义，比如：

count response：log-linear model
- log(E[Y])
- $fac_1 * x2$ 直接解释吧，只是体现了glm 相对lm的优势，考虑了方差随均值的增加而增加，而且不用去掉本来有意义的outlier。
- $fac_1 * fac_2$ exponentiate coef 可以得到$\hat{OR}$ 能解释的东西就多了，把2-by-2表格颠来倒去看就行。如果没有关联系，还能分列，分行独立进行对比解释。
- binary response或者说binomial data或者说logistic regression，它的相应变量变为了logit(p)
- 即 log(odds)，所以我们只要exponentiate 那些系数就可以知道odds的变化，这样就行了，我们并不需要了解它的binary response的变化。
- predict 了以后记得要逆变换才能得到p概率
- $\frac{exp(logit(p))}{1+exp(logit(p))} = p$
count of successes
- 为什么成功数数据符合二项分布？因为他们是给定次伯努利实验的总和。
- 它的fitting 比较特殊有个cbind(success, fail)~x1*x2 — 在这种例子中，我们少了个体信息。还要假设给定行列，那么成功概率是确定的。每一次实验都是独立的。
- 败血症例子中，因为可以死亡的人数是有上限的，所以即使响应变量是个count，我们也不能用泊松分布
- 通过变换
response is a proportion
- S 型曲线在统计学语言里称为逻辑曲线，采用的方法称为逻辑回归
- $\beta_0 影响曲线的垂直高度，\beta_1 的量级（绝对值）影响曲线的陡峭程度，越大越陡，\beta_1 < 0 的时候曲线是反S形。$
- log(odds)
- 书本上predictCount()出来的可以直接解释为成功概率，不需要再exp了

这里有几个有趣的地方：关联列表就是一个two-way anova
它们的区别全部在响应变量的形式上：
对于响应变量是：
1. 各种组合下响应变量的值：比如来不来上课及段考过没过的学生的期考成绩
2. 各种组合出现的次数：count to contingency table
3. 各种组合下的成功与否：binary response，这个组合下成功为1不成功为0
4. 各种组合成功的次数：count of successes 每种组合下成功的次数

考点

好吧，重头戏登场了：

MA & Excu Summary procedure

MA
1. what & why a model was fitted & refitted
2. all assumption was satisfied
3. the final model is:
Excutive Summary
1. problem of interest
2. p-value
3. estimation & prediction
4. total variance explained & good or not for prediction

1. Null model was fit

2. simple linear model

MA
1. 相应变量和解释变量都是数值quantitative variables. so a simple linear model was fitted

3. quadratic model

MA
1. scatter plot suggested curvature
2. residual plot(simple-linear.fit) showes curvature
Excu Summary
1. For a one mark increase in assignment score, the increase in the expected exam score was greater as the the assignment score increased, e.g., there is little difference in expected exam scores for those getting 7 or 8 in assignment, much bigger for those getting 17 or 18.

4. Two-sample t-test for mean

MA:
1. linear model was fitted
2. with equal variance assumed in bot group)

5. Multitplicative model:

Multiplicative model 的特点：
- 从散点图plot/trendscatter 看：Y值有指数级下降的趋势，伴随着离散度也下降
- 从residual plot看，
  1. a log-linear model was fitted
  2. 前面几章的特点：求响应变量随解释变量变化的关系。
    - 注意：截距在这里没有意义，无须解释，因为截距已经不代表均值，而是当$x_2, x_3 \cdots , x_n$=0 的时候的值，而这些值在数据中通常都会不会为0.
    - 比如段考成绩为0的学生的期末考试成绩，是没有意义的。
  3. prediction interval 和 confidence interval的区别：predict的是预测样本中个体的值区间，confidence的是预测样本均值的区间，注意log-linear model只能预测median 而不是mean
  4. 对于相互影响的变量，我们通常需要知道的是它的调节作用，它调整了另一个变量的影响：It adjusts the effects of the test marks on exam marks.
1. ANCOVA without interaction
2. ANCOVA with interaction
3. several explanatory variables.
  1. 通常将哪个变量被logged，哪个变量被去掉即可。
4. Power law model. log-log model
5. one-way ANOVA
  1. 一个分类变量，里面分类超过2中
6. two-way ANOVA
  - MA
    - 是否parallel
    - 同A比B，同B比A 比大小趋势
    - 是否有 unusual high low
7. poisson model
8. poisson model
9. logistic regression model
11. time series
残差的同分布，指的是残差图evocheck(fit)/plot(fit, which=1)
- 图中对于每个相同x值上的多个y值（残差）在垂直方向上的分布是正态分布的，而不同x值对应的这个分布应该相同。所以应该都是分布在0周围，而且不应该有趋势：residuals (do not) appear to be randomly scattered around 0.

Exec Summary:

关于interpretation 的技巧：

研究对象：
- We want to quantify the relationship between … and …
- We wish to investigate/check/quantify
- We are interested in using…/checking whether…
时态：
1. 我们关注：we are interested …
2. 图标显示：the plots showes …
3. 我们估计（当时发生的情况）：用过去式
4. 我们估计如果…会发生…：可以用过去式也可以用一般现在时，will would
5. between … and/to … 单位和百分比，可以写在后面一个数字后。
6. 单位的单复数：表示量的多少需要复数，如5 grams, 1.8 meters. 如果是缩写则表示单位，不用复数：34.1 cm
分段解释：
1. A child has a higher epxected birth-weight the longer its gestiataion time - up to a 42 weeks - then it starts decreseing in size the longer it stays unborn.
2. 这里的分段位置在42 weeks，之前正相关，之后负相关。
3. 通过增加一个dummy variable，我们得到两个又相互影响的解释变量。
4. dummy 变量影响另一个解释变量的效果。e.g. for non-regular attenders each additional mark of … the … increase somewhere between … and …, for students who attended regularly, this effect on the exam mark increased ()
$R^2$ 解释的是数据的或或者是响应变量的variability 变异性
快速反应：log(Y) 及 log-log的区别
1. 对于：log-linear model
  - $log(y_i) = \beta_0 + \beta_1 x_i$
  - $y = exp^{\beta_0} \times exp^{\beta_1 \cdot x}$
  - $for \, x_i^{'} = x_i+a$
  - $y_i^{'} = y_i \times exp^{\beta_1 \times a}$
  - 所以x增加a，y就乘以 $e^{\beta_1 \cdot a}$
2. 对于： log-log model
  - $log(y_i) = \beta_0 + \beta_1 log(x_i)$
  - $y_i = exp^{\beta_0} \times exp^{\beta_1 \cdot log(x_i)}$
  - $y_i = exp^{\beta_0} \times x^{\beta_1}$
  - $for \, x_i^{'} = x_i \times a$
  - $y_i^{'} = y_i \times a^{\beta_1}$
  - x乘以a，y就乘以 $a^{\beta_1}$
3. 区别：
  - 注意在log-linear model中对比的x的增加量为了让影响更加显著 $\Delta_x= +1, +10, +100 \Rightarrow \Delta_y = \times e^{\beta_1 / 10\beta_1/100\beta_1}$
  - 在log-log model中对比的是x变化的倍数$\Delta_x= \times2, \times4, \times6 \Rightarrow \Delta_y = \times {(2/4/6)}^{\beta_1}$
常用句型：
- seems to be/appears to be/does not appear to be
- slight trend/variability increases with increaseing fitted values(y)
- there is also …
- roughly constant
- more-or-less averaging around zero
- come to a erroneous conclusion base on wishful thinking
统计显著：
- the P-value associated with the Attend variable is highly statistically significant.
Variation -差异正常预期结果与观测结果之间的差额总量即为差异，variation表示变化，variance表示差异
标准差
- 简单来说，标准差是一组数据平均值分散程度的一种度量。一个较大的标准差，代表大部分数值和其平均值之间差异较大；一个较小的标准差，代表这些数值较接近平均值。
\[\sigma = \sqrt{ \frac{1}{N} \sum_{i=0}^N (x_i-\mu)^2}\]
方差
标准误
- 标准误差针对样本统计量而言，是某个样本统计量的标准差。当谈及标准误差时，一般须指明对应的样本统计量才有意义。以下以样本均值（样本均值是一种样本统计量）作为例子：
- 例如，样本均值是总体均值的无偏估计。但是，来自同一总量的不同样本可能有不同的均值。于是，假设可以从总体中随机选取无限的大小相同的样本，那每个样本都可以有一个样本均值。依此法可以到一个由无限多样本均值组成的总体，该总体的标准差即为标准误差。
- $SE_{\overline x} = \frac{s}{\sqrt n}$
t值
iid:
1. independence: investigate how we obtained the data.
2. identically distributed: result in the variance of the residual being roughly constant(regardless of the fitted values) and the residuals more-or-less averaging around zero. (see with evocheck)
3. Normality (see using normcheck)
当要你comment一个plot的时候通常要你comment什么？
1. yx1x2x3关系，甚至xx关系(相关性)，
2. 变换，曲线，转折点
3. 异常点，极值点

STATS Learning Summary

Handout 1: Introduction and null model

Handout 2: Simple linear model

Handout 3: Fitting curves using linear models

Handout 4: Categorical explanatory variable:

Handout 5: Multiplicative models — working with logged response variable 乘法模型

Handout 6: Categorical and numeric explanatory variables (a.k.a. ANCOVA model)

Handout 7: Categorical and numeric explanatory variables (a.k.a. ANCOVA model) 分类和数值变量

Handout 8: Models with several explanatory variables 多重回归模型

Handout 9: Power law models 幂律模型

Handout 10: 多级分类变量以及多级比较问题

Handout 11: 两个分类变量

Handout 12: 分析计数值数据

计数值数据

泊松分布

相关性检验逻辑：

有效条件：

举例：

卡方检验的问题：

Handout 13： 比值比

Handout 14： 费舍尔检验 FET

Handout 15： 推广线性统计模型到关联列表

Handout 16： 推广线性统计模型到计数数据

举例1：

举例2：

举例3：

Handout 17.5 逻辑回归

当响应变量是二元量的时候 -（逻辑回归logistic regression）

当相应变量是成功数的时候

Handout 18： 当相应变量是百分比的时候

Handout 18.5： 垃圾邮件

Handout 时间序列

介绍时间序列

时间序列分析：

研究方法

时间序列的数据结构

时间序列的成分

时间序列模型

四个成分

STL（Seasonal Trend Loess）函数分解时间序列

如何读图

例1：

例2

例3

分解时间序列：

用一个解释分类变量来去掉季节成分

预测

指数平滑（exponential smoothing）

概念

基本形式：

性质

指数平滑的基本形式：（废除其他成分，免得成为通用形式）

预测举例：

用HoltWinters预测

针对时间序列的线性回归模型

时间序列线性回归

自相关作用

识别自相关

举例：

相关性和自相关性

ACF 图

为什么要关注自相关性

总结：

先说说GLM

考点

MA & Excu Summary procedure

1. Null model was fit

2. simple linear model

3. quadratic model

4. Two-sample t-test for mean

5. Multitplicative model:

Exec Summary:

关于interpretation 的技巧：

Handout 13：比值比

Handout 14：费舍尔检验 FET

Handout 15：推广线性统计模型到关联列表

Handout 16：推广线性统计模型到计数数据

Handout 18：当相应变量是百分比的时候

Handout 18.5：垃圾邮件