\[ \begin{align} \sum_{i=1}^n (Y_i - \mu)^2 & = \ \sum_{i=1}^n (Y_i - \bar Y + \bar Y - \mu)^2 \\ & = \sum_{i=1}^n (Y_i - \bar Y)^2 + \ 2 \sum_{i=1}^n (Y_i - \bar Y) (\bar Y - \mu) +\ \sum_{i=1}^n (\bar Y - \mu)^2 \\ & = \sum_{i=1}^n (Y_i - \bar Y)^2 + \ 2 (\bar Y - \mu) \sum_{i=1}^n (Y_i - \bar Y) +\ \sum_{i=1}^n (\bar Y - \mu)^2 \\ & = \sum_{i=1}^n (Y_i - \bar Y)^2 + \ 2 (\bar Y - \mu) (\sum_{i=1}^n Y_i - n \bar Y) +\ \sum_{i=1}^n (\bar Y - \mu)^2 \\ & = \sum_{i=1}^n (Y_i - \bar Y)^2 + \sum_{i=1}^n (\bar Y - \mu)^2\\ & \geq \sum_{i=1}^n (Y_i - \bar Y)^2 \ \end{align} \]
\[\begin{align} \ \sum_{i=1}^n (Y_i - \hat \mu_i) (\hat \mu_i - \mu_i) = & \sum_{i=1}^n (Y_i - \hat\beta_0 - \hat\beta_1 X_i) (\hat \beta_0 + \hat \beta_1 X_i - \beta_0 - \beta_1 X_i) \\ = & (\hat \beta_0 - \beta_0) \sum_{i=1}^n (Y_i - \hat\beta_0 - \hat \beta_1 X_i) + (\beta_1 - \beta_1)\sum_{i=1}^n (Y_i - \hat\beta_0 - \hat \beta_1 X_i)X_i\\ \end{align} \]
\[ \begin{align} \sum_{i=1}^n (Y_i - \bar Y)^2 & = \sum_{i=1}^n (Y_i - \hat Y_i + \hat Y_i - \bar Y)^2 \\ & = \sum_{i=1}^n (Y_i - \hat Y_i)^2 + 2 \sum_{i=1}^n (Y_i - \hat Y_i)(\hat Y_i - \bar Y) + \sum_{i=1}^n (\hat Y_i - \bar Y)^2 \\ \end{align} \]
data(anscombe);example(anscombe)\[ \begin{align} Var(\hat \beta_1) & = Var\left(\frac{\sum_{i=1}^n (Y_i - \bar Y) (X_i - \bar X)}{\sum_{i=1}^n (X_i - \bar X)^2}\right) \\ & = \frac{Var\left(\sum_{i=1}^n Y_i (X_i - \bar X) \right) }{\left(\sum_{i=1}^n (X_i - \bar X)^2 \right)^2} \\ & = \frac{\sum_{i=1}^n \sigma^2(X_i - \bar X)^2}{\left(\sum_{i=1}^n (X_i - \bar X)^2 \right)^2} \\ & = \frac{\sigma^2}{\sum_{i=1}^n (X_i - \bar X)^2} \\ \end{align} \]
The prediction interval is the range in which future observation can be thought most likely to occur, whereas the confidence interval is where the mean of future observation is most likely to reside. From here
n <- 100; x2 <- 1 : n; x1 <- .01 * x2 + runif(n, -.1, .1); y = -x1 + x2 + rnorm(n, sd = .01)
summary(lm(y ~ x1))$coef
summary(lm(y ~ x1 + x2))$coef
R 会自动检测并消除变量生成的变量 如上面 x2 中需要加入 runif(n,-.1,.1) 才能得到结果t 检验 如果模型中去掉截距 等同于所有分类与零进行 t 检验 参数系数为均值差 可用 relevel(data,'name') 来指定比对对象?influence.measuresThere are known knowns. These are things we know that we know. There are known unknowns. That is to say, there are things that we know we don’t know. But there are also unknown unknowns. There are things we don’t know we don’t know. Donald Rumsfeld
vif来检验 协变量在欠拟合下有偏offset 可用来估计增长率The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. John Tukey
caret 包createDataPartition 数据比例 重采样 产生时间片段train predicttrain <- createDataPartition(y=spam$type,p=0.75, list=FALSE) 数据三一分 得到indexfolds <- createFolds(y=spam$type,k=10,list=TRUE,returnTrain=TRUE) 数据分10份 返回每一份列表folds <- createResample(y=spam$type,times=10,list=TRUE) 数据bootstrap重采样 返回每一份列表folds <- createTimeSlices(y=tme,initialWindow=20,horizon=10) 时序数据重采样 产生20为窗口时序片段的训练集与预测集args(train.default) 通过 method 控制算法 metric 控制算法评价 trainControl 控制训练方法trainControl中 method选择模型选择方法 如bootstrap 交叉检验 留一法 number 控制次数 repeats 控制重采样次数 seed 控制可重复性 总体设置一个 具体每一次用列表设置控制具体过程 特别是并行模型featurePlotggplot2train 中的 preProcess=c("center","scale") 标准化spatialSign 该转化可提高计算效率 有偏preProcess(training[,-58],method=c("BoxCox")) 正态化转化method="knnImpute" 用最小邻近法填补缺失值nearZeroVar 去除零方差变量findCorrelation 去除相关变量findLinearCombos 去除线性组合变量classDist 测定分类变量的距离 生成新变量splines 包中的 bspreProcess 中 method 设置为 pca pcaComp 指定主成分个数rattle 包的 fancyRpartPlot 出图漂亮caretEnsemble 包library(ISLR); data(Wage); library(ggplot2); library(caret);
Wage <- subset(Wage,select=-c(logwage))
# Create a building data set and validation set
inBuild <- createDataPartition(y=Wage$wage,p=0.7, list=FALSE)
validation <- Wage[-inBuild,]; buildData <- Wage[inBuild,]
inTrain <- createDataPartition(y=buildData$wage,p=0.7, list=FALSE)
training <- buildData[inTrain,]; testing <- buildData[-inTrain,]
mod1 <- train(wage ~.,method="glm",data=training)
mod2 <- train(wage ~.,method="rf",data=training,trControl = trainControl(method="cv"),number=3)
pred1 <- predict(mod1,testing); pred2 <- predict(mod2,testing)
qplot(pred1,pred2,colour=wage,data=testing)
predDF <- data.frame(pred1,pred2,wage=testing$wage)
combModFit <- train(wage ~.,method="gam",data=predDF)
combPred <- predict(combModFit,predDF)
sqrt(sum((pred1-testing$wage)^2))
sqrt(sum((pred2-testing$wage)^2))
sqrt(sum((combPred-testing$wage)^2))
clue 包 cl_predict 函数decomposewindow 窗口ma 平滑ets 指数平滑forecast 预测quantmod 包 或 quandl 包处理金融数据install.packages("shiny");libray(shiny)ui.R 控制外观 sever.R 控制计算runApp() 启动应用sever.R 中 shinyServer 之前的代码只在启动应用时执行一次 适合读入数据shinyServer(function(input, output){ 之内的非互动函数只被每个用户执行一次Render* 为互动函数 数值改变就执行一次runApp(display.mode='showcase') 可用来同时高亮显示执行代码reactive 用来加速互动函数外的信息交换actionButton 用来一次提交输入数据 if (input$goButton == 1){ Conditional statements } 用来定义条件语句cat browser() 调试fluidRow 产生表格require(devtools);install_github('rCharts', 'ramnathv')install.packages("devtools");library(devtools);install_github('slidify', 'ramnathv');install_github('slidifyLibraries', 'ramnathv');library(slidify)author("yufree")YAML 配置幻灯片结构## 幻灯片开始 --- 加空行表结束 .class #id 自定义css文件idslidify("index.Rmd") 生成 browseURL("index.html") 观看publish_github(user, repo) github发布DESCRIPTION 指明包内容Package 包名字Title 全名Description 一句话描述Version 版本号Author 作者Maintainer 维护者License 许可协议Depends 依赖Suggests 建议Date 发布日期 YYYY-MM-DD 格式URL 项目主页R 源码Documentation 文档 Rd文件NAMESPACE 关键词 输入输出的函数及类型R CMD build/check newpackage 构建 检查包roxygen2 源文件注释文档setClass指定类型 用setMethod指定处理类型的方法generic处理对象 开放 没有指定类型就用通用方法stats4 有很多针对性的极大似然估计的对象定义与方法