library(UsingR)
ggplot(nym.2002, aes(age, time)) +
geom_point(col="#808000", size=2) +
geom_smooth(method="lm") +
theme_bw() +
theme(axis.title=element_text(size=12),
axis.text = element_text(size=12))
Figure 1.1: 迴歸線與散佈圖
母體迴歸函數(Population regression function)一般表示為 \[\begin{align} Y=\beta_{0}+X\beta_{1}+\epsilon \tag{2.1} \end{align}\] 或者 \[ Y=f(X)+\epsilon \]
\(Y\)是依變數,\(X=(X_{1},X_{2},\ldots,X_{p})\)為自變數。\(f\)代表固定但是未知的函數,也就是\(Y\)與\(X_{1},X_{2},\ldots,X_{p}\)之間的關係。\(\epsilon\)是隨機變數,稱為誤差項。可能包含我們沒有測量到的變數,或是無法測量的變數,並且獨立於\(X\)。例如,不同病情的病人對於某種藥劑有不同的反應,兩者之間的關係可以用\(Y=f(X)\)表示。但是這種藥劑製作過程中,可能受到某些因素的影響而有不同的藥效,也可能因為病人當天的生理狀況而有不同的反應,這些都可歸類為\(\epsilon\)。
當我們把\(Y\)改成\(E[Y|X]\),就是改為計算\(Y\)的條件期望值,所以方程式 (2.1)可以改成條件期望值函數:
\[\begin{align} E[Y|X]=\beta_{0}+\beta_{1}X+E[\epsilon|X] \tag{2.2} \end{align}\]
我們假設\(E[\epsilon|X]=0\),方程式(2.2)解釋成當X固定某一個值,Y的平均值是由\(\beta_{0}\)與\(\beta_{1}\)決定。而個別\(Y_{i}\)的預測值則寫成: \[ \hat{Y_{i}}=\hat{\beta_{0}}+\hat{\beta_{1}}X_{i} \]
如果\(E[\epsilon|X]\neq 0\),表示誤差項和自變數之間有相關,這樣會造成\(\hat{\beta_{0}}\)與\(\hat{\beta_{0}}\)的估計有誤。
迴歸模型就是盡可能逼近X各類別下Y的平均數的函數。一旦找到這樣的迴歸函數,可以用來預測其他的觀察值。
\[ \frac{1}{\sqrt{2\pi}}\text{exp}(-\frac{x^2}{2})\]
\[\begin{align*} \hat{f}(x;h)=\frac{1}{nh}\sum_{i=1}^{n}K\frac{x-X_{i}}{h} \end{align*}\]
K是核函數,類似權重,\(h\)用來控制密度分布的平滑程度,先設定為1。\(n\)則是觀察值的數目。
細看這個公式,先假設\(h=1\),可以想像我們把特定的數與每一個觀察值相減,然後代入核函數,加總之後,再除以全部樣本數\(n\),就會得到特定的數的核密度。\(h\)越大,得到的核密度分佈越平滑。
如何決定\(h\)的大小?這一篇Andrey Akinshin寫的文章提到\(h\approx 1.06\times \hat{\sigma}n^{-1/5}\)適用於核函數為高斯分佈。
假設有三個觀察值3, 4, 7,他們來自什麼樣的分佈?用KDE估計如下:
# Define data and bandwidth
x <- c(3, 4, 7)
h <- 1
# Define a sequence of points where we want to evaluate the density
x_grid <- seq(min(x) - 2*h, max(x) + 2*h, length=100)
# Function to calculate the normal kernel density for a single point
kernel <- function(x_i, mu, h) {
dnorm(x_i, mu, h) # dnorm is the R function for normal density function
}
# Calculate kernel density for each data point
densities <- sapply(x, function(xi) kernel(x_grid, xi, h))
# Sum the kernel densities from each data point
density_estimate <- apply(densities, 1, sum)
# Normalize by the number of data points
density_estimate <- density_estimate / length(x)
# Plot the kernel density estimate
plot(x_grid, density_estimate, type = "l",
main = "Kernel Density Estimation (Normal Kernel)",
xlab = "Value", ylab = "Density")
- 我們用\(\textbf{UsingR::too.young}\) 這筆資料中的男性約會年齡為例,可以畫圖 2.1如下,請注意我們指定
lattice::densityplot(UsingR::too.young$Male, bw=5)
Figure 2.1: 核密度圖1
k=function(x){
(1/sqrt(2*pi))*exp(-(x^2)/2)
}
xs <- UsingR::too.young$Male
x1 = 20
n = length(xs); h = 1
pdf = sum(k(x1-xs))/(n*h)
print(pdf)
## [1] 0.03701
lattice::densityplot(UsingR::too.young$Male, bw=6)
Figure 2.2: 核密度圖2
n = length(xs); h = 1.2
pdf = sum(k(x1-xs))/(n*h)
print(pdf)
## [1] 0.03084
im1<-here::here('Fig','nonpreg1.jpg')
knitr::include_graphics(im1)
Figure 2.3: 無母數迴歸線1
\(Y_{i}=\alpha_{0}+\beta_{1}(x_{i}-x_{0})+\cdots +\beta_{p}(x_{i}-x_{0})^{p}+E_{i}\)
im2<-here::here('Fig','nonpreg2.jpg')
knitr::include_graphics(im2)
Figure 2.4: 無母數迴歸線2
lwl<-loess(intensity ~ inequality, data=carData::Chirot)
plot(intensity ~ inequality, data=carData::Chirot)
j <- order(carData::Chirot$inequality)
lines(carData::Chirot$inequality[j], lwl$fitted[j], col='red', lwd=2, lty=2)
im3<-here::here('Fig','nonpreg3.jpg')
knitr::include_graphics(im3)
Figure 2.5: 無母數迴歸線3
從無母數迴歸與線性迴歸的比較中,發現無母數迴歸線的點比線性迴歸線的點更集中在線旁邊,表示誤差比較小,但是無母數迴歸線上面的點之間的離散程度比較高,也就是比較有彈性但是也比較有不確定性。
用計算均方誤差和的方式,來觀察變異數與誤差之間有衝突,一個太大,另一個就會太小。
\[MSE_{pred.-y_{i}}=Var_{pred.}+Bias_{\mathit{f_{i}}-y_{i}}^2\]
定義mean squared error (MSE)=\(E[(y - f(x))^2]\),其中
\(y=f(x)+\epsilon\)
\(y\)是觀察值
\(f(x)\)是真實的模型
\(\hat{f}(x)\)是樣本的模型
\(\epsilon\)為y與\(f(x)\)的誤差
計算MSE如下:
\[ E[(y - \hat{f}(x))^2] = E[y^2] - 2E[y\times \hat{f}(x)] + E[\hat{f}(x)^2] \]
\[\begin{align*} E[\epsilon^2] + \Big(E[f(x)] - E[\hat{f}(x))]\Big)^2 + Var(f(x)-\hat{f}(x)) \\ & = \text{irreducible error}+\text{bias}+\text{variance of prediction} \end{align*}\]
在同樣的MSE之下,當\(\Big(E[f(x)] - E[\hat{f}(x))]\Big)^2\)越大,樣本模型的預測平均值與真實預測值之間的差距越大,會造成預測值之間的變異程度(\(Var(f(x)-\hat{f}(x))\))變小,也就是模型受到特定觀察值而改變預測值的機會越小(越沒有彈性)。
無母數迴歸使用的樣本統計(estimator)有很大的彈性。帶寬越小,越有彈性,預測值的變異數也越大。帶寬越大,越不具彈性,與真實模型之間的誤差越大。例如圖2.7與圖2.8顯示,不同自變數X的區間有不同的帶寬,區間越多,誤差越小,但是變異數越大。
原始資料的散佈圖:
with(carData::Chirot, plot(intensity ~ inequality))
Figure 2.6: 散佈圖
im4<-here::here('Fig','nonpa_4.jpg')
knitr::include_graphics(im4)
Figure 2.7: 無母數迴歸線4
im5<-here::here('Fig','nonpa_8.jpg')
knitr::include_graphics(im5)
Figure 2.8: 無母數迴歸線5
im6<-here::here('Fig','loess_r.jpg')
knitr::include_graphics(im6)
Figure 2.9: 無母數迴歸線6
\[ E[Y|X]=\beta_{0}+X\beta_{1} \]
只有兩個係數。\(\beta_{0}\)稱為截距或常數,\(\beta_{1}\)稱為斜率係數。當X=0,\(E[Y|X=0]\)等於Y的期望值,也就是\(\beta_{0}\)。或者是當X減去自己的平均數,\(E(X)\)也會等於0,\(E[Y]=\beta_{0}\)。 - 因為平均數是一個常數,常數的期望值等於該常數,也就是:
\[ E[\bar{X}]=\bar{X} \]
所以當\(X^{*}=X-\bar{X}\):\[\begin{align*} E[Y|X^{*}] = \beta_{0}+\beta_{1}X^{*} \\ & = E[\beta_{0}]+E[\beta_{1}\times (X-\bar{X})] \\ & = E[\beta_{0}]+E[\beta_{1}\times (X-\bar{X})] \\ &= \beta_{0}+\beta_{1}\times E[(X-\bar{X})] \\ &= \beta_{0}+\beta_{1}\times (E[X]-E[\bar{X}]) \\ &= \beta_{0}+\beta_{1}\times (\bar{X}-\bar{X}) \\ & = \beta_{0} \end{align*}\]
線性假設的其中之一是截距為固定。
Figure 3.1: Y的條件期望值
\[E[X]=\int_{-\infty}^{\infty}xf(x)dx\]
\[E[X]=\sum xf(x)=\sum x\cdot P(X=x)\]
\[\rm{Var}(X)=E(X^2)-(E(X))^2\]
\[\boxed{E(Y|X=x)=\mu_{Y|X=x}=\sum yf(y|x)}\]
我們用以下的例子說明如何計算Y的條件期待值(見Rachel Fewster):
假設X是一個隨機變數,有如下的機率分佈(3.1):
\[\begin{equation} X = \begin{cases} 1 & \rm{with\hspace{.3em} probability}\quad 1/8\\ 2 & \rm{with\hspace{.3em}probability}\quad 7/8 \end{cases} \tag{3.1} \end{equation}\]
\[\begin{equation} Y|X= \begin{cases} 2X & \rm{with\hspace{.3em}probability}\quad 3/4\\ 3X & \rm{with\hspace{.3em}probability}\quad 1/4 \end{cases} \tag{3.2} \end{equation}\]
\[\begin{equation} Y|(X=1)= \begin{cases} 2 & \rm{with\hspace{.3em}probability}\quad 3/4\\ 3 & \rm{with\hspace{.3em}probability}\quad 1/4 \end{cases} \tag{3.3} \end{equation}\]
\[\boxed{Var(Y|X=x)=E(Y^2|X)-(E(Y|X))^2}\]
\[\begin{equation} Y^2|(X=1)= \begin{cases} 4 & \rm{with\hspace{.3em}probability}\quad 3/4\\ 9 & \rm{with\hspace{.3em}probability}\quad 1/4 \end{cases} \tag{3.4} \end{equation}\]
\[E[Y]=\sum E[Y|X=x]P(X=x)]\]
\[E(Y)=E[E(Y|X)]\]
| Tea | Coffee | Total |
---|---|---|---|
Monday | 200 | 100 | 300 |
Tuesday | 280 | 180 | 460 |
Wednesday | 240 | 200 | 440 |
\[\begin{align} E[T] = E[E(Tea|Day)]\\ & = E(T|Day=\text{Monday})\cdot P(\text{Monday}) + E(T|Day=\text{Tuesday})\cdot P(\text{Tuesday}) + E(T|Day=\text{Wednesday})\cdot P(\text{Wednesday}) \\ & = \frac{200}{300}\cdot \frac{300}{1200}+ \frac{280}{460}\cdot \frac{460}{1200}+ \frac{240}{440}\cdot \frac{440}{1200}\\ & = 0.6 \tag{3.5} \end{align}\]
| Tea | Coffee | Total |
---|---|---|---|
Monday | 200 | 100 | 300 |
Tuesday | 280 | 180 | 460 |
Wednesday | 240 | 200 | 440 |
Total | 720 | 480 | 1,200 |
\[ E[Y] = E[Y|X=1]\times P(X=1)+E[Y|X=2]\times P(X=2)=E[E[Y|X]] \]
set.seed(02138)
#X, Y
x<-rnorm(1000, 0, 0.1)
y<-rnorm(1000, 0, 1) + x
library(dplyr)
tmp <- data.table::data.table(X=x, Y=y) %>%
group_by(X) %>%
summarise(y.bar=mean(Y))
mean(tmp$y.bar)
[1] 0.04224
mean(y)
[1] 0.04224
:
\[\begin{align*} \{\tilde{\beta}_{0},\tilde{\beta}_{1}\}&=\arg\min\limits_{{\tilde{\beta}_{0},\tilde{\beta}_{1}}}\sum _{i=1}^n(y_{i}-\tilde{\beta}_{0}-x_{i}\tilde{\beta}_{1})^2\\ &= \arg\min\limits_{{\tilde{\beta}_{0},\tilde{\beta}_{1}}}\sum _{i=1}^n\hat{u_{i}}^2 \end{align*}\]
Figure 4.1: 迴歸線圖
sales \(\approx {\beta}_{0}+{\beta}_{1}\cdot\) TV
file<-here::here('data','advertising.csv')
Advertising<-read.csv(file, sep=',', header = TRUE)
OLS1<-lm(sales ~ TV, data=Advertising)
stargazer::stargazer(OLS1, title='電視廣告與銷售量之間的關係',
type=ifelse(knitr::is_latex_output(),"latex","html"),
label=knitr::opts_current$get("label"))
Dependent variable: | |
sales | |
TV | 0.048*** |
(0.003) | |
Constant | 7.033*** |
(0.458) | |
Observations | 200 |
R2 | 0.612 |
Adjusted R2 | 0.610 |
Residual Std. Error | 3.259 (df = 198) |
F Statistic | 312.100*** (df = 1; 198) |
Note: | p<0.1; p<0.05; p<0.01 |
library(ggplot2)
A1<-ggplot(Advertising, aes(x=TV, y=sales)) +
labs(y = 'Sales', x = 'TV') +
geom_point(col="saddlebrown") +
geom_smooth(method="lm", col='blue', se=F, size=1.5) ; A1
Figure 5.1: 電視廣告與銷售量之間的關係
m1 <- lm(sales ~ TV, Advertising)
Advertising$predicted <- predict(m1) # Save the predicted values
Advertising$residuals <- residuals(m1) # Save the residual values
# Quick look at the actual, predicted, and residual values
#library(dplyr)
#Advertising %>% select(predicted, residuals) %>% head()
ggplot(Advertising, aes(x=TV, y=sales)) +
geom_point(color="saddlebrown") +
geom_point(aes(y = predicted), shape = 1, color="red") +
geom_segment(aes(xend=TV, yend=predicted), color="lightgray") +
theme_bw()
\[\begin{align*} \tilde{u} & = y_{i}- \hat{y}_{i} \\ & = y_{i}- \tilde{\beta}_{0}-\tilde{\beta}_{1}x_{i} \end{align*}\]
\[\begin{align} S(\tilde{u}) = {\sum\limits_{i=1}^n(y_{i}-\hat{y})^2} \\ L(y) = {\sum\limits_{i=1}^n(y_{i}-\hat{y})^2} \tag{6.1} \end{align}\]
\[\begin{align*} \frac{dL}{d\hat{y}}L(y) & = \frac{d}{d\hat{y}}{\sum\limits_{i=1}^n(y_{i}-\hat{y})^2} \\ & = \sum\limits_{i=1}^n(y_{i}^2-2y_{i}\hat{y}+\hat{y}^2) \\ & = \sum(-2y_{i}+2\hat{y}) \\ & = -2\sum(y_{i} - \hat{y}) \end{align*}\]
\[-2\sum(y_{i} - \hat{y})=0\]
\[\begin{align*} -2\sum(y_{i} - \hat{y}) & = -2(\sum y_{i})-n\hat{y}\\ n\hat{y} = \sum y_{i} \end{align*}\]
#create variable 1000 Y with mean=10
set.seed(666)
Y <- rnorm(1000, mean = 10, sd = 2)
#create 50 data points with mean = 10
obs <- rnorm(20, 10, 3)
# generate empty list
squared_residuals <- rep(NA, length(obs))
min_residual <- Inf
min_observation <- NULL
#calculate the sum of squared residuals between Y and obs
for (i in 1:length(obs)) {
squared_residuals[i] = sum((Y - obs[i])^2)
if (squared_residuals[i] < min_residual) {
min_residual <- squared_residuals[i]
min_observation <- obs[i]
}
}
cat('Minimization of Squared Residuals=', min_observation)
## Minimization of Squared Residuals= 9.893
#plot the sum of squared residuals along with the 50 observations to find the lowest sum of squared residuals
data.frame(obs_values = obs, squared_residuals = squared_residuals) %>%
ggplot(aes(obs_values, squared_residuals)) +
stat_smooth(method="lm",
formula = y ~ poly(x, 2),
se = FALSE,
colour = "#FCC3B0",
linetype = "dashed") +
geom_point(col = "#661E4F", size = 2.2, alpha = 0.7) +
geom_vline(xintercept = min_observation, linetype='dashed', colour='#EEAA11') +
geom_hline(yintercept=min(squared_residuals), linetype='dashed', colour='#113BEA') +
labs(title = "Sum of Squared Residuals (SSR) Loss Function",
x = "Summary Value",
y = "SSR") +
theme_minimal() +
# theme(text = element_text(family = ""),
# plot.title = element_text(family = "", hjust = 0.5) +
scale_y_continuous(labels = scales::comma)
Figure 6.1: 誤差平方和函數圖
\[\begin{align*} S(\tilde{\beta}_{0},\tilde{\beta}_{1})= {\sum\limits_{i=1}^n(y_{i}-\tilde{\beta}_{0}-x_{i}\tilde{\beta}_{1})^2} \end{align*}\]
1.首先對\(\{\tilde{\beta}_{0},\tilde{\beta}_{1}\}\)取微分
2.命兩者的微分等於0
3.求出\(\{\tilde{\beta}_{0},\tilde{\beta}_{1}\}\)的解。
\[y_{i}-\tilde{\beta}_{0}-x_{i}\tilde{\beta}_{1}\]
\[\begin{align*} S(\tilde{\beta}_{0},\tilde{\beta}_{1}) & = {\sum\limits_{i=1}^n(y_{i}-\tilde{\beta}_{0}-x_{i}\tilde{\beta}_{1})^2} \\ & = {\sum\limits_{i=1}^n(y_{i}^2-2y_{i}\tilde{\beta}_{0}-2y_{i}\tilde{\beta}_{1}x_{i}+ \tilde{\beta}_{0}^2+2\tilde{\beta}_{0}\tilde{\beta}_{1}x_{i}+ \tilde{\beta}_{1}^2x_{i}^2)} \end{align*}\]
\[\begin{align*} \frac{\partial S(\tilde{\beta}_{0},\tilde{\beta}_{1})}{\partial \tilde{\beta}_{0}}= {\sum\limits_{i=1}^n(-2y_{i}+2\tilde{\beta}_{0}+2\tilde{\beta}_{1}x_{i})} \end{align*}\]
\[\begin{align*} \frac{\partial S(\tilde{\beta}_{0},\tilde{\beta}_{1})}{\partial S(\tilde{\beta}_{1})}= {\sum\limits_{i=1}^n(-2y_{i}x_{i}+2\tilde{\beta}_{0}x_{i}+2\tilde{\beta}_{1}x_{i}^2)} \end{align*}\]
\[\begin{align*} 0 & = {\sum\limits_{i=1}^n(-2y_{i}+2\tilde{\beta}_{0}+2\tilde{\beta}_{1}x_{i})} \tag{6.2} \end{align*}\]
\[\begin{align*} 0 & = {\sum\limits_{i=1}^n(-2y_{i}x_{i}+2\tilde{\beta}_{0}x_{i}+2\tilde{\beta}_{1}x_{i}^2)} \tag{6.3} \end{align*}\]
\[\begin{align*} \hat{\beta}_{0}n &= {(\sum\limits_{i=1}^ny_{i})- \hat{\beta}_{1}{(\sum\limits_{i=1}^nx_{i})}} \\ \tilde{\beta}_{0} &=\bar{Y}-\tilde{\beta}_{1} \bar{x} \tag{6.4} \end{align*}\]
\[\begin{align*} \hat{\beta}_{1}\sum\limits_{i=1}^nx_{i}^2 &= (\sum\limits_{i=1}^nx_{i}y_{i})- \tilde{\beta}_{0}{(\sum\limits_{i=1}x_{i}}) \tag{6.5} \end{align*}\]
這兩個等式稱為normal equations。
從normal equations的第1式((6.4)可以得到:
\[\begin{align*} \hat{\beta}_{1} &=\frac{\sum y_{i}-\hat{\beta}_{0}n}{\sum x_{i}} \tag{6.6} \end{align*}\]
\[\begin{align*} \hat{\beta}_{0}=\frac{\sum x_{i}y_{i}-\hat{\beta}_{1}\sum x_{i}^2}{\sum x_{i}} \tag{6.7} \end{align*}\]
\[\begin{align*} \hat{\beta}_{1} & =\frac{\sum y_{i}-n\frac{\sum x_{i}y_{i}-\hat{\beta}_{1}\sum x_{i}^2}{\sum x_{i}}}{\sum x_{i}}\\ & = \frac{n\sum x_{i}y_{i}-\sum x_{i}\sum y_{i}}{n\sum x_{i}^2-(\sum x_{i})^2} \tag{6.8} \end{align*}\]
\[\begin{align*} {n\sum x_{i}y_{i}-\sum x_{i}\sum y_{i}} & = {n(\sum x_{i}y_{i}-\bar{y}\sum x_{i})}\\ & = {n(\sum x_{i}y_{i}-n\bar{x}\bar{y})} \\ & = {n(\sum x_{i}y_{i}-n\bar{x}\bar{y}-n\bar{x}\bar{y}+n\bar{x}\bar{y})} \\ & = {n(\sum x_{i}y_{i}-\bar{y}\sum x_{i}-\bar{x}\sum y_{i}+\sum \bar{x}\bar{y})} \\ & = {n\sum (x_{i}-\bar{x})(y_{i}-\bar{y})} \tag{6.9} \end{align*}\]
\[\begin{align*} {n\sum x_{i}^2-(\sum x_{i})^2} & = {n\sum x_{i}^2-(\sum x_{i})^2+(\sum x_{i})^2-(\sum x_{i})^2}\\ & = {n\sum x_{i}^2-2n\bar{x}(\sum x_{i})+n^2(\bar{x})^2} \\ & = {n(\sum x_{i}^2-2\bar{x}\sum x_{i}+n(\bar{x})^2)}\\ & = {n(\sum x_{i}^2-2\bar{x}\sum x_{i}+\sum \bar{x}^2)}\\ & = {n\sum(x_{i}-\bar{x})^2} \tag{6.10} \end{align*}\] (因為\(\sum \bar{x}^2=\bar{x}^2+\bar{x}^2+\cdots +\bar{x}^2=n\bar{x}^2\))
\[\begin{align*} {\hat{\beta}_{1}=\frac{\sum_{i=1}^n(x_{i}-\bar{x})(y_{i}-\bar{y})}{\sum_{i=1}^n(x_{i}-\bar{x})^2} }=\frac{\mathrm {Sample\hspace{.5em} covariance\hspace{.5em} of\hspace{.5em} X\hspace{.5em} and\hspace{.5em} Y}}{\mathrm {sample\hspace{.5em} variance\hspace{.5em} of\hspace{.5em} X}}\\ \end{align*}\]
X的變異數越大,\(\hat{\beta}_{1}\)越小
X, Y的共變量越大,\(\hat{\beta}_{1}\)越大。共變量可寫成\(\sum xy\),其中 \(x=(X-\bar{X})\),\(y=(Y-\bar{Y})\)。
另外,
\[ \hat{\beta}_{0}=\bar{y}-\hat{\beta}_{1}\bar{x} \]
★如果使用圖書館人數為X,借出的書本(百本)為Y,而且知道過去25天的各項變數統計為\(\sum X=200\)、\(\sum Y=300\)、\(\sum X^2=16600\)、\(\sum Y^2=1696\)、\(\sum XY=2436\),請求出\(\hat{Y}=\hat{\beta_{0}}+\hat{\beta_{1}X}\)。
\(\blacksquare\)把以上的數據代入(6.8),分別求出\(\beta_{1}\)以及\(\beta_{0}\):
n=25
#sum(X)
sum.x=200
#sum(Y)
sum.y=300
#sum(XY)
sum.xy <- 2436
#sum(x^2)
sumxsq <- 1660
#sum(y^2)
sumysq <- 1696
#(sumX)^2
sumx.sq <- sum.x^2
beta1 <- (n*sum.xy-sum.x*sum.y)/(n*sumxsq-sumx.sq)
beta1
[1] 0.6
beta0=sum.y/n-beta1*sum.x/n
beta0
[1] 7.2
## -x + x^2
Figure 6.2: 一元二次函數求微分圖1
Figure 6.3: 一元二次函數求微分圖2
\[\begin{align*} f(\xi+h)-f(\xi) & =(\xi+h)^2-(\xi+h)-(\xi^2-\xi) \\ & = \xi^2+2\xi\cdot h + h^2-\xi-h-\xi^2+\xi \\ & = 2\xi\cdot h + h^2 -h \tag{6.11} \end{align*}\]
\[\begin{align*} m &=\frac{f(\xi+h)-f(\xi)}{\xi+h-\xi} \\ & = \frac{2\xi\cdot h + h^2 -h}{h} \\ & = 2\xi + h -1 \tag{6.12} \end{align*}\]
\[f^{`}=\frac{\partial y}{\partial x}\]
\[f^{`}(x)=ax^{a-1}\]
圖6.3中的藍色虛線代表切線,定義為通過\((\xi,f(\xi))\)這個點而且斜率為\(m\)的直線。
綠色的點則是當\(f{`}=0\)時的切線,此時\(f(\xi)\)是\(f(x)\)的最低點。心算可知當\(\xi\)等於0.5時,\(f^{`}=0\)。
\[\begin{align*} Y & =\hat{\beta}_{0}+ \hat{\beta}_{1}X+\hat{u} \\ \frac{\partial Y(\hat{\beta}_{0}, \hat{\beta}_{1})}{\partial X} & =\hat{\beta}_{1} \end{align*}\]
★ Granovetter and Soong (1988) 提出一個函數說明白人對於社區裡黑人居住的容忍程度的函數如下: \[ f(x)=R\left[1-\frac{x}{N_{w}}\right]x \]
其中\(x\)是白人的比例,\(R\)是白人對黑人的忍受度,\(N_{w}\)是黑人人數,\(f(x)\)代表黑人人數與白人搬走的關係。
如果\(N_{w}=100\),\(R=5\),請求出當白人的比例x是20%時,白人移出的比例變化是多少?
代入\(N_{w}=100\),\(R=5\),得到:
\[ f(x)=5x-\frac{x^2}{20} \]
\[\begin{align*} f'(x) = \mathrm{lim}_{\mathit{h}\rightarrow 0}\frac{[5(x+h)-\frac{1}{20}(x+h)^2]-[5x-\frac{1}{20}x^2]}{h} \\ & = \mathrm{lim}_{\mathit{h}\rightarrow 0}\frac{5h-\frac{1}{20}(x^2+2xh+h^2)+\frac{1}{20}x^2}{h}\\ & = \mathrm{lim}_{\mathit{h}\rightarrow 0}\frac{5h-\frac{2}{20}xh-\frac{1}{20}h^2}{h}\\ & = \mathrm{lim}_{\mathit{h}\rightarrow 0}\left(5-\frac{1}{10}x-\frac{1}{20}h \right)\\ & = 5-\frac{1}{10}x \end{align*}\]
代入x=20,可以得到\(f'(20)=3\)。如果代入x=40,可以得到\(f'(40)=1\)。如果代入x=50,可以得到\(f'(50)=0\)。這個結果顯示,不同的X會對應到不同的切線有不同的斜率。
\(f(x)=5x-\frac{x^2}{20}\)的函數圖形如圖6.4:
# Create a scatter plot
plot(1, type = 'n', xlim = c(-6, 80), ylim = c(0, 130), xlab = 'X', ylab = 'Y')
f <- function(x)(5*x - (1/20)*x^2)
curve(f, -10, 80, col = 'red', ylab = expression(italic(f(x))),
add = T)
grid(NULL, NULL, col = 'gray50')
# Define the coordinates of the point through which the line passes
x_point <- 20
y_point <- 5*x_point - (1/20)*x_point^2
# Add the point to the plot
points(x_point, y_point, col = 'red', pch = 16)
# Add a line with a slope of 3 passing through the point (20, y)
abline(a = y_point - 3 * x_point, b = 3, col = 'blue')
Figure 6.4: 一元二次函數與切線
\[\sum {x}_{i}\hat{u}_{i}=0\]
證明: \[\begin{align*} \sum x_{i}\hat{u}_{i} & = \sum x_{i}(y_{i}-\hat{\beta}_{0}-\hat{\beta}_{1}x_{i})\\ & = \sum x_{i}y_{i}-\hat{\beta}_{0}\sum x_{i}-\hat{\beta}_{1}\sum x_{i}^2 \end{align*}\]
根據normal equations,也就是式(6.5),\(\hat{\beta}_{0}\sum x_{i}-\hat{\beta}_{1}\sum x_{i}^2=\sum x_{i}y_{i}\)
\[\bar{\hat{y}}=\bar{y}\]
\[\sum \hat{y}_{i}\hat{u}_{i}=0\]
\[\begin{align*} \sum \hat{y}_{i}\hat{u}_{i} & = \sum (\hat{\beta}_{0}+\hat{\beta}_{1}x_{i})\hat{u}_{i}\\ & = \sum \hat{\beta}_{0}\hat{u}_{i}+\hat{\beta}_{1}\sum x_{i}\hat{u}_{1} \\ &= \hat{\beta}_{0}\sum \hat{u}_{i}+\hat{\beta}_{1}\sum x_{i}\hat{u}_{1} \end{align*}\]
\[\sum \hat{u}_{i}= 0\qquad\sum x_{i}\hat{u}_{1}=0\]
\[\sum \hat{\beta}_{0}\hat{u}_{i}+\hat{\beta}_{1}\sum x_{i}\hat{u}_{1}= 0\]
image3<-here::here('Fig','regybar0.jpg')
knitr::include_graphics(image3)
Figure 6.5: 有Y平均值的迴歸線圖
image5<-here::here('Fig','regybarxbar.jpg')
knitr::include_graphics(image5)
Figure 6.6: 有X與Y平均值的迴歸線圖
image6<-here::here('Fig','regybar1.jpg')
knitr::include_graphics(image6)
Figure 6.7: 有Y平均值以及迴歸線的迴歸圖線
\[\begin{align*} u &= y_{i}-\hat{y}_{i} \\ &= y_{i}- \tilde{\beta}_{0}-\tilde{\beta}_{1}x_{i} \end{align*}\]
\[\hat{u} \equiv \tilde{u} = \frac{1}{n}\mathcal {\sum\limits_{i=1}^n y_{i}} \]
\[ \hat{\beta_{1}}=\mathcal{n\sum (x_{i}-\bar{x})(y_{i}-\bar{y})} \]
\[\begin{align*} \hat{\beta}_{1} & = \frac{\sum_{i=1}^n(x_{i}-\bar{x})(y_{i}-\bar{y})}{\sum_{i=1}^n(x_{i}-\bar{x})^2} \\ & = \frac{Sample\hspace{.5em} covariance\hspace{.5em} of\hspace{.5em} X\hspace{.5em} and\hspace{.5em} Y}{sample\hspace{.5em} variance\hspace{.5em} of\hspace{.5em} X} \end{align*}\]
另一方面, \[ \hat{\beta}_{0}=\bar{y}-\hat{\beta}_{1}\bar{x} \]
Residual Sum of Squares
\[\begin{align*} \sum _{i}^n(\hat{y}_{i}-y_{i})^2 & = \sum _{i}^n\hat{u}^2\\ & = SSR \\ & = \rm{Var}[\hat{u}] \end{align*}\]
Explained Sum of Squares
\[\begin{align*} \sum _{i}^n(\hat{y}-\bar{y})^2 & = SSE\\ & = Var[\hat{y}] \end{align*}\]
x <- seq(1,10,length.out = 100)
y <- 2 + 1.2*x + rnorm(100,0,sd = 2)
cat("Y~X, R-squared:", summary(lm(y ~ x))$r.squared, "\n")
## Y~X, R-squared: 0.762
cat("X~Y, R-squared:", summary(lm(x ~ y))$r.squared, "\n")
## X~Y, R-squared: 0.762
set.seed(116)
obs = rnorm(50, 10, 1)
dm<-data.frame(obs=rep(obs,2),
Y = c(rep('y1',50), rep('y2', 50)),
value = c(1 + 1.5*obs + runif(50, 0, 3),
1 + 1.5*obs + rchisq(50, 2)))
ggpubr::ggscatter(dm, x='obs', y='value',add='reg.line',
color='Y', palette = "jco") +
facet_wrap(~ Y) +
stat_cor(label.y = 30) +
stat_regline_equation(label.y = 27)
Figure 8.1: 不同的\(R^2\)迴歸線
set.seed(21)
obs = rnorm(50, 10, 1)
dm<-data.frame(obs=rep(obs,2),
Y = c(rep('y1',50), rep('y2', 50)),
value = c(1 + 1.5*obs + runif(50, 0, 3),
1 + 1.5*obs + rchisq(50, 2)))
ggpubr::ggscatter(dm, x='obs', y='value',add='reg.line',
color='Y', palette = "jco") +
facet_wrap(~ Y) +
stat_cor(label.y = 30) +
stat_regline_equation(label.y = 27)
Figure 8.2: 不同的亂數的迴歸線
set.seed(3939889)
r2.0 <- function(sig){
x <- seq(1,10, length.out = 300) # our predictor
y <- 2 + 1.2*x + rnorm(300, 0, sd = sig) # our response; a function of x plus some random noise
summary(lm(y ~ x))$r.squared # print the R-squared value
}
sigmas <- seq(1, 10, length.out = 20)
rout <- sapply(sigmas, r2.0) # apply our function to a series of sigma values
dt <- data.frame(sigmas, rout)
library(ggplot2)
ggplot(data=dt, aes(x = sigmas, y = rout)) +
geom_line(col='#FF6600', size=1.5) +
geom_point() +
labs(y=expression(R^2), x=expression(sigma^2)) +
theme_classic()
Figure 8.3: 變異數與解釋變異量圖1
set.seed(3939889)
beta <- function(sig){
x <- seq(1,10,length.out = 300) # our predictor
y <- 2 + 1.2*x + rnorm(300,0, sd = sig) # our response; a function of x plus some random noise
summary(lm(y ~ x))$coefficients[1,2] # print the R-squared value
}
sigmas <- seq(1, 10,length.out = 20)
res <- sapply(sigmas, beta)
dt <- data.frame(x=sigmas, y=res)
library(ggplot2)
ggplot(data=dt, aes(sigmas, res)) +
geom_line(col='#FF6600', size=1.5) +
geom_point() +
labs(y=expression(beta[1]), x=expression(sigma^2)) +
theme_classic()
Figure 8.4: 變異數與解釋變異量圖2
showtext::showtext_auto()
set.seed(02138)
SSRSST <- function(sig){
x <- seq(1,10,length.out = 300) # our predictor
y <- 2 + 1.2*x + rnorm(300, 0, sd = sig) # our response; a function of x plus some random noise
m1 <- lm(y ~ x)
return(sum((m1$fitted.values-mean(y))^2))
}
sigmas <- seq(1, 10, length.out = 20)
tmp<- sapply(sigmas, SSRSST) # apply our function return SSR
plot(sigmas, tmp, main='SSE與Y的變異數散佈圖',
ylab = expression(SSE),
xlab = expression(sigma^2), col='#ee1122')
Figure 8.5: SST與變異數
調整後的\(R^2\)考慮樣本數\(n\)以及自變數的數目\(k\): \[ R^2_{Adj}=1-\frac{(1-R^2)(n-1)}{n-k-1} \]
DT<-data.frame(Y=c(1,1,1,1,1,2,2,2,2,2,3,3,3,10),
X=c(1,2,1,1,2,1,2,1,3,1,2,3,1,11))
m1 <- lm(Y ~ X, data=DT)
library(ggpubr)
ggscatter(DT, x = "X", y = "Y", add = "reg.line", color = '#aecc02') +
stat_cor(label.x = 3, label.y = 32) +
stat_regline_equation(label.x = 3, label.y = 35)
Figure 8.6: R2散佈圖1
library(tidyverse)
DT<-tibble(
X=seq(1, 10, by=0.5),
Y=X^2)
library(ggpubr)
ggscatter(DT, x = "X", y = "Y", add = "reg.line", color = '#c12c02') +
stat_cor(label.x = 4, label.y = 75) +
stat_regline_equation(label.x = 4, label.y = 85)
Figure 8.7: R2散佈圖2
DTnew<-tibble(X=c(7, 7.5, 8, 8.9, 8.7, 1, 2),
Y=c(19, 19.5, 20, 30 , 23, 1, 1.3))
library(ggpubr)
ggscatter(DTnew, x = "X", y = "Y", add = "reg.line", color = '#21baec') +
stat_cor(label.x = 1, label.y = 32) +
stat_regline_equation(label.x = 1, label.y = 35)
Figure 8.8: R2散佈圖3
| 20-39 | 40-90 |
---|---|---|
Landline | 13 | 11 |
Cellular | 17 | 14 |
predict
這個語法)
Y <- rnorm(1000, 0, 1)
X <- rnorm(1000, 0, 1)
mtcars
這筆資料的hp以及mpg繪製散佈圖與迴歸線圖,並且計算迴歸係數。
UsingR
的babies
這筆資料,估計年齡(age)與身高(ht)對於嬰兒重量(wt)的影響。注意這些變數有一定的範圍。
X1<-c(1,2,3,2,5,5,2,4,3,3)
X2<-c(4,6,7,6,7,7,5,7,6,5)
Y<-c(30,38,44,38,50,52,33,50,45,40)
mag <- rbind(X1,X2,Y)
knitr::kable(mag, format = 'simple')
X1 | 1 | 2 | 3 | 2 | 5 | 5 | 2 | 4 | 3 | 3 |
X2 | 4 | 6 | 7 | 6 | 7 | 7 | 5 | 7 | 6 | 5 |
Y | 30 | 38 | 44 | 38 | 50 | 52 | 33 | 50 | 45 | 40 |
R
估計X1影響Y的作用,以及X2影響Y的作用。
nym.2002
資料中,性別與年齡會影響完成馬拉松的時間嗎?
## 最後更新日期 05/13/2024