工程数学_统计学_作业一.knit

作业要求
第 1 题
第 2 题
第 3 题
第 4 题

作业要求

你可以与其他同学讨论作业题目，但作业应由你本人独立完成.
请将你所有的答案、计算机代码和图表整合到一个 HTML 文件中.
请将作业文件命名为 学号-姓名-01，例如：2025017000-肖磊-01.
评分方式：每一问得分 \(\left\{ 0,~ 1,~ 2~ \right\}\).
- 2: 按时提交且答案基本正确
- 1: 按时提交且答案部分正确
- 0: 提交的答案完全错误，或未交作业
请将本次作业发送至：2023582025@cupk.edu.cn. 截止日期: 2025年11月15日24:00.

第 1 题

每个 ggplot2 图形都包含三个关键组成部分：

数据 data.
数据中的变量与视觉属性之间的一组美学映射 aesthetic mapping.
至少包含一个图层，该图层用于描述如何呈现每个观测值. 图层通常通过 geom 函数创建.

本题使用数据集 mpg. 例如：

library(ggplot2)
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
  geom_point()

1.1 利用 ggplot2 作 cty 与 hwy 的散点图. 你会如何描述城市油耗 cty 与高速油耗 hwy 之间的关系？ (2 分)

library(ggplot2)
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_point()

1.2 你还可以使用其它的图形美学属性设置，例如颜色 color、形状 shape 和大小 size. 运行以下代码，观察会出现什么结果.

ggplot(data = mpg, aes(x = cty, y = hwy, colour = class)) +
  geom_point()
ggplot(data = mpg, aes(x = cty, y = hwy, shape = drv)) +
  geom_point()
ggplot(data = mpg, aes(x = cty, y = hwy, size = cyl)) +
  geom_point()

尝试使用颜色 colour、形状 shape 和大小 size 这些美学属性. 将它们映射到连续值时会发生什么？映射到分类值时又会怎样？在一个图形中使用不止一种美学属性时，结果又如何？ (2 分)

# 运行题目中的代码
ggplot(data = mpg, aes(x = cty, y = hwy, colour = class)) +
  geom_point()

ggplot(data = mpg, aes(x = cty, y = hwy, shape = drv)) +
  geom_point()

ggplot(data = mpg, aes(x = cty, y = hwy, size = cyl)) +
  geom_point()

# 额外实验：映射到连续值 vs 分类值
# 颜色映射到连续值
ggplot(data = mpg, aes(x = cty, y = hwy, colour = cyl)) +
  geom_point()

# 颜色映射到分类值  
ggplot(data = mpg, aes(x = cty, y = hwy, colour = factor(cyl))) +
  geom_point()

# 同时使用多种美学属性
ggplot(data = mpg, aes(x = cty, y = hwy, colour = class, shape = drv, size = displ)) +
  geom_point()

1.3 在图形上展示额外分类变量的另一种方法是分面 (facetting). 例如：

ggplot(data = mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  facet_wrap(~ class)

如果尝试按照高速油耗 hwy 这类连续变量进行分面，会发生什么？按气缸数 cyl 分面又会怎样？二者最关键的区别是什么？ (2 分)

# 按连续变量 hwy 分面
ggplot(data = mpg, aes(x = displ, y = cty)) +
  geom_point() +
  facet_wrap(~ hwy)

# 按气缸数 cyl 分面（cyl 本质上是分类变量）
ggplot(data = mpg, aes(x = displ, y = cty)) +
  geom_point() +
  facet_wrap(~ cyl)

# 将 cyl 转换为因子后分面
ggplot(data = mpg, aes(x = displ, y = cty)) +
  geom_point() +
  facet_wrap(~ factor(cyl))

1.4 你可以使用 geom_smooth() 函数为图形添加一条平滑曲线，例如：

ggplot(data = mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  geom_smooth(span = 0.75)

geom_smooth() 函数中一个重要的参数是 method (方法)，通过该参数，你可以选择用于拟合平滑曲线的模型类型.

method = "loess"，这是样本量较小时 (\(n\) 较小) 的默认设置 (具体说明可参考 ?loess 帮助文档). 尝试调整 span 参数的不同取值，观察会出现什么结果. (2 分)

# 使用不同的 span 参数值
library(ggplot2)

# span = 0.2 (更局部的拟合，更波动)
ggplot(data = mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  geom_smooth(method = "loess", span = 0.2, se = FALSE) +
  ggtitle("span = 0.2 (更波动)")

## `geom_smooth()` using formula = 'y ~ x'

# span = 0.5 (中等平滑度)
ggplot(data = mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  geom_smooth(method = "loess", span = 0.5, se = FALSE) +
  ggtitle("span = 0.5 (中等平滑)")

## `geom_smooth()` using formula = 'y ~ x'

# span = 0.75 (默认值，较平滑)
ggplot(data = mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  geom_smooth(method = "loess", span = 0.75, se = FALSE) +
  ggtitle("span = 0.75 (默认，较平滑)")

## `geom_smooth()` using formula = 'y ~ x'

# span = 1.0 (非常平滑)
ggplot(data = mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  geom_smooth(method = "loess", span = 1.0, se = FALSE) +
  ggtitle("span = 1.0 (非常平滑)")

## `geom_smooth()` using formula = 'y ~ x'

# 同时显示多个span值的效果对比
ggplot(data = mpg, aes(x = displ, y = hwy)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "loess", span = 0.2, se = FALSE, color = "red", linetype = "dashed") +
  geom_smooth(method = "loess", span = 0.5, se = FALSE, color = "blue", linetype = "dotted") +
  geom_smooth(method = "loess", span = 0.75, se = FALSE, color = "green") +
  geom_smooth(method = "loess", span = 1.0, se = FALSE, color = "purple", linetype = "dotdash") +
  labs(title = "不同span参数值的平滑曲线比较",
       subtitle = "红色: span=0.2, 蓝色: span=0.5, 绿色: span=0.75, 紫色: span=1.0")

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

method = "lm" 用于拟合线性模型 (linear model)，得到最佳拟合线. 尝试运行下面的代码，观察会出现什么结果. (2 分)

ggplot(data = mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  geom_smooth(method = "lm")

ggplot(data = mpg, aes(x = displ, y = hwy)) +
  geom_point(alpha = 0.6, color = "steelblue", size = 2) +  # 调整点的透明度、颜色和大小
  geom_smooth(method = "lm", color = "red", fill = "lightcoral", 
              level = 0.95, linetype = "solid") +  # 自定义线性回归线的外观
  labs(
    title = "发动机排量与高速油耗的线性关系",
    subtitle = "使用线性回归模型 (method = 'lm') 拟合",
    x = "发动机排量 (升)",
    y = "高速油耗 (英里/加仑)",
    caption = "数据来源: ggplot2::mpg 数据集"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(color = "gray40")
  )

## `geom_smooth()` using formula = 'y ~ x'

method = "gam" 用于拟合由 mgcv 包提供的广义加性模型 (Generalized Additive Model，简称 GAM). 你需要使用类似 formula = y ~ s(x) 或 formula = y ~ s(x, bs = "cs") (适用于大数据集) 这样的公式. 当数据点数量超过 \(1000\) 个时，ggplot2 会默认使用该方法. 尝试运行以下代码，观察会出现什么结果. (2 分)

ggplot(data = mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  geom_smooth(method = "gam", formula = y ~ s(x))

# 基础GAM拟合
ggplot(data = mpg, aes(x = displ, y = hwy)) +
  geom_point(alpha = 0.6, color = "steelblue") +
  geom_smooth(method = "gam", formula = y ~ s(x), 
              color = "darkgreen", fill = "lightgreen") +
  labs(title = "使用GAM方法拟合发动机排量与高速油耗关系",
       subtitle = "formula = y ~ s(x)",
       x = "发动机排量 (升)",
       y = "高速油耗 (英里/加仑)") +
  theme_minimal()

# 比较不同GAM公式
ggplot(data = mpg, aes(x = displ, y = hwy)) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "gam", formula = y ~ s(x), 
              color = "darkgreen", aes(fill = "y ~ s(x)"), se = TRUE) +
  geom_smooth(method = "gam", formula = y ~ s(x, bs = "cs"), 
              color = "purple", aes(fill = "y ~ s(x, bs='cs')"), se = TRUE) +
  scale_fill_manual(values = c("y ~ s(x)" = "lightgreen", 
                               "y ~ s(x, bs='cs')" = "lavender")) +
  labs(title = "不同GAM公式比较",
       x = "发动机排量 (升)",
       y = "高速油耗 (英里/加仑)",
       fill = "GAM公式") +
  theme_minimal() +
  theme(legend.position = "bottom")

# 与线性模型和loess方法对比
ggplot(data = mpg, aes(x = displ, y = hwy)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "gam", formula = y ~ s(x), 
              color = "darkgreen", aes(fill = "GAM"), se = TRUE) +
  geom_smooth(method = "lm", color = "red", 
              aes(fill = "线性模型"), se = TRUE) +
  geom_smooth(method = "loess", color = "blue", 
              aes(fill = "Loess"), se = TRUE) +
  scale_fill_manual(values = c("GAM" = "lightgreen", 
                               "线性模型" = "lightcoral",
                               "Loess" = "lightblue")) +
  labs(title = "GAM、线性模型和Loess方法对比",
       subtitle = "GAM能捕捉更复杂的非线性关系",
       x = "发动机排量 (升)",
       y = "高速油耗 (英里/加仑)",
       fill = "拟合方法") +
  theme_minimal() +
  theme(legend.position = "bottom")

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

第 2 题

R 语言中 alr4 包中的 walleye 数据集，包含了对美国威斯康星州 (Wisconsin) 白斑狗鱼 (walleye，一种淡水鱼) 的测量数据.

2.1 使用 dplyr 包中的 filter 函数筛选数据框中的行，然后用 ggplot2 绘制 age (年龄) 为 \(5\) 至 \(8\) 岁的 length (体长) 的箱线图. (2 分)

# 加载必要的包
library(alr4)

## Loading required package: car

## Loading required package: carData

## Loading required package: effects

## lattice theme set by effectsTheme()
## See ?effectsTheme for details.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:car':
## 
##     recode

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

# 使用filter筛选年龄在5-8岁的数据，并绘制箱线图
walleye %>%
  filter(age >= 5 & age <= 8) %>%
  ggplot(aes(x = factor(age), y = length)) +
  geom_boxplot(fill = "lightblue", alpha = 0.7) +
  labs(
    title = "白斑狗鱼年龄与体长关系箱线图 (5-8岁)",
    x = "年龄 (岁)",
    y = "体长 (mm)"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
    axis.title = element_text(size = 12)
  )

2.2 使用 dplyr 包计算四个年龄组 (\(5\) 至 \(8\) 岁) 中体长 length 的样本均值和样本标准差. (2 分)

# 加载必要的包
library(alr4)
library(dplyr)

# 使用dplyr计算5-8岁年龄组体长的样本均值和标准差
walleye %>%
  filter(age >= 5 & age <= 8) %>%
  group_by(age) %>%
  summarise(
    样本均值 = mean(length),
    样本标准差 = sd(length),
    .groups = 'drop'
  )

## # A tibble: 4 × 3
##     age 样本均值 样本标准差
##   <int>    <dbl>      <dbl>
## 1     5     364.       30.5
## 2     6     384.       29.1
## 3     7     403.       27.8
## 4     8     420.       28.4

2.3 使用 ggplot2 绘制 age 为 \(5\) 至 \(8\) 岁时 length 的直方图，并将这些图形以 \(2 \times 2\) 的网格形式排列在同一张图中. (2 分)

# 加载必要的包
library(alr4)
library(dplyr)
library(ggplot2)

# 如果walleye对象不存在，使用Walleye
if (!exists("walleye")) {
  walleye <- Walleye
}

# 筛选5-8岁的数据
walleye_filtered <- walleye %>%
  filter(age >= 5 & age <= 8)

# 绘制2×2网格排列的直方图
ggplot(walleye_filtered, aes(x = length)) +
  geom_histogram(fill = "lightblue", color = "black", alpha = 0.7, bins = 15) +
  facet_wrap(~ age, nrow = 2, ncol = 2) +
  labs(
    title = "白斑狗鱼不同年龄组体长分布直方图 (5-8岁)",
    x = "体长 (mm)",
    y = "频数"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
    axis.title = element_text(size = 12),
    strip.text = element_text(size = 11, face = "bold")  # 分面标题样式
  )

第 3 题

以下问题基于 R 语言中的 anscombe 数据集.

3.1 使用 ggplot2 和 gridExtra 包，在 \(2 \times 2\) 的图形网格中绘制 \(4\) 个数据集 \(\left( x_1 ,~ y_1 \right)\)、\(\left( x_2 ,~ y_2 \right)\)、\(\left( x_3 ,~ y_3 \right)\)、\(\left( x_4 ,~ y_4 \right)\) 的散点图. (2 分)

在每个图形上添加数据集的编号，并将其作为该图形的主标题.

使用 bquote 函数为每个图形添加坐标轴标签. 例如：

library(ggplot2)
library(magrittr)
library(dplyr)
ggplot(data = anscombe) +
  geom_point(aes(x = x1, y = y1)) +
  xlab(bquote(x[1])) +
  ylab(bquote(y[1])) +
  ggtitle(paste0("n=", dim(anscombe %>% select(x1, y1))[1])) +
  theme(plot.title = element_text(hjust = 0.5))

# 加载必要的包
library(ggplot2)
library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

library(dplyr)

# 创建四个散点图
plot1 <- ggplot(data = anscombe) +
  geom_point(aes(x = x1, y = y1)) +
  xlab(bquote(x[1])) +
  ylab(bquote(y[1])) +
  ggtitle("Dataset 1") +
  theme(plot.title = element_text(hjust = 0.5))

plot2 <- ggplot(data = anscombe) +
  geom_point(aes(x = x2, y = y2)) +
  xlab(bquote(x[2])) +
  ylab(bquote(y[2])) +
  ggtitle("Dataset 2") +
  theme(plot.title = element_text(hjust = 0.5))

plot3 <- ggplot(data = anscombe) +
  geom_point(aes(x = x3, y = y3)) +
  xlab(bquote(x[3])) +
  ylab(bquote(y[3])) +
  ggtitle("Dataset 3") +
  theme(plot.title = element_text(hjust = 0.5))

plot4 <- ggplot(data = anscombe) +
  geom_point(aes(x = x4, y = y4)) +
  xlab(bquote(x[4])) +
  ylab(bquote(y[4])) +
  ggtitle("Dataset 4") +
  theme(plot.title = element_text(hjust = 0.5))

# 使用 gridExtra 包将四个图形排列在 2×2 网格中
grid.arrange(plot1, plot2, plot3, plot4, ncol = 2, nrow = 2)

3.2 使用 lm() 拟合以下数据集的简单线性回归模型:

- $y_1 \sim x_1$

- $y_2 \sim x_2$

- $y_3 \sim x_3$

- $y_4 \sim x_4$

验证所有的拟合模型是否具有相同的回归系数 (在数值误差范围内). 
<span style="color:red"> (2 分) </span>

# 拟合四个线性回归模型
model1 <- lm(y1 ~ x1, data = anscombe)
model2 <- lm(y2 ~ x2, data = anscombe)
model3 <- lm(y3 ~ x3, data = anscombe)
model4 <- lm(y4 ~ x4, data = anscombe)

# 提取回归系数
coef1 <- coef(model1)
coef2 <- coef(model2)
coef3 <- coef(model3)
coef4 <- coef(model4)

# 创建系数比较表格
coefficients_table <- data.frame(
  Dataset = c("y1 ~ x1", "y2 ~ x2", "y3 ~ x3", "y4 ~ x4"),
  Intercept = c(coef1[1], coef2[1], coef3[1], coef4[1]),
  Slope = c(coef1[2], coef2[2], coef3[2], coef4[2])
)

cat("回归系数比较:\n")

## 回归系数比较:

print(coefficients_table)

##    Dataset Intercept     Slope
## x1 y1 ~ x1  3.000091 0.5000909
## x2 y2 ~ x2  3.000909 0.5000000
## x3 y3 ~ x3  3.002455 0.4997273
## x4 y4 ~ x4  3.001727 0.4999091

# 验证系数是否在数值误差范围内相同
cat("\n验证回归系数是否相同:\n")

## 
## 验证回归系数是否相同:

cat("截距项的最大差异:", max(abs(diff(coefficients_table$Intercept))), "\n")

## 截距项的最大差异: 0.001545455

cat("斜率项的最大差异:", max(abs(diff(coefficients_table$Slope))), "\n")

## 斜率项的最大差异: 0.0002727273

# 使用 all.equal 进行精确比较
cat("\n使用 all.equal 比较 (容差 = 1e-10):\n")

## 
## 使用 all.equal 比较 (容差 = 1e-10):

cat("所有模型的截距是否相同:", 
    isTRUE(all.equal(coef1[1], coef2[1], tolerance = 1e-10)) &
    isTRUE(all.equal(coef1[1], coef3[1], tolerance = 1e-10)) &
    isTRUE(all.equal(coef1[1], coef4[1], tolerance = 1e-10)), "\n")

## 所有模型的截距是否相同: FALSE

cat("所有模型的斜率是否相同:", 
    isTRUE(all.equal(coef1[2], coef2[2], tolerance = 1e-10)) &
    isTRUE(all.equal(coef1[2], coef3[2], tolerance = 1e-10)) &
    isTRUE(all.equal(coef1[2], coef4[2], tolerance = 1e-10)), "\n")

## 所有模型的斜率是否相同: FALSE

# 显示模型摘要
cat("\n详细模型摘要:\n")

## 
## 详细模型摘要:

cat("模型1 (y1 ~ x1):\n")

## 模型1 (y1 ~ x1):

print(summary(model1)$coefficients)

##              Estimate Std. Error  t value    Pr(>|t|)
## (Intercept) 3.0000909  1.1247468 2.667348 0.025734051
## x1          0.5000909  0.1179055 4.241455 0.002169629

cat("\n模型2 (y2 ~ x2):\n")

## 
## 模型2 (y2 ~ x2):

print(summary(model2)$coefficients)

##             Estimate Std. Error  t value    Pr(>|t|)
## (Intercept) 3.000909  1.1253024 2.666758 0.025758941
## x2          0.500000  0.1179637 4.238590 0.002178816

cat("\n模型3 (y3 ~ x3):\n")

## 
## 模型3 (y3 ~ x3):

print(summary(model3)$coefficients)

##              Estimate Std. Error  t value    Pr(>|t|)
## (Intercept) 3.0024545  1.1244812 2.670080 0.025619109
## x3          0.4997273  0.1178777 4.239372 0.002176305

cat("\n模型4 (y4 ~ x4):\n")

## 
## 模型4 (y4 ~ x4):

print(summary(model4)$coefficients)

##              Estimate Std. Error  t value    Pr(>|t|)
## (Intercept) 3.0017273  1.1239211 2.670763 0.025590425
## x4          0.4999091  0.1178189 4.243028 0.002164602

3.3 使用 cor() 命令计算每个数据集的样本相关系数. (2 分)

# 3.3 使用 cor() 计算每个数据集的样本相关系数

# 计算四个数据集的样本相关系数
cor1 <- cor(anscombe$x1, anscombe$y1)
cor2 <- cor(anscombe$x2, anscombe$y2)
cor3 <- cor(anscombe$x3, anscombe$y3)
cor4 <- cor(anscombe$x4, anscombe$y4)

# 创建相关系数比较表格
correlation_table <- data.frame(
  Dataset = c("(x1, y1)", "(x2, y2)", "(x3, y3)", "(x4, y4)"),
  Correlation = c(cor1, cor2, cor3, cor4)
)

cat("样本相关系数比较:\n")

## 样本相关系数比较:

print(correlation_table)

##    Dataset Correlation
## 1 (x1, y1)   0.8164205
## 2 (x2, y2)   0.8162365
## 3 (x3, y3)   0.8162867
## 4 (x4, y4)   0.8165214

# 验证相关系数是否在数值误差范围内相同
cat("\n验证相关系数是否相同:\n")

## 
## 验证相关系数是否相同:

cat("相关系数的最大差异:", max(abs(diff(correlation_table$Correlation))), "\n")

## 相关系数的最大差异: 0.0002346974

# 使用 all.equal 进行精确比较
cat("\n使用 all.equal 比较 (容差 = 1e-10):\n")

## 
## 使用 all.equal 比较 (容差 = 1e-10):

cat("所有数据集的相关系数是否相同:", 
    isTRUE(all.equal(cor1, cor2, tolerance = 1e-10)) &
    isTRUE(all.equal(cor1, cor3, tolerance = 1e-10)) &
    isTRUE(all.equal(cor1, cor4, tolerance = 1e-10)), "\n")

## 所有数据集的相关系数是否相同: FALSE

# 显示更精确的相关系数值
cat("\n精确的相关系数值:\n")

## 
## 精确的相关系数值:

cat("数据集 1 (x1, y1):", sprintf("%.10f", cor1), "\n")

## 数据集 1 (x1, y1): 0.8164205163

cat("数据集 2 (x2, y2):", sprintf("%.10f", cor2), "\n")

## 数据集 2 (x2, y2): 0.8162365060

cat("数据集 3 (x3, y3):", sprintf("%.10f", cor3), "\n")

## 数据集 3 (x3, y3): 0.8162867395

cat("数据集 4 (x4, y4):", sprintf("%.10f", cor4), "\n")

## 数据集 4 (x4, y4): 0.8165214369

3.4 拟合与第 \(2\) 问中相同的模型，但交换 \(x\) 和 \(y\) 的位置. 使用 summary() 命令查看结果，当交换 \(x\) 和 \(y\) 的位置时，结果中是否有某些内容保持不变？ (2 分)

# 拟合交换x和y位置的四个模型
model1_swapped <- lm(x1 ~ y1, data = anscombe)
model2_swapped <- lm(x2 ~ y2, data = anscombe)
model3_swapped <- lm(x3 ~ y3, data = anscombe)
model4_swapped <- lm(x4 ~ y4, data = anscombe)

# 查看原始模型和交换后模型的摘要
cat("原始模型 (y ~ x) 的摘要:\n")

## 原始模型 (y ~ x) 的摘要:

cat("模型1 (y1 ~ x1):\n")

## 模型1 (y1 ~ x1):

print(summary(model1))

## 
## Call:
## lm(formula = y1 ~ x1, data = anscombe)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.92127 -0.45577 -0.04136  0.70941  1.83882 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   3.0001     1.1247   2.667  0.02573 * 
## x1            0.5001     0.1179   4.241  0.00217 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared:  0.6665, Adjusted R-squared:  0.6295 
## F-statistic: 17.99 on 1 and 9 DF,  p-value: 0.00217

cat("\n模型2 (y2 ~ x2):\n")

## 
## 模型2 (y2 ~ x2):

print(summary(model2))

## 
## Call:
## lm(formula = y2 ~ x2, data = anscombe)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9009 -0.7609  0.1291  0.9491  1.2691 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.001      1.125   2.667  0.02576 * 
## x2             0.500      0.118   4.239  0.00218 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared:  0.6662, Adjusted R-squared:  0.6292 
## F-statistic: 17.97 on 1 and 9 DF,  p-value: 0.002179

cat("\n模型3 (y3 ~ x3):\n")

## 
## 模型3 (y3 ~ x3):

print(summary(model3))

## 
## Call:
## lm(formula = y3 ~ x3, data = anscombe)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1586 -0.6146 -0.2303  0.1540  3.2411 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   3.0025     1.1245   2.670  0.02562 * 
## x3            0.4997     0.1179   4.239  0.00218 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.236 on 9 degrees of freedom
## Multiple R-squared:  0.6663, Adjusted R-squared:  0.6292 
## F-statistic: 17.97 on 1 and 9 DF,  p-value: 0.002176

cat("\n模型4 (y4 ~ x4):\n")

## 
## 模型4 (y4 ~ x4):

print(summary(model4))

## 
## Call:
## lm(formula = y4 ~ x4, data = anscombe)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.751 -0.831  0.000  0.809  1.839 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   3.0017     1.1239   2.671  0.02559 * 
## x4            0.4999     0.1178   4.243  0.00216 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.236 on 9 degrees of freedom
## Multiple R-squared:  0.6667, Adjusted R-squared:  0.6297 
## F-statistic:    18 on 1 and 9 DF,  p-value: 0.002165

cat("\n\n交换后的模型 (x ~ y) 的摘要:\n")

## 
## 
## 交换后的模型 (x ~ y) 的摘要:

cat("模型1 (x1 ~ y1):\n")

## 模型1 (x1 ~ y1):

print(summary(model1_swapped))

## 
## Call:
## lm(formula = x1 ~ y1, data = anscombe)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.6522 -1.5117 -0.2657  1.2341  3.8946 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  -0.9975     2.4344  -0.410  0.69156   
## y1            1.3328     0.3142   4.241  0.00217 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.019 on 9 degrees of freedom
## Multiple R-squared:  0.6665, Adjusted R-squared:  0.6295 
## F-statistic: 17.99 on 1 and 9 DF,  p-value: 0.00217

cat("\n模型2 (x2 ~ y2):\n")

## 
## 模型2 (x2 ~ y2):

print(summary(model2_swapped))

## 
## Call:
## lm(formula = x2 ~ y2, data = anscombe)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8516 -1.4315 -0.3440  0.8467  4.2017 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  -0.9948     2.4354  -0.408  0.69246   
## y2            1.3325     0.3144   4.239  0.00218 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.02 on 9 degrees of freedom
## Multiple R-squared:  0.6662, Adjusted R-squared:  0.6292 
## F-statistic: 17.97 on 1 and 9 DF,  p-value: 0.002179

cat("\n模型3 (x3 ~ y3):\n")

## 
## 模型3 (x3 ~ y3):

print(summary(model3_swapped))

## 
## Call:
## lm(formula = x3 ~ y3, data = anscombe)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9869 -1.3733 -0.0266  1.3200  3.2133 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  -1.0003     2.4362  -0.411  0.69097   
## y3            1.3334     0.3145   4.239  0.00218 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.019 on 9 degrees of freedom
## Multiple R-squared:  0.6663, Adjusted R-squared:  0.6292 
## F-statistic: 17.97 on 1 and 9 DF,  p-value: 0.002176

cat("\n模型4 (x4 ~ y4):\n")

## 
## 模型4 (x4 ~ y4):

print(summary(model4_swapped))

## 
## Call:
## lm(formula = x4 ~ y4, data = anscombe)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7859 -1.4122 -0.1853  1.4551  3.3329 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  -1.0036     2.4349  -0.412  0.68985   
## y4            1.3337     0.3143   4.243  0.00216 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.018 on 9 degrees of freedom
## Multiple R-squared:  0.6667, Adjusted R-squared:  0.6297 
## F-statistic:    18 on 1 and 9 DF,  p-value: 0.002165

# 比较保持不变的内容
cat("\n比较保持不变的内容:\n")

## 
## 比较保持不变的内容:

# 1. 比较R-squared值
cat("\n1. R-squared值比较:\n")

## 
## 1. R-squared值比较:

cat("原始模型 R-squared:", 
    summary(model1)$r.squared, summary(model2)$r.squared, 
    summary(model3)$r.squared, summary(model4)$r.squared, "\n")

## 原始模型 R-squared: 0.6665425 0.666242 0.666324 0.6667073

cat("交换后模型 R-squared:", 
    summary(model1_swapped)$r.squared, summary(model2_swapped)$r.squared, 
    summary(model3_swapped)$r.squared, summary(model4_swapped)$r.squared, "\n")

## 交换后模型 R-squared: 0.6665425 0.666242 0.666324 0.6667073

cat("R-squared值是否相同:", 
    isTRUE(all.equal(summary(model1)$r.squared, summary(model1_swapped)$r.squared)) &
    isTRUE(all.equal(summary(model2)$r.squared, summary(model2_swapped)$r.squared)) &
    isTRUE(all.equal(summary(model3)$r.squared, summary(model3_swapped)$r.squared)) &
    isTRUE(all.equal(summary(model4)$r.squared, summary(model4_swapped)$r.squared)), "\n")

## R-squared值是否相同: TRUE

# 2. 比较F统计量的值
cat("\n2. F统计量值比较:\n")

## 
## 2. F统计量值比较:

cat("原始模型 F-statistic:", 
    summary(model1)$fstatistic[1], summary(model2)$fstatistic[1], 
    summary(model3)$fstatistic[1], summary(model4)$fstatistic[1], "\n")

## 原始模型 F-statistic: 17.98994 17.96565 17.97228 18.00329

cat("交换后模型 F-statistic:", 
    summary(model1_swapped)$fstatistic[1], summary(model2_swapped)$fstatistic[1], 
    summary(model3_swapped)$fstatistic[1], summary(model4_swapped)$fstatistic[1], "\n")

## 交换后模型 F-statistic: 17.98994 17.96565 17.97228 18.00329

cat("F统计量值是否相同:", 
    isTRUE(all.equal(summary(model1)$fstatistic[1], summary(model1_swapped)$fstatistic[1])) &
    isTRUE(all.equal(summary(model2)$fstatistic[1], summary(model2_swapped)$fstatistic[1])) &
    isTRUE(all.equal(summary(model3)$fstatistic[1], summary(model3_swapped)$fstatistic[1])) &
    isTRUE(all.equal(summary(model4)$fstatistic[1], summary(model4_swapped)$fstatistic[1])), "\n")

## F统计量值是否相同: TRUE

# 3. 比较t统计量的绝对值（对于斜率系数）
cat("\n3. 斜率系数的t统计量绝对值比较:\n")

## 
## 3. 斜率系数的t统计量绝对值比较:

cat("原始模型 |t| for slope:", 
    abs(summary(model1)$coefficients[2,3]), abs(summary(model2)$coefficients[2,3]), 
    abs(summary(model3)$coefficients[2,3]), abs(summary(model4)$coefficients[2,3]), "\n")

## 原始模型 |t| for slope: 4.241455 4.23859 4.239372 4.243028

cat("交换后模型 |t| for slope:", 
    abs(summary(model1_swapped)$coefficients[2,3]), abs(summary(model2_swapped)$coefficients[2,3]), 
    abs(summary(model3_swapped)$coefficients[2,3]), abs(summary(model4_swapped)$coefficients[2,3]), "\n")

## 交换后模型 |t| for slope: 4.241455 4.23859 4.239372 4.243028

cat("t统计量绝对值是否相同:", 
    isTRUE(all.equal(abs(summary(model1)$coefficients[2,3]), abs(summary(model1_swapped)$coefficients[2,3]))) &
    isTRUE(all.equal(abs(summary(model2)$coefficients[2,3]), abs(summary(model2_swapped)$coefficients[2,3]))) &
    isTRUE(all.equal(abs(summary(model3)$coefficients[2,3]), abs(summary(model3_swapped)$coefficients[2,3]))) &
    isTRUE(all.equal(abs(summary(model4)$coefficients[2,3]), abs(summary(model4_swapped)$coefficients[2,3]))), "\n")

## t统计量绝对值是否相同: TRUE

# 4. 比较p值（对于斜率系数）
cat("\n4. 斜率系数的p值比较:\n")

## 
## 4. 斜率系数的p值比较:

cat("原始模型 p-value for slope:", 
    summary(model1)$coefficients[2,4], summary(model2)$coefficients[2,4], 
    summary(model3)$coefficients[2,4], summary(model4)$coefficients[2,4], "\n")

## 原始模型 p-value for slope: 0.002169629 0.002178816 0.002176305 0.002164602

cat("交换后模型 p-value for slope:", 
    summary(model1_swapped)$coefficients[2,4], summary(model2_swapped)$coefficients[2,4], 
    summary(model3_swapped)$coefficients[2,4], summary(model4_swapped)$coefficients[2,4], "\n")

## 交换后模型 p-value for slope: 0.002169629 0.002178816 0.002176305 0.002164602

cat("p值是否相同:", 
    isTRUE(all.equal(summary(model1)$coefficients[2,4], summary(model1_swapped)$coefficients[2,4])) &
    isTRUE(all.equal(summary(model2)$coefficients[2,4], summary(model2_swapped)$coefficients[2,4])) &
    isTRUE(all.equal(summary(model3)$coefficients[2,4], summary(model3_swapped)$coefficients[2,4])) &
    isTRUE(all.equal(summary(model4)$coefficients[2,4], summary(model4_swapped)$coefficients[2,4])), "\n")

## p值是否相同: TRUE

# 结论
cat("\n结论:\n")

## 
## 结论:

cat("当交换x和y的位置时，以下内容保持不变:\n")

## 当交换x和y的位置时，以下内容保持不变:

cat("- R-squared值\n")

## - R-squared值

cat("- F统计量的值\n")

## - F统计量的值

cat("- 斜率系数的t统计量绝对值\n")

## - 斜率系数的t统计量绝对值

cat("- 斜率系数的p值\n")

## - 斜率系数的p值

cat("这是因为在简单线性回归中，这些统计量基于变量间的相关性，而相关性是对称的。")

## 这是因为在简单线性回归中，这些统计量基于变量间的相关性，而相关性是对称的。

3.5 计算每个数据集的残差平方和 (SSE)、总平方和 (SST) 以及决定系数 \(R^2\). 使用 mean、sum、predict 或 resid 命令. (使用原始模型，即模型形式为 \(y_i \sim x_i\)，因此只需计算 \(4\) 个残差平方和 (SSE) 的值). (2 分)

# 使用原始模型 (y_i ~ x_i)
# 重新拟合模型以确保可用
model1 <- lm(y1 ~ x1, data = anscombe)
model2 <- lm(y2 ~ x2, data = anscombe)
model3 <- lm(y3 ~ x3, data = anscombe)
model4 <- lm(y4 ~ x4, data = anscombe)

# 计算每个数据集的统计量
calculate_stats <- function(model, x_var, y_var) {
  # 获取残差
  residuals <- resid(model)
  
  # 计算残差平方和 (SSE)
  SSE <- sum(residuals^2)
  
  # 获取响应变量
  y <- model$model[[1]]
  
  # 计算总平方和 (SST)
  SST <- sum((y - mean(y))^2)
  
  # 计算回归平方和 (SSR)
  SSR <- sum((predict(model) - mean(y))^2)
  
  # 计算决定系数 R²
  R_squared <- 1 - (SSE / SST)
  
  # 验证 R² 是否与 summary() 输出一致
  R_squared_verify <- summary(model)$r.squared
  
  return(list(SSE = SSE, SST = SST, SSR = SSR, 
              R_squared = R_squared, R_squared_verify = R_squared_verify))
}

# 计算四个数据集的统计量
stats1 <- calculate_stats(model1, anscombe$x1, anscombe$y1)
stats2 <- calculate_stats(model2, anscombe$x2, anscombe$y2)
stats3 <- calculate_stats(model3, anscombe$x3, anscombe$y3)
stats4 <- calculate_stats(model4, anscombe$x4, anscombe$y4)

# 创建结果表格
results_table <- data.frame(
  Dataset = c("y1 ~ x1", "y2 ~ x2", "y3 ~ x3", "y4 ~ x4"),
  SSE = c(stats1$SSE, stats2$SSE, stats3$SSE, stats4$SSE),
  SST = c(stats1$SST, stats2$SST, stats3$SST, stats4$SST),
  SSR = c(stats1$SSR, stats2$SSR, stats3$SSR, stats4$SSR),
  R_squared = c(stats1$R_squared, stats2$R_squared, stats3$R_squared, stats4$R_squared),
  R_squared_from_summary = c(stats1$R_squared_verify, stats2$R_squared_verify, 
                            stats3$R_squared_verify, stats4$R_squared_verify)
)

cat("残差平方和(SSE)、总平方和(SST)和决定系数(R²)的计算结果:\n")

## 残差平方和(SSE)、总平方和(SST)和决定系数(R²)的计算结果:

print(results_table)

##   Dataset      SSE      SST      SSR R_squared R_squared_from_summary
## 1 y1 ~ x1 13.76269 41.27269 27.51000 0.6665425              0.6665425
## 2 y2 ~ x2 13.77629 41.27629 27.50000 0.6662420              0.6662420
## 3 y3 ~ x3 13.75619 41.22620 27.47001 0.6663240              0.6663240
## 4 y4 ~ x4 13.74249 41.23249 27.49000 0.6667073              0.6667073

# 验证 SSE + SSR = SST
cat("\n验证 SSE + SSR = SST:\n")

## 
## 验证 SSE + SSR = SST:

for(i in 1:4) {
  equality_check <- results_table$SSE[i] + results_table$SSR[i] - results_table$SST[i]
  cat(paste0("数据集 ", i, ": SSE + SSR - SST = ", 
             format(equality_check, scientific = TRUE, digits = 6), 
             " (应该接近0)\n"))
}

## 数据集 1: SSE + SSR - SST = 7.10543e-15 (应该接近0)
## 数据集 2: SSE + SSR - SST = 1.42109e-14 (应该接近0)
## 数据集 3: SSE + SSR - SST = 7.10543e-15 (应该接近0)
## 数据集 4: SSE + SSR - SST = -1.42109e-14 (应该接近0)

# 验证 R² = 1 - SSE/SST
cat("\n验证 R² = 1 - SSE/SST:\n")

## 
## 验证 R² = 1 - SSE/SST:

for(i in 1:4) {
  R_squared_calc <- 1 - (results_table$SSE[i] / results_table$SST[i])
  difference <- R_squared_calc - results_table$R_squared[i]
  cat(paste0("数据集 ", i, ": 计算R² - 公式R² = ", 
             format(difference, scientific = TRUE, digits = 6), 
             " (应该接近0)\n"))
}

## 数据集 1: 计算R² - 公式R² = 0e+00 (应该接近0)
## 数据集 2: 计算R² - 公式R² = 0e+00 (应该接近0)
## 数据集 3: 计算R² - 公式R² = 0e+00 (应该接近0)
## 数据集 4: 计算R² - 公式R² = 0e+00 (应该接近0)

# 验证与 summary() 输出的一致性
cat("\n验证与 summary() 输出的一致性:\n")

## 
## 验证与 summary() 输出的一致性:

for(i in 1:4) {
  difference <- results_table$R_squared[i] - results_table$R_squared_from_summary[i]
  cat(paste0("数据集 ", i, ": 计算R² - summary()R² = ", 
             format(difference, scientific = TRUE, digits = 6), 
             " (应该接近0)\n"))
}

## 数据集 1: 计算R² - summary()R² = 0e+00 (应该接近0)
## 数据集 2: 计算R² - summary()R² = 1.11022e-16 (应该接近0)
## 数据集 3: 计算R² - summary()R² = 0e+00 (应该接近0)
## 数据集 4: 计算R² - summary()R² = 0e+00 (应该接近0)

# 手动计算方法展示（使用要求的函数）
cat("\n手动计算方法展示:\n")

## 
## 手动计算方法展示:

cat("数据集1 (y1 ~ x1):\n")

## 数据集1 (y1 ~ x1):

y1_mean <- mean(anscombe$y1)
SSE_manual <- sum(resid(model1)^2)
SST_manual <- sum((anscombe$y1 - y1_mean)^2)
R2_manual <- 1 - (SSE_manual / SST_manual)

cat("  SSE = sum(resid(model)^2) =", SSE_manual, "\n")

##   SSE = sum(resid(model)^2) = 13.76269

cat("  SST = sum((y - mean(y))^2) =", SST_manual, "\n")

##   SST = sum((y - mean(y))^2) = 41.27269

cat("  R² = 1 - SSE/SST =", R2_manual, "\n")

##   R² = 1 - SSE/SST = 0.6665425

# 总结
cat("\n总结:\n")

## 
## 总结:

cat("四个数据集的残差平方和(SSE)分别为:\n")

## 四个数据集的残差平方和(SSE)分别为:

for(i in 1:4) {
  cat(paste0("  数据集", i, ": ", round(results_table$SSE[i], 6), "\n"))
}

##   数据集1: 13.76269
##   数据集2: 13.776291
##   数据集3: 13.756192
##   数据集4: 13.74249

cat("在数值误差范围内，四个数据集的SSE几乎相同。\n")

## 在数值误差范围内，四个数据集的SSE几乎相同。

3.6 使用 ggplot2 包重新绘制数据，并在每个图形上添加回归线. (使用原始模型，即模型形式为 \(y_i \sim x_i\)，因此只需绘制 \(4\) 个图形). (2 分)

# 加载必要的包
library(ggplot2)
library(gridExtra)

# 创建四个带有回归线的散点图
plot1_with_line <- ggplot(data = anscombe, aes(x = x1, y = y1)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "red", linewidth = 0.8) +
  xlab(bquote(x[1])) +
  ylab(bquote(y[1])) +
  ggtitle("Dataset 1 with Regression Line") +
  theme(plot.title = element_text(hjust = 0.5))

plot2_with_line <- ggplot(data = anscombe, aes(x = x2, y = y2)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "red", linewidth = 0.8) +
  xlab(bquote(x[2])) +
  ylab(bquote(y[2])) +
  ggtitle("Dataset 2 with Regression Line") +
  theme(plot.title = element_text(hjust = 0.5))

plot3_with_line <- ggplot(data = anscombe, aes(x = x3, y = y3)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "red", linewidth = 0.8) +
  xlab(bquote(x[3])) +
  ylab(bquote(y[3])) +
  ggtitle("Dataset 3 with Regression Line") +
  theme(plot.title = element_text(hjust = 0.5))

plot4_with_line <- ggplot(data = anscombe, aes(x = x4, y = y4)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "red", linewidth = 0.8) +
  xlab(bquote(x[4])) +
  ylab(bquote(y[4])) +
  ggtitle("Dataset 4 with Regression Line") +
  theme(plot.title = element_text(hjust = 0.5))

# 在2×2网格中显示图形
grid.arrange(plot1_with_line, plot2_with_line, plot3_with_line, plot4_with_line, 
             ncol = 2, nrow = 2)

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

# 可选：添加回归方程和R²值到图形中
# 创建函数来生成回归方程标签
get_equation <- function(model) {
  coef <- coef(model)
  intercept <- round(coef[1], 3)
  slope <- round(coef[2], 3)
  r_squared <- round(summary(model)$r.squared, 3)
  paste0("y = ", slope, "x + ", intercept, "\nR² = ", r_squared)
}

# 创建带有回归方程和R²值的图形
plot1_annotated <- plot1_with_line + 
  annotate("text", x = min(anscombe$x1), y = max(anscombe$y1), 
           label = get_equation(model1), hjust = 0, vjust = 1, size = 3.5)

plot2_annotated <- plot2_with_line + 
  annotate("text", x = min(anscombe$x2), y = max(anscombe$y2), 
           label = get_equation(model2), hjust = 0, vjust = 1, size = 3.5)

plot3_annotated <- plot3_with_line + 
  annotate("text", x = min(anscombe$x3), y = max(anscombe$y3), 
           label = get_equation(model3), hjust = 0, vjust = 1, size = 3.5)

plot4_annotated <- plot4_with_line + 
  annotate("text", x = min(anscombe$x4), y = max(anscombe$y4), 
           label = get_equation(model4), hjust = 0, vjust = 1, size = 3.5)

# 显示带有注解的图形
cat("\n带有回归方程和R²值的图形:\n")

## 
## 带有回归方程和R²值的图形:

grid.arrange(plot1_annotated, plot2_annotated, plot3_annotated, plot4_annotated, 
             ncol = 2, nrow = 2)

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

# 显示回归方程的详细信息
cat("\n回归方程详细信息:\n")

## 
## 回归方程详细信息:

cat("数据集 1:", get_equation(model1), "\n")

## 数据集 1: y = 0.5x + 3
## R² = 0.667

cat("数据集 2:", get_equation(model2), "\n")

## 数据集 2: y = 0.5x + 3.001
## R² = 0.666

cat("数据集 3:", get_equation(model3), "\n")

## 数据集 3: y = 0.5x + 3.002
## R² = 0.666

cat("数据集 4:", get_equation(model4), "\n")

## 数据集 4: y = 0.5x + 3.002
## R² = 0.667

第 4 题

对简单线性回归模型 \[ Y = \beta_0 + \beta_1 \, X + \varepsilon ,\, \quad \varepsilon \sim N \left( 0 ,\, \sigma^2 \right) \,. \] 若 \(\left( x_1 ,\, y_1 \right) ,\, \ldots ,\, \left( x_n ,\, y_n \right)\) 是对 \((X ,\, Y)\) 的一组观测数值，求回归参数 \(\beta_0 ,\, \beta_1\) 的最小二乘估计. (2 分)

# 第4题：简单线性回归参数的最小二乘估计
# 模型: Y = β₀ + β₁X + ε, ε ~ N(0, σ²)

# 观测数据（模拟的线性关系数据）
x <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
y <- c(2.1, 3.8, 5.2, 7.1, 8.9, 10.5, 12.1, 13.8, 15.3, 16.8)

# 计算基本统计量
n <- length(x)
x_bar <- mean(x)
y_bar <- mean(y)

# 最小二乘估计核心计算
# β₁_hat = Σ(xi - x̄)(yi - ȳ) / Σ(xi - x̄)²
S_xy <- sum((x - x_bar) * (y - y_bar))
S_xx <- sum((x - x_bar)^2)
beta_1_hat <- S_xy / S_xx

# β₀_hat = ȳ - β₁_hat × x̄
beta_0_hat <- y_bar - beta_1_hat * x_bar

# 使用R内置函数验证计算结果
model_lm <- lm(y ~ x)
beta_0_lm <- coef(model_lm)[1]
beta_1_lm <- coef(model_lm)[2]

# 正式输出最小二乘估计结果
cat("==================== 最小二乘估计结果 ====================\n")

## ==================== 最小二乘估计结果 ====================

cat("根据公式计算:\n")

## 根据公式计算:

cat(sprintf("β₀_hat = %.6f\n", beta_0_hat))

## β₀_hat = 0.480000

cat(sprintf("β₁_hat = %.6f\n", beta_1_hat))

## β₁_hat = 1.650909

cat("\n使用lm()函数验证:\n")

## 
## 使用lm()函数验证:

cat(sprintf("β₀_hat = %.6f\n", beta_0_lm))

## β₀_hat = 0.480000

cat(sprintf("β₁_hat = %.6f\n", beta_1_lm))

## β₁_hat = 1.650909

# 验证计算精度
tolerance <- 1e-10
if (abs(beta_0_hat - beta_0_lm) < tolerance && 
    abs(beta_1_hat - beta_1_lm) < tolerance) {
  cat("\n✓ 验证通过：手动计算结果与lm()函数完全一致\n")
} else {
  cat("\n⚠ 注意：计算结果存在微小差异\n")
}

## 
## ✓ 验证通过：手动计算结果与lm()函数完全一致

# 输出数学公式对应关系
cat("\n==================== 公式对应关系 ====================\n")

## 
## ==================== 公式对应关系 ====================

cat("β₁_hat = S_xy / S_xx =", S_xy, "/", S_xx, "=", beta_1_hat, "\n")

## β₁_hat = S_xy / S_xx = 136.2 / 82.5 = 1.650909

cat("β₀_hat = ȳ - β₁_hat × x̄ =", y_bar, "-", beta_1_hat, "×", x_bar, "=", beta_0_hat, "\n")

## β₀_hat = ȳ - β₁_hat × x̄ = 9.56 - 1.650909 × 5.5 = 0.48

# 最终答案框显（符合数学答题格式）
cat("\n==================== 最终答案 ====================\n")

## 
## ==================== 最终答案 ====================

cat("回归参数的最小二乘估计为：\n")

## 回归参数的最小二乘估计为：

cat(sprintf("β₀_hat = %.4f\n", beta_0_hat))

## β₀_hat = 0.4800

cat(sprintf("β₁_hat = %.4f\n", beta_1_hat))

## β₁_hat = 1.6509

工程数学 (统计学) 作业一

2025216850-王勇亮

`11.11`

作业要求

第 1 题

第 2 题

第 3 题

第 4 题