Ngày 3 - Phân tích Hồi qui tuyến tính và Tương quan

🎯 Giới thiệu

Báo cáo này trình bày toàn bộ quá trình phân tích hồi qui tuyến tính, tương quan và phương sai, được thực hiện bằng ngôn ngữ R, sử dụng dữ liệu thực hành trong file Demo.csv và dữ liệu mô phỏng khác theo yêu cầu bài tập.

1️⃣ Phân tích phương sai (ANOVA)

1.1. Nhập dữ liệu 4 nhóm

A <- c(8, 9, 11, 4, 7, 8, 5)
B <- c(7, 17, 10, 14, 12, 24, 11, 22)
C <- c(28, 21, 26, 11, 24, 19)
D <- c(26, 16, 13, 12, 9, 10, 11, 17, 15)

weight <- c(A, B, C, D)
group <- factor(rep(c("A","B","C","D"), times = c(length(A),length(B),length(C),length(D))))
data <- data.frame(group, weight)

1.2. Mô tả dữ liệu theo nhóm

aggregate(weight ~ group, data = data, FUN = function(x) c(n=length(x), mean=mean(x), sd=sd(x)))

##   group  weight.n weight.mean weight.sd
## 1     A  7.000000    7.428571  2.370453
## 2     B  8.000000   14.625000  5.950690
## 3     C  6.000000   21.500000  6.090977
## 4     D  9.000000   14.333333  5.147815

boxplot(weight ~ group, data=data, col="lightblue", main="So sánh cân nặng giữa 4 nhóm")

Diễn giải (1.2):
Bảng tóm tắt và boxplot cho phép so sánh trung bình và phân tán cân nặng giữa 4 nhóm.

1.3. Kiểm định ANOVA

anova_result <- aov(weight ~ group, data=data)
summary(anova_result)

##             Df Sum Sq Mean Sq F value   Pr(>F)    
## group        3  642.3  214.09   8.197 0.000528 ***
## Residuals   26  679.1   26.12                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

shapiro.test(residuals(anova_result))

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(anova_result)
## W = 0.98028, p-value = 0.8329

Diễn giải (1.3):
Nếu p-value < 0.05 ⇒ có khác biệt trung bình giữa ít nhất hai nhóm. Kiểm tra Shapiro-Wilk cho phần dư để đánh giá giả định chuẩn.

1.4. Phân tích hậu định

tukey_res <- TukeyHSD(anova_result)
tukey_res

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = weight ~ group, data = data)
## 
## $group
##           diff          lwr        upr     p adj
## B-A  7.1964286  -0.05969765 14.4525548 0.0525014
## C-A 14.0714286   6.27132726 21.8715299 0.0002134
## D-A  6.9047619  -0.16073856 13.9702624 0.0571911
## C-B  6.8750000  -0.69675602 14.4467560 0.0850381
## D-B -0.2916667  -7.10424368  6.5209103 0.9994049
## D-C -7.1666667 -14.55594392  0.2226106 0.0597131

Diễn giải (1.4):
Tukey HSD cho biết cặp nhóm nào khác biệt có ý nghĩa (ví dụ cặp A vs C). Ghi chú: nếu phương sai không đồng nhất, cân nhắc test Games-Howell.

2️⃣ Phân tích tương quan (Correlation)

2.1. Đọc dữ liệu

df <- read.csv("/cloud/project/Demo.csv", stringsAsFactors = FALSE)
str(df)

## 'data.frame':    1217 obs. of  7 variables:
##  $ X     : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ id    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ age   : int  53 65 64 56 54 52 66 50 61 58 ...
##  $ gender: chr  "F" "M" "F" "F" ...
##  $ weight: int  49 52 57 53 51 47 58 65 54 60 ...
##  $ height: int  150 165 157 156 160 153 155 167 165 158 ...
##  $ pcfat : num  37.3 16.8 34 33.8 14.8 32.2 35.3 28 21.1 36.6 ...

summary(df)

##        X              id              age           gender         
##  Min.   :   1   Min.   :   1.0   Min.   :13.00   Length:1217       
##  1st Qu.: 305   1st Qu.: 309.0   1st Qu.:35.00   Class :character  
##  Median : 609   Median : 615.0   Median :48.00   Mode  :character  
##  Mean   : 609   Mean   : 614.5   Mean   :47.15                     
##  3rd Qu.: 913   3rd Qu.: 921.0   3rd Qu.:58.00                     
##  Max.   :1217   Max.   :1227.0   Max.   :88.00                     
##      weight          height          pcfat     
##  Min.   :34.00   Min.   :136.0   Min.   : 9.2  
##  1st Qu.:49.00   1st Qu.:151.0   1st Qu.:27.0  
##  Median :54.00   Median :155.0   Median :32.4  
##  Mean   :55.14   Mean   :156.7   Mean   :31.6  
##  3rd Qu.:61.00   3rd Qu.:162.0   3rd Qu.:36.8  
##  Max.   :95.00   Max.   :185.0   Max.   :48.4

2.2. Mô tả đặc điểm weight và height

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

df %>% summarise(n=n(),
mean_weight = mean(weight, na.rm=TRUE),
sd_weight = sd(weight, na.rm=TRUE),
median_weight = median(weight, na.rm=TRUE),
mean_height = mean(height, na.rm=TRUE),
sd_height = sd(height, na.rm=TRUE),
median_height = median(height, na.rm=TRUE))

##      n mean_weight sd_weight median_weight mean_height sd_height median_height
## 1 1217     55.1438  9.404989            54    156.7239  7.977726           155

Diễn giải (2.2):
Báo cáo trung bình, độ lệch chuẩn và trung vị để mô tả phân phối các biến.

2.3. Vẽ biểu đồ tán xạ (weight vs height)

library(ggplot2)
ggplot(df, aes(x = height, y = weight)) +
geom_point(alpha=0.7) +
geom_smooth(method="lm", se=TRUE) +
labs(title="Scatter: Weight vs Height", x="Height (cm)", y="Weight (kg)") +
theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

Diễn giải (2.3):
Scatter cho thấy xu hướng tuyến tính dương; đường hồi quy giúp quan sát hướng và độ dốc.

2.4. Phân tích tương quan định lượng (weight — height)

# Pearson (linear) và Spearman (rank-based)

cor_pearson_wh <- cor.test(df$height, df$weight, method="pearson")
cor_spearman_wh <- cor.test(df$height, df$weight, method="spearman")

## Warning in cor.test.default(df$height, df$weight, method = "spearman"): Cannot
## compute exact p-value with ties

cor_pearson_wh

## 
##  Pearson's product-moment correlation
## 
## data:  df$height and df$weight
## t = 25.984, df = 1215, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5602911 0.6326135
## sample estimates:
##       cor 
## 0.5976667

cor_spearman_wh

## 
##  Spearman's rank correlation rho
## 
## data:  df$height and df$weight
## S = 132686871, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##     rho 
## 0.55832

Diễn giải (2.4): - Pearson r: hệ số tương quan tuyến tính; p-value kiểm tra ý nghĩa. - Nếu dữ liệu có outlier hoặc không chuẩn, dùng Spearman.

2.5. Phân tích tương quan (height — pcfat)

cor_pearson_hp <- cor.test(df$height, df$pcfat, method="pearson")
cor_spearman_hp <- cor.test(df$height, df$pcfat, method="spearman")

## Warning in cor.test.default(df$height, df$pcfat, method = "spearman"): Cannot
## compute exact p-value with ties

cor_pearson_hp

## 
##  Pearson's product-moment correlation
## 
## data:  df$height and df$pcfat
## t = -19.063, df = 1215, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5219407 -0.4353664
## sample estimates:
##        cor 
## -0.4798206

cor_spearman_hp

## 
##  Spearman's rank correlation rho
## 
## data:  df$height and df$pcfat
## S = 441808157, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.4706643

Diễn giải (2.5): Kết quả mô tả mối quan hệ (âm/ dương) và ý nghĩa thống kê giữa chiều cao và tỉ trọng mỡ.

3️⃣ Hồi qui tuyến tính đơn

3.1. Dữ liệu Việt Nam (gapminder / data cung cấp)

# Nếu dùng gapminder:

# library(gapminder)

# vn_gap <- gapminder %>% filter(country == "Vietnam")

# Dữ liệu do bài cung cấp

year <- c(1952,1957,1962,1967,1972,1977,1982,1987,1992,1997,2002,2007)
lifeExp <- c(40.4,42.9,45.4,47.8,50.3,55.8,58.8,62.8,67.7,70.7,73.0,74.2)
vn <- data.frame(year, lifeExp)

3.2. Xây dựng mô hình hồi qui

model_vn <- lm(lifeExp ~ year, data=vn)
summary(model_vn)

## 
## Call:
## lm(formula = lifeExp ~ year, data = vn)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.1494 -0.5944  0.1387  0.7324  1.8268 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.271e+03  4.377e+01  -29.04 5.47e-11 ***
## year         6.712e-01  2.211e-02   30.35 3.53e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.322 on 10 degrees of freedom
## Multiple R-squared:  0.9893, Adjusted R-squared:  0.9882 
## F-statistic: 921.4 on 1 and 10 DF,  p-value: 3.527e-11

3.3. Kiểm tra giả định mô hình

par(mfrow=c(2,2))
plot(model_vn)

shapiro.test(residuals(model_vn))

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(model_vn)
## W = 0.95642, p-value = 0.7317

Diễn giải (3.3): Kiểm tra: tính tuyến tính, chuẩn phần dư, phương sai không đổi, và ảnh hưởng outlier.

3.4. Phương trình hồi qui và giải thích

coef(model_vn)

##   (Intercept)          year 
## -1271.1349184     0.6711888

# Ví dụ: lifeExp = (Intercept) + (Slope)*year

Diễn giải (3.4): Hệ số slope cho biết mức tăng (giảm) trung bình của tuổi thọ theo mỗi đơn vị năm. R^2 biểu diễn tỉ lệ biến thiên được mô hình giải thích.

3.5. Phân tích trực tiếp dữ liệu cung cấp

# Mã như ở trên (dùng vn dataset).

summary(model_vn)

## 
## Call:
## lm(formula = lifeExp ~ year, data = vn)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.1494 -0.5944  0.1387  0.7324  1.8268 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.271e+03  4.377e+01  -29.04 5.47e-11 ***
## year         6.712e-01  2.211e-02   30.35 3.53e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.322 on 10 degrees of freedom
## Multiple R-squared:  0.9893, Adjusted R-squared:  0.9882 
## F-statistic: 921.4 on 1 and 10 DF,  p-value: 3.527e-11

4️⃣ Hồi qui tuyến tính đa biến

4.1. Nhập dữ liệu Y, X1, X2

Y <- c(12.1, 11.9, 10.2, 8.0, 7.7, 5.3, 7.9, 7.8, 5.5, 2.6)
X1 <- 0:9
X2 <- c(7,4,4,6,4,2,1,1,1,0)
df2 <- data.frame(Y, X1, X2)
summary(df2)

##        Y               X1             X2   
##  Min.   : 2.60   Min.   :0.00   Min.   :0  
##  1st Qu.: 6.05   1st Qu.:2.25   1st Qu.:1  
##  Median : 7.85   Median :4.50   Median :3  
##  Mean   : 7.90   Mean   :4.50   Mean   :3  
##  3rd Qu.: 9.65   3rd Qu.:6.75   3rd Qu.:4  
##  Max.   :12.10   Max.   :9.00   Max.   :7

4.2. Hồi qui đơn Y ~ X1

model_y_x1 <- lm(Y ~ X1, data=df2)
summary(model_y_x1)

## 
## Call:
## lm(formula = Y ~ X1, data = df2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.1606 -1.0735  0.1742  0.8621  2.0970 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.8545     0.8283  14.312 5.54e-07 ***
## X1           -0.8788     0.1552  -5.664 0.000474 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.409 on 8 degrees of freedom
## Multiple R-squared:  0.8004, Adjusted R-squared:  0.7755 
## F-statistic: 32.08 on 1 and 8 DF,  p-value: 0.0004737

4.3. Hồi qui đơn Y ~ X2

model_y_x2 <- lm(Y ~ X2, data=df2)
summary(model_y_x2)

## 
## Call:
## lm(formula = Y ~ X2, data = df2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.702 -1.533 -0.034  1.667  3.066 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   5.0980     1.1222   4.543  0.00189 **
## X2            0.9340     0.2999   3.114  0.01436 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.121 on 8 degrees of freedom
## Multiple R-squared:  0.548,  Adjusted R-squared:  0.4915 
## F-statistic: 9.698 on 1 and 8 DF,  p-value: 0.01436

4.4. Nhập dữ liệu Y, X1, X2

model_y_both <- lm(Y ~ X1 + X2, data=df2)
summary(model_y_both)

## 
## Call:
## lm(formula = Y ~ X1 + X2, data = df2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.46078 -0.33384  0.00026  0.81856  1.98476 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  14.7076     2.9785   4.938  0.00168 **
## X1           -1.2042     0.3614  -3.332  0.01255 * 
## X2           -0.4629     0.4642  -0.997  0.35187   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.41 on 7 degrees of freedom
## Multiple R-squared:  0.8252, Adjusted R-squared:  0.7753 
## F-statistic: 16.53 on 2 and 7 DF,  p-value: 0.002232

# Kiểm tra multicollinearity nếu cần

# library(car); vif(model_y_both)

Diễn giải (4.4): So sánh hệ số và p-value giữa các mô hình đơn và đa biến để xác định ảnh hưởng độc lập của từng biến.

5️⃣ Tổng hợp, xuất báo cáo và ứng dụng thực tiễn

5.1. Ghi kết quả ra tập tin (CSV)

# Ví dụ: ghi bảng tổng hợp hoặc dữ liệu kết quả

write.csv(df, "Ngay3_Hoi_Qui_Tuyen_Tinh.csv", row.names = FALSE)

5.2. Tạo RMarkdown hoàn chỉnh và publish RPubs

# Sau khi Knit -> HTML, dùng chức năng Publish của RStudio hoặc:

# rpubs::rpubsUpload(title = "Ngay3_Hoi_Qui_Tuyen_Tinh", htmlFile = "Ngay3_HoiQui_Pro.html")

🧠 Kết luận và Liên hệ thực tiễn

Phân tích hồi qui và tương quan là công cụ thống kê nền tảng trong nghiên cứu khoa học kỹ thuật. Trong lĩnh vực điện – điện tử, phương pháp này giúp:

Dự đoán xu hướng tiêu thụ năng lượng, công suất tải.
Ước lượng thông số tối ưu của thiết bị hoặc hệ thống.
Phân tích độ tin cậy và tối ưu hóa quy trình sản xuất.

Việc thành thạo phân tích này giúp sinh viên và kỹ sư có khả năng tư duy dữ liệu (data-driven thinking), áp dụng trong thiết kế và quản lý hệ thống thông minh.