Ngày 4: Hồi quy tuyến tính lm: linear Model

Hồi qui tuyến tính

##4. So sánh tỉ trọng mỡ giữa nam và nữ

on4 = read.csv("D:\\OneDrive\\Documents\\HPK1 2018\\NVPS2025\\Viet bao khoa hoc paper\\R\\DU LIEU THUC HANH\\Obesity data.csv")
library(lessR)
## 
## lessR 4.4.3                         feedback: gerbing@pdx.edu 
## --------------------------------------------------------------
## > d <- Read("")  Read data file, many formats available, e.g., Excel
##   d is default data frame, data= in analysis routines optional
## 
## Many examples of reading, writing, and manipulating data, 
## graphics, testing means and proportions, regression, factor analysis,
## customization, forecasting, and aggregation from pivot tables
##   Enter: browseVignettes("lessR")
## 
## View lessR updates, now including time series forecasting
##   Enter: news(package="lessR")
## 
## Interactive data analysis
##   Enter: interact()
## 
## Attaching package: 'lessR'
## The following object is masked from 'package:base':
## 
##     sort_by
Histogram(pcfat, fill = "blue", xlab = "Percentage of fat (%)", ylab = "Frequency",  data = on4)

## >>> Suggestions 
## bin_width: set the width of each bin 
## bin_start: set the start of the first bin 
## bin_end: set the end of the last bin 
## Histogram(pcfat, density=TRUE)  # smoothed curve + histogram 
## Plot(pcfat)  # Violin/Box/Scatterplot (VBS) plot 
## 
## --- pcfat --- 
##  
##        n   miss            mean              sd             min             mdn             max 
##      1217      0       31.604786        7.182862        9.200000       32.400000       48.400000 
##  
## 
##   
## --- Outliers ---     from the box plot: 10 
##  
## Small       Large 
## -----       ----- 
##   9.2            
##   9.7            
##   9.8            
##  10.3            
##  10.3            
##  10.7            
##  11.0            
##  11.4            
##  11.7            
##  11.9            
## 
## 
## Bin Width: 5 
## Number of Bins: 9 
##  
##      Bin  Midpnt  Count    Prop  Cumul.c  Cumul.p 
## ------------------------------------------------- 
##   5 > 10     7.5      3    0.00        3     0.00 
##  10 > 15    12.5     26    0.02       29     0.02 
##  15 > 20    17.5     61    0.05       90     0.07 
##  20 > 25    22.5    128    0.11      218     0.18 
##  25 > 30    27.5    244    0.20      462     0.38 
##  30 > 35    32.5    338    0.28      800     0.66 
##  35 > 40    37.5    294    0.24     1094     0.90 
##  40 > 45    42.5    107    0.09     1201     0.99 
##  45 > 50    47.5     16    0.01     1217     1.00 
## 

##Suggestions bin_width: set the width of each bin bin_start: set the start of the first bin bin_end: set the end of the last bin Histogram(pcfat, density=TRUE) # smoothed curve + histogram Plot(pcfat) # Violin/Box/Scatterplot (VBS) plot

— pcfat —

   n   miss            mean              sd             min             mdn             max 
 1217      0       31.604786        7.182862        9.200000       32.400000       48.400000 

— Outliers — from the box plot: 10

Small Large
9.2
9.7
9.8
10.3
10.3
10.7
11.0
11.4
11.7
11.9

Bin Width: 5 Number of Bins: 9

Bin Midpnt Count Prop Cumul.c Cumul.p

5 > 10 7.5 3 0.00 3 0.00 10 > 15 12.5 26 0.02 29 0.02 15 > 20 17.5 61 0.05 90 0.07 20 > 25 22.5 128 0.11 218 0.18 25 > 30 27.5 244 0.20 462 0.38 30 > 35 32.5 338 0.28 800 0.66 35 > 40 37.5 294 0.24 1094 0.90 40 > 45 42.5 107 0.09 1201 0.99 45 > 50 47.5 16 0.01 1217 1.00 ## ###: Pcfat = 34 +-10genderM, tỷ trọng mỡ nữ (FM) 34%, Nam thấp hơn 10%.sum ##5.Sử dụng kiểm định t

ttest(pcfat ~ gender, data = on4)
## 
## Compare pcfat across gender with levels F and M 
## Grouping Variable:  gender
## Response Variable:  pcfat
## 
## 
## ------ Describe ------
## 
## pcfat for gender F:  n.miss = 0,  n = 862,  mean = 34.672,  sd = 5.187
## pcfat for gender M:  n.miss = 0,  n = 355,  mean = 24.156,  sd = 5.764
## 
## Mean Difference of pcfat:  10.516
## 
## Weighted Average Standard Deviation:   5.362 
## 
## 
## ------ Assumptions ------
## 
## Note: These hypothesis tests can perform poorly, and the 
##       t-test is typically robust to violations of assumptions. 
##       Use as heuristic guides instead of interpreting literally. 
## 
## Null hypothesis, for each group, is a normal distribution of pcfat.
## Group F: Sample mean assumed normal because n > 30, so no test needed.
## Group M: Sample mean assumed normal because n > 30, so no test needed.
## 
## Null hypothesis is equal variances of pcfat, homogeneous.
## Variance Ratio test:  F = 33.223/26.909 = 1.235,  df = 354;861,  p-value = 0.016
## Levene's test, Brown-Forsythe:  t = -2.232,  df = 1215,  p-value = 0.026
## 
## 
## ------ Infer ------
## 
## --- Assume equal population variances of pcfat for each gender 
## 
## t-cutoff for 95% range of variation: tcut =  1.962 
## Standard Error of Mean Difference: SE =  0.338 
## 
## Hypothesis Test of 0 Mean Diff:  t-value = 31.101,  df = 1215,  p-value = 0.000
## 
## Margin of Error for 95% Confidence Level:  0.663
## 95% Confidence Interval for Mean Difference:  9.853 to 11.180
## 
## 
## --- Do not assume equal population variances of pcfat for each gender 
## 
## t-cutoff: tcut =  1.964 
## Standard Error of Mean Difference: SE =  0.353 
## 
## Hypothesis Test of 0 Mean Diff:  t = 29.768,  df = 602.015, p-value = 0.000
## 
## Margin of Error for 95% Confidence Level:  0.694
## 95% Confidence Interval for Mean Difference:  9.823 to 11.210
## 
## 
## ------ Effect Size ------
## 
## --- Assume equal population variances of pcfat for each gender 
## 
## Standardized Mean Difference of pcfat, Cohen's d:  1.961
## 
## 
## ------ Practical Importance ------
## 
## Minimum Mean Difference of practical importance: mmd
## Minimum Standardized Mean Difference of practical importance: msmd
## Neither value specified, so no analysis
## 
## 
## ------ Graphics Smoothing Parameter ------
## 
## Density bandwidth for gender F: 1.475
## Density bandwidth for gender M: 1.867

##Compare pcfat across gender with levels F and M Grouping Variable: gender Response Variable: pcfat

—— Describe ——

pcfat for gender F: n.miss = 0, n = 862, mean = 34.672, sd = 5.187 pcfat for gender M: n.miss = 0, n = 355, mean = 24.156, sd = 5.764

Mean Difference of pcfat: 10.516

Weighted Average Standard Deviation: 5.362

—— Assumptions ——

Note: These hypothesis tests can perform poorly, and the t-test is typically robust to violations of assumptions. Use as heuristic guides instead of interpreting literally.

Null hypothesis, for each group, is a normal distribution of pcfat. Group F: Sample mean assumed normal because n > 30, so no test needed. Group M: Sample mean assumed normal because n > 30, so no test needed.

Null hypothesis is equal variances of pcfat, homogeneous. Variance Ratio test: F = 33.223/26.909 = 1.235, df = 354;861, p-value = 0.016 Levene’s test, Brown-Forsythe: t = -2.232, df = 1215, p-value = 0.026

—— Infer ——

— Assume equal population variances of pcfat for each gender

t-cutoff for 95% range of variation: tcut = 1.962 Standard Error of Mean Difference: SE = 0.338

Hypothesis Test of 0 Mean Diff: t-value = 31.101, df = 1215, p-value = 0.000

Margin of Error for 95% Confidence Level: 0.663 95% Confidence Interval for Mean Difference: 9.853 to 11.180

— Do not assume equal population variances of pcfat for each gender

t-cutoff: tcut = 1.964 Standard Error of Mean Difference: SE = 0.353

Hypothesis Test of 0 Mean Diff: t = 29.768, df = 602.015, p-value = 0.000

Margin of Error for 95% Confidence Level: 0.694 95% Confidence Interval for Mean Difference: 9.823 to 11.210

—— Effect Size ——

— Assume equal population variances of pcfat for each gender

Standardized Mean Difference of pcfat, Cohen’s d: 1.961

—— Practical Importance ——

Minimum Mean Difference of practical importance: mmd Minimum Standardized Mean Difference of practical importance: msmd Neither value specified, so no analysis

—— Graphics Smoothing Parameter ——

Density bandwidth for gender F: 1.475 Density bandwidth for gender M: 1.867 ## ##6. Sử dụng mô hình hồi qui tuyến tính

m.1 = lm(pcfat ~ gender, data = on4)
summary(m.1)
## 
## Call:
## lm(formula = pcfat ~ gender, data = on4)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -20.0724  -3.2724   0.1484   3.6276  14.8439 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)  34.6724     0.1826   189.9 <0.0000000000000002 ***
## genderM     -10.5163     0.3381   -31.1 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.362 on 1215 degrees of freedom
## Multiple R-squared:  0.4432, Adjusted R-squared:  0.4428 
## F-statistic: 967.3 on 1 and 1215 DF,  p-value: < 0.00000000000000022

##Call: lm(formula = pcfat ~ gender, data = on4)

Residuals: Min 1Q Median 3Q Max -20.0724 -3.2724 0.1484 3.6276 14.8439

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 34.6724 0.1826 189.9 <0.0000000000000002 genderM -10.5163 0.3381 -31.1 <0.0000000000000002 — Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.362 on 1215 degrees of freedom Multiple R-squared: 0.4432, Adjusted R-squared: 0.4428 F-statistic: 967.3 on 1 and 1215 DF, p-value: < 0.00000000000000022 ##

library(lessR)
m.2 = reg(pcfat ~ gender, data = on4)
## 
## >>>  gender is not numeric. Converted to indicator variables.

##7.>>> Suggestion # Create an R markdown file for interpretative output with Rmd = “file_name” reg(pcfat ~ gender, data=on4, Rmd=“eg”)

BACKGROUND

Data Frame: on4

Response Variable: pcfat Predictor Variable: genderM

Number of cases (rows) of data: 1217 Number of cases retained for analysis: 1217

BASIC ANALYSIS

– Estimated Model for pcfat

          Estimate    Std Err  t-value  p-value    Lower 95%    Upper 95% 

(Intercept) 34.672413 0.182622 189.859 0.000 34.314123 35.030703 genderM -10.516344 0.338131 -31.101 0.000 -11.179729 -9.852959

– Model Fit

Standard deviation of pcfat: 7.182861

Standard deviation of residuals: 5.361759 for df=1215 95% range of residuals: 21.038669 = 2 * (1.962 * 5.361759)

R-squared: 0.443 Adjusted R-squared: 0.443 PRESS R-squared: 0.441

Null hypothesis of all 0 population slope coefficients: F-statistic: 967.297 df: 1 and 1215 p-value: 0.000

– Analysis of Variance

           df        Sum Sq       Mean Sq     F-value   p-value 

Model 1 27808.311497 27808.311497 967.297285 0.000 Residuals 1215 34929.384159 28.748464 pcfat 1216 62737.695656 51.593500

K-FOLD CROSS-VALIDATION

RELATIONS AMONG THE VARIABLES

– Correlation Matrix

      pcfat genderM 
pcfat  1.00   -0.67 

genderM -0.67 1.00

RESIDUALS AND INFLUENCE

– Data, Fitted, Residual, Studentized Residual, Dffits, Cook’s Distance [sorted by Cook’s Distance] [n_res_rows = 20, out of 1217 rows of data, or do n_res_rows=“all”] ————————————————————————— genderM pcfat fitted resid rstdnt dffits cooks 210 1 9.200000 24.156069 -14.956069 -2.801192 -0.148882 0.011020 509 1 39.000000 24.156069 14.843931 2.780055 0.147758 0.010860 179 1 38.700000 24.156069 14.543931 2.723523 0.144754 0.010420 518 1 9.700000 24.156069 -14.456069 -2.706970 -0.143874 0.010300 200 1 9.800000 24.156069 -14.356069 -2.688132 -0.142873 0.010150 563 1 38.300000 24.156069 14.143931 2.648179 0.140749 0.009860 318 1 10.300000 24.156069 -13.856069 -2.593980 -0.137869 0.009460 972 1 10.300000 24.156069 -13.856069 -2.593980 -0.137869 0.009460 388 1 10.700000 24.156069 -13.456069 -2.518700 -0.133867 0.008920 203 1 11.000000 24.156069 -13.156069 -2.462262 -0.130868 0.008530 1137 0 14.600000 34.672413 -20.072413 -3.766065 -0.128347 0.008150 893 0 14.700000 34.672413 -19.972413 -3.747085 -0.127700 0.008070 688 1 11.400000 24.156069 -12.756069 -2.387042 -0.126870 0.008020 403 1 11.700000 24.156069 -12.456069 -2.330649 -0.123873 0.007640 858 1 11.900000 24.156069 -12.256069 -2.293064 -0.121875 0.007400 158 1 36.300000 24.156069 12.143931 2.271993 0.120755 0.007270 1106 1 36.300000 24.156069 12.143931 2.271993 0.120755 0.007270 827 1 36.000000 24.156069 11.843931 2.215637 0.117760 0.006910 756 1 12.400000 24.156069 -11.756069 -2.199135 -0.116883 0.006810 196 1 12.500000 24.156069 -11.656069 -2.180355 -0.115885 0.006690

PREDICTION ERROR

– Data, Predicted, Standard Error of Prediction, 95% Prediction Intervals [sorted by lower bound of prediction interval] [to see all intervals add n_pred_rows=“all”] ———————————————-

    genderM     pcfat      pred   s_pred    pi.lwr    pi.upr     width 
 2        1 16.800000 24.156069 5.369306 13.621929 34.690209 21.068280 
 5        1 14.800000 24.156069 5.369306 13.621929 34.690209 21.068280 

… 1209 1 26.400000 24.156069 5.369306 13.621929 34.690209 21.068280 1 0 37.300000 34.672413 5.364869 24.146979 45.197847 21.050869 3 0 34.000000 34.672413 5.364869 24.146979 45.197847 21.050869 … 1215 0 34.400000 34.672413 5.364869 24.146979 45.197847 21.050869 1216 0 41.300000 34.672413 5.364869 24.146979 45.197847 21.050869 1217 0 33.200000 34.672413 5.364869 24.146979 45.197847 21.050869

Plot 1: Distribution of Residuals
Plot 2: Residuals vs Fitted Values

7. Kiểm tra giả định.

par(mfrow = c(2, 2))
plot(m.1)

##Dùng gói ggfortify

library(ggfortify)
## Loading required package: ggplot2
autoplot(m.1)

pcfat = 34.7 - 10.5*SexM ##8. Đánh giá mối liên quan giữa cân nặng và tỉ trọng mỡ ### 8.1.Vẽ biểu đồ tán xạ

Plot(weight, pcfat, fit = "lm", data = on4)

## 
## 
## >>> Suggestions  or  enter: style(suggest=FALSE)
## Plot(weight, pcfat, enhance=TRUE)  # many options
## Plot(weight, pcfat, fill="skyblue")  # interior fill color of points
## Plot(weight, pcfat, out_cut=.10)  # label top 10% from center as outliers 
## 
## 
## >>> Pearson's product-moment correlation 
##  
## Number of paired values with neither missing, n = 1217 
## Sample Correlation of weight and pcfat: r = 0.057 
##   
## Hypothesis Test of 0 Correlation:  t = 1.975,  df = 1215,  p-value = 0.049 
## 95% Confidence Interval for Correlation:  0.000 to 0.112 
##   
## 
##   Line: b0 = 29.222947    b1 = 0.043193    Linear Model MSE = 51.470803   Rsq = 0.003
## 

8.2. Sử dụng mô hình hồi qui tuyến tính

m.3 = lm(pcfat ~ weight, data = on4)
summary(m.3)
## 
## Call:
## lm(formula = pcfat ~ weight, data = on4)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.3122  -4.5234   0.8902   5.2695  16.9742 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) 29.22295    1.22370  23.881 <0.0000000000000002 ***
## weight       0.04319    0.02188   1.975              0.0485 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.174 on 1215 degrees of freedom
## Multiple R-squared:  0.003199,   Adjusted R-squared:  0.002378 
## F-statistic: 3.899 on 1 and 1215 DF,  p-value: 0.04855
library(ggfortify)
##vẽ biểu đồ phù hợp nhất cho kiểu dữ liệu đầu vào
autoplot(m.3)

## ##9.Đánh giá mối liên quan độc lập giữa cân nặng và tỉ trọng mỡ ###9.1. Xây dựng và kiểm tra giả định mô hình

m.4 = lm(pcfat ~ weight + age + gender + height, data = on4)
summary(m.4)
## 
## Call:
## lm(formula = pcfat ~ weight + age + gender + height, data = on4)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.208  -2.543   0.019   2.582  15.706 
## 
## Coefficients:
##               Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)  48.368722   3.505431  13.798 < 0.0000000000000002 ***
## weight        0.439169   0.015594  28.163 < 0.0000000000000002 ***
## age           0.056166   0.007404   7.585   0.0000000000000658 ***
## genderM     -11.483254   0.344343 -33.348 < 0.0000000000000002 ***
## height       -0.257013   0.023768 -10.813 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.974 on 1212 degrees of freedom
## Multiple R-squared:  0.695,  Adjusted R-squared:  0.694 
## F-statistic: 690.4 on 4 and 1212 DF,  p-value: < 0.00000000000000022
## vẽ biểu đồ phù hợp nhất cho kiểu dữ liệu đầu vào
autoplot(m.4)

### Viết phương trình ####pcfat = 48.4 + 0.44weight + 0.06age - 11.5genderM - 0.26height ##. 10.Xây dựng mô hình dự báo tỉ trọng mỡ ###10.1. Xây dựng mô hình dự báo tối ưu bằng phương pháp BMA

library(BMA)
## Loading required package: survival
## Loading required package: leaps
## Loading required package: robustbase
## 
## Attaching package: 'robustbase'
## The following object is masked from 'package:survival':
## 
##     heart
## Loading required package: inline
## Loading required package: rrcov
## Scalable Robust Estimators with High Breakdown Point (version 1.7-7)
yvar = on4[, c("pcfat")]
xvar = on4[, c("gender", "age", "height", "weight", "bmi")]
m.bma = bicreg(xvar, yvar, strict = FALSE, OR = 20)
summary(m.bma)
## 
## Call:
## bicreg(x = xvar, y = yvar, strict = FALSE, OR = 20)
## 
## 
##   3  models were selected
##  Best  3  models (cumulative posterior probability =  1 ): 
## 
##            p!=0    EV        SD        model 1      model 2      model 3    
## Intercept  100.0    5.26146  4.582901      7.95773     -0.79279      8.13735
## genderM    100.0  -11.25139  0.429659    -11.44430    -11.42764    -10.80625
## age        100.0    0.05259  0.008048      0.05497      0.05473      0.04715
## height      31.4    0.01759  0.028494        .          0.05598        .    
## weight      39.2    0.03102  0.042611      0.07921        .            .    
## bmi        100.0    1.01265  0.111625      0.89419      1.08852      1.08936
##                                                                             
## nVar                                         4            4            3    
## r2                                         0.697        0.696        0.695  
## BIC                                    -1423.06312  -1422.62198  -1422.49027
## post prob                                  0.392        0.314        0.294
###
imageplot.bma(m.bma)

###10.2.Kiểm tra giả định của mô hình

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following objects are masked from 'package:lessR':
## 
##     bc, recode, sp
library(BMA)
# Vẽ image plot mà không truyền main
imageplot.bma(m.bma, col = c("white", "black"))

# Thêm tiêu đề thủ công
title(main = "BMA Posterior Inclusion Plot")

###pcfat = 8.0 - 11.4genderM + 0.05age + 0.08weight + 0.89bmi ###10.3. Xác định tầm quan trọng library(relaimpo)

library(relaimpo)
## Loading required package: MASS
## Loading required package: boot
## 
## Attaching package: 'boot'
## The following object is masked from 'package:car':
## 
##     logit
## The following object is masked from 'package:robustbase':
## 
##     salinity
## The following object is masked from 'package:survival':
## 
##     aml
## Loading required package: survey
## Loading required package: grid
## Loading required package: Matrix
## 
## Attaching package: 'survey'
## The following object is masked from 'package:graphics':
## 
##     dotchart
## Loading required package: mitools
## This is the global version of package relaimpo.
## If you are a non-US user, a version with the interesting additional metric pmvd is available
## from Ulrike Groempings web site at prof.beuth-hochschule.de/groemping.
on4$sex = ifelse(on4$gender == "F", 1, 0)
m.bma2 = lm(pcfat ~ sex + age + weight + bmi, data = on4)
calc.relimp(m.bma2, type = "lmg", rela = TRUE, rank = TRUE)
## Response variable: pcfat 
## Total response variance: 51.5935 
## Analysis based on 1217 observations 
## 
## 4 Regressors: 
## sex age weight bmi 
## Proportion of variance explained by model: 69.66%
## Metrics are normalized to sum to 100% (rela=TRUE). 
## 
## Relative importance metrics: 
## 
##               lmg
## sex    0.59317775
## age    0.06893066
## weight 0.09175463
## bmi    0.24613695
## 
## Average coefficients for different model sizes: 
## 
##                 1X         2Xs         3Xs         4Xs
## sex    10.51634414 11.71834412 11.80453842 11.44430262
## age     0.12768705  0.10445197  0.05168496  0.05496623
## weight  0.04319324 -0.05539405 -0.06907993  0.07920690
## bmi     1.03619023  1.50631405  1.54278433  0.89419395

m.5 <- lm(pcfat ~ age + gender, data = on4)
summary(m.5)
## 
## Call:
## lm(formula = pcfat ~ age + gender, data = on4)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -20.6748  -3.2841   0.3097   3.3655  17.7216 
## 
## Coefficients:
##               Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)  30.119597   0.451098   66.77 <0.0000000000000002 ***
## age           0.093731   0.008566   10.94 <0.0000000000000002 ***
## genderM     -10.059716   0.325415  -30.91 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.118 on 1214 degrees of freedom
## Multiple R-squared:  0.4932, Adjusted R-squared:  0.4924 
## F-statistic: 590.8 on 2 and 1214 DF,  p-value: < 0.00000000000000022

###Kết thúc