Việc 1. Đọc dữ liệu vào R

df = read.csv("C:/Users/leduc/Downloads/Bone data.csv")

Việc 2. Đánh giá mối liên quan giữa cân nặng và mật độ xương cổ xương đùi

2.1 Vẽ biểu đồ đánh giá mối liên quan giữa cân nặng và mật độ xương cổ xương đùi

library(lessR)

## 
## lessR 4.4.3                         feedback: gerbing@pdx.edu 
## --------------------------------------------------------------
## > d <- Read("")  Read data file, many formats available, e.g., Excel
##   d is default data frame, data= in analysis routines optional
## 
## Many examples of reading, writing, and manipulating data, 
## graphics, testing means and proportions, regression, factor analysis,
## customization, forecasting, and aggregation from pivot tables
##   Enter: browseVignettes("lessR")
## 
## View lessR updates, now including time series forecasting
##   Enter: news(package="lessR")
## 
## Interactive data analysis
##   Enter: interact()

## 
## Attaching package: 'lessR'

## The following object is masked from 'package:base':
## 
##     sort_by

Plot(weight, fnbmd, fit = "loess", data = df)

## 
## 
## >>> Suggestions  or  enter: style(suggest=FALSE)
## Plot(weight, fnbmd, enhance=TRUE)  # many options
## Plot(weight, fnbmd, fill="skyblue")  # interior fill color of points
## Plot(weight, fnbmd, MD_cut=6)  # Mahalanobis distance from center > 6 is an outlier 
## 
##    Loess Model MSE = 0.0156
##

2.2 Phân tích đánh giá tương quan giữa cân nặng và mật độ xương đùi

Correlation(weight, fnbmd, data = df)

## Correlation Analysis for Variables weight and fnbmd 
##   
## 
## >>> Pearson's product-moment correlation 
##  
## Number of paired values with neither missing, n = 2121 
## Number of cases (rows of data) deleted: 41 
## 
## Sample Covariance: s = 1.269 
##  
## Sample Correlation: r = 0.581 
## 
## Hypothesis Test of 0 Correlation:  t = 32.882,  df = 2119,  p-value = 0.000 
## 95% Confidence Interval for Correlation:  0.552 to 0.609

Việc 3. Đọc dữ liệu vào R

ob = read.csv("C:/Users/leduc/Downloads/Obesity data.csv")

Việc 4. So sánh tỉ trọng mỡ giữa nam và nữ

4.1 Đánh giá phân bố của tỉ trọng mỡ

library(lessR)
Histogram(pcfat, fill = "blue", xlab = "Percentage of fat (%)", ylab = "Frequency",  data = ob)

## >>> Suggestions 
## bin_width: set the width of each bin 
## bin_start: set the start of the first bin 
## bin_end: set the end of the last bin 
## Histogram(pcfat, density=TRUE)  # smoothed curve + histogram 
## Plot(pcfat)  # Violin/Box/Scatterplot (VBS) plot 
## 
## --- pcfat --- 
##  
##        n   miss            mean              sd             min             mdn             max 
##      1217      0       31.604786        7.182862        9.200000       32.400000       48.400000 
##  
## 
##   
## --- Outliers ---     from the box plot: 10 
##  
## Small       Large 
## -----       ----- 
##   9.2            
##   9.7            
##   9.8            
##  10.3            
##  10.3            
##  10.7            
##  11.0            
##  11.4            
##  11.7            
##  11.9            
## 
## 
## Bin Width: 5 
## Number of Bins: 9 
##  
##      Bin  Midpnt  Count    Prop  Cumul.c  Cumul.p 
## ------------------------------------------------- 
##   5 > 10     7.5      3    0.00        3     0.00 
##  10 > 15    12.5     26    0.02       29     0.02 
##  15 > 20    17.5     61    0.05       90     0.07 
##  20 > 25    22.5    128    0.11      218     0.18 
##  25 > 30    27.5    244    0.20      462     0.38 
##  30 > 35    32.5    338    0.28      800     0.66 
##  35 > 40    37.5    294    0.24     1094     0.90 
##  40 > 45    42.5    107    0.09     1201     0.99 
##  45 > 50    47.5     16    0.01     1217     1.00 
##

4.2 Sử dụng kiểm định t

ttest(pcfat ~ gender, data = ob)

## 
## Compare pcfat across gender with levels F and M 
## Grouping Variable:  gender
## Response Variable:  pcfat
## 
## 
## ------ Describe ------
## 
## pcfat for gender F:  n.miss = 0,  n = 862,  mean = 34.672,  sd = 5.187
## pcfat for gender M:  n.miss = 0,  n = 355,  mean = 24.156,  sd = 5.764
## 
## Mean Difference of pcfat:  10.516
## 
## Weighted Average Standard Deviation:   5.362 
## 
## 
## ------ Assumptions ------
## 
## Note: These hypothesis tests can perform poorly, and the 
##       t-test is typically robust to violations of assumptions. 
##       Use as heuristic guides instead of interpreting literally. 
## 
## Null hypothesis, for each group, is a normal distribution of pcfat.
## Group F: Sample mean assumed normal because n > 30, so no test needed.
## Group M: Sample mean assumed normal because n > 30, so no test needed.
## 
## Null hypothesis is equal variances of pcfat, homogeneous.
## Variance Ratio test:  F = 33.223/26.909 = 1.235,  df = 354;861,  p-value = 0.016
## Levene's test, Brown-Forsythe:  t = -2.232,  df = 1215,  p-value = 0.026
## 
## 
## ------ Infer ------
## 
## --- Assume equal population variances of pcfat for each gender 
## 
## t-cutoff for 95% range of variation: tcut =  1.962 
## Standard Error of Mean Difference: SE =  0.338 
## 
## Hypothesis Test of 0 Mean Diff:  t-value = 31.101,  df = 1215,  p-value = 0.000
## 
## Margin of Error for 95% Confidence Level:  0.663
## 95% Confidence Interval for Mean Difference:  9.853 to 11.180
## 
## 
## --- Do not assume equal population variances of pcfat for each gender 
## 
## t-cutoff: tcut =  1.964 
## Standard Error of Mean Difference: SE =  0.353 
## 
## Hypothesis Test of 0 Mean Diff:  t = 29.768,  df = 602.015, p-value = 0.000
## 
## Margin of Error for 95% Confidence Level:  0.694
## 95% Confidence Interval for Mean Difference:  9.823 to 11.210
## 
## 
## ------ Effect Size ------
## 
## --- Assume equal population variances of pcfat for each gender 
## 
## Standardized Mean Difference of pcfat, Cohen's d:  1.961
## 
## 
## ------ Practical Importance ------
## 
## Minimum Mean Difference of practical importance: mmd
## Minimum Standardized Mean Difference of practical importance: msmd
## Neither value specified, so no analysis
## 
## 
## ------ Graphics Smoothing Parameter ------
## 
## Density bandwidth for gender F: 1.475
## Density bandwidth for gender M: 1.867

## 4.3 Sử dụng mô hình hồi qui tuyến tính

m.1 = lm(pcfat ~ gender, data = ob)
summary(m.1)

## 
## Call:
## lm(formula = pcfat ~ gender, data = ob)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -20.0724  -3.2724   0.1484   3.6276  14.8439 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)  34.6724     0.1826   189.9 <0.0000000000000002 ***
## genderM     -10.5163     0.3381   -31.1 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.362 on 1215 degrees of freedom
## Multiple R-squared:  0.4432, Adjusted R-squared:  0.4428 
## F-statistic: 967.3 on 1 and 1215 DF,  p-value: < 0.00000000000000022

library(lessR)
m.2 = reg(pcfat ~ gender, data = ob)

## 
## >>>  gender is not numeric. Converted to indicator variables.

par(mfrow = c(2, 2))
plot(m.1)

library(ggfortify)

## Loading required package: ggplot2

autoplot(m.1)

# Việc 5. Đánh giá mối liên quan giữa cân nặng và tỉ trọng mỡ ## 5.1 Vẽ biểu đồ tán xạ

Plot(weight, pcfat, fit = "lm", data = ob)

## 
## 
## >>> Suggestions  or  enter: style(suggest=FALSE)
## Plot(weight, pcfat, enhance=TRUE)  # many options
## Plot(weight, pcfat, color="red")  # exterior edge color of points
## Plot(weight, pcfat, MD_cut=6)  # Mahalanobis distance from center > 6 is an outlier 
## 
## 
## >>> Pearson's product-moment correlation 
##  
## Number of paired values with neither missing, n = 1217 
## Sample Correlation of weight and pcfat: r = 0.057 
##   
## Hypothesis Test of 0 Correlation:  t = 1.975,  df = 1215,  p-value = 0.049 
## 95% Confidence Interval for Correlation:  0.000 to 0.112 
##   
## 
##   Line: b0 = 29.222947    b1 = 0.043193    Linear Model MSE = 51.470803   Rsq = 0.003
##

5.2 Sử dụng mô hình hồi qui tuyến tính

m.3 = lm(pcfat ~ weight, data = ob)
summary(m.3)

## 
## Call:
## lm(formula = pcfat ~ weight, data = ob)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.3122  -4.5234   0.8902   5.2695  16.9742 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) 29.22295    1.22370  23.881 <0.0000000000000002 ***
## weight       0.04319    0.02188   1.975              0.0485 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.174 on 1215 degrees of freedom
## Multiple R-squared:  0.003199,   Adjusted R-squared:  0.002378 
## F-statistic: 3.899 on 1 and 1215 DF,  p-value: 0.04855

library(ggfortify)
autoplot(m.3)

# Việc 6. Đánh giá mối liên quan độc lập giữa cân nặng và tỉ trọng mỡ ## 6.1 Xây dựng và kiểm định giả định mô hình

m.4 = lm(pcfat ~ weight + age + gender + height, data = ob)
summary(m.4)

## 
## Call:
## lm(formula = pcfat ~ weight + age + gender + height, data = ob)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.208  -2.543   0.019   2.582  15.706 
## 
## Coefficients:
##               Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)  48.368722   3.505431  13.798 < 0.0000000000000002 ***
## weight        0.439169   0.015594  28.163 < 0.0000000000000002 ***
## age           0.056166   0.007404   7.585   0.0000000000000658 ***
## genderM     -11.483254   0.344343 -33.348 < 0.0000000000000002 ***
## height       -0.257013   0.023768 -10.813 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.974 on 1212 degrees of freedom
## Multiple R-squared:  0.695,  Adjusted R-squared:  0.694 
## F-statistic: 690.4 on 4 and 1212 DF,  p-value: < 0.00000000000000022

autoplot(m.4)

R Exercise Day 4

Khoa

2025-05-14