R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Including Plots

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

t = file.choose()
#"D:\\THU24\\LopAI-NCKH01.2026\\Obesity data .csv"
ob = read.csv(t)
ob = read.csv(t)
dim (ob)
## [1] 1217   11
head (ob)
##   id gender height weight  bmi age  bmc  bmd   fat  lean pcfat
## 1  1      F    150     49 21.8  53 1312 0.88 17802 28600  37.3
## 2  2      M    165     52 19.1  65 1309 0.84  8381 40229  16.8
## 3  3      F    157     57 23.1  64 1230 0.84 19221 36057  34.0
## 4  4      F    156     53 21.8  56 1171 0.80 17472 33094  33.8
## 5  5      M    160     51 19.9  54 1681 0.98  7336 40621  14.8
## 6  6      F    153     47 20.1  52 1358 0.91 14904 30068  32.2
library(lessR)
## 
## lessR 4.5                            feedback: gerbing@pdx.edu 
## --------------------------------------------------------------
## > d <- Read("")  Read data file, many formats available, e.g., Excel
##   d is the default data frame, data= in analysis routines optional
## 
## Many examples of reading, writing, and manipulating data, graphics,
## testing means and proportions, regression, factor analysis,
## customization, forecasting, and aggregation to pivot tables.
##   Enter: browseVignettes("lessR")
## 
## View lessR updates, now including modern time series forecasting
##   and many, new Plotly interactive visualizations output. Most
##   visualization functions are now reorganized to three functions:
##      Chart(): type="bar", "pie", "radar", "bubble", "treemap", "icicle"
##      X(): type="histogram", "density", "vbs" and more
##      XY(): type="scatter" for a scatterplot, or "contour", "smooth"
##    Most previous function calls still work, such as:
##      BarChart(), Histogram, and Plot().
##   Enter: news(package="lessR"), or ?Chart, ?X, or ?XY
## There is also Flows() for Sankey flow diagrams, see ?Flows
## 
## Interactive data analysis for constructing visualizations.
##   Enter: interact()
Histogram(pcfat,
          data = ob,
          fill = "blue",
          xlab = "Tỉ trọng mỡ (%)",
          ylab = "Số người",
          main = "Phân bố tỉ trọng mỡ")
## lessR visualizations are now unified over just three core functions:
##   - Chart() for pivot tables, such as bar charts. More info: ?Chart
##   - X() for a single variable x, such as histograms. More info: ?X
##   - XY() for scatterplots of two variables, x and y. More info: ?XY
## 
## Histogram() is deprecated, though still working for now.
## Please use X(..., type = "histogram") going forward.
## [Interactive plot from the Plotly R package (Sievert, 2020)]

## >>> Suggestions 
## bin_width: set the width of each bin 
## bin_start: set the start of the first bin 
## bin_end: set the end of the last bin 
## X(pcfat, type="density")  # smoothed curve + histogram 
## X(pcfat, type="vbs")  # Violin/Box/Scatterplot (VBS) plot 
## 
## --- pcfat --- 
##  
##        n   miss            mean              sd             min             mdn             max 
##      1217      0       31.604786        7.182862        9.200000       32.400000       48.400000 
##  
## 
##   
## --- Outliers ---     from the box plot: 10 
##  
## Small       Large 
## -----       ----- 
##   9.2            
##   9.7            
##   9.8            
##  10.3            
##  10.3            
##  10.7            
##  11.0            
##  11.4            
##  11.7            
##  11.9            
## 
## 
## Bin Width: 5 
## Number of Bins: 9 
##  
##      Bin  Midpnt  Count    Prop  Cumul.c  Cumul.p 
## ------------------------------------------------- 
##   5 > 10     7.5      3    0.00        3     0.00 
##  10 > 15    12.5     26    0.02       29     0.02 
##  15 > 20    17.5     61    0.05       90     0.07 
##  20 > 25    22.5    128    0.11      218     0.18 
##  25 > 30    27.5    244    0.20      462     0.38 
##  30 > 35    32.5    338    0.28      800     0.66 
##  35 > 40    37.5    294    0.24     1094     0.90 
##  40 > 45    42.5    107    0.09     1201     0.99 
##  45 > 50    47.5     16    0.01     1217     1.00 
## 
library(table1)
## 
## Attaching package: 'table1'
## The following object is masked from 'package:lessR':
## 
##     label
## The following objects are masked from 'package:base':
## 
##     units, units<-
table1(~ pcfat | gender, data = ob)
F
(N=862)
M
(N=355)
Overall
(N=1217)
pcfat
Mean (SD) 34.7 (5.19) 24.2 (5.76) 31.6 (7.18)
Median [Min, Max] 34.7 [14.6, 48.4] 24.6 [9.20, 39.0] 32.4 [9.20, 48.4]
ttest(pcfat ~ gender, data = ob)
## 
## Compare pcfat across gender with levels F and M 
## Grouping Variable:  gender
## Response Variable:  pcfat
## 
## 
## ------ Describe ------
## 
## pcfat for gender F:  n.miss = 0,  n = 862,  mean = 34.672,  sd = 5.187
## pcfat for gender M:  n.miss = 0,  n = 355,  mean = 24.156,  sd = 5.764
## 
## Mean Difference of pcfat:  10.516
## 
## Weighted Average Standard Deviation:   5.362 
## 
## 
## ------ Assumptions ------
## 
## Note: These hypothesis tests can perform poorly, and the 
##       t-test is typically robust to violations of assumptions. 
##       Use as heuristic guides instead of interpreting literally. 
## 
## Null hypothesis, for each group, is a normal distribution of pcfat.
## Group F: Sample mean assumed normal because n > 30, so no test needed.
## Group M: Sample mean assumed normal because n > 30, so no test needed.
## 
## Null hypothesis is equal variances of pcfat, homogeneous.
## Variance Ratio test:  F = 33.223/26.909 = 1.235,  df = 354;861,  p-value = 0.016
## Levene's test, Brown-Forsythe:  t = -2.232,  df = 1215,  p-value = 0.026
## 
## 
## ------ Infer ------
## 
## --- Assume equal population variances of pcfat for each gender 
## 
## t-cutoff for 95% range of variation: tcut =  1.962 
## Standard Error of Mean Difference: SE =  0.338 
## 
## Hypothesis Test of 0 Mean Diff:  t-value = 31.101,  df = 1215,  p-value = 0.000
## 
## Margin of Error for 95% Confidence Level:  0.663
## 95% Confidence Interval for Mean Difference:  9.853 to 11.180
## 
## 
## --- Do not assume equal population variances of pcfat for each gender 
## 
## t-cutoff: tcut =  1.964 
## Standard Error of Mean Difference: SE =  0.353 
## 
## Hypothesis Test of 0 Mean Diff:  t = 29.768,  df = 602.015, p-value = 0.000
## 
## Margin of Error for 95% Confidence Level:  0.694
## 95% Confidence Interval for Mean Difference:  9.823 to 11.210
## 
## 
## ------ Effect Size ------
## 
## --- Assume equal population variances of pcfat for each gender 
## 
## Standardized Mean Difference of pcfat, Cohen's d:  1.961
## 
## 
## ------ Practical Importance ------
## 
## Minimum Mean Difference of practical importance: mmd
## Minimum Standardized Mean Difference of practical importance: msmd
## Neither value specified, so no analysis
## 
## 
## ------ Graphics Smoothing Parameter ------
## 
## Density bandwidth for gender F: 1.475
## Density bandwidth for gender M: 1.867

model <- lm(pcfat ~ gender, data = ob)
summary(model)
## 
## Call:
## lm(formula = pcfat ~ gender, data = ob)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -20.0724  -3.2724   0.1484   3.6276  14.8439 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  34.6724     0.1826   189.9   <2e-16 ***
## genderM     -10.5163     0.3381   -31.1   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.362 on 1215 degrees of freedom
## Multiple R-squared:  0.4432, Adjusted R-squared:  0.4428 
## F-statistic: 967.3 on 1 and 1215 DF,  p-value: < 2.2e-16
Regression(pcfat ~ gender, data = ob, graphics = TRUE)
## 
## >>>  gender is not numeric. Converted to indicator variables.

## >>> Suggestion
## # Create an R markdown file for interpretative output with  Rmd = "file_name"
## Regression(my_formula=pcfat ~ gender, data=ob, graphics=TRUE, Rmd="eg")  
## 
## 
##   BACKGROUND 
## 
## Data Frame:  ob 
##  
## Response Variable: pcfat 
## Predictor Variable: genderM 
##  
## Number of cases (rows) of data:  1217 
## Number of cases retained for analysis:  1217 
## 
## 
##   BASIC ANALYSIS 
## 
##               Estimate    Std Err  t-value  p-value    Lower 95%    Upper 95% 
## (Intercept)  34.672413   0.182622  189.859    0.000    34.314123    35.030703 
##     genderM -10.516344   0.338131  -31.101    0.000   -11.179729    -9.852959 
## 
## Standard deviation of pcfat: 7.182861 
##  
## Standard deviation of residuals:  5.361759 for df=1215 
## 95% range of residuals:  21.038669 = 2 * (1.962 * 5.361759) 
##  
## R-squared: 0.443    Adjusted R-squared: 0.443    PRESS R-squared: 0.441 
## 
## Null hypothesis of all 0 population slope coefficients:
##   F-statistic: 967.297     df: 1 and 1215     p-value:  0.000 
## 
## -- Analysis of Variance 
##  
##                df        Sum Sq       Mean Sq     F-value   p-value 
## Model           1  27808.311497  27808.311497  967.297285     0.000 
## Residuals    1215  34929.384159     28.748464 
## pcfat        1216  62737.695656     51.593500 
## 
## 
##   K-FOLD CROSS-VALIDATION 
## 
## 
##   RELATIONS AMONG THE VARIABLES 
## 
##           pcfat genderM 
##     pcfat  1.00   -0.67 
##   genderM -0.67    1.00 
## 
## 
##   RESIDUALS AND INFLUENCE 
## 
## -- Data, Fitted, Residual, Studentized Residual, Dffits, Cook's Distance 
##    [sorted by Cook's Distance] 
##    [n_res_rows = 20, out of 1217 rows of data, or do n_res_rows="all"] 
## --------------------------------------------------------------------------- 
##         genderM     pcfat    fitted      resid    rstdnt    dffits    cooks 
##    210        1  9.200000 24.156069 -14.956069 -2.801192 -0.148882 0.011020 
##    509        1 39.000000 24.156069  14.843931  2.780055  0.147758 0.010860 
##    179        1 38.700000 24.156069  14.543931  2.723523  0.144754 0.010420 
##    518        1  9.700000 24.156069 -14.456069 -2.706970 -0.143874 0.010300 
##    200        1  9.800000 24.156069 -14.356069 -2.688132 -0.142873 0.010150 
##    563        1 38.300000 24.156069  14.143931  2.648179  0.140749 0.009860 
##    318        1 10.300000 24.156069 -13.856069 -2.593980 -0.137869 0.009460 
##    972        1 10.300000 24.156069 -13.856069 -2.593980 -0.137869 0.009460 
##    388        1 10.700000 24.156069 -13.456069 -2.518700 -0.133867 0.008920 
##    203        1 11.000000 24.156069 -13.156069 -2.462262 -0.130868 0.008530 
##   1137        0 14.600000 34.672413 -20.072413 -3.766065 -0.128347 0.008150 
##    893        0 14.700000 34.672413 -19.972413 -3.747085 -0.127700 0.008070 
##    688        1 11.400000 24.156069 -12.756069 -2.387042 -0.126870 0.008020 
##    403        1 11.700000 24.156069 -12.456069 -2.330649 -0.123873 0.007640 
##    858        1 11.900000 24.156069 -12.256069 -2.293064 -0.121875 0.007400 
##    158        1 36.300000 24.156069  12.143931  2.271993  0.120755 0.007270 
##   1106        1 36.300000 24.156069  12.143931  2.271993  0.120755 0.007270 
##    827        1 36.000000 24.156069  11.843931  2.215637  0.117760 0.006910 
##    756        1 12.400000 24.156069 -11.756069 -2.199135 -0.116883 0.006810 
##    196        1 12.500000 24.156069 -11.656069 -2.180355 -0.115885 0.006690 
## 
## 
##   PREDICTION ERROR 
## 
## -- Data, Predicted, Standard Error of Prediction, 95% Prediction Intervals 
##    [sorted by lower bound of prediction interval] 
##    [to see all intervals add n_pred_rows="all"] 
##  ---------------------------------------------- 
## 
##         genderM     pcfat      pred   s_pred    pi.lwr    pi.upr     width 
##      2        1 16.800000 24.156069 5.369306 13.621929 34.690209 21.068280 
##      5        1 14.800000 24.156069 5.369306 13.621929 34.690209 21.068280 
## ... 
##   1209        1 26.400000 24.156069 5.369306 13.621929 34.690209 21.068280 
##      1        0 37.300000 34.672413 5.364869 24.146979 45.197847 21.050869 
##      3        0 34.000000 34.672413 5.364869 24.146979 45.197847 21.050869 
## ... 
##   1215        0 34.400000 34.672413 5.364869 24.146979 45.197847 21.050869 
##   1216        0 41.300000 34.672413 5.364869 24.146979 45.197847 21.050869 
##   1217        0 33.200000 34.672413 5.364869 24.146979 45.197847 21.050869 
## 
## ---------------------------------- 
## Plot 1: Distribution of Residuals 
## Plot 2: Residuals vs Fitted Values 
## ----------------------------------
Plot(pcfat, weight, data = ob, fit = "lm",)
## lessR visualizations are now unified over just three core functions:
##   - Chart() for pivot tables, such as bar charts. More info: ?Chart
##   - X() for a single variable x, such as histograms. More info: ?X
##   - XY() for scatterplots of two variables, x and y. More info: ?XY
## 
## Plot() is deprecated, though still working for now.
## Please use X(...) or XY(...) going forward.
## [Interactive chart from the Plotly R package (Sievert, 2020)]

## 
## 
## >>> Suggestions  or  enter: style(suggest=FALSE)
## XY(pcfat, weight, enhance=TRUE)  # many options
## XY(pcfat, weight, color="red")  # exterior edge color of points
## XY(pcfat, weight, MD_cut=6)  # Mahalanobis distance from center > 6 is an outlier 
## 
## 
## >>> Pearson's product-moment correlation 
##  
## Number of paired values with neither missing, n = 1217 
## Sample Correlation of pcfat and weight: r = 0.057 
##   
## Hypothesis Test of 0 Correlation:  t = 1.975,  df = 1215,  p-value = 0.049 
## 95% Confidence Interval for Correlation:  0.000 to 0.112 
##   
## 
##   Line: b0 = 52.803    b1 = 0.074
##   Linear Model MSE = 88.2435   Rsq = 0.003
## 
     #main = "Mối liên hệ giữa cân nặng và tỉ trọng mỡ",
     #xlab = "Cân nặng (kg)",
     #ylab = "Tỉ trọng mỡ (%)")
model <- lm(pcfat ~ weight, data = ob)
par(mfrow = c(2, 2))
plot(model)

summary(model)
## 
## Call:
## lm(formula = pcfat ~ weight, data = ob)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.3122  -4.5234   0.8902   5.2695  16.9742 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 29.22295    1.22370  23.881   <2e-16 ***
## weight       0.04319    0.02188   1.975   0.0485 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.174 on 1215 degrees of freedom
## Multiple R-squared:  0.003199,   Adjusted R-squared:  0.002378 
## F-statistic: 3.899 on 1 and 1215 DF,  p-value: 0.04855
Regression(pcfat ~ weight, data = ob, graphics = TRUE)

## >>> Suggestion
## # Create an R markdown file for interpretative output with  Rmd = "file_name"
## Regression(my_formula=pcfat ~ weight, data=ob, graphics=TRUE, Rmd="eg")  
## 
## 
##   BACKGROUND 
## 
## Data Frame:  ob 
##  
## Response Variable: pcfat 
## Predictor Variable: weight 
##  
## Number of cases (rows) of data:  1217 
## Number of cases retained for analysis:  1217 
## 
## 
##   BASIC ANALYSIS 
## 
##              Estimate    Std Err  t-value  p-value   Lower 95%   Upper 95% 
## (Intercept) 29.222947   1.223696   23.881    0.000   26.822156   31.623738 
##      weight  0.043193   0.021875    1.975    0.049    0.000276    0.086111 
## 
## Standard deviation of pcfat: 7.182861 
##  
## Standard deviation of residuals:  7.174315 for df=1215 
## 95% range of residuals:  28.150843 = 2 * (1.962 * 7.174315) 
##  
## R-squared: 0.003    Adjusted R-squared: 0.002    PRESS R-squared: 0.000 
## 
## Null hypothesis of all 0 population slope coefficients:
##   F-statistic: 3.899     df: 1 and 1215     p-value:  0.049 
## 
## -- Analysis of Variance 
##  
##                df        Sum Sq     Mean Sq   F-value   p-value 
## Model           1    200.669670  200.669670  3.898709     0.049 
## Residuals    1215  62537.025985   51.470803 
## pcfat        1216  62737.695656   51.593500 
## 
## 
##   K-FOLD CROSS-VALIDATION 
## 
## 
##   RELATIONS AMONG THE VARIABLES 
## 
##          pcfat weight 
##    pcfat  1.00   0.06 
##   weight  0.06   1.00 
## 
## 
##   RESIDUALS AND INFLUENCE 
## 
## -- Data, Fitted, Residual, Studentized Residual, Dffits, Cook's Distance 
##    [sorted by Cook's Distance] 
##    [n_res_rows = 20, out of 1217 rows of data, or do n_res_rows="all"] 
## ---------------------------------------------------------------------------- 
##           weight     pcfat    fitted      resid    rstdnt    dffits    cooks 
##    373        85 47.400000 32.894372  14.505628  2.033775  0.194997 0.018960 
##    923        95 43.500000 33.326305  10.173695  1.429871  0.179944 0.016180 
##    972        42 10.300000 31.037063 -20.737063 -2.902805 -0.143205 0.010190 
##    716        74 17.600000 32.419246 -14.819246 -2.072679 -0.133434 0.008880 
##    858        42 11.900000 31.037063 -19.137063 -2.677456 -0.132088 0.008680 
##    177        38 15.700000 30.864290 -15.164290 -2.120502 -0.126644 0.008000 
##   1071        70 48.100000 32.246474  15.853526  2.216504  0.118990 0.007060 
##    943        73 18.700000 32.376053 -13.676053 -1.911957 -0.117867 0.006930 
##    876        34 18.800000 30.691517 -11.891517 -1.662860 -0.117617 0.006910 
##    245        35 18.400000 30.734710 -12.334710 -1.724650 -0.117167 0.006850 
##    762        80 22.600000 32.678406 -10.078406 -1.409997 -0.114628 0.006560 
##     88        68 15.300000 32.160087 -16.860087 -2.357246 -0.114610 0.006540 
##    200        47  9.800000 31.253029 -21.453029 -3.002259 -0.113942 0.006450 
##    688        46 11.400000 31.209836 -19.809836 -2.771011 -0.110895 0.006120 
##    184        39 17.200000 30.907483 -13.707483 -1.915842 -0.109309 0.005960 
##    388        47 10.700000 31.253029 -20.553029 -2.875431 -0.109129 0.005920 
##    895        65 13.600000 32.030507 -18.430507 -2.577138 -0.107125 0.005710 
##    276        40 16.900000 30.950676 -14.050676 -1.963672 -0.106882 0.005700 
##    528        39 17.600000 30.907483 -13.307483 -1.859774 -0.106110 0.005620 
##    441        72 20.000000 32.332860 -12.332860 -1.723410 -0.101599 0.005150 
## 
## 
##   PREDICTION ERROR 
## 
## -- Data, Predicted, Standard Error of Prediction, 95% Prediction Intervals 
##    [sorted by lower bound of prediction interval] 
##    [to see all intervals add n_pred_rows="all"] 
##  ---------------------------------------------- 
## 
##           weight     pcfat      pred   s_pred    pi.lwr    pi.upr     width 
##    876        34 18.800000 30.691517 7.192151 16.581104 44.801929 28.220825 
##     55        35 27.100000 30.734710 7.190777 16.626993 44.842427 28.215435 
##    245        35 18.400000 30.734710 7.190777 16.626993 44.842427 28.215435 
## ... 
##   1211        54 34.100000 31.555382 7.177306 17.474093 45.636670 28.162577 
##     20        55 19.300000 31.598575 7.177263 17.517370 45.679779 28.162409 
##     27        55 37.200000 31.598575 7.177263 17.517370 45.679779 28.162409 
## ... 
##    402        93 32.300000 33.239918 7.224879 19.065295 47.414541 28.349246 
##    628        95 32.900000 33.326305 7.230024 19.141587 47.511022 28.369435 
##    891        95 30.100000 33.326305 7.230024 19.141587 47.511022 28.369435 
## 
## ---------------------------------- 
## Plot 1: Distribution of Residuals 
## Plot 2: Residuals vs Fitted Values 
## ----------------------------------
model_adj <- lm(pcfat ~ weight + gender + age + height, data = ob)
par(mfrow = c(2, 2))
plot(model_adj)

summary(model_adj)
## 
## Call:
## lm(formula = pcfat ~ weight + gender + age + height, data = ob)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.208  -2.543   0.019   2.582  15.706 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  48.368722   3.505431  13.798  < 2e-16 ***
## weight        0.439169   0.015594  28.163  < 2e-16 ***
## genderM     -11.483254   0.344343 -33.348  < 2e-16 ***
## age           0.056166   0.007404   7.585 6.58e-14 ***
## height       -0.257013   0.023768 -10.813  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.974 on 1212 degrees of freedom
## Multiple R-squared:  0.695,  Adjusted R-squared:  0.694 
## F-statistic: 690.4 on 4 and 1212 DF,  p-value: < 2.2e-16
Regression(pcfat ~ weight + gender + age + height, data = ob, graphics = TRUE)
## 
## >>>  gender is not numeric. Converted to indicator variables.

## >>> Suggestion
## # Create an R markdown file for interpretative output with  Rmd = "file_name"
## Regression(my_formula=pcfat ~ weight + gender + age + height, data=ob, graphics=TRUE, Rmd="eg")  
## 
## 
##   BACKGROUND 
## 
## Data Frame:  ob 
##  
## Response Variable: pcfat 
## Predictor Variable 1: weight 
## Predictor Variable 2: genderM 
## Predictor Variable 3: age 
## Predictor Variable 4: height 
##  
## Number of cases (rows) of data:  1217 
## Number of cases retained for analysis:  1217 
## 
## 
##   BASIC ANALYSIS 
## 
##               Estimate    Std Err  t-value  p-value    Lower 95%    Upper 95% 
## (Intercept)  48.368722   3.505431   13.798    0.000    41.491335    55.246110 
##      weight   0.439169   0.015594   28.163    0.000     0.408576     0.469763 
##     genderM -11.483254   0.344343  -33.348    0.000   -12.158828   -10.807679 
##         age   0.056166   0.007404    7.585    0.000     0.041639     0.070693 
##      height  -0.257013   0.023768  -10.813    0.000    -0.303644    -0.210382 
## 
## Standard deviation of pcfat: 7.182861 
##  
## Standard deviation of residuals:  3.973577 for df=1212 
## 95% range of residuals:  15.591705 = 2 * (1.962 * 3.973577) 
##  
## R-squared: 0.695    Adjusted R-squared: 0.694    PRESS R-squared: 0.692 
## 
## Null hypothesis of all 0 population slope coefficients:
##   F-statistic: 690.357     df: 4 and 1212     p-value:  0.000 
## 
## -- Analysis of Variance 
##  
##                df        Sum Sq       Mean Sq      F-value   p-value 
##    weight       1    200.669670    200.669670    12.709209     0.000 
##   genderM       1  38576.550161  38576.550161  2443.206529     0.000 
##       age       1   2977.577719   2977.577719   188.581853     0.000 
##    height       1   1846.251961   1846.251961   116.930488     0.000 
##  
## Model           4  43601.049512  10900.262378   690.357020     0.000 
## Residuals    1212  19136.646144     15.789312 
## pcfat        1216  62737.695656     51.593500 
## 
## 
##   K-FOLD CROSS-VALIDATION 
## 
## 
##   RELATIONS AMONG THE VARIABLES 
## 
##           pcfat weight genderM   age height 
##     pcfat  1.00   0.06   -0.67  0.31  -0.48 
##    weight  0.06   1.00    0.47 -0.05   0.60 
##   genderM -0.67   0.47    1.00 -0.13   0.67 
##       age  0.31  -0.05   -0.13  1.00  -0.37 
##    height -0.48   0.60    0.67 -0.37   1.00 
## 
##           Tolerance       VIF 
##    weight     0.604     1.656 
##   genderM     0.530     1.888 
##       age     0.794     1.260 
##    height     0.361     2.769 
## 
##  weight genderM age height    R2adj    X's 
##       1       1   1      1    0.694      4 
##       1       1   0      1    0.680      3 
##       1       1   1      0    0.665      3 
##       1       1   0      0    0.617      2 
##       0       1   1      1    0.494      3 
##       0       1   1      0    0.492      2 
##       0       1   0      1    0.444      2 
##       0       1   0      0    0.443      1 
##       1       0   1      1    0.414      3 
##       1       0   0      1    0.413      2 
##       0       0   1      1    0.248      2 
##       0       0   0      1    0.230      1 
##       1       0   1      0    0.098      2 
##       0       0   1      0    0.094      1 
##       1       0   0      0    0.002      1 
##  
## [based on Thomas Lumley's leaps function from the leaps package] 
## 
## 
##   RESIDUALS AND INFLUENCE 
## 
## -- Data, Fitted, Residual, Studentized Residual, Dffits, Cook's Distance 
##    [sorted by Cook's Distance] 
##    [n_res_rows = 20, out of 1217 rows of data, or do n_res_rows="all"] 
## ---------------------------------------------------------------------------------------------------------- 
##           weight  genderM       age     height     pcfat    fitted      resid    rstdnt    dffits    cooks 
##    563        50        1        48        150 38.300000 22.987969  15.312031  3.892904  0.365059 0.026350 
##    923        95        0        57        160 43.500000 52.169213  -8.669213 -2.212032 -0.347600 0.024090 
##   1000        51        1        43        148 35.200000 23.660335  11.539665  2.929329  0.308975 0.018970 
##    509        67        1        13        167 39.000000 24.118827  14.881173  3.777390  0.300716 0.017890 
##   1106        49        1        67        150 36.300000 23.615951  12.684049  3.218255  0.299627 0.017820 
##    562        85        0        21        167 34.800000 43.956458  -9.156458 -2.327441 -0.298503 0.017760 
##    377        76        1        40        148 26.900000 34.471075  -7.571075 -1.929208 -0.291924 0.017010 
##    893        53        0        18        163 14.700000 30.762588 -16.062588 -4.078127 -0.283131 0.015830 
##    245        35        0        80        145 18.400000 30.966052 -12.566052 -3.186550 -0.279949 0.015560 
##     49        45        1        77        156 32.100000 20.878854  11.221146  2.845738  0.278598 0.015430 
##    876        34        0        76        141 18.800000 31.330271 -12.530271 -3.177361 -0.278691 0.015420 
##   1008        54        0        82        165 25.500000 34.282346  -8.782346 -2.228558 -0.257750 0.013240 
##    316        62        1        19        176 32.100000 19.946858  12.153142  3.079313  0.250519 0.012460 
##   1137        50        0        55        158 14.600000 32.808281 -18.208281 -4.627167 -0.243679 0.011680 
##    269        52        1        36        156 34.000000 21.650241  12.349759  3.128488  0.241383 0.011570 
##     16        70        1        49        150 24.300000 31.827525  -7.527525 -1.909655 -0.225639 0.010160 
##    179        75        1        23        168 38.700000 27.936828  10.763172  2.724895  0.222502 0.009850 
##    891        95        1        41        172 30.100000 36.703152  -6.603152 -1.676741 -0.215937 0.009310 
##    135        52        1        72        160 33.200000 22.644159  10.555841  2.671843  0.215079 0.009210 
##    264        55        1        78        154 35.600000 25.840740   9.759260  2.470316  0.212774 0.009020 
## 
## 
##   PREDICTION ERROR 
## 
## -- Data, Predicted, Standard Error of Prediction, 95% Prediction Intervals 
##    [sorted by lower bound of prediction interval] 
##    [to see all intervals add n_pred_rows="all"] 
##  ---------------------------------------------- 
## 
##          weight  genderM       age     height     pcfat      pred   s_pred    pi.lwr    pi.upr     width 
##   177        38        1        55        162 15.700000 15.026941 3.996117  7.186868 22.867015 15.680148 
##   186        52        1        22        175 15.400000 15.980674 3.991308  8.150033 23.811315 15.661281 
##   331        45        1        50        169 17.500000 16.021208 3.993505  8.186259 23.856158 15.669899 
## ... 
##   734        55        0        47        158 36.900000 34.554801 3.977017 26.752200 42.357403 15.605203 
##   348        52        0        48        153 33.600000 34.578523 3.975890 26.778132 42.378915 15.600783 
##   164        50        0        50        150 31.900000 34.583554 3.976421 26.782122 42.384987 15.602865 
## ... 
##   373        85        0        61        153 47.400000 49.801272 4.007267 41.939322 57.663223 15.723901 
##   923        95        0        57        160 43.500000 52.169213 4.021169 44.279988 60.058439 15.778452 
## 
## ---------------------------------- 
## Plot 1: Distribution of Residuals 
## Plot 2: Residuals vs Fitted Values 
## ----------------------------------
library(gapminder)
vn = subset(gapminder, country=="Vietnam")
library(ggplot2)
ggplot(data=vn, aes(x=year, y=gdpPercap)) + geom_point() + geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

vn$knot = ifelse(vn$year>1992, 1, 0) 
vn$diff = vn$year - 1992
vn$int = vn$diff * vn$knot
m = lm(gdpPercap ~ year + int, data=vn)
summary(m)
## 
## Call:
## lm(formula = gdpPercap ~ year + int, data = vn)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -106.291  -54.225   -5.144   41.823  133.329 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -11328.061   3963.982  -2.858   0.0188 *  
## year             6.116      2.009   3.044   0.0139 *  
## int             95.389      7.244  13.168 3.48e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 81.95 on 9 degrees of freedom
## Multiple R-squared:  0.9829, Adjusted R-squared:  0.9791 
## F-statistic: 259.2 on 2 and 9 DF,  p-value: 1.107e-08
data("mtcars")
dim(mtcars)
## [1] 32 11
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
summary(mtcars)
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000
ggplot(data=mtcars, aes(x=hp, y=mpg)) + geom_point() 

m1 =  lm(mpg ~ hp, data = mtcars)
summary(m1)
## 
## Call:
## lm(formula = mpg ~ hp, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.7121 -2.1122 -0.8854  1.5819  8.2360 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30.09886    1.63392  18.421  < 2e-16 ***
## hp          -0.06823    0.01012  -6.742 1.79e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.863 on 30 degrees of freedom
## Multiple R-squared:  0.6024, Adjusted R-squared:  0.5892 
## F-statistic: 45.46 on 1 and 30 DF,  p-value: 1.788e-07
library(ggfortify)
autoplot(m1)
## Warning: `fortify(<lm>)` was deprecated in ggplot2 4.0.0.
## ℹ Please use `broom::augment(<lm>)` instead.
## ℹ The deprecated feature was likely used in the ggfortify package.
##   Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the ggfortify package.
##   Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the ggfortify package.
##   Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

mtcars$hp2 = mtcars$hp^2
m2 =  lm(mpg ~ hp + hp2, data = mtcars)
summary(m2)
## 
## Call:
## lm(formula = mpg ~ hp + hp2, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5512 -1.6027 -0.6977  1.5509  8.7213 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.041e+01  2.741e+00  14.744 5.23e-15 ***
## hp          -2.133e-01  3.488e-02  -6.115 1.16e-06 ***
## hp2          4.208e-04  9.844e-05   4.275 0.000189 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.077 on 29 degrees of freedom
## Multiple R-squared:  0.7561, Adjusted R-squared:  0.7393 
## F-statistic: 44.95 on 2 and 29 DF,  p-value: 1.301e-09
mtcars$hp3 = mtcars$hp^3 
m3 =  lm(mpg ~ hp + hp2 + hp3, data = mtcars)
summary(m3)
## 
## Call:
## lm(formula = mpg ~ hp + hp2 + hp3, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.8605 -1.3972 -0.5736  1.6461  9.0738 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.422e+01  5.961e+00   7.419 4.43e-08 ***
## hp          -2.945e-01  1.178e-01  -2.500   0.0185 *  
## hp2          9.115e-04  6.863e-04   1.328   0.1949    
## hp3         -8.701e-07  1.204e-06  -0.722   0.4760    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.103 on 28 degrees of freedom
## Multiple R-squared:  0.7606, Adjusted R-squared:  0.7349 
## F-statistic: 29.65 on 3 and 28 DF,  p-value: 7.769e-09
ggplot(data=mtcars, aes(x=hp, y=mpg)) + geom_point() + 
geom_smooth(method="lm", formula=y~x+I(x^2))

library(BMA)
## Loading required package: survival
## Loading required package: leaps
## Loading required package: robustbase
## 
## Attaching package: 'robustbase'
## The following object is masked from 'package:survival':
## 
##     heart
## Loading required package: inline
## Loading required package: rrcov
## Scalable Robust Estimators with High Breakdown Point (version 1.7-7)
X <- ob[, c("gender", "height", "weight", "bmi", "age")]
Y <- ob$pcfat
bma_model <- bicreg(x = X, y = Y)
summary(bma_model)
## 
## Call:
## bicreg(x = X, y = Y)
## 
## 
##   3  models were selected
##  Best  3  models (cumulative posterior probability =  1 ): 
## 
##            p!=0    EV        SD        model 1     model 2     model 3   
## Intercept  100.0    5.26146  4.582901   7.958e+00  -7.928e-01   8.137e+00
## genderM    100.0  -11.25139  0.429659  -1.144e+01  -1.143e+01  -1.081e+01
## height      31.4    0.01759  0.028494       .       5.598e-02       .    
## weight      39.2    0.03102  0.042611   7.921e-02       .           .    
## bmi        100.0    1.01265  0.111625   8.942e-01   1.089e+00   1.089e+00
## age        100.0    0.05259  0.008048   5.497e-02   5.473e-02   4.715e-02
##                                                                          
## nVar                                      4           4           3      
## r2                                      0.697       0.696       0.695    
## BIC                                    -1.423e+03  -1.423e+03  -1.422e+03
## post prob                               0.392       0.314       0.294
m.bma = lm(pcfat ~ gender + age + weight + bmi, data = ob)
par(mfrow = c(2, 2))
plot(m.bma)

summary(m.bma)
## 
## Call:
## lm(formula = pcfat ~ gender + age + weight + bmi, data = ob)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.225  -2.557   0.033   2.608  15.646 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   7.957732   0.852500   9.335  < 2e-16 ***
## genderM     -11.444303   0.342565 -33.408  < 2e-16 ***
## age           0.054966   0.007395   7.433 1.99e-13 ***
## weight        0.079207   0.028620   2.768  0.00573 ** 
## bmi           0.894194   0.080297  11.136  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.963 on 1212 degrees of freedom
## Multiple R-squared:  0.6966, Adjusted R-squared:  0.6956 
## F-statistic: 695.7 on 4 and 1212 DF,  p-value: < 2.2e-16
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following objects are masked from 'package:lessR':
## 
##     recode, sp
vif(m.bma)
##   gender      age   weight      bmi 
## 1.878778 1.263468 5.609651 4.663425