R Markdown

Here is my attempt to create an raport for additional assigmnet about estimating the multiple regression equation for the percentage of body fat.

Data comes from mplot package (bodyfat).

Including Plots

Firstly, here is a summary table of this data set. The most important for this report is part about the body fat. From the table we can see that that on avarage people form this sample have 19.19% of fat in their bodies, the lowest observed value is 3% and the highest is 34.8%.

summary(bodyfat)
##        Id            Bodyfat           Age            Weight      
##  Min.   :  2.00   Min.   : 3.00   Min.   :22.00   Min.   : 56.81  
##  1st Qu.: 69.75   1st Qu.:12.97   1st Qu.:35.00   1st Qu.: 72.29  
##  Median :136.00   Median :19.50   Median :43.50   Median : 80.06  
##  Mean   :130.32   Mean   :19.19   Mean   :44.67   Mean   : 81.18  
##  3rd Qu.:181.50   3rd Qu.:25.32   3rd Qu.:52.00   3rd Qu.: 89.50  
##  Max.   :252.00   Max.   :34.80   Max.   :81.00   Max.   :119.18  
##      Height           Neck           Chest             Abdo       
##  Min.   :65.75   Min.   :31.50   Min.   : 85.10   Min.   : 72.80  
##  1st Qu.:68.25   1st Qu.:36.67   1st Qu.: 94.35   1st Qu.: 86.08  
##  Median :70.00   Median :37.95   Median : 99.25   Median : 90.90  
##  Mean   :70.33   Mean   :38.02   Mean   :100.43   Mean   : 92.53  
##  3rd Qu.:72.06   3rd Qu.:39.65   3rd Qu.:105.38   3rd Qu.: 99.33  
##  Max.   :75.25   Max.   :43.20   Max.   :128.30   Max.   :126.20  
##       Hip            Thigh            Knee           Ankle      
##  Min.   : 88.2   Min.   :49.30   Min.   :33.70   Min.   :20.40  
##  1st Qu.: 95.8   1st Qu.:56.88   1st Qu.:37.27   1st Qu.:22.10  
##  Median : 99.3   Median :59.25   Median :38.60   Median :22.85  
##  Mean   :100.1   Mean   :59.86   Mean   :38.72   Mean   :23.07  
##  3rd Qu.:103.5   3rd Qu.:63.27   3rd Qu.:40.00   3rd Qu.:24.00  
##  Max.   :125.6   Max.   :72.90   Max.   :46.00   Max.   :26.60  
##       Bic             Fore           Wrist      
##  Min.   :25.60   Min.   :21.00   Min.   :16.10  
##  1st Qu.:30.27   1st Qu.:27.50   1st Qu.:17.68  
##  Median :32.10   Median :28.70   Median :18.20  
##  Mean   :32.19   Mean   :28.75   Mean   :18.24  
##  3rd Qu.:34.08   3rd Qu.:30.00   3rd Qu.:18.80  
##  Max.   :39.10   Max.   :34.90   Max.   :21.40

Here is a histogram showing the fregency of the distribution of values of the percentage of the bodyfat. We can see that there is very small number of really low body fat percentage. The histogram is quite unregular.

ggplot(bodyfat, aes(x=Bodyfat)) + 
 geom_histogram(colour="black", fill="white")+
  geom_vline(aes(xintercept=mean(Bodyfat)), color="red", linetype="dashed", size=1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

From this scatter plot we can notice that the body fat percentage isn’t dependednt on the age. The plot is non linear, we can also assume that the correlation coeeficient beteen these two variables is low.

ggplot(bodyfat, aes(x=Age, y=Bodyfat)) +
  geom_point(size=2, shape=1)+
  geom_smooth(method="lm", se=FALSE, fullrange=FALSE)
## `geom_smooth()` using formula 'y ~ x'

In this plot, there is visible that dependency between neck perimeter and bodyfat is still quite low, but slightly bigger than between pervious pair.

ggplot(bodyfat, aes(x=Neck, y=Bodyfat)) +
  geom_point(size=2, shape=1)+
  geom_smooth(method="lm", se=FALSE, fullrange=FALSE)
## `geom_smooth()` using formula 'y ~ x'

As the opposite for the previous two graphs the plot of percenrage of body fat and perimeter of abdominal area can be called linear (the correlation coefficient of these variables is high, correlation is positive). That means that the percentage of the body fat is strongly related to this measuremnt.

## `geom_smooth()` using formula 'y ~ x'

Similatly hot the hip area. There is a little bit smaller correlation than with the abdominal, but still visibly higher than for the neck and age.

ggplot(bodyfat, aes(x=Hip, y=Bodyfat)) +
  geom_point(size=2, shape=1)+
  geom_smooth(method="lm", se=FALSE, fullrange=FALSE)
## `geom_smooth()` using formula 'y ~ x'

With this simple linear regression model we can say that the least squares regression line for the linear model is y=-15.85+0.43x.

mod0<-lm(Bodyfat~Weight,data=bodyfat)
summary(mod0)
## 
## Call:
## lm(formula = Bodyfat ~ Weight, data = bodyfat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.2072  -4.0904   0.0681   4.3035  15.1893 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -15.84630    3.57562  -4.432 2.01e-05 ***
## Weight        0.43157    0.04358   9.902  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.844 on 126 degrees of freedom
## Multiple R-squared:  0.4376, Adjusted R-squared:  0.4332 
## F-statistic: 98.05 on 1 and 126 DF,  p-value: < 2.2e-16

Here is an summary of the full linear regression model.

full.mod<-lm(Bodyfat~.,data=subset(bodyfat,select=-Id))
full.mod1<-glm(Bodyfat~.,data=subset(bodyfat,select=-Id))
summary(full.mod)
## 
## Call:
## lm(formula = Bodyfat ~ ., data = subset(bodyfat, select = -Id))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3767 -2.5514 -0.1723  2.6391  9.1393 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -52.553646  40.062856  -1.312   0.1922    
## Age           0.009288   0.043470   0.214   0.8312    
## Weight       -0.271016   0.243569  -1.113   0.2682    
## Height        0.258388   0.320810   0.805   0.4223    
## Neck         -0.592669   0.322125  -1.840   0.0684 .  
## Chest         0.090883   0.164738   0.552   0.5822    
## Abdo          0.995184   0.123072   8.086 7.29e-13 ***
## Hip          -0.141981   0.204533  -0.694   0.4890    
## Thigh         0.101272   0.200714   0.505   0.6148    
## Knee         -0.096682   0.325889  -0.297   0.7673    
## Ankle        -0.048017   0.507695  -0.095   0.9248    
## Bic           0.075332   0.244105   0.309   0.7582    
## Fore          0.412107   0.272144   1.514   0.1327    
## Wrist        -0.263067   0.745145  -0.353   0.7247    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.081 on 114 degrees of freedom
## Multiple R-squared:  0.7519, Adjusted R-squared:  0.7236 
## F-statistic: 26.57 on 13 and 114 DF,  p-value: < 2.2e-16
glance(full.mod)
## # A tibble: 1 x 12
##   r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.752         0.724  4.08      26.6 1.45e-28    13  -354.  738.  781.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
tidy(full.mod)
## # A tibble: 14 x 5
##    term         estimate std.error statistic  p.value
##    <chr>           <dbl>     <dbl>     <dbl>    <dbl>
##  1 (Intercept) -52.6       40.1      -1.31   1.92e- 1
##  2 Age           0.00929    0.0435    0.214  8.31e- 1
##  3 Weight       -0.271      0.244    -1.11   2.68e- 1
##  4 Height        0.258      0.321     0.805  4.22e- 1
##  5 Neck         -0.593      0.322    -1.84   6.84e- 2
##  6 Chest         0.0909     0.165     0.552  5.82e- 1
##  7 Abdo          0.995      0.123     8.09   7.29e-13
##  8 Hip          -0.142      0.205    -0.694  4.89e- 1
##  9 Thigh         0.101      0.201     0.505  6.15e- 1
## 10 Knee         -0.0967     0.326    -0.297  7.67e- 1
## 11 Ankle        -0.0480     0.508    -0.0946 9.25e- 1
## 12 Bic           0.0753     0.244     0.309  7.58e- 1
## 13 Fore          0.412      0.272     1.51   1.33e- 1
## 14 Wrist        -0.263      0.745    -0.353  7.25e- 1

From this plot it is visible that the neatly normal residuals condition is fullfield. The plot is almost a straight line, that means that it’s normal.

plot(full.mod, 2)+
  geom_smooth(method="lm", se=FALSE, fullrange=FALSE)

## NULL

The red line on the plot is almost a straight line, which means that the data is homoscedastic (has constance variance)

plot(full.mod, 3)

From the summary table of gvlma function, it is checked na dreclaimed that the model is reliable. The function for variance inflation factor (VIF) shows that the correlaiton between vairables in bodyfat data set is really low

full.mod2<-gvlma(full.mod)
summary(full.mod2, 5) 
## 
## Call:
## lm(formula = Bodyfat ~ ., data = subset(bodyfat, select = -Id))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3767 -2.5514 -0.1723  2.6391  9.1393 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -52.553646  40.062856  -1.312   0.1922    
## Age           0.009288   0.043470   0.214   0.8312    
## Weight       -0.271016   0.243569  -1.113   0.2682    
## Height        0.258388   0.320810   0.805   0.4223    
## Neck         -0.592669   0.322125  -1.840   0.0684 .  
## Chest         0.090883   0.164738   0.552   0.5822    
## Abdo          0.995184   0.123072   8.086 7.29e-13 ***
## Hip          -0.141981   0.204533  -0.694   0.4890    
## Thigh         0.101272   0.200714   0.505   0.6148    
## Knee         -0.096682   0.325889  -0.297   0.7673    
## Ankle        -0.048017   0.507695  -0.095   0.9248    
## Bic           0.075332   0.244105   0.309   0.7582    
## Fore          0.412107   0.272144   1.514   0.1327    
## Wrist        -0.263067   0.745145  -0.353   0.7247    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.081 on 114 degrees of freedom
## Multiple R-squared:  0.7519, Adjusted R-squared:  0.7236 
## F-statistic: 26.57 on 13 and 114 DF,  p-value: < 2.2e-16
## 
## Correlation of Coefficients:
##        (Intercept) Age   Weight Height Neck  Chest Abdo  Hip   Thigh Knee 
## Age     0.11                                                              
## Weight  0.95        0.19                                                  
## Height -0.86       -0.01 -0.76                                            
## Neck   -0.23       -0.05 -0.24   0.09                                     
## Chest  -0.69       -0.13 -0.68   0.55   0.05                              
## Abdo   -0.30       -0.36 -0.39   0.34   0.02 -0.17                        
## Hip    -0.48        0.04 -0.49   0.20   0.21  0.28 -0.14                  
## Thigh  -0.21        0.33 -0.18   0.32  -0.09  0.23 -0.04 -0.28            
## Knee   -0.06       -0.34 -0.14  -0.19   0.05  0.08  0.02  0.05 -0.27      
## Ankle  -0.29        0.10 -0.29   0.21   0.18  0.18  0.17  0.05 -0.07 -0.29
## Bic    -0.11        0.01 -0.16   0.16  -0.04 -0.02  0.15  0.00 -0.29 -0.03
## Fore   -0.21        0.02 -0.21   0.11  -0.16  0.06  0.16  0.18  0.06  0.09
## Wrist  -0.22       -0.37 -0.16   0.18  -0.41  0.11  0.15 -0.09  0.13  0.03
##        Ankle Bic   Fore 
## Age                     
## Weight                  
## Height                  
## Neck                    
## Chest                   
## Abdo                    
## Hip                     
## Thigh                   
## Knee                    
## Ankle                   
## Bic     0.22            
## Fore   -0.12 -0.34      
## Wrist  -0.29 -0.05  0.00
## 
## 
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05 
## 
## Call:
##  gvlma(x = full.mod) 
## 
##                     Value p-value                Decision
## Global Stat        2.4851  0.6473 Assumptions acceptable.
## Skewness           0.1753  0.6754 Assumptions acceptable.
## Kurtosis           0.2964  0.5862 Assumptions acceptable.
## Link Function      1.5207  0.2175 Assumptions acceptable.
## Heteroscedasticity 0.4927  0.4827 Assumptions acceptable.
vif_val<-vif(full.mod2)
vif_val
##       Age    Weight    Height      Neck     Chest      Abdo       Hip     Thigh 
##  2.256602 64.038005  4.323426  3.849251 13.233597 10.759846 12.318911  7.432930 
##      Knee     Ankle       Bic      Fore     Wrist 
##  4.506660  3.382574  4.086372  2.344915  3.662724

From plot based on the regdubsets function we can say that variable age and ankle are least usefull for predicitng the percentage of the body fat. So next the model will be created with omitting those 2.

rsub<-regsubsets(Bodyfat~.,data=subset(bodyfat,select=-Id), nbest = 1, nvmax = 13)
rsub
## Subset selection object
## Call: regsubsets.formula(Bodyfat ~ ., data = subset(bodyfat, select = -Id), 
##     nbest = 1, nvmax = 13)
## 13 Variables  (and intercept)
##        Forced in Forced out
## Age        FALSE      FALSE
## Weight     FALSE      FALSE
## Height     FALSE      FALSE
## Neck       FALSE      FALSE
## Chest      FALSE      FALSE
## Abdo       FALSE      FALSE
## Hip        FALSE      FALSE
## Thigh      FALSE      FALSE
## Knee       FALSE      FALSE
## Ankle      FALSE      FALSE
## Bic        FALSE      FALSE
## Fore       FALSE      FALSE
## Wrist      FALSE      FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: exhaustive
plot(rsub)

From this plots we can say that all the necessary assumptions for the model to be reliable (linearity, normaloty, constance viarance) are met.

full.mod3<-lm(Bodyfat~.-Age -Ankle,data=subset(bodyfat,select=-Id))
summary(full.mod3)
## 
## Call:
## lm(formula = Bodyfat ~ . - Age - Ankle, data = subset(bodyfat, 
##     select = -Id))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.319 -2.606 -0.224  2.528  9.091 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -54.92632   37.60236  -1.461   0.1468    
## Weight       -0.28973    0.22509  -1.287   0.2006    
## Height        0.26752    0.31048   0.862   0.3907    
## Neck         -0.58217    0.31323  -1.859   0.0656 .  
## Chest         0.09926    0.15885   0.625   0.5333    
## Abdo          1.00757    0.11113   9.066 3.64e-15 ***
## Hip          -0.14266    0.20237  -0.705   0.4822    
## Thigh         0.08476    0.18681   0.454   0.6509    
## Knee         -0.08327    0.29236  -0.285   0.7763    
## Bic           0.08095    0.23620   0.343   0.7324    
## Fore          0.40736    0.26795   1.520   0.1312    
## Wrist        -0.22674    0.66017  -0.343   0.7319    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.047 on 116 degrees of freedom
## Multiple R-squared:  0.7517, Adjusted R-squared:  0.7282 
## F-statistic: 31.93 on 11 and 116 DF,  p-value: < 2.2e-16
plot(full.mod3)

Here is an example data of 3 persons from the dataset, to check if the predictions based on the model are in agreement with the real data. The prediction are showed with the lower and upper boundries of interval of confidence.

new.person <- data.frame(
  Age = c(39, 54, 46),
  Weight=c(56.81, 70.42, 80.29),
  Height=c(68, 69.25, 70.00),
  Neck=c(31.5,37.5, 37.2),
  Chest=c(85.1, 89.3, 99.7),
  Abdo=c(76.0,78.4, 95.6),
  Hip=c(88.2, 96.1, 102.2),
  Thigh=c(50.0, 56.0, 58.3),
  Knee=c(34.7, 37.4, 38.2),
  Ankle=c(21.0,22.4, 22.5),
  Bic=c(26.1, 32.6, 29.1),
  Fore=c(23.1, 28.1, 27.7),
  Wrist=c(16.1, 18.1, 17.7)
)
predict(full.mod3, newdata = new.person, interval = "confidence")
##         fit       lwr       upr
## 1 10.127775  7.498504 12.757045
## 2  7.127225  4.916171  9.338279
## 3 21.907985 20.086765 23.729205

Finally we checking the predictions are coresspodning to the real values with the rmse function (Root mean square deviation). The lower the value is the better. Here it is 1.689262. But there is no any fixed boundries for the RMSE value to bo good or bad.

p1<-predict(full.mod3, newdata=new.person, interval="none")
p<-data.frame(actual=c(7.7, 6.3, 20.5), predicted=p1)
rmse(p$actual, p$predicted)
## [1] 1.689262