• Create two quantitative variables: age and body-mass index (BMI) with random samples of size 1000 each:
•Age: 0 to 99 (random samples)
•BMI: 10 to 40 (random samples)
#setting seed so that the random numbers are fixed
set.seed(17)
age <- c(sample(x=seq(0,99), size=1000, replace = T))
bmi <- c(runif(1000, min=10, max=40))
• Create a binary variable sex (1=Male and 0=Female) of 1000 random samples
set.seed(17)
sex <- sample(x=c("male", "female"), size = 1000, replace = T)
sex <- factor(sex, levels = c("male", "female"))
head(sex,15)
## [1] female female male female female male male male male male
## [11] female female male male female
## Levels: male female
• Create a data frame as df containing four variables/features:Serial Number, BMI, Age and Sex
serial_number <- seq(1:1000)
df <- data.frame(serial_number, bmi, age, sex)
head(df)
## serial_number bmi age sex
## 1 1 27.60871 49 female
## 2 2 30.07957 96 female
## 3 3 14.08479 93 male
## 4 4 28.42358 44 female
## 5 5 26.66123 38 female
## 6 6 27.20891 41 male
• Split the data into “train” and “test” data using 80-20 partition
set.seed(17)
library(caret)
## Warning in register(): Can't find generic `scale_type` in package ggplot2 to
## register S3 method.
ind <- sample(2, nrow(df), replace = T, prob = c(0.8,0.2))
head(train <- df[ind==1,])
## serial_number bmi age sex
## 1 1 27.60871 49 female
## 3 3 14.08479 93 male
## 4 4 28.42358 44 female
## 5 5 26.66123 38 female
## 6 6 27.20891 41 male
## 7 7 12.14658 21 male
test <- df[ind==2,]
• Fit a linear regression model with BMI as dependent variable and age and sex and predictors in the train data samples
#fitting linear regression model
lrm <- lm(bmi ~ age + sex, data=train)
#getting summary of the regression model
summary(lrm)
##
## Call:
## lm(formula = bmi ~ age + sex, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.1677 -7.3683 -0.0547 7.2590 15.2115
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25.1791204 0.6714574 37.499 <2e-16 ***
## age 0.0008408 0.0102696 0.082 0.935
## sexfemale -0.4559793 0.5991988 -0.761 0.447
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.503 on 803 degrees of freedom
## Multiple R-squared: 0.0007317, Adjusted R-squared: -0.001757
## F-statistic: 0.294 on 2 and 803 DF, p-value: 0.7454
• Conduct residual analysis of the fitted model with graphs (suggestive) and tests (confirmative)
LINE Test
Linearity of residuals
plot(lrm,which=1,col=c("blue"))

summary(lrm$residuals)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -15.16774 -7.36828 -0.05472 0.00000 7.25905 15.21150
From the graph we can see that the LOESS line lies in the zero line of the y-axis. From the summary we can see that the Mean of residuals is 0 which means that the residuals are linear.
Independence of residuals
#to test independence of residuals we can use auto correlation function plot
acf(lrm$residuals,main="Autocorrelation Function Plot (ACF)")

set.seed(17)
#we need to load "car" library for the Durbin Watson Test for independence of residuls
library(car)
durbinWatsonTest(lrm)
## lag Autocorrelation D-W Statistic p-value
## 1 -0.0425776 2.082003 0.24
## Alternative hypothesis: rho != 0
In the Autocorrelation Function Plot (ACF), the plot shows ups and down bars on X-axis and P-Value of Durbin-Watson test is lesser than 0.05 so there exists autocorrelation.
Normality of residuals
plot(lrm, which = 2,col=c("red"))

shapiro.test(lrm$residuals)
##
## Shapiro-Wilk normality test
##
## data: lrm$residuals
## W = 0.96037, p-value = 5.931e-14
From Shapiro-Wilk normality test we can see that p-value is less than 0.05 which means that the residuals do not follow normal distribution.
Equal variance (homoscedasticity) of residuals
plot(lrm, which=3)

#we need to load library "lmtest" for bptest
library(lmtest)
bptest(lrm)
##
## studentized Breusch-Pagan test
##
## data: lrm
## BP = 4.7409, df = 2, p-value = 0.09344
The Scatter plot of residuals vs predicted values shows values are distributed randomly. Also, the P-Value of bp-test is greater than 0.05 we can say that variances are equal (homoscedasticity is present).
• Use the fitted model to predict the random test data samples
pred <- predict(lrm, newdata = test)
• Get R-square, MAE and RMSE for training as well as test data and interpret them carefully
library(caret)
R_squared <- R2(pred, test$bmi)
(paste("R_squared: ",R_squared))
## [1] "R_squared: 0.00494364110778339"
RMSE<- RMSE(pred, test$bmi)
(paste("RMSE: ",RMSE))
## [1] "RMSE: 8.26701576035122"
MSE <- RMSE^2
(paste("MSE: ",MSE))
## [1] "MSE: 68.3435495818954"
The error calculation gives us a very low R-squared and high RMSE. Typically, an R-squared above 0.8 is considered to be acceptable whereas RMSE should be closer to zero for the model to be accurate. Since we have contradicting values, we can say that our model is not accurate.
• Take decision and write conclusion based on all the results obtained so far
Based on Regression metrics, we conclude that the model is not accurate and therefore should not be used for prediction. The predictors used, i.e, age and sex, are not very good predictors of BM.