• Create two quantitative variables: age and body-mass index (BMI) with random samples of size 1000 each:

•Age: 0 to 99 (random samples)

•BMI: 10 to 40 (random samples)

#setting seed so that the random numbers are fixed
set.seed(17)
age <- c(sample(x=seq(0,99), size=1000, replace = T))
bmi <- c(runif(1000, min=10, max=40))

• Create a binary variable sex (1=Male and 0=Female) of 1000 random samples

set.seed(17)
sex <- sample(x=c("male", "female"), size = 1000, replace = T)
sex <- factor(sex, levels = c("male", "female"))
head(sex,15)
##  [1] female female male   female female male   male   male   male   male  
## [11] female female male   male   female
## Levels: male female

• Create a data frame as df containing four variables/features:Serial Number, BMI, Age and Sex

serial_number <- seq(1:1000)
df <- data.frame(serial_number, bmi, age, sex)
head(df)
##   serial_number      bmi age    sex
## 1             1 27.60871  49 female
## 2             2 30.07957  96 female
## 3             3 14.08479  93   male
## 4             4 28.42358  44 female
## 5             5 26.66123  38 female
## 6             6 27.20891  41   male

• Split the data into “train” and “test” data using 80-20 partition

set.seed(17)

library(caret)
## Warning in register(): Can't find generic `scale_type` in package ggplot2 to
## register S3 method.
ind <- sample(2, nrow(df), replace = T, prob = c(0.8,0.2))
head(train <- df[ind==1,])
##   serial_number      bmi age    sex
## 1             1 27.60871  49 female
## 3             3 14.08479  93   male
## 4             4 28.42358  44 female
## 5             5 26.66123  38 female
## 6             6 27.20891  41   male
## 7             7 12.14658  21   male
test <- df[ind==2,]

• Fit a linear regression model with BMI as dependent variable and age and sex and predictors in the train data samples

#fitting linear regression model
lrm <- lm(bmi ~ age + sex, data=train)
#getting summary of the regression model
summary(lrm)
## 
## Call:
## lm(formula = bmi ~ age + sex, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.1677  -7.3683  -0.0547   7.2590  15.2115 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 25.1791204  0.6714574  37.499   <2e-16 ***
## age          0.0008408  0.0102696   0.082    0.935    
## sexfemale   -0.4559793  0.5991988  -0.761    0.447    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.503 on 803 degrees of freedom
## Multiple R-squared:  0.0007317,  Adjusted R-squared:  -0.001757 
## F-statistic: 0.294 on 2 and 803 DF,  p-value: 0.7454

• Conduct residual analysis of the fitted model with graphs (suggestive) and tests (confirmative)

LINE Test

Linearity of residuals
plot(lrm,which=1,col=c("blue"))

summary(lrm$residuals)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -15.16774  -7.36828  -0.05472   0.00000   7.25905  15.21150
From the graph we can see that the LOESS line lies in the zero line of the y-axis. From the summary we can see that the Mean of residuals is 0 which means that the residuals are linear.
Independence of residuals
#to test independence of residuals we can use auto correlation function plot
acf(lrm$residuals,main="Autocorrelation Function Plot (ACF)")

set.seed(17)
#we need to load "car" library for the Durbin Watson Test for independence of residuls
library(car)
durbinWatsonTest(lrm)
##  lag Autocorrelation D-W Statistic p-value
##    1      -0.0425776      2.082003    0.24
##  Alternative hypothesis: rho != 0
In the Autocorrelation Function Plot (ACF), the plot shows ups and down bars on X-axis and P-Value of Durbin-Watson test is lesser than 0.05 so there exists autocorrelation.
Normality of residuals
plot(lrm, which = 2,col=c("red"))

shapiro.test(lrm$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  lrm$residuals
## W = 0.96037, p-value = 5.931e-14
From Shapiro-Wilk normality test we can see that p-value is less than 0.05 which means that the residuals do not follow normal distribution.
Equal variance (homoscedasticity) of residuals
plot(lrm, which=3)

#we need to load library "lmtest" for bptest
library(lmtest)
bptest(lrm)
## 
##  studentized Breusch-Pagan test
## 
## data:  lrm
## BP = 4.7409, df = 2, p-value = 0.09344
The Scatter plot of residuals vs predicted values shows values are distributed randomly. Also, the P-Value of bp-test is greater than 0.05 we can say that variances are equal (homoscedasticity is present). 

• Use the fitted model to predict the random test data samples

pred <- predict(lrm, newdata = test)

• Get R-square, MAE and RMSE for training as well as test data and interpret them carefully

library(caret)
R_squared <- R2(pred, test$bmi)
(paste("R_squared: ",R_squared))
## [1] "R_squared:  0.00494364110778339"
RMSE<- RMSE(pred, test$bmi)
(paste("RMSE: ",RMSE))
## [1] "RMSE:  8.26701576035122"
MSE <- RMSE^2
(paste("MSE: ",MSE))
## [1] "MSE:  68.3435495818954"
The error calculation gives us a very low R-squared and high RMSE. Typically, an R-squared above 0.8 is considered to be acceptable whereas RMSE should be closer to zero for the model to be accurate. Since we have contradicting values, we can say that our model is not accurate.

• Take decision and write conclusion based on all the results obtained so far

Based on Regression metrics, we conclude that the model is not accurate and therefore should not be used for prediction. The predictors used, i.e, age and sex, are not very good predictors of BM.