Create two quantitative variables: age and body mass index(BMI) with random sample of size 1000 each.

set.seed(24) 
age <- sample(0:99, size  = 1000, replace = T)
BMI <- sample(10:40, size  = 1000, replace = T)
head(age)
## [1] 18 71 86 33 34 28
head(BMI)
## [1] 27 27 40 33 29 27

Create a binary variable sex (1 = male, 0 = female) of 1000 random samples

set.seed(24)
as.factor(sex <- sample(c(0,1), size = 1000, replace = T))
##    [1] 0 0 1 0 1 0 0 1 1 1 1 0 1 0 1 0 1 0 0 1 0 0 0 1 0 1 0 1 1 0 1 0 0 0 1 0 0
##   [38] 1 1 1 0 1 1 1 1 1 0 0 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 1 1 0 1 0 0 1 0 0
##   [75] 0 0 1 0 0 0 1 0 1 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 1 0 0 1 1 1 1 1 0 0 1 1 0
##  [112] 0 1 1 0 1 0 0 0 0 0 1 1 1 0 1 0 1 0 0 1 1 1 1 0 1 1 0 0 0 0 0 0 0 1 1 0 1
##  [149] 1 1 1 0 1 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 1 0 0 0 1 1 0
##  [186] 1 0 0 0 1 1 0 0 0 1 1 0 0 1 1 1 1 0 1 1 0 0 1 0 1 0 1 1 0 1 1 1 1 1 1 1 1
##  [223] 0 0 0 0 0 0 0 1 1 0 1 1 1 0 0 0 1 0 0 0 0 1 1 0 0 1 1 1 0 0 0 1 0 1 0 1 1
##  [260] 0 1 0 0 1 0 1 1 1 1 0 0 0 0 1 0 0 0 0 0 1 1 1 1 1 1 0 0 1 0 0 0 1 0 0 0 0
##  [297] 1 1 1 0 1 0 1 1 1 1 0 0 0 0 1 1 1 1 1 1 1 0 1 0 1 0 1 1 1 1 0 1 1 1 0 1 1
##  [334] 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 1 1 1 0 1 1 1 1 1 0 1 0 0 1 0 0 1 1 0 0 1 0
##  [371] 0 1 1 1 1 1 0 0 0 1 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 1 0 1 1 1 0 0 0 1 1 1 1
##  [408] 1 0 0 1 1 1 0 0 1 1 1 1 0 0 0 0 0 0 1 1 1 0 0 1 0 0 1 1 0 1 0 0 1 0 0 1 0
##  [445] 0 0 1 1 0 0 0 1 0 0 1 0 0 1 0 1 1 1 1 1 1 1 0 1 0 0 0 0 1 0 1 1 1 0 1 0 1
##  [482] 0 0 1 0 0 0 1 0 0 1 0 1 1 1 1 0 1 1 1 0 0 1 0 1 0 1 0 1 1 1 0 1 0 0 1 0 0
##  [519] 0 1 1 0 1 0 1 1 1 1 1 1 1 1 0 0 0 1 0 0 1 0 1 0 1 0 1 1 1 0 1 0 1 1 1 1 1
##  [556] 1 1 0 0 1 1 1 1 1 0 0 0 0 1 1 0 0 1 0 0 0 1 0 1 0 0 0 1 0 0 0 1 1 0 1 0 0
##  [593] 1 0 0 1 1 0 0 0 0 1 0 1 1 0 0 0 1 0 0 1 0 1 1 0 1 1 0 1 1 0 1 0 1 0 0 0 1
##  [630] 0 0 1 1 1 1 1 0 0 0 1 0 0 1 0 1 1 1 1 1 0 0 0 0 1 0 0 1 0 0 1 1 1 0 0 0 0
##  [667] 1 0 0 0 1 1 1 1 0 1 1 0 1 0 1 1 1 0 1 0 0 0 1 0 1 1 0 0 1 1 0 1 1 0 0 0 0
##  [704] 0 0 0 1 1 1 1 0 1 0 0 0 1 1 0 1 0 0 1 0 0 1 1 1 0 0 0 0 0 1 1 1 0 1 0 1 1
##  [741] 1 0 1 0 1 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 1 1 0 0 0 1 0 1 1 1
##  [778] 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 0 0 0 0 1 1 1 0 0 0 1 1 1 0 0 0 1 1 0 0 0 1
##  [815] 1 1 1 0 1 0 0 1 0 1 1 0 1 0 0 1 1 0 0 0 0 0 0 1 1 1 0 1 1 0 1 1 1 0 1 1 0
##  [852] 0 0 0 0 0 0 1 1 0 0 1 0 1 1 0 1 1 1 0 1 0 1 1 1 0 1 1 0 0 1 0 1 0 0 0 1 1
##  [889] 1 0 0 0 1 0 1 1 0 0 1 0 0 0 1 1 0 0 1 0 1 0 0 0 0 1 0 1 0 1 1 1 1 0 0 1 0
##  [926] 1 0 1 1 1 0 0 1 1 0 1 0 0 1 0 0 0 0 0 0 0 1 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1
##  [963] 0 1 0 1 0 0 0 1 1 0 1 1 1 0 1 0 1 0 0 1 1 1 1 1 1 1 0 0 0 0 1 1 1 0 0 1 1
## [1000] 0
## Levels: 0 1
head(sex)
## [1] 0 0 1 0 1 0
tail(sex)
## [1] 1 0 0 1 1 0

Create a dataframead df containing Serial Number, BMI, Age and Sex

SN <- seq(1:1000)
df <- data.frame(cbind(SN, BMI, age,sex))

For replication of the results, use your class roll number as random seed during analysis

Split the data into train and test data using 80-20 partition.

set.seed(24)
ind <- sample(2, nrow(df), replace = TRUE, prob = c(0.8, 0.2))
train <- df[ind == 1, ]
test <- df[ind == 2, ]

Fit a linear regression model with BMI as dependent variable and age and sex as as predictors in te train data samples.

#Createing a linear regression with two variables
lm = lm(BMI~age + sex, data = train) 

Conduct residual analysis of the fitted model with graphs(suggestive), and tests (confirmative)

We made some assumptions about the data which must be met to fit a linear model (and for it to work successfully).Which are as follows: * The dataset must have normal relationship * Multivariate normality (Data points are normally distributed) * No auto-correlation (It occurs when the residuals are not independent from each other) * homoscedasticity (the residuals are equally distributed across the regression line)

We are now going to validate these points with graphical analysis and by conducting different tests.

Grapgical Analysis
#Plotting the model
plot(lm)

* While plotting the model, we get 4 graphs. First one gives us the residual plot. It shows the slightly non linear effect of the data as the line is not straight. So maybe we should fit the polynomial regression to capture the non linear effect of the data.But we can not say it for sure (we have to do confirmative analysis)

  • second plot tells us that the residuals are normally distributed because most of the data falls in qqline.

  • third plot shows homoskedascity as most of the variance of residuals are constant.

  • fourth plot shows the cook distance, the points with lesser cook distance are closely related as compared to points which are farther away.The points which are farther away are extreme values or outlines. In our plot some points are near and some points are far so we can not say the points are closely related.

We should confirm this with confirmitive (test) analysis.

Confirmitive analysis
library(car)
library(lmtest)
#Test of linearity
summary(lm$residuals)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -16.0507  -7.3132   0.2184   0.0000   7.8228  15.1005
#Test of autocorreation
durbinWatsonTest(lm)
##  lag Autocorrelation D-W Statistic p-value
##    1     -0.02801735      2.055991   0.468
##  Alternative hypothesis: rho != 0
#Test of normality
shapiro.test(lm$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  lm$residuals
## W = 0.95692, p-value = 1.587e-14
#Test for homoskedascity
bptest(lm)
## 
##  studentized Breusch-Pagan test
## 
## data:  lm
## BP = 4.2896, df = 2, p-value = 0.1171

As the mean of the residuals is zero it suggests that the residuals are linear.

The p- value for autocorrelation is 0.692 which is greater than 0.05 obviously. So there is no auto correlation present.

For the test of normality, we did shapiro test. The p-value is 2.31e-14 which is less than 0.05. It suggested that the distribution does not follow the normal distribution.

The p-value in the homoskedascity is 0.6247 which is greater than 0.05 so it confirms that, there is homoskascity. Or the variances are equally distributed.

As our data did not pass the LINE test we should not be using this to model to predict.

Use the fitted model to predict the random test data sample.

library(dplyr)
predictions <- lm %>% predict(test)

Get R-Squared, MSE and RMSE for training as well a test data and interpret them carefully

#Saving predicted values:
data <- data.frame(pred = predict(lm), actual = train$BMI)

#For training data
library(caret)
predictions <- lm %>% predict(train)
R2_train = R2(predictions,train$BMI)
RMSE_train = RMSE(predictions,train$BMI)
MAE_train = MAE(predictions, train$BMI)
(R2_train)
## [1] 0.001775567
#For testing data
predictions_test <- lm %>% predict(test)
R2_test = R2(predictions_test,test$BMI)
RMSE_test = RMSE(predictions_test,test$BMI)
MAE_test = MAE(predictions_test, train$BMI)
(R2_test)
## [1] 0.009710952

With the values of R squared from both training as well as test test data,we can say that the value of R2 is far too less to accept this linear model. Our residual analysis had also failed suggesting that we can not fit the linear model. We fitted the model anyway and the value of R2 also confined it. So everything suggested we can not fit linear model in this data. We can try either polynomial regression model or any other models for this data.