set.seed(24)
age <- sample(0:99, size = 1000, replace = T)
BMI <- sample(10:40, size = 1000, replace = T)
head(age)
## [1] 18 71 86 33 34 28
head(BMI)
## [1] 27 27 40 33 29 27
set.seed(24)
as.factor(sex <- sample(c(0,1), size = 1000, replace = T))
## [1] 0 0 1 0 1 0 0 1 1 1 1 0 1 0 1 0 1 0 0 1 0 0 0 1 0 1 0 1 1 0 1 0 0 0 1 0 0
## [38] 1 1 1 0 1 1 1 1 1 0 0 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 1 1 0 1 0 0 1 0 0
## [75] 0 0 1 0 0 0 1 0 1 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 1 0 0 1 1 1 1 1 0 0 1 1 0
## [112] 0 1 1 0 1 0 0 0 0 0 1 1 1 0 1 0 1 0 0 1 1 1 1 0 1 1 0 0 0 0 0 0 0 1 1 0 1
## [149] 1 1 1 0 1 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 1 0 0 0 1 1 0
## [186] 1 0 0 0 1 1 0 0 0 1 1 0 0 1 1 1 1 0 1 1 0 0 1 0 1 0 1 1 0 1 1 1 1 1 1 1 1
## [223] 0 0 0 0 0 0 0 1 1 0 1 1 1 0 0 0 1 0 0 0 0 1 1 0 0 1 1 1 0 0 0 1 0 1 0 1 1
## [260] 0 1 0 0 1 0 1 1 1 1 0 0 0 0 1 0 0 0 0 0 1 1 1 1 1 1 0 0 1 0 0 0 1 0 0 0 0
## [297] 1 1 1 0 1 0 1 1 1 1 0 0 0 0 1 1 1 1 1 1 1 0 1 0 1 0 1 1 1 1 0 1 1 1 0 1 1
## [334] 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 1 1 1 0 1 1 1 1 1 0 1 0 0 1 0 0 1 1 0 0 1 0
## [371] 0 1 1 1 1 1 0 0 0 1 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 1 0 1 1 1 0 0 0 1 1 1 1
## [408] 1 0 0 1 1 1 0 0 1 1 1 1 0 0 0 0 0 0 1 1 1 0 0 1 0 0 1 1 0 1 0 0 1 0 0 1 0
## [445] 0 0 1 1 0 0 0 1 0 0 1 0 0 1 0 1 1 1 1 1 1 1 0 1 0 0 0 0 1 0 1 1 1 0 1 0 1
## [482] 0 0 1 0 0 0 1 0 0 1 0 1 1 1 1 0 1 1 1 0 0 1 0 1 0 1 0 1 1 1 0 1 0 0 1 0 0
## [519] 0 1 1 0 1 0 1 1 1 1 1 1 1 1 0 0 0 1 0 0 1 0 1 0 1 0 1 1 1 0 1 0 1 1 1 1 1
## [556] 1 1 0 0 1 1 1 1 1 0 0 0 0 1 1 0 0 1 0 0 0 1 0 1 0 0 0 1 0 0 0 1 1 0 1 0 0
## [593] 1 0 0 1 1 0 0 0 0 1 0 1 1 0 0 0 1 0 0 1 0 1 1 0 1 1 0 1 1 0 1 0 1 0 0 0 1
## [630] 0 0 1 1 1 1 1 0 0 0 1 0 0 1 0 1 1 1 1 1 0 0 0 0 1 0 0 1 0 0 1 1 1 0 0 0 0
## [667] 1 0 0 0 1 1 1 1 0 1 1 0 1 0 1 1 1 0 1 0 0 0 1 0 1 1 0 0 1 1 0 1 1 0 0 0 0
## [704] 0 0 0 1 1 1 1 0 1 0 0 0 1 1 0 1 0 0 1 0 0 1 1 1 0 0 0 0 0 1 1 1 0 1 0 1 1
## [741] 1 0 1 0 1 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 1 1 0 0 0 1 0 1 1 1
## [778] 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 0 0 0 0 1 1 1 0 0 0 1 1 1 0 0 0 1 1 0 0 0 1
## [815] 1 1 1 0 1 0 0 1 0 1 1 0 1 0 0 1 1 0 0 0 0 0 0 1 1 1 0 1 1 0 1 1 1 0 1 1 0
## [852] 0 0 0 0 0 0 1 1 0 0 1 0 1 1 0 1 1 1 0 1 0 1 1 1 0 1 1 0 0 1 0 1 0 0 0 1 1
## [889] 1 0 0 0 1 0 1 1 0 0 1 0 0 0 1 1 0 0 1 0 1 0 0 0 0 1 0 1 0 1 1 1 1 0 0 1 0
## [926] 1 0 1 1 1 0 0 1 1 0 1 0 0 1 0 0 0 0 0 0 0 1 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1
## [963] 0 1 0 1 0 0 0 1 1 0 1 1 1 0 1 0 1 0 0 1 1 1 1 1 1 1 0 0 0 0 1 1 1 0 0 1 1
## [1000] 0
## Levels: 0 1
head(sex)
## [1] 0 0 1 0 1 0
tail(sex)
## [1] 1 0 0 1 1 0
SN <- seq(1:1000)
df <- data.frame(cbind(SN, BMI, age,sex))
set.seed(24)
ind <- sample(2, nrow(df), replace = TRUE, prob = c(0.8, 0.2))
train <- df[ind == 1, ]
test <- df[ind == 2, ]
#Createing a linear regression with two variables
lm = lm(BMI~age + sex, data = train)
We made some assumptions about the data which must be met to fit a linear model (and for it to work successfully).Which are as follows: * The dataset must have normal relationship * Multivariate normality (Data points are normally distributed) * No auto-correlation (It occurs when the residuals are not independent from each other) * homoscedasticity (the residuals are equally distributed across the regression line)
We are now going to validate these points with graphical analysis and by conducting different tests.
#Plotting the model
plot(lm)
* While plotting the model, we get 4 graphs. First one gives us the residual plot. It shows the slightly non linear effect of the data as the line is not straight. So maybe we should fit the polynomial regression to capture the non linear effect of the data.But we can not say it for sure (we have to do confirmative analysis)
second plot tells us that the residuals are normally distributed because most of the data falls in qqline.
third plot shows homoskedascity as most of the variance of residuals are constant.
fourth plot shows the cook distance, the points with lesser cook distance are closely related as compared to points which are farther away.The points which are farther away are extreme values or outlines. In our plot some points are near and some points are far so we can not say the points are closely related.
We should confirm this with confirmitive (test) analysis.
library(car)
library(lmtest)
#Test of linearity
summary(lm$residuals)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -16.0507 -7.3132 0.2184 0.0000 7.8228 15.1005
#Test of autocorreation
durbinWatsonTest(lm)
## lag Autocorrelation D-W Statistic p-value
## 1 -0.02801735 2.055991 0.468
## Alternative hypothesis: rho != 0
#Test of normality
shapiro.test(lm$residuals)
##
## Shapiro-Wilk normality test
##
## data: lm$residuals
## W = 0.95692, p-value = 1.587e-14
#Test for homoskedascity
bptest(lm)
##
## studentized Breusch-Pagan test
##
## data: lm
## BP = 4.2896, df = 2, p-value = 0.1171
As the mean of the residuals is zero it suggests that the residuals are linear.
The p- value for autocorrelation is 0.692 which is greater than 0.05 obviously. So there is no auto correlation present.
For the test of normality, we did shapiro test. The p-value is 2.31e-14 which is less than 0.05. It suggested that the distribution does not follow the normal distribution.
The p-value in the homoskedascity is 0.6247 which is greater than 0.05 so it confirms that, there is homoskascity. Or the variances are equally distributed.
As our data did not pass the LINE test we should not be using this to model to predict.
library(dplyr)
predictions <- lm %>% predict(test)
#Saving predicted values:
data <- data.frame(pred = predict(lm), actual = train$BMI)
#For training data
library(caret)
predictions <- lm %>% predict(train)
R2_train = R2(predictions,train$BMI)
RMSE_train = RMSE(predictions,train$BMI)
MAE_train = MAE(predictions, train$BMI)
(R2_train)
## [1] 0.001775567
#For testing data
predictions_test <- lm %>% predict(test)
R2_test = R2(predictions_test,test$BMI)
RMSE_test = RMSE(predictions_test,test$BMI)
MAE_test = MAE(predictions_test, train$BMI)
(R2_test)
## [1] 0.009710952
With the values of R squared from both training as well as test test data,we can say that the value of R2 is far too less to accept this linear model. Our residual analysis had also failed suggesting that we can not fit the linear model. We fitted the model anyway and the value of R2 also confined it. So everything suggested we can not fit linear model in this data. We can try either polynomial regression model or any other models for this data.