The objective of this project is to fit a linear regression model with training data and to make predictions with the test data. The task has following major steps: i.Data generation and preparation ii.Partitioning data to training and testing data iii.Fitting linear regression model with training data iv.Performing residual analysis of the fitted model v.Making predictions with the testing data
#Generating age from 0 to 99 with random sample of size 1000
age <- sample(0:99,1000,replace = T)
#Generating BMI from 10 to 40 with random sample of size 1000
BMI <- sample(10:40,1000,replace = T)
Interpretation: Here sample() function is used to generate random sample for age and BMI of size 1000 with replacement.
# Generating sex
sex <- sample(0:1,1000,replace = T)
Random sampling is done for ‘sex’ variable also of size 1000. It is a binary variable, 1 for male, 0 for female.
#Creating data frame named 'df' for for four variables
df <- data.frame(SN = seq(1:1000),BMI = BMI,Age = age,Sex=sex)
head(df)
## SN BMI Age Sex
## 1 1 40 25 0
## 2 2 15 81 0
## 3 3 14 48 0
## 4 4 34 21 1
## 5 5 24 27 0
## 6 6 27 78 1
A data frame is created and first 6 data are viewed.
#Setting seed as class roll number i.e 9
set.seed(9)
#Partitioning whole data into 2 subset with probability 80 and 20
ind = sample(2,nrow(df),replace = T,prob = c(0.8,0.2))
#Separating training and testing data
train <- df[ind==1,]
test <- df[ind==2,]
The total data in the data frame is partitioned into 2 subsets by random sample training and testing data. The purpose of training data is we fit the model with the training data and then testing data is used to test the accuracy of the model built. Thus, here also data is partitioned into training and testing data.
linear_model <- lm(BMI ~ Age+Sex,data=train)
summary(linear_model)
##
## Call:
## lm(formula = BMI ~ Age + Sex, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.2279 -7.6110 -0.3552 8.1344 16.4762
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.443665 0.744860 31.474 <2e-16 ***
## Age 0.004715 0.011008 0.428 0.6686
## Sex 1.515527 0.639986 2.368 0.0181 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.986 on 795 degrees of freedom
## Multiple R-squared: 0.00709, Adjusted R-squared: 0.004592
## F-statistic: 2.838 on 2 and 795 DF, p-value: 0.05912
Linear regression is fitted in the training data using ‘lm()’ function with BMI as response variable and the summary is viewed.
The residuals analysis is described by the ‘LINE’ test. It checks for the validation of the results. LINE test stands for: L - Linearity of residuals I - Independence of residuals N - Normality of residuals E - Equal variance of residuals
It consists of two types of testing: i.Graphical ii.Calculation
The graphical methods is considered as the suggestive method only,the calculation method is considered as the confirmative one.
#If the LOESS line lies in the zero line of the y-axis then residuals are linear
plot(linear_model,which = 1,col = "red")
#If the mean of the residuals is zero then the residuals are linear
summary(linear_model$residuals)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -15.2279 -7.6110 -0.3552 0.0000 8.1344 16.4762
#For the independence, autocorrelation is checked, if the autocorrelation plot shows 'ups' and 'downs' bars on x-axis then no autocorrelation
acf(linear_model$residuals)
plot(linear_model,which = 2,col = "red")
#For the normality Shapiro-Wilk test is performed. If the p-value > 0.05, it can be concluded that residuals follow the normal distribution
shapiro.test(linear_model$residuals)
##
## Shapiro-Wilk normality test
##
## data: linear_model$residuals
## W = 0.95311, p-value = 2.967e-15
#Here the scatterplot of standardize residuals (y-axis) and standardized predicted values (x-axis) is observered. If the values are distributed randomly i.e if the plot doesn't show any pattern then it is considered as homoscedasticity
plot(linear_model,which = 3, col = "red")
predict_test <- predict(linear_model,test)
predict_test <- predict(linear_model,test)
data.frame(R2 = R2(predict_test,test$BMI),
MSE = mean((predict_test - test$BMI)^2),
RMSE = RMSE(predict_test,test$BMI))
## R2 MSE RMSE
## 1 0.01314017 81.88826 9.049213
The model fitted above is considered as weak model as: i. The coefficient determination is less than 0.5 ii. Regression ANOVA is not statistically significant iii. Y-intercept and slope is not statistically significant.
We can see the LINE test concludes the valid residuals, but also the value of R2 is very low of this model for both training and testing data. Also the value of MSE and RMSE is very high which is more than required to be a good model.