The objective of this project is to fit a linear regression model with training data and to make predictions with the test data. The task has following major steps: i.Data generation and preparation ii.Partitioning data to training and testing data iii.Fitting linear regression model with training data iv.Performing residual analysis of the fitted model v.Making predictions with the testing data

1.Creating two quantitative variables age and BMI(Body Mass Index)

#Generating age from 0 to 99 with random sample of size 1000
age <- sample(0:99,1000,replace = T)

#Generating BMI from 10 to 40 with random sample of size 1000
BMI <- sample(10:40,1000,replace = T)

Interpretation: Here sample() function is used to generate random sample for age and BMI of size 1000 with replacement.

2.Creating binary variable sex

# Generating sex
sex <- sample(0:1,1000,replace = T)

Random sampling is done for ‘sex’ variable also of size 1000. It is a binary variable, 1 for male, 0 for female.

3.Creating a data frame df for four variables SN, BMI, Age and Sex

#Creating data frame named 'df' for for four variables
df <- data.frame(SN = seq(1:1000),BMI = BMI,Age = age,Sex=sex)
head(df)

##   SN BMI Age Sex
## 1  1  40  25   0
## 2  2  15  81   0
## 3  3  14  48   0
## 4  4  34  21   1
## 5  5  24  27   0
## 6  6  27  78   1

A data frame is created and first 6 data are viewed.

4.Splitting the data into train and test data

#Setting seed as class roll number i.e 9
set.seed(9)

#Partitioning whole data into 2 subset with probability 80 and 20
ind = sample(2,nrow(df),replace = T,prob = c(0.8,0.2))

#Separating training and testing data
train <- df[ind==1,]
test <- df[ind==2,]

The total data in the data frame is partitioned into 2 subsets by random sample training and testing data. The purpose of training data is we fit the model with the training data and then testing data is used to test the accuracy of the model built. Thus, here also data is partitioned into training and testing data.

5.Fitting linear regression model with BMI as dependent variable and age and sex as preditors

linear_model <- lm(BMI ~ Age+Sex,data=train)
summary(linear_model)

## 
## Call:
## lm(formula = BMI ~ Age + Sex, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.2279  -7.6110  -0.3552   8.1344  16.4762 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 23.443665   0.744860  31.474   <2e-16 ***
## Age          0.004715   0.011008   0.428   0.6686    
## Sex          1.515527   0.639986   2.368   0.0181 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.986 on 795 degrees of freedom
## Multiple R-squared:  0.00709,    Adjusted R-squared:  0.004592 
## F-statistic: 2.838 on 2 and 795 DF,  p-value: 0.05912

Linear regression is fitted in the training data using ‘lm()’ function with BMI as response variable and the summary is viewed.

6.Residuals Analysis

The residuals analysis is described by the ‘LINE’ test. It checks for the validation of the results. LINE test stands for: L - Linearity of residuals I - Independence of residuals N - Normality of residuals E - Equal variance of residuals

It consists of two types of testing: i.Graphical ii.Calculation

The graphical methods is considered as the suggestive method only,the calculation method is considered as the confirmative one.

Linearity of the residuals

#If the LOESS line lies in the zero line of the y-axis then residuals are linear

plot(linear_model,which = 1,col = "red")

#If the mean of the residuals is zero then the residuals are linear
summary(linear_model$residuals)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -15.2279  -7.6110  -0.3552   0.0000   8.1344  16.4762

Independence or residuals

#For the independence, autocorrelation is checked, if the autocorrelation plot shows 'ups' and 'downs' bars on x-axis then no autocorrelation

acf(linear_model$residuals)

Normality of residuals

plot(linear_model,which = 2,col = "red")

#For the normality Shapiro-Wilk test is performed. If the p-value > 0.05, it can be concluded that residuals follow the normal distribution

shapiro.test(linear_model$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  linear_model$residuals
## W = 0.95311, p-value = 2.967e-15

Equal variance(homoscedasticity) of residuals

#Here the scatterplot of standardize residuals (y-axis) and standardized predicted values (x-axis) is observered. If the values are distributed randomly i.e if the plot doesn't show any pattern then it is considered as homoscedasticity

plot(linear_model,which = 3, col = "red")

Using model to make prediction of test data

predict_test <- predict(linear_model,test)

Getting R2, MSE and RMSE

For training data

predict_test <- predict(linear_model,test)

data.frame(R2 = R2(predict_test,test$BMI),
           MSE = mean((predict_test - test$BMI)^2),
           RMSE = RMSE(predict_test,test$BMI))

##           R2      MSE     RMSE
## 1 0.01314017 81.88826 9.049213

Taking decision and conclusion

The model fitted above is considered as weak model as: i. The coefficient determination is less than 0.5 ii. Regression ANOVA is not statistically significant iii. Y-intercept and slope is not statistically significant.

We can see the LINE test concludes the valid residuals, but also the value of R2 is very low of this model for both training and testing data. Also the value of MSE and RMSE is very high which is more than required to be a good model.