Linear Regression on Sample Data Set

Overview

In this example we are going to implement the linear regression model on random sample data set. We will use sampling techniques to prepare the necessary data to perform linear regression. After that, we will conduct residual analysis on fitted model by using LINE test. And predictions will be carried out using random test data samples if the necessary conditions are satisfied.

Data Preparation and Split

Data Preparation

It can be done using sample() function.
# setting seed
set.seed(16)
# creating variables as per need
# random samples of age from 0-99
age <- sample(0:99, size = 1000, replace = TRUE)
# random sample of BMI from 10-40
bmi <- sample(10:40, size = 1000, replace = TRUE)
# random samples of sex (1 = Male, 0 = Female)
sex <- sample(x = c(1, 0), size = 1000, replace = TRUE)
# creating data frame
df <- data.frame(s_no = c(1:1000), bmi, age, sex)
# head
head(df)
##   s_no bmi age sex
## 1    1  21  96   0
## 2    2  26  58   1
## 3    3  35  60   0
## 4    4  16  14   0
## 5    5  10  42   0
## 6    6  40   7   0
As shown in the above process, a data frame with the variables s_no, age, bmi and sex is created.

Data Split

We will now split the data set into train and test data with probability of 80% and 20% respectively.
# splitting data into train data and test data
set.seed(16)
index <- sample(2, nrow(df), replace = TRUE, prob = c(0.8, 0.2))
train_data <- df[index == 1, ]
test_data <- df[index == 2, ]
The data set is successfully divided into two sets of train data and test data. As the data set is prepared we can now move forward and fit the linear regression model.

Fitting the Linear Regression Model

Linear Regression

We will use bmi variable as dependent variable, age and sex as independent variable to implement linear regression model.
# fitting linear regression
linear_model <- lm(bmi ~ age + sex, data = train_data)
summary(linear_model)
## 
## Call:
## lm(formula = bmi ~ age + sex, data = train_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.9140  -7.5551  -0.1702   7.4540  15.5130 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 25.917073   0.699454  37.053   <2e-16 ***
## age         -0.003028   0.010809  -0.280   0.7795    
## sex         -1.130364   0.627770  -1.801   0.0721 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.852 on 793 degrees of freedom
## Multiple R-squared:  0.004195,   Adjusted R-squared:  0.001684 
## F-statistic:  1.67 on 2 and 793 DF,  p-value: 0.1888
After fitting the linear regression model, we will conduct the residual analysis of the fitted model. This can be done using LINE test in which LINE represents;
  L = Linearity of residuals
  I = Independence of residuals
  N = Normality of residuals
  E = Equal variance of residuals (most important)
  
To Conduct residual analysis graphical (suggestive) and test (confirmative) method can be used, which is shown in the process below.

Linearity of Residuals

Linearity of the residuals can be tested using scatter plot of residuals vs fitted. After that it can be confirmed using summary function.
# LOESS scatter plot of residuals
plot(linear_model, which = 1, col = "blue")

# summary
summary(linear_model$residuals)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -15.9140  -7.5551  -0.1702   0.0000   7.4540  15.5130
Here, the obtained plot is scatter plot of residuals vs fitted. The red line represents the LOESS smoothing which lies in the zero line of the y-axis and from the summary, mean value of residuals is zero which concludes that the residuals are linear.

Independence of Residuals

Autocorrelation occurs when the residuals are not independent of each other. To check autocorrelation; acf() function can be used which will be confirmed by using Durbin-Watson test.
# ACF plot
acf(linear_model$residuals)

# calculation
library(car)
durbinWatsonTest(linear_model)
##  lag Autocorrelation D-W Statistic p-value
##    1    -0.001846823      2.000901   0.998
##  Alternative hypothesis: rho != 0
Here, the obtained plots shows that the occurrences of the residuals is random. Furthermore, the Dp-value obtained i.e. 0.998 > 0.05, so we accept null hypothesis of this test. Therefore there is no autocorrelation between residuals.

Normality of Residuals

Similarly, to check whether the residuals are normally distributed or not we can use Q-Q plot and Shapiro-Wilk test to confirm our assumptions.
# Q-Q plot of linear model
plot(linear_model, which = 2, col = "blue")

# test
shapiro.test(linear_model$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  linear_model$residuals
## W = 0.95835, p-value = 2.999e-14
Here, from the obtained Q-Q plot we can see the residuals are not normally distributed. Also, from the Shapiro-Wilk test the obtained p-value i.e. 2.999e-14 is less than the predefined significance level i.e. 0.05 which confirms that the residuals does not follow the normal distribution.
So the fitted model failed to pass the normality test. However we can still fit the model if it passes the last and most important test i.e. Equal variance of residuals.

Variance of Residuals

One of the most important assumptions of linear regresion is that, there should be no heteroscedasticity of residuals which means the variance of the residuals should be equal (homoscedascity). So to check equal variance of residuals we use scatter plot of standardized residuals (y-axis) and standardized predicted values (x-axis) and Breusch-Pagan test to confirm the equal variance of residuals.
# scatter plot of standardized residuals
plot(linear_model, which = 3, col = "blue")
# bptest
library(lmtest)

bptest(linear_model)
## 
##  studentized Breusch-Pagan test
## 
## data:  linear_model
## BP = 1.1366, df = 2, p-value = 0.5665
Since from the obtained plot, the values are distributed randomly and from the bptest, obtained p-value i.e. 0.5665 > 0.05, so we reject null hypothesis. Therefore the residual variances are equal (homoscedasticity).
As the fitted linear regression model passes the necessary condition, we assume this is a valid model and it can be further used to make predictions.

Predictions Using Fitted Linear Regression Model

After the validation of linear regression model using LINE test, we will use test the performance of the model using test data.
# library
library(dplyr)
library(caret)
# predictions using fitted model
predictions <- linear_model %>%
  predict(test_data)
# calculation of R2, RMSE and MAE
results <- data.frame(R2 = R2(predictions, test_data$bmi),
                      RMSE = RMSE(predictions, test_data$bmi),
                      MAE = MAE(predictions, test_data$bmi)
                      )
# results of testing
summary(linear_model)
## 
## Call:
## lm(formula = bmi ~ age + sex, data = train_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.9140  -7.5551  -0.1702   7.4540  15.5130 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 25.917073   0.699454  37.053   <2e-16 ***
## age         -0.003028   0.010809  -0.280   0.7795    
## sex         -1.130364   0.627770  -1.801   0.0721 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.852 on 793 degrees of freedom
## Multiple R-squared:  0.004195,   Adjusted R-squared:  0.001684 
## F-statistic:  1.67 on 2 and 793 DF,  p-value: 0.1888
# results of predictions
print(results)
##             R2    RMSE    MAE
## 1 0.0001822038 9.80541 8.6228
From the obtained results; R-squared for training (0.004195) and testing (0.0001822038) is very low. Also the RMSE value for testing (9.80541) is to high.
Although this model passes the necessary condition of LINE test, it is not good enough to be used as per the model performance. So, I suggest not to use linear regression on the above data set.