Multiple linear regression analysis is an inferential statistical analysis used to model the relationship between one dependent variable (response) and two or more independent variables (predictors).

Data Preparation

The data to be used in this tutorial is the 2018 Human Development Index (HDI) data from 119 districts/cities in Java. There are 16 columns (apart from the ID column) where HDI value will be the response variable and 15 others as explanatory variables (indicators resulting from the aggregation of 2018 Village Potential data) and all of them are numeric type. The description of each column is as follows: > ID: Kode Kab/Kota (4 digit Kode BPS) > IPM : Value IPM > PR_NO_LIS :

data <- read.csv("D:/Portofolio/DATASET/data_ipm_jawa_2018.csv")
str(data)
## 'data.frame':    119 obs. of  17 variables:
##  $ id                : int  3101 3171 3172 3173 3174 3175 3201 3202 3203 3204 ...
##  $ ipm_2018          : num  70.9 84.4 82.1 81 80.9 ...
##  $ PR_NO_LIS         : num  0 0 0 0 0 ...
##  $ PR_SAMPAH         : num  0 0 0 0 0 ...
##  $ PR_TINJA          : num  0 0.0154 0.0308 0 0 ...
##  $ PR_MKM_SUNGAI     : num  0 0.615 0.446 0.25 0.304 ...
##  $ PR_SUNGAI_LMBH    : num  0 0.169 0.369 0.432 0.679 ...
##  $ PR_KUMUH          : num  0.167 0.369 0.554 0.841 0.679 ...
##  $ PRA_1000          : num  0.787 0.199 0.228 0.255 0.134 ...
##  $ SD_1000           : num  0.829 0.311 0.279 0.398 0.234 ...
##  $ SM_1000           : num  0.622 0.219 0.233 0.287 0.172 ...
##  $ RS_PKS_PDK_1000   : num  0.331 0.337 0.298 0.393 0.269 ...
##  $ LIN_BID_POS_P_1000: num  0.29 0.1046 0.1626 0.0552 0.1242 ...
##  $ APT_OBT_1000      : num  0.0414 0.1349 0.1488 0.2293 0.2583 ...
##  $ DOK_DRG_1000      : num  0.373 0.212 0.448 0.313 0.262 ...
##  $ BID_1000          : num  1.906 0.0868 0.2123 0.04 0.1309 ...
##  $ GZ_BURUK_1000     : num  0.16574 0.00757 0.00343 0.01622 0 ...
# Delete 
data$id <- NULL

Exploration Data Analysis

Summary Data

Before entering the modeling stage, it is advisable to do data exploration to better understand the condition of the data. There are several things that we will do and it starts with checking the data summary. The results of the data summary provide a little insight into the characteristics of each variable. For example, for our response variable ipm_2018, the smallest value in the data is 61.00 and the largest value is 86.11 with a mean value of 71.23. We also need to make sure there is no missing data. If it turns out that there is missing data, then it needs to be handled, either by deleting rows, columns or doing imputation. In the dataset we use, there happens to be no missing data so we can continue with other checks.

summary(data)
##     ipm_2018       PR_NO_LIS           PR_SAMPAH          PR_TINJA     
##  Min.   :61.00   Min.   :0.0000000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:67.91   1st Qu.:0.0000000   1st Qu.:0.00000   1st Qu.:0.1522  
##  Median :71.23   Median :0.0001314   Median :0.04982   Median :0.3169  
##  Mean   :71.98   Mean   :0.0015136   Mean   :0.07444   Mean   :0.3485  
##  3rd Qu.:74.92   3rd Qu.:0.0010234   3rd Qu.:0.10515   3rd Qu.:0.4942  
##  Max.   :86.11   Max.   :0.0246482   Max.   :0.35890   Max.   :0.9471  
##  PR_MKM_SUNGAI     PR_SUNGAI_LMBH      PR_KUMUH          PRA_1000     
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.00000   Min.   :0.1344  
##  1st Qu.:0.07685   1st Qu.:0.1428   1st Qu.:0.02484   1st Qu.:0.3792  
##  Median :0.12648   Median :0.2174   Median :0.07576   Median :0.4748  
##  Mean   :0.17818   Mean   :0.2682   Mean   :0.15779   Mean   :0.5107  
##  3rd Qu.:0.23558   3rd Qu.:0.3617   3rd Qu.:0.19812   3rd Qu.:0.6020  
##  Max.   :0.76471   Max.   :0.9649   Max.   :0.84091   Max.   :1.2772  
##     SD_1000          SM_1000       RS_PKS_PDK_1000  LIN_BID_POS_P_1000
##  Min.   :0.2093   Min.   :0.1627   Min.   :0.1487   Min.   :0.05515   
##  1st Qu.:0.4782   1st Qu.:0.2237   1st Qu.:0.2198   1st Qu.:0.26374   
##  Median :0.6318   Median :0.2760   Median :0.2641   Median :0.43536   
##  Mean   :0.6159   Mean   :0.3019   Mean   :0.3080   Mean   :0.43437   
##  3rd Qu.:0.7508   3rd Qu.:0.3433   3rd Qu.:0.3486   3rd Qu.:0.58097   
##  Max.   :1.1616   Max.   :0.7648   Max.   :0.8780   Max.   :0.92269   
##   APT_OBT_1000      DOK_DRG_1000        BID_1000       GZ_BURUK_1000    
##  Min.   :0.04144   Min.   :0.06402   Min.   :0.04001   Min.   :0.00000  
##  1st Qu.:0.13320   1st Qu.:0.13407   1st Qu.:0.31515   1st Qu.:0.03831  
##  Median :0.18011   Median :0.17146   Median :0.42429   Median :0.07618  
##  Mean   :0.21075   Mean   :0.23894   Mean   :0.42723   Mean   :0.12594  
##  3rd Qu.:0.24919   3rd Qu.:0.28077   3rd Qu.:0.52864   3rd Qu.:0.15679  
##  Max.   :1.09135   Max.   :1.07837   Max.   :1.90602   Max.   :0.93573
# checking for missing values
cek.na <- colSums(is.na(data))
# convert to dataframe to make it neater
cek.na.df <- data.frame(Variabel=names(cek.na), "Jumlah NA"=cek.na, row.names = NULL)
print(cek.na.df)
##              Variabel Jumlah.NA
## 1            ipm_2018         0
## 2           PR_NO_LIS         0
## 3           PR_SAMPAH         0
## 4            PR_TINJA         0
## 5       PR_MKM_SUNGAI         0
## 6      PR_SUNGAI_LMBH         0
## 7            PR_KUMUH         0
## 8            PRA_1000         0
## 9             SD_1000         0
## 10            SM_1000         0
## 11    RS_PKS_PDK_1000         0
## 12 LIN_BID_POS_P_1000         0
## 13       APT_OBT_1000         0
## 14       DOK_DRG_1000         0
## 15           BID_1000         0
## 16      GZ_BURUK_1000         0

Boxplot of Scatter of Numerical Variables

All variables in the dataset are numeric, so we can create a boxplot for each variable. through boxplots we can obtain information on how the data spreads and can also be used to identify the presence of outliers. By utilizing the lapply function, we can create all the boxplots at once with just a few lines of code. The boxplot visualization results below show that each variable has a diverse distribution pattern. Most of the data points are indicated as outliers. Of course, this needs to be looked at more carefully. If indeed this condition is something natural, it means that there is no problem. However, if it is considered that something is not appropriate, it is necessary to check and handle these conditions. Here, we will assume all these conditions to be the facts on the ground and will not perform any treatment.

library(ggplot2)    # to create a boxplot
library(cowplot)    # to display in a grid
# create boxplots at once for all variables
plots <- lapply(names(data), function(var_x){
  p <- 
    ggplot(data) +
    aes(x=!!sym(var_x)) +
    geom_boxplot(color="black", fill = "darkgreen", 
                 lwd=0.2, alpha=0.8)
})
# Show Plot
plot_grid(plotlist = plots)

#### Correlation between Variables Looking at the correlation between variables is an important part. This can be an early indication of whether there are multicollinearity conditions in our data. Of course, we can further look at the VIF value to be more convincing. From the following correlation plot, it can be seen that some columns have high correlation values, so there might be a multicollinearity condition. As for the correlation between explanatory variables and response variables, it can be seen that some variables have a fairly high correlation such as DOK_DRG_1000, SD_1000, PR_TINJA and so on. In addition, this correlation can be an initial information to see how the relationship between each explanatory variable and the response variable. For example, it is quite surprising to see the correlation of some data that seems a bit “strange”. For example, the correlation of SD_1000 or SM_1000 with HDI value has a negative correlation. The more SD_1000, the lower the HDI value in the district/city. Conversely, the correlation between PR_KUMUH and PR_SUNGAI_LMBH and HDI value is positive. Which means, the higher the proportion of slums, or the proportion of rivers polluted by sewage, the higher the HDI value in the district. Of course, there are many things that might cause this condition to occur, and of course it needs to be studied more comprehensively. But for this short article, we will not look beyond the technicalities of regression analysis.

library(corrplot)
## corrplot 0.92 loaded
# menghitung korelasi antar kolom
corr_matrix <- round(cor(data), 2)
# membuat plot korelasi
corrplot(corr_matrix, 
         type="lower",
         method = "color", 
         tl.cex = 0.6, 
         tl.col = "black",
         addCoef.col = "#2F2F2F",
         addCoefasPercent = FALSE,
         number.cex = 0.6,
         diag = FALSE)

#### Regression Model If the data condition is ‘clean’ then we can enter the multiple linear regression modeling stage. The stats package provides an lm function that we can use for a quick and simple linear regression model. There are several other packages, but for now we will only use the lm function. The parameters that need to be specified in the lm function are formula and data. Formula indicates the name of the response variable followed by the names of the explanatory variables (example: y ~ x1 + x2 + x3). However, if all variables in the data will be used, then we simply write the symbol “y ~ .”

# Train a linear regression model
model_reg <- lm(ipm_2018 ~ ., data = data)
# display the model output summary
summary(model_reg)
## 
## Call:
## lm(formula = ipm_2018 ~ ., data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.9279 -1.3472 -0.0228  1.2981  6.6353 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         77.69339    1.61562  48.089  < 2e-16 ***
## PR_NO_LIS          -54.16621   72.84019  -0.744 0.458793    
## PR_SAMPAH          -12.32380    3.34207  -3.687 0.000364 ***
## PR_TINJA            -3.85873    1.63650  -2.358 0.020266 *  
## PR_MKM_SUNGAI       -0.30645    2.23861  -0.137 0.891383    
## PR_SUNGAI_LMBH      -0.06651    1.70957  -0.039 0.969043    
## PR_KUMUH             1.14777    1.87576   0.612 0.541953    
## PRA_1000            -1.47333    1.35941  -1.084 0.280981    
## SD_1000             -8.47723    2.23944  -3.785 0.000258 ***
## SM_1000             -3.55704    2.85238  -1.247 0.215211    
## RS_PKS_PDK_1000      5.06631    3.63456   1.394 0.166340    
## LIN_BID_POS_P_1000   2.95048    2.03778   1.448 0.150685    
## APT_OBT_1000        -7.72740    2.88662  -2.677 0.008646 ** 
## DOK_DRG_1000        13.50010    2.48140   5.441 3.61e-07 ***
## BID_1000            -2.22060    1.52796  -1.453 0.149178    
## GZ_BURUK_1000        0.63440    1.56650   0.405 0.686335    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.282 on 103 degrees of freedom
## Multiple R-squared:  0.8474, Adjusted R-squared:  0.8252 
## F-statistic: 38.14 on 15 and 103 DF,  p-value: < 2.2e-16

The output presented above displays various values that are commonly interpreted from the modeling results. Some of these are: ##### The t-test statistic This component shows which explanatory variables are significant. If the p-value is smaller than a certain e.g. 0.05 then the variable is significant at a confidence level of $. This information can also be seen from the significance code added in the last column. The ’**’ sign states that the variable is highly significant even for very small values of values that are very small. As for the standard value used, it is marked with the symbol ‘.’ and marked with the ’’ symbol. The results above show which variables are significant or not, and it can be seen that many variables are not significant. These results certainly need to be tested again for validity by checking assumptions and conditions that should not be violated in linear regression models. ##### F test statistics If the test statistic test statistic shows the partial significance of each variable, then the F test statistic test statistic measures the significance of the model as a whole. In this example, the p-value 0 () indicates the model is significant (even with a very small error rate, e.g.or smaller). In other words, there is at least 1 explanatory variable that is significant (or there is at least one explanatory variable that is significant). for where the value of. This is of course also in accordance with the results in the partial test. ##### Coefficient of Determination (R^2) The value shows the amount of diversity in the response variable that can be explained by the explanatory variables. In this example, the value This means that the explanatory variables in the model are able to explain about 84.74 percent of the diversity in HDI values. ##### Adjusted R^2 Adjusted value is the value of penalized by the number of explanatory variables used. This measure is usually used when comparing performance between models while still considering the complexity of the model. A simpler model will be penalized less than a complex model. For example, Model A with 15 variables has a value of and model B with only 5 variables has a value of.If the adjusted value is calculated This can be interpreted, with a large reduction in model complexity (from 15 variables to only 5 variables) the model still has good performance and only slightly reduces the model complexity (from 0.84 to 0.82). (from 0.84 to 0.82). In a situation like this, we might be more inclined to choose model B over A.

Assumption Checking

Multiple linear regression analysis has several assumptions and conditions that must be met. If there are conditions that are not met, it can cause the validity of the results obtained to be less accurate or even wrong. The following is an explanation of the assumptions in multiple linear regression analysis.

Linearity

The main premise of multiple linear regression analysis is that there is a linear relationship between the response variable and the explanatory variables. This linearity can be checked visually using a scatterplot, which should show a relationship with a straight line trend.

library(ggplot2)
library(cowplot)
var.respon <- "ipm_2018"
var.penjelas <- setdiff(names(data), var.respon)
plots <- lapply( var.penjelas, function(var_x){
  p <- ggplot(data, aes_string(x = var_x, y = var.respon)) +
       aes(x=!!sym(var_x)) +
       geom_point(color = "blue") +
       theme_minimal() +
       theme(axis.text = element_blank(), axis.title = element_text(size = 8)
    )
})
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
plot_grid(plotlist = plots, nrow = 5, ncol = 3)

Based on the scatter plot, we can see the relationship pattern between the response variable and the explanatory variables. For some explanatory variables, there is a fairly strong linear relationship. However, there are some variables that do not seem to be linearly related. These variables may not have a significant effect.

Depending on the researcher’s interest, these non-linearly related variables can be excluded from the model or subjected to certain treatments. If these variables are still to be included, one technique that can be tried is data transformation. This transformation can be natural log, square root, box-cox transformation and so on. ##### Normality Multiple linear regression analysis assumes that the residuals (the difference between observed and predicted values) are normally distributed. This assumption can be assessed by examining the histogram or Q-Q plot of the residuals, or through statistical tests such as the Kolmogorov-Smirnov test, Shapiro-Wilk and so on.

In the qq plot results and formal tests using the K-S test (Tidak Tolak H0), the model residuals fulfill the normal assumption. ), the model residuals fulfill the normality assumption, namely spreading Normal(0, σ^2)

# Get the residuals of the model
model.res <- residuals(model_reg)
mean.res = round(mean(model.res), 4)
sd.res = round(sd(model.res), 4)
cat('Center value of residuals:', mean.res, '\n')
## Center value of residuals: 0
cat('Standard deviation of residuals:', sd.res, '\n')
## Standard deviation of residuals: 2.1323
# Create a Q-Q plot of the residuals
qqnorm(model.res)
qqline(model.res)

# Perform the Kolmogorov-Smirnov test on the residuals
ks_result <- ks.test(model.res, 'pnorm', mean = mean.res, sd = sd.res)
ks_result
## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  model.res
## D = 0.049367, p-value = 0.9339
## alternative hypothesis: two-sided

Homoskedasticity

The variance of the residuals should be consistent across the levels of the explanatory variables. The scatterplot of the residuals versus the predicted values should not display any visible patterns, such as a conical or funnel-shaped distribution, which would indicate heteroscedasticity. If required, there are also statistical tests used to check this assumption such as the Breusch-Pagan test, White’s test and so on. If heteroscedasticity is present, it can also be treated through data transformation.

In this case, based on BP test (do not reject H0) and the residual value distribution plot looks relatively random for each prediction value and there is no tendency to form a cone pattern. Thus it can be said that there is no heteroscedasticity.

library(lmtest) # package for linear model testing
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
# Perform Breusch-Pagan test
breusch_pagan_test <- bptest(model_reg)
breusch_pagan_test
## 
##  studentized Breusch-Pagan test
## 
## data:  model_reg
## BP = 15.26, df = 15, p-value = 0.4329
# Plot residuals vs predicted values with red scatter plot
plot(fitted(model_reg),model.res,
     xlab = "Fitted Values", ylab ="Residuals", 
     main = "Residuals vs Fitted Values",
     col="red", pch=20, cex=1.5)
# Adding a red dashed line at value 0
abline(h = 0, col = "blue", lty = 2)

#### Non-Multicollinearity Multicollinearity (This is not really an assumption) is a problem that needs to be addressed. Model analysis in the presence of multicollinearity can be inappropriate. The variance of the estimates becomes larger, which will also widen the confidence interval. The impact of wider confidence intervals is that tests can become insignificant even though the variable should have a strong influence. It is also possible to cause the coefficient estimate to change sign (positive to negative or vice versa). If this happens, of course the analysis results become very unreliable to use. The existence of a multicollinearity problem can be seen in the following way: The correlation matrix, where the correlation coefficient limit of 0.80 can be a ‘warning’. In the previous section, we have presented the correlation matrix and shown some explanatory variables with high correlation. Variance Inflation Factor (VIF), with VIF values above 10 indicating a multicollinearity problem.

library(car) # vif function available in car package
## Loading required package: carData
# Calculate VIF for each variable in the model
vif_values <- vif(model_reg)
data.frame(vif_values)
##                    vif_values
## PR_NO_LIS            1.674004
## PR_SAMPAH            1.752889
## PR_TINJA             3.883222
## PR_MKM_SUNGAI        2.519156
## PR_SUNGAI_LMBH       2.248597
## PR_KUMUH             2.945363
## PRA_1000             1.840095
## SD_1000              4.638975
## SM_1000              2.323137
## RS_PKS_PDK_1000      5.283640
## LIN_BID_POS_P_1000   3.867567
## APT_OBT_1000         3.014481
## DOK_DRG_1000         4.545848
## BID_1000             2.377227
## GZ_BURUK_1000        1.272896

If there is such a multicollinearity problem we may need to exclude one or more of the variables with a strong correlation.

Based on the results above, it turns out that there is no indication of multicollinearity problems between the explanatory variables. This can reassure us that the multiple linear regression analysis we conducted has met the assumptions and there are no problems that can interfere with the results and their interpretation.

CC

https://sainsdata.id/statistika/9773/analisis-regresi-linier-berganda-dengan-r/