Multiple linear regression analysis is an inferential statistical analysis used to model the relationship between one dependent variable (response) and two or more independent variables (predictors).
The data to be used in this tutorial is the 2018 Human Development Index (HDI) data from 119 districts/cities in Java. There are 16 columns (apart from the ID column) where HDI value will be the response variable and 15 others as explanatory variables (indicators resulting from the aggregation of 2018 Village Potential data) and all of them are numeric type. The description of each column is as follows: > ID: Kode Kab/Kota (4 digit Kode BPS) > IPM : Value IPM > PR_NO_LIS :
data <- read.csv("D:/Portofolio/DATASET/data_ipm_jawa_2018.csv")
str(data)
## 'data.frame': 119 obs. of 17 variables:
## $ id : int 3101 3171 3172 3173 3174 3175 3201 3202 3203 3204 ...
## $ ipm_2018 : num 70.9 84.4 82.1 81 80.9 ...
## $ PR_NO_LIS : num 0 0 0 0 0 ...
## $ PR_SAMPAH : num 0 0 0 0 0 ...
## $ PR_TINJA : num 0 0.0154 0.0308 0 0 ...
## $ PR_MKM_SUNGAI : num 0 0.615 0.446 0.25 0.304 ...
## $ PR_SUNGAI_LMBH : num 0 0.169 0.369 0.432 0.679 ...
## $ PR_KUMUH : num 0.167 0.369 0.554 0.841 0.679 ...
## $ PRA_1000 : num 0.787 0.199 0.228 0.255 0.134 ...
## $ SD_1000 : num 0.829 0.311 0.279 0.398 0.234 ...
## $ SM_1000 : num 0.622 0.219 0.233 0.287 0.172 ...
## $ RS_PKS_PDK_1000 : num 0.331 0.337 0.298 0.393 0.269 ...
## $ LIN_BID_POS_P_1000: num 0.29 0.1046 0.1626 0.0552 0.1242 ...
## $ APT_OBT_1000 : num 0.0414 0.1349 0.1488 0.2293 0.2583 ...
## $ DOK_DRG_1000 : num 0.373 0.212 0.448 0.313 0.262 ...
## $ BID_1000 : num 1.906 0.0868 0.2123 0.04 0.1309 ...
## $ GZ_BURUK_1000 : num 0.16574 0.00757 0.00343 0.01622 0 ...
# Delete
data$id <- NULL
Before entering the modeling stage, it is advisable to do data exploration to better understand the condition of the data. There are several things that we will do and it starts with checking the data summary. The results of the data summary provide a little insight into the characteristics of each variable. For example, for our response variable ipm_2018, the smallest value in the data is 61.00 and the largest value is 86.11 with a mean value of 71.23. We also need to make sure there is no missing data. If it turns out that there is missing data, then it needs to be handled, either by deleting rows, columns or doing imputation. In the dataset we use, there happens to be no missing data so we can continue with other checks.
summary(data)
## ipm_2018 PR_NO_LIS PR_SAMPAH PR_TINJA
## Min. :61.00 Min. :0.0000000 Min. :0.00000 Min. :0.0000
## 1st Qu.:67.91 1st Qu.:0.0000000 1st Qu.:0.00000 1st Qu.:0.1522
## Median :71.23 Median :0.0001314 Median :0.04982 Median :0.3169
## Mean :71.98 Mean :0.0015136 Mean :0.07444 Mean :0.3485
## 3rd Qu.:74.92 3rd Qu.:0.0010234 3rd Qu.:0.10515 3rd Qu.:0.4942
## Max. :86.11 Max. :0.0246482 Max. :0.35890 Max. :0.9471
## PR_MKM_SUNGAI PR_SUNGAI_LMBH PR_KUMUH PRA_1000
## Min. :0.00000 Min. :0.0000 Min. :0.00000 Min. :0.1344
## 1st Qu.:0.07685 1st Qu.:0.1428 1st Qu.:0.02484 1st Qu.:0.3792
## Median :0.12648 Median :0.2174 Median :0.07576 Median :0.4748
## Mean :0.17818 Mean :0.2682 Mean :0.15779 Mean :0.5107
## 3rd Qu.:0.23558 3rd Qu.:0.3617 3rd Qu.:0.19812 3rd Qu.:0.6020
## Max. :0.76471 Max. :0.9649 Max. :0.84091 Max. :1.2772
## SD_1000 SM_1000 RS_PKS_PDK_1000 LIN_BID_POS_P_1000
## Min. :0.2093 Min. :0.1627 Min. :0.1487 Min. :0.05515
## 1st Qu.:0.4782 1st Qu.:0.2237 1st Qu.:0.2198 1st Qu.:0.26374
## Median :0.6318 Median :0.2760 Median :0.2641 Median :0.43536
## Mean :0.6159 Mean :0.3019 Mean :0.3080 Mean :0.43437
## 3rd Qu.:0.7508 3rd Qu.:0.3433 3rd Qu.:0.3486 3rd Qu.:0.58097
## Max. :1.1616 Max. :0.7648 Max. :0.8780 Max. :0.92269
## APT_OBT_1000 DOK_DRG_1000 BID_1000 GZ_BURUK_1000
## Min. :0.04144 Min. :0.06402 Min. :0.04001 Min. :0.00000
## 1st Qu.:0.13320 1st Qu.:0.13407 1st Qu.:0.31515 1st Qu.:0.03831
## Median :0.18011 Median :0.17146 Median :0.42429 Median :0.07618
## Mean :0.21075 Mean :0.23894 Mean :0.42723 Mean :0.12594
## 3rd Qu.:0.24919 3rd Qu.:0.28077 3rd Qu.:0.52864 3rd Qu.:0.15679
## Max. :1.09135 Max. :1.07837 Max. :1.90602 Max. :0.93573
# checking for missing values
cek.na <- colSums(is.na(data))
# convert to dataframe to make it neater
cek.na.df <- data.frame(Variabel=names(cek.na), "Jumlah NA"=cek.na, row.names = NULL)
print(cek.na.df)
## Variabel Jumlah.NA
## 1 ipm_2018 0
## 2 PR_NO_LIS 0
## 3 PR_SAMPAH 0
## 4 PR_TINJA 0
## 5 PR_MKM_SUNGAI 0
## 6 PR_SUNGAI_LMBH 0
## 7 PR_KUMUH 0
## 8 PRA_1000 0
## 9 SD_1000 0
## 10 SM_1000 0
## 11 RS_PKS_PDK_1000 0
## 12 LIN_BID_POS_P_1000 0
## 13 APT_OBT_1000 0
## 14 DOK_DRG_1000 0
## 15 BID_1000 0
## 16 GZ_BURUK_1000 0
All variables in the dataset are numeric, so we can create a boxplot for each variable. through boxplots we can obtain information on how the data spreads and can also be used to identify the presence of outliers. By utilizing the lapply function, we can create all the boxplots at once with just a few lines of code. The boxplot visualization results below show that each variable has a diverse distribution pattern. Most of the data points are indicated as outliers. Of course, this needs to be looked at more carefully. If indeed this condition is something natural, it means that there is no problem. However, if it is considered that something is not appropriate, it is necessary to check and handle these conditions. Here, we will assume all these conditions to be the facts on the ground and will not perform any treatment.
library(ggplot2) # to create a boxplot
library(cowplot) # to display in a grid
# create boxplots at once for all variables
plots <- lapply(names(data), function(var_x){
p <-
ggplot(data) +
aes(x=!!sym(var_x)) +
geom_boxplot(color="black", fill = "darkgreen",
lwd=0.2, alpha=0.8)
})
# Show Plot
plot_grid(plotlist = plots)
#### Correlation between Variables Looking at the correlation between
variables is an important part. This can be an early indication of
whether there are multicollinearity conditions in our data. Of course,
we can further look at the VIF value to be more convincing. From the
following correlation plot, it can be seen that some columns have high
correlation values, so there might be a multicollinearity condition. As
for the correlation between explanatory variables and response
variables, it can be seen that some variables have a fairly high
correlation such as DOK_DRG_1000, SD_1000, PR_TINJA and so on. In
addition, this correlation can be an initial information to see how the
relationship between each explanatory variable and the response
variable. For example, it is quite surprising to see the correlation of
some data that seems a bit “strange”. For example, the correlation of
SD_1000 or SM_1000 with HDI value has a negative correlation. The more
SD_1000, the lower the HDI value in the district/city. Conversely, the
correlation between PR_KUMUH and PR_SUNGAI_LMBH and HDI value is
positive. Which means, the higher the proportion of slums, or the
proportion of rivers polluted by sewage, the higher the HDI value in the
district. Of course, there are many things that might cause this
condition to occur, and of course it needs to be studied more
comprehensively. But for this short article, we will not look beyond the
technicalities of regression analysis.
library(corrplot)
## corrplot 0.92 loaded
# menghitung korelasi antar kolom
corr_matrix <- round(cor(data), 2)
# membuat plot korelasi
corrplot(corr_matrix,
type="lower",
method = "color",
tl.cex = 0.6,
tl.col = "black",
addCoef.col = "#2F2F2F",
addCoefasPercent = FALSE,
number.cex = 0.6,
diag = FALSE)
#### Regression Model If the data condition is ‘clean’ then we can enter
the multiple linear regression modeling stage. The stats package
provides an lm function that we can use for a quick and simple linear
regression model. There are several other packages, but for now we will
only use the lm function. The parameters that need to be specified in
the lm function are formula and data. Formula indicates the name of the
response variable followed by the names of the explanatory variables
(example: y ~ x1 + x2 + x3). However, if all variables in the data will
be used, then we simply write the symbol “y ~ .”
# Train a linear regression model
model_reg <- lm(ipm_2018 ~ ., data = data)
# display the model output summary
summary(model_reg)
##
## Call:
## lm(formula = ipm_2018 ~ ., data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.9279 -1.3472 -0.0228 1.2981 6.6353
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 77.69339 1.61562 48.089 < 2e-16 ***
## PR_NO_LIS -54.16621 72.84019 -0.744 0.458793
## PR_SAMPAH -12.32380 3.34207 -3.687 0.000364 ***
## PR_TINJA -3.85873 1.63650 -2.358 0.020266 *
## PR_MKM_SUNGAI -0.30645 2.23861 -0.137 0.891383
## PR_SUNGAI_LMBH -0.06651 1.70957 -0.039 0.969043
## PR_KUMUH 1.14777 1.87576 0.612 0.541953
## PRA_1000 -1.47333 1.35941 -1.084 0.280981
## SD_1000 -8.47723 2.23944 -3.785 0.000258 ***
## SM_1000 -3.55704 2.85238 -1.247 0.215211
## RS_PKS_PDK_1000 5.06631 3.63456 1.394 0.166340
## LIN_BID_POS_P_1000 2.95048 2.03778 1.448 0.150685
## APT_OBT_1000 -7.72740 2.88662 -2.677 0.008646 **
## DOK_DRG_1000 13.50010 2.48140 5.441 3.61e-07 ***
## BID_1000 -2.22060 1.52796 -1.453 0.149178
## GZ_BURUK_1000 0.63440 1.56650 0.405 0.686335
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.282 on 103 degrees of freedom
## Multiple R-squared: 0.8474, Adjusted R-squared: 0.8252
## F-statistic: 38.14 on 15 and 103 DF, p-value: < 2.2e-16
The output presented above displays various values that are commonly interpreted from the modeling results. Some of these are: ##### The t-test statistic This component shows which explanatory variables are significant. If the p-value is smaller than a certain e.g. 0.05 then the variable is significant at a confidence level of $. This information can also be seen from the significance code added in the last column. The ’**’ sign states that the variable is highly significant even for very small values of values that are very small. As for the standard value used, it is marked with the symbol ‘.’ and marked with the ’’ symbol. The results above show which variables are significant or not, and it can be seen that many variables are not significant. These results certainly need to be tested again for validity by checking assumptions and conditions that should not be violated in linear regression models. ##### F test statistics If the test statistic test statistic shows the partial significance of each variable, then the F test statistic test statistic measures the significance of the model as a whole. In this example, the p-value 0 () indicates the model is significant (even with a very small error rate, e.g.or smaller). In other words, there is at least 1 explanatory variable that is significant (or there is at least one explanatory variable that is significant). for where the value of. This is of course also in accordance with the results in the partial test. ##### Coefficient of Determination (R^2) The value shows the amount of diversity in the response variable that can be explained by the explanatory variables. In this example, the value This means that the explanatory variables in the model are able to explain about 84.74 percent of the diversity in HDI values. ##### Adjusted R^2 Adjusted value is the value of penalized by the number of explanatory variables used. This measure is usually used when comparing performance between models while still considering the complexity of the model. A simpler model will be penalized less than a complex model. For example, Model A with 15 variables has a value of and model B with only 5 variables has a value of.If the adjusted value is calculated This can be interpreted, with a large reduction in model complexity (from 15 variables to only 5 variables) the model still has good performance and only slightly reduces the model complexity (from 0.84 to 0.82). (from 0.84 to 0.82). In a situation like this, we might be more inclined to choose model B over A.
Multiple linear regression analysis has several assumptions and conditions that must be met. If there are conditions that are not met, it can cause the validity of the results obtained to be less accurate or even wrong. The following is an explanation of the assumptions in multiple linear regression analysis.
The main premise of multiple linear regression analysis is that there is a linear relationship between the response variable and the explanatory variables. This linearity can be checked visually using a scatterplot, which should show a relationship with a straight line trend.
library(ggplot2)
library(cowplot)
var.respon <- "ipm_2018"
var.penjelas <- setdiff(names(data), var.respon)
plots <- lapply( var.penjelas, function(var_x){
p <- ggplot(data, aes_string(x = var_x, y = var.respon)) +
aes(x=!!sym(var_x)) +
geom_point(color = "blue") +
theme_minimal() +
theme(axis.text = element_blank(), axis.title = element_text(size = 8)
)
})
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
plot_grid(plotlist = plots, nrow = 5, ncol = 3)
Based on the scatter plot, we can see the relationship pattern between
the response variable and the explanatory variables. For some
explanatory variables, there is a fairly strong linear relationship.
However, there are some variables that do not seem to be linearly
related. These variables may not have a significant effect.
Depending on the researcher’s interest, these non-linearly related variables can be excluded from the model or subjected to certain treatments. If these variables are still to be included, one technique that can be tried is data transformation. This transformation can be natural log, square root, box-cox transformation and so on. ##### Normality Multiple linear regression analysis assumes that the residuals (the difference between observed and predicted values) are normally distributed. This assumption can be assessed by examining the histogram or Q-Q plot of the residuals, or through statistical tests such as the Kolmogorov-Smirnov test, Shapiro-Wilk and so on.
In the qq plot results and formal tests using the K-S test (Tidak Tolak H0), the model residuals fulfill the normal assumption. ), the model residuals fulfill the normality assumption, namely spreading Normal(0, σ^2)
# Get the residuals of the model
model.res <- residuals(model_reg)
mean.res = round(mean(model.res), 4)
sd.res = round(sd(model.res), 4)
cat('Center value of residuals:', mean.res, '\n')
## Center value of residuals: 0
cat('Standard deviation of residuals:', sd.res, '\n')
## Standard deviation of residuals: 2.1323
# Create a Q-Q plot of the residuals
qqnorm(model.res)
qqline(model.res)
# Perform the Kolmogorov-Smirnov test on the residuals
ks_result <- ks.test(model.res, 'pnorm', mean = mean.res, sd = sd.res)
ks_result
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: model.res
## D = 0.049367, p-value = 0.9339
## alternative hypothesis: two-sided
The variance of the residuals should be consistent across the levels of the explanatory variables. The scatterplot of the residuals versus the predicted values should not display any visible patterns, such as a conical or funnel-shaped distribution, which would indicate heteroscedasticity. If required, there are also statistical tests used to check this assumption such as the Breusch-Pagan test, White’s test and so on. If heteroscedasticity is present, it can also be treated through data transformation.
In this case, based on BP test (do not reject H0) and the residual value distribution plot looks relatively random for each prediction value and there is no tendency to form a cone pattern. Thus it can be said that there is no heteroscedasticity.
library(lmtest) # package for linear model testing
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
# Perform Breusch-Pagan test
breusch_pagan_test <- bptest(model_reg)
breusch_pagan_test
##
## studentized Breusch-Pagan test
##
## data: model_reg
## BP = 15.26, df = 15, p-value = 0.4329
# Plot residuals vs predicted values with red scatter plot
plot(fitted(model_reg),model.res,
xlab = "Fitted Values", ylab ="Residuals",
main = "Residuals vs Fitted Values",
col="red", pch=20, cex=1.5)
# Adding a red dashed line at value 0
abline(h = 0, col = "blue", lty = 2)
#### Non-Multicollinearity Multicollinearity (This is not really an
assumption) is a problem that needs to be addressed. Model analysis in
the presence of multicollinearity can be inappropriate. The variance of
the estimates becomes larger, which will also widen the confidence
interval. The impact of wider confidence intervals is that tests can
become insignificant even though the variable should have a strong
influence. It is also possible to cause the coefficient estimate to
change sign (positive to negative or vice versa). If this happens, of
course the analysis results become very unreliable to use. The existence
of a multicollinearity problem can be seen in the following way: The
correlation matrix, where the correlation coefficient limit of 0.80 can
be a ‘warning’. In the previous section, we have presented the
correlation matrix and shown some explanatory variables with high
correlation. Variance Inflation Factor (VIF), with VIF values above 10
indicating a multicollinearity problem.
library(car) # vif function available in car package
## Loading required package: carData
# Calculate VIF for each variable in the model
vif_values <- vif(model_reg)
data.frame(vif_values)
## vif_values
## PR_NO_LIS 1.674004
## PR_SAMPAH 1.752889
## PR_TINJA 3.883222
## PR_MKM_SUNGAI 2.519156
## PR_SUNGAI_LMBH 2.248597
## PR_KUMUH 2.945363
## PRA_1000 1.840095
## SD_1000 4.638975
## SM_1000 2.323137
## RS_PKS_PDK_1000 5.283640
## LIN_BID_POS_P_1000 3.867567
## APT_OBT_1000 3.014481
## DOK_DRG_1000 4.545848
## BID_1000 2.377227
## GZ_BURUK_1000 1.272896
If there is such a multicollinearity problem we may need to exclude one or more of the variables with a strong correlation.
Based on the results above, it turns out that there is no indication of multicollinearity problems between the explanatory variables. This can reassure us that the multiple linear regression analysis we conducted has met the assumptions and there are no problems that can interfere with the results and their interpretation.