Multiple Linear Regression Analysis with R

The project is a modeling of medicine availability using a multiple linear regression model with a number of regressors derived using exploratory factor analysis and measuring physical conditions of clinical pharmacies, stock management practices, medicine ordering, stockout management, medicine receipting, inventory management, medicine reporting systems and observed inventory management systems.

The dataset we will use is a curated and contains the 8 composite constructs to be used as explanatory variables in the multiple regression analysis modeling factors influencing medicine availability.

We will import the data and adjust the variables for analysis, and also adjusting variable values for randomness and accounting for multiple linear regression model assumptions.

Packages <- c("gtsummary","broom","readxl","descr","stargazer","ggplot2","DMwR2")
lapply(Packages, library, character.only = TRUE)

## Warning: package 'gtsummary' was built under R version 4.3.1

## #Uighur

## Warning: package 'broom' was built under R version 4.3.1

## Warning: package 'readxl' was built under R version 4.3.1

## Warning: package 'descr' was built under R version 4.3.1

## 
## Please cite as:

##  Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.

##  R package version 5.2.3. https://CRAN.R-project.org/package=stargazer

## Warning: package 'ggplot2' was built under R version 4.3.1

## Warning: package 'DMwR2' was built under R version 4.3.1

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

## [[1]]
## [1] "gtsummary" "stats"     "graphics"  "grDevices" "utils"     "datasets" 
## [7] "methods"   "base"     
## 
## [[2]]
## [1] "broom"     "gtsummary" "stats"     "graphics"  "grDevices" "utils"    
## [7] "datasets"  "methods"   "base"     
## 
## [[3]]
##  [1] "readxl"    "broom"     "gtsummary" "stats"     "graphics"  "grDevices"
##  [7] "utils"     "datasets"  "methods"   "base"     
## 
## [[4]]
##  [1] "descr"     "readxl"    "broom"     "gtsummary" "stats"     "graphics" 
##  [7] "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[5]]
##  [1] "stargazer" "descr"     "readxl"    "broom"     "gtsummary" "stats"    
##  [7] "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[6]]
##  [1] "ggplot2"   "stargazer" "descr"     "readxl"    "broom"     "gtsummary"
##  [7] "stats"     "graphics"  "grDevices" "utils"     "datasets"  "methods"  
## [13] "base"     
## 
## [[7]]
##  [1] "DMwR2"     "ggplot2"   "stargazer" "descr"     "readxl"    "broom"    
##  [7] "gtsummary" "stats"     "graphics"  "grDevices" "utils"     "datasets" 
## [13] "methods"   "base"

Data <- read_excel("C:/Users/tawon/Documents/R/REG_DATA.xlsx")

Data$Availability <- as.numeric(Data$`_ADEQUATEAVAILABILITY_`)

## Warning: NAs introduced by coercion

Data$PhysicalCond <- Data$abs_Physical_Conditions
Data$StockManagement <- Data$abs_Stock_Management
Data$MedicineOrdering <- Data$abs_Medicine_Ordering
Data$StockOutMgt <- Data$abs_Stockout_Mgt
Data$MedReceipting <- Data$abs_MedicineReceipting
Data$InventoryMgt <- Data$abs_InventoryManagement
Data$ReportingSystem <- Data$abs_Reporting_System
Data$InventorySystem <- Data$abs_InventorySystem

In the foregoing steps, we simplified variable naming conventions, and changed the data structure of the dependent variable from string object to a numeric, so that the decimal values could not be rounded off as when the function as.integer is used. We then create a simplified data frame containing only the variables required for this multiple linear regression modelling project.

Df <- Data[, c("Availability","PhysicalCond","StockManagement","MedicineOrdering", "StockOutMgt","MedReceipting","InventoryMgt","ReportingSystem","InventorySystem")]
summary(Df)

##   Availability     PhysicalCond       StockManagement     MedicineOrdering  
##  Min.   :0.0000   Min.   : 0.000293   Min.   : 0.000001   Min.   : 0.02001  
##  1st Qu.:0.7200   1st Qu.: 0.015220   1st Qu.: 0.097202   1st Qu.: 0.12516  
##  Median :0.8100   Median : 0.033854   Median : 0.173845   Median : 0.16518  
##  Mean   :0.7363   Mean   : 0.945543   Mean   : 0.912616   Mean   : 0.93528  
##  3rd Qu.:0.8700   3rd Qu.: 0.112357   3rd Qu.: 0.376095   3rd Qu.: 0.19370  
##  Max.   :0.9600   Max.   :19.636299   Max.   :14.916897   Max.   :10.69765  
##  NA's   :3                                                                  
##   StockOutMgt       MedReceipting        InventoryMgt     ReportingSystem 
##  Min.   : 0.04477   Min.   : 0.005927   Min.   : 0.1564   Min.   :0.0000  
##  1st Qu.: 0.04477   1st Qu.: 0.013782   1st Qu.: 0.1605   1st Qu.:1.0000  
##  Median : 0.04477   Median : 0.083594   Median : 0.1947   Median :1.0000  
##  Mean   : 0.88153   Mean   : 0.955466   Mean   : 0.9803   Mean   :0.9545  
##  3rd Qu.: 0.04477   3rd Qu.: 0.083594   3rd Qu.: 0.1947   3rd Qu.:1.0000  
##  Max.   :18.47967   Max.   :20.614220   Max.   :12.0107   Max.   :1.0000  
##                                                           NA's   :2       
##  InventorySystem 
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :1.0000  
##  Mean   :0.6364  
##  3rd Qu.:1.0000  
##  Max.   :1.0000  
##  NA's   :2

The summary statistics presented above show some standard metrics of statistical distribution of continuous variables. The dependent variable “Availability” has a minimum value of 0 and a maximum value of 0.96, missing values (NAs are also reported = 3). The rest of the variables excluding “Inventory System” are index variables with a continuous scale, showing the scores for each participating entity in the study, in this case, Pharmacies at the various healthcare facilities. These range of values for these variables is also positive with positive minimum values. This is because the values were squared, with the intention towards log transformation given assessment of non-constant variance. We proceed to assess the relationships among the variables using correlation analysis.

library(corrplot)

## Warning: package 'corrplot' was built under R version 4.3.1

## corrplot 0.92 loaded

DF.Corr <- cor(Df)
corrplot(DF.Corr, method = "number")

The corrplot shows very high correlations among the predicting variables as high as 0.8, however, we observe that the correlations with the response variables are very small. We will continue with regression modeling and run model diagnostics to evaluate whether the variables require transformations.

The sample data set we are using is small, with N = 46 observations, which makes creation of a training and validation sample challenging. Since the minimum number of observations for estimating a normal distribution is 30, we will use 76% of the observations for training and the remaining values as test data set.

set.seed(1234)
RndSmple <- sample(1:nrow(Df),36)
tr.dat <- Df[RndSmple, ]
ts.dat <- Df[-RndSmple,]
tr.dat <- knnImputation(tr.dat, k = 10)

We generate a plot to check the distribution of values of the dependent variable.

ggplot(tr.dat, aes(x=Availability)) + geom_histogram(aes(y = after_stat(density))) + stat_function(fun = dnorm, 
                                                                                                   args = list(mean = mean(tr.dat$Availability),
                                                                                                               sd = sd(tr.dat$Availability)),
                                                                                                   col = "#1b98e0",
                                                                                                   linewidth = 2)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(tr.dat, aes(x=Availability)) + geom_histogram(aes(y = after_stat(density))) + geom_density(color = "red") + geom_rug() + 
  ggtitle("History of Medicine Availability") + xlab(" ") + ylab(" ")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

library(car)

## Warning: package 'car' was built under R version 4.3.1

## Loading required package: carData

## Warning: package 'carData' was built under R version 4.3.1

qqPlot(tr.dat$Availability, main = 'Normal QQ plot of Medicine Availability',
       ylab = " ")

## [1] 26 20

The distribution of the dependent variable is skewed, which might further have been intensified by reducing the dataset into train (.76) and test (0.24). We will keep the variable in its present form as we build the model. Next we assess the explanatory variables. The normal QQ plot shows that except for the extreme values coded #26 and #20, the rest of the data points are within the 95% confidence band of the straight line.

The multiple linear regression can be run and diagnostics conducted

Model1 <- lm(Availability~PhysicalCond+StockManagement+MedicineOrdering+StockOutMgt+MedReceipting+InventoryMgt+ReportingSystem+InventorySystem, tr.dat)
summary(Model1)

## 
## Call:
## lm(formula = Availability ~ PhysicalCond + StockManagement + 
##     MedicineOrdering + StockOutMgt + MedReceipting + InventoryMgt + 
##     ReportingSystem + InventorySystem, data = tr.dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.40329 -0.08077  0.00839  0.05973  0.31162 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       1.41468    0.23781   5.949 2.42e-06 ***
## PhysicalCond      0.05022    0.16318   0.308  0.76064    
## StockManagement  -0.11163    0.06248  -1.787  0.08521 .  
## MedicineOrdering  0.00396    0.01317   0.301  0.76593    
## StockOutMgt      -0.39513    0.45601  -0.866  0.39386    
## MedReceipting     0.47646    0.40596   1.174  0.25079    
## InventoryMgt     -0.17716    0.04530  -3.911  0.00056 ***
## ReportingSystem  -0.54269    0.23366  -2.323  0.02798 *  
## InventorySystem  -0.03312    0.05568  -0.595  0.55690    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1357 on 27 degrees of freedom
## Multiple R-squared:  0.4331, Adjusted R-squared:  0.2651 
## F-statistic: 2.578 on 8 and 27 DF,  p-value: 0.03118

The initial model with all the 8 predictor variables, reports an adjusted R-squared value of 26.5%, meaning that the model with the 8 predictors explains 26.5% of the variation observed in medicine availability. With a larger level of significance at 10%, stock management, inventory management and reporting systems are statistically significant predictors of medicine availability. The model is significant with an F Statistic on 8 and 27 degrees of freedom = 2.578, with a p-value (Prob > F) = 0.03. We can check the analysis of variance and employ backward elimination method to cull the model and retain only statistically significant explanatory variables.

Analysis of Variance

anova(Model1)

## Analysis of Variance Table
## 
## Response: Availability
##                  Df  Sum Sq  Mean Sq F value   Pr(>F)   
## PhysicalCond      1 0.00420 0.004199  0.2279 0.636895   
## StockManagement   1 0.00170 0.001695  0.0920 0.763958   
## MedicineOrdering  1 0.00814 0.008137  0.4417 0.511939   
## StockOutMgt       1 0.00011 0.000107  0.0058 0.939760   
## MedReceipting     1 0.00770 0.007696  0.4178 0.523504   
## InventoryMgt      1 0.24043 0.240429 13.0515 0.001221 **
## ReportingSystem   1 0.11118 0.111182  6.0354 0.020736 * 
## InventorySystem   1 0.00652 0.006518  0.3538 0.556897   
## Residuals        27 0.49738 0.018421                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can see that the variable stockout management had the least contribution to the observed variation in the response variable. We can use backward elimination and update the multiple regression model, or we can use the step function which employs the AIC to update the model with the most efficient predictive performance.

Model2 <- step(Model1)

## Start:  AIC=-136.15
## Availability ~ PhysicalCond + StockManagement + MedicineOrdering + 
##     StockOutMgt + MedReceipting + InventoryMgt + ReportingSystem + 
##     InventorySystem
## 
##                    Df Sum of Sq     RSS     AIC
## - MedicineOrdering  1  0.001666 0.49905 -138.03
## - PhysicalCond      1  0.001745 0.49912 -138.02
## - InventorySystem   1  0.006518 0.50390 -137.68
## - StockOutMgt       1  0.013830 0.51121 -137.16
## - MedReceipting     1  0.025375 0.52275 -136.36
## <none>                          0.49738 -136.15
## - StockManagement   1  0.058809 0.55619 -134.13
## - ReportingSystem   1  0.099369 0.59675 -131.59
## - InventoryMgt      1  0.281709 0.77909 -121.99
## 
## Step:  AIC=-138.03
## Availability ~ PhysicalCond + StockManagement + StockOutMgt + 
##     MedReceipting + InventoryMgt + ReportingSystem + InventorySystem
## 
##                   Df Sum of Sq     RSS     AIC
## - PhysicalCond     1  0.002018 0.50106 -139.88
## - InventorySystem  1  0.005966 0.50501 -139.60
## - StockOutMgt      1  0.015163 0.51421 -138.95
## - MedReceipting    1  0.027217 0.52626 -138.12
## <none>                         0.49905 -138.03
## - StockManagement  1  0.060873 0.55992 -135.88
## - ReportingSystem  1  0.104707 0.60375 -133.17
## - InventoryMgt     1  0.282464 0.78151 -123.88
## 
## Step:  AIC=-139.88
## Availability ~ StockManagement + StockOutMgt + MedReceipting + 
##     InventoryMgt + ReportingSystem + InventorySystem
## 
##                   Df Sum of Sq     RSS     AIC
## - InventorySystem  1  0.006686 0.50775 -141.41
## - StockOutMgt      1  0.013887 0.51495 -140.90
## <none>                         0.50106 -139.88
## - MedReceipting    1  0.032991 0.53405 -139.59
## - StockManagement  1  0.063316 0.56438 -137.60
## - ReportingSystem  1  0.106253 0.60732 -134.96
## - InventoryMgt     1  0.282756 0.78382 -125.78
## 
## Step:  AIC=-141.41
## Availability ~ StockManagement + StockOutMgt + MedReceipting + 
##     InventoryMgt + ReportingSystem
## 
##                   Df Sum of Sq     RSS     AIC
## - StockOutMgt      1  0.017422 0.52517 -142.19
## <none>                         0.50775 -141.41
## - MedReceipting    1  0.036665 0.54441 -140.90
## - StockManagement  1  0.057157 0.56491 -139.57
## - ReportingSystem  1  0.118160 0.62591 -135.87
## - InventoryMgt     1  0.292903 0.80065 -127.01
## 
## Step:  AIC=-142.19
## Availability ~ StockManagement + MedReceipting + InventoryMgt + 
##     ReportingSystem
## 
##                   Df Sum of Sq     RSS     AIC
## <none>                         0.52517 -142.19
## - StockManagement  1  0.070683 0.59585 -139.65
## - ReportingSystem  1  0.179533 0.70471 -133.61
## - MedReceipting    1  0.194622 0.71979 -132.84
## - InventoryMgt     1  0.309684 0.83486 -127.50

Summarizing the final model

summary(Model2)

## 
## Call:
## lm(formula = Availability ~ StockManagement + MedReceipting + 
##     InventoryMgt + ReportingSystem, data = tr.dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.40655 -0.08152  0.00749  0.06833  0.31261 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      1.24350    0.13088   9.501 1.07e-10 ***
## StockManagement -0.11628    0.05693  -2.043 0.049673 *  
## MedReceipting    0.17396    0.05132   3.389 0.001924 ** 
## InventoryMgt    -0.17103    0.04000  -4.276 0.000169 ***
## ReportingSystem -0.39294    0.12070  -3.255 0.002739 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1302 on 31 degrees of freedom
## Multiple R-squared:  0.4014, Adjusted R-squared:  0.3242 
## F-statistic: 5.197 on 4 and 31 DF,  p-value: 0.002536

The updated model show that medicine availability is best predicted stock management, Medicine receipting, inventory management and inventory reporting systems. The performance of the model is not impressive, with an adjusted R-Squared value of 34.42. This model has the lowest AIC = -142.19.

We compare whether the statistical differences between Model1 and Model2 are significant. We use the Analysis of Variance anova() function which assesses the models based on F Statistics and probability of obtaining that F statistic.

anova(Model1,Model2)

## Analysis of Variance Table
## 
## Model 1: Availability ~ PhysicalCond + StockManagement + MedicineOrdering + 
##     StockOutMgt + MedReceipting + InventoryMgt + ReportingSystem + 
##     InventorySystem
## Model 2: Availability ~ StockManagement + MedReceipting + InventoryMgt + 
##     ReportingSystem
##   Res.Df     RSS Df Sum of Sq      F Pr(>F)
## 1     27 0.49738                           
## 2     31 0.52517 -4 -0.027792 0.3772 0.8229

We can observe that the p-value is greater than 5% at (82.89%) showing that the differences between the two models are not statistically significant. We will compute the regression model, using the full data without train and test partitions.

DF <- knnImputation(Df, k=10)
Model3 <- lm(Availability~PhysicalCond+StockManagement+MedicineOrdering+StockOutMgt+MedReceipting+InventoryMgt+ReportingSystem+InventorySystem, DF)
summary(Model3)

## 
## Call:
## lm(formula = Availability ~ PhysicalCond + StockManagement + 
##     MedicineOrdering + StockOutMgt + MedReceipting + InventoryMgt + 
##     ReportingSystem + InventorySystem, data = DF)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.22451 -0.08550 -0.00034  0.06380  0.54719 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       1.524086   0.235944   6.460 1.50e-07 ***
## PhysicalCond     -0.001294   0.167431  -0.008   0.9939    
## StockManagement  -0.163979   0.063270  -2.592   0.0136 *  
## MedicineOrdering  0.004436   0.014021   0.316   0.7535    
## StockOutMgt      -0.155817   0.457143  -0.341   0.7351    
## MedReceipting     0.403695   0.400972   1.007   0.3206    
## InventoryMgt     -0.293981   0.033308  -8.826 1.23e-10 ***
## ReportingSystem  -0.604403   0.235645  -2.565   0.0145 *  
## InventorySystem  -0.060405   0.052629  -1.148   0.2584    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1462 on 37 degrees of freedom
## Multiple R-squared:  0.7162, Adjusted R-squared:  0.6549 
## F-statistic: 11.67 on 8 and 37 DF,  p-value: 4.243e-08

The results presented above for Model3 shows significant improvement in the model performance which is based on the full dataset. Interestingly the differences between this dataframe and the training data used in Model 1 and Model 2 is only 10 observations. The adjusted R-Squared for Model 3 is 65.5% meaning that the model explains 65.5% of observed variation in the response variable measuring medicine availability. The model is statistically significant with an F Statistic (8,37) = 11.67, and p-value = 0.00000004243.

We use step function to update this model

Model4 <- step(Model3)

## Start:  AIC=-168.9
## Availability ~ PhysicalCond + StockManagement + MedicineOrdering + 
##     StockOutMgt + MedReceipting + InventoryMgt + ReportingSystem + 
##     InventorySystem
## 
##                    Df Sum of Sq     RSS     AIC
## - PhysicalCond      1   0.00000 0.79104 -170.90
## - MedicineOrdering  1   0.00214 0.79318 -170.78
## - StockOutMgt       1   0.00248 0.79352 -170.76
## - MedReceipting     1   0.02167 0.81271 -169.66
## - InventorySystem   1   0.02816 0.81920 -169.29
## <none>                          0.79104 -168.90
## - ReportingSystem   1   0.14065 0.93168 -163.37
## - StockManagement   1   0.14361 0.93464 -163.23
## - InventoryMgt      1   1.66550 2.45654 -118.78
## 
## Step:  AIC=-170.9
## Availability ~ StockManagement + MedicineOrdering + StockOutMgt + 
##     MedReceipting + InventoryMgt + ReportingSystem + InventorySystem
## 
##                    Df Sum of Sq     RSS     AIC
## - MedicineOrdering  1   0.00215 0.79319 -172.78
## - StockOutMgt       1   0.00261 0.79365 -172.75
## - MedReceipting     1   0.02249 0.81353 -171.61
## - InventorySystem   1   0.02842 0.81946 -171.28
## <none>                          0.79104 -170.90
## - ReportingSystem   1   0.14066 0.93169 -165.37
## - StockManagement   1   0.14431 0.93535 -165.19
## - InventoryMgt      1   1.66986 2.46090 -120.69
## 
## Step:  AIC=-172.78
## Availability ~ StockManagement + StockOutMgt + MedReceipting + 
##     InventoryMgt + ReportingSystem + InventorySystem
## 
##                   Df Sum of Sq     RSS     AIC
## - StockOutMgt      1   0.00307 0.79625 -174.60
## - MedReceipting    1   0.02446 0.81764 -173.38
## - InventorySystem  1   0.02724 0.82042 -173.22
## <none>                         0.79319 -172.78
## - ReportingSystem  1   0.14683 0.94001 -166.96
## - StockManagement  1   0.14756 0.94074 -166.93
## - InventoryMgt     1   1.66920 2.46238 -122.67
## 
## Step:  AIC=-174.6
## Availability ~ StockManagement + MedReceipting + InventoryMgt + 
##     ReportingSystem + InventorySystem
## 
##                   Df Sum of Sq     RSS     AIC
## - InventorySystem  1   0.03093 0.82719 -174.84
## <none>                         0.79625 -174.60
## - StockManagement  1   0.16125 0.95750 -168.12
## - ReportingSystem  1   0.38443 1.18069 -158.48
## - MedReceipting    1   0.59983 1.39609 -150.77
## - InventoryMgt     1   1.72668 2.52294 -123.55
## 
## Step:  AIC=-174.84
## Availability ~ StockManagement + MedReceipting + InventoryMgt + 
##     ReportingSystem
## 
##                   Df Sum of Sq     RSS     AIC
## <none>                         0.82719 -174.84
## - StockManagement  1   0.14493 0.97212 -169.42
## - ReportingSystem  1   0.41860 1.24579 -158.01
## - MedReceipting    1   0.56900 1.39619 -152.76
## - InventoryMgt     1   1.80190 2.62908 -123.65

We observe that the step function yields similar final regression models for the full data and for the train partition. The final model was chosen as the model with the lowest AIC = -174.84. We first compare this final model with the initial model.

anova(Model3,Model4)

## Analysis of Variance Table
## 
## Model 1: Availability ~ PhysicalCond + StockManagement + MedicineOrdering + 
##     StockOutMgt + MedReceipting + InventoryMgt + ReportingSystem + 
##     InventorySystem
## Model 2: Availability ~ StockManagement + MedReceipting + InventoryMgt + 
##     ReportingSystem
##   Res.Df     RSS Df Sum of Sq      F Pr(>F)
## 1     37 0.79104                           
## 2     41 0.82719 -4 -0.036152 0.4227 0.7912

The differences between the two models are not statistically significant.We assess the predictive performance and other metrics of this model

summary(Model4)

## 
## Call:
## lm(formula = Availability ~ StockManagement + MedReceipting + 
##     InventoryMgt + ReportingSystem, data = DF)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.21426 -0.09071 -0.00043  0.07160  0.55696 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      1.43232    0.13089  10.943 9.76e-14 ***
## StockManagement -0.15941    0.05948  -2.680   0.0105 *  
## MedReceipting    0.25611    0.04823   5.311 4.11e-06 ***
## InventoryMgt    -0.28133    0.02977  -9.450 7.50e-12 ***
## ReportingSystem -0.55781    0.12246  -4.555 4.63e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.142 on 41 degrees of freedom
## Multiple R-squared:  0.7033, Adjusted R-squared:  0.6743 
## F-statistic: 24.29 on 4 and 41 DF,  p-value: 2.352e-10

The model with 4 predictors (Stock Management, Medicine Receipting, Inventory Management and Reporting Systems) explains 67.43% of the observed variation in the response variable measuring Medicine Availability.

Regression model diagnostics

par(mfrow = c(2,2))
plot(Model4)

The normal QQ plot above does show that the residuals(errors) of the estimated final regression model all lie more or less on the line but we also observe some very extreme values, a very small value coded 24, and very high values coded 37 and 3 respectively, which reflect the cases (observations) associated with these extreme values.

We examine whether these values are influential points and the strategy we can use to improve their influence on the regression model

Outliers <- hatvalues(Model4) > 3*mean(hatvalues(Model4))
DF[Outliers,]

## # A tibble: 4 × 9
##   Availability PhysicalCond StockManagement MedicineOrdering StockOutMgt
##          <dbl>        <dbl>           <dbl>            <dbl>       <dbl>
## 1        0.514       19.6       14.9                  10.7        18.5  
## 2        0.514       19.6       14.9                  10.7        18.5  
## 3        0.85         0.281      2.35                  0.194       0.508
## 4        0.94         0.350      0.00000101            1.06        0.508
## # ℹ 4 more variables: MedReceipting <dbl>, InventoryMgt <dbl>,
## #   ReportingSystem <dbl>, InventorySystem <dbl>

We see that there are four observations positioned far from the fitted regression line, that is points with larger residuals.

df <- Model4$df.residual
p <- length(Model4$coefficients)
n <- nrow(Model4$model)

dffits_crit <- 2*sqrt((p+1)/(n-p-1))
Model_Diff <- dffits(Model4)
Df <- data.frame(obs = names(Model_Diff), DFFITS = Model_Diff)

ggplot(Df, aes(y = DFFITS, x = obs)) + geom_point() + 
  geom_hline(yintercept = c(dffits_crit), linetype = "dashed") + 
  labs(title = "DFFITS",
       subtitle = "Influential Observations",
       x = "Observation Number",
       y = "DFFITS")

The DFFITS plot above shows three potential influential data points. We can list which ones as follows

DF[which(abs(Model_Diff) > dffits_crit), ]

## # A tibble: 5 × 9
##   Availability PhysicalCond StockManagement MedicineOrdering StockOutMgt
##          <dbl>        <dbl>           <dbl>            <dbl>       <dbl>
## 1         0.74       0.129       0.157                0.125       0.0448
## 2         0          0.239       0.0149               0.0344      0.0448
## 3         0.85       0.281       2.35                 0.194       0.508 
## 4         0.85       0.0967      1.47                 0.125       0.0448
## 5         0.94       0.350       0.00000101           1.06        0.508 
## # ℹ 4 more variables: MedReceipting <dbl>, InventoryMgt <dbl>,
## #   ReportingSystem <dbl>, InventorySystem <dbl>

We can assess outliers with Cook’s distance

cooks_crit = 0.5
Model_cooks <- cooks.distance(Model4)
df <- data.frame(obs = names(Model_cooks),
                 cooks = Model_cooks)
ggplot(df, aes(y = cooks, x = obs)) +
  geom_point() +
  geom_hline(yintercept = cooks_crit, linetype="dashed") +
  labs(title = "Cook's Distance",
       x = "Observation Number",
       y = "Cook's")

The computed cook distance, shows 3 influential observations. We can remove the influential observations and compare the regression with the observations removed against the current regression model. Alternatively we can compute robust regression as an alternative to the ols regression. Before doing so, we will compute the normalised mean square error for the final regression.

Predictions_lm <- predict(Model4, DF)
(nmse.model4 <- mean((Predictions_lm - DF[['Availability']])^2)/
    mean((mean(DF[['Availability']]) - DF[['Availability']])^2))

## [1] 0.2967335

The NMSE is a unitless error measure with values ranging from 0 to 1. Our model is evidently performing better than a very simple baseline, as our nmse is sufficiently low.

dg <- data.frame(Model5 = Predictions_lm,
                 True.Model <- DF[['Availability']])
ggplot(dg, aes(x = Model5, y = True.Model)) + geom_point() + 
  geom_abline(slope = 1, intercept = 0, color = "red") + 
  ggtitle("Multiple Linear Regression Model")

Robust Regression Model The model accounts for outliers to provide a better fit to the majority of the data. When theoretical reasons constrain the need to keep outliers in the data, robust regression is a better alternative to the OLS regression model.

library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:gtsummary':
## 
##     select

Model6 <- rlm(Availability~PhysicalCond+StockManagement+StockOutMgt+MedicineOrdering+MedReceipting+InventoryMgt+ReportingSystem+InventorySystem, data = DF)
summary(Model6)

## 
## Call: rlm(formula = Availability ~ PhysicalCond + StockManagement + 
##     StockOutMgt + MedicineOrdering + MedReceipting + InventoryMgt + 
##     ReportingSystem + InventorySystem, data = DF)
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.1517449 -0.0834245 -0.0004181  0.0621174  0.6559029 
## 
## Coefficients:
##                  Value    Std. Error t value 
## (Intercept)        1.6678   0.1760     9.4741
## PhysicalCond      -0.0356   0.1249    -0.2848
## StockManagement   -0.2059   0.0472    -4.3621
## StockOutMgt       -0.2397   0.3411    -0.7027
## MedicineOrdering   0.0045   0.0105     0.4337
## MedReceipting      0.5621   0.2992     1.8790
## InventoryMgt      -0.3315   0.0249   -13.3408
## ReportingSystem   -0.7549   0.1758    -4.2935
## InventorySystem   -0.0374   0.0393    -0.9531
## 
## Residual standard error: 0.1197 on 37 degrees of freedom

summary(Model4)

## 
## Call:
## lm(formula = Availability ~ StockManagement + MedReceipting + 
##     InventoryMgt + ReportingSystem, data = DF)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.21426 -0.09071 -0.00043  0.07160  0.55696 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      1.43232    0.13089  10.943 9.76e-14 ***
## StockManagement -0.15941    0.05948  -2.680   0.0105 *  
## MedReceipting    0.25611    0.04823   5.311 4.11e-06 ***
## InventoryMgt    -0.28133    0.02977  -9.450 7.50e-12 ***
## ReportingSystem -0.55781    0.12246  -4.555 4.63e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.142 on 41 degrees of freedom
## Multiple R-squared:  0.7033, Adjusted R-squared:  0.6743 
## F-statistic: 24.29 on 4 and 41 DF,  p-value: 2.352e-10

library(dplyr)

## Warning: package 'dplyr' was built under R version 4.3.1

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:MASS':
## 
##     select

## The following object is masked from 'package:car':
## 
##     recode

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

data.frame(resid = Model6$wresid,
           weight = Model6$w) %>%
  arrange(weight) %>%
  head(n=15)

##           resid    weight
## 3   0.655902902 0.2454358
## 32  0.314412641 0.5120513
## 46  0.178213575 0.9033055
## 1  -0.123996732 1.0000000
## 2   0.075236663 1.0000000
## 4   0.001017944 1.0000000
## 5   0.004267988 1.0000000
## 6   0.039739378 1.0000000
## 7   0.063057713 1.0000000
## 8   0.016191709 1.0000000
## 9  -0.007023709 1.0000000
## 10  0.090224306 1.0000000
## 11 -0.083895514 1.0000000
## 12  0.059296458 1.0000000
## 13  0.086628314 1.0000000

While the ols regression allocates equal weights of 1 on all variables in the model, we can see that robust regression assigns 1 to low residuals and lower weights to slightly higher residuals. We also observe that the differences are larger between t values of an ols regression and those of the robust regression, we choose the latter for modeling medicine availability, although the ols model has better interpretability.

The ols regression model, explains 67% of variation in medicine availability in a model with Stock management, medicine receipting, inventory management and reporting systems as predictors. It must be observed that these explanatory variables are composites computed using factor analysis as a data reduction technique.

Multiple Linear Regression Analysis with R

Tawonga Rushambwa: Masters in Development Studies.

2023-07-18