The project is a modeling of medicine availability using a multiple linear regression model with a number of regressors derived using exploratory factor analysis and measuring physical conditions of clinical pharmacies, stock management practices, medicine ordering, stockout management, medicine receipting, inventory management, medicine reporting systems and observed inventory management systems.
The dataset we will use is a curated and contains the 8 composite constructs to be used as explanatory variables in the multiple regression analysis modeling factors influencing medicine availability.
We will import the data and adjust the variables for analysis, and also adjusting variable values for randomness and accounting for multiple linear regression model assumptions.
Packages <- c("gtsummary","broom","readxl","descr","stargazer","ggplot2","DMwR2")
lapply(Packages, library, character.only = TRUE)
## Warning: package 'gtsummary' was built under R version 4.3.1
## #Uighur
## Warning: package 'broom' was built under R version 4.3.1
## Warning: package 'readxl' was built under R version 4.3.1
## Warning: package 'descr' was built under R version 4.3.1
##
## Please cite as:
## Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
## Warning: package 'ggplot2' was built under R version 4.3.1
## Warning: package 'DMwR2' was built under R version 4.3.1
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## [[1]]
## [1] "gtsummary" "stats" "graphics" "grDevices" "utils" "datasets"
## [7] "methods" "base"
##
## [[2]]
## [1] "broom" "gtsummary" "stats" "graphics" "grDevices" "utils"
## [7] "datasets" "methods" "base"
##
## [[3]]
## [1] "readxl" "broom" "gtsummary" "stats" "graphics" "grDevices"
## [7] "utils" "datasets" "methods" "base"
##
## [[4]]
## [1] "descr" "readxl" "broom" "gtsummary" "stats" "graphics"
## [7] "grDevices" "utils" "datasets" "methods" "base"
##
## [[5]]
## [1] "stargazer" "descr" "readxl" "broom" "gtsummary" "stats"
## [7] "graphics" "grDevices" "utils" "datasets" "methods" "base"
##
## [[6]]
## [1] "ggplot2" "stargazer" "descr" "readxl" "broom" "gtsummary"
## [7] "stats" "graphics" "grDevices" "utils" "datasets" "methods"
## [13] "base"
##
## [[7]]
## [1] "DMwR2" "ggplot2" "stargazer" "descr" "readxl" "broom"
## [7] "gtsummary" "stats" "graphics" "grDevices" "utils" "datasets"
## [13] "methods" "base"
Data <- read_excel("C:/Users/tawon/Documents/R/REG_DATA.xlsx")
Data$Availability <- as.numeric(Data$`_ADEQUATEAVAILABILITY_`)
## Warning: NAs introduced by coercion
Data$PhysicalCond <- Data$abs_Physical_Conditions
Data$StockManagement <- Data$abs_Stock_Management
Data$MedicineOrdering <- Data$abs_Medicine_Ordering
Data$StockOutMgt <- Data$abs_Stockout_Mgt
Data$MedReceipting <- Data$abs_MedicineReceipting
Data$InventoryMgt <- Data$abs_InventoryManagement
Data$ReportingSystem <- Data$abs_Reporting_System
Data$InventorySystem <- Data$abs_InventorySystem
In the foregoing steps, we simplified variable naming conventions, and changed the data structure of the dependent variable from string object to a numeric, so that the decimal values could not be rounded off as when the function as.integer is used. We then create a simplified data frame containing only the variables required for this multiple linear regression modelling project.
Df <- Data[, c("Availability","PhysicalCond","StockManagement","MedicineOrdering", "StockOutMgt","MedReceipting","InventoryMgt","ReportingSystem","InventorySystem")]
summary(Df)
## Availability PhysicalCond StockManagement MedicineOrdering
## Min. :0.0000 Min. : 0.000293 Min. : 0.000001 Min. : 0.02001
## 1st Qu.:0.7200 1st Qu.: 0.015220 1st Qu.: 0.097202 1st Qu.: 0.12516
## Median :0.8100 Median : 0.033854 Median : 0.173845 Median : 0.16518
## Mean :0.7363 Mean : 0.945543 Mean : 0.912616 Mean : 0.93528
## 3rd Qu.:0.8700 3rd Qu.: 0.112357 3rd Qu.: 0.376095 3rd Qu.: 0.19370
## Max. :0.9600 Max. :19.636299 Max. :14.916897 Max. :10.69765
## NA's :3
## StockOutMgt MedReceipting InventoryMgt ReportingSystem
## Min. : 0.04477 Min. : 0.005927 Min. : 0.1564 Min. :0.0000
## 1st Qu.: 0.04477 1st Qu.: 0.013782 1st Qu.: 0.1605 1st Qu.:1.0000
## Median : 0.04477 Median : 0.083594 Median : 0.1947 Median :1.0000
## Mean : 0.88153 Mean : 0.955466 Mean : 0.9803 Mean :0.9545
## 3rd Qu.: 0.04477 3rd Qu.: 0.083594 3rd Qu.: 0.1947 3rd Qu.:1.0000
## Max. :18.47967 Max. :20.614220 Max. :12.0107 Max. :1.0000
## NA's :2
## InventorySystem
## Min. :0.0000
## 1st Qu.:0.0000
## Median :1.0000
## Mean :0.6364
## 3rd Qu.:1.0000
## Max. :1.0000
## NA's :2
The summary statistics presented above show some standard metrics of statistical distribution of continuous variables. The dependent variable “Availability” has a minimum value of 0 and a maximum value of 0.96, missing values (NAs are also reported = 3). The rest of the variables excluding “Inventory System” are index variables with a continuous scale, showing the scores for each participating entity in the study, in this case, Pharmacies at the various healthcare facilities. These range of values for these variables is also positive with positive minimum values. This is because the values were squared, with the intention towards log transformation given assessment of non-constant variance. We proceed to assess the relationships among the variables using correlation analysis.
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.3.1
## corrplot 0.92 loaded
DF.Corr <- cor(Df)
corrplot(DF.Corr, method = "number")
The corrplot shows very high correlations among the predicting variables
as high as 0.8, however, we observe that the correlations with the
response variables are very small. We will continue with regression
modeling and run model diagnostics to evaluate whether the variables
require transformations.
The sample data set we are using is small, with N = 46 observations, which makes creation of a training and validation sample challenging. Since the minimum number of observations for estimating a normal distribution is 30, we will use 76% of the observations for training and the remaining values as test data set.
set.seed(1234)
RndSmple <- sample(1:nrow(Df),36)
tr.dat <- Df[RndSmple, ]
ts.dat <- Df[-RndSmple,]
tr.dat <- knnImputation(tr.dat, k = 10)
We generate a plot to check the distribution of values of the dependent variable.
ggplot(tr.dat, aes(x=Availability)) + geom_histogram(aes(y = after_stat(density))) + stat_function(fun = dnorm,
args = list(mean = mean(tr.dat$Availability),
sd = sd(tr.dat$Availability)),
col = "#1b98e0",
linewidth = 2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(tr.dat, aes(x=Availability)) + geom_histogram(aes(y = after_stat(density))) + geom_density(color = "red") + geom_rug() +
ggtitle("History of Medicine Availability") + xlab(" ") + ylab(" ")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
library(car)
## Warning: package 'car' was built under R version 4.3.1
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.3.1
qqPlot(tr.dat$Availability, main = 'Normal QQ plot of Medicine Availability',
ylab = " ")
## [1] 26 20
The distribution of the dependent variable is skewed, which might further have been intensified by reducing the dataset into train (.76) and test (0.24). We will keep the variable in its present form as we build the model. Next we assess the explanatory variables. The normal QQ plot shows that except for the extreme values coded #26 and #20, the rest of the data points are within the 95% confidence band of the straight line.
The multiple linear regression can be run and diagnostics conducted
Model1 <- lm(Availability~PhysicalCond+StockManagement+MedicineOrdering+StockOutMgt+MedReceipting+InventoryMgt+ReportingSystem+InventorySystem, tr.dat)
summary(Model1)
##
## Call:
## lm(formula = Availability ~ PhysicalCond + StockManagement +
## MedicineOrdering + StockOutMgt + MedReceipting + InventoryMgt +
## ReportingSystem + InventorySystem, data = tr.dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.40329 -0.08077 0.00839 0.05973 0.31162
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.41468 0.23781 5.949 2.42e-06 ***
## PhysicalCond 0.05022 0.16318 0.308 0.76064
## StockManagement -0.11163 0.06248 -1.787 0.08521 .
## MedicineOrdering 0.00396 0.01317 0.301 0.76593
## StockOutMgt -0.39513 0.45601 -0.866 0.39386
## MedReceipting 0.47646 0.40596 1.174 0.25079
## InventoryMgt -0.17716 0.04530 -3.911 0.00056 ***
## ReportingSystem -0.54269 0.23366 -2.323 0.02798 *
## InventorySystem -0.03312 0.05568 -0.595 0.55690
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1357 on 27 degrees of freedom
## Multiple R-squared: 0.4331, Adjusted R-squared: 0.2651
## F-statistic: 2.578 on 8 and 27 DF, p-value: 0.03118
The initial model with all the 8 predictor variables, reports an adjusted R-squared value of 26.5%, meaning that the model with the 8 predictors explains 26.5% of the variation observed in medicine availability. With a larger level of significance at 10%, stock management, inventory management and reporting systems are statistically significant predictors of medicine availability. The model is significant with an F Statistic on 8 and 27 degrees of freedom = 2.578, with a p-value (Prob > F) = 0.03. We can check the analysis of variance and employ backward elimination method to cull the model and retain only statistically significant explanatory variables.
Analysis of Variance
anova(Model1)
## Analysis of Variance Table
##
## Response: Availability
## Df Sum Sq Mean Sq F value Pr(>F)
## PhysicalCond 1 0.00420 0.004199 0.2279 0.636895
## StockManagement 1 0.00170 0.001695 0.0920 0.763958
## MedicineOrdering 1 0.00814 0.008137 0.4417 0.511939
## StockOutMgt 1 0.00011 0.000107 0.0058 0.939760
## MedReceipting 1 0.00770 0.007696 0.4178 0.523504
## InventoryMgt 1 0.24043 0.240429 13.0515 0.001221 **
## ReportingSystem 1 0.11118 0.111182 6.0354 0.020736 *
## InventorySystem 1 0.00652 0.006518 0.3538 0.556897
## Residuals 27 0.49738 0.018421
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can see that the variable stockout management had the least contribution to the observed variation in the response variable. We can use backward elimination and update the multiple regression model, or we can use the step function which employs the AIC to update the model with the most efficient predictive performance.
Model2 <- step(Model1)
## Start: AIC=-136.15
## Availability ~ PhysicalCond + StockManagement + MedicineOrdering +
## StockOutMgt + MedReceipting + InventoryMgt + ReportingSystem +
## InventorySystem
##
## Df Sum of Sq RSS AIC
## - MedicineOrdering 1 0.001666 0.49905 -138.03
## - PhysicalCond 1 0.001745 0.49912 -138.02
## - InventorySystem 1 0.006518 0.50390 -137.68
## - StockOutMgt 1 0.013830 0.51121 -137.16
## - MedReceipting 1 0.025375 0.52275 -136.36
## <none> 0.49738 -136.15
## - StockManagement 1 0.058809 0.55619 -134.13
## - ReportingSystem 1 0.099369 0.59675 -131.59
## - InventoryMgt 1 0.281709 0.77909 -121.99
##
## Step: AIC=-138.03
## Availability ~ PhysicalCond + StockManagement + StockOutMgt +
## MedReceipting + InventoryMgt + ReportingSystem + InventorySystem
##
## Df Sum of Sq RSS AIC
## - PhysicalCond 1 0.002018 0.50106 -139.88
## - InventorySystem 1 0.005966 0.50501 -139.60
## - StockOutMgt 1 0.015163 0.51421 -138.95
## - MedReceipting 1 0.027217 0.52626 -138.12
## <none> 0.49905 -138.03
## - StockManagement 1 0.060873 0.55992 -135.88
## - ReportingSystem 1 0.104707 0.60375 -133.17
## - InventoryMgt 1 0.282464 0.78151 -123.88
##
## Step: AIC=-139.88
## Availability ~ StockManagement + StockOutMgt + MedReceipting +
## InventoryMgt + ReportingSystem + InventorySystem
##
## Df Sum of Sq RSS AIC
## - InventorySystem 1 0.006686 0.50775 -141.41
## - StockOutMgt 1 0.013887 0.51495 -140.90
## <none> 0.50106 -139.88
## - MedReceipting 1 0.032991 0.53405 -139.59
## - StockManagement 1 0.063316 0.56438 -137.60
## - ReportingSystem 1 0.106253 0.60732 -134.96
## - InventoryMgt 1 0.282756 0.78382 -125.78
##
## Step: AIC=-141.41
## Availability ~ StockManagement + StockOutMgt + MedReceipting +
## InventoryMgt + ReportingSystem
##
## Df Sum of Sq RSS AIC
## - StockOutMgt 1 0.017422 0.52517 -142.19
## <none> 0.50775 -141.41
## - MedReceipting 1 0.036665 0.54441 -140.90
## - StockManagement 1 0.057157 0.56491 -139.57
## - ReportingSystem 1 0.118160 0.62591 -135.87
## - InventoryMgt 1 0.292903 0.80065 -127.01
##
## Step: AIC=-142.19
## Availability ~ StockManagement + MedReceipting + InventoryMgt +
## ReportingSystem
##
## Df Sum of Sq RSS AIC
## <none> 0.52517 -142.19
## - StockManagement 1 0.070683 0.59585 -139.65
## - ReportingSystem 1 0.179533 0.70471 -133.61
## - MedReceipting 1 0.194622 0.71979 -132.84
## - InventoryMgt 1 0.309684 0.83486 -127.50
Summarizing the final model
summary(Model2)
##
## Call:
## lm(formula = Availability ~ StockManagement + MedReceipting +
## InventoryMgt + ReportingSystem, data = tr.dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.40655 -0.08152 0.00749 0.06833 0.31261
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.24350 0.13088 9.501 1.07e-10 ***
## StockManagement -0.11628 0.05693 -2.043 0.049673 *
## MedReceipting 0.17396 0.05132 3.389 0.001924 **
## InventoryMgt -0.17103 0.04000 -4.276 0.000169 ***
## ReportingSystem -0.39294 0.12070 -3.255 0.002739 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1302 on 31 degrees of freedom
## Multiple R-squared: 0.4014, Adjusted R-squared: 0.3242
## F-statistic: 5.197 on 4 and 31 DF, p-value: 0.002536
The updated model show that medicine availability is best predicted stock management, Medicine receipting, inventory management and inventory reporting systems. The performance of the model is not impressive, with an adjusted R-Squared value of 34.42. This model has the lowest AIC = -142.19.
We compare whether the statistical differences between Model1 and Model2 are significant. We use the Analysis of Variance anova() function which assesses the models based on F Statistics and probability of obtaining that F statistic.
anova(Model1,Model2)
## Analysis of Variance Table
##
## Model 1: Availability ~ PhysicalCond + StockManagement + MedicineOrdering +
## StockOutMgt + MedReceipting + InventoryMgt + ReportingSystem +
## InventorySystem
## Model 2: Availability ~ StockManagement + MedReceipting + InventoryMgt +
## ReportingSystem
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 27 0.49738
## 2 31 0.52517 -4 -0.027792 0.3772 0.8229
We can observe that the p-value is greater than 5% at (82.89%) showing that the differences between the two models are not statistically significant. We will compute the regression model, using the full data without train and test partitions.
DF <- knnImputation(Df, k=10)
Model3 <- lm(Availability~PhysicalCond+StockManagement+MedicineOrdering+StockOutMgt+MedReceipting+InventoryMgt+ReportingSystem+InventorySystem, DF)
summary(Model3)
##
## Call:
## lm(formula = Availability ~ PhysicalCond + StockManagement +
## MedicineOrdering + StockOutMgt + MedReceipting + InventoryMgt +
## ReportingSystem + InventorySystem, data = DF)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.22451 -0.08550 -0.00034 0.06380 0.54719
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.524086 0.235944 6.460 1.50e-07 ***
## PhysicalCond -0.001294 0.167431 -0.008 0.9939
## StockManagement -0.163979 0.063270 -2.592 0.0136 *
## MedicineOrdering 0.004436 0.014021 0.316 0.7535
## StockOutMgt -0.155817 0.457143 -0.341 0.7351
## MedReceipting 0.403695 0.400972 1.007 0.3206
## InventoryMgt -0.293981 0.033308 -8.826 1.23e-10 ***
## ReportingSystem -0.604403 0.235645 -2.565 0.0145 *
## InventorySystem -0.060405 0.052629 -1.148 0.2584
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1462 on 37 degrees of freedom
## Multiple R-squared: 0.7162, Adjusted R-squared: 0.6549
## F-statistic: 11.67 on 8 and 37 DF, p-value: 4.243e-08
The results presented above for Model3 shows significant improvement in the model performance which is based on the full dataset. Interestingly the differences between this dataframe and the training data used in Model 1 and Model 2 is only 10 observations. The adjusted R-Squared for Model 3 is 65.5% meaning that the model explains 65.5% of observed variation in the response variable measuring medicine availability. The model is statistically significant with an F Statistic (8,37) = 11.67, and p-value = 0.00000004243.
We use step function to update this model
Model4 <- step(Model3)
## Start: AIC=-168.9
## Availability ~ PhysicalCond + StockManagement + MedicineOrdering +
## StockOutMgt + MedReceipting + InventoryMgt + ReportingSystem +
## InventorySystem
##
## Df Sum of Sq RSS AIC
## - PhysicalCond 1 0.00000 0.79104 -170.90
## - MedicineOrdering 1 0.00214 0.79318 -170.78
## - StockOutMgt 1 0.00248 0.79352 -170.76
## - MedReceipting 1 0.02167 0.81271 -169.66
## - InventorySystem 1 0.02816 0.81920 -169.29
## <none> 0.79104 -168.90
## - ReportingSystem 1 0.14065 0.93168 -163.37
## - StockManagement 1 0.14361 0.93464 -163.23
## - InventoryMgt 1 1.66550 2.45654 -118.78
##
## Step: AIC=-170.9
## Availability ~ StockManagement + MedicineOrdering + StockOutMgt +
## MedReceipting + InventoryMgt + ReportingSystem + InventorySystem
##
## Df Sum of Sq RSS AIC
## - MedicineOrdering 1 0.00215 0.79319 -172.78
## - StockOutMgt 1 0.00261 0.79365 -172.75
## - MedReceipting 1 0.02249 0.81353 -171.61
## - InventorySystem 1 0.02842 0.81946 -171.28
## <none> 0.79104 -170.90
## - ReportingSystem 1 0.14066 0.93169 -165.37
## - StockManagement 1 0.14431 0.93535 -165.19
## - InventoryMgt 1 1.66986 2.46090 -120.69
##
## Step: AIC=-172.78
## Availability ~ StockManagement + StockOutMgt + MedReceipting +
## InventoryMgt + ReportingSystem + InventorySystem
##
## Df Sum of Sq RSS AIC
## - StockOutMgt 1 0.00307 0.79625 -174.60
## - MedReceipting 1 0.02446 0.81764 -173.38
## - InventorySystem 1 0.02724 0.82042 -173.22
## <none> 0.79319 -172.78
## - ReportingSystem 1 0.14683 0.94001 -166.96
## - StockManagement 1 0.14756 0.94074 -166.93
## - InventoryMgt 1 1.66920 2.46238 -122.67
##
## Step: AIC=-174.6
## Availability ~ StockManagement + MedReceipting + InventoryMgt +
## ReportingSystem + InventorySystem
##
## Df Sum of Sq RSS AIC
## - InventorySystem 1 0.03093 0.82719 -174.84
## <none> 0.79625 -174.60
## - StockManagement 1 0.16125 0.95750 -168.12
## - ReportingSystem 1 0.38443 1.18069 -158.48
## - MedReceipting 1 0.59983 1.39609 -150.77
## - InventoryMgt 1 1.72668 2.52294 -123.55
##
## Step: AIC=-174.84
## Availability ~ StockManagement + MedReceipting + InventoryMgt +
## ReportingSystem
##
## Df Sum of Sq RSS AIC
## <none> 0.82719 -174.84
## - StockManagement 1 0.14493 0.97212 -169.42
## - ReportingSystem 1 0.41860 1.24579 -158.01
## - MedReceipting 1 0.56900 1.39619 -152.76
## - InventoryMgt 1 1.80190 2.62908 -123.65
We observe that the step function yields similar final regression models for the full data and for the train partition. The final model was chosen as the model with the lowest AIC = -174.84. We first compare this final model with the initial model.
anova(Model3,Model4)
## Analysis of Variance Table
##
## Model 1: Availability ~ PhysicalCond + StockManagement + MedicineOrdering +
## StockOutMgt + MedReceipting + InventoryMgt + ReportingSystem +
## InventorySystem
## Model 2: Availability ~ StockManagement + MedReceipting + InventoryMgt +
## ReportingSystem
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 37 0.79104
## 2 41 0.82719 -4 -0.036152 0.4227 0.7912
The differences between the two models are not statistically significant.We assess the predictive performance and other metrics of this model
summary(Model4)
##
## Call:
## lm(formula = Availability ~ StockManagement + MedReceipting +
## InventoryMgt + ReportingSystem, data = DF)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.21426 -0.09071 -0.00043 0.07160 0.55696
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.43232 0.13089 10.943 9.76e-14 ***
## StockManagement -0.15941 0.05948 -2.680 0.0105 *
## MedReceipting 0.25611 0.04823 5.311 4.11e-06 ***
## InventoryMgt -0.28133 0.02977 -9.450 7.50e-12 ***
## ReportingSystem -0.55781 0.12246 -4.555 4.63e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.142 on 41 degrees of freedom
## Multiple R-squared: 0.7033, Adjusted R-squared: 0.6743
## F-statistic: 24.29 on 4 and 41 DF, p-value: 2.352e-10
The model with 4 predictors (Stock Management, Medicine Receipting, Inventory Management and Reporting Systems) explains 67.43% of the observed variation in the response variable measuring Medicine Availability.
Regression model diagnostics
par(mfrow = c(2,2))
plot(Model4)
The normal QQ plot above does show that the residuals(errors) of the estimated final regression model all lie more or less on the line but we also observe some very extreme values, a very small value coded 24, and very high values coded 37 and 3 respectively, which reflect the cases (observations) associated with these extreme values.
We examine whether these values are influential points and the strategy we can use to improve their influence on the regression model
Outliers <- hatvalues(Model4) > 3*mean(hatvalues(Model4))
DF[Outliers,]
## # A tibble: 4 × 9
## Availability PhysicalCond StockManagement MedicineOrdering StockOutMgt
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.514 19.6 14.9 10.7 18.5
## 2 0.514 19.6 14.9 10.7 18.5
## 3 0.85 0.281 2.35 0.194 0.508
## 4 0.94 0.350 0.00000101 1.06 0.508
## # ℹ 4 more variables: MedReceipting <dbl>, InventoryMgt <dbl>,
## # ReportingSystem <dbl>, InventorySystem <dbl>
We see that there are four observations positioned far from the fitted regression line, that is points with larger residuals.
df <- Model4$df.residual
p <- length(Model4$coefficients)
n <- nrow(Model4$model)
dffits_crit <- 2*sqrt((p+1)/(n-p-1))
Model_Diff <- dffits(Model4)
Df <- data.frame(obs = names(Model_Diff), DFFITS = Model_Diff)
ggplot(Df, aes(y = DFFITS, x = obs)) + geom_point() +
geom_hline(yintercept = c(dffits_crit), linetype = "dashed") +
labs(title = "DFFITS",
subtitle = "Influential Observations",
x = "Observation Number",
y = "DFFITS")
The DFFITS plot above shows three potential influential data points. We can list which ones as follows
DF[which(abs(Model_Diff) > dffits_crit), ]
## # A tibble: 5 × 9
## Availability PhysicalCond StockManagement MedicineOrdering StockOutMgt
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.74 0.129 0.157 0.125 0.0448
## 2 0 0.239 0.0149 0.0344 0.0448
## 3 0.85 0.281 2.35 0.194 0.508
## 4 0.85 0.0967 1.47 0.125 0.0448
## 5 0.94 0.350 0.00000101 1.06 0.508
## # ℹ 4 more variables: MedReceipting <dbl>, InventoryMgt <dbl>,
## # ReportingSystem <dbl>, InventorySystem <dbl>
We can assess outliers with Cook’s distance
cooks_crit = 0.5
Model_cooks <- cooks.distance(Model4)
df <- data.frame(obs = names(Model_cooks),
cooks = Model_cooks)
ggplot(df, aes(y = cooks, x = obs)) +
geom_point() +
geom_hline(yintercept = cooks_crit, linetype="dashed") +
labs(title = "Cook's Distance",
x = "Observation Number",
y = "Cook's")
The computed cook distance, shows 3 influential observations. We can
remove the influential observations and compare the regression with the
observations removed against the current regression model. Alternatively
we can compute robust regression as an alternative to the ols
regression. Before doing so, we will compute the normalised mean square
error for the final regression.
Predictions_lm <- predict(Model4, DF)
(nmse.model4 <- mean((Predictions_lm - DF[['Availability']])^2)/
mean((mean(DF[['Availability']]) - DF[['Availability']])^2))
## [1] 0.2967335
The NMSE is a unitless error measure with values ranging from 0 to 1. Our model is evidently performing better than a very simple baseline, as our nmse is sufficiently low.
dg <- data.frame(Model5 = Predictions_lm,
True.Model <- DF[['Availability']])
ggplot(dg, aes(x = Model5, y = True.Model)) + geom_point() +
geom_abline(slope = 1, intercept = 0, color = "red") +
ggtitle("Multiple Linear Regression Model")
Robust Regression Model The model accounts for outliers to provide a
better fit to the majority of the data. When theoretical reasons
constrain the need to keep outliers in the data, robust regression is a
better alternative to the OLS regression model.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:gtsummary':
##
## select
Model6 <- rlm(Availability~PhysicalCond+StockManagement+StockOutMgt+MedicineOrdering+MedReceipting+InventoryMgt+ReportingSystem+InventorySystem, data = DF)
summary(Model6)
##
## Call: rlm(formula = Availability ~ PhysicalCond + StockManagement +
## StockOutMgt + MedicineOrdering + MedReceipting + InventoryMgt +
## ReportingSystem + InventorySystem, data = DF)
## Residuals:
## Min 1Q Median 3Q Max
## -0.1517449 -0.0834245 -0.0004181 0.0621174 0.6559029
##
## Coefficients:
## Value Std. Error t value
## (Intercept) 1.6678 0.1760 9.4741
## PhysicalCond -0.0356 0.1249 -0.2848
## StockManagement -0.2059 0.0472 -4.3621
## StockOutMgt -0.2397 0.3411 -0.7027
## MedicineOrdering 0.0045 0.0105 0.4337
## MedReceipting 0.5621 0.2992 1.8790
## InventoryMgt -0.3315 0.0249 -13.3408
## ReportingSystem -0.7549 0.1758 -4.2935
## InventorySystem -0.0374 0.0393 -0.9531
##
## Residual standard error: 0.1197 on 37 degrees of freedom
summary(Model4)
##
## Call:
## lm(formula = Availability ~ StockManagement + MedReceipting +
## InventoryMgt + ReportingSystem, data = DF)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.21426 -0.09071 -0.00043 0.07160 0.55696
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.43232 0.13089 10.943 9.76e-14 ***
## StockManagement -0.15941 0.05948 -2.680 0.0105 *
## MedReceipting 0.25611 0.04823 5.311 4.11e-06 ***
## InventoryMgt -0.28133 0.02977 -9.450 7.50e-12 ***
## ReportingSystem -0.55781 0.12246 -4.555 4.63e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.142 on 41 degrees of freedom
## Multiple R-squared: 0.7033, Adjusted R-squared: 0.6743
## F-statistic: 24.29 on 4 and 41 DF, p-value: 2.352e-10
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.3.1
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
data.frame(resid = Model6$wresid,
weight = Model6$w) %>%
arrange(weight) %>%
head(n=15)
## resid weight
## 3 0.655902902 0.2454358
## 32 0.314412641 0.5120513
## 46 0.178213575 0.9033055
## 1 -0.123996732 1.0000000
## 2 0.075236663 1.0000000
## 4 0.001017944 1.0000000
## 5 0.004267988 1.0000000
## 6 0.039739378 1.0000000
## 7 0.063057713 1.0000000
## 8 0.016191709 1.0000000
## 9 -0.007023709 1.0000000
## 10 0.090224306 1.0000000
## 11 -0.083895514 1.0000000
## 12 0.059296458 1.0000000
## 13 0.086628314 1.0000000
While the ols regression allocates equal weights of 1 on all variables in the model, we can see that robust regression assigns 1 to low residuals and lower weights to slightly higher residuals. We also observe that the differences are larger between t values of an ols regression and those of the robust regression, we choose the latter for modeling medicine availability, although the ols model has better interpretability.
The ols regression model, explains 67% of variation in medicine availability in a model with Stock management, medicine receipting, inventory management and reporting systems as predictors. It must be observed that these explanatory variables are composites computed using factor analysis as a data reduction technique.