In the post-pandemic era, work arrangements have undergone significant transformations. Many organizations have shifted from traditional office-based setups to more flexible arrangements, including hybrid or fully remote work. This change has sparked widespread interest in understanding the factors that influence an individual’s choice of work location.Work location decisions are shaped by a combination of personal preferences, organizational policies, and external circumstances. For example, an employee’s preference for remote work might be influenced by the frequency of virtual meetings, their ability to maintain work-life balance, or feelings of social isolation. Similarly, organizations might consider these factors when designing policies to improve employee satisfaction and productivity. This project focuses on predicting work location—categorized as Remote, On-site, or Hybrid—using factors from the Global Mental Health dataset. By employing nominal logistic regression, the project aims to uncover relationships between work location and key predictors such as the number of virtual meetings, work-life balance ratings, and social isolation ratings.
Research Objective: The primary goal is to predict an individual’s work location based on these predictors and to assess the effectiveness of nominal logistic regression for this task. This analysis not only provides insights into the predictors but also evaluates the strengths and limitations of this modeling approach in real-world applications.
We have 5,000 records collected from employees worldwide, containing various variables related to mental health and workload-related factors. The dependent variable that I chose is work location since we are predicting the work locations based on independent variables. The independent variables that I chose were number of virtual meetings which outlines the frequency of online meetings, which may influence preferences for remote work. Also, I chose work-life balance rating because it highlights the employees’ self-assessment of their ability to balance work and personal life. Who work remotely tend to have better work-life balance rating because they could save more time by not going to work location. Lastly, I chose Social Isolation rating which remarks the perceived isolation due to work conditions, particularly relevant for remote workers. Personally, when I was working in a company remotely, I was satisfied with my work-life balance but I felt kind of isolated socially so I included this independent variable because I think this would be a key predictor.
How do virtual meetings, work-life balance, and social isolation influence work location?
Can nominal logistic regression effectively model and predict work location probabilities?
What are the limitations of this approach, and how can they be addressed in future analyses?
This study is particularly relevant in the current climate, where understanding work preferences can guide organizations in crafting policies that align with employee needs while maintaining operational efficiency.
Statistical technique:
This project employs Nominal Logistic Regression, a statistical modeling technique used to predict a categorical dependent variable with three or more unordered categories. Unlike binary logistic regression, which deals with two outcomes, nominal logistic regression is designed for situations where there is no natural ordering among the categories of the dependent variable.
When do we need nominal logistic regression?
The dependent variable is categorical with three or more unordered categories(Remote, On-site, Hybrid).
Independent variables can be either quantitative or qualitative.
Why this technique?
Traditional linear regression cannot handle categorical dependent variables.
Binary logistic regression is limited to two outcome categories, making it unsuitable for this project.
Other techniques, such as ordinal logistic regression, assume a natural order among the categories, which does not apply to work location.
Why Do We Need a New Technique?
The dependent variable in this project consists of three unordered categories (Remote, On-site, Hybrid). To model the probabilities of these categories based on predictors, nominal logistic regression provides the flexibility to calculate the likelihood of each category without assuming any inherent ranking among them.
\(logodds_j <- \beta_0^j + \beta_1^j X_1 +\beta_2^j X_2 + ... + \beta_p^jXp\)
j would be the category that I’m predicting(Remote, On-site)
K is the reference category(Hybrid)
X_1, X_2, … _X_p is the independent variables(predictors)
\(\beta_0^j\), \(\beta_1^j\), … , \(\beta_p^j\): The coefficient for each independent variable for category j relative to category K
## Loading required package: carData
model <- multinom(Work_Location ~ Number_of_Virtual_Meetings +
Work_Life_Balance_Rating + Social_Isolation_Rating,
data = data_subset)
## # weights: 15 (8 variable)
## initial value 5493.061443
## iter 10 value 5486.714801
## final value 5485.274199
## converged
summary(model)
## Call:
## multinom(formula = Work_Location ~ Number_of_Virtual_Meetings +
## Work_Life_Balance_Rating + Social_Isolation_Rating, data = data_subset)
##
## Coefficients:
## (Intercept) Number_of_Virtual_Meetings Work_Life_Balance_Rating
## Onsite -0.06716496 0.02247721 -0.03863882
## Remote 0.04436945 0.01715389 -0.02039069
## Social_Isolation_Rating
## Onsite 0.002074439
## Remote -0.024267405
##
## Std. Errors:
## (Intercept) Number_of_Virtual_Meetings Work_Life_Balance_Rating
## Onsite 0.1247448 0.007541983 0.02477523
## Remote 0.1229951 0.007453141 0.02447730
## Social_Isolation_Rating
## Onsite 0.02504872
## Remote 0.02475168
##
## Residual Deviance: 10970.55
## AIC: 10986.55
The coefficients for the predictors were estimated for each work location, whether it is Onsite or Remote.
Theoretical model:
\(log(P(On-site)/P(Hybrid)) = 0.04436945 + 0.01715389NumberofVirtualMeetings-0.02039069 WorkLifeBalanceRating + 0.002074439SocialIsolationrating\)
\(log(P(Remote)/P(Hybrid)) = -0.06716496 + 0.02247721NumberofVirtualMeetings-0.03863882 WorkLifeBalanceRating -0.024267405SocialIsolationrating\)
z_values <- summary(model)$coefficients / summary(model)$standard.errors
p_values <- (1 - pnorm(abs(z_values), 0, 1)) * 2
p_values
## (Intercept) Number_of_Virtual_Meetings Work_Life_Balance_Rating
## Onsite 0.5902878 0.002879863 0.1188604
## Remote 0.7182927 0.021359692 0.4048195
## Social_Isolation_Rating
## Onsite 0.9339977
## Remote 0.3268716
Most of the p-values for the coefficients are not statistically significant because they are greater than 0.05. However, my predictor, number of virtual meetings, is statistically significant, suggesting that the number of virtual meetings significantly contributes to the prediction of work location.
loglik_full <- logLik(model)
loglik_null <- logLik(multinom(Work_Location ~ 1, data = data_subset))
## # weights: 6 (2 variable)
## initial value 5493.061443
## final value 5492.036004
## converged
pseudo_r2 <- 1 - as.numeric(loglik_full) / as.numeric(loglik_null)
cat("Pseudo R^2:", pseudo_r2, "\n")
## Pseudo R^2: 0.001231202
The pseudo R-squared value is 0.0012, indicating that the model explains a small proportion of the variation in the dependent variable. This further emphasizes the need for a more comprehensive set of predictors or alternative modeling techniques.
predicted_probs <- predict(model, type = "probs")
head(predicted_probs)
## Hybrid Onsite Remote
## 1 0.3205549 0.3253904 0.3540547
## 2 0.3321857 0.3289884 0.3388259
## 3 0.3300521 0.3284677 0.3414802
## 4 0.3341621 0.3224496 0.3433883
## 5 0.3042331 0.3469919 0.3487749
## 6 0.3617390 0.3132515 0.3250095
The model generated probabilities for each work location based on the predictors.
Example: For an employee with frequent virtual meetings and a high work-life balance rating, the predicted probabilities might be: Remote: 35%, Hybrid: 33%, On-site: 32%.
These probabilities allow for nuanced predictions rather than binary outcomes.
odds_ratios <- exp(coef(model))
odds_ratios
## (Intercept) Number_of_Virtual_Meetings Work_Life_Balance_Rating
## Onsite 0.9350409 1.022732 0.9620981
## Remote 1.0453685 1.017302 0.9798158
## Social_Isolation_Rating
## Onsite 1.0020766
## Remote 0.9760247
Odds ratios were calculated to understand the relative impact of predictors.For instance, the odds ratio for Number of Virtual Meetings is approximately 1.02, indicating a small positive relationship between virtual meetings and the likelihood of working Onsite.Similarly, the Work-Life Balance Rating has odds ratios close to 0.96–0.97, showing its inverse relationship with Onsite and Remote work.
predicted_classes <- predict(model, type = "class")
actual_classes <- data_subset$Work_Location
accuracy <- mean(predicted_classes == actual_classes)
accuracy
## [1] 0.3552
The model achieved an overall accuracy of 35.5%, meaning it correctly predicted the work location for approximately one-third of the cases. While this accuracy is relatively low, it still provides insights into the factors influencing work location.
vif(model)
## Warning in vif.default(model): No intercept: vifs may not be sensible.
## Number_of_Virtual_Meetings Work_Life_Balance_Rating
## 5.012687 7.275317
## Social_Isolation_Rating
## 7.564903
The multinomial logistic regression model effectively demonstrated the relationship between predictors and work location. While the accuracy (35.5%) and pseudo R-squared value were low, the model still provided valuable insights:
Number of virtual meeting is the significant predictor which has impact work location probabilites.
The model outputs probabilities for work locations, with the highest probability determining the prediction.
Work-life balance ratings showed a weaker but still notable association with work location preferences.
Social isolation ratings had minimal impact, suggesting that other unmeasured factors may better explain work location decisions.
Implications
Organizations can use insights from this analysis to design policies that align with employee preferences. For instance, reducing the number of virtual meetings or improving work-life balance could encourage Remote or Hybrid work.
What did we found out?
The number of virtual meetings emerged as the most impactful predictor, with higher frequencies associated with a greater likelihood of Remote work.Work-life balance ratings showed a weaker but still notable influence, indicating that employees with better work-life balance may prefer Remote or Hybrid arrangements. Social isolation ratings had a negligible effect, suggesting that other factors might play a more critical role in work location preferences. The model successfully assigned probabilities to each work location category, offering nuanced predictions rather than binary classifications. However, the overall accuracy (35.5%) and explained variance (pseudo R² = 0.001) indicate that the model has limited predictive power.
Strength of the Analysis:
Nominal logistic regression provides straightforward insights into how each predictor affects the likelihood of each work location.Odds ratios and predicted probabilities make the results easy to communicate to non-technical audiences.The methodology is well-suited for categorical outcomes like work location, where there is no inherent order among categories.The model highlights actionable factors, such as the impact of virtual meetings, which can inform organizational policies.
Weaknesses of the Analysis:
The model’s accuracy (35.5%) and pseudo R² (0.001) suggest that the chosen predictors explain only a small fraction of the variance in work location.The limited set of predictors may have excluded key factors such as job role, industry, or geographic location, which are likely to have a significant impact.The data may not fully meet this assumption, as employees within the same organization could have correlated responses.Without interaction terms or non-linear transformations, the model assumes a linear relationship between continuous predictors and the log-odds, which might oversimplify complex relationships.
# Required packages
library(nnet)
library(car)
# Fit the model
model <- multinom(Work_Location ~ Number_of_Virtual_Meetings +
Work_Life_Balance_Rating + Social_Isolation_Rating,
data = data_subset)
## # weights: 15 (8 variable)
## initial value 5493.061443
## iter 10 value 5486.714801
## final value 5485.274199
## converged
# Model Summary
summary(model)
## Call:
## multinom(formula = Work_Location ~ Number_of_Virtual_Meetings +
## Work_Life_Balance_Rating + Social_Isolation_Rating, data = data_subset)
##
## Coefficients:
## (Intercept) Number_of_Virtual_Meetings Work_Life_Balance_Rating
## Onsite -0.06716496 0.02247721 -0.03863882
## Remote 0.04436945 0.01715389 -0.02039069
## Social_Isolation_Rating
## Onsite 0.002074439
## Remote -0.024267405
##
## Std. Errors:
## (Intercept) Number_of_Virtual_Meetings Work_Life_Balance_Rating
## Onsite 0.1247448 0.007541983 0.02477523
## Remote 0.1229951 0.007453141 0.02447730
## Social_Isolation_Rating
## Onsite 0.02504872
## Remote 0.02475168
##
## Residual Deviance: 10970.55
## AIC: 10986.55
# Calculate Pseudo R²
loglik_full <- logLik(model)
loglik_null <- logLik(multinom(Work_Location ~ 1, data = data_subset))
## # weights: 6 (2 variable)
## initial value 5493.061443
## final value 5492.036004
## converged
pseudo_r2 <- 1 - as.numeric(loglik_full) / as.numeric(loglik_null)
cat("Pseudo R²:", pseudo_r2, "\n")
## Pseudo R²: 0.001231202
# Predicted Probabilities
predicted_probs <- predict(model, type = "probs")
head(predicted_probs)
## Hybrid Onsite Remote
## 1 0.3205549 0.3253904 0.3540547
## 2 0.3321857 0.3289884 0.3388259
## 3 0.3300521 0.3284677 0.3414802
## 4 0.3341621 0.3224496 0.3433883
## 5 0.3042331 0.3469919 0.3487749
## 6 0.3617390 0.3132515 0.3250095
# Classification Accuracy
predicted_classes <- predict(model, type = "class")
actual_classes <- data_subset$Work_Location
accuracy <- mean(predicted_classes == actual_classes)
cat("Accuracy:", accuracy, "\n")
## Accuracy: 0.3552
# Odds Ratios
odds_ratios <- exp(coef(model))
print(odds_ratios)
## (Intercept) Number_of_Virtual_Meetings Work_Life_Balance_Rating
## Onsite 0.9350409 1.022732 0.9620981
## Remote 1.0453685 1.017302 0.9798158
## Social_Isolation_Rating
## Onsite 1.0020766
## Remote 0.9760247
# VIF for multicollinearity check
vif_values <- vif(model)
## Warning in vif.default(model): No intercept: vifs may not be sensible.
print(vif_values)
## Number_of_Virtual_Meetings Work_Life_Balance_Rating
## 5.012687 7.275317
## Social_Isolation_Rating
## 7.564903