Final Project Report:Predicting work location using nonimal logistic regression

Introduction

In the post-pandemic era, work arrangements have undergone significant transformations. Many organizations have shifted from traditional office-based setups to more flexible arrangements, including hybrid or fully remote work. This change has sparked widespread interest in understanding the factors that influence an individual’s choice of work location.Work location decisions are shaped by a combination of personal preferences, organizational policies, and external circumstances. For example, an employee’s preference for remote work might be influenced by the frequency of virtual meetings, their ability to maintain work-life balance, or feelings of social isolation. Similarly, organizations might consider these factors when designing policies to improve employee satisfaction and productivity. This project focuses on predicting work location—categorized as Remote, On-site, or Hybrid—using factors from the Global Mental Health dataset. By employing nominal logistic regression, the project aims to uncover relationships between work location and key predictors such as the number of virtual meetings, work-life balance ratings, and social isolation ratings.

Research Objective: The primary goal is to predict an individual’s work location based on these predictors and to assess the effectiveness of nominal logistic regression for this task. This analysis not only provides insights into the predictors but also evaluates the strengths and limitations of this modeling approach in real-world applications.

Data description

We have 5,000 records collected from employees worldwide, containing various variables related to mental health and workload-related factors. The dependent variable that I chose is work location since we are predicting the work locations based on independent variables. The independent variables that I chose were number of virtual meetings which outlines the frequency of online meetings, which may influence preferences for remote work. Also, I chose work-life balance rating because it highlights the employees’ self-assessment of their ability to balance work and personal life. Who work remotely tend to have better work-life balance rating because they could save more time by not going to work location. Lastly, I chose Social Isolation rating which remarks the perceived isolation due to work conditions, particularly relevant for remote workers. Personally, when I was working in a company remotely, I was satisfied with my work-life balance but I felt kind of isolated socially so I included this independent variable because I think this would be a key predictor.

Key Questions:

How do virtual meetings, work-life balance, and social isolation influence work location?
Can nominal logistic regression effectively model and predict work location probabilities?
What are the limitations of this approach, and how can they be addressed in future analyses?

This study is particularly relevant in the current climate, where understanding work preferences can guide organizations in crafting policies that align with employee needs while maintaining operational efficiency.

Methodology

Statistical technique:

This project employs Nominal Logistic Regression, a statistical modeling technique used to predict a categorical dependent variable with three or more unordered categories. Unlike binary logistic regression, which deals with two outcomes, nominal logistic regression is designed for situations where there is no natural ordering among the categories of the dependent variable.

When do we need nominal logistic regression?

The dependent variable is categorical with three or more unordered categories(Remote, On-site, Hybrid).
Independent variables can be either quantitative or qualitative.

Why this technique?

Traditional linear regression cannot handle categorical dependent variables.
Binary logistic regression is limited to two outcome categories, making it unsuitable for this project.
Other techniques, such as ordinal logistic regression, assume a natural order among the categories, which does not apply to work location.

Why Do We Need a New Technique?

The dependent variable in this project consists of three unordered categories (Remote, On-site, Hybrid). To model the probabilities of these categories based on predictors, nominal logistic regression provides the flexibility to calculate the likelihood of each category without assuming any inherent ranking among them.

General formula

\(logodds_j <- \beta_0^j + \beta_1^j X_1 +\beta_2^j X_2 + ... + \beta_p^jXp\)

j would be the category that I’m predicting(Remote, On-site)
K is the reference category(Hybrid)
X_1, X_2, … _X_p is the independent variables(predictors)
\(\beta_0^j\), \(\beta_1^j\), … , \(\beta_p^j\): The coefficient for each independent variable for category j relative to category K

Model Assumptions

Independence of observations: Each observation (employee) must be independent. This ensures that the predictions are not influenced by related data points.
No multicollinearity: The independent variables should not be highly correlated. This was checked using the Variance Inflation Factor (VIF) to ensure the stability of the model coefficients.
For continuous predictors, the relationship between the predictors and the log-odds of the outcomes should be linear. Scatter plots of predictors versus predicted probabilities were used to assess this assumption.
A sufficiently large sample is required to ensure reliable and stable estimates for each category of the dependent variable. With 5,000 observations, the sample size was deemed adequate for this analysis.

## Loading required package: carData

model <- multinom(Work_Location ~ Number_of_Virtual_Meetings + 
                   Work_Life_Balance_Rating + Social_Isolation_Rating, 
                  data = data_subset)

## # weights:  15 (8 variable)
## initial  value 5493.061443 
## iter  10 value 5486.714801
## final  value 5485.274199 
## converged

Results

summary(model)

## Call:
## multinom(formula = Work_Location ~ Number_of_Virtual_Meetings + 
##     Work_Life_Balance_Rating + Social_Isolation_Rating, data = data_subset)
## 
## Coefficients:
##        (Intercept) Number_of_Virtual_Meetings Work_Life_Balance_Rating
## Onsite -0.06716496                 0.02247721              -0.03863882
## Remote  0.04436945                 0.01715389              -0.02039069
##        Social_Isolation_Rating
## Onsite             0.002074439
## Remote            -0.024267405
## 
## Std. Errors:
##        (Intercept) Number_of_Virtual_Meetings Work_Life_Balance_Rating
## Onsite   0.1247448                0.007541983               0.02477523
## Remote   0.1229951                0.007453141               0.02447730
##        Social_Isolation_Rating
## Onsite              0.02504872
## Remote              0.02475168
## 
## Residual Deviance: 10970.55 
## AIC: 10986.55

The coefficients for the predictors were estimated for each work location, whether it is Onsite or Remote.

Theoretical model:

\(log(P(On-site)/P(Hybrid)) = 0.04436945 + 0.01715389NumberofVirtualMeetings-0.02039069 WorkLifeBalanceRating + 0.002074439SocialIsolationrating\)

\(log(P(Remote)/P(Hybrid)) = -0.06716496 + 0.02247721NumberofVirtualMeetings-0.03863882 WorkLifeBalanceRating -0.024267405SocialIsolationrating\)

z_values <- summary(model)$coefficients / summary(model)$standard.errors
p_values <- (1 - pnorm(abs(z_values), 0, 1)) * 2
p_values

##        (Intercept) Number_of_Virtual_Meetings Work_Life_Balance_Rating
## Onsite   0.5902878                0.002879863                0.1188604
## Remote   0.7182927                0.021359692                0.4048195
##        Social_Isolation_Rating
## Onsite               0.9339977
## Remote               0.3268716

Most of the p-values for the coefficients are not statistically significant because they are greater than 0.05. However, my predictor, number of virtual meetings, is statistically significant, suggesting that the number of virtual meetings significantly contributes to the prediction of work location.

Pseudo R-Squared

loglik_full <- logLik(model)
loglik_null <- logLik(multinom(Work_Location ~ 1, data = data_subset))

## # weights:  6 (2 variable)
## initial  value 5493.061443 
## final  value 5492.036004 
## converged

pseudo_r2 <- 1 - as.numeric(loglik_full) / as.numeric(loglik_null)
cat("Pseudo R^2:", pseudo_r2, "\n")

## Pseudo R^2: 0.001231202

The pseudo R-squared value is 0.0012, indicating that the model explains a small proportion of the variation in the dependent variable. This further emphasizes the need for a more comprehensive set of predictors or alternative modeling techniques.

Predicted Probabilities

predicted_probs <- predict(model, type = "probs")
head(predicted_probs)

##      Hybrid    Onsite    Remote
## 1 0.3205549 0.3253904 0.3540547
## 2 0.3321857 0.3289884 0.3388259
## 3 0.3300521 0.3284677 0.3414802
## 4 0.3341621 0.3224496 0.3433883
## 5 0.3042331 0.3469919 0.3487749
## 6 0.3617390 0.3132515 0.3250095

The model generated probabilities for each work location based on the predictors.

Example: For an employee with frequent virtual meetings and a high work-life balance rating, the predicted probabilities might be: Remote: 35%, Hybrid: 33%, On-site: 32%.

These probabilities allow for nuanced predictions rather than binary outcomes.

Odds Ratio

odds_ratios <- exp(coef(model))
odds_ratios

##        (Intercept) Number_of_Virtual_Meetings Work_Life_Balance_Rating
## Onsite   0.9350409                   1.022732                0.9620981
## Remote   1.0453685                   1.017302                0.9798158
##        Social_Isolation_Rating
## Onsite               1.0020766
## Remote               0.9760247

Odds ratios were calculated to understand the relative impact of predictors.For instance, the odds ratio for Number of Virtual Meetings is approximately 1.02, indicating a small positive relationship between virtual meetings and the likelihood of working Onsite.Similarly, the Work-Life Balance Rating has odds ratios close to 0.96–0.97, showing its inverse relationship with Onsite and Remote work.

Classification Accuracy

predicted_classes <- predict(model, type = "class")
actual_classes <- data_subset$Work_Location
accuracy <- mean(predicted_classes == actual_classes)
accuracy

## [1] 0.3552

The model achieved an overall accuracy of 35.5%, meaning it correctly predicted the work location for approximately one-third of the cases. While this accuracy is relatively low, it still provides insights into the factors influencing work location.

vif(model)

## Warning in vif.default(model): No intercept: vifs may not be sensible.

## Number_of_Virtual_Meetings   Work_Life_Balance_Rating 
##                   5.012687                   7.275317 
##    Social_Isolation_Rating 
##                   7.564903

Model conditions

Independence is not met since this is not coming from a random sample.
Linearity of Log-odds for continous predictors condition can’t be verified since we don’t have an interaction term.
We do have sufficient sample size which is 5000
There can’t be an outlier nor influential points based on my data since we are using scale of 1 to 5 ratings.
There are some multicolinearity among the variables since the vif is higher than 5.

Conclusion

The multinomial logistic regression model effectively demonstrated the relationship between predictors and work location. While the accuracy (35.5%) and pseudo R-squared value were low, the model still provided valuable insights:

Number of virtual meeting is the significant predictor which has impact work location probabilites.
The model outputs probabilities for work locations, with the highest probability determining the prediction.
Work-life balance ratings showed a weaker but still notable association with work location preferences.
Social isolation ratings had minimal impact, suggesting that other unmeasured factors may better explain work location decisions.

Implications

Organizations can use insights from this analysis to design policies that align with employee preferences. For instance, reducing the number of virtual meetings or improving work-life balance could encourage Remote or Hybrid work.

Discussion and Critique

What did we found out?

The number of virtual meetings emerged as the most impactful predictor, with higher frequencies associated with a greater likelihood of Remote work.Work-life balance ratings showed a weaker but still notable influence, indicating that employees with better work-life balance may prefer Remote or Hybrid arrangements. Social isolation ratings had a negligible effect, suggesting that other factors might play a more critical role in work location preferences. The model successfully assigned probabilities to each work location category, offering nuanced predictions rather than binary classifications. However, the overall accuracy (35.5%) and explained variance (pseudo R² = 0.001) indicate that the model has limited predictive power.

Strength of the Analysis:

Nominal logistic regression provides straightforward insights into how each predictor affects the likelihood of each work location.Odds ratios and predicted probabilities make the results easy to communicate to non-technical audiences.The methodology is well-suited for categorical outcomes like work location, where there is no inherent order among categories.The model highlights actionable factors, such as the impact of virtual meetings, which can inform organizational policies.

Weaknesses of the Analysis:

The model’s accuracy (35.5%) and pseudo R² (0.001) suggest that the chosen predictors explain only a small fraction of the variance in work location.The limited set of predictors may have excluded key factors such as job role, industry, or geographic location, which are likely to have a significant impact.The data may not fully meet this assumption, as employees within the same organization could have correlated responses.Without interaction terms or non-linear transformations, the model assumes a linear relationship between continuous predictors and the log-odds, which might oversimplify complex relationships.

R-codes

# Required packages
library(nnet)
library(car)

# Fit the model
model <- multinom(Work_Location ~ Number_of_Virtual_Meetings + 
                  Work_Life_Balance_Rating + Social_Isolation_Rating, 
                  data = data_subset)

## # weights:  15 (8 variable)
## initial  value 5493.061443 
## iter  10 value 5486.714801
## final  value 5485.274199 
## converged

# Model Summary
summary(model)

## Call:
## multinom(formula = Work_Location ~ Number_of_Virtual_Meetings + 
##     Work_Life_Balance_Rating + Social_Isolation_Rating, data = data_subset)
## 
## Coefficients:
##        (Intercept) Number_of_Virtual_Meetings Work_Life_Balance_Rating
## Onsite -0.06716496                 0.02247721              -0.03863882
## Remote  0.04436945                 0.01715389              -0.02039069
##        Social_Isolation_Rating
## Onsite             0.002074439
## Remote            -0.024267405
## 
## Std. Errors:
##        (Intercept) Number_of_Virtual_Meetings Work_Life_Balance_Rating
## Onsite   0.1247448                0.007541983               0.02477523
## Remote   0.1229951                0.007453141               0.02447730
##        Social_Isolation_Rating
## Onsite              0.02504872
## Remote              0.02475168
## 
## Residual Deviance: 10970.55 
## AIC: 10986.55

# Calculate Pseudo R²
loglik_full <- logLik(model)
loglik_null <- logLik(multinom(Work_Location ~ 1, data = data_subset))

## # weights:  6 (2 variable)
## initial  value 5493.061443 
## final  value 5492.036004 
## converged

pseudo_r2 <- 1 - as.numeric(loglik_full) / as.numeric(loglik_null)
cat("Pseudo R²:", pseudo_r2, "\n")

## Pseudo R²: 0.001231202

# Predicted Probabilities
predicted_probs <- predict(model, type = "probs")
head(predicted_probs)

##      Hybrid    Onsite    Remote
## 1 0.3205549 0.3253904 0.3540547
## 2 0.3321857 0.3289884 0.3388259
## 3 0.3300521 0.3284677 0.3414802
## 4 0.3341621 0.3224496 0.3433883
## 5 0.3042331 0.3469919 0.3487749
## 6 0.3617390 0.3132515 0.3250095

# Classification Accuracy
predicted_classes <- predict(model, type = "class")
actual_classes <- data_subset$Work_Location
accuracy <- mean(predicted_classes == actual_classes)
cat("Accuracy:", accuracy, "\n")

## Accuracy: 0.3552

# Odds Ratios
odds_ratios <- exp(coef(model))
print(odds_ratios)

##        (Intercept) Number_of_Virtual_Meetings Work_Life_Balance_Rating
## Onsite   0.9350409                   1.022732                0.9620981
## Remote   1.0453685                   1.017302                0.9798158
##        Social_Isolation_Rating
## Onsite               1.0020766
## Remote               0.9760247

# VIF for multicollinearity check
vif_values <- vif(model)

## Warning in vif.default(model): No intercept: vifs may not be sensible.

print(vif_values)

## Number_of_Virtual_Meetings   Work_Life_Balance_Rating 
##                   5.012687                   7.275317 
##    Social_Isolation_Rating 
##                   7.564903