data <- read.csv("C:\\Users\\91814\\Desktop\\Statistics\\nurses.csv")

I’m choosing the binary variable “Location_Quotient” as the response variable. Location quotient measures the concentration of Registered Nurses in a particular state compared to the national average, and it can be reasonably converted into a binary variable by categorizing states with a location quotient greater than 1 as “above average” and states with a location quotient less than or equal to 1 as “average or below.”

I’m selecting a few explanatory variables to build the logistic regression model -

“Hourly_Wage_Median,” “Annual_Salary_Median,” “Hourly_10th_Percentile,” and “Annual_90th_Percentile” as our explanatory variables.

data$Location_Quotient_binary <- ifelse(data$Location_Quotient > 1, 1, 0)

model <- glm(Location_Quotient_binary ~ Hourly_Wage_Median + Annual_Salary_Median + Hourly_10th_Percentile + Annual_90th_Percentile, data = data, family = "binomial")

summary(model)
## 
## Call:
## glm(formula = Location_Quotient_binary ~ Hourly_Wage_Median + 
##     Annual_Salary_Median + Hourly_10th_Percentile + Annual_90th_Percentile, 
##     family = "binomial", data = data)
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             2.911e+00  6.022e-01   4.834 1.34e-06 ***
## Hourly_Wage_Median     -5.399e+01  2.834e+01  -1.905   0.0568 .  
## Annual_Salary_Median    2.554e-02  1.362e-02   1.874   0.0609 .  
## Hourly_10th_Percentile  6.415e-01  1.045e-01   6.140 8.27e-10 ***
## Annual_90th_Percentile  1.153e-04  2.373e-05   4.859 1.18e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 819.87  on 591  degrees of freedom
## Residual deviance: 712.24  on 587  degrees of freedom
##   (650 observations deleted due to missingness)
## AIC: 722.24
## 
## Number of Fisher Scoring iterations: 4

Interpreting the coefficients:

library(ggplot2)
data_clean <- data[is.finite(data$Hourly_Wage_Median), ]

plot_data <- expand.grid(
  Hourly_Wage_Median = seq(min(data_clean$Hourly_Wage_Median), max(data_clean$Hourly_Wage_Median), length.out = 100),
  Annual_Salary_Median = mean(data_clean$Annual_Salary_Median),
  Hourly_10th_Percentile = mean(data_clean$Hourly_10th_Percentile),
  Annual_90th_Percentile = mean(data_clean$Annual_90th_Percentile)
)

# Predicting probabilities
plot_data$predicted_prob <- predict(model, newdata = plot_data, type = "response")

ggplot(plot_data, aes(x = Hourly_Wage_Median, y = predicted_prob)) +
  geom_line() +
  labs(x = "Hourly Wage Median", y = "Predicted Probability of Above Average Location Quotient") +
  theme_minimal()

This will plot the predicted probability of having an above-average location quotient against the Hourly Wage Median, while holding the other explanatory variables constant at their means.

Logistic Regression Model:

Insight: The logistic regression model was built to predict the binary variable Location_quotient_binary using explanatory variables such as Hourly_wage_Median, Annual_Salary_Median, Hourly_10th_Percentile, and Annual_90th_Percentile.

Significance: This model helps understand how different factors such as wage and salary percentiles relate to the likelihood of a state having an above-average location quotient for registered nurses.

Further Questions:

Are these explanatory variables sufficient to explain the variation in location quotient?

Are there any interactions or nonlinear effects that should be considered in the model?

coef <- coef(model)

se <- summary(model)$coefficients[, "Std. Error"]

df <- df.residual(model)

t_value <- qt(0.975, df)

ci_lower <- coef - t_value * se
ci_upper <- coef + t_value * se

confidence_intervals <- data.frame(ci_lower, ci_upper)

rownames(confidence_intervals) <- names(coef)

print(confidence_intervals)
##                             ci_lower     ci_upper
## (Intercept)             1.728189e+00 4.0937955362
## Hourly_Wage_Median     -1.096605e+02 1.6706094078
## Annual_Salary_Median   -1.222618e-03 0.0522927923
## Hourly_10th_Percentile  4.362848e-01 0.8466982676
## Annual_90th_Percentile  6.869520e-05 0.0001618963

I’m calculating the confidence intervals for each coefficient in the logistic regression model using the t-distribution with 95% confidence level. The ci_lower and ci_upper variables represent the lower and upper bounds of the confidence intervals, respectively.The confidence intervals provide a range of values within which we can be confident that the true coefficient lies. For example, for a specific coefficient, say Hourly_Wage_Median, the confidence interval might be [0.023, 0.056]. This means we are 95% confident that the true effect of Hourly_Wage_Median on the log odds of being above average in location quotient lies between 0.023 and 0.056.

library(ggplot2)
ggplot(confidence_intervals, aes(x = rownames(confidence_intervals), y = ci_upper - ci_lower)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  geom_errorbar(aes(ymin = ci_lower, ymax = ci_upper), width = 0.2, color = "black") +
  labs(x = "Coefficient", y = "Width of Confidence Interval") +
  coord_flip() +  # Flip axes for horizontal bars
  theme_minimal()

Insight: Confidence intervals for the coefficients of the logistic regression model were calculated.

Significance: Confidence intervals provide a range of values within which we can be confident that the true coefficient lies. They help assess the precision of the coefficient estimates.

Further Questions:

Are there any coefficients with particularly wide confidence intervals, indicating uncertainty in the estimation?

How do the confidence intervals for different coefficients compare to each other?