data <- read.csv("C:\\Users\\91814\\Desktop\\Statistics\\nurses.csv")
I’m choosing the binary variable “Location_Quotient” as the response variable. Location quotient measures the concentration of Registered Nurses in a particular state compared to the national average, and it can be reasonably converted into a binary variable by categorizing states with a location quotient greater than 1 as “above average” and states with a location quotient less than or equal to 1 as “average or below.”
I’m selecting a few explanatory variables to build the logistic regression model -
“Hourly_Wage_Median,” “Annual_Salary_Median,” “Hourly_10th_Percentile,” and “Annual_90th_Percentile” as our explanatory variables.
data$Location_Quotient_binary <- ifelse(data$Location_Quotient > 1, 1, 0)
model <- glm(Location_Quotient_binary ~ Hourly_Wage_Median + Annual_Salary_Median + Hourly_10th_Percentile + Annual_90th_Percentile, data = data, family = "binomial")
summary(model)
##
## Call:
## glm(formula = Location_Quotient_binary ~ Hourly_Wage_Median +
## Annual_Salary_Median + Hourly_10th_Percentile + Annual_90th_Percentile,
## family = "binomial", data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.911e+00 6.022e-01 4.834 1.34e-06 ***
## Hourly_Wage_Median -5.399e+01 2.834e+01 -1.905 0.0568 .
## Annual_Salary_Median 2.554e-02 1.362e-02 1.874 0.0609 .
## Hourly_10th_Percentile 6.415e-01 1.045e-01 6.140 8.27e-10 ***
## Annual_90th_Percentile 1.153e-04 2.373e-05 4.859 1.18e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 819.87 on 591 degrees of freedom
## Residual deviance: 712.24 on 587 degrees of freedom
## (650 observations deleted due to missingness)
## AIC: 722.24
##
## Number of Fisher Scoring iterations: 4
Interpreting the coefficients:
Hourly_Wage_Median: For a one-unit increase in the hourly median wage, the log odds of being above average in location quotient (compared to average or below) increases by the coefficient value.
Annual_Salary_Median: For a one-unit increase in the annual median salary, the log odds of being above average in location quotient increases by the coefficient value.
Hourly_10th_Percentile: For a one-unit increase in the hourly 10th percentile wage, the log odds of being above average in location quotient increases by the coefficient value.
Annual_90th_Percentile: For a one-unit increase in the annual 90th percentile salary, the log odds of being above average in location quotient increases by the coefficient value.
library(ggplot2)
data_clean <- data[is.finite(data$Hourly_Wage_Median), ]
plot_data <- expand.grid(
Hourly_Wage_Median = seq(min(data_clean$Hourly_Wage_Median), max(data_clean$Hourly_Wage_Median), length.out = 100),
Annual_Salary_Median = mean(data_clean$Annual_Salary_Median),
Hourly_10th_Percentile = mean(data_clean$Hourly_10th_Percentile),
Annual_90th_Percentile = mean(data_clean$Annual_90th_Percentile)
)
# Predicting probabilities
plot_data$predicted_prob <- predict(model, newdata = plot_data, type = "response")
ggplot(plot_data, aes(x = Hourly_Wage_Median, y = predicted_prob)) +
geom_line() +
labs(x = "Hourly Wage Median", y = "Predicted Probability of Above Average Location Quotient") +
theme_minimal()
This will plot the predicted probability of having an above-average location quotient against the Hourly Wage Median, while holding the other explanatory variables constant at their means.
Logistic Regression Model:
Insight: The logistic regression model was built to predict the binary variable Location_quotient_binary using explanatory variables such as Hourly_wage_Median, Annual_Salary_Median, Hourly_10th_Percentile, and Annual_90th_Percentile.
Significance: This model helps understand how different factors such as wage and salary percentiles relate to the likelihood of a state having an above-average location quotient for registered nurses.
Further Questions:
Are these explanatory variables sufficient to explain the variation in location quotient?
Are there any interactions or nonlinear effects that should be considered in the model?
coef <- coef(model)
se <- summary(model)$coefficients[, "Std. Error"]
df <- df.residual(model)
t_value <- qt(0.975, df)
ci_lower <- coef - t_value * se
ci_upper <- coef + t_value * se
confidence_intervals <- data.frame(ci_lower, ci_upper)
rownames(confidence_intervals) <- names(coef)
print(confidence_intervals)
## ci_lower ci_upper
## (Intercept) 1.728189e+00 4.0937955362
## Hourly_Wage_Median -1.096605e+02 1.6706094078
## Annual_Salary_Median -1.222618e-03 0.0522927923
## Hourly_10th_Percentile 4.362848e-01 0.8466982676
## Annual_90th_Percentile 6.869520e-05 0.0001618963
I’m calculating the confidence intervals for each coefficient in the logistic regression model using the t-distribution with 95% confidence level. The ci_lower and ci_upper variables represent the lower and upper bounds of the confidence intervals, respectively.The confidence intervals provide a range of values within which we can be confident that the true coefficient lies. For example, for a specific coefficient, say Hourly_Wage_Median, the confidence interval might be [0.023, 0.056]. This means we are 95% confident that the true effect of Hourly_Wage_Median on the log odds of being above average in location quotient lies between 0.023 and 0.056.
library(ggplot2)
ggplot(confidence_intervals, aes(x = rownames(confidence_intervals), y = ci_upper - ci_lower)) +
geom_bar(stat = "identity", fill = "skyblue") +
geom_errorbar(aes(ymin = ci_lower, ymax = ci_upper), width = 0.2, color = "black") +
labs(x = "Coefficient", y = "Width of Confidence Interval") +
coord_flip() + # Flip axes for horizontal bars
theme_minimal()
Insight: Confidence intervals for the coefficients of the logistic regression model were calculated.
Significance: Confidence intervals provide a range of values within which we can be confident that the true coefficient lies. They help assess the precision of the coefficient estimates.
Further Questions:
Are there any coefficients with particularly wide confidence intervals, indicating uncertainty in the estimation?
How do the confidence intervals for different coefficients compare to each other?