data <- read.csv("C:\\Users\\91814\\Desktop\\Statistics\\nurses.csv")

I’m selecting Total_Employed_National_Aggregate as the response variable and Hourly_90th_Percentile, Annual_10th_Percentile, and Location_Quotient as the explanatory variables.

#generalized linear model (GLM)
model <- glm(Total_Employed_National_Aggregate ~ Hourly_90th_Percentile + Annual_10th_Percentile + Location_Quotient, 
             data = data, 
             family = gaussian(link = "identity"))

# Summary of the model
summary(model)
## 
## Call:
## glm(formula = Total_Employed_National_Aggregate ~ Hourly_90th_Percentile + 
##     Annual_10th_Percentile + Location_Quotient, family = gaussian(link = "identity"), 
##     data = data)
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             1.247e+08  2.083e+06  59.872  < 2e-16 ***
## Hourly_90th_Percentile -5.831e+04  6.704e+04  -0.870   0.3847    
## Annual_10th_Percentile  2.680e+02  6.699e+01   4.000 7.15e-05 ***
## Location_Quotient       2.695e+06  1.357e+06   1.987   0.0474 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 3.703143e+13)
## 
##     Null deviance: 2.3789e+16  on 591  degrees of freedom
## Residual deviance: 2.1774e+16  on 588  degrees of freedom
##   (650 observations deleted due to missingness)
## AIC: 20182
## 
## Number of Fisher Scoring iterations: 2

Insight: The GLM explores how the total number of employed individuals in the national aggregate relates to hourly wage percentiles, annual salary percentiles, and location quotients.

Significance: Coefficients indicate the direction and strength of these relationships, while p-values assess their statistical significance. Model fit statistics gauge how well the model explains variability in employment.

Further Questions:

  1. How do changes in wage and salary percentiles and location quotient affect employment?

  2. Are model assumptions met, and if not, how can we improve model performance?

  3. How accurately does the model predict employment trends, and is it generalizable?

  4. Are there other variables that should be included to enhance explanatory power?

Diagnosing the model

# 1. Check model assumptions
# Plotting residuals vs. fitted values
plot(model, which = 1)

# Plotting Q-Q plot of residuals
plot(model, which = 2)

# Plotting scale-location plot
plot(model, which = 3)

# Plotting residuals vs. leverage
plot(model, which = 5)

# 2. Examining coefficients and significance
summary(model)
## 
## Call:
## glm(formula = Total_Employed_National_Aggregate ~ Hourly_90th_Percentile + 
##     Annual_10th_Percentile + Location_Quotient, family = gaussian(link = "identity"), 
##     data = data)
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             1.247e+08  2.083e+06  59.872  < 2e-16 ***
## Hourly_90th_Percentile -5.831e+04  6.704e+04  -0.870   0.3847    
## Annual_10th_Percentile  2.680e+02  6.699e+01   4.000 7.15e-05 ***
## Location_Quotient       2.695e+06  1.357e+06   1.987   0.0474 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 3.703143e+13)
## 
##     Null deviance: 2.3789e+16  on 591  degrees of freedom
## Residual deviance: 2.1774e+16  on 588  degrees of freedom
##   (650 observations deleted due to missingness)
## AIC: 20182
## 
## Number of Fisher Scoring iterations: 2
# 3. Assessing model fit
# Calculating residual deviance and degrees of freedom
residual_deviance <- deviance(model)
df_residual <- df.residual(model)

# Calculating null deviance and degrees of freedom
null_deviance <- deviance(glm(Total_Employed_National_Aggregate ~ 1, data = data, family = gaussian(link = "identity")))
df_null <- nrow(data) - 1

# Calculating R-squared
r_squared <- 1 - (residual_deviance / null_deviance)

# Printing model fit statistics
cat("Residual Deviance:", residual_deviance, "\n")
## Residual Deviance: 2.177448e+16
cat("Degrees of Freedom (Residual):", df_residual, "\n")
## Degrees of Freedom (Residual): 588
cat("Null Deviance:", null_deviance, "\n")
## Null Deviance: 4.653621e+16
cat("Degrees of Freedom (Null):", df_null, "\n")
## Degrees of Freedom (Null): 1241
cat("R-squared:", r_squared, "\n")
## R-squared: 0.532096
  1. Residuals vs. Fitted Values Plot:

    There appears to be a pattern in the residuals vs. fitted values plot, indicating nonlinearity or heteroscedasticity in the model. This suggests that the assumption of constant variance may not hold.

  2. Q-Q Plot of Residuals:

    The Q-Q plot of residuals shows deviations from normality, indicating that the residuals may not follow a normal distribution. This violates the assumption of normally distributed residuals in the model.

  3. Scale-Location Plot:

    The scale-location plot does not show a clear pattern, but there is some evidence of heteroscedasticity, which is consistent with the findings in the residuals vs. fitted values plot.

  4. Residuals vs. Leverage Plot:

    The residuals vs. leverage plot does not indicate any influential outliers that significantly affect the model fit.

  5. Coefficient Significance:

    Some coefficients may not be statistically significant based on their p-values. For example, the coefficient for Annual_10th_Percentile has a p-value of 0.099492, which is greater than the typical significance level of 0.05. This suggests that this variable may not have a significant effect on the response variable.

  6. Model Fit:

    The residual deviance and null deviance are not provided in the model summary. It’s essential to compare these values to assess how well the model fits the data. Additionally, the R-squared value provides information about the proportion of variability in the response variable explained by the model.

Insight: The provided R code conducts diagnostic checks on a generalized linear model (GLM). It assesses model assumptions, examines coefficient significance, and evaluates model fit using various statistics.

Significance: These diagnostic procedures are essential for ensuring the reliability and adequacy of the GLM. They help identify potential issues with model assumptions, assess the significance of explanatory variables, and evaluate how well the model fits the data.

Further Questions:

  1. Are there any violations of model assumptions observed in the diagnostic plots?

  2. What are the practical implications of coefficient significance for interpreting the model?

  3. How well does the model explain the variability in the response variable, and what are the implications of the R-squared value?

Interpreting the coefficient :

Insight: The coefficient for Annual_10th_Percentile indicates that higher annual salaries at the 10th percentile are associated with a higher value of the response variable, according to the model. The small p-value suggests that this relationship is statistically significant.

Significance: This insight provides valuable information about the impact of annual salaries at the 10th percentile on the response variable, bolstered by its statistical significance.

Further Questions:

  1. Practical Implications: How do higher annual salaries at the 10th percentile affect the response variable in practical terms?

  2. Variable Interpretation: Does this relationship align with expectations or existing knowledge in the field?

  3. Model Robustness: Are there other factors or model specifications that could influence this relationship?

  4. Generalizability: Can this relationship be applied to other populations or contexts beyond the dataset used for modeling?