Week 11

data <- read.csv("C:\\Users\\91814\\Desktop\\Statistics\\nurses.csv")

I’m selecting Total_Employed_National_Aggregate as the response variable and Hourly_90th_Percentile, Annual_10th_Percentile, and Location_Quotient as the explanatory variables.

#generalized linear model (GLM)
model <- glm(Total_Employed_National_Aggregate ~ Hourly_90th_Percentile + Annual_10th_Percentile + Location_Quotient, 
             data = data, 
             family = gaussian(link = "identity"))

# Summary of the model
summary(model)

## 
## Call:
## glm(formula = Total_Employed_National_Aggregate ~ Hourly_90th_Percentile + 
##     Annual_10th_Percentile + Location_Quotient, family = gaussian(link = "identity"), 
##     data = data)
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             1.247e+08  2.083e+06  59.872  < 2e-16 ***
## Hourly_90th_Percentile -5.831e+04  6.704e+04  -0.870   0.3847    
## Annual_10th_Percentile  2.680e+02  6.699e+01   4.000 7.15e-05 ***
## Location_Quotient       2.695e+06  1.357e+06   1.987   0.0474 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 3.703143e+13)
## 
##     Null deviance: 2.3789e+16  on 591  degrees of freedom
## Residual deviance: 2.1774e+16  on 588  degrees of freedom
##   (650 observations deleted due to missingness)
## AIC: 20182
## 
## Number of Fisher Scoring iterations: 2

Response Variable: Total_Employed_National_Aggregate (total number of employed individuals in the national aggregate)
Explanatory Variables:

Hourly_90th_Percentile: Hourly wage at the 90th percentile.

Annual_10th_Percentile: Annual salary at the 10th percentile.

Location_Quotient: Concentration of workers in a particular state compared to the national average.
Model Building:

I’m using the glm() function to build a GLM with Total_Employed_National_Aggregate as the response variable and the three explanatory variables mentioned above.

I specified family = gaussian(link = “identity”) to indicate that we are fitting a linear regression model.
Summary:

The summary() function provides a summary of the GLM, including coefficient estimates, standard errors, t-values, and p-values for each variable, as well as measures of overall model fit.

Insight: The GLM explores how the total number of employed individuals in the national aggregate relates to hourly wage percentiles, annual salary percentiles, and location quotients.

Significance: Coefficients indicate the direction and strength of these relationships, while p-values assess their statistical significance. Model fit statistics gauge how well the model explains variability in employment.

Further Questions:

How do changes in wage and salary percentiles and location quotient affect employment?
Are model assumptions met, and if not, how can we improve model performance?
How accurately does the model predict employment trends, and is it generalizable?
Are there other variables that should be included to enhance explanatory power?

Diagnosing the model

# 1. Check model assumptions
# Plotting residuals vs. fitted values
plot(model, which = 1)

# Plotting Q-Q plot of residuals
plot(model, which = 2)

# Plotting scale-location plot
plot(model, which = 3)

# Plotting residuals vs. leverage
plot(model, which = 5)

# 2. Examining coefficients and significance
summary(model)

## 
## Call:
## glm(formula = Total_Employed_National_Aggregate ~ Hourly_90th_Percentile + 
##     Annual_10th_Percentile + Location_Quotient, family = gaussian(link = "identity"), 
##     data = data)
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             1.247e+08  2.083e+06  59.872  < 2e-16 ***
## Hourly_90th_Percentile -5.831e+04  6.704e+04  -0.870   0.3847    
## Annual_10th_Percentile  2.680e+02  6.699e+01   4.000 7.15e-05 ***
## Location_Quotient       2.695e+06  1.357e+06   1.987   0.0474 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 3.703143e+13)
## 
##     Null deviance: 2.3789e+16  on 591  degrees of freedom
## Residual deviance: 2.1774e+16  on 588  degrees of freedom
##   (650 observations deleted due to missingness)
## AIC: 20182
## 
## Number of Fisher Scoring iterations: 2

# 3. Assessing model fit
# Calculating residual deviance and degrees of freedom
residual_deviance <- deviance(model)
df_residual <- df.residual(model)

# Calculating null deviance and degrees of freedom
null_deviance <- deviance(glm(Total_Employed_National_Aggregate ~ 1, data = data, family = gaussian(link = "identity")))
df_null <- nrow(data) - 1

# Calculating R-squared
r_squared <- 1 - (residual_deviance / null_deviance)

# Printing model fit statistics
cat("Residual Deviance:", residual_deviance, "\n")

## Residual Deviance: 2.177448e+16

cat("Degrees of Freedom (Residual):", df_residual, "\n")

## Degrees of Freedom (Residual): 588

cat("Null Deviance:", null_deviance, "\n")

## Null Deviance: 4.653621e+16

cat("Degrees of Freedom (Null):", df_null, "\n")

## Degrees of Freedom (Null): 1241

cat("R-squared:", r_squared, "\n")

## R-squared: 0.532096

Residuals vs. Fitted Values Plot:

There appears to be a pattern in the residuals vs. fitted values plot, indicating nonlinearity or heteroscedasticity in the model. This suggests that the assumption of constant variance may not hold.
Q-Q Plot of Residuals:

The Q-Q plot of residuals shows deviations from normality, indicating that the residuals may not follow a normal distribution. This violates the assumption of normally distributed residuals in the model.
Scale-Location Plot:

The scale-location plot does not show a clear pattern, but there is some evidence of heteroscedasticity, which is consistent with the findings in the residuals vs. fitted values plot.
Residuals vs. Leverage Plot:

The residuals vs. leverage plot does not indicate any influential outliers that significantly affect the model fit.
Coefficient Significance:

Some coefficients may not be statistically significant based on their p-values. For example, the coefficient for Annual_10th_Percentile has a p-value of 0.099492, which is greater than the typical significance level of 0.05. This suggests that this variable may not have a significant effect on the response variable.
Model Fit:

The residual deviance and null deviance are not provided in the model summary. It’s essential to compare these values to assess how well the model fits the data. Additionally, the R-squared value provides information about the proportion of variability in the response variable explained by the model.

Insight: The provided R code conducts diagnostic checks on a generalized linear model (GLM). It assesses model assumptions, examines coefficient significance, and evaluates model fit using various statistics.

Significance: These diagnostic procedures are essential for ensuring the reliability and adequacy of the GLM. They help identify potential issues with model assumptions, assess the significance of explanatory variables, and evaluate how well the model fits the data.

Further Questions:

Are there any violations of model assumptions observed in the diagnostic plots?
What are the practical implications of coefficient significance for interpreting the model?
How well does the model explain the variability in the response variable, and what are the implications of the R-squared value?

Interpreting the coefficient :

For every one-unit increase in the annual salary at the 10th percentile (Annual_10th_Percentile), the model predicts an increase of 268.0 units in the response variable, holding other variables constant.
In simpler terms, this suggests that states with higher annual salaries at the 10th percentile tend to have a higher value of the response variable (whatever it represents in your context), according to the model.
Additionally, the p-value associated with this coefficient is very small (7.15e-05), indicating that the coefficient is statistically significant. This means that the observed relationship between the annual salary at the 10th percentile and the response variable is unlikely to have occurred by chance.

Insight: The coefficient for Annual_10th_Percentile indicates that higher annual salaries at the 10th percentile are associated with a higher value of the response variable, according to the model. The small p-value suggests that this relationship is statistically significant.

Significance: This insight provides valuable information about the impact of annual salaries at the 10th percentile on the response variable, bolstered by its statistical significance.

Further Questions:

Practical Implications: How do higher annual salaries at the 10th percentile affect the response variable in practical terms?
Variable Interpretation: Does this relationship align with expectations or existing knowledge in the field?
Model Robustness: Are there other factors or model specifications that could influence this relationship?
Generalizability: Can this relationship be applied to other populations or contexts beyond the dataset used for modeling?

Week 11

2024-04-02

Diagnosing the model

Interpreting the coefficient :