Week 9 Data Dive

data <- read.csv("C:\\Users\\91814\\Desktop\\Statistics\\nurses.csv")

Interaction Term: An interaction term is created by multiplying two variables together. In this case, the interaction term would capture the combined effect of Location_Quotient and Employed_Standard_Error on the response variable, Annual_Salary_Avg.

Including Hourly_wage_Avg allows for the examination of its direct impact on Annual_Salary_Avg while controlling for other variables. It provides insights into how changes in hourly wage affect annual salary.

lm_model <- lm(Annual_Salary_Avg ~ Location_Quotient * Employed_Standard_Error + Hourly_Wage_Avg, data = data)

summary(lm_model)

## 
## Call:
## lm(formula = Annual_Salary_Avg ~ Location_Quotient * Employed_Standard_Error + 
##     Hourly_Wage_Avg, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.2337  -4.8747  -0.0873   5.1256  14.8843 
## 
## Coefficients:
##                                             Estimate Std. Error   t value
## (Intercept)                                  2.13585    2.91288     0.733
## Location_Quotient                           -3.60681    2.09060    -1.725
## Employed_Standard_Error                     -0.44969    0.28979    -1.552
## Hourly_Wage_Avg                           2080.00753    0.04828 43084.177
## Location_Quotient:Employed_Standard_Error    0.65412    0.29730     2.200
##                                           Pr(>|t|)    
## (Intercept)                                 0.4637    
## Location_Quotient                           0.0850 .  
## Employed_Standard_Error                     0.1213    
## Hourly_Wage_Avg                             <2e-16 ***
## Location_Quotient:Employed_Standard_Error   0.0282 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.506 on 587 degrees of freedom
##   (650 observations deleted due to missingness)
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 4.94e+08 on 4 and 587 DF,  p-value: < 2.2e-16

To visualize the relationship between each independent variable and the dependent variable, we can use a scatterplot matrix.

# Load the required library
library(ggplot2)

# Create a scatterplot with interaction effect
ggplot(data, aes(x = Location_Quotient, y = Annual_Salary_Avg, color = Employed_Standard_Error)) +
  geom_point() +
  labs(title = "Interaction Effect on Annual Salary",
       x = "Location Quotient",
       y = "Annual Salary Avg",
       color = "Employed Standard Error")

## Warning: Removed 650 rows containing missing values (`geom_point()`).

Interaction Term (between Location_Quotient and Employed_Standard_Error ):

Reason for Inclusion: An interaction term allows us to capture the joint effect of Location_Quotient and Employed_Standard_Error on Annual_Salary_Avg. Including an interaction term is essential when there is reason to believe that the relationship between two variables is not additive and may vary depending on the levels of the interacting variables.
Multicollinearity: When including an interaction term, multicollinearity between the interacting variables and their constituent terms should be assessed. High multicollinearity can lead to unstable coefficient estimates and reduced interpretability.

Hourly Wage Average:

Reason for Inclusion: Hourly wage average represents a crucial factor influencing the annual salary of registered nurses. Including this variable allows us to assess its direct impact on Annual_Salary_Avg while controlling for other variables. Additionally, it provides insights into how changes in hourly wage affect annual salary.
Multicollinearity: Before inclusion, it’s essential to check for multicollinearity with existing variables in the model. If Hourly_Wage_Avg is highly correlated with other predictors, it may lead to multicollinearity issues. Careful consideration should be given to the inclusion of correlated predictors to avoid collinearity problems.

Employed Standard Error :

Reason for Inclusion: The employed standard error could represent the uncertainty in the measurement of employment levels in the healthcare sector. Including this variable allows us to examine how the precision of employment estimates impacts the relationship between Location_Quotient and Annual_Salary_Avg.
Multicollinearity: Similar to other variables, multicollinearity between Employed_Standard_Error and other predictors should be evaluated. If Employed_Standard_Error is highly correlated with other variables in the model, it may lead to multicollinearity issues and affect the stability of coefficient estimates.

lm_model <- lm(Annual_Salary_Avg ~ Location_Quotient * Employed_Standard_Error + Hourly_Wage_Avg, data = data)

#diagnostic plots
par(mfrow=c(2, 2)) # Arrange plots in a 2x2 grid

# Residuals vs Fitted Values Plot
plot(lm_model, which = 1)

# Normal Q-Q Plot
plot(lm_model, which = 2)

# Scale-Location Plot
plot(lm_model, which = 3)

# Residuals vs Leverage Plot
plot(lm_model, which = 5)

For the Residuals vs Fitted plot, look for a random scatter around the horizontal line at 0, indicating homoscedasticity.
In the Normal Q-Q plot, check if points follow the diagonal line, indicating normality of residuals.
The Scale-Location plot should ideally show a horizontal line with points randomly scattered, indicating homoscedasticity.
In the Residuals vs Leverage plot, look for points outside the dashed lines, which might be influential observations.

Residuals vs Fitted Values Plot:
Issue Indications: Look for patterns or non-random scatter in the residuals. If there’s a clear pattern (e.g., a curve or funnel shape), it suggests non-linearity or heteroscedasticity.
Severity: The severity of non-linearity or heteroscedasticity can vary. Strong patterns indicate more severe issues.
Confidence: If the plot shows random scatter around the horizontal line at 0, it supports the assumption of linearity and homoscedasticity with higher confidence.

Normal Q-Q Plot:

Issue Indications: Deviation from the diagonal line suggests non-normality of residuals.
Severity: If points deviate significantly from the diagonal line, it indicates a severe departure from normality.
Confidence: If points follow the diagonal line closely, it supports the assumption of normality with higher confidence.

Scale-Location Plot:

Issue Indications: Look for patterns or non-random scatter in the spread of residuals.
Severity: Similar to the Residuals vs Fitted plot, strong patterns indicate more severe issues with homoscedasticity.
Confidence: If the plot shows a horizontal line with points randomly scattered, it supports the assumption of homoscedasticity with higher confidence.

Residuals vs Leverage Plot:

Issue Indications: Look for points outside the dashed lines, which may indicate influential observations.
Severity: Points far outside the dashed lines suggest highly influential observations.
Confidence: If most points fall within the dashed lines, it supports the assumption of no influential outliers with higher confidence.

Week 9 Data Dive

2024-03-22