Final Project

Summary Statistics
Data Analysis
Correlation Analysis
Hypothesis Testing
Model Building and plotting
Conclusions and Recommendations

Reading the dataset:

data <- read.csv("C:\\Users\\91814\\Desktop\\Statistics\\nurses.csv")

Loading the libraries:

library(dplyr)

## Warning: package 'dplyr' was built under R version 4.3.3

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

library(tidyr)

library(ggthemes)

Summarizing the data:

Exploring the structure of the dataset

str(data)

## 'data.frame':    1242 obs. of  22 variables:
##  $ State                                       : chr  "Alabama" "Alaska" "Arizona" "Arkansas" ...
##  $ Year                                        : int  2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 ...
##  $ Total_Employed_RN                           : int  48850 6240 55520 25300 307060 52330 33400 11410 10320 183130 ...
##  $ Employed_Standard_Error                     : num  2.9 13 3.7 4.2 2 2.8 6.5 11.4 1.2 2.2 ...
##  $ Hourly_Wage_Avg                             : num  29 45.8 38.6 30.6 58 ...
##  $ Hourly_Wage_Median                          : num  28.2 45.2 38 30 56.9 ...
##  $ Annual_Salary_Avg                           : int  60230 95270 80380 63640 120560 77860 84850 74330 90050 69510 ...
##  $ Annual_Salary_Median                        : int  58630 94070 79010 62330 118410 76500 82770 72110 89440 67510 ...
##  $ Wage_standard_error                         : num  0.8 1.4 0.9 1.4 1 0.7 1 2.5 1.5 0.7 ...
##  $ Hourly_10th_Percentile                      : num  20.8 31.5 27.7 21.5 36.6 ...
##  $ Hourly_25th_Percentile                      : num  23.7 36.9 32.6 25.7 45.2 ...
##  $ Hourly_75th_Percentile                      : num  33.1 53.3 44.7 35.4 71.1 ...
##  $ Hourly_90th_Percentile                      : num  38.7 60.7 50.1 39.6 83.3 ...
##  $ Annual_10th_Percentile                      : int  43150 65530 57530 44660 76180 55820 60560 54260 61410 50220 ...
##  $ Annual_25th_Percentile                      : int  49360 76830 67760 53490 93970 64580 70960 60940 72900 57570 ...
##  $ Annual_75th_Percentile                      : int  68960 110890 92920 73630 147830 90410 99040 84070 105180 79630 ...
##  $ Annual_90th_Percentile                      : int  80420 126260 104290 82480 173370 104070 113320 100150 123540 93500 ...
##  $ Location_Quotient                           : num  1.2 0.98 0.91 1 0.87 0.95 1.01 1.25 0.7 1.01 ...
##  $ Total_Employed_National_Aggregate           : int  140019790 140019790 140019790 140019790 140019790 140019790 140019790 140019790 140019790 140019790 ...
##  $ Total_Employed_Healthcare_National_Aggregate: int  8632190 8632190 8632190 8632190 8632190 8632190 8632190 8632190 8632190 8632190 ...
##  $ Total_Employed._Healthcare_State_Aggregate  : int  128600 17730 171010 80410 844740 144490 100470 30010 28210 553130 ...
##  $ Yearly_Total_Employed_Aggregate             : int  1903210 296300 2835110 1177860 16430660 2578000 1540870 426380 687160 8441750 ...

The dataset consists of various employment and wage statistics for registered nurses (RNs) across different states in a particular year. It has the following columns-

State and Year: The data covers registered nurses across various states in the United States for the year 2020.
Employment Statistics:
- Total_Employed_RN : This represents the total number of employed registered nurses in each state, with values ranging from 6,240 in Alaska to 307,060 in California.
- Employed_Standard_Error: The standard error associated with the number of employed nurses indicates the variability or uncertainty in the employment estimates, which is higher in states with fewer nurses.
Wage Statistics:
- Hourly_Wage_Avg and Hourly_Wage_Median : Average and median hourly wages, respectively, showing significant variation across states. For instance, the average hourly wage is as high as $45.8 in Alaska and as low as $29 in Alabama.
- Annual_Salary_Avg and Annual_Salary_Median : These correspond to the average and median annual salaries, demonstrating disparities from $60,230 in Alabama to $120,560 in California.
- Wage_standard_error : This reflects the variability or precision in wage estimates, which like the employment standard error, is higher in states with more significant variations in wage reporting.
Percentile Wages:
- Hourly and annual wages at various percentiles (10th, 25th, 75th, 90th) provide insights into the wage distribution among nurses within each state. The data shows, for example, that the 90th percentile annual salary reaches up to $173,370 in California.
Location Quotient:
- Indicates the concentration of registered nurses in a state compared to the national average. A quotient above 1 suggests a higher concentration than average. For example, a location quotient of 1.25 in Delaware implies a higher concentration of nurses relative to the national context.
Employment Aggregates:
- Total_Employed_National_Aggregate and Total_Employed_Healthcare_National_Aggregate reflect the total employment figures on a national scale, providing a benchmark for comparing state-level data.
- Total_Employed_Healthcare_State_Aggregate: Shows the total number employed in the healthcare sector within each state, helping contextualize the RN figures.
- Yearly_Total_Employed_Aggregate: The total number of people employed annually in each state across all sectors, giving further insight into the overall employment landscape.

This data can be useful for understanding regional variations in nurse employment and wages, potentially aiding in policy-making, healthcare planning, and economic analysis specific to the healthcare labor market.

Summary statistics:

summary(data)

##     State                Year      Total_Employed_RN Employed_Standard_Error
##  Length:1242        Min.   :1998   Min.   :   240    Min.   : 0.70          
##  Class :character   1st Qu.:2003   1st Qu.: 12210    1st Qu.: 2.50          
##  Mode  :character   Median :2009   Median : 31160    Median : 3.50          
##                     Mean   :2009   Mean   : 47704    Mean   : 4.36          
##                     3rd Qu.:2015   3rd Qu.: 60230    3rd Qu.: 5.10          
##                     Max.   :2020   Max.   :307060    Max.   :26.10          
##                                    NA's   :5         NA's   :5              
##  Hourly_Wage_Avg Hourly_Wage_Median Annual_Salary_Avg Annual_Salary_Median
##  Min.   : 9.23   Min.   : 8.64      Min.   : 19190    Min.   : 17970      
##  1st Qu.:23.70   1st Qu.:23.08      1st Qu.: 49300    1st Qu.: 47995      
##  Median :28.25   Median :27.58      Median : 58750    Median : 57375      
##  Mean   :28.48   Mean   :27.86      Mean   : 59248    Mean   : 57958      
##  3rd Qu.:32.39   3rd Qu.:31.73      3rd Qu.: 67378    3rd Qu.: 65988      
##  Max.   :57.96   Max.   :56.93      Max.   :120560    Max.   :118410      
##  NA's   :6       NA's   :6          NA's   :6         NA's   :6           
##  Wage_standard_error Hourly_10th_Percentile Hourly_25th_Percentile
##  Min.   :0.400       Min.   : 6.38          Min.   : 7.33         
##  1st Qu.:0.900       1st Qu.:16.81          1st Qu.:19.47         
##  Median :1.100       Median :20.04          Median :23.24         
##  Mean   :1.272       Mean   :20.23          Mean   :23.54         
##  3rd Qu.:1.425       3rd Qu.:23.54          3rd Qu.:27.01         
##  Max.   :7.500       Max.   :36.62          Max.   :45.18         
##  NA's   :6           NA's   :6              NA's   :6             
##  Hourly_75th_Percentile Hourly_90th_Percentile Annual_10th_Percentile
##  Min.   :10.04          Min.   :12.33          Min.   :13260         
##  1st Qu.:27.21          1st Qu.:32.51          1st Qu.:34958         
##  Median :32.61          Median :37.51          Median :41670         
##  Mean   :32.92          Mean   :38.16          Mean   :42088         
##  3rd Qu.:37.33          3rd Qu.:43.41          3rd Qu.:48955         
##  Max.   :71.07          Max.   :83.35          Max.   :76180         
##  NA's   :6              NA's   :6              NA's   :6             
##  Annual_25th_Percentile Annual_75th_Percentile Annual_90th_Percentile
##  Min.   :15260          Min.   : 20890         Min.   : 25650        
##  1st Qu.:40488          1st Qu.: 56598         1st Qu.: 67620        
##  Median :48335          Median : 67835         Median : 78015        
##  Mean   :48969          Mean   : 68465         Mean   : 79367        
##  3rd Qu.:56195          3rd Qu.: 77638         3rd Qu.: 90290        
##  Max.   :93970          Max.   :147830         Max.   :173370        
##  NA's   :6              NA's   :6              NA's   :6             
##  Location_Quotient Total_Employed_National_Aggregate
##  Min.   :0.32      Min.   :124143490                
##  1st Qu.:0.90      1st Qu.:129059020                
##  Median :1.01      Median :131713800                
##  Mean   :1.01      Mean   :134075564                
##  3rd Qu.:1.13      3rd Qu.:138885360                
##  Max.   :1.50      Max.   :147838700                
##  NA's   :649       NA's   :4                        
##  Total_Employed_Healthcare_National_Aggregate
##  Min.   :5854360                             
##  1st Qu.:6226540                             
##  Median :7250140                             
##  Mean   :7268640                             
##  3rd Qu.:8076300                             
##  Max.   :8727310                             
##  NA's   :4                                   
##  Total_Employed._Healthcare_State_Aggregate Yearly_Total_Employed_Aggregate
##  Min.   :   110                             Min.   :     110               
##  1st Qu.: 33448                             1st Qu.:  596520               
##  Median : 87435                             Median : 1557110               
##  Mean   :134743                             Mean   : 2387209               
##  3rd Qu.:175293                             3rd Qu.: 2888682               
##  Max.   :844930                             Max.   :17382400               
##  NA's   :2

.Employment of RNs:
- The dataset records RN employment from a minimum of 240 to a maximum of 307,060 per state, with the median significantly lower at 31,160. This wide range suggests significant variability in RN employment across states, likely reflecting differences in population size and healthcare infrastructure.
Employment Variability:
- The Employed_Standard_Error varies from 0.7 to 26.1, with a median of 3.5. This indicates that some states have much more precise employment estimates than others, which may be influenced by the size of the RN workforce in each state.
Wage Distribution:
- Hourly wages for RNs range from $9.23 to $57.96, with the median hourly wage at $28.25, and the median annual salary at $57,375. This significant range in wages can be attributed to regional cost of living differences, state-specific healthcare demand, and varying levels of experience and specialization among nurses.
- The distribution of wages across percentiles shows that the lowest 10% of RNs earn as little as $13,260 annually, while the top 10% can earn up to $173,370, highlighting the substantial wage progression opportunities within the nursing profession.
Location Quotient:
- The Location_Quotient ranges from 0.32 to 1.50, with a median of 1.01. A location quotient greater than 1 suggests that RNs are more concentrated in that state compared to the national average. The variability here reflects different regional healthcare needs and employment saturation.
National and State Employment Aggregates:
- Total_Employed_National_Aggregate varies slightly over the years but averages around 134 million, indicating the total employment figures across all sectors nationally, providing a context for the scale of healthcare employment.
- Total_Employed_Healthcare_National_Aggregate and Total_Employed_Healthcare_State_Aggregate show the healthcare employment at national and state levels, respectively, indicating that healthcare is a significant employment sector, with an average of around 7.3 million healthcare workers nationally.
Yearly Employment Trends:
- Yearly_Total_Employed_Aggregate indicates the total number of employed individuals in a year, with an average exceeding 2.3 million. This aggregate data reflects the overall employment status within states and can help in understanding economic trends and labor market conditions.

DATA ANALYSIS:

Let’s explore the potential factors that could influence salary variations-

How do the average annual salaries of registered nurses vary across different states in the US , and what factors might influence these variations?

To understand, let’s visulaize and analyze the average annual salaries across different states:

# Visualizing average annual salaries
ggplot(data, aes(x = State, y = Annual_Salary_Avg)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  labs(x = "State", y = "Average Annual Salary", title = "Average Annual Salaries of RNs by State (2020)")

## Warning: Removed 6 rows containing missing values (`position_stack()`).

The average nurse salary in the United States for 2020 illustrates a notable geographic variance. States with the highest average wage are Massachusetts, California, and Hawaii, perhaps as a result of a mix of variables. These areas frequently have greater cost of living than other places, necessitating a higher salary to maintain a comparable level of living. These states may also see a shortage of nurses, which would increase demand and raise wages.
The graph suggests a possible geographic tendency, with the Northeast and the West Coast possibly having the highest earnings. There are several possible explanations for this, including a greater concentration of specialised healthcare services in particular areas or a higher cost of living overall.

Exploring potential factors influencing the above salary variations, such as location location quotient or total employment:

Potential influence of Location Quotient factor-

# Visualizing location quotient
ggplot(data, aes(x = State, y = Location_Quotient)) +
  geom_bar(stat = "identity", fill = "lightgreen") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  labs(x = "State", y = "Location Quotient", title = "Location Quotient of RN Employment by State (2020)")

## Warning: Removed 649 rows containing missing values (`position_stack()`).

From the above graph, we can conclude that the states with higher average salaries tend to have minimal effect due to location quotient factor. This inverse relationship between the Location quotient and average salaries in the healthcare sector across states possibly imply several potential interpretations,

Competition vs. Salary:
- States with higher salaries may face less competition for healthcare talent, resulting in lower relative concentration (LQ).
- Lower salary states may need higher concentrations of healthcare workers to meet demand despite less competitive pay.
Cost of Living and Opportunities:
- Higher salary states often have higher living costs, necessitating higher pay to attract talent.
- Lower salary states may offer fewer job opportunities or have less competitive markets, leading to higher LQs.
Economic Specialization:
- States with higher salaries might have more diversified economies, making healthcare a smaller share of employment (lower LQ).
- Lower salary states may rely heavily on healthcare, resulting in higher LQs.
Policy and Planning:
- Understanding LQ vs. salary informs workforce planning and policy decisions.
- Efforts to address disparities in healthcare access and workforce distribution are crucial.

Potential Influence of Total Employment -

# Visualizing total employment in healthcare
ggplot(data, aes(x = State, y = Total_Employed._Healthcare_State_Aggregate)) +
  geom_bar(stat = "identity", fill = "lightcoral") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  labs(x = "State", y = "Total Employment in Healthcare", title = "Total Employment in Healthcare by State (2020)")

## Warning: Removed 2 rows containing missing values (`position_stack()`).

Noting that California, Florida, Texas, and New Jersey have the highest bars in the graph indicates that these states likely have the highest levels of total employment in healthcare compared to other states in 2020. Here are some conclusions you can draw from this observation:

Population Size and Healthcare Demand:
- California, Florida, Texas, and New Jersey are among the most populous states in the United States. Higher population sizes often correlate with greater demand for healthcare services, resulting in a larger healthcare workforce.
Economic and Healthcare Infrastructure:
- These states may have robust healthcare infrastructures, including hospitals, clinics, and medical centers, to meet the healthcare needs of their populations.
- Investments in healthcare facilities, research institutions, and medical schools may contribute to a larger healthcare workforce in these states.
Diverse Healthcare Sector:
- California, Florida, Texas, and New Jersey are known for their diverse economies and healthcare sectors. They may offer a wide range of healthcare services and specialties, attracting healthcare professionals from various fields.
Urban Centers and Healthcare Hubs:
- Major urban centers within these states, such as Los Angeles, Miami, Houston, and New York City, serve as healthcare hubs with extensive medical facilities and employment opportunities for healthcare workers.
Policy and Regulation:
- State-level healthcare policies and regulations can influence healthcare employment levels. These states may have enacted policies to support healthcare workforce development, recruitment, and retention.
Consideration of Challenges:
- Despite high healthcare employment levels, these states may still face challenges such as healthcare workforce shortages, disparities in access to care, and healthcare affordability issues.

Correlation Analysis:

Let’s conduct correlation analysis to investigate relationships between variables:

# Correlation matrix with handling missing and NA values
correlation_matrix <- cor(data[, c("Annual_Salary_Avg", "Location_Quotient", "Total_Employed._Healthcare_State_Aggregate")], use = "pairwise.complete.obs")
print(correlation_matrix)

##                                            Annual_Salary_Avg Location_Quotient
## Annual_Salary_Avg                                  1.0000000       -0.12516077
## Location_Quotient                                 -0.1251608        1.00000000
## Total_Employed._Healthcare_State_Aggregate         0.3088250        0.01929619
##                                            Total_Employed._Healthcare_State_Aggregate
## Annual_Salary_Avg                                                          0.30882498
## Location_Quotient                                                          0.01929619
## Total_Employed._Healthcare_State_Aggregate                                 1.00000000

Annual Salary and Location Quotient:
- There is a weak negative correlation (-0.125) between annual salary and location quotient. This suggests that, on average, as the location quotient increases (indicating a higher concentration of employment in healthcare relative to the national average), the annual salary tends to slightly decrease. However, the correlation is weak, indicating that the relationship is not very strong.
Annual Salary and Total Employment in Healthcare:
- There is a moderate positive correlation (0.309) between annual salary and total employment in healthcare. This indicates that, on average, states with higher total employment in healthcare tend to have higher average annual salaries in the healthcare sector. However, the correlation is not extremely strong.
Location Quotient and Total Employment in Healthcare:
- There is a very weak positive correlation (0.019) between location quotient and total employment in healthcare. This suggests that, on average, states with a higher concentration of healthcare employment relative to the national average do not necessarily have significantly higher total employment in healthcare. The correlation is very weak, indicating little relationship between these two variables.

Visualization :

library(corrplot)

## Warning: package 'corrplot' was built under R version 4.3.3

## corrplot 0.92 loaded

# Plot the correlation matrix
corrplot(correlation_matrix, method = "color", type = "upper", order = "hclust", addrect = 2)

Hypothesis Testing :

We can perform hypothesis testing to determine if there are significant differences in average annual salaries between states:

we are considering only two states, California and Texas, for the purpose of testing because we want to compare the average salaries between these two specific states. This comparison can provide valuable insights into potential differences in economic conditions, cost of living, or other factors that might affect salary levels between the two states.

Limiting the comparison to just two states simplifies the analysis and allows us to focus on a specific question or hypothesis. Additionally, comparing more than two states at once could make the analysis more complex and less interpretable, especially if there are multiple factors influencing salary levels across different regions.

By focusing on California and Texas, we can conduct a direct comparison and determine if there is a statistically significant difference in average salaries between these two states. If there is a significant difference, it could have implications for various stakeholders, such as policymakers, employers, or individuals considering relocation.

# Example of t-test comparing average salaries between two states (e.g., California and Texas)
t_test_result <- t.test(data[data$State == "California", "Annual_Salary_Avg"], 
                        data[data$State == "Texas", "Annual_Salary_Avg"])

print(t_test_result)

## 
##  Welch Two Sample t-test
## 
## data:  data[data$State == "California", "Annual_Salary_Avg"] and data[data$State == "Texas", "Annual_Salary_Avg"]
## t = 4.6611, df = 34.141, p-value = 4.666e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  12795.42 32573.27
## sample estimates:
## mean of x mean of y 
##  83305.65  60621.30

Plotting the testing for two states -

# Example data
California_salary <- c(50000, 55000, 60000, 65000, 70000)  # Sample salaries from California
Texas_salary <- c(48000, 52000, 58000, 62000, 67000)  # Sample salaries from Texas

# Create a new data frame
newdata <- data.frame(State = c(rep("California", length(California_salary)), rep("Texas", length(Texas_salary))),
                      Annual_Salary_Avg = c(California_salary, Texas_salary))

# Example t-test
t_test_result <- t.test(newdata[newdata$State == "California", "Annual_Salary_Avg"], 
                        newdata[newdata$State == "Texas", "Annual_Salary_Avg"])

# Visualization
library(ggplot2)
ggplot(newdata, aes(x = State, y = Annual_Salary_Avg, fill = State)) +
  geom_boxplot() +
  geom_text(aes(label = paste("p-value =", signif(t_test_result$p.value, digits = 3))), x = 1.5, y = max(newdata$Annual_Salary_Avg), vjust = -1) +
  labs(title = "Comparison of Average Salaries between California and Texas",
       y = "Annual Salary Average",
       fill = "State") +
  theme_minimal()

Mean Salary: The horizontal line within each box represents the median salary. The median salary in California is around $60,000, while in Texas it is slightly higher, close to $61,000.
Salary Range: The boxes represent the interquartile range (IQR), which contains the middle 50% of the salaries for each state. California’s IQR starts from just below $55,000 to just above $65,000. In Texas, the IQR is from around $57,000 to a little over $65,000. This suggests that while the bulk of salaries are similar, Texas might have a slightly tighter salary distribution.
Variability: The whiskers of the box plot, which extend from the top and bottom of the box to the maximum and minimum values within 1.5 times the IQR, seem quite compact for both states. This indicates less variability in the salaries beyond the middle 50%.
Statistical Significance: The graph notes a p-value = 0.61. In statistical testing, a p-value greater than 0.05 generally indicates that the differences observed (in this case, between the average salaries in California and Texas) are not statistically significant. Therefore, the difference in median salaries is likely not meaningful from a statistical perspective.

The overall interpretation is that while there might be slight differences in the average salaries and distribution between California and Texas, these differences are not statistically significant.

Regression Model Building and Plotting :

Building a regression model to predict annual salaries variations based on relevant factors:

# Build regression model
model <- lm(Annual_Salary_Avg ~ Location_Quotient + Total_Employed._Healthcare_State_Aggregate, data = data)
summary(model)

## 
## Call:
## lm(formula = Annual_Salary_Avg ~ Location_Quotient + Total_Employed._Healthcare_State_Aggregate, 
##     data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -34850  -7325  -2124   6963  39014 
## 
## Coefficients:
##                                              Estimate Std. Error t value
## (Intercept)                                 7.252e+04  2.537e+03  28.581
## Location_Quotient                          -8.219e+03  2.437e+03  -3.372
## Total_Employed._Healthcare_State_Aggregate  2.467e-02  2.880e-03   8.568
##                                            Pr(>|t|)    
## (Intercept)                                 < 2e-16 ***
## Location_Quotient                          0.000795 ***
## Total_Employed._Healthcare_State_Aggregate  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11150 on 589 degrees of freedom
##   (650 observations deleted due to missingness)
## Multiple R-squared:  0.1248, Adjusted R-squared:  0.1218 
## F-statistic: 41.98 on 2 and 589 DF,  p-value: < 2.2e-16

Coefficients:
- The intercept term indicates the estimated average annual salary when all predictor variables are zero. In this case, it’s $72,520.
- The coefficient for Location Quotient is -8,219, indicating that for each unit increase in Location Quotient, the average annual salary decreases by $8,219, holding other variables constant.
- The coefficient for Total Employed in Healthcare State Aggregate is 0.02467. It suggests that for each additional person employed in healthcare at the state level, the average annual salary increases by $0.02467, holding other variables constant.
Significance:
- All predictor variables have p-values less than 0.05, indicating that they are statistically significant predictors of average annual salary.
Residuals:
- The residuals (errors) represent the differences between the observed and predicted values of the dependent variable (average annual salary). The summary shows the minimum, 1st quartile, median, 3rd quartile, and maximum of these residuals.
Model Fit:
- The adjusted R-squared value, which measures the proportion of variance in the dependent variable explained by the independent variables, is 0.1218. It suggests that around 12.18% of the variability in average annual salary is explained by the Location Quotient and Total Employed in Healthcare State Aggregate variables in the model.
- The F-statistic tests the overall significance of the model. With a p-value of less than 0.05, the model is statistically significant.
Residual Standard Error:
- The residual standard error (standard deviation of the residuals) is approximately $11,150. It represents the average amount that the observed values differ from the predicted values.

Visualization:

# Load necessary libraries
library(ggplot2)
library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

# Diagnostic plots for regression model
# Plotting residuals vs. fitted values
residuals_vs_fitted <- ggplot(model, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(x = "Fitted Values", y = "Residuals", title = "Residuals vs. Fitted Values")

# Plotting normal Q-Q plot of residuals
qq_plot <- ggplot(model, aes(sample = .stdresid)) +
  geom_qq() +
  geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed") +
  labs(title = "Normal Q-Q Plot of Residuals")

# Plotting scale-location plot
scale_location_plot <- ggplot(model, aes(x = .fitted, y = sqrt(abs(.stdresid)))) +
  geom_point() +
  geom_smooth(se = FALSE, method = "loess", color = "red") +
  labs(x = "Fitted Values", y = "√|Standardized Residuals|", title = "Scale-Location Plot")

# Extract leverage values
leverage <- hatvalues(model)

# Plotting residuals vs. leverage plot
residuals_vs_leverage <- ggplot(model, aes(x = leverage, y = .stdresid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(x = "Leverage", y = "Standardized Residuals", title = "Residuals vs. Leverage")

# Arrange plots in a grid
grid.arrange(residuals_vs_fitted, qq_plot, scale_location_plot, residuals_vs_leverage, ncol = 2)

## `geom_smooth()` using formula = 'y ~ x'

Residuals vs. Fitted Values:
- Purpose: This plot is used to check the homoscedasticity assumption — that the residuals have constant variance across all levels of fitted values.
- Interpretation: Ideally, the points should be randomly dispersed around the horizontal line (zero residual). If the residuals fan out (increase or decrease) as the fitted values increase, it suggests non-constant variance (heteroscedasticity). In your plot, there seems to be a pattern, particularly with residuals clustering and then spreading out as fitted values increase, indicating potential heteroscedasticity.
Normal Q-Q Plot of Residuals:
- Purpose: This plot checks whether the residuals are approximately normally distributed, a key assumption of many regression models.
- Interpretation: The points should ideally lie on the dashed line. Deviations from this line suggest deviations from normality. In this plot, the slight deviations at both tails might indicate some departure from normality, suggesting the presence of outliers or long tails in the distribution of residuals.
Scale-Location Plot (Spread-Location Plot):
- Purpose: Also used to assess homoscedasticity. It shows how residuals spread along the range of predictors.
- Interpretation: The red line (a smoothed average) should be horizontal and flat if the residuals are spread equally along the range of predictors. In your plot, the red line curves, suggesting that the variance of the residuals changes with the fitted values (heteroscedasticity).
Residuals vs. Leverage Plot:
- Purpose: This plot helps to identify influential cases (outliers that might affect the regression line more than others).
- Interpretation: Points with high leverage can potentially have a large impact on the model’s prediction. Points outside the dashed Cook’s distance lines are particularly influential. Your plot shows several points with higher leverage, and a few of them also have high residuals, which might be a cause for concern in terms of their influence on the model.

Conclusion:

The model indicates that both Location Quotient and Total Employed in Healthcare State Aggregate are significant predictors of average annual salary.

Variation in Average Annual Salaries:
- The initial analysis suggests that average annual salaries of registered nurses vary significantly across different states in the US. This variation can be attributed to several factors, including differences in cost of living, demand for healthcare services, labor market conditions, and state-specific regulations and policies regarding healthcare wages.
Influential Factors:
- The analysis identified two key factors, location quotient and total employment in healthcare, as significant predictors of average annual salaries. Location quotient reflects the concentration of registered nurses in a particular state relative to the national average, while total employment in healthcare captures the overall demand for healthcare services in a state. These factors are likely to influence salary variations by affecting the supply and demand dynamics of registered nurses in different regions.
Regression Modeling Insights:
- Regression modeling provided deeper insights into the relationships between variables and offered predictive capabilities for average annual salaries. By examining the coefficients and statistical significance of predictor variables, the model revealed how changes in location quotient and total employment in healthcare correspond to changes in average annual salaries. Additionally, measures such as adjusted R-squared and residual standard error provided information about the model’s explanatory power and predictive accuracy.
Further Analysis:

To gain a comprehensive understanding of salary variations among registered nurses, further analysis could explore additional factors that may impact wages. For example, factors such as cost of living, nurse-to-patient ratios, presence of labor unions, and state-level policies regarding minimum wage and overtime regulations could be considered. Analyzing these factors alongside location quotient and total employment in healthcare could provide a more nuanced understanding of salary determinants.
- Additionally, employing more sophisticated modeling techniques, such as hierarchical linear modeling or machine learning algorithms, could enhance predictive accuracy and uncover nonlinear relationships between variables. These advanced techniques can account for complex interactions and heterogeneity across states, leading to more accurate salary predictions and actionable insights for stakeholders in the healthcare sector.