Summary Statistics
Data Analysis
Correlation Analysis
Hypothesis Testing
Model Building and plotting
Conclusions and Recommendations
data <- read.csv("C:\\Users\\91814\\Desktop\\Statistics\\nurses.csv")
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.3.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(tidyr)
library(ggthemes)
Exploring the structure of the dataset
str(data)
## 'data.frame': 1242 obs. of 22 variables:
## $ State : chr "Alabama" "Alaska" "Arizona" "Arkansas" ...
## $ Year : int 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 ...
## $ Total_Employed_RN : int 48850 6240 55520 25300 307060 52330 33400 11410 10320 183130 ...
## $ Employed_Standard_Error : num 2.9 13 3.7 4.2 2 2.8 6.5 11.4 1.2 2.2 ...
## $ Hourly_Wage_Avg : num 29 45.8 38.6 30.6 58 ...
## $ Hourly_Wage_Median : num 28.2 45.2 38 30 56.9 ...
## $ Annual_Salary_Avg : int 60230 95270 80380 63640 120560 77860 84850 74330 90050 69510 ...
## $ Annual_Salary_Median : int 58630 94070 79010 62330 118410 76500 82770 72110 89440 67510 ...
## $ Wage_standard_error : num 0.8 1.4 0.9 1.4 1 0.7 1 2.5 1.5 0.7 ...
## $ Hourly_10th_Percentile : num 20.8 31.5 27.7 21.5 36.6 ...
## $ Hourly_25th_Percentile : num 23.7 36.9 32.6 25.7 45.2 ...
## $ Hourly_75th_Percentile : num 33.1 53.3 44.7 35.4 71.1 ...
## $ Hourly_90th_Percentile : num 38.7 60.7 50.1 39.6 83.3 ...
## $ Annual_10th_Percentile : int 43150 65530 57530 44660 76180 55820 60560 54260 61410 50220 ...
## $ Annual_25th_Percentile : int 49360 76830 67760 53490 93970 64580 70960 60940 72900 57570 ...
## $ Annual_75th_Percentile : int 68960 110890 92920 73630 147830 90410 99040 84070 105180 79630 ...
## $ Annual_90th_Percentile : int 80420 126260 104290 82480 173370 104070 113320 100150 123540 93500 ...
## $ Location_Quotient : num 1.2 0.98 0.91 1 0.87 0.95 1.01 1.25 0.7 1.01 ...
## $ Total_Employed_National_Aggregate : int 140019790 140019790 140019790 140019790 140019790 140019790 140019790 140019790 140019790 140019790 ...
## $ Total_Employed_Healthcare_National_Aggregate: int 8632190 8632190 8632190 8632190 8632190 8632190 8632190 8632190 8632190 8632190 ...
## $ Total_Employed._Healthcare_State_Aggregate : int 128600 17730 171010 80410 844740 144490 100470 30010 28210 553130 ...
## $ Yearly_Total_Employed_Aggregate : int 1903210 296300 2835110 1177860 16430660 2578000 1540870 426380 687160 8441750 ...
The dataset consists of various employment and wage statistics for registered nurses (RNs) across different states in a particular year. It has the following columns-
State and Year: The data covers registered nurses across various states in the United States for the year 2020.
Employment Statistics:
Total_Employed_RN : This represents the total number of employed registered nurses in each state, with values ranging from 6,240 in Alaska to 307,060 in California.
Employed_Standard_Error: The standard error associated with the number of employed nurses indicates the variability or uncertainty in the employment estimates, which is higher in states with fewer nurses.
Wage Statistics:
Hourly_Wage_Avg and Hourly_Wage_Median : Average and median hourly wages, respectively, showing significant variation across states. For instance, the average hourly wage is as high as $45.8 in Alaska and as low as $29 in Alabama.
Annual_Salary_Avg and Annual_Salary_Median : These correspond to the average and median annual salaries, demonstrating disparities from $60,230 in Alabama to $120,560 in California.
Wage_standard_error : This reflects the variability or precision in wage estimates, which like the employment standard error, is higher in states with more significant variations in wage reporting.
Percentile Wages:
Location Quotient:
Employment Aggregates:
Total_Employed_National_Aggregate and Total_Employed_Healthcare_National_Aggregate reflect the total employment figures on a national scale, providing a benchmark for comparing state-level data.
Total_Employed_Healthcare_State_Aggregate: Shows the total number employed in the healthcare sector within each state, helping contextualize the RN figures.
Yearly_Total_Employed_Aggregate: The total number of people employed annually in each state across all sectors, giving further insight into the overall employment landscape.
This data can be useful for understanding regional variations in nurse employment and wages, potentially aiding in policy-making, healthcare planning, and economic analysis specific to the healthcare labor market.
summary(data)
## State Year Total_Employed_RN Employed_Standard_Error
## Length:1242 Min. :1998 Min. : 240 Min. : 0.70
## Class :character 1st Qu.:2003 1st Qu.: 12210 1st Qu.: 2.50
## Mode :character Median :2009 Median : 31160 Median : 3.50
## Mean :2009 Mean : 47704 Mean : 4.36
## 3rd Qu.:2015 3rd Qu.: 60230 3rd Qu.: 5.10
## Max. :2020 Max. :307060 Max. :26.10
## NA's :5 NA's :5
## Hourly_Wage_Avg Hourly_Wage_Median Annual_Salary_Avg Annual_Salary_Median
## Min. : 9.23 Min. : 8.64 Min. : 19190 Min. : 17970
## 1st Qu.:23.70 1st Qu.:23.08 1st Qu.: 49300 1st Qu.: 47995
## Median :28.25 Median :27.58 Median : 58750 Median : 57375
## Mean :28.48 Mean :27.86 Mean : 59248 Mean : 57958
## 3rd Qu.:32.39 3rd Qu.:31.73 3rd Qu.: 67378 3rd Qu.: 65988
## Max. :57.96 Max. :56.93 Max. :120560 Max. :118410
## NA's :6 NA's :6 NA's :6 NA's :6
## Wage_standard_error Hourly_10th_Percentile Hourly_25th_Percentile
## Min. :0.400 Min. : 6.38 Min. : 7.33
## 1st Qu.:0.900 1st Qu.:16.81 1st Qu.:19.47
## Median :1.100 Median :20.04 Median :23.24
## Mean :1.272 Mean :20.23 Mean :23.54
## 3rd Qu.:1.425 3rd Qu.:23.54 3rd Qu.:27.01
## Max. :7.500 Max. :36.62 Max. :45.18
## NA's :6 NA's :6 NA's :6
## Hourly_75th_Percentile Hourly_90th_Percentile Annual_10th_Percentile
## Min. :10.04 Min. :12.33 Min. :13260
## 1st Qu.:27.21 1st Qu.:32.51 1st Qu.:34958
## Median :32.61 Median :37.51 Median :41670
## Mean :32.92 Mean :38.16 Mean :42088
## 3rd Qu.:37.33 3rd Qu.:43.41 3rd Qu.:48955
## Max. :71.07 Max. :83.35 Max. :76180
## NA's :6 NA's :6 NA's :6
## Annual_25th_Percentile Annual_75th_Percentile Annual_90th_Percentile
## Min. :15260 Min. : 20890 Min. : 25650
## 1st Qu.:40488 1st Qu.: 56598 1st Qu.: 67620
## Median :48335 Median : 67835 Median : 78015
## Mean :48969 Mean : 68465 Mean : 79367
## 3rd Qu.:56195 3rd Qu.: 77638 3rd Qu.: 90290
## Max. :93970 Max. :147830 Max. :173370
## NA's :6 NA's :6 NA's :6
## Location_Quotient Total_Employed_National_Aggregate
## Min. :0.32 Min. :124143490
## 1st Qu.:0.90 1st Qu.:129059020
## Median :1.01 Median :131713800
## Mean :1.01 Mean :134075564
## 3rd Qu.:1.13 3rd Qu.:138885360
## Max. :1.50 Max. :147838700
## NA's :649 NA's :4
## Total_Employed_Healthcare_National_Aggregate
## Min. :5854360
## 1st Qu.:6226540
## Median :7250140
## Mean :7268640
## 3rd Qu.:8076300
## Max. :8727310
## NA's :4
## Total_Employed._Healthcare_State_Aggregate Yearly_Total_Employed_Aggregate
## Min. : 110 Min. : 110
## 1st Qu.: 33448 1st Qu.: 596520
## Median : 87435 Median : 1557110
## Mean :134743 Mean : 2387209
## 3rd Qu.:175293 3rd Qu.: 2888682
## Max. :844930 Max. :17382400
## NA's :2
.Employment of RNs:
Employment Variability:
Wage Distribution:
Hourly wages for RNs range from $9.23 to $57.96, with the median hourly wage at $28.25, and the median annual salary at $57,375. This significant range in wages can be attributed to regional cost of living differences, state-specific healthcare demand, and varying levels of experience and specialization among nurses.
The distribution of wages across percentiles shows that the lowest 10% of RNs earn as little as $13,260 annually, while the top 10% can earn up to $173,370, highlighting the substantial wage progression opportunities within the nursing profession.
Location Quotient:
National and State Employment Aggregates:
Total_Employed_National_Aggregate varies slightly over the years but averages around 134 million, indicating the total employment figures across all sectors nationally, providing a context for the scale of healthcare employment.
Total_Employed_Healthcare_National_Aggregate and Total_Employed_Healthcare_State_Aggregate show the healthcare employment at national and state levels, respectively, indicating that healthcare is a significant employment sector, with an average of around 7.3 million healthcare workers nationally.
Yearly Employment Trends:
Let’s explore the potential factors that could influence salary variations-
How do the average annual salaries of registered nurses vary across different states in the US , and what factors might influence these variations?
To understand, let’s visulaize and analyze the average annual salaries across different states:
# Visualizing average annual salaries
ggplot(data, aes(x = State, y = Annual_Salary_Avg)) +
geom_bar(stat = "identity", fill = "skyblue") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
labs(x = "State", y = "Average Annual Salary", title = "Average Annual Salaries of RNs by State (2020)")
## Warning: Removed 6 rows containing missing values (`position_stack()`).
The average nurse salary in the United States for 2020 illustrates a
notable geographic variance. States with the highest average wage are
Massachusetts, California, and Hawaii, perhaps as a result of a mix of
variables. These areas frequently have greater cost of living than other
places, necessitating a higher salary to maintain a comparable level of
living. These states may also see a shortage of nurses, which would
increase demand and raise wages.
The graph suggests a possible geographic tendency, with the Northeast
and the West Coast possibly having the highest earnings. There are
several possible explanations for this, including a greater
concentration of specialised healthcare services in particular areas or
a higher cost of living overall.
Exploring potential factors influencing the above salary variations, such as location location quotient or total employment:
Potential influence of Location Quotient factor-
# Visualizing location quotient
ggplot(data, aes(x = State, y = Location_Quotient)) +
geom_bar(stat = "identity", fill = "lightgreen") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
labs(x = "State", y = "Location Quotient", title = "Location Quotient of RN Employment by State (2020)")
## Warning: Removed 649 rows containing missing values (`position_stack()`).
From the above graph, we can conclude that the states with higher average salaries tend to have minimal effect due to location quotient factor. This inverse relationship between the Location quotient and average salaries in the healthcare sector across states possibly imply several potential interpretations,
Competition vs. Salary:
States with higher salaries may face less competition for healthcare talent, resulting in lower relative concentration (LQ).
Lower salary states may need higher concentrations of healthcare workers to meet demand despite less competitive pay.
Cost of Living and Opportunities:
Higher salary states often have higher living costs, necessitating higher pay to attract talent.
Lower salary states may offer fewer job opportunities or have less competitive markets, leading to higher LQs.
Economic Specialization:
States with higher salaries might have more diversified economies, making healthcare a smaller share of employment (lower LQ).
Lower salary states may rely heavily on healthcare, resulting in higher LQs.
Policy and Planning:
Understanding LQ vs. salary informs workforce planning and policy decisions.
Efforts to address disparities in healthcare access and workforce distribution are crucial.
Potential Influence of Total Employment -
# Visualizing total employment in healthcare
ggplot(data, aes(x = State, y = Total_Employed._Healthcare_State_Aggregate)) +
geom_bar(stat = "identity", fill = "lightcoral") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
labs(x = "State", y = "Total Employment in Healthcare", title = "Total Employment in Healthcare by State (2020)")
## Warning: Removed 2 rows containing missing values (`position_stack()`).
Noting that California, Florida, Texas, and New Jersey have the highest bars in the graph indicates that these states likely have the highest levels of total employment in healthcare compared to other states in 2020. Here are some conclusions you can draw from this observation:
Population Size and Healthcare Demand:
Economic and Healthcare Infrastructure:
These states may have robust healthcare infrastructures, including hospitals, clinics, and medical centers, to meet the healthcare needs of their populations.
Investments in healthcare facilities, research institutions, and medical schools may contribute to a larger healthcare workforce in these states.
Diverse Healthcare Sector:
Urban Centers and Healthcare Hubs:
Policy and Regulation:
Consideration of Challenges:
Let’s conduct correlation analysis to investigate relationships between variables:
# Correlation matrix with handling missing and NA values
correlation_matrix <- cor(data[, c("Annual_Salary_Avg", "Location_Quotient", "Total_Employed._Healthcare_State_Aggregate")], use = "pairwise.complete.obs")
print(correlation_matrix)
## Annual_Salary_Avg Location_Quotient
## Annual_Salary_Avg 1.0000000 -0.12516077
## Location_Quotient -0.1251608 1.00000000
## Total_Employed._Healthcare_State_Aggregate 0.3088250 0.01929619
## Total_Employed._Healthcare_State_Aggregate
## Annual_Salary_Avg 0.30882498
## Location_Quotient 0.01929619
## Total_Employed._Healthcare_State_Aggregate 1.00000000
Annual Salary and Location Quotient:
Annual Salary and Total Employment in Healthcare:
Location Quotient and Total Employment in Healthcare:
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.3.3
## corrplot 0.92 loaded
# Plot the correlation matrix
corrplot(correlation_matrix, method = "color", type = "upper", order = "hclust", addrect = 2)
We can perform hypothesis testing to determine if there are significant differences in average annual salaries between states:
we are considering only two states, California and Texas, for the purpose of testing because we want to compare the average salaries between these two specific states. This comparison can provide valuable insights into potential differences in economic conditions, cost of living, or other factors that might affect salary levels between the two states.
Limiting the comparison to just two states simplifies the analysis and allows us to focus on a specific question or hypothesis. Additionally, comparing more than two states at once could make the analysis more complex and less interpretable, especially if there are multiple factors influencing salary levels across different regions.
By focusing on California and Texas, we can conduct a direct comparison and determine if there is a statistically significant difference in average salaries between these two states. If there is a significant difference, it could have implications for various stakeholders, such as policymakers, employers, or individuals considering relocation.
# Example of t-test comparing average salaries between two states (e.g., California and Texas)
t_test_result <- t.test(data[data$State == "California", "Annual_Salary_Avg"],
data[data$State == "Texas", "Annual_Salary_Avg"])
print(t_test_result)
##
## Welch Two Sample t-test
##
## data: data[data$State == "California", "Annual_Salary_Avg"] and data[data$State == "Texas", "Annual_Salary_Avg"]
## t = 4.6611, df = 34.141, p-value = 4.666e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 12795.42 32573.27
## sample estimates:
## mean of x mean of y
## 83305.65 60621.30
# Example data
California_salary <- c(50000, 55000, 60000, 65000, 70000) # Sample salaries from California
Texas_salary <- c(48000, 52000, 58000, 62000, 67000) # Sample salaries from Texas
# Create a new data frame
newdata <- data.frame(State = c(rep("California", length(California_salary)), rep("Texas", length(Texas_salary))),
Annual_Salary_Avg = c(California_salary, Texas_salary))
# Example t-test
t_test_result <- t.test(newdata[newdata$State == "California", "Annual_Salary_Avg"],
newdata[newdata$State == "Texas", "Annual_Salary_Avg"])
# Visualization
library(ggplot2)
ggplot(newdata, aes(x = State, y = Annual_Salary_Avg, fill = State)) +
geom_boxplot() +
geom_text(aes(label = paste("p-value =", signif(t_test_result$p.value, digits = 3))), x = 1.5, y = max(newdata$Annual_Salary_Avg), vjust = -1) +
labs(title = "Comparison of Average Salaries between California and Texas",
y = "Annual Salary Average",
fill = "State") +
theme_minimal()
Mean Salary: The horizontal line within each box represents the median salary. The median salary in California is around $60,000, while in Texas it is slightly higher, close to $61,000.
Salary Range: The boxes represent the interquartile range (IQR), which contains the middle 50% of the salaries for each state. California’s IQR starts from just below $55,000 to just above $65,000. In Texas, the IQR is from around $57,000 to a little over $65,000. This suggests that while the bulk of salaries are similar, Texas might have a slightly tighter salary distribution.
Variability: The whiskers of the box plot, which extend from the top and bottom of the box to the maximum and minimum values within 1.5 times the IQR, seem quite compact for both states. This indicates less variability in the salaries beyond the middle 50%.
Statistical Significance: The graph notes a p-value = 0.61. In statistical testing, a p-value greater than 0.05 generally indicates that the differences observed (in this case, between the average salaries in California and Texas) are not statistically significant. Therefore, the difference in median salaries is likely not meaningful from a statistical perspective.
The overall interpretation is that while there might be slight differences in the average salaries and distribution between California and Texas, these differences are not statistically significant.
Building a regression model to predict annual salaries variations based on relevant factors:
# Build regression model
model <- lm(Annual_Salary_Avg ~ Location_Quotient + Total_Employed._Healthcare_State_Aggregate, data = data)
summary(model)
##
## Call:
## lm(formula = Annual_Salary_Avg ~ Location_Quotient + Total_Employed._Healthcare_State_Aggregate,
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34850 -7325 -2124 6963 39014
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 7.252e+04 2.537e+03 28.581
## Location_Quotient -8.219e+03 2.437e+03 -3.372
## Total_Employed._Healthcare_State_Aggregate 2.467e-02 2.880e-03 8.568
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## Location_Quotient 0.000795 ***
## Total_Employed._Healthcare_State_Aggregate < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11150 on 589 degrees of freedom
## (650 observations deleted due to missingness)
## Multiple R-squared: 0.1248, Adjusted R-squared: 0.1218
## F-statistic: 41.98 on 2 and 589 DF, p-value: < 2.2e-16
Coefficients:
The intercept term indicates the estimated average annual salary when all predictor variables are zero. In this case, it’s $72,520.
The coefficient for Location Quotient is -8,219, indicating that for each unit increase in Location Quotient, the average annual salary decreases by $8,219, holding other variables constant.
The coefficient for Total Employed in Healthcare State Aggregate is 0.02467. It suggests that for each additional person employed in healthcare at the state level, the average annual salary increases by $0.02467, holding other variables constant.
Significance:
Residuals:
Model Fit:
The adjusted R-squared value, which measures the proportion of variance in the dependent variable explained by the independent variables, is 0.1218. It suggests that around 12.18% of the variability in average annual salary is explained by the Location Quotient and Total Employed in Healthcare State Aggregate variables in the model.
The F-statistic tests the overall significance of the model. With a p-value of less than 0.05, the model is statistically significant.
Residual Standard Error:
# Load necessary libraries
library(ggplot2)
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
# Diagnostic plots for regression model
# Plotting residuals vs. fitted values
residuals_vs_fitted <- ggplot(model, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
labs(x = "Fitted Values", y = "Residuals", title = "Residuals vs. Fitted Values")
# Plotting normal Q-Q plot of residuals
qq_plot <- ggplot(model, aes(sample = .stdresid)) +
geom_qq() +
geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed") +
labs(title = "Normal Q-Q Plot of Residuals")
# Plotting scale-location plot
scale_location_plot <- ggplot(model, aes(x = .fitted, y = sqrt(abs(.stdresid)))) +
geom_point() +
geom_smooth(se = FALSE, method = "loess", color = "red") +
labs(x = "Fitted Values", y = "√|Standardized Residuals|", title = "Scale-Location Plot")
# Extract leverage values
leverage <- hatvalues(model)
# Plotting residuals vs. leverage plot
residuals_vs_leverage <- ggplot(model, aes(x = leverage, y = .stdresid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
labs(x = "Leverage", y = "Standardized Residuals", title = "Residuals vs. Leverage")
# Arrange plots in a grid
grid.arrange(residuals_vs_fitted, qq_plot, scale_location_plot, residuals_vs_leverage, ncol = 2)
## `geom_smooth()` using formula = 'y ~ x'
Residuals vs. Fitted Values:
Purpose: This plot is used to check the homoscedasticity assumption — that the residuals have constant variance across all levels of fitted values.
Interpretation: Ideally, the points should be randomly dispersed around the horizontal line (zero residual). If the residuals fan out (increase or decrease) as the fitted values increase, it suggests non-constant variance (heteroscedasticity). In your plot, there seems to be a pattern, particularly with residuals clustering and then spreading out as fitted values increase, indicating potential heteroscedasticity.
Normal Q-Q Plot of Residuals:
Purpose: This plot checks whether the residuals are approximately normally distributed, a key assumption of many regression models.
Interpretation: The points should ideally lie on the dashed line. Deviations from this line suggest deviations from normality. In this plot, the slight deviations at both tails might indicate some departure from normality, suggesting the presence of outliers or long tails in the distribution of residuals.
Scale-Location Plot (Spread-Location Plot):
Purpose: Also used to assess homoscedasticity. It shows how residuals spread along the range of predictors.
Interpretation: The red line (a smoothed average) should be horizontal and flat if the residuals are spread equally along the range of predictors. In your plot, the red line curves, suggesting that the variance of the residuals changes with the fitted values (heteroscedasticity).
Residuals vs. Leverage Plot:
Purpose: This plot helps to identify influential cases (outliers that might affect the regression line more than others).
Interpretation: Points with high leverage can potentially have a large impact on the model’s prediction. Points outside the dashed Cook’s distance lines are particularly influential. Your plot shows several points with higher leverage, and a few of them also have high residuals, which might be a cause for concern in terms of their influence on the model.
The model indicates that both Location Quotient and Total Employed in Healthcare State Aggregate are significant predictors of average annual salary.
Variation in Average Annual Salaries:
Influential Factors:
Regression Modeling Insights:
Further Analysis:
To gain a comprehensive understanding of salary variations among registered nurses, further analysis could explore additional factors that may impact wages. For example, factors such as cost of living, nurse-to-patient ratios, presence of labor unions, and state-level policies regarding minimum wage and overtime regulations could be considered. Analyzing these factors alongside location quotient and total employment in healthcare could provide a more nuanced understanding of salary determinants.