Week 7

data <- read.csv("C:\\Users\\91814\\Desktop\\Statistics\\nurses.csv")

Hypothesis 1

Null Hypothesis H0 : There is no significant difference in the average annual salary of registered nurses between California and New York in 2020.

Alternative Hypothesis H1: There is a significant difference between the average annual salary of registered nurses of California and New York in 2020.

Alpha Level: 0.05

Reason: The risk of making a Type I error—erroneously rejecting the null hypothesis—and the requirement for a conclusive outcome are balanced by selecting an alpha level of 0.05. This alpha level is important for policy makers and employment standards agencies that might use this analysis to evaluate and compare labor market conditions because it helps ensure that any claimed salary difference is not the result of pure chance when comparing salaries between states.

Level of Power: 0.80

Reason: An 80% chance of correctly rejecting the null hypothesis in the event that it is wrong is indicated by a power level of 0.80. This high level of power minimizes the risk of a Type II error (failing to detect a true difference when one actually exists), which is particularly important in salary comparisons that could influence employment policies, negotiations, or funding allocations based on perceived regional disparities.

Minimum Effect Size: 0.3

Reason: The choice of a minimum effect size of 0.3 (Cohen’s d) reflects an interest in detecting a medium effect size, which is meaningful from a practical standpoint. In salary comparisons, this size indicates a substantial, non-trivial difference that would warrant attention from stakeholders such as healthcare institutions and policy makers. This level of effect size ensures that only differences with practical significance are considered, reducing the focus on statistically significant but practically insignificant findings.

Performing a Neyman-Pearson Hypothesis testing & Fishers style test for significance on the same hypothesis

california_salary <- data[data$State == "California", "Annual_Salary_Avg"]
new_york_salary <- data[data$State == "New York", "Annual_Salary_Avg"]

# Perform two-sample t-test
t_test_result <- t.test(california_salary, new_york_salary)

# Neyman-Pearson test
alpha <- 0.05
if (abs(t_test_result$statistic) > qt(1 - alpha/2, df = t_test_result$parameter)) {
  cat("Reject the null hypothesis at the", alpha, "level of significance.\n")
} else {
  cat("Fail to reject the null hypothesis at the", alpha, "level of significance.\n")
}

## Reject the null hypothesis at the 0.05 level of significance.

cat("P-value:", t_test_result$p.value, "\n")

## P-value: 0.01061057

# Fisher's style test for significance
fisher_result <- var.test(california_salary, new_york_salary)

# Interpret the p-value
cat("Fisher's style test p-value:", fisher_result$p.value, "\n")

## Fisher's style test p-value: 0.02427058

Based on the statistical analysis conducted, we can infer the following insights:

A statistically significant difference in the average annual earnings of registered nurses (RNs) between California and New York in 2020 was found by the two-sample t-test, with a p-value of 0.0106. The conclusion that there is a significant salary difference between the two states is further supported by the p-value of 0.0243 obtained by Fisher’s style test for significance.
The null hypothesis was rejected at the 0.05 level of significance, indicating strong evidence in favor of the conclusion that average yearly earnings for registered nurses in California and New York differ. This research has applications for registered nurses, medical facilities, legislators, and other interested parties. It emphasizes how crucial it is to take regional salary differences into account when making decisions about workforce planning, remuneration, and healthcare.

Even though the data showed a notable difference in the average yearly salary between California and New York, more research is necessary to determine the underlying causes of this discrepancy. Possible inquiries for more investigation include of:

What particular elements have a role in the difference in RN pay between New York and California?
Do the dynamics of supply and demand, the cost of living, or healthcare policies differ?
What effects do education, specialization, and experience level have on RN earnings in these states?
Are there any regional differences in California and New York that require additional investigation?
What effects do these pay disparities have on hiring, retaining, and job satisfaction among registered nurses?
How can healthcare companies and legislators resolve any disparities in RN pay to guarantee equitable compensation and workforce stability?

Visualization to illustrate the results of hypothesis 1

library(ggplot2)

# Subset data for California and New York
california_data <- data[data$State == "California", ]
new_york_data <- data[data$State == "New York", ]

# Create a boxplot
ggplot(mapping = aes(x = factor(State), y = Annual_Salary_Avg, fill = State)) +
  geom_boxplot(data = rbind(california_data, new_york_data), width = 0.6) +
  labs(title = "Comparison of Average Annual Salaries between California and New York (2020)",
       x = "State", y = "Average Annual Salary") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

From the above graph, the comparison of average yearly earnings for registered nurses (RNs) in California and New York for the year 2020 was made clear by the statistical analysis and visualisation. We discovered the following information by comparing boxplots representing the average yearly salary in these two states and running a two-sample t-test

According to the subsequent two-sample t-test and visualisation, the average yearly salary for registered nurses in 2020 differed statistically significantly between California and New York. This is evident in the p-value that the t-test yielded.
Although a significant difference was found by the statistical test, it is important to take this finding’s practical implications into account. It is important to assess the extent of the average annual salary gap between the two states in order to ascertain whether it has any bearing on the pay and welfare of registered nurses.
Even though a notable disparity was found, more research could be necessary to identify the causes of this inconsistency.
Overall, even though the statistical analysis showed a noteworthy difference in the average yearly salary for registered nurses (RNs) in 2020 between California and New York, more research is necessary to completely comprehend the underlying causes of this disparity and its wider implications for the planning and formulation of healthcare workforce policies.

Hypothesis 2

Null Hypothesis h0: The hourly wage distribution of registered nurses in Texas follows a normal distribution in 2020.

Alpha Level: 0.05

Reason: An alpha level of 0.05 is chosen to maintain a strict control over Type I errors, which is critical in asserting the normality of the wage distribution. A Type I error in this context would mean incorrectly concluding that the wage distribution is not normal when it actually is, which could mislead policy interventions aimed at wage standardization or adjustments. By using a conventional alpha level, the analysis aims to ensure robust conclusions that support reliable economic planning and regulatory actions within the healthcare sector in Texas.

Power Level: 0.80

Reason: Setting the power level at 0.80 is significant for effectively detecting deviations from a normal distribution, if present. This high level of power reduces the risk of Type II errors (failing to detect non-normality when it actually exists). In practical terms, this is crucial because recognizing non-normality in wage distributions can prompt further investigation into inequalities or anomalies in wage structures, thereby guiding more equitable policy decisions. Maintaining a high power ensures that if significant wage distribution issues exist, they are likely to be detected and addressed.

Minimum Effect Size: 0.2

Reason: The selection of a minimum effect size of 0.2 (a small to medium effect according to Cohen’s standards) is tailored to detect even modest deviations from normality that are still practically significant. In wage analysis, even small deviations can indicate issues like wage compression or unexpected skewness, which are important for stakeholders such as labor unions, hospital administration, and governmental agencies overseeing healthcare and labor regulations. This sensitivity to smaller effects ensures that any meaningful irregularities are not overlooked, facilitating interventions that may be necessary to maintain fairness and competitiveness in the labor market.

Performing a Neyman- Pearson Hypothesis Testing and Fisher’s Style test for significance on the same hypothesis

# Assuming 'data' is your data frame
texas_hourly_wage <- data[data$State == "Texas", "Hourly_Wage_Avg"]

# Neyman-Pearson test using t.test
alpha <- 0.05
t_test_result_hourly_wage <- t.test(texas_hourly_wage)

# Interpret the results
if (abs(t_test_result_hourly_wage$statistic) > qt(1 - alpha/2, df = t_test_result_hourly_wage$parameter)) {
  cat("Reject the null hypothesis (t.test for Hourly Wage) at the", alpha, "level of significance.\n")
} else {
  cat("Fail to reject the null hypothesis (t.test for Hourly Wage) at the", alpha, "level of significance.\n")
}

## Reject the null hypothesis (t.test for Hourly Wage) at the 0.05 level of significance.

cat("P-value (t.test for Hourly Wage):", t_test_result_hourly_wage$p.value, "\n")

## P-value (t.test for Hourly Wage): 5.661245e-18

# Fisher's style test for significance (using simulated data)
simulated_data <- rnorm(length(texas_hourly_wage), mean = mean(texas_hourly_wage), sd = sd(texas_hourly_wage))
chisq_result_hourly_wage <- chisq.test(texas_hourly_wage, simulated_data)

## Warning in chisq.test(texas_hourly_wage, simulated_data): Chi-squared
## approximation may be incorrect

# Interpret the p-value
cat("Fisher's style test p-value (Hourly Wage):", chisq_result_hourly_wage$p.value, "\n")

## Fisher's style test p-value (Hourly Wage): 0.2363731

Based on the above statistical tests for hourly wage in Texas:

The t-test for hourly wage in Texas yielded a very low p-value of approximately 5.66 × 1 0 − 18 5.66×10 −18 . This indicates strong evidence to reject the null hypothesis at the 0.05 significance level, suggesting that there is a significant difference in average hourly wage in Texas compared to the hypothetical mean. However, the Fisher’s style test for significance (using simulated data) produced a p-value of approximately 0.236, indicating that there may not be sufficient evidence to reject the null hypothesis when using this approach.
The t-test’s significant result indicates that it is improbable that Texas’s average hourly wage will match the fictitious mean. The implications of this study for workforce planning, compensation strategies, and policy-making in the healthcare industry are noteworthy and pertain to the understanding of the wage landscape for registered nurses in Texas.
The differences in the outcomes of the Fisher’s style test and the t-test raise concerns regarding the reliability of the underlying assumptions of each test as well as the suitability of the statistical techniques employed. To determine which test best assesses the relevance of the variation in hourly wage in Texas and why the results differ, more research is required.
Additionally, exploring the factors contributing to the observed difference in hourly wage, such as regional variations, cost of living, demand-supply dynamics, and healthcare policies, would provide valuable insights into the wage dynamics for registered nurses in Texas.
In summary, while the t-test suggests a significant difference in hourly wage in Texas, the discrepancy with the results of the Fisher’s style test warrants caution. Further investigation and consideration of additional factors are necessary to provide a comprehensive understanding of the wage landscape for registered nurses in Texas and its implications.

Visualization to illustrate the results of Hypothesis 2:

library(ggplot2)

# Creating a histogram for the Hourly Wage distribution in Texas
ggplot(data[data$State == "Texas", ], aes(x = Hourly_Wage_Avg)) +
  geom_histogram(binwidth = 2, fill = "blue", color = "black", alpha = 0.7) +
  labs(title = "Hourly Wage Distribution of Registered Nurses in Texas (2020)",
       x = "Hourly Wage", y = "Frequency") +
  theme_minimal()

The histogram illustrates the distribution of hourly wages for RNs in Texas for the year 2020. It provides a visual representation of the frequency of different wage levels. The statistical tests, particularly the t-test, suggest that there is a significant difference in the average hourly wage in Texas compared to the hypothetical mean.

The rejection of the null hypothesis in the t-test indicates that the average hourly wage in Texas is unlikely to be equal to the hypothetical mean. This finding is significant as it highlights disparities or unique characteristics in the wage distribution for RNs in Texas compared to the reference value. The histogram helps visualize the shape and spread of the wage distribution, providing additional context to the significance of the statistical tests.

Despite the significant result from the t-test, the warning about the chi-squared approximation in the Fisher’s style test raises questions about the validity of the statistical methods used and the accuracy of the results. Further investigation is needed to understand the reasons for the observed difference in hourly wage and the appropriateness of the statistical tests employed.

Exploring potential factors contributing to the wage distribution, such as regional variations, cost of living, demand-supply dynamics, and healthcare policies specific to Texas, would provide deeper insights into the wage landscape for RNs in the state.

Additionally, assessing the robustness of the findings through sensitivity analyses or alternative statistical approaches could help validate the observed differences in hourly wages in Texas.

In summary, the histogram visualization and statistical tests shed light on the hourly wage distribution for RNs in Texas, indicating significant differences compared to a hypothetical mean.

Further investigation and consideration of additional factors are necessary to fully understand the implications of these findings and guide informed decision-making in healthcare workforce planning and policy development.