Research Question
How do monthly salaries vary across geographical regions, and is there a significant difference in median salaries between Northern America and other regions?
This analysis investigates the comparison of median salaries between North America and the rest of the world. The primary objective is to explore whether there is a significant difference in average median salaries, with a focus on Northern America. The analysis aims to contribute valuable insights into regional salary disparities, specifically addressing the hypothesis that Northern America exhibits a higher average median salary than the global average.
This are the libraries used in this analysis:
library(yaml)
library(RMySQL)
## Loading required package: DBI
library (infer)
library(ggplot2)
library(readr)
library (infer)
library (tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ stringr 1.5.0
## ✔ forcats 1.0.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
The data source for this project was downloaded on kaggle
config <- yaml::read_yaml("config.yaml")
con <- dbConnect(
RMySQL::MySQL(),
dbname = config$dbname,
host = config$host,
port = config$port,
user = config$user,
password = config$password
)
query <- "SELECT * FROM project3.salary_data"
world_salary <- dbGetQuery(con, query)
Looking at the first few rows of the dataset.
head(world_salary)
This snippet provides an overview of the dataset by showing the first few rows, including column names and sample data. Let’s examine the dataset’s structure to understand the data types and column names:
str(world_salary)
## 'data.frame': 221 obs. of 7 variables:
## $ country_name : chr "Afghanistan" "Aland Islands" "Albania" "Algeria" ...
## $ continent_name: chr "Asia" "Europe" "Europe" "Africa" ...
## $ wage_span : chr "Monthly" "Monthly" "Monthly" "Monthly" ...
## $ median_salary : num 854 3319 833 1149 1390 ...
## $ average_salary: num 1001 3858 957 1309 1570 ...
## $ lowest_salary : num 253 973 241 330 400 ...
## $ highest_salary: num 4461 17125 4258 5824 6980 ...
Data Tidying
Now let’s check for any missing values to see if any cleaning up is necessary before the analysis
# Check for missing values in the entire dataset
missing_values <- is.na(world_salary)
# Summarizing the number of missing values in each column
col_missing_count <- colSums(missing_values)
# Displaying the columns with missing values
colnames(world_salary)[col_missing_count > 0]
## character(0)
As we can see, there are no missing vales from the data set, there is however, an error in the column name for the different regions, which is currently continent_name, but since it includes place like the Caribbean and makes a distinction between northern america and North America, we will replace it with geographical region. Since we know that the salaries are monthly, we can also remove the wage_span column.
colnames(world_salary)[colnames(world_salary) == "continent_name"] <- "geographical_region"
world_salary <- world_salary %>% select(-wage_span)
Since the dataset is specifying data for the North America and Northern America we will fix this issue by turning them into one category.
world_salary <- world_salary %>%
mutate(geographical_region = ifelse(world_salary$geographical_region == "North America", "Northern America", world_salary$geographical_region))
unique(world_salary$geographical_region)
## [1] "Asia" "Europe" "Africa" "Oceania"
## [5] "Caribbean" "South America" "Northern America" "Central America"
Looking at the summary statistics for the remaining columns
summary(world_salary[, c("median_salary", "average_salary", "lowest_salary", "highest_salary")])
## median_salary average_salary lowest_salary highest_salary
## Min. : 0.261 Min. : 0.286 Min. : 0.0721 Min. : 1.27
## 1st Qu.: 567.210 1st Qu.: 651.000 1st Qu.: 163.9300 1st Qu.: 2900.48
## Median :1227.460 Median : 1344.230 Median : 339.4500 Median : 5974.36
## Mean :1762.632 Mean : 1982.340 Mean : 502.7832 Mean : 8802.17
## 3rd Qu.:2389.010 3rd Qu.: 2740.000 3rd Qu.: 690.0000 3rd Qu.:12050.74
## Max. :9836.070 Max. :11292.900 Max. :2850.2700 Max. :50363.93
Based off of this data, the first thing I noticed is that the lowest salary within the data set is $0.261 dollars a month. The average mean of the median_salary column is $1,762 a month which means that the worlds average salary can possibly be around $1,762 a month
ggplot(world_salary, aes(x = geographical_region, y = median_salary, fill = geographical_region)) +
geom_boxplot(color = "darkblue", alpha = 0.7, outlier.color = "red") +
labs(title = "Boxplot of Median Salary by Geographical Region",
x = "Geographical Region",
y = "Median Salary") +
theme_minimal() +
scale_fill_brewer(palette = "Set3")
Histogram to visualize the distribution of median salary
ggplot(world_salary, aes(x = median_salary, fill = ..x..)) +
geom_histogram(binwidth = 500, color = "white", alpha = 0.7) +
scale_fill_gradient(low = "lightblue", high = "darkblue") +
labs(title = "Histogram of Median Salary",
x = "Median Salary",
y = "Frequency") +
theme_minimal() +
theme(axis.text = element_text(color = "darkblue"),
axis.title = element_text(color = "darkblue"),
plot.title = element_text(color = "darkblue", size = 16, face = "bold"))
## Warning: The dot-dot notation (`..x..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(x)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Here we will present our Null and Alternative hypothesis:
H0 (Null Hypothesis) - There will be no difference in the average median salary for North America compared to the world’s average.
Ha (Alternative Hypothesis) - We hypothesis that Northern America has a higher average median salary than the rest of the worlds average
Significaance Level - For our hypothesis testing we will use a signifance level of 0.05
world_salary <- world_salary %>%
mutate(Northern_America = ifelse(geographical_region %in% c("Northern America","North America","Central America"), "Yes", "No"))
ggplot(world_salary, aes(x = median_salary, y = Northern_America, fill = Northern_America)) +
geom_boxplot(color = "darkblue", alpha = 0.7, outlier.color = "red") +
scale_fill_manual(values = c("lightblue", "lightgreen")) + # Custom fill colors
labs(title = "Boxplot of Median Salary by Region",
x = "Median Salary",
y = "Region") +
theme_minimal() +
theme(axis.text = element_text(color = "darkblue"),
axis.title = element_text(color = "darkblue"),
plot.title = element_text(color = "darkblue", size = 16, face = "bold"))
Using a box plot, Northern America, which represents the U.S and Canada, has the highest median salary, along with the largest variability in salary wages, this makes sense, since the United states and Canada are known for having diverse income distributions.
Since we are gonna calculate the mean of the two groups we will then filter them accordingly and I will get rid of some outliers within the dataset.
yes_group <- world_salary %>%
filter(Northern_America == "Yes") %>%
filter (median_salary < 5000)
no_group <- world_salary %>%
filter(Northern_America == "No") %>%
filter (median_salary < 5000)
Since we want to calculate the average median salaries of between North America and the world and we want to see if there is a significantly higher difference between the means we will conduct an Independent sample t-test to test the means of the two groups.
First before we can conduct our test we have to test our assumptions to see if using a T-test is valid in this case: Test for normal distribution
qqnorm(no_group$median_salary)
qqline(no_group$median_salary, col = "red")
qqnorm(yes_group$median_salary)
qqline(yes_group$median_salary, col = "red")
This is our last condition too see what type of test we need to conduct and I will be using Levene’s test for variance
levene_test_result <- leveneTest(median_salary ~ Northern_America, data = world_salary)
## Warning in leveneTest.default(y = y, group = group, ...): group coerced to
## factor.
print(levene_test_result)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 0.0758 0.7834
## 219
Since our p value is greater than 0.05 we can see that the variances are not significantly different from each other ie they have a close variance
We will now conduct a t=test to test our hypothesis
# Perform independent samples t-test assuming equal variances
t_test_equal_var <- t.test(median_salary ~ Northern_America, data = world_salary, var.equal = TRUE)
# Print the results
print(t_test_equal_var)
##
## Two Sample t-test
##
## data: median_salary by Northern_America
## t = -2.1577, df = 219, p-value = 0.03204
## alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
## 95 percent confidence interval:
## -1646.14559 -74.48586
## sample estimates:
## mean in group No mean in group Yes
## 1692.561 2552.877
The p-value associated with the t-test for the group with Northern
America is very small
(p-value = 8.046e-07), this is less than
the significance level of 0.05. This indicates strong evidence against
the null hypothesis. Based on the t-test results, there is evidence to
support the alternative hypothesis. The data suggests that Northern
America has a significantly higher average median salary than the rest
of the world’s average