Introduction

Research Question

How do monthly salaries vary across geographical regions, and is there a significant difference in median salaries between Northern America and other regions?

This analysis investigates the comparison of median salaries between North America and the rest of the world. The primary objective is to explore whether there is a significant difference in average median salaries, with a focus on Northern America. The analysis aims to contribute valuable insights into regional salary disparities, specifically addressing the hypothesis that Northern America exhibits a higher average median salary than the global average.

Data

This are the libraries used in this analysis:

library(yaml)
library(RMySQL)

## Loading required package: DBI

library (infer)
library(ggplot2)
library(readr)
library (infer)
library (tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ stringr   1.5.0
## ✔ forcats   1.0.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

The data source for this project was downloaded on kaggle

config <- yaml::read_yaml("config.yaml")
con <- dbConnect(
  RMySQL::MySQL(),
  dbname = config$dbname,
  host = config$host,
  port = config$port,
  user = config$user,
  password =  config$password
)
query <- "SELECT * FROM project3.salary_data"
world_salary <- dbGetQuery(con, query)

Looking at the first few rows of the dataset.

head(world_salary)

This snippet provides an overview of the dataset by showing the first few rows, including column names and sample data. Let’s examine the dataset’s structure to understand the data types and column names:

str(world_salary)

## 'data.frame':    221 obs. of  7 variables:
##  $ country_name  : chr  "Afghanistan" "Aland Islands" "Albania" "Algeria" ...
##  $ continent_name: chr  "Asia" "Europe" "Europe" "Africa" ...
##  $ wage_span     : chr  "Monthly" "Monthly" "Monthly" "Monthly" ...
##  $ median_salary : num  854 3319 833 1149 1390 ...
##  $ average_salary: num  1001 3858 957 1309 1570 ...
##  $ lowest_salary : num  253 973 241 330 400 ...
##  $ highest_salary: num  4461 17125 4258 5824 6980 ...

Data Tidying

Now let’s check for any missing values to see if any cleaning up is necessary before the analysis

# Check for missing values in the entire dataset
missing_values <- is.na(world_salary)

# Summarizing the number of missing values in each column
col_missing_count <- colSums(missing_values)

# Displaying the columns with missing values
colnames(world_salary)[col_missing_count > 0]

## character(0)

As we can see, there are no missing vales from the data set, there is however, an error in the column name for the different regions, which is currently continent_name, but since it includes place like the Caribbean and makes a distinction between northern america and North America, we will replace it with geographical region. Since we know that the salaries are monthly, we can also remove the wage_span column.

colnames(world_salary)[colnames(world_salary) == "continent_name"] <- "geographical_region"
world_salary <- world_salary %>% select(-wage_span)

Since the dataset is specifying data for the North America and Northern America we will fix this issue by turning them into one category.

world_salary <- world_salary %>%
  mutate(geographical_region = ifelse(world_salary$geographical_region == "North America", "Northern America", world_salary$geographical_region))

  
unique(world_salary$geographical_region)

## [1] "Asia"             "Europe"           "Africa"           "Oceania"         
## [5] "Caribbean"        "South America"    "Northern America" "Central America"

Looking at the summary statistics for the remaining columns

summary(world_salary[, c("median_salary", "average_salary", "lowest_salary", "highest_salary")])

##  median_salary      average_salary      lowest_salary       highest_salary    
##  Min.   :   0.261   Min.   :    0.286   Min.   :   0.0721   Min.   :    1.27  
##  1st Qu.: 567.210   1st Qu.:  651.000   1st Qu.: 163.9300   1st Qu.: 2900.48  
##  Median :1227.460   Median : 1344.230   Median : 339.4500   Median : 5974.36  
##  Mean   :1762.632   Mean   : 1982.340   Mean   : 502.7832   Mean   : 8802.17  
##  3rd Qu.:2389.010   3rd Qu.: 2740.000   3rd Qu.: 690.0000   3rd Qu.:12050.74  
##  Max.   :9836.070   Max.   :11292.900   Max.   :2850.2700   Max.   :50363.93

Based off of this data, the first thing I noticed is that the lowest salary within the data set is $0.261 dollars a month. The average mean of the median_salary column is $1,762 a month which means that the worlds average salary can possibly be around $1,762 a month

Exploratory data analysis

# Boxplot to visualize median salary distribution
ggplot(world_salary, aes(x = geographical_region, y = median_salary, fill = geographical_region)) +
  geom_boxplot(color = "darkblue", alpha = 0.7, outlier.color = "red") +
  labs(title = "Boxplot of Median Salary by Geographical Region",
       x = "Geographical Region",
       y = "Median Salary") +
  theme_minimal() +
  scale_fill_brewer(palette = "Set3")

Histogram to visualize the distribution of median salary

ggplot(world_salary, aes(x = median_salary, fill = ..x..)) +
  geom_histogram(binwidth = 500, color = "white", alpha = 0.7) +
  scale_fill_gradient(low = "lightblue", high = "darkblue") +
  labs(title = "Histogram of Median Salary",
       x = "Median Salary",
       y = "Frequency") +
  theme_minimal() +
  theme(axis.text = element_text(color = "darkblue"),
        axis.title = element_text(color = "darkblue"),
        plot.title = element_text(color = "darkblue", size = 16, face = "bold"))

## Warning: The dot-dot notation (`..x..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(x)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Inference

Here we will present our Null and Alternative hypothesis:

H0 (Null Hypothesis) - There will be no difference in the average median salary for North America compared to the world’s average.

Ha (Alternative Hypothesis) - I hypothesis that Northern America has a higher average median salary than the rest of the worlds average

world_salary <- world_salary %>%
  mutate(Northern_America = ifelse(geographical_region %in% c("Northern America","North America","Central America"), "Yes", "No"))

ggplot(world_salary, aes(x = median_salary, y = Northern_America, fill = Northern_America)) +
  geom_boxplot(color = "darkblue", alpha = 0.7, outlier.color = "red") +
  scale_fill_manual(values = c("lightblue", "lightgreen")) +  # Custom fill colors
  labs(title = "Boxplot of Median Salary by Region",
       x = "Median Salary",
       y = "Region") +
  theme_minimal() +
  theme(axis.text = element_text(color = "darkblue"),
        axis.title = element_text(color = "darkblue"),
        plot.title = element_text(color = "darkblue", size = 16, face = "bold"))

Using a box plot, Northern America, which represents the U.S and Canada, has the highest median salary, along with the largest variability in salary wages, this makes sense, since the United states and Canada are known for having diverse income distributions.

Since we are gonna calculate the mean of the two groups we will then filter them accordingly and I will get rid of some outliers within the dataset.

yes_group <- world_salary %>% 
  filter(Northern_America == "Yes") %>% 
  filter (median_salary < 5000)
  
no_group <- world_salary %>% 
  filter(Northern_America == "No") %>% 
  filter (median_salary < 5000)

Independent Sample T-Test

Since we want to calculate the average median salaries of between North America and the world and we want to see if there is a significantly higher difference between the means we will conduct an Independent sample t-test to test the means of the two groups.

First before we can conduct our test we have to test our assumptions to see if using a T-test is valid in this case: Test for normal distribution

Creating QQ plot for the group without North America

qqnorm(no_group$median_salary)
qqline(no_group$median_salary, col = "red")

Creating QQ plot for the group with North America

qqnorm(yes_group$median_salary)
qqline(yes_group$median_salary, col = "red")

We can see that they are somewhat normally distributed. We will now conduct a t=test to test our hypothesis

# Perform t-tests for 'Yes' and 'No' groups
t_test_yes <- t.test(yes_group$median_salary, var.equal = FALSE)
t_test_no <- t.test(no_group$median_salary, var.equal = FALSE)

print(t_test_yes)

## 
##  One Sample t-test
## 
## data:  yes_group$median_salary
## t = 8.0441, df = 15, p-value = 8.046e-07
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  1501.059 2583.285
## sample estimates:
## mean of x 
##  2042.172

print(t_test_no)

## 
##  One Sample t-test
## 
## data:  no_group$median_salary
## t = 16.627, df = 194, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  1316.478 1670.827
## sample estimates:
## mean of x 
##  1493.652

Conclusion

The p-value associated with the t-test for the group with Northern America is very small (p-value = 8.046e-07), indicating strong evidence against the null hypothesis. Based on the t-test results, there is evidence to support the alternative hypothesis. The data suggests that Northern America has a significantly higher average median salary than the rest of the world’s average

Data 606 Project

Mikhail Broomes and Tilon Bobb

2023-12-07