Table of Contents

Introduction

The aim of this paper is to perform Principal Component Analysis (PCA) method of dimension reduction for salary determinants dataset. The result was the reduction of number of variables from 28 to 11, using only the numerical data types. From my personal and professional background, I have always been interested in identifying the key factors that influence salary levels. I have recently created a start up “Recrupay”, which is an online job board where you get money for passing each stage of the recruitment process. To make deeper research in this area, I decided to devote this project to finding salary determinants, and the results have been very useful in the development of my new company.

Methodology

I found that the PCA method of dimension reduction was extremely useful in identifying the key factors that influence salary levels. This is because the dataset I used, which included information on 28 variables and their various characteristics such as age, industry, and work experience, was quite large and complex. Without using PCA, it would have been challenging to analyze all the variables at once, and the results could have been difficult to interpret.

The paper “A Novel Method of Feature Selection Based on Principal Component Analysis” by Zhang and Yang (2016) further reinforced the usefulness of PCA in my project. The paper highlights the importance of using feature selection methods to reduce the number of variables in large datasets and improve the accuracy and efficiency of machine learning models. PCA is one such method that can reduce the dimensionality of the dataset and extract the most important features that explain the variability in the data.

By applying PCA to my dataset, I was able to reduce the initial variables down to a smaller number of principal components that explained the majority of the variance in the data. These principal components allowed to identify the key factors that influence salary levels. Additionally, I was able to create visualizations that helped to better understand the relationships between the different variables.

Overall, the PCA method of dimension reduction was an essential tool in our analysis, enabling to effectively explore the relationships between the various factors and their impact on salary levels. By reducing the dimensionality of the dataset, I was able to focus on the most significant variables and gain deeper insights into the factors that influence salaries.

Dataset

I chose to use the Kaggle Jobs Dataset from Glassdoor for my project because it provides a rich source of information about job listings and salaries across a wide range of industries and locations. The dataset includes over 200,000 job listings with information on job titles, companies, locations, and salaries. I used the “salary_data_cleaned.csv” file for my analysis.

What I found particularly useful about this dataset is that it includes both numeric and categorical variables, which makes it well-suited for exploring the factors that influence salary levels. By using principal component analysis (PCA) to reduce the dimensionality of the data, I was able to identify the key variables that were most strongly associated with salary levels, and explore how these variables varied across different industries and locations.

Setup

library(GGally)
library(pdp)
library(plotly)
library(kableExtra)
library(cluster)
library(labdsv)
library(ggthemes)
library(RColorBrewer)
library(dplyr)
library(factoextra)
library(fpc)
library(readxl)
library(psych)
library(maptools)
library(corrplot)
library(smacof)
library(flexclust)
library(gridExtra)
library(ade4)
library(missMDA)
library(patchwork)
library(Rtsne)
library(tidyverse)
library(stringr)
library(scales)
library(jpeg)
library(ClusterR)
library(tibble)

Dataset source

df <- read.csv("salary_data_cleaned.csv", header = TRUE, stringsAsFactors = FALSE)

Variables

Description of the initial variables:

  1. Job.Title: A string variable containing the job title of each observation in the dataset.
  2. Salary.Estimate: A string variable containing the estimated salary range for each job posting.
  3. Job.Description: A string variable containing the full job description for each posting.
  4. Rating: A numeric variable representing the company rating on Glassdoor.
  5. Company.Name: A string variable containing the name of the company posting the job.
  6. Location: A string variable containing the location of the job.
  7. Headquarters: A string variable containing the location of the company’s headquarters.
  8. Size: A string variable containing the size of the company.
  9. Founded: An integer variable representing the year the company was founded.
  10. Type.of.ownership: A string variable containing the type of ownership for the company.
  11. Industry: A string variable containing the industry of the company.
  12. Sector: A string variable containing the sector of the company.
  13. Revenue: A string variable containing the revenue of the company.
  14. Competitors: A string variable containing the competitors of the company.
  15. hourly: A binary variable indicating whether the salary is hourly or not.
  16. employer_provided: A binary variable indicating whether the employer provided the salary estimate or not.
  17. min_salary: An integer variable containing the minimum salary in the salary estimate.
  18. max_salary: An integer variable containing the maximum salary in the salary estimate.
  19. avg_salary: A numeric variable representing the average salary for each job posting.
  20. company_txt: A string variable containing only the name of the company posting the job.
  21. job_state: A string variable containing the state where the job is located.
  22. same_state: A binary variable indicating whether the company and job are in the same state or not.
  23. age: An integer variable representing the age of the company.
  24. python_yn: A binary variable indicating whether Python is a requirement for the job or not.
  25. R_yn: A binary variable indicating whether R is a requirement for the job or not.
  26. spark: A binary variable indicating whether Spark is a requirement for the job or not.
  27. aws: A binary variable indicating whether AWS is a requirement for the job or not.
  28. excel: A binary variable indicating whether Excel is a requirement for the job or not.

We then see that data has no missing values, in any of the columns, which is promising for the further analysis.

str(df)
## 'data.frame':    742 obs. of  28 variables:
##  $ Job.Title        : chr  "Data Scientist" "Healthcare Data Scientist" "Data Scientist" "Data Scientist" ...
##  $ Salary.Estimate  : chr  "$53K-$91K (Glassdoor est.)" "$63K-$112K (Glassdoor est.)" "$80K-$90K (Glassdoor est.)" "$56K-$97K (Glassdoor est.)" ...
##  $ Job.Description  : chr  "Data Scientist\nLocation: Albuquerque, NM\nEducation Required: Bachelor’s degree required, preferably in math, "| __truncated__ "What You Will Do:\n\nI. General Summary\n\nThe Healthcare Data Scientist position will join our Advanced Analyt"| __truncated__ "KnowBe4, Inc. is a high growth information security company. We are the world's largest provider of new-school "| __truncated__ "*Organization and Job ID**\nJob ID: 310709\n\nDirectorate: Earth & Biological Sciences\n\nDivision: Biological "| __truncated__ ...
##  $ Rating           : num  3.8 3.4 4.8 3.8 2.9 3.4 4.1 3.8 3.3 4.6 ...
##  $ Company.Name     : chr  "Tecolote Research\n3.8" "University of Maryland Medical System\n3.4" "KnowBe4\n4.8" "PNNL\n3.8" ...
##  $ Location         : chr  "Albuquerque, NM" "Linthicum, MD" "Clearwater, FL" "Richland, WA" ...
##  $ Headquarters     : chr  "Goleta, CA" "Baltimore, MD" "Clearwater, FL" "Richland, WA" ...
##  $ Size             : chr  "501 to 1000 employees" "10000+ employees" "501 to 1000 employees" "1001 to 5000 employees" ...
##  $ Founded          : int  1973 1984 2010 1965 1998 2000 2008 2005 2014 2009 ...
##  $ Type.of.ownership: chr  "Company - Private" "Other Organization" "Company - Private" "Government" ...
##  $ Industry         : chr  "Aerospace & Defense" "Health Care Services & Hospitals" "Security Services" "Energy" ...
##  $ Sector           : chr  "Aerospace & Defense" "Health Care" "Business Services" "Oil, Gas, Energy & Utilities" ...
##  $ Revenue          : chr  "$50 to $100 million (USD)" "$2 to $5 billion (USD)" "$100 to $500 million (USD)" "$500 million to $1 billion (USD)" ...
##  $ Competitors      : chr  "-1" "-1" "-1" "Oak Ridge National Laboratory, National Renewable Energy Lab, Los Alamos National Laboratory" ...
##  $ hourly           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ employer_provided: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ min_salary       : int  53 63 80 56 86 71 54 86 38 120 ...
##  $ max_salary       : int  91 112 90 97 143 119 93 142 84 160 ...
##  $ avg_salary       : num  72 87.5 85 76.5 114.5 ...
##  $ company_txt      : chr  "Tecolote Research\n" "University of Maryland Medical System\n" "KnowBe4\n" "PNNL\n" ...
##  $ job_state        : chr  " NM" " MD" " FL" " WA" ...
##  $ same_state       : int  0 0 1 1 1 1 1 0 1 1 ...
##  $ age              : int  47 36 10 55 22 20 12 15 6 11 ...
##  $ python_yn        : int  1 1 1 1 1 1 0 1 0 1 ...
##  $ R_yn             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ spark            : int  0 0 1 0 0 0 0 1 0 1 ...
##  $ aws              : int  0 0 0 0 0 1 0 1 0 0 ...
##  $ excel            : int  1 0 1 0 1 1 1 1 0 0 ...
# count missing values in each column
colSums(is.na(df))
##         Job.Title   Salary.Estimate   Job.Description            Rating 
##                 0                 0                 0                 0 
##      Company.Name          Location      Headquarters              Size 
##                 0                 0                 0                 0 
##           Founded Type.of.ownership          Industry            Sector 
##                 0                 0                 0                 0 
##           Revenue       Competitors            hourly employer_provided 
##                 0                 0                 0                 0 
##        min_salary        max_salary        avg_salary       company_txt 
##                 0                 0                 0                 0 
##         job_state        same_state               age         python_yn 
##                 0                 0                 0                 0 
##              R_yn             spark               aws             excel 
##                 0                 0                 0                 0

Cleaning

In this part, I cleaned and extracted salary information from the “Salary.Estimate” variable in the data frame. The first line removed dollar signs, “K” and “(Glassdoor est.)” text from the variable. Then, in the next three lines, I cleaned the variable by removing non-numeric and non-hyphen characters, trailing hyphens, and leading hyphens. Next, I used the “strsplit” function to split the salary estimates into lower and upper limits. Finally, I converted the values to numeric, calculated the mean of the range, and assigned it to the “Salary.Estimate” variable in the data frame. The result was a cleaned and transformed salary variable in the data frame, which can be used for further analysis.

# Remove dollar sign, "K", and "(Glassdoor est.)" text from Salary.Estimate
df$Salary.Estimate <- gsub("\\$|K|(Glassdoor est.)", "", df$Salary.Estimate)

# clean the variable to extract numeric range
df$Salary.Estimate <- gsub("[^0-9-]", "", df$Salary.Estimate) # remove non-numeric and non-hyphen characters
df$Salary.Estimate <- gsub("-$", "", df$Salary.Estimate) # remove trailing hyphen
df$Salary.Estimate <- gsub("^-", "", df$Salary.Estimate) # remove leading hyphen

# Extract lower and upper limits of salary estimate
limits <- strsplit(df$Salary.Estimate, "-")

# Convert the values to numeric and calculate mean
df$Salary.Estimate <- sapply(limits, function(x) mean(as.numeric(x)))

The next step is manipulating the “Size” column in the “df” dataframe. It was extracting the minimum and maximum values from this column and creating two new variables in the dataframe, “Size_min” and “Size_max”. To achieve this, I used regular expressions to extract the numeric values from the “Size” column and assigning them to the new variables. I then checked for missing values in both variables and replaces them with a value of 10001. Finally, I printed the structure of the dataframe and the number of missing values in each column. This way I cleaned and prepared the “Size” variable for analysis in the project.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Extract minimum and maximum values from Size column
df$Size_min <- as.numeric(sub("^(\\d+) to (\\d+).*", "\\1", df$Size))
## Warning: NAs introduced by coercion
df$Size_max <- as.numeric(sub("^(\\d+) to (\\d+).*", "\\2", df$Size))
## Warning: NAs introduced by coercion
# count missing values in each column
colSums(is.na(df))
##         Job.Title   Salary.Estimate   Job.Description            Rating 
##                 0                 0                 0                 0 
##      Company.Name          Location      Headquarters              Size 
##                 0                 0                 0                 0 
##           Founded Type.of.ownership          Industry            Sector 
##                 0                 0                 0                 0 
##           Revenue       Competitors            hourly employer_provided 
##                 0                 0                 0                 0 
##        min_salary        max_salary        avg_salary       company_txt 
##                 0                 0                 0                 0 
##         job_state        same_state               age         python_yn 
##                 0                 0                 0                 0 
##              R_yn             spark               aws             excel 
##                 0                 0                 0                 0 
##          Size_min          Size_max 
##               139               139
# replace NA values with 10001
df$Size_min[is.na(df$Size_min)] <- 10001
df$Size_max[is.na(df$Size_max)] <- 10001

# create mean variable
df$Size_mean <- (df$Size_min + df$Size_max) / 2
df <- select(df, -Size_min, -Size_max)
colSums(is.na(df))
##         Job.Title   Salary.Estimate   Job.Description            Rating 
##                 0                 0                 0                 0 
##      Company.Name          Location      Headquarters              Size 
##                 0                 0                 0                 0 
##           Founded Type.of.ownership          Industry            Sector 
##                 0                 0                 0                 0 
##           Revenue       Competitors            hourly employer_provided 
##                 0                 0                 0                 0 
##        min_salary        max_salary        avg_salary       company_txt 
##                 0                 0                 0                 0 
##         job_state        same_state               age         python_yn 
##                 0                 0                 0                 0 
##              R_yn             spark               aws             excel 
##                 0                 0                 0                 0 
##         Size_mean 
##                 0

Then, I converted all remaining variables to numeric, including the binary variables. I also changed the minimum and maximum salary values to accurate, numeric format using the “mutate” function.

library(dplyr)

df$Founded <- as.numeric(df$Founded)

df <- df %>%
  mutate(min_salary = as.numeric(gsub("[^0-9-]", "", min_salary)),
         max_salary = as.numeric(gsub("[^0-9-]", "", max_salary)))

df$age <- as.numeric(df$age)

df$python_yn <- as.numeric(df$python_yn)
df$R_yn <- as.numeric(df$R_yn)
df$spark <- as.numeric(df$spark)
df$aws <- as.numeric(df$aws)
df$excel <- as.numeric(df$excel)

I changed the last variable to numeric, and decided to remove hourly, and employer_provided variables, because they did not input valuable information to my PCA analysis on salary determinants. I selcted numeric variables and displayed the cleaned data below with the str(df) function. The cleaned data has no missing values. I also cleaned it by creating capitalised variable names.

library(dplyr)

df$same_state <- as.numeric(df$same_state)
df <- select(df, -hourly, -employer_provided)
df <- select_if(df, is.numeric)

str(df)
## 'data.frame':    742 obs. of  14 variables:
##  $ Salary.Estimate: num  72 87.5 85 76.5 114.5 ...
##  $ Rating         : num  3.8 3.4 4.8 3.8 2.9 3.4 4.1 3.8 3.3 4.6 ...
##  $ Founded        : num  1973 1984 2010 1965 1998 ...
##  $ min_salary     : num  53 63 80 56 86 71 54 86 38 120 ...
##  $ max_salary     : num  91 112 90 97 143 119 93 142 84 160 ...
##  $ avg_salary     : num  72 87.5 85 76.5 114.5 ...
##  $ same_state     : num  0 0 1 1 1 1 1 0 1 1 ...
##  $ age            : num  47 36 10 55 22 20 12 15 6 11 ...
##  $ python_yn      : num  1 1 1 1 1 1 0 1 0 1 ...
##  $ R_yn           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ spark          : num  0 0 1 0 0 0 0 1 0 1 ...
##  $ aws            : num  0 0 0 0 0 1 0 1 0 0 ...
##  $ excel          : num  1 0 1 0 1 1 1 1 0 0 ...
##  $ Size_mean      : num  750 10001 750 3000 126 ...
# count missing values in each column
colSums(is.na(df))
## Salary.Estimate          Rating         Founded      min_salary      max_salary 
##               0               0               0               0               0 
##      avg_salary      same_state             age       python_yn            R_yn 
##               0               0               0               0               0 
##           spark             aws           excel       Size_mean 
##               0               0               0               0
library(tools)
colnames(df) <- toTitleCase(colnames(df))
colnames(df)
##  [1] "Salary.Estimate" "Rating"          "Founded"         "Min_salary"     
##  [5] "Max_salary"      "Avg_salary"      "Same_state"      "Age"            
##  [9] "Python_yn"       "R_yn"            "Spark"           "Aws"            
## [13] "Excel"           "Size_mean"

Statistics

The code summary(df) generates a summary of the dataset df. The summary provides information about several variables in the dataset, including Salary.Estimate, Rating, Founded, Min_salary, Max_salary, Avg_salary, Same_state, Age, Python_yn, R_yn, Spark, Aws, Excel, Size_min, and Size_max.

For each variable, the summary provides several descriptive statistics, including the minimum value, the first quartile, the median, the mean, the third quartile, and the maximum value. These statistics provide a quick overview of the distribution of each variable in the dataset.

For example, we can see that the mean Avg_salary is around 100.6, with a minimum of 13.5 and a maximum of 254.0. We can also see that the Same_state variable is binary, with a mean of 0.558, indicating that slightly more than half of the jobs in the dataset are in the same state as the company headquarters.

summary(df)
##  Salary.Estimate     Rating          Founded       Min_salary    
##  Min.   : 13.5   Min.   :-1.000   Min.   :  -1   Min.   : 10.00  
##  1st Qu.: 73.5   1st Qu.: 3.300   1st Qu.:1939   1st Qu.: 52.00  
##  Median : 97.5   Median : 3.700   Median :1988   Median : 69.50  
##  Mean   :100.6   Mean   : 3.619   Mean   :1837   Mean   : 74.07  
##  3rd Qu.:122.5   3rd Qu.: 4.000   3rd Qu.:2007   3rd Qu.: 91.00  
##  Max.   :254.0   Max.   : 5.000   Max.   :2019   Max.   :202.00  
##    Max_salary      Avg_salary      Same_state         Age        
##  Min.   : 16.0   Min.   : 13.5   Min.   :0.000   Min.   : -1.00  
##  1st Qu.: 96.0   1st Qu.: 73.5   1st Qu.:0.000   1st Qu.: 11.00  
##  Median :124.0   Median : 97.5   Median :1.000   Median : 24.00  
##  Mean   :127.2   Mean   :100.6   Mean   :0.558   Mean   : 46.59  
##  3rd Qu.:155.0   3rd Qu.:122.5   3rd Qu.:1.000   3rd Qu.: 59.00  
##  Max.   :306.0   Max.   :254.0   Max.   :1.000   Max.   :276.00  
##    Python_yn           R_yn              Spark             Aws        
##  Min.   :0.0000   Min.   :0.000000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.000000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :1.0000   Median :0.000000   Median :0.0000   Median :0.0000  
##  Mean   :0.5283   Mean   :0.002695   Mean   :0.2251   Mean   :0.2372  
##  3rd Qu.:1.0000   3rd Qu.:0.000000   3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :1.000000   Max.   :1.0000   Max.   :1.0000  
##      Excel          Size_mean      
##  Min.   :0.0000   Min.   :   -1.0  
##  1st Qu.:0.0000   1st Qu.:  350.5  
##  Median :1.0000   Median :  750.5  
##  Mean   :0.5229   Mean   : 3456.1  
##  3rd Qu.:1.0000   3rd Qu.: 7500.5  
##  Max.   :1.0000   Max.   :10001.0

Correlation

The correlation plot generated by the code indicates the strength and direction of the correlation between pairs of variables in the dataset. The strength of the correlation is indicated by the color and size of the circles in the plot.

Variables that are strongly positively correlated have a large circle that is colored in shades of purple. Variables that are strongly negatively correlated have a large circle that is colored in shades of red.

Variables that are weakly correlated have smaller circles that are green or light blue.

library(corrplot)
## corrplot 0.92 loaded
library(RColorBrewer)

# Define color palette
colors <- colorRampPalette(c("red", "orange", "yellow", "green", "cyan", "blue", "purple"))(n = 200)

# Compute correlation matrix
cor.matrix <- cor(df, method = "pearson")

# Plot correlation matrix
corrplot(cor.matrix, type = "lower", order = "alphabet", tl.col = "black", tl.cex = 1, col=colors)

Dropping

Based on the correlation plot, most of the variables are weakly correlated, apart from Max_salary with Min_salary and Salary.Estimate, and Salary.Estimate with Avg_salary. This is why I decide to drop these variables, apart from Salary.Estimate, because it is my dependent variable in my project.

library(dplyr)

df <- select(df, -Min_salary, -Max_salary, -Avg_salary)

str(df)
## 'data.frame':    742 obs. of  11 variables:
##  $ Salary.Estimate: num  72 87.5 85 76.5 114.5 ...
##  $ Rating         : num  3.8 3.4 4.8 3.8 2.9 3.4 4.1 3.8 3.3 4.6 ...
##  $ Founded        : num  1973 1984 2010 1965 1998 ...
##  $ Same_state     : num  0 0 1 1 1 1 1 0 1 1 ...
##  $ Age            : num  47 36 10 55 22 20 12 15 6 11 ...
##  $ Python_yn      : num  1 1 1 1 1 1 0 1 0 1 ...
##  $ R_yn           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Spark          : num  0 0 1 0 0 0 0 1 0 1 ...
##  $ Aws            : num  0 0 0 0 0 1 0 1 0 0 ...
##  $ Excel          : num  1 0 1 0 1 1 1 1 0 0 ...
##  $ Size_mean      : num  750 10001 750 3000 126 ...

Distributions

The assumption of PCA analysis is that variables have approximately normal distribution. As visible on the histograms below, additional tranformations are needed to ensure this.

library(ggplot2)

# create a list of histogram plots for each variable
plots <- lapply(names(df), function(x) {
  ggplot(df, aes(x = df[[x]])) + 
    geom_histogram(bins = 10, fill = "yellow", color = "black") + 
    labs(title = paste(x), x = x, y = "Count")
})

# display the plots together using grid.arrange
gridExtra::grid.arrange(grobs = plots, ncol = 3)
## Warning: Use of `df[[x]]` is discouraged.
## ℹ Use `.data[[x]]` instead.
## Use of `df[[x]]` is discouraged.
## ℹ Use `.data[[x]]` instead.
## Use of `df[[x]]` is discouraged.
## ℹ Use `.data[[x]]` instead.
## Use of `df[[x]]` is discouraged.
## ℹ Use `.data[[x]]` instead.
## Use of `df[[x]]` is discouraged.
## ℹ Use `.data[[x]]` instead.
## Use of `df[[x]]` is discouraged.
## ℹ Use `.data[[x]]` instead.
## Use of `df[[x]]` is discouraged.
## ℹ Use `.data[[x]]` instead.
## Use of `df[[x]]` is discouraged.
## ℹ Use `.data[[x]]` instead.
## Use of `df[[x]]` is discouraged.
## ℹ Use `.data[[x]]` instead.
## Use of `df[[x]]` is discouraged.
## ℹ Use `.data[[x]]` instead.
## Use of `df[[x]]` is discouraged.
## ℹ Use `.data[[x]]` instead.

Binary variables

It is difficult to determine whether dropping all binary variables in my project would be a good decision.

Binary variables can be very informative in some contexts, so dropping them without good reason may result in the loss of potentially valuable information. On the other hand, if the binary variables are not very informative, highly correlated with other variables in dataset, or if they introduce multicollinearity into your analysis, it may be reasonable to drop them. In my project it is not the case, so I decided to keep my binary variables.

Before performing PCA on a mix of binary and numeric variables, I performed scaling on the binary variables. Since PCA is sensitive to the scale of the variables, it’s generally a good idea to scale the numeric variables before performing PCA.

df[, c("Same_state", "Python_yn", "R_yn", "Spark", "Aws", "Excel")] <- scale(df[, c("Same_state", "Python_yn", "R_yn", "Spark", "Aws", "Excel")], center = TRUE, scale = TRUE)

Logarithmic transformation

To perform a logarithmic transformation on the variables in the df data frame, I used the log() function in R. This function will create new variables in the df data frame with the log-transformed values of the original variables. These new variables should have a more normal distribution than the original variables. I also deleted the missing values created in this process. From the green histograms we can see that the distribution after such change is closer to normal.

df$Salary.Estimate <- log(df$Salary.Estimate)
df$Rating <- log(df$Rating)
## Warning in log(df$Rating): NaNs produced
df$Founded <- log(df$Founded)
## Warning in log(df$Founded): NaNs produced
df$Age <- log(df$Age)
## Warning in log(df$Age): NaNs produced
df$Size_mean <- log(df$Size_mean)
## Warning in log(df$Size_mean): NaNs produced
# count missing values in each column
colSums(is.na(df))
## Salary.Estimate          Rating         Founded      Same_state             Age 
##               0              11              50               0              50 
##       Python_yn            R_yn           Spark             Aws           Excel 
##               0               0               0               0               0 
##       Size_mean 
##               1
# modify the original data frame
df <- na.omit(df)
# count missing values in each column
colSums(is.na(df))
## Salary.Estimate          Rating         Founded      Same_state             Age 
##               0               0               0               0               0 
##       Python_yn            R_yn           Spark             Aws           Excel 
##               0               0               0               0               0 
##       Size_mean 
##               0
library(ggplot2)

# create a list of histogram plots for each variable
plots <- lapply(names(df), function(x) {
  ggplot(df, aes(x = df[[x]])) + 
    geom_histogram(bins = 10, fill = "green", color = "black") + 
    labs(title = paste(x), x = x, y = "Count")
})

# display the plots together using grid.arrange
gridExtra::grid.arrange(grobs = plots, ncol = 3)

# PCA

Summary of Principal Component Analysis

The code below performs a principal component analysis (PCA) on the dataframe df using the prcomp function. The center = TRUE and scale = TRUE arguments standardize the variables before computing the PCA.

The summary(df.pca) function call then prints a summary of the results of the PCA analysis, including the standard deviation, proportion of variance, and cumulative proportion of variance explained by each principal component (PC).

There are 11 principal components in total, with PC1 explaining 23.55% of the variance, PC2 explaining 16.10% of the variance, and so on. The cumulative proportion of variance is also shown, which indicates that the first two PCs account for 39.66% of the total variance, the first three PCs account for 49.50%.

The summary also displays the standard deviation for each PC. The standard deviation gives an indication of how much of the variability in the original dataset is captured by each PC. Finally, the proportion of variance for each PC is shown, which gives an indication of how much information is retained by each component. The higher the proportion of variance, the more important the component is in capturing the original variation in the data.

df.pca <- prcomp(df, center = TRUE, scale = TRUE)
summary(df.pca)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     1.6095 1.3310 1.04045 1.02307 0.97383 0.95625 0.88389
## Proportion of Variance 0.2355 0.1610 0.09841 0.09515 0.08621 0.08313 0.07102
## Cumulative Proportion  0.2355 0.3966 0.49497 0.59012 0.67633 0.75946 0.83048
##                            PC8     PC9    PC10    PC11
## Standard deviation     0.83449 0.76419 0.68083 0.34756
## Proportion of Variance 0.06331 0.05309 0.04214 0.01098
## Cumulative Proportion  0.89379 0.94688 0.98902 1.00000

Results

Eigenvalues

The get_eigenvalue function is used to extract the eigenvalues and variance explained by each principal component from the prcomp object df.pca.

The output eig.val displays the eigenvalues and variance percentages for each of the 11 principal components. The first column gives the eigenvalues, the second column gives the percentage of variance explained by each principal component, and the third column gives the cumulative percentage of variance explained up to that component.

From this output, we can see that the first principal component (Dim.1) has the largest eigenvalue and explains 23.55% of the variance in the data. The second principal component (Dim.2) has the second-largest eigenvalue and explains an additional 16.10% of the variance. Together, the first two principal components explain 39.66% of the variance in the data.

The cumulative variance explained increases with each additional principal component, and by the 11th component, 100% of the variance is explained.

library(factoextra) 
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
eig.val <- get_eigenvalue(df.pca)
eig.val
##        eigenvalue variance.percent cumulative.variance.percent
## Dim.1   2.5905641        23.550582                    23.55058
## Dim.2   1.7715188        16.104716                    39.65530
## Dim.3   1.0825458         9.841325                    49.49662
## Dim.4   1.0466705         9.515187                    59.01181
## Dim.5   0.9483429         8.621299                    67.63311
## Dim.6   0.9144178         8.312889                    75.94600
## Dim.7   0.7812700         7.102454                    83.04845
## Dim.8   0.6963670         6.330609                    89.37906
## Dim.9   0.5839797         5.308906                    94.68797
## Dim.10  0.4635271         4.213883                    98.90185
## Dim.11  0.1207964         1.098149                   100.00000

Scree Plots

The scree plot is a visualization of the eigenvalues of the principal components. The plot shows the eigenvalues for each dimension. As visible in the previous table, Dim. 1 has eigenvalue of 2.5905641, Dim. 2 (1.7715188 ), Dim. 3 (1.0825458), Dim. 4 (1.0466705), and the next dimensions have eigenvalues smaller than 1, so we should not include them in further analysis.

The plot also shows an “elbow” at the third principal component, indicating that these components may be the most important in explaining the variance in the data. Overall, the scree plot suggests that a 3 or 4 principal components may be sufficient to capture most of the variation in the data.

library(psych)
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
library(factoextra)
library(labdsv)
## Loading required package: mgcv
## Loading required package: nlme
## 
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
## 
##     collapse
## This is mgcv 1.8-41. For overview type 'help("mgcv-package")'.
## This is labdsv 2.0-1
## convert existing ordinations with as.dsvord()
## 
## Attaching package: 'labdsv'
## The following object is masked from 'package:psych':
## 
##     pca
## The following object is masked from 'package:stats':
## 
##     density
library(psy)
## 
## Attaching package: 'psy'
## The following object is masked from 'package:psych':
## 
##     wkappa
scree.plot(eig.val[,1], title = "Scree Plot", type = "E",  simu = "P")

The plot below shows the proportion of the total variance in the data explained by each principal component. The plot from the given code shows that the first principal component explains the largest proportion of variance in the data (23.55%), followed by the second (16.10%), and so on. The plot also shows an “elbow” at the third principal component, indicating that these components may be the most important in explaining the variance in the data. The outcomes of this plot are the same as from the previous scree plot above.

fviz_eig(df.pca)

Correlation circle

The plot shows the contribution of each variable to the principal components in the PCA analysis. The size of the arrows on the plot represents the contribution of the variable to each principal component. The color of the label indicates the direction of the correlation between the variable and the principal component. The plot helps to identify variables that are strongly associated with each principal component and also to identify variables that are important for distinguishing different groups or clusters in the data. The plot also helps to visualize which variables are most important for explaining the variation in the data.

The plot indicates that the most important variables in the PCA analysis are “Founded”, “Age”, and “Size_Mean”. The least important are “Excel” and “R_yn” variables.

library(ggpubr)
library(factoextra)
library(RColorBrewer)

# Create a rainbow color palette
colors <- colorRampPalette(c("red", "orange", "yellow", "green", "cyan", "blue", "purple"))(length(df.pca$sdev))

# Plot PCA results
p <- fviz_pca_var(df.pca, col.var="contrib", col.var.rep = TRUE, 
                  gradient.cols = colors, col.axis = "#757575") + 
  scale_color_gradientn(colors = colors)
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
# Adjust label size and position
p <- ggpar(p, label_size = 6, label_y = 0.6)

# Print plot
print(p)

# Loadings

The first plot shows the variable loadings for PC1 and PC2 in a scatter plot. Each variable is represented by an arrow pointing in the direction of its loading for that PC. The length of the arrow represents the magnitude of the loading, and the color represents the contribution of that variable to the PC, with a rainbow color palette ranging from red (low contribution) to purple (high contribution). The color scale ranges from 0% to 100%, and the colorbar on the right side of the plot shows the corresponding labels for each value. The plot helps to identify which variables contribute most to the separation of the observations in the PC1 and PC2 dimensions, which are: “Age” and “Founded”.

# Define the rainbow color palette
colors <- colorRampPalette(c("red", "orange", "yellow", "green", "cyan", "blue", "purple"))(100)

# Plot the variable loadings for PC1 and PC2, with rainbow color palette
fviz_pca_var(df.pca, col.var="contrib", repel = TRUE, axes = c(1, 2)) +
  labs(title="Variables loadings for PC1 and PC2", x="PC1", y="PC2")+
  scale_color_gradientn(colors = colors, limits = c(0, 100), guide = "colorbar", 
                        breaks = seq(0, 100, by = 10), 
                        labels = c("0%", "10%", "20%", "30%", "40%", "50%", "60%", "70%", "80%", "90%", "100%"))

The second plot shows the variable loadings for PC2 and PC3, with a rainbow color palette. Here, the variable that is mostly driving the separation between different samples or groups is the “Excel” variable.

colors <- colorRampPalette(c("red", "orange", "yellow", "green", "cyan", "blue", "purple"))(100)

# Plot the variable loadings for PC1 and PC2, with rainbow color palette
fviz_pca_var(df.pca, col.var="contrib", repel = TRUE, axes = c(2, 3)) +
  labs(title="Variables loadings for PC2 and PC3", x="PC2", y="PC3")+
  scale_color_gradientn(colors = colors, limits = c(0, 100), guide = "colorbar", 
                        breaks = seq(0, 100, by = 10), 
                        labels = c("0%", "10%", "20%", "30%", "40%", "50%", "60%", "70%", "80%", "90%", "100%"))

Variance explained

The plot plot(summary(df.pca)$importance[3,]) is a scree plot that displays the percentage of total variance explained by each principal component (PC). The x-axis shows the PC number, while the y-axis shows the percentage of total variance explained by each PC. The plot allows for visual inspection of the number of significant principal components to retain for further analysis.

plot(summary(df.pca)$importance[3,])

PCA plot

This plot shows the principal component analysis (PCA) results for the individual data points in the dataset. The red-purple color scheme represents the level of contribution of each variable to each principal component, with red indicating high positive contribution and purple indicating higher positive contribution. The size of the points represents the magnitude of the contributions of each individual data point to the principal components. This plot can help identify patterns and clusters of data points in the PCA results.

# Define a rainbow color palette
colors <- colorRampPalette(c("red", "orange", "yellow", "green", "cyan", "blue", "purple"))(n = nrow(df.pca$ind))

# Plot the PCA results with a rainbow color scheme
fviz_pca_ind(df.pca, col.ind = "cos2", geom = "point", 
             gradient.cols = colors, col.ind.rep = TRUE) +
  scale_color_gradientn(colors = colors)
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.

Contributions

The plot shows the contribution of each variable to the principal components of a PCA analysis. Three plots are presented, one for each principal component. In each plot, the variables are ranked by their contribution to that specific component. The more a variable contributes to a component, the closer its bar is to the top of the plot. By examining the plots, we can identify the variables that are most important in explaining the variation of the data along each principal component.

library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
# Create plots for each principal component
PC1 <- fviz_contrib(df.pca, choice = "var", axes = 1,fill = "cyan",color = "purple") +
       ggtitle("Contribution to Dim-1")
PC2 <- fviz_contrib(df.pca, choice = "var", axes = 2,fill = "green",color = "purple") +
       ggtitle("Contribution to Dim-2")
PC3 <- fviz_contrib(df.pca, choice = "var", axes = 3,fill = "yellow",color = "purple") +
       ggtitle("Contribution to Dim-3")

# Combine the plots into a single grid
grid.arrange(PC1, PC2, PC3, ncol = 3)

Summary

This paper aimed to determine the most important variables influencing the Salary.Estimate variable, so the level of estimated salary range for each job posting. The result of the PCA analysis was the reduction of number of variables from 28 to 11, using the numerical and binary data types.

The scree plots and elbow method showed that the optimal number of dimensions is three, and the 3 components may be the most important in explaining the variance in the data.

The first principal component has 3 variables: “Age”, “Founded” and “Size_mean”. The second principal component has 4 variables: “Salary.Estimate”, “Python_yn”, “Spark” and “Aws”. The third has only 2 variables: “Excel” and “Same_state”. This means that the variables that can best explain the changes in Salary.Estimate variable are: “Age”, “Founded” and “Size_mean”.

I will use this research as part of development of my recent business project “Recrupay”, where you can further see the practical use of analysis made in my dimension reduction project. The results showed that by estimating salaries on my website, I should mostly focus on variables such as the age of the company, the year the company was founded and the size of the company. This information can be useful both for the candidates and the recruiters, as both parties perform salary estimation, for various purposes.

References

Salary Prediction. (2022, November 16). Kaggle. https://www.kaggle.com/datasets/thedevastator/jobs-dataset-from-glassdoor

Recrupay. (n.d.). Recrupay. https://www.recrupay.com/

RPubs - Dimension Reduction. (n.d.-b). https://rpubs.com/paulinazabska/DimensionReduction

Zhang, T., & Yang, B. (2016). Big Data Dimension Reduction Using PCA. 2016 IEEE International Conference on Smart Cloud (SmartCloud). https://doi.org/10.1109/smartcloud.2016.33