The aim of this paper is to perform Principal Component Analysis (PCA) method of dimension reduction for salary determinants dataset. The result was the reduction of number of variables from 28 to 11, using only the numerical data types. From my personal and professional background, I have always been interested in identifying the key factors that influence salary levels. I have recently created a start up “Recrupay”, which is an online job board where you get money for passing each stage of the recruitment process. To make deeper research in this area, I decided to devote this project to finding salary determinants, and the results have been very useful in the development of my new company.
I found that the PCA method of dimension reduction was extremely useful in identifying the key factors that influence salary levels. This is because the dataset I used, which included information on 28 variables and their various characteristics such as age, industry, and work experience, was quite large and complex. Without using PCA, it would have been challenging to analyze all the variables at once, and the results could have been difficult to interpret.
The paper “A Novel Method of Feature Selection Based on Principal Component Analysis” by Zhang and Yang (2016) further reinforced the usefulness of PCA in my project. The paper highlights the importance of using feature selection methods to reduce the number of variables in large datasets and improve the accuracy and efficiency of machine learning models. PCA is one such method that can reduce the dimensionality of the dataset and extract the most important features that explain the variability in the data.
By applying PCA to my dataset, I was able to reduce the initial variables down to a smaller number of principal components that explained the majority of the variance in the data. These principal components allowed to identify the key factors that influence salary levels. Additionally, I was able to create visualizations that helped to better understand the relationships between the different variables.
Overall, the PCA method of dimension reduction was an essential tool in our analysis, enabling to effectively explore the relationships between the various factors and their impact on salary levels. By reducing the dimensionality of the dataset, I was able to focus on the most significant variables and gain deeper insights into the factors that influence salaries.
I chose to use the Kaggle Jobs Dataset from Glassdoor for my project because it provides a rich source of information about job listings and salaries across a wide range of industries and locations. The dataset includes over 200,000 job listings with information on job titles, companies, locations, and salaries. I used the “salary_data_cleaned.csv” file for my analysis.
What I found particularly useful about this dataset is that it includes both numeric and categorical variables, which makes it well-suited for exploring the factors that influence salary levels. By using principal component analysis (PCA) to reduce the dimensionality of the data, I was able to identify the key variables that were most strongly associated with salary levels, and explore how these variables varied across different industries and locations.
library(GGally)
library(pdp)
library(plotly)
library(kableExtra)
library(cluster)
library(labdsv)
library(ggthemes)
library(RColorBrewer)
library(dplyr)
library(factoextra)
library(fpc)
library(readxl)
library(psych)
library(maptools)
library(corrplot)
library(smacof)
library(flexclust)
library(gridExtra)
library(ade4)
library(missMDA)
library(patchwork)
library(Rtsne)
library(tidyverse)
library(stringr)
library(scales)
library(jpeg)
library(ClusterR)
library(tibble)
Dataset source
df <- read.csv("salary_data_cleaned.csv", header = TRUE, stringsAsFactors = FALSE)
Description of the initial variables:
We then see that data has no missing values, in any of the columns, which is promising for the further analysis.
str(df)
## 'data.frame': 742 obs. of 28 variables:
## $ Job.Title : chr "Data Scientist" "Healthcare Data Scientist" "Data Scientist" "Data Scientist" ...
## $ Salary.Estimate : chr "$53K-$91K (Glassdoor est.)" "$63K-$112K (Glassdoor est.)" "$80K-$90K (Glassdoor est.)" "$56K-$97K (Glassdoor est.)" ...
## $ Job.Description : chr "Data Scientist\nLocation: Albuquerque, NM\nEducation Required: Bachelor’s degree required, preferably in math, "| __truncated__ "What You Will Do:\n\nI. General Summary\n\nThe Healthcare Data Scientist position will join our Advanced Analyt"| __truncated__ "KnowBe4, Inc. is a high growth information security company. We are the world's largest provider of new-school "| __truncated__ "*Organization and Job ID**\nJob ID: 310709\n\nDirectorate: Earth & Biological Sciences\n\nDivision: Biological "| __truncated__ ...
## $ Rating : num 3.8 3.4 4.8 3.8 2.9 3.4 4.1 3.8 3.3 4.6 ...
## $ Company.Name : chr "Tecolote Research\n3.8" "University of Maryland Medical System\n3.4" "KnowBe4\n4.8" "PNNL\n3.8" ...
## $ Location : chr "Albuquerque, NM" "Linthicum, MD" "Clearwater, FL" "Richland, WA" ...
## $ Headquarters : chr "Goleta, CA" "Baltimore, MD" "Clearwater, FL" "Richland, WA" ...
## $ Size : chr "501 to 1000 employees" "10000+ employees" "501 to 1000 employees" "1001 to 5000 employees" ...
## $ Founded : int 1973 1984 2010 1965 1998 2000 2008 2005 2014 2009 ...
## $ Type.of.ownership: chr "Company - Private" "Other Organization" "Company - Private" "Government" ...
## $ Industry : chr "Aerospace & Defense" "Health Care Services & Hospitals" "Security Services" "Energy" ...
## $ Sector : chr "Aerospace & Defense" "Health Care" "Business Services" "Oil, Gas, Energy & Utilities" ...
## $ Revenue : chr "$50 to $100 million (USD)" "$2 to $5 billion (USD)" "$100 to $500 million (USD)" "$500 million to $1 billion (USD)" ...
## $ Competitors : chr "-1" "-1" "-1" "Oak Ridge National Laboratory, National Renewable Energy Lab, Los Alamos National Laboratory" ...
## $ hourly : int 0 0 0 0 0 0 0 0 0 0 ...
## $ employer_provided: int 0 0 0 0 0 0 0 0 0 0 ...
## $ min_salary : int 53 63 80 56 86 71 54 86 38 120 ...
## $ max_salary : int 91 112 90 97 143 119 93 142 84 160 ...
## $ avg_salary : num 72 87.5 85 76.5 114.5 ...
## $ company_txt : chr "Tecolote Research\n" "University of Maryland Medical System\n" "KnowBe4\n" "PNNL\n" ...
## $ job_state : chr " NM" " MD" " FL" " WA" ...
## $ same_state : int 0 0 1 1 1 1 1 0 1 1 ...
## $ age : int 47 36 10 55 22 20 12 15 6 11 ...
## $ python_yn : int 1 1 1 1 1 1 0 1 0 1 ...
## $ R_yn : int 0 0 0 0 0 0 0 0 0 0 ...
## $ spark : int 0 0 1 0 0 0 0 1 0 1 ...
## $ aws : int 0 0 0 0 0 1 0 1 0 0 ...
## $ excel : int 1 0 1 0 1 1 1 1 0 0 ...
# count missing values in each column
colSums(is.na(df))
## Job.Title Salary.Estimate Job.Description Rating
## 0 0 0 0
## Company.Name Location Headquarters Size
## 0 0 0 0
## Founded Type.of.ownership Industry Sector
## 0 0 0 0
## Revenue Competitors hourly employer_provided
## 0 0 0 0
## min_salary max_salary avg_salary company_txt
## 0 0 0 0
## job_state same_state age python_yn
## 0 0 0 0
## R_yn spark aws excel
## 0 0 0 0
In this part, I cleaned and extracted salary information from the “Salary.Estimate” variable in the data frame. The first line removed dollar signs, “K” and “(Glassdoor est.)” text from the variable. Then, in the next three lines, I cleaned the variable by removing non-numeric and non-hyphen characters, trailing hyphens, and leading hyphens. Next, I used the “strsplit” function to split the salary estimates into lower and upper limits. Finally, I converted the values to numeric, calculated the mean of the range, and assigned it to the “Salary.Estimate” variable in the data frame. The result was a cleaned and transformed salary variable in the data frame, which can be used for further analysis.
# Remove dollar sign, "K", and "(Glassdoor est.)" text from Salary.Estimate
df$Salary.Estimate <- gsub("\\$|K|(Glassdoor est.)", "", df$Salary.Estimate)
# clean the variable to extract numeric range
df$Salary.Estimate <- gsub("[^0-9-]", "", df$Salary.Estimate) # remove non-numeric and non-hyphen characters
df$Salary.Estimate <- gsub("-$", "", df$Salary.Estimate) # remove trailing hyphen
df$Salary.Estimate <- gsub("^-", "", df$Salary.Estimate) # remove leading hyphen
# Extract lower and upper limits of salary estimate
limits <- strsplit(df$Salary.Estimate, "-")
# Convert the values to numeric and calculate mean
df$Salary.Estimate <- sapply(limits, function(x) mean(as.numeric(x)))
The next step is manipulating the “Size” column in the “df” dataframe. It was extracting the minimum and maximum values from this column and creating two new variables in the dataframe, “Size_min” and “Size_max”. To achieve this, I used regular expressions to extract the numeric values from the “Size” column and assigning them to the new variables. I then checked for missing values in both variables and replaces them with a value of 10001. Finally, I printed the structure of the dataframe and the number of missing values in each column. This way I cleaned and prepared the “Size” variable for analysis in the project.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Extract minimum and maximum values from Size column
df$Size_min <- as.numeric(sub("^(\\d+) to (\\d+).*", "\\1", df$Size))
## Warning: NAs introduced by coercion
df$Size_max <- as.numeric(sub("^(\\d+) to (\\d+).*", "\\2", df$Size))
## Warning: NAs introduced by coercion
# count missing values in each column
colSums(is.na(df))
## Job.Title Salary.Estimate Job.Description Rating
## 0 0 0 0
## Company.Name Location Headquarters Size
## 0 0 0 0
## Founded Type.of.ownership Industry Sector
## 0 0 0 0
## Revenue Competitors hourly employer_provided
## 0 0 0 0
## min_salary max_salary avg_salary company_txt
## 0 0 0 0
## job_state same_state age python_yn
## 0 0 0 0
## R_yn spark aws excel
## 0 0 0 0
## Size_min Size_max
## 139 139
# replace NA values with 10001
df$Size_min[is.na(df$Size_min)] <- 10001
df$Size_max[is.na(df$Size_max)] <- 10001
# create mean variable
df$Size_mean <- (df$Size_min + df$Size_max) / 2
df <- select(df, -Size_min, -Size_max)
colSums(is.na(df))
## Job.Title Salary.Estimate Job.Description Rating
## 0 0 0 0
## Company.Name Location Headquarters Size
## 0 0 0 0
## Founded Type.of.ownership Industry Sector
## 0 0 0 0
## Revenue Competitors hourly employer_provided
## 0 0 0 0
## min_salary max_salary avg_salary company_txt
## 0 0 0 0
## job_state same_state age python_yn
## 0 0 0 0
## R_yn spark aws excel
## 0 0 0 0
## Size_mean
## 0
Then, I converted all remaining variables to numeric, including the binary variables. I also changed the minimum and maximum salary values to accurate, numeric format using the “mutate” function.
library(dplyr)
df$Founded <- as.numeric(df$Founded)
df <- df %>%
mutate(min_salary = as.numeric(gsub("[^0-9-]", "", min_salary)),
max_salary = as.numeric(gsub("[^0-9-]", "", max_salary)))
df$age <- as.numeric(df$age)
df$python_yn <- as.numeric(df$python_yn)
df$R_yn <- as.numeric(df$R_yn)
df$spark <- as.numeric(df$spark)
df$aws <- as.numeric(df$aws)
df$excel <- as.numeric(df$excel)
I changed the last variable to numeric, and decided to remove hourly, and employer_provided variables, because they did not input valuable information to my PCA analysis on salary determinants. I selcted numeric variables and displayed the cleaned data below with the str(df) function. The cleaned data has no missing values. I also cleaned it by creating capitalised variable names.
library(dplyr)
df$same_state <- as.numeric(df$same_state)
df <- select(df, -hourly, -employer_provided)
df <- select_if(df, is.numeric)
str(df)
## 'data.frame': 742 obs. of 14 variables:
## $ Salary.Estimate: num 72 87.5 85 76.5 114.5 ...
## $ Rating : num 3.8 3.4 4.8 3.8 2.9 3.4 4.1 3.8 3.3 4.6 ...
## $ Founded : num 1973 1984 2010 1965 1998 ...
## $ min_salary : num 53 63 80 56 86 71 54 86 38 120 ...
## $ max_salary : num 91 112 90 97 143 119 93 142 84 160 ...
## $ avg_salary : num 72 87.5 85 76.5 114.5 ...
## $ same_state : num 0 0 1 1 1 1 1 0 1 1 ...
## $ age : num 47 36 10 55 22 20 12 15 6 11 ...
## $ python_yn : num 1 1 1 1 1 1 0 1 0 1 ...
## $ R_yn : num 0 0 0 0 0 0 0 0 0 0 ...
## $ spark : num 0 0 1 0 0 0 0 1 0 1 ...
## $ aws : num 0 0 0 0 0 1 0 1 0 0 ...
## $ excel : num 1 0 1 0 1 1 1 1 0 0 ...
## $ Size_mean : num 750 10001 750 3000 126 ...
# count missing values in each column
colSums(is.na(df))
## Salary.Estimate Rating Founded min_salary max_salary
## 0 0 0 0 0
## avg_salary same_state age python_yn R_yn
## 0 0 0 0 0
## spark aws excel Size_mean
## 0 0 0 0
library(tools)
colnames(df) <- toTitleCase(colnames(df))
colnames(df)
## [1] "Salary.Estimate" "Rating" "Founded" "Min_salary"
## [5] "Max_salary" "Avg_salary" "Same_state" "Age"
## [9] "Python_yn" "R_yn" "Spark" "Aws"
## [13] "Excel" "Size_mean"
The code summary(df) generates a summary of the dataset df. The summary provides information about several variables in the dataset, including Salary.Estimate, Rating, Founded, Min_salary, Max_salary, Avg_salary, Same_state, Age, Python_yn, R_yn, Spark, Aws, Excel, Size_min, and Size_max.
For each variable, the summary provides several descriptive statistics, including the minimum value, the first quartile, the median, the mean, the third quartile, and the maximum value. These statistics provide a quick overview of the distribution of each variable in the dataset.
For example, we can see that the mean Avg_salary is around 100.6, with a minimum of 13.5 and a maximum of 254.0. We can also see that the Same_state variable is binary, with a mean of 0.558, indicating that slightly more than half of the jobs in the dataset are in the same state as the company headquarters.
summary(df)
## Salary.Estimate Rating Founded Min_salary
## Min. : 13.5 Min. :-1.000 Min. : -1 Min. : 10.00
## 1st Qu.: 73.5 1st Qu.: 3.300 1st Qu.:1939 1st Qu.: 52.00
## Median : 97.5 Median : 3.700 Median :1988 Median : 69.50
## Mean :100.6 Mean : 3.619 Mean :1837 Mean : 74.07
## 3rd Qu.:122.5 3rd Qu.: 4.000 3rd Qu.:2007 3rd Qu.: 91.00
## Max. :254.0 Max. : 5.000 Max. :2019 Max. :202.00
## Max_salary Avg_salary Same_state Age
## Min. : 16.0 Min. : 13.5 Min. :0.000 Min. : -1.00
## 1st Qu.: 96.0 1st Qu.: 73.5 1st Qu.:0.000 1st Qu.: 11.00
## Median :124.0 Median : 97.5 Median :1.000 Median : 24.00
## Mean :127.2 Mean :100.6 Mean :0.558 Mean : 46.59
## 3rd Qu.:155.0 3rd Qu.:122.5 3rd Qu.:1.000 3rd Qu.: 59.00
## Max. :306.0 Max. :254.0 Max. :1.000 Max. :276.00
## Python_yn R_yn Spark Aws
## Min. :0.0000 Min. :0.000000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.000000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :1.0000 Median :0.000000 Median :0.0000 Median :0.0000
## Mean :0.5283 Mean :0.002695 Mean :0.2251 Mean :0.2372
## 3rd Qu.:1.0000 3rd Qu.:0.000000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.000000 Max. :1.0000 Max. :1.0000
## Excel Size_mean
## Min. :0.0000 Min. : -1.0
## 1st Qu.:0.0000 1st Qu.: 350.5
## Median :1.0000 Median : 750.5
## Mean :0.5229 Mean : 3456.1
## 3rd Qu.:1.0000 3rd Qu.: 7500.5
## Max. :1.0000 Max. :10001.0
The correlation plot generated by the code indicates the strength and direction of the correlation between pairs of variables in the dataset. The strength of the correlation is indicated by the color and size of the circles in the plot.
Variables that are strongly positively correlated have a large circle that is colored in shades of purple. Variables that are strongly negatively correlated have a large circle that is colored in shades of red.
Variables that are weakly correlated have smaller circles that are green or light blue.
library(corrplot)
## corrplot 0.92 loaded
library(RColorBrewer)
# Define color palette
colors <- colorRampPalette(c("red", "orange", "yellow", "green", "cyan", "blue", "purple"))(n = 200)
# Compute correlation matrix
cor.matrix <- cor(df, method = "pearson")
# Plot correlation matrix
corrplot(cor.matrix, type = "lower", order = "alphabet", tl.col = "black", tl.cex = 1, col=colors)
Based on the correlation plot, most of the variables are weakly correlated, apart from Max_salary with Min_salary and Salary.Estimate, and Salary.Estimate with Avg_salary. This is why I decide to drop these variables, apart from Salary.Estimate, because it is my dependent variable in my project.
library(dplyr)
df <- select(df, -Min_salary, -Max_salary, -Avg_salary)
str(df)
## 'data.frame': 742 obs. of 11 variables:
## $ Salary.Estimate: num 72 87.5 85 76.5 114.5 ...
## $ Rating : num 3.8 3.4 4.8 3.8 2.9 3.4 4.1 3.8 3.3 4.6 ...
## $ Founded : num 1973 1984 2010 1965 1998 ...
## $ Same_state : num 0 0 1 1 1 1 1 0 1 1 ...
## $ Age : num 47 36 10 55 22 20 12 15 6 11 ...
## $ Python_yn : num 1 1 1 1 1 1 0 1 0 1 ...
## $ R_yn : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Spark : num 0 0 1 0 0 0 0 1 0 1 ...
## $ Aws : num 0 0 0 0 0 1 0 1 0 0 ...
## $ Excel : num 1 0 1 0 1 1 1 1 0 0 ...
## $ Size_mean : num 750 10001 750 3000 126 ...
The assumption of PCA analysis is that variables have approximately normal distribution. As visible on the histograms below, additional tranformations are needed to ensure this.
library(ggplot2)
# create a list of histogram plots for each variable
plots <- lapply(names(df), function(x) {
ggplot(df, aes(x = df[[x]])) +
geom_histogram(bins = 10, fill = "yellow", color = "black") +
labs(title = paste(x), x = x, y = "Count")
})
# display the plots together using grid.arrange
gridExtra::grid.arrange(grobs = plots, ncol = 3)
## Warning: Use of `df[[x]]` is discouraged.
## ℹ Use `.data[[x]]` instead.
## Use of `df[[x]]` is discouraged.
## ℹ Use `.data[[x]]` instead.
## Use of `df[[x]]` is discouraged.
## ℹ Use `.data[[x]]` instead.
## Use of `df[[x]]` is discouraged.
## ℹ Use `.data[[x]]` instead.
## Use of `df[[x]]` is discouraged.
## ℹ Use `.data[[x]]` instead.
## Use of `df[[x]]` is discouraged.
## ℹ Use `.data[[x]]` instead.
## Use of `df[[x]]` is discouraged.
## ℹ Use `.data[[x]]` instead.
## Use of `df[[x]]` is discouraged.
## ℹ Use `.data[[x]]` instead.
## Use of `df[[x]]` is discouraged.
## ℹ Use `.data[[x]]` instead.
## Use of `df[[x]]` is discouraged.
## ℹ Use `.data[[x]]` instead.
## Use of `df[[x]]` is discouraged.
## ℹ Use `.data[[x]]` instead.
It is difficult to determine whether dropping all binary variables in my project would be a good decision.
Binary variables can be very informative in some contexts, so dropping them without good reason may result in the loss of potentially valuable information. On the other hand, if the binary variables are not very informative, highly correlated with other variables in dataset, or if they introduce multicollinearity into your analysis, it may be reasonable to drop them. In my project it is not the case, so I decided to keep my binary variables.
Before performing PCA on a mix of binary and numeric variables, I performed scaling on the binary variables. Since PCA is sensitive to the scale of the variables, it’s generally a good idea to scale the numeric variables before performing PCA.
df[, c("Same_state", "Python_yn", "R_yn", "Spark", "Aws", "Excel")] <- scale(df[, c("Same_state", "Python_yn", "R_yn", "Spark", "Aws", "Excel")], center = TRUE, scale = TRUE)
To perform a logarithmic transformation on the variables in the df data frame, I used the log() function in R. This function will create new variables in the df data frame with the log-transformed values of the original variables. These new variables should have a more normal distribution than the original variables. I also deleted the missing values created in this process. From the green histograms we can see that the distribution after such change is closer to normal.
df$Salary.Estimate <- log(df$Salary.Estimate)
df$Rating <- log(df$Rating)
## Warning in log(df$Rating): NaNs produced
df$Founded <- log(df$Founded)
## Warning in log(df$Founded): NaNs produced
df$Age <- log(df$Age)
## Warning in log(df$Age): NaNs produced
df$Size_mean <- log(df$Size_mean)
## Warning in log(df$Size_mean): NaNs produced
# count missing values in each column
colSums(is.na(df))
## Salary.Estimate Rating Founded Same_state Age
## 0 11 50 0 50
## Python_yn R_yn Spark Aws Excel
## 0 0 0 0 0
## Size_mean
## 1
# modify the original data frame
df <- na.omit(df)
# count missing values in each column
colSums(is.na(df))
## Salary.Estimate Rating Founded Same_state Age
## 0 0 0 0 0
## Python_yn R_yn Spark Aws Excel
## 0 0 0 0 0
## Size_mean
## 0
library(ggplot2)
# create a list of histogram plots for each variable
plots <- lapply(names(df), function(x) {
ggplot(df, aes(x = df[[x]])) +
geom_histogram(bins = 10, fill = "green", color = "black") +
labs(title = paste(x), x = x, y = "Count")
})
# display the plots together using grid.arrange
gridExtra::grid.arrange(grobs = plots, ncol = 3)
# PCA
The code below performs a principal component analysis (PCA) on the dataframe df using the prcomp function. The center = TRUE and scale = TRUE arguments standardize the variables before computing the PCA.
The summary(df.pca) function call then prints a summary of the results of the PCA analysis, including the standard deviation, proportion of variance, and cumulative proportion of variance explained by each principal component (PC).
There are 11 principal components in total, with PC1 explaining 23.55% of the variance, PC2 explaining 16.10% of the variance, and so on. The cumulative proportion of variance is also shown, which indicates that the first two PCs account for 39.66% of the total variance, the first three PCs account for 49.50%.
The summary also displays the standard deviation for each PC. The standard deviation gives an indication of how much of the variability in the original dataset is captured by each PC. Finally, the proportion of variance for each PC is shown, which gives an indication of how much information is retained by each component. The higher the proportion of variance, the more important the component is in capturing the original variation in the data.
df.pca <- prcomp(df, center = TRUE, scale = TRUE)
summary(df.pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.6095 1.3310 1.04045 1.02307 0.97383 0.95625 0.88389
## Proportion of Variance 0.2355 0.1610 0.09841 0.09515 0.08621 0.08313 0.07102
## Cumulative Proportion 0.2355 0.3966 0.49497 0.59012 0.67633 0.75946 0.83048
## PC8 PC9 PC10 PC11
## Standard deviation 0.83449 0.76419 0.68083 0.34756
## Proportion of Variance 0.06331 0.05309 0.04214 0.01098
## Cumulative Proportion 0.89379 0.94688 0.98902 1.00000
The get_eigenvalue function is used to extract the eigenvalues and variance explained by each principal component from the prcomp object df.pca.
The output eig.val displays the eigenvalues and variance percentages for each of the 11 principal components. The first column gives the eigenvalues, the second column gives the percentage of variance explained by each principal component, and the third column gives the cumulative percentage of variance explained up to that component.
From this output, we can see that the first principal component (Dim.1) has the largest eigenvalue and explains 23.55% of the variance in the data. The second principal component (Dim.2) has the second-largest eigenvalue and explains an additional 16.10% of the variance. Together, the first two principal components explain 39.66% of the variance in the data.
The cumulative variance explained increases with each additional principal component, and by the 11th component, 100% of the variance is explained.
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
eig.val <- get_eigenvalue(df.pca)
eig.val
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 2.5905641 23.550582 23.55058
## Dim.2 1.7715188 16.104716 39.65530
## Dim.3 1.0825458 9.841325 49.49662
## Dim.4 1.0466705 9.515187 59.01181
## Dim.5 0.9483429 8.621299 67.63311
## Dim.6 0.9144178 8.312889 75.94600
## Dim.7 0.7812700 7.102454 83.04845
## Dim.8 0.6963670 6.330609 89.37906
## Dim.9 0.5839797 5.308906 94.68797
## Dim.10 0.4635271 4.213883 98.90185
## Dim.11 0.1207964 1.098149 100.00000
The scree plot is a visualization of the eigenvalues of the principal components. The plot shows the eigenvalues for each dimension. As visible in the previous table, Dim. 1 has eigenvalue of 2.5905641, Dim. 2 (1.7715188 ), Dim. 3 (1.0825458), Dim. 4 (1.0466705), and the next dimensions have eigenvalues smaller than 1, so we should not include them in further analysis.
The plot also shows an “elbow” at the third principal component, indicating that these components may be the most important in explaining the variance in the data. Overall, the scree plot suggests that a 3 or 4 principal components may be sufficient to capture most of the variation in the data.
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
library(factoextra)
library(labdsv)
## Loading required package: mgcv
## Loading required package: nlme
##
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
##
## collapse
## This is mgcv 1.8-41. For overview type 'help("mgcv-package")'.
## This is labdsv 2.0-1
## convert existing ordinations with as.dsvord()
##
## Attaching package: 'labdsv'
## The following object is masked from 'package:psych':
##
## pca
## The following object is masked from 'package:stats':
##
## density
library(psy)
##
## Attaching package: 'psy'
## The following object is masked from 'package:psych':
##
## wkappa
scree.plot(eig.val[,1], title = "Scree Plot", type = "E", simu = "P")
The plot below shows the proportion of the total variance in the data
explained by each principal component. The plot from the given code
shows that the first principal component explains the largest proportion
of variance in the data (23.55%), followed by the second (16.10%), and
so on. The plot also shows an “elbow” at the third principal component,
indicating that these components may be the most important in explaining
the variance in the data. The outcomes of this plot are the same as from
the previous scree plot above.
fviz_eig(df.pca)
The plot shows the contribution of each variable to the principal components in the PCA analysis. The size of the arrows on the plot represents the contribution of the variable to each principal component. The color of the label indicates the direction of the correlation between the variable and the principal component. The plot helps to identify variables that are strongly associated with each principal component and also to identify variables that are important for distinguishing different groups or clusters in the data. The plot also helps to visualize which variables are most important for explaining the variation in the data.
The plot indicates that the most important variables in the PCA analysis are “Founded”, “Age”, and “Size_Mean”. The least important are “Excel” and “R_yn” variables.
library(ggpubr)
library(factoextra)
library(RColorBrewer)
# Create a rainbow color palette
colors <- colorRampPalette(c("red", "orange", "yellow", "green", "cyan", "blue", "purple"))(length(df.pca$sdev))
# Plot PCA results
p <- fviz_pca_var(df.pca, col.var="contrib", col.var.rep = TRUE,
gradient.cols = colors, col.axis = "#757575") +
scale_color_gradientn(colors = colors)
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
# Adjust label size and position
p <- ggpar(p, label_size = 6, label_y = 0.6)
# Print plot
print(p)
# Loadings
The first plot shows the variable loadings for PC1 and PC2 in a scatter plot. Each variable is represented by an arrow pointing in the direction of its loading for that PC. The length of the arrow represents the magnitude of the loading, and the color represents the contribution of that variable to the PC, with a rainbow color palette ranging from red (low contribution) to purple (high contribution). The color scale ranges from 0% to 100%, and the colorbar on the right side of the plot shows the corresponding labels for each value. The plot helps to identify which variables contribute most to the separation of the observations in the PC1 and PC2 dimensions, which are: “Age” and “Founded”.
# Define the rainbow color palette
colors <- colorRampPalette(c("red", "orange", "yellow", "green", "cyan", "blue", "purple"))(100)
# Plot the variable loadings for PC1 and PC2, with rainbow color palette
fviz_pca_var(df.pca, col.var="contrib", repel = TRUE, axes = c(1, 2)) +
labs(title="Variables loadings for PC1 and PC2", x="PC1", y="PC2")+
scale_color_gradientn(colors = colors, limits = c(0, 100), guide = "colorbar",
breaks = seq(0, 100, by = 10),
labels = c("0%", "10%", "20%", "30%", "40%", "50%", "60%", "70%", "80%", "90%", "100%"))
The second plot shows the variable loadings for PC2 and PC3, with a
rainbow color palette. Here, the variable that is mostly driving the
separation between different samples or groups is the “Excel”
variable.
colors <- colorRampPalette(c("red", "orange", "yellow", "green", "cyan", "blue", "purple"))(100)
# Plot the variable loadings for PC1 and PC2, with rainbow color palette
fviz_pca_var(df.pca, col.var="contrib", repel = TRUE, axes = c(2, 3)) +
labs(title="Variables loadings for PC2 and PC3", x="PC2", y="PC3")+
scale_color_gradientn(colors = colors, limits = c(0, 100), guide = "colorbar",
breaks = seq(0, 100, by = 10),
labels = c("0%", "10%", "20%", "30%", "40%", "50%", "60%", "70%", "80%", "90%", "100%"))
The plot plot(summary(df.pca)$importance[3,]) is a scree plot that displays the percentage of total variance explained by each principal component (PC). The x-axis shows the PC number, while the y-axis shows the percentage of total variance explained by each PC. The plot allows for visual inspection of the number of significant principal components to retain for further analysis.
plot(summary(df.pca)$importance[3,])
This plot shows the principal component analysis (PCA) results for the individual data points in the dataset. The red-purple color scheme represents the level of contribution of each variable to each principal component, with red indicating high positive contribution and purple indicating higher positive contribution. The size of the points represents the magnitude of the contributions of each individual data point to the principal components. This plot can help identify patterns and clusters of data points in the PCA results.
# Define a rainbow color palette
colors <- colorRampPalette(c("red", "orange", "yellow", "green", "cyan", "blue", "purple"))(n = nrow(df.pca$ind))
# Plot the PCA results with a rainbow color scheme
fviz_pca_ind(df.pca, col.ind = "cos2", geom = "point",
gradient.cols = colors, col.ind.rep = TRUE) +
scale_color_gradientn(colors = colors)
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
The plot shows the contribution of each variable to the principal components of a PCA analysis. Three plots are presented, one for each principal component. In each plot, the variables are ranked by their contribution to that specific component. The more a variable contributes to a component, the closer its bar is to the top of the plot. By examining the plots, we can identify the variables that are most important in explaining the variation of the data along each principal component.
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
# Create plots for each principal component
PC1 <- fviz_contrib(df.pca, choice = "var", axes = 1,fill = "cyan",color = "purple") +
ggtitle("Contribution to Dim-1")
PC2 <- fviz_contrib(df.pca, choice = "var", axes = 2,fill = "green",color = "purple") +
ggtitle("Contribution to Dim-2")
PC3 <- fviz_contrib(df.pca, choice = "var", axes = 3,fill = "yellow",color = "purple") +
ggtitle("Contribution to Dim-3")
# Combine the plots into a single grid
grid.arrange(PC1, PC2, PC3, ncol = 3)
This paper aimed to determine the most important variables influencing the Salary.Estimate variable, so the level of estimated salary range for each job posting. The result of the PCA analysis was the reduction of number of variables from 28 to 11, using the numerical and binary data types.
The scree plots and elbow method showed that the optimal number of dimensions is three, and the 3 components may be the most important in explaining the variance in the data.
The first principal component has 3 variables: “Age”, “Founded” and “Size_mean”. The second principal component has 4 variables: “Salary.Estimate”, “Python_yn”, “Spark” and “Aws”. The third has only 2 variables: “Excel” and “Same_state”. This means that the variables that can best explain the changes in Salary.Estimate variable are: “Age”, “Founded” and “Size_mean”.
I will use this research as part of development of my recent business project “Recrupay”, where you can further see the practical use of analysis made in my dimension reduction project. The results showed that by estimating salaries on my website, I should mostly focus on variables such as the age of the company, the year the company was founded and the size of the company. This information can be useful both for the candidates and the recruiters, as both parties perform salary estimation, for various purposes.
Salary Prediction. (2022, November 16). Kaggle. https://www.kaggle.com/datasets/thedevastator/jobs-dataset-from-glassdoor
Recrupay. (n.d.). Recrupay. https://www.recrupay.com/
RPubs - Dimension Reduction. (n.d.-b). https://rpubs.com/paulinazabska/DimensionReduction
Zhang, T., & Yang, B. (2016). Big Data Dimension Reduction Using PCA. 2016 IEEE International Conference on Smart Cloud (SmartCloud). https://doi.org/10.1109/smartcloud.2016.33