Job-Education Mismatch

Problem

Does job-education mismatch depend on major?

About Dataset

In this case, the dataset consists of answers which include information about the participants’ demographics (gender, year of birth), education level, major, current employment status, job-education mismatch, job satisfaction, whether their degree was necessary for their current job, whether they chose their major for a specific career field, whether they have considered pursuing additional qualifications, whether their education adequately prepared them for job searching, whether they experienced difficulties finding a job in their desired field, and factors contributing to those difficulties from students who attended our survey. The dependent variable “Job-education mismatch” has five categories: Moderate mismatch, Minor mismatch, Not applicable, No mismatch, and Major mismatch.

Analyze Data-set

Preliminary

First, we downloaded and read the required libraries to analyse and visualise the data-set.

# Read the required libraries
library(ggplot2)
library(gridExtra)
library(nnet)  
library(tidyverse)  
# Read in the data
data <- read.csv("Questionnaire.csv")

##Create the variable names
new_variable_names <- c("time_stamp", "email", "gender", "year_of_birth", "education_level", "major", "employment_status", "major_job_relation", "job_field", "job_satisfaction", "job_education_mismatch", "degree_necessary", "major_career", "additional_courses", "education_job_preparation", "job_search_difficulties", "job_search_factors", "different_major", "additional_comments")

# Change the first row with new variable names 
names(data) <- new_variable_names

# Extract the years from the year_of_birth variable and save as a new variable called "year"
data$year <- str_extract(as.numeric(data$year_of_birth), "\\d{4}")
## Warning in vctrs::vec_size_common(string = string, pattern = pattern,
## replacement = replacement, : NAs introduced by coercion
# Save the changes
write.csv(data, "Questionnaire_new.csv", row.names = FALSE)

Data Cleaning and EDA

##Empty value controls

data[!complete.cases(data),]
colSums(is.na(data))
attach(data)

###Check changes and last version
head(data)

###EDA

# Check the structure of the data
str(data)
## 'data.frame':    27 obs. of  20 variables:
##  $ time_stamp               : chr  "2023/04/14 11:36:46 ÖÖ GMT+2" "2023/04/14 11:40:19 ÖÖ GMT+2" "2023/04/14 11:53:07 ÖÖ GMT+2" "2023/04/14 12:07:06 ÖS GMT+2" ...
##  $ email                    : chr  "piola48@wp.pl" "semrabaktir.eng@gmail.com" "coskunhacer@gmail.com" "hamedahmedhamed100@gmail.com" ...
##  $ gender                   : chr  "Female" "Female" "Female" "Male" ...
##  $ year_of_birth            : chr  "09/01/1987" "24.11.1988" "16/06/1986" "04/01/1997" ...
##  $ education_level          : chr  "Bachelor's degree (e.g. BA, BS)" "Bachelor's degree (e.g. BA, BS)" "Master's degree (e.g. MA, MS)" "Bachelor's degree (e.g. BA, BS)" ...
##  $ major                    : chr  "Economics" "Education" "Technology;Finance, Accounting" "Data Science and/or Business Analytics" ...
##  $ employment_status        : chr  "Employed full-time" "Unemployed and currently not looking for work" "Employed full-time" "Employed full-time" ...
##  $ major_job_relation       : chr  "No" "Yes" "Partially related" "No" ...
##  $ job_field                : chr  "IT - software testing" "" "" "Software Engineering" ...
##  $ job_satisfaction         : chr  "Very satisfied" "Not applicable" "Neutral" "Somewhat satisfied" ...
##  $ job_education_mismatch   : chr  "Moderate mismatch" "Moderate mismatch" "Minor mismatch" "Moderate mismatch" ...
##  $ degree_necessary         : chr  "No, my degree was not required but it was helpful in obtaining my current job." "N/A - I am not currently employed or my degree is not relevant to my current job." "No, my degree was not required but it was helpful in obtaining my current job." "N/A - I am not currently employed or my degree is not relevant to my current job." ...
##  $ major_career             : chr  "No" "Yes" "Not Sure" "Yes" ...
##  $ additional_courses       : chr  "Yes, I plan to pursue additional qualifications in the near future." "Yes, I plan to pursue additional qualifications in the near future." "Yes, I plan to pursue additional qualifications in the near future." "Yes, I plan to pursue additional qualifications in the near future." ...
##  $ education_job_preparation: chr  "No, not at all." "No, not at all." "Yes, somewhat." "Yes, somewhat." ...
##  $ job_search_difficulties  : chr  "No" "Not sure" "No" "Yes" ...
##  $ job_search_factors       : chr  "" "Other (please specify)" "Limited networking opportunities" "Lack of relevant experience" ...
##  $ different_major          : chr  "Yes," "Yes," "Maybe, but I am not sure what field would be better suited to my current job" "No, I am happy with my current field of study" ...
##  $ additional_comments      : chr  "I believe that we don't know what we want to do when we are 18. With time and experience we can find out our field." "" "" "" ...
##  $ year                     : chr  NA NA NA NA ...
# Check the summary statistics of the numerical variables
summary(data$year)
##    Length     Class      Mode 
##        27 character character
# Check the distribution of the categorical variables
table(data$gender)
## 
##            Female              Male Prefer not to say 
##                15                 6                 6
table(data$education_level)
## 
## Associate's degree (e.g. AA, AS)  Bachelor's degree (e.g. BA, BS) 
##                                1                               14 
##  Doctoral degree (e.g. PhD, EdD)          High school diploma/GED 
##                                1                                3 
##    Master's degree (e.g. MA, MS)      Some college, but no degree 
##                                4                                4
# Check for missing values
colSums(is.na(data))
##                time_stamp                     email                    gender 
##                         0                         0                         0 
##             year_of_birth           education_level                     major 
##                         0                         0                         0 
##         employment_status        major_job_relation                 job_field 
##                         0                         0                         0 
##          job_satisfaction    job_education_mismatch          degree_necessary 
##                         0                         0                         0 
##              major_career        additional_courses education_job_preparation 
##                         0                         0                         0 
##   job_search_difficulties        job_search_factors           different_major 
##                         0                         0                         0 
##       additional_comments                      year 
##                         0                        26
# Create bar plots of categorical variables
ggplot(data, aes(x = gender)) +
  geom_bar() +
  ggtitle("Gender distribution")

ggplot(data, aes(x = education_level)) +
  geom_bar() +
  ggtitle("Education level distribution")

ggplot(data, aes(x = job_satisfaction, y = education_level)) +
  geom_point() +
  ggtitle("Job satisfaction vs. education level")

###there is problem on year variable needs to check and fix
# Check for outliers in the numerical variables
#boxplot(data$year)

# Create histograms of numerical variables
#hist(data$year)

# Create scatter plots to explore relationships between variables
#ggplot(data, aes(x = job_satisfaction, y = age)) +
 # geom_point() +
  #ggtitle("Job satisfaction vs. age")

# Create a correlation matrix to explore relationships between variables
#cor(data[, c("age", "job_satisfaction")])

Building Model

We chose multinomial logistic regression model because the dependent variable “Job-education mismatch” has five categories: Moderate mismatch, Minor mismatch, Not applicable, No mismatch, and Major mismatch. Multinomial logistic regression is a useful model for analyzing categorical dependent variables with more than two categories. It estimates the probability of each category of the dependent variable given a set of independent variables. Therefore, it is appropriate for analyzing the relationship between the independent variables and the dependent variable in this case.

# Create a factor variable for the dependent variable
data$job_education_mismatch <- factor(data$job_education_mismatch)

# Fit the multinomial logistic regression model
model <- multinom(job_education_mismatch ~ year_of_birth + gender + education_level +
                    major + employment_status + job_satisfaction + additional_courses
                  + education_job_preparation + job_search_difficulties, data = data)
## # weights:  390 (320 variable)
## initial  value 48.377506 
## iter  10 value 0.187543
## iter  20 value 0.000318
## final  value 0.000088 
## converged
# Summarize the results
summary(model)