Recently, I got interested in analyzing H1B facts and figures in the United States because my most recent employer has moved forward with H1B application. I wanted to move forward with this analyses thinking that it would provide ample opportunities to hone my data science skills and learn some advance skills. This work features all key aspects of data science outlined in (Estrellado et al., 2021):
This is a research on the trend of H1B employees in the United States. The data comes from the online repository of the Department of Labor. This data doesn’t tell anything whether somebody applied or actually got the H1B visa, nonetheless, it provides information about the critical step before actual H1B application. In addition, it provides insights into the hiring company, job type, prevailing wage, state etc. I am going to dive deep down, I don’t really have any agenda here. I am open to exploration and surprises!!
Along with numerous online help especially Google Search, twitter feed etc., my analyses heavily rely on these references:
Now, let’s move sequentially and strategically.
I want to do trend analysis. I am planning to get data from all month
2020, 2021, & first quarter in 2022. Based on the layout of data
listed on https://www.dol.gov/agencies/eta/foreign-labor/performance,
I have to download 9 different data sets. I have to work on my tables,
which requires tidyverse package; I have to save the
downloaded file and easy way to do so is to use here
package; and I need to work on dates require libridate
package for now. Finally readxl package to read the
.xlsx data files.
library(tidyverse)
library(here)
library(lubridate)
library(readxl)
library(gt)
library(ggplot2)
library(janitor)
library(sjlabelled)
library(stringr)
Disclaimer:
My intention of conducting this research is manifold. My first objective is to hone my skills of working in R environment and be able to successfully complete a project. I have done so many times in the past, but I feel like I have to keep practicing things. Second, for some time now, I have taken a lot of courses in R. I learned a lot, but I barely remember most of them. When I sit to work in R, I stumble on every small bumps- the fundamental aspects- and I need to look into my notes or do a google search. The thing is, I overlooked the basic aspects of R and tried to achieve big thing, i.e., being able to use R for advanced statistical research that require rigorous methodological insights like: Hierarchical Linear Modeling (HLM), Structural Equation Modeling (SEM), Longitudinal Modeling, Multivariate Regresssion, and even Machine Learning. Once you are there you realize you lack the basic building blocks. Thus, my intention is to use these basic tools and learn and/or relearn them throug this study. This is the reason, I am over using some of the R features above and will do throughout this paper. Once done, I want to keep this document handy so that I can refer back over and over.
Finally, my intention is to look into the situation of the H1B applicants and know more about this population. I conducted a small pilot study using 1-quarter long data, i.e., 120,000 data points, which inspired me to use big data to strengthen the the findings. I am recently in the process to be one of them and knowing better would help me keep reasonable expectations. I am using a raw data, I have not read any serious research conducted in this topic and I don't know what to expect. Maybe there are pretty good research that could help me ask better questions, however, I would not like to do so. Thus, everything I come up with will be a pure discovery for me.
Because the download.file() function which allows me to
type in respective url where the data are located and
destfile where I can save them did not work properly, I
downloaded the data on my local disc in a folder named
all_data and saved them. Now, I want make a quick check on
one of the data sets:
read_excel(here::here(
"all_data",
"LCA_Disclosure_Data_FY2020_Q4.xlsx"
))
# A tibble: 117,395 x 96
CASE_NUMBER CASE_STATUS RECEIVED_DATE DECISION_DATE
<chr> <chr> <dttm> <dttm>
1 I-200-20176-676243 Certified 2020-06-24 00:00:00 2020-07-01 00:00:00
2 I-200-20176-676246 Certified 2020-06-24 00:00:00 2020-07-01 00:00:00
3 I-200-20176-676247 Certified 2020-06-24 00:00:00 2020-07-01 00:00:00
4 I-200-20176-676251 Certified 2020-06-24 00:00:00 2020-07-01 00:00:00
5 I-200-20176-676254 Certified 2020-06-24 00:00:00 2020-07-01 00:00:00
6 I-200-20176-676255 Certified 2020-06-24 00:00:00 2020-07-01 00:00:00
7 I-200-20176-676257 Certified 2020-06-24 00:00:00 2020-07-01 00:00:00
8 I-200-20176-676259 Certified 2020-06-24 00:00:00 2020-07-01 00:00:00
9 I-200-20176-676260 Certified 2020-06-24 00:00:00 2020-07-01 00:00:00
10 I-200-20176-676265 Certified 2020-06-24 00:00:00 2020-07-01 00:00:00
# ... with 117,385 more rows, and 92 more variables: ORIGINAL_CERT_DATE <lgl>,
# VISA_CLASS <chr>, JOB_TITLE <chr>, SOC_CODE <chr>, SOC_TITLE <chr>,
# FULL_TIME_POSITION <chr>, BEGIN_DATE <dttm>, END_DATE <dttm>,
# TOTAL_WORKER_POSITIONS <dbl>, NEW_EMPLOYMENT <dbl>,
# CONTINUED_EMPLOYMENT <dbl>, CHANGE_PREVIOUS_EMPLOYMENT <dbl>,
# NEW_CONCURRENT_EMPLOYMENT <dbl>, CHANGE_EMPLOYER <dbl>,
# AMENDED_PETITION <dbl>, EMPLOYER_NAME <chr>, TRADE_NAME_DBA <chr>, ...
The data have been saved and I got an important peek into them. There were 96 columns and 117,395 in the 2020 Q4 data set. I have 9 different data files like this and I want to use them in this study. Using a single file at time sounds boring and it takes a lot of time. Thus, I would be using codes that passes through all the data files in the folder and do designated tasks.
First of all, I want to check all the data files in the folder
all_data using the list.files() function. I
would extract the full names using full.names = TRUE
argument because I need them to pull the data. Finally, I will save the
names in the file_path vector, so that I can use them
easily.
# Saving the file names in a vector
file_path <- xfun::cache_rds({
list.files(
path = here::here("all_data"),
full.names = TRUE
)
})
# Checking
file_path
[1] "C:/Users/nirma/Documents/GitHub/EDA_Project2/all_data/LCA_Disclosure_Data_FY2020_Q1.xlsx"
[2] "C:/Users/nirma/Documents/GitHub/EDA_Project2/all_data/LCA_Disclosure_Data_FY2020_Q2.xlsx"
[3] "C:/Users/nirma/Documents/GitHub/EDA_Project2/all_data/LCA_Disclosure_Data_FY2020_Q3.xlsx"
[4] "C:/Users/nirma/Documents/GitHub/EDA_Project2/all_data/LCA_Disclosure_Data_FY2020_Q4.xlsx"
[5] "C:/Users/nirma/Documents/GitHub/EDA_Project2/all_data/LCA_Disclosure_Data_FY2021_Q1.xlsx"
[6] "C:/Users/nirma/Documents/GitHub/EDA_Project2/all_data/LCA_Disclosure_Data_FY2021_Q2.xlsx"
[7] "C:/Users/nirma/Documents/GitHub/EDA_Project2/all_data/LCA_Disclosure_Data_FY2021_Q3.xlsx"
[8] "C:/Users/nirma/Documents/GitHub/EDA_Project2/all_data/LCA_Disclosure_Data_FY2021_Q4.xlsx"
[9] "C:/Users/nirma/Documents/GitHub/EDA_Project2/all_data/LCA_Disclosure_Data_FY2022_Q1.xlsx"
I got the complete path to data files stored in the
all_data folder. Now, I am going to read the data in R
using read_excel() function and will do so in all data
files at the same time using map() function.
total_data <- xfun::cache_rds({
file_path %>%
map(., ~ read_excel(., sheet = 1))
})
There are too many data files and each file contains more than
100,000 data points, the system became pretty slow. Thus, I wrapped the
above syntax in xfun::cache_rds({}). It is faster than
cache=TRUE option and I can call much easier.
My ultimate goal is to create a big giant data file combining
(rbinding) all of these 9 files. I will be able to do so, once I have
similar columns in all of the data files. Let’s check random data files
for consistency in column names using identical()
function.
identical(names(total_data[[1]]), names(total_data[[2]]))
[1] TRUE
identical(names(total_data[[1]]), names(total_data[[6]]))
[1] TRUE
identical(names(total_data[[1]]), names(total_data[[9]]))
[1] FALSE
The output shows that data set 1 & 2, and 1 & 6 have
identical column names but not data set 9. Let’s dig in a little more.
First, let’s see if these data files have equal number of columns using
map() function:
xfun::cache_rds({
total_data |>
map(ncol)
})
[[1]]
[1] 96
[[2]]
[1] 96
[[3]]
[1] 96
[[4]]
[1] 96
[[5]]
[1] 96
[[6]]
[1] 96
[[7]]
[1] 96
[[8]]
[1] 96
[[9]]
[1] 96
Good grace! All of the data files have equal number of columns, though. I know that I am not going to use all of these, but I have to make sure what’s different among them.
Let’s check for the files 1 and 9.
file_path[[1]]
[1] "C:/Users/nirma/Documents/GitHub/EDA_Project2/all_data/LCA_Disclosure_Data_FY2020_Q1.xlsx"
file_path[[9]]
[1] "C:/Users/nirma/Documents/GitHub/EDA_Project2/all_data/LCA_Disclosure_Data_FY2022_Q1.xlsx"
One of them is quarter 1 data from 2020 and the other is quarter 1 data from 2022. Now, I want to check if all 2020 data (quarter 1:quarter 4) are different from quarter 1 2022.
identical(names(total_data[[9]]), names(total_data[[2]]))
[1] FALSE
identical(names(total_data[[9]]), names(total_data[[3]]))
[1] FALSE
identical(names(total_data[[9]]), names(total_data[[4]]))
[1] FALSE
Yes. The problem is with 2020 and 2022 data sets. Let’s check for the column names.
xfun::cache_rds({
data.frame(names(total_data[[1]]), names(total_data[[9]]))
})
names.total_data..1... names.total_data..9...
1 CASE_NUMBER CASE_NUMBER
2 CASE_STATUS CASE_STATUS
3 RECEIVED_DATE RECEIVED_DATE
4 DECISION_DATE DECISION_DATE
5 ORIGINAL_CERT_DATE ORIGINAL_CERT_DATE
6 VISA_CLASS VISA_CLASS
7 JOB_TITLE JOB_TITLE
8 SOC_CODE SOC_CODE
9 SOC_TITLE SOC_TITLE
10 FULL_TIME_POSITION FULL_TIME_POSITION
11 BEGIN_DATE BEGIN_DATE
12 END_DATE END_DATE
13 TOTAL_WORKER_POSITIONS TOTAL_WORKER_POSITIONS
14 NEW_EMPLOYMENT NEW_EMPLOYMENT
15 CONTINUED_EMPLOYMENT CONTINUED_EMPLOYMENT
16 CHANGE_PREVIOUS_EMPLOYMENT CHANGE_PREVIOUS_EMPLOYMENT
17 NEW_CONCURRENT_EMPLOYMENT NEW_CONCURRENT_EMPLOYMENT
18 CHANGE_EMPLOYER CHANGE_EMPLOYER
19 AMENDED_PETITION AMENDED_PETITION
20 EMPLOYER_NAME EMPLOYER_NAME
21 TRADE_NAME_DBA TRADE_NAME_DBA
22 EMPLOYER_ADDRESS1 EMPLOYER_ADDRESS1
23 EMPLOYER_ADDRESS2 EMPLOYER_ADDRESS2
24 EMPLOYER_CITY EMPLOYER_CITY
25 EMPLOYER_STATE EMPLOYER_STATE
26 EMPLOYER_POSTAL_CODE EMPLOYER_POSTAL_CODE
27 EMPLOYER_COUNTRY EMPLOYER_COUNTRY
28 EMPLOYER_PROVINCE EMPLOYER_PROVINCE
29 EMPLOYER_PHONE EMPLOYER_PHONE
30 EMPLOYER_PHONE_EXT EMPLOYER_PHONE_EXT
31 NAICS_CODE NAICS_CODE
32 EMPLOYER_POC_LAST_NAME EMPLOYER_POC_LAST_NAME
33 EMPLOYER_POC_FIRST_NAME EMPLOYER_POC_FIRST_NAME
34 EMPLOYER_POC_MIDDLE_NAME EMPLOYER_POC_MIDDLE_NAME
35 EMPLOYER_POC_JOB_TITLE EMPLOYER_POC_JOB_TITLE
36 EMPLOYER_POC_ADDRESS1 EMPLOYER_POC_ADDRESS1
37 EMPLOYER_POC_ADDRESS2 EMPLOYER_POC_ADDRESS2
38 EMPLOYER_POC_CITY EMPLOYER_POC_CITY
39 EMPLOYER_POC_STATE EMPLOYER_POC_STATE
40 EMPLOYER_POC_POSTAL_CODE EMPLOYER_POC_POSTAL_CODE
41 EMPLOYER_POC_COUNTRY EMPLOYER_POC_COUNTRY
42 EMPLOYER_POC_PROVINCE EMPLOYER_POC_PROVINCE
43 EMPLOYER_POC_PHONE EMPLOYER_POC_PHONE
44 EMPLOYER_POC_PHONE_EXT EMPLOYER_POC_PHONE_EXT
45 EMPLOYER_POC_EMAIL EMPLOYER_POC_EMAIL
46 AGENT_REPRESENTING_EMPLOYER AGENT_REPRESENTING_EMPLOYER
47 AGENT_ATTORNEY_LAST_NAME AGENT_ATTORNEY_LAST_NAME
48 AGENT_ATTORNEY_FIRST_NAME AGENT_ATTORNEY_FIRST_NAME
49 AGENT_ATTORNEY_MIDDLE_NAME AGENT_ATTORNEY_MIDDLE_NAME
50 AGENT_ATTORNEY_ADDRESS1 AGENT_ATTORNEY_ADDRESS1
51 AGENT_ATTORNEY_ADDRESS2 AGENT_ATTORNEY_ADDRESS2
52 AGENT_ATTORNEY_CITY AGENT_ATTORNEY_CITY
53 AGENT_ATTORNEY_STATE AGENT_ATTORNEY_STATE
54 AGENT_ATTORNEY_POSTAL_CODE AGENT_ATTORNEY_POSTAL_CODE
55 AGENT_ATTORNEY_COUNTRY AGENT_ATTORNEY_COUNTRY
56 AGENT_ATTORNEY_PROVINCE AGENT_ATTORNEY_PROVINCE
57 AGENT_ATTORNEY_PHONE AGENT_ATTORNEY_PHONE
58 AGENT_ATTORNEY_PHONE_EXT AGENT_ATTORNEY_PHONE_EXT
59 AGENT_ATTORNEY_EMAIL_ADDRESS AGENT_ATTORNEY_EMAIL_ADDRESS
60 LAWFIRM_NAME_BUSINESS_NAME LAWFIRM_NAME_BUSINESS_NAME
61 STATE_OF_HIGHEST_COURT STATE_OF_HIGHEST_COURT
62 NAME_OF_HIGHEST_STATE_COURT NAME_OF_HIGHEST_STATE_COURT
63 WORKSITE_WORKERS WORKSITE_WORKERS
64 SECONDARY_ENTITY SECONDARY_ENTITY
65 SECONDARY_ENTITY_BUSINESS_NAME SECONDARY_ENTITY_BUSINESS_NAME
66 WORKSITE_ADDRESS1 WORKSITE_ADDRESS1
67 WORKSITE_ADDRESS2 WORKSITE_ADDRESS2
68 WORKSITE_CITY WORKSITE_CITY
69 WORKSITE_COUNTY WORKSITE_COUNTY
70 WORKSITE_STATE WORKSITE_STATE
71 WORKSITE_POSTAL_CODE WORKSITE_POSTAL_CODE
72 WAGE_RATE_OF_PAY_FROM WAGE_RATE_OF_PAY_FROM
73 WAGE_RATE_OF_PAY_TO WAGE_RATE_OF_PAY_TO
74 WAGE_UNIT_OF_PAY WAGE_UNIT_OF_PAY
75 PREVAILING_WAGE PREVAILING_WAGE
76 PW_UNIT_OF_PAY PW_UNIT_OF_PAY
77 PW_TRACKING_NUMBER PW_TRACKING_NUMBER
78 PW_WAGE_LEVEL PW_WAGE_LEVEL
79 PW_OES_YEAR PW_OES_YEAR
80 PW_OTHER_SOURCE PW_OTHER_SOURCE
81 PW_OTHER_YEAR PW_OTHER_YEAR
82 PW_SURVEY_PUBLISHER PW_SURVEY_PUBLISHER
83 PW_SURVEY_NAME PW_SURVEY_NAME
84 TOTAL_WORKSITE_LOCATIONS TOTAL_WORKSITE_LOCATIONS
85 AGREE_TO_LC_STATEMENT AGREE_TO_LC_STATEMENT
86 H-1B_DEPENDENT H1B_DEPENDENT
87 WILLFUL_VIOLATOR WILLFUL_VIOLATOR
88 SUPPORT_H1B SUPPORT_H1B
89 STATUTORY_BASIS STATUTORY_BASIS
90 APPENDIX_A_ATTACHED APPENDIX_A_ATTACHED
91 PUBLIC_DISCLOSURE PUBLIC_DISCLOSURE
92 PREPARER_LAST_NAME PREPARER_LAST_NAME
93 PREPARER_FIRST_NAME PREPARER_FIRST_NAME
94 PREPARER_MIDDLE_INITIAL PREPARER_MIDDLE_INITIAL
95 PREPARER_BUSINESS_NAME PREPARER_BUSINESS_NAME
96 PREPARER_EMAIL PREPARER_EMAIL
But unfortunately, I don’t see any differences in the column names.
If my eyes are not mistaken, the problem may be in something like
leading or following spaces in some of the column names. Which expands
my duty! Now, I have to try to parse out the odd-man-out using the
compare() function in {waldo} package. It requires a series
of actions shown below:
# Indexing the interaction in data file 1 and 9
waldo::compare(names(total_data[[1]]), names(total_data[[9]]))
old | new
[83] "PW_SURVEY_NAME" | "PW_SURVEY_NAME" [83]
[84] "TOTAL_WORKSITE_LOCATIONS" | "TOTAL_WORKSITE_LOCATIONS" [84]
[85] "AGREE_TO_LC_STATEMENT" | "AGREE_TO_LC_STATEMENT" [85]
[86] "H-1B_DEPENDENT" - "H1B_DEPENDENT" [86]
[87] "WILLFUL_VIOLATOR" | "WILLFUL_VIOLATOR" [87]
[88] "SUPPORT_H1B" | "SUPPORT_H1B" [88]
[89] "STATUTORY_BASIS" | "STATUTORY_BASIS" [89]
Thanks to the Waldo package. It made my task much
easier. The problem is in the 86th column. It is mentioned as
H1B_DEPENDENT in 2022 file but as
H-1B_DEPENDENT in 2020 and 2021 files. Now, I have a choice
to change the variable in eight of the 2020-2021 data files, or one 2022
file. However, it is not one of the variables that I need. Thus, I don’t
worry much about them.
My next step is to select the required columns and leave others
untouched from these data sets. Again it takes a lot of codes and time
to pass select() option to all 9 data sets one by one.
Thus, I am going to create a function that will select the required
column once I pass it through the total_data. Based on my
prior experience, I am going to select the following columns:
| Variables and Their Brief Description. | |
|---|---|
| Variable_Names | Description |
| case_number | LCA application case number |
| case_status | Application outcome/decision |
| lca_application_date | Application date |
| lca_decision_data | Date of application decision |
| visa_class | Employees' proposed visa type e.g., H1B |
| soc_title | Job category, e.g. computer system analyst |
| job_type | Fulltime or a parttime job |
| job_begin_date | First date of work |
| total_worker | Total H1B workers in the company |
| employer_name | Name of the hiring company |
| employer_state | Employer's state |
| employer_zip | Employer's zip code |
| attorney_representation | If an attorney to filed the case |
| prevailing_wage | Proposed wage for the employment |
| pay_wage_level | Wage level: ranges between I to IV |
Here’s a function select_variables that selects the
required variables. I will then pass it through the
total_data and check if it worked.
# creating the function
select_variables <- function(sv) {
sv |>
select_at(vars(
CASE_NUMBER,
CASE_STATUS, RECEIVED_DATE,
DECISION_DATE, VISA_CLASS,
SOC_TITLE, FULL_TIME_POSITION,
BEGIN_DATE, TOTAL_WORKER_POSITIONS,
EMPLOYER_NAME, EMPLOYER_STATE,
EMPLOYER_POSTAL_CODE,
AGENT_REPRESENTING_EMPLOYER,
PREVAILING_WAGE, PW_UNIT_OF_PAY
))
}
# Passing it through the list total_data
total_data <- xfun::cache_rds({
total_data |>
map(select_variables)
})
# Checking if the above action worked
total_data |>
map(names)
[[1]]
[1] "CASE_NUMBER" "CASE_STATUS"
[3] "RECEIVED_DATE" "DECISION_DATE"
[5] "VISA_CLASS" "SOC_TITLE"
[7] "FULL_TIME_POSITION" "BEGIN_DATE"
[9] "TOTAL_WORKER_POSITIONS" "EMPLOYER_NAME"
[11] "EMPLOYER_STATE" "EMPLOYER_POSTAL_CODE"
[13] "AGENT_REPRESENTING_EMPLOYER" "PREVAILING_WAGE"
[15] "PW_UNIT_OF_PAY"
[[2]]
[1] "CASE_NUMBER" "CASE_STATUS"
[3] "RECEIVED_DATE" "DECISION_DATE"
[5] "VISA_CLASS" "SOC_TITLE"
[7] "FULL_TIME_POSITION" "BEGIN_DATE"
[9] "TOTAL_WORKER_POSITIONS" "EMPLOYER_NAME"
[11] "EMPLOYER_STATE" "EMPLOYER_POSTAL_CODE"
[13] "AGENT_REPRESENTING_EMPLOYER" "PREVAILING_WAGE"
[15] "PW_UNIT_OF_PAY"
[[3]]
[1] "CASE_NUMBER" "CASE_STATUS"
[3] "RECEIVED_DATE" "DECISION_DATE"
[5] "VISA_CLASS" "SOC_TITLE"
[7] "FULL_TIME_POSITION" "BEGIN_DATE"
[9] "TOTAL_WORKER_POSITIONS" "EMPLOYER_NAME"
[11] "EMPLOYER_STATE" "EMPLOYER_POSTAL_CODE"
[13] "AGENT_REPRESENTING_EMPLOYER" "PREVAILING_WAGE"
[15] "PW_UNIT_OF_PAY"
[[4]]
[1] "CASE_NUMBER" "CASE_STATUS"
[3] "RECEIVED_DATE" "DECISION_DATE"
[5] "VISA_CLASS" "SOC_TITLE"
[7] "FULL_TIME_POSITION" "BEGIN_DATE"
[9] "TOTAL_WORKER_POSITIONS" "EMPLOYER_NAME"
[11] "EMPLOYER_STATE" "EMPLOYER_POSTAL_CODE"
[13] "AGENT_REPRESENTING_EMPLOYER" "PREVAILING_WAGE"
[15] "PW_UNIT_OF_PAY"
[[5]]
[1] "CASE_NUMBER" "CASE_STATUS"
[3] "RECEIVED_DATE" "DECISION_DATE"
[5] "VISA_CLASS" "SOC_TITLE"
[7] "FULL_TIME_POSITION" "BEGIN_DATE"
[9] "TOTAL_WORKER_POSITIONS" "EMPLOYER_NAME"
[11] "EMPLOYER_STATE" "EMPLOYER_POSTAL_CODE"
[13] "AGENT_REPRESENTING_EMPLOYER" "PREVAILING_WAGE"
[15] "PW_UNIT_OF_PAY"
[[6]]
[1] "CASE_NUMBER" "CASE_STATUS"
[3] "RECEIVED_DATE" "DECISION_DATE"
[5] "VISA_CLASS" "SOC_TITLE"
[7] "FULL_TIME_POSITION" "BEGIN_DATE"
[9] "TOTAL_WORKER_POSITIONS" "EMPLOYER_NAME"
[11] "EMPLOYER_STATE" "EMPLOYER_POSTAL_CODE"
[13] "AGENT_REPRESENTING_EMPLOYER" "PREVAILING_WAGE"
[15] "PW_UNIT_OF_PAY"
[[7]]
[1] "CASE_NUMBER" "CASE_STATUS"
[3] "RECEIVED_DATE" "DECISION_DATE"
[5] "VISA_CLASS" "SOC_TITLE"
[7] "FULL_TIME_POSITION" "BEGIN_DATE"
[9] "TOTAL_WORKER_POSITIONS" "EMPLOYER_NAME"
[11] "EMPLOYER_STATE" "EMPLOYER_POSTAL_CODE"
[13] "AGENT_REPRESENTING_EMPLOYER" "PREVAILING_WAGE"
[15] "PW_UNIT_OF_PAY"
[[8]]
[1] "CASE_NUMBER" "CASE_STATUS"
[3] "RECEIVED_DATE" "DECISION_DATE"
[5] "VISA_CLASS" "SOC_TITLE"
[7] "FULL_TIME_POSITION" "BEGIN_DATE"
[9] "TOTAL_WORKER_POSITIONS" "EMPLOYER_NAME"
[11] "EMPLOYER_STATE" "EMPLOYER_POSTAL_CODE"
[13] "AGENT_REPRESENTING_EMPLOYER" "PREVAILING_WAGE"
[15] "PW_UNIT_OF_PAY"
[[9]]
[1] "CASE_NUMBER" "CASE_STATUS"
[3] "RECEIVED_DATE" "DECISION_DATE"
[5] "VISA_CLASS" "SOC_TITLE"
[7] "FULL_TIME_POSITION" "BEGIN_DATE"
[9] "TOTAL_WORKER_POSITIONS" "EMPLOYER_NAME"
[11] "EMPLOYER_STATE" "EMPLOYER_POSTAL_CODE"
[13] "AGENT_REPRESENTING_EMPLOYER" "PREVAILING_WAGE"
[15] "PW_UNIT_OF_PAY"
Beautiful! I successfully truncated all of the data sets at once. I now have only those columns that I deem important for my studies.
I still have 9 different data sets. I now have to combine them in a
single giant data frame using rbind() function. I am going
to save the new file as LCA_data.
# Combining
LCA_data <- xfun::cache_rds({
total_data |>
bind_rows()
})
# Checking
str(LCA_data)
tibble [1,526,247 x 15] (S3: tbl_df/tbl/data.frame)
$ CASE_NUMBER : chr [1:1526247] "I-200-19268-393467" "I-200-19268-638983" "I-200-19268-177184" "I-200-19268-936403" ...
$ CASE_STATUS : chr [1:1526247] "Certified" "Certified" "Certified" "Certified" ...
$ RECEIVED_DATE : POSIXct[1:1526247], format: "2019-09-25" "2019-09-25" ...
$ DECISION_DATE : POSIXct[1:1526247], format: "2019-10-01" "2019-10-01" ...
$ VISA_CLASS : chr [1:1526247] "H-1B" "H-1B" "H-1B" "H-1B" ...
$ SOC_TITLE : chr [1:1526247] "COMPUTER OCCUPATIONS, ALL OTHER" "SOFTWARE DEVELOPERS, APPLICATIONS" "MECHANICAL ENGINEERS" "SOFTWARE DEVELOPERS, APPLICATIONS" ...
$ FULL_TIME_POSITION : chr [1:1526247] "Y" "Y" "Y" "Y" ...
$ BEGIN_DATE : POSIXct[1:1526247], format: "2019-10-07" "2020-01-08" ...
$ TOTAL_WORKER_POSITIONS : num [1:1526247] 1 1 1 1 1 1 1 1 1 1 ...
$ EMPLOYER_NAME : chr [1:1526247] "JO-ANN STORES, INC." "DENKEN SOLUTIONS INC." "EPITEC, INC." "SYSTEMS TECHNOLOGY GROUP, INC." ...
$ EMPLOYER_STATE : chr [1:1526247] "OH" "CA" "MI" "MI" ...
$ EMPLOYER_POSTAL_CODE : chr [1:1526247] "44236" "92618" "48033" "48084" ...
$ AGENT_REPRESENTING_EMPLOYER: chr [1:1526247] "Y" "Y" "Y" "N" ...
$ PREVAILING_WAGE : num [1:1526247] 95118 39 39 53 65333 ...
$ PW_UNIT_OF_PAY : chr [1:1526247] "Year" "Hour" "Hour" "Hour" ...
Based on the output, I have a big data file named
LCA_data that has 15 columns and
1,526,247 rows.
Now, I have come to the stage where I tidy my data for further analyses. First of all, I have all raw data, which requires some filtering, changing types and may be cases.
Based on my prior experience, I came to realize that some
applications for LCA take very long time even more than a year. These
cases are often first certified and withdrawn
afterward. I don’t really need this block of data. But before deleting
let’s check the types of categories of case_status.
xfun::cache_rds({
LCA_data |>
count(CASE_STATUS) |>
mutate(pct = n / 1526247 * 100) |>
gt()
})
| CASE_STATUS | n | pct |
|---|---|---|
| Certified | 1430063 | 93.70 |
| Certified - Withdrawn | 58374 | 3.82 |
| Denied | 8802 | 0.58 |
| Withdrawn | 29008 | 1.90 |
Approximately, 94% of LCA applications at DoL are approved. As
mentioned earlier, 4% were certified_withdrawn, less than
1% denied, and roughly 2% withdrawn. Now, I am going to get keep just
the certified data.
LCA_data <- xfun::cache_rds({
LCA_data |>
filter(CASE_STATUS == "Certified")
})
# Checking
dim(LCA_data)
[1] 1430063 15
New dimension of LCA_data proves that total data points
equal to the certified cases. Amazing!
Although, most of the H1B employees are on full-time basis, I have
seen (FULL_TIME_POSITION) some as part time, as well. There
were total of 23,066 aka 0.016% of part time cases, to be exact. I am
not sure how it is possible. Thus, I want to get rid of all part time
employees from further analyses.
LCA_data |>
count(FULL_TIME_POSITION) |>
mutate(pct = n / 1430063) |>
gt()
| FULL_TIME_POSITION | n | pct |
|---|---|---|
| N | 23066 | 0.016 |
| Y | 1406997 | 0.984 |
LCA_data <- xfun::cache_rds({
LCA_data |>
filter(FULL_TIME_POSITION == "Y")
})
# Checking
dim(LCA_data)
[1] 1406997 15
New dimension shows that total data points dropped to 1,406,997 from
1,430,063. The gap is exactly equal to the number of
part time positions.
Similarly, when I looked at the structure of the variable
PW_UNIT_OF_PAY, I saw that some wages are in hourly,
weekly, bi-weekly, and yearly basis. I need to have a consistency here.
all of them should be in same basis of pay. I can change all other pay
rates to yearly wages but I am not sure if that were the cases. Given
that I have a huge data set, I would like to limit my study to the H1B
employees who reported yearly wage in their LCA applications.
# Checking
LCA_data |>
count(PW_UNIT_OF_PAY) |>
mutate(pct = n / 1406997 * 100) |>
gt()
| PW_UNIT_OF_PAY | n | pct |
|---|---|---|
| Bi-Weekly | 200 | 0.014 |
| Hour | 60105 | 4.272 |
| Month | 790 | 0.056 |
| Week | 181 | 0.013 |
| Year | 1345721 | 95.645 |
# Filtering
LCA_data <- LCA_data |>
filter(PW_UNIT_OF_PAY == "Year")
# Checking
dim(LCA_data)
[1] 1345721 15
As can be seen, total data points came down to 1,345,721 after
limiting data to yearly wages, which is equal to the total
number of cases that reported yearly wage in their LCA applications.
Finally, I can see that there are a few varieties of visa types mentioned in the LCAs. I would like to limit my study only on H1-B visa type not EB or Singapore ro Australia class.
# Calculating
LCA_data |>
count(VISA_CLASS) |>
mutate(pct = n / 1345721 * 100) |>
gt()
| VISA_CLASS | n | pct |
|---|---|---|
| E-3 Australian | 27075 | 2.01 |
| H-1B | 1312667 | 97.54 |
| H-1B1 Chile | 2956 | 0.22 |
| H-1B1 Singapore | 3023 | 0.22 |
# Filtering
LCA_data <- LCA_data |>
filter(VISA_CLASS == "H-1B")
dim(LCA_data)
[1] 1312667 15
I got the statistics. Still, went ahead and got rid of
E-3 Australian, H-1B1 Chile, & H-1B1 Singapore visa
class, which brought my data points to 1,312,667.
I worked on four variables, so far. All of the variables had a few
categories to begin with, however, I limited my studies to individual
sub-categories. For example, CASE_STATUS == "Certified" had
four different categories, which I filtered to one, so is true about
FULL_TIME_POSITION == "Y",
PW_UNIT_OF_PAY == "Year" and
VISA_CLASS == "H-1B". When, there is no variability, I
don’t need to have these variables any more. I am going to get rid of
them right now:
LCA_data$CASE_STATUS <- NULL
LCA_data$FULL_TIME_POSITION <- NULL
LCA_data$PW_UNIT_OF_PAY <- NULL
LCA_data$VISA_CLASS <- NULL
# Getting rid of NAs
LCA_data <- LCA_data |>
na.omit()
# Checking
names(LCA_data)
[1] "CASE_NUMBER" "RECEIVED_DATE"
[3] "DECISION_DATE" "SOC_TITLE"
[5] "BEGIN_DATE" "TOTAL_WORKER_POSITIONS"
[7] "EMPLOYER_NAME" "EMPLOYER_STATE"
[9] "EMPLOYER_POSTAL_CODE" "AGENT_REPRESENTING_EMPLOYER"
[11] "PREVAILING_WAGE"
The final list of variables shows that I got rid of 4-variables mentioned above.
So far, so good. When I observed the output of
str(LCA_data), I came to realize that the time related
variables are in POSIXct class. It’s not bad, however, it
still has the HH:MM:SS format after it. First, I have to
get rid of these precision time units. Second, I want to be able to
count number of days between the days of LCA application and Decision to
first day to work, thus, I will change them in normal mdy()
function using {lubridate} package.
LCA_data$RECEIVED_DATE <- as.Date(LCA_data$RECEIVED_DATE, format = "%m/%d/%Y")
LCA_data$DECISION_DATE <- as.Date(LCA_data$DECISION_DATE, format = "%m/%d/%Y")
LCA_data$BEGIN_DATE <- as.Date(LCA_data$BEGIN_DATE, format = "%m/%d/%Y")
# Checking
data.frame(class(LCA_data$RECEIVED_DATE), class(LCA_data$DECISION_DATE), class(LCA_data$BEGIN_DATE))
class.LCA_data.RECEIVED_DATE. class.LCA_data.DECISION_DATE.
1 Date Date
class.LCA_data.BEGIN_DATE.
1 Date
Further more, based on my previous studies I came to know that some applications took very long time to adjudicated. I want to check the status in this data set and see if I can limit cases based on this time table. I have to create two new variables to be able to do this.
Now, I want to create two new variables using the date variables I
worked on above. The first one would be DECISION_TIME the
days it took for cases to get decision from the Department of Labor. The
second would be DAYS_TOWORK the gap between the decision
time and the proposed first day of work.
# Creating
LCA_data <- xfun::cache_rds({
LCA_data |>
mutate(
DECISION_TIME = DECISION_DATE - RECEIVED_DATE,
DAYS_TOWORK = BEGIN_DATE - RECEIVED_DATE
)
})
# Checking
data.frame(mean(LCA_data$DECISION_TIME), mean(LCA_data$DAYS_TOWORK))
mean.LCA_data.DECISION_TIME. mean.LCA_data.DAYS_TOWORK.
1 7.3 days 71 days
data.frame(range(LCA_data$DECISION_TIME), range(LCA_data$DAYS_TOWORK))
range.LCA_data.DECISION_TIME. range.LCA_data.DAYS_TOWORK.
1 4 days -34 days
2 29 days 184 days
Based on the output, the average LCA decision time was 7.3 days with a range of 4-29 days. Likewise, average first day of work from the day of decision on LCA was 71 days. The range remained between -34 to 184 days. It makes sense because employers have to apply for H1B after the approval of LCA because the approval of LCA doesn’t allow one to start working.
It is inconvenient to have the column names in uppercase letters. It
slows down the flow of my work and I am more prone to commit errors when
typing because the referral to column is case sensitive. Thus, I am
going to change all column names in lowercase letters using
clean_names() function from {janitor} package. Afterward, I
would like to change the class of some of the variables to factor using
the as.factor() function.
# Changing to lower case names
LCA_data <- LCA_data |>
clean_names()
# Changing the variable class
LCA_data <- LCA_data |>
mutate(
employer_state = as_factor(employer_state),
employer_postal_code = as_factor(employer_postal_code),
agent_representing_employer = as_factor(agent_representing_employer),
soc_title = as_factor(soc_title)
) |>
select(case_number, received_date, decision_date, decision_time, begin_date, days_towork, employer_name, employer_state, employer_postal_code, agent_representing_employer, soc_title, total_worker_positions, prevailing_wage)
Now, I want to check the summary of all of these variables. I will be looking to find out some discrepancy on the structure, which needs further act.
summary(LCA_data)
case_number received_date decision_date decision_time
Length:1312647 Min. :2019-09-25 Min. :2019-10-01 Length:1312647
Class :character 1st Qu.:2020-05-13 1st Qu.:2020-05-20 Class :difftime
Mode :character Median :2020-12-09 Median :2020-12-16 Mode :numeric
Mean :2020-11-14 Mean :2020-11-21
3rd Qu.:2021-04-21 3rd Qu.:2021-04-28
Max. :2021-12-22 Max. :2021-12-30
begin_date days_towork employer_name employer_state
Min. :2019-09-25 Length:1312647 Length:1312647 CA :281308
1st Qu.:2020-08-29 Class :difftime Class :character TX :153793
Median :2021-01-25 Mode :numeric Mode :character NJ :142180
Mean :2021-01-24 NY : 85307
3rd Qu.:2021-08-09 WA : 78639
Max. :2022-06-23 IL : 76981
(Other):494439
employer_postal_code agent_representing_employer
77845 : 50009 N :149293
98121 : 39835 No :223577
94043 : 31671 Y :352137
20850 : 27108 Yes:587640
98052 : 22954
95054 : 22871
(Other):1118199
soc_title
Software Developers, Applications :436552
Computer Systems Analysts : 97500
Software Developers, Systems Software : 82575
Computer Systems Engineers/Architects : 43121
Software Quality Assurance Engineers and Testers: 33817
Information Technology Project Managers : 29919
(Other) :589163
total_worker_positions prevailing_wage
Min. : 1 Min. : 15080
1st Qu.: 1 1st Qu.: 76898
Median : 1 Median : 93558
Mean : 2 Mean : 98300
3rd Qu.: 1 3rd Qu.:115773
Max. :300 Max. :760202
Very clean nice summary. I can immediately see some problems in a
couple of variables. The first one is
agent_representing_employer which has two different form of
Yes\No. I have to change them to the same form. Second,
looking at the total_worker_positions, I realize that I may
not need this variable at all, because the overwhelming amount of cases
have one employee as
minimum, 1st Quartile, median, and even 3rd quartile. It is
hugely skewed, thus I will be well off if I get rid of it.
# Managing agent_representing column
LCA_data <- LCA_data |>
mutate(
agent_representing_employer = case_when(
agent_representing_employer == "N" ~ "No",
agent_representing_employer == "Y" ~ "Yes",
agent_representing_employer == "No" ~ "No",
agent_representing_employer == "Yes" ~ "Yes",
)
) |>
mutate(agent_representing_employer = as_factor(agent_representing_employer)) |>
select(case_number, received_date, decision_date, decision_time, begin_date, days_towork, employer_name, employer_state, employer_postal_code, agent_representing_employer, soc_title, prevailing_wage)
# Checking
summary(LCA_data$agent_representing_employer)
No Yes
372870 939777
I have successfully got rid of total_worker_positions
and changed the N and Yes to No and Yes under
agent_representing_employer variable. Based on my pilot
study of similar data set, I stumble on a problem associated with the
employer_postal_code variable. Most of the applications
used the 5-digit postal codes, while there were some that
also used 5 plus 4 digit codes. Thus, I would like to make
them consistent by making all 5-digit codes.
glimpse(LCA_data$employer_postal_code)
Factor w/ 11961 levels "00646","00717-9997",..: 5457 6337 3833 8563 7793 3752 3744 10237 10020 10237 ...
LCA_data$employer_postal_code <- as_factor((substr(LCA_data$employer_postal_code, 1, 5)))
glimpse(LCA_data$employer_postal_code)
Factor w/ 10780 levels "00646","00717",..: 4888 5705 3391 7756 7040 3326 3318 9304 9093 9304 ...
The employer_postal_code is in a good shape now. Looking
at the summary of my data frame earlier, I realized that there can be
some problems in the employers’ name, i.e.,employer_name
and the job title, i.e., soc_title. I want to change all of
them in title cases. In addition, some of the names of both companies
and job title are preceded with numbers and special characters. I would
like to get rid of all of them using the combination of
mutate(str_remove_all()) function.
LCA_data <- LCA_data |>
mutate(
soc_title = str_to_title(soc_title),
employer_name = str_to_title(employer_name)
) |>
mutate(
soc_title = factor(gsub("\\.-[0-9]*$", "", soc_title)),
employer_name = factor(gsub("\\.-[0-9]*$", "", employer_name))
)
I got the desired variables and the clean names. There may a few things along the way, but I guess, I pretty good so far. I am ready to create some plots and start visualizing the data.
Let me begin with the visuals of the decision time and first day of work. I am going to use ggplot2.
decision_density <- ggplot() +
geom_freqpoly(data = LCA_data, aes(decision_time, ..density..), color = "darkred") +
geom_freqpoly(data = LCA_data, aes(days_towork, ..density..), color = "darkblue") +
xlab("Red: decision_time in days & Blue: days_towork") +
ggtitle("Distribution of decision_time & days_towork among LCA Applicants")
decision_density
One of the most amazing thing that I saw in the plot above is
consistency in the LCA decision time. It picks the next day the
employers registers the LCA application and drops down to zero within 10
days of application. They must have a pretty good system of taking care
of LCA applications. The first day to work seems to extend a little
further and it makes a complete sense because it is the tentative time
and sometimes the expectations are not met.
prevailing_wage_distribution <- ggplot(LCA_data, aes(prevailing_wage / 1000)) +
geom_histogram(fill = "dodgerblue", color = "hotpink4", binwidth = 5) +
geom_vline(xintercept = mean(LCA_data$prevailing_wage) / 1000, color = "darkred", linetype = "dashed", size = 1.5, alpha = 0.4) +
xlab("Prevailing Wages in Thousands of Dollar") +
ylab(expression("Cumulative Frequency or Count")) +
ggtitle("Prevailing-Wage-Distribution for H1B Employees: 10/2019 - 12/2021")
prevailing_wage_distribution
mean(mean(LCA_data$prevailing_wage))
[1] 98300
Looks like the prevailing wages are normally distributed. As found in the previous pilot study, the average wage is around $98,300/year. Let’s zoom in between the values 200 and 300 thousand dollar range.
zoomed_in <- prevailing_wage_distribution +
xlim(c(200, 300)) + ylim(c(0, 1200)) +
xlab("Prevailing Wages Between $200,000 and $300,000") +
ylab(expression("Cumulative Frequency or Count")) +
ggtitle("Upper Bound Prevailing Wage Distribution")
zoomed_in
Looks like there a few thousand data points between $200,000 and $250,000 prevailing wage ranges. I will limit my studies for wages below 250,000 dollar range. For now, I would like to see what are the job titles and the companies that pay the highest rates. So, I am going to zoom in between the values 300 to 800 thousand dollar range.
highest_wages <- prevailing_wage_distribution +
xlim(c(300, 800)) + ylim(c(0, 6)) +
xlab("Prevailing Wages Between $300,000 and $8000,000") +
ylab(expression("Cumulative Frequency or Count")) +
ggtitle("Highest Prevailing Wage for H1B Between October 2019-December 2021")
highest_wages
Yes. There are a few data points that are between $300,000 to $800,000 yearly prevailing wages. I would like to expand it a little further to include $250,000 annual wages and higher, subset the data and do further analyses.
My interest of ultimate study will be the regular data having the
prevailing wage range below $250,000. I am going to filter the data into
two smaller data sets. The first one includes only the cases having the
prevailing wage less than $250,000. The second one will only include the
cases having annual prevailing wage equal to $250,000 or higher. I will
save them in LCA_Final and LCA_High,
respectively.
LCA_Final <- LCA_data |>
filter(prevailing_wage < 250000)
LCA_High <- LCA_data |>
filter(prevailing_wage >= 250000)
rbind(dim(LCA_Final), dim(LCA_High))
[,1] [,2]
[1,] 1311377 12
[2,] 1270 12
I have successfully divided the into to portions, e.g., LCA_Final and LCA_High. LCA_Final has 1,311,377 cases, while LCA_High has 1270 and 12 columns.
As mentioned earlier, I want to dig a little deeper into the LCA_High
Data set. First, I would like to see if there is any patterns in terms
of job_title and employer_name.
# Checking for job title
j_title <- LCA_High |>
count(soc_title) |>
mutate(pct = n / 1270 * 100) |>
arrange(desc(pct))
data.frame(j_title)
soc_title n pct
1 Computer And Information Systems Managers 303 23.858
2 Physicians And Surgeons, All Other 274 21.575
3 Financial Managers 202 15.906
4 Chief Executives 72 5.669
5 Internists, General 67 5.276
6 Hospitalists 49 3.858
7 Neurologists 30 2.362
8 Psychiatrists 29 2.283
9 Lawyers 22 1.732
10 Radiologists 20 1.575
11 Financial Managers, Branch Or Department 19 1.496
12 Surgeons 18 1.417
13 Health Specialties Teachers, Postsecondary 16 1.260
14 Anesthesiologists 15 1.181
15 Architectural And Engineering Managers 15 1.181
16 Marketing Managers 15 1.181
17 Pediatricians, General 14 1.102
18 Family And General Practitioners 10 0.787
19 Medical Scientists, Except Epidemiologists 9 0.709
20 Obstetricians And Gynecologists 8 0.630
21 Pathologists 8 0.630
22 Treasurers And Controllers 7 0.551
23 Dentists, General 6 0.472
24 Financial Analysts 6 0.472
25 General And Operations Managers 6 0.472
26 Management Analysts 6 0.472
27 Managers, All Other 6 0.472
28 Natural Sciences Managers 5 0.394
29 Urologists 2 0.157
30 Architecture Teachers, Postsecondary 1 0.079
31 Chemists 1 0.079
32 Chief Sustainability Officers 1 0.079
33 Clinical Research Coordinators 1 0.079
34 Dermatologists 1 0.079
35 Hospitalist 1 0.079
36 Physical Medicine And Rehabilitation Physicians 1 0.079
37 Physicans And Surgeons, All Others 1 0.079
38 Physicians And Surgeons 1 0.079
39 Physicians And Surgeons,All Other 1 0.079
40 Sales Managers 1 0.079
Job title definitely seem to determine the prevailing wages. However,
the output showed 40 different job titles. some of them were hugely
cohesive, e.g., Physicians and Surgeons, All Others
vs. Physicians and Surgeons, All Other. They were listed
differently just because of the differences of s in
other vs others. After having a close look into these job
titles, I decided to put them in 4-big buckets, namely
Finance_Sector, Engineering_Sector, Medical_Sector, & Computer_IT_Managers.
# Putting job_titles in big buckets
LCA_High_Modified <- LCA_High |>
mutate(
soc_title = case_when(
soc_title %in% c("Computer And Information Systems Managers") ~ "Computer_IT_Managers",
soc_title %in% c(
"Physicians And Surgeons,All Other",
"Physicians And Surgeons",
"Physicans And Surgeons, All Others",
"Physical Medicine And Rehabilitation Physicians",
"Hospitalist",
"Dermatologists",
"Urologists",
"Dentists, General",
"Pathologists",
"Obstetricians And Gynecologists",
"Family And General Practitioners",
"Medical Scientists, Except Epidemiologists",
"Pediatricians, General",
"Anesthesiologists",
"Health Specialties Teachers, Postsecondary",
"Surgeons",
"Radiologists",
"Neurologists",
"Psychiatrists",
"Lawyers",
"Hospitalists",
"Clinical Research Coordinators",
"Physicians And Surgeons, All Other",
"Chemists"
) ~ "Medical_Sector",
soc_title %in% c(
"Architectural And Engineering Managers",
"Natural Sciences Managers",
"Architecture Teachers, Postsecondary",
"Chief Sustainability Officers"
) ~ "Engineering_Sector",
soc_title %in% c(
"Sales Managers",
"Financial Managers",
"Chief Executives",
"Internists, General",
"Financial Managers, Branch Or Department",
"Marketing Managers",
"Treasurers And Controllers",
"Financial Analysts",
"General And Operations Managers",
"Management Analysts",
"Managers, All Other"
) ~ "Finance_Sector",
TRUE ~ as.character(soc_title)
)
)
# Plotting the prevailing wages
g_density <- ggplot(LCA_High_Modified, aes(prevailing_wage)) +
geom_density(aes(fill = factor(soc_title)), alpha = 0.4) +
xlim(c(230000, 320000)) +
ylim(c(0.0000, 0.0002)) +
labs(
title = "Prevailing Wage Density Plot",
subtitle = "Four Highest Paying Job Areas",
x = "Prevailing Wage",
fill = "# Job Areas"
)
g_density
# Counting by new Job Titles
j_title_modified <- LCA_High_Modified |>
group_by(soc_title) |>
count() |>
mutate(pct = n / 1270 * 100) |>
arrange(desc(pct))
# Counting Minimum, Mean, Maximum Prevailing Wages
wages <- LCA_High_Modified |>
select(soc_title, prevailing_wage) |>
group_by(soc_title) |>
summarize(
minimum_wage = min(prevailing_wage),
mean_wage = mean(prevailing_wage),
maximum_wage = max(prevailing_wage)
)
# Merging Above Tables
wages_by_jtitles <- merge(j_title_modified, wages, by = "soc_title")
wages_by_jtitles
soc_title n pct minimum_wage mean_wage maximum_wage
1 Computer_IT_Managers 303 23.9 250286 255872 305707
2 Engineering_Sector 22 1.7 251326 261095 287394
3 Finance_Sector 407 32.0 250099 264691 415589
4 Medical_Sector 538 42.4 250000 276317 760202
Beside the job title, I tried to see if there was a pattern of having high prevailing wages based on employer, but nothing stood out.
Hurrah! I made up to this questions. It’s a whole new world. I can do a lot of things, in other words, spend months with this data set identifying trends of every single aspects. However, every researcher needs to come to a point when they have to make decision about what to dig in and what to left open. Having 96 unique variables and more than 200,000 data points the possibilities are innumerable. After much thought about what I want to do in this regard, I came up with three ideas:
Take the top 5 companies and check their trend of hiring H1B employees based on the number of LCA filings. I am particularly interested to see if the number of filings dropped. The data come from a time frame when the whole world was in a hot mess of COVID-19 which caused both private, public, and government companies to shut down. Many people lost their jobs. However, there were a few companies, especially tech companies like Amazon reportedly made more money during this period as many people resorted to online shopping. More business requires more employees. Thus, it would be interesting to figure out how things played out in these top 5 companies.
In the mean time, academic companies had to resort to hiring freeze. Many of these institutions have not been in regular order of operation even at the time of this report preparation. On the other hand, the wages tend to go up by a little bit every year. I would like to check there was any discontinuity on wage increase during this period. Or simply, how these two aspects played out in top 5 job_titles in academia.
Finally, it would be interesting to check if I could predict number of LCAs for H1B based on these data. So, I will try to predict total number of LCAs for upcoming quarter.
Based on the studies conducted above, we know that Amazon, Cognizant Technology Solutions, Microsoft Corporation, Tata Consultancy Services, Google were the top 5 H1B employers during the study period. There are a few different companies having Amazon in their names like Amazon Web Services, Amazon.com Services etc. Thus, I index all the companies using key words and include them under same name afterward. {r indexing_top_5_for_trend} top_5_companies <- LCA_final|> filter(str_detect(employer_name,‘Amazon|Tata|Microsoft|Cognizant|Google’))|> select(received_date,employer_name) unique(top_5_companies$employer_name)
The output shows a lot of Amazon companies, a few Cognizant & Tata and one each for Microsoft and Google. I am going to assign a single names to the respective companies using the `case_when()` function. It's going to be a long manual process.
{r assigning_single_names}
top_5_companies <- top_5_companies|>
mutate(
companies = case_when( #to change the various names to single name
employer_name %in% c("Amazon.com Services Llc",
"Amazon Web Services, Inc.",
"Amazon Advertising Llc",
"Amazon Development Center U.s., Inc.",
"Amazon Web Services Inc",
"Amazon Data Services, Inc",
"Amazon Data Services, Inc.",
"Amazon.com.ca, Inc.",
"Amazon Robotics Llc",
"Amazon Capital Services, Inc.",
"Amazon Payments, Inc.",
"Amazon Mechanical Turk, Inc.",
"Amazon Registry Services Inc.",
"Amazon Logistics",
"Amazon Web Service, Inc.",
"Amazon.com.ca, Inc. (Us)") ~ "Amazon",
employer_name %in% c("Cognizant Technology Solutions Us Corp",
"Trizetto Corporation - A Cognizant Company",
"Tmg Health - A Cognizant Company",
"Cognizant Worldwide Limited",
"Cognizant Technology Solutions Us Corporation") ~ "Cognizant",
employer_name %in% c("Tata Consultancy Services Limited",
"Tata Elxsi Ltd",
"Tata Elxsi Limited",
"Tata Technologies, Inc.",
"Tata Communications (America), Inc.") ~ "Tata",
employer_name %in% "Google Llc" ~ "Google",
employer_name %in% "Microsoft Corporation" ~ "Microsoft",
TRUE ~ as.character(employer_name)
)
)|>
mutate(received_date_m = lubridate::date(received_date),
quarter_date = zoo::as.yearqtr(received_date_m))|> # Changing the Date layout
select(received_date_m,quarter_date, companies)|># selecting only two columns
arrange(desc(received_date_m))
#unique(top_5_companies$companies)
data.frame(top_5_companies)
Now, that I have changed all different names to just 5-companies.