Recently, I got interested in analyzing H1B facts and figures in the United States because my most recent employer has moved forward with H1B application. I wanted to move forward with this analyses thinking that it would provide ample opportunities to hone my data science skills and learn some advance skills. This work features all key aspects of data science outlined in (Estrellado et al., 2021):

  1. Importing Data from Multiple Online Sources,
  2. Tidying Data,
  3. Transforming Data,
  4. Visualizing Data,
  5. Modeling data, and
  6. Communication of Results

This is a research on the trend of H1B employees in the United States. The data comes from the online repository of the Department of Labor. This data doesn’t tell anything whether somebody applied or actually got the H1B visa, nonetheless, it provides information about the critical step before actual H1B application. In addition, it provides insights into the hiring company, job type, prevailing wage, state etc. I am going to dive deep down, I don’t really have any agenda here. I am open to exploration and surprises!!

Along with numerous online help especially Google Search, twitter feed etc., my analyses heavily rely on these references:

Now, let’s move sequentially and strategically.

1. Importing Data from Multiple Online Sources

I want to do trend analysis. I am planning to get data from all month 2020, 2021, & first quarter in 2022. Based on the layout of data listed on https://www.dol.gov/agencies/eta/foreign-labor/performance, I have to download 9 different data sets. I have to work on my tables, which requires tidyverse package; I have to save the downloaded file and easy way to do so is to use here package; and I need to work on dates require libridate package for now. Finally readxl package to read the .xlsx data files.

library(tidyverse)
library(here)
library(lubridate)
library(readxl)
library(gt)
library(ggplot2)
library(janitor)
library(sjlabelled)
library(stringr)
Disclaimer:
My intention of conducting this research is manifold. My first objective is to hone my skills of working in R environment and be able to successfully complete a project. I have done so many times in the past, but I feel like I have to keep practicing things. Second, for some time now, I have taken a lot of courses in R. I learned a lot, but I barely remember most of them. When I sit to work in R, I stumble on every small bumps- the fundamental aspects- and I need to look into my notes or do a google search. The thing is, I overlooked the basic aspects of R and tried to achieve big thing, i.e., being able to use R for advanced statistical research that require rigorous methodological insights like: Hierarchical Linear Modeling (HLM), Structural Equation Modeling (SEM), Longitudinal Modeling, Multivariate Regresssion, and even Machine Learning. Once you are there you realize you lack the basic building blocks. Thus, my intention is to use these basic tools and learn and/or relearn them throug this study. This is the reason, I am over using some of the R features above and will do throughout this paper. Once done, I want to keep this document handy so that I can refer back over and over. 
Finally, my intention is to look into the situation of the H1B applicants and know more about this population. I conducted a small pilot study using 1-quarter long data, i.e., 120,000 data points, which inspired me to use big data to strengthen the the findings. I am recently in the process to be one of them and knowing better would help me keep reasonable expectations. I am using a raw data, I have not read any serious research conducted in this topic and I don't know what to expect. Maybe there are pretty good research that could help me ask better questions, however, I would not like to do so. Thus, everything I come up with will be a pure discovery for me. 

Downloading the data files from online repository

Because the download.file() function which allows me to type in respective url where the data are located and destfile where I can save them did not work properly, I downloaded the data on my local disc in a folder named all_data and saved them. Now, I want make a quick check on one of the data sets:

read_excel(here::here(
  "all_data",
  "LCA_Disclosure_Data_FY2020_Q4.xlsx"
))
# A tibble: 117,395 x 96
   CASE_NUMBER        CASE_STATUS RECEIVED_DATE       DECISION_DATE      
   <chr>              <chr>       <dttm>              <dttm>             
 1 I-200-20176-676243 Certified   2020-06-24 00:00:00 2020-07-01 00:00:00
 2 I-200-20176-676246 Certified   2020-06-24 00:00:00 2020-07-01 00:00:00
 3 I-200-20176-676247 Certified   2020-06-24 00:00:00 2020-07-01 00:00:00
 4 I-200-20176-676251 Certified   2020-06-24 00:00:00 2020-07-01 00:00:00
 5 I-200-20176-676254 Certified   2020-06-24 00:00:00 2020-07-01 00:00:00
 6 I-200-20176-676255 Certified   2020-06-24 00:00:00 2020-07-01 00:00:00
 7 I-200-20176-676257 Certified   2020-06-24 00:00:00 2020-07-01 00:00:00
 8 I-200-20176-676259 Certified   2020-06-24 00:00:00 2020-07-01 00:00:00
 9 I-200-20176-676260 Certified   2020-06-24 00:00:00 2020-07-01 00:00:00
10 I-200-20176-676265 Certified   2020-06-24 00:00:00 2020-07-01 00:00:00
# ... with 117,385 more rows, and 92 more variables: ORIGINAL_CERT_DATE <lgl>,
#   VISA_CLASS <chr>, JOB_TITLE <chr>, SOC_CODE <chr>, SOC_TITLE <chr>,
#   FULL_TIME_POSITION <chr>, BEGIN_DATE <dttm>, END_DATE <dttm>,
#   TOTAL_WORKER_POSITIONS <dbl>, NEW_EMPLOYMENT <dbl>,
#   CONTINUED_EMPLOYMENT <dbl>, CHANGE_PREVIOUS_EMPLOYMENT <dbl>,
#   NEW_CONCURRENT_EMPLOYMENT <dbl>, CHANGE_EMPLOYER <dbl>,
#   AMENDED_PETITION <dbl>, EMPLOYER_NAME <chr>, TRADE_NAME_DBA <chr>, ...

The data have been saved and I got an important peek into them. There were 96 columns and 117,395 in the 2020 Q4 data set. I have 9 different data files like this and I want to use them in this study. Using a single file at time sounds boring and it takes a lot of time. Thus, I would be using codes that passes through all the data files in the folder and do designated tasks.

First of all, I want to check all the data files in the folder all_data using the list.files() function. I would extract the full names using full.names = TRUE argument because I need them to pull the data. Finally, I will save the names in the file_path vector, so that I can use them easily.

# Saving the file names in a vector
file_path <- xfun::cache_rds({
  list.files(
    path = here::here("all_data"),
    full.names = TRUE
  )
})
# Checking
file_path
[1] "C:/Users/nirma/Documents/GitHub/EDA_Project2/all_data/LCA_Disclosure_Data_FY2020_Q1.xlsx"
[2] "C:/Users/nirma/Documents/GitHub/EDA_Project2/all_data/LCA_Disclosure_Data_FY2020_Q2.xlsx"
[3] "C:/Users/nirma/Documents/GitHub/EDA_Project2/all_data/LCA_Disclosure_Data_FY2020_Q3.xlsx"
[4] "C:/Users/nirma/Documents/GitHub/EDA_Project2/all_data/LCA_Disclosure_Data_FY2020_Q4.xlsx"
[5] "C:/Users/nirma/Documents/GitHub/EDA_Project2/all_data/LCA_Disclosure_Data_FY2021_Q1.xlsx"
[6] "C:/Users/nirma/Documents/GitHub/EDA_Project2/all_data/LCA_Disclosure_Data_FY2021_Q2.xlsx"
[7] "C:/Users/nirma/Documents/GitHub/EDA_Project2/all_data/LCA_Disclosure_Data_FY2021_Q3.xlsx"
[8] "C:/Users/nirma/Documents/GitHub/EDA_Project2/all_data/LCA_Disclosure_Data_FY2021_Q4.xlsx"
[9] "C:/Users/nirma/Documents/GitHub/EDA_Project2/all_data/LCA_Disclosure_Data_FY2022_Q1.xlsx"

I got the complete path to data files stored in the all_data folder. Now, I am going to read the data in R using read_excel() function and will do so in all data files at the same time using map() function.

total_data <- xfun::cache_rds({
  file_path %>%
    map(., ~ read_excel(., sheet = 1))
})

There are too many data files and each file contains more than 100,000 data points, the system became pretty slow. Thus, I wrapped the above syntax in xfun::cache_rds({}). It is faster than cache=TRUE option and I can call much easier.

My ultimate goal is to create a big giant data file combining (rbinding) all of these 9 files. I will be able to do so, once I have similar columns in all of the data files. Let’s check random data files for consistency in column names using identical() function.

identical(names(total_data[[1]]), names(total_data[[2]]))
[1] TRUE
identical(names(total_data[[1]]), names(total_data[[6]]))
[1] TRUE
identical(names(total_data[[1]]), names(total_data[[9]]))
[1] FALSE

The output shows that data set 1 & 2, and 1 & 6 have identical column names but not data set 9. Let’s dig in a little more. First, let’s see if these data files have equal number of columns using map() function:

xfun::cache_rds({
  total_data |>
    map(ncol)
})
[[1]]
[1] 96

[[2]]
[1] 96

[[3]]
[1] 96

[[4]]
[1] 96

[[5]]
[1] 96

[[6]]
[1] 96

[[7]]
[1] 96

[[8]]
[1] 96

[[9]]
[1] 96

Good grace! All of the data files have equal number of columns, though. I know that I am not going to use all of these, but I have to make sure what’s different among them.

Let’s check for the files 1 and 9.

file_path[[1]]
[1] "C:/Users/nirma/Documents/GitHub/EDA_Project2/all_data/LCA_Disclosure_Data_FY2020_Q1.xlsx"
file_path[[9]]
[1] "C:/Users/nirma/Documents/GitHub/EDA_Project2/all_data/LCA_Disclosure_Data_FY2022_Q1.xlsx"

One of them is quarter 1 data from 2020 and the other is quarter 1 data from 2022. Now, I want to check if all 2020 data (quarter 1:quarter 4) are different from quarter 1 2022.

identical(names(total_data[[9]]), names(total_data[[2]]))
[1] FALSE
identical(names(total_data[[9]]), names(total_data[[3]]))
[1] FALSE
identical(names(total_data[[9]]), names(total_data[[4]]))
[1] FALSE

Yes. The problem is with 2020 and 2022 data sets. Let’s check for the column names.

xfun::cache_rds({
  data.frame(names(total_data[[1]]), names(total_data[[9]]))
})
           names.total_data..1...         names.total_data..9...
1                     CASE_NUMBER                    CASE_NUMBER
2                     CASE_STATUS                    CASE_STATUS
3                   RECEIVED_DATE                  RECEIVED_DATE
4                   DECISION_DATE                  DECISION_DATE
5              ORIGINAL_CERT_DATE             ORIGINAL_CERT_DATE
6                      VISA_CLASS                     VISA_CLASS
7                       JOB_TITLE                      JOB_TITLE
8                        SOC_CODE                       SOC_CODE
9                       SOC_TITLE                      SOC_TITLE
10             FULL_TIME_POSITION             FULL_TIME_POSITION
11                     BEGIN_DATE                     BEGIN_DATE
12                       END_DATE                       END_DATE
13         TOTAL_WORKER_POSITIONS         TOTAL_WORKER_POSITIONS
14                 NEW_EMPLOYMENT                 NEW_EMPLOYMENT
15           CONTINUED_EMPLOYMENT           CONTINUED_EMPLOYMENT
16     CHANGE_PREVIOUS_EMPLOYMENT     CHANGE_PREVIOUS_EMPLOYMENT
17      NEW_CONCURRENT_EMPLOYMENT      NEW_CONCURRENT_EMPLOYMENT
18                CHANGE_EMPLOYER                CHANGE_EMPLOYER
19               AMENDED_PETITION               AMENDED_PETITION
20                  EMPLOYER_NAME                  EMPLOYER_NAME
21                 TRADE_NAME_DBA                 TRADE_NAME_DBA
22              EMPLOYER_ADDRESS1              EMPLOYER_ADDRESS1
23              EMPLOYER_ADDRESS2              EMPLOYER_ADDRESS2
24                  EMPLOYER_CITY                  EMPLOYER_CITY
25                 EMPLOYER_STATE                 EMPLOYER_STATE
26           EMPLOYER_POSTAL_CODE           EMPLOYER_POSTAL_CODE
27               EMPLOYER_COUNTRY               EMPLOYER_COUNTRY
28              EMPLOYER_PROVINCE              EMPLOYER_PROVINCE
29                 EMPLOYER_PHONE                 EMPLOYER_PHONE
30             EMPLOYER_PHONE_EXT             EMPLOYER_PHONE_EXT
31                     NAICS_CODE                     NAICS_CODE
32         EMPLOYER_POC_LAST_NAME         EMPLOYER_POC_LAST_NAME
33        EMPLOYER_POC_FIRST_NAME        EMPLOYER_POC_FIRST_NAME
34       EMPLOYER_POC_MIDDLE_NAME       EMPLOYER_POC_MIDDLE_NAME
35         EMPLOYER_POC_JOB_TITLE         EMPLOYER_POC_JOB_TITLE
36          EMPLOYER_POC_ADDRESS1          EMPLOYER_POC_ADDRESS1
37          EMPLOYER_POC_ADDRESS2          EMPLOYER_POC_ADDRESS2
38              EMPLOYER_POC_CITY              EMPLOYER_POC_CITY
39             EMPLOYER_POC_STATE             EMPLOYER_POC_STATE
40       EMPLOYER_POC_POSTAL_CODE       EMPLOYER_POC_POSTAL_CODE
41           EMPLOYER_POC_COUNTRY           EMPLOYER_POC_COUNTRY
42          EMPLOYER_POC_PROVINCE          EMPLOYER_POC_PROVINCE
43             EMPLOYER_POC_PHONE             EMPLOYER_POC_PHONE
44         EMPLOYER_POC_PHONE_EXT         EMPLOYER_POC_PHONE_EXT
45             EMPLOYER_POC_EMAIL             EMPLOYER_POC_EMAIL
46    AGENT_REPRESENTING_EMPLOYER    AGENT_REPRESENTING_EMPLOYER
47       AGENT_ATTORNEY_LAST_NAME       AGENT_ATTORNEY_LAST_NAME
48      AGENT_ATTORNEY_FIRST_NAME      AGENT_ATTORNEY_FIRST_NAME
49     AGENT_ATTORNEY_MIDDLE_NAME     AGENT_ATTORNEY_MIDDLE_NAME
50        AGENT_ATTORNEY_ADDRESS1        AGENT_ATTORNEY_ADDRESS1
51        AGENT_ATTORNEY_ADDRESS2        AGENT_ATTORNEY_ADDRESS2
52            AGENT_ATTORNEY_CITY            AGENT_ATTORNEY_CITY
53           AGENT_ATTORNEY_STATE           AGENT_ATTORNEY_STATE
54     AGENT_ATTORNEY_POSTAL_CODE     AGENT_ATTORNEY_POSTAL_CODE
55         AGENT_ATTORNEY_COUNTRY         AGENT_ATTORNEY_COUNTRY
56        AGENT_ATTORNEY_PROVINCE        AGENT_ATTORNEY_PROVINCE
57           AGENT_ATTORNEY_PHONE           AGENT_ATTORNEY_PHONE
58       AGENT_ATTORNEY_PHONE_EXT       AGENT_ATTORNEY_PHONE_EXT
59   AGENT_ATTORNEY_EMAIL_ADDRESS   AGENT_ATTORNEY_EMAIL_ADDRESS
60     LAWFIRM_NAME_BUSINESS_NAME     LAWFIRM_NAME_BUSINESS_NAME
61         STATE_OF_HIGHEST_COURT         STATE_OF_HIGHEST_COURT
62    NAME_OF_HIGHEST_STATE_COURT    NAME_OF_HIGHEST_STATE_COURT
63               WORKSITE_WORKERS               WORKSITE_WORKERS
64               SECONDARY_ENTITY               SECONDARY_ENTITY
65 SECONDARY_ENTITY_BUSINESS_NAME SECONDARY_ENTITY_BUSINESS_NAME
66              WORKSITE_ADDRESS1              WORKSITE_ADDRESS1
67              WORKSITE_ADDRESS2              WORKSITE_ADDRESS2
68                  WORKSITE_CITY                  WORKSITE_CITY
69                WORKSITE_COUNTY                WORKSITE_COUNTY
70                 WORKSITE_STATE                 WORKSITE_STATE
71           WORKSITE_POSTAL_CODE           WORKSITE_POSTAL_CODE
72          WAGE_RATE_OF_PAY_FROM          WAGE_RATE_OF_PAY_FROM
73            WAGE_RATE_OF_PAY_TO            WAGE_RATE_OF_PAY_TO
74               WAGE_UNIT_OF_PAY               WAGE_UNIT_OF_PAY
75                PREVAILING_WAGE                PREVAILING_WAGE
76                 PW_UNIT_OF_PAY                 PW_UNIT_OF_PAY
77             PW_TRACKING_NUMBER             PW_TRACKING_NUMBER
78                  PW_WAGE_LEVEL                  PW_WAGE_LEVEL
79                    PW_OES_YEAR                    PW_OES_YEAR
80                PW_OTHER_SOURCE                PW_OTHER_SOURCE
81                  PW_OTHER_YEAR                  PW_OTHER_YEAR
82            PW_SURVEY_PUBLISHER            PW_SURVEY_PUBLISHER
83                 PW_SURVEY_NAME                 PW_SURVEY_NAME
84       TOTAL_WORKSITE_LOCATIONS       TOTAL_WORKSITE_LOCATIONS
85          AGREE_TO_LC_STATEMENT          AGREE_TO_LC_STATEMENT
86                 H-1B_DEPENDENT                  H1B_DEPENDENT
87               WILLFUL_VIOLATOR               WILLFUL_VIOLATOR
88                    SUPPORT_H1B                    SUPPORT_H1B
89                STATUTORY_BASIS                STATUTORY_BASIS
90            APPENDIX_A_ATTACHED            APPENDIX_A_ATTACHED
91              PUBLIC_DISCLOSURE              PUBLIC_DISCLOSURE
92             PREPARER_LAST_NAME             PREPARER_LAST_NAME
93            PREPARER_FIRST_NAME            PREPARER_FIRST_NAME
94        PREPARER_MIDDLE_INITIAL        PREPARER_MIDDLE_INITIAL
95         PREPARER_BUSINESS_NAME         PREPARER_BUSINESS_NAME
96                 PREPARER_EMAIL                 PREPARER_EMAIL

But unfortunately, I don’t see any differences in the column names. If my eyes are not mistaken, the problem may be in something like leading or following spaces in some of the column names. Which expands my duty! Now, I have to try to parse out the odd-man-out using the compare() function in {waldo} package. It requires a series of actions shown below:

# Indexing the interaction in data file 1 and 9
waldo::compare(names(total_data[[1]]), names(total_data[[9]]))
     old                        | new                            
[83] "PW_SURVEY_NAME"           | "PW_SURVEY_NAME"           [83]
[84] "TOTAL_WORKSITE_LOCATIONS" | "TOTAL_WORKSITE_LOCATIONS" [84]
[85] "AGREE_TO_LC_STATEMENT"    | "AGREE_TO_LC_STATEMENT"    [85]
[86] "H-1B_DEPENDENT"           - "H1B_DEPENDENT"            [86]
[87] "WILLFUL_VIOLATOR"         | "WILLFUL_VIOLATOR"         [87]
[88] "SUPPORT_H1B"              | "SUPPORT_H1B"              [88]
[89] "STATUTORY_BASIS"          | "STATUTORY_BASIS"          [89]

Thanks to the Waldo package. It made my task much easier. The problem is in the 86th column. It is mentioned as H1B_DEPENDENT in 2022 file but as H-1B_DEPENDENT in 2020 and 2021 files. Now, I have a choice to change the variable in eight of the 2020-2021 data files, or one 2022 file. However, it is not one of the variables that I need. Thus, I don’t worry much about them.

My next step is to select the required columns and leave others untouched from these data sets. Again it takes a lot of codes and time to pass select() option to all 9 data sets one by one. Thus, I am going to create a function that will select the required column once I pass it through the total_data. Based on my prior experience, I am going to select the following columns:

Variables and Their Brief Description.
Variable_Names Description
case_number LCA application case number
case_status Application outcome/decision
lca_application_date Application date
lca_decision_data Date of application decision
visa_class Employees' proposed visa type e.g., H1B
soc_title Job category, e.g. computer system analyst
job_type Fulltime or a parttime job
job_begin_date First date of work
total_worker Total H1B workers in the company
employer_name Name of the hiring company
employer_state Employer's state
employer_zip Employer's zip code
attorney_representation If an attorney to filed the case
prevailing_wage Proposed wage for the employment
pay_wage_level Wage level: ranges between I to IV

Here’s a function select_variables that selects the required variables. I will then pass it through the total_data and check if it worked.

# creating the function
select_variables <- function(sv) {
  sv |>
    select_at(vars(
      CASE_NUMBER,
      CASE_STATUS, RECEIVED_DATE,
      DECISION_DATE, VISA_CLASS,
      SOC_TITLE, FULL_TIME_POSITION,
      BEGIN_DATE, TOTAL_WORKER_POSITIONS,
      EMPLOYER_NAME, EMPLOYER_STATE,
      EMPLOYER_POSTAL_CODE,
      AGENT_REPRESENTING_EMPLOYER,
      PREVAILING_WAGE, PW_UNIT_OF_PAY
    ))
}

# Passing it through the list total_data
total_data <- xfun::cache_rds({
  total_data |>
    map(select_variables)
})

# Checking if the above action worked
total_data |>
  map(names)
[[1]]
 [1] "CASE_NUMBER"                 "CASE_STATUS"                
 [3] "RECEIVED_DATE"               "DECISION_DATE"              
 [5] "VISA_CLASS"                  "SOC_TITLE"                  
 [7] "FULL_TIME_POSITION"          "BEGIN_DATE"                 
 [9] "TOTAL_WORKER_POSITIONS"      "EMPLOYER_NAME"              
[11] "EMPLOYER_STATE"              "EMPLOYER_POSTAL_CODE"       
[13] "AGENT_REPRESENTING_EMPLOYER" "PREVAILING_WAGE"            
[15] "PW_UNIT_OF_PAY"             

[[2]]
 [1] "CASE_NUMBER"                 "CASE_STATUS"                
 [3] "RECEIVED_DATE"               "DECISION_DATE"              
 [5] "VISA_CLASS"                  "SOC_TITLE"                  
 [7] "FULL_TIME_POSITION"          "BEGIN_DATE"                 
 [9] "TOTAL_WORKER_POSITIONS"      "EMPLOYER_NAME"              
[11] "EMPLOYER_STATE"              "EMPLOYER_POSTAL_CODE"       
[13] "AGENT_REPRESENTING_EMPLOYER" "PREVAILING_WAGE"            
[15] "PW_UNIT_OF_PAY"             

[[3]]
 [1] "CASE_NUMBER"                 "CASE_STATUS"                
 [3] "RECEIVED_DATE"               "DECISION_DATE"              
 [5] "VISA_CLASS"                  "SOC_TITLE"                  
 [7] "FULL_TIME_POSITION"          "BEGIN_DATE"                 
 [9] "TOTAL_WORKER_POSITIONS"      "EMPLOYER_NAME"              
[11] "EMPLOYER_STATE"              "EMPLOYER_POSTAL_CODE"       
[13] "AGENT_REPRESENTING_EMPLOYER" "PREVAILING_WAGE"            
[15] "PW_UNIT_OF_PAY"             

[[4]]
 [1] "CASE_NUMBER"                 "CASE_STATUS"                
 [3] "RECEIVED_DATE"               "DECISION_DATE"              
 [5] "VISA_CLASS"                  "SOC_TITLE"                  
 [7] "FULL_TIME_POSITION"          "BEGIN_DATE"                 
 [9] "TOTAL_WORKER_POSITIONS"      "EMPLOYER_NAME"              
[11] "EMPLOYER_STATE"              "EMPLOYER_POSTAL_CODE"       
[13] "AGENT_REPRESENTING_EMPLOYER" "PREVAILING_WAGE"            
[15] "PW_UNIT_OF_PAY"             

[[5]]
 [1] "CASE_NUMBER"                 "CASE_STATUS"                
 [3] "RECEIVED_DATE"               "DECISION_DATE"              
 [5] "VISA_CLASS"                  "SOC_TITLE"                  
 [7] "FULL_TIME_POSITION"          "BEGIN_DATE"                 
 [9] "TOTAL_WORKER_POSITIONS"      "EMPLOYER_NAME"              
[11] "EMPLOYER_STATE"              "EMPLOYER_POSTAL_CODE"       
[13] "AGENT_REPRESENTING_EMPLOYER" "PREVAILING_WAGE"            
[15] "PW_UNIT_OF_PAY"             

[[6]]
 [1] "CASE_NUMBER"                 "CASE_STATUS"                
 [3] "RECEIVED_DATE"               "DECISION_DATE"              
 [5] "VISA_CLASS"                  "SOC_TITLE"                  
 [7] "FULL_TIME_POSITION"          "BEGIN_DATE"                 
 [9] "TOTAL_WORKER_POSITIONS"      "EMPLOYER_NAME"              
[11] "EMPLOYER_STATE"              "EMPLOYER_POSTAL_CODE"       
[13] "AGENT_REPRESENTING_EMPLOYER" "PREVAILING_WAGE"            
[15] "PW_UNIT_OF_PAY"             

[[7]]
 [1] "CASE_NUMBER"                 "CASE_STATUS"                
 [3] "RECEIVED_DATE"               "DECISION_DATE"              
 [5] "VISA_CLASS"                  "SOC_TITLE"                  
 [7] "FULL_TIME_POSITION"          "BEGIN_DATE"                 
 [9] "TOTAL_WORKER_POSITIONS"      "EMPLOYER_NAME"              
[11] "EMPLOYER_STATE"              "EMPLOYER_POSTAL_CODE"       
[13] "AGENT_REPRESENTING_EMPLOYER" "PREVAILING_WAGE"            
[15] "PW_UNIT_OF_PAY"             

[[8]]
 [1] "CASE_NUMBER"                 "CASE_STATUS"                
 [3] "RECEIVED_DATE"               "DECISION_DATE"              
 [5] "VISA_CLASS"                  "SOC_TITLE"                  
 [7] "FULL_TIME_POSITION"          "BEGIN_DATE"                 
 [9] "TOTAL_WORKER_POSITIONS"      "EMPLOYER_NAME"              
[11] "EMPLOYER_STATE"              "EMPLOYER_POSTAL_CODE"       
[13] "AGENT_REPRESENTING_EMPLOYER" "PREVAILING_WAGE"            
[15] "PW_UNIT_OF_PAY"             

[[9]]
 [1] "CASE_NUMBER"                 "CASE_STATUS"                
 [3] "RECEIVED_DATE"               "DECISION_DATE"              
 [5] "VISA_CLASS"                  "SOC_TITLE"                  
 [7] "FULL_TIME_POSITION"          "BEGIN_DATE"                 
 [9] "TOTAL_WORKER_POSITIONS"      "EMPLOYER_NAME"              
[11] "EMPLOYER_STATE"              "EMPLOYER_POSTAL_CODE"       
[13] "AGENT_REPRESENTING_EMPLOYER" "PREVAILING_WAGE"            
[15] "PW_UNIT_OF_PAY"             

Beautiful! I successfully truncated all of the data sets at once. I now have only those columns that I deem important for my studies.

I still have 9 different data sets. I now have to combine them in a single giant data frame using rbind() function. I am going to save the new file as LCA_data.

# Combining
LCA_data <- xfun::cache_rds({
  total_data |>
    bind_rows()
})

# Checking
str(LCA_data)
tibble [1,526,247 x 15] (S3: tbl_df/tbl/data.frame)
 $ CASE_NUMBER                : chr [1:1526247] "I-200-19268-393467" "I-200-19268-638983" "I-200-19268-177184" "I-200-19268-936403" ...
 $ CASE_STATUS                : chr [1:1526247] "Certified" "Certified" "Certified" "Certified" ...
 $ RECEIVED_DATE              : POSIXct[1:1526247], format: "2019-09-25" "2019-09-25" ...
 $ DECISION_DATE              : POSIXct[1:1526247], format: "2019-10-01" "2019-10-01" ...
 $ VISA_CLASS                 : chr [1:1526247] "H-1B" "H-1B" "H-1B" "H-1B" ...
 $ SOC_TITLE                  : chr [1:1526247] "COMPUTER OCCUPATIONS, ALL OTHER" "SOFTWARE DEVELOPERS, APPLICATIONS" "MECHANICAL ENGINEERS" "SOFTWARE DEVELOPERS, APPLICATIONS" ...
 $ FULL_TIME_POSITION         : chr [1:1526247] "Y" "Y" "Y" "Y" ...
 $ BEGIN_DATE                 : POSIXct[1:1526247], format: "2019-10-07" "2020-01-08" ...
 $ TOTAL_WORKER_POSITIONS     : num [1:1526247] 1 1 1 1 1 1 1 1 1 1 ...
 $ EMPLOYER_NAME              : chr [1:1526247] "JO-ANN STORES, INC." "DENKEN SOLUTIONS INC." "EPITEC, INC." "SYSTEMS TECHNOLOGY GROUP, INC." ...
 $ EMPLOYER_STATE             : chr [1:1526247] "OH" "CA" "MI" "MI" ...
 $ EMPLOYER_POSTAL_CODE       : chr [1:1526247] "44236" "92618" "48033" "48084" ...
 $ AGENT_REPRESENTING_EMPLOYER: chr [1:1526247] "Y" "Y" "Y" "N" ...
 $ PREVAILING_WAGE            : num [1:1526247] 95118 39 39 53 65333 ...
 $ PW_UNIT_OF_PAY             : chr [1:1526247] "Year" "Hour" "Hour" "Hour" ...

Based on the output, I have a big data file named LCA_data that has 15 columns and 1,526,247 rows.

2. Tidying Data

Now, I have come to the stage where I tidy my data for further analyses. First of all, I have all raw data, which requires some filtering, changing types and may be cases.

A. Getting Rid of Unnecessary Data Points

Based on my prior experience, I came to realize that some applications for LCA take very long time even more than a year. These cases are often first certified and withdrawn afterward. I don’t really need this block of data. But before deleting let’s check the types of categories of case_status.

xfun::cache_rds({
  LCA_data |>
    count(CASE_STATUS) |>
    mutate(pct = n / 1526247 * 100) |>
    gt()
})
CASE_STATUS n pct
Certified 1430063 93.70
Certified - Withdrawn 58374 3.82
Denied 8802 0.58
Withdrawn 29008 1.90

Approximately, 94% of LCA applications at DoL are approved. As mentioned earlier, 4% were certified_withdrawn, less than 1% denied, and roughly 2% withdrawn. Now, I am going to get keep just the certified data.

LCA_data <- xfun::cache_rds({
  LCA_data |>
    filter(CASE_STATUS == "Certified")
})
# Checking
dim(LCA_data)
[1] 1430063      15

New dimension of LCA_data proves that total data points equal to the certified cases. Amazing!

Although, most of the H1B employees are on full-time basis, I have seen (FULL_TIME_POSITION) some as part time, as well. There were total of 23,066 aka 0.016% of part time cases, to be exact. I am not sure how it is possible. Thus, I want to get rid of all part time employees from further analyses.

LCA_data |>
  count(FULL_TIME_POSITION) |>
  mutate(pct = n / 1430063) |>
  gt()
FULL_TIME_POSITION n pct
N 23066 0.016
Y 1406997 0.984
LCA_data <- xfun::cache_rds({
  LCA_data |>
    filter(FULL_TIME_POSITION == "Y")
})
# Checking
dim(LCA_data)
[1] 1406997      15

New dimension shows that total data points dropped to 1,406,997 from 1,430,063. The gap is exactly equal to the number of part time positions.

Similarly, when I looked at the structure of the variable PW_UNIT_OF_PAY, I saw that some wages are in hourly, weekly, bi-weekly, and yearly basis. I need to have a consistency here. all of them should be in same basis of pay. I can change all other pay rates to yearly wages but I am not sure if that were the cases. Given that I have a huge data set, I would like to limit my study to the H1B employees who reported yearly wage in their LCA applications.

# Checking
LCA_data |>
  count(PW_UNIT_OF_PAY) |>
  mutate(pct = n / 1406997 * 100) |>
  gt()
PW_UNIT_OF_PAY n pct
Bi-Weekly 200 0.014
Hour 60105 4.272
Month 790 0.056
Week 181 0.013
Year 1345721 95.645
# Filtering
LCA_data <- LCA_data |>
  filter(PW_UNIT_OF_PAY == "Year")
# Checking
dim(LCA_data)
[1] 1345721      15

As can be seen, total data points came down to 1,345,721 after limiting data to yearly wages, which is equal to the total number of cases that reported yearly wage in their LCA applications.

Finally, I can see that there are a few varieties of visa types mentioned in the LCAs. I would like to limit my study only on H1-B visa type not EB or Singapore ro Australia class.

# Calculating
LCA_data |>
  count(VISA_CLASS) |>
  mutate(pct = n / 1345721 * 100) |>
  gt()
VISA_CLASS n pct
E-3 Australian 27075 2.01
H-1B 1312667 97.54
H-1B1 Chile 2956 0.22
H-1B1 Singapore 3023 0.22
# Filtering
LCA_data <- LCA_data |>
  filter(VISA_CLASS == "H-1B")
dim(LCA_data)
[1] 1312667      15

I got the statistics. Still, went ahead and got rid of E-3 Australian, H-1B1 Chile, & H-1B1 Singapore visa class, which brought my data points to 1,312,667.

B. Removing Unncessary Variables

I worked on four variables, so far. All of the variables had a few categories to begin with, however, I limited my studies to individual sub-categories. For example, CASE_STATUS == "Certified" had four different categories, which I filtered to one, so is true about FULL_TIME_POSITION == "Y", PW_UNIT_OF_PAY == "Year" and VISA_CLASS == "H-1B". When, there is no variability, I don’t need to have these variables any more. I am going to get rid of them right now:

LCA_data$CASE_STATUS <- NULL
LCA_data$FULL_TIME_POSITION <- NULL
LCA_data$PW_UNIT_OF_PAY <- NULL
LCA_data$VISA_CLASS <- NULL

# Getting rid of NAs
LCA_data <- LCA_data |>
  na.omit()

# Checking
names(LCA_data)
 [1] "CASE_NUMBER"                 "RECEIVED_DATE"              
 [3] "DECISION_DATE"               "SOC_TITLE"                  
 [5] "BEGIN_DATE"                  "TOTAL_WORKER_POSITIONS"     
 [7] "EMPLOYER_NAME"               "EMPLOYER_STATE"             
 [9] "EMPLOYER_POSTAL_CODE"        "AGENT_REPRESENTING_EMPLOYER"
[11] "PREVAILING_WAGE"            

The final list of variables shows that I got rid of 4-variables mentioned above.

3. Transforming Data

So far, so good. When I observed the output of str(LCA_data), I came to realize that the time related variables are in POSIXct class. It’s not bad, however, it still has the HH:MM:SS format after it. First, I have to get rid of these precision time units. Second, I want to be able to count number of days between the days of LCA application and Decision to first day to work, thus, I will change them in normal mdy() function using {lubridate} package.

A. Changing Class

LCA_data$RECEIVED_DATE <- as.Date(LCA_data$RECEIVED_DATE, format = "%m/%d/%Y")
LCA_data$DECISION_DATE <- as.Date(LCA_data$DECISION_DATE, format = "%m/%d/%Y")
LCA_data$BEGIN_DATE <- as.Date(LCA_data$BEGIN_DATE, format = "%m/%d/%Y")
# Checking
data.frame(class(LCA_data$RECEIVED_DATE), class(LCA_data$DECISION_DATE), class(LCA_data$BEGIN_DATE))
  class.LCA_data.RECEIVED_DATE. class.LCA_data.DECISION_DATE.
1                          Date                          Date
  class.LCA_data.BEGIN_DATE.
1                       Date

Further more, based on my previous studies I came to know that some applications took very long time to adjudicated. I want to check the status in this data set and see if I can limit cases based on this time table. I have to create two new variables to be able to do this.

B. Creating New Variables

Now, I want to create two new variables using the date variables I worked on above. The first one would be DECISION_TIME the days it took for cases to get decision from the Department of Labor. The second would be DAYS_TOWORK the gap between the decision time and the proposed first day of work.

# Creating
LCA_data <- xfun::cache_rds({
  LCA_data |>
    mutate(
      DECISION_TIME = DECISION_DATE - RECEIVED_DATE,
      DAYS_TOWORK = BEGIN_DATE - RECEIVED_DATE
    )
})
# Checking
data.frame(mean(LCA_data$DECISION_TIME), mean(LCA_data$DAYS_TOWORK))
  mean.LCA_data.DECISION_TIME. mean.LCA_data.DAYS_TOWORK.
1                     7.3 days                    71 days
data.frame(range(LCA_data$DECISION_TIME), range(LCA_data$DAYS_TOWORK))
  range.LCA_data.DECISION_TIME. range.LCA_data.DAYS_TOWORK.
1                        4 days                    -34 days
2                       29 days                    184 days

Based on the output, the average LCA decision time was 7.3 days with a range of 4-29 days. Likewise, average first day of work from the day of decision on LCA was 71 days. The range remained between -34 to 184 days. It makes sense because employers have to apply for H1B after the approval of LCA because the approval of LCA doesn’t allow one to start working.

C. Changing Column Names to lowercase letters for consistency and ease

It is inconvenient to have the column names in uppercase letters. It slows down the flow of my work and I am more prone to commit errors when typing because the referral to column is case sensitive. Thus, I am going to change all column names in lowercase letters using clean_names() function from {janitor} package. Afterward, I would like to change the class of some of the variables to factor using the as.factor() function.

# Changing to lower case names
LCA_data <- LCA_data |>
  clean_names()

# Changing the variable class
LCA_data <- LCA_data |>
  mutate(
    employer_state = as_factor(employer_state),
    employer_postal_code = as_factor(employer_postal_code),
    agent_representing_employer = as_factor(agent_representing_employer),
    soc_title = as_factor(soc_title)
  ) |>
  select(case_number, received_date, decision_date, decision_time, begin_date, days_towork, employer_name, employer_state, employer_postal_code, agent_representing_employer, soc_title, total_worker_positions, prevailing_wage)

Now, I want to check the summary of all of these variables. I will be looking to find out some discrepancy on the structure, which needs further act.

summary(LCA_data)
 case_number        received_date        decision_date        decision_time    
 Length:1312647     Min.   :2019-09-25   Min.   :2019-10-01   Length:1312647   
 Class :character   1st Qu.:2020-05-13   1st Qu.:2020-05-20   Class :difftime  
 Mode  :character   Median :2020-12-09   Median :2020-12-16   Mode  :numeric   
                    Mean   :2020-11-14   Mean   :2020-11-21                    
                    3rd Qu.:2021-04-21   3rd Qu.:2021-04-28                    
                    Max.   :2021-12-22   Max.   :2021-12-30                    
                                                                               
   begin_date         days_towork       employer_name      employer_state  
 Min.   :2019-09-25   Length:1312647    Length:1312647     CA     :281308  
 1st Qu.:2020-08-29   Class :difftime   Class :character   TX     :153793  
 Median :2021-01-25   Mode  :numeric    Mode  :character   NJ     :142180  
 Mean   :2021-01-24                                        NY     : 85307  
 3rd Qu.:2021-08-09                                        WA     : 78639  
 Max.   :2022-06-23                                        IL     : 76981  
                                                           (Other):494439  
 employer_postal_code agent_representing_employer
 77845  :  50009      N  :149293                 
 98121  :  39835      No :223577                 
 94043  :  31671      Y  :352137                 
 20850  :  27108      Yes:587640                 
 98052  :  22954                                 
 95054  :  22871                                 
 (Other):1118199                                 
                                            soc_title     
 Software Developers, Applications               :436552  
 Computer Systems Analysts                       : 97500  
 Software Developers, Systems Software           : 82575  
 Computer Systems Engineers/Architects           : 43121  
 Software Quality Assurance Engineers and Testers: 33817  
 Information Technology Project Managers         : 29919  
 (Other)                                         :589163  
 total_worker_positions prevailing_wage 
 Min.   :  1            Min.   : 15080  
 1st Qu.:  1            1st Qu.: 76898  
 Median :  1            Median : 93558  
 Mean   :  2            Mean   : 98300  
 3rd Qu.:  1            3rd Qu.:115773  
 Max.   :300            Max.   :760202  
                                        

Very clean nice summary. I can immediately see some problems in a couple of variables. The first one is agent_representing_employer which has two different form of Yes\No. I have to change them to the same form. Second, looking at the total_worker_positions, I realize that I may not need this variable at all, because the overwhelming amount of cases have one employee as minimum, 1st Quartile, median, and even 3rd quartile. It is hugely skewed, thus I will be well off if I get rid of it.

# Managing agent_representing column
LCA_data <- LCA_data |>
  mutate(
    agent_representing_employer = case_when(
      agent_representing_employer == "N" ~ "No",
      agent_representing_employer == "Y" ~ "Yes",
      agent_representing_employer == "No" ~ "No",
      agent_representing_employer == "Yes" ~ "Yes",
    )
  ) |>
  mutate(agent_representing_employer = as_factor(agent_representing_employer)) |>
  select(case_number, received_date, decision_date, decision_time, begin_date, days_towork, employer_name, employer_state, employer_postal_code, agent_representing_employer, soc_title, prevailing_wage)

# Checking
summary(LCA_data$agent_representing_employer)
    No    Yes 
372870 939777 

I have successfully got rid of total_worker_positions and changed the N and Yes to No and Yes under agent_representing_employer variable. Based on my pilot study of similar data set, I stumble on a problem associated with the employer_postal_code variable. Most of the applications used the 5-digit postal codes, while there were some that also used 5 plus 4 digit codes. Thus, I would like to make them consistent by making all 5-digit codes.

glimpse(LCA_data$employer_postal_code)
 Factor w/ 11961 levels "00646","00717-9997",..: 5457 6337 3833 8563 7793 3752 3744 10237 10020 10237 ...
LCA_data$employer_postal_code <- as_factor((substr(LCA_data$employer_postal_code, 1, 5)))
glimpse(LCA_data$employer_postal_code)
 Factor w/ 10780 levels "00646","00717",..: 4888 5705 3391 7756 7040 3326 3318 9304 9093 9304 ...

The employer_postal_code is in a good shape now. Looking at the summary of my data frame earlier, I realized that there can be some problems in the employers’ name, i.e.,employer_name and the job title, i.e., soc_title. I want to change all of them in title cases. In addition, some of the names of both companies and job title are preceded with numbers and special characters. I would like to get rid of all of them using the combination of mutate(str_remove_all()) function.

LCA_data <- LCA_data |>
  mutate(
    soc_title = str_to_title(soc_title),
    employer_name = str_to_title(employer_name)
  ) |>
  mutate(
    soc_title = factor(gsub("\\.-[0-9]*$", "", soc_title)),
    employer_name = factor(gsub("\\.-[0-9]*$", "", employer_name))
  )

I got the desired variables and the clean names. There may a few things along the way, but I guess, I pretty good so far. I am ready to create some plots and start visualizing the data.

4. Visualizing Data

Let me begin with the visuals of the decision time and first day of work. I am going to use ggplot2.

decision_density <- ggplot() +
  geom_freqpoly(data = LCA_data, aes(decision_time, ..density..), color = "darkred") +
  geom_freqpoly(data = LCA_data, aes(days_towork, ..density..), color = "darkblue") +
  xlab("Red: decision_time in days & Blue: days_towork") +
  ggtitle("Distribution of decision_time & days_towork among LCA Applicants")
decision_density

One of the most amazing thing that I saw in the plot above is consistency in the LCA decision time. It picks the next day the employers registers the LCA application and drops down to zero within 10 days of application. They must have a pretty good system of taking care of LCA applications. The first day to work seems to extend a little further and it makes a complete sense because it is the tentative time and sometimes the expectations are not met.

A. Checking to see Prevailing Wage Distribution

prevailing_wage_distribution <- ggplot(LCA_data, aes(prevailing_wage / 1000)) +
  geom_histogram(fill = "dodgerblue", color = "hotpink4", binwidth = 5) +
  geom_vline(xintercept = mean(LCA_data$prevailing_wage) / 1000, color = "darkred", linetype = "dashed", size = 1.5, alpha = 0.4) +
  xlab("Prevailing Wages in Thousands of Dollar") +
  ylab(expression("Cumulative Frequency or Count")) +
  ggtitle("Prevailing-Wage-Distribution for H1B Employees: 10/2019 - 12/2021")

prevailing_wage_distribution

mean(mean(LCA_data$prevailing_wage))
[1] 98300

Looks like the prevailing wages are normally distributed. As found in the previous pilot study, the average wage is around $98,300/year. Let’s zoom in between the values 200 and 300 thousand dollar range.

zoomed_in <- prevailing_wage_distribution +
  xlim(c(200, 300)) + ylim(c(0, 1200)) +
  xlab("Prevailing Wages Between $200,000 and $300,000") +
  ylab(expression("Cumulative Frequency or Count")) +
  ggtitle("Upper Bound Prevailing Wage Distribution")
zoomed_in

Looks like there a few thousand data points between $200,000 and $250,000 prevailing wage ranges. I will limit my studies for wages below 250,000 dollar range. For now, I would like to see what are the job titles and the companies that pay the highest rates. So, I am going to zoom in between the values 300 to 800 thousand dollar range.

highest_wages <- prevailing_wage_distribution +
  xlim(c(300, 800)) + ylim(c(0, 6)) +
  xlab("Prevailing Wages Between $300,000 and $8000,000") +
  ylab(expression("Cumulative Frequency or Count")) +
  ggtitle("Highest Prevailing Wage for H1B Between October 2019-December 2021")
highest_wages

Yes. There are a few data points that are between $300,000 to $800,000 yearly prevailing wages. I would like to expand it a little further to include $250,000 annual wages and higher, subset the data and do further analyses.

B. Splitting Data in Two Groupd

My interest of ultimate study will be the regular data having the prevailing wage range below $250,000. I am going to filter the data into two smaller data sets. The first one includes only the cases having the prevailing wage less than $250,000. The second one will only include the cases having annual prevailing wage equal to $250,000 or higher. I will save them in LCA_Final and LCA_High, respectively.

LCA_Final <- LCA_data |>
  filter(prevailing_wage < 250000)
LCA_High <- LCA_data |>
  filter(prevailing_wage >= 250000)
rbind(dim(LCA_Final), dim(LCA_High))
        [,1] [,2]
[1,] 1311377   12
[2,]    1270   12

I have successfully divided the into to portions, e.g., LCA_Final and LCA_High. LCA_Final has 1,311,377 cases, while LCA_High has 1270 and 12 columns.

As mentioned earlier, I want to dig a little deeper into the LCA_High Data set. First, I would like to see if there is any patterns in terms of job_title and employer_name.

# Checking for job title
j_title <- LCA_High |>
  count(soc_title) |>
  mutate(pct = n / 1270 * 100) |>
  arrange(desc(pct))
data.frame(j_title)
                                         soc_title   n    pct
1        Computer And Information Systems Managers 303 23.858
2               Physicians And Surgeons, All Other 274 21.575
3                               Financial Managers 202 15.906
4                                 Chief Executives  72  5.669
5                              Internists, General  67  5.276
6                                     Hospitalists  49  3.858
7                                     Neurologists  30  2.362
8                                    Psychiatrists  29  2.283
9                                          Lawyers  22  1.732
10                                    Radiologists  20  1.575
11        Financial Managers, Branch Or Department  19  1.496
12                                        Surgeons  18  1.417
13      Health Specialties Teachers, Postsecondary  16  1.260
14                               Anesthesiologists  15  1.181
15          Architectural And Engineering Managers  15  1.181
16                              Marketing Managers  15  1.181
17                          Pediatricians, General  14  1.102
18                Family And General Practitioners  10  0.787
19      Medical Scientists, Except Epidemiologists   9  0.709
20                 Obstetricians And Gynecologists   8  0.630
21                                    Pathologists   8  0.630
22                      Treasurers And Controllers   7  0.551
23                               Dentists, General   6  0.472
24                              Financial Analysts   6  0.472
25                 General And Operations Managers   6  0.472
26                             Management Analysts   6  0.472
27                             Managers, All Other   6  0.472
28                       Natural Sciences Managers   5  0.394
29                                      Urologists   2  0.157
30            Architecture Teachers, Postsecondary   1  0.079
31                                        Chemists   1  0.079
32                   Chief Sustainability Officers   1  0.079
33                  Clinical Research Coordinators   1  0.079
34                                  Dermatologists   1  0.079
35                                     Hospitalist   1  0.079
36 Physical Medicine And Rehabilitation Physicians   1  0.079
37              Physicans And Surgeons, All Others   1  0.079
38                         Physicians And Surgeons   1  0.079
39               Physicians And Surgeons,All Other   1  0.079
40                                  Sales Managers   1  0.079

Job title definitely seem to determine the prevailing wages. However, the output showed 40 different job titles. some of them were hugely cohesive, e.g., Physicians and Surgeons, All Others vs. Physicians and Surgeons, All Other. They were listed differently just because of the differences of s in other vs others. After having a close look into these job titles, I decided to put them in 4-big buckets, namely Finance_Sector, Engineering_Sector, Medical_Sector, & Computer_IT_Managers.

# Putting job_titles in big buckets
LCA_High_Modified <- LCA_High |>
  mutate(
    soc_title = case_when(
      soc_title %in% c("Computer And Information Systems Managers") ~ "Computer_IT_Managers",
      soc_title %in% c(
        "Physicians And Surgeons,All Other",
        "Physicians And Surgeons",
        "Physicans And Surgeons, All Others",
        "Physical Medicine And Rehabilitation Physicians",
        "Hospitalist",
        "Dermatologists",
        "Urologists",
        "Dentists, General",
        "Pathologists",
        "Obstetricians And Gynecologists",
        "Family And General Practitioners",
        "Medical Scientists, Except Epidemiologists",
        "Pediatricians, General",
        "Anesthesiologists",
        "Health Specialties Teachers, Postsecondary",
        "Surgeons",
        "Radiologists",
        "Neurologists",
        "Psychiatrists",
        "Lawyers",
        "Hospitalists",
        "Clinical Research Coordinators",
        "Physicians And Surgeons, All Other",
        "Chemists"
      ) ~ "Medical_Sector",
      soc_title %in% c(
        "Architectural And Engineering Managers",
        "Natural Sciences Managers",
        "Architecture Teachers, Postsecondary",
        "Chief Sustainability Officers"
      ) ~ "Engineering_Sector",
      soc_title %in% c(
        "Sales Managers",
        "Financial Managers",
        "Chief Executives",
        "Internists, General",
        "Financial Managers, Branch Or Department",
        "Marketing Managers",
        "Treasurers And Controllers",
        "Financial Analysts",
        "General And Operations Managers",
        "Management Analysts",
        "Managers, All Other"
      ) ~ "Finance_Sector",
      TRUE ~ as.character(soc_title)
    )
  )

# Plotting the prevailing wages
g_density <- ggplot(LCA_High_Modified, aes(prevailing_wage)) +
  geom_density(aes(fill = factor(soc_title)), alpha = 0.4) +
  xlim(c(230000, 320000)) +
  ylim(c(0.0000, 0.0002)) +
  labs(
    title = "Prevailing Wage Density Plot",
    subtitle = "Four Highest Paying Job Areas",
    x = "Prevailing Wage",
    fill = "# Job Areas"
  )
g_density

# Counting by new Job Titles
j_title_modified <- LCA_High_Modified |>
  group_by(soc_title) |>
  count() |>
  mutate(pct = n / 1270 * 100) |>
  arrange(desc(pct))

# Counting Minimum, Mean, Maximum Prevailing Wages
wages <- LCA_High_Modified |>
  select(soc_title, prevailing_wage) |>
  group_by(soc_title) |>
  summarize(
    minimum_wage = min(prevailing_wage),
    mean_wage = mean(prevailing_wage),
    maximum_wage = max(prevailing_wage)
  )

# Merging Above Tables
wages_by_jtitles <- merge(j_title_modified, wages, by = "soc_title")
wages_by_jtitles
             soc_title   n  pct minimum_wage mean_wage maximum_wage
1 Computer_IT_Managers 303 23.9       250286    255872       305707
2   Engineering_Sector  22  1.7       251326    261095       287394
3       Finance_Sector 407 32.0       250099    264691       415589
4       Medical_Sector 538 42.4       250000    276317       760202

Beside the job title, I tried to see if there was a pattern of having high prevailing wages based on employer, but nothing stood out.

5. Modeling data

Q. What is the most recent trend of applying for LCA at the Department of Labor?

Hurrah! I made up to this questions. It’s a whole new world. I can do a lot of things, in other words, spend months with this data set identifying trends of every single aspects. However, every researcher needs to come to a point when they have to make decision about what to dig in and what to left open. Having 96 unique variables and more than 200,000 data points the possibilities are innumerable. After much thought about what I want to do in this regard, I came up with three ideas:

  1. Take the top 5 companies and check their trend of hiring H1B employees based on the number of LCA filings. I am particularly interested to see if the number of filings dropped. The data come from a time frame when the whole world was in a hot mess of COVID-19 which caused both private, public, and government companies to shut down. Many people lost their jobs. However, there were a few companies, especially tech companies like Amazon reportedly made more money during this period as many people resorted to online shopping. More business requires more employees. Thus, it would be interesting to figure out how things played out in these top 5 companies.

  2. In the mean time, academic companies had to resort to hiring freeze. Many of these institutions have not been in regular order of operation even at the time of this report preparation. On the other hand, the wages tend to go up by a little bit every year. I would like to check there was any discontinuity on wage increase during this period. Or simply, how these two aspects played out in top 5 job_titles in academia.

  3. Finally, it would be interesting to check if I could predict number of LCAs for H1B based on these data. So, I will try to predict total number of LCAs for upcoming quarter.

B. How did COVID-19 impacted top five US companies in terms of hiring international employees, i.e., required H1B to be able to work in the United states.

Based on the studies conducted above, we know that Amazon, Cognizant Technology Solutions, Microsoft Corporation, Tata Consultancy Services, Google were the top 5 H1B employers during the study period. There are a few different companies having Amazon in their names like Amazon Web Services, Amazon.com Services etc. Thus, I index all the companies using key words and include them under same name afterward. {r indexing_top_5_for_trend} top_5_companies <- LCA_final|> filter(str_detect(employer_name,‘Amazon|Tata|Microsoft|Cognizant|Google’))|> select(received_date,employer_name) unique(top_5_companies$employer_name)


The output shows a lot of Amazon companies, a few Cognizant & Tata and one each for Microsoft and Google. I am going to assign a single names to the respective companies using the `case_when()` function. It's going to be a long manual process. 
{r assigning_single_names}
top_5_companies <- top_5_companies|>
  mutate(
    companies = case_when( #to change the various names to single name
    employer_name %in% c("Amazon.com Services Llc", 
                         "Amazon Web Services, Inc.",
                         "Amazon Advertising Llc",
                         "Amazon Development Center U.s., Inc.",
                         "Amazon Web Services Inc",
                         "Amazon Data Services, Inc",
                         "Amazon Data Services, Inc.",
                         "Amazon.com.ca, Inc.",
                         "Amazon Robotics Llc",
                         "Amazon Capital Services, Inc.",
                         "Amazon Payments, Inc.",
                         "Amazon Mechanical Turk, Inc.",
                         "Amazon Registry Services Inc.",
                         "Amazon Logistics",
                         "Amazon Web Service,  Inc.",
                         "Amazon.com.ca, Inc. (Us)") ~ "Amazon",
    employer_name %in% c("Cognizant Technology Solutions Us Corp",
                         "Trizetto Corporation - A Cognizant Company",
                         "Tmg Health - A Cognizant Company",
                         "Cognizant Worldwide Limited",
                         "Cognizant Technology Solutions Us Corporation") ~ "Cognizant",
    employer_name %in% c("Tata Consultancy Services Limited",
                         "Tata Elxsi Ltd",
                         "Tata Elxsi Limited",
                         "Tata Technologies, Inc.",
                         "Tata Communications (America), Inc.") ~ "Tata",
    employer_name %in% "Google Llc" ~ "Google",
    employer_name %in% "Microsoft Corporation" ~ "Microsoft",
      TRUE ~ as.character(employer_name)
    )
  )|>
  mutate(received_date_m = lubridate::date(received_date),
         quarter_date = zoo::as.yearqtr(received_date_m))|> # Changing the Date layout
  select(received_date_m,quarter_date, companies)|># selecting only two columns
 arrange(desc(received_date_m))
  #unique(top_5_companies$companies)
data.frame(top_5_companies)

Now, that I have changed all different names to just 5-companies.

6. Communication of Results