H-1B VISA PETITION ANALYSIS 2016-17

Introduction

The H-1B program allows companies in the United States to temporarily employ foreign workers in occupations that require the theoretical and practical application of a body of highly specialized knowledge and a bachelor’s degree or higher in the specific specialty, or its equivalent. H-1B specialty occupations may include fields such as science, engineering and information technology.

In carrying out its responsibility for the processing of labor certification and labor attestation applications, the Office of Foreign Labor Certification (OFLC) generates program data that is essential both for internal assessment of program effectiveness and for providing the Department’s external stakeholders with useful information about the immigration programs administered by OFLC. In line with the Department’s commitment to the Open Government initiative and specific regulatory disclosure requirements, OFLC makes public the annual releases of program disclosure data to assist with external research and program evaluation.

It is always of interest to find out which locations, employers, job titles and salary range make up most of the H1B petitions. I’m also interested to learn the changes in trends of H-1B visa applicants in FY 2017 as compared to the FY 2016. H-1B filing data for FY 2016 and FY 2017 are pulled in from the Office of Foreign Labor Certification.

Packages Required

To analyze this data, we will use the following R packages:

library(readr)      #For reading csv file
library(dplyr)      #For Data transformation
library(ggplot2)    #For graphics
library(lubridate)  #For date time conversions
library(DT)         #For diplaying data with formatting
library(tidyr)      #For data transformation

Data Preparation

The raw data available is untidy and might not be suitable for analysis. So, prior to assessing the data, we must acquire and clean the data.

Loading H-1B data

The data for this project originated from the following sources:

FY-2016 H1-B filing data

FY-2017 H1-B filing data

These datasets are imported into R as csv files.

#Importing 2016 and 2017 data
efile_data_2016 <- read_csv("E:\\Data Wrangling with R\\project\\H-1B_Disclosure_Data_FY16.csv")
efile_data_2017 <- read_csv("E:\\Data Wrangling with R\\project\\H-1B_Disclosure_Data_FY17.csv")

Variable selection and data merging

There are 40 variables present in the 2016 H-1B dataset, whose data dictionary can be found here.The 2017 H-1B dataset contains 52 variables, whose data dictionary can be found here.These two datasets are merged to create a single dataframe containing both 2016 and 2017 Financial Year data. Out of the variables present in both the datasets, only 38 are common, So, we keep only those 38 variables in our dataset.

#merging datasets
names(efile_data_2016)[which(names(efile_data_2016) == "H-1B_DEPENDENT")] <- "H1B_DEPENDENT"  
common_cols <- intersect(colnames(efile_data_2016), colnames(efile_data_2017))
efile_data <- rbind(efile_data_2016[common_cols], efile_data_2017[common_cols])

# cleaning the environment
rm(efile_data_2016,efile_data_2017,common_cols)

Let’s take a look at the first six observations of the merged dataset.

The merged dataset has 38 variables and over 1.2 million records.

Data Cleaning

The raw merged data is messy and some cleaning steps needs to be performed on the data table before any exploratory data analysis can be performed on it. These are the data cleaning steps that are performed to get our dataset into the right shape:

  • The merged dataset has filing records for other kinds of visa processing such as H1-B1, H2-B2 etc. So, these records are filtered out to retain only those records corresponding to H-1B data.

  • The merged dataset we have has only 10 variables that are of interest for our analysis. So, variables other than these 10 are dropped from the final dataset.

  • We want to know if an employee is a full time or a part time worker.So, we create a new variable FULL_TIME and assign a value of Y when PW_UNIT_OF_PAY is Year and N when PW_UNIT_OF_PAY is one of Hour, Weekly, Bi-Weekly or Month.

  • Now we create FIN_YEAR variable and assign it a value of either 2016 or 2017, based on the financial year a decision is made on the case by the authorities.

  • We have to build a common scale for salary data for part time and full time employees so that it becomes easy to compare them.We do that by scaling the hourly, Weekly, Bi-Weekly and monthly salaries to an yearly pay.

  • It is found that the dataset has some missing values and the percentage of records with missing values is less than 0.1. So, these records are removed from the dataset.

  • Records with extreme PREVAILING_WAGE values are discarded.

##df1
df1 <- efile_data %>%
      filter(VISA_CLASS %in% "H-1B") %>%
      select(c("CASE_STATUS","DECISION_DATE","EMPLOYER_NAME","JOB_TITLE","SOC_NAME"),PREVAILING_WAGE:PW_UNIT_OF_PAY,
      H1B_DEPENDENT,WORKSITE_CITY,WORKSITE_STATE) %>%
      mutate( FULL_TIME = case_when(PW_UNIT_OF_PAY == 'Year' ~ 'Y', PW_UNIT_OF_PAY != 'Year' ~ 'N'),DECISION_DATE =   
      as.numeric(as.Date(DECISION_DATE, format = "%d-%m-%Y")),FIN_YEAR = case_when(16709 <= DECISION_DATE & 
      DECISION_DATE  <= 17074 ~ 2016,17075 <= DECISION_DATE & DECISION_DATE  <= 17439 ~ 2017)) %>%
      mutate(PREVAILING_WAGE = case_when(PW_UNIT_OF_PAY == 'Hour' ~ PREVAILING_WAGE*2087,PW_UNIT_OF_PAY == 'Year' ~
      PREVAILING_WAGE,PW_UNIT_OF_PAY=='Month'~PREVAILING_WAGE*12,PW_UNIT_OF_PAY=='Bi-Weekly'~PREVAILING_WAGE*21)) %>%
      select(-DECISION_DATE,-PW_UNIT_OF_PAY) %>%
      na.omit(df1) %>%
      filter(PREVAILING_WAGE > 5000 & (PREVAILING_WAGE < 200000))

Now that the data cleaning operations are performed and the final dataset that is used to perform the exploratory data analysis is obtained, let’s take a look at the final dataset.

Data Summary

The H-1B dataset has 10 variables in which most of them are cateforical. PREVAILING_WAGE is the only numeric variable pesent in the dataset. Let’s take a look at the Data dictionary.

VARIABLE DATA TYPE VARIABLE DESCRIPTION
CASE_STATUS Character Status associated with the last significant event or decision.
EMPLOYER_NAME Character Name of employer submitting labor condition application
JOB_TITLE Character Title of the job.
SOC_NAME Character Occupational name as per SOC system
PREVAILING_WAGE numeric Prevailing Wage for the job being requested
H1B_DEPENDENT Character Y = Employer is H-1B Dependent; N = Employer is not H-1B Dependent
WORKSITE_CITY Character City information of the foreign worker’s intended area of employment.
WORKSITE_STATE Character State information of the foreign worker’s intended area of employment.
FULL_TIME Character Y = Full Time, N = Part Time
FIN_YEAR numeric financial Year (2016/2017)

PREVAILING_WAGE is the variable that we are more concerned about. The data suggests that the median salary of the people who submitted H-1B applications in 2016 was $67,891 and that in 2017 is $71,510, which suggests that a 5.33% increase in median salary of applicants was observed in 2017 as compared to 2016.

Exploratory Data Analysis

Here are some of the relations that could be interesting to look at

  • Number of h1b submissions by state.

  • Employers who submit the most number of H-1B visa applications.

  • Most common Job Titles.

  • Cities that have high number of submissions.

  • States and cities with highest median salaries.

  • Comparision of 2016 statistics with 2017.

In order to look at the graphical representation of the data it is better to look at top 10 submissions based on a factor.

So, in order to extract top 10 records based on a factor per year, we create a function called ‘top_10_recordsby_year’. In order to extract the top 10 records overall, we also create a function called ‘top_n_records’.

#####function to extract top 10 records by a parameter

top_n_records <- function(col_name,n_rec) {
  col_name <- as.name(col_name)
  df <- df1 %>%
    group_by_("FIN_YEAR",col_name) %>%
    summarise(num_apps = n()) %>%
    arrange(desc(num_apps)) %>%
    slice(1:n_rec)
}
######function to extract top 10 records by a parameter based on year######

top_10_recordsby_year <- function(col_name) {
  col_name <- as.name(col_name)
df <- df1 %>%
    group_by_("FIN_YEAR",col_name) %>%
    summarise(num_apps = n())
df <- spread(df,key = FIN_YEAR, value = num_apps)
df$num_apps <- df$`2016` + df$`2017`
df <- arrange(df,desc(num_apps))
df <- df[1:10,-4]
df <-  gather(df, key = "FIN_YEAR",value = "num_apps",2:3)
}

H1B Visa Case Status

There are four types of case application statuses in the dataset. An application can be denied, withdrawn, certified or certified-withdrawn. Certified does not mean that the h1b is approved but it means that the application is approved and is sent for further processing. It is for this very reason, we are unable to fit a model to predict the factors impacting approval of h1b visa applications.

#### H1B Visa Case Status
ggplot(top_n_records("CASE_STATUS",10),aes(x = reorder(CASE_STATUS,-num_apps),y = num_apps, fill = CASE_STATUS)) +
  geom_bar(stat = "identity", alpha = 0.9, width = 0.7) +
  ggtitle("Figure1: case status of applications") +
  labs(x = "case status", y = "No. of Applications")

Bar chart above shows us the distribution of h1b visa status. After looking at the plot of case status, we can easily observe that the certified cases dominate this dataset.

Top H1B applicant states by year

Let’s look at the states with maximum number of submissions and compare the statistics of 2016 with those in 2017. This will help us understand which states file a lot of applications. Also, we can see the difference in the number of applications between 2016 and 2017 to find out if there is any significant difference in the number of applications in some states.

#### Top H1B applicant states by year
ggplot(top_10_recordsby_year("WORKSITE_STATE"),aes(x = WORKSITE_STATE, y = num_apps,fill = FIN_YEAR)) +
  geom_bar(stat = "identity", position = position_dodge() , alpha = 0.9,  width = 0.7) +
  coord_flip() +
  ggtitle("Figure2: Top 10 work sites with most applications") +
  labs(x = "State", y = "No. of Applications")

It is observed that California has the maximum number of submissions, followed by Texas, but, with only as many as half the number of submissions as that of California. There hasn’t been a lot of difference in the number of applications submitted in 2017 as compared to 2016 in any of the top 10 leading states.

Top H1B applicant cities by year

Let’s look at the cities with maximum number of submissions and compare the statistics of 2016 with those in 2017. This will help us understand which cities file a lot of applications. Also, we can see the difference in the number of applications between 2016 and 2017 to find out if there is any significant difference in the number of applications in some cities.

#### Top H1B applicant cities by year
ggplot(top_10_recordsby_year("WORKSITE_CITY"),aes(x = WORKSITE_CITY, y = num_apps,fill = FIN_YEAR)) +
  geom_bar(stat = "identity", position = position_dodge() , alpha = 0.9,  width = 0.7) +
  coord_flip() +
  ggtitle("Figure3: Top 10 work site cities with most applications") +
  labs(x = "City", y = "No. of Applications")

We see that New York has the maximum number of application in 2016 and 2017. Houston and San Franscisco come in distant second with less than half the number of applications submitted in New York. Among all the top 10 states, Houston sees a considerable decrease in the number of applications in 2017 as compared to 2016.

Full time vs part time applicants

Let’s look at the distribution of the number of applications based on full time status. It has been observed that in 2016 as well as 2017, the maximum number of applicants were in a full time role.

#####full time vs part time

df1 %>%
  group_by(FULL_TIME, FIN_YEAR) %>%
  summarise(Num_applications = n())
## # A tibble: 4 x 3
## # Groups:   FULL_TIME [?]
##   FULL_TIME FIN_YEAR Num_applications
##       <chr>    <dbl>            <int>
## 1         N     2016            37316
## 2         N     2017            37320
## 3         Y     2016           594870
## 4         Y     2017           570912

Comparision of wages by h1b dependency

Let’s take a look at the spread of wages for applications filed by both h1b dependant and non h1b dependant companies.

ggplot(df1, aes(x = H1B_DEPENDENT, y = PREVAILING_WAGE)) +
  geom_boxplot() +
  ggtitle("wage rate per year(in USD)") +
  labs(x = "H1B dependancy", y = "wage")

Occupations with highest wages

Because the dataset has a lot of outliers and is severely skewed, using the median wage as the metric to compare prevailing wages of different occupations will help reduce distortion and provide a better picture. Based on the median wage, we observe that family & general practitioners have the highest median wage, which is close to 200k.

#####Top 10 socs with the highest wages

top_10_soc_highest_wage <- df1 %>%
  group_by(SOC_NAME) %>%
  summarise(median_wage = median(PREVAILING_WAGE)) %>%
  arrange(desc(median_wage)) %>%
  slice(1:10) %>%
  select(SOC_NAME, median_wage)

ggplot(top_10_soc_highest_wage,aes(x = SOC_NAME, y = median_wage)) +
  geom_bar(stat = "identity", fill = "Blue", alpha = 0.9,  width = 0.7) +
  coord_flip() +
  ggtitle("Top 10 Standard Occupational Classification names and their median pay") +
  labs(x = "Standard Occupational Classification", y = "median salary")

States with highest wages

California and Washington are the states with highest median wages among all the applicants across all the states in the United States. Their median wage is approximately 95k.

#####top 10 states with highest wages

top_10_state_highest_wage <- df1 %>%
  group_by(WORKSITE_STATE) %>%
  summarise(median_wage = median(PREVAILING_WAGE)) %>%
  arrange(desc(median_wage)) %>%
  slice(1:10) %>%
  select(WORKSITE_STATE, median_wage)

ggplot(top_10_state_highest_wage,aes(x = reorder(WORKSITE_STATE, -median_wage), y = median_wage,fill = WORKSITE_STATE)) +
  geom_bar(stat = "identity", alpha = 0.9,  width = 0.7) +
  coord_flip() +
  ggtitle("Top 10 State names and their median pay") +
  labs(x = "State", y = "median salary")

Cities with highest wages

Surprisingly, it is observed that some smaller cities have highest median pay. It is probably because there are very less number of applicants from these cities and they are in a high paying role.

######top 10 cities with highest wages

top_10_city_highest_wage <- df1 %>%
  group_by(WORKSITE_CITY) %>%
  summarise(median_wage = median(PREVAILING_WAGE)) %>%
  arrange(desc(median_wage)) %>%
  slice(1:10) %>%
  select(WORKSITE_CITY, median_wage)

ggplot(top_10_city_highest_wage,aes(x = reorder(WORKSITE_CITY, -median_wage), y = median_wage,fill = WORKSITE_CITY)) +
  geom_bar(stat = "identity", alpha = 0.9,  width = 0.7) +
  coord_flip() +
  ggtitle("Top 10 cities and their median pay") +
  labs(x = "City", y = "median salary")

Summary

Exploratory data analysis was performed on the h1b petition dataset for the years 2016 and 2017 and following conclusions were made.

Anesthesiologists are the top earners among all the occupations. Anesthesiologist MD are the top earners.

California and Washington are the states with highest median wages among all the applicants across all the states in the United States. Their median wage is approximately 95k.

It is observed that California has the maximum number of submissions, followed by Texas, but, with only as many as half the number of submissions as that of California.

We see that New York has the maximum number of application in 2016 and 2017. Houston and San Francisco come in distant second with less than half the number of applications submitted in New York. Among all the top 10 states, Houston sees a considerable decrease in the number of applications in 2017 as compared to 2016.

Infosys Limited filed more than twice as many h1b visa applications as Tata Consultancy did in during both the years. Four out of the top ten companies in this chart are all Indian information technology companies.

Information Technology sector uses the maximum number of h1b slots available to be filled.

Computer developers, programmers and system analysts are the occupations which contribute to the maximum share of the h1b applications. This suggests that there is a dearth in the number of computer engineers in the United States.

This analysis provides quick information about the sectors in which there is lack of local talent and foreign resources are needed. This provides insights to universities/ job seekers to learn these technologies/skills to find a better job in the United States. One can also see which job roles are top paying. Also, h1b seekers will get a fair idea on the companies that sponsor the maximum share of h1b applications in the Unites States.

Limitations and Challenges

Data cleaning was the most challenging job apart from understanding the data. The dataset has lots of variables that were not of our interest. This dataset does not provide information if the h1b application is accepted or not. It is for this very reason, we are unable to fit a model to predict the factors impacting approval of h1b visa applications. Even if the dataset contains such information, it has a lot of categorical variables and very less continous and factor variables. So, it would still be difficult to build a model to predict the success of the visa application.

This dataset can be combined with the state wise literacy data and comparision of statewise literacy rates with number of submissions by state can be made. At the same time, one can also look at relationship between cost of living and the wages offered at various locations.