The H-1B program allows companies in the United States to temporarily employ foreign workers in occupations that require the theoretical and practical application of a body of highly specialized knowledge and a bachelor’s degree or higher in the specific specialty, or its equivalent. H-1B specialty occupations may include fields such as science, engineering and information technology.
In carrying out its responsibility for the processing of labor certification and labor attestation applications, the Office of Foreign Labor Certification (OFLC) generates program data that is essential both for internal assessment of program effectiveness and for providing the Department’s external stakeholders with useful information about the immigration programs administered by OFLC. In line with the Department’s commitment to the Open Government initiative and specific regulatory disclosure requirements, OFLC makes public the annual releases of program disclosure data to assist with external research and program evaluation.
It is always of interest to find out which locations, employers, job titles and salary range make up most of the H1B petitions. I’m also interested to learn the changes in trends of H-1B visa applicants in FY 2017 as compared to the FY 2016. H-1B filing data for FY 2016 and FY 2017 are pulled in from the Office of Foreign Labor Certification.
To analyze this data, we will use the following R packages:
library(readr) #For reading csv file
library(dplyr) #For Data transformation
library(ggplot2) #For graphics
library(lubridate) #For date time conversions
library(DT) #For diplaying data with formatting
library(tidyr) #For data transformation
The raw data available is untidy and might not be suitable for analysis. So, prior to assessing the data, we must acquire and clean the data.
The data for this project originated from the following sources:
These datasets are imported into R as csv files.
#Importing 2016 and 2017 data
efile_data_2016 <- read_csv("E:\\Data Wrangling with R\\project\\H-1B_Disclosure_Data_FY16.csv")
efile_data_2017 <- read_csv("E:\\Data Wrangling with R\\project\\H-1B_Disclosure_Data_FY17.csv")
There are 40 variables present in the 2016 H-1B dataset, whose data dictionary can be found here.The 2017 H-1B dataset contains 52 variables, whose data dictionary can be found here.These two datasets are merged to create a single dataframe containing both 2016 and 2017 Financial Year data. Out of the variables present in both the datasets, only 38 are common, So, we keep only those 38 variables in our dataset.
#merging datasets
names(efile_data_2016)[which(names(efile_data_2016) == "H-1B_DEPENDENT")] <- "H1B_DEPENDENT"
common_cols <- intersect(colnames(efile_data_2016), colnames(efile_data_2017))
efile_data <- rbind(efile_data_2016[common_cols], efile_data_2017[common_cols])
# cleaning the environment
rm(efile_data_2016,efile_data_2017,common_cols)
Let’s take a look at the first six observations of the merged dataset.
The merged dataset has 38 variables and over 1.2 million records.
The raw merged data is messy and some cleaning steps needs to be performed on the data table before any exploratory data analysis can be performed on it. These are the data cleaning steps that are performed to get our dataset into the right shape:
The merged dataset has filing records for other kinds of visa processing such as H1-B1, H2-B2 etc. So, these records are filtered out to retain only those records corresponding to H-1B data.
The merged dataset we have has only 10 variables that are of interest for our analysis. So, variables other than these 10 are dropped from the final dataset.
We want to know if an employee is a full time or a part time worker.So, we create a new variable FULL_TIME and assign a value of Y when PW_UNIT_OF_PAY is Year and N when PW_UNIT_OF_PAY is one of Hour, Weekly, Bi-Weekly or Month.
Now we create FIN_YEAR variable and assign it a value of either 2016 or 2017, based on the financial year a decision is made on the case by the authorities.
We have to build a common scale for salary data for part time and full time employees so that it becomes easy to compare them.We do that by scaling the hourly, Weekly, Bi-Weekly and monthly salaries to an yearly pay.
It is found that the dataset has some missing values and the percentage of records with missing values is less than 0.1. So, these records are removed from the dataset.
Records with extreme PREVAILING_WAGE values are discarded.
##df1
df1 <- efile_data %>%
filter(VISA_CLASS %in% "H-1B") %>%
select(c("CASE_STATUS","DECISION_DATE","EMPLOYER_NAME","JOB_TITLE","SOC_NAME"),PREVAILING_WAGE:PW_UNIT_OF_PAY,
H1B_DEPENDENT,WORKSITE_CITY,WORKSITE_STATE) %>%
mutate( FULL_TIME = case_when(PW_UNIT_OF_PAY == 'Year' ~ 'Y', PW_UNIT_OF_PAY != 'Year' ~ 'N'),DECISION_DATE =
as.numeric(as.Date(DECISION_DATE, format = "%d-%m-%Y")),FIN_YEAR = case_when(16709 <= DECISION_DATE &
DECISION_DATE <= 17074 ~ 2016,17075 <= DECISION_DATE & DECISION_DATE <= 17439 ~ 2017)) %>%
mutate(PREVAILING_WAGE = case_when(PW_UNIT_OF_PAY == 'Hour' ~ PREVAILING_WAGE*2087,PW_UNIT_OF_PAY == 'Year' ~
PREVAILING_WAGE,PW_UNIT_OF_PAY=='Month'~PREVAILING_WAGE*12,PW_UNIT_OF_PAY=='Bi-Weekly'~PREVAILING_WAGE*21)) %>%
select(-DECISION_DATE,-PW_UNIT_OF_PAY) %>%
na.omit(df1) %>%
filter(PREVAILING_WAGE > 5000 & (PREVAILING_WAGE < 200000))
Now that the data cleaning operations are performed and the final dataset that is used to perform the exploratory data analysis is obtained, let’s take a look at the final dataset.
The H-1B dataset has 10 variables in which most of them are cateforical. PREVAILING_WAGE is the only numeric variable pesent in the dataset. Let’s take a look at the Data dictionary.
VARIABLE | DATA TYPE | VARIABLE DESCRIPTION |
---|---|---|
CASE_STATUS | Character | Status associated with the last significant event or decision. |
EMPLOYER_NAME | Character | Name of employer submitting labor condition application |
JOB_TITLE | Character | Title of the job. |
SOC_NAME | Character | Occupational name as per SOC system |
PREVAILING_WAGE | numeric | Prevailing Wage for the job being requested |
H1B_DEPENDENT | Character | Y = Employer is H-1B Dependent; N = Employer is not H-1B Dependent |
WORKSITE_CITY | Character | City information of the foreign worker’s intended area of employment. |
WORKSITE_STATE | Character | State information of the foreign worker’s intended area of employment. |
FULL_TIME | Character | Y = Full Time, N = Part Time |
FIN_YEAR | numeric | financial Year (2016/2017) |
PREVAILING_WAGE is the variable that we are more concerned about. The data suggests that the median salary of the people who submitted H-1B applications in 2016 was $67,891 and that in 2017 is $71,510, which suggests that a 5.33% increase in median salary of applicants was observed in 2017 as compared to 2016.
Here are some of the relations that could be interesting to look at
Number of h1b submissions by state.
Employers who submit the most number of H-1B visa applications.
Most common Job Titles.
Cities that have high number of submissions.
States and cities with highest median salaries.
Comparision of 2016 statistics with 2017.
In order to look at the graphical representation of the data it is better to look at top 10 submissions based on a factor.
So, in order to extract top 10 records based on a factor per year, we create a function called ‘top_10_recordsby_year’. In order to extract the top 10 records overall, we also create a function called ‘top_n_records’.
#####function to extract top 10 records by a parameter
top_n_records <- function(col_name,n_rec) {
col_name <- as.name(col_name)
df <- df1 %>%
group_by_("FIN_YEAR",col_name) %>%
summarise(num_apps = n()) %>%
arrange(desc(num_apps)) %>%
slice(1:n_rec)
}
######function to extract top 10 records by a parameter based on year######
top_10_recordsby_year <- function(col_name) {
col_name <- as.name(col_name)
df <- df1 %>%
group_by_("FIN_YEAR",col_name) %>%
summarise(num_apps = n())
df <- spread(df,key = FIN_YEAR, value = num_apps)
df$num_apps <- df$`2016` + df$`2017`
df <- arrange(df,desc(num_apps))
df <- df[1:10,-4]
df <- gather(df, key = "FIN_YEAR",value = "num_apps",2:3)
}
There are four types of case application statuses in the dataset. An application can be denied, withdrawn, certified or certified-withdrawn. Certified does not mean that the h1b is approved but it means that the application is approved and is sent for further processing. It is for this very reason, we are unable to fit a model to predict the factors impacting approval of h1b visa applications.
#### H1B Visa Case Status
ggplot(top_n_records("CASE_STATUS",10),aes(x = reorder(CASE_STATUS,-num_apps),y = num_apps, fill = CASE_STATUS)) +
geom_bar(stat = "identity", alpha = 0.9, width = 0.7) +
ggtitle("Figure1: case status of applications") +
labs(x = "case status", y = "No. of Applications")
Bar chart above shows us the distribution of h1b visa status. After looking at the plot of case status, we can easily observe that the certified cases dominate this dataset.
Let’s look at the states with maximum number of submissions and compare the statistics of 2016 with those in 2017. This will help us understand which states file a lot of applications. Also, we can see the difference in the number of applications between 2016 and 2017 to find out if there is any significant difference in the number of applications in some states.
#### Top H1B applicant states by year
ggplot(top_10_recordsby_year("WORKSITE_STATE"),aes(x = WORKSITE_STATE, y = num_apps,fill = FIN_YEAR)) +
geom_bar(stat = "identity", position = position_dodge() , alpha = 0.9, width = 0.7) +
coord_flip() +
ggtitle("Figure2: Top 10 work sites with most applications") +
labs(x = "State", y = "No. of Applications")
It is observed that California has the maximum number of submissions, followed by Texas, but, with only as many as half the number of submissions as that of California. There hasn’t been a lot of difference in the number of applications submitted in 2017 as compared to 2016 in any of the top 10 leading states.
Let’s look at the cities with maximum number of submissions and compare the statistics of 2016 with those in 2017. This will help us understand which cities file a lot of applications. Also, we can see the difference in the number of applications between 2016 and 2017 to find out if there is any significant difference in the number of applications in some cities.
#### Top H1B applicant cities by year
ggplot(top_10_recordsby_year("WORKSITE_CITY"),aes(x = WORKSITE_CITY, y = num_apps,fill = FIN_YEAR)) +
geom_bar(stat = "identity", position = position_dodge() , alpha = 0.9, width = 0.7) +
coord_flip() +
ggtitle("Figure3: Top 10 work site cities with most applications") +
labs(x = "City", y = "No. of Applications")
We see that New York has the maximum number of application in 2016 and 2017. Houston and San Franscisco come in distant second with less than half the number of applications submitted in New York. Among all the top 10 states, Houston sees a considerable decrease in the number of applications in 2017 as compared to 2016.
It is always interesting to observe which companies sponsor most number of the h1b applications. It also gives us an indication of the sector in which the maximum number of applications are submitted in.
#### Popular H1B Visa Sponsors by year
ggplot(top_10_recordsby_year("EMPLOYER_NAME"),aes(x = EMPLOYER_NAME, y = num_apps,fill = FIN_YEAR)) +
geom_bar(stat = "identity", position = position_dodge() , alpha = 0.9, width = 0.7) +
coord_flip() +
ggtitle("Figure4: Top 10 employers with most applications") +
labs(x = "Empoyer", y = "No. of Applications")
The bar chart tells us that Infosys Limited filed more than twice as many h1b visa applications as Tata Consultancy did in during both the years. Four out of the top ten companies in this chart are all Indian information technology companies. It can be observed that Wipro limited, Infosys limited Capgemini America and Accenture submitted significantly lesser number of applications in 2017 as compared to 2016.
Programmer Analyst and Software engineer are the two job titles which attracted the maximum number of h1b applications. It can also be deduced from the job titles that the Information Technology sector uses the maximum number of h1b slots available to be filled.
#### Popular Job Titles by year
ggplot(top_10_recordsby_year("JOB_TITLE"),aes(x = JOB_TITLE, y = num_apps,fill = FIN_YEAR)) +
geom_bar(stat = "identity", position = position_dodge() , alpha = 0.9, width = 0.7) +
coord_flip() +
ggtitle("Figure5: Top 10 job titles with most applications") +
labs(x = "Job Title", y = "No. of Applications")
Computer developers, programmers and system analysts are the occupations which contribute to the maximum share of the h1b applications. This suggests that there is a dearth in the number of computer engineers in the United States.
#### Popular Occupations by year
ggplot(top_10_recordsby_year("SOC_NAME"),aes(x = SOC_NAME, y = num_apps,fill = FIN_YEAR)) +
geom_bar(stat = "identity", position = position_dodge() , alpha = 0.9, width = 0.7) +
coord_flip() +
ggtitle("Figure6: Top 10 Standard Occupational Classification names with most applications") +
labs(x = "Standard Occupational Classification", y = "No. of Applications")
It is interesting to know which companies are h1b dependant. It is practically not possible to look at the h1b dependency status of all the companies that filed an h1b application. So, we shall check h1b dependancy of popular h1b sponsoring companies.
####h1b dependent or not
df1 %>%
group_by(EMPLOYER_NAME) %>%
summarise(Num_applications = n(), h1b_dependent = max(H1B_DEPENDENT)) %>%
arrange(desc(Num_applications)) %>%
slice(1:10) %>%
select(EMPLOYER_NAME,h1b_dependent)
## # A tibble: 10 x 2
## EMPLOYER_NAME h1b_dependent
## <chr> <chr>
## 1 INFOSYS LIMITED Y
## 2 TATA CONSULTANCY SERVICES LIMITED Y
## 3 CAPGEMINI AMERICA INC Y
## 4 IBM INDIA PRIVATE LIMITED Y
## 5 ACCENTURE LLP N
## 6 WIPRO LIMITED Y
## 7 DELOITTE CONSULTING LLP Y
## 8 TECH MAHINDRA (AMERICAS),INC. Y
## 9 ERNST & YOUNG U.S. LLP N
## 10 MICROSOFT CORPORATION N
It is observed that 7 out of the top 10 companies are h1b dependent. ACCENTURE LLP, ERNST & YOUNG U.S. LLP and MICROSOFT CORPORATION are the three non h1b dependant companies.
Let’s look at the distribution of the number of applications based on full time status. It has been observed that in 2016 as well as 2017, the maximum number of applicants were in a full time role.
#####full time vs part time
df1 %>%
group_by(FULL_TIME, FIN_YEAR) %>%
summarise(Num_applications = n())
## # A tibble: 4 x 3
## # Groups: FULL_TIME [?]
## FULL_TIME FIN_YEAR Num_applications
## <chr> <dbl> <int>
## 1 N 2016 37316
## 2 N 2017 37320
## 3 Y 2016 594870
## 4 Y 2017 570912
Let’s take a look at the spread of wages for applications filed by both h1b dependant and non h1b dependant companies.
ggplot(df1, aes(x = H1B_DEPENDENT, y = PREVAILING_WAGE)) +
geom_boxplot() +
ggtitle("wage rate per year(in USD)") +
labs(x = "H1B dependancy", y = "wage")
Because the dataset has a lot of outliers and is severely skewed, using the median wage as the metric to compare prevailing wages of different occupations will help reduce distortion and provide a better picture. Based on the median wage, we observe that family & general practitioners have the highest median wage, which is close to 200k.
#####Top 10 socs with the highest wages
top_10_soc_highest_wage <- df1 %>%
group_by(SOC_NAME) %>%
summarise(median_wage = median(PREVAILING_WAGE)) %>%
arrange(desc(median_wage)) %>%
slice(1:10) %>%
select(SOC_NAME, median_wage)
ggplot(top_10_soc_highest_wage,aes(x = SOC_NAME, y = median_wage)) +
geom_bar(stat = "identity", fill = "Blue", alpha = 0.9, width = 0.7) +
coord_flip() +
ggtitle("Top 10 Standard Occupational Classification names and their median pay") +
labs(x = "Standard Occupational Classification", y = "median salary")
California and Washington are the states with highest median wages among all the applicants across all the states in the United States. Their median wage is approximately 95k.
#####top 10 states with highest wages
top_10_state_highest_wage <- df1 %>%
group_by(WORKSITE_STATE) %>%
summarise(median_wage = median(PREVAILING_WAGE)) %>%
arrange(desc(median_wage)) %>%
slice(1:10) %>%
select(WORKSITE_STATE, median_wage)
ggplot(top_10_state_highest_wage,aes(x = reorder(WORKSITE_STATE, -median_wage), y = median_wage,fill = WORKSITE_STATE)) +
geom_bar(stat = "identity", alpha = 0.9, width = 0.7) +
coord_flip() +
ggtitle("Top 10 State names and their median pay") +
labs(x = "State", y = "median salary")
Surprisingly, it is observed that some smaller cities have highest median pay. It is probably because there are very less number of applicants from these cities and they are in a high paying role.
######top 10 cities with highest wages
top_10_city_highest_wage <- df1 %>%
group_by(WORKSITE_CITY) %>%
summarise(median_wage = median(PREVAILING_WAGE)) %>%
arrange(desc(median_wage)) %>%
slice(1:10) %>%
select(WORKSITE_CITY, median_wage)
ggplot(top_10_city_highest_wage,aes(x = reorder(WORKSITE_CITY, -median_wage), y = median_wage,fill = WORKSITE_CITY)) +
geom_bar(stat = "identity", alpha = 0.9, width = 0.7) +
coord_flip() +
ggtitle("Top 10 cities and their median pay") +
labs(x = "City", y = "median salary")
Exploratory data analysis was performed on the h1b petition dataset for the years 2016 and 2017 and following conclusions were made.
Anesthesiologists are the top earners among all the occupations. Anesthesiologist MD are the top earners.
California and Washington are the states with highest median wages among all the applicants across all the states in the United States. Their median wage is approximately 95k.
It is observed that California has the maximum number of submissions, followed by Texas, but, with only as many as half the number of submissions as that of California.
We see that New York has the maximum number of application in 2016 and 2017. Houston and San Francisco come in distant second with less than half the number of applications submitted in New York. Among all the top 10 states, Houston sees a considerable decrease in the number of applications in 2017 as compared to 2016.
Infosys Limited filed more than twice as many h1b visa applications as Tata Consultancy did in during both the years. Four out of the top ten companies in this chart are all Indian information technology companies.
Information Technology sector uses the maximum number of h1b slots available to be filled.
Computer developers, programmers and system analysts are the occupations which contribute to the maximum share of the h1b applications. This suggests that there is a dearth in the number of computer engineers in the United States.
This analysis provides quick information about the sectors in which there is lack of local talent and foreign resources are needed. This provides insights to universities/ job seekers to learn these technologies/skills to find a better job in the United States. One can also see which job roles are top paying. Also, h1b seekers will get a fair idea on the companies that sponsor the maximum share of h1b applications in the Unites States.
Data cleaning was the most challenging job apart from understanding the data. The dataset has lots of variables that were not of our interest. This dataset does not provide information if the h1b application is accepted or not. It is for this very reason, we are unable to fit a model to predict the factors impacting approval of h1b visa applications. Even if the dataset contains such information, it has a lot of categorical variables and very less continous and factor variables. So, it would still be difficult to build a model to predict the success of the visa application.
This dataset can be combined with the state wise literacy data and comparision of statewise literacy rates with number of submissions by state can be made. At the same time, one can also look at relationship between cost of living and the wages offered at various locations.